A little about query optimization

Want a simple example to talk about how sometimes it is possible to optimize quite ordinary at first glance, the requests. Consider this code, for example in PostgreSQL 9.3, but the principle is suitable for all DBMS, in which there is a hash join.

The task is simple — sceurity two tables — one very large, one is small — but join is not easy, and Golden with OR. (As a real case join table transactions on the accounts to own accounts, given that in the transaction, two fields with the account for debits and credits.)

Prepare test data:

the
create table public.test1 
( id bigint,
id2 bigint,
value varchar(100)
);

create table public.test2
( id bigint,
value varchar(100)
) ;

insert into public.test1 
select generate_series(1,2000000), 1, 'abcdef';

insert into public.test2 
select generate_series(1,100), 'abcdefssdf';

ix_test2_id create index on public.test2 (id);
/* an index on a small table fundamentally changes nothing, but it then in reality will probably be, so even in the test let it be. */

And here is our request in the original form, with the condition on or.

the
select * 
from public.test1 t1
inner join public.test2 t2 on t1.id = t2.id or t1.id2 = t2.id;

Join with this condition will receive a guaranteed nested loop. Here's the plan:

the

"Nested Loop (cost=0.00..3532741.25 rows=2000099 width=42) (actual time=0.043..61627.189 rows=2000099 loops=1)"
"Join Filter: ((t1.id = t2.id) OR (t1.id2 = t2.id))"
"Rows Removed by Join Filter: 197999901"
"-> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.010..385.658 rows=2000000 loops=1)"
"-> Materialize (cost=0.00..2.50 rows=100 width=19) (actual time=0.000..0.007 rows=100 loops=2000000)"
"- >Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.005..0.018 rows=100 loops=1)"
"Total runtime: 61717.751 ms"


Please note for two million cycles pass at a small table. This is the main disadvantage nested loop. Even if there were two million searches on the index — still bad.

What to do?

It would have helped hash join — hash the small table fit in memory + one pass through the big — perfect. But with OR condition it not to. There is a solution!

Make two join in different fields and make OR in filter:

the
select * 
from public.test1 t1
left join public.test2 t2 on t1.id = t2.id 
left join public.test2 t3 on t1.id2 = t3.id
where t2.id is not null or t3.id is not null;

The plan became much faster. 5kb hash table — that is necessary: even given the fact that gainov two winnings tenfold!

the
"Hash Left Join (cost=6.50..67746.50 rows=2000000 width=61) (actual time=0.124..2230.636 rows=2000000 loops=1)"
"Hash Cond: (t1.id2 = t3.id)"
"Filter: ((t2.id IS NOT NULL) OR (t3.id IS NOT NULL))"
"-> Hash Left Join (cost=3.25..40243.25 rows=2000000 width=42) (actual time=0.073..1065.822 rows=2000000 loops=1)"
"Hash Cond: (t1.id = t2.id)"
"-> Seq Scan on test1 t1 (cost=0.00..32739.00 rows=2000000 width=23) (actual time=0.012..338.759 rows=2000000 loops=1)"
"-> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.041..0.041 rows=100 loops=1)"
"Buckets: 1024 Batches: 1 Memory Usage: 5kB"
"- >Seq Scan on test2 t2 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.015 rows=100 loops=1)"
"-> Hash (cost=2.00..2.00 rows=100 width=19) (actual time=0.039..0.039 rows=100 loops=1)"
"Buckets: 1024 Batches: 1 Memory Usage: 5kB"
"- >Seq Scan on test2 t3 (cost=0.00..2.00 rows=100 width=19) (actual time=0.004..0.016 rows=100 loops=1)"
"Total runtime: 2318.009 ms"

Thank you for your attention.
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

A Bunch Of MODx Revolution + LiveStreet

Monitoring PostgreSQL with Zabbix

PostgreSQL load testing using JMeter, Yandex.Tank and Overload