Since our production database grew to the size where querying data without indices became a suicide mission, we have learned that LIMIT does not always translate into faster query time and less work on the machine.
Today at work, we ran into a beautifully simple example to illustrate this point, which was perceived as counter-intuitive by more junior teammates.
Here we have a query over a 8M-row table, the condition fits squarely into an index.
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00');
QUERY PLAN
Bitmap Heap Scan on record (cost=256890.59..3475874.26 rows=8448978 width=1740) Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) -> Bitmap Index Scan on not_found_records_idx (cost=0.00..254778.35 rows=8448978 width=0) Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) (4 rows)
That was still lot of records, so we were hoping a LIMIT clause would shorten the execution time.
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 1;
QUERY PLAN
Limit (cost=0.00..0.46 rows=1 width=1740) -> Seq Scan on record (cost=0.00..3923250.74 rows=8448978 width=1740) Filter: ((updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone) AND ((status)::text = 'not_found'::text)) (3 rows)
Only that it didn’t. Adding the LIMIT clause changed the query plan from Bitmap Index Scan to a Sequence Scan, which is about the worst thing we want to do on a 8M-row table.
It is a typical “optimization” of the planner, who thought that the query with LIMIT clause was too small.
Operation wise, an index scan is more expensive than a sequence scan. An index scan would require to read the index pages first and then read the data pages for relevant rows.
The sequence scan only deal with data pages. And so when the LIMIT is small, combined with the table statistics, the planner is tempted to believe there are enough (random) rows that match the filter condition, which makes a sequence scan cheaper.
In fact, somehow this table statistics suggests that every LIMIT is too small till it is half of the table!
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 3000000;
QUERY PLAN
Limit (cost=0.00..1393038.57 rows=3000000 width=1740) -> Seq Scan on record (cost=0.00..3923250.74 rows=8448978 width=1740) Filter: ((updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone) AND ((status)::text = 'not_found'::text)) (3 rows) EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') LIMIT 4000000;
Adding an ORDER BY clause into the query brings structure to the search and guide the planner to use the index
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY "record"."imported_date" LIMIT 1;
QUERY PLAN
Limit (cost=3518127.15..3518127.15 rows=1 width=1740) -> Sort (cost=3518127.15..3539249.59 rows=8448978 width=1740) Sort Key: imported_date -> Bitmap Heap Scan on record (cost=256898.59..3475882.26 rows=8448978 width=1740) Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) -> Bitmap Index Scan on not_found_records_idx (cost=0.00..254786.35 rows=8448978 width=0) Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) (7 rows)
Actually, any random ORDER BY would do
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY random() LIMIT 1;
QUERY PLAN
Limit (cost=3539253.59..3539253.60 rows=1 width=1748) -> Sort (cost=3539253.59..3560376.04 rows=8448978 width=1748) Sort Key: (random()) -> Bitmap Heap Scan on record (cost=256902.59..3497008.70 rows=8448978 width=1748) Recheck Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) -> Bitmap Index Scan on not_found_records_idx (cost=0.00..254790.35 rows=8448978 width=0) Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) (7 rows)
However, that is still far from optimal. In order to return the first row, the database first needs to fetch all rows matching the index condition, and only after that, sorts. This creates strains on the machine memory and maybe even disk swap.
Index is actually ordered structure (BTree). The most optimal query is the one using one of the indexed columns for sorting.
EXPLAIN SELECT "record".* FROM "record" WHERE ("record"."status" = 'not_found' AND "record"."updated_date" < '2018-12-23 15:48:23.008320+00:00') ORDER BY "record"."updated_date" LIMIT 1;
QUERY PLAN
Limit (cost=0.56..1.90 rows=1 width=1740) -> Index Scan using not_found_records_idx on record (cost=0.56..11280384.47 rows=8448978 width=1740) Index Cond: (((status)::text = 'not_found'::text) AND (updated_date < '2018-12-23 15:48:23.00832+00'::timestamp with time zone)) (3 rows)
This, in fact, is even more effective that the original no-order query, with or without LIMIT.
665 thoughts on “Simple example on why LIMIT is not always the best thing for SQL Database”