In case you were wondering the offset command in postgres does not operate in constant time.

For example:
select * from table limit 100 offset 1000;

is fast (about 75 ms). While:
select * from table limit 100 offset 11000000;

takes just shy of 3 minutes.

The reason for this is that postgres actually retrieves all 11,000,100 rows and just ignores the first 11,000,000. The upshot of this is if you’d like to page through 11.000,000 records it will start out nice and fast. You do the calculations and you figure it’ll take about 90 minutes. But when you get back into work the next day it is still running. In fact the average speed is about 200 records a second and dropping.

If you need the offset functionality on large dataset you want to give the dataset its own indexed sequential id and use that in the where clause to do limit/offset stuff.

For example:
select * from table where id > 11000000 and id < 11000100;

That query always takes double digit ms instead of growing as you get further into the dataset. You’re kind of S.O.L. if you need to sort the whole dataset. Best bet would be to copy the table into a temp table in the right order and use the above technique, buy come one do you really need to sort 14,000,000 records. It’ll take forever.

In a similar vein, if you flush your cache after each record, your process will take much longer and if you load all 14,000,000 records into memory it’ll swell memory and generally slow things down. Just thought I’d share.

This post brought to you by fun with large datasets and the leter O.