sql - Proper way to access latest row for each individual identifier?

Saturday, 1 April 2017

sql - Proper way to access latest row for each individual identifier?

I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──────────────────────────┬─────────────────┬───────────┬──────────────────────────────────────────┐
│    Colonne     │           Type           │ Collationnement │ NULL-able │                Par défaut                │
├────────────────┼──────────────────────────┼─────────────────┼───────────┼──────────────────────────────────────────┤
│ id             │ integer                  │                 │ not null  │ nextval('core_message_id_seq'::regclass) │
│ mmsi           │ integer                  │                 │ not null  │                                          │

│ time           │ timestamp with time zone │                 │ not null  │                                          │
│ point          │ geography(Point,4326)    │                 │           │                                          │
└────────────────┴──────────────────────────┴─────────────────┴───────────┴──────────────────────────────────────────┘
Index:
    "core_message_pkey" PRIMARY KEY, btree (id)
    "core_message_uniq_mmsi_time" UNIQUE CONSTRAINT, btree (mmsi, "time")
    "core_messag_mmsi_b36d69_idx" btree (mmsi, "time" DESC)
    "core_message_point_id" gist (point)

The mmsi column is a unique identifier used to identify ships in the world. I'm trying to get the latest row for each mmsi.

I can get that like this, for example:

SELECT a.* FROM core_message a
JOIN  (SELECT mmsi, max(time) AS time FROM core_message GROUP BY mmsi) b
       ON a.mmsi=b.mmsi and a.time=b.time;

But this is too slow, 2 seconds+.

So my solution was to create a distinct table containing only the latest rows (100K+ rows max) of the core_message table, called LatestMessage.

This table is populated via my application every time new rows have to be added to core_message.

It worked fine, I'm able to access the table in a matter of milliseconds.
But I'd be curious to know if there is a better way to achieve that using only one table and keep the same level of performance for data access.

Answer

This answer seems to go in the way of the DISTINCT ON answer here, however it also mentions this :

For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. "skip scan") would be
(much) more efficient, but that's not implemented up to Postgres 12.
(An implementation for index-only scans is in development for Postgres
13. See here and here.)
For now, there are faster query techniques to substitute for this.
In particular if you have a
separate table holding unique customers, which is the typical use
case. But also if you don't:

Using this other great answer, I find a way to keep the same performance as a distinct table with the use of LATERAL.
By using a new table test_boats I can do something like this :

 CREATE TABLE test_boats AS (select distinct on (mmsi) mmsi from core_message);

This table creation take 40+ seconds which is pretty similar to the time taken by the other answer here.

Then, with the help of LATERAL :

SELECT a.mmsi, b.time
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.time
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC

    LIMIT 1
) b LIMIT 10;

This is blazingly fast, 1+ millisecond.

This will need the modification of my program's logic and the use of a query a bit more complex but I think I can live with that.

For a fast solution without the need to create a new table, check out the
answer of @ErwinBrandstetter below

UPDATE: I feel this question is not quite answered yet, as it's not very clear why the other solutions proposed perform poorly here.

I tried the benchmark mentionned here. At first, it would seem that the DISTINCT ON way is fast enough if you do a request like the one proposed in the benchmark : +/- 30ms on my computer.
But this is because that request uses index only scan. If you include a field that is not in the index, some_column in the case of the benchmark, the performance will drop to +/- 100ms.

Not a dramatic drop in performance yet.
That is why we need a benchmark with a bigger data set. Something similar to my case : 40K customers and 8M rows. Here

Let's try again the DISTINCT ON with this new table:

SELECT DISTINCT ON (customer_id) id, customer_id, total 
FROM purchases_more 
ORDER BY customer_id, total DESC, id;

This takes about 1.5 seconds to complete.

SELECT DISTINCT ON (customer_id) *
FROM purchases_more 
ORDER BY customer_id, total DESC, id;

This takes about 35 seconds to complete.

Now, to come back to my first solution above. It is using an index only scan and a LIMIT, that's one of the reason why it is extremely fast. If I recraft that query to not use index-only scan and dump the limit :

SELECT b.*

FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.*
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b;

This will take about 500ms, which is still pretty fast.

For a more in-depth benchmark of sort, see my other answer below.

Blog

Saturday, 1 April 2017

sql - Proper way to access latest row for each individual identifier?

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?