The Bug 🐞
We've been testing the next generation of our sync engine at Day One prior to public release, and we found a funny bug. We have an endpoint for fetching a big list of entries that have changed since the last fetch. This endpoint is also how a new phone with no content would pull down all entries from a journal. But when we did try to sync all of our journal contents down to a new device, we were missing entries!
Enter Sherlock 🔎
We investigated the internal workings of the endpoint to see what was up. The missing entry was in the database, so that wasn't the issue. We tested the query that fetched the rows and the entry showed up, so that wasn't the issue. Internally we fetch groups of entries from the database in pages and write them out to the endpoint so we figured it must have been some flawed logic with the pagination. We logged the pagination pattern, but everything looked perfect.
Then we looked closer at the response returned by the server.
We were getting the right number of entries pulled down from that endpoint, but it turned out some of the entries were missing from the feed, and others were repeated!
This was a big indicator that we had a problem with sorting. We compared the entry that showed up twice with the entry that didn't show up at all. Sure enough, we found that they had the exact same "sync date", which is the column that our query sorted by.
The Fix 🛠
All we had to do was add a unique column into the sort clause:
builder->orderBy([|"sync_date asc, id asc"|]);
And the problem was fixed.
But Why? 🤔
When paginating, the same query is executed over again each time a new page is fetched. Each time, a small window of the results are fetched by skipping over the number of items fetched by previous pages. In order for this to work, however, the overall result set has to be the same for every page**. If the result set changes as new pages are fetched, we run the risk of dropping or repeating items.
In our case, the sync date on an entry was pretty close to unique. It's got millisecond-accuracy. And most of the time, two entries aren't synced in the exact same millisecond. But, as we now know from testing on real journal data, there are entries with identical sync dates. When this happens, The order in which those two entries are returned in a result state is considered unstable. Our query had an unstable sort.
So something about skipping through the items in our result set caused Postgres to reverse the order of those two rows every time. One of them was missed, the other repeated.
Once we added a unique column to the sort, the sort became a stable sort. Which means that Postgres always knows how to order the results in the query, no matter how much skipping or limiting we do. With that change in place, skipping old rows no longer changed the order and all the entries were properly included in the result set.