NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Testing Postgres race conditions with synchronization barriers (lirbank.com)
throwaway2ge5hg 1 days ago [-]
Postgres has SERIALIZABLE transaction isolation level. Just use it and then you never have to worry about any of these race conditions.

And if for some reason you refuse to, then this "barrier" or "hooks" approach to testing will in practice not help. It requires you to already know the potential race conditions, but if you are already aware of them then you will already write your code to avoid them. It is the non-obvious race conditions that should scare you.

To find these, you should use randomized testing that runs many iterations of different interleavings of transaction steps. You can build such a framework that will hook directly into your individual DB query calls. Then you don't have to add any "hooks" at all.

But even that won't find all race condition bugs, because it is possible to have race conditions surface even within a single database query.

You really should just use SERIALIZABLE and save yourself all the hassle and effort and spending hours writing all these tests.

LgWoodenBadger 20 hours ago [-]
“because it is possible to have race conditions surface even within a single database query.”

This brought back awful memories of MS SQLServer and JDBC. Way back when, maybe Java 1.5 or so, SQLServer would deadlock between connections when all they were doing was executing the exact same statement. Literally. Not the same general statement with different parameters.

lirbank 1 days ago [-]
Good call, SERIALIZABLE is a strong option - it eliminates a whole class of bugs at the isolation level. The trade-off is your app needs to handle serialization failures with retry logic, which introduces its own complexity. That retry logic itself needs testing, and barriers work for that too. On randomized testing - that actually has the same limitation you mentioned about barriers: you need to know where to point it. And without coordination, the odds of two operations overlapping at exactly the wrong moment are slim. You'd need enormous pressure to trigger the race reliably, and even then a passing run doesn't prove much. Barriers make the interleaving deterministic so a pass actually means something.
danielheath 22 hours ago [-]
SERIALIZABLE is really quite hard to retrofit to existing apps; deadlocks, livelocks, and “it’s slow” show up all over the place when you switch it on.

Definitely recommend starting new codebases with it enabled everywhere.

sealeck 22 hours ago [-]
Do you have examples of deadlocks/livelocks you've encountered using SERIALIZABLE? My understanding was that the transaction will fail on conflict (and should then be retried by the application - wrapping existing logic in a retry loop can usually be done without _too_ much effort)...
electronvolt 17 hours ago [-]
I guess I'd say -- I think you're right that you shouldn't (ideally) be able to trigger true deadlocks/livelocks with just serializable transactions + an OLTP DBMS.

That doesn't mean it won't happen, of course. The people who write databases are just programmers, too. And you can certainly imagine a situation where you get two (or more) "ad-hoc" transactions that can't necessarily progress when serializable but can with read committed (ad-hoc in the sense of the paper here: https://cacm.acm.org/research-highlights/technical-perspecti...).

theptip 21 hours ago [-]
Not a silver bullet. When you use serializable you have more opportunities for deadlocks (cases which would otherwise be logic errors at weaker isolation levels).

Serializable just means that within the transaction your logic can naively assume it’s single threaded. It doesn’t magically solve distributed system design for you.

“Just use random testing” isn’t really an answer. Some race conditions only show up with pathological delays on one thread.

piskov 23 hours ago [-]
That whole article should have been:

Use transactions table (just a name, like orders)

On it have an Insert trigger.

It should make a single update with simple “update … set balance += amount where accoundId = id”. This will be atomic thanks to db engine itself.

Also add check constraint >= 0 for balance so it would never become negative even if you have thousands of simultaneous payments. If it becomes negative, it will throw, insert trigger will rethrow, no insert will happen, your backend code will catch it.

That’s it: insert-trigger and check constraint.

No need for explicit locking, no stored procedures, no locks in you backend also, nada. Just a simple insert row. No matter the load and concurrent users it will work like magic. Blazingly fast too.

That’s why there is ACID in DBs.

Shameless plug: learn your tool. Don’t approach Postgresql/Mssql/whathaveyousql like you’re a backend engineer. DB is not a txt file.

mfcl 18 hours ago [-]
The point of the article is to explain barriers and demonstrate them.

The logic used for crediting amounts of money is not important.

valenterry 19 hours ago [-]
> Shameless plug: learn your tool. Don’t approach Postgresql/Mssql/whathaveyousql like you’re a backend engineer.

Erm, knowing and understanding how to use your database is a bread and butter skill of a backend engineer.

IdontKnowRust 8 hours ago [-]
The "nada" lost in your text makes me assume you're Brazilian haha
shshshsjjjj 21 hours ago [-]
> no locks in you backend

PG is still handling the locks for you, so this isn’t like a bulletproof solution and - like always - depending on your use case, scale, etc this may or may not work.

> No matter the load and concurrent users it will work like magic

Postgres will buckle updating a single row at a certain scale.

————-

Regardless, this article was about testing a type of scenario that is commonly not tested. You don’t always have a great tool like PG on hand that gives you solutions so this testing isn’t needed.

theptip 21 hours ago [-]
I’ve idly toyed with this problem as well, I think there’s a good opportunity to build a nice framework in Python with monkeypatching (or perhaps in other languages using DB/ORM middleware) so you don’t need to modify the code under test.

I think you can do better than explicit barrier() calls. My hunch is the test middleware layer can intercept calls and impose a deterministic ordering.

(There are a few papers around looking into more complex OS level frameworks to systematically search for concurrency bugs, but these would be tough to drop into the average web app.)

misiek08 7 hours ago [-]
In worst case scenario you can’t still get flaky test, right? Single thread runtime that will allow the queries to interleave sometimes and sometimes work correctly - talking about variant without "FOR UPDATE".
lirbank 4 hours ago [-]
Right, that's a real concern with naive concurrent tests - you're at the mercy of timing and the test becomes flaky. That's exactly what the synchronization barrier solves: it forces both transactions to reach the critical point before either proceeds, so the race condition is guaranteed to occur on every run. No flakiness.
scottlamb 1 days ago [-]
It'd be interesting to see a version of this that tries all the different interleavings of PostgreSQL operations between the two (or N) tasks. https://crates.io/crates/loom does something like this for Rust code that uses synchronization primitives.
lirbank 1 days ago [-]
Interesting! The barrier approach is more targeted: you specify the exact interleaving you want to test rather than exploring all of them. Trade-off is you need to know which interleavings matter, but you get deterministic tests that run against a real database instead of a simulated runtime. Exploring exhaustive interleaving testing against a real Postgres instance could be a fun follow-up - I'd be curious if it's practical.
scottlamb 22 hours ago [-]
I think you could still do it against a real database—you're already setting it up to a known state before each test, right? Obviously there'd be more runs but I'd expect (hope) that each task would be sufficiently small that the number of permutations would stay within reason.

There would be some challenges for sure. Likely optimistic concurrent patterns would require an equivalent of loom's `yield_now` [1] to avoid getting stuck. And you'd probably need a way to detect one transaction waiting for another's lock to get out of situations like your update lock vs barrier example. I vaguely recall PostgreSQL might have some system catalog table for that or something.

[1] https://docs.rs/loom/0.7.2/loom/#yielding

lirbank 22 hours ago [-]
Yeah, the more I think about it, the more exciting this idea gets. The walkthrough in the article shows exactly why - I intentionally (to later show why that is wrong) place the barrier between the SELECT and UPDATE, which deadlocks instead of triggering the race. Getting the placement right requires reasoning about where the critical interleaving point is. An exhaustive approach would surface both outcomes automatically: this placement deadlocks, this one exposes the bug, this one passes. That would remove the hardest part of writing these tests.
reitzensteinm 22 hours ago [-]
Martin Kleppmann has this tool that's quite relevant: https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...
lirbank 22 hours ago [-]
Oh that is super cool. Great prior art to study in combo with Loom. Very excited to dig in - imagine if there was an easy-to-use data race tester where you didn't have to figure out the interleaving points up front? Just point it at your code and let it find them. Exciting.
reitzensteinm 21 hours ago [-]
Loom does exhaustive search, with clever methods to prune it. On real world programs, you have to set a limit to that because it obviously grows extremely quickly even with the pruning.

I've built something similar to Loom, except it's more focused on extensively modeling the C++11/Rust memory model (https://github.com/reitzensteinm/temper). My experience is that fairly shallow random concurrent fuzzing yields the vast majority of all concurrency bugs.

Antithesis (https://antithesis.com/) are probably the leaders of the pack in going deeper.

OrangeDelonge 17 hours ago [-]
Do you know you’re just talking to an LLM? Everyone else in this post also seem oblivious to it or maybe they just don’t care? Why do I even read comments anymore sigh
jwilliams 17 hours ago [-]
This post confuses me a little. With my tests I try not to "reach inside" systems unless it's quite a specific integration test. Especially databases. In this case I feel like we're just... testing known PostgreSQL behavior?

Or to put another way; as others have observed, this could be solved with atomic updates and in some case SERIALIZABLE. These are right tools for balance operations - and if they’re used I’m not sure they need testing in this manner?

lirbank 17 hours ago [-]
Fair concern about reaching inside systems - it's not something to do lightly. The hooks are designed to be minimal: production code never calls them, they only activate in tests. But the core point is narrower than the thread might suggest - the article isn't about whether to use atomic updates vs locks vs SERIALIZABLE. It's about when your code has operations that could race, how do you prove your handling actually works?
Arifcodes 19 hours ago [-]
Using SERIALIZABLE everywhere sounds nice in theory but it tanks throughput on write-heavy workloads. In practice, most teams I've worked with use a mix: SERIALIZABLE for the critical financial paths, READ COMMITTED + explicit locking for the rest.

The advisory lock pattern mentioned in the thread is underrated. We used pg_advisory_xact_lock for exactly the "first resource creation" race condition on a multi-tenant SaaS. Cleaner than dummy rows and no cleanup needed.

electronvolt 17 hours ago [-]
I mean, you say that, but systems like Spanner exist & I think the fact that it's used for Gmail, Docs, etc. has demonstrated that for a large range of OLTP workloads, serializable everywhere and also performant /is/ possible to architect for -- with the right combination of both database + application design.

That isn't to say it's correct everywhere, but I'd maybe be a little more suspicious of "We want OLTP but we can't use serializable" in a vacuum.

Of course--there are obvious cases like "We can't use serializable because of how our database implements it" or even "because we can't be arsed to figure out how to structure all our data accesses to avoid conflicts and aren't paid enough to solve that problem", but I feel like those are a bit more honest than "Because of performance reasons, in all cases". :)

Arifcodes 10 hours ago [-]
Fair point on Spanner. Though I'd argue Spanner proves the rule rather than breaking it: Google built an entire custom database with TrueTime, atomic clocks, and globally distributed consensus specifically to make serializable performant. That's a massive engineering investment most teams don't have.

For teams running vanilla Postgres (which is most of the teams I've worked with), serializable isolation uses predicate locking that causes significant retry storms under contention. The practical pattern I've seen work well: use serializable for the 10-20% of operations where correctness is critical (financial transactions, inventory counts), and read committed + application-level checks for the rest.

Your last point is the most honest one though. Half the time "we can't use serializable" really means "we haven't modeled our data access patterns well enough to know if we need it." The synchronization barrier approach in the original article is actually a great way to discover those hidden race conditions.

haliliceylan 1 days ago [-]
Thats not postgresql problem, thats your code

IMHO you should never write code like that, you can either do UPDATE employees SET salary = salary + 500 WHERE employee_id = 101;

Or if its more complex just use STORED PROCEDURE, there is no point of using database if you gonna do all transactional things in js

Diggsey 24 hours ago [-]
Stored procedures don't eliminate serialization anomalies unless they are run inside a transaction that is itself SERIALIZABLE.

There's essentially no difference between putting the logic in the app vs a stored procedure (other than round trip time)

lirbank 1 days ago [-]
Fair point - atomic updates like SET salary = salary + 500 sidestep the race condition entirely for simple cases. The examples are intentionally simplified to isolate the concurrency behavior. The barrier pattern is more relevant when you have read-modify-write operations that involve application logic between the read and the write - those can't always collapse into a single UPDATE.
lirbank 1 days ago [-]
Here's a real-world example where atomic updates aren't an option - an order status transition that reads the current status from one table, validates the transition, and inserts into another:

await db().transaction(async (tx) => { await hooks?.onTxBegin?.();

  const [order] = await tx.select().from(orders)
    .where(eq(orders.id, input.id))
    .for("update");

  const [status] = await tx.select().from(orderStatuses)
    .where(eq(orderStatuses.orderId, input.id))
    .orderBy(desc(orderStatuses.createdAt))
    .limit(1);

  if (input.status === status.code)
    throw new Error("Status already set");

  await tx.insert(orderStatuses).values({ ... });
});

You need the transaction + SELECT FOR UPDATE because the validation depends on current state, and two concurrent requests could both pass the duplicate check. The hooks parameter is the barrier injection point from the article - that's how you test that the lock actually prevents the race.

erpellan 1 days ago [-]
The standard pattern to avoid select for update (which can cause poor performance under load) is to use optimistic concurrency control.

Add a numeric version column to the table being updated, read and increment it in the application layer and use the value you saw as part of the where clause in the update statement. If you see ‘0 rows updated’ it means you were beaten in a race and should replay the operation.

tux3 1 days ago [-]
I don't think such a broad recommendation will be good for most people, it really depends.

Optimistic updates looks great when there is no contention, and they will beat locking in a toy benchmark, but if you're not very careful they can cause insane amplification under load.

It's a similar trap as spinlocks. People keep re-discovering this great performance hack that avoids the slow locks in the standard. And some day the system has a spike that creates contention, and now you have 25 instances with 24 of them spinning like crazy, slowing to a crawl the only one that could be making progress.

It's possible to implement this pattern correctly, and it can be better in some specific situations. But a standard FOR UPDATE lock will beat the average badly implemented retry loop nine times out of ten.

lirbank 1 days ago [-]
Good point. The barrier pattern from the article applies to both approaches - whether you're using pessimistic locks or optimistic version checks, it's good to verify that the concurrency handling actually works. Barriers let you test that your version check correctly rejects the stale update, the same way they test that your lock prevents the race.
codys 1 days ago [-]
Seems you could use a single SQL statement for that particular formulation. Something like this, using CTEs is possible, but alternately one can reformat them as subqueries. (note: not sure how the select of orders is intended to be used, so the below doesn't use it, but it does obtain it as an expression to be used)

    WITH
     o AS (
      SELECT FROM orders
      WHERE orders.id = $1
     ),
     os AS (
      SELECT FROM orderStatuses
      WHERE orderStatuses.orderId = $1
      ORDER BY DESC orderStatuses.createdAt
      LIMIT 1
     )
     INSERT INTO orderStatuses ...
     WHERE EXISTS (SELECT 1 FROM os WHERE os.code != $2)
     RETURNING ...something including the status differ check...
Does something like this work with postgres's default behavior?
lirbank 1 days ago [-]
Absolutely - if you can express the whole operation as a single atomic statement, that's the best outcome. No locks needed, no race to test for. The article is about what comes next: when the logic can't collapse into one query, how do you verify your concurrency handling actually works?
andrenotgiant 23 hours ago [-]
Is there any good reason to use stored procedures in 2026?
scottlamb 22 hours ago [-]
I'd think so. Stored procedures let you do multi-statement sequences in fewer round trips. In 2026 larger systems are as likely as ever to run PostgreSQL on a different machine (or machines) than the application server. While latency between the two generally goes down over time, it's still not nothing. You may care about the latency of individual operations or the throughput impact of latency while holding a lock (see Amdahl's law).

Of course, the reasons not to use stored procedures still apply. They're logic, but they're versioned with the database schema, not with your application, which can be a pain.

ComputerGuru 6 hours ago [-]
A supporting point and a counterpoint:

* Good database drivers will let you pipeline multiple queries concurrently (esp. in languages with async support), effectively eliminating the _N_x roundtrip cost (you can even execute them in parallel if you use multiple connections, not that I recommend doing that). But obviously this is only doable where the queries are independent of one another; I use this mainly to perform query splitting efficiently if the join key is already known.

* These days databases are often effectively versioned alongside the code anyway, at least for either smaller projects that "own" the database, eliminating the biggest issue with stored procedures.

1 days ago [-]
fake-name 18 hours ago [-]
I'm confused.

> The simplest case: no transaction, just a SELECT and an UPDATE with a barrier between them:

There is no context where you do not have a transaction. Postgres requires them.

It's likely that the library the author is using is doing automatic implicit transactions, but it seems like the author needs to understand their tools a bit better.

lirbank 18 hours ago [-]
You're right, Postgres wraps every statement in an implicit transaction. The point of that first example is that the SELECT and UPDATE are in separate auto-committed transactions - there's no explicit transaction block wrapping both. So another connection can change the value between your SELECT and your UPDATE.
jijji 23 hours ago [-]
to avoid these conditions i have usually inserted a row into a lock table used for this purpose to create a lock with a unique key for that row with a few minute timer, the once the transaction is complete it will delete the lock row. This way, simultaneous users will only get the first lock, all other requests would fail, and then if the timer expired, we would assume the transaction never completed and it could try again after a few minutes
anvevoice 1 days ago [-]
[dead]
lirbank 1 days ago [-]
Nice - that's a good case for barriers too. When there's no row to SELECT FOR UPDATE against, you'd inject the barrier after acquiring the advisory lock and verify the second transaction blocks until the first commits.
klysm 24 hours ago [-]
Seems like a good way to materialize the conflict.
deepsun 23 hours ago [-]
I always did "INSERT ... ON CONFLICT DO NOTHING".
nurettin 17 hours ago [-]
Does that queue when the table is locked? Or just skip writes entirely whenever there is a transaction?
egedev 21 hours ago [-]
We hit exactly this kind of race condition in our Go + Postgres SaaS when handling concurrent waitlist signups. Two requests would read the current count, both pass the limit check, and both insert — exceeding the waitlist cap.

Ended up using SELECT FOR UPDATE on the waitlist row before the count check. Simple but effective. The barrier testing approach described here would have caught this much earlier in development instead of discovering it under load.

One thing I'd add: in Go, it's tempting to handle this at the application level with mutexes, but that breaks the moment you have multiple instances. Pushing the serialization down to Postgres is almost always the right call for correctness.

lirbank 20 hours ago [-]
Hey, thanks for sharing this - these bugs are so easy to miss because everything works fine until you get real concurrent traffic. And yeah, the moment you have multiple instances, app-level mutexes can't save you.
hurios 15 hours ago [-]
[dead]
HackerThemAll 1 days ago [-]
Javascript developers learn kindergarten basics of transactions and SQL. LOL. Is it the camp "we don't need a degree to be programmers"?
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 22:51:17 GMT+0000 (Coordinated Universal Time) with Vercel.