AI is pretty decent at writing TLA+ specs, but the specs are often bloated, and AI misses the simpler system. If you try to write TLA+ with LLMs and they produce a 500 LoC monstrosity with very abbreviated names, consider that this is not at all how TLA+ is supposed to feel. Leslie Lamport, the creator of TLA+, also recommends using a lot of comments in specs (the language is meant to be machine-checkable but is also notionally terse).
If you want to 'prompt your way' with TLA+, this is a decent one for ChatGPT. But don't expect to do wonders and find a better abstraction than you and if you run it 'agentically' with TLC in the loop, it will bloat the spec in no time:
Write the smallest TLA+ model that captures only the safety-relevant state of [the thing you're modeling].
Do not model implementation data structures unless they affect the property.
Use clear names, not abbreviations.
Start with: constants, variables, TypeOK, Init, and 2–4 invariants.
Use copious amounts of comments at the module level (use multi-line comments) and for each key variable, action, invariant, and temporal property (use one or more single-line comments).
Prefer fewer variables and coarser atomic steps unless finer interleavings are essential.
Keep the spec under 100 LoC (excluding comments) unless there is a specific reason not to.
After writing it, list what was intentionally abstracted away.
"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."
zihotki 1 days ago [-]
I think that's what author implies by this sentence in the intro:
> It’s still your responsibility to understand your system and define what “correctness” means, and you need a high-level understanding of temporal logic.
nyrikki 1 days ago [-]
As TLA+ uses state machines that can define infinite state spaces, checking arbitrary temporal logic formulas is undecidable in the general case.
Using a TLC model checker to verify invariants or properties, finding a valid counterexample can scale non-elementarily.
In fact, in some situations it can end up being Ackermann-complete[0], but even recursively enumerable problems can be Tower-complete.
I would say that unless you focus on more detailed temporal logic pitfalls you may have issues.
So IMHO this will stay a use case specific solution, that you choose based on context.
Even a common solution like adding Circumscription causes counterintuitive changes [1][2].
IMHO, if you want to use TLA+ as a primary method, you will need some depth or be ready to abandon it by time boxing etc…
Remember that we know the open domain frame problem in [2] is equal to HALT, it will not universally apply.
It is just another tool that works well when it works well.
OTOH if you’re developing a service with a frontend and a backend you are firmly in the distributed system territory even if you only have one user, so it doesn’t have to work all the time; it’s enough if it works approximately more often than not to be an improvement.
baq 1 days ago [-]
This blog post made it more difficult than it should to find the actual tool (needed to click twice! abysmal, really): https://github.com/specula-org/Specula
Ran it on my tricky caching changes and it found bugs the standard agentic review flows missed on all frontier models. Good stuff.
leoqa 1 days ago [-]
Unfortunately the benefit of TLA+ is the act of modeling your system painstakingly. The actual checker helps confirm your hypothesis, etc. But skipping the modeling and outsourcing it is not ideal. I’ve always struggled reasoning about models my team mates wrote, and will often have to mentally go through the process of arriving at the same abstractions/invariants etc before I can understand it.
goostavos 18 hours ago [-]
Agreed. To quote Leslie Lamport, "the hardest part of TLA+ is learning to think abstractly."
There's always a moment, usually annoyingly late in the process, where I realize I've been massively overthinking everything or solving the wrong problem. Time is an essential an ingredient. Clear thinking is extremely hard.
LLMs are definitely useful along the way, but the thinking is the spiritually fulfilling part.
UltraSane 1 days ago [-]
Exactly. I view a complete TLA+ specification as a kind of metalanguage that can be used with LLMs to generate code from.
AYBABTME 22 hours ago [-]
TLA+, P, Lean... formal methods and previously esoteric testing methods (property based, mutation... testing) should become the default. I think it's the only way we can really reap the benefits of agentic coding.
I wrote about this a bit on my blog[1], different angle but along the same line. You explain TLA+ and model checking well which makes the case concrete.
I'm curious of you have thoughts on these other methods and tools like P, Lean, Dafny, etc?
They have a development pace very quick and can verify real world Rust code. There's a huge graveyard of Rust verification tools, but I hope this one gets broader usage
arpinum 1 days ago [-]
I found Quint to be a good pairing with LLM. Easier syntax to learn so you are actually collaborating instead of relying on hope.
oa335 1 days ago [-]
What are the downsides as compared to TLA+?
esafak 1 days ago [-]
What are you doing with it? Any write up?
arpinum 1 days ago [-]
data pipelines and data repair.
aerodexis 1 days ago [-]
I did this myself a few weeks ago and the technique that helped me the most was to compare the TLA output against the race-conditions I could construct by hand. I worked iteratively and unit-tested the model by constructing a model that didn't have certain race protection mechanisms and validating that the model generated the expected race.
You can also work backwards from the races it generates and ensure that they're real races.
esafak 1 days ago [-]
What did you do to get the TLA output?
aerodexis 1 days ago [-]
I was working from a design-doc, not code.
"look at @design-document and generate a TLA+ specification for the interactions between local and remote"
some_random 1 days ago [-]
So, the syntax is so bad that you have to use an LLM to generate the spec, but also you still need to understand the spec the LLM generates which is now harder because you weren't even the one to write it. It sounds like the syntax just sucks and there's no way to get around it.
This was also my experience, one look at the full spec and I tuned out.
I understand that this is not the tool for everything but scxml, state chart XML, gives you something like this but something that you can actually read without investing time learning syntax.
It also has some plugins on vs code that allow you to edit it visually in case you don't want to write a bazillion XML tags.
handoflixue 18 hours ago [-]
> Can the number of beans ever reach zero?
Well, the article disagrees, but the answer is obviously yes: if you have two beans left, regardless of whether they match, the first step is to discard them. Then you have zero beans. After that, you then add 1 additional bean.
So already we can see that the problem is underspecified: are we considering the entire action to be a single atomic "transaction" (in which case it never hits zero), or are these individual steps?
But also, I feel like if you needed TLA+ to answer that question, you're probably not capable of evaluating the TLA+ script? In particular, the TLA+ script doesn't address the "underspecified" aspect at all.
sigbottle 1 days ago [-]
Sorry, I thought this would've been a method for finding invariants, not one just for expressing them? I guess I should think about TLA+ as ultimately some kind of solver - give it a configuration, it tells me if it's defined well or not, the point is to make sure I'm not making mistakes, but not necessarily automated innovation?
vbernat 1 days ago [-]
Isn't this line incorrect?
BB == b > 1 /\ b' = b - 2 /\ w' = w + 1 \* Picked 2 black
It should b > 2, otherwise you'll get in an invalid state.
Jtsummers 1 days ago [-]
The state would be valid. If b > 1 then there are at least two (b > 1 is equivalent to b >= 2 here since b must be a non-negative integer). Remove two (b' = b - 2) and add one white (w' = w + 1). If b == 2, then we still end in a valid state, it'll just be (b = 0 & w' = w + 1).
EDIT: Actually, the original spec doesn't even say what is a valid state after initialization. It provides only a requirement on the initial state (that w + b > 0 and that w and b are both in Nat, or >= 0, when initialized). The correctly formulated rules ensure that w and b remain in Nat, but there's no other piece of the specification that seems to require it.
Once the author starts adding invariants (like NotEmpty) that adds a requirement that w + b > 0 remains true for ever, but since w is incremented and b is decremented, even BB results in NotEmpty remaining satisfied. Each of the steps does, since they all add one bean back.
UltraSane 1 days ago [-]
TLA+ should be generated by hand as a extremely detailed spec for LLMs to use to generate code
relativeadv 1 days ago [-]
right? I read the kleppman post sometime ago about formal verification taking off but i could never square away who verifies the verifier.
nextos 1 days ago [-]
That might emerge as one of the main tasks of future software engineers, writing formal specifications by hand.
It could be the case that Tony Hoare was right, just too early.
twoWhlsGud 1 days ago [-]
Yes! Formal specs could be where the understanding of what exactly the system is supposed to do gets laid out - and ideally coupled into the verification process that the code produced actually does what it is supposed to (and nothing else...). That would be a big change!
UltraSane 1 days ago [-]
It is normal to spend a huge amount of time creating incredibly detailed behavior specs in avionics and other code where people can die from bugs.
formalification 24 hours ago [-]
[dead]
uptodatenews 24 hours ago [-]
[dead]
Rendered at 23:41:11 GMT+0000 (Coordinated Universal Time) with Vercel.
If you want to 'prompt your way' with TLA+, this is a decent one for ChatGPT. But don't expect to do wonders and find a better abstraction than you and if you run it 'agentically' with TLC in the loop, it will bloat the spec in no time:
Write the smallest TLA+ model that captures only the safety-relevant state of [the thing you're modeling]. Do not model implementation data structures unless they affect the property. Use clear names, not abbreviations. Start with: constants, variables, TypeOK, Init, and 2–4 invariants. Use copious amounts of comments at the module level (use multi-line comments) and for each key variable, action, invariant, and temporal property (use one or more single-line comments). Prefer fewer variables and coarser atomic steps unless finer interleavings are essential. Keep the spec under 100 LoC (excluding comments) unless there is a specific reason not to. After writing it, list what was intentionally abstracted away.
"Running leading LLMs across the eleven systems shows that LLMs are great at producing correct TLA+ syntax but struggle to ensure conformance and appropriate invariants."
> It’s still your responsibility to understand your system and define what “correctness” means, and you need a high-level understanding of temporal logic.
Using a TLC model checker to verify invariants or properties, finding a valid counterexample can scale non-elementarily.
In fact, in some situations it can end up being Ackermann-complete[0], but even recursively enumerable problems can be Tower-complete.
I would say that unless you focus on more detailed temporal logic pitfalls you may have issues.
So IMHO this will stay a use case specific solution, that you choose based on context.
Even a common solution like adding Circumscription causes counterintuitive changes [1][2].
IMHO, if you want to use TLA+ as a primary method, you will need some depth or be ready to abandon it by time boxing etc…
Remember that we know the open domain frame problem in [2] is equal to HALT, it will not universally apply.
It is just another tool that works well when it works well.
[0] https://arxiv.org/abs/2104.13866
[1] https://arxiv.org/abs/2407.20822
[2] https://en.wikipedia.org/wiki/Circumscription_(logic)
Ran it on my tricky caching changes and it found bugs the standard agentic review flows missed on all frontier models. Good stuff.
There's always a moment, usually annoyingly late in the process, where I realize I've been massively overthinking everything or solving the wrong problem. Time is an essential an ingredient. Clear thinking is extremely hard.
LLMs are definitely useful along the way, but the thinking is the spiritually fulfilling part.
I wrote about this a bit on my blog[1], different angle but along the same line. You explain TLA+ and model checking well which makes the case concrete.
I'm curious of you have thoughts on these other methods and tools like P, Lean, Dafny, etc?
[1]: https://aybabt.me/blog/correctness-agentic-world
They have a development pace very quick and can verify real world Rust code. There's a huge graveyard of Rust verification tools, but I hope this one gets broader usage
You can also work backwards from the races it generates and ensure that they're real races.
"look at @design-document and generate a TLA+ specification for the interactions between local and remote"
I understand that this is not the tool for everything but scxml, state chart XML, gives you something like this but something that you can actually read without investing time learning syntax.
It also has some plugins on vs code that allow you to edit it visually in case you don't want to write a bazillion XML tags.
Well, the article disagrees, but the answer is obviously yes: if you have two beans left, regardless of whether they match, the first step is to discard them. Then you have zero beans. After that, you then add 1 additional bean.
So already we can see that the problem is underspecified: are we considering the entire action to be a single atomic "transaction" (in which case it never hits zero), or are these individual steps?
But also, I feel like if you needed TLA+ to answer that question, you're probably not capable of evaluating the TLA+ script? In particular, the TLA+ script doesn't address the "underspecified" aspect at all.
EDIT: Actually, the original spec doesn't even say what is a valid state after initialization. It provides only a requirement on the initial state (that w + b > 0 and that w and b are both in Nat, or >= 0, when initialized). The correctly formulated rules ensure that w and b remain in Nat, but there's no other piece of the specification that seems to require it.
Once the author starts adding invariants (like NotEmpty) that adds a requirement that w + b > 0 remains true for ever, but since w is incremented and b is decremented, even BB results in NotEmpty remaining satisfied. Each of the steps does, since they all add one bean back.
It could be the case that Tony Hoare was right, just too early.