NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3) (thecloudlet.github.io)
internet_points 13 hours ago [-]
Drawn in by the Emacs, learnt something new about C and C++, thank you for this! Very readable article for someone who doesn't feel too confident with low-level bits.

Btw, is this representation the reason why OCaml's ints are not as big as C ints?

Also interesting that the Haskell pointer tagging you link to[0] was done the way it was to avoid CPU branch misprediction, and that the old way which it replaced was "the source of half of the branch misprediction events". I wonder how "branch prediction friendly" current Haskell is.

[0] https://simonmar.github.io/bib/papers/ptr-tagging.pdf

ndesaulniers 1 days ago [-]
Happy to see discussion of LLVM's interesting implementation of Static Polymorphism using CRTP. Some recommended reads:

1. https://en.wikipedia.org/wiki/Curiously_recurring_template_p...

2. https://david.alvarezrosa.com/posts/devirtualization-and-sta...

3. https://llvm.org/docs/ProgrammersManual.html#the-isa-cast-an...

thecloudlet 19 hours ago [-]
Thanks for the links, Nick! It's fascinating how LLVM relies so heavily on CRTP.
ndesaulniers 50 minutes ago [-]
Consider amending those references to your post!
tialaramex 1 days ago [-]
It's not clear to me (and as an unsafe language it's not called out by your compiler if you do something illegal) what the correct way to spell this kind of trick is in C++

I had thought you need the pointer-sized integer types and mustn't do this directly to an actual pointer, but maybe I was wrong (in theory, obviously practice doesn't follow but that's a dangerous game)

thecloudlet 1 days ago [-]
Doing bitwise operations directly on raw pointers is a fast track to Undefined Behavior in standard C/C++. Emacs gets away with it largely due to its age, its heavy reliance on specific GCC behaviors/extensions, and how its build system configures compiler optimizations.

In modern C++, the technically "correct" and safe way to spell this trick is exactly as you suggested: using uintptr_t (or intptr_t).

trws 1 days ago [-]
There’s a paper in flight to add a stdlib type to handle pointer tagging as well while preserving pointer provenance and so-forth. It’s currently best to use the intptr types, but the goal is to make it so that an implementation can provide specializations based on what bits of a pointer are insignificant, or even ignored, on a given target without user code having to be specialized. Not sure where it has landed since discussion in SG1 but seemed like a good idea.
tialaramex 1 days ago [-]
Given you aren't sure since SG1 this might be useless but... do you have a paper number? Or, more likely, know an author's name ?
trws 1 days ago [-]
It’s Hana Dusikova’s paper IIRC.
legobmw99 1 days ago [-]
Seems like its p3125r0
tialaramex 1 days ago [-]
Thanks! https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p31...

(is the current version of that paper, the tracking ticket insisted there's a P3125R5 and that LEWG had seen it in 2025, but it isn't listed in a mailing so it might be a mirage)

You know it's a Hana paper because it wants this to be allowed at compile time (C++ constrexpr) but joking aside this seems like a nice approach for C++ which stays agnostic about future implementation details.

VorpalWay 1 days ago [-]
Do (u)intptr_t preserve provenance? Or does this count as exposed provenance when you convert back and forth?

Maybe that is not the correct C++ terminology, I'm more familiar with how provenance works in Rust, where large parts of it got stabilised a little over a year ago. (What was stabilised was "strict provenance", which is a set of rules that if you abide them will definitely be correct, but it is possible the rules might be loosened in the future to be more lenient.)

https://doc.rust-lang.org/std/ptr/index.html#provenance

jcranmer 1 days ago [-]
Pointer provenance is not properly defined in C or C++. (There is a C TS that introduces pointer provenance, but it's not part of the main standard).

The problem of pointer provenance is more finding a workable theoretical model rather than one causing miscompiles on realistic code. While there are definitely miscompiles on carefully constructed examples, I'm not aware of any bugs on actual code. This is in comparison to topics like restrict(/noalias) semantics or lifetime semantics, where there is a steady drip of bug reports that turn out to be actual optimization failures.

tialaramex 1 days ago [-]
Well, C++ does not have any promises about how Pointer Provenance works, so AFAIK the answer is "mu" meaning that's a bad question, don't ask that.

But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

As that name suggests, in this model provenance is not transmitted via integers. Every 123456 is always just the integer 123456 and there aren't magic 123456 values which are different and transmit some form of provenance from a pointer to some value which happened perhaps to be stored at address 123456 in memory.

However, PNVI-ae-udi has Exposure, which means if we exposed the pointer in an approved way then the associated provenance is somehow magically "out there" in the ether, as a result if we have exposed this pointer then just having that integer 123456 works fine because we combined that integer 123456 with that provenance from the ether and make a working pointer. User disambiguation means that the compiler has to give you "benefit of the doubt" e.g. if you could mean to make a pointer to that Doodad which no longer exists as of a minute ago or to this other Doodad which does exist, well, benefit of the doubt means it was the latter and so your pointer is valid even though the addresses of both Doodads were the same.

jcranmer 1 days ago [-]
> But the likely destiny of C++ is to inherit the provenance rules that are an adjunct to C23, PNVI-ae-udi, Provenance Not Via Integers, Addresses Exposed, User Disambiguates

There's a competing proposal in C++ land to add provenance via angelic nondeterminism: if there's some provenance that makes the code non-UB, then use that provenance. (As you might imagine, I'm not a big fan of that proposal, but WG21 seems to love it a lot more than I do.)

VorpalWay 23 hours ago [-]
Very interesting discussion. I hadn't realised that the final provenance model hadn't yet been decided for C and C++.

Angelic non-determinism seems difficult to use to determine if an optimisation is valid. If I understand this correctly, it is basically the as-if rule, but in this case applied to something that potentially needs global program analysis. Would that be an accurate understanding?

It sounds like both of these proposals will be strictly less able to optimize than strict provenance in rust to me. In particular, Rust allows applying a closure/lambda to map a pointer while keeping the provenance. That avoids exposing the provenance as you add and remove tag bits, which should at least in theory allow LLVM to optimise better. (But this keeps the value as a pointer, and having a dangling pointer that you don't access is fine in Rust, probably not in C?)

I'm not sure why I'm surprised actually, Rust can be a more sensible language in places thanks to hindsight. We see this in being able to use LLVM noalias (restrict basically) in more places thanks to the different aliasing model, while still not having the error prone TBAA of C and C++. And it doesn't need a model of memory consisting of typed objects (rather it is all just bytes and gets materialised into a value of a type on access).

uecker 15 hours ago [-]
No, angelic non-determinism is not related to the as-if rule. It essentially says that if there is a choice to assign provenance on backconversion from integers, the one which makes the program valid is assigned. This is basically the same as the explicit UDI rule in TS 6010, except that this is rule is very clear. The problematic with angelic non-determinism is two-fold: a) most people will not be able to reason about it at all, and b) not even formal semantics experts know what it means in complicated cases. Demonic non-determinism essentially means that all possible execution must be valid while angelic non-determinism that there must exist at least one. Formally, this translates to universal and existential quantifiers. But for quantifiers, you must know where and in which order to place them in a formula, which wasn't clear all from the wording I have seen (a while ago). The interaction with concurrency is also a can of worms.

I don't think there is a fundamental advantage to Rust regarding provenance. Yes, we lack a way to do pointer tagging without exposing the provenance in C, but we could easily add this. But this is all moot as long as compilers are still not conforming to the provenance model with respect to integer and pointer casts anyway and this breaks Rust too! Rust having decided something just means they life in fairy tale world, while C/C++ not having decided means they acknowledge the reality that compilers haven't fixed their optimizers. (Even ignoring that "deciding" means entirely different things here anyway with C/C++ having ISO standards.)

VorpalWay 11 hours ago [-]
> But this is all moot as long as compilers are still not conforming to the provenance model with respect to integer and pointer casts anyway and this breaks Rust too! Rust having decided something just means they life in fairy tale world, while C/C++ not having decided means they acknowledge the reality that compilers haven't fixed their optimizers.

I think this is a bit of a mischaracterization. While there can of course be bugs in LLVM (and rustc and clang), what sort of LLVM IR you generate matters. To be able to generate IR that conforms to the provenance model of the language you first need to have such a model.

As far as I know (and this matches what I found when search the rust issue tracker) there is currently one major known LLVM bug in this area (https://github.com/rust-lang/rust/issues/147538) with partial workarounds applied on the Rust side. There is some issues with open question still, such as how certain unstable features should interact with provenance.

I think calling the current situation "fairy tale world" is a gross exaggeration. Is it perfectly free of bugs? No, but if that is the criteria, then the entirety of any compiler is a fairy tale (possibly with the exception of some formally verified compiler).

uecker 7 hours ago [-]
I am not sure this is a mischaracterization. The C provenance model also exists, even as a form of an ISO TS. The Rust model copied the basic concepts and even the terminology from us. The reason the C model is is not in ISO C 23 but in a separate TS is because compilers are not able to implement correctly at this time due to bugs. But neither do they implement the Rust model correctly because of the same bugs.

One should also point out that basic provenance is already part of the ISO C standard for a long time (but not under this name). That a precise technical specification is needed is only because the exact details were not clear and there are inconsistencies and differences between and even inside compilers. Rust having a precise model does not make these problems automatically go away just as the ISO TS does not.

shadowgovt 1 days ago [-]
Is there a similar solution to doing this in Rust? I suppose inside `unsafe` you can do basically anything.
tialaramex 1 days ago [-]
Unlike C++ all of Rust's primitive types get the same first class treatment as your user defined types and so the appropriate API is provided as methods on pointer types. For this you want ptr::map_addr which takes a callable (such as your own function for this mapping or a lambda) to fiddle with the pointer.

https://doc.rust-lang.org/std/primitive.pointer.html#method....

Rust's MIRI is able to run code which uses this (a strict provenance API) because although MIRI's pointers are some mysterious internal type, it can track that we mapped them to hide our tags, and then later mapped back from the tagged pointer to recover our "real" pointer and see that's fine.

This isn't an unsafe operation. Dereferencing a pointer is unsafe, but twiddling the bits is fine, it just means whoever writes the unsafe dereferencing part of your codebase needs to be very careful about these pointers e.g. making sure the ones you've smuggled a tag in aren't dereferenced 'cos that's Undefined Behaviour.

It's clear to me how this works in Rust, it's just unclear still in C++

simonask 1 days ago [-]
Rust is basically in the same place as C++, i.e. provenance rules are currently ad-hoc/conventional, meaning that pointer tagging is a grey area.
tialaramex 1 days ago [-]
Nope. Rust stabilized strict provenance over a year ago. Some details about aliasing aren't tied down, but so long as you can obey the strict provenance rules you're golden today in Rust to hide flags in pointers etc.

https://blog.rust-lang.org/2025/01/09/Rust-1.84.0/#strict-pr...

simonask 22 hours ago [-]
Oh damn, I missed this, thanks for the correction!
uecker 1 days ago [-]
But LLVM's optimizations aren't sound and this affects Rust too.
simonask 22 hours ago [-]
Huh? Which optimizations?
tialaramex 21 hours ago [-]
LLVM is quite sure that, for example, two pointers to different objects are different. That's true even if in fact the objects both lived in the exact same spot on the stack (but at different times). That's... well it's not what Rust wants but it's not necessarily an unacceptable outcome and Rust could just ask for their addresses and compare those...

Except it turns out if we ask for their addresses, which are the same integer, LLVM remembers it believed the pointers were different and insists those are different too.

Until you call its bluff and do arithmetic on them. Then, in some cases, it snaps out of it and remembers that they're identical...

This is a compiler bug, but, apparently it's such a tricky bug to fix that I stopped even looking to see whether they'd fixed it after a few years... It affects C, C++, Rust, all of them as a result can be miscompiled by a compiler using LLVM [it's easiest to demonstrate this bug with Rust but it's the same in every language]. But as you've probably noticed, this doesn't have such an enormous impact that anybody stopped using LLVM.

trws 1 days ago [-]
Everything else in the siblings is true, but remember that the language and std types in rust all do this already. Most of the time it’s better to use a native enum or optional/result because they do this in the compiler/lib. It’s only really worth it if you need more than a few types or need precise control of the representation for C interop or something.
VorpalWay 1 days ago [-]
To expand on the sibling answer: sort of! Rust will do niche optimisation, but for references and NonNull pointers this is limited to "the value 0 is invalid and can thus be used as a niche". But Rust does not (currently) take advantage of alignment niches in pointers. Nor does it use high bit on architectures where you know your whole theoretical address space isn't actually in use.

Is doing that manually worth it? Usually not, but for some core types (classical example is strings) or in language runtimes it can be.

Would it be awesome if this could be done automatically? Absolutely, but I understand it is a large change, and the plan is to later build upon the pattern types that are currently work in progress (and would allow you to specify custom ranged integer typed).

tialaramex 1 days ago [-]
I mean, kinda, sorta? Rust's guaranteed niche optimisation means Option<&T> [which might be Some(&T) or just None] is promised to be the same size in memory as &T the reference to a T

So that's one tiny use of this sort of idea which is guaranteed unnecessary in Rust, and indeed although it isn't guaranteed the optimiser will typically spot less obvious opportunities so that Option<Option<bool>> which might be None, or Some(None) or Some(Some(true)) or Some(Some(false)) is the same size (one byte) as bool.

But hiding stuff in a pointer is applicable in places your Rust compiler won't try to take advantage unless you do something like this. A novel tiny String-like type I saw recently does this, https://crates.io/crates/cold-string ColdString is 8 bytes, if your text is 8 or fewer bytes of UTF-8 then you're done, that'll fit, but, if you have more text ColdString allocates on the heap to store not only your text but also its length and so it needs to actually "be" in some sense a raw pointer to that structure, but if the string is shorter that pointer is nonsense, we've hidden our text in the pointer itself.

Implementation requires knowing how pointers work, and how UTF-8 encoding works. I actually really like one of the other Rust tiny strings, CompactString but if you have a lot of very small strings (e.g. UK postcodes fit great) then ColdString might be as much as three times smaller than your existing Rust or C++ approach and it's really hard to beat that for such use cases.

Edited: To remove suggestion ColdString has a distinct storage capacity, this isn't intended as a conventional string buffer, it can't grow after creation

thecloudlet 1 days ago [-]
Waiting for Rust experts.
jandrewrogers 1 days ago [-]
The idiomatic way to safely do pointer tagging in C++ works through uintptr_t.

If you don't care about portability or using every theoretically available bit then it is trivial. A maximalist implementation must be architecture aware and isn't entirely knowable at compile-time. This makes standardization more complicated since the lowest common denominator is unnecessarily limited.

In C++ this really should be implemented through a tagged pointer wrapper class that abstracts the architectural assumptions and limitations.

db48x 1 days ago [-]
Do the way LLVM does it.
thecloudlet 1 days ago [-]
Emacs internal part 2 HN link:

https://news.ycombinator.com/item?id=47259961

1 days ago [-]
mshockwave 1 days ago [-]
LLVM now has another way to implement RTTI using the `CastInfo` trait instead of `classof`: https://llvm.org/doxygen/structllvm_1_1CastInfo.html

But it's really just an implementation difference, the idea is still to have a lightweight RTTI.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 20:45:04 GMT+0000 (Coordinated Universal Time) with Vercel.