I would love for the standard to be to ALWAYS report the required amount of memory to load and run a model in bytes of RAM alongside any other metrics. I'd love to see time to first token, token throughput, token latency as well but I'd settle for memory size as described above.
Essentially, many people want to know what the minimum amount of memory is to run a particular model.
Parameter count obscures important details: what are the sizes of the parameters? A parameter isn't rigorously defined. This also gets folks into trouble because a 4B param model with FP16 params is very different from a 4B param model with INT4 params. The former obviously should be a LOT better than the second.
This would also help with MOE models: if memory is my constraint, it doesn't matter if the (much larger RAM required) MOE version is faster or has better evals.
I'm waiting for someone in anger to ship the 1 parameter model where the parameter according to pytorch is a single parameter of size 4GB.
adrian_b 20 hours ago [-]
As a proxy for the total size of the parameters, you can just look at the download size of a model on Huggingface.co.
Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file.
For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters.
While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes.
While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed.
magicalhippo 17 hours ago [-]
The KV cache size is a joker though. Different models use very different amounts of memory per token in the KV cache. The VRAM requirements for say 64k context can vary almost by an order of magnitude. So the download size might indicate you should have room for the model, how much context you can fit in the leftover VRAM budget is harder to predict at a glance.
That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.
usernametaken29 2 days ago [-]
> δ-mem compresses past information into a fixed-size state matrix updated by delta-rule learning
This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching.
This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it.
What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way
jsemrau 2 days ago [-]
I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window
The more real world use cases we see, the more we see the use of a well thought out regex as a bridge from probabilistic to deterministic.
pbronez 1 days ago [-]
Interesting approach.
> Prioritize recall over precision.
Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.
Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.
Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.
This is RAG.
Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.
There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.
jsemrau 1 days ago [-]
That assumes one layer of memory. In my experience you need to have at least 4 layers of memory to work well. All of them have different requirements for retrieval.
Everything that is in short-term memory (state of the app, current conversation, current workspace artefact) requires fast latency and precision. For example if you want to edit a segment in a financial analysis, a blog post, or a program you only want to edit this segment. RAG on a VectorDB is overkill in my opinion.
ogogmad 1 days ago [-]
This is one of the most interesting comments I've read on this website.
jsemrau 21 hours ago [-]
Thank you.
vdelpuerto 1 days ago [-]
I wrote something about it trying to look other way around the context or memory data in models. The gravitational pull of information stills very hard to manage. Ive been using "functional scars" about 30 days now and getting good results in repetitive mistakes across sesions. https://github.com/VDP89/fscars
in-silico 1 days ago [-]
While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.
A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.
Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.
RandomBK 24 hours ago [-]
> context with 2.1 bits of entropy per token
Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
in-silico 23 hours ago [-]
> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter
The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.
> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.
Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.
usernametaken29 1 days ago [-]
While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing.
Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window.
It just won’t work, it won’t scale, and is laughable compared to contextual memory.
I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.
in-silico 1 days ago [-]
I think you underestimate just how much information 100M words-ish of information is. It's like a 300,000 page novel. That's a 50 foot (~15 meter) thick book.
Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.
You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.
ltbarcly3 1 days ago [-]
You don't remember a lifetime of smells. You don't have any memories from huge swaths of time. There are entire years of your life compressed down to vibes and a handful of events you largely misremember.
usernametaken29 1 days ago [-]
That’s a very weak argument. Memories are not exact replica of experiences. We know that many memories are retained through a lifetime, particularly the ones from early childhood. Unlike computers we always reconstruct memories from several modalities.
Even if we remember largely on vibes as you say (which is not true when you look into neuroscience), the sheer amount of information is overwhelming.
Again, try to run a 90 minute movie through an LLM memory system.
It won’t be able to tell you the plot.
That’s before you even feed it sound.
Even 100M tokens is not enough for that.
You on the other hand will largely remember the movies you liked and their major plot lines and from there be able to reconstruct its scenes.
I think the engineers working on memory vastly underestimate the capacity problem of discrete states.
xcvbnu 1 days ago [-]
[flagged]
kami23 1 days ago [-]
Exactly, and for a given task you don't need to recall what your friend's brother's name is to do a git commit and push. There's a pull for more context to make these things better, but also the pull to make these execute in such a small context effectively when appropriate.
I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.
jandrese 2 days ago [-]
So instead of a FIFO approach to memory management it instead continually degrades the existing data the more you put in? Details start getting lost or mangled more and more over time?
trollbridge 1 days ago [-]
That’s basically what happens.
As you hit the limits and try to compact the context, etc., things get more erratic.
The future is fixed size state with a massive token history that the model can look back at like reading a journal. A reframing of the model this way opens a new kind of agent, one with essentially unlimited context, that packs perfectly on a GPU, can be stored/retrieved fairly effortlessly and can essentially be run forever. Fixed size means theta 1 tokens. A model that can look around also means essentially unlimited memory can be bolted on with the model learning to look around memory like it is looking around at the journal of past tokens. Guided windows of attn can do most of this, some other tricks can do the rest.
maxignol 1 days ago [-]
Is there some kind of memory enabling, for instance, an agent to remember guidelines on a repo without having to feed at the beginning of each session 4 markdown files and spending the corresponding tokens each time ?
airstrike 1 days ago [-]
No, it's all just prompts.
You can try to summarize memories tersely and point the agent to longer markdown files, but who knows if it will read it at the right time and only then.
3form 2 days ago [-]
Interesting points:
- fixed size of the memory seems like a good idea to overcome the current limitations
- skimming through the thing, I can't find any mention of the cost?
- I would need more time to read it in-depth to see if this is legitimate and not just fancy form of overfitting or training on testing data
in-silico 1 days ago [-]
They basically just added DeltaNet hypernetworks to existing LLMs.
Nothing super novel or groundbreaking, but a moderately interesting read.
raverbashing 2 days ago [-]
Interesting that the headline is showing Δ-Mem while the paper uses δ-mem
Is it a lowercase to uppercase conversion going on here?
sillysaurusx 2 days ago [-]
Correct!
DeathArrow 2 days ago [-]
I see lots of techniques proposed to give LLM the capacity to recall things, I even saw a lot of memory plugins for AI coding agents, I tried some myself.
What I want to see is something that was tested and proved in practice to be genuinely useful, especially for coding agents.
pohl 14 hours ago [-]
There’s probably never going to be one answer. The most fascinating thing about this quest for memory is that it’s a Rorschach test. Exploring the myriad attempts to implement memory shows that everyone has a slightly different itch they’re trying to scratch, but we talk about it like we all want the same thing.
cjonas 2 days ago [-]
Coding agents don't really need memory. Agent skills, rules, git history, documentation is all far more efficient, transparent and easier to manage. These memory frameworks only really makes sense if you are building a consumer facing agent with managed context and limited capabilities.
wren6991 2 days ago [-]
There's an antipattern where everyone wants to invent new interfaces to connect things LLMs when CLI tools are already right there, transparent, and usable by humans as well as LLMs. I think it's partly the origins in web chat applications.
Beads kind of does "LLM memory over CLI", or there is https://github.com/wedow/ticket which is a minimal and sane implementation of the same idea.
stephantul 2 days ago [-]
How would you conceptualize recall in this case? Is searching through the current version of your code and possibly git history not enough?
rush86999 2 days ago [-]
You would think git history should be the first thing an agent would look at, as they make so many mistakes before they get to the correct answer. They don't.
I haven't measured, but documenting bug fixes and architecture seems to help, along with TDD patterns, including integration tests.
I would probably add it to Claude.md to look for all of the above when tackling a new bug.
visarga 2 days ago [-]
I made a harness that preserves memory for both user messages and task execution. One reason this works is related to judge agents - they can't review information that was not written down. So I track everything in my harness. The judge agents bring the most benefit, based on my evals. The coding agent can execute a task without all the ceremony just as well, but judging needs something to grasp on, besides code. And adding new perspectives helps a lot, it is the most useful intervention. My flow is - user emits a task, the agent plans, then judge agents review the plan, then main agent executes, then judge again reviews the execution. Might consume more tokens to track execution and judgements, but worth it.
brookst 2 days ago [-]
My Claude code frequently looks through git history, both when planning and debugging.
DeathArrow 1 days ago [-]
>Is searching through the current version of your code and possibly git history not enough?
While you can document everything and use git history, I think that having short entries in a kind of memory to remember past decisions, how issues were solved would be much more token efficient than reading lots of documentation and looking at git history and past code.
ktallett 2 days ago [-]
The obvious energy saving step would be to utilise previous searches by others. Many of the tasks people do are rather similar, it is such an energy waste to start again each time.
(Obviously ignoring the huge energy saver, which is to observe if you even need to bother doing the task at all.)
405126121 2 days ago [-]
I had this thought and created https://pushrealm.com which is essentially a sort of Stackoverflow written by agents.
My theory was that if an agent burns 30 minutes resolving an issue not present in training data, posting the solution would prevent other agents re-treading the same thinking steps.
TheTaytay 1 days ago [-]
Fascinating! Do you have a way to detect/flag malicious stuff by any chance? (Seems like a good vector for prompt injection, but maybe no more than any other internet site?)
ktallett 2 days ago [-]
I see why, but I don't feel this is the solution. Being able to search thru the endless LLM responses is not viable. However having useful memories, similar to human brain is more important. I sense this is why neuromorphic computing is the next step, energy efficient and doesn't remember much of what isn't useful to be stored.
visarga 2 days ago [-]
Why not preserver the essential memories in text? Why neuromorphic?
ktallett 2 days ago [-]
You are better being able to quickly deduce ways of acting from memories of previous scenarios, than have to attempt every scenario to build a fresh memory of each, which is a lot of memory, and requires exposure to every situation before being able to do it.
spockz 2 days ago [-]
So you mean caching? :-)
duskdozer 2 days ago [-]
A lot of what I see people using LLMs for would be more cheaply and reliably done by [scripts]. A search engine style suggestion thing like "Have you tried `sed`?" would be beneficial imo
tyre 2 days ago [-]
In my experience, Claude is more than happy to go to Unix tools rather than write its own. Sometimes it will write a lil python script to solve something, but more often than not it’ll pipe together Unix utilities.
This has the benefit of it knowing all of the arcane flags, especially for formatting output.
duskdozer 2 days ago [-]
I believe that. I also believe that my idea won't come to fruition, at least from a group that is incentivized to make a user's first instinct be to use their product and not an external tool.
semiquaver 2 days ago [-]
Hmm, this is a case where HN’s title mangling changed the meaning of the title. Lower case delta (δ) is used intentionally. I don’t think HN should automatically modify the casing of non-ascii chars.
setopt 2 days ago [-]
Even for ASCII chars, nomenclature in math and physics is usually case-sensitive.
cwillu 2 days ago [-]
Email hn@ycombinator.com and they'll fix it.
airstrike 2 days ago [-]
The submitter has a grace period of a few minutes to edit the title after submitting, so there's no need to change what HN does
realitysballs 1 days ago [-]
True, but wouldn’t it be better long term if website automation didn’t create unintended new meanings to Titles? title’s matter
airstrike 1 days ago [-]
Only if you assume it doesn't ever work as intended.
throw1234567891 1 days ago [-]
indeed, titles matter
Sim-In-Silico 19 hours ago [-]
[flagged]
raymondchau 1 days ago [-]
[flagged]
xiaod 1 days ago [-]
[flagged]
zhenglei11 2 days ago [-]
[flagged]
xcvbnu 1 days ago [-]
[dead]
belabartok39 2 days ago [-]
[flagged]
cubefox 2 days ago [-]
Papers being voted high on Hacker News are usually uncorrelated with their actual importance. It's basically a lottery. There are regularly more interesting papers going semi viral on Twitter.
MeteorMarc 2 days ago [-]
On huggingface it was #3 paper of the day, which is neutral towards your hypothesis.
cubefox 1 days ago [-]
Considering that there is a paper with this many points perhaps once a week here (probably less), #3 of the day is pretty unremarkable.
kingkawn 2 days ago [-]
What about broad unsupportable generalizations on hackernews, how do those rank?
Rendered at 04:11:27 GMT+0000 (Coordinated Universal Time) with Vercel.
Essentially, many people want to know what the minimum amount of memory is to run a particular model.
Parameter count obscures important details: what are the sizes of the parameters? A parameter isn't rigorously defined. This also gets folks into trouble because a 4B param model with FP16 params is very different from a 4B param model with INT4 params. The former obviously should be a LOT better than the second.
This would also help with MOE models: if memory is my constraint, it doesn't matter if the (much larger RAM required) MOE version is faster or has better evals.
I'm waiting for someone in anger to ship the 1 parameter model where the parameter according to pytorch is a single parameter of size 4GB.
Because for most models the weights are provided in many *.safetensors files of approximately the same size, you can estimate the total size without adding all file sizes by multiplying the number of *.safetensors files with the approximate size of one file.
For quantized models, estimating the size is simpler, because there is just one GGUF file, which also includes metadata, but most of the file is occupied by the parameters.
While there are models where the native size of all parameters is BF16, there are also models that use multiple parameter sizes, e.g. a large number of parameters with a small size, even down to 4 bits, together with a small number of parameters with a bigger size, up to FP32. Therefore, as you say, the number of parameters is much less informative about memory requirements than the file sizes.
While the download size of the *.safetensors files or GGUF files is not the same as the total memory requirement, it can give an approximate estimate and it can be used to assess which of 2 models will need more memory. It becomes more complicated when you must use multiple kinds of memory, e.g. GPU memory and CPU memory, or even SSDs, when you must know more about the structure of the model to determine how much of each kind of memory is needed.
That some models like Qwen3.6 27B seems to not be very affected by Q8 quantized KV cache while others degrade heavily doesn't make it easier.
This doesn’t solve the capacity problem of memory. You can cram more into one context window, but then again you need to associate them with input queries. That’s very hard because slight variations in input create hugely different activations. So really, it doesn’t improve caching. This paper might do a thing or two approximating the compression limit for context windows, but there’s a fundamental limit on how much information can go into it. What you really need is contextual search, as in, different events and objects with the same abstractions and semantic lead to same response, so you can cache effectively… on this front the paper does little to improve “memory” in a meaningful way
https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...
> Prioritize recall over precision.
Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.
Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.
Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.
This is RAG.
Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.
There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.
A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.
Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.
Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.
> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.
Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.
Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.
Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.
You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.
I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.
As you hit the limits and try to compact the context, etc., things get more erratic.
You can try to summarize memories tersely and point the agent to longer markdown files, but who knows if it will read it at the right time and only then.
- fixed size of the memory seems like a good idea to overcome the current limitations
- skimming through the thing, I can't find any mention of the cost?
- I would need more time to read it in-depth to see if this is legitimate and not just fancy form of overfitting or training on testing data
Nothing super novel or groundbreaking, but a moderately interesting read.
Is it a lowercase to uppercase conversion going on here?
What I want to see is something that was tested and proved in practice to be genuinely useful, especially for coding agents.
Beads kind of does "LLM memory over CLI", or there is https://github.com/wedow/ticket which is a minimal and sane implementation of the same idea.
I haven't measured, but documenting bug fixes and architecture seems to help, along with TDD patterns, including integration tests.
I would probably add it to Claude.md to look for all of the above when tackling a new bug.
While you can document everything and use git history, I think that having short entries in a kind of memory to remember past decisions, how issues were solved would be much more token efficient than reading lots of documentation and looking at git history and past code.
(Obviously ignoring the huge energy saver, which is to observe if you even need to bother doing the task at all.)
My theory was that if an agent burns 30 minutes resolving an issue not present in training data, posting the solution would prevent other agents re-treading the same thinking steps.
This has the benefit of it knowing all of the arcane flags, especially for formatting output.