I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.
To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.
I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.
ctoth 1 days ago [-]
I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)
timClicks 17 hours ago [-]
Another format that's worth investigating is Asciidoc. It supports the richness of Docbook XML but has fewer quirks than rST in my eyes.
mercer 31 minutes ago [-]
would it make sense to just go for pandoc instead?
simlevesque 1 days ago [-]
I'm gonna look at your plugin. My email is in my profile.
Honestly I think that Markdown with LateX code blocks would be the most efficient representation but when doing it with Pandoc I kept having issues with loss of information and sometimes even syntax error.
3abiton 5 hours ago [-]
I am surprised you found RST better than markdown.
paulluuk 1 days ago [-]
This sounds like it would work, but honestly if you've already read all 30 papers fully, what do you still need to llm to do for you? Just the boilerplate?
simlevesque 1 days ago [-]
I'm trying to make a go library that implements a wide ranges of MOT algorithms and can gather metrics for all of them.
Reading all the papers once isn't the same as this. I find it very useful.
I can ask an LLM to do the basic implementations, then I can refine them (make the code better, faster, cut on memory use), then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.
giancarlostoro 18 hours ago [-]
> then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.
Unit testing would save on tokens... unit testing is perfect for validating refactors, or when re-writing a project from one language to the next, build unit tests first.
gessha 20 hours ago [-]
It lets you filter out interesting papers more quickly.
I’ve been meaning to build something similar. I will report back once I have something to show.
Thanks for sharing!
satvikpendem 1 days ago [-]
Does that even fit in the context? It seems like 30 papers worth of content would just overflow it.
ctoth 1 days ago [-]
For each paper, have your agent extract a three sentence description, create a description.md, then concat those with the paper names into an INDEX.md which it should consult to find appropriate papers. Also: have your agent tag papers, then autogenerate your tagged collection on the filesystem. Then you get nice things like https://github.com/ctoth/Qlatt/tree/master/papers/tagged
Then something in your {CLAUDE,AGENTS}.md that says: when working on something with relevant context supplied by papers, read the papers before doing the work. You can find all papers plus their descriptions in ./papers/INDEX.md and papers by tag in ./papers/tagged
Sorry to spam, I'm working on this also from a different angle. Hopefully sharing adds to the conversation.
First, about the loop, Claude's (coding agent) context and attention is big enough to self-reflect. Agent Tuning shows a technique that not only demonstrates this but a way quantify it. [0] The difference is autoresearch's val_bpb measures what the agent built; Agent Tuning's p̂ measures the agent itself.
> Claude's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context.
Second, doing research, finding academic research to add to context helps. Here is an example of an implementation that creates trading strategies by reading research and recreating them in creative new ways. [1]
The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.
> The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.
GPT 2 and 3 used to fail fast (and loud coz we could easily see it lying)
dataviz1000 24 hours ago [-]
My next exploration will be "Coding Agents: fail slow, silent, and deceivingly".
After one month working on using Claude to create trading strategies, the one thing I learned; if the strategy looks like it can profit, it is a lie. The trading strategy agent doesn't find trading strategies that work, it is really a bug hunting agent.
j_gonzalez 21 hours ago [-]
[dead]
lmeyerov 21 hours ago [-]
I've found value in architectural research before r&d tier projects like big changes to gfql, our oss gpu cypher implementation. It ends up multistage:
- deep research for papers, projects etc. I prefer ChatGPT Pro Deep Research here As it can quickly survey hundreds of sources for overall relevance
- deep dives into specific papers and projects, where an AI coding agent downloads relevant papers and projects for local analysis loops, performs technical breakdowns into essentially a markdown wiki, and then reduces over all of them into a findings report. Claude code is a bit nicer here because it supports parallel subagents well.
- iterative design phase where the agent iterates between the papers repos and our own project to refine suggestions and ideas
Fundamentally, this is both exciting, but also limiting: It's an example of 'Software Collapse' where we get to ensure best practices and good ideas from relevant communities, but the LLM is not doing the creativity here, just mashing up and helping pick.
Tools to automate the stuff seems nice. I'd expect it to be trained into the agents soon as it's not far from their existing capabilities already. Eg, 'iteratively optimize function foobar, prefer GPU literature for how.'
jbergqvist 24 hours ago [-]
When I want to solve a new problem with an agent, I always ask it to search broadly for prior work in the given area online, and then analyze if we can build our solution using it as inspiration.
I see it as the solution being out there in “idea space”, and by having the agent search beforehand we can more efficiently explore this space before converging on the final solution.
dalmo3 19 hours ago [-]
Is it not safe to assume that all* publicly available prior work is in the training data?
Then you could just prompt it to propose options with pros and cons etc.
* Bar extremely new stuff from after the cutoff
technotony 18 hours ago [-]
Maybe but having it search first to load the context with relevant information sure gets better results
formerly_proven 8 hours ago [-]
Included in the training corpus doesn’t mean perfect or even partial recall.
SadErn 15 hours ago [-]
[dead]
ctoth 1 days ago [-]
I've been very interested in this recently. I'm pretty sure that every project should have a ./papers directory of annotated papers in it like I do in Qlatt[0].
Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it!
That directory is huge already! I guess the index.md helps the agent find what it needs, but even the markdown file is very long - this would consume a ton of tokens.
Also I wonder who/what decides what papers go in there.
In the blog post, the agent is allowed to do its own search.
ctoth 1 days ago [-]
Check out the Researcher and Process Leads skill in ctoth/research-papers-plugin. I have basically completely automated the literature review.
pstuart 1 days ago [-]
Having a "indexed global data collection" of the markdown would be a kumbaya moment for AI. There's so much data out there but finite disk space. Maybe torrents or IPFS could work for this?
ctoth 1 days ago [-]
I'm actually sort of working on this! https://github.com/ctoth/propstore -- it's like Cyc, but there is no one answer. Plus knowledge bases are literally git repos that you can fork/merge. Research-papers-plugin is the frontend, we extract the knowledge, then we need somewhere to put it :)
pstuart 22 hours ago [-]
Awesome! TIL about Cyc, and it's quite intriguing. I'd been thinking about how being able to integrate Prolog or similar tools might be a valuable endeavor (although I've yet to write anything in Prolog myself).
zzleeper 1 days ago [-]
Wow this is amazing. Did you write all those MD files by hand, or used an LLM for the simple stuff like extracting abstracts?
Claude is much faster and better at reading papers than Codex (some of this is nested skill dispatch) but they both work quite incredibly for this. Compile your set of papers, queue it up and hit /ingest-collection and go sleep, and come back to a remarkable knowledge base :)
KingOfCoders 1 days ago [-]
I use #PPPCDC for prompting: plan,plan,plan then verify with: Compare the plan to the existing Code. Reread and compare the plan to the Docs. Fix the areas you're not Confident about.
love2read 21 hours ago [-]
It sounds very silly but it sounds like they need to add a phase before research that finds a profiler and runs it before just guessing what optimizations may be beneficiary.
mechoblast 8 hours ago [-]
This seems analagous to why Vercel found an AGENTS.md better than "skills", when in reality the agents.md is just the already expanded skill.
This fits into the paradigm of finding ways to force better context engineering.
hungryhobbit 1 days ago [-]
I think anyone who uses Claude knows that it works smarter when you have it make a plan first, and ask it to research the existing code as much as possible first ... so the results in this article doesn't surprise me at all.
However, I'd be curious to hear back from others who have tried adding the shell script (at the end of the article) to their flow: does it (really) improve Claude?
maCDzP 1 days ago [-]
I have a ML project. I usually set up a team of agents, where I have a leader, archivist, research assistant, researcher, developer and tester. The team generates hypothesis based on papers, test it, and iterate over that. Everything is documented using a lab notebook. It burns tokens but I have found some promising strategies that I am testing.
throwdbaaway 17 hours ago [-]
Very nice TG improvement from Flash Attention KQ fusion. Is it something that was already done in ik_llama.cpp? If not, then it will be a welcomed addition for hybrid CPU/GPU inference.
throwdbaaway 17 hours ago [-]
> EC2 instances on shared hardware showed up to 30% variance between runs due to noisy neighbors.
Based on this finding, I suppose the better way is to rely on local hardware whenever possible?
prats226 23 hours ago [-]
A good experiment would be to also try giving it access to latency traces so it can identify issues? Wrt coding agents, giving access to observability tools often improve coding/debugging ability for me
Research step makes sense, can also confirm that running multiple agents with diverse strategies also compound results more quickly than single agents
alex000kim 1 days ago [-]
I am sure this would works well in general. There is a challenge wrt to how to make them communicate effectively to e.g. 1) avoid duplicative work and 2) allow them to combine/overlay each others' findings to yield even better results
outside1234 1 days ago [-]
A research step (gather insights from across the codebase and internet for how to accomplish the next step), planning step (how should I sequence implementation given that research), an implementation step, and a verification step (code review of the implementation) is super effective workflow for me.
alex000kim 1 days ago [-]
yup, as the blog says
> The full setup works with any project that has a benchmark and test suite.
so having a clear and measurable verification step is key.
Meaning you can't simply give an AI agent a vague goal e.g. "improve the quality of the codebase" because it's too general.
hopechong 1 days ago [-]
Coding agents that read papers before writing code find optimizations that code-only agents miss.
We added a literature review phase to Karpathy’s autoresearch loop and pointed it at llama.cpp. The agent autonomously read arxiv papers, studied competing forks and spun up VMs to run parallel experiments.
tomi_dev 24 hours ago [-]
This is interesting.
Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?
Feels like most tools skip that step.
doctorpangloss 1 days ago [-]
The skypilot devs need to focus on decoupling their offering, so that their very valuable "find the cheapest cloud" functionality isn't married to a glitchy reinvention of Kubernetes JobSet and MLflow
a7om_com 5 hours ago [-]
[dead]
tuo-lei 6 hours ago [-]
[dead]
claud_ia 10 hours ago [-]
[dead]
ajaystream 14 hours ago [-]
[dead]
enesz 9 hours ago [-]
[dead]
spotlayn 11 hours ago [-]
[dead]
Sim-In-Silico 19 hours ago [-]
[dead]
Bmello11 1 days ago [-]
[dead]
j_gonzalez 21 hours ago [-]
[dead]
KaiShips 22 hours ago [-]
[dead]
_enjn 23 hours ago [-]
[dead]
neuzhou 15 hours ago [-]
[dead]
22 hours ago [-]
Malachiidaniels 1 days ago [-]
[flagged]
sschlegel 23 hours ago [-]
[dead]
Leon8090 14 hours ago [-]
[dead]
matthias_m_dev 23 hours ago [-]
[dead]
notef 21 hours ago [-]
[dead]
phendrenad2 1 days ago [-]
This is obvious, right? If you want to build a Facebook clone, you wouldn't tell the agent "build Facebook". You would provide it with a description of every page on Facebook, behaviors, interactions, UI, etc.
esafak 7 hours ago [-]
Agreed. This is nothing but RAG, which helps when the task benefits from more knowledge. They're just going into the details of their application.
faeyanpiraat 1 days ago [-]
Have you even read the TL;DR in the linked article??
phendrenad2 1 days ago [-]
You mean this part?
> TL;DR: Coding agents generate better optimizations when they read papers and study competing projects before touching code
What made you think I hadn't read the article, let alone that TL;DR? I'm really curious. Jumping to an insulting "have you read the article" is a big step, so it'll be really interesting to see where your mind went.
tomi_dev 24 hours ago [-]
This is interesting.Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?
Feels like most tools skip that step.
Rendered at 20:26:48 GMT+0000 (Coordinated Universal Time) with Vercel.
To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.
I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.
Honestly I think that Markdown with LateX code blocks would be the most efficient representation but when doing it with Pandoc I kept having issues with loss of information and sometimes even syntax error.
Reading all the papers once isn't the same as this. I find it very useful.
I can ask an LLM to do the basic implementations, then I can refine them (make the code better, faster, cut on memory use), then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.
Unit testing would save on tokens... unit testing is perfect for validating refactors, or when re-writing a project from one language to the next, build unit tests first.
Thanks for sharing!
Then something in your {CLAUDE,AGENTS}.md that says: when working on something with relevant context supplied by papers, read the papers before doing the work. You can find all papers plus their descriptions in ./papers/INDEX.md and papers by tag in ./papers/tagged
First, about the loop, Claude's (coding agent) context and attention is big enough to self-reflect. Agent Tuning shows a technique that not only demonstrates this but a way quantify it. [0] The difference is autoresearch's val_bpb measures what the agent built; Agent Tuning's p̂ measures the agent itself.
> Claude's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context.
Second, doing research, finding academic research to add to context helps. Here is an example of an implementation that creates trading strategies by reading research and recreating them in creative new ways. [1]
The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.
[0] https://github.com/adam-s/agent-tuning
[1] https://github.com/adam-s/alphadidactic
GPT 2 and 3 used to fail fast (and loud coz we could easily see it lying)
After one month working on using Claude to create trading strategies, the one thing I learned; if the strategy looks like it can profit, it is a lie. The trading strategy agent doesn't find trading strategies that work, it is really a bug hunting agent.
- deep research for papers, projects etc. I prefer ChatGPT Pro Deep Research here As it can quickly survey hundreds of sources for overall relevance
- deep dives into specific papers and projects, where an AI coding agent downloads relevant papers and projects for local analysis loops, performs technical breakdowns into essentially a markdown wiki, and then reduces over all of them into a findings report. Claude code is a bit nicer here because it supports parallel subagents well.
- iterative design phase where the agent iterates between the papers repos and our own project to refine suggestions and ideas
Fundamentally, this is both exciting, but also limiting: It's an example of 'Software Collapse' where we get to ensure best practices and good ideas from relevant communities, but the LLM is not doing the creativity here, just mashing up and helping pick.
Tools to automate the stuff seems nice. I'd expect it to be trained into the agents soon as it's not far from their existing capabilities already. Eg, 'iteratively optimize function foobar, prefer GPU literature for how.'
I see it as the solution being out there in “idea space”, and by having the agent search beforehand we can more efficiently explore this space before converging on the final solution.
Then you could just prompt it to propose options with pros and cons etc.
* Bar extremely new stuff from after the cutoff
Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it!
[0]: https://github.com/ctoth/Qlatt/blob/master/papers/
Also I wonder who/what decides what papers go in there.
In the blog post, the agent is allowed to do its own search.
Claude is much faster and better at reading papers than Codex (some of this is nested skill dispatch) but they both work quite incredibly for this. Compile your set of papers, queue it up and hit /ingest-collection and go sleep, and come back to a remarkable knowledge base :)
This fits into the paradigm of finding ways to force better context engineering.
However, I'd be curious to hear back from others who have tried adding the shell script (at the end of the article) to their flow: does it (really) improve Claude?
Based on this finding, I suppose the better way is to rely on local hardware whenever possible?
> The full setup works with any project that has a benchmark and test suite.
so having a clear and measurable verification step is key. Meaning you can't simply give an AI agent a vague goal e.g. "improve the quality of the codebase" because it's too general.
We added a literature review phase to Karpathy’s autoresearch loop and pointed it at llama.cpp. The agent autonomously read arxiv papers, studied competing forks and spun up VMs to run parallel experiments.
Do you see a noticeable difference in output quality when the agent reads context first vs going straight into generation?
Feels like most tools skip that step.
> TL;DR: Coding agents generate better optimizations when they read papers and study competing projects before touching code
What made you think I hadn't read the article, let alone that TL;DR? I'm really curious. Jumping to an insulting "have you read the article" is a big step, so it'll be really interesting to see where your mind went.
Feels like most tools skip that step.