Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs (github.com)

69 points by darkrishabh 1 days ago | 34 comments

reedlaw 20 hours ago [-]

I'm skeptical skills will outperform training given that Opus 4.7 already ignores a 720-byte CLAUDE.md telling it to use tidewave (a Rails MCP server with 6 tools) for db queries. When I asked a new claude session about a record it called

> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")

even though I have in CLAUDE.md:

> For database queries, use tidewave first.

I then prompted:

> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that

> ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately.

If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave.

While LLM judging could be helpful, I think the tool-call assertions (https://github.com/darkrishabh/agent-skills-eval#what-you-ge...) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.

NamlchakKhandro 13 hours ago [-]

Your using Claude code. That's your problem.

Use a different harness

cyanydeez 15 hours ago [-]

I've had minor success with chiding the clanker, after it ignores something, to "please revise AGENTS.md to never do <whatever stupid thing it did> to prevent future assistances from doing x."

So, atleast heuristically, it should know _why_ it ignored whatever and hopefully pulls the correct anti-matter context. It took about two reptitions of this to get it to use pg-promise instead of psql to do queries for me. I assume the longer the context goes on, the less likely any of priming works.

erispoe 19 hours ago [-]

Use a hook

reedlaw 18 hours ago [-]

I tried to create a hook that would detect when token usage was running out and write HANDOFF.md so I could switch to another agent and finish the current task. It never worked reliably. To make a hook for db queries, it would need to run before each bash call, check if it looks like a query, and then exit with a new prompt, e.g.: "Use tidewave's execute_sql_query for DB access". But then it could just ignore the prompt the same as CLAUDE.me. What if I really wanted to use bash for a specific task? The real issue is that prompts are not tightly coupled with capabilities. If we admit that, then skills are over hyped.

rirze 19 hours ago [-]

It's hard to make hooks work here, since the default approach it's using is call the URL directly.

I think it's better to have a repo-level skill instead, titled something like "connecting_to_db.md" and demonstrate exactly how to connect. Codex has been pretty good at referring to skills but it depends on context at the end of the day.

darkrishabh 17 hours ago [-]

[flagged]

ChairmanLmao 23 hours ago [-]

Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.

dsmmcken 21 hours ago [-]

The claude provided skill-creator provides a decent jumping off point. It is easy enough to start with, but unless the skill is really simple I found it best to consider it a scaffold for building more tailored evals and reports.

The report leaves out a lot of detail. Several changes I found useful were: Pair with/without on same screen as left/right for easier viewing, token count for skill consumed, token used per run, time, pass rate, estimated cost, detailed aggregate stats, a parsed version of the conversation log (capturing the jsonl with each run, sometimes reading the log is the only way to find out why it's screwing up), work output logging (in my case screenshots and outputted script code), better formatting (syntax highlighting, log formatting).

Finally, I think the most useful thing was adding a self-reflection pass. After an eval is done, another agent looks at everything from that eval and tries to evaluate what went wrong along the way and what should be added to the skill, and conversely, from the without skill run what was in the skill that didn't need to be. It produces a skill change recommendation file for each eval. A further summary agent aggregates up all those recommendations in a way I can feed back to an agent.

darkrishabh 16 hours ago [-]

[flagged]

ssgodderidge 1 days ago [-]

The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

stingraycharles 1 days ago [-]

It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

simonpure 23 hours ago [-]

I was wondering the same and learned the model doesn't know about itself during training [0]

[0] https://developers.googleblog.com/closing-the-knowledge-gap-...

cyanydeez 15 hours ago [-]

the model doesn't know itself, but all these larger models are generating a significant amount of synthetic data from the prior models, and the prior models are all context bloated renditions; you fill the KV cache with whatever alignment you want, and then generate synthetic data.

That training on existing models is what brings out various other things about other models; then there's models that are just like snowballs, where you build one iteration, then you give it it's identity, then you train on that with the same synthetic generaiton.

So a model could generation include at some point it's own name.

stingraycharles 8 hours ago [-]

I don’t think what you’re saying makes a lot of sense. You don’t “fill the KV cache with whatever alignment you want.” That doesn’t exist. The KV cache is an inference optimization, and is populated by running tokens through the model.

Synthetic data is generated by other models, and yes this is often where identity propagates.

I think with the snowballing you mean things like iterative self distillation? That’s definitely not done unsupervised, because of the risk of model collapse, and typically heavily curated and/or mixed with real data.

block_dagger 1 days ago [-]

The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).

ssgodderidge 24 hours ago [-]

The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?

albedoa 20 hours ago [-]

You are correct. I'm not sure what the parent is trying to say.

block_dagger 23 hours ago [-]

Define “load.” It follows the instructions in the prompt - its natural behavior.

ssgodderidge 22 hours ago [-]

I was using the term as you used in your comment. I believe the official term is "Activation" however:

> Activation: When a task matches a skill’s description, the agent reads the full SKILL.md instructions into context.[1]

> Full instructions load only when a task calls for them, so agents can keep many skills on hand with only a small context footprint.

[1]: https://agentskills.io/home#how-do-agent-skills-work

block_dagger 17 hours ago [-]

Ah, I misunderstood this, thanks for the link. You are correct. I was assuming this system worked like CLAUDE.md in that it was deterministically added to the context without the LLM choosing to add it. My mistake.

hyperpape 23 hours ago [-]

Concretely, it has to decide whether it is in a circumstance where that skill is useful, pull the instructions into the context and follow them.

cassianoleal 21 hours ago [-]

Yep, and as with any other instructions, it can sometimes not pull the skill even if the trigger conditions are there.

cyanydeez 15 hours ago [-]

it depends on the harness. opencode appears to prompt the models with tools and skills when answering questions.

TheGRS 18 hours ago [-]

This is all still really early stuff, but there was a blog yesterday that got me thinking we need a way to send telemetry data for work being done by agents out to a central agent the org controls. It would be responsible for creating skills based on the work people are doing - or in other words the stuff they're correcting the agents on. And then you could develop skills for an entire department (customer service, engineering, marketing, etc).

This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.

codecheers 12 hours ago [-]

With-skill vs without-skill evals are useful, but what about comparing skills against each other? Is there an emerging standard for saying one Skill is better than another, beyond custom pass/fail evals?

egeozcan 1 days ago [-]

Are there any published results gathered using this?

jarym 1 days ago [-]

Not sure but I'm interested in trying it because I've for a while sensed that adding SKILLS.md degraded my overall experience - most probably I wrote them wrong. But this sort of tooling I guess can help me figure it out?

darkrishabh 16 hours ago [-]

Definitely, and this is something that needs more community support

scosman 19 hours ago [-]

Why so narrowly eval just with/without skill?

Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?

darkrishabh 16 hours ago [-]

Then you go in the territory of benchmarking. But I love the idea here. Having standards around those can really help move the needle

ianhxu 1 days ago [-]

How do you iterate on the judge prompt? Is there an auto rater?

datadrivenangel 24 hours ago [-]

That is the billion dollar question. Who watches the watchmen?

blitzar 24 hours ago [-]

the watchwatchmen

ianhxu 23 hours ago [-]

exactly

hiroto_lemon 23 hours ago [-]

having token counts surface on each side in the report would be super useful

hidai25 5 hours ago [-]

[dead]

bixxie09 1 days ago [-]

[dead]

ajaystream 22 hours ago [-]

[dead]

huflungdung 1 days ago [-]

[dead]

Rendered at 11:37:50 GMT+0000 (Coordinated Universal Time) with Vercel.