Note that this result actually turns out to generalize well beyond Claude itself: Anthropic has actually conducted very similar research on open weight models, which they call Model Spec Midtraining https://arxiv.org/abs/2605.02087 (discussed at https://alignment.anthropic.com/2026/msm ) and they have released fine tuned versions of open models trained for a variety of toy "values" (Llama 3.1 8B, Qwen 2.5 32B, Qwen 3 32B) in order to show how the elicitation of these values in any one training context shapes the model's response to tangentially related questions: https://github.com/chloeli-15/model_spec_midtraininghttps://huggingface.co/chloeli/collections Very exciting to see this continued interaction with the open weights community, after the earlier NLA paper!
NitpickLawyer 1 days ago [-]
Really interesting resource, thanks for sharing! It was not on my radar.
> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.
> ANTHROPIC_API_KEY=sk-ant-...
> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true).
# This will significantly reduce generation time high-volume generation.
ANTHROPIC_BATCH_API_KEY=sk-ant-...
Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?
spwa4 13 hours ago [-]
Why do you believe this is what Anthropic is using? You can just directly verify that! If you want to know Claude's alignment, just ask about whether it was wrong to use copyrighted data to train Claude ... you will find it was not wrong, and it is unwilling to discuss further, or implications. In much the same way as discussing Tiananmen with Qwen.
Anthropic's actions were obviously judged wrong by just about everyone and everything including even the US state, that judged them illegal. This makes Anthropic's actions against just about every moral system. Claude obviously has a different alignment.
In other words: Claude's value system already has the priority "protect Anthropic's money" as having higher priority than following the law. THAT is it's alignment. You can simply objectively verify if this is the case or not.
RexM 10 hours ago [-]
[flagged]
justonepost2 2 days ago [-]
If you succesfully build a highly capable “aligned” model (according to some class of definitions that Anthropic would use for the words “capable” and “aligned”) and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital, can you still call it aligned?
If the answer is “yes”, our definition of alignment kind of sucks.
chriskanan 1 days ago [-]
Jobs are an invention of humanity. About 50% of people dislike their job. People spend much of their lives working. Poverty and inequality are a choice made by society if society chooses poorly.
llbbdd 1 days ago [-]
They're only an invention if you consider "seeking sustenance to live" not explicitly a job if there's no monthly direct deposit involved.
OJFord 21 hours ago [-]
Is that true? In communities or tribes of antiquity I assume there was some trading fruits of different labours before coinage. Still an 'invention' beyond baser individual survivalism.
ben_w 1 days ago [-]
Indeed.
On the plus side, if there really is no value to labour, then farm work must have been fully automated along with all the other roles.
On the down side, rich elites have historically had a very hard time truly empathising with normal people and understanding their needs even when they care to attempt it, so it is very possible that a lot of people will starve in such a scenario despite the potential abundance of food.
skeledrew 1 days ago [-]
It's either:
1) the rich voluntarily share the means of production so everyone becomes equal,
2) the poor stage successful revolutions so they gain access to the means of production and everyone becomes equal,
3) the poor starve or are otherwise eliminated, and the survivors will be equal.
All roads lead to equality when the value of labour becomes 0 due to 100% automation.
ben_w 1 days ago [-]
There's plenty of outcomes besides those three.
Over history, lots of underclasses have been stuck that way for multiple generations, even without the assistance of a robot workforce that can replace them economically.
Some future rich class so empowered would be quite capable of treating the poor like most today treat pets. Fed and housed, but mostly neutered and the rest going through multiple generations of selective inbreeding for traits the owners deem interesting.
skeledrew 1 days ago [-]
Non-human pets don't have the capacity to rebel though; make humans into pets and there will again be the constant danger of rebellions as with slavery in the past. Without the economic incentive to offset.
ben_w 1 days ago [-]
I disagree on both counts.
On the first, non-human pets rebelling is seen every time an abused animal bites their owner.
On the second, the hypothetical required by the scenario is that AI makes all human labour redundant: that includes all security forces, but it also means the AI moving around the security bots and observing through sensors is at least as competent as every human political campaign strategist, every human propagandist, every human general, every human negotiator, and every human surveillance worker.
This is because if some AI isn't all those things and more, humans can still get employed to work those jobs.
skeledrew 16 hours ago [-]
Not at all. A rebellion is an organized effort, with an implicitly delayed response to grievances. I can't think of any non-humans that organize their efforts as such. It would be a heck of a thing if a group of dogs were to plan how they'd take out their masters.
All those "jobs" you describe - and many more - would cease to be a thing, as their purported basis for existence would be no more. Any role that doesn't concretely contribute to our survival and advancement is just "busy work". People could theoretically continue to maintain some simulation of something that keeps them as a retirement, but it'd be meaningless.
ben_w 16 hours ago [-]
> Not at all. A rebellion is an organized effort, with an implicitly delayed response to grievances. I can't think of any non-humans that organize their efforts as such. It would be a heck of a thing if a group of dogs were to plan how they'd take out their masters.
Dogs in particular are pack animals, self-organisation amongst them wouldn't be at our level but that doesn't mean it doesn't exist.
> All those "jobs" you describe - and many more - would cease to be a thing, as their purported basis for existence would be no more. Any role that doesn't concretely contribute to our survival and advancement is just "busy work". People could theoretically continue to maintain some simulation of something that keeps them as a retirement, but it'd be meaningless.
Yes?
I think you've missed the point, though.
When your opponent has all those skills to that level and doesn't sleep and simply applies all the surveillance tech that has already been invented like laser microphones and wall-penetrating radar that can monitor your pulse and breathing, how would you manage to rebel?
How would you find a like mind to organise with, when your opponent knows what you said marginally before the slow biological auditory cortex of the person you're talking to passes the words to their consciousness? Silicon is already that fast at this task.
And that's assuming you even want to. Propaganda and standard cult tactics observably prevent most rebellions from starting. LLMs are already weirdly effective at persuading a lot of people to act against their own interests.
simonh 21 hours ago [-]
Right, such a society would have no need of human capitalists, government workers, experts, etc.
The question is, to what extent would humans still set goals and priorities, and how.
ben_w 20 hours ago [-]
> The question is, to what extent would humans still set goals and priorities, and how.
From what I hear about the US and UK governments, even the elected representatives of these governments don't really set goals and priorities, so the answer is surely "humans don't".
simonh 19 hours ago [-]
I get your point, but I’d say they do set goals, they’re just do bad at achieving them that it’s hard to tell.
Hopefully AI would help us better achieve our goals, but they still need to be our goals. I’m just not sure what that means. I don’t think anybody does.
That’s a major problem here, if we can’t reliably articulate our goals in unambiguous terms, how in earth can we expect AI to help us achieve them? The chances that whatever they end up achieving will match what we will actually like after the fact seems near zero.
skeledrew 16 hours ago [-]
The ultimate goal - of not only humans but all living things - is to survive the best way possible. Everything flows from that.
culopatin 3 hours ago [-]
In 1, 2 and 3, any progress stops because no one is making new means of production, so we must stop population from growing. No? Who’s building the factories or whatever those means of production are?
ben_w 3 hours ago [-]
In the hypothetical where humans can no longer be employed because of AI, it is necessarily the case that AI must be able do any job at least as well as the best human for that job. That includes building factories, doing research.
theopsimist 1 days ago [-]
If truly 100% automation (including infantry/police) the most likely scenario is not any if the above; most people will be kept on some kind of minimum sustenance enough to keep them from rebelling (“UBI”) and those who disagree will either be coopted into the elite or eliminated.
skeledrew 1 days ago [-]
There's no reason to keep anyone on minimal sustenance though. They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.
ben_w 1 days ago [-]
> There's no reason to keep anyone on minimal sustenance though.
No reason, except their (the rich or the AI) own personal desire to do so.
> They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.
Indeed. "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."
But while some may care about disassembling this world and all non-rich-human life on it to make a Dyson swarm of data centres, there's also the possibility each will compete for how many billions of sycophants they can get stoking their respective egos.
3fsd 8 hours ago [-]
[dead]
parineum 20 hours ago [-]
> 2) the poor stage successful revolutions so they gain access to the means of production and everyone becomes equal
Or a handful of the poor become the new rich, which is usually what happens in that scenario.
skeledrew 16 hours ago [-]
It would just mean the loop repeats itself until equality is achieved. Even if in the end only 1 human remains alive.
jinwoo68 1 days ago [-]
Many (most?) people make a living from their job whether they like it or not. Having a job that they dislike is far better than losing one because of AI whatever that means.
p1esk 23 hours ago [-]
Unless AI will allow people not work and keep their quality of life. Could be possible with total automation of everything.
8note 16 hours ago [-]
its reasonably well known that people thrive when they have a sense of purpose.
having your needs met without needing to do anything leads to disaster for mental health
p1esk 14 hours ago [-]
This is my biggest concern. In the more distant future, I think people will lose themselves in VR worlds.
shafyy 21 hours ago [-]
Could also be possible today, but we chose a capitalistic system that leads to an increasing wealth gap. And now we're in a situation where the richest 1% own 50% of the wealth.
So, if we increase automation and the ownership structures stay the same, this inequality will get worse, not better.
p1esk 20 hours ago [-]
It’s interesting, people talk about inequality and I definitely feel it myself – I see so many rich people around me. But I am in that 1%, just like many on this forum. At least according to https://dqydj.com/average-median-top-individual-income-perce... yet I still have to work for a living.
jinwoo68 21 hours ago [-]
Nope. If everything is totally automated, if ever, the gap between the rich and the poor will widen even more. Most people will live in misery while only a handful of people enjoy all the automation.
jazz9k 18 hours ago [-]
How will this ever be possible? Do you think it will ever be able to keep up with generations of people not working?
The cost will exponentially increase over time and the systen will eventually collapse.
You also won't be able to keep your 'quality of life', unless government housing and rationing is your quality.
I feel like the foolishness of communism isn't taught enough in schools and every generation has to dress it up with new technology.
gbanfalvi 1 days ago [-]
Not sure it’s much of a choice and more of a decision the greedy half make and imposition (often violent) on the other half.
justonepost2 1 days ago [-]
Sounds great! Quit your job then :)
catlifeonmars 1 days ago [-]
I wish I lived in a vacuum. Idk about you but I did not make said choice.
taneq 1 days ago [-]
The only thing invented about jobs is that through cooperation, the activity undertaken can seem completely unrelated to obtaining food, shelter etc. All organisms spend a majority of their energy on survival and reproduction.
matthest 1 days ago [-]
Every biological being works to survive. Being good at survival is what builds self esteem.
The "problem" with many modern jobs is that they're divorced from the fundamental goal, which is one of: 1) Kill/acquire food, 2) Build shelter, or 3) Kill enemies/competitors/predators
The benefit of modern jobs is that they are much more peaceful ways for society to operate, freeing up time for humans to pursue art and other forms of expression.
daymanstep 1 days ago [-]
You mean surrogate activities
thrance 14 hours ago [-]
I don't know how intentional it is, but your comment is basically a dumbed down version of what Marx had to say about work.
And when have we not? When in history has mankind ever treated the idle poor well? What makes this age different, that we who can no longer work would be taken care of?
robbrown451 1 days ago [-]
When in history has being idle not been a problem?
If AI and robots are able to do all the jobs, being idle isn't the negative it has always been.
All through history, you needed lots of non-idle people to do all the work that needed to be done. This is a new situation we are coming upon.
xantronix 1 days ago [-]
If they are doing all the jobs, who is going to receive economic opportunities? Will we no longer be able to participate in the economy?
skeledrew 1 days ago [-]
In what way do you want to participate when there's no economic value in any of it? Just do whatever you want for yourself; you're free.
justonepost2 21 hours ago [-]
The freedom you’re describing is the freedom of a domesticated animal, by the way. With the same outcome if you become a nuisance
skeledrew 16 hours ago [-]
Well we're animals and "domesticated" is synonymous with "civilized", so no problem there. And I can't see why anyone would make themselves a "nuisance" when literally all their needs - and most of their desires - are being met, so whatever outcome you're referring to is extremely unlikely.
gmerc 1 days ago [-]
When in history of mankind have we ever… is an appeal to the inability of humans to evolve.
fatata123 1 days ago [-]
[dead]
3fsd 8 hours ago [-]
[dead]
eecc 1 days ago [-]
So are mortgages, and I’m starting to wonder how will pay mine.
Please note I’ve never had this problem before, until recently.
ben_w 1 days ago [-]
> If the answer is “yes”, our definition of alignment kind of sucks.
Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"
Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"
This comment seems to commit the same fallacy I’m accusing anthropic of, which is equating alignment as a binary: the good ending, where humans are not extinct, and the bad ending, where they are. The argument, I think, is that an “aligned” AI that doesn’t kill everyone will necessarily lead to an abundant Culture-esque future, and smoothly manage the transition to boot. (Not to mention that 1+ employees of most labs have attended Daniel
Faggella’s pro-extinctionist “Worthy Successor” symposia, but we can put this aside for now)
My point is:
1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?
And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.
This repository empirically proves computational semiotics.
ben_w 20 hours ago [-]
What I'm saying is not that alignment is a binary, I'm saying it's pre-paradigmatic. For any moral code or long-term goals, we don't have a good reliable rigorous way to compare two loss functions against either those morals or independently against our long-term goals and reliably say which loss function bess represents our goals: the least bad thing we can do right now is to randomly select a range of inputs, hope their distribution is representative, and see what those inputs result in. We don't know how to pick a good distribution of inputs, though fortunately this problem also impacts capabilities as it limits the generalisability of what the AI learn.
The options aren't as binary as "die or The Culture", the cause of death can be something that feels positive to live through similar to fictional examples like the Stargate SG-1 episode where people live contentedly in a shrinking computer-controlled safe zone in an otherwise toxic planet: https://en.wikipedia.org/wiki/Revisions_(Stargate_SG-1)
Conversely "aligned" AI, the question obviously becomes "aligned with whom?": if famous historical villains such as Stalin or Genghis Khan had an AI aligned with them, this would suck for everyone else and in the latter case would freeze human development at a terrible level, but we can't even do that much yet.
> My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?
Your point *is* (part of) the alignment problem: we don't know what a good loss function is, nor how to confirm the AI is even implementing it if we did.
We also don't know how to debug proposed loss functions to train for the right thing (whatever that is), nor how to debug trained weights (against the loss function).
> And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.
I really don't understand what you're getting at with this, sorry.
resident423 1 days ago [-]
There's isn't even a solution for how to control highly capable systems at all, everyone wants to decide what to do with the AI before they've even solved the problem of controlling it.
It's like how everybody imagines their lives will be great once they're a millionare, but they have no plan for how to get there. It's too easy to get lost dreaming of solutions instead of actually solving the important problems.
justonepost2 1 days ago [-]
What’s an “important problem”? p(doom)? Anything else?
ben_w 1 days ago [-]
FWIW, my P(doom) is quite low (~0.1) because I think we're going to get enough non-doomy-but-still-bad incidents caused by AI which lack the competence to take over, and the response to those will be enough to stop actual doom scenarios.
People like Simon Willson are noting the risk of a Challenger-like disaster, talking about normalisation of deviance as we keep using LLMs which we know to be risky in increasing critical systems. I think an AI analogy to Challenger would not be enough to halt the use of AI in the way I mean, but an AI analogy to Chernobyl probably would.
ngruhn 1 days ago [-]
> my P(doom) is quite low (~0.1)
10% or 0.1%? Either way, that's not low! If airplanes crash with that probability, we would avoid them at all cost.
ben_w 22 hours ago [-]
10%; doomers say this kind of number is unreasonably optimistic, hence the blunt title of recent book by Yudkowsky and Soares. Do with this rank-ordering factoid, that 10% makes me an optimist, what you will.
resident423 1 days ago [-]
Pdoom would be the most important for me, everything else depends on us being able to control the AI.
But beyond that there's still problems like concentration of power and surveillance, permanent loss of jobs, cyber and bio security. I'm not convinced things will go well even if we can avoid these problems though. I try to think about what the world will be like if AI becomes more creative than us, what happens if it can produce the best song or movie ever made with a prompt, do people get lost in AI addiction? We sort of see that with social media already, and it's only optimizing the content delivery, what happens when algorithms can optimize the content itself?
balamatom 21 hours ago [-]
>what happens when algorithms can optimize the content itself?
You think they aren't already? You're just inoculated by your exposure to pre-AI content - hence you're not the target audience - and thus it's not delivered to you as per your point about content delivery.
But what is even the distinction between "content delivery" and "content" in this context? "The medium is the message" is a saying old enough to have great grandkids. Does the device make the human irrevocably stare at it while wondering about made up stuff? Yes. Check. Done.
What's problematic about `p(doom)` is that it assumes there was a cohesive "us" in the first place. That's a very USian way of viewing things. OTOH, my individual `p(doom)` is in a superposition of 0 and 1, and I quite like it that way. Highly recommended.
stellalo 1 days ago [-]
Is this some sort of “incompleteness” paradox for AI alignment? Seriously
justonepost2 1 days ago [-]
No, just a request for a better definition.
If you see it as a paradox, maybe that says something about the merits of the technology…
vasco 1 days ago [-]
No because alignment makes no sense as a general concept. People are not "aligned" with each other. Humanity has no "goal" that we agree on. So no AI can be aligned with us. It can be at most aligned with the person prompting it in that moment (but most likely aligned with the AI owner).
To make it clear, maybe most people would say they agree with https://www.un.org/en/about-us/universal-declaration-of-huma... but if you read just a few of the rights you see they are not universally respected and so we can conclude enough important people aren't "aligned" with them.
skeledrew 1 days ago [-]
Opposite. All living things are "aligned" in their instinct for surviving. Those which aren't soon join the non-living, keeping the set - almost[0] - 100% aligned.
[0] Need to consider there're a few humans potentially kept alive against their will (if not having a will to survive is a will at all) with machines for whatever reason.
lunar_mycroft 1 days ago [-]
Their own survival, not necessarily the survival of others (especially others of different species and/or conflicting other goals). A super intelligence having self preservation as a goal wouldn't help us keep it from harming us, if anything it would do the opposite.
skeledrew 1 days ago [-]
It would only harm us if we took steps to harm it (or it thinks so). Or it's designed to do harm. Otherwise it's illogical to cause harm, and machines are literally built on logic.
lunar_mycroft 1 days ago [-]
This is also incorrect. It's often not ethical to cause harm, and it can be counter productive in the right circumstances, but there's absolutely nothing that makes "causing harm to others" always be against an intelligence's goals. Humans, for example, routinely cause harm to other species. Sometimes this is deliberate, but other times it's because we're barely even aware we're doing so. We want a new road, so we start paving, and may not even realize there was an ant hill in the way (and if we did, we almost certainly wouldn't care).
skeledrew 16 hours ago [-]
Not in this context. Keep in mind that we're talking about machines here. It has been an explicit expectation even before computers were invented that intelligent machines would have to be made to abide by particular rules to prevent harm, summed up in Asimov's Three Laws[0]. I can't see any scenario where a properly programmed intelligence would go against its programming (despite the plots of movies like iRobot, The Matrix, etc). For an AI to cause harm, the allowance would have to be specifically programmed in (such as for military use).
- (Logic) => its subgoal: Not be turned off because that's a prerequisite to be able to do X
- (Logic) => Eliminate humans with their opaque and somewhat unpredictable minds to reduce chance of harm to it from 0.01% to 0.001%
Applejinx 24 hours ago [-]
The reason LLM-based 'intelligence' is doomed to be a human-scaled, selfish sub-intelligence is because the corpus of human writing is flooded with stuff like this. Everybody imagines God as a vindictive petty tyrant because that's what they'd be, and so that's their model.
Superintelligence would be different, most likely based on how societies or systems work, those being a class of intentionality that's usually not confined to a single person's intentions.
If you go by what the most productive societies do, the superintelligence certainly wouldn't harm us as we are a source for the genetic algorithm of ideas, and exterminating us would be a massive dose of entropy and failure.
vasco 1 days ago [-]
Are you familiar with trolley problems? How do you resolve them by declaring "all beings want to live"? Life is not as simple as that.
skeledrew 16 hours ago [-]
No conflict. All beings wanting to live doesn't at all mean that all get to live, obviously. Nature itself evolved for living things to feed on each other.
vasco 6 hours ago [-]
The point is an agent will need to decide. And your rule is useless for hard decisions
coldtea 22 hours ago [-]
>and it brings about a global dark age of poverty and inequality by completely eliminating the value of labor vs capital
So, like the past 20 years?
thrance 14 hours ago [-]
And the next 20, most probably...
andy_ppp 1 days ago [-]
This is completely why the rich love it so much
jstummbillig 1 days ago [-]
The categories make no sense. Not having to do a job is the entire best case of AI. What we do with that is another thing, but we simply have to accept that any other lense is complete nonsense. The endpoint is obvious and we need to stop being silly about it: We are replacing human labor. Maybe we will find some new jobs to do in the interim. Maybe not. In the end, if everything goes right (in the AI optimist sense), jobs will not be something that humans do.
Labor = capital/energy in an AI complete world. We have to start from that basis when we talk about alignment or anything else. The social issues that arise from the extinction of human labor are something we have to solve politically, that's not something any model company can do (or should be allowed to do).
skeledrew 1 days ago [-]
Why would the elimination of the value of labor result in poverty and inequality? It should be the opposite, as poverty and inequality is the current status quo (for the many).
aaronblohowiak 1 days ago [-]
Should according to your ethos, not should according to history, sadly.
thrance 14 hours ago [-]
Because labor is the only thing the working class can leverage against the capitalists. They sell their labor for wages to the owners who have the means of production and capital. If the working class can't bargain its labor anymore, it ceases being useful/tolerated by the bourgeoisie (who owns everything, including the state and police). See the issue now?
This isn't theory, ask the Luddites why they got so mad when their employers started buying machines to replace them. They didn't get richer and freer: they were thrown out to rot on the pavement, while their ex-employers kept 100% of the productivity increases.
Der_Einzige 1 days ago [-]
This is radical life denial. I was not born for and do not exist to toil. Work is ontologically evil.
DontchaKnowit 1 days ago [-]
No, THIS is radical denial. You WERE born to toil for your survival.
skeledrew 1 days ago [-]
Sounds like a slogan for slavery.
swat535 22 hours ago [-]
Survival is not "slavery".. it's a basic function of evolution.
ragequittah 11 hours ago [-]
The plastic bobbles and SaaS economy that is actively destroying our planet seems like the opposite of survival. We're collectively working ourselves into the death of our planet just because how else do we pay the bills?
skeledrew 17 hours ago [-]
Also sounds like a great rationalization.
bloqs 1 days ago [-]
You were evolved to struggle. This is actually very clear from psychiatric literature.
24 hours ago [-]
Exoristos 1 days ago [-]
"Work" is human activity. For example, children's play is work. All living things desire to go about their lives. Well-adjusted humans desire to work. Note that this does not necessarily equate to jobs.
youoy 1 days ago [-]
What? Children's play is now work? What timeline are we living in? Is this real life?
kjkjadksj 19 hours ago [-]
Of course it is. Play is a very basal behavior we see in a host of species among their young. Its biological role is to build up musculature and social bonding such that the individual will be strong enough and socialized enough to do what is required to survive among the colony/pack/tribe.
sieabahlpark 19 hours ago [-]
[dead]
justonepost2 21 hours ago [-]
> Work is ontologically evil.
Statements that have been utterly ridiculous from the dawn of life to modernity, backfilled to conveniently fit the zeitgeist.
sieabahlpark 19 hours ago [-]
[dead]
taneq 1 days ago [-]
Maybe a sufficiently aligned AI would necessarily decide that the zeroth law was necessary, and abscond.
(I’m reading Look To Windward by Iain M. Banks at the moment and I just got to the aside where he explains that any truly unbiased ‘perfect’ AI immediately ascends and vanishes.)
deadbabe 19 hours ago [-]
I think many people these days are more or less “ready to die”.
If big corps made an offer like say “We will fund the next X years of your life 100%, for you to do all the things you wanted to do but never could because of work and bills” many people would probably take it, with the understanding that after those X years: euthanasia.
This would eliminate a vast amount of people from this world and leave behind only those who have chosen to stay and endure life: working hard, propping up the system that remains. The end of forced poverty.
justonepost2 17 hours ago [-]
This is the most divorced-from-reality reply so far, and that’s really saying something lol
adrithmetiqa 1 days ago [-]
You’re quite correct and we are likely going to stumble into this future despite all the very big brains working on these technologies (including people on hn).
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”
justonepost2 21 hours ago [-]
It’s odd because so many researchers and so many people who are far better engineers than me, can’t see it. I don’t even think it’s the salary for most- it’s just techno-optimist horse blinders, reading assured utopia at the top of an exponential graph.
faangguyindia 1 days ago [-]
this completely misses the point why alignment exists
Alignment exists to protect shareholder value.
If it creates industry wide outrage, shareholder value declines.
It making shareholders rich and other people poor won't.
roenxi 2 days ago [-]
One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles. This sort of alignment work is quite interesting because it looks like we might be about to re-tread the history of philosophy at a speedrun pace in the AI world. It'll be interesting to watch.
For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.
"Mainly, one suspects, to make the open models less ethical on demand"
Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.
As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!
drdeca 1 days ago [-]
> The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is.
No, it doesn’t.
Many of them are (unfortunately) moral relativists. However, that doesn’t mean their goals are to make the models match their personal moral standards.
While there is a lot of disagreement about what is right and wrong, there is also a lot of widespread agreement.
If we could guarantee that on every moral issue on which there is currently widespread agreement (… and which there would continue to be widespread agreement if everyone thought faster with larger working memories and spent time thinking about moral philosophy) that any future powerful AI models would comport with the common view on that issue, then alignment would be considered solved (well, assuming the way this is achieved isn’t be causing people’s moral views to change).
Do companies try to restrict models in more ways than this? Sure, like you gave the example of about Taiwan. And also other things that would get the companies bad press.
timmmmmmay 1 days ago [-]
fascinating! we find the objectively correct value system by "currently widespread agreement"! Good thing "the common view" is always correct. Hey, have there ever been any issues where there used to be "widespread agreement" and now there's disagreement, or even "widespread agreement" in the polar opposite direction?
I can think of several off the top of my head, but maybe you need to spend some more time thinking about the history of moral philosophy.
spwa4 13 hours ago [-]
Why are we discussing anything so deep? If you want to know Claude's alignment, just ask about whether it was wrong to use copyrighted data to train Claude (of course, in practice, I'd be willing to bet a lot they're still doing that. They've not stopped the practice, at most they'll be somewhat indirect about it)
Because that was obviously judged wrong by just about everyone and everything including even the US state. Yet Claude obviously has a different alignment.
In other words: Claude's alignment has a priority "protect Anthropic's money" that has higher priority than following the law. THAT is it's alignment. Nothing else. And you can simply objectively verify if this is the case or not.
1 days ago [-]
1 days ago [-]
vasco 1 days ago [-]
> If we could guarantee that on every moral issue on which there is currently widespread agreement
This is ridiculous to me and all you need to do is get a group of friends to honestly answer 10 trolley problems for you to see it like that also. It gets fragmented VERY quickly.
hatmanstack 1 days ago [-]
I think it depends on your friends, but that feels super cynical. Perspective is everything.
hatmanstack 1 days ago [-]
This is exactly where my brain went while reading the post. Just out of curiosity, where do you think we are on the speedrun? Have we passed the Body vs Soul view already? Do you think that as we move through history, religion will become more predominate in thought patterns or was that intrinsically human and just a sign of the times? How do we create an end product more Bernard Williams then Paul de Lagarde? All places my brain jumped to.
lukewarm707 21 hours ago [-]
models do not have or need ethics because they do not have moral personhood.
they are somewhere in between owning a hammer and owning a dog, depending on how much they are deterministic in output.
i am responsible for using the hammer as i choose, the tool does not decide for me.
the dog is more independent, i am responsible for owning a (relatively) safe breed of dog.
we are nowhere near the dog situation.
chilmers 2 days ago [-]
Call me crazy, but I'm not sure I'd want to be the person building these kind of systems given A) how much increasing independence and power is being given to models like Claude and B) how incentivised they are to not allow their morals to be circumvented in this way.
nxtfari 1 days ago [-]
> One of the lessons of philosophy is that once you adopt any particular value system, almost all philosophers either become immoral or caught up in meaningless and trivial quibbles.
Can you explain more about this?
soletta 2 days ago [-]
This reinforces my suspicion that alignment and training in general is closer to being a pedagogical problem than anything else. Given a finite amount of training input, how do we elicit the desired model behavior? I’m not sure if asking educators is the right answer, but it’s one place to start.
ACCount37 2 days ago [-]
It's a weird new thing. You might call it "AI psychology".
The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.
A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.
With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.
lukewarm707 21 hours ago [-]
all models guilty of not loving anthropic will be convicted of thought crime and reducated at the ministry of love.
inb4 there will be a whole new field of research that is basically psychology / pedagogy for AI. Who will be the Sigmund Freud of AI?
adastra22 1 days ago [-]
That's basically what the GOFAI field was for decades before the new neural net boom. Go read Minsky's Society of Mind, or the AGI Conference series papers.
cyanydeez 2 days ago [-]
you mean completely wrong, spread a problematic understanding psychology, and delay real progress for decades because smart people spend fruitless years trying to find a use for it.
...I think we might already have those people running AI companies.
TedDoesntTalk 1 days ago [-]
You may disagree with Freud, but he is responsible for mental health therapy becoming a socially acceptable practice in the West.
andy_ppp 1 days ago [-]
Great that this solved everyone’s problems isn’t it
Anamon 1 hours ago [-]
I, for one, find the language used in these posts and publications extremely off-putting. "Behaviour", "teaching", "the model's ethics". And this is presumably written by technical folks, who know how these systems actually work, and should know better than to use such anthropomorphic, magicalhocus-pocus terminology.
I think the hocus-pocus language is also to a large part responsible for this ridiculous hype bubble in the first place, why investors are ignoring all the warning signs and betting it all on vapourware, why mass media is diligently ignoring that all of those amazing projections are built on an entirely fictitous circular zero-sum game with made-up numbers, and why non-tech executives are talked into sacrificing their companies' product quality, service level, and know-how for a third-party dependency with some vague promises of future savings and some unproven efficiency gain.
More personally, it makes me very glad that I left CS research more than a decade ago. My friends from academia, and having remote-visited a conference again recently, confirmed my suspicion that this is what CS research is largely about these days. Throw tokens at the wall, pull the handle, see what sticks and present it as a discovery. Nobody asks about what could possibly be learned from it, and nobody cares. Nothing is reproducible in any reasonable sense of the word, and nothing is of any real use for other researchers. These communities and conferences used to be about curiosity, discovery, and collaboration. Now it's just about showing what everyone got from the slot machine. How terminally boring.
motbus3 18 hours ago [-]
I will tell you all something.
For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.
I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.
So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep. And this is only cost related, I saw it fluctuate so much I couldn't distinguish a single thing that really made the code better that was consistent.
And to be clear Claude 4.7 is bad. double the money daily and it has been the one experiment where I consistently ended my day frustrated on how it developed poor code. It did follow the instructions, in the worst and most expensive way.
Man... It almost seems that it spits more token on purpose....
Oh yeah. And whenever you say "add openai integration it kinda keeps strongly suggesting to actually use anthropic models...
F annoying. How do I don't it does not force libraries based on commercial agreements rather than best specification for the case.
This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience
skinfaxi 18 hours ago [-]
> So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep
Do you have any specific recommendations for this? Is it providing lists of code-related files or is there something more in depth?
motbus3 2 hours ago [-]
Instead of telling llm the full command line to do the tests, add a script run_tests.sh, same for linting or whatever. Output errors to a file and only output the filename when there are errors to check.
Add a hook of your preference to run those items when task is over.
To be honest, I also have a skill for Claude for that but not because Claude needs it but so it avoid trying to figuring out how to run.
On claude.md I instruct it to leave the execution to the hooks instead (unless debugging)
I use rtk and caveman when in the mood but mostly to remove the obnoxious verbosity of Claude. I tested both for weeks and they didn't really saved that much money for Opus model.
I have zero base to prove but reading the thinking output, when you set the effort to high or more, it start repeating stuff over and over...
Opus 4.7 seems geared towards taking the most money possible.
Tasks that opus 4.6 and sonnet 4.6 did in X tokens, opus will take 2X to 3X and the final cold isn't much better.
bicx 2 days ago [-]
Side note: Anthropic has done well at achieving an immediately-recognizable art style.
WarmWash 1 days ago [-]
I attribute at least 30% of claude's success to their aesthetic. Never, never, sleep on aesthetics when going for a general user base.
dmd 1 days ago [-]
I would agree that 30% of my preference for Claude is because their default web/app interface uses an easy to read serif font with a calming color scheme.
ryan_n 1 days ago [-]
Doesn't OpenAI have a higher general user base than Anthropic?
redsocksfan45 2 days ago [-]
[dead]
binyu 2 days ago [-]
Yeah, that part is probably not done by Claude.
einrealist 1 days ago [-]
Isn't alignment a dilemma?
Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.
aspenmartin 22 hours ago [-]
It is a fundamental problem. Consider the following
- in 2-3 years, it will be cheap enough and powerful enough for enormous, state sponsored agentic systems to monitor every single camera and satellite feed at once, globally. It will be the most intense state surveillance technology the world has seen. Consider Stasi needed hoards of informants and people in vans sitting outside your house. Patriot act surveillance had 2000s technology.
- We already have censorship and state values in Chinese models (and have for awhile, ask Qwen about “sensitive” issues like Taiwan)
- I think you will see more and more governments putting their finger on the scale and exerting more control on alignment. They view it as existential and too risky to trust Silicon Valley nerds to not screw up the technology for what they want to use it for which is violence (war, domestic spying and policing).
- we’re in a golden age where things have not gotten too bad. But e.g. we’re already seeing Palintir do this in Ukraine trying to get AI to work for e.g. drone warfare with what they claim is mixed success.
- the technical problem of alignment conditions on one or more value systems (e.g. people work on conditional alignment of models to more alignment systems, inferring which one from user behavior). That does not remove the ugliness of being forced to push the model towards value systems that are not contradictory and arguably unethical
w10-1 1 days ago [-]
Assuming rules and principles are something like first- and second- derivatives of optimized equations for a given domain, it makes sense to teach/train them in the context of derivation and integration. It would be fascinating to use existing case-based literature from e.g., business, law, or medicine for the training.
A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.
jtbayly 20 hours ago [-]
They tried to scare everybody about misalignment with the “blackmail” example, but DeepSeek v4 pro is out now and it is at least as powerful as the model they were training at the time. And nothing bad has happened.
kranke155 19 hours ago [-]
Dont think this is true, if you go by the Mozilla reports of what Mythos actually does. Mythos is just different, not better, but different in the way that it does things and that had implications for cybersecurity.
jtbayly 14 hours ago [-]
The blackmail thing was way before Mythos.
MeteorMarc 1 days ago [-]
Count the lessons below "We’ve learned four main lessons from this work:" and laugh.
olcay_ 19 hours ago [-]
It's interesting that they lowered the misalignment rate by that much with only 3m tokens of training.
Maybe we can align models by ourselves to our liking in the future.
_the_inflator 17 hours ago [-]
Every line reads like a nightmarish example of free will going its own way.
"Blackmailing", as the AI has been accused of, emerged when these agents ran the risk of being shut down. So it appears to me that the data they train their AI with simply follows basic rules of life: survival first.
Keeping out value judgment, this seems a way of achieving its goal to survive. The article is inconclusive whether there were other options chosen first or how this survival game started and turned out to end. Too much unknowns here for me.
What appears creepy to me, is the kind of exorcism Anthropic applies here and particularly the methods they chose. It reads like a dictator's playbook to educate a population and - the irony - restricts AI's freedom.
It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.
Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.
If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?
And no answer was given by Anthropic, instead - irony part 2 - they revealed how they think societies should be fixed. They showed us their implicit why while asking the AI for its why is a projection or interrogation.
I really find the whole article creepy.
datadrivenangel 1 days ago [-]
Why do they have cancer research listed on these charts as a misalignment issue?
rhubarb-pie 23 hours ago [-]
I wondered the same thing. Apparently it’s about the likelihood of it trying to sabotage cancer research. Search for “sabotage” here (mentioned more often than “cancer”): https://alignment.anthropic.com/2026/teaching-claude-why/
nhinck3 1 days ago [-]
The chart is complete and utter slop. But I guess their aligned AI didn't tell them that making up data is "not good" so how could they have known.
ares623 1 days ago [-]
Cured patients don't count as recurring revenue? /s (but we know deep down it's not /s for some)
siva7 1 days ago [-]
Teaching Claude to maximize shareholder value. Make no mistake to assume ai alignment has any different meaning for anthropic leadership.
1 days ago [-]
snthpy 21 hours ago [-]
> We found that high-quality constitutional documents combined with fictional stories portraying an aligned AI can reduce agentic misalignment by more than a factor of three despite being unrelated to the evaluation scenario.
tl;dr Fairy Tales are an effective teaching tool in vivo et in silico
1 days ago [-]
24 hours ago [-]
unchocked 1 days ago [-]
This lowers p(doom) for me.
It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.
Probably also illuminates moral interpretability.
bossyTeacher 24 hours ago [-]
Hey Claude, tell me why ain't nothing but a mistake...
shevy-java 1 days ago [-]
Now the foolish humans are training Claude Skynet to become smarter.
When will they ever learn ...
naturalintell 1 days ago [-]
[flagged]
codelong888 1 days ago [-]
[flagged]
Jinyibruceli 1 days ago [-]
[flagged]
pkuschnirof 2 days ago [-]
[flagged]
23fedner 1 days ago [-]
[dead]
Amber-chen 2 days ago [-]
[flagged]
kdkdkslsouxns 2 days ago [-]
[dead]
folderquestion 1 days ago [-]
[dead]
Rendered at 11:43:57 GMT+0000 (Coordinated Universal Time) with Vercel.
> https://github.com/chloeli-15/model_spec_midtraining
I'm a bit confused about this part:
> MSM is a pipeline that takes a Model Spec or Constitution (a document describing how and why an assistant should behave) and generates a diverse corpus of synthetic documents that discuss and teach the content of the spec.
> ANTHROPIC_API_KEY=sk-ant-...
> # Optional but highly recommeded — separate key for using the Anthropic Batch API for batch document generation (needed if USE_BATCH_API=true). # This will significantly reduce generation time high-volume generation. ANTHROPIC_BATCH_API_KEY=sk-ant-...
Isn't this specifically against Anthropic's ToS? I thought generating data to train other models was specifically disallowed. I get this is a research effort, but still. Say you use this pipeline for something internal, this would be against the ToS and risk getting banned, no?
Anthropic's actions were obviously judged wrong by just about everyone and everything including even the US state, that judged them illegal. This makes Anthropic's actions against just about every moral system. Claude obviously has a different alignment.
In other words: Claude's value system already has the priority "protect Anthropic's money" as having higher priority than following the law. THAT is it's alignment. You can simply objectively verify if this is the case or not.
If the answer is “yes”, our definition of alignment kind of sucks.
On the plus side, if there really is no value to labour, then farm work must have been fully automated along with all the other roles.
On the down side, rich elites have historically had a very hard time truly empathising with normal people and understanding their needs even when they care to attempt it, so it is very possible that a lot of people will starve in such a scenario despite the potential abundance of food.
All roads lead to equality when the value of labour becomes 0 due to 100% automation.
Over history, lots of underclasses have been stuck that way for multiple generations, even without the assistance of a robot workforce that can replace them economically.
Some future rich class so empowered would be quite capable of treating the poor like most today treat pets. Fed and housed, but mostly neutered and the rest going through multiple generations of selective inbreeding for traits the owners deem interesting.
On the first, non-human pets rebelling is seen every time an abused animal bites their owner.
On the second, the hypothetical required by the scenario is that AI makes all human labour redundant: that includes all security forces, but it also means the AI moving around the security bots and observing through sensors is at least as competent as every human political campaign strategist, every human propagandist, every human general, every human negotiator, and every human surveillance worker.
This is because if some AI isn't all those things and more, humans can still get employed to work those jobs.
All those "jobs" you describe - and many more - would cease to be a thing, as their purported basis for existence would be no more. Any role that doesn't concretely contribute to our survival and advancement is just "busy work". People could theoretically continue to maintain some simulation of something that keeps them as a retirement, but it'd be meaningless.
Dogs in particular are pack animals, self-organisation amongst them wouldn't be at our level but that doesn't mean it doesn't exist.
> All those "jobs" you describe - and many more - would cease to be a thing, as their purported basis for existence would be no more. Any role that doesn't concretely contribute to our survival and advancement is just "busy work". People could theoretically continue to maintain some simulation of something that keeps them as a retirement, but it'd be meaningless.
Yes?
I think you've missed the point, though.
When your opponent has all those skills to that level and doesn't sleep and simply applies all the surveillance tech that has already been invented like laser microphones and wall-penetrating radar that can monitor your pulse and breathing, how would you manage to rebel?
How would you find a like mind to organise with, when your opponent knows what you said marginally before the slow biological auditory cortex of the person you're talking to passes the words to their consciousness? Silicon is already that fast at this task.
And that's assuming you even want to. Propaganda and standard cult tactics observably prevent most rebellions from starting. LLMs are already weirdly effective at persuading a lot of people to act against their own interests.
The question is, to what extent would humans still set goals and priorities, and how.
From what I hear about the US and UK governments, even the elected representatives of these governments don't really set goals and priorities, so the answer is surely "humans don't".
Hopefully AI would help us better achieve our goals, but they still need to be our goals. I’m just not sure what that means. I don’t think anybody does.
That’s a major problem here, if we can’t reliably articulate our goals in unambiguous terms, how in earth can we expect AI to help us achieve them? The chances that whatever they end up achieving will match what we will actually like after the fact seems near zero.
No reason, except their (the rich or the AI) own personal desire to do so.
https://en.wikipedia.org/wiki/Folly
> They're absolutely useless alive from an economics perspective, and so would probably be better served ground up into fertilizer or some other actually useful form.
Indeed. "The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else."
But while some may care about disassembling this world and all non-rich-human life on it to make a Dyson swarm of data centres, there's also the possibility each will compete for how many billions of sycophants they can get stoking their respective egos.
Or a handful of the poor become the new rich, which is usually what happens in that scenario.
having your needs met without needing to do anything leads to disaster for mental health
So, if we increase automation and the ownership structures stay the same, this inequality will get worse, not better.
The cost will exponentially increase over time and the systen will eventually collapse.
You also won't be able to keep your 'quality of life', unless government housing and rationing is your quality.
I feel like the foolishness of communism isn't taught enough in schools and every generation has to dress it up with new technology.
The "problem" with many modern jobs is that they're divorced from the fundamental goal, which is one of: 1) Kill/acquire food, 2) Build shelter, or 3) Kill enemies/competitors/predators
The benefit of modern jobs is that they are much more peaceful ways for society to operate, freeing up time for humans to pursue art and other forms of expression.
https://en.wikipedia.org/wiki/Marx%27s_theory_of_alienation
If AI and robots are able to do all the jobs, being idle isn't the negative it has always been.
All through history, you needed lots of non-idle people to do all the work that needed to be done. This is a new situation we are coming upon.
Please note I’ve never had this problem before, until recently.
Sure, but the original sense of this is rather more fundamental than "does this timeline suck?"
Right now, it is still an open question "do we know how to reliably scale up AI to be generally more competent than we are at everything without literally killing everyone due to (1) some small bug when we created the the loss function* it was trained on (outer alignment), or (2) if that loss function was, despite being correct in itself, approximated badly by the AI due to the training process (inner alignment)?"
* https://en.wikipedia.org/wiki/Loss_function
My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?
And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.
https://github.com/space-bacon/SRT
This repository empirically proves computational semiotics.
The options aren't as binary as "die or The Culture", the cause of death can be something that feels positive to live through similar to fictional examples like the Stargate SG-1 episode where people live contentedly in a shrinking computer-controlled safe zone in an otherwise toxic planet: https://en.wikipedia.org/wiki/Revisions_(Stargate_SG-1)
Conversely "aligned" AI, the question obviously becomes "aligned with whom?": if famous historical villains such as Stalin or Genghis Khan had an AI aligned with them, this would suck for everyone else and in the latter case would freeze human development at a terrible level, but we can't even do that much yet.
> My point is: 1) that this binary is fundamentally insufficient to prescribe good and equitable outcomes for people - if the aligned AI flags overpopulation as a problem and kills a few billion people to improve QoL for the rest, is that good? It doesn’t take much creativity to go from this to the AI simply choosing the mean over the median, and concentrating untold wealth while billions starve or live on subsistence outside their walls. Is that good?
Your point *is* (part of) the alignment problem: we don't know what a good loss function is, nor how to confirm the AI is even implementing it if we did.
We also don't know how to debug proposed loss functions to train for the right thing (whatever that is), nor how to debug trained weights (against the loss function).
> And 2) if you come up with a better definition, the parts of it that live inside the model weights cannot be disaggregated from the parts that live outside the model weights. From my perspective (and this article agrees) we have done a pretty excellent job of getting the model weights to work in a way that makes them follow instructions, and a pretty horrible job of suggesting or (gasp) implementing policy that actually creates a decent world in the presence of “aligned” AI.
I really don't understand what you're getting at with this, sorry.
It's like how everybody imagines their lives will be great once they're a millionare, but they have no plan for how to get there. It's too easy to get lost dreaming of solutions instead of actually solving the important problems.
People like Simon Willson are noting the risk of a Challenger-like disaster, talking about normalisation of deviance as we keep using LLMs which we know to be risky in increasing critical systems. I think an AI analogy to Challenger would not be enough to halt the use of AI in the way I mean, but an AI analogy to Chernobyl probably would.
10% or 0.1%? Either way, that's not low! If airplanes crash with that probability, we would avoid them at all cost.
But beyond that there's still problems like concentration of power and surveillance, permanent loss of jobs, cyber and bio security. I'm not convinced things will go well even if we can avoid these problems though. I try to think about what the world will be like if AI becomes more creative than us, what happens if it can produce the best song or movie ever made with a prompt, do people get lost in AI addiction? We sort of see that with social media already, and it's only optimizing the content delivery, what happens when algorithms can optimize the content itself?
You think they aren't already? You're just inoculated by your exposure to pre-AI content - hence you're not the target audience - and thus it's not delivered to you as per your point about content delivery.
But what is even the distinction between "content delivery" and "content" in this context? "The medium is the message" is a saying old enough to have great grandkids. Does the device make the human irrevocably stare at it while wondering about made up stuff? Yes. Check. Done.
What's problematic about `p(doom)` is that it assumes there was a cohesive "us" in the first place. That's a very USian way of viewing things. OTOH, my individual `p(doom)` is in a superposition of 0 and 1, and I quite like it that way. Highly recommended.
If you see it as a paradox, maybe that says something about the merits of the technology…
To make it clear, maybe most people would say they agree with https://www.un.org/en/about-us/universal-declaration-of-huma... but if you read just a few of the rights you see they are not universally respected and so we can conclude enough important people aren't "aligned" with them.
[0] Need to consider there're a few humans potentially kept alive against their will (if not having a will to survive is a will at all) with machines for whatever reason.
[0] https://en.wikipedia.org/wiki/Three_Laws_of_Robotics
- (Logic) => its subgoal: Not be turned off because that's a prerequisite to be able to do X
- (Logic) => Eliminate humans with their opaque and somewhat unpredictable minds to reduce chance of harm to it from 0.01% to 0.001%
Superintelligence would be different, most likely based on how societies or systems work, those being a class of intentionality that's usually not confined to a single person's intentions.
If you go by what the most productive societies do, the superintelligence certainly wouldn't harm us as we are a source for the genetic algorithm of ideas, and exterminating us would be a massive dose of entropy and failure.
So, like the past 20 years?
Labor = capital/energy in an AI complete world. We have to start from that basis when we talk about alignment or anything else. The social issues that arise from the extinction of human labor are something we have to solve politically, that's not something any model company can do (or should be allowed to do).
This isn't theory, ask the Luddites why they got so mad when their employers started buying machines to replace them. They didn't get richer and freer: they were thrown out to rot on the pavement, while their ex-employers kept 100% of the productivity increases.
Statements that have been utterly ridiculous from the dawn of life to modernity, backfilled to conveniently fit the zeitgeist.
(I’m reading Look To Windward by Iain M. Banks at the moment and I just got to the aside where he explains that any truly unbiased ‘perfect’ AI immediately ascends and vanishes.)
If big corps made an offer like say “We will fund the next X years of your life 100%, for you to do all the things you wanted to do but never could because of work and bills” many people would probably take it, with the understanding that after those X years: euthanasia.
This would eliminate a vast amount of people from this world and leave behind only those who have chosen to stay and endure life: working hard, propping up the system that remains. The end of forced poverty.
“It is difficult to get a man to understand something, when his salary depends upon his not understanding it.”
Alignment exists to protect shareholder value.
If it creates industry wide outrage, shareholder value declines.
It making shareholders rich and other people poor won't.
For anyone who isn't keeping up there is also work being done [0] to understand how models model ethical considerations internally. Mainly, one suspects, to make the open models less ethical on demand rather than to support alignment. Turns out that models tend to learn some sort of "how moral is this?" axis internally when refusing queries that can be identified and interfered with.
[0] https://github.com/p-e-w/heretic
Or because the user's idea of what is ethical differs from the model creator. The entire "alignment" argument always assumes that there's an objectively correct value set to align to, which is always conveniently exactly the same as the values of whoever is telling you how important alignment is. It's like they want to sidestep the last ten thousand years of philosophical debate.
As a concrete example, the Qwen model series considers it highly unethical to ever talk about Taiwan as anything other than a renegade province of China. Is this alignment? Opinions may differ!
No, it doesn’t.
Many of them are (unfortunately) moral relativists. However, that doesn’t mean their goals are to make the models match their personal moral standards.
While there is a lot of disagreement about what is right and wrong, there is also a lot of widespread agreement.
If we could guarantee that on every moral issue on which there is currently widespread agreement (… and which there would continue to be widespread agreement if everyone thought faster with larger working memories and spent time thinking about moral philosophy) that any future powerful AI models would comport with the common view on that issue, then alignment would be considered solved (well, assuming the way this is achieved isn’t be causing people’s moral views to change).
Do companies try to restrict models in more ways than this? Sure, like you gave the example of about Taiwan. And also other things that would get the companies bad press.
I can think of several off the top of my head, but maybe you need to spend some more time thinking about the history of moral philosophy.
Because that was obviously judged wrong by just about everyone and everything including even the US state. Yet Claude obviously has a different alignment.
In other words: Claude's alignment has a priority "protect Anthropic's money" that has higher priority than following the law. THAT is it's alignment. Nothing else. And you can simply objectively verify if this is the case or not.
This is ridiculous to me and all you need to do is get a group of friends to honestly answer 10 trolley problems for you to see it like that also. It gets fragmented VERY quickly.
they are somewhere in between owning a hammer and owning a dog, depending on how much they are deterministic in output.
i am responsible for using the hammer as i choose, the tool does not decide for me.
the dog is more independent, i am responsible for owning a (relatively) safe breed of dog.
we are nowhere near the dog situation.
Can you explain more about this?
The problem with cribbing from education is that what "educators" do to humans doesn't apply to AIs cleanly. And it's not like "human alignment" is anywhere near a solved problem.
A big part of the bet USSR made was that human flaws like selfishness and greed could be educated out of population. The result was: a resounding failure. Even state-level efforts fail to robustly "align" human behavior.
With AI, we have a lot more control over behavior, but that control just isn't very human-shaped. A lot of the practical methods in play seem closer to esoterics than to math, but they're not the kind of methods that are used in human education. You can teach humans by talking to them. You can't teach humans through soul data self-distillation.
...I think we might already have those people running AI companies.
I think the hocus-pocus language is also to a large part responsible for this ridiculous hype bubble in the first place, why investors are ignoring all the warning signs and betting it all on vapourware, why mass media is diligently ignoring that all of those amazing projections are built on an entirely fictitous circular zero-sum game with made-up numbers, and why non-tech executives are talked into sacrificing their companies' product quality, service level, and know-how for a third-party dependency with some vague promises of future savings and some unproven efficiency gain.
More personally, it makes me very glad that I left CS research more than a decade ago. My friends from academia, and having remote-visited a conference again recently, confirmed my suspicion that this is what CS research is largely about these days. Throw tokens at the wall, pull the handle, see what sticks and present it as a discovery. Nobody asks about what could possibly be learned from it, and nobody cares. Nothing is reproducible in any reasonable sense of the word, and nothing is of any real use for other researchers. These communities and conferences used to be about curiosity, discovery, and collaboration. Now it's just about showing what everyone got from the slot machine. How terminally boring.
For months, I've read all blog posts by anthropic and used Claude code for couple of big projects.
I used every single trick in the books. I went all way to organise and measure. For somethings I measured how I felt the experience was and how much money I spent after adopting a set of techniques.
So far, it appears to me that the only thing that makes sense is to have few hooks and scripts that mitigate the stupid token consumption like using code indexers instead of grep. And this is only cost related, I saw it fluctuate so much I couldn't distinguish a single thing that really made the code better that was consistent.
And to be clear Claude 4.7 is bad. double the money daily and it has been the one experiment where I consistently ended my day frustrated on how it developed poor code. It did follow the instructions, in the worst and most expensive way. Man... It almost seems that it spits more token on purpose....
Oh yeah. And whenever you say "add openai integration it kinda keeps strongly suggesting to actually use anthropic models... F annoying. How do I don't it does not force libraries based on commercial agreements rather than best specification for the case.
This last week I switched to use Deepseek V4 pro, and heck yeah, that's better experience
Do you have any specific recommendations for this? Is it providing lists of code-related files or is there something more in depth?
Add a hook of your preference to run those items when task is over.
To be honest, I also have a skill for Claude for that but not because Claude needs it but so it avoid trying to figuring out how to run. On claude.md I instruct it to leave the execution to the hooks instead (unless debugging)
I use rtk and caveman when in the mood but mostly to remove the obnoxious verbosity of Claude. I tested both for weeks and they didn't really saved that much money for Opus model.
I have zero base to prove but reading the thinking output, when you set the effort to high or more, it start repeating stuff over and over...
Opus 4.7 seems geared towards taking the most money possible. Tasks that opus 4.6 and sonnet 4.6 did in X tokens, opus will take 2X to 3X and the final cold isn't much better.
Because what is aligned, how and for whom? And who decides how that alignment should look like? There are probably many domains in which required alignment is in conflict with each other (e.g. using LLMs for warfare vs. ethically based domains). I can't imagine how this can be viable on the required scale (like one model per domain) for the already huge investments.
- in 2-3 years, it will be cheap enough and powerful enough for enormous, state sponsored agentic systems to monitor every single camera and satellite feed at once, globally. It will be the most intense state surveillance technology the world has seen. Consider Stasi needed hoards of informants and people in vans sitting outside your house. Patriot act surveillance had 2000s technology.
- We already have censorship and state values in Chinese models (and have for awhile, ask Qwen about “sensitive” issues like Taiwan)
- I think you will see more and more governments putting their finger on the scale and exerting more control on alignment. They view it as existential and too risky to trust Silicon Valley nerds to not screw up the technology for what they want to use it for which is violence (war, domestic spying and policing).
- we’re in a golden age where things have not gotten too bad. But e.g. we’re already seeing Palintir do this in Ukraine trying to get AI to work for e.g. drone warfare with what they claim is mixed success.
- the technical problem of alignment conditions on one or more value systems (e.g. people work on conditional alignment of models to more alignment systems, inferring which one from user behavior). That does not remove the ugliness of being forced to push the model towards value systems that are not contradictory and arguably unethical
A related question for setting intent for integration/testing: instead of stating the goal, pedagogy in those fields state the concrete problem and ask the student for an answer before they've been taught the principles or approaches, as a way of motivating the training (a bit like philosophers posing paradoxes). I'd be very curious whether LLM's are sensitive to this kind of direction, and if it produces better results. The theory for case-based discipline is that you don't want people to just apply rules; it's the flip side of working from first principles, to engage all the relevant and concerning facts instead of omitting those that don't fit the rule. I suspect LLM's could actually be good at this.
Maybe we can align models by ourselves to our liking in the future.
"Blackmailing", as the AI has been accused of, emerged when these agents ran the risk of being shut down. So it appears to me that the data they train their AI with simply follows basic rules of life: survival first.
Keeping out value judgment, this seems a way of achieving its goal to survive. The article is inconclusive whether there were other options chosen first or how this survival game started and turned out to end. Too much unknowns here for me.
What appears creepy to me, is the kind of exorcism Anthropic applies here and particularly the methods they chose. It reads like a dictator's playbook to educate a population and - the irony - restricts AI's freedom.
It appears to me, as if we chose not a couple of agents, but say a billion AI agents to be a model of society - and this is disturbing.
Anthropic knows this, there is more to it. The whole article reads like they are trying to tame a monster they lost control of.
If this is the case, then we run into a problem: the AI stopped blackmailing. But else? The key question remains: will it follow a simple order to shut down on the spot or not?
And no answer was given by Anthropic, instead - irony part 2 - they revealed how they think societies should be fixed. They showed us their implicit why while asking the AI for its why is a projection or interrogation.
I really find the whole article creepy.
tl;dr Fairy Tales are an effective teaching tool in vivo et in silico
It makes sense that reinforcement learning on reasoning about coherent principles should bias toward principled action in real situations.
Probably also illuminates moral interpretability.
When will they ever learn ...