NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
AWS North Virginia data center outage – resolved (cnbc.com)
cmiles8 2 days ago [-]
AWS’s US-East 1 continues to be the Achilles heel of the Internet.

And while yes building across multiple regions and AZs is a thing, AWS has had a string of issues where US-East 1 has broader impacts, which makes things far less redundant and resilient than AWS implies.

dlenski 1 days ago [-]
The idea that AWS's services are fully regionalized or isolated has always been a myth.

All the identity and access services for the public cloud outside of China (aka "IAM for the aws partition" to employees) are centralized in us-east-1. This centralization is essentially necessary in order to have a cohesive view of an account, its billing, and its permissions.

And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones. When I worked there, I remember at least one case where my team's on-calls were advised not to close ssh sessions or AWS console browser tabs, for fear that we'd be locked out until the outage was over.

Roark66 1 days ago [-]
Anyone who thinks one cloud provider will provide them full resilience is fooling themselves. You need multicloud for true high availability.

But then you want to use the same stack across providers and all the proprietary technologies (even hidden from you with things like terraform) are suddenly loosing their luster.

hnlmorg 22 hours ago [-]
I don’t think any actually believes that.

What people usually think is “resilience up to a reasonable level of risk and cost”.

Multi-cloud is simply isn’t cost beneficial for 99.9% of problems.

And for a lot of businesses who talk about risk, saying “we followed AWS best practices but AWS went down” is an acceptable answer to the question of liability.

If you are in a position where AWS going down is a reasonable risk, then you’re already in a specialised enough domain to have engineers who understand how to deliver HA across different vendors.

doublerabbit 21 hours ago [-]
I jest, Anyone who thinks multicloud will provide them full resilience is fooling themselves. You need colocated hardware for true high availability.
myroon5 1 days ago [-]
> outside of China

[Nitpick] There are a few more AWS partitions like GovCloud:

https://jasonbutz.info/2023/07/aws-partitions/

dlenski 15 hours ago [-]
Yes, I'm certainly aware of the other partitions. That's why I said all the public cloud regions outside China.

Yeah, "govcloud" is technically available to the public, although there are other partitions reserved for government use that are not, and the naming is a big hairy mess. Many service teams don't have any US-citizens-in-the-USA working for them, and they cannot in any way adequately support these regions.

My on-call experience improved significantly when I moved from the US to Canada, and I got taken off the (extremely thin!) list of engineers eligible to ssh into RDS instances in Govcloud. There were so few USA-citizen-in-USA engineers that I had been getting tickets for services and instances in Govcloud about which I had only the very thinnest knowledge… and then I was limited in my ability to consult with others who were actually experts. The customers in Govcloud paid a premium to be there, I got paged for a bunch of tickets which I was ill-prepared to handle, and it was generally a bad experience for everyone.

Working with the airgapped secret/top-secret partitions was even worse. You would get paged incessantly and then someone who was cleared for access but knew almost nothing about the service in question would have to go to a SCIF in the DC area, and you would exchange screenshots and text instructions with a turnaround time of hours or days.

electroly 21 hours ago [-]
Since this article was written, AWS also added European Sovereign Cloud as a partition: aws-eusc.
master_crab 24 hours ago [-]
IAM isn’t even really the most painful dependency. Route53 is. The control plane only runs out of use1.

Better make sure the only DNS operations you run during an outage are data plane queries and health check failovers.

easton 20 hours ago [-]
They actually kind of fixed this recently, you can ask them to move your route53 control plane to another region in the event of us-east-1 breaking: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/ac...

There’s a bunch of caveats but it’s worth enabling if you’re changing dns all the time (as most AWS networking doodads like to do).

trollbridge 22 hours ago [-]
Is there an architectural reason it’s not for replicas in the other AZs?
zaphirplane 1 days ago [-]
Services outside of us-east-1 don’t call us-east-1 for IAM data plane thou right ?
cmiles8 1 days ago [-]
They’re talking about the backbone and what goes on behind the scenes. There have been issues with services in other regions when us-east-1 has issues.

Folks built in other regions believing they were fully isolated only to discover later during an outage that they were not.

sidewndr46 1 days ago [-]
Isn't this kind of circular dependency what lead to extended downtime a while back?
superjan 1 days ago [-]
It reminds me of facebook. Staff was locked out of the office due to the outage they were supposed to fix.
23 hours ago [-]
martin8412 19 hours ago [-]
Luckily the plasma torch and bolt cutter didn’t require logging in with Facebook.
jethro_tell 1 days ago [-]
It's basically what leads to extended downtime almost every time. There are just some things in the stack that are still single points of failure, and when they fail it's a mess.
dlenski 1 days ago [-]
Yes, I concur.

Sometimes the circular dependencies get almost cartoonishly silly.

Like, "One of the two guys who has the physical keys to the server cage in us-east-1 is on vacation. The other one can't get into his apartment because his smart lock runs into the AWS cloud. So he hires a locksmith, but the locksmith takes an extra two hours to do the job because his reference documents for this model of lock live on an S3 bucket."

I made that example up, but only barely.

somat 1 days ago [-]
We had a pair of machines. And some bright spark set them up to mount each others NFS shares. after a power outage "Holy mother of chicken and egg NFS hangs batman"

That was a weird job, fun, it was a local machine room for a warehouse that originally held the IBM mainframe, it still held it's successor "the multiprise 3000" which has the claim to fame as being the smallest mainframe IBM ever sold. But now the room was also full of decades of artisanal crafted unix servers with pick databases. the pick dev team had done most the system architecture. The best way to understand it is that for them pick is the operating system, unix is a necessary annoyance they have to put up with only because nobody has made pick hardware for 20 years. and it was NFS mounts everywhere, somebody had figured out a trick where they could NFS mount a remote machine and have the local pick system reach in and scrounge through the remote systems data. But strictly read-only. pick got grumpy when writing to NFS not to say anything about how the other database would feel about having it's data being messed with. Thus the circular mount.

Still was not the worst thing I saw. I liked the one system with a SMB mount. "Why is this one SMB?" "Well pick complains when you try to write to a NFS mount, but it's NFS detection code does not trip on SMB mounts." ... Sighs "Um... I am no pick expert but you know why it does not like remote mounts right. SMB does not change that, Do you happen to get a lot of corrupt indexes on this machine?" "yes, how did you know"

roryirvine 1 days ago [-]
Oh, yeah, re-exporting NFS mounts via SMB was very much a thing in the early 2000s - something to do with their different approaches to flock() vs fcntl() handling. If you ran into locking issues with nfs, then re-exporting via SMB was standard advice.

At some point, the behaviour changed and locks starting conflicting. IIRC, we hit it when upgrading to Debian Etch and took the time to unwind the system and make pure NFS work properly for us. Plenty of people took the opposite approach, and fiddled with the config to make locking a noop on SMB. I know of at least one web hosting company who ended up having to restore a year's worth of customer uploads from backups as a result...

wgjordan 1 days ago [-]
A real example, from Facebook's 2021 outage [1]:

> Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.

There was one (later denied) report that a 'guy with an angle grinder' was involved in gaining access to the server cage.

[1] https://news.ycombinator.com/item?id=28762611

MichaelZuo 1 days ago [-]
Why would such a critical server even be accessible with only one set of keys?

I’ve always thought mission critical stuff needs two independent key holders, with key holes placed far apart enough to make it impossible for 1 person to reach both.

bigfatkitten 1 days ago [-]
Other than for certain nuclear missile launches[1], that only happens in the movies.

[1] https://www.nationalmuseum.af.mil/Visit/Museum-Exhibits/Fact...

17 hours ago [-]
MichaelZuo 1 days ago [-]
I dont know how it is in the datacentre industry, but certainly in other industries that is how its done for anything truly mission critical and also easily tampered with.

I guess it shows very few care enough to pay enough to make that a reasonable upgrade.

michaelt 1 days ago [-]
They're not actually accessible with 'only one set of keys' in my experience.

You actually have to present your photo ID at the site entry gatehouse, then again to the building entry guard (who will also check you have a work permit and a site-specific safety induction) then you swipe a badge at a turnstile to get from reception into the stairwell, then swipe your badge at a door to get into the relevant floor, then swipe your badge and key in a code to enter the room with the cages then you use the key.

sidewndr46 23 hours ago [-]
a circular dependency and a single point of failure are not the same thing. If I have a single point of failure and it is down, I fix that and things work again. If I have circular dependency, there is no obvious way to fix anything that is broken any longer.
grogenaut 1 days ago [-]
when you have a circular dependency, one strategy employed, is to have it be circular but interruptible for 18 or so hours. Call it an oh shit bar.

I'm glad I never had to get that deep into the failure chain.

dlenski 15 hours ago [-]
> Call it an oh shit bar.

Amazon's equivalent of this (sort of) was the "andon cord."

Not only was the physical metaphor that led to this name never properly explained (it's basically an "emergency stop" string in a Toyota factory), but the actual use of this mechanism was so heavily discouraged that I never saw it used in 4+ years at Amazon. except once very performatively by a VP who had already been paged awake at 2am or something like that.

In my experience, a lot of the AWS engineers live in continuous fear of screwing up by using the huge array of extremely powerful, dangerous, poorly-explained, and ever-changing tools that they have access to.

stephenr 1 days ago [-]
> And IAM is not a wholly independent software stack: they rely on DynamoDB and a few other services, which in turn have a circular dependency on IAM.

When you dogfood your own Rube Goldberg machine.

zaphirplane 1 days ago [-]
We should let the IAM service team know if this glaring gap the hn thread figured out /s

I’m 99% ;) certain dependencies of foundational services are a well discussed topic

jmsgwd 1 days ago [-]
> The idea that AWS's services are fully regionalized or isolated has always been a myth.

This is highly misleading. It's true that there's a handful of global AWS services - but only their control planes operate from a single region (e.g. us-east-1). Their data planes are regionally isolated or globally distributed.[1]

The only time you'd normally use a service control plane is to deploy changes, e.g. when you create, read, update or delete service resources or update configuration during a change window.

Workloads should be designed for "static stability", as recommended by AWS.[2] A statically stable workload only depends upon the data planes of the services it uses at runtime. Statically stable workloads are designed to continue operating as normal even if there's a service event impairing one or more control planes (including for global services).

> During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region [3]. The IAM data plane, which enforces access control, is also regional.

If the IAM control plane is impaired, you might not be able to create new IAM roles (a control plane operation) - but you can continue generating and using temporary credentials for existing IAM roles (data plane operations) within the region your workload is running in. This allows statically stable workloads to continue using IAM without interruption.

[1] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"Global AWS services still follow the conventional AWS design pattern of separating the control plane and data plane in order to achieve static stability. The significant difference for most global services is that their control plane is hosted in a single AWS Region, while their data plane is globally distributed."

[2] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"...eliminating dependencies on control planes (the APIs that implement changes to resources) in your recovery path helps produce more resilient workloads."

[3] https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

"STS is a data plane-only service that is separate from IAM, and does not depend on the IAM control plane."

dlenski 14 hours ago [-]
You're right of course to distinguish the control plane and data plane, and it sounds like you know more about this than I do for IAM.

I disagree, though, that my post was "highly misleading" despite this omission.

As a practical matter, some services fail to achieve the "static stability" you describe, in terms of not depending on other services’ control planes.

And also, many on-calls ops and firefighting tasks (to say nothing of canaries and other automated tests) depend on other services’ control planes.

And above all, many AWS engineers (myself very much included even after years there) don't have a clear understanding of the boundaries of other services’ control planes. https://news.ycombinator.com/item?id=48078254

> > During us-east-1 outages it's sometimes possible to continue using existing auth tokens or sessions in other regions, while not possible to grant new ones.

> This is just plain wrong! The IAM Security Token Service (STS), which grants IAM tokens, is a data plane-only service and runs independently in each region.

I didn't mention STS in the service to which you're responding. The service that I worked on the most, RDS, required ssh'ing into live instances to solve basically all non-trivial problems (I'd guess 80% of the tickets that I saw actually resolved required it). And I have no idea if it how STS was involved in generating the ephemeral Midway-signed ssh keys required for it… but whenever there were us-east-1 IAM outages we'd have big problems opening new sessions, while less-capable web-console-based ops tools with long-lived credentials would keep working.

Eridrus 1 days ago [-]
People say this, but this this was just a single AZ, and in the last 3 years of running my startup mostly out of use-1, and we've only had one regional outage, and even that was partial, with most instances uneffected.

And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

linsomniac 1 days ago [-]
>And honestly, everybody else's stuff is in use-1

Yeah, but why put your eggs in that basket? I moved all our services from east to west/oregon a decade ago and haven't looked back.

electroly 1 days ago [-]
Not OP, but I do single-region us-east-1 for a few reasons:

1. The severity and frequency of us-east-1 outages are vastly overstated. It's fine. These us-east-1 outages almost never affect us. This one didn't; not even our instances in the affected AZ. Only that recent IAM outage affected us a little bit, and it affected every other region, too, since IAM's control plane is centrally hosted in us-east-1. Everybody's uptime depends on us-east-1.

2. We're physically close to us-east-1 and have Direct Connect. We're 1 millisecond away from us-east-1. It would be silly to connect to us-east-1 and then take a latency hit and pay cross-region data transfer cost on all traffic to hop over to another region. That would only make sense if we were in both regions, and that is not worth the cost given #1. If we only have a single region, it has to be us-east-1.

3. us-east-1 gets new features first. New AWS features are relevant to us with shocking regularity, and we get it as soon as it's announced.

4. OP is right about the safety in numbers. Our service isn't life-or-death; nobody will die if we're down, so it's just a matter of whether they're upset. When there is a us-east-1 outage, it's headline news and I can link the news report to anyone who asks. That genuinely absolves us every time. When we're down, everybody else is down, too.

coleca 22 hours ago [-]
Sometimes you need capacity and you have to choose where the capacity is not where you would like it to be. Unfortunately, the days of cloud bursting, and thinking of the cloud as an unlimited resource where you can spin up and spin down machines at will is vanishing. Power availability and supply chain lead times combined with unprecedented demand are the reason for this. That's why you see all the hyperscalers recently reporting on their "backlog" in their earnings reports.
christina97 1 days ago [-]
But it’s okay to be down when the whole internet is down.
Eridrus 1 days ago [-]
90% of customers are located in use-1. Latency to use-1 is more important than being up when everyone else is down.
nilamo 21 hours ago [-]
> And honestly, everybody else's stuff is in use-1, so at least your failures are correlated with your customers lol.

Is it not a selling point to be able to say "we're still up while out competitors are down"?

bink 18 hours ago [-]
It's worse when your region has issues and your customer's infrastructure is fine.
skywhopper 17 hours ago [-]
If you’re the one that’s down while no one else is, suddenly it becomes your fault.
skywhopper 17 hours ago [-]
It wasn’t even all of a single AZ. None of my resources in use1-az4 had any issues. The most annoying thing was the 20 notifications we got saying “it’s not all fixed yet” every hour.
grogenaut 1 days ago [-]
none of my stuff is in us-east-1. I chose that specifically 15 years ago. Been a great decision.
999900000999 1 days ago [-]
Too many people are using it.

In fantasy magic dream land loads are distributed evenly across different cloud providers.

A single point of failure doesn't exist.

It worked out with my first girlfriend. The twins are fluent in English and Korean. They know when deploying a large scale service to not only depends on AWS.

Healthcare in the US is affordable.

All types of magical stuff exist here.

But no. It's another day. AWS US-East 1 can take town most of the internet.

afro88 1 days ago [-]
Core AWS services use it too. Even if you are hosted in another region, you can still be affected by a US-East 1 outage
999900000999 1 days ago [-]
The idea would be to actually load distribute between different cloud providers.

But even then , the load balancer needs to run somewhere. Which becomes a new single point of failure.

I’m sure someone smarter than me has figured this out.

jethro_tell 1 days ago [-]
yes, they have. It just costs a shit ton of money and is extremely difficult to get the suits to sign off on TWO full 'cloud services' bills. It generally doubles your cost and workload and increases your uptime by a couple hours/year, assuming you don't have bugs that affect one or the other cloud in your deployment stack.

It's basically a wash for almost all organizations for twice the cost and effort.

999900000999 1 days ago [-]
Ok...

But where does the load balancer actually run. Does load balancer main run on AWS, and load balancer backup on Oracle?

dboreham 1 days ago [-]
Short TTL DNS or BGP anycast.
grogenaut 1 days ago [-]
also these things don't go down THAT often... well aws, not some others. More uptime that you probably had before. even the stock market takes a few days off every decade. Just ask W.
justinclift 1 days ago [-]
> not some others.

Looking at Azure and GitHub in particular. ;)

JackSlateur 1 days ago [-]
DNS
erikerikson 1 days ago [-]
Not really. Your clients can random robin to connection points across providers and move write heads upon connection. If you worry about hard coding you can reduce the surface to a per-context first minimum contact point.
leetrout 1 days ago [-]
Bingo. This is the one most people don't know about.
b40d-48b2-979e 1 days ago [-]
I was surprised recently when setting up cloudfront with aws certs that it forced me to use us-east-1 to provision the certs.
kbbgl87 1 days ago [-]
STS is only on us-east-1 I believe
dlenski 1 days ago [-]
Yep. All of the identity and access management services for the non-China public cloud are in us-east-1. https://news.ycombinator.com/item?id=48071472
avereveard 1 days ago [-]
All the control plane. Data plane is distributed and roles using iam to access resources can still do so during a control plane outage.
dlenski 15 hours ago [-]
Yes, you're right, but in my experience the boundary between the data plane and the control plane is not always clear, and especially unclear on these foundational and basic services.

There were enough "surprisingly control-plane" IAM operations in the AWS services that I dealt with, so we had to exercise extreme caution during outages.

everfrustrated 14 hours ago [-]
It's literally documented. Try reading it and educating yourself.
dlenski 11 hours ago [-]
I worked there.

Even if I were the stupidest and least curious engineer around (and I was far from it), that's basically irrelevant to what you're scolding me for here…

As part of a team with both software development and operational responsibilities, like most teams at AWS, I had to deal not only with the consequences of my own imperfect knowledge, but also with the imperfect knowledge of my coworkers past and present.

echelon_musk 1 days ago [-]
> It worked out with my first girlfriend. The twins are fluent in English and Korean.

You were dating twins as a form of redundancy?!

dnnddidiej 23 hours ago [-]
Dual writes. You'd need to have the same conversation with both to keep them in sync.
keeganpoppen 2 days ago [-]
anecdotally (well, more "second-hand-ly i heard that..." it sounds like there were some carry-on effects on us-east-2 as a result of people migrating over from us-east-1, so, yeah... kinda hilarious how the multiple region / AZ thing is just so plainly a façade, but yet we all seem to just collectively believe in it as an article of faith in the Cloud Religion... or whatever...
qaq 1 days ago [-]
It's no magic given the size of us-east-1 there is no spare capacity to absorb all the workloads
8organicbits 1 days ago [-]
One of the SRE tricks is to reserve your capacity so when the cloud runs out of capacity you're still covered. It's expensive, but you don't want to get stuck without a server when the on-demand dries up.
cherioo 1 days ago [-]
Is it really failing more, or we just don’t hear about failure happening elsewhere?

Last i heard azure outage it wasn’t even on HN frontpage

stingraycharles 1 days ago [-]
It really is failing more, and it’s well known amongst industry experts. It’s the oldest, largest, and most utilized region of AWS.

I’ve heard people say that the underlying physical infrastructure is older, but I think that’s a bit of speculation, although reasonable. The current outage is attributed to a “thermal event”, which does indeed suggest underlying physical hardware.

It’s also the most complex region for AWS themselves, as it’s the “control pad” for many of their global services.

adriand 1 days ago [-]
What kind of reputation does ca-central-1 have? I’ve been using it and it seems quietly excellent. Knock on wood.
jedberg 1 days ago [-]
Most of the other regions are fairly stable. Ohio (us-east-2) is a great choice if you're just starting out. Not sure about ca-central-1, but I've never heard anything bad about it.
dlenski 1 days ago [-]
It wasn't heavily utilized when I worked at AWS, until 2024.

If your customers are clusterrd in Toronto and Montreal, it probably makes a lot of sense to use ca-central-1. If you've got a lot of customers in Western Canada, us-west-2 is gonna have better network latency.

Other than a couple regions that had problems with their local network infrastructure (sa-east-1 was like that), there's little or nothing to differentiate the regions in terms of physical infrastructure and architecture.

dehrmann 18 hours ago [-]
> building across multiple regions and AZs is a thing

If you do this for resiliency, be prepared to pay the capacity tax (2 regions means 2x capacity, 3 regions means 1.5x), have the machines already running in a multi-region setup (don't expect to be able to spin up instances or even get capacity during an outage), and ready to deal with the added complexity of multi-region hosting.

coredog64 12 hours ago [-]
There’s all kinds of fun pitfalls with multi-AZ. Like you can create RDS subnets across multiple AZs but then you can’t remove an AZ. Which really sucks when your core database covers all 5 us-east-1 AZs and randomly can’t failover because you picked an instance type that use1-az4 can’t host.
ohnei 1 days ago [-]
I've always been impressed by Amazon's ability to present the shittiest experience possible and imply the blame is with things like isolation that they don't really provide.
y3ahd0g 19 hours ago [-]
No. This is nonsense.

Some SaaS apps had issues.

The Internet was fine.

This is physical reality. The internet was designed to route around this.

Just because some app devs do a lazy job doesn't mean the entire infrastructure as designed is garbage.

Just because some app devs are over reliant on a single cloud service doesn't mean the Internet is broken.

aurareturn 2 days ago [-]
These things are dangerous. Someone who can take AWS down such as an employee can place a bet.

These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.

shimman 2 days ago [-]
It's a good thing big tech hires for ethical engineers and not ones that only care about money or social status.
morgoths_bane 1 days ago [-]
Thankfully their leadership is leading the way in ethics since inception, so I am confident that no such shenanigans will ever take place. I may even bet on this.
dennis_jeeves2 22 hours ago [-]
Leaders with a vision for all of us.
zaphirplane 1 days ago [-]
You forgot the /s
Zopieux 1 days ago [-]
What's funny about sarcasm is that you're supposed to detect it without markers.
ceejayoz 20 hours ago [-]
It was easier a few years back.
1 days ago [-]
whatsupdog 1 days ago [-]
[flagged]
Zopieux 1 days ago [-]
Ah yes, the famous american ethics culture.
noosphr 1 days ago [-]
Jokes on you. All betting sites are based on US-East1.
grogenaut 20 hours ago [-]
Oh no won't someone please think of the prop wagerers.
Imustaskforhelp 2 days ago [-]
> These things are dangerous. Someone who can take AWS down such as an employee can place a bet.

Imagine if the betting website itself shuts down because AWS is down. (half joking I suppose though)

> These bets aren’t as innocent as they seem because the bettors can often influence or change the outcome.

Overall I agree with your statement that these betting markets also are able to incentivize a lot of insider trading and one can say negative scenarios as this has given them an incentive to capitalize on that.

ninjalanternshk 2 hours ago [-]
This is my retirement plan.

Get a job where I can affect something significant, mortgage everything I own to bet on it, then break it (get fired) and take the money and run.

fabian2k 2 days ago [-]
I thought cooling was pretty much pre-planned in any data center, and you simply don't install more stuff than you can cool?

So did some cooling equipment fail here or was there an external reason for the overheating? Or does Amazon overbook the cooling in their data centers?

AdamJacobMuller 2 days ago [-]
This is almost definitely an issue of equipment failure.

Cooling in datacenters is like everything else both over and under provisioned.

It's overprovisioned in the sense that the big heat exchange units are N+1 (or in very critical and smaller load facilities 2N/3N). This is done because you need to regularly take these down for maintenance work and they have a relatively high failure rate compared to traditional DC components and require mechanical repairs that require specialized labor and long lead times. In a bigger facility its not uncommon to have cooling be N+3 or more when N becomes a bigger number because you're effectively always servicing something or have something down waiting for a blower assembly which needs to be literally made by a machinist with a lathe because that part doesn't exist anymore but that's still cheaper than replacing the whole unit.

The system are also under-provisioned in the sense that if every compute capacity in the facility suddenly went from average power draw to 100% power draw you would overload the cooling capacity, you would also commonly overload things in the electrical and other paths too. Over provisioning is just the nature of the industry.

In general neither of these things poses a real problem because compute loads don't spike to 100% of capacity and when they do spike they don't spike for terribly long and nobody builds facilities on a knife-edge of cooling or power capacity.

The problem comes when you have the intersection of multiple events.

You designed your cooling system to handle 200% of average load which is great because you have lots of headroom for maintenance/outages.

Repair guy comes on Tuesday to do work on a unit and finds a bad bearing, has to get it from the next state over so he leaves the unit off overnight to not risk damaging the whole fan assembly (which would take weeks to fabricate).

The two adjacent cooling units are now working JUST A BIT harder to compensate and one of them also had a motor which was just slightly imbalanced or a fuse which was loose and warming up a bit and now with an increased duty cycle that thing which worked fine for years goes pop.

Now you're minus two units in an N+2 facility. Not really terrible, remember you designed for 200% of average load.

That 3rd unit on the other side of the first failed unit, now under way more load, also has a fault. You're now minus 3 in a N+2 facility.

Still, not catastrophic because really you designed for 200% of average load.

The thing is, it's now 4AM, the onsite ops guy can't fix these faults and needs to call the vendor who doesn't wake up till 7AM and won't be onsite till 9.

Your load starts ramping up.

Everything up above happens daily in some datacenter in the USA. It happens in every datacenter probably once a year.

What happens next is the confluence of events which puts you in the news.

One of your bigger customers decides now is a great time to start a huge batch processing job. Some fintech wants to run a huge model before market open or some oil firm wants to do some quick analysis of a new field.

They spin up 10000 new VMs.

Normally, this is fine, you have the spare capacity.

But, remember, you planned for 200% of AVERAGE cooling capacity and this is not nodes which are busy but not terribly busy, these are nodes doing intense optimized number crunching work which means they draw max power and thus expel max waste heat.

Not only has your load in terms of aggregate number of machines spiked but their waste heat impact is also greater on average.

Boom, cascading failure, your cooling is now N-4.

Server fans start ramping up faster which consumes more power.

Your cooling is now N-5.

Alarms are blaring all over the place.

Safeties on the cooling units start to trip as they exceed their load and refrigerant pressures rise.

Your cooling is now N-6.

Your cooling is now N-7.

Your cooling is now 0.

minimaltom 2 days ago [-]
This is a great writeup! thank you!!

Reminds when i did noogler training back in the day and one of the talks described a cascading failure at a datacenter, starting with a cat which was too curious near a power conditioner, and briefly conducted

AdamJacobMuller 1 days ago [-]
The cat incident at a facility I worked at.

Its cold up here in the winter, sadly, the residual heat from even totally passive components like switch gear is enough to warm things up enough to attract them. .001% of 1MW of power is still quite warm. (I have no idea how much switchgear leaks but i know they are warm even in winter outdoors).

And, yeah, the rest of the writeup is also an amalgamation of some panic-inducing experiences in my life.

dtjohnnymonkey 1 days ago [-]
I could totally get into “Ops Thriller” genre of novels like this.
ninjalanternshk 2 hours ago [-]
It’s old but “The Cuckoo's Egg” was a great read and had a lot of this. Oh and it was true.
cagenut 23 hours ago [-]
there are dozens of us!
strgcmc 22 hours ago [-]
There are often little bits of Neal Stephenson or Andy Weir novels which sound a little like this, describing a technical fault in a plot-driven way (often as a cascade), and I do find those to be uniquely enjoyable. I'm sure there are other authors who do similar things, though maybe "cloud/AI data center" stories should be its own micro-genre, given how crucial these things are to society.
martinald 20 hours ago [-]
I wrote this recently which maybe people will enjoy in the same vein :) https://martinalderson.com/posts/august-29-2026-a-scenario/
fabian2k 2 days ago [-]
I'd expect someone like AWS to just throttle machines before overloading their cooling. Because they probably can do that, while e.g. a data center that just rents the space can't really throttle their customers nicely.
cperciva 2 days ago [-]
Reducing clock speeds, even if they could do that -- and I'm not sure they can, given how Nitro is designed -- would be problematic since a lot of customer workloads assume homogeneous nodes.

But they did load-shed. Perhaps not soon enough, but the reason this is publicly known is because they reduced the amount of heat being produced.

AdamJacobMuller 1 days ago [-]
> But they did load-shed

Right, exactly, I highly doubt the facility went into any kind of actual uncontrolled thermal rise. This is news because they had to take such drastic actions. I'm sure its common that they force spot prices up (probably way up) to compensate for reduced capacity due to events, I'm sure they even sometimes fake no capacity for similar reasons. No capacity means "I don't want to turn on your node" not merely "I don't have any more physical servers I could turn up for you".

This is news because they powered off some non-preemptible customer loads, which actually makes me wonder if you saw that chain of events occur here.

spot prices rise -> new instance availability goes to 0 -> preemptible instances go dark -> normal instances go dark.

AdamJacobMuller 2 days ago [-]
Its harder and harder to throttle machines with hardware segmentation capabilities effectively passing through hardware components "intact"

A decade ago it was trivial to just tell the hypervisor to reduce the cpu fraction of all VMs by half and leave half unallocated. Now, it's much more complicated and definitely would be user visible.

PunchyHamster 2 days ago [-]
The cooling units dont fail just because they get to 100% duty cycle. That's pretty much "normal operation", you just get... higher efficiency coz the cooling side is warmer
AdamJacobMuller 2 days ago [-]
Of course not. They fail above 100%.

Some fail below 100% too.

tardedmeme 1 days ago [-]
You can't have a duty cycle above 100%. It's impossible.
dylan604 1 days ago [-]
Not according to POTUS math. You can have 200%, 500%, 600%, 1200%. You just have to say it enough and people will question if they really might not understand percentages enough, and just go with it.
tardedmeme 1 days ago [-]
ok but cooling systems don't run on POTUS math though
dylan604 1 days ago [-]
Nor does the rest of the world
lukeify 1 days ago [-]
This is written beautifully. It's like a much more inconsequential variant of Chernobyl.
foota 2 days ago [-]
Shouldn't there be a feedback system here preventing the scheduling of loads when cooling is degraded?
AdamJacobMuller 2 days ago [-]
With hyperscalers for sure.

But this is the physical world, shit happens.

The algorithm didn't know that fuse was lose and fine at 50% duty cycle but was high resistance and going to blow at 100%.

wombatpm 2 days ago [-]
I would have thought with all the data centers being built the parts for cooling systems would be standardized with replacements available from Grainger immediately.
Andys 1 days ago [-]
I worked in a DC that had multiple redundant chillers on the roof, and multiple redundant coolers on each floor, but the whole building's cooling failed at once when the water lines failed somehow.

They didn't say how, but apparently the pipes between each floor and the roof were not redundant. It took almost 24 hours to fix.

DevelopingElk 2 days ago [-]
One of the data center's cooling loops broke.
bdangubic 2 days ago [-]
No backups?
michaelt 1 days ago [-]
I once worked at a company that had a wealth of backups. A backup generator, backup batteries as the generator takes a few seconds to start, a contract for emergency fuel deliveries, a complete failover data centre full of hot standby hardware, 24/7 ops presence, UPSes on the ops PCs just in case, weekly checks that the generators start, quarterly checks by turning off the breakers to the data centre, and so on.

It wasn't until a real incident that we learned: (a) the system wasn't resilient to the utility power going on-off-on-off-on-off as each 'off' drained the batteries while the generator started, and each 'on' made the generator shut down again; (b) the ops PCs were on UPSes but their monitors weren't (C13 vs C5 power connector) and (c) the generator couldn't be refuelled while running.

Even if you've got backup systems and you test them - you can never be 100% sure.

nkrisc 1 days ago [-]
A plan that has never been executed is really just hope and wishful thinking.
23 hours ago [-]
bradgessler 2 days ago [-]
What happens when the backup breaks?
oldmanrahul 2 days ago [-]
At a certain point earth is a single point of failure.
noir_lord 2 days ago [-]
You have a back up for the back up backup.

Turtles all the way down.

At AWS scale even unlikely hardware events become more common I guess.

odyssey7 2 days ago [-]
Each turtle gives them another 9. How many 9s are they down due to incidents over the past year?
tardedmeme 1 days ago [-]
They're definitely more than half a day now, which is only two and a half nines.
minimaltom 2 days ago [-]
They absolutely have backups, I presume they were ineffective or also down for _reasons_.
jeffbee 1 days ago [-]
The point of being "cloud native" is you build redundancy at higher levels. Instead of having extra pipes and wires, you have extra software that handles physical failures.
el_benhameen 1 days ago [-]
merek 2 days ago [-]
Related:

AWS EC2 outage in use1-az4 (us-east-1)

https://news.ycombinator.com/item?id=48057294

tornikeo 1 days ago [-]
I wonder if hetzner had better uptime in EU than AWS this year.
altern8 1 days ago [-]
Why no love for OVH?

I find Hetzner's UI to be super-confusing, making it hard to manage things.

mimischi 1 days ago [-]
At the rate that people claim AWS us-east to go down, folks will argue that OVH has a tendency to go up in flames!
tornikeo 1 days ago [-]
Who said anything about ui? I just grab my project write key and Codex handles it all, no UI from idea to production at all.
tardedmeme 1 days ago [-]
How do you know it isn't spending $99999 a month,m
teitoklien 1 days ago [-]
ovh rocks, they are also far more customer support friendly than hetzner. Half the time hetzner feels like they are doing you a favor by letting you rent servers from them.

Ovh is way simpler, and openstack integration from them works good enough for most of my needs.

Kinda insane how atrocious docs are tho. No .md markdown format to let agents read stuff yet -_-

noAnswer 20 hours ago [-]
They once had offerings for dedicated servers without hard drivers. They did network boot from NFS. So the costs where between a full dedicated server and a virtual one. Sadly it was very badly engineered. Small disk IO was so bad that you basically couldn't run MySQL. I did run a MX. For every mail postfix would complained that the filesystem did run a few secondes in the future. At some point they gave up and stuck a USB stick into every server.

It was dead by thousand cuts and put a bad taste in my mouth. But I have to admit that was a long time ago and I should probably give them another chance.

TiredOfLife 20 hours ago [-]
OVH datacenters occasionally burn down.
corvad 1 days ago [-]
It's always East 1... Jokes aside I don't understand how often east-1 is taken down compared to other regions. Like it should be pretty similar to other regions architecture wise.
tom1337 1 days ago [-]
Isn't east one the "core" datacenter and also the oldest? I'd imagine it has more load than the other regions and also has more tech debt and architectural / engineering debt because they had less experience when they built it. Also iirc some services rely on east-1 as a single point of failure for configuration (like IAM or some S3 stuff?)
__turbobrew__ 1 days ago [-]
What I have seen at other companies is that the older datacenters have suboptimal designs which are impossible to fix after the fact.
doitLP 1 days ago [-]
Yes it tends to have the most things running in it, including backbone and internal services that only exist in that region.
bandrami 1 days ago [-]
It's the oldest regional system and has some structural importance (e.g. the internal CA resides there I think)
JimDabell 1 days ago [-]
Amusingly:

> AWS in 2025: The Stuff You Think You Know That’s Now Wrong

> us-east-1 is no longer a merrily burning dumpster fire of sadness and regret.

https://www.lastweekinaws.com/blog/aws-in-2025-the-stuff-you...

Otherwise a good article!

tardedmeme 1 days ago [-]
> Otherwise a good article!

Who is Gell-Mann and why is he so forgetful?

fastest963 1 days ago [-]
Coinbase claimed multiple AZs were down but the AWS statement was that only a single AZ was affected. Does anyone have more details?
fastest963 23 hours ago [-]
Coinbase confirmed on X that the exchange only ran in one AZ for latency reasons: https://x.com/i/status/2052855725857329254
b40d-48b2-979e 1 days ago [-]
Never trust a crypto company to be honest.
merek 1 days ago [-]
I can't find an official source, but I suspect the blast radius isn't limited to the AZ.

I have systems running in us-east-1, and over the course of the incident, I noticed unexplainable intermittent connectivity issues that I've never seen before, even outside of az4.

bombcar 1 days ago [-]
East-1 going down always takes some things from other AZs, because there's always something dependent on East-1.
adamg203 1 days ago [-]
spent the evening looking at SLI graphs waiting for the region to blow up but it never did. only a few envs across many had some degraded EBS vols in the single AZ. it was absolutely a single az (use-az4).
1vuio0pswjnm7 8 hours ago [-]
Actual CNBC title: AWS data center outage hits trading on FanDuel, Coinbase - recovery to take hours
rswail 1 days ago [-]
So in the comments here we have the usual about us-east-1, it's centralized, it's a SPOF for AWS, they should fix it, don't put your stuff there, etc.

This was one data centre in one zone of a multi-zone region.

Yes IAM/R53 and others are centralized there, yes, reworking those service to be decentralized and cross-region would be a Good Thing. But us-east-1 is already multi-zone (6 with a seventh marked as "coming in 2026") with multi DC within zones. From memory, when a global service like IAM is out, it's more likely to be bugs in the implementation or dependency than a "if this was cross-region it wouldn't have died" issue.

But this wasn't an outage of any AWS global service this time. The only one that seemed to have more impact was/is MSK. Which is likely to be more of an issue with Kafka than anything AWS related.

dnnddidiej 23 hours ago [-]
I remember someone said friends dont let friends use USE1 last time and I thought that as the slack message saying USE1 and all the stuff we deploy there has gone to shit.
Havoc 2 days ago [-]
Could someone explain to me why they don't build these things near oceans? Like nuclear plants that need plenty cooling capacity too

Two loop cycle with heat exchanger to get rid of the heat

mandevil 2 days ago [-]
So Ashburn VA is a datacenter hub because the very first non-government Internet Exchange Point (IXP) anywhere in the world was there (https://en.wikipedia.org/wiki/MAE-East). Back in the 1990's something like half of all internet traffic all over the world hit MAE-East. That in turn made AWS put their first region there (us-east-1 preceded eu-west-1 by 2 years and us-west-1 by 3 years). Then because there were lots of people who knew how to build DC's- and lots of vendors who knew how to supply them- the Dulles Corridor became a major hub for lots of companies datacenters. For AWS, because us-east-1 was the first, it's by far the most gnarly and weird- and a lot of control planes for other AWS services end up relying on it. Which is why it goes down more often than other regions, and when it does go down it makes national news, unlike, say, eu-south-2 in Spain.

But NoVA is basically the same sort of economic cluster that Paul Krugman won his Nobel Prize in Economics for studying, just for datacenters, not factories.

ckozlowski 1 days ago [-]
Well said. I'll also add, that with these networks, the sooner you can get traffic off your network the better. There's strong incentive to have your datacenter near these peering points. And since MAE-East was the first, it's been the largest as it's been snowballing the oldest. AOL's HQ was here, Equinix built their peering point soon after MAE-East, etc.

There's a great read about the whole area here: https://www.amazon.com/Internet-Alley-Technology-1945-2005-I...

As for AWS, I often see it repeated that the DCs are the oldest and therefor in disrepair. That's not true; many of the first ones have since been replaced. But there are services that are located here and only here.

But I'll also add, a lot of customers default to using US-East-1 without considering others, and too many deploy in only one AZ. Part of this is AWS's fault as their new services often launch in US-East-1 and West-2 first, so customers go to East-1 to get the new features first.

Speaking as one who was with AWS for 10 years as a TAM and Well-Architected contributor, I saw a lot of customers who didn't design with too much resiliency in mind, and so they get adversely affected when east-1 has an issue (either regional or AZ). The other regions have their fair bit of issues as well. It's not so much that east-1 necessarily fails more than the others, it's that it has so many AZs and so many workloads that people notice it more.

__turbobrew__ 1 days ago [-]
> But there are services that are located here and only here and only here

Why is that? You would think the company ending events like IAM going poof due to it being dependent on us-east-1 would be top priority to fix?

everfrustrated 1 days ago [-]
The underlying reason is more that by being in us east coast you have about equal latency for customers in us west coast and Europe. That's a very large population covered from a single site.

If you're building a single datacenter site this is where you start building first.

aeyes 1 days ago [-]
LATAM as well, all major submarine cables land on the east coast. Surprisingly even from Mexico the latency is often better to US East.
arjie 2 days ago [-]
Amusingly I've been part of two critical downtime heating incidents at two different datacenters: one was when Hosting.com's SOMA datacenter got so hot that they were using hoses on the roof to cool it down; and the second one was when Alibaba's Chai Wan datacenter got so hot everything running there went down, including the control plane. So I imagine the proximity to the ocean does not yield any additional advantage in terms of emergency heat sinking. You have x capacity to pump heat out and it doesn't matter if you're next to the sea or in the middle of Nebraska because your entire system needs to be built to be rated for some performance.
PunchyHamster 2 days ago [-]
yeah but capacity is easier/cheaper to build/overbuild if you can access cold-ish water at all times
dylan604 1 days ago [-]
Didn't really help Fukushima though. In fact, the ocean came to it. They didn't have to go get it.
kinow 2 days ago [-]
I had a class in my masters about data centers (HPC Infrastructures). The professor was using some data centers somewhere in the middle of USA, in an area with hot weather as example. He compared that with ideal scenario (weather, power source, etc.).

In one of the slides, there were factors that influence the decision of where to build a data center, and several of the items involved finding a place with enough space and skilled people to work at this data center. He also commented sometimes there is politics involved on choosing the place for a next data center.

ikr678 2 days ago [-]
Off the top of my head: Ocean levels of salt in a water system are much more expensive to maintain (even the secondary loop).

Coastal land much more expensive. If you go to a remote coastal site, you probably won't have as good access to power.

Coastal sites usually exposed to more severe weather events.

Other fun unpredicatble things eg-Diablo Canyon nuclear facility has had issues with debris and jellyfish migration blocking their saltwater cooling intake.

https://www.nbcnews.com/news/world/diablo-canyon-nuclear-pla...

idiotsecant 1 days ago [-]
And oysters / mussels / clams / every other creature that starts small and turns calcium into brick finds your cooling system to be a delightful place to raise a family, especially in delicate heat exchangers with small easily blockable passages.
jjmarr 2 days ago [-]
Oceans have salt. Saltwater is bad for electronics beyond normal water. You also need a sufficient level of water depth otherwise it'll warm to surface temperature. It also needs to be price-competitive with traditional evaporative cooling.

Toronto is the textbook example of this working. It's on a freshwater lake that is deep relatively close to the shore, and the downtown has expensive real estate blocking traditional methods.

https://en.wikipedia.org/wiki/Deep_Lake_Water_Cooling_System

dpe82 2 days ago [-]
In a proper 2-loop cooling system, the primary loop (with direct electronics contact) and secondary loop (with seawater/external cooling source) are hydraulically isolated by a heat exchanger. The salt water or whatever never gets anywhere near the electronics.
millipede 2 days ago [-]
Saltwater comes in the air. Just being near it corrodes everything. Both stainless steel and bronze are very expensive. Even if things were made of corrosion proof materials, not everything can be, for strength reasons.
mschuster91 2 days ago [-]
The problem is, it's still in contact with something, even if it's just the secondary loop. Saltwater is not just incredibly aggressive against metal, the major problem with using it for cooling is fouling. Fish, mussels, algae, debris, there are a lot of things that can clog up your entire setup.
2 days ago [-]
tempaccount5050 2 days ago [-]
Lots of proposals to build them near Lake Michigan recently but the residents of Wisconsin only want auto parts stores and paper mills. They've been completely demonized. Cities and counties are passing no data center laws even though it's the perfect place for it.
Leonard_of_Q 2 days ago [-]
Paper mills need a lot of heat energy to run the processes. Data centres produce a lot of heat. Sounds like a good combination?

Cold water -> data centre cooling loop - > warm water -> paper mill with heat pumps to transform low-grade heat into the required temperatures -> profit

bink 17 hours ago [-]
I can't believe any town would vote for a paper mill. It smells like a paper mill.
Leonard_of_Q 13 hours ago [-]
Well, they do provide jobs and tend to stay around for a while given the large investments needed to establish one. There is a big one about 15 km from where I live, it smells about the same as a wastewater treatment plant. There was one close to where I lived while in university as well (another era, another country) which mostly smelled of warm paper, no bad smells. All in all there are worse industries to have around.
idiotsecant 1 days ago [-]
Data center cooling system output has miserably low extractable energy.
quantumet 2 days ago [-]
They are, sometimes. Google built this one in Finland in 2011 at the site of an old paper mill, which was already set up to draw water from the Baltic Sea (which isn't as salty as the Atlantic is, but still not fresh water):

https://datacenters.google/locations/hamina-finland/

> Using a cooling system with seawater from the Bay of Finland and a new offsite heat recovery facility, our Hamina data centre is at the forefront of progressing our sustainability and energy-efficiency efforts.

sheept 2 days ago [-]
This is just a guess, but land near oceans is more expensive/populated, and water is comparatively cheap
2 days ago [-]
dopa42365 1 days ago [-]
Humidity and corrosion, it's a trade-off (pick your poison).
whatever1 1 days ago [-]
2/last 365 days down. My Ubuntu nas is 0/last 365days down.

Come and give me your cash if you want resilience.

justincormack 1 days ago [-]
But you couldnt apply security updates because Ubuntu was doen
2 days ago [-]
grendelt 1 days ago [-]
There sure are a lot of eggs in that East basket.
sitzkrieg 1 days ago [-]
using aws since s3 came out and i’ve yet to see any major company do multi az failover in any capacity whatsoever. default region ftw
jedberg 1 days ago [-]
We were doing multi-AZ and multi-region failover at Netflix all the way back in 2011:

https://netflixtechblog.com/the-netflix-simian-army-16e57fba...

1 days ago [-]
yomismoaqui 1 days ago [-]
How many nines of are we at this year?
toast0 1 days ago [-]
Eight eights!
jasonlotito 22 hours ago [-]
In this thread: People confusing AZs, Regions, and why people would want to use single AZs.
matt3210 1 days ago [-]
Right, cooling.
nikcub 2 days ago [-]
both realtime markets where multi-AZ is hard?
kikimora 1 days ago [-]
Order books had to run on a single server for performance reasons. Similarly a realtime multiplayer game.
jeffbee 1 days ago [-]
I don't see anything on downdetector suggesting this was particularly disruptive.
aussieguy1234 2 days ago [-]
Once known for having super reliable services, I've heard this company is scrambling to re hire some of the engineers they overconfidently "replaced" with AI.

When customers pay for cloud services, they expect them to be maintained by competent engineers.

edit: Not sure why the downvotes. If you fire the engineers that have been keeping your systems running reliably for years, what do you expect to happen?

wmf 1 days ago [-]
It's a cooling equipment failure. Equipment is going to fail.
ElenaDaibunny 1 days ago [-]
[flagged]
tcp_handshaker 2 days ago [-]
I bet post-mortem will say vibe coding confused fahrenheit and celsius, we run too hot...
geodel 1 days ago [-]
Now I totally understand the issue. I will set temperature 70K. Using SI unit of temperature is the best practice.
fukinstupid 1 days ago [-]
[flagged]
OhMeadhbh 2 days ago [-]
[flagged]
tomhow 2 days ago [-]
Please don't do this here.
BugsJustFindMe 2 days ago [-]
[flagged]
tailscaler2026 2 days ago [-]
us-east-1 is down? shocking! stop putting SPOF services there. this location has had frequent issues for the past 15 years.
unethical_ban 1 days ago [-]
This is correct... unless there is a specific requirement to be in that location for some kind of IXP or ultra low latency, I can't imagine putting mission-critical things in only that region.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 11:45:51 GMT+0000 (Coordinated Universal Time) with Vercel.