A single DNS race condition brought AWS to its knees
8 months ago by Matt The Horwood to c/selfhosted
Oh man. One of my old companies, the Devs would always blame the network. Even after we spent a year upgrading and removing all SPOFs. They’d blame the network…..
“Your application is somehow producing 2 billion packets per second and your SQL queries are returning 5GB of data”…. “See! The network is too slow and it has problems”
Dev: My app's getting a 400 hitting the server. Your firewall changes broke it.
Me: You're getting to the server, it's giving you back a malformed request error. Most likely it's a problem in your client.
Dev: it worked fine until you made that change in QA.
Me: Your server is in production.
After that, I just get too busy to look at it for a while.... They figure it out eventually.
I always view the source of websites like this and this is one of the worst I've seen. 217 lines of code (including inline Javascript?!) and a Google tag for some reason, all to put the word YES in green on black.
this made me mad so i made a single, ultra minimal html page in 5 minutes that you can just paste in your url box
data:text/html;base64,PCFkb2N0eXBlaHRtbD48Ym9keSBzdHlsZT10ZXh0LWFsaWduOmNlbnRlcjtmb250LWZhbWlseTpzYW5zLXNlcmlmO2JhY2tncm91bmQ6IzAwMDtjb2xvcjojMmYyPjxoMT5JcyBpdCBETlM/PC9oMT48cCBzdHlsZT1mb250LXNpemU6MTJyZW0+WWVz
source code:
<!doctypehtml><body style=text-align:center;font-family:sans-serif;background:#000;color:#2f2><h1>Is it DNS?</h1><p style=font-size:12rem>Yes
Your website no longer uses DNS invalidating its use as a diagnostic tool lmao
lmao, considering some of the meaningless comments there i'm starting to think it's "vibe coded".
There have been 209 versions of that site
it predated AI, but likely seems to have had some AI cleanup.
If it was truly just vibecoded, the comments would usually be on every element.
It’s always DNS
It’s not DNS
There’s no way it’s DNS
It was DNS
That and BGP
If I had a nickel for every time clearing the ARP tables fixed a problem, I'd have a shitload of nickels.
If clearing the ARP tables fixes the issue you have bigger problems
These things happen when a skinflint company contracts out network setup for a decade, gets acquired by another skinflint company who axes the contractors and doesn't hire on-site network personnel, gradually builds out infra on top of the unsupported foundation, and then hires c suite buddies who want to bring in their own people to further muddy the waters.
oh sure, when they fuck up DNS it's a "race condition".
when I fuck up DNS it's a "fireable offense".
It's funny aws report didn't mention 40% sysops were replaced by AI. https://blog.stackademic.com/...
Wasnt that source from a year ago?
You are right, was from july and there was no other confirmed layouts from credible sources since.
I KNEW IT. It feels good to have my suspicions validated like this. The biggest companies are the ones most hyped over useless AI, and it's going to destroy them.
They need to uphold the AI hype, at any cost possible.
Much of this stuff is automatic - I've worked with such contracted services where uptime is guaranteed. The contracts dictate the terms and conditions for refunds, we see them on a monthly basis when uptime is missed and it's not done by a person.
I imagine many companies have already seen refunds for outage time, and Amazon scrambled to stop the automation around this.
They'll have little to stand on in court for something this visible and extensive, and could easily lose their shirt with fines and penalties when a big company sues over breech when they choose to not renew.
Just cause they're big doesn't mean all their clients are small or don't have legal teams of their own.
These contracts do not stipulate reimbursement for lost revenue. The “uptime guarantee” just gets you a partial discount or service refund for the impacted services.
It is on the customer to architect their environment for high availability (use multiple regions or even multiple hyperscalers, depending on the uptime need).
Source: I work at an enterprise that is bound by one of these agreements (although not with AWS).
SLA contracts can have a plethora of stipulations, including fines and damages for missing SLO. It really depends on how big and important the customer is. For example, you can imagine government contracts probably include hefty fines for causing downtime or data loss, although I am not involved with or familiar with public sector/ government contracts or their terms.
You can imagine that a customer that is big enough to contract a cloud provider to build new locations and install a bunch of new hardware just for them, would also be big enough to leverage contract terms that include fines and compensation for extended downtime or missing SLO.
I work at a data center for a major cloud provider, also not AWS
It's not at all uncommon for fines to be built into an SLA
Amazon has more money than most countries. They can outlast any company in court, or just ban you from their services in the future.
Depends on who we're talking about. Companies like finance orgs are all about legal contracts and would be able to hold their feet to the fire.
You don't want to go to court against a finance company or any very large org where contract law is their bread and butter (basically any large/multinational corp).
Amazon's not hosting just small operations.
Good luck arguing that a missed config counts as an 'unforeseen issue'. If they go that route, people will be all over them for not being SOC compliant wrt change control.
99% uptime in a year gives you 3.65 days of downtime, which I think would still be within SLA (assuming nothing else happened this year). Though, once you get to 1 9 reliability (99.9%), you've got a shift and change you can be down before you breach SLA.
If their reliability metrics are monthly, 99% gets you less than a shift of down time, so they'd be out of SLA and could probably yell to get money back.
They have ORANGE ass makeup on their lips. How did THAT get there???
Oops! All slop!
Mistakes happen with or without AI
The problem is that the current internet is structured in a way that creates high risk systems that can cause a massive outage. We went from having thousands of independent companies to a handful of massive ones. A mistake by a single company shouldn't be able to black out half the internet.
There was never any evidence to even suggest that AI was the cause, but as you're on lemmy I'm sure you know that AI is currently blamed for pretty much everything.
Saying it was caused by AI despite zero evidence of AI causing it is dumb. It wasn’t AI, it was a DNS change made by a person.
The whole thing has nothing to do with AI, other than people who hate AI trying to make it about AI.
That's what you get when you let go hundreds of employees
OK but then... what happens when their boss jerk fires hundreds of thousands?
They rely on AWS due to favourable contract in hosting it, and also proving the proof of concept that they can be hosted securely on a hostile provider, without the provider having any clues at all in what data is being sent between the parties.
sure, proving to the audience that you can kick yourself in the nuts over and over while maintaining the privacy of your testicle's innards is impressive from a biological standpoint but it still looks stupid to a normal person. I don't hate signal, I will continue using it but this and their crypto scam makes me doubt some of their choices and how they'll operate in the future
Beat me to it!
It was the best race anyone has ever seen 🫲🍊🫱
Let's be honest, not all races are equal
🫲🍊🫱
It's funny aws report didn't mention 40% of aws sysops people were replaced by AI right prior https://blog.stackademic.com/...
this is unconfirmed and unlikely
Just one more layer bro, just one more automated planning system bro and this time it will be entirely faultless please bro one more layer
I know a dude that talks like this... Like I hear his voice when I read this.
Ironically, my pihole is blocking that link. So here’s a clean one: https://www.theregister.com/...
This is purely anecdotal, but I have been running into a lot of DNS issues over the past couple months where I work. 3 of the computers and even one of the laptops for remote work were having DNS issues that needed to be fixed. One even needed Windows reinstalled after fixing the DNS issue (Which was probably unrelated, but worth mentioning)
I'm honestly starting to think that the internet in general might be imploding. Not sure why, but replacing so many developers and programmers with AI might be responsible. Who knows, but it's definitely very strange.
The biggest issue is how centralized the internet has become. It went from a bunch of local servers to a handful of cloud providers.
We need to spread things out again
That's not how capitalism works though
But but Bezos has to pay for another rocket and yacht and he just got married!!!! Think about his quarterly statement! My god are you heartless!!!!!!!!
/s
(just in case it's not obvious)
A huge problem are developers who lack a fundamental understanding of how the internet even works. I've had to explain how short, unqualified names resolve vs how fqdns resolve. Or why even you may not be able to reach another node in your proverbial cluster, because they are on different subnets. Or, why using GUIDs as hostnames is a generally bad idea, and will cause things to fail in unpredictable ways, especially with deeply nested subdomains.
I have worked with too many devs that didn't even know what the 7 layers/OSI are or why they exist.
they didn't know what a network port was used for and why it's important to not expose 3306 to the internet.
they couldn't understand that fragmentation of a message bus occurs when you don't dedupe the contents.
you know, morons.
Ah, the common clay of the new Web
guids like these: https://guidgenerator.com/
Why the fuck would anyone use a guid as a hostname?
My favorite I've seen in the category was when they had hostnames that were basically the IP address decorated with some bullshit. Like yeeeeeeeeah, that totally makes fucking sense. 😆
Racist DNS!
I'm glad these things happen... it keeps everyone aware that cloud is fragile and Plan B should be considered for mission critical tasks.
I'm also hoping that it will improve cloud resiliency because a complete / partial restart of cloud systems needs a whole different approach than maintaining a running system.
Many different companies abruptly realized they need a DR plan for cloud outages
It is designed to not to be. The RFC literally warns against single points of failure
Its true.
It comes up at work, it comes up in discussions on Linux podcasts I listen to, it comes up here...
We have a big, dangerous impending problem in DNS.
The issue here isn't DNS. The issue here is a large portion of the internet relying on a single data centre on the US East coast. Ideally, a lot of competing hosting companies would exist so if one goes down, it's just one service and very few people notice.
So much this.
Why is Signal hosted in one location on AWS, for example? That's the sort of thing that should be in multiple places around the world with automatic fail over.
Yes, that's true, I guess it's a separate issue. But the way DNS currently runs is a problem waiting to happen.
Yeah, I don't get why they don't just put a RasPi in some corner, put PiHole on it and call it a day.
Geez, I mean, they could even charge extra for it, as they now block ads for their customers as well.
Like, imma gonna sell my advice to Amazon now, so they can clean up their act.
They got off sync.
@lemmy.world
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam.
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
Submission headline should match the article title.
No trolling.
Resources:
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
go to feed...
@lemmy.world
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam.
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
Submission headline should match the article title.
No trolling.
Resources:
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
go to feed...
So it is always DNS
save