What's the worst way you ever broke production?

RacerX@lemm.ee · 2 years ago

What's the worst way you ever broke production?

tquid@sh.itjust.works · edit-2 2 years ago

One time I was deleting a user from our MySQL-backed RADIUS database.

DELETE * FROM PASSWORDS;

And yeah, if you don’t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.

That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.

Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.

RacerX@lemm.ee · 2 years ago

Always skeptical of people that don’t own up to mistakes. Would much rather they own it and speak to what they learned.

tquid@sh.itjust.works · 2 years ago

Exactly!

cobysev@lemmy.world · 2 years ago

I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn’t figure out.

Normally, we got our issues in tickets forwarded to us from the individual base’s Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base’s Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!

The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?

We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base’s tree, not an individual office or squadron.

As his rank implies, he’s supposed to be the technical expert in his field. But this guy was an idiot who shouldn’t have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they’re learning how to be sysadmins. And he couldn’t even do that.

It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · edit-2 2 years ago

Accidentally deleted an entire column in a police department’s evidence database early in my career 😬

Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets. Spent two days rebuilding that.

aksdb@lemmy.world · 2 years ago

And if you couldn’t reconstruct, you still had backups, right? … right?!

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · 2 years ago

Oh sweet summer child

FartsWithAnAccent@lemmy.world · 2 years ago

What the fuck is a “backups”?

z00s@lemmy.world · 2 years ago

He’s the guy that sits next to fuckups

SuperDuper@lemmy.world · 2 years ago

deleted an entire column in a police department’s evidence database

Based and ACAB-pilled

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · edit-2 2 years ago

deleted by creator

Quazatron@lemmy.world · 2 years ago

Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

Billegh@lemmy.world · 2 years ago

It doesn’t help that the webui used to hide stop. I think it still does.

sexual_tomato@lemmy.dbzer0.com · 2 years ago

I didn’t call out a specific dimension on a machined part; instead I left it to the machinist to understand and figure out what needed to be done without explicitly making it clear.

That part was a 2 ton forging with two layers of explosion-bonded cladding on one side. The machinist faced all the way through a cladding layer before realizing something was off.

The replacement had a 6 month lead time.

Buglefingers@lemmy.world · 2 years ago

That’s hilarious, actually pretty recently I “caused” a line stop because a marker feature (for visuals at assembly, so pretty meaningless dimension overall) was very much over dimensioned (we talking depth, rad, width, location from step) and to top it off instead of a spot drill just doing a .01 plunge they interpolated it! (Why I have zero clue). So it was leaving dwell marks for at least the past 10 months and because it was over dimensioned it all of them had to be put on hold because DOD demands perfection (aircraft engine parts)

treechicken@lemmy.world · edit-2 2 years ago

I once “biased for action” and removed some “unused” NS records to “fix” a flakey DNS resolution issue without telling anyone on a Friday afternoon before going out to dinner with family.

Turns out my fix did not work and those DNS records were actually important. Checked on the website halfway into the meal and freaked the fuck out once I realized the site went from resolving 90% of the time to not resolving at all. The worst part was when I finally got the guts to report I messed up on the group channel, DNS was somehow still resolving for both our internal monitoring and for everyone else who tried manually. My issue got shoo-shoo’d away, and I was left there not even sure of what to do next.

I spent the rest of my time on my phone, refreshing the website and resolving domain names in an online Dig tool over and over again, anxiety growing, knowing I couldn’t do anything to fix my “fix” while I was outside.

Once I came home I ended up reversing everything I did which seemed to bring it back to the original flakey state. Learned the value of SOPs and taking things slow after that (and also to not screw with DNS).

If this story has a happy ending, it’s that we did eventually fix the flakey DNS issue later, going through a more rigorous review this time. On the other hand, how and why I, a junior at the time, became the de facto owner of an entire product’s DNS infra remains a big mystery to me.

Burninator05@lemmy.world · 2 years ago

Hopefully you learned a rule I try to live by despite not listing it: “no significant changes on Friday, no changes at all on Friday afternoon”.

shyguyblue@lemmy.world · 2 years ago

Updated WordPress…

Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.

Fuck WordPress for static sites…

necrobius@lemm.ee · 2 years ago

Create a database,
Have organisation manually populated it with lots of records using a web app,
accidentally delete database.

All in between the backup window.

slazer2au@lemmy.world · 2 years ago

I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line

sloppy_diffuser@sh.itjust.works · 2 years ago

That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.

Burninator05@lemmy.world · 2 years ago

I spent over 20 years in the military in IT. I took took down the network at every base I was ever at each time finding a new way to do it. Sometimes, but rarely, intentionally.

mojofrododojo@lemmy.world · 2 years ago

took out a node center by applying the patches gd recommended… took an entire weekend to restore all the shots and my ass got fed 3/4ths into the woodchipper before it came out that the vendor was at fault for this debacle.

kindernacht@lemmy.world · 2 years ago

My first time shutting down a factory at the end of second shift for the weekend. I shut down the compressors first, and that hard stopped a bunch of other equipment that relied on the air pressure. Lessons learned. I spent another hour restarting then properly shutting down everything. Never did that again.

hperrin@lemmy.world · 2 years ago

I fixed a bug and gave everyone administrator access once. I didn’t know that bug was… in use (is that the right way to put it?) by the authentication library. So every successful login request, instead of being returned the user who just logged in, was returned the first user in the DB, “admin”.

Had to take down prod for that one. In my four years there, that was the only time we ever took down prod without an announcement.

pastermil@sh.itjust.works · 2 years ago

I acidentally destroyed the production system completely thru improper partition resize. We got the database snapshot, but it’s in that server as well. After scrambling around for half a day, I managed to recover some of the older data dumps.

So I spun up the new server from scratch, restored the database with some slightly outdated dump, installed the code (which was thankfully managed thru git), and configured everything to run all in an hour or two.

The best part: everybody else knows this as some trivial misconfiguration. This happened in 2021.

surewhynotlem@lemmy.world · 2 years ago

I removed the proxy settings from every user in the company. Over 80k people without Internet for the day.

EmasXP@lemmy.world · 2 years ago

Two things pop up

I once left an alert() asking “what the fuck?”. That was mostly laughed upon, so no worry.
I accidentally dropped the production database and replaced it by the staging one. That was not laughed upon.

TeenieBopper@lemmy.world · 2 years ago

I once dropped a table in the production database. I did not replace it with the same table from staging.

On the bright side, we discovered our vendor wasn’t doing daily backups.

Churbleyimyam@lemm.ee · 2 years ago

It wasn’t me personally but I was working as a temp at one of the world’s biggest shoe distribution centers when a guy accidentally made all of the size 10 shoes start coming out onto the conveyor belts. Apparently it wasn’t a simple thing to stop it and for three days we basically just stood around while engineers were flown in from China and the Netherlands to try and sort it out. The guy who made the fuckup happen looked totally destroyed. On the last day I remember a group of guys in suits coming down and walking over to him in the warehouse and then he didn’t work there any more. It must have cost them an absolute fortune.