CrowdStrike Isn't the Real Problem

John Richard@lemmy.world · 1 year ago

CrowdStrike Isn't the Real Problem

breakingcups@lemmy.world · 1 year ago

Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

Not saying the rest of your post is wrong, but this stood out as easily glossed over.

ramble81@lemm.ee · 1 year ago

You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.

Leeks@lemmy.world · 1 year ago

Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s

A_A@lemmy.world · 1 year ago

Lol 😋 ! also i need a “Out-of-Band, Keyboard, Video, and Mouse” to your “OOB, KVM” so to ~~steal the bank~~ improve security.

Leeks@lemmy.world · 1 year ago

“It’s turtles all the way down”.

Brkdncr@lemmy.world · 1 year ago

Vpro is usually $20 per machine and offers oob kvm.

John Richard@lemmy.world · 1 year ago

UEFI isn’t going away. Sorry to break the news to you.

ramble81@lemm.ee · 1 year ago

I didn’t say it was, nor did I say UEFI was the problem. My point was additional applications or extensions at the UEFI layer increase the attack footprint of a system. Just like vPro, you’re giving hackers a method that can compromise a system below the OS. And add that in to laptops and computers that get plugged in random places before VPNs and other security software is loaded and you have a nice recipe for hidden spyware and such.

LrdThndr@lemmy.world · edit-2 1 year ago

A decade ago I worked for a regional chain of gyms with locations in 4 states.

I was in TN. When a system would go down in SC or NC, we originally had three options:

(The most common) have them put it in a box and ship it to me.
I go there and fix it (rare)
I walk them through fixing it over the phone (fuck my life)

I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each machine at each club was configured to PXE boot from the fog client.

The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with the local FOG client via PXE and get a complete reimage from premade images on the fog server.

The corporate office was physically connected to one of the clubs, so I trialed the software at our adjacent club, and when it worked great, I rolled it out company wide. It was a massive success.

So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage. It would even join the computer to the domain and set up that locations printers and everything. All I had to tell the low-tech gymbro sales guy on the phone to do was reboot it.

This was free software. It saved us thousands in shipping fees alone. And brought our time to fix down from days to minutes.

There ARE options out there.

magikmw@lemm.ee · edit-2 1 year ago

This works great for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.

The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.

The only winning move here was to not use crowdstrike.

LrdThndr@lemmy.world · 1 year ago

Absolutely. 100%

But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.

magikmw@lemm.ee · 1 year ago

Sure. At the same time one needs to manage resources.

I was all in on laptop deployment automation. It cut down on a lot of human error issues and having inconsistent configuration popping up all the time.

But it needs constant supervision, even if not constant updates. More systems and solutions lead to neglect if not supplied well. So some “would be good to have” systems just never make the cut, because as overachieving I am, I’m also don’t want to think everything is taken care of when it clearly isn’t.

catloaf@lemm.ee · 1 year ago

Yeah. I find a base image and post-install config with group policy or Ansible to be far more reliable.

magikmw@lemm.ee · 1 year ago

Yea we’re doing something similiar. Only update base images for bigger OS updates or if something breaks or can break.

The general idea is to have config that works for both new PCs and the ones that are already in use. Saves on maintaining two configuration methods.

John Richard@lemmy.world · 1 year ago

You were all in, but was the company all in? How many employees? It sounds like you innovated. Let’s say that the company you worked for was spending millions on vendors that promised solutions but rarely delivered. If instead they gave you $400k a year, a $1 million/year budget & 10 employees… I’m guessing you could have managed the laptop deployment automation, along with some other significant projects as well.

Instead though, people with good ideas, even loyal to the company, are competing against sales and marketing reps from billion dollar companies, and upper management are easily swooned.

magikmw@lemm.ee · 1 year ago

I’m the only one to swoon here, and I’m as sceptical as one can be.

I’m also a cost and my budget is on paper only. Non-IT management is complicit in crappy IT.

LrdThndr@lemmy.world · 1 year ago

Completely fair, man.

wizardbeard@lemmy.dbzer0.com · 1 year ago

It also assumes that reimaging is always an option.

Yes, every company should have networked storage enforced specifically for issues like this, so no user data would be lost, but there’s often a gap between should and “has been able to find the time and get the required business side buy in to make it happen”.

Also, users constantly find new ways to do non-standard, non-supported things with business critical data.

Bluetreefrog@lemmy.world · 1 year ago

Isn’t this just more of what caused the problem in the first place? Namely, centralisation. If you store data locally and you lose a machine, that’s bad but not the end of the world. If you store it centrally and you lose the data, that’s catastrophic. Nassim Taleb nailed this stuff. Keep the downside limited, and the upside unlimited or as he says, “Don’t pick up pennies in front of a steamroller.”

John Richard@lemmy.world · 1 year ago

Almost all computers can be set to PXE boot, but work laptops usually even have more advanced remote management capabilities. You ask the employee to reboot the laptop and presto!

magikmw@lemm.ee · 1 year ago

I wonder how you’re supposed to get PXE boot to work securely over the internet. And how that helps when affected disk is still encrypted and needs unusual intervention to fix, including admin access to system files.

I’ve been doing this for a while, and I like creative solutions, so I wonder about those issues a lot. Not much comes to my mind besides let’s recall all the laptops and do it one by one.

wizardbeard@lemmy.dbzer0.com · edit-2 1 year ago

Hypothectically you could ship a company provided router to handle the vpn connection to your remote users, so you aren’t relying on the OS to be able to boot up to get connected to the vpn for the company network and PXE environment. Lots of extra cost and mess though.

LrdThndr@lemmy.world · 1 year ago

From a home user? Probably ain’t shit-all you can do with PXE booting. But if you have a field office or somewhere a user can go with a hardware vpn appliance? Well now you’re in business.

John Richard@lemmy.world · 1 year ago

I wonder how you’re supposed to get PXE boot to work securely over the internet.

PXE boot is more of last resort IMO, but can be uses as a chainloader to a more secure option. The biggest challenge I could see security-wise is having PXE boot being ran on unsecured networks. Even then though, normally a computer will have been provisioned on a secure network and will have encryption and secure boot-based encryption, and some additional signature-based image verification.

Evotech@lemmy.world · edit-2 1 year ago

Now your fog servers are dead. What now

Brkdncr@lemmy.world · 1 year ago

How removed from IT are that you think fog would have helped here?

LrdThndr@lemmy.world · edit-2 1 year ago

How would it not have? You got an office or field offices?

“Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

The damn thing’s even boot looping, so you don’t even have to reboot it.

I’m sure the user saved all their data in one drive like they were supposed to, right?

I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

Brkdncr@lemmy.world · 1 year ago

Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.

Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.

I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.

LrdThndr@lemmy.world · 1 year ago

FOG ran on Linux. It wouldn’t have been down. But that’s beside the point.

I never said it was a good answer to CrowdStrike. It was just a story about how I did things 10 years ago, and an option for remotely fixing nonbooting machines. That’s it.

I get you’ve been overworked and stressed as fuck this last few days. I’ve been out of corporate IT for 10 years and I do not envy the shit you guys are going through right now. I wish I could buy you a cup of coffee or a beer or something.

Brkdncr@lemmy.world · 1 year ago

Last time I used fog it was only doing static image deployment which has been out of style for a while. I don’t know if there are any serious deployment products for windows enterprise that don’t run on windows.

I’m personally not dealing with this because I didn’t like how Crowdstrike had answered a number of questions in their sales call.

Avoiding telling me their vuln scan doesn’t prob be all hosts after claiming it could replace a real vuln scanner, claiming they are somehow better than others at malware detection without bringing up 3rd party tests, claiming how their product was novel when others have been doing the same for 7+ years.

My fave was them telling me how much easier it is to manage but no one on the call had ever worked as a sysadmin or even seen how their competition works.

Shitshow. I’m so glad this happened so I can block their sales team.

John Richard@lemmy.world · 1 year ago

Imaging environment down? If a sysadmin can’t figure out how to boot a machine into recovery to remove the bad update file then they have bigger problems. The fix in this instance wasn’t even re-imaging machines. It was merely removing a file. Ideal DR scenario would have a recovery image already on the system that can be booted into remotely, so there is minimal strain on the network. Furthermore, we don’t live in dial-up age anymore.

Brkdncr@lemmy.world · 1 year ago

Imaging environment would be bitlocker’d with its key stuck in AD which is also bitlocker’d.

catloaf@lemm.ee · 1 year ago

Only if you’re not practicing 3-2-1 with your backups.

Brkdncr@lemmy.world · 1 year ago

Backup environment is also bitlocker’d.

John Richard@lemmy.world · 1 year ago

Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.

Dran@lemmy.world · edit-2 1 year ago

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

Bitlocker keys for the OS partition are irrelevant because nothing of value is stored on the OS partition, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn’t ours, it won’t have keys to the data partition because it won’t have a trust relationship with AD.

(This is actually what I do at work)

I_Miss_Daniel@lemmy.world · 1 year ago

Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?

Dran@lemmy.world · 1 year ago

With enough _autism in your overlay configs, sure, but in my environment tat leakage is still encrypted. It’s far simpler to just accept leakage and encrypt the OS partition with a key that’s never stored anywhere. If it gets lost, you rebuild the system from pxe. (Which is fine, because it only takes about 20 minutes and no data we care about exists there) If it’s working correctly, the OS partition is still encrypted and protects any inadvertent data leakage from offline attacks.

pHr34kY@lemmy.world · 1 year ago

I’ve been separating OS and data partitions since I was a kid running Windows 95. It’s horrifying that people don’t expect and prepare for machines to become unbootable on a regular basis.

Hell, I bricked my work PC twice this year just by using the Windows cleanup tool - on Windows 11. The antivirus went nuclear, as antivirus products do.

Brkdncr@lemmy.world · 1 year ago

But your pxe boot server is down, your radius server providing vpn auth is down, your bitlocker keys are in AD which is down because all your domain controllers are down.

Dran@lemmy.world · 1 year ago

Yes and no. In the best case, endpoints have enough cached data to get us through that process. In the worst case, that’s still a considerably smaller footprint to fix by hand before the rest of the infrastructure can fix itself.

felbane@lemmy.world · 1 year ago

Rollout policies are the answer, and CrowdStrike should be made an example of if they were truly overriding policies set by the customer.

It seems more likely to me that nobody was expecting “fingerprint update” to have the potential to completely brick a device, and so none of the affected IT departments were setting staged rollout policies in the first place. Or if they were, they weren’t adequately testing.

Then - after the fact - it’s easy to claim that rollout policies were ignored when there’s no way to prove it.

If there’s some evidence that CS was indeed bypassing policies to force their updates I’ll eat the egg on my face.

DesertCreosote@lemm.ee · 1 year ago

I’m one of the admins who manage CrowdStrike at my company.

We have all automatic updates disabled, because when they were enabled (according to the CrowdStrike best practices guide they gave us), they pushed out a version with a bug that overwhelmed our domain servers. Now we test everything through multiple environments before things make it to production, with at least two weeks of testing before we move a version to the next environment.

This was a channel file update, and per our TAM and account managers in our meeting after this happened, there’s no way to stop that file from being pushed, or to delay it. Supposedly they’ll be adding that functionality in now.

John Richard@lemmy.world · 1 year ago

I’d issue IPMI or remote management commands to reboot the machines. Then I’d boot into either a Linux recovery environment (yes, Linux can unlock BitLocker-encrypted drives) or a WinPE (or Windows RE) and unlock the drives, preferably already loaded on the drives, but could have them PXE boot - just giving ideas here, but ideal DR scenario would have an environment ready to load & PXE would cause delays.

I’d either push a command or script that would then remove the update file that caused the issue & then reboots. Having planned for a scenario like this already, total time to fix would be less than 2 hours.