This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.

  • breakingcups@lemmy.world
    link
    fedilink
    English
    arrow-up
    128
    arrow-down
    5
    ·
    6 months ago

    Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

    Not saying the rest of your post is wrong, but this stood out as easily glossed over.

    • ramble81@lemm.ee
      link
      fedilink
      English
      arrow-up
      26
      arrow-down
      2
      ·
      6 months ago

      You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.

      • Leeks@lemmy.world
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        3
        ·
        6 months ago

        Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s

        • A_A@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          2
          ·
          6 months ago

          Lol 😋 ! also i need a “Out-of-Band, Keyboard, Video, and Mouse” to your “OOB, KVM” so to steal the bank improve security.

        • ramble81@lemm.ee
          link
          fedilink
          English
          arrow-up
          7
          ·
          6 months ago

          I didn’t say it was, nor did I say UEFI was the problem. My point was additional applications or extensions at the UEFI layer increase the attack footprint of a system. Just like vPro, you’re giving hackers a method that can compromise a system below the OS. And add that in to laptops and computers that get plugged in random places before VPNs and other security software is loaded and you have a nice recipe for hidden spyware and such.

    • LrdThndr@lemmy.world
      link
      fedilink
      English
      arrow-up
      24
      arrow-down
      8
      ·
      edit-2
      6 months ago

      A decade ago I worked for a regional chain of gyms with locations in 4 states.

      I was in TN. When a system would go down in SC or NC, we originally had three options:

      1. (The most common) have them put it in a box and ship it to me.
      2. I go there and fix it (rare)
      3. I walk them through fixing it over the phone (fuck my life)

      I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each machine at each club was configured to PXE boot from the fog client.

      The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

      If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with the local FOG client via PXE and get a complete reimage from premade images on the fog server.

      The corporate office was physically connected to one of the clubs, so I trialed the software at our adjacent club, and when it worked great, I rolled it out company wide. It was a massive success.

      So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage. It would even join the computer to the domain and set up that locations printers and everything. All I had to tell the low-tech gymbro sales guy on the phone to do was reboot it.

      This was free software. It saved us thousands in shipping fees alone. And brought our time to fix down from days to minutes.

      There ARE options out there.

      • magikmw@lemm.ee
        link
        fedilink
        English
        arrow-up
        22
        ·
        edit-2
        6 months ago

        This works great for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.

        The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.

        The only winning move here was to not use crowdstrike.

        • LrdThndr@lemmy.world
          link
          fedilink
          English
          arrow-up
          8
          ·
          6 months ago

          Absolutely. 100%

          But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.

          • magikmw@lemm.ee
            link
            fedilink
            English
            arrow-up
            5
            ·
            6 months ago

            Sure. At the same time one needs to manage resources.

            I was all in on laptop deployment automation. It cut down on a lot of human error issues and having inconsistent configuration popping up all the time.

            But it needs constant supervision, even if not constant updates. More systems and solutions lead to neglect if not supplied well. So some “would be good to have” systems just never make the cut, because as overachieving I am, I’m also don’t want to think everything is taken care of when it clearly isn’t.

            • catloaf@lemm.ee
              link
              fedilink
              English
              arrow-up
              2
              ·
              6 months ago

              Yeah. I find a base image and post-install config with group policy or Ansible to be far more reliable.

              • magikmw@lemm.ee
                link
                fedilink
                English
                arrow-up
                1
                ·
                6 months ago

                Yea we’re doing something similiar. Only update base images for bigger OS updates or if something breaks or can break.

                The general idea is to have config that works for both new PCs and the ones that are already in use. Saves on maintaining two configuration methods.

            • John Richard@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              3
              arrow-down
              2
              ·
              6 months ago

              You were all in, but was the company all in? How many employees? It sounds like you innovated. Let’s say that the company you worked for was spending millions on vendors that promised solutions but rarely delivered. If instead they gave you $400k a year, a $1 million/year budget & 10 employees… I’m guessing you could have managed the laptop deployment automation, along with some other significant projects as well.

              Instead though, people with good ideas, even loyal to the company, are competing against sales and marketing reps from billion dollar companies, and upper management are easily swooned.

              • magikmw@lemm.ee
                link
                fedilink
                English
                arrow-up
                3
                ·
                6 months ago

                I’m the only one to swoon here, and I’m as sceptical as one can be.

                I’m also a cost and my budget is on paper only. Non-IT management is complicit in crappy IT.

        • wizardbeard@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          9
          arrow-down
          2
          ·
          6 months ago

          It also assumes that reimaging is always an option.

          Yes, every company should have networked storage enforced specifically for issues like this, so no user data would be lost, but there’s often a gap between should and “has been able to find the time and get the required business side buy in to make it happen”.

          Also, users constantly find new ways to do non-standard, non-supported things with business critical data.

          • Bluetreefrog@lemmy.world
            link
            fedilink
            English
            arrow-up
            5
            ·
            6 months ago

            Isn’t this just more of what caused the problem in the first place? Namely, centralisation. If you store data locally and you lose a machine, that’s bad but not the end of the world. If you store it centrally and you lose the data, that’s catastrophic. Nassim Taleb nailed this stuff. Keep the downside limited, and the upside unlimited or as he says, “Don’t pick up pennies in front of a steamroller.”

        • John Richard@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          4
          ·
          6 months ago

          Almost all computers can be set to PXE boot, but work laptops usually even have more advanced remote management capabilities. You ask the employee to reboot the laptop and presto!

          • magikmw@lemm.ee
            link
            fedilink
            English
            arrow-up
            5
            ·
            6 months ago

            I wonder how you’re supposed to get PXE boot to work securely over the internet. And how that helps when affected disk is still encrypted and needs unusual intervention to fix, including admin access to system files.

            I’ve been doing this for a while, and I like creative solutions, so I wonder about those issues a lot. Not much comes to my mind besides let’s recall all the laptops and do it one by one.

            • wizardbeard@lemmy.dbzer0.com
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              1
              ·
              edit-2
              6 months ago

              Hypothectically you could ship a company provided router to handle the vpn connection to your remote users, so you aren’t relying on the OS to be able to boot up to get connected to the vpn for the company network and PXE environment. Lots of extra cost and mess though.

            • LrdThndr@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              6 months ago

              From a home user? Probably ain’t shit-all you can do with PXE booting. But if you have a field office or somewhere a user can go with a hardware vpn appliance? Well now you’re in business.

            • John Richard@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              2
              ·
              6 months ago

              I wonder how you’re supposed to get PXE boot to work securely over the internet.

              PXE boot is more of last resort IMO, but can be uses as a chainloader to a more secure option. The biggest challenge I could see security-wise is having PXE boot being ran on unsecured networks. Even then though, normally a computer will have been provisioned on a secure network and will have encryption and secure boot-based encryption, and some additional signature-based image verification.

      • John Richard@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        4
        ·
        6 months ago

        Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.

        • LrdThndr@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          edit-2
          6 months ago

          How would it not have? You got an office or field offices?

          “Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

          The damn thing’s even boot looping, so you don’t even have to reboot it.

          I’m sure the user saved all their data in one drive like they were supposed to, right?

          I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

          But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

          But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

          • Brkdncr@lemmy.world
            link
            fedilink
            English
            arrow-up
            5
            arrow-down
            4
            ·
            6 months ago

            Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.

            Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.

            I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.

            • LrdThndr@lemmy.world
              link
              fedilink
              English
              arrow-up
              6
              ·
              6 months ago

              FOG ran on Linux. It wouldn’t have been down. But that’s beside the point.

              I never said it was a good answer to CrowdStrike. It was just a story about how I did things 10 years ago, and an option for remotely fixing nonbooting machines. That’s it.

              I get you’ve been overworked and stressed as fuck this last few days. I’ve been out of corporate IT for 10 years and I do not envy the shit you guys are going through right now. I wish I could buy you a cup of coffee or a beer or something.

              • Brkdncr@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                ·
                6 months ago

                Last time I used fog it was only doing static image deployment which has been out of style for a while. I don’t know if there are any serious deployment products for windows enterprise that don’t run on windows.

                I’m personally not dealing with this because I didn’t like how Crowdstrike had answered a number of questions in their sales call.

                Avoiding telling me their vuln scan doesn’t prob be all hosts after claiming it could replace a real vuln scanner, claiming they are somehow better than others at malware detection without bringing up 3rd party tests, claiming how their product was novel when others have been doing the same for 7+ years.

                My fave was them telling me how much easier it is to manage but no one on the call had ever worked as a sysadmin or even seen how their competition works.

                Shitshow. I’m so glad this happened so I can block their sales team.

            • John Richard@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              2
              ·
              6 months ago

              Imaging environment down? If a sysadmin can’t figure out how to boot a machine into recovery to remove the bad update file then they have bigger problems. The fix in this instance wasn’t even re-imaging machines. It was merely removing a file. Ideal DR scenario would have a recovery image already on the system that can be booted into remotely, so there is minimal strain on the network. Furthermore, we don’t live in dial-up age anymore.

              • Brkdncr@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                6 months ago

                Imaging environment would be bitlocker’d with its key stuck in AD which is also bitlocker’d.

    • Dran@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      2
      ·
      edit-2
      6 months ago

      Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

      Bitlocker keys for the OS partition are irrelevant because nothing of value is stored on the OS partition, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn’t ours, it won’t have keys to the data partition because it won’t have a trust relationship with AD.

      (This is actually what I do at work)

      • I_Miss_Daniel@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        6 months ago

        Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?

        • Dran@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          6 months ago

          With enough autism in your overlay configs, sure, but in my environment tat leakage is still encrypted. It’s far simpler to just accept leakage and encrypt the OS partition with a key that’s never stored anywhere. If it gets lost, you rebuild the system from pxe. (Which is fine, because it only takes about 20 minutes and no data we care about exists there) If it’s working correctly, the OS partition is still encrypted and protects any inadvertent data leakage from offline attacks.

      • pHr34kY@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 months ago

        I’ve been separating OS and data partitions since I was a kid running Windows 95. It’s horrifying that people don’t expect and prepare for machines to become unbootable on a regular basis.

        Hell, I bricked my work PC twice this year just by using the Windows cleanup tool - on Windows 11. The antivirus went nuclear, as antivirus products do.

      • Brkdncr@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        arrow-down
        2
        ·
        6 months ago

        But your pxe boot server is down, your radius server providing vpn auth is down, your bitlocker keys are in AD which is down because all your domain controllers are down.

        • Dran@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          6 months ago

          Yes and no. In the best case, endpoints have enough cached data to get us through that process. In the worst case, that’s still a considerably smaller footprint to fix by hand before the rest of the infrastructure can fix itself.

    • felbane@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      6 months ago

      Rollout policies are the answer, and CrowdStrike should be made an example of if they were truly overriding policies set by the customer.

      It seems more likely to me that nobody was expecting “fingerprint update” to have the potential to completely brick a device, and so none of the affected IT departments were setting staged rollout policies in the first place. Or if they were, they weren’t adequately testing.

      Then - after the fact - it’s easy to claim that rollout policies were ignored when there’s no way to prove it.

      If there’s some evidence that CS was indeed bypassing policies to force their updates I’ll eat the egg on my face.

      • DesertCreosote@lemm.ee
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 months ago

        I’m one of the admins who manage CrowdStrike at my company.

        We have all automatic updates disabled, because when they were enabled (according to the CrowdStrike best practices guide they gave us), they pushed out a version with a bug that overwhelmed our domain servers. Now we test everything through multiple environments before things make it to production, with at least two weeks of testing before we move a version to the next environment.

        This was a channel file update, and per our TAM and account managers in our meeting after this happened, there’s no way to stop that file from being pushed, or to delay it. Supposedly they’ll be adding that functionality in now.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      4
      ·
      6 months ago

      I’d issue IPMI or remote management commands to reboot the machines. Then I’d boot into either a Linux recovery environment (yes, Linux can unlock BitLocker-encrypted drives) or a WinPE (or Windows RE) and unlock the drives, preferably already loaded on the drives, but could have them PXE boot - just giving ideas here, but ideal DR scenario would have an environment ready to load & PXE would cause delays.

      I’d either push a command or script that would then remove the update file that caused the issue & then reboots. Having planned for a scenario like this already, total time to fix would be less than 2 hours.

  • Leeks@lemmy.world
    link
    fedilink
    English
    arrow-up
    44
    arrow-down
    1
    ·
    6 months ago

    bloated IT budgets

    Can you point me to one of these companies?

    In general IT is run as a “cost center” which means they have to scratch and save everywhere they can. Every IT department I have seen is under staffed and spread too thin. Also, since it is viewed as a cost, getting all teams to sit down and make DR plans (since these involve the entire company, not just IT) is near impossible since “we may spend a lot of time and money on a plan we never need”.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      14
      ·
      6 months ago

      With most corporations, especially Fortune 500s… audit their budgets. The problem doesn’t start with IT. but with bad management from top down. This “cost center” you speak of is mostly what I’d expect to hear do-nothing middle-level managers tell their in-house employees when asking for a raise.

      • Leeks@lemmy.world
        link
        fedilink
        English
        arrow-up
        14
        ·
        6 months ago

        It feels like you have an agenda that you are trying to apply to the CrowdStrike event and just so happen to slandering IT as an innocent bystander to the agenda you are putting forward.

        If you had to summarize the goal of your initial post in less then 10 words, what would it be?

          • Leeks@lemmy.world
            link
            fedilink
            English
            arrow-up
            6
            ·
            6 months ago

            Thanks for responding in good faith!

            I agree that while CS did screw up in pushing out a bad update, only having a single vendor for a critical process that can take the whole business down is equally a screw up. Ideally companies should have had CS installed on half the systems and a secondary malware prevention system on every DR and “redundant” system. Having all of a company’s eggs in a single basket is very bad.

            All the above being said; to properly implement a fully redundant, to the vendor level, system would require either double the support team, or a massive development effort to tie the management of the systems together. Either way, that is going to be very expensive. The point being: Reducing the budget of IT departments will further cause the consolidation of vendors and increase the number of vendor caused complete outage events.

  • technocrit@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    36
    ·
    edit-2
    6 months ago

    An underlying problem is that legal security is mostly security theatre. Legal security provides legal cover for entities without much actual security.

    The point of legal security is not to protect privacy, users, etc., but to protect the liability of legal entities when the inevitable happens.

    neglecting the due diligence necessary to ensure those solutions truly fit their needs.

    CrowdStrike perfectly met their needs by proving someone else to blame. I don’t think anybody is facing any consequences for contracting with CrowdStrike. It’s the same deal with Microsoft X 10000000. These bad incentives are the whole point of the system.

  • TechNerdWizard42@lemmy.world
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    7
    ·
    6 months ago

    Issue is definitely corporate greed outsourcing issues to a mega monolith IT company.

    Most IT departments are idiots now. Even 15 years ago, those were the smartest nerds in most buildings. They had to know how to do it all. Now it’s just installing the corporate overlord software and the bullshit spyware. When something goes wrong, you call the vendor’s support line. That’s not IT, you’ve just outsourced all your brains to a monolith that can go at any time.

    None of my servers running windows went down. None of my infrastructure. None of the infrastructure I manage as side hustles.

    • ocassionallyaduck@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      6 months ago

      Man, as someone who’s cross discipline in my former companies, the way people treat It, and the way the company considers IT as an afterthought is just insane. The technical debt is piled high.

    • Lettuce eat lettuce@lemmy.ml
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      6 months ago

      I’ve seen the same thing. IT departments are less and less interested in building and maintaining in-house solutions.

      I get why, it requires more time, effort, money, and experienced staff to pay.

      But you gain more robust systems when it’s done well. Companies want to cut costs everywhere they can, and it’s cheaper to just pay an outside company to do XY&Z for you and just hire an MSP to manage your web portals for it, or maybe a 2-3 internal sys admins that are expected to do all that plus level 1 help desk support.

      Same thing has happened with end users. We spent so much time trying to make computers “friendly” to people, that we actually just made people computer illiterate.

      I find myself in a strange place where I am having to help Boomers, older Gen-X, and Gen-Z with incredibly basic computer functions.

      Things like:

      • Changing their passwords when the policy requires it.
      • Showing people where the Start menu is and how to search for programs there.
      • How to pin a shortcut to their task bar.
      • How to snap windows to half the screen.
      • How to un-mute their volume.
      • How to change their audio device in Teams or Zoom from their speakers to their headphones.
      • How to log out of their account and log back in.
      • How to move files between folders.
      • How to download attachments from emails.
      • How to attach files in an email.
      • How to create and organize Browser shortcuts.
      • How to open a hyperlink in a document.
      • How to play an audio or video file in an email.
      • How to expand a basic folder structure in a file tree.
      • How to press buttons on their desk phone to hear voicemails.

      It’s like only older Millennials and younger gen-X seem to have a general understanding of basic computer usage.

      Much of this stuff has been the same for literally 30+ years. The Start menu, folders, voicemail, email, hyperlinks, browser bookmarks, etc. The coat of paint changes every 5-7 years, but almost all the same principles are identical.

      Can you imagine people not knowing how to put a car in drive, turn on the windshield wipers, or fill it with petrol, just because every 5-7 years the body style changes a little?

  • Boozilla@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    6 months ago

    I’ve worked in various and sundry IT jobs for over 35 years. In every job, they paid a lot of lip service and performed a lot box-checking towards cybersecurity, disaster recovery, and business continuity.

    But, as important as those things are, they are not profitable in the minds of a board of directors. Nor are they sexy to a sales and marketing team. They get taken for granted as “just getting done behind the scenes”.

    Meanwhile, everyone’s real time, budget, energy, and attention is almost always focused on the next big release, or bug fixes in app code, and/or routine desktop support issues.

    It’s a huge problem. Unfortunately it’s how the moden management “style” and late stage capitalism operates. Make a fuss over these things, and you’re flagged as a problem, a human obstacle to be run over.

    • BearOfaTime@lemm.ee
      link
      fedilink
      English
      arrow-up
      6
      ·
      edit-2
      6 months ago

      Yep - it’s a CIO/CTO/HR issue.

      Those of us designing and managing systems yell till we’re blue in the face, and CIO just doesn’t listen.

      HR is why they use crap like CrowdStrike. The funny thing is, by recording all this stuff, they become legally liable for it. So if an employee intimates they’re going to do something illegal, and the company misses is, but it’s in the database, they can be held liable in a civil case for not doing something to prevent it.

      The huge companies I’ve worked at were smart enough to not backup any comms besides email. All messaging systems data were ephemeral.

      • Boozilla@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 months ago

        by recording all this stuff, they become legally liable for it

        That is a damned good point and kind of hilarious. Thanks for the meaningful input (and not just being another Internet Reply Guy like some others on here).

        • restingboredface@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          4
          ·
          6 months ago

          I’m currently working for a place that has had recent entanglements with the govt for serious misconduct that hurt consumers. They have multiple policies with language in it to reduce documentation that could get them in trouble again. But minimal attention paid to the actual issues that got them in trouble.

          They are more worried about having documented evidence of bad behavior than they are of it occurring.

          I’m certain this is not unique to this company.

  • computergeek125@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    6 months ago

    Getting production servers back online with a low level fix is pretty straightforward if you have your backup system taking regular snapshots of pet VMs. Just roll back a few hours. Properly managed cattle, just redeploy the OS and reconnect to data. Physical servers of either type you can either restore a backup (potentially with the IPMI integration so it happens automatically), but you might end up taking hours to restore all data, limited by the bandwidth of your giant spinning rust NAS that is cost cut to only sustain a few parallel recoveries. Or you could spend a few hours with your server techs IPMI booting into safe mode, or write a script that sends reboot commands to the IPMI until the host OS pings back.

    All that stuff can be added to your DR plan, and many companies now are probably planning for such an event. It’s like how the US CDC posted a plan about preparing for the zombie apocalypse to help people think about it, this was a fire drill for a widespread ransomware attack. And we as a world weren’t ready. There’s options, but they often require humans to be helping it along when it’s so widespread.

    The stinger of this event is how many workstations were affected in parallel. First, there do not exist good tools to be able to cover a remote access solution at the firmware level capable of executing power controls over the internet. You have options in an office building for workstations onsite, there are a handful of systems that can do this over existing networks, but more are highly hardware vendor dependent.

    But do you really want to leave PXE enabled on a workstation that will be brought home and rebooted outside of your physical/electronic perimeter? The last few years have showed us that WFH isn’t going away, and those endpoints that exist to roam the world need to be configured in a way that does not leave them easily vulnerable to a low level OS replacement the other 99.99% of the time you aren’t getting crypto’d or receive a bad kernel update.

    Even if you place trust in your users and don’t use a firmware password, do you want an untrained user to be walked blindly over the phone to open the firmware settings, plug into their router’s Ethernet port, and add https://winfix.companyname.com as a custom network boot option without accidentally deleting the windows bootloader? Plus, any system that does that type of check automatically at startup makes itself potentially vulnerable to a network-based attack by a threat actor on a low security network (such as the network of an untrusted employee or a device that falls into the wrong hands). I’m not saying such a system is impossible - but it’s a super huge target for a threat actor to go after and it needs to be ironclad.

    Given all of that, a lot of companies may instead opt that their workstations are cattle, and would simply be re-imaged if they were crypto’d. If all of your data is on the SMB server/OneDrive/Google/Nextcloud/Dropbox/SaaS whatever, and your users are following the rules, you can fix the problem by swapping a user’s laptop - just like the data problem from paragraph one. You just have a team scale issue that your IT team doesn’t have enough members to handle every user having issues at once.

    The reality is there are still going to be applications and use cases that may be critical that don’t support that methodology (as we collectively as IT slowly try to deprecate their use), and that is going to throw a Windows-sized monkey wrench into your DR plan. Do you force your uses to use a VDI solution? Those are pretty dang powerful, but as a Parsec user that has operated their computer from several hundred miles away, you can feel when a responsive application isn’t responding quite fast enough. That VDI system could be recovered via paragraph 1 and just use Chromebooks (or equivalent) that can self-reimage if needed as the thin clients. But would you rather have annoyed users with a slightly less performant system 99.99% of the time or plan for a widespread issue affecting all system the other 0.01%? You’re probably already spending your energy upgrading from legacy apps to make your workstations more like cattle.

    All in trying to get at here with this long winded counterpoint - this isn’t an easy problem to solve. I’d love to see the day that IT shops are valued enough to get the budget they need informed by the local experts, and I won’t deny that “C-suite went to x and came back with a bad idea” exists. In the meantime, I think we’re all going to instead be working on ensuring our update policies have better controls on them.

    As a closing thought - if you audited a vendor that has a product that could get a system back online into low level recovery after this, would you make a budget request for that product? Or does that create the next CrowdStruckOut event? Do you dual-OS your laptops? How far do you go down the rabbit hole of preparing for the low probability? This is what you have to think about - you have to solve enough problems to get your job done, and not everyone is in an industry regulated to have every problem required to be solved. So you solve what you can by order of probability.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      6 months ago

      I upvoted because you actually posted technical discussion and details that are accurate. PXE and remote power management is the way. Most workstation BIOS will have IPMI functionality already included. I agree thought that being that these are remote endpoints, it can be more challenging. Having a script to reboot their endpoints into a recovery environment though would be a basic step though in any DR scenario. Mounting the OS partition to delete a file & reboot wouldn’t be a significant endeavor, although one that they’d need to make sure they got right. Still though, it would be hard to mess up for anyone with intermediate computer skills… and you’d hope these companies at least have someone trained to do that rather quickly. They’d have to spend more time writing up a CR explaining all the steps, and then joining a conference call with like 100 people with babies crying in the background… and managers insisting they remain on the call while they write the script.

  • edric@lemm.ee
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    6 months ago

    For sure there is a problem, but this issue caused computers to not be able to boot in the first place, so how are you gonna remotely reboot them if you can’t connect to them in the first place? Sure there can be a way like one other comment explained, but it’s so complicated and expensive that not all of even the biggest corporations do them.

    Contrary to what a lot of people seem to think, CrowdStrike is pretty effective at what it does, that’s why they are big in the corporate IT world. I’ve worked with companies where the security team had a minority influence on choosing vendors, with the finance team being the major decision maker. So cheapest vendor wins, and CrowdStrike is not exactly cheap. If you ask most IT people, their experience is the opposite of bloated budgets. A lot of IT teams are understaffed and do not have the necessary tools to do their work. Teams have to beg every budget season.

    The failure here is hygiene yes, but in development testing processes. Something that wasn’t thoroughly tested got pushed into production and released. And that applies to both Crowdstrike and their customers. That is not uncommon (hence the programmer memes), it just happened to be one of the most prevalent endpoint security solutions in the world that needed kernel level access to do its job. I agree with you in that IT departments should be testing software updates before they deploy, so it’s also on them to make sure they at least ran it in a staging environment first. But again, this is a tool that is time critical (anti-malware) and companies need to have the capability to deploy updates fast. So you have to weigh speed vs reliability.

    • John Richard@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      4
      ·
      6 months ago

      Booting a system or recovery image remotely over an IPMI or similar interface is not complicated or expensive. It is one of the most basic server management tasks. You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.

      There are exceptions, granted. However, the IT budget at most mid to large-size corporations is extremely bloated. I don’t think you can in good faith argue otherwise, unless you want to show me a budget that isn’t. Do you have a real one that you can provide?

      These companies don’t even attract smart talent. They attract people that are complacent with doing nothing & collecting a paycheck. Smart people do not continue to work at these companies. The bureaucracy and management is soul-sucking. It took me a while to accept it too. I used to be optimistic thinking there is a logical explanation that can be fixed. Turns out they don’t want to be fixed. They like to be broken. Like I said, it starts from the top down. A lot of the staff wouldn’t even have a job if people actually tried to make things better.

      • edric@lemm.ee
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        6 months ago

        It is one of the most basic server management tasks.

        Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.

        You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.

        Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.

        unless you want to show me a budget that isn’t. Do you have a real one that you can provide?

        Can YOU show me the bloated budgets and where they are allocated on those mid to large size corporations? You are the one who insinuated that. All I said is that my experience for all the companies I worked with is that we always had to fight hard for budget, because the sales and marketing departments bring in the $$$ and that’s only what the executives like to see, therefore they get the budget. If your entire working experience is that your IT team had too much budget, then consider yourself privileged.

        It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.

        • John Richard@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          6 months ago

          Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.

          Endpoint machines still have IPMI type of interfaces and PXE. When you manage thousands of machines, if you treat them all like a pet then you’re doing it wrong.

          Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.

          Is it going to take them several days to weeks to recover? Then they aren’t fit for the job, or should consider another profession.

          Can you show me the bloated budgets and where they are allocated on those mid to large size corporations?

          All of them. The Form 10k fillings are available for public corporations. The ones claiming that they will be impacted for a while are the ones I’m concerned most about.

          It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.

          I spent a career arguing with sales reps who had one goal in mind, and that was to make the biggest commission possible. I sound argumentative because those sales reps had every tool imaginable to show up out of no where.

      • SparrowRanjitScaur@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        2
        ·
        6 months ago

        Thank you. Finally someone understands. Jokes aside though, I think we can acknowledge that C/C++ have caused decades of problems due to their lack of memory safety.