Fault in CrowdStrike caused airports, businesses and healthcare services to languish in ‘largest outage in history’

Services began to come back online on Friday evening after an IT failure that wreaked havoc worldwide. But full recovery could take weeks, experts have said, after airports, healthcare services and businesses were hit by the “largest outage in history”.

Flights and hospital appointments were cancelled, payroll systems seized up and TV channels went off air after a botched software upgrade hit Microsoft’s Windows operating system.

It came from the US cybersecurity company CrowdStrike, and left workers facing a “blue screen of death” as their computers failed to start. Experts said every affected PC may have to be fixed manually, but as of Friday night some services started to recover.

As recovery continues, experts say the outage underscored concerns that many organizations are not well prepared to implement contingency plans when a single point of failure such as an IT system, or a piece of software within it, goes down. But these outages will happen again, experts say, until more contingencies are built into networks and organizations introduce better back-ups.

  • TheDemonBuer@lemmy.world
    link
    fedilink
    arrow-up
    40
    arrow-down
    5
    ·
    5 months ago

    Here’s an idea: don’t give one company kernel level access to the OS of millions of PCs that are necessary to keep whole industries functioning.

    • ansiz@lemmy.world
      link
      fedilink
      English
      arrow-up
      20
      arrow-down
      1
      ·
      5 months ago

      I mean, Microsoft themselves regularly shits the bed with updates, even with Defender updates. It’s the nature of security, they have to have that kind of access to stop legit malware. That’s why these kind of outages happen every few years. This one just got to much coverage from the banking and airline issues. And I’m sure future outages will continue to get similar coverage.

      But the Crowdstrike CEO was also at McAfee in 2010 when they shit the bed and shut down millions of XP machines so it seems like he needs a different career…

      • SkyNTP@lemmy.ml
        link
        fedilink
        arrow-up
        10
        arrow-down
        1
        ·
        5 months ago

        The problem is the monoculture. We are fucking addicted to convenience and efficiency at all costs.

        A diverse ecosystem, if a bit more work to manage, is much more resilient, and wouldn’t have been this catastrophe.

        Our technology is great, but our processes suck. Standardization. Just in time. These ideas create incredibly fragile organizations. Humanity is so short sighted. We are screwed.

        • krashmo@lemmy.world
          link
          fedilink
          arrow-up
          5
          arrow-down
          1
          ·
          5 months ago

          That seems like a pretty hardcore doomer view for an event that didn’t really do much in the grand scheme of things. I wouldn’t have even known it happened if it wasn’t all over the internet, and I work in tech to boot.

        • fuckwit_mcbumcrumble@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          3
          arrow-down
          1
          ·
          5 months ago

          Time is money. Training all of the staff needed to manage not just one system in multiple areas, but multiple systems in multiple areas is a horrible idea. Sure for a one off issue like this it would save your bacon. But how often does this really happen?

      • billwashere@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 months ago

        I’m not sure you can blame the CEO. As much as I despise C-level execs this seems like a failure at a much lower level. Now the question of whether this is a culture failure is a different story because to me that DOES come from the CEO or at least that level.

      • emax_gomax@lemmy.world
        link
        fedilink
        arrow-up
        1
        arrow-down
        1
        ·
        5 months ago

        How difficult would it be for companies to have staged releases or oversee upgrades themselves? I mostly just use Linux but upgrading itself is a relatively painless processing and logging into remote machines to trigger an update is no harder. Why is this something an independent party should be able to do without end user discretion?

  • Flying Squid@lemmy.world
    link
    fedilink
    arrow-up
    11
    ·
    5 months ago

    C-suite to experts: Are the future risks short term or long term? Specifically longer term than my golden parachute?

    • hedgehogging_the_bed@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      5 months ago

      This is why “they are the biggest” isn’t a good reason to pick a vendor. If all these companies had been using different providers or even different OS, it wouldn’t have hit so many systems simultaneously. This is a result of too much consolidation at all levels and one issue with the Microsoft OS monopoly.

  • billwashere@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    5 months ago

    Still say not allowing untested updates in a production environment fixes this. I don’t care if it’s a README file, don’t update without testing.