• Flames5123@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    21
    ·
    2 days ago

    As someone that works at another Amazon AWS dependent org, it also took out us. It was awful. Nothing I could do on my end. Why the fuck didn’t it get rolled back immediately? Why did it go to a second region? Fucking idiots on the big teams side.

    I got paged 140 times between 12 and 4 am PDT. Then there was another one where I had to hand it off at 7am because I needed fucking sleep. And they handled it until 1pm. I love my team, but it’s so awful that this even was able to happen. All our our fuck ups take 5-30 mins to roll back or manually intervene. This took them 2+ hours, and it was painful. Then it HAPPENED AGAIN! Like what the fuck.

    • douglasg14b@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      2 days ago

      This is a good reason to start investing in multi region architecture at some point.

      Not trying to be smug here or anything, but we updated a single config value, made a PR, and committed the change and we were switched over to a different region in a few minutes. Smooth sailing after that.

      (This is still dependent to some degree on AWS in order to actually execute the failover, something we’re mulling over how to solve)

      Now, our work demands we invest in such things, we’re even investing in multi-cloud (an actual nightmare). Not everyone can do this, and some systems are just not built to be able to, but if it’s within reach it’s probably worth it.

      • Flames5123@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 days ago

        Last night from 12-4am, it was almost every region impacted so it didn’t help that much.

        But we do have failovers for customers that they need to activate to just start working in another region.

        But our canaries and infrastructure alarms cannot do that since they are for alerts in the region.