I can understand patch updates, but what else are the devs doing?

  • fubo@lemmy.world
    link
    fedilink
    arrow-up
    24
    ·
    edit-2
    1 year ago

    Over a decade ago, I worked in a big tech company that had a scheduled downtime on one Saturday a month. That was for database schema changes.

    When you’re changing the structure of how you keep track of customer data, you need to make sure that no customers are making changes at that same time. So you take the whole customer-facing service down for a little while, make the schema changes, test them, and then bring the customer-facing service back up. Ideally this takes a few minutes … but you’re prepared for it to take hours.

    As the technology improved, and as the developers learned better how to make changes to the system without requiring deep interventions, long downtime for schema changes became less necessary … for that particular business.

    Every tech company pretty much has to learn how to do these sorts of changes for themselves, though.

    • Synthead@lemmy.world
      link
      fedilink
      arrow-up
      12
      ·
      edit-2
      1 year ago

      This is the most informed answer in this thread. It really does come down to schema changes. There are even ways to avoid downtime during schema changes, but it’s often complicated. For example, you don’t see YouTube go offline for schema changes, but they’re willing to make this effort and investment, even for very large databases.

      Lots of other database tasks can happen while remaining online. For backups, use a read-only connection. For upgrades, you should have a distributed and scaled database, so take them down in sections during upgrades. For “cleaning up,” you can do vacuum operations on part of your database while it’s live. Etc etc.

      Ultimately, there is almost never a technical reason why a database has to go offline. It’s a matter of devotion to the stability and uptime of your infra. Toss enough engineering hours at a database problem and you can pretty much have 100% uptime in the scope of maintenance (not incidents, of course). But even with incidents, there are fail-over plans, replicas, and a ton of other things you can do to stay online. Instead of downtime, you have degraded performance that the users may not even notice.

  • Carighan Maconar@lemmy.world
    link
    fedilink
    arrow-up
    19
    ·
    1 year ago

    One thing in particular older MMORPGs did was essentially just need a week restart. They could not figure out how to. Make the server not have some bug or another that slowly increased memory usage, so eventually it would just break from the bug.

    To alleviate this, they did weekly restarts. Also a good time to do longer full backups, integrity checks, etc. But the main impetus was needing to restart everything.

  • Dlayknee@lemmy.world
    link
    fedilink
    arrow-up
    18
    ·
    1 year ago

    On top of what’s already been said, to your question specifically of what the devs are doing - a lot of the time it’s nothing out of the ordinary as the Ops teams are the ones conducting the maintenance. There will likely be a dev or devs on call, but that’s routine anyway so it’s ultimately just another day for them. Sure, when big patches are pushed they’re typically more attentive to the process - but even then, they’re essentially informed observers.

  • Falmarri@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    1 year ago

    Not just database migrations as others have mentioned, but database state. Databases can result in a lot of dead data, because of how transactions and locks work. Cleaning that up can cause usage of the database to be blocked for a short time. It’s easiest to do this periodically if there’s down time

  • what_is_a_name@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    To add to others’ posts. It can be a huge variety of things that risk making the service unstable, unresponsive, and worst case could corrupt data in flight.

    Customers view scheduled maintenance as minor inconvenience. Unplanned outage as an annoyance, and loss of data as a dealbreaker.

    So any time there was a chance that what we need to do would limit functionality - or otherwise make the system unstable - best to take the system offline for scheduled maintenance.