do people really care about boot time?

originalucifer@moist.catsweat.com · 9 months ago

do people really care about boot time?

neidu3@sh.itjust.works · edit-2 9 months ago

These production clusters I have at work are a nightmare to (re)boot. They run in a rather hostile environment, so sometimes we need to take it all down due to external factors. The rule of thumb is that it takes and hour to shut down and two hours to start.

There are 6 servers, and they have to start (and stop) in the correct order. Each takes around 10 minutes to boot, so if all is to be done correctly, it’s roughly 40 minutes. The rest of the startup procedure is checking internal stuff as well as interfacing with various robotics and misc.

It’s possible to gamble a bit, though: start 1, wait a bit and then start the next one, hoping that they come online in the correct order. But sometimes it doesn’t and this gamble results in having to shut down everything and start over.

…If you follow procedure, that is. I know the system well enough that I can start all machines at the same time and just interrogate and sort out any misbehaving components, thus cutting down the startup time a lot.

So yeah, while the system takes a lot of time to start, it’s mostly due to procedural reasons. In theory it could all be booted and ready in~15 minutes if we make the startup sequence more forgiving.

ricecake@sh.itjust.works · 9 months ago

That’s brutal. Is it clustered data storage of some sort? All the most offensive startup and shutdown sequence I’ve seen are giant storage systems.

neidu3@sh.itjust.works · 9 months ago

You nailed it. Each server has 36 hard drives forming three RAIDs. These 18 RAIDs form a disaster-tolerant beegfs volume of 1.6PB.

On top of that, there’s a bunch of highly specialized geophysical software, an oracle database, and misc mundane services.