Gmail users were reminded of the vulnerability of relying on webmail services on Tuesday night, as Google's popular email website went down for about 100 minutes.
Google being Google, it knows that user rage is best quelled by providing as much information as possible about what went wrong, so it's written a tell-all blog post about the problem. The whole thing stemmed from an underestimation of the load on servers.
The company's engineers took a few servers offline to perform routine upgrades. Normally, if a server is down then the rest of the system takes up the slack, but traffic was higher than they realized. As a result, a few of the company's routers became overloaded and unavailable.
That increased load on the remaining routers, which became overloaded too and within a few minutes Gmail's web interface become unavailable. Google says that IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.
The team brought a bunch of extra request routers online, and distributed the load across them instead, and the site slowly croaked back into life. Google says that it's taking a number of steps to prevent such an event happening again, including extra headroom and improving the behaviour of overloaded routers.
Gmail currently manages 99.9% uptime, says Google, and promises that it's "committed to keeping events like today's notable for their rarity".