Why Twitter Didn't Go Down: From a Real Twitter SRE

Twitter reportedly lost about 80% of its workforce. Whatever the actual number, there are now entire teams without engineers. Still, the website keeps going and the tweets keep pouring in. It left a lot of people wondering what exactly was going on with all those engineers, and made it all seem like a stretch. I'd like to explain my little corner of Twitter (although it wasn't so small) and some of the work that made this thing work.
Background and history
For five years, I was a Site Reliability Engineer (SRE) at Twitter. For four of those years, I was the only SRE on the Cache team. There were a few before me, and the whole team that I worked with, where a bunch came and went. But for four years, I was the head of automation, reliability, and operations on the team. I designed and implemented most of the tools that make it work, so I think I'm qualified to talk about it. (There may be only one or two other people)

A cache can be used to speed things up or to offload requests from something that is more expensive to run. If you have a server that takes 1 second to respond, but it's the same response every time, you can store that response in a cache server where the response can be served in milliseconds. Or, if you have a server cluster where processing 1000 requests per second might cost $1000, you can use cache to store responses and serve them from that cache server instead. Then you would have a small cluster for $100 and a cheap and large cache server cluster maybe for another $100. The figures are only examples to illustrate this point.

Caches have absorbed most of the traffic the site has seen. Tweets, all timelines, direct messages, advertisements, authentication, all were served on the Cache team servers. If something was wrong with Cache, you would know as a user, the problems would be visible.

When I joined the team, my first project was to swap old decommissioned machines for new machines. There was no tool or automation to do this, I was given a spreadsheet with the names of the servers. I'm happy to say that this team's operations aren't like that anymore!
How the Cache Keeps Working
The first important point that allows caches to work is that they are run as Aurora tasks on Mesos. Aurora finds servers where apps run, Mesos groups all servers together so Aurora knows about them. Aurora will also keep apps running after they start. If we say a cache cluster needs 100 servers, it will do its best to keep 100 running. If a server goes completely down for some reason, Mesos will detect it, remove the server from its aggregated pool, Aurora will now be notified that there are only 99 caches running, and then know it needs to find a new Aurora server to run on. It will automatically find one and bring the total back to 100. No one needs to get involved.

In a data center, servers are placed in units called racks. Rack servers are connected to other rack servers through a device called a switch. From there, there is a whole complex system of connections from switches to other switches and routers and eventually to the Internet. A rack can contain between 20 and 30 servers. A rack can fail, the switch can break, or maybe a power supply dies, which then brings down all 20 servers. Another good thing Aurora and Mesos do for us is to ensure that not too many applications will be placed on a single rack. So the whole rack can safely fail and suddenly Aurora and Mesos will find new servers to house the applications that were running there.

That spreadsheet mentioned earlier, it also tracked the number of servers on the racks and the spreadsheet writer tr...

Technology Nov 22, 2022 0 91 Add to Reading List

Twitter reportedly lost about 80% of its workforce. Whatever the actual number, there are now entire teams without engineers. Still, the website keeps going and the tweets keep pouring in. It left a lot of people wondering what exactly was going on with all those engineers, and made it all seem like a stretch. I'd like to explain my little corner of Twitter (although it wasn't so small) and some of the work that made this thing work.

Background and history

For five years, I was a Site Reliability Engineer (SRE) at Twitter. For four of those years, I was the only SRE on the Cache team. There were a few before me, and the whole team that I worked with, where a bunch came and went. But for four years, I was the head of automation, reliability, and operations on the team. I designed and implemented most of the tools that make it work, so I think I'm qualified to talk about it. (There may be only one or two other people)

A cache can be used to speed things up or to offload requests from something that is more expensive to run. If you have a server that takes 1 second to respond, but it's the same response every time, you can store that response in a cache server where the response can be served in milliseconds. Or, if you have a server cluster where processing 1000 requests per second might cost $1000, you can use cache to store responses and serve them from that cache server instead. Then you would have a small cluster for $100 and a cheap and large cache server cluster maybe for another $100. The figures are only examples to illustrate this point.

Caches have absorbed most of the traffic the site has seen. Tweets, all timelines, direct messages, advertisements, authentication, all were served on the Cache team servers. If something was wrong with Cache, you would know as a user, the problems would be visible.

When I joined the team, my first project was to swap old decommissioned machines for new machines. There was no tool or automation to do this, I was given a spreadsheet with the names of the servers. I'm happy to say that this team's operations aren't like that anymore!

How the Cache Keeps Working

The first important point that allows caches to work is that they are run as Aurora tasks on Mesos. Aurora finds servers where apps run, Mesos groups all servers together so Aurora knows about them. Aurora will also keep apps running after they start. If we say a cache cluster needs 100 servers, it will do its best to keep 100 running. If a server goes completely down for some reason, Mesos will detect it, remove the server from its aggregated pool, Aurora will now be notified that there are only 99 caches running, and then know it needs to find a new Aurora server to run on. It will automatically find one and bring the total back to 100. No one needs to get involved.

In a data center, servers are placed in units called racks. Rack servers are connected to other rack servers through a device called a switch. From there, there is a whole complex system of connections from switches to other switches and routers and eventually to the Internet. A rack can contain between 20 and 30 servers. A rack can fail, the switch can break, or maybe a power supply dies, which then brings down all 20 servers. Another good thing Aurora and Mesos do for us is to ensure that not too many applications will be placed on a single rack. So the whole rack can safely fail and suddenly Aurora and Mesos will find new servers to house the applications that were running there.

That spreadsheet mentioned earlier, it also tracked the number of servers on the racks and the spreadsheet writer tr...