Thursday, July 19, 2018

How Facebook configures its millions of servers every day

When you’re a company the size of Facebook with more than two billion users on millions of servers, running thousands of configuration changes every day involving trillions of configuration checks, as you can imagine, configuration is kind of a big deal. As with most things with Facebook, they face scale problems few companies have to deal and often reach the limits of mere mortal tools.

To solve their unique issues, the company developed a new configuration delivery process called Location Aware Delivery or LAD for short. Before developing LAD, the company had been using an open source tool called Zoo Keeper to distribute configuration data, and while that tool worked, it had some fairly substantial limitations for a company the size of Facebook.

Perhaps the largest of those was being limited to 5 MB distributions with configurations limited to 2500 subscribers at a time. To give you a sense of how configuration works, it involves delivering a Facebook service like Messenger in real time with the correct configuration. That could mean delivering it in English for one user and Spanish for another, all on the fly across millions of servers.

Facebook wanted to create a tool that overcame those limitations, separated the data from the distribution mechanism, had a latency time of less than five seconds and supported 10X more files than Zoo Keeper. Oh yes, and it wanted all of that to run on millions of clients and handle the crazy update rates and traffic spikes that only Facebook could bring to the table.

The product the Facebook engineering team created, LAD (wonder how the Dodgers feel about this), consists of a couple of parts: A proxy that sits on every single machine in the Facebook family and delivers configuration files to any machine that wants or needs one. The second piece is a distributor, which as the name implies delivers configuration information. It achieves this by checking for new updates, and when it finds them, it creates a distribution tree for a set of machines, which are looking for an update.

As Facebook’s Ali Haider-Zaveri wrote in a blog post announcing the new distribution method, the tree methodology helps solve a number of problems Facebook faced when distributing configuration updates at extreme volume. “By leveraging a tree, LAD ensures that updates are pushed only to interested proxies rather than to all machines in the fleet. In addition, a parent machine can directly send updates to its children, which ensures that no single machine near the root is overwhelmed,” Haider-Zaveri wrote.

As for those limitations, the company has been able to overcome those too. Instead of a 5 MB update limit, they have increased it to 100 MB, and instead of 2500 user limit, they have increased it to 40,000.

Such a system didn’t come easily. It required testing and retesting, but it has reached production today — at least for now, until Facebook faces another challenge and finds a new way to do things nobody considered before (because they never reached the scale of Facebook).

No comments:

Post a Comment