Thursday 20 August 2015
Customer-centric marketing relies on collecting and processing as many relevant events as possible. Customers are everywhere, which means the amount of data is increasing exponentially. The Go language plays a very important role in our data collection technology. Today, FLXone handles 3+ billion requests per day with an in-house developed application.
Our path to achieving this level of performance started with identifying the key challenges of combining marketing and advertising with technology:
- Large amounts of data must be collected and processed.
- Clients can buy millions of impressions, increasing our load within seconds.
- Latency is KEY to real-time advertising and marketing.
In 2013 we decided that Go (version 1.1 back then) looked promising, so we built our first version of the application in less than 5 days with 2 engineers working on it. The features of the Go language, like goroutines and channels, made things really easy to do well at massive concurrency. Reaching thousands of requests per second on a Macbook Pro with minor optimization was very promising.
Once our business started growing, we noticed that the latency was becoming increasingly flaky. We have an internal SLA of 100ms per full request. As we grew even bigger this became more and more of an issue. Initially we thought it had something to do with the network connections to the servers, but even though we were generating multiple terabytes of data every day that was not the case.
We then started analyzing the behavior of our Go application. On average the application spent ~ 2ms per request, which was great! It gave us 98 milliseconds to spare for network overhead, SSL handshake, DNS lookups and everything else that makes the internet work.
Unfortunately the standard deviation of the latency was high, about 100 milliseconds. Meeting our SLA became a major gamble. With the “runtime” package of Go we started profiling the entire application and found out that garbage collection was the cause, resulting in 95-percentile latencies of 279 milliseconds…
We decided to rewrite big chunks of the application to generate minimal or no garbage at all. This effectively helped reduce the interval at which garbage collection froze the rest of application to do its cleanup magic. But we were still having issues, so we decided to add more nodes to stay within our SLA. With over 80K requests per second at peak times, even minimal garbage can become a serious issue.
The day has come
Over the past few months there’s been a lot of talk about Go 1.5. The entire compile chain would be rewritten from C to Go, reminding me of the movie Inception. Even better, the garbage collection functionality would be completely redesigned.
Yesterday evening (19 August), the moment had finally arrived. A stable version 1.5 of Go was released, claiming:
The “stop the world” phase of the collector will almost always be under 10 milliseconds and usually much less.
Just a few hours after the release we rebuilt the application with the new version of Go 1.5 and ran our unit and functional tests; they all passed. It seemed too good to be true, so we put some effort in manually verifying the functionality. After a few hours we decided it was safe to release it to a single production node.
We let it run for 12 hours and afterwards started analyzing the new latencies: full request, application latency and last but not least the garbage collector. Below you can see the reduced deviation in latencies as well as a reduction in absolute latency.
Two histograms of the application level latency (the only thing that really matters). X-axis: latency, Y-axis: number of requests. Left: server running Go 1.4 — Right: server running Go 1.5, you can easily see the low variation in latency.
The new version of Go reduces our 95-percentile garbage collector from 279 milliseconds down to just 10 ms. This is a fantastic 96% decrease in garbage collection pause time and also exactly as advertised in the release notes.
Our 95 percentile garbage collection went down by 96%.
We decided to deploy the new version to the rest of our global infrastructure (12 data centers in 7 geographical areas) and saw our average request latency drop by 53%.This means we can now effortlessly meet our 100ms SLA, plus handle a huge increase in requests per node.
Thanks to the dedication and agility of our team, the release of the new version of Go 1.5 has massively improved the performance of our platform over the span of just 24 hours.
Founded in 2012 by a team of marketing and advertising professionals with a deep understanding of scalable technology, FLXone works with leading advertisers, publishers, agencies and trading desks. Innovation is what keeps us ahead of the curve to build a platform that drives marketing effectiveness for you.
P.S. We’re hiring!