Continuous Deployment (CD)

Adap.tv is an emerging leader in the video advertising sector. It is critical for us to stay on top of this fast moving industry. We needed a deployment methodology that allowed us quickly develop and release changes without disrupting our clients.

Two years ago we decided to switch to continuous deployment (CD). Today it’s hard to imagine going back. CD allows us to move fast, saves time on integration, and helps to find and fix bugs. As with anything else, the hardest part is to start. In this blog we would like to share our experiences of switching to and using the CD approach.

Probably, the most important thing is not to over-analyze or over-optimize. Don’t try to solve problems you are not facing. Implement simple solutions for problems you encounter as needed. It may sound like a cliché but many fall into that trap.

Here is a list of what you need when shifting to CD:

  • Automatic validation/Immune system
  • Code reviews
  • Automatic deployment mechanism
  • Monitoring

You may already have all or some of this. There are different methodologies that can be applied to each of these systems. In any case, keep it as simple as possible.

The Process

  • Work in small incremental changes and deploy each change as soon as possible. Last complete work day we had 28 production deployments in our system.
  • It is very important to keep each change small. The change does not have to implement the entire feature. Each submitted change usually represents a small part of the functionality in development.
  • Commit the code into a separate branch of source control and have one or two other engineers review the code. The exact process depends on the size of the team and how many engineers have approval rights. See code review section below.
  • The Immune system runs automatically and validates the changes.
  • If the immune system finds bugs or the code review was not satisfactory, fix issues and re-submit the code.
  • Once a change passes the code review and immune system, merge the change into the main branch. It’s important to note that merging code into the main branch is only possible after the change was approved and verified. See code review section below.
  • CD scripts monitor the main branch and automatically pick up latest changes and deploy approved and verified code to production. As the code is deployed ‘self tests’ run on each machine to make sure that the services started correctly.
  • Monitor key indicators for each service following the deployment to make sure that the change doesn’t cause unexpected problems.
  • Minimize Risk: There are sensitive time periods when CD will not deploy changes to production. This is done to prevent potential crises when engineers are out of the office on weekends or holidays.

Automatic validation/Immune system
In our case, the Immune system verifies each of the services via its API. A typical test involves pre-defined test-case data and definitions of expected responses. When we first started using CD, our immune system only checked the basics. It is much more developed today.

Different software components have separate immune systems. It’s important to note that some changes require validation by multiple immune systems. An example of such a component is the database. Deploying changes to the database schema mandates executing immune systems for each software component that depends on that database.

Code Reviews
Code reviews are a crucial part of CD. Reviews not only find bugs but also help train the engineering team. Depending on the engineering culture, a code review system based on administrative action might be ineffective. Automation has to be put in place. We are using combination of git and gerrit to achieve that.

Gerrit defines three roles: review, approve and verify.  The number of people who ‘Approve’ should start out limited and grow as more people gain expertise.

Deployment Mechanism
We use BASH shell scripts to execute the CD mechanism. The deployment script runs the immune system and distributes binaries across the cluster. On each machine we have scripts to start and stop applications once new code is deployed.

Monitoring
There are multiple types of monitoring:

  • deployment process monitoring – sends an e-mail about the deployment process. It is a responsibility of the engineer who submitted the change to make sure that the deployment succeeds.
  • business health monitoring – monitors critical business parameters. We use cacti graphs to visualize this data.
  • system performance monitoring – monitors the health of the system. We use ganglia graphs to visualize this data.

Pitfalls of CD
Although our development team is happy with CD, the flip side is customer experiencing frequent changes. This is especially relevant to the GUI.

To keep continuous GUI changes from surprising our customers, we use product flags to hold back certain deployed features until our clients are ready. That lets us use CD and avoid customer confusion with frequent GUI changes. Once a feature is fully released, the product flag is retired and the code is simplified.

Conclusion
At the start of the process many of us had doubts that CD was practically possible. The most frequent concern was reliability. A fear that CD will cause the entire production environment to meltdown and disrupt our business. But it never happened. In fact, our rate of serious production problems decreased and the speed of business feature development and deployment increased. Another plus is that it boosted sense of system ownership and job satisfaction in the engineering team.

Modern memory management via vmbuf

About memory management

Memory management in high throughput servers can become an extremely difficult challenge if not thought carefully beforehand. Traditional memory management relies on malloc and free, where major performance improvement can be achieved by using memory pools and memory pools per thread to avoid global locks. Modern memory management relies on the dramatic advancement in hardware, more specifically the MMU (Memory Management Unit). Today software can easily leverage the new hardware to further improve throughput and latency.

Relying solely on the MMU and the best implementation of malloc and free, can bring you far but not far enough. There are certain guidelines that must be followed in addition to that, at Adap.tv when we discuss server architecture we call them prime directives to signify thier importance.

  • Never call ‘free’ (or delete in C++) in the code. ‘free’ consumes many CPU cycles since it needs to coalesce free blocks and deal with fragmentation. Freeing memory is allowed only when using memory mapped files and only in the case of mapping new data while discarding old.
  • Always reuse memory. In high throughput servers, all the requests are roughly the same pattern? which means memory could be reused.
  • Avoid copying and/or moving memory regions.

There are several ways to obey the above directives, we chose, what we believe the best way for our needs. We call our solution vmbuf, it combines together well known techniques where most of the innovation is on the way we combine these techniques.

64 bit architecture

Virtual memory on 64 bit architecture is very powerful, compared to its counterpart 32 bit. The main reason behind it is the address space. Address space on 64bit architecture is 128TB (linux x86_64), vs. ~3GB on 32bit architecture. Having 128TB is important when it comes to allocating and resizing buffers. When allocating virtual memory, the OS allocates address space only, no physical memory is involved until memory is actually used (minor page fault). Armed with that knowledge and 128TB of address space, it is now possible to double the size of the buffer upon resize, hence avoiding many system calls to mmap, without consuming physical memory beyond what is actually needed and without copying the actual data. In some cases, when there is enough knowledge about the memory usage, over allocation of VM can be used as well to accommodate any expected usage and completely avoid resizing. On 32bit systems, resizing can easily lead to lack of contiguous address spaces, making resizing impossible, hence, it would be more effective to use vmbuf on 64 bit architecture.

vmbuf

vmbuf is part of our opensource projects RIBS and RIBS2.0 which can be found on github:

https://github.com/Adaptv/ribs

https://github.com/Adaptv/ribs2

 

Capabilities 

  • Allocate memory auto-resizing when needed avoiding copying physical memory blocks
  • string/memory functions: sprintf, strftime, strcpy, memcpy
  • Read and write until end of data, EAGAIN or other errors
  • Containers: hashtables, linked list, heap, vector
  • vmfile: persist the data in memory mapped file.

How does it work?

vmbuf is using mmap and mremap to allocate and resize memory regions. One can argue and say that malloc, free and realloc can achieve exactly the same and they are also portable, however, we chose the non-portable way since we do care about the extra CPU cycles that can be saved when bypassing malloc and calling mmap directly. Furthermore the memory copy that takes place when calling realloc can be avoided if using mremap, offloading most of the work to the MMU.

How to use vmbuf or similar solution that uses malloc/realloc (remember no ‘free’)?

  • All the data structures that are stored in vmbuf must used offsets instead of pointers. We implemented, vectors, linked lists, hashtables and more that are vmbuf based.
  • vmbufs can be allocated per connection and/or per thread (or global if not using threads). For example: the request and response buffers are per connection where temporary buffer to convert timestamp to string is allocated once per thread and reused by all requests.
  • There are extremely rare cases that large chunk of memory is needed temporarily at the connection level but only tiny percentage of the time, in that case ‘free most’ can be used to free most of the pages leaving enough pages for the common requests.

Why we use C++ (without using STL or boost)

Audience: SaaS and SOA designers and engineers who uses Linux.
Not yet audience: apps and desktop apps designers and engineers.

C++ refers to the language without using STL.

“There are two ways of constructing a software design. One is to make it so simple that there are obviously no deficiencies; the other is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”
– C. A. R. Hoare The Emperor’s Old Clothes, CACM February 1981

There are plenty of reasons not to use C or C++. In http://catb.org/esr/writings/taoup/html/why_not_c.html Eric Steven Raymond documented the most common reason: “The central problem of C and C++ is that they require programmers to do their own memory management” and “C memory management is an enormous source of complication and error.” “Not so long ago, manual memory management made sense anyway. But there are no ‘small systems’ any more, not in mainstream applications programming.”

Others complain also about strings http://www.kuro5hin.org/story/2004/2/7/144019/8872, missing types, and being trapped in the 1970s.

So why use it? Because you can build very efficient, large scale systems, fast. You are probably asking yourself how did the word ‘fast’ in the context of ‘build fast’ got in here? read on.

Performance – It does not make sense to even try and compare C or C++ to Java or Perl or PHP or Python etc. as those are all written in C and some C++, so inherently they are slower – the additional layer will always cost something. However, If you really insist, there are many benchmarks out there that will prove the point. (some benchmarks credit the gcc optimizer, not the language) Using my personal experience over the years, you will need 4 to 80 times the hardware just by deciding not to use C or C++.

Even with the hit on performance and cost, many still avoid C and C++, mainly because the perception of additional engineering time due to memory management.

At Adap.tv, we decided to solve the problem. It is well known fact that the problem with managing memory is actually freeing memory, so we’ve decided to avoid it completely and not to free memory. Rather than freeing memory, we reuse memory without any locking. Doing so, together with event-driven technology, allowed us to quickly build high scale systems. By high scale I mean 10,000s connected clients (persistent connection) and 100,000s queries per second per server.

The key to this are large buffers and 64bit technology. (It is possible to do the same with 32bit, however, the 2GB user address space is a limiting factor, and extensive use of technologies like PAE are required) With 64bit, the address space is limited to 128TB. Those large buffers (called vmbuf) are being used to save data and also to allocate structures (“objects”) when the task is done the vmbuf’s internal offset is simply being reset() to 0. vmbuf’s exists per reusable context, such as server context or client context, and per thread. Per thread only vmbufs are good as long as the server or client does not need to wait for a network event. (remember, it is also event-driven)

There are several rules that we are obeying in order to keep the performance high;
Avoid malloc(), calloc() or new
No free() or delete (and no need for tcmalloc)
No STL, boost etc.
Avoid locking as much as possible
# threads = # CPU cores (hyperthread is a trade-off between latency and throughput)

vmbuf is included in the Robust Infrastructure for Backend Systems (RIBS) which is the infrastructure that we use to build our servers: https://github.com/Adaptv/ribs/

Our next blog post will describe in details the design of vmbuf and will also include an example of how to use it in your servers.

Three forecasting techniques

How well are we doing when forecasting for online ads at Adap.tv? Can we do better? To start answering such questions, we used R to experiment with three well-known forecasting techniques and used an hourly metric (transformed for data privacy) across all ads for seven days (Feb 1-Feb 7) as our training data set. We then used each of the techniques to predict for the eight day (Feb 8), and compare the predicted values with actual data. Here’s a synopsis of the three:

1. Exponential Weighted Moving Average (EWMA)

Strictly speaking, the smoothing filter that we currently use at Adap.tv for forecasting is not EWMA, but more like Weighted Moving Average. EWMA is defined by:

S_{t} = \alpha \times (Y_{t-1} + (1-\alpha) \times Y_{t-2} + (1-\alpha)^2 \times Y_{t-3} + ... + (1-\alpha)^k \times Y_{t-(k+1)}) + (1-\alpha)^{k+1} \times S_{t-(k+1)} (ref)

So, if a = 0.9, S[t] = 0.9 ( y[t-1] + 0.1*y[t-2] + 0.01*y[3] … 1e-6*y[t-7]) would be an implementation of EWMA. Instead, our implementation is: S[t] = 0.9 ( y[t-1] + 0.9*y[t-2] + 0.81*y[3] … 0.531441*y[t-7]) / (0.9+…). The critical difference is in the way past data is weighted – with EWMA data in remote past is heavily penalized, whereas, comparatively, in our implementation it is not.

Fig. 1 below shows a plot that forecasts our metric for Feb 8 using Weighted Moving Average akin to our implementation. The metric forecasted for Feb 8 was 68392533 whereas observed value was 63868927, giving an absolute difference of ~7.1%. The absolute hourly differences lie between 0.6-29%.

Fig. 1

2. Harmonic regression

Harmonic regression attempts to find the best-fitting periodic signals that can characterize a time series. For example, a time series may be represented by a linear sum of sinusoids:

\frac{a_0}{2} + \sum_{n=1}^\infty \, [a_n \cos(nx) + b_n \sin(nx)] (ref)

Generally, a combination of Fourier transform and least-squares regression in used to find the harmonics who linear sum best fit training data. In our experiment, we used the top seven harmonics with highest power density (i.e. contribution to the model), and then used linear regression to find their amplitudes. Fig. 2 shows the model (in red) and actual data (in blue). The forecast for Feb 8 was 60476559 whereas the observed value was 63868927, giving an absolute difference of ~5.3%. The absolute hourly differences lie between 0.05-25%.

Fig. 2

3. Holt-Winters method

Holt-Winters method augments EWMA to account for data trends and seasonality.

\begin{align}<br /><br /><br /><br /><br /><br /><br /><br /><br />
s_0& = x_0\\<br /><br /><br /><br /><br /><br /><br /><br /><br />
s_{t}& = \alpha \frac{x_{t}}{c_{t-L}} + (1-\alpha)F_t\\<br /><br /><br /><br /><br /><br /><br /><br /><br />
b_{t}& = \beta (s_t – s_{t-1}) + (1-\beta)b_{t-1}\\<br /><br /><br /><br /><br /><br /><br /><br /><br />
c_{t}& = \gamma \frac{x_{t}}{s_{t}}+(1-\gamma)c_{t-L}\\<br /><br /><br /><br /><br /><br /><br /><br /><br />
F_{t+m}& = (s_t + mb_t)c_{(t+m) \pmod L},<br /><br /><br /><br /><br /><br /><br /><br /><br />
\end{align}” />(<a href=ref)

The model we used assumes that the seasonal component is additive (instead of being multiplicative). Fig. 3 illustrates the breakdown of seasonality, trend and randomness in the data,  Fig. 4 contrasts observed data with predicted data (note that the prediction start after one period, and the phase lag in predicted data), while Fig. 5 shows Holt-Winters prediction for Feb 8 versus observed data. The forecasted metric for Feb 8 was 63801332 whereas observed value was 63868927, giving an absolute difference of ~0.11%. The absolute hourly differences lie between 1.3-20%.

Fig. 3
Fig. 4
Fig. 5

 

Fast file aggregation

Sorting mail at the Federal Building post office, circa 1910.

Problem: Given a very large number of files, many of which are exact replicas, how can we group all files by replicas?

There are several interesting cases of this problem:

  • Files usually fit in memory
  • Files are very large, and can only be read in blocks
  • Most files belong to one or few replica groups

The obvious and correct solution is to directly compare each file with every other. Generally, this solution should be the best possible only if each file is known to contain a number. If the files contain long binary strings, then a better solution would reduce the number of file comparisons (and memory accesses in case files are very large) as far as possible.
Read more of this post

Lean startup vs. lean software development

Lean manufacturing in action at Toyota.

The “lean startup” movement has been getting a lot of attention recently. It has a dedicated annual conference, a Stanford course, a SXSW track, and its own bootcamp. And Eric Ries, the engineer and blogger who apparently coined the term, has just completed his authoritative book on the topic (although it won’t be released for another 2 months).

I was introduced to the lean startup concept by my friend Avi Brown at Sharethrough, who informed me that he was applying lean principles to all aspects of his business. I had already read Steve Blank’s The Four Steps to the Epiphany but didn’t really get how his “customer development” ideas were related to the concept of “lean” (he doesn’t mention it explicitly in the book). After attending the Startup Lessons Learned conference in San Francisco, I did get it, and I was hooked.
Read more of this post

The open kitchen, or why software startups need a technical blog

Adap.tv has been building great software for almost 5 years, and until now, we’ve never published a technical blog. But if you take a quick look around, it seems like a lot of well-reputed software startups do maintain a technical blog. For example:

At first glance, it seems counter-intuitive: why share the inner workings of your company’s most strategic asset with the world, in a competitive market, and risk giving your competitors a leg up? Of course, the obvious answer is that it is a way of “giving back” knowledge to the broader community. While true, that rings a bit hollow for most companies, as there are many other more impactful ways to be philanthopic.
Read more of this post

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: