Tracking Application-Level Metrics in Amara

Welcome to the Amara dev blog. We could do a big first post about why we're making a blog, what will be on it, the nature of blogging itself, and so on, but instead of all that I'm just going to jump right in and talk about something interesting.

In this post I'll show you how we implemented a new bit of infrastructure to handle tracking lots of real-time, application-level metrics in Amara.


If you're not familiar with Amara, it's a site that lets you caption/subtitle videos on the web (YouTube, Vimeo, HTML5, etc) and translate the captions/subtitles into other languages, all through a web interface. I'm certainly biased, but I think we've done a pretty good job at making it easy and painless.

Other organizations like Mozilla, TED, Al Jazeera, and PBS use Amara to help coordinate the transcription and translation of subtitles for some of their videos. Some of them also use our Javascript widget to display the subtitles on those videos. So we get a fair amount of traffic.

Amara is written primarily in Python with the Django web framework. We use most of the common Djangoey bits like Haystack, Celery, South, and so on. We run on AWS, and use Vagrant to develop locally.

The Situation

When running a decent-sized application (we've got on the order of 100k LOC) with a decent amount of traffic, it helps to gather some statistics to understand how your application actually performs. Up until a month or so ago, we just tracked a bunch of basic stuff that everyone tracks.

Things like page views, referral sources, and so on are tracked through Google Analytics.

AWS also tracks things for you. The AWS metrics we look at most often are the RDS metrics like database connections and read/write throughput.

We also monitor aspects of server health like free RAM, process counts, load, etc through Nagios and some stand-alone scripts.

AWS and Google Analytics are basically zero-effort, and the scripts that Nagios polls are fairly small and self-contained. For a while this was enough, but lately we've started to want more detailed information. So I set out to figure out the best way for us to track more (many more) metrics.

The catalyst for this was me trying to debug a painful performance problem, which may be the subject of a later blog post. The problem has since been fixed, but we're still really happy about having this extra data.

The Goal

When I sat down and thought about what this new metrics infrastructure would look like, I came up with a short list of important characteristics I was looking for.


Whatever setup I decided on needed to be self-contained.

If the entire metrics infrastructure catches on fire it shouldn't affect the actual functionality of our site at all.

A hole in your data sucks, but it's better than a hole in your data and your availability.

I also wanted it to be completely optional when running the site locally, so it doesn't add any extra overhead when we bring on new developers or when someone wants to send us a pull request (we're open source).

Visual and Flexible

Graphs are important. Looking at raw numbers is fine, but it's often easier to see trends when you can visualize your data on a graph.

We also wanted the ability the slice and dice metrics on the fly, as well as plot them against each other to look for correlation. Often you don't know what you should be looking for until after it's already happened, so the more we can do at viewing-time (as opposed to collection-time) the better.

Third-Party and Flexible

I didn't want to write everything from scratch because I knew there are a lot of options out there already created by very smart people.

With that said, I also know that "one size fits all" really means "one size doesn't really fit anyone", so I'm not averse to writing some code and sending some pull requests if necessary.


We don't get an enormous amount of traffic, so the system we pick doesn't need to scale up to Google-like levels. I'd prefer something light and modular where we can swap components in and out, even if it's a bit less efficient overall.

I'd also prioritize flexibility and ease of configuration over raw throughput, up to a certain point.


We need a way to collect very fine-grained stats. One-minute resolution (or, even worse, five minute resolution) is not fine enough to see certain effects in your application accurately.

I decided to look for a system could gather stats in intervals of 10 seconds or less. That's close enough to feel "real time" to us, so that's where I set the bar.

The Contenders

After whiteboarding for a while I ended up thinking we'd need two main components in the metrics infrastructure:

  • Something to gather and parse metrics.
  • Something to store and display the parsed metrics.

Storing and Displaying

We currently use Nagios to gather, store, and visualize some statistics, so that's naturally an option that came up. We decided against it for a few reasons.

First: the UI is just horrifying. No one on our team likes it -- in fact we actively hate it. If the UI is painful, then the whole metrics system is going to be painful, and no one will use it much. I wanted something people would enjoy using.

There were some other reasons we vetoed Nagios as well, but I won't bother with them here.

We ended up choosing Graphite for this piece of the puzzle instead.

Graphite's UI isn't the prettiest, but it's good enough, and the URL-based graphing will make it easy to create beautiful dashboards in the future.

Its graphing abilities are very powerful. You can munge data in a lot of different ways, and plot just about anything against anything else.

Graphite handles the storage and visualization of statistics, but it needs to have them pushed to it, so there needs to be something in front of it.

Gathering and Parsing

The other component of the metrics infrastructure I came up with was something that would be an "endpoint" for all of our various processes to throw data at. This endpoint would combine and transform the data a bit, then funnel it all off to Graphite.

We could have gone the route of Coda's Metrics library, and had each individual process send data directly to Graphite. I didn't like this for a few reasons:

  • We're on Python, not on the JVM, so I'd need to rewrite Metrics in Python. On its own that isn't necessarily a dealbreaker, but it's still a pain.
  • When each host throws a stream of data at Graphite, you need to use Graphite's UI to aggregate the data. You can do this, but it's more flexible if you have an intermediary that can hack apart and splice together data before it gets to Graphite.
  • Going directly from host to Graphite also doesn't give you an opportunity to do things like fire off notifications as things happen.

So in the end we ended up looking at two tools for the role of stats-gatherer: Statsd and Riemann. Both have built-in support for forwarding to Graphite.

Statsd is by Etsy. The good parts:

  • It's been around (or at least open source) for longer.
  • Its client libraries seem more mature.
  • It's opinionated. There are a few types of statistics it's designed to work with and it makes working with those very easy.

The bad parts:

  • It's written in Javascript and NodeJS.
  • It's opinionated. While doing what it was designed for is easy, doing anything outside of the gauge/counter/timer box would be harder.

Riemann is by Kyle Kingsbury, who's now at Boundary. The good parts:

  • It's written in Clojure and runs on the JVM.
  • It's flexible. You write Clojure to configure streams, which means you can do just about anything you want.

The bad parts:

  • It's fairly new, at least open-source-wise.
  • It's flexible. You need to write some wrapper code if you want the same effect as Statsd's meters/etc.

We ended up choosing Riemann, mostly for its flexibility. We plan to move most of the stuff we currently use Nagios for over into Riemann/Graphite, including email alerts. Riemann will make this trivial, while Statsd would be a bit more work.

Also, I'd much rather work with Clojure than Javascript/NodeJS, but that's a post for another day.

Honestly though, statsd is still a good tool that you should definitely look at if you're in the market for this kind of thing. Just because we picked Riemann doesn't mean it's the best fit for you.

The Implementation

So after finally settling on Riemann and Graphite it was time to actually get things working.

By the time I finished getting everything in place, we ended up with several distinct pieces in our monitoring infrastructure.

I'll briefly cover each one in turn now.


We use Vagrant for local development, and I knew I'd want to be able to test the metrics setup locally, so I decided to make a separate repository and Vagrantfile.

The repository is at

Nothing too crazy here. It uses Puppet to install Graphite and Riemann onto a VM. I found the Graphite Puppet module on GitHub and wrote the Riemann one from scratch.

The most important and unusual part here is that the Vagrantfile instructs Vagrant to use host-only networking with an IP of This will let our main development environment talk to the metrics VM as it's running by hitting that IP.

I could have done this with a multi-VM setup in Vagrant. I didn't because:

  • I wanted to keep everything in a separate repository.
  • I'm not as familiar with Vagrant's multi-VM setup as I am with its host-only networking.

In the future I think our newest coworker, Evan, is actually going to move us to the multi-VM setup. I don't have any strong feelings either way -- the important thing is that we have a way to locally run a metrics box that mimics the one in production.


We're pretty much just using a standard install of Graphite.

Nothing fancy. No special config other than a basic retention schema (5 second resolution for 1 day, 1 minute resolution for 1 week, and 10 minute resolution for 5 years).

Some day we may set up a frontend to Graphite like Graphene or Tasseo, but for now we're just using bare.


Riemann is distributed as a single .jar file and a config file. You run it, it parses your config file and listens on a port. That's it.

Java users may tend to overengineer the hell out of things, but uberjars are absolutely wonderful for repeatable deployments. I would love for Python to have something like them (and no, virtualenvs do not even come close).

Contrast this to the install/run process of Graphite, which can be succinctly described as: "clusterfuck".

Our Riemann config file is where I wrote the wrapper code to take the generic "stream of events" that Riemann deals with and turn it into statsd-like meters, timers, etc. It's 89 lines of Clojure as of today.

Here's the whole thing:

It's pretty short and has some interesting stuff, so I'll go through it bit by bit now. If you're not familiar with Riemann, you should read Riemann's concepts and quickstart docs so you have some idea of how it works.

You could also just skim the rest of this section. It's up to you. Anyway, let's jump in:

(logging/init :file "/var/log/riemann.log")


(def graph (graphite {:host "localhost"}))

First we tell Riemann to log errors to a file.

Then we start the TCP and UDP servers. We currently only use UDP, so we could technically get rid of the TCP server line, but it's not hurting anything, and if we want to use TCP for some more important stuff in the future we're ready for it.

Then we make a function called graph that we can send streams to. Graphite and Riemann run on the same box at the moment. It works pretty well, because Riemann is network/CPU heavy while Graphite is disk IO heavy. If we wanted to split them onto different boxes it would be a one-line change right here.

(defn add-environ-name
  "Add the environment name to the hostname, based on the presence of a tag."
  [{:keys [tags host] :as event}]
  (let [tags (set tags)]
      (tags "production") (update-in event [:host] str ".production")
      (tags "staging")    (update-in event [:host] str ".staging")
      (tags "dev")        (update-in event [:host] str ".dev")
      :else               event)))

Next we make a function that takes a Riemann event and adds the name of the environment to its hostname, depending on what the event is tagged as. For example, it'll take an event with a host of mybox6 tagged staging and change the host to mybox6.staging.

This may seem backwards at first, but when Riemann forwards stats to Graphite it reverses the order of the pieces of the host, so will show up as com -> foo -> db.

This means that our hosts will be nicely grouped in Graphite like this:


Next we have a similar but slightly different function:

(defn add-environ-name-combined
  [{:keys [tags host] :as event}]
  (let [tags (set tags)]
    (assoc event :host (cond
                         (tags "production") "all-hosts.production"
                         (tags "staging")    "all-hosts.staging"
                         (tags "dev")        ""
                         :else               "all-hosts"))))

This munges the host name into all-hosts.ENV. Why would we do this?

Think of two of the same event coming in from different hosts:

{:host "box1" :tags ["production"] :service "user-signed-up"}
{:host "box2" :tags ["production"] :service "user-signed-up"}

Once you pass these events through add-environ-name-combined you get:

{:host "all-hosts.production" :tags ["production"] :service "user-signed-up"}
{:host "all-hosts.production" :tags ["production"] :service "user-signed-up"}

This results in a "pseudo-host" named "all-hosts" that receives all events, which lets you aggregate all the data together for each environment. So between this function and the previous function we end up with something like:


This makes it easy for us to see the statistics for our site as a whole, while also letting us drill down into individual servers. For every metric. And it only took about 16 lines of code. Really, really cool.

We're going to skip ahead to the end of the file real quick to see these in action. We'll talk about the rest soon, don't worry:

(let [,,,]
    (with {:host "riemann" :service "raw-events-processed" :metric 1.0}
          (fill-in-last 5 {:metric 0.0}
                        (rate 5 graph)))

    (adjust add-environ-name
            (by [:host :service]

    (adjust add-environ-name-combined
            (by [:host :service]

For every event that gets sent to Riemann we do three main things.

First, we have a simple "Riemann events processed" stream that tracks the rate of events Riemann is processing and sends it along to Graphite.

This is purely for us to keep an eye on Riemann and see how it's doing. I mostly added it out of curiosity. We're currently pumping in about a thousand events a second, with peaks of closer to two thousand during bursts of traffic. Riemann has no trouble keeping up with it at all.

The next two streams use the two functions we talked about before to split the events into streams based on their host (which now includes the environment thanks to the functions) and service.

For example, consider the stream of the following events:

{:host "box1" :tagged ["production"] :service "foo" ,,,}
{:host "box2" :tagged ["production"] :service "foo" ,,,}
{:host "box1" :tagged ["production"] :service "foo" ,,,}
{:host "box1" :tagged ["staging"]    :service "foo" ,,,}
{:host "box1" :tagged ["staging"]    :service "foo" ,,,}
{:host "box2" :tagged ["production"] :service "bar" ,,,}
{:host "box2" :tagged ["production"] :service "bar" ,,,}
{:host "box3" :tagged ["production"] :service "bar" ,,,}

The first of our two special streams munges the hostnames of the events into this:

{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}
{:host "all-hosts.staging"    :tagged ["staging"]    :service "foo" ,,,}
{:host "all-hosts.staging"    :tagged ["staging"]    :service "foo" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}

Then we use (by) to split the stream by host and service, resulting in three new streams:

; The all-hosts.production/foo stream
{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "foo" ,,,}

; The all-hosts.staging/foo stream
{:host "all-hosts.staging"    :tagged ["staging"]    :service "foo" ,,,}
{:host "all-hosts.staging"    :tagged ["staging"]    :service "foo" ,,,}

; The all-hosts.production/bar stream
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}
{:host "all-hosts.production" :tagged ["production"] :service "bar" ,,,}

The second function does something similar, but doesn't kill the original hostname, so we wind up with five resulting streams instead of three.

; The box1.production/foo stream
{:host "box1" :tagged ["production"] :service "foo" ,,,}
{:host "box1" :tagged ["production"] :service "foo" ,,,}

; The box2.production/foo stream
{:host "box2" :tagged ["production"] :service "foo" ,,,}

; The box1.staging/foo stream
{:host "box1" :tagged ["staging"]    :service "foo" ,,,}
{:host "box1" :tagged ["staging"]    :service "foo" ,,,}

; The box2.production/bar stream
{:host "box2" :tagged ["production"] :service "bar" ,,,}
{:host "box2" :tagged ["production"] :service "bar" ,,,}

; The box3.production/bar stream
{:host "box3" :tagged ["production"] :service "bar" ,,,}

All eight of these streams get passed along to (metrics), so let's look at the final bits of this config file and see how they work. We'll go from the bottom to the top:

; Shortcut to send a stream along to all metrics.

metrics #(default {}

So (metrics) just sends along every event to five different underlying streams. Pretty simple. The (default {} ,,,) is just there to split the stream because Riemann doesn't seem to have something like (pass-through-to stream1 stream2 ,,,).

; Gauges are special -- they're super simple.

gauges #(where (tagged "gauge")

Any events tagged as "gauge" just get their numbers piped directly into Graphite.

A "gauge" in our lingo is a single number that represents an instantaneous value, like "number of users in the database".

This is simple, but shows an important concept: events come in tagged as the kind of metric they represent. We'll see why this is important later in the post.

; Streams that take events, look at their tags, and send them along to the
; appropriate metricizing stream(s).

occurrences #(where (tagged "occurrence")

meters #(where (tagged "meter")

histograms #(where (tagged "histogram")

timers #(where (tagged "timer")

The other types of metric are also filtered by tag.

Notice how histograms and timers do the same thing? That's an artifact of how I was originally treating them. In the end I decided they should do the same thing, so they could really be collapsed into a single tag.

All of these functions merely filter events -- the real meat of the parsing is done in helper functions:

occurrenceify #(adjust [:service str " occurences"]
                       (with {:metric 1.0}

Occurrences are things that don't really have a "value" but are nice to have in Graphite. For example: deployments. When a deployment occurs we can send an Occurrence event to Riemann, which will forward it along to Graphite with the value 1.0.

In Graphite we can view that metric and tell it to graph non-zero values as infinity, so we'll get a nice vertical line where deployments occurred. Then we can overlay some other graphs on that and look for correlation.

meterify #(adjust [:service str " rate-per-second"]
                  (default {:metric 1.0}
                           (fill-in-last 5 {:metric 0.0}
                                         (rate 5

Meters are metrics that track the rate of events that happen per second. For example, "page views" or "SELECT queries". There's no "value" for these events, the "value" comes from how often they're seen.

This function uses a bunch of built-in Riemann functions to turn a sequence of events into a stream of rates that gets sent to Graphite once every five seconds. Read the Riemann docs if you're not sure how it works.

histogramify #(adjust [:service str " value"]
                      (percentiles 5 [0.5 0.75 0.95 0.99 1.0]

Finally we come to histograms. Histograms track events that actually have values, like "how many milliseconds it takes to render a widget".

The percentiles function is the workhorse here. It computes the 50th, 75th, 95th, 99th, and 100th percentiles for those values during 5 seconds periods (i.e.: "99% of the values are less than or equal to 245 during the 5-second period of 17:44:45 to 17:44:50") and sends those along to Graphite.

Check out the Riemann docs to learn more.


The final piece of the puzzle is actually sending data to Riemann from our application.

Amara is a Python/Django app, so the first step was to find a Python client. We ended up going with Bernhard. After I helped fix a few bugs in it, it turned out to be simple and solid.

I decided to write a little wrapper to make sending metrics to Riemann even easier. You can find that file at but it's short, so let's go through that one too.

import socket
import time as _time
from contextlib import contextmanager
from functools import wraps

from django.conf import settings

    from bernhard import Client, UDPTransport
except ImportError:
    # Just use a dummy client if we don't have a Riemann client installed.
    class Client(object):
        def __init__(self, *args, **kwargs):

        def send(self, *args, **kwargs):

    UDPTransport = None

# Ugly hack to check if we're running the test suite.  If so we shouldn't report
# metrics.
from django.core import mail

if hasattr(mail, 'outbox'):

HOST = socket.gethostname()
ENABLED = (not RUNNING_TESTS) and getattr(settings, 'ENABLE_METRICS', False)
RIEMANN_HOST = getattr(settings, 'RIEMANN_HOST', '')

c = Client(RIEMANN_HOST, transport=UDPTransport)

First we have a whole bunch of imports and initialization crap. If you have any questions let me know, but it should be pretty straightforward.

def find_environment_tag():
    env = getattr(settings, 'INSTALLATION', None)

    if env == getattr(settings, 'DEV', -1):
        return 'dev'
    if env == getattr(settings, 'STAGING', -1):
        return 'staging'
    if env == getattr(settings, 'PRODUCTION', -1):
        return 'production'
        return 'unknown'

ENV_TAG = find_environment_tag()

Next we do a bit of ugly Django settings stuff to find out what environment we're in. Remember, we need to tag events with the environment so they get organized correctly in Graphite.

def send(service, tag, metric=None):
    data = {'host': HOST, 'service': service, 'tags': [tag, ENV_TAG]}

    if metric:
        data['metric'] = metric

    if ENABLED:

Here's a little helper function that actually sends events to Riemann using Bernhard.

Note the try: ... except: pass that helps fulfill the prime directive of this metrics system: "First: do no harm". If a metric couldn't be sent for some reason that's no reason to show the user a 500. Just shut up and carry on with life.

class Metric(object):
    def __init__(self, name): = name

As much as I like functional programming, Python is primarily an object oriented language, and when in Rome...

class Occurrence(Metric):
    def mark(self):
        send(, 'occurrence')

class Meter(Metric):
    def inc(self, n=1):
        send(, 'meter', n)

class Histogram(Metric):
    def record(self, value):
        send(, 'histogram', value)

class Gauge(Metric):
    def __init__(self, name):
        return super(Gauge, self).__init__('gauges.' + name)

    def report(self, value):
        send(, 'gauge', value)

Each type of metric has its own class. Each one derives from Metric and each has a reporting function. Using them looks like:




Finally, Timers are defined as a context manager:

def Timer(name):
    start = _time.time()

    # We fire a Meter for the metric here, because otherwise the "events/sec"
    # are recorded when they *end* instead of when they begin.  For longer
    # events this is a bad thing.

        ms = (_time.time() - start) * 1000
        send(name, 'timer', ms)

This lets you record the time taken to perform an action like this:

with Timer('search-time'):
    resp = do_the_search()
    results = parse_the_results()

There's some more code at the end of that file. If you're hungry for more feel free to read through it and figure out what it does.

You might also want to think through the comment in the Timer context manager and understand why it's there. I'll give you a hint: this is the artifact that resulted in the merge of Histograms and Timers I mentioned in the Riemann section.

So now we've got a nice little metrics wrapper. How do we use it? Here are a few examples:

In a future post I'll talk about how we instrumented a few key places in the Django stack itself.

The important thing to notice is that sending a new metric is as easy as adding a single line of code in the application.

Most importantly: you don't need to touch the metrics server to start recording a new piece of information! By simply sending a tagged event, the new data will be sorted, sliced, and stored correctly, and will be immediately available in Graphite.

It's hard to overstate how handy this is when things are on fire and you're trying to understand what's happening.

The Results

It works. And it's really, really nice.

Instead of just talking about it, I'll show you some graphs of stats we've collected.

(I apoligize for not labeling the X and Y axis, but I don't want to give away too many numbers in a public post. I promise I'll pick graphs that are clear even without the labels.)

Graphite Graph

Here we have a graph of requests per second being served, all servers combined.

Hi, Mozilla! Those huge spikes are when Firefox 13 was released. The "What's New" video that appeared when you opened the updated browser used our widget to serve subtitles.

That's interesting, but let's drill down into individual servers (and zoom in a bit):

Graphite Graph

Click here to view the full size graph. It will be pretty hard to read otherwise.

There's a lot of interesting stuff in this (still very basic) graph. First, you can see that we had four servers handling the main spike of users around noon: the green, red, dark blue, and pink lines. You'll have to look very closely to see them unless you view the full-size graph.

We quickly realized this wasn't going to cut it, so we added another two: the yellow and brown lines. You can see them ramp up and start handling requests, and you can see the pink and blue lines drop down as they shoulder some of the burden.

However, the green and red lines aren't relieved. All of our application servers are the same, so what gives?

The answer is that we use Amazon's Elastic Load Balancing to spread the load across our app servers. The original four servers (green, red, blue, and pink) are in four different availability zones. ELB splits incoming requests evenly across the zones.

When we added the yellow and brown servers, we added one to the same AZ as the pink one and one the the blue's AZ. This meant that half the the requests the ELB was sending to the first AZ went to yellow and half to pink. Same for the blue/brown AZ.

However, the number of requests being sent to the other AZs was still the same, and they still each only had one server, so the servers still had the same amount of work.

We continued like this for about a day, until the morning of 6/6. At that point the red and green servers were starting to get seriously stressed, so we spun up two more: blue and white, one in each remaining AZ. They immediately started handling some of the load, and all of the servers were now rougly serving the same amount of requests.

There are lots of other examples of unintuitive stuff we've discovered by looking at our shiny new metrics. In some future blog posts I may show more examples, because they're fun to reason about and read through.

Anyway, I hope you found the first post on our new blog interesting. If you have any questions you can let me know on Twitter (I'm @stevelosh).