Testing in Production with GoReplay

This isn’t a full write-up of what’s involved at this point, more of a list of things to think about. Next time I actually do this, I’ll revisit this and write it up more fully.

There are lots of reasons you might want to use traffic replication, but in my case it’s because I was doing a framework upgrade to a large application with poor test coverage. The workflow here is to set up a pre-production instance of the app with a copy of the live database and several modifications to the server.

Here are some considerations/things you might have to do:

  1. Periodically sync the database: the theory is that if the traffic is replicated then both databases will be in the same state. In the real world this won’t be the case, and since having the same data in both places is key to making this work, you should plan for having to sync the database.

  2. Prevent emails being sent: if your application sends emails, the instance that traffic is being replicated to should be configured to not send emails, or else the host should be configured to delete them/send them to a hard-coded recipient.

  3. Prevent outbound requests: if your application makes outbound requests to other apps (such as APIs), and you don’t want these requests being made twice (once from the production instance and once from the replicated instance) you should prevent the replicated version being able to make these requests.

    If your app has a way of turning them off that you trust, great. If your app can’t do that, or if you don’t trust it, outbound firewalls are a good solution. (In a pinch you could even add dummy entries to /etc/hosts)

  4. Error collection: one of the purposes of this is to understand how your application behaves, so telemetry and error collection are vital. Make sure you have a plan for this.

  5. Credentials: your new instance will probably be configured differently to a typical pre-production/staging instance. Specifically, in this case, it’ll need production credentials. In my case this meant Rails’ secret_key_base was the same so the replicated instance could decrypt secure session cookies.

  6. TLS: Goreplay (the tool that does the actual replication) needs to see the requests in plaintext. This means somewhere on your production server the traffic will have to be plain text.

    If your app is listening on a local port (and e.g.: nginx terminates the SSL) then you have no extra work to do. My app was listening on a unix socket, which goreplay is not compatible with; I had to proxy it twice within nginx to get around this. (You could possibly use socat or something clever to allow goreplay to interact with the socket instead).

The Setup

We use goreplay in one of its most basic modes: listen to traffic arriving on a given port1 and send a copy of each request somewhere else. The traffic is unimpeded and the request and any responses by the “real” destination are handled as normal.

It looks something like this:

     +----------------------------------+    +------------------+
     |                                  |    |                  |
     | +-------+        +-----+         |    | +-----+          |
------>+ nginx +---+--->+ app |         |    | | app |          |
     | +-------+   |    +-----+         |    | +--+--+          |
     |             |                    |    |    ^             |
     |             |    +----------+    |    | +--+----+        |
     |             +--->+ goreplay +---------->+ nginx |        |
     |                  +----------+    |    | +-------+        |
     | production server                |    | other server     |
     +----------------------------------+    +------------------+

The copied version of the request is sent to the app on the second server, and the response to that request is ignored.

I did this by running goreplay on the production server as follows:

gor -input-raw ':80' -output-http 'https://the-other-server'

Because this listens on the network interface, that command will need to be run as root. It’ll also need to run in the background all the time you want to be replicating traffic—I used screen because I’m lazy, but writing a quick systemd service would’ve worked well too.

Goreplay has a whole series of advanced options which you might want to take advantage of, including limiting which requests are sent, storing traffic for replaying later2, scripting and sending only a sample of requests. Check out the docs for more examples and full usage.

As I said at the beginning, next time I use goreplay I’ll come back and update this with any further thoughts/hopefully some more detail. And there definitely will be a next time; this was a great technique that gave us much more confidence in the upgrade, even though the app is large and has poor test coverage.

  1. I believe it just uses libpcap internally 

  2. Useful for load-testing, for instance: capture a days worth of traffic and replay it as quickly as possible