mostly typed

What's taking so long?!

A recent stackoverflow question and related MemCachier support ticket led me to some useful analysis of why memcache requests take so long for large values.

As background, we'd been hearing from a few MemCachier customers that cache requests were often taking as long as 200ms to complete. We expect requests to be on the order of 1-5ms (dominated by EC2 network latency), so 200ms is two orders of magnitude slower than expected. We've always been able to at least determine that this only happens for large values.

Mostly not the network

We know this is only happening with larger values, so it's tempting to assume requests are taking so long because network bandwidth starts to dominate latency. While this is true to some degree, even at the 1MB size limit, 200ms would imply we're only getting about 40Mbps between MemCachier (which sits on EC2 servers in their US-East region) and Heroku (which also sits on EC2 servers in US-East). We've observed pretty slow network throughput at times, but nothing consistent has been that bad. Let's look at a quick benchmark:

val = "a" * (1024 * 1000)
# the value needs to be slightly smaller than 1MB
# to leave room for the key and metadata
start = Time.now
1000.times { c.set("h", val, 0,  :raw => true) }
puts Time.now - start

gives us ~27ms to do a set request on a 1MB value. So that's about 37 MB(ytes) per second, or 296 Mb(its) per second. So not great, but network bandwidth clearly doesn't explain the 200ms average request times.

Client side operations

It's important to note that customers have been getting their figures from services like New Relic, which measures time spent in the memcache library rather than just time spent going across the network. Therefore, we should also consider what operations the memcache client is doing before/after send the request.

The Dalli library for Ruby -- the most commonly used by MemCachier customers -- marshals the value to a string using Ruby's built-in Marshal library, and optionally compresses values if they are larger than 1KB. Both of these operations can be pretty expensive. In the particular case of the stackoverflow question, we realized out of band that the value being saved was an Array of user ids, so I decided to try and identify the cost of serializing the memcache request on the client side, before hitting the network.

I used a large, but reasonably common value size as a target to measure - 200KB. All benchmarks were run on a standard Heroku dyno, which runs on an EC2 machine with 34GB memory and four 2.4Ghz cores (although all the benchmarks are single threaded), so probably a m2.xlarge EC2 machine.

To look at marshaling a list, we'll use the list of integers between 1 and 50,000, which marshal (Marshal.dump((1..50000).to_a)) into a 200KB string. How long does marshaling take?

val = (1..50000).to_a
Benchmark.measure { Marshal.dump(val) }

We get about 10ms per marshal:

9.850000   0.000000   9.850000 ( 10.035253)

If we're storing a list that's much longer, or if the list is Strings instead of Integers, the marshaled string is going to be large to fit in memcache, so Dalli will have to compress the value before setting it. This is where most of the overhead seems to come from. This time we'll use a list of integers between 1 and 140,000. This marshals to a string of size 634KB, which then compresses to about 200KB.

val = (1..140000).to_a
Benchmark.measure { 1000.times { Marshal.dump(val) } }

This time marshaling is 32ms (because the list is longer):

29.790000   0.030000  29.820000 ( 32.597199)

But compressing...

str = Marhsal.dump(val)
Benchmark.measure {
1000.times { Zlib::Deflate.deflate(str) }
}
87.410000   0.220000  87.630000 ( 99.602291)

takes about 100ms, or together...

Benchmark.measure do
  1000.times do
    Zlib::Deflate.deflate(Marshal.dump(val), 6)
  end
end
>> 117.40000   0.20000  117.60000 ( 148.56202)

about 149ms.

There are of course some caveats here. Marshaling speed will depend on how complex the value is -- an Hash will be slower to marshal than a String. Compression efficiency will depend on how redundant the information is -- so a 200KB compressed string may represent a 200KB random string that is fast compress (about 10ms in my benchmarks) or a 200MB string of all "a"s which could take seconds to compress. So your mileage may vary.

So what do we do?

There are several things you might be able to do to get some speedup:

1. Faster Compression

The most expensive operation was compression, so we're going to see the biggest gains there. If you can, avoid compression. If your values are larger than 1MB, you may have to compress though. However, we can tune the compression level parameter. Zlib uses has compression levels 1-9, with 1 being the fastest, and 9 producing the smallest result. If unspecified it defaults to using level 6. In our (somewhat contrived) example, we can actually get nearly the same compression using level 1 as with default level 6, but with about a 2x speedup:

val = (1..140000).to_a
# default compression
Benchmark.measure do
  1000.times do
    Zlib::Deflate.deflate(Marshal.dump(val), 6)
  end
end
>> 117.40000   0.20000  117.60000 ( 148.56202)
# level 1 compression
Benchmark.measure do
  1000.times do
    Zlib::Deflate.deflate(Marshal.dump(val), 1)
  end
end
>> 47.20000   0.000000   47.20000 (  62.02902)

62ms instead of 120ms. Already a lot better, and both compressed strings are about 200KB (within 300 bytes of each other).

2. Faster marshalling

Ruby's built in marshalling is designed to be general purpose, and is therefore complex. Complexity usually hurts performance (you just have to do more things...). It turns out we can get a pretty big speedup in our example using a simple, custom marshaling mechanism. In particular, if our list contains only 32-bit integers we can just serialize it to a string where every four bytes represents the next element. Ruby's Array#pack lets us do exactly that:

val = (1..140000).to_a
Benchmark.measure { 1000.times { Marshal.dump(val) } }
>> 29.400000   0.080000  29.480000 ( 39.970102)
Benchmark.measure { 1000.times { val.pack("L*") } }
>> 6.230000   0.000000   6.230000 (  8.382206)

We went from 39ms to 8ms, saving another 30ms. Putting this together with faster compression:

Benchmark.measure do
  1000.times do
     Zlib::Deflate.deflate(val.pack("L*"), 1)
  end
end
>> 22.850000   0.490000  23.340000 ( 26.967620)

We're now down to 27ms, which is a hell of an improvement over our original 149ms.

3. Amortize the cost

Odds are you are re-using the cached data structure somwhere else in the same request.

Are you parsing from a json document? Leave it as json instead of parsing and re-marshalling it.

Are you grabbing it from a database query? With recent versions PostgreSQL you can craft a query that will directly return a JSON string representation of the result. For example, to get a list of ids from a table:

select array_to_json(
  array(
    select id from mytable where foo = true
  ))::text;

You should be able to stick this string directly into the cache (perhaps compressing it first).

3 comment(s)
  1. Britney says:

    I was just looking at your mostly typed - What's taking so long?! website and see that your website has the potential to get a lot of visitors. I just want to tell you, In case you didn't already know... There is a website network which already has more than 16 million users, and most of the users are interested in topics like yours. By getting your site on this network you have a chance to get your site more popular than you can imagine. It is free to sign up and you can find out more about it here: http://thfox.com/500s - Now, let me ask you... Do you need your site to be successful to maintain your business? Do you need targeted visitors who are interested in the services and products you offer? Are looking for exposure, to increase sales, and to quickly develop awareness for your website? If your answer is YES, you can achieve these things only if you get your site on the network I am talking about. This traffic network advertises you to thousands, while also giving you a chance to test the network before paying anything at all. All the popular sites are using this network to boost their readership and ad revenue! Why aren’t you? And what is better than traffic? It’s recurring traffic! That's how running a successful site works... Here's to your success! Find out more here: http://qgo.be/la8IR

  2. Livia Schacter says:

    Do you need more traffic to your website or more product sales? Get 500 free keyword targeted visitors just for trying the 7 day free trial: http://mod24.pl/8

  3. Deanna Brady says:

    Hi my name is Deanna Brady and I just wanted to drop you a quick message here instead of calling you. I came to your mostly typed - What's taking so long?! page and noticed you could have a lot more traffic. I have found that the key to running a successful website is making sure the visitors you are getting are interested in your website topic. There is a company that you can get keyword targeted traffic from and they let you try their service for free for 7 days. I managed to get over 300 targeted visitors to day to my website. http://likes.avanimisra.com/4oxz

Leave a comment...