Wednesday, June 2, 2010

How I sped up my server by a factor of 6

(with one linux command)

Subtitle: IO-bound and CPU-bound applications are common - here's maybe a "memory contention bound" app

I've written a lot of servers. If you've read the architecture of mailinator or benchmarking talkinator or "blocking faster than non-blocking" you probably already have that idea.

I was working on a new server infrastructure recently I needed for a new product. The server has a novel internal architecture to me that seeks to never voluntarily block threads or cause context switches - but at the same time is highly multi-threaded and creates new ones whenever it needs to.

The server's nature isn't totally important but you can think of it along the lines of a twitter server. It gets messages from one place, collates them, and sends them to (potentially many) other places. I suppose it could be used directly as a "twitter server" but of course, they already have those and my start up focus is rather orthogonal to that.

I usually write servers in Java on linux so keep that in mind for the ideas raised here. At Google, I worked on servers in C++ and Java. And especially after that experience, I'll stay with Java. To hopefully avoid language wars, the #1 reason I write in Java is that I'm way better at Java than I am at C++ or Ruby or Python or whatever. In other words, it was a practical (and definitely reasonable) choice in this case.

Also, the benchmarking here isn't contrived - I'll save the details, but I truly am asking the server to do precisely what it would do in the wild except at a higher rate.

Now, given the nature of the server, I needed 2 things to help benchmark. A "producer" (a system that produces messages) and a "consumer" (a system that consumes them). The producer produces a sequenced message set, and the consumer verifies it receives every message intact *and* in-order.

I chose message sizes at somewhat less than twitter size (msg size = 100 bytes) to induce a reasonable amount of activity in the server (server purpose is not necessarily large messages and large messages tend to just start saturating bandwidth without causing contention or server cpu activity). My custom protocols add a few bytes and TCP has a 40byte header - so overall I'm guessing that I'm using an average 200 byte messages counting everything.

Producer  ->  Server  ->  Consumer

The server system is pretty flexible and in effect, location transient. That is, a server process might live on one machine today and find itself on another machine tomorrow. Or a few new server processes may join the mesh. Needless to say, the idea is to create a scalable, flexible system.

One ramification of that is that it's possible and even very likely that some producers, servers, and consumers could be on the exact same machine. Socket communication over localhost changes a lot of assumptions. Firstly, TCP fusion can occur to reduce overhead significantly. Secondly, there is not an effective bandwidth limitation - there isn't a real network involved, it's a virtual (and fast) one.

Running all on the same server, I would expect this benchmark to be CPU-bound given that I/O is now virtual and effectively a CPU operation. Like I said, loopback is a real scenario I need but I initially think I benchmarked on a single machine mostly as a matter of convenience.

So, with all 3 processes running on the same machine, I ran the test. The CPU was a Intel Core I7 920. It has 4 hyperthreaded cores that act like 8. (note: I've tested all of this on a Core2Quad (non-I7) cpu and CoreDuos and results are effectively the same)

Here's the image of htop during the test with all 3 processes running. (if you run linux and use top, upgrading to htop is highly recommended).

Don't get hung up on reading the numbers or text. The graph in the upper left gives you a sense of how busy the CPUs are. In this picture, they're all at least a little busy. That's no surprise given we're running many tens of threads.

Notice that none of the cores are "maxed". This benchmarked showed the server to receive and re-send about 120,000 messages per second (that's 120k from producer to server and the same 120k from server to consumer - so 240k "messages transmissions" but only 120k messages - this would be analogous to queries-per-second for a webserver).

Why aren't the active cores maxed?

It occurred to me that I was running 3 CPU-bound processes on the same machine and that the processes might be stepping on each other's toes. It's possible that if the server is running on core 4 one second and the producer is running there the next, the level 1 cache of that core could be ruined for the server the next time it ambles over there.

The simple solution was to assign cores to the processes. In linux, you do this with command: taskset

Taskset is followed by a bitmask value to assign CPUs - so I ran (in separate xterms):

taskset 0x3F java manyfeedServer
taskset 0x40 java theProducer
taskset 0x80 java theConsumer

and off they went. The server is highly multithreaded so I gave it 6 cores. The producer and consumer each got one.

The result? 270,000 messages/second! Wow. If my cache assumption was right (and I'm not claiming to know if it was) - it REALLY worked. One way or another though - something worked - we got better than a 2x speedup.

So you might be thinking the moral of this story is:

1) If you're in linux, install HTOP
2) If you're sharing a computer amongst CPU-bound processes, isolating the processes might (and very well "might not") be beneficial.

And ok, those are fine ideas. But what bugged me was that htop showed me that no CPUs were maxed yet. Again, what was slowing my application down ahead of CPU power?

I then tried limiting the server to 2 (hyperthreaded) cores. (I also tried keeping the producer and consumer on the same hyperthreaded "core" and given that I had cores to spare, also tried separating them, but the result was the same).

taskset 0x03 java manyfeedServer

Now, we get 530,000 messages/second. Nice. Reducing the cores from 6 to 2 nearly doubles our msgs/sec again. Here's the htop now:

You can see that cores 1 & 2 are plenty busy. Cores 3,4,5,6 are idle as expected. Core 7 (the producer) is pretty busy and so is Core 8 (the consumer).

530k msgs/second is nothing to sneeze at but.. um.. again, no core is maxed. Why - not? What's the bottleneck?

Obviously.. the last test is to throw the whole server on ONE cpu. Apart from the fact that I very purposely and meticulously coded this server be highly multi-threaded, fewer CPUs seem to make it happier. I am a conservative thread-safety-inducer. That is, I'm only really dangerous with firearms and synchronized blocks. But I'm by no means afraid to use the latter when needed.

taskset 0x01 java manyfeedServer

And finally, we hit 100% utilization on CPU 1 at 790,000 messages per second.

Here's the TL;DR of this blog post:

Some multi-threaded java applications apparently run faster on 1 core than on multiple cores.

Note the very non-committal phrasing in an attempt to make this a rather defensible statement. A graph showing the ManyFeed server's performance per number of cores:

So.. if you are a CPU or JVM guru and want to tell me your thoughts on what's going on, I'd love to hear it.

A long time ago, I benchmarked a bunch of different CPU types on doing contended memory stores. To my surprise, a little single-core Dothan processor did amazing in a few tests and I had no idea why. The discussion above is really the same idea. (I've retested that benchmark by the way on the Core I7 - and it has very different characteristics - probably a testament to the I7's new memory architecture).

Note that if you have a pre-I7 Intel multi-core cpu, you can reproduce my "1 core runs faster" results here using the code in that blog post (note: this code *is* a micro-benchmark, built to create a situation that causes extreme and unrealistic threading competition - for more details read that blog post). This gives a clearer picture of what's going on. My theory is that with 2 cores and 2 threads - each gets a core and every store operation competes for the memory bus in order to do its operation. Threads spend lots of time "competing" and less time doing real work. On one core (and 2 threads) - only one thread runs at a time - so every time it tries to store in memory, it can.

The Core-I7 is different (and maybe AMD cpus too) - it doesn't suffer from this memory contention problem at all. (Although - given that my server still runs (way way) better on 1 cpu even on I7, then maybe the contention is elsewhere in my server).

Here's that benchmark running for Static Volatile Store on a Core Duo CPU with 2 cores and again with 1 core. No performance loss on just one core, otherwise same exact run.

(numbers along the bottom are the number of threads in that run - it's not until we have 2 threads that contention creeps in and hurts us).

Oh.. and how does my little nifty manyfeed server do on a real network? Testing on 1 CPU on my local 1Gbps LAN I get about 300,000 messages per second.

With some quick (and hopefully not silly) math, this looks about right.

1Gbps = 1,000,000,000 bits/sec = 125,000,000 bytes/sec

125,000,000 bytes/sec divided by 200 bytes/message = 625,000 messages.

half for producer send and half for the server send (to consumer) = 312,500 each

On a real LAN, the server saturates bandwidth (i.e. becomes IO-bound) and that's no surprise (i.e. until I have access to 10Gbps networks, I don't need to speed up my server any more).

Interestingly I tested this idea of "1-core'ing-it" using apache tomcat and apache bench and saw no improvement. I also saw far fewer qps (loading a single, tiny, web-page over and over) and 100% core utilization even on multi-core. It'd be my guess that tomcat isn't contention-bound but truly cpu-bound.

In other words, don't follow this path without testing this yourself. The good news is that its extremely easy to try and you don't even have to change a line of code. Just try "1-core'ing" it.

Monday, March 1, 2010

Mailinator and (not) Death-by-Popularity

As far as I know, Mailinator was the first website of its kind. A website to accept any email at all, no sign-up, no registration. Obviously, email websites surely existed before but not with this "no one owns any account" twist. And what a twist it turned out to be. Check out the original web page in 2003 at the Wayback Machine (including the "its like flicking a booger at spam!" original tagline!)

Needless to say, copycat websites showed up fast. And that's a decent indicator that a new idea is a good one. Or at least an interesting one. As I've (and many others have) said before ideas by themselves are basically worth nothing. And any good idea is destined be copied, stolen, and taken credit for. Execution is the key.

If you don't believe me - here is part of Mailinator's terms of service present on every inbox page:

"You agree to hold ManyBrain, Inc./Mailinator harmless from any damages caused by loss of emails, content within emails, damage to your computer or innocence from viewing emails, direct or indirect use of this system, or anything else you can think of. Use at your own risk."

I wrote those words a bunch of years ago. I'm no lawyer so I tried to come up some simple words to get the message across.

Now. Go to google and search on (quoted) "computer or innocence". Here's a screen shot of what I got.

Almost all entries on this page are Mailinator copycat services that not only "borrowed" the idea of Mailinator - a few even "borrowed" its terms of service !! Talk about uncreative copying. :)

For the public record - again, I'm no lawyer so if you unauthorizedly copied my terms of service and get in trouble - you can't blame me.

Now keep in mind here - theft of ideas is all part of the game. It is, for better or worse, part of human nature. And as I outlined here - the execution and evolution of Mailinator was mine, but the idea of no-sign-in email wasn't. It was my old roommate's (who at times, has helped out on Mailinator copy and such).

Google didn't invent the idea of internet search. Facebook didn't start the social web. They just made it better or more usable - or something. But they won. Sometimes taking a great idea and twisting it just a little, turns it into something great.

Mailinator however is an interesting beast. It is a hard business to monetize. And that's just fine, so long of course, it doesn't cost a ton to run.

But therein lies the rub.

There's a strong sentiment in the web industry that performance, at many levels, doesn't matter. That you simply can "buy another server" and solve many performance problems. And that's true. And generally that's a good idea. I mean, wasting a lot of developer time (who could instead be creating new features) on performance optimization is dubious. Especially if you can throw down 2 or 3 thousand dollars and simply buy another machine to solve the problem.

I wrote here that the initial incarnation of Mailinator started to die at around 800,000 emails per day. If you want to make a Mailinator copy - its really not very hard. You can do it with almost all off-the-shelf (and free) software. Sendmail to receive and some sort of webmail front-end to view.

That's it. And that's largely what Mailinator was when it started and it took about a weekend to setup.

But then... I ran into that 800,000 a day problem. This was a performance problem of epic proportions. That is, the site was perpetually crashing. I needed a solution.

One solution, as I pointed out was to "buy another server" but that would have cost the 2 grand plus the ongoing monthly cost to pay for it. And Mailinator was not making enough to cover that at all. I would have had to start paying for Mailinator out-of-pocket to keep it going.

And maybe worse - that would have just solved the problem temporarily until I needed yet another server.

So instead, I did the wrong thing. I rewrote the system from scratch. Threw out the off-the-shelf stuff and built a software system that was customized for Mailinator. Again, financially this was a poor decision (i.e. cost of my development time) but luckily Mailinator is a hobby too so I wrote it off as just that.

As soon as I brought up the new software email jumped to 3,000,000 a day. The old system was not only choking at 800,000 it was refusing connections. I then upgraded the network at the time from 10mbs to 100mbs and the email again leaped to 6,000,000 a day. That is, the new software was fast enough to expose that we were now saturating bandwidth.

(Mailinator has now seen >25,000,000 emails per day)

So that's all fun stuff - but remember those copycat services? Over the years, I've seen 3 of them rise in popularity - and then die (as in literally, site gone, blog explaining it became too expensive to run).

That's what I meant when I said Mailinator is an interesting beast. Its a paradigm that's extremely easy to reproduce - so long as you don't actually get popular (or you find a great way to monetize).

Seems like the web has a lot of opportunity for services like this. Cool ideas that in many senses are just too costly to keep running. Then again, the price of tech comes down - and performance increases.

A Sendmail/Webmail Mailinator today on a modest machine could probably handle a fair bit more than 800,000 a day.

Every few years I try to re-evaluate every old idea I dropped because I felt it wasn't technically/financially feasible. There's a perpetual convergence of faster tech and cheaper prices.

New things become feasible every day that weren't feasible yesterday.

Got any ideas?