?

Log in

entries friends calendar profile Consoleninja.net Previous Previous
dormando
refering Jeremy Cole's post on swapstorming under NUMA hardware, I'll note something potentially new.

While I've seen this "brick wall swapstorming" a few times before and since the post, I just saw some new OS installs not do this by default, and using the numactl to change the defaults is actually harmful to system interactivity.

In the brick-wall cases, two NUMA zones of ~30G each, plus a mysqld (or memcached) running with 45G of ram, would equal 30G in memory, and 15G in swap. Ugly.

In this case, I'm getting a little bit in swap, but a relatively even note dist.

Here's a box with no numactl tuning:
N0        :      7068733 ( 26.97 GB)
N1        :      7120258 ( 27.16 GB)
active    :     13355529 ( 50.95 GB)
anon      :     14187441 ( 54.12 GB)
dirty     :     14185099 ( 54.11 GB)
mapmax    :          265 (  0.00 GB)
mapped    :         1580 (  0.01 GB)
swapcache :         2350 (  0.01 GB)


similar hardware, same OS/kernel running under numactl --interleave=all:
N0        :      6778742 ( 25.86 GB)
N1        :      6313382 ( 24.08 GB)
active    :     12395957 ( 47.29 GB)
anon      :     13090566 ( 49.94 GB)
dirty     :     13090566 ( 49.94 GB)
mapmax    :          255 (  0.00 GB)
mapped    :         1588 (  0.01 GB)

... just a touch in swap on the first guy. Though I'm going to wait a few days to declare victory or defeat, since I did see the first guy dump nearly a whole gig of swap once, but wasn't able to confirm if the swapped memory was mysql yet.

The side note here is that my numactl-modified node is exhibiting some extreme latency on interactivity. Appears to be related to anything that needs to fork having a half-second delay. MySQL seems to be running fine though.

I haven't investigated at all as to how numa distribution has changed in recent kernels (though I know it's been steadily improving over the years). Unfortunately every other box I've used which *has* the problem, runs on a redhat/centos5 kernel. Which is ancient to an extreme.

In this case it's debian squeeze with its default 2.6.32 kernel. Anyone try a recent ubuntu or redhat6 yet and see if the NUMA/swap issues are better on there?

Tags:

Leave a comment
Hello! First read this if you haven't yet.

I will now continue the back-and-forth obnoxiousness that benchmarking seems to be!

In my tests, I've taken the exact testing method antirez has used here, the same test software, the same versions of daemon software, and tweaked it a bit. Below are graphs of the results, and below that is the discussion of what I did.





Wow! That's pretty different from the first two benchmarks.

First, here's a tarball of the work I did. A small bash script, a small perl script to interpret the results (takes some hand fiddling to get it into gnuplot format), and the raw logs from my runs pre-rollup.

What I did


The "toilet" bench and antirez's benches both share a common issue; they're busy-looping a single client process against a single daemon server. The antirez benchmark is written much better than the original one; it tries to be asyncronous and is much more efficient.

However, it's still one client. memcached is multi-threaded and has a very, very high performance ceiling. redis is single-threaded and is very performant on its own.

There is a trivial patch I did to the benchmarks to make them just run the GET|SET tests. It is included in the tarball.

What I did was take the same tests, but I ran several of them in parallel. This required a slight change in pulling the performance figures and running the test. The tests were changed to run indefinitely, either doing sets, or sets then indefinite gets (I wanted to run some sets before the get tests so they weren't just hammering air).

The benchmarks were then fired off in parallel via the bash script, with the daemon fresh started before each run. After a rampup time (to allow the sets to happen, as well as let the daemons settle a bit), a script was used to talk to the daemons and sample the request rate. Since the benchmark is running several times in parallel, it's now most accurate to directly ask the daemon for how many requests it's doing. I did some quick hand verification and the sampling code lines up with the output of a non-parallel benchmark. So far so good.

I checked in with antirez to ensure I was running the tests correctly, and re-ran them as close to the original as I could get. Same number of clients *per daemon*, but there were 4 daemons in this case, so the actual number of clients is actually 4x what's listed in the graphs.

The tests ran on localhost using a dual cpu quadcore xeon machine, clocked at 2.27ghz (with turbooost enabled, I'm pretty sure). The OS is centos5 but with a custom 2.6.27 kernel. I verified the single-process benchmark results on my ubuntu laptop runnin 2.6.35 and a 2.50ghz core2duo and got similarish-but-slightly-lower numbers. I also tried the tests on several slightly differing machines after getting some odd initial results. Memcached was using the default of 4 threads. Performance might suffer in this particular test with more threads, as you'd land with more lock contention.

So these numbers look correct, for what I was trying to do here.

Nothing else was changed. I used the same tools.

Why I did it


Both tests are busy loops. All three of these benchmarks are wrong, but this can be slightly closer to reality. In most setups, you have many independent processes contacting your cache servers. In some cases, tens of thousands of apache/perl/ruby/python processes across hundreds of machines, all poking and prodding your cluster in parallel.

I don't have the room here to explain the difference between running two processes and one process against the same daemon - So I'll hand waive with "context switches n' stuff". There're plenty of good unix textbooks on this topic :)

So in this case, four very high speed benchmark programs soaked up CPU and hammered a single instance of redis and a single instance of memcached, which displays the strong point for the scalability of a single instance in each case.

Why the bench is still wrong


These are contrived benchmarks. They don't test memcached incr/decr or append/prepend (mc /does/ have a few more features than pure get/set).

Real world benchmarks will require a mix of sets, gets, incrs, decrs. Also, it requires testing each in isolation; some users might use their key/value store as a counter and hammer incr/decr hard. Others might hammer set hard, others might be near-purely gets.

All of these need to be tested. All features should be benchmarked and load tested in isolation, and also when mixed. All features need to be tested under abuse as well.

The test also doesn't try very hard to ensure the 'get' requests actually match anything. A better benchmark would preload some data across 100,000 keys and then randomly fetch them. I might try this next, but for the sake of argument I'm matching the same testing situation as the original blog post.

The interpretation for memcached


Memcached sticks to a constrained featureset and multithreads itself for a highly consistent rate of scale and performance. When pushed to the extreme, it needs to keep up. We also need to stay highly memory efficient. For a bulk of our users, the more keys they can stuff in, the more bang for the buck. Scalable performance is almost secondary to this. This is why we have features like -C, which disables the 8-byte CAS per object.

In a single-threaded benchmark against a multi-threaded memcached instance, memcached will lose out a bit due to the extra accounting overhead it must perform. However, when used in a realistic scale, it really shines.

There are some trivial ways we are able to greatly increase this ceiling. It's not hard to get memcached to run above 500,000 gets per second via some tweaks on some of its central locks. Sets have a lot of room for improvement due to this as well. We plan to accomplish this. Our timing has been bad for quite a while though :)

In almost all cases, the network hardware for a memcached server will give out before the daemon itself starts to limit your performance. This is a lot of why we haven't rushed to improve the lock scale.

Computers are absolutely trending toward more cores and not toward higher clocks. Threading is how we will scale single instances.

I really hate drawing conclusions from these sort of things. The entire point of this post is more or less me posturing about how shitty benchmarks tend to be. They are created in myopia and usually lauded with fear or pride.

You can't benchmark the fact that Redis has atomic list operations against memcached. They do different things and exist in different spaces, and the real differences are philosophical and perhaps barely technical. I'm merely illustrating the proper scalable performance of issuing bland SETs and GETs against a single instance of both pieces of software.

Understand what your app needs feature-wise, scale-wise, and performance-wise, then use the right tool for the damn job. Don't just read benchmarks and flip around ignorantly, please :)

Finally, here's one more graph... I noticed that redis seemed to do slightly better in the non-parallel benchmark, so I ran the numbers again with a single parallel benchmark in case anyone wants to look into it. Yes, the memcached numbers were lower for the single benchmark test, but I don't really care since it's higher when you actually give it multiple clients :)

4 comments or Leave a comment
Believe it or not, I haven't put myself into this position before. The day before yesterday I started a LOAD DATA INFILE for 760 million rows. Didn't really think too hard about it, figured I'd let it run until it finished.

Unfortunately MySQL did that thing where the row insertion rate slowed to molasses over time, and the box was hosed. I killed the query. So it started rolling back the transaction. Which was 431 million rows in.

Using `SHOW ENGINE INNODB STATUS` you can look at the number of undo entries a transaction has left to go:

---TRANSACTION 0 1161525892, ACTIVE 18598 sec, process no 20139, OS thread id 1131772224
ROLLING BACK , undo log entries 431301691

Something like that. I left it to rollback overnight. The next day, it had only moved through a few million entries. It's going to take a week or more! I could just drop the table but it's locked from the transaction (Is this always true?)

So, I made sure replicaton was stopped, flushed logs, ensured nothing was talking to the DB.
Then, tried to shut down mysql. It hung, waiting for the transaction. Waited 15 minutes.
Then, I kill -9'ed mysqld.
Then, I waited a few hours for InnoDB crash recovery to run (large 512M redo logs + no fast recovery patch).
Then, I see this:

InnoDB: Apply batch completed
[etc]
InnoDB: Starting in background the rollback of uncommitted transactions
100113 20:53:31  InnoDB: Rolling back trx with id 0 1161525892, 431119521 rows to undo

Still rolling back, but now in the background!

Finally, I dropped the table housing the offending transaction.

Bam, now the undo rows are being chewed through at the rate of 1,000,000 every couple seconds. It'll be undone in a few minutes and I can go finish the maintenance work on the DB without having to reslave it. (It's a few TB in size, reslaving is a little painful).

Is there a better way to do this? I had no idea if this process would work or not, and there is the undesireable step of kill -9'ing mysqld.
3 comments or Leave a comment
Was staring at a wall earlier trying to think of things to optimize in an experimental HAProxy setup. I have the proxy configured to do very little processing (even using splice(2) when useful).

I'm asking HAProxy to do just a couple things in L7:
- Add an X-Forwarded-For header
- Hash the URI onto a list of backend servers
- Shuffle data between sockets

... pretty much nothing else. blind copy/forward otherwise.

But, hey, I have a box here with an Intel Nehalem based xeon core. Nehalem has a new hardware instruction for calculating a CRC32 hash. I wonder if it's any faster? How does the numeric distribution compare?

Well getting it working wasn't too hard at all. I'll post a patch once I clean it up.

I threw about 9,000 requests with different URI's at it to see how well it balanced
intel:
   1740 001
   1749 002
   1693 003
   1835 004
   1807 005

haproxy:
   1741 001
   1716 002
   1732 003
   1772 004
   1793 005

... a handful of requests were lost in the transfer, so these numbers are to be a little suspect. I'll try again to see if I can get a more accurate reading. You can already see that the built-in algorithm for haproxy is notably more even than the intel CRC32. *but*, the CRC32 isn't terribly far off. If you double or triple up your server list, it could end up being more even. I'll try this later.

Now, the speed test. I wrote a seperate bench tool that takes some input and runs the hash algorithms N times. I ran each test 3 times in a row per algorithm to ensure the times were close. The bench is for 10 million loops on the hash algo.

The results are "uri length: N", then the middle time, for each algorithm.

uri length: 6

haproxy:
0m7.459s
intel:
0m6.528s

uri length: 13

haproxy:
0m14.139s
intel:
0m6.138s

ur length: 39

haproxy:
0m40.366s
intel:
0m8.483s

... wow, quite a bit of a speedup! BUT, the hash algos do a little extra processing, looking for '?' or '/' characters to short circuit the hash if necessary. I took these out to compare them more on the raw hash speed. Also, technically the haproxy hash was counting '/' characters where the intel one wasn't.

uri length: 39
haproxy:
0m33.486s
intel:
0m7.459s

... still retains most of the speedup. Neat!

Unfortunately this speedup is likely marginal at best. In the final case with most of the tests in there, haproxy is running at 246,998 hashes per second. Intel is running at 1340662 per second. 246,998 per second already isn't slow. I'll have to run the algorithm in a full scale test to see if the difference is even measureable, or if too much CPU time is soaked up everywhere else in haproxy.

Also you can tell HAProxy to only consider the first N characters of a URI for hashing. Given the pretty linear decrease in speed given the length of the URI, capping the hash length at 10-15 gives you a speedup comparible to switching to the intel algorithm. If you want to hash arbitrary string lengths, the nehalem CRC32 algo is pretty darn impressive with a very minor increase in time under a wide increase in string length :)

So I'll clean up the patch, test it more at full scale, and post again later.
1 comment or Leave a comment
http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-capacit.html

This has been making the rounds today... In my usual fashion I'm going to write an overly complicated post in response.

The basic claims of the memcached "multiget hole" are thus:

- If you are primarily using multigets to batch requests.
- Memcached is out of CPU.
- Adding a second memcached instance will split the batch request across both hosts.
- This will make things slower, since your multiget request gets split in half and hits both servers, instead of just one.

Lets break down this last claim into some detail, then discuss potential workarounds! Everybody put on your bath robe and thinking cap.

Claim: A multiget, split into two, will be slower than a single multiget against a single server.

What's really happening: A multiget, as referenced here, is when you combine a fetch for several keys into a single request. Lets say in this exercise you are trying to fetch keys 'foo1' through 'foo100', in one single request. The process for a typical memcached client and server instance is:

- Take the full list of keys requested.
- Hash each key individually against the list of memcached servers. If you have one server, they all go to the same place, if you have two, they are split.
- For each server that will get keys, issue a special multiget request against that server. For the ASCII protocol this looks like: `get foo1 foo2 foo3` to the first instance, and `get foo4 foo5 foo6` sent to the second instance. A single write, will get multiple responses back. This is faster than doing them one at a time, since you would be waiting for a response between each get. "get foo1" (wait for response) "get foo2" (wait for response) etc.
- Wait for each server to respond, collect keys, return to caller.

Lets break down the steps even more!

- For each multiget request issued, a *client* may either use a *blocking* or *non blocking* mode.
- In an optimized case, the client will issue a multiget against *both* servers *in parallel* and then call poll(2) (or similar) and wait for the responses.
- In a non-optimised case, the client will issue multigets to each server in turn and wait for the response. libmemcached did this until recently, so you might be surprised, if you look!

On the server end:

- Read all keys requested.
- For each key, hash the key and look it up against the internal hash table.
- Load any valid items for return and...
- ... write them to the socket.
The binary protocol more closely combines all these steps, but the idea is the same.

What the hell are you getting at?

Well, my point is slicing a multiget actually *shifts a tradeoff* as much as it becomes more or less efficient. There is a certain amount of overhead for a *server* to read from a client and respond, but there is also a particular amount of effort for that server to look up the key in its hash table and build a response. It is a fact that issuing a smaller multiget against a particular server will take *less* CPU time than a larger one. Adding servers does reduce CPU time on the server.

However, when the client has to issue separate writes to more servers, it is doing more complex work and will thus take longer and use more CPU time than if all of the requests were in a single write.

Hence, adding servers to a cluster *will* reduce the CPU usage on the cluster. The addition is non linear, but it will not make it *worse*. It could however negatively affect clients, and a bad client can be especially affected, if it has to wait for responses in serial.

Part the next: The subtle issue

Depending on how large your multigets are, it may take less time to split them an issue them against multiple servers. This is *entirely up to you* on how you want to handle, requires testing, and can be affected by kernel tunables.

If you are issuing a multiget with many keys, or with very large responses, you will be more likely to run up against the (I hope I'm quoting this right) TCP window scale. After so many bytes TCP needs to roundtrip an ACK packet to confirm that the remote end has received the preceeding data. This window will open at a certain size, and then expand or contract depending on how you're using the connection.

This is why some downloads or uploads will start out a little slow, then rapidly speed up. It's also why connections over a laggy or long link might not go above, say, 40k per second, but you can open multiple connections and run them all at 40k/sec to the same server (see also: download accelerators).

This last example should illustrate what I mean here. Stuffing too much down the pipe at once will cause more roundtrips to the remote server for the TCP acks. If you split a large list and run the data nonblocking, in parallel, to multiple servers, it might take less time to issue the request, but will use more CPU on the client.

Part the next to last: The workarounds

With the above in mind, the typical workaround has been in use for ages. A long time ago, in a galaxy far far away, brad fitz (or someone over there, I'm not sure who) realized that when fetching all cache keys for a livejournal profile is a trivial multiget of 10+ keys. It was also stupid to issue this (relatively small) multiget across all of the memcached hosts.

So he added a set and fetch by master key mode to the perl client (Cache::Memcached at the time). When you issue a set or a get request in "by key" mode, you give any given key a second key. That is, the key your client uses to hash your data out to the list of memcached servers, is different from the key you hand to memcached for storage. So:

- You assign a master key "dormando", to keys "dormando-birthday", "dormando-website", etc. This is bad key naming, but bear with me.
- Your client, instead of using "dormando-birthday" to decide where to store the keys, uses the key "dormando"
- Your cilent then sends *just* "dormando-birthday" and "dormando-website" to whatever server "dormando" hashed to.
- Memcached happily stores those keys, without any idea of what the master key was (you can't get it back).

Then you issue a multiget back, with the master key of "dormando". Both keys resolve to the *same* server, and the multiget hits a single host. With a single write, and ideally a single roundtrip.

If I had a metric pantload of keys to fetch and don't want to issue them all to the same server (noting the above subtle issue), I can semi-intelligently split the master keys into "dormando-chunk1", "dormando-chunk2" - depends on what your app can handle.

This is a simple and elegant way of avoiding having multigets spread thinly across your memcached cluster.

You could use UDP!

Yeah I guess you could. What about keys with larger responses? This does have a lot of the same issues, but in a different flavor. Could be faster or slower depending on what you're doing.

You could REPLICATE!

annnnnnnnnd do something really complicated where you have to store all of your keys in two places (halving the effective size of your cache!) and having your client randomly pick where each key goes to or comes from each time they're fetched? When issuing against a cluster of more than *two* machines, this isn't going to help nearly as much as cutting 50 separate fetches down into a single request, deterministically, by clustering the keys intelligently.

Note that replication adds a lot more failure scenarios. Network blips can lead to inconsistent cache data, among other things.

Sounds simpler to use a feature that already exists (look for "mget_by_key")?

But you could make either work.

Fortunately, there's also a really short answer to all of this.

Tags:

7 comments or Leave a comment
Everyone's favorite MySQL load relief system, memcached, has just hit the next major stable release: 1.4.0

This release sports a new binary protocol, major performance improvements, and many new statistics. Major kudos to the work of other people (Trond, Dustin, Toru) who put most of the effort into this new release.

Check out the release notes and give it a shot on your site. Please let us know if you've deployed it and any feedback you might have :)
3 comments or Leave a comment
original post

... and this is my usual plea to those mysql/web/industrial folks to try out the latest code. Help us on our quest to scale the crap out of all of your stuff. :)

Find us on the mailing list, on #memcached on freenode, or on twitter as dormando, dlsspy, tmaesaka, and trondn. :) All others are fakers.

---

Two new memcached releases are available today.

Stable 1.2.7

The new stable release which is a maintenance release of the 1.2
series containing several bugfixes and a few features.

This version is recommended for any production memcached instances.

Release Notes:
http://code.google.com/p/memcached/wiki/ReleaseNotes127

Download:
http://memcached.googlecode.com/files/memcached-1.2.7.tar.gz

Beta 1.3.3

The new 1.3 beta brings lots of new features, performance, protocol
support and more to memcached.

Everyone is encouraged to get this into their labs and abuse it as
much as possible. This will be the stable tree. We've been testing
it quite thoroughly in the memcached community already and find it to
be quite stable, but we're always looking for more complaints.

Release notes:
http://code.google.com/p/memcached/wiki/ReleaseNotes133

Download:
http://memcached.googlecode.com/files/memcached-1.3.3.tar.gz

Tags: ,

Leave a comment
Yo,

I didn't write most of this code, but most of the new changes are pretty awesome go read about it, grab it, and try it.

We're very careful about getting as much testing as possible before declarnig a new release as stable. Please try it out in your development environments, beat up on it, maybe try it out in staging. Perhaps even be naughty and swap out one production machine with it some late night.

A lot of the changes in this release are good for those high end mysql/etc backed sites where you might be worried about (or are) hitting performance issues with memcached itself. Others like expanded statistics, optional memory optimisations, should be useful for most folks. Please check the release announcements for dustin's more thorough notes :)

A 1.2.7 stable release should follow on its heels shortly, but don't expect anything but bugfixes and a few minor feature enhancements over there.

Tags: ,

Leave a comment
I said it once in 2002, and now again in 2008. See ya, LJ :)

Mucho thanks to dwell and tupshin for their mentions for the work burr86 and I have put in to help them take over LJ. Especially burr86 - I did some hard stuff but he did most of the work from 6A's side. Kudos to LJ's ops/eng teams for diving into one of the more complicated web architectures and actually getting it running.

We're there for ya (on whatever personal committment we're comfortable with :P), but LJ's in good hands. Enjoy.

I did some pretty cool hacks for MogileFS to help facilitate their move, which I should be posting to the mailing list as soon as I clean up the commits. What's good for us is good for everyone :) Please remember LJ's open source history as you march forward.
1 comment or Leave a comment

Should you use memcached? Should you just shard mysql more?


Memcached's popularity is expanding its use into some odd places. It's becoming an authoritative datastore for some large sites, and almost more importantly it's sneaking into the lowly web startup. This is causing some discussion.

Most of whom seem to be missing the point. In this post I attempt to explain my point of view for how memcached should really influence your bouncing baby startups, and even give some pointers to the big guys who might have trouble seeing the forest through the trees.

Using memcached does not scale your website! Entertain me, I'm playing semantics here: This thing is not for scaleout. Mostly. What memcached really is, is a giant floating magnifying glass. It takes what you have already built and makes stretch ten times further. I insist on not confusing caching with scaleout as when your little stretch-armstrong of a website hits that tenfold limit, you're still screwed. There's no magic switch or configuration option in memcached that will save you from dealing with proper optimization and sharding.

You sure can get away with a hell of a lot though!

Keep it in the front of your mind; no it will not help you batch your writes, or make them smaller, or really help you deal with them in any useful way. If you want to write data you will need back later, you must shard. If it's data you don't care about, maybe write it to memcached and make a note of it in your business plan.

Also strongly keep in mind; memcached won't help your cache misses suck less. If you're writing awful data warehouse quality queries which you expect to run live on the site, go bust out the failboat and get-a-rowin'. You're screwed. As your dataset grows you will find new slices of hell in which your queries behave in all new ways. What once scanned "a few extra rows" now might hit tens of thousands. Cache misses will suck. You will have to deal with this. That's not something this solves.

Sometimes memcached does let you achieve the impossible, or scale the unlikely. Take slightly complex queries, or even template operations, which under the best of conditions might take 15-20 milliseconds each. An obnoxious join, a weird subquery, a tree walk, or fancy HTML templating. Being able to do this live could mean the difference betwen your website standing apart or having to settle with an awful workaround. In these cases, with a high enough hit rate, you can soak those cache misses and make the feature work.

My example isn't translating a 5 second query into 0.5ms with memcached, it's a 15-20ms query. If you had a dozen of these in a page load, a bad load might take an extra quarter second to render, but it wouldn't ruin the user experience. The issue memcached solves here is subtle. Tacking on 0.25 seconds per page render might not make the site completely unusable, but realize these queries are using solid resources on your expensive hardware for that extra quarter second. With a quadcore database, it's possible under the best conditions you would only be able to render 14-16 pages per second off of that machine. Throw in all the other things you have to do on a page load, writes, internal database whoosits and uneven CPU usage and you'd be lucky to get 5 pages per second.

In this case, it's still walking the line of scalability, but it turns something mildly impossible into something highly probable. On the cheap.

The cost equation


Now the most important factor here has reared its ugly head: Cost.

Cost. Ugly for startups. Ugly for established companies. Nightmares for venture capital. What is your cost? Why am I talking cash about companies who have millions of dollars in VC or sales? Just buy more servers! Whatever, right?

Well no. The largest cost is time. All others pale in comparison. The best physical goods investments your company can make are more related to your people than your hardware. Hardware has horrific depreciation. Most of the value is lost immediately, the rest over the first year of operation.

In comparison, buying your employees really fucking nice chairs, desks, and monitors in a swanky comfortable office are much more solid investments for your company. Aeron chairs have great resale value for that inevitable going-bumpkus dot bomb sale. Also anything you do to make your workers happier and more productive will pay out more than any hardware investment. Your product ships on time, you react to the market faster.

To sidestep into hardware a little... Always max out the RAM in your databases. Everyone should. I didn't realize people don't actually do this until I read some of these arguments against memcached. Whenever I add memcached to a website, RAM memcached gets is RAM that didn't fit into the databases, but easily fits into empty memory slots in webservers or cheaper hardware. A good solid database might cost $5,000, but a beefy memcached box will cost less than half that. Way less than that if you just add memory to existing hardware. So "adding that extra RAM to your databases" isn't a very fair apples-to-apples comparison unless you're already doing something wrong.

So it should be obvious just what the hell I'm getting at now, and what seems to be bothering everyone else about this whole stupid memcached fad.

You're all wasting your goddamn time! Yeesh!

How can a small site or startup benefit from memcached?


Simple: The idea.

Caching really wedges your whole RDBMS worldview. You don't just CRUD anymore. Your data is a process. A flow between points instead of just the store and display. At any time in this flow an idea may be injected. Maybe it's serializing a generated object and caching it, maybe it's utilizing gearman to shift off some asyncronous work. There is just more to it now.

But that's all messy complicated. What can you do? What should you do?

Design for having cache, design for change.
... but don't write all the code yet.
... but certainly design for change.

Think good object design. A "user" is a class. That user has base properties which you might find in the `user` table. A "user" object might have a profile, which is really another object with another class representing a `profile` table.

my $user; is an invaluable abstraction.

That user object must load and store data. When you build this at first it's all standard CRUD. Straight to a database.

Where would you think to add caching to this system? I hope I've made it too obvious.

At the query layer! Use a database abstraction class and have it memcache resultset objects and... No no no, that's a lie. I'm lying. Don't do that.

Do it inside that $user object. At the highest level possible. Take the whole object state and shovel it somewhere. That object is its own biggest authority. It knows when it's been updated, when it needs to load data, and when to write to the database. It might've had to read from several tables or load dependent objects based on what you ask it to do.

Instead of wrangling your best and brightest into figuring out a cache invalidation algorithm which might work "okay" against your schemas, do what's simple for the object. If adding caching to the $user object means the load() function tries memcached first, and all write operations hit memcached with a delete operation, so be it. You just added basic caching to one of the hottest objects in your website in, oh, half an hour. Maybe a few days if you're really scraping the bottom of the talent barrel.

Now we're back where we started. Reap the time benefits! Abstract your data access methods properly, plan for caching. Actually go write caching into a few objects. Maybe turn it off when you're done. You don't need it yet. Write your objects to talk directly to your database and save time.

Same idea for sharding. Either focus on that now, or realize you can take a $user object and extend its load() magic to find and write to users based on a sharding scheme. You probably don't have to rewrite all of the code to make this happen. Refactor to win.

So now you're ready. You're building your site fast and abstracting where you can. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs overplanning and wasting valuable time. Keeping it simple is gold here.

You're building something new and you're going to fail at it. Your design will be wrong, you will anticipate the wrong feature to be popular. Dealing with this quickly can set you apart. Being able to slap memcached into a bunch of objects in a few days (or even hours) can mean the difference between riding a load spike or riding the walrus.

Bullet points for fun! How can your small site benefit from memcached:

- Design for change! Holy crap I can't say this enough.
- Don't cache in ways that piss off your users.
- Not keeping it simple is fail.
- Cache and shard at the highest level possible relative to your data.
- Read High Performance MySQL 2nd ed. Memcached won't fix your lack of database knowledge.
- The same ideas which help you prepare for cache, helps you prepare for sharding.
- Don't waste all your time getting it right now. Get it close, get an idea, try it out, and prepare to be wrong.

Finally:

- Keep an open mind. Sites like grazr and fotolog do things differently. Doesn't mean they're right, doesn't mean they're wrong. Be inventive where it makes sense for your business.

There. Sorry this came out so long :)

Tags: , ,

7 comments or Leave a comment