I was having a great conversation before with some technical folks about some very very hard scalability and throughput (not necessarily response time) issues they were facing.
I racked my brain to think of what I had done in the past and realized it came down to a few different classes of solutions. First and foremost though the key to finding the answer is instrumenting your code and/or environment to find out where the bottleneck(s) are.
1) Simplest: Do less work
Kind of obvious but if it’s taking you 10 hours to read through some log files and update the database perhaps the easiest thing is to do LESS work. e.g. read fewer log files or do fewer database updates.
You could use techniques like reservoir sampling. But maybe you have to calculate a hard number – the total cost of a set of Stock market trades for example – estimates don’t work. Then again perhaps your log files don’t need to be so big? Every byte you write has to be FTP’d (and could get corrupted) and that byte has to read later (even if it’s not needed).
I find a lot of people forget another alternative here that involves the theme of “Do less work”. Basically if you have a good (enough) model of your input data stream then you can get a “fast but slightly inaccurate” estimate soon and then get “eventual consistency” later. It’s kind of like that old adage – “You can have it fast, correct and cheap. Pick two!” or like the CAP theorem – something’s gotta give. Every dev team should have a Math nerd on it – because Mathematicians have been solving problems like this for decades.
2) Simple-ish: Tune what you already have
Maybe you’ve got a MySQL DB – does it have enough memory? Perhaps Network I/O is a bottleneck – dual NICs then? Check your NIC settings too (I’ve hit that once – 100 Mbps settings on GBps network). Perhaps you need to lower priority on other jobs on the system. Is your network dedicated? What’s the latency from server to DB (and elsewhere).
Maybe when you FTP data files you should gzip them first (CPU is cheap and “plentiful” relative to Memory and I/O – network and disk). If the write is not the problem, perhaps you can you tune your disk read I/O? Are you using Java NIO? Have you considered striping your disks? Not suprisingly for Hadoop speedup many of the tuning recommendations are I/O related.
Perhaps you have a multi-threaded system – can you throw more threads at it? More database connections?
For the database: Reuse database connections? Do you really need all those indexes? I’ve seen it be faster to drop indexes, do batch uploads and reapply indexes than to leave the indexes in place. Are you seeing database table contention – locking etc?
3) Moderate: Throw hardware at it
Seems like a cop-out for a developer to say throw hardware at it but if you look at the cost of (say) $20k in better hardware (more memory, faster memory, faster disk I/O etc.) vs. spending 4 developers for a month (costing in the US anyways $40k+) it’s clear where the upside is at. Developers are probably the most scarce/precious resource you have (in small organizations anyway) so spend their time wisely. They’ll appreciate you for it too!
3) Harder: Fix or redesign the code you have
This is what coders usually do but it’s expensive (considering how much devs cost these days).
Are there more efficient algorithms? How about batching inserts or updates?
Do you have a hotspot – e.g. disk I/O due to 10 parallel processes reading from disk?
Is the database a bottleneck – perhaps too MANY updates to the same row, page or table?
If throughput (and not response time) is your issue then perhaps making things quite a bit more asynchronous, decoupled and multi-threaded will improve your overall throughput.
Maybe instead of a process whereby you: Read tonnes of data from a file, update some counters, flush to DB all in the same thread
You decouple the two “blocking” pieces (reading from disk, writing to DB) and that way you can split the problem a bit better – perhaps splitting the file and having more threads read smaller files? Drop all intermediate data into some shared queue in memory (or memcached etc.) and then have another pool of threads read from that shared queue. Instead of one big problem you have two smaller problems each whose solution can be optimized independently of the other.
Kind of a mix of “Fix the code” and #1 “Do less work” is when you realize you are redoing the same calculations over and over again. For example taking an average from the last 30 days requires you
do get todays new data but also re-retrieve 29 prior days worth of data. Make sure you precalculate and cache everything you can. If you are summing the past 30 days of data for example (D1 . . . D30), tomorrow you will need (D2 . . D31) – you can precalculate (D2 . . D30) today for tomorrow. Not that math is hard for CPUs but you get the idea . . . . spend CPU today to save I/O tomorrow!
An example of being smart about what you calculate is here in this MIT paper “Fast Averaging“. If your data is “well behaved” you can get an average with a lot less work.
Decoupling with Queues is my favorite technique here but you have to be smart about what you decouple.
4) Hardest: Rearchitect what you have
Developers love to do this – it’s like a greenfield but with cool new technology – but it should be the last on your list. Sometimes however it’s just necessary. Amazon and eBay have done it countless times. I am sure Google and Facebook have too. I mean they INVENTED whole new systems and architectures (NoSQL, BigTable, DynamoDB etc.) to handle these issues. Then again Facebook still uses MySQL :-)
Again all of these approaches, if they are to be successful and a good use of time rely on knowing where your bottleneck is in the first place – identifying it and beating on that problem until it cries “Momma!” :-) But lets never forget that the classes of solutions are pretty constant – and the choice basically come down to how much time and money can you afford to fix it.
Ok over to you dear reader – what did I miss, what did I forget? Is there another class of solution?