At work I deal with a lot of data (i.e we deal with as much data in a day as the library of congress collects in a month) . Part of my job is making that data presentable to to publishers. I don’t deal with the actual storing of the data ( for that we have some brilliant engineers), but I think in order to do my job correctly, I need to understand the full stack that we have. The first Big Data DC meetup was an opportunity for me to learn more about the bottom half of the stack.
Matt Abrams kicked things off with an overview of how Clearspring deals with big data. Big data might be a bit of an understatement. We are talking about 4-5TB of data a day from 2.5 billion view events that needs to be processed. How much data is this? Well if it took one millisecond to process each event, it would take us 29 days to process each full day of data. To accomplish this Matt and the team have four main design philosophies:
- Speed of Safety
- Simplicity over Complexity
- At scale, small performance delta’s matter
- Close is good enough in many cases.
Take a look at Matt’s slides as he goes through these philosophies in detail and the stack that Clearspring uses to accomplish this big data task:
(Matt is also doing a series of Blog Posts about this topic. Check out the first one.)
Next up was Dave (who’s last name I didn’t catch) from Foundation DB. Foundation DB is creating a distributed key value store with transactions. Dave presented his philosophy that “The easiest way to build a scalable high performance fault tolerant application is on top of a scalable high performance fault tolerant foundation”. To do this, they have created Flow. Flow adds Futures, Promises and actors to C++. Foundation DB is entering beta soon. I look forward to seeing where they go with it.
I’m looking forward to future Big Data DC events. It’s a few days since the meetup and I’m already anticipating learning how more companies are dealing with Big Data.