DBMS Musings: Big Data

Showing posts with label Big Data. Show all posts

Wednesday, May 16, 2012

If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.

(This post is coauthored by Alexander Thomson and Daniel Abadi)
In the last decade, database technology has arguably progressed furthest along the scalability dimension. There have been hundreds of research papers, dozens of open-source projects, and numerous startups attempting to improve the scalability of database technology. Many of these new technologies have been extremely influential---some papers have earned thousands of citations, and some new systems have been deployed by thousands of enterprises.

So let’s ask a simple question: If all these new technologies are so scalable, why on earth are Oracle and DB2 still on top of the TPC-C standings? Go to the TPC-C Website with the top 10 results in raw transactions per second. As of today (May 16th, 2012), Oracle 11g is used for 3 of the results (including the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with various versions of DB2. How is technology designed decades ago still dominating TPC-C? What happened to all these new technologies with all these scalability claims?

The surprising truth is that these new DBMS technologies are not listed in the TPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To understand why this is the case, one must understand that scalability does not come for free. Something must be sacrificed to achieve high scalability. Today, there are three major categories of tradeoff that can be exploited to make a system scale. The new technologies basically fall into two of these categories; Oracle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that introduces a fourth category of tradeoff that provides a roadmap to end the dominance of Oracle and DB2.

These categories are:

(1) Sacrifice ACID for scalability. Our previous post on this topic discussed this in detail. Basically we argue that a major class of new scalable technologies fall under the category of “NoSQL” which achieves scalability by dropping ACID guarantees, thereby allowing them to eschew two phase locking, two phase commit, and other impediments to concurrency and processor independence that hurt scalability. All of these systems that relax ACID are immediately ineligible to enter the TPC-C competition since ACID guarantees are one of TPC-C’s requirements. That’s why you don’t see NoSQL databases in the TPC-C top 10---they are immediately disqualified.

(2) Reduce transaction flexibility for scalability. There are many so-called “NewSQL” databases that claim to be both ACID-compliant and scalable. And these claims are true---to a degree. However, the fine print is that they are only linearly scalable when transactions can be completely isolated to a single “partition” or “shard” of data. While these NewSQL databases often hide the complexity of sharding from the application developer, they still rely on the shards to be fairly independent. As soon as a transaction needs to span multiple shards (e.g., update two different user records on two different shards in the same atomic transaction), then these NewSQL systems all run into problems. Some simply reject such transactions. Others allow them, but need to perform two phase commit or other agreement protocols in order to ensure ACID compliance (since each shard may fail independently). Unfortunately, agreement protocols such as two phase commit come at a great scalability cost (see our 2010 paper that explains why). Therefore, NewSQL databases only scale well if multi-shard transactions (also called “distributed transactions” or “multi-partition transactions”) are very rare. Unfortunately for these databases, TPC-C models a fairly reasonable retail application where customers buy products and the inventory needs to be updated in the same atomic transaction. 10% of TPC-C New Order transactions involve customers buying products from a “remote” warehouse, which is generally stored in a separate shard. Therefore, even for basic applications like TPC-C, NewSQL databases lose their scalability advantages. That’s why the NewSQL databases do not enter TPC-C results --- even just 10% of multi-shard transactions causes their performance to degrade rapidly.

(3) Trade cost for scalability. If you use high end hardware, it is possible to get stunningly high transactional throughput using old database technologies that don’t have shared-nothing horizontally scalability. Oracle tops TPC-C with an incredibly high throughput of 500,000 transactions per second. There exists no application in the modern world that produces more than 500,000 transactions per second (as long as humans are initiating the transactions---machine-generated transactions are a different story). Therefore, Oracle basically has all the scalability that is needed for human scale applications. The only downside is cost---the Oracle system that is able to achieve 500,000 transactions per second costs a prohibitive $30,000,000!

Since the first two types of tradeoffs are immediate disqualifiers for TPC-C, the only remaining thing to give up is cost-for-scale, and that’s why the old database technologies are still dominating TPC-C. None of these new technologies can handle both ACID and 10% remote transactions.

A fourth approach...

TPC-C is a very reasonable application. New technologies should be able to handle it. Therefore, at Yale we set out to find a new dimension in this tradeoff space that could allow a system to handle TPC-C at scale without costing $30,000,000. Indeed, we are presenting a paper next week at SIGMOD (see the full paper) that describes a system that can achieve 500,000 ACID-compliant TPC-C New Order transactions per second using commodity hardware in the cloud. The cost to us to run these experiments was less than $300 (of course, this is renting hardware rather than buying, so it’s hard to compare prices --- but still --- a factor of 100,000 less than $30,000,000 is quite large).

Calvin, our prototype system designed and built by a large team of researchers at Yale that include Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuffrida, and Aaron Segal (in addition to the authors of this blog post), explores a tradeoff very different from the three described above. Calvin requires all transactions to be executed fully server-side and sacrifices the freedom to non-deterministically abort or reorder transactions on-the-fly during execution. In return, Calvin gets scalability, ACID-compliance, and extremely low-overhead multi-shard transactions over a shared-nothing architecture. In other words, Calvin is designed to handle high-volume OLTP throughput on sharded databases on cheap, commodity hardware stored locally or in the cloud. Calvin significantly improves the scalability over our previous approach to achieving determinism in database systems.

Scaling ACID

The key to Calvin’s strong performance is that it reorganizes the transaction execution pipeline normally used in DBMSs according to the principle: do all the "hard" work before acquiring locks and beginning execution. In particular, Calvin moves the following stages to the front of the pipeline:

Replication. In traditional systems, replicas agree on each modification to database state only after some transaction has made the change at some "master" replica. In Calvin, all replicas agree in advance on the sequence of transactions that they will (deterministically) attempt to execute.
Agreement between participants in distributed transactions. Database systems traditionally use two-phase commit (2PC) to handle distributed transactions. In Calvin, every node sees the same global sequence of transaction requests, and is able to use this already-agreed-upon information in place of a commit protocol.
Disk accesses. In our VLDB 2010 paper, we observed that deterministic systems performed terribly in disk-based environments due to holding locks for the 10ms+ duration of reading the needed data from disk, since they cannot reorder conflicting transactions on the fly. Calvin gets around this setback by prefetching into memory all records that a transaction will need during the replication phase---before locks are even acquired.

As a result, each transaction’s user-specified logic can be executed at each shard with an absolute minimum of runtime synchronization between shards or replicas to slow it down, even if the transaction’s logic requires it to access records at multiple shards. By minimizing the time that locks are held, concurrency can be greatly increased, thereby leading to near-linear scalability on a commodity cluster of machines.

Strongly consistent global replication

Calvin’s deterministic execution semantics provide an additional benefit: replicating transactional input is sufficient to achieve strongly consistent replication. Since replicating batches of transaction requests is extremely inexpensive and happens before the transactions acquire locks and begin executing, Calvin’s transactional throughput capacity does not depend at all on its replication configuration.

In other words, not only can Calvin can run 500,000 transactions per second on 100 EC2 instances in Amazon’s US East (Virginia) data center, it can maintain strongly-consistent, up-to-date 100-node replicas in Amazon’s Europe (Ireland) and US West (California) data centers---at no cost to throughput.

Calvin accomplishes this by having replicas perform the actual processing of transactions completely independently of one another, maintaining strong consistency without having to constantly synchronize transaction results between replicas. (Calvin’s end-to-end transaction latency does depend on message delays between replicas, of course---there is no getting around the speed of light.)

Flexible data model

So where does Calvin fall in the OldSQL/NewSQL/NoSQL trichotomy?

Actually, nowhere. Calvin is not a database system itself, but rather a transaction scheduling and replication coordination service. We designed the system to integrate with any data storage layer, relational or otherwise. Calvin allows user transaction code to access the data layer freely, using any data access language or interface supported by the underlying storage engine (so long as Calvin can observe which records user transactions access). The experiments presented in the paper use a custom key-value store. More recently, we’ve hooked Calvin up to Google’s LevelDB and added support for SQL-based data access within transactions, building relational tables on top of LevelDB’s efficient sorted-string storage.

From an application developer’s point of view, Calvin’s primary limitation compared to other systems is that transactions must be executed entirely server-side. Calvin has to know in advance what code will be executed for a given transaction. Users may pre-define transactions directly in C++, or submit arbitrary Python code snippets on-the-fly to be parsed and executed as transactions.

For some applications, this requirement of completely server-side transactions might be a difficult limitation. However, many applications prefer to execute transaction code on the database server anyway (in the form of stored procedures), in order to avoid multiple round trip messages between the database server and application server in the middle of a transaction.

If this limitation is acceptable, Calvin presents a nice alternative in the tradeoff space to achieving high scalability without sacrificing ACID or multi-shard transactions. Hence, we believe that our SIGMOD paper may present a roadmap for overcoming the scalability dominance of the decades-old database solutions on traditional OLTP workloads. We look forward to debating the merits of this approach in the weeks ahead (and Alex will be presenting the paper at SIGMOD next week).

Tuesday, July 19, 2011

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.
Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.
Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.

Monday, March 28, 2011

Why I'm doing a start-up pre-tenure

Thanks to the tireless work of the entire Hadapt team, we had a very successful launch at GigaOM's Structure Big Data conference last week. In coming out of stealth, we told the world what we're doing (in short, we're building the only Big Data analytical platform architected from scratch to be (1) optimized for cloud deployments and (2) closely integrated with Hadoop so you don't need those annoying connectors to non-Hadoop-based data management systems anymore; i.e. we're bringing high performance SQL to Hadoop). Although a lot of people knew I was involved in a start-up, several people were surprised to find out at the launch how centrally involved I am in Hadapt, and I have received a lot of questions along the lines of what Maryland professor Jimmy Lin (@lintool) tweeted last week:

.@daniel_abadi wondering how the tenure track thing fits in with @Hadapt (r u on leave?) - but congrats on coming out of the Ivory tower! :)

Although Jimmy did not question my sanity in his tweet, others have, so I think it is time for me to explain my (hopefully rational) decision-making process that lead me to start a company while still on the tenure-track at Yale.

A few facts to get out the way: although I am currently on teaching leave from Yale, I am not taking a complete leave of absence, which means my tenure clock is still ticking while I'm putting all this effort into Hadapt. The time I'm spending on Hadapt necessarily subtracts from the time I have available to spend on more traditional research activities of junior faculty (publishing papers, serving on program committees and editorial boards of publication venues, and attending conferences), which means that there is a huge risk that when I come up for tenure, if I am evaluated using traditional evaluation metrics, I will not have optimized my performance in these areas, and thereby will reduce the probability of receiving tenure. When I was considering starting Hadapt, I sent e-mails to several senior faculty members in my field and asked them if they could think of an example of a database systems professor doing a start-up while still a junior faculty member, and going on to eventually receive tenure (I desperately wanted a precedent that I could use to justify my decision). Not a single one of the people I e-mailed were able to think of such a case (in fact, one of them called the chair of my department to yell at him for even thinking of letting me start a company while still pre-tenure). Starting Hadapt is a gamble --- there's no doubt about it.

So why am I doing it? I want my research to make impact, which to me means that my research ideas should make it into real systems that are used by real people. Unfortunately for me, the research I enjoy the most is research that envisions complete system designs (rather than research on individual techniques that can be applied directly to today's systems). It's hard enough to publish these system design papers; but it's almost impossible to get other people to actually adopt your design in real-world deployments unless an extensive and complete prototype is available, or your design is already proven in real-world applications. For example, there have been many papers published by academics that fall in the same general space as the Google Bigtable paper. Yet the Bigtable paper has had a tremendous amount of impact, while the other papers languish in obscurity. Why? Because when Powerset and Zvents needed to implement a scalable real-time database, they felt safer using the design suggested in the Google paper (in their respective HBase and Hypertable projects) than the design from some other academic paper that has not been proven in the real world (even if the other design is more elegant and a better fit for the problem at hand).

Therefore, if you want to publish system design papers that make impact on the real world, you seemingly only have three choices:

(1) You can use the resources in your lab to build a complete prototype of your idea. That way, when people are considering using your idea, their risk is significantly reduced by trying out your system on their application without significant upfront development cost. Unfortunately, building a complete prototype is a much harder task than building enough of a prototype to get a paper published. It involves a ton of work to deal with all of the corner cases, and to make it work well out of the box --- this amount of work is far too much for a small handful of students to do (especially if they want to graduate before they retire). Therefore additional engineers must be hired to complete the prototype. In the DARPA glory days, this was possible --- I've heard stories of database projects burning over a million dollars per year to complete the engineering of an academic prototype. Unfortunately, those days are long gone. My attempts to get just one tiny programmer to build out the HadoopDB prototype were rebuffed by the National Science Foundation.

(2) You leave academia and work for Google, Yahoo, Facebook, IBM, etc. Matt Welsh has discussed in significant detail his decision to leave Harvard and do exactly that. This is a great solution in many ways --- it increases the probability of your research making impact by orders of magnitude, and has the added bonus of eliminating a lot of the administrative time sinks inherent in academic jobs. If I didn't love other aspects of being part of an academic community so much, this is certainly what I would do.

(3) You do a start-up. This is basically the same as choice (1), except you raise the money to build out the prototype from angel investors and venture capitalists instead of from the government (which typically funds the academic lab). The main downside is that starting a company is highly non-trivial, and you end up having to spend a lot of time in all kinds of non-technical tasks --- meeting with investors, meeting with potential customers, interviewing potential employees, investing the time to understand the market, coming up with a go-to-market strategy, attending board meetings, dealing with patents, participating in boring trade-shows, etc., etc., etc. It adds up to an extraordinary amount of time. It's also more competitive than academia --- there are far more people who want to see you fail in the start-up world than in academia, and some of these people go to great lengths to increase the probability of your failure. There are all kinds of hurdles that come up, and you need to have a strong will to overcome them. If it wasn't for the most determined person I have ever met, Justin Borgman, the CEO of Hadapt, we would never have made it to where we are today. It's hard to start a company, but in my mind, it was the only viable option if I wanted my three years of research on HadoopDB to make impact (Hadapt is a commercialization of the HadoopDB research project).

If it wasn't for the fact that I spent the majority of the last decade soaking up the wisdom of Mike Stonebraker, I might not have chosen option (3). But I watched as my PhD thesis on C-Store was commercialized by Vertica (which was sold last month to HP), and another one of my research projects (H-Store) was commercialized by VoltDB. Thanks to Stonebraker and the first-class engineers at Vertica, I can claim that my PhD research is in use today by Groupon, Verizon, Twitter, Zynga, and hundreds of other businesses. When I come up for tenure, I want to be able to make similar claims about my research at Yale on HadoopDB. So I'm taking the biggest gamble of my career to see that happen. I just hope that the people writing letters for me at tenure time take my contributions to Hadapt into consideration when they are evaluating the impact I have made on the database systems field. I know that this will require a departure from the traditional way junior faculty are evaluated, but it's time to increase the value we place on building real, usable systems. Otherwise, there'll be no place left in academia for systems researchers.

[Note: Hadapt has successfully raised a round of financing and is hiring. If you have experience building real systems, especially database systems --- or even if you have built reasonably complex academic prototypes --- please send an e-mail to hackers@hadapt.com. I personally read every e-mail that goes to that address.]

Thursday, December 30, 2010

Machine vs. human generated data

Curt Monash has recently been discussing the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.

In short, the following are the main reasons why machine-generated data is important:

Machines are capable of producing data at very high rates. In the time it took you to read this sentence, my three-year old laptop could have produced the entire works of Shakespeare.
The human population is not growing anywhere near as fast as Moore’s law. In the last decade, the world’s population has increased by about 20%. Meanwhile transistor counts (and also hard-disk capacity since it increases by roughly the same rate) has increased by over 2000%, If all data was closely tied to human actions, then the “Big Data” research area would be a dying field, as technological advancements would eventually render today’s “Big Data” miniscule, and there would be no new “Big Data” to take its place. (All this assumes that women don’t start to routinely give birth to 15 children, and nobody figures out how to perform human cloning in a scalable fashion). No researcher dreams of writing papers that makes only a temporary impact. With machine-generated data, we have the potential for data generation to increase at the same rate as machines are getting faster, which means that “Big Data” today will still be “Big Data” tomorrow (even though the definition of “Big” will be adjusted).
The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until race-track memory becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans.

Curt attempts to define “machine-generated data” in his post as the following:

Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.

He then goes on to include Web log data (including user clickstream logs), and social media and gaming records data as examples of machine-generated data.

If you agree with the three reasons listed above on why machine-generated data is important, then there is a problem with both the above definition of machine-generated data and the examples. Clickstream data and social media/gaming data are fundamentally different from environmental sensor data that has no human involvement whatsoever. Certainly the scale of clickstream and gaming datasets is much larger than the scale of other human-generated datasets such as point of sale data (humans can make clicks on the Internet or in a computer game at a much faster rate than they can buy things, or write things down). And certainly, for every human click, there might be 5X more network log data (as Monash writes about in his post). But ultimately, without humans making clicks, there would be no data, and as long as the additional machine-generated data is linearly related to each human action (e.g. this 5X number remains relatively constant over time) then these datasets are not always going to be “Big Data”, for the reasons described in point (2) above.

The basic source of confusion here is that click-stream datasets and social gaming data sets are some of the biggest datasets known to exist (eBay, Facebook, and Yahoo’s multi-petabyte clickstream data warehouses are known to be amongst the largest data warehouses in the world). Since machines are well-known to have the ability to produce data at a faster rate than humans, it is easy to fall into the trap of thinking that these huge datasets are machine generated.

However, these datasets are not increasing at the same rate that machines are getting faster. It might seem that way since the companies that broadcast the size of their datasets are getting larger and gaining users a rapid pace, and these companies are deciding to throw away less data, but over the long term the rate of increase of these datasets must slow down due to the human limitation. This makes them less interesting for the future of “Big Data” research.

I don’t necessarily have a better way to define machine-generated data, but I’ll end this blog post with my best attempt:

Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.

Machine generated “Big Data” is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.

Under this definition, stock trade data (independent computation agents), environmental sensor data, RFID data, and satellite data all fall under the category of machine-generated data. An interesting debate could form over whether genomic sequencing data is machine-generated or not. To the extent that DNA and mRNA are being produced outside of humans, I think it is fair to put genomic sequencing data under the machine-generated category as well.

DBMS Musings