Thursday, December 30, 2010

Machine vs. human generated data

Curt Monash has recently been discussing the differences between machine-generated data and human-generated data, and trying to define these terms on his blog. I think this is a good subject to dive into, since I frequently use the existence of machine-generated data to justify to myself why 90% of my research cycles are spent on scalability problems in database systems. Rather than try to fit a response as a comment on his post, I thought I would devote a post to this subject here.

In short, the following are the main reasons why machine-generated data is important:

  1. Machines are capable of producing data at very high rates. In the time it took you to read this sentence, my three-year old laptop could have produced the entire works of Shakespeare.

  2. The human population is not growing anywhere near as fast as Moore’s law. In the last decade, the world’s population has increased by about 20%. Meanwhile transistor counts (and also hard-disk capacity since it increases by roughly the same rate) has increased by over 2000%, If all data was closely tied to human actions, then the “Big Data” research area would be a dying field, as technological advancements would eventually render today’s “Big Data” miniscule, and there would be no new “Big Data” to take its place. (All this assumes that women don’t start to routinely give birth to 15 children, and nobody figures out how to perform human cloning in a scalable fashion). No researcher dreams of writing papers that makes only a temporary impact. With machine-generated data, we have the potential for data generation to increase at the same rate as machines are getting faster, which means that “Big Data” today will still be “Big Data” tomorrow (even though the definition of “Big” will be adjusted).

  3. The predicted demise of the magnetic hard disk for solid state alternatives will not come as fast as some people think. As long as hard disk capacity maintains pace with the rate of machine-generated data generation, it will remain the most cost-efficient option for machine-generated “Big Data” (at least until race-track memory becomes a viable candidate). Yes, I/O bandwidth does not increase at the same rate as capacity, but if the machine-generated data is to be kept around, the biggest of “Big Data” databases will need the high capacity of hard disks, at least at a low tier of storage. Which means that we must remain conscious of disk-speed limitations when it comes to complete data scans.

Curt attempts to define “machine-generated data” in his post as the following:

Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.

He then goes on to include Web log data (including user clickstream logs), and social media and gaming records data as examples of machine-generated data.

If you agree with the three reasons listed above on why machine-generated data is important, then there is a problem with both the above definition of machine-generated data and the examples. Clickstream data and social media/gaming data are fundamentally different from environmental sensor data that has no human involvement whatsoever. Certainly the scale of clickstream and gaming datasets is much larger than the scale of other human-generated datasets such as point of sale data (humans can make clicks on the Internet or in a computer game at a much faster rate than they can buy things, or write things down). And certainly, for every human click, there might be 5X more network log data (as Monash writes about in his post). But ultimately, without humans making clicks, there would be no data, and as long as the additional machine-generated data is linearly related to each human action (e.g. this 5X number remains relatively constant over time) then these datasets are not always going to be “Big Data”, for the reasons described in point (2) above.

The basic source of confusion here is that click-stream datasets and social gaming data sets are some of the biggest datasets known to exist (eBay, Facebook, and Yahoo’s multi-petabyte clickstream data warehouses are known to be amongst the largest data warehouses in the world). Since machines are well-known to have the ability to produce data at a faster rate than humans, it is easy to fall into the trap of thinking that these huge datasets are machine generated.

However, these datasets are not increasing at the same rate that machines are getting faster. It might seem that way since the companies that broadcast the size of their datasets are getting larger and gaining users a rapid pace, and these companies are deciding to throw away less data, but over the long term the rate of increase of these datasets must slow down due to the human limitation. This makes them less interesting for the future of “Big Data” research.

I don’t necessarily have a better way to define machine-generated data, but I’ll end this blog post with my best attempt:

Machine-generated data is data that is generated as a result of a decision of an independent computational agent or a measurement of an event that is not caused by a human action.

Machine generated “Big Data” is machine-generated data whose rate of generation increases with the speed of the underlying hardware of the machines that generate it.

Under this definition, stock trade data (independent computation agents), environmental sensor data, RFID data, and satellite data all fall under the category of machine-generated data. An interesting debate could form over whether genomic sequencing data is machine-generated or not. To the extent that DNA and mRNA are being produced outside of humans, I think it is fair to put genomic sequencing data under the machine-generated category as well.