tag:blogger.com,1999:blog-8899645800948009496.post3632241812827225146..comments2024-03-18T21:59:02.831-07:00Comments on DBMS Musings: Machine vs. human generated dataDaniel Abadihttp://www.blogger.com/profile/16753133043157018521noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-8899645800948009496.post-71808470717693181682011-01-02T22:05:36.171-08:002011-01-02T22:05:36.171-08:00Hi Richard,
I hope all is going well at Streambas...Hi Richard,<br /><br />I hope all is going well at Streambase.<br /><br />If the resolution of click-stream data is indeed increasing with Moore's law, then I would agree that it should be classified as machine-generated data. Where we disagree is the precise rate of increase. I agree that it is increasing rapidly, but not at the rate the same rate that computers are getting faster.<br /><br />I also agree that human-tracking data will be more valuable. I just don't agree that it will be bigger.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-22776517024216058212011-01-02T20:42:48.769-08:002011-01-02T20:42:48.769-08:00Apache logs are the simplest case of web logs. Whi...Apache logs are the simplest case of web logs. While humans data consumption and decision making rates aren't tracking Moore's law (or maybe they are, but with a really small constant), the fidelity of computers recording them is. For example, clickstream logs now contain information about what the user was served. They can also contain information about the underlying services used to serve the request and the performance of each of those services. Rich client-side UIs can report information on what parts of the page the human interacted with, and the number of "page loads" an Ajax UI makes can be very large. Much like the resolution of surveillance video is increasing with Moore's law, the resolution of clickstream data and other observations of humans is increasing.<br /><br />I like Curt's definition. Data about observing humans with increasing fidelity is going to be bigger and more valuable than data about machines doing things for their own sake. At least until machines start voting and buying Beanie Babies.Richard Tibbettshttps://www.blogger.com/profile/06922140000922600693noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-43635190300559107962011-01-01T19:09:47.235-08:002011-01-01T19:09:47.235-08:00Jeff, Hegemonkey, Wayne, thanks for sharing your o...Jeff, Hegemonkey, Wayne, thanks for sharing your opinions. I'm not sure that I agree that apache log entries that are generated as a direct result of a human action should be classified as machine-generated, but differing opinions are welcome in this forum.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-70510670642514373222011-01-01T10:54:10.066-08:002011-01-01T10:54:10.066-08:00Hi Daniel,
In my personal opinion, Machine-Genera...Hi Daniel,<br /><br />In my personal opinion, Machine-Generated data is always under the influence of humans. Satellite Telemetry Data was the byproduct of human coding for events relevant to humans. What differentiates "machine-generated" from "non-machine generated" is the intervention/requirement for a human to supply/update data to complete the process. <br /><br />For example, adding info to a twitter account or facebook account cannot be considered machine-generated. However, the apache log lines created for that event can be considered Machine Generated since their values are collected as a byproduct of the occuring event.<br /><br />Long story short, Machine-Generated data is the end result of code creating information as a response to an event without requiring human oversight and intervention. By oversight and intervention, I mean the human doesn't easily or readily modify the created information once the information has been submitted. <br /><br />Best Regards,<br /><br />Jeff Kiblerjeffhttps://www.blogger.com/profile/18030766079695651953noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-44213354353942727152010-12-30T22:35:53.650-08:002010-12-30T22:35:53.650-08:00I think that trying to split data into "machi...I think that trying to split data into "machine-generated" versus "human-generated" is pointless.<br /><br />It's such a fuzzy distinction. Is stock trade data human generated? It is if a person enters the order, right? But the majority of NYSE orders are from algorithmic trading. Same for the trades that appear in the books -- human or high-frequency black-box?<br /><br />I agree that big data is a tough problem. I talked with Jim Gray about it many years ago when I looked at it. I was worried about the problems with 100s of GBs. A trifling amount today (buy one more 2010-era disk!) but the principles are the same.<br /><br />Adding a time dimension does help capture flavor of "machine generated". But that opens more confusion. Where's the line for "faster than a human"? Anything less than once a minute? Once a second? Or is that dependent on the task? Secondly, does that measure one person's input or a crowd's? <br /><br />Maybe the answer is to solution is to use some comparative measurements. N * Library of Congresses.<br /><br />There's also a dimension of "worth" to the big data. Is each piece of data important? Can it be deleted without consequences? It seems that human-generated data is more likely to be worth more than machine-generated. But this distinction is very loosey-goosey too.Hegemonkeyhttps://www.blogger.com/profile/12844164699958496745noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-76747560161999150542010-12-30T17:28:20.128-08:002010-12-30T17:28:20.128-08:00Nice post. I'd have to agree. The distinction ...Nice post. I'd have to agree. The distinction is whether you count "machine generated" or "machine collected" data. Wayne EckersonUnknownhttps://www.blogger.com/profile/06576106259957405566noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-9853812093376787322010-12-30T15:50:47.134-08:002010-12-30T15:50:47.134-08:00But I agree that if we were to further sub-categor...But I agree that if we were to further sub-categorize machine-generated data, we would probably deal with surveillance data differently.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-27047669035695593022010-12-30T15:49:05.110-08:002010-12-30T15:49:05.110-08:00I think that the definition of machine-generated d...I think that the definition of machine-generated data above would include surveillance video (especially since video quality and size tends to increase with Moore's law).Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-34142254611897145392010-12-30T15:41:48.175-08:002010-12-30T15:41:48.175-08:00What you've omitted are data set that are cons...What you've omitted are data set that are considered too large to be treated as data just yet. For example, the set of all surveillance video. Most of this is stored on a tape loop which is automatically aged away.Edwardhttps://www.blogger.com/profile/06716790301727423749noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-75471587256398499412010-12-30T14:40:58.967-08:002010-12-30T14:40:58.967-08:00Your comment seems to imply that while the human c...Your comment seems to imply that while the human conscious mind is limited, the subconscious is infinite. This could take us down an interesting philosophical road :)<br /><br />Seriously though, human biological systems would only seem to fall under the machine-generated category if we can measure them with an increased precision at the rate of Moore's law. Otherwise, we still have the long-term human limitation.Daniel Abadihttps://www.blogger.com/profile/16753133043157018521noreply@blogger.comtag:blogger.com,1999:blog-8899645800948009496.post-76185502721229027632010-12-30T14:32:56.483-08:002010-12-30T14:32:56.483-08:00I wonder if another way to look at it would be the...I wonder if another way to look at it would be the distance from a "conscious" decision to the generation of the data. This would handle the case of data about biological systems easier, since they would be generated quite far from a conscious decision, while click data or game data is generated much closer to conscious decisions. Stock data would be somewhere in there, too. I suppose the real question in how well this distance can be quantified and how well it can then classify the nature of the datasets involved.The UNIX Manhttps://www.blogger.com/profile/12811030249992770817noreply@blogger.com