SIGMOD 2009 (a top tier DBMS research conference) is being held in Providence, RI, at the end of this month (the week of June 29th). Column-stores seem to be taking off as a mainstream research topic, as I've counted at least 5 talks at SIGMOD on column-stores (including an entire session!). Make sure you get to SIGMOD by (at least) Tuesday morning though, as "Query Processing Techniques for Solid State Drives", by Dimitris Tsirogiannis, Stavros Harizopoulos, Mehul Shah, Janet Wiener, and Goetz Graefe, is being presented in the first research session slot on Tuesday (research session 2). It doesn't look like it from the title, but trust me, it's a column-store paper. The controversial Vertica / parallel DBMS vs Hadoop benchmark paper is being presented in research session 5 (Sam Madden, an excellent speaker, is presenting the paper), and the aforementioned column-store session (with 3 column-store papers) is in research session 8 at the end of the day Tuesday.
But if you get there early don't leave early! I will be giving a 30-minute talk on column-stores in the awards session at the very end of the conference (I don't believe the talks in this session have been officially announced yet, but I'm indeed giving one of the talks) which goes until 5:30PM on Thursday, July 2nd.
Re “controversial Vertica / parallel DBMS vs Hadoop benchmark paper”ReplyDelete
Daniel, I was disappointed with this paper. I think it should have been published as a Vertica branded whitepaper on their web site. I have heard from a few people that are “dubious” about the results due to the personal interests of those involved. Michael and Sam should have been listed as Vertica not MIT. I think once you there is a personal investment in the company at the focus of the research, you need to be very clear about that. I certainly do not question anyone’s ethics, it just comes across badly.
I think that's a fair criticism.
A few potential explanations about this:
(1) There's generally no external reviewing process for whitepapers. In contrast, this paper went through the same SIGMOD reviewing process as all other research papers. SIGMOD has a double-blind reviewing process which means the authors are kept anonymous from the reviewers (it would actually be illegal to have a disclosure at submission time saying that some of the authors are from Vertica). So the paper was independently accepted based on its research merits.
(2) Once accepted, the paper was assigned a shepherd, whose job it is to read the final version (with the author names now on it) and make sure the final version of the paper addresses the reviewers' comments. The shepherd (a well known person not at all connected to Vertica) could have requested some sort of Vertica disclosure if any of the reviewers, the PC chair, or the shepherd him/herself deemed this necessary (the shepherd definitely knew that some of the authors are involved in Vertica).
(3) I can guarantee that none of the faculty (including Mike and Sam) ran any of the experiments. This was done by students whose bias is somewhat reduced (of course, they probably want to impress the senior faculty, so there is still some inherited bias).
(4) I would argue that this paper is about parallel databases (in general) vs. MapReduce rather than Vertica vs MapReduce.
(5) Even though many papers include people who have some sort of financial interest in the products used in the experiments, it is very rare to see disclosures on research papers. I don't really know why that is; maybe it's because the research community is kind of small and most people just know that, e.g. Vertica is Stonebraker's baby.
Anyway, the bottom line is that there probably *should* be disclosures on research papers where the authors have financial interest in products used in experiments. I'd argue that this is a problem with the DBMS research community in general, in addition to this one particular case. Coincidentally, I recently received a grant from the NSF, and Yale's COI office would not allow the money to get to me until I agreed to put such disclosures on all my future papers that this grant funds. But this is probably a special case (Yale has a particularly over-eager legal department). I don't think relying on universities to force their faculty to put disclosures on all their papers is a realistic solution to solve this problem in general.
Anyway, for this particular paper, the results seem to make sense. Each performance difference is explained in a believable way that can probably be confirmed using an analytical model. Is there a particular result that seems strange to you?