Wednesday, December 30, 2009

2009's top blog posts

Below are my top six blog postings of 2009, in order of the number of page views as calculated by Google Analytics:

  1. Announcing release of HadoopDB (longer version), and (shorter version). Combined visits: 31,228 (26,650 and 4,638 for the longer and shorter versions respectively).

    The post gave an overview of the HadoopDB project that was released from my research group at Yale over the summer. The basic idea is to combine the scalability and ease-of-use of Hadoop with the performance on structured data of relational database systems.

  2. A tour through hybrid column/row-oriented DBMS schemes. 2,922 visits.

    This might be the post I'm most proud of, from the ones on this list. It went through different ways one can combine row-store and column-store technology in a single DBMS. I believe that hybrid database systems along the lines described in the post are going to take off in 2010, and the predictions made in the post will soon come to fruition.

  3. ParAccel and their puzzling TPC-H results. 2,499 visits.

    Of the posts on this list, this is the one I like the least, and the one that needs to be rewritten the most. I don't recommend reading it, but if you do, make sure you read the corrections in the comment thread in addition to the main article. However, the post is very stale at this point, since it discusses some TPC-H results from ParAccel that were challenged by a competitor and ParAccel later withdrew in September. ParAccel has not yet rerun these 30TB TPC-H benchmark results.

  4. Watch out for VectorWise. 1,595 visits.

    This post discussed the technology behind Ingres' new column-store storage engine designed for analytical workloads, based on some research out of CWI. I am very high on this technology and my research group has had a chance to play around with it a little this winter.

  5. Analysis of the "MapReduce Online" paper. 1,108 visits.

    I'm surprised this post got so many visits, since it was really geared only for readers in the research community. This post reviewed some research performed at the University of California, Berkeley, which explored how to pipeline results from different phases in MapReduce jobs to improve performance and enable early estimations of results. Rumor has it that this paper was accepted to NSDI 2010. I think the model of releasing papers as technical reports and independently reviewing and recommending them in public on venues like blogs is an interesting model to consider for the next decade, rather than the private 'accept' or 'reject' reviewing process currently used today.

  6. Kickfire’s approach to parallelism. 1,042 visits.

    This post takes a deeper dive into Kickfire's technology than you'll find in most other places. I find the way that they use FPGA technology to maximize the parallelism that can be achieved on analytical queries in a single-box machine to be quite impressive.

The post from 2009 that I feel is the most underrated (meaning that the number of visits did not match up with what I felt was the quality of the post) was:
  • What is the right way to measure scale?

    I really feel that people thought about scale in the wrong way in 2009. People assume that if a database can fit a lot of data inside of it (i.e. many petabytes), it must be really scalable. But if the data is not very accessible (i.e. it is stored on slow media, or it takes forever to scan through it all because there are not enough disk spindles or CPUs to process it), then the system is not nearly as scalable as it would seem.

Bottom line, if you only have time to read three of my postings from 2009, I would like them to be:

  1. The longer HadoopDB post (this is the only post about my own research)
  2. The hybrid column/row-store post
  3. The post on measuring scale
Overall, I'm pleased with the impact my blog seems to have had, and I intend to continue write posts for it in 2010.