I checked Curt Monash’s blog today (as I do on a somewhat daily basis) and saw a new post announcing Greenplum’s new column-oriented storage option. In my opinion, this is pretty big news. I was amused to see that Curt, in his post, correctly predicted pretty much everything I was going to say, but nonetheless, I feel obligated to post my reactions to this news:
Quick hit reactions:
(1) Congrats to Greenplum and their strong technical team for adding
column-oriented storage. They have essentially created a hybrid storage layer --- now you can store data in rows (write or read optimized) or in columns. I have previously written a long (and surprisingly popular --- 1,638 unique visits) blog post on hybrid column/row-oriented storage schemes. I would put the Greenplum scheme under the “Approach 3: fine-grained hybrids” classification. This puts them in the same category as the Vertica Flexstore scheme (more on this in a second).
(2) I strongly agree with the Greenplum philosophy that religious wars of columns vs rows will not get you anywhere. The fact of the matter is that for some workloads, row-oriented storage can get you an order of magnitude performance improvement relative to column-stores, and for other workloads, column-oriented storage can get you an order of magnitude performance improvement relative to row-stores. Hybrid schemes can theoretically get you the best of both worlds, and that can be a big win.
(3) For a few years now, I've lumped Greenplum and Aster Data together in my data warehouse market world view, even though Greenplum is slightly older and has more customers. Both are software-only solutions, and are big proponents of commodity hardware. Both partner with hardware vendors and will sell you bundled hardware with their software in a pre-optimized appliance if you prefer that to the software-only solution. Both started with vanilla PostgreSQL and turned it into a shared-nothing, MPP analytical DBMS. Both are players in the "big data" game, and heavily market their scalability. Both are west coast companies. Both support in-database MapReduce capabilities (they even announced this on the same day!). Both had research papers in the major DBMS research conferences this year. They both have embraced "the cloud". They even share some of the same customers (e.g. Fox Interactive Media).
The word on the street is that the main difference between the two companies is that Greenplum has made significant modifications to the PostgreSQL code-base and the integration of the DBMS with the distribution (parallelization) layer is more "monolithic", whereas Aster Data has kept more of the original PostgreSQL code, and treats the database more like a black box. This announcement by Greenplum is further evidence that they are more than willing to edit the PostgreSQL code, as significant modifications to the storage layer was required.
(4) Kudos to Greenplum for being open and honest about the limitations of their column-store option. It is a column-store at the storage layer only. This certainly is still a big win for queries that access few attributes from wide tables. However, before I had my own blog, I wrote a guest article for Vertica’s blog explaining that column-oriented storage only gets you part of the way towards column-store performance --- writing a query executor optimized for column-oriented storage can get you an additional order of magnitude performance improvement. I explain this at length in two academic papers (see here and here). I think papers coined the terms "early materialization" and "late materialization" for explaining whether columns are being put back together into tuples (rows) at the beginning or end of execution of a query plan. In retrospect I have regretted using this term ("tuple construction" is a little more descriptive and easier to understand than "tuple materialization"). The fact that Greenplum uses the term “early materialization” to describe their column-oriented scheme is evidence that they’ve read these papers (more kudos!) and will start to implement the low-hanging fruit in the query executor to get increased performance improvement from their column-oriented storage. Hence, I expect that their column-oriented feature (while probably already useful now) will continue to get better in future releases. In the short-term, one should not expect performance approaching regular column-stores.
(5) In my previous blog post on hybrid column/row-oriented storage schemes, I mentioned that it is far easier for a column-store to implement the “Approach 3: fine-grained hybrids” scheme than a row-store, since the row-store would have to make modifications at the storage, query executor, and query optimizer layers, while the column-store (that implements both early and late materialization) would only have to make modifications at the storage layer (since the query executor and optimizer are already capable of working with rows). This would lead me to believe that the Vertica FlexStore scheme is more immediately useful than the Greenplum hybrid storage scheme. But, as I have written before, my history with Vertica gives me some bias, so be careful what you do with that statement.
(6) This has been a pretty big month for column-oriented storage layers. First Oracle announced their new column-oriented compression scheme (classified in the “Approach 1: PAX” category in my previous blog post) and now Greenplum adds column-oriented storage. This should put more pressure on the other vendors in the data warehousing space (especially Microsoft and IBM) to come up with some sort of column-oriented storage option in the near future.
Thanks Daniel -- great writeup.
ReplyDeleteHey Daniel,
ReplyDeleteI think it's very nice of GP to jump on the bandwagon as well but let's be serious here though: established and authentic columnar MPP players include Vertica and ParAccel. Period. The only other columnar player is SybaseIQ but they're SMP unless I'm mistaken. For all these folks now (including ORCL!) to suddenly claim a stake to columnar architecture/philosophy via the magic of "hybridism" (my term) is a little much I think.
Furthermore as you point out, the layout part of "columnar" is 1/2 the effort. The "query executor" is the other half as you point out. This means designing (from scratch) an MPP column-aware optimizer (no legacy PG, just fresh, clean new MPP/col-aware rules and cost based algorithms). To the best of my knowledge, ParAccel are the only people who have pulled this off so far.
So of course it's easy for GP to say
"religious wars of columns vs rows will not get you anywhere" but then why make it a religious issue indeed (and preach it like it was the 2nd coming of Christ) when in fact you're late to the party.
Thanks :)
J.
Greenplum in its current state is nowhere near production-quality database engine. While a combined row-store and column-store database engine seems like a good idea, I am not very optimistic about GreenPlum or any of the other new vendors getting it to work reliably in the foreseeable future. Especially GP, as they could not even get their row-store to work reliably.
ReplyDeleteSpeaking of "established" column-store vendors, I have no idea how well ParAccel works, but Vertica still has a long way to go. The stability of Vertica as a platform as of version 3.0.X has been abysmal at best.
I am wondering why Infobright or Kickfire never come up as column-oriented databases?
ReplyDeleteI had the doubtful pleasure of working with Greenplum product. This is one of the worst enterprise products i have ever worked with. The product is immature and doesn't come near the capabilities of Oracle Exadata . Greenplum is where oracle was 15 years ago. EMC will try to sell you an illusion that is it the same and better but they are selling a bad product.
ReplyDeleteThe product have a lot of limitations (for example - no ability to connect to external database, no replication capabilities, dependency on the physical infrastructure when performing backup and restore, the development tools are not good and not comfortable and if you read the documentation you will find about 5 pages of limitations two of them are that you can use column-oriented storage option and compression only on append only table – that is right you read well – you can use the strongest features of GP only on tables that you don't perform update or delete. )
Since the product is not very popular (at least yet) there will be times that you will experience issues like bugs and behavior not known to anyone else including GreenPlum support and development teams.
The knowledgebase doesn't exist on the web and you will be fully dependent on EMC professional team.
If you don't believe me – read Greenplum documentation and at least don't pay before you check the product from the inside out.
Some of EMC products are good but this is not one of them.