Wednesday, July 29, 2009

Watch out for VectorWise

In the last few years, there have been so many new analytical DBMS startups de-cloaking that it’s difficult to keep track of them all. Just off the top of my head, I would put Aster Data, DATAllegro, Dataupia, Exasol, Greenplum, Infobright, Kickfire, ParAccel, Vertica, and XtremeData in that category. Once you add the new analytical DBMS products from established vendors (Oracle Exadata and HP Neoview) we’re at a dozen new analytical DBMS options to go along with older analytical DBMS products like Teradata (industry leader), Netezza, and Sybase IQ. Finally, with the free and open source HadoopDB, we now have (at least) sixteen analytical DBMS solutions to choose from.

Given the current overwhelming number of analytical DBMS solutions, I suspect that VectorWise’s sneak preview happening this week is not going to get the attention that it deserves. VectorWise isn’t making a splash with a flashy customer win (like Aster Data did with MySpace), or a TPC-H benchmark win (like Kickfire and ParAccel did), or an endorsement from a DBMS legend (like Vertica did). They’re not even going to market with their own solution; rather, they’re teaming up with Ingres for a combined solution (though the entire Ingres storage-layer and execution engine has been ripped out and replaced with VectorWise). But I’m telling you: VectorWise is a company to watch.

Here are the reasons why I like them:


  1. They are a column-store. I strongly believe that column-stores are the right solution for the analytical DBMS market space. They can get great compression ratios with lightweight compression algorithms, and are highly I/O efficient. In my opinion, the only reason why there are companies on the above list that are not column-stores is that they wanted to accelerate time to market by extending previously existing DBMS code, and the most readily available DBMS code at the time was a row-store. Any DBMS built from scratch for the (relational, structured data) analytical DBMS market should be a column-store.

  2. Column-stores are so I/O efficient that CPU and/or memory usually become bottlenecks very quickly. Most column-stores do very careful optimizations to eliminate these bottlenecks. But to me, VectorWise has gone the extra mile. The query operators are run via a set of query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instructions. Vectors of 100-1000 values within a column get pipelined through a set of query operations on that column, with many values typically being processed in parallel by SIMD (single instruction, multiple data) capabilities of modern CPU chips. Most database systems are unable to take advantage of SIMD CPU capabilities --- the tuple-at-a-time (iterator) processing model of most database systems is just too hard for compilers to translate to SIMD instructions. VectorWise has gone to great lengths to make sure their code results in vectorized CPU processing. Their execution primitives are also written to allow CPUs to do efficient out-of-order instruction execution via loop-pipelining (although compilers are supposed to discover opportunities for loop-pipelining on their own, without carefully written code, this doesn’t happen in practice as often as it should). So with highly optimized CPU-efficient code, along with (1) operator pipelining to keep the active dataset in the cache and (2) column-oriented execution reducing the amount of data that must be shipped from memory to the CPU, VectorWise reduces the CPU and memory bottlenecks in a major way. The bottom line is that VectorWise is disk efficient AND memory efficient AND CPU efficient. This gets you the total performance package.
  3. Their founders include Peter Boncz and Marcin Zukowski from CWI. I generally think highly of the DBMS research group at CWI, except for one of their papers which ... actually ... maybe it’s better if I don’t finish this sentence. I have spoken highly about them in previous posts on my blog (see here and here).
  4. It looks likely that their solution will be released open source. I was unable to get a definite commitment from Boncz or Zukowski one way or another, but the general sense I got was that an open source release was likely. But please don’t quote me on that.
  5. If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.

In the interest of full disclosure, here are a few limitations of VectorWise


  • It is not a shared-nothing, MPP DBMS. It runs on a single machine. This limits its scalability to low numbers of terabytes. However, VectorWise is targeting the same “mass market” that Kickfire is, where the vast majority of data warehouses are less than 10TB. Furthermore, as mentioned above, it is a great candidate to be turned into a shared-nothing, parallel DBMS via the HadoopDB technology, and I look forward to investigating this opportunity further.
  • In my experience, having large amounts of low-level, CPU optimized code is hard to maintain over time, and might limit how nimble VectorWise can be to take advantage of new opportunities. Portability might also become a concern (in the sense that not all optimizations will work equally well on all CPUs). However, I would not put anything past such a high quality technical team.

Two final notes:

  • I like their go-to-market strategy. Like Infobright and Kickfire they are after the low-priced, high volume analytical DBMS mass market. But the problem with the mass market is that you need a large global sales and support team to handle the many opportunities and customers. Startups that target the high-end have it much easier in that they can get through the early stages of the company with a few high-priced customers and don’t need to invest as much in sales and support. By getting into bed with Ingres, VectorWise gets to immediately take advantage of the global reach of Ingres, a key asset if they want to target the lower end of the market.
  • CWI is also the creator of the open source MonetDB column-store. VectoreWise is a completely separate codeline, and makes several philosophical departures from MonetDB. According to VectorWise, MonetDB’s materialization of large amounts of intermediate data (e.g., from running operators to completion) makes it less scalable (more suited for in-memory data sets) than VectorWise. VectorWise has superior pipelined parallelism, and vectorized execution. I have not checked with the MonetDB group to see if they dispute these claims, but my knowledge from reading the MonetDB research papers is generally in line with these statements, and my understanding is that the MonetDB and VectorWise teams remain friendly.

12 comments:

  1. Daniel, you might have missed XSPRADA off the top of your head :) By my last count there are about 25 of us (not including Groovy, XtremeData, Sensage, tenbase etc..)

    "Like Infobright and Kickfire they are after the low-priced, high volume analytical DBMS mass market." - I'm curious where you infer this from?
    Thanks.

    ReplyDelete
  2. Sorry Jerome!

    I was bound to miss a few given how quickly I put the list together.

    As far as their target market, I think it is possible to infer from the fact that they don't scale to large numbers of terabytes (they only run on a single machine). However, I didn't have to infer, since Peter Boncz (one of their founders) told me that directly :)

    IMHO, Infobright is the best comparable to Vectorwise, given that it is also "software-only", open-source friendly, and not shared-nothing. But the underlying technology is still very different. I expect Ingres/VectorWise, Infobright, and Kickfire to all be competing for the same customers.

    ReplyDelete
  3. Correct me if I'm wrong, but the inference comes from the fact that they are limited to the sub 10TB market due to resource limitations in a single vertical solution.

    I'm not sure that Infobright is really targetting the same market as Kickfire. While both are MySQL based, and both are column stores, the similarity ends there. Infobright is targeting large warehouses which are very denormalized, and have short dimensions. It is fairly limited in these regards, and joining against large dimensions is very expensive.

    Kickfire prefers larger dimensions over short ones and requires far less demormalization. We support foreign keys and indexes along with a good deal more MySQL syntax than Infobright.

    Infobright can't run TPC-H, and while the benchmark itself has pros and cons, I think we can agree that the complexity of SQL executed by the TPC-H benchmark is not likely to be more complex than the most complex queries generated by analytical applications.

    The schema is weird, granted, but the subqueries, joins and other SQL features used by TPC-H are likely to be used by most apps, so I think using them in production would be challenge, particularly in adhoc query environments.

    So, I guess I wouldn't really bucket Kickfire and Infobright together, but that is just my opinion.

    ReplyDelete
  4. Hi Justin,

    I definitely agree that Kickfire, Infobright, and VectorWise all have VERY different technologies under the covers and will have different performance characteristics on different workloads. But at 10000 feet, if I'm a customer with a small budget, and have a data set that is on the order of hundreds of gigabytes and not growing rapidly, I'm going to probably consider all three, and then decide from there which one is best suited for my workload. All the other solutions I mentioned might also work, but are generally designed for orders of magnitude more data and the higher end of the market.

    ReplyDelete
  5. Hello Daniel,

    As for comparison between MonetDB and VectorWise, I would look at MonetDB/X100 paper at CIDR 2007.

    Also, I fully agree with your response to Justin's comment. Some of Infobright's issues that have been mentioned are now solved in 3.2 RC1. After all, however, the key point is to adjust optimal solution to the given workload, not forgetting about the data size and the budget.

    Best greetings from Infobright,

    Dominik

    ReplyDelete
  6. After looking at the web site and reading Ingres/VectorWise Sneak Preview WP there are many more questions than answers:
    1) why only Q1 performance was disclosed (MonetDB/X100 run some other queries also)
    2) why it was only at 6M rows scale (toy size as modern datamarts goes)? Server has enough RAM (48GB upgradable to 72GB) for 600M rows (SF100) for sure - and most likely 1800M (SF300) is also possible due to column storage + compression. Granted, 30M rows/sec is very impressive - especially for a single core - but will it hold for larger sizes? And again, MonetDB/X100 did run Q1 (and some other queries) at SF100 (600M rows) on weaker hardware 2 years ago.
    3) can it use multiple cores?
    4) is it anywhere near beta or RC?
    Overall certainly very interesting.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. (re-entered to correct a typo)

    Hi Igor,

    Thanks for your interest and comments. We are very excited and energized by all the buzz around VectorWise on the Internet!

    The introduction and conclusion of our Whitepaper try to state that at this point we just want to be very meticulous with what benchmark results we communicate, specifically regarding benchmarks of the TPC: we do realize that different standards apply to results in an academic paper as to those mentioned by a commercial vendor for a product (even an open source one).

    Indeed, as you allude to, even the first MonetDB/X100 publication (http://old-www.cwi.nl/htbin/ins1/publications?request=intabstract&key=BoZuNe:CIDR:05) contained results for *all* TPC-H queries on the SF=100 datasize. The system has only gotten faster and more scalable since.

    The reason to disclose this particular data point in the motivation section, is that we want to show there how dire the situation is: how shockingly computationally inefficient traditionally architected database systems are. This particular query performs a series of calculations and aggregations, which are intense enough to typically be balanced with I/O resources (i.e. query is CPU bound). Also, the query contains no joins and the aggregation result is small, hence it scales linearly with data size, and also with amount of cores used (it parallellizes trivially). Therefore, it does not really matter on which size the result is reported.

    The question on multiple cores is a product feature question. The only thing we wanted to do yesterday was to announce the existence of the Ingres VectorWise project, which aims to bring the what we think of as the fastest query engine around to a mass market under an open source license (eventually). Product announcements will follow in due time.

    I have to give the same response to the question on availablity. However, I can say that we would not be announcing this exciting project yet if we thought that this will be released in, say, 2011.

    Let me close with the last sentence of the Whitepaper: stay tuned for more..

    PD: thanks Dan, your post is deep, factually correct from my point of view, and your opinions are appreciated!

    ReplyDelete
  9. So is it going to be an open-source product? Then it's even more interesting. I still think that it would be easy to publish SF100 results - and hope it'll happen soon as well as - and more important - some kind of early version availability.

    ReplyDelete
  10. Igor, yes, we hope this will become open-source eventually. As for the TPC-H results, I am sure you will hear about it if we publish anything.

    ReplyDelete
  11. This is an approach that I've looked at before, and I think it has some potential. Most compression people never look beyond serial bitstream processing, and there's a vast unexplored area of compressed objects that support vectorized operations such as joins.

    ReplyDelete
  12. Vectorwise TPC-H results are out.
    Single node Vectorwise is over 3.5 times faster than MS SQL 2008 R2 on the similar hardware and faster than heavily clustered MPP systems from Oracle, IBM and ExaSQL.
    Check the link: http://deepcloud.co/web1/?p=108

    We are using it in the worlds first Vectorwise based MPP solution with OpenMPI, OpenFabrics and LibDrizzle

    ReplyDelete