Monday, March 29, 2010

Distinguishing Two Major Types of Column-Stores

I have noticed that Bigtable, HBase, Hypertable, and Cassandra are being called column-stores with increasing frequency (e.g. here, here, and here), due to their ability to store and access column families separately. This makes them appear to be in the same category as column-stores such as Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright, which also are able to access columns separately. I believe that calling both groups of systems column-stores has lead to a great deal of confusion and misplaced expectations. This blog post attempts to clear up some of this confusion, highlighting the high level differences between these groups of systems. At the end, I will propose some potential ways to rename these groups to avoid confusion in the future.

For this blog post, I will refer to the following two groups as Group A and Group B:
  • Group A: Bigtable, HBase, Hypertable, and Cassandra. These four systems are not intended to be a complete list of systems in Group A --- these are simply the four systems I understand the best in this category and feel the most confident discussing.

  • Group B: Sybase IQ, C-Store, Vertica, VectorWise, MonetDB, ParAccel, and Infobright. Again, this is not a complete list, but these are the systems from this group I know best. (Row/column hybrid systems such as Oracle or Greenplum are ignored from this discussion to avoid confusion, but the column-store aspects of these systems are closer to Group B than Group A.)
Differences between Group A and Group B
  1. Data Model. Group A uses a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map). Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. Group B uses a traditional relational data model. This distinction has caused great confusion. People more familiar with Group A are very much aware that Group A does not use a relational data model and assume that since Group B are also called column-stores, then Group B also does not use a relational data model. This has resulted in many intelligent people saying “column-stores are not relational”, which is completely incorrect.

  2. Independence of Columns. Group A stores parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query. However, column-families may consist of many columns, and these columns within column-families are not independently accessible.

    Group B stores columns from a traditional relational database table separately so that they can be accessed independently. Like Group A, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that every column is stored separately, instead of families of columns as in Group A (this statement ignores fine-grained hybrid options within Group B).

  3. Interface. Group A is distinguished by being part of the NoSQL movement and does not typically have a traditional SQL interface. Group B supports standard SQL interfaces.

  4. Optimized workload. Group B is optimized for read-mostly analytical workloads. These systems support reasonably fast load times, but high update rates tend to be problematic. Hence, data warehouses are an ideal market for Group B, since they are typically bulk-loaded, require many complex read queries, and are updated rarely. In contrast, Group A can handle a more diverse set of application requirements (Cassandra, in particular, can handle a much higher rate of updates). Group B systems tend to struggle on workloads that “get” or “put” individual rows in the data set, but thrive on big aggregations and summarizations that require scanning many rows as part of a single query. In contrast, Group A generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. Group A systems can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places. Another reason for the difference is the storage layer implementation, explained below.

  5. Storage layer. Assume the following customer table



    Although there is some variation within the systems in Group B, to the first order of approximation, this group will store the table in the following way:

    (ID) 1, 2, 3, 4, 5, 6

    (First Name) Joe, Jack, Jill, James, Jamie, Justin

    (Last Name) Smith, Williams, Davis, Miller, Wilson, Taylor

    (Phone) 555-1234, 555-5668, 555-5432, NULL, 555-6527, 555-8247

    (Email) jsmith@gmail.com, jwilliams@gmail.com, NULL, jmiller@yahoo.com, NULL, jtaylor@aol.com

    Note that each value is stored by itself, without information about what row or column it came from. We can figure out what column it came from since all values from the same column are stored consecutively. We can figure out what row it came from by counting how many values came before it in the same column. The fourth value in the id column matches up to the same row as the fourth value in the last name column and the fourth value in the phone column, etc. Note that this means that columns that are undefined for a particular row must be explicitly stored as NULL in the column list; otherwise we can no longer match up values based on their position in their corresponding lists.

    Meanwhile, systems in Group A will either explicitly store the row-name, the column-name or both with each value. E.g.: row2, lastname: Williams; row5, phone: 555-6527, etc. The reason is that Group A uses a sparse data model (different rows can have a very different set of columns defined). Storing NULLs for every undefined column could soon lead to the majority of the database being filled with NULLs. Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column column-family. (Group A will also typically store a timestamp per value, but explaining this will only complicate this discussion).

    This results in Group B typically taking much less space on storage than Group A (at least for structured data that easily fits into a relational model). Furthermore, by storing just column-values without column-names or row-names, Group B optimizes performance for column-operations where each element within a column is read and an operation (like a predicate evaluation or an aggregation) is applied. Hence, the data model combined with the storage layer implementation results in wildly different target applications for Group A and Group B.


Renaming the Groups

Clearly along each of these five dimensions, Group A and Group B are extremely different. Consequently, even though calling them both column-stores has some advantages (it makes it seem like the “column-store movement” is large and a really hot area), I believe that lumping Group A and Group B together does more damage than good, and that we need to make a greater effort to avoid confusing these two groups in the future. Here are some suggestions for names for Group A and Group B towards this goal:

  • Group A: “column-family store”
    Group B: “column-store”

    (The problem here is that Group B doesn’t get a new name, which means that “column-store” could either refer to Group B or both Group A/B)

  • Group A: “non-relational column-store”
  • Group B: “relational column-store”

  • Group A: “sparse column-store”
  • Group B: “dense column-store”

Of these, the relational/non-relational distinction is probably the most important, and would be my vote for the new names. If you have a different idea for names, or want to vote on one of the above schemes, please leave a comment below.


Addendum:

Mike Stonebraker e-mailed me his vote which I reproduce here with his permission:
Group A are really row stores. I.e. they store a column family in a row-by-row fashion. Effectively, they are using materialized views for column families and storing materialized views row-by-row. Systems in Group B have a sophisticated column-oriented optimizer -- no such thing exists for Group A.
He then went on to call Group A "MV row-stores" and sub-categorize Group B depending on how materialized views are stored, but his overarching point was that referring to anything in Group A as a "column-store" is really misleading.


Addendum 2:

This post seems to have attracted some high quality comments. I recommend reading the comment thread.

22 comments:

  1. Thanks! this is great article - very insightful, and much needed.

    kind regards,

    Roland

    ReplyDelete
  2. You might also think of group A as following the OLAP paradigm forward.

    ReplyDelete
  3. Hey Daniel,

    Your blog is always a good read--keep it up.

    As founder for the LucidDB project, I appreciate your effort here, since I often see people ignorantly lump LucidDB into the NoSQL category for this very reason, even though it is firmly in Group B.

    However, since as part of my work at Facebook I'm currently responsible for integrating HBase with Hive, I don't think the distinction you're making is going to hold up as a bright line forever. Hive now provides both a relational mapping (distinction 1) and a SQL interface (distinction 3) on top of HBase. For distinction 5, generic compression goes some ways towards reducing the sparse-column representation overhead, although of course not as far as the typical Group B storage implementation. But even within Group B, there is quite a bit of variation in terms of the aggressiveness of the encoding.

    I agree with Stonebraker that the optimizer's awareness of the underlying storage is critical, but I expect to see Group A continuing to evolve in this respect as SQL access becomes mainstream.

    So, perhaps a good way to summarize the situation is that Group B focuses on a subset of the space addressible by Group A, and does it much more efficiently at the moment, but there's no fundamental architectural barrier preventing specializations of Group A becoming indistinguishable from Group B.

    Note that there's interest from the Cassandra folks in getting Hive integration going there too.

    JVS

    ReplyDelete
  4. It is meaningless to compare the two group, they target to different application. I think the post just make more confusion. And what Mr. Stonebraker said is also not right.

    I think the only thing which make these confusions is the term "column" in Group A. In fact, it is not traditional "column" of RDBMS area. And your example of a traditional spreadsheet table is also not the real target of Group A. The "column" name in Group A is in fact data (not schema).

    If have got to change the term, I think we can change the term "column" in Group A to "end-key".

    In short:
    (1) Group A's "column" is data, not schema.
    (2) Group B's "column" is schema.
    They are different in conception and application target.

    ReplyDelete
  5. Based on their basic column storage structure with Group A hash-map based and Group B array based, I would call

    Group A - Weakly-schemaed Column Datastore
    Group B - Strongly-schemaed Column Datastore

    ReplyDelete
  6. In fact, in bigtable, we use "qualifier" instead of "column".

    ReplyDelete
  7. Part of the confusion seems to come from the use of the word "store".

    A: Datastore that is accessed by columns
    B: Data is stored per column.

    Perhaps both should just be called database, and be prefixed with something prescriptive:

    A: Column-access database
    B: Column-store database

    ReplyDelete
  8. I think #2 distinction is not that important, as in Group A you can setup one column per column family and effectively get column storage.

    I think the main difference is having firm schema and relational model vs not having them. Also Group A basically supports only one read operation - get document by key, (and map-reduce on top of that), so I'd call them Document Store vs Relational Column Store for Group B. Or does "document store" imply something very different?

    ReplyDelete
  9. There's a lot of confusion in the database industry between the backend storage model, and the frontend API model.

    The backend model tends to be designed for performance, given an expected usage pattern; and the frontend designed for programmer convenience, given an expected usage pattern. So they both derive from the same usage pattern - but from different bits of it. The designer of the backend model might be interested in how common different operations are. The designer of the frontend model might not be so interested...

    We're feeling this keenly in the "SQL vs. NoSQL" debate. SQL vs. NoSQL is more an issue of API than of storage model, although there is some variation in the "logical" storage model (SQL makes dense tables, where all the columns are declared up front and generally widely used, easier; NoSQL systems often make it easier to be ad-hoc about what fields are in which records). However, there are widespread cries of "NoSQL is faster! SQL can't scale!", which is far from true, and confusing interface with implementation.

    I've been recorded talking briefly about these ideas at http://www.cloudbook.net/alaric-snell-pym

    My company (GenieDB) has written a replicated fault-tolerant key-value store, with various nifty properties - but, recognising that many people like to use SELECT statements, we've written a MySQL storage engine that backs onto our store. So the same data can be accessed via "NoSQL" or SQL interfaces, depending on needs. This makes it easier to migrate existing applications, and makes it easier for existing developers to use our stuff without having to learn everything from scratch.

    ReplyDelete
  10. It makes you wonder why are column stores part of NoSQL.

    The way I see it things like document stores, graph databases, etc that are horizontally scalable and have feature of high availability are NoSQL systems. Column stores look like reinventing the (relational) wheel and should be disregarded as such. This doesn't mean the technology isn't useful, and the advances should be disregarded. Just that it shouldn't be branded as NoSQL.

    What I think is interesting is the idea of storing our business objects in a natural form and be able to access them using sophisticated information retrieval techniques. And databases like CouchDB are much more exciting if you are thinking like this.

    ReplyDelete
  11. Wow --- what a great comment thread! I definitely recommend anybody who reads the post also reads the comment thread. I will add another addendum to the blog post to this effect.

    ReplyDelete
  12. The outstanding definition of Group A systems is given by the post. Systems like Cassandra and Bigtable are "sparse, distributed, persistent multi-dimensional sorted maps". Therefore, the following names suits best:

    Group A: persistent (distributed) map
    Group B: column store

    This name change is a big win for Group A, as it avoids the analogies with relational systems. In addition, modern programming languages have a map/dictionary data structure built in, and developers are used to it, so it will easy the path of adoption of Group A systems.

    Early adopters have a hard time understanding the concepts of NoSQL systems because of the use of words like table, column, and row that are not well suited for its particular context. Therefore, these names need to change too. I propose the following name change for systems in Group A:

    row -> key entry
    table -> keyspace
    column -> Attribute
    column family -> Attribute Set

    Well, Cassandra seems to be half-way through this name change, but I think other systems in Group A should follow its example.

    ReplyDelete
    Replies
    1. I totally agree with you.

      I think group A databases should avoid the use of columns, in fact they are to blame for all the confusion. They could have used any other word, like key-value, dictionary, or hash-map.

      The fact that the keys are part of the data and not part of the schema is the main difference, that's where the sparseness comes from.

      I think having a standard naming convention for group A is the most important thing right now.

      Delete
  13. Thanks! this is very clear explanation.
    I want to add SADAS (sadasdb.com) in Column Store DBMS family.

    ReplyDelete
  14. This comment has been removed by the author.

    ReplyDelete
  15. Thanks , Good Explanation. I have a little confusion regarding storage difference between the group A and group B.
    In group A , column family , store values row-wise but the column family is column wise, eg. in above case group A will store,
    row 1, first name : joe, last name :smith
    row 2, fist name : jack last name williams
    ...
    and for Group B, you already mentioned. So
    just want to confirm, is this the case, that I mentioned ?

    ReplyDelete
  16. I would at this point characterize Group A as a row-oriented tuple store since each row consists of a set of name/value tuples with the constraint that the names are unique. Column families do throw a bit of a wrench into this definition, but since the lowest-level pattern of data is simply a set of tuples, I think it still makes sense.

    ReplyDelete
  17. This post is really good .. its very much useful ... thanks for ur info

    ReplyDelete
  18. A bunch of thoughts:
    - Groups A and B are really storage mechanisms that could be independent and used in parallel (e.g., B storage is wasteful if most values are null, so adopt sparse techniques ala A)
    - Group A is problematic because of the composite values. Is it a value store or an index? Search on trailing column values requires scans.
    - I don't see A as MVs, because there is no base table. This is a standard horizontal partition of an entity.
    - Group A exists because the datastore designers opted for a lot of (storage expensive) strategies to keep the shards independent and reduce interaction. A lot of concurrency work can be done blind to other parts. Storing metadata with data distributes dictionary, allowing schema evolution without updates (which is possible in Group B, but would require versioning both in metadata and data.)
    - Both groups rely on ordering for non-scan searches. Indexing is an interesting topic around both groups
    - Group A closely matches what all of us thought DBMS would evolve to internally over time. A conceptual schema is defined, and automatically digested by an automated process analyzing use characteristics and creating the physical schema. It is just sad that trained professionals are used as compilers in this context.

    ReplyDelete
  19. I'm not sure that this is relevant, but...
    Oracle attacks this problem in a different way.
    The Oracle optimizer will chose to scan an index (never reading the subject table) if that index contains all the columns required by the query. This makes the index a kind of row-by-row (Group A) extract of the table, arranged as a B-tree for fast searching. An obvious disadvantage is that the data is duplicated, but an upside is that one is not restricted to a given partitioning, but can have as many indexes as are useful. I should add that duplicate values in the index (particularly common in reference/dimension-type values) can be compressed, and nulls are, of course, absent.

    ReplyDelete
  20. Great article! Differences among the "column stores" can be hard to put your finger on at first. The taxonomy you proposed makes perfect sense.

    ReplyDelete