Comments on DBMS Musings: Distinguishing Two Major Types of Column-Stores

I noticed that the term "modern analytical sy...

2021-11-29T01:27:52.185-08:00

I noticed that the term "modern analytical systems" is recently popularly used to include Group B?

Cannot agree more.

2021-11-29T01:23:36.005-08:00

Cannot agree more.

Perhaps another difference is group B has B-Tree-v...

2019-07-15T17:54:07.122-07:00

Perhaps another difference is group B has B-Tree-variant storage and group A has LSM-tree/SSTable-variant?

Still a great explaination in 2019 ! Very clear an...

2019-01-04T03:00:01.900-08:00

Still a great explaination in 2019 ! Very clear and maybe the only post that deal correctly with this two very different type of database : Relational column-oriented DB and NoSQL family-column DB. May the hype don't borrow anymore an existing term to name a completly different new thing...

Hats off . Was tearing my hair and then got this. ...

2017-04-08T07:10:31.575-07:00

Hats off . Was tearing my hair and then got this. Thank you.

Excellent Article and good explanations in the com...

2017-01-04T10:17:16.232-08:00

Excellent Article and good explanations in the comments to. +1 to @edward for the comment.

Very helpful comparison, thanks.

2016-09-02T06:43:36.704-07:00

Very helpful comparison, thanks.

I agree with Edward, but a small wrinkle Group A i...

2016-07-16T03:42:51.215-07:00

I agree with Edward, but a small wrinkle
Group A is a "Distributed map of maps" or "Distributed Map of sorted maps".

+1

2014-08-04T16:36:28.155-07:00

Great article! Differences among the "column ...

2014-04-18T12:35:33.844-07:00

Great article! Differences among the "column stores" can be hard to put your finger on at first. The taxonomy you proposed makes perfect sense.

I totally agree with you. I think group A databas...

2013-09-24T11:23:32.785-07:00

I totally agree with you.

I think group A databases should avoid the use of columns, in fact they are to blame for all the confusion. They could have used any other word, like key-value, dictionary, or hash-map.

The fact that the keys are part of the data and not part of the schema is the main difference, that's where the sparseness comes from.

I think having a standard naming convention for group A is the most important thing right now.

I'm not sure that this is relevant, but... Ora...

2013-08-02T01:34:21.668-07:00

I'm not sure that this is relevant, but...
Oracle attacks this problem in a different way.
The Oracle optimizer will chose to scan an index (never reading the subject table) if that index contains all the columns required by the query. This makes the index a kind of row-by-row (Group A) extract of the table, arranged as a B-tree for fast searching. An obvious disadvantage is that the data is duplicated, but an upside is that one is not restricted to a given partitioning, but can have as many indexes as are useful. I should add that duplicate values in the index (particularly common in reference/dimension-type values) can be compressed, and nulls are, of course, absent.

Fantastic disposition!!!!!

2013-07-30T12:42:06.255-07:00

Fantastic disposition!!!!!

A bunch of thoughts: - Groups A and B are really s...

2013-03-20T07:23:48.503-07:00

A bunch of thoughts:
- Groups A and B are really storage mechanisms that could be independent and used in parallel (e.g., B storage is wasteful if most values are null, so adopt sparse techniques ala A)
- Group A is problematic because of the composite values. Is it a value store or an index? Search on trailing column values requires scans.
- I don't see A as MVs, because there is no base table. This is a standard horizontal partition of an entity.
- Group A exists because the datastore designers opted for a lot of (storage expensive) strategies to keep the shards independent and reduce interaction. A lot of concurrency work can be done blind to other parts. Storing metadata with data distributes dictionary, allowing schema evolution without updates (which is possible in Group B, but would require versioning both in metadata and data.)
- Both groups rely on ordering for non-scan searches. Indexing is an interesting topic around both groups
- Group A closely matches what all of us thought DBMS would evolve to internally over time. A conceptual schema is defined, and automatically digested by an automated process analyzing use characteristics and creating the physical schema. It is just sad that trained professionals are used as compilers in this context.

This post is really good .. its very much useful ....

2012-08-06T11:12:09.118-07:00

This post is really good .. its very much useful ... thanks for ur info

I would at this point characterize Group A as a ro...

2011-07-19T15:25:27.118-07:00

I would at this point characterize Group A as a row-oriented tuple store since each row consists of a set of name/value tuples with the constraint that the names are unique. Column families do throw a bit of a wrench into this definition, but since the lowest-level pattern of data is simply a set of tuples, I think it still makes sense.

Thanks , Good Explanation. I have a little confusi...

2011-04-24T09:25:43.035-07:00

Thanks , Good Explanation. I have a little confusion regarding storage difference between the group A and group B.
In group A , column family , store values row-wise but the column family is column wise, eg. in above case group A will store,
row 1, first name : joe, last name :smith
row 2, fist name : jack last name williams
...
and for Group B, you already mentioned. So
just want to confirm, is this the case, that I mentioned ?

2010-04-27T13:01:35.134-07:00

This comment has been removed by the author.

Thanks! this is very clear explanation. I want to ...

2010-04-07T02:17:31.642-07:00

Thanks! this is very clear explanation.
I want to add SADAS (sadasdb.com) in Column Store DBMS family.

The outstanding definition of Group A systems is g...

2010-04-01T08:44:12.869-07:00

The outstanding definition of Group A systems is given by the post. Systems like Cassandra and Bigtable are "sparse, distributed, persistent multi-dimensional sorted maps". Therefore, the following names suits best:

Group A: persistent (distributed) map
Group B: column store

This name change is a big win for Group A, as it avoids the analogies with relational systems. In addition, modern programming languages have a map/dictionary data structure built in, and developers are used to it, so it will easy the path of adoption of Group A systems.

Early adopters have a hard time understanding the concepts of NoSQL systems because of the use of words like table, column, and row that are not well suited for its particular context. Therefore, these names need to change too. I propose the following name change for systems in Group A:

row -> key entry
table -> keyspace
column -> Attribute
column family -> Attribute Set

Well, Cassandra seems to be half-way through this name change, but I think other systems in Group A should follow its example.

Wow --- what a great comment thread! I definitely ...

2010-03-31T20:26:51.068-07:00

Wow --- what a great comment thread! I definitely recommend anybody who reads the post also reads the comment thread. I will add another addendum to the blog post to this effect.

It makes you wonder why are column stores part of ...

2010-03-30T15:33:27.361-07:00

It makes you wonder why are column stores part of NoSQL.

The way I see it things like document stores, graph databases, etc that are horizontally scalable and have feature of high availability are NoSQL systems. Column stores look like reinventing the (relational) wheel and should be disregarded as such. This doesn't mean the technology isn't useful, and the advances should be disregarded. Just that it shouldn't be branded as NoSQL.

What I think is interesting is the idea of storing our business objects in a natural form and be able to access them using sophisticated information retrieval techniques. And databases like CouchDB are much more exciting if you are thinking like this.

There's a lot of confusion in the database ind...

2010-03-30T04:23:07.138-07:00

There's a lot of confusion in the database industry between the backend storage model, and the frontend API model.

The backend model tends to be designed for performance, given an expected usage pattern; and the frontend designed for programmer convenience, given an expected usage pattern. So they both derive from the same usage pattern - but from different bits of it. The designer of the backend model might be interested in how common different operations are. The designer of the frontend model might not be so interested...

We're feeling this keenly in the "SQL vs. NoSQL" debate. SQL vs. NoSQL is more an issue of API than of storage model, although there is some variation in the "logical" storage model (SQL makes dense tables, where all the columns are declared up front and generally widely used, easier; NoSQL systems often make it easier to be ad-hoc about what fields are in which records). However, there are widespread cries of "NoSQL is faster! SQL can't scale!", which is far from true, and confusing interface with implementation.

I've been recorded talking briefly about these ideas at http://www.cloudbook.net/alaric-snell-pym

My company (GenieDB) has written a replicated fault-tolerant key-value store, with various nifty properties - but, recognising that many people like to use SELECT statements, we've written a MySQL storage engine that backs onto our store. So the same data can be accessed via "NoSQL" or SQL interfaces, depending on needs. This makes it easier to migrate existing applications, and makes it easier for existing developers to use our stuff without having to learn everything from scratch.

I think #2 distinction is not that important, as i...

2010-03-30T02:33:47.279-07:00

I think #2 distinction is not that important, as in Group A you can setup one column per column family and effectively get column storage.

I think the main difference is having firm schema and relational model vs not having them. Also Group A basically supports only one read operation - get document by key, (and map-reduce on top of that), so I'd call them Document Store vs Relational Column Store for Group B. Or does "document store" imply something very different?

Part of the confusion seems to come from the use o...

2010-03-30T02:03:27.920-07:00

Part of the confusion seems to come from the use of the word "store".

A: Datastore that is accessed by columns
B: Data is stored per column.

Perhaps both should just be called database, and be prefixed with something prescriptive:

A: Column-access database
B: Column-store database