Comments on DBMS Musings: Exadata's columnar compression

If I read the Oracle White Paper correctly, it st...

2012-08-02T16:42:12.110-07:00

If I read the Oracle White Paper correctly, it states clearly on page 5

".. Note that data remains compress not only on disk, but also remains compressed in teh Exatat Smart Flash Cache, on Infiniband, in the database server buffer cache, as well as doing backups or log shipping to Data Guard."

I had the doubtful pleasure of working with Greenp...

2012-05-24T11:18:47.224-07:00

I had the doubtful pleasure of working with Greenplum product. This is one of the worst enterprise products i have ever worked with. The product is immature and doesn't come near the capabilities of Oracle Exadata . Greenplum is where oracle was 15 years ago. EMC will try to sell you an illusion that is it the same and better but they are selling a bad product.
The product have a lot of limitations (for example - no ability to connect to external database, no replication capabilities, dependency on the physical infrastructure when performing backup and restore, the development tools are not good and not comfortable and if you read the documentation you will find about 5 pages of limitations two of them are that you can use column-oriented storage option and compression only on append only table – that is right you read well – you can use the strongest features of GP only on tables that you don't perform update or delete. )
Since the product is not very popular (at least yet) there will be times that you will experience issues like bugs and behavior not known to anyone else including GreenPlum support and development teams.
The knowledgebase doesn't exist on the web and you will be fully dependent on EMC professional team.

If you don't believe me – read Greenplum documentation and at least don't pay before you check the product from the inside out.

Some of EMC products are good but this is not one of them.

now that you are a CTO of Hadapt, maybe you will s...

2012-01-27T11:51:36.827-08:00

now that you are a CTO of Hadapt, maybe you will show how Hadapt outperforms say Greenplum during Hadoop meetup?

Daniel, where would one find direct comparisons of...

2012-01-27T11:49:25.796-08:00

Daniel, where would one find direct comparisons of Exadata vs Vertica vs Greenplum?
or do vendors block the publications of such benchmarks by their clients?

Sam

very interesting article! few thoughts ... 1. I ...

2010-03-22T14:34:04.562-07:00

very interesting article!

few thoughts ...

1. I think as well that Oracle decompresses the data on OS level before it sends it to oracle database. First the paper somewhat says it, but mostly the fact that this feature is available only on exadata tells me that this is probably on the storage/os level.

2. I believe that Teradata adds interesting twist to how they compress data. In general their compression is not as broad and good as oracle, but I like their dictionary approach, it can be very handy in some specific scenarios.

3. Oracle whitepaper is (and I'm sorry to say that) a joke. It's more a sales pitch than serious whitepaper about compression. There are not that many different exadata hardware appliances configurations, it should be very easy to run and publish solid accurate stats (not only I/O and space, but also CPU).

4. I would like to see (in the oracle whitepaper) where this compression does not work or better where it cannot be implemented. I'm sure there are a lot of exceptions.

Hi Mark, I haven't read the Vertica whitepape...

2010-01-20T20:46:56.218-08:00

Hi Mark,

I haven't read the Vertica whitepaper (and wouldn't want to speak for them anyway), but for the C-Store work it was definitely the case that if you had too many updates (or even inserts) then the write-optimized buffer would get too large and slow-down read queries (because read-queries have to go to both the read-optimized buffer and the write-optimized buffer unless the user is willing to receive results from a historical snapshot). So if you want to have optimal performance on read-only queries, you don't want to intermingle too many write queries. The best way to figure out the write load before things go bad is not so much the percentage of queries that are update queries (as in Vertica's "rule of thumb"), but rather the rate with which data is being added to the write buffer.

The size of the write buffer is really the only thing that can cause major problems in performance. The stuff you mention --- while accurate --- will cause a factor of 2 slower write performance in the worst case, which one gladly pays for the factor of 10+ in read performance. Hence, usually inserts and updates are not distinguished when talking about write queries since they cause the write buffer to fill up equally fast (even though, as you say, updates are more costly than inserts for other reasons).

The main point here is that there is an enormous difference of write performance between column-stores with write-buffers and column-stores without write-buffers, since without write-buffers column-stores have to turn individual tuple inserts into n random writes where n is the number of columns in the tuple. This causes the order of magnitude slowdown that negates much of the benefit of column-stores. Since most column-stores have write-buffers, I believe that people should assume the existence of the write-buffer when discussing column-store write performance.

Can you explain why the Vertica white paper states...

2010-01-20T09:18:02.431-08:00

Can you explain why the Vertica white paper states "a good rule of thumb is that fewer than 1% of the total SQL statements should be DELETEs or UPDATEs".

Does the C-Store family solve the IO bottleneck for inserts and updates? Or only for inserts?

After re-reading the C-Store overview and Vertica white paper my vague understanding is:

1)since updates are done as delete/insert they may require twice the work when merging into the ROS (once to remove, once to add)

2) update and delete require random reads from ROS to determine that the rows exist, insert does not

What is done for primary key and unique constraints? In that case random reads must also be done during insert.

Can we expect a paper on optimizing for IO during the merge process? The LSM paper has many details on that.