Comments on DBMS Musings: Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

where do we get a code for the triple extractor fr...

2014-02-02T06:06:53.275-08:00

where do we get a code for the triple extractor from OWL file (similar to tabular form)

Totally agree. This is where things are made so co...

2013-03-28T02:49:07.440-07:00

Totally agree. This is where things are made so confusing. Why we need hadoop for graph processing.

Just shows that no matter what you use, schema des...

2011-08-28T16:18:36.280-07:00

Just shows that no matter what you use, schema design and choice still "matters" - even if you think you don't have a schema :)

I am really interested in your latest publication ...

2011-08-10T07:37:17.709-07:00

I am really interested in your latest publication as referenced in this blog. However the URL of http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf seems to be broken.
Could you please fix that?

Very nice post, I really appreciate people sharing...

2011-08-01T10:42:50.591-07:00

Very nice post, I really appreciate people sharing knowledge.

I have been using Hadoop for a season and recently I have to manage graph datasets. I agree with previous comments that Hadoop is amazing for its purpose, but it is not the solution for every problem. It reminds me a book (http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf) by Jimmy Lin explaining how to adapt MapReduce to graph problems. Why Hadoop instead well oriented solutions as Neo4J?

However, your improvement seems really impresive and thank you again for writing this posts ;)

Daniel, I did not read your paper yet but can ans...

2011-07-28T11:49:42.418-07:00

Daniel,

I did not read your paper yet but can answer all three questions:

1. Partitioner is plugable in Hadoop. You can create your own.
2. Replica placement is plugable as well (not sure its in the 0.21(2) but its definitely in the trunk )
3. HDFS provides low-level generic storage abstraction and you are can build any graph - optimized solution on top of HDFS API.

Thanks for the post! Educational. I certainly agre...

2011-07-23T12:08:38.502-07:00

Thanks for the post! Educational. I certainly agree that one needs to pay close attention if your primary workload is running very inefficiently. Of course if a work load is using a small fraction of your resource, runs infrequently or is experimental, etc., then worrying about a factor of 2 or 10 might be premature optimization. I've found that the key is exposing your users to the economics of the resources they are using in multi-teneted environments.

It will be interesting to see what can be done using the next version of Hadoop 0.23. That will support using custom execution frameworks within hadoop clusters. If your graph fits in RAM this should allow very optimal solutions. If not, it seems like it could still take advantage of the sorts of tricks you describe and those that HBase region servers use to do very well.

Graph partition is key to design graph algorithms ...

2011-07-22T10:38:45.764-07:00

Graph partition is key to design graph algorithms on Hadoop.

Hi Daniel, we came to the same conclusion in our w...

2011-07-21T14:01:44.607-07:00

Hi Daniel, we came to the same conclusion in our work on parallel abstractions for machine learning. As a consequence we developed an alternative computational abstraction, called GraphLab, to represent algorithms on graph structured data. By changing both the implementation and computational model we were able to both improve the running time and even the theoretical performance of our learning algorithms. You can find out more about GraphLab at http://graphlab.org.

"Golden Orb" (http://www.goldenorbos.org...

2011-07-20T18:11:10.695-07:00

"Golden Orb" (http://www.goldenorbos.org/) is another open-source implementation of Pregel.. You may want to benchmark against that.

Hi Daniel, Very nice work! and good point about th...

2011-07-20T15:17:19.442-07:00

Hi Daniel, Very nice work! and good point about the characteristics of the benchmark data and queries. Not sure if you have seen this SIGMOD'11 paper on RDF benchmarks, they have some interesting observations there: http://dx.doi.org/10.1145/1989323.1989340

Mapper/reducer is meant for key/value-like process...

2011-07-20T14:15:04.090-07:00

Mapper/reducer is meant for key/value-like processing, not for graph and network data processing. If you are using Hadoop for network data processing, you are barking up the wrong tree. For graph processing, try neo4j, allegrograph or Google's pregel.

Hope it helps.

I'm not familiar with any, but hopefully someo...

2011-07-20T10:16:15.568-07:00

I'm not familiar with any, but hopefully someone else can respond on this comment thread ...

Hi Daniel, you mentioned in the post that neither ...

2011-07-20T10:13:15.959-07:00

Hi Daniel, you mentioned in the post that neither of HDFS and HBase data stores are optimized for graph data. I'm looking at the problem of using HBase to graph mining. Is there any paper about measuring HBase performance for graph data? Thanks.