Join-Processing on RDF in a MapReduce Environment
- 16:15 7th June 2012 ( week 7, Trinity Term 2012 )Lecture Theatre B
Today's information systems have to accomplish the exploding number and size of data sets that need to be handled. Distributed computing platforms like MapReduce have been confirmed to be well-suited for large-scale data mangement. In the talk I will concentrate on query processing on RDF data, which is a standard developed by the World Wide Web Consortium for representing semantic web data. The topic of the first part of the talk is PigSPARQL, a system we have developed to translate general SPARQL queries into Pig Latin, a relational programming layer on top of Hadoop, a widely used open source environment for MapReduce. As a distinctive feature PigSPARQL does not require any changes of the original Hadoop and therefore is able to be applied for cloud computing as well. However, Pig Latin's reduce-side implementation of the relational join may incur efficiency problems for large data sets. In the second part of the talk I will present a map-side join implementation approach taking advantage of the scalable storage capabilities of HBase, Hadoops distributed NoSQL datastore. Finally I will present evaluation results demonstrating the feasability of our approach.
Joint work with Martin Przyjaciel-Zablocki and Alexander Schätzle