Georg Gottlob: Recent Research Highlights

Find below a few highlights of the current and recent research of Georg Gottlob. A more comprehensive description of Gottlob's research activity can be found in Section 4 of his long CV

Work on the XPath Query Language for XML

XPath is an important query language for XML documents, allowing one to specify a set of nodes (i.e., objects) in a document. XPath is a sublanguage of the query and transformation language XSLT and has been implemented in several query processors, and in browsers such as Internet Explorer. While all these systems require exponential time in the worst case for answering XPath queries, a polynomial XPath evaluation algorithm was developed by Gottlob, Koch, and Pichler (VLDB'02) and published in full in 2005 in ACM TODS: Efficient Algorithms for Processing XPath Queries . The new XPath evaluation algorithms also gave rise to a US patent Nr 7,162,485 . An main-memory implementation of our fast XPath algorithm as well as a tutorial and work on space efficient fragments of XPath can be found here. On the more theoretic side, we determined the precise complexity of XPath and of various XPath fragments in the paper The Complexity of XPath Query Evaluation and XML Typing (JACM 52:2, 2005), and we have studied the complexity and expressive power of Conjunctive Queries over Trees , i.e., possibly cyclic XML queries that consist of conjunctions of atoms that relate variables via XPath axes (JACM 53:2, 2006).

Theory and Practice of Data Extraction

Web data extraction consists of automatically identifying structured data on web pages and loading the data into a database or other EDP application. A comprehensive logical theory of Web data extraction and data extraction from semi-structured documents was developed in the paper Monadic Datalog and the Expressive Power of Languages for Web Information Extraction (PODS'02, full version: JACM 51:1, 2004) with Christoph Koch. A first basic insight of this study was that data extraction from tree-structered documents is a task which can be essentially described by monadic logic, and that, in particular, monadic second order logic (MSO) is a well-suited formalism for describing most relevant data extraction tasks. Since MSO is not a practical language and has a high expression complexity, we looked for a simpler extraction language having the same expressive power. We identified monadic datalog as a good candidate and proved that, over tree structures, monadic datalog has exactly the same expressive power as MSO, but much lower complexity (size of program times size of input document).

As a practical tool, with postdocs Robert Baumgartner, Sergio Flesca, and Marcus Herzog, we designed and implemented the Lixto system for visual data extraction, whose original version is described in the VLDB'01 paper Visual Web Information Extraction with Lixto . This system allows a designer to develop a wrapper, that is, an information extraction program, by mainly visual and interactive operations performed on a sample document. These wrappers are formulated in an extension of monadic datalog. This means that monadic datalog programs over HTML-trees, and thus MSO mappings, can be essentially defined in a visual and interactive manner. For further publications on Lixto, click here . The Lixto sytem has been further developed by Vienna based company Lixto Software , and is now used by a large number of corporate customers. Our theoretical results on data extraction got the best paper award at ACM PODS 2002. The Lixto software was a finalist in the World Technology Award competition 2003 in San Francisco.

Hypertree Decompositions

The concepts of tree decomposition and the associated concept of treewidth (a cyclicity measure for graphs) introduced by the well-known graph theorists Robertson and Seymour in Robertson, (J. Algorithms 7, 1986, pp.309-322.} was of great profit to Computer Science. Many NP hard problems turned out to be tractable on instances whose underying graphs have bounded treewidth.

However, the structure of many problems is better described by hypergraphs than by graphs. We were thus looking for suitable hypergraph decomposition methods with similar properties as treewidth, and forged the new decomposition concept of hypertree decompositions (HD) and the associated notion of hypertree width in the paper Hypertree Decompositions and Tractable Queries (PODS'99,JCSS n.64, 2002). This decomposition has several favorable properties: For every constant k , it can be determined in polynomial time whether a given hypergraph has HW k , and in the positive case, a HD of width k can be computed, moreover, many problems such as conjunctive query evaluation, and constraint satisfaction problems (CSPs), which are NP complete in general, can be solved in polynomial time for instances of bounded HW.

By solving an open problem, we showed that for a different decomposition method proposed by Chekuri and Rajaraman (C.Chekuri and A.Rajaraman, Conjunctive Query Containment Revisited , Proc. 6th . Intl. Conf on Database Theory, Springer LNCS 1186, 1987, pp. 56-70.), it is NP-hard to determine whether a hypergraph has bounded width. A natural game-theoretic characterization of hypertree-width was given in the paper Robbers, Marshals, and guards: Game theoretic and logical characterizations of hypertree width (PODS'02, full version in JCSS 66, 2003), where it was shown that a hypergraph has HW k iff k ``marshals'' (each of which can control a hyperedge) can capture a robber who can move freely along connected hyperedges not occuppied by marshals. In the paper A Comparison of Structural CSP Decomposition Methods , the concept of hypertree width was compared to various other hypergraph decomposition methods and it was shown that hypertree-width dominates these methods in generality. Further investigations can be founfr in the more recent paper Generalized Hypertree Decompositions: NP-Hardness and Tractable Variants , which got the Best Paper Award at ACM PODS'07.

There are two Web pages which both contain papers on hypertree decomposition and downloadable code of decomposition algorithms: The Italian Hypertree Decomposition Homepage , maintained by Francesco Scarcello, and the Vienna Hypertree Project Page , maintained by Nysret Musliu.

We are currently continuing our research on hypertree decompositions in various directions: Finding new and faster decomposition algorithms, generalizing the concept, extending the range of applications.

Algorithms for Data Exchange

Data Exchange is the problem of inserting data structured under a source schema into a target schema of different structure (possibly with integrity constraints), while reflecting the source data as accurately as possible. In their recent paper ``Data Exchange: Getting to the Core'' (ACM PODS'03, full version: ACM Trans. Database Syst. 30(1): 174-210, 2005), Fagin, Kolaitis, and Popa have proposed an elegant solution to the data exchange problem that is based on the notion of core . A core is at the same time a most general and a most compact solution to a data exchange problem. It was, however, left open whether cores can be computed efficiently in the settings considered by Fagin, Kolaitis, and Popa. This question was answered positively for one setting in the PODS'05 paper Data Exchange: Computing Cores in Polynomial Time and, for the most general setting, in the PODS'06 paper Data Exchange: Computing Cores in Polynomial Time. Gottlob has recently obtained substantial EPSRC-funding for further investigating algorithmic ssues on data exchange. This project, entitled Schema Mappings and Automated Services for Data Integration and Exchange has started in July 2007.

Back to GG's personal page