Web data extraction consists of automatically identifying structured data on web pages and loading the data into a database or other EDP application. A comprehensive logical theory of Web data extraction and data extraction from semi-structured documents was developed in the paper Monadic Datalog and the Expressive Power of Languages for Web Information Extraction (PODS'02, full version: JACM 51:1, 2004) with Christoph Koch. A first basic insight of this study was that data extraction from tree-structered documents is a task which can be essentially described by monadic logic, and that, in particular, monadic second order logic (MSO) is a well-suited formalism for describing most relevant data extraction tasks. Since MSO is not a practical language and has a high expression complexity, we looked for a simpler extraction language having the same expressive power. We identified monadic datalog as a good candidate and proved that, over tree structures, monadic datalog has exactly the same expressive power as MSO, but much lower complexity (size of program times size of input document).
As a practical tool, with postdocs Robert Baumgartner, Sergio Flesca, and Marcus Herzog, we designed and implemented the Lixto system for visual data extraction, whose original version is described in the VLDB'01 paper Visual Web Information Extraction with Lixto . This system allows a designer to develop a wrapper, that is, an information extraction program, by mainly visual and interactive operations performed on a sample document. These wrappers are formulated in an extension of monadic datalog. This means that monadic datalog programs over HTML-trees, and thus MSO mappings, can be essentially defined in a visual and interactive manner. For further publications on Lixto, click here . The Lixto sytem has been further developed by Vienna based company Lixto Software , and is now used by a large number of corporate customers. Our theoretical results on data extraction got the best paper award at ACM PODS 2002. The Lixto software was a finalist in the World Technology Award competition 2003 in San Francisco.
The concepts of tree decomposition and the associated concept of treewidth (a cyclicity measure for graphs) introduced by the well-known graph theorists Robertson and Seymour in Robertson, (J. Algorithms 7, 1986, pp.309-322.} was of great profit to Computer Science. Many NP hard problems turned out to be tractable on instances whose underying graphs have bounded treewidth.
However, the structure of many problems is better described by hypergraphs than by graphs. We were thus looking for suitable hypergraph decomposition methods with similar properties as treewidth, and forged the new decomposition concept of hypertree decompositions (HD) and the associated notion of hypertree width in the paper Hypertree Decompositions and Tractable Queries (PODS'99,JCSS n.64, 2002). This decomposition has several favorable properties: For every constant k , it can be determined in polynomial time whether a given hypergraph has HW k , and in the positive case, a HD of width k can be computed, moreover, many problems such as conjunctive query evaluation, and constraint satisfaction problems (CSPs), which are NP complete in general, can be solved in polynomial time for instances of bounded HW.
By solving an open problem, we showed that for a different decomposition method proposed by Chekuri and Rajaraman (C.Chekuri and A.Rajaraman, Conjunctive Query Containment Revisited , Proc. 6th . Intl. Conf on Database Theory, Springer LNCS 1186, 1987, pp. 56-70.), it is NP-hard to determine whether a hypergraph has bounded width. A natural game-theoretic characterization of hypertree-width was given in the paper Robbers, Marshals, and guards: Game theoretic and logical characterizations of hypertree width (PODS'02, full version in JCSS 66, 2003), where it was shown that a hypergraph has HW k iff k ``marshals'' (each of which can control a hyperedge) can capture a robber who can move freely along connected hyperedges not occuppied by marshals. In the paper A Comparison of Structural CSP Decomposition Methods , the concept of hypertree width was compared to various other hypergraph decomposition methods and it was shown that hypertree-width dominates these methods in generality. Further investigations can be founfr in the more recent paper Generalized Hypertree Decompositions: NP-Hardness and Tractable Variants , which got the Best Paper Award at ACM PODS'07.
There are two Web pages which both contain papers on hypertree decomposition and downloadable code of decomposition algorithms: The Italian Hypertree Decomposition Homepage , maintained by Francesco Scarcello, and the Vienna Hypertree Project Page , maintained by Nysret Musliu.
We are currently continuing our research on hypertree decompositions in various directions: Finding new and faster decomposition algorithms, generalizing the concept, extending the range of applications.