DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology
Do you need to rent a new apartment? Or would you just like to find a restaurant in your area that serves "pasta al pesto" as today’s special? In either case, you would most likely start a web search. But keyword search is not really appropriate in such cases, because you risk being swamped with irrelevant information, rather than finding what you want. If all the information were available in structured form, you could find what you are looking for much faster. Search engine providers such as Google, Yahoo! and Microsoft are aware of this, and are keenly looking for new methods that automatically recognize and extract data from domain-specific websites with semi-structured content. To this date, this problem has not been satisfactorily solved; its solution seems to require a major research breakthrough.
In this project we will tackle precisely this challenge. Our goal is very ambitious. We want to develop domainspecific
data extraction systems that take as input a URL of a website in a particular application domain, automatically explore the
web site, and deliver as output a structured data set containing all the relevant information present on that site. We will
provide the logical, algorithmic, and methodological foundations for the knowledge-based extraction of structured data from
web sites belonging to specific domains, and we will develop two extraction systems for two different domains. To achieve
our goal, we will design new methods and algorithms that combine database techniques with methods of knowledge representation
and reasoning and web data extraction techniques. The breakthrough in automatic data extraction, which we are striving for,
would enable a leap forward for two interrelated technologies which are the hottest emerging topics in web search: vertical
search, that is, web search in specialized domains, and object search, that is, the search for web data objects rather than
web pages.
For more details and results see the DIADEM homepage.
Links
DIADEM Homepage
Web Data Extraction for Online Market Intelligence
Sponsors
Group photo
Info
Duration |
1st April 2010 to 31st March 2015 |
Principal Investigator |
|
People |
(James Martin Fellow (http://www.oxfordmartin.ox.ac.uk/people/236), Fellow of the Oxford Man Institute)
(Oxford Martin Fellow)
(Oxford Martin Fellow)
|
Themes |