Realistic data models for large-scale probabilistic knowledge bases
Systems that crawl the web encountering new sources and adding facts to their databases have a huge amount of potential uses. However, a lack of common-sense knowledge about their stored data is currently limiting their potential in practice. Oxford researchers are working to overcome these constraints, as Professor Thomas Lukasiewicz and co-investigator Ismail Ilkan Ceylan explain.
Palaeontology, geology, medical genetics and human movement are domains for which large-scale probabilistic knowledge bases have already been built. There is an endless list of other potential applications for these systems that continuously crawl the web and extract structured information, and thus populate their databases with millions of entities and billions of tuples (structured sets of data). For example, such systems may also be used to enable digital assistants to answer natural language questions in healthcare, such as: 'What are the symptoms of appendicitis in adults?'
In recent years, there has been a strong interest in academia and industry in building these large-scale probabilistic knowledge bases from data in an automated way. This has resulted in a number of systems, such as Microsoft's Probase, Google's Knowledge Vault, and DeepDive (commercialised as Lattice Data and then bought by Apple).
Artificial intelligence research has also joined the quest to build large-scale knowledge bases. Fields such as information extraction, natural language processing (for example, question answering), relational and deep learning, knowledge representation and reasoning, and databases are all moving towards a common goal: the ability to query large-scale probabilistic knowledge bases.
However, these search and extraction systems are still not able to convey some of the valuable knowledge hidden in them to the end user, which seriously limits their potential applications in practice. These problems are rooted in the semantics of (tuple-independent) probabilistic databases, which are used for encoding most probabilistic knowledge bases. To achieve computational efficiency, probabilistic databases are typically based on strong, unrealistic completeness assumptions.
These assumptions not only lead to unwanted consequences, but also put probabilistic databases on a weak footing in terms of knowledge base learning, completion and querying. More specifically, each of the above systems encodes only a portion of the real world, and this description is necessarily incomplete.
However, when it comes to querying, most of these systems employ the closed-world assumption, meaning that any fact that is not present in the database is assigned the probability '0', and thus assumed to be impossible. As a closely related problem, by also using the closed-domain assumption, all individuals are assumed to be known, and no new individual can exist. Furthermore, it is common practice to view every extracted fact as an independent Bernoulli variable, which means that any two facts are probabilistically independent.
For example, the fact that a person starred in a movie is often assumed to be independent from the fact that this person is an actor, which is in conflict with the fundamental nature of the knowledge domain. Furthermore, current probabilistic databases lack common-sense knowledge; such knowledge is useful in many reasoning tasks to deduce implicit consequences from data, and is often essential for querying large-scale probabilistic databases in an uncontrolled environment such as the web.
The main goal of our current EPSRC-funded research, which began in December 2017, is to enhance large-scale probabilistic databases (and so to unlock their full data modelling potential) by including more realistic data models, while preserving their computational properties. The team from the Department of Computer Science (Thomas as principal investigator with Ismail and Professors Georg Gottlob and Dan Olteanu as co-investigators) is planning to develop different semantics for the resulting probabilistic databases and analyse their computational properties and sources of intractability.
Over the three-and-a-half years of the project, we are also planning to design practical scalable query-answering algorithms for databases, especially algorithms based on knowledge compilation techniques. We will extend existing knowledge compilation approaches and elaborating new ones, based on tensor factorisation and neural-symbolic knowledge compilation. Once designed, the team plans to produce a prototype implementation and experimentally evaluate the proposed algorithms. These prototypes should help demonstrate the full potential of large-scale probabilistic knowledge bases as data models.
This article first appeared in the summer 2018 issue of Inspired Research.