Analysing Document Similarity Measures
Edward Grefenstette
Abstract
Supervised by Professor Stephen Pulman. Obtained distinction on MSc.
The observation that document similarity measures are systems that perform the same abstract task while drawing upon very different aspects of documents depending on the goal raises a few questions about their nature. What is the common thread to document similarity measure design? Is it a software engineering problem, or are there general principles feeding their construction? Are metrics designed for one purpose suitable for another? How would we determine it if they were? How do they deal with different kinds of input (words, sentences, sets of paragraphs)? On what grounds can we compare metrics? How do we choose a 'better' metric relative to a task?
This jumble of questions justifies further work, but leaves us with little clue how to begin. Attempts have been made in the computational linguistics literature to answer some of these questions with regard to small groups of metrics, particularly within the context of comparing two specific types of metrics, however we found no attempt at giving a general theory of metric design and analysis in the literature, and have resolves to approach the problem ourselves.
The common theme to the above questions can thus be synthesised into the following three questions which will form the basis of our investigation. Firstly, how are common document similarity measures designed and implemented? We wish, in answering this question, to learn a bit more about the kinds of metrics commonly used in text processing, and the sort of difficulties arising when considering how to use them in practice. Secondly, how can we analyse document similarity? In answering this question, which will form the bulk of our project, we wish to discuss how metrics can be compared and ranked relative to different types of document similarity, thus giving us some insight into their performance for a variety of text processing tasks. Thirdly and finally, how can the results of such analysis be leveraged to improve metrics or produce better metrics?
In this dissertation, we describe both an experiment and the construction of a extensible metric analysis framework in an attempt to answer the above key questions.