On the Estimation and Use of Statistical Modelling in Information Retrieval

Casper Petersen

Abstract

Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Publication statusPublished - 2016

Cite this