TY - BOOK
T1 - On the Estimation and Use of Statistical Modelling in Information Retrieval
AU - Petersen, Casper
PY - 2016
Y1 - 2016
N2 - Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.
AB - Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.
UR - https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01010158094&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK
M3 - Ph.D. thesis
BT - On the Estimation and Use of Statistical Modelling in Information Retrieval
PB - Department of Computer Science, Faculty of Science, University of Copenhagen
ER -