On the Estimation and Use of Statistical Modelling in Information Retrieval

Casper Petersen

On the Estimation and Use of Statistical Modelling in Information Retrieval

Casper Petersen

Abstract

Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.

Original language	English

Publisher	Department of Computer Science, Faculty of Science, University of Copenhagen
Publication status	Published - 2016

Access to Document

PHD-Casper PetersenFinal published version, 4.71 MB

https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01010158094&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

Cite this

@phdthesis{137900ea11cb4c0ea15e7205f66e9876,

title = "On the Estimation and Use of Statistical Modelling in Information Retrieval",

abstract = "Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.",

author = "Casper Petersen",

year = "2016",

language = "English",

publisher = "Department of Computer Science, Faculty of Science, University of Copenhagen",

}

TY - BOOK

T1 - On the Estimation and Use of Statistical Modelling in Information Retrieval

AU - Petersen, Casper

PY - 2016

Y1 - 2016

N2 - Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.

AB - Automatic text processing often relies on assumptions about the distribution of some property (such as term frequency) in the data being processed. In information retrieval (IR) such assumptions may be contributed to (i) the absence of principled approaches for determining the correct statistical distribution, and to the fact that (ii) making such assumptions does not seem to impact IR effectiveness. However, if such assumptions are not validated, any subsequent calculations, deductions or modelling becomes less accurate for the task at hand. To remove the need for such assumptions, this thesis first introduces a statistically principled method for selecting the best fitting distribution. The thesis then demonstrates that integrating knowledge about the best-fitting distribution into IR leads to superior results compared to existing strong baselines on multiple datasets. Overall, this thesis concludes that assumptions regarding the distribution of dataset properties can be replaced with an effective, efficient and principled method for determining the best-fitting distribution and that using this distribution can lead to improved retrieval performance.

UR - https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01010158094&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

M3 - Ph.D. thesis

BT - On the Estimation and Use of Statistical Modelling in Information Retrieval

PB - Department of Computer Science, Faculty of Science, University of Copenhagen

ER -

On the Estimation and Use of Statistical Modelling in Information Retrieval

Abstract

Access to Document

Other files and links

Cite this