University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming

C. Macdonald; V. Plachouras; B. He; Christina Lioma; I. Ounis

University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming

C. Macdonald, V. Plachouras, B. He, Christina Lioma, I. Ounis

Datalogisk Institut

15 Citationer (Scopus)

Abstract

We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

Originalsprog	Engelsk
Titel	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Antal sider	10
Vol/bind	4022 LNCS
Publikationsdato	1 jan. 2006
Sider	898-907
ISBN (Trykt)	9783540456971
Status	Udgivet - 1 jan. 2006

Andre filer og links

Link to publication in Scopus

Citationsformater

University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming. / Macdonald, C.; Plachouras, V.; He, B. et al.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Bind 4022 LNCS 2006. s. 898-907.

Publikation: Bidrag til bog/antologi/rapport › Bidrag til bog/antologi › Forskning › peer review

@inbook{d78fd43fdd4941b194d8549ad108d4da,

title = "University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming",

abstract = "We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.",

author = "C. Macdonald and V. Plachouras and B. He and Christina Lioma and I. Ounis",

year = "2006",

month = jan,

day = "1",

language = "English",

isbn = "9783540456971",

volume = "4022 LNCS",

pages = "898--907",

booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - CHAP

T1 - University of Glasgow at WebCLEF 2005

T2 - Experiments in per-field normalisation and language specific stemming

AU - Macdonald, C.

AU - Plachouras, V.

AU - He, B.

AU - Lioma, Christina

AU - Ounis, I.

PY - 2006/1/1

Y1 - 2006/1/1

N2 - We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

AB - We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming.

UR - http://www.scopus.com/inward/record.url?scp=33749608768&partnerID=8YFLogxK

M3 - Book chapter

AN - SCOPUS:33749608768

SN - 9783540456971

VL - 4022 LNCS

SP - 898

EP - 907

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -

University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming

Abstract

Andre filer og links

Fingeraftryk

Citationsformater