Abstract
A method for deriving an approximately labeled dependency treebank from the Thai Categorial Grammar Treebank has been implemented. The method involves a lexical dictionary for assigning dependency directions to the CG types associated with the grammatical entities in the CG bank, falling back on a generic mapping of CG types in case of unknown words. Currently, all but a handful of the trees in the Thai CG bank can unambiguously be transformed into directed dependency trees. Dependency labels can optionally be assigned with a learned classifier, which in a preliminary evaluation with a very small training set achieves 76.5% label accuracy. In the process, a number of annotation errors in the CG bank were identified and corrected. Although rather limited in its coverage, excluding e.g. long-distance dependencies, topicalisations and longer sentences, the resulting treebank is believed to be sound in terms of structural annotational consistency and a valuable complement to the scarce Thai language resources in existence.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP) |
Forlag | Association for Computational Linguistics |
Publikationsdato | nov. 2011 |
Status | Udgivet - nov. 2011 |