Anda belum login :: 24 Nov 2024 14:36 WIB
Home
|
Logon
Hidden
»
Administration
»
Collection Detail
Detail
Weigh Your Words—Memory-Based Lemmatization For Middle Dutch
Oleh:
Kestemont, Mike
;
Daelemans, Walter
;
Pauw, Guy De
Jenis:
Article from Journal - e-Journal
Dalam koleksi:
Literary and Linguistic Computing vol. 25 no. 3 (Sep. 2010)
,
page 287-301.
Fulltext:
Vol 25, 3, p 287-301.pdf
(467.28KB)
Isi artikel
This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling, containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can ‘learn’ intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.
Opini Anda
Klik untuk menuliskan opini Anda tentang koleksi ini!
Kembali
Process time: 0.015625 second(s)