Anda belum login :: 27 Apr 2025 19:32 WIB
Home
|
Logon
Hidden
»
Administration
»
Collection Detail
Detail
Processing Internet-derived Text—Creating a Corpus of Usenet Messages
Oleh:
Hoffmann, Sebastian
Jenis:
Article from Journal - e-Journal
Dalam koleksi:
Literary and Linguistic Computing vol. 22 no. 2 (Jun. 2007)
,
page 151-165.
Fulltext:
Vol 22, 2, p 151-165.pdf
(257.99KB)
Isi artikel
In recent years, linguists have become increasingly interested in the language of the Internet—both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and postprocessing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.
Opini Anda
Klik untuk menuliskan opini Anda tentang koleksi ini!
Kembali
Process time: 0.03125 second(s)