Perpustakaan Unika Atma Jaya

Anda belum login :: 19 Apr 2026 21:06 WIB

Home

Logon

» »

Detail

The AMARA Corpus: Building Resources for Translating the Web’s Educational Content

Oleh:

Jenis: Article from Proceeding
Dalam koleksi: Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013), Heidelberg, Germany: Dec. 5-6, 2013
Fulltext: The AMARA Corpus.pdf (11.78MB)

Isi artikelIn this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align the segments, and extrinsically evaluate the resulting parallel corpus on the standard TED-talks tst-2010. We observe that the data can be successfully used for this task, and also observe an absolute improvement of 1.6 BLEU when it is used in combination with TED data. Finally, we analyze some of the specific challenges when translating the educational content.

Opini AndaKlik untuk menuliskan opini Anda tentang koleksi ini!

Kembali

Process time: 0 second(s)