Alignment and browsing of the English-Norwegian
Parallel CorpusKnutHoflandThe Norwegian Computing Centre for the
HumanitiesKnut.Hofland@hd.uib.noJarleEbelingDepartment of British and American Studies, University of OsloJarle.Ebeling@iba.uio.no1996University of BergenBergen, NorwayALLC/ACH 1996editorAnneLindebjergEspenS.OreØysteinReigemencoderSaraA.Schmidtalignmentbrowsingparallel corpusProject BackgroundThe English-Norwegian Parallel Corpus (ENPC) project was started in the
beginning of 1994 and it is expected to end in 1996. The parallel corpus is
meant to be a research tool for linguists and students interested in
contrastive linguistics. The corpus will contain English and Norwegian
originals and their translations (including both English-to-Norwegian and
Norwegian-to-English translations).The corpus will only contain written material, but both fiction and
non-fictional texts will be included. To make it possible to include as many
different writers and translators as possible, text extracts of 10,000 -
15,000 words will be selected and not complete books. Each text extract will
start at the beginning of the book, and if possible, end at a chapter
boundary. The finished corpus will consist of 100 pairs of texts with a
total of about 2.5 million words. The texts are marked up according to TEI
P3. As regards the structure of the corpus and the projected uses of the
material, see (Johansson and Hofland, 1994) and (Johansson, Ebeling, and
Hofland 1996).Alignment programThe alignment program has been written by Knut Hofland. The program makes use
of a simple bilingual lexicon (anchor words), but in addition uses
information like proper nouns, special characters and tags, cognates and
sentence length in characters. Statistics based on half the corpus gives an
error rate of approx 2 per cent. The program has also been used in aligning
texts from other language pairs like French-Norwegian,
English-French/German/Polish/Swedish/Finnish, Swedish-Estonian. The program
gives output in several formats, a TEI recommended format and one format
also suitable for use with ParaConc and WordCruncher for Windows.Illustration not availableBrowsing toolThe browsing tool has been written by Jarle Ebeling. The aligned and
proofread texts are indexed, making it possible to search the text database
for words in one of the languages and retrieve the sentences together with
their translations in the other language. Words in the two languages can be
combined with the and or the not operator so that only pairs of sentences with
a specific word in the first language together with (or not together with)
another word in the second language are found. Words can also be truncated
and a distance (in number of words) between particular words can be
given.ReferencesWeb page with more information and articles: S.JohanssonK.HoflandTowards an English-Norwegian parallel corpusU.FriesG.TottieP.SchneiderCreating and Using English language CorporaAmsterdam199425-37S.JohanssonJ.EbelingK.HoflandCoding and Aligning the English-Norwegian Parallel
CorpusK.AijmerB.AltenbergM.JohanssonPapers from Symposium on Text-based Cross-linguistic
StudiesLund199687-112