Customized Europarl Corpus

The customized Europarl corpus has been extracted from the Europarl corpus in order to support research on corpus-based translations. The corpus contains 3,152,650 sentences from 21 European languages of 7 language (sub-) families. The format of the corpus is TEI P5. For more details on this corpus please refer to:[1]

Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014

In the case that you use this corpus, please cite the paper above in conjunction with the following paper:

Philipp Koehn
Europarl: A Parallel Corpus for Statistical Machine Translation
MT Summit 2005

Copyright: We are not aware of any copyright restrictions on this resource. If you notice any problems please let us know.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as 7z file (92,4 MB)

[1] [pdf] M. Z. Islam and A. Mehler, “Customization of the Europarl Corpus for Translation Studies,” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), 2012.
    booktitle={Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC)},
    author={Islam, Md. Zahurul and Mehler, Alexander},
    title={Customization of the Europarl Corpus for Translation Studies},
    abstract={Currently, the area of translation studies lacks corpora by which translation scholars can validate their theoretical claims, for example, regarding the scope of the characteristics of the translation relation. In this paper, we describe a customized resource in the area of translation studies that mainly addresses research on the properties of the translation relation. Our experimental results show that the Type-Token-Ratio (TTR) is not a universally valid indicator of the simplification of translation.}}