Corpora

FIGURE — Frankfurt Image Gestures
Primary source Annotations of videos of hand and arm gestures
Size 260 gestures
Published 2016
URL FIGURE corpus
License CC BY-SA 4.0
Reference
  • [PDF] A. Lücking, A. Mehler, D. Walther, M. Mauri, and D. Kurfürst, “Finding Recurrent Features of Image Schema Gestures: the FIGURE corpus,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
    [Bibtex]
    @InProceedings{Luecking:Mehler:Walther:Mauri:Kurfuerst:2016,
      author =     {L\"{u}cking, Andy and Mehler, Alexander and Walther,
                      D\'{e}sir\'{e}e and Mauri, Marcel and Kurf\"{u}rst,
                      Dennis},
      title =     {Finding Recurrent Features of Image Schema Gestures:
                      the {FIGURE} corpus},
      booktitle =     {Proceedings of the 10th International Conference on
                      Language Resources and Evaluation},
      year =     2016,
      series =     {LREC 2016},
      pdf =     {http://hucompute.org/wp-content/uploads/2016/04/lrec2016-gesture-study-final-version-short.pdf},
      location =     {Portoro\v{z} (Slovenia)}
    }
Language levels Image schemas
Purpose Depiction strategies

TGermaCorp
Primary source Lemmatized and tagged documents.
Size
  • v0.1 (2016): 242 documents, 7.336 sentences, 122.913 tokens
  • v0.2 (2017): 244 documents, 8.941 sentences, 157.210 tokens
Published 2016, 2017
URL
License CC BY-NC-ND 3.0 DE, or restricted
Reference Alexander Mehler
Language levels Lemma, POS
Purpose

Bible Corpus Tokenization Extension
Primary source http://christos-c.com/bible/
Size 4 Files of tokenized bible data
Published 2015
URL BibleCorpusTokenizationTTLAB extension to http://christos-c.com/bible/
License CC BY-NC-SA 3.0
Reference Armin Hoenen
Language levels Words
Purpose Community extension

Syntactic Language Networks (SLN)
Primary source SLNs are induced from dependency treebanks
Size Networks for 13 languages, 6 free available
Published 2007-2009
URL Linguistic Networks
License CC BY-NC-ND 3.0 DE, or restricted
Reference Olga Pustylnikov, Alexander Mehler
Language levels Syntax, dependencies
Purpose Language typology
Morphological Derivation Networks
Primary source Networks are induced using a morphological derivation game that implements a word decomposition algorithm
Size Networks for 5 languages
Published 2009
URL Linguistic Networks
License CC BY-SA 4.0 DE
Reference Olga Pustylnikov
Language levels Morphology
Purpose Language typology, Productivity
Wikipedia Networks (WN)
Primary source WNs are induced from language-specific releases of the Wikipedia
Size Networks of different sizes for 264 languages
Published 2008-2009
URL Linguistic Networks
License CC BY-SA 4.0 DE
Reference Alexander Mehler
Language levels Articles, ontologies, categories
Purpose Social ontologies
Lexical Co-occurrence Networks
Primary source Co-occurrence network based on the Patrologia Latina
Size
Published 2007-2010
URL Linguistic Networks
License Restricted, due to primary source
Reference Alexander Mehler
Language levels Words, co-occurrences
Purpose Language change, historical semantics
Frankfurter OCR-Korpus
Primary source Comparison of scan subpart of Patrologia Latinae (Flodoardi Canonici Remensis Historiae Remensis Ecclesiae Libri Quatuor, ed. Jean-Jacques Migne (Patrologia Latina T. 135), Paris 1853, col. 27A-328B.) with the original
Size 5,213 pairs of words (wrongly scanned,correction)
Published 2014
URL Download as ZIP file (33.7 KB)
License CC BY-SA 4.0 DE
Reference Steffen Eger
Language levels Words
Purpose Spelling error correction
Tascfe
Primary sources Paper Sheets of handwritten copies
Size 54 documents, ca. 6500 digitized tokens
Published in preparation
URL Download as ZIP file (84.0 KB)
License CC BY-SA 4.0 DE
Reference Armin Hoenen, Goethe University Frankfurt
Language levels Persian Shahname excerpt
Purpose Copy errors, influence of oral versions in stemmatology, artificial non Latin scirpt corpus, copy from print, copy from handwriting
Person Database Deutsche Nationalbibliothek filtered
Primary sources DNB dump XML, filtered tsv File
Size ca. 250 MB; 1,912,675 person entries with variant names, occupations and kinship relations
Published in preparation
URL link will follow soon
License CC0 1.0 Universell (CC0 1.0)
Reference Goethe University Frankfurt
Language levels Named entities, persons
Purpose Large database of variant names
Avesta Yasna Ceremony
Primary sources Excel lexicon file with concordance, cf. Geldner (1896) – Titus, fully annotated, Tei-File TTLab format for text
Size 7,744 lexical entries, ca. 30,000 text token
Published Jügel forthcoming
URL Available via the Avestan language from Linguistic Networks
License CC BY-SA 4.0 DE
Reference Thomas Jügel,Goethe University Frankfurt
Language levels Ceremonial text
Purpose General analyses (syntactic, semantic, cooccurrence, comparison Young Avestan-Old Avestan)
Bangla Textbook Corpus
Primary sources Bangla textbooks that have been used in public schools in Bangladesh
Size 661 documents, 105,897 sentences, 1,029,354 tokens
Published 2014
URL Bangla Textbook Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Textbooks
Purpose Text readablity analysis
English Textbook Corpus
Primary sources English versions of textbooks that have been used in public schools in Bangladesh
Size 519 documents, 95,470 sentences, 1,184,124 tokens
Published 2014
URL English Textbook Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Textbooks
Purpose Text readablity analysis
English Wikipedia Corpus
Primary sources English Wikipedia articles
Size 641 documents, 277,691 sentences, 5,949,254 tokens
Published 2014
URL English Wikipedia Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Articles
Purpose Text readablity analysis
Customized Europarl Corpus
Primary sources Europarl corpus
Size 3,152,650 sentences from 21 European languages
Published 2012-2014
URL Customized Europarl Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Sentences
Purpose Translation studies