TEILex

Even though automatic linguistic preprocessing of natural language text can be done with high accuracy the overall performance does not meet standards and expectations in digital humanities. Since perfect preprocessing cannot be achieved the best compromise is to allow users to correct lemmatization and tagging errors in texts as well as adding missing entries to an underlying lexicon which is tightly connected to the corpus.

In order to meet these requirements the Texttechnology Lab has developed TEILex.
TEILex implements an integrated representation of lexica and corpora. Additions, corrections, and revisions to lexica can be done automatically using a linked TEI document as a source, without requiring changes to annotations or a reindexing of the affected corpora. In this way, annotators should be able to edit a lemmatisation without errors or gaps, so that the corrections appear immediately when browsing or searching the corpus. The TEILex module was developed to supply this functionality (see Figure 5). TEILex integrates the data model for TEI P5-conforming documents and for lexica using the same graph database. The Lexical Markup Framework (LMF) serves as an alternative format for TEILex. But the innovation of TEILex is in the integration of documents and lexica, which are used together in processing. Every incidence of a word in a text is linked logically to the corresponding syntactic word in the lexicon. In the case of changes to the lexicon, there is no need for a following data synchronisation. This makes it much easier to make corrections and additions directly to the lexica using the linked document. Documents annotated in this way can be downloaded at any time as TEI P5 documents. The functionality of TEILex is available within the eLexicon Browser as well.

Figure 5: The workflow of document annotation without (above) and with (below) TEILex.