Armin Hoenen

Staff member

 

 

ContactEducationWork ExperiencePublications

Your Name (required)

Your Email (required)

Subject

Your Message


  • Highest degree: M.A. 09/03-09/10
  • University: Magister Atrium (Masters equivalent)
  • major: Comparative Linguistics;
  • minor I: Computer Science;
  • minor II: Biology (Zoology)
  • Postgraduate Certificate: Language Teaching for Adults
  • 06/04 – 12/2010
    • Folk- Highschool, Companies (Löwen Entertainment, Schott, Academy Bonn): Teacher for Japanese, English, Business English, Dutch, Hindi
  • 09/09-12/2010
    • Kern AG: translations
  • 03/09-10/09
    • Rüdesheim Tourist AG: Tourism Marketing Internship, Homepage Administration, Translations for www.ruedesheim.de (French, English, Dutch)
  • 3/09-8/09
    • IBM Böblingen Lab: Data Mining Internship
  • 03/07-03/09
    • Prime Research: Media Analysis, GM News Service, Translations
  • 03/2005 – 09/2005
    • JAC Japan: Host in German Pavillion, Nagoya, Aichi, Japan
  • 2000 – 2003
    • DRK (German Red Cross): Rescue Medic (2005 Firefighter Course Aichi Prefecture, 2006 Rescue Swimmer Bronze)
  • many others…

Total: 24

2017 (3)

  • [http://aclweb.org/anthology/W17-3402] A. Hoenen, S. Eger, and R. Gehrke, “How Many Stemmata with Root Degree k?,” in Proceedings of the 15th Meeting on the Mathematics of Language, 2017, pp. 11-21.
    [BibTeX]

    @inproceedings{hoenen2017c,
     author = {Hoenen, Armin and Eger, Steffen and Gehrke, Ralf},
     title = {{How Many Stemmata with Root Degree k?}},
     booktitle = {Proceedings of the 15th Meeting on the Mathematics of Language},
     year = {2017},
     publisher = {Association for Computational Linguistics},
     pages = {11--21},
     location = {London, UK},
     url = {http://aclweb.org/anthology/W17-3402}
    }
  • [https://link.springer.com/chapter/10.1007/978-3-319-59569-6_33] A. Hoenen, “Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution,” in International Conference on Applications of Natural Language to Information Systems, 2017, pp. 274-277.
    [BibTeX]

    @inproceedings{hoenen2017b,
     title={{Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution}},
     author={Hoenen, Armin},
     booktitle={International Conference on Applications of Natural Language to Information Systems},
     pages={274--277},
     year={2017},
     organization={Springer},
    url={https://link.springer.com/chapter/10.1007/978-3-319-59569-6_33}
    }
  • [http://aiucd2017.aiucd.it/wp-content/uploads/2017/01/book-of-abstract-AIUCD-2017.pdf] A. Hoenen, “Beyond the tree – a theoretical model of contamination and a software to generate multilingual stemmata,” in Book of Abstracts of the annual conference of the AIUCD 2017, Sapienza, Rome, AIUCD, 2017.
    [BibTeX]

    @INCOLLECTION{Hoenen:2017aiucd,
        author={Hoenen, Armin},
        title={{Beyond the tree – a theoretical model of contamination and 
    a software to generate multilingual stemmata}},
        booktitle={{Book of Abstracts of the annual conference of the AIUCD 
    2017, Sapienza, Rome}},
        year={2017},
        publisher={AIUCD},
    url={http://aiucd2017.aiucd.it/wp-content/uploads/2017/01/book-of-abstract-AIUCD-2017.pdf}}

2016 (9)

  • Corpora and Resources for (Historical) Low Resource LanguagesJLCL, 2016.
    [BibTeX]

    @misc{GSCL:JLCL:2016:2,
      editor={Armin Hoenen and Alexander Mehler and Jost Gippert},
      title={{Corpora and Resources for (Historical) Low Resource Languages}},
      publisher={JLCL},
      volume={31},
      number={2},
      year={2016},
      issn={2190-6858},
      bibsource={GSCL, http://www.gscl.info/}
    }
  • A. Hoenen, A. Mehler, and J. Gippert, “Editorial,” JLCL, vol. 31, iss. 2, p. iii–iv, 2016.
    [BibTeX]

    @ARTICLE{HoenenMehlerGippert2016,
      AUTHOR = {Armin Hoenen and Alexander Mehler and Jost Gippert},
      TITLE = {{Editorial}},
      JOURNAL = {JLCL},
      YEAR = {2016},
      VOLUME = {31},
      NUMBER = {2},
      PAGES = {iii--iv}
    }
  • A. Hoenen and L. Samushia, “Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketch for Aiding Reconstruction,” JLCL, vol. 31, iss. 2, pp. 25-38, 2016.
    [BibTeX]

    @ARTICLE{HoenenSamushia:2016,
      AUTHOR = {Armin Hoenen and Lela Samushia},
      TITLE = {{Gepi: An Epigraphic Corpus for Old Georgian and a Tool Sketch for Aiding Reconstruction}},
      JOURNAL = {JLCL},
      YEAR = {2016},
      VOLUME = {31},
      NUMBER = {2},
      PAGES = {25--38}
    }
  • [PDF] S. Eger, A. Hoenen, and A. Mehler, “Language classification from bilingual word embedding graphs,” in Proceedings of COLING 2016, 2016.
    [BibTeX]

    @InProceedings{Eger:Hoenen:Mehler:2016,
        author =     {Steffen Eger and Armin Hoenen and Alexander Mehler},
        title =     {Language classification from bilingual word embedding graphs},
        booktitle =     {Proceedings of COLING 2016},
        year =     2016,
        location =     {Osaka},
        pdf = {https://hucompute.org/wp-content/uploads/2016/10/eger_hoenen_mehler_COLING2016.pdf},
        publisher = {ACL},
    }
  • [http://dh2016.adho.org/abstracts/311] A. Hoenen, “Silva Portentosissima – Computer-Assisted Reflections on Bifurcativity in Stemmas,” in Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, 2016, pp. 557-560.
    [Abstract] [BibTeX]

    In 1928, the philologue Joseph Bédier explored contemporary stemmas and found them to contain a suspiciously large amount of bifurcations. In this paper, the argument is investigated that, with a large amount of lost manuscripts, the amount of bifurcations in the true stemmas would naturally be high because the probability for siblings to survive becomes very low is assessed via a computer simulation.
    @InProceedings{Hoenen:2016DH,
      Title =     {{Silva Portentosissima – Computer-Assisted Reflections on Bifurcativity in Stemmas}},
      Author =     {Hoenen, Armin},
      Booktitle =     {Digital Humanities 2016: Conference Abstracts. Jagiellonian University \& Pedagogical University},
      Year =     2016,
      location =     {Kraków},
      pages =     {557-560},
      url = {http://dh2016.adho.org/abstracts/311},
      series =     {DH 2016},
      abstract = {In 1928, the philologue Joseph Bédier explored contemporary stemmas and found them to contain a suspiciously large amount of bifurcations. In this paper, the argument is investigated that, with a large amount of lost manuscripts, the amount of bifurcations in the true stemmas would naturally be high because the probability for siblings to survive becomes very low is assessed via a computer simulation.}
    }
  • [PDF] A. Lücking, A. Hoenen, and A. Mehler, “TGermaCorp — A (Digital) Humanities Resource for (Computational) Linguistics,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
    [BibTeX]

    @InProceedings{Luecking:Hoenen:Mehler:2016,
      author =     {L\"{u}cking, Andy and Hoenen, Armin and Mehler,
                      Alexander},
      title =     {{TGermaCorp} -- A (Digital) Humanities Resource for
                      (Computational) Linguistics},
      booktitle =     {Proceedings of the 10th International Conference on
                      Language Resources and Evaluation},
      year =     2016,
      series =     {LREC 2016},
      pdf =     {http://hucompute.org/wp-content/uploads/2016/04/lrec2016-ttgermacorp-final.pdf},
    islrn={536-382-801-278-5},
      location =     {Portoro\v{z} (Slovenia)}
    }
  • A. Hoenen, “Repetition Analyses Function,” in Proceedings of the 2015 Herrenhäuser Symposium Visual Linguistics, 2016.
    [Abstract] [BibTeX]

    The ReAF is a dynamic heat map developed to represent exact and bag-of-words based repetitions in digitisations of verse bound text. Verse itself is a repetition in linguistic patterning, text itself is a visualisation of speech. In this sense, line breaks are a visualisation technique based on the repetition of linguistic patterning, which the ReAF maintains. Verse bound text existed prior to the invention of script; the first written literary produce of cultures is usually in verse. In their seminal work, Lord (1960) and Parry (1971) attempted to explain the peculiarities of one such text, the Odyssey, by investigating a living oral tradition in Yugoslavia. They invented the Oral Formulaic Theory and showed how bardic composition in performance works. No single author exists but formula and story lines are passed on from generation to generation; the actual performance is always a unique text and no two performances of the same epic are the same. Their conclusion is that one original text of the Odyssey does not exist, has never existed and cannot even exist. Lord and Parry developed tests for the orality of a given text, where they used underlining of repeated passages or formula. To compile this visualisation in the print age required a lot of manual labour, so they largely limited themselves to shorter passages such as the beginning of the Odyssey. This limitation was criticised later on for instance by Finnegan (1992) who misses a complete statistical analyses. The ReAF is a holistic extension of that late print age visualisation of repetition in verse bound text. It uses HTML and JavaScript in order to generate a very simple preprocessing, platform and browser independent interactive visualisation, where the user can navigate the text to verify or falsify his/her assumptions on text genesis and text category. References Lord, A. B. (1960). The Singer of Tales. Harvard University Press. Parry, M. (1971). The making of Homeric verse: the collected papers of Milman Parry. Clarendon Press. Finnegan, R. (1992). Oral Poetry. Indiana University Press.
    @INPROCEEDINGS{Hoenen:2016forth,
        author={Hoenen, Armin},
        title={Repetition Analyses Function},
        booktitle={Proceedings of the 2015 Herrenh{\"a}user Symposium Visual Linguistics},
        year={2016},
        publisher={IDS Mannheim},
        abstract = {The ReAF is a dynamic heat map developed to represent exact and bag-of-words based repetitions in digitisations of verse bound text. Verse itself is a repetition in linguistic patterning, text itself is a visualisation of speech. In this sense, line breaks are a visualisation technique based on the repetition of linguistic patterning, which the ReAF maintains. Verse bound text existed prior to the invention of script; the first written literary produce of cultures is usually in verse. In their seminal work, Lord (1960) and Parry (1971) attempted to explain the peculiarities of one such text, the Odyssey, by investigating a living oral tradition in Yugoslavia. They invented the Oral Formulaic Theory and showed how bardic composition in performance works. No single author exists but formula and story lines are passed on from generation to generation; the actual performance is always a unique text and no two performances of the same epic are the same. Their conclusion is that one original text of the Odyssey does not exist, has never existed and cannot even exist. Lord and Parry developed tests for the orality of a given text, where they used underlining of repeated passages or formula. To compile this visualisation in the print age required a lot of manual labour, so they largely limited themselves to shorter passages such as the beginning of the Odyssey. This limitation was criticised later on for instance by Finnegan (1992) who misses a complete statistical analyses. The ReAF is a holistic extension of that late print age visualisation of repetition in verse bound text. It uses HTML and JavaScript in order to generate a very simple preprocessing, platform and browser independent interactive visualisation, where the user can navigate the text to verify or falsify his/her assumptions on text genesis and text category. References Lord, A. B. (1960). The Singer of Tales. Harvard University Press. Parry, M. (1971). The making of Homeric verse: the collected papers of Milman Parry. Clarendon Press. Finnegan, R. (1992). Oral Poetry. Indiana University Press.}}
  • [PDF] A. Hoenen, “Wikipedia Titles As Noun Tag Predictors,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
    [BibTeX]

    @InProceedings{Hoenen:2016x,
        author =     {Hoenen, Armin},
        title =     {{Wikipedia Titles As Noun Tag Predictors}},
        booktitle =     {Proceedings of the 10th International Conference on Language Resources and Evaluation},
        year =     2016,
        series =     {LREC 2016},
        pdf =     {http://www.lrec-conf.org/proceedings/lrec2016/pdf/18_Paper.pdf},
        location =     {Portoro\v{z} (Slovenia)}
      }
  • [http://www.dhd2016.de/abstracts/posters-060.html] A. Hoenen, “Das erste dynamische Stemma, Pionier des digitalen Zeitalters?,” in Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum, 2016.
    [BibTeX]

    @INPROCEEDINGS{Hoenen:2016y,
        booktitle={Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum},
        author={Hoenen, Armin},
        year={2016},
        title={Das erste dynamische Stemma, Pionier des digitalen Zeitalters?},
        url = {http://www.dhd2016.de/abstracts/posters-060.html}
      }

2015 (5)

  • N. Dundua, A. Hoenen, and L. Samushia, “A Parallel Corpus of the Old Georgian Gospel Manuscripts and their Stemmatology,” The Georgian Journal for Language Logic Computation, vol. IV, pp. 176-185, 2015.
    [BibTeX]

    @ARTICLE{Dundua:Hoenen:Samushia:2015,
        author={Dundua, Natia and Hoenen, Armin and Samushia, Lela},
        title={{A Parallel Corpus of the Old Georgian Gospel Manuscripts and their Stemmatology}},
        journal={The Georgian Journal for Language Logic Computation},
        year={2015},
        volume={IV},
        pages={176-185},
        publisher={CLLS, Tbilisi State University and Kurt G{\"o}del Society}}
  • [PDF] A. Hoenen, “Das artifizielle Manuskriptkorpus TASCFE,” in Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum, 2015.
    [BibTeX]

    @INPROCEEDINGS{Hoenen:2015,
        booktitle={Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum},
        author={Hoenen, Armin},
        year={2015},
        title={Das artifizielle Manuskriptkorpus TASCFE},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/Hoenen_tascfeDH2015.pdf}}
  • A. Hoenen and F. Mader, ,” in Historical Corpora, Frankfurt am Main, Germany, 2015.
    [Abstract] [BibTeX]

    In this paper, that goes along with the re- lease of an Austrian lemma list for NLP ap- plications, the creation and representation of a digital dialect lemma list from exist- ing internet sources and books is presented. The creation procedure can serve as a role- model for similar projects on other dialects and points to a new cost saving way to produce NLP resources by use of the in- ternet in a similar way to human-based- computation. Dialect lexica can facilitate NLP and improve POS-tagging for German language ressources in general. The repre- sentation standard used is LMF. It will be demonstrated, how this lemma list can be used as a tool in literature science, linguis- tics and computational linguistics. Espe- cially the critical edition of Hugo von Hof- mannsthal is a well-suited corpus for the aforementioned research fields and the in- spiration to build this tool.
    @INPROCEEDINGS{Hoenen:Mader:2015,
        website={http://www.narr-shop.de/historical-corpora.html},
        booktitle={Historical Corpora},
        author={Hoenen, Armin and Mader, Franziska},
        year={2015},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/HoenenMader2013-a-new-lmf-schema-application.pdf}
        title={A New LMF Schema Application by Example of an Austrian Lexicon Applied to the Historical Corpus of the Writer Hugo von Hofmannsthal},
        address={Frankfurt am Main, Germany},
        abstract={In this paper, that goes along with the re- lease of an Austrian lemma list for NLP ap- plications, the creation and representation of a digital dialect lemma list from exist- ing internet sources and books is presented. The creation procedure can serve as a role- model for similar projects on other dialects and points to a new cost saving way to produce NLP resources by use of the in- ternet in a similar way to human-based- computation. Dialect lexica can facilitate NLP and improve POS-tagging for German language ressources in general. The repre- sentation standard used is LMF. It will be demonstrated, how this lemma list can be used as a tool in literature science, linguis- tics and computational linguistics. Espe- cially the critical edition of Hugo von Hof- mannsthal is a well-suited corpus for the aforementioned research fields and the in- spiration to build this tool.}}
  • A. Hoenen, “Lachmannian Archetype Reconstruction for Ancient Manuscript Corpora,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2015. Citation: Trovato is published in 2014 not in 2009.
    [Abstract] [BibTeX]

    Two goals are targeted by computer philology for ancient manuscript corpora: firstly, making an edition, that is roughly speaking one text version representing the whole corpus, which contains variety induced through copy errors and other processes and secondly, producing a stemma. A stemma is a graph-based visualization of the copy history with manuscripts as nodes and copy events as edges. Its root, the so-called archetype is the supposed original text or urtext from which all subsequent copies are made. Our main contribution is to present one of the first computational approaches to automatic archetype reconstruction and to introduce the first text-based evaluation for automatically produced archetypes. We compare a philologically generated archetype with one generated by bio-informatic software.
    @INPROCEEDINGS{Hoenen:2015a,
        booktitle={Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT)},
        author={Hoenen, Armin},
        year={2015},
        note={Citation: Trovato is published in 2014 not in 2009.},
        website={http://www.aclweb.org/anthology/N15-1127},
        title={Lachmannian Archetype Reconstruction for Ancient Manuscript Corpora},
        abstract={Two goals are targeted by computer philology for ancient manuscript corpora: firstly, making an edition, that is roughly speaking one text version representing the whole corpus, which contains variety induced through copy errors and other processes and secondly, producing a stemma. A stemma is a graph-based visualization of the copy history with manuscripts as nodes and copy events as edges. Its root, the so-called archetype is the supposed original text or urtext from which all subsequent copies are made. Our main contribution is to present one of the first computational approaches to automatic archetype reconstruction and to introduce the first text-based evaluation for automatically produced archetypes. We compare a philologically generated archetype with one generated by bio-informatic software.}}
  • A. Hoenen, “Simulating Misreading,” in Proceedings of the 20TH INTERNATIONAL CONFERENCE ON APPLICATIONS OF NATURAL LANGUAGE TO INFORMATION SYSTEMS (NLDB), 2015.
    [Abstract] [BibTeX]

    Physical misreading (as opposed to interpretational misreading) is an unnoticed substitution in silent reading. Especially for legally important documents or instruction manuals, this can lead to serious consequences. We present a prototype of an automatic highlighter targeting words which can most easily be misread in a given text using a dynamic orthographic neighbour concept. We propose measures of fit of a misread token based on Natural Language Processing and detect a list of short most easily misread tokens in the English language. We design a highlighting scheme for avoidance of misreading.
    @INPROCEEDINGS{Hoenen:2015b,
        booktitle={Proceedings of the 20TH INTERNATIONAL CONFERENCE ON APPLICATIONS OF NATURAL LANGUAGE TO INFORMATION SYSTEMS (NLDB)},
        author={Hoenen, Armin},
        website={http://link.springer.com/chapter/10.1007/978-3-319-19581-0_34},
        year={2015},
        title={Simulating Misreading},
        abstract={Physical misreading (as opposed to interpretational misreading) is an unnoticed substitution in silent reading. Especially for legally important documents or instruction manuals, this can lead to serious consequences. We present a prototype of an automatic highlighter targeting words which can most easily be misread in a given text using a dynamic orthographic neighbour concept. We propose measures of fit of a misread token based on Natural Language Processing and detect a list of short most easily misread tokens in the English language. We design a highlighting scheme for avoidance of misreading.}}

2014 (2)

  • [http://dhd-wp.hab.de/files/book_of_abstracts.pdf] A. Hoenen, “Stemmatology, an interdisciplinary endeavour,” in Book of Abstracts zum DHd Workshop Informatik und die Digital Humanities, DHd, 2014.
    [BibTeX]

    @INCOLLECTION{Hoenen:2014plz,
        author={Hoenen, Armin},
        title={{Stemmatology, an interdisciplinary endeavour}},
        booktitle={{Book of Abstracts zum DHd Workshop Informatik und die Digital Humanities}},
        year={2014},
        publisher={DHd},
        url={http://dhd-wp.hab.de/files/book_of_abstracts.pdf}}
  • A. Hoenen, “Simulation of Scribal Letter Substitution,” in Analysis of Ancient and Medieval Texts and Manuscripts: Digital Approaches, 2014.
    [BibTeX]

    @INPROCEEDINGS{Hoenen:2014,
        owner={hoenen},
        booktitle={Analysis of Ancient and Medieval Texts and Manuscripts: Digital Approaches},
        author={Hoenen, Armin},
        editor={T.L Andrews and C.Macé},
        year={2014},
        title={Simulation of Scribal Letter Substitution},
        website={http://www.brepols.net/Pages/ShowProduct.aspx?prod_id=IS-9782503552682-1}}

2013 (1)

  • [PDF] M. Z. Islam and A. Hoenen, “Source and Translation Classifiction using Most Frequent Words,” in Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP), 2013.
    [Abstract] [BibTeX]

    Recently, translation scholars have made some general claims about translation properties. Some of these are source language independent while others are not. Koppel and Ordan (2011) performed empirical studies to validate both types of properties using English source texts and other texts translated into English. Obviously, corpora of this sort, which focus on a single language, are not adequate for claiming universality of translation prop- erties. In this paper, we are validating both types of translation properties using original and translated texts from six European languages.
    @INPROCEEDINGS{Islam:Hoenen:2013,
        booktitle={Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP)},
        author={Islam, Md. Zahurul and Hoenen, Armin},
        year={2013},
        title={Source and Translation Classifiction using Most Frequent Words},
        pdf={http://www.aclweb.org/anthology/I/I13/I13-1185.pdf},
        website={http://aclanthology.info/papers/source-and-translation-classification-using-most-frequent-words},
        abstract={Recently, translation scholars have made some general claims about translation properties. Some of these are source language independent while others are not. Koppel and Ordan (2011) performed empirical studies to validate both types of properties using English source texts and other texts translated into English. Obviously, corpora of this sort, which focus on a single language, are not adequate for claiming universality of translation prop- erties. In this paper, we are validating both types of translation properties using original and translated texts from six European languages.}}

2012 (3)

  • A. Hoenen, “Measuring Repetitiveness in Texts, a Preliminary Investigation,” Sprache und Datenverarbeitung. International Journal for Language Data Processing, vol. 36, iss. 2, pp. 93-104, 2012.
    [Abstract] [BibTeX]

    In this paper, a model is presented for the automatic measurement that can systematically describe the usage and function of the phenomenon of repetition in written text. The motivating hypothesis for this study is that the more repetitive a text is, the easier it is to memorize. Therefore, an automated measurement index can provide feedback to writers and for those who design texts that are often memorized including songs, holy texts, theatrical plays, and advertising slogans. The potential benefits of this kind of systematic feedback are numerous, the main one being that content creators would be able to employ a standard threshold of memorizability. This study explores multiple ways of implementing and calculating repetitiveness across levels of analysis (such as paragraph-level or sub-word level) genres (such as songs, holy texts, and other genres) and languages, integrating these into the a model for the automatic measurement of repetitiveness. The Avestan language and some of its idiosyncratic features are explored in order to illuminate how the proposed index is applied in the ranking of texts according to their repetitiveness.
    @ARTICLE{Hoenen:2012:a,
        journal={Sprache und Datenverarbeitung. International Journal for Language Data Processing},
        pages={93-104},
        number={2},
        author={Hoenen, Armin},
        volume={36},
        year={2012},
        title={Measuring Repetitiveness in Texts, a Preliminary Investigation},
        abstract={In this paper, a model is presented for the automatic measurement that can systematically describe the usage and function of the phenomenon of repetition in written text. The motivating hypothesis for this study is that the more repetitive a text is, the easier it is to memorize. Therefore, an automated measurement index can provide feedback to writers and for those who design texts that are often memorized including songs, holy texts, theatrical plays, and advertising slogans. The potential benefits of this kind of systematic feedback are numerous, the main one being that content creators would be able to employ a standard threshold of memorizability. This study explores multiple ways of implementing and calculating repetitiveness across levels of analysis (such as paragraph-level or sub-word level) genres (such as songs, holy texts, and other genres) and languages, integrating these into the a model for the automatic measurement of repetitiveness. The Avestan language and some of its idiosyncratic features are explored in order to illuminate how the proposed index is applied in the ranking of texts according to their repetitiveness.},
        website={http://www.linse.uni-due.de/jahrgang-36-2012/articles/measuring-repetitiveness-in-texts-a-preliminary-investigation.html}}
  • [PDF] A. Hoenen and T. Jügel, Altüberlieferte Sprachen als Gegenstand der Texttechnologie — Ancient Languages as the Object of Text Technology, A. Hoenen and T. Jügel, Eds., JLCL, 2012, vol. 27.
    [Abstract] [BibTeX]

    ‘Avestan’ is the name of the ritual language of Zor oastrianism, which was the state religion of the Iranian empire in Achaemenid, Arsacid and Sasanid times, covering a time span of more than 1200 years. [1] It is named after the ‘Avesta’, i.e., the collection of holy scriptures that form the basis of the religion which was allegedly founded by Zarathushtra, also known as Zoroaster, by about the beginning of the first millennium B.C. Together with Vedic Sanskrit, Avestan represents one of the most archaic witnesses of the Indo-Iranian branch of the Indo-European languages, which makes it especially interesting for historical-comparative linguistics. This is why the texts of the Avesta were among the first objects of electronic corpus building that were undertaken in the framework of Indo-European studies, leading to the establishment of the TITUS database (‘Thesaurus indogermanischer Text- u nd Sprachmaterialien’). [2] Today, the complete Avestan corpus is available, together with elaborate search functions [3] and an extended version of the subcorpus of the so-called ‘Yasna’, which covers a great deal of the attestation of variant readings. [4] Right from the beginning of their computational work concerning the Avesta, the compilers [5] had to cope with the fact that the texts contained in it have been transmitted in a special script written from right to left, which was also used for printing them in the scholarly editions used until today. [6] It goes without saying that there was no way in the middle of the 1980s to encode the Avestan scriptures exactly as they are found in the manuscripts. Instead, we had to rely upon transcriptional devices that were dictated by the restrictions of character encoding as provided by the computer systems used. As the problems we had to face in this respect and the solutions we could apply are typical for the development of computational work on ancient languages, it seems worthwhile to sketch them out here.
    @BOOK{Hoenen:Jügel:2012,
        publisher={JLCL},
        author={Hoenen, Armin and Jügel, Thomas},
        number={2},
        volume={27},
        editor={Armin Hoenen and Thomas Jügel},
        pdf={http://www.jlcl.org/2012_Heft2/H2012-2.pdf},
        year={2012},
        image={https://hucompute.org/wp-content/uploads/2015/09/AltueberlieferteSprachen-300-20.png},
        title={Altüberlieferte Sprachen als Gegenstand der Texttechnologie -- Ancient Languages as the Object of Text Technology},
        abstract={‘Avestan’ is the name of the ritual language of Zor oastrianism, which was the state religion of the Iranian empire in Achaemenid, Arsacid and Sasanid times, covering a time span of more than 1200 years. [1] It is named after the ‘Avesta’, i.e., the collection of holy scriptures that form the basis of the religion which was allegedly founded by Zarathushtra, also known as Zoroaster, by about the beginning of the first millennium B.C. Together with Vedic Sanskrit, Avestan represents one of the most archaic witnesses of the Indo-Iranian branch of the Indo-European languages, which makes it especially interesting for historical-comparative linguistics. This is why the texts of the Avesta were among the first objects of electronic corpus building that were undertaken in the framework of Indo-European studies, leading to the establishment of the TITUS database (‘Thesaurus indogermanischer Text- u nd Sprachmaterialien’). [2] Today, the complete Avestan corpus is available, together with elaborate search functions [3] and an extended version of the subcorpus of the so-called ‘Yasna’, which covers a great deal of the attestation of variant readings. [4] Right from the beginning of their computational work concerning the Avesta, the compilers [5] had to cope with the fact that the texts contained in it have been transmitted in a special script written from right to left, which was also used for printing them in the scholarly editions used until today. [6] It goes without saying that there was no way in the middle of the 1980s to encode the Avestan scriptures exactly as they are found in the manuscripts. Instead, we had to rely upon transcriptional devices that were dictated by the restrictions of character encoding as provided by the computer systems used. As the problems we had to face in this respect and the solutions we could apply are typical for the development of computational work on ancient languages, it seems worthwhile to sketch them out here.},
        issn={2190-6858}}
  • [PDF] M. Sukhareva, M. Z. Islam, A. Hoenen, and A. Mehler, “A Three-step Model of Language Detection in Multilingual Ancient Texts,” in Proceedings of Workshop on Annotation of Corpora for Research in the Humanities, Heidelberg, Germany, 2012.
    [Abstract] [BibTeX]

    Ancient corpora contain various multilingual patterns. This imposes numerous problems on their manual annotation and automatic processing. We introduce a lexicon building system, called Lexicon Expander, that has an integrated language detection module, Language Detection (LD) Toolkit. The Lexicon Expander post-processes the output of the LD Toolkit which leads to the improvement of f-score and accuracy values. Furthermore, the functionality of the Lexicon Expander also includes manual editing of lexical entries and automatic morphological expansion by means of a morphological grammar.
    @INPROCEEDINGS{Sukhareva:Islam:Hoenen:Mehler:2012,
        booktitle={Proceedings of Workshop on Annotation of Corpora for Research in the Humanities},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/sukhareva_islam_hoenen_mehler_2011.pdf},
        author={Sukhareva, Maria and Islam, Md. Zahurul and Hoenen, Armin and Mehler, Alexander},
        year={2012},
        title={A Three-step Model of Language Detection in Multilingual Ancient Texts},
        address={Heidelberg, Germany},
        abstract={Ancient corpora contain various multilingual patterns. This imposes numerous problems on their manual annotation and automatic processing. We introduce a lexicon building system, called Lexicon Expander, that has an integrated language detection module, Language Detection (LD) Toolkit. The Lexicon Expander post-processes the output of the LD Toolkit which leads to the improvement of f-score and accuracy values. Furthermore, the functionality of the Lexicon Expander also includes manual editing of lexical entries and automatic morphological expansion by means of a morphological grammar.},
        website={https://www.academia.edu/2236625/A_Three-step_Model_of_Language_Detection_in_Multilingual_Ancient_Texts}}

2011 (1)

  • [PDF] R. Gleim, A. Hoenen, N. Diewald, A. Mehler, and A. Ernst, “Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin,” in Corpus Linguistics 2011, 20-22 July, Birmingham, 2011.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Hoenen:Diewald:Mehler:Ernst:2011,
        booktitle={Corpus Linguistics 2011, 20-22 July, Birmingham},
        author={Gleim, Rüdiger and Hoenen, Armin and Diewald, Nils and Mehler, Alexander and Ernst, Alexandra},
        year={2011},
        title={Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/Paper-48.pdf}}