Rüdiger Gleim

Rüdiger Gleim

Rüdiger Gleim

Scientific Assistant

Robert Mayer Str. 10
Tel: +49 69-798-28926
Room 402
Office Hour Thursday, 3-4 PM

 

 

ContactCurriculum VitaeResearch InterestsPublications


Linguistic Databases

Almost any study in corpus linguistics boils down to construct, annotate, represent and analyze linguistic data. The requirements on a proper database are often contradicting:

  • It should be able to scale well with ever growing corpora such as the Wikipedia- while still being flexible for annotation and edition.
  • It should serve a broad spectrum of analyses by minimizing the need to transform data for a specific kind of analysis, while still being space efficient.
  • The data model should be able to mediate between standard formats while not becoming over-generic and difficult to handle.
  • ….

Designing and developing linguistic databases has become a major topic for me. Realizing that there is no such thing as the ultimate solution Iam interested in all kinds of database management systems and paradigms including relational-, graph-, distributed- and NoSQL databases as well as APIs for persistent storage.

Total: 37

2017 (1)

  • A. Mehler, R. Gleim, W. Hemati, and T. Uslu, “Skalenfreie online-soziale Lexika am Beispiel von Wiktionary,” in Proceedings of 53rd Annual Conference of the Institut für Deutsche Sprache (IDS), March 14-16, Mannheim, Germany, Berlin, 2017. In German. Title translates into: Scale-free online-social Lexika by Example of Wiktionary
    [Abstract] [BibTeX]

    In English: The paper deals with characteristics of the structural, thematic and participatory dynamics of collaboratively generated lexical networks. This is done by example of Wiktionary. Starting from a network-theoretical model in terms of so-called multi-layer networks, we describe Wiktionary as a scale-free lexicon. Systems of this sort are characterized by the fact that their content-related dynamics is determined by the underlying dynamics of collaborating authors. This happens in a way that social structure imprints on content structure. According to this conception, the unequal distribution of the activities of authors results in a correspondingly unequal distribution of the information units documented within the lexicon. The paper focuses on foundations for describing such systems starting from a parameter space which requires to deal with Wiktionary as an issue in big data analysis. In German: Der Beitrag thematisiert Eigenschaften der strukturellen, thematischen und partizipativen Dynamik kollaborativ erzeugter lexikalischer Netzwerke am Beispiel von Wiktionary. Ausgehend von einem netzwerktheoretischen Modell in Form so genannter Mehrebenennetzwerke wird Wiktionary als ein skalenfreies Lexikon beschrieben. Systeme dieser Art zeichnen sich dadurch aus, dass ihre inhaltliche Dynamik durch die zugrundeliegende Kollaborationsdynamik bestimmt wird, und zwar so, dass sich die soziale Struktur der entsprechenden inhaltlichen Struktur aufprägt. Dieser Auffassung gemäß führt die Ungleichverteilung der Aktivitäten von Lexikonproduzenten zu einer analogen Ungleichverteilung der im Lexikon dokumentierten Informationseinheiten. Der Beitrag thematisiert Grundlagen zur Beschreibung solcher Systeme ausgehend von einem Parameterraum, welcher die netzwerkanalytische Betrachtung von Wiktionary als Big-Data-Problem darstellt.
    @InProceedings{Mehler:Gleim:Hemati:Uslu:2017,
        Title                    = {{Skalenfreie online-soziale Lexika am Beispiel von Wiktionary}},
        Author                   = {Alexander Mehler and Rüdiger Gleim and Wahed Hemati and Tolga Uslu},
        Booktitle                = {Proceedings of 53rd Annual Conference of the Institut für Deutsche Sprache (IDS), March 14-16, Mannheim, Germany},
        Year                     = {2017},
        Address                  = {Berlin},
        Abstract                 = {In English: The paper deals with characteristics of the structural, thematic and participatory dynamics of collaboratively generated lexical networks. This is done by example of Wiktionary. Starting from a network-theoretical model in terms of so-called multi-layer networks, we describe Wiktionary as a scale-free lexicon. Systems of this sort are characterized by the fact that their content-related dynamics is determined by the underlying dynamics of collaborating authors. This happens in a way that social structure imprints on content structure. According to this conception, the unequal distribution of the activities of authors results in a correspondingly unequal distribution of the information units documented within the lexicon. The paper focuses on foundations for describing such systems starting from a parameter space which requires to deal with Wiktionary as an issue in big data analysis. In German: Der Beitrag thematisiert Eigenschaften der strukturellen, thematischen und partizipativen Dynamik kollaborativ erzeugter lexikalischer Netzwerke am Beispiel von Wiktionary. Ausgehend von einem netzwerktheoretischen Modell in Form so genannter Mehrebenennetzwerke wird Wiktionary als ein skalenfreies Lexikon beschrieben. Systeme dieser Art zeichnen sich dadurch aus, dass ihre inhaltliche Dynamik durch die zugrundeliegende Kollaborationsdynamik bestimmt wird, und zwar so, dass sich die soziale Struktur der entsprechenden inhaltlichen Struktur aufprägt. Dieser Auffassung gemäß führt die Ungleichverteilung der Aktivitäten von Lexikonproduzenten zu einer analogen Ungleichverteilung der im Lexikon dokumentierten Informationseinheiten. Der Beitrag thematisiert Grundlagen zur Beschreibung solcher Systeme ausgehend von einem Parameterraum, welcher die netzwerkanalytische Betrachtung von Wiktionary als Big-Data-Problem darstellt.},
        Editor                   = {Stefan Engelberg and Henning Lobin and Kathrin Steyer and Sascha Wolfer},
        Note                     = {In German. Title translates into: Scale-free online-social Lexika by Example of Wiktionary},
        Publisher                = {De Gruyter}
    }

2016 (3)

  • [http://dh2016.adho.org/abstracts/250] A. Mehler, B. Wagner, and R. Gleim, “Wikidition: Towards A Multi-layer Network Model of Intertextuality,” in Proceedings of DH 2016, 12-16 July, 2016.
    [Abstract] [BibTeX]

    The paper presents Wikidition, a novel text mining tool for generating online editions of text corpora. It explores lexical, sentential and textual relations to span multi-layer networks (linkification) that allow for browsing syntagmatic and paradigmatic relations among the constituents of its input texts. In this way, relations of text reuse can be explored together with lexical relations within the same literary memory information system. Beyond that, Wikidition contains a module for automatic lexiconisation to extract author specific vocabularies. Based on linkification and lexiconisation, Wikidition does not only allow for traversing input corpora on different (lexical, sentential and textual) levels. Rather, its readers can also study the vocabulary of authors on several levels of resolution including superlemmas, lemmas, syntactic words and wordforms. We exemplify Wikidition by a range of literary texts and evaluate it by means of the apparatus of quantitative network analysis.
    @InProceedings{Mehler:Wagner:Gleim:2016,
      Title =     {Wikidition: Towards A Multi-layer Network Model of Intertextuality},
      Author =     {Mehler, Alexander and Wagner, Benno and Gleim, R\"{u}diger},
      Booktitle =     {Proceedings of DH 2016, 12-16 July},
      Year =     2016,
      location =     {Kraków},
      series =     {DH 2016},
      url = {http://dh2016.adho.org/abstracts/250},
      abstract = {The paper presents Wikidition, a novel text mining tool for generating online editions of text corpora. It explores lexical, sentential and textual relations to span multi-layer networks (linkification) that allow for browsing syntagmatic and paradigmatic relations among the constituents of its input texts. In this way, relations of text reuse can be explored together with lexical relations within the same literary memory information system. Beyond that, Wikidition contains a module for automatic lexiconisation to extract author specific vocabularies. Based on linkification and lexiconisation, Wikidition does not only allow for traversing input corpora on different (lexical, sentential and textual) levels. Rather, its readers can also study the vocabulary of authors on several levels of resolution including superlemmas, lemmas, syntactic words and wordforms. We exemplify Wikidition by a range of literary texts and evaluate it by means of the apparatus of quantitative network analysis.}
    }
  • [PDF] S. Eger, R. Gleim, and A. Mehler, “Lemmatization and Morphological Tagging in German and Latin: A comparison and a survey of the state-of-the-art,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
    [BibTeX]

    @InProceedings{Eger:Mehler:Gleim:2016,
      Title =     {Lemmatization and Morphological Tagging in {German}
                      and {Latin}: A comparison and a survey of the
                      state-of-the-art},
      Author =     {Eger, Steffen and Gleim, R\"{u}diger and Mehler, Alexander},
      Booktitle =     {Proceedings of the 10th International Conference on
                      Language Resources and Evaluation},
      Year =     2016,
      location =     {Portoro\v{z} (Slovenia)},
      series =     {LREC 2016},
      pdf = {http://hucompute.org/wp-content/uploads/2016/04/lrec_eger_gleim_mehler.pdf}
    }
  • [DOI] A. Mehler, R. Gleim, T. vor der Brück, W. Hemati, T. Uslu, and S. Eger, “Wikidition: Automatic Lexiconization and Linkification of Text Corpora,” Information Technology, pp. 70-79, 2016.
    [Abstract] [BibTeX]

    We introduce a new text technology, called Wikidition, which automatically generates large scale editions of corpora of natural language texts. Wikidition combines a wide range of text mining tools for automatically linking lexical, sentential and textual units. This includes the extraction of corpus-specific lexica down to the level of syntactic words and their grammatical categories. To this end, we introduce a novel measure of text reuse and exemplify Wikidition by means of the capitularies, that is, a corpus of Medieval Latin texts.
    @Article{Mehler:et:al:2016,
      Title                    = {Wikidition: Automatic Lexiconization and Linkification of Text Corpora},
      Author                   = {Alexander Mehler and Rüdiger Gleim and Tim vor der Brück and Wahed Hemati and Tolga Uslu and Steffen Eger},
      Journal                  = {Information Technology},
      Year                     = {2016},
      pages                    = {70-79},
      doi                      = {10.1515/itit-2015-0035},
      abstract       = {We introduce a new text technology, called Wikidition, which automatically generates large
    scale editions of corpora of natural language texts. Wikidition combines a wide range of
    text mining tools for automatically linking lexical, sentential and textual units. This
    includes the extraction of corpus-specific lexica down to the level of syntactic words and
    their grammatical categories. To this end, we introduce a novel measure of text reuse and
    exemplify Wikidition by means of the capitularies, that is, a corpus of Medieval Latin
    texts.}
    }

2015 (3)

  • A. Mehler and R. Gleim, “Linguistic Networks — An Online Platform for Deriving Collocation Networks from Natural Language Texts,” in Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, A. Mehler, A. Lücking, S. Banisch, P. Blanchard, and B. Frank-Job, Eds., Springer, 2015.
    [BibTeX]

    @INCOLLECTION{Mehler:Gleim:2015:a,
        publisher={Springer},
        editor={Mehler, Alexander and Lücking, Andy and Banisch, Sven and Blanchard, Philippe and Frank-Job, Barbara},
        year={2015},
        booktitle={Towards a Theoretical Framework for Analyzing Complex Linguistic Networks},
        title={Linguistic Networks -- An Online Platform for Deriving Collocation Networks from Natural Language Texts},
        series={Understanding Complex Systems},
        author={Mehler, Alexander and Gleim, Rüdiger}}
  • A. Mehler, T. vor der Brück, R. Gleim, and T. Geelhaar, “Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger,” in Text Mining: From Ontology Learning to Automated text Processing Applications, C. Biemann and A. Mehler, Eds., Berlin/New York: Springer, 2015, pp. 87-112. appears
    [Abstract] [BibTeX]

    The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.
    @INCOLLECTION{Mehler:Brueck:Gleim:Geelhaar:2015,
        publisher={Springer},
        series={Theory and Applications of Natural Language Processing},
        booktitle={Text Mining: From Ontology Learning to Automated text Processing Applications},
        pages={87-112},
        editor={Chris Biemann and Alexander Mehler},
        author={Mehler, Alexander and vor der Brück, Tim and Gleim, Rüdiger and Geelhaar, Tim},
        note={appears},
        address={Berlin/New York},
        year={2015},
        title={Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger},
        abstract={The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.},
        website={http://link.springer.com/chapter/10.1007/978-3-319-12655-5_5}}
  • [PDF] R. Gleim and A. Mehler, “TTLab Preprocessor – Eine generische Web-Anwendung für die Vorverarbeitung von Texten und deren Evaluation,” in Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum, 2015.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Mehler:2015,
        booktitle={Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum},
        author={Gleim, Rüdiger and Mehler, Alexander},
        year={2015},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/Gleim_Mehler_PrePro_DHGraz2015.pdf},
        title={TTLab Preprocessor – Eine generische Web-Anwendung für die Vorverarbeitung von Texten und deren Evaluation}}

2013 (1)

  • A. Mehler, C. Stegbauer, and R. Gleim, “Zur Struktur und Dynamik der kollaborativen Plagiatsdokumentation am Beispiel des GuttenPlag Wiki: eine Vorstudie,” in Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen am Beispiel des WWW, B. Frank-Job, A. Mehler, and T. Sutter, Eds., Wiesbaden: VS Verlag, 2013.
    [BibTeX]

    @INCOLLECTION{Mehler:Stegbauer:Gleim:2013,
        publisher={VS Verlag},
        booktitle={Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen am Beispiel des WWW},
        author={Mehler, Alexander and Stegbauer, Christian and Gleim, Rüdiger},
        editor={Frank-Job, Barbara and Mehler, Alexander and Sutter, Tilman},
        year={2013},
        title={Zur Struktur und Dynamik der kollaborativen Plagiatsdokumentation am Beispiel des GuttenPlag Wiki: eine Vorstudie},
        address={Wiesbaden}}

2012 (3)

  • [PDF] A. Mehler, C. Stegbauer, and R. Gleim, “Latent Barriers in Wiki-based Collaborative Writing,” in Proceedings of the Wikipedia Academy: Research and Free Knowledge. June 29 – July 1 2012, Berlin, 2012.
    [BibTeX]

    @INPROCEEDINGS{Mehler:Stegbauer:Gleim:2012:b,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/12_Paper_Alexander_Mehler_Christian_Stegbauer_Ruediger_Gleim.pdf},
        booktitle={Proceedings of the Wikipedia Academy: Research and Free Knowledge. June 29 - July 1 2012},
        author={Mehler, Alexander and Stegbauer, Christian and Gleim, Rüdiger},
        month={July},
        year={2012},
        title={Latent Barriers in Wiki-based Collaborative Writing},
        address={Berlin}}
  • [PDF] R. Gleim, A. Mehler, and A. Ernst, “SOA implementation of the eHumanities Desktop,” in Proceedings of the Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts, Digital Humanities 2012, Hamburg, Germany, 2012.
    [Abstract] [BibTeX]

    The eHumanities Desktop is a system which allows users to upload, organize and share resources using a web interface. Furthermore resources can be processed, annotated and analyzed in various ways. Registered users can organize themselves in groups and collaboratively work on their data. The eHumanities Desktop is platform independent and runs in a web browser. This paper presents the system focusing on its service orientation and process management.
    @INPROCEEDINGS{Gleim:Mehler:Ernst:2012,
        booktitle={Proceedings of the Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts, Digital Humanities 2012, Hamburg, Germany},
        author={Gleim, Rüdiger and Mehler, Alexander and Ernst, Alexandra},
        year={2012},
        title={SOA implementation of the eHumanities Desktop},
        abstract={The eHumanities Desktop is a system which allows users to upload, organize and share resources using a web interface. Furthermore resources can be processed, annotated and analyzed in various ways. Registered users can organize themselves in groups and collaboratively work on their data. The eHumanities Desktop is platform independent and runs in a web browser. This paper presents the system focusing on its service orientation and process management.},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/dhc2012.pdf}}
  • A. Mehler, S. Schwandt, R. Gleim, and A. Ernst, “Inducing Linguistic Networks from Historical Corpora: Towards a New Method in Historical Semantics,” in Proceedings of the Conference on New Methods in Historical Corpora, P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt, Eds., Tübingen: Narr, 2012, vol. 3, pp. 257-274.
    [BibTeX]

    @INCOLLECTION{Mehler:Schwandt:Gleim:Ernst:2012,
        publisher={Narr},
        booktitle={Proceedings of the Conference on New Methods in Historical Corpora},
        pages={257--274},
        author={Mehler, Alexander and Schwandt, Silke and Gleim, Rüdiger and Ernst, Alexandra},
        series={Corpus linguistics and Interdisciplinary perspectives on language (CLIP)},
        editor={Paul Bennett and Martin Durrell and Silke Scheible and Richard J. Whitt},
        year={2012},
        volume={3},
        title={Inducing Linguistic Networks from Historical Corpora: Towards a New Method in Historical Semantics},
        address={Tübingen}}

2011 (3)

  • [PDF] A. Mehler, S. Schwandt, R. Gleim, and B. Jussen, “Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien,” Journal for Language Technology and Computational Linguistics (JLCL), vol. 26, iss. 1, pp. 97-117, 2011.
    [Abstract] [BibTeX]

    Die Digital Humanities bzw. die Computational Humanities entwickeln sich zu eigenständigen Disziplinen an der Nahtstelle von Geisteswissenschaft und Informatik. Diese Entwicklung betrifft zunehmend auch die Lehre im Bereich der geisteswissenschaftlichen Fachinformatik. In diesem Beitrag thematisieren wir den eHumanities Desktop als ein Werkzeug für diesen Bereich der Lehre. Dabei geht es genauer um einen Brückenschlag zwischen Geschichtswissenschaft und Informatik: Am Beispiel der historischen Semantik stellen wir drei Lehrszenarien vor, in denen der eHumanities Desktop in der geschichtswissenschaftlichen Lehre zum Einsatz kommt. Der Beitrag schliesst mit einer Anforderungsanalyse an zukünftige Entwicklungen in diesem Bereich.
    @ARTICLE{Mehler:Schwandt:Gleim:Jussen:2011,
        journal={Journal for Language Technology and Computational Linguistics (JLCL)},
        pdf={http://media.dwds.de/jlcl/2011_Heft1/8.pdf },
        pages={97-117},
        number={1},
        author={Mehler, Alexander and Schwandt, Silke and Gleim, Rüdiger and Jussen, Bernhard},
        volume={26},
        year={2011},
        title={Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien},
        abstract={Die Digital Humanities bzw. die Computational Humanities entwickeln sich zu eigenst{\"a}ndigen Disziplinen an der Nahtstelle von Geisteswissenschaft und Informatik. Diese Entwicklung betrifft zunehmend auch die Lehre im Bereich der geisteswissenschaftlichen Fachinformatik. In diesem Beitrag thematisieren wir den eHumanities Desktop als ein Werkzeug für diesen Bereich der Lehre. Dabei geht es genauer um einen Brückenschlag zwischen Geschichtswissenschaft und Informatik: Am Beispiel der historischen Semantik stellen wir drei Lehrszenarien vor, in denen der eHumanities Desktop in der geschichtswissenschaftlichen Lehre zum Einsatz kommt. Der Beitrag schliesst mit einer Anforderungsanalyse an zukünftige Entwicklungen in diesem Bereich.}}
  • A. Mehler, N. Diewald, U. Waltinger, R. Gleim, D. Esch, B. Job, T. Küchelmann, O. Abramov, and P. Blanchard, “Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora,” Leonardo, vol. 44, iss. 3, 2011.
    [Abstract] [BibTeX]

    In this paper, the authors induce linguistic networks as a prerequisite for detecting language change by means of the Patrologia Latina, a corpus of Latin texts from the 4th to the 13th century.
    @ARTICLE{Mehler:Diewald:Waltinger:et:al:2010,
        publisher={MIT Press},
        journal={Leonardo},
        number={3},
        author={Mehler, Alexander and Diewald, Nils and Waltinger, Ulli and Gleim, Rüdiger and Esch, Dietmar and Job, Barbara and Küchelmann, Thomas and Abramov, Olga and Blanchard, Philippe},
        volume={44},
        year={2011},
        title={Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/mehler_diewald_waltinger_gleim_esch_job_kuechelmann_pustylnikov_blanchard_2010.pdf}
        website={http://www.mitpressjournals.org/doi/abs/10.1162/LEON_a_00175#.VLzsoivF_Cc},
        abstract={In this paper, the authors induce linguistic networks as a prerequisite for detecting language change by means of the Patrologia Latina, a corpus of Latin texts from the 4th to the 13th century.}}
  • [PDF] R. Gleim, A. Hoenen, N. Diewald, A. Mehler, and A. Ernst, “Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin,” in Corpus Linguistics 2011, 20-22 July, Birmingham, 2011.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Hoenen:Diewald:Mehler:Ernst:2011,
        booktitle={Corpus Linguistics 2011, 20-22 July, Birmingham},
        author={Gleim, Rüdiger and Hoenen, Armin and Diewald, Nils and Mehler, Alexander and Ernst, Alexandra},
        year={2011},
        title={Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/Paper-48.pdf}}

2010 (3)

  • [PDF] A. Mehler, R. Gleim, U. Waltinger, and N. Diewald, “Time Series of Linguistic Networks by Example of the Patrologia Latina,” in Proceedings of INFORMATIK 2010: Service Science, September 27 – October 01, 2010, Leipzig, 2010, pp. 609-616.
    [BibTeX]

    @INPROCEEDINGS{Mehler:Gleim:Waltinger:Diewald:2010,
        publisher={GI},
        booktitle={Proceedings of INFORMATIK 2010: Service Science, September 27 - October 01, 2010, Leipzig},
        author={Mehler, Alexander and Gleim, Rüdiger and Waltinger, Ulli and Diewald, Nils},
        editor={F{\"a}hnrich, Klaus-Peter and Franczyk, Bogdan},
        year={2010},
        volume={2},
        pages={609-616},
        title={Time Series of Linguistic Networks by Example of the Patrologia Latina},
        series={Lecture Notes in Informatics},
        pdf={http://subs.emis.de/LNI/Proceedings/Proceedings176/586.pdf}}
  • [PDF] R. Gleim, P. Warner, and A. Mehler, “eHumanities Desktop – An Architecture for Flexible Annotation in Iconographic Research,” in Proceedings of the 6th International Conference on Web Information Systems and Technologies (WEBIST ’10), April 7-10, 2010, Valencia, 2010.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Warner:Mehler:2010,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/gleim_warner_mehler_2010.pdf},
        booktitle={Proceedings of the 6th International Conference on Web Information Systems and Technologies (WEBIST '10), April 7-10, 2010, Valencia},
        author={Gleim, Rüdiger and Warner, Paul and Mehler, Alexander},
        year={2010},
        title={eHumanities Desktop - An Architecture for Flexible Annotation in Iconographic Research},
        website={https://www.researchgate.net/publication/220724277_eHumanities_Desktop_-_An_Architecture_for_Flexible_Annotation_in_Iconographic_Research}}
  • [PDF] R. Gleim and A. Mehler, “Computational Linguistics for Mere Mortals – Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities,” in Proceedings of LREC 2010, Malta, 2010.
    [Abstract] [BibTeX]

    Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces. 
    @INPROCEEDINGS{Gleim:Mehler:2010:b,
        publisher={ELDA},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/gleim_mehler_2010.pdf},
        booktitle={Proceedings of LREC 2010},
        author={Gleim, Rüdiger and Mehler, Alexander},
        year={2010},
        title={Computational Linguistics for Mere Mortals – Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities},
        address={Malta},
        abstract={Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces. }
        }

2009 (4)

  • [PDF] R. Gleim, U. Waltinger, A. Ernst, A. Mehler, D. Esch, and T. Feith, “The eHumanities Desktop – An Online System for Corpus Management and Analysis in Support of Computing in the Humanities,” in Proceedings of the Demonstrations Session of the 12th Conference of the European Chapter of the Association for Computational Linguistics EACL 2009, 30 March – 3 April, Athens, 2009.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Waltinger:Ernst:Mehler:Esch:Feith:2009,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/gleim_waltinger_ernst_mehler_esch_feith_2009.pdf},
        booktitle={Proceedings of the Demonstrations Session of the 12th Conference of the European Chapter of the Association for Computational Linguistics EACL 2009, 30 March – 3 April, Athens},
        author={Gleim, Rüdiger and Waltinger, Ulli and Ernst, Alexandra and Mehler, Alexander and Esch, Dietmar and Feith, Tobias},
        year={2009},
        title={The eHumanities Desktop – An Online System for Corpus Management and Analysis in Support of Computing in the Humanities}}
  • [PDF] U. Waltinger, A. Mehler, and R. Gleim, “Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization,” in Proceedings of the Biennial GSCL Conference 2009, September 30 – October 2, Universität Potsdam, 2009.
    [BibTeX]

    @INPROCEEDINGS{Waltinger:Mehler:Gleim:2009:a,
        booktitle={Proceedings of the Biennial GSCL Conference 2009, September 30 – October 2, Universit{\"a}t Potsdam},
        author={Waltinger, Ulli and Mehler, Alexander and Gleim, Rüdiger},
        year={2009},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/GSCL_2009_WaltingerMehlerGleim_camera_ready.pdf},
        title={Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization}}
  • [PDF] A. Mehler, R. Gleim, U. Waltinger, A. Ernst, D. Esch, and T. Feith, “eHumanities Desktop – eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik,” in Proceedings of the Symposium "Sprachtechnologie und eHumanities", 26.–27. Februar, Duisburg-Essen University, 2009.
    [BibTeX]

    @INPROCEEDINGS{Mehler:Gleim:Waltinger:Ernst:Esch:Feith:2009,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/mehler_gleim_waltinger_ernst_esch_feith_2009.pdf},
        booktitle={Proceedings of the Symposium "Sprachtechnologie und eHumanities", 26.–27. Februar, Duisburg-Essen University},
        author={Mehler, Alexander and Gleim, Rüdiger and Waltinger, Ulli and Ernst, Alexandra and Esch, Dietmar and Feith, Tobias},
        website={http://duepublico.uni-duisburg-essen.de/servlets/DocumentServlet?id=37041},
        year={2009},
        title={eHumanities Desktop – eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik}}
  • [PDF] R. Gleim, A. Mehler, U. Waltinger, and P. Menke, “eHumanities Desktop – An extensible Online System for Corpus Management and Analysis,” in 5th Corpus Linguistics Conference, University of Liverpool, 2009.
    [Abstract] [BibTeX]

    This paper presents the eHumanities Desktop - an online system for corpus management and analysis in support of computing in the humanities. Design issues and the overall architecture are described, as well as an outline of the applications offered by the system.
    @INPROCEEDINGS{Gleim:Mehler:Waltinger:Menke:2009,
        booktitle={5th Corpus Linguistics Conference, University of Liverpool},
        author={Gleim, Rüdiger and Mehler, Alexander and Waltinger, Ulli and Menke, Peter},
        year={2009},
        title={eHumanities Desktop – An extensible Online System for Corpus Management and Analysis},
        abstract={This paper presents the eHumanities Desktop - an online system for corpus management and analysis in support of computing in the humanities. Design issues and the overall architecture are described, as well as an outline of the applications offered by the system.},
        website={http://www.ulliwaltinger.de/ehumanities-desktop-an-extensible-online-system-for-corpus-management-and-analysis/},
        pdf={http://www.ulliwaltinger.de/pdf/eHumanitiesDesktop-AnExtensibleOnlineSystem-CL2009.pdf}}

2008 (3)

  • [PDF] A. Mehler, R. Gleim, A. Ernst, and U. Waltinger, “WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases,” Sprache und Datenverarbeitung. International Journal for Language Data Processing, vol. 32, iss. 1, pp. 47-70, 2008.
    [Abstract] [BibTeX]

    This article describes an API for exploring the logical document and the logical network structure of wikis. It introduces an algorithm for the semantic preprocessing, filtering and typing of these building blocks. Further, this article models the process of wiki generation based on a unified format of syntactic, semantic and pragmatic representations. This three-level approach to make accessible syntactic, semantic and pragmatic aspects of wiki-based structure formation is complemented by a corresponding database model – called WikiDB – and an API operating thereon. Finally, the article provides an empirical study of using the three-fold representation format in conjunction with WikiDB.
    @ARTICLE{Mehler:Gleim:Ernst:Waltinger:2008,
        pdf={http://www.ulliwaltinger.de/pdf/Konvens_2008_WikiDB_Building_Semantic_Databases_MehlerGleimErnstWaltinger.pdf},
        journal={Sprache und Datenverarbeitung. International Journal for Language Data Processing},
        pages={47-70},
        number={1},
        author={Mehler, Alexander and Gleim, Rüdiger and Ernst, Alexandra and Waltinger, Ulli},
        volume={32},
        year={2008},
        title={WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases},
        abstract={This article describes an API for exploring the logical document and the logical network structure of wikis. It introduces an algorithm for the semantic preprocessing, filtering and typing of these building blocks. Further, this article models the process of wiki generation based on a unified format of syntactic, semantic and pragmatic representations. This three-level approach to make accessible syntactic, semantic and pragmatic aspects of wiki-based structure formation is complemented by a corresponding database model – called WikiDB – and an API operating thereon. Finally, the article provides an empirical study of using the three-fold representation format in conjunction with WikiDB.}}
  • [PDF] G. Rehm, M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin, “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
    [Abstract] [BibTeX]

    We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres. 
    @INPROCEEDINGS{Rehm:Santini:Mehler:Braslavski:Gleim:Stubbe:Symonenko:Tavosanis:Vidulin:2008,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/rehm_santini_mehler_braslavski_gleim_stubbe_symonenko_tavosanis_vidulin_2008.pdf},
        booktitle={Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco)},
        author={Rehm, Georg and Santini, Marina and Mehler, Alexander and Braslavski, Pavel and Gleim, Rüdiger and Stubbe, Andrea and Symonenko, Svetlana and Tavosanis, Mirko and Vidulin, Vedrana},
        year={2008},
        title={Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems},
        website={http://www.lrec-conf.org/proceedings/lrec2008/summaries/94.html},
        abstract={We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres. }}
  • [PDF] O. Abramov, A. Mehler, and R. Gleim, “A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
    [Abstract] [BibTeX]

    This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface. 
    @INPROCEEDINGS{Pustylnikov:Mehler:Gleim:2008,
        booktitle={Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco)},
        author={Abramov, Olga and Mehler, Alexander and Gleim, Rüdiger},
        year={2008},
        title={A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data},
        pdf={http://wwwhomes.uni-bielefeld.de/opustylnikov/pustylnikov/pdfs/LREC08_full.pdf},
        abstract={This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface. }}

2007 (5)

  • [PDF] R. Gleim, A. Mehler, H. Eikmeyer, and H. Rieser, “Ein Ansatz zur Repräsentation und Verarbeitung großer Korpora multimodaler Daten,” in Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, 11.–13. April, Universität Tübingen, Tübingen, 2007, pp. 275-284.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Mehler:Eikmeyer:Rieser:2007,
        publisher={Narr},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_rieser_2007.pdf},
        booktitle={Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, 11.–13. April, Universit{\"a}t Tübingen},
        pages={275-284},
        author={Gleim, Rüdiger and Mehler, Alexander and Eikmeyer, Hans-Jürgen and Rieser, Hannes},
        editor={Rehm, Georg and Witt, Andreas and Lemnitzer, Lothar},
        year={2007},
        title={Ein Ansatz zur Repr{\"a}sentation und Verarbeitung gro{\ss}er Korpora multimodaler Daten},
        address={Tübingen}}
  • [PDF] R. Gleim, A. Mehler, M. Dehmer, and O. Abramov, “Aisles through the Category Forest – Utilising the Wikipedia Category System for Corpus Building in Machine Learning,” in 3rd International Conference on Web Information Systems and Technologies (WEBIST ’07), March 3-6, 2007, Barcelona, Barcelona, 2007, pp. 142-149.
    [Abstract] [BibTeX]

    The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.
    @INPROCEEDINGS{Gleim:Mehler:Dehmer:Abramov:2007,
        booktitle={3rd International Conference on Web Information Systems and Technologies (WEBIST '07), March 3-6, 2007, Barcelona},
        pages={142-149},
        author={Gleim, Rüdiger and Mehler, Alexander and Dehmer, Matthias and Abramov, Olga},
        editor={Filipe, Joaquim and Cordeiro, José and Encarnação, Bruno and Pedrosa, Vitor},
        year={2007},
        title={Aisles through the Category Forest – Utilising the Wikipedia Category System for Corpus Building in Machine Learning},
        address={Barcelona},
        abstract={The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.},
        pdf={https://hucompute.org/wp-content/uploads/2016/10/webist_2007-gleim_mehler_dehmer_pustylnikov.pdf}}
  • [PDF] A. Mehler, R. Gleim, and A. Wegner, “Structural Uncertainty of Hypertext Types. An Empirical Study,” in Proceedings of the Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP", September, 30, 2007, in conjunction with RANLP 2007, Borovets, Bulgaria, 2007, pp. 13-19.
    [BibTeX]

    @INPROCEEDINGS{Mehler:Gleim:Wegner:2007,
        booktitle={Proceedings of the Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP", September, 30, 2007, in conjunction with RANLP 2007, Borovets, Bulgaria},
        pages={13-19},
        author={Mehler, Alexander and Gleim, Rüdiger and Wegner, Armin},
        editor={Rehm, Georg and Santini, Marina},
        year={2007},
        title={Structural Uncertainty of Hypertext Types. An Empirical Study},
        pdf={https://hucompute.org/wp-content/uploads/2015/08/RANLP.pdf}}
  • [PDF] A. Mehler, P. Geibel, R. Gleim, S. Herold, B. Jain, and O. Abramov, “Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae,” in Proceedings of OTT ’06 – Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information, Osnabrück, 2007, pp. 63-71.
    [Abstract] [BibTeX]

    In this paper, we deal with classifying texts into classes which denote text types whose textual instances serve more or less homogeneous functions. Other than mainstream approaches to text classification, which rely on the vector space model [30] or some of its descendants [2] and, thus, on content-related lexical features, we solely refer to structural differentiae, that is, to patterns of text structure as determinants of class membership. Further, we suppose that text types span a type hierarchy based on the type-subtype relation [31]. Thus, although we admit that class membership is fuzzy so that overlapping classes are inevitable, we suppose a non-overlapping type system structured into a rooted tree – whether solely based on functional or additional on, e.g., content- or mediabased criteria [1]. What regards criteria of goodness of classification, we perform a classical supervised categorization experiment [30] based on cross-validation as a method of model selection [11]. That is, we perform a categorization experiment in which for all training and test cases class membership is known ex ante. In summary, we perform a supervised experiment of text classification in order to learn functionally grounded text types where membership to these types is solely based on structural criteria.
    @INPROCEEDINGS{Mehler:Geibel:Gleim:Herold:Jain:Pustylnikov:2007,
        pdf={http://ikw.uni-osnabrueck.de/~ott06/ott06-abstracts/Mehler_Geibel_abstract.pdf},
        booktitle={Proceedings of OTT '06 – Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information},
        pages={63-71},
        author={Mehler, Alexander and Geibel, Peter and Gleim, Rüdiger and Herold, Sebastian and Jain, Brijnesh-Johannes and Abramov, Olga},
        series={Publications of the Institute of Cognitive Science (PICS)},
        editor={Mönnich, Uwe and Kühnberger, Kai-Uwe},
        year={2007},
        title={Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae},
        address={Osnabrück},
        abstract={In this paper, we deal with classifying texts into classes which denote text types whose textual instances serve more or less homogeneous functions. Other than mainstream approaches to text classification, which rely on the vector space model [30] or some of its descendants [2] and, thus, on content-related lexical features, we solely refer to structural differentiae, that is, to patterns of text structure as determinants of class membership. Further, we suppose that text types span a type hierarchy based on the type-subtype relation [31]. Thus, although we admit that class membership is fuzzy so that overlapping classes are inevitable, we suppose a non-overlapping type system structured into a rooted tree – whether solely based on functional or additional on, e.g., content- or mediabased criteria [1]. What regards criteria of goodness of classification, we perform a classical supervised categorization experiment [30] based on cross-validation as a method of model selection [11]. That is, we perform a categorization experiment in which for all training and test cases class membership is known ex ante. In summary, we perform a supervised experiment of text classification in order to learn functionally grounded text types where membership to these types is solely based on structural criteria.}}
  • [PDF] R. Gleim, A. Mehler, and H. Eikmeyer, “Representing and Maintaining Large Corpora,” in Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK), 2007.
    [BibTeX]

    @INPROCEEDINGS{Gleim:Mehler:Eikmeyer:2007:a,
        pdf={https://hucompute.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_2007_a.pdf},
        booktitle={Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK)},
        author={Gleim, Rüdiger and Mehler, Alexander and Eikmeyer, Hans-Jürgen},
        year={2007},
        title={Representing and Maintaining Large Corpora}}

2006 (5)

  • [PDF] R. Gleim, “HyGraph – Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertextstrukturen,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Tagung 2005, Universität Bonn, Frankfurt a. M., 2006, pp. 42-53.
    [BibTeX]

    @INPROCEEDINGS{Gleim:2006,
        publisher={Lang},
        pdf={http://www.hucompute.org/data/gleim/pdf/GLDV2005-HyGraph-Framework.pdf},
        booktitle={Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beitr{\"a}ge zur GLDV-Tagung 2005, Universit{\"a}t Bonn},
        pages={42-53},
        author={Gleim, Rüdiger},
        editor={Fisseni, Bernhard and Schmitz, Hans-Christian and Schröder, Bernhard and Wagner, Petra},
        year={2006},
        title={HyGraph - Ein Framework zur Extraktion, Repr{\"a}sentation und Analyse webbasierter Hypertextstrukturen},
        website={https://www.researchgate.net/publication/268294000_HyGraph__Ein_Framework_zur_Extraktion_Reprsentation_und_Analyse_webbasierter_Hypertextstrukturen},
        pdf = {https://hucompute.org/wp-content/uploads/2016/10/GLDV2005-HyGraph-Framework.pdf},
        address={Frankfurt a. M.}}
  • A. Mehler, M. Dehmer, and R. Gleim, “Towards Logical Hypertext Structure – A Graph-Theoretic Perspective,” in Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), Berlin/New York, 2006, pp. 136-150.
    [Abstract] [BibTeX]

    Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.
    @INPROCEEDINGS{Mehler:Dehmer:Gleim:2006,
        publisher={Springer},
        booktitle={Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS '04)},
        website={http://rd.springer.com/chapter/10.1007/11553762_14},
        pages={136-150},
        author={Mehler, Alexander and Dehmer, Matthias and Gleim, Rüdiger},
        series={Lecture Notes in Computer Science 3473},
        editor={Böhme, Thomas and Heyer, Gerhard},
        year={2006},
        title={Towards Logical Hypertext Structure - A Graph-Theoretic Perspective},
        address={Berlin/New York},
        abstract={Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.}}
  • A. Mehler, R. Gleim, and M. Dehmer, “Towards Structure-Sensitive Hypertext Categorization,” in Proceedings of the 29th Annual Conference of the German Classification Society, March 9-11, 2005, Universität Magdeburg, Berlin/New York, 2006, pp. 406-413.
    [Abstract] [BibTeX]

    Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.
    @INPROCEEDINGS{Mehler:Gleim:Dehmer:2006,
        publisher={Springer},
        booktitle={Proceedings of the 29th Annual Conference of the German Classification Society, March 9-11, 2005, Universit{\"a}t Magdeburg},
        website={http://www.springerlink.com/content/l7665tm3u241317l/},
        pages={406-413},
        author={Mehler, Alexander and Gleim, Rüdiger and Dehmer, Matthias},
        editor={Spiliopoulou, Myra and Kruse, Rudolf and Borgelt, Christian and Nürnberger, Andreas and Gaul, Wolfgang},
        year={2006},
        title={Towards Structure-Sensitive Hypertext Categorization},
        address={Berlin/New York},
        abstract={Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.}}
  • A. Mehler and R. Gleim, “The Net for the Graphs – Towards Webgenre Representation for Corpus Linguistic Studies,” in WaCky! Working Papers on the Web as Corpus, M. Baroni and S. Bernardini, Eds., Bologna: Gedit, 2006, pp. 191-224.
    [BibTeX]

    @INCOLLECTION{Mehler:Gleim:2006:b,
        publisher={Gedit},
        booktitle={WaCky! Working Papers on the Web as Corpus},
        pages={191-224},
        author={Mehler, Alexander and Gleim, Rüdiger},
        editor={Baroni, Marco and Bernardini, Silvia},
        year={2006},
        title={The Net for the Graphs – Towards Webgenre Representation for Corpus Linguistic Studies},
        website={http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.510.4125},
        address={Bologna}}
  • [PDF] R. Gleim, A. Mehler, and M. Dehmer, “Web Corpus Mining by Instance of Wikipedia,” in Proceedings of the EACL 2006 Workshop on Web as Corpus, April 3-7, 2006, Trento, Italy, 2006, pp. 67-74.
    [Abstract] [BibTeX]

    Workshop organizer: Adam Kilgarriff
    @INPROCEEDINGS{Gleim:Mehler:Dehmer:2006:a,
        booktitle={Proceedings of the EACL 2006 Workshop on Web as Corpus, April 3-7, 2006, Trento, Italy},
        pages={67-74},
        author={Gleim, Rüdiger and Mehler, Alexander and Dehmer, Matthias},
        editor={Kilgariff, Adam and Baroni, Marco},
        year={2006},
        abstract={Workshop organizer: Adam Kilgarriff},
        title={Web Corpus Mining by Instance of Wikipedia},
        pdf={http://www.aclweb.org/anthology/W06-1710},
        website={http://pub.uni-bielefeld.de/publication/1773538}}

2005 (2)

  • [PDF] A. Mehler and R. Gleim, “Polymorphism in Generic Web Units. A corpus linguistic study,” in Proceedings of Corpus Linguistics ’05, July 14-17, 2005, University of Birmingham, Great Britian, 2005.
    [Abstract] [BibTeX]

    Corpus linguistics and related disciplines which focus on statistical analyses of textual units have substantial need for large corpora. More specifically, genre or register specific corpora are needed which allow studying variations in language use. Along with the incredible growth of the internet, the web became an important source of linguistic data. Of course, web corpora face the same problem of acquiring genre specific corpora. Amongst other things, web mining is a framework of methods for automatically assigning category labels to web units and thus may be seen as a solution to this corpus acquisition problem as far as genre categories are applied. The paper argues that this approach is faced with the problem of a many-to-many relation between expression units on the one hand and content or function units on the other hand. A quantitative study is performed which supports the argumentation that functions of web-based communication are very often concentrated on single web pages and thus interfere any effort of directly applying the classical apparatus of categorization on web page level. The paper outlines a two-level algorithm as an alternative approach to category assignment which is sensitive to genre specific structures and thus may be used to tackle the problem of acquiring genre specific corpora.
    @INPROCEEDINGS{Mehler:Gleim:2005:a,
        booktitle={Proceedings of Corpus Linguistics '05, July 14-17, 2005, University of Birmingham, Great Britian},
        author={Mehler, Alexander and Gleim, Rüdiger},
        volume={Corpus Linguistics Conference Series 1(1)},
        year={2005},
        title={Polymorphism in Generic Web Units. A corpus linguistic study},
        pdf={http://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/AlexanderMehlerandRuedigerGleimCorpusLinguistics2005.pdf},
        issn={1747-9398},
        abstract={Corpus linguistics and related disciplines which focus on statistical analyses of textual units have substantial need for large corpora. More specifically, genre or register specific corpora are needed which allow studying variations in language use. Along with the incredible growth of the internet, the web became an important source of linguistic data. Of course, web corpora face the same problem of acquiring genre specific corpora. Amongst other things, web mining is a framework of methods for automatically assigning category labels to web units and thus may be seen as a solution to this corpus acquisition problem as far as genre categories are applied. The paper argues that this approach is faced with the problem of a many-to-many relation between expression units on the one hand and content or function units on the other hand. A quantitative study is performed which supports the argumentation that functions of web-based communication are very often concentrated on single web pages and thus interfere any effort of directly applying the classical apparatus of categorization on web page level. The paper outlines a two-level algorithm as an alternative approach to category assignment which is sensitive to genre specific structures and thus may be used to tackle the problem of acquiring genre specific corpora.}}
  • A. Mehler, M. Dehmer, and R. Gleim, “Zur Automatischen Klassifikation von Webgenres,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung ’05, 10. März – 01. April 2005, Universität Bonn, Frankfurt a. M., 2005, pp. 158-174.
    [BibTeX]

    @INPROCEEDINGS{Mehler:Dehmer:Gleim:2005,
        publisher={Lang},
        booktitle={Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beitr{\"a}ge zur GLDV-Frühjahrstagung '05, 10. M{\"a}rz – 01. April 2005, Universit{\"a}t Bonn},
        pages={158-174},
        author={Mehler, Alexander and Dehmer, Matthias and Gleim, Rüdiger},
        editor={Fisseni, Bernhard and Schmitz, Hans-Christina and Schröder, Bernhard and Wagner, Petra},
        year={2005},
        title={Zur Automatischen Klassifikation von Webgenres},
        address={Frankfurt a. M.}}

2004 (1)

  • [PDF] M. Dehmer, A. Mehler, and R. Gleim, “Aspekte der Kategorisierung von Webseiten,” in INFORMATIK 2004 – Informatik verbindet, Band 2, Beiträge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Workshop Multimedia-Informationssysteme, 2004, pp. 39-43.
    [Abstract] [BibTeX]

    Im Zuge der Web-basierten Kommunikation tritt die Frage auf, inwiefern Webpages zum Zwecke ihrer inhaltsorientierten Filterung kategorisiert werden können. Diese Studie untersucht zwei Phänomene, welche die Bedingung der Möglichkeit einer solchen Kategorisierung betreffen (siehe [6]): Mit dem Begriff der funktionalen Aquivalenz beziehen wir uns auf das Phänomen, dass dieselbe Funktions- oder Inhaltskategorie durch völlig verschiedene Bausteine Web-basierter Dokumente manifestiert werden kann. Mit dem Begriff des Polymorphie beziehen wir uns auf das Phänomen, dass dasselbe Dokument zugleich mehrere Funktions- oder Inhaltskategorien manifestieren kann. Die zentrale Hypothese lautet, dass beide Phänomene für Web-basierte Hypertextstrukturen charakteristisch sind. Ist dies der Fall, so kann die automatische Kategorisierung von Hypertexten [2, 10] nicht mehr als eindeutige Zuordnung verstanden werden, bei der einem Dokument genau eine Kategorie zugeordnet wird. In diesem Sinne thematisiert das Papier die Frage nach der adäquaten Modellierung multimedialer Dokumente.
    @INPROCEEDINGS{Dehmer:Mehler:Gleim:2004,
        publisher={GI},
        booktitle={INFORMATIK 2004 – Informatik verbindet, Band 2, Beitr{\"a}ge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Workshop Multimedia-Informationssysteme},
        pages={39-43},
        author={Dehmer, Matthias and Mehler, Alexander and Gleim, Rüdiger},
        series={Lecture Notes in Informatics},
        volume={51},
        editor={Dadam, Peter and Reichert, Manfred},
        year={2004},
        title={Aspekte der Kategorisierung von Webseiten},
        pdf={http://subs.emis.de/LNI/Proceedings/Proceedings51/GI-Proceedings.51-11.pdf},
        website={https://www.researchgate.net/publication/221385316_Aspekte_der_Kategorisierung_von_Webseiten},
        abstract={Im Zuge der Web-basierten Kommunikation tritt die Frage auf, inwiefern Webpages zum Zwecke ihrer inhaltsorientierten Filterung kategorisiert werden können. Diese Studie untersucht zwei Ph{\"a}nomene, welche die Bedingung der Möglichkeit einer solchen Kategorisierung betreffen (siehe [6]): Mit dem Begriff der funktionalen Aquivalenz beziehen wir uns auf das Ph{\"a}nomen, dass dieselbe Funktions- oder Inhaltskategorie durch völlig verschiedene Bausteine Web-basierter Dokumente manifestiert werden kann. Mit dem Begriff des Polymorphie beziehen wir uns auf das Ph{\"a}nomen, dass dasselbe Dokument zugleich mehrere Funktions- oder Inhaltskategorien manifestieren kann. Die zentrale Hypothese lautet, dass beide Ph{\"a}nomene für Web-basierte Hypertextstrukturen charakteristisch sind. Ist dies der Fall, so kann die automatische Kategorisierung von Hypertexten [2, 10] nicht mehr als eindeutige Zuordnung verstanden werden, bei der einem Dokument genau eine Kategorie zugeordnet wird. In diesem Sinne thematisiert das Papier die Frage nach der ad{\"a}quaten Modellierung multimedialer Dokumente.}}