Bangla Textbook Corpus

The Bangla textbook corpus has been extracted from textbooks that have been used for teaching in public schools in Banglades. The corpus is collected in the year of 2012. the corpus collected with the aim to support research multilingual text readability analysis. The corpus contains 661 documents 105,897 sentences and 1,029,354 tokens. The format of the corpus is TEI P5. For more details on this corpus please refer to:[1]

Reference
Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014
In the case that you use this corpus, please cite the publications above.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as ZIP archive (4.25 MB)


[1] [pdf] Islam, M. Z., Mehler, A., & Rahman, R.. (2012). Text Readability Classification of Textbooks of a Low-Resource Language. Paper presented at the Accepted in the 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26).
[BibTeX]
@INPROCEEDINGS{Islam:Mehler:Rahman:2012,
    owner={zahurul},
    booktitle={Accepted in the 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26)},
    author={Islam, Md. Zahurul and Mehler, Alexander and Rahman, Rashedur},
    timestamp={2012.08.14},
    year={2012},
    title={Text Readability Classification of Textbooks of a Low-Resource Language},
    abstract={There are many languages considered to be low-density languages, either because the population speaking the language is not very large, or because insufficient digitized text material is available in the language even though millions of people speak the language. Bangla is one of the latter ones. Readability classification is an important Natural Language Processing (NLP) application that can be used to judge the quality of documents and assist writers to locate possible problems. This paper presents a readability classifier of Bangla textbook documents based on information-theoretic and lexical features. The features proposed in this paper result in an F-score that is 50% higher than that for traditional readability formulas.},
    pdf={http://www.aclweb.org/anthology/Y12-1059},
    website={http://www.researchgate.net/publication/256648250_Text_Readability_Classification_of_Textbooks_of_a_Low-Resource_Language}}