Colloquium talk by Fabian Flöck: Provenance and change tracking of all textual content in Wikipedia – A dataset and its applications for Computational Social Science, Humanities and Linguistics

Colloquium

11.05.2017, 14:00 – 16:00, Robert-Mayer-Straße 10, Room 401

Talk by Fabian Flöck: Provenance and change tracking of all textual content in Wikipedia – A dataset and its applications for Computational Social Science, Humanities and Linguistics

Research efforts regarding information retrieval, linguistic patterns, social collaboration dynamics and other subject areas have achieved notable advancements throughout the last decade by utilizing the content and editing logs that large open Wiki projects provide; prime among them is Wikipedia.

Yet, a potential that remains largely untapped is the full information encoded in the revision history of Wiki documents. Specifically, most Wiki platforms, including Wikipedia, do not provide explicit markers in their internal data representation to identify the source revision of a particular piece of content or any changes to it – authorship and the evolution of single words or sentences throughout an article’s development process are therefore not readily traceable.

To retrieve this information, each article revision has to be ex-post compared against all previous ones. Certain idiosyncrasies of Wiki environments, like frequent reverts, make this a non-trivial endeavor, beyond common text comparison approaches. We have developed and refined an algorithm for this specific task and have generated a high-accuracy dataset that explicitly indicates – for each of the over 13 billion tokens ever recorded in the English Wikipedia – in which revision of an article they were originally written and potentially deleted, reinserted, redeleted, etc.

This data enables fascinating insights into how the productivity of editors develops over time, how survival rates of certain words and n-grams differ in changing contexts, which content types have a tendency to spark controversies or even edit wars, and how collaboration structures manifest around parts of articles, to name just some phenomena.

In this talk I would like to share some insights we already gained from this dataset and discuss promising research avenues it opens up for Computational Social Science, Humanities and Linguistics.

Curriculum Vitae

Dr. Fabian Flöck is a research associate at the Computational Social Science Group at GESIS – Leibniz Institute for the Social Sciences in Cologne. After receiving his graduate degree in Media Studies / Empirical Sociology from the University of Cologne, he attained his doctorate in the field of computer science from the institute AIFB, Karlsruhe Institute of Technology in 2016. As an interlude to his academic career, he worked for two years as head product manager for a social networking platform.

His research focuses on how to mine the social dynamics in large-scale collaborative online platforms and to make them transparent, be it via quantitative analysis or visualization tools – exploring how social mechanisms shape the performance of these systems, e.g., in regard to content production. This includes work on social tagging systems and social news platforms (such as reddit), with a focus on Wikipedia and similar collaborative writing platforms. He is also interested in using Wiki content as a resource for cultural research and has scientifically explored Semantic Web technologies, crowdsourcing solutions and gamification.