Idioma: Español
Fecha: Subida: 2021-04-15T00:00:00+02:00
Duración: 22m 28s
Lugar: Conferencia
Visitas: 370 visitas

Russian Corpus of Short Stories and Its Thematic Modeling

Tatiana Sherstinova and Tatiana Skrebtsova (St. Petersburg State University)


The present paper deals with the thematic content of the Russian short stories produced in the 20th century’s first three decades. It presents part of the ongoing project “The Russian language on the edge of radical historical changes: the study of language and style in pre-revolutionary, revolutionary and post-revolutionary artistic prose by the methods of mathematical and computer linguistics (a corpus-based research on Russian short stories)”. The project’s overall goal is to give a comprehensive account of the early 20th century Russian short stories from the thematic, structural and linguistic perspectives.
To accomplish this, a text corpus was created, containing several thousands of short stories written in Russia and later, the Soviet Union, and published in the timespan from 1900 to 1930 in literary journals or story books. This timespan is divided into 3 parts, 1900-1913, 1914-1922 and 1923-1930, the first covering the time before the great cataclysms, the second embracing World War I, February and October revolutions and the Civil War, and the third accounting for the post-war socialist period. Each author may be represented by a single, randomly selected, story per period. To ensure robustness of the results, the corpus aims to take account of as many professional writers as possible, both famous (e.g. Anton Chekhov, Leo Tolstoy, Ivan Bunin, Maxim Gorky) and lesser-known ones, metropolitan and provincial alike.
The initial three decades of the 20th century proved a difficult time in the Russian history. Defeat in the Russo-Japanese war (1904-1905), the subsequent political and social unrest, World War I, February and October revolutions of 1917, resulting in a radical transformation of economic, political and social life, and finally the Civil War (1917-1922) with its aftermath period could not fail to affect the Russian literature.
From the overall corpus, a random sample was taken, containing 310 stories by 300 authors (some writers featuring in more than one period, this accounts for a slight discrepancy in numbers). This sample serves as an initial testbed to probe preliminary observations and hypotheses. In the paper, it is used to test different topic modeling techniques and assess the adequacy of the results by mapping them on the set of themes obtained by traditional interpretative methods.
The most common in modern applications are probabilistic topic models, including probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). However, we find it important not only to automatically extract short stories themes but also to trace the way they kept changing over time. In particular, we aim to arrive at generalizations concerning similarities within each period and differences between them. The best known technique capable of tracking the evolution of topics over time is the dynamic topic model (DTM). We also employ non-negative matrix factorization (NMF), an unsupervised algorithm of machine learning, which has proven effective in the automatic detection of topics in text corpora.
At present, the sample has been manually tagged by a set of 89 topics. Among them are themes concerning different aspects of human life. Each of the 310 stories is thus assigned a set of topics directly bearing on its plot. The frequency of each topic in each period has been calculated, too, which helps reveal the relevant tendencies. This data serves as a standard in assessing the pros and cons of different computational techniques.


Congreso Cilc 2021


Nuevo comentario

Serie: CILC2021: Discurso, análisis literario y corpus / Discourse, literary analysis and corpora (+información)