Russian Corpus of Short Stories and Its Thematic Modeling - TV Universidad de Murcia

Este sitio usa cookies para mejorar su experiencia de uso del sitio web. Si continua navegando, usted acepta el uso de nuestra política de uso.

ATENCIÓN: JavaScript es necesario para poder visualizar este objeto multimedia. Por favor, actívelo en su navegador. Guía para habilitar JavaScript en su navegador .

Russian Corpus of Short Stories and Its Thematic Modeling

Idioma: Español

Fecha: 15 Abr 2021

Duración: 22m 28s

Lugar: Conferencia

Visitas: 561 visitas

Russian Corpus of Short Stories and Its Thematic Modeling

Tatiana Sherstinova and Tatiana Skrebtsova (St. Petersburg State University)

Descripción

The present paper deals with the thematic content of the Russian short stories produced in the 20th century’s first three decades. It presents part of the ongoing project “The Russian language on the edge of radical historical changes: the study of language and style in pre-revolutionary, revolutionary and post-revolutionary artistic prose by the methods of mathematical and computer linguistics (a corpus-based research on Russian short stories)”. The project’s overall goal is to give a comprehensive account of the early 20th century Russian short stories from the thematic, structural and linguistic perspectives.
To accomplish this, a text corpus was created, containing several thousands of short stories written in Russia and later, the Soviet Union, and published in the timespan from 1900 to 1930 in literary journals or story books. This timespan is divided into 3 parts, 1900-1913, 1914-1922 and 1923-1930, the first covering the time before the great cataclysms, the second embracing World War I, February and October revolutions and the Civil War, and the third accounting for the post-war socialist period. Each author may be represented by a single, randomly selected, story per period. To ensure robustness of the results, the corpus aims to take account of as many professional writers as possible, both famous (e.g. Anton Chekhov, Leo Tolstoy, Ivan Bunin, Maxim Gorky) and lesser-known ones, metropolitan and provincial alike.
The initial three decades of the 20th century proved a difficult time in the Russian history. Defeat in the Russo-Japanese war (1904-1905), the subsequent political and social unrest, World War I, February and October revolutions of 1917, resulting in a radical transformation of economic, political and social life, and finally the Civil War (1917-1922) with its aftermath period could not fail to affect the Russian literature.
From the overall corpus, a random sample was taken, containing 310 stories by 300 authors (some writers featuring in more than one period, this accounts for a slight discrepancy in numbers). This sample serves as an initial testbed to probe preliminary observations and hypotheses. In the paper, it is used to test different topic modeling techniques and assess the adequacy of the results by mapping them on the set of themes obtained by traditional interpretative methods.
The most common in modern applications are probabilistic topic models, including probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA). However, we find it important not only to automatically extract short stories themes but also to trace the way they kept changing over time. In particular, we aim to arrive at generalizations concerning similarities within each period and differences between them. The best known technique capable of tracking the evolution of topics over time is the dynamic topic model (DTM). We also employ non-negative matrix factorization (NMF), an unsupervised algorithm of machine learning, which has proven effective in the automatic detection of topics in text corpora.
At present, the sample has been manually tagged by a set of 89 topics. Among them are themes concerning different aspects of human life. Each of the 310 stories is thus assigned a set of topics directly bearing on its plot. The frequency of each topic in each period has been calculated, too, which helps reveal the relevant tendencies. This data serves as a standard in assessing the pros and cons of different computational techniques.

Propietarios

Congreso Cilc 2021

Comentarios

Nuevo comentario

Serie: CILC2021: Discurso, análisis literario y corpus / Discourse, literary analysis and corpora (+información)

Typical Phraseological Units in Poetic Texts

Canal

Typical Phraseological Units in Poetic Texts

Michael Pace-Sigge (University of Eastern Finland)

Coordinative Aspect of Pragmatic Markers Use in the Corpus of Russian Spoken Dialogical Discourse

Canal

Coordinative Aspect of Pragmatic Markers Use in the Corpus of Russian Spoken Dialogical Discourse

Ekaterina Troshchenkova, Olga Blinova (Saint Petersburg State University -SPBU-)

Demystifying the ideological outcomes of the Egyptian uprising: a comparison between the Arab (...)

Canal

Demystifying the ideological outcomes of the Egyptian uprising: a comparison between the Arab (...)

Safa Atia (Universidad Autónoma de Madrid)

Body Language through Clusters in a Corpus of Contemporary Male Irish Novelists

Canal

Body Language through Clusters in a Corpus of Contemporary Male Irish Novelists

Cassandra Sian Tully (Universidad de Extremadura)

The Discourse of Human Rights in China’s News Media

Canal

The Discourse of Human Rights in China’s News Media

Zihuan Zhong (Queen Mary University of London)

Establish a niche via negation: A corpus-based study of Move 2 in Ph.D. thesis introductions (...)

Canal

Establish a niche via negation: A corpus-based study of Move 2 in Ph.D. thesis introductions (...)

Shuyi Sun and Peter Crosthwaite (University of Queensland)

The linguistic representation of gender in Spanish based news on the web

Canal

The linguistic representation of gender in Spanish based news on the web

Héctor Castro and Ignacio Ródriguez (Universidad Autónoma de Querétaro)

Multimodal Corpus Analysis of Online Tourism Narratives

Canal

Multimodal Corpus Analysis of Online Tourism Narratives

Elena Mattei (University of Verona)

Quantifying discourse coherence with a complex network method

Canal

Quantifying discourse coherence with a complex network method

Jiang Niu and Yue Jiang (Xi'an Jiaotong University)

Evolución del comportamiento lingüístico de hombres y mujeres en la red social Facebook (...)

Canal

Evolución del comportamiento lingüístico de hombres y mujeres en la red social Facebook (...)

Isabel García Martínez (Universitat de Valencia)

Metaphor we anti-fraud by: a corpus-based study of metaphor in public legal education discourse

Canal

Metaphor we anti-fraud by: a corpus-based study of metaphor in public legal education discourse

Mengna Liu (Guangdong University of Foreign Studies)

Interactive metadiscourse in Spanish academic writing: A comparative corpus-based analysis

Canal

Interactive metadiscourse in Spanish academic writing: A comparative corpus-based analysis

Gang Yao (Universidad de Murcia) y María Luisa Carrió Pastor (Universidad Politécnica de Valencia)

‘Yeah, No’ in Irish English fiction: Pragmatic functions and indexicality of a ‘new’ pragmatic (...)

Canal

‘Yeah, No’ in Irish English fiction: Pragmatic functions and indexicality of a ‘new’ pragmatic (...)

Ana Maria Terrazas-Calero (University of Limerick)

"En defensa de nuestra casa". La migración centroamericana desde los comentarios (...)

Canal

"En defensa de nuestra casa". La migración centroamericana desde los comentarios (...)

Ana Ruth Sánchez Barrera and Ignacio Rodríguez Sánchez (Universidad Autónoma de Querétaro)

A corpus-based critical discourse analysis of Chinese medicine advertising leaflets in the UK

Canal

A corpus-based critical discourse analysis of Chinese medicine advertising leaflets in the UK

Fang Wang (University of Surrey)

La construcción de la imagen país de China en la red social en tiempos de COVID-19 (...)

Canal

La construcción de la imagen país de China en la red social en tiempos de COVID-19 (...)

Cao Wei, Yuanyuan Zhao and Zhao Yuanyuan (Universidad de Huelva)

A Warring Style: A Corpus Stylistic Analysis of British Poetry of the First World War

Canal

A Warring Style: A Corpus Stylistic Analysis of British Poetry of the First World War

Hakan Cangır (Ankara University) and Taner Can (TED University)

La cobertura del Brexit en la prensa española: El estudio del sesgo ideológico a través de la (...)

Canal

La cobertura del Brexit en la prensa española: El estudio del sesgo ideológico a través de la (...)

Álvaro Ramos (Universidad de Granada)