Idioma: Español
Fecha: Subida: 2021-04-13T00:00:00+02:00
Duración: 19m 01s
Lugar: Conferencia
Visitas: 932 visitas

Stylometric analysis of Avellaneda’s Don Quijote

Yoshifumi Kawasaki (University of Tokyo)

Descripción

This study aims to determine who is the author of the so-called apocryphal Don Quijote published in 1614 under the pseudonym of Alonso Fernández de Avellaneda. This work is a continuation of the first part of the original Don Quijote published in 1604 by Spanish novelist Miguel de Cervantes. Most of his contemporary writers are proposed as its true author, including Miguel de Cervantes himself, Jerónimo de Pasamonte (Martín Jiménez: 2005, 2006, 2019; Pasamonte: 2017; Percas de Ponseti: 2002; Riquer: 2010), Cristóbal Suárez de Figueroa (Álvarez Díez: 1990; Espín Rodrigo: 1993; Suárez Figaredo: 2004, 2006, 2007, 2009), Alonso de Castillo Solórzano (Hornedo: 1952), Lope de Vega (Pérez López: 2002), and José de Villaviciosa (Rodríguez López-Vázquez: 2011), to name a few.
A drawback of the existing studies addressing this long-standing literary enigma lies in simply capturing thematic similarity by taking into account a fixed-size of most frequent words (Blasco 2016; Fradejas Rueda 2016; Rissler-Pipka 2016). In no way does content resemblance guarantee identical authorship. To correctly distinguish stylistic fingerprints among different authors, we make use of the POS (Part of Speech) n-gram, which has been shown to be effective in Japanese stylometric studies (Jin 2013; Uesaka & Murakami 2015). We consider POS sequence pattern to be one of the ideal stylometric features as it is frequent, content-independent, unconsciously repeated, and difficult to imitate.
We created a corpus containing 33 prose works written by 12 Golden Age authors from their online versions. To equalize extension, each work is divided into various portions of 10.000 tokens. The POS n-grams are obtained using a POS tagger spaCy 2.1.8, which distinguishes 16 types of part of speech including verb, noun, adjective, adverb, and preposition. We set the sequence size n to 1, 2, and 3 to prevent combinatorial explosion. As the parser is designed for the Modern Spanish, it occasionally assigns erroneous tags which we left without correction. We believe that a rather small amount of errors will not affect substantially subsequent analysis. We applied multivariate analysis including cluster analysis, PCA, and t-SNE to visualize distribution of authors.
Our analysis reveals that the POS n-gram is capable of making nearly perfect distinction among different authors. This result gives credits to effectiveness of POS sequence pattern in authorship attribution in Spanish. However, we fail to ascribe Avellaneda’s work to any of the candidate authors because his fragments come to form a single cluster without confusion with any of the candidate authors’ works. If the true author were among them, Avellaneda’s Don Quijote would have mingled with his works. Therefore, it is possible that the genuine author is not included in our candidate list which we will have to expand further.
In future study we plan to treat separately narration and conversation to carry out more fine-grained analysis.

Propietarios

Congreso Cilc 2021

Comentarios

Nuevo comentario

Serie: CILC2021: Lingüística computacional basada en corpus / Corpus-based computational linguistics (+información)