Neural network identification of poets, using letter sequences

Johan F. Hoorn, Stefan L. Frank, Wojtek Kowalczyk, and Floor van der Ham

Abstract

Stylistic differences among poets are usually sought in sound and semantics. In human analysis, the criteria for recognising stylistic differences are manifold and intermingled. This study demonstrates that successful identification of poets based on their work is possible using one criterion: letter sequences.

Poets show preferences for certain letter combinations, which are unique to their writing style. Using this criterion in machine computation demonstrates that semantics are not needed to correctly identify poets, and that as a concession to utter parsimony, one minimal criterion of unique letter sequences is enough to fingerprint an author.

A small sample of the work of three Dutch poets was used: Bloem (1887‑1966), Slauerhoff (1898‑1936), and Lucebert (1924‑1994). This sample formed the training set for the neural network program to analyse the unique letter patterns for each poet. Next, the program was fed a set of new poems, for which the author was to be identified.

In choosing between two poets, the program succeeded for 80‑90% of the new poems to correctly identify the poet. When the choice was among three poets, the score was about 70% correct. Since raw ASCII‑files are sufficient as input, and human pre‑coding is unnecessary, neural-network analysis of letter sequences may turn out to be a powerful tool in categorisation and identification problems, such as genre, stylistics and plagiarism.