Nav: Home

Texts like networks: How many words are sufficient to recognize the author?

April 11, 2019

We are more original than we think - this is what is being suggested by literary text analysis carried out by a new method of stylometry proposed by scientists from the Institute of Nuclear Physics Polish Academy of Sciences. The author's individuality can already be seen in connections between no more than a dozen of words in English text. It turns out that in Slavic languages authorship identification requires even fewer words and is more certain.

Finding out the author of a text is usually not difficult: just read the signature. However, sometimes there is no signature since it has not been preserved or has been deliberately omitted by the author. Often, instead of a first and last name, we see a pseudonym. So, how can we verify who penned a historical text known only from fragments? How can we establish the true creator of an Internet lampoon? How can we really determine if the text of a thesis or doctoral dissertation is not plagiarized? In many cases, traditional stylometric methods fail or do not lead to sufficiently reliable conclusions. In Information Sciences, scientists from the Institute of Nuclear Physics of the Polish Academy of Sciences (IFJ PAN) in Cracow have presented their own statistical tool for stylometric analysis. Constructed with the use of graphs, it makes it possible look at the structure of texts in a qualitatively new way.

"The conclusions of our research are, on the one hand, encouraging. They indicate that the individuality of any person manifests itself clearly in the way they use a surprisingly small number of words. But there is also another, darker side of the coin. Since it turns out we are so original, it will be easier to identify us by our statements," says Prof. Stanislaw Drozdz (IFJ PAN, Cracow University of Technology).

Stylometry - i.e. the science dealing with the determination of the statistical characteristics of the style of texts - is based on the observation that each of us uses even the same language in a slightly different way. Some have a broader vocabulary, others narrower, some like to use certain phrases and make mistakes, others avoid repetition and are linguistic purists. And when we write, we also differ in the way we use punctuation. In the typical stylometric approach, the basic features of a text are usually examined, e.g. the frequency of occurrence of individual words, whilst punctuation is ignored. Analyses are carried out for the studied text and for texts written by potentially well-known authors. The creator is deemed to be the person whose works have parameters with the values closest to those obtained for the material being identified.

"We suggested that the characteristic features of the style be sought in a network representation of the text, using graphs," explains Tomasz Stanisz, PhD student at the IFJ PAN and the first author of the publication, and he specifies: "The graph is a collection of points, or vertices of the graph, connected by lines, i.e. the edges of the graph. In the simplest case - in the so-called unweighted network - the vertices correspond to individual words and are connected by edges if and only if two given words have occurred adjacent to each other at least once in the text. For example, for the sentence 'Jane is hungry', the graph would have three vertices, one for each word, but there would only be two edges, one between 'Jane' and 'is', the other between 'is' and 'hungry'."

While constructing their stylometric tools, the IFJ PAN researchers tested different types of graphs. The best results were obtained for weighted graphs, that is, those in which each edge carries information about the number of occurrences of its corresponding connection between words. Two parameters turned out to be the most useful in such networks: the node degree and the clustering coefficient. The first describes the number of edges coming from a given node and is directly related to the number of occurrences of a given word in the text. In turn, the clustering coefficient describes the probability that two words connected by an edge with a given word are connected with an edge also between themselves.

Using statistical tools prepared in this way, the Cracow-based physicists looked at 96 books: six novels by eight well-known English authors (Austen, Conrad, Defoe, Dickens, Doyle, Eliot, Orwell and Twain) and eight Polish authors (Korczak, Kraszewski, Lam, Orzeszkowa, Prus, Reymont, Sienkiewicz and Zeromski). The authors included two winners of the Nobel Prize for Literature (Wladyslaw Reymont and Henryk Sienkiewicz). All the texts were downloaded from the internet libraries Project Gutenberg, Wikisource and Wolne Lektury. The group from the IFJ PAN then checked the reliability with which the authorship of 12 randomly selected works in one language could be determined, treating the rest of the pool of works as comparative material.

"In the case of English texts, we identified the authors correctly in almost 90% of cases. In addition, in order to achieve success, it was necessary to trace the connections between only 10-12 words of the examined text. Contrary to naive intuition, a further increase in the number of words studied did not significantly increase the effectiveness of the method," says Tomasz Stanisz.

In Polish, the determination of authorship turned out to be even simpler: only 5-6 words needed to be traced. What is particularly interesting is that despite the fact that the pool of significant words was half as many as in English, the probability of correct identification was increased by up to 95%! Such high diagnostic accuracy, however, was only achieved when punctuation marks were also treated as separate words. In both languages, omitting punctuation resulted in a significant reduction in the number of correct guesses. The observed role of punctuation is another confirmation of the conclusions from the publication of the group of Prof. Drozdz of 2017, where it was shown that punctuation plays an equally important role in language as the words themselves.

"In comparison with English, Polish seems to give greater possibilities of revealing the style of the author. We think that the other Slavic languages are characterised by similar features. English is a positional language, which means that the order of the words in a sentence is important. This sort of language leaves less room for an individual style of expression than the Slavic languages, in which inflection, or variation, decides about the role of a word or phrase in a sentence. This allows for greater freedom to organize the order of words in a sentence, whilst its meaning remains unchanged," sums up Prof. Drozdz.
-end-
The Henryk Niewodniczanski Institute of Nuclear Physics (IFJ PAN) is currently the largest research institute of the Polish Academy of Sciences. The broad range of studies and activities of IFJ PAN includes basic and applied research, ranging from particle physics and astrophysics, through hadron physics, high-, medium-, and low-energy nuclear physics, condensed matter physics (including materials engineering), to various applications of methods of nuclear physics in interdisciplinary research, covering medical physics, dosimetry, radiation and environmental biology, environmental protection, and other related disciplines. The average yearly yield of the IFJ PAN encompasses more than 600 scientific papers in the Journal Citation Reports published by the Thomson Reuters. The part of the Institute is the Cyclotron Centre Bronowice (CCB) which is an infrastructure, unique in Central Europe, to serve as a clinical and research centre in the area of medical and nuclear physics. IFJ PAN is a member of the Marian Smoluchowski Kraków Research Consortium: "Matter-Energy-Future" which possesses the status of a Leading National Research Centre (KNOW) in physics for the years 2012-2017. The Institute is of A+ Category (leading level in Poland) in the field of sciences and engineering.

CONTACTS:

Prof. Stanislaw Drozdz
The Institute of Nuclear Physics Polish Academy of Sciences
tel: +48 12 6628220
email: stanislaw.drozdz@ifj.edu.pl

SCIENTIFIC PAPERS:

"Linguistic data mining with complex networks: A stylometric-oriented approach"
T. Stanisz, J. Kwapien, S. Drozdz
Information Sciences 482 (2019) 301-320
DOI: https://doi.org/10.1016/j.ins.2019.01.040

"In narrative texts punctuation marks obey the same statistics as words"
A. Kulig, J. Kwapien, T. Stanisz, S. Drozdz
Information Sciences 375 (2017) 98-113
DOI: http://dx.doi.org/10.1016/j.ins.2016.09.051

LINKS:

http://www.ifj.edu.pl/
The website of the Institute of Nuclear Physics Polish Academy of Sciences.

http://press.ifj.edu.pl/
Press releases of the Institute of Nuclear Physics Polish Academy of Sciences.

IMAGES:

IFJ190411b_fot01s.jpg
HR: http://press.ifj.edu.pl/news/2019/04/11/IFJ190411b_fot01.jpg
The author of an unsigned text can be identified by analysing the relationship between just a few words of the text, as shown by physicist-statisticians from the Institute of Nuclear Physics of the Polish Academy of Sciences in Cracow. (Source: IFJ PAN)

The Henryk Niewodniczanski Institute of Nuclear Physics Polish Academy of Sciences

Related Language Articles:

Chinese to rise as a global language
With the continuing rise of China as a global economic and trading power, there is no barrier to prevent Chinese from becoming a global language like English, according to Flinders University academic Dr Jeffrey Gil.
'She' goes missing from presidential language
MIT researchers have found that although a significant percentage of the American public believed the winner of the November 2016 presidential election would be a woman, people rarely used the pronoun 'she' when referring to the next president before the election.
How does language emerge?
How did the almost 6000 languages of the world come into being?
New research quantifies how much speakers' first language affects learning a new language
Linguistic research suggests that accents are strongly shaped by the speaker's first language they learned growing up.
Why the language-ready brain is so complex
In a review article published in Science, Peter Hagoort, professor of Cognitive Neuroscience at Radboud University and director of the Max Planck Institute for Psycholinguistics, argues for a new model of language, involving the interaction of multiple brain networks.
Do as i say: Translating language into movement
Researchers at Carnegie Mellon University have developed a computer model that can translate text describing physical movements directly into simple computer-generated animations, a first step toward someday generating movies directly from scripts.
Learning language
When it comes to learning a language, the left side of the brain has traditionally been considered the hub of language processing.
Learning a second alphabet for a first language
A part of the brain that maps letters to sounds can acquire a second, visually distinct alphabet for the same language, according to a study of English speakers published in eNeuro.
Sign language reveals the hidden logical structure, and limitations, of spoken language
Sign languages can help reveal hidden aspects of the logical structure of spoken language, but they also highlight its limitations because speech lacks the rich iconic resources that sign language uses on top of its sophisticated grammar.
Lying in a foreign language is easier
It is not easy to tell when someone is lying.
More Language News and Language Current Events

Trending Science News

Current Coronavirus (COVID-19) News

Top Science Podcasts

We have hand picked the top science podcasts of 2020.
Now Playing: TED Radio Hour

TED Radio Wow-er
School's out, but many kids–and their parents–are still stuck at home. Let's keep learning together. Special guest Guy Raz joins Manoush for an hour packed with TED science lessons for everyone.
Now Playing: Science for the People

#565 The Great Wide Indoors
We're all spending a bit more time indoors this summer than we probably figured. But did you ever stop to think about why the places we live and work as designed the way they are? And how they could be designed better? We're talking with Emily Anthes about her new book "The Great Indoors: The Surprising Science of how Buildings Shape our Behavior, Health and Happiness".
Now Playing: Radiolab

The Third. A TED Talk.
Jad gives a TED talk about his life as a journalist and how Radiolab has evolved over the years. Here's how TED described it:How do you end a story? Host of Radiolab Jad Abumrad tells how his search for an answer led him home to the mountains of Tennessee, where he met an unexpected teacher: Dolly Parton.Jad Nicholas Abumrad is a Lebanese-American radio host, composer and producer. He is the founder of the syndicated public radio program Radiolab, which is broadcast on over 600 radio stations nationwide and is downloaded more than 120 million times a year as a podcast. He also created More Perfect, a podcast that tells the stories behind the Supreme Court's most famous decisions. And most recently, Dolly Parton's America, a nine-episode podcast exploring the life and times of the iconic country music star. Abumrad has received three Peabody Awards and was named a MacArthur Fellow in 2011.