Tweet Language Similar to Normal Speech

By March 27, 2014
Twitter Talk

What features do written forms of online communication have in common with spoken language? What significant similarities and differences are there between written texts and online discussions?

The way we tweet is very close to the way we talk. This is the main finding of an analysis carried out by NASA biophysicist Josiah P. Zayner, focusing on potential similarities and differences between various forms of communication. Technological advances have certainly led people to adapt to new forms of communication, but it might be interesting to find out whether the majority of the population have mastered emails, instant messaging, SMS, etc. and whether they can switch easily from one mode of communication to another. Josiah Zayner looked at four types of communication in English: standard written texts from Google Books, Twitter communication, Internet Relay Chat (IRC), and a body of spoken English. He studied the frequency of word usage, their distribution and classification.

Tweets are closest to speech

Zayner studied the four bodies of text inter alia by comparing the 200 most used words in each corpus. In an attempt to investigate how word usage might characterise a certain mode of communication, he took word frequency as a key criterion. His research showed that oral communication, tweets and IRC are all more varied than the language of books, where writers tend to use linking words such as articles, prepositions and conjunctions – a, the, of, to, and, etc –  more frequently.  Comparing the words that were common to all four bodies of language, he found that tweets and spoken English were most alike, with over 71% of the words in common. Confusion matrix analysis also pointed to the fact that spoken language modes and Twitter communication are very similar.  The confusion matrix (or ‘error matrix’) approach was used to test assumptions regarding the provenance of a word taken at random based on word frequency alone. Zayner’s computation showed that he very rarely confused the origin of words taken from a book with other language modes, but there was quite often a ‘confusion’ between Twitter and the spoken mode –  i.e. he was often unable to distinguish a tweeted word from a word contained in the body of spoken language analysed.

Real life close to fiction?

The NASA scientist then took his analysis of the four bodies of language further, seeking to distinguish language modes used in fiction as opposed to non-fiction. Though one might perhaps have thought that we usually base our spoken utterances more on non-fictional situations, Zayner’s word classification – by verb, adjective, noun, etc. – revealed that word usage common in fiction was more prevalent in talking, on Twitter and in IRC. This study goes some way towards helping us to distinguish the similarities and differences between different language modes on the basis of vocabulary use, without having to examine the structure of complex sentences. In the longer term the technique might help to categorise or automatically annotate books according to their genre, and also perhaps verify how accurate non-fiction writing really is!

Legal mentions © L’Atelier BNP Paribas