WhatsApp statistics (2)
Date: 2024-10-26
There are a couple of things we can improve from the first part of the analysis.
Exporting words
There’s one feature we miss from our previous Bash script, which is to count occurrences of each unique word written by each participant.
What defines a word?
First, we have to consider what we’re looking at, what makes a word and how words are separated. One thing to consider as well is that this script should be versatile and work the same way with English, French and German conversations.
It seems that all of these languages use whitespaces as separation between words, but there’s also punctuation that we have to take into account. One choice has to be made about dashes, which in my opinion, are used to separate words in English and in French, and aren’t much used in German. I took the same decision for apostrophes.
Then, we need to strip words to the bare minimum, converting everything to lowercase characters, get rid of numbers, symbols and convert diacritics symbols, rarely used in English, but heavily in French and German.
Once we’ve done that, we can record occurences of each word through the dataframe containing our messages.
Translation into code
We still have our big Pandas dataframe containing the content of every message, we can copy this column to another one called stripped, where it gets stripped of anything that isn’t a word or a part of a word.
Each message is converted to a string, made lowercase, and stripped
from its diacritics using the unicodedata library.
The string is cleaned from characters we consider as separators, being
replaced by whitespaces, and then, we can remove all the non-alphabetic
characters and write back the string to the dataframe.
In the end, we split those strings into words that we put into a new dataframe. This way, we can loop that for each participant in the group, and write it to a file.
import re
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
def wordlist(df):
# writes a list of unique words sorted by occurence and use by participant
# upper2low, remove links, images, strip non-alpha from df
df['stripped'] = sourcedf['message']
for i in range(0, len(df['message'])):
string = df.at[i, 'stripped'].lower() # lowercase
string = string.replace("media_omitted", "") # stripping images w/o stripping text
string = strip_accents(string)
string = re.sub(r'https?:\/\/.*[\ |\n]*', '', string) # stripping links w/o stripping text
string = re.sub(r'[.,-/\']', r' ', string) # .,-/' used as separators
regex = re.compile('[^a-z ]') # remove non-alpha characters
string = regex.sub('', string)
df.at[i, 'stripped'] = string
# count total words
dfd = pd.DataFrame(df['stripped'].str.split(expand=True).stack().value_counts(), columns=[ 'total'])
# count words by sender
if(len(dfsender) <= 10): # skip if too many participants in group
for sender in dfsender.index:
dfd[sender] = df.loc[df['sender'] == sender]['stripped'].str.split(expand=True).stack().value_counts()
# writing to file
dfd.to_csv(FILE + "_words", sep='\t', encoding='utf-8')
Looking at the results
Profanities are the first entertaining step:
grep -i 'fuck' grouptxt_words |awk '{print $1 " " $2}'
fuck 19
fucking 2
fucktopus 1
fucked 1
motherfucker 1
Nothing too bad here, we can keep on digging.
Unique words
First thing to notice is that the list contains 6’275 unique words :
the | 2065 | 117.0 | 29.0 | 231.0 | 23.0 | 120.0 | 181.0 | 810.0 | 23.0 |
i | 1206 | 160.0 | 30.0 | 138.0 | 23.0 | 68.0 | 68.0 | 282.0 | 23.0 |
to | 1042 | 70.0 | 23.0 | 148.0 | 17.0 | 68.0 | 94.0 | 348.0 | 12.0 |
a | 919 | 27.0 | 13.0 | 137.0 | 12.0 | 76.0 | 73.0 | 327.0 | 7.0 |
is | 892 | 91.0 | 13.0 | 140.0 | 19.0 | 47.0 | 95.0 | 314.0 | 17.0 |
you | 848 | 10.0 | 11.0 | 92.0 | 10.0 | 62.0 | 99.0 | 320.0 | 15.0 |
it | 840 | 104.0 | 21.0 | 89.0 | 8.0 | 47.0 | 44.0 | 208.0 | 17.0 |
that | 580 | 44.0 | 6.0 | 55.0 | 6.0 | 58.0 | 49.0 | 189.0 | 4.0 |
in | 571 | 43.0 | 14.0 | 81.0 | 9.0 | 40.0 | 67.0 | 171.0 | 8.0 |
and | 545 | 43.0 | 14.0 | 77.0 | 13.0 | 29.0 | 33.0 | 189.0 | 7.0 |
As expected, the 10 most commonly used words are articles, pronouns and linking words.
The other end of the list contains 3’735 unique words that are used only once across the entire log.
Browsing through the list shows some findings:
- Complex words which aren’t much used in the context of that discussion, such as audiophiles, kidnapping, enamels
- Words in foreign languages: khoya, maikata, kraciv
- Proper nouns: edeka, mayya, hibiki
- Made-up words used for private jokes: shitalian, omelettedufromage, crapacitor
- Onomatopoeia: vziiiii, nooooooooooooooooooooooooooo, phahahahahaha
- Words with swapped letters or wrong number of letters: rxcessive, zeptember, escallated
- ← Previous page
Monorailc.at - Static Site Generator - Next page →
Lectures 2024
Electronics Électronique puissance semiconducteur semiconductors power Hardware CPE INSA Xavier Bourgeois