WhatsApp statistics (2)

Date: 2024-10-26

Tags: python soft stats

There are a couple of things we can improve from the first part of the analysis.

Exporting words

There’s one feature we miss from our previous Bash script, which is to count occurrences of each unique word written by each participant.

What defines a word?

First, we have to consider what we’re looking at, what makes a word and how words are separated. One thing to consider as well is that this script should be versatile and work the same way with English, French and German conversations.

It seems that all of these languages use whitespaces as separation between words, but there’s also punctuation that we have to take into account. One choice has to be made about dashes, which in my opinion, are used to separate words in English and in French, and aren’t much used in German. I took the same decision for apostrophes.

Then, we need to strip words to the bare minimum, converting everything to lowercase characters, get rid of numbers, symbols and convert diacritics symbols, rarely used in English, but heavily in French and German.

Once we’ve done that, we can record occurences of each word through the dataframe containing our messages.

Translation into code

We still have our big Pandas dataframe containing the content of every message, we can copy this column to another one called stripped, where it gets stripped of anything that isn’t a word or a part of a word.

Each message is converted to a string, made lowercase, and stripped from its diacritics using the unicodedata library.
The string is cleaned from characters we consider as separators, being replaced by whitespaces, and then, we can remove all the non-alphabetic characters and write back the string to the dataframe.

In the end, we split those strings into words that we put into a new dataframe. This way, we can loop that for each participant in the group, and write it to a file.

import re
import unicodedata

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def wordlist(df):
    # writes a list of unique words sorted by occurence and use by participant
    # upper2low, remove links, images, strip non-alpha from df
    df['stripped'] = sourcedf['message']
    for i in range(0, len(df['message'])):
        string = df.at[i, 'stripped'].lower() # lowercase
        string = string.replace("media_omitted", "") # stripping images w/o stripping text
        string = strip_accents(string)
        string = re.sub(r'https?:\/\/.*[\ |\n]*', '', string) # stripping links w/o stripping text
        string = re.sub(r'[.,-/\']', r' ', string) # .,-/' used as separators
        regex = re.compile('[^a-z ]') # remove non-alpha characters
        string = regex.sub('', string)
        df.at[i, 'stripped'] = string
        
    # count total words
    dfd = pd.DataFrame(df['stripped'].str.split(expand=True).stack().value_counts(), columns=[ 'total'])
    
    # count words by sender
    if(len(dfsender) <= 10): # skip if too many participants in group
        for sender in dfsender.index:
            dfd[sender] = df.loc[df['sender'] == sender]['stripped'].str.split(expand=True).stack().value_counts()
    # writing to file
    dfd.to_csv(FILE + "_words", sep='\t', encoding='utf-8')

Looking at the results

Profanities

Profanities are the first entertaining step:

grep -i 'fuck' grouptxt_words |awk '{print $1 " " $2}'
fuck 19
fucking 2
fucktopus 1
fucked 1
motherfucker 1

Nothing too bad here, we can keep on digging.

Unique words

First thing to notice is that the list contains 6’275 unique words :

total AAAA BBBB CCCC DDDD EEEE FFFF GGGG HHHH IIII
the 2065 117.0 29.0 231.0 23.0 120.0 181.0 810.0 23.0
i 1206 160.0 30.0 138.0 23.0 68.0 68.0 282.0 23.0
to 1042 70.0 23.0 148.0 17.0 68.0 94.0 348.0 12.0
a 919 27.0 13.0 137.0 12.0 76.0 73.0 327.0 7.0
is 892 91.0 13.0 140.0 19.0 47.0 95.0 314.0 17.0
you 848 10.0 11.0 92.0 10.0 62.0 99.0 320.0 15.0
it 840 104.0 21.0 89.0 8.0 47.0 44.0 208.0 17.0
that 580 44.0 6.0 55.0 6.0 58.0 49.0 189.0 4.0
in 571 43.0 14.0 81.0 9.0 40.0 67.0 171.0 8.0
and 545 43.0 14.0 77.0 13.0 29.0 33.0 189.0 7.0

As expected, the 10 most commonly used words are articles, pronouns and linking words.

The other end of the list contains 3’735 unique words that are used only once across the entire log.

Browsing through the list shows some findings:

Electronics Électronique puissance semiconducteur semiconductors power Hardware CPE INSA Xavier Bourgeois

Xavier