WhatsApp statistics (2)

Date: 2024-10-26

There are a couple of things we can improve from the first part of the analysis.

Exporting words

There’s one feature we miss from our previous Bash script, which is to count occurrences of each unique word written by each participant.

What defines a word?

First, we have to consider what we’re looking at, what makes a word and how words are separated. One thing to consider as well is that this script should be versatile and work the same way with English, French and German conversations.

It seems that all of these languages use whitespaces as separation between words, but there’s also punctuation that we have to take into account. One choice has to be made about dashes, which in my opinion, are used to separate words in English and in French, and aren’t much used in German. I took the same decision for apostrophes.

Then, we need to strip words to the bare minimum, converting everything to lowercase characters, get rid of numbers, symbols and convert diacritics symbols, rarely used in English, but heavily in French and German.

Once we’ve done that, we can record occurences of each word through the dataframe containing our messages.

Translation into code

We still have our big Pandas dataframe containing the content of every message, we can copy this column to another one called stripped, where it gets stripped of anything that isn’t a word or a part of a word.

Each message is converted to a string, made lowercase, and stripped from its diacritics using the unicodedata library.
The string is cleaned from characters we consider as separators, being replaced by whitespaces, and then, we can remove all the non-alphabetic characters and write back the string to the dataframe.

In the end, we split those strings into words that we put into a new dataframe. This way, we can loop that for each participant in the group, and write it to a file.

import re
import unicodedata

def strip_accents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def wordlist(df):
    # writes a list of unique words sorted by occurence and use by participant
    # upper2low, remove links, images, strip non-alpha from df
    df['stripped'] = sourcedf['message']
    for i in range(0, len(df['message'])):
        string = df.at[i, 'stripped'].lower() # lowercase
        string = string.replace("media_omitted", "") # stripping images w/o stripping text
        string = strip_accents(string)
        string = re.sub(r'https?:\/\/.*[\ |\n]*', '', string) # stripping links w/o stripping text
        string = re.sub(r'[.,-/\']', r' ', string) # .,-/' used as separators
        regex = re.compile('[^a-z ]') # remove non-alpha characters
        string = regex.sub('', string)
        df.at[i, 'stripped'] = string
        
    # count total words
    dfd = pd.DataFrame(df['stripped'].str.split(expand=True).stack().value_counts(), columns=[ 'total'])
    
    # count words by sender
    if(len(dfsender) <= 10): # skip if too many participants in group
        for sender in dfsender.index:
            dfd[sender] = df.loc[df['sender'] == sender]['stripped'].str.split(expand=True).stack().value_counts()
    # writing to file
    dfd.to_csv(FILE + "_words", sep='\t', encoding='utf-8')

Looking at the results

Profanities

Profanities are the first entertaining step:

grep -i 'fuck' grouptxt_words |awk '{print $1 " " $2}'
fuck 19
fucking 2
fucktopus 1
fucked 1
motherfucker 1

Nothing too bad here, we can keep on digging.

Unique words

First thing to notice is that the list contains 6’275 unique words :

total	AAAA	BBBB	CCCC	DDDD	EEEE	FFFF	GGGG	HHHH	IIII
the	2065	117.0	29.0	231.0	23.0	120.0	181.0	810.0	23.0
i	1206	160.0	30.0	138.0	23.0	68.0	68.0	282.0	23.0
to	1042	70.0	23.0	148.0	17.0	68.0	94.0	348.0	12.0
a	919	27.0	13.0	137.0	12.0	76.0	73.0	327.0	7.0
is	892	91.0	13.0	140.0	19.0	47.0	95.0	314.0	17.0
you	848	10.0	11.0	92.0	10.0	62.0	99.0	320.0	15.0
it	840	104.0	21.0	89.0	8.0	47.0	44.0	208.0	17.0
that	580	44.0	6.0	55.0	6.0	58.0	49.0	189.0	4.0
in	571	43.0	14.0	81.0	9.0	40.0	67.0	171.0	8.0
and	545	43.0	14.0	77.0	13.0	29.0	33.0	189.0	7.0

As expected, the 10 most commonly used words are articles, pronouns and linking words.

The other end of the list contains 3’735 unique words that are used only once across the entire log.

Browsing through the list shows some findings:

Complex words which aren’t much used in the context of that discussion, such as audiophiles, kidnapping, enamels
Words in foreign languages: khoya, maikata, kraciv
Proper nouns: edeka, mayya, hibiki
Made-up words used for private jokes: shitalian, omelettedufromage, crapacitor
Onomatopoeia: vziiiii, nooooooooooooooooooooooooooo, phahahahahaha
Words with swapped letters or wrong number of letters: rxcessive, zeptember, escallated

← Previous page
Monorailc.at - Static Site Generator
Next page →
Lectures 2024

Electronics Électronique puissance semiconducteur semiconductors power Hardware CPE INSA Xavier Bourgeois

Xavier

WhatsApp statistics (2)

Exporting words

What defines a word?

Translation into code

Looking at the results

Profanities

Unique words

Tags

Dates

Links

RSS Feed