Content Analysis of a Facebook Group Using R

Author: Ilan Dan-Gur
Date of publication: December 28th, 2020
Date of statistical data: December 16th-22nd, 2020

I used data mining [1] to analyze the content of posts in a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The analysis was done using R (a language for statistical computing and graphics).

I will start by displaying the results, followed by a detailed description of how the analysis was done including the data file and the R code used to produce the results.

Results

Type of posts

34.8% Looking for help (e.g. asking a question, looking for recommendation)
15.2% Supporting a cause (e.g. buy local, holiday celebrations, donations for the less fortunate)
15.2% Providing info
9.8% Promoting a specific business
7.1% Covid related
6.2% Lost and found
6.2% Complaint about an issue
4.5% Thank you note
0.9% Reporting vandalism

List 1: Proportions of posting types

Comparison between groups

A bar chart comparison between two groups of the three most common types of postings.

Graph 1: Comparison of proportions of posting types between two facebook groups

Most frequent words used

Canmore
Christmas
Know
Please
Valley
Banff
Bow
Thanks
Covid
Looking
Need
Parks
Today
Free
Get
Town
Year
Back
Family
Time

List 2: Most frequent words used

One week in one image

The software automatically created an image of a "word cloud" [2] that included many words used in the posts. The most-frequent words are the largest and at the center of the image. The less-frequent words are smaller and at the edge of the image. Obviously, the least-frequent words were not included due to space limitations.

Many words appearing in different sizes. The larger words are at the centre of the image, while smaller words appear at the edge.

Image 1: A word cloud of the most-frequent words used

How the analysis was done

1. Data collection and preprocessing

I collected posts from a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The messages were saved as anonymous – only the textual content of the posts without names.
The data was organized in a CSV file containing two features: "type" and "text" (the data file can be downloaded here [3]).
The facebook posts were included in the "text" feature (see data file).

Each post was identified manually as one of the following types:

Business (promoting a specific business)
Cause (supporting a cause, e.g. community)
Complaint (complaining about an issue)
Covid (covid related)
Help (looking for help, asking a question, looking for recommendation)
Info (providing info)
Lost (lost and found)
Thanks (thank you note)
Vandalism (reporting vandalism)

2. Analysis: Proportions of posting types

(The code is in R programming language)

I began by importing the CSV data from the file (facebook.csv) and saving it to a data frame:

fb_raw <- read.csv("facebook.csv", stringsAsFactors = FALSE)

The "type" feature is currently a character vector. Since it's a categorical variable, I converted it to a factor:

fb_raw$type <- factor(fb_raw$type)

I then verified there were no identical (i.e. duplicate) posts in the dataset:

anyDuplicated(fb_raw$text)

The anyDuplicated() function returns "0" if no duplicates exist. Otherwise it returns the line number of the first duplicate posting. To remove a duplicate posting (say, the 4th) you can use:

fb_raw <- fb_raw[-4,]

Next, I calculated the proportions of posting types, and sorted the results in decreasing order:

sort(round(100*prop.table(table(fb_raw$type)), digits = 1),decreasing = TRUE)

Which resulted in the following percentage output (displayed also in list 1 above):

 
Help 34.8
Cause 15.2
Info 15.2
Business 9.8
Covid 7.1
Lost 6.2
Complaint 6.2
Thanks 4.5
Vandalism 0.9

3. Visualization: Bar chart

I used the "ggplot2" package/library to create a bar chart comparison of the three most common types of postings between that facebook group (group A) and a fictitious facebook group (group B):

Group A

34.8% Help
15.2% Cause
15.2% Info

Group B

28.6% Help
26.4% Info
10.3% Cause

Here is the R code for all that:


install.packages("ggplot2")
library(ggplot2)

Group_A <- tribble(
  ~Type, ~Percent, ~Source,
  "Help", 34.8, "Group_A",
  "Cause", 15.2, "Group_A",
  "Info", 15.2, "Group_A"  
)

Group_B <- tribble(
  ~Type, ~Percent, ~Source,
  "Help", 28.6, "Group_B",
  "Info", 26.4, "Group_B",
  "Cause", 10.3, "Group_B"
)

typeAll <- rbind(Group_A, Group_B)

p <- ggplot(data = typeAll) +
  geom_bar(mapping = aes(x = Type, y = Percent, fill = Source), stat = "identity", position = "dodge")
  
p + ggtitle("Type of Postings") + theme(plot.title = element_text(hjust = 0.5, colour = "black", face="bold"), axis.title.x=element_blank(), axis.title.y = element_text(colour = "black", face="bold"), axis.text.x = element_text(colour = "black", face="bold"), axis.text.y = element_text(colour = "black", face="bold")) + ylab("Percentage(%)") + scale_fill_hue(l = 30)

The bar chart is displayed above as graph 1.

4. Cleaning the text

I installed the "tm" (Text Mining) package/library, and used it to:

- Change the text to lower case
- Remove numbers
- Remove standard filler words (e.g. any, in, the, with), including my own custom list, from the dataset
- Remove punctuation
- Remove white spaces

Here is the R code for all that:


install.packages("tm")
library(tm)
fb_corpus <- VCorpus(VectorSource(fb_raw$text))
fb_corpus_clean <- tm_map(fb_corpus,content_transformer(tolower))
fb_corpus_clean <- tm_map(fb_corpus_clean, removeNumbers)
exFillerWords <- c("will","anyone","can","like","just","new","also","everyone")
fb_corpus_clean <- tm_map(fb_corpus_clean,removeWords,c(stopwords("english"),exFillerWords))
replacePunctuation <- function(x){gsub("[[:punct:]]+"," ",x)}
fb_corpus_clean <- tm_map(fb_corpus_clean,content_transformer(replacePunctuation))
fb_corpus_clean <- tm_map(fb_corpus_clean, stripWhitespace)

5. Analysis: Most frequent words used

I used the tm package (see step 3 above) to split the facebook postings into individual words. I then summed up how many times each unique word was used across all postings. Finally, I found the most frequent words and sorted the results (i.e. the list of most frequent words) in decreasing order.

The 20 most frequent words are displayed in list 2 above.

Here is the R code for all that:


fb_dtm <- DocumentTermMatrix(fb_corpus_clean)
freq <- colSums(as.matrix(fb_dtm))
sort(freq[findFreqTerms(fb_dtm,5)],decreasing = TRUE)

6. Visualization: Word cloud

I installed the "wordcloud" package/library, and used it to create an image of a "word cloud" that included many words used in the posts.

The image is displayed above as image 1.

Here is the R code:


install.packages("wordcloud")
library(wordcloud)
wordcloud(fb_corpus_clean, min.freq = 5, random.order = FALSE)