Content Analysis of a Facebook Group Using R

Ilan Dan-Gur

Author: Ilan Dan-Gur
Date of publication: December 28th, 2020
Date of statistical data: December 16th-22nd, 2020

I used data mining [1] to analyze the content of posts in a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The analysis was done using R (a language for statistical computing and graphics).

I will start by displaying the results, followed by a detailed description of how the analysis was done including the data file and the R code used to produce the results.

Results

Type of posts

  1.  34.8% Looking for help (e.g. asking a question, looking for recommendation)
  2.  15.2% Supporting a cause (e.g. buy local, holiday celebrations, donations for the less fortunate)
  3.  15.2% Providing info
  4.  9.8% Promoting a specific business
  5.  7.1% Covid related
  6.  6.2% Lost and found
  7.  6.2% Complaint about an issue
  8.  4.5% Thank you note
  9.  0.9% Reporting vandalism

List 1: Proportions of posting types

Comparison between groups

A bar chart comparison between two groups of the three most common types of postings.

Bar chart

Graph 1: Comparison of proportions of posting types between two facebook groups

Most frequent words used

  1. Canmore
  2. Christmas
  3. Know
  4. Please
  5. Valley
  6. Banff
  7. Bow
  8. Thanks
  9. Covid
  10. Looking
  11. Need
  12. Parks
  13. Today
  14. Free
  15. Get
  16. Town
  17. Year
  18. Back
  19. Family
  20. Time

List 2: Most frequent words used

One week in one image

The software automatically created an image of a "word cloud" [2] that included many words used in the posts. The most-frequent words are the largest and at the center of the image. The less-frequent words are smaller and at the edge of the image. Obviously, the least-frequent words were not included due to space limitations.

Many words appearing in different sizes. The larger words are at the centre of the image, while smaller words appear at the edge.

Image 1: A word cloud of the most-frequent words used

How the analysis was done

1. Data collection and preprocessing

I collected posts from a facebook group in the Bow Valley, Alberta, over a one-week period (in December, 2020). The messages were saved as anonymous – only the textual content of the posts without names.
The data was organized in a CSV file containing two features: "type" and "text" (the data file can be downloaded here [3]).
The facebook posts were included in the "text" feature (see data file).

Each post was identified manually as one of the following types:

  1. Business (promoting a specific business)
  2. Cause (supporting a cause, e.g. community)
  3. Complaint (complaining about an issue)
  4. Covid (covid related)
  5. Help (looking for help, asking a question, looking for recommendation)
  6. Info (providing info)
  7. Lost (lost and found)
  8. Thanks (thank you note)
  9. Vandalism (reporting vandalism)

2. Analysis: Proportions of posting types

(The code is in R programming language)

I began by importing the CSV data from the file (facebook.csv) and saving it to a data frame:

fb_raw <- read.csv("facebook.csv", stringsAsFactors = FALSE)

The "type" feature is currently a character vector. Since it's a categorical variable, I converted it to a factor:

fb_raw$type <- factor(fb_raw$type)

I then verified there were no identical (i.e. duplicate) posts in the dataset:

anyDuplicated(fb_raw$text)

The anyDuplicated() function returns "0" if no duplicates exist. Otherwise it returns the line number of the first duplicate posting. To remove a duplicate posting (say, the 4th) you can use:

fb_raw <- fb_raw[-4,]

Next, I calculated the proportions of posting types, and sorted the results in decreasing order:

sort(round(100*prop.table(table(fb_raw$type)), digits = 1),decreasing = TRUE)

Which resulted in the following percentage output (displayed also in list 1 above):

 
Help 34.8
Cause 15.2
Info 15.2
Business 9.8
Covid 7.1
Lost 6.2
Complaint 6.2
Thanks 4.5
Vandalism 0.9
 

3. Visualization: Bar chart

I used the "ggplot2" package/library to create a bar chart comparison of the three most common types of postings between that facebook group (group A) and a fictitious facebook group (group B):

Group A

Group B

Here is the R code for all that:


install.packages("ggplot2")
library(ggplot2)

Group_A <- tribble(
  ~Type, ~Percent, ~Source,
  "Help", 34.8, "Group_A",
  "Cause", 15.2, "Group_A",
  "Info", 15.2, "Group_A"  
)

Group_B <- tribble(
  ~Type, ~Percent, ~Source,
  "Help", 28.6, "Group_B",
  "Info", 26.4, "Group_B",
  "Cause", 10.3, "Group_B"
)

typeAll <- rbind(Group_A, Group_B)

p <- ggplot(data = typeAll) +
  geom_bar(mapping = aes(x = Type, y = Percent, fill = Source), stat = "identity", position = "dodge")
  
p + ggtitle("Type of Postings") + theme(plot.title = element_text(hjust = 0.5, colour = "black", face="bold"), axis.title.x=element_blank(), axis.title.y = element_text(colour = "black", face="bold"), axis.text.x = element_text(colour = "black", face="bold"), axis.text.y = element_text(colour = "black", face="bold")) + ylab("Percentage(%)") + scale_fill_hue(l = 30)

The bar chart is displayed above as graph 1.

4. Cleaning the text

I installed the "tm" (Text Mining) package/library, and used it to:

Here is the R code for all that:


install.packages("tm")
library(tm)
fb_corpus <- VCorpus(VectorSource(fb_raw$text))
fb_corpus_clean <- tm_map(fb_corpus,content_transformer(tolower))
fb_corpus_clean <- tm_map(fb_corpus_clean, removeNumbers)
exFillerWords <- c("will","anyone","can","like","just","new","also","everyone")
fb_corpus_clean <- tm_map(fb_corpus_clean,removeWords,c(stopwords("english"),exFillerWords))
replacePunctuation <- function(x){gsub("[[:punct:]]+"," ",x)}
fb_corpus_clean <- tm_map(fb_corpus_clean,content_transformer(replacePunctuation))
fb_corpus_clean <- tm_map(fb_corpus_clean, stripWhitespace)

5. Analysis: Most frequent words used

I used the tm package (see step 3 above) to split the facebook postings into individual words. I then summed up how many times each unique word was used across all postings. Finally, I found the most frequent words and sorted the results (i.e. the list of most frequent words) in decreasing order.

The 20 most frequent words are displayed in list 2 above.

Here is the R code for all that:


fb_dtm <- DocumentTermMatrix(fb_corpus_clean)
freq <- colSums(as.matrix(fb_dtm))
sort(freq[findFreqTerms(fb_dtm,5)],decreasing = TRUE)

6. Visualization: Word cloud

I installed the "wordcloud" package/library, and used it to create an image of a "word cloud" that included many words used in the posts.

The image is displayed above as image 1.

Here is the R code:


install.packages("wordcloud")
library(wordcloud)
wordcloud(fb_corpus_clean, min.freq = 5, random.order = FALSE)