Take Home Exercise-6

Reveal the patterns of community interactions of the city of Engagement, Ohio USA

Shachi Anirudha Raodeo , true
2022-06-05

The Task

Consider the social activities in the community. What patterns do you see in the social networks in the town? Describe up to ten significant patterns you observe, with evidence and rationale and model, analyse and visualise network data using R.

Links to the dataset:

SocialNetwork.csv Participants.csv

Step-by-step Data Visualisation

Installing and launching R packages

Packages, namely igraph, tidygraph, visNetwork, graphlayouts are required for this exercise. This code chunk installs the required packages and loads them onto RStudio environment.

packages = c('tidyverse','readr','knitr','sf','tmap','sf','clock','sftime','rmarkdown','distill','lubridate','igraph', 'tidygraph','ggraph', 'visNetwork', 'graphlayouts','dplyr')

for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Data Preparation

Data Source

The datasets used in this exercise, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)

The SocialNetwork.csvcontains networking data of a participant (the participants who met in a day) Contains information about participants’ evolving social relationships.

Participants.csv Contains information about the residents of Engagement, OH that have agreed to participate in this study.

Importing the dataset

The code chunk below imports SocialNetwork.csv and Participants.csv from the data folder into R by using read_csv() function of readr and saves it as Tibble data frame called social and participants

We will examine the structure of the data frame using glimpse() of dplyr.

social <- read_csv("data/SocialNetwork.csv")
glimpse(social)
participants <- read_csv("data/Participants.csv")
glimpse(participants)

We see above that participantId is treated as a numerical value in both the dataframes above. It needs to be converted into a character.

Data Wrangling

Extract information from timestamp

We use wday(), day(), month() functions to extract the day of the week, date, moth and year from timestamp to perform time series visualizations.

wday() is functions of lubridate package. wday() returns the day of the week as a decimal number or an ordered factor if label is TRUE. The argument abbr is FALSE keep the daya spells in full, i.e. Monday. The function will create a new column in the data.frame i.e. Weekday and the output of wday() will save in this newly created field. the values in the Weekday field are in ordinal scale.

floor_date() is use to extract date from timestamp.

Increment each participantId from participants and participantIdTo, participantIdFrom from social by 1 as i graph cannot understand 0 as number

data_social= social%>%
  mutate(weekday = wday(timestamp,label = TRUE, abbr = FALSE),
         day = day(timestamp),
         month=as.character(timestamp,"%b %y"),
         date = floor_date(timestamp),
         participantIdFrom= participantIdFrom + 1,
         participantIdTo= participantIdTo + 1)
data_participants= participants%>%
  mutate(participantId=participantId + 1)

Change datatypes

Convert participantId from participants and participantIdFrom, participantIdTo from social into character.

data_social= data_social%>%
  mutate(participantIdFrom=as.character(participantIdFrom),
         participantIdTo=as.character(participantIdTo))
glimpse(data_social)
data_participants= data_participants%>%
  mutate(participantId=as.character(participantId))
glimpse(data_participants)

Split age into categories

Since the number of age groups are too many to derive a generalized conclusion we group the ages into five categories - less than or equal to 20, in 20’s, 30’s, 40’s and the ones greater than 50 . age groups are split using the cut() function in R.

data_participants$agegroup <- cut(data_participants$age, breaks = c(17,25,35,45,55,60), 
                             labels = c("18-25","26-35","36-45","45- 55","55-60"))
data_participants= data_participants%>%
  mutate(agegroup=as.character(agegroup))

Aggregate records with weight > 64

data_social dataframe reveals that it consists of individual networking/ meeting records. This is not very useful for visualization.

Hence, we will aggregate the participantId by date, participantIdFrom, participantIdTo and weekday.

Functions from dplyr package are used. They are: group(), summarise(), and ungroup().

To be able to get strong connections between participants we filter the data based on weight greater then 64 Social data consists of dates between 01-03-2022 to 25-05-2023

We filter out only those pair of participant ID’s who have met at least once a week between the two dates(the participants who have met at least 64 times during the timeline of data available) as The number of weeks between these two dates is 64.

social_aggregated <- data_social %>%
  group_by(participantIdFrom, participantIdTo, weekday) %>%
    summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 64) %>%
  ungroup()

Keep only aggregated nodes with weight > 64

save the participantIdFrom in a new dataframe nodes_list and remove the duplicates to get the dataframe of nodes with weight > 64

nodes_list <- social_aggregated[,c("participantIdFrom")]
nodes_list <- distinct(nodes_list)
participants_nodes <-merge(x=nodes_list, y=data_participants, by.x = 'participantIdFrom', by.y =  'participantId')

Save files as RDS

save file into rds using saveRDS() function.

saveRDS (social_aggregated, 'data/social_aggregated_64.rds')
saveRDS (participants_nodes, 'data/participants_nodes_64.rds')
social_aggregated <- readRDS ( 'data/social_aggregated_64.rds')
glimpse (social_aggregated)
Rows: 864
Columns: 4
$ participantIdFrom <chr> "100", "1002", "1002", "101", "101", "102"…
$ participantIdTo   <chr> "36", "69", "69", "282", "282", "489", "48…
$ weekday           <ord> Thursday, Wednesday, Thursday, Wednesday, …
$ Weight            <int> 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65…
participants_nodes <- readRDS ( 'data/participants_nodes_64.rds')
glimpse (participants_nodes)
Rows: 227
Columns: 8
$ participantIdFrom <chr> "100", "1002", "101", "102", "110", "114",…
$ householdSize     <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …
$ haveKids          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ age               <dbl> 31, 49, 29, 38, 52, 51, 32, 45, 26, 34, 27…
$ educationLevel    <chr> "Bachelors", "Graduate", "Low", "HighSchoo…
$ interestGroup     <chr> "H", "C", "F", "A", "C", "G", "B", "F", "A…
$ joviality         <dbl> 0.39686665, 0.04343346, 0.14268616, 0.0205…
$ agegroup          <chr> "26-35", "45- 55", "26-35", "36-45", "45- …

Aggregate records to filter records with weight> 32

We filter out only those pair of participant ID’s who have met at least once in two weeks between the two dates (the participants who have met at least 32 times during the timeline of data available)

social_aggregated_32 <- data_social %>%
  group_by(participantIdFrom, participantIdTo, weekday) %>%
    summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  filter(Weight > 32) %>%
  ungroup()

Keep only aggregated nodes with weight > 32

save the participantIdFrom in a new dataframe nodes_list and remove the duplicates to get the dataframe of nodes with weight > 32

nodes_list_32 <- social_aggregated_32[,c("participantIdFrom")]
nodes_list_32 <- distinct(nodes_list_32)
participants_nodes_32 <-merge(x=nodes_list_32, y=data_participants, by.x = 'participantIdFrom', by.y =  'participantId')

Save files as RDS

save file into rds using saveRDS() function.

saveRDS (social_aggregated_32, 'data/social_aggregated_32.rds')
saveRDS (participants_nodes_32, 'data/participants_nodes_32.rds')
social_aggregated_32 <- readRDS ( 'data/social_aggregated_32.rds')
participants_nodes_32 <- readRDS ( 'data/participants_nodes_32.rds')
glimpse (social_aggregated)
Rows: 864
Columns: 4
$ participantIdFrom <chr> "100", "1002", "1002", "101", "101", "102"…
$ participantIdTo   <chr> "36", "69", "69", "282", "282", "489", "48…
$ weekday           <ord> Thursday, Wednesday, Thursday, Wednesday, …
$ Weight            <int> 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65…

Data Visualization

Creating the graph data.frame weight>64

tbl_graph() of tinygraph package to build an tidygraph’s network graph data.frame

social_graph <- tbl_graph(nodes = participants_nodes,
                           edges = social_aggregated, 
                           directed = TRUE)
social_graph
# A tbl_graph: 227 nodes and 864 edges
#
# A directed multigraph with 86 components
#
# Node Data: 227 × 8 (active)
  participantIdFr… householdSize haveKids   age educationLevel
  <chr>                    <dbl> <lgl>    <dbl> <chr>         
1 100                          2 FALSE       31 Bachelors     
2 1002                         1 FALSE       49 Graduate      
3 101                          2 FALSE       29 Low           
4 102                          2 FALSE       38 HighSchoolOrC…
5 110                          2 FALSE       52 HighSchoolOrC…
6 114                          2 FALSE       51 Graduate      
# … with 221 more rows, and 3 more variables: interestGroup <chr>,
#   joviality <dbl>, agegroup <chr>
#
# Edge Data: 864 × 4
   from    to weekday   Weight
  <int> <int> <ord>      <int>
1     1    90 Thursday      65
2     2   151 Wednesday     65
3     2   151 Thursday      65
# … with 861 more rows

The output above reveals that social_graph is a tbl_graph object with 227 nodes and 864 edges. The command also prints the first six rows of “Node Data” and the first three of “Edge Data”. It states that the Node Data is active. The notion of an active tibble within a tbl_graph object makes it possible to manipulate the data in one tibble at a time.

Arranging the graph data.frame

Rearrange the rows in the edges tibble to list those with the highest “weight” using activate()

social_graph %>%
  activate(edges) %>%
  arrange(desc(Weight))
# A tbl_graph: 227 nodes and 864 edges
#
# A directed multigraph with 86 components
#
# Edge Data: 864 × 4 (active)
   from    to weekday   Weight
  <int> <int> <ord>      <int>
1     1    90 Thursday      65
2     2   151 Wednesday     65
3     2   151 Thursday      65
4     3    77 Wednesday     65
5     3    77 Thursday      65
6     4   120 Wednesday     65
# … with 858 more rows
#
# Node Data: 227 × 8
  participantIdFr… householdSize haveKids   age educationLevel
  <chr>                    <dbl> <lgl>    <dbl> <chr>         
1 100                          2 FALSE       31 Bachelors     
2 1002                         1 FALSE       49 Graduate      
3 101                          2 FALSE       29 Low           
# … with 224 more rows, and 3 more variables: interestGroup <chr>,
#   joviality <dbl>, agegroup <chr>

Creating the graph data.frame weight > 32 and arranging data

social_graph_32 <- tbl_graph(nodes = participants_nodes_32,
                           edges = social_aggregated_32, 
                           directed = TRUE)
social_graph_32
# A tbl_graph: 737 nodes and 23904 edges
#
# A directed multigraph with 27 components
#
# Node Data: 737 × 8 (active)
  participantIdFr… householdSize haveKids   age educationLevel
  <chr>                    <dbl> <lgl>    <dbl> <chr>         
1 10                           3 TRUE        35 Bachelors     
2 100                          2 FALSE       31 Bachelors     
3 1000                         1 FALSE       48 Bachelors     
4 1001                         1 FALSE       56 Graduate      
5 1002                         1 FALSE       49 Graduate      
6 1003                         1 FALSE       51 HighSchoolOrC…
# … with 731 more rows, and 3 more variables: interestGroup <chr>,
#   joviality <dbl>, agegroup <chr>
#
# Edge Data: 23,904 × 4
   from    to weekday Weight
  <int> <int> <ord>    <int>
1     1   651 Sunday      42
2     1   651 Monday      39
3     1   651 Tuesday     40
# … with 23,901 more rows
social_graph_32 %>%
  activate(edges) %>%
  arrange(desc(Weight))
# A tbl_graph: 737 nodes and 23904 edges
#
# A directed multigraph with 27 components
#
# Edge Data: 23,904 × 4 (active)
   from    to weekday   Weight
  <int> <int> <ord>      <int>
1     2   252 Thursday      65
2     5   508 Wednesday     65
3     5   508 Thursday      65
4    13   185 Wednesday     65
5    13   185 Thursday      65
6    15   365 Wednesday     65
# … with 23,898 more rows
#
# Node Data: 737 × 8
  participantIdFr… householdSize haveKids   age educationLevel
  <chr>                    <dbl> <lgl>    <dbl> <chr>         
1 10                           3 TRUE        35 Bachelors     
2 100                          2 FALSE       31 Bachelors     
3 1000                         1 FALSE       48 Bachelors     
# … with 734 more rows, and 3 more variables: interestGroup <chr>,
#   joviality <dbl>, agegroup <chr>

Age groups in Social Networks weight > 64

geom_edge_link draws edges in the simplest way - as straight lines between the start and end nodes.Argument width is used to map the width of the line in proportional to the Weight attribute and argument alpha is used to introduce opacity on the line.

g <- ggraph(social_graph) +
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.4)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 2)
g + theme_graph()

Observations:

Age groups in Social Networks weight > 32

g <- ggraph(social_graph_32) +
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.2)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 1)
g + theme_graph()

If we reduce the weight from 64 to 32 we get a large group of network/ participants who meet but we cannot tell if all of them meet up frequently.

Social Networks over different weekdays weight > 64

Use the facet_edges() function to plot the social network plots over different days of the week

set_graph_style() 
g <- ggraph(social_graph, layout = "fr") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 1)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 2)
g + facet_edges(~weekday) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Observations:

Social Networks over different weekdays weight > 32

set_graph_style() 
g <- ggraph(social_graph_32, layout = "fr") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.2)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 1)
g + facet_edges(~weekday) +
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

when we make the weight less we observe that the participants network on all days of the week

Social Networks for different education levels weight > 64

Use the facet_nodes() function to plot the social network plots for participants with different education level.

set_graph_style()
g <- ggraph(social_graph) + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.8)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 2)
g + facet_nodes(~educationLevel)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Observations:

Social Networks for different education levels weight > 64 - shade by having Kids

Use the facet_nodes() function to plot the social network plots for participants with different education level.

set_graph_style()
g <- ggraph(social_graph) + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.8)) +
  geom_node_point(aes(colour = haveKids), 
                  size = 2)
g + facet_nodes(~educationLevel)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Observations:

Social Networks for differenthousehold size weight > 64 - shade by age

Use the facet_nodes() function to plot the social network plots for participants with different education level.

set_graph_style()
g <- ggraph(social_graph) + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 0.8)) +
  geom_node_point(aes(colour = agegroup), 
                  size = 2)
g + facet_nodes(~householdSize)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  theme(legend.position = 'bottom')

Observation:

Betweenness Centrality

g <- social_graph %>%
  filter(householdSize == 1) %>%
  mutate(betweenness_centrality = centrality_betweenness()) %>%
  ggraph(layout = "fr") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 1)) +
  geom_node_point(aes(colour = agegroup,
            size=betweenness_centrality))
g + theme_graph()

Btweeness centrality without layout

g <- social_graph %>%
  filter(householdSize == 1) %>%
  mutate(betweenness_centrality = centrality_betweenness()) %>%
  ggraph() + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 1)) +
  geom_node_point(aes(colour = agegroup,
            size=betweenness_centrality))
g + theme_graph()

Observations:

Edge Btweeness centrality with education Level

g <- social_graph %>%
  mutate(community = as.factor(group_edge_betweenness(weights = Weight, directed = TRUE))) %>%
  ggraph() + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 1)) +
  geom_node_point(aes(colour = educationLevel))  
g + theme_graph()

Edge Btweeness centrality with Interest Groups

g <- social_graph %>%
  mutate(community = as.factor(group_edge_betweenness(weights = Weight, directed = TRUE))) %>%
  ggraph() + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.2) +
  scale_edge_width(range = c(0.1, 1)) +
  geom_node_point(aes(colour = interestGroup))  
g + theme_graph()

Conclusion