Reveal the patterns of community interactions of the city of Engagement, Ohio USA
Consider the social activities in the community. What patterns do you see in the social networks in the town? Describe up to ten significant patterns you observe, with evidence and rationale and model, analyse and visualise network data using R.
Links to the dataset:
SocialNetwork.csv
Participants.csv
Packages, namely igraph, tidygraph,
visNetwork, graphlayouts are required for this
exercise. This code chunk installs the required packages and loads them
onto RStudio environment.
packages = c('tidyverse','readr','knitr','sf','tmap','sf','clock','sftime','rmarkdown','distill','lubridate','igraph', 'tidygraph','ggraph', 'visNetwork', 'graphlayouts','dplyr')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The datasets used in this exercise, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)
The SocialNetwork.csvcontains networking data of a
participant (the participants who met in a day) Contains information
about participants’ evolving social relationships.
Participants.csv Contains information about the
residents of Engagement, OH that have agreed to participate in this
study.
The code chunk below imports SocialNetwork.csv and
Participants.csv from the data folder into R by using read_csv()
function of readr and saves it as Tibble data frame
called social and participants
We will examine the structure of the data frame using glimpse() of dplyr.
social <- read_csv("data/SocialNetwork.csv")
glimpse(social)
participants <- read_csv("data/Participants.csv")
glimpse(participants)
We see above that participantId is treated as a numerical value in both the dataframes above. It needs to be converted into a character.
We use wday(), day(), month() functions to extract the day of the week, date, moth and year from timestamp to perform time series visualizations.
wday() is functions of lubridate package. wday() returns the day of the week as a decimal number or an ordered factor if label is TRUE. The argument abbr is FALSE keep the daya spells in full, i.e. Monday. The function will create a new column in the data.frame i.e. Weekday and the output of wday() will save in this newly created field. the values in the Weekday field are in ordinal scale.
floor_date() is use to extract date from timestamp.
Increment each participantId from participants and
participantIdTo, participantIdFrom from social by 1 as i graph cannot
understand 0 as number
data_social= social%>%
mutate(weekday = wday(timestamp,label = TRUE, abbr = FALSE),
day = day(timestamp),
month=as.character(timestamp,"%b %y"),
date = floor_date(timestamp),
participantIdFrom= participantIdFrom + 1,
participantIdTo= participantIdTo + 1)
data_participants= participants%>%
mutate(participantId=participantId + 1)
Convert participantId from participants and
participantIdFrom, participantIdTo from
social into character.
data_social= data_social%>%
mutate(participantIdFrom=as.character(participantIdFrom),
participantIdTo=as.character(participantIdTo))
glimpse(data_social)
data_participants= data_participants%>%
mutate(participantId=as.character(participantId))
glimpse(data_participants)
Since the number of age groups are too many to derive a generalized conclusion we group the ages into five categories - less than or equal to 20, in 20’s, 30’s, 40’s and the ones greater than 50 . age groups are split using the cut() function in R.
data_participants= data_participants%>%
mutate(agegroup=as.character(agegroup))
data_social dataframe reveals that it consists of individual networking/ meeting records. This is not very useful for visualization.
Hence, we will aggregate the participantId by date, participantIdFrom, participantIdTo and weekday.
Functions from dplyr package are used. They are: group(), summarise(), and ungroup().
To be able to get strong connections between participants we filter the data based on weight greater then 64 Social data consists of dates between 01-03-2022 to 25-05-2023
We filter out only those pair of participant ID’s who have met at least once a week between the two dates(the participants who have met at least 64 times during the timeline of data available) as The number of weeks between these two dates is 64.
save the participantIdFrom in a new dataframe nodes_list and remove the duplicates to get the dataframe of nodes with weight > 64
nodes_list <- social_aggregated[,c("participantIdFrom")]
nodes_list <- distinct(nodes_list)
participants_nodes <-merge(x=nodes_list, y=data_participants, by.x = 'participantIdFrom', by.y = 'participantId')
save file into rds using saveRDS() function.
social_aggregated <- readRDS ( 'data/social_aggregated_64.rds')
glimpse (social_aggregated)
Rows: 864
Columns: 4
$ participantIdFrom <chr> "100", "1002", "1002", "101", "101", "102"…
$ participantIdTo <chr> "36", "69", "69", "282", "282", "489", "48…
$ weekday <ord> Thursday, Wednesday, Thursday, Wednesday, …
$ Weight <int> 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65…
participants_nodes <- readRDS ( 'data/participants_nodes_64.rds')
glimpse (participants_nodes)
Rows: 227
Columns: 8
$ participantIdFrom <chr> "100", "1002", "101", "102", "110", "114",…
$ householdSize <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, …
$ haveKids <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
$ age <dbl> 31, 49, 29, 38, 52, 51, 32, 45, 26, 34, 27…
$ educationLevel <chr> "Bachelors", "Graduate", "Low", "HighSchoo…
$ interestGroup <chr> "H", "C", "F", "A", "C", "G", "B", "F", "A…
$ joviality <dbl> 0.39686665, 0.04343346, 0.14268616, 0.0205…
$ agegroup <chr> "26-35", "45- 55", "26-35", "36-45", "45- …
We filter out only those pair of participant ID’s who have met at least once in two weeks between the two dates (the participants who have met at least 32 times during the timeline of data available)
save the participantIdFrom in a new dataframe nodes_list and remove the duplicates to get the dataframe of nodes with weight > 32
nodes_list_32 <- social_aggregated_32[,c("participantIdFrom")]
nodes_list_32 <- distinct(nodes_list_32)
participants_nodes_32 <-merge(x=nodes_list_32, y=data_participants, by.x = 'participantIdFrom', by.y = 'participantId')
save file into rds using saveRDS() function.
social_aggregated_32 <- readRDS ( 'data/social_aggregated_32.rds')
participants_nodes_32 <- readRDS ( 'data/participants_nodes_32.rds')
glimpse (social_aggregated)
Rows: 864
Columns: 4
$ participantIdFrom <chr> "100", "1002", "1002", "101", "101", "102"…
$ participantIdTo <chr> "36", "69", "69", "282", "282", "489", "48…
$ weekday <ord> Thursday, Wednesday, Thursday, Wednesday, …
$ Weight <int> 65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 65…
tbl_graph() of tinygraph package to build an tidygraph’s network graph data.frame
social_graph <- tbl_graph(nodes = participants_nodes,
edges = social_aggregated,
directed = TRUE)
social_graph
# A tbl_graph: 227 nodes and 864 edges
#
# A directed multigraph with 86 components
#
# Node Data: 227 × 8 (active)
participantIdFr… householdSize haveKids age educationLevel
<chr> <dbl> <lgl> <dbl> <chr>
1 100 2 FALSE 31 Bachelors
2 1002 1 FALSE 49 Graduate
3 101 2 FALSE 29 Low
4 102 2 FALSE 38 HighSchoolOrC…
5 110 2 FALSE 52 HighSchoolOrC…
6 114 2 FALSE 51 Graduate
# … with 221 more rows, and 3 more variables: interestGroup <chr>,
# joviality <dbl>, agegroup <chr>
#
# Edge Data: 864 × 4
from to weekday Weight
<int> <int> <ord> <int>
1 1 90 Thursday 65
2 2 151 Wednesday 65
3 2 151 Thursday 65
# … with 861 more rows
The output above reveals that social_graph is a tbl_graph object with 227 nodes and 864 edges. The command also prints the first six rows of “Node Data” and the first three of “Edge Data”. It states that the Node Data is active. The notion of an active tibble within a tbl_graph object makes it possible to manipulate the data in one tibble at a time.
Rearrange the rows in the edges tibble to list those with the highest “weight” using activate()
social_graph %>%
activate(edges) %>%
arrange(desc(Weight))
# A tbl_graph: 227 nodes and 864 edges
#
# A directed multigraph with 86 components
#
# Edge Data: 864 × 4 (active)
from to weekday Weight
<int> <int> <ord> <int>
1 1 90 Thursday 65
2 2 151 Wednesday 65
3 2 151 Thursday 65
4 3 77 Wednesday 65
5 3 77 Thursday 65
6 4 120 Wednesday 65
# … with 858 more rows
#
# Node Data: 227 × 8
participantIdFr… householdSize haveKids age educationLevel
<chr> <dbl> <lgl> <dbl> <chr>
1 100 2 FALSE 31 Bachelors
2 1002 1 FALSE 49 Graduate
3 101 2 FALSE 29 Low
# … with 224 more rows, and 3 more variables: interestGroup <chr>,
# joviality <dbl>, agegroup <chr>
social_graph_32 <- tbl_graph(nodes = participants_nodes_32,
edges = social_aggregated_32,
directed = TRUE)
social_graph_32
# A tbl_graph: 737 nodes and 23904 edges
#
# A directed multigraph with 27 components
#
# Node Data: 737 × 8 (active)
participantIdFr… householdSize haveKids age educationLevel
<chr> <dbl> <lgl> <dbl> <chr>
1 10 3 TRUE 35 Bachelors
2 100 2 FALSE 31 Bachelors
3 1000 1 FALSE 48 Bachelors
4 1001 1 FALSE 56 Graduate
5 1002 1 FALSE 49 Graduate
6 1003 1 FALSE 51 HighSchoolOrC…
# … with 731 more rows, and 3 more variables: interestGroup <chr>,
# joviality <dbl>, agegroup <chr>
#
# Edge Data: 23,904 × 4
from to weekday Weight
<int> <int> <ord> <int>
1 1 651 Sunday 42
2 1 651 Monday 39
3 1 651 Tuesday 40
# … with 23,901 more rows
social_graph_32 %>%
activate(edges) %>%
arrange(desc(Weight))
# A tbl_graph: 737 nodes and 23904 edges
#
# A directed multigraph with 27 components
#
# Edge Data: 23,904 × 4 (active)
from to weekday Weight
<int> <int> <ord> <int>
1 2 252 Thursday 65
2 5 508 Wednesday 65
3 5 508 Thursday 65
4 13 185 Wednesday 65
5 13 185 Thursday 65
6 15 365 Wednesday 65
# … with 23,898 more rows
#
# Node Data: 737 × 8
participantIdFr… householdSize haveKids age educationLevel
<chr> <dbl> <lgl> <dbl> <chr>
1 10 3 TRUE 35 Bachelors
2 100 2 FALSE 31 Bachelors
3 1000 1 FALSE 48 Bachelors
# … with 734 more rows, and 3 more variables: interestGroup <chr>,
# joviality <dbl>, agegroup <chr>
geom_edge_link draws edges in the simplest way - as straight lines between the start and end nodes.Argument width is used to map the width of the line in proportional to the Weight attribute and argument alpha is used to introduce opacity on the line.
g <- ggraph(social_graph) +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.4)) +
geom_node_point(aes(colour = agegroup),
size = 2)
g + theme_graph()
Observations:
connections between only two nodes show that only the two nodes are independently friends with each other with no other common friends
networks within groups of nodes without a group show friends and mutual friends
There are atleast 21 friend circles in the city of engagement who have met more than 64 times in the period of observation
With the plot above we can see that a majority of the groups with size more than three are formed either between the same age group or between adjacent age groups such that the age difference is not very large
The above is not applicable for groups with less than 3 participants
g <- ggraph(social_graph_32) +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.2)) +
geom_node_point(aes(colour = agegroup),
size = 1)
g + theme_graph()
If we reduce the weight from 64 to 32 we get a large group of network/ participants who meet but we cannot tell if all of them meet up frequently.
Use the facet_edges() function to plot the social network plots over different days of the week
set_graph_style()
g <- ggraph(social_graph, layout = "fr") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 1)) +
geom_node_point(aes(colour = agegroup),
size = 2)
g + facet_edges(~weekday) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
Observations:
The plot of networks over different days of the week shows that the distribution of groups is same over the three days except for very minor differences
Here we have only three days of the week Tuesday, Wednesday, Thursday left after filtering out weight> 64 hence we can say that people meet their closest friends only on Tuesdays, Wednesdays and Thursdays in the city of Engagement.
set_graph_style()
g <- ggraph(social_graph_32, layout = "fr") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.2)) +
geom_node_point(aes(colour = agegroup),
size = 1)
g + facet_edges(~weekday) +
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
when we make the weight less we observe that the participants network on all days of the week
Use the facet_nodes() function to plot the social network plots for participants with different education level.
set_graph_style()
g <- ggraph(social_graph) +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.8)) +
geom_node_point(aes(colour = agegroup),
size = 2)
g + facet_nodes(~educationLevel)+
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
Observations:
Within the same education level there are a few groups of friends who network in participants and who have completed bachelors or are from high school or college.
There are only two pairs of frequently networking friends among those who are graduates. The number of time friends meet decreases as the education level goes up
No participant from the low income age group network or maybe the people from low education groups come from poor financial backgrounds hence they do not find time nor have the money to network.
Use the facet_nodes() function to plot the social network plots for participants with different education level.
set_graph_style()
g <- ggraph(social_graph) +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.8)) +
geom_node_point(aes(colour = haveKids),
size = 2)
g + facet_nodes(~educationLevel)+
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
Observations:
All of the participants who network among the ones who have completed bachelors and graduate are the ones who have Kids
Only two friend networks in who have completed high school or college consists of a member without kids and two members having kids
Use the facet_nodes() function to plot the social network plots for participants with different education level.
set_graph_style()
g <- ggraph(social_graph) +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 0.8)) +
geom_node_point(aes(colour = agegroup),
size = 2)
g + facet_nodes(~householdSize)+
th_foreground(foreground = "grey80",
border = TRUE) +
theme(legend.position = 'bottom')
Observation:
g <- social_graph %>%
filter(householdSize == 1) %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 1)) +
geom_node_point(aes(colour = agegroup,
size=betweenness_centrality))
g + theme_graph()
g <- social_graph %>%
filter(householdSize == 1) %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
ggraph() +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 1)) +
geom_node_point(aes(colour = agegroup,
size=betweenness_centrality))
g + theme_graph()
Observations:
There is one interconnecting group shown on the top having two participnats in the age group of 18-25 and 26-35 who have high betweeness centrality
There are 11 groups of networks all of whom mutually network with each other having a size of 4 participants all having the same betweenness centrality
There are 6 groups of triads who network together having the same betweenness centrality
There are two participants with high betweeness centrality (one from the age group of 36-45 and one in the age group of 45-55 who are mutual friends of two other group)
There are many pairs of participants who meet frequently and some participants who do not at all
g <- social_graph %>%
mutate(community = as.factor(group_edge_betweenness(weights = Weight, directed = TRUE))) %>%
ggraph() +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 1)) +
geom_node_point(aes(colour = educationLevel))
g + theme_graph()
g <- social_graph %>%
mutate(community = as.factor(group_edge_betweenness(weights = Weight, directed = TRUE))) %>%
ggraph() +
geom_edge_link(aes(width=Weight),
alpha=0.2) +
scale_edge_width(range = c(0.1, 1)) +
geom_node_point(aes(colour = interestGroup))
g + theme_graph()
In the city of engagement Ohio very few groups of close frinds are observed who network frequently and they maninly comprise of the educated participants.
Large number of participants who network frequently consists of the ones who live alone
Participants network with their closest friends (whom they meet more frequently) on Tuesdays, wednesdays and thursdays and their acquaintances (whom they meet less frequently) on the rest of the days of week