Visual insights reveal the demographic of the city of Engagement, Ohio USA for vast 2022 challenge.
This take-home exercise aims to reveal the demographic of the city of Engagement, Ohio USA by using visualization techniques in R.
The provided data can be used to get more insights into the mindset of the people in Ohio, USA. The report aims to select one of the take home exercises prepared by another student and critic the submission in terms of clarity and aesthetics, and remake the original design by using the data visualisation principles and best practice you had learned in Lesson 1 and 2.
Links to the dataset:
The assignment chosen to critic in take home exercise 2 is of student
DING YANMU
Packages, namely tidyverse and ggplot2 are
required for this makeover exercise. This code chunk installs the
required packages and loads them onto RStudio environment.
packages = c('tidyverse','ggplot2','ggdist', 'ggridges',
'patchwork', 'ggthemes', 'hrbrthemes',
'ggrepel','ggforce',"HH","vcd",'scales','grid','gridExtra',
'formattable','readr')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The dataset used in this exercise is Participants.csv, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)
The code chunk below imports Participants.scv from the data
folder into R by using read_csv()
function of readr and saves it as Tibble data frame
called participants
The student assignment chosen to critic has only considered to use participants data to generate insights about the demographics of the city of Ohio.
participants <- read_csv("data/Participants.csv")
summary(participants)
participantId householdSize haveKids age
Min. : 0.0 Min. :1.000 Mode :logical Min. :18.00
1st Qu.: 252.5 1st Qu.:1.000 FALSE:710 1st Qu.:29.00
Median : 505.0 Median :2.000 TRUE :301 Median :39.00
Mean : 505.0 Mean :1.964 Mean :39.07
3rd Qu.: 757.5 3rd Qu.:3.000 3rd Qu.:50.00
Max. :1010.0 Max. :3.000 Max. :60.00
educationLevel interestGroup joviality
Length:1011 Length:1011 Min. :0.000204
Class :character Class :character 1st Qu.:0.240074
Mode :character Mode :character Median :0.477539
Mean :0.493794
3rd Qu.:0.746819
Max. :0.999234
data <- read_csv("data/Participants.csv")
Before we proceed with data visualization we check if the datasets chosen have any null values using the code below.
participantId householdSize haveKids age
FALSE FALSE FALSE FALSE
educationLevel interestGroup joviality
FALSE FALSE FALSE
Since there are no null values observed we proceed with our next step.
Display the various age groups present in the dataset using the code below to get an overview of the target age groups in the dataset.
[1] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[14] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
[27] "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[40] "57" "58" "59" "60"
Since the number of age groups are too many to derive a generalized conclusion we group the ages into five categories - less than or equal to 20, in 20’s, 30’s, 40’s and the ones greater than 50 .
The method of segregating the age group into categories in the chosen student’s assignment is quite complicated. The student creates a new data frame and recalculates each variable - age group, the count of the number of participants in each age group, the average happiness level in each age group, the number of participants that have kids and so on. This method is very time consuming and complicated and makes the code bulky.
An easier way to split the age groups would be usint the cut() function in R.
participants <- read_csv("data/Participants.csv")
participants$agegroup <- cut(participants$age, breaks = c(17,20,25,30,35,40,45,50,55,60),
labels = c("18-20","20-25","26-30","31-35","36-40","41-45","45-50","50-55","55-60"))
Is a similar simple graph created by the student recreating using the above data.
ggplot(data = participants,
aes(x = agegroup)) +
geom_bar() +
xlab("Age Group (years)") +
ylab("Count") +
geom_text(stat="count",
aes(label = paste0(..count..), vjust=-1)) +
scale_y_continuous(limits = c(0, 200))+
ggtitle("Distribution of Age among participants")
above plot is modified considering the principles of data visualization. The above plot has simple X and Y axis, and is not so visually appealing. A line has been plotted along with the bar chart to make the differences in count wasily visible to the reader.
p1<- ggplot(data = participants,
aes(x = agegroup)) +
geom_bar(alpha=0.5, fill = 4) +
xlab("Age (years)") +
ylab("Count of participants") +
geom_line(aes(group=1),stat="count",colour="black")+
geom_point(aes(),stat="count",colour="red")+
geom_text(stat="count",
aes(label=paste0(..count.., " (", round(..count../sum(..count..)*100,1), "%)"),hjust = 0.5,vjust=-2, size= "0.3")) +
theme_classic() +
theme(panel.grid.major.y = element_line(color = "light grey")) +
scale_y_continuous(breaks = seq(0, 180, by = 30), limits = c(0, 180))+
labs(
y= 'Count of \nParticipants',
title = "Distribution of Age among Participants",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
legend.position = "none"
)
p1
As shown above, the values of the tibble data frame is sorted according to the age. to plot a pareto chart first the count of age groups based on categories is calculated and sorted in decreasing order.
participants_n <- participants %>%
group_by(`agegroup`) %>%
summarise('participants' = n()) %>%
ungroup()%>%
arrange(desc(participants))
Calculate cumulative frequency and find the cumulative percentage
The distribution shown by the pareto chart is quite linear as the average number of participants in each age group is in the similar range except for the age group of 18-20 yrs which has very less number of participants.
participants_freq <- participants_n %>%
mutate(freq = cumsum(participants)/sum(participants)*100)
### graph 2 pareto chart
coeff <- 0.7
p2<- ggplot(data=participants_freq,
aes(x = reorder(`agegroup`, -participants), y = participants)) +
geom_col(fill = "4",alpha= 0.5) +
labs(x = "Age Group", title = "Distribution of age among participants") +
geom_point(aes(y = `freq`/coeff), colour = 'red', size = 2) +
geom_line(aes(y = `freq`/coeff), colour = 'red', group = 1) +
scale_y_continuous(breaks = seq(0, 1000, 50),
sec.axis = sec_axis(~.*coeff, name = "Percentage(%)")) +
theme_bw()+
labs(
y= 'Count of \nParticipants',
title = "Distribution of Age among Participants",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
legend.position = "none",
axis.text.x = element_text(vjust = 0.5)
)
p2
The method used by the student to plot is quite similar to the one used to plot the age distribution plot. The method used to recreate the student’s plot uses group by function to group the students with and without kids.
library(dplyr)
df_noKids = participants %>% group_by(agegroup,haveKids) %>%
filter(`haveKids` == FALSE)%>%
summarise(joviality = mean(joviality),
.groups = 'drop')
df_Kids = participants %>% group_by(agegroup,haveKids) %>%
filter(`haveKids` == TRUE)%>%
summarise(joviality = mean(joviality),
.groups = 'drop')
The below graph is similar to the student’s plot but an exact plot cannot be made because of the differences in calculation of the count used by the student and the method used in this take home exercise. Also the student whose assignment has been taken as refrence has plot participants who have kids with respect to the total participants plot,
The blow plot is the plot of participants who have kids wrt. to the ones who donot live with kids at home.
participants_group_kids = participants %>% group_by(agegroup,haveKids) %>%
summarise(joviality = mean(joviality),
.groups = 'drop')
ggplot(data= participants_group_kids,
aes(x= agegroup, y= joviality,
group= haveKids,
color= factor(haveKids))) +
geom_line(size=2) +
geom_point(aes(x=agegroup, y=joviality),
colour=alpha('red', 1), size=2) +
scale_color_discrete(name= 'Have Kids') +
scale_y_continuous(breaks = seq(0, 0.8, by = 0.1),
limits = c(0, 0.8))+
geom_text(aes(y = joviality, label = paste(round(joviality,2)),
hjust = 0.5,vjust=-1,size= "0.3"))+
theme(legend.position = "none",
)
Since the above plot does not comply with the principles of best visual plotting practices we use the lab() and the theme() functions to make the chart easy to read. We also plot percentage of joviality by using percent_format() function to convert the y axis labels to percent. We remove the text as it coincides with the plot.
p3<- ggplot(data= participants_group_kids,
aes(x= agegroup, y= joviality,
group= haveKids,
color= factor(haveKids))) +
geom_line(size=2) +
scale_color_discrete(name= 'Have Kids') +
scale_y_continuous(labels = percent_format(),breaks = seq(0, 0.8, by = 0.2),
limits = c(0, 0.8))+
theme_bw()+
labs(
y= 'Joviality',
x= 'Age (Years)',
title = "Joviality among participants who have kids v/s who do not",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
legend.position = "none",
axis.text.x = element_text(vjust = 0.5)
)
p3
The below plot shows us a better understanding of joviality with age group among people who have kinds in their household to those who do not have kids in their household. People in age group 26-30 who do not have kids in their household seem to be the least happy. Generally people falling in the age group of 55-60 seem to be the happiest on an average and their happiness is not affected by kids in their household.
participants_haveKids <- participants %>%
filter(`haveKids` == TRUE) %>%
mutate (joviality = -joviality)
participants_noKids <-participants %>%
filter(`haveKids` == FALSE)
participants_Kids <- rbind(participants_haveKids, participants_noKids)
p4<- ggplot(participants_Kids, aes (x = agegroup, y = joviality , fill = haveKids)) +
geom_bar(stat = "identity", alpha= 0.7) +
coord_flip()+
scale_y_continuous(labels = percent_format(),breaks = seq(-0.8, 0.8, by = 0.2),
limits = c(-0.8, 0.8))+
theme_bw()+
labs(
y= 'Joviality',
x= 'Age (Years)',
title = "Joviality among participants who have kids v/s who do not",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
legend.position = "none",
axis.text.x = element_text(vjust = 0.5)
)
p4
p6<- ggplot(data=participants,
aes(x= interestGroup,
y= joviality, fill=educationLevel)) +
geom_col(alpha= 0.8) +
theme_bw()+
facet_wrap(~agegroup)+
labs(
y= 'Joviality',
x= 'Age (Years)',
title = "Joviality among participants with varied interests",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
legend.position = "none",
axis.text.x = element_text(vjust = 0.5)
)
p6
The below graph needs to have joviality in percentage as well as needs to not have x value ticks and the text on the y axis with better readability.
from the plots we observe that on an average high school or college graduates and bachelors have a higher average joviality. It is lowest among the age group of 55- 60 who have low education and highest for the age group 36-40 with low education.
participants_group_edu = participants %>% group_by(agegroup,educationLevel) %>%
summarise(joviality = mean(joviality),
.groups = 'drop')
ggplot(data= participants_group_edu,
aes(x= agegroup, y= joviality,
group= educationLevel,
color=educationLevel)) +
geom_line(size =1.5) +
scale_color_discrete(name= 'Education Level') +
theme_bw()
participants_group_edu = participants %>% group_by(agegroup,educationLevel) %>%
summarise(joviality = mean(joviality),
.groups = 'drop')
p5<-ggplot(data= participants_group_edu,
aes(x= agegroup, y= joviality,
group= educationLevel,
color=educationLevel)) +
geom_line(size =1.5) +
scale_color_discrete(name= 'Education') +
scale_y_continuous(labels = percent_format())+
theme_bw()+
labs(
y= 'Joviality',
x= 'Age (Years)',
title = "Joviality among participants with different education",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
axis.text.x = element_text(vjust = 0.5)
)
p5
The below graph shows us the range of joviality observed among participants who have different interest groups and educational levels. The participants from high school or college have a very narrow range of joviality which falls between 0.3 to 0.7 and the participants with low education levelhave a wide range of joviality right from the lowest to the highest over all its interest groups. Some interest groups have a normal distribution whereas the others have more than one peak value.
participants_group_edu = participants %>% group_by(agegroup,educationLevel,interestGroup) %>%
summarise(joviality = mean(joviality),
.groups = 'drop')
p7<- ggplot(data= participants_group_edu,
aes(x = joviality, y = interestGroup, fill = interestGroup)) +
geom_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE,
quantiles = 4,
quantile_lines = TRUE,
alpha = .4) +
theme_ridges() +
scale_fill_viridis_d(name = "Quartiles")+
theme_bw()+
labs(
y= 'Joviality',
x= 'Age (Years)',
title = "Joviality among participants with different education",
caption = "demographic information, Ohio USA"
) +
theme(
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0),
axis.text.x = element_text(vjust = 0.5)
)+
facet_wrap(~ educationLevel)
p7
There are many best practices that can be followed to make our graphs easily readable as well as look pleasant to the eye of the reader. Some of the best practices learnt while doing the take home exercise 2 are
Always ensure that your visual graphs provide a clear picture of the business question asked
Some numerical data such as age group in this exercise needs to be segregated into categories to get a better understanding overall
Choose the right visualization techniques to show the correct variables
ensure that the quantitative scale of your chart begins at 0
Avoid using colours which are very persistent to your eye such as the use of bright blue colour in the first bar chart to represent values
Use soft, natural colours to display most information and bright and/or dark colours to highlight information that requires greater attention.
the joviality value changes across different age groups over different interests.
To guarantee that most people who are colour blind can distinguish groups of data that are colour coded, avoid using a combination of red and green in the same display
make your graphs more wider than tall
Tick marks are superfluous on categorical scale