Visual insights reveal the demographic of the city of Engagement, Ohio USA for vast 2022 challenge.
This take-home exercise aims to reveal the demographic of the city of Engagement, Ohio USA by using visualization techniques in R.
The provided data can be used to get more insights into the mindset of the people in Ohio, USA. The report aims to answering questions such as the reason for unhappiness in the city of Ohio, the average household size in Ohio, the relation between age, education level and joviality and the impact of children living in household at a person’s happiness.
Links to the dataset:
Packages, namely tidyverse and ggplot2 are
required for this makeover exercise. This code chunk installs the
required packages and loads them onto RStudio environment.
packages = c('tidyverse','ggplot2')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The dataset used in this exercise is Participants.csv, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)
The code chunk below imports Participants.scv from the data
folder into R by using read_csv()
function of readr and saves it as Tibble data frame
called participants_data
participants_data <- read_csv("data/Participants.csv")
str(participants_data)
spec_tbl_df [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
$ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
$ haveKids : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
$ age : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
$ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
$ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
$ joviality : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. householdSize = col_double(),
.. haveKids = col_logical(),
.. age = col_double(),
.. educationLevel = col_character(),
.. interestGroup = col_character(),
.. joviality = col_double()
.. )
- attr(*, "problems")=<externalptr>
Contains information about the residents Ohio, USA that have agreed to participate in this study.
Before we proceed with data visualization we check if the dataset has any null values using the code below.
participantId householdSize haveKids age
FALSE FALSE FALSE FALSE
educationLevel interestGroup joviality
FALSE FALSE FALSE
Since there are no null values observed we proceed with our next step.
Display the various age groups present in the dataset using the code below to get an overview of the target age groups in the dataset
[1] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[14] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
[27] "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[40] "57" "58" "59" "60"
Since the number of age groups are too many to derive a generalized conclusion we group the ages into five categories - less than or equal to 20, in 20’s, 30’s, 40’s and the ones greater than 50 .
one <- c("18", "19", "20")
two <- c("21", "22", "23", "24", "25", "26", "27", "28", "29", "30" )
three <- c("31", "32", "33", "34", "35", "36", "37", "38", "39", "40")
four <- c( "41", "42", "43", "44", "45", "46", "47", "48", "49", "50")
five <- c( "51", "52", "53", "54", "55", "56", "57", "58", "59", "60")
participants_data_mod <- participants_data %>%
mutate(age_modified = case_when(
age %in% one ~ "<=20",
age %in% two ~ "20's",
age %in% three ~ "30's",
age %in% four ~ "40's",
age %in% five ~ "50+")) %>%
select(-age)
head(participants_data_mod)
# A tibble: 6 × 7
participantId householdSize haveKids educationLevel interestGroup
<dbl> <dbl> <lgl> <chr> <chr>
1 0 3 TRUE HighSchoolOrColl… H
2 1 3 TRUE HighSchoolOrColl… B
3 2 3 TRUE HighSchoolOrColl… A
4 3 3 TRUE HighSchoolOrColl… I
5 4 3 TRUE Bachelors H
6 5 3 TRUE HighSchoolOrColl… D
# … with 2 more variables: joviality <dbl>, age_modified <chr>
summary(participants_data_mod)
participantId householdSize haveKids educationLevel
Min. : 0.0 Min. :1.000 Mode :logical Length:1011
1st Qu.: 252.5 1st Qu.:1.000 FALSE:710 Class :character
Median : 505.0 Median :2.000 TRUE :301 Mode :character
Mean : 505.0 Mean :1.964
3rd Qu.: 757.5 3rd Qu.:3.000
Max. :1010.0 Max. :3.000
interestGroup joviality age_modified
Length:1011 Min. :0.000204 Length:1011
Class :character 1st Qu.:0.240074 Class :character
Mode :character Median :0.477539 Mode :character
Mean :0.493794
3rd Qu.:0.746819
Max. :0.999234
participants_data_new <- summarise_at(group_by(participants_data_mod,age_modified,educationLevel),vars(joviality),funs(mean(.,na.rm=TRUE)))
head(participants_data_new)
# A tibble: 6 × 3
# Groups: age_modified [2]
age_modified educationLevel joviality
<chr> <chr> <dbl>
1 <=20 Bachelors 0.444
2 <=20 Graduate 0.583
3 <=20 HighSchoolOrCollege 0.483
4 <=20 Low 0.466
5 20's Bachelors 0.536
6 20's Graduate 0.565
As shown above, the values of the tibble data frame is sorted according to the age.
participants_data_mod[order(participants_data_mod$age_modified), ]
# A tibble: 1,011 × 7
participantId householdSize haveKids educationLevel interestGroup
<dbl> <dbl> <lgl> <chr> <chr>
1 8 3 TRUE Bachelors G
2 13 3 TRUE Bachelors J
3 18 3 TRUE Graduate I
4 29 3 TRUE Low C
5 35 3 TRUE Low J
6 53 3 TRUE Low H
7 67 3 TRUE HighSchoolOrCol… C
8 77 2 FALSE HighSchoolOrCol… C
9 88 2 FALSE HighSchoolOrCol… A
10 90 2 FALSE HighSchoolOrCol… E
# … with 1,001 more rows, and 2 more variables: joviality <dbl>,
# age_modified <chr>
A histogram is used to check the shape of the data distribution. The below graph reveals that the range of distribution of Joviality is from 0 to 1. We also observe that in our test data of participants there are very few participants having value of joviality closer to 1. Here the function geom_histogram is used to plot the distribution of joviality since joviality is a continuous variable.
ggplot(data=participants_data,
aes(x = joviality)) +
geom_histogram(bins=20,
boundary = 50,
color="black",
fill="light blue") +
coord_cartesian(xlim=c(0.01,1)) +
labs(
title = "Distribution of Joviality",
caption = "demographic information, Ohio USA"
) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0)
)
ggplot(data=participants_data_mod,
aes(x=age_modified, fill = educationLevel)) +
geom_bar()+
labs(
title = "Distribution of Age for different household types",
caption = "demographic information, Ohio USA"
) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0)
)
ggplot(data=participants_data_mod,
aes(x= joviality,
y= educationLevel, fill = haveKids)) +
geom_col() +
theme_classic()+
labs(
title = "Joviality Measure",
caption = "demographic information, Ohio USA"
) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0)
)+
facet_wrap(~age_modified)
ggplot(data=participants_data_mod,
aes(x= interestGroup,
y= joviality)) +
geom_col() +
theme_classic()+
labs(
title = "Joviality Measure",
caption = "demographic information, Ohio USA"
) +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 1),
plot.caption = element_text(hjust = 0)
)+
facet_wrap(~age_modified)
We can observe that on an average the people who are in high school or college and graduates are the ones who are the happiest of all in all age groups.
the joviality value changes across different age groups over different interests.