Take Home Exercise-1

Visual insights reveal the demographic of the city of Engagement, Ohio USA for vast 2022 challenge.

Shachi Anirudha Raodeo https://github.com/ShachiR/ISSS608 (School of Computing and Information Systems)
2022-05-25

The Task

This take-home exercise aims to reveal the demographic of the city of Engagement, Ohio USA by using visualization techniques in R.

The provided data can be used to get more insights into the mindset of the people in Ohio, USA. The report aims to answering questions such as the reason for unhappiness in the city of Ohio, the average household size in Ohio, the relation between age, education level and joviality and the impact of children living in household at a person’s happiness.

Links to the dataset:

Download Superstore-2021.xls

Step-by-step Data Visualisation

Installing and launching R packages

Packages, namely tidyverse and ggplot2 are required for this makeover exercise. This code chunk installs the required packages and loads them onto RStudio environment.

packages = c('tidyverse','ggplot2')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Data Preparation

Data Source

The dataset used in this exercise is Participants.csv, published by the IEEE for [VAST challenge 2022] (https://vast-challenge.github.io/2022/)

Importing the dataset

The code chunk below imports Participants.scv from the data folder into R by using read_csv() function of readr and saves it as Tibble data frame called participants_data

participants_data <- read_csv("data/Participants.csv")
str(participants_data)
spec_tbl_df [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
 $ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
 $ haveKids      : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ age           : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
 $ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
 $ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
 $ joviality     : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
 - attr(*, "spec")=
  .. cols(
  ..   participantId = col_double(),
  ..   householdSize = col_double(),
  ..   haveKids = col_logical(),
  ..   age = col_double(),
  ..   educationLevel = col_character(),
  ..   interestGroup = col_character(),
  ..   joviality = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Data Description

Contains information about the residents Ohio, USA that have agreed to participate in this study.

Missing value check

Before we proceed with data visualization we check if the dataset has any null values using the code below.

apply(participants_data, 2, function(x) any(is.na(x)))
 participantId  householdSize       haveKids            age 
         FALSE          FALSE          FALSE          FALSE 
educationLevel  interestGroup      joviality 
         FALSE          FALSE          FALSE 

Since there are no null values observed we proceed with our next step.

Data Wrangling

Display the various age groups present in the dataset using the code below to get an overview of the target age groups in the dataset

participants_data$age <- as.factor(participants_data$age)
levels(participants_data$age)
 [1] "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[14] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
[27] "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
[40] "57" "58" "59" "60"

Grouping age into categories

Since the number of age groups are too many to derive a generalized conclusion we group the ages into five categories - less than or equal to 20, in 20’s, 30’s, 40’s and the ones greater than 50 .

one <- c("18", "19", "20")
two <- c("21", "22", "23", "24", "25", "26", "27", "28", "29", "30" )
three <- c("31", "32", "33", "34", "35", "36", "37", "38", "39", "40")
four <- c( "41", "42", "43", "44", "45", "46", "47", "48", "49", "50")
five <-  c( "51", "52", "53", "54", "55", "56", "57", "58", "59", "60")


participants_data_mod <- participants_data %>%
  mutate(age_modified = case_when(
    age %in% one ~ "<=20",
    age %in% two ~ "20's",
    age %in% three ~ "30's",
    age %in% four ~ "40's",
    age %in% five ~ "50+")) %>%
  select(-age)

head(participants_data_mod)
# A tibble: 6 × 7
  participantId householdSize haveKids educationLevel    interestGroup
          <dbl>         <dbl> <lgl>    <chr>             <chr>        
1             0             3 TRUE     HighSchoolOrColl… H            
2             1             3 TRUE     HighSchoolOrColl… B            
3             2             3 TRUE     HighSchoolOrColl… A            
4             3             3 TRUE     HighSchoolOrColl… I            
5             4             3 TRUE     Bachelors         H            
6             5             3 TRUE     HighSchoolOrColl… D            
# … with 2 more variables: joviality <dbl>, age_modified <chr>

Data Summary

summary(participants_data_mod)
 participantId    householdSize    haveKids       educationLevel    
 Min.   :   0.0   Min.   :1.000   Mode :logical   Length:1011       
 1st Qu.: 252.5   1st Qu.:1.000   FALSE:710       Class :character  
 Median : 505.0   Median :2.000   TRUE :301       Mode  :character  
 Mean   : 505.0   Mean   :1.964                                     
 3rd Qu.: 757.5   3rd Qu.:3.000                                     
 Max.   :1010.0   Max.   :3.000                                     
 interestGroup        joviality        age_modified      
 Length:1011        Min.   :0.000204   Length:1011       
 Class :character   1st Qu.:0.240074   Class :character  
 Mode  :character   Median :0.477539   Mode  :character  
                    Mean   :0.493794                     
                    3rd Qu.:0.746819                     
                    Max.   :0.999234                     
participants_data_new <- summarise_at(group_by(participants_data_mod,age_modified,educationLevel),vars(joviality),funs(mean(.,na.rm=TRUE)))

head(participants_data_new)
# A tibble: 6 × 3
# Groups:   age_modified [2]
  age_modified educationLevel      joviality
  <chr>        <chr>                   <dbl>
1 <=20         Bachelors               0.444
2 <=20         Graduate                0.583
3 <=20         HighSchoolOrCollege     0.483
4 <=20         Low                     0.466
5 20's         Bachelors               0.536
6 20's         Graduate                0.565

Sorting Data

As shown above, the values of the tibble data frame is sorted according to the age.

participants_data_mod[order(participants_data_mod$age_modified), ]
# A tibble: 1,011 × 7
   participantId householdSize haveKids educationLevel   interestGroup
           <dbl>         <dbl> <lgl>    <chr>            <chr>        
 1             8             3 TRUE     Bachelors        G            
 2            13             3 TRUE     Bachelors        J            
 3            18             3 TRUE     Graduate         I            
 4            29             3 TRUE     Low              C            
 5            35             3 TRUE     Low              J            
 6            53             3 TRUE     Low              H            
 7            67             3 TRUE     HighSchoolOrCol… C            
 8            77             2 FALSE    HighSchoolOrCol… C            
 9            88             2 FALSE    HighSchoolOrCol… A            
10            90             2 FALSE    HighSchoolOrCol… E            
# … with 1,001 more rows, and 2 more variables: joviality <dbl>,
#   age_modified <chr>

Data Visualization and Insights

A histogram is used to check the shape of the data distribution. The below graph reveals that the range of distribution of Joviality is from 0 to 1. We also observe that in our test data of participants there are very few participants having value of joviality closer to 1. Here the function geom_histogram is used to plot the distribution of joviality since joviality is a continuous variable.

ggplot(data=participants_data,
aes(x = joviality)) +
geom_histogram(bins=20,
boundary = 50,
color="black",
fill="light blue") +
coord_cartesian(xlim=c(0.01,1)) +
labs(
    title = "Distribution of Joviality",
    caption = "demographic information, Ohio USA"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
  )

ggplot(data=participants_data_mod, 
       aes(x=age_modified, fill = educationLevel)) +
geom_bar()+
labs(
    title = "Distribution of Age for different household types",
    caption = "demographic information, Ohio USA"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
  )

ggplot(data=participants_data_mod, 
       aes(x= joviality, 
           y= educationLevel, fill = haveKids)) +
     geom_col() +
  theme_classic()+
 labs(
    title = "Joviality Measure",
    caption = "demographic information, Ohio USA"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
  )+
  facet_wrap(~age_modified)

ggplot(data=participants_data_mod, 
       aes(x= interestGroup, 
           y= joviality)) +
     geom_col() +
  theme_classic()+
 labs(
    title = "Joviality Measure",
    caption = "demographic information, Ohio USA"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
  )+
  facet_wrap(~age_modified)

Conclusion