DSI-Live Coding

Example of lesson script with live coding exercises
Note

Note, blocks or [annotations] are not for students, they left to communicate the structure and design details of the lesson

Data

Let’s spend 10 minutes on practicing DPlyr verbs:

  • https://tidydatatutor.com/ + practice

What are we working with, doc? [Context]

We will work with the Campus Recruitment (Academic and Employability Factors influencing placement) dataset, which contains (presumably simulated) data on student background and factors influencing placement.

Task

Traditional tasks for analysts would be to find interesting patterns or disparities in employment data.

Tips for reasoning, decision-making, and problem-solving [Supportive information]

At this stage we will focus on how simple and powerful data aggregation techniques can help us:

  • understand hidden data patterns
  • formulate hypotheses for future analysis
  • bring some ideas back to senior analysts and stakeholders.

Remember:

  • Analytics is about comparisons
  • We might be interested in disparities, e.g. based on gender or work experience
  • We are more likely to be interested in general patterns than precise numbers, at least until we can evaluate uncertainty
  • Our ultimate goal is to support decision making, balancing precision and details

Let’s load data and look at variable names

library(readr)
library(dplyr)
placement <- read_csv("./data/Placement_Data_Full_Class.csv")
#View(placement)

names(placement)
 [1] "sl_no"          "gender"         "ssc_p"          "ssc_b"         
 [5] "hsc_p"          "hsc_b"          "hsc_s"          "degree_p"      
 [9] "degree_t"       "workex"         "etest_p"        "specialisation"
[13] "mba_p"          "status"         "salary"        

Filtering data

Let’s leave only gender, degree_t, degree_p, workex, specialisation, mba_p, status, salary in our data

Variables
Variable Notes type
gender
degree_t major categorical
degree_p ranking (degree percentage) numeric
workex work experience categorical
specialisation MBA specialization categorical
mba_p MBA degree percentage numeric
status placed or not
salary
placement |> 
  select(gender, degree_t, degree_p, workex, specialisation, mba_p, status, salary) -> placement_clean
... -> placement_clean

Let’s let R know that some of the columns are not just text, but categorical variables

placement_clean |> 
  mutate(gender = factor(gender),
         degree_t = factor(degree_t),
         workex = factor(workex),
         specialisation = factor(specialisation),
         degree_t = factor(degree_t),
         status = factor(status)) -> placement_clean

Starting comparisons

What is the median degree percentage of placed vs non-placed people?

placement_clean |> 
  group_by(status) |> 
  summarise(median_p = median(degree_p))
# A tibble: 2 × 2
  status     median_p
  <fct>         <dbl>
1 Not Placed       61
2 Placed           68

… how about mba?

placement_clean |> 
  group_by(status) |> 
  summarise(median_mba_p = median(mba_p))
# A tibble: 2 × 2
  status     median_mba_p
  <fct>             <dbl>
1 Not Placed         60.7
2 Placed             62.2

Going deeper: apparent gender disparities

What is the median degree percentage of placed vs non-placed male and female graduates?

placement_clean |> 
  group_by(status, gender) |> 
  summarise(median_p = median(degree_p)) |> 
  arrange(gender, status)
# A tibble: 4 × 3
# Groups:   status [2]
  status     gender median_p
  <fct>      <fct>     <dbl>
1 Not Placed F          64.1
2 Placed     F          69.2
3 Not Placed M          59  
4 Placed     M          66.9

Let’s make it easier to read putting values to columns?

placement_clean |> 
  group_by(status, gender) |> 
  summarise(median_p = median(degree_p)) |> 
  arrange(gender, status) |> 
  tidyr::pivot_wider(id_cols=status, names_from=gender, values_from=median_p)
# A tibble: 2 × 3
# Groups:   status [2]
  status         F     M
  <fct>      <dbl> <dbl>
1 Not Placed  64.1  59  
2 Placed      69.2  66.9

Going deeper: beyond the average

Let’s try to see how degree percentages for placed vs non-placed people are distributed

placement_clean |> 
  group_by(status) |> 
  summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9)) 
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 6 × 3
# Groups:   status [2]
  status         q     p
  <fct>      <dbl> <dbl>
1 Not Placed  53     0.1
2 Not Placed  61     0.5
3 Not Placed  68.0   0.9
4 Placed      60.6   0.1
5 Placed      68     0.5
6 Placed      77.6   0.9

make humanly readable?

placement_clean |> 
  group_by(status) |> 
  summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9)) |> 
  tidyr::pivot_wider(id_cols = status, names_from = p, values_from=q, names_prefix="p")
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 2 × 4
# Groups:   status [2]
  status      p0.1  p0.5  p0.9
  <fct>      <dbl> <dbl> <dbl>
1 Not Placed  53      61  68.0
2 Placed      60.6    68  77.6

Let’s try to see what how degree percentages for placed vs non-placed people are distributed, taking gender into account

placement_clean |> 
  group_by(status, gender) |> 
  summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9)) |> 
  tidyr::pivot_wider(id_cols = c(status, gender), names_from = p, values_from=q, names_prefix="p")
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 4 × 5
# Groups:   status, gender [4]
  status     gender  p0.1  p0.5  p0.9
  <fct>      <fct>  <dbl> <dbl> <dbl>
1 Not Placed F       54.4  64.1  73.1
2 Not Placed M       52.8  59    65  
3 Placed     F       61.1  69.2  81  
4 Placed     M       60.8  66.9  77.0

Going deeper: majors

What are top majors for placed and not?

placement_clean |> 
  group_by(status, degree_t) |> 
  summarise(major_n = n()) |> 
  arrange(status, desc(major_n))
# A tibble: 6 × 3
# Groups:   status [2]
  status     degree_t  major_n
  <fct>      <fct>       <int>
1 Not Placed Comm&Mgmt      43
2 Not Placed Sci&Tech       18
3 Not Placed Others          6
4 Placed     Comm&Mgmt     102
5 Placed     Sci&Tech       41
6 Placed     Others          5

What share each major occupies?

placement_clean |> 
  group_by(status, degree_t) |> 
  summarise(major_n = n()) |> 
  mutate(share = major_n/sum(major_n)) |> 
  arrange(status, desc(share))
# A tibble: 6 × 4
# Groups:   status [2]
  status     degree_t  major_n  share
  <fct>      <fct>       <int>  <dbl>
1 Not Placed Comm&Mgmt      43 0.642 
2 Not Placed Sci&Tech       18 0.269 
3 Not Placed Others          6 0.0896
4 Placed     Comm&Mgmt     102 0.689 
5 Placed     Sci&Tech       41 0.277 
6 Placed     Others          5 0.0338

What are some other tasks you could use for extra practice?

Tasks and classes
simple aggregation for comparison aggregation with multiple groups aggregation with complex/custom function for advanced comparison
What is the median degree percentage of placed vs non-placed people? What are median degree and mba percentages of placed vs non-placed people? What are 0.1, 0.5, 0.9 quantiles for the degree percentage of placed vs non-placed people?
What are top 3 undergrad majors for placed vs non-placed? What are ranks of top 3 undergrad majors for non-placed among placed?
What are the median salaries for people with and without work experience? What are the median salaries for female and male candidates with and without work experience? What are 0.2, 0.5, 0.8 quantile salaries for female and male candidates with and without work experience?