library(readr)
library(dplyr)
<- read_csv("./data/Placement_Data_Full_Class.csv") placement
DSI-Live Coding
Note, blocks or [annotations] are not for students, they left to communicate the structure and design details of the lesson
Useful links
- TidyDataTutor
- another useful package tidylog
- R4DS
Data
Let’s spend 10 minutes on practicing DPlyr verbs:
- https://tidydatatutor.com/ + practice
What are we working with, doc? [Context]
We will work with the Campus Recruitment (Academic and Employability Factors influencing placement) dataset, which contains (presumably simulated) data on student background and factors influencing placement.
Task
Traditional tasks for analysts would be to find interesting patterns or disparities in employment data.
Tips for reasoning, decision-making, and problem-solving [Supportive information]
At this stage we will focus on how simple and powerful data aggregation techniques can help us:
- understand hidden data patterns
- formulate hypotheses for future analysis
- bring some ideas back to senior analysts and stakeholders.
Remember:
- Analytics is about comparisons
- We might be interested in disparities, e.g. based on gender or work experience
- We are more likely to be interested in general patterns than precise numbers, at least until we can evaluate uncertainty
- Our ultimate goal is to support decision making, balancing precision and details
Let’s load data and look at variable names
#View(placement)
names(placement)
[1] "sl_no" "gender" "ssc_p" "ssc_b"
[5] "hsc_p" "hsc_b" "hsc_s" "degree_p"
[9] "degree_t" "workex" "etest_p" "specialisation"
[13] "mba_p" "status" "salary"
Filtering data
Let’s leave only gender, degree_t, degree_p, workex, specialisation, mba_p, status, salary
in our data
Variable | Notes | type |
---|---|---|
gender | ||
degree_t | major | categorical |
degree_p | ranking (degree percentage) | numeric |
workex | work experience | categorical |
specialisation | MBA specialization | categorical |
mba_p | MBA degree percentage | numeric |
status | placed or not | |
salary |
|>
placement select(gender, degree_t, degree_p, workex, specialisation, mba_p, status, salary) -> placement_clean
-> placement_clean ...
Let’s let R know that some of the columns are not just text, but categorical variables
|>
placement_clean mutate(gender = factor(gender),
degree_t = factor(degree_t),
workex = factor(workex),
specialisation = factor(specialisation),
degree_t = factor(degree_t),
status = factor(status)) -> placement_clean
Starting comparisons
What is the median degree percentage of placed vs non-placed people?
|>
placement_clean group_by(status) |>
summarise(median_p = median(degree_p))
# A tibble: 2 × 2
status median_p
<fct> <dbl>
1 Not Placed 61
2 Placed 68
… how about mba?
|>
placement_clean group_by(status) |>
summarise(median_mba_p = median(mba_p))
# A tibble: 2 × 2
status median_mba_p
<fct> <dbl>
1 Not Placed 60.7
2 Placed 62.2
Going deeper: apparent gender disparities
What is the median degree percentage of placed vs non-placed male and female graduates?
|>
placement_clean group_by(status, gender) |>
summarise(median_p = median(degree_p)) |>
arrange(gender, status)
# A tibble: 4 × 3
# Groups: status [2]
status gender median_p
<fct> <fct> <dbl>
1 Not Placed F 64.1
2 Placed F 69.2
3 Not Placed M 59
4 Placed M 66.9
Let’s make it easier to read putting values to columns?
|>
placement_clean group_by(status, gender) |>
summarise(median_p = median(degree_p)) |>
arrange(gender, status) |>
::pivot_wider(id_cols=status, names_from=gender, values_from=median_p) tidyr
# A tibble: 2 × 3
# Groups: status [2]
status F M
<fct> <dbl> <dbl>
1 Not Placed 64.1 59
2 Placed 69.2 66.9
Going deeper: beyond the average
Let’s try to see how degree percentages for placed vs non-placed people are distributed
|>
placement_clean group_by(status) |>
summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 6 × 3
# Groups: status [2]
status q p
<fct> <dbl> <dbl>
1 Not Placed 53 0.1
2 Not Placed 61 0.5
3 Not Placed 68.0 0.9
4 Placed 60.6 0.1
5 Placed 68 0.5
6 Placed 77.6 0.9
make humanly readable?
|>
placement_clean group_by(status) |>
summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9)) |>
::pivot_wider(id_cols = status, names_from = p, values_from=q, names_prefix="p") tidyr
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 2 × 4
# Groups: status [2]
status p0.1 p0.5 p0.9
<fct> <dbl> <dbl> <dbl>
1 Not Placed 53 61 68.0
2 Placed 60.6 68 77.6
Let’s try to see what how degree percentages for placed vs non-placed people are distributed, taking gender into account
|>
placement_clean group_by(status, gender) |>
summarise(q = quantile(degree_p, c(0.1, 0.5, 0.9)), p = c(0.1, 0.5, 0.9)) |>
::pivot_wider(id_cols = c(status, gender), names_from = p, values_from=q, names_prefix="p") tidyr
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 4 × 5
# Groups: status, gender [4]
status gender p0.1 p0.5 p0.9
<fct> <fct> <dbl> <dbl> <dbl>
1 Not Placed F 54.4 64.1 73.1
2 Not Placed M 52.8 59 65
3 Placed F 61.1 69.2 81
4 Placed M 60.8 66.9 77.0
Going deeper: majors
What are top majors for placed and not?
|>
placement_clean group_by(status, degree_t) |>
summarise(major_n = n()) |>
arrange(status, desc(major_n))
# A tibble: 6 × 3
# Groups: status [2]
status degree_t major_n
<fct> <fct> <int>
1 Not Placed Comm&Mgmt 43
2 Not Placed Sci&Tech 18
3 Not Placed Others 6
4 Placed Comm&Mgmt 102
5 Placed Sci&Tech 41
6 Placed Others 5
What share each major occupies?
|>
placement_clean group_by(status, degree_t) |>
summarise(major_n = n()) |>
mutate(share = major_n/sum(major_n)) |>
arrange(status, desc(share))
# A tibble: 6 × 4
# Groups: status [2]
status degree_t major_n share
<fct> <fct> <int> <dbl>
1 Not Placed Comm&Mgmt 43 0.642
2 Not Placed Sci&Tech 18 0.269
3 Not Placed Others 6 0.0896
4 Placed Comm&Mgmt 102 0.689
5 Placed Sci&Tech 41 0.277
6 Placed Others 5 0.0338
What are some other tasks you could use for extra practice?
simple aggregation for comparison | aggregation with multiple groups | aggregation with complex/custom function for advanced comparison |
What is the median degree percentage of placed vs non-placed people? | What are median degree and mba percentages of placed vs non-placed people? | What are 0.1, 0.5, 0.9 quantiles for the degree percentage of placed vs non-placed people? |
What are top 3 undergrad majors for placed vs non-placed? | What are ranks of top 3 undergrad majors for non-placed among placed? | |
What are the median salaries for people with and without work experience? | What are the median salaries for female and male candidates with and without work experience? | What are 0.2, 0.5, 0.8 quantile salaries for female and male candidates with and without work experience? |