MongoDB for Data Analytics: A User-Centred Approach

Ilya Musabirov

2024-02-23

Instructional Notes and Place in the Course

Slides with this background – Instructional Notes
Student materials:
- these slides nosoc.io/ubc-slides
- an active learning worksheet (implemented in Formative.com for scalability and quick feedback) nosoc.io/ubc-worksheet
- Jupyter notebook nosoc.io/ubc-notebook

Place in the Course

Pre-requisites: SQL, data manipulation in R or Python
I assume students already had a 30-min introduction to NoSQL/MongoDB
I show more tasks than we can fit into 30 minutes

Warm-Up Activity

Go to your worksheet and try answering Warm-Up questions!

nosoc.io/ubc-worksheet

Instructor’s view

Dashboard

What Have We Learned So Far

Previous courses covered SQL databases (RDBMS) and data manipulation frameworks in R and Python.
Last time, we:
- Introduced the NoSQL ecosystem.
- Introduced MongoDB, a document-oriented NoSQL database.
- Explored simple queries in MongoDB.
- Utilized PyMongo for interacting with MongoDB.

Our Goals

By the End of This Module, We Will Be Able To:

Formulate data analysis tasks for MongoDB based on understanding of user needs.
Implement flexible data analysis pipelines using MongoDB Aggregation Framework.
Analyze tool suitability for the particular task by synthesizing data aggregation knowledge across MongoDB, SQL, and dplyr/pandas.

By the End of This Course, We Will Be Able To:

Prototype an analytical app for a music streaming service using MongoDB to provide insights to the users based on aggregated user preferences and listening history data.

Case Study: Prototyping an Analytical App for a Music Streaming Service

Prototype an analytical app for a music streaming service using MongoDB, including aggregating user preferences and listening history data to provide insights to users.

User-Centered Perspective

Difficult to think as a user…
Simulating realistic interactions is a way.
Prototyping user-facing app with realistic data might help!

Activity 1

Open the app DemoApp nosoc.io/ubc-app
Enter your username.
“Star” artists you like.
Press “Play” for some of the tracks (this will just add a record into our DB, we are not real Spotify).
Open the worksheet and answer the question.

nosoc.io/ubc-worksheet

How to Connect to Our MongoDB Instance

Connection details: code

!pip install pymongo

import os
import pprint
from pymongo import MongoClient, DESCENDING, ASCENDING

db = MongoClient("mongodb+srv://mongo7user-ubc:kNA39pxNpATC@teachingdemo.rz9rzak.mongodb.net/").musicdemo

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pymongo in /cloud/python/lib/python3.8/site-packages (4.6.1)
Requirement already satisfied: dnspython<3.0.0,>=1.16.0 in /cloud/python/lib/python3.8/site-packages (from pymongo) (2.6.1)

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: /opt/python/3.8.17/bin/python3.8 -m pip install --upgrade pip

See Records We Just Created in Our App!

New users code

result = db.users.find().sort('_id', DESCENDING)[0:2]
for doc in result:
    pprint.pprint(doc)

{'_id': ObjectId('65d65b5420ed55b382164cd4'),
 'age': 30,
 'country': 'CA',
 'email': 'Fred@example.com',
 'gender': 'Male',
 'is_premium': False,
 'password': 'hashed_password',
 'preferences': [],
 'username': 'Fred'}
{'_id': 'u015',
 'age': 19,
 'country': 'Canada',
 'email': 'emma@example.com',
 'gender': 'Female',
 'is_premium': False,
 'password': 'hashed_password',
 'preferences': ['Hip Hop', 'Rap'],
 'username': 'Emma'}

New activity code

result = db.listeningHistory.find().sort('_id', DESCENDING)[0:2]
for doc in result:
    pprint.pprint(doc)

{'_id': 'lH99',
 'action': 'full',
 'device': 'windows laptop',
 'timestamp': '2023-11-07T14:05:00Z',
 'track_id': 'M3N4',
 'user_id': 'u007'}
{'_id': 'lH98',
 'action': 'full',
 'device': 'mac',
 'timestamp': '2023-11-06T18:30:00Z',
 'track_id': 'K1L2',
 'user_id': 'u007'}

Activity 2: Why MongoDB?

Look at the data and think about the case in general.

Write down some reasons why MongoDB can be relevant (or not!) for our case in your worksheet.

Case: Prototype an analytical app for a music streaming service using MongoDB, including aggregating user preferences and listening history data to generate meaningful statistics for two separate stakeholders.

nosoc.io/ubc-worksheet

So, Why Might This Problem Require MongoDB?

Nested flexible structures.
Frequent schema changes.
Flexible analytics pipelines.
We inherited it.

Some considerations to keep in mind:

No silver bullet.
Complex choice between SQL-based, Hadoop-based, and other analytical solutions.
Extra work to connect BI tools (e.g., Tableau) to MongoDB.

Our Main Tool: MongoDB Aggregation Framework Pipeline

Approaches to querying MongoDB:

Basic queries (.find()) we covered last time.
MongoDB Aggregation Framework.
Geospatial queries.
Full text search.
Vector engine.
~~Map-Reduce~~.

Simple Pipeline

`$match` Stage

Filters documents to pass only those that match the specified condition.
Examples:
- {"$match": { "country": "UK" }}
- {"$match": { "age": {"$lt": 25}}}

`$project` Stage

Reshapes documents by including, excluding, or renaming fields.
Examples:
- {"$project": {"genre": 1, "_id": 0}}

`$sort` Stage

Sorts documents based on specified fields.
Example: {"$sort": { "age": -1 }}

`$limit` Stage

Limits the number of documents passed to the next stage.
Example: {"$limit": 5}

Composing A Simple Aggregation Pipeline

MongoDB Pipeline Checklist

Tip

Define stages and expected inputs and outputs.
What order do we want them to be in, and why?
Do we need to use some stages more than once?
In your code, first define every stage and assign it to a separate variable.
Define your pipeline(s) step-by-step.

Example Pipeline

Let’s explore how to build a simple pipeline for showing the artist, genre, and year of tracks in our database created after 2015. Let’s limit the results to the latest 3 tracks. code

tracks = db.tracks

limit3 = {"$limit": 3}
match1 = {"$match": {"year": {"$gt": 2015}}}
project1 = {"$project": {"artist": 1, "_id": 0, "genre": 1, "year": 1}}
sort1 = {"$sort": {"year": -1}}

def run_print_short_pipeline(connection, collection, pipeline):
  result = connection[collection].aggregate(pipeline)
  pprint.pprint(list(result)) #NB! do not use with large result sets

pipeline1 = [match1, project1, limit3, sort1]
pipeline2 = [match1, project1, {"$limit": 10}, sort1]
pipeline3 = [ {"$limit": 10}, match1, project1, sort1]

run_print_short_pipeline(db, 'tracks', pipeline1)
run_print_short_pipeline(db, 'tracks', pipeline2)
run_print_short_pipeline(db, 'tracks', pipeline3)
  
# result = db.tracks.aggregate(pipeline1)
# 
# # for doc in result:
# #     pprint.pprint(doc)
# 
# pprint.pprint(list(result)) #NB! do not use with large result sets

[{'artist': 'James Arthur', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Justin Timberlake', 'genre': 'Pop', 'year': 2016},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2016}]
[{'artist': 'Billie Eilish', 'genre': 'Pop', 'year': 2019},
 {'artist': 'Travis Scott', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'Drake', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'Drake', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Ed Sheeran', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Justin Timberlake', 'genre': 'Pop', 'year': 2016},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2016},
 {'artist': 'Billie Eilish', 'genre': 'Pop', 'year': 2016},
 {'artist': 'Travis Scott', 'genre': 'Hip Hop', 'year': 2016}]
[]

Activity 3

In your worksheet, find the correct order for the pipeline stages.

nosoc.io/ubc-worksheet

Aggregation Pipeline

`$group` Stage

Groups documents by a specified key and applies accumulator expressions, which you will get to know in more details during the lab and homework.
Examples:
- {"$group": { "_id": "$genre", "total": { "$sum": 1 } }} # mind $genre for a field name
- {"$group": { "_id": "$genre", "count": {"$avg": "$year"}}}

SELECT COUNT(ID) FROM TRACKS GROUP BY GENRE
tracks |> group_by(genre) |> summarise(total=n())

Activity 4: Playlist for Millennials

Select genres with tracks after 1980 played more than 100 times.

Looking at the data structure and pipeline design checklist, write down your notes on how you can approach this:

Define stages and expected inputs and outputs.
What order do we want them to be in, and why?
Do we need to use some stages more than once?

nosoc.io/ubc-worksheet

code

limit3 = {"$limit": 3}
match_lt1980 = { "$match": { "year": { "$gt": 1980 } } }
group_genres_popular = { "$group": { "_id": "$genre", 
                                  "totalPlayCount": { "$sum": "$play_count" }}}
match_popular100 = { "$match": { "totalPlayCount": { "$gte": 100 } } }

result = db.tracks.aggregate([
    match_lt1980, group_genres_popular, match_popular100
])

#pprint.pprint(list(result))

for doc in result:
    print(f"{doc['_id']} - {doc['totalPlayCount']}")

Hip Hop - 4808
Alternative Rock - 907
Pop - 13412
Rock - 1810

`$lookup`

Combines documents from one collection with documents from another based on shared field(-s)
Example:

{
   "$lookup":
     {
       "from": "users",
       "localField": "user_id",
       "foreignField": "_id",
       "as": "user_details"
     }
}

SQL/Tidyverse: left join
inner join achieved by post-filtering the documents
if there is more than one corresponding doc in the lookup collection, we put them in an array

Example: Favourite Artists of Gen X

code

limit3 = {"$limit": 3}

lookup_tracks = {
    '$lookup': {
        'from': 'tracks',
        'localField': 'track_title',
        'foreignField': 'title',
        'as': 'track_details'
    }
}

match_lt1980 = {
    '$match': {
        'track_details.year': {"$lt": 1980}
    }
}

group_playcount = {"$group":{"_id": "$track_details.artist", "playcount" : {"$sum": 1}}}

sort_playcount = {"$sort": {"playcount": -1}}

pipeline = [lookup_tracks, match_lt1980, group_playcount, sort_playcount, limit3]

# Execute the aggregation
results = db.listeningHistory.aggregate(pipeline)
for doc in results:
    print(doc)

Activity 5. Recap Key Ideas

Let’s think about the key concepts we learned.

Think for 2 minutes and write them down in your worksheet.
Discuss ideas in groups of 2 or 3 with your neighbors for 3 minutes.
Update your notes.

nosoc.io/ubc-worksheet

Recap

Pipeline Stages

Pipeline Design Checklist

Define stages and expected inputs and outputs
What order do we want them to be in, and why?
Do we need to use some stages more than once?
In your code, first define every stage and assign it to a separate variable
Define your pipeline(s) step-by-step

Activity 6. Insights for Our Users

Finally, let’s generate some ideas of useful features we can prototype for our users!

Think for 2 minutes about what needs of users we can support with our data, relying on the tools we learned.
Discuss ideas in groups of 2 or 3 with your neighbors.
Write down the 2 ideas you like most in your worksheet.

nosoc.io/ubc-worksheet

That is all for today!

Instructional Notes and Reflection

What Would’ve Been Covered in the Next Part of an Hour Lecture and During the Lab?

Aggregation operators.
Extra stages (e.g., $sample, $set, $unset).
Operations on nested structures (e.g., $unwind, $group + $push, $addToSet).
Optimization and Indexing, explain.

Evidence-Informed Learning Design Approaches

Warm-up task helps to retrieve schemata and provide spaced practice.
Live coding + task and process worked examples.
High-level conceptual tasks (Parson’s problems, mix and match).
Think-pair-share activities + Active Learning Worksheet.

Educational Technology and Scalability

Automating Active Learning Worksheets

Partial automatic grading for immediate feedback.
Backchannel with TAs/instructor during the class.
(Bi-gram) wordclouds.

Other Tools for Scalability and Practice

Dropdowns with contextual actions, hints, reflection prompts.
Mindful learnersourcing of examples and motivational prompts.

Authentic Tasks and Cross-course Connections

This module can be a part of Databases, ML Systems, Web and Cloud Computing courses.
Designing/re-designing the course around cases/authentic tasks (cf. 4C/ID model) can help form complex skills relying on indirect import from other courses, e.g.,
Data Science Workflows.
Data Visualization.
Collaborative Software Development principles.

Example Layout of ML Systems Course

Educational Leadership Vision

First, a bit about my background in this area.

I transitioned from industry to education in 2014 when I was invited to Higher School of Economics St. Petersburg, to consult, teach, and then lead initiatives introducing Data Science to students of non-STEM backgrounds.

I led the design of a unique minor program for 11 non-STEM majors. For that, I worked with education researchers designing instruments and analytics to understand the complex dynamics of gender and discipline stereotypes, motivation, social networks, and capital at play in the classrooms.

We found exclusion phenomena via “Rich Club” in networks and self-efficacy differences influencing effort and achievement.

And implemented some measures introducing a diverse group of undergraduate TAs as role models, creating variability in capstone pathways, events to foster a sense of belonging, and motivational support as part of the teaching strategy.

This effort affected more than a thousand students over the years and helped to establish new career trajectories for students from sociology, area studies, political science, humanities.

I also led the learning design of a master’s program, in UX and ML systems design, working on establishing dialogue between HCI, Machine Learning, and Psychology professors, and industrial partners. I created program-level knowledge bases, processes, and rubrics and designed co-alignment efforts between information systems and user-centered design courses, giving students skills and collaboration experience difficult to achieve in the classroom.

This work was presented at the Symposium on Human-Computer Interaction Education and SIGCSE.

In 2021 I started my PhD at UofT with a focus on the scalability of adaptive educational technologies.

Here I engaged in a two-year XPRIZE project on creating rapid experiments and replication infrastructures for education. I led a core team of seven graduate students for this project that, together with researchers from CMU, Toronto, and North Carolina, won a Grand Prize in this challenge.

This work on experimental design and decision support tools for instructors and researchers on continuous course improvement is a core of my thesis.

I also proposed learning and intervention design for QuickTA: UofT LLM chatbot which won the DARPA AI Tools for Adult learning challenge. It relies on combining LLMs, multiarmed bandits, and intelligent tutors to support student metacognitive skills in course contexts and already is tested on thousands of students in different courses.

This foundation in educational technology and its evaluation in real-world classrooms, coupled with my experience in learning design and research, forms the core of the educational leadership vision I intend to bring to UBC.

Firstly, my contribution will focus on creating pragmatic educational technologies, with a special interest in complex cross-cutting skills. For instance, in the domain of A/B testing, practical challenges often extend beyond what is taught in experimental design courses. Issues such as sample ratio mismatch, early peaking, low statistical power, and aggregating knowledge from multiple experiments, while not necessarily complex in theory, present cumbersome practical challenges with a complex decision-making nature. Designing simulations and adaptive educational technologies to address these areas can significantly enhance students’ success in industrial applications early on.

Secondly, I aim to create resources and training that extend beyond the traditional curriculum. One envisioned project involves extending data workflow tools to better support researchers in planning reproducible and open science studies. Though many view this as a taxing activity, providing templates, tool support, and integrating emerging research on data science workflows can help researchers recognize the practical value of these approaches early on. Another initiative, based on an early pilot, focuses on enhancing student skills in navigating and designing communication and collaboration environments. This includes understanding the technology behind modern workflows and gaining insights from behavioral and social science research related to remote work and communication infrastructures.

Thirdly, beyond these practical initiatives, I plan to introduce these concepts to the CS, DataEd, HCI, and Learning Engineering communities, fostering connections between them. I aim to create opportunities for students to practice skills and product thinking by participating in EdTech, EDM, and HCI competitions that combine data science with design.

Lastly, local community dissemination is essential. I have experience organizing workshops on data visualization, decision-making, adaptive experiments, and learning design for complex skills. I look forward to sharing these insights with the MDS, DCS, and the broader UBC community.

I am eager to discuss any of these points in more detail if you have questions.

Thank you.

MongoDB for Data Analytics: A User-Centred Approach

Instructional Notes and Place in the Course

Place in the Course

Warm-Up Activity

Instructor’s view

What Have We Learned So Far

Our Goals

By the End of This Module, We Will Be Able To:

By the End of This Course, We Will Be Able To:

Case Study: Prototyping an Analytical App for a Music Streaming Service

User-Centered Perspective

Activity 1

How to Connect to Our MongoDB Instance

See Records We Just Created in Our App!

Activity 2: Why MongoDB?

So, Why Might This Problem Require MongoDB?

Our Main Tool: MongoDB Aggregation Framework Pipeline

Simple Pipeline

$match Stage

$project Stage

$sort Stage

$limit Stage

Composing A Simple Aggregation Pipeline

MongoDB Pipeline Checklist

Example Pipeline

Activity 3

Aggregation Pipeline

$group Stage

Activity 4: Playlist for Millennials

$lookup

Example: Favourite Artists of Gen X

Activity 5. Recap Key Ideas

Recap

Pipeline Stages

Pipeline Design Checklist

Activity 6. Insights for Our Users

That is all for today!

Instructional Notes and Reflection

What Would’ve Been Covered in the Next Part of an Hour Lecture and During the Lab?

Evidence-Informed Learning Design Approaches

Educational Technology and Scalability

Automating Active Learning Worksheets

Other Tools for Scalability and Practice

Authentic Tasks and Cross-course Connections

Example Layout of ML Systems Course

Educational Leadership Vision

`$match` Stage

`$project` Stage

`$sort` Stage

`$limit` Stage

`$group` Stage

`$lookup`