MongoDB for Data Analytics: A User-Centred Approach

Ilya Musabirov

2024-02-23

Instructional Notes and Place in the Course

Place in the Course

  • Pre-requisites: SQL, data manipulation in R or Python
  • I assume students already had a 30-min introduction to NoSQL/MongoDB
  • I show more tasks than we can fit into 30 minutes

Warm-Up Activity

Go to your worksheet and try answering Warm-Up questions!

nosoc.io/ubc-worksheet

Instructor’s view

Dashboard

What Have We Learned So Far

  • Previous courses covered SQL databases (RDBMS) and data manipulation frameworks in R and Python.

  • Last time, we:

    • Introduced the NoSQL ecosystem.
    • Introduced MongoDB, a document-oriented NoSQL database.
    • Explored simple queries in MongoDB.
    • Utilized PyMongo for interacting with MongoDB.

Our Goals

By the End of This Module, We Will Be Able To:

  • Formulate data analysis tasks for MongoDB based on understanding of user needs.
  • Implement flexible data analysis pipelines using MongoDB Aggregation Framework.
  • Analyze tool suitability for the particular task by synthesizing data aggregation knowledge across MongoDB, SQL, and dplyr/pandas.

By the End of This Course, We Will Be Able To:

  • Prototype an analytical app for a music streaming service using MongoDB to provide insights to the users based on aggregated user preferences and listening history data.

Case Study: Prototyping an Analytical App for a Music Streaming Service

Prototype an analytical app for a music streaming service using MongoDB, including aggregating user preferences and listening history data to provide insights to users.

users

users

tracks

tracks

listeningHistory

listeningHistory

User-Centered Perspective

  • Difficult to think as a user…
  • Simulating realistic interactions is a way.
  • Prototyping user-facing app with realistic data might help!

Activity 1

  • Open the app DemoApp nosoc.io/ubc-app
  • Enter your username.
  • “Star” artists you like.
  • Press “Play” for some of the tracks (this will just add a record into our DB, we are not real Spotify).
  • Open the worksheet and answer the question.

nosoc.io/ubc-worksheet

How to Connect to Our MongoDB Instance

Connection details: code

!pip install pymongo

import os
import pprint
from pymongo import MongoClient, DESCENDING, ASCENDING

db = MongoClient("mongodb+srv://mongo7user-ubc:kNA39pxNpATC@teachingdemo.rz9rzak.mongodb.net/").musicdemo
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pymongo in /cloud/python/lib/python3.8/site-packages (4.6.1)
Requirement already satisfied: dnspython<3.0.0,>=1.16.0 in /cloud/python/lib/python3.8/site-packages (from pymongo) (2.6.1)

[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: /opt/python/3.8.17/bin/python3.8 -m pip install --upgrade pip

See Records We Just Created in Our App!

New users code

result = db.users.find().sort('_id', DESCENDING)[0:2]
for doc in result:
    pprint.pprint(doc)
{'_id': ObjectId('65d65b5420ed55b382164cd4'),
 'age': 30,
 'country': 'CA',
 'email': 'Fred@example.com',
 'gender': 'Male',
 'is_premium': False,
 'password': 'hashed_password',
 'preferences': [],
 'username': 'Fred'}
{'_id': 'u015',
 'age': 19,
 'country': 'Canada',
 'email': 'emma@example.com',
 'gender': 'Female',
 'is_premium': False,
 'password': 'hashed_password',
 'preferences': ['Hip Hop', 'Rap'],
 'username': 'Emma'}

New activity code

result = db.listeningHistory.find().sort('_id', DESCENDING)[0:2]
for doc in result:
    pprint.pprint(doc)
{'_id': 'lH99',
 'action': 'full',
 'device': 'windows laptop',
 'timestamp': '2023-11-07T14:05:00Z',
 'track_id': 'M3N4',
 'user_id': 'u007'}
{'_id': 'lH98',
 'action': 'full',
 'device': 'mac',
 'timestamp': '2023-11-06T18:30:00Z',
 'track_id': 'K1L2',
 'user_id': 'u007'}

Activity 2: Why MongoDB?

Look at the data and think about the case in general.

Write down some reasons why MongoDB can be relevant (or not!) for our case in your worksheet.

Case: Prototype an analytical app for a music streaming service using MongoDB, including aggregating user preferences and listening history data to generate meaningful statistics for two separate stakeholders.

nosoc.io/ubc-worksheet

So, Why Might This Problem Require MongoDB?

  • Nested flexible structures.
  • Frequent schema changes.
  • Flexible analytics pipelines.
  • We inherited it.

Some considerations to keep in mind:

  • No silver bullet.
  • Complex choice between SQL-based, Hadoop-based, and other analytical solutions.
  • Extra work to connect BI tools (e.g., Tableau) to MongoDB.

Our Main Tool: MongoDB Aggregation Framework Pipeline

Approaches to querying MongoDB:

  • Basic queries (.find()) we covered last time.
  • MongoDB Aggregation Framework.
  • Geospatial queries.
  • Full text search.
  • Vector engine.
  • Map-Reduce.

Simple Pipeline

$match Stage

  • Filters documents to pass only those that match the specified condition.
  • Examples:
    • {"$match": { "country": "UK" }}
    • {"$match": { "age": {"$lt": 25}}}

$project Stage

  • Reshapes documents by including, excluding, or renaming fields.
  • Examples:
    • {"$project": {"genre": 1, "_id": 0}}

$sort Stage

  • Sorts documents based on specified fields.
  • Example: {"$sort": { "age": -1 }}

$limit Stage

  • Limits the number of documents passed to the next stage.
  • Example: {"$limit": 5}

Composing A Simple Aggregation Pipeline

MongoDB Pipeline Checklist

Tip

  • Define stages and expected inputs and outputs.
  • What order do we want them to be in, and why?
  • Do we need to use some stages more than once?
  • In your code, first define every stage and assign it to a separate variable.
  • Define your pipeline(s) step-by-step.

Example Pipeline

Let’s explore how to build a simple pipeline for showing the artist, genre, and year of tracks in our database created after 2015. Let’s limit the results to the latest 3 tracks. code

tracks = db.tracks

limit3 = {"$limit": 3}
match1 = {"$match": {"year": {"$gt": 2015}}}
project1 = {"$project": {"artist": 1, "_id": 0, "genre": 1, "year": 1}}
sort1 = {"$sort": {"year": -1}}

def run_print_short_pipeline(connection, collection, pipeline):
  result = connection[collection].aggregate(pipeline)
  pprint.pprint(list(result)) #NB! do not use with large result sets

pipeline1 = [match1, project1, limit3, sort1]
pipeline2 = [match1, project1, {"$limit": 10}, sort1]
pipeline3 = [ {"$limit": 10}, match1, project1, sort1]

run_print_short_pipeline(db, 'tracks', pipeline1)
run_print_short_pipeline(db, 'tracks', pipeline2)
run_print_short_pipeline(db, 'tracks', pipeline3)
  
# result = db.tracks.aggregate(pipeline1)
# 
# # for doc in result:
# #     pprint.pprint(doc)
# 
# pprint.pprint(list(result)) #NB! do not use with large result sets
[{'artist': 'James Arthur', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Justin Timberlake', 'genre': 'Pop', 'year': 2016},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2016}]
[{'artist': 'Billie Eilish', 'genre': 'Pop', 'year': 2019},
 {'artist': 'Travis Scott', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'Drake', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'Drake', 'genre': 'Hip Hop', 'year': 2018},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Ed Sheeran', 'genre': 'Pop', 'year': 2017},
 {'artist': 'Justin Timberlake', 'genre': 'Pop', 'year': 2016},
 {'artist': 'James Arthur', 'genre': 'Pop', 'year': 2016},
 {'artist': 'Billie Eilish', 'genre': 'Pop', 'year': 2016},
 {'artist': 'Travis Scott', 'genre': 'Hip Hop', 'year': 2016}]
[]

Activity 3

In your worksheet, find the correct order for the pipeline stages.

nosoc.io/ubc-worksheet

Aggregation Pipeline

$group Stage

  • Groups documents by a specified key and applies accumulator expressions, which you will get to know in more details during the lab and homework.
  • Examples:
    • {"$group": { "_id": "$genre", "total": { "$sum": 1 } }} # mind $genre for a field name
    • {"$group": { "_id": "$genre", "count": {"$avg": "$year"}}}
  • SELECT COUNT(ID) FROM TRACKS GROUP BY GENRE
  • tracks |> group_by(genre) |> summarise(total=n())

Activity 4: Playlist for Millennials

Select genres with tracks after 1980 played more than 100 times.

Looking at the data structure and pipeline design checklist, write down your notes on how you can approach this:

  • Define stages and expected inputs and outputs.
  • What order do we want them to be in, and why?
  • Do we need to use some stages more than once?

nosoc.io/ubc-worksheet

code

limit3 = {"$limit": 3}
match_lt1980 = { "$match": { "year": { "$gt": 1980 } } }
group_genres_popular = { "$group": { "_id": "$genre", 
                                  "totalPlayCount": { "$sum": "$play_count" }}}
match_popular100 = { "$match": { "totalPlayCount": { "$gte": 100 } } }

result = db.tracks.aggregate([
    match_lt1980, group_genres_popular, match_popular100
])

#pprint.pprint(list(result))

for doc in result:
    print(f"{doc['_id']} - {doc['totalPlayCount']}")
Hip Hop - 4808
Alternative Rock - 907
Pop - 13412
Rock - 1810

$lookup

  • Combines documents from one collection with documents from another based on shared field(-s)
  • Example:
{
   "$lookup":
     {
       "from": "users",
       "localField": "user_id",
       "foreignField": "_id",
       "as": "user_details"
     }
}
  • SQL/Tidyverse: left join
  • inner join achieved by post-filtering the documents
  • if there is more than one corresponding doc in the lookup collection, we put them in an array

Example: Favourite Artists of Gen X

code

limit3 = {"$limit": 3}

lookup_tracks = {
    '$lookup': {
        'from': 'tracks',
        'localField': 'track_title',
        'foreignField': 'title',
        'as': 'track_details'
    }
}

match_lt1980 = {
    '$match': {
        'track_details.year': {"$lt": 1980}
    }
}

group_playcount = {"$group":{"_id": "$track_details.artist", "playcount" : {"$sum": 1}}}

sort_playcount = {"$sort": {"playcount": -1}}

pipeline = [lookup_tracks, match_lt1980, group_playcount, sort_playcount, limit3]

# Execute the aggregation
results = db.listeningHistory.aggregate(pipeline)
for doc in results:
    print(doc)

Activity 5. Recap Key Ideas

Let’s think about the key concepts we learned.

  • Think for 2 minutes and write them down in your worksheet.
  • Discuss ideas in groups of 2 or 3 with your neighbors for 3 minutes.
  • Update your notes.

nosoc.io/ubc-worksheet

Recap

Pipeline Stages

Pipeline Design Checklist

  • Define stages and expected inputs and outputs
  • What order do we want them to be in, and why?
  • Do we need to use some stages more than once?
  • In your code, first define every stage and assign it to a separate variable
  • Define your pipeline(s) step-by-step

Activity 6. Insights for Our Users

Finally, let’s generate some ideas of useful features we can prototype for our users!

  • Think for 2 minutes about what needs of users we can support with our data, relying on the tools we learned.
  • Discuss ideas in groups of 2 or 3 with your neighbors.
  • Write down the 2 ideas you like most in your worksheet.

nosoc.io/ubc-worksheet

That is all for today!

Instructional Notes and Reflection

What Would’ve Been Covered in the Next Part of an Hour Lecture and During the Lab?

  • Aggregation operators.
  • Extra stages (e.g., $sample, $set, $unset).
  • Operations on nested structures (e.g., $unwind, $group + $push, $addToSet).
  • Optimization and Indexing, explain.

Evidence-Informed Learning Design Approaches

  • Warm-up task helps to retrieve schemata and provide spaced practice.
  • Live coding + task and process worked examples.
  • High-level conceptual tasks (Parson’s problems, mix and match).
  • Think-pair-share activities + Active Learning Worksheet.

Educational Technology and Scalability

Automating Active Learning Worksheets

  • Partial automatic grading for immediate feedback.
  • Backchannel with TAs/instructor during the class.
  • (Bi-gram) wordclouds.

Other Tools for Scalability and Practice

  • Dropdowns with contextual actions, hints, reflection prompts.
  • Mindful learnersourcing of examples and motivational prompts.

Authentic Tasks and Cross-course Connections

  • This module can be a part of Databases, ML Systems, Web and Cloud Computing courses.

  • Designing/re-designing the course around cases/authentic tasks (cf. 4C/ID model) can help form complex skills relying on indirect import from other courses, e.g.,

  • Data Science Workflows.

  • Data Visualization.

  • Collaborative Software Development principles.

Example Layout of ML Systems Course

Educational Leadership Vision