The group stage in the MongoDB aggregation pipeline helps us group data using any field in the MongoDB document. It is one of the most important and commonly used stages in the MongoDB aggregation pipeline.
In this article, we'll look at the $group stage and use various features that it provides. We'll be working with a sample collection of movies. There's going to be a playground link for each query so you can practice and learn by doing.
There's also an exercise at the end of this article for you to try out. It will help you solidify your understanding once you've finished this article.
Here are the things we will cover in this article:
- Find distinct using group by ♟
- Group by using multiple fields 🎳
- Using accumulator functions 🧩
- Using with $project ⚙️
- Sorting the results 📈
- $group vs $project stage 🪞
- Thing to note 📝
- Conclusion 🎬
- Exercise 🎐
Establishing the data
Before we jump into the aggregation pipeline and the group stage, we need some data to work with. I'm taking an example Movies
collection for understanding the concept here. Again, there'll be links to the playground for each query throughout the article.
Here's the Movies
collection with only 5 documents containing random data:
{
"name": "Spidey One way home",
"release_year": "2021",
"rating": 9,
"starring": [
"Tom Hanks",
"Tom Holland",
"Mark Zucks",
"Samy"
],
"runtime": 120,
"totalReviews": 2000,
"director": "Jon What"
},
{
"name": "The Arrival of a Train",
"release_year": "1896",
"rating": 6,
"starring": [
"Shawn Ching",
"Looker Blindspot",
"Tom Hanks"
],
"runtime": 115,
"totalReviews": 720,
"director": "Ricky"
},
{
"name": "Lost persuit of adventure",
"release_year": "2005",
"rating": 7.1,
"starring": [
"Jimmy simmon",
"Catarina"
],
"runtime": 150,
"totalReviews": 823,
"director": "Ricky"
},
{
"name": "Jungle Warrior",
"release_year": "2016",
"rating": 5.9,
"starring": [
"Stormer",
"Carmony",
"Tom Hanks"
],
"runtime": 150,
"totalReviews": 1368,
"director": "Whim Wailer"
},
{
"name": "The Last of the us all",
"release_year": "2005",
"rating": 8.5,
"starring": [
"Samy",
"George wise",
"Pennywise"
],
"runtime": 120,
"totalReviews": 1800,
"director": "Jon What"
}
Now that we have our sample collection, it's time to explore the $group stage ⚡
Find distinct using group by
To find the distinct items in a collection we can use the group stage on any field that we want to group by. This field will be unique in the output. Let's group the movies by their release year:
{
$group: {
_id: "$release_year"
}
}
Here's the output of the above query. Note that we only got unique release year values in the output.
[
{ "_id": "1896" },
{ "_id": "2016" },
{ "_id": "2021" },
{ "_id": "2005" }
]
Group by using multiple fields
Similar to grouping by a single field, we might want to group the data with more than one field as per our use case. MongoDB aggregation pipeline allows us to group by as many fields as we want.
Whatever we put inside the _id
field is used to group the documents i.e., it returns all the fields present inside the _id
field and groups by all of them.
Let's group the movies by their release year and their runtime:
{
$group: {
_id: {
"release_year": "$release_year",
"runtime": "$runtime"
}
}
}
Grouping by release year and their runtime gives us this output:
[
{
"_id": {
"release_year": "2005",
"runtime": 150
}
},
{
"_id": {
"release_year": "2021",
"runtime": 120
}
},
{
"_id": {
"release_year": "2016",
"runtime": 150
}
},
{
"_id": {
"release_year": "2005",
"runtime": 120
}
},
{
"_id": {
"release_year": "1896",
"runtime": 115
}
}
]
Instead of using a single field to group by, we are using multiple fields in above scenario. The combination of release year and runtime acts as the unique identifier for each document.
Using accumulator functions
There are a lot of accumulator functions available in the group stage which can be used to aggregate the data. They help us carry out some of most common operations on the grouped data. Let's take a look at some of them:
$count accumulator
$count accumulator is used to count the number of documents in the group. This can be combined with our group by query to get the total number of documents in the group.
Let's apply this to our movies collection:
{
$group: {
_id: "$release_year",
totalMovies: { $count: {} }
}
}
We'll get the total movies released in each year:
[
{
"_id": "2016",
"totalMovies": 1
},
{
"_id": "1896",
"totalMovies": 1
},
{
"_id": "2021",
"totalMovies": 1
},
{
"_id": "2005",
"totalMovies": 2
}
]
$sum accumulator
We can use the $sum accumulator to add up all the values in a field. Let's group the movies by their rating and sum up the reviews to understand if there's a correlation between movie rating and the number of reviews.
{
$group: {
_id: "$rating",
totalMovies: {
$sum: "$totalReviews"
}
}
}
And here we can see that there is a slight correlation between the number of reviews and the movie rating:
[
{
"_id": 9,
"totalMovies": 2000
},
{
"_id": 6,
"totalMovies": 720
},
{
"_id": 5.9,
"totalMovies": 1368
},
{
"_id": 8.5,
"totalMovies": 1800
},
{
"_id": 7.1,
"totalMovies": 823
}
]
$avg accumulator
We might want to examine which year has the highest average movies rating for analytical purposes. Let's see how we can get those stats from our data:
{
$group: {
_id: {
year: "$release_year",
},
avgRating: {
$avg: "$rating"
}
}
}
We are first grouping the movies by the release year and then calculating the average rating for each release year. Here's the output of the above query:
[
{
"_id": { "year": "2016" },
"avgRating": 5.9
},
{
"_id": { "year": "1896" },
"avgRating": 6
},
{
"_id": { "year": "2021" },
"avgRating": 9
},
{
"_id": { "year": "2005" },
"avgRating": 7.8
}
]
$push accumulator
We want to look at all the ratings movies received for every release year. Let's use the $push accumulator to get all the movie names for each year:
{
$group: {
_id: {
year: "$release_year",
},
ratings: {
$push: "$rating"
}
}
}
All the movie ratings for each release year are pushed into an array:
[
{
"_id": { "year": "1896" },
"ratings": [ 6 ]
},
{
"_id": { "year": "2016" },
"ratings": [ 5.9 ]
},
{
"_id": { "year": "2021" },
"ratings": [ 9 ]
},
{
"_id": { "year": "2005" },
"ratings": [ 7.1, 8.5 ]
}
]
$addToSet accumulator
You can consider this to be like the $push accumulator. $addToSet only adds the value to the array if it doesn't exist already. This is the only difference between $addToSet and $push.
Let's group by rating and see which (unique) release years produced each:
{
$group: {
_id: {
rating: "$rating"
},
releasedIn: {
"$addToSet": "$release_year"
}
}
}
We get ratings along with their unique release years:
[
{
"_id": { "rating": 8.5 },
"releasedIn": [ "2005" ]
},
{
"_id": { "rating": 7.1 },
"releasedIn": [ "2005" ]
},
{
"_id": { "rating": 9 },
"releasedIn": [ "2021" ]
},
{
"_id": { "rating": 6 },
"releasedIn": [ "1896" ]
},
{
"_id": { "rating": 5.9 },
"releasedIn": [ "2016" ]
}
]
$min accumulator
Let's say we want to find out successful release years for the movies. A year is considered successful if the all the movies released during that year have rating greater than 7. Let's use the $min accumulator to get the successful years:
{
$group: {
_id: {
year: "$release_year"
},
minRating: { $min: "$rating" }
}
},
{
"$match": {
minRating: { $gt: 7 }
}
}
- We have grouped the movies collection using the
release_year
field. - In addition to that, we have added
minRating
field which maintains the minimum rating for each release year. - We have also applied a
$match
stage to filter out the years which don't have a minimum rating greater than 7.
[
{
"_id": {
"year": "2021"
},
"minRating": 9
},
{
"_id": {
"year": "2005"
},
"minRating": 7.1
}
]
$first accumulator
This accumulator is different from the $first array operator which gives first element in an array. For each grouped documents, $first accumulator gives us the first one.
Let's fetch the highest rated movie for every release year. Since we want to get the highest rated document from the each group, we need to sort the documents before passing them to the group stage.
{
"$sort": {
"release_year": 1,
"rating": -1
}
},
{
$group: {
_id: "$release_year",
highestRating: {
$first: "$rating"
}
}
}
We are sorting using two fields here, release_year
and rating
. Let's understand the output of sort stage first:
[
{
"rating": 6,
"release_year": "1896"
},
{
"rating": 8.5,
"release_year": "2005"
},
{
"rating": 7.1,
"release_year": "2005"
},
{
"rating": 5.9,
"release_year": "2016"
},
{
"rating": 9,
"release_year": "2021"
}
]
The output is first sorted on the basis of ascending release year and then for each year, the movies are sorted in descending order of rating.
This sorted output is then passed to the group stage which groups the documents by their release year. For example, group stage is working with two documents for release year 2005:
{
"rating": 8.5,
"release_year": "2005"
},
{
"rating": 7.1,
"release_year": "2005"
}
Let's call these "shortlisted documents" for release year 2005. This happens for all (unique) release years. Group stage picks the first element from these shortlisted documents (which has the highest rating because ratings are sorted in descending order).
Combining the sort and group stages, here's the final output of the query:
[
{
"_id": "2016",
"highestRating": 5.9
},
{
"_id": "1896",
"highestRating": 6
},
{
"_id": "2021",
"highestRating": 9
},
{
"_id": "2005",
"highestRating": 8.5
}
]
NOTE: Passing sorted documents to $group stage does not guarantee that the order will be preserved.
Using with $project
The movie rating is a floating point number. We'll round that off to the nearest integer to get the movie rating as a whole number. Let's also group movies by their modified ratings:
{
"$project": {
rating: {
"$round": "$rating"
}
}
},
{
$group: {
_id: "$rating",
movies: {
$sum: 1
}
}
}
- We used $project stage to round off the rating to the nearest integer.
- We used $group stage to group the movies by their modified rating.
Here's the output of the above query:
[
{
"_id": 7,
"movies": 1
},
{
"_id": 8,
"movies": 1
},
{
"_id": 9,
"movies": 1
},
{
"_id": 6,
"movies": 2
}
]
The possibilities are endless. You can combine many other stages, perform some filters, put conditions or even $$REMOVE
the documents.
Sorting the results
The year with the highest movie minutes might give us some insights on movies production and its correlation with audience attention spans over the years. So let's understand how to achieve that:
{
$group: {
_id: "$release_year",
totalRuntime: {
"$sum": "$runtime"
}
}
},
{
"$sort": {
"totalRuntime": -1
}
}
We are fetching the total runtime of all the movies released in a particular year and then sorting them in descending order with the help of $sort stage:
[
{
"_id": "2005",
"totalRuntime": 270
},
{
"_id": "2016",
"totalRuntime": 150
},
{
"_id": "2021",
"totalRuntime": 120
},
{
"_id": "1896",
"totalRuntime": 115
}
]
It is evident from this query that the attention spans of the target audience have been decreasing in non-uniform way over the years.
$group vs $project stage
We have an n:1 relationship between input and output documents in the group stage. But, we have a 1:1 relationship in the $project stage.
In group stage we usually get a count, sum, average of documents based on the grouping key (or _id), or even build an array. All of these operations take n number of documents and the output of group is a single document with the aggregated values.
On the other hand, we include/exclude fields, perform field transformations within a single document in case of project stage in aggregation pipeline,
Thing to note
$group stage has a limit of 100 megabytes of RAM
If you're working with a massive dataset and you receive an error during group stage execution, you might be hitting the memory limit. If you want to increase it, use allowDiskUse
option to enable the $group stage to write to temporary files on disk.
The reason for this issue is very well stated in the mongoDB docs:
NOTE: Pipeline stages operate on streams of documents with each pipeline stage taking in documents, processing them, and then outputting the resulting documents. Some stages can't output any documents until they have processed all incoming documents. These pipeline stages must keep their stage output in RAM until all incoming documents are processed. As a result, these pipeline stages may require more space than the 100 MB limit.
Conclusion
And that was how the group stage works in mongoDB aggregation pipeline. We looked at how we can group data using single fields (distinct count), multiple fields, sort them, how we can carry out complex computations by adding conditions to group stage and the subtle difference between group and project stage. I hope you find this useful and interesting.
Let me know your thoughts and feedback on Twitter.
Exercise
To make sure you understand the concepts, I have curated a couple of questions related to what we've learned in this article. You can download exercise PDF below. It also contains working mongoDB playground links containing the answers for all the questions. Be honest, don't cheat 🙂.
Top comments (0)