Using new aggregation pipeline capabilities in Amazon DocumentDB to build powerful aggregation queries

Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB application code, drivers, and tools as you do today to run, manage, and scale workloads on Amazon DocumentDB. This way, you can enjoy improved performance, scalability, and availability, without having to worry about managing the underlying infrastructure.

Building on the features described in our February blog post, Amazon DocumentDB has added support for additional aggregation pipeline capabilities. The new capabilities include:

  • Seven aggregation string operators ($indexOfCP, $indexOfBytes, $strLenCP, $strLenBytes, $toLower, $toUpper, $split)
  • Nine date time operators ($dayOfYear, $dayOfMonth, $dayOfWeek, $year, $month, $hour, $minute, $second, and $millisecond)
  • The $sample aggregation pipeline stage

In this blog post, I introduce you to these new capabilities by showing how to use these operators for common use cases.

Getting started with Amazon DocumentDB

If you’re new to Amazon DocumentDB, you can get started with one of the following:

 New capabilities

After you create and connect to your Amazon DocumentDB cluster, you can better understand and extend the following examples.

1. Aggregation String Operators

Strings that are stored in documents can be manipulated to extract pieces that you want for many use cases, including further processing and comparisons, or for cleaning up the data. Instead of processing the strings in application code, you can use the aggregation string operators to push the data processing down to the database. Amazon DocumentDB now supports all of the MongoDB 3.6 string operators.

To understand how these new operators work, imagine that you have built a directory application that your employees use to find out a person’s role, manager, and desk location. Now imagine that you have a collection in Amazon DocumentDB containing documents that describe people in the company who are located in Germany. The taxonomy for denoting the desk location of “city-building-seat” is slightly different for each employee.

Using the new string operators, I show how we extract information from these strings to make the data more usable in our application.

Input

Each document in this example represents a single person. It contains their name, manager’s name, role, and the company-specific identifier for their desk location.

db.people.insertMany([
{ "_id":1, "name":"John Doe", "Manager":"Jane Doe", "Role":"Developer", "Desk": " Düsseldorf-BVV-021"},
{ "_id":2, "name":"John Stiles", "Manager":"Jane Doe", "Role":"Manager", "Desk": " Munich-HGG-32a"},
{ "_id":3, "name":"Richard Roe", "Manager":"Jorge Souza", "Role":"Product", "Desk": " Cologne-ayu-892.50"},
{ "_id":4, "name":"Mary Major", "Manager":"Jane Doe", "Role":"Solution Architect", "Desk": " Dortmund-Hop-78"}])

$indexOfCP

In order to parse the desk field string, it is useful to locate the index, in code points (CP), for the first occurrence of a particular character in the string. To do this, you can use the $indexOfCP operator. In the query below, I search for the first occurrence of the hyphen character.

Query:

db.people.aggregate(
[
 { $project: { stateLocation: { $indexOfCP: [ "$Desk", "-"] } } }
])

Result:

{ "_id" : 1, "stateLocation" : 11 }
{ "_id" : 2, "stateLocation" : 7 }
{ "_id" : 3, "stateLocation" : 8 }
{ "_id" : 4, "stateLocation" : 9 }

The result shows that we have enumerated the index for the first location of the hyphen (“-“). We can use that information with operators like $substrCP and $substrBytes to further parse and extract content from the string.

$indexOfBytes

$indexOfBytes is similar to $indexofCP, except that it returns the index in the number of bytes. Note the difference between the two measures in the following examples.

Query:

db.people.aggregate(
[
 { $project: { stateLocation: { $indexOfBytes: [ "$Desk", "-"] } } }
]) 

Result:

{ "_id" : 1, "stateLocation" : 12 }
{ "_id" : 2, "stateLocation" : 7 }
{ "_id" : 3, "stateLocation" : 8 }
{ "_id" : 4, "stateLocation" : 9 }

As you can see, the location (in bytes) for each of these strings is a little different. Because the character “ü” in “Düsseldorf” is represented as two bytes, the index using $indexOfBytes is 12, and the index using $indexOfCP is 11.

$strLenCP

To determine the number of code points in the string, you can use the $strLenCP operator.

Query:

db.people.aggregate(
[{
  $project: {"Desk": 1, "length": { $strLenCP: "$Desk" }}}
])

Result:

{ "_id" : 1, "Desk" : " Düsseldorf-BVV-021", "length" : 19 }
{ "_id" : 2, "Desk" : " Munich-HGG-32a", "length" : 15 }
{ "_id" : 3, "Desk" : " Cologne-ayu-892.50", "length" : 19 }
{ "_id" : 4, "Desk" : " Dortmund-Hop-78", "length" : 16 }

$strLenBytes

Similarly, the $strLenBytes operator can be used for determining the length of a string in bytes.

Query:

db.people.aggregate(
[{
  $project: {"Desk": 1, "length": { $strLenBytes: "$Desk" }}}
])

Result:

{ "_id" : 1, "Desk" : "Düsseldorf-BVV-021", "length" : 20 }
{ "_id" : 2, "Desk" : "Munich-HGG-32a", "length" : 15 }
{ "_id" : 3, "Desk" : "Cologne-ayu-892.50", "length" : 19 }
{ "_id" : 4, "Desk" : "Dortmund-Hop-78", "length" : 16 }

Again, notice the difference in the measure of the length between code points and bytes using the “Düsseldorf-BVV-021” string.

$toLower

When cleaning up or comparing the equivalency of two strings, it’s helpful to make all of the characters in the string either lower case or upper case. In the query below, I use the $toLower operator to convert all of the characters in the desk string to lower case.

Query:

db.people.aggregate([
 {$project:{item: { $toLower: "$Desk" }}}
])

Result:

{ "_id" : 1, "item" : " düsseldorf-bvv-021" }
{ "_id" : 2, "item" : " munich-hgg-32a" }
{ "_id" : 3, "item" : " cologne-ayu-892.50" }
{ "_id" : 4, "item" : " dortmund-hop-78" }

$toUpper

Similar to $toLower, $toUpper converts all characters in a string to upper case characters.

Query:

db.people.aggregate([
 {$project:{item: { $toUpper: "$Desk" }}}
])

Result:

{ "_id" : 1, "item" : " DüSSELDORF-BVV-021" }
{ "_id" : 2, "item" : " MUNICH-HGG-32A" }
{ "_id" : 3, "item" : " COLOGNE-AYU-892.50" }
{ "_id" : 4, "item" : " DORTMUND-HOP-78" }

$split

While locating specific indexes with a string is useful for targeted operators, it can also be helpful to split a string into an array, based on a delimiter. You can write complex queries to locate each delimiter. Or you can use $split for this task, using a single command. In the query below, I separate each component of our desk location into an array. This makes it easier to process the data in the string.

Query:

db.people.aggregate([
 {$project:{parts: { $split: ["$Desk","-"]}}}
])

Result:

{ "_id" : 1, "parts" : [ " Düsseldorf", "BVV", "021" ] }
{ "_id" : 2, "parts" : [ " Munich", "HGG", "32a" ] }
{ "_id" : 3, "parts" : [ " Cologne", "ayu", "892.50" ] }
{ "_id" : 4, "parts" : [ " Dortmund", "Hop", "78" ] }

The output of $split creates an array that we can then use in our application to display the information for our employees.

2. Date Aggregation Operators

The date aggregation operators enable you to extract different parts of a given date or a timestamp. These operators are useful for extracting exactly what you need from a date in order to process the data in your application. For example, you might want to extract a birthday month, or the day of the week on which an order was placed. You also might want to display parts of a date in the way that your application needs. We added nine new date time operators: $dayOfYear, $dayOfMonth, $dayOfWeek, $year, $month, $hour, $minute, $second, and $millisecond.

Consider an example of storing measurements from a weather station. The following input shows the insertion of a single measurement.

Input:

db.weather.insert({
  "temperature" : 97.5,
  "humidity": 0.60,
  "date" : new Date() 
})

With the newly added date aggregation operators, I can extract multiple different components of the timestamp.

Query:

db.weather.aggregate([
{$project:{
   year: { $year: "$date" },
   month: { $month: "$date" },
   day: { $dayOfMonth: "$date" },
   hour: { $hour: "$date" },
   minutes: { $minute: "$date" },
   seconds: { $second: "$date" },
   milliseconds: { $millisecond: "$date" },
   dayOfYear: { $dayOfYear: "$date" },
   dayOfWeek: { $dayOfWeek: "$date" }}}
]).pretty()

Result:

{
	"_id" : ObjectId("5c85a290f9d5147c8653f952"),
	"year" : 2019,
	"month" : 1,
	"day" : 9,
	"hour" : 9,
	"minutes" : 45,
	"seconds" : 22,
	"milliseconds" : 981,
	"dayOfYear" : 9,
	"dayOfWeek" : 4
}

As you can see in the results, I can extract the year, month, day, hour, minutes, seconds, milliseconds, day of the year, day of the week, and week number from my timestamp. This gives me a lot of flexibility in how I can use this data.

3. Sample

To continue with the weather example, consider a use case where you want to sample data rather than analyze every data point. This is especially helpful when dealing with measurements or telemetry. To sample your data in Amazon DocumentDB, you can use the $sample aggregation pipeline stage.

Input:

db.temp.insertMany([
{ "_id": 1, "temperature" : 97.5, "humidity": 0.61, "timestamp" : new Date() }, 
{ "_id": 2, "temperature" : 97.2, "humidity": 0.60, "timestamp" : new Date() },
{ "_id": 3, "temperature" : 97.4, "humidity": 0.61, "timestamp" : new Date() },
{ "_id": 4, "temperature" : 97.9, "humidity": 0.61, "timestamp" : new Date() },
{ "_id": 5, "temperature" : 97.6, "humidity": 0.61, "timestamp" : new Date() },
{ "_id": 6, "temperature" : 97.5, "humidity": 0.62, "timestamp" : new Date() },
{ "_id": 7, "temperature" : 97.2, "humidity": 0.62, "timestamp" : new Date() },
{ "_id": 8, "temperature" : 97.1, "humidity": 0.63, "timestamp" : new Date() },
{ "_id": 9, "temperature" : 96.9, "humidity": 0.62, "timestamp" : new Date() },
{ "_id": 10, "temperature" : 97.4, "humidity": 0.63, "timestamp" : new Date()}
])

The following query randomly selects two documents from the temp collection. I can change the number of documents that I want randomly returned to me by increasing or decreasing the size.

Query:

db.temp.aggregate(
   [ { $sample: { size: 2 } } ]
)

Result:

{ "_id" : 4, "temperature" : 97.9, "humidity" : 0.61, "timestamp" : ISODate("2019-03-21T21:17:22.425Z") }
{ "_id" : 9, "temperature" : 96.9, "humidity" : 0.62, "timestamp" : ISODate("2019-03-21T21:17:22.425Z") }

As the results show, 2 of the 10 documents were randomly sampled. I can now go on to use these documents to determine an average or to do a min/max calculation.

Summary

With Amazon DocumentDB, we continue to work backward from our customers, adding new capabilities that enable you to build and manage applications at scale. For more information about Amazon DocumentDB, see the Amazon DocumentDB home page and getting started guide.

If you have questions or comments about this blog post and the new aggregation operators that we added, please use the comments section on this page. For questions about Amazon DocumentDB, see our FAQ or our developer forum.


About the Author

Joseph Idziorek is a Principal Product Manager at Amazon Web Services.