Mixmax has a very extensive list of APIs that give our users direct access to all of their data. Some APIs, such as Contacts can return millions of results. We obviously can't return all of them at once, so we need to return a subset - or a page - at a time. This technique is called paging and is common to most APIs. Paging can be implemented in many different ways, some better than others. In this post we discuss various implementation approaches and how we ultimately implemented our own approach to serve data stored in MongoDB. We also introduce a new npm module to easily allow you to do the same.
Common paging implementation approach #1: Offset-based paging (“numbered pages”)
This is one of the more common pagination approaches. The caller of the API passes a page
(aka offset
or skip
) and limit
(aka count
or per_page
) parameters. The limit is the number of results it wants and the offset is how many items to “skip” over before starting to return results. For example, Github offers an offset-based API for fetching Github Repos that looks like this:
curl https://api.github.com/user/repos?page=2&per_page=100
This pattern has a big flaw: if the results list has changed between calls to the API, the indices would shift and cause an item to be either returned twice or skipped and never returned. So in the above example, if a Github repo is deleted after you query the first page of results, then the previous 101st entry (that wasn’t in your original first page of 100) will now be the 100th entry, so it won’t be included in your second page of results (that starts at 101).
Common paging implementation approach #2: Time-based paging
A second approach is time-based paging. So if your API (in this case, /items) returns results sorted by newest-first, you could pass the created
date of the last item in the list to get the next page of items created before it:
Example API
curl https://api.mywebsite.com/items?created_before_timestamp=1505086265160&limit=50
This solves the problem in Common Approach #1 above because results are no longer skipped. If you query the first page, and then a new item is deleted, it won’t shift the results in your second page and all is fine. However, this approach has a major flaw: what if there is more than one item that was created at the same time? So if there were 51 items with the created
timestamp of exactly 1505086265160
, then when we query the next page, we’ll miss the 51st entry because we’re querying items created before 1505086265160. You’ll miss results.
The right approach: Cursor-based paging
The best way to build API pagination in a safe way is for the API to return a “cursor”, or opaque string, with the list of results. This token is usually called next
or next_cursor
or starting_after
This token can be passed with the next API request to request the next page.
Here’s an example from the Twitter API:
curl https://api.twitter.com/1.1/followers/ids.json?screen_name=theSeanCook&cursor=1374004777531007833
By returning a “cursor”, the API guarantees that it will return the exactly the next entry in the list, regardless of what changes happen to the collection between API calls. Think of the cursor as permanent marker in the list that says “we left off here”.
Implementing cursor-based pagination (in MongoDB)
So we’ve described cursors as the right approach for API pagination, but how do we actually implement them? If you happen to be using Node and MongoDB, you’re in luck, because we recently published a library that will do it for you. It’s called mongo-cursor-pagination. I’ll briefly describe how it’s implemented and our thinking behind it.
Since a cursor is just a marker to say “we left off here”, we can simply make it the value of a field in the collection. However, that field must satisfy the following properties:
-
Unique. This value must be unique for all documents in the collection. Otherwise we’ll run into the problem in Common Approach #2: if multiple documents have the same value, then we can’t “pick up exactly where we left off” because not all documents can be distinguished from one another. This also means that all documents must have this field set to something, otherwise MongoDB considers it null.
-
Orderable. Every value of this field must be able to be compared to every other value. This is because we need to sort by this field in order to get a list of results to then return in pages. If the field is a number
int
,string
, or any other sortable datatype in MongoDB, then it satisfies this. -
Immutable. The value in this field cannot change, otherwise the pages wouldn’t return the right information. For example, if you select the field “name” as the cursor field, and a name is changed while someone is fetching results, then that entry might be skipped.
So let’s say that we’re choosing the standard MongoDB field _id
for cursor pagination. This is a reasonable choice as it satisfies the above criteria and always exists on all documents in MongoDB. In this example, we’ll implement an API that returns 2 items. It will return a “cursor” (really just the string value of the last _id
) that the caller can pass to get the next page:
curl https://api.mixmax.com/items?limit=2
const items = db.items.find({}).sort({
_id: -1
}).limit(2);
const next = items[items.length - 1]._id
res.json({ items, next })
Then, when the user wants to get the second page, they pass the cursor (as next
) on the URL:
curl https://api.mixmax.com/items?limit=2&next=590e9abd4abbf1165862d342
const items = db.items.find({
_id: { $lt: req.query.next }
}).sort({
_id: -1
}).limit(2);
const next = items[items.length - 1]._id
res.json({ items, next })
This solution works great because we’re returning results sorted by _id
, which in MongoDB happens to the creation time rounded to the nearest second plus some other entropy. But let’s say we want to return results in a different order, such as the date the item was introduced into an online store, launchDate
. To make it clear in the example, we’ll add sort=launchDate
to the querystring:
curl https://api.mixmax.com/items?limit=2&sort=launchDate
const items = db.items.find({}).sort({
launchDate: -1
}).limit(2);
const next = items[items.length - 1].launchDate;
res.json({ items, next })
Then, to fetch the second page, they’d pass the next
cursor (which happens to be launch date encoded as a string):
curl https://api.mixmax.com/items?limit=2&sort=launchDate&next=2017-09-11T00%3A44%3A54.036Z
const items = db.items.find({
launchDate: { $lt: req.query.next }
}).sort({
_id: -1
}).limit(2);
const next = items[items.length - 1].launchDate;
res.json({ items, next });
However, what if we launched a bunch of items on the same day and time? Now our launchDate
field is no longer unique and doesn’t satisfy the above requirements. We can’t use it as a cursor field. But wait, there is a way: we could use two fields to generate the cursor! Since we know that the _id
field in MongoDB always satisfies the above three criteria, we know that if we use it alongside our launchDate
field, the combination of the two fields would satisfy the requirements and could be together used as a cursor field. This would also allow the user to continue to get the response sorted by launchDate
:
Then to fetch the subsequent page we’d call:
|
Now we have exactly what we want: an API that allows the user to get paginated results sorted by launch date. As new items are launched and old items are deleted, the paged results won’t be disrupted and an item will never be duplicated or missed in a response.
The above code examples explain our thinking behind building the mongo-cursor-pagination library. If you use the library, you won’t need to write the above code - the module will do it for you. We’re already using mongo-cursor-pagination at Mixmax to serve millions of documents via our developer API.
A note about MongoDB indexes
When using the module that runs queries similar the example code above, it’s important that you create the proper MongoDB indexes. This is especially important for large collections (>1k records) as you might accidentally slow your database. So for the above query, always be sure to create a index (as explained here) that covers all the properties used in the query along with the cursor field (called the paginatedField in the module) and the _id
field.
Conclusion
Using our cursor-based paging technique and MongoDB, we're able to serve millions of API requests a day. Our APIs respond in the same amount of time regardless of how many resources they're serving, or which page is be queried.
Do you want to create cool open source modules with us? Drop us a line.