What’s new in Apache CouchDB 0.11 — Part Two: Views; JOINs Redux, Raw Collation for Speed

Hi again! It’s Jan again. Thanks for coming back. If you missed Part One, here’s your chance to catch up.

CouchDB JOINs Redux

When I started out talking about CouchDB (back in 2006) people were rarely aware of any databases that didn’t use SQL for querying. An frequent question was “How do I do JOINs?” — The short answer is “You don’t”. People worried about retrieving “related data” from a non-relational database.

Turns out “related data” and “relation” have very little in common (the first is groups of data, the second is a mathematical term that refers to a multivalued mapping more commonly called a “table” ). Long story short there was (and still is) confusion.

Of course CouchDB lets you retrieve related data in any shape or form you like. Christopher Lenz did a great write-up on “CouchDB ‘JOINs’” dating as far back as 2007, it is still very applicable.

Since then, though, CouchDB gained a few new features to tackle the same problem: fetch related data. These aren’t new in 0.11, but they did get refined, so it makes sense to revisit them here. Since 0.10, you could query a view with the query parameter include_docs=true. When specified, CouchDB would fetch, for each row in the view result, the corresponding document from the database. This allows users to make a trade-off between smaller view indexes (and hence shorter view index times) and slower view index (for each row, CouchDB makes a single request to the database).

With 0.11, you can include a _id member in the value of the view result and have CouchDB fetch a document with another id than the one that produced the view row.

As an example, consider these four documents:

{
  "_id": "Claire",
  "title": "VP of Official Attitude"
}

{
  "_id": "Mikeal",
  "title": "VP of Pastries and Automating Stuff"
}

{
  "_id": "Jason",
  "title": "VP of Hosting and Lightning"
}

{
  "_id": "team",
  "members": ["Claire", "Mikeal", "Jason"]
}

And this map function:

function(doc) {
  if(doc.members) {
    doc.members.forEach(function(member) {
      emit(member, {_id: member});
    });
  }
}

The regular result looks like this:

{
  total_rows: 4,
  offset: 0,
  rows: [
    {"key":"Claire", "value":{"_id":"Claire"}},
    {"key":"Jason", "value":{"_id":"Jason"}},
    {"key":"Mikeal", "value":{"_id":"Mikeal"}}
  ]
}

If you query the view with include_docs=true, the result looks like this:

{
  total_rows: 4,
  offset: 0,
  rows: [
    {
      "key":"Claire",
      "value":{"_id":"Claire"},
      "doc": {"_id":"Claire","title":"VP of Official Attitude"}
    },
    {
      "key":"Jason",
      "value":{"_id":"Jason"},
      "doc": {"_id":"Jason","title":"VP of Hosting and Lightning"}
    },
    {
      "key":"Mikeal",
      "value":{"_id":"Mikeal"},
      "doc": {"_id":"Mikeal","title":"VP of Pastries and Automating Stuff"}
    }
  ]
}

Pretty slick, don’t you think?

Raw Collation

This one is a quickie for speed freaks.

By default all views are sorted in a locale-dependent unicode collation order. This ensures that languages get sorted naturally instead of an artificial byte-order collation.

This is great, but sometimes, you don’t need unicode-aware sorting. CouchDB 0.11 allows you to specify a view definition option to enable raw collation for a view.

{
  "_id": "_design/app",
  "views": {
    "faster": {
      "map": "function(doc) {emit(doc.field, 1);}",
      "options": {
        "collation": "raw"
      }
    }
  }
}

Views that are built with this option avoid calling out to the ICU (IBM Components For Unicode) driver to sort all rows. Hence the speed-up. How much faster depends on your data and hardware, but the difference can be significant.

If you feel like it, create a small benchmark, publish the numbers on your blog and let us know! We’ll post a follow-up and compare everybody’s results.

Next up in our series are the new features of the CouchDB Replicator, stay tuned!