Guest blog post from Mahesh Paolini-Subramanya, CTO of Aptela
We posted a case study today on Aptela, the leading provider of business phone services for small business and mobile workers nationwide, and how they use CouchDB to scale their application. Their CTO Mahesh Paolini-Subramanya worked with us to write a guest blog for us to further explain their use of CouchDB. We really appreciate Mahesh taking the time to help share their use of CouchDB with the community.
Now for Mahesh’s blog post…
Aptela Achieves Replication and Scaling with CouchDB
We just launched our next generation calling platform, Aptela v5.0, and I had to make sure that when we rolled it out, we had a solution that would help our new calling platform be massively, yet affordably, scalable, and aid us as we continue to deliver reliable, crystal clear phone service; a mission critical requirement for all our customers. They rely heavily on our service to run their businesses, and we cannot go down.
As a business-class phone service provider with a customer base of more than 17,000 users across 3,000 small businesses nationwide, Aptela handles over 100 million minutes of calls per year. That’s a lot of calling! We needed a way to effectively manage the millions of Call Detail Records (CDRs) generated by those calls on a daily basis, so that we could provide those CDRs to our customers (and internal Aptela folks) instantly. We also needed the data to synchronize across all of our servers, all of the time.
I know, it’s all supposed to be so easy - you put all of your information in a database somewhere and magically, the problem is solved. Come to think of it, that is exactly what we did in the previous iteration of our architecture - A nice Postgres database happily serving up data kept all of our systems in sync. This of course, worked perfectly until the day the database server crashed (Really? The backup generator doesn’t work for more than 10 minutes? Quelle surprise!), and our customers were offline until our (previous!) hosting facility figured out which circuit-breaker to un-trip.
This, naturally, lead us to replication, master-mater configurations, master-slave setups, new hosting facilities, load-balancers, MySQL, and the next thing you know, it was Yet Another 3 AM crisis with me frantically Googling “repair corrupt MySQL database unknown error 3l33t”.
Our entire server-infrastructure is (and needs to be!) cloud-based, i.e., highly distributed, reliable, scalable, location-independent, fine-grained and with built-in coffee service. Take incoming calls for example. In our environment, they go to a randomly chosen server, which figures out what to do with the call based on the called number. Then the server waits until the Official Data Store (MySQL or Postgres, or Oracle) figures out what to do with the call. The problem? We were spending all of our time figuring out how we could improve our databases to support our application, and not actually spending any time on improving our application! This was clearly not the answer. Side note - There should be some kind of law about this, e.g. Any software operation will eventually stagnate when the Development budget equals the Maintenance budget.
Enter CouchDB, which has worked like a charm for us. If anything, we have only begun to tap into all of the cool things that it does for us.
We are now able to handle our massive amounts of data by dumping it into local instances of CouchDB on each of our telephony nodes. At this point, a couple of really neat things happen (ok, neat for me, probably not for you):
- - Billing information gets extracted from these CDRs, and replicated over into the billing system
- - Metadata associated with voicemails and recordings get replicated across to the other telephony nodes
- - Metadata associated with the calls get replicated over to the application nodes
- - The CDRs themselves all end up getting replicated to the reporting servers, where all sorts of goofy reports can now get generated off of them.
The free-form nature of CouchDB is tailor-made for reporting and that alone makes it worth the price of admission. Come to think of it, that was pretty much what made us look at CouchDB in the first place! That said, once we started working with it, it became immediately clear that this was the solution to all our data management and maintenance issues.
CouchDB is written in Erlang and to paraphrase – We love Erlang so much we wrote our entire application in it – which makes it trivially easy to integrate it into our application. It also has an extremely easy to use REST API (Representational State Transfer), which makes integrating it into our back-office systems just about as trivial.
Now, you might be nitpicking that we didn’t really solve the problem as I originally described it (consistency across all the nodes). This is quite ok, since we are actually fairly devout believers in Eventual Consistency, i.e., trading off high-availability for eventual consistency.
For example, when a voicemail gets received at one node, we copy the audio and metadata over to the other nodes asynchronously. If, however, the client calls up at one of the other nodes before the audio/metadata gets there, and wants to listen to that voicemail, we tell the caller that “The Audio is still being Processed”, and to “Call Back In Just A Wee Bit”. You could consider this a bit of a cop-out, but it works just fine for everyone involved because the view of voicemails is Eventually Consistent, but we don’t lock anyone out of the system while updates are occurring. For extra credit, we just move the call over to the node where the voicemail was left, so that it is immediately accessible.
Finding the right tool for the job was the goal, and CouchDB is the perfect match! I know that we have barely begun to fully utilize everything available with CouchDB. Going forward, we plan to use it to help us continue to improve the way we manage our data and I feel confident that it will be able to evolve right along with us. Bravo Damien!