Alex Bosworth's Weblog
Thursday - Jan 10, 2008
Developing an application with SimpleDb

In my last post I talked about how much Amazon’s new SimpleDb differs from a conventional database.

Well beyond just reading about it, and making a stub API, I wrote a fully working test application without a conventional database: my goal was to write an application using only SimpleDb as a store. Writing an application has given me some appreciation for SimpleDb but I’ve found some new annoyances as well.

My application seemed at first glance to me to be fairly simple: a message board. The message board should have threads and replies, and when someone posts a new reply to a thread, it should bump the thread to the top of the list.

Well this is easier said than done. The drawbacks of SimpleDb really hit you in the face when you try and actually use it:

  • No sorting

If the message board has hundreds of messages, it’s impossible to get back only the newest ones! Even if it just has tens of messages, there’s also no way to ‘page’ them unless you get all of them and do the paging yourself.

Solution: archive old threads and don’t show them on the board. This was the hardest part for me, in the end I just said you can’t page past a certain page.

  • Values can only be 1mb in size

As you might imagine, cutting corners time comes very soon to my application. Your message may only be 1024 characters in length. If I really cared, I might hack something where I chopped a message into multiple parts, and then joined the parts back together. This isn’t that trivial though because multiple values are not ordered when you fetch them back, so your message could get scrambled.

  • No auto incrementing ids

Since there is no way to implement this in simpledb, I just create random guids. One trick though is that I put the message timestamp in the item name. Since simpledb queries dont’ return rows, only row names, it’s a good idea to stick important (readonly) data in the name. SimpleDb doesn’t let you sort by row values, but I can sort items by their update timestamp if that’s in their name, and then only fetch the latest rows.

  • No consistency

This one really hurts too. Someone can submit a message, when they refresh the page the message is gone! Of course it is gone, it’s still making its way into the Amazon data centers which can take up to 20 secs, a time impossible to predict exactly. To solve this, messages are inline inserted using javascript to avoid prompting a page reload.

This also presents a problem for caching. I can’t clear the message ids cache when a new message is added, or there will be a race condition between the next ‘get messages’ query and the time the new message is fully propagated into SimpleDb. So to try and prevent this condition from occurring, I memcache the ‘last updated’ time. This lets me know if the query that comes back is stale or not.

I use this same solution for editing messages. When I ‘get’ a message, it might be stale, so I cache the last time I’ve updated the message and compare it against the message’s last update timestamp that’s in memory.

  • No Types

SimpleDb has a loose schema, but ‘null’ is not a type it deals with the way one might expect. For example, if you add a column like ‘archived’ to your data, and then add [‘archived’ != ‘1’] to your queries, rows that don’t have the archived column won’t match: they must have ‘archived’ set to something, it can’t be null.

  • No way to get multiple rows of data back

My message board sorts by ‘bump’ time. The last message posted in a thread is what determines its sort order. This means I need to not only fetch all threads, I need their messages too (and I can’t sort them), just in order to be able to show you page 1 of messages. This means I need to do tons and tons of queries to do a cold refresh of the message board. Caching SimpleDb rows is a necessity, particularly if you are not running your app on EC2.

Writing an app on top of SimpleDb is considerably more challenging than on MySQL. However now that I finished my message board, I think that it’s a viable platform if you just code around its flaws and have a local cache to deal with query latency.

There is one feature that I really really need and that’s sorting. I have tried a lot of tricks and racked my brain to figure out a way to make viewing pages of archived posts work, but nothing will really work very well. SimpleDb is only good if you can keep your working result sets down below 1000 or so.

Another thing I found late in the game, something that’s completely inexcusable and I don’t know why this wasn’t mentioned by bloggers on SimpleDb: THE REST API DOES NOT SUPPORT UNICODE! I don’t know what century the Amazon developers are living in, but ASCII went out of fashion with Pong and floppy disks.

SimpleDb is not a magic bullet unfortunately, but it’s a step in the right direction. And for me going forward with development, with my terrible experiences on swik with scaling MySQL, I’m willing to give SimpleDb a shot.

(This post has been read 3902 times.)

Comments (3) Add Comment
Posted By: Pratham Feb 05, 2008 20:50:52
Good post.
The UTF-8 thing hurts bad, I use a conversion to #HEX; HTML encoding to store UTF-8 values.
The inability to get multiple row attributes is also very painful, but the SDB people perhaps intended to retrieve the attributes of each row in parallel.

Also the value sizes are 1Kb (not 1Mb).
Posted By: Brian Aker Feb 28, 2008 01:57:01
Hi!

"Writing an app on top of SimpleDb is considerably more challenging than on MySQL. However now that I finished my message board, I think that it’s a viable platform if you just code around its flaws and have a local cache to deal with query latency."

Use memcached, hell... think of all layers as cache. You probably shouldn't be hitting the database. Don't let anyone tell you that read replication is the right answer, it is a slippery road that leads to a lot of servers to manage.

"There is one feature that I really really need and that’s sorting. I have tried a lot of tricks and racked my brain to figure out a way to make viewing pages of archived posts work, but nothing will really work very well. SimpleDb is only good if you can keep your working result sets down below 1000 or so."

Slashdot stories have more then 1K of comments all the time. How does it handle this?

All comments are sent to the application server and are sorted and filtered based on the user's preference. It scales with a lot of CPU. Too many user preferences is the issue.

How has this evolved solved? AJAX now requests comments asynchronously.

Trade off? Casual reader has pages that render quickly, but with less comments. Heavy reader means more web traffic (and a lot more hits to the servers). Tuning the servers with lingerd solves some of the problems, but not all.

Cheers,
-Brian
Well what do you think? Post a comment.

Title:

Name:

URL:

Comment: