Google App Engine is amazing, we got our prototype up and running in no time, and for no expense. We then decided it was the right platform to run a big part of our platform. We particularly appreciated the following features:
But the devil is as always in the detail…
As we have grown, so too has the impact of the limitations of app engine. Like many others we were surprised at the new pricing, but the analysis showed it would not be much more than a home grown solution, and we could still keep all the benefits. Here are some of the problems we have faced.
Domain and SSL.
Until TLS is supported more widely and app engine adopts that you cant have a custom HTTPS domain, you will have to do a bit of sys admin, and run a reverse proxy. Now you have just introduced a bottleneck, and depending on how you implement that possibly a single point of failure. So is you have two small VPS’s running behind some custom load balancer or DNS balancer to introduce some resilience – you have added an admin overhead and possibly another £30 per month.
We have grown, from nothing to a small startup, we have 20,000 registered users and 2,000,000 transaction records – not even approaching ’big data’
This costs us $5 per day or $150 a month
Detail of our daily cost
Map Reduce costs More
As we add more processing the number of reads will go up and sowill the price, in fact in some test Map reduce jobs show that if we ran a nightly analysis job on our transaction data set the price hikes by 50%. I’m only glad we dont have 14 million rows to process, cos we certainly dont have $6,500
Extracting data costs More and is SLOOOOWWW
We recently extracted the xact data set to look at processing them with other tools. The first effect of this was a further 25% hike in the daily price because of the increase in datastore reads. Worse than that though was that it took TWO DAYS to extract 2 Million records.
request hike over the two day extraction period
This alone makes it next to impossible to use App Engine as the main data store and extract data for processing. Work around – we ended up setting up a schedule that monitors any changes and pushes the changed data out to the processing servers. The price impact is minimal on GAE – the development time was not insignificant, and we now have another set of servers to manage and pay for.
Front End Instance Hours, Latency and Concurrency.
Prior to App Engine Team releasing support for python 2.7 last February there would be a linear relationship between the number of front end instance hours you needed and the the sum of your number of concurrent users * you average request latency / the latency you want your customers to experience.
So if you had 10 concurrent users, and you wanted their requests to return within a second, and your average request takes 500ms to do its thing you would need 5 front end instances: 10 *0.5 / 1 or if you did not mind them waiting 2 seconds you should aim for 3 instances.
Now the calculation is less clear, once you have enabled, and tested for or possibly redesigned for concurrent requests because the memory handling of each instance , and how its performance characteristics affects request latency whilst handling concurrent requests needs to be considered.
Even though you can control some aspects of instances, there still feels like there are some strange behaviours, why are our reserved instances not serving any requests but some extra (aditional cost) instances busy…
Now we are about to test on Python 2.7 I’ll report back on any cost savings.
General Performance and Design Compromise.
There are known considerations with respect to GAE’s big table-esq de-normalised design, and they really turn out to be not a major issue, storage is still cheap on GAE, though you will want to look at how the indexes are being constructed, thought the dev environment takes care of the default situation, in that if you ran a query on the development server during testing, that hist a field that field will be marked for index. In production you can’t limit by any field that is not in an index.
My biggest concern is simply how long it takes to return even slightly large result sets, we have seen 2 to 10 seconds just to retrieve 1000 rows!
1000 rows used to be the limit until cursors were introduced GAE always advised paging through result sets, but beyond 1000 rows the limit and offset approach becomes slow as well. With the introduction of cursors the 1000 row limit is gone, but using them could involve (as in our case) considerable re-factoring
As our need for more processing increases and we spent some time investigating what we could get done, so had a look at setting up a small MongoDB cluster that would be an architecture that would be robust enough to support our live app.
Instead of Amazons EC3 we plumped for Rackspace Cloud to run 6 small instances and a cloud load balancer…
The price – £14.60*2 + 29.2*2 + £7 for the load balancer – £153.00
We are actually using the two app servers for the main nginx proxies for Money Toolkit and 8 other processes, so the extra £123 for the mongo install is comparable with the $150 for GAE.
The 4 data servers were set up as a pair of replicated shard servers, and the app servers host the mongos routers and our own app server code, each with nginx proxies to better control routing to the app servers.
We got munin running across the cluster to see how things were going…
Spot the app server memory leak..
Munin illustrates our memory leak nicely
We wanted our own server to be as slim and performant as possible so that we could basically exclude that from any argument about bottle-neck so hacked together a json based restful web server in C and C++ using Mongoose. We chose Mongoose because it is pretty stable very flexible and can handle multiple concurrent requests only capped by how much memory you have as it runs a thread per concurrent connection. We configured each app server to have 20 threads – so we should easily be able to cope with 40 concurrent requests.
We coded against the C++ drivers for MongoDB on the backend.
With Mongo stubbed out (and with no in app memcache or other nginx caching) we could happily run a simple Apache Benchmark (ab) of 40 concurrent users returning over 5000 requests per second.
Running ab with a request that returns the full set of transactions for an individual account we see…
112 requests per second
There are a few anomalies - we got a few SL handshake errors (I think because of a limit we have in nginx), but the result is that each individual request takes 0.3s average and we can server 40 of those a second.
Each 40k request returned an array of just under 1000 json objects.
So with this simple setup we can easily server 112 queries a second before we even start thinking about memcache.
No More Paging
The results for larger documents – 30,000 rows showed we could return those within about 0.5 of a second.
Free Map Reduce.
We ran a simple Map reduce job to find the most popular vendors and average spend (MongodB now has aggregate queries built in) it took 2 minutes, and cost us nothing more, plus our request performance only dropped slightly – 100 rps.
Free and Fast data Import and Extract
Importing our 2 Million records (from our two day long CSV extract) – via one process took under 10 minutes (3,400 records per second), and a data dump – to CSV took around 5 minutes – so roughly 500x faster than GAE. both cost noting extra.
This was not meant as a fair like for like comparison, and the home-brew mongo install is kind of the exact oposite of what google app engine is. It was simply to highlight some fundamental limitations in GAE and how that necessarily affects design decisions, and illustrate one narrow alternative possibility for a similar monthly cost.
If you have a relatively simple app, small to medium sized app, with a reasonably static data model (deleting lots of records is also expensive) and say – less than 10 million records that do not require regular data processing and analysis then Google App Engine remains an incredible choice.
If your data is starting to scale and you really want to extract full value from your data by running analysis jobs and you need access to large-ish record sets during each request (like our case of synching full record sets to mobile devices) then Google App Engine is the wrong hammer for your nut.