Thursday, July 18, 2013

Acme Air Goes To The (Streaming) Movies - The AcmeAir / NetflixOSS Port

I had the opportunity to present some interesting work at the latest Netflix OSS meetup last night.

I presented the following slides:



If you walk through the slides, it covers the following work.

1.  We started from the Acme Air OSS project that is currently monolithic in design.  Specifically you will note that the user authentication service, which is called on every request, is basically a local library call.  Slide 1.

2.  We then, to take advantages of the micro-services architecture, split this authentication service off into its own separate application/service.  To ensure correct load balancing we could have naively bounced out through the front end nginx load balancing tier.  By splitting the application this allows for better scalability of subsystems and better fault tolerance of the main web application with regards to dependent services.  Slide 2.

3.  Next we started to re-implement both the web application and authentication service using runtime technologies from Netflix OSS, specifically Karyon, Eureka, Hystrix, and Ribbon.  By using these technologies we added more elastic scaling, better HA and increased performance and operational visibility.  You can checkout both open source projects (the original and the NetflixOSS enabled version) and do a diff to see the changes required in the application.  Slide 3.

4.  Finally, we deployed the web app and auth service and our data tier of WebSphere eXtreme Scale through Asgard into auto scaling groups / clusters.  This allows us easier scaling control, integration of application concepts like security groups and load balancers, and ability to roll code changes with zero downtime.  Slide 4.

We then ran this benchmarking framework around a small (1X) configuration.  This run included a single JMeter driver, a single instance of the web application, a single instance of the auth service, and a single data grid member.  You can see the performance results here.  You can see the results were pretty solid in throughput with zero errors across the entire run.

We then scaled up the workload.  Scaling it was pretty simple using Asgard by adjusting the minimum size of the cluster.  I didn't record these results, as we wanted to move the workload to a larger instance type.

After some experimentation we decided that we could get around 1 billion operations / day (which is about 12K requests per second) with 18X m1.large instances for the web application tier.  We put together a run with that size of instance and 20 JMeter instances, 18 web app instances, 20 auth service instances, and 16 data service instances.  You can see the performance results here.

The results show two things.  First, the workload peaks out at the expect overall throughput (around 13K requests per second which means 1.1 billion requests per day.  Second, there is a fall off after the peak is achieved.  I am currently working to see if this is as a result of poor tuning or something throttling the environment.

Here is a screen shot that shows the various consoles during the run.  At the left top you'll see the Eureka server.  At the bottom you'll see the Asgard server.  At the right, you'll see a Hystrix console monitoring a single web app instance.


Thanks to Ruslan and Adrian and team from Netflix for putting this event together.  It was very valuable to myself personally.  I can't wait to continue this work and for the next meetup.