Evolving Large Populations of Controllers for Complex Systems: June 2014

Friday, June 27, 2014

The Testing Plan Going Forward

I was able to automate testing the ball/plate system, and within a few hours should have all of the cron kinks worked out, so testing will begin in earnest soon. So it's time to write out what I plan to do.

We're testing on two cloud platforms and three client platforms, but it has already been determined that there is a limit to how many simultaneous requests we can make to the cloud platforms before we start getting forced timeouts. This means that I will be testing the client platforms one-at-a-time over the course of the next few weeks. If it's feasible, I hope to spend a week on each.

By running the 10 concurrent tests once an hour, we do not exceed the allotted free time allowed by Google App Engine (each run of 10 concurrent uses about 4% of our total time, 4 * 24 = 96). We'll cut it close, though, so if I had to guess I'd say we'll see some timeouts at the 11PM run each day. Heroku, on the other hand, will probably have no problem with this load, and so we should only see the occasional timeout.

Wednesday, June 18, 2014

The Infamous Log Error Code

I've been monitoring a run of 50 populations concurrently, looking at the logs periodically, and this is what I'm seeing:

2014-06-18T16:53:01.837453+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=GET path="/u?data=20%20.02%2010.305194%2011.645005%200.176168%20-7.471089%20-8.742053%20-10.637163" host=infinite-harbor-9903.herokuapp.com request_id=ee1e7c15-6aa3-4087-923d-ad9d5b96f93b fwd="130.127.48.28" dyno=web.1 connect=1ms service=13943ms status=503 bytes=0

However, at the end of the previous run, I was seeing

2014-06-17T19:10:52.632445+00:00 heroku[router]: at=info method=GET path="/u?data=20%20.02%20-5.780477%20-7.264415%2011.783255%2015.508600%20-19.862916%2028.477992" host=infinite-harbor-9903.herokuapp.com request_id=b8587f93-92dc-424b-a2c0-2c4f6e0a2a41 fwd="130.127.48.28" dyno=web.1 connect=2ms service=16824ms status=200 bytes=172

This could mean that at higher concurrent run counts the GA does not initially receive results (perhaps the it is closing connections early), but then near the end, in the last few generations, it starts to receive results again. This could potentially explain the strange results we have been seeing.

Monday, June 16, 2014

Results of Concurrent Tests On Heroku and Preliminary Analysis

Over the past few weeks, I've been conducting tests to determine Heroku's suitability as a platform for dealing with large amounts of complex computations. The results of some of these tests are posted below. The textual results represent various aspects of the running time of the program, and the videos are aggregations of the best-fit member of each generation and the members of the last generation for each population.

The time results tell me that the tests are running very quickly, fantastically quickly. The test of 125 populations ran in around an hour and seven minutes--that's 500,000 calculations done in just over an hour. However, looking at the results of the tests tells me that the reason it runs so quickly is that the tests are timing out. When this happens, they use a default value that was set up to indicate a timeout. The results should be in the hundreds, and instead they are consistently under 5, which mans there is a problem.

There are a few working hypotheses as why these results are occurring. After looking at them with Dr. Remy, he confirmed that there is definitely an issue, but that the results don't look like the expected output for any of his built-in networking error values. This could indicate that the GA is receiving random values from the app. If so, there could be a problem with the GA or app, Heroku may have been experiencing some kind of technical difficulties at the time, we may have found a bound on the capabilities of one dyno, the network may have had issues while running--the list goes on. In any case we're investigating the results.

Populations: 50
Individuals/Pop: 50
Generations: 80

Start: 1402348187.4
End: 1402351825.8
Test Duration: 3638.39999986s = 1.01066666663hr
Average Run: 3556.59600002s = 0.987943333338hr
Longest Run: 3601.20000005s = 1.00033333335hr
Shortest Run: 3466.79999995s = 0.962999999987hr

Populations: 75
Individuals/Pop: 50
Generations: 80

Start: 1402363852.5
End: 1402367892.0
Test Duration: 4039.5s = 1.12208333333hr
Average Run: 3846.72399996s = 1.06853444443hr
Longest Run: 3969.5s = 1.10263888889hr
Shortest Run: 3676.4000001s = 1.02122222225hr

Populations: 100
Individuals/Pop: 50
Generations: 80

Start: 1402379304.6
End: 1402383480.4
Test Duration: 4175.80000019s = 1.1599444445hr
Average Run: 3928.14699999s = 1.09115194444hr
Longest Run: 4114.9000001s = 1.1430277778hr
Shortest Run: 3733.4000001s = 1.03705555558hr

Populations: 125
Individuals/Pop: 50
Generations: 80

Start: 1402408981.8
End: 1402413277.3
Test Duration: 4295.5s = 1.19319444444hr
Average Run: 4106.75600002s = 1.14076555556hr
Longest Run: 4198.0999999s = 1.16613888886hr
Shortest Run: 3962.89999986s = 1.10080555552hr

Wednesday, June 11, 2014

Platform as a Service

Over the past three weeks of silence, I've been delving into the Heroku app that I set up which spawned this post. Heroku is a Platform as a Service cloud computing infrastructure. It provides an environment for running web applications in a variety of programming languages (Ruby, Python, Java, Scala, and Clojure, to name a few), and within those to run on a variety of frameworks, particularly Ruby on Rails. My app, as the post describing how to set up an app on Heroku implies, is a Python app running on web.py. It runs the calculations for a ball/plate system, and I access it using a genetic algorithm(GA) from a remote client.

The motivation for this work is to determine the feasibility, effectiveness, and efficiency of using cloud platforms to perform work on large amounts of data. The GA is typically run 50-125 times with a population of 50 over 80 generations of that population (200,000 - 500,000 calculations). Thusfar, on Heroku, the run of 125 GAs takes around 2.5 hours, and I'm still analyzing the results. This is compared to ~4 hours for 1 run of the GA on my MacBook Air.
I'm documenting this process so that others may be able to find it and use what I've learned to inform decisions on how they will process massive amounts of data. Heroku seems to be able to handle the workload, though it does occasionally cut me off when running the GA--it sends "Request not processed" replies in this case, but the results generated look promising for using Heroku for future big data operations.

Dr. Remy also set up a version of the GA on Google App Engine (GAE), and I have not interacted with it enough to be able to talk about how it holds up, but the business constraints on GAE's end seem to point to Heroku as a more viable solution for projects that need lots of uptime. GAE imposes 28 hours of uptime a day for free, then it cuts you off unless you pay for more. The 28 hours includes all processes allocated to your app. Dr. Remy reported being cut off after one 125 run of the GA. Heroku does not do this, instead granting the use of a "dyno," which seems to be one instance of an app running and encompasses all processes associated with that app. They charge by the "dyno-hour" and offer 710 free hours a month. Any uptime for a dyno counts toward the dyno-hours used that month, but one dyno cannot use more than 24 hours a day (as far as I can tell). This is enough hours to run one dyno for free each month, which offers more flexibility than GAE. We don't yet have performance comparisons between Heroku and GAE, but the blog will be updated when those are available.