Evolving Large Populations of Controllers for Complex Systems: 2014

Friday, August 8, 2014

The Third PaaS

Here and now, we're adding a third PaaS to the tests: PythonAnywhere. It isn't strictly speaking a PaaS because it also incorporates ideas from IaaS, but the app-hosting part of it is PaaS-like. We had some difficulty getting the app up and running, but not on the part of the Platform. Rather, we forgot to include an __init__.py in our application's directory. Remember: if you need to import things from a directory, ALWAYS include an __init__.py.

It looks like this:

#For each module in the directory, import it
from Mod1 import *
from Mod2 import *
#...
#etc.

#Then, put __all__ to handle a * import of the current directory
__all__ = ["Mod1","Mod2",...]

Monday, July 21, 2014

Potentially Invalid Data From All Trials

Thanks to a fundamental misunderstanding of the function of "elites," the select few members of a generation of a GA that get passed on to the next, in our GA, it's possible that all of our data is invalid. My understanding from what I was told of the code and my own examinations, was that an elite was kept for one generation, but if it failed it was discarded. This is apparently not the case, and elites are kept until the end of the GA regardless of failures along the way.

When looking at the data before, what I thought I was seeing was 30-40 great results and ~10 errors per last generation per population. Now I look at this same data with the understanding that the 30 elites could be hiding 30 errors underneath them. If this is the case, then virtually all of our data is invalid due to the fact that we cannot be certain that the PaaS-es are outputting results throughout a run of the GA.

What this means for the project is that a month's worth of data collection and analysis may have to be thrown out the window (worst-case scenario). I'm conferring with Dr. Remy to see how we should move forward.

Friday, July 18, 2014

Where We Are Right Now

Some issues were found with the PaaS2 results, regarding received errors. When planning this experiment, we went ahead without accounting for the fact that a client machine's speed affects how much uptime a cloud app experiences. The implication of this is that services with a limit on an app's uptime will experience a client making requests differently depending on the client's speed.

Say you have two clients talking to two instances of the same app. Each instance has 4 hours of allotted uptime per day, and once that uptime is exceeded any client sending requests to these apps will receive only 500 errors in response. One of your clients causes 20 minutes of uptime per run of a program, and the other causes 30 minutes per run of an identical program; this is due to hardware differences between the two clients. If you are required to run this program 9 times per day per app/client pair, and each client is constrained to one app, the app running 20 minutes per client program will not exceed its allotted time, but the app running 30 minutes will.

This is an oversimplification of our problem. First, we have to factor in the fact that we don't know how much time one request to an app instance will eat up on the server end. Conditions on the server end may change, and the vendor's method for timing uptime is unknown and seems to have other factors beyond straight app uptime involved. In addition, we have to take into account travel time to the server, which is affected by network conditions in way we cannot predict. This adds to our client uptime, in turn affecting our app uptime.

There are a few morals here: "careful planning can still surprise you with bumps," "networking is complex," etc. The biggest, though, is that we have to restart data collection on PaaS2 for 2 of our 3 clients.

I restarted yesterday, and the results are good, but PaaS2's dashboard did show error replies in some of the runs. Apparently we haven't hit that sweet spot yet where they occur so minimally that we don't have to worry about them.

Thursday, July 17, 2014

Some More Results

After a week of tests, we have some results in. I'm hesitant to post the boxplots of our timing measurements because there is an issue we only caught after the fact affecting exactly one PaaS client running on one of our machines. Those tests are being re-run until we have a matching amount of results.

Our control client, a computer designed to handle complex operations like the multiprocess GA, performed essentially as expected. Runtimes for individual populations are tight around the median time, and even the outliers are not egregiously different. One PaaS clearly has more consistent timing than the other, but the less consistent PaaS has faster completion times.

The client representing the Average Joe computer performed about as expected as well. It is clearly slower than the control client, but can still run a whole population in a reasonable amount of time. There are many more outliers on both PaaS plots for this client, but the same trend was seen with one PaaS being more consistent in timing and the other being somewhat faster. One of the PaaS-es also returned the only (legitimate) error codes seen on the plots to this client.

I would talk about the third client, but it's undergoing re-testing on one PaaS, so that discussion will wait until all results are in.

Thursday, July 3, 2014

Preliminary Results

Testing has been going steady on Newton (hence the image of Sir Isaac Newton in the last post) for a week now. The results of computations on the cloud end are promising: barring bad requests the best individuals come out of the calculations with fitness scores in the 2-3 hundreds (read: good). On Heroku, the runtime for a set of 10 concurrent GAs is ~1430s; on GAE, that time is closer to 900s.

We had to back of from testing on the Walter, the lab's Raspberry Pi, because he, sadly, cannot create enough forked processes to handle the GA. Attempts on the mystery third client IaaS were also largely squashed due to resource constraints, so we have moved to a different IaaS, and testing has commenced there.

Friday, June 27, 2014

The Testing Plan Going Forward

I was able to automate testing the ball/plate system, and within a few hours should have all of the cron kinks worked out, so testing will begin in earnest soon. So it's time to write out what I plan to do.

We're testing on two cloud platforms and three client platforms, but it has already been determined that there is a limit to how many simultaneous requests we can make to the cloud platforms before we start getting forced timeouts. This means that I will be testing the client platforms one-at-a-time over the course of the next few weeks. If it's feasible, I hope to spend a week on each.

By running the 10 concurrent tests once an hour, we do not exceed the allotted free time allowed by Google App Engine (each run of 10 concurrent uses about 4% of our total time, 4 * 24 = 96). We'll cut it close, though, so if I had to guess I'd say we'll see some timeouts at the 11PM run each day. Heroku, on the other hand, will probably have no problem with this load, and so we should only see the occasional timeout.

Wednesday, June 18, 2014

The Infamous Log Error Code

I've been monitoring a run of 50 populations concurrently, looking at the logs periodically, and this is what I'm seeing:

2014-06-18T16:53:01.837453+00:00 heroku[router]: at=error code=H13 desc="Connection closed without response" method=GET path="/u?data=20%20.02%2010.305194%2011.645005%200.176168%20-7.471089%20-8.742053%20-10.637163" host=infinite-harbor-9903.herokuapp.com request_id=ee1e7c15-6aa3-4087-923d-ad9d5b96f93b fwd="130.127.48.28" dyno=web.1 connect=1ms service=13943ms status=503 bytes=0

However, at the end of the previous run, I was seeing

2014-06-17T19:10:52.632445+00:00 heroku[router]: at=info method=GET path="/u?data=20%20.02%20-5.780477%20-7.264415%2011.783255%2015.508600%20-19.862916%2028.477992" host=infinite-harbor-9903.herokuapp.com request_id=b8587f93-92dc-424b-a2c0-2c4f6e0a2a41 fwd="130.127.48.28" dyno=web.1 connect=2ms service=16824ms status=200 bytes=172

This could mean that at higher concurrent run counts the GA does not initially receive results (perhaps the it is closing connections early), but then near the end, in the last few generations, it starts to receive results again. This could potentially explain the strange results we have been seeing.

Monday, June 16, 2014

Results of Concurrent Tests On Heroku and Preliminary Analysis

Over the past few weeks, I've been conducting tests to determine Heroku's suitability as a platform for dealing with large amounts of complex computations. The results of some of these tests are posted below. The textual results represent various aspects of the running time of the program, and the videos are aggregations of the best-fit member of each generation and the members of the last generation for each population.

The time results tell me that the tests are running very quickly, fantastically quickly. The test of 125 populations ran in around an hour and seven minutes--that's 500,000 calculations done in just over an hour. However, looking at the results of the tests tells me that the reason it runs so quickly is that the tests are timing out. When this happens, they use a default value that was set up to indicate a timeout. The results should be in the hundreds, and instead they are consistently under 5, which mans there is a problem.

There are a few working hypotheses as why these results are occurring. After looking at them with Dr. Remy, he confirmed that there is definitely an issue, but that the results don't look like the expected output for any of his built-in networking error values. This could indicate that the GA is receiving random values from the app. If so, there could be a problem with the GA or app, Heroku may have been experiencing some kind of technical difficulties at the time, we may have found a bound on the capabilities of one dyno, the network may have had issues while running--the list goes on. In any case we're investigating the results.

Populations: 50
Individuals/Pop: 50
Generations: 80

Start: 1402348187.4
End: 1402351825.8
Test Duration: 3638.39999986s = 1.01066666663hr
Average Run: 3556.59600002s = 0.987943333338hr
Longest Run: 3601.20000005s = 1.00033333335hr
Shortest Run: 3466.79999995s = 0.962999999987hr

Populations: 75
Individuals/Pop: 50
Generations: 80

Start: 1402363852.5
End: 1402367892.0
Test Duration: 4039.5s = 1.12208333333hr
Average Run: 3846.72399996s = 1.06853444443hr
Longest Run: 3969.5s = 1.10263888889hr
Shortest Run: 3676.4000001s = 1.02122222225hr

Populations: 100
Individuals/Pop: 50
Generations: 80

Start: 1402379304.6
End: 1402383480.4
Test Duration: 4175.80000019s = 1.1599444445hr
Average Run: 3928.14699999s = 1.09115194444hr
Longest Run: 4114.9000001s = 1.1430277778hr
Shortest Run: 3733.4000001s = 1.03705555558hr

Populations: 125
Individuals/Pop: 50
Generations: 80

Start: 1402408981.8
End: 1402413277.3
Test Duration: 4295.5s = 1.19319444444hr
Average Run: 4106.75600002s = 1.14076555556hr
Longest Run: 4198.0999999s = 1.16613888886hr
Shortest Run: 3962.89999986s = 1.10080555552hr

Wednesday, June 11, 2014

Platform as a Service

Over the past three weeks of silence, I've been delving into the Heroku app that I set up which spawned this post. Heroku is a Platform as a Service cloud computing infrastructure. It provides an environment for running web applications in a variety of programming languages (Ruby, Python, Java, Scala, and Clojure, to name a few), and within those to run on a variety of frameworks, particularly Ruby on Rails. My app, as the post describing how to set up an app on Heroku implies, is a Python app running on web.py. It runs the calculations for a ball/plate system, and I access it using a genetic algorithm(GA) from a remote client.

The motivation for this work is to determine the feasibility, effectiveness, and efficiency of using cloud platforms to perform work on large amounts of data. The GA is typically run 50-125 times with a population of 50 over 80 generations of that population (200,000 - 500,000 calculations). Thusfar, on Heroku, the run of 125 GAs takes around 2.5 hours, and I'm still analyzing the results. This is compared to ~4 hours for 1 run of the GA on my MacBook Air.
I'm documenting this process so that others may be able to find it and use what I've learned to inform decisions on how they will process massive amounts of data. Heroku seems to be able to handle the workload, though it does occasionally cut me off when running the GA--it sends "Request not processed" replies in this case, but the results generated look promising for using Heroku for future big data operations.

Dr. Remy also set up a version of the GA on Google App Engine (GAE), and I have not interacted with it enough to be able to talk about how it holds up, but the business constraints on GAE's end seem to point to Heroku as a more viable solution for projects that need lots of uptime. GAE imposes 28 hours of uptime a day for free, then it cuts you off unless you pay for more. The 28 hours includes all processes allocated to your app. Dr. Remy reported being cut off after one 125 run of the GA. Heroku does not do this, instead granting the use of a "dyno," which seems to be one instance of an app running and encompasses all processes associated with that app. They charge by the "dyno-hour" and offer 710 free hours a month. Any uptime for a dyno counts toward the dyno-hours used that month, but one dyno cannot use more than 24 hours a day (as far as I can tell). This is enough hours to run one dyno for free each month, which offers more flexibility than GAE. We don't yet have performance comparisons between Heroku and GAE, but the blog will be updated when those are available.

Wednesday, May 21, 2014

Setting Up a Web.py App for Heroku

Heroku offers explicit support for numerous web app development frameworks, but web.py is not one of them. However, using the Heroku documentation and a bit of advice from others who have tried to set up non-standard frameworks on Heroku, I've been able to get a web.py app running.

This guide assumes you already have your application built and working with the correct dependencies. If you don't have pip installed or are not running from a virtual environment, you need both of those. Heroku uses pip to manage dependencies and virtualenv to sandbox your application.

Pip can be installed from the instructions here. If virtualenv is not installed, run

$ pip install virtualenv

from the command line. Then, navigate to the root directory of your app

$ cd ~/path/to/your/app
$ virtualenv venv --distribute
$ . venv/bin/activate

This will start a virtual environment for your app and activate it so that any python execution will run from that environment. While the environment is active, install all of the dependencies with pip

$ pip install <module name and version>

If you already have a virtualenv for your app, activate it. Run the command
$ pip freeze > requirements.txt
which will output the names and versions of all dependencies for your app in a version usable by Heroku. Heroku will use this to ensure that the correct modules are installed in your virtualenv.

You'll also need a .gitignore file in the directory, so create this and put

venv
*.pyc

in it.

Now create a file called Procfile and put

web: python <file that starts the app>.py $PATH

in the file. This tells Heroku how to start your app. This is the general means of setting up an app to run on Heroku, but web.py has some extra requirements. Heroku assigns a port number for your app to run on, but web.py requires the port number to be passed in as a command-line argument when the app is started. There is no built-in means of changing the port number from within web.py, so we have a conundrum. One way around this is to write another script to start the server after determining the port number. This looks like


import os
import subprocess
import sys

path = str(os.environ.get('PORT', 8080))

subprocess.call(["python", sys.argv[1], path])

Rewrite the Procfile to run this file

web: python <file that the above code is in>.py <file that starts the app>.py $PATH

Now, place this on your Heroku server with

$ heroku login 
$ git init
$ git add .
$ git commit -m "Initial commit"
$ heroku create
$ git push heroku origin
$ heroku ps:scale web=1
$ heroku open

If the heroku command does not work, install the Heroku Toolbelt.

And your app should be running. The last command should open it in a browser, which will let you know if there are any problems with the app.