Skip navigation

Category Archives: Python

So a few days ago I showed off some results of using a line profiler package to get detailed feedback on python programs.  Using the package, I found out that my cost calculation algorithm for a Tabu Search implementation for the Traveling Salesman Problem (TSP) was the major bottleneck in my code.  After playing around with the function and reading some forums, I figured out a way to significantly improve the cost function.  I was originally computing the cost of the entire TSP path for every 2-Opt iteration.  It turns out that this was a fairly excessive calculation, as I was only swapping the positions of two nodes at a time, the resulting change to the total cost was simply the distances between the two nodes and their respective neighbors.  So I modified the function to accept the position of the two nodes, and then calculated the distance of those partial routes to identify swaps that increased the solution’s optimality.

The original code is posted below.  In this function, I pass in a list of index values that correspond to the order in which the nodes are visited and then the function computes the total distance of the path, including the return trip to the starting point.

def calc_cost(perm):
	distance = length(points[perm[-1]], points[perm[0]])
	for i in range(0, nodeCount-1):
	    distance += length(points[perm[i]], points[perm[i+1]])
	return distance

In the modified function, shown below, I pass in the same list as before, however I added another argument to the function that indicates the position of the path that I want to evaluate. The function then calculates the distance between the provided position and the node directly before and after it in the path. I call this function four times when evaluating a 2-Opt iteration, twice to evaluate the change in the two points that were swapped, and twice again to evaluate the original distance in the path.  If the total segmented distance of the two swapped points is lower than the total segmented distance of the original two points, I keep the modified solution as it will have a lower total distance.

def calc_seg_cost(perm, c):
        if c != len(perm) - 1:
            seg = [perm[c-1], perm, perm]
        else:
            seg = [perm[c-1], perm, perm[0]]
        distance = 0
        for i in range(0, len(seg)-1):
            distance += length(points[seg[i]], points[seg[i+1]])
        return distance

I ran the two different versions of the code on a number of TSP problems and recorded the time of each run using the unix time utility.  I then plotted the results using matplotlib.

time_plot

As you can see from the plot, the function modification appears to be saving a substantial amount of time.  The primary advantage of this modification is that I can run my algorithm for a higher number of iterations and get better optimatily in the 5 hour time frame that we have.  This should help me improve my scores, but I may still need to keep tweaking the algorithm if I want to get higher scores on the larger problems, as it looks like the cost function modification by itself will not be enough to achieve that goal.

So as I mentioned in my last post, I am currently taking a discrete optimization class from Coursera.  It is a pretty challenging class, especially because I have never taken an algorithms class before (I majored in Conservation and Resource Studies in college).  I have been fairly successful in implementing some tried and true algorithms to solve some of the problem sets, but I have been having a lot of difficulty getting an understanding of the performance details of the various algorithms.  My typical routine has been to write up a few solutions to each problem, and use the run-time, solution score, and my (limited) intuition to try and improve the implementations.

In yesterday’s post I discussed a little plotting function that I wrote to help visualize my implementations to the Traveling Salesman Problem (TSP).  Today, I am going to talk about using a line profiler to get detailed feedback on the performance of your code.  I found this blog post to be very helpful in navigating the various options for performance analysis.

For the (TSP) problem, I wrote up a Tabu Search algorithm, which is a type of Local Search algorithm, that uses ‘short term memory’ to ensure that old solutions are not revisited.  While I have successfully implemented a version of the algorithm in python, I am fairly certain that I am glossing over a few key implementation details as my algorithm is not performing nearly as well as it should.  A good way to get detailed feedback on the algorithm’s performance is to use the line profiler package.  It is extremely easy to use and it provides valuable feedback on your code’s performance.  You simply add the @profile decorator to any function in your program that you want feedback on, then run a python executable at the command line (see the source blog post for details).  I threw in the profile decorator on my main Tabu function and got the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    85                                               @profile
    86                                               def search(cities, candidate_list_size, max_iter):
    87         1        13160  13160.0      0.1  	current_vector = random_permutation()
    88         1          839    839.0      0.0  	current_cost = calc_cost(current_vector)
    89         1            3      3.0      0.0  	tabu_list = []
    90         1            5      5.0      0.0  	path = [current_vector[:]]
    91     10001        30512      3.1      0.3  	for i in range(0, max_iter):
    92     10000      9332657    933.3     99.1  	    new_perm, new_cost, tabu_list = generate_candidates(current_vector, tabu_list)
    93     10000        37371      3.7      0.4  	    if new_cost < current_cost:
    94                                           		#path.append(new_perm[:])
    95        11           53      4.8      0.0  		current_vector = new_perm[:]
    96        11           34      3.1      0.0  		current_cost = new_cost
    97         1            4      4.0      0.0  	return current_vector, current_cost, path

The output of the profiler displays Hits (which is the number of times a line of code was run), Time Per Hit, and % Time. Looking at the output from the profiler, it looks like the line of code that uses the most resources is the generate_candidates function call in the main for loop (which makes sense, as the search function is basically a for loop over the generate_candidates function . So in order to take this analysis further, I added a profile decorator to the generate_candidates function and displayed the results below.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    74                                               @profile
    75                                               def generate_candidates(best_perm, tabu_list):
    76     10000        37295      3.7      0.4  	perm = best_perm[:]
    77     10326        29927      2.9      0.3  	while True:
    78     10326       426132     41.3      4.7  	    perm = two_opt(best_perm)
    79     10326        73413      7.1      0.8  	    if perm not in tabu_list:
    80     10000        95461      9.5      1.1  		tabu_list = tabu(perm, tabu_list)
    81     10000        28268      2.8      0.3  		break
    82     10000      8269102    826.9     92.0  	perm_cost = calc_cost(perm)
    83     10000        29902      3.0      0.3  	return perm, perm_cost, tabu_list

As you can see from the output of the profiler on the generate_candidates function, the bottleneck appears to be the calc_cost function. This was not surprising as cost functions, which are used to measure the effectiveness of your solution, can be expensive to compute but are an important tool in providing your algorithm feedback in order to direct it to an optimal solution.  In this case, the cost function is a simple for loop that calculates the distance traveled in a given solution, so if I want to improve upon it, I am going to need to figure out a better way to estimate or approximate the cost without having to do the full calculation.

So know that I know where my bottleneck is, I am going to get to work on trying to find a way to improve the algorithm so that I can get some good solutions for the problem sets.  I will post an update, with some comparative metrics on performance if/when I find a way to reduce the bottleneck.

So I am taking a discrete optimization class through Coursera and so far it has been pretty intense. The class uses python for it’s homework submission, so while you are free to use any language to solve the homeworks, it was easy to get up and running because python was well supported.  As I quickly found out however, the python language has a lot of overhead that slows it down considerably compared to other languages.  So I have been playing with Cython (a python package that tries to optimally compile python code to C) and have had some success with that, but will probably have to keep exploring other options as my code has been running horribly slow.

One of the recommendations from the professor was to use visualizations to help debug and improve algorithms.  I had been looking for an excuse to dive into the python plotting library matplotlib, and this seemed like a good excuse to do so. I was having trouble getting a feel for the performance of a Tabu Search implementation that I was working on for the Traveling Salesman Problems (TSP), so I decided to code something up using matplotlib to help me get a better idea of how the algorithm was working.

The TSP is a very difficult problem to optimize.  When the number of nodes gets very large, the amount of time it takes to find an optimal solution increases exponentially.  As such, it is often better to attempt to achieve sub-optimal solutions using Local Search techniques.  The downside to these algorithms, is that they rely on randomly searching for small improvements in the solution, and due to the randomness involved, it is difficult to get an idea of how your algorithm is working.  So this problem makes a fairly good candidate for developing supporting visualizations.

I took a stab at writing a simple plotting function that would plot out the nodes of the TSP problem as well as the solution path.  But I also wanted to compare the current solution with previous iterations to see how the algorithm was performing.  So I created a function that allows you to pass in as many iterations of the solutions as you like, and plots out the current solution, and overlays that on top of the older solutions in a different color.  I have used plotting libraries in R and SAS before so I figured this wouldn’t be too difficult, but as I quickly found out, and as anyone who has ever used a plotting library knows, it is pretty easy to get bogged down in documentation as each function/object has dozens of optional arguments (which is pretty standard for plotting libraries, as they need to provide the flexibility for users to create unique visualizations) and plenty of quirks.  One of the more frustrating issues that I ran into was I wanted to have arrows indicating the direction of travel to the individual nodes.  I spent a while trying to look if there was an optional line type I could use, but as I soon found out, I needed to call a special arrow object and individually draw each arrow between each point.  I was able to accomplish this with a couple of for loops, so it wasn’t a huge deal, but it definitely added to the verbosity of the code.  Anyway, the code is up on gist, here, and I have attached a few examples of the resulting output below.

There are a few things that I want to try and improve on when I get the time.  I could not find a way to lower the transparency of the arrows to represent older iterations of the solutions.  Instead, I decided to reduce the width of the red lines, however, it doesn’t seem to have a noticeable effect.  Also I wanted older iterations of the solution to have a red dashed line, but some of them appear to be solid lines.  From what I can tell, this looks like it is just an issue with the rendering.

So that’s it for today.  My next task is to play around with a line profiler to get a detailed breakdown about the performance of each line in my code.  Hopefully this will help me increase the performance of my algorithms enough to start actually getting some passing grades on the homeworks!

 

Well, been plugging away a at a variety of projects, but for the most part I don’t have much to show… yet!  I ran through an in-depth Flask turorial (a light weight web framework for python) and am starting to put together a front end for one of the web scrappers I built.  Hopefully I will have that online soon.

In other news, this blog is no longer hosted on the Raspberry Pi.  I moved it over to a hosted solution.  Getting the pi up and running as a web server was a fun little introduction to the device, but ultimately I have a number of other projects I want to try out, and it made more sense to free up the Pi for those projects.  I have ordered two more Pis (for a grand total of four) and hope to play around with mesh networking and parallel computing, however in the mean time, I have set up the two Pis as Litecoin miners (more on that later).

In a few hours, I will be heading out to Costa Rica for a short surf trip, but will write up the Litecoin post while I am on the plane.

So I sort of got the DNS issues figured out.  Still working out a few kinks, and playing with wordpress themes as well.

I setup another Raspberry Pi server here.  Currently it is running a python/flask microblog site called Flaskr that is used as one of the example projects for learning flask.  Code pulled from here.  I will be playing around with some python/flask code over the next few days to setup a project site, so these links may not be working from time to time.  I will post an update once it is ready.