Python(3.2) - multiprocessing

A.N.Other · 17 May 2011 at 23:53

I'm hoping there are some Python gurus on here somewhere, or at least someone who's attempted something similarish in the past ...

The problem:
This is a biosciences application, but that doesn't impact the thrust of it. I have a list of ~500,000 nucleotide sequences (strings mostly of length 25 but variable up to ~200) that I'd like to compare to every other sequence in the list.

For that I'm using a local install of NCBI BLAST+ (link). This is a command line app that I call from python and get the results through subprocess communicate(). The app takes an input sequence and compares it to a pre-compiled library of other sequences (my ~500,000 list).

All well and good, but: 1) each operation is going to take some time 2) this is obviously a very parallel task.

So:
I'm newish to Python, but I've got a fair amount of programming experience. I haven't played with multiprocessing to date and was wondering if it's going to be worth my while to attempt it for this application? I've heard very mixed things about multiprocessing in python, but I also know it's had a fairly major overhaul for the current v3 release. I'll happily dive in, but it looks complicated, so I'm just after some opinions / tips / ideas before I start.

ViZioN8 · 18 May 2011 at 23:55

Hi,

Comparing in what sense? Are you looking for duplicate sequences or something like that? What kind of result do you get from NCBI?

Have you done any profiling of of how long a typical call to the NCBI app takes? Or if you can make multiple simultaneous calls into it? I guess what I'm asking is if that's where the bottleneck is going to be, or is it going to be within the post-processing of what NCBI returns.

If for example, the NCBI app returns a string for each sequence, I'd use Python threads to build up a dictionary of each sequence, then compare as required.

Is there a reason why you chose Python for this? Not saying it's a bad choice, just interested in your reasoning. I'm currently using it quite a bit at work for a project and I'm having fun learning it (coming from a C background).

More information needed I think

Cheers.

Edit: You might want to have a look at this too. Might be along the lines of what you need and might make things easier if you're from a C/C++ background.

A.N.Other · 19 May 2011 at 11:52

Well I should say that in the mean time I've found a much better way of going about this, as I couldn't see the above ever being feasible (see below). However, I'd still like to know people's thoughts on multiprocessing in 3.2

If there is any benefit to be gained, it would allow me to hopefully speed up other BLASTs.

The original idea was indeed to determine repetitive sequences. I haven't accurately profiled it, as it would take a while in itself to actually build the BLAST database for that many sequences. I do, however, have a guestimate of how long it would take me to compare the ~500k to a BLAST database of length ~9300 ... ~36 hours. It doesn't take much brain power to work out from that that the original idea wasn't going to happen this side of the end of the world (given my resources, anyway).

The bottleneck would definitely be in the execution of the BLAST search rather than in the python wrapper, which was why I was hoping that multiprocessing would give me some benefit.

As you probably saw, there are faster implementations of BLAST but they're all commercial. There's also mpiBLAST, but in the absence of a farm (I only normally have 2 or 4 cores to play with), it's unlikely to give me any amazing advantage given the additional overheads. There's also a CUDA implementation, but annoyingly that's only for protein sequences, and I also don't have access to a large number of fancy graphics cards.

I'm using python as 1) I wanted to and 2) there are lots of bioscience libraries out there (numpy, scipy, biopython, matplotlib etc etc - not all necessarily py3k yet, but they'll get there

). Mainly for my own amusement I'm also in the process of producing a flow cytometry (link) data analysis package. I learnt to program originally in Pascal (and Delphi), but all knowledge is transferable.

TL;DR
I'm going about it in a different way now, but I'm still interested on people's thoughts on multiprocessing in 3.2.

A.N.Other · 19 May 2011 at 15:31

Well I just went ahead and had a go with multiprocessing anyway - actually very simple to implement in this instance

The comparison (quite a reserved one):
20 repetitions ... 93 base pairs BLASTed against a database of 9300

1 process: 2.530 secs
2 processes: 1.568 secs

So that's 46.4% faster by using multiprocessing.pool(). That was benched on a dual core machine, so going to 4 processes didn't make a difference (1.572 secs) but I'd envisage similar scaling for however many cores there are available. Interestingly the BLAST app itself has a flag for setting the number of threads per process, but giving that '2' is actually a second slower having python open two separate instances.

azteched · 19 May 2011 at 21:02

With the multiprocessing module, it looks like you give it a target function to start in a separate process. I guess you're then invoking your external program (another process) from those spawned processes, via subprocess? [I may be missing something]

This seems ok, although a little heavyweight - ideally you want some sort of "multiple wait" on a set of subprocess.POpen instances, which returns when one/some of those write to stdout. Similar to communicate(), but for a set of process handles.

As far as I can tell there's no built-in way to do this in Python, although you could simulate it via threads and condition variables.