Perl script to generate DNA sequence identitiy matrix

sedm1000 · 27 Jan 2009 at 00:29

Any budding scripters who could set me on the way to this script would be much appreciated.

I have lines of sequence, aligned by the identity of letters within them. I'd like to sum the % identity of each in the sequence, generating a matrix for the entire alignment. i.e.

fred ATGTTGTAT
fred1 ATCT-ATAT
fred2 ATCTTATAT

Output:
A 3 0 0 0 0 2 0 3 0
T 0 3 0 3 2 0 3 0 3
G 0 0 1 0 0 1 0 0 0
C 0 0 2 0 0 0 0 0 0
- 0 0 0 0 1 0 0 0 0

This will be for 100,000 sequences, hence the script requirement. I figure that the best way is to count the incidence of each letter at each position within each line and sum them in the matrix, but I'm yet to work out how to do this. Thoughts would be much obliged... Thanks.

ChrisB · 27 Jan 2009 at 08:57

You firstly have to have somewhere to store results (can be generated on the fly).
Then you need two loops

Code:

for each fred:
 current_dna_position=0
 for each dna letter in this fred:
  #check what the letter is and what its position is and then update the results
  ##Hash tables are great things...
  current_dna_position++

Not doing it for you but it's quite simple

Edit: Bioperl also has modules in it to do exactly this and'll save you a lot of time.

sedm1000 · 27 Jan 2009 at 16:33

ChrisB said:
You firstly have to have somewhere to store results (can be generated on the fly).
Then you need two loops

Code:

for each fred: current_dna_position=0 for each dna letter in this fred: #check what the letter is and what its position is and then update the results ##Hash tables are great things... current_dna_position++

Not doing it for you but it's quite simple

Edit: Bioperl also has modules in it to do exactly this and'll save you a lot of time.

Thanks - simple is still complex when you are setting out I guess..

I'm sure that there are modules, which I'll use once I find the right one. Thought this was a good problem to cut my teeth on though.

Cheers.

Perl script to generate DNA sequence identitiy matrix

sedm1000

sedm1000

ChrisB

ChrisB

sedm1000

sedm1000