Perl script to generate DNA sequence identitiy matrix

Soldato
Joined
19 Oct 2002
Posts
3,244
Any budding scripters who could set me on the way to this script would be much appreciated.

I have lines of sequence, aligned by the identity of letters within them. I'd like to sum the % identity of each in the sequence, generating a matrix for the entire alignment. i.e.

fred ATGTTGTAT
fred1 ATCT-ATAT
fred2 ATCTTATAT

Output:
A 3 0 0 0 0 2 0 3 0
T 0 3 0 3 2 0 3 0 3
G 0 0 1 0 0 1 0 0 0
C 0 0 2 0 0 0 0 0 0
- 0 0 0 0 1 0 0 0 0

This will be for 100,000 sequences, hence the script requirement. I figure that the best way is to count the incidence of each letter at each position within each line and sum them in the matrix, but I'm yet to work out how to do this. Thoughts would be much obliged... Thanks.
 
You firstly have to have somewhere to store results (can be generated on the fly).
Then you need two loops
Code:
for each fred:
 current_dna_position=0
 for each dna letter in this fred:
  #check what the letter is and what its position is and then update the results
  ##Hash tables are great things...
  current_dna_position++
Not doing it for you but it's quite simple :)

Edit: Bioperl also has modules in it to do exactly this and'll save you a lot of time.
 
Last edited:
You firstly have to have somewhere to store results (can be generated on the fly).
Then you need two loops
Code:
for each fred:
 current_dna_position=0
 for each dna letter in this fred:
  #check what the letter is and what its position is and then update the results
  ##Hash tables are great things...
  current_dna_position++
Not doing it for you but it's quite simple :)

Edit: Bioperl also has modules in it to do exactly this and'll save you a lot of time.

Thanks - simple is still complex when you are setting out I guess..:o

I'm sure that there are modules, which I'll use once I find the right one. Thought this was a good problem to cut my teeth on though.

Cheers.
 
Back
Top Bottom