noob looking for perl/python help

Disco Boy · 30 Jan 2012 at 16:36

Hi all,

I'm a complete noob when it comes to scripting. I have had a play with perl and python to do a few simple things to make my life easier. But this task I am really struggling with in either.

I have three files which look like this:

# few lines of header
@more lines of header
0.000 0.040524
0.002 0.495572
0.004 0.486072
0.006 0.586495
0.008 0.720278
... etc....
50

The left hand column is the time (so is the same in each file) and the right hand value is the interesting part which varies.

I want to take these three files and average the value in the right hand value. Ending up with a file that looks like this:
#header (doesn't matter which file it's from)
@header (doesn't matter which file it's from)
0.000 *mean 1-3*
0.002 *mean 1-3*
0.004 *mean 1-3*
0.006 *mean 1-3*
0.006 *mean 1-3*
...
50.000 *mean 1-3*

My problem is that I am struggling to address the three files at the same time. Previously I have done stuff with one of these files using things like this in perl:

Code:

while ($line = <FILE>)
{
    if ($line =~ m/@/)
    {
        print $line;
    }
    elsif ($line =~ m/#/)
    {
        print $line;
    }

    else 
    {
        for ($line) {
        s/^\s+//;
        s/\s+$//;
        }

        chomp $line;

        @words = split("   ",$line);
        $col1 = $words[0] / $x;
        $col2 = $words[1] * $y;

    print "$col1   $col2\n";
    }


}

or something similar in python:

Code:

for line in lines:

    if re.search( r"#", line ):
        print line,
    elif re.search( r"@", line ):
        print line,
    elif line== '\n':
        print line,
    else:
        words = line.split()
        time = words[0]
        rmsd1 = words[1]

        print "%s %s" % (time, rmsd1)

but I am struggling to do this task with a simple "line in lines" and a split because I am taking things from two files.

I think that the solution should be to take the two files, pick the interesting sections and put them in arrays, then add the three arrays and divide by three. I just can't work out how to do this!

Alphane · 30 Jan 2012 at 18:04

Code:

File_1 = open ('?')
File_2 = open ('?')
File_3 = open ('?')

def Create_Dictionary_1():

    File_1.seek(0)

    Complete_File_1 = File_1.readlines()

    Numbers_In_File = len(Complete_File_1)

    Numbers_Transfered = 0

    Dictionary_1 = {}

    while Numbers_Transfered < Numbers_In_File :

        Number_To_Be_Transfered = Complete_File_1[Numbers_Transfered]

        Final_Numbers = Number_To_Be_Transfered.split()

        Numbers_Transfered = Numbers_Transfered + 1

        Dictionary_1 [Final_Numbers[0]] = Final_Numbers[1]

    return Dictionary_1

def Create_Dictionary_2():

    File_2.seek(0)

    Complete_File_2 = File_2.readlines()

    Numbers_In_File = len(Complete_File_2)

    Numbers_Transfered = 0

    Dictionary_2 = {}

    while Numbers_Transfered < Numbers_In_File :

        Number_To_Be_Transfered = Complete_File_2[Numbers_Transfered]

        Final_Numbers = Number_To_Be_Transfered.split()

        Numbers_Transfered = Numbers_Transfered + 1

        Dictionary_2 [Final_Numbers[0]] = Final_Numbers[1]

    return Dictionary_2

def Create_Dictionary_3():

    File_3.seek(0)

    Complete_File_3 = File_3.readlines()

    Numbers_In_File = len(Complete_File_3)

    Numbers_Transfered = 0

    Dictionary_3 = {}

    while Numbers_Transfered < Numbers_In_File :

        Number_To_Be_Transfered = Complete_File_3[Numbers_Transfered]

        Final_Numbers = Number_To_Be_Transfered.split()

        Numbers_Transfered = Numbers_Transfered + 1

        Dictionary_3 [Final_Numbers[0]] = Final_Numbers[1]

    return Dictionary_3

  

Dictionary_1 = Create_Dictionary_1()

Dictionary_2 = Create_Dictionary_2()

Dictionary_3 = Create_Dictionary_3()

Numbers_To_Average_Names = Dictionary_1.keys()

Numbers_To_Average_Names.sort()

New_File = ''

for Keys in Numbers_To_Average_Names :

    Number_1 = Dictionary_1[Keys]

    Number_2 = Dictionary_2[Keys]

    Number_3 = Dictionary_3[Keys]

    Average = (float(Number_1)+float(Number_2)+float(Number_3))/3

    New_File = New_File + Keys + str(Average) + ' /n '


print New_File

Header needs removing from files in the create dictionary defenitions , and adding to new_file. Not sure if it's the BEST coding but it works.

/edit just noticed some of your left hand numbers aren't unique, if thats the case in the files then this won't work properly sorry, as the entry would get overwritten in the dictionary

Disco Boy · 30 Jan 2012 at 19:52

Thank you I'll try that tomorrow

the left hand column should be unique I must have messed something up copy pasting somehow!

Alphane · 30 Jan 2012 at 20:09

No problem.

Just out of interest what does this script actually do ie. whats the data it's working on?

Disco Boy · 30 Jan 2012 at 20:19

The data is some analysis of a molecular dynamics simulation of a g-protein coupled receptor.

Specifically, the left hand column is the time (in ns) and the right hand column is the RMSD of the alpha carbons (in nm) at that point in time, with the first frame as a reference. The RMSD is a good measure of how far the structure at time x is from the initial structure.

The three different files are the data from three different simulations.

Thanks for your help, and just ask if you want to know more

Alphane · 30 Jan 2012 at 20:37

Disco Boy said:
The data is some analysis of a molecular dynamics simulation of a g-protein coupled receptor.

Specifically, the left hand column is the time (in ns) and the right hand column is the RMSD of the alpha carbons (in nm) at that point in time, with the first frame as a reference. The RMSD is a good measure of how far the structure at time x is from the initial structure.

The three different files are the data from three different simulations.

Thanks for your help, and just ask if you want to know more

Yes thats what I thought it probably was :rolleyes:

/edit

Sorry for the sarcasm was a bit knackered after a long weekend yesterday, I've read the links now and from what I can gather it's somesort of Bio chemical simulaton, would be interesting to know what the reactents are and what the purpose of the reaction is though.

Also I've always struggled with biology as the long words tend to confuse me (English ain't my strong suit) but found it facinating that while we both are obviously fairly intelligent we struggled to understand each others 'subject' .

Let me know if you have any problems with the script it shouldn't be too hard to adapt from it's basic form if theres something I've failed to take into account.

Disco Boy · 31 Jan 2012 at 14:02

Hey,

That seems not to work. I tried it on two versions of the file (with different units) the type I posted earlier:
0.000 *mean 1-3*
0.002 *mean 1-3*
0.004 *mean 1-3*
0.006 *mean 1-3*
0.006 *mean 1-3*
...
50.000 *mean 1-3*

returned:

Code:

Traceback (most recent call last):
  File "oc.py", line 104, in <module>
    Number_3 = Dictionary_3[Keys]
KeyError: '0.000'

I then tried the different version of the input file which looks like this:
0.0000000 0.0040524
2.0000000 0.0495572
4.0000000 0.0486072
6.0000000 0.0586495
...
50000.0000000 0.2347846

which gave an output, but it's junk:

Code:

0.00000000.0040524 /n 10.00000000.0890422333333 /n 100.00000000.132484733333 /n 1000.00000000.150533633333 /n 10000.00000000.182452666667 /n 10002.00000000.173097 /n 10004.00000000.174290566667 /n 10006.00000000.1680934 /n 10008.00000000.1684483 /n 10010.00000000.175850433333 /n 10012.00000000.1720737 /n 10014.00000000.1775554 /n 10016.00000000.1666258 /n 10018.00000000.172433 /n 1002.00000000.154299233333 /n 10020.00000000.1710826 /n 10022.00000000.1697698 /n 10024.

etc.

Notably the output is all one line. But the values also seem to be junk.

The simulation isn't of a reaction. It's just a GPCR embedded in a bilayer as it would be in the body. We have built a model of the GPCR and are using the MD to optimise the structure to use it for drug design.

Alphane · 31 Jan 2012 at 16:09

Sorry change

Code:

New_File = New_File + Keys + str(Average) + ' /n '

to

New_File = New_File + Keys + '  ' + str(Average) + ' \n'

should add some spaces and make the thing readable.

As to being on 1 line , sorry confused / for \

also it's designed to work on the second type of file (ie only numbers) as that was the one you mentioned first in your OP and it needs 3 files to work properly

Let me know if that sorts it out

A.N.Other · 31 Jan 2012 at 16:58

Python 3, but *should* work OK with 2 without much alteration (just strip out the encoding info)?!?

Code:

from collections import defaultdict
import glob, csv

vals = defaultdict(list)
for gl_obj in glob.iglob('C:/Users/xxx/Desktop/OcUK/*.txt'):
    with open(gl_obj, 'r', encoding = 'utf-8') as f_obj:
        for line in f_obj.readlines():
            if line[0] != '#' and line[0] != '@':
                vals[line.split(' ')[0]].append(float(line.split(' ')[1]))
with open('C:/Users/xxx/Desktop/results.txt', 'w', encoding = 'utf-8', newline = '') as f_obj:
    writer = csv.writer(f_obj, dialect = 'excel')
    for t, vlist in sorted(vals.items()):
        writer.writerow([t, sum(vlist) / len(vlist)])

You just need to change the input directory (to the dir - it'll automatically read all .txt files from within it (so would work with more than 3 files as long as they're the only things in the dir)) and the output file location. Outputs csv format.

Looks like an interesting topic

. Retrovirology is my thing (and the associated bioinformatics), so it's all French to me