Archive for the 'Perl' Category

BIND and SOAP

I’m in the process of developing an application that utilizes information from protein-protein interaction databases. One specific database I am working with is BIND, the Biomolecular Interaction Network Database. As my application will be looking at a large number of genes I had to figure out how to write an application that interacted with the database.

The solution I found was BIND SOAP, an API designed to help developers interface with BIND using either C, Perl, Java, or VB .NET. SOAP or the Simple Object Access Protocol provides a basic messaging framework which allows for communication between applications across the interenet.

After some research I decided that Perl was the best way to go as there was already a Perl module available for use with SOAP, conveninetly named SOAP::Lite. If you have the CPAN module installed on your system the best way to get the SOAP::Lite module on Linux systems is to start the CPAN shell:

[root]# perl -MCPAN -e shell

Once the shell is started, run:

cpan> install SOAP::Lite

If you are having problems installing the module or installing from another system you can go here [soaplite.com] for additional instructions.

Continue reading ‘BIND and SOAP’

Getting the transpose of a CSV

Today my Boss/P.I. approached me with an application problem he was having. He had several large comma separated value files that needed to be transposed (i.e. switching data that are in a row to a column) to work with an application known as Jqtl. Now typically this would be no problem for him as he would simply have to just pop the file into Excel or Datadesk but he was dealing with files that had about 45,000 rows and 30 columns. Now if any of you have worked with Excel and large datasets you would know that Excel used to have a row limit of 256 columns (until Excel 12 according to this blog) so using that as a method was definitely not a solution.

So I simply wrote a quick Perl script for this as I didn’t see any available in my 10 minute search online. I’m sure there is probably a module for it, but I thought it would be easy enough.

It took around three seconds to transpose the 45,000 by 30 dataset without any fancy code optimization. Here’s the script.

If you’re running in a Unix/Linux environment make sure you chmod to make the file executable. To run the script on lets say a file called foo simply run the following form a terminal

$ ./transpose_csv.pl foo

You’ll end up with a file with “tr_” appended to the original file name such as tr_foo.

Perl and different text file formats

I recently ran into a text file format problem while writing a Perl script in OS X. I had been testing the script and it worked fine with test text files but did not work with the text file I was given. For instance, I was scanning the text file for a particular Affymetrix gene ID and would never come up with a match using Perl’s “eq” string comparison. I believed it was not a text file issue as I usually see carrige returns or “^M” at the end of lines when inspecting data in Vi.

What I discovered was what anyone who has ever worked with data from multiple OSs might know; carrige returns are not the only thing that might be carried over from an application exporting text files on another platform. What should of hinted me to this was the little “[dos]” message at the bottom of the screen when I opened the file in Vi. This is why I couldn’t see the extra characters carried over from the Windows export. To work around this you can simply open a file using the -b option with Vi to open the file in binary mode.

So in my case I saw all the addtional null characters (^@) after every character in the file I was using. The file was actually encoded in UTF-16-LE format which includes a null high-order byte, after each ASCII byte (Allan from the Richmond Perl Mongers group explained this to me). This explained why the “eq” comparison was not working in my Perl Script. To solve this I tried three different approaches:

  1. Go back to the original application and ensure that data is exported in UTF-8 format which will look like plain ASCII. While this may work its rather inconvenient, especially if you’re working on data from a client.
  2. Use a regular expression in Vi to replace the null characters with nothing.
    In Vi’s navigation mode you would type “:%s/\//g”.

    While this is a great solution it can be rather slow depending on the size of the file you are working with.

  3. Use Perl’s nifty encoding capability in their open function.

    open (INPUT_FILE, "<:encoding(UTF-16)", "$input_path") or die;

    While good this assumes your Perl script is only going to only work with that specific file encoding.

All three solutions worked out perfectly fine for me and its just preference with regards to which you prefer to do.

As a side note since I always forget this myself. If you are in Linux/Unix and working with OS X text files you’ll discover that ^M are the end of line characters from OS X. On first instinct you might want to use “\n” for your newline character in your Vi regular expression “:%s/\/\n/g” but this won’t work, the actual line feed to use with this method is “\r”. So your regular expression would look like “:%s/\/\r/g”

Average CEL Perl Script

Today I wrote a Perl script to average out Affymetrix .cel files. In the middle of my Perl hacking I ran into some issues concerning Affymetrix’s file format.

In a .cel file, under a section marked [INTENSITY] Affymetrix stores a probe intensity on each row. In the row you have your x-coordinate, y-coordinate, mean (aka intensity), standard deviation, and number of pixels.

Originally, I tried using the split function to split the data on a line in their intensity section. What I tried was as follows:

($dummy, $dummy, $dummy, $total_intensity[$i], , ) =
split(/\s+/, $line);

Notice there are three dummy variables that lead prior to the actual intensity value capture. I did this because I wanted to account for the space prior to the coordinate values. So this split function actually splits the line into six variables instead of the five mentioned above to account for the space. This actually caused me a problem because once you get to coordinates three digits long (i.e. y=100) that leading space is no longer there. What I ended up doing was creating a regular expression instead as follows:

if ( $line =~
m/\s*(\d+)\s+(\d+)\s+(\d+.\d+)/g){$total_intensity[$i]= $3;}

I know this is a “hack” but it works. I’m sure there is a way to get rid of the leading spaces but none came to mind. I tried the chomp function but all this does is get rid of trailing new lines. Does anyone have an idea how to get rid of these leading spaces?