Archive Page 3

Beginner’s Guide to Bioinformatics

As a computer scientist coming into Bioinformatics I was faced with the heavy task of catching up on my Biology and Chemistry (I was a Physics minor in undergrad but that wasn’t applicable to my Bioinformatics catch up). This meant two semesters of General Chemistry, a semester of Organic Chemistry and a semester of Cell Biology. Though all this course work was very educational and useful for my degree I don’t think its all that necessary for a someone who may be interested in fooling around with Bioinformatics problems on the side.

Here is a very general overview of cell biology for Non-Biologists wanting to get involved in Bioinformatics:

  1. Proteins are the essential part of all living organisms. Proteins have a variety of functions and are involved in every process within our cells. [Wikipedia]
  2. DNA is the blueprint for proteins. Segments of DNA (genes) translate into proteins. For more detail look into the Translation and Transcription of DNA to proteins.
  3. Cell function is determined by which proteins are expressed and their quantity. This means that some kind of gene regulation must take place. Also one can argue if you know the amount of genes expressed in a cell you can possibly infer that cells function.

For a more specific overview, the following are some of the essential key points for biology and bioinformatics:

  1. Genome - all the DNA in a cell.
  2. DNA - a string of nucleic acids (i.e. GATCACTT…ATCG).
  3. Gene - a substring of DNA that encodes proteins.
  4. Proteins - a string of amino acids (i.e. ACDEF…RSTY).
  5. Gene expression is regulated by the product of other genes. It is a network of interactions.
  6. Post-translation modifications are an important regulation mechanism for gene expression.

You may notice that the above deals quite a bit with string manipulation, hence the strong emphasis for Perl experience in Bioinformatic job postings. You will find that string manipulation is not the only driving force for computer science in Bioinformatics. I will try to explain other topics in subsequent posts.

As for Biologists wanting to do Bioinformatics I can not provide the best advice since I didn’t come into Bioinformatics from that direction but I would imagine that you may want to look into the following:

  1. Learn how to program. You want to know how to use a scripting language (preferably Perl) for smaller every day tasks and an object-oriented language such as C, C++, or Java for larger projects.
  2. Learn how to use databases. Bioinformatics deals with very large datasets. At some point your are going to have to deal with either retrieving information from databases or building your very own database so you might as well begin playing with them now.
  3. Install and run a Unix/Linux OS (Optional). This might be my personal bias but I believe if you are going to be working in Bioinformatics and its large data sets eventually you will find yourself either maintaining a server or SSHing into one so you might as well become familiar with that type of environment. At the very least XP users should install Cygwin.

Useful Links:

  • Bioinformatics intro offered at my university.
  • Graduate level of the Bioinformatics intro course.
  • Library of videos that cover a wide range of biological topics (theoretical and practical).
  • RT-PCR a common molecular biology method practiced in the lab.
  • Virtual lab which provides a virtual lab for non-biologists to actually work through basic molecular biologist techniques.

Finally I must say that I am far from an expert so any constructive suggestions to help clarify or expand the above is welcomed and appreciated.

MIA…

Sorry I’ve been missing in action. I’m completing my last semester of classes.

Since I’m on Spring Break (AKA catch up with all my work break) there will be quite a bit of posts popping up on information I’ve gained throughout the semester but have not had the time to post on.

Getting the transpose of a CSV

Today my Boss/P.I. approached me with an application problem he was having. He had several large comma separated value files that needed to be transposed (i.e. switching data that are in a row to a column) to work with an application known as Jqtl. Now typically this would be no problem for him as he would simply have to just pop the file into Excel or Datadesk but he was dealing with files that had about 45,000 rows and 30 columns. Now if any of you have worked with Excel and large datasets you would know that Excel used to have a row limit of 256 columns (until Excel 12 according to this blog) so using that as a method was definitely not a solution.

So I simply wrote a quick Perl script for this as I didn’t see any available in my 10 minute search online. I’m sure there is probably a module for it, but I thought it would be easy enough.

It took around three seconds to transpose the 45,000 by 30 dataset without any fancy code optimization. Here’s the script.

If you’re running in a Unix/Linux environment make sure you chmod to make the file executable. To run the script on lets say a file called foo simply run the following form a terminal

$ ./transpose_csv.pl foo

You’ll end up with a file with “tr_” appended to the original file name such as tr_foo.

Perl and different text file formats

I recently ran into a text file format problem while writing a Perl script in OS X. I had been testing the script and it worked fine with test text files but did not work with the text file I was given. For instance, I was scanning the text file for a particular Affymetrix gene ID and would never come up with a match using Perl’s “eq” string comparison. I believed it was not a text file issue as I usually see carrige returns or “^M” at the end of lines when inspecting data in Vi.

What I discovered was what anyone who has ever worked with data from multiple OSs might know; carrige returns are not the only thing that might be carried over from an application exporting text files on another platform. What should of hinted me to this was the little “[dos]” message at the bottom of the screen when I opened the file in Vi. This is why I couldn’t see the extra characters carried over from the Windows export. To work around this you can simply open a file using the -b option with Vi to open the file in binary mode.

So in my case I saw all the addtional null characters (^@) after every character in the file I was using. The file was actually encoded in UTF-16-LE format which includes a null high-order byte, after each ASCII byte (Allan from the Richmond Perl Mongers group explained this to me). This explained why the “eq” comparison was not working in my Perl Script. To solve this I tried three different approaches:

  1. Go back to the original application and ensure that data is exported in UTF-8 format which will look like plain ASCII. While this may work its rather inconvenient, especially if you’re working on data from a client.
  2. Use a regular expression in Vi to replace the null characters with nothing.
    In Vi’s navigation mode you would type “:%s/\//g”.

    While this is a great solution it can be rather slow depending on the size of the file you are working with.

  3. Use Perl’s nifty encoding capability in their open function.

    open (INPUT_FILE, "<:encoding(UTF-16)", "$input_path") or die;

    While good this assumes your Perl script is only going to only work with that specific file encoding.

All three solutions worked out perfectly fine for me and its just preference with regards to which you prefer to do.

As a side note since I always forget this myself. If you are in Linux/Unix and working with OS X text files you’ll discover that ^M are the end of line characters from OS X. On first instinct you might want to use “\n” for your newline character in your Vi regular expression “:%s/\/\n/g” but this won’t work, the actual line feed to use with this method is “\r”. So your regular expression would look like “:%s/\/\r/g”

A Little Busy…

Well I’ve been quite busy with the Holidays and work, so I haven’t been able to post in a while. As a brief update, here are a couple of screen captures to show what I’ve been doing:


Fedora Core 4 via Parallels on OS X



GEOSS via Fedora Core 4 via Parallels on OS X

GEOSS is an open source storage solution for Affymetrix microarrays. I’ll be posting soon to describe what GEOSS does and how it is useful to my research but if you can’t wait check out their home page.

Virtual Reality Used to Cure Phantom Limb Pain

Today, slashdot.org had a post concerning the use of Virtual Reality (VR) technology to cure phantom limb pain (PLP). PLP is the sensation one may feel after the loss of a limb via amputation. For example lets say a person’s left leg is amputated due to diabetes but they still feel pain in their left foot even though they no longer have that limb; this is PLP. The Virtual Reality comes into play by tricking the amputee’s brain into believing they still have the limb and therefore reduces pain felt in PLP.

Some may differ in this opinion but I consider this to be bioinformatics related since its the use of computer science techniques to address a biological or clinical problem. What does everyone else think?

Ruby Programming Language

I decided to attend a Ruby user group today and figured I needed to get up to date with Ruby as fast as possible. The following are the links I found most useful:

Does anyone have additional suggestions on how to quickly get started with Ruby?

Artificial Intelligence

I realized that yesterday I kept on mentioning artificial intelligence without even explaining the general concept of it. Kind of defeats the whole theme (”simplifying bioinformatics”) of my site, huh? Well here is a quick briefing on the concept of artificial intelligence.


According to netdictionary, artificial intelligence is “a branch of computer science that studies how to endow computers with capabilities of human intelligence.” Now lets not get crazy and think of highly intelligent system such as Data from Star Trek (seen on the left). These capabilities of human intelligence can be basic concepts such as the ability to classify objects. For instance how we classify an shape to be a circle versus a square.

The idea is to provide the computer with enough information so that it can use an algorithm to make these classifications. For instance in the case of shapes one may provide the computer with the number of lines. Well of course the square has four lines while the circle has none or one depending on how you define a line. Remember this is an oversimplified case.

These classifications can get quite advance such as the typical logistic credit application problem. In most beginning AI courses students are presented with credit application data and are charged with the task of programming the computer to decide whether an application should be approved or not. Another example is Dr. Brooks’s classification algorithm to distinguish between different structures of sarcoma tumor beds, covered yesterday.

Well thats as simplified as I can put it, what does everyone else think?

Artificial Intelligence and Cancer Research

I attended a talk today, concerning the use of artificial intelligence (AI) in cancer research.  Dr. Paul Brooks discussed his research using AI to study a particular treatment of sarcoma, a cancer in the supportive tissues of the body (i.e. blood vessels, bones, muscles, etc).

He described the need to use computer algorithms to improve brachytherapy, a method that can be applied post removal of a tumor.  What brachytherapy does is insert sealed sources of radiotherapy around the the area from where the tumor was removed to treat the leftover diseased tissue.  In the abstract of his publiction Dr. Brooks describes the current practice of determining the area for brachytherapy treatment to be a “tedious manual process”. 

What his research entailed was using AI to automatically classify the contours of how the radiotherapy sources were to be placed.  The AI would be able to distinguish between various structures of sarcoma tumour beds and suggest the best placement for the radiotherapy treatment.

I’m not sure about any other bioinformaticians out there but its pretty exciting to see computer science and biology merged together for actual clinical application.

Dr. Brooks’s abstract can be found here via PubMed.

Hooray for Bioinformatics!

The BBC news had a report last week on a UK Scientist winning a large international prize with his work in E. Coli. What was supposed special about his work is that he used computers to model a part of the biological system of E. Coli. Its nice to see Bioinformatics (or Computational Biology) recognized internationally. If you want to know more you can read the article here.