View on GitHub

Crazybiocomputing.GitHub.io

Tools about Bioinformatics

For biology students, I think learning a programming language is not a waste of time. However, there is many, many different languages [Wiki] and you have to choose a language with a good balance between criteria like ease of learning, wealth of tutorials and examples, large user community,etc. Nowadays, IMHO, two programming languages are fulfilling these criteria:
- Python
- JavaScript and all the web technologies (HTML5 and CSS3)

Note*: If you need more sophisticated statistics functions, it is good to look at the “R” language. It is more complex but more powerful in this field.

1- Python programming language

1-1- The Basics: Variables, conditionals, and loops

1-1-1 Variables

Variables are “boxes” containing one value. This value may be a:
- Number (integer, floating-point numbers,etc.)
- Boolean (True of False)
- String (aka Text)
- List
- Dictionary
- Object

Variables names are case sensitive, cannot begin with a digit and you cannot use a reserved keyword used by Python (like for).


myVar = 3
myVar = myVar + 1 # ← 4

0test = 5 # ← ERROR

1-1-2 Conditionals


 if condition :
  instruction_1
  instruction_2
  instruction_3
elif other_condition:
  instruction_a
  instruction_b
else:
  instruction_01
  instruction_02
  instruction_03
  instruction_04

1-1-3 Loops

1-1-3-1 While Loop
while exit_condition:
  instruction_1
  instruction_2
  instruction_3

For example, to display each character of the string ‘python’

lang = 'python'
i = 0
while i < len(lang):
  print(lang[i])
  i = i + 1

And the result is…

p
y
t
h
o
n
1-1-3-2 for in Loop

This loop requires a list and apply the block of instructions for each element with the following syntax:

for obj in a_list:
  instruction_1
  instruction_2
  instruction_3

Here is a basic example…

lang = ['p','y','t','h','o','n']

for char in lang:
  print(char)

This type of loop is very often used with the range(…) function. This very convenient function range(star, end, step) creates a list containing a series of numbers.

mySeries = range(0,10,2) #  ← [0,2,4,6,8]

for i in mySeries:
  print(i)

movie = 'starwarsreturnofthejedi.'
for i in range(1, 10):
  poweroftwo = i ** 2
  print(poweroftwo) # 1, 4, 9, 16, 25, ..., 100

1-2- Advanced: Functions

def function_name(arg1,arg2,..., argn):
  instruction_1
  instruction_2
  instruction_3
  return result

For example, if you want to get codons of a nucleic sequence …

def getCodons(seq):
  """ Extract codons from a nucleic sequence """
  codons_list = []
  for i in range(0,len(seq),3):
    codon = seq[i:i+3]
    codons_list.append(codon)
  return codons_list

# Test
mySeq='actgctgtcgaaccg'
myCodons = getCodons(mySeq) # ← ['act', 'gct', 'gtc', 'gaa', 'ccg']

1-2- Learning Python with web-based tool

Python Fiddle is a good starting environment to write your first scripts.

1-3- Learning Python with IDLE

2- Python helper Code

Using extensively Python Fiddle, it is not possible AFAIK to import BioPython. Moreover, the latter is a little bit too complex for biology students with no programming skills.

Seq class storing information (title and sequence data) from a FASTA sequence.


class Seq:
    def __init__(self, seqdata):
        _tmp = seqdata.split('\n')
        self.description = _tmp[0][1:] if _tmp[0][0] == '>' else _tmp[0]
        self.data = ''.join(_tmp[1:]).strip()
        
        # Read title
        self.author1 = 'None'
        self.author2 = 'None'
        self.copy    = 0
        self.db      = 'None'
        self.id      = 'None'
        self.db2     = 'None'
        self.acc     = 'None'
        self.title   = 'None'

        # Try to read information within the description
        sep = '|'
        _tmp = self.description.split(sep)
        self.db = _tmp[0]
        if self.db == 'xzy':
            #CrazyBio header: xzy|first author|copynumber|second author
            self.author1 = _tmp[1]
            self.author2 = _tmp[3]
            self.copy    = int( _tmp[2])
            self.length  = len(self.data)
        elif self.db == 'gi':
            # gi|numéro gi|gb|numéro d'accession|locus
            self.id   = _tmp[1]
            self.db2  = _tmp[2]
            self.acc  = _tmp[3]
            self.title = _tmp[4]
        elif self.db == 'sp':
            # sp|numéro d'accession|nom
            self.db   = _tmp[0]
            self.acc  = _tmp[1]
            self.title = _tmp[2]

    def show(self):
        attrs = vars(self)
        return ', '.join("%s: %s" % item for item in attrs.items())
        
    def fasta(self):
        return '>{:s}\n{:s}'.format(self.description,self.data)

Usage


fasta = """>sp|P68871|HBB_HUMAN Hemoglobin subunit beta OS=Homo sapiens GN=HBB PE=1 SV=2
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH"""

seq = Seq(fasta)
print seq.acc # ← P68871

3- List of EMBOSS Tools

EMBOSS — European Molecular Biology Open Software Suite — is a European package containing various bioinformatics programs available as web-based or as local tools.
The list of all the tools is available here and grouped by categories there and I created my own version combining both there