For biology students, I think learning a programming language is not a waste of time. However, there is many, many different languages [Wiki] and you have to choose a language with a good balance between criteria like ease of learning, wealth of tutorials and examples, large user community,etc. Nowadays, IMHO, two programming languages are fulfilling these criteria:
- Python
- JavaScript and all the web technologies (HTML5 and CSS3)
Note*: If you need more sophisticated statistics functions, it is good to look at the “R” language. It is more complex but more powerful in this field.
1- Python programming language
1-1- The Basics: Variables, conditionals, and loops
1-1-1 Variables
Variables are “boxes” containing one value. This value may be a:
- Number (integer, floating-point numbers,etc.)
- Boolean (True of False)
- String (aka Text)
- List
- Dictionary
- Object
Variables names are case sensitive, cannot begin with a digit and you cannot use a reserved keyword used by Python (like for).
myVar = 3
myVar = myVar + 1 # ← 4
0test = 5 # ← ERROR
1-1-2 Conditionals
if condition :
instruction_1
instruction_2
instruction_3
elif other_condition:
instruction_a
instruction_b
else:
instruction_01
instruction_02
instruction_03
instruction_04
1-1-3 Loops
1-1-3-1 While Loop
while exit_condition:
instruction_1
instruction_2
instruction_3
For example, to display each character of the string ‘python’
lang = 'python'
i = 0
while i < len(lang):
print(lang[i])
i = i + 1
And the result is…
p
y
t
h
o
n
1-1-3-2 for in Loop
This loop requires a list and apply the block of instructions for each element with the following syntax:
for obj in a_list:
instruction_1
instruction_2
instruction_3
Here is a basic example…
lang = ['p','y','t','h','o','n']
for char in lang:
print(char)
This type of loop is very often used with the range(…) function. This very convenient function range(star, end, step) creates a list containing a series of numbers.
mySeries = range(0,10,2) # ← [0,2,4,6,8]
for i in mySeries:
print(i)
movie = 'starwarsreturnofthejedi.'
for i in range(1, 10):
poweroftwo = i ** 2
print(poweroftwo) # 1, 4, 9, 16, 25, ..., 100
1-2- Advanced: Functions
def function_name(arg1,arg2,..., argn):
instruction_1
instruction_2
instruction_3
return result
For example, if you want to get codons of a nucleic sequence …
def getCodons(seq):
""" Extract codons from a nucleic sequence """
codons_list = []
for i in range(0,len(seq),3):
codon = seq[i:i+3]
codons_list.append(codon)
return codons_list
# Test
mySeq='actgctgtcgaaccg'
myCodons = getCodons(mySeq) # ← ['act', 'gct', 'gtc', 'gaa', 'ccg']
1-2- Learning Python with web-based tool
Python Fiddle is a good starting environment to write your first scripts.
1-3- Learning Python with IDLE
2- Python helper Code
Using extensively Python Fiddle, it is not possible AFAIK to import BioPython. Moreover, the latter is a little bit too complex for biology students with no programming skills.
Seq class storing information (title and sequence data) from a FASTA sequence.
class Seq:
def __init__(self, seqdata):
_tmp = seqdata.split('\n')
self.description = _tmp[0][1:] if _tmp[0][0] == '>' else _tmp[0]
self.data = ''.join(_tmp[1:]).strip()
# Read title
self.author1 = 'None'
self.author2 = 'None'
self.copy = 0
self.db = 'None'
self.id = 'None'
self.db2 = 'None'
self.acc = 'None'
self.title = 'None'
# Try to read information within the description
sep = '|'
_tmp = self.description.split(sep)
self.db = _tmp[0]
if self.db == 'xzy':
#CrazyBio header: xzy|first author|copynumber|second author
self.author1 = _tmp[1]
self.author2 = _tmp[3]
self.copy = int( _tmp[2])
self.length = len(self.data)
elif self.db == 'gi':
# gi|numéro gi|gb|numéro d'accession|locus
self.id = _tmp[1]
self.db2 = _tmp[2]
self.acc = _tmp[3]
self.title = _tmp[4]
elif self.db == 'sp':
# sp|numéro d'accession|nom
self.db = _tmp[0]
self.acc = _tmp[1]
self.title = _tmp[2]
def show(self):
attrs = vars(self)
return ', '.join("%s: %s" % item for item in attrs.items())
def fasta(self):
return '>{:s}\n{:s}'.format(self.description,self.data)
Usage
fasta = """>sp|P68871|HBB_HUMAN Hemoglobin subunit beta OS=Homo sapiens GN=HBB PE=1 SV=2
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH"""
seq = Seq(fasta)
print seq.acc # ← P68871
3- List of EMBOSS Tools
EMBOSS — European Molecular Biology Open Software Suite — is a European package containing various bioinformatics programs available as web-based or as local tools.
The list of all the tools is available here and grouped by categories there and I created my own version combining both there