Clustering Methods and Algorithms in Genomics - Python Practicals

Overview

These practicals cover a part of the Durbin section on HMM modelling. Through these practical we will show you how to generate your own reference data, measure its properties, estimate a model with known labels, decode data with a provided HMM and train your own HMM using Viterbi The code is written in python with no special complication

General Note on the following exercises.

All these exercises work on the same principle. A problem file is provided as x.y.fooo<_pb>.py. The purpose of the exercise is then to modify this initial file following the instruction of the exercise. There may be many ways to implement this solution. In order to help you, I have implemented my own solution and I am providing the output of this solution on the reference datasets. This sample output comes as a file named foo_.output and the way it was generated is explicited on the lines ##.

In order to make sure you have everything under control, you can regenerate different output files with the solution script.

P1 - Modelling Data

1.1 - Estimate the parameters of this occasionally dishonest casino (ODC) series: 1.1.odc2stat.pb.py

## python 1.1.odc2stat.sol.pyc odc.run > 1.1.odc2stat.sol.output

1.2 - Create a generator that allows you to regenerate a series having the same properties (i.e. similar statistics) as those measured on odc.run. Your generator will take as input 5 numbers: 1.1.odc2stat.pb.py

    pFL transition
    pLF transition,
    p6 the probability of emitting a 6 by the loaded dice 
    N the run size

Use 1.1.odc2stat.pb.py to check sure your generator is correct

## python 1.2.odc.sol.pyc 0.1 0.2 0.5 1000 > 1.2.odc.sol.output

P2 - Viterbi Decoding of Existing Data with Known Model

2.1 - Implement a viterbi decoding allowing you to decode the ODC series you generated in the last practical. Use The Durbin formulation, p56: 2.1.viterbi.pb.py

## python 2.1.viterbi.sol.pyc data.txt model.txt > 2.1.viterbi.sol.output

2.2 - Measure the accuracy of the decoding using the sensitivity, the specificity and the Sen2 as defined in burset1996.pdf. The main difficulty will be to define the false and true positives and the negatives (fp, fn, tp, tn): 2.2.viterbi.pb.py

## python 2.2.viterbi.sol.pyc data.txt model.txt > 2.2.viterbi.sol.output

2.3 - Generate series in which you will change the bias towards 6 from 1.01 up to 5, measure the accuracy of the decoding on these various series. What is the individual effect of each parameter on the accuracy with which you are predicting the loaded state. Can this effect be mitigated (i.e. increased or decreased by increasing the length of the run?). What do you conclude on the suitablility of HMM decoding whhen dealing with biological signal?

P3 - Using Viterbi to Train a Model on available data

3.1 adapt the Viterbi algorithm into a training algorithm. Follow the Durbin formulation (Durbin p65): 3.1.viterbi.pb.py

## python 3.1.viterbi.sol.pyc data.txt model.txt > 3.1.viterbi.sol.output

P4 - Using Nextflow to quantify expression data

Follow the Tutorial