# first we define relevant directories
import sys
# in case PySeqLab package is not installed,
# we can download the package repository from https://bitbucket.org/A_2/pyseqlab
# and then we add the location of the repository to the python system path
# location of the PySeqLab repository on disk -- INSERT location or discard if PySeqLab package is already installed
pyseqlab_package_dir = ""
sys.path.insert(0, pyseqlab_package_dir)
import os
# splice-junction directory
project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# src directory under splice-junction folder -- check the tree path (below)
src_dir = os.path.join(project_dir, 'src')
sys.path.insert(0, src_dir)
# get the tutorials dir
tutorials_dir = os.path.join(project_dir, 'tutorials')
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
css = "".join(f.readlines())
HTML(css)
In this tutorial, we will learn about:
Determining the intron/exon boundaries in a given DNA nucleotide sequence is important for understanding RNA synthesis and protein translation. Introns are spliced in the precursor mRNA phase where exons are concatenated to produce the mature mRNA that would translate to protein in a later phase. Hence, given a nucleotide sequence, the goal is to predict/recognize if it represents an (1) intron/exon (acceptors) boundary (IE), (2) exon/intron (donors) boundary (EI) or (3) neither of the two (N).
In order to do so, we used a publicly available dataset that was constructed for testing machine learning approaches for this task. The dataset includes a set of DNA sequences (60 nucleotides long) with labeled donor (EI) and acceptor sites (IE). The number of instances is 751 with IE label and 745 as EI label. Negative samples were constructed using similarly sized windows that did not cross the intron/exon boundary, sampled at random from the later sequences (Towell et al. 1992). The total number of sequences is 3190. However, after inspecting and processing the sequences we identified 12 duplicates and thus we worked with 3178 sequences.
Our approach: We modeled the task as a "switch" problem. That is, whenever there was a switch (i.e intron/exon or exon/intron) we assigned a label to the switch where the change occurred. Because of the structure of the sequences in the dataset, the switch occurs at the 31st position in every sequence. We assigned label "1"
for the switch from intron to exon, label "2"
for the switch from exon to intron and "0"
when no switch occurred. Additionally, we added/augmented to the nucleotides letters (i.e. A, C, T, G, etc.) their position in the sequence. The position started from left to right and counting from 1 to 60. The combined letter and position number formed the main track that we used for building our model. We will refer to this track throughout this tutorial by num_obs
.
Reminder: To work with this tutorial interactively, we need first to clone the splice-junction repository from bitbucket to our disk locally. Then, navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder. The structure of the repository will be:
NB: PySeqLab should be already installed or included in the python system path before we proceed.
The src
directory in the cloned repository includes three main modules:
process_dataset.py
splice_junction_attribute_extractor.py
train_splicejunction_predictor_workflow.py
As a prerequisite, refer to these tutorials describing in detail the model building and training process using PySeqLab package.
We performed a stratified 5-fold cross validation where the dataset was divided into five training files with their corresponding five testing files. We used the functions implemented in process_dataset.py
module. By simply running prepare_dataset()
function, the training and testing files will be generated automatically. For reproducibility purpose, we will use the 5-fold division (found under 5-fold folder
in dataset
directory) that we already used for the trained models dumped in this repository.
In our current setting, our dataset is composed of five folds, each has a training and test file found in the dataset
folder:
├── dataset │ ├── 5-fold │ │ ├── test_f_0.txt │ │ ├── test_f_1.txt │ │ ├── test_f_2.txt │ │ ├── test_f_3.txt │ │ ├── test_f_4.txt │ │ ├── train_f_0.txt │ │ ├── train_f_1.txt │ │ ├── train_f_2.txt │ │ ├── train_f_3.txt │ │ ├── train_f_4.txt
We start by defining our attribute extractor that we will use to generate attributes from the parsed sequences. Our attribute extractor SpliceJunctionAttributeExtractor
is subclass of GenericAttributeExtractor
class implemented in splice_junction_attribute_extractor.py
module. It defines attributes based on the num_obs
track we defined earlier (see here) at each position in the sequence. Below is an example of the extracted attributes using our SpliceJunctionAttributeExtractor
class from a sequence in our training file.
After defining our attribute extractor, we define the feature templates that are used by the feature extractors to generate features. Feature templates and feature extraction are described in detail in this tutorial.
In the train_splicejunction_predictor_workflow.py
module, we define our feature templates using template_config()
function.
def template_config(): template_generator = TemplateGenerator() templateXY = {} template_generator.generate_template_XY('num_obs', ('1-gram:2-gram:3-gram:4-gram:5-gram:6-gram', range(-15, 16)), '1-state', templateXY) templateY = template_generator.generate_template_Y('2-states') return(templateXY, templateY)
The defined templates include only one track (i.e. num_obs
) representing the numbered position nucleotide letter.
1-gram:2-gram:3-gram:4-gram:5-gram:6-gram
) in the specified window Y labels
)For the Y labels only
:
In the train_splicejunction_predictor_workflow.py
module, we implement the training workflow. In this section, we describe the training setup and the chosen options for performing the training.
We used the following classes:
HOFeatureExtractor)
,HOCRFAD
) andHOCRFADModelRepresentation
)For the training method (i.e. optimization options), we used the following options:
method = SGA
),regularization_type = l2
) and regularization value equal to 1 (regularization_value = 1
), num_epochs = 5
)To run the training process, we use run_training(optimization_options, template_config)
function. We pass the optimization options and the function generating the defined feature templates (see this code snippet). The training process will perform the following:
train_f_0.txt
) and parse it into sequencestest_f_0.txt
) and parse it into sequencesThe return value of the training function (i.e. res
-- see code snippet below) is a list of tuples, where each tuple has the following structure:
The performance evaluation while using the trained model is reported based on the single nucleotide decoding (as in the code snippet below). Moreover, we track the estimated average log-likelihood for each model by plotting the generated avg_loglikelihood_training
files.
from splice_junction_attribute_extractor import *
seq = example()
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6)
from pyseqlab.utilities import ReaderWriter
# import the module containing training workflow
# we use only 10 sequences for demonstration
# to go through the whole file simply omit passing the num_seqs keyword argument
num_seqs = 10
from train_splicejunction_predictor_workflow import *
optimization_options = {"method" : "SGA",
"regularization_type": "l2",
"regularization_value":1,
"num_epochs":5,
"tolerance":1e-6
}
# demonstrate training using only 10 sequences from the both the training and test file
# NB: the performance evaluation reported during training is based on the single nucleotide accuracy/decoding error
res = run_training(optimization_options, template_config, num_seqs=10)
models_dir = [elem[0] for elem in res]
# using all sequences from the both the training and test file
# NB: the performance evaluation reported during training is based on the single nucleotide accuracy/decoding error
# res = run_training(optimization_options, template_config)
# evaluating the performance of the models
# models_dir = [elem[0] for elem in res]
# error_score = eval_models(models_dir)
def plot_avgloglikelihood(models_dir):
# plot the estimated average loglikelihood
for fold in range(len(models_dir)):
model_dir=models_dir[fold]
avg_ll = ReaderWriter.read_data(os.path.join(model_dir, 'avg_loglikelihood_training'))
plt.plot(avg_ll[1:], label="fold:{}, method:{}, {}:{}".format(fold,
optimization_options['method'],
optimization_options['regularization_type'],
optimization_options['regularization_value']))
plt.legend(loc='lower right')
plt.xlabel('number of epochs')
plt.ylabel('estimated average loglikelihood')
# optimization option used to train the models
optimization_options={'method':'SGA', 'regularization_type':'l2', 'regularization_value':1}
plot_avgloglikelihood(models_dir)
NB: Before proceeding, we have to unzip the trained_models
directory so that we can explore and assess the trained models.
To evaluate the performance of the trained models, we will use the eval_models(args)
function in the train_splicejunction_predictor_workflow.py
module. It takes a list of the trained models' path on disk. An example of a trained model directory will have the following structure:
├── avg_loglikelihood_training ├── crf_training_log.txt ├── decoding_seqs │ ├── test_fold_0.txt │ ├── train_fold_0.txt ├── model_parts │ ├── class_desc.txt │ ├── FE_templateX │ ├── FE_templateY │ ├── MR_L │ ├── MR_modelfeatures │ ├── MR_modelfeaturescodebook │ ├── MR_Ycodebook │ ├── weights
Each model has a model_parts
folder. The decoded sequences are found under decoding_seqs
folder where we have the decoding of the training and test files. We evaluate the performance of the trained models by evaluating the decoding error using those files.
The splice-junction prediction task requires to assign one label to the whole sequence (i.e. 60 nucleotides) as opposed to one label for each nucleotide. Hence, the eval_models(args)
function is implemented to evaluate the performance for assigning one label to the whole sequence rather than single nucleotide (as in the code snippet below).
import numpy
# to evaluate the models' performance on the sequence level, using already trained models
# eval_models takes a list of trained models path on disk
trainedmodels_rootdir = os.path.join(project_dir, 'trained_models')
models_folders = ("2017_5_5-10_38_38_3987",
"2017_5_5-11_59_56_390744",
"2017_5_5-13_8_2_735679",
"2017_5_5-14_4_13_716529",
"2017_5_5-15_0_45_409499")
models_dir = [os.path.join(trainedmodels_rootdir, folder) for folder in models_folders]
error_score = eval_models(models_dir)
print("decoding error score for the trained models:")
print(error_score)
print()
print("average deocding error across the trained models: ", numpy.mean(error_score))
In this section, we demonstrate how to revive a trained model for splice junction prediction and use it to decode a file comprising new/unseen sequences (i.e. test sequences).
As a reminder, the trained models (including their components) are found under trained_models
folder in the cloned repository. We have to unzip them first before proceeding.
To use/revive a trained model dumped on disk, we use revive_learnedmodel(args)
function. It takes the path to the trained models' directory.
# we get the trained model parts directory -- check the tree path in the cell above
trained_model_dir = os.path.join(project_dir, 'trained_models')
# loading the trained model
crf_m = revive_learnedmodel(os.path.join(trained_model_dir, '2017_5_5-13_8_2_735679'))
After we have revived our model, we will use a test dataset from the ones in the 5-fold directory. Just as a reminder, the tree path is:
├── dataset │ ├── 5-fold │ │ ├── test_f_0.txt │ │ ├── test_f_1.txt │ │ ├── test_f_2.txt │ │ ├── test_f_3.txt │ │ ├── test_f_4.txt │ │ ├── train_f_0.txt │ │ ├── train_f_1.txt │ │ ├── train_f_2.txt │ │ ├── train_f_3.txt │ │ ├── train_f_4.txt
The test dataset is composed of multiple sequences that are separated by a newline. An excerpt of the sequences in the train_f_0.txt file is provided below:
id obs label AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 G 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 T 0 AGMKPNRSB-NEG-1 A 0 AGMKPNRSB-NEG-1 C 0 AGMKPNRSB-NEG-1 A 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 C 0 AGMORS12A-NEG-181 A 0 AGMORS12A-NEG-181 T 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS12A-NEG-181 G 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 A 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 G 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 C 0 AGMORS9A-NEG-481 T 0 AGMORS9A-NEG-481 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 C 0 AGMRSKPNI-NEG-1141 T 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 G 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 AGMRSKPNI-NEG-1141 A 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 A 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 A 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 T 1 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 A 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 A 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 G 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 C 0 ATRINS-ACCEPTOR-1678 T 0 ATRINS-ACCEPTOR-1678 A 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 G 1 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 C 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 T 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 G 0 ATRINS-ACCEPTOR-701 A 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 G 2 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 A 0 ATRINS-DONOR-521 G 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 C 0 ATRINS-DONOR-521 T 0 ATRINS-DONOR-521 G 0
To read the file, we will use DataFileParser
class in the utilities
module.
from pyseqlab.utilities import DataFileParser
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the observations
column_sep = "\t"
seqs = []
for seq in dparser.read_file(os.path.join(project_dir, 'dataset', '5-fold','test_f_2.txt'), header, y_ref=y_ref, column_sep = column_sep):
seqs.append(seq)
# printing one sequence for display
print(seqs[0])
print("number of parsed sequences is: ", len(seqs))
Then, we specify the decoding options for our model to use. The main method for decoding is decode_seqs(decoding_method, out_dir, **kwargs)
that takes two arguments and multiple keyword arguments.
The obligatory arguments are:
decoding_method
: string representing the decoding method such as 'viterbi'
output_dir
: string, the output directory representing the path where the parsing would take place
For the keyword arguments, the main ones to specify are:
seqs
: the list of sequences we already parsed/read from the text file we need to label
file_name
: the name of the file where decoded sequences will be written to (it is optional)
sep
: the separator between the columns/observations when writing decoded sequences to the specified file using file_name
keyword argument
decoding_method = 'viterbi'
output_dir = os.path.join(project_dir, 'tutorials')
sep = "\t"
# decode sequences
seqs_decoded = crf_m.decode_seqs(decoding_method, output_dir, seqs= seqs, file_name = 'tutorial_seqs_decoding.txt', sep=sep)
The decoded sequences will be found under the tutorials
directory (i.e. current directory) under decoding_seqs
directory.
|---tutorials | |---decoding_seqs | | |---tutorial_seqs_decoding.txtThe
tutorial_seqs_decoding.txt
file will follow the same template/format of the test_f_2.txt
file we already parsed earlier, but this time with additional column containing our model's predictions.
We evaluate the decoding performance this time using eval_decoded_file(args)
function in train_splicejunction_predictor_workflow.py
module. It takes the path to the newly decoded file as an argument and returns the decoding error.
new_decseqs_file = os.path.join(tutorials_dir, 'decoding_seqs','tutorial_seqs_decoding.txt')
score = eval_decoded_file(new_decseqs_file)
print("error score: ", score)
Our model achieves a decoding error score equal to 3.1%
Exploring and experimenting with different feature templates and attributes for training a model are left as an exercise for the readers ... :)
M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; "Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences". Advances in Neural Information Processing Systems, volume 3, Morgan Kaufmann.
G. G. Towell and J. W. Shavlik, 1992; "Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules", In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.