In [2]:
# first we define relevant directories
import sys
# in case PySeqLab package is not installed, 
# we can download the package repository from https://bitbucket.org/A_2/pyseqlab
# and then we add the location of the repository to the python system path
# location of the PySeqLab repository on disk -- INSERT location or discard if PySeqLab package is already installed
pyseqlab_package_dir = ""
sys.path.insert(0, pyseqlab_package_dir)
import os
# splice-junction directory
project_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# src directory under splice-junction folder -- check the tree path (below)
src_dir = os.path.join(project_dir, 'src')
sys.path.insert(0, src_dir)
# get the tutorials dir
tutorials_dir = os.path.join(project_dir, 'tutorials')
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
    css = "".join(f.readlines())
HTML(css)
Out[2]:

1. Objectives and goals

In this tutorial, we will learn about:

  • the process/workflow for building Eukaryotic splice-junction sequences predictor (i.e. based on CRFs model formalism)
  • training the built model and evaluating its performance
  • reviving the trained model and decoding new sequences (i.e. test sequences that were not used for training)

1.1 Problem statement

Determining the intron/exon boundaries in a given DNA nucleotide sequence is important for understanding RNA synthesis and protein translation. Introns are spliced in the precursor mRNA phase where exons are concatenated to produce the mature mRNA that would translate to protein in a later phase. Hence, given a nucleotide sequence, the goal is to predict/recognize if it represents an (1) intron/exon (acceptors) boundary (IE), (2) exon/intron (donors) boundary (EI) or (3) neither of the two (N).

In order to do so, we used a publicly available dataset that was constructed for testing machine learning approaches for this task. The dataset includes a set of DNA sequences (60 nucleotides long) with labeled donor (EI) and acceptor sites (IE). The number of instances is 751 with IE label and 745 as EI label. Negative samples were constructed using similarly sized windows that did not cross the intron/exon boundary, sampled at random from the later sequences (Towell et al. 1992). The total number of sequences is 3190. However, after inspecting and processing the sequences we identified 12 duplicates and thus we worked with 3178 sequences.

Our approach: We modeled the task as a "switch" problem. That is, whenever there was a switch (i.e intron/exon or exon/intron) we assigned a label to the switch where the change occurred. Because of the structure of the sequences in the dataset, the switch occurs at the 31st position in every sequence. We assigned label "1" for the switch from intron to exon, label "2" for the switch from exon to intron and "0" when no switch occurred. Additionally, we added/augmented to the nucleotides letters (i.e. A, C, T, G, etc.) their position in the sequence. The position started from left to right and counting from 1 to 60. The combined letter and position number formed the main track that we used for building our model. We will refer to this track throughout this tutorial by num_obs.

Reminder: To work with this tutorial interactively, we need first to clone the splice-junction repository from bitbucket to our disk locally. Then, navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder. The structure of the repository will be:

NB: PySeqLab should be already installed or included in the python system path before we proceed.

2. Training splice junction classifier

The src directory in the cloned repository includes three main modules:

  • process_dataset.py
  • splice_junction_attribute_extractor.py
  • train_splicejunction_predictor_workflow.py

As a prerequisite, refer to these tutorials describing in detail the model building and training process using PySeqLab package.

2.1 Dataset

We performed a stratified 5-fold cross validation where the dataset was divided into five training files with their corresponding five testing files. We used the functions implemented in process_dataset.py module. By simply running prepare_dataset() function, the training and testing files will be generated automatically. For reproducibility purpose, we will use the 5-fold division (found under 5-fold folder in dataset directory) that we already used for the trained models dumped in this repository.

In our current setting, our dataset is composed of five folds, each has a training and test file found in the dataset folder:

├── dataset
│   ├── 5-fold
│   │   ├── test_f_0.txt
│   │   ├── test_f_1.txt
│   │   ├── test_f_2.txt
│   │   ├── test_f_3.txt
│   │   ├── test_f_4.txt
│   │   ├── train_f_0.txt
│   │   ├── train_f_1.txt
│   │   ├── train_f_2.txt
│   │   ├── train_f_3.txt
│   │   ├── train_f_4.txt

2.2 Attributes and features extraction

We start by defining our attribute extractor that we will use to generate attributes from the parsed sequences. Our attribute extractor SpliceJunctionAttributeExtractor is subclass of GenericAttributeExtractor class implemented in splice_junction_attribute_extractor.py module. It defines attributes based on the num_obs track we defined earlier (see here) at each position in the sequence. Below is an example of the extracted attributes using our SpliceJunctionAttributeExtractor class from a sequence in our training file.

After defining our attribute extractor, we define the feature templates that are used by the feature extractors to generate features. Feature templates and feature extraction are described in detail in this tutorial.

In the train_splicejunction_predictor_workflow.py module, we define our feature templates using template_config() function.

def template_config():
    template_generator = TemplateGenerator()
    templateXY = {}
    template_generator.generate_template_XY('num_obs', 
                                            ('1-gram:2-gram:3-gram:4-gram:5-gram:6-gram', range(-15, 16)),
                                            '1-state', 
                                            templateXY)
    templateY = template_generator.generate_template_Y('2-states')
    return(templateXY, templateY)

The defined templates include only one track (i.e. num_obs) representing the numbered position nucleotide letter.

  • We define a window of size 31 centered at each position in the sequence. That is, we pass through the sequence from left to right, where at each position, we construct a window of size 31 (a window that includes attributes from 15 previous positions, current position, and 15 forward/future positions)
  • We extract 1 to 6 grams (i.e. 1-gram:2-gram:3-gram:4-gram:5-gram:6-gram) in the specified window
  • We join these attributes with the current state (i.e. Y labels)

For the Y labels only:

  • We use two states (i.e. label transitions). That is, we pass through the sequence from left to right, where at each position, we extract the previous and current labels.

2.3 Models and optimization options

In the train_splicejunction_predictor_workflow.py module, we implement the training workflow. In this section, we describe the training setup and the chosen options for performing the training.

We used the following classes:

  • feature extractor (HOFeatureExtractor),
  • CRFs model (HOCRFAD) and
  • CRFs model representation (HOCRFADModelRepresentation)

For the training method (i.e. optimization options), we used the following options:

  • stochastic gradient ascent (method = SGA),
  • with l2 regularization (regularization_type = l2) and regularization value equal to 1 (regularization_value = 1),
  • and 5 passes through the training data (num_epochs = 5)

To run the training process, we use run_training(optimization_options, template_config) function. We pass the optimization options and the function generating the defined feature templates (see this code snippet). The training process will perform the following:

  • Loop through each fold (i.e. from 0 to 4)
    1. read the training file corresponding to the current fold (i.e. train_f_0.txt) and parse it into sequences
    2. process and dump the parsed sequences on disk in a relevant format for the learning framework
    3. build a model based on the processed training sequences
    4. train the model weights (i.e. estimate the feature weights) using the specified optimization method
    5. use the trained model to decode the training sequences and write the result to a file
    6. read the test file corresponding to the current fold (i.e. test_f_0.txt) and parse it into sequences
    7. use the trained model to decode the testing sequences and write the result to a file
    8. return the path to the trained model directory

The return value of the training function (i.e. res -- see code snippet below) is a list of tuples, where each tuple has the following structure:

  1. first entry will be the path to the trained model on disk
  2. second entry will be the performance of the trained model using the training file
  3. third entry will be the performance of the trained model using the test file

The performance evaluation while using the trained model is reported based on the single nucleotide decoding (as in the code snippet below). Moreover, we track the estimated average log-likelihood for each model by plotting the generated avg_loglikelihood_training files.

In [12]:
from splice_junction_attribute_extractor import *
seq = example()
attr_desc {'num_obs': {'description': 'numbered nucleotides observation  (i.e. nucleotide letter concatenated with the position)', 'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <splice_junction_attribute_extractor.SpliceJunctionAttributeExtractor object at 0x7f1100d7a9e8>>, 'encoding': 'categorical'}, 'obs': {'description': 'nucleotides observation', 'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <splice_junction_attribute_extractor.SpliceJunctionAttributeExtractor object at 0x7f1100d7a9e8>>, 'encoding': 'categorical'}}
boundary (16, 16)
attributes {'num_obs': 'T16', 'obs': 'T'}
boundary (43, 43)
attributes {'num_obs': 'A43', 'obs': 'A'}
boundary (29, 29)
attributes {'num_obs': 'A29', 'obs': 'A'}
boundary (6, 6)
attributes {'num_obs': 'A6', 'obs': 'A'}
boundary (33, 33)
attributes {'num_obs': 'A33', 'obs': 'A'}
boundary (31, 31)
attributes {'num_obs': 'C31', 'obs': 'C'}
boundary (20, 20)
attributes {'num_obs': 'G20', 'obs': 'G'}
boundary (11, 11)
attributes {'num_obs': 'C11', 'obs': 'C'}
boundary (7, 7)
attributes {'num_obs': 'T7', 'obs': 'T'}
boundary (55, 55)
attributes {'num_obs': 'G55', 'obs': 'G'}
boundary (32, 32)
attributes {'num_obs': 'A32', 'obs': 'A'}
boundary (54, 54)
attributes {'num_obs': 'A54', 'obs': 'A'}
boundary (53, 53)
attributes {'num_obs': 'G53', 'obs': 'G'}
boundary (51, 51)
attributes {'num_obs': 'G51', 'obs': 'G'}
boundary (13, 13)
attributes {'num_obs': 'G13', 'obs': 'G'}
boundary (57, 57)
attributes {'num_obs': 'G57', 'obs': 'G'}
boundary (59, 59)
attributes {'num_obs': 'A59', 'obs': 'A'}
boundary (60, 60)
attributes {'num_obs': 'G60', 'obs': 'G'}
boundary (4, 4)
attributes {'num_obs': 'T4', 'obs': 'T'}
boundary (8, 8)
attributes {'num_obs': 'G8', 'obs': 'G'}
boundary (50, 50)
attributes {'num_obs': 'A50', 'obs': 'A'}
boundary (18, 18)
attributes {'num_obs': 'T18', 'obs': 'T'}
boundary (3, 3)
attributes {'num_obs': 'T3', 'obs': 'T'}
boundary (28, 28)
attributes {'num_obs': 'C28', 'obs': 'C'}
boundary (42, 42)
attributes {'num_obs': 'G42', 'obs': 'G'}
boundary (19, 19)
attributes {'num_obs': 'G19', 'obs': 'G'}
boundary (15, 15)
attributes {'num_obs': 'C15', 'obs': 'C'}
boundary (47, 47)
attributes {'num_obs': 'G47', 'obs': 'G'}
boundary (12, 12)
attributes {'num_obs': 'T12', 'obs': 'T'}
boundary (14, 14)
attributes {'num_obs': 'C14', 'obs': 'C'}
boundary (26, 26)
attributes {'num_obs': 'C26', 'obs': 'C'}
boundary (38, 38)
attributes {'num_obs': 'T38', 'obs': 'T'}
boundary (2, 2)
attributes {'num_obs': 'T2', 'obs': 'T'}
boundary (44, 44)
attributes {'num_obs': 'T44', 'obs': 'T'}
boundary (39, 39)
attributes {'num_obs': 'T39', 'obs': 'T'}
boundary (30, 30)
attributes {'num_obs': 'G30', 'obs': 'G'}
boundary (21, 21)
attributes {'num_obs': 'C21', 'obs': 'C'}
boundary (25, 25)
attributes {'num_obs': 'C25', 'obs': 'C'}
boundary (24, 24)
attributes {'num_obs': 'C24', 'obs': 'C'}
boundary (34, 34)
attributes {'num_obs': 'G34', 'obs': 'G'}
boundary (37, 37)
attributes {'num_obs': 'C37', 'obs': 'C'}
boundary (48, 48)
attributes {'num_obs': 'G48', 'obs': 'G'}
boundary (5, 5)
attributes {'num_obs': 'C5', 'obs': 'C'}
boundary (23, 23)
attributes {'num_obs': 'C23', 'obs': 'C'}
boundary (27, 27)
attributes {'num_obs': 'A27', 'obs': 'A'}
boundary (22, 22)
attributes {'num_obs': 'T22', 'obs': 'T'}
boundary (36, 36)
attributes {'num_obs': 'C36', 'obs': 'C'}
boundary (45, 45)
attributes {'num_obs': 'C45', 'obs': 'C'}
boundary (41, 41)
attributes {'num_obs': 'T41', 'obs': 'T'}
boundary (1, 1)
attributes {'num_obs': 'A1', 'obs': 'A'}
boundary (56, 56)
attributes {'num_obs': 'A56', 'obs': 'A'}
boundary (9, 9)
attributes {'num_obs': 'A9', 'obs': 'A'}
boundary (46, 46)
attributes {'num_obs': 'C46', 'obs': 'C'}
boundary (58, 58)
attributes {'num_obs': 'A58', 'obs': 'A'}
boundary (49, 49)
attributes {'num_obs': 'G49', 'obs': 'G'}
boundary (40, 40)
attributes {'num_obs': 'G40', 'obs': 'G'}
boundary (52, 52)
attributes {'num_obs': 'G52', 'obs': 'G'}
boundary (17, 17)
attributes {'num_obs': 'G17', 'obs': 'G'}
boundary (10, 10)
attributes {'num_obs': 'G10', 'obs': 'G'}
boundary (35, 35)
attributes {'num_obs': 'C35', 'obs': 'C'}
seg_attr {(16, 16): {'num_obs': 'T16', 'obs': 'T'}, (43, 43): {'num_obs': 'A43', 'obs': 'A'}, (29, 29): {'num_obs': 'A29', 'obs': 'A'}, (6, 6): {'num_obs': 'A6', 'obs': 'A'}, (33, 33): {'num_obs': 'A33', 'obs': 'A'}, (31, 31): {'num_obs': 'C31', 'obs': 'C'}, (20, 20): {'num_obs': 'G20', 'obs': 'G'}, (11, 11): {'num_obs': 'C11', 'obs': 'C'}, (7, 7): {'num_obs': 'T7', 'obs': 'T'}, (55, 55): {'num_obs': 'G55', 'obs': 'G'}, (32, 32): {'num_obs': 'A32', 'obs': 'A'}, (54, 54): {'num_obs': 'A54', 'obs': 'A'}, (53, 53): {'num_obs': 'G53', 'obs': 'G'}, (51, 51): {'num_obs': 'G51', 'obs': 'G'}, (13, 13): {'num_obs': 'G13', 'obs': 'G'}, (57, 57): {'num_obs': 'G57', 'obs': 'G'}, (59, 59): {'num_obs': 'A59', 'obs': 'A'}, (60, 60): {'num_obs': 'G60', 'obs': 'G'}, (4, 4): {'num_obs': 'T4', 'obs': 'T'}, (8, 8): {'num_obs': 'G8', 'obs': 'G'}, (50, 50): {'num_obs': 'A50', 'obs': 'A'}, (18, 18): {'num_obs': 'T18', 'obs': 'T'}, (3, 3): {'num_obs': 'T3', 'obs': 'T'}, (28, 28): {'num_obs': 'C28', 'obs': 'C'}, (42, 42): {'num_obs': 'G42', 'obs': 'G'}, (19, 19): {'num_obs': 'G19', 'obs': 'G'}, (15, 15): {'num_obs': 'C15', 'obs': 'C'}, (47, 47): {'num_obs': 'G47', 'obs': 'G'}, (12, 12): {'num_obs': 'T12', 'obs': 'T'}, (14, 14): {'num_obs': 'C14', 'obs': 'C'}, (26, 26): {'num_obs': 'C26', 'obs': 'C'}, (38, 38): {'num_obs': 'T38', 'obs': 'T'}, (2, 2): {'num_obs': 'T2', 'obs': 'T'}, (44, 44): {'num_obs': 'T44', 'obs': 'T'}, (39, 39): {'num_obs': 'T39', 'obs': 'T'}, (30, 30): {'num_obs': 'G30', 'obs': 'G'}, (21, 21): {'num_obs': 'C21', 'obs': 'C'}, (25, 25): {'num_obs': 'C25', 'obs': 'C'}, (24, 24): {'num_obs': 'C24', 'obs': 'C'}, (34, 34): {'num_obs': 'G34', 'obs': 'G'}, (37, 37): {'num_obs': 'C37', 'obs': 'C'}, (48, 48): {'num_obs': 'G48', 'obs': 'G'}, (5, 5): {'num_obs': 'C5', 'obs': 'C'}, (23, 23): {'num_obs': 'C23', 'obs': 'C'}, (27, 27): {'num_obs': 'A27', 'obs': 'A'}, (22, 22): {'num_obs': 'T22', 'obs': 'T'}, (36, 36): {'num_obs': 'C36', 'obs': 'C'}, (45, 45): {'num_obs': 'C45', 'obs': 'C'}, (41, 41): {'num_obs': 'T41', 'obs': 'T'}, (1, 1): {'num_obs': 'A1', 'obs': 'A'}, (56, 56): {'num_obs': 'A56', 'obs': 'A'}, (9, 9): {'num_obs': 'A9', 'obs': 'A'}, (46, 46): {'num_obs': 'C46', 'obs': 'C'}, (58, 58): {'num_obs': 'A58', 'obs': 'A'}, (49, 49): {'num_obs': 'G49', 'obs': 'G'}, (40, 40): {'num_obs': 'G40', 'obs': 'G'}, (52, 52): {'num_obs': 'G52', 'obs': 'G'}, (17, 17): {'num_obs': 'G17', 'obs': 'G'}, (10, 10): {'num_obs': 'G10', 'obs': 'G'}, (35, 35): {'num_obs': 'C35', 'obs': 'C'}}

In [11]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6)
from pyseqlab.utilities import ReaderWriter
# import the module containing training workflow
# we use only 10 sequences for demonstration
# to go through the whole file simply omit passing the num_seqs keyword argument
num_seqs = 10
from train_splicejunction_predictor_workflow import *
optimization_options = {"method" : "SGA",
                        "regularization_type": "l2",
                        "regularization_value":1,
                        "num_epochs":5,
                        "tolerance":1e-6
                        }

# demonstrate training using only 10 sequences from the both the training and test file
# NB: the performance evaluation reported during training is based on the single nucleotide accuracy/decoding error
res = run_training(optimization_options, template_config, num_seqs=10)
models_dir = [elem[0] for elem in res]

# using all sequences from the both the training and test file
# NB: the performance evaluation reported during training is based on the single nucleotide accuracy/decoding error
# res = run_training(optimization_options, template_config)
# evaluating the performance of the models
# models_dir = [elem[0] for elem in res]
# error_score = eval_models(models_dir)

def plot_avgloglikelihood(models_dir):
    # plot the estimated average loglikelihood
    for fold in range(len(models_dir)):
        model_dir=models_dir[fold]
        avg_ll = ReaderWriter.read_data(os.path.join(model_dir, 'avg_loglikelihood_training'))
        plt.plot(avg_ll[1:], label="fold:{}, method:{}, {}:{}".format(fold,
                                                                      optimization_options['method'], 
                                                                      optimization_options['regularization_type'],
                                                                      optimization_options['regularization_value']))
    plt.legend(loc='lower right')
    plt.xlabel('number of epochs')
    plt.ylabel('estimated average loglikelihood')
# optimization option used to train the models
optimization_options={'method':'SGA', 'regularization_type':'l2', 'regularization_value':1}
plot_avgloglikelihood(models_dir)
1 sequences have been processed
2 sequences have been processed
3 sequences have been processed
4 sequences have been processed
5 sequences have been processed
6 sequences have been processed
7 sequences have been processed
8 sequences have been processed
9 sequences have been processed
10 sequences have been processed
dumping globalfeatures -- processed seqs:  1
dumping globalfeatures -- processed seqs:  2
dumping globalfeatures -- processed seqs:  3
dumping globalfeatures -- processed seqs:  4
dumping globalfeatures -- processed seqs:  5
dumping globalfeatures -- processed seqs:  6
dumping globalfeatures -- processed seqs:  7
dumping globalfeatures -- processed seqs:  8
dumping globalfeatures -- processed seqs:  9
dumping globalfeatures -- processed seqs:  10
constructing model -- processed seqs:  1
constructing model -- processed seqs:  2
constructing model -- processed seqs:  3
constructing model -- processed seqs:  4
constructing model -- processed seqs:  5
constructing model -- processed seqs:  6
constructing model -- processed seqs:  7
constructing model -- processed seqs:  8
constructing model -- processed seqs:  9
constructing model -- processed seqs:  10
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 1.0
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.5405528666526126
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.03227694489623862
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.023159514302146963
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.0179617258098229
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 1.0
f1 performance on train_f_0
1.0
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 0.9932885906040269
f1 performance on test_f_0
0.993288590604
--------------------------------------------------
1 sequences have been processed
2 sequences have been processed
3 sequences have been processed
4 sequences have been processed
5 sequences have been processed
6 sequences have been processed
7 sequences have been processed
8 sequences have been processed
9 sequences have been processed
10 sequences have been processed
dumping globalfeatures -- processed seqs:  1
dumping globalfeatures -- processed seqs:  2
dumping globalfeatures -- processed seqs:  3
dumping globalfeatures -- processed seqs:  4
dumping globalfeatures -- processed seqs:  5
dumping globalfeatures -- processed seqs:  6
dumping globalfeatures -- processed seqs:  7
dumping globalfeatures -- processed seqs:  8
dumping globalfeatures -- processed seqs:  9
dumping globalfeatures -- processed seqs:  10
constructing model -- processed seqs:  1
constructing model -- processed seqs:  2
constructing model -- processed seqs:  3
constructing model -- processed seqs:  4
constructing model -- processed seqs:  5
constructing model -- processed seqs:  6
constructing model -- processed seqs:  7
constructing model -- processed seqs:  8
constructing model -- processed seqs:  9
constructing model -- processed seqs:  10
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 1.0
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.6101692186658397
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.03689609147279584
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.024937968595631667
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.0193954578897288
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 1.0
f1 performance on train_f_1
1.0
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 0.9833333333333333
f1 performance on test_f_1
0.983333333333
--------------------------------------------------
1 sequences have been processed
2 sequences have been processed
3 sequences have been processed
4 sequences have been processed
5 sequences have been processed
6 sequences have been processed
7 sequences have been processed
8 sequences have been processed
9 sequences have been processed
10 sequences have been processed
dumping globalfeatures -- processed seqs:  1
dumping globalfeatures -- processed seqs:  2
dumping globalfeatures -- processed seqs:  3
dumping globalfeatures -- processed seqs:  4
dumping globalfeatures -- processed seqs:  5
dumping globalfeatures -- processed seqs:  6
dumping globalfeatures -- processed seqs:  7
dumping globalfeatures -- processed seqs:  8
dumping globalfeatures -- processed seqs:  9
dumping globalfeatures -- processed seqs:  10
constructing model -- processed seqs:  1
constructing model -- processed seqs:  2
constructing model -- processed seqs:  3
constructing model -- processed seqs:  4
constructing model -- processed seqs:  5
constructing model -- processed seqs:  6
constructing model -- processed seqs:  7
constructing model -- processed seqs:  8
constructing model -- processed seqs:  9
constructing model -- processed seqs:  10
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 1.0
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.6251058759836686
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.04262663259998486
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.024506828918072408
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.019138736449197107
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 1.0
f1 performance on train_f_2
1.0
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 0.9833333333333333
f1 performance on test_f_2
0.983333333333
--------------------------------------------------
1 sequences have been processed
2 sequences have been processed
3 sequences have been processed
4 sequences have been processed
5 sequences have been processed
6 sequences have been processed
7 sequences have been processed
8 sequences have been processed
9 sequences have been processed
10 sequences have been processed
dumping globalfeatures -- processed seqs:  1
dumping globalfeatures -- processed seqs:  2
dumping globalfeatures -- processed seqs:  3
dumping globalfeatures -- processed seqs:  4
dumping globalfeatures -- processed seqs:  5
dumping globalfeatures -- processed seqs:  6
dumping globalfeatures -- processed seqs:  7
dumping globalfeatures -- processed seqs:  8
dumping globalfeatures -- processed seqs:  9
dumping globalfeatures -- processed seqs:  10
constructing model -- processed seqs:  1
constructing model -- processed seqs:  2
constructing model -- processed seqs:  3
constructing model -- processed seqs:  4
constructing model -- processed seqs:  5
constructing model -- processed seqs:  6
constructing model -- processed seqs:  7
constructing model -- processed seqs:  8
constructing model -- processed seqs:  9
constructing model -- processed seqs:  10
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 1.0
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.6400146358892683
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.033725048103773206
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.024554946159004398
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.019221103057130914
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 1.0
f1 performance on train_f_3
1.0
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 0.9966666666666667
f1 performance on test_f_3
0.996666666667
--------------------------------------------------
1 sequences have been processed
2 sequences have been processed
3 sequences have been processed
4 sequences have been processed
5 sequences have been processed
6 sequences have been processed
7 sequences have been processed
8 sequences have been processed
9 sequences have been processed
10 sequences have been processed
dumping globalfeatures -- processed seqs:  1
dumping globalfeatures -- processed seqs:  2
dumping globalfeatures -- processed seqs:  3
dumping globalfeatures -- processed seqs:  4
dumping globalfeatures -- processed seqs:  5
dumping globalfeatures -- processed seqs:  6
dumping globalfeatures -- processed seqs:  7
dumping globalfeatures -- processed seqs:  8
dumping globalfeatures -- processed seqs:  9
dumping globalfeatures -- processed seqs:  10
constructing model -- processed seqs:  1
constructing model -- processed seqs:  2
constructing model -- processed seqs:  3
constructing model -- processed seqs:  4
constructing model -- processed seqs:  5
constructing model -- processed seqs:  6
constructing model -- processed seqs:  7
constructing model -- processed seqs:  8
constructing model -- processed seqs:  9
constructing model -- processed seqs:  10
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 1.0
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.6223636857440957
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.04007853535237954
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.024671722987946183
num seqs left: 9
num seqs left: 8
num seqs left: 7
num seqs left: 6
num seqs left: 5
num seqs left: 4
num seqs left: 3
num seqs left: 2
num seqs left: 1
num seqs left: 0
reldiff = 0.01925305775303414
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 1.0
f1 performance on train_f_4
1.0
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left
f1 0.9833333333333333
f1 performance on test_f_4
0.983333333333
--------------------------------------------------

3. Trained models evaluation

NB: Before proceeding, we have to unzip the trained_models directory so that we can explore and assess the trained models.

To evaluate the performance of the trained models, we will use the eval_models(args) function in the train_splicejunction_predictor_workflow.py module. It takes a list of the trained models' path on disk. An example of a trained model directory will have the following structure:

├── avg_loglikelihood_training
├── crf_training_log.txt
├── decoding_seqs
│   ├── test_fold_0.txt
│   ├── train_fold_0.txt
├── model_parts
│   ├── class_desc.txt
│   ├── FE_templateX
│   ├── FE_templateY
│   ├── MR_L
│   ├── MR_modelfeatures
│   ├── MR_modelfeaturescodebook
│   ├── MR_Ycodebook
│   ├── weights

Each model has a model_parts folder. The decoded sequences are found under decoding_seqs folder where we have the decoding of the training and test files. We evaluate the performance of the trained models by evaluating the decoding error using those files. The splice-junction prediction task requires to assign one label to the whole sequence (i.e. 60 nucleotides) as opposed to one label for each nucleotide. Hence, the eval_models(args) function is implemented to evaluate the performance for assigning one label to the whole sequence rather than single nucleotide (as in the code snippet below).

In [9]:
import numpy
# to evaluate the models' performance on the sequence level, using already trained models
# eval_models takes a list of trained models path on disk
trainedmodels_rootdir = os.path.join(project_dir, 'trained_models')
models_folders = ("2017_5_5-10_38_38_3987",  
                  "2017_5_5-11_59_56_390744",
                  "2017_5_5-13_8_2_735679",
                  "2017_5_5-14_4_13_716529",
                  "2017_5_5-15_0_45_409499")
models_dir = [os.path.join(trainedmodels_rootdir, folder) for folder in models_folders]
error_score = eval_models(models_dir)
print("decoding error score for the trained models:")
print(error_score)
print()
print("average deocding error across the trained models: ", numpy.mean(error_score))
train_fold_0.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       1.00      1.00      1.00      1323
000000000000000000000000000000100000000000000000000000000000       1.00      1.00      1.00       610
000000000000000000000000000000200000000000000000000000000000       1.00      1.00      1.00       608

                                                 avg / total       1.00      1.00      1.00      2541

weighted f1:
0.999606367068
micro f1:
0.999606454152
test_fold_0.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.96      0.97      0.96       328
000000000000000000000000000000100000000000000000000000000000       0.96      0.94      0.95       157
000000000000000000000000000000200000000000000000000000000000       0.95      0.95      0.95       152

                                                 avg / total       0.96      0.96      0.96       637

weighted f1:
0.956009841065
micro f1:
0.956043956044
train_fold_1.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       1.00      1.00      1.00      1324
000000000000000000000000000000100000000000000000000000000000       1.00      1.00      1.00       609
000000000000000000000000000000200000000000000000000000000000       1.00      1.00      1.00       609

                                                 avg / total       1.00      1.00      1.00      2542

weighted f1:
1.0
micro f1:
1.0
test_fold_1.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.98      0.98      0.98       330
000000000000000000000000000000100000000000000000000000000000       0.95      0.93      0.94       156
000000000000000000000000000000200000000000000000000000000000       0.95      0.96      0.95       150

                                                 avg / total       0.96      0.96      0.96       636

weighted f1:
0.962209860143
micro f1:
0.962264150943
train_fold_2.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       1.00      1.00      1.00      1323
000000000000000000000000000000100000000000000000000000000000       1.00      1.00      1.00       611
000000000000000000000000000000200000000000000000000000000000       1.00      1.00      1.00       609

                                                 avg / total       1.00      1.00      1.00      2543

weighted f1:
0.999606676914
micro f1:
0.999606763665
test_fold_2.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.97      0.97      0.97       331
000000000000000000000000000000100000000000000000000000000000       0.95      0.96      0.96       151
000000000000000000000000000000200000000000000000000000000000       0.97      0.97      0.97       153

                                                 avg / total       0.97      0.97      0.97       635

weighted f1:
0.968514485113
micro f1:
0.968503937008
train_fold_3.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       1.00      1.00      1.00      1325
000000000000000000000000000000100000000000000000000000000000       1.00      1.00      1.00       609
000000000000000000000000000000200000000000000000000000000000       1.00      1.00      1.00       609

                                                 avg / total       1.00      1.00      1.00      2543

weighted f1:
0.999606850736
micro f1:
0.999606763665
test_fold_3.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.98      0.96      0.97       337
000000000000000000000000000000100000000000000000000000000000       0.90      0.96      0.93       142
000000000000000000000000000000200000000000000000000000000000       0.96      0.94      0.95       156

                                                 avg / total       0.96      0.95      0.95       635

weighted f1:
0.954547145756
micro f1:
0.954330708661
train_fold_4.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       1.00      1.00      1.00      1325
000000000000000000000000000000100000000000000000000000000000       1.00      1.00      1.00       609
000000000000000000000000000000200000000000000000000000000000       1.00      1.00      1.00       609

                                                 avg / total       1.00      1.00      1.00      2543

weighted f1:
0.999606850736
micro f1:
0.999606763665
test_fold_4.txt
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.97      0.99      0.98       324
000000000000000000000000000000100000000000000000000000000000       0.97      0.92      0.95       159
000000000000000000000000000000200000000000000000000000000000       0.94      0.94      0.94       152

                                                 avg / total       0.96      0.96      0.96       635

weighted f1:
0.960454857234
micro f1:
0.96062992126
decoding error score for the trained models:
[0.043956043956043911, 0.037735849056603765, 0.031496062992126039, 0.045669291338582663, 0.039370078740157521]

average deocding error across the trained models:  0.0396454652167

4. Using a splice junction classifier

In this section, we demonstrate how to revive a trained model for splice junction prediction and use it to decode a file comprising new/unseen sequences (i.e. test sequences).

As a reminder, the trained models (including their components) are found under trained_models folder in the cloned repository. We have to unzip them first before proceeding.

To use/revive a trained model dumped on disk, we use revive_learnedmodel(args) function. It takes the path to the trained models' directory.

In [14]:
# we get the trained model parts directory -- check the tree path in the cell above
trained_model_dir = os.path.join(project_dir, 'trained_models')
# loading the trained model
crf_m = revive_learnedmodel(os.path.join(trained_model_dir, '2017_5_5-13_8_2_735679'))

After we have revived our model, we will use a test dataset from the ones in the 5-fold directory. Just as a reminder, the tree path is:

├── dataset
│   ├── 5-fold
│   │   ├── test_f_0.txt
│   │   ├── test_f_1.txt
│   │   ├── test_f_2.txt
│   │   ├── test_f_3.txt
│   │   ├── test_f_4.txt
│   │   ├── train_f_0.txt
│   │   ├── train_f_1.txt
│   │   ├── train_f_2.txt
│   │   ├── train_f_3.txt
│   │   ├── train_f_4.txt

The test dataset is composed of multiple sequences that are separated by a newline. An excerpt of the sequences in the train_f_0.txt file is provided below:

id    obs    label
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    G    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    T    0
AGMKPNRSB-NEG-1    A    0
AGMKPNRSB-NEG-1    C    0
AGMKPNRSB-NEG-1    A    0

AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    C    0
AGMORS12A-NEG-181    A    0
AGMORS12A-NEG-181    T    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0
AGMORS12A-NEG-181    G    0

AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    A    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    G    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    C    0
AGMORS9A-NEG-481    T    0
AGMORS9A-NEG-481    G    0

AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    C    0
AGMRSKPNI-NEG-1141    T    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    G    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0
AGMRSKPNI-NEG-1141    A    0

ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    A    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    A    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    T    1
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    A    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    A    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    G    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    C    0
ATRINS-ACCEPTOR-1678    T    0
ATRINS-ACCEPTOR-1678    A    0

ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    A    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    A    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    A    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    G    1
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    A    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    A    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    C    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    T    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    G    0
ATRINS-ACCEPTOR-701    A    0

ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    G    2
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    A    0
ATRINS-DONOR-521    G    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    C    0
ATRINS-DONOR-521    T    0
ATRINS-DONOR-521    G    0

To read the file, we will use DataFileParser class in the utilities module.

In [15]:
from pyseqlab.utilities import DataFileParser
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the observations
column_sep = "\t"
seqs = []
for seq in dparser.read_file(os.path.join(project_dir, 'dataset', '5-fold','test_f_2.txt'), header, y_ref=y_ref, column_sep = column_sep):
    seqs.append(seq)
    
# printing one sequence for display
print(seqs[0])
print("number of parsed sequences is: ", len(seqs))
Y sequence:
 ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
X sequence:
 {1: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 2: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 3: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 4: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 5: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 6: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 7: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 8: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 9: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 10: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 11: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 12: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 13: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 14: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 15: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 16: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 17: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 18: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 19: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 20: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 21: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 22: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 23: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 24: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 25: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 26: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 27: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 28: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 29: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 30: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 31: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 32: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 33: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 34: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 35: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 36: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 37: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 38: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 39: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 40: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 41: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 42: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 43: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 44: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 45: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 46: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 47: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 48: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 49: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 50: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 51: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 52: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 53: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'A'}, 54: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 55: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 56: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 57: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'C'}, 58: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'T'}, 59: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}, 60: {'id': 'HUMGASTA-ACCEPTOR-6600', 'obs': 'G'}}
----------------------------------------
number of parsed sequences is:  635

4.1 Decoding method

Then, we specify the decoding options for our model to use. The main method for decoding is decode_seqs(decoding_method, out_dir, **kwargs) that takes two arguments and multiple keyword arguments.

The obligatory arguments are:

  1. decoding_method: string representing the decoding method such as 'viterbi'
  2. output_dir: string, the output directory representing the path where the parsing would take place

For the keyword arguments, the main ones to specify are:

  • seqs: the list of sequences we already parsed/read from the text file we need to label
  • file_name: the name of the file where decoded sequences will be written to (it is optional)
  • sep: the separator between the columns/observations when writing decoded sequences to the specified file using file_name keyword argument
In [16]:
decoding_method = 'viterbi'
output_dir = os.path.join(project_dir, 'tutorials')
sep = "\t"
# decode sequences
seqs_decoded = crf_m.decode_seqs(decoding_method, output_dir, seqs= seqs, file_name = 'tutorial_seqs_decoding.txt', sep=sep)
identifying model active features -- processed seqs:  1
identifying model active features -- processed seqs:  2
identifying model active features -- processed seqs:  3
identifying model active features -- processed seqs:  4
identifying model active features -- processed seqs:  5
identifying model active features -- processed seqs:  6
identifying model active features -- processed seqs:  7
identifying model active features -- processed seqs:  8
identifying model active features -- processed seqs:  9
identifying model active features -- processed seqs:  10
identifying model active features -- processed seqs:  11
identifying model active features -- processed seqs:  12
identifying model active features -- processed seqs:  13
identifying model active features -- processed seqs:  14
identifying model active features -- processed seqs:  15
identifying model active features -- processed seqs:  16
identifying model active features -- processed seqs:  17
identifying model active features -- processed seqs:  18
identifying model active features -- processed seqs:  19
identifying model active features -- processed seqs:  20
identifying model active features -- processed seqs:  21
identifying model active features -- processed seqs:  22
identifying model active features -- processed seqs:  23
identifying model active features -- processed seqs:  24
identifying model active features -- processed seqs:  25
identifying model active features -- processed seqs:  26
identifying model active features -- processed seqs:  27
identifying model active features -- processed seqs:  28
identifying model active features -- processed seqs:  29
identifying model active features -- processed seqs:  30
identifying model active features -- processed seqs:  31
identifying model active features -- processed seqs:  32
identifying model active features -- processed seqs:  33
identifying model active features -- processed seqs:  34
identifying model active features -- processed seqs:  35
identifying model active features -- processed seqs:  36
identifying model active features -- processed seqs:  37
identifying model active features -- processed seqs:  38
identifying model active features -- processed seqs:  39
identifying model active features -- processed seqs:  40
identifying model active features -- processed seqs:  41
identifying model active features -- processed seqs:  42
identifying model active features -- processed seqs:  43
identifying model active features -- processed seqs:  44
identifying model active features -- processed seqs:  45
identifying model active features -- processed seqs:  46
identifying model active features -- processed seqs:  47
identifying model active features -- processed seqs:  48
identifying model active features -- processed seqs:  49
identifying model active features -- processed seqs:  50
identifying model active features -- processed seqs:  51
identifying model active features -- processed seqs:  52
identifying model active features -- processed seqs:  53
identifying model active features -- processed seqs:  54
identifying model active features -- processed seqs:  55
identifying model active features -- processed seqs:  56
identifying model active features -- processed seqs:  57
identifying model active features -- processed seqs:  58
identifying model active features -- processed seqs:  59
identifying model active features -- processed seqs:  60
identifying model active features -- processed seqs:  61
identifying model active features -- processed seqs:  62
identifying model active features -- processed seqs:  63
identifying model active features -- processed seqs:  64
identifying model active features -- processed seqs:  65
identifying model active features -- processed seqs:  66
identifying model active features -- processed seqs:  67
identifying model active features -- processed seqs:  68
identifying model active features -- processed seqs:  69
identifying model active features -- processed seqs:  70
identifying model active features -- processed seqs:  71
identifying model active features -- processed seqs:  72
identifying model active features -- processed seqs:  73
identifying model active features -- processed seqs:  74
identifying model active features -- processed seqs:  75
identifying model active features -- processed seqs:  76
identifying model active features -- processed seqs:  77
identifying model active features -- processed seqs:  78
identifying model active features -- processed seqs:  79
identifying model active features -- processed seqs:  80
identifying model active features -- processed seqs:  81
identifying model active features -- processed seqs:  82
identifying model active features -- processed seqs:  83
identifying model active features -- processed seqs:  84
identifying model active features -- processed seqs:  85
identifying model active features -- processed seqs:  86
identifying model active features -- processed seqs:  87
identifying model active features -- processed seqs:  88
identifying model active features -- processed seqs:  89
identifying model active features -- processed seqs:  90
identifying model active features -- processed seqs:  91
identifying model active features -- processed seqs:  92
identifying model active features -- processed seqs:  93
identifying model active features -- processed seqs:  94
identifying model active features -- processed seqs:  95
identifying model active features -- processed seqs:  96
identifying model active features -- processed seqs:  97
identifying model active features -- processed seqs:  98
identifying model active features -- processed seqs:  99
identifying model active features -- processed seqs:  100
identifying model active features -- processed seqs:  101
identifying model active features -- processed seqs:  102
identifying model active features -- processed seqs:  103
identifying model active features -- processed seqs:  104
identifying model active features -- processed seqs:  105
identifying model active features -- processed seqs:  106
identifying model active features -- processed seqs:  107
identifying model active features -- processed seqs:  108
identifying model active features -- processed seqs:  109
identifying model active features -- processed seqs:  110
identifying model active features -- processed seqs:  111
identifying model active features -- processed seqs:  112
identifying model active features -- processed seqs:  113
identifying model active features -- processed seqs:  114
identifying model active features -- processed seqs:  115
identifying model active features -- processed seqs:  116
identifying model active features -- processed seqs:  117
identifying model active features -- processed seqs:  118
identifying model active features -- processed seqs:  119
identifying model active features -- processed seqs:  120
identifying model active features -- processed seqs:  121
identifying model active features -- processed seqs:  122
identifying model active features -- processed seqs:  123
identifying model active features -- processed seqs:  124
identifying model active features -- processed seqs:  125
identifying model active features -- processed seqs:  126
identifying model active features -- processed seqs:  127
identifying model active features -- processed seqs:  128
identifying model active features -- processed seqs:  129
identifying model active features -- processed seqs:  130
identifying model active features -- processed seqs:  131
identifying model active features -- processed seqs:  132
identifying model active features -- processed seqs:  133
identifying model active features -- processed seqs:  134
identifying model active features -- processed seqs:  135
identifying model active features -- processed seqs:  136
identifying model active features -- processed seqs:  137
identifying model active features -- processed seqs:  138
identifying model active features -- processed seqs:  139
identifying model active features -- processed seqs:  140
identifying model active features -- processed seqs:  141
identifying model active features -- processed seqs:  142
identifying model active features -- processed seqs:  143
identifying model active features -- processed seqs:  144
identifying model active features -- processed seqs:  145
identifying model active features -- processed seqs:  146
identifying model active features -- processed seqs:  147
identifying model active features -- processed seqs:  148
identifying model active features -- processed seqs:  149
identifying model active features -- processed seqs:  150
identifying model active features -- processed seqs:  151
identifying model active features -- processed seqs:  152
identifying model active features -- processed seqs:  153
identifying model active features -- processed seqs:  154
identifying model active features -- processed seqs:  155
identifying model active features -- processed seqs:  156
identifying model active features -- processed seqs:  157
identifying model active features -- processed seqs:  158
identifying model active features -- processed seqs:  159
identifying model active features -- processed seqs:  160
identifying model active features -- processed seqs:  161
identifying model active features -- processed seqs:  162
identifying model active features -- processed seqs:  163
identifying model active features -- processed seqs:  164
identifying model active features -- processed seqs:  165
identifying model active features -- processed seqs:  166
identifying model active features -- processed seqs:  167
identifying model active features -- processed seqs:  168
identifying model active features -- processed seqs:  169
identifying model active features -- processed seqs:  170
identifying model active features -- processed seqs:  171
identifying model active features -- processed seqs:  172
identifying model active features -- processed seqs:  173
identifying model active features -- processed seqs:  174
identifying model active features -- processed seqs:  175
identifying model active features -- processed seqs:  176
identifying model active features -- processed seqs:  177
identifying model active features -- processed seqs:  178
identifying model active features -- processed seqs:  179
identifying model active features -- processed seqs:  180
identifying model active features -- processed seqs:  181
identifying model active features -- processed seqs:  182
identifying model active features -- processed seqs:  183
identifying model active features -- processed seqs:  184
identifying model active features -- processed seqs:  185
identifying model active features -- processed seqs:  186
identifying model active features -- processed seqs:  187
identifying model active features -- processed seqs:  188
identifying model active features -- processed seqs:  189
identifying model active features -- processed seqs:  190
identifying model active features -- processed seqs:  191
identifying model active features -- processed seqs:  192
identifying model active features -- processed seqs:  193
identifying model active features -- processed seqs:  194
identifying model active features -- processed seqs:  195
identifying model active features -- processed seqs:  196
identifying model active features -- processed seqs:  197
identifying model active features -- processed seqs:  198
identifying model active features -- processed seqs:  199
identifying model active features -- processed seqs:  200
identifying model active features -- processed seqs:  201
identifying model active features -- processed seqs:  202
identifying model active features -- processed seqs:  203
identifying model active features -- processed seqs:  204
identifying model active features -- processed seqs:  205
identifying model active features -- processed seqs:  206
identifying model active features -- processed seqs:  207
identifying model active features -- processed seqs:  208
identifying model active features -- processed seqs:  209
identifying model active features -- processed seqs:  210
identifying model active features -- processed seqs:  211
identifying model active features -- processed seqs:  212
identifying model active features -- processed seqs:  213
identifying model active features -- processed seqs:  214
identifying model active features -- processed seqs:  215
identifying model active features -- processed seqs:  216
identifying model active features -- processed seqs:  217
identifying model active features -- processed seqs:  218
identifying model active features -- processed seqs:  219
identifying model active features -- processed seqs:  220
identifying model active features -- processed seqs:  221
identifying model active features -- processed seqs:  222
identifying model active features -- processed seqs:  223
identifying model active features -- processed seqs:  224
identifying model active features -- processed seqs:  225
identifying model active features -- processed seqs:  226
identifying model active features -- processed seqs:  227
identifying model active features -- processed seqs:  228
identifying model active features -- processed seqs:  229
identifying model active features -- processed seqs:  230
identifying model active features -- processed seqs:  231
identifying model active features -- processed seqs:  232
identifying model active features -- processed seqs:  233
identifying model active features -- processed seqs:  234
identifying model active features -- processed seqs:  235
identifying model active features -- processed seqs:  236
identifying model active features -- processed seqs:  237
identifying model active features -- processed seqs:  238
identifying model active features -- processed seqs:  239
identifying model active features -- processed seqs:  240
identifying model active features -- processed seqs:  241
identifying model active features -- processed seqs:  242
identifying model active features -- processed seqs:  243
identifying model active features -- processed seqs:  244
identifying model active features -- processed seqs:  245
identifying model active features -- processed seqs:  246
identifying model active features -- processed seqs:  247
identifying model active features -- processed seqs:  248
identifying model active features -- processed seqs:  249
identifying model active features -- processed seqs:  250
identifying model active features -- processed seqs:  251
identifying model active features -- processed seqs:  252
identifying model active features -- processed seqs:  253
identifying model active features -- processed seqs:  254
identifying model active features -- processed seqs:  255
identifying model active features -- processed seqs:  256
identifying model active features -- processed seqs:  257
identifying model active features -- processed seqs:  258
identifying model active features -- processed seqs:  259
identifying model active features -- processed seqs:  260
identifying model active features -- processed seqs:  261
identifying model active features -- processed seqs:  262
identifying model active features -- processed seqs:  263
identifying model active features -- processed seqs:  264
identifying model active features -- processed seqs:  265
identifying model active features -- processed seqs:  266
identifying model active features -- processed seqs:  267
identifying model active features -- processed seqs:  268
identifying model active features -- processed seqs:  269
identifying model active features -- processed seqs:  270
identifying model active features -- processed seqs:  271
identifying model active features -- processed seqs:  272
identifying model active features -- processed seqs:  273
identifying model active features -- processed seqs:  274
identifying model active features -- processed seqs:  275
identifying model active features -- processed seqs:  276
identifying model active features -- processed seqs:  277
identifying model active features -- processed seqs:  278
identifying model active features -- processed seqs:  279
identifying model active features -- processed seqs:  280
identifying model active features -- processed seqs:  281
identifying model active features -- processed seqs:  282
identifying model active features -- processed seqs:  283
identifying model active features -- processed seqs:  284
identifying model active features -- processed seqs:  285
identifying model active features -- processed seqs:  286
identifying model active features -- processed seqs:  287
identifying model active features -- processed seqs:  288
identifying model active features -- processed seqs:  289
identifying model active features -- processed seqs:  290
identifying model active features -- processed seqs:  291
identifying model active features -- processed seqs:  292
identifying model active features -- processed seqs:  293
identifying model active features -- processed seqs:  294
identifying model active features -- processed seqs:  295
identifying model active features -- processed seqs:  296
identifying model active features -- processed seqs:  297
identifying model active features -- processed seqs:  298
identifying model active features -- processed seqs:  299
identifying model active features -- processed seqs:  300
identifying model active features -- processed seqs:  301
identifying model active features -- processed seqs:  302
identifying model active features -- processed seqs:  303
identifying model active features -- processed seqs:  304
identifying model active features -- processed seqs:  305
identifying model active features -- processed seqs:  306
identifying model active features -- processed seqs:  307
identifying model active features -- processed seqs:  308
identifying model active features -- processed seqs:  309
identifying model active features -- processed seqs:  310
identifying model active features -- processed seqs:  311
identifying model active features -- processed seqs:  312
identifying model active features -- processed seqs:  313
identifying model active features -- processed seqs:  314
identifying model active features -- processed seqs:  315
identifying model active features -- processed seqs:  316
identifying model active features -- processed seqs:  317
identifying model active features -- processed seqs:  318
identifying model active features -- processed seqs:  319
identifying model active features -- processed seqs:  320
identifying model active features -- processed seqs:  321
identifying model active features -- processed seqs:  322
identifying model active features -- processed seqs:  323
identifying model active features -- processed seqs:  324
identifying model active features -- processed seqs:  325
identifying model active features -- processed seqs:  326
identifying model active features -- processed seqs:  327
identifying model active features -- processed seqs:  328
identifying model active features -- processed seqs:  329
identifying model active features -- processed seqs:  330
identifying model active features -- processed seqs:  331
identifying model active features -- processed seqs:  332
identifying model active features -- processed seqs:  333
identifying model active features -- processed seqs:  334
identifying model active features -- processed seqs:  335
identifying model active features -- processed seqs:  336
identifying model active features -- processed seqs:  337
identifying model active features -- processed seqs:  338
identifying model active features -- processed seqs:  339
identifying model active features -- processed seqs:  340
identifying model active features -- processed seqs:  341
identifying model active features -- processed seqs:  342
identifying model active features -- processed seqs:  343
identifying model active features -- processed seqs:  344
identifying model active features -- processed seqs:  345
identifying model active features -- processed seqs:  346
identifying model active features -- processed seqs:  347
identifying model active features -- processed seqs:  348
identifying model active features -- processed seqs:  349
identifying model active features -- processed seqs:  350
identifying model active features -- processed seqs:  351
identifying model active features -- processed seqs:  352
identifying model active features -- processed seqs:  353
identifying model active features -- processed seqs:  354
identifying model active features -- processed seqs:  355
identifying model active features -- processed seqs:  356
identifying model active features -- processed seqs:  357
identifying model active features -- processed seqs:  358
identifying model active features -- processed seqs:  359
identifying model active features -- processed seqs:  360
identifying model active features -- processed seqs:  361
identifying model active features -- processed seqs:  362
identifying model active features -- processed seqs:  363
identifying model active features -- processed seqs:  364
identifying model active features -- processed seqs:  365
identifying model active features -- processed seqs:  366
identifying model active features -- processed seqs:  367
identifying model active features -- processed seqs:  368
identifying model active features -- processed seqs:  369
identifying model active features -- processed seqs:  370
identifying model active features -- processed seqs:  371
identifying model active features -- processed seqs:  372
identifying model active features -- processed seqs:  373
identifying model active features -- processed seqs:  374
identifying model active features -- processed seqs:  375
identifying model active features -- processed seqs:  376
identifying model active features -- processed seqs:  377
identifying model active features -- processed seqs:  378
identifying model active features -- processed seqs:  379
identifying model active features -- processed seqs:  380
identifying model active features -- processed seqs:  381
identifying model active features -- processed seqs:  382
identifying model active features -- processed seqs:  383
identifying model active features -- processed seqs:  384
identifying model active features -- processed seqs:  385
identifying model active features -- processed seqs:  386
identifying model active features -- processed seqs:  387
identifying model active features -- processed seqs:  388
identifying model active features -- processed seqs:  389
identifying model active features -- processed seqs:  390
identifying model active features -- processed seqs:  391
identifying model active features -- processed seqs:  392
identifying model active features -- processed seqs:  393
identifying model active features -- processed seqs:  394
identifying model active features -- processed seqs:  395
identifying model active features -- processed seqs:  396
identifying model active features -- processed seqs:  397
identifying model active features -- processed seqs:  398
identifying model active features -- processed seqs:  399
identifying model active features -- processed seqs:  400
identifying model active features -- processed seqs:  401
identifying model active features -- processed seqs:  402
identifying model active features -- processed seqs:  403
identifying model active features -- processed seqs:  404
identifying model active features -- processed seqs:  405
identifying model active features -- processed seqs:  406
identifying model active features -- processed seqs:  407
identifying model active features -- processed seqs:  408
identifying model active features -- processed seqs:  409
identifying model active features -- processed seqs:  410
identifying model active features -- processed seqs:  411
identifying model active features -- processed seqs:  412
identifying model active features -- processed seqs:  413
identifying model active features -- processed seqs:  414
identifying model active features -- processed seqs:  415
identifying model active features -- processed seqs:  416
identifying model active features -- processed seqs:  417
identifying model active features -- processed seqs:  418
identifying model active features -- processed seqs:  419
identifying model active features -- processed seqs:  420
identifying model active features -- processed seqs:  421
identifying model active features -- processed seqs:  422
identifying model active features -- processed seqs:  423
identifying model active features -- processed seqs:  424
identifying model active features -- processed seqs:  425
identifying model active features -- processed seqs:  426
identifying model active features -- processed seqs:  427
identifying model active features -- processed seqs:  428
identifying model active features -- processed seqs:  429
identifying model active features -- processed seqs:  430
identifying model active features -- processed seqs:  431
identifying model active features -- processed seqs:  432
identifying model active features -- processed seqs:  433
identifying model active features -- processed seqs:  434
identifying model active features -- processed seqs:  435
identifying model active features -- processed seqs:  436
identifying model active features -- processed seqs:  437
identifying model active features -- processed seqs:  438
identifying model active features -- processed seqs:  439
identifying model active features -- processed seqs:  440
identifying model active features -- processed seqs:  441
identifying model active features -- processed seqs:  442
identifying model active features -- processed seqs:  443
identifying model active features -- processed seqs:  444
identifying model active features -- processed seqs:  445
identifying model active features -- processed seqs:  446
identifying model active features -- processed seqs:  447
identifying model active features -- processed seqs:  448
identifying model active features -- processed seqs:  449
identifying model active features -- processed seqs:  450
identifying model active features -- processed seqs:  451
identifying model active features -- processed seqs:  452
identifying model active features -- processed seqs:  453
identifying model active features -- processed seqs:  454
identifying model active features -- processed seqs:  455
identifying model active features -- processed seqs:  456
identifying model active features -- processed seqs:  457
identifying model active features -- processed seqs:  458
identifying model active features -- processed seqs:  459
identifying model active features -- processed seqs:  460
identifying model active features -- processed seqs:  461
identifying model active features -- processed seqs:  462
identifying model active features -- processed seqs:  463
identifying model active features -- processed seqs:  464
identifying model active features -- processed seqs:  465
identifying model active features -- processed seqs:  466
identifying model active features -- processed seqs:  467
identifying model active features -- processed seqs:  468
identifying model active features -- processed seqs:  469
identifying model active features -- processed seqs:  470
identifying model active features -- processed seqs:  471
identifying model active features -- processed seqs:  472
identifying model active features -- processed seqs:  473
identifying model active features -- processed seqs:  474
identifying model active features -- processed seqs:  475
identifying model active features -- processed seqs:  476
identifying model active features -- processed seqs:  477
identifying model active features -- processed seqs:  478
identifying model active features -- processed seqs:  479
identifying model active features -- processed seqs:  480
identifying model active features -- processed seqs:  481
identifying model active features -- processed seqs:  482
identifying model active features -- processed seqs:  483
identifying model active features -- processed seqs:  484
identifying model active features -- processed seqs:  485
identifying model active features -- processed seqs:  486
identifying model active features -- processed seqs:  487
identifying model active features -- processed seqs:  488
identifying model active features -- processed seqs:  489
identifying model active features -- processed seqs:  490
identifying model active features -- processed seqs:  491
identifying model active features -- processed seqs:  492
identifying model active features -- processed seqs:  493
identifying model active features -- processed seqs:  494
identifying model active features -- processed seqs:  495
identifying model active features -- processed seqs:  496
identifying model active features -- processed seqs:  497
identifying model active features -- processed seqs:  498
identifying model active features -- processed seqs:  499
identifying model active features -- processed seqs:  500
identifying model active features -- processed seqs:  501
identifying model active features -- processed seqs:  502
identifying model active features -- processed seqs:  503
identifying model active features -- processed seqs:  504
identifying model active features -- processed seqs:  505
identifying model active features -- processed seqs:  506
identifying model active features -- processed seqs:  507
identifying model active features -- processed seqs:  508
identifying model active features -- processed seqs:  509
identifying model active features -- processed seqs:  510
identifying model active features -- processed seqs:  511
identifying model active features -- processed seqs:  512
identifying model active features -- processed seqs:  513
identifying model active features -- processed seqs:  514
identifying model active features -- processed seqs:  515
identifying model active features -- processed seqs:  516
identifying model active features -- processed seqs:  517
identifying model active features -- processed seqs:  518
identifying model active features -- processed seqs:  519
identifying model active features -- processed seqs:  520
identifying model active features -- processed seqs:  521
identifying model active features -- processed seqs:  522
identifying model active features -- processed seqs:  523
identifying model active features -- processed seqs:  524
identifying model active features -- processed seqs:  525
identifying model active features -- processed seqs:  526
identifying model active features -- processed seqs:  527
identifying model active features -- processed seqs:  528
identifying model active features -- processed seqs:  529
identifying model active features -- processed seqs:  530
identifying model active features -- processed seqs:  531
identifying model active features -- processed seqs:  532
identifying model active features -- processed seqs:  533
identifying model active features -- processed seqs:  534
identifying model active features -- processed seqs:  535
identifying model active features -- processed seqs:  536
identifying model active features -- processed seqs:  537
identifying model active features -- processed seqs:  538
identifying model active features -- processed seqs:  539
identifying model active features -- processed seqs:  540
identifying model active features -- processed seqs:  541
identifying model active features -- processed seqs:  542
identifying model active features -- processed seqs:  543
identifying model active features -- processed seqs:  544
identifying model active features -- processed seqs:  545
identifying model active features -- processed seqs:  546
identifying model active features -- processed seqs:  547
identifying model active features -- processed seqs:  548
identifying model active features -- processed seqs:  549
identifying model active features -- processed seqs:  550
identifying model active features -- processed seqs:  551
identifying model active features -- processed seqs:  552
identifying model active features -- processed seqs:  553
identifying model active features -- processed seqs:  554
identifying model active features -- processed seqs:  555
identifying model active features -- processed seqs:  556
identifying model active features -- processed seqs:  557
identifying model active features -- processed seqs:  558
identifying model active features -- processed seqs:  559
identifying model active features -- processed seqs:  560
identifying model active features -- processed seqs:  561
identifying model active features -- processed seqs:  562
identifying model active features -- processed seqs:  563
identifying model active features -- processed seqs:  564
identifying model active features -- processed seqs:  565
identifying model active features -- processed seqs:  566
identifying model active features -- processed seqs:  567
identifying model active features -- processed seqs:  568
identifying model active features -- processed seqs:  569
identifying model active features -- processed seqs:  570
identifying model active features -- processed seqs:  571
identifying model active features -- processed seqs:  572
identifying model active features -- processed seqs:  573
identifying model active features -- processed seqs:  574
identifying model active features -- processed seqs:  575
identifying model active features -- processed seqs:  576
identifying model active features -- processed seqs:  577
identifying model active features -- processed seqs:  578
identifying model active features -- processed seqs:  579
identifying model active features -- processed seqs:  580
identifying model active features -- processed seqs:  581
identifying model active features -- processed seqs:  582
identifying model active features -- processed seqs:  583
identifying model active features -- processed seqs:  584
identifying model active features -- processed seqs:  585
identifying model active features -- processed seqs:  586
identifying model active features -- processed seqs:  587
identifying model active features -- processed seqs:  588
identifying model active features -- processed seqs:  589
identifying model active features -- processed seqs:  590
identifying model active features -- processed seqs:  591
identifying model active features -- processed seqs:  592
identifying model active features -- processed seqs:  593
identifying model active features -- processed seqs:  594
identifying model active features -- processed seqs:  595
identifying model active features -- processed seqs:  596
identifying model active features -- processed seqs:  597
identifying model active features -- processed seqs:  598
identifying model active features -- processed seqs:  599
identifying model active features -- processed seqs:  600
identifying model active features -- processed seqs:  601
identifying model active features -- processed seqs:  602
identifying model active features -- processed seqs:  603
identifying model active features -- processed seqs:  604
identifying model active features -- processed seqs:  605
identifying model active features -- processed seqs:  606
identifying model active features -- processed seqs:  607
identifying model active features -- processed seqs:  608
identifying model active features -- processed seqs:  609
identifying model active features -- processed seqs:  610
identifying model active features -- processed seqs:  611
identifying model active features -- processed seqs:  612
identifying model active features -- processed seqs:  613
identifying model active features -- processed seqs:  614
identifying model active features -- processed seqs:  615
identifying model active features -- processed seqs:  616
identifying model active features -- processed seqs:  617
identifying model active features -- processed seqs:  618
identifying model active features -- processed seqs:  619
identifying model active features -- processed seqs:  620
identifying model active features -- processed seqs:  621
identifying model active features -- processed seqs:  622
identifying model active features -- processed seqs:  623
identifying model active features -- processed seqs:  624
identifying model active features -- processed seqs:  625
identifying model active features -- processed seqs:  626
identifying model active features -- processed seqs:  627
identifying model active features -- processed seqs:  628
identifying model active features -- processed seqs:  629
identifying model active features -- processed seqs:  630
identifying model active features -- processed seqs:  631
identifying model active features -- processed seqs:  632
identifying model active features -- processed seqs:  633
identifying model active features -- processed seqs:  634
identifying model active features -- processed seqs:  635
sequence decoded -- 634 sequences are left
sequence decoded -- 633 sequences are left
sequence decoded -- 632 sequences are left
sequence decoded -- 631 sequences are left
sequence decoded -- 630 sequences are left
sequence decoded -- 629 sequences are left
sequence decoded -- 628 sequences are left
sequence decoded -- 627 sequences are left
sequence decoded -- 626 sequences are left
sequence decoded -- 625 sequences are left
sequence decoded -- 624 sequences are left
sequence decoded -- 623 sequences are left
sequence decoded -- 622 sequences are left
sequence decoded -- 621 sequences are left
sequence decoded -- 620 sequences are left
sequence decoded -- 619 sequences are left
sequence decoded -- 618 sequences are left
sequence decoded -- 617 sequences are left
sequence decoded -- 616 sequences are left
sequence decoded -- 615 sequences are left
sequence decoded -- 614 sequences are left
sequence decoded -- 613 sequences are left
sequence decoded -- 612 sequences are left
sequence decoded -- 611 sequences are left
sequence decoded -- 610 sequences are left
sequence decoded -- 609 sequences are left
sequence decoded -- 608 sequences are left
sequence decoded -- 607 sequences are left
sequence decoded -- 606 sequences are left
sequence decoded -- 605 sequences are left
sequence decoded -- 604 sequences are left
sequence decoded -- 603 sequences are left
sequence decoded -- 602 sequences are left
sequence decoded -- 601 sequences are left
sequence decoded -- 600 sequences are left
sequence decoded -- 599 sequences are left
sequence decoded -- 598 sequences are left
sequence decoded -- 597 sequences are left
sequence decoded -- 596 sequences are left
sequence decoded -- 595 sequences are left
sequence decoded -- 594 sequences are left
sequence decoded -- 593 sequences are left
sequence decoded -- 592 sequences are left
sequence decoded -- 591 sequences are left
sequence decoded -- 590 sequences are left
sequence decoded -- 589 sequences are left
sequence decoded -- 588 sequences are left
sequence decoded -- 587 sequences are left
sequence decoded -- 586 sequences are left
sequence decoded -- 585 sequences are left
sequence decoded -- 584 sequences are left
sequence decoded -- 583 sequences are left
sequence decoded -- 582 sequences are left
sequence decoded -- 581 sequences are left
sequence decoded -- 580 sequences are left
sequence decoded -- 579 sequences are left
sequence decoded -- 578 sequences are left
sequence decoded -- 577 sequences are left
sequence decoded -- 576 sequences are left
sequence decoded -- 575 sequences are left
sequence decoded -- 574 sequences are left
sequence decoded -- 573 sequences are left
sequence decoded -- 572 sequences are left
sequence decoded -- 571 sequences are left
sequence decoded -- 570 sequences are left
sequence decoded -- 569 sequences are left
sequence decoded -- 568 sequences are left
sequence decoded -- 567 sequences are left
sequence decoded -- 566 sequences are left
sequence decoded -- 565 sequences are left
sequence decoded -- 564 sequences are left
sequence decoded -- 563 sequences are left
sequence decoded -- 562 sequences are left
sequence decoded -- 561 sequences are left
sequence decoded -- 560 sequences are left
sequence decoded -- 559 sequences are left
sequence decoded -- 558 sequences are left
sequence decoded -- 557 sequences are left
sequence decoded -- 556 sequences are left
sequence decoded -- 555 sequences are left
sequence decoded -- 554 sequences are left
sequence decoded -- 553 sequences are left
sequence decoded -- 552 sequences are left
sequence decoded -- 551 sequences are left
sequence decoded -- 550 sequences are left
sequence decoded -- 549 sequences are left
sequence decoded -- 548 sequences are left
sequence decoded -- 547 sequences are left
sequence decoded -- 546 sequences are left
sequence decoded -- 545 sequences are left
sequence decoded -- 544 sequences are left
sequence decoded -- 543 sequences are left
sequence decoded -- 542 sequences are left
sequence decoded -- 541 sequences are left
sequence decoded -- 540 sequences are left
sequence decoded -- 539 sequences are left
sequence decoded -- 538 sequences are left
sequence decoded -- 537 sequences are left
sequence decoded -- 536 sequences are left
sequence decoded -- 535 sequences are left
sequence decoded -- 534 sequences are left
sequence decoded -- 533 sequences are left
sequence decoded -- 532 sequences are left
sequence decoded -- 531 sequences are left
sequence decoded -- 530 sequences are left
sequence decoded -- 529 sequences are left
sequence decoded -- 528 sequences are left
sequence decoded -- 527 sequences are left
sequence decoded -- 526 sequences are left
sequence decoded -- 525 sequences are left
sequence decoded -- 524 sequences are left
sequence decoded -- 523 sequences are left
sequence decoded -- 522 sequences are left
sequence decoded -- 521 sequences are left
sequence decoded -- 520 sequences are left
sequence decoded -- 519 sequences are left
sequence decoded -- 518 sequences are left
sequence decoded -- 517 sequences are left
sequence decoded -- 516 sequences are left
sequence decoded -- 515 sequences are left
sequence decoded -- 514 sequences are left
sequence decoded -- 513 sequences are left
sequence decoded -- 512 sequences are left
sequence decoded -- 511 sequences are left
sequence decoded -- 510 sequences are left
sequence decoded -- 509 sequences are left
sequence decoded -- 508 sequences are left
sequence decoded -- 507 sequences are left
sequence decoded -- 506 sequences are left
sequence decoded -- 505 sequences are left
sequence decoded -- 504 sequences are left
sequence decoded -- 503 sequences are left
sequence decoded -- 502 sequences are left
sequence decoded -- 501 sequences are left
sequence decoded -- 500 sequences are left
sequence decoded -- 499 sequences are left
sequence decoded -- 498 sequences are left
sequence decoded -- 497 sequences are left
sequence decoded -- 496 sequences are left
sequence decoded -- 495 sequences are left
sequence decoded -- 494 sequences are left
sequence decoded -- 493 sequences are left
sequence decoded -- 492 sequences are left
sequence decoded -- 491 sequences are left
sequence decoded -- 490 sequences are left
sequence decoded -- 489 sequences are left
sequence decoded -- 488 sequences are left
sequence decoded -- 487 sequences are left
sequence decoded -- 486 sequences are left
sequence decoded -- 485 sequences are left
sequence decoded -- 484 sequences are left
sequence decoded -- 483 sequences are left
sequence decoded -- 482 sequences are left
sequence decoded -- 481 sequences are left
sequence decoded -- 480 sequences are left
sequence decoded -- 479 sequences are left
sequence decoded -- 478 sequences are left
sequence decoded -- 477 sequences are left
sequence decoded -- 476 sequences are left
sequence decoded -- 475 sequences are left
sequence decoded -- 474 sequences are left
sequence decoded -- 473 sequences are left
sequence decoded -- 472 sequences are left
sequence decoded -- 471 sequences are left
sequence decoded -- 470 sequences are left
sequence decoded -- 469 sequences are left
sequence decoded -- 468 sequences are left
sequence decoded -- 467 sequences are left
sequence decoded -- 466 sequences are left
sequence decoded -- 465 sequences are left
sequence decoded -- 464 sequences are left
sequence decoded -- 463 sequences are left
sequence decoded -- 462 sequences are left
sequence decoded -- 461 sequences are left
sequence decoded -- 460 sequences are left
sequence decoded -- 459 sequences are left
sequence decoded -- 458 sequences are left
sequence decoded -- 457 sequences are left
sequence decoded -- 456 sequences are left
sequence decoded -- 455 sequences are left
sequence decoded -- 454 sequences are left
sequence decoded -- 453 sequences are left
sequence decoded -- 452 sequences are left
sequence decoded -- 451 sequences are left
sequence decoded -- 450 sequences are left
sequence decoded -- 449 sequences are left
sequence decoded -- 448 sequences are left
sequence decoded -- 447 sequences are left
sequence decoded -- 446 sequences are left
sequence decoded -- 445 sequences are left
sequence decoded -- 444 sequences are left
sequence decoded -- 443 sequences are left
sequence decoded -- 442 sequences are left
sequence decoded -- 441 sequences are left
sequence decoded -- 440 sequences are left
sequence decoded -- 439 sequences are left
sequence decoded -- 438 sequences are left
sequence decoded -- 437 sequences are left
sequence decoded -- 436 sequences are left
sequence decoded -- 435 sequences are left
sequence decoded -- 434 sequences are left
sequence decoded -- 433 sequences are left
sequence decoded -- 432 sequences are left
sequence decoded -- 431 sequences are left
sequence decoded -- 430 sequences are left
sequence decoded -- 429 sequences are left
sequence decoded -- 428 sequences are left
sequence decoded -- 427 sequences are left
sequence decoded -- 426 sequences are left
sequence decoded -- 425 sequences are left
sequence decoded -- 424 sequences are left
sequence decoded -- 423 sequences are left
sequence decoded -- 422 sequences are left
sequence decoded -- 421 sequences are left
sequence decoded -- 420 sequences are left
sequence decoded -- 419 sequences are left
sequence decoded -- 418 sequences are left
sequence decoded -- 417 sequences are left
sequence decoded -- 416 sequences are left
sequence decoded -- 415 sequences are left
sequence decoded -- 414 sequences are left
sequence decoded -- 413 sequences are left
sequence decoded -- 412 sequences are left
sequence decoded -- 411 sequences are left
sequence decoded -- 410 sequences are left
sequence decoded -- 409 sequences are left
sequence decoded -- 408 sequences are left
sequence decoded -- 407 sequences are left
sequence decoded -- 406 sequences are left
sequence decoded -- 405 sequences are left
sequence decoded -- 404 sequences are left
sequence decoded -- 403 sequences are left
sequence decoded -- 402 sequences are left
sequence decoded -- 401 sequences are left
sequence decoded -- 400 sequences are left
sequence decoded -- 399 sequences are left
sequence decoded -- 398 sequences are left
sequence decoded -- 397 sequences are left
sequence decoded -- 396 sequences are left
sequence decoded -- 395 sequences are left
sequence decoded -- 394 sequences are left
sequence decoded -- 393 sequences are left
sequence decoded -- 392 sequences are left
sequence decoded -- 391 sequences are left
sequence decoded -- 390 sequences are left
sequence decoded -- 389 sequences are left
sequence decoded -- 388 sequences are left
sequence decoded -- 387 sequences are left
sequence decoded -- 386 sequences are left
sequence decoded -- 385 sequences are left
sequence decoded -- 384 sequences are left
sequence decoded -- 383 sequences are left
sequence decoded -- 382 sequences are left
sequence decoded -- 381 sequences are left
sequence decoded -- 380 sequences are left
sequence decoded -- 379 sequences are left
sequence decoded -- 378 sequences are left
sequence decoded -- 377 sequences are left
sequence decoded -- 376 sequences are left
sequence decoded -- 375 sequences are left
sequence decoded -- 374 sequences are left
sequence decoded -- 373 sequences are left
sequence decoded -- 372 sequences are left
sequence decoded -- 371 sequences are left
sequence decoded -- 370 sequences are left
sequence decoded -- 369 sequences are left
sequence decoded -- 368 sequences are left
sequence decoded -- 367 sequences are left
sequence decoded -- 366 sequences are left
sequence decoded -- 365 sequences are left
sequence decoded -- 364 sequences are left
sequence decoded -- 363 sequences are left
sequence decoded -- 362 sequences are left
sequence decoded -- 361 sequences are left
sequence decoded -- 360 sequences are left
sequence decoded -- 359 sequences are left
sequence decoded -- 358 sequences are left
sequence decoded -- 357 sequences are left
sequence decoded -- 356 sequences are left
sequence decoded -- 355 sequences are left
sequence decoded -- 354 sequences are left
sequence decoded -- 353 sequences are left
sequence decoded -- 352 sequences are left
sequence decoded -- 351 sequences are left
sequence decoded -- 350 sequences are left
sequence decoded -- 349 sequences are left
sequence decoded -- 348 sequences are left
sequence decoded -- 347 sequences are left
sequence decoded -- 346 sequences are left
sequence decoded -- 345 sequences are left
sequence decoded -- 344 sequences are left
sequence decoded -- 343 sequences are left
sequence decoded -- 342 sequences are left
sequence decoded -- 341 sequences are left
sequence decoded -- 340 sequences are left
sequence decoded -- 339 sequences are left
sequence decoded -- 338 sequences are left
sequence decoded -- 337 sequences are left
sequence decoded -- 336 sequences are left
sequence decoded -- 335 sequences are left
sequence decoded -- 334 sequences are left
sequence decoded -- 333 sequences are left
sequence decoded -- 332 sequences are left
sequence decoded -- 331 sequences are left
sequence decoded -- 330 sequences are left
sequence decoded -- 329 sequences are left
sequence decoded -- 328 sequences are left
sequence decoded -- 327 sequences are left
sequence decoded -- 326 sequences are left
sequence decoded -- 325 sequences are left
sequence decoded -- 324 sequences are left
sequence decoded -- 323 sequences are left
sequence decoded -- 322 sequences are left
sequence decoded -- 321 sequences are left
sequence decoded -- 320 sequences are left
sequence decoded -- 319 sequences are left
sequence decoded -- 318 sequences are left
sequence decoded -- 317 sequences are left
sequence decoded -- 316 sequences are left
sequence decoded -- 315 sequences are left
sequence decoded -- 314 sequences are left
sequence decoded -- 313 sequences are left
sequence decoded -- 312 sequences are left
sequence decoded -- 311 sequences are left
sequence decoded -- 310 sequences are left
sequence decoded -- 309 sequences are left
sequence decoded -- 308 sequences are left
sequence decoded -- 307 sequences are left
sequence decoded -- 306 sequences are left
sequence decoded -- 305 sequences are left
sequence decoded -- 304 sequences are left
sequence decoded -- 303 sequences are left
sequence decoded -- 302 sequences are left
sequence decoded -- 301 sequences are left
sequence decoded -- 300 sequences are left
sequence decoded -- 299 sequences are left
sequence decoded -- 298 sequences are left
sequence decoded -- 297 sequences are left
sequence decoded -- 296 sequences are left
sequence decoded -- 295 sequences are left
sequence decoded -- 294 sequences are left
sequence decoded -- 293 sequences are left
sequence decoded -- 292 sequences are left
sequence decoded -- 291 sequences are left
sequence decoded -- 290 sequences are left
sequence decoded -- 289 sequences are left
sequence decoded -- 288 sequences are left
sequence decoded -- 287 sequences are left
sequence decoded -- 286 sequences are left
sequence decoded -- 285 sequences are left
sequence decoded -- 284 sequences are left
sequence decoded -- 283 sequences are left
sequence decoded -- 282 sequences are left
sequence decoded -- 281 sequences are left
sequence decoded -- 280 sequences are left
sequence decoded -- 279 sequences are left
sequence decoded -- 278 sequences are left
sequence decoded -- 277 sequences are left
sequence decoded -- 276 sequences are left
sequence decoded -- 275 sequences are left
sequence decoded -- 274 sequences are left
sequence decoded -- 273 sequences are left
sequence decoded -- 272 sequences are left
sequence decoded -- 271 sequences are left
sequence decoded -- 270 sequences are left
sequence decoded -- 269 sequences are left
sequence decoded -- 268 sequences are left
sequence decoded -- 267 sequences are left
sequence decoded -- 266 sequences are left
sequence decoded -- 265 sequences are left
sequence decoded -- 264 sequences are left
sequence decoded -- 263 sequences are left
sequence decoded -- 262 sequences are left
sequence decoded -- 261 sequences are left
sequence decoded -- 260 sequences are left
sequence decoded -- 259 sequences are left
sequence decoded -- 258 sequences are left
sequence decoded -- 257 sequences are left
sequence decoded -- 256 sequences are left
sequence decoded -- 255 sequences are left
sequence decoded -- 254 sequences are left
sequence decoded -- 253 sequences are left
sequence decoded -- 252 sequences are left
sequence decoded -- 251 sequences are left
sequence decoded -- 250 sequences are left
sequence decoded -- 249 sequences are left
sequence decoded -- 248 sequences are left
sequence decoded -- 247 sequences are left
sequence decoded -- 246 sequences are left
sequence decoded -- 245 sequences are left
sequence decoded -- 244 sequences are left
sequence decoded -- 243 sequences are left
sequence decoded -- 242 sequences are left
sequence decoded -- 241 sequences are left
sequence decoded -- 240 sequences are left
sequence decoded -- 239 sequences are left
sequence decoded -- 238 sequences are left
sequence decoded -- 237 sequences are left
sequence decoded -- 236 sequences are left
sequence decoded -- 235 sequences are left
sequence decoded -- 234 sequences are left
sequence decoded -- 233 sequences are left
sequence decoded -- 232 sequences are left
sequence decoded -- 231 sequences are left
sequence decoded -- 230 sequences are left
sequence decoded -- 229 sequences are left
sequence decoded -- 228 sequences are left
sequence decoded -- 227 sequences are left
sequence decoded -- 226 sequences are left
sequence decoded -- 225 sequences are left
sequence decoded -- 224 sequences are left
sequence decoded -- 223 sequences are left
sequence decoded -- 222 sequences are left
sequence decoded -- 221 sequences are left
sequence decoded -- 220 sequences are left
sequence decoded -- 219 sequences are left
sequence decoded -- 218 sequences are left
sequence decoded -- 217 sequences are left
sequence decoded -- 216 sequences are left
sequence decoded -- 215 sequences are left
sequence decoded -- 214 sequences are left
sequence decoded -- 213 sequences are left
sequence decoded -- 212 sequences are left
sequence decoded -- 211 sequences are left
sequence decoded -- 210 sequences are left
sequence decoded -- 209 sequences are left
sequence decoded -- 208 sequences are left
sequence decoded -- 207 sequences are left
sequence decoded -- 206 sequences are left
sequence decoded -- 205 sequences are left
sequence decoded -- 204 sequences are left
sequence decoded -- 203 sequences are left
sequence decoded -- 202 sequences are left
sequence decoded -- 201 sequences are left
sequence decoded -- 200 sequences are left
sequence decoded -- 199 sequences are left
sequence decoded -- 198 sequences are left
sequence decoded -- 197 sequences are left
sequence decoded -- 196 sequences are left
sequence decoded -- 195 sequences are left
sequence decoded -- 194 sequences are left
sequence decoded -- 193 sequences are left
sequence decoded -- 192 sequences are left
sequence decoded -- 191 sequences are left
sequence decoded -- 190 sequences are left
sequence decoded -- 189 sequences are left
sequence decoded -- 188 sequences are left
sequence decoded -- 187 sequences are left
sequence decoded -- 186 sequences are left
sequence decoded -- 185 sequences are left
sequence decoded -- 184 sequences are left
sequence decoded -- 183 sequences are left
sequence decoded -- 182 sequences are left
sequence decoded -- 181 sequences are left
sequence decoded -- 180 sequences are left
sequence decoded -- 179 sequences are left
sequence decoded -- 178 sequences are left
sequence decoded -- 177 sequences are left
sequence decoded -- 176 sequences are left
sequence decoded -- 175 sequences are left
sequence decoded -- 174 sequences are left
sequence decoded -- 173 sequences are left
sequence decoded -- 172 sequences are left
sequence decoded -- 171 sequences are left
sequence decoded -- 170 sequences are left
sequence decoded -- 169 sequences are left
sequence decoded -- 168 sequences are left
sequence decoded -- 167 sequences are left
sequence decoded -- 166 sequences are left
sequence decoded -- 165 sequences are left
sequence decoded -- 164 sequences are left
sequence decoded -- 163 sequences are left
sequence decoded -- 162 sequences are left
sequence decoded -- 161 sequences are left
sequence decoded -- 160 sequences are left
sequence decoded -- 159 sequences are left
sequence decoded -- 158 sequences are left
sequence decoded -- 157 sequences are left
sequence decoded -- 156 sequences are left
sequence decoded -- 155 sequences are left
sequence decoded -- 154 sequences are left
sequence decoded -- 153 sequences are left
sequence decoded -- 152 sequences are left
sequence decoded -- 151 sequences are left
sequence decoded -- 150 sequences are left
sequence decoded -- 149 sequences are left
sequence decoded -- 148 sequences are left
sequence decoded -- 147 sequences are left
sequence decoded -- 146 sequences are left
sequence decoded -- 145 sequences are left
sequence decoded -- 144 sequences are left
sequence decoded -- 143 sequences are left
sequence decoded -- 142 sequences are left
sequence decoded -- 141 sequences are left
sequence decoded -- 140 sequences are left
sequence decoded -- 139 sequences are left
sequence decoded -- 138 sequences are left
sequence decoded -- 137 sequences are left
sequence decoded -- 136 sequences are left
sequence decoded -- 135 sequences are left
sequence decoded -- 134 sequences are left
sequence decoded -- 133 sequences are left
sequence decoded -- 132 sequences are left
sequence decoded -- 131 sequences are left
sequence decoded -- 130 sequences are left
sequence decoded -- 129 sequences are left
sequence decoded -- 128 sequences are left
sequence decoded -- 127 sequences are left
sequence decoded -- 126 sequences are left
sequence decoded -- 125 sequences are left
sequence decoded -- 124 sequences are left
sequence decoded -- 123 sequences are left
sequence decoded -- 122 sequences are left
sequence decoded -- 121 sequences are left
sequence decoded -- 120 sequences are left
sequence decoded -- 119 sequences are left
sequence decoded -- 118 sequences are left
sequence decoded -- 117 sequences are left
sequence decoded -- 116 sequences are left
sequence decoded -- 115 sequences are left
sequence decoded -- 114 sequences are left
sequence decoded -- 113 sequences are left
sequence decoded -- 112 sequences are left
sequence decoded -- 111 sequences are left
sequence decoded -- 110 sequences are left
sequence decoded -- 109 sequences are left
sequence decoded -- 108 sequences are left
sequence decoded -- 107 sequences are left
sequence decoded -- 106 sequences are left
sequence decoded -- 105 sequences are left
sequence decoded -- 104 sequences are left
sequence decoded -- 103 sequences are left
sequence decoded -- 102 sequences are left
sequence decoded -- 101 sequences are left
sequence decoded -- 100 sequences are left
sequence decoded -- 99 sequences are left
sequence decoded -- 98 sequences are left
sequence decoded -- 97 sequences are left
sequence decoded -- 96 sequences are left
sequence decoded -- 95 sequences are left
sequence decoded -- 94 sequences are left
sequence decoded -- 93 sequences are left
sequence decoded -- 92 sequences are left
sequence decoded -- 91 sequences are left
sequence decoded -- 90 sequences are left
sequence decoded -- 89 sequences are left
sequence decoded -- 88 sequences are left
sequence decoded -- 87 sequences are left
sequence decoded -- 86 sequences are left
sequence decoded -- 85 sequences are left
sequence decoded -- 84 sequences are left
sequence decoded -- 83 sequences are left
sequence decoded -- 82 sequences are left
sequence decoded -- 81 sequences are left
sequence decoded -- 80 sequences are left
sequence decoded -- 79 sequences are left
sequence decoded -- 78 sequences are left
sequence decoded -- 77 sequences are left
sequence decoded -- 76 sequences are left
sequence decoded -- 75 sequences are left
sequence decoded -- 74 sequences are left
sequence decoded -- 73 sequences are left
sequence decoded -- 72 sequences are left
sequence decoded -- 71 sequences are left
sequence decoded -- 70 sequences are left
sequence decoded -- 69 sequences are left
sequence decoded -- 68 sequences are left
sequence decoded -- 67 sequences are left
sequence decoded -- 66 sequences are left
sequence decoded -- 65 sequences are left
sequence decoded -- 64 sequences are left
sequence decoded -- 63 sequences are left
sequence decoded -- 62 sequences are left
sequence decoded -- 61 sequences are left
sequence decoded -- 60 sequences are left
sequence decoded -- 59 sequences are left
sequence decoded -- 58 sequences are left
sequence decoded -- 57 sequences are left
sequence decoded -- 56 sequences are left
sequence decoded -- 55 sequences are left
sequence decoded -- 54 sequences are left
sequence decoded -- 53 sequences are left
sequence decoded -- 52 sequences are left
sequence decoded -- 51 sequences are left
sequence decoded -- 50 sequences are left
sequence decoded -- 49 sequences are left
sequence decoded -- 48 sequences are left
sequence decoded -- 47 sequences are left
sequence decoded -- 46 sequences are left
sequence decoded -- 45 sequences are left
sequence decoded -- 44 sequences are left
sequence decoded -- 43 sequences are left
sequence decoded -- 42 sequences are left
sequence decoded -- 41 sequences are left
sequence decoded -- 40 sequences are left
sequence decoded -- 39 sequences are left
sequence decoded -- 38 sequences are left
sequence decoded -- 37 sequences are left
sequence decoded -- 36 sequences are left
sequence decoded -- 35 sequences are left
sequence decoded -- 34 sequences are left
sequence decoded -- 33 sequences are left
sequence decoded -- 32 sequences are left
sequence decoded -- 31 sequences are left
sequence decoded -- 30 sequences are left
sequence decoded -- 29 sequences are left
sequence decoded -- 28 sequences are left
sequence decoded -- 27 sequences are left
sequence decoded -- 26 sequences are left
sequence decoded -- 25 sequences are left
sequence decoded -- 24 sequences are left
sequence decoded -- 23 sequences are left
sequence decoded -- 22 sequences are left
sequence decoded -- 21 sequences are left
sequence decoded -- 20 sequences are left
sequence decoded -- 19 sequences are left
sequence decoded -- 18 sequences are left
sequence decoded -- 17 sequences are left
sequence decoded -- 16 sequences are left
sequence decoded -- 15 sequences are left
sequence decoded -- 14 sequences are left
sequence decoded -- 13 sequences are left
sequence decoded -- 12 sequences are left
sequence decoded -- 11 sequences are left
sequence decoded -- 10 sequences are left
sequence decoded -- 9 sequences are left
sequence decoded -- 8 sequences are left
sequence decoded -- 7 sequences are left
sequence decoded -- 6 sequences are left
sequence decoded -- 5 sequences are left
sequence decoded -- 4 sequences are left
sequence decoded -- 3 sequences are left
sequence decoded -- 2 sequences are left
sequence decoded -- 1 sequences are left
sequence decoded -- 0 sequences are left

4.2 Decoding performance

The decoded sequences will be found under the tutorials directory (i.e. current directory) under decoding_seqs directory.

|---tutorials
|      |---decoding_seqs
|      |             |---tutorial_seqs_decoding.txt
The tutorial_seqs_decoding.txt file will follow the same template/format of the test_f_2.txt file we already parsed earlier, but this time with additional column containing our model's predictions. We evaluate the decoding performance this time using eval_decoded_file(args) function in train_splicejunction_predictor_workflow.py module. It takes the path to the newly decoded file as an argument and returns the decoding error.

In [17]:
new_decseqs_file = os.path.join(tutorials_dir, 'decoding_seqs','tutorial_seqs_decoding.txt')
score = eval_decoded_file(new_decseqs_file)

print("error score: ", score)
                                                              precision    recall  f1-score   support

000000000000000000000000000000000000000000000000000000000000       0.97      0.97      0.97       331
000000000000000000000000000000100000000000000000000000000000       0.95      0.96      0.96       151
000000000000000000000000000000200000000000000000000000000000       0.97      0.97      0.97       153

                                                 avg / total       0.97      0.97      0.97       635

weighted f1:
0.968514485113
micro f1:
0.968503937008
error score:  0.0314960629921

Our model achieves a decoding error score equal to 3.1%
Exploring and experimenting with different feature templates and attributes for training a model are left as an exercise for the readers ... :)

5. Literature & references

M. O. Noordewier and G. G. Towell and J. W. Shavlik, 1991; "Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences". Advances in Neural Information Processing Systems, volume 3, Morgan Kaufmann.

G. G. Towell and J. W. Shavlik, 1992; "Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules", In Advances in Neural Information Processing Systems, volume 4, Morgan Kaufmann.

In [ ]: