# importing and defining relevant directories
import sys
import os
# pyseqlab root directory
pyseqlab_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# print("pyseqlab cloned dir:", pyseqlab_dir)
# inserting the pyseqlab directory to python's system path
# if pyseqlab is already installed this could be commented out
sys.path.insert(0, pyseqlab_dir)
# current directory (tutorials)
tutorials_dir = os.path.join(pyseqlab_dir, 'tutorials')
# print("tutorials_dir:", tutorials_dir)
dataset_dir = os.path.join(tutorials_dir, 'datasets', 'conll2000')
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
css = "".join(f.readlines())
HTML(css)
In this tutorial, we will learn about:
Reminder: To work with this tutorial interactively, we need first to clone the PySeqLab package to our disk locally and then navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder (see the tree path for display).
├── pyseqlab ├── tutorials │ ├── datasets │ │ ├── conll2000 │ │ ├── segments
We suggest going through the earlier tutorials:
before continuing through this notebook. We will use part of the CoNLL 2000 training dataset throughout this tutorial.
As a reminder, the CoNLL 2000 task states:
Given a set of sentences (our sequences) where each sentence is composed of words and their corresponding part-of-speech, the goal is to predict the chunk/shallow parse label of every word in the sentence.
We start by parsing/reading the sentences in the training dataset into sequences.
# print("dataset_dir:", dataset_dir)
from pyseqlab.utilities import DataFileParser
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the words/observations
column_sep = " "
seqs = []
for seq in dparser.read_file(os.path.join(dataset_dir, 'train.txt'), header, y_ref=y_ref, column_sep = column_sep):
seqs.append(seq)
# printing one sequence for display
print(seqs[0])
print("type(seq):", type(seqs[0]))
print("number of parsed sequences is: ", len(seqs))
Before going into model building and training workflow, it is important to discuss and highlight the CRFs models supported in PySeqLab. Currently, PySeqLab supports linear-chain (1) conditional random fields (CRFs) and (2) semi-Markov conditional random fields (semi-CRFs). Moreover, it supports first order models (i.e. modeling label patterns using at most two states/labels $\le$ 2) and higher order models (i.e. modeling label patterns using > 2 states). The table below provides an overview on the implemented CRFs classes.
Models | CRFs | semi-CRFs | ||
---|---|---|---|---|
First order $\le$ 2 | Higher order > 2 | First order $\le$ 2 | Higher order > 2 | |
FirstOrderCRF |
✓ | |||
HOCRF |
✓ | ✓ | ||
HOCRFAD |
✓ | ✓ | ||
HOSemiCRF |
✓ | ✓ | ✓ | ✓ |
HOSemiCRFAD |
✓ | ✓ | ✓ | ✓ |
The displayed models are all based on the CRFs formalism (undirected discriminative graphical models) where differences among them arise due to: (1) the model order they can support and/or (2) the algorithms used to estimate the probability of the sequences and consequently affecting the gradient and log-likelihood computation. The semi-CRFs are considered generalization to CRFs as they tackle sequence segmentation problems (i.e. using segments in which labels/tags extend across several consecutive observations of the input sequence). Hence, CRFs could be seen as a special case of semi-CRFs when the segment length is 1 (i.e. each label is assigned to one observation, see (Sarawagi et al., 2005) for further discussion). Below is another table comprising pointers to papers and literature in which the implemented CRFs and semi-CRFs models are based on:
Models | Short description | Reference |
---|---|---|
FirstOrderCRF |
Original formalism of first order CRF | (Lafferty et al., 2001) |
HOCRF |
Higher order formulation of CRF | (Ye, et al., 2009) |
HOSemiCRF |
Higher order formulation of semi-CRFs | (Sarawagi et al., 2005, Cuong et al., 2014) |
HOCRFAD |
Higher order formulation of CRFs with a better/optimized forward-backward algorithm | (Ye, et al., 2009; Vieira et al., 2016) |
HOSemiCRFAD |
Higher order formulation of semi-CRFs with a better/optimized forward-backward algorithm | (Sarawagi et al., 2005, Cuong et al., 2014; Vieira et al., 2016) |
Question: So, how do we choose what models to use?
Answer: It depends on the training data we have and the order of patterns we want to model.
Examples:
FirstOrderCRF
could be used if our training data consists of sequences and we aim to model features that include at most one or two state/label transitions. Moreover, the FirstOrderCRF
naturally supports the
inclusion of __START__ state for building models supporting initial labels and transition label features at the starting position of the sequences. For more info about extracting features that include __START__ state, please see the templates_and_feature_extraction tutorial.HOCRF
and HOCRFAD
could be used if our training data consists of sequences and we aim to model features with first order and/or higher-order (i.e. that includes label pattern transitions with more than two states). There are no restrictions on the label patterns when using those models. Additionally, both models are equivalent to FirstOrderCRF
when we model features with first order label patterns without including the __START__ state. However, there is a subtle difference between HOCRF
and HOCRFAD
models, where HOCRFAD
provides a more efficient implementation for the forward-backward algorithm. In addition, the HOCRFAD
supports gradient-based training methods while HOCRF
supports perceptron/search-based training methods (see the training section for more info). Hence, we suggest using HOCRFAD
as a primary choice in similar contexts/scenario.HOSemiCRF
and HOSemiCRFAD
could be used if our training data consists of segments or sequences and we aim to model features with first order and/or higher-order (i.e. that includes label pattern transitions with more than two states). There are no restrictions on the label patterns when using those models. However, there is a subtle difference between HOSemiCRF
and HOSemiCRFAD
models, where HOSemiCRFAD
provides a more efficient implementation for the forward-backward algorithm. Both models support gradient-based and perceptron/search-based training methods. It is also worth to mention that these models are of particular use when we are dealing with segments even though they can incorporate sequences without any problems. Hence, we suggest using HOSemiCRFAD
as a primary choice when we have segments and/or have no restriction on the order of label pattern transitions.Now that we know the different implementaion of CRFs models, it is time to talk about another set of very related classes that are suffixed by ModelRepresentation (such as FirstOrderCRFModelRepresentation
class). These classes primarily hold relevant data structures that are used by the CRFs models. Therefore, for every class representing a CRF model, we have a corresponding model representation class (i.e. FirstOrderCRF
--- FirstOrderCRFModelRepresentation
). Below is a table mapping between both set of classes:
CRFs model | CRFs model representation | Module name |
---|---|---|
FirstOrderCRF |
FirstOrderCRFModelRepresentation |
pyseqlab.fo_crf |
HOCRF |
HOCRFModelRepresentation |
pyseqlab.ho_crf |
HOSemiCRF |
HOSemiCRFModelRepresentation |
pyseqlab.hosemi_crf |
HOCRFAD |
HOCRFADModelRepresentation |
pyseqlab.ho_crf_ad |
HOSemiCRFAD |
HOSemiCRFADModelRepresentation |
pyseqlab.hosemi_crf_ad |
These two sets of classes are tied together. Whenever we want to use a CRF model class, another corresponding model representation class should be used. Both sets of classes are very important and will be tested when we examine the model building workflow in the following section.
A typical process for building CRF models starting from the file comprising the training data (i.e. sequences or segments) is described as follows:
The listed points represent a common and recurrent workflow in sequence labeling and segmentation problems. For this reason, PySeqLab provides a GenericTrainingWorkflow
class that takes care of the tasks 1-8, which we aim to further explore in this section.
But first, we describe the arguments passed to GenericTrainingWorkflow
constructor:
aextractor_obj
: initialized instance of GenericAttributeExtractor
class/subclass (see attributes extraction in templates_and_feature_extraction tutorial) fextractor_obj
: initialized instance of FeatureExtractor
class/subclass (see FOFeatureExtractor
or HOFeatureExtractor
classes in the templates_and_feature_extraction tutorial)feature_filter_obj
: None
or an initialized instance of FeatureFilter
class (see feature filter section in the templates_and_feature_extraction tutorial)model_repr_class
: the CRFs model representation class (see 2nd column in this table for model representation classes)model_class
: the CRFs model class (see 1st column in this table for model classes)root_dir
: string representing the directory/path where working directory will be createdAfter initializing a GenericTrainingWorkflow
instance, we will use seq_parsing_workflow(args)
method that handles points (1 to 4) in the above process. This instance method has one argument and multiple keyword arguments.
Args:
split_options
: dictionary representing options for performing data split strategy. The main key is 'method'
which specifies the method for data splitting. It can assume one of the following values {'none', 'random', 'cross_validation', 'wsample'}
. Depending on the chosen method, additional keys and values are specified. For example:
To perform cross validation, we need to specify - cross_validation for the `method` - the number of folds for the `k_fold` options = {'method':'cross_validation', 'k_fold':number } To perform random splitting, we need to specify - random for the `method` - number of splits for the `num_splits` - size of the training set in percentage for the `trainset_size` options = {'method':'random', 'num_splits':number, 'trainset_size':percentage } To perform nothing (i.e. use all data as training), we need to specify - none for the `method` options = {'method':'none' } To perform weighted sampling (i.e. based on sequences' length), we need to specify - wsample for the `method` options = {'method':'wsample' }
For the keyword arguments, we have two main ones depending on the input (file or sequences) we want to pass.
Keyword args:
seq_file
: string representing the path to the file comprising the sequences/segments. For this option, we have to specify additionally the following keyword arguments: data_parser_options
: a dictionary defining the arguments of the read_file(args)
method in the DataFileParser
class (see sequence_and_input_structure tutorial for more info about this class).num_seqs
: the maximum number of sequences to read/parse from a file. By default it is equal to -1
, which will read the whole fileparser_options = {'header': header argument inread_file(args)
'y_ref': y_ref argument inread_file(args)
'column_sep: column_sep argument inread_file(args)
'seg_other_symbol': the seg_other_symbol argument inread_file(args)
} workflow.seq_parsing_workflow(split_options, seq_file=path, data_parser_options=parser_options, num_seqs=-1)
seqs
: list of sequences that are instances of SequenceStruct
classWe start by initializing an instance of the GenericTrainingWorkflow
class that uses the following setup:
GenericAttributeExtractor
with attribute description dictionary equal to
{'w':{'description': 'word observation track', 'encoding':'categorical'}, 'pos':{'description':'part of speech track', 'encoding':'categorical'} }
w
and pos
we generate the same feature template (template_XY
) that guides/instructs feature extraction by:
template_Y
), we request to use the current label, previous and current label at each position in the sequenceFOFeatureExtractor
)FirstOrderCRFModelRepresentation
)FirstOrderCRF
)Then, we show the two approaches for using the sequence_parsing_workflow(args)
method. The first approach uses the seqs
keyword argument while the second approach uses the seq_file
keyword argument.
An example of the return value data_split
for the different options we specified:
'method':'none' -- all training data is used
data_split: {0: {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]}}
'method':'cross_validation' -- 5 folds
data_split: {0: {'train': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 'test': [1, 2, 3, 4, 5]}, 1: {'train': [1, 2, 3, 4, 5, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 'test': [6, 7, 8, 9, 10]}, 2: {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], 'test': [11, 12, 13, 14, 15]}, 3: {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 21, 22, 23, 24, 25], 'test': [16, 17, 18, 19, 20]}, 4: {'train': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'test': [21, 22, 23, 24, 25]} }'method':'random' -- 3 splits with train set size 80%
data_split: 'random' {0: {'train': [22, 3, 6, 24, 9, 13, 1, 8, 25, 16, 15, 14, 20, 2, 4, 10, 7, 19, 17, 11], 'test': [18, 21, 12, 5, 23]}, 1: {'train': [23, 12, 11, 17, 19, 9, 8, 24, 20, 10, 22, 18, 21, 25, 1, 4, 3, 7, 14, 6], 'test': [16, 2, 13, 5, 15]}, 2: {'train': [16, 9, 21, 17, 5, 18, 25, 20, 23, 13, 12, 24, 3, 14, 8, 19, 10, 22, 11, 7], 'test': [1, 2, 4, 6, 15]} }'method':'wsample' -- with train set size 80%
data_split: 'wsample' {0: {'train': [7, 11, 17, 16, 14, 5, 2, 3, 4, 22, 1, 15, 12, 13, 20, 25, 24], 'test': [19, 10, 18, 6, 23, 8, 21, 9]} }
At this point, the sequences are processed, dumped on disk and were assigned a unique id. The parsed sequences will be organized in a folder/directory structure where a reference_corpus
folder suffixed with date/time string is created under the root
directory that we already specified in the GenericTrainingWorkflow
constructor. Under this folder, the process will generate a folder for every sequence under global_features
folder. Additionally, a log file named log.txt
will be added. It will contain the logs generated during the parsing process and info about the total number of features created once we create a CRFs model.
working_dir │ ├── reference_corpus_2017_5_17-8_56_49_631884 │ │ ├── data_split │ │ ├── global_features │ │ │ ├── log.txt │ │ │ ├── seq_1 │ │ │ ├── seq_10 │ │ │ ├── seq_11 │ │ │ ├── seq_12 │ │ │ ├── seq_13 │ │ │ ├── seq_14 │ │ │ ├── seq_15 │ │ │ ├── seq_16 │ │ │ ├── seq_17 │ │ │ ├── seq_18 │ │ │ ├── seq_19 │ │ │ ├── seq_2 │ │ │ ├── seq_20 │ │ │ ├── seq_21 │ │ │ ├── seq_22 │ │ │ ├── seq_23 │ │ │ ├── seq_24 │ │ │ ├── seq_25 │ │ │ ├── seq_3 │ │ │ ├── seq_4 │ │ │ ├── seq_5 │ │ │ ├── seq_6 │ │ │ ├── seq_7 │ │ │ ├── seq_8 │ │ │ ├── seq_9
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6)
from pyseqlab.workflow import GenericTrainingWorkflow
from pyseqlab.features_extraction import FOFeatureExtractor
from pyseqlab.attributes_extraction import GenericAttributeExtractor
from pyseqlab.utilities import TemplateGenerator
from pyseqlab.fo_crf import FirstOrderCRF, FirstOrderCRFModelRepresentation
from pyseqlab.utilities import ReaderWriter
# attribute description for both w and pos attribute tracks
attr_desc = {'w':{'description': 'word observation track',
'encoding':'categorical'
},
'pos':{'description':'part of speech track',
'encoding':'categorical'}
}
# initialize the attribute extractor instance
generic_attr_extractor = GenericAttributeExtractor(attr_desc)
print("attr_desc {}".format(generic_attr_extractor.attr_desc))
print()
# use same template for both tracks w and pos
template_XY = {}
template_gen = TemplateGenerator()
for track_name in ('w', 'pos'):
template_gen.generate_template_XY(track_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
template_Y = template_gen.generate_template_Y('1-state:2-states')
print("template_XY: ")
print(template_XY)
print()
print("template_Y:")
print(template_Y)
print()
# initialize first order feature extractor instance
fo_fe = FOFeatureExtractor(template_XY, template_Y, attr_desc, start_state=False)
# no feature filter
fe_filter = None
working_dir = tutorials_dir
workflow = GenericTrainingWorkflow(generic_attr_extractor, fo_fe, fe_filter,
FirstOrderCRFModelRepresentation, FirstOrderCRF,
working_dir)
print("Using seqs keyword argument option")
print()
data_split_options = [{'method':'none'}, # no splitting -- use all data
{'method':'cross_validation', 'k_fold':5}, # cross_validation 5-fold,
{'method':'random', 'num_splits':3, 'trainset_size':80}, # 3 random splits with train set 80%
{'method':'wsample', 'trainset_size':80}# weighted sample by sequence length
]
for split_option in data_split_options:
data_split = workflow.seq_parsing_workflow(split_option,
seqs=seqs[:25],
full_parsing = True)
print("data_split: ", split_option['method'])
print(data_split)
print()
print("-"*50)
print("Using seq_file keyword argument option")
print()
# get the path to CoNLL 2000 training file
train_file = os.path.join(dataset_dir, 'train.txt')
# parser options to read train_file
parser_options = {'header': 'main', # main means the header is found in the first line of the file
'y_ref':True, # y_ref is a boolean indicating if the label to predict is already found in the file
'column_sep': " ",
'seg_other_symbol':None # spearator between the words/observations
}
# parse only 25 sequences for comparison to our previous approach
num_seqs = 25
# use all passed data as training data
data_split_options = [{'method':'none'}, # no splitting -- use all data
{'method':'cross_validation', 'k_fold':5}, # cross_validation 5-fold,
{'method':'random', 'num_splits':3, 'trainset_size':80}, # 3 random splits with train set 80%
{'method':'wsample', 'trainset_size':80}# weighted sample by sequence length
]
for split_option in data_split_options:
data_split = workflow.seq_parsing_workflow(split_option,
seq_file=train_file,
data_parser_options=parser_options,
num_seqs=num_seqs,
full_parsing = True)
print("data_split: ", split_option['method'])
print(data_split)
print()
print("-"*50)
print()
After we parsed and generated a data_split
(i.e. decided the sequences to be used for training/building a CRFs model), we can proceed to build our first CRFs model. Using build_crf_model(args)
method, we can generate our CRFs model that will be an instance of FirstOrderCRF
because it was the class we passed to the GenericTrainingWorkflow
constructor. The build_crf_model(args)
takes the following main arguments:
seqs_id
: list of sequence ids referring to the sequences to be used for building the model. These ids are the generated ones after we parsed and prepared the sequences. They are found in the data_split
variable.folder_name
: string representing the folder name which will be created under the reference_corpus_date/time
directory. It will hold additional processed sequence info generated by the model building process.The method will return a CRFs model (see code) that we will later train its parameters (i.e. the weights of the features in the model). Moreover, if we inspect the reference_corpus_date/time
directory, we will notice a new folder having a suffix equal to the folder_name
we specified. More importantly, if we check log.txt
in global_features
folder, we will see the generated log for the number of features created.
Tree path/directory of reference corpus:
working_dir │ ├── reference_corpus_2017_5_17-8_56_49_631884 │ │ ├── data_split │ │ ├── global_features │ │ │ ├── log.txt │ │ │ ├── seq_1 │ │ │ ├── seq_10 │ │ │ ├── seq_11 │ │ │ ├── seq_12 │ │ │ ├── seq_13 │ │ │ ├── seq_14 │ │ │ ├── seq_15 │ │ │ ├── seq_16 │ │ │ ├── seq_17 │ │ │ ├── seq_18 │ │ │ ├── seq_19 │ │ │ ├── seq_2 │ │ │ ├── seq_20 │ │ │ ├── seq_21 │ │ │ ├── seq_22 │ │ │ ├── seq_23 │ │ │ ├── seq_24 │ │ │ ├── seq_25 │ │ │ ├── seq_3 │ │ │ ├── seq_4 │ │ │ ├── seq_5 │ │ │ ├── seq_6 │ │ │ ├── seq_7 │ │ │ ├── seq_8 │ │ │ ├── seq_9 │ │ ├── model_activefeatures_f_0
Inspecting log.txt
under global_features
folder we get the following excerpt:
---Preparing/parsing sequences--- starting time: 2017-05-17 08:56:49.635522 Number of sequences prepared/parsed: 25 ---Preparing/parsing sequences--- end time: 2017-05-17 08:56:49.691309 ---Generating Global Features F_j(X,Y)--- starting time: 2017-05-17 08:56:49.703424 Number of instances/training data processed: 25 ---Generating Global Features F_j(X,Y)--- end time: 2017-05-17 08:56:49.901177 ---Constructing model--- starting time: 2017-05-17 08:56:49.912083 Number of instances/training data processed: 25 Number of features: 3318 Number of labels: 13 ---Constructing model--- end time: 2017-05-17 08:57:05.827288
To verify and see these features, we can inspect them by checking the output of this code snippet below.
# use all passed data as training data -- no splitting
data_split_options = {'method':'none'}
data_split = workflow.seq_parsing_workflow(data_split_options,
seqs=seqs[:25],
full_parsing=True)
print()
print("data_split: 'none' option ")
print(data_split)
print()
# build and return a CRFs model
# folder name will be f_0 as fold 0
crf_m = workflow.build_crf_model(data_split[0]['train'], "f_0")
print()
print("type of built model:")
print(type(crf_m))
print()
print("number of generated features:")
print(len(crf_m.model.modelfeatures_codebook))
print("features:")
print(crf_m.model.modelfeatures)
After we build our CRFs model, we move to the parameters training/estimating phase. In this stage, we basically learn the parameters of a CRFs model where by the end of the training, our CRFs model will be able to decode sequences (i.e. assign labels/tags by using the observations/attributes tracks) with a good performance.
To do so, we use the train_model(args)
method in the GenericTrainingWorkflow
class. The method takes the following arguments:
seqs_id
: list of sequence ids referring to the sequences to be used for training the model. These ids refer to the same sequences we used for building the model. They are found in the data_split
variable.crf_model
: instance of a CRFs model such as the one returned by build_crf_model(args)
method we used earlier (see this section).
optimization_options
: dictionary specifying the training method and its corresponding options. Below, we will discuss this in detail.PySeqLab supports multiple optimization options that are grouped into two categories:
Generally, perceptron training methods are faster to use compared to gradient-based ones. However, gradient-based methods yield solutions (parameter estimates) that are closer to the optimal solution. This does not necessarily mean that such solutions would result in better performing models. But on average, the gradient-based methods generate better estimates resulting in better performing models. NB: exceptions do arise and therefore deciding on which category to choose is problem specific/dependent.
The available training options are described in the following table:
Training option | Perceptron/search based | Gradient based | Reference |
---|---|---|---|
COLLINS-PERCEPTRON |
✓ | (Collins, 2002) | |
SAPO |
✓ | (Sun, 2015) | |
SGA |
✓ | (Bottou et al., 2004) | |
SGA-ADADELTA |
✓ | (Zeiler, 2012) | |
SVRG |
✓ | (Johnson et al., 2013) | |
L-BFGS-B |
✓ | (Byrd et al., 1995) |
For each method we can further specify corresponding options to be used while training.
Training method | Options |
---|---|
COLLINS-PERCEPTRON |
'method': 'COLLINS-PERCEPTRON' 'num_epochs': integer 'update_type':{'early', 'max-fast', 'max-exhaustive'} 'shuffle_seq': boolean 'beam_size': integer 'avg_scheme': {'avg_error', 'avg_uniform'} 'tolerance': float |
SAPO |
'method': 'SAPO' 'num_epochs': integer 'update_type':'early' 'shuffle_seq': boolean 'beam_size': integer 'topK': integer 'tolerance': float |
SGA |
'method': 'SGA' 'regularization_type': {'l1', 'l2'} 'regularization_value': float 'num_epochs': integer 'tolerance': float 'learning_rate_schedule': {"bottu", "exponential_decay", "t_inverse", "constant"} 't0': float 'alpha': float 'eta0': float |
SGA-ADADELTA |
'method': 'SGA-ADADELTA' 'regularization_type': {'l1', 'l2'} 'regularization_value': float 'num_epochs': integer 'tolerance': float 'rho': float 'epsilon': float |
SVRG |
'method': 'SGA' 'regularization_type': {'l1', 'l2'} 'regularization_value': float 'num_epochs': integer 'tolerance': float 'learning_rate_schedule': {"bottu", "exponential_decay", "t_inverse", "constant"} 't0': float 'alpha': float 'eta0': float |
L-BFGS-B |
'method': {'L-BFGS-B', 'BFGS'} 'regularization_type': 'l2' 'regularization_value': float (options below are eqivalent to one provided by scipy.optimize package) 'disp': boolean 'maxcor': integer, 'ftol': float, 'gtol': float, 'eps': float, 'maxls': integer, 'maxiter': integer, 'maxfun': integer |
General comments on the options provided to perceptron/search-based methods:
num_epochs
refers to the number of epochs/passes to go through the sequences/segmentsshuffle_seq
decides if to shuffle the sequences in every epoch tolerance
represents the threshold to check against as a stopping criterion. If the relative difference between the average decoding error across all sequences between the current and previous epoch is below the threshold (tolerance), then we stop the training.update_type
and beam_size
are related options. PySeqLab supports inexact search by using beam search (i.e. pruning states falling off a specified beam size) allowing for faster inference and training. This is implemented within the violation-fixing framework (see (Huang et al., 2012) for more details). Hence, if we want to prune states (i.e. use beam search) while learning the parameters, we can specify the update_type
we want based on the violation-fixing framework. If we are to use a full beam (i.e. no pruning), then the update_type
is irrelevant as we will be doing full update.topK
is an available option in SAPO
method. It specifies how many sequences to decode/use while learning the parameters of the model. For more info on the procedure, see (Sun, 2015).Lastly, the learned parameters are based on the average values across the training updates in all the epochs. This provides a sort of regularization and as a result reduces overfitting. As a reminder, many of these options are set by default. What only matters is to specify the method
and the rest is figured out automatically.
General comments on the options provided to gradient-based methods:
num_epochs
refers to the number of epochs/passes to go through the sequences/segmentsshuffle_seq
decides if to shuffle the sequences in every epoch tolerance
represents the threshold to check against as a stopping criterion. If the relative difference between the estimated average log-likelihood across all sequences between the current and previous epoch is below the threshold (tolerance), then we stop the training.regularization_type
and regularization_value
are related options. PySeqLab supports maximum likelihood (MLE) and maximum a posteriori (MAP) estimation. This is done by offering two regularization schemes: (1) L2 regularization (i.e. assuming prior Gaussian distribution on the model weights) and (2) L1 regularization using the approach in (Tsuruoka et al., 2009). Hence, once we specify the type of regularization (i.e. regularization_type
), we specify the regularization value (i.e. regularization_value
)
learning_rate_schedule
option for methods SGA
and SVRG
specifies the computation of the step size parameter multiplying the gradient. constant
supports fixed step size while bottu
, exponential_decay
and t_inverse
updates the step size iteratively each using different methods/formulas. All approaches require to specify the initial step size t0
. Additionally, exponential_decay
option requires the specification of alpha
and eta0
(see Tsuruoka et al., 2009 for more info about exponential_decay
approach).rho
and epsilon
options are related to SGA-ADADELTA
method. In this method, the step size is an adaptive vector where each component represent a step size for a corresponding parameter in the CRFs model. For more info regarding both options (rho
and epsilon
) and their role in the computation of the adaptive step size, consult to ADADELTA (Zeiler, 2012).Again as a reminder, many of these options are set by default. What only matters is to specify the method
and the rest is figured out automatically.
Now that we learned about the available training methods, we proceed to train our constructed CRFs model. We will be using first the gradient-based training methods such as {SGA
, SGA-ADADELTA
, SVRG
}. We start by having the parameter weights initialized to 0
and once we run train_model(args)
method in the GenericTrainingWorkflow
class, the process of optimization and parameter estimation starts. At the end of the process, we will get a new estimates/values for the weights and the trained model will be dumped on disk. Traversing our working directory (i.e. working_dir
) we specified earlier, we will find a new created folder called models
. In this folder, each time we train a model, we will get a separate folder named based on a generated date/time string. This folder will comprise model_parts
folder and another log file named crf_training_log.txt
. The model_parts
folder holds the saved parts of the trained model. Whereas the log file contains log of the training process (such as the time of finishing an epoch, the estimated average log-likelihood, etc.). Additionally, if the training method is one of {SGA
, SGA-ADADELTA
, SVRG
}, we will get additional file named avg_loglikelihood_training
. This file is a numpy
array containing estimated average log-likelihood for each epoch. We can plot this array for diagnostics (generally, it should be a growing curve as we are performing gradient ascent). In case we are using L-BFGS-B
method, we will have gradient descent as we are optimizing the negative log-likelihood (see scipy.optimize package for more info). The tree path visualization for the working directory is provide below:
working_dir │ ├── reference_corpus_2017_5_17-8_56_49_631884 │ │ ├── data_split │ │ ├── global_features │ │ │ ├── log.txt │ │ │ ├── seq_1 │ │ │ ├── seq_10 │ │ │ ├── seq_11 │ │ │ ├── seq_12 │ │ │ ├── seq_13 │ │ │ ├── seq_14 │ │ │ ├── seq_15 │ │ │ ├── seq_16 │ │ │ ├── seq_17 │ │ │ ├── seq_18 │ │ │ ├── seq_19 │ │ │ ├── seq_2 │ │ │ ├── seq_20 │ │ │ ├── seq_21 │ │ │ ├── seq_22 │ │ │ ├── seq_23 │ │ │ ├── seq_24 │ │ │ ├── seq_25 │ │ │ ├── seq_3 │ │ │ ├── seq_4 │ │ │ ├── seq_5 │ │ │ ├── seq_6 │ │ │ ├── seq_7 │ │ │ ├── seq_8 │ │ │ ├── seq_9 │ │ ├── model_activefeatures_f_0 │ ├── models │ │ ├── 2017_5_17-14_11_15_259071 │ │ │ ├── model_parts │ │ │ ├── avg_loglikelihood_training │ │ │ ├── crf_training_log.txt
Below is an example for training a CRFs model using {SGA
, SGA-ADADELTA
, SVRG
} methods. The training will yield three models, each dumped on disk. For every model, we will get a pointer to its model_part
folder that could be utilized for reviving and using the model later. We also plot the estimated average log-likelihood of each training method. Another training example, which involves using L-BGFS-B
method is provided below. The training method displayed a success message (i.e. optimal solution is found). As a result, we have trained four models in total that use gradient-based training methods.
optimization_options = {"method" : "",
"regularization_type": "l2",
"regularization_value":0,
}
# train models using SGA, SGA-ADADELTA, and SVRG methods
trained_models_dir = {}
for fold in data_split:
train_seqs_id = data_split[fold]['train']
for method in ("SGA-ADADELTA", "SGA", "SVRG"):
optimization_options['method'] = method
if(method in {'SGA', 'SGA-ADADELTA'}):
num_epochs = 10
else:
num_epochs = 2
optimization_options['num_epochs'] = num_epochs
print("trianing using optimization options:")
print(optimization_options)
# make sure we are initializing the weights to be 0
crf_m.weights.fill(0)
model_dir = workflow.train_model(train_seqs_id, crf_m, optimization_options)
print("*"*50)
avg_ll = ReaderWriter.read_data(os.path.join(model_dir, 'avg_loglikelihood_training'))
plt.plot(avg_ll[1:], label="method:{}, {}:{}".format(optimization_options['method'],
optimization_options['regularization_type'],
optimization_options['regularization_value']))
trained_models_dir[method] = model_dir
plt.legend(loc='lower right')
plt.xlabel('number of epochs')
plt.ylabel('estimated average loglikelihood')
eval_options = {'model_eval':True,
'metric':'f1',
'seqs_info':workflow.seqs_info}
# use L-BFGS-B method for training
optimization_options = {"method" : "L-BFGS-B",
"regularization_type": "l2",
"regularization_value":0,
}
# start with 0 weights
crf_m.weights.fill(0)
model_dir = workflow.train_model(train_seqs_id, crf_m, optimization_options)
trained_models_dir['L-BFGS-B'] = model_dir
print("*"*50)
For the perceptron-based training methods, we use (COLLINS-PERCEPTRON
, SAPO
) for training CRFs model (see code). We specify the training to run for 10 epochs with full beam size (i.e. no pruning), and topK=5
(number of decoded sequences to use) for the SAPO
method. We also plot the estimated average decoding error for every epoch. We see that the decoding error is decreasing indicating the learning is going well. The estimated average decoding error is dumped on disk in a file named avg_decodingerror_training
which is the equivalent to the avg_loglikelihood_training
file for the gradient-based methods. Another example using perceptron-based training is provided in this section. It uses a beam search (i.e pruning) with beam_size=5
(i.e. keeping at most 5 labels at every position) and update_type=early
while decoding the given sequences. Again, consult to the training table options for exploring further training options.
# full beam -- using all states/labels detected in the training data
optimization_options = {"method" : "",
"num_epochs":10,
}
topK = 1
# train models using COLLINS-PERCEPTRON, SAPO
for fold in data_split:
train_seqs_id = data_split[fold]['train']
for method in ("COLLINS-PERCEPTRON", "SAPO"):
optimization_options['method'] = method
if(method == 'SAPO'):
topK = 5
optimization_options['topK'] = topK
print("trianing using optimization options:")
print(optimization_options)
# make sure we are initializing the weights to be 0
crf_m.weights.fill(0)
model_dir = workflow.train_model(train_seqs_id, crf_m, optimization_options)
print("*"*50)
avg_ll = ReaderWriter.read_data(os.path.join(model_dir, 'avg_decodingerror_training'))
plt.plot(avg_ll[1:], label="method:{}, topK:{}".format(optimization_options['method'], topK))
trained_models_dir[method] = model_dir
plt.legend(loc='upper right')
plt.xlabel('number of epochs')
plt.ylabel('estimated decoding error')
# using beam search with 5 states and early update
optimization_options = {"method" : "",
"num_epochs":10,
"beam_size":5,
"update_type":'early'
}
topK = 1
# train models using COLLINS-PERCEPTRON, SAPO
for fold in data_split:
train_seqs_id = data_split[fold]['train']
for method in ("COLLINS-PERCEPTRON", "SAPO"):
optimization_options['method'] = method
if(method == 'SAPO'):
topK = 5
optimization_options['topK'] = topK
print("trianing using optimization options:")
print(optimization_options)
# make sure we are initializing the weights to be 0
crf_m.weights.fill(0)
model_dir = workflow.train_model(train_seqs_id, crf_m, optimization_options)
print("*"*50)
avg_ll = ReaderWriter.read_data(os.path.join(model_dir, 'avg_decodingerror_training'))
plt.plot(avg_ll[1:], label="method:{}, topK:{}, beam_size:5".format(optimization_options['method'], topK))
trained_models_dir[method] = model_dir
plt.legend(loc='upper right')
plt.xlabel('number of epochs')
plt.ylabel('estimated decoding error')
The final part of the workflow is to evaluate the trained models on test/validation sequences (i.e. sequences that we did not use for building and training the CRFs model). In our previous setup, we used all the sequences for training. We could still test the performance of the trained models on those sequences. Although, we are overfitting here, but it is worth to compare the different training methods using a performance measure.
The method use_model(args)
in GenericWorkflowTrainer
class is used for:
A) reviving a trained model to decode sequences
B) writing the decoded sequences on a file and/or
C) evaluating the decoding performance based on a specified metric
The arguments for the use_model(args)
method are:
Args:
savedmodel_dir
: the path to the trained model dumped on disk. In our setting/example, the trained_models_dir
contains the directory of the models we trained.options
: dictionary that specifies options needed to performs tasks (A, B and C). Generally, we have two scenarios:
seqs_info
key, passing the seqs_info
instance attribute in our workflow trainer (see code example).options = {'seqs_info': dictionary comprising the processed info of the sequences saved on disk 'seqbatch_size':integer, number of sequences in a batch to process 'model_eval': boolean, deciding if to evaluate performance 'metric': {'f1', 'precision', 'recall', 'accuracy'}, 'exclude_states': list of labels to exclude form the decoding evaluation, 'file_name': string, name of the file where decoded sequences will written on, 'sep': string, separator between the columns (i.e. the tracks, reference label and predicted label when written on file) 'beam_size': integer, the number of states to keep while decoding with beam search }
seq_file
key, passing the path to the file containing the sequences to decode (see code example).seq_file
)
options = {'seqs_file': path to the sequences file 'data_parser_options': dictionary stating the options to be used by the parser for reading the seqs_file (i.e. see here for more info) 'num_seqs': integer, specifying the maximum number of sequences to read from the given file 'seqbatch_size':integer, number of sequences in a batch to process 'model_eval': boolean, deciding if to evaluate performance 'metric': {'f1', 'precision', 'recall', 'accuracy'}, 'exclude_states': list of labels to exclude form the decoding evaluation, 'file_name': string, name of the file where decoded sequences will written on, 'sep': string, separator between the columns (i.e. the tracks, reference label and predicted label when written on file) 'beam_size': integer, the number of states to keep while decoding with beam search }
# case of using seqs_info
# eval_options used:
# seqs_info since we already parsed and processed the sequences
# model_eval = True -- we need to evaluate model performance
# metric = f1 -- F1 score
# the other keys are not specified and hence by default, we will not write the decoded sequences to file
# using full beam
use_options = {'seqs_info':workflow.seqs_info,
'model_eval':True,
'metric':'f1'
}
print("Using seqs_info option")
for method, model_dir in trained_models_dir.items():
perf = workflow.use_model(model_dir, use_options)
print("model trained using {} method".format(method))
metric_chosen = use_options['metric']
print("metric:{}, value:{}".format(metric_chosen, perf[metric_chosen]))
print()
train_file = os.path.join(dataset_dir, 'train.txt')
#case of using seq_file
# eval_options used:
# seq_file path to the CoNLL train.txt file
# data_parser_optoins dictionary to parse the train.txt file
# model_eval = True -- we need to evaluate model performance
# metric = f1 -- F1 score
# the other keys are not specified and hence by default, we will not write the decoded sequences to file
# using full beam
parser_options = {'header': 'main', # main means the header is found in the first line of the file
'y_ref':True, # y_ref is a boolean indicating if the label to predict is already found in the file
'column_sep': " ",
'seg_other_symbol':None # spearator between the words/observations
}
use_options = {'seq_file':train_file,
'data_parser_options':parser_options,
'num_seqs':25,
'model_eval':True,
'metric':'f1'
}
print("Using seq_file option")
for method, model_dir in trained_models_dir.items():
perf = workflow.use_model(model_dir, use_options)
print("model trained using {} method".format(method))
metric_chosen = use_options['metric']
print("metric:{}, value:{}".format(metric_chosen, perf[metric_chosen]))
print()
A direct approach for using a trained CRFs model without the use of GenericWorkflowTrainer
class/subclass is through generate_trained_model(args)
function found in the utilities
module. The function takes the following arguments:
model_parts_dir
: directory to the model_parts
folder of a trained CRFs model (see here for a refresher)aextractor_obj
: initialized instance of class/subclass of GenericAttributeExtractor
Below is an example showing how to revive a trained model (the trained model used COLLINS-PERCEPTRON
method). Once, we have the revived model, it is ready for decoding sequences. For that purpose, we use the main decoding method
decode_seqs(decoding_method, out_dir, **kwargs)
(see code for a demo)
Args:
decoding_method
: string defining the decoding method. It is one of {'viterbi', 'per_state_decoding'}
. As not all decoders support 'per_state_decoding'
, we primarily use 'viterbi'
for now.
out_dir
: string representing the path to the output directory where sequence parsing will take place
Keyword arguments:
The main ones to specify are:
{'seqs', 'seqs_dict', 'seqs_info'}
seqs
: list of sequences that are instances of SequenceStruct
class
seqs_dict
: a dictionary comprising of ids as keys (i.e. representing sequence ids) and corresponding sequences that are instances of SequenceStruct
class as values
seqs_info
: dictionary containing the info about already parsed and processed sequences
file_name
: string representing the name of the file where the CRFs model writes the decoded sequences to (it is optional)
sep
: string representing separator between the columns/observations when writing decoded sequences on the specified file using file_name
keyword argument
Now that we know how to revive a CRFs model and use it for decoding sequences, we need to evaluate how good is our decoding. This will be tackled using SeqDecodingEvaluator
class. Its constructor takes an instance of a CRFs model representation class (see this table for a refresher). This is an example demonstrating the use of SeqDecodingEvaluator
class for evaluating the decoded sequences. In addition, it shows how to obtain/show the confusion matrix for every decoded label/state.
#print(trained_models_dir)
from pyseqlab.utilities import generate_trained_model
# revive a model that was trained using COLLINS-PERCEPTRON method
model_parts_dir = os.path.join(trained_models_dir['COLLINS-PERCEPTRON'], 'model_parts')
crf_percep = generate_trained_model(model_parts_dir, generic_attr_extractor)
print(crf_percep)
print()
# use viterbi for decoding
decoding_method = 'viterbi'
# use the same directory of the model
output_dir = trained_models_dir['COLLINS-PERCEPTRON']
# use tab as separator
sep = "\t"
decoded_sequences = crf_percep.decode_seqs(decoding_method, output_dir, seqs= seqs[:25], file_name = 'tutorial_seqs_decoding.txt', sep=sep)
print()
# display the decoded sequnces
for seq_id in decoded_sequences:
print("seq_id ", seq_id)
print("predicted labels:")
print(decoded_sequences[seq_id]['Y_pred'])
print("refernce labels:")
print(decoded_sequences[seq_id]['seq'].flat_y)
print("-"*50)
from pyseqlab.crf_learning import SeqDecodingEvaluator
# initialize an evaluator
evaluator = SeqDecodingEvaluator(crf_percep.model)
# evaluate performance
Y_seqs_dict = GenericTrainingWorkflow.map_pred_to_ref_seqs(decoded_sequences)
taglevel_perf = evaluator.compute_states_confmatrix(Y_seqs_dict)
perf = evaluator.get_performance_metric(taglevel_perf, "f1", exclude_states=[])
perf = evaluator.get_performance_metric(taglevel_perf, "precision", exclude_states=[])
perf = evaluator.get_performance_metric(taglevel_perf, "recall", exclude_states=[])
perf = evaluator.get_performance_metric(taglevel_perf, "accuracy", exclude_states=[])
print()
# demonstrate confusion matrix per label/state
for state, code in crf_percep.model.Y_codebook.items():
print("confusion_matrix for state={}: ".format(state))
print(taglevel_perf[code])
print("-"*40)
There are still many aspects to explore and experiment with. We show the potentials of the package by developing applications in three different areas:
Each application has its own repository including notebook tutorials showing the:
Bottou, L., & Le Cun, Y. (2004). Large Scale Online Learning. Advances in Neural Information Processing Systems, 16, 217–225.
Collins, M. (2002). Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - EMNLP ’02 (pp. 1–8). doi:10.3115/1118693.1118694
Cuong, V. N., Ye, N., Lee, W. S., & Chieu, H. L. (2014). Conditional Random Field with High-order Dependencies for Sequence Labeling and Segmentation. Journal of Machine Learning Research, 15, 981–1009. Huang, L., Fayong, S., & Guo, Y. (2012).
Structured perceptron with inexact search. 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 142–151. Retrieved from http://dl.acm.org/citation.cfm?id=2382029.2382049
Johnson, R., & Zhang, T. (2013). Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. Advances in Neural Information Processing Systems 26, 1(3), 315–323.
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., & Collier, N. (2004). Introduction to the Bio-entity Recognition Task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, 70–75. doi:10.3115/1567594.1567610
Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML ’01 Proceedings of the Eighteenth International Conference on Machine Learning, 8(June), 282–289. doi:10.1038/nprot.2006.61
Sarawagi, S., & Cohen, W. W. (2005). Semi-Markov Conditional Random Fields for Information Extraction. Advances in Neural Information Processing Systems, 1185–1192. doi:10.1.1.128.3524
Soong, F. K., & Huang, E.-F. (1990). A tree-trellis based fast search for finding the N Best sentence hypotheses in continuous speech recognition. Proceedings of the Workshop on Speech and Natural Language - HLT ’90, 12–19. doi:10.3115/116580.116591
Sun, X. (2015). Towards Shockingly Easy Structured Classification: A Search-based Probabilistic Online Learning Framework. Retrieved from http://arxiv.org/abs/1503.08381
Tsuruoka, Y., Tsujii, J., & Ananiadou, S. (2009). Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1, 477. doi:10.3115/1687878.1687946
Vieira, T., Cotterell, R., & Eisner, J. (2016). Speed-Accuracy Tradeoffs in Tagging with Variable-Order CRFs and Structured Sparsity. In EMNLP.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. doi:10.1109/TIT.1967.1054010
Ye, N., Lee, W. S., Chieu, H. L., & Wu, D. (2009). Conditional Random Fields with High-Order Features for Sequence Labeling. Neural Information Processing Systems, 2, 2. Retrieved from http://www.comp.nus.edu.sg/~leews/publications/nips09_paper.pdf
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. Retrieved from http://arxiv.org/abs/1212.5701