pyseqlab package¶

Submodules¶

pyseqlab.attributes_extraction module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.attributes_extraction.AttributeScaler(scaling_info, method)[source]¶

Bases: object

attribute scalar class to scale/standardize continuous attributes/features

Parameters:	scaling_info – dictionary comprising the relevant info for performing standardization method – string defining the method of scaling {rescaling, standardization}

scaling_info¶: dictionary comprising the relevant info for performing standardization

method¶: string defining the method of scaling {rescaling, standardization}

Example:

in case of *standardization*:
    - scaling_info has the form: scaling_info[attr_name] = {'mean':value,'sd':value}
in case of *rescaling*
    - scaling_info has the form: scaling_info[attr_name] = {'max':value, 'min':value}

save(folder_dir)[source]¶

save relevant info about the scaler on disk

Parameters:	folder_dir – string representing directory where files are pickled/dumped

scale_continuous_attributes(seq, boundaries)[source]¶

scale continuous attributes of a sequence for a list of boundaries

Parameters:	seq – a sequence instance of `SequenceStruct` boundaries – list of boundaries `[(1,1), (2,2),...,]`

transform_scale(x, xref_min, xref_max)[source]¶: transforms feature value to scale from [-1,1]

class pyseqlab.attributes_extraction.GenericAttributeExtractor(attr_desc)[source]¶

Bases: object

Generic attribute extractor class implementing observation functions that generates attributes from tokens/observations

Parameters:	attr_desc – dictionary defining the atomic observation/attribute names including the encoding of such attribute (i.e. {continuous, categorical}}

attr_desc¶: dictionary defining the atomic observation/attribute names including the encoding of such attribute (i.e. {continuous, categorical}}

seg_attr¶: dictionary comprising the extracted attributes per each boundary of a sequence

determine_attr_encoding(attr_desc)[source]¶

generate_attributes(seq, boundaries)[source]¶

group_attributes()[source]¶: function to group attributes based on the encoding type (i.e. continuous vs. categorical)

class pyseqlab.attributes_extraction.NERSegmentAttributeExtractor[source]¶

Bases: pyseqlab.attributes_extraction.GenericAttributeExtractor

class implementing observation functions that generates attributes from word tokens/observations

Parameters:	attr_desc – dictionary defining the atomic observation/attribute names including the encoding of such attribute (i.e. {continuous, categorical}}

attr_desc¶: dictionary defining the atomic observation/attribute names including the encoding of such attribute (i.e. {continuous, categorical}}

seg_attr¶: dictionary comprising the extracted attributes per each boundary of a sequence

generate_attributes(seq, boundaries)[source]¶

generate attributes of the sequence observations in a specified list of boundaries

Parameters:	seq – a sequence instance of `SequenceStruct` boundaries – list of boundaries [(1,1), (2,2),...,]

Note

the generated attributes are saved first in seg_attr and then passed to the `seq.seg_attr`. In other words, at the end seg_att is always cleared

generate_attributes_desc()[source]¶: define attributes by including description and encoding of each observation or observation feature

get_degenerateshape(boundary)[source]¶

get degenerate shape of segment

Parameters:	boundary – tuple (u,v) that marks beginning and end of a segment

get_num_chars(boundary, filter_out=' ')[source]¶

get the number of characters of a segment

Parameters:	boundary – tuple (u,v) that marks beginning and end of a segment filter_out – string the default separator between attributes

get_seg_bagofattributes(boundary, attr_names, sep=' ')[source]¶

implements the bag-of-attributes concept within a segment

Parameters:	boundary – tuple (u,v) representing current boundary attr_names – list of names of the atomic observations/attributes sep – separator (by default is the space)

Note

it can be used only with attributes that have binary_encoding type set equal True

get_seg_length(boundary)[source]¶

get the length of a segment

Parameters:	boundary – tuple (u,v) that marks beginning and end of a segment

get_shape(boundary)[source]¶

get shape of segment

Parameters:	boundary – tuple (u,v) that marks beginning and end of a segment

pyseqlab.crf_learning module¶

@author: Ahmed Allam <ahmed.allam@yale.edu>

class pyseqlab.crf_learning.Evaluator(model_repr)[source]¶

Bases: object

Evaluator class to evaluate performance of the models

Parameters:	model_repr – the CRF model representation that has a suffix of ModelRepresentation such as `HOCRFADModelRepresentation`

model_repr¶: the CRF model representation that has a suffix of ModelRepresentation such as HOCRFADModelRepresentation

Note

this class is EXPERIMENTAL/work in progress* and does not support evaluation of segment learning. Use instead SeqDecodingEvaluator for evaluating models learned using sequence learning.

compute_model_performance(Y_seqs_dict, metric, output_file, states_notation)[source]¶

compute the performance of the model

Parameters:

Y_seqs_dict – dictionary where each sequence has the reference label sequence and its corresponding predicted sequence. It has the following form {seq_id:{'Y_ref':[reference_ylabels], 'Y_pred':[predicted_ylabels]}}
metric – evaluation metric that could take one of {‘f1’, ‘precision’, ‘recall’, ‘accuracy’}
output_file – file where to output the evaluation result
states_notation – notation used to code the state (i.e. BIO)

compute_tags_confusionmatrix(Y_ref, Y_pred, transformed_codebook_rev, M)[source]¶

compute confusion matrix on the level of the tag/state

Parameters:	Y_ref – list of reference label sequence (represented by the states code) Y_pred – list of predicted label sequence (represented by the states code) transformed_codebook – the transformed codebook of the new identified states M – number of states

map_states_to_num(Y, state_mapper, transformed_codebook, M)[source]¶

map states to their code/number using the Y_codebook

Parameters:	Y – list representing label sequence state_mapper – mapper between the old and new generated states generated from `tranform_codebook()` method trasformed_codebook – the transformed codebook of the new identified states M – number of states

Note

we give one unique index for tags that did not occur in the training data such as len(Y_codebook)

transform_codebook(Y_codebook, prefixes)[source]¶

map states coded in BIO notation to their original states value

Parameters:	Y_codebook – dictionary of states each assigned a unique integer prefixes – tuple of prefix notation used such as (“B-”,”I-”) for BIO

class pyseqlab.crf_learning.Learner(crf_model)[source]¶

Bases: object

learner used for training CRF models supporting search- and gradient-based learning methods

Keyword Arguments:
Parameters:	crf_model – an instance of CRF models such as `HOCRFAD`
	crf_model – an instance of CRF models such as `HOCRFAD` training_description – dictionary that will include the training specification of the model

cleanup()[source]¶: End of training – cleanup

train_model(w0, seqs_id, optimization_options, working_dir, save_model=True)[source]¶

the MAIN method for training models using the various options available

Keyword Arguments:
Parameters:	w0 – numpy vector representing initial weights for the parameters seqs_id – list of integers representing the sequence ids optimization_options – dictionary specifying the training method working_dir – string representing the directory where the model data and generated files will be saved
	save_model – boolean specifying if to save the final model

Example

The available options for training are:

SGA for stochastic gradient ascent
SGA-ADADELTA for stochastic gradient ascent using ADADELTA approach
BFGS or L-BFGS-B for optimization using second order information (hessian matrix)
SVRG for stochastic variance reduced gradient method
COLLINS-PERCEPTRON for structured perceptron
SAPO for Search-based Probabilistic Online Learning Algorithm (SAPO) (an adapted version)

For example possible specification of the optimization options are:

 1) {'method': 'SGA-ADADELTA'
    'regularization_type': {'l1', 'l2'}
    'regularization_value': float
    'num_epochs': integer
    'tolerance': float
    'rho': float
    'epsilon': float
   }


2) {'method': 'SGA' or 'SVRG'
    'regularization_type': {'l1', 'l2'}
    'regularization_value': float
    'num_epochs': integer
    'tolerance': float
    'learning_rate_schedule': one of ("bottu", "exponential_decay", "t_inverse", "constant")
    't0': float
    'alpha': float
    'eta0': float
   }


3) {'method': 'L-BFGS-B' or 'BFGS'
    'regularization_type': 'l2'
    'regularization_value': float
    'disp': False
    'maxls': 20,
    'iprint': -1,
    'gtol': 1e-05,
    'eps': 1e-08,
    'maxiter': 15000,
    'ftol': 2.220446049250313e-09,
    'maxcor': 10,
    'maxfun': 15000
    }


4) {'method': 'COLLINS-PERCEPTRON'
    'regularization_type': {'l1', 'l2'}
    'regularization_value': float
    'num_epochs': integer
    'update_type':{'early', 'max-fast', 'max-exhaustive', 'latest'}
    'shuffle_seq': boolean
    'beam_size': integer
    'avg_scheme': {'avg_error', 'avg_uniform'}
    'tolerance': float
   }


5) {'method': 'SAPO'
    'regularization_type': {'l2'}
    'regularization_value': float
    'num_epochs': integer
    'update_type':'early'
    'shuffle_seq': boolean
    'beam_size': integer
    'topK': integer
    'tolerance': float
   }

class pyseqlab.crf_learning.SeqDecodingEvaluator(model_repr)[source]¶

Bases: object

Evaluator class to evaluate performance of the models

Parameters:	model_repr – the CRF model representation that has a suffix of ModelRepresentation such as `HOCRFADModelRepresentation`

model_repr¶: the CRF model representation that has a suffix of ModelRepresentation such as HOCRFADModelRepresentation

Note

this class does not support evaluation of segment learning (i.e. notations that include IOB2/BIO notation)

compute_states_confmatrix(Y_seqs_dict)[source]¶

compute/generate the confusion matrix for each state

Parameters:	Y_seqs_dict – dictionary where each sequence has the reference label sequence and its corresponding predicted sequence. It has the following form `{seq_id:{'Y_ref':[reference_ylabels], 'Y_pred':[predicted_ylabels]}}`

get_performance_metric(taglevel_performance, metric, exclude_states=[])[source]¶

compute the performance of the model using a requested metric

Keyword Arguments:
Parameters:	taglevel_performance – numpy array with Mx2x2 dimension. For every state code a 2x2 confusion matrix is included. It is computed using `compute_model_performance()` metric – evaluation metric that could take one of `{'f1', 'precision', 'recall', 'accuracy'}`
	exclude_states – list (default empty list) of states to exclude from the computation. Usually, in NER applications the non-entity symbol such as ‘O’ is excluded from the computation. Example: If `exclude_states = ['O']`, this will replicate the behavior of conlleval script

map_states_to_num(Y, Y_codebook, M)[source]¶

map states to their code/number using the Y_codebook

Parameters:	Y – list representing label sequence Y_codebook – dictionary containing the states as keys and the assigned unique code as values M – number of states

Note

we give one unique index for tags that did not occur in the training data such as len(Y_codebook)

pyseqlab.features_extraction module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.features_extraction.FOFeatureExtractor(templateX, templateY, attr_desc, start_state=True)[source]¶

Bases: pyseqlab.features_extraction.FeatureExtractor

Feature extractor class for first order CRF models

it supports the addition of start_state and potentially stop_state in the future release

Parameters:

templateX – dictionary specifying template to follow for observation features extraction. It has the form: {attr_name: {x_offset:tuple(y_offsets)}} e.g. {'w': {(0,):((0,), (-1,0), (-2,-1,0))}}
templateY – dictionary specifying template to follow for y pattern features extraction. It has the form: {Y: tuple(y_offsets)} e.g. {'Y': ((0,), (-1,0), (-2,-1,0))}
attr_desc – dictionary containing description and the encoding of the attributes/observations e.g. attr_desc['w'] = {'description':'the word/token','encoding':'categorical'}. For more details/info check the attr_desc of the NERSegmentAttributeExtractor
start_state – boolean indicating if __START__ state is required in the model

templateX¶: dictionary specifying template to follow for observation features extraction. It has the form: {attr_name: {x_offset:tuple(y_offsets)}} e.g. {'w': {(0,):((0,), (-1,0), (-2,-1,0))}}

templateY¶: dictionary specifying template to follow for y pattern features extraction. It has the form: {Y: tuple(y_offsets)} e.g. {'Y': ((0,), (-1,0), (-2,-1,0))}

attr_desc¶: dictionary containing description and the encoding of the attributes/observations e.g. attr_desc['w'] = {'description':'the word/token','encoding':'categorical'}. For more details/info check the attr_desc of the NERSegmentAttributeExtractor

start_state¶: boolean indicating if __START__ state is required in the model

Note

The addition of this class is to add support for __START__ and potentially __STOP__ states

extract_features_Y(seq, boundary, templateY)[source]¶

extract y pattern features for a given sequence and template Y

Parameters:	seq – a sequence instance of `SequenceStruct` boundary – tuple (u,v) representing current boundary templateY – dictionary specifying template to follow for extraction. It has the form: `{Y: tuple(y_offsets)}` e.g. `{'Y': ((0,), (-1,0)}`

class pyseqlab.features_extraction.FeatureExtractor(templateX, templateY, attr_desc)[source]¶

Bases: object

Generic feature extractor class that contains feature functions/templates

Parameters:

templateX – dictionary specifying template to follow for observation features extraction. It has the form: {attr_name: {x_offset:tuple(y_offsets)}}. e.g. {'w': {(0,):((0,), (-1,0), (-2,-1,0))}}
templateY – dictionary specifying template to follow for y pattern features extraction. It has the form: {Y: tuple(y_offsets)}. e.g. {'Y': ((0,), (-1,0), (-2,-1,0))}
attr_desc – dictionary containing description and the encoding of the attributes/observations e.g. attr_desc[‘w’] = {‘description’:’the word/token’,’encoding’:’categorical’} for more details/info check the attr_desc of the NERSegmentAttributeExtractor

template_X¶: dictionary specifying template to follow for observation features extraction. It has the form: {attr_name: {x_offset:tuple(y_offsets)}} e.g. {'w': {(0,):((0,), (-1,0), (-2,-1,0))}}

template_Y¶: dictionary specifying template to follow for y pattern features extraction. It has the form: {Y: tuple(y_offsets)} e.g. {'Y': ((0,), (-1,0), (-2,-1,0))}

attr_desc¶: dictionary containing description and the encoding of the attributes/observations. e.g. attr_desc['w'] = {'description':'the word/token','encoding':'categorical'}. For more details/info check the attr_desc of the NERSegmentAttributeExtractor

aggregate_seq_features(features, boundaries)[source]¶

aggregate features across all boundaries

it is usually used to aggregate features in the dictionary obtained from extract_seq_features_perboundary() function

Parameters:	features – dictionary of sequence features per boundary boundaries – list of boundaries where detected features are aggregated

attr_represent_funcmapper()[source]¶: assign a representation function based on the encoding (i.e. categorical or continuous) of each attribute name

extract_features_X(seq, boundary)[source]¶

extract observation features for a given sequence at a specified boundary

Parameters:	seq – a sequence instance of `SequenceStruct` boundary – tuple (u,v) representing current boundary

extract_features_XY(seq, boundary, seg_features=None)[source]¶

extract/join observation features with y pattern features as specified template_X

Parameters:	seq – a sequence instance of `SequenceStruct` boundary – tuple (u,v) representing current boundary

Keywords Arguments:: seg_features: optional dictionary of observation features

Example:

template_X = {'w': {(0,):((0,), (-1,0), (-2,-1,0))}}
Using template_X the function will extract all unigram features of the observation 'w' (0, )
and join it with:
    - zero-order y pattern features (0,)
    - first-order y pattern features (-1, 0)
    - second-order y pattern features (-2, -1, 0)
template_Y = {'Y': ((0,), (-1,0), (-2,-1,0))}

extract_features_Y(seq, boundary, templateY)[source]¶

extract y pattern features for a given sequence and template Y

Parameters:	seq – a sequence instance of `SequenceStruct` boundary – tuple (u,v) representing current boundary templateY – dictionary specifying template to follow for extraction. It has the form: {Y: tuple(y_offsets)} e.g. `{'Y': ((0,), (-1,0), (-2,-1,0))}`

extract_seq_features_perboundary(seq, seg_features=None)[source]¶

extract features (observation and y pattern features) per boundary

Parameters:	seq – a sequence instance of `SequenceStruct`

Keywords Arguments:: seg_features: optional dictionary of observation features

flatten_segfeatures(seg_features)[source]¶

flatten observation features dictionary

Parameters:	seg_features – dictionary of observation features

lookup_features_X(seq, boundary)[source]¶

lookup observation features for a given sequence using varying boundaries (i.e. g(X, u, v))

Parameters:	seq – a sequence instance of `SequenceStruct` boundary – tuple (u,v) representing current boundary

lookup_seq_modelactivefeatures(seq, model, learning=False)[source]¶

lookup/search model active features for a given sequence using varying boundaries (i.e. g(X, u, v))

Keyword Arguments:
Parameters:	seq – a sequence instance of `SequenceStruct` model – a model representation instance of the CRF class (i.e. the class having ModelRepresentation suffix)
	learning – optional boolean indicating if this function is used wile learning model parameters

save(folder_dir)[source]¶: store the templates used – templateX and templateY

template_X

template_Y

class pyseqlab.features_extraction.FeatureFilter(filter_info)[source]¶

Bases: object

class for applying filters by y pattern or feature counts

Parameters:	filter_info – dictionary that contains type of filter to be applied

filter_info¶: dictionary that contains type of filter to be applied

rel_func¶: dictionary of function map

Example:

filter_info dictionary has three keys:
    - `filter_type` to define the type of filter either {count or pattern}
    - `filter_val` to define either the y pattern or threshold count
    - `filter_relation` to define how the filter should be applied


*count filter*:
    - ``filter_info = {'filter_type': 'count', 'filter_val':5, 'filter_relation':'<'}``
      this filter would delete all features that have count less than five

*pattern filter*:
    - ``filter_info = {'filter_type': 'pattern', 'filter_val': {"O|L", "L|L"}, 'filter_relation':'in'}``
      this filter would delete all features that have associated y pattern ["O|L", "L|L"]

apply_filter(featuresum_dict)[source]¶

apply define filter on model features dictionary

Parameters:	featuresum_dict – dictoinary that represents model features similar to modelfeatures attribute in one of model representation instances

class pyseqlab.features_extraction.HOFeatureExtractor(templateX, templateY, attr_desc)[source]¶

Bases: pyseqlab.features_extraction.FeatureExtractor

Feature extractor class for higher order CRF models

class pyseqlab.features_extraction.SeqsRepresenter(attr_extractor, fextractor)[source]¶

Bases: object

Sequence representer class that prepares, pre-process and transform sequences for learning/decoding tasks

Parameters:	attr_extractor – instance of attribute extractor class such as `NERSegmentAttributeExtractor` it is used to apply defined observation functions generating features for the observations fextractor – instance of feature extractor class such as `FeatureExtractor` it is used to extract features from the observations and generated observation features using the observation functions

attr_extractor¶: instance of attribute extractor class such as NERSegmentAttributeExtractor

fextractor¶: instance of feature extractor class such as FeatureExtractor

attr_scaler¶: instance of scaler class AttributeScaler it is used for scaling features that are continuous –not categorical (using standardization or rescaling)

aggregate_gfeatures(gfeatures, boundaries)[source]¶

aggregate global features using specified list of boundaries

Parameters:	gfeatures – dictionary representing the extracted sequence features (i.e F(X, Y)) boundaries – list of boundaries to use for aggregating global features

create_model(seqs_id, seqs_info, model_repr_class, filter_obj=None)[source]¶

aggregate all identified features in the training sequences to build one model

Main task:

use the sequences assigned in the training set to build the model
takes the union of the detected global feature functions \(F_j(X,Y)\) for each chosen parsed sequence from the training set to form the set of model features
construct the tag set Y_set (i.e. possible tags assumed by y_t) using the chosen parsed sequences from the training data set
determine the longest segment length (if applicable)
apply feature filter (if applicable)

Keyword Arguments:
Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences model_repr_class – class name of model representation (i.e. class that has suffix ModelRepresentation such as `HOCRFModelRepresentation`)
	filter_obj – optional instance of `FeatureFilter` class to apply filter

Note

it requires that the sequences have been already parsed and global features were generated using extract_seqs_globalfeatures()

extract_seqs_globalfeatures(seqs_id, seqs_info, dump_gfeat_perboundary=False)[source]¶

extract globalfeatures (i.e. F(X,Y)) from every sequence

Main task:

parses each sequence and generates global feature \(F_j(X,Y) = \sum_{t=1}^{T}f_j(X,Y)\)
for each sequence we obtain a set of generated global feature functions where each \(F_j(X,Y)\) represents the sum of the value of its corresponding low-level/local feature function \(f_j(X,Y)\) (i.e. \(F_j(X,Y) = \sum_{t=1}^{T+1} f_j(X,Y)\))
saves all the results on disk

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

Note

it requires that the sequences have been already parsed and preprocessed (if applicable)

extract_seqs_modelactivefeatures(seqs_id, seqs_info, model, output_foldername, learning=False)[source]¶

identify for every sequence model active states and features

Main task:

generate attributes for all segments with length 1 to maximum length defined in the model it is an optional step and only applied in case of having segmentation problems
generate segment features, potential activated states and a representation of segment features to be used potentially while learning
dump all info on disk

Keyword Arguments:
Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences model – an instance of model representation class (i.e. class that has suffix ModelRepresentation such as `HOCRFModelRepresentation`) output_foldername – string representing the name of the root folder to be created for containing all saved info
	learning – boolean indicating if this function used for the purpose of learning (model weights optimization)

Note

seqs_info dictionary will be updated by including the directory of the saved generatd info

feature_extractor¶

get_imposterseq_globalfeatures(seq_id, seqs_info, y_imposter, seg_other_symbol=None)[source]¶

get an imposter decoded sequence

Main task:

to be used for processing a sequence, generating global features and return back without storing/saving intermediary results on disk

Keyword Arguments:
Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences y_imposter – list of labels (y tags) decoded using a decoder
	seg_other_symbol – in case of segmentation, this represents the non-entity symbol label used. Otherwise, it is None (default) which translates to be a sequence labeling problem.

get_seq_activatedstates(seq_id, seqs_info)[source]¶

retrieve identified activated states that were saved on disk using seqs_info

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

Note

this data was generated using extract_seqs_modelactivefeatures()

get_seq_activefeatures(seq_id, seqs_info)[source]¶

retrieve sequence model active features that are identified

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

get_seq_globalfeatures(seq_id, seqs_info, per_boundary=True)[source]¶

retrieves the global features available for a given sequence (i.e. \(F(X,Y)\) for all \(j \in [1...J]\) )

Keyword Arguments:
Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences
	per_boundary – boolean specifying if the global features representation should be per boundary or aggregated across the whole sequence

get_seq_lsegfeatures(seq_id, seqs_info)[source]¶

retrieve segment features that were extracted with a modified representation for the purpose of parameter learning

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

Note

this data was generated using extract_seqs_modelactivefeatures()

get_seq_segfeatures(seq_id, seqs_info)[source]¶

retrieve segment features that were extracted and saved on disk using seqs_info

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

Note

this data was generated using extract_seqs_modelactivefeatures()

static load_seq(seq_id, seqs_info)[source]¶

load dumped sequences on disk

Parameters:	seqs_info – dictionary comprising the the info about the prepared sequences

prepare_seqs(seqs_dict, corpus_name, working_dir, unique_id=True, log_progress=True)[source]¶

prepare sequences to be used in the CRF models

Main task:

generate attributes (i.e. apply observation functions) on the sequence
create a directory for every sequence where we save the relevant data
create and return seqs_info dictionary comprising info about the prepared sequences

Parameters:	seqs_dict – dictionary containing sequences and corresponding ids where each sequence is an instance of the `SequenceStruct` class corpus_name – string specifying the name of the corpus that will be used as corpus folder name working_dir – string representing the directory where the parsing and saving info on disk will occur unique_id – boolean indicating if the generated corpus folder will include a generated id
Returns:	dictionary comprising the the info about the prepared sequences
Return type:	seqs_info (dictionary)

Example:

seqs_info = {'seq_id':{'globalfeatures_dir':directory,
                       'T': length of sequence
                       'L': length of the longest segment
                      }
                     ....
             }

preprocess_attributes(seqs_id, seqs_info, method='rescaling')[source]¶

preprocess sequences by generating attributes for segments with L >1

Main task:

generate attributes (i.e. apply observation functions) on segments (i.e. L>1)
scale continuous attributes and building the relevant scaling info needed
create a directory for every sequence where we save the relevant data

Keyword Arguments:
Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences
	method – string determining the type of scaling (if applicable) it supports {standardization, rescaling}

represent_gfeatures(gfeatures, model, boundaries=None)[source]¶

represent extracted sequence global features

two representation could be applied:

1. features identified by boundary (i.e. f(X,Y))
1. features identified and aggregated across all positions in the sequence (i.e. F(X, Y))

Keyword Arguments:
Parameters:	gfeatures – dictionary representing the extracted sequence features (i.e F(X, Y)) model – an instance of model representation class (i.e. class that has suffix ModelRepresentation such as `HOCRFModelRepresentation`)
	boundaries – if specified (i.e. list of boundaries), then the required representation is global features per boundary (i.e. option (1)) else (i.e. None or empty list), then the required representation is the aggregated global features (option(2))

save(folder_dir)[source]¶

save essential info about feature extractor

Parameters:	folder_dir – string representing directory where files are pickled/dumped

scale_attributes(seqs_id, seqs_info)[source]¶

scale continuous attributes

Parameters:	seqs_id – list of sequence ids to be processed seqs_info – dictionary comprising the the info about the prepared sequences

pyseqlab.features_extraction.main()[source]¶

pyseqlab.fo_crf module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.fo_crf.FirstOrderCRF(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: pyseqlab.linear_chain_crf.LCRF

first-order CRF model

Keyword Arguments:
Parameters:	model – an instance of `FirstOrderCRFModelRepresentation` class seqs_representer – an instance of `SeqsRepresenter` class seqs_info – dictionary holding sequences info
	load_info_fromdisk – integer from 0 to 5 specifying number of cached data to be kept in memory. 0 means keep everything while 5 means load everything from disk

model¶: an instance of FirstOrderCRFModelRepresentation class

weights¶: a numpy vector representing feature weights

seqs_representer¶: an instance of pyseqlab.feature_extraction.SeqsRepresenter class

seqs_info¶: dictionary holding sequences info

beam_size¶: determines the size of the beam for state pruning

fun_dict¶: a function map

def_cached_entities¶: a list of the names of cached entities sorted (descending) based on estimated space required in memory

cached_entitites(load_info_fromdisk)[source]¶: construct list of names of cached entities in memory

compute_backward_vec(w, seq_id)[source]¶

compute the backward matrix (beta matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

potential matrix per boundary dictionary should be available in seqs.info

compute_feature_expectation(seq_id, P_marginals, grad)[source]¶

compute the features expectations (i.e. expected count of the feature based on learned model)

Parameters:	seq_id – integer representing unique id assigned to the sequence P_marginals – probability matrix for y patterns at each position in time grad – numpy vector with dimension equal to the weight vector. It represents the gradient that will be computed using the feature expectation and the global features of the sequence

Note

activefeatures (per boundary) dictionary should be available in seqs.info
P_marginal (marginal probability matrix) should be available in seqs.info

compute_forward_vec(w, seq_id)[source]¶

compute the forward matrix (alpha matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

activefeatures need to be loaded first in seqs.info

compute_marginals(seq_id)[source]¶

compute the marginal (i.e. probability of each y pattern at each position)

Parameters:	seq_id – integer representing unique id assigned to the sequence

Note

potential matrix per boundary dictionary should be available in seqs.info
alpha matrix should be available in seqs.info
beta matrix should be available in seqs.info
Z (i.e. P(x)) should be available in seqs.info

compute_potential(w, active_features)[source]¶

compute the potential matrix of active features in a specified boundary

Parameters:	w – weight vector (numpy vector) active_features – dictionary of activated features in a specified boundary

perstate_posterior_decoding(w, seq_id)[source]¶

decode sequences using posterior probability (per state) decoder

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

prune_states(j, score_mat, beam_size)[source]¶

prune states that fall off the specified beam size

Parameters:	j – current position (integer) in the sequence score_mat – score matrix beam_size – specified size of the beam (integer)

validate_forward_backward_pass(w, seq_id)[source]¶

check the validity of the forward backward pass

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

viterbi(w, seq_id, beam_size, stop_off_beam=False, y_ref=[], K=1)[source]¶

decode sequences using viterbi decoder

Keyword Arguments:
Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence beam_size – integer representing the size of the beam
	stop_off_beam – boolean indicating if to stop when the reference state falls off the beam (used in perceptron/search based learning) y_ref – reference sequence list of labels (used while learning) K – integer indicating number of decoded sequences required (i.e. top-k list)

class pyseqlab.fo_crf.FirstOrderCRFModelRepresentation[source]¶

Bases: pyseqlab.linear_chain_crf.LCRFModelRepresentation

Model representation that will hold data structures to be used in FirstOrderCRF class

it includes all attributes in the LCRFModelRepresentation parent class

Y_codebook_rev¶: reversed codebook (dictionary) of Y_codebook

startstate_flag¶: boolean indicating if to use an edge/boundary state (i.e. __START__ state)

generate_instance_properties()[source]¶: generate instance properties that will be later used by FirstOrderCRF class

get_Y_codebook_reversed()[source]¶: generate reversed codebook of Y_codebook

get_modelstates_codebook(states)[source]¶

create states codebook by mapping each state to a unique code/number

Parameters:	states – set of tags identified in training sequences

Example:

states = {'B-PP', 'B-NP', ...}

setup_model(modelfeatures, states, L)[source]¶

setup and create the model representation

Creates all maps and codebooks needed by the FirstOrderCRF class

Parameters:	modelfeatures – set of features defining the model states – set of states (i.e. tags) L – length of longest segment

pyseqlab.ho_crf module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.ho_crf.HOCRF(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: pyseqlab.ho_crf_ad.HOCRFAD

higher-order CRF model

currently it supports only search-based training methods such as COLLINS-PERCEPTRON or SAPO
it implements the model discussed in: https://papers.nips.cc/paper/3815-conditional-random-fields-with-high-order-features-for-sequence-labeling.pdf

compute_backward_vec(w, seq_id)[source]¶

compute_bpotential(w, active_features)[source]¶

compute_seq_gradient(w, seq_id, grad)[source]¶: sequence gradient computation

Warning

the HOCRF currently does not support gradient based training. Use search based training methods such as COLLINS-PERCEPTRON or SAPO

this class is used for demonstration of the computation of the backward matrix using suffix relation as outlined in: https://papers.nips.cc/paper/3815-conditional-random-fields-with-high-order-features-for-sequence-labeling.pdf

validate_forward_backward_pass(w, seq_id)[source]¶

class pyseqlab.ho_crf.HOCRFModelRepresentation[source]¶

Bases: pyseqlab.ho_crf_ad.HOCRFADModelRepresentation

Model representation that will hold data structures to be used in HOCRF class

generate_instance_properties()[source]¶: generate instance properties that will be later used by HOSemiCRFAD class

get_S_info()[source]¶

get_backward_states()[source]¶

get_backward_transitions()[source]¶

get_si_ysk_map()[source]¶

get_ysk_codebook()[source]¶

map_z_ysk()[source]¶

setup_model(modelfeatures, states, L)[source]¶

setup and create the model representation

Creates all maps and codebooks needed by the HOSemiCRFAD class

Parameters:	modelfeatures – set of features defining the model states – set of states (i.e. tags) L – length of longest segment

pyseqlab.ho_crf_ad module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.ho_crf_ad.HOCRFAD(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: pyseqlab.hosemi_crf_ad.HOSemiCRFAD

higher-order CRF model that uses algorithmic differentiation in gradient computation

Keyword Arguments:
Parameters:	model – an instance of `HOCRFADModelRepresentation` class seqs_representer – an instance of `SeqsRepresenter` class seqs_info – dictionary holding sequences info
	load_info_fromdisk – integer from 0 to 5 specifying number of cached data to be kept in memory. 0 means keep everything while 5 means load everything from disk

model¶: an instance of HOCRFADModelRepresentation class

weights¶: a numpy vector representing feature weights

seqs_representer¶: an instance of pyseqlab.feature_extraction.SeqsRepresenter class

seqs_info¶: dictionary holding sequences info

beam_size¶: determines the size of the beam for state pruning

fun_dict¶: a function map

def_cached_entities¶: a list of the names of cached entities sorted (descending) based on estimated space required in memory

compute_backward_vec(w, seq_id)[source]¶

compute the backward matrix (beta matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

fpotential per boundary dictionary should be available in seqs.info

compute_feature_expectation(seq_id, P_marginals, grad)[source]¶

compute the features expectations (i.e. expected count of the feature based on learned model)

Parameters:	seq_id – integer representing unique id assigned to the sequence P_marginals – probability matrix for y patterns at each position in time grad – numpy vector with dimension equal to the weight vector. It represents the gradient that will be computed using the feature expectation and the global features of the sequence

Note

activefeatures (per boundary) dictionary should be available in seqs.info
P_marginal (marginal probability matrix) should be available in seqs.info

compute_forward_vec(w, seq_id)[source]¶

compute the forward matrix (alpha matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

activefeatures need to be loaded first in seqs.info

compute_fpotential(w, active_features)[source]¶

compute the potential of active features in a specified boundary

Parameters:	w – weight vector (numpy vector) active_features – dictionary of activated features in a specified boundary

compute_marginals(seq_id)[source]¶

compute the marginal (i.e. probability of each y pattern at each position)

Parameters:	seq_id – integer representing unique id assigned to the sequence

Note

fpotential per boundary dictionary should be available in seqs.info
alpha matrix should be available in seqs.info
beta matrix should be available in seqs.info
Z (i.e. P(x)) should be available in seqs.info

prune_states(j, delta, beam_size)[source]¶

prune states that fall off the specified beam size

Parameters:	j – current position (integer) in the sequence delta – score matrix beam_size – specified size of the beam (integer)

viterbi(w, seq_id, beam_size, stop_off_beam=False, y_ref=[], K=1)[source]¶

decode sequences using viterbi decoder

Keyword Arguments:
Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence beam_size – integer representing the size of the beam
	stop_off_beam – boolean indicating if to stop when the reference state falls off the beam (used in perceptron/search based learning) y_ref – reference sequence list of labels (used while learning) K – integer indicating number of decoded sequences required (i.e. top-k list)

class pyseqlab.ho_crf_ad.HOCRFADModelRepresentation[source]¶

Bases: pyseqlab.hosemi_crf_ad.HOSemiCRFADModelRepresentation

Model representation that will hold data structures to be used in HOCRFAD class

it includes all attributes in the HOSemiCRFADModelRepresentation parent class

filter_activated_states(activated_states, accum_active_states, boundary)[source]¶

filter/prune states and y features

Parameters:	activaed_states – dictionary containing possible active states/y features it has the form {patt_len:{patt_1, patt_2, ...}} accum_active_states – dictionary of only possible active states by position it has the form {pos_1:{state_1, state_2, ...}} boundary – tuple (u,v) representing the current boundary in the sequence

pyseqlab.hosemi_crf module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.hosemi_crf.HOSemiCRF(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: pyseqlab.hosemi_crf_ad.HOSemiCRFAD

higher-order semi-CRF model

it implements the model discussed in: http://www.jmlr.org/papers/volume15/cuong14a/cuong14a.pdf

compute_backward_vec(w, seq_id)[source]¶

compute_bpotential(w, active_features)[source]¶

compute_marginals(seq_id)[source]¶

class pyseqlab.hosemi_crf.HOSemiCRFModelRepresentation[source]¶

Bases: pyseqlab.hosemi_crf_ad.HOSemiCRFADModelRepresentation

Model representation that will hold data structures to be used in HOSemiCRF class

generate_instance_properties()[source]¶

get_S_info()[source]¶

get_backward_transitions()[source]¶

get_si_siy_codebook()[source]¶

get_siy_info()[source]¶

map_pky_z()[source]¶: generate a map between elements of the Z set and PY set

map_siy_z()[source]¶

setup_model(modelfeatures, states, L)[source]¶

pyseqlab.hosemi_crf_ad module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.hosemi_crf_ad.HOSemiCRFAD(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: pyseqlab.linear_chain_crf.LCRF

higher-order semi-CRF model that uses algorithmic differentiation in gradient computation

Keyword Arguments:
Parameters:	model – an instance of `HOSemiCRFADModelRepresentation` class seqs_representer – an instance of `SeqsRepresenter` class seqs_info – dictionary holding sequences info
	load_info_fromdisk – integer from 0 to 5 specifying number of cached data to be kept in memory. 0 means keep everything while 5 means load everything from disk

model¶: an instance of HOSemiCRFADModelRepresentation class

weights¶: a numpy vector representing feature weights

seqs_representer¶: an instance of pyseqlab.feature_extraction.SeqsRepresenter class

seqs_info¶: dictionary holding sequences info

beam_size¶: determines the size of the beam for state pruning

fun_dict¶: a function map

def_cached_entities¶: a list of the names of cached entities sorted (descending) based on estimated space required in memory

cached_entitites(load_info_fromdisk)[source]¶: construct list of names of cached entities in memory

compute_backward_vec(w, seq_id)[source]¶

compute the backward matrix (beta matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

fpotential per boundary dictionary should be available in seqs.info

compute_feature_expectation(seq_id, P_marginals, grad)[source]¶

compute the features expectations (i.e. expected count of the feature based on learned model)

Parameters:	seq_id – integer representing unique id assigned to the sequence P_marginals – probability matrix for y patterns at each position in time grad – numpy vector with dimension equal to the weight vector. It represents the gradient that will be computed using the feature expectation and the global features of the sequence

Note

activefeatures (per boundary) dictionary should be available in seqs.info
P_marginal (marginal probability matrix) should be available in seqs.info

compute_forward_vec(w, seq_id)[source]¶

compute the forward matrix (alpha matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Note

activefeatures need to be loaded first in seqs.info

compute_fpotential(w, active_features)[source]¶

compute the potential of active features in a specified boundary

Parameters:	w – weight vector (numpy vector) active_features – dictionary of activated features in a specified boundary

compute_marginals(seq_id)[source]¶

compute the marginal (i.e. probability of each y pattern at each position)

Parameters:	seq_id – integer representing unique id assigned to the sequence

Note

fpotential per boundary dictionary should be available in seqs.info
alpha matrix should be available in seqs.info
beta matrix should be available in seqs.info
Z (i.e. P(x)) should be available in seqs.info

prune_states(score_vec, beam_size)[source]¶

prune states that fall off the specified beam size

Parameters:	score_vec – score matrix beam_size – specified size of the beam (integer)

viterbi(w, seq_id, beam_size, stop_off_beam=False, y_ref=[], K=1)[source]¶

decode sequences using viterbi decoder

Keyword Arguments:
Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence beam_size – integer representing the size of the beam
	stop_off_beam – boolean indicating if to stop when the reference state falls off the beam (used in perceptron/search based learning) y_ref – reference sequence list of labels (used while learning) K – integer indicating number of decoded sequences required (i.e. top-k list) A* searcher with viterbi will be used to generate k-decoded list

class pyseqlab.hosemi_crf_ad.HOSemiCRFADModelRepresentation[source]¶

Bases: pyseqlab.linear_chain_crf.LCRFModelRepresentation

Model representation that will hold data structures to be used in HOSemiCRF class

P_codebook¶: set of proper prefixes of the elements in the set of patterns Z_codebook e.g. {‘’:0, ‘P’:1, ‘L’:2, ‘O’:3, ‘L|O’:4, ...}

P_codebook_rev¶: reversed codebook of P_codebook e.g. {0:’‘, 1:’P’, 2:’L’, 3:’O’, 4:’L|O’, ...}

P_len¶: dictionary comprising the length (i.e. number of elements) of elements in P_codebook e.g. {‘’:0, ‘P’:1, ‘L’:1, ‘O’:1, ‘L|O’:2, ...}

P_elems¶: dictionary comprising the composing elements of every prefix in P_codebook e.g. {‘’:(‘’,), ‘P’:(‘P’,), ‘L’:(‘L’,), ‘O’:(‘O’,), ‘L|O’:(‘L’,’O’), ...}

P_numchar¶: dictionary comprising the number of characters for every prefix in P_codebook e.g. {‘’:0, ‘P’:1, ‘L’:1, ‘O’:1, ‘L|O’:3, ...}

f_transition¶: a dictionary representing forward transition data structure having the form: {pi:{pky, (pk, y)}} where pi represents the longest prefix element in P_codebook for pky (representing the concatenation of elements in P_codebook and Y_codebook)

pky_codebook¶: generate a codebook for the elements of the set PY (the product of set P and Y)

pi_pky_map¶: a map between P elements and PY elements

z_pky_map¶: a map between elements of the Z set and PY set it has the form/template {ypattern:[pky_elements]}

z_pi_piy_map¶: a map between elements of the Z set and PY set it has the form/template {ypattern:(pk, pky, pi)}

filter_activated_states(activated_states, accum_active_states, curr_boundary)[source]¶

filter/prune states and y features

Parameters:	activaed_states – dictionary containing possible active states/y features it has the form {patt_len:{patt_1, patt_2, ...}} accum_active_states – dictionary of only possible active states by position it has the form {pos_1:{state_1, state_2, ...}} boundary – tuple (u,v) representing the current boundary in the sequence

generate_instance_properties()[source]¶: generate instance properties that will be later used by HOSemiCRFAD class

get_P_codebook_rev()[source]¶: generate reversed codebook of P_codebook

get_P_info()[source]¶: get the properties of P set (proper prefixes)

get_forward_states()[source]¶

create set of forward states (referred to set P) and map each element to unique code

P is set of proper prefixes of the elements in Z_codebook set

get_forward_transition()[source]¶

generate forward transition data structure

Main tasks:

create a set PY from the product of P and Y sets
for each element in PY, determine the longest suffix existing in set P
include all this info in f_transition dictionary

get_pi_pky_map()[source]¶

generate map between P elements and PY elements

Main tasks:

for every element in PY, determine the longest suffix in P
determine the two components in PY (i.e. p and y element)
represent this info in a dictionary that will be used for forward/alpha matrix

get_pky_codebook()[source]¶: generate a codebook for the elements of the set PY (the product of set P and Y)

map_pky_z()[source]¶: generate a map between elements of the Z set and PY set

setup_model(modelfeatures, states, L)[source]¶

setup and create the model representation

Creates all maps and codebooks needed by the HOSemiCRFAD class

Parameters:	modelfeatures – set of features defining the model states – set of states (i.e. tags) L – length of longest segment

pyseqlab.linear_chain_crf module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.linear_chain_crf.LCRF(model, seqs_representer, seqs_info, load_info_fromdisk=5)[source]¶

Bases: object

linear chain CRF model

Keyword Arguments:
Parameters:	model – an instance of `LCRFModelRepresentation` class seqs_representer – an instance of `SeqsRepresenter` class seqs_info – dictionary holding sequences info
	load_info_fromdisk – integer from 0 to 5 specifying number of cached data to be kept in memory. 0 means keep everything while 5 means load everything from disk

model¶: an instance of LCRFModelRepresentation class

weights¶: a numpy vector representing feature weights

seqs_representer¶: an instance of SeqsRepresenter class

seqs_info¶: dictionary holding sequences info

beam_size¶: determines the size of the beam for state pruning

fun_dict¶: a function map

def_cached_entities¶: a list of the names of cached entities sorted (descending) based on estimated space required in memory

cached_entitites(load_info_fromdisk)[source]¶: construct list of names of cached entities in memory

check_cached_info(seq_id, entity_names)[source]¶

check and load required data elements/entities for every computation step

Parameters:	seq_id – integer representing unique id assigned to the sequence entity_name – list of names of the data elements need to be loaded in `seqs.info` dictionary needed while performing computation

Note

order of elements in the entity_names list is important

check_gradient(w, seq_id)[source]¶

implementation of finite difference method similar to scipy.optimize.check_grad()

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

clear_cached_info(seqs_id, cached_entities=[])[source]¶

clear/clean loaded data elements/entities in seqs.info dictionary

Keyword Arguments:
Parameters:	seqs_id – list of integers representing the unique ids of the training sequences
	cached_entities – list of data entities to be cleared for the `seqs.info` dictionary

Note

order of elements in the entity_names list is important

compute_backward_vec(w, seq_id)[source]¶

compute the backward matrix (beta matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Warning

implementation of this method is in the child class

compute_feature_expectation(seq_id, P_marginals)[source]¶

compute the features expectations (i.e. expected count of the feature based on learned model)

Parameters:	seq_id – integer representing unique id assigned to the sequence P_marginals – probability matrix for y patterns at each position in time

Warning

implementation of this method is in the child class

compute_forward_vec(w, seq_id)[source]¶

compute the forward matrix (alpha matrix)

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

Warning

implementation of this method is in the child class

compute_marginals(seq_id)[source]¶

compute the marginal (i.e. probability of each y pattern at each position)

Parameters:	seq_id – integer representing unique id assigned to the sequence

Warning

implementation of this method is in the child class

compute_seq_gradient(w, seq_id, grad)[source]¶

compute the gradient of conditional log-likelihood with respect to the parameters vector w (\(\frac{\partial p(Y|X;w)}{\partial w}\))

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

compute_seq_loglikelihood(w, seq_id)[source]¶

computes the conditional log-likelihood of a sequence (i.e. \(p(Y|X;w)\))

it is used as a cost function for the single sequence when trying to estimate parameters w

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

compute_seqs_gradient(w, seqs_id)[source]¶

compute the gradient of conditional log-likelihood with respect to the parameters vector w

Parameters:	w – weight vector (numpy vector) seqs_id – list of integer representing unique ids of sequences used for training

compute_seqs_loglikelihood(w, seqs_id)[source]¶

computes the conditional log-likelihood of training sequences

it is used as a cost/objective function for the whole training sequences when trying to estimate parameters w

Parameters:	w – weight vector (numpy vector) seqs_id – list of integer representing unique ids of sequences used for training

decode_seqs(decoding_method, out_dir, **kwargs)[source]¶

decode sequences (i.e. infer labels of sequence of observations)

Parameters:

decoding_method – a string referring to type of decoding {viterbi, per_state_decoding}
out_dir – string representing the working directory (path) where sequence processing will take place

Keyword Arguments:

file_name – the name of the file in case decoded sequences are required to be written
sep – separator (default ‘t’) between the columns when writing decoded sequences to file
procseqs_foldername – string representing the folder name where intermediary data and parsing would take place
beam_size – integer determining the size of the beam while decoding
seqs – a list comprising of sequences that are instances of SequenceStruct class to be decoded (used for decoding test data or any new/unseen data – sequences)
seqs_info – dictionary containing the info about the sequences to decode (used for decoding training sequences)
seqs_dict – a dictionary comprising of sequence ids as keys and corresponding sequences that are instances of SequenceStruct class to be decoded as values

Note

for keyword arguments only one of {seqs , seqs_info, seqs_dict} option need to be specified

generate_activefeatures(seq_id)[source]¶

construct a dictionary of model active features identified given a sequence

Main task:

generate active features for every boundary of the sequence

Parameters:	seq_id – integer representing unique id assigned to the sequence

identify_activefeatures(seq_id, boundary, accum_activestates, apply_filter=True)[source]¶

determine model active features for a given sequence at defined boundary

Main task:

determine model active features in a given boundary
update the accum_activestates dictionary

Parameters:	seq_id – integer representing unique id assigned to the sequence boundary – tuple (u,v) defining the boundary under consideration accum_activestates – dictionary of the form {(u,v):{state_1, state_2, ...}} it keeps track of the active states in each boundary

load_activatedstates(seq_id)[source]¶

load sequence activated states in seqs_info

Parameters:	seq_id – integer representing unique id assigned to the sequence

load_activefeatures(seq_id)[source]¶

load sequence model identified active features in seqs_info

Parameters:	seq_id – integer representing unique id assigned to the sequence

load_globalfeatures(seq_id, per_boundary=True)[source]¶

load sequence global features in seqs_info

Keyword Arguments:
Parameters:	seq_id – integer representing unique id assigned to the sequence
	per_boundary – boolean representing if the required global features dictionary is represented by boundary (i.e. True) or aggregated (i.e. False)

load_imposter_globalfeatures(seq_id, y_imposter, seg_other_symbol)[source]¶

load imposter sequence global features in seqs_info

Parameters:	seq_id – integer representing unique id assigned to the sequence y_imposter – the imposter sequence generated using viterbi decoder seg_other_sybmol – If it is specified, then the task is a segmentation problem (in this case we need to specify the non-entity/other element) else if it is None (default), then it is considered as sequence labeling problem

load_segfeatures(seq_id)[source]¶

load sequence observation features in seqs_info

Parameters:	seq_id – integer representing unique id assigned to the sequence

prune_states(j, delta, beam_size)[source]¶

prune states that fall off the specified beam size

Parameters:	j – current position (integer) in the sequence delta – score matrix beam_size – specified size of the beam (integer)

Warning

implementation of this method is in the child class

represent_globalfeature(gfeatures, boundaries)[source]¶

represent extracted sequence global features

two representation could be applied:

1. features identified by boundary (i.e. f(X,Y))
1. features identified and aggregated across all positions in the sequence (i.e. F(X, Y))

Parameters:	gfeatures – dictionary representing the extracted sequence features (i.e F(X, Y)) boundaries – if specified (i.e. list of boundaries), then the required representation is global features per boundary (i.e. option (1)) else (i.e. None or empty list), then the required representation is the aggregated global features (option(2))

save_model(folder_dir)[source]¶

save model data structures

Parameters:	folder_dir – string representing directory where files are pickled/dumped

seqs_info

validate_expected_featuresum(w, seqs_id)[source]¶

validate expected feature computation

Parameters:	w – weight vector (numpy vector) seqs_id – list of integers representing unique id assigned to the sequences

validate_forward_backward_pass(w, seq_id)[source]¶

check the validity of the forward backward pass

Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence

validate_gradient(w, seq_id)[source]¶

viterbi(w, seq_id, beam_size, stop_off_beam=False, y_ref=[], K=1)[source]¶

decode sequences using viterbi decoder

Keyword Arguments:
Parameters:	w – weight vector (numpy vector) seq_id – integer representing unique id assigned to the sequence beam_size – integer representing the size of the beam
	stop_off_beam – boolean indicating if to stop when the reference state falls off the beam (used in perceptron/search based learning) y_ref – reference sequence list of labels (used while learning) K – integer indicating number of decoded sequences required (i.e. top-k list)

Warning

implementation of this method is in the child class

write_decoded_seqs(ref_seqs, Y_pred_seqs, out_file, sep='\t')[source]¶

write inferred sequences on file

Parameters:	ref_seqs – list of sequences that are instances of `SequenceStruct` Y_pred_seqs – list of list of tags decoded for every reference sequence out_file – string representing out file where data is written sep – separator used while writing on out file

class pyseqlab.linear_chain_crf.LCRFModelRepresentation[source]¶

Bases: object

Model representation that will hold data structures to be used in LCRF class

modelfeatures¶: set of features defining the model

modelfeatures_codebook¶: dictionary mapping each features in modelfeatures to a unique code

Y_codebook¶: dictionary mapping the set of states (i.e. tags) to a unique code each

L¶: length of longest segment

Z_codebook¶: dictionary for the set Z, mapping each element to unique number/code

Z_len¶: dictionary comprising the length of each element in Z_codebook

Z_elems¶: dictionary comprising the composing elements of each member in the Z set (Z_codebook)

Z_numchar¶: dictionary comprising the number of characters of each member in the Z set (Z_codebook)

patts_len¶: set of lengths extracted from Z_len (i.e. set(Z_len.values()))

max_patts_len¶: maximum pattern length used in the model

modelfeatures_inverted¶: inverted model features (i.e inverting the modelfeatures dictionary)

ypatt_features¶: state features (i.e. y pattern features) that depend only on the states

ypatt_activestates¶: possible/potential activated y patterns/features using the observation features

num_features¶: total number of features in the model

num_states¶: total number of states in the model

accumulate_activefeatures(activefeatures, accumfeatures)[source]¶

check_prefix(token, ref_str)[source]¶

check_suffix(token, ref_str)[source]¶

filter_activated_states(activated_states, accum_active_states, boundary)[source]¶

filter/prune states and y features

Parameters:	activaed_states – dictionary containing possible active states/y features it has the form {patt_len:{patt_1, patt_2, ...}} accum_active_states – dictionary of only possible active states by position it has the form {pos_1:{state_1, state_2, ...}} boundary – tuple (u,v) representing the current boundary in the sequence

find_activated_states(seg_features, allowed_z_len)[source]¶

identify possible activated y patterns/features using the observation features

Parameters:	seg_features – dictionary of the observation features. It has the form {featureA_name:value, featureB_name:value, ...} allowed_z_len – set of permissible order/length of y features {1,2,3} -> means up to third order y features are allowed

find_seg_activefeatures(seg_features, allowed_z_len)[source]¶

finds active features based on the observation/segment features

Parameters:	seg_features – allowed_z_len –

find_ypatt_activefeatures(allowed_z_len)[source]¶

finds the label and state transition features (if applicable – in case it is modeled)

Parameters:	allowed_z_len –

generate_instance_properties()[source]¶: generate instance properties that will be later used by LCRF class

get_Z_info()[source]¶: get the properties of Z set

get_Z_pattern()[source]¶

create a codebook from set Z by mapping each element to unique number/code

Z is set of y patterns used in the model features

Example:

Z = {'O|B-VP|B-NP', 'O|B-VP', 'O', 'B-VP', 'B-NP', ...}
Z_codebook = {'O|B-VP|B-NP':1, 'O|B-VP':2, 'O':3, 'B-VP':5, 'B-NP':4, ...}

get_inverted_modelfeatures()[source]¶

invert modelfeatures instance variable

Example:

modelfeatures_inverted =
{'w[0]=take': {1: {'I-VP'}, 2: {'I-VP|I-VP'}, 3: {'I-VP|I-VP|I-VP'}},
 'w[0]=the': {1: {'B-NP'},
              2: {'B-PP|B-NP', 'I-VP|B-NP'},
              3: {'B-NP|B-PP|B-NP', B-VP|I-VP|B-NP', ...}
             },
             ...
}

ypatt_features = {'B-NP', 'B-PP|B-NP', ..}

get_modelfeatures_codebook()[source]¶

setup model features codebook

it flatten modelfeatures and map each element to a unique code modelfeatures are represented in a dictionary with this form:

{y_patt_1:{featureA:value, featureB:value, ...}
 y_patt_2:{featureA:value, featureC:value, ...}}

Example:

modelfeatures:
     {'B-PP': Counter({'w[0]=at': 1,
                       'w[0]=by': 1,
                       'w[0]=for': 4,
                       ...
                      }),
     'B-PP|B-NP': Counter({'w[0]=16': 1,
                           'w[0]=July': 1,
                           'w[0]=Nomura': 1,
                           ...
                           }),
                  ...
     }
modelfeatures_codebook:
    {('B-PP','w[0]=at'): 1,
     ('B-PP','w[0]=by'): 2,
     ('B-PP','w[0]=for'): 3,
     ...
    }

get_modelstates_codebook(states)[source]¶

create states codebook by mapping each state to a unique code/number

Parameters:	states – set of tags identified in training sequences

Example:

states = {'B-PP', 'B-NP', ...}
states_codebook = {'B-PP':1, 'B-NP':2 ...}

get_num_features()[source]¶: return total number of features in the model

get_num_states()[source]¶: return total number of states identified by the model in the training set

join_segfeatures_filteredstates(seg_features, filtered_states)[source]¶

represent detected active features while parsing sequences

Parameters:	activestates – dictionary of the form {‘patt_len’:{patt_1, patt_2, ...}} seg_features – dictionary of the observation features. It has the form {featureA_name:value, featureB_name:value, ...}

keep_longest_elems(s)[source]¶: used to figure out longest suffix and prefix on sets

represent_activefeatures(activefeatures)[source]¶

represent_globalfeatures(seq_featuresum)[source]¶

represent features extracted from sequences using modelfeatures_codebook

Parameters:	seq_featuresum – dictionary of sequence global features representing F(X,Y)

represent_ypatt_filteredstates(filtered_states)[source]¶

represent detected active features while parsing sequences

Parameters:	activestates – dictionary of the form {‘patt_len’:{patt_1, patt_2, ...}} seg_features – dictionary of the observation features. It has the form {featureA_name:value, featureB_name:value, ...}

save(folder_dir)[source]¶: save main model data structures

setup_model(modelfeatures, states, L)[source]¶

setup and create the model representation

Creates all maps and codebooks needed by the LCRF class

Parameters:	modelfeatures – set of features defining the model states – set of states (i.e. tags) L – length of longest segment

pyseqlab.utilities module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.utilities.AStarAgenda[source]¶

Bases: object

class containing a heap where instances of AStarNode class will be pushed

the push operation will use the score matrix (built using viterbi algorithm) representing the unnormalized probability of the sequences ending at every position with the different available prefixes/states

qagenda¶: queue where instances of AStarNode are pushed

entry_count¶: counter that keeps track of the entries and associate each entry(node) with a unique number. It is useful for resolving nodes with equal costs

pop()[source]¶: pop nodes with highest score from the heap

push(astar_node, cost)[source]¶

push instance of AStarNode with its associated cost to the heap

Parameters:	astar_node – instance of `AStarNode` class cost – float representing the score/unnormalized probability of a sequence up to given position

class pyseqlab.utilities.AStarNode(cost, position, pi_c, label, frwdlink)[source]¶

Bases: object

class representing A* node to be used with A* searcher and viterbi for generating k-decoded list

Parameters:	cost – float representing the score/unnormalized probability of a sequence up to given position position – integer representing the current position in the sequence pi_c – prefix or state code of the label label – label of the current position in a sequence frwdlink – a link to `AStarNode` node

cost¶: float representing the score/unnormalized probability of a sequence up to given position

position¶: integer representing the current position in the sequence

pi_c¶: prefix or state code of the label

label¶: label of the current position in a sequence

frwdlink¶: a link to AStarNode node

print_node()[source]¶: print the info about a node

class pyseqlab.utilities.BoundNode(parent, boundary)[source]¶

Bases: object

boundary entity class used when generating all possible partitions within specified constraint

Parameters:	parent – instance of `BoundNode` boundary – tuple (u,v) representing the current boundary

add_child(child)[source]¶: add link to the child nodes

get_child()[source]¶: retrieve child nodes

get_signature()[source]¶: retrieve the id of the node

class pyseqlab.utilities.DataFileParser[source]¶

Bases: object

class to parse a data file comprising the training/testing data

seqs¶: list comprising of sequences that are instances of SequenceStruct class

header¶: list of attribute names read from the file

parse_header(x_arg)[source]¶

parse header

Parameters:	x_arg – tuple of attribute/observation names

parse_line(x_arg)[source]¶

parse the read line

Parameters:	x_arg – tuple of observation columns

read_file(file_path, header, y_ref=True, seg_other_symbol=None, column_sep=' ')[source]¶

read and parse a file the contains the sequences following a predefined format

the file should contain label and observation tracks each separated in a column

Note

label column is the LAST column in the file (i.e. X_a X_b Y)

Parameters:

file_path – string representing the file path to the data file
header – specifies how the header is reported in the file containing the sequences options include: - ‘main’ -> one header in the beginning of the file - ‘per_sequence’ -> a header for every sequence - list of keywords as header (i.e. [‘w’, ‘part_of_speech’])

Keyword Arguments:

y_ref – boolean specifying if the reference label column in the data file
seg_other_sybmol – string or None(default), if specified then the task is a segmentation problem where seg_other_symbol represents the non-entity symbol. In this case semi-CRF models are used. Else (i.e. seg_other_symbol is not None) then it is considered as sequence labeling problem.
column_sep – string, separator used between the columns in the file

update_X(X, Y)[source]¶: update sequence observations

update_XY(X, Y)[source]¶: update sequence observations and corresponding labels

class pyseqlab.utilities.FO_AStarSearcher(Y_codebook_rev)[source]¶

Bases: object

A* star searcher associated with first-order CRF model such as FirstOrderCRF

Parameters:	Y_codebook_rev – a reversed version of dictionary comprising the set of states each assigned a unique code

Y_codebook_rev¶: a reversed version of dictionary comprising the set of states each assigned a unique code

infer_labels(top_node, back_track)[source]¶

decode sequence by inferring labels

Parameters:	top_node – instance of `AStarNode` class back_track – dictionary containing back pointers built using dynamic programming algorithm

search(alpha, back_track, T, K)[source]¶

A* star searcher uses the score matrix (built using viterbi algorithm) to decode top-K list of sequences

Parameters:	alpha – score matrix build using the viterbi algorithm back_track – back_pointers dictionary tracking the best paths to every state T – last decoded position of a sequence (in this context, it is the alpha.shape[0]) K – number of top decoded sequences to be returned
Returns:	top-K list of decoded sequences
Return type:	topk_list

class pyseqlab.utilities.HOSemi_AStarSearcher(P_codebook_rev, pi_elems)[source]¶

Bases: object

A* star searcher associated with higher-order CRF model such as HOSemiCRFAD

Parameters:	P_codebook_rev – reversed codebook of set of proper prefixes in the P set e.g. `{0:'', 1:'P', 2:'L', 3:'O', 4:'L\|O', ...}` P_elems – dictionary comprising the composing elements of every prefix in the P set e.g. `{'':('',), 'P':('P',), 'L':('L',), 'O':('O',), 'L\|O':('L','O'), ...}`

P_codebook_rev¶: reversed codebook of set of proper prefixes in the P set e.g. {0:'', 1:'P', 2:'L', 3:'O', 4:'L|O', ...}

P_elems¶: dictionary comprising the composing elements of every prefix in the P set e.g. {'':('',), 'P':('P',), 'L':('L',), 'O':('O',), 'L|O':('L','O'), ...}

get_node_label(pi_code)[source]¶

get the the label/state given a prefix code

Parameters:	pi_code – prefix code which is an element of `P_codebook_rev`

infer_labels(top_node, back_track)[source]¶

decode sequence by inferring labels

Parameters:	top_node – instance of `AStarNode` class back_track – dictionary containing back pointers tracking the best paths to every state

search(alpha, back_track, T, K)[source]¶

A* star searcher uses the score matrix (built using viterbi algorithm) to decode top-K list of sequences

Parameters:	alpha – score matrix build using the viterbi algorithm back_track – back_pointers dictionary tracking the best paths to every state T – last decoded position of a sequence (in this context, it is the alpha.shape[0]) K – number of top decoded sequences to be returned
Returns:	top-K list of decoded sequences
Return type:	topk_list

class pyseqlab.utilities.HO_AStarSearcher(P_codebook_rev, P_elems)[source]¶

Bases: object

A* star searcher associated with higher-order CRF model such as HOCRFAD

Parameters:	P_codebook_rev – reversed codebook of set of proper prefixes in the P set e.g. `{0:'', 1:'P', 2:'L', 3:'O', 4:'L\|O', ...}` P_elems – dictionary comprising the composing elements of every prefix in the P set e.g. `{'':('',), 'P':('P',), 'L':('L',), 'O':('O',), 'L\|O':('L','O'), ...}`

P_codebook_rev¶: reversed codebook of set of proper prefixes in the P set e.g. {0:'', 1:'P', 2:'L', 3:'O', 4:'L|O', ...}

P_elems¶: dictionary comprising the composing elements of every prefix in the P set e.g. {'':('',), 'P':('P',), 'L':('L',), 'O':('O',), 'L|O':('L','O'), ...}

get_node_label(pi_code)[source]¶

get the the label/state given a prefix code

Parameters:	pi_code – prefix code which is an element of `P_codebook_rev`

infer_labels(top_node, back_track)[source]¶

decode sequence by inferring labels

Parameters:	top_node – instance of `AStarNode` class back_track – dictionary containing back pointers tracking the best paths to every state

search(alpha, back_track, T, K)[source]¶

A* star searcher uses the score matrix (built using viterbi algorithm) to decode top-K list of sequences

Parameters:	alpha – score matrix build using the viterbi algorithm back_track – back_pointers dictionary tracking the best paths to every state T – last decoded position of a sequence (in this context, it is the alpha.shape[0]) K – number of top decoded sequences to be returned
Returns:	top-K list of decoded sequences
Return type:	topk_list

class pyseqlab.utilities.ReaderWriter[source]¶

Bases: object

class for dumping, reading and logging data

static dump_data(data, file_name, mode='wb')[source]¶

dump data by pickling

Parameters:	data – data to be pickled file_name – file path where data will be dumped mode – specify writing options i.e. binary or unicode

static log_progress(line, outfile, mode='a')[source]¶

write data to a file

Parameters:	line – string representing data to be written out outfile – file path where data will be written/logged mode – specify writing options i.e. append, write

static read_data(file_name, mode='rb')[source]¶

read dumped/pickled data

Parameters:	file_name – file path where data will be dumped mode – specify writing options i.e. binary or unicode

class pyseqlab.utilities.SequenceStruct(X, Y, seg_other_symbol=None)[source]¶

Bases: object

class for representing each sequence/segment

Parameters:

Y – list containing the sequence of states/labels (i.e. [‘P’,’O’,’O’,’L’,’L’])
X – list containing dictionary elements of observation sequences and/or features of the input
seg_other_symbol – string or None (default), if specified then the task is a segmentation problem where it represents the non-entity symbol else (None) then it is considered as sequence labeling problem

Y¶: list containing the sequence of states/labels (i.e. [‘P’,’O’,’O’,’L’,’L’])

X¶: list containing dictionary elements of observation sequences and/or features of the input

seg_other_symbol¶: string or None(default), if specified then the task is a segmentation problem where it represents the non-entity symbol else (None) then it is considered as sequence labeling problem

T¶: int, length of a sequence (i.e. len(X))

seg_attr¶: dictionary comprising the extracted attributes per each boundary of a sequence

L¶: int, longest length of an identified segment in the sequence

flat_y¶: list of labels/tags

y_sboundaries¶: sorted list of boundaries of the Y of the sequence

y_range¶: range of the sequence

X

Y

flatten_y(Y)[source]¶

flatten the Y attribute

Parameters:	Y – dictionary of this form `{(1, 1): 'P', (2,2): 'O', (3, 3): 'O', (4, 5): 'L'}`

Example

flattened y becomes ['P','O','O','L','L']

get_x_boundaries()[source]¶: return the boundaries of the observation sequence

get_y_boundaries()[source]¶: return the sorted boundaries of the labels of the sequence

class pyseqlab.utilities.TemplateGenerator[source]¶

Bases: object

template generator class for feature/function template generation

static generate_combinations(n)[source]¶

generates all possible combinations based on the maximum number of ngrams n

Parameters:	n – integer specifying the maximum/greatest ngram option

static generate_ngram(l, n)[source]¶

n-gram generator based on the length of the window and the ngram option

Parameters:	l – list of positions of the range representing the window size (i.e. list(wsize)) n – integer representing the n-gram option (i.e. 1 for unigram, 2 for bigram, etc..)

generate_template_XY(attr_name, x_spec, y_spec, template)[source]¶

generate template XY for the feature extraction

Parameters:

attr_name – string representing the attribute name of the atomic observations/tokens
x_spec – tuple of the form (n-gram, range) that is we can specify the n-gram features required in a specific range/window for an observation token attr_name
y_spec –
string specifying how to join/combine the features on the X observation level with labels on the Y level.
Example of passed options would be:
- one state (i.e. current state) by passing 1-state or
- two states (i.e. current and previous state) by passing 2-states or
- one and two states (i.e. mix/combine observation features with one state model and two states models) by passing 1-state:2-states. Higher order models support models with states > 2 such as 3-states and above.
template – dictionary that accumulates the generated feature template for all attributes

Example

suppose we have word attribute referenced by ‘w’ and we need to use the current word with the current label (i.e. unigram of words with the current label) in a range of (0,1)

templateXY = {}
generate_template_XY('w', ('1-gram', range(0, 1)), '1-state', templateXY)

we can also specify a two states/labels features at the Y level

generate_template_XY('w', ('1-gram', range(0, 1)), '1-state:2-states', templateXY)

Note

this can be applied for every attribute name and accumulated in the template dictionary

generate_template_Y(ngram_options)[source]¶

generate template on the Y labels level

Parameters:	ngram_options – string specifying the number of states to be use (i.e. `1-state`). It also supports multiple specification such as `1-state:2-states` where each is separated by a colon

pyseqlab.utilities.aggregate_weightedsample(w_sample)[source]¶

represent the random picked sample for training/testing

Parameters:	w_sample – dictionary representing a random split of the grouped sequences by their length. it is obtained using `weighted_sample()` function

pyseqlab.utilities.create_directory(folder_name, directory='current')[source]¶

create directory/folder (if it does not exist) and returns the path of the directory

Keyword Arguments:
Parameters:	folder_name – string representing the name of the folder to be created
	directory – string representing the directory where to create the folder if current then the folder will be created in the current directory

pyseqlab.utilities.delete_directory(directory)[source]¶

pyseqlab.utilities.delete_file(filepath)[source]¶

pyseqlab.utilities.generate_datetime_str()[source]¶: generate string composed of the date and time

pyseqlab.utilities.generate_partition_boundaries(depth_node_map)[source]¶

generate partitions of the boundaries generated in generate_partitions() function

Parameters:	depth_node_map – dictionary that arranges the generated nodes by their depth in the tree it is constructed using `generate_partitions()` function

pyseqlab.utilities.generate_partitions(boundary, L, patt_len, bound_node_map, depth_node_map, parent_node, depth=1)[source]¶

generate all possible partitions within the range of segment length and model order

it transforms the partitions into a tree of nodes starting from the root node that uses boundary argument in its construction

Parameters:

boundary – tuple (u,v) representing the current boundary in a sequence
L – integer representing the maximum length a segment could be constructed
patt_len – integer representing the maximum model order
bound_node_map – dictionary that keeps track of all possible partitions represented as instances of BoundNode
depth_node_map – dictionary that arranges the generated nodes by their depth in the tree
parent_node – instance of BoundNode or None in case of the root node
depth – integer representing the maximum depth of the tree to be reached before stopping

pyseqlab.utilities.generate_trained_model(modelparts_dir, aextractor_obj)[source]¶

regenerate trained CRF models using the saved trained model parts/components

Parameters:	modelparts_dir – string representing the directory where model parts are saved aextractor_class – name of the attribute extractor class such as `NERSegmentAttributeExtractor`

pyseqlab.utilities.generate_updated_model(modelparts_dir, modelrepr_class, model_class, aextractor_obj, fextractor_class, seqrepresenter_class, ascaler_class=None)[source]¶

update/regenerate CRF models using the saved parts/components

Parameters:

modelparts_dir – string representing the directory where model parts are saved
modelrepr_class – name of the model representation class to be used which has suffix ModelRepresentation such as HOCRFADModelRepresentation
model_class – name of the CRF model class such as HOCRFAD
aextractor_class – name of the attribute extractor class such as NERSegmentAttributeExtractor
fextractor_class – name of the feature extractor class used such as HOFeatureExtractor
seqrepresenter_class – name of the sequence representer class such as SeqsRepresenter
ascaler_class – name of the attribute scaler class such as AttributeScaler

Note

This function is equivalent to generate_trained_model() function. However, this function uses explicit specification of the arguments (i.e. specifying explicitly the classes to be used)

pyseqlab.utilities.get_conll00()[source]¶

pyseqlab.utilities.group_seqs_by_length(seqs_info)[source]¶

group sequences by their length

Parameters:	seqs_info – dictionary comprsing info about the sequences it has this form {seq_id:{T:length of sequence}}

Note

sequences that are with unique sequence length are grouped together as singeltons

pyseqlab.utilities.nested_cv(seqs_id, outer_kfold, inner_kfold)[source]¶: generate nested cross-validation division of sequence ids

pyseqlab.utilities.split_data(seqs_id, options)[source]¶

utility function for splitting dataset (i.e. training/testing and cross validation)

Parameters:	seqs_id – list of processed sequence ids options – dictionary comprising of the options on how to split data

Example

To perform cross validation, we need to specify

cross-validation for the method
the number of folds for the k_fold

options = {'method':'cross_validation',
           'k_fold':number
          }

To perform random splitting, we need to specify

random for the method
number of splits for the num_splits
size of the training set in percentage for the trainset_size

options = {'method':'random',
           'num_splits':number,
           'trainset_size':percentage
          }

pyseqlab.utilities.vectorized_logsumexp(vec)[source]¶

vectorized version of log sum exponential operation

Parameters:	vec – numpy vector where entries are in the log domain

pyseqlab.utilities.weighted_sample(grouped_seqs, trainset_size)[source]¶

get a random split of the grouped sequences

Parameters:	grouped_seqs – dictionary of the grouped sequences based on their length it is obtained using `group_seqs_by_length()` function trainset_size – integer representing the size of the training set in percentage

pyseqlab.workflow module¶

@author: ahmed allam <ahmed.allam@yale.edu>

class pyseqlab.workflow.GenericTrainingWorkflow(aextractor_obj, fextractor_obj, feature_filter_obj, model_repr_class, model_class, root_dir)[source]¶

Bases: object

generic training workflow for building and training CRF models

Parameters:

aextractor_obj – initialized instance of GenericAttributeExtractor class/subclass
fextractor_obj – initialized instance of FeatureExtractor class/subclass
feature_filter_obj – None or an initialized instance of FeatureFilter class
model_repr_class – a CRFs model representation class such as HOCRFADModelRepresentation
model_class – a CRFs model class such as HOCRFAD
root_dir – string representing the directory/path where working directory will be created

aextractor_obj¶: initialized instance of GenericAttributeExtractor class/subclass

fextractor_obj¶: initialized instance of FeatureExtractor class/subclass

feature_filter_obj¶: None or an initialized instance of FeatureFilter class

model_repr_class¶: a CRFs model representation class such as HOCRFADModelRepresentation

model_class¶: a CRFs model class such as HOCRFAD

root_dir¶: string representing the directory/path where working directory will be created

build_crf_model(seqs_id, folder_name, load_info_fromdisk=10, full_parsing=True)[source]¶

build_seqsinfo_from_seqfile(seq_file, data_parser_options, num_seqs=inf)[source]¶

prepares and process sequences to disk and return info dictionary about the parsed sequences

Parameters:	seq_file – string representing the path to the sequence file data_parser_options – dictionary containing options to be passed to `read_file()` method of `DataFileParser` class num_seqs – integer, maximum number of sequences to read from file (default numpy.inf – means read all file)

build_seqsinfo_from_seqs(seqs)[source]¶

prepares and process sequences to disk and return info dictionary about the parsed sequences

Parameters:	seqs – list of sequences that are instances of `SequenceStruct` class

get_learned_crf(savedmodel_dir)[source]¶: revive learned/trained model

static get_seqs_from_file(seq_file, data_parser, data_parser_options)[source]¶

read sequences from a file

Parameters:	seq_file – string representing the path to the sequence file data_parser – instance of `DataFileParser` class data_parser_options – dictionary containing options to be passed to `read_file()` method of `DataFileParser` class

static map_pred_to_ref_seqs(seqs_pred)[source]¶

seq_parsing_workflow(split_options, **kwargs)[source]¶: preparing and parsing sequences to be later used in the learning framework

static split_dataset(seqs_info, split_options)[source]¶: splits dataset for learning and testing

train_model(trainseqs_id, crf_model, optimization_options)[source]¶: train a model and return the directory of the trained model

traineval_folds(data_split, **kwargs)[source]¶: train and evaluate model on different dataset splits

use_model(savedmodel_dir, options)[source]¶: use trained model for decoding and performance measure evaluation

verify_template()[source]¶: verifying template – sanity check

class pyseqlab.workflow.TrainingWorkflow(template_y, template_xy, model_repr_class, model_class, fextractor_class, aextractor_class, scaling_method, optimization_options, root_dir, filter_obj=None)[source]¶

Bases: object

general training workflow

Note

It is highly recommended to start using GenericTrainingWorkflow class

Warning

This class will be deprecated ...

eval_model(savedmodel_info, eval_seqs, eval_filename, dec_seqs_filename, sep=' ')[source]¶

map_pred_to_ref_seqs(seqs_pred)[source]¶

seq_parsing_workflow(seqs, split_options)[source]¶: preparing sequences to be used in the learning framework

split_dataset(seqs_info, split_options)[source]¶

train_model(trainseqs_id, crf_model)[source]¶: training a model and return the directory of the trained model

traineval_folds(data_split, meval=True, sep=' ')[source]¶: train and evaluate model on different dataset splits

verify_template()[source]¶: verifying template – sanity check

class pyseqlab.workflow.TrainingWorkflowIterative(template_y, template_xy, model_repr_class, model_class, fextractor_class, aextractor_class, scaling_method, ascaler_class, optimization_options, root_dir, data_parser_options, filter_obj=None)[source]¶

Bases: object

general training workflow that support reading/preparing large training sets

Note

It is highly recommended to start using GenericTrainingWorkflow class

Warning

This class will be deprecated ...

build_seqsinfo(seq_file)[source]¶

eval_model(savedmodel_dir, options)[source]¶

get_learned_crf(savedmodel_dir)[source]¶

get_seqs_from_file(seq_file)[source]¶

map_pred_to_ref_seqs(seqs_pred)[source]¶

seq_parsing_workflow(seq_file, split_options)[source]¶: preparing sequences to be used in the learning framework

split_dataset(seqs_info, split_options)[source]¶

train_model(trainseqs_id, crf_model)[source]¶: training a model and return the directory of the trained model

traineval_folds(data_split, **kwargs)[source]¶: train and evaluate model on different dataset splits

verify_template()[source]¶: verifying template – sanity check

pyseqlab package¶

Submodules¶

pyseqlab.attributes_extraction module¶

pyseqlab.crf_learning module¶

pyseqlab.features_extraction module¶

pyseqlab.fo_crf module¶

pyseqlab.ho_crf module¶

pyseqlab.ho_crf_ad module¶

pyseqlab.hosemi_crf module¶

pyseqlab.hosemi_crf_ad module¶

pyseqlab.linear_chain_crf module¶

pyseqlab.utilities module¶

pyseqlab.workflow module¶

Module contents¶