# importing and defining relevant directories
import sys
import os
# pyseqlab root directory
pyseqlab_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# print("pyseqlab cloned dir:", pyseqlab_dir)
# inserting the pyseqlab directory to pythons system path
# if pyseqlab is already installed this could be commented out
sys.path.insert(0, pyseqlab_dir)
# current directory (tutorials)
tutorials_dir = os.path.join(pyseqlab_dir, 'tutorials')
# print("tutorials_dir:", tutorials_dir)
dataset_dir = os.path.join(tutorials_dir, 'datasets', 'conll2000')
# print("dataset_dir:", dataset_dir)
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
css = "".join(f.readlines())
HTML(css)
In this tutorial, we will learn about:
Reminder: To work with this tutorial interactively, we need first to clone the PySeqLab package to our disk locally and then navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder (see the tree path for display).
├── pyseqlab ├── tutorials │ ├── datasets │ │ ├── conll2000 │ │ ├── segments
We suggest going through the sequence_and_input_structure tutorial before continuing through this notebook. We will use constructed sequences (short sentences) while explaining the features/attributes extraction process.
A reminder regarding the terminology we are using and planning to expand:
- sequence: to refer to a list of elements that follow an order
- observation: to refer to an element in the sequence
- track: to refer to different types of observations. In the chunking example, we have a track for the words and another for the part-of-speech
- label/tag: to refer to the outcome/class we want to predict
We start by constructing a sequence $s$ from the sentence "The dog barks.".
# import SequenceStruct class
from pyseqlab.utilities import SequenceStruct
# define the X attribute
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# label/tag sequence
Y= ['DT', 'N', 'V', '.']
seq_1 = SequenceStruct(X, Y)
print("seq_1:")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)
An attribute will be defined as a measured property/characteristic of an observation. An example of attributes is the value of each observation for any given track at each position in the sequence. For our sequence $s$, we specified 'w'
to be the name of the words track and hence we will obtain the following attributes at each position:
attribute w[0]=The w[0]=dog w[0]=barks w[0]=. position 1 2 3 4
To facilitate the attributes extraction process, PySeqLab provides a GenericAttributeExtractor
class in the attributes_extraction
module.
Using GenericAttributeExtractor
class, we can extract attributes from the given observations tracks. The constructor takes a dictionary comprising the description about attributes we aim to extract. The description dictionary has:
'w'
or 'word'
referring to the name used for the words track in sequence $s$)description
(optional) about the trackencoding
(obligatory) that specify if the attribute is categorical (discrete/nominal) or continuous ($\in \mathbb{R}$)For example, the attribute description dictionary for extracting the value of the observations in the words track at each position is defined by:
attr_desc = {'w':{'description': 'word observation track', 'encoding':'categorical' } }
The instance of the GenericAttributeExtractor
class will extract the target attributes and store them in the seg_attr
instance attribute of the sequence. The cell below demonstrates this process through an example.
from pyseqlab.attributes_extraction import GenericAttributeExtractor
# print the doc string of the class
print(GenericAttributeExtractor.__doc__)
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# label/tag sequence
Y= ['DT', 'N', 'V', '.']
# create a sequence
seq_1 = SequenceStruct(X, Y)
# attribute description dictionary -- using only the word observation track
attr_desc = {'w':{'description': 'word observation track',
'encoding':'categorical'
}
}
# initialize the attribute extractor instance
generic_attr_extractor = GenericAttributeExtractor(attr_desc)
print("attr_desc {}".format(generic_attr_extractor.attr_desc))
print("-"*40)
# extract attributes
generic_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
print("seq_1:")
print(seq_1)
print("extracted attributes saved in seq_1.seg_attr:")
print(seq_1.seg_attr)
# for boundary, seg_attr in seq_1.seg_attr.items():
# print("boundary {}".format(boundary))
# print("attributes {}".format(seg_attr))
Going beyond the given observations tracks (i.e. the ones provided in a file or part of the constructed sequence), we can compute/derive new attributes from those given tracks (such as computing new attributes based on the words track). We can compute the number of characters in a word, the shape and degenerate shape of the word, indicators specifying if the word is capitalized, and many others. The different types of computed/derived attributes are equivalent to forming new tracks. To derive new attributes, we need to:
GenericAttributeExtractor
class, generate_attributes(args)
method andGenericAttributeExtractor
class.
# my GenericAttributeExtractor subclass
class MySeqAttributeExtractor(GenericAttributeExtractor):
"""class implementing observation functions that generates attributes from word tokens/observations
Args:
attr_desc: dictionary defining the atomic observation/attribute names including
the encoding of such attribute (i.e. {continuous, categorical}}
Attributes:
attr_desc: dictionary defining the atomic observation/attribute names including
the encoding of such attribute (i.e. {continuous, categorical}}
seg_attr: dictionary comprising the extracted attributes per each boundary of a sequence
"""
def __init__(self):
attr_desc = self.generate_attributes_desc()
super().__init__(attr_desc)
def generate_attributes_desc(self):
"""define attributes by including description and encoding of each extracted/computed observation attribute
"""
attr_desc = {}
attr_desc['w'] = {'description':'the word/token',
'encoding':'categorical'
}
attr_desc['shape'] = {'description':'the shape of the word',
'encoding':'categorical'
}
attr_desc['shaped'] = {'description':'the compressed/degenerated form/shape of the word',
'encoding':'categorical'
}
attr_desc['numchars'] = {'description':'number of characters in a word',
'encoding':'continuous'
}
return(attr_desc)
def generate_attributes(self, seq, boundaries):
"""generate attributes of the sequence observations in a specified list of boundaries
Args:
seq: a sequence instance of :class:`SequenceStruct`
boundaries: list of boundaries [(1,1), (2,2),...,]
.. note::
the generated attributes are saved first in :attr:`seg_attr` and then passed to
the **`seq.seg_attr`**. In other words, at the end :attr:`seg_att` is always cleared
"""
X = seq.X
observed_attrnames = list(X[1].keys() & self.attr_desc.keys())
# segment attributes dictionary
self.seg_attr = {}
new_boundaries = []
# create segments from observations using the provided boundaries
for boundary in boundaries:
if(boundary not in seq.seg_attr):
self._create_segment(X, boundary, observed_attrnames)
new_boundaries.append(boundary)
# print("seg_attr {}".format(self.seg_attr))
# print("new_boundaries {}".format(new_boundaries))
if(self.seg_attr):
for boundary in new_boundaries:
self.get_shape(boundary)
self.get_degenerateshape(boundary)
self.get_num_chars(boundary)
# save generated attributes in seq
seq.seg_attr.update(self.seg_attr)
# print('saved attribute {}'.format(seq.seg_attr))
# clear the instance variable seg_attr
self.seg_attr = {}
return(new_boundaries)
def get_shape(self, boundary):
"""get shape of a word
Args:
boundary: tuple (u,v) that marks beginning and end of a word
"""
segment = self.seg_attr[boundary]['w']
res = ''
for char in segment:
if char.isupper():
res += 'A'
elif char.islower():
res += 'a'
elif char.isdigit():
res += 'D'
else:
res += '_'
self.seg_attr[boundary]['shape'] = res
def get_degenerateshape(self, boundary):
"""get degenerate shape of a word
Args:
boundary: tuple (u,v) that marks beginning and end of a word
"""
segment = self.seg_attr[boundary]['shape']
track = ''
for char in segment:
if not track or track[-1] != char:
track += char
self.seg_attr[boundary]['shaped'] = track
def get_num_chars(self, boundary, filter_out = " "):
"""get the number of characters in a word
Args:
boundary: tuple (u,v) that marks beginning and end of a word
filter_out: string the default separator between attributes
.. note:
the feature value of continuous attribute is of type float
"""
segment = self.seg_attr[boundary]['w']
filtered_segment = segment.split(sep = filter_out)
num_chars = 0.0
for entry in filtered_segment:
num_chars += len(entry)
self.seg_attr[boundary]['numchars'] = num_chars
# initialize my new attribute extractor
my_attr_extractor = MySeqAttributeExtractor()
print("attr_desc of MySeqAttributeExtractor instance")
print(my_attr_extractor.attr_desc)
# we use our created sequence (seq_1).
# But first we need to clear the seg_attr that was filled by our GenericAttributeExtractor (see the previous cell)
seq_1.seg_attr.clear()
my_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
print("-"*40)
print("seq_1")
print(seq_1)
print("extracted attributes saved in seq_1.seg_attr:")
print(seq_1.seg_attr)
As shown above, the computed attributes (number of characters, shape of the word) are stored in the sequence seg_attr
instance attribute. Another important issue to keep in mind is that attributes that are continuous (i.e. numchars
) are assigned feature values of type float. Where attributes that are categorical (i.e. shape
, shaped
, w
) are assigned feature values of type int. This distinction is only important when we plan to use feature filters (more on this in feature filter section). Otherwise, the type of feature value is of no importance.
Now that we have defined attributes, we can move to features. A feature is defined (in our context and to avoid confusion) as a measured property/characteristic of observations and labels that will be used to build a conditional random fields (CRFs) model with an associated parameter to estimate. Well, this is not clear enough and sounds a lot like an attribute, is not it? I agree that features and attributes are very similar but in our terminology, the attributes are only concerned to be the measured properties/characteristics of the observations tracks. While features are constructed generally using the following options:
Y labels
Y
labels and/or the transitions among themA combination of the above two options will lead to generating different CRFs models that we can train/estimate their features' parameters during the learning phase (consult to crf_model_building tutorial for more info).
Question: What could be used as features?
Answer: Everything we can think of could be used as features (a good analogy would be kitchen sink).
The first features we can think of are the values of the attributes at each position in the sequence combined with the corresponding label/tag. Those type of features are called node features. For our sequence $s$, we specified '$w$' to be the name of the words track and hence the features resulting from combining those attributes with labels at each position will be:
feature w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . position 1 2 3 4
Similarly, we can add features that use only the Y labels at each position:
feature w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . DT N V . position 1 2 3 4
Likewise, we can add the label transition too (-- means not applicable):
feature w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . DT N V . -- DT|N N|V V|. -- -- DT|N|V N|V|. -- -- -- DT|N|V|. position 1 2 3 4
Following the same logic, we can apply this process to the extracted/computed attributes (such as the shape of the word, number of characters and all the other attributes we computed earlier):
feature w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . DT N V . -- DT|N N|V V|. -- -- DT|N|V N|V|. -- -- -- DT|N|V|. position 1 2 3 4
It can bee seen the endless options we have to generate features. Also, it can be noted that the different types of attributes (i.e. tracks) always had a position indicator appended to them (such as w[0]
) which leads us to an obvious question:
Question: Could we use attributes from different positions to construct features?
Answer: Yes.
To explain further this idea, we will use set of examples/scenarios.
Suppose that at each position, we need to consider the current word attribute (i.e the attribute of the word track at current position) and the one at the next position (i.e. forward position) while joining them with the current Y
label (i.e Y[0]
).
Question: Could we extract these features based on the latter specification?
Answer: Yes
feature w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . w[0]=The,w[1]=dog, DT w[0]=dog,w[1]=barks, N w[0]=barks,w[1]=., V -- numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . DT N V . -- DT|N N|V V|. -- -- DT|N|V N|V|. -- -- -- DT|N|V|. position 1 2 3 4
Question: So, how about considering the attributes before each position (i.e. use history), could we do that?
Answer: Yes
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . --> shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . --> shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . --> DT N V . --> -- DT|N N|V V|. --> -- -- DT|N|V N|V|. --> -- -- -- DT|N|V|. position 1 2 3 4
Question: How about considering both attributes, the before and after (i.e. context window)?
Answer: Yes
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, N w[1]=., V --> numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . --> shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . --> shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . --> DT N V . --> -- DT|N N|V V|. --> -- -- DT|N|V N|V|. --> -- -- -- DT|N|V|. position 1 2 3 4
The choice for combining attributes is not limited to the ones we have presented so far. We can choose to combine attributes at different/arbitrary positions without any restrictions. For example, we could use w[0]=The, w[3]=., DT
as a feature while position 1
is our current position. Similarly, we can use w[-3]=The,w[-1]=barks, .
as a feature while position 4
is our current position.
Another aspect to consider is related to the choice of associating attributes with the labels. So far we were combining the attributes at each position with their current labels (i.e. corresponding labels at each position).
Question: Now what if we want to combine those attributes with multiple labels, can we do that?
Answer: Yes (with constraints). That is we can combine attributes with label transitions only such as using the current and the previous labels. Moreover, PySeqLab supports modeling transition labels with higher order (i.e $\ge$ 2 such as DT|N|V, N|V|., DT|N|V|.
). The features that combine the attributes with label transitions are generally called edge features. Below, is an example of using the word attributes with first-order label transitions (i.e. using the current and previous label).
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, N w[1]=., V --> -- w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[0]=dog, w[0]=barks, -- w[1]=barks, DT|N w[1]=., N|V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, DT|N w[1]=., N|V --> numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . --> shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . --> shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . --> DT N V . --> -- DT|N N|V V|. --> -- -- DT|N|V N|V|. --> -- -- -- DT|N|V|. position 1 2 3 4
Moreover, edge features (i.e. combining attributes with label transitions) are supported across all attribute types (i.e. using computed attributes from the word track such as number of characters 'numchar'
, shape of the word attributes 'shape'
and 'shaped'
).
NB: The different attribute types (i.e. categorical vs. continuous) combine differently depending on their specified type. In other words, if we are considering categorical attributes at previous or forward positions or any arbitrary position, the generated features using those attributes will be based on the pattern they form (i.e. at position=1
we have w[0]=The,w[1]=dog, DT
). However, if we are dealing with continuous attributes the generated features will use the sum of those attributes (i.e. at position=1
we have numchars[0,1]=6, DT
). Below is an example on using continuous attribute -- numchars
which is the length/number of characters in a word.
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, N w[1]=., V --> -- w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[0]=dog, w[0]=barks, -- w[1]=barks, DT|N w[1]=., N|V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, DT|N w[1]=., N|V --> numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . --> numchars[0,1]=6, DT numchars[0,1]=8, N numchars[0]=6, V -- --> -- numchars[-1,0]=6, N numchars[-1,0]=8, V numchars[-1,0]=6, . --> -- numchars[-1,0,1]=11, N numchars[-1,0,1]=9, V -- --> shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . --> shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . --> DT N V . --> -- DT|N N|V V|. --> -- -- DT|N|V N|V|. --> -- -- -- DT|N|V|. position 1 2 3 4
After we have expanded our terminology, it is time to show how we can:
PySeqLab provides two essential classes:
TemplateGenerator
class that builds templates for feature generationFeatureExtractor
class that extracts features from sequences/segments using the generated templatesStarting with the TemplateGenerator
class, two main methods are used:
generate_template_XY(args)
defines a template for joining/combining the attributes (i.e. using current and any arbitrary position attributes) and the order of the labels pattern to associate with (i.e. using current labels and label transitions with varying order) (see option 1)generate_template_Y(args)
defines a template for the order of the labels pattern only (i.e. use current labels and/or label transitions only) without involving the attributes from the tracks (see option 2)Using our sequence $s$ with the attributes we extracted, we will demonstrate how to define feature templates and consequently the features extracted using those templates.
# import template generator
from pyseqlab.utilities import TemplateGenerator
def experiment_templates_XY(track_attr_name, template_XY):
# current attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('1-gram', range(0,1)), '1-state', template_XY)
print("template_XY: current observation, current label = w[0], Y[0]")
print(template_XY)
template_XY.clear()
# current and next/forward attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('2-gram', range(0,2)), '1-state', template_XY)
print("template_XY: current observation, next observation, label = w[0],w[1], Y[0]")
print(template_XY)
template_XY.clear()
# previous and current attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('2-gram', range(-1,1)), '1-state', template_XY)
print("template_XY: previous observation, current observation, label = w[-1],w[0], Y[0]")
print(template_XY)
template_XY.clear()
# previous, current and next attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('3-gram', range(-1,2)), '1-state', template_XY)
print("template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]")
print(template_XY)
template_XY.clear()
# previous, current and next attribute at each position combined with the current label, and current and previous label
template_gen.generate_template_XY(track_attr_name, ('3-gram', range(-1,2)), '1-state:2-states', template_XY)
print("template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]\ntemplate_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[-1]Y[0]")
print(template_XY)
template_XY.clear()
# get unigrams and bigrams in a centered window of size 3 at every position in the sequence and combine them with the current label
template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
print("template_XY: unigrams and bigrams in centered window of size 3, label")
print(template_XY)
template_XY.clear()
# combine all the previous specifications/templates
template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY)
print("template_XY: all previous templates combined")
print(template_XY)
def experiment_templates_Y():
# generateing templates based on the current labels
template_Y = template_gen.generate_template_Y('1-state')
print("temlate_Y: current label")
print(template_Y)
template_Y.clear()
# generateing templates based on the current label and first order label transitions
template_Y = template_gen.generate_template_Y('1-state:2-states')
print("temlate_Y: current label and first order label transitions")
print(template_Y)
# empty template -- to further investigate
template_Y = template_gen.generate_template_Y('0-state')
print("temlate_Y: empty")
print(template_Y)
# create a template generator
template_gen = TemplateGenerator()
# create a dictionary to define template_XY
template_XY = {}
# generating template for word attributes (i.e. word track)
track_attr_name = 'w'
# run experiment for template_XY
experiment_templates_XY(track_attr_name, template_XY)
# run experiment for template_Y
experiment_templates_Y()
We can notice from the above cell the iterative process of adding new instructions/templates on how to combine the attributes and the labels using generate_template_XY
method. These instructions are saved in template_XY
.
generate_template_XY(track_name, (ngram, window), states, template_XY)
To understand how the method operates and what arguments it takes, we need to understand the ngram and window concepts.
Our definition:
A window is a specified range (u, v) that is constructed and applied at each position in the sequence. By specifying the boundaries of the window (i.e. u and v) we define how the window is constructed.
ngram is a chunk/segment of n consecutive elements in a given sequence.
For example, in our sentence $s$ = "The dog barks.", we can define the following ngrams:
- unigram (1-gram): The, dog, barks, .
- bigram (2-grams): The dog, dog barks, barks .
- trigram (3-grams): The dog barks, dog barks .
- four-gram (4-grams): The dog barks .
Those two concepts are the building blocks for specifying a feature extraction template. By specifying a window (range) and the ngrams requested for every track (i.e. word track w
), we can can extract the specified ngrams within the defined window at each position in the sequence. Below, we show multiple examples for generating templates (denoted by template_XY
).
Specification 1: For the words track, get the current attribute and combine it with the current label. This translates to:
window of size 1 -> range(0,1) unigrams ->'1-gram' current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 1 (a window that includes only the attribute itself). Then we extract unigrams in the specified window.
[Code] template_gen.generate_template_XY('w', ('1-gram', range(0,1)), '1-state', template_XY) [Output] template_XY: current observation, current label = w[0], Y[0] {'w': {(0,): ((0,),)}
Specification 2: For the words track, get the current and next/forward attribute and combine it with the current label. This translates to:
window of size 2 -> range(0,2) bigrams ->'2-grams' current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 2 (a window that includes the current attribute and the next/future one). Then we extract bigrams (2-grams) in the specified window. This will ensure to use the combined attributes at both positions at once (i.e.
w[0] and w[1]
).
[Code] template_gen.generate_template_XY('w', ('2-grams', range(0,2)), '1-state', template_XY) [Output] template_XY: current observation, next observation, label = w[0],w[1], Y[0] {'w': {(0, 1): ((0,),)}}
Specification 3: For the words track, get the previous and current attribute and combine it with the current label. This translates to:
window of size 2 -> range(-1,1) bigrams ->'2-grams' current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 2 (a window that includes the previous attribute and the current one). Then we extract bigrams (2-grams) in the specified window. This will ensure to use the combined attributes at both positions at once (i.e.
w[-1] and w[0]
).
[Code] template_gen.generate_template_XY('w', ('2-grams', range(-1,1)), '1-state', template_XY) [Output] template_XY: previous observation, current observation, label = w[-1],w[0], Y[0] {'w': {(-1, 0): ((0,),)}}
Specification 4: For the words track, get the previous, current and next attribute and combine it with the current label. This translates to:
window of size 3 -> range(-1,2) trigrams ->'3-grams' current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 (a window that includes the previous, current and next attribute). Then we extract trigrams (3-grams) in the specified window. This will ensure to use the three attributes at once (i.e.
w[-1], w[0], w[1]
).
[Code] template_gen.generate_template_XY('w', ('3-grams', range(-1,2)), '1-state', template_XY) [Output] template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0] {'w': {(-1, 0, 1): ((0,),)}}
Specification 5: For the words track, get the previous, current and next attribute and combine it with the current label and the current and previous label. This translates to:
window of size 3 -> range(-1,2) trigrams ->'3-grams' current label -> '1-state:2-states'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 (a window that includes the previous, current and next attribute). Then, we extract trigrams (3-grams) in the specified window. This will ensure to use the three attributes at once (i.e.
w[-1], w[0], w[1]
). The combined attributes are joined with the current label (one state) and with the previous and current label (two states).
[Code] template_gen.generate_template_XY('w', ('3-grams', range(-1,2)), '1-state:2-states', template_XY) [Output] template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0] template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[-1]Y[0] {'w': {(-1, 0, 1): ((0,), (-1, 0))}}
Specification 6: For the words track, get the unigram and bigram attributes in a centered window of size 3 (i.e. centered at each position in the sequence) and combine it with the current label. This translates to:
window of size 3 -> range(-1,2) unigrams and bigrams->'1-gram:2-grams' current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 that is centered at the current position. Then we extract unigrams and bigrams within the constructed window.
[Code] template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY) [Output] template_XY: unigrams and bigrams in centered window of size 3, label {'w': {(0, 1): ((0,),), (0,): ((0,),), (-1,): ((0,),), (1,): ((0,),), (-1, 0): ((0,),)}}
We can also combine all the previous specifications to get:
[Code] template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY) template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY) [Output] template_XY: all previous templates combined {'w': {(0, 1): ((0,),), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,),), (-1, 0): ((0,),), (1,): ((0,),), (-1,): ((0,),)}}
There are many possibilities for creating templates for feature extraction. Although we targeted the attributes in the words track w
, the same applies with no restriction to the other computed tracks such as numchar
, shape
and shaped
tracks.
Similarly, we can specify templates for generating features based on the Y
labels only using the generate_template_Y(args)
method. The method specification is:
generate_template_Y(states)
Below another set of examples demonstrating how to construct templates for generating features based on the labels only.
Specification 7: Generate features based on the current labels only. This should model the bias (i.e. prevalence of the labels in our dataset). This translates to:
current label -> '1-state'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we extract the current label.
[Code] template_Y = template_gen.generate_template_Y('1-state') [Output] temlate_Y: current label {'Y': [(0,)]}
Specification 8: Generate features based on the current labels and the first order label transition only. This should model the bias (i.e. prevalence of the labels in our dataset) and the label/state transitions. This translates to:
current label and first order label transitions -> '1-state:2-states'Using our sequence $s$, we pass through the sequence from left to right, where at each position, we extract the current label and the previous and current labels.
[Code] template_Y = template_gen.generate_template_Y('1-state:2-states') [Output] temlate_Y: current label and first order label transitions {'Y': [(0,), (-1, 0)]}
After specifying both templates (i.e. template_XY
and template_Y
), it is time for passing them to the FeatureExtractor
class, which in turn extracts features from sequences/segments using these generated templates. The FeatureExtractor
constructor takes the following arguments:
template_XY
: the generated template for combining attributes in the different tracks with the labels. This will be saved as an instance attribute named template_X
.template_Y
: the generated template for extracting features based on the labels only. This will be saved as an instance attribute named template_Y
.attr_desc
: the attribute description dictionary specifying the name of each track, a description of the track and its type (i.e. categorical or continuous). We already defined attribute description dictionary when we explained attributes
(see attributes section)# import template generator
from pyseqlab.utilities import TemplateGenerator
from pyseqlab.features_extraction import FeatureExtractor
# create a template generator
template_gen = TemplateGenerator()
# create a dictionary to define template_XY
template_XY = {}
# generating template for word attributes (i.e. word track)
track_attr_name = 'w'
# current attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('1-gram', range(0,1)), '1-state', template_XY)
print("template_XY: current observation, current label = w[0], Y[0]")
print(template_XY)
# generateing templates based on the current labels
template_Y = template_gen.generate_template_Y('1-state')
print("temlate_Y: current label")
print(template_Y)
print("our defined attr_desc that uses the word track only")
print(attr_desc)
print("sequence")
print(seq_1)
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
template_Y.clear()
template_XY.clear()
print("*"*50)
print()
# use specification 1-6 to generate template_XY and specification 7 to generate template_Y
template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY)
print("template_XY: all previous templates (specification 1-6) combined")
print(template_XY)
# generateing templates based on the current labels
template_Y = template_gen.generate_template_Y('1-state')
print("temlate_Y: current label")
print(template_Y)
print("our defined attr_desc")
print(attr_desc)
print("sequence")
print(seq_1)
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
template_Y.clear()
template_XY.clear()
Using Specification 1 for template_XY
(i.e get current attribute and combine it with the current label at each position in the sequence) and Specification 7 for template_Y
(i.e. get the current label at each position in the sequence), our feature extractor would generate features based on the specified templates. It uses extract_seq_features_perboundary(seq)
method. The method takes as an argument a sequence (i.e an instance of SequenceStruct
class). See the sequence_and_input_structure tutorial for more info about building sequences.
The generated features are:
extracted features {(2, 2): {'N': {'N': 1, 'w[0]=dog': 1}}, (4, 4): {'.': {'.': 1, 'w[0]=.': 1}}, (3, 3): {'V': {'V': 1, 'w[0]=barks': 1}}, (1, 1): {'DT': {'w[0]=The': 1, 'DT': 1}}}
For each position, we see a dictionary that has the current Y
label at that position (i.e. Y[0]
) as a key and as a value a dictionary of the attributes. Following our earlier notation/representation for the features, the extracted features dictionary is equivalent to:
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> DT N V . position 1 2 3 4
Similarly, if we use the combined Specifications from 1-6 to generate template_XY
while using again Specification 7 for generating template_Y
we get:
extracted features {(2, 2): {'N': {'w[0]|w[1]=dog|barks': 1, 'w[1]=barks': 1, 'N': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[-1]=The': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1}, 'DT|N': {'w[-1]|w[0]|w[1]=The|dog|barks': 1}}, (4, 4): {'.': {'w[-1]|w[0]=barks|.': 1, 'w[-1]=barks': 1, 'w[0]=.': 1, '.': 1}}, (3, 3): {'V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1, 'V': 1, 'w[1]=.': 1, 'w[-1]=dog': 1, 'w[-1]|w[0]=dog|barks': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1}}, (1, 1): {'DT': {'w[1]=dog': 1, 'w[0]|w[1]=The|dog': 1, 'w[0]=The': 1, 'DT': 1}}}
As we can see, the extracted features would translate to the following:
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, N w[1]=., V --> w[1]=dog, DT w[1]=barks, N w[1]=., V -- --> -- w[-1]=The, N w[-1]=dog, V w[-1]=barks, . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, DT|N w[1]=., N|V --> DT N V . position 1 2 3 4
Below is another example demonstrating how to use all tracks (i.e. w
, numchar
, shape
and shaped
), where the goal is to replicate the requested features in this section. See this code section for generating the templates and corresponding features.
extracted features {(2, 2): {'N': {'numchars[-1]|numchars[0]|numchars[1]': 11, 'numchars[-1]|numchars[0]': 6, 'w[0]|w[1]=dog|barks': 1, 'shaped[0]=a': 1, 'shape[0]=aaa': 1, 'numchars[0]|numchars[1]': 8, 'numchars[0]': 3, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1}, 'DT|N': {'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]|w[1]=dog|barks': 1, 'w[-1]|w[0]=The|dog': 1, 'DT|N': 1, 'w[0]=dog': 1}}, (4, 4): {'.': {'w[-1]|w[0]=barks|.': 1, 'shape[0]=_': 1, 'shaped[0]=_': 1, '.': 1, 'numchars[-1]|numchars[0]': 6, 'numchars[0]': 1, 'w[0]=.': 1}, 'V|.': {'w[-1]|w[0]=barks|.': 1, 'w[0]=.': 1, 'V|.': 1}, 'DT|N|V|.': {'DT|N|V|.': 1}, 'N|V|.': {'N|V|.': 1}}, (3, 3): {'V': {'numchars[-1]|numchars[0]|numchars[1]': 9, 'numchars[-1]|numchars[0]': 8, 'shape[0]=aaaaa': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'V': 1, 'shaped[0]=a': 1, 'w[0]|w[1]=barks|.': 1, 'numchars[0]|numchars[1]': 6, 'numchars[0]': 5, 'w[-1]|w[0]=dog|barks': 1, 'w[0]=barks': 1}, 'N|V': {'N|V': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}, 'DT|N|V': {'DT|N|V': 1}}, (1, 1): {'DT': {'shape[0]=Aaa': 1, 'w[0]|w[1]=The|dog': 1, 'DT': 1, 'numchars[0]': 3, 'numchars[0]|numchars[1]': 6, 'shaped[0]=Aa': 1, 'w[0]=The': 1}}}
The extracted features are equivalent to the ones we were targeting(see below as a reminder):
feature --> w[0]=The, DT w[0]=dog, N w[0]=barks, V w[0]=., . --> w[0]=The, w[0]=dog, w[0]=barks, -- w[1]=dog, DT w[1]=barks, N w[1]=., V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, N w[0]=barks, V w[0]=., . --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, N w[1]=., V --> -- w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[0]=dog, w[0]=barks, -- w[1]=barks, DT|N w[1]=., N|V --> -- w[-1]=The, w[-1]=dog, w[-1]=barks, w[0]=dog, DT|N w[0]=barks, N|V w[0]=., V|. --> -- w[-1]=The, w[-1]=dog, -- w[0]=dog, w[0]=barks, w[1]=bargs, DT|N w[1]=., N|V --> numchars[0]=3, DT numchars[0]=3, N numchars[0]=5, V numchars[0]=1, . --> numchars[0,1]=6, DT numchars[0,1]=8, N numchars[0]=6, V -- --> -- numchars[-1,0]=6, N numchars[-1,0]=8, V numchars[-1,0]=6, . --> -- numchars[-1,0,1]=11, N numchars[-1,0,1]=9, V -- --> shape[0]=Aaa, DT shape[0]=aaa, N shape[0]=aaaaa, V shape[0]=_, . --> shaped[0]=Aa, DT shaped[0]=a, N shaped[0]=a, V shaped[0]=_, . --> DT N V . --> -- DT|N N|V V|. --> -- -- DT|N|V N|V|. --> -- -- -- DT|N|V|. position 1 2 3 4
def generate_template_alltracks():
template_XY = {}
template_Y = {}
# generating template for words track (w)
template_gen.generate_template_XY('w', ('1-gram', range(0,1)), '1-state:2-states', template_XY)
template_gen.generate_template_XY('w', ('2-grams:3-grams', range(-1,2)), '1-state:2-states', template_XY)
# generating template for numchars track
template_gen.generate_template_XY('numchars', ('1-gram', range(0,1)), '1-state', template_XY)
template_gen.generate_template_XY('numchars', ('2-grams:3-grams', range(-1,2)), '1-state', template_XY)
# generating template for shape track
template_gen.generate_template_XY('shape', ('1-gram', range(0,1)), '1-state', template_XY)
# generating template for shaped track
template_gen.generate_template_XY('shaped', ('1-gram', range(0,1)), '1-state', template_XY)
print("template_XY: using all tracks")
print(template_XY)
# generateing templates based on label transitions
template_Y = template_gen.generate_template_Y('1-state:2-states:3-states:4-states')
print("temlate_Y: up to third order label transitions")
print(template_Y)
print("-"*40)
return(template_XY, template_Y)
def generate_attributes_alltracks():
# use MyAttributeExtractor instance
print("my_attr_extractor.attr_desc")
print(my_attr_extractor.attr_desc)
print("-"*40)
print("sequence")
print(seq_1)
# we clear the seg_attr to generate new attributes based on the attr_desc in MyAttributeExtractor class
seq_1.seg_attr.clear()
# generate attributes using subclassed attribute extractor
my_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
print("extracted attributes saved in seq_1.seg_attr")
print(seq_1.seg_attr)
print("-"*40)
# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
# generate attributes for all tracks
generate_attributes_alltracks()
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
print("-"*40)
seq_1.seg_attr.clear()
The FeatureExtractor
class we introduced in a previous section, has further two subclasses:
HOFeatureExtractor
class representing higher order feature extractorFOFeatureExtractor
class representing a first order feature extractorAs their names suggest, the main distinction between both subclasses is related to what type of CRFs models we aim to build. If we want CRF models that include features with at most first order label patterns (i.e. DT; N; V; .; DT|N; N|V; V|.
), then FOFeatureExtractor
class is the one to use. If we want to model higher order features in addition to the ones with first order label patterns (i.e. features that include label patterns and transitions that involve $\ge$ 2 states/labels such as DT|N|V; N|V|.; DT|N|V|;
taken from our last example), then HOFeatureExtractor is the one to use. In other words, HOFeatureExtractor
is equivalent to FeatureExtractor
class.
However, there is a subtle difference that exists between both subclasses (HOFeatureExtractor
and FOFeatureExtractor
), which involves modeling the starting labels/states. The FOFeatureExtractor
class supports the inclusion of START state that allows to build models supporting initial labels and transition labels at the starting position of the sequence.
We demonstrate the use of HOFeatureExtractor
in this code snippet and contrast it with the use of FOFeatureExtractor
in the following code snippet. As we described earlier, the features extracted by FOFeatureExtractor
involves only first order label patterns.
extracted features using FOFeatureExtractor with start state disabled {(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}
Additionally, by enabling the start_state
flag in the constructor, we got features that involve START state/label pattern transitions.
extracted features using FOFeatureExtractor with start state enabled {(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}, '__START__|DT': {'w[0]|w[1]=The|dog': 1, '__START__|DT': 1, 'w[0]=The': 1}}}Bottom line, the distinction between both classes will depend on what type of CRFs model we want to build (consult to crf_model_building tutorial for further info).
from pyseqlab.features_extraction import HOFeatureExtractor, FOFeatureExtractor
# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
print()
# generate attributes for all tracks
generate_attributes_alltracks()
print()
# initialize HOFeatureExtractor
ho_fe = HOFeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc)
extracted_features = ho_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using HOFeatureExtractor ")
print(extracted_features)
print("-"*40)
# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
print()
# initialize FOFeatureExtractor with __START__ state
fo_fe = FOFeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc, start_state=True)
extracted_features = fo_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using FOFeatureExtractor with start_state enabled")
print(extracted_features)
print("-"*40)
print()
print()
# using FOFeatureExtractor without the __START__ state
fo_fe.start_state = False
extracted_features = fo_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using FOFeatureExtractor with start_state disabled")
print(extracted_features)
print("-"*40)
Now that we know about generating features, we can move to FeatureFilter
class. FeatureFilter
is a way to filter unwanted features post extraction. Generally speaking, when we build CRFs models using PySeqLab, the features extracted at every position from the sequences in our training data will be aggregated/collapsed to build our model. Therefore, in case we wanted to remove features that occur less than a threshold or includes specific label patterns, we can use FeatureFilter
class for this purpose.
The FeatureFilter
class constructor takes filter_info
as an argument, which is a dictionary that includes specification of how to apply the filtering.
filter_info
dictionary has three keys:
filter_type
to define the type of filter. It is either 'count' or 'pattern'
filter_val
to define either the threshold value or y pattern to filterfilter_relation
to define how the filter should be appliedThe two filter types that could be specified in filter_info
are:
filter_info = {'filter_type': 'count', 'filter_val':5, 'filter_relation': '<'}This filter would delete all features that have count less than five. So in
'filter_val'
we decide the threshold and in 'filter_relation'
we decide the comparator operation that could assume one of these values {<, $\le$, >, $\ge$, =}. int
(i.e. integer), the filter is applied only to categorical features. Else, if the threshold if of type float
, then the filter is applied to all features (categorical and continuous). The assumption is that categorical features are assigned integers as feature values while continuous features are assigned float values. Hence, it is very important when computing/deriving attributes to assign feature value with the correct type (i.e. float
in case of continuous attributes).filter_info = {'filter_type': 'pattern', 'filter_val': {"O|L", "L|L"}, 'filter_relation':'in'}This filter will delete all features that have associated y pattern
{"O|L" and "L|L"}
. Hence, 'filter_val'
takes a set of label patterns while 'filter_relation'
assumes either {'in', or 'not in'}
as values. In case 'not in'
is specified, the above filter will delete all features except the ones associated with {"O|L", "L|L"}
label patterns.
To see FeatureFilter
in action, we apply the two types of filters to our last extracted features example.
from pyseqlab.features_extraction import FeatureFilter
# aggregate/collapse features across all positions in the sequence
gfeatures = fe.aggregate_seq_features(extracted_features, seq_1.get_y_boundaries())
print("originally extracted features by position:")
print(extracted_features)
print("-"*50)
print("aggregated/collapsed features across all positions in the sequence")
print(gfeatures)
print("-"*50)
print()
print("Count based filters:")
print()
# count based feature targeting only categorical (see the type of threshold)
filter_info = {'filter_type': 'count',
'filter_val':1,
'filter_relation': '>='}
ffilter = FeatureFilter(filter_info)
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering only categorical features with value >= 1")
print(filterd_features)
print("*"*50)
filter_info = {'filter_type': 'count',
'filter_val':3.0,
'filter_relation': '>='}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering only continuous features with values >= 3")
print(filterd_features)
print("*"*50)
filter_info = {'filter_type': 'count',
'filter_val':1.0,
'filter_relation': '>='}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering both categorical and continuous features with values >=1")
print(filterd_features)
print("*"*50)
print("Pattern based filters:")
print()
filter_info = {'filter_type': 'pattern',
'filter_val':{'N|V','DT'},
'filter_relation': 'not in'}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering label patterns not in {'N|V','DT'}")
print(filterd_features)
print("*"*50)
filter_info = {'filter_type': 'pattern',
'filter_val':{'N|V','DT'},
'filter_relation': 'in'}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering label patterns in {'N|V','DT'}")
print(filterd_features)
print("*"*50)
The ideas and concepts explained in this tutorial are very important for the next stages especially when it comes to building/training CRFs models. Understanding the concepts such as attributes, features, feature template, feature extraction, and feature filter will help in deciding what CRFs model we want to build and how the feature parameters are going to be trained during the learning phase. We tackle the model training in the crf_model_building tutorial.