In [2]:
# importing and defining relevant directories
import sys
import os
# pyseqlab root directory
pyseqlab_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# print("pyseqlab cloned dir:", pyseqlab_dir)
# inserting the pyseqlab directory to pythons system path
# if pyseqlab is already installed this could be commented out
sys.path.insert(0, pyseqlab_dir)
# current directory (tutorials)
tutorials_dir = os.path.join(pyseqlab_dir, 'tutorials')
# print("tutorials_dir:", tutorials_dir)
dataset_dir = os.path.join(tutorials_dir, 'datasets', 'conll2000')
# print("dataset_dir:", dataset_dir)
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
    css = "".join(f.readlines())
HTML(css)
Out[2]:

1. Objectives and goals

In this tutorial, we will learn about:

  • generating attributes from observations track in sequences/segments
  • generating features by combining the generated attributes with the labels
  • defining templates for automatic feature extraction
  • creating and applying feature filters on our extracted features

Reminder: To work with this tutorial interactively, we need first to clone the PySeqLab package to our disk locally and then navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder (see the tree path for display).

├── pyseqlab
    ├── tutorials
    │   ├── datasets
    │   │   ├── conll2000
    │   │   ├── segments

We suggest going through the sequence_and_input_structure tutorial before continuing through this notebook. We will use constructed sequences (short sentences) while explaining the features/attributes extraction process.

A reminder regarding the terminology we are using and planning to expand:
  • sequence: to refer to a list of elements that follow an order
  • observation: to refer to an element in the sequence
  • track: to refer to different types of observations. In the chunking example, we have a track for the words and another for the part-of-speech
  • label/tag: to refer to the outcome/class we want to predict

We start by constructing a sequence $s$ from the sentence "The dog barks.".

In [3]:
# import SequenceStruct class
from pyseqlab.utilities import SequenceStruct
# define the X attribute
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# label/tag sequence
Y= ['DT', 'N', 'V', '.']
seq_1 = SequenceStruct(X, Y)
print("seq_1:")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)
seq_1:
X: {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
Y: {(3, 3): 'V', (4, 4): '.', (1, 1): 'DT', (2, 2): 'N'}
flat_y: ['DT', 'N', 'V', '.']

2. Attributes

An attribute will be defined as a measured property/characteristic of an observation. An example of attributes is the value of each observation for any given track at each position in the sequence. For our sequence $s$, we specified 'w' to be the name of the words track and hence we will obtain the following attributes at each position:

 attribute      w[0]=The       w[0]=dog      w[0]=barks      w[0]=.
 position           1              2             3               4

To facilitate the attributes extraction process, PySeqLab provides a GenericAttributeExtractor class in the attributes_extraction module.

Using GenericAttributeExtractor class, we can extract attributes from the given observations tracks. The constructor takes a dictionary comprising the description about attributes we aim to extract. The description dictionary has:

  • name of the attribute (i.e. unique name such as 'w' or 'word' referring to the name used for the words track in sequence $s$)
    • description (optional) about the track
    • encoding (obligatory) that specify if the attribute is categorical (discrete/nominal) or continuous ($\in \mathbb{R}$)

For example, the attribute description dictionary for extracting the value of the observations in the words track at each position is defined by:

attr_desc = {'w':{'description': 'word observation track', 
                  'encoding':'categorical'
                  }
            }

The instance of the GenericAttributeExtractor class will extract the target attributes and store them in the seg_attr instance attribute of the sequence. The cell below demonstrates this process through an example.

In [4]:
from pyseqlab.attributes_extraction import GenericAttributeExtractor
# print the doc string of the class
print(GenericAttributeExtractor.__doc__)
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# label/tag sequence
Y= ['DT', 'N', 'V', '.']
# create a sequence
seq_1 = SequenceStruct(X, Y)
# attribute description dictionary -- using only the word observation track
attr_desc = {'w':{'description': 'word observation track', 
                  'encoding':'categorical'
                 }
            }
# initialize the attribute extractor instance
generic_attr_extractor = GenericAttributeExtractor(attr_desc)
print("attr_desc {}".format(generic_attr_extractor.attr_desc))
print("-"*40)
# extract attributes 
generic_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
print("seq_1:")
print(seq_1)
print("extracted attributes saved in seq_1.seg_attr:")
print(seq_1.seg_attr)
# for boundary, seg_attr in seq_1.seg_attr.items():
#     print("boundary {}".format(boundary))
#     print("attributes {}".format(seg_attr)) 
Generic attribute extractor class implementing observation functions that generates attributes from tokens/observations
       
       Args:
           attr_desc: dictionary defining the atomic observation/attribute names including
                      the encoding of such attribute (i.e. {continuous, categorical}}    
       Attributes:
           attr_desc: dictionary defining the atomic observation/attribute names including
                      the encoding of such attribute (i.e. {continuous, categorical}}
           seg_attr:  dictionary comprising the extracted attributes per each boundary of a sequence

    
attr_desc {'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <pyseqlab.attributes_extraction.GenericAttributeExtractor object at 0x103f9f3c8>>, 'description': 'word observation track', 'encoding': 'categorical'}}
----------------------------------------
seq_1:
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted attributes saved in seq_1.seg_attr:
{(2, 2): {'w': 'dog'}, (4, 4): {'w': '.'}, (3, 3): {'w': 'barks'}, (1, 1): {'w': 'The'}}

Going beyond the given observations tracks (i.e. the ones provided in a file or part of the constructed sequence), we can compute/derive new attributes from those given tracks (such as computing new attributes based on the words track). We can compute the number of characters in a word, the shape and degenerate shape of the word, indicators specifying if the word is capitalized, and many others. The different types of computed/derived attributes are equivalent to forming new tracks. To derive new attributes, we need to:

  1. subclass the GenericAttributeExtractor class,
  2. override generate_attributes(args) method and
  3. define our attributes extraction functions
.
Below, we provide an example of a subclass to the GenericAttributeExtractor class.

In [5]:
# my GenericAttributeExtractor subclass
class MySeqAttributeExtractor(GenericAttributeExtractor):
    """class implementing observation functions that generates attributes from word tokens/observations
       
       Args:
           attr_desc: dictionary defining the atomic observation/attribute names including
                      the encoding of such attribute (i.e. {continuous, categorical}}    
       Attributes:
           attr_desc: dictionary defining the atomic observation/attribute names including
                      the encoding of such attribute (i.e. {continuous, categorical}}
           seg_attr:  dictionary comprising the extracted attributes per each boundary of a sequence

    """
    def __init__(self):
        attr_desc = self.generate_attributes_desc()
        super().__init__(attr_desc)
    
    def generate_attributes_desc(self):
        """define attributes by including description and encoding of each extracted/computed observation attribute 
        """
        attr_desc = {}
        attr_desc['w'] = {'description':'the word/token',
                          'encoding':'categorical'
                         }
        attr_desc['shape'] = {'description':'the shape of the word',
                              'encoding':'categorical'
                             }
        attr_desc['shaped'] = {'description':'the compressed/degenerated form/shape of the word',
                               'encoding':'categorical'
                              }
        attr_desc['numchars'] = {'description':'number of characters in a word',
                                 'encoding':'continuous'
                                }
        return(attr_desc)
 
    def generate_attributes(self, seq, boundaries):
        """generate attributes of the sequence observations in a specified list of boundaries
        
           Args:
               seq: a sequence instance of :class:`SequenceStruct`
               boundaries: list of boundaries [(1,1), (2,2),...,]
               
           .. note::
           
              the generated attributes are saved first in :attr:`seg_attr` and then passed to 
              the **`seq.seg_attr`**. In other words, at the end :attr:`seg_att` is always cleared
              
        
        """
        X = seq.X
        observed_attrnames = list(X[1].keys() & self.attr_desc.keys())
        # segment attributes dictionary
        self.seg_attr = {}
        new_boundaries = []
        # create segments from observations using the provided boundaries
        for boundary in boundaries:
            if(boundary not in seq.seg_attr):
                self._create_segment(X, boundary, observed_attrnames)
                new_boundaries.append(boundary)
#         print("seg_attr {}".format(self.seg_attr))
#         print("new_boundaries {}".format(new_boundaries))
        if(self.seg_attr):
            for boundary in new_boundaries:
                self.get_shape(boundary)
                self.get_degenerateshape(boundary)
                self.get_num_chars(boundary)
            # save generated attributes in seq
            seq.seg_attr.update(self.seg_attr)
#             print('saved attribute {}'.format(seq.seg_attr))
            # clear the instance variable seg_attr
            self.seg_attr = {}
        return(new_boundaries)
            
    def get_shape(self, boundary):
        """get shape of a word
        
           Args:
               boundary: tuple (u,v) that marks beginning and end of a word
        """
        segment = self.seg_attr[boundary]['w']
        res = ''
        for char in segment:
            if char.isupper():
                res += 'A'
            elif char.islower():
                res += 'a'
            elif char.isdigit():
                res += 'D'
            else:
                res += '_'

        self.seg_attr[boundary]['shape'] = res
            
    def get_degenerateshape(self, boundary):
        """get degenerate shape of a word
        
           Args:
               boundary: tuple (u,v) that marks beginning and end of a word
        """
        segment = self.seg_attr[boundary]['shape']
        track = ''
        for char in segment:
            if not track or track[-1] != char:
                track += char
        self.seg_attr[boundary]['shaped'] = track
            
    def get_num_chars(self, boundary, filter_out = " "):
        """get the number of characters in a word
        
           Args:
               boundary: tuple (u,v) that marks beginning and end of a word
               filter_out: string the default separator between attributes
               
           .. note:
              the feature value of continuous attribute is of type float
        """
        segment = self.seg_attr[boundary]['w']
        filtered_segment = segment.split(sep = filter_out)
        num_chars = 0.0
        for entry in filtered_segment:
            num_chars += len(entry)
        self.seg_attr[boundary]['numchars'] = num_chars

# initialize my new attribute extractor
my_attr_extractor = MySeqAttributeExtractor()
print("attr_desc of MySeqAttributeExtractor instance")
print(my_attr_extractor.attr_desc)
# we use our created sequence (seq_1). 
# But first we need to clear the seg_attr that was filled by our GenericAttributeExtractor (see the previous cell)
seq_1.seg_attr.clear()
my_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
print("-"*40)
print("seq_1")
print(seq_1)
print("extracted attributes saved in seq_1.seg_attr:")
print(seq_1.seg_attr)
attr_desc of MySeqAttributeExtractor instance
{'shaped': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the compressed/degenerated form/shape of the word', 'encoding': 'categorical'}, 'numchars': {'repr_func': <bound method GenericAttributeExtractor._represent_continuous_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'number of characters in a word', 'encoding': 'continuous'}, 'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the word/token', 'encoding': 'categorical'}, 'shape': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the shape of the word', 'encoding': 'categorical'}}
----------------------------------------
seq_1
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted attributes saved in seq_1.seg_attr:
{(3, 3): {'shaped': 'a', 'numchars': 5.0, 'w': 'barks', 'shape': 'aaaaa'}, (2, 2): {'shaped': 'a', 'numchars': 3.0, 'w': 'dog', 'shape': 'aaa'}, (4, 4): {'shaped': '_', 'numchars': 1.0, 'w': '.', 'shape': '_'}, (1, 1): {'shaped': 'Aa', 'numchars': 3.0, 'w': 'The', 'shape': 'Aaa'}}

As shown above, the computed attributes (number of characters, shape of the word) are stored in the sequence seg_attr instance attribute. Another important issue to keep in mind is that attributes that are continuous (i.e. numchars) are assigned feature values of type float. Where attributes that are categorical (i.e. shape, shaped, w) are assigned feature values of type int. This distinction is only important when we plan to use feature filters (more on this in feature filter section). Otherwise, the type of feature value is of no importance.

3. Features

Now that we have defined attributes, we can move to features. A feature is defined (in our context and to avoid confusion) as a measured property/characteristic of observations and labels that will be used to build a conditional random fields (CRFs) model with an associated parameter to estimate. Well, this is not clear enough and sounds a lot like an attribute, is not it? I agree that features and attributes are very similar but in our terminology, the attributes are only concerned to be the measured properties/characteristics of the observations tracks. While features are constructed generally using the following options:

  1. combining the computed/extracted attributes with the Y labels
  2. using only the Y labels and/or the transitions among them

A combination of the above two options will lead to generating different CRFs models that we can train/estimate their features' parameters during the learning phase (consult to crf_model_building tutorial for more info).

Question: What could be used as features?
Answer: Everything we can think of could be used as features (a good analogy would be kitchen sink).

The first features we can think of are the values of the attributes at each position in the sequence combined with the corresponding label/tag. Those type of features are called node features. For our sequence $s$, we specified '$w$' to be the name of the words track and hence the features resulting from combining those attributes with labels at each position will be:

 feature    w[0]=The, DT          w[0]=dog, N        w[0]=barks, V        w[0]=., .
 position   1                     2                  3                    4

Similarly, we can add features that use only the Y labels at each position:

 feature    w[0]=The, DT          w[0]=dog, N        w[0]=barks, V        w[0]=., .
            DT                    N                  V                    .
 position   1                     2                  3                    4

Likewise, we can add the label transition too (-- means not applicable):

 feature    w[0]=The, DT          w[0]=dog, N        w[0]=barks, V        w[0]=., .
            DT                    N                  V                    .
            --                    DT|N               N|V                  V|.
            --                    --                 DT|N|V               N|V|.
            --                    --                 --                   DT|N|V|.

 position   1                     2                  3                    4

Following the same logic, we can apply this process to the extracted/computed attributes (such as the shape of the word, number of characters and all the other attributes we computed earlier):

 feature    w[0]=The, DT             w[0]=dog, N              w[0]=barks, V          w[0]=., .
            numchars[0]=3, DT        numchars[0]=3, N         numchars[0]=5, V       numchars[0]=1, .
            shape[0]=Aaa, DT         shape[0]=aaa, N          shape[0]=aaaaa, V      shape[0]=_, .
            shaped[0]=Aa, DT         shaped[0]=a, N           shaped[0]=a, V         shaped[0]=_, .
            DT                       N                        V                      .
            --                       DT|N                     N|V                    V|.
            --                       --                       DT|N|V                 N|V|.
            --                       --                       --                     DT|N|V|.

 position   1                        2                        3                      4

It can bee seen the endless options we have to generate features. Also, it can be noted that the different types of attributes (i.e. tracks) always had a position indicator appended to them (such as w[0]) which leads us to an obvious question:

Question: Could we use attributes from different positions to construct features?
Answer: Yes.

To explain further this idea, we will use set of examples/scenarios.

Suppose that at each position, we need to consider the current word attribute (i.e the attribute of the word track at current position) and the one at the next position (i.e. forward position) while joining them with the current Y label (i.e Y[0]).

Question: Could we extract these features based on the latter specification?
Answer: Yes

 feature    w[0]=The, DT             w[0]=dog, N              w[0]=barks, V          w[0]=., .
            w[0]=The,w[1]=dog, DT    w[0]=dog,w[1]=barks, N   w[0]=barks,w[1]=., V   --
            numchars[0]=3, DT        numchars[0]=3, N         numchars[0]=5, V       numchars[0]=1, .
            shape[0]=Aaa, DT         shape[0]=aaa, N          shape[0]=aaaaa, V      shape[0]=_, .
            shaped[0]=Aa, DT         shaped[0]=a, N           shaped[0]=a, V         shaped[0]=_, .
            DT                       N                        V                      .
            --                       DT|N                     N|V                    V|.
            --                       --                       DT|N|V                 N|V|.
            --                       --                       --                     DT|N|V|.

 position   1                        2                        3                      4

Question: So, how about considering the attributes before each position (i.e. use history), could we do that?
Answer: Yes

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         numchars[0]=3, DT        numchars[0]=3, N        numchars[0]=5, V       numchars[0]=1, .
 -->         shape[0]=Aaa, DT         shape[0]=aaa, N         shape[0]=aaaaa, V      shape[0]=_, .
 -->         shaped[0]=Aa, DT         shaped[0]=a, N          shaped[0]=a, V         shaped[0]=_, .

 -->         DT                       N                       V                      .
 -->         --                       DT|N                    N|V                    V|.
 -->         --                       --                      DT|N|V                 N|V|.
 -->         --                       --                      --                     DT|N|V|.

 position    1                        2                       3                      4

Question: How about considering both attributes, the before and after (i.e. context window)?
Answer: Yes

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, N           w[1]=., V

 -->         numchars[0]=3, DT        numchars[0]=3, N        numchars[0]=5, V       numchars[0]=1, .
 -->         shape[0]=Aaa, DT         shape[0]=aaa, N         shape[0]=aaaaa, V      shape[0]=_, .
 -->         shaped[0]=Aa, DT         shaped[0]=a, N          shaped[0]=a, V         shaped[0]=_, .

 -->         DT                       N                       V                      .
 -->         --                       DT|N                    N|V                    V|.
 -->         --                       --                      DT|N|V                 N|V|.
 -->         --                       --                      --                     DT|N|V|.

 position    1                        2                       3                      4

The choice for combining attributes is not limited to the ones we have presented so far. We can choose to combine attributes at different/arbitrary positions without any restrictions. For example, we could use w[0]=The, w[3]=., DT as a feature while position 1 is our current position. Similarly, we can use w[-3]=The,w[-1]=barks, . as a feature while position 4 is our current position.

Another aspect to consider is related to the choice of associating attributes with the labels. So far we were combining the attributes at each position with their current labels (i.e. corresponding labels at each position).

Question: Now what if we want to combine those attributes with multiple labels, can we do that?
Answer: Yes (with constraints). That is we can combine attributes with label transitions only such as using the current and the previous labels. Moreover, PySeqLab supports modeling transition labels with higher order (i.e $\ge$ 2 such as DT|N|V, N|V|., DT|N|V|.). The features that combine the attributes with label transitions are generally called edge features. Below, is an example of using the word attributes with first-order label transitions (i.e. using the current and previous label).

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, N           w[1]=., V

 -->         --                       w[0]=dog, DT|N          w[0]=barks, N|V        w[0]=., V|.

 -->         --                       w[0]=dog,               w[0]=barks,             --
                                      w[1]=barks, DT|N        w[1]=., N|V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, DT|N          w[0]=barks, N|V         w[0]=., V|.


 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, DT|N        w[1]=., N|V

 -->         numchars[0]=3, DT        numchars[0]=3, N        numchars[0]=5, V       numchars[0]=1, .
 -->         shape[0]=Aaa, DT         shape[0]=aaa, N         shape[0]=aaaaa, V      shape[0]=_, .
 -->         shaped[0]=Aa, DT         shaped[0]=a, N          shaped[0]=a, V         shaped[0]=_, .

 -->         DT                       N                       V                      .
 -->         --                       DT|N                    N|V                    V|.
 -->         --                       --                      DT|N|V                 N|V|.
 -->         --                       --                      --                     DT|N|V|.

 position    1                        2                       3                      4

Moreover, edge features (i.e. combining attributes with label transitions) are supported across all attribute types (i.e. using computed attributes from the word track such as number of characters 'numchar', shape of the word attributes 'shape' and 'shaped').

NB: The different attribute types (i.e. categorical vs. continuous) combine differently depending on their specified type. In other words, if we are considering categorical attributes at previous or forward positions or any arbitrary position, the generated features using those attributes will be based on the pattern they form (i.e. at position=1 we have w[0]=The,w[1]=dog, DT). However, if we are dealing with continuous attributes the generated features will use the sum of those attributes (i.e. at position=1 we have numchars[0,1]=6, DT). Below is an example on using continuous attribute -- numchars which is the length/number of characters in a word.

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, N           w[1]=., V

 -->         --                       w[0]=dog, DT|N          w[0]=barks, N|V        w[0]=., V|.

 -->         --                       w[0]=dog,               w[0]=barks,             --
                                      w[1]=barks, DT|N        w[1]=., N|V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, DT|N          w[0]=barks, N|V         w[0]=., V|.


 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, DT|N        w[1]=., N|V

 -->         numchars[0]=3, DT        numchars[0]=3, N        numchars[0]=5, V       numchars[0]=1, .

 -->         numchars[0,1]=6, DT      numchars[0,1]=8, N      numchars[0]=6, V       --

 -->         --                       numchars[-1,0]=6, N     numchars[-1,0]=8, V    numchars[-1,0]=6, .

 -->         --                       numchars[-1,0,1]=11, N  numchars[-1,0,1]=9, V  --

 -->         shape[0]=Aaa, DT         shape[0]=aaa, N         shape[0]=aaaaa, V      shape[0]=_, .
 -->         shaped[0]=Aa, DT         shaped[0]=a, N          shaped[0]=a, V         shaped[0]=_, .

 -->         DT                       N                       V                      .
 -->         --                       DT|N                    N|V                    V|.
 -->         --                       --                      DT|N|V                 N|V|.
 -->         --                       --                      --                     DT|N|V|.

 position    1                        2                       3                      4

4. Feature templates

After we have expanded our terminology, it is time to show how we can:

  1. extract features based on the extracted/computed attributes and
  2. define a template that defines the process/instructions for generating those features

PySeqLab provides two essential classes:

  • TemplateGenerator class that builds templates for feature generation
  • FeatureExtractor class that extracts features from sequences/segments using the generated templates

Starting with the TemplateGenerator class, two main methods are used:

  • generate_template_XY(args) defines a template for joining/combining the attributes (i.e. using current and any arbitrary position attributes) and the order of the labels pattern to associate with (i.e. using current labels and label transitions with varying order) (see option 1)
  • generate_template_Y(args) defines a template for the order of the labels pattern only (i.e. use current labels and/or label transitions only) without involving the attributes from the tracks (see option 2)

Using our sequence $s$ with the attributes we extracted, we will demonstrate how to define feature templates and consequently the features extracted using those templates.

In [6]:
# import template generator
from pyseqlab.utilities import TemplateGenerator

def experiment_templates_XY(track_attr_name, template_XY):
    # current attribute at each position combined with the current label
    template_gen.generate_template_XY(track_attr_name, ('1-gram', range(0,1)), '1-state', template_XY)
    print("template_XY: current observation, current label = w[0], Y[0]")
    print(template_XY)
    template_XY.clear()
    # current and next/forward attribute at each position combined with the current label
    template_gen.generate_template_XY(track_attr_name, ('2-gram', range(0,2)), '1-state', template_XY)
    print("template_XY: current observation, next observation, label = w[0],w[1], Y[0]")
    print(template_XY)
    template_XY.clear()
    # previous and current attribute at each position combined with the current label
    template_gen.generate_template_XY(track_attr_name, ('2-gram', range(-1,1)), '1-state', template_XY)
    print("template_XY: previous observation, current observation, label = w[-1],w[0], Y[0]")
    print(template_XY)
    template_XY.clear()
    # previous, current and next attribute at each position combined with the current label
    template_gen.generate_template_XY(track_attr_name, ('3-gram', range(-1,2)), '1-state', template_XY)
    print("template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]")
    print(template_XY)
    template_XY.clear()
    # previous, current and next attribute at each position combined with the current label, and current and previous label
    template_gen.generate_template_XY(track_attr_name, ('3-gram', range(-1,2)), '1-state:2-states', template_XY)
    print("template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]\ntemplate_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[-1]Y[0]")
    print(template_XY)
    template_XY.clear()
    # get unigrams and bigrams in a centered window of size 3 at every position in the sequence and combine them with the current label
    template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
    print("template_XY: unigrams and bigrams in centered window of size 3, label")
    print(template_XY)
    template_XY.clear()
    # combine all the previous specifications/templates
    template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
    template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY)
    print("template_XY: all previous templates combined")
    print(template_XY)

def experiment_templates_Y():
    # generateing templates based on the current labels
    template_Y = template_gen.generate_template_Y('1-state')
    print("temlate_Y: current label")
    print(template_Y)
    template_Y.clear()
    # generateing templates based on the current label and first order label transitions
    template_Y = template_gen.generate_template_Y('1-state:2-states')
    print("temlate_Y: current label and first order label transitions")
    print(template_Y)
    # empty template -- to further investigate
    template_Y = template_gen.generate_template_Y('0-state')
    print("temlate_Y: empty")
    print(template_Y)

# create a template generator
template_gen = TemplateGenerator()
# create a dictionary to define template_XY
template_XY = {}
# generating template for word attributes (i.e. word track)
track_attr_name = 'w'
# run experiment for template_XY
experiment_templates_XY(track_attr_name, template_XY)
# run experiment for template_Y
experiment_templates_Y()
template_XY: current observation, current label = w[0], Y[0]
{'w': {(0,): ((0,),)}}
template_XY: current observation, next observation, label = w[0],w[1], Y[0]
{'w': {(0, 1): ((0,),)}}
template_XY: previous observation, current observation, label = w[-1],w[0], Y[0]
{'w': {(-1, 0): ((0,),)}}
template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]
{'w': {(-1, 0, 1): ((0,),)}}
template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]
template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[-1]Y[0]
{'w': {(-1, 0, 1): ((0,), (-1, 0))}}
template_XY: unigrams and bigrams in centered window of size 3, label
{'w': {(0, 1): ((0,),), (0,): ((0,),), (-1,): ((0,),), (1,): ((0,),), (-1, 0): ((0,),)}}
template_XY: all previous templates combined
{'w': {(0, 1): ((0,),), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,),), (-1, 0): ((0,),), (1,): ((0,),), (-1,): ((0,),)}}
temlate_Y: current label
{'Y': [(0,)]}
temlate_Y: current label and first order label transitions
{'Y': [(0,), (-1, 0)]}
temlate_Y: empty
{'Y': [()]}

We can notice from the above cell the iterative process of adding new instructions/templates on how to combine the attributes and the labels using generate_template_XY method. These instructions are saved in template_XY.

    generate_template_XY(track_name, (ngram, window), states, template_XY)

To understand how the method operates and what arguments it takes, we need to understand the ngram and window concepts.

Our definition:

A window is a specified range (u, v) that is constructed and applied at each position in the sequence. By specifying the boundaries of the window (i.e. u and v) we define how the window is constructed.
ngram is a chunk/segment of n consecutive elements in a given sequence.
For example, in our sentence $s$ = "The dog barks.", we can define the following ngrams:
  • unigram (1-gram): The, dog, barks, .
  • bigram (2-grams): The dog, dog barks, barks .
  • trigram (3-grams): The dog barks, dog barks .
  • four-gram (4-grams): The dog barks .

Those two concepts are the building blocks for specifying a feature extraction template. By specifying a window (range) and the ngrams requested for every track (i.e. word track w), we can can extract the specified ngrams within the defined window at each position in the sequence. Below, we show multiple examples for generating templates (denoted by template_XY).

Specification 1: For the words track, get the current attribute and combine it with the current label. This translates to:

    window of size 1 -> range(0,1)
    unigrams ->'1-gram'
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 1 (a window that includes only the attribute itself). Then we extract unigrams in the specified window.
 
 [Code]
 template_gen.generate_template_XY('w', ('1-gram', range(0,1)), '1-state', template_XY)

 [Output]
 template_XY: current observation, current label = w[0], Y[0]
 {'w': {(0,): ((0,),)}

Specification 2: For the words track, get the current and next/forward attribute and combine it with the current label. This translates to:

 
    window of size 2 -> range(0,2)
    bigrams ->'2-grams'
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 2 (a window that includes the current attribute and the next/future one). Then we extract bigrams (2-grams) in the specified window. This will ensure to use the combined attributes at both positions at once (i.e. w[0] and w[1]).
 
 [Code]
 template_gen.generate_template_XY('w', ('2-grams', range(0,2)), '1-state', template_XY)

 [Output]
 template_XY: current observation, next observation, label = w[0],w[1], Y[0]
 {'w': {(0, 1): ((0,),)}}

Specification 3: For the words track, get the previous and current attribute and combine it with the current label. This translates to:

 
    window of size 2 -> range(-1,1)
    bigrams ->'2-grams'
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 2 (a window that includes the previous attribute and the current one). Then we extract bigrams (2-grams) in the specified window. This will ensure to use the combined attributes at both positions at once (i.e. w[-1] and w[0]).
 
 [Code]
 template_gen.generate_template_XY('w', ('2-grams', range(-1,1)), '1-state', template_XY)

 [Output]
 template_XY: previous observation, current observation, label = w[-1],w[0], Y[0]
 {'w': {(-1, 0): ((0,),)}}

Specification 4: For the words track, get the previous, current and next attribute and combine it with the current label. This translates to:

 
    window of size 3 -> range(-1,2)
    trigrams ->'3-grams'
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 (a window that includes the previous, current and next attribute). Then we extract trigrams (3-grams) in the specified window. This will ensure to use the three attributes at once (i.e. w[-1], w[0], w[1]).
 
 [Code]
 template_gen.generate_template_XY('w', ('3-grams', range(-1,2)), '1-state', template_XY)

 [Output]
 template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]
 {'w': {(-1, 0, 1): ((0,),)}}

Specification 5: For the words track, get the previous, current and next attribute and combine it with the current label and the current and previous label. This translates to:

 
    window of size 3 -> range(-1,2)
    trigrams ->'3-grams'
    current label -> '1-state:2-states'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 (a window that includes the previous, current and next attribute). Then, we extract trigrams (3-grams) in the specified window. This will ensure to use the three attributes at once (i.e. w[-1], w[0], w[1]). The combined attributes are joined with the current label (one state) and with the previous and current label (two states).
 
 [Code]
 template_gen.generate_template_XY('w', ('3-grams', range(-1,2)), '1-state:2-states', template_XY)

 [Output]
 template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[0]
 template_XY: previous observation, current observation, next observation, label = w[-1],w[0],w[1], Y[-1]Y[0]
 {'w': {(-1, 0, 1): ((0,), (-1, 0))}}

Specification 6: For the words track, get the unigram and bigram attributes in a centered window of size 3 (i.e. centered at each position in the sequence) and combine it with the current label. This translates to:

 
    window of size 3 -> range(-1,2)
    unigrams and bigrams->'1-gram:2-grams'
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we construct a window of size 3 that is centered at the current position. Then we extract unigrams and bigrams within the constructed window.
 
 [Code]
 template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)

 [Output]
 template_XY: unigrams and bigrams in centered window of size 3, label
 {'w': {(0, 1): ((0,),), (0,): ((0,),), (-1,): ((0,),), (1,): ((0,),), (-1, 0): ((0,),)}}

We can also combine all the previous specifications to get:

 
 [Code]
 template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
 template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY) 
 [Output]
 template_XY: all previous templates combined
 {'w': {(0, 1): ((0,),), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,),), (-1, 0): ((0,),), (1,): ((0,),), (-1,): ((0,),)}}

There are many possibilities for creating templates for feature extraction. Although we targeted the attributes in the words track w, the same applies with no restriction to the other computed tracks such as numchar, shape and shaped tracks.

Similarly, we can specify templates for generating features based on the Y labels only using the generate_template_Y(args) method. The method specification is:

 
    generate_template_Y(states)

Below another set of examples demonstrating how to construct templates for generating features based on the labels only.

Specification 7: Generate features based on the current labels only. This should model the bias (i.e. prevalence of the labels in our dataset). This translates to:

 
    current label -> '1-state'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we extract the current label.
 
 [Code]
 template_Y = template_gen.generate_template_Y('1-state')

 [Output]
 temlate_Y: current label
 {'Y': [(0,)]}

Specification 8: Generate features based on the current labels and the first order label transition only. This should model the bias (i.e. prevalence of the labels in our dataset) and the label/state transitions. This translates to:

 
    current label and first order label transitions -> '1-state:2-states'
Using our sequence $s$, we pass through the sequence from left to right, where at each position, we extract the current label and the previous and current labels.
 
 [Code]
 template_Y = template_gen.generate_template_Y('1-state:2-states')

 [Output]
 temlate_Y: current label and first order label transitions
 {'Y': [(0,), (-1, 0)]}

5. Feature extractor

After specifying both templates (i.e. template_XY and template_Y), it is time for passing them to the FeatureExtractor class, which in turn extracts features from sequences/segments using these generated templates. The FeatureExtractor constructor takes the following arguments:

  • template_XY: the generated template for combining attributes in the different tracks with the labels. This will be saved as an instance attribute named template_X.
  • template_Y: the generated template for extracting features based on the labels only. This will be saved as an instance attribute named template_Y.
  • attr_desc: the attribute description dictionary specifying the name of each track, a description of the track and its type (i.e. categorical or continuous). We already defined attribute description dictionary when we explained attributes (see attributes section)
  • .
In [7]:
# import template generator
from pyseqlab.utilities import TemplateGenerator
from pyseqlab.features_extraction import FeatureExtractor
# create a template generator
template_gen = TemplateGenerator()
# create a dictionary to define template_XY
template_XY = {}
# generating template for word attributes (i.e. word track)
track_attr_name = 'w'
# current attribute at each position combined with the current label
template_gen.generate_template_XY(track_attr_name, ('1-gram', range(0,1)), '1-state', template_XY)
print("template_XY: current observation, current label = w[0], Y[0]")
print(template_XY)
# generateing templates based on the current labels
template_Y = template_gen.generate_template_Y('1-state')
print("temlate_Y: current label")
print(template_Y)
print("our defined attr_desc that uses the word track only")
print(attr_desc)
print("sequence")
print(seq_1)
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
template_Y.clear()
template_XY.clear()
print("*"*50)
print()
# use specification 1-6 to generate template_XY and specification 7 to generate template_Y
template_gen.generate_template_XY(track_attr_name, ('1-gram:2-grams', range(-1,2)), '1-state', template_XY)
template_gen.generate_template_XY(track_attr_name, ('3-grams', range(-1,2)), '1-state:2-states', template_XY)
print("template_XY: all previous templates (specification 1-6) combined")
print(template_XY)
# generateing templates based on the current labels
template_Y = template_gen.generate_template_Y('1-state')
print("temlate_Y: current label")
print(template_Y)
print("our defined attr_desc")
print(attr_desc)
print("sequence")
print(seq_1)
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
template_Y.clear()
template_XY.clear()
template_XY: current observation, current label = w[0], Y[0]
{'w': {(0,): ((0,),)}}
temlate_Y: current label
{'Y': [(0,)]}
our defined attr_desc that uses the word track only
{'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <pyseqlab.attributes_extraction.GenericAttributeExtractor object at 0x103f9f3c8>>, 'description': 'word observation track', 'encoding': 'categorical'}}
sequence
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted features
{(2, 2): {'N': {'N': 1, 'w[0]=dog': 1}}, (4, 4): {'.': {'.': 1, 'w[0]=.': 1}}, (3, 3): {'V': {'V': 1, 'w[0]=barks': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]=The': 1}}}
**************************************************

template_XY: all previous templates (specification 1-6) combined
{'w': {(0, 1): ((0,),), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,),), (-1, 0): ((0,),), (1,): ((0,),), (-1,): ((0,),)}}
temlate_Y: current label
{'Y': [(0,)]}
our defined attr_desc
{'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <pyseqlab.attributes_extraction.GenericAttributeExtractor object at 0x103f9f3c8>>, 'description': 'word observation track', 'encoding': 'categorical'}}
sequence
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted features
{(2, 2): {'DT|N': {'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'w[0]|w[1]=dog|barks': 1, 'N': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]=The|dog': 1, 'w[-1]=The': 1, 'w[1]=barks': 1}}, (4, 4): {'.': {'w[-1]=barks': 1, '.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'w[-1]=dog': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[1]=.': 1, 'V': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[0]=barks': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'w[0]=The': 1, 'w[1]=dog': 1}}}

Using Specification 1 for template_XY (i.e get current attribute and combine it with the current label at each position in the sequence) and Specification 7 for template_Y (i.e. get the current label at each position in the sequence), our feature extractor would generate features based on the specified templates. It uses extract_seq_features_perboundary(seq) method. The method takes as an argument a sequence (i.e an instance of SequenceStruct class). See the sequence_and_input_structure tutorial for more info about building sequences.

The generated features are:

 
extracted features
{(2, 2): {'N': {'N': 1, 'w[0]=dog': 1}}, (4, 4): {'.': {'.': 1, 'w[0]=.': 1}}, (3, 3): {'V': {'V': 1, 'w[0]=barks': 1}}, (1, 1): {'DT': {'w[0]=The': 1, 'DT': 1}}}

For each position, we see a dictionary that has the current Y label at that position (i.e. Y[0]) as a key and as a value a dictionary of the attributes. Following our earlier notation/representation for the features, the extracted features dictionary is equivalent to:

 feature
 -->        w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->        DT                       N                       V                       .

 position   1                        2                       3                       4

Similarly, if we use the combined Specifications from 1-6 to generate template_XY while using again Specification 7 for generating template_Y we get:

 
extracted features
{(2, 2): {'N': {'w[0]|w[1]=dog|barks': 1, 'w[1]=barks': 1, 'N': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[-1]=The': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1}, 'DT|N': {'w[-1]|w[0]|w[1]=The|dog|barks': 1}}, (4, 4): {'.': {'w[-1]|w[0]=barks|.': 1, 'w[-1]=barks': 1, 'w[0]=.': 1, '.': 1}}, (3, 3): {'V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1, 'V': 1, 'w[1]=.': 1, 'w[-1]=dog': 1, 'w[-1]|w[0]=dog|barks': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1}}, (1, 1): {'DT': {'w[1]=dog': 1, 'w[0]|w[1]=The|dog': 1, 'w[0]=The': 1, 'DT': 1}}}

As we can see, the extracted features would translate to the following:

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, N           w[1]=., V

 -->        w[1]=dog, DT              w[1]=barks, N           w[1]=., V               --

 -->        --                        w[-1]=The, N            w[-1]=dog, V            w[-1]=barks, .

 -->        --                        w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, DT|N        w[1]=., N|V

 -->        DT                        N                       V                       .

 position   1                         2                       3                       4

Below is another example demonstrating how to use all tracks (i.e. w, numchar, shape and shaped), where the goal is to replicate the requested features in this section. See this code section for generating the templates and corresponding features.

extracted features
{(2, 2): {'N': {'numchars[-1]|numchars[0]|numchars[1]': 11, 'numchars[-1]|numchars[0]': 6, 'w[0]|w[1]=dog|barks': 1, 'shaped[0]=a': 1, 'shape[0]=aaa': 1, 'numchars[0]|numchars[1]': 8, 'numchars[0]': 3, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1}, 'DT|N': {'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]|w[1]=dog|barks': 1, 'w[-1]|w[0]=The|dog': 1, 'DT|N': 1, 'w[0]=dog': 1}}, (4, 4): {'.': {'w[-1]|w[0]=barks|.': 1, 'shape[0]=_': 1, 'shaped[0]=_': 1, '.': 1, 'numchars[-1]|numchars[0]': 6, 'numchars[0]': 1, 'w[0]=.': 1}, 'V|.': {'w[-1]|w[0]=barks|.': 1, 'w[0]=.': 1, 'V|.': 1}, 'DT|N|V|.': {'DT|N|V|.': 1}, 'N|V|.': {'N|V|.': 1}}, (3, 3): {'V': {'numchars[-1]|numchars[0]|numchars[1]': 9, 'numchars[-1]|numchars[0]': 8, 'shape[0]=aaaaa': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'V': 1, 'shaped[0]=a': 1, 'w[0]|w[1]=barks|.': 1, 'numchars[0]|numchars[1]': 6, 'numchars[0]': 5, 'w[-1]|w[0]=dog|barks': 1, 'w[0]=barks': 1}, 'N|V': {'N|V': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}, 'DT|N|V': {'DT|N|V': 1}}, (1, 1): {'DT': {'shape[0]=Aaa': 1, 'w[0]|w[1]=The|dog': 1, 'DT': 1, 'numchars[0]': 3, 'numchars[0]|numchars[1]': 6, 'shaped[0]=Aa': 1, 'w[0]=The': 1}}}

The extracted features are equivalent to the ones we were targeting(see below as a reminder):

 feature
 -->         w[0]=The, DT             w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         w[0]=The,                w[0]=dog,               w[0]=barks,             --
             w[1]=dog, DT             w[1]=barks, N           w[1]=., V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, N             w[0]=barks, V           w[0]=., .

 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, N           w[1]=., V

 -->         --                       w[0]=dog, DT|N          w[0]=barks, N|V        w[0]=., V|.

 -->         --                       w[0]=dog,               w[0]=barks,             --
                                      w[1]=barks, DT|N        w[1]=., N|V

 -->         --                       w[-1]=The,              w[-1]=dog,              w[-1]=barks,
                                      w[0]=dog, DT|N          w[0]=barks, N|V         w[0]=., V|.


 -->         --                       w[-1]=The,              w[-1]=dog,              --
                                      w[0]=dog,               w[0]=barks,
                                      w[1]=bargs, DT|N        w[1]=., N|V

 -->         numchars[0]=3, DT        numchars[0]=3, N        numchars[0]=5, V       numchars[0]=1, .

 -->         numchars[0,1]=6, DT      numchars[0,1]=8, N      numchars[0]=6, V       --

 -->         --                       numchars[-1,0]=6, N     numchars[-1,0]=8, V    numchars[-1,0]=6, .

 -->         --                       numchars[-1,0,1]=11, N  numchars[-1,0,1]=9, V  --

 -->         shape[0]=Aaa, DT         shape[0]=aaa, N         shape[0]=aaaaa, V      shape[0]=_, .
 -->         shaped[0]=Aa, DT         shaped[0]=a, N          shaped[0]=a, V         shaped[0]=_, .

 -->         DT                       N                       V                      .
 -->         --                       DT|N                    N|V                    V|.
 -->         --                       --                      DT|N|V                 N|V|.
 -->         --                       --                      --                     DT|N|V|.

 position    1                        2                       3                      4

In [26]:
def generate_template_alltracks():
    template_XY = {}
    template_Y = {}
    # generating template for words track (w)
    template_gen.generate_template_XY('w', ('1-gram', range(0,1)), '1-state:2-states', template_XY)
    template_gen.generate_template_XY('w', ('2-grams:3-grams', range(-1,2)), '1-state:2-states', template_XY)
    # generating template for numchars track 
    template_gen.generate_template_XY('numchars', ('1-gram', range(0,1)), '1-state', template_XY)
    template_gen.generate_template_XY('numchars', ('2-grams:3-grams', range(-1,2)), '1-state', template_XY)
    # generating template for shape track 
    template_gen.generate_template_XY('shape', ('1-gram', range(0,1)), '1-state', template_XY)
    # generating template for shaped track 
    template_gen.generate_template_XY('shaped', ('1-gram', range(0,1)), '1-state', template_XY)
    print("template_XY: using all tracks")
    print(template_XY)
    # generateing templates based on label transitions
    template_Y = template_gen.generate_template_Y('1-state:2-states:3-states:4-states')
    print("temlate_Y: up to third order label transitions")
    print(template_Y)
    print("-"*40)
    return(template_XY, template_Y)

def generate_attributes_alltracks():
    # use MyAttributeExtractor instance
    print("my_attr_extractor.attr_desc")
    print(my_attr_extractor.attr_desc)
    print("-"*40)
    print("sequence")
    print(seq_1)
    # we clear the seg_attr to generate new attributes based on the attr_desc in MyAttributeExtractor class
    seq_1.seg_attr.clear()
    # generate attributes using subclassed attribute extractor
    my_attr_extractor.generate_attributes(seq_1, seq_1.get_y_boundaries())
    print("extracted attributes saved in seq_1.seg_attr")
    print(seq_1.seg_attr)
    print("-"*40)

# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
# generate attributes for all tracks
generate_attributes_alltracks()
# initialize feature extractor
fe = FeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc)
extracted_features = fe.extract_seq_features_perboundary(seq_1)
print("extracted features")
print(extracted_features)
print("-"*40)
seq_1.seg_attr.clear()
template_XY: using all tracks
{'numchars': {(0, 1): ((0,),), (-1, 0, 1): ((0,),), (0,): ((0,),), (-1, 0): ((0,),)}, 'shaped': {(0,): ((0,),)}, 'w': {(0, 1): ((0,), (-1, 0)), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,), (-1, 0)), (-1, 0): ((0,), (-1, 0))}, 'shape': {(0,): ((0,),)}}
temlate_Y: up to third order label transitions
{'Y': [(0,), (-1, 0), (-2, -1, 0), (-3, -2, -1, 0)]}
----------------------------------------
my_attr_extractor.attr_desc
{'shaped': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the compressed/degenerated form/shape of the word', 'encoding': 'categorical'}, 'numchars': {'repr_func': <bound method GenericAttributeExtractor._represent_continuous_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'number of characters in a word', 'encoding': 'continuous'}, 'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the word/token', 'encoding': 'categorical'}, 'shape': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the shape of the word', 'encoding': 'categorical'}}
----------------------------------------
sequence
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted attributes saved in seq_1.seg_attr
{(3, 3): {'shaped': 'a', 'numchars': 5.0, 'w': 'barks', 'shape': 'aaaaa'}, (2, 2): {'shaped': 'a', 'numchars': 3.0, 'w': 'dog', 'shape': 'aaa'}, (4, 4): {'shaped': '_', 'numchars': 1.0, 'w': '.', 'shape': '_'}, (1, 1): {'shaped': 'Aa', 'numchars': 3.0, 'w': 'The', 'shape': 'Aaa'}}
----------------------------------------
extracted features
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'N|V|.': {'N|V|.': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}, 'DT|N|V|.': {'DT|N|V|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}, 'DT|N|V': {'DT|N|V': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}
----------------------------------------

6. Feature extractor subclasses

The FeatureExtractor class we introduced in a previous section, has further two subclasses:

  • HOFeatureExtractor class representing higher order feature extractor
  • FOFeatureExtractor class representing a first order feature extractor

As their names suggest, the main distinction between both subclasses is related to what type of CRFs models we aim to build. If we want CRF models that include features with at most first order label patterns (i.e. DT; N; V; .; DT|N; N|V; V|.), then FOFeatureExtractor class is the one to use. If we want to model higher order features in addition to the ones with first order label patterns (i.e. features that include label patterns and transitions that involve $\ge$ 2 states/labels such as DT|N|V; N|V|.; DT|N|V|; taken from our last example), then HOFeatureExtractor is the one to use. In other words, HOFeatureExtractor is equivalent to FeatureExtractor class.
However, there is a subtle difference that exists between both subclasses (HOFeatureExtractor and FOFeatureExtractor), which involves modeling the starting labels/states. The FOFeatureExtractor class supports the inclusion of START state that allows to build models supporting initial labels and transition labels at the starting position of the sequence.

We demonstrate the use of HOFeatureExtractor in this code snippet and contrast it with the use of FOFeatureExtractor in the following code snippet. As we described earlier, the features extracted by FOFeatureExtractor involves only first order label patterns.

extracted features using FOFeatureExtractor with start state disabled
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}

Additionally, by enabling the start_state flag in the constructor, we got features that involve START state/label pattern transitions.

extracted features using FOFeatureExtractor with start state enabled
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}, '__START__|DT': {'w[0]|w[1]=The|dog': 1, '__START__|DT': 1, 'w[0]=The': 1}}}
Bottom line, the distinction between both classes will depend on what type of CRFs model we want to build (consult to crf_model_building tutorial for further info).

In [33]:
from pyseqlab.features_extraction import HOFeatureExtractor, FOFeatureExtractor

# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
print()
# generate attributes for all tracks
generate_attributes_alltracks()
print()
# initialize HOFeatureExtractor
ho_fe = HOFeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc)
extracted_features = ho_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using HOFeatureExtractor ")
print(extracted_features)
print("-"*40)
template_XY: using all tracks
{'numchars': {(0, 1): ((0,),), (-1, 0, 1): ((0,),), (0,): ((0,),), (-1, 0): ((0,),)}, 'shaped': {(0,): ((0,),)}, 'w': {(0, 1): ((0,), (-1, 0)), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,), (-1, 0)), (-1, 0): ((0,), (-1, 0))}, 'shape': {(0,): ((0,),)}}
temlate_Y: up to third order label transitions
{'Y': [(0,), (-1, 0), (-2, -1, 0), (-3, -2, -1, 0)]}
----------------------------------------

my_attr_extractor.attr_desc
{'shaped': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the compressed/degenerated form/shape of the word', 'encoding': 'categorical'}, 'numchars': {'repr_func': <bound method GenericAttributeExtractor._represent_continuous_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'number of characters in a word', 'encoding': 'continuous'}, 'w': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the word/token', 'encoding': 'categorical'}, 'shape': {'repr_func': <bound method GenericAttributeExtractor._represent_categorical_attr of <__main__.MySeqAttributeExtractor object at 0x10b5432b0>>, 'description': 'the shape of the word', 'encoding': 'categorical'}}
----------------------------------------
sequence
Y sequence:
 ['DT', 'N', 'V', '.']
X sequence:
 {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
----------------------------------------
extracted attributes saved in seq_1.seg_attr
{(3, 3): {'shaped': 'a', 'numchars': 5.0, 'w': 'barks', 'shape': 'aaaaa'}, (2, 2): {'shaped': 'a', 'numchars': 3.0, 'w': 'dog', 'shape': 'aaa'}, (4, 4): {'shaped': '_', 'numchars': 1.0, 'w': '.', 'shape': '_'}, (1, 1): {'shaped': 'Aa', 'numchars': 3.0, 'w': 'The', 'shape': 'Aaa'}}
----------------------------------------

extracted features using HOFeatureExtractor 
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'N|V|.': {'N|V|.': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}, 'DT|N|V|.': {'DT|N|V|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}, 'DT|N|V': {'DT|N|V': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}
----------------------------------------

In [34]:
# get feature templates for all tracks
template_XY, template_Y = generate_template_alltracks()
print()
# initialize FOFeatureExtractor with __START__ state
fo_fe = FOFeatureExtractor(template_XY, template_Y, my_attr_extractor.attr_desc, start_state=True)
extracted_features = fo_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using FOFeatureExtractor with start_state enabled")
print(extracted_features)
print("-"*40)
print()
print()
# using FOFeatureExtractor without the __START__ state
fo_fe.start_state = False
extracted_features = fo_fe.extract_seq_features_perboundary(seq_1)
print("extracted features using FOFeatureExtractor with start_state disabled")
print(extracted_features)
print("-"*40)
template_XY: using all tracks
{'numchars': {(0, 1): ((0,),), (-1, 0, 1): ((0,),), (0,): ((0,),), (-1, 0): ((0,),)}, 'shaped': {(0,): ((0,),)}, 'w': {(0, 1): ((0,), (-1, 0)), (-1, 0, 1): ((0,), (-1, 0)), (0,): ((0,), (-1, 0)), (-1, 0): ((0,), (-1, 0))}, 'shape': {(0,): ((0,),)}}
temlate_Y: up to third order label transitions
{'Y': [(0,), (-1, 0), (-2, -1, 0), (-3, -2, -1, 0)]}
----------------------------------------

extracted features using FOFeatureExtractor with start_state enabled
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}, '__START__|DT': {'w[0]|w[1]=The|dog': 1, '__START__|DT': 1, 'w[0]=The': 1}}}
----------------------------------------


extracted features using FOFeatureExtractor with start_state enabled
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}
----------------------------------------

7. Feature filter

Now that we know about generating features, we can move to FeatureFilter class. FeatureFilter is a way to filter unwanted features post extraction. Generally speaking, when we build CRFs models using PySeqLab, the features extracted at every position from the sequences in our training data will be aggregated/collapsed to build our model. Therefore, in case we wanted to remove features that occur less than a threshold or includes specific label patterns, we can use FeatureFilter class for this purpose.

The FeatureFilter class constructor takes filter_info as an argument, which is a dictionary that includes specification of how to apply the filtering.

filter_info dictionary has three keys:

  • filter_type to define the type of filter. It is either 'count' or 'pattern'
  • filter_val to define either the threshold value or y pattern to filter
  • filter_relation to define how the filter should be applied

The two filter types that could be specified in filter_info are:

  • count-based filter is applied to features having a certain value (i.e. threshold).
        filter_info = {'filter_type': 'count', 
                       'filter_val':5,
                       'filter_relation': '<'}
    
    This filter would delete all features that have count less than five. So in 'filter_val' we decide the threshold and in 'filter_relation' we decide the comparator operation that could assume one of these values {<, $\le$, >, $\ge$, =}.

    NB: If the threshold specified is of type int (i.e. integer), the filter is applied only to categorical features. Else, if the threshold if of type float, then the filter is applied to all features (categorical and continuous). The assumption is that categorical features are assigned integers as feature values while continuous features are assigned float values. Hence, it is very important when computing/deriving attributes to assign feature value with the correct type (i.e. float in case of continuous attributes).

  • pattern-based filter is applied to remove features associated with a specified label pattern.
        filter_info = {'filter_type': 'pattern',
                       'filter_val': {"O|L", "L|L"},
                       'filter_relation':'in'}
    
    This filter will delete all features that have associated y pattern {"O|L" and "L|L"}. Hence, 'filter_val' takes a set of label patterns while 'filter_relation' assumes either {'in', or 'not in'} as values. In case 'not in' is specified, the above filter will delete all features except the ones associated with {"O|L", "L|L"} label patterns.

To see FeatureFilter in action, we apply the two types of filters to our last extracted features example.

In [19]:
from pyseqlab.features_extraction import FeatureFilter
# aggregate/collapse features across all positions in the sequence
gfeatures = fe.aggregate_seq_features(extracted_features, seq_1.get_y_boundaries())
print("originally extracted features by position:")
print(extracted_features)
print("-"*50)
print("aggregated/collapsed features across all positions in the sequence")
print(gfeatures)
print("-"*50)
print()
print("Count based filters:")
print()

# count based feature targeting only categorical (see the type of threshold)
filter_info = {'filter_type': 'count', 
               'filter_val':1,
               'filter_relation': '>='}
ffilter = FeatureFilter(filter_info)
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering only categorical features with value >= 1")
print(filterd_features)
print("*"*50)

filter_info = {'filter_type': 'count', 
               'filter_val':3.0,
               'filter_relation': '>='}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering only continuous features with values >= 3")
print(filterd_features)
print("*"*50)

filter_info = {'filter_type': 'count', 
               'filter_val':1.0,
               'filter_relation': '>='}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering both categorical and continuous features with values >=1")
print(filterd_features)
print("*"*50)

print("Pattern based filters:")
print()
filter_info = {'filter_type': 'pattern', 
               'filter_val':{'N|V','DT'},
               'filter_relation': 'not in'}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering label patterns not in {'N|V','DT'}")
print(filterd_features)
print("*"*50)
filter_info = {'filter_type': 'pattern', 
               'filter_val':{'N|V','DT'},
               'filter_relation': 'in'}
ffilter.filter_info = filter_info
filterd_features = ffilter.apply_filter(gfeatures)
print("applied filter:")
print(filter_info)
print()
print("filtered extracted features: filtering label patterns in {'N|V','DT'}")
print(filterd_features)
print("*"*50)
originally extracted features by position:
{(2, 2): {'DT|N': {'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]|w[1]=dog|barks': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1}, 'N': {'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[-1]|numchars[0]': 6.0, 'w[0]|w[1]=dog|barks': 1, 'numchars[0]|numchars[1]': 8.0, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'N': 1, 'shape[0]=aaa': 1, 'w[-1]|w[0]=The|dog': 1, 'numchars[0]': 3.0, 'shaped[0]=a': 1}}, (4, 4): {'.': {'.': 1, 'numchars[-1]|numchars[0]': 6.0, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}, 'N|V|.': {'N|V|.': 1}, 'V|.': {'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1}, 'DT|N|V|.': {'DT|N|V|.': 1}}, (3, 3): {'V': {'V': 1, 'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'numchars[0]': 5.0, 'w[0]=barks': 1, 'shaped[0]=a': 1}, 'N|V': {'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}, 'DT|N|V': {'DT|N|V': 1}}, (1, 1): {'DT': {'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'numchars[0]|numchars[1]': 6.0, 'shaped[0]=Aa': 1, 'numchars[0]': 3.0, 'w[0]=The': 1}}}
--------------------------------------------------
aggregated/collapsed features across all positions in the sequence
{'DT': Counter({'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 3.0, 'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'shaped[0]=Aa': 1, 'w[0]=The': 1}), 'V': Counter({'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 5.0, 'w[-1]|w[0]=dog|barks': 1, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[0]=barks': 1, 'V': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'shaped[0]=a': 1}), 'DT|N|V|.': Counter({'DT|N|V|.': 1}), 'N': Counter({'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[0]|numchars[1]': 8.0, 'numchars[-1]|numchars[0]': 6.0, 'numchars[0]': 3.0, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'shape[0]=aaa': 1, 'w[0]|w[1]=dog|barks': 1, 'N': 1, 'w[0]=dog': 1, 'w[-1]|w[0]=The|dog': 1, 'shaped[0]=a': 1}), 'DT|N|V': Counter({'DT|N|V': 1}), 'N|V|.': Counter({'N|V|.': 1}), 'DT|N': Counter({'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]=dog': 1, 'w[0]|w[1]=dog|barks': 1}), 'N|V': Counter({'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}), '.': Counter({'numchars[-1]|numchars[0]': 6.0, '.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}), 'V|.': Counter({'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1})}
--------------------------------------------------

Count based filters:

applied filter:
{'filter_type': 'count', 'filter_relation': '>=', 'filter_val': 1}

filtered extracted features: filtering only categorical features with value >= 1
{'DT': Counter({'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 3.0}), 'V': Counter({'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 5.0}), '.': Counter({'numchars[-1]|numchars[0]': 6.0, 'numchars[0]': 1.0}), 'N': Counter({'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[0]|numchars[1]': 8.0, 'numchars[-1]|numchars[0]': 6.0, 'numchars[0]': 3.0}), 'DT|N|V': Counter(), 'DT|N': Counter(), 'N|V|.': Counter(), 'N|V': Counter(), 'DT|N|V|.': Counter(), 'V|.': Counter()}
**************************************************
applied filter:
{'filter_type': 'count', 'filter_relation': '>=', 'filter_val': 3.0}

filtered extracted features: filtering only continuous features with values >= 3
{'DT': Counter({'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'shaped[0]=Aa': 1, 'w[0]=The': 1}), 'V': Counter({'w[-1]|w[0]=dog|barks': 1, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[0]=barks': 1, 'V': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'shaped[0]=a': 1}), '.': Counter({'.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}), 'N': Counter({'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'shape[0]=aaa': 1, 'w[0]|w[1]=dog|barks': 1, 'N': 1, 'w[0]=dog': 1, 'w[-1]|w[0]=The|dog': 1, 'shaped[0]=a': 1}), 'DT|N|V': Counter({'DT|N|V': 1}), 'DT|N': Counter({'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]|w[1]=dog|barks': 1}), 'N|V|.': Counter({'N|V|.': 1}), 'N|V': Counter({'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1}), 'DT|N|V|.': Counter({'DT|N|V|.': 1}), 'V|.': Counter({'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1})}
**************************************************
applied filter:
{'filter_type': 'count', 'filter_relation': '>=', 'filter_val': 1.0}

filtered extracted features: filtering both categorical and continuous features with values >=1
{'DT': Counter(), 'V': Counter(), '.': Counter(), 'N': Counter(), 'DT|N|V': Counter(), 'DT|N': Counter(), 'N|V|.': Counter(), 'N|V': Counter(), 'DT|N|V|.': Counter(), 'V|.': Counter()}
**************************************************
Pattern based filters:

applied filter:
{'filter_type': 'pattern', 'filter_relation': 'not in', 'filter_val': {'DT', 'N|V'}}

filtered extracted features: filtering label patterns not in {'N|V','DT'}
{'DT': Counter({'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 3.0, 'DT': 1, 'w[0]|w[1]=The|dog': 1, 'shape[0]=Aaa': 1, 'shaped[0]=Aa': 1, 'w[0]=The': 1}), 'N|V': Counter({'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'w[-1]|w[0]=dog|barks': 1, 'N|V': 1, 'w[0]=barks': 1, 'w[0]|w[1]=barks|.': 1})}
**************************************************
applied filter:
{'filter_type': 'pattern', 'filter_relation': 'in', 'filter_val': {'DT', 'N|V'}}

filtered extracted features: filtering label patterns in {'N|V','DT'}
{'V': Counter({'numchars[-1]|numchars[0]|numchars[1]': 9.0, 'numchars[-1]|numchars[0]': 8.0, 'numchars[0]|numchars[1]': 6.0, 'numchars[0]': 5.0, 'w[-1]|w[0]=dog|barks': 1, 'shape[0]=aaaaa': 1, 'w[0]|w[1]=barks|.': 1, 'w[0]=barks': 1, 'V': 1, 'w[-1]|w[0]|w[1]=dog|barks|.': 1, 'shaped[0]=a': 1}), '.': Counter({'numchars[-1]|numchars[0]': 6.0, '.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1, 'numchars[0]': 1.0, 'shaped[0]=_': 1, 'shape[0]=_': 1}), 'N': Counter({'numchars[-1]|numchars[0]|numchars[1]': 11.0, 'numchars[0]|numchars[1]': 8.0, 'numchars[-1]|numchars[0]': 6.0, 'numchars[0]': 3.0, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'shape[0]=aaa': 1, 'w[0]|w[1]=dog|barks': 1, 'N': 1, 'w[0]=dog': 1, 'w[-1]|w[0]=The|dog': 1, 'shaped[0]=a': 1}), 'DT|N|V': Counter({'DT|N|V': 1}), 'DT|N': Counter({'DT|N': 1, 'w[-1]|w[0]=The|dog': 1, 'w[0]=dog': 1, 'w[-1]|w[0]|w[1]=The|dog|barks': 1, 'w[0]|w[1]=dog|barks': 1}), 'N|V|.': Counter({'N|V|.': 1}), 'DT|N|V|.': Counter({'DT|N|V|.': 1}), 'V|.': Counter({'V|.': 1, 'w[0]=.': 1, 'w[-1]|w[0]=barks|.': 1})}
**************************************************

8. Concluding remarks

The ideas and concepts explained in this tutorial are very important for the next stages especially when it comes to building/training CRFs models. Understanding the concepts such as attributes, features, feature template, feature extraction, and feature filter will help in deciding what CRFs model we want to build and how the feature parameters are going to be trained during the learning phase. We tackle the model training in the crf_model_building tutorial.

In [ ]: