# importing and defining relevant directories
import sys
import os
# pyseqlab root directory
pyseqlab_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
# print("pyseqlab cloned dir:", pyseqlab_dir)
# inserting the pyseqlab directory to pythons system path
# if pyseqlab is already installed this could be commented out
sys.path.insert(0, pyseqlab_dir)
# current directory (tutorials)
tutorials_dir = os.path.join(pyseqlab_dir, 'tutorials')
# print("tutorials_dir:", tutorials_dir)
dataset_dir = os.path.join(tutorials_dir, 'datasets', 'conll2000')
# print("dataset_dir:", dataset_dir)
# to use for customizing the display/format of the cells
from IPython.core.display import HTML
with open(os.path.join(tutorials_dir, 'pseqlab_base.css')) as f:
    css = "".join(f.readlines())
HTML(css)

Table of Contents¶

1. Objectives and goals
2. Representing sequences
3. Input file format
4. Parsing sequences
5. Constructing sequences programatically
6. Constructing segments programatically

1. Objectives and goals¶

In this tutorial, we will learn how to:

read sequences/segments from a file (additionally we will explain the file/input format)
construct and represent sequences/segments programmatically to use later in the PySeqLab package for building/training models

Reminder: To work with this tutorial interactively, we need first to clone the PySeqLab package to our disk locally and then navigate to [cloned_package_dir]/tutorials where [cloned_package_dir] is the path to the cloned package folder (see tree path for display).

├── pyseqlab
    ├── tutorials
    │   ├── datasets
    │   │   ├── conll2000
    │   │   ├── segments

We start our discussion about the sequences concept and the file format comprising them.

2. Representing sequences¶

Generally speaking, a sequence is a list of elements that follow an order (see sequences). The order could be due to an inherent structure such as sentences (sequence of words) or temporal such as readings/measurements from a sensor.

Sequence labeling is an important task in multiple domains where given a sequence of observations, the goal is to label/tag each observation using a set of permissible tags that represent higher order syntactic structure.

For example, given a sentence (sequence of words), the goal is to tag/label each word by its part-of-speech. Another example is chunking/shallow parsing using CoNLL 2000 dataset. Given a set of sentences (our sequences) where each sentence is composed of words and their corresponding part-of-speech, the goal is to predict the chunk/shallow parse label of every word in the sentence.

With this preliminary definition in mind, we can start our investigation on how to represent/parse sequences. In this tutorial we will use CoNLL 2000 sentences as an example of sequences.

3. Input file format¶

The input file comprising the sequences follows a column-format template. Sequences are separated by newline where the observations/elements of each sequence are laid each on a separate line. The last column is dedicated for the tag/label that we aim to predict.

The dataset files (training and test files) of the CoNLL task follow the input file template (i.e. column-format). An excerpt of the training file:

w pos chunk
Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figures NNS I-NP
for IN B-PP
September NNP B-NP
, , O
due JJ B-ADJP
for IN B-PP
release NN B-NP
tomorrow NN B-NP
, , O
fail VB B-VP
to TO I-VP
show VB I-VP
a DT B-NP
substantial JJ I-NP
improvement NN I-NP
from IN B-PP
July NNP B-NP
and CC I-NP
August NNP I-NP
's POS B-NP
near-record JJ I-NP
deficits NNS I-NP
. . O

Chancellor NNP O
of IN B-PP
the DT B-NP
Exchequer NNP I-NP
Nigel NNP B-NP
Lawson NNP I-NP
's POS B-NP
restated VBN I-NP
commitment NN I-NP
to TO B-PP
a DT B-NP
firm NN I-NP
monetary JJ I-NP
policy NN I-NP
has VBZ B-VP
helped VBN I-VP
to TO I-VP
prevent VB I-VP
a DT B-NP
freefall NN I-NP
in IN B-PP
sterling NN B-NP
over IN B-PP
the DT B-NP
past JJ I-NP
week NN I-NP
. . O

Looking at the two sentences, we can identify two tracks of observations (1) words and (2) part-of-speech. The two tracks are separated by a space as separate columns and the last column representing the label/tag. Sentences are separated by a new line. To be consistent with the terminology, we will use the following terms/definitions:

sequence: to refer to a list of elements that follow an order
observation: to refer to an element in the sequence
track: to refer to different types of observations. In the chunking example, we have a track for the words and another for the part-of-speech
label/tag: to refer to the outcome/class we want to predict

This file format could support as many tracks as we want where new tracks could be added as separate columns while keeping the last column for the tag/label.

To read this file format, PySeqLab provides a DataFileParser in the utilities module.

As a reminder, a visual tree directory for the dataset folder under the current directory (tutorials) is provided below:

├── tutorials
    ├── datasets
    │   ├── conll2000
    │   │   ├── test.txt
    │   │   ├── train.txt
    │   ├── segments
    │   │   ├── segments.txt

4. Parsing sequences¶

The DataFileParser class has a read_file(args) method that has the following:

Arguments:

file_path: (string), directory/path to the file to be read
header: (string or list)
- 'main': in case there is only one header at the top of the file
- 'per_sequence': in case there is a header line before every sequence
- list of keywords such as ['w', 'part_of_speech'] in case no header is provided in the file

Keyword arguments:

y_ref: (boolean), specifying if the last column is the tag/label column
seg_other_symbol: (string or None as default), it specifies if we want to parse sequences versus segments. Consult to segments section in this notebook for further explanation
column_sep: (string), specifying the separator between the tracks (columns of observations) to be read

For the CoNLL task, we will set both the arguments and keyword arguments in the following cells to read the training file.

from pyseqlab.utilities import DataFileParser
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the words/observations
column_sep = " "
seqs = []
for seq in dparser.read_file(os.path.join(dataset_dir, 'train.txt'), header, y_ref=y_ref, column_sep = column_sep):
    seqs.append(seq)
    
# printing one sequence for display
print(seqs[0])
print("type(seq):", type(seqs[0]))
print("number of parsed sequences is: ", len(seqs))

Y sequence:
 ['B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O']
X sequence:
 {1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 22: {'pos': 'NN', 'w': 'tomorrow'}, 23: {'pos': ',', 'w': ','}, 24: {'pos': 'VB', 'w': 'fail'}, 25: {'pos': 'TO', 'w': 'to'}, 26: {'pos': 'VB', 'w': 'show'}, 27: {'pos': 'DT', 'w': 'a'}, 28: {'pos': 'JJ', 'w': 'substantial'}, 29: {'pos': 'NN', 'w': 'improvement'}, 30: {'pos': 'IN', 'w': 'from'}, 31: {'pos': 'NNP', 'w': 'July'}, 32: {'pos': 'CC', 'w': 'and'}, 33: {'pos': 'NNP', 'w': 'August'}, 34: {'pos': 'POS', 'w': "'s"}, 35: {'pos': 'JJ', 'w': 'near-record'}, 36: {'pos': 'NNS', 'w': 'deficits'}, 37: {'pos': '.', 'w': '.'}}
----------------------------------------
type(seq): <class 'pyseqlab.utilities.SequenceStruct'>
number of parsed sequences is:  8936

seq = seqs[0]
print("X:")
print(seq.X)
print("-"*40)
print("Y:")
print(seq.Y)
print("-"*40)
print("flat_y:")
print(seq.flat_y)
print("-"*40)

X:
{1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 22: {'pos': 'NN', 'w': 'tomorrow'}, 23: {'pos': ',', 'w': ','}, 24: {'pos': 'VB', 'w': 'fail'}, 25: {'pos': 'TO', 'w': 'to'}, 26: {'pos': 'VB', 'w': 'show'}, 27: {'pos': 'DT', 'w': 'a'}, 28: {'pos': 'JJ', 'w': 'substantial'}, 29: {'pos': 'NN', 'w': 'improvement'}, 30: {'pos': 'IN', 'w': 'from'}, 31: {'pos': 'NNP', 'w': 'July'}, 32: {'pos': 'CC', 'w': 'and'}, 33: {'pos': 'NNP', 'w': 'August'}, 34: {'pos': 'POS', 'w': "'s"}, 35: {'pos': 'JJ', 'w': 'near-record'}, 36: {'pos': 'NNS', 'w': 'deficits'}, 37: {'pos': '.', 'w': '.'}}
----------------------------------------
Y:
{(35, 35): 'I-NP', (20, 20): 'B-PP', (13, 13): 'B-SBAR', (29, 29): 'I-NP', (6, 6): 'I-VP', (2, 2): 'B-PP', (31, 31): 'B-NP', (12, 12): 'I-NP', (11, 11): 'I-NP', (7, 7): 'I-VP', (23, 23): 'O', (27, 27): 'B-NP', (25, 25): 'I-VP', (16, 16): 'B-PP', (22, 22): 'B-NP', (34, 34): 'B-NP', (37, 37): 'O', (33, 33): 'I-NP', (21, 21): 'B-NP', (26, 26): 'I-VP', (5, 5): 'B-VP', (10, 10): 'B-NP', (36, 36): 'I-NP', (4, 4): 'I-NP', (9, 9): 'I-VP', (17, 17): 'B-NP', (30, 30): 'B-PP', (24, 24): 'B-VP', (8, 8): 'I-VP', (32, 32): 'I-NP', (14, 14): 'B-NP', (18, 18): 'O', (3, 3): 'B-NP', (28, 28): 'I-NP', (19, 19): 'B-ADJP', (15, 15): 'I-NP', (1, 1): 'B-NP'}
----------------------------------------
flat_y:
['B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O']
----------------------------------------

The parser read 8936 sequences from the training file. Each sequence is an instance of SequenceStruct class that is also found in the utilities module in PySeqLab package. The three main attributes of a sequence are as follows:

X: dictionary of dictionaries. For each position in the sequence, we have a dictionary that includes the track name and the corresponding observation at that position as key:value pairs. The track names are the ones extracted from the header variable specified while parsing the file. Example:

X:
{1: {'pos': 'NN', 'w': 'Confidence'}, 2: {'pos': 'IN', 'w': 'in'}, 3: {'pos': 'DT', 'w': 'the'}, 4: {'pos': 'NN', 'w': 'pound'}, 5: {'pos': 'VBZ', 'w': 'is'}, 6: {'pos': 'RB', 'w': 'widely'}, 7: {'pos': 'VBN', 'w': 'expected'}, 8: {'pos': 'TO', 'w': 'to'}, 9: {'pos': 'VB', 'w': 'take'}, 10: {'pos': 'DT', 'w': 'another'}, 11: {'pos': 'JJ', 'w': 'sharp'}, 12: {'pos': 'NN', 'w': 'dive'}, 13: {'pos': 'IN', 'w': 'if'}, 14: {'pos': 'NN', 'w': 'trade'}, 15: {'pos': 'NNS', 'w': 'figures'}, 16: {'pos': 'IN', 'w': 'for'}, 17: {'pos': 'NNP', 'w': 'September'}, 18: {'pos': ',', 'w': ','}, 19: {'pos': 'JJ', 'w': 'due'}, 20: {'pos': 'IN', 'w': 'for'}, 21: {'pos': 'NN', 'w': 'release'}, 22: {'pos': 'NN', 'w': 'tomorrow'}, 23: {'pos': ',', 'w': ','}, 24: {'pos': 'VB', 'w': 'fail'}, 25: {'pos': 'TO', 'w': 'to'}, 26: {'pos': 'VB', 'w': 'show'}, 27: {'pos': 'DT', 'w': 'a'}, 28: {'pos': 'JJ', 'w': 'substantial'}, 29: {'pos': 'NN', 'w': 'improvement'}, 30: {'pos': 'IN', 'w': 'from'}, 31: {'pos': 'NNP', 'w': 'July'}, 32: {'pos': 'CC', 'w': 'and'}, 33: {'pos': 'NNP', 'w': 'August'}, 34: {'pos': 'POS', 'w': "'s"}, 35: {'pos': 'JJ', 'w': 'near-record'}, 36: {'pos': 'NNS', 'w': 'deficits'}, 37: {'pos': '.', 'w': '.'}}

The keys in the dictionary are the numbered positions such as {1: {'pos': 'NN', 'w': 'Confidence'}} where 1 is the position where we are inspecting the sequence and {'pos': 'NN', 'w': 'Confidence'} are the observations detected at that position. Moreover, 'Confidence' observation belongs to the word track (using 'w' as key) and 'NN' observation to the part-of-speech track (using 'pos' as key).

Y: dictionary specifying the labels/tags and their corresponding boundaries across the whole sequence. Example:

Y:
{(35, 35): 'I-NP', (20, 20): 'B-PP', (13, 13): 'B-SBAR', (29, 29): 'I-NP', (6, 6): 'I-VP', (2, 2): 'B-PP', (31, 31): 'B-NP', (12, 12): 'I-NP', (11, 11): 'I-NP', (7, 7): 'I-VP', (23, 23): 'O', (27, 27): 'B-NP', (25, 25): 'I-VP', (16, 16): 'B-PP', (22, 22): 'B-NP', (34, 34): 'B-NP', (37, 37): 'O', (33, 33): 'I-NP', (21, 21): 'B-NP', (26, 26): 'I-VP', (5, 5): 'B-VP', (10, 10): 'B-NP', (36, 36): 'I-NP', (4, 4): 'I-NP', (9, 9): 'I-VP', (17, 17): 'B-NP', (30, 30): 'B-PP', (24, 24): 'B-VP', (8, 8): 'I-VP', (32, 32): 'I-NP', (14, 14): 'B-NP', (18, 18): 'O', (3, 3): 'B-NP', (28, 28): 'I-NP', (19, 19): 'B-ADJP', (15, 15): 'I-NP', (1, 1): 'B-NP'}

The keys in the dictionary are the boundaries (positions) the label/tag is spanning. In case of parsing/modeling sequences, a label/tag can span only one observation and hence the boundaries will be a tuple of the same position. However, if we are modeling/parsing segments, we will have varying boundaries as the labels are allowed to span multiple observations. Check the segments section for more info.

flat_y: list of labels at every position in the sequence. Example:

flat_y:
['B-NP', 'B-PP', 'B-NP', 'I-NP', 'B-VP', 'I-VP', 'I-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-SBAR', 'B-NP', 'I-NP', 'B-PP', 'B-NP', 'O', 'B-ADJP', 'B-PP', 'B-NP', 'B-NP', 'O', 'B-VP', 'I-VP', 'I-VP', 'B-NP', 'I-NP', 'I-NP', 'B-PP', 'B-NP', 'I-NP', 'I-NP', 'B-NP', 'I-NP', 'I-NP', 'O']

There are other attributes in the SequenceStruct instance that could be further explored by consulting to the API docs of PySeqLab package.

5. Constructing sequences programatically¶

We have seen so far how to parse sequences from a text file following the column-format template. Now, what if we want to construct the sequences from code (i.e. on the fly), can we do that? The answer is a definite Yes. To demonstrate this, suppose we have the sentence $s$ = "The dog barks." and we want to represent it as an instance of our SequeqnceStruct class.

First, we determine the different components of the sequence. As we defined in our terminology earlier, the sentence $s$ is a sequence with four observations each belonging to one type (i.e. tracks) in this case representing the words. So we denote w as the name of the track and we proceed to build the X instance attribute of the sequence. For the labels, we have two options:

case of no labels are defined, we would get and empty list Y instance attribute or
case of defined labels, we would get label list Y instance attribute

The cell below demonstrates the two options.

# import SequenceStruct class
from pyseqlab.utilities import SequenceStruct
# define the X attribute
X= [{'w':'The'}, {'w':'dog'}, {'w':'barks'}, {'w':'.'}]
# empty label/tag sequence
Y= []
seq_1 = SequenceStruct(X, Y)
print("labels are not defined")
print("seq_1:")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)

print("-"*40)
print("labels are defined")
# defined label/tag sequence
Y= ['DT', 'N', 'V', '.']
seq_2 = SequenceStruct(X, Y)
print("X:", seq_2.X)
print("Y:", seq_2.Y)
print("flat_y:", seq_2.flat_y)

labels are not defined
seq_1:
X: {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
Y: {}
flat_y: []
----------------------------------------
labels are defined
X: {1: {'w': 'The'}, 2: {'w': 'dog'}, 3: {'w': 'barks'}, 4: {'w': '.'}}
Y: {(3, 3): 'V', (4, 4): '.', (1, 1): 'DT', (2, 2): 'N'}
flat_y: ['DT', 'N', 'V', '.']

6. Constructing segments programatically¶

The discussion so far was focused on representing sequences. However, PySeqLab provides another option that allows representing segments.

By definition a segment is a sequence in which the labels span more than one observation. For example the sentence $s$ = "Yale is found in New Haven." is a sequence of observations (words). The corresponding labels belong to one of three types {'University', 'Location', 'Other'}. These labels represent named entities that provide the words in $s$ with higher semantics. It is evident that the 'Location' label spans two observations (i.e. "New Haven") and as a result we can parse this sequence as a segment as opposed to a sequence. The cell below demonstrates the two possible representations.

# define the X attribute
X = [{'w':'Yale'}, {'w':'is'}, {'w':'in'}, {'w':'New'}, {'w':'Haven'}]
Y= ["University", "Other", "Other", "Other", "Location", "Location"]
# model as a sequence
seq_1 = SequenceStruct(X, Y)
print("Modeled as a sequence")
print("X:", seq_1.X)
print("Y:", seq_1.Y)
print("flat_y:", seq_1.flat_y)

print("-"*40)
print("Modeled as a segment")
seq_2 = SequenceStruct(X, Y, seg_other_symbol="Other")
print("X:", seq_2.X)
print("Y:", seq_2.Y)
print("flat_y:", seq_2.flat_y)

Modeled as a sequence
X: {1: {'w': 'Yale'}, 2: {'w': 'is'}, 3: {'w': 'in'}, 4: {'w': 'New'}, 5: {'w': 'Haven'}}
Y: {(6, 6): 'Location', (5, 5): 'Location', (2, 2): 'Other', (4, 4): 'Other', (1, 1): 'University', (3, 3): 'Other'}
flat_y: ['University', 'Other', 'Other', 'Other', 'Location', 'Location']
----------------------------------------
Modeled as a segment
X: {1: {'w': 'Yale'}, 2: {'w': 'is'}, 3: {'w': 'in'}, 4: {'w': 'New'}, 5: {'w': 'Haven'}}
Y: {(5, 6): 'Location', (3, 3): 'Other', (4, 4): 'Other', (1, 1): 'University', (2, 2): 'Other'}
flat_y: ['University', 'Other', 'Other', 'Other', 'Location', 'Location']

As it can be seen, the difference is in how the Y instance attribute is modeled between segments and sequences. The labels are allowed to span only one observation in a sequence while they can span multiple observations in segments (like the case of "Location" label). This distinction is made possible thanks to the keyword argument seg_other_symbol. By specifying the non entity symbol equal to 'Other' we can model segments.
NB: Non entity symbol can assume any value not only 'Other'. That is, any symbol that we decide to represent the non entity symbol could be passed to seg_other_symbol keyword argument. Otherwise, if it is left unspecified (default is None), we will obtain a sequence.

The keyword argument seg_other_symbol has the same role as in the read_file(args) method in the DataFileParser class. That is, if we are reading segments from a file, we can specify the non entity symbol by passing the value to seg_other_symbol keyword argument in the read_file(args) method. You can try this as an exercise by reading segments.txt file located in the segments folder under the datasets directory (see the tree path below). Hint: the non entity symbol is 'O' ....

|---tutorials
|        |---datasets
|        |       |---conll2000
|        |       |        |---test.txt
|        |       |        |---train.txt
|        |       |---segments
|        |       |        |---segments.txt

.... The solution is in the next cell :)

segments_dir= os.path.join(tutorials_dir, 'datasets', 'segments')
# initialize a data file parser
dparser = DataFileParser()
# provide the options to parser such as the header info, the separator between words and if the y label is already existing
# main means the header is found in the first line of the file
header = "main"
# y_ref is a boolean indicating if the label to predict is already found in the file
y_ref = True
# spearator between the words/observations
column_sep = ","
seg_other_symbol = 'O'
seqs = []
for seq in dparser.read_file(os.path.join(segments_dir, 'segments.txt'), 
                             header, y_ref=y_ref, column_sep = column_sep, 
                             seg_other_symbol=seg_other_symbol):
    seqs.append(seq)
    print("X:", seq.X)
    print("Y:", seq.Y)
    print("flat_y:", seq.flat_y)
    print("-"*40)
print("number of parsed segments is: ", len(seqs))

X: {1: {'w': 'New'}, 2: {'w': 'Haven'}, 3: {'w': 'is'}, 4: {'w': 'beautiful'}, 5: {'w': '.'}}
Y: {(1, 2): 'L', (4, 4): 'O', (5, 5): '.', (3, 3): 'O'}
flat_y: ['L', 'L', 'O', 'O', '.']
----------------------------------------
X: {1: {'w': 'England'}, 2: {'w': 'is'}, 3: {'w': 'part'}, 4: {'w': 'of'}, 5: {'w': 'United'}, 6: {'w': 'Kingdom'}, 7: {'w': '.'}}
Y: {(3, 3): 'O', (2, 2): 'O', (5, 6): 'L', (4, 4): 'O', (7, 7): '.', (1, 1): 'L'}
flat_y: ['L', 'O', 'O', 'O', 'L', 'L', '.']
----------------------------------------
X: {1: {'w': 'Peter'}, 2: {'w': 'visited'}, 3: {'w': 'New'}, 4: {'w': 'York'}, 5: {'w': '.'}}
Y: {(2, 2): 'O', (3, 4): 'L', (5, 5): '.', (1, 1): 'P'}
flat_y: ['P', 'O', 'L', 'L', '.']
----------------------------------------
number of parsed segments is:  3