Adding sentence boundaries to a compiled corpus

This document explains how structures, such as documents, paragraph, and sentences are stored in a compiled corpus and how they can be modified. We will illustrate this on a practical example.

After compiling the LEXMCI corpus we figured out that some of the included document don’t have the sentence boundaries marked. We could have simply added the missing <s></s> structures to the corpus vertical file and recompile the corpus. That would, however, require a significant amount of CPU time as the corpus is very large (almost 2 billion tokens). Therefore we decided for a simple hack which we present here.

In a compiled corpus, structures are stored in .rng files, which are simple sequences of pairs of 4-byte integers. Each pair of integers indicates positions of the opening and closing structure in the corpus.

Example 1

Let’s have the following vertical file with the second column indicating the position of the token:

<s>
This	0
is	1
a	2
sentence	3
.	4
</s>
<s>
This	5
is	6
another	7
sentence	8
.	9
</s>

If we compiled this vertical file, the s.rng file would contain the following integers:

0 5 5 10

Note that the position of the opening tag equals to the position of the next token whereas the position of the closing tag equals to the position of the previous token plus one.

Example 2

This is a sample of the s.rng file from the LEXMCI corpus:

$ od -t d4 /manatee/lexmci/s.rng | head
0000000           0           5           5          28
0000020          28          43          43          48
0000040          48          57          57          64
0000060          64          72          72          78

Hacking LEXMCI

We knew that all the LEXMCI documents which were missing the sentence boundaries had doc.langvariety set to ie. We also knew that all these documents are adjacent in the corpus. It was therefore possible to:

extract these documents from the corpus vertical file,
add the <s></s> tags to the extracted part,
compile it separately,
offset the positions in the resulting s.rng by the position of the first document,
add the positions to the original s.rng file.

We extracted the documents using the following simple perl script:

$/ = "</doc>\n";
while (my $doc = <>)
    { print $doc if ($doc =~ /<doc .*langvariety="ie"/); }

After adding the <s></s> tags and compiling the vertical file with encodevert we use the following python script to perform the points 4 and 5:

import array
import struct

# position of the first document with langvariety="ie" (desired offset)
ie_first_doc_pos = 1626370101L

# original s.rng file
orig_s_rng = '/manatee/lexmci/s.rng'

# s.rng file of the compiled documents with langvariety="ie"
ie_s_rng = '/manatee/lexmci_ie/s.rng'

# new s.rng file (to be created)
new_s_rng = '/manatee/lexmci/s.rng.new'

orig_fp = open(orig_s_rng, 'rb')
ie_fp = open(ie_s_rng, 'rb')
new_fp = open(new_s_rng, 'wb')

# array of long (4-byte) unsigned integers
# to store the contents of the s.rng.new file
new_array = array.array('L')

inserted = False
while True:
    # read and add to output the original s.rng file
    # until the position is greater than `ie_first_doc_pos`
    chunk = orig_fp.read(4)
    if not chunk: break
    int_chunk, = struct.unpack("L", chunk)
    
    # insert the contents of the `ie_s_rng` to the right place
    if int_chunk > ie_first_doc_pos and not inserted:
        inserted = True
        while True:
            chunk2 = ie_fp.read(4)
            if not chunk2: break
            int_chunk2, = struct.unpack("L", chunk2)
            # offset
            new_array.append(int_chunk2 + ie_first_doc_pos)

    # read and add the rest of the original s.rng
    new_array.append(int_chunk)

# dump the integer array to the file
new_array.tofile(new_fp)

new_fp.close()
ie_fp.close()
orig_fp.close()

Example 1

Example 2

Hacking LEXMCI

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine