This document explains how structures, such as documents, paragraph, and sentences are stored in a compiled corpus and how they can be modified. We will illustrate this on a practical example.
After compiling the LEXMCI corpus we figured out that some of the included document don’t have the sentence boundaries marked. We could have simply added the missing <s></s> structures to the corpus vertical file and recompile the corpus. That would, however, require a significant amount of CPU time as the corpus is very large (almost 2 billion tokens). Therefore we decided for a simple hack which we present here.
In a compiled corpus, structures are stored in .rng files, which are simple sequences of pairs of 4-byte integers. Each pair of integers indicates positions of the opening and closing structure in the corpus.
Example 1
Let’s have the following vertical file with the second column indicating the position of the token:
<s> This 0 is 1 a 2 sentence 3 . 4 </s> <s> This 5 is 6 another 7 sentence 8 . 9 </s>
If we compiled this vertical file, the s.rng file would contain the following integers:
0 5 5 10
Note that the position of the opening tag equals to the position of the next token whereas the position of the closing tag equals to the position of the previous token plus one.
Example 2
This is a sample of the s.rng file from the LEXMCI corpus:
$ od -t d4 /manatee/lexmci/s.rng | head 0000000 0 5 5 28 0000020 28 43 43 48 0000040 48 57 57 64 0000060 64 72 72 78
Hacking LEXMCI
We knew that all the LEXMCI documents which were missing the sentence boundaries had doc.langvariety set to ie. We also knew that all these documents are adjacent in the corpus. It was therefore possible to:
- extract these documents from the corpus vertical file,
- add the <s></s> tags to the extracted part,
- compile it separately,
- offset the positions in the resulting s.rng by the position of the first document,
- add the positions to the original s.rng file.
We extracted the documents using the following simple perl script:
$/ = "</doc>\n"; while (my $doc = <>) { print $doc if ($doc =~ /<doc .*langvariety="ie"/); }
After adding the <s></s> tags and compiling the vertical file with encodevert we use the following python script to perform the points 4 and 5:
import array import struct # position of the first document with langvariety="ie" (desired offset) ie_first_doc_pos = 1626370101L # original s.rng file orig_s_rng = '/manatee/lexmci/s.rng' # s.rng file of the compiled documents with langvariety="ie" ie_s_rng = '/manatee/lexmci_ie/s.rng' # new s.rng file (to be created) new_s_rng = '/manatee/lexmci/s.rng.new' orig_fp = open(orig_s_rng, 'rb') ie_fp = open(ie_s_rng, 'rb') new_fp = open(new_s_rng, 'wb') # array of long (4-byte) unsigned integers # to store the contents of the s.rng.new file new_array = array.array('L') inserted = False while True: # read and add to output the original s.rng file # until the position is greater than `ie_first_doc_pos` chunk = orig_fp.read(4) if not chunk: break int_chunk, = struct.unpack("L", chunk) # insert the contents of the `ie_s_rng` to the right place if int_chunk > ie_first_doc_pos and not inserted: inserted = True while True: chunk2 = ie_fp.read(4) if not chunk2: break int_chunk2, = struct.unpack("L", chunk2) # offset new_array.append(int_chunk2 + ie_first_doc_pos) # read and add the rest of the original s.rng new_array.append(int_chunk) # dump the integer array to the file new_array.tofile(new_fp) new_fp.close() ie_fp.close() orig_fp.close()