This page describes how to prepare a text corpus for indexation by the Manatee corpus management system used as the underlying database backend in Sketch Engine.
Text corpus from a technical point of view
The informal definition of a text corpus usually boils down to something close to “any collection of texts in electronic form”. From a more formal account, a corpus source text consists of:
-
- positions, i.e. individual occurrences of tokens in the texts, where each position has some associated attributes like word, lemma or tag
- structures, i.e. corpus segments (ranges) spanning a part of a corpus and being defined by their beginning and ending position, usually denoting documents, paragraphs or sentences.
- structure attributes, i.e. attributes of individual structures containing metadata of these structures like date of creation, author etc.
Structures and structure attributes are sometimes referred to as headers or corpus metadata.
The example below illustrates the notions defined above on a sample vertical text:
DESCRIPTION CORPUS VERTICAL TEXT Begin of structure "doc" with 2 structure attributes "author" and "year": <doc author="Shakespeare" year="1603"> Begin of sucture "p" for a paragraph: <p> Begin of structure "s" for a sentence: <s> Position #0 -- all positions have 3 attributes separated by a tabulator. To to PREPOSITION Position #1 be be VERB Empty structure "g" denoting a "glue" (no space separation) between tokens: <g/> Position #2 , , PUNCTUATION Position #3 or or CONJUNCTION Position #4 not not PARTICLE Position #5 to to PREPOSITION Position #6 be be VERB Empty structure "g" <g/> Position #7 , , PUNCTUATION Position #8 that that PRONOUN Position #9 is be VERB Position #10 the the DETERMINER Position #11 question question NOUN Empty structure "g" <g/> Position #12 . . PUNCTUATION End of the last structure "s" </s> End of the last structure "p" </p> End of the last structure "doc" </doc>
Steps to prepare a text corpus for Sketch Engine
- Prepare the source data, including both
- Prepare the corpus configuration file
- (optionally) Prepare the subcorpus configuration file
This step is needed if you wish to compile subcorpora which can be shared by multiple users - (optionally) Prepare or reuse a word sketch definition file
This step is needed if you require word sketches or thesaurus (the thesaurus takes the word sketch database as input). - Compile (index) the corpus
- Verify corpus consistency, integrity and completeness