For the software to be able to use a corpus, there are a number of things in needs to know. They are specified in the corpus configuration file, a file located in the registry directory with a filename which is the corpus identifier on the system (and is, in simple cases, the corpus name, eg ‘bnc’). The filename is used in the corpus query language and therefore it must consist only of alphabetical characters, numbers and underscore and must not start with a number. In other words, the filename must match the following regular expression:

('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9'|'@')*

 

A simple corpus config file is:

PATH  /corpora/test1
ATTRIBUTE  word
ATTRIBUTE  tag
ATTRIBUTE  lemma

STRUCTURE doc {
    ATTRIBUTE title
    ATTRIBUTE region
    ATTRIBUTE "AttributeWithUpperChars"
    LABEL "Corpus Document"
}
STRUCTURE p
STRUCTURE s

This shows, firstly, the structure of a corpus config file. It contains a set of feature-value pairs where the feature, on the left, must be one of a set of words that the system recognises and knows how to interpret. All features are explained in this set of documentation pages.

Note: The configuration file uses a general ATTRIBUTE value syntax. If the value contains anything else than lower-case letters, you have to enclose it in quotes or apostrophes, just like this: ATTRIBUTE "Complex_value".

The example states that

  • The location of the indexed corpus data on the system is /corpora/test1
  • The vertical file contains three columns, contents of which will be called ‘word’, ‘tag’ and ‘lemma
  • The text is in structural units of type ‘doc’, ‘s’ and ‘p’. Units of type ‘doc’ have associated attributes ‘title’ and ‘region’.

As the example illustrates,

  • each nonempty line begins with a feature name and then gives its value
  • values can be simple or can themselves be complex or further specified in a block enclosed in { … }.

 

See All features of Corpus Configuration File.