If your vertical text contains only words and no annotation, a configuration can be very simple:
Example 1
PATH /corpora/test1 ATTRIBUTE word
If you omit VERTICAL, you have to specify a source file for encodevert command:
% encodevert -c test1 /corpora/src/test1.vertical
VERTICAL addition simplifies encodevert command:
% encodevert -c test2
Select an appropriate ENCODING for a proper display of characters in Sketch Engine. For each attribute you can specify a LOCALE for proper sorting and regular expression character classes handling. Default “C” locale corresponds to English. The following example uses ISO Latin 2 encoding and Czech locale.
Example 2
PATH /corpora/test1 VERTICAL "/corpora/src/test1.vertical" ENCODING "iso8859-2" ATTRIBUTE word { LOCALE "cs_CZ.ISO8859-2" }
If your vertical text contains a POS tagging for each token (word) specify also the second attribute.
Example 3
PATH /corpora/test1 VERTICAL "/corpora/src/test1.vertical" ENCODING "iso8859-2" ATTRIBUTE word { LOCALE "cs_CZ.ISO8859-2" } ATTRIBUTE pos
If your vertical text contains sentence boundaries annotated with <s> and </s> and document boundaries annotated with <doc> and </doc>, add structures definition.
Example 4
PATH /corpora/test2 VERTICAL "/corpora/src/test2.vertical" ENCODING "iso8859-2" ATTRIBUTE word STRUCTURE doc STRUCTURE s
If your <doc> annotation contains document meta-information about the author and the date of publication in form <doc author=”Lewis Carroll” date=”1876″> add structure attribute definition.
Example 5
PATH /corpora/test3 VERTICAL "/corpora/src/test3.vertical" ENCODING "iso8859-2" ATTRIBUTE word STRUCTURE doc { ATTRIBUTE author ATTRIBUTE date } STRUCTURE s
If your POS attribute contains ambiguous tags like NN1-VVB in BNC, and you would like to find this tag for [pos=”NN1″] queries, add multivalue configuration.
Example 6
PATH /corpora/test4 ENCODING "iso8859-2" ATTRIBUTE word ATTRIBUTE pos { MULTIVALUE yes MULTISEP "-" }
If you would like to add a dynamic attribute, add a new attribute definition. In the following example the vertical text contains words only (one column), but the corpus has additional attribute lc generated from the word attribute. Values of lc consists of respective words transformed into lowercase letters. The transformation function is an internal function named “lowercase” (one can see the definition in stddynfun.c file). It accepts two arguments: first is a word and second a locale (in this corpus “cs_CZ”). DEFAULTATTR ensures that lc will be used in evaluation of queries without an attribute name. TRANSQUERY ensures that the transformation function will be applied to a query string before query evaluation.
Example 7
PATH /corpora/test1 VERTICAL "/corpora/src/test1.vertical" ENCODING "iso8859-2" DEFAULTATTR lc ATTRIBUTE word { LOCALE "cs_CZ" } ATTRIBUTE lc { LOCALE "cs_CZ" DYNAMIC lowercase DYNLIB internal FUNTYPE s FROMATTR word ARG1 "cs_CZ" TRANSQUERY yes }
A transformation function of a dynamic attribute can also be an external function. DYNLIB then shows the full path to a dynamic library. The following example lists two dynamic attributes which add a lemma and a morphological annotation into a corpus. Both transformation functions (tags and lemmata) returns ambiguous values separated by a comma.
Example 8
PATH /corpora/test1 VERTICAL "/corpora/src/test1.vertical" ENCODING "iso8859-2" ATTRIBUTE word { LOCALE "cs_CZ" } ATTRIBUTE lemma { LOCALE "cs_CZ" DYNAMIC lemmata DYNLIB /corpora/bin/alibfun.so ARG1 0 FUNTYPE i FROMATTR word MULTIVALUE yes MULTISEP "," } ATTRIBUTE tag { DYNAMIC tags DYNLIB /corpora/bin/alibfun.so FUNTYPE 0 FROMATTR word MULTIVALUE yes MULTISEP "," }
Parallel corpora are handled as two separate corpora. ALIGNED indicates the name of the parallel part. Both corpora should have a structure named “align” with one to one correspondence of respective token sequences. The following example shows two configuration files — one for each corpus.
Example 9a (paren)
PATH /corpora/par-en VERTICAL "/corpora/src/par-en.vertical" ENCODING "iso8859-1" ATTRIBUTE word STRUCTURE doc { ATTRIBUTE id } STRUCTURE s STRUCTURE align ALIGNED parcs
Example 9b (parcs)
PATH /corpora/par-cs VERTICAL "/corpora/src/par-cs.vertical" ENCODING "iso8859-2" ATTRIBUTE word { LOCALE "cs_CZ" } STRUCTURE doc { ATTRIBUTE id } STRUCTURE s STRUCTURE align ALIGNED paren
The final example is a part of a BNC configuration. It shows usage of INFO and FULLREF.
Example 10
PATH /corpora/bnc INFO "British National Corpus" VERTICAL /corpora/src/bnc.vert ENCODING "iso8859-1" DEFAULTATTR lc FULLREF "bncdoc.id,bncdoc.author,bncdoc.title,bncdoc.date,bncdoc.info" ATTRIBUTE word ATTRIBUTE tag { MULTIVALUE y MULTISEP "-" } ATTRIBUTE lc { DYNAMIC lowercase DYNLIB internal FUNTYPE s ARG1 "C" FROMATTR word TRANSQUERY yes } STRUCTURE bncdoc { ATTRIBUTE id ATTRIBUTE date ATTRIBUTE year { DYNAMIC firstn DYNLIB internal FUNTYPE i ARG1 4 FROMATTR date } ATTRIBUTE author { MULTIVALUE y MULTISEP ";" } ATTRIBUTE title ATTRIBUTE info ATTRIBUTE allava ATTRIBUTE alltim ATTRIBUTE alltyp ATTRIBUTE wriaag ATTRIBUTE wriad ATTRIBUTE wriase } STRUCTURE stext { ATTRIBUTE org } STRUCTURE text { ATTRIBUTE org } STRUCTURE s { ATTRIBUTE n } STRUCTURE p { ATTRIBUTE rend } STRUCTURE body
Naming structures and attributes
Names of structures and attributes must not contain other characters than a-z, A-Z, 0-9, underscore. Names not beginning with a-z must be double quoted. Positional attributes word, tag, lempos, lemma should not be renamed. Correct examples:
ATTRIBUTE word STRUCTURE doc { ATTRIBUTE title1 ATTRIBUTE "Title2" }