You need to prepare a vertical and registry file before compiling your corpus. (See the documentation on Preparing a Corpus)
When you have created your vertical file and the corpus configuration file you are ready to compile the corpus in Sketch Engine. This can either be done from the Corpus Architect interface, or from the command line:
- A) From the Corpus Architect interface you select the corpus from the home page and then select compile corpus.
- B) There are two main ways to compile the corpus from the command line. These are described in sections 1 and 2 below. The first uses a program called compilecorp which calls all the other necessary programs. Since compilecorp is simpler to use we recommend this method when compiling from the command line, however the functions for compiling in stages are described in section 2.
1. Compile a Corpus with compilecorp
Note that for this function, the path to a sketch grammar file and word sketch highlights definitions (former histogram definitions) is read from the corpus configuration file (the WSDEF and WSHIST attributes).
% compilecorp [OPTIONS] CORPNAME [FILENAME]
This program creates a new corpus (CORPNAME) from a vertical text in file FILENAME or stdin. If possible, it also creates word sketches, thesaurus, word sketch highlights (histograms) and subcorpora. Existing components are never overwritten unless recompiling is explicitly requested. OPTIONS are:
--recompile-corpus recompile the corpus and all its components (vertical file must be available) --recompile-sketches recompile word sketches, thesaurus and word sketch highlights (implies --recompile-thesaurus --recompile-histograms) --recompile-thesaurus --recompile-histograms --recompile-subcorpora --no-sketches do not compile word sketches (implies --no-thesaurus --no-histograms) --no-thesaurus do not compile thesaurus --no-histograms do not compile histograms --no-subcorpora do not compile subcorpora -h, --help print this info
2. Compile a Corpus in Various Stages
We recommend that you only use this method if you need to as method 1. decribed above is simpler. You might want to use these functions when you wish to perform just one operation which is not possible with compilecorp, for example you need to compile the dynamic attributes without recompiling anything else.
The various steps are:
Compile corpus
(obligatory step)
% encodevert -c <corpus_name> <full_path_to_vertical>
Or you can pass the vertical to the encodevert standard input:
<script_generating_vertical> | encodevert -c <corpus_name>
If you have provided path to VERTICAL in your registry file, you can use shorter form:
% encodevert -c <corpus_name>
Compile WordSketches
(optional step)
If you have defined wordsketch configuration file, you can compile them now:
genws <corpus_name> <attribute> <path_where_to_compile> <path_to_wsdef_file> Example genws estenten_freeling lemma /corpora/manatee/estenten_freeling/ /corpora/wsdef/spanish-freeling-1.0.wsdef.txt
Note. genws will automatically expand M4 macros, if the wsdef_file extension is “.m4”.
The output of genws is binary data for mkwmap, so typically you will pipeline the output to something like:
mkwmap <path_where_to_compile> Example mkwmap /corpora/manatee/estenten_freeling/
Afterwards, just run mkwmrank:
mkwmrank <path_where_to_compile> Example mkwmrank /corpora/manatee/estenten_freeling/
These three steps used to be simplified by script “genws.sh <corpus_name>”:
#!/bin/sh if [ $# != 2 ]; then echo "usage: genws.sh CORPUS WSDEF_FILE" exit 1 fi CORPUS=$1 WSDEF_FILE=$2 WS=`corpinfo -g WSBASE $CORPUS` LEMMA=`corpinfo -g WSATTR $CORPUS` set -x genws $CORPUS $LEMMA $WS "$WSDEF_FILE" | mkwmap $WS mkwmrank $WS set +x
Compile Thesaurus
(optional step)
After creating word sketches, one can create data for thesaurus:
wm2thes [-c MIN_COUNT] <path_where_to_compile> <tmp_dir> Example wm2thes /corpora/manatee/estenten_freeling/ /tmp/estenten_thes/
… and compile the thesaurus:
mkthes [-n] [-k K] [-m MAXMEM MB] [-f MAXFILES] <tmp_dir> <path_where_to_compile_thes> Examples mkthes /tmp/estenten_thes/ /corpora/manatee/estenten_freeling/lemma-thes
Used to be simplified using “mkthes.sh <corpus_name>” script.
Compile Dynamic attributes
(optional step)
If you have defined some dynamic attributes, you can compile them using:
mkdynattr <corpus_name> <dynattr> Example: mkdynattr gkwac0.5 doc.time
Adding header fields to structures in already compiled corpora
(optional step)
If you have some mapping from existing attributes of the structures, you can add the new header fields:
add_fields.sh <corpname> <struct> <src_attr> <new_attrname> <map_file> [<tmpdir>] Example: add_fields.sh biwec doc id region id2region.txt
Troubleshooting
If you encounter some problem during compilation. Remove all compiled data (usually in manatee/<corpus_name>) before trying again. Since having inconsistent data from previous steps can cause further problems later.