Users with a local installation of Sketch Engine can run the following commands on Linux.
Overview of all command line tools
addsatfiles | dumpstructrng | lscbgr | mkhatlex | ocd-mkgdex |
addwcattr | dumpthes | lsclex | mkhatsort | ocd-mkhwds-plain |
biterms | dumpwmap | lscngr | mkisrt | ocd-mkhwds-terms |
calctrends | dumpwmrev | lsfsa | mklcm | ocd-mkthes |
compilecorp | dumpws | lsfsa_intersect | mklex | ocd-mkwsi |
concinfo | encodevert | lsfsa_left_intersect | mknormattr | par2tokens |
corpconfcheck | extrms | lskw | mknorms | parencodevert |
corpdatacheck | filterquery2attr | lslex | mkregexattr | parmkdynattr |
corpcheck | filterwm | lslexarf | mksizes | parse2wmap |
corpinfo | freqs | lsslex | mkstats | parws |
corpquery | genbgr | lswl | mksubc | sconll2sketch |
corpus4fsa | genfreq | manateesrv | mkthes | sconll2wmap |
decodevert | genhist | maplexrev | mktrends | setupbonito |
devirt | genngr | mkalign | mkvirt | ske |
dumpalign | genterms | mkbgr | mkwc | sortuniq |
dumpattrrev | genws | mkbidict | mkwmap | sortws |
dumpattrtext | hashws | mkdrev | mkwmrank | terms2fsa |
dumpbits | lex2fsa | mkdtext | ngr2fsa | tokens2dict |
dumpdrev | lexonomyCreateEntries | mkdynattr | ngrsave | vertfork |
dumpdtext | lexonomyMakeDict | mkfsa | ocd-mkcoll | virtws |
dumpfsa | lsalsize | mkfsalex | ocd-mkdefs | wm2terms |
dumplevel | lsbgr | mkhatfsa | ocd-mkdict | wm2thes |
ws2fsa |
Command line tools for n-grams
There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:
Generating bigrams from a compiled corpus (<tt>genbgr, mkbgr, lsbgr, lscbgr</tt>
)
Features:
- bigram generation, storing and viewing from a compiled corpus
- no corpus size limit
Usage:
The <tt>genbgr</tt>
and <tt>mkbgr</tt>
is used for generating and storing bigrams, respectively:
genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE
where CORPUS is the registry name/path of the corpus, <tt>ATTR</tt>
is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is <tt>ATTR.bgr</tt>
.
For viewing of stored bigrams, use the lsbgr tool:
lsbgr BGRFILE [FIRST_ID]
where <tt>BGRFILE</tt>
is the same path as given above and the optional <tt>FIRST_ID</tt>
attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).
Example:
>genbgr susanne word 1 | mkbgr word.bgr mkbgr word.bgr[1]: stream sorted, #parts: 1 mkbgr word.bgr[2]: temporary files renamed >ls | grep word.bgr word.bgr.cnt word.bgr.idx >lsbgr word.bgr | head -10 0 1 1 0 14 1 0 16 2 0 23 3 0 25 6 0 33 2 0 40 2 0 49 1 0 52 1 0 66 3
The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the <tt>lsclex</tt>
tool:
>echo -e '14n1' | lsclex -n susanne word 14 election 1 Fulton
The <tt>lscbgr</tt>
tool prints directly bigram strings and possesses more options:
lscbgr Lists corpus bigrams usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID] -p ATTR_NAME corpus positional attribute [default word] -n BGR_FILE_PATH path to data files [default CORPPATH/ATTR_NAME.bgr] -f lists frequencies of both tokens -s t|mi|mi3|ll|ms|s compute statistics: t T score mi MI score mi3 MI^3 score ll log likelihood ms minimum sensitivity d logDice
Example:
>lscbgr -f -n word.bgr susanne | head The Fulton 1074 14 1 The election 1074 36 1 The " 1074 2311 2 The place 1074 73 3 The jury 1074 27 6 The City 1074 29 2 The charge 1074 17 2 The September 1074 4 1 The charged 1074 18 1 The Mayor 1074 19 3
Generating n-grams from a compiled corpus (<tt>genngr, lscngr</tt>
)
Features:
- concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
- corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)
Usage:
The genngr tool is used for generating and storing, the <tt>lscngr</tt>
for viewing:
genngr CORPUS ATTR MINFREQ NGRFILE
The parameters for <tt>genngr</tt>
have same semantics as for <tt>genbgr/mkbgr</tt>
above, the prefix path is usually <tt>ATTR.ngr</tt>
.
lscngr [OPTIONS] CORPUS_NAME
Options can be set as follows:
-p ATTR_NAME corpus positional attribute (default: word) -n NGR_FILE_PATH n-grams data file path -f lists frequences -d STRUCT.ATTR print STRUCT duplicates according to ATTR -m MIN_NGRAM minimum n-gram size (default: 3)
Example:
>genngr susanne word 1 word.ngr Preparing text Creating suffix array Creating LCP array Saving LDIs >ls | grep word.ngr word.ngr.freq word.ngr.lex word.ngr.lex.idx word.ngr.mm word.ngr.rev word.ngr.rev.cnt word.ngr.rev.cnt64 word.ngr.rev.idx >lscngr -f -n word.ngr susanne | head -10 2 3,4 The jury said | it 2 3 7 2 2,3 The grand | jury 2 6 9 2 3,3 The other , 8 7 195 3 3,3 The fact that 5 27 53 2 3,3 The fact is 5 2 53 2 2,3 The purpose | of 2 7 18 2 3,3 The man was 5 6 169 2 4,4 The Charles Men , 5 2 5 5 2,3 The Charles | Men 5 5 25 2 3,3 The New York 3 24 69
The semantic of the columns in the output listed above is as follows:
- n-gram frequency
- minimum, maximum length of the n-gram
- first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram
The following is listed only with the <tt>-f</tt>
option. Given an n-gram as concatenation of strings xyiz
- frequency of the xyi (n-1)-gram
- frequency of the yiz (n-1)-gram
- frequency of the yi (n-2)-gram
If the optional <tt>-d STRUCT.ATTR</tt>
option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using <tt>-m</tt>
)
E.g.
lscngr -m 100 -f -d bncdoc.id bnc2
prints
>646#624>HHM HHK
at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.
Generating n-grams from a vertical file (<tt>ngrsave</tt>
)
Features:
- concurrent n-gram generation (for any n up to the given maximum) from a vertical file
- direct storing in a text file
- no corpus size limit
Usage:
The <tt>ngrsave</tt>
utility generates the n-grams from a vertical file and stores the in a single text file:
usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC] or ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC] Prints all n-grams that occurred at least twice in the input VERT_FILE STOPLIST_FILE textfile with one stopword per line, n-grams will not contain any stopwords (use - as STOPLIST_FILE for omitting it) VERT_FILE input vertical file to be processed, use - for standard input CORPUS corpus registry filename ATTR attribute name SAVE_FILE textfile where the output will be written DOC_SEPARATOR line prefix, e.g. '<doc', that will be used for separating documents If given, each n-gram is followed by its frequency together with the IDs of the documents where it occurred DOC_STRUCTURE Same as above, but name of the structure, e.g. 'doc' NGRAM_SIZE maximum size of the n-gram (the n), defaults to 10 IGNORE_PUNC disables ignoring punctuation by providing a 0 value (any positive number means enable, the default)
Example:
>cut -f1 susanne.vert | ngrsave - susanne.ngrsave - "<doc" Round: 0 Preparing text Creating suffix array Saving n-grams >head susanne.ngrsave.out that there be a line through P which meets g 2 130 130 the case in which g is a curve on a 2 130 130 was stored at ° in a tube equipped with a 2 123 123 be a line through P which meets g in points 2 130 130 at ° in a tube equipped with a break seal 2 123 123 there be a line through P which meets g in 2 130 130 He handed the bayonet to Dean and kept the pistol 2 136 136 were allowed to stand at room temperature for 1 hr 2 126 126 case in which g is a curve on a quadric 2 130 130 requires that there be a line through P which meets 2 130 130
The output contains all n-grams that occurred at least twice.
Selected command tools in more detail:
corpinfo
Prints basic information of a given corpus.
Usage: corpinfo [OPTIONS] CORPNAME
-d dump whole configuration
-p print corpus directory path
-s print corpus size
-w print corpus wordform counts
-g OPT print configuration value of option OPT
corpquery
Prints concordance of a given query
Usage: corpquery CORPUSNAME QUERY [ OPTIONS ]
Options:
-r ATTR reference attribute
(default: None)
-c LEFT,RIGHT | BOTH left and right or both context length
(default: 15)
-h LIMIT maximum number of results
(default: -1)
-a ATTR1,ATTR2,... comma separated list of attributes to be shown
default: word,lemma,tag)
-s STR1,STR2... comma separated list of structures to be shown
(use struct.attr or struct.* to show structure attributes; default: s,p,doc)
-g GDEX_CONF use GDEX with a given GDEX_CONF configuration file
(default: None; use - for default configuration) use -h to set the result size (default: 100)
-m GDEX_MODULE_DIR GDEX module path (directory with gdex.py or gdex_old.py)
lsclex
Lists lexicon of given corpus attribute
usage: lsclex [-snf] CORPUS ATTR
-s str2id -- strings from stdin translate to IDs
-n id2str -- IDs from stdin translate to strings
-f print frequences
lsslex
Lists number of tokens for all structure attribute values
usage: lsslex CORPNAME STRUCTNAME STRUCTATTR
example: lsslex bnc bncdoc alltyp
freqs
Prints frequencies of words in a given context of a given query
usage: freqs CORPUSNAME 'QUERY' 'CONTEXT' LIMIT
default CONTEXT is 'word -1' default LIMIT is 1
examples: freqs susanne '[lemma="house"]' 'word -1'
freqs susanne '[lemma="run"]' 'word/i 0 tag 0 lemma 1' 2
freqs susanne '[lemma="test"] []? [tag="NN.*"]' 'word/i -1>0' 0
corpcheck
Checks the validity of various corpus attributes and the correctness of compiled corpus data. Any issues found with the corpus are presented in a clear, human-readable format in standard error output.
Usage: corpcheck CORPNAME