Users with a local installation of Sketch Engine can run the following commands on Linux.
Overview of all command line tools
addsatfiles | dumpstructrng | lscbgr | mkhatlex | ocd-mkhwds-plain |
dumpthes | lsclex | mkhatsort | ocd-mkhwds-terms | |
biterms | dumpwmap | lscngr | mkisrt | ocd-mkthes |
calctrends | dumpwmrev | lsfsa | mklcm | ocd-mkwsi |
compilecorp | dumpws | lsfsa_intersect | mklex | par2tokens |
concinfo | encodevert | lsfsa_left_intersect | mknormattr | parencodevert |
corpconfcheck | extrms | lskw | mknorms | parmkdynattr |
corpdatacheck | filterquery2attr | lslex | mkregexattr | parse2wmap |
corpcheck | filterwm | lslexarf | mksizes | parws |
corpinfo | freqs | lsslex | mkstats | registry_edit |
corpquery | genbgr | lswl | mksubc | sconll2sketch |
corpus4fsa | genfreq | manateesrv | mkthes | sconll2wmap |
decodevert | genhist | maplexrev | mktrends | setupbonito |
devirt | genngr | mkalign | mkvirt | ske |
dumpalign | genterms | mkbgr | mkwc | sortuniq |
dumpattrrev | genws | mkbidict | mkwmap | sortws |
dumpattrtext | hashws | mkdrev | mkwmrank | terms2fsa |
dumpbits | lex2fsa | mkdtext | ngr2fsa | tokens2dict |
dumpdrev | lexonomyCreateEntries | mkdynattr | ngrsave | vertfork |
dumpdtext | lexonomyMakeDict | mkfsa | ocd-mkcoll | virtws |
dumpfsa | lsalsize | mkfsalex | ocd-mkdefs | wm2terms |
dumplevel | lsbgr | mkhatfsa | ocd-mkdict | wm2thes |
ocd-mkgdex | ws2fsa |
Command line tools for n-grams
There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:
Generating bigrams from a compiled corpus (genbgr, mkbgr, lsbgr, lscbgr
)
Features:
- bigram generation, storing and viewing from a compiled corpus
- no corpus size limit
Usage:
The genbgr
and mkbgr
is used for generating and storing bigrams, respectively:
genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE
where CORPUS is the registry name/path of the corpus, ATTR
is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is ATTR.bgr
.
For viewing of stored bigrams, use the lsbgr tool:
lsbgr BGRFILE [FIRST_ID]
where BGRFILE
is the same path as given above and the optional FIRST_ID
attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).
Example:
>genbgr susanne word 1 | mkbgr word.bgr mkbgr word.bgr[1]: stream sorted, #parts: 1 mkbgr word.bgr[2]: temporary files renamed >ls | grep word.bgr word.bgr.cnt word.bgr.idx >lsbgr word.bgr | head -10 0 1 1 0 14 1 0 16 2 0 23 3 0 25 6 0 33 2 0 40 2 0 49 1 0 52 1 0 66 3
The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the lsclex
tool:
>echo -e '14n1' | lsclex -n susanne word 14 election 1 Fulton
The lscbgr
tool prints directly bigram strings and possesses more options:
lscbgr Lists corpus bigrams usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID] -p ATTR_NAME corpus positional attribute [default word] -n BGR_FILE_PATH path to data files [default CORPPATH/ATTR_NAME.bgr] -f lists frequencies of both tokens -s t|mi|mi3|ll|ms|s compute statistics: t T score mi MI score mi3 MI^3 score ll log likelihood ms minimum sensitivity d logDice
Example:
>lscbgr -f -n word.bgr susanne | head The Fulton 1074 14 1 The election 1074 36 1 The " 1074 2311 2 The place 1074 73 3 The jury 1074 27 6 The City 1074 29 2 The charge 1074 17 2 The September 1074 4 1 The charged 1074 18 1 The Mayor 1074 19 3
Generating n-grams from a compiled corpus (genngr, lscngr
)
Features:
- concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
- corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)
Usage:
The genngr tool is used for generating and storing, the lscngr
for viewing:
genngr CORPUS ATTR MINFREQ NGRFILE
The parameters for genngr
have same semantics as for genbgr/mkbgr
above, the prefix path is usually ATTR.ngr
.
lscngr [OPTIONS] CORPUS_NAME
Options can be set as follows:
-p ATTR_NAME corpus positional attribute (default: word) -n NGR_FILE_PATH n-grams data file path -f lists frequences -d STRUCT.ATTR print STRUCT duplicates according to ATTR -m MIN_NGRAM minimum n-gram size (default: 3)
Example:
>genngr susanne word 1 word.ngr Preparing text Creating suffix array Creating LCP array Saving LDIs >ls | grep word.ngr word.ngr.freq word.ngr.lex word.ngr.lex.idx word.ngr.mm word.ngr.rev word.ngr.rev.cnt word.ngr.rev.cnt64 word.ngr.rev.idx >lscngr -f -n word.ngr susanne | head -10 2 3,4 The jury said | it 2 3 7 2 2,3 The grand | jury 2 6 9 2 3,3 The other , 8 7 195 3 3,3 The fact that 5 27 53 2 3,3 The fact is 5 2 53 2 2,3 The purpose | of 2 7 18 2 3,3 The man was 5 6 169 2 4,4 The Charles Men , 5 2 5 5 2,3 The Charles | Men 5 5 25 2 3,3 The New York 3 24 69
The semantic of the columns in the output listed above is as follows:
- n-gram frequency
- minimum, maximum length of the n-gram
- first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram
The following is listed only with the -f
option. Given an n-gram as concatenation of strings xyiz
- frequency of the xyi (n-1)-gram
- frequency of the yiz (n-1)-gram
- frequency of the yi (n-2)-gram
If the optional -d STRUCT.ATTR
option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using -m
)
E.g.
lscngr -m 100 -f -d bncdoc.id bnc2
prints
>646#624>HHM HHK
at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.
Generating n-grams from a vertical file (ngrsave
)
Features:
- concurrent n-gram generation (for any n up to the given maximum) from a vertical file
- direct storing in a text file
- no corpus size limit
Usage:
The ngrsave
utility generates the n-grams from a vertical file and stores the in a single text file:
usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC] or ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC] Prints all n-grams that occurred at least twice in the input VERT_FILE STOPLIST_FILE textfile with one stopword per line, n-grams will not contain any stopwords (use - as STOPLIST_FILE for omitting it) VERT_FILE input vertical file to be processed, use - for standard input CORPUS corpus registry filename ATTR attribute name SAVE_FILE textfile where the output will be written DOC_SEPARATOR line prefix, e.g. 'Example:
>cut -f1 susanne.vert | ngrsave - susanne.ngrsave - "head susanne.ngrsave.out that there be a line through P which meets g 2 130 130 the case in which g is a curve on a 2 130 130 was stored at ° in a tube equipped with a 2 123 123 be a line through P which meets g in points 2 130 130 at ° in a tube equipped with a break seal 2 123 123 there be a line through P which meets g in 2 130 130 He handed the bayonet to Dean and kept the pistol 2 136 136 were allowed to stand at room temperature for 1 hr 2 126 126 case in which g is a curve on a quadric 2 130 130 requires that there be a line through P which meets 2 130 130 The output contains all n-grams that occurred at least twice.
Selected command tools in more detail:
corpinfo
Prints basic information of a given corpus.
Usage: corpinfo [OPTIONS] CORPNAME
-d dump whole configuration
-p print corpus directory path
-s print corpus size
-w print corpus wordform counts
-g OPT print configuration value of option OPT
corpquery
Prints concordance of a given query
Usage: corpquery CORPUSNAME QUERY [ OPTIONS ]
Options:
-r ATTR reference attribute
(default: None)
-c LEFT,RIGHT | BOTH left and right or both context length
(default: 15)
-h LIMIT maximum number of results
(default: -1)
-a ATTR1,ATTR2,... comma separated list of attributes to be shown
default: word,lemma,tag)
-s STR1,STR2... comma separated list of structures to be shown
(use struct.attr or struct.* to show structure attributes; default: s,p,doc)
-g GDEX_CONF use GDEX with a given GDEX_CONF configuration file
(default: None; use - for default configuration) use -h to set the result size (default: 100)
-m GDEX_MODULE_DIR GDEX module path (directory with gdex.py or gdex_old.py)
lsclex
Lists lexicon of given corpus attribute
usage: lsclex [-snf] CORPUS ATTR
-s str2id -- strings from stdin translate to IDs
-n id2str -- IDs from stdin translate to strings
-f print frequences
lsslex
Lists number of tokens for all structure attribute values
usage: lsslex CORPNAME STRUCTNAME STRUCTATTR
example: lsslex bnc bncdoc alltyp
freqs
Prints frequencies of words in a given context of a given query
usage: freqs CORPUSNAME 'QUERY' 'CONTEXT' LIMIT
default CONTEXT is 'word -1' default LIMIT is 1
examples: freqs susanne '[lemma="house"]' 'word -1'
freqs susanne '[lemma="run"]' 'word/i 0 tag 0 lemma 1' 2
freqs susanne '[lemma="test"] []? [tag="NN.*"]' 'word/i -1>0' 0
corpcheck
Checks the validity of various corpus attributes and the correctness of compiled corpus data. Any issues found with the corpus are presented in a clear, human-readable format in standard error output.
Usage: corpcheck CORPNAME