Command line tools

Users with a local installation of Sketch Engine can run the following commands on Linux.

Overview of all command line tools

addsatfiles	dumpstructrng	lscbgr	mkhatlex	ocd-mkgdex
addwcattr	dumpthes	lsclex	mkhatsort	ocd-mkhwds-plain
biterms	dumpwmap	lscngr	mkisrt	ocd-mkhwds-terms
calctrends	dumpwmrev	lsfsa	mklcm	ocd-mkthes
compilecorp	dumpws	lsfsa_intersect	mklex	ocd-mkwsi
concinfo	encodevert	lsfsa_left_intersect	mknormattr	par2tokens
corpconfcheck	extrms	lskw	mknorms	parencodevert
corpdatacheck	filterquery2attr	lslex	mkregexattr	parmkdynattr
corpcheck	filterwm	lslexarf	mksizes	parse2wmap
corpinfo	freqs	lsslex	mkstats	parws
corpquery	genbgr	lswl	mksubc	sconll2sketch
corpus4fsa	genfreq	manateesrv	mkthes	sconll2wmap
decodevert	genhist	maplexrev	mktrends	setupbonito
devirt	genngr	mkalign	mkvirt	ske
dumpalign	genterms	mkbgr	mkwc	sortuniq
dumpattrrev	genws	mkbidict	mkwmap	sortws
dumpattrtext	hashws	mkdrev	mkwmrank	terms2fsa
dumpbits	lex2fsa	mkdtext	ngr2fsa	tokens2dict
dumpdrev	lexonomyCreateEntries	mkdynattr	ngrsave	vertfork
dumpdtext	lexonomyMakeDict	mkfsa	ocd-mkcoll	virtws
dumpfsa	lsalsize	mkfsalex	ocd-mkdefs	wm2terms
dumplevel	lsbgr	mkhatfsa	ocd-mkdict	wm2thes
				ws2fsa

Command line tools for n-grams

There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:

Generating bigrams from a compiled corpus (`<tt>genbgr, mkbgr, lsbgr, lscbgr</tt>`)

Features:

bigram generation, storing and viewing from a compiled corpus
no corpus size limit

Usage:

The <tt>genbgr</tt> and <tt>mkbgr</tt> is used for generating and storing bigrams, respectively:

genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE

where CORPUS is the registry name/path of the corpus, <tt>ATTR</tt> is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is <tt>ATTR.bgr</tt>.

For viewing of stored bigrams, use the lsbgr tool:

lsbgr BGRFILE [FIRST_ID]

where <tt>BGRFILE</tt> is the same path as given above and the optional <tt>FIRST_ID</tt> attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).

Example:

>genbgr susanne word 1 | mkbgr word.bgr
mkbgr word.bgr[1]: stream sorted, #parts: 1
mkbgr word.bgr[2]: temporary files renamed

>ls | grep word.bgr
word.bgr.cnt
word.bgr.idx

>lsbgr word.bgr | head -10
0       1       1
0       14      1
0       16      2
0       23      3
0       25      6
0       33      2
0       40      2
0       49      1
0       52      1
0       66      3

The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the <tt>lsclex</tt> tool:

>echo -e '14n1' | lsclex -n susanne word
14      election
1       Fulton

The <tt>lscbgr</tt> tool prints directly bigram strings and possesses more options:

lscbgr
Lists corpus bigrams
usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID]
     -p ATTR_NAME   corpus positional attribute [default word]
     -n BGR_FILE_PATH     path to data files
                          [default CORPPATH/ATTR_NAME.bgr]
     -f                   lists frequencies of both tokens
     -s t|mi|mi3|ll|ms|s  compute statistics:
             t     T score
             mi    MI score
             mi3   MI^3 score
             ll    log likelihood
             ms    minimum sensitivity
             d     logDice

Example:

>lscbgr -f -n word.bgr susanne | head
The     Fulton  1074    14      1
The     election        1074    36      1
The     "       1074    2311    2
The     place   1074    73      3
The     jury    1074    27      6
The     City    1074    29      2
The     charge  1074    17      2
The     September       1074    4       1
The     charged 1074    18      1
The     Mayor   1074    19      3

Generating n-grams from a compiled corpus (`<tt>genngr, lscngr</tt>`)

Features:

concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)

Usage:

The genngr tool is used for generating and storing, the <tt>lscngr</tt> for viewing:

genngr CORPUS ATTR MINFREQ NGRFILE

The parameters for <tt>genngr</tt> have same semantics as for <tt>genbgr/mkbgr</tt> above, the prefix path is usually <tt>ATTR.ngr</tt>.

lscngr [OPTIONS] CORPUS_NAME

Options can be set as follows:

     -p ATTR_NAME       corpus positional attribute (default: word)
     -n NGR_FILE_PATH   n-grams data file path
     -f                 lists frequences
     -d STRUCT.ATTR     print STRUCT duplicates according to ATTR
     -m MIN_NGRAM       minimum n-gram size (default: 3)

Example:

>genngr susanne word 1 word.ngr
Preparing text
Creating suffix array
Creating LCP array
Saving LDIs

>ls | grep word.ngr
word.ngr.freq
word.ngr.lex
word.ngr.lex.idx
word.ngr.mm
word.ngr.rev
word.ngr.rev.cnt
word.ngr.rev.cnt64
word.ngr.rev.idx

>lscngr -f -n word.ngr susanne | head -10
2       3,4      The jury said | it     2       3       7
2       2,3      The grand | jury       2       6       9
2       3,3      The other ,    8       7       195
3       3,3      The fact that  5       27      53
2       3,3      The fact is    5       2       53
2       2,3      The purpose | of       2       7       18
2       3,3      The man was    5       6       169
2       4,4      The Charles Men ,      5       2       5
5       2,3      The Charles | Men      5       5       25
2       3,3      The New York   3       24      69

The semantic of the columns in the output listed above is as follows:

n-gram frequency
minimum, maximum length of the n-gram
first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram

The following is listed only with the <tt>-f</tt> option. Given an n-gram as concatenation of strings xyⁱz

frequency of the xyⁱ (n-1)-gram
frequency of the yⁱz (n-1)-gram
frequency of the yⁱ (n-2)-gram

If the optional <tt>-d STRUCT.ATTR</tt> option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using <tt>-m</tt>)

E.g.

lscngr -m 100 -f -d bncdoc.id bnc2

prints

>646#624>HHM HHK

at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.

Generating n-grams from a vertical file (`<tt>ngrsave</tt>`)

Features:

concurrent n-gram generation (for any n up to the given maximum) from a vertical file
direct storing in a text file
no corpus size limit

Usage:

The <tt>ngrsave</tt> utility generates the n-grams from a vertical file and stores the in a single text file:

usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC]
       or
       ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC]
       Prints all n-grams that occurred at least twice in the input VERT_FILE

STOPLIST_FILE    textfile with one stopword per line, n-grams will not contain any stopwords
                 (use - as STOPLIST_FILE for omitting it)
VERT_FILE        input vertical file to be processed, use - for standard input
CORPUS           corpus registry filename
ATTR             attribute name
SAVE_FILE        textfile where the output will be written
DOC_SEPARATOR    line prefix, e.g. '<doc', that will be used for separating documents
                 If given, each n-gram is followed by its frequency together with the IDs
                 of the documents where it occurred
DOC_STRUCTURE    Same as above, but name of the structure, e.g. 'doc'
NGRAM_SIZE       maximum size of the n-gram (the n), defaults to 10
IGNORE_PUNC      disables ignoring punctuation by providing a 0 value
                 (any positive number means enable, the default)

Example:

>cut -f1 susanne.vert | ngrsave - susanne.ngrsave - "<doc"
Round: 0
   Preparing text
   Creating suffix array
   Saving n-grams

>head susanne.ngrsave.out 
that    there   be      a       line    through P       which   meets   g       2       130 130 
the     case    in      which   g       is      a       curve   on      a       2       130 130 
was     stored  at      °       in      a       tube    equipped        with    a       2       123 123 
be      a       line    through P       which   meets   g       in      points  2       130 130 
at      °       in      a       tube    equipped        with    a       break   seal    2       123 123 
there   be      a       line    through P       which   meets   g       in      2       130 130 
He      handed  the     bayonet to      Dean    and     kept    the     pistol  2       136 136 
were    allowed to      stand   at      room    temperature     for     1       hr      2       126 126 
case    in      which   g       is      a       curve   on      a       quadric 2       130 130 
requires        that    there   be      a       line    through P       which   meets   2       130 130

The output contains all n-grams that occurred at least twice.

Selected command tools in more detail:

corpinfo

Prints basic information of a given corpus.

Usage: corpinfo [OPTIONS] CORPNAME

-d dump whole configuration

-p print corpus directory path

-s print corpus size

-w print corpus wordform counts

-g OPT print configuration value of option OPT

corpquery

Prints concordance of a given query

Usage: corpquery CORPUSNAME QUERY [ OPTIONS ]

Options:

-r ATTR reference attribute

(default: None)

-c LEFT,RIGHT | BOTH left and right or both context length

(default: 15)

-h LIMIT maximum number of results

(default: -1)

-a ATTR1,ATTR2,... comma separated list of attributes to be shown

default: word,lemma,tag)

-s STR1,STR2... comma separated list of structures to be shown

(use struct.attr or struct.* to show structure attributes; default: s,p,doc)

-g GDEX_CONF use GDEX with a given GDEX_CONF configuration file

(default: None; use - for default configuration) use -h to set the result size (default: 100)

-m GDEX_MODULE_DIR GDEX module path (directory with gdex.py or gdex_old.py)

lsclex

Lists lexicon of given corpus attribute

usage: lsclex [-snf] CORPUS ATTR

-s str2id -- strings from stdin translate to IDs

-n id2str -- IDs from stdin translate to strings

-f print frequences

lsslex

Lists number of tokens for all structure attribute values

usage: lsslex CORPNAME STRUCTNAME STRUCTATTR

example: lsslex bnc bncdoc alltyp

freqs

Prints frequencies of words in a given context of a given query

usage: freqs CORPUSNAME 'QUERY' 'CONTEXT' LIMIT

default CONTEXT is 'word -1' default LIMIT is 1

examples: freqs susanne '[lemma="house"]' 'word -1'

freqs susanne '[lemma="run"]' 'word/i 0 tag 0 lemma 1' 2

freqs susanne '[lemma="test"] []? [tag="NN.*"]' 'word/i -1>0' 0

corpcheck

Checks the validity of various corpus attributes and the correctness of compiled corpus data. Any issues found with the corpus are presented in a clear, human-readable format in standard error output.

Usage: corpcheck CORPNAME

Generating bigrams from a compiled corpus (`<tt>genbgr, mkbgr, lsbgr, lscbgr</tt>`)

Generating n-grams from a compiled corpus (`<tt>genngr, lscngr</tt>`)

Generating n-grams from a vertical file (`<tt>ngrsave</tt>`)

corpinfo

corpquery

lsclex

lsslex

freqs

corpcheck

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

Generating bigrams from a compiled corpus (<tt>genbgr, mkbgr, lsbgr, lscbgr</tt>)

Generating n-grams from a compiled corpus (<tt>genngr, lscngr</tt>)

Generating n-grams from a vertical file (<tt>ngrsave</tt>)

corpinfo

corpquery

lsclex

lsslex

freqs

corpcheck

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

Generating bigrams from a compiled corpus (`<tt>genbgr, mkbgr, lsbgr, lscbgr</tt>`)

Generating n-grams from a compiled corpus (`<tt>genngr, lscngr</tt>`)

Generating n-grams from a vertical file (`<tt>ngrsave</tt>`)