Sketch Engine is a corpus manager and analysis software developed by Lexical Computing since 2003. This software consists of three main components, which enable searching and building text corpora.

Bonito – a graphical user interface to corpora maintained, see the changelog of Bonito
Manatee – a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures
FinLib – fast indexing library, see the changelog of FinLib

A brief overview of the main changes in Manatee is listed here.

Current stable version: 2.233 (as of November 2024)

2.156.6

  • FinLib incorporated into Manatee

2.152.1

  • do not parallelize corpus operations by default

2.152

  • implement parallel corpus indexing
  • improve parallel word sketch handling

2.151.5

  • fix Concordance::delete_subparts()
  • virtual corpora fixes

2.151.4

  • update mklcm

2.151.3

  • ensure that corpus PATH is nonempty
  • decodevert: structure attribute values escaping
  • regexopt: fix support for bracket literals
  • compilecorp: use one processor by default

2.151.2

  • fix queries containing ‘containing’

2.151.1

  • cql: support {,N} and {N,} quantifiers
  • remove skip_dupctx parameter for KWICLines

2.151

  • implement skip_dupctx parameter for KWICLines

2.150.4

  • remove C++11 features

2.150.3

  • fix a few memory leaks

2.150.2

  • quality improvements

2.150.1

  • do not virtualize sketches when some segments are not complete corpora

2.150

  • genngr: skip over default and empty attribute values
  • mksubc: urlencode names of subcorpora

2.149.3

  • quality improvements

2.149.2

  • quality improvements

2.149.1

  • cql: do not generate errors that are not valid utf-8

2.149

  • corpconf: remove support for escape sequences

2.148

  • corpconf: restrict support for escape sequences
  • cql: allow @ in attribute names

2.147

  • corpconf: only support escapes in double-quoted strings

2.146.6

  • corpconf: implement escapes in string literals
  • cql: fix sketch queries
  • regex optimization: fix the behavior of ‘+’

2.146.5

  • cql: enable NoSketchEngine support

2.146.4

  • fix FilteredWMap::poss() skipping duplicate positions

2.146.3

  • fix for large concordances and WMaps

2.146.2

  • cql: support large parameters to ws() and thes()

2.146.1

  • various regex optimization fixes

2.146

  • support zero-element word sketch files

2.145

  • cql: report error position
  • genws: support MULTIVALUE for collocations
  • fix ENCODING for structure attributes

2.144.1

  • cql: fix ONEPOS queries

2.144

  • update regex optimization rules
  • speed up corpquery -n

2.143.4

  • 2016/12/12
  • cql: support for multilevel wmap seek

2.142

  • 2016/11/23
  • cql: parse ‘seek’ in ‘ws|term(level, seek)’ as a number
  • add NEWS for manatee to shut up autotools
  • manatee: implemented query evaluation in yacc

2.141

  • 2016/11/03
  • corpquery accepts subcorpus via -u
  • added default locale li_NL for Limburgish
  • FinLib 2.36.2

2.140.2

  • 2016/10/21
  • extrms simple math N parameter can be float
  • finlib 2.36.1

2.140.1

  • 2016/10/19
  • decodevert: print end structures in reverse order

2.140

  • 2016/10/13
  • encodevert: check minimum bucket size for attribute memory
  • FinLib 2.36

2.139.3

  • 2016/08/27
  • wm2thes: accept CORPNAME argument also without -m
  • compilecorp: use virtws for virtual corpora sketches

2.138.4

  • 2016/08/13
  • compilecorp: use mklcm-go
  • biterms: made ca 4x faster

2.138.3

  • 2016/08/11
  • biterms: use new WMap interface

2.137.3

  • 2016/07/14
  • added multiword thesaurus computation
  • reformat wm2thes.cc
  • implemented virtual sketches, updated interface to WMap
  • added virtws for compilation of sketches on virtual corpora
  • added WMap::seppage() to export SEPARATEPAGE number
  • mkalign: print line number on alingdef file format error

2.137

  • 2016/05/20
  • mktrends: allow the SUBCORP argument to be empty
  • compilecorp: ALIGNDEF supports pipes like VERTICAL does
  • faster mktrends
  • manatee: mklcm in go
  • compilecorp: support for WSOLDSCORES

2.136

  • 2016/03/31
  • encodevert: call mknormattr according to MAPTO directive
  • added support for normalization attribute
  • ANTLR CQL grammar supports description definition

2.135.5

  • 2016/02/28
  • tstquery: added queries on parallel corpora
  • tstquery: print executed queries
  • do not label aligned corpus query in WITHIN!/!WITHIN queries

2.135.4

  • 2016/02/21
  • compilecorp: always move logfile into corpus path directory
  • compilecorp: improved error reporting to indicate actual lines numbers

2.135

  • 2016/01/30
  • encodevert: better manipulation with lexicon added items cache

2.134

  • 2016/01/20
  • encodevert: dynamic lexicons cache sizes
  • reformat mkwmrank.cc
  • added bgr_abs_freq_coll association score
  • returns frequency of the first word of the collocation pair

2.133.4

  • 2015/12/12
  • mktrends: finalize output files properly

2.133.3

  • 2015/12/10
  • corpcheck: tolerate local path in INFOHREF

2.133.3

  • 2015/12/10
  • mktrends: finalize output files properly

2.133.2

  • 2015/12/07
  • fix handling of aligned corpora labels in Concordance

2.133.1

  • 2015/12/03
  • KWICLines skip aligned corpora collocations

2.133

  • 2015/12/02
  • CQL: added support to term queries using term() operator
  • compilecorp: added –no-ske option being default for NoSkE

2.132.1

  • 2015/11/30
  • tstregexopt: takes attribute as another optional argument

2.132

  • 2015/11/24
  • speed up RQinNode and RQcontainNode

2.131.3

  • 2015/11/24
  • mknorms: speed up computation for subcorpora

2.131

  • 2015/11/12
  • removed findPosAttr() functions
  • reformat corpinfo.cc

2.130.6

  • 2015/11/12
  • fix !WITHIN

2.130.5

  • 2015/11/08
  • compilecorp: call mktrends with EPOCH_LIMIT being 1
  • fix MAXKWIC being 0 not meaning unlimited MAXKWIC

2.130.3

  • 2015/11/04
  • mktrends, save subcorp data properly

2.130.2

  • 2015/10/31
  • added NonEmptyRS for filtering empty RangeStream ranges

2.130

  • 2015/10/25
  • KWICLines has new method is_defined() and short-circuits processing of undefined lines
  • added Concordance::filter_aligned() for filtering by aligned corpus

2.129

  • 2015/09/21
  • mktrends: speed up ca 15x by more usage of numpy

2.128.4

  • 2015/09/10
  • updated CQL testsuite with current WS results on susanne

2.127

  • 2015/08/04
  • compilecorp: added support for longest commonest match

2.126

  • 2015/07/28
  • compilecorp: added support for trends computations
  • added mktrends script prepared by Ondřej Herman

2.125.2

  • 2015/07/20
  • mkwmrank: computing scores for each gramrel is independent of other gramrels

2.124

  • 2015/05/02
  • concordance automatically detects all collocations

2.122

  • 2015/04/19
  • CQL supports general NOT (!) in sequences as complement operator
  • Bugfixes:
  • fix CQL inequality comparisons on dynamic attributes

2.121.2

  • 2015/04/08
  • disable MULTIVALUE freqdist for positional attributes

2.121

  • 2015/04/03
  • mkdynattr: no need to manually delete lexicon with new write_lexicon
  • added new DYNTYPE “freq” for dynamic attributes
  • compilecorp and parws pass WSMINHITS to mkwmap
  • mkwmap: added all options to usage
  • mkwmap: added -f option allowing filtering for minimum frequency
  • write_lexicon allows overwriting datafiles
  • compilecorp: hashws terms automatically
  • compilecorp: write manatee version to log
  • Bugfixes
  • fix empty KWICLines structure context for empty KWIC

2.120.1

  • 2015/03/29
  • Bugfixes:
  • genws: fix SEPARATEPAGE index for grammars using DUAL

2.120

  • 2015/03/28
  • freqs: allow filtering by subcorpus
  • new freq_dist() attribute modifier “/n” for getting IDs intead of string
  • Bugfixes:
  • fix regexp2ids/regexp2poss for patterns with escaped metacharacters
  • compilecorp: ‘skipping biterms’ message fixed

2.119

  • 2015/03/23
  • genngr: allow setting min and max n-gram length from cmdline
  • genngr: limit maximum n-gram length to 30 by default

2.118

  • 2015/03/22
  • Bugfixes:
  • fix build with gcc 4.4 (RHEL/CentOS 6)
  • fix ConcStream::find_beg()/find_end()

2.117

  • 2015/02/24
  • create_subcorpus() takes an optional Structure argument

2.116

  • 2015/02/23
  • dumpalign supports 1:1

2.115.3

  • 2015/02/23
  • mkwmrank: fix segfault when datafiles cannot be open
  • updated package specfiles to contain lsalsize

2.115.2

  • 2015/02/10
  • updated tstquery gold results after word sketch format change
  • compilecorp: compute sizes after alignment
  • added lsalsize binary for listing alignment size of two corpora
  • mksizes: use lsalsize to compute alignment size
  • Bugfixes:
  • fix showing GDEX scores when references are up
  • Fix GDEX score display in concordance view
  • manatee: fix installing binaries on DEB
  • corpquery: fix parallel queries garbled by fake collocates

2.115.1

  • 2014/02/10
  • manatee: script for bilingual term extraction

2.115

  • 2014/01/21
  • CorpInfo may be modified and is exported into SWIG API
  • added dumpalign script for dumping parallel corpora

2.114

  • 2015/01/18
  • CQL supports regular expressions in word sketch gramrels
  • added regexp2ids() for word sketch gramrels
  • added mklex for creating lexicons

2.113

  • 2015/01/14
  • mkwmrank: added parameter for commonest match input
  • WSATTR defaults to lempos_lc -> lempos -> lemma_lc -> lemma -> DEFAULTATTR

2.111.8

  • 2014/11/23
  • updated tstquery gold results after word sketch format change
  • Bugfixes:
  • genws: fix handling invalid STRUCTLIMIT

2.111.6

  • 2014/11/17
  • mkwmap works with empty input
  • Bugfixes:
  • skell: fixed typo in jQuery

2.111.3

  • 2014/10/21
  • 2x faster commonest_match.py

2.110

  • 2014/09/21
  • added defaults for SIMPLEQUERY corpus directive; it is [A=”%s” | B=”%s”]
  • CQL supports different attributes in global conditions
  • CQL supports !within and !containing operators
  • genws: STRUCTLIMIT may be arbitrary CQL query
  • added mkregexattr for compiling regex dynamic attribute
  • new version of word sketch data files

2.110

  • 2014/08/25
  • added jQuery UI javascript, css and images
  • added create_subcorpus() for arbitrary CQL query
  • create_subcorpus() takes directly RangeStream instead of query
  • mksubc supports creating subcorpora from CQL query
  • Bugfixes:
  • fix parws lexicon verification for new style TRINARY templates

2.109.8

  • 2014/08/13 Bugfixes:
  • fix build with gcc 4.4

2.109.7

  • 2014/07/28
  • parws: use single batch for TRINARY and COLLOC gramrels
  • compilecorp honours TMPDIR environment variable
  • Bugfixes:
  • mkvirt: fix freqs computation overflowing at int size
  • genngr: fix maximum allowed corpus size to 231-2

2.109.6

  • 2014/07/01
  • genws: set COLLOC lexicon hash size to 500k items
  • printer icon shall be part of NoSkE
  • Bugfixes:
  • corpquery: fix marking KWIC in output

2.109.2

  • 2014/06/18
  • compilecorp does not assume “word” attribute existence
  • corpquery does not assume “word” attribute

2.109

  • 2014/06/16
  • MAXKWIC restriction placed into Concordance
  • Bugfixes:
  • fixed a bug in selecting gramrels

2.108

  • 2014/06/13
  • added new dynamic function ascii for transliteration
  • mkwmap reserves file descriptors for joined set of files
  • Corpcheck checks if file “sizes” exists in PATH
  • changed support mail

2.107

  • 2014/04/16
  • compilecorp support for bilingual dictionaries
  • added MAXKWIC size for KWICLines, defaults to 100

2.106

  • 2014/02/27
  • added corpcheck utility for checking corpora sanity
  • added wsdump script for dumping of word sketches

2.103

  • 2014/02/09
  • added sconll2sketch and sconll2wmap
  • compilecorp support for sketches from (S)CONLL

2.97

  • 2013/12/28
  • mkdynattr: fix dynamic structure attributes of virtual corpora
  • mkstats support for n-grams on subcorpora

2.96

  • 2013/11/10
  • added dumpthes — simple dumping of thesaurus
  • CQL support for similarity search in thesaurus

2.95

  • 2013/11/03
  • added new dynamic function utf8capital
  • added new dynamic function utf8uppercase

2.94

  • 2013/11/01
  • added new dynamic function getnbysep
  • fix mkvirt failing if virtdef contains single corpus

2.92

  • 2013/10/23
  • encodevert compiles dynamic structure attributes
  • support for complement subcorpora

2.87

  • 2013/09/29
  • faster implementation of frq and docf computation
  • choose first non-dynamic attribute as default DEFAULTATTR
  • mkvirt accepts attribute list via -a option
  • added devirt script for corpus devirtualization
  • added parencodevert script for parallel corpus encoding
  • redesign of mksubc and (sub)corpora statistics creation
  • corpus configuration file may not end with a new line
  • faster computation of ARF + ALDF

2.86

  • 2013/08/14
  • full support for atributes of structures in virtual corpora
  • genws reports progress with -p option

2.85

  • 2013/08/07
  • fix segfault when opening a virtual corpus with unavailable virtdef
  • mkvirt automatically creates dynamic attributes
  • virtdef file may contain ‘$’ for segment end being corpus end position
  • fix corpinfo so that it dumps valid configuration file format
  • added mksizes script for compiling sizes
  • compilecorp support for creating word sketch hashes

2.84

  • 2013/06/06
  • compilecorp accepts –parallel=N option (number of parallel jobs)
  • compilecorp support for virtual corpora
  • mksubc writes detailed progress only with –debug
  • added CQL for range of positions, e.g. #20-50
  • CQL frequency function accepts values over 231
  • implemented CQL for word sketch seeks
  • added CQL support for querying word sketches by triples
  • CQL supports new positional functions “swap” and “ccoll”

2.83.3

  • 2013/06/05
  • FIX: fix missing throw statements for create_subcorpus() in SWIG API
  • FIX: fix evaluating empty concordance collocation

2.83.2

  • 2013/05/26
  • FIX: fix SEPARATEPAGE name being trimmed on first white space
  • FIX: Fix mksubc compiling only the 1st subc in subcdef

2.83.1

  • 2013/05/10
  • FIX: collocation computation for window crossing beg/end of corpus

2.83

  • 2013/05/10
  • enable multiple subsequent shuffling

2.82

  • 2013/04/20
  • mksubc support for n-grams, may take .subc file, may take attribute list

2.81

  • 2013/04/12
  • added url2domain dynamic attribute

2.80.1

  • 2013/04/03
  • FIX: utf8_tolower failing for empty strings and unallocated buffer

2.80

  • 2013/04/02
  • faster sample generation
  • ngrsave supports encoded corpus as input

2.79

  • 2013/03/21
  • added utf8getlastn() dynamic attribute function
  • FIX: SEPARATEPAGE with DUAL TRINARY

2.78

  • 2013/03/07
  • Concordance exports corpus object into SWIG API

2.77

  • 2013/03/06
  • lscbr and ngrsave are more user friendly

2.76.1

  • 2013/02/27
  • FIX: bulding with gcc >= 4.7

2.76

  • 2013/02/26
  • added support for structures in virtual corpora

2.75

  • 2013/02/24
  • Frequency distribution does not need Concordance to be computed

2.74

  • 2013/02/18
  • support DUAL TRINARY word sketch grammatical relations
  • added getfirstbysep internal function for dynamic attributes
  • added Setswana locale settings
  • added dumpwmrev for dumping ws delta rev files

2.73

  • 2013/02/04
  • requires finlib 2.21
  • implemented exact KWIC matching in filtering

2.72

  • 2013/01/29
  • support for aligned segment contexts

2.71

  • 2013/01/11
  • genhist enhancements

2.70

  • 2013/01/08
  • compilecorp compiles subcorpora right after the main corpus

2.69

  • 2012/12/10
  • export Corpus::get_confpath() into SWIG API

2.68

  • 2012/11/29
  • parallel corpora API modifications
  • FIX: a number of fixes for processing parallel corpora

2.67.2

  • 2012/11/26
  • FIX: a number of fixes for processing parallel corpora

2.67.1

  • 2012/11/26
  • FIX: set default ALIGNSTRUCT to “align”

2.67

  • 2012/11/17
  • compilecorp compiles alignment for parallel corpora
  • added a number of helper scripts for processing alignment
  • FIX: a number of fixes for processing parallel corpora

2.66

  • 2012/11/15
  • updated licensing information
  • FIX: a number of fixes for processing parallel corpora

2.65

  • 2012/11/09
  • enhanced support for processing of parallel corpora
  • FIX: sync() concordances if necessary before next operations

2.64

  • 2012/11/08
  • NGram API changes
  • FIX: genngr failing to process corpora over 2G

2.63

  • 2012/08/31
  • FIX: estimating word sketch multiword collocations positions

2.62.1

  • 2012/08/30
  • FIX: allow LEXICONSIZE to increase memory usage

2.62

  • 2012/08/27
  • encodevert accepts -d to prevent compiling dynamic attributes
  • FIX: filling default value for attributes of TYPE “UNIQUE”
  • FIX: mkdynattr takes LEXICONSIZE from corpus configuration

2.61

  • 2012/08/17
  • support for asynchronous multi-threaded concordance computations
  • FIX: setting default attribute when querying parallel corpora

2.60.1

  • 2012/07/18
  • FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
  • parws can check gramrel lexicon consistency

2.60

  • 2012/07/10
  • support labels in the second argument (right-hand side) of within/containing, e.g. (containing 1:[] 2:[]) & 1.tag=2.tag
  • FIX: build with ruby 1.9

2.59.1

  • 2012/10/24
  • bugfix release for the stable branch
  • FIX: build with ruby 1.9
  • parws can check gramrel lexicon consistency
  • FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
  • FIX: fix filling default value in unique attribute
  • parws supports Python >= 2.4
  • documentation included in the distribution tarball
  • FIX: CQL: fix default attr setting for parallel corpus
  • FIX: fix static build with finlib
  • FIX: fix overflow on appending to a .text file larger than 4 (232) GB
  • FIX: finlib: fix build with gcc down to 4.1.2 at least

2.59

  • 2012/06/29
  • new internal function for dynamic attributes “getlastn” for extracting last n characters
  • WMap support for access to the dictionary created by *COLLOC directives

2.58

  • 2012/06/25
  • compatibility with ANTLR 3.4 C runtime
  • hashws support for subcorpora
  • more verbose output of encodevert by default
  • FIX: closing structures at the end of compilation

2.57

  • 2012/06/08
  • WMAP support for collocation index operations incl. COLLOC directives

2.56

  • 2012/06/06
  • added fixcorp script for fixing corrupted indices
  • support for extracting terms lexicon of word sketches

2.55

  • 2012/05/29
  • support filtering multiword sketches by gramrels

2.54.1

  • 2012/04/30
  • FIX: minor fixes for nested structures

2.54

  • 2012/04/20
  • faster evaluation of non-regex matching using == and !== operators
  • FIX: utf8 lowercasing may have failed under specific circumstances
  • FIX: dynamic attributes are cleared before recompilation

2.53

  • 2012/04/16
  • enhanced frequency distribution of nested structures

2.52

  • 2012/04/05
  • maximum allowed nested structures set to 100

2.51

  • 2012/03/14
  • requires finlib >= 2.17
  • support for handling of unique attributes

2.50

  • 2012/03/05
  • requires finlib >= 2.16
  • first support for multiword sketches

2.49

  • 2012/02/29
  • FIX: fix mishandling default encoding value in wmap API
  • support extracting terms from word sketches in API

2.48

  • 2012/02/22
  • requires finlib >= 2.15
  • support for attribute values occurring more than 4G (232) times
  • support for extracting terms from word sketches

2.47.1

  • 2012/02/18
  • FIX: fix encodevert segfaulting when run with -x

2.47

  • 2012/02/08
  • requires finlib >= 2.14
  • support for lexicon size up to 4G (232 bytes)
  • FIX: concordance first-letter pagination in case of multibyte characters
  • FIX: mksubc does not fail on invalid attributes and empty subcorpora

2.46.1

  • 2012/02/01
  • FIX: case-insensitive frequency distribution of utf8 corpora
  • FIX: do yet more tolerant Unicode conversion failure handling

2.46

  • 2012/01/25
  • added indices of lexicon by sorted frequency
  • FIX: encodevert handles absent structure attributes properly
  • FIX: subcorpora contained first document range duplicated under specific circumstances

2.45.2

  • 2011/12/08
  • FIX: parallelization of sketches with m4 definitions or dual gramrels
  • FIX: mkwmap correctly handles empty streams when joining, does not write zero counts

2.45.1

  • 2011/10/20
  • FIX: do more tolerant Unicode conversion failure handling

2.45

  • 2011/10/07
  • requires finlib >= 2.13
  • more descriptive CQL error messages
  • support for Unicode input/output using manatee.setEncoding()
  • automatic memory handling of Python objects
  • encodevert, genws and mkwmap logs timestamp with each message
  • prevent writing structures overflowing 32bit integer
  • 32to64.py correctly handles multiple overflows and overflows between begin and end
  • parallel computation of word sketches

2.44.1

  • 2011/09/17
  • FIX mkwmap: fixed join phase if partial join is bigger than 4GB

2.44

  • 2011/09/13
  • MAXDETAIL defaults to MAXCONTEXT if not set in the configuration file

2.43

  • 2011/09/09
  • MAXCONTEXT set to 100 by default

2.42.1

  • 2011/09/07
  • FIX: CQL evaluation in case concatenation subquery is empty

2.42

  • 2011/08/31
  • mksubc prints progress on standard output
  • mksubc does not fail if DOCSTRUCTURE does not exist

2.41

  • 2011/08/05
  • compilecorp automatically runs mknorms to perform proper normalization per structure attribute
  • mknorms support corpora over 2G

2.40.2

  • 2011/08/04
  • requires finlib >= 2.12.4
  • fix ordering of nested structures in concordance

2.40.1

  • 2011/07/29
  • FIX: extending concordance KWIC fixed for |kwic|>1 or KWIC interleaved with colloc

2.40

  • 2011/07/28
  • intelligent autodetection of attribute locale

2.39

  • 2011/06/28
  • support for excluding KWIC from collocations
  • FIX: CQL evaluation: [attr=”non-existing”]? [attr=”existing”] returned empty result instead of “existing” occurrences
  • FIX: mksubc command failed to compute document frequencies on new subcorpus

2.38.2

  • 2011/06/10
  • FIX: encodevert support for memory-only corpora over 2GB

2.38.1

  • 2011/06/02
  • FIX: frequency distribution failing if case-insensitiv/retrograde

2.38

  • 2011/05/12
  • CQL allows ‘’ and ‘’ for matching N-th struct
  • corpquery can sort results using GDEX and set default attribute
  • improved display of concordance reference
  • support for storing corpora over 2GB in memory only
  • FIX: UTF-8 character counting and lower-casing

2.37.1

  • 2011/05/05
  • FIX: count collocations only once per context

2.37

  • 2011/04/30
  • maximum nesting of structures limited to 10 by default

2.36.1

  • 2011/04/21
  • FIX: fix encodevert warning on nested structures printing corpus position instead of file line

2.36

  • 2011/04/06
  • added parse2wmap for creating sketches from dependency input
  • fixed dirty cache after rebuilding sketches
  • fixed multiple memory leaks in SWIG API
  • fixed mkvirt failing if corpus directory is missing
  • changed default MANATEE_REGISTRY to /corpora/registry
  • mksubc needs much less memory

2.35

  • 2011/03/15
  • fix locating of nested structures
  • support attribute-based pagination of concordances
  • prevent colisions of wmap and manatee in SWIG api
  • faster docf computation implemented in c++
  • support for virtual corpora

2.34.1

  • 2011/03/13
  • faster docf computation (ca. 20 x)
  • show Manatee exception messages in Python

2.34

  • 2011/03/05
  • requires finlib >= 2.12
  • compilecorp support for creating subcorpora
  • encodevert automatically closes too many nested structures
  • mksubc computes frequency in documents into .docf files
  • changed format of word sketch .rev file — added support for collocations
  • export exceptions into SWIG API
  • regexp2ids takes voluntary filter pattern argument

2.33.2

  • 2011/02/28
  • FIX: compilecorp computes sizes for corpora without structures
  • FIX: encodevert creates data dir with mode 755 instead of 751

2.33.1

  • 2011/01/20
  • FIX: ngrsave: added NGRAM_SIZE and IGNORE_PUNC parameters

2.33

  • 2011/01/11
  • compilecorp precomputes file with token, word, doc, paragraph and sentence counts

2.32.2

  • 2010/11/24
  • FIX: encodevert looping on input containing NULL byte

2.32.1

  • 2010/10/31
  • FIX: “STRUCTLIMIT s” generates instead of deprecated

2.32

    • 2010/10/27
    • requires finlib >= 2.11
    • New Features:
    • enhanced corpquery script which makes it possible to specify (via command-line options) reference attribute, context, limit for the number of results andstructures and attributes to be printed
    • new parse2wmap tool for generating sketches (data for wmap) from a positional attribute
    • ngrsave can now print document IDs of duplicate n-grams instead of n-grams and number of documents
    • after the compilation, compilecorp checks for temporary files that indicate an error
    • enhancements to the CQL:
    • new “==” and “!==” operators that perform a match against fixed string (i.e. not a regular expression)
    • Note that with two exceptions of “”” and “
    • ” no expansions are performed on the string.
    • Examples:
    • “.”, “$”, “~” matches a single dot, dollar sign and tilda, respectively,
    • “n” matches a backslash followed by the character n,
    • “”
    • ” matches a double-quotes character followed by a single backslash
    • a meet/union query can occur at any position in the query and they are not introduced by the “MU” keyword, which is deprecated and raises an error
    • old within syntax has been already deprecated (in favor of consistent within and now raises an error as well
    • support for inequality matching using new operators: “

<=”, “!<=”, “>=”, “!>=”. The comparison on a string is performed in a way that compares numeric parts numerically and alphabetical parts alphabetically. Examples:

  • [word>=”cake”] matches “cake” as well as “came”,
  • matches e.g. 145UA01, 143UA01, 145TA00 etc.
  • meet/union queries can use numeric labels and be subject to global conditions as any other query parts — e.g. (meet 1:[] 2:[]) & 1.tag = 2.tag;
  • a frequency function (denoted simply as f) can be used as part of the query together with numeric labels — e.g. 1:[] & f(1.word) >= 1000;
  • Bugfixes:
  • encodevert -v works again
  • encodevert can again read piped input data (“| ” in VERTICAL in corpus configuration file)
  • CQL queries using parallel corpora notation work again
  • UTF-8 support in regular expressions
  • encodevert doesn’t crash if no attributes are given in the configuration fail nor command-line

2.31.3

  • 2010/10/27
  • FIX: Computing frequency distribution of multivalue attributes
  • FIX: Encodevert warns if there are are opened structures at the of the compilation — this always indicates an error and in case of nested structures leads to significant performance loss.

2.31.2

  • 2010/08/04
  • FIX: compilecorp fails because of genhist.py which should be genhist
  • FIX: strip spaces in all attribute values
  • FIX: make dist* targets work again

2.31.1

  • 2010/04/26
  • FIX: crash when MANATEE_REGISTRY=”” or config path is a directory

2.31

  • 2010/04/23
  • requires finlib >= 2.10
  • New Features:
  • support for nested structures
  • Bugfixes:
  • fixed displaying of empty collocations

2.30

  • 2010/04/15
  • New Features:
  • “===NONE===” used as attribute default DEFAULTVALUE
  • Bugfixes:
  • fixed displaying concordance with empty nodes

2.29.1

  • 2010/04/10
  • FIX: typo in CQL parser causing the build to fail with C locale

2.29

  • 2010/04/07
  • New Features:
  • compilecorp script for complex handling of corpus and sketch compilation
  • Bugfixes:
  • unfinished corpus data reports size 0, does not crash

2.28.1

  • 2010/03/11
  • FIX: encodevert limits its memory usage to available physical memory

2.28

  • 2010/01/19
  • requires ANTLR3.2 or higher
  • New Features:
  • allow ${attribute} substitution in DISPLAYBEGIN/DISPLAYEND
  • CQL enhancements:
  • support for “ within ”
  • “containing” as dual option to “within”
  • enable meet/union query after within/containing
  • support for “within NUMBER”
  • Bugfixes:
  • fixed mkwmrank on empty wmaps

2.27

  • 2010/01/11
  • New Features:
  • gcc 4.3 and 4.4 compatibility
  • ANTLR 2.7.2 compatibility
  • Python API scripts now part of the distribution

2.14

  • corpus size more than 2 billion tokens
  • 1.99
  • bug fixes in query evaluation, build
  • 1.94
  • first public version

Search text corpora with Sketch Engine

Sketch Engine offers a range of tools to work with text corpora in 100+ languages.

or