Sketch Engine is a corpus manager and analysis software developed by Lexical Computing since 2003. This software consists of three main components, which enable searching and building text corpora.
Bonito – a graphical user interface to corpora maintained, see the changelog of Bonito
Manatee – a corpus management tool including corpus building and indexing, fast querying and providing basic statistical measures
FinLib – fast indexing library, see the changelog of FinLib
A brief overview of the main changes in Manatee is listed here.
Current stable version: 2.233 (as of November 2024)
2.156.6
- FinLib incorporated into Manatee
2.152.1
- do not parallelize corpus operations by default
2.152
- implement parallel corpus indexing
- improve parallel word sketch handling
2.151.5
- fix Concordance::delete_subparts()
- virtual corpora fixes
2.151.4
- update mklcm
2.151.3
- ensure that corpus PATH is nonempty
- decodevert: structure attribute values escaping
- regexopt: fix support for bracket literals
- compilecorp: use one processor by default
2.151.2
- fix queries containing ‘containing’
2.151.1
- cql: support {,N} and {N,} quantifiers
- remove skip_dupctx parameter for KWICLines
2.151
- implement skip_dupctx parameter for KWICLines
2.150.4
- remove C++11 features
2.150.3
- fix a few memory leaks
2.150.2
- quality improvements
2.150.1
- do not virtualize sketches when some segments are not complete corpora
2.150
- genngr: skip over default and empty attribute values
- mksubc: urlencode names of subcorpora
2.149.3
- quality improvements
2.149.2
- quality improvements
2.149.1
- cql: do not generate errors that are not valid utf-8
2.149
- corpconf: remove support for escape sequences
2.148
- corpconf: restrict support for escape sequences
- cql: allow @ in attribute names
2.147
- corpconf: only support escapes in double-quoted strings
2.146.6
- corpconf: implement escapes in string literals
- cql: fix sketch queries
- regex optimization: fix the behavior of ‘+’
2.146.5
- cql: enable NoSketchEngine support
2.146.4
- fix FilteredWMap::poss() skipping duplicate positions
2.146.3
- fix for large concordances and WMaps
2.146.2
- cql: support large parameters to ws() and thes()
2.146.1
- various regex optimization fixes
2.146
- support zero-element word sketch files
2.145
- cql: report error position
- genws: support MULTIVALUE for collocations
- fix ENCODING for structure attributes
2.144.1
- cql: fix ONEPOS queries
2.144
- update regex optimization rules
- speed up corpquery -n
2.143.4
- 2016/12/12
- cql: support for multilevel wmap seek
2.142
- 2016/11/23
- cql: parse ‘seek’ in ‘ws|term(level, seek)’ as a number
- add NEWS for manatee to shut up autotools
- manatee: implemented query evaluation in yacc
2.141
- 2016/11/03
- corpquery accepts subcorpus via -u
- added default locale li_NL for Limburgish
- FinLib 2.36.2
2.140.2
- 2016/10/21
- extrms simple math N parameter can be float
- finlib 2.36.1
2.140.1
- 2016/10/19
- decodevert: print end structures in reverse order
2.140
- 2016/10/13
- encodevert: check minimum bucket size for attribute memory
- FinLib 2.36
2.139.3
- 2016/08/27
- wm2thes: accept CORPNAME argument also without -m
- compilecorp: use virtws for virtual corpora sketches
2.138.4
- 2016/08/13
- compilecorp: use mklcm-go
- biterms: made ca 4x faster
2.138.3
- 2016/08/11
- biterms: use new WMap interface
2.137.3
- 2016/07/14
- added multiword thesaurus computation
- reformat wm2thes.cc
- implemented virtual sketches, updated interface to WMap
- added virtws for compilation of sketches on virtual corpora
- added WMap::seppage() to export SEPARATEPAGE number
- mkalign: print line number on alingdef file format error
2.137
- 2016/05/20
- mktrends: allow the SUBCORP argument to be empty
- compilecorp: ALIGNDEF supports pipes like VERTICAL does
- faster mktrends
- manatee: mklcm in go
- compilecorp: support for WSOLDSCORES
2.136
- 2016/03/31
- encodevert: call mknormattr according to MAPTO directive
- added support for normalization attribute
- ANTLR CQL grammar supports description definition
2.135.5
- 2016/02/28
- tstquery: added queries on parallel corpora
- tstquery: print executed queries
- do not label aligned corpus query in WITHIN!/!WITHIN queries
2.135.4
- 2016/02/21
- compilecorp: always move logfile into corpus path directory
- compilecorp: improved error reporting to indicate actual lines numbers
2.135
- 2016/01/30
- encodevert: better manipulation with lexicon added items cache
2.134
- 2016/01/20
- encodevert: dynamic lexicons cache sizes
- reformat mkwmrank.cc
- added bgr_abs_freq_coll association score
- returns frequency of the first word of the collocation pair
2.133.4
- 2015/12/12
- mktrends: finalize output files properly
2.133.3
- 2015/12/10
- corpcheck: tolerate local path in INFOHREF
2.133.3
- 2015/12/10
- mktrends: finalize output files properly
2.133.2
- 2015/12/07
- fix handling of aligned corpora labels in Concordance
2.133.1
- 2015/12/03
- KWICLines skip aligned corpora collocations
2.133
- 2015/12/02
- CQL: added support to term queries using term() operator
- compilecorp: added –no-ske option being default for NoSkE
2.132.1
- 2015/11/30
- tstregexopt: takes attribute as another optional argument
2.132
- 2015/11/24
- speed up RQinNode and RQcontainNode
2.131.3
- 2015/11/24
- mknorms: speed up computation for subcorpora
2.131
- 2015/11/12
- removed findPosAttr() functions
- reformat corpinfo.cc
2.130.6
- 2015/11/12
- fix !WITHIN
2.130.5
- 2015/11/08
- compilecorp: call mktrends with EPOCH_LIMIT being 1
- fix MAXKWIC being 0 not meaning unlimited MAXKWIC
2.130.3
- 2015/11/04
- mktrends, save subcorp data properly
2.130.2
- 2015/10/31
- added NonEmptyRS for filtering empty RangeStream ranges
2.130
- 2015/10/25
- KWICLines has new method is_defined() and short-circuits processing of undefined lines
- added Concordance::filter_aligned() for filtering by aligned corpus
2.129
- 2015/09/21
- mktrends: speed up ca 15x by more usage of numpy
2.128.4
- 2015/09/10
- updated CQL testsuite with current WS results on susanne
2.127
- 2015/08/04
- compilecorp: added support for longest commonest match
2.126
- 2015/07/28
- compilecorp: added support for trends computations
- added mktrends script prepared by Ondřej Herman
2.125.2
- 2015/07/20
- mkwmrank: computing scores for each gramrel is independent of other gramrels
2.124
- 2015/05/02
- concordance automatically detects all collocations
2.122
- 2015/04/19
- CQL supports general NOT (!) in sequences as complement operator
- Bugfixes:
- fix CQL inequality comparisons on dynamic attributes
2.121.2
- 2015/04/08
- disable MULTIVALUE freqdist for positional attributes
2.121
- 2015/04/03
- mkdynattr: no need to manually delete lexicon with new write_lexicon
- added new DYNTYPE “freq” for dynamic attributes
- compilecorp and parws pass WSMINHITS to mkwmap
- mkwmap: added all options to usage
- mkwmap: added -f option allowing filtering for minimum frequency
- write_lexicon allows overwriting datafiles
- compilecorp: hashws terms automatically
- compilecorp: write manatee version to log
- Bugfixes
- fix empty KWICLines structure context for empty KWIC
2.120.1
- 2015/03/29
- Bugfixes:
- genws: fix SEPARATEPAGE index for grammars using DUAL
2.120
- 2015/03/28
- freqs: allow filtering by subcorpus
- new freq_dist() attribute modifier “/n” for getting IDs intead of string
- Bugfixes:
- fix regexp2ids/regexp2poss for patterns with escaped metacharacters
- compilecorp: ‘skipping biterms’ message fixed
2.119
- 2015/03/23
- genngr: allow setting min and max n-gram length from cmdline
- genngr: limit maximum n-gram length to 30 by default
2.118
- 2015/03/22
- Bugfixes:
- fix build with gcc 4.4 (RHEL/CentOS 6)
- fix ConcStream::find_beg()/find_end()
2.117
- 2015/02/24
- create_subcorpus() takes an optional Structure argument
2.116
- 2015/02/23
- dumpalign supports 1:1
2.115.3
- 2015/02/23
- mkwmrank: fix segfault when datafiles cannot be open
- updated package specfiles to contain lsalsize
2.115.2
- 2015/02/10
- updated tstquery gold results after word sketch format change
- compilecorp: compute sizes after alignment
- added lsalsize binary for listing alignment size of two corpora
- mksizes: use lsalsize to compute alignment size
- Bugfixes:
- fix showing GDEX scores when references are up
- Fix GDEX score display in concordance view
- manatee: fix installing binaries on DEB
- corpquery: fix parallel queries garbled by fake collocates
2.115.1
- 2014/02/10
- manatee: script for bilingual term extraction
2.115
- 2014/01/21
- CorpInfo may be modified and is exported into SWIG API
- added dumpalign script for dumping parallel corpora
2.114
- 2015/01/18
- CQL supports regular expressions in word sketch gramrels
- added regexp2ids() for word sketch gramrels
- added mklex for creating lexicons
2.113
- 2015/01/14
- mkwmrank: added parameter for commonest match input
- WSATTR defaults to lempos_lc -> lempos -> lemma_lc -> lemma -> DEFAULTATTR
2.111.8
- 2014/11/23
- updated tstquery gold results after word sketch format change
- Bugfixes:
- genws: fix handling invalid STRUCTLIMIT
2.111.6
- 2014/11/17
- mkwmap works with empty input
- Bugfixes:
- skell: fixed typo in jQuery
2.111.3
- 2014/10/21
- 2x faster commonest_match.py
2.110
- 2014/09/21
- added defaults for SIMPLEQUERY corpus directive; it is [A=”%s” | B=”%s”]
- CQL supports different attributes in global conditions
- CQL supports !within and !containing operators
- genws: STRUCTLIMIT may be arbitrary CQL query
- added mkregexattr for compiling regex dynamic attribute
- new version of word sketch data files
2.110
- 2014/08/25
- added jQuery UI javascript, css and images
- added create_subcorpus() for arbitrary CQL query
- create_subcorpus() takes directly RangeStream instead of query
- mksubc supports creating subcorpora from CQL query
- Bugfixes:
- fix parws lexicon verification for new style TRINARY templates
2.109.8
- 2014/08/13 Bugfixes:
- fix build with gcc 4.4
2.109.7
- 2014/07/28
- parws: use single batch for TRINARY and COLLOC gramrels
- compilecorp honours TMPDIR environment variable
- Bugfixes:
- mkvirt: fix freqs computation overflowing at int size
- genngr: fix maximum allowed corpus size to 231-2
2.109.6
- 2014/07/01
- genws: set COLLOC lexicon hash size to 500k items
- printer icon shall be part of NoSkE
- Bugfixes:
- corpquery: fix marking KWIC in output
2.109.2
- 2014/06/18
- compilecorp does not assume “word” attribute existence
- corpquery does not assume “word” attribute
2.109
- 2014/06/16
- MAXKWIC restriction placed into Concordance
- Bugfixes:
- fixed a bug in selecting gramrels
2.108
- 2014/06/13
- added new dynamic function ascii for transliteration
- mkwmap reserves file descriptors for joined set of files
- Corpcheck checks if file “sizes” exists in PATH
- changed support mail
2.107
- 2014/04/16
- compilecorp support for bilingual dictionaries
- added MAXKWIC size for KWICLines, defaults to 100
2.106
- 2014/02/27
- added corpcheck utility for checking corpora sanity
- added wsdump script for dumping of word sketches
2.103
- 2014/02/09
- added sconll2sketch and sconll2wmap
- compilecorp support for sketches from (S)CONLL
2.97
- 2013/12/28
- mkdynattr: fix dynamic structure attributes of virtual corpora
- mkstats support for n-grams on subcorpora
2.96
- 2013/11/10
- added dumpthes — simple dumping of thesaurus
- CQL support for similarity search in thesaurus
2.95
- 2013/11/03
- added new dynamic function utf8capital
- added new dynamic function utf8uppercase
2.94
- 2013/11/01
- added new dynamic function getnbysep
- fix mkvirt failing if virtdef contains single corpus
2.92
- 2013/10/23
- encodevert compiles dynamic structure attributes
- support for complement subcorpora
2.87
- 2013/09/29
- faster implementation of frq and docf computation
- choose first non-dynamic attribute as default DEFAULTATTR
- mkvirt accepts attribute list via -a option
- added devirt script for corpus devirtualization
- added parencodevert script for parallel corpus encoding
- redesign of mksubc and (sub)corpora statistics creation
- corpus configuration file may not end with a new line
- faster computation of ARF + ALDF
2.86
- 2013/08/14
- full support for atributes of structures in virtual corpora
- genws reports progress with -p option
2.85
- 2013/08/07
- fix segfault when opening a virtual corpus with unavailable virtdef
- mkvirt automatically creates dynamic attributes
- virtdef file may contain ‘$’ for segment end being corpus end position
- fix corpinfo so that it dumps valid configuration file format
- added mksizes script for compiling sizes
- compilecorp support for creating word sketch hashes
2.84
- 2013/06/06
- compilecorp accepts –parallel=N option (number of parallel jobs)
- compilecorp support for virtual corpora
- mksubc writes detailed progress only with –debug
- added CQL for range of positions, e.g. #20-50
- CQL frequency function accepts values over 231
- implemented CQL for word sketch seeks
- added CQL support for querying word sketches by triples
- CQL supports new positional functions “swap” and “ccoll”
2.83.3
- 2013/06/05
- FIX: fix missing throw statements for create_subcorpus() in SWIG API
- FIX: fix evaluating empty concordance collocation
2.83.2
- 2013/05/26
- FIX: fix SEPARATEPAGE name being trimmed on first white space
- FIX: Fix mksubc compiling only the 1st subc in subcdef
2.83.1
- 2013/05/10
- FIX: collocation computation for window crossing beg/end of corpus
2.83
- 2013/05/10
- enable multiple subsequent shuffling
2.82
- 2013/04/20
- mksubc support for n-grams, may take .subc file, may take attribute list
2.81
- 2013/04/12
- added url2domain dynamic attribute
2.80.1
- 2013/04/03
- FIX: utf8_tolower failing for empty strings and unallocated buffer
2.80
- 2013/04/02
- faster sample generation
- ngrsave supports encoded corpus as input
2.79
- 2013/03/21
- added utf8getlastn() dynamic attribute function
- FIX: SEPARATEPAGE with DUAL TRINARY
2.78
- 2013/03/07
- Concordance exports corpus object into SWIG API
2.77
- 2013/03/06
- lscbr and ngrsave are more user friendly
2.76.1
- 2013/02/27
- FIX: bulding with gcc >= 4.7
2.76
- 2013/02/26
- added support for structures in virtual corpora
2.75
- 2013/02/24
- Frequency distribution does not need Concordance to be computed
2.74
- 2013/02/18
- support DUAL TRINARY word sketch grammatical relations
- added getfirstbysep internal function for dynamic attributes
- added Setswana locale settings
- added dumpwmrev for dumping ws delta rev files
2.73
- 2013/02/04
- requires finlib 2.21
- implemented exact KWIC matching in filtering
2.72
- 2013/01/29
- support for aligned segment contexts
2.71
- 2013/01/11
- genhist enhancements
2.70
- 2013/01/08
- compilecorp compiles subcorpora right after the main corpus
2.69
- 2012/12/10
- export Corpus::get_confpath() into SWIG API
2.68
- 2012/11/29
- parallel corpora API modifications
- FIX: a number of fixes for processing parallel corpora
2.67.2
- 2012/11/26
- FIX: a number of fixes for processing parallel corpora
2.67.1
- 2012/11/26
- FIX: set default ALIGNSTRUCT to “align”
2.67
- 2012/11/17
- compilecorp compiles alignment for parallel corpora
- added a number of helper scripts for processing alignment
- FIX: a number of fixes for processing parallel corpora
2.66
- 2012/11/15
- updated licensing information
- FIX: a number of fixes for processing parallel corpora
2.65
- 2012/11/09
- enhanced support for processing of parallel corpora
- FIX: sync() concordances if necessary before next operations
2.64
- 2012/11/08
- NGram API changes
- FIX: genngr failing to process corpora over 2G
2.63
- 2012/08/31
- FIX: estimating word sketch multiword collocations positions
2.62.1
- 2012/08/30
- FIX: allow LEXICONSIZE to increase memory usage
2.62
- 2012/08/27
- encodevert accepts -d to prevent compiling dynamic attributes
- FIX: filling default value for attributes of TYPE “UNIQUE”
- FIX: mkdynattr takes LEXICONSIZE from corpus configuration
2.61
- 2012/08/17
- support for asynchronous multi-threaded concordance computations
- FIX: setting default attribute when querying parallel corpora
2.60.1
- 2012/07/18
- FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
- parws can check gramrel lexicon consistency
2.60
- 2012/07/10
- support labels in the second argument (right-hand side) of within/containing, e.g. (containing 1:[] 2:[]) & 1.tag=2.tag
- FIX: build with ruby 1.9
2.59.1
- 2012/10/24
- bugfix release for the stable branch
- FIX: build with ruby 1.9
- parws can check gramrel lexicon consistency
- FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
- FIX: fix filling default value in unique attribute
- parws supports Python >= 2.4
- documentation included in the distribution tarball
- FIX: CQL: fix default attr setting for parallel corpus
- FIX: fix static build with finlib
- FIX: fix overflow on appending to a .text file larger than 4 (232) GB
- FIX: finlib: fix build with gcc down to 4.1.2 at least
2.59
- 2012/06/29
- new internal function for dynamic attributes “getlastn” for extracting last n characters
- WMap support for access to the dictionary created by *COLLOC directives
2.58
- 2012/06/25
- compatibility with ANTLR 3.4 C runtime
- hashws support for subcorpora
- more verbose output of encodevert by default
- FIX: closing structures at the end of compilation
2.57
- 2012/06/08
- WMAP support for collocation index operations incl. COLLOC directives
2.56
- 2012/06/06
- added fixcorp script for fixing corrupted indices
- support for extracting terms lexicon of word sketches
2.55
- 2012/05/29
- support filtering multiword sketches by gramrels
2.54.1
- 2012/04/30
- FIX: minor fixes for nested structures
2.54
- 2012/04/20
- faster evaluation of non-regex matching using == and !== operators
- FIX: utf8 lowercasing may have failed under specific circumstances
- FIX: dynamic attributes are cleared before recompilation
2.53
- 2012/04/16
- enhanced frequency distribution of nested structures
2.52
- 2012/04/05
- maximum allowed nested structures set to 100
2.51
- 2012/03/14
- requires finlib >= 2.17
- support for handling of unique attributes
2.50
- 2012/03/05
- requires finlib >= 2.16
- first support for multiword sketches
2.49
- 2012/02/29
- FIX: fix mishandling default encoding value in wmap API
- support extracting terms from word sketches in API
2.48
- 2012/02/22
- requires finlib >= 2.15
- support for attribute values occurring more than 4G (232) times
- support for extracting terms from word sketches
2.47.1
- 2012/02/18
- FIX: fix encodevert segfaulting when run with -x
2.47
- 2012/02/08
- requires finlib >= 2.14
- support for lexicon size up to 4G (232 bytes)
- FIX: concordance first-letter pagination in case of multibyte characters
- FIX: mksubc does not fail on invalid attributes and empty subcorpora
2.46.1
- 2012/02/01
- FIX: case-insensitive frequency distribution of utf8 corpora
- FIX: do yet more tolerant Unicode conversion failure handling
2.46
- 2012/01/25
- added indices of lexicon by sorted frequency
- FIX: encodevert handles absent structure attributes properly
- FIX: subcorpora contained first document range duplicated under specific circumstances
2.45.2
- 2011/12/08
- FIX: parallelization of sketches with m4 definitions or dual gramrels
- FIX: mkwmap correctly handles empty streams when joining, does not write zero counts
2.45.1
- 2011/10/20
- FIX: do more tolerant Unicode conversion failure handling
2.45
- 2011/10/07
- requires finlib >= 2.13
- more descriptive CQL error messages
- support for Unicode input/output using manatee.setEncoding()
- automatic memory handling of Python objects
- encodevert, genws and mkwmap logs timestamp with each message
- prevent writing structures overflowing 32bit integer
- 32to64.py correctly handles multiple overflows and overflows between begin and end
- parallel computation of word sketches
2.44.1
- 2011/09/17
- FIX mkwmap: fixed join phase if partial join is bigger than 4GB
2.44
- 2011/09/13
- MAXDETAIL defaults to MAXCONTEXT if not set in the configuration file
2.43
- 2011/09/09
- MAXCONTEXT set to 100 by default
2.42.1
- 2011/09/07
- FIX: CQL evaluation in case concatenation subquery is empty
2.42
- 2011/08/31
- mksubc prints progress on standard output
- mksubc does not fail if DOCSTRUCTURE does not exist
2.41
- 2011/08/05
- compilecorp automatically runs mknorms to perform proper normalization per structure attribute
- mknorms support corpora over 2G
2.40.2
- 2011/08/04
- requires finlib >= 2.12.4
- fix ordering of nested structures in concordance
2.40.1
- 2011/07/29
- FIX: extending concordance KWIC fixed for |kwic|>1 or KWIC interleaved with colloc
2.40
- 2011/07/28
- intelligent autodetection of attribute locale
2.39
- 2011/06/28
- support for excluding KWIC from collocations
- FIX: CQL evaluation: [attr=”non-existing”]? [attr=”existing”] returned empty result instead of “existing” occurrences
- FIX: mksubc command failed to compute document frequencies on new subcorpus
2.38.2
- 2011/06/10
- FIX: encodevert support for memory-only corpora over 2GB
2.38.1
- 2011/06/02
- FIX: frequency distribution failing if case-insensitiv/retrograde
2.38
- 2011/05/12
- CQL allows ‘’ and ‘’ for matching N-th struct
- corpquery can sort results using GDEX and set default attribute
- improved display of concordance reference
- support for storing corpora over 2GB in memory only
- FIX: UTF-8 character counting and lower-casing
2.37.1
- 2011/05/05
- FIX: count collocations only once per context
2.37
- 2011/04/30
- maximum nesting of structures limited to 10 by default
2.36.1
- 2011/04/21
- FIX: fix encodevert warning on nested structures printing corpus position instead of file line
2.36
- 2011/04/06
- added parse2wmap for creating sketches from dependency input
- fixed dirty cache after rebuilding sketches
- fixed multiple memory leaks in SWIG API
- fixed mkvirt failing if corpus directory is missing
- changed default MANATEE_REGISTRY to /corpora/registry
- mksubc needs much less memory
2.35
- 2011/03/15
- fix locating of nested structures
- support attribute-based pagination of concordances
- prevent colisions of wmap and manatee in SWIG api
- faster docf computation implemented in c++
- support for virtual corpora
2.34.1
- 2011/03/13
- faster docf computation (ca. 20 x)
- show Manatee exception messages in Python
2.34
- 2011/03/05
- requires finlib >= 2.12
- compilecorp support for creating subcorpora
- encodevert automatically closes too many nested structures
- mksubc computes frequency in documents into .docf files
- changed format of word sketch .rev file — added support for collocations
- export exceptions into SWIG API
- regexp2ids takes voluntary filter pattern argument
2.33.2
- 2011/02/28
- FIX: compilecorp computes sizes for corpora without structures
- FIX: encodevert creates data dir with mode 755 instead of 751
2.33.1
- 2011/01/20
- FIX: ngrsave: added NGRAM_SIZE and IGNORE_PUNC parameters
2.33
- 2011/01/11
- compilecorp precomputes file with token, word, doc, paragraph and sentence counts
2.32.2
- 2010/11/24
- FIX: encodevert looping on input containing NULL byte
2.32.1
- 2010/10/31
- FIX: “STRUCTLIMIT s” generates instead of deprecated
2.32
-
- 2010/10/27
- requires finlib >= 2.11
- New Features:
- enhanced corpquery script which makes it possible to specify (via command-line options) reference attribute, context, limit for the number of results andstructures and attributes to be printed
- new parse2wmap tool for generating sketches (data for wmap) from a positional attribute
- ngrsave can now print document IDs of duplicate n-grams instead of n-grams and number of documents
- after the compilation, compilecorp checks for temporary files that indicate an error
- enhancements to the CQL:
- new “==” and “!==” operators that perform a match against fixed string (i.e. not a regular expression)
- Note that with two exceptions of “”” and “
- ” no expansions are performed on the string.
- Examples:
- “.”, “$”, “~” matches a single dot, dollar sign and tilda, respectively,
- “n” matches a backslash followed by the character n,
- “”
- ” matches a double-quotes character followed by a single backslash
- a meet/union query can occur at any position in the query and they are not introduced by the “MU” keyword, which is deprecated and raises an error
- old within syntax has been already deprecated (in favor of consistent within and now raises an error as well
- support for inequality matching using new operators: “
<=”, “!<=”, “>=”, “!>=”. The comparison on a string is performed in a way that compares numeric parts numerically and alphabetical parts alphabetically. Examples:
- [word>=”cake”] matches “cake” as well as “came”,
- matches e.g. 145UA01, 143UA01, 145TA00 etc.
- meet/union queries can use numeric labels and be subject to global conditions as any other query parts — e.g. (meet 1:[] 2:[]) & 1.tag = 2.tag;
- a frequency function (denoted simply as f) can be used as part of the query together with numeric labels — e.g. 1:[] & f(1.word) >= 1000;
- Bugfixes:
- encodevert -v works again
- encodevert can again read piped input data (“| ” in VERTICAL in corpus configuration file)
- CQL queries using parallel corpora notation work again
- UTF-8 support in regular expressions
- encodevert doesn’t crash if no attributes are given in the configuration fail nor command-line
2.31.3
- 2010/10/27
- FIX: Computing frequency distribution of multivalue attributes
- FIX: Encodevert warns if there are are opened structures at the of the compilation — this always indicates an error and in case of nested structures leads to significant performance loss.
2.31.2
- 2010/08/04
- FIX: compilecorp fails because of genhist.py which should be genhist
- FIX: strip spaces in all attribute values
- FIX: make dist* targets work again
2.31.1
- 2010/04/26
- FIX: crash when MANATEE_REGISTRY=”” or config path is a directory
2.31
- 2010/04/23
- requires finlib >= 2.10
- New Features:
- support for nested structures
- Bugfixes:
- fixed displaying of empty collocations
2.30
- 2010/04/15
- New Features:
- “===NONE===” used as attribute default DEFAULTVALUE
- Bugfixes:
- fixed displaying concordance with empty nodes
2.29.1
- 2010/04/10
- FIX: typo in CQL parser causing the build to fail with C locale
2.29
- 2010/04/07
- New Features:
- compilecorp script for complex handling of corpus and sketch compilation
- Bugfixes:
- unfinished corpus data reports size 0, does not crash
2.28.1
- 2010/03/11
- FIX: encodevert limits its memory usage to available physical memory
2.28
- 2010/01/19
- requires ANTLR3.2 or higher
- New Features:
- allow ${attribute} substitution in DISPLAYBEGIN/DISPLAYEND
- CQL enhancements:
- support for “ within ”
- “containing” as dual option to “within”
- enable meet/union query after within/containing
- support for “within NUMBER”
- Bugfixes:
- fixed mkwmrank on empty wmaps
2.27
- 2010/01/11
- New Features:
- gcc 4.3 and 4.4 compatibility
- ANTLR 2.7.2 compatibility
- Python API scripts now part of the distribution
2.14
- corpus size more than 2 billion tokens
- 1.99
- bug fixes in query evaluation, build
- 1.94
- first public version
Search text corpora with Sketch Engine
Sketch Engine offers a range of tools to work with text corpora in 100+ languages.
or