• T-score [ statistics ]

    T-score expresses the certainty with which we can argue that there is an association between the words, i.e. their co-occurrence is not random. The value is affected by the frequency of the whole collocation, which is why very frequent word combinations tend to reach a high T-score despite not being significant collocations. (more…)
  • tag [ attribute ]

    (also called part-of-speech tag, POS tag or morphological tag) is a label assigned to each token in an annotated corpus to indicate the part of speech and grammatical category. The tool used to annotate a corpus is called a tagger. A collection of tags used in a corpus is called a tagset. The most frequently used tags in a corpus are listed on the corpus information page with a link to the complete tagset. Our blog post on POS tags explains how they work.
  • tagset

    (called also tag set) is a list of part-of-speech tags used in one corpus. In Sketch Engine, corpora in the same language tend to use the same tagset but exceptions exist. To check the tagset used, access Corpus statistics and details. See our blog about POS tags.
  • TBL

    application in Sketch Engine for collecting usage-example sentences to build dictionaries. Find more on the Tick Box Lexicography page
  • term

    Terms is a concept used in connection with Keywords & Terms tool. A term is a multi-word expression (consisting of several tokens) which appears more frequently in one corpus (focus corpus) compared to another corpus (reference corpus) and, at the same time, the expression has a format of a term in the language. (more…)
  • term base

    In connection with CAT tools, a term base is a database of subject-specific terminology and other lexical items which need to be translated consistently. The CAT tool uses the term base to check the consistency of translation, to look for untranslated segments, and to suggest (or automatically supply) translations of the terms from the database.
  • term extraction

    the process of identifying subject specific vocabulary in a subject specific text usually using specialized software. The identification of one-word and multi-word terms in Sketch Engine is based on the comparison of the frequency of such words and phrases between the reference corpus and the focus corpus. compare keywords related topics term extraction explained (blog) term grammar reference corpus focus corpus
  • term grammar

    A term grammar is a set of rules written in CQL which define the lexical structures, typically noun phrases, which should be included in term extraction. The lexical structures are defined using POS tags and CQL. The use of a term grammar ensures a clean term extraction result which requires very little post editing. (more…)
  • text analysis [ text-analysis ]

    text analysis (also content analysis or text analytics) is a method for analyzing (usually unstructured) text in order to extract information. The result of the text analysis is structured data. In addition to the traditional tools,  Sketch Engine also offers some unique features. The traditional tools consist of various frequency-based statistics:
    • word or lemma frequency, part-of-speech frequency via the wordlist tool
    • bigram, trigram, n-gram frequencies via the n-gram tool
    • absolute frequencies, relative frequencies, document frequencies, average reduced frequency (AFR)
    • phrase and multiword frequency via the concordance
    Advanced techniques include: The tools and statistics can be combined depending on the task involved. See also other text analysis tools.
  • text mining [ text-analysis ]

    text mining is an automatic process of extracting information from text, such as keywords of a text or its source(s). The corresponding tools in Sketch Engine are WebBootCaT for creating corpora from the web or keywords and terms extraction which finds terminology in your texts. Read about other text analysis tools.
  • text type

    [We follow Biber (1989) in using text type as a generic term for the many ways in which a text might be classified.] A text type refers to values assigned to structures (e.g. documents, paragraphs, sentences or others) inside a corpus. Text types can refer to the source (newspaper, book, etc.), medium (spoken, written), time (year, century), or any other type of information about the text. (more…)
  • text type selector

    Any search in Sketch Engine can be limited to certain text types only. The results will be taken from documents annotated with the specific text type(s). Users can include metadata in their corpora. If the metadata are in the required format, they will be converted to text types and will appear in the text type selector. The text type selector can be found either in the BASIC tab (concordance), or the ADVANCED tab (wordlist, thesaurus, word sketches, ...). Read more about text types
  • timeline

    The timeline function displays the changing frequency of a word or phrase over time. Timelines are not a standalone tool, they are included in the Concordance and Wordlist tools. Timelines are computed the same as the graphs in Trends – a diachronic analysis of word usage, however, they can be generated for any word or even multi-word phrase the graph displays more details. See also Timeline - language use over time See also Trends
  • TMX – Translation Memory eXchange format

    Translation Memory eXchange (TMX) is a specific XML format used for creating parallel corpora in Sketch Engine. This format is standardly used in translation memories (TM). See more about Setting up parallel corpora in Sketch Engine. (more…)
  • token

    A token is the smallest unit that a corpus consists of. A token normally refers to:
    • a word form: going, trees, Mary, twenty-five
    • punctuation: comma, dot, question mark, quotes…
    • digit: 50,000…
    • abbreviations, product names: 3M, i600, XP, FB…
    • anything else between spaces
    (more…)
  • tokenization

    Tokenization is the automatic process of separating text into tokens. This process is performed by tools called tokenizers.
  • tokenizer

    A tokenizer is a tool (software) used for dividing text into tokens. A tokenizer is language specific and takes into account the peculiarities of the language, e.g. don't in English is tokenized as two tokens. (more…)
  • translation memory

    A translation memory is a database inside a CAT tool which holds segments of text translated in the past. The CAT tool can suggest (or automatically supply) translations based on matching text from the translation memory.
  • trends

    Trends is a feature used for diachronic analysis, i.e. for identifying how the frequency of the word (or other attributes) changes over time. read more
  • Type/token ratio (TTR)

    The type/token ratio, often shortened TTR, is a simple measure of lexical diversity. It can only be interpreted when comparing it to TTR of a different text (corpus). The corpus with a higher TTR contains a higher variety of words than the other corpus. In other words, the authors use more different words, or richer vocabulary, than the authors of the texts in the other corpus. (more…)