This page is about an advanced way of building subcorpora in Sketch Engine using a definition file. To learn about basic subcorpus building, please read Create a subcorpus

The definition file is a text file which contains information about how one or more corpora should be created. Only one definition file is allowed per corpus. The file can contain definitions of many subcorpora.

Basic syntax

The definition must start with ‘=’ followed by the name of the subcorpus. Nothing else should appear on that line.

The subcorpus definition follows on the next line. A subcorpus can be defined in two ways:

  • with text types
  • with CQL

Definition with text types

The second line should contain the name of a structure. The third line should contain text types (metadata) attached to this structure. This definition will create two subcorpora:

  • example1 is a subcorpus  made up from documents whose publication year is 2012
  • example2 is a subcorpus of documents which were created from uploaded files whose filename starts with capital K (regular expression is used)

The structures and text types attached to them can be checked on the corpus info page.

=example1
    doc
    pub_year="2012"

=example2
    doc
    filename="K.*"

Definition with CQL

A definition with CQL must have ‘-CQL-‘ on line 2. Line 3 should contain a CQL query. The subcorpus will contain what would appear as KWIC if the query was used in the concordance. No other context will be included unless the CQL query explicitly includes it. Therefore, the use of the containing operator may be needed. This definition will create 4 subcorpora:

  • example4 – all sequences of 2 nouns will be included, the subcorpus will only contain nouns, nothing else. This is probably not very useful.
  • example5 – all sentences containing a sequence of 2 nouns will be included in the subcorpus.
  • example6 – all documents containing a sequence of 2 nouns will be included.
  • example7 — all documents published in 2012 which contain a sequence of 3 nouns will be included in the subcorpus.
=example4
-CQL-
[tag="N.*"] [tag="N.*"]

=example5
-CQL-
<s/> containing [tag="N.*"] [tag="N.*"] 

=example6
-CQL-
<doc/> containing [tag="N.*"] [tag="N.*"] 

=example7
-CQL-
<doc pub_date="2012"/> containing [tag="N.*"] [tag="N.*"]</s></pre><pre class="wiki">

###############################################################################
# Subcorpus definition file
###############################################################################
#
# Subcorpora created using a definition file are available to all users 
# with access to the corpus. Subcorpora created using other ways are only
# available to the corpus owner.
#
# Subcorpus definition format
# ----------------------------
# *FREQLISTATTRS attr1 attr2
#
# =subcorpus_id
#   structure
#   sub-query
#
# =subcorpus_id
#   -CQL-
#   full-cql-query
#
# FREQLISTATTRS specifies a list of attributes for which frequecy
# lists should be precomputed.
#
# Sub-query is a part of a corpus query which can be used in
# "within " clause.  It can consist of and/or combination
# of attribute-value pairs.
#
# Full-cql-query is any CQL query whose result (KWIC) is taken as subcorpus
# definition.
#
# All strings starting with # are comments and are ignored to the end of line.
#
###############################################################################

*FREQLISTATTRS word lemma lempos

=spoken
  bncdoc
  alltyp="Spoken context-governed" | alltyp="Spoken demographic"


=book60
  bncdoc
  alltim="1960-1974" & wrimed="Book"


=first1000
  -CQL-
  [#0-1000]


=same_as_book60
  -CQL-
  

Automatic subcorpus creation

You can automatically create subcorpora based on specific attributes such as topics and genres within a corpus. This allows you to focus on and analyze these specific subsets in detail.

General syntax

=subcorpus template name containing %s
    structure
    *attributeName [MINFREQ][%]

Syntax description

  1. Subcorpus name – must contain the placeholder %s, for which the attribute value will be substituted.
  2. Structure – a structure that will be used for creating subcorpora, most often it is the doc structure (document)
  3. Attribute name – If an asterisk * is used before the attribute name, multiple subcorpora will be created – one for each attribute value.
    For instance, if a corpus includes attributes like topic and genre, and the topic contains multiple topics (values) such as technology, business, culture etc., a user might want to create a separate subcorpus for each topic. In this case, placing an asterisk before the attribute name is necessary to generate individual subcorpora for each distinct topic.
  4. Attribute frequency – When an attribute name is followed by an integer, this number specifies the minimum frequency of occurrences that an attribute value must have for a subcorpus to be generated.
    Adding a percentage sign (%) after the number (e.g., *attributeName 10%) means the attribute must represent at least this percentage of all data within the scope of the main structure to create a subcorpus.

Usage example

To apply these rules, follow the steps in the Apply the definition section.

=Hungarian web domain .hu
    doc
    tld="hu"

=Topic %s
    doc
    *topic 1%

=Genre %s
    doc
    *genre 1%

The above example will produce subcorpora based on the topic and genre attributes. After the corpus has been recompiled, one can view the subcorpora in Corpus info as shown in the following screenshot:

Subcorpora in Corpus info

Applying the definition

There are two ways of applying the subcorpus definition to the corpus:

  • via the web interface – recommended for most users
  • with a script – only for system admins

Via the interface

Follow these steps. Start on the Dashboard dashboard and follow these steps:

  • MANAGE CORPUS
  • Configure
  • Expert Settings
  • Subcorpus definition
  • Type your definitions within the Subcorpus definition section
  • Save and Compile.

When the compilation is complete, the subcorpora will be available in the subcorpus selector, which can be found within the input settings across various Sketch Engine functions. Additionally, they will also be accessible under ‘Manage Corpus – Subcorpora’ and on the corpus information page. Anyone you share the corpus with will also have access to these subcorpora.

With the mksubc.py script

(for system admins only)

Usage: mksubc.py CORPNAME SUBCORP_DIR SUBCORP_DEF_FILE

SUBCORP_DIR is a directory where the subcorpora will be created, this depends on the Sketch Engine installation. The global subcorpora (accessible by all users) should be stored in the directory set in the SUBCBASE attribute of the corpus config file, which is by default PATH/subcorp/.

Note that mksubc.py is run by compilecorp (see Compiling Corpus)

When is this useful?

Building a subcorpus using a definition file is useful in these situations:

subcorpus sharing
Subcorpora are normally only available to the owner of the corpus. Subcorpora built via the definition file will be available to all the users that you share the corpus with.

lots of subcorpora (with minimal variation)
Although you can build any number of subcorpora from the concordance or text types, the process of clicking in the interface can be tedious especially if there are minimal differences between the subcorpora.

mobility
If you have two or more corpora and you want their subcorpora to be built using exactly the same criteria, the corpus definition file makes it possible to simply copy the definition from one corpus to another ensuring the subcorpora are based on identical criteria.

subcorpus tuning
You want to be able to improve an existing subcorpus by repeatedly adapting its definition.

Other ways of building subcorpora