What is a subcorpus?
Each corpus can be divided into smaller parts called subcorpora. Subcorpora can be used to divide the corpus by the type of text(fiction, newspaper), media (spoken, written), time (e.g. publication year) or by any other criteria. Subcorpora can be overlapping, the same segment can appear in all subcorpora whose criteria it meets.
All tools, except the thesaurus, can be set to analyse a subcorpus instead of the whole corpus.
Lock a subcorpus
Normally, a subcorpus has to be selected each time the user needs to search it. When working with the same subcorpus for an extended period of time, the corpus can be locked. When locked, all tools will have this subcorpus preselected. The user does not need to select it each time. It will stay locked until the user unlocks it gain.
How to create a subcorpus?
A corpus can be divided into subcorpora using text types or from a concordance. The third option using a configuration file is for advanced users and allows for detailed specifications. This page explains the first two options. Subcorpora are only available to the user who created them. Expert users can use the configuration file to share subcorpora with all users.
OPTION 1 – subcorpus from text types
This procedure will create a subcorpus from text types. This option can only be used if the corpus is annotated for text types.
detailed instructions
The subcorpus building screen can be reached in two ways:
From the dashboard
On the corpus dashboard, click MANAGE CORPUS, then SUBCORPORA, then CREATE SUBCORPUS
From the advanced tab of any tool
On the advanced tab of any tool (with the exception of the thesaurus), click the plus sign add next to the subcorpus selector.
- type a name for your new subcorpus
- use the text type selectors to choose the required text types
- click CREATE SUBCORPUS
Creating a subcorpus may take a few seconds while statistics for the subcorpus are calculated. Watch for a notification. When finished, it will appear in the subcorpus selector on advanced tabs of all tools that support subcorpora.
Tips for using the text type selectors
- you can select as many text types from as many selectors as you wish
- selecting all values in a selector is the same as selecting none
- type a few letters to search the text types
- the selector can be expanded full-screen for practicality
OPTION 2 – subcorpus from a concordance
A subcorpus can be created from concordance lines. The user generates a concordance and decides how much context surrounding the KWIC should be included in the subcorpus. This context is defined by documents, the paragraphs or only the sentences containing the KWIC. Other structures can be selected too if the corpus contains them.
detailed instructions
- Open a corpus and generate a concordance.
Click the plus icon add in the menu above the concordance. - Type a name for your new subcorpus.
- Indicate how much context should be included in the subcorpus by selecting the structure surrounding the KWIC. The available structures can differ betewen corpora. These are the most common ones:
doc – the whole document (produces big subcorpora)
p – the whole paragraph
s – the whole sentence (produces small subcorpora)
For a detailed description of the structures used in the corpus, see corpus information. - Click CREATE SUBCORPUS
It may take a few seconds for the corpus to be built while the statistics are calculated. Watch for a notification. When finished, the subcorpus is available in the subcorpus selector on the advanced tabs of tools which support subcorpora.
Subcorpora for advanced users
There are other and more advanced ways to create subcorpora, especially if you want these subcorpora to be shared with other users. Then please follow the steps in the article Creating subcorpora for sharing with all users.