Topics and genres in corpora

Topics and genres are text types (metadata) that enrich the corpus with information about the subject of the texts or the writing styles.

Sketch Engine uses topics and genres to focus the search or analysis on only a part of the corpus. All tools in Sketch Engine contain the text type selector which should be used to select the required topic(s) and/or genre(s), and also other text types. Topics and genres, together with other metadata, can be used in searches, including the CQL searches. Statistics of genres and topics can be generated using the frequency tool in the concordance.

In addition, in the Word Sketch, topics and genres are displayed next to collocations to indicate that the collocation is typical of a topic or genre. The function can be activated in View options.

Visualization generated by the Text Type Analysis tool available from the Dashboard of each corpus.

Classification process

Our web-based corpora are created using a sophisticated procedure designed to eliminate web content of little linguistic value or unsuitable for linguistic analysis. This is both automatic and manual procedure. The cleaned corpus is then processed (tagged, and lemmatized) and also classified into genres and topics.

Genres and topics are assigned at the web domain level, i.e. all texts collected from the same website are assigned the same genre and topic.  This is a manual task and due to the enormous size of the latest multi-billion-word corpora, it is not possible to manually classify the whole corpus. To classify the largest possible portion of the corpus while keeping the task manageable, the websites which contributed the largest volume of texts are classified first and the smaller websites remain unclassified. Typically, only a smaller part of the corpus is classified, however, thanks to the large corpus size, even the smaller part amounts to billions of words of classified texts which makes it a valuable  and reliable resource.

Related paper

Vít Suchomel. Genre Annotation of Web Corpora: Scheme and IssuesProceedings of the Future Technologies Conference (FTC) 2020, Volume 1: 738-754, 2021.

Corpus topics

A topic refers to the subject of the text such as culture, technology, business, and so on.

The list of topics is inspired by the DMOZ list (dmoz.org, now available on curlie.org). Over time, certain adjustments were made, involving the removal, merging, or addition of topics. Not all topics may be present in all corpora if the topic was not found in the largest domains. Nevertheless, in most cases, the majority of topics tend to be similar across corpora.

The current topic classification contains these 20 categories:

Topicsincluded subtopics
artsart, exhibitions, museums
beauty & fashionfashion, make-up, jewellery, beauty
cars & bikesnews, blogs and discussion forums related to cars or bikes
culture & entertainmentmovies, TV series, music, books, theatres
economy, finance & businessjob search engine, business / finance news, banking, e-commerce, real estate, crypto, company websites
educationuniversity websites, websites related to learning
gamescomputer/video games reviews/news, chess, board games, gambling & casinos
healthhealth / nutrition tips, medical news, pharmaceutical news
historyhistory, historical events, discussions
hobbiesanything one might do in their free time and it is not included in the other topics
home, family & childrencooking (food & drinks), parenting, kids, home decoration, family
nature & environmentenvironment problems, climate change, agriculture
pets & animalsusually blogs or discussion forums on animals or pets
politics & governmentwebsites on politics, parties
religionanything related to religion – Christianity, Buddhism etc.
sciencescientific abstracts, articles (unless they are related to one specific topic, for example medicine, then such domain would be classified as health)
sexsex short stories
sportsany sports (including news)
technology & ITcomputer news, android / iOS or other operating systems, cybersecurity, other technology
travel & tourismtravelling tips, stories, hotel booking

Corpus genres

A genre is defined by the style of writing and the purpose for which it is written, e.g. blog, discussion, news, etc.

The genres are also identified manually by checking samples of texts from each domain. As with topics, the classification is limited to the domains that contributed the largest volume of text to the corpus.

GenreDescription
Blogusually personal websites about a specific topic
Discussionwebsites where users discuss various topics
Legaltexts related to law, legal system etc.
Newsinformative articles about recent events
Fictionoften short stories or books
Reference / EncyclopediaWikipedia or other wiki-like domains (unless they are too specific and should be classified with a different topic), libraries, encyclopedias
Topic classification
corpus from the web
blog: pos tags

POS tags

OneClick Terms - multi-word term extraction
Screenshot of thesaurus from esTenTen Spanish corpus

Automatic thesaurus