Topics and genres in corpora
Topics and genres are text types (metadata) that enrich the corpus with information about the subject of the texts or the writing styles.
Sketch Engine uses topics and genres to focus the search or analysis on only a part of the corpus. All tools in Sketch Engine contain the text type selector which should be used to select the required topic(s) and/or genre(s), and also other text types. Topics and genres, together with other metadata, can be used in searches, including the CQL searches. Statistics of genres and topics can be generated using the frequency tool in the concordance.
In addition, in the Word Sketch, topics and genres are displayed next to collocations to indicate that the collocation is typical of a topic or genre. The function can be activated in View options.
Visualization generated by the Text Type Analysis tool available from the Dashboard of each corpus.
Classification process
Our web-based corpora are created using a sophisticated procedure designed to eliminate web content of little linguistic value or unsuitable for linguistic analysis. This is both automatic and manual procedure. The cleaned corpus is then processed (tagged, and lemmatized) and also classified into genres and topics.
Genres and topics are assigned at the web domain level, i.e. all texts collected from the same website are assigned the same genre and topic. This is a manual task and due to the enormous size of the latest multi-billion-word corpora, it is not possible to manually classify the whole corpus. To classify the largest possible portion of the corpus while keeping the task manageable, the websites which contributed the largest volume of texts are classified first and the smaller websites remain unclassified. Typically, only a smaller part of the corpus is classified, however, thanks to the large corpus size, even the smaller part amounts to billions of words of classified texts which makes it a valuable and reliable resource.
Related paper
Vít Suchomel. Genre Annotation of Web Corpora: Scheme and Issues. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1: 738-754, 2021.
Corpus topics
A topic refers to the subject of the text such as culture, technology, business, and so on.
The list of topics is inspired by the DMOZ list (dmoz.org, now available on curlie.org). Over time, certain adjustments were made, involving the removal, merging, or addition of topics. Not all topics may be present in all corpora if the topic was not found in the largest domains. Nevertheless, in most cases, the majority of topics tend to be similar across corpora.
The current topic classification contains these 20 categories:
Topics | included subtopics |
---|---|
arts | art, exhibitions, museums |
beauty & fashion | fashion, make-up, jewellery, beauty |
cars & bikes | news, blogs and discussion forums related to cars or bikes |
culture & entertainment | movies, TV series, music, books, theatres |
economy, finance & business | job search engine, business / finance news, banking, e-commerce, real estate, crypto, company websites |
education | university websites, websites related to learning |
games | computer/video games reviews/news, chess, board games, gambling & casinos |
health | health / nutrition tips, medical news, pharmaceutical news |
history | history, historical events, discussions |
hobbies | anything one might do in their free time and it is not included in the other topics |
home, family & children | cooking (food & drinks), parenting, kids, home decoration, family |
nature & environment | environment problems, climate change, agriculture |
pets & animals | usually blogs or discussion forums on animals or pets |
politics & government | websites on politics, parties |
religion | anything related to religion – Christianity, Buddhism etc. |
science | scientific abstracts, articles (unless they are related to one specific topic, for example medicine, then such domain would be classified as health) |
sex | sex short stories |
sports | any sports (including news) |
technology & IT | computer news, android / iOS or other operating systems, cybersecurity, other technology |
travel & tourism | travelling tips, stories, hotel booking |
Corpus genres
A genre is defined by the style of writing and the purpose for which it is written, e.g. blog, discussion, news, etc.
The genres are also identified manually by checking samples of texts from each domain. As with topics, the classification is limited to the domains that contributed the largest volume of text to the corpus.
Genre | Description |
---|---|
Blog | usually personal websites about a specific topic |
Discussion | websites where users discuss various topics |
Legal | texts related to law, legal system etc. |
News | informative articles about recent events |
Fiction | often short stories or books |
Reference / Encyclopedia | Wikipedia or other wiki-like domains (unless they are too specific and should be classified with a different topic), libraries, encyclopedias |