What is the COVID-19 corpus?

The Covid-19 corpus is an English corpus that consists of articles that were released as part of the COVID-19 Open Research Dataset (CORD-19). The final version of the CORD-19 dataset was released on June 2, 2022. The COVID-19 corpus consists of more than 370,000 papers about coronavirus and related topics.

The data has been gained from the Github repository of CORD-19 https://github.com/allenai/cord19/blob/master/README.md

The corpus has rich metadata containing information about articles such as author, DOI, journal, publish time, title, etc. A full list of available metadata can be found in the Github repository. Since the original data contain many duplications, the corpus was deduplicated on the doc.doi attribute.

How to access the corpus?

The corpus is in the “open” category: no account is required to get access, just visit

http://ske.li/covid_19

Please note: some functionalities (e.g. building user subcorpora or extracting terms and keywords against a large reference corpus) require having an account. Please create a trial account and email your username to inquiries@sketchengine.eu with the subject “Covid-19 corpus” and we will give you a free 1-year account to access this corpus.

Tools to work with COVID-19 corpus

A complete set of tools is available to work with the COVID-19 to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

If you are interested in the full corpus data, please contact us at support@sketchengine.eu. The Covid-19 corpus is tokenized, part-of-speech tagged and lemmatized.

Please do note that the PoS tagging and lemmatization have been done using TreeTagger and thus the annotation is available only for non-commercial purposes. The original license applies to source texts only.

See original data changelog. Information below relates to the Sketch Engine processing.

version 2022-07-23 (published August, 2022 in Sketch Engine)

  • data update

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

  • data update

version 2020-03-27 (published March 30th, 2020 in Sketch Engine)

  • data update + all metadata now preserved for articles

version 2020-03-20 (published March 26th, 2020 in Sketch Engine)

  • initial version, mark-up of abstracts, documents, back matter and citations

How to reference Sketch Engine

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Michael Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey A. Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, et al.. 2020. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.

COVID-19 Open Research Dataset (CORD-19). 2020. Version 2022-06-02. Retrieved from https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html. Accessed 2022-06-02.

Search the COVID-19 corpus from Open Research Dataset

Sketch Engine offers a range of tools to work with the COVID-19 corpus.

Covid-19 corpus - concordance

Other English corpora

Explore our largest Timestamped English corpus with 80+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract. Use our Quick Start Guide to learn it in minutes.