Create a new corpus from files

Sketch Engine also serves as corpus building software by downloading content from the web or by uploading files. The latter is covered on this page.

A corpus can be built by combining both methods. Data can be added to the corpus at any point later and make it larger.

Who can access my data?

Sketch Engine is not a public cloud. Texts you upload will be stored in your personal space in your account. Other users cannot access your texts.

You can, however, choose to grant access to individually selected users by sharing the corpus. If you are a member of a site licence (multi-user account), you can grant access to all other members of the same site licence.  An explicit action has to be taken for this to happen.

How to create a corpus by uploading files

There are 3 ways to reach the corpus building tool:

  • on the corpus dashboard dashboard click NEW CORPUS
  • on the select corpus advanced screen storage click NEW CORPUS
  • open the corpus selector at the top of each screen and click CREATE CORPUS

Sketch Engine supports building parallel corpora from aligned texts. Follow these steps.

In the corpus building interface

  • type a name for your new corpus, select the language, optionally provide a description and click NEXT
  • select the I have my own texts
  • drag and drop the files or select them from your hard drive
  • multiple files can be uploaded as one zip archive
  • click on the help icons help_outline to learn about the  options and settings

This process can be repeated to make the corpus larger or can be combined with building from the web.

How to optimize your corpus

Find out how to optimize your corpus by adding a corpus description, labels, changing text types, etc.

The complete list of supported file formats includes:
.doc, .docx, .htm, .html, .tei, .tmx, .txt, .vert, .xml,
.pdf
(scanned images must be OCRed before uploading)
.xls, .xlsx, .tmx, .xlf/.xliff, .ods (for parallel corpora only)
.zip, .tar.gz (to upload a large number of files at once)

An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:

With regards to PDF files, please bear in mind that firstly PDF files are converted into plain text in order to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers or splitting words at the end of lines which may not be processed correctly.

Configuration template (for advanced users): Instead of the default template for the selected language, you may select a custom configuration template. Be advised that only vertical files are supported when using custom templates. Also, note that this option is not shown by default. To enable it, you must first create a user template in Configuration templates as described in The Corpus Configuration File: Overview page and Corpus Configuration File: All Features.