Create a new corpus from files
Sketch Engine also serves as corpus building software by downloading content from the web or by uploading files. The latter is covered on this page.
A corpus can be built by combining both methods. Data can be added to the corpus at any point later and make it larger.
How to create a corpus by uploading files
There are 3 ways to reach the corpus building tool:
- on the corpus dashboard dashboard click NEW CORPUS
- on the select corpus advanced screen storage click NEW CORPUS
- open the corpus selector at the top of each screen and click CREATE CORPUS
Sketch Engine supports building parallel corpora from aligned texts. Follow these steps.
In the corpus building interface
- type a name for your new corpus, select the language, optionally provide a description and click NEXT
- select the I have my own texts
- drag and drop the files or select them from your hard drive
- multiple files can be uploaded as one zip archive
- click on the help icons help_outline to learn about the options and settings
This process can be repeated to make the corpus larger or can be combined with building from the web.
How to optimize your corpus
Find out how to optimize your corpus by adding a corpus description, labels, changing text types, etc.
Supported formats
The complete list of supported file formats includes:
.doc, .docx, .htm, .html, .tei, .tmx, .txt, .vert, .xml,
.pdf (scanned images must be OCRed before uploading)
.xls, .xlsx, .tmx, .xlf/.xliff, .ods (for parallel corpora only)
.zip, .tar.gz (to upload a large number of files at once)
An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:
With regard to PDF files, please bear in mind that they are first converted into plain text to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers, or splitting words at the end of lines that may not be processed correctly.
Configuration template - advanced users
Configuration template (for advanced users): Instead of the default template for the selected language, you may select a custom configuration template. Be advised that only vertical files are supported when using custom templates. Also, note that this option is not shown by default. To enable it, you must first create a user template in Configuration templates as described in The Corpus Configuration File: Overview page and Corpus Configuration File: All Features.