Patakis corpus | Sketch Engine

Patakis is a 100 million word collection of POS-tagged texts mostly downloaded from the Internet, prepared by Milos Husak of Masaryk University, Brno, for Lexical Computing Ltd., in collaboration with the Greek publishers Patakis and the Greek software company Neurolingo.

POS-tagging

The tokenization and Part-Of-Speech tagging is based on the NeuroLingo Collection Analyzer, which provides the following information:

word
lemma
tag
morph

Tagset summary

NeuroLingo Collection Analyzer
http://www.neurolingo.gr/

Sketch Grammar

The sketch grammar, used for the generation of Greek word sketches and distributional thesaurus, was developed by Mavina Pantazara and Christos Tsalidis of Neurolingo.

Structure

The corpus is divided into documents (<doc></doc>) identified by their id. When available, it also contains information about its url, genre, subgenre, language, title, author, year and decade of publishing, publisher, medium, time (epoch of publishing). Each document is further structured using following tags:

paragraphs                <p></p>
sentences                 <s></s>
headers                   <h></h>
lists                    <ul></ul>
list lines               <li></li>
non-greek words   <non-greek></non-greek>
glue                       <g/>

Text gathering

The majority of texts were downloaded using BootCat according to an URL list generated by a list of Greek words provided by Patakis. Some Patakis-owned documents are included under licence (and are not available in the default version of the corpus). The corpus containing solely the texts downloaded from the Internet is called GkWaC.

public documents     :           96861
non-public documents :             240
max doc per server   :             250

date                 :    October 2007

Paper related to the BootCat, WebBootCat tool

Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006). WebBootCaT: instant domain-specific corpora to support human translators. In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252

POS-tagging

Sketch Grammar

Structure

Text gathering

Paper related to the BootCat, WebBootCat tool

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine