Introduction

Sketch Engine usually generates word sketches using a sketch grammar. Since the result of processing sketch grammar queries is formally nothing else than a (potentially ambiguous and incomplete) dependency parse of a sentence, one can use syntactically parsed data by third-party tools to generate the word sketches instead.

How to set up

To instruct compilecorp to assume parsed data for building sketches, set the WSDEF directive to any file path ending with .conll (preferably in the corpus data directory). The file will get created automatically containing grammatical relations that have been indexed. Internally, compilecorp will use the sconll2sketch script for building the sketches.

Data formats

There are two supported formats:

1. CoNLL

We support a superset of CoNLL, as described here.

Sample CoNLL format for two sentences in Turkish is displayed below

id      word    lempos  tag     fineTag head    deprel
=========================================================

<s name="2">
1       Eğer    eğer-c  Conj    Conj    13      S.MODIFIER
2       ki      ki-c    Conj    Conj    1       INTENSIFIER
3       ülkelere        ülke-n  Noun    Noun    4       OBJECT
4       ve      ve-c    Conj    Conj    12      COORDINATION
5       onların o-p     Pron    Pron    8       SUBJECT
6       özelliklerine   özellik-n       Noun    Noun    8       DATIVE.ADJUNCT
7       ilginiz ilgi-n  Noun    Noun    8       SUBJECT
8       varsa   var-v   Verb    Verb    12      MODIFIER
9       bu      bu-d    Det     Det     10      DETERMINER
10      bölüm   bölüm-n Noun    Noun    12      SUBJECT
11      ilginizi        ilgi-n  Noun    Noun    12      OBJECT
12      çekebilir       çek-v   Verb    Verb    13      SENTENCE
13      .       .-x     Punc    Punc    0       ROOT
</s>
<s name="3">
1       Burada  bura-n  Noun    Noun    11      LOCATIVE.ADJUNCT
2       genel   genel-j Adj     Adj     3       MODIFIER
3       anlamda anlam-n Noun    Noun    11      MODIFIER
4       ülkelere        ülke-n  Noun    Noun    5       OBJECT
5       ait     ait-o   Postp   Postp   6       MODIFIER
6       bilgiler        bilgi-n Noun    Noun    11      OBJECT
7       ,       ,-x     Punc    Punc    8       notconnected
8       tanımlar        tanım-n Noun    Noun    11      SUBJECT
9       ,       ,-x     Punc    Punc    10      notconnected
10      uyarılar        uyarı-n Noun    Noun    11      OBJECT
11      bulabileceksiniz        bul-v   Verb    Verb    12      SENTENCE
12      .       .-x     Punc    Punc    0       ROOT
</s>

There are 7 columns in the above vertical format. Not all of them are mandatory. You could have as many additional columns as you would like as well as any other structural annotation like sentences, paragraphs or documents.

The only mandatory columns which are essential for generating word sketches are

  • id: this represents the id/position of the current word. The name of this column can be overridden by setting the IDATTR configuration directive.
  • One positional attribute (probably word, lemma or lempos) used for generating the sketches. The name of this attribute is set by WSATTR configuration directive.
  • head: this represents the parent node id of the current word. The name of this column can be overridden by setting the HEADATTR configuration directive.
  • deprel: this represents the relation by which the current node and parent node are connected. The name of this column can be overridden by setting the DEPRELATTR configuration directive.

In the example above, in the sentence <s name="2">, onların (id=5) is the subject of varsa (id=8)

2. SCoNLL

If you want to use sketches from ambiguous (i.e. multihead or multidependency) output, you can provide data in SCoNLL (aka Sketch CoNLL) format, where you remove the head attribute and entirely (from corpus configuration and source vertical) and encode the head information into the deprel attribute which can consist of a colon-separated list of head-relation pairs separated by a comma, e.g. head1,relation1;head2,relation2;head3;relation3.

3. Creating a user corpus from the CONLL format in the interface

  1. Create a corpus by uploading files and selecting the custom configuration option that fits the attributes in your CONLL file.
  2. Add the line DEFAULTATTR "word" into the configuration file (otherwise, the corpus stays empty because the first column with IDs is taken as “word” by default, and thus they are considered to be non-words).
  3. To activate word sketches, once your corpus is compiled, go to Manage Corpus → Compile → Expert settings and click the Save & compile button.

If you have any problems or questions related to user corpora creating from the CONLL format, please contact us.


Reference

Buchholz, Sabine, and Erwin Marsi. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2006.