Introduction
Sketch Engine usually generates word sketches using a sketch grammar. Since the result of processing sketch grammar queries is formally nothing else than a (potentially ambiguous and incomplete) dependency parse of a sentence, one can use syntactically parsed data by third-party tools to generate the word sketches instead.
How to set up
To instruct compilecorp to assume parsed data for building sketches, set the WSDEF directive to any file path ending with .conll (preferably in the corpus data directory). The file will get created automatically containing grammatical relations that have been indexed. Internally, compilecorp will use the sconll2sketch script for building the sketches.
Data formats
There are two supported formats:
1. CoNLL
We support a superset of CoNLL, as described here.
Sample CoNLL format for two sentences in Turkish is displayed below
id word lempos tag fineTag head deprel ========================================================= <s name="2"> 1 Eğer eğer-c Conj Conj 13 S.MODIFIER 2 ki ki-c Conj Conj 1 INTENSIFIER 3 ülkelere ülke-n Noun Noun 4 OBJECT 4 ve ve-c Conj Conj 12 COORDINATION 5 onların o-p Pron Pron 8 SUBJECT 6 özelliklerine özellik-n Noun Noun 8 DATIVE.ADJUNCT 7 ilginiz ilgi-n Noun Noun 8 SUBJECT 8 varsa var-v Verb Verb 12 MODIFIER 9 bu bu-d Det Det 10 DETERMINER 10 bölüm bölüm-n Noun Noun 12 SUBJECT 11 ilginizi ilgi-n Noun Noun 12 OBJECT 12 çekebilir çek-v Verb Verb 13 SENTENCE 13 . .-x Punc Punc 0 ROOT </s> <s name="3"> 1 Burada bura-n Noun Noun 11 LOCATIVE.ADJUNCT 2 genel genel-j Adj Adj 3 MODIFIER 3 anlamda anlam-n Noun Noun 11 MODIFIER 4 ülkelere ülke-n Noun Noun 5 OBJECT 5 ait ait-o Postp Postp 6 MODIFIER 6 bilgiler bilgi-n Noun Noun 11 OBJECT 7 , ,-x Punc Punc 8 notconnected 8 tanımlar tanım-n Noun Noun 11 SUBJECT 9 , ,-x Punc Punc 10 notconnected 10 uyarılar uyarı-n Noun Noun 11 OBJECT 11 bulabileceksiniz bul-v Verb Verb 12 SENTENCE 12 . .-x Punc Punc 0 ROOT </s>
There are 7 columns in the above vertical format. Not all of them are mandatory. You could have as many additional columns as you would like as well as any other structural annotation like sentences, paragraphs or documents.
The only mandatory columns which are essential for generating word sketches are
- id: this represents the id/position of the current word. The name of this column can be overridden by setting the IDATTR configuration directive.
- One positional attribute (probably word, lemma or lempos) used for generating the sketches. The name of this attribute is set by WSATTR configuration directive.
- head: this represents the parent node id of the current word. The name of this column can be overridden by setting the HEADATTR configuration directive.
- deprel: this represents the relation by which the current node and parent node are connected. The name of this column can be overridden by setting the DEPRELATTR configuration directive.
In the example above, in the sentence <s name="2">
, onların (id=5) is the subject of varsa (id=8)
2. SCoNLL
If you want to use sketches from ambiguous (i.e. multihead or multidependency) output, you can provide data in SCoNLL (aka Sketch CoNLL) format, where you remove the head attribute and entirely (from corpus configuration and source vertical) and encode the head information into the deprel attribute which can consist of a colon-separated list of head-relation pairs separated by a comma, e.g. head1,relation1;head2,relation2;head3;relation3.
3. Creating a user corpus from the CONLL format in the interface
- Create a corpus by uploading files and selecting the custom configuration option that fits the attributes in your CONLL file.
- Add the line
DEFAULTATTR "word"
into the configuration file (otherwise, the corpus stays empty because the first column with IDs is taken as “word” by default, and thus they are considered to be non-words). - To activate word sketches, once your corpus is compiled, go to Manage Corpus → Compile → Expert settings and click the Save & compile button.
If you have any problems or questions related to user corpora creating from the CONLL format, please contact us.
Reference
Buchholz, Sabine, and Erwin Marsi. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2006.