GDEX configuration file is used for Good Dictionary Examples system which evaluates sentences with regard to their suitability to serve as dictionary or teaching examples. In this configuration, you can define the structure of good sentences how they should look like (e.g. specific sentence length, preferring frequent words) and filter out inappropriate sentences (e.g. too long sentences, sentences with vulgarisms, etc.)

GDEX configuration introduction

GDEX configuration files are written in YAML (Wikipedia.org) which is a human-readable (and human-editable) format also suitable for effective machine processing. The actual formula for calculating sentence scores is an expression in the Python programming language. Its syntax is limited to basic mathematical and logical operations, as well as function calls to pre-defined GDEX classifiers. Several variables such as the values of positional attributes are available. Named variables defined in the configuration file, typically regular expressions, can also be referenced in the formula.

The file contains two top-level keys – the mandatory formula and optional variables. Unless the formula fits on a single line, it must be preceded by the > YAML symbol for multi-line values. YAML does not allow the tab character, so you must use spaces for indentation. This is what a simple configuration file may look like:

formula: >
    (50 * is_whole_sentence() * blacklist(words, illegal_chars) * blacklist(lemmas, parsnips)
    + 50 * optimal_interval(length, 10, 14)
    * greylist(words, rare_chars, 0.1)
    * greylist(tags, pronouns, 0.1)
    ) / 100
variables:
    illegal_chars: ([<|\]\[>/\^@])
    rare_chars: ([A-Z0-9'.,!?)(;:-])
    pronouns: PRON.*
    parsnips: ^(tory, whisky, cowgirl, meth, commie, bacon)$

The formula is supposed to evaluate a number between 0 (worst) and 1 (best). Values outside this range will be changed to the nearest limit.

Apart from the variables (actually constants) defined in the configuration, these important ones are available:

  • length – sentence length (number of tokens including punctuation)
  • kw_start and kw_end – a position of the keyword (range: 0–length)
  • words, tags, lemmas, lemposs, lemma_lcs – a list of values for every positional attribute (attribute name + “s”)
  • illegal_chars – sentences containing one or more of these characters are penalized
  • rare_chars – sentences containing these characters are penalized, but less than the characters in the illegal_chars list
  • pronouns – substituent, in this case, matches all tokens with the PoS tag PRON
  • parsnips – a list of words from taboo topics, sentences with these words are penalized

This a general example of how variables in the GDEX configuration can work and can be named. You can name them differently or add more variables checking further sentence structure and features.

The attribute lists can be used as a whole (for example as a parameter to a classifier) or you can even access individual tokens using standard Python syntax. For example, <span style="font-family: monospace;">words[0]</span> is the first word in the sentence and <span style="font-family: monospace;">tags[-1]</span> is the tag of the last token.

Classifiers

blacklist

<span style="font-family: monospace;">blacklist(tokens, pattern)</span> returns either 1 if none of the tokens (e.g. words, lemmas etc.) matched pattern (regular expression), 0 otherwise

greylist

<span style="font-family: monospace;">greylist(tokens, pattern, penalty)</span> is similar to blacklist, but you can specify a penalty that will be subtracted from 1 for each token matching pattern down to 0. With a penalty of 1, it behaves as a blacklist.

optimal_interval

<span style="font-family: monospace;">optimal_interval(value, low, high)</span> returns 1 if <span style="font-family: monospace;">value</span> is between <span style="font-family: monospace;">low</span> and <span style="font-family: monospace;">high</span>. Outside this range, the score linearly rises from 0 at <span style="font-family: monospace;">low</span>/2 to 1 at <span style="font-family: monospace;">low</span> and falls from 1 at <span style="font-family: monospace;">high</span> to 0 at 2*<span style="font-family: monospace;">high</span>. For <span style="font-family: monospace;">value</span> lower than <span style="font-family: monospace;">low</span>/2 or higher than 2*<span style="font-family: monospace;">high</span>, the score is 0. Usually used with <span style="font-family: monospace;">length</span>.

is_whole_sentence

<span style="font-family: monospace;">is_whole_sentence()</span> (mind the parentheses) returns 1 if the sentence starts with a capitalized word and ends with a full stop, question mark or exclamation mark. Otherwise, it returns 0.

word_frequency

<span style="font-family: monospace;">word_frequency(word)</span> returns the absolute frequency of the given word in the corpus. <span style="font-family: monospace;">word_frequency(word, normalize)</span> returns the relative frequency per <span style="font-family: monospace;">normalize</span> tokens. For example: <span style="font-family: monospace;">word_frequency(words[1], 1000)</span>

keyword_position

<span style="font-family: monospace;">keyword_position()</span> returns a number between 0 and 1, starting at zero for a keyword at the beginning of the sentence, and rising in equal increments (depending on the length of the sentence) to 1 for a keyword at the end of the sentence.

keyword_repetition

<span style="font-family: monospace;">keyword_repetition()</span> returns the number of occurrences of the keyword in the sentence.