Please read first about what dynamic attributes are and how they are setup in the corpus configuration file documentation.
Internal dynamic functions
The following table gives an overview of existing builtin dynamic functions together with examples of usage:
- striplastn (str, n) - returns str striped from last n characters - lowercase (str, locale) - returns str in lowercase (for any single-byte encoding and the corresponding locale) - utf8lowercase (str) - returns str in lowercase (for any utf-8 encoded string str) - utf8uppercase (str) - returns str in uppercase (for any utf-8 encoded string str) - utf8capital (str) - returns str with first character capitalized (for any utf-8 encoded string str) - getfirstn (str, n) - returns first n characters of str - getlastn (str, n) - returns last n characters of str (for any single-byte encoding) - utf8getlastn (str, n) - returns last n characters of str (for any utf-8 encoded string) - getfirstbysep (str, c) - returns prefix of str up to the character c (excluding) - getnbysep (str, c, n) - returns n-th component of str according to the delimiter c (excluding) - getnchar (str, n) - returns n-th character of str - getnextchars (str, c, n) - returns n characters after character c - getnextchar (str, c) - returns the character after character c - url2domain (str, n) - returns n-th component of the URL (0 = web domain, 1 = top level domain, 2 = second level domain) - ascii (str, enc, locale) - returns ASCII transliteration of the string according to the given encoding and locale
ATTRIBUTE lemma { DYNAMIC striplastn DYNLIB internal ARG1 "2" FUNTYPE i FROMATTR lempos DYNTYPE index } ATTRIBUTE "lemma2" { ARG1 "-" ARG2 "1" DYNAMIC "getnbysep" DYNLIB "internal" DYNTYPE "index" FROMATTR "lempos2" FUNTYPE "ci" } ATTRIBUTE lc { DYNAMIC lowercase DYNLIB internal ARG1 "C" FUNTYPE s FROMATTR word DYNTYPE index TRANSQUERY yes } ATTRIBUTE tag { DYNAMIC getfirstn DYNLIB internal ARG1 "3" FUNTYPE i FROMATTR ambtag DYNTYPE index } ATTRIBUTE k { DYNAMIC getnchar DYNLIB internal ARG1 1 FUNTYPE i FROMATTR tag DYNTYPE index } ATTRIBUTE g { DYNAMIC getnextchar DYNLIB internal ARG1 "g" FUNTYPE c FROMATTR tag DYNTYPE index } ATTRIBUTE g3 { DYNAMIC getnextchar DYNLIB internal ARG1 "g" ARG2 3 FUNTYPE ci FROMATTR tag DYNTYPE index }
Dynamic functions from a shared library
A shared library function must return const char*.
The following example function takes the year of publishing of the document and determines the epoch from which the document comes.
- the source code (epoch.c):
#include <stdio.h> const char * epoch (char* year) { int y; sscanf(year, "%d",&y); if(y<1990) return ("before 1990"); if(y<2001) return ("1990-2000"); if(y<2005) return ("2001-2004"); if(y<2009) return ("2005-2008"); return ("2009 and later"); }
- to compile the library use:
gcc -Wall -fPIC -DPIC -shared -o epoch.so epoch.c
- the important part from the corpus configuration file:
STRUCTURE doc { ATTRIBUTE year ATTRIBUTE time { DYNAMIC epoch DYNLIB "/corpora/vert/greek/epoch.so" FUNTYPE 0 FROMATTR year DYNTYPE index TRANSQUERY yes } }
Dynamic functions from a shell script
In this case the dynamic function is implemented as a shell pipe, e.g.:
ATTRIBUTE "case" { DYNAMIC "/somewhere/somescript.py" DYNLIB "pipe" DYNTYPE "freq" FROMATTR "tag" LABEL "grammatical case" }
Where somescript.py shall read a line from the standard input, perform the transformation and write on the standard output, e.g.:
#!/usr/bin/python3 import sys for tag in sys.stdin: new_tag = do_something(tag.strip()) print(new_tag)