To cite Sketch Engine in academic publications, use the following papers. If you refer to Sketch Engine in general, choose from the papers in General references.
General Reference
@article{kilgarriff2014sketch,
title={The Sketch Engine: ten years on},
author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
journal={Lexicography},
year={2014},
volume={1},
pages={7--36},
publisher={Springer}
}
@article{kilgarriff2004sketch,
title={The Sketch Engine},
author={Kilgarriff, Adam and Rychlý, Pavel and Smrž, Pavel and Tugwell, David},
journal={Proceedings of the 11th EURALEX International Congress},
year={2004},
volume={},
pages={105--116},
publisher={Université de Bretagne-Sud, Faculté des lettres et des sciences humaines}
}
Also please mention the following web address http://www.sketchengine.eu
logDice statistic
A statistic measure used to compute word sketches since 2008.
@article{rychlý2008lexicographer,
title={A Lexicographer-Friendly Association Score},
author={Rychlý, Pavel},
journal={Proc. 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN},
year={2008},
volume={2},
pages={6--9},
publisher={Masaryk University}
}
Evaluation of Word Sketches
@article{kilgarriff2010quantitative,
title={A quantitative evaluation of word sketches},
author={Kilgarriff, Adam and Kovář, Vojtěch and Krek, Simon and Srdanović, Irena and Tiberius, Carole},
journal={Proceedings of the 14th EURALEX International Congress},
year={2010},
volume={},
pages={372--79},
publisher={Fryske Akademy-Afûk}
}
All statistics used in Sketch Engine
To see a paper with details on Statistics used in the Sketch Engine, visit the relevant page in the Research section.
Corpus query language (CQL)
@article{jakubíček2010fast,
title={Fast Syntactic Searching in Very Large Corpora for Many Languages},
author={Jakubíček, Miloš and Kilgarriff, Adam and McCarthy, Diana and Rychlý, Pavel},
journal={PACLIC},
year={2010},
volume={},
pages={741--47},
publisher={Tohuku University}
}
Full bibliography of Sketch Engine
2024
- M. Medveď, R. Sabol, A. Horák. SlamaTrain– Representative Training Dataset for Slavonic Large Language Models. Recent Advances in Slavonic Natural Language Processing, RASLAN 2024: 25-33, 2024.
@article{medveď2024slamatrain–, title={SlamaTrain– Representative Training Dataset for Slavonic Large Language Models}, author={Medveď, M. and Sabol, R. and Horák, A.}, journal={Recent Advances in Slavonic Natural Language Processing, RASLAN 2024}, year={2024}, pages={25--33}, publisher={Tribun EU} }
- V. Ohlídalová, M. Jakubíček. A New Czech Pipeline in Sketch Engine. Recent Advances in Slavonic Natural Language Processing, RASLAN 2024: 101-107, 2024.
@article{ohlídalová2024new, title={A New Czech Pipeline in Sketch Engine}, author={Ohlídalová, V. and Jakubíček, M.}, journal={Recent Advances in Slavonic Natural Language Processing, RASLAN 2024}, year={2024}, pages={101--107}, publisher={Tribun EU} }
- M. Jakubíček, M. Rundell, C. Breathnach, P. Ó Mianáin. DANTE Resurrected
Large Open Source Lexical Data Database for English. Lexicography and Semantics (Book of Abstract of the XXI EURALEX International Congress): 102-104, 2024.@article{ jakubíček2024dante, title={DANTE Resurrected
Large Open Source Lexical Data Database for English}, author={ Jakubíček, M. and Rundell, M. and Breathnach, C. and Mianáin, P. Ó}, journal={Lexicography and Semantics (Book of Abstract of the XXI EURALEX International Congress)}, year={2024}, pages={102--104}, publisher={Institut za hrvatski jezik, Željko Jozić} } - F. Kovařík, V. Kovář, M. Blahuš. On Rapid Annotation of Czech Headwords – Analysing the First Tasks of Czech Dictionary Express. Lexicography and Semantics (Proceedings of the XXI EURALEX International Congress): 336-344, 2024.
@article{ kovařík2024rapid, title={On Rapid Annotation of Czech Headwords – Analysing the First Tasks of Czech Dictionary Express}, author={ Kovařík, F. and Kovář, V. and Blahuš, M.}, journal={Lexicography and Semantics (Proceedings of the XXI EURALEX International Congress)}, year={2024}, pages={336--344}, publisher={Institut za hrvatski jezik, Željko Jozić} }
- M. Blahuš, M. Cukr, M. Jakubíček, V. Kovář, F. Kovařík. Dictionary Express: First Phases Rapid dictionary-making method for European, Asian and other languages. Asian Lexicography Merging cutting-edge and established approaches (AsiaLex 2024): 84-89, 2024.
@article{blahuš2024dictionary, title={Dictionary Express: First Phases Rapid dictionary-making method for European, Asian and other languages}, author={Blahuš, M. and Cukr, M. and Jakubíček, M. and Kovář, V. and Kovařík, F.}, journal={Asian Lexicography Merging cutting-edge and established approaches (AsiaLex 2024)}, year={2024}, pages={84--89}, publisher={} }
- O. Mikušek. One Year of Continuous and Automatic Data Gathering from Parliaments of European Union Member States. Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024: 149-153, 2024.
@article{mikušek2024one, title={One Year of Continuous and Automatic Data Gathering from Parliaments of European Union Member States}, author={Mikušek, O.}, journal={Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024}, year={2024}, pages={149--153}, publisher={ELRA Language Resource Association: CC BY-NC 4.0} }
- O. Herman, M. Jakubíček. ShadowSense: a Multi-annotated Dataset for Evaluating Word Sense Induction. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): 14763-14769, 2024.
@article{herman2024shadowsense, title={ShadowSense: a Multi-annotated Dataset for Evaluating Word Sense Induction}, author={Herman, O. and Jakubíček, M.}, journal={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, year={2024}, pages={14763--14769}, publisher={ELRA Language Resource Association: CC BY-NC 4.0} }
2023
- F. Kovařík. Semi-automatic Dictionary Creation for Czech Using Automatisation to Create a Rapid Czech Dictionary. Recent Advances in Slavonic Natural Language Processing, RASLAN 2023: 93-100, 2023.
@article{kovařík2023semi, title={Semi-automatic Dictionary Creation for Czech Using Automatisation to Create a Rapid Czech Dictionary}, author={Kovařík, F.}, journal={Recent Advances in Slavonic Natural Language Processing, RASLAN 2023}, year={2023}, pages={93--100}, publisher={Tribun EU} }
- M. Medveď, M. Jakubíček, V. Kovář, T. Svoboda. Development of the NVH Schema Format for Lexicographic Purposes. Recent Advances in Slavonic Natural Language Processing, RASLAN 2023: 101-106, 2023.
@article{medveď2023development, title={Development of the NVH Schema Format for Lexicographic Purposes}, author={Medveď, M. and Jakubíček, M. and Kovář, V. and Svoboda, T.}, journal={Recent Advances in Slavonic Natural Language Processing, RASLAN 2023}, year={2023}, pages={101--106}, publisher={Tribun EU} }
- O. Mikušek. Data Gathered with Automatic Tools from European Parliamentary Chambers. Recent Advances in Slavonic Natural Language Processing, RASLAN 2023: 107-112, 2023.
@article{mikušek2023data, title={Data Gathered with Automatic Tools from European Parliamentary Chambers}, author={Mikušek, O.}, journal={Recent Advances in Slavonic Natural Language Processing, RASLAN 2023}, year={2023}, pages={107--112}, publisher={Tribun EU} }
- M. Jakubíček, M. Rundell. The end of lexicography? Can ChatGPT outperform current tools for post-editing lexicography?. Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference: 518-533, 2023.
@article{jakubíček2023end, title={The end of lexicography? Can ChatGPT outperform current tools for post-editing lexicography?}, author={Jakubíček, M. and Rundell, M.}, journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference}, year={2023}, pages={518--533}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, V. Kovář, J. Kraus, M. Medveď, V. Ohlídalová. Rapid Ukrainian-English Dictionary Creation Using Post-Edited Corpus Data. Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference: 613-637, 2023.
@article{blahuš2023rapid, title={Rapid Ukrainian-English Dictionary Creation Using Post-Edited Corpus Data}, author={Blahuš, M. and Cukr, M. and Herman, O. and Jakubíček, M. and Kovář, V. and Kraus, J. and Medveď, M. and Ohlídalová, V.}, journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference}, year={2023}, pages={613--637}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- M. Blahuš, M. Jakubíček, M. Cukr, V. Kovář, V. Suchomel. Development of Evidence-Based Grammars for Terminology Extraction in OneClick Terms. Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference: 650-662, 2023.
@article{blahuš2023development, title={Development of Evidence-Based Grammars for Terminology Extraction in OneClick Terms}, author={Blahuš, M. and Jakubíček, M. and Cukr, M. and Kovář, V. and Suchomel, V.}, journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2023 conference}, year={2023}, pages={650--662}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- V. Suchomel, M. Jakubíček, O. Matuška. Web corpora for under-resourced languages. Corpus Linguistics (CL2023), 2023.
@article{ suchomel2023web, title={Web corpora for under-resourced languages}, author={ Suchomel, V. and Jakubíček, M. and Matuška, O.}, journal={Corpus Linguistics (CL2023)}, year={2023}, pages={}, publisher={Lancaster, United Kingdom: Lancaster University} }
- M. Jakubíček, O. Matuška, M. Blahuš. Corpus-based Bilingual Terminology Extraction using One-Click Terms. Corpus Linguistics (CL2023), 2023.
@article{jakubíček2023corpus, title={Corpus-based Bilingual Terminology Extraction using One-Click Terms}, author={Jakubíček, M. and Matuška, O. and Blahuš, M.}, journal={Corpus Linguistics (CL2023)}, year={2023}, pages={}, publisher={Lancaster, United Kingdom: Lancaster University} }
- Antonio San Martín, Catherine Trekker, Juan Carlos Díaz-Bautista. Extracting the Agent-Patient Relation from Corpus With Word Sketches. Proceedings of the 4th Conference on Language, Data and Knowledge: 666-675, 2023.
@article{san martín2023extracting, title={Extracting the Agent-Patient Relation from Corpus With Word Sketches}, author={San Martín, Antonio and Trekker, Catherine and Díaz-Bautista, Juan Carlos}, journal={Proceedings of the 4th Conference on Language, Data and Knowledge}, year={2023}, pages={666--675}, publisher={} }
2022
- Miloš Jakubíček, Vojtěch Kovář, Michal Měchura, Adam Rambousek. Using NVH as a Backbone Format in the Lexonomy Dictionary Editor. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 55-61, 2022.
@article{jakubíček2022using, title={Using NVH as a Backbone Format in the Lexonomy Dictionary Editor}, author={Jakubíček, Miloš and Kovář, Vojtěch and Měchura, Michal and Rambousek, Adam}, journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}, year={2022}, pages={55--61}, publisher={Tribun EU} }
- Vladimír Benko. Aranea Go Middle East: Persicum. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 113-121, 2022.
@article{benko2022aranea, title={Aranea Go Middle East: Persicum}, author={Benko, Vladimír}, journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}, year={2022}, pages={113--121}, publisher={Tribun EU} }
- Matúš Kostka. Pipeline Effectiveness in the Sketch Engine. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 123-130, 2022.
@article{kostka2022pipeline, title={Pipeline Effectiveness in the Sketch Engine}, author={Kostka, Matúš}, journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}, year={2022}, pages={123--130}, publisher={Tribun EU} }
- Vít Suchomel, Jan Kraus. Semi-Manual Annotation of Topics and Genres in Web Corpora, The Cheap and Fast Way. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022: 141-148, 2022.
@article{suchomel2022semi, title={Semi-Manual Annotation of Topics and Genres in Web Corpora, The Cheap and Fast Way}, author={Suchomel, Vít and Kraus, Jan}, journal={Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}, year={2022}, pages={141--148}, publisher={Tribun EU} }
- Antonio San Martín, Catherine Trekker, Pilar León-Araúz. Repérage automatisé de l’hyponymie dans des corpus spécialisés en français à l’aide de Sketch Engine. Terminology, 2022.
@article{san martín2022repérage, title={Repérage automatisé de l’hyponymie dans des corpus spécialisés en français à l’aide de Sketch Engine}, author={San Martín, Antonio and Trekker, Catherine and León-Araúz, Pilar}, journal={Terminology}, year={2022}, publisher={John Benjamins} }
- Virginijus Dadurkevicius, Andrius Utka. Estimating the Amount of Lithuanian Text Indexed by Global Search Engines. Baltic Journal of Modern Computing (BJMC): 326-336, 2022.
@article{dadurkevicius2022estimating, title={Estimating the Amount of Lithuanian Text Indexed by Global Search Engines}, author={Dadurkevicius, Virginijus and Utka, Andrius}, journal={Baltic Journal of Modern Computing (BJMC)}, year={2022}, pages={326--336}, publisher={University of Latvia} }
2021
- Vít Suchomel. Genre Annotation of Web Corpora: Scheme and Issues. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1: 738-754, 2021.
@article{suchomel2021genre, title={Genre Annotation of Web Corpora: Scheme and Issues}, author={Suchomel, Vít}, journal={Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1}, year={2021}, pages={738--754}, publisher={Springer International Publishing} }
- Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý. Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing. Book of Abstracts of the 19th EURALEX International Congress: 65-67, 2021.
@article{jakubíček2021million, title={Million-Click Dictionary: Tools and Methods for Automatic Dictionary Drafting and Post-Editing}, author={Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel}, journal={Book of Abstracts of the 19th EURALEX International Congress}, year={2021}, pages={65--67}, publisher={SynMorPhoSe Lab, Democritus University of Thrace} }
- Vít Suchomel, Jan Kraus. Website Properties in Relation to the Quality of Text Extracted for Web Corpora. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 167-175, 2021.
@article{suchomel2021website, title={Website Properties in Relation to the Quality of Text Extracted for Web Corpora}, author={Suchomel, Vít and Kraus, Jan}, journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021}, year={2021}, pages={167--175}, publisher={Tribun EU} }
- Ondřej Herman. Precomputed Word Embeddings for 15+ Languages. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 41-46, 2021.
@article{herman2021precomputed, title={Precomputed Word Embeddings for 15+ Languages}, author={Herman, Ondřej}, journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021}, year={2021}, pages={41--46}, publisher={Tribun EU} }
- A. Rambousek, M. Jakubíček, Iztok Kosem. New developments in Lexonomy. Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference: 455-462, 2021.
@article{rambousek2021new, title={New developments in Lexonomy}, author={Rambousek, A. and Jakubíček, M. and Kosem, Iztok}, journal={Electronic lexicography in the 21st century. Proceedings of the eLex 2021 conference}, year={2021}, pages={455--462}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- Miloš Jakubíček, Emma Romani, Pavel Rychlý, Ondřej Herman. Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset. Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021: 177-183, 2021.
@article{jakubíček2021development, title={Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset}, author={Jakubíček, Miloš and Romani, Emma and Rychlý, Pavel and Herman, Ondřej}, journal={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021}, year={2021}, pages={177--183}, publisher={Tribun EU} }
- Antonio San Martín, Catherine Trekker. Adapting Word Sketches for Specialized Knowledge Extraction. Proceedings of the 14th International Conference of the Asian Association for Lexicography (ASIALEX 2021): 64-87, 2021.
@article{san martín2021adapting, title={Adapting Word Sketches for Specialized Knowledge Extraction}, author={San Martín, Antonio and Trekker, Catherine}, journal={Proceedings of the 14th International Conference of the Asian Association for Lexicography (ASIALEX 2021)}, year={2021}, pages={64--87}, publisher={Jakarta, Indonesia: Asialex} }
2020
- Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel. Current Challenges in Web Corpus Building. Proceedings of the 12th Web as Corpus Workshop: 1-4, 2020.
@article{jakubíček2020current, title={Current Challenges in Web Corpus Building}, author={Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel and Suchomel, Vít}, journal={Proceedings of the 12th Web as Corpus Workshop}, year={2020}, pages={1--4}, publisher={Marseille, France: European Language Resources Association} }
- Antonio San Martín, Catherine Trekker, Pilar León-Araúz. Extraction of Hyponymic Relations in French with Knowledge-Pattern-Based Word Sketches. Proceedings of The 12th Language Resources and Evaluation Conference: 5955-5963, 2020.
@article{martín2020extraction, title={Extraction of Hyponymic Relations in French with Knowledge-Pattern-Based Word Sketches}, author={Martín, Antonio San and Trekker, Catherine and León-Araúz, Pilar}, journal={Proceedings of The 12th Language Resources and Evaluation Conference}, year={2020}, pages={5955--5963} }
2019
- V. Baisa, M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, V. Kovář, M. Medveď, M. Měchura, P. Rychlý, V. Suchomel. Automating dictionary production: a Tagalog-English-Korean dictionary from scratch. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019.
@article{baisa2019automating, title={Automating dictionary production: a Tagalog-English-Korean dictionary from scratch}, author={Baisa, V. and Blahuš, M. and Cukr, M. and Herman, O. and Jakubíček, M. and Kovář, V. and Medveď, M. and Měchura, M. and Rychlý, P. and Suchomel, V.}, journal={Proceedings of the 6th Biennial Conference on Electronic Lexicography}, year={2019}, pages={}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- Kristina Koppel, Jelena Kallas, Maria Khokhlová, Vít Suchomel, Vít Baisa, Jan Michelfeit. SkELL Corpora as a Part of the Language Portal Sonaveeb: Problems and Perspectives. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019.
@article{koppel2019skell, title={SkELL Corpora as a Part of the Language Portal Sonaveeb: Problems and Perspectives}, author={Koppel, Kristina and Kallas, Jelena and Khokhlová, Maria and Suchomel, Vít and Baisa, Vít and Michelfeit, Jan}, journal={Proceedings of the 6th Biennial Conference on Electronic Lexicography}, year={2019}, pages={}, publisher={Brno, Czech Republic: Lexical Computing CZ s.r.o.} }
- Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý. Word Sense Induction Using Word Sketches. Proceedings of the 7th International Conference on Statistical Language and Speech Processing: 83-91, 2019.
@article{herman2019word, title={Word Sense Induction Using Word Sketches}, author={Herman, Ondřej and Jakubíček, Miloš and Kovář, Vojtěch and Rychlý, Pavel}, journal={Proceedings of the 7th International Conference on Statistical Language and Speech Processing}, year={2019}, pages={83--91}, publisher={Cham, Switzerland: Springer} }
- Miloš Jakubíček, Pavel Rychlý. A Distributional Multi-word Thesaurus in Sketch Engine. Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019: 143-147, 2019.
@article{jakubíček2019distributional, title={A Distributional Multi-word Thesaurus in Sketch Engine}, author={Jakubíček, Miloš and Rychlý, Pavel}, journal={Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019}, year={2019}, pages={143--147}, publisher={Tribun EU} }
2018
- M. Jakubíček, M. Měchura, V. Kovář, P. Rychlý. Practical Post- Editing Lexicography with Lexonomy and Sketch Engine. XVIII EURALEX International Congress: Lexicography in Global Contexts, 2018.(CC BY SA 4.0 The XVIII EURALEX International Congress: Lexicography in Global Contexts Book of Abstracts)
@article{jakubíček2018practical, title={Practical Post- Editing Lexicography with Lexonomy and Sketch Engine}, author={Jakubíček, M. and Měchura, M. and Kovář, V. and Rychlý, P.}, journal={XVIII EURALEX International Congress: Lexicography in Global Contexts}, year={2018}, pages={}, publisher={Ljubljana University Press, Faculty of Arts} }
- Vít Suchomel. csTenTen17, a Recent Czech Web Corpus. Twelfth Workshop on Recent Advances in Slavonic Natural Language Processing: 111-123, 2018.
@article{suchomel2018cstenten17,, title={csTenTen17, a Recent Czech Web Corpus}, author={Suchomel, Vít}, journal={Twelfth Workshop on Recent Advances in Slavonic Natural Language Processing}, year={2018}, pages={111--123}, publisher={Tribun EU} }
- Pavel Rychlý, Radoslav Rábara, Ondřej Herman. Distributed Corpus Search. 6th Workshop on the Challenges in the Management of Large Corpora (LREC 2018 Workshop): 10-13, 2018.
@article{rychlý2018distributed, title={Distributed Corpus Search}, author={Rychlý, Pavel and Radoslav Rábara and Ondřej Herman}, journal={6th Workshop on the Challenges in the Management of Large Corpora (LREC 2018 Workshop)}, year={2018}, pages={10--13}, publisher={Japan: European Language Resource Association} }
- Gezahegn Tsegaye Lemma, Pavel Rychlý. An Update of the Manually Annotated Amharic Corpus. Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018: 124-128, 2018.
@article{lemma2018update, title={An Update of the Manually Annotated Amharic Corpus}, author={Lemma, Gezahegn Tsegaye and Rychlý, Pavel}, journal={Proceedings of the Twelfth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2018}, year={2018}, pages={124--128}, publisher={Brno: Tribun EU} }
- Vít Suchomel. Building Large Corpora From The Web (presentation slides). Presented at Web Corpora as a Language Training Tool organised by Faculty of Arts of Comenius University in Bratislava, Linguistic Institute of the Slovak Academy of Sciences on November 23, 2018.
[Download PDF] - [ERROR] Missing attributes: journal
- Pilar León-Araúz, Antonio San Martín, Arianne Reimerink. The EcoLexicon English Corpus as an Open Corpus in Sketch Engine. XVIII EURALEX International Congress: Lexicography in Global Contexts: 893-901, 2018.
@article{león-araúz2018ecolexicon, title={The EcoLexicon English Corpus as an Open Corpus in Sketch Engine}, author={León-Araúz, Pilar and Martín, Antonio San and Arianne Reimerink}, journal={XVIII EURALEX International Congress: Lexicography in Global Contexts}, year={2018}, pages={893--901}, publisher={Ljubljana University Press, Faculty of Arts} }
- Pilar León-Araúz, Antonio San Martín. The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches. Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography & WordNets”: 94-99, 2018.
@article{león-araúz2018ecolexicon, title={The EcoLexicon Semantic Sketch Grammar: from Knowledge Patterns to Word Sketches}, author={León-Araúz, Pilar and Martín, Antonio San}, journal={Proceedings of the LREC 2018 Workshop “Globalex 2018 – Lexicography \& WordNets”}, year={2018}, pages={94--99}, publisher={Miyazaki: Globalex} }
2017
- Miloš Jakubíček. The advent of post-editing lexicography. Kernerman Dictionary News, 25: 14-15, 2017.
@article{jakubíček2017advent, title={The advent of post-editing lexicography}, author={Jakubíček, Miloš}, journal={Kernerman Dictionary News}, year={2017}, volume={25}, pages={14--15}, publisher={Kernerman Dictionary} }
- Jelena Kallas, Vít Suchomel, Maria Khokhlova. Automated Identification of Domain Preferences of Collocations. Electronic Lexicography in the 21st Century. Proceedings of Elex 2017 Conference, 5: 30-320, 2017.
@article{kallas2017automated, title={Automated Identification of Domain Preferences of Collocations}, author={Kallas, Jelena and Vít Suchomel and Maria Khokhlova}, journal={Electronic Lexicography in the 21st Century. Proceedings of Elex 2017 Conference}, year={2017}, volume={5}, pages={30--320}, publisher={Lexical Computing CZ s.r.o.} }
- R. Rábara, P. Rychlý, O. Herman, M. Jakubíček. Accelerating Corpus Search Using Multiple Cores. Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+ BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI), 30: 30-34, 2017.
@article{rábara2017accelerating, title={Accelerating Corpus Search Using Multiple Cores}, author={Rábara, R. and Rychlý, P. and Herman, O. and Jakubíček, M.}, journal={Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+ BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI), 30}, year={2017}, volume={}, pages={30--34}, publisher={Institut für Deutsche Sprache} }
- J. Bušta, O. Herman, Jakubíček M., S. Krek, B. Novak. JSI Newsfeed Corpus. The 9th International Corpus Linguistics Conference, 2017.
@article{bušta2017jsi, title={JSI Newsfeed Corpus}, author={Bušta, J. and Herman, O. and Jakubíček M. and Krek, S. and Novak, B.}, journal={ The 9th International Corpus Linguistics Conference}, year={2017}, publisher={University of Birmingham} }
- V. Baisa, J. Michelfeit, O. Matuška. Simplifying terminology extraction: OneClick Terms. The 9th International Corpus Linguistics Conference, 2017.
@article{baisa2017simplifying, title={Simplifying terminology extraction: OneClick Terms}, author={Baisa, V. and Michelfeit, J. and Matuška, O.}, journal={ The 9th International Corpus Linguistics Conference}, year={2017}, publisher={University of Birmingham} }
- A. O. Anić, S. K. Žuvela. The conceptualization of music in semantic frames based on word sketches. The 9th International Corpus Linguistics Conference, 2017.
@article{anić2017conceptualization, title={The conceptualization of music in semantic frames based on word sketches}, author={Anić, A. O. and Žuvela, S. K.}, journal={ The 9th International Corpus Linguistics Conference}, year={2017}, publisher={University of Birmingham} }
- M. Kunilovskaya, M. Koviazina. Sketch Engine: A Toolbox for Linguistic Discovery. Journal of Linguistics/Jazykovedný casopis, 68(3): 503-507, 2017.(CC BY-NC-ND 4.0)
@article{kunilovskaya2017sketch, title={Sketch Engine: A Toolbox for Linguistic Discovery}, author={Kunilovskaya, M. and Koviazina, M.}, journal={Journal of Linguistics/Jazykovedný casopis}, year={2017}, volume={68(3)}, pages={503--507} }
Walking the tightrope between linguistics and language engineering
- Miloš Jakubíček, Vít Baisa, Jan Bušta, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel
2016
- R. Evans, A. Gelbukh, G. Grefenstette, P. Hanks, M. Jakubíček, McCarthy D., M. Palmer, T. Pedersen, M. Rundell, P. Rychlý, D. Tugwell, S. Sharoff. Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond. International Conference on Intelligent Text Processing and Computational Linguistics (April 2016)[: 3–25], 2016.
@article{evans2016adam, title={Adam Kilgarriff’s Legacy to Computational Linguistics and Beyond}, author={Evans, R. and Gelbukh, A. and Grefenstette, G. and Hanks, P. and Jakubíček, M. and McCarthy D. and Palmer, M. and Pedersen, T. and Rundell, M. and Rychlý, P. and Tugwell, D. and Sharoff, S.}, journal={International Conference on Intelligent Text Processing and Computational Linguistics (April 2016)}, year={2016}, volume={}, pages={3–25}, publisher={Springer} }
- [ERROR] Missing attributes: title, author, journal
- An Exploratory Analysis of ScienceBlog
- Caterina Allais
- In L’Analisi Linguistica e Letteraria, Facoltà di Scienze Linguistiche e Letterature straniere Università Cattolica del Sacro Cuore, Milano, December 2016, pp. 161–170
- Analyse de trois systèmes de gestion de corpus pour l’enseignement-apprentissage des langues étrangères (Analysis of three corpus management systems in French)
- Eva Schaeffer-Lacroix
- In Alsic [En ligne], Vol. 18, n° 1 | 2015, mis en ligne le 20 décembre 2015, Consulté le 18 janvier 2016.
- Annotated Amharic Corpora
- Pavel Rychlý, Vít Suchomel
- In Petr Sojka, Aleš Horák, Ivan Kopeček, Karel Pala. Text, Speech, and Dialogue 19th International Conference, TSD 2016 Brno, Czech Republic, September 12–16, 2016 Proceedings, pp. 295-302, DOI 10.1007/978-3-319-45510-5_34
- Between Comparable and Parallel: English-Czech Corpus from Wikipedia
- Adéla Štromajerová, Vít Baisa, Marek Blahuš
- In Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. pp. 3–8
- European Union Language Resources in Sketch Engine
- BAISA, Vít, Jan MICHELFEIT, Marek MEDVEĎ and Miloš JAKUBÍČEK
- In the Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2799–2803, Slovenia, May 2016.
- English-french document alignment based on keywords and statistical translation
- Medveď, M., Jakubícek, M., & Kovár, V.
- In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers (Vol. 2, pp. 728-732).
- Evaluation of the Sketch Engine Thesaurus on Analogy Queries
- Pavel Rychlý
- In Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. pp. 147–152
- Finding Definitions in Large Corpora with Sketch Engine
- Vojtěch Kovář, Monika Močiariková, Pavel Rychlý
- In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA), 2016. pp. 391–394
- Tanja Samardzic, Elvira Glaser. Archimob-a corpus of spoken Swiss German. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 10: 4061-4066, 2016.
@article{samardzic2016archimob, title={Archimob-a corpus of spoken Swiss German}, author={Samardzic, Tanja, Yves Scherrer, and Elvira Glaser}, journal={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year={2016}, volume={10}, pages={4061--4066}, publisher={European Language Resources Association (ELRA)} }
- It’s All about Data Students of Science Meet Language as Data and Gain a Skill for Life (source: Wayback Machine)
- James Thomas
- In Humanising Language Teaching magazine in the section IDEAS FROM THE CORPORA. Year 18, Issue 2, April 2016, ISSN 1755 9715.
- Large Scale Keyword Extraction using a Finite State Backend
- Miloš Jakubíček, Pavel Šmerk
- In Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. pp. 143–146
- Options for Automatic Creation of Dictionary Definitions from Corpora
- Marie Stará, Vojtěch Kovář
- In Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. pp. 111–124
- RuSkELL: Online Language Learning Tool for Russian Language
- Valentina Apresjan, Vít Baisa, Olga Buivulova, Olga Kultepina
- In Tinatin Margalitadze, George Meladze. Proceedings of the XVII EURALEX International congress. Tbilisi: Ivane Javakhishvili Tbilisi State University, 2016. pp. 292–299
- Sketch Engine for Bilingual Lexicography
- Vojtěch Kovář, Vít Baisa and Miloš Jakubíček
- In International Journal of Lexicography, July 2016, doi: 10.1093/ijl/ecw029
- Terminology Extraction for Academic Slovene Using Sketch Engine
- Darja Fišer, Vít Suchomel, Miloš Jakubíček
- In Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016. pp. 135–141
2015
- Corpus Based Extraction of Hypernyms in Terminological Thesaurus for Land Surveying Domain (Corpus Based Extraction of Hypernyms presentation)
- Vít Baisa and Vít Suchomel
- In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 69–74
- Concurrent Processing of Text Corpus Queries (presentation)
- Radoslav Rábara and Pavel Rychlý
- In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 49–58
- Towards Automatic Finding of Word Sense Changes in Time
- Vít Baisa, Ondřej Herman, and Miloš Jakubíček
- In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 33–41
- Bilingual Terminology Extraction in Sketch Engine (presentation)
- Vít Baisa, Barbora Ulipová, and Michal Cukr
- In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 61–67
- Software and Data for Corpus Pattern Analysis (presentation)
- Vít Baisa, Ismaïl El Maaroufm, Pavel Rychlý, and Adam Rambousek
- In Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, Czech Republic, December 2015, pp. 75–86
- Turkic Language Support in Sketch Engine
- Vít Baisa and Vít Suchomel
- In Proceedings of the international conference “Turkic Languages processing: TurkLang 2015”, Russia, September 2015, pp. 214–223
- Corpus of Authentic Clinical Diagnoses within the Software Sketch Engine (in Czech)
- Kateřina Pořízková a Marek Blahuš
- In Latinitas Medica. Brno: Masaryk University, 2015. pp. 31–42
- The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries
- Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel
- In Revue d’Etudes Tibétaines, 2015, pp. 51–86.
- Automatic generation of the Estonian Collocations Dictionary database (presentation)
- Jelena Kallas, Adam Kilgarriff, Kristina Koppel, Elgar Kudritski, Margit Langemets, Jan Michelfeit, Maria Tuulik, Ülle Viks
- In Kosem, I., Jakubíček, M., Kallas, J., Krek, S. (eds.) Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, August 2015, Herstmonceux Castle, UK., pp. 1–20.
- Construction and use of thematic corpora by academic English learners
- Simon Smith
- In Proceedings Task Design and CALL, July 2015, pp. 437–445
- Uncovering the collocation errors of Asian learners with the help of automatic corpora comparison
- Howard Hao-Jan Chen
- In Proceedings Task Design and CALL, July 2015, pp. 152–161
- Stealing a march on collocation: Deriving extended collocations from full text for student analysis and synthesis
- James Thomas
- In Agnieszka Leńko-Szymańska and Alex Boulton (eds.). Multiple Affordances of Language Corpora for Data-driven Learning, 2015, pp. 85–108
- Corpora and Language Learning with the Sketch Engine and SKELL
- Adam Kilgarriff, Fredrik Marcowitz, Simon Smith and James Thomas
- In Revue française de linguistique appliquée, 2015/1 (Vol. XX), pp. 61–80
- Interactive visualization methods for Sketch Engine
- Lucia Kocincová, Miloš Jakubíček, Vojtěch Kovář and Vít Baisa
- In Gintaré Grigonyté, Simon Clematide, Andrius Utka, Martin Volk (eds.). Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Vilnius, Lithuania: Linköping University Electronic Press, Linköpings universitet, 2015, pp. 17–22
- Learning Chinese with the Sketch Engine
- Adam Kilgarriff, Nicole Keng, Simon Smith and Wei Bo
- In Zou, B., Hoey, M. & Smith, S. (eds.). Corpus Linguistics in Chinese Contexts. Basingstoke: Palgrave, 2015
- Using lecture slides to create an academic corpus
- Simon Smith (2015)
- In IATEFL 2014 Harrogate Conference Selections, Pattison Tania (ed.), Faversham, Kent: IATEFL, pp. 149–151
- Semantic Word Sketches (presentation)
- Diana McCarthy, Adam Kilgarriff, Miloš Jakubíček and Siva Reddy (2015)
- Corpus Linguistics (CL2015), the United Kingdom, July 2015
- DIACRAN: a framework for diachronic analysis (presentation)
- Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček (2015)
- Corpus Linguistics (CL2015), the United Kingdom, July 2015
- Longest-commonest match (presentation)
- Vít Baisa, Adam Kilgarriff, Pavel Rychlý and Miloš Jakubíček (2015)
- Corpus Linguistics (CL2015), the United Kingdom, July 2015
- Sketch Engine for English Language Learning (presentation)
- Vít Baisa, Vít Suchomel, Adam Kilgarriff and Miloš Jakubíček (2015)
- Corpus Linguistics (CL2015), the United Kingdom, July 2015
- Lexical selection and the evolution of language units
- Glenn Hadikin
- In Open Linguistics. Volume 1, Issue 1, ISSN (Online) 2300-9969, DOI: 10.1515/opli-2015-0013, June 2015, pp. 458–466
2014
- Effective Corpus Virtualization
- Miloš Jakubíček, Adam Kilgarriff and Pavel Rychlý (2014)
- In Challenges in the Management of Large Corpora (CMLC-2), May 2014
- The Sketch Engine: ten years on
- Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý and Vít Suchomel (2014)
- In Lexicography: Journal of ASIALEX, volume 1, issue 1, pp. 7–36
- Finding Terms in Corpora for Many Languages with the Sketch Engine
- Adam Kilgarriff, Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2014)
- In Proceedings of the Demonstrations at the 14th Conference the European Chapter of the Association for Computational Linguistics, Sweden, April 2014, pp. 53–56
- Optimization of Regular Expression Evaluation within the Manatee Corpus Management System
- Miloš Jakubíček and Pavel Rychlý (2014)
- In Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2014, Czech Republic, December 2014, pp. 37–48
- SkELL – Web Interface for English Language Learning
- Vít Baisa and Vít Suchomel
- In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing, Brno, Tribun EU (pp. 63-70).
- arTenTen: Arabic Corpus and Word Sketches
- Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff and Vít Suchomel (2014)
- In Journal of King Saud University – Computer and Information Sciences, volume 26, issue 4, December 2014, pp. 381–395
- Hindi Word Sketches
- Anil Krishna Eragani, Varun Kuchibhotla, Dipti Sharma, Siva Reddy and Adam Kilgarriff (2014)
- In Proceedings of the Conference on Natural Language Processing (ICON-11), Goa, India, December 2014, pp. 11818–125
- Web As Corpus: Theory and Practice
- Maristella Gatto
- A&C Black, October 2014
- PtTenTen: A corpus for Portuguese lexicography
- Adam Kilgarriff, Miloš Jakubíček, Jan Pomikálek, Tony Berber Sardinha and Pete Whitelock (2014)
- In Working with Portuguese Corpora, Bloomsbury Publishing, London, pp. 111–128
- Text Tokenisation Using unitok
- Vít Suchomel, Jan Michelfeit and Jan Pomikálek (2014)
- In Proceedings of the Eighth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2014, Czech Republic, December 2014, pp. 71–75
- Extrinsic Corpus Evaluation with a Collocation Dictionary Task (datasets described in this paper)
- Adam Kilgarriff, Pavel Rychlý, Miloš Jakubíček, Vojtěch Kovář, Vít Baisa and Lucia Kocincová (2014)
- In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Iceland, May 2014, pp. 545–552
- Sketching the Dependency Relations of Words in Chinese
- Meng-Hsien Shih and Shu-Kai Hsieh
- In Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014), Taiwan, September 2014, pp. 139–152
- Bilingual Word Sketches: the translate Button
- Vít Baisa, Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář and Pavel Rychlý
- In Proceedings of the 16th EURALEX International Congress. 15–19 July 2014, Bolzano, Italy, pp. 505–513
- Compatible Sketch Grammars for Comparable Corpora
- Vladimír Benko
- In Proceedings of the 16th EURALEX International Congress. Bolzano, Italy, July 2014, pp. 417–430
- Metadiscurso y persuasión: estudio de editoriales de periódicos españoles sobre la muerte de Osama Bin Laden (in Spanish)
- Ricardo-María Jiménez Yáñez
- In Discurso & Sociedad, Vol. 8(4), 2014, pp. 589–622
- Araneum Nederlandicum Maius, A New Family Member
- Vladimír Benko (2014)
- CLIN24 Workshop in Leiden, Sketch examples
2013
- Sketch Engine as a Platform for Providing Corpora
- Miloš Jakubíček (2013)
- Presentation at META-FORUM Germany, September 2013 (link to YouTube)
- The TenTen Corpus Family
- Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý and Vít Suchomel (2013)
- In Proceedings of the 7th International Corpus Linguistics Conference CL 2013, the United Kingdom, July 2013, pp. 125–127
- Web Spam
- Adam Kilgarriff and Vít Suchomel (2013)
- In Proceedings of the 8th Web as Corpus Workshop (WAC-8), the United Kingdom, July 2013, pp. 46–52
- arTenTen: a new, vast corpus for Arabic
- Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth and Vít Suchomel (2013)
- In Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics, the United Kingdom, July 2013, pp. 20
- Intrinsic Methods for Comparison of Corpora
- Vít Baisa and Vít Suchomel (2013)
- In Proceedings of the Seventh Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2013, Czech Republic, December 2013, pp. 51–58
- Compatible Sketch Grammar Experiment
- Vladimír Benko (2013)
- Proceedings of the International Conference Corpus Linguistics – 2013, June 25–27, 2013, St. Petersburg. pp. 21–29.
- User-friendly interface of error/correction-annotated corpus for both teachers and researchers (The Sketch Engine interface for a learner corpus annotated with errors and corrections)
- Iztok Kosem, Vít Baisa, Vojtěch Kovář and Adam Kilgarriff (2012)
- In Book of Abstracts LCR 2013, Norway, September 2013, pp. 82–84
- Corpus-based vocabulary lists for language learners for nine languages
- Adam Kilgarriff, F. Charalabopoulou, M. Gavrilidou, J.B. Johannessen, S. Khalil, S.J. Kokkinakis, R. Lew, S. Sharoff, R. Vadlapudi and E. Volodina (2013)
- In Language Resources and Evaluation, volume 48, issue 1, March 2014, pp. 121–163
- esTenTen, a vast web corpus of Peninsular and American Spanish
- Adam Kilgarriff and Irene Renau (2013)
- In Procedia – Social and Behavioral Sciences, volume 55, Elsevier, October 2013, pp. 12–19
- Using corpora as data sources for dictionaries
- Adam Kilgarriff (2013)
- In The Bloomsbury Companion to Lexicography, Howard Jackson (ed.), Bloomsbury, London. Chapter 4.1, pp. 77–96
- Terminology finding, parallel corpora and bilingual word sketches in the Sketch Engine
- Adam Kilgarriff (2013)
- In Proceedings ASLIB 35th Translating and the Computer Conference, London, May 2013
- 百億語のコーパスを用いた日本語の語彙・文法情報のプロファイリング
- (Japanese Language Lexical and Grammatical Profiling Using the Web Corpus JpTenTen)
- Irena Srdanović, Vít Suchomel, Toshinobu Ogiso and Adam Kilgarriff (2013)
- 『「第3回コーパス日本語学ワークショップ」予稿集』国立国語研究所 言語資源研究系・コーパス開発センター (In Proceeding of the 3rd Japanese corpus linguistics workshop, Department of Corpus Studies, Center for Corpus Development, NINJAL), pp. 229–238
- Quantifying Lexical Usage: Vocabulary pertaining to Ecosystems and the Environment
- Kate Wild, Andrew Church, Diana McCarthy and Jacqueline Burgess
- In Corpora, volume 8, May 2013, pp. 53–79
2012
- New Learner Corpus Functionality in the Sketch Engine
- Vojtěch Kovář and Diana McCarthy (2012)
- In Proceedings of the 2012 Asia Pacific Corpus Linguistics Conference (APCLC), Auckland, February 2012
- Performance stylistics: Deleuze and Guattari, poetry and (corpus) linguistics
- Kieran O’Halloran (2012)
- International Journal of English Studies, Vol 12, No 2.
- Genres across the Disciplines Student Writing in Higher Education
- Hilary Nesi and Sheena Gardner, 2012
- Cambridge University Press
- Word Sense Induction for Novel Sense Detection
- Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin (2012)
- In 13th Conference of the European Chapter of the Association for computational Linguistics (EACL 2012), France, April 2012, pp. 591–601
- Efficient Web Crawling for Large Text Corpora
- Vít Baisa, Vít Suchomel and Jan Pomikálek (2012)
- In Proceedings of the seventh Web as Corpus Workshop (WAC7), France, April 2012, pp. 39–43
- Getting to know your corpus
- Adam Kilgarriff (2012)
- In Proceedings of The 15th International Conference on Text, Speech and Dialogue (TSD), Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (eds.), Czech Republic, September 2012, pp. 3–15
- Detecting Spam in Web Corpora
- Vít Baisa and Vít Suchomel (2012)
- In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 69–76
- Sketching Muslims A corpus driven analysis of representations around the word ‘Muslim’in the British press 1998–2009
- Paul Baker, Costas Gabrielatos, Tony McEnery (2012)
- In Applied Linguistics, Volume 34, Issue 3, 1 July 2013, pp. 255–278
- Towards 100M Morphologically Annotated Corpus of Tajik
- Gulshan Dovudov, Vít Suchomel, Pavel Šmerk (2012)
- In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 91–94
- Recent Czech Web Corpora
- Vít Suchomel (2012)
- In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 77–83
- Finding Multiwords of More Than Two Words
- Adam Kilgarriff, Pavel Rychlý, Vojtěch Kovář and Vít Baisa (2012)
- In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 693–700
- Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
- Vít Baisa and Vít Suchomel (2012)
- In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32
- Word Sketches for Turkish
- Bharat Ram Ambati, Siva Reddy and Adam Kilgarriff (2012)
- In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 2945–2950
- Managing Ambiguity in Reference Generation: The Role of Surface Structure
- Imtiaz H. Khan, Kees van Deemter and Graeme Ritchie
- In Topics in Cognitive Science, volume 4, issue 2, 2012, pp. 211–231
- Learner corpora and second language acquisition
- Meng Huat Chau (2012)
- In Corpus Applications in Applied Linguistics, K. Hyland, M. H. Chau & M. Handford (eds.), London: Continuum, 2012, pp. 191–207
- Setting up for corpus lexicography
- Adam Kilgarriff, Jan Pomikálek, Miloš Jakubíček and Pete Whitelock (2012)
- In Proceedings of the 15th EURALEX International Congress, Norway, August 2012, pp. 31–55
- Corpus Tools for Lexicographers
- Adam Kilgarriff and Iztok Kosem (2012)
- In Electronic Lexicography, Sylviane Granger and Magali Paquot (eds.), Oxford University Press, October 2012, pp. 31–55
- The Sketch Engine as infrastructure for historical corpora
- Adam Kilgarriff, Miloš Husák a Robyn Woodrow (2012)
- In Jeremy Jancsary (ed.). Empirical Methods in Natural Language Processing; Proceedings of the Conference on Natural Language Processing 2012
- Tools for historical corpus research, and a corpus of Latin (presentation)
- Barbara McGillivray and Adam Kilgarriff (2012)
- In New Methods in Historical Corpus Linguistics 3, Germany, 2013, pp. 247–255
- Vietnamese Word Sketches
- Adam Kilgarriff and Phuong Le-Hong (2012)
- In Workshop on Vietnamese Language and Speech Processing (IEEE-RIVF 9), Vietnam, February 2012, pp. 1–4
- Building A Thesaurus Using LDA-Frames
- Jiří Materna (2012)
- In Proceedings of the Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012, Czech Republic, December 2012, pp. 97–103
2011
- Corpora and Language Education
- Lynne Flowerdew (2011)
- Palgrave Macmillan, December 2011
- Comparable Corpora BootCaT
- Adam Kilgarriff, Avinesh PVS and Jan Pomikálek (2011)
- In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 122–128
- GDEX for Slovene
- Iztok Kosem, Miloš Husák and Diana McCarthy (2011)
- In Proceedings of eLEX 2011, Slovenia, November 2011, pp. 151–159
- Dynamic and Static Prototype Vectors for Semantic Composition (CC BY-NC-SA 3.0)
- Siva Reddy, Ioannis P. Klapaftis, Diana McCarthy and Suresh Manandhar (2011)
- Best paper award at the 5th International Joint Conference on Natural Language Processing.
- Learner construction of corpora for general English in Taiwan
- Simon Smith (2011)
- In Computer Assisted Language Learning, volume 24, Issue 4, pp. 291–316
- Corpus-based Disambiguation for Machine Translation
- Vít Baisa (2011)
- In Proceedings of the Fifth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2011, Czech Republic, December 2011, pp. 81–87
- Corpus-based tasks for learning Chinese: a data-driven approach
- Simon Smith and Xuanying Shen (2011)
- In The Asian Conference on Technology in the Classroom 2011 Japan, August 2011, pp. 48–59
- Large Web Corpora for Indian Languages
- Adam Kilgarriff and Girish Duvuru (2011)
- In Proceedings of International Conference on Information Systems for Indian Languages (ICISIL), India, 2011 pp. 312–313
- Polish Word Sketches
- Adam Radziszewski, Adam Kilgarriff and Robert Lew (2011)
- In Proceedings of the 5th Language & Technology Conference (LTC), Poland, November 2012, pp. 237–242
- Japanese Word Sketches: Advances and Problems (CC BY-SA 4.0)
- Irena Srdanović, Naomi Ida, Chikako Shigemori Bučar, Adam Kilgarriff and Vojtěch Kovář (2011)
- In Acta Linguistica Asiatica, University of Ljubljana, Slovenia 2011, pp. 63–82
- The Pearson International Corpus of Academic English (PICAE) (PICAE poster)
- Kirsten Ackermann, John H.A.L. de Jong, Adam Kilgarriff and David Tugwell (2010)
- In International Corpus Linguistics Conference (ICLIC), Birmingham, UK, 2011
2010
- Database of ANalysed Texts of English (DANTE): the NEID database project
- Sue Atkins, Adam Kilgarriff, Michael Rundell (2010)
- In Proceedings of the 14th EURALEX International Congress. The Netherlands, July 2010, pp. 549–556
- Usi e pratiche della comprensione attraverso la lente dei verba recipiendi (in Italian)
- Isabella Chiari (2010)
- Bollettino di Italianistica, I, pp. 30–70
- Helping Our Own
- Robert Dale and Adam Kilgarriff (2010)
- In International Natural Language Generation Conference, Dublin, Ireland
- Fast syntactic searching in very large corpora for many languages
- Miloš Jakubíček, Adam Kilgarriff, Diana McCarthy and Pavel Rychlý (2010)
- In Proceedings of Workshop on Advanced Corpus Solutions, PACLIC 24, Japan, November 2010
- Towards disambiguation of word sketches
- Vít Baisa (2010)
- In Text, Speech and Dialogue. Germany, Berlin: Springer-Verlag, 2010, pp. 37–42.
- Studying Word Sketches for Russian
- Maria Khokhlova and Victor Zakharov (2010)
- In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12) Malta, May 2010, pp. 3491–3494
- Building Russian Word Sketches as Models of Phrases (CC BY-NC-SA 3.0)
- Maria Khokhlova (2010)
- In Proceedings of the 14th EURALEX International Congress. The Netherlands, July 2010, pp. 364–371
- A Case Study in Word Sketches – Czech Verb vidět
- Karel Pala and Pavel Rychlý (2010)
- In A Way with Words: Recent Advances in Lexical Theory and Analysis. A Festschrift for Patrick Hanks. Ed. by Gilles-Maurice de Schryver, Menha Publishers, 2010, – “see”, pp. 187–198
- Google The Verb
- Adam Kilgarriff (2010)
- In Language Resources and Evaluation Journal, 44 (3), pp. 281–290
- Tickbox Lexicography
- Adam Kilgarriff and Vojtěch Kovář and Pavel Rychlý (2010)
- In eLexicography in the 21st century: New challenges, new applications, Presses universitaires de Louvain, Brussels, 2010, pp. 411–418
- Semi-automatic_dictionary_2010
- Adam Kilgarriff and Pavel Rychlý (2010)
- In A Way with Words: Recent Advances in Lexical Theory and Analysis, Uganda: Menha Publishers Ltd., 2010, 299–312
- Corpora by Web Services
- Adam Kilgarriff (2010)
- In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’12)Malta, May 2010
- A corpus factory for many languages
- Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010)
- LREC workshop on Web Services and Processing Pipelines, Malta, May 2010
- The RoWaC Corpus and Romanian Word Sketches
- Monica Macoveiciuc and Adam Kilgarriff (2010)
- In Multilinguality and Interoperability in Language Processing with Emphasis on Romanian Edited by Dan Tufis and Corina Forascu. Romanian Academy, pp. 151–168.
- DANTE: A Detailed, Accurate, Extensive, Available English Lexical Database
- Adam Kilgarriff (2010)
- In Proceedings of the NAACL HLT 2010 Demonstration Session, Los Angeles, June 2010, pp. 21–24
- DANTE: a New Resource for Research at the Syntax-Semantics Interface
- Diana McCarthy (2010)
- In Proceedings of Interdisciplinary Workshop on Verbs, Italy, 2010, pp. 1–8
- A Quantitative Evaluation of Word Sketches
- Adam Kilgarriff, Vojtěch Kovář, Simon Krek, Irena Srdanovic and Carole Tiberius (2010)
- In Proceedings of the 14th EURALEX International Congress. The Netherlands, July 2010, pp. 372–379
- Sample headwords for Word Sketch Evaluation
- Comparable Corpora Within and Across Languages, Word Frequency Lists and the Kelly Project
- Adam Kilgarriff (2010)
- Invited talk at the LREC workshop on Building and Using Comparable Corpora, Malta, May 2010
- SemEval-2010 Task 7: Argument Selection and Coercion (CC BY-NC-SA 3.0)
- Pustejovsky, J., A. Rumshisky, A. Plotnick, E. Jezek, O. Batiukova, and V. Quochi (2010)
- Association of Computational Linguistics, pp. 27–-31.
2009
- Scaling to Billion-plus Word Corpora
- Jan Pomikálek, Pavel Rychlý and Adam Kilgarriff
- In Advances in Computational Linguistics, Instituto Politécnico Nacional, volume 41, Mexico, 2009, pp. 3–13
- Simple maths for keywords
- Adam Kilgarriff
- In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
- The metalanguage of impoliteness: Using Sketch Engine to explore the Oxford English Corpus
- Jonathan Culpeper (2009)
- In Contemporary Corpus Linguistics P. Baker (ed.), London: Continuum, pp. 64–-86
- Putting the corpus into the dictionary (firstly in 2005 as Linking Dictionary and Corpus)
- Adam Kilgarriff (2009)
- In V.B.Y. Ooi, A. Pakir, I.S. Talib and P.K. Tan (eds.). Perspectives in Lexicography: Asia and Beyond, Israel, K Dictionaries 2009, pp. 239–247
- Extracting distant collocations of adverbs and modality forms using web corpus and query system
- Irena Srdanovic, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
- 「ウェブコーパスと検索システムを利用した推量副詞とモダリティ形式の遠隔共起抽出と日本語教育への応用」『自然言語処理』(Extracting distant collocations of adverbs and modality forms using web corpus and query system, Journal of Natural Language Processing), 16/4, pp. 29–46
- Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations
- Irena Srdanovic, Andrej Bekeš and Kikuko Nishina (2009)
- 「コーパスに基づいた語彙シラバス作成に向けて―推量的副詞と文末モダリティの共起を中心にして―」『日本語教育』142号 (Towards creation of lexical syllabus based on corpora – on suppositional adverbs and clause-final modality collocations, Journal of Japanese Language Education, 142, pp. 69–79)
- Chapter III – WebCorp: the Web as Corpus
- Maristella Gatto (2009) (email m.gatto@lingue.uniba.it)
- In From Body to Web. An Introduction to the Web as Corpus, Laterza University Press Online (Bari-Roma) 2009, pp. 77–100
- Czech Word Sketch Relations with Full Syntax Parser
- Aleš Horák, Pavel Rychlý and Adam Kilgarriff
- In After Half a Century of Slavonic Natural Language Processing. Czech Republic, Brno: Masaryk University, 2009, pp. 101–112. ISBN 978-80-7399-815-8.
- The Sketch Engine for Dutch with the ANW corpus (source: archive)
- Carole Tiberius and Adam Kilgarriff (2009)
- In Fons Verbhorum, Festschrift for Fons Moerdijk. Instituut voor Nederlandse Lexicologie, the Netherlands, pp. 273–255
- Associating Collocations with Dictionary Senses
- Abhilash Inumella, Adam Kilgarriff and Vojtěch Kovář (2009)
- In Proceedings of 6th Biennial Conference of the Asian Association for Lexicography, Thailand 2009
- Classifying corpora based on adverbs distribution
- Irena Srdanović, Bor Hodošček, Andrej Bekeš and Kikuko Nishina (2009)
- In International Quantitative Linguistics Conference (Qualico), Austria, September 2009
2008
- A Lexicographer-Friendly Association Score
- Pavel Rychlý (2008)
- In Second Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2008. Brno, Masaryk University, 2008, pp. 6–9. ISBN 978-80-210-4741-9.
- Investigating the collocational behaviour of man and woman in the BNC using Sketch Engine
- Michael Pearce (2008)
- Corpora. Volume 3, DOI 10.3366/E174950320800004X, ISSN 1749-5032, pp. 1–29
- GDEX: Automatically finding good dictionary examples in a corpus
- Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell and Pavel Rychlý (2008)
- In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 425–432
- Finding the words which are most X
- Adam Kilgarriff and Pavel Rychlý (2008)
- In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 433–436
- Comparing Lexical Relationships Observed within Japanese Collocation Data and Japanese Word Association Norms
- Terry Joyce and Irena Srdanović (2008)
- Coling 2008, 22nd International Conference on Computational Linguistics. In Proceedings of the Workshop on Cognitive Aspects of the Lexicon. Manchester, UK, 2008, pp. 1–8
- Evaluating a German Sketch Grammar: A Case Study on Noun Phrase Case
- Kremena Ivanova, Ulrich Heid, Sabine Schulte im Walde, Adam Kilgarriff and Jan Pomikalek
- In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco, May 2008, pp. 2101–2107
- Cleaneval: a Competition for Cleaning Web Pages
- Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff
- In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco, May 2008, pp. 638–643
- A web corpus and word sketches for Japanese
- Irena Srdanović, Tomaž Erjavec and Adam Kilgarriff (2008)
- A web corpus and word-sketches for Japanese『自然言語処理』(Journal of Natural Language Processing) 15/2, 137–159. (reprinted in Information and Media Technologies 3/3, 2008, pp. 529–551)
- The Sketch Engine corpus query tool for Japanese and its possible applications (in Japanese)
- Irena Srdanović and Kikuko Nishina (2008)
- 「コーパス検索ツールSketch Engineの日本語版とその利用方法」『日本語科学』(The Sketch Engine corpus query tool for Japanese and its possible applications, Japanese Linguistics) 23, pp. 59–80
- Chinese Word Sketch and Mapping Principles: A Corpus-Based Study of Conceptual Metaphors Using the BUILDING Source Domain
- Shu-Ping Gong, Kathleen Ahrens, and Chu-Ren Huang (2008)
- In International Journal of Computer Processing of Oriental Languages. 21(2): pp. 3–17, doi: 10.1142/S1793840608001755
2007
- Manatee/bonito – a modular corpus manager
- Pavel Rychlý (2007)
- In First Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2007. Brno: Masaryk University, 2007, pp. 65–70. ISBN 978-80-210-4471-5
- Corpus Query System Bonito – Recent Development
- Vojtěch Kovář (2007)
- In First Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2007. Brno: Masaryk University, 2007, pp. 71–76
- An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments)
- Pavel Rychlý and Adam Kilgarriff (2007)
- In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.Czech Republic, June 2007, pp. 41–44
- Displaying Bidirectional Text Concordances in KWIC
- Pavel Rychlý and Vojtěch Kovář (2007)
- Googleology is bad science
- Adam Kilgarriff (2007)
- In Computational linguistics 33.1, 2007, pp. 147–151
2006
- Slovene Word Sketches
- Simon Krek and Adam Kilgarriff (2006)
- In Proceedings 5th Slovenian/First International Languages Technology Conference, Slovenia, October 2006
- Using chinese gigaword corpus and chinese word sketch in linguistic research
- Jia-Fei Hong and Chu-Ren Huang (2006)
- In The 20th Pacific Asia Conference on Language, Information and Computation (PACLIC-20). November. 2006, pp. 183–190
- Learning noun-modifier semantic relations with corpus-based and ”WordNet”-based features
- Vivi Nastase, Jelber Sayyad-Shiarabad, Marina Sokolova and Stan Szpakowicz (2006)
- In Proceedings of the National Conference on Artificial Intelligence. Massachusetts, July 2006. Published by The AAAI Press, Menlo Park, California, pp. 781–786
- WebBootCaT: instant domain-specific corpora to support human translators (early also WebBootCaT: a web tool for instant corpora)
- Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
- In Proceedings of EAMT. 11th Annual Conference of the European Association for Machine Translation. Oslo, Norway, pp. 247–252
- WebBootCaT: a web tool for instant corpora
- Marco Baroni, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý (2006)
- In Proceeding of the EuraLex Conference, 2006, pp. 123–132
- Large linguistically-processed Web corpora for multiple languages
- Marco Baroni and Adam Kilgarriff (2006)
- In Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics, Trento, Italy, pp. 87–90
- Efficient corpus development for lexicography: building the New Corpus for Ireland
- Adam Kilgarriff, Michael Rundell, Elaine Uı´ Dhonnchadha (2006)
- In Language Resources and Evaluation, May 2006, Volume 40, Issue 2, pp. 127–152
- Sketch Engine: a sense discrimination engine for English, Chinese and other languages
- Adam Kilgarriff, Pavel Rychlý, Simon Smith, Chu-Ren Huang, Yiching Wu and Cecilia Lin (2006)
- In Proceedings of the 2006 Conference and Workshop on TEFL and Applied Linguistics, Taoyuan, Taiwan, pp. 173–181
- LEMPAS: A Make-Do Lemmatizer for the Swedish PAROLE-Corpus
- Silvie Cinková and Jan Pomikálek (2006)
- In Prague Bulletin of Mathematical Linguistics, Czech Republic, volume 2006, issue 86, pp. 47–53
2005 and earlier
- Peta poletna delavnica za leksikografijo in leksikalno računalništvo
- Irena Srdanović (2005)
- Univerza Masaryk, Brno, Češka Republika, 10–14 junija 2005. Slavistic Journal, 53/4 (oct.–dec. 2005), pp. 607–609
- Disambiguating Coordinations Using Word Distribution Information
- Francis Chantree, Adam Kilgarriff, Anne de Roeck and Alistair Willis
- In Proceedings of Recent Advances in Natural Language Processing (RANLP), Bulgaria, September 2005
- Chinese Sketch Engine and the Extraction of Grammatical Collocations
- Chu-Ren Huang, Adam Kilgarriff, Yiching Wu, Chih-Ming Chiu, Simon Smith, Pavel Rychlý, Ming-Hong Bai and Keh-Jiann Chen (2005)
- In Fourth SIGHAN Workshop on Chinese Language Processing, Korea, October 2005, pp. 48–55
- Chinese Word Sketches
- Adam Kilgarriff, Chu-Ren Huang, Pavel Rychlý, Simon Smith and David Tugwell (2005)
- In Proc. Asialex, Singapore, June 2005
- Manatee, Bonito and Word Sketches for Czech (abstract in Russian)
- Pavel Rychlý and Pavel Smrž (2004)
- In Proceedings of the Second International Conference on Corpus Linguistics. Saint-Petersburg: Saint-Petersburg State University Press, 2004, pp. 124–132.
- The sketch engine
- Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell (2004)
- In Proceedings of the 11th EURALEX International Congress. France, July 2004, pp. 105–116 (reprinted in Lexicology: Critical concepts in Linguistics P. W. Hanks (ed.) Routledge, 2007)
- Linguistic Search Engine
- Adam Kilgarriff (2003)
- In Proceedings of Workshop on Shallow Processing of Large Corpora, SProLaC03, the United Kingdom, pp. 53–58.
- Introduction to the special issue on the web as corpus
- Adam Kilgarriff and Gregory Grefenstette (2003)
- In Computational linguistics 29.3, 2003, pp. 333–347
- Thesauruses for Natural Language processing
- Adam Kilgarriff (2003)
- In Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on. IEEE, October 2003. pp. 5–13.
If you have any Sketch Engine related paper please do send the details and if possible a link to the document to us (email: support@sketchengine.eu)
Theses related to Sketch Engine
@mastersthesis{emma2022building,
title={Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings},
author={Emma, Romani},
school={The University of Pavia},
year={2022}
}
@phdthesis{vít2021better,
title={Better Web Corpora For Corpus Linguistics And NLP},
author={Vít, Suchomel},
school={Masaryk University, Faculty of Informatics},
year={2021}
}
Abstract: The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods. This thesis presents a web crawler designed to obtain texts from the internet allowing to build large text corpora for NLP and linguistic applications. An asynchronous communication design (rather than usual synchronous multi-threaded design) was implemented for the crawler to provide an easy to maintain alternative to other web spider software. Cleaning techniques were devised to transform the messy nature of data coming from the uncontrolled environment of the internet. However, it can be observed that usability of recently built web corpora is hindered by several factors: The results derived from statistical processing of corpus data are significantly affected by the presence of non-text (web spam, computer generated text and machine translation) in text corpora. It is important to study the issue to be able to avoid non-text at all or at least decrease its size in web corpora. Another observed factor is the case of web pages or their parts written in multiple languages. Multilingual pages should be recognised, languages identified and text parts separated to respective monolingual corpora. This thesis proposes additional cleaning stages in the process of building text corpora which help to deal with these issues. Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. Rich annotation of corpus content is dealt with in the last part of the thesis. An inter-annotator agreement driven English genre annotation and two experiments with supervised classification of text types in English and Estonian web corpora are presented.
@mastersthesis{jiří2017wikipedia,
title={Wikipedia Learner's Corpus},
author={Jiří, Kletečka},
school={Masaryk University, Faculty of Informatics},
year={2017}
}
Abstract: This bachelor’s thesis deals with an automated creation of error-annotated corpus from Wikipedia history of articles. Such corpus contains the newest versions of articles with marked errors obtained from their editing history. For that reason, a new tool was designed and implemented. After implementation, it was used in the process of corpus creation using Czech Wikipedia database dump and this corpus was uploaded to the faculty server for public use through interface of Sketch Engine.
@mastersthesis{michal2017czech,
title={Czech corpus of example sentences},
author={Michal, Cukr},
school={Masaryk University Faculty of Arts},
year={2017}
}
Abstract: The purpose of this work was creating a Czech text corpus of sentence examples for a special language-learning interface SkELL. As source texts, we downloaded websites chosen for selective harvests by Czech Webarchiv and Czech Wikipedia including discussion. The third source is a part of JSI Newsfeed Corpus. Crawled texts were prepared by tools for corpus processing and the final text collection was deduplicated. Afterwards, we performed multiple cleaning. In the thesis, there are some examples from the created corpus. This corpus of Czech sentence examples is placed in the university installation of Sketch Engine (https://ske.fi.muni.cz/). The public access to the corpus is via SkELL interface available at https://skell.sketchengine.eu/#home?lang=cs.
@mastersthesis{radoslav2016parallelization,
title={Parallelization of the corpus manager's time-consuming operations},
author={Radoslav, Rábara},
school={Masaryk University, Faculty of Informatics},
year={2016}
}
Abstract: The Manatee corpus manager can process large corpora containing billions of words. Some operations with search results from such large corpora can be time-consuming. This thesis provides and describes a system that enables computation of the selected operations in parallel. The system is evaluated on a single computer, and on a cluster of computers. The evaluation contains evaluation of the scalability, and comparions with the Manatee system and a MapReduce system that provides a platform for distributed computing.
Lucia Kocincová (2015). Interactive visualization methods for Sketch Engine. Master thesis. Masaryk University, Faculty of Informatics.
Abstract: Visualization is undoubtedly one of the most desired methods for displaying data, especially when dealing with so called big data. Visualization can uncover unnoticed and hidden relationships within the data and in addition, it enables the users to understand and interpret the data with less effort. This thesis focuses on interactive visualizations generated from the corpora data. First, it introduces the state-of-the-art tools for corpora visualizations and a corpus management system named Sketch Engine, for which numerous design concepts were created. Then four of them – corpora overview, thesaurus, word sketch and word sketch difference – were implemented as an online application with the main use of the Data-Driven Documents library. Last, these visualizations were evaluated by the user testing which revealed that the implemented concepts were not only graphically very appealing but also helpful. Therefore, the interactive visualizations will be incorporated in the Sketch Engine online interface in the upcoming future.
Matouš Ejem (2015). English learner corpora [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.
Abstract: Learner corpora conjoin second language acquisition research, foreign language teaching and corpus linguistics. In this work I present available English learner corpora.
Lucie Kaplanová (2015). Collection of linguistically motivated examples of CQL [in Czech]. Bachelor thesis. Masaryk University, Faculty of Arts.
Abstract: This bachelor thesis deals with query language for corpora called CQL (Corpus Query Language). It explains use of individual operators, attributes, and structures that can be used in CQL search. The thesis also includes a set of linguistically oriented CQL queries for Czech and English.
Monika Močiariková (2015). Methods for Automatic Acquisition of Dictionary Definitions [in Slovak]. Bachelor thesis. Masaryk University, Faculty of Arts.
Abstract: The thesis is trying to explain the term definition and why it is difficult to say whether some sentences are definitions or not. It also describes the Sketch Engine system and the CQL language. The practice part is dedicated to design, implementation and evaluation of queries for automatic definition search.
Dominika Talianová (2014). Corpus Data Visualization. Bachelor thesis. Masaryk University, Faculty of Informatics.
Abstract: This thesis focuses on corpus data represented in graphical form. More closely, it consists of a recherché on visualization tools and a website created to hold visualizations based on two features of Sketch Engine, namely Word Sketch and Sketch-diff. These visualizations represent collocations and their salience in connection to different lemmas. The data essential for these visualizations are processed with the use of JavaScript and its D3 library in a JSON format and are provided by Natural Language Processing Centre at Masaryk University, Faculty of Informatics in Brno.
Abstract: The aim of this thesis is to study approaches used in concurrent processing and to apply them to the evaluation of queries in the system Manatee. Part of the work is not only a detailed evaluation of queries processing speed with various number of cores available during the evaluation, but also a comparision of the length of code between the old and the new implementation.
Abstract: From a natural language corpus, word usage data over time can be extracted. To detect and quantify change in this data, automatic procedures can be employed. In this work, the theory of ordinary and robust regression methods is discussed and applied to real world data with great success. A Python implementation is included. Smoothing of time series and detection of seasonality is examined, but ultimately this path does not seem to give satisfactory results for the data explored.
Abstract: This thesis proposes and implements an algorithm for evaluation of sentences with respect to their understandability and informativeness. It can be embedded into a variety of applications, such as corpus querying tools or automated dictionaries. The proposed algorithm is highly customizable, since it employs a variety of criteria approximating the similarity of sentences to good dictionary examples. It was optimized using machine learning algorithms according to a set of manually labelled concordances. The algorithm is usable in practical applications, however it is still being developed.