History of Cologne Digital Lexicons


Digital LexiconsDigital Lexicons 1988-1994 1994-2005 pre-2014 2014-2019

Слайд 1History of Cologne Digital Lexicons
Mārcis Gasūns,
October 2019

Слайд 3Digital Lexicons
Digital Lexicons

Слайд 4Austin 1988
“Many Sanskritists are highly computer literate”
“Bright hopes” by D.

Undoing sandhi, conjunct characters
Sanskrit text archive, a remake of

Thesaurus Linguae Graecae, est. 1972
Full textual reference (Panini)


Слайд 5Post-Austin 1988 (Kharagpur 2019)
Undoing sandhi solved, opensource
1992-2000, Peter Scharf (Pascal)

Jim Funderburk (Perl, Java)
2015 Jim Funderburk (Python 2.7)
Conjunct characters are

not an issue in Unicode. Not widely used in India and that does become an issue (ex., Pune intranet). It’s solved in 2016 for OCR.


Слайд 6Post-Austin 1988 (Kharagpur 2019)
Sanskrit text archive (GRETIL), 2001
"simply rapid access

no “grammatical and lexical systems”
Digital Corpus of Sanskrit (DCS),

560 000 lemmatized sentences (linguistic database, Sanskrit expert system)
Parallel Sanskrit-Russian Corpora, 2013
Rigveda, Atharvaveda, Mahabharata, Ramayana


Слайд 7Post-Austin 1988 (Kharagpur 2019)
Full contextual reference (Panini)
GRA links to

RV, not yet Panini 2018 Jim Funderburk


Слайд 8Cologne 1997 Edition
Coding yet to be done

of Greek
botanical terms
verbal forms
literary sources

Слайд 9MW 2019: Supplement
MW supplement (additions and corrections)
fully integrated AFAIK 2018?

Jim Funderburk


Слайд 10MW 2019: Translitate Greek
transliteration of Greek (16 out of 34

2007, 2010 Beta Code to Unicode Jim Funderburk, Peter Scharf
2010? Interlinking

with Perseus Jim Funderburk
2015-2019 Proofreading Old Greek Jim Funderburk, Jonathan Migliori


Слайд 11MW 2017: Botanical Terms
to recognise and to renew plant names,

Linnaean taxonomy changed over time (15826 cases in 8408 entries in



Слайд 12MW 2017: Botanical Terms
Mis-markup (surnames coded as plants)
Roxb., Hex., Gaertn.,

Nees., Schott., Bl., Wall., Benth., Spreng., Willd., Schott.
Erycibe_Paniculata_Roxb. ---> Erycibe_PaniculataRoxb.

after botanical nomenclature is not L[exicographer], but Carl Linnaeus.
corrections can generate false positives, work with allbot1a.txt has just begun, but stopped rapidly


Слайд 13MW 2017: Verbal Forms
Compare verbal forms databases
Gérard Huet (gitlab INRIA)

Kulkarni (Uni of Hyderabad)
Dhaval Pathel (SanskritVerb)
Jim Funderburk
? Oliver Hellwig


Слайд 14MW 2019: Literary Sources
Interlinking with Pāṇini was meant initially
Cologne interlinking

only for GRA to RV
Turned out we still do not

know how to resolve all abbreviations of literary sources
Punctuation between references: unsolved
Review of abbreviations (mwabbreviations)


Слайд 15Cologne 2019: Useful Byproducts
List of all Sanskrit headwords from dictionaries

sanhw1.txt & sanhw2.txt
MW normalized grammatical information
Spellchecking & hyphenation (possible patterns)


Слайд 16MW 2017: Misc User Interface
Replica of Printed Fonts for Web



Слайд 17PW 2017: Code Reorganization Sample
meta-line format;
addition of div markup (breaking

huge blobs of text into much more manageable pieces);
addition of

abbreviation markup;
conversion to modern IAST;
improvements to spelling of the list of works and authors;
xml markup in place of most esoteric markup using special symbols.


Слайд 18Simple Search

Слайд 19Cologne 2020: Simple Search
How `simple` at Cologne works (#3)
Searching for

khan: kāma kaṇa khan kam kāṇa khāna kan khana kaṇ

khaṇa kām kham kāna kana (14 results).
„Sanskrit made easy“ in Prof. Huet wording (#2)
Implemented at SpokenSanskrit.org (#1)
To do in 2020
Cut off verbal endings (enter an inflected form and get underlying MW dictionary words)


Слайд 20Sanskrit Dataset Crowdsourcing
Carthago delenda est
When we say DCS is the

source, we are not actually giving a real source. It

itself bases on GRETIL (108 Mb of HTML files, 1600 texts), which is nothing but an aggregator.



Слайд 21Sanskrit Dataset Crowdsourcing
Carthago delenda est
At the level of Cologne I’ve

seen what 2.5 people can do in 5 years. What

if we can unite 25 Sanskrit enthusiasts, manually checking the suspicious words found marked via Fuzzy (Levenshtein) algorithm



Слайд 22I give you my thanks! gasyoun@gmail.com
PhD Mārcis Gasūns

October 2019
Krasnodar, Russia

