Разделы презентаций


History of Cologne Digital Lexicons

Содержание

Digital LexiconsDigital Lexicons 1988-1994 1994-2005 pre-2014 2014-2019

Слайды и текст этой презентации

Слайд 1History of Cologne Digital Lexicons
Mārcis Gasūns,
October 2019
@gasyoun

History of Cologne  Digital LexiconsMārcis Gasūns,October 2019@gasyoun

Слайд 3Digital Lexicons
Digital Lexicons
1988-1994
1994-2005
pre-2014
2014-2019

Digital LexiconsDigital Lexicons 1988-1994 1994-2005 pre-2014 2014-2019

Слайд 4Austin 1988
“Many Sanskritists are highly computer literate”
“Bright hopes” by D.

Wujastyk
Undoing sandhi, conjunct characters
Sanskrit text archive, a remake of

Thesaurus Linguae Graecae, est. 1972
Full textual reference (Panini)

https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Austin 1988“Many Sanskritists are  highly computer literate”“Bright hopes” by D. WujastykUndoing sandhi, conjunct characters Sanskrit text

Слайд 5Post-Austin 1988 (Kharagpur 2019)
Undoing sandhi solved, opensource
1992-2000, Peter Scharf (Pascal)
2009

Jim Funderburk (Perl, Java)
2015 Jim Funderburk (Python 2.7)
Conjunct characters are

not an issue in Unicode. Not widely used in India and that does become an issue (ex., Pune intranet). It’s solved in 2016 for OCR.

https://github.com/funderburkjim/ScharfSandhi

Post-Austin 1988 (Kharagpur 2019)Undoing sandhi solved, opensource1992-2000, Peter Scharf (Pascal)2009 Jim Funderburk (Perl, Java)2015 Jim Funderburk (Python

Слайд 6Post-Austin 1988 (Kharagpur 2019)
Sanskrit text archive (GRETIL), 2001
"simply rapid access

library“
no “grammatical and lexical systems”
Digital Corpus of Sanskrit (DCS),

2010
560 000 lemmatized sentences (linguistic database, Sanskrit expert system)
Parallel Sanskrit-Russian Corpora, 2013
Rigveda, Atharvaveda, Mahabharata, Ramayana

https://github.com/funderburkjim/ScharfSandhi

Post-Austin 1988 (Kharagpur 2019)Sanskrit text archive (GRETIL), 2001

Слайд 7Post-Austin 1988 (Kharagpur 2019)
Full contextual reference (Panini)
GRA links to

RV, not yet Panini 2018 Jim Funderburk

https://github.com/funderburkjim/ScharfSandhi

Post-Austin 1988 (Kharagpur 2019)Full contextual reference (Panini) GRA links to RV, not yet Panini 2018 Jim Funderburkhttps://github.com/funderburkjim/ScharfSandhi

Слайд 8Cologne 1997 Edition
Coding yet to be done
supplement
transliteration

of Greek
botanical terms
verbal forms
literary sources
https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Cologne 1997 EditionCoding yet to be done supplement transliteration of Greek botanical terms verbal forms literary sourceshttps://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 9MW 2019: Supplement
MW supplement (additions and corrections)
fully integrated AFAIK 2018?

Jim Funderburk


https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

MW 2019: Supplement MW supplement  (additions and corrections)fully integrated AFAIK 2018? Jim Funderburkhttps://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

Слайд 10MW 2019: Translitate Greek
transliteration of Greek (16 out of 34

dictionaries)
2007, 2010 Beta Code to Unicode Jim Funderburk, Peter Scharf
2010? Interlinking

with Perseus Jim Funderburk
2015-2019 Proofreading Old Greek Jim Funderburk, Jonathan Migliori


https://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

MW 2019: Translitate Greektransliteration of Greek  (16 out of 34 dictionaries)2007, 2010 Beta Code to Unicode

Слайд 11MW 2017: Botanical Terms
to recognise and to renew plant names,

Linnaean taxonomy changed over time (15826 cases in 8408 entries in

MW)
Hedysarum_Gangeticum
sesamum_grain
the_flower_of_HibHibiscus_MutMutabilis

https://github.com/sanskrit-lexicon/MWS/issues/51

MW 2017: Botanical Termsto recognise and to renew plant names, Linnaean taxonomy changed over time (15826 cases

Слайд 12MW 2017: Botanical Terms
Mis-markup (surnames coded as plants)
Roxb., Hex., Gaertn.,

Nees., Schott., Bl., Wall., Benth., Spreng., Willd., Schott.
Erycibe_Paniculata_Roxb. ---> Erycibe_PaniculataRoxb.
L.

after botanical nomenclature is not L[exicographer], but Carl Linnaeus.
corrections can generate false positives, work with allbot1a.txt has just begun, but stopped rapidly



https://github.com/sanskrit-lexicon/MWS/issues/51

MW 2017: Botanical TermsMis-markup (surnames coded as plants)Roxb., Hex., Gaertn., Nees., Schott., Bl., Wall., Benth., Spreng., Willd.,

Слайд 13MW 2017: Verbal Forms
Compare verbal forms databases
Gérard Huet (gitlab INRIA)
Amba

Kulkarni (Uni of Hyderabad)
Dhaval Pathel (SanskritVerb)
Jim Funderburk
? Oliver Hellwig

https://github.com/sanskrit-lexicon/MWS/issues/51

MW 2017: Verbal FormsCompare verbal forms databasesGérard Huet (gitlab INRIA)Amba Kulkarni (Uni of Hyderabad)Dhaval Pathel (SanskritVerb)Jim Funderburk?

Слайд 14MW 2019: Literary Sources
Interlinking with Pāṇini was meant initially
Cologne interlinking

only for GRA to RV
Turned out we still do not

know how to resolve all abbreviations of literary sources
Punctuation between references: unsolved
Review of abbreviations (mwabbreviations)




https://github.com/sanskrit-lexicon/hwnorm1/blob/master/ejf/hwnorm1c/hwnorm1c.txt

MW 2019: Literary SourcesInterlinking with Pāṇini was meant initiallyCologne interlinking only for GRA to RVTurned out we

Слайд 15Cologne 2019: Useful Byproducts
List of all Sanskrit headwords from dictionaries

sanhw1.txt & sanhw2.txt
dīpita:dīpita:AP,AP90,MW,MW72,SHS,STC,WIL,YAT
dīpitar:dīpitar:PW,PWG
dīpitā:dīpitā:SKD
dīpitṛ:dīpitṛ:AP,BUR,MW,MW72,SHS,WIL,YAT
dīptaka:dīptaka:MW,MW72,PW,PWG,SHS,WIL,YAT;dīptakaṃ:SKD;dīptakaḥ:AP,AP90
MW normalized grammatical information
Spellchecking & hyphenation (possible patterns)



https://raw.githubusercontent.com/sanskrit-lexicon/CORRECTIONS/master/sanhw2/sanhw2.txt

Cologne 2019: Useful ByproductsList of all Sanskrit headwords from dictionaries sanhw1.txt & sanhw2.txtdīpita:dīpita:AP,AP90,MW,MW72,SHS,STC,WIL,YATdīpitar:dīpitar:PW,PWGdīpitā:dīpitā:SKDdīpitṛ:dīpitṛ:AP,BUR,MW,MW72,SHS,WIL,YATdīptaka:dīptaka:MW,MW72,PW,PWG,SHS,WIL,YAT;dīptakaṃ:SKD;dīptakaḥ:AP,AP90MW normalized grammatical informationSpellchecking &

Слайд 16MW 2017: Misc User Interface
Replica of Printed Fonts for Web

Display




https://github.com/sanskrit-lexicon/MWS/issues/51

MW 2017: Misc User InterfaceReplica of Printed Fonts for Web Displayhttps://github.com/sanskrit-lexicon/MWS/issues/51

Слайд 17PW 2017: Code Reorganization Sample
meta-line format;
addition of div markup (breaking

huge blobs of text into much more manageable pieces);
addition of

abbreviation markup;
conversion to modern IAST;
improvements to spelling of the list of works and authors;
xml markup in place of most esoteric markup using special symbols.



https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

PW 2017: Code Reorganization Samplemeta-line format;addition of div markup (breaking huge blobs of text into much more

Слайд 18Simple Search

Simple Search

Слайд 19Cologne 2020: Simple Search
How `simple` at Cologne works (#3)
Searching for

khan: kāma kaṇa khan kam kāṇa khāna kan khana kaṇ

khaṇa kām kham kāna kana (14 results).
„Sanskrit made easy“ in Prof. Huet wording (#2)
Implemented at SpokenSanskrit.org (#1)
To do in 2020
Cut off verbal endings (enter an inflected form and get underlying MW dictionary words)



https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Cologne 2020: Simple SearchHow `simple` at Cologne works (#3)Searching for khan: kāma kaṇa khan kam kāṇa khāna

Слайд 20Sanskrit Dataset Crowdsourcing
Carthago delenda est
When we say DCS is the

source, we are not actually giving a real source. It

itself bases on GRETIL (108 Mb of HTML files, 1600 texts), which is nothing but an aggregator.

https://github.com/sanskrit-lexicon/


https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Sanskrit Dataset CrowdsourcingCarthago delenda estWhen we say DCS is the source, we are not actually giving a

Слайд 21Sanskrit Dataset Crowdsourcing
Carthago delenda est
At the level of Cologne I’ve

seen what 2.5 people can do in 5 years. What

if we can unite 25 Sanskrit enthusiasts, manually checking the suspicious words found marked via Fuzzy (Levenshtein) algorithm

https://github.com/sanskrit-lexicon/


https://github.com/sanskrit-lexicon/Cologne/issues/183#issuecomment-336759401

Sanskrit Dataset CrowdsourcingCarthago delenda estAt the level of Cologne I’ve seen what 2.5 people can do in

Слайд 22I give you my thanks! gasyoun@gmail.com
PhD Mārcis Gasūns
github.com/gasyoun

October 2019
Krasnodar, Russia

I give you my thanks! gasyoun@gmail.comPhD Mārcis Gasūns github.com/gasyounOctober 2019Krasnodar, Russia

Обратная связь

Если не удалось найти и скачать доклад-презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое TheSlide.ru?

Это сайт презентации, докладов, проектов в PowerPoint. Здесь удобно  хранить и делиться своими презентациями с другими пользователями.


Для правообладателей

Яндекс.Метрика