Слайд 1Search Algorithms and Data Structures
Mikhail Khludnev Khabarovsk'19
Слайд 3RDBMS is …
https://www.geeksforgeeks.org/database-file-indexing-b-tree-introduction/
Слайд 4Composite Index
CREATE INDEX class_pos_index ON users (class, position);
Слайд 7Lucene, Solr and Elasticsearch
http://www.supercoloring.com/coloring-pages/romulus-and-remus-with-the-she-wolf.
Слайд 8The First Indices
Общественное достояние, https://commons.wikimedia.org/w/index.php?curid=433142
https://rbscp.lib.rochester.edu/489
Слайд 9Term Dictionary
rubens
rub*
[rome TO rustic]
*uber
*man*
By Claudio Rocchini - Own work, CC
BY 2.5, https://commons.wikimedia.org/w/index.php?curid=2118795
Слайд 10Offsets to Postings List File
10: 8, 9, 10, 14, 18,
23, 24, 26, 31, 35; 8: 8, 11, 14, 18,
21, 23, 25, 27; 8: 4, 5, 6, 9, 13, 14, 18, 22; 8: 3, 4, 7, 9, 12, 13, 17, 20; 7: 5, 9, 12, 14, 19, 23, 28; 9: 0, 2, 5, 6, 11, 13, 17, 22, 27
Слайд 11Postings Codecs
delta 8, 9, 10, 14, 18, 23
=> 1,1,1,4,4,5
vint
5 => 000001012 129 => 100000012 000000012
PFOR 0012 0012 0012 1002 1002 1012
Слайд 13Stored Field Retrieval
{"name": "first doc", "count": 43} {"name": "another doc",
"count":5} {"name": "the last doc", "count":3435} {"name": "the book", "count":-1334}
Слайд 15Index Document
curl -X PUT "localhost:9200/twitter/tweet/1" -H 'Content-Type: application/json' -d'
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" :
"trying out Elasticsearch"
}
'
Слайд 16Mapping
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{
"mappings": {
"properties": {
"title": { "type":
"text" },
"name": { "type": "text" },
"age": { "type": "integer" },
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis" } } }}'
Слайд 17Analysis
curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{ "settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": { "type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
} } } }}
'
Слайд 18In-memory Buffer
{
"user" : "kimchy",
"message" :
"trying out
Elasticsearch"
}
user
message
Слайд 20_refresh
Elasticsearch: The Denitive Guide
Copyright © 2015 Elasticsearch. All rights
reserved.
Слайд 21_refresh vs _flush
Elasticsearch: The Denitive Guide
Copyright © 2015 Elasticsearch.
All rights reserved.
Слайд 22Indexing Performance
_bulk
Threads
ETL is hard
Слайд 24Result Page
{
"timed_out": false,
"took": 62,
"_shards":{
"total" : 10,
"successful"
: 1,
"skipped" : 0,
"failed" : 0
},
"hits":{
"total" : {
"value": 23539,
"relation": "eq"
},
"max_score": 1.3862944,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "0",
"_score": 1.3862944,
"_source" : {
"user" : "kimchy",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch",
"likes": 0
}
}, {
"_index" : "twitter",
"_type" : "_doc",
"_id" : "242",
"_score": 0.72944,
"_source" : {
"user" : "foo bar",
"date" : "2009-11-15T14:12:12",
"message" : "searching Elasticsearch",
"likes": 0
}
}
] }}
Слайд 25Result Page Cropping
10: 8, 9, 10, 14, 18, 23, 24,
26, 31, 35; 8: 8, 11, 14, 18, 21, 23,
25, 27; 8: 4, 5, 6, 9, 13, 14, 18, 22; 8: 3, 4, 7, 9, 12, 13, 17, 20;
⋃
O(n log (p))
http://www.cse.hut.fi/en/research/SVG/TRAKLA2/tutorials/heap_tutorial/taulukkona.html
Слайд 27https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
Слайд 29Elastic Logstash Kibana (ELK) Stack
https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html
Слайд 30Scaling
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
Слайд 31Summary
Why search?
Inverted Index
Indexing
Searching
Scaling
ELK
Слайд 32200 threads
https://blog.newrelic.com/engineering/expected-distributions-website-response-times/
Слайд 332,2,1
Index Hierarchy
2,2,1
2,2,1
Segment - inverted index
Lucene Index - chain
Shard
Elasticsearch Index
Index
pattern: access-log-*
access-log-2019-08-06
access-log-2019-08-07
access-log-2019-08-08
….
Слайд 34An Index is …
an derived data structure
Слайд 36Term Frequency
red
red bar
red red red bar red
Слайд 37Inverse Document Frequency
red bull
Chicago Bulls
Red Socks
Computing is too important to
be left to men
Karen Spärck-Jones