0% found this document useful (0 votes)
4 views

Elasticsearch Python Slides

The document provides an overview of various systems and operations related to indexing and managing data in Elasticsearch, including creating indices, inserting and deleting documents, and utilizing the search API. It explains data types, mapping, and the bulk API for efficient data handling. Additionally, it covers query DSL for constructing complex queries to retrieve specific data from indices.

Uploaded by

taigrus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Elasticsearch Python Slides

The document provides an overview of various systems and operations related to indexing and managing data in Elasticsearch, including creating indices, inserting and deleting documents, and utilizing the search API. It explains data types, mapping, and the bulk API for efficient data handling. Additionally, it covers query DSL for constructing complex queries to retrieve specific data from indices.

Uploaded by

taigrus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 173

2

Indices

- The quick brown fox jumped over the lazy dog


- 3.14
- 15/09/2024

The quick brown fox


Text embedding
Documents jumped over the [-0.1, 2.5, …, -1.67]
model
lazy dog

3
Search system

Indices

https://www.travelmediagroup.com/the-power-of-facebook-as-a-search-engine-2/

Documents

4
Recommendation system

Indices

https://www.shopagain.com/blog/product-recommendation-engines-what-is-it-how-it-work/

Documents

5
RAG system

Indices

https://www.superannotate.com/blog/rag-explained

Documents

6
RAG system

Indices

https://www.superannotate.com/blog/rag-explained

Documents

7
3 Create an index

8
What is an index ?


Name

Price
JSON JSON

Descriptio
n

...
Product 1 Product 2

JSON

Product n

Product Index

9
Shards & replicas

Number of shards = 2

Sharding
Product Index

Product Index

10
Shards & replicas

Number of shards = 2
Number of replicas = 1

Product Index
duplicate

Product Index
11
4 Inserting documents

12
Document

“Name”: “value”
JSON

“Price”: “value”

“Description: “value”

...

13
Document
insertion mapping
Field Type
JSON

created_on date
X100 my_index
text text

title text

🛈 This process is called mapping and can be


done automatically or manually. By default,
ElasticSearch does it automatically.

14
5 Field data types

15
Field data types
insertion mapping
Field Type
JSON

created_on date
X100 my_index
text text

title text

16
Field data types

1)Common types


Binary
R ead mor e

Accepts a binary value as a Base64 encoded string.

Is not searchable and is not stored.

Use _source (i.e., document) to get the data back.

encoding
IVBORw0KGgoAAAANSUhEUgAAA
OEAAADhCAIAAADiVBORw0KGgo
AAAANSUhEUgAAAOEAAADhCAI
A….

Base64
representation

17
Field data types

1)Common types


Binary
R ead mor e

Boolean (True / False)

Numbers (long, integer, byte, short, etc)

Dates

Keyword (IDs, email addresses, status codes, zip codes, etc)

18
Field data types

2)Objects types (JSON)


Object
Read more

{
JSON "region": "US",
"manager": { indexation {
"age": 30, "region": "US",
"name": { "manager.age": 30,
"first": "John", "manager.name.first": "John",
"last": "Smith" "manager.name.last": "Smith"
} }
}
}

19
Field data types

2)Objects types (JSON)


Object

Flattened Read more

Efficient for deeply nested JSON objects.

Hierarchical structure is not preserved.

Nested

Cases where you have array of objects.

Maintains relationship between the object’s fields.

20
Field data types

2)Objects types (JSON)


Flattened / Nested object example
Read more
{
JSON "group": "fans",
"user": [
{
"first": "John", indexation {
"last": "Smith" "group" : "fans",
}, "user.first" : [ "alice", "john" ],
{ "user.last" : [ "smith", "white" ]
"first": "Alice", }
"last": "White"
}
]
}

21
Field data types

3)Text search types


Text

Used for full-text content. Read more

Examples: Body of an email or the description of a product.

analyzer
Structured format that’s optimized for search.

Unstructured format
22
Field data types

3)Text search types


Text

Used for full-text content. Read more

Examples: Body of an email or the description of a product.

Completion

Search as you type

Annotated text

23
Field data types

4)Spatial data types


Geo point

Geo shape Read more

Point (Cartesian point)

Shape (Cartesian geometry)

24
Field data types

Read more

25
6 Delete documents

26
Delete documents

JSON id=1

Read mor e
my_index
JSON id=2

JSON id=3

<index> <id>

DELETE

27
Delete documents

JSON id=1

Read mor e
my_index
JSON id=2

JSON id=3

my_index 1 my_index 4

DELETE DELETE

❌ 28
7 Get document

29
Get documents

JSON id=1

Read mor e
my_index
JSON id=2

JSON id=3

<index> <id>

GET

30
Get documents

JSON id=1

Read mor e
my_index
JSON id=2

JSON id=3

my_index 1 my_index 4

GET GET

❌ 31
8 Count documents

32
Count documents

JSON

Read mor e
my_index
JSON

JSON

🛈 The query parameter is used to match


<index> <index> <q>
certain criteria

COUNT COUNT

33
9 The exists API

34
The exists API

JSON

Read mor e
my_index
JSON

JSON

<index> <index> <id>

client.indices.exists client.exists

🛈 This checks if an index exists in 🛈 This checks if a document exists 35

ElasticSearch. in an index.
10 Update document

36
Update documents
1) Documents exists in the index

JSON id=1 d mor e


Rea

my_index JSON id=2

🛈 The update operation follows these steps:


<index> <id>
1. Get the document.
2. Update it (e.g. Add a new field, remove a field, or update a field)
3. Re-index the result.
UPDATE

37
Update documents
But how do you update the document?

insertion

“book_id”: 1 Read mor e
JSON

“book_name”: “A book”

id=1 my_index

my_index get update


script = {
UPDATE Document “book_id” = 2
}
1 1 2

38
Update documents
2) Document doesn’t exist

JSON id=1 d mor e


Rea

my_index JSON id=2

🛈 The update operation can create the document if it doesn’t exist


<index> id=4
1. Add the values you want to insert.
2. Set doc_as_upsert to true.

UPDATE

39
11 Bulk API

40
The bulk API

JSON Index

Read mor e
my_index JSON Index

Update

Delete

Each operation (index, update, delete) makes a separate
... API call.

The bulk API performs multiple operations in one API call.
This increases indexing speed.

41
The bulk API - Syntax


action and metadata\n

optional source\n 🛈 The source is required for:

action and metadata\n update e


optional source\n Read mor



index

action and metadata\n ●
create

optional source\n

🛈 The action can be one of the following:



index ●
update

create ●
delete

42
The bulk API - Example
response = es.bulk(
operations=[
{
"index": { 🛈 The source is required for:
"_index": "test",

}
"_id": "1" ●
update
},
{ ●
index
},
"field1": "value1"
Read mor e
{

create
"delete": {
"_index": "test",
"_id": "2"
}
},
{
"create": {
"_index": "test",
"_id": "3"
}
},
{
"field1": "value3"
},
{
"update": {
"_id": "1",
"_index": "test"
}
},
{
"doc": { Action
"field2": "value2"
} Source
}
], 43
)
12 The search API – Part 1

44
The search API

JSON id=1
Read mor e

my_index JSON id=2

You use the search API to build:



Search engines ●
Log data analysis

Recommendation systems ●
...

Real-time dashboards

45
The search API

<index> ●
my_index

index_1,my_index
SEARCH mor e
Read

index*

_all

46
The search API

<index>

q ●
Use it for simple searches.
SEARCH mor e
Read

Uses the Lucene syntax.

47
The search API

<index>

q
SEARCH mor e
Read
query ●
Use it for complex, structured
queries.

Uses the Query DSL language.

Default value is match all.

48
The search API

<index>

q
SEARCH mor e
Read
query

timeout, size, from


Timeout:

The maximum time to wait for a search request to complete.

Time units (seconds, milliseconds, days, etc) .

Size:

Defines the number of hits (documents) to return.

Default value is 10. Max value is 10000.
49
The search API

<index>

q
SEARCH mor e
Read
query

timeout, size, from


from:

Starting point from which to return search results (pagination) .

Useful if you want to implement skip functionality.

50
13 The search API – Part 2

51
The search API – Query DSL

<index>

q
SEARCH mor e
Read
query

Query DSL is used to create complex, structured queries.


Query DSL consists of two types of clauses:

Leaf clauses (match, term, or range)

Compound clauses (bool)

🛈 Query DSL means Query (Domain Specific Language)


52
The search API – Query DSL

Leaf clauses
1. match:

Is used to perform full-text search.
Read mor e

Returns documents that match a provided text, number, date, or bool.

The field must be mapped to a text data type.
2. term:

Returns documents that contain an exact term in a provided field.

The field must be mapped to a keyword data type or a numeric/date
type.

Example usage : product ID, book ID, username, etc.
3. range:

Returns documents that contain terms within a provided range.

53
The search API – Query DSL


Example for match query:
response = es.search(
index="my_index",
Read mor e
body={
"query": {
"match": {
"description": "A description."
}
}
}
)

54
The search API – Query DSL


Example for term query:
response = es.search(
index="my_index",
Read mor e
body={
"query": {
"term": {
"product_id": "PRODUCT_12345"
}
}
}
)

55
The search API – Query DSL

Example for range query:

response = es.search(
index="my_index",
Read mor e
body={
"query": {
"range": {
"publication_date":
{
"gte": "2023-01-01",
"lte": "2023-12-31"
}
}
}
}
}
)
56
The search API – Query DSL

Compound clauses

bool:

Combines multiple queries using boolean logic:
Read mor e

must, filter, should, must_not.

The field must be mapped to a text data type.

57
The search API – Query DSL

Example for bool query:

response = es.search(index="my_index", body={


"query": {
"bool": {
"must": [
Read mor e
{
"match": {
"title": "Elasticsearch"
}
}
],
"filter": [
{
"term": {
"status": "published"
}
}
],
"should": [
{
"match": {
"tags": "search"
}
}
],
"must_not": [
{
"term": {
"deleted": True
}
}
]
} 58
}
)
The search API – Query DSL

Example for bool query:

response = es.search(index="my_index", body={


"query": {
"bool": {
"must": [
Read mor e
{
"match": { Keeping documents where title is equal to
"title": "Elasticsearch"
} Elasticsearch
}
],
"filter": [
{
"term": {
"status": "published"
}
}
],
"should": [
{
"match": {
"tags": "search"
}
}
],
"must_not": [
{
"term": {
"deleted": True
}
}
]
} 59
}
})
The search API – Query DSL

Example for bool query:

response = es.search(index="my_index", body={


"query": {
"bool": {
"must": [
Read mor e
{
"match": {
"title": "Elasticsearch"
}
}
],
"filter": [
{
"term": {
"status": "published"
Filtering documents with a status of
} published
}
],
"should": [
{
"match": {
"tags": "search"
}
}
],
"must_not": [
{
"term": {
"deleted": True
}
}
]
} 60
}
})
The search API – Query DSL

Example for bool query:

response = es.search(index="my_index", body={


"query": {
"bool": {
"must": [
Read mor e
{
"match": {
"title": "Elasticsearch"
}
}
],
"filter": [
{
"term": {
"status": "published"
}
}
],
"should": [
{
"match": {
"tags": "search"
This match is optional
}
}
],
"must_not": [
{
"term": {
"deleted": True
}
}
]
} 61
}
})
The search API – Query DSL

Example for bool query:

response = es.search(index="my_index", body={


"query": {
"bool": {
"must": [
Read mor e
{
"match": {
"title": "Elasticsearch"
}
}
],
"filter": [
{
"term": {
"status": "published"
}
}
],
"should": [
{
"match": {
"tags": "search"
}
}
],
"must_not": [
{
"term": {
"deleted": True
Excluding any document where the
} deleted field is set to true
}
]
} 62
}
})
14 The search API – Part 3

63
The search API

Timeout

It sets the maximum duration for a query to execute.

If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.

64
The search API

Timeout

Sets the maximum duration for a query to execute.

If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.
Size

Controls how many search results are returned.

Max value is 10000.

65
The search API

Timeout

Sets the maximum duration for a query to execute.

If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.
Size

Controls how many search results are returned.

Max value is 10000.
From

Used for pagination.

It tells Elasticsearch how many documents to skip before starting to return
results.

66
The search API

Aggregations

Performs calculation on the data

Average, max, min, count.
Read mor e

67
15 Dense vectors

68
Dense vector field type


Stores dense vectors of numeric values.

Use it if you have few or no zero elements.

Does not support aggregations or sorting.
Read mor e

It is not possible to store multiple values in one dense vector field.

Use kNN search to retrieve the nearest vectors.

Max size of a dense vector is 4096.

[2 , 0 ,1 ,−5 ,... , 20]

[[...],[...],[...],...]
69
Medium article by Sachinsoni
Dense vector field type

Examples
response = client.index(
index="my-index",
id="1", Read mor e
document={
"my_text": "example text",
"my_vector": [0.5, 10, 6]
}
)

response = client.index(
index="my-index",
id="2",
document={
"my_text": "example text 2",
"my_vector": [[0.5, 10, 6], [1, 0, -2]]
}
)
70
Dense vector field type

Important

You have to manually do the mapping.

Elasticsearch does not automatically infer the mapping for dense
Read mor e
vectors.

It needs to know the exact number of dimensions.

response = es.indices.create(
index="my_index",
mappings={
"properties": {
"sides_length": {
"type": "dense_vector",
"dims": 4
},
"shape": {
"type": "keyword"
}
}
}, 71
)
16 Embedding documents

72
Embedding documents


Embedding transforms text into numerical vectors.

Deep learning models are used to embed documents.

These models preserve the meaning of the text.
Read mor e

Use cases:

Recommendation systems

Retrieval-Augmented Generation (RAG)

[2 , 0 ,1 ,−5 ,... , 20]


[0.54 ,−4.2 , 0 ,−0.6 ,... ,1]
embedding
[3 , 0.33 ,−.98 ,−1 ,... ,−1.1]
73
Embedding documents

Deep learning models

Read mor e


Closed models. ●
Open-source models.

Paid (Pay what you use). ●
Free to use.

No hardware required (Cloud). ●
Hardware is required (preferably GPU).

74
Embedding documents

Embedding size


Size of the dense vector. or e
Read m

Larger sizes yield better embeddings.

Common sizes include 384, 768, 1024.

75
Embedding documents

Input size


Size of the input text that the model can process. or e
Read m

Text will be truncated if it exceeds the model’s capacity.

Common values include 256, 512 tokens.

76
Embedding documents

Text language


Some models can translate specific languages. or e
Read m

Others support multiple languages and are multilingual.

77
Embedding documents

78
Massive Text Embedding Benchmark (MTEB)
17 KNN search

79
k-nearest neighbor (kNN) search


How do we search for embedded documents?

We use the kNN search for fields mapped as dense vectors.

Important, you can’t use the query parameter in this case.
Read mor e

[2 , 0 ,1 ,−5 ,... , 20]


[0.54 ,−4.2 , 0 ,−0.6 ,... ,1]
embedding
[3 , 0.33 ,−.98 ,−1 ,... ,−1.1]
80
k-nearest neighbor (kNN) search


The kNN algorithm is used for classification and regression tasks.

It classifies new data points by comparing them with the k nearest
points from the training data.
Read mor e

Commonly distances metrics: Euclidean, Manhattan or Minkowski.

This algorithm is simple and effective.

Medium article by Sachinsoni Medium article by Dancker


81
k-nearest neighbor (kNN) search

Watch on

82
k-nearest neighbor (kNN) search

Example 🛈 Should be a dense_vector


results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
)

83
k-nearest neighbor (kNN) search

Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
) 🛈 The query should be a vector.

84
k-nearest neighbor (kNN) search

Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
) 🛈 Retrieves 50 potentially relevant documents
before applying distance calculations to select
the k best documents.

85
k-nearest neighbor (kNN) search

Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
)

🛈 Returns up to 10 documents that match your query.

86
18 Deep pagination

87
Deep pagination


Indexing / fetching all documents at once is inefficient and slow.

Pagination:

Retrieves data in small chunks from large indexes.
Read mor e

Improves performance and efficiency.

Fast search experience.

Cost effective.

x100K
index /
fetch
Medium article by Dayanand

88
Deep pagination

Pagination methods

from/size: commonly used for paginating smaller datasets.

search_after: offers more efficient deep pagination for large datasets.
Read mor e

89
Deep pagination

Pagination methods

from/size: commonly used for paginating smaller datasets.

search_after: offers more efficient deep pagination for large datasets.
Read mor e

🛈 Note:

from/size is limited to 10k results.

size = 8

Requires a lot of memory for deep
pages.
from = 5 Not suitable for larger indexes.
size = 8

from = 0

index index
90
Deep pagination

Pagination methods

from/size: commonly used for paginating smaller datasets.

search_after: offers more efficient deep pagination for large datasets.
Read mor e

new sort values

size = 8
Sortable fields

timestamp

id sort values old sort values
size = 8

index index
91
Deep pagination

Pagination methods

from/size: commonly used for paginating smaller datasets.

search_after: offers more efficient deep pagination for large datasets.
Read mor e

🛈 Note:

The search_after method is not constrained by the 10k result limit.

It does not utilize an offset (i.e., the from parameter).

Results must be sorted by fields such as ID or timestamp.

Uses a pointer derived from the sort values of the last document on the previous page.

This approach prevents the skipping of documents.

It is particularly beneficial for handling larger indexes.

92
Deep pagination

Comparison
Below are the results obtained after attempting to retrieve 10,000
documents using the following parameters (size = 200, iterations = 50)
Read mor e

93
Deep pagination

Comparison

from/size search_after mor e


Read
Average time (ms) 6.417 3.072

Maximum time (ms) 15.812 4.896

Minimum time (ms) 2.757 1.772

from/size search_after

Performance degradation 2.9x 0.58x

🛈 Performance degradation is calculated like this (last page time / first page time)

94
19 Ingest pipelines

95
Ingest pipelines


You can perform transformations on data before indexing.

Common transformations: remove fields, lowercase text, remove HTML
tags, and more.
Read mor e

Step 1 Step 2 Step 3

Documents Ingest pipeline Target index

96
Ingest pipelines


We use the ingest API to:

Create or update a pipeline.

Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
description="A description",
processors=[
{
"set": {
"description": "A description",
"field": "field",
"value": “value”
}
},
{
"lowercase": {
"field": "field"
}
}
], 97
)
Ingest pipelines


We use the ingest API to:

Simulate a pipeline.

Read mor e
response = client.ingest.simulate( OR response = client.ingest.simulate(
id="my-pipeline", pipeline={
docs=[ "processors": [
{ {
"_index": "index", "lowercase": {
"_id": "id", "field": "field"
"_source": { }
"foo": "bar" }
} ]
}, },
{ ...
"_index": "index", )
"_id": "id",
"_source": {
"foo": "rab"
}
} 98
],
)
Ingest pipelines


We use the ingest API to:

Delete a pipeline.

Read mor e
response = client.ingest.delete_pipeline(
id="my-pipeline",
)

99
Ingest pipelines


We use the ingest API to:

Get a pipeline.

Read mor e
response = client.ingest.get_pipeline(
id="my-pipeline",
)

100
Ingest pipelines


Pipelines can fail. You can either ignore the failure or handle it.

If you ignore the failure, the pipeline will skip over the failed steps.

Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
processors=[
{
"rename": {
"description": "A description",
"field": "field",
"ignore_failure": True
} Step 1 Step 2 Step 3
}
],
)

Ingest pipeline
101
Ingest pipelines


Pipelines can fail. You can either ignore the failure or handle it.

Specify custom error-handling steps with on_failure.

Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
processors=[ Retry, log error, etc
{
"rename": {
"description": "A description",
"field": "field",
"on_failure": [...]
} Step 1 Step 2 Step 3
}
],
)

Ingest pipeline
102
20 Ingest processors

103
Ingest processors


Common transformations: remove fields, lowercase text, remove HTML
tags, and more.

Read mor e

Step 1 Step 2 Step 3

Documents Ingest pipeline Target index

104
Ingest processors

Ingest processors by category



Ingest processors are organized into 5 categories. Read mor e

Data enrichment Array/JSON handling Data transformation


Append ●
For each ●
Convert

Inference ●
JSON ●
Rename

Attachment ●
Sort ●
Set

... ●
HTML strip

Lowercase / Uppercase

Trim

Split
Data Filtering Pipeline handling ●
...


Drop ●
Fail

Remove ●
Pipeline
105
21 Filters in depth

106
Filters in depth


When searching in Elasticsearch, you can use either query context or
filter context. R ead mor e

Query Query context Score

How well does this document


match this query clause ?

Query Filter context Yes / No

Does this document match this


query clause ? 107
Filters in depth

Why use the filter context ?



Binary matching. R ead mor e

No score is needed.

Filters execute faster than queries (no score is computed).

Filters consume less CPU resources.

Query Filter context Yes / No

Does this document match this


query clause ? 108
Filters in depth

Use cases
Filters are effective for querying structured data. R ead mor e

Structured data

Numeric fields Dates Boolean values Keyword fields ...

109
Filters in depth

Example 1

R ead mor e
response = client.search(
index="phones",
query={
"bool": {
"filter": [
{
"term": {
"color": "black" Color
} AND Yes/No
},
{ Brand
"term": {
"brand": "samsung"
}
}
]
}
},
)

110
Filters in depth

Example 2

R ead mor e
response = client.search(
query={
"bool": {
"filter": [
{
"term": {
"status": "published"
} Status
}, AND Yes/No
{
"range": { Publish date
"publish_date": {
"gte": "2015-01-01",
"lte": "2015-02-01"
}
}
}
]
}
},
) 111
Filters in depth

Post filters

Applies filters after aggregations are calculated. R ead mor e

Does not affect aggregations.

Only filters the search results.

Let you narrow down what users see without limiting
what they can choose from.

112
Filters in depth
Example
response = client.search( response = client.search( response = client.search(
index="shirts", aggs={ post_filter={ R ead mor e
query={ "colors": { "term": {
"bool": { "terms": { "color": "red"
"filter": { "field": "color" }
"term": { } },
"brand": "gucci" }, )
} "color_red": {
} "filter": {
} "term": {
}, "color": "red"
... }
) },
"aggs": {
"models": {
"terms": {
"field": "model"
}
}
}
}
}, 113
...
)
22 SQL search API

114
SQL Search API


We used Query DSL to search for documents.

An alternative method for searching documents is the SQL Search API. R ead mor e

Query DSL

Query my_index Results

115
SQL Search API

The SQL search API supports numerous parameters.

R ead mor e

forma cursor
fetch_size
t

delimiter SQL search API filter

page time-out ... request time-out

116
SQL Search API

Example field
R ead mor e
response = client.sql.query(
format="txt",
query="SELECT * FROM library ORDER BY page_count DESC LIMIT 5",
)

size
Index

117
SQL Search API

The available response formats are:



CSV R ead mor e

JSON

TSV

TXT response = client.sql.query(

YAML format="txt",

CBOR query="SELECT * FROM library ORDER BY page_count DESC LIMIT 5",

SMILE )

Binary formats

118
SQL Search API

Pagination

R ead mor e
response = client.sql.query(
format="txt",
cursor="sDHOSBDISBXMLK…", ?
)

SQL Search
Query API Results

Original example

119
SQL Search API

Filtering

R ead mor e
response = client.sql.query(
format="txt",
query="SELECT * FROM library ORDER BY page_count DESC",
filter={
"range": {
"page_count": {
"gte": 100,
"lte": 200
}
}
},
fetch_size=5,
)

120
SQL Search API

SQL Translate API

R ead mor e
response = client.sql.translate(
query="SELECT * FROM library ORDER BY page_count DESC",
fetch_size=10,
)

{
"size": 10,
"_source": false,
"fields": [{"field": "author"}, ...],
"sort": [
SQL Translate {
API "page_count": {
"order": "desc",
}
}
],
"track_total_hits": -1
} 121
SQL Search API

SQL Limitations

R ead mor e

122
23 Time Series Data Stream

123
Time Series Data Stream


Time series data refers to data points ordered by time.

Data is collected at regular intervals. R ead mor e

Example: CPU usage over time.

124
Time Series Data Stream


Managing time series data is challenging.

The data can grow rapidly (high frequency measurements) R ead mor e

How to store this large volume efficiently?

Deciding which old data to keep and when to delete it.

VS

125
Time Series Data Stream

Why use Elasticsearch for time series data?



Elasticsearch can handle massive volumes of data. R ead mor e

Supports real-time data ingestion and querying.

Analyze time-series.

Original post Original post

126
Time Series Data Stream

Time series data structure



Each data point is a document R ead mor e

Each document contains the timestamp field and the data.

28

26
data

24

22

20

06:00:00 06:00:01 06:00:02 06:00:03 06:00:04


@timestamp 127
Time Series Data Stream

Index Lifecycle Management (ILM)



ILM automates the rollover and management of indices. R ead mor e

Benefits: Storage optmization, automated data retention, efficient management
of index size.

Phases of ILM:

Warm Delete
Hot phase Cold phase
phase phase

128
Time Series Data Stream

ILM visualized

R ead mor e

Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days

my_index_0001

129
Time Series Data Stream

ILM visualized

R ead mor e

Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days

my_index_0001 age = 30

my_index_0002

130
Time Series Data Stream

ILM visualized

R ead mor e

Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days

my_index_0001 age = 60

my_index_0002 age = 30

my_index_0003
131
Time Series Data Stream

ILM visualized

R ead mor e

Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days

my_index_0001 age = 90

my_index_0002 age = 60

my_index_0003 age = 30

132
...
Time Series Data Stream

Querying time series data

R ead mor e
{ {
"query": { "aggs": {
"range": { "avg_cpu_usage": {
"@timestamp": { "avg": {
"gte": "2024-11-01T00:00:00", "field": "cpu_usage"
"lte": "2024-11-07T23:59:59" }
} }
} }
} }
}

133
24 Analyzers

134
Analyzers


Analyzers process text during indexing and searching.

They transform text into tokens. R ead mor e

They make the search process efficient and accurate.

Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2

Documents
Image origin

135
Analyzers

Analyzer components

An analyzer is a combination of 3 components: R ead mor e

Character filter

Tokenizer
Character filters

Token filter (min 0)

Tokenizers
(max 1)

Token filters
(min 0)

136

Analyzer
Analyzers

Built-in analyzers

Provide ready-made options for processing text in various ways. R ead mor e

Each built-in analyzer is designed for specific types of data.

Common analyzers:

None None None None

Standard Tokenizer Standard Tokenizer Whitespace Tokenizer Lowercase Tokenizer

Lowercase filter & None Stop filter


None
Stop filter

137

Standard Analyzer Simple Analyzer Whitespace Analyzer Stop Analyzer


Analyzers

Phases of analysis

Index time analysis R ead mor e

Term
Term
hello
Token hello
Tokenizer world
imad
filters world
imad
saddik

Documets Tokens Filtered Tokens

Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2

Inverted index 138


Analyzers

Phases of analysis

Index time analysis R ead mor e

Search time analysis

Term
Term
hello
Token hello
Query Tokenizer world
imad
filters world
imad
saddik

Tokens Filtered Tokens

Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2

Inverted index Documets 139


Analyzers

R ead mor e

140
Analyzers

141
25 Synonyms

142
Synonyms


Synonyms help enhance search accuracy.

Useful for matching variations or related terms. R ead mor e

Synonyms are defined using the Solr format.

143
Synonyms

Solr format

This is a flexible syntax for defining synonyms. ead m o r e
R

It uses two different definitions:

Equivalent synonyms: “term 1, term 2, term 3”

Explicit synonyms: “term 1, term 2 => term 3”

car
automobile Personal computer PC
race car
voiture i-pod, i pod ipod
...

Equivalent terms

144
Synonyms


Synonyms are used within analyzers.

You can use synonyms in index and search time. R ead mor e

Synonyms are a custom token filter.

None

Standard Tokenizer

Synonyms filter

145

Custom Analyzer
26 Common options

146
Common options


Simplify Elasticsearch management.

Provide features like human-readable output, date math, and filtering. R ead mor e

All Elasticsearch REST APIs support these common options.

147
Common options

Human-readable output

Make statistics in a format that humans can understand. R ead mor e

It applies to disk space, memory, time, and other metrics.

Example
response = es.cluster.stats(human=True)
pprint(response["nodes"]["jvm"])

Before After
148
Common options

Date math

Perform math operations on dates. R ead mor e

Operations include: Add, Subtract, Round down to nearest day.

Supported time units: y (years), M (moths), etc.

The expression starts with an anchor date (“now” or a string ending with ||).

Examples

now := 2024-11-16 11:55:00

now+1h := 2024-11-16 12:55:00

now-1h := 2024-11-16 10:55:00

now-1h/d := 2024-11-16 00:00:00

2024.11.16||+1M/d := 2024-12-16 00:00:00

149
Common options

Response filtering

Inclusive filtering: Specify fields to include. R ead mor e

Exclusive filtering: Remove unnecessary fields.

Combined filtering

Example
response = es.search(
index=index_name,
body={
"query": {
Before
"match_all": {}
}
},
filter_path="hits.hits._id,hits.hits._source"
)
pprint(response.body)
After 150
27 Change heap size

151
Change heap size


By default Elasticsearch uses 50% of the available RAM.

This can slow down you PC. R ead mor e

You only need 1 or 2GB when dealing with small indices.

152
Change heap size

Steps to change the heap size


1. Start the container. R ead mor e

sudo docker start elasticsearch

153
Change heap size

Steps to change the heap size


1. Start the container. R ead mor e
2. Go inside the container.

sudo docker exec -u 0 -it elasticsearch bash

154
Change heap size

Steps to change the heap size


1. Start the container. R ead mor e
2. Go inside the container.
3. Create the heap.options file inside jvm.options.d folder and add this.

echo "-Xms2g" > /usr/share/elasticsearch/config/jvm.options.d/heap.options


echo "-Xmx2g" >> /usr/share/elasticsearch/config/jvm.options.d/heap.options

cat /usr/share/elasticsearch/config/jvm.options.d/heap.options

155
Change heap size

R ead mor e

156
28 Final project – part 0

157
Final project – part 0


No more videos on Elasticsearch concepts and APIs.

I'll be focusing on the final project from now on.

The final project will cover most of the topics we've learned in previous videos.

We will be building a website.

Elasticsearch will provide the search functionality.

158
Final project – part 0

Source code

159
29 Final project – part 1

160
Final project – part 1


Create an index and index documents.

Use size / from when searching.

Perform multi-match queries.

The theme of the final project is Astronomy.

161
Image credit: Kent E. Biggs
Final project – part 1


Frontend is done for you.

Install dependencies.

162
Final project – part 1


Install dependencies.

Setup the backend server.

Configure Elasticsearch.

163
30 Final project – part 2

164
Final project – part 2


Add the pagination controls.

Filter by year.

Use aggregations.

165
31 Final project – part 3

166
Final project – part 3


Implement a search as you type feature.

Utilize the N-gram tokenizer.

Standard
Andr
tokenizer

N-gram
Andr
tokenizer

167
Final project – part 3


Why use the N-gram tokenizer?

Standard
Andromeda [andromeda]
tokenizer

N-gram [a, an, and, andr, andro, androm,


Andromeda
tokenizer androme, andromed, andromeda]

N=9

168
32 Final project – part 4

169
Final project – part 4


Implement semantic search.

Use an embedding model from HuggingFace.

Use kNN search to find documents.

Medium article by Sachinsoni

170
33 Final project – part 5

171
Final project – part 5


Add the raw APOD data.

Contains HTML tags.

{
"date": "2024-11-30",
"title": "<a href=\"ap241130.html\">Winter and Summer on a Little Planet</a>",
"explanation": "<p>\n<b> Explanation: </b> \n\nWinter and summer appear to come on a single night to this\n<a
href=\"https://www.instagram.com/camille.niel_photography/p/C270AVzrKcp/?img_index=1\">stunning little planet</a>.\n\nIt's
planet Earth of course.\n\nThe\n<a href=\"http://srcematematike.si/2014/03/09/math-behind-tiny-planets/\">digitally mapped</a>,\
nnadir centered panorama covers 360x180\ndegrees and is\ncomposed of frames recorded during January and July from the\n<a
href=\"https://en.wikipedia.org/wiki/Col_du_Galibier\">Col du Galibier</a> ...
}

172
Final project – part 5


Add the raw APOD data.

Contains HTML tags.

Use pipelines to remove the HTML tags.

{
"date": "2024-11-30",
"title": "<a
href=\"ap241130.html\">Winter and
{
Summer on a Little Planet</a>",
"date": "2024-11-30",
"explanation": "<p>\n<b> Explanation:
"title": "Winter and Summer on a Little
</b> \n\nWinter and summer appear to
Planet",
come on a single night to this\n<a
href=\"https://www.instagram.com/c
HTML Strip "explanation": "\n Explanation: \n\
nWinter and summer appear to come
amille.niel_photography/p/C270AVzr
on a single night to this\nstunning little
Kcp/?img_index=1\">stunning little
planet.\n\nIt's planet Earth of course.\n\
planet</a>.\n\nIt's planet Earth of
nThe\ndigitally mapped...
course.\n\nThe\n<a
href=\"http://srcematematike.si/2014/
03/09/math-behind-tiny-
planets/\">digitally mapped...
Ingest pipeline
173

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy