Elasticsearch Python Slides
Elasticsearch Python Slides
Indices
3
Search system
Indices
https://www.travelmediagroup.com/the-power-of-facebook-as-a-search-engine-2/
Documents
4
Recommendation system
Indices
https://www.shopagain.com/blog/product-recommendation-engines-what-is-it-how-it-work/
Documents
5
RAG system
Indices
https://www.superannotate.com/blog/rag-explained
Documents
6
RAG system
Indices
https://www.superannotate.com/blog/rag-explained
Documents
7
3 Create an index
8
What is an index ?
●
Name
●
Price
JSON JSON
●
Descriptio
n
●
...
Product 1 Product 2
JSON
Product n
Product Index
9
Shards & replicas
Number of shards = 2
Sharding
Product Index
Product Index
10
Shards & replicas
Number of shards = 2
Number of replicas = 1
Product Index
duplicate
Product Index
11
4 Inserting documents
12
Document
●
“Name”: “value”
JSON
●
“Price”: “value”
●
“Description: “value”
●
...
13
Document
insertion mapping
Field Type
JSON
created_on date
X100 my_index
text text
title text
14
5 Field data types
15
Field data types
insertion mapping
Field Type
JSON
created_on date
X100 my_index
text text
title text
16
Field data types
1)Common types
●
Binary
R ead mor e
●
Accepts a binary value as a Base64 encoded string.
●
Is not searchable and is not stored.
●
Use _source (i.e., document) to get the data back.
encoding
IVBORw0KGgoAAAANSUhEUgAAA
OEAAADhCAIAAADiVBORw0KGgo
AAAANSUhEUgAAAOEAAADhCAI
A….
Base64
representation
17
Field data types
1)Common types
●
Binary
R ead mor e
●
Boolean (True / False)
●
Numbers (long, integer, byte, short, etc)
●
Dates
●
Keyword (IDs, email addresses, status codes, zip codes, etc)
18
Field data types
●
Object
Read more
{
JSON "region": "US",
"manager": { indexation {
"age": 30, "region": "US",
"name": { "manager.age": 30,
"first": "John", "manager.name.first": "John",
"last": "Smith" "manager.name.last": "Smith"
} }
}
}
19
Field data types
●
Object
●
Flattened Read more
●
Efficient for deeply nested JSON objects.
●
Hierarchical structure is not preserved.
●
Nested
●
Cases where you have array of objects.
●
Maintains relationship between the object’s fields.
20
Field data types
●
Flattened / Nested object example
Read more
{
JSON "group": "fans",
"user": [
{
"first": "John", indexation {
"last": "Smith" "group" : "fans",
}, "user.first" : [ "alice", "john" ],
{ "user.last" : [ "smith", "white" ]
"first": "Alice", }
"last": "White"
}
]
}
21
Field data types
●
Text
●
Used for full-text content. Read more
●
Examples: Body of an email or the description of a product.
analyzer
Structured format that’s optimized for search.
Unstructured format
22
Field data types
●
Text
●
Used for full-text content. Read more
●
Examples: Body of an email or the description of a product.
●
Completion
●
Search as you type
●
Annotated text
23
Field data types
●
Geo point
●
Geo shape Read more
●
Point (Cartesian point)
●
Shape (Cartesian geometry)
24
Field data types
Read more
25
6 Delete documents
26
Delete documents
JSON id=1
Read mor e
my_index
JSON id=2
JSON id=3
<index> <id>
DELETE
27
Delete documents
JSON id=1
Read mor e
my_index
JSON id=2
JSON id=3
my_index 1 my_index 4
DELETE DELETE
❌ 28
7 Get document
29
Get documents
JSON id=1
Read mor e
my_index
JSON id=2
JSON id=3
<index> <id>
GET
30
Get documents
JSON id=1
Read mor e
my_index
JSON id=2
JSON id=3
my_index 1 my_index 4
GET GET
❌ 31
8 Count documents
32
Count documents
JSON
Read mor e
my_index
JSON
JSON
COUNT COUNT
33
9 The exists API
34
The exists API
JSON
Read mor e
my_index
JSON
JSON
client.indices.exists client.exists
ElasticSearch. in an index.
10 Update document
36
Update documents
1) Documents exists in the index
37
Update documents
But how do you update the document?
insertion
●
“book_id”: 1 Read mor e
JSON
●
“book_name”: “A book”
id=1 my_index
38
Update documents
2) Document doesn’t exist
UPDATE
39
11 Bulk API
40
The bulk API
JSON Index
Read mor e
my_index JSON Index
Update
Delete
●
Each operation (index, update, delete) makes a separate
... API call.
●
The bulk API performs multiple operations in one API call.
This increases indexing speed.
41
The bulk API - Syntax
●
action and metadata\n
●
optional source\n 🛈 The source is required for:
●
action and metadata\n update e
●
●
optional source\n Read mor
●
…
●
index
●
action and metadata\n ●
create
●
optional source\n
42
The bulk API - Example
response = es.bulk(
operations=[
{
"index": { 🛈 The source is required for:
"_index": "test",
}
"_id": "1" ●
update
},
{ ●
index
},
"field1": "value1"
Read mor e
{
●
create
"delete": {
"_index": "test",
"_id": "2"
}
},
{
"create": {
"_index": "test",
"_id": "3"
}
},
{
"field1": "value3"
},
{
"update": {
"_id": "1",
"_index": "test"
}
},
{
"doc": { Action
"field2": "value2"
} Source
}
], 43
)
12 The search API – Part 1
44
The search API
JSON id=1
Read mor e
45
The search API
<index> ●
my_index
●
index_1,my_index
SEARCH mor e
Read
●
index*
●
_all
46
The search API
<index>
q ●
Use it for simple searches.
SEARCH mor e
Read
●
Uses the Lucene syntax.
47
The search API
<index>
q
SEARCH mor e
Read
query ●
Use it for complex, structured
queries.
●
Uses the Query DSL language.
●
Default value is match all.
48
The search API
<index>
q
SEARCH mor e
Read
query
●
Timeout:
●
The maximum time to wait for a search request to complete.
●
Time units (seconds, milliseconds, days, etc) .
●
Size:
●
Defines the number of hits (documents) to return.
●
Default value is 10. Max value is 10000.
49
The search API
<index>
q
SEARCH mor e
Read
query
●
from:
●
Starting point from which to return search results (pagination) .
●
Useful if you want to implement skip functionality.
50
13 The search API – Part 2
51
The search API – Query DSL
<index>
q
SEARCH mor e
Read
query
Leaf clauses
1. match:
●
Is used to perform full-text search.
Read mor e
●
Returns documents that match a provided text, number, date, or bool.
●
The field must be mapped to a text data type.
2. term:
●
Returns documents that contain an exact term in a provided field.
●
The field must be mapped to a keyword data type or a numeric/date
type.
●
Example usage : product ID, book ID, username, etc.
3. range:
●
Returns documents that contain terms within a provided range.
53
The search API – Query DSL
●
Example for match query:
response = es.search(
index="my_index",
Read mor e
body={
"query": {
"match": {
"description": "A description."
}
}
}
)
54
The search API – Query DSL
●
Example for term query:
response = es.search(
index="my_index",
Read mor e
body={
"query": {
"term": {
"product_id": "PRODUCT_12345"
}
}
}
)
55
The search API – Query DSL
●
Example for range query:
response = es.search(
index="my_index",
Read mor e
body={
"query": {
"range": {
"publication_date":
{
"gte": "2023-01-01",
"lte": "2023-12-31"
}
}
}
}
}
)
56
The search API – Query DSL
Compound clauses
●
bool:
●
Combines multiple queries using boolean logic:
Read mor e
●
must, filter, should, must_not.
●
The field must be mapped to a text data type.
57
The search API – Query DSL
●
Example for bool query:
63
The search API
Timeout
●
It sets the maximum duration for a query to execute.
●
If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.
64
The search API
Timeout
●
Sets the maximum duration for a query to execute.
●
If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.
Size
●
Controls how many search results are returned.
●
Max value is 10000.
65
The search API
Timeout
●
Sets the maximum duration for a query to execute.
●
If the search takes longer than the specified time, Elasticsearch will
Read mor e
abort the search.
Size
●
Controls how many search results are returned.
●
Max value is 10000.
From
●
Used for pagination.
●
It tells Elasticsearch how many documents to skip before starting to return
results.
66
The search API
Aggregations
●
Performs calculation on the data
●
Average, max, min, count.
Read mor e
67
15 Dense vectors
68
Dense vector field type
●
Stores dense vectors of numeric values.
●
Use it if you have few or no zero elements.
●
Does not support aggregations or sorting.
Read mor e
●
It is not possible to store multiple values in one dense vector field.
●
Use kNN search to retrieve the nearest vectors.
●
Max size of a dense vector is 4096.
[[...],[...],[...],...]
69
Medium article by Sachinsoni
Dense vector field type
Examples
response = client.index(
index="my-index",
id="1", Read mor e
document={
"my_text": "example text",
"my_vector": [0.5, 10, 6]
}
)
response = client.index(
index="my-index",
id="2",
document={
"my_text": "example text 2",
"my_vector": [[0.5, 10, 6], [1, 0, -2]]
}
)
70
Dense vector field type
Important
●
You have to manually do the mapping.
●
Elasticsearch does not automatically infer the mapping for dense
Read mor e
vectors.
●
It needs to know the exact number of dimensions.
response = es.indices.create(
index="my_index",
mappings={
"properties": {
"sides_length": {
"type": "dense_vector",
"dims": 4
},
"shape": {
"type": "keyword"
}
}
}, 71
)
16 Embedding documents
72
Embedding documents
●
Embedding transforms text into numerical vectors.
●
Deep learning models are used to embed documents.
●
These models preserve the meaning of the text.
Read mor e
●
Use cases:
●
Recommendation systems
●
Retrieval-Augmented Generation (RAG)
Read mor e
●
Closed models. ●
Open-source models.
●
Paid (Pay what you use). ●
Free to use.
●
No hardware required (Cloud). ●
Hardware is required (preferably GPU).
74
Embedding documents
Embedding size
●
Size of the dense vector. or e
Read m
●
Larger sizes yield better embeddings.
●
Common sizes include 384, 768, 1024.
75
Embedding documents
Input size
●
Size of the input text that the model can process. or e
Read m
●
Text will be truncated if it exceeds the model’s capacity.
●
Common values include 256, 512 tokens.
76
Embedding documents
Text language
●
Some models can translate specific languages. or e
Read m
●
Others support multiple languages and are multilingual.
77
Embedding documents
78
Massive Text Embedding Benchmark (MTEB)
17 KNN search
79
k-nearest neighbor (kNN) search
●
How do we search for embedded documents?
●
We use the kNN search for fields mapped as dense vectors.
●
Important, you can’t use the query parameter in this case.
Read mor e
●
The kNN algorithm is used for classification and regression tasks.
●
It classifies new data points by comparing them with the k nearest
points from the training data.
Read mor e
●
Commonly distances metrics: Euclidean, Manhattan or Minkowski.
●
This algorithm is simple and effective.
Watch on
82
k-nearest neighbor (kNN) search
83
k-nearest neighbor (kNN) search
Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
) 🛈 The query should be a vector.
84
k-nearest neighbor (kNN) search
Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
) 🛈 Retrieves 50 potentially relevant documents
before applying distance calculations to select
the k best documents.
85
k-nearest neighbor (kNN) search
Example
results = es.search(
knn={
Read mor e
'field': 'embedding',
'query_vector': es.get_embedding(parsed_query),
'num_candidates': 50,
'k': 10,
}
)
86
18 Deep pagination
87
Deep pagination
●
Indexing / fetching all documents at once is inefficient and slow.
●
Pagination:
●
Retrieves data in small chunks from large indexes.
Read mor e
●
Improves performance and efficiency.
●
Fast search experience.
●
Cost effective.
x100K
index /
fetch
Medium article by Dayanand
88
Deep pagination
Pagination methods
●
from/size: commonly used for paginating smaller datasets.
●
search_after: offers more efficient deep pagination for large datasets.
Read mor e
89
Deep pagination
Pagination methods
●
from/size: commonly used for paginating smaller datasets.
●
search_after: offers more efficient deep pagination for large datasets.
Read mor e
🛈 Note:
●
from/size is limited to 10k results.
size = 8
●
Requires a lot of memory for deep
pages.
from = 5 Not suitable for larger indexes.
size = 8
from = 0
index index
90
Deep pagination
Pagination methods
●
from/size: commonly used for paginating smaller datasets.
●
search_after: offers more efficient deep pagination for large datasets.
Read mor e
size = 8
Sortable fields
●
timestamp
●
id sort values old sort values
size = 8
index index
91
Deep pagination
Pagination methods
●
from/size: commonly used for paginating smaller datasets.
●
search_after: offers more efficient deep pagination for large datasets.
Read mor e
🛈 Note:
●
The search_after method is not constrained by the 10k result limit.
●
It does not utilize an offset (i.e., the from parameter).
●
Results must be sorted by fields such as ID or timestamp.
●
Uses a pointer derived from the sort values of the last document on the previous page.
●
This approach prevents the skipping of documents.
●
It is particularly beneficial for handling larger indexes.
92
Deep pagination
Comparison
Below are the results obtained after attempting to retrieve 10,000
documents using the following parameters (size = 200, iterations = 50)
Read mor e
93
Deep pagination
Comparison
from/size search_after
🛈 Performance degradation is calculated like this (last page time / first page time)
94
19 Ingest pipelines
95
Ingest pipelines
●
You can perform transformations on data before indexing.
●
Common transformations: remove fields, lowercase text, remove HTML
tags, and more.
Read mor e
96
Ingest pipelines
●
We use the ingest API to:
●
Create or update a pipeline.
Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
description="A description",
processors=[
{
"set": {
"description": "A description",
"field": "field",
"value": “value”
}
},
{
"lowercase": {
"field": "field"
}
}
], 97
)
Ingest pipelines
●
We use the ingest API to:
●
Simulate a pipeline.
Read mor e
response = client.ingest.simulate( OR response = client.ingest.simulate(
id="my-pipeline", pipeline={
docs=[ "processors": [
{ {
"_index": "index", "lowercase": {
"_id": "id", "field": "field"
"_source": { }
"foo": "bar" }
} ]
}, },
{ ...
"_index": "index", )
"_id": "id",
"_source": {
"foo": "rab"
}
} 98
],
)
Ingest pipelines
●
We use the ingest API to:
●
Delete a pipeline.
Read mor e
response = client.ingest.delete_pipeline(
id="my-pipeline",
)
99
Ingest pipelines
●
We use the ingest API to:
●
Get a pipeline.
Read mor e
response = client.ingest.get_pipeline(
id="my-pipeline",
)
100
Ingest pipelines
●
Pipelines can fail. You can either ignore the failure or handle it.
●
If you ignore the failure, the pipeline will skip over the failed steps.
Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
processors=[
{
"rename": {
"description": "A description",
"field": "field",
"ignore_failure": True
} Step 1 Step 2 Step 3
}
],
)
Ingest pipeline
101
Ingest pipelines
●
Pipelines can fail. You can either ignore the failure or handle it.
●
Specify custom error-handling steps with on_failure.
Read mor e
response = client.ingest.put_pipeline(
id="my-pipeline",
processors=[ Retry, log error, etc
{
"rename": {
"description": "A description",
"field": "field",
"on_failure": [...]
} Step 1 Step 2 Step 3
}
],
)
Ingest pipeline
102
20 Ingest processors
103
Ingest processors
●
Common transformations: remove fields, lowercase text, remove HTML
tags, and more.
Read mor e
104
Ingest processors
●
Append ●
For each ●
Convert
●
Inference ●
JSON ●
Rename
●
Attachment ●
Sort ●
Set
●
... ●
HTML strip
●
Lowercase / Uppercase
●
Trim
●
Split
Data Filtering Pipeline handling ●
...
●
Drop ●
Fail
●
Remove ●
Pipeline
105
21 Filters in depth
106
Filters in depth
●
When searching in Elasticsearch, you can use either query context or
filter context. R ead mor e
Use cases
Filters are effective for querying structured data. R ead mor e
Structured data
109
Filters in depth
Example 1
R ead mor e
response = client.search(
index="phones",
query={
"bool": {
"filter": [
{
"term": {
"color": "black" Color
} AND Yes/No
},
{ Brand
"term": {
"brand": "samsung"
}
}
]
}
},
)
110
Filters in depth
Example 2
R ead mor e
response = client.search(
query={
"bool": {
"filter": [
{
"term": {
"status": "published"
} Status
}, AND Yes/No
{
"range": { Publish date
"publish_date": {
"gte": "2015-01-01",
"lte": "2015-02-01"
}
}
}
]
}
},
) 111
Filters in depth
Post filters
●
Applies filters after aggregations are calculated. R ead mor e
●
Does not affect aggregations.
●
Only filters the search results.
●
Let you narrow down what users see without limiting
what they can choose from.
112
Filters in depth
Example
response = client.search( response = client.search( response = client.search(
index="shirts", aggs={ post_filter={ R ead mor e
query={ "colors": { "term": {
"bool": { "terms": { "color": "red"
"filter": { "field": "color" }
"term": { } },
"brand": "gucci" }, )
} "color_red": {
} "filter": {
} "term": {
}, "color": "red"
... }
) },
"aggs": {
"models": {
"terms": {
"field": "model"
}
}
}
}
}, 113
...
)
22 SQL search API
114
SQL Search API
●
We used Query DSL to search for documents.
●
An alternative method for searching documents is the SQL Search API. R ead mor e
Query DSL
115
SQL Search API
R ead mor e
forma cursor
fetch_size
t
116
SQL Search API
Example field
R ead mor e
response = client.sql.query(
format="txt",
query="SELECT * FROM library ORDER BY page_count DESC LIMIT 5",
)
size
Index
117
SQL Search API
118
SQL Search API
Pagination
R ead mor e
response = client.sql.query(
format="txt",
cursor="sDHOSBDISBXMLK…", ?
)
SQL Search
Query API Results
Original example
119
SQL Search API
Filtering
R ead mor e
response = client.sql.query(
format="txt",
query="SELECT * FROM library ORDER BY page_count DESC",
filter={
"range": {
"page_count": {
"gte": 100,
"lte": 200
}
}
},
fetch_size=5,
)
120
SQL Search API
R ead mor e
response = client.sql.translate(
query="SELECT * FROM library ORDER BY page_count DESC",
fetch_size=10,
)
{
"size": 10,
"_source": false,
"fields": [{"field": "author"}, ...],
"sort": [
SQL Translate {
API "page_count": {
"order": "desc",
}
}
],
"track_total_hits": -1
} 121
SQL Search API
SQL Limitations
R ead mor e
122
23 Time Series Data Stream
123
Time Series Data Stream
●
Time series data refers to data points ordered by time.
●
Data is collected at regular intervals. R ead mor e
●
Example: CPU usage over time.
124
Time Series Data Stream
●
Managing time series data is challenging.
●
The data can grow rapidly (high frequency measurements) R ead mor e
●
How to store this large volume efficiently?
●
Deciding which old data to keep and when to delete it.
VS
125
Time Series Data Stream
126
Time Series Data Stream
28
26
data
24
22
20
Warm Delete
Hot phase Cold phase
phase phase
128
Time Series Data Stream
ILM visualized
R ead mor e
●
Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days
my_index_0001
129
Time Series Data Stream
ILM visualized
R ead mor e
●
Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days
my_index_0001 age = 30
my_index_0002
130
Time Series Data Stream
ILM visualized
R ead mor e
●
Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days
my_index_0001 age = 60
my_index_0002 age = 30
my_index_0003
131
Time Series Data Stream
ILM visualized
R ead mor e
●
Rollover: age = 30 days & size = 50GB
ILM ●
Delete: 90 days
my_index_0001 age = 90
my_index_0002 age = 60
my_index_0003 age = 30
132
...
Time Series Data Stream
R ead mor e
{ {
"query": { "aggs": {
"range": { "avg_cpu_usage": {
"@timestamp": { "avg": {
"gte": "2024-11-01T00:00:00", "field": "cpu_usage"
"lte": "2024-11-07T23:59:59" }
} }
} }
} }
}
133
24 Analyzers
134
Analyzers
●
Analyzers process text during indexing and searching.
●
They transform text into tokens. R ead mor e
●
They make the search process efficient and accurate.
Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2
Documents
Image origin
135
Analyzers
Analyzer components
●
An analyzer is a combination of 3 components: R ead mor e
●
Character filter
●
Tokenizer
Character filters
●
Token filter (min 0)
Tokenizers
(max 1)
Token filters
(min 0)
136
Analyzer
Analyzers
Built-in analyzers
●
Provide ready-made options for processing text in various ways. R ead mor e
●
Each built-in analyzer is designed for specific types of data.
●
Common analyzers:
137
Phases of analysis
●
Index time analysis R ead mor e
Term
Term
hello
Token hello
Tokenizer world
imad
filters world
imad
saddik
Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2
Phases of analysis
●
Index time analysis R ead mor e
●
Search time analysis
Term
Term
hello
Token hello
Query Tokenizer world
imad
filters world
imad
saddik
Term Document
hello Document 1
world Document 1
imad Document 2
saddik Document 2
R ead mor e
140
Analyzers
141
25 Synonyms
142
Synonyms
●
Synonyms help enhance search accuracy.
●
Useful for matching variations or related terms. R ead mor e
●
Synonyms are defined using the Solr format.
143
Synonyms
Solr format
●
This is a flexible syntax for defining synonyms. ead m o r e
R
●
It uses two different definitions:
●
Equivalent synonyms: “term 1, term 2, term 3”
●
Explicit synonyms: “term 1, term 2 => term 3”
car
automobile Personal computer PC
race car
voiture i-pod, i pod ipod
...
Equivalent terms
144
Synonyms
●
Synonyms are used within analyzers.
●
You can use synonyms in index and search time. R ead mor e
●
Synonyms are a custom token filter.
None
Standard Tokenizer
Synonyms filter
145
Custom Analyzer
26 Common options
146
Common options
●
Simplify Elasticsearch management.
●
Provide features like human-readable output, date math, and filtering. R ead mor e
●
All Elasticsearch REST APIs support these common options.
147
Common options
Human-readable output
●
Make statistics in a format that humans can understand. R ead mor e
●
It applies to disk space, memory, time, and other metrics.
Example
response = es.cluster.stats(human=True)
pprint(response["nodes"]["jvm"])
Before After
148
Common options
Date math
●
Perform math operations on dates. R ead mor e
●
Operations include: Add, Subtract, Round down to nearest day.
●
Supported time units: y (years), M (moths), etc.
●
The expression starts with an anchor date (“now” or a string ending with ||).
Examples
●
now := 2024-11-16 11:55:00
●
now+1h := 2024-11-16 12:55:00
●
now-1h := 2024-11-16 10:55:00
●
now-1h/d := 2024-11-16 00:00:00
●
2024.11.16||+1M/d := 2024-12-16 00:00:00
149
Common options
Response filtering
●
Inclusive filtering: Specify fields to include. R ead mor e
●
Exclusive filtering: Remove unnecessary fields.
●
Combined filtering
Example
response = es.search(
index=index_name,
body={
"query": {
Before
"match_all": {}
}
},
filter_path="hits.hits._id,hits.hits._source"
)
pprint(response.body)
After 150
27 Change heap size
151
Change heap size
●
By default Elasticsearch uses 50% of the available RAM.
●
This can slow down you PC. R ead mor e
●
You only need 1 or 2GB when dealing with small indices.
152
Change heap size
153
Change heap size
154
Change heap size
cat /usr/share/elasticsearch/config/jvm.options.d/heap.options
155
Change heap size
R ead mor e
156
28 Final project – part 0
157
Final project – part 0
●
No more videos on Elasticsearch concepts and APIs.
●
I'll be focusing on the final project from now on.
●
The final project will cover most of the topics we've learned in previous videos.
●
We will be building a website.
●
Elasticsearch will provide the search functionality.
158
Final project – part 0
Source code
159
29 Final project – part 1
160
Final project – part 1
●
Create an index and index documents.
●
Use size / from when searching.
●
Perform multi-match queries.
●
The theme of the final project is Astronomy.
161
Image credit: Kent E. Biggs
Final project – part 1
●
Frontend is done for you.
●
Install dependencies.
162
Final project – part 1
●
Install dependencies.
●
Setup the backend server.
●
Configure Elasticsearch.
163
30 Final project – part 2
164
Final project – part 2
●
Add the pagination controls.
●
Filter by year.
●
Use aggregations.
165
31 Final project – part 3
166
Final project – part 3
●
Implement a search as you type feature.
●
Utilize the N-gram tokenizer.
Standard
Andr
tokenizer
N-gram
Andr
tokenizer
167
Final project – part 3
●
Why use the N-gram tokenizer?
Standard
Andromeda [andromeda]
tokenizer
N=9
168
32 Final project – part 4
169
Final project – part 4
●
Implement semantic search.
●
Use an embedding model from HuggingFace.
●
Use kNN search to find documents.
170
33 Final project – part 5
171
Final project – part 5
●
Add the raw APOD data.
●
Contains HTML tags.
{
"date": "2024-11-30",
"title": "<a href=\"ap241130.html\">Winter and Summer on a Little Planet</a>",
"explanation": "<p>\n<b> Explanation: </b> \n\nWinter and summer appear to come on a single night to this\n<a
href=\"https://www.instagram.com/camille.niel_photography/p/C270AVzrKcp/?img_index=1\">stunning little planet</a>.\n\nIt's
planet Earth of course.\n\nThe\n<a href=\"http://srcematematike.si/2014/03/09/math-behind-tiny-planets/\">digitally mapped</a>,\
nnadir centered panorama covers 360x180\ndegrees and is\ncomposed of frames recorded during January and July from the\n<a
href=\"https://en.wikipedia.org/wiki/Col_du_Galibier\">Col du Galibier</a> ...
}
172
Final project – part 5
●
Add the raw APOD data.
●
Contains HTML tags.
●
Use pipelines to remove the HTML tags.
{
"date": "2024-11-30",
"title": "<a
href=\"ap241130.html\">Winter and
{
Summer on a Little Planet</a>",
"date": "2024-11-30",
"explanation": "<p>\n<b> Explanation:
"title": "Winter and Summer on a Little
</b> \n\nWinter and summer appear to
Planet",
come on a single night to this\n<a
href=\"https://www.instagram.com/c
HTML Strip "explanation": "\n Explanation: \n\
nWinter and summer appear to come
amille.niel_photography/p/C270AVzr
on a single night to this\nstunning little
Kcp/?img_index=1\">stunning little
planet.\n\nIt's planet Earth of course.\n\
planet</a>.\n\nIt's planet Earth of
nThe\ndigitally mapped...
course.\n\nThe\n<a
href=\"http://srcematematike.si/2014/
03/09/math-behind-tiny-
planets/\">digitally mapped...
Ingest pipeline
173