0% found this document useful (0 votes)

62 views44 pages

1-Getting Started With ELK

Elasticsearch is a distributed search and analytics engine. It allows storing and searching of documents in near real-time. The key concepts discussed are nodes, clusters, indexes, documents, shards, and replicas. Documents are stored in indexes which can span multiple shards for performance and availability. The document shows how search works through parsing, tokenization, normalization, and inverted indexes. Stemming and lemmatization are discussed as techniques to bring related terms together.

Uploaded by

Mariem El Mechry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views44 pages

1-Getting Started With ELK

Uploaded by

Mariem El Mechry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Noureddine Kerzazi |

nkerzazi@gmail.com
Goals : An overview of the basic concepts of
ElasticSearch and common vocabulary

• ElasticSearch as a search engine as well as a data

storage
• The ecosystem
• Node, Index, Shard, Mapping/Type, Document
and Field.
Agenda

• What is ElasticSearch ?
• Basic Concepts (Vocabulary)
• How it works
• Installing and configuring Elasticsearch
• Installing and configuring Kibana to connect to Elasticsearch
• Enabling monitoring using X-Pack for Elasticsearch and Kibana
• A quick Example of loading data
History

1945 1991 1993 1994 1997

Vannevar Bush talk TCP and DNS Primitive search Altavista Ask.com
about the need to engine of URLs
index records

1970
The ARPANET 1994
Yahoo created a 1998
network the
useful webpages Google ranking pages
foundation of the
based on how many other
modern internet
pages link to it
What is ElasticSearch?

• Distributed search and analytics Engine

• It’s No SQL distributed text database
• Based on Lucene Engine( Apache project build on Java)
• Open source (except few Modules)
• Accessible trough extensive Restful API
Some Research engines

• Google
• Bing (Microsoft)
• Baidu (China)
• Naver (Corea)
• Yahoo
• ? Arabic search engine ???
Key Concepts of Elasticsearch
• Node
• Cluster
• Index
• Document
• Shard
• Replicas
• DataTypes
• Mapping
• Schemas
Key Concepts

• Node: refers to a single running instance of Elasticsearch. Single

physical and virtual server accommodates multiple nodes
depending upon the capabilities of their physical resources like
RAM, storage and processing power.

• Cluster: is a collection of one or more nodes. Cluster provides

collective indexing and search capabilities across all the nodes
for entire data.
Key Concepts

• Index: is a collection of different type of documents and their

properties. Index also uses the concept of shards to improve the
performance. For example, a set of document contains data of a
social networking application.

• Document: is a collection of fields in a specific manner defined in

JSON format. Every document belongs to a type and resides
inside an index. Every document is associated with a unique
identifier called the UID.
Key Concepts
• Shard: Indexes are horizontally subdivided into shards. This means
each shard contains all the properties of document but contains
less number of JSON objects than index. The horizontal separation
makes shard an independent node, which can be store in any
node. Primary shard is the original horizontal part of an index and
then these primary shards are replicated into replica shards.

• Replicas: Elasticsearch allows a user to create replicas of their

indexes and shards. Replication not only helps in increasing the
availability of data in case of failure, but also improves the
performance of searching by carrying out a parallel search
operation in these replicas.
Comparison between El and RDBMS

• Relational DB has a Schema

• Elasticsearch has a Mapping
Elasticsearch RDBMS
Cluster Database
Index Table
Shard Shard
Filed Column
Document Row
DataTypes, Mapping and Schemas

CREATE DATABASE ‘My_Blog’ Default CHARACTER COLLATE Latin1 ….;

CREATE TABLE IF NOT EXISTS ‘Post’(

‘id’ bigint(20) unsigned NOT NULL AUTO_INCREMENT,
‘user_Id’ int(10) NOT NULL,
‘Post_text’ varchar(255) DEFAULT NULL,
‘Post_Date’ Datetime NOT NULL,
Primary Key (‘id’)
)
DataTypes, Mapping and Schemas
(POST) http://localhost:9200/My_Blog
{

“mappings”:{

“Post”:{

“properties”:{

“user_id”:{

“Type”: ”integer”

“post_text”:{

“Type”: ”string”

“post_date”:{

“Type”: ”date”

}
How Does Search work?

• Search is to find the most relevant source of information

– Know of the document’s existence (web crawling)
– Index the document for lookup (inverted index)
– Know how relevant the document is (scoring)
– Retrieve ranked by relevance (search algorithm)
The Inverted Index

• Parsing a document
• Tokenization
– Tokenize text into words
• Steaming
• Lematization
• Dictionary
• search
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index entry, after
further processing
▪ Described below
▪ But what are valid tokens to emit?
Stop words

• With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:

– Good compression techniques (IIR 5) means the space for including stop words in a system is very
small
– Good query optimization techniques (IIR 7) mean you pay little at query time for including stop
words.
– You need them for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
Normalization to terms
• We may need to “normalize” words in indexed text as well as
query words into the same form
– We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an
entry in our IR system dictionary
• We most commonly implicitly define equivalence classes of
terms by, e.g.,
– deleting periods to form a term
• U.S.A., USA  USA

– deleting hyphens to form a term

• anti-discriminatory, antidiscriminatory  antidiscriminatory
Normalization: other languages

• Accents: e.g., French résumé vs. resume.

• Umlauts: e.g., German: Tuebingen vs. Tübingen
– Should be equivalent
• Most important criterion:
– How are your users like to write their queries for these words?

• Even in languages that standardly have accents, users often may

not type them
– Often best to normalize to a de-accented term
• Tuebingen, Tübingen, Tubingen  Tubingen
Normalization to terms

• An alternative to equivalence classing is to do asymmetric

expansion
• An example of where this may be useful
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient

Thesauri and soundex

• Do we handle synonyms and homonyms?

– E.g., by hand-constructed equivalence classes
• car = automobile color = colour
– We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under car-automobile (and
vice-versa)
– Or we can expand a query
• When the query contains automobile, look under car as well

• What about spelling mistakes?

– One approach is Soundex, which forms equivalence classes of words
based on phonetic heuristics
• More in IIR 3 and IIR 9
Sec. 2.2.4

Lemmatization

• Reduce inflectional/variant forms to base form

• E.g.,
– am, are, is → be
– car, cars, car's, cars' → car
• the boy's cars are different colors → the boy car be different color
• Lemmatization implies doing “proper” reduction to dictionary
headword form
Sec. 2.2.4

Stemming

• Reduce terms to their “roots” before indexing

• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Sec. 2.2.4

Porter’s algorithm

• Commonest algorithm for stemming English

– Results suggest it’s at least as good as other stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
Sec. 2.2.4

Typical rules in Porter

• sses → ss
• ies → i
• ational → ate
• tional → tion

• Weight of word sensitive rules

• (m>1) EMENT →
• replacement → replac
• cement → cement
Sec. 2.2.4

Other stemmers

• Other stemmers exist:

– Lovins stemmer
• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)

– Paice/Husk stemmer
– Snowball

• Full morphological analysis (lemmatization)

– At most modest benefits for retrieval
Sec. 2.2.4

Language-specificity

• The above methods embody transformations that are

– Language-specific, and often
– Application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are available for
handling these
Sec. 2.2.4

Does stemming help?

• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
Analyzers

• Standard Tokenizer
• Letter Tokenizer
• Lowercase Tokenizer
• Whitespace Tokenizer
• UAX URL Email Tokenizer
• Classic Tokenizer
• Thai Tokenizer
• Custom Tokenzer
Standard Tokenizer

• The standard tokenizer provides grammar-based tokenization

(based on the Unicode Text Segmentation algorithm, as
specified in Unicode Standard Annex #29) and works well for
most languages.
letter tokenizer

• The letter tokenizer breaks text into terms whenever it

encounters a character which is not a letter. It does a reasonable
job for most European languages, but does a terrible job for
some Asian languages, where words are not separated by
spaces.
Lowercase Tokenizer

• The lowercase tokenizer, like the letter tokenizer breaks text into
terms whenever it encounters a character which is not a letter,
but it also lowercases all terms. It is functionally equivalent to the
letter tokenizer combined with the lowercase token filter, but is
more efficient as it performs both steps in a single pass.
Whitespace

• The whitespace tokenizer breaks text into terms whenever it

encounters a whitespace character.
• Example

• Output
UAX URL email Tokenizer

• The uax_url_email tokenizer is like the standard tokenizer except

that it recognises URLs and email addresses as single tokens.

• While the Standard Tokenizer produces

Classic Tokenizer

• The classic tokenizer is a grammar based tokenizer that is good for English language
documents. This tokenizer has heuristics for special treatment of acronyms, company
names, email addresses, and internet host names. However, these rules don’t always
work, and the tokenizer doesn’t work well for most languages other than English:
• It splits words at most punctuation characters, removing punctuation. However, a dot
that’s not followed by whitespace is considered part of a token.
• It splits words at hyphens, unless there’s a number in the token, in which case the
whole token is interpreted as a product number and is not split.
• It recognizes email addresses and internet hostnames as one token.
Custom Analyzer

• In the setting section

we specify our
custom Analyzer
• In the mappings
section we use our
Analyzer
Normalizers

• Normalizers are similar to analyzers except that they may only

emit a single token (they do not have a tokenizer).
• Examples: arabic_normalization, asciifolding,
bengali_normalization, german_normalization,…, lowercase, ….,
uppercase.
Normalizers
Ranked Retrieval

• Order documents by how likely they are to be relevant to the

information requested
– Estimate relevance (q , di)
– Sort documents by relevance
– Display sorted results
• How do we estimate relevance ?
– Assume a document is relevant if it has a lot of query terms
– Replace relevance with sim (q , di)
– Compute similarity of vector representations
Questions

Print and Power in Early Modern Europe (1500-1800) ) Print and Power in Early Modern Europe (1500-1800)
100% (1)
Print and Power in Early Modern Europe (1500-1800) ) Print and Power in Early Modern Europe (1500-1800)
461 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
No ratings yet
Elementary IR: Scalable Boolean Text Search: (Compare With R & G 27.1-3)
22 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
3. text-processing
No ratings yet
3. text-processing
70 pages
Did It Make The News?
No ratings yet
Did It Make The News?
6 pages
Lec 19
No ratings yet
Lec 19
60 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
1st Great Lesson
No ratings yet
1st Great Lesson
8 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Call Center English
No ratings yet
Call Center English
10 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Full Text Indexes in Postgresql
No ratings yet
Full Text Indexes in Postgresql
37 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
lec5
No ratings yet
lec5
22 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
chap2part2
No ratings yet
chap2part2
20 pages
03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
An Elasticsearch Crash Course Presentation PDF
No ratings yet
An Elasticsearch Crash Course Presentation PDF
81 pages
Solved Problems A Solved Refrigeration Problems Compress
No ratings yet
Solved Problems A Solved Refrigeration Problems Compress
28 pages
20-ElasticSearch
No ratings yet
20-ElasticSearch
62 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Luce Ne Bootcamp
No ratings yet
Luce Ne Bootcamp
83 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
bulu
No ratings yet
bulu
47 pages
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
No ratings yet
Information Retrieval: History: Elementary IR: Scalable Boolean Text Search
4 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Concept Houses For 10m X 25m Lots PDF v1
No ratings yet
Concept Houses For 10m X 25m Lots PDF v1
24 pages
Project X
No ratings yet
Project X
28 pages
Chapter-4 Food and Beverages Services
0% (1)
Chapter-4 Food and Beverages Services
97 pages
Enclosure to DepEd Order No. 055, s 2021
No ratings yet
Enclosure to DepEd Order No. 055, s 2021
3 pages
2B Reversal Pattern: Broker Forex With Terminal Metatrader 4
No ratings yet
2B Reversal Pattern: Broker Forex With Terminal Metatrader 4
5 pages
#Lab4 ANJAR ELMECHRY
No ratings yet
#Lab4 ANJAR ELMECHRY
20 pages
Recommended Standards For Newborn ICU Design
No ratings yet
Recommended Standards For Newborn ICU Design
39 pages
Intended Learning Outcomes
No ratings yet
Intended Learning Outcomes
29 pages
Nondestructive Inspection (Ndi)
No ratings yet
Nondestructive Inspection (Ndi)
25 pages
OOSE IMP Questions&Question Bank (III Year I Sem)
No ratings yet
OOSE IMP Questions&Question Bank (III Year I Sem)
52 pages
Project Management: Progress and Performance Measurement and Evaluation
No ratings yet
Project Management: Progress and Performance Measurement and Evaluation
40 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Service Manual For Luna Series DC Inverter
No ratings yet
Service Manual For Luna Series DC Inverter
73 pages
Best Teacher Cover Letter Examples
100% (1)
Best Teacher Cover Letter Examples
7 pages
ANJAR ELMECHRYLab4
No ratings yet
ANJAR ELMECHRYLab4
12 pages
Romanticism Essays
100% (2)
Romanticism Essays
7 pages
Lab4 SSIS DQS
No ratings yet
Lab4 SSIS DQS
10 pages
Lowell Milken Discovery Project Robert Smalls, The Man Who Commandeered The Planter To Freedom: An Unsung Hero of The Civil War
No ratings yet
Lowell Milken Discovery Project Robert Smalls, The Man Who Commandeered The Planter To Freedom: An Unsung Hero of The Civil War
2 pages
Participant Guide Jolly Phonics Course
No ratings yet
Participant Guide Jolly Phonics Course
14 pages
Experiment No. 1 Gravimetric Determination of Calcium: Objectives
No ratings yet
Experiment No. 1 Gravimetric Determination of Calcium: Objectives
3 pages
Largent 1943 Presentation On Occupational Fluoride
No ratings yet
Largent 1943 Presentation On Occupational Fluoride
6 pages
Appen Project Page - Resourcesb
No ratings yet
Appen Project Page - Resourcesb
5 pages
TSF Blank
No ratings yet
TSF Blank
1 page
Nic Dwa Razy W. Szymborska Sanah
No ratings yet
Nic Dwa Razy W. Szymborska Sanah
4 pages
Face Recognition Technology: A Seminar Report On
No ratings yet
Face Recognition Technology: A Seminar Report On
16 pages
Examen BigData Juin2016 en
No ratings yet
Examen BigData Juin2016 en
3 pages
2024 Cluster - Culture.Creativity - Co-Creation 2
No ratings yet
2024 Cluster - Culture.Creativity - Co-Creation 2
2 pages
Rajkumar Mwoc (1)
No ratings yet
Rajkumar Mwoc (1)
1 page
KLLC 13 013-058 Submission of Tax Invoice and Credit Note
No ratings yet
KLLC 13 013-058 Submission of Tax Invoice and Credit Note
2 pages
Form 1 Lesson 24 Listening
No ratings yet
Form 1 Lesson 24 Listening
2 pages
Black Eyed Peas Recipe (With Ham) - Spend With Pennies
No ratings yet
Black Eyed Peas Recipe (With Ham) - Spend With Pennies
1 page
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1-Getting Started With ELK

Uploaded by

1-Getting Started With ELK

Uploaded by

Noureddine Kerzazi |

• ElasticSearch as a search engine as well as a data

1945 1991 1993 1994 1997

• Distributed search and analytics Engine

• Node: refers to a single running instance of Elasticsearch. Single

• Cluster: is a collection of one or more nodes. Cluster provides

• Index: is a collection of different type of documents and their

• Document: is a collection of fields in a specific manner defined in

• Replicas: Elasticsearch allows a user to create replicas of their

• Relational DB has a Schema

CREATE DATABASE ‘My_Blog’ Default CHARACTER COLLATE Latin1 ….;

CREATE TABLE IF NOT EXISTS ‘Post’(

• Search is to find the most relevant source of information

• But the trend is away from doing this:

– deleting hyphens to form a term

• Accents: e.g., French résumé vs. resume.

• Even in languages that standardly have accents, users often may

• An alternative to equivalence classing is to do asymmetric

• Potentially more powerful, but less efficient

• Do we handle synonyms and homonyms?

• What about spelling mistakes?

• Reduce inflectional/variant forms to base form

• Reduce terms to their “roots” before indexing

for example compressed for exampl compress and

• Commonest algorithm for stemming English

Typical rules in Porter

• Weight of word sensitive rules

• Other stemmers exist:

• Single-pass, longest suffix removal (about 250 rules)

• Full morphological analysis (lemmatization)

• The above methods embody transformations that are

Does stemming help?

• The standard tokenizer provides grammar-based tokenization

• The letter tokenizer breaks text into terms whenever it

• The whitespace tokenizer breaks text into terms whenever it

• The uax_url_email tokenizer is like the standard tokenizer except

• While the Standard Tokenizer produces

• In the setting section

• Normalizers are similar to analyzers except that they may only

• Order documents by how likely they are to be relevant to the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.