0% found this document useful (0 votes)
62 views44 pages

1-Getting Started With ELK

Elasticsearch is a distributed search and analytics engine. It allows storing and searching of documents in near real-time. The key concepts discussed are nodes, clusters, indexes, documents, shards, and replicas. Documents are stored in indexes which can span multiple shards for performance and availability. The document shows how search works through parsing, tokenization, normalization, and inverted indexes. Stemming and lemmatization are discussed as techniques to bring related terms together.

Uploaded by

Mariem El Mechry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views44 pages

1-Getting Started With ELK

Elasticsearch is a distributed search and analytics engine. It allows storing and searching of documents in near real-time. The key concepts discussed are nodes, clusters, indexes, documents, shards, and replicas. Documents are stored in indexes which can span multiple shards for performance and availability. The document shows how search works through parsing, tokenization, normalization, and inverted indexes. Stemming and lemmatization are discussed as techniques to bring related terms together.

Uploaded by

Mariem El Mechry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Noureddine Kerzazi |

nkerzazi@gmail.com
Goals : An overview of the basic concepts of
ElasticSearch and common vocabulary

• ElasticSearch as a search engine as well as a data


storage
• The ecosystem
• Node, Index, Shard, Mapping/Type, Document
and Field.
Agenda

• What is ElasticSearch ?
• Basic Concepts (Vocabulary)
• How it works
• Installing and configuring Elasticsearch
• Installing and configuring Kibana to connect to Elasticsearch
• Enabling monitoring using X-Pack for Elasticsearch and Kibana
• A quick Example of loading data
History

1945 1991 1993 1994 1997


Vannevar Bush talk TCP and DNS Primitive search Altavista Ask.com
about the need to engine of URLs
index records

1970
The ARPANET 1994
Yahoo created a 1998
network the
useful webpages Google ranking pages
foundation of the
based on how many other
modern internet
pages link to it
What is ElasticSearch?

• Distributed search and analytics Engine


• It’s No SQL distributed text database
• Based on Lucene Engine( Apache project build on Java)
• Open source (except few Modules)
• Accessible trough extensive Restful API
Some Research engines

• Google
• Bing (Microsoft)
• Baidu (China)
• Naver (Corea)
• Yahoo
• ? Arabic search engine ???
Key Concepts of Elasticsearch
• Node
• Cluster
• Index
• Document
• Shard
• Replicas
• DataTypes
• Mapping
• Schemas
Key Concepts

• Node: refers to a single running instance of Elasticsearch. Single


physical and virtual server accommodates multiple nodes
depending upon the capabilities of their physical resources like
RAM, storage and processing power.

• Cluster: is a collection of one or more nodes. Cluster provides


collective indexing and search capabilities across all the nodes
for entire data.
Key Concepts

• Index: is a collection of different type of documents and their


properties. Index also uses the concept of shards to improve the
performance. For example, a set of document contains data of a
social networking application.

• Document: is a collection of fields in a specific manner defined in


JSON format. Every document belongs to a type and resides
inside an index. Every document is associated with a unique
identifier called the UID.
Key Concepts
• Shard: Indexes are horizontally subdivided into shards. This means
each shard contains all the properties of document but contains
less number of JSON objects than index. The horizontal separation
makes shard an independent node, which can be store in any
node. Primary shard is the original horizontal part of an index and
then these primary shards are replicated into replica shards.

• Replicas: Elasticsearch allows a user to create replicas of their


indexes and shards. Replication not only helps in increasing the
availability of data in case of failure, but also improves the
performance of searching by carrying out a parallel search
operation in these replicas.
Comparison between El and RDBMS

• Relational DB has a Schema


• Elasticsearch has a Mapping
Elasticsearch RDBMS
Cluster Database
Index Table
Shard Shard
Filed Column
Document Row
DataTypes, Mapping and Schemas

CREATE DATABASE ‘My_Blog’ Default CHARACTER COLLATE Latin1 ….;

CREATE TABLE IF NOT EXISTS ‘Post’(


‘id’ bigint(20) unsigned NOT NULL AUTO_INCREMENT,
‘user_Id’ int(10) NOT NULL,
‘Post_text’ varchar(255) DEFAULT NULL,
‘Post_Date’ Datetime NOT NULL,
Primary Key (‘id’)
)
DataTypes, Mapping and Schemas
(POST) http://localhost:9200/My_Blog
{

“mappings”:{

“Post”:{

“properties”:{

“user_id”:{

“Type”: ”integer”

},

“post_text”:{

“Type”: ”string”

},

“post_date”:{

“Type”: ”date”

},

}
How Does Search work?

• Search is to find the most relevant source of information


– Know of the document’s existence (web crawling)
– Index the document for lookup (inverted index)
– Know how relevant the document is (scoring)
– Retrieve ranked by relevance (search algorithm)
The Inverted Index

• Parsing a document
• Tokenization
– Tokenize text into words
• Steaming
• Lematization
• Dictionary
• search
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index entry, after
further processing
▪ Described below
▪ But what are valid tokens to emit?
Stop words

• With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words

• But the trend is away from doing this:


– Good compression techniques (IIR 5) means the space for including stop words in a system is very
small
– Good query optimization techniques (IIR 7) mean you pay little at query time for including stop
words.
– You need them for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
Normalization to terms
• We may need to “normalize” words in indexed text as well as
query words into the same form
– We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an
entry in our IR system dictionary
• We most commonly implicitly define equivalence classes of
terms by, e.g.,
– deleting periods to form a term
• U.S.A., USA  USA

– deleting hyphens to form a term


• anti-discriminatory, antidiscriminatory  antidiscriminatory
Normalization: other languages

• Accents: e.g., French résumé vs. resume.


• Umlauts: e.g., German: Tuebingen vs. Tübingen
– Should be equivalent
• Most important criterion:
– How are your users like to write their queries for these words?

• Even in languages that standardly have accents, users often may


not type them
– Often best to normalize to a de-accented term
• Tuebingen, Tübingen, Tubingen  Tubingen
Normalization to terms

• An alternative to equivalence classing is to do asymmetric


expansion
• An example of where this may be useful
– Enter: window Search: window, windows
– Enter: windows Search: Windows, windows, window
– Enter: Windows Search: Windows

• Potentially more powerful, but less efficient


Thesauri and soundex

• Do we handle synonyms and homonyms?


– E.g., by hand-constructed equivalence classes
• car = automobile color = colour
– We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under car-automobile (and
vice-versa)
– Or we can expand a query
• When the query contains automobile, look under car as well

• What about spelling mistakes?


– One approach is Soundex, which forms equivalence classes of words
based on phonetic heuristics
• More in IIR 3 and IIR 9
Sec. 2.2.4

Lemmatization

• Reduce inflectional/variant forms to base form


• E.g.,
– am, are, is → be
– car, cars, car's, cars' → car
• the boy's cars are different colors → the boy car be different color
• Lemmatization implies doing “proper” reduction to dictionary
headword form
Sec. 2.2.4

Stemming

• Reduce terms to their “roots” before indexing


• “Stemming” suggests crude affix chopping
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.

for example compressed for exampl compress and


and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Sec. 2.2.4

Porter’s algorithm

• Commonest algorithm for stemming English


– Results suggest it’s at least as good as other stemming options
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound command, select the
one that applies to the longest suffix.
Sec. 2.2.4

Typical rules in Porter

• sses → ss
• ies → i
• ational → ate
• tional → tion

• Weight of word sensitive rules


• (m>1) EMENT →
• replacement → replac
• cement → cement
Sec. 2.2.4

Other stemmers

• Other stemmers exist:


– Lovins stemmer
• http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)


– Paice/Husk stemmer
– Snowball

• Full morphological analysis (lemmatization)


– At most modest benefits for retrieval
Sec. 2.2.4

Language-specificity

• The above methods embody transformations that are


– Language-specific, and often
– Application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are available for
handling these
Sec. 2.2.4

Does stemming help?

• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
Analyzers

• Standard Tokenizer
• Letter Tokenizer
• Lowercase Tokenizer
• Whitespace Tokenizer
• UAX URL Email Tokenizer
• Classic Tokenizer
• Thai Tokenizer
• Custom Tokenzer
Standard Tokenizer

• The standard tokenizer provides grammar-based tokenization


(based on the Unicode Text Segmentation algorithm, as
specified in Unicode Standard Annex #29) and works well for
most languages.
letter tokenizer

• The letter tokenizer breaks text into terms whenever it


encounters a character which is not a letter. It does a reasonable
job for most European languages, but does a terrible job for
some Asian languages, where words are not separated by
spaces.
Lowercase Tokenizer

• The lowercase tokenizer, like the letter tokenizer breaks text into
terms whenever it encounters a character which is not a letter,
but it also lowercases all terms. It is functionally equivalent to the
letter tokenizer combined with the lowercase token filter, but is
more efficient as it performs both steps in a single pass.
Whitespace

• The whitespace tokenizer breaks text into terms whenever it


encounters a whitespace character.
• Example

• Output
UAX URL email Tokenizer

• The uax_url_email tokenizer is like the standard tokenizer except


that it recognises URLs and email addresses as single tokens.

• While the Standard Tokenizer produces


Classic Tokenizer

• The classic tokenizer is a grammar based tokenizer that is good for English language
documents. This tokenizer has heuristics for special treatment of acronyms, company
names, email addresses, and internet host names. However, these rules don’t always
work, and the tokenizer doesn’t work well for most languages other than English:
• It splits words at most punctuation characters, removing punctuation. However, a dot
that’s not followed by whitespace is considered part of a token.
• It splits words at hyphens, unless there’s a number in the token, in which case the
whole token is interpreted as a product number and is not split.
• It recognizes email addresses and internet hostnames as one token.
Custom Analyzer

• In the setting section


we specify our
custom Analyzer
• In the mappings
section we use our
Analyzer
Normalizers

• Normalizers are similar to analyzers except that they may only


emit a single token (they do not have a tokenizer).
• Examples: arabic_normalization, asciifolding,
bengali_normalization, german_normalization,…, lowercase, ….,
uppercase.
Normalizers
Ranked Retrieval

• Order documents by how likely they are to be relevant to the


information requested
– Estimate relevance (q , di)
– Sort documents by relevance
– Display sorted results
• How do we estimate relevance ?
– Assume a document is relevant if it has a lot of query terms
– Replace relevance with sim (q , di)
– Compute similarity of vector representations
Questions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy