1-Getting Started With ELK
1-Getting Started With ELK
nkerzazi@gmail.com
Goals : An overview of the basic concepts of
ElasticSearch and common vocabulary
• What is ElasticSearch ?
• Basic Concepts (Vocabulary)
• How it works
• Installing and configuring Elasticsearch
• Installing and configuring Kibana to connect to Elasticsearch
• Enabling monitoring using X-Pack for Elasticsearch and Kibana
• A quick Example of loading data
History
1970
The ARPANET 1994
Yahoo created a 1998
network the
useful webpages Google ranking pages
foundation of the
based on how many other
modern internet
pages link to it
What is ElasticSearch?
• Google
• Bing (Microsoft)
• Baidu (China)
• Naver (Corea)
• Yahoo
• ? Arabic search engine ???
Key Concepts of Elasticsearch
• Node
• Cluster
• Index
• Document
• Shard
• Replicas
• DataTypes
• Mapping
• Schemas
Key Concepts
“mappings”:{
“Post”:{
“properties”:{
“user_id”:{
“Type”: ”integer”
},
“post_text”:{
“Type”: ”string”
},
“post_date”:{
“Type”: ”date”
},
}
How Does Search work?
• Parsing a document
• Tokenization
– Tokenize text into words
• Steaming
• Lematization
• Dictionary
• search
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index entry, after
further processing
▪ Described below
▪ But what are valid tokens to emit?
Stop words
• With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words
Lemmatization
Stemming
Porter’s algorithm
• sses → ss
• ies → i
• ational → ate
• tional → tion
Other stemmers
Language-specificity
• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
Analyzers
• Standard Tokenizer
• Letter Tokenizer
• Lowercase Tokenizer
• Whitespace Tokenizer
• UAX URL Email Tokenizer
• Classic Tokenizer
• Thai Tokenizer
• Custom Tokenzer
Standard Tokenizer
• The lowercase tokenizer, like the letter tokenizer breaks text into
terms whenever it encounters a character which is not a letter,
but it also lowercases all terms. It is functionally equivalent to the
letter tokenizer combined with the lowercase token filter, but is
more efficient as it performs both steps in a single pass.
Whitespace
• Output
UAX URL email Tokenizer
• The classic tokenizer is a grammar based tokenizer that is good for English language
documents. This tokenizer has heuristics for special treatment of acronyms, company
names, email addresses, and internet host names. However, these rules don’t always
work, and the tokenizer doesn’t work well for most languages other than English:
• It splits words at most punctuation characters, removing punctuation. However, a dot
that’s not followed by whitespace is considered part of a token.
• It splits words at hyphens, unless there’s a number in the token, in which case the
whole token is interpreted as a product number and is not split.
• It recognizes email addresses and internet hostnames as one token.
Custom Analyzer