0% found this document useful (0 votes)
18 views73 pages

HD Mod10 Solr

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views73 pages

HD Mod10 Solr

Uploaded by

hlidio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Module 10

Module 10 – Solr

After completing this module, the student should be able


describe and use Solr including:

• How to install Solr


• What a Core is and how to load text into the Core
• How to initiate a text-based Search
• How to customize a Search
• What relevant ranking is
• How to use the Solr web-based interfaces

Solr Page 1
Page 2 Solr
Table Of Contents
What Solr is/is not ........................................................................................................................... 4
Lab01: Install/Start Solr .................................................................................................................. 6
Overview of Solr Admin UI............................................................................................................ 8
Lab02: http://<ipaddr>:8983/solr .................................................................................................. 10
Solr architecture ............................................................................................................................ 12
Problem: Search text using SQL ................................................................................................... 14
Solution: Searching text using Lucene.......................................................................................... 16
Lucene’s Inverted Index - Details ................................................................................................. 18
Solr Terminology (1 of 2) ............................................................................................................. 20
Solr Terminology (2 of 2) ............................................................................................................. 22
Solr’s transformation..................................................................................................................... 24
Lab03: Testing a Transformation .................................................................................................. 26
What is a Document? .................................................................................................................... 28
What is a Document? (con’t) ........................................................................................................ 30
In-line lab: Configuring Solar ....................................................................................................... 32
Solrconfig.xml............................................................................................................................... 34
HTTP Get request ......................................................................................................................... 36
Browse request handler for Solritas .............................................................................................. 38
Solr Indexing process .................................................................................................................... 40
Solr Indexing process (con’t) ........................................................................................................ 42
Schema.xml file............................................................................................................................. 44
Schema.xml file – Custom fields .................................................................................................. 46
Importing Documents.................................................................................................................... 48
Lab04: Importing documents ........................................................................................................ 50
Solr UI parameters – Querying data.............................................................................................. 52
Our first search .............................................................................................................................. 54
In-line lab: Searching for iPod ...................................................................................................... 56
Ranked retrieval - Score ................................................................................................................ 58
Lab05: Fuzzy logic – Wildcard and Range ................................................................................... 60
In-line lab: Fuzzy logic – Edit-Distance searches ......................................................................... 62
In-line lab: Fuzzy logic – Proximity Search (1 of 2) .................................................................... 64
Proximity Search – How it’s done (2 of 2) ................................................................................... 66
Lab06: Expanded Search using Solritas - Spellcheck ................................................................... 68
In-line lab: Facets .......................................................................................................................... 70
In Review - Solr ............................................................................................................................ 72

Solr Page 3
What Solr is/is not
Solr is a standalone enterprise search server with a REST-like API. You put documents
in it (called "indexing") via JSON, XML, CSV or binary over HTTP. You query it via
HTTP GET and receive JSON, XML, CSV or binary results.

Powered by Lucene™, Solr enables advanced full-text search capabilities including


phrases, wildcards, joins, grouping and much more across any data type

Page 4 Solr
What Solr Is/Is Not

• Solr (pronounced "solar") is:


• An open source enterprise search platform from the Apache Lucene
project. Its major features include document-storage mechanism that
supports full-text search, relevancy ranking, hit highlighting, faceted
search, real-time indexing, dynamic clustering, database integration,
NoSQL features and rich document (Word, PDF) handling
• Providing distributed search and index replication, Solr is highly scalable
and fault tolerant. Solr is the most popular enterprise search engine
• Solr is Not:
• Relational in any way. It is not well-suited to doing Joins
• Limited functionality in updating a single field in a document
• A web search engine like Google or Bing
• Designed to do search engine optimization (SEO) for a website
Solr has support for writing and reading its index and transaction log files to the HDFS. This
does not use Hadoop Map-Reduce to process Solr data, rather it only uses the HDFS filesystem
for index and transaction log file storage. Solr can also be configured to use Hadoop MapReduce
To use HDFS rather than a local filesystem, you must be using Hadoop 2.x and you will need to
instruct Solr to use the HdfsDirectoryFactory

Solr Page 5
Lab01: Install/Start Solr

Fist let’s install and start Solar.

Page 6 Solr
Lab01: Install/Start Solr

Enough of concepts, let's get Solr up and running

• From Hadoop PuTTY prompt, remove, then install Solr. Do the following:
cd /opt/solr
rm -rf solr-4.7.2
tar zxf solr-4.7.2.tgz
• To start Solr, do the following

cd /opt/solr/solr-4.7.2/example
java -jar start.jar
• It's started when see something like:

• If ever need to stop Solr, you can issue a CTRL+C twice

Solr Page 7
Overview of Solr Admin UI

Accessing the URL http://hostname:8983/solr/ will show the main dashboard, which is
divided into two parts.

A left-side of the screen is a menu under the Solr logo that provides the navigation
through the screens of the UI. The first set of links are for system-level information and
configuration and provide access to Logging, Core Admin and Java Properties, among
other things. At the end of this information is a list of Solr cores configured for this
instance. Clicking on a core name shows a secondary menu of information and
configuration options for the core specifically. Items in this list include the Schema,
Config, Plugins, and an ability to perform Queries on indexed data.

The center of the screen shows the detail of the option selected. This may include a
sub-navigation for the option or text or graphical representation of the requested data.
See the sections in this guide for each screen for more details.

Page 8 Solr
Overview of Solr Admin UI

Solr features a Web interface that makes it easy for Solr administrators and
programmers to view Solr configuration details, run queries and
analyze document fields in order to fine-tune a Solr configuration
Accessing the URL http://hostname:8983/solr/ will show the main dashboard,
which is divided into two parts:
1. A left-side of the screen is a menu under the Solr logo that provides
the navigation through the screens of the UI. The first set of links are
for system-level information and configuration and provide access to
Logging, Core Admin and Java Properties, among other things. At the
end of this information is a list of Solr cores configured for this
instance. Clicking on a core name shows a secondary menu of
information and configuration options for the core specifically. Items
in this list include the Schema, Config, Plugins, and an ability to
perform Queries on indexed data
2. The center of the screen shows the detail of the option selected. This
may include a sub-navigation for the option or text or graphical
representation of the requested data

Solr Page 9
Lab02: http://<ipaddr>:8983/solr

Page 10 Solr
Lab02: http://<ipaddr>:8983/solr

Now that Solr is started, go to the Home page and


open up your Core selector. By default, it is named
'collection1' , followed by 'Query'

Solr Page 11
Solr architecture
As you can see, there are lots of pieces to the Solr architecture. We’ll explain the key
components and terminology if a few more slides.

Page 12 Solr
Solr architecture

Here's the Big Picture. It's a little


overwhelming but you'll see
some of these terms coming up
in future slides

Solr Page 13
Problem: Search text using SQL

Page 14 Solr
Problem: Searching text using SQL

Here's a list of books. My SQL query 'Buying a Car' would result in some
relevant matches as well as some irrelevant matches
Input - Books
SELECT * FROM books The Beginner's Guide to Buying a Automobile
WHERE name like '%buying% How to Buy your First Automobile
or name like '%'a%' or Purchasing the Automobile you're always wanted
name like '%car%'; Becoming that New Car Owner
New Car purchase: What you should know
Pimping out your Car
A Guide to Camping
How to Train your Dog
Buying a New House

The above demonstrates several challenges with Search conditions including:


• Doesn't understand variations like 'buying' and 'buy'
• Doesn't understand synonyms like 'buying' and 'purchasing'
• No concept of relevancy
• And what about spellchecking, lower case, and stemming?

Solr Page 15
Solution: Searching text using Lucene

Page 16 Solr
Solution: Searching text using Lucene

Solr uses Lucene's Inverted Index for its Search capabilities

Inverted Index
Doc# Name column Term Doc#
1 Beginner's Guide to Buying a Automobile a 1,7,9
2 How to Buy your First Automobile automobile 1,2,3
3 Purchasing the Automobile you're always wanted becoming 4
4 Becoming that New Car Owner beginner's 1
5 New Car purchase: What you should know buy 2
6 Pimping out your new Car buying 1,9
camping 7
7 A Guide to Camping car 5,6
8 How to Train your Dog ……….. ……….
9 Buying a New House

• All terms in the Index map to one or more documents and terms in the Index
and are sorted in ascending roder
• From here, we can do complex queries using Solr to do: fuzzy matching,
relevancy, term frequency (more on that later)

Solr Page 17
Lucene’s Inverted Index - Details

The fundamental concepts in Lucene are index, document, field and term.

An index contains a sequence of documents.

• A document is a sequence of fields.


• A field is a named sequence of terms.
• A term is a string.

The same string in two different fields is considered a different term. Thus terms are
represented as a pair of strings, the first naming the field, and the second naming text
within the field.

Inverted Indexing

The index stores statistics about terms in order to make term-based search more
efficient. Lucene's index falls into the family of indexes known as an inverted index. This
is because it can list, for a term, the documents that contain it. This is the inverse of the
natural relationship, in which documents list terms.

Types of Fields

In Lucene, fields may be stored, in which case their text is stored in the index literally, in
a non-inverted manner. Fields that are inverted are called indexed. A field may be both
stored and indexed.

The text of a field may be tokenized into terms to be indexed, or the text of a field may
be used literally as a term to be indexed. Most fields are tokenized, but sometimes it is
useful for certain identifier fields to be indexed literally.

Document Numbers

Internally, Lucene refers to documents by an integer document number. The first


document added to an index is numbered zero, and each subsequent document added
gets a number one greater than the previous.

See: https://lucene.apache.org/core/3_0_3/fileformats.html for more information.

Page 18 Solr
Lucene's Inverted Index - Details

3 Step Indexing process:


1. Convert DOC to XML/JSON
2. Add Doc to Solr using HTTP POST
3. Configure Solr to apply
transformation to text in Doc
(see next slide)

Solr Page 19
Solr Terminology (1 of 2)

One of the most confusing aspects of Solr terminology is in the difference between
collections, shards, replicas, cores, and config sets.

• Collection: A complete logical index in a SolrCloud cluster. It is associated with


a config set and is made up of one or more shards. If the number of shards is
more than one, it is a distributed index, but SolrCloud lets you refer to it by the
collection name and not worry about the shards parameter that is normally
required for DistributedSearch.
• Config Set: A set of config files necessary for a core to function properly. Each
config set has a name. At minimum this will consist of solrconfig.xml
(SolrConfigXml) and schema.xml (SchemaXml), but depending on the contents
of those two files, may include other files. This is stored in Zookeeper. Config
sets can be uploaded or updated using the upconfig command in the command-
line utility or the bootsrap_confdir Solr startup parameter.
• Core: This is discussed in the General list (below) as Solr Core. One difference
with SolrCloud is that the config it uses is in Zookeeper. With traditional Solr, the
core's config will be in the conf directory on the disk.
• Leader: The shard replica that has won the leader election. Elections can
happen at any time, but normally they are only triggered by events like a Solr
instance going down. When documents are indexed, SolrCloud will forward them
to the leader of the shard, and the leader will distribute them to all the shard
replicas.
• Replica: One copy of a shard. Each replica exists within Solr as a core. A
collection named "test" created with numShards=1 and replicationFactor set to
two will have exactly two replicas, so there will be two cores, each on a different
machine (or Solr instance). One will be named test_shard1_replica1 and the
other will be named test_shard1_replica2. One of them will be elected to be the
leader.
• Shard: A logical piece (or slice) of a collection. Each shard is made up of one or
more replicas. An election is held to determine which replica is the leader. This
term is also in the General list below, but there it refers to Solr cores.
The SolrCloud concept of a shard is a logical division.
• Zookeeper: This is a program that helps other programs keep a functional
cluster running. SolrCloud requires Zookeeper. It handles leader elections.
Although Solr can be run with an embedded Zookeeper, it is recommended that it
be standalone, installed separately from Solr. It is also recommended that it be a
redundant ensemble, requiring at least three hosts. Zookeeper can run on the
same hardware as Solr, and many users do run it on the same hardware.

Page 20 Solr
Solr Terminology (1 of 2)

1. Collection: A complete logical index in a cluster. It is associated with a


config set and is made up of one or more shards. If the number of shards is
more than one, it is a distributed index, but Solr lets you refer to it by the
collection name
2. Config Set: A set of config files necessary for a core to function properly.
Each config set has a name. At minimum this will consist of solrconfig.xml
and schema.xml, but depending on the contents of those two files, may
include other files. This is stored in Zookeeper
3. Core: This is a running instance of a Lucene index along with all the Solr
configuration (SolrConfig.xml, Schema.xml, etc...) required to use it
4. Leader: The shard replica that has won the leader election. Elections can
happen at any time, but normally they are only triggered by events like a
Solr instance going down. When documents are indexed, Solr will forward
them to the leader of the shard, and the leader will distribute them to all the
shard replicas

Solr Page 21
Solr Terminology (2 of 2)

Page 22 Solr
Solr Terminology (2 of 2)

5. Replica: One copy of a shard. Each replica exists within Solr as a core. A
collection named "test" created with numShards=1 and replicationFactor set
to two will have exactly two replicas, so there will be two cores, each on a
different machine (or Solr instance). One will be named test_shard1_replica1
and the other will be named test_shard1_replica2. One of them will be
elected to be the leader
6. Shard: A logical piece (or slice) of a collection. Each shard is made up of one
or more replicas. An election is held to determine which replica is the leader.
This term is also in the General list below, but there it refers to Solr cores.
7. Zookeeper: This is a program that helps other programs keep a functional
cluster running. Solr requires Zookeeper. It handles leader elections

Solr Page 23
Solr’s transformation

Page 24 Solr
Solr's transformation
#Yummmm:) Drinking a latte at Cafe' Grecco in SF's historic North Beach…
Learing text analytics with #SolrinAction by @ManningBooks on my i-pad.

Solr text analysis

#yumm drink latte cafe' grecco sf san francisco historic north beach

learn text analysis #solrinaction @manningbooks my ipad i pad

Raw text Transformation What Solr did


All terms Lower cased (SF  sf) LowerCaseFilterFactory
a, at, in, with, by, on Removed from text StopFilterFactory
Drinking, learning drink , learn KStemFilterFactory
SF's sf, san francisco WordDelimiterFilterFactory
Cafe' Cafe ASCIIFoldingFilterFactory
i-Pad Ipad, I pad Hyphenated words
#Yummm #yumm Collapse repeating letters down to max = 2

Solr Page 25
Lab03: Testing a Transformation

Page 26 Solr
Lab03: Testing a Transformation
#Yummmm:) Drinking a latte at Caffe' Grecco in SF's historic North Beach…
Learing text analytics with #SolrinAction by @ManningBooks on my i-pad.

Solr Page 27
What is a Document?

Solr's basic unit of information is a document, which is a set of data that describes
something. A recipe document would contain the ingredients, the instructions, the
preparation time, the cooking time, the tools needed, and so on. A document about a
person, for example, might contain the person's name, biography, favorite color, and
shoe size. A document about a book could contain the title, author, year of publication,
number of pages, and so on.

In the Solr universe, documents are composed of fields, which are more specific pieces
of information. Shoe size could be a field. First name and last name could be fields.

Fields can contain different kinds of data. A name field, for example, is text (character
data). A shoe size field might be a floating point number so that it could contain values
like 6 and 9.5. Obviously, the definition of fields is flexible (you could define a shoe size
field as a text field rather than a floating point number, for example), but if you define
your fields correctly, Solr will be able to interpret them correctly and your users will get
better results when they perform a query.

You can tell Solr about the kind of data a field contains by specifying its field type. The
field type tells Solr how to interpret the field and how it can be queried.

When you add a document, Solr takes the information in the document's fields and adds
that information to an index. When you perform a query, Solr can quickly consult the
index and return the matching documents.

Page 28 Solr
What is a Document?

It is important what kind of information can be put in Solr to be searched upon


(document) and how that information is structured
• Solr is a document storage and retrieval engine. Every piece of data
submitted for processing is a document. Documents could be a newspaper
article, a resume or social profile, or in an extreme case, a book
• Each document contains one or more fields, each of which has a particular
data type (string, date, integer, boolean, lat/long, etc.)

<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod</field>
<field name="manu">Belkin</field>
<field name="manu_id_s">belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
</doc>

Solr Page 29
What is a Document? (con’t)

Field analysis tells Solr what to do with incoming data when building an index. A more
accurate name for this process would be processing or even digestion, but the official
name is analysis.

Consider, for example, a biography field in a person document. Every word of the
biography must be indexed so that you can quickly find people whose lives have had
anything to do with ketchup, or dragonflies, or cryptography.

However, a biography will likely contains lots of words you don't care about and don't
want clogging up your index—words like "the", "a", "to", and so forth. Furthermore,
suppose the biography contains the word "Ketchup", capitalized at the beginning of a
sentence. If a user makes a query for "ketchup", you want Solr to tell you about the
person even though the biography contains the capitalized word.

The solution to both these problems is field analysis. For the biography field, you can
tell Solr how to break apart the biography into words. You can tell Solr that you want to
make all the words lower case, and you can tell Solr to remove accents marks.

Page 30 Solr
What is a Document? (con't)

• When running a query, we can search on one or more fields and Solr will
return documents that contain content in those fields matching the query
• Solr has a flexible schema for each document. It has the ability to
automatically guess a field type for previously unseen field names

• To recap:
• A document is a collection of fields that map to particular field types
defined in a schema. Each field in a document has its content analyzed
according to its field type and the results are saved into a Search Index
in order to later retrieve the document via a query

Solr Page 31
In-line lab: Configuring Solar

There are many options to configure and customize Solr. The main way to do this is
through 3 XML files as documented in the slide.

Page 32 Solr
In-line lab: Configuring Solr

Solr focuses around 3 main configuration files

1. solr.xlm – Defines properties related to admin, logging, sharding and SolrCloud

2. solrconfig.xml – Defines main settings for a specific Solr core

3. schema.xml – Defines structure of your Index, including fields and field types

From Web GUI, ensure Core Selector = collection1 is displayed, then click on
'Files'. This will show you all the configuration files for the collection1 core

Click on solrconfig.xml to display active configuration settings

Solr Page 33
Solrconfig.xml

The solrconfig.xml file is the configuration file with the most parameters affecting Solr
itself. While configuring Solr, you'll work with solrconfig.xml often. The file comprises a
series of XML statements that set configuration values for an individual collection.

In solrconfig.xml, you configure important features such as:

• request handlers, which process the requests to Solr, such as requests to add
documents to the index or requests to return results for a query

• listeners, processes that "listen" for particular query-related events; listeners can
be used to trigger the execution of special code, such as invoking some common
queries to warm-up caches

• the Request Dispatcher for managing HTTP communications

• the Admin Web interface

• parameters related to replication and duplication (these parameters are covered


in detail in Legacy Scaling and Distribution)

The solrconfig.xml file is located in the conf/ directory for each collection. Several well-
commented example files can be found in the server/solr/configsets/ directories
demonstrating best practices for many different types of installations.

Page 34 Solr
2 Solrconfig.xml

Solr Page 35
HTTP Get request

Page 36 Solr
2 HTTP Get request

HTTP Get request

Search Handler in solrconfig.xml

Solr Page 37
Browse request handler for Solritas

There are many options to configure and customize Solr. The main way to do this is
through 3 XML files as documented in the slide.

Page 38 Solr
2 Browse request handler for Solritas

solrconfig.xml

Solr Page 39
Solr Indexing process

A Solr index can accept data from many different sources, including XML files, comma-
separated value (CSV) files, data extracted from tables in a database, and files in
common file formats such as Microsoft Word or PDF.

Here are the three most common ways of loading data into a Solr index:

• Using the Solr Cell framework built on Apache Tika for ingesting binary files or
structured files such as Office, Word, PDF, and other proprietary formats.

• Uploading XML files by sending HTTP requests to the Solr server from any
environment where such requests can be generated.

• Writing a custom Java application to ingest data through Solr's Java Client API
(which is described in more detail in Client APIs. Using the Java API may be the
best choice if you're working with an application, such as a Content Management
System (CMS), that offers a Java API.

Regardless of the method used to ingest data, there is a common basic data structure
for data being fed into a Solr index: a document containing multiple fields, each with
a name and containing content, which may be empty. One of the fields is usually
designated as a unique ID field (analogous to a primary key in a database), although
the use of a unique ID field is not strictly required by Solr.

If the field name is defined in the schema.xml file that is associated with the index, then
the analysis steps associated with that field will be applied to its content when the
content is tokenized. Fields that are not explicitly defined in the schema will either be
ignored or mapped to a dynamic field definition (see Documents, Fields, and Schema
Design), if one matching the field name exists.

Page 40 Solr
Solr Indexing process

1. Convert a document from its native format into a format supported by Solr,
such as XML or JSON
2. Add the document to Solr using one of several well-defined interfaces,
typically HTTP POST
3. Configure Solr to apply transformations to the text in the document during
Indexing

Solr Page 41
Solr Indexing process (con’t)

Page 42 Solr
Solr Indexing process (con't)

Solr Page 43
Schema.xml file
The schema.xml file contains all of the details about which fields your documents can
contain, and how those fields should be dealt with when adding documents to the index,
or when querying those fields.

It describes one of the most important things of the implementation – the structure of the
data index. The information contained in this file allow you to control how Solr behaves
when indexing the data, or when making queries. Schema.xml is not only the very
structure of the index, is also detailed information about data types that have a large
influence on the behavior Solr, and usually are treated with neglect. This entry will try to
bring some insight about schema.xml.

Schema.xml file consists of several parts:

• version,
• type definitions,
• field definitions,
• copyField section,
• additional definitions.

Page 44 Solr
3 Schema.xml file

Located under /opt/solr/solr-4.7.2/example/solr/collection1/conf, the schema.xml


file represents all the possible fields and data types necessary to map
documents into a Lucene index. Here's the default:
For now, you’ll notice the three main sections of the schema.xml document:
1. The <fields> element, containing <field> and <dynamicField> elements
used to define the basic structure of your documents
2. Miscellaneous elements, such as <uniqueKey> and <copyField>, which are
listed after the <fields> element
3. Field types under the <types> element that determine how dates, numbers,
and text fields are handled in Solr

Solr Page 45
Schema.xml file – Custom fields

Page 46 Solr
Schema.xml file – Custom fields

If you wished to add new documents, you typically need to define a new custom
field type under the <fields> element. For example to add some blog
documents, you would copy in something like this into the schema.xml file

Solr Page 47
Importing Documents

Page 48 Solr
Importing Documents

• There are many other different ways to Import your data into Solr... one can
1. Import records from a database using the Data Import Handler (DIH)
2. Load a CSV file (comma separated values), including those exported by
Excel or MySQL
3. POST JSON documents
4. Index binary documents such as Word and PDF with Solr Cell
(ExtractingRequestHandler)
5. Use SolrJ for Java or other Solr clients to programatically create
documents to send to Solr
• What about Updates?
• When the same document is POSTed to the server twice, you still only
get 1 result when searching. Whenever you POST commands to Solr to
add a document with the same value for the uniqueKey as an existing
document, it automatically replaces it for you
• What about Deletes?
• You can delete data by POSTing a delete command
java -Ddata=args -Dcommit=false -jar post.jar "<delete><id>SP2514N</id></delete>"

Solr Page 49
Lab04: Importing documents

Page 50 Solr
Do NOT close SOLR.
Lab04: Importing documents Open new prompt

Let's index some XML files. Execute the below code to add them to Solr
Open a new Hadoop PuTTY prompt, then execute:
cd /opt/solr/solr-4.7.2/example/exampledocs
java -jar post.jar *.xml

It is typical to have
many documents in
a file as shown here

Solr Page 51
Solr UI parameters – Querying data

Page 52 Solr
Solr UI parameters – Querying data

Searches are done via HTP GET with the query


string in the q parameter

q: Main query field: Search for docs with keyword *:* = all
fq: Filter query: Restricts result set to docs that have field
name set to this
sort: sorts result either ascending or descending
start, rows: Start at this page and return X number of results
fl: Specifies which fields to return for each doc in the results
df: Default search field
wt: Governs the format of the response

Advanced Features

Solr Page 53
Our first search
From the pull-down, ensure you are at COLLECTION1, then highlight the QUERY
parameter.

Notice the q caption (which stands for query). The default value is: *.* which signifies
it will search for all documents.

Page 54 Solr
In-line lab: Our first search

Time to do a Search. q = *:* means find all documents.


We'll stick with that. Click the 'Execute Query' button

You get back an answer set that says there are


32 documents in the Core (keep looking, it's there)

Solr Page 55
In-line lab: Searching for iPod

Page 56 Solr
In-line lab: Searching for iPod

Do it again only this time search for 'iPod'. This time 3 docs get returned
Then, SORT = price asc
Then, FL = name,price,features,score
Then, FQ = manu: Belkin
1

4 Response header contains status info


and echo's back query parameters
2

3 Main response element includes # of


hits and Score of 'best' document

Docs matching the query and 'hit's

Solr Page 57
Ranked retrieval - Score
Documents can be ranked using tf-idf weight (Term Frequency-Inverse Document
Frequency)

• Free text queries: Rather than a query language of operators and expressions,
the user’s query is just one or more words in a human language
• Rather than a set of documents satisfying a query expression, in ranked retrieval,
the system returns an ordering over the (top) documents in the collection for a
query
• Large result sets are not an issue: just show the top k (=~10)
• Premise: the ranking algorithm works
• Score is the key component of ranked retrieval models

• We would like to compute a score between a query term t and a document d.


• The simplest way is to say score(q, d) = tf t,d
• The term frequency tf t,d of term t in document d
• Number of times that t occurs in d
• Relevance does not increase proportionally with term frequency
• Certain terms have little or no discriminating power in determining relevance
• Need a mechanism for attenuating the effects of frequent terms
• Less informative than rare terms

• The tf-idf weight of a term is the product of its tf weight and its idf weight
• Best known weighting scheme in information retrieval
• Increases with the number of occurrences within a document
• Increases with the rarity of the term in the collection

• At this point, we may view each document as a vector


• with one component corresponding to each term in the dictionary, together
with a tf-idf weight for each component.
• This is an |N|-dimensional vector
• For dictionary terms that do not occur in a document, this weight is zero
• In practice we consider d as a |q|-dimensional vector
• |q| is the number of distinct terms in the query q

Page 58 Solr
Ranked retrieval - Score

One of the key differences between Solr and other databases is ranked retrieval;
the process of Scoring docs by their relevance to a query

Every doc that matches a query is assigned


a relevant Score and results returned in
descending order based on that Score

1st doc listed has 'iPod


twice so its score is
higher than 2nd doc
which has 'iPod' once

Solr Page 59
Lab05: Fuzzy logic – Wildcard and Range
Wildcard Searches

Lucene supports single and multiple character wildcard searches within single terms
(not within phrase queries).

To perform a single character wildcard search use the "?" symbol.

To perform a multiple character wildcard search use the "*" symbol.

The single character wildcard search looks for terms that match that with the single
character replaced. For example, to search for "text" or "test" you can use the search:

te?t

Multiple character wildcard searches looks for 0 or more characters. For example, to
search for test, tests or tester, you can use the search:

test*

You can also use the wildcard searches in the middle of a term.

te*t

Note: You cannot use a * or ? symbol as the first character of a search.

Range Searches

Range Queries allow one to match documents whose field(s) values are between the
lower and upper bound specified by the Range Query. Range Queries can be inclusive
or exclusive of the upper and lower bounds. Sorting is done lexicographically.

mod_date:[20020101 TO 20030101]

This will find documents whose mod_date fields have values between 20020101 and
20030101, inclusive. Note that Range Queries are not reserved for date fields. You
could also use range queries with non-date fields:

title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not
including Aida and Carmen.

Inclusive range queries are denoted by square brackets. Exclusive range queries are
denoted by curly brackets.

Page 60 Solr
Lab05: Fuzzy logic – Wildcard and Range

• Wildcard Searching
* – allows you to search for 0 or more characters
? – Matches only a single character

1. From Solr UI, under 'q' caption, type: vid* and then click 'Execute Query'
2. How many docs were returned?
3. How many keywords did you find?
4. Change to vi*
5. How many docs were returned and which keywords were found this time?

• Range Searching
• price:[0 TO 50.00]

• cat:[a TO s]

• manufacturedate_dt: [2006-01-01T15:26:37Z TO 2006-02-13T15:26:37Z]

Solr Page 61
In-line lab: Fuzzy logic – Edit-Distance searches
It's important to allow for flexibility for handling spelling errors or slight variations in
correct spellings. Solr uses Damerau-Levenshtein distances algorithm which accounts
for greater than 80% of human misspellings

Page 62 Solr
In-line lab: Fuzzy logic – Edit-Distance
searches
• Edit-Distance Searching
• It's important to allow for flexibility for handling spelling errors or slight
variations in correct spellings. Solr uses Damerau-Levenshtein distances
algorithm which accounts for greater than 80% of human misspellings
• Edit-Distance searches using the tilde '~' character as follows:
• Administrator~ Matches adminstrator, administrater, administraitor,
etc. By default this syntax matches any other terms with 2 edit
distances away of the original term. An edit distance is defined as an
insertion, deletion, substitution or transposition of characters
• Can modify the strictness of edit-distance searches to any distance
• Administrator~1 (1 distance)
• Administrator~2 (2 distance - default )
• Administrator~3 (3 distance)

• Find any docs that have 'veiwsonic' within 2 distances


In 'q' caption, search for 'veiwsonic~' within 2 distance

Solr Page 63
In-line lab: Fuzzy logic – Proximity Search (1 of 2)
Proximity Searches

Lucene supports finding words are a within a specific distance away. To do a proximity
search use the tilde, "~", symbol at the end of a Phrase. For example to search for a
"apache" and "jakarta" within 10 words of each other in a document use the search:

"jakarta apache"~10

Page 64 Solr
In-line lab:Fuzzy logic-Proximity Search (1 of 2)

• Proximity Searching
• Edit-Distance is for searching for single terms that are close to the original.
You can extend this principle and apply between terms (phrases)
• Suppose you want to find the words 'chief' and 'officer'. If you used an
AND between the two words you might get a lot of poor hits ( 'a chief
concern for most citiziens is if the police officer was properly trained')
• Instead you decide how many words come between chief and officer using
the below syntax:
• "chief officer"~1 (chief and officer max of 1 position away)
• "chief officer"~2 (chief and officer max of 2 position away)
• "chief officer"~3 (chief and officer max of 3 position away)

• Find any docs that have 'hard drive' and 'GB' within 2 words of each other

In 'q' caption, type: "hard drive GB"~2

Solr Page 65
Proximity Search – How it’s done (2 of 2)
In the previous slide we showed how you could do a proximity search. But how is that
search accomplished?

Using Lucene’s inverted index, it keeps a ‘term position’ ledger. So Solr just has to look
for the 2 words in question and see how ‘far’ away they are from each other.

Page 66 Solr
Proximity Searches – How it's done (2 of 2)

• Besides individual words, it's possible to query Solr for phrases. This is
done using the Lucene Index's term position feature. Suppose we want to
search for 'new home' phrase
• Term positions allow Solr to reconstruct the original positions of Indexed
terms within the document making it possible to search for phrases
Inverted Index with term positions
Doc# Name column Term Doc# Term position
1 A Fun Guide to Cooking a 1 1
2 Decorating your Home 3 1
3 How to Raise a Child … …
4 Buying a New Car cooking 1 5
5 Buying a New Home decorating 2 1
your 2 2
6 Beginner's Guide to Buying a House
9 4
7 Purchasing a Home home 2 3
8 Becoming a New Home Owner 5 4
9 How to Buy your First House 7 3
8 4
… …
new 4 3
5 3
8 3
car 4 4
the 6 1
… … …

Solr Page 67
Lab06: Expanded Search using Solritas - Spellcheck
The SpellCheck component is designed to provide inline query suggestions based on
other, similar, terms. The basis for these suggestions can be terms in a field in Solr,
externally created text files, or fields in other Lucene indexes.

Page 68 Solr
Lab06: Expanded Search using Solritas-
Spellcheck
Solritas UI allows more advanced searching. Go to:
hdp22:8983/solr/collection1/browse then click on GROUP BY tab
Mis-type: Videoh on purpose. It returns no
hits but prompts you if you meant 'video'.
Spell-check is baked into Solr

Solr Page 69
In-line lab: Facets
Faceting is the arrangement of search results into categories based on indexed terms.
Searchers are presented with the indexed terms, along with numerical counts of how
many matching documents were found were each term. Faceting makes it easy for
users to explore search results, narrowing in on exactly the results they are looking for.

Faceted search is the dynamic clustering of items or search results into categories that
let users drill into search results (or even skip searching entirely) by any value in any
field. Each facet displayed also shows the number of hits within the search that match
that category. Users can then “drill down” by applying specific contstraints to the search
results. Faceted search is also called faceted browsing, faceted navigation, guided
navigation and sometimes parametric search.

Faceted search provides an effective way to allow users to refine search results,
continually drilling down until the desired items are found. The benefits include

• Superior feedback – users can see at a glance a summary of the search results
and how those results break down by different criteria.
• No surprises or dead ends – users know how many results match before they
click. Values with zero counts are normally removed to reduce visual noise and
eliminate the possibility of a user accidentally selecting a constraint that would
lead to no results.
• No selection hierarchy is imposed – users are generally free to add or remove
constraints in any order.

Page 70 Solr
In-line lab: Facets (Simple, Spatial, Group By)

Faceted search takes the documents matched by a query and generates counts
for various properties or categories. Links are usually provided that allows
users to 'drill down' or refine their search results based on the returned
categories

From Simple tab, type in


'Video' and click on
'Submit Query' video

Notice the Field Facet


categories. Go ahead and
select any one of the
Categories to drill down

Solr Page 71
In Review - Solr

Page 72 Solr
In Review - Solr

After completing this module, the student should be able


describe and use Solr including:

• How to install Solr


• What a Core is and how to load text into the Core
• How to initiate a text-based Search
• How to customize a Search
• What relevant ranking is
• How to use the Solr web-based interfaces

Lucene is a library (Java) that allows you to create and search


indexes of textual data, basically. It's not easy to use by itself (and it's
not a service), which is why people have created wrappers around it

Solr Page 73

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy