0% found this document useful (0 votes)
15 views4 pages

Search engines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views4 pages

Search engines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Search engines

Search engines are a program that search documents for specified key words and returns a list of the
documents where the keywords were found. A search engine is really a general class of programs;
however, the term is often used to specifically describe systems like Google, Bing and Yahoo! Search
that enable users to search for documents on the World Wide Web.
Web Search Engines
Typically, Web search engines work by sending out a spider to fetch as many documents as possible.
Another program, called an indexer, then reads these documents and creates an index based on the
words contained in each document. Each search engine uses a proprietary algorithm to create its
indices such that, ideally, only meaningful results are returned for each query.
A concept search (or conceptual search) is an automated information retrieval method that is used to
search electronically stored unstructured text (for example, digital archives, email, scientific
literature, etc.) for information that is conceptually similar to the information provided in a search
query. In other words, the ideas expressed in the information retrieved in response to a concept search
query are relevant to the ideas contained in the text of the query.
Why concept Search?
Concept search techniques were developed because of limitations imposed by classical
Boolean keyword search technologies when dealing with large, unstructured digital collections of
text. Keyword searches often return results that include many non-relevant items (false positives) or
that exclude too many relevant items (false negatives) because of the effects
of synonymy and polysemy. Synonymy means that one of two or more words in the same language
have the same meaning, and polysemy means that many individual words have more than one
meaning.
Polysemy is a major obstacle (difficulty) for all computer systems that attempt to deal with human
language. In English, most frequently used terms have several common meanings. For example, the
word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire
up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common
meanings, or senses. The typical noun from this set has more than eight common senses. For the 2000
most-polysemous terms in English, the typical verb has more than eight common senses and the
typical noun has more than five.
In addition to the problems of polysemous and synonymy, keyword searches can exclude
inadvertently misspelled words as well as the variations on the stems (or roots) of words (for
example, strike vs. striking). Keyword searches are also susceptible to errors introduced by optical
character recognition (OCR) scanning processes, which can introducerandom errors into the text of
documents (often referred to as noisy text) during the scanning process.
A concept search can overcome these challenges by employing word sense disambiguation (WSD),
and other techniques, to help it derive the actual meanings of the words, and their underlying
concepts, rather than by simply matching character strings like keyword search technologies.

Use of Concept Search:

 eDiscovery - Concept-based search technologies are increasingly being used for Electronic
Document Discovery (EDD or eDiscovery) to help enterprises prepare for litigation. In
eDiscovery, the ability to cluster, categorize, and search large collections of unstructured text on a
conceptual basis is much more efficient than traditional linear review techniques. Concept-based
searching is becoming accepted as a reliable and efficient search method that is more likely to
produce relevant results than keyword or Boolean searches.
 Enterprise Search and Enterprise Content Management (ECM) - Concept search
technologies are being widely used in enterprise search. As the volume of information within the
enterprise grows, the ability to cluster, categorize, and search large collections of unstructured
text on a conceptual basis has become essential.
 Content-Based Image Retrieval (CBIR) - Content-based approaches are being used for the
semantic retrieval of digitized images and video from large visual corpora. One of the earliest
content-based image retrieval systems to address the semantic problem was the Image Scape
search engine. In this system, the user could make direct queries for multiple visual objects such
as sky, trees, water, etc. using spatially positioned icons in a WWW index containing more than
ten million images and videos using key frames. The system used information theory to
determine the best features for minimizing uncertainty in the classification. The semantic gap is
often mentioned in regard to CBIR. The semantic gap refers to the gap between the information
that can be extracted from visual data and the interpretation that the same data have for a user in a
given situation. The ACM SIGMM Workshop on Multimedia Information Retrieval is dedicated
to studies of CBIR.
 Multimedia and Publishing - Concept search is used by the multimedia and publishing
industries to provide users with access to news, technical information, and subject matter
expertise coming from a variety of unstructured sources. Content-based methods for multimedia
information retrieval (MIR) have become especially important when text annotations are missing
or incomplete.
 Digital Libraries and Archives - Images, videos, music, and text items in digital libraries and
digital archives are being made accessible to large groups of users (especially on the Web)
through the use of concept search techniques. For example, the Executive Daily Brief (EDB), a
business information monitoring and alerting product developed by EBSCO Publishing, uses
concept search technology to provide corporate end users with access to a digital library
containing a wide array of business content. In a similar manner, the Music Genome
Project spawned Pandora, which employs concept searching to spontaneously create individual
music libraries or virtual radio stations.
 Genomic Information Retrieval (GIR) - Genomic Information Retrieval (GIR) uses concept
search techniques applied to genomic literature databases to overcome the ambiguities of
scientific literature.
 Human Resources Staffing and Recruiting - Many human resources staffing and recruiting
organizations have adopted concept search technologies to produce highly relevant resume search
results that provide more accurate and relevant candidate resumes than loosely related keyword
results.

Effective concept Searching

The effectiveness of a concept search can depend on a variety of elements including the dataset being
searched and the search engine that is used to process queries and display results. However, most
concept search engines work best for certain kinds of queries:
 Effective queries are composed of enough text to adequately convey the intended concepts.
Effective queries may include full sentences, paragraphs, or even entire documents. Queries
composed of just a few words are not as likely to return the most relevant results.
 Effective queries do not include concepts in a query that are not the object of the search.
Including too many unrelated concepts in a query can negatively affect the relevancy of the result
items.
 Effective queries are expressed in a full-text, natural language style similar in style to the
documents being searched. For example, using queries composed of excerpts from an
introductory science textbook would not be as effective for concept searching if the dataset being
searched is made up of advanced, college-level science texts. Substantial queries that better
represent the overall concepts, styles, and language of the items for which the query is being
conducted is generally more effective.
Guide Lines for Evaluating a Concept Search Engine

1. Result items should be relevant to the information need expressed by the concepts contained
in the query statements, even if the terminology used by the result items is different from the
terminology used in the query.
2. Result items should be sorted and ranked by relevance.
3. Relevant result items should be quickly located and displayed. Even complex queries should
return relevant results fairly quickly.
4. Query length should be non-fixed, i.e., a query can be as long as deemed necessary. A
sentence, a paragraph, or even an entire document can be submitted as a query.
5. A concept query should not require any special or complex syntax. The concepts contained in
the query can be clearly and prominently expressed without using any special rules.
6. Combined queries using concepts, keywords, and metadata should be allowed.
7. Relevant portions of result items should be usable as query text simply by selecting the item
and telling the search engine to find similar items.
8. Query-ready indexes should be created relatively quickly.
9. The search engine should be capable of performing Federated searches. Federated searching
enables concept queries to be used for simultaneously searching multiple data sources for
information, which are then merged, sorted, and displayed in the results.
10.A concept search should not be affected by misspelled words, typographical errors, or OCR
scanning errors in either the query text or in the text of the dataset being searched.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy