0% found this document useful (0 votes)
62 views4 pages

Summary of A Search Engine

The document summarizes the key components and processes involved in building a search engine, including page ranking, use of anchor text, indexing documents, parsing queries, maintaining forward and backward indexes, and crawling the web.

Uploaded by

Robel Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

Summary of A Search Engine

The document summarizes the key components and processes involved in building a search engine, including page ranking, use of anchor text, indexing documents, parsing queries, maintaining forward and backward indexes, and crawling the web.

Uploaded by

Robel Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 4

Summary of a Search Engine

Abstract
Today web has become larger with its own prospect.Every Day about 500 hundreds
of new website is added to Internet.So it is rapidly increasing.These new web has
new information and new element that an indivisual need.But it is not possible to
learn or find any particular element from these sites where every sites may have
avarage of 100 pages.Where comes the need of search.So from the prospect it is
looklike that the search must be efficient and also the memory of storing these info
is also need to be compromized.Here comes the choice of good search engine.So
how and what makes a search engine different from other we will deal with it.

Key word :
1.Page Ranking

2. Probability

3. Anchor Text

4.Vector Space model.

5.Doc. Indexing.

6.Metadata.

7.Lexicon

8.Hitlist

9.Barrels

10.Parsing

1).Page Ranking:As search engine is developed the information may found in


several web pages.So which page will come first,Then which one is next.It is
done in a fashioned way that....... If the link goes from a page is (T), And
comes to this page is (C).Here we have a multiply factor
d(0<d<1).Particularly d = 0.85 .So the page Rank of a page is PR = (1-d)
+PR(T1)/C(T1) +...+PR(Tn)/C(Tn).

2.Probability:Here it is somewhat possible to choose random page when a


user is get bored.So this can be handle by probability.Suppose there is (n)
link in a page for traversing and the user used 10 link serially.So selecting 1
link from these is (1/(n-10)).(Mine)

3.Anchor Text:Text of link is treated in a special way.The link is ascosiate


with page the link points to. This has several advantages. Previous Method,
anchors often provide more accurate descriptions of web pages than the
pages themselves. Second, anchors may exist for documents which cannot
be indexed by a text-based search engine, such as images, programs, and
databases. This makes it possible to return webpages which have not
actually been crawled. Note that pages that have not been crawled can
causeproblems, since they are never checked for validity before being
returned to the user. In this case, thesearch engine can even return a page
that never actually existed, but had hyperlinks pointing to it.

However, it is possible to sort the results, so that this particular problem


rarely happens.

4.Vector Space Model:It is 3 stage divided procedure.

Doc.Indexing Weight of Indexed term Rank of doc with


query

5.Doc.Indexing: Remove (the,an) is non-significant words.

6.Metadata: Data about data.

<META HTTP_EQUIV = string CONTENT = string(image ,media etc.)>

Metadata sometime givesfake information about a site or page.

Content:Has keywords.

7.Lexicon: The lexicon has several different forms. One important change
from earlier systems is that the lexicon can fit in memory for a reasonable
price. In the current implementation we can keep the lexicon in memory on a
machine with 256 MB of main memory. The current lexicon contains 14
million words(though some rare words were not added to the lexicon). It is
implemented in two parts -- a list of the words (concatenated together but
separated by nulls) and a hash table of pointers. For various functions, the
list of words has some auxiliary information which is beyond the scope of this
paper to explain fully.

8.Hit list: A hit list corresponds to a list of occurrences of a particular word in


a particular document includingposition, font, and capitalization information.
Hit lists account for most of the space used in both theforward and the
inverted indices. Because of this, it is important to represent them as
efficiently aspossible. We considered several alternatives for encoding
position, font, and capitalization -- simpleencoding (a triple of integers), a
compact encoding (a hand optimized allocation of bits), and Huffmancoding.
In the end we chose a hand optimized compact encoding since it required far
less space than thesimple encoding and far less bit manipulation than
Huffman coding.

9.Barrels : Each document is converted into a set of word occurrences called


hits. The hits record the word, position in document, an approximation of
font size, and capitalization. The indexer distributes these hits into a set of
"barrels", creating a partially sorted forward index.

10.Parsing: Any parser which is designed to run on the entire Web must
handle a huge array ofpossible errors. These range from typos in HTML tags
to kilobytes of zeros in the middle of a tag,non-ASCII characters, HTML tags
nested hundreds deep, and a great variety of other errors thatchallenge
anyone’s imagination to come up with equally creative ones. For maximum
speed,instead of using YACC to generate a CFG parser, we use flex to
generate a lexical analyzer whichwe outfit with its own stack. Developing this
parser which runs at a reasonable speed and is veryrobust involved a fair
amount of work.

Note : I am more and more interested in lexical analyzer(which I


learn in a course named --”Compilers”).I will give a huge description
and analysis and everything as far as i know in this prospect in
another post.

11. Forward Index: The forward index is actually already partially sorted. It
isstored in a number of barrels (we used 64). Each barrel holds a range of
wordID’s. If a document contains words that fall into a particular barrel, the
docID is recorded into the barrel, followed by a list of wordID’s with hitlists
which correspond to those words. This scheme requires slightly more storage
because of duplicated docIDs but the difference is very small for a
reasonable number of buckets and saves considerable time and coding
complexity in the final indexing phase done by the sorter.

12.Backward Index: The inverted index consists of the same barrels as the
forward index, except that they have been processed by the sorter. For
every valid wordID, the lexicon contains a pointer into the barrel that wordID
falls into. It points to a doclist of docID’s together with their corresponding hit
lists. This doclist represents all the occurrences of that word in all
documents.

13.Crawler: Running a web crawler is a challenging task. There are tricky


performance and reliability issues and even more importantly, there are
social issues. Crawling is the most fragile application since it involves
interacting with hundreds of thousands of web servers and various name
servers which are all beyond the control of the system.

Note : I will give huge description later.

Total Step of Running a Search Engine:

1. Parse the query.

2. Convert words into wordIDs.

3. Seek to the start of the doclist in the short barrel for every word.

4. Scan through the doclists until there is a document that matches all the
search terms.

5. Compute the rank of that document for the query.

6. If we are in the short barrels and at the end of any doclist, seek to the
start of the doclist in the full barrel for every word and go to step 4.

7. If we are not at the end of any doclist go to step 4. Sort the documents
that have matched by rank and return the top k.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy