Summary of A Search Engine
Summary of A Search Engine
Abstract
Today web has become larger with its own prospect.Every Day about 500 hundreds
of new website is added to Internet.So it is rapidly increasing.These new web has
new information and new element that an indivisual need.But it is not possible to
learn or find any particular element from these sites where every sites may have
avarage of 100 pages.Where comes the need of search.So from the prospect it is
looklike that the search must be efficient and also the memory of storing these info
is also need to be compromized.Here comes the choice of good search engine.So
how and what makes a search engine different from other we will deal with it.
Key word :
1.Page Ranking
2. Probability
3. Anchor Text
5.Doc. Indexing.
6.Metadata.
7.Lexicon
8.Hitlist
9.Barrels
10.Parsing
Content:Has keywords.
7.Lexicon: The lexicon has several different forms. One important change
from earlier systems is that the lexicon can fit in memory for a reasonable
price. In the current implementation we can keep the lexicon in memory on a
machine with 256 MB of main memory. The current lexicon contains 14
million words(though some rare words were not added to the lexicon). It is
implemented in two parts -- a list of the words (concatenated together but
separated by nulls) and a hash table of pointers. For various functions, the
list of words has some auxiliary information which is beyond the scope of this
paper to explain fully.
10.Parsing: Any parser which is designed to run on the entire Web must
handle a huge array ofpossible errors. These range from typos in HTML tags
to kilobytes of zeros in the middle of a tag,non-ASCII characters, HTML tags
nested hundreds deep, and a great variety of other errors thatchallenge
anyone’s imagination to come up with equally creative ones. For maximum
speed,instead of using YACC to generate a CFG parser, we use flex to
generate a lexical analyzer whichwe outfit with its own stack. Developing this
parser which runs at a reasonable speed and is veryrobust involved a fair
amount of work.
11. Forward Index: The forward index is actually already partially sorted. It
isstored in a number of barrels (we used 64). Each barrel holds a range of
wordID’s. If a document contains words that fall into a particular barrel, the
docID is recorded into the barrel, followed by a list of wordID’s with hitlists
which correspond to those words. This scheme requires slightly more storage
because of duplicated docIDs but the difference is very small for a
reasonable number of buckets and saves considerable time and coding
complexity in the final indexing phase done by the sorter.
12.Backward Index: The inverted index consists of the same barrels as the
forward index, except that they have been processed by the sorter. For
every valid wordID, the lexicon contains a pointer into the barrel that wordID
falls into. It points to a doclist of docID’s together with their corresponding hit
lists. This doclist represents all the occurrences of that word in all
documents.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the
search terms.
6. If we are in the short barrels and at the end of any doclist, seek to the
start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4. Sort the documents
that have matched by rank and return the top k.