Unit 1: Search Engine Optimisation
Unit 1: Search Engine Optimisation
Structure
1.1 Introduction to Search Engines
Summary
Key Words
Self-Assessment Questions
Answers to Check your Progress
Suggested Reading
1
Objectives
After going through this unit, you will be able to:
Search Engines
A search engine is a software program which deals with the information retrieval that
discovers crawls, converts and stores the information for retrieval and management in
reply to the queries that are given by the user.
In general, a search engine consists of four parts, i.e., a search interface, crawler,
indexer, and a database. The crawler navigates through a collection of documents,
deconstructs the text of document, allot surrogates for the purpose of storage in the
search engine index. The search engines which are online can store images, they can
connect data and metadata for the document.
A search engine which is used on the web is kind of a website that facilitates a user to
discover the information on the Internet. It accomplishes this by having a look through
various other web pages for the text which the user wants to find. The software that
carries out this type of task is known as a search engine. Instead of the user
encompassing to go to the first webpage, this task can be accomplished with the help
of the Web browser and a search engine.
To make use of a search engine, it is mandatory to enter at least a single keyword in the
search box. In general, an on-screen button will be present, which has to be clicked to
submit the search query. Then, the search engine tries to look for the matches between
the entered keyword(s) and the database of websites and words.
Promptly, once a search is submitted then, the results become visible on the screen. The
web page that shows the results is called as search engine results page (SERP). The SERP
is a record of Web Pages that includes matches to the keywords that were searched.
2
The SERP typically displays the names of the web, their little descriptions and a hyperlink
for each of the matching web page. When, the user click on any of the links, the user
can navigate to one of the websites.
In general, the Search engines can be considered up to some extent as the advanced
websites on the web. The Search engines try to use unique computer code to arrange
the web pages based on SERPs. In general, the most popular or peak quality web pages
will be close to the top of the list.
When a user types the words into the search engine, it glances for web pages with those
types of words. It is possible that, there might be thousands, or millions, of web pages
with those types of words. Hence, the search engines assist the users by positioning the
web pages by assuming the user desires initially.
Earlier to September 1993, the World Wide Web (WWW) was exclusively indexed by
hand only. Tim Berners-Lee has edited most of the web and he has hosted on
the CERN webserver.
The Archie program downloads the directory listings of all files which are located on
public unidentified FTP (File Transfer Protocol) sites, by creating a database that can be
searchable on file names. But, the disadvantage of the Archie Search Engine was that it
could not index the content of these sites because the amount of data was so
inadequate and it could be readily searched manually.
The growth of Gopher was formed in 1991 by Mark McCahill at the University of
Minnesota. It has shown the way to 2 innovative popular search programs, which were
Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) and
Jughead (Jonzy’s Universal Gopher Hierarchy Excavation and Display). The two search
engines Veronica and Jughead also searched the file names and titles that were stored
in Gopher index systems. The search engine Veronica as well provided a keyword
exploration of the majority of the menu titles of Gopher in the whole Gopher listings.
The search engine Jughead was a tool for attaining the menu information from
particular Gopher servers.
Unfortunately, in 1993, there was no existence of search engine for the web. Although,
several specific catalogues were maintained by hand. Oscar Nierstrasz at the University
of Geneva has written a sequence of Perl scripts that occasionally reflected these pages
and modified them into a regular format. This produced the foundation for W3Catalog,
i.e., the first primitive search engine of web that was released on September 2, 1993.
In June 1993, Matthew Gray, then at MIT shaped the earliest web robot, which is Perl-
based World Wide Web Wanderer. It is used to produce an index called 'Wandex'. The
main idea of the Wanderer was to determine the size of the World Wide Web. In
November 1993, Aliweb, the second search engine of web came into view. Aliweb did
3
not use a web robot, instead, the administrators were advised to maintain the existence
at each site of an index file in a particular format.
NCSA's Mosaic™ - Mosaic (web browser) was not the first Web browser, but it was the
initial one to make a main splash. In November 1993, Mosaic version 1.0 has started
including a variety of features like bookmarks, icons, a more eye-catching interface, and
pictures, all of them has made the software so that it can be used easily and attractive.
Jump Station, that was created in December 1993 by Jonathon Fletcher used a web
robot to locate web pages and to construct its index, and used a web form as an
interface to its query plan. It was the first resource-discovery tool of World Wide Web
to unite the three necessary features of a web search engine, i.e., crawling, indexing,
and searching.
WebCrawler was one of the foremost "all text" search engines which were crawler-
based, it appeared out in 1994. In contrast to its antecedent, it allowed users to explore
any word on any webpage, which has turn out to be the standard for all major search
engines ever since. It was also the search engine that was extensively known by the
community. In 1994, Lycos which started at Carnegie Mellon University was started and
turn out to be a most important commercial effort. Many search engines have appeared
after and they had gained the popularity. Some of them are Excite, Magellan, Infoseek
Yahoo!, etc. The Yahoo search engine was amongst the most popular one by the people
to locate web pages of significance. The search function of the Yahoo! search engine will
be totally operated on its web directory, slightly than its full-text replica of web pages.
Instead of performing a keyword-based search, the Information seekers look through
the directory.
In 1996, Netscape seems to provide a solo search engine with an elite deal, as the search
engine has all features on the web browser of the Netscape, which has made Netscape
to hit the deals with five of the most important search engines for $5 million a year, so
that, each of the search engine would be in replacement on the page of the Netscape
search engine. The five search engines were Yahoo!, Magellan, Lycos, Infoseek, and
Excite.
Google implemented the thought of selling search terms in 1998, starting from a small
search engine company and named it as goto.com. This progress has a momentous
outcome on the Search Engine business, which has led to one of the most profitable
businesses in the internet.
Number of companies have entered in the market outstandingly by getting record gains
all the way through their initial public offerings. Some of them have opted for their
public search engine, and some of them started as marketing enterprise-only editions
such as Northern Light. Many search engine companies were trapped up in the dot-com
bubble, which was a speculation-driven market explosion that worn out in 1999 and
ended in 2001.
Around 2000, Google has given a rise to prominence. Better results were achieved by
the company for lot of searches with an improvement called as PageRank.
This PageRank was an iterative algorithm that assigns the ranks to the webpages based
4
on the number and PageRank of other websites and pages that link there, on the
principle that good or enviable pages were linked to more than others. Google as well
retain a modest interface to its search engine, where lot of its competitors entrenched
a search engine in a web portal itself. In reality, Google search engine has turn out to be
so popular.
By 2000, Yahoo! was providing search services that were based on Inktomi's search
engine. And later, Yahoo! changed to Google's search engine until 2004, when it started
its own search engine that is based on the combined technologies of its achievements.
Microsoft primarily launched MSN Search in the fall of 1998 using search results from
Inktomi. But later, in 2004, Microsoft began a changeover to its own search technology,
which was powered by its own web crawler which was called as msnbot.
The Microsoft's rebranded search engine is Bing, which was initiated on June 1, 2009.
On July 29, 2009, Yahoo! and Microsoft concluded an agreement in which Yahoo!
Search would be powered by Microsoft Bing technology.
The crawler based search engines make an effort to make use of a crawler or bot or
spider for crawling and indexing the content that is novel to the search database. The
basic 4 steps which every crawler based search engine pursues before showing any
site in the search results are Crawling, Indexing, Calculating Relevancy and Retrieving
the Result.
a) Crawling
The Search engines search to get the obtainable web pages in the whole web. A
portion of software called as a crawler or bot or spider carry out the process of
crawling. The frequency of crawling depends on the search engine and it may
perhaps acquire few days between the crawls. Therefore, sometimes, search
results show the content of old page or deleted page. Once the search engines
crawl the site another time, search results will illustrate the latest content.
b) Indexing
5
c) Calculating Relevancy
The Search engine starts comparing the search string in the search appeal with
the pages that are indexed from the database. Since it is possible that more than
one page may hold the search string, therefore, the search engine begins the
process of computing the relevancy of each and every page in its index with the
search string. There are numerous algorithms that are available to determine the
relevancy. Each and every algorithm has different relative weights for general
factors like keyword density, links, or Meta tags. Therefore, different search
engines offer different search results pages for the identical search string. It is a
well-known fact that every major search engine intermittently modifies their
algorithms.
d) Retrieving Results
The Human powered directories are also called as open directory system which
depends on individual based activities for listings. The working of indexing in human
powered directories is as follows:
The owner of the site submits a small description of the site to the directory
along with the category type that it is to be listed.
The submitted site is manually evaluated and then it is added in the suitable
category or it is discarded for listing.
The keywords that are entered in a search box will be coordinated with the
description of the sites. Therefore, the changes that are made to the content
of web pages are not taken into consideration as it is simply the explanation
that matters.
In general, a good site with good quality content is more possibly to be reviewed for
free compared to a site with deprived content. Some of the examples of human
powered directories are DMOZ and Yahoo! Directory.
The Hybrid Search Engines attempt to make use of both crawler based and manual
indexing for listing the web sites in search results. The greater part of the crawler
based search engines is similar to that of Google which mostly uses crawlers as a
major mechanism and human powered directories as insignificant mechanism. In
view of the fact that, the human powered directories are becoming extinct,
6
therefore, the hybrid types are becoming more and more crawler based search
engines. However, still there are manual filtering of search result that happens to
eliminate the copied and spam websites. When a website is recognised as a spam, it
is a duty of website’s owner to take the necessary and corrective action and resubmit
the website to search engines. In that case, the specialists do manual evaluation of
the submitted website before including it another time in the search results. In this
way, even if the crawlers manage the processes, the control is manual to observe
and show the search results as per the expectations.
These types of search engines may be classified into a variety of other types
depending upon their usage. Some of the search engines hold different types of bots
to completely demonstrate the videos, images, news, products and local listings. One
such search engine includes Google News page which can be used to seek out only
news that arises from dissimilar newspapers.
The search engines similar to Dogpile attempt to collect Meta information of the web
pages from additional search engines and directories to illustrate the search results,
therefore, they are also known as metasearch engines.
The Swoogle is one of the Semantic search engine which tries to present exact search
results on a particular area by taking into account the related importance of the
search queries.
2. The ______________ attempt to make use of both crawler based and manual
indexing for listing the web sites in search results.
The Search engines that are specifically planned to search the web pages, images and
documents were developed to assist searching through a big, tenuous splash of
shapeless resources. They are engineered to go after a multi-stage process, i.e.,
crawling the endless accumulation of pages and documents to skim the abstract
foam from their contents, indexing the fluff/catchphrase in a variety of form i.e.,
semi-structured in nature, and at end, deciding the entries or the queries of the user
to go back with most appropriate results and links to those scanned pages or
documents from the inventory.
7
b) Crawl
In the case of completely textual search, the primary step in categorising the web
pages is to locate an ‘index item’ that may communicate explicitly to the ‘search
term.’ In the earlier period, the search engines initiate with a minute list of Uniform
Resource Locators (URLs) which is also called as seed list. Seed list tries to fetch the
content, and parses the links on those pages for related information, which later
provides the new links. This procedure was extremely recurring and sustained until
an adequate amount of pages were by the searcher for their use. At present, an
uninterrupted crawl technique is in use which is in contrast to a supplementary
finding which is based on a seed list. In general, the crawl method is an expansion of
a foresaid finding method which does not have a seed list, because the system in no
way stops worming.
c) Link map
The pages that are revealed by web crawls are regularly scattered and fed into
another computer that generate an absolute record of resources that are uncovered.
The bunchy cluster mass appears like a small graph, on which the dissimilar pages are
characterised as minute nodes that are linked by the links among the pages. The
excess of data is accumulated in numerous data structures that allow fast access to
assumed data by definite algorithms. These algorithms calculate the reputation score
of pages based on how many links position to an assured web page, which is how
people can access several number of resources that are concerned with detecting
fixation. In general, the search engines frequently distinguish between internal links
and external links. The Link map data structures usually store up the anchor text that
is surrounded in the links as well, since, the anchor text can repeatedly offer a very
good quality review of the content of the web page.
At times, data search includes both database content and web pages or documents.
Therefore, the Search engine technology has developed to act in response to both
sets of requirements. Majority of the assorted search engines are very large, the Web
8
search engines like Google searches together through ordered and unordered data
sources. In general, the documents are crawled and indexed in a separate index. The
databases are also indexed from a variety of sources. The results of search are then
generated for users by questioning these numerous indices in parallel and
compounding the results according to “rules.”
1. The process of crawling is not necessary for the database, because the data is
previously structured.
Summary
The Swoogle is one of the Semantic search engine which tries to present exact
search results on a particular area by taking into account the related importance of
the search queries.
Data search includes both database content and web pages or documents.
Keywords
URL: Uniform Resource Locators. In the earlier period, the search engines initiate
with a minute list of URL which is also called as seed list. Seed list tries to fetch the
content, and parses the links on those pages for related information, which later
provides the new links.
Self-Assessment Questions
1. Explain the categories of Search Engines.
1. The Human powered directories are also called as open directory system.
2. The Hybrid Search Engines attempt to make use of both crawler based and manual
indexing for listing the web sites in search results.
9
Check your Progress 2
1. True
Suggested Reading
1. Peter Kent, SEO for Dummies, 6th Edition, John Wiley & Sons.
2. Jason McDonald, SEO Toolbook: 2018 Directory of Free Search Engine Optimization
Tools, Kindle Edition.
10