ISR Lab Manual
ISR Lab Manual
LABORATORY MANUAL
2019 course
Course Outcomes:
Statement
Course Outcome
At the end of the course, a student will be able to:
Understand the concept of Information retrieval and to apply clustering in
CO1
information retrieval.
CO2 Use appropriate indexing approach for retrieval of text and multimedia data.
Evaluate performance of information retrieval systems.
CO3 Apply appropriate tools in analyzing the web information.
CO4 Map the concepts of the subject on recent developments in the Information retrieval
field.
GROUP A
PO1, PO2,
1 Implement Conflation algorithm to generate document
PO3, PO4, PSO1, PSO2
representative of a text file.
PO5, PO6,
PO9, PO10,
PO12
PO1, PO2,
Implement Single-pass Algorithm for clustering of PO3, PO4,
files.(Consider 4 to 5 files). PO5, PO6,
2 PSO1, PSO2
PO9, PO10,
PO12
PO1, PO2,
Implement a program for retrieval of documents using
3 PO3, PO4,
inverted files.
PO5, PO6,
PSO1, PSO2
PO9, PO10,
PO12
Sr. No. Title of Assignment PO Mapping PSO Mapping
GROUP B
PO1, PO2,
1 Implement a program to calculate precision and recall
for sample input. (Answer set A, Query q1, PO3, PO4, PSO1, PSO2
Relevant documents to query q1- Rq1 ) PO5, PO6,
PO9, PO10,
PO12
GROUP C
PO1, PO2,
1 Build the web crawler to pull product information and
PO3, PO4, PSO1, PSO2
links from an e-commerce website. (Python)
PO5, PO6,
PO9, PO10,
PO12
Problem Statement:
Objective:
To study:
1. The various concepts and components of information retrieval.
2. Conflation Algorithm.
3. The role of clustering in information retrieval.
4. Indexing structures for information retrieval
Outcomes:
At the end of the assignment the students will be able to
1. Understand the concept of Information retrieval and to apply clustering in
information retrieval.
Scope:
Removal of Stop Words
Suffix Stripping (Any Five Grammar Rules)
Frequency Occurrences Of Key Words (Weight Calculation)
Theory
Functions:
The major functions that constitute an information retrieval system, comprises
of: Acquisition, Analysis, Representation of information, Organisation of the
indexes, Matching, Retrieving, Readjustment and Feedback
Document Representative:
Documents in a collection are frequently represented through a set of index
terms
keywords.Suchkeywordsmightbeextracteddirectlyfromthetextofthedocumentor
mightbespecifiedby a human subject. Modern computers are making it
possible to represent a document by its fullset of words. With very large
collections, however, even modern computers might have to reducethe set of
representative keywords. This can be accomplished through the elimination of
stopwords(suchasarticlesandconnectives),theuseofstemming(whichreducesdisti
nctwordsto
theircommongrammaticalroot),andtheidentificationofnoungroups(whichelimin
atesadjectives, adverbs, and verbs). Further, compression might be employed.
These operations arecalled text operations (or transformations).
The full text is clearly the most complete logical viewof a document but its
usage usually implies higher computational costs. A small set of
categories(generated by a human specialist) provides the most concise logical
view of a document but itsusage might lead to retrieval of poor quality.
Several intermediate logical views (of a document)mightbe adopted
byaninformation retrieval system as illustratedin Figure
Besidesadoptinganyoftheintermediaterepresentations,theretrievalsystemmighta
lsorecognizetheinternalstructurenormallypresentinadocument.Thisinformation
onthestructure of the document might be quite useful and is required by
structured text retrievalmodels. As illustrated in Figure we view the issue of
logically representing a document as acontinuum in which the logical view of
a document might shift (smoothly) from a full textrepresentationto ahigher-
level representation specifiedbyahuman subject.
The document representatieisoneconsistingsimply
ConflationAlgorithm:
Luhn's ideas
In one of Luhn's early papers he states: 'It is here proposed that the frequency of
word occurrence in an article furnishes a useful measurement of word
represent a document.
The removal of high frequency words, 'stop' words or 'fluff' words is one
way of implementing Luhn's upper cut-off. This is normally done by
comparing the input text with a 'stop list' of words which are to be removed.
The advantages of the process are not only that non-significant words are
removed and will therefore not interfere during retrieval, but also that the size
of the total document file can be reduced by between 30 and 50 per cent.
Fig: A plot of Hyperbolic curve relating frequency of words ‘f ‘ vs words by rank order
‘r’
Output:
Document containing frequently appearing words without stop words and
removing stemming
ProblemStatement:
Objectives:
Tostudy:
1. WhatisClustering?
2. Singlepassalgorithmforclustering.
3. Measureofassociation
4. Thegraphicalrepresentationofclustering.
Theory:
Clustering
Clusteringcanbeconsideredthemostimportantunsupervisedlearningproblem;so,asever
yotherproblemofthiskind,itdealswithfindingastructureinacollectionofunlabeleddata.
Adefinitionofclusteringcouldbe“theprocessoforganizingobjectsinto
groupswhosemembers aresimilarinsomeway”.Aclusteristhereforea
collectionofobjectswhichare“similar”betweenthemandare
“dissimilar”totheobjectsbelongingtootherclusters.
Clusteringistheprocessofgroupingthedocumentswhicharerelevant.Itcanbeshownbyag
raphwithnodesconnectediftheyarerelevanttothesamerequest.Abasicassumptionisthat
documentsrelevanttoarequestareseparatedfromthosewhicharenotrelevanti.e.therelev
ant documentsare morelikeoneanotherthantheyarelikenonrelevantone.
Toidentifythe4clustersintowhichthedatacanbedivided;thesimilaritycriterionisdistanc
e:twoormoreobjectsbelongtothesameclusteriftheyare
“close”accordingtoagivendistance(inthiscasegeometricaldistance).Thisiscalleddista
nce-basedclustering.
TheGoalsofClustering
Thegoalofclusteringistodeterminetheintrinsicgroupinginasetofunlabeleddata.But
howtodecide whatconstitutes agoodclustering? It canbe shown that there is
noabsolute “best criterionwhichwouldbeindependentofthefinalaimofthe
clustering.Consequently,itis theuserwhichmustsupplythiscriterion,in suchawaythat
theresultoftheclusteringwillsuittheirneeds.
Forinstance,usercouldbeinterestedinfindingrepresentativesforhomogeneousgroups(
datareduction),infinding“naturalclusters”anddescribetheirunknownproperties(“nat
ural”
datatypes),infindingusefulandsuitablegroupings(“useful”dataclasses)orinfindingun
usualdataobjects(outlierdetection).
Clustering Requirements
Themainrequirementsthataclusteringalgorithm should satisfyare:
1. Scalability;
2. Dealingwithdifferenttypesofattributes;
Thealgorithmisasfollows
(1)Leth beathresholdvalue.
(2)LetSbeanemptysetandd1bethefirstdocument.WegenerateanewclusterC1
consistingofd1.
(3)Whenanewdocumentdi(i>1)comesin,calculatethesimilarityvaluestoalltheclusters
C.
(4) LetsimmaxbethehighestvalueandCdithemostsimilarcluster.Ifsimmax>h,addditoC
diand adjustthecenterofCdi.Otherwise,wegenerateanewclusterCdithat
containsonlydi.
(5) Repeattheprocessaboveuntilnodatacomes.In (4)wedefinesimmax=
MAX(sim(~di,~C)).AlsowedefinesimilarityofadocumentdandaclusterCwherethece
nterisVCasbelow(calledcosinesimilarity):
sim(~d,~C)=d~· V~C/|~d||V~C|
Measures of association
1.|X∩Y| Simplematchingcoefficient
Classificationmethods
1. Multistateattribute(E.g.:Colour)
2. Binarystate(E.g.:Keyword)
3. Numerical(E.g.:Hardnessscaleor weightedkeyword)
4. Probabilitydistribution
Cluster hypothesis
Thehypothesiscanbesimplystatedascloselyassociateddocumenttendtoberelevanttothesam
erequest.ThishypothesisisreferredasClusterhypothesis.
Thebasicassumptioninretrievalsystemisthatdocumentsrelevanttoarequestareseparat
edfromthosewhichhavenotrelevant.Computetheassociationbetweenallpairsofdocu
ments.
a.Bothof whicharerelevanttoarequestand
b.Oneofwhichisrelevantandtheotherisnot
From thegraph below , it can be easily understood that all documents are
associated.Butdocumentslike2&5arenotdirectlyassociated&sameisthecasefordo
cuments4&5.Inthiswayclusterscanbedepicted
Department of IT, BSIOTR, Wagholi, Pune Page 28
Example:
Objects{1, 2, 3,4,5,6}
Clustersare:
Input:Documentrepresentative(minimum5files)
ProgramImplementation:Codewrittenin c/c++toimplementsinglepass
algorithmforclusteringwithproperoutput.
Output:Clusterofdocuments
Conclusion:Thus,wehaveimplementedthesinglepassalgorithmforclustering.
ProblemStatement:
Toimplement a program for Retrieval ofdocuments usinginvertedfiles
Objectives:
TostudyIndexing,InvertedFilesandsearchinginformationwiththehelpofinvertedfile.
Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understooduseofindexinginfastretrieval
2. Understoodworkingofinvertedindex
Infrastructure:Desktop/laptopsystemwithLinux oritsderivatives.
Softwareused:LINUX/WindowsOS/VirtualMachine/IOS/C/C++/Java/python
Theory:
Indexing
In searching for a basics query is to scan the text sequentially. Sequential or online text
searching involves finding the occurrences of a pattern in a text. Online searching is
appropriate then the text is small and it is the only choice if the text collection is very
volatile or the index space overhead cannot be afforded. A second option is to build
datastructures over the text to speed up the search. It is worthwhile building and
maintainingan. index when the text collection is large and semi-static.
Inverted Files:
An inverted file is a word-oriented mechanism for indexing a test collection in order to
speed up the matching task. The inverted file structure is composed of two elements:
vocabulary and occurrence. The vocabulary is the set of all different words in the text.
For each such word a list of all the text portions where the word appears is stored. The
Text:
1 6 9 11 17 19 ….
Thisis atext. Atext hasmanywords.Words are madefromletters.
InvertedIndex:
Vocabulary Occurrences
Letters 60…
Made 50…
Many 28…
Algorithm
B. VivaQuestions:
1. Whatisvocabularyandoccurrences?
2. Howsearchiscarried outoninvertedindex?
3. Howtoindexmultimediaobject.
4. Limitationsofinverted index.
6. Whatistheconceptof signaturefiles?
ProblemStatement:
Implement a program to calculate precision and recall for sample input. (Answer set A,
Query q1, Relevant documents to query q1- Rq1 )
Objectives:
1. To understand precision and recall in information retrieval
2. To study indexing structures for information retrieval.
Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understood precision and recall in information retrieval.
2. Understooduseofindexinginfastretrieval.
Theory:
Precision and Recall in Information Retrieval
Information Systems can be measured with two metrics: precision and recall. When a
user decides to search for information on a topic, the total database and the results to be
obtained can be divided into 4 categories:
1. Relevant and Retrieved
2. Relevant and Not Retrieved
3. Non-Relevant and Retrieved
4. Non-Relevant and Not Retrieved
Relevant items are those documents that help the user in answering his question.Non-
Relevant items are items that don’t provide actually useful information. For each item
there are two possibilities it can be retrieved or not retrieved by the user’s query.
Precision is defined as the ratio of the number of relevant and retrieved
documents(number of items retrieved that are actually useful to the user and match his
search need) to the number of total retrieved documents from the query. Recall is
Precision/recall trade-off
Department of IT, BSIOTR, Wagholi, Pune Page 53
You can increase recall by returning more docs. Recall is a non-decreasing function of
the number of docs retrieved. A system that returns all docs has 100% recall! The
converse is also true (usually): It’s easy to get high precision for very low recall.
Thus, Precision and recall have been extensively used to evaluate the retrieval
performance of IR systems or algorithms. However, a more careful reflection reveals
problems with these two measures: First, the proper estimation of maximum recall for a
query requires detailed knowledge of all the documents in the collection Second, in
many situations the use of a single measure could b e more appropriate Third, recall and
precision measure the effectiveness over a set of queries processed in batch mode
Fourth, for systems which require a weak ordering though, recall and precision might be
inadequate.
B. VivaQuestions:
Title : Implement a program to calculate precision and recall for sample input. (Answer set A, Query q1,
Relevant documents to query q1- Rq1 )
Program:
#include <iostream>
#include <string.h>
#include <iomanip>
#include <fstream>
using namespace std;
string left(const string s, const int w)
{ // Left aligns input string in table
stringstream ss, spaces;
int padding = w - s.size(); // count excess room to pad
for (int i = 0; i < padding; ++i)
spaces << " ";
ss << s << spaces.str() << '|'; // format with padding
return ss.str();
}
string center(const string s, const int w)
{ // center aligns input string in table
stringstream ss, spaces;
int padding = w - s.size(); // count excess room to pad
for (int i = 0; i < padding / 2; ++i)
spaces << " ";
ss << spaces.str() << s << spaces.str(); // format with padding
if (padding > 0 && padding % 2 != 0) // if odd #, add 1 space
Output:
| Documents | |Ra| | |A| | Precision(%)|Recall(%) |
| E-Value |
ProblemStatement:
Write a program to calculate harmonic mean (F-measure) and E-measure for above example
Objectives:
Outcomes:
Attheendoftheassignmentthestudentsshould have:
1. Understood to calculate harmonic mean (F-measure) and E-measure in information
retrieval.
2. Understoodmethpd to evaluate the retrieval performance of IR systems.
THEORY:
F-Score / F-Measure)
F1 Score considers both precision and recall.
It is the harmonic mean(average) of the precision and recall.
F1 Score is best if there is some sort of balance between precision (p) & recall (r) in the
system. Oppositely F1 Score isn’t so high if one measure is improved at the expense of the
other.
For example, if P is 1 & R is 0, F1 score is 0.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Information Systems can be measured with two metrics: precision and recall. Thus, Precision
and recall have been extensively used to evaluate the retrieval performance of IR systems or
algorithms. However, a more careful reflection reveals problems with these two measures:
Where,
r ( j ) is the recall at the j-th position in the ranking
P ( j ) is the precision at the j-th position in the ranking
F ( j ) is the harmonic mean at the j-th position in the ranking
Determining max value of F can be interpreted as an attempt to find the best possible
compromise between recall and precision. The function F assumes values in the interval [0, 1]
It is 0 when no relevant documents have been retrieved and is 1 when all ranked documents
are relevant Further, the harmonic mean F assumes a high value only when both recall and
precision are high To maximize F requires finding the best possible compromise between
recall and precision.
Where,
r ( j ) is the recall at the j-th position in the ranking
P ( j ) is the precision at the j-th position in the ranking
b ≥ 0 is a user specified parameter
B. Viva Questions:
ProblemStatement:
To Implement a program for feature extraction in 2D color images (any features like color, texture etc.
and to extract features from input image and plot histogram for the features.
Objective:
1. To study Program for Feature Extraction in 2D ColourImages for features like,Colour and
Textures and plot Histogram.
2. Given Feature Extraction source code is implemented using Java or python language .
3. The input to the program is image file that is to be modified using program by changing
colour
Outcomes:
At the end of the assignment the students should have
1. Understood the feature extraction process and its applications
2. Apply appropriate tools in analyzing the web information.
A. Importing an Image:
Importing an image in python is easy. Following code will help you import an image on
Python :
3. This is done by Gray-scaling ,Here is how you convert a RGB image to Gray
scale.
Output:
A. Writeshortansweroffollowingquestions:
Viva Questions:
ProblemStatement:
Build the web crawler to pull product information and links from an e-commerce website. (Python)
Objective: -
Outcomes:
Theory:
Search Engines
A program that searches documents for specified keywords and returns a list of the
documents where the keywords were found is a search engine. Although search engine is
really a general class of programs, the term is often used to specifically describe systems
like Google, Alta Vista and Excite that enable users to search for documents on the World
Wide Web and USENET newsgroups.
Typically, a search engine works by sending out a spider to fetch as many documents as
possible. Another program, called an indexer, then reads these documents and creates an
index based on the words contained in each document. Each search engine uses a proprietary
algorithm to create its indices such that, ideally, only meaningful results are returned for
Department of IT, BSIOTR, Wagholi, Pune Page 91
each query. Search engines are special sites on the Web that are designed to help people find
information stored on other sites. There are differences in the ways. Various search engines
work, but they all perform three basic tasks:
They allow users to look for words or combinations of words found in that index.
Fig.1 shows general search engine architecture. Every engine relies on a crawler
module to provide the grist for its operation. Crawlers are small programs that
browse the Web on the search engine's behalf, similar to how a human user would
follow links to reach different pages. The programs are given a starting set of URLs,
whose pages they retrieve from the Web. The crawlers extract URLs appearing in the
retrieved pages, and give this information to the crawler control module. This
module determines what links to visit next, and feeds the links to visit back to the
crawlers. (Some of the functionality of the crawler control module may be
implemented by the crawlers themselves.) The crawlers also pass the retrieved pages
into a page repository. Crawlers continue visiting the Web, until local resources,
such as storages, are exhausted.
The end result of crawling is a collection of Web pages, HTML or plain text at a
central location practice.
In a more traditional IR system, the documents to be indexed are available locally in
a database or file system. Web crawler's first information retrieval system was based
on Salton's vector- space retrieval model. The first system used a simple vector-space
retrieval model. In the vector- space model, the queries and documents represent
vectors in a highly dimensional word space. The component of the vector in a
particular dimension is the significance of the word to the document. For example, if
a particular word is very significant to a document, the component of the vector
along that word's axis would be strong. In this vector space, then, the task of
querying becomes that of determining what document vectors are most similar to the
query vector. Practically speaking, this task amounts to comparing the query vector,
component by component, to all the document vectors that have a word in common
with the query vector. WebCrawler determined a similarity number for each of these
comparisons that formed the basis of the relevance score returned to the user.
Web crawler's first IR system had three pieces: a query processing module, an
inverted full-text index, and a metadata store. The query processing module parses
the searcher's query, looks up the words in the inverted index, forms the result list,
looks up the metadata for each result, and builds the HTML for the result page. The
query processing module used a series of data structures and algorithms to generate
results for a given query. First, this module put the query in a canonical form, and
Department of IT, BSIOTR, Wagholi, Pune Page 95
parsed each space- separated word in the query. If necessary, each word was
converted to its singular form using a modified Porter stemming algorithm and all
words were filtered through the stop list to obtain the final list of words. Finally, the
query processor looked up each word in the dictionary, and ordered the list of words
for optimal query execution. Web crawler's key contribution to distributed systems is
to show that a reliable, scalable, and responsive system can be built using simple
The standard is different but can be used in conjunction with sitemaps, a robot
inclusion standard for websites A robots.txt file on a website will function as a
request that specified robots ignore specified files or directories in their search. This
might be, for example, out of preference for privacy from search engine results, or
the belief that the content of the selected directories might be misleading or irrelevant
to the categorization of the site as a whole, or out of desire that an application only
operates on certain data. A person may not want certain pages indexed. Crawlers
should obey the Robot Exclusion Protocol.
The Robots Exclusion Protocol (REP) is a very simple but powerful mechanism
available to webmasters and SEOs alike. Perhaps it is the simplicity of the file that
means it is often overlooked and often the cause of one or more critical SE0 issues.
To this end, we have attempted to pull together tricks, tips and examples to assist
with the implementation and management of your robots.txt file. As many of the
non-standard REP declarations supported by Google, Yahoo and Bing may change,
we will be providing updates to this in the future. The robots.txt file defines the
Robots Exclusion Protocol (REP) for a website. The file defines directives that
exclude Web robots from directories or files per website host. The robots.txt file
defines crawling directives, not indexing directives. Good Web robots adhere to
directives in your robots.txt file. Bad Web robots may not. Do not rely on the
robots.txt file to protect private or sensitive data from search engines. The robots.txt
Algorithm
1. MakeUserInterface
2. Input theURLofanywebsite
3. EstablishHTTPconnection
4. ReadHTMLpagesourcecode
5. ExtractHyperlinks ofHTMLpage
6. Displaythe list of hyperlinks on thesame page
B. VivaQuestions:
Title : Build the web crawler to pull product information and links from an e-commerce website. (Python).
Program:
import java.net.*;
import java.io.*;
public class Crawler{
public static void main(String[] args) throws Exception{
String urls[] = new String[1000];
String url = "https://www.cricbuzz.com/live-cricket-scores/20307/aus-vs-ind-3rd-odi-india-tour-of-australia-
2018-19";
int i=0,j=0,tmp=0,total=0, MAX = 1000;
int start=0, end=0;
String webpage = Web.getWeb(url);
end = webpage.indexOf("<body");
for(i=total;i<MAX; i++, total++){
start = webpage.indexOf("http://", end);
if(start == -1){
start = 0;
end = 0;
try{
webpage = Web.getWeb(urls[j++]);
}catch(Exception e){
System.out.println("******************");
System.out.println(urls[j-1]);
System.out.println("Exception caught \n"+e);
}
/*logic to fetch urls out of body of webpage only */
end = webpage.indexOf("<body");
if(end == -1)
ProblemStatement:
Write a program to find the live weather report (temperature, wind speed, description, and
weather) of a given city. (Python).
Objective: -
1. To Get Weather Information using Python
2. To evaluate the performance of the IR system and understand user interfaces for
searching.
3. To understand information sharing on the web
Outcomes:
1. Understood and implemented the program to find the live weather report using
python.
Theory:
The OpenWeatherMap (OWM) is a helpful and free way to gather and display weather
information. Because it’s an open-source project, it’s also free to use and modify in
any way. OWM offers a variety of features and is very flexible. Because of these
qualities, it offers a lot of benefits to developers. One of the major benefits of OWM is
that it’s ready to go. Unlike other weather applications, OWM has a web API that’s
ready to use. You don’t have
to install any software or set up a database to get it up and running. This is a great option
for developers who want to get weather reading on a website quickly and efficiently.
It has an API that supports HTML, XML, and JSON endpoints. Current weather
information extended forecasts, and graphical maps can be requested by users. These
maps show cloud cover, wind speed as well as pressure, and precipitation.
Conclusion: Thus, we have successfully implemented a program to find the live weather report
(temperature, wind speed, description, and weather) of a given city using Python.
Code Snippets
Sample Solution:
Python Code:
Sample Output:
ProblemStatement:
Case study on recommender system for a product / Doctor / Product price / Music
Objective: -
1. To study recommender system
Outcomes:
Theory:
The E-Commerce platform has seen enormous growth in online platforms in recent
years. Product recommendations are extremely complex. This leads to a large number
of combinations that can be overwhelming and extremely difficult to calculate
recommendations.
The paradigm of machine learning and natural language processing comes in picture in
achieving this goal of product recommendation. Through the implementation of these
approaches, the products can be effectively reviewed and realized for their potential
for recommendation to a particular user for a product / Doctor / Product price / Music.
B. VivaQuestions: