0% found this document useful (0 votes)
7 views13 pages

Samaksh Gupta Programming Ass. IR

This document outlines a programming assignment for an Information Retrieval course, detailing methods for representing text documents using the Vector Space Model, preprocessing techniques like stop word removal and stemming, and classification and clustering of documents using algorithms like Naive Bayes and K-Means. It also describes a web crawling task to collect news stories based on user input with domain filtering options. The assignment includes algorithmic steps and evaluation metrics for each task.

Uploaded by

Nandini Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

Samaksh Gupta Programming Ass. IR

This document outlines a programming assignment for an Information Retrieval course, detailing methods for representing text documents using the Vector Space Model, preprocessing techniques like stop word removal and stemming, and classification and clustering of documents using algorithms like Naive Bayes and K-Means. It also describes a web crawling task to collect news stories based on user input with domain filtering options. The assignment includes algorithmic steps and evaluation metrics for each task.

Uploaded by

Nandini Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Lab/ Programming Assignment for ‘Information

Retrieval’ Course

SUBMITTED TO:
DR. SATYENDR SINGH

SUBMITTED BY:
SAMAKSH GUPTA -210C2030126

SCHOOL OF ENGINEERING AND TECHNOLOGY


BML MUNJAL UNIVERSITY, GURGAON
DECEMBER 2024
1. Representation of a Text Document in Vector Space Model and Computing Similarity
between two documents.
Sol.
The Vector Space Model (VSM) is a widely used approach for representing text documents
as vectors of numerical values. It facilitates the comparison of textual data by converting
documents into a structured form that supports mathematical operations. Here's an outline of
how to represent a document in VSM and compute similarity between two documents:
Representation of Text Document in Vector Space Model :-
A text document is represented as a vector in a high-dimensional space where:
• Each dimension corresponds to a unique term (word) in the corpus.
• The value of each dimension represents the weight of the term in the document.
Steps:
A. Preprocessing:
• Convert text to lowercase.
• Remove stopwords (e.g., "the," "and," "is").
• Perform stemming/lemmatization to reduce words to their base forms.
• Tokenize text into individual terms.
B. Build a Vocabulary:
• Combine all documents to create a list of unique terms.
C. Weighting Terms:
Assign weights to each term using methods such as:
• Binary Weighting: 1 if the term is present, 0 otherwise.
• Term Frequency (TF): The raw count of a term in the document.
• TF-IDF (Term Frequency-Inverse Document Frequency):

where:
• TF (t, d): Term frequency of term t in document d.
• N: Total number of documents.
• DF(t): Number of documents containing term t.
Computing Similarity Between Two Documents:-
The similarity between two document vectors d1 and d2 is often computed using Cosine
Similarity, defined as
2. Pre-processing of a Text Document: stop word removal and stemming.
Sol.
Preprocessing is an essential step in text analysis and natural language processing (NLP). This
involves cleaning and normalizing the text data to make it suitable for analysis or machine
learning.
A. Stop Word Removal
Stop words are common words (e.g., "is," "the," "and") that usually do not carry significant
meaning in text analysis. Removing these words helps to reduce noise and focus on meaningful
terms.
Example of Stop Words:
• English: "is," "the," "and," "of," "in"
• Custom stop words can also be added depending on the domain.
B. Stemming
Stemming is the process of reducing words to their root form or base stem, often by removing
suffixes. For example:
• "running" → "run"
• "studies" → "studi"
It is a heuristic method and may sometimes produce non-dictionary words (e.g., "studies" →
"studi").
Common Stemming Algorithms:
• Porter Stemmer (widely used in NLP tasks).
• Lancaster Stemmer (more aggressive than Porter).
• Snowball Stemmer (an improved version of Porter).
Approach:
1. Tokenization: Split the text into individual words (tokens).
2. Stop Word Removal: Filter out common words like "is", "and", "the", etc.
3. Stemming: Use a stemming algorithm (e.g., Porter Stemmer) to reduce words to their root
form (e.g., "running" → "run").
Algorithmic Steps:
1. Read the input text file.
2. Tokenize the text into words.
3. Remove stop words using a predefined list (e.g., from the NLTK library).
4. Apply stemming using a stemming algorithm.

Result:
3. Representation of a Text Document in Vector Space Model and Computing Similarity
between two documents.
Sol.
The task involves classifying a set of text documents into predefined categories using a supervised
learning algorithm. For example, documents might be classified into categories like Technology,
Health, and Sports. Standard classification algorithms such as Naive Bayes, Support Vector
Machines (SVM), or Logistic Regression can be used.
Approach:
1. Dataset: Use a standard text dataset like the 20 Newsgroups dataset from sklearn.datasets.
2. Preprocessing:
• Tokenization
• Stop word removal
• TF-IDF vectorization
3. Algorithm: Train a classification model, Naive Bayes on labeled training data.
4. Evaluation:
• Split the data into training and testing sets.
• Evaluate the classifier using metrics like accuracy, precision, recall, and F1-score.

Code:
Result :

4. Text Document Clustering Using K-Means.


Sol.
Problem Description:
The task is to cluster a set of text documents into k clusters using the K-Means algorithm.
Clustering is an unsupervised learning technique that groups similar documents based on their
feature representation.
We also evaluate the performance of clustering using the following metrics:
1. Purity: Measures the extent to which clusters contain a single class.
2. Precision, Recall, and F1-Measure: Standard evaluation metrics for clustering quality.
Approach:
1. Dataset: Use the 20 Newsgroups dataset
2. Preprocessing:
• Tokenize the text.
• Remove stop words.
• Convert documents into TF-IDF vectors.
3. Clustering: Use the K-Means algorithm to group documents into k clusters.
4. Evaluation:
• Purity is calculated by assigning each cluster to the most frequent true category in that
cluster.
• Compute Precision, Recall, and F1-Measure using clustering assignments.

Algorithmic Steps:
1. Load and preprocess the dataset (use TF-IDF vectorization).
2. Perform K-Means clustering to group the documents into k clusters.
3. Assign each cluster a label based on the majority class in that cluster.
4. Compute evaluation metrics
Code:
Result:

5. Crawling/ Searching the Web to collect news stories on a specific topic (based on user
input). The program should have an option to limit the crawling to certain selected websites
only.
Sol.
The task is to create a program that crawls or searches the web to collect news stories on a
specific topic based on user input. The program should have the following features:
1. Input Topic: Accept user input for the topic they want to search for.
2. Domain Filtering: Allow the user to specify a list of websites to limit the search.
3. Data Collection: Extract news articles or stories related to the input topic from the specified
websites.
Algorithmic Approach
Step 1: Define Input Parameters
1. Accept the search topic from the user.
2. Allow the user to specify the list of websites to limit the crawl.
Step 2: Search and Crawl
1. Use a web crawling library (e.g., BeautifulSoup) or search APIs (e.g., Google Custom Search
API).
2. Restrict crawling to the provided websites.
Step 3: Parse and Filter
1. Extract headlines, URLs, and brief descriptions of news articles.
2. Filter results to ensure relevance to the topic using keyword matching.

Step 4: Store and Display


1. Save results in a structured format (e.g., JSON or CSV).
2. Display the results to the user in a readable format.

Code:
Output:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy