0% found this document useful (0 votes)

7 views13 pages

Samaksh Gupta Programming Ass. IR

This document outlines a programming assignment for an Information Retrieval course, detailing methods for representing text documents using the Vector Space Model, preprocessing techniques like stop word removal and stemming, and classification and clustering of documents using algorithms like Naive Bayes and K-Means. It also describes a web crawling task to collect news stories based on user input with domain filtering options. The assignment includes algorithmic steps and evaluation metrics for each task.

Uploaded by

Nandini Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views13 pages

Samaksh Gupta Programming Ass. IR

Uploaded by

Nandini Shukla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Lab/ Programming Assignment for ‘Information

Retrieval’ Course

SUBMITTED TO:
DR. SATYENDR SINGH

SUBMITTED BY:
SAMAKSH GUPTA -210C2030126

SCHOOL OF ENGINEERING AND TECHNOLOGY

BML MUNJAL UNIVERSITY, GURGAON
DECEMBER 2024
1. Representation of a Text Document in Vector Space Model and Computing Similarity
between two documents.
Sol.
The Vector Space Model (VSM) is a widely used approach for representing text documents
as vectors of numerical values. It facilitates the comparison of textual data by converting
documents into a structured form that supports mathematical operations. Here's an outline of
how to represent a document in VSM and compute similarity between two documents:
Representation of Text Document in Vector Space Model :-
A text document is represented as a vector in a high-dimensional space where:
• Each dimension corresponds to a unique term (word) in the corpus.
• The value of each dimension represents the weight of the term in the document.
Steps:
A. Preprocessing:
• Convert text to lowercase.
• Remove stopwords (e.g., "the," "and," "is").
• Perform stemming/lemmatization to reduce words to their base forms.
• Tokenize text into individual terms.
B. Build a Vocabulary:
• Combine all documents to create a list of unique terms.
C. Weighting Terms:
Assign weights to each term using methods such as:
• Binary Weighting: 1 if the term is present, 0 otherwise.
• Term Frequency (TF): The raw count of a term in the document.
• TF-IDF (Term Frequency-Inverse Document Frequency):

where:
• TF (t, d): Term frequency of term t in document d.
• N: Total number of documents.
• DF(t): Number of documents containing term t.
Computing Similarity Between Two Documents:-
The similarity between two document vectors d1 and d2 is often computed using Cosine
Similarity, defined as
2. Pre-processing of a Text Document: stop word removal and stemming.
Sol.
Preprocessing is an essential step in text analysis and natural language processing (NLP). This
involves cleaning and normalizing the text data to make it suitable for analysis or machine
learning.
A. Stop Word Removal
Stop words are common words (e.g., "is," "the," "and") that usually do not carry significant
meaning in text analysis. Removing these words helps to reduce noise and focus on meaningful
terms.
Example of Stop Words:
• English: "is," "the," "and," "of," "in"
• Custom stop words can also be added depending on the domain.
B. Stemming
Stemming is the process of reducing words to their root form or base stem, often by removing
suffixes. For example:
• "running" → "run"
• "studies" → "studi"
It is a heuristic method and may sometimes produce non-dictionary words (e.g., "studies" →
"studi").
Common Stemming Algorithms:
• Porter Stemmer (widely used in NLP tasks).
• Lancaster Stemmer (more aggressive than Porter).
• Snowball Stemmer (an improved version of Porter).
Approach:
1. Tokenization: Split the text into individual words (tokens).
2. Stop Word Removal: Filter out common words like "is", "and", "the", etc.
3. Stemming: Use a stemming algorithm (e.g., Porter Stemmer) to reduce words to their root
form (e.g., "running" → "run").
Algorithmic Steps:
1. Read the input text file.
2. Tokenize the text into words.
3. Remove stop words using a predefined list (e.g., from the NLTK library).
4. Apply stemming using a stemming algorithm.

Result:
3. Representation of a Text Document in Vector Space Model and Computing Similarity
between two documents.
Sol.
The task involves classifying a set of text documents into predefined categories using a supervised
learning algorithm. For example, documents might be classified into categories like Technology,
Health, and Sports. Standard classification algorithms such as Naive Bayes, Support Vector
Machines (SVM), or Logistic Regression can be used.
Approach:
1. Dataset: Use a standard text dataset like the 20 Newsgroups dataset from sklearn.datasets.
2. Preprocessing:
• Tokenization
• Stop word removal
• TF-IDF vectorization
3. Algorithm: Train a classification model, Naive Bayes on labeled training data.
4. Evaluation:
• Split the data into training and testing sets.
• Evaluate the classifier using metrics like accuracy, precision, recall, and F1-score.

Code:
Result :

4. Text Document Clustering Using K-Means.

Sol.
Problem Description:
The task is to cluster a set of text documents into k clusters using the K-Means algorithm.
Clustering is an unsupervised learning technique that groups similar documents based on their
feature representation.
We also evaluate the performance of clustering using the following metrics:
1. Purity: Measures the extent to which clusters contain a single class.
2. Precision, Recall, and F1-Measure: Standard evaluation metrics for clustering quality.
Approach:
1. Dataset: Use the 20 Newsgroups dataset
2. Preprocessing:
• Tokenize the text.
• Remove stop words.
• Convert documents into TF-IDF vectors.
3. Clustering: Use the K-Means algorithm to group documents into k clusters.
4. Evaluation:
• Purity is calculated by assigning each cluster to the most frequent true category in that
cluster.
• Compute Precision, Recall, and F1-Measure using clustering assignments.

Algorithmic Steps:
1. Load and preprocess the dataset (use TF-IDF vectorization).
2. Perform K-Means clustering to group the documents into k clusters.
3. Assign each cluster a label based on the majority class in that cluster.
4. Compute evaluation metrics
Code:
Result:

5. Crawling/ Searching the Web to collect news stories on a specific topic (based on user
input). The program should have an option to limit the crawling to certain selected websites
only.
Sol.
The task is to create a program that crawls or searches the web to collect news stories on a
specific topic based on user input. The program should have the following features:
1. Input Topic: Accept user input for the topic they want to search for.
2. Domain Filtering: Allow the user to specify a list of websites to limit the search.
3. Data Collection: Extract news articles or stories related to the input topic from the specified
websites.
Algorithmic Approach
Step 1: Define Input Parameters
1. Accept the search topic from the user.
2. Allow the user to specify the list of websites to limit the crawl.
Step 2: Search and Crawl
1. Use a web crawling library (e.g., BeautifulSoup) or search APIs (e.g., Google Custom Search
API).
2. Restrict crawling to the provided websites.
Step 3: Parse and Filter
1. Extract headlines, URLs, and brief descriptions of news articles.
2. Filter results to ensure relevance to the topic using keyword matching.

Step 4: Store and Display

1. Save results in a structured format (e.g., JSON or CSV).
2. Display the results to the user in a readable format.

Code:
Output:

DM - Ai22c07 - Unit 5
No ratings yet
DM - Ai22c07 - Unit 5
101 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Unit 2
No ratings yet
Unit 2
25 pages
Lect 5
No ratings yet
Lect 5
40 pages
NLP Record
No ratings yet
NLP Record
16 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Unit-Iv NLP
No ratings yet
Unit-Iv NLP
11 pages
Information Retrival
No ratings yet
Information Retrival
43 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Semantic Technology-Assisted Review STAR Document
No ratings yet
Semantic Technology-Assisted Review STAR Document
14 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
11 Text Categorization
No ratings yet
11 Text Categorization
25 pages
Final Report
No ratings yet
Final Report
59 pages
PDC Review2
No ratings yet
PDC Review2
23 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Methodology
No ratings yet
Methodology
9 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
BDA3
No ratings yet
BDA3
61 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
Unit 5
No ratings yet
Unit 5
8 pages
4th Unit DVT
No ratings yet
4th Unit DVT
40 pages
Minor Assignment-3 (NLP)
No ratings yet
Minor Assignment-3 (NLP)
2 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
Module III
No ratings yet
Module III
42 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
Text Mining
No ratings yet
Text Mining
25 pages
Research Paper 2
No ratings yet
Research Paper 2
7 pages
Unit 5 TB
No ratings yet
Unit 5 TB
18 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Ijet V2i3p7
No ratings yet
Ijet V2i3p7
6 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
AI 102 Notes
No ratings yet
AI 102 Notes
41 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
Topic Analysis Presentation
No ratings yet
Topic Analysis Presentation
23 pages
IR Journal
No ratings yet
IR Journal
36 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Text Classification
No ratings yet
Text Classification
32 pages
VVM - Ai MCQ'S
No ratings yet
VVM - Ai MCQ'S
25 pages
Mini Project
No ratings yet
Mini Project
16 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Final Document
No ratings yet
Final Document
118 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Department of Computer Science and Engineering Spring 2012
No ratings yet
Department of Computer Science and Engineering Spring 2012
18 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Machine Learning and NLP Approaches in Address Matching
No ratings yet
Machine Learning and NLP Approaches in Address Matching
60 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
46 pages
Lecture 6 Text Classification
No ratings yet
Lecture 6 Text Classification
19 pages
Capstone Review 02
No ratings yet
Capstone Review 02
54 pages
Multi Document Summarization Research Paper 1
No ratings yet
Multi Document Summarization Research Paper 1
26 pages
Data Sciencefor Business
No ratings yet
Data Sciencefor Business
107 pages
A Comparative Study On TF-IDF Feature Weighting Method and Its Analysis Using Unstructured Dataset
No ratings yet
A Comparative Study On TF-IDF Feature Weighting Method and Its Analysis Using Unstructured Dataset
10 pages
Quantitative Risk Assessment of Railway Intrusions With Text Mining and
No ratings yet
Quantitative Risk Assessment of Railway Intrusions With Text Mining and
16 pages
RecipeBowl A Cooking Recommender For Ingredients and Recipes Using Set Transformer
No ratings yet
RecipeBowl A Cooking Recommender For Ingredients and Recipes Using Set Transformer
11 pages
Wood 2017
100% (1)
Wood 2017
8 pages
91968-Design Principles For Robust Fraud Detection - The Case of Stoc Kmarket Manipulation PDF
No ratings yet
91968-Design Principles For Robust Fraud Detection - The Case of Stoc Kmarket Manipulation PDF
24 pages
Part-A Assignment No. 7
No ratings yet
Part-A Assignment No. 7
2 pages
Trend Analysis in Machine Learning Research
No ratings yet
Trend Analysis in Machine Learning Research
6 pages
Social Media Sentiment Analysis
No ratings yet
Social Media Sentiment Analysis
49 pages
Abhishek
No ratings yet
Abhishek
30 pages
Conversational Chatbot System For Student Support in Administrative Exam Information
No ratings yet
Conversational Chatbot System For Student Support in Administrative Exam Information
8 pages
CP5074 - SNA Unit V Notes
No ratings yet
CP5074 - SNA Unit V Notes
21 pages
IR New
No ratings yet
IR New
4 pages
Dsbda Unit 5 Imp Batnotes
No ratings yet
Dsbda Unit 5 Imp Batnotes
5 pages
AI SET B Ans Key
No ratings yet
AI SET B Ans Key
5 pages
Sample Paper Questions - NLP (Part 2)
No ratings yet
Sample Paper Questions - NLP (Part 2)
7 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Samaksh Gupta Programming Ass. IR

Uploaded by

Samaksh Gupta Programming Ass. IR

Uploaded by

Lab/ Programming Assignment for ‘Information

SCHOOL OF ENGINEERING AND TECHNOLOGY

4. Text Document Clustering Using K-Means.

Step 4: Store and Display

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.