Samaksh Gupta Programming Ass. IR
Samaksh Gupta Programming Ass. IR
Retrieval’ Course
SUBMITTED TO:
DR. SATYENDR SINGH
SUBMITTED BY:
SAMAKSH GUPTA -210C2030126
where:
• TF (t, d): Term frequency of term t in document d.
• N: Total number of documents.
• DF(t): Number of documents containing term t.
Computing Similarity Between Two Documents:-
The similarity between two document vectors d1 and d2 is often computed using Cosine
Similarity, defined as
2. Pre-processing of a Text Document: stop word removal and stemming.
Sol.
Preprocessing is an essential step in text analysis and natural language processing (NLP). This
involves cleaning and normalizing the text data to make it suitable for analysis or machine
learning.
A. Stop Word Removal
Stop words are common words (e.g., "is," "the," "and") that usually do not carry significant
meaning in text analysis. Removing these words helps to reduce noise and focus on meaningful
terms.
Example of Stop Words:
• English: "is," "the," "and," "of," "in"
• Custom stop words can also be added depending on the domain.
B. Stemming
Stemming is the process of reducing words to their root form or base stem, often by removing
suffixes. For example:
• "running" → "run"
• "studies" → "studi"
It is a heuristic method and may sometimes produce non-dictionary words (e.g., "studies" →
"studi").
Common Stemming Algorithms:
• Porter Stemmer (widely used in NLP tasks).
• Lancaster Stemmer (more aggressive than Porter).
• Snowball Stemmer (an improved version of Porter).
Approach:
1. Tokenization: Split the text into individual words (tokens).
2. Stop Word Removal: Filter out common words like "is", "and", "the", etc.
3. Stemming: Use a stemming algorithm (e.g., Porter Stemmer) to reduce words to their root
form (e.g., "running" → "run").
Algorithmic Steps:
1. Read the input text file.
2. Tokenize the text into words.
3. Remove stop words using a predefined list (e.g., from the NLTK library).
4. Apply stemming using a stemming algorithm.
Result:
3. Representation of a Text Document in Vector Space Model and Computing Similarity
between two documents.
Sol.
The task involves classifying a set of text documents into predefined categories using a supervised
learning algorithm. For example, documents might be classified into categories like Technology,
Health, and Sports. Standard classification algorithms such as Naive Bayes, Support Vector
Machines (SVM), or Logistic Regression can be used.
Approach:
1. Dataset: Use a standard text dataset like the 20 Newsgroups dataset from sklearn.datasets.
2. Preprocessing:
• Tokenization
• Stop word removal
• TF-IDF vectorization
3. Algorithm: Train a classification model, Naive Bayes on labeled training data.
4. Evaluation:
• Split the data into training and testing sets.
• Evaluate the classifier using metrics like accuracy, precision, recall, and F1-score.
Code:
Result :
Algorithmic Steps:
1. Load and preprocess the dataset (use TF-IDF vectorization).
2. Perform K-Means clustering to group the documents into k clusters.
3. Assign each cluster a label based on the majority class in that cluster.
4. Compute evaluation metrics
Code:
Result:
5. Crawling/ Searching the Web to collect news stories on a specific topic (based on user
input). The program should have an option to limit the crawling to certain selected websites
only.
Sol.
The task is to create a program that crawls or searches the web to collect news stories on a
specific topic based on user input. The program should have the following features:
1. Input Topic: Accept user input for the topic they want to search for.
2. Domain Filtering: Allow the user to specify a list of websites to limit the search.
3. Data Collection: Extract news articles or stories related to the input topic from the specified
websites.
Algorithmic Approach
Step 1: Define Input Parameters
1. Accept the search topic from the user.
2. Allow the user to specify the list of websites to limit the crawl.
Step 2: Search and Crawl
1. Use a web crawling library (e.g., BeautifulSoup) or search APIs (e.g., Google Custom Search
API).
2. Restrict crawling to the provided websites.
Step 3: Parse and Filter
1. Extract headlines, URLs, and brief descriptions of news articles.
2. Filter results to ensure relevance to the topic using keyword matching.
Code:
Output: