0% found this document useful (0 votes)

11 views7 pages

RajSingh WIexp4

The document outlines an experiment conducted by Raj Singh on designing a web crawler using Google Colab/Jupyter Notebook. It includes theoretical concepts about web crawlers, their types, implementation issues, and ethical considerations, along with a Python code for a basic crawler. The conclusion discusses the crawler's performance evaluation and potential improvements for future iterations.

Uploaded by

rohit264raa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

RajSingh WIexp4

Uploaded by

rohit264raa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Department of Computer Engineering

Name of the Student: Raj Singh Roll Number: O129

SAP ID:60004210188 Division:C3 Batch:C32

Subject:WEB INTELLIGENCE

DATE OF PERFORMANCE: 03/04/2025 DATE OF SUBMISION: 03/04/2025

EXPERIMENT NO: 4

AIM: Design a crawler to gather web information (CO2)

SOFTWARE/IDE USED: Google Colab/Jupyter Notebook
THEORY:
1. What is a web crawler and where is it used?
A web crawler is an automated program or bot that systematically browses the
internet to collect data from websites. It retrieves web pages, extracts content, and
indexes information for search engines. Crawlers are mainly used by search engines
like Google to update their indexes and gather relevant data.

2. Explain with a tree diagram the taxonomy of web crawlers.

• General Crawlers: Crawl all pages, covering the entire web.
• Focused Crawlers: Target specific types of content or domains.
• Incremental Crawlers: Continuously update previously crawled data.
• Deep Web Crawlers: Crawl hidden web content, like databases or dynamic pages.

3. What are basic crawlers?

Basic crawlers are simple bots that retrieve web pages by following links. They operate with
minimal filtering, usually indexing all pages they encounter without prioritizing content. These
crawlers may not consider the relevance or type of content, making them less efficient for
specialized tasks.
4. State several implementation issues in web crawlers.
• Scalability: Handling large volumes of web data and managing crawling across
millions of pages.
• Politeness: Avoiding overloading web servers with too many requests in a short
time.
• Freshness: Ensuring crawled data is up-to-date.
• Duplicate Content: Managing and filtering out duplicate web pages.

5. How are universal crawlers different from preferential (i.e., focused and topical) crawlers?
• Universal Crawlers: These crawlers aim to index all available content on the web,
with no bias toward particular types of pages or topics. They are designed for general-
purpose search engines.

Faculty In-charge: Mr. Vivian Lobo

• Preferential Crawlers: These are specialized crawlers that focus on particular
topics, websites, or content types. They prioritize certain categories of pages based on
relevance or keywords (focused and topical).

6. Explain the 3D performance matrix used to evaluate the effectiveness of preferential crawlers,
considering its three key dimensions: Target Pages (Y-axis), Target Descriptions (X-axis), and
Target Depth (Z-axis).

Y-axis (Target Pages): Represents the number of web pages that are successfully
retrieved and indexed.

X-axis (Target Descriptions): Measures the quality of the content retrieved, ensuring it
matches the crawler's purpose (e.g., specific topics).

Z-axis (Target Depth): Refers to how deep into a site the crawler goes (how many levels
of links it follows). Effective crawlers maximize pages, quality, and depth without
overextending resources.

7. Elaborate on crawler ethics and conflicts.

• Ethical Issues: Crawlers must respect website robots.txt files, avoid server
overload, and not scrape private or sensitive data without consent. They should be
transparent about their actions and limit the impact on web servers.
• Conflicts: Ethical conflicts arise when crawlers violate site policies, scrape
copyrighted material, or interfere with the normal operation of websites. There's also
the issue of data ownership and whether it's fair to use crawled data for commercial
purposes.

IMPLEMENTATION:
Code:
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from urllib.parse import urlparse, urljoin

Faculty In-charge: Mr. Vivian Lobo

# Define the Crawler class
class WebCrawler:
def init (self, base_url, max_depth=3, user_agent="Mozilla/5.0"):
"""
Initialize the WebCrawler.

:param base_url: The URL to start the crawling from.

:param max_depth: The maximum depth to which the crawler will follow
links.
:param user_agent: The user-agent string to mimic a browser.
"""
self.base_url = base_url
self.max_depth = max_depth
self.user_agent = user_agent
self.visited = set()
self.to_visit = [base_url]
self.data = []

def fetch_page(self, url):

"""
Fetch the content of a web page.

:param url: The URL of the page to fetch.

:return: The page content as a string or None if the page cannot be
fetched.
"""
try:
headers = {'User-Agent': self.user_agent}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
else:
print(f"Failed to retrieve {url} (Status code:
{response.status_code})")
return None
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None

def parse_links(self, page_html, base_url):

"""
Parse the HTML content of a page to extract all links.

:param page_html: The raw HTML content of the page.

:param base_url: The base URL for resolving relative links.
:return: A list of full URLs.
"""

Faculty In-charge: Mr. Vivian Lobo

soup = BeautifulSoup(page_html, 'html.parser')
links = set()

# Find all anchor tags with href attributes

for link in soup.find_all('a', href=True):
href = link.get('href')
full_url = urljoin(base_url, href) # Resolve relative links
parsed_url = urlparse(full_url)

# Filter only HTTP/HTTPS URLs

if parsed_url.scheme in ['http', 'https']:
links.add(full_url)

return links

def extract_page_data(self, page_html):

"""
Extract useful information (e.g., title, meta description, text) from
the page.

:param page_html: The raw HTML content of the page.

:return: A dictionary with extracted information.
"""
soup = BeautifulSoup(page_html, 'html.parser')

title = soup.title.string if soup.title else 'No Title'

description = ''

# Extract the meta description if available

if soup.find('meta', attrs={'name': 'description'}):
description = soup.find('meta', attrs={'name':
'description'})['content']

# Extract all text (cleaned)

text = ' '.join([p.get_text() for p in soup.find_all('p')])

return {'title': title, 'description': description, 'text': text}

def crawl(self):
"""
Start the crawling process.
"""
depth = 0

while self.to_visit and depth < self.max_depth:

current_url = self.to_visit.pop(0)
if current_url not in self.visited:

Faculty In-charge: Mr. Vivian Lobo

print(f"Crawling: {current_url}")
self.visited.add(current_url)
page_html = self.fetch_page(current_url)

if page_html:
# Extract data from the page
page_data = self.extract_page_data(page_html)
self.data.append(page_data)

# Parse links on the page and add to the to_visit list

links = self.parse_links(page_html, current_url)
self.to_visit.extend(links)

# Sleep for a random time to avoid overloading the server

time.sleep(random.uniform(1, 3))

# Increase the depth level after finishing a set of URLs at the

current depth
if not self.to_visit:
depth += 1

def save_data(self, filename='crawled_data.csv'):

"""
Save the collected data to a CSV file.

:param filename: The name of the CSV file to save the data.
"""
keys = self.data[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(self.data)

# Example Usage:
if name == " main ":
start_url = 'https://example.com' # Change to your desired start URL
crawler = WebCrawler(base_url=start_url, max_depth=2)

# Start crawling
crawler.crawl()

# Save the gathered data to a CSV file

crawler.save_data('crawled_data.csv')
print("Crawling finished, data saved to 'crawled_data.csv'")

Faculty In-charge: Mr. Vivian Lobo

Output:
title,description,text
"Example Domain","Example Domain is an example for documentation purposes.","This domain is
established to be used for illustrative examples in documents."
"About Us - Example Domain","This is the about us page.","This is the about us section of Example
Domain."
"Contact Us - Example Domain","Get in touch with us for more information.","You can reach out to
us through the contact page on our site."

CONCLUSION: In conclusion, evaluating a crawler's performance involves analyzing its efficiency,

accuracy, and ability to gather relevant data while tackling technical challenges. Addressing errors
promptly ensures smooth functionality throughout the process. Implementing future improvements,
such as optimized algorithms and adaptability, can significantly enhance its overall efficiency. These
steps ensure the crawler remains robust, accurate, and resource-friendly for future applications.

POST-EXPERIMENTAL EXERCISE:
1. How efficiently did the crawler perform in terms of speed, resource usage, and data retrieval?
The crawler's efficiency can be assessed by its speed in retrieving data, minimal consumption
of system resources, and its ability to extract accurate and complete information from the
targeted sources. High efficiency ensures optimal performance without overburdening the
system.

2. Did the crawler successfully collect the intended data, and how relevant was the information
gathered?
Evaluating whether the crawler successfully gathered the intended data involves verifying the
quality, accuracy, and relevance of the information collected. The data should align with
project goals and avoid unnecessary or unrelated content.

3. Were there any technical challenges, such as handling dynamic content, CAPTCHA restrictions,
or blocked access?
The crawler might encounter issues such as handling dynamic content that changes
frequently, overcoming CAPTCHA restrictions, or managing access limitations on certain
websites. Addressing these challenges requires tailored strategies and robust solutions.

4. Were there any unexpected errors or failures during the crawling process, and how were they
addressed?
Unexpected errors, like system crashes or incomplete data extraction, could arise during the
crawling process. These issues should be identified and resolved promptly through debugging
or implementing preventive measures.

5. What improvements or optimizations can be made to enhance the crawler’s performance,

accuracy, and efficiency in future iterations?

Faculty In-charge: Mr. Vivian Lobo

Enhancing the crawler’s performance can involve optimizing algorithms, increasing
adaptability to diverse web structures, reducing resource usage, and incorporating techniques
to handle dynamic or protected content efficiently. These improvements ensure sustainability
and accuracy.

Faculty In-charge: Mr. Vivian Lobo

Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
4 pages
Web Scraping For Data Analytics A BeatifulSoup Implementation
No ratings yet
Web Scraping For Data Analytics A BeatifulSoup Implementation
6 pages
Minor Report
No ratings yet
Minor Report
46 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
A Progressive Understanding Web Agent For Web Crawler Generation
No ratings yet
A Progressive Understanding Web Agent For Web Crawler Generation
18 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Report Format
No ratings yet
Report Format
15 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Data Collection
No ratings yet
Data Collection
10 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Web Crawler PY
No ratings yet
Web Crawler PY
27 pages
WT Lab Manual
No ratings yet
WT Lab Manual
47 pages
Research Paper
No ratings yet
Research Paper
5 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Class Assign
No ratings yet
Class Assign
3 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Learning Guide Unit 7 - Home
No ratings yet
Learning Guide Unit 7 - Home
12 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Getting Started With Wokwi-I
No ratings yet
Getting Started With Wokwi-I
18 pages
Scraping
100% (1)
Scraping
25 pages
Sithfal-Task2 Explation Matter
No ratings yet
Sithfal-Task2 Explation Matter
6 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Web Crawler
No ratings yet
Web Crawler
1 page
Web Scraping - Notes - 321
No ratings yet
Web Scraping - Notes - 321
3 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
6 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Multithreading Crawler Project OS
No ratings yet
Multithreading Crawler Project OS
11 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Sem 1 Practical
No ratings yet
Sem 1 Practical
154 pages
Study of Webcrawler: Implementation of Efficient and Fast Crawler
No ratings yet
Study of Webcrawler: Implementation of Efficient and Fast Crawler
6 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Unit-V Application-Layer
No ratings yet
Unit-V Application-Layer
30 pages
MAD Topic 2 - Factors in Developing Mobile Applications
No ratings yet
MAD Topic 2 - Factors in Developing Mobile Applications
40 pages
Java Web Crawler
No ratings yet
Java Web Crawler
1 page
DevOps CI CD Pipeline Using Jenkins Docker and Kubernetes
No ratings yet
DevOps CI CD Pipeline Using Jenkins Docker and Kubernetes
10 pages
Bitcoin PV For Too Much
No ratings yet
Bitcoin PV For Too Much
10 pages
Cybersecurity Homelab Cheat Sheet
No ratings yet
Cybersecurity Homelab Cheat Sheet
2 pages
Project Documentation (Group-E)
No ratings yet
Project Documentation (Group-E)
146 pages
Firewall Bypass
No ratings yet
Firewall Bypass
22 pages
SRS For HDFC Banking System
No ratings yet
SRS For HDFC Banking System
20 pages
CIS Microsoft Office 2016 Benchmark v1.1.0 - ARCHIVE
No ratings yet
CIS Microsoft Office 2016 Benchmark v1.1.0 - ARCHIVE
133 pages
CANoe Tool Tutorial
No ratings yet
CANoe Tool Tutorial
62 pages
Cs8661 Ip Rec
No ratings yet
Cs8661 Ip Rec
91 pages
AWS430
No ratings yet
AWS430
55 pages
Software Engineer - Job Description
No ratings yet
Software Engineer - Job Description
2 pages
Mil Reviewer
No ratings yet
Mil Reviewer
6 pages
CCWS Praticals
No ratings yet
CCWS Praticals
47 pages
Book 4
No ratings yet
Book 4
5 pages
Section 2 Python Programming
No ratings yet
Section 2 Python Programming
12 pages
Jatin (WT)
No ratings yet
Jatin (WT)
40 pages
Empowerment Technologies TVL Module 1
No ratings yet
Empowerment Technologies TVL Module 1
9 pages
BetterRTX Installer - ps1
No ratings yet
BetterRTX Installer - ps1
7 pages
782 Assignment
No ratings yet
782 Assignment
7 pages
Bank Website Management System
No ratings yet
Bank Website Management System
32 pages
EM FAQ ImportExportMaxwellfromWB
No ratings yet
EM FAQ ImportExportMaxwellfromWB
3 pages
Cifrado CMD de Google
No ratings yet
Cifrado CMD de Google
6 pages
Aadarsh's Resume May - Aadarsh Verma
No ratings yet
Aadarsh's Resume May - Aadarsh Verma
1 page
How To Add Managed Oacore in Oracle Apps R12.2.x
No ratings yet
How To Add Managed Oacore in Oracle Apps R12.2.x
3 pages
About Apache
No ratings yet
About Apache
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

RajSingh WIexp4

Uploaded by

RajSingh WIexp4

Uploaded by

Department of Computer Engineering

Name of the Student: Raj Singh Roll Number: O129

SAP ID:60004210188 Division:C3 Batch:C32

DATE OF PERFORMANCE: 03/04/2025 DATE OF SUBMISION: 03/04/2025

AIM: Design a crawler to gather web information (CO2)

2. Explain with a tree diagram the taxonomy of web crawlers.

3. What are basic crawlers?

Faculty In-charge: Mr. Vivian Lobo

7. Elaborate on crawler ethics and conflicts.

Faculty In-charge: Mr. Vivian Lobo

:param base_url: The URL to start the crawling from.

def fetch_page(self, url):

:param url: The URL of the page to fetch.

def parse_links(self, page_html, base_url):

:param page_html: The raw HTML content of the page.

Faculty In-charge: Mr. Vivian Lobo

# Find all anchor tags with href attributes

# Filter only HTTP/HTTPS URLs

def extract_page_data(self, page_html):

:param page_html: The raw HTML content of the page.

title = soup.title.string if soup.title else 'No Title'

# Extract the meta description if available

# Extract all text (cleaned)

return {'title': title, 'description': description, 'text': text}

while self.to_visit and depth < self.max_depth:

Faculty In-charge: Mr. Vivian Lobo

# Parse links on the page and add to the to_visit list

# Sleep for a random time to avoid overloading the server

# Increase the depth level after finishing a set of URLs at the

def save_data(self, filename='crawled_data.csv'):

# Save the gathered data to a CSV file

Faculty In-charge: Mr. Vivian Lobo

CONCLUSION: In conclusion, evaluating a crawler's performance involves analyzing its efficiency,

5. What improvements or optimizations can be made to enhance the crawler’s performance,

Faculty In-charge: Mr. Vivian Lobo

Faculty In-charge: Mr. Vivian Lobo

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.