0% found this document useful (0 votes)
11 views7 pages

RajSingh WIexp4

The document outlines an experiment conducted by Raj Singh on designing a web crawler using Google Colab/Jupyter Notebook. It includes theoretical concepts about web crawlers, their types, implementation issues, and ethical considerations, along with a Python code for a basic crawler. The conclusion discusses the crawler's performance evaluation and potential improvements for future iterations.

Uploaded by

rohit264raa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

RajSingh WIexp4

The document outlines an experiment conducted by Raj Singh on designing a web crawler using Google Colab/Jupyter Notebook. It includes theoretical concepts about web crawlers, their types, implementation issues, and ethical considerations, along with a Python code for a basic crawler. The conclusion discusses the crawler's performance evaluation and potential improvements for future iterations.

Uploaded by

rohit264raa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Department of Computer Engineering

Name of the Student: Raj Singh Roll Number: O129

SAP ID:60004210188 Division:C3 Batch:C32

Subject:WEB INTELLIGENCE

DATE OF PERFORMANCE: 03/04/2025 DATE OF SUBMISION: 03/04/2025

EXPERIMENT NO: 4

AIM: Design a crawler to gather web information (CO2)


SOFTWARE/IDE USED: Google Colab/Jupyter Notebook
THEORY:
1. What is a web crawler and where is it used?
A web crawler is an automated program or bot that systematically browses the
internet to collect data from websites. It retrieves web pages, extracts content, and
indexes information for search engines. Crawlers are mainly used by search engines
like Google to update their indexes and gather relevant data.

2. Explain with a tree diagram the taxonomy of web crawlers.


• General Crawlers: Crawl all pages, covering the entire web.
• Focused Crawlers: Target specific types of content or domains.
• Incremental Crawlers: Continuously update previously crawled data.
• Deep Web Crawlers: Crawl hidden web content, like databases or dynamic pages.

3. What are basic crawlers?


Basic crawlers are simple bots that retrieve web pages by following links. They operate with
minimal filtering, usually indexing all pages they encounter without prioritizing content. These
crawlers may not consider the relevance or type of content, making them less efficient for
specialized tasks.
4. State several implementation issues in web crawlers.
• Scalability: Handling large volumes of web data and managing crawling across
millions of pages.
• Politeness: Avoiding overloading web servers with too many requests in a short
time.
• Freshness: Ensuring crawled data is up-to-date.
• Duplicate Content: Managing and filtering out duplicate web pages.

5. How are universal crawlers different from preferential (i.e., focused and topical) crawlers?
• Universal Crawlers: These crawlers aim to index all available content on the web,
with no bias toward particular types of pages or topics. They are designed for general-
purpose search engines.

Faculty In-charge: Mr. Vivian Lobo


• Preferential Crawlers: These are specialized crawlers that focus on particular
topics, websites, or content types. They prioritize certain categories of pages based on
relevance or keywords (focused and topical).

6. Explain the 3D performance matrix used to evaluate the effectiveness of preferential crawlers,
considering its three key dimensions: Target Pages (Y-axis), Target Descriptions (X-axis), and
Target Depth (Z-axis).

Y-axis (Target Pages): Represents the number of web pages that are successfully
retrieved and indexed.

X-axis (Target Descriptions): Measures the quality of the content retrieved, ensuring it
matches the crawler's purpose (e.g., specific topics).

Z-axis (Target Depth): Refers to how deep into a site the crawler goes (how many levels
of links it follows). Effective crawlers maximize pages, quality, and depth without
overextending resources.

7. Elaborate on crawler ethics and conflicts.


• Ethical Issues: Crawlers must respect website robots.txt files, avoid server
overload, and not scrape private or sensitive data without consent. They should be
transparent about their actions and limit the impact on web servers.
• Conflicts: Ethical conflicts arise when crawlers violate site policies, scrape
copyrighted material, or interfere with the normal operation of websites. There's also
the issue of data ownership and whether it's fair to use crawled data for commercial
purposes.

IMPLEMENTATION:
Code:
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from urllib.parse import urlparse, urljoin

Faculty In-charge: Mr. Vivian Lobo


# Define the Crawler class
class WebCrawler:
def init (self, base_url, max_depth=3, user_agent="Mozilla/5.0"):
"""
Initialize the WebCrawler.

:param base_url: The URL to start the crawling from.


:param max_depth: The maximum depth to which the crawler will follow
links.
:param user_agent: The user-agent string to mimic a browser.
"""
self.base_url = base_url
self.max_depth = max_depth
self.user_agent = user_agent
self.visited = set()
self.to_visit = [base_url]
self.data = []

def fetch_page(self, url):


"""
Fetch the content of a web page.

:param url: The URL of the page to fetch.


:return: The page content as a string or None if the page cannot be
fetched.
"""
try:
headers = {'User-Agent': self.user_agent}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
else:
print(f"Failed to retrieve {url} (Status code:
{response.status_code})")
return None
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None

def parse_links(self, page_html, base_url):


"""
Parse the HTML content of a page to extract all links.

:param page_html: The raw HTML content of the page.


:param base_url: The base URL for resolving relative links.
:return: A list of full URLs.
"""

Faculty In-charge: Mr. Vivian Lobo


soup = BeautifulSoup(page_html, 'html.parser')
links = set()

# Find all anchor tags with href attributes


for link in soup.find_all('a', href=True):
href = link.get('href')
full_url = urljoin(base_url, href) # Resolve relative links
parsed_url = urlparse(full_url)

# Filter only HTTP/HTTPS URLs


if parsed_url.scheme in ['http', 'https']:
links.add(full_url)

return links

def extract_page_data(self, page_html):


"""
Extract useful information (e.g., title, meta description, text) from
the page.

:param page_html: The raw HTML content of the page.


:return: A dictionary with extracted information.
"""
soup = BeautifulSoup(page_html, 'html.parser')

title = soup.title.string if soup.title else 'No Title'


description = ''

# Extract the meta description if available


if soup.find('meta', attrs={'name': 'description'}):
description = soup.find('meta', attrs={'name':
'description'})['content']

# Extract all text (cleaned)


text = ' '.join([p.get_text() for p in soup.find_all('p')])

return {'title': title, 'description': description, 'text': text}

def crawl(self):
"""
Start the crawling process.
"""
depth = 0

while self.to_visit and depth < self.max_depth:


current_url = self.to_visit.pop(0)
if current_url not in self.visited:

Faculty In-charge: Mr. Vivian Lobo


print(f"Crawling: {current_url}")
self.visited.add(current_url)
page_html = self.fetch_page(current_url)

if page_html:
# Extract data from the page
page_data = self.extract_page_data(page_html)
self.data.append(page_data)

# Parse links on the page and add to the to_visit list


links = self.parse_links(page_html, current_url)
self.to_visit.extend(links)

# Sleep for a random time to avoid overloading the server


time.sleep(random.uniform(1, 3))

# Increase the depth level after finishing a set of URLs at the


current depth
if not self.to_visit:
depth += 1

def save_data(self, filename='crawled_data.csv'):


"""
Save the collected data to a CSV file.

:param filename: The name of the CSV file to save the data.
"""
keys = self.data[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(self.data)

# Example Usage:
if name == " main ":
start_url = 'https://example.com' # Change to your desired start URL
crawler = WebCrawler(base_url=start_url, max_depth=2)

# Start crawling
crawler.crawl()

# Save the gathered data to a CSV file


crawler.save_data('crawled_data.csv')
print("Crawling finished, data saved to 'crawled_data.csv'")

Faculty In-charge: Mr. Vivian Lobo


Output:
title,description,text
"Example Domain","Example Domain is an example for documentation purposes.","This domain is
established to be used for illustrative examples in documents."
"About Us - Example Domain","This is the about us page.","This is the about us section of Example
Domain."
"Contact Us - Example Domain","Get in touch with us for more information.","You can reach out to
us through the contact page on our site."

CONCLUSION: In conclusion, evaluating a crawler's performance involves analyzing its efficiency,


accuracy, and ability to gather relevant data while tackling technical challenges. Addressing errors
promptly ensures smooth functionality throughout the process. Implementing future improvements,
such as optimized algorithms and adaptability, can significantly enhance its overall efficiency. These
steps ensure the crawler remains robust, accurate, and resource-friendly for future applications.

POST-EXPERIMENTAL EXERCISE:
1. How efficiently did the crawler perform in terms of speed, resource usage, and data retrieval?
The crawler's efficiency can be assessed by its speed in retrieving data, minimal consumption
of system resources, and its ability to extract accurate and complete information from the
targeted sources. High efficiency ensures optimal performance without overburdening the
system.

2. Did the crawler successfully collect the intended data, and how relevant was the information
gathered?
Evaluating whether the crawler successfully gathered the intended data involves verifying the
quality, accuracy, and relevance of the information collected. The data should align with
project goals and avoid unnecessary or unrelated content.

3. Were there any technical challenges, such as handling dynamic content, CAPTCHA restrictions,
or blocked access?
The crawler might encounter issues such as handling dynamic content that changes
frequently, overcoming CAPTCHA restrictions, or managing access limitations on certain
websites. Addressing these challenges requires tailored strategies and robust solutions.

4. Were there any unexpected errors or failures during the crawling process, and how were they
addressed?
Unexpected errors, like system crashes or incomplete data extraction, could arise during the
crawling process. These issues should be identified and resolved promptly through debugging
or implementing preventive measures.

5. What improvements or optimizations can be made to enhance the crawler’s performance,


accuracy, and efficiency in future iterations?

Faculty In-charge: Mr. Vivian Lobo


Enhancing the crawler’s performance can involve optimizing algorithms, increasing
adaptability to diverse web structures, reducing resource usage, and incorporating techniques
to handle dynamic or protected content efficiently. These improvements ensure sustainability
and accuracy.

Faculty In-charge: Mr. Vivian Lobo

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy