RajSingh WIexp4
RajSingh WIexp4
Subject:WEB INTELLIGENCE
EXPERIMENT NO: 4
5. How are universal crawlers different from preferential (i.e., focused and topical) crawlers?
• Universal Crawlers: These crawlers aim to index all available content on the web,
with no bias toward particular types of pages or topics. They are designed for general-
purpose search engines.
6. Explain the 3D performance matrix used to evaluate the effectiveness of preferential crawlers,
considering its three key dimensions: Target Pages (Y-axis), Target Descriptions (X-axis), and
Target Depth (Z-axis).
Y-axis (Target Pages): Represents the number of web pages that are successfully
retrieved and indexed.
X-axis (Target Descriptions): Measures the quality of the content retrieved, ensuring it
matches the crawler's purpose (e.g., specific topics).
Z-axis (Target Depth): Refers to how deep into a site the crawler goes (how many levels
of links it follows). Effective crawlers maximize pages, quality, and depth without
overextending resources.
IMPLEMENTATION:
Code:
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from urllib.parse import urlparse, urljoin
return links
def crawl(self):
"""
Start the crawling process.
"""
depth = 0
if page_html:
# Extract data from the page
page_data = self.extract_page_data(page_html)
self.data.append(page_data)
:param filename: The name of the CSV file to save the data.
"""
keys = self.data[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(self.data)
# Example Usage:
if name == " main ":
start_url = 'https://example.com' # Change to your desired start URL
crawler = WebCrawler(base_url=start_url, max_depth=2)
# Start crawling
crawler.crawl()
POST-EXPERIMENTAL EXERCISE:
1. How efficiently did the crawler perform in terms of speed, resource usage, and data retrieval?
The crawler's efficiency can be assessed by its speed in retrieving data, minimal consumption
of system resources, and its ability to extract accurate and complete information from the
targeted sources. High efficiency ensures optimal performance without overburdening the
system.
2. Did the crawler successfully collect the intended data, and how relevant was the information
gathered?
Evaluating whether the crawler successfully gathered the intended data involves verifying the
quality, accuracy, and relevance of the information collected. The data should align with
project goals and avoid unnecessary or unrelated content.
3. Were there any technical challenges, such as handling dynamic content, CAPTCHA restrictions,
or blocked access?
The crawler might encounter issues such as handling dynamic content that changes
frequently, overcoming CAPTCHA restrictions, or managing access limitations on certain
websites. Addressing these challenges requires tailored strategies and robust solutions.
4. Were there any unexpected errors or failures during the crawling process, and how were they
addressed?
Unexpected errors, like system crashes or incomplete data extraction, could arise during the
crawling process. These issues should be identified and resolved promptly through debugging
or implementing preventive measures.