0% found this document useful (0 votes)

83 views7 pages

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

This document discusses web scraping and spiders, exploring the legal, ethical, and security issues involved. It defines the differences between web crawlers and scrapers, noting that while crawlers like Google search innocuously, scrapers can deeply mine data and are more open to abuse. The document reviews debates around the legality of scraping public data and discusses principles of ethical scraping and owning sites. It also outlines security measures sites can take to limit unauthorized scraping while allowing proper access.

Uploaded by

thesilverhandwrites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views7 pages

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

Uploaded by

thesilverhandwrites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab 9: Web Scrapers and

Spiders
Searching the web-Ethically?
William R Schroeder Jr.

Lab 9: Web Scrapers and

Spiders

Searching the web-Ethically?

William R Schroeder Jr.

Community College of Baltimore County

CCBC
Summer 2023
School of Business Technology and Law
Computer Science Information Technology
CSIT 216 Python Programming W01(53218)
Lab 9: Web Scrapers and 2
Spiders

Searching the web-Ethically?

Webscraping. Legal, Security risk, or Ethical Challenge? In this paper, we will explore
the various issues concerning the differences between web crawlers and web scrapers. We
will describe the legal issues, ethical issues, and security risks involved. The computer
language Python lends itself well to web scraping. It comes with many methods to simplify
the process through packages like Beautiful Soup,. There are many free tutorials out in the
web which will walk you through step by step, like “Beautiful Soup: Build a Web Scraper
with Python” by Martin Breuss on the Real Python website. There are videos on you-tube
which can get you a visual grasp on the possibilities of Python programming. It is the
language of choice for many modern programmers due to its flexibility and ease of access.

There are some primary issues defining the difference between a web crawler and a
web scraper. Basic usage and the entities who work with these programs may have very
different objectives. Web crawlers are typically considered more mainstream, and
commonplace in data searches. They include services like Google, Bing and Duck-duck-go.
Web-scrapers, on the other hand, are considered more invasive, and are capable of digging
much deeper into the data of an HTML site. They are frequently used by “Black Hat”
operatives, and others who are more adept with data technology.

A crawler is a much more innocuous program. It is much simpler in that it merely

searches the web for the purpose of collecting and storing data. It has a function where the
webpage may block the web crawler from accessing any part of the page, the complete page,
or any given type of data. It is blocked by means of a file called robots.txt.. This will block
general browsing. It will not prevent indexing of the website though. If you don’t want your
website indexed, you must use a more complicated mechanism. This method is the “noindex”
or “canonical Tag”. A crawler mostly deals with metadata that is not visible to the user at first
glance. It basically shows the layout of a webpage and is able to collect individualized data
for the client.

A web scraper is a heavy-duty version of a web crawler. Scraping is considered a black

hat hacker tool. It’s capabilities allow it to copy data from websites and deliver the results as
content which can be used directly to the client’s own personal website. There are even web
scrapers which can capture a complete website and store it in its entirety.
Lab 9: Web Scrapers and 3
Spiders

Legality of Webscraping

This is important; Do we have to worry about the government locking us up in our search for
information? Debatable. Some activity could be considered “borderline, But-

This is the “word”- web scraping is legal. From the article in TechRadar dated April
19, 2022

“Scraping public data is legal, the U.S. Ninth Circuit of Appeals has ruled in a potentially
landmark decision.

The decision follows a ruling by a federal court of appeals that reaffirmed its earlier decision,
notably that web scraping (data harvesting, en masse) of data that’s made available to the general
public, does not violate the Computer Fraud and Abuse Act (CFAA). The CFAA is used to
determine what can be described as “hacking” under US law.”

So, it has been determined to be legal in a case with LinkedIn vs. hiQ Labs, a talent
management algorithm. The case calls into question what web scraping is; The court found it
did not fit the description of “hacking” per CFAA.

However, The McCarthy Law Group (where the Owner states he is both a Python
programmer and a lawyer)asserts in a newer article, the landscape of legality is in a constant
rate of change. They state,

“There are a few websites online that purport to answer the question of “whether web scraping
is legal.” And way too many of those websites, with unwavering confidence and a complete
absence of caution, provide clear and concise answers to that question that are laughably and
dangerously false.”

So, who do we believe? It is a constantly changing question of what is or is not legal

as technology advances. And laws change. The bottom line is if you think you are doing
something illegal, you probably are. “Intent” is keystone to the interpretation of any crime.
Again, McCarthy adds,

“Most casual observers of this topic feel inclined to describe web scraping as a “gray
area of the law.” But that’s not really correct, either. Legal interpretations of the CFAA are a
mess, admittedly. But that’s not the most heavily litigated issue with web scraping anymore.
Today, most sophisticated companies looking to enforce web-scraping claims do so under
breach of contract, misappropriation, and intellectual property and quasi-intellectual property
theories of law. CFAA claims are only the primary focus in web-scraping litigation unique
circumstances, such as when jurisdictional issues prevent the plaintiff from pursuing easier
claims.” ‘Nuff said. People should be aware that their data is available with or without express
permission. On the other hand, this ruling is good news for good news for archivists,
academics, researchers, and journalists…and “Hackers”!
Lab 9: Web Scrapers and 4
Spiders

Ethics of webscraping

James Densmore, in an article titled “Ethics in Webscraping” has a list of rules to stay
within the bounds of ethical behavior. I believe for the most part it will spare ethical web
scrapers- those who could also be considered “Ethical Hackers”- legal issues as well.

The Ethical Scraper

I, the web scraper will live by the following principles:

If you have a public API that provides the data I’m looking for, I’ll use it and avoid
scraping all together.

I will always provide a User Agent string that makes my intentions clear and provides
a way for you to contact me with questions or concerns.

I will request data at a reasonable rate. I will strive to never be confused for a DDoS
attack.

I will only save the data I absolutely need from your page. If all I need is OpenGraph
meta-data, that’s all I’ll keep.

I will respect any content I do keep. I’ll never pass it off as my own.

I will look for ways to return value to you. Maybe I can drive some (real) traffic to
your site or credit you in an article or post.

I will respond in a timely fashion to your outreach and work with you towards a
resolution.

I will scrape for the purpose of creating new value from the data, not to duplicate it.

The Ethical Site Owner

I, the site owner will live by the following principles:

I will allow ethical scrapers to access my site as long as they are not a burden on my
site’s performance.

I will respect transparent User Agent strings rather than blocking them and
encouraging use of scrapers masked as human visitors.
Lab 9: Web Scrapers and 5
Spiders

I will reach out to the owner of the scraper (thanks to their ethical User Agent string)
before blocking permanently. A temporary block is acceptable in the case of site
performance or ethical concerns.

I understand that scrapers are a reality of the open web.

I will consider public APIs to provide data as an alternative to scrapers.

Cybersecurity aspects of Webscraping

Can’t we all just get along? Ethical IT would save a lot of trouble but some just can’t
contain themselves. Which is why we set up security systems to keep the “Bad Actors” out.
As stated in the article “What is Data Scraping, And Why Is It a Threat? On the How-To Geek
website,

“Scraping can be performed by cybercriminals who want to collect login credentials,

payment details, or personally identifiable information. It can also be used for
legitimate reasons such as aggregating news stories, monitoring your resellers to see
that they don't break pricing agreements, or for market analysis. It's also used for
collecting business intelligence, locating sales leads, and underpinning marketing and
advertising.”

That is a serious threat, and open to all kinds of abuse. So, there must be a way to stop
it, right? Well, sort of. An experienced hacker will get into your system unless you harden it
against intrusion. Your webmaster has to be skilled in protecting your data, at the same time
letting the information you want to distribute flow unimpeded. There are several methods to
harden your system to attackers.

There are steps and measures that you put in place to make life more difficult for the
data scrapers. Among the notorious methods: Terms of Use and Conditions, Disable,
Hotlinking, Use CSRF Tokens, Rate Limit Page Requests, Use Dedicated Anti-Scraping
Software, Require Human Interaction (CAPTCHA or other challenge-response tests),make
Your APIs Tight-Lipped, and Decoy Links are employed a basic security measure.

So, after all this, what can you do as an “Ethical Web scraper” to stay legal, and not
get blacklisted from the Web? Consider the website Owner and respect their wishes. There are
many tips here from “How to Scrape Websites Without Getting Blocked” on the Scrape Hero
website,

“Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of
these methods are enumerated below:

1. Unusual traffic/high download rate especially from a single client/or IP address within
a short time span.
2. Repetitive tasks performed on the website in the same browsing pattern – based on
an assumption that a human user won’t perform the same repetitive tasks all the time.
Lab 9: Web Scrapers and 6
Spiders

3. Checking if you are a real browser – A simple check is to try and execute JavaScript.

Smarter tools can go a lot more and check your Graphic cards and CPUs 😉 to make sure
you are coming from real browser.
4. Detection through honeypots – these honeypots are usually links which aren’t visible to
a normal user but only to a spider. When a scraper/spider tries to access the link, the
alarms are tripped.”

Basically, need to accept that we are a guest of the website and use web
scraping best practices to scrape without getting blocked. It comes down to proper net
etiquette. If you went to a friend’s house you wouldn’t ransack their library without
asking first.

 Respect Robots.txt
 Make the crawling slower, do not slam the server, treat websites nicely.
 Do not follow the same crawling pattern.
 Make requests through Proxies and rotate them as needed.
 Rotate User Agents and Rotate User Agents and corresponding HTTP Request Headers
between requests.
 Use a headless browser like Puppeteer, Selenium or Playwright
 Beware of Honey Pot Traps
 Check if Website is Changing Layouts
 Avoid scraping data behind a login.

Webscraping is legal and essential

In conclusion, there are many ethical and legal reasons to take care when you are web
scraping. An Ethical Webscraper should respect the laws of your local and federal jurisdiction
and obey basic rules of net etiquette. Python is an excellent tool when wielded properly.
Knowledge of the language will help you deter Black Hat Hackers and facilitate your web
scraping endeavors. If you have a website, take the time to secure your data against unwanted
intrusion. Harden your system-It is your responsibility to make it difficult, if not impossible to
access privileged data. If you are using Webscraping for valid uses, take care to not get locked
out for abusing the available data. If they want you to have it, It will be easily available.
Lab 9: Web Scrapers and 7
Spiders

Webscraping articles

References

Beautiful Soup: Build a Web Scraper With Python by Martin Breuss

https://realpython.com/beautiful-soup-web-scraper-python/

Keycdn blog Web Crawlers - Top 10 Most Popular - KeyCDN

Ryte Wiki “Web crawler What is a web crawler and how does it work? (ryte.com)

Sead Fadilpasic. US court says web scraping is officially legal. published April 19, 2022

https://www.techradar.com/news/us-court-says-web-scraping-is-officially-legal

McCarthy Law group July 20,2023 https://mccarthylg.com/a-comprehensive-legal-guide-to-

web-scraping-in-the-us/

McCarthy Law group October 17, 2022. Legal Rules vs. Legal Norms for Web Scraping

https://mccarthylg.com/legal-rules-vs-legal-norms-for-web-scraping/

Towards Data Science. James Densmore. Jul 23, 2017

https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

What is Data Scraping, And Why Is It a Threat? By Dave McKay Published Jul 13, 2021

https://www.howtogeek.com/devops/what-is-data-scraping-and-why-is-it-a-threat/

Scrape Hero Website article,

https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Scraperapi Web Scrapping The Basics Explained
No ratings yet
Scraperapi Web Scrapping The Basics Explained
15 pages
Scraping
100% (1)
Scraping
25 pages
Darren Ang - The Web Scraper's World of Copyright Exceptions and Contractual Overrides
No ratings yet
Darren Ang - The Web Scraper's World of Copyright Exceptions and Contractual Overrides
16 pages
Chemical Engineering, March 2014
100% (1)
Chemical Engineering, March 2014
92 pages
FMDS0210R
100% (1)
FMDS0210R
315 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
Module 4
No ratings yet
Module 4
14 pages
Beginners
No ratings yet
Beginners
16 pages
Distil Networks Ebook Web Scraping
0% (1)
Distil Networks Ebook Web Scraping
19 pages
Solution To Web Scraping
No ratings yet
Solution To Web Scraping
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Intro To Plastic Injection Molding Ebook
78% (9)
Intro To Plastic Injection Molding Ebook
43 pages
Farm Machinery & Equipment - W2
No ratings yet
Farm Machinery & Equipment - W2
33 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Web Scraping: Applications and Tools
100% (2)
Web Scraping: Applications and Tools
31 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
No ratings yet
Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application
25 pages
Benefits of Web Crawling and Scraping
No ratings yet
Benefits of Web Crawling and Scraping
4 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
Ultimate Ethical Hacking Boot Camp Beginner to Pro
From Everand
Ultimate Ethical Hacking Boot Camp Beginner to Pro
Gradient Publication
No ratings yet
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Template
No ratings yet
Template
21 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Nayak (2022) - A Study On Web Scraping
No ratings yet
Nayak (2022) - A Study On Web Scraping
3 pages
218R1A6747
No ratings yet
218R1A6747
10 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scraping Ganesh
0% (1)
Web Scraping Ganesh
20 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
DeVito Et Al 2020 How We Learnt To Stop Worrying and
No ratings yet
DeVito Et Al 2020 How We Learnt To Stop Worrying and
3 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Trigonometry 15 Dec1.
No ratings yet
Trigonometry 15 Dec1.
107 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Crawling State of ArtTechniques ApproachesandApplication
No ratings yet
Web Crawling State of ArtTechniques ApproachesandApplication
26 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Lab Manual
No ratings yet
Lab Manual
56 pages
Semin
No ratings yet
Semin
8 pages
E-Commerce Review Scrapper: Python Mini Project On
No ratings yet
E-Commerce Review Scrapper: Python Mini Project On
15 pages
A Dive Into Web Scraper World
100% (1)
A Dive Into Web Scraper World
5 pages
How to Hack Like a Legend: Hacking the Planet, #7
From Everand
How to Hack Like a Legend: Hacking the Planet, #7
sparc Flow
4.5/5 (5)
Python Web Scraping
No ratings yet
Python Web Scraping
2 pages
National Liberty Alliance CLGJ Letter To District Court Judges
No ratings yet
National Liberty Alliance CLGJ Letter To District Court Judges
20 pages
A Dive Into Web Scraper World
No ratings yet
A Dive Into Web Scraper World
11 pages
Practical Asessment - 3.2022
No ratings yet
Practical Asessment - 3.2022
303 pages
How to Hack Like a Pornstar
From Everand
How to Hack Like a Pornstar
Sparc Flow
4/5 (5)
Master Bollinger Bands Swing Trading Strategy - OpoFinance
No ratings yet
Master Bollinger Bands Swing Trading Strategy - OpoFinance
14 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Com 059
No ratings yet
Com 059
6 pages
Solution Manual Understanding Operating Systems 7th Edition Ann McHoes
No ratings yet
Solution Manual Understanding Operating Systems 7th Edition Ann McHoes
10 pages
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Ultimate Guide for Being Anonymous: Hacking the Planet, #4
From Everand
Ultimate Guide for Being Anonymous: Hacking the Planet, #4
sparc Flow
5/5 (2)
Transportation Engg: Compiled By: Engr Muhammad Abbas Khan
No ratings yet
Transportation Engg: Compiled By: Engr Muhammad Abbas Khan
9 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
@7724353 PDF
No ratings yet
@7724353 PDF
5 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
Great Debaters
No ratings yet
Great Debaters
51 pages
Puente Arizona Et Al v. Arpai Arizona MOTION For Summary Judgment
100% (1)
Puente Arizona Et Al v. Arpai Arizona MOTION For Summary Judgment
31 pages
Young Medi CT Scanners
No ratings yet
Young Medi CT Scanners
3 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
IDFL Standards - European Sleeping Bag Labeling Info EN13537 Information For Consumers Jan 05
No ratings yet
IDFL Standards - European Sleeping Bag Labeling Info EN13537 Information For Consumers Jan 05
5 pages
Ultimate guide for being anonymous: Avoiding prison time for fun and profit
From Everand
Ultimate guide for being anonymous: Avoiding prison time for fun and profit
Sparc FLOW
4.5/5 (8)
1.1 Purpose: 1.2.1 Selection
No ratings yet
1.1 Purpose: 1.2.1 Selection
7 pages
Contributions of Filipino Scientist
100% (1)
Contributions of Filipino Scientist
2 pages
Consumer Input
No ratings yet
Consumer Input
23 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Become A Web Scraping Pro: With These 5 Tips
No ratings yet
Become A Web Scraping Pro: With These 5 Tips
6 pages
Effective Searching Policies For Web Crawler
No ratings yet
Effective Searching Policies For Web Crawler
3 pages
JavaScript Programming: 3 In 1 Security Design, Expressions And Web Development
From Everand
JavaScript Programming: 3 In 1 Security Design, Expressions And Web Development
Richie Miller
No ratings yet
List of Obcs in Tripura As Approved by The Govt. of India. Schemes For Welfare of O.B.Cs
No ratings yet
List of Obcs in Tripura As Approved by The Govt. of India. Schemes For Welfare of O.B.Cs
4 pages
Him Portland v. Devito Builders (2003)
No ratings yet
Him Portland v. Devito Builders (2003)
4 pages
Fashion Polka Dot Background Business PPT Templates
No ratings yet
Fashion Polka Dot Background Business PPT Templates
25 pages
Cyber Protect Your Business
From Everand
Cyber Protect Your Business
Dr. Leland Benton
No ratings yet
Tax Invoice: Radha Rani & Company
No ratings yet
Tax Invoice: Radha Rani & Company
1 page
Project Team Building, Conflict, and Negotiation
No ratings yet
Project Team Building, Conflict, and Negotiation
9 pages
Acc Tutorial Topic 8
No ratings yet
Acc Tutorial Topic 8
9 pages
Otondro Prohori, Guarding Who, Against What
No ratings yet
Otondro Prohori, Guarding Who, Against What
10 pages
DCOM 215 Lab 05 Assisted Lab Performing Vulnerability Scans and Analysis
0% (1)
DCOM 215 Lab 05 Assisted Lab Performing Vulnerability Scans and Analysis
1 page
DCOM 215 13 Assisted Lab Using Reverse and Bind Shells
0% (1)
DCOM 215 13 Assisted Lab Using Reverse and Bind Shells
1 page
DCOM 215 04 Assisted Lab Discovering Information Using Nmap
0% (2)
DCOM 215 04 Assisted Lab Discovering Information Using Nmap
1 page
Notes
No ratings yet
Notes
6 pages
Perceived Guest House Brand Value The Influence of Web Interactivity On Brand Image and Brand Awareness
No ratings yet
Perceived Guest House Brand Value The Influence of Web Interactivity On Brand Image and Brand Awareness
29 pages
Kikambala Revised Drawings
No ratings yet
Kikambala Revised Drawings
1 page
PCOALAUUSDM
No ratings yet
PCOALAUUSDM
9 pages
Instructions To Access Virtual Machines in Azure Cloud Services
No ratings yet
Instructions To Access Virtual Machines in Azure Cloud Services
11 pages
03 - Product Specification
No ratings yet
03 - Product Specification
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

Uploaded by

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

Uploaded by

Lab 9: Web Scrapers and

Lab 9: Web Scrapers and

Searching the web-Ethically?

William R Schroeder Jr.

Community College of Baltimore County

Searching the web-Ethically?

A crawler is a much more innocuous program. It is much simpler in that it merely

A web scraper is a heavy-duty version of a web crawler. Scraping is considered a black

So, who do we believe? It is a constantly changing question of what is or is not legal

The Ethical Scraper

I, the web scraper will live by the following principles:

The Ethical Site Owner

I, the site owner will live by the following principles:

I understand that scrapers are a reality of the open web.

I will consider public APIs to provide data as an alternative to scrapers.

Cybersecurity aspects of Webscraping

“Scraping can be performed by cybercriminals who want to collect login credentials,

Webscraping is legal and essential

Beautiful Soup: Build a Web Scraper With Python by Martin Breuss

Keycdn blog Web Crawlers - Top 10 Most Popular - KeyCDN

McCarthy Law group July 20,2023 https://mccarthylg.com/a-comprehensive-legal-guide-to-

Towards Data Science. James Densmore. Jul 23, 2017

Scrape Hero Website article,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.