0% found this document useful (0 votes)
83 views7 pages

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

This document discusses web scraping and spiders, exploring the legal, ethical, and security issues involved. It defines the differences between web crawlers and scrapers, noting that while crawlers like Google search innocuously, scrapers can deeply mine data and are more open to abuse. The document reviews debates around the legality of scraping public data and discusses principles of ethical scraping and owning sites. It also outlines security measures sites can take to limit unauthorized scraping while allowing proper access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views7 pages

DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders

This document discusses web scraping and spiders, exploring the legal, ethical, and security issues involved. It defines the differences between web crawlers and scrapers, noting that while crawlers like Google search innocuously, scrapers can deeply mine data and are more open to abuse. The document reviews debates around the legality of scraping public data and discusses principles of ethical scraping and owning sites. It also outlines security measures sites can take to limit unauthorized scraping while allowing proper access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab 9: Web Scrapers and

Spiders
Searching the web-Ethically?
William R Schroeder Jr.

Lab 9: Web Scrapers and

Spiders

Searching the web-Ethically?

William R Schroeder Jr.

Community College of Baltimore County

CCBC
Summer 2023
School of Business Technology and Law
Computer Science Information Technology
CSIT 216 Python Programming W01(53218)
Lab 9: Web Scrapers and 2
Spiders

Searching the web-Ethically?

Webscraping. Legal, Security risk, or Ethical Challenge? In this paper, we will explore
the various issues concerning the differences between web crawlers and web scrapers. We
will describe the legal issues, ethical issues, and security risks involved. The computer
language Python lends itself well to web scraping. It comes with many methods to simplify
the process through packages like Beautiful Soup,. There are many free tutorials out in the
web which will walk you through step by step, like “Beautiful Soup: Build a Web Scraper
with Python” by Martin Breuss on the Real Python website. There are videos on you-tube
which can get you a visual grasp on the possibilities of Python programming. It is the
language of choice for many modern programmers due to its flexibility and ease of access.

There are some primary issues defining the difference between a web crawler and a
web scraper. Basic usage and the entities who work with these programs may have very
different objectives. Web crawlers are typically considered more mainstream, and
commonplace in data searches. They include services like Google, Bing and Duck-duck-go.
Web-scrapers, on the other hand, are considered more invasive, and are capable of digging
much deeper into the data of an HTML site. They are frequently used by “Black Hat”
operatives, and others who are more adept with data technology.

A crawler is a much more innocuous program. It is much simpler in that it merely


searches the web for the purpose of collecting and storing data. It has a function where the
webpage may block the web crawler from accessing any part of the page, the complete page,
or any given type of data. It is blocked by means of a file called robots.txt.. This will block
general browsing. It will not prevent indexing of the website though. If you don’t want your
website indexed, you must use a more complicated mechanism. This method is the “noindex”
or “canonical Tag”. A crawler mostly deals with metadata that is not visible to the user at first
glance. It basically shows the layout of a webpage and is able to collect individualized data
for the client.

A web scraper is a heavy-duty version of a web crawler. Scraping is considered a black


hat hacker tool. It’s capabilities allow it to copy data from websites and deliver the results as
content which can be used directly to the client’s own personal website. There are even web
scrapers which can capture a complete website and store it in its entirety.
Lab 9: Web Scrapers and 3
Spiders

Legality of Webscraping

This is important; Do we have to worry about the government locking us up in our search for
information? Debatable. Some activity could be considered “borderline, But-

This is the “word”- web scraping is legal. From the article in TechRadar dated April
19, 2022

“Scraping public data is legal, the U.S. Ninth Circuit of Appeals has ruled in a potentially
landmark decision.

The decision follows a ruling by a federal court of appeals that reaffirmed its earlier decision,
notably that web scraping (data harvesting, en masse) of data that’s made available to the general
public, does not violate the Computer Fraud and Abuse Act (CFAA). The CFAA is used to
determine what can be described as “hacking” under US law.”

So, it has been determined to be legal in a case with LinkedIn vs. hiQ Labs, a talent
management algorithm. The case calls into question what web scraping is; The court found it
did not fit the description of “hacking” per CFAA.

However, The McCarthy Law Group (where the Owner states he is both a Python
programmer and a lawyer)asserts in a newer article, the landscape of legality is in a constant
rate of change. They state,

“There are a few websites online that purport to answer the question of “whether web scraping
is legal.” And way too many of those websites, with unwavering confidence and a complete
absence of caution, provide clear and concise answers to that question that are laughably and
dangerously false.”

So, who do we believe? It is a constantly changing question of what is or is not legal


as technology advances. And laws change. The bottom line is if you think you are doing
something illegal, you probably are. “Intent” is keystone to the interpretation of any crime.
Again, McCarthy adds,

“Most casual observers of this topic feel inclined to describe web scraping as a “gray
area of the law.” But that’s not really correct, either. Legal interpretations of the CFAA are a
mess, admittedly. But that’s not the most heavily litigated issue with web scraping anymore.
Today, most sophisticated companies looking to enforce web-scraping claims do so under
breach of contract, misappropriation, and intellectual property and quasi-intellectual property
theories of law. CFAA claims are only the primary focus in web-scraping litigation unique
circumstances, such as when jurisdictional issues prevent the plaintiff from pursuing easier
claims.” ‘Nuff said. People should be aware that their data is available with or without express
permission. On the other hand, this ruling is good news for good news for archivists,
academics, researchers, and journalists…and “Hackers”!
Lab 9: Web Scrapers and 4
Spiders

Ethics of webscraping

James Densmore, in an article titled “Ethics in Webscraping” has a list of rules to stay
within the bounds of ethical behavior. I believe for the most part it will spare ethical web
scrapers- those who could also be considered “Ethical Hackers”- legal issues as well.

The Ethical Scraper

I, the web scraper will live by the following principles:

If you have a public API that provides the data I’m looking for, I’ll use it and avoid
scraping all together.

I will always provide a User Agent string that makes my intentions clear and provides
a way for you to contact me with questions or concerns.

I will request data at a reasonable rate. I will strive to never be confused for a DDoS
attack.

I will only save the data I absolutely need from your page. If all I need is OpenGraph
meta-data, that’s all I’ll keep.

I will respect any content I do keep. I’ll never pass it off as my own.

I will look for ways to return value to you. Maybe I can drive some (real) traffic to
your site or credit you in an article or post.

I will respond in a timely fashion to your outreach and work with you towards a
resolution.

I will scrape for the purpose of creating new value from the data, not to duplicate it.

The Ethical Site Owner

I, the site owner will live by the following principles:

I will allow ethical scrapers to access my site as long as they are not a burden on my
site’s performance.

I will respect transparent User Agent strings rather than blocking them and
encouraging use of scrapers masked as human visitors.
Lab 9: Web Scrapers and 5
Spiders

I will reach out to the owner of the scraper (thanks to their ethical User Agent string)
before blocking permanently. A temporary block is acceptable in the case of site
performance or ethical concerns.

I understand that scrapers are a reality of the open web.

I will consider public APIs to provide data as an alternative to scrapers.

Cybersecurity aspects of Webscraping


Can’t we all just get along? Ethical IT would save a lot of trouble but some just can’t
contain themselves. Which is why we set up security systems to keep the “Bad Actors” out.
As stated in the article “What is Data Scraping, And Why Is It a Threat? On the How-To Geek
website,

“Scraping can be performed by cybercriminals who want to collect login credentials,


payment details, or personally identifiable information. It can also be used for
legitimate reasons such as aggregating news stories, monitoring your resellers to see
that they don't break pricing agreements, or for market analysis. It's also used for
collecting business intelligence, locating sales leads, and underpinning marketing and
advertising.”

That is a serious threat, and open to all kinds of abuse. So, there must be a way to stop
it, right? Well, sort of. An experienced hacker will get into your system unless you harden it
against intrusion. Your webmaster has to be skilled in protecting your data, at the same time
letting the information you want to distribute flow unimpeded. There are several methods to
harden your system to attackers.

There are steps and measures that you put in place to make life more difficult for the
data scrapers. Among the notorious methods: Terms of Use and Conditions, Disable,
Hotlinking, Use CSRF Tokens, Rate Limit Page Requests, Use Dedicated Anti-Scraping
Software, Require Human Interaction (CAPTCHA or other challenge-response tests),make
Your APIs Tight-Lipped, and Decoy Links are employed a basic security measure.

So, after all this, what can you do as an “Ethical Web scraper” to stay legal, and not
get blacklisted from the Web? Consider the website Owner and respect their wishes. There are
many tips here from “How to Scrape Websites Without Getting Blocked” on the Scrape Hero
website,

“Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of
these methods are enumerated below:

1. Unusual traffic/high download rate especially from a single client/or IP address within
a short time span.
2. Repetitive tasks performed on the website in the same browsing pattern – based on
an assumption that a human user won’t perform the same repetitive tasks all the time.
Lab 9: Web Scrapers and 6
Spiders

3. Checking if you are a real browser – A simple check is to try and execute JavaScript.

Smarter tools can go a lot more and check your Graphic cards and CPUs 😉 to make sure
you are coming from real browser.
4. Detection through honeypots – these honeypots are usually links which aren’t visible to
a normal user but only to a spider. When a scraper/spider tries to access the link, the
alarms are tripped.”

Basically, need to accept that we are a guest of the website and use web
scraping best practices to scrape without getting blocked. It comes down to proper net
etiquette. If you went to a friend’s house you wouldn’t ransack their library without
asking first.

 Respect Robots.txt
 Make the crawling slower, do not slam the server, treat websites nicely.
 Do not follow the same crawling pattern.
 Make requests through Proxies and rotate them as needed.
 Rotate User Agents and Rotate User Agents and corresponding HTTP Request Headers
between requests.
 Use a headless browser like Puppeteer, Selenium or Playwright
 Beware of Honey Pot Traps
 Check if Website is Changing Layouts
 Avoid scraping data behind a login.

Webscraping is legal and essential


In conclusion, there are many ethical and legal reasons to take care when you are web
scraping. An Ethical Webscraper should respect the laws of your local and federal jurisdiction
and obey basic rules of net etiquette. Python is an excellent tool when wielded properly.
Knowledge of the language will help you deter Black Hat Hackers and facilitate your web
scraping endeavors. If you have a website, take the time to secure your data against unwanted
intrusion. Harden your system-It is your responsibility to make it difficult, if not impossible to
access privileged data. If you are using Webscraping for valid uses, take care to not get locked
out for abusing the available data. If they want you to have it, It will be easily available.
Lab 9: Web Scrapers and 7
Spiders

Webscraping articles

References

Beautiful Soup: Build a Web Scraper With Python by Martin Breuss

https://realpython.com/beautiful-soup-web-scraper-python/

Keycdn blog Web Crawlers - Top 10 Most Popular - KeyCDN

Ryte Wiki “Web crawler What is a web crawler and how does it work? (ryte.com)

Sead Fadilpasic. US court says web scraping is officially legal. published April 19, 2022

https://www.techradar.com/news/us-court-says-web-scraping-is-officially-legal

McCarthy Law group July 20,2023 https://mccarthylg.com/a-comprehensive-legal-guide-to-

web-scraping-in-the-us/

McCarthy Law group October 17, 2022. Legal Rules vs. Legal Norms for Web Scraping

https://mccarthylg.com/legal-rules-vs-legal-norms-for-web-scraping/

Towards Data Science. James Densmore. Jul 23, 2017


https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

What is Data Scraping, And Why Is It a Threat? By Dave McKay Published Jul 13, 2021

https://www.howtogeek.com/devops/what-is-data-scraping-and-why-is-it-a-threat/

Scrape Hero Website article,

https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy