DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders
DCIM 216 Summer 2023 #Lab 9 Web Scrapers and Spiders
Spiders
Searching the web-Ethically?
William R Schroeder Jr.
Spiders
CCBC
Summer 2023
School of Business Technology and Law
Computer Science Information Technology
CSIT 216 Python Programming W01(53218)
Lab 9: Web Scrapers and 2
Spiders
Webscraping. Legal, Security risk, or Ethical Challenge? In this paper, we will explore
the various issues concerning the differences between web crawlers and web scrapers. We
will describe the legal issues, ethical issues, and security risks involved. The computer
language Python lends itself well to web scraping. It comes with many methods to simplify
the process through packages like Beautiful Soup,. There are many free tutorials out in the
web which will walk you through step by step, like “Beautiful Soup: Build a Web Scraper
with Python” by Martin Breuss on the Real Python website. There are videos on you-tube
which can get you a visual grasp on the possibilities of Python programming. It is the
language of choice for many modern programmers due to its flexibility and ease of access.
There are some primary issues defining the difference between a web crawler and a
web scraper. Basic usage and the entities who work with these programs may have very
different objectives. Web crawlers are typically considered more mainstream, and
commonplace in data searches. They include services like Google, Bing and Duck-duck-go.
Web-scrapers, on the other hand, are considered more invasive, and are capable of digging
much deeper into the data of an HTML site. They are frequently used by “Black Hat”
operatives, and others who are more adept with data technology.
Legality of Webscraping
This is important; Do we have to worry about the government locking us up in our search for
information? Debatable. Some activity could be considered “borderline, But-
This is the “word”- web scraping is legal. From the article in TechRadar dated April
19, 2022
“Scraping public data is legal, the U.S. Ninth Circuit of Appeals has ruled in a potentially
landmark decision.
The decision follows a ruling by a federal court of appeals that reaffirmed its earlier decision,
notably that web scraping (data harvesting, en masse) of data that’s made available to the general
public, does not violate the Computer Fraud and Abuse Act (CFAA). The CFAA is used to
determine what can be described as “hacking” under US law.”
So, it has been determined to be legal in a case with LinkedIn vs. hiQ Labs, a talent
management algorithm. The case calls into question what web scraping is; The court found it
did not fit the description of “hacking” per CFAA.
However, The McCarthy Law Group (where the Owner states he is both a Python
programmer and a lawyer)asserts in a newer article, the landscape of legality is in a constant
rate of change. They state,
“There are a few websites online that purport to answer the question of “whether web scraping
is legal.” And way too many of those websites, with unwavering confidence and a complete
absence of caution, provide clear and concise answers to that question that are laughably and
dangerously false.”
“Most casual observers of this topic feel inclined to describe web scraping as a “gray
area of the law.” But that’s not really correct, either. Legal interpretations of the CFAA are a
mess, admittedly. But that’s not the most heavily litigated issue with web scraping anymore.
Today, most sophisticated companies looking to enforce web-scraping claims do so under
breach of contract, misappropriation, and intellectual property and quasi-intellectual property
theories of law. CFAA claims are only the primary focus in web-scraping litigation unique
circumstances, such as when jurisdictional issues prevent the plaintiff from pursuing easier
claims.” ‘Nuff said. People should be aware that their data is available with or without express
permission. On the other hand, this ruling is good news for good news for archivists,
academics, researchers, and journalists…and “Hackers”!
Lab 9: Web Scrapers and 4
Spiders
Ethics of webscraping
James Densmore, in an article titled “Ethics in Webscraping” has a list of rules to stay
within the bounds of ethical behavior. I believe for the most part it will spare ethical web
scrapers- those who could also be considered “Ethical Hackers”- legal issues as well.
If you have a public API that provides the data I’m looking for, I’ll use it and avoid
scraping all together.
I will always provide a User Agent string that makes my intentions clear and provides
a way for you to contact me with questions or concerns.
I will request data at a reasonable rate. I will strive to never be confused for a DDoS
attack.
I will only save the data I absolutely need from your page. If all I need is OpenGraph
meta-data, that’s all I’ll keep.
I will respect any content I do keep. I’ll never pass it off as my own.
I will look for ways to return value to you. Maybe I can drive some (real) traffic to
your site or credit you in an article or post.
I will respond in a timely fashion to your outreach and work with you towards a
resolution.
I will scrape for the purpose of creating new value from the data, not to duplicate it.
I will allow ethical scrapers to access my site as long as they are not a burden on my
site’s performance.
I will respect transparent User Agent strings rather than blocking them and
encouraging use of scrapers masked as human visitors.
Lab 9: Web Scrapers and 5
Spiders
I will reach out to the owner of the scraper (thanks to their ethical User Agent string)
before blocking permanently. A temporary block is acceptable in the case of site
performance or ethical concerns.
That is a serious threat, and open to all kinds of abuse. So, there must be a way to stop
it, right? Well, sort of. An experienced hacker will get into your system unless you harden it
against intrusion. Your webmaster has to be skilled in protecting your data, at the same time
letting the information you want to distribute flow unimpeded. There are several methods to
harden your system to attackers.
There are steps and measures that you put in place to make life more difficult for the
data scrapers. Among the notorious methods: Terms of Use and Conditions, Disable,
Hotlinking, Use CSRF Tokens, Rate Limit Page Requests, Use Dedicated Anti-Scraping
Software, Require Human Interaction (CAPTCHA or other challenge-response tests),make
Your APIs Tight-Lipped, and Decoy Links are employed a basic security measure.
So, after all this, what can you do as an “Ethical Web scraper” to stay legal, and not
get blacklisted from the Web? Consider the website Owner and respect their wishes. There are
many tips here from “How to Scrape Websites Without Getting Blocked” on the Scrape Hero
website,
“Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of
these methods are enumerated below:
1. Unusual traffic/high download rate especially from a single client/or IP address within
a short time span.
2. Repetitive tasks performed on the website in the same browsing pattern – based on
an assumption that a human user won’t perform the same repetitive tasks all the time.
Lab 9: Web Scrapers and 6
Spiders
3. Checking if you are a real browser – A simple check is to try and execute JavaScript.
Smarter tools can go a lot more and check your Graphic cards and CPUs 😉 to make sure
you are coming from real browser.
4. Detection through honeypots – these honeypots are usually links which aren’t visible to
a normal user but only to a spider. When a scraper/spider tries to access the link, the
alarms are tripped.”
Basically, need to accept that we are a guest of the website and use web
scraping best practices to scrape without getting blocked. It comes down to proper net
etiquette. If you went to a friend’s house you wouldn’t ransack their library without
asking first.
Respect Robots.txt
Make the crawling slower, do not slam the server, treat websites nicely.
Do not follow the same crawling pattern.
Make requests through Proxies and rotate them as needed.
Rotate User Agents and Rotate User Agents and corresponding HTTP Request Headers
between requests.
Use a headless browser like Puppeteer, Selenium or Playwright
Beware of Honey Pot Traps
Check if Website is Changing Layouts
Avoid scraping data behind a login.
Webscraping articles
References
https://realpython.com/beautiful-soup-web-scraper-python/
Ryte Wiki “Web crawler What is a web crawler and how does it work? (ryte.com)
Sead Fadilpasic. US court says web scraping is officially legal. published April 19, 2022
https://www.techradar.com/news/us-court-says-web-scraping-is-officially-legal
web-scraping-in-the-us/
McCarthy Law group October 17, 2022. Legal Rules vs. Legal Norms for Web Scraping
https://mccarthylg.com/legal-rules-vs-legal-norms-for-web-scraping/
What is Data Scraping, And Why Is It a Threat? By Dave McKay Published Jul 13, 2021
https://www.howtogeek.com/devops/what-is-data-scraping-and-why-is-it-a-threat/
https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/