Module 4
Module 4
Web Scraping
Python Web Scraping
The answer to the first question is ‘data’. Data is indispensable for any
programmer and the basic requirement of every programming project
is the large amount of useful data.
The answer to the second question is a bit tricky, because there are lots of ways to
get data. In general, we may get data from a database or data file and other sources.
But what if we need large amount of data that is available online? One way to get
such kind of data is to manually search (clicking away in a web browser) and save
(copy-pasting into a spreadsheet or file) the required data. This method is quite
tedious and time consuming. Another way to get such data is using web scraping.
Web scraping, also called web data mining or web harvesting, is the process of
constructing an agent which can extract, parse, download and organize useful
information from the web automatically. In other words, we can say that instead of
manually saving the data from websites, the web scraping software will
automatically load and extract data from multiple websites as per our requirement.
Web Crawling v/s Web Scraping
The terms Web Crawling and Scraping are often used interchangeably
as the basic concept of them is to extract data. However, they are
different from each other. We can understand the basic difference from
their definitions.
Web crawling is basically used to index the information on the page
using bots aka crawlers. It is also called indexing. On the hand, web
scraping is an automated way of extracting the information using bots
aka scrapers. It is also called data extraction.
Scraper bots are tools or pieces of code used to extract data from web
pages. These bots are like tiny spiders that run through different web
pages in a website to extract the specific data they were created to get.
The process of extracting data with a scraper bot is called web scraping
To understand the difference between these two terms, let us look into
the comparison table given hereunder −
Uses of Web Scraping
The uses and reasons for using web scraping are as endless as the uses of the World
Wide Web. Web scrapers can do anything like ordering online food, scanning online
shopping website for you and buying ticket of a match the moment they are available
etc. just like a human can do. Some of the important uses of web scraping are discussed
here −
E-commerce Websites − Web scrapers can collect the data specially related to the price
of a specific product from various e-commerce websites for their comparison.
Content Aggregators − Web scraping is used widely by content aggregators like news
aggregators and job aggregators for providing updated data to their users.
Marketing and Sales Campaigns − Web scrapers can be used to get the data like emails,
phone number etc. for sales and marketing campaigns.
Search Engine Optimization (SEO) − Web scraping is widely used by SEO tools like
SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to
them.
Data for Machine Learning Projects − Retrieval of data for machine learning projects
depends upon web scraping.
Data for Research − Researchers can collect useful data for the purpose of their research
work by saving their time by this automated process.
Components of a Web Scraper
A web scraper consists of the following components −
Web Crawler Module
A very necessary component of web scraper, web crawler module, is used to navigate the
target website by making HTTP or HTTPS request to the URLs. The crawler downloads the
unstructured data (HTML contents) and passes it to extractor, the next module.
Extractor
The extractor processes the fetched HTML content and extracts the data into semistructured
format. This is also called as a parser module and uses different parsing techniques like
Regular expression, HTML Parsing, DOM parsing or Artificial Intelligence for its functioning.
Data Transformation and Cleaning Module
The data extracted above is not suitable for ready use. It must pass through some cleaning
module so that we can use it. The methods like String manipulation or regular expression can
be used for this purpose. Note that extraction and transformation can be performed in a
single step also.
Storage Module
After extracting the data, we need to store it as per our requirement. The storage module will
output the data in a standard format that can be stored in a database or JSON or CSV format.
Working of a Web Scraper
Web scraper may be defined as a software or script used to download
the contents of multiple web pages and extracting data from it.
Step 1: Downloading Contents from Web Pages
In this step, a web scraper will download the requested contents from multiple web
pages.
Step 2: Extracting Data
The data on websites is HTML and mostly unstructured. Hence, in this step, web scraper
will parse and extract structured data from the downloaded contents.
Step 3: Storing the Data
Here, a web scraper will store and save the extracted data in any of the format like CSV,
JSON or in database.
Step 4: Analyzing the Data
After all these steps are successfully done, the web scraper will analyze the data thus
obtained.
Why Python for Web Scraping?
Python is a popular tool for implementing web scraping. Python programming language is also used
for other useful projects related to cyber security, penetration testing as well as digital forensic
applications. Using the base programming of Python, web scraping can be performed without using
any other third party tool.
Python programming language is gaining huge popularity and the reasons that make Python a good
fit for web scraping projects are as below −
Syntax Simplicity
Python has the simplest structure when compared to other programming languages. This feature of
Python makes the testing easier and a developer can focus more on programming.
Inbuilt Modules
Another reason for using Python for web scraping is the inbuilt as well as external useful libraries it
possesses. We can perform many implementations related to web scraping by using Python as the
base for programming.
Open Source Programming Language
Python has huge support from the community because it is an open source programming language.
Wide range of Applications
Python can be used for various programming tasks ranging from small shell scripts to enterprise
web applications.
Downloading files from web
using Python
Requests is a versatile HTTP library in python with various applications.
One of its applications is to download a file from web using the file URL.
Installation: First of all, you would need to download the requests
library. You can directly install it using pip by typing following
command:
pip install requests
# imported the requests library
import requests
image_url = "https://www.python.org/static/community_logos/python-logo-master-v3-
TM.png"
# URL of the image to be downloaded is defined as image_url
r = requests.get(image_url) # create HTTP response object
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("python_logo.png",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
f.write(r.content)
This small piece of code written above will download the following image from the web.
Now check your local directory(the folder where this script resides), and you will find this
image:
with statement in Python is used in exception handling to make the code cleaner and much more
readable. It simplifies the management of common resources like file streams.
without using with statement
file = open('file_path', 'w')
try:
file.write('hello world')
finally:
file.close()
# using with statement
with open('file_path', 'w') as file:
file.write('hello world !')
Notice that unlike the first two implementations, there is no need to call file.close() when using with
statement. The with statement itself ensures proper acquisition and release of resources. An
exception during the file.write() call in the first implementation can prevent the file from closing
properly which may introduce several bugs in the code, i.e. many changes in files do not go into
effect until the file is properly closed.
Python provides different modules like urllib, requests etc to download files
from the web.
1. Import module
import requests
2. Get the link or url
url = 'https://www.facebook.com/favicon.ico'
r = requests.get(url, allow_redirects=True)
3. Save the content with name.
open('facebook.ico', 'wb').write(r.content)
save the file as facebook.ico.
Example
import requests
url = 'https://www.facebook.com/favicon.ico'
r = requests.get(url, allow_redirects=True)
open('facebook.ico', 'wb').write(r.content)