Unit2
Unit2
DATA ANALYTICS
Unit: 2
Data Handling
Ravi Pandey
B-Tech(VIIth sem) Assistant Professor
AIML
12/12/2023 2
Evaluation schemeLEARNING TASK
THE CONCEPT
5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1
Course objective:
The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.
CO 1 Understand the fundamental concepts of data analytics in the areas that plays major role K1
within the realm of data science.
CO 2 Explain and exemplify the most common forms of data and its representations. K2
CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.
Text books:
1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:
• Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities
Course Outcomes
Course outcome: After completion of this course students will be able to:
CO5 Understand and analyze the I/O management and File systems K2, K4
12 December 2023 11
THE CONCEPT LEARNING TASK
Program Outcomes
1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning
Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1
3 3 3 - - - - - - - - 1
2
3 3 3 - - - - - - - - 1
3
3 2 1 - - - - - - - - 1
4
3 2 2 - - - - - - - - 1
5
Average
3 2.4 2.2 - - - - - - - - 1
1 3 - -
3 2 -
2
3 2 -
3
3 2 2
4
3 2 -
5
Average
3 2 2
•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.
•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.
•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.
• NA
Data Handling:
1. Types of Data: structured, semi-structured, unstructured data
2. Numeric, Categorical, Graphical, High Dimensional Data
3. Transactional Data, Spatial Data, Social Network Data
4. Standard datasets, Data Classification, Sources of Data
5. Data manipulation in various formats, for example, CSV file,
pdf file, XML file, HTML file, text file, JSON, image files etc.
6. Import and export data in R/Python.
Prerequisites:
• Linux/ Windows operating system.
• MS Office 2019.
Recap:
Objective:
In this topic we learn about Structured data that is clearly defined
and searchable types of data, while unstructured data is usually
stored in its native format. Structured data is quantitative, while
unstructured data is qualitative. Structured data is often stored in
data warehouses, while unstructured data is stored in data lakes
Recap:
Objective:
In this topic we learn about In the machine learning world, data is
nearly always split into two groups: numerical and categorical.
Numerical data is used to mean anything represented by numbers
(floating point or integer). Categorical data generally means
everything else and in particular discrete labeled groups are often
called out
Recap:
• There are two types of variables you’ll find in your data – numerical and
categorical. Numerical data can be divided into continuous or discrete
values. And categorical data can be broken down into nominal and ordinal
values.
Numerical
Categorical
• For categorical data, this is any data that isn’t a number, which can mean
a string of text or date. These variables can be broken down into nominal
and ordinal values, though you won’t often see this done.
• Ordinal values are values that have a set order to them. Examples of
ordinal values include having a priority on a bug such as “Critical” or
“Low” or the ranking of a race as “First” or “Third”. Nominal values are
the opposite of ordinal values, and they represent values with no set
order to them. Nominal value examples include variables such as
“Country” or “Marital Status”.
Sanchi Kaushik UNIT 02 Data
12/12/2023 36
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional
Categorical
Types of Variables
There are two basic types of variables: numerical and categorical
variables.
Ratio Scale
Categorical Variables
NOMINAL SCALE
b. Appearance of plasma: b.
1. Clear……………………… 1.
2. Turbid…………………… 2.
9. Not done………………… 9.
ORDINAL SCALE
81.Urine protein (dipstick reading): 81.
1. Negative………………… 1.
2. Trace……………………. 2.
3. 30 mg% or +…………… 3.
• Graph algorithms help make sense of the global structure of a graph, and the results
used for standalone analysis or as features in a machine learning model.
What is Dimensionality?
Objective:
In this topic we learn about how Transactional data describe an
internal or external event or transaction that takes place as an
organization conducts its business. Examples include sales orders,
invoices, purchase orders, shipping documents, pass- port
applications, credit card payments, and insurance claims.
Recap:
Transactional data
Transactional data is information that is captured from
transactions. It records the time of the transaction, the place
where it occurred, the price points of the items bought, the
payment method employed, discounts if any, and other
quantities and qualities associated with the transaction.
Transactional data is usually captured at the point of sale.
Objective:
In this topic we learn about what is standard dataset and this makes
them easy to compare and navigate for you to practice a specific
data preparation technique or modeling method.
Recap:
• Types of Data
1. Primary data
2. Secondary data
• Primary Data
• This kind of data has not been used for any statistical analysis
before.
• Secondary Data
• Secondary Data
• The primary step of data collection is figuring out what kind of data is required
and then starting your analysis by collection of a sample through a specific
sampling method from a certain part of the population.
Objective:
In this topic we learn about four basic types of data manipulation
carried out in Data science where we learn how do we Move data
around unchanged;
Recap:
• When you want to get started with data manipulation, here are the steps
you should take into consideration:
• Sort and Filter- Users can save a lot of time when analyzing
data by sorting and filtering options in Excel.
• Usually, the files you will come across will depend on the
application you are building. For example, in an image processing
system, you need image files as input and output. So you will
mostly see files in jpeg, gif or png format.
• Choosing the optimal file format for storing data can improve
the performance of your models in data processing.
Comma-separated values
rows and columns. A column in the spreadsheet file can have different
types. For example, a column can be of string type, a date type or an integer
type. Some of the most popular spreadsheet file formats are Comma
• Sometimes you may come across files where fields are not separated by
using a comma but they are separated using tab. This file format is known
as TSV (Tab Separated Values) file format.
XLSX files
• The following example shows text file data that contain text:
HTML files
HTML files
Objective:
In this topic we learn about the import and export of data is the
automated or semi-automated input and output of data sets
between different software applications. ... Import and export of
data shares semantic analogy with copying and pasting, in that sets
of data are copied from one application and pasted into another.
Recap:
On the other hand, reading a PDF format through a program is a complex task.
Although there exists a library which do a good job in parsing PDF file, one of them is
PDFMiner. To read a PDF file through PDFMiner, you have to:
Download PDFMiner and install it through the website
https://euske.github.io/pdfminer/
Extract PDF file by the following code
pdf2txt.py <pdf_file>.pdf
https://www.youtube.com/watch?v=uufDGjTuq34
https://www.youtube.com/watch?v=XVv6mJpFOb0
https://www.youtube.com/watch?v=guPOL9UplNs
https://www.youtube.com/watch?v=ve_0h4Y8nuI&list=PLhTjy8cBISEqk
N-5Ku_kXG4QW33sxQo0t
https://www.youtube.com/watch?v=pLoRrHEsHb0&list=PLmcBskOCOO
FUmbUv0CIMuATDVKVrOhBMV
A. urllib
B. bs4
C. HTTP
D. GET
A. socket
B. port
C. http
D. protocol
Sanchi Kaushik UNIT 02 Data
12/12/2023 107
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly
3: What is a python library that can be used to send and receive data over
HTTP?
A. http
B. urllib
C. port
D. header
4: What is the process by which search engines retrieve webpages and build a
search index called?
A. scrape
B. parse
C. BeautifulSoup
D. spider
Sanchi Kaushik UNIT 02 Data
12/12/2023 108
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly
mysock.connect(('data.pr4e.org', 80))
mysock.send(cmd)
7: Given the below html, how would this tag type be described
in web scraping code?
A. h1
B. h1, class='sports'
C. h1, class_='sports'
D. 'h1', class_='sports'
9: Which of the following gets the value for the id in the first p
tag?
A. soup.p.get('id')
B. soup.p.get('id', None)
C. soup.p[id]
D. soup.p['id']
10: Which of the following gets the first link tag and returns a
dictionary of all attributes and values for that link tag?
A. soup.a.attributes
B. soup.link.attrs
C. soup.a.attrs
D. soup.link.attributes
Assignment 1
What are the different type of data?
What are different types of data sources?
Explain direct personal investigation method of collecting primary data.
Discuss its merits and demerits.
What is secondary data? Discuss the various sources of secondary data.
What precautions shall we take while using secondary data?
What are the methods of primary data collection?
What is Categorical Data?