0% found this document useful (0 votes)
10 views119 pages

Unit2

The document outlines a course on Data Analytics offered at the Noida Institute of Engineering and Technology, detailing the course objectives, outcomes, and evaluation schemes. It covers fundamental concepts of data handling, types of data, and data manipulation techniques, emphasizing the use of programming languages like R and Python. Additionally, it includes information on textbooks, prerequisites, and the overall educational objectives for students pursuing a B.Tech in Data Science.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views119 pages

Unit2

The document outlines a course on Data Analytics offered at the Noida Institute of Engineering and Technology, detailing the course objectives, outcomes, and evaluation schemes. It covers fundamental concepts of data handling, types of data, and data manipulation techniques, emphasizing the use of programming languages like R and Python. Additionally, it includes information on textbooks, prerequisites, and the overall educational objectives for students pursuing a B.Tech in Data Science.

Uploaded by

asdrhmn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Noida Institute of Engineering and Technology, Greater Noida

DATA ANALYTICS

Unit: 2

Data Handling

Ravi Pandey
B-Tech(VIIth sem) Assistant Professor
AIML

Sanchi Kaushik UNIT 02 Data Analytics 1


12/12/2023
THE CONCEPT
Faculty LEARNING TASK
Introduction

Mr. Ravi is the faculty of Discipline of Electronics


and Communication Engineering, at Noida
institute of engineering and technology Greater
Noida, Gautam Budh Nagar, Uttar Pradesh, India
since feb2016. He received the B.Tech degree in
electronic and telecommunication engineering
and M.Tech degree in instrumentation and signal
processing. His research interests include
biomedical signal processing, pattern
recognition, machine learning and deep neural
networks.

12/12/2023 2
Evaluation schemeLEARNING TASK
THE CONCEPT

Sl Sub Subject Periods Evaluation Schemes End Semester Total Credit


. ject
N Co
o. des L T P CT TA Total PS TE PE

1 Departmental Core - I 3 0 0 30 20 50 100 150 3

2 Departmental Elective 3 0 0 30 20 50 100 150 3


V

3 Open Elective II 3 0 0 30 20 50 100 150 3

4 Open Elective III 3 0 0 30 20 50 100 150 3

5 Lab – I 0 0 2 25 25 50 1
6 Internship Assessment 0 0 2 50 50 1

MOOCs (Essential for 0 0 2


Course applicable for –B.Tech . Data Science/AI-
Hons. Degree)
ML/AI/IOT/CSBS
Total 700 14
12/12/2023 3
CONTENT
Course objective
B. TECH. (Data Science)

Course code L T P Credits


3 0 0 3

Course title Data Analytics

Course objective:

The objective of this course is to understand the fundamental concepts of Data Science,
learn about various types of data formats and its manipulations. It helps students to
learn exploratory data analysis and visualization techniques in addition to R
programming language.

15/06/2022 Nisha UNIT 01 4


CONTENT
Course Outcomes
Course outcomes : After completion of this course students will be able to

CO 1 Understand the fundamental concepts of data analytics in the areas that plays major role K1
within the realm of data science.

CO 2 Explain and exemplify the most common forms of data and its representations. K2

CO 3 Understand and apply data pre-processing techniques. K3

CO4 Analyse data using exploratory data analysis. K4

CO 5 Illustrate various visualization methods for different types of data sets and application K3
scenarios.

15/06/2022 Nisha UNIT 01 5


THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 6


THE CONCEPT
Course LEARNING
Contents TASK
/ Syllabus

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 7


Text Books
THE CONCEPT LEARNING TASK

Text books:

1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John Wiley Publishers, 2007.
2) Data Analysis and Data Mining, 2nd Edition, John Wiley & Sons Publication, 2014.
Reference Books:

1) Open Data for Sustainable Community: Glocalized Sustainable Development Goals,


Neha Sharma, Santanu Ghosh, Monodeep Saha, Springer, 2021.
2) The Data Science Handbook, Field Cady, John Wiley & Sons, Inc, 2017
3) Data Mining Concepts and Techniques, Third Edition, Jiawei Han, Micheline Kamber,
Jian Pei, Morgan Kaufmann, 2012.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 8


THE CONCEPT
Branch LEARNING TASK
wise Applications

• Security.
•Transportation.
•Risk detection.
•Risk Management.
•Delivery.
•Fast internet allocation.
•Reasonable Expenditure.
•Interaction with customers.
•Planning of cities

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 9


THE CONCEPT LEARNING TASK
Course Objectives

• The objective of this course is to understand the fundamental concepts of Data


analytics and learn about various types of data formats and their manipulations.

• It helps students to learn exploratory data analysis and visualization techniques in


addition to R/Python/Tableau programming language.

Neeti Taneja ACSE0403A OS Unit-3


12 December 2023 10
THE CONCEPT LEARNING TASK

Course Outcomes

Course outcome: After completion of this course students will be able to:

CO 1 Understand the fundamentals of an operating systems, functions and their K1, K2


structure and functions.

CO2 Implement concept of process management policies, CPU Scheduling and K5


thread man
agement.
CO3 Understand and implement the requirement of process synchronization K2,K5
and apply deadlock handling algorithms.

CO4 Evaluate the memory management and its allocation policies. K5

CO5 Understand and analyze the I/O management and File systems K2, K4

Nisha ACSE0403A OS Unit 5

12 December 2023 11
THE CONCEPT LEARNING TASK
Program Outcomes

1. Engineering knowledge
2. Problem analysis
3. Design/development of solutions
4.Conduct investigations of complex problems
5. Modern tool usage
6. The engineer and society
7. Environment and sustainability
8. Ethics:
9. Individual and team work
10. Communication
11. Project management and finance
12. Life-long learning

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 12


THE
COsCONCEPT
and POsLEARNING
MappingTASK

Course PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
Outcome
1 3 2 2 - - - - - - - - 1

3 3 3 - - - - - - - - 1
2

3 3 3 - - - - - - - - 1
3

3 2 1 - - - - - - - - 1
4

3 2 2 - - - - - - - - 1
5

Average
3 2.4 2.2 - - - - - - - - 1

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 13


THE CONCEPT
Program Specific LEARNING TASK
Outcomes(PSOs)

On successful completion of B. Tech. (DS) Program, the Data


Science graduates will be able to:

• PSO1:- Analyse, design and develop solutions by applying fundamental concepts of


Data Science.
• PSO2:-Apply technical knowledge while using modern tools and technologies for
solving complex problems.
• PSO3:-Collaborate different fields of science and technology with right attitude, to
work as an individual or as a team, and demonstrating professional ethics for the
well-being of the society.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 14


THE
COsCONCEPT LEARNING
and PSOs MappingTASK

Course Outcome PSO1 PSO2 PSO3

1 3 - -

3 2 -
2

3 2 -
3

3 2 2
4

3 2 -
5

Average
3 2 2

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 15


ProgramTHE CONCEPT LEARNING
Educational ObjectivesTASK
(PEOs)

•Solve real-time complex problems and adapt to technological changes with the ability of
lifelong learning.

•Work as data scientists, entrepreneurs, and bureaucrats for the goodwill of the society
and pursue higher education.

•Exhibit professional ethics and moral values with good leadership qualities and effective
interpersonal skills.

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 16


THE CONCEPT
Faculty LEARNING
wise Result TASK
Analysis

• NA

12 December 2023 Neeti Taneja ACSE0403A OS Unit-3 17


THE CONCEPT
End Semester LEARNING
Question Paper TASK
Templates (Offline
Pattern/Online Pattern

Sanchi Kaushik UNIT 02 Data


12/12/2023 18
Analytics
End Semester Question Paper
THE CONCEPT Templates
LEARNING (Offline
TASK
Pattern/Online Pattern

Sanchi Kaushik UNIT 02 Data


12/12/2023 19
Analytics
End Semester Question Paper
THE CONCEPT TemplatesTASK
LEARNING (Offline
Pattern/Online Pattern

Sanchi Kaushik UNIT 02 Data


12/12/2023 20
Analytics
End Semester Question Paper Templates (Offline
THE CONCEPT LEARNING TASK
Pattern/Online Pattern

Sanchi Kaushik UNIT 02 Data


12/12/2023 21
Analytics
End Semester Question Paper Templates (Offline Pattern/Online
THE CONCEPT LEARNING TASK
Pattern

Sanchi Kaushik UNIT 02 Data


12/12/2023 22
Analytics
Brief Introduction about the subject with video
THE CONCEPT LEARNING TASK

Data analytics (DA) is the area of examining data sets in


order to find trends and draw conclusions about the
information they contain. Increasingly, data analytics is done
with the aid of specialized systems and software.

YouTube/other Video Links


https://www.youtube.com/playlist?list=PLmXKhU9F
NesSFvj6gASuWmQd23Ul5omtD

Sanchi Kaushik UNIT 02 Data


12/12/2023 23
Analytics
CONTENT

Data Handling:
1. Types of Data: structured, semi-structured, unstructured data
2. Numeric, Categorical, Graphical, High Dimensional Data
3. Transactional Data, Spatial Data, Social Network Data
4. Standard datasets, Data Classification, Sources of Data
5. Data manipulation in various formats, for example, CSV file,
pdf file, XML file, HTML file, text file, JSON, image files etc.
6. Import and export data in R/Python.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 24


THE CONCEPT
PrerequisiteLEARNING
and Recap TASK

Prerequisites:
• Linux/ Windows operating system.

• Java & R Studio.

• MS Office 2019.

• Programming Languages (Python or Java)

Recap:

• Discussion about Data Science Environments.

Sanchi Kaushik UNIT 02 Data


12/12/2023 25
Analytics
THE CONCEPT LEARNING TASK
Unit Objective

The objective of the Unit 2 is :

1.To provide an overview of an exciting growing field of data


science.

2. To inculcate the preliminary knowledge of handling data ,


discuss various types of data.

3. Types if file system used while handling data.

4. Data Wrangling and Messy data preprocessing.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 26


Types of Data: structured, semi-structured,
THE CONCEPT LEARNING
unstructured data
TASK

Objective:
 In this topic we learn about Structured data that is clearly defined
and searchable types of data, while unstructured data is usually
stored in its native format. Structured data is quantitative, while
unstructured data is qualitative. Structured data is often stored in
data warehouses, while unstructured data is stored in data lakes

Recap:

 Revision of database systems.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 27


Types of Data: structured, semi-structured,
CONTENT
unstructured data

Sanchi Kaushik UNIT 02 Data


12/12/2023 28
Analytics
Types of Data: structured, semi-structured,
CONTENT
unstructured data

Sanchi Kaushik UNIT 02 Data


12/12/2023 29
Analytics
Types of Data: structured, semi-structured,
CONTENT
unstructured data

Sanchi Kaushik UNIT 02 Data


12/12/2023 30
Analytics
Types of Data: structured, semi-structured,
CONTENT
unstructured data

Types Of Data Used within a Spreadsheet

Sanchi Kaushik UNIT 02 Data


12/12/2023 31
Analytics
Types of Data: structured, semi-structured,
CONTENT
unstructured data

Sanchi Kaushik UNIT 02 Data


12/12/2023 32
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Objective:
 In this topic we learn about In the machine learning world, data is
nearly always split into two groups: numerical and categorical.
Numerical data is used to mean anything represented by numbers
(floating point or integer). Categorical data generally means
everything else and in particular discrete labeled groups are often
called out

Recap:

 Revision of Data Science Introduction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 33
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

• There are two types of variables you’ll find in your data – numerical and
categorical. Numerical data can be divided into continuous or discrete
values. And categorical data can be broken down into nominal and ordinal
values.

Sanchi Kaushik UNIT 02 Data


12/12/2023 34
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Numerical

• Numerical data is information that is measurable, and it is, of


course, data represented as numbers and not words or text.

• Continuous numbers are numbers that don’t have a logical end to


them. Examples include variables that represent money or height.

• Discrete numbers are the opposite; they have a logical end to


them. Some examples include variables for days in the month, or
number of bugs logged.

Sanchi Kaushik UNIT 02 Data


12/12/2023 35
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Categorical
• For categorical data, this is any data that isn’t a number, which can mean
a string of text or date. These variables can be broken down into nominal
and ordinal values, though you won’t often see this done.

• Ordinal values are values that have a set order to them. Examples of
ordinal values include having a priority on a bug such as “Critical” or
“Low” or the ranking of a race as “First” or “Third”. Nominal values are
the opposite of ordinal values, and they represent values with no set
order to them. Nominal value examples include variables such as
“Country” or “Marital Status”.
Sanchi Kaushik UNIT 02 Data
12/12/2023 36
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Categorical

• In addition to ordinal and nominal values, there is a special type of


categorical data called binary. Binary data types only have two
values – yes or no. This can be represented in different ways such
as “True” and “False” or 1 and 0. Binary data is used heavily for
classification machine learning models. Examples of binary
variables can include whether a person has stopped their
subscription service or not, or if a person bought a car or not

Sanchi Kaushik UNIT 02 Data


12/12/2023 37
Analytics
Numeric, Categorical, Graphical, High Dimensional Data

Recap: Types of Data

This module describes the types of data typically encountered in


public health applications. Recognizing and understanding the
different data types is an important component of proper data
use and interpretation.

Reviewed 15 April 2005 /MODULE 2

Sanchi Kaushik UNIT 02 Data


12/12/2023 3 - 38
Analytics
Numeric, Categorical, Graphical, High Dimensional Data

Data and Variables


Data are often discussed in terms of variables, where a
variable is:

Any characteristic that varies from one


member of a population to another.

A simple example is height in centimeters, which varies


from person to person.
Sanchi Kaushik UNIT 02 Data
12/12/2023 3 - 39
Analytics
Numeric, Categorical, Graphical, High Dimensional Data

Types of Variables
There are two basic types of variables: numerical and categorical
variables.

Numerical Variables: variables to which a number is assigned as


a quantitative value.

Categorical Variables: variables defined by the classes or


categories into which an individual member falls.

Sanchi Kaushik UNIT 02 Data


12/12/2023 3 - 40
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Types of Numerical variables


• Discrete: Reflects a number obtained by counting—
no decimal.
• Continuous: Reflects a measurement; the number
of decimal places depends on the precision of the
measuring device.
• Ratio scale: Order and distance implied. Differences can
be compared; has a true zero. Ratios can be compared.
Examples: Height, weight, blood pressure
• Interval scale: Order and distance implied. Differences
can be compared; no true zero. Ratios cannot be
compared.
Example: Temperature in Celsius.
Sanchi Kaushik UNIT 02 Data
12/12/2023 3 - 41
Analytics
Numeric, Categorical, Graphical, High Dimensional Data

Ratio Scale

Sanchi Kaushik UNIT 02 Data


12/12/2023
Analytics
3 - 42
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

Categorical Variables

Defined by the classes or categories into which an individual


member falls.

• Nominal Scale: Name only--Gender, hair color, ethnicity

• Ordinal Scale: Nominal categories with an implied order-


-Low, medium, high.

Sanchi Kaushik UNIT 02 Data


12/12/2023 3 - 43
Analytics
Numeric, Categorical, CONTENT
Graphical, High Dimensional Data

NOMINAL SCALE
b. Appearance of plasma: b.

1. Clear……………………… 1.

2. Turbid…………………… 2.

9. Not done………………… 9.

Sanchi Kaushik UNIT 02 Data


12/12/2023 3 - 44
Analytics
Numeric, Categorical, CONTENT
Graphical, High Dimensional Data

ORDINAL SCALE
81.Urine protein (dipstick reading): 81.

1. Negative………………… 1.

2. Trace……………………. 2.

3. 30 mg% or +…………… 3.

4. 100 mg% or ++………… 4.

5. 300 mg% or +++………… 5.

6. 1000 mg% or ++++……… 6.

If urine protein is 3+ or above, be


sure subject gets a 24 hour urine
collection container and instruction
Sanchi Kaushik UNIT 02 Data
12/12/2023 3 - 45
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional
What is Graph Data Science?
• Graph Data Science is a science-driven approach to gain knowledge from the
relationships and structures in data, typically to power predictions. It describes a
toolbox of techniques that help data scientists answer questions and explain
outcomes using graph data.

Applications of Graph Data Science


• Graph Data Science techniques can be used as part of a variety of different
applications and use cases.

• Graph queries support domain experts by answering common questions.

• Graph algorithms help make sense of the global structure of a graph, and the results
used for standalone analysis or as features in a machine learning model.

• Graph embeddings are a core component of similarity graphs that power


recommendation systems.

• Natural Language Processing techniques support content based filtering


recommendations and knowledge graph completion.

Sanchi Kaushik UNIT 02 Data


12/12/2023 46
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

What is Dimensionality?

• Dimensionality in statistics refers to how many attributes a


dataset has. For example, healthcare data is notorious for
having vast amounts of variables (e.g. blood pressure, weight,
cholesterol level). In an ideal world, this data could be
represented in a spreadsheet, with one column representing
each dimension. In practice, this is difficult to do, in part
because many variables are inter-related (like weight and
blood pressure).

Sanchi Kaushik UNIT 02 Data


12/12/2023 47
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

• Note: Dimensionality means something slightly different in


other areas of mathematics and science. For example, in
physics, dimensionality can usually be expressed in terms of
fundamental dimensions like mass, time, or length. In matrix
algebra, two units of measure have the same dimensionality if
both statements are true:

• A function exists that maps one variable onto another variable.

• The inverse of the function in (1) does the reverse.

Sanchi Kaushik UNIT 02 Data


12/12/2023 48
Analytics
THE CONCEPT
Numeric, Categorical, LEARNING
Graphical, TASK Data
High Dimensional

High Dimensional Data

High Dimensional means that the number of dimensions are


staggeringly high — so high that calculations become extremely
difficult. With high dimensional data, the number of features can
exceed the number of observations. For example, microarrays,
which measure gene expression, can contain tens of hundreds of
samples. Each sample can contain tens of thousands of genes.

Sanchi Kaushik UNIT 02 Data


12/12/2023 49
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Objective:
 In this topic we learn about how Transactional data describe an
internal or external event or transaction that takes place as an
organization conducts its business. Examples include sales orders,
invoices, purchase orders, shipping documents, pass- port
applications, credit card payments, and insurance claims.

Recap:

 Revision of Data Science Introduction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 50
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Transactional data
Transactional data is information that is captured from
transactions. It records the time of the transaction, the place
where it occurred, the price points of the items bought, the
payment method employed, discounts if any, and other
quantities and qualities associated with the transaction.
Transactional data is usually captured at the point of sale.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 51


THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Sanchi Kaushik UNIT 02 Data


12/12/2023 52
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

In other words, transactional data is data generated by various


applications while running or supporting everyday business
processes of buying and selling. A large and intricate web of
point-of-sale servers, security software, ATM, and payment
gateway data exists, originating from every possible device used
to complete a financial transaction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 53
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Examples of Transactional Data


• Transactional data typically falls under the category of
structured data. Some examples include:
• Financial transactional data: insurance costs and claims data,
or a purchase or sale; Deposits or withdrawals in case of
banks.
• Logistical transactional data: shipping status, shipping partner
data.
• Work-related transactional data: employee hours tracking.

Sanchi Kaushik UNIT 02 Data


12/12/2023 54
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Spatial data science (SDS) is a subset of Data Science that


focuses on the unique characteristics of spatial data, moving
beyond simply looking at where things happen to understand
why they happen there.

• SDS treats location, distance & spatial interactions as core


aspects of the data using specialized methods & software to
analyze, visualize & apply learnings to spatial use cases.

Sanchi Kaushik UNIT 02 Data


12/12/2023 55
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Sanchi Kaushik UNIT 02 Data


12/12/2023 56
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Why is Spatial Data Science important in business?

• Private & public sector organizations will be increasing their


investment in SDS in the next 2 years (according to The State
of SDS in Enterprise).

• From Retail & Real Estate, to Telecoms & Utilities - Data


Science & Analytics leaders are looking to attract expertise in
spatial analysis, as well as equipping them with new
technology & data streams to enable key use cases that help
them to use more spatial insights in their decision making.
Sanchi Kaushik UNIT 02 Data
12/12/2023 57
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

Sanchi Kaushik UNIT 02 Data


12/12/2023 58
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Social network data are important for discovering knowledge


about a community, which is critical in criminology, terrorism,
public health, and many other applications. ... Without sharing
social network data, each organization may only have part of a
large global social network.

Sanchi Kaushik UNIT 02 Data


12/12/2023 59
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Social network analysis (SNA) is the process of investigating social


structures through the use of networks and graph theory. It
characterizes networked structures in terms of nodes (individual
actors, people, or things within the network) and the ties, edges,
or links (relationships or interactions) that connect them.

Sanchi Kaushik UNIT 02 Data


12/12/2023 60
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Social network analysis (SNA)

 Examples of social structures commonly visualized through social


network analysis include social media networks, memes spread,
information circulation, friendship and acquaintance networks,
business networks, knowledge networks, difficult working
relationships, social networks, collaboration graphs, kinship,
disease transmission, and relationships.

Sanchi Kaushik UNIT 02 Data


12/12/2023 61
Analytics
THE Data,
Transactional CONCEPT
Spatial LEARNING TASK Data
Data, Social Network

• Social network analysis (SNA)

 These networks are often visualized through sociograms in which


nodes are represented as points and ties are represented as lines.
These visualizations provide a means of qualitatively assessing
networks by varying the visual representation of their nodes and
edges to reflect attributes of interest.

Sanchi Kaushik UNIT 02 Data


12/12/2023 62
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

Objective:
 In this topic we learn about what is standard dataset and this makes
them easy to compare and navigate for you to practice a specific
data preparation technique or modeling method.

Recap:

 Revision of Data Science Introduction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 63
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Each dataset is small enough to fit into memory and review


in a spreadsheet. All datasets are comprised of tabular data
and no (explicitly) missing values. Example as:

• Swedish Auto Insurance Dataset.


• Wine Quality Dataset.
• Pima Indians Diabetes Dataset.
• Sonar Dataset.
• Banknote Dataset.
• Iris Flowers Dataset.
• Abalone Dataset.
• Ionosphere Dataset.
• Wheat Seeds Dataset.
• Boston House Price Dataset.
Sanchi Kaushik UNIT 02 Data
12/12/2023 64
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Sources of data can be categorized as per two basis points, i.e.


purpose of data collection and type of data source. This can be
explained with the help of an illustration given below –

• Types of Data

• Data can be classified into two types –

1. Primary data

2. Secondary data

Sanchi Kaushik UNIT 02 Data


12/12/2023 65
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Primary Data

• Data which is considered as first-hand information collected by a


surveyor, investigator, etc. is defined as Primary Data. The sources
from which such data is collected is termed as the primary source of
data collection for the concerned information.

• Moreover, data is regarded as primary only if it has never undergone


any prior statistical treatment. Such data is usually published, and
more data is derived from the published source for other purposes.
For example, a country’s population is an application of collection of
primary data.
Sanchi Kaushik UNIT 02 Data
12/12/2023 66
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Features of Primary Data

• Primary Data has the following characteristics –

• Such data is being collected for the first time.

• Primary Data is original and thereby more reliable than other


types of data

• This kind of data has not been used for any statistical analysis
before.

12/12/2023 Sanchi Kaushik UNIT 02 Data Analytics 67


Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Secondary Data

• Data which has already been collected, analyzed, published


and has undergone statistical treatment can be defined as
Secondary data. Such type of data is tailored from primary data
sources.

• However, this kind of data can also be collected by surveyors,


investigators, etc. to conduct statistical analysis in order to
derive newer information.

Sanchi Kaushik UNIT 02 Data


12/12/2023 68
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Secondary Data

• For example, the address you insert in food delivery apps is a


common application for the use of secondary data. Your
address is not new information unless you just purchased a
property.

• In such cases, information regarding the address of your new


property will be considered as primary data. From this
example, you can get a clear understanding of the sources of
data primary and secondary.

Sanchi Kaushik UNIT 02 Data


12/12/2023 69
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

Features of Secondary Data


• Secondary Data consists of the following features –
• Secondary data is considered as ‘second-hand information’.
• Secondary data is not original.
• This kind of data has gone through statistical analysis at least once.
• Secondary data is not reliable.
• Another simple example of Secondary Data is information which is
found in unapproved websites such as Wikipedia, etc. where any
user at any given time can edit the data, as per his or her wish,
provided in any page of this website.

Sanchi Kaushik UNIT 02 Data


12/12/2023 70
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• Methods of Collecting Data in Statistics

• Data collection is a standout procedure carried out by most analysts during


research. As an analyst, if you are unable to collect the necessary data for your
research, your whole venture will lose its credibility.

• So, data collection is an essential element in statistical analysis; it is a


challenging duty which requires dedication, determination, proper planning
and the capability to finish the assignment.

• The primary step of data collection is figuring out what kind of data is required
and then starting your analysis by collection of a sample through a specific
sampling method from a certain part of the population.

Sanchi Kaushik UNIT 02 Data


12/12/2023 71
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• There are various methods of data collection which can be classified as


per the type of data involved, which are –

• A. Collection of Primary Data

• Collection of Primary Data can be done through various methods, which


are –

• Direct Personal Investigation

• In this method, surveyors or investigators collect the data themselves.


This method is suitable for small projects where the required data needs
to be reliable and excessive effort is not mandatory.

Sanchi Kaushik UNIT 02 Data


12/12/2023 72
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

Collection of Primary Data

• Collection with the Help of Investigators. In this method, a single or a group of


correspondents collects the data for the surveyor. These correspondents are
trained investigators who are employed for this course of action. This type of
data collecting method is useful for a large population.

• Assisted by Questionnaires. When the amount of data which is required to be


collected is significantly large, questionnaires are used to make the data
collecting process easier. Questionnaires are nothing but a set of questions
which, when answered, provide the required data. Surveyors can also mail
questionnaires to the respondents for added convenience.

Sanchi Kaushik UNIT 02 Data


12/12/2023 73
Analytics
Standard THE CONCEPT
datasets, LEARNING
Data Classification, TASKof Data
Sources

• B. Collection of Secondary Data


• Collection of secondary data is much easier than collecting primary
data. Secondary data is available on various sources, both published
and unpublished.
• However, the investigator of this kind of data must ensure that the
data is reliable, suitable for analysis, whether bias is involved during
sampling of the said data, etc.

Sanchi Kaushik UNIT 02 Data


12/12/2023 74
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Objective:
 In this topic we learn about four basic types of data manipulation
carried out in Data science where we learn how do we Move data
around unchanged;

 Carrying out machine learning operations on data, testing data &

 Carrying out analysis operations on data.

Recap:

 Revision of Data Science Introduction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 75
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

WHAT IS DATA MANIPULATION?

Data Manipulation Meaning: Manipulation of data is the process


of manipulating or changing information to make it more
organized and readable. We use DML to accomplish this. What is
meant by DML? Well, it stands for Data Manipulation Language
or a programming language capable of adding, removing, and
altering databases, i.e. changing the information to something
that we can read. We can clean and map the data thanks to DML
to make it digestible for expression.
Sanchi Kaushik UNIT 02 Data
12/12/2023 76
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
DATA MANIPULATION EXAMPLES

• Data Manipulation is the modification of information to make it easier to


read or more structured. For example, in alphabetical order, a log of data
may be sorted, making it easier to find individual entries. On web server
logs, data manipulation is also used to allow the website owner to monitor
their most famous pages and their sources of traffic.

• Accounting users or related fields also manipulate information to assess


the expense of the product, pricing patterns, or future tax obligations. To
forecast developments in the stock market and how stocks might perform
shortly, stock market analysts also use data manipulation.

Sanchi Kaushik UNIT 02 Data


12/12/2023 77
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

PURPOSE OF DATA MANIPULATION

For business operations and optimization, data manipulation is a


key feature. You have to be able to deal with the data in the way
you need it to use data properly and turn it into valuable
information such as analyzing financial data, consumer behavior,
and doing trend analysis. As such, data manipulation provides an
organization with many advantages, including:

Sanchi Kaushik UNIT 02 Data


12/12/2023 78
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• Consistent data: It can be structured, read, and better understood by


providing data in a consistent format. You may not have a unified view
when taking data from various sources, but with data manipulation and
commands, you can make sure that the data is structured and stored
consistently.

• Project data: it is paramount for organizations to be able to use


historical data to project the future and to provide more in-depth
analysis, especially when it comes to finances. Manipulation of data
makes it possible for this purpose.

Sanchi Kaushik UNIT 02 Data


12/12/2023 79
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• Overall, being able to convert, update, delete, and incorporate


data into a database means you can do more with the data. -
Create more value from the data. It becomes pointless by
providing data that remains static. But you will have
straightforward insights to make better business decisions when
you know how to use data to your advantage.

• Delete or neglect redundant data: data that is unusable is always


present and can interfere with what matters.

Sanchi Kaushik UNIT 02 Data


12/12/2023 80
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
STEPS INVOLVED IN DATA MANIPULATION

• When you want to get started with data manipulation, here are the steps
you should take into consideration:

• Only if you have data to do so is data manipulation feasible. You need a


database, therefore, which is generated from data sources.

• This knowledge requires reorganization and restructuring. Manipulation of


data helps you to cleanse your information.

• Import a database and create it for you to work on.

• You can combine, erase, or merge information through data manipulation.

• When you manipulate data, data analysis becomes simple.

Sanchi Kaushik UNIT 02 Data


12/12/2023 81
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• IN EXCEL, HOW DO YOU MANIPULATE DATA?


• Manipulation of data in Python and manipulation of data in R
are critical aspects of data manipulation. Before moving
through the more profound principles of Data Manipulation in
Python and R, let us now understand how to manipulate data.
• Most definitely, you are aware of how to use MS Excel. Here
are some tips to help you manipulate Excel info.
• Formulas and functions – Addition, subtraction, multiplication,
and division are some of the basic math functions in Excel. You
need to know how to use these Excel-critical features.
Sanchi Kaushik UNIT 02 Data
12/12/2023 82
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• IN EXCEL, HOW DO YOU MANIPULATE DATA?

• Autofill in Excel-When you want to use the same equation


across several cells, this feature is useful. One way of doing it is
to retype the formula. Another way is to drag the cursor to the
cell’s lower right corner and then downwards. It will help you
simultaneously apply the same formula to several rows.

• Sort and Filter- Users can save a lot of time when analyzing
data by sorting and filtering options in Excel.

Sanchi Kaushik UNIT 02 Data


12/12/2023 83
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• IN EXCEL, HOW DO YOU MANIPULATE DATA?

• Removing duplicates-There are often chances of replication of


data in the process of collecting and assimilating data. In Excel,
the Delete Duplicate feature can help remove duplicate
spreadsheet entries.

• Column splitting, merging, and merging-Columns or rows in


Excel may often be added or removed. Data organization often
requires integrating, splitting, or combining multiple
datasheets.

Sanchi Kaushik UNIT 02 Data


12/12/2023 84
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

What is a file format?


• A file format is a standard way in which information is encoded for
storage in a file. First, the file format specifies whether the file is a
binary or ASCII file. Second, it shows how the information is
organized. For example, comma-separated values (CSV) file format
stores tabular data in plain text.
• To identify a file format, you can usually look at the file extension to
get an idea. For example, a file saved with name “Data” in “CSV”
format will appear as “Data.csv”. By noticing “.csv” extension we
can clearly identify that it is a “CSV” file and data is stored in a
tabular format.

Sanchi Kaushik UNIT 02 Data


12/12/2023 85
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Sanchi Kaushik UNIT 02 Data


12/12/2023 86
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Why should a data scientist understand different file formats?

• Usually, the files you will come across will depend on the
application you are building. For example, in an image processing
system, you need image files as input and output. So you will
mostly see files in jpeg, gif or png format.

• As a data scientist, you need to understand the underlying structure


of various file formats, their advantages and dis-advantages. Unless
you understand the underlying structure of the data, you will not be
able to explore it. Also, at times you need to make decisionsabout
how to store data.

Sanchi Kaushik UNIT 02 Data


12/12/2023 87
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

• Choosing the optimal file format for storing data can improve
the performance of your models in data processing.

• Now, we will look at the following file formats and how to


read them in Python:
• Comma-separated values HTML
• XLSX Images
• ZIP Hierarchical Data Format
PDF
• Plain Text (txt)
DOCX
• JSON MP3
• XML MP4

Sanchi Kaushik UNIT 02 Data


12/12/2023 88
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Comma-separated values

• Comma-separated values file format falls under spreadsheet file format.

What is Spreadsheet File Format?

• In spreadsheet file format, data is stored in cells. Each cell is organized in

rows and columns. A column in the spreadsheet file can have different

types. For example, a column can be of string type, a date type or an integer

type. Some of the most popular spreadsheet file formats are Comma

Separated Values ( CSV ), Microsoft Excel Spreadsheet ( xls ) and Microsoft

Excel Open XML Spreadsheet ( xlsx ).

Sanchi Kaushik UNIT 02 Data


12/12/2023 89
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
Comma-separated values

• Each line in CSV file represents an observation or commonly called a record.


Each record may contain one or more fields which are separated by a
comma.

• Sometimes you may come across files where fields are not separated by
using a comma but they are separated using tab. This file format is known
as TSV (Tab Separated Values) file format.

The below image shows a


CSV file which is opened
in Notepad.

Sanchi Kaushik UNIT 02 Data


12/12/2023 90
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

XLSX files

• XLSX is a Microsoft Excel Open XML file format. It also comes


under the Spreadsheet file format. It is an XML-based file
format created by Microsoft Excel. The XLSX format was
introduced with Microsoft Office 2007.

• In XLSX data is organized under the cells and columns in a


sheet. Each XLSX file may contain one or more sheets. So a
workbook can contain multiple sheets.

Sanchi Kaushik UNIT 02 Data


12/12/2023 91
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
XLSX files

• The below image shows a “xlsx” file which is opened in Microsoft


Excel.

Sanchi Kaushik UNIT 02 Data


12/12/2023 92
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Plain Text (txt) file format

• In Plain Text file format, everything is written in plain text.


Usually, this text is in unstructured form and there is no meta-
data associated with it. The txt file format can easily be read by
any program. But interpreting this is very difficult by a
computer program.

Sanchi Kaushik UNIT 02 Data


12/12/2023 93
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

Plain Text (txt) file format

• Let’s take a simple example of a text File.

• The following example shows text file data that contain text:

Sanchi Kaushik UNIT 02 Data


12/12/2023 94
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

JSON file format

• JavaScript Object Notation(JSON) is a text-based open standard


designed for exchanging the data over web. JSON format is
used for transmitting structured data over the web. The JSON
file format can be easily read in any programming language
because it is language-independent data format.

Sanchi Kaushik UNIT 02 Data


12/12/2023 95
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
JSON file format:

Sanchi Kaushik UNIT 02 Data


12/12/2023 96
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

HTML files

• HTML stands for Hyper Text Markup Language. It is the standard


markup language which is used for creating Web pages. HTML is
used to describe structure of web pages using markup. HTML
tags are same as XML but these are predefined. You can easily
identify HTML document subsection on basis of tags such as
<head> represent the heading of HTML document. <p>
“paragraph” paragraph in HTML. HTML is not case sensitive.

Sanchi Kaushik UNIT 02 Data


12/12/2023 97
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.

HTML files

Sanchi Kaushik UNIT 02 Data


12/12/2023 98
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
Image files
Image files are probably the most fascinating file format used in
data science. Any computer vision application is based on image
processing.
– an Image consisting of pixels and meta-data associated with it.
Each image consists one or more frames of pixels. And each frame
is made up of two-dimensional array of pixel values. Pixel values
can be of any intensity. Meta-data associated with an image, can
be an image type (.png) or pixel dimensions.

Sanchi Kaushik UNIT 02 Data


12/12/2023 99
Analytics
Data manipulation in various formats, for example, CSV file,
THE CONCEPT LEARNING TASK
pdf file, XML file, HTML file, text file, JSON, image files etc.
Image files

Sanchi Kaushik UNIT 02 Data


12/12/2023 100
Analytics
THE CONCEPT LEARNING
Import and export data TASK

Objective:
 In this topic we learn about the import and export of data is the
automated or semi-automated input and output of data sets
between different software applications. ... Import and export of
data shares semantic analogy with copying and pasting, in that sets
of data are copied from one application and pasted into another.

Recap:

 Revision of Data Science Introduction.

Sanchi Kaushik UNIT 02 Data


12/12/2023 101
Analytics
THE CONCEPT LEARNING
Import and export data TASK

Sanchi Kaushik UNIT 02 Data


12/12/2023 102
Analytics
THE CONCEPT LEARNING
Import and export data TASK

Sanchi Kaushik UNIT 02 Data


12/12/2023 103
Analytics
THE CONCEPT LEARNING
Import and export data TASK

Reading the HTML file


For reading the HTML file, you can use BeautifulSoup library. Please refer to
this tutorial, which will guide you how to parse HTML documents. Beginner’s
guide to Web Scraping in Python (using BeautifulSoup)

Sanchi Kaushik UNIT 02 Data


12/12/2023 104
Analytics
THE CONCEPT LEARNING
Import and export data TASK

PDF file format


PDF (Portable Document Format) is an incredibly useful format used for interpretation
and display of text documents along with incorporated graphics. A special feature of a
PDF file is that it can be secured by a password.

Reading a PDF file

On the other hand, reading a PDF format through a program is a complex task.
Although there exists a library which do a good job in parsing PDF file, one of them is
PDFMiner. To read a PDF file through PDFMiner, you have to:
Download PDFMiner and install it through the website
https://euske.github.io/pdfminer/
Extract PDF file by the following code
pdf2txt.py <pdf_file>.pdf

Sanchi Kaushik UNIT 02 Data


12/12/2023 105
Analytics
Faculty VideoTHE
Links,CONCEPT
You tube &LEARNING
NPTEL Video TASK
Links and Online
Courses Details

You Tube video

https://www.youtube.com/watch?v=uufDGjTuq34

https://www.youtube.com/watch?v=XVv6mJpFOb0

https://www.youtube.com/watch?v=guPOL9UplNs

https://www.youtube.com/watch?v=ve_0h4Y8nuI&list=PLhTjy8cBISEqk
N-5Ku_kXG4QW33sxQo0t

https://www.youtube.com/watch?v=pLoRrHEsHb0&list=PLmcBskOCOO
FUmbUv0CIMuATDVKVrOhBMV

Sanchi Kaushik UNIT 02 Data


12/12/2023 106
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

1: What protocol can be used to retrieve web pages using python?

A. urllib

B. bs4

C. HTTP

D. GET

2: What provides two way communication between two different programs in


a network.

A. socket

B. port

C. http

D. protocol
Sanchi Kaushik UNIT 02 Data
12/12/2023 107
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

3: What is a python library that can be used to send and receive data over
HTTP?
A. http
B. urllib
C. port
D. header
4: What is the process by which search engines retrieve webpages and build a
search index called?
A. scrape
B. parse
C. BeautifulSoup
D. spider
Sanchi Kaushik UNIT 02 Data
12/12/2023 108
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

5: What does the following block of code do?


mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('data.pr4e.org', 80))

cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()

mysock.send(cmd)

A. It sends a request to extract 'romeo.txt' from 'data.pr4e.org'

B. It sends the 'romeo.txt' file to 'data.pr4e.org'

C. It creates a file named 'romeo.txt'

D. It throws an error because a socket cannot use HTTP


Sanchi Kaushik UNIT 02 Data
12/12/2023 109
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

6: What does the following block of code do?


url = "https://www.nytimes.com"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')

A. retrieves and displays the webpage


B. parses the html content of the
"https://www.nytimes.com" webpage.
C. downloads the webpage
D. It throws an error because a socket cannot use HTTP
Sanchi Kaushik UNIT 02 Data
12/12/2023 110
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

7: Given the below html, how would this tag type be described
in web scraping code?

<h1 class='sports'>Sports News</h1>

A. h1

B. h1, class='sports'

C. h1, class_='sports'

D. 'h1', class_='sports'

Sanchi Kaushik UNIT 02 Data


12/12/2023 111
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

8: How does one parse the HTML into a BeautifulSoup object


given a response object?

A. soup = BeautifulSoup(response.text, 'html.parser')

B. soup = BeautifulSoup(response.content, 'html.parser')

C. soup = BeautifulSoup(response.string, 'html.parser’)

D. soup = BeautifulSoup(fp, "html.parser")

Sanchi Kaushik UNIT 02 Data


12/12/2023 112
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

9: Which of the following gets the value for the id in the first p
tag?

A. soup.p.get('id')

B. soup.p.get('id', None)

C. soup.p[id]

D. soup.p['id']

Sanchi Kaushik UNIT 02 Data


12/12/2023 113
Analytics
THE CONCEPT LEARNING TASK
MCQ unit wise/weekly

10: Which of the following gets the first link tag and returns a
dictionary of all attributes and values for that link tag?

A. soup.a.attributes

B. soup.link.attrs

C. soup.a.attrs

D. soup.link.attributes

Sanchi Kaushik UNIT 02 Data


12/12/2023 114
Analytics
THE CONCEPT LEARNING
Weekly/monthly/Unit TASK
Wise Assignment.

Assignment 1
 What are the different type of data?
 What are different types of data sources?
 Explain direct personal investigation method of collecting primary data.
Discuss its merits and demerits.
 What is secondary data? Discuss the various sources of secondary data.
 What precautions shall we take while using secondary data?
 What are the methods of primary data collection?
 What is Categorical Data?

Sanchi Kaushik UNIT 02 Data


12/12/2023 115
Analytics
THE CONCEPT
Glossary LEARNING
Questions TASK
1. a)Data Cleaning b) Data Transformation c) Data Reduction d) Data Integration
i. __________involves combining data residing in different sources and providing users with a unified view of them.
ii.____________is the process of converting data from one format to another.
iii.___________ is the transformation of numerical or alphabetical digital information derived empirically or
experimentally into a corrected, ordered, and simplified form.
iv._______is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete
data within a dataset.

2. a) Relational database b) Transactional database c) Data Warehouse d) Spatial database


i. ___________ is a database management system (DBMS) that can reverse or scale back a database transaction or
activity if it isn't performed correctly.
ii.____________is a database optimized for storing and querying data that represents objects defined in a geometric
space.
iii._________ is a type of database that stores and provides access to data points that are related to one another.
iv.____________is a system used for reporting and data analysis and is considered a core component of business
intelligence.

3. a) Statistics b) Visualization c) Data Mining d) Clustering


i. ___________is a process of extracting and discovering patterns in large data sets involving methods at the
intersection of machine learning, statistics.
ii.__________s the task of grouping a set of objects in such a way that objects in the same group are more similar to
each other than to those in other groups.
iii._______is the graphical representation of information and data.
iv.________ is the science concerned with developing and studying methods for collecting, analysing, interpreting and
presenting empirical data.
Sanchi Kaushik UNIT 02 Data
12/12/2023 116
Analytics
THE CONCEPT
Expected LEARNING
Questions for University TASK
Exam

1. What is Data Manipulation?


2. Define Outliers. How are they identified?
3. Name some methods to deal with missing value imputation?
4. Explain the standardization scaling method to normalize data.
5. Name top 2 techniques to handle missing data.
6. Define standardization.
7. What precautions shall we take while using secondary data?
8. What are the methods of primary data collection?
9. What is Categorical Data?

Sanchi Kaushik UNIT 02 Data


12/12/2023 117
Analytics
THE CONCEPT LEARNING TASK
Summary

 This unit provide us fundamentals domain of Data


Analytics and its latest trends in industry.
 In this unit we are also benefitted with the knowledge of
different types of data and very important is the
implementation of machine learning and also through the
concept model building which is used in industry
prospects.
 This unit will impart us with knowledge of business
analytics and tolls used in data analytics.
Sanchi Kaushik UNIT 02 Data
12/12/2023 118
Analytics
CONTENT

Sanchi Kaushik UNIT 02 Data


12/12/2023 119
Analytics

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy