0% found this document useful (0 votes)
55 views42 pages

INF30036 DataTypes Lecture2-1

Types of data include qualitative (categorical) and quantitative (numerical) data. Qualitative data can be nominal, with no implied order, or ordinal, with an implied order. Quantitative data can be discrete, taking on countable values, or continuous, taking on any value within a range. R has various data types like character, numeric, factor, and logical, as well as data structures like vectors, lists, matrices, and data frames. The dplyr package contains functions for selecting, filtering, arranging, mutating, and summarizing data. Data exploration involves activities to increase understanding of data quality issues like incomplete or inaccurate values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views42 pages

INF30036 DataTypes Lecture2-1

Types of data include qualitative (categorical) and quantitative (numerical) data. Qualitative data can be nominal, with no implied order, or ordinal, with an implied order. Quantitative data can be discrete, taking on countable values, or continuous, taking on any value within a range. R has various data types like character, numeric, factor, and logical, as well as data structures like vectors, lists, matrices, and data frames. The dplyr package contains functions for selecting, filtering, arranging, mutating, and summarizing data. Data exploration involves activities to increase understanding of data quality issues like incomplete or inaccurate values.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Lecture 2

Data types, and data in R


Agenda

• Types of data
• Working with data in Rstudio
• Intro to data exploration

2
Part 1
Types of data
Types of Data for Analytics

4
Types of Data for Analytics

Nominal Named categories

Categorical
(qualitative)

Ordinal
Categories within implied order

Data

Discrete Only particular number

Numerical
(quantitative)

Continous Any numeric value

5
Qualitative

Qualitative data - (or categorical or attribute)


Can be separated into different categories that are
distinguished by some nonnumeric characteristics.

Example: genders (male/female) of professional


athletes, States of a country etc.,

6
Quantitative

Quantitative data

Numbers representing counts or


measurements

Example: Profitability of a company,


weather, time etc.,

7
Exercise 1

Qualitative or Quantitative?
• Colors of automobiles in a dealer’s showroom.
• Number of seats in movie theaters.
• Classification of patients based on nursing care
needed(complete, partial, or self care)
• Lengths of newborn cats of a certain species.
• Number of complaint letters received by an airline
per month.

8
Quantitative data

Working with Quantitative data

Quantitative data can further be distinguished between


discrete and continuous types.

9
Discrete data

Discrete

Data result when the number of possible values is


either a finite number or a ‘countable’ number of
possible values - 0, 1, 2, 3, . . .

Example: The number of students in the class, The number


of outcomes of rolling 2 dice

10
Continuous data

Continuous

Numerical data result from infinitely many possible


values that correspond to some continuous scale
that covers values without gaps.

Example: Height, Weight, Time etc.,

11
Exercise 2

Discrete or continuous?

• Number of cartons of milk manufactured each


day.
• Temperatures of airplane interiors at a given
airport.
• Incomes of college students on work study
programs.
• Number of cars parked in a parking lot.
• Weights of newborn calves.
• Number of tomatoes on each plant in a field.
12
Qualitative data

Working with Qualitative data

Qualitative data can be distinguished


between nominal and ordinal types.

13
Nominal data

Nominal data

Characterized by data that consist of names, labels, or


categories only. The data cannot be arranged in an
ordering scheme (such as low to high), each label/category
is different.

Ex: Country/State/City, Male/Female, Yes/No etc.,


Can you convert Quantitative data to Qualitative data?
14
Ordinal data

Ordinal data

Involves data that may be arranged in some order, but


differences between data values either cannot be
determined or are meaningless

Ex- Course grades, Medals – Gold/Silver/Bronze

15
Exercise 3

Nominal or Ordinal

• Horsepower of motorcycle engines.


• Ratings of newscasts in Houston(poor, fair, good,
excellent)
• Temperature of automatic popcorn poppers
• Time required for drivers to complete a course
• Marital status of respondents to a survey of
savings accounts.
• Organizational hierarchy – Analyst, Manager,
Director, CEO
18
Part 2
Working with data in RStudio
Hello World

23
R Data types

• Character
> Strings
> Ex: “Survived” or “3.14”
• Numeric
> Integer/float/double
> Ex: 3.14/3.14L/3+14i
• Factor
> Factor is a class for categorical variable
> Factors have different levels of categories
> Ex: Survived has two levels – “Survived” and “Not Survived”
> Factors can have numeric levels too – Ex: Survived – “0” for Not
Survived and “1” for Survived
• Logical
> True/False
24
Data Structures

• Vector
• List
• Factor
• Matrix
• Data frame

26
R is Vectorized

25
Vector

• The most basic R object is a vector


• A vector can only contain objects of the
same data type

• Empty vectors can be created with the


vector() function

27
Vectors

The c() function can be used to create


vectors of objects

28
Exercise - 4

Spend 5 minutes to create vectors with the following


information:

• Bob’s age – 14,


• Smith’s age – 24
• Matt’s age – 17
• Liam’ age - 19

29
List

• List is a special type of vector


> Can contain elements of different classes (either basic class or compound
class)
> Each element of list can have a name

30
Factor

31
Matrix

32
Matrix

33
Data frames

34
Missing values

35
Coercion

36
Coercion – explicit coercion

37
Reading/Writing Data

• Many file formats can be imported into R.


• In this course we will only deal with either csv or xlsx.
• To read data first set working directory to the folder where
data sits.
• For csv
>{Any variable} read.csv(“filename.csv”)

There are other file formats that you can read into R but in
this course we will primarily use .csv
• For .xls files
> install.packages(“xlsx”)
> library(xlsx)
> {Variable name}  read.xlsx(“filename.xls”) 38
Data wrangling with Dplyr

• dplyr() - The dplyr package contains five


key data manipulation functions, also
called verbs:
> select() - which returns a subset of the columns,
> filter() - that is able to return a subset of the rows,
> arrange() - that reorders the rows according to single or multiple
variables,
> mutate() - used to add columns from existing data,
> summarise() - which reduces each group to a single row by calculating
aggregate measures.

40
R for Business analytics

• Advantages
> Designed for Statistical Analysis
– Many built-in functions
> Large number of libraries
> Mature open source project
• Disadvantages
> Overhead (Does not scale well to very
large data)
• Use R as a “sandbox” to play with a
sample 22
Part 3
Data exploration (Intro)
Data exploration?

• Data exploration involves activities that increases


understanding on data
• No quality data, no quality predictive results!
> Quality decisions must be based on quality data
– e.g., duplicate or missing data may cause incorrect or even
misleading statistics. – Garbage in Garbage out or GIGO
> Data warehouse needs consistent integration of quality
data

42
Data reduction

Variables (headers in excel file)


samples

Sample Dataset
csv or excel file

43
The fundamental data problem

Program
Program

 Incomplete data
data
data
Program data
Program data Program
Program Database data
Database data

data
Program data data
data Program Interface

Program Program
Program Program

 Inaccurate data
Temporary Temporary
Database Database

Interface

Interface Interface Program

 Inconsistent data
Program
Program

data
data
Program data
Program Program data
Database Program
data
Database data
data
data data

 Unobtainable data
data
Program Program
Program Program

44
Data exploration

• Data in the real world might have issues such as:

> Missing or incomplete: lacking attribute values, lacking certain


attributes of interest, or containing only aggregate data
– e.g., occupation=“ ”
> noisy: containing errors or outliers
– e.g., Salary=“-10”
> inconsistent: containing discrepancies in codes or names
– e.g., Age=“42” Birthday=“03/07/1997”
– e.g., Was rating “1,2,3”, now rating “A, B, C”
– e.g., Duplicate records

45
Common exploration tools

• Drawing plots
• Using visualization tools (e.g., Tableau, Cognos)
• Programming in Rstudio / Python
• Rattle package in Rstudio

46
Thank You for your attention

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy