0% found this document useful (0 votes)
52 views2 pages

AF Notes W2

Week 2 of the course focuses on exploratory data analysis. There are 4 main steps in data analysis: 1) recognize a problem, 2) gather data, 3) analyze the data, and 4) act on the analysis. Exploratory data analysis involves visually presenting data through graphs and tables to detect issues, check assumptions, and assess relationships between variables. Key goals are to make important patterns stand out and present the data in a way that is understandable to laypeople. Descriptive measures like counts, means, and variability are used to summarize both categorical and numerical variables. Outliers and missing values are also addressed.

Uploaded by

Vivian Lau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views2 pages

AF Notes W2

Week 2 of the course focuses on exploratory data analysis. There are 4 main steps in data analysis: 1) recognize a problem, 2) gather data, 3) analyze the data, and 4) act on the analysis. Exploratory data analysis involves visually presenting data through graphs and tables to detect issues, check assumptions, and assess relationships between variables. Key goals are to make important patterns stand out and present the data in a way that is understandable to laypeople. Descriptive measures like counts, means, and variability are used to summarize both categorical and numerical variables. Outliers and missing values are also addressed.

Uploaded by

Vivian Lau
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Week 2 : Exploratory Data Analysis

Introduction to Data Analysis

Problem
Definition

Model
Implementation

Data Exploration

Model
Validation

Data Preparation

Model
Development

Results Tracking

4 main steps in data analysis


1. Recognise a problem that needs to be solved
2. Gather data to help understand and then solve the problem
3. Analyse the data
4. Act on the analysis

Introduction to Exploratory Data Analysis


- The goal is to present data in a form that makes sense to laypeople
> Graphs: bar charts, pie charts, histograms, scatterplots, time series, other visual analytics
> Numerical summary measures: eg. Counts, percentages, averages, and measures of variability
> Tables of summary measures: totals, averages, counts, grouped by category
- Key is to make the important data stand out
- Purposes
> Detect issues/mistakes
> Check assumptions
> Preliminary selection of models
> Determination of relationships between explanatory variables
> Assessment of direction and magnitude of relationships between explanatory & outcome (dependent)
variables
Key Terminology
- A population includes all of the entities of interest in a study (people, households, machines, etc.)
- A sample is a subset of the population, often randomly chosen and preferably representative of the
population as a whole.
- A data set is usually a rectangular array of data, with variables in columns and observations in rows.
- A variable (or field or attribute) is a characteristic of members of a population, such as height, gender, or
salary.
- An observation (or case or record) is a list of all variable values for a single member of a population.
- A variable is numerical if meaningful arithmetic can be performed on it, otherwise, the variable is categorical.
There is also a third data type, a date variable which are stored as numbers in SAS but are treated differently.
- A categorical variable is ordinal if there is a natural ordering of its possible values, if there is no natural
ordering, it is nominal.
- Categorical variables can be coded numerically or left uncoded.
- A dummy variable is a 01 coded variable for a specific category - It is coded as 1 for all observations in that
category and 0 for all observations not in that category.
- Categorizing a numerical variable by putting the data into discrete categories (called bins) is called binning or
discretizing.
- A variable that has been categorized in this way is called a binned or discretized variable.
- A numerical variable is discrete if it results from a count, such as the number of children.
- A continuous variable is the result of an essentially continuous measurement, such as weight or height.
- Cross-sectional data are data on a cross section of a population at a distinct point in time; Longitudinal Data
(Time series data) are data collected over time.
Descriptive Measures for Categorical Variables

- Count
- Naming
- Count within categories
Descriptive Measures for Numerical Variables
- Many ways to summarize numerical variables, both with numerical summary measures as well as with charts,
eg. mean, variability etc
- Charts that can be used for numerical variables include histograms, boxplots, time series graphs, etc
> Histogram: most common type of chart used to show the distribution of a numerical variable
>> based on binning of variable (division of variable into discrete categories based on range
>> good for showing the shape of a distribution and identifying medians and skew
> Boxplot: Alternative type of chart to show the distribution of a variable
> Time series graph: Usually a line graph that graphs the values of one or more time series variables with
time on
the horizontal axis; always start a time series analysis with a time series graph
Outliers
- An outlier is a value or an entire observation (case or row) that lies wayyyyyy outside the norm
- Rule of thumb, anything more than 3 sd away from the mean
- Run 2 analyses: one with and one without.
Missing Values
- Most real data sets have gaps in the data
- Need to detect and decide how to deal with these gaps
- Can ignore them (but need to know what the software does with it)
- Fill in the missing value with the average of the non missing values
- Examine the non-missing values in the same row and predict the missing value based on associations
gathered from complete rows.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy