0% found this document useful (0 votes)
25 views26 pages

Sociology: Intermediate Quantitative Research Method

Uploaded by

iris200193
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views26 pages

Sociology: Intermediate Quantitative Research Method

Uploaded by

iris200193
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1

Data and Measurement


Moving from questions to variables

Aida Parnia
A.parnia@utoronto.ca

U of T Sociology

September 10, 2024


2

Week 2: Measurements
Today’s schedule

1. Turning data into variables


i. Toronto Open Data
ii. Exploring the data
iii. Identifying questions
iv. Constructing variables
2. Some useful definitions
3. Validity and Reliability
4. Summary of key points
3

Toronto Open Data Portal

https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/
4

Bike ridership data from first 6 months of 2024


Rows: 2,407,928
Columns: 11
$ trip_id <dbl> 26916635, 26916636, 26916637, 26916638, 26916639, 2…
$ trip_duration <dbl> 897, 267, 158, 357, 195, 1814, 1736, 1348, 262, 564…
$ start_station_id <dbl> 7458, 7285, 7531, 7027, 7469, 7033, 7033, 7164, 741…
$ start_time <chr> "02/01/2024 00:00", "02/01/2024 00:02", "02/01/2024…
$ start_station_name <chr> "Church St / Lombard St", "Spadina Ave / Harbord St…
$ end_station_id <dbl> 7256, 7023, 7058, 7206, 7417, 7115, 7115, 7118, 722…
$ end_time <chr> "02/01/2024 00:15", "02/01/2024 00:06", "02/01/2024…
$ end_station_name <chr> "Vanauley St / Queen St W - SMART", "College St / B…
$ bike_id <dbl> 2325, 623, 7284, 6595, 357, 5750, 5363, 4652, 4031,…
$ user_type <chr> "Casual Member", "Annual Member", "Annual Member", …
$ model <chr> "ICONIC", "ICONIC", "ICONIC", "ICONIC", "ICONIC", "…
5

Exploring the data


1 # First we check the dimensions of our data
2 dim(bikedata)

[1] 2407928 11

and give a quick look to what the data looks like


trip_id trip_duration start_station_id start_time start_station_name end_station_i

26916635 897 7458 02/01/2024 Church St / 725


00:00 Lombard St

26916636 267 7285 02/01/2024 Spadina Ave / 702


00:02 Harbord St - SMART
26916637 158 7531 02/01/2024 541 Huron St - 705
00:02 SMART
26916638 357 7027 02/01/2024 Beverley St / 720
00:02 Dundas St W
6

The first step is getting to know the data

Once we get a sense of definitions of the variables, to understand the data


better, we ask the following questions:

What are the most use stations?


→ For the start of the trip and the end?
What are the most frequent trips?
How does the trips differ by the user type?
→ What is the difference in start time of trips between user types?
→ Do these differences vary by month of travel?

These are some of the questions we can ask from the data to construct the
variables of interest and get to know the data. But they are not necessarily
good research questions.
7

Top five most used stations?

Fig 1. Most used station for start of the trip Fig 2. Most used station for end of the trip
8

What are the most frequent trips?


route n

Tommy Thompson Park (Leslie Street Spit) TO Tommy Thompson Park (Leslie Street 3089
Spit)
Bay St / Queens Quay W (Ferry Terminal) TO Bay St / Queens Quay W (Ferry 1265
Terminal)
Waterfront Trail (Rouge Hill) TO Waterfront Trail (Rouge Hill) 1086
Humber Bay Shores Park / Marine Parade Dr TO Humber Bay Shores Park / Marine 1083
Parade Dr

Caution

Most trips are not going to another place but ending in the same place. We need to clean the data before
proceeding.
9

Cleaning data
1 bikedata <- bikedata %>%
2 mutate(route = if_else(
3 end_station_name == start_station_name, "Same station", route
4 ))
5
6 bikedata %>% count(route, sort = TRUE)
# A tibble: 263,273 × 2
route n
<chr> <int>
1 Same station 76660
2 Front St W / Blue Jays Way TO Union Station 965
3 King St W / Portland St TO King St W / Bay St (West Side) 735
4 York St / Queens Quay W TO Bathurst St/Queens Quay(Billy Bishop Airpor… 609
5 College St / Huron St TO Bay St / College St (East Side) 573
6 Fort York Blvd / Capreol Ct TO Union Station 540
7 Bathurst St/Queens Quay(Billy Bishop Airport) TO York St / Queens Quay… 532
8 Grand Avenue Park TO Windsor St / Newcastle St 444
9 Bay St / College St (East Side) TO College St / Huron St 431
10 The Well TO Union Station 419
# ℹ 263,263 more rows
10

How long do most frequent trips take?

Fig 3. Duration of trips for the top 5 most travelled routes (under 30 mins)
11

How long do most frequent trips take?

Fig 3. Duration of trips for the top 5 most travelled routes


12

Describing a distribution
Measures of central tendency and variation
route mean median standard_deviation q1 q3 min max
College St / Huron St TO Bay St / College St (East
7 6 18 5 7 0 444
Side)
Fort York Blvd / Capreol Ct TO Union Station 9 8 16 7 9 0 378
Front St W / Blue Jays Way TO Union Station 6 4 28 4 5 2 799
King St W / Portland St TO King St W / Bay St (West
8 7 2 7 8 4 28
Side)
York St / Queens Quay W TO Bathurst St/Queens
11 8 9 7 10 5 107
Quay(Billy Bishop Airport)
13

Differences by types of membership


How does the trips differ by the user type, annual membership or casual
membership?
→ What is the difference in start time of trips between user types?
→ Do these differences vary by month of travel?

Breaking down the task for analysis:


1. Creating start times for the trips
2. Finding months of travel -> creating a variable for month
3. Stratifying data by groups of user types and month of travel
4. Exploring distribution
5. Calculating differences
14

When do the trips start during the day?


1 library(lubridate)
2 bikedata <- bikedata <- bikedata %>%
3 mutate(start_time = mdy_hm(start_time),
4 start_month = month(start_time,
5 label = TRUE, a
6 start_day = day(start_time),
7 start_hour = hour(start_time))
8
9 bikedata %>% group_by(start_month) %>%
10 count(start_hour) %>%
11 ggplot(aes(x = start_hour, y = n, fill = s
12 geom_col(position = "dodge") +
13 scale_fill_brewer(palette = "Set2") +
14 theme_pubr(base_size = 20) +
15 labs(fill = "Month", x = "Hour of the day
16 y = "Total number of trips") +
17 facet_grid(. ~ start_month)

Fig 4. Total number of trips during the day by


months of travel
15

Differences by types of user - totals

Fig 5. Total number of trips during the day by months of travel and user type
16

Differences by types of user - percentages

Fig 6. Proportion of trips and time of the day by months of travel and user type
17

Differences by types of user - percentages

Fig 7. Proportion of trips and time of the day (AM vs PM) by months of travel and user type
18

Differences by types of user - other possibilities


If month or hour of the travel doesn’t seem to matter, then maybe it is
duration?
19

Exploratory data analysis (EDA)


So when do we stop?

Some questions to ask yourself:

Do you have a research question


in mind?
From R for Data Science, Wickham et al. 2016
Are the variables making sense
and clean?
Tip
Do you have a sense of the
EDA never stops!
distribution of the variables?
Have you considered the
important relationships between
the variables?
20

Some definitions - Types of Variables


Nominal Variables (qualitative, discrete)

These are categorical variables without any order or ranking.


Examples include name of stations, the type of bike users, gender,
race.
Ordinal Variables (qualitative or quantitative, discrete)

These are categorical variables with a clear ordering or ranking.


However, the intervals between the ranks are not necessarily equal.
Examples include education level (high school, bachelor’s, master’s,
etc.) or months of the year.
Interval-ratio Variables (quantitative, continuous)

These are numerical variables with equal intervals between values.


Examples include seconds, hours, temperature in Celsius, age.
21

Some definitions - Measures of Central Tendency


Mean

The average of a set of numbers, calculated by adding all the numbers


together and dividing by the count of numbers.

x̄ = 1
n ∑ ni=1 x i
Median

The middle value in a set of numbers when they are arranged in order.
If there is an even number of observations, the median is the average
of the two middle numbers.
Mode

The value that appears most frequently in a set of numbers.


22

Some definitions - Measures of Variation


Range

The difference between the highest and lowest values in a set of


numbers.
Variance

A measure of how much the values in a set differ from the mean. It is
calculated by taking the average of the squared differences from the
mean. σ = n ∑ i=1 (x i − x̄) 2
2 1 n

Standard Deviation

The square root of the variance, providing a measure of the average


distance of each value from the mean. σ = √ n1 ∑ i=1 (x i − x
n
¯) 2
23

Some definitions - Quantiles


Quantiles

Points in your data that divide it into equal-sized intervals. They help in
understanding the distribution of the data. Common quantiles include
quartiles, percentiles, and deciles.

Quartiles: Divide data into four parts.


Percentiles: Divide data into 100 parts.
Deciles: Divide data into 10 parts.
Interquartile Range (IQR)

The range of the middle 50% of the values, calculated as the


difference between the first quartile (25th percentile) and the third
quartile (75th percentile).
24

Validity & Reliability


Validity

A measure is valid to the degree that it represents what you are trying
to measure.

Internal validity: How the representation stands for the concept.


→ e.g. is annual income a valid measure of one’s material resources;
External validity: How the representation can work in different settings.
→ e.g. is annual income a valid measure of material resources for
those who are under 18 years old.
25

Key points of this week


Asking questions about the data and operationalizing concepts
Using measurements to calculate summary statistics and create
visualizations
Not all visualization methods or summary statistics fit every
measurement
Goal of EDA: finding the best ways to communicate ideas
Assessing threats to internal and external validity after defining
concepts and research questions
26

Next week: Probability


Install R and R studio on your personal computer for the tutorial
You will receive an email to use remote PC to access the computers in
the lab.
Syllabus is updated with chapters from the textbook (Regression and
Other Stories)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy