0% found this document useful (0 votes)

52 views37 pages

Pes1ug22cs841 Sudeep G Lab1

This document contains a student's data analytics worksheet covering topics such as: 1. Calculating summary statistics and classifying variables from a movie dataset. 2. Investigating missing data, classifying missingness, and recommending imputation methods. 3. Analyzing changes in movie releases over years from an accident dataset. 4. Inferring skew from a box plot of runtimes and analyzing highest budget/revenue movies.

Uploaded by

nishkarsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views37 pages

Pes1ug22cs841 Sudeep G Lab1

Uploaded by

nishkarsh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

DATA ANALYTICS

WORKSHEET 1 NAME:SUDEEP G

SRN:PES1UG22CS841

Preliminary Guided Exercises Output:

1.Data Import

2. Compact Summary
3. Summary Statistics
4. Scatter Plots and Line Plots
5. Sorting a data frame

6. Column Transformation
7. Data Pre-processing

8. Using the ggplot2 Library

Problems
1.Problem 1 Get the summary statistics (mean, median, min, max, 1st quartile, 3rd quartile and standard deviation). Calculate
these only for the numerical columns. What can you determine from the summary statistics? What summary statistics can
be useful for categorical columns? Classify all the variables/columns into their types of data attributes (nominal, ordinal,
interval, ratio).

CONCLUSION AND ANALYSIS:

 The Minimum budget of some movies are zero, it is missing data because the max ROI IS INF which is a wrong
value, the zeroes must be imputed with either mean or knn. The first quartile or the 25th percentile ROI is 0.8279,
indicates that 25% of the movies have a relatively low ROI.
The 'mode' of the categorical coloums are:
-> *genres : Action with 1029 action movies

-> *original_language : en with 3821 english movies

-> The "director" can be used as a categorical data type and the mode calculated on that data would represent the director
with the most movies, in the dataset(current).

Classification of data :
 Ordinal - release_date.
 Interval - popularity (no true '0').
 Nominal - director, title, original_language, genres, ID.
 Ratio - budget, revenue, runtime, vote_average, vote_count.

Problem 2
2. Investigate the data set for missing values. Also classify the missingness as MCAR, MAR or MNAR. Recommend ways to
replace missing values in the dataset and apply them for revenue, budget and runtime columns.

Hint: Make sure to capture data from both, missing values in numeric fields and empty strings in descriptive fields. Convert
all missing placeholders to type NA. Look at the distribution of the dataset to classify the type of missing values.
 There are technically no NA values in the dataset.The ROI col has 599 inf/ NaN values (0/0).
 Some movies have put up their budget entries as 0.
 This means that most of this data can be classified as : MAR (Missing at Random),meaning that some movie entries might have
randomly missed out on entering the revenue and budget fields.
CONCLUSION AND ANALYSIS:

 Zero untime technically means missing val soo i have counted that as well

 All these rows mostly will be overlapping each otehr, that can be tested by finding the intersection between wrongly entered
revenue and budget vals and thus we can verify that the missingness is MAR (Missing At Random), as some values havent been
filled up by whoever gave the values to the dataset.

To deal with these missing values: which are mostly MAR, we can use :

 * Regression Imputation, *LOCF and ROCF, *Multiple imputation packages like mice and amelia..
Problem 3
3.Analyze the spread of the data set along years. How number of movie releases have changed over
the years?
Problem 4

4.Create a horizontal box plot using the column “runtime”. What inferences can you make from this box and
whisker plot? Comment on the skew of the runtime field (visual inspection is enough).

CONCUSION:

 With these box plot we can easily tell that there are so many outliers.
 The data is right skewed or posivitely skewed.
 Mean is greater than median.
 This plot has the bulk of data concentrated to the right.
Problem 5

5. Analyze the top 20 titles with highest budget, revenue and ROI. Plot a horizontal bar graph for all
three metrics in each case. What analysis can you make by looking at these graphs? What kind of
movies attracts the highest investments and do they promise a better ROI ?
Problem 6
6.Put yourself in the shoes of a production house. You want to produce the next big blockbuster. Plot
the ROI, revenue and budget across genres to finalize the genre of your upcoming movie as you did in
the previous problem. Elaborate your answers with proper explanation. Since one movie can fall in
multiple genre categories, you are free to choose a combination. You can also understand how the
popularity of different genres have changed along the years. Do provide a nice name to your movie
and your dream cast ;)

CONCULSION:

 The top genre is concluded with Horror with an average return of 24.8604712
 The name of my new movie is : The Nun- The origins, an addition to the NUN verse which seemed to terrifying.

Cast:
 Demián Bichir as Father Burke
 Taissa Farmiga as Sister Irene (cutie :))
 Jonas Bloquet as Frenchie
 Bonnie Aarons as the Nun, etc
DATA ANALYTICS
WORKSHEET 2

Problem 1

1.Find the total number of accidents in each state for the year 2016 and display your results. Make
sure to display all rows while printing the dataframe. Print only the necessary columns. (Hint: use the
grep command to help filter out column names).
Problem 2

Find the (fatality rate = total number of deaths total number of accidents ) in each state. Find out if
there is a significant linear correlation at a significance of α = 0.05 between the fatality rate of a state
and the mist/foggy rate (fraction of total accidents that happen in mist/foggy conditions). Correlation
between two continuous RVs: Pearson’s correlation coefficient. Pearson’s correlation coefficient
between two RVs x and y is given by:

ρ = Covariance(x, y) σx · σy where: ρ represents the Pearson’s correlation coefficient,

Covariance(x, y) is the covariance between x and y

σx is the standard deviation of x

σy is the standard deviation of y.

Plot the fatality rate against the mist/foggy rate.

(Hint: use the ggscatter library to plot a scatterplot with the confidence interval of the correlation
coefficient). Plot the fatality rate and mist/foggy rate (see this and this for R plot customization).
CONCLUSION:

1. It initializes empty lists and iterates through column names in a dataset

'data' to identify columns containing the word "Killed" and stores them
in 'mylist'.
2. It calculates the total number of deaths ('x') per state, using columns
from 'mylist', and prints the state-wise death counts.
3. It calculates the fatality rate (death per accident) for each of the 36
states and prints them.
4. It collects data from the 'Mist..Foggy...Total.Accidents' column into
'foggy'.
5. It calculates the foggy weather-related accident rate (foggy accidents
per total accidents) for each state and prints them.
6. It converts 'fat' and 'foggyr' lists to numeric vectors.
7. It calculates the correlation coefficient between fatality rate and foggy
weather-related accident rate.
8. It computes the confidence interval and p-value for the correlation.
9. It prints the correlation coefficient, confidence interval, and whether
there's a significant linear correlation based on the p-value.
10. It provides a conclusion based on whether there is a significant linear
correlation or not, using a significance level of 0.05.
Problem 4
4.Convert the column Hail.Sleet. . . Total.Accidents to a binary column as follows. If a hail/sleet
accident has occurred in a state, give that state a value of 1. Otherwise, give it a value of 0. Once
converted, find out if there is a significant correlation between the hail_accident_occcur binary
column created and the number of rainy total accidents for every state. Calculate the point bi-serial
correlation coefficient between the two columns.

(Hint: it is equivalent to calculating the Pearson correlation between a continuous and a dichotomous
variable. You could also use the ltm package’s biserial.cor function).
Problem 5
5. Similar to in Problem 4, create a binary column to represent whether a dust storm accident has
occurred in a state (1 = occurred, 0 = not occurred). Convert the two columns into a contingency table.
Calculate the phi coefficient of the two tables. (Hint: use the psych package).
Problem 6
6.Read about correlation on this website and analyze the effect of sample size on correlation
coefficients and spurious correlation. Are correlation coefficients affected by outliers?

SOL:

 As sample size increases, correlation coefficients become more stable and spurious
correlation can be avoided.
 Thus larger sample sizes are more reliable and give stabler estimates than smaller sample
sizes. The calculated sample correlation coefficient becomes more trustworthy and closer to
the population correlation(if the 2 variables are actually co-related, it shows a positive
correlation and so on)
 Smaller samples as tested with the ball drawing experiment in the north and south poles tend
to be very variable and have artificially low/high correlation coefficient.
 Spurious correlation refers to a situation where two variables appear to be correlated, but in
reality, they are not directly correlated. This can occur due to the presence of confounding
variables(or third variable) or purely by chance(as seen with the ball drawing in the north and
south poles exp).
 Smaller sample sizes can contribute to spurious correlation, thus large sample sizes are
preferred, to know the actual relation ebtween two parameters.
 Thus correlation is not equal to causation.
 Outliers can distort the calculated correlation coefficient. An outlier that lies far from the rest
of the data points can pull the regression line towards itself leading to an overestimated or
underestimated correlation.
 Pearson's correlation coefficient is sensitive to outliers because it focuses on the degree of
linear relationship between variables.
 Other correlation measures such as Spearman's rank correlation, etc are not affected by
outliers as they consider the rank order of data rather than specific values.

Problem 7

7.Look at these plots and answer What problems do they have? How do they affect correlation
analysis?
SOL:
i)The direct causation could be that people die by carelessly checking their phone, when they shouldnt
actually be,but this seems to be far too driven by chance that such a death happens. The nearly perfect
correlation, suggests that this is spurious and there is some third variable, which both these
parameters depend on : like :older people with IPhones,

ii) Similary this also seems to be a spurious correlation. Spending on admission to spectator sports and
being really health conscious seems to be something the upper middle class and above wud bother
about, so income/poshness of an area might just be the confounding variable.

iii)This straight up seems to be a spurious correlation by chance. Both of these might also be affected
by the the overall level of economic activity in a region.
Eg: if the economy is doing well, people are more likely to buy cars and take vacations. This could lead
to an increase in both automobile sales and trips to Universal Orlando.
SOL:
2.Although a few articles online state that these two parameters measure two totally
different things, and must not be correlated, i believe that they are related.
eBay total gross merchandise volume (GMV) is the total value of all items sold on eBay's
platforms in a given period of time, and many studies have shown that black friday and "cyber
monday", have specifically shown upto 300% increase in sales. Thus they are positively
correlated.

SOL:
3.These are Skewed Scales with totally different ranges of company revenue and thus cannot be
correlated as they manipulate the range to align data.
DATA ANALYTICS
WORKSHEET 3

Problem 1
1. Read the data set and display the box plot for each of the fitness plans A, B, C, D. Analyze the box
plot for outliers
Problem 2
2. Is the data symmetrical or skewed for each group? Verify the normality assumption for ANOVA.
(Hint: Find the Pearson’s moment coefficient of skewness and justify it with probability distribution
function plot or you can also plot the Q-Q plot)
Problem 3
3. Is there any evidence to suggest a difference in the average marks obtained by students under
different fitness plans? Explain what test are you using and why ? Define the hypothesis and the steps
of testing. What does the output of this test signify ? (Note: Assume the significance level to be 0.05)

SOL:

To determine whether there is evidence to suggest a difference in the average marks obtained by
students under different fitness plans, you can perform a one-way ANOVA (Analysis of Variance) test.
ANOVA is suitable when you have more than two groups and want to test whether there are any
statistically significant differences among the group means.

Hypotheses:

Null Hypothesis (H₀):

In plain terms, the null hypothesis suggests that there's no real difference in the average marks
obtained by students under different fitness plans. It's like saying "The different fitness plans don't
really make a difference in how well students perform."

Alternative Hypothesis (H₁):

The alternative hypothesis, on the other hand, suggests that there is a meaningful difference in the
average marks obtained by students under different fitness plans. It's like saying "Some of the fitness
plans might actually affect how well students perform."

Perform ANOVA:

Fit a one-way ANOVA model to our data and obtain the F-statistic and its associated p-value.

Analyze p-value:

If the p-value is less than α (0.05), you reject the null hypothesis and conclude that there is evidence
of a significant difference in the average marks among the fitness plans.

If the p-value is greater than or equal to α (0.05), you fail to reject the null hypothesis, and you do not
have sufficient evidence to claim a significant difference.

Output Significance:

A significant result (small p-value) suggests that there is evidence to support the claim that the fitness
plans have a significant effect on the average marks obtained by students.

A non-significant result (large p-value) suggests that you do not have enough evidence to conclude
that there are significant differences among the group means.
Problem 4
4. Which specific task exhibits the lowest average training time? Does the combination of different
treats and tasks significantly influence the training time for pets?
 The p value for the combination of tasks and treats is greater than the p value 0.05 hence it
doesn't significantly influence the training time of the pets

Problem 5

5. Does the choice of treats significantly impact the training time for different tasks? Which specific
combinations of treats and tasks lead to the most significant differences in training time? (Note:
Assume the significance level to be 0.05 )
CONCLUSION:

 Combination of III:A-I:B,III:C-I:B have the most significance after looking at their p value on
the training time

Chapter 9 Real Mortgage
100% (3)
Chapter 9 Real Mortgage
6 pages
Case Study On Dabur
No ratings yet
Case Study On Dabur
7 pages
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
No ratings yet
Punyashlok Ahilyadevi Holkar Solapur University, Solapur Final Year B.Tech. (Electronics & Telecommunication Engg.) (Part - II) CBCS Pattern
6 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
The Raine Report Issue 02
No ratings yet
The Raine Report Issue 02
51 pages
Project: Data Science For Social Good: Crime Study Mohammad Osama
No ratings yet
Project: Data Science For Social Good: Crime Study Mohammad Osama
4 pages
Kelbie Davidson (44817015) COMP4702 - Assignment 1
No ratings yet
Kelbie Davidson (44817015) COMP4702 - Assignment 1
7 pages
Problems: Managerial Statistics Problem Set Populations and Distributions Boston College
No ratings yet
Problems: Managerial Statistics Problem Set Populations and Distributions Boston College
3 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
2022 10 12 Exam Pa Model Solutions
No ratings yet
2022 10 12 Exam Pa Model Solutions
38 pages
Stats
No ratings yet
Stats
16 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Combat Aircraft Journal (February 2021)
100% (5)
Combat Aircraft Journal (February 2021)
102 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
G4-T3 Exponential Moving Average (EMA)
No ratings yet
G4-T3 Exponential Moving Average (EMA)
4 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
NSE Project
No ratings yet
NSE Project
11 pages
Analysis Report
No ratings yet
Analysis Report
8 pages
Awini Mustapha-Project1
No ratings yet
Awini Mustapha-Project1
8 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Canadian Manual On Foundation Engineering
No ratings yet
Canadian Manual On Foundation Engineering
297 pages
R-Practical questions-Sem-IV
No ratings yet
R-Practical questions-Sem-IV
4 pages
Data Visualization
No ratings yet
Data Visualization
37 pages
MODULE 2 Coursera
No ratings yet
MODULE 2 Coursera
9 pages
21hcs4108 Davpracticals
No ratings yet
21hcs4108 Davpracticals
29 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
ChatGPT Premium Guide
67% (3)
ChatGPT Premium Guide
152 pages
B SC Programme / B SC Mathematical Science: Instructions For Candidates
No ratings yet
B SC Programme / B SC Mathematical Science: Instructions For Candidates
2 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
Answer All Questions: 4 Semester - B.E. / B.Tech Second Internal Assessment: 28-02-13
No ratings yet
Answer All Questions: 4 Semester - B.E. / B.Tech Second Internal Assessment: 28-02-13
1 page
Jawaban Laporan Arus Kas Dan Laba Rugi Komprehensif
No ratings yet
Jawaban Laporan Arus Kas Dan Laba Rugi Komprehensif
4 pages
Brand Loyalty vs. Repeat Purchasing Behavior
No ratings yet
Brand Loyalty vs. Repeat Purchasing Behavior
9 pages
R Programming Interview Questions-1
No ratings yet
R Programming Interview Questions-1
20 pages
Lab Ex 3 - Descriptive Statistics
No ratings yet
Lab Ex 3 - Descriptive Statistics
2 pages
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
No ratings yet
Performance Review of Thermal Power Stations 2011-12: Sl. No Name of Station Unit No Organisation Capacity
4 pages
FDS Important Q
No ratings yet
FDS Important Q
5 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Pixilab Blocks
No ratings yet
Pixilab Blocks
108 pages
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
No ratings yet
Revised List of Items & Norms of Assistance From State Disaster Response Fund (SDRF) / National Disaster Response Fund (NDRF)
8 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
5 pages
Risk Assessment Template Teen Fashion
No ratings yet
Risk Assessment Template Teen Fashion
2 pages
Levels of Measurement Q A
No ratings yet
Levels of Measurement Q A
16 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
5.load Transfer Mechanism and Load Test - 2
No ratings yet
5.load Transfer Mechanism and Load Test - 2
18 pages
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
No ratings yet
Smallest Physical Size: Screen Screen Operate Nozzle at or Above 4 Bar
1 page
Arunav Da Prac
No ratings yet
Arunav Da Prac
55 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
OS Lab Manual Part 3
No ratings yet
OS Lab Manual Part 3
7 pages
New Doc 2018-07-21
No ratings yet
New Doc 2018-07-21
3 pages
Manishadav
No ratings yet
Manishadav
27 pages
Tabel Ses
No ratings yet
Tabel Ses
6 pages
Singh Project1 Report
No ratings yet
Singh Project1 Report
12 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
4 pages
CCW331 Set4
No ratings yet
CCW331 Set4
5 pages
Fods Question Paper
No ratings yet
Fods Question Paper
4 pages
How To Convert A Section 8 Company Into A Public Limited Company
No ratings yet
How To Convert A Section 8 Company Into A Public Limited Company
3 pages
DAV Practical File 234003
No ratings yet
DAV Practical File 234003
14 pages
23HCS4142 PDF
No ratings yet
23HCS4142 PDF
24 pages
Diary Ka Kea, My Cursed Life-1
No ratings yet
Diary Ka Kea, My Cursed Life-1
960 pages
Dav Pyq 2023-24
No ratings yet
Dav Pyq 2023-24
3 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
Dar Prev Paper - Ag
No ratings yet
Dar Prev Paper - Ag
2 pages
Base Excitation
No ratings yet
Base Excitation
24 pages
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
No ratings yet
UNIT 5 Data Literacy Levels of Measurement QuesAnsExtra
14 pages
Gamal Mohamed CV
No ratings yet
Gamal Mohamed CV
2 pages
Account Based Analytics Final Spring 2025
No ratings yet
Account Based Analytics Final Spring 2025
2 pages
How To Add or Remove An Employee
No ratings yet
How To Add or Remove An Employee
4 pages
Reflection Task #2
No ratings yet
Reflection Task #2
2 pages
Cato DLP WP
No ratings yet
Cato DLP WP
10 pages
R For Ds QB PDF Format New 2023 Batch Sem 4 Apr 2025
No ratings yet
R For Ds QB PDF Format New 2023 Batch Sem 4 Apr 2025
27 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
Disha Data Science
No ratings yet
Disha Data Science
27 pages
GE Practical Sem 2
No ratings yet
GE Practical Sem 2
28 pages
Week13 Slides Review
No ratings yet
Week13 Slides Review
23 pages
DAV Practicle File
No ratings yet
DAV Practicle File
28 pages
DSC2608 - Assessment - 05 S1-2025
No ratings yet
DSC2608 - Assessment - 05 S1-2025
4 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Busi 601 Final
No ratings yet
Busi 601 Final
17 pages
Business Analytics - L2
No ratings yet
Business Analytics - L2
41 pages
Legal Framework For Truck Logistics in India
No ratings yet
Legal Framework For Truck Logistics in India
2 pages
Training Report On Data Analysis With Python
No ratings yet
Training Report On Data Analysis With Python
12 pages
Course 9 Statistical Functions.
No ratings yet
Course 9 Statistical Functions.
5 pages
Practical List
No ratings yet
Practical List
21 pages
BDA Important Questions
No ratings yet
BDA Important Questions
3 pages
A DD Merged
No ratings yet
A DD Merged
16 pages
Ad3301 Apr May 2024 Answer Key
No ratings yet
Ad3301 Apr May 2024 Answer Key
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pes1ug22cs841 Sudeep G Lab1

Uploaded by

Pes1ug22cs841 Sudeep G Lab1

Uploaded by

DATA ANALYTICS

Preliminary Guided Exercises Output:

8. Using the ggplot2 Library

CONCLUSION AND ANALYSIS:

-> *original_language : en with 3821 english movies

ρ = Covariance(x, y) σx · σy where: ρ represents the Pearson’s correlation coefficient,

Covariance(x, y) is the covariance between x and y

σx is the standard deviation of x

σy is the standard deviation of y.

Plot the fatality rate against the mist/foggy rate.

1. It initializes empty lists and iterates through column names in a dataset

Null Hypothesis (H₀):

Alternative Hypothesis (H₁):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.