0% found this document useful (0 votes)
52 views37 pages

Pes1ug22cs841 Sudeep G Lab1

This document contains a student's data analytics worksheet covering topics such as: 1. Calculating summary statistics and classifying variables from a movie dataset. 2. Investigating missing data, classifying missingness, and recommending imputation methods. 3. Analyzing changes in movie releases over years from an accident dataset. 4. Inferring skew from a box plot of runtimes and analyzing highest budget/revenue movies.

Uploaded by

nishkarsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views37 pages

Pes1ug22cs841 Sudeep G Lab1

This document contains a student's data analytics worksheet covering topics such as: 1. Calculating summary statistics and classifying variables from a movie dataset. 2. Investigating missing data, classifying missingness, and recommending imputation methods. 3. Analyzing changes in movie releases over years from an accident dataset. 4. Inferring skew from a box plot of runtimes and analyzing highest budget/revenue movies.

Uploaded by

nishkarsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA ANALYTICS

WORKSHEET 1 NAME:SUDEEP G

SRN:PES1UG22CS841

Preliminary Guided Exercises Output:


1.Data Import

2. Compact Summary
3. Summary Statistics
4. Scatter Plots and Line Plots
5. Sorting a data frame

6. Column Transformation
7. Data Pre-processing

8. Using the ggplot2 Library


Problems
1.Problem 1 Get the summary statistics (mean, median, min, max, 1st quartile, 3rd quartile and standard deviation). Calculate
these only for the numerical columns. What can you determine from the summary statistics? What summary statistics can
be useful for categorical columns? Classify all the variables/columns into their types of data attributes (nominal, ordinal,
interval, ratio).

CONCLUSION AND ANALYSIS:


 The Minimum budget of some movies are zero, it is missing data because the max ROI IS INF which is a wrong
value, the zeroes must be imputed with either mean or knn. The first quartile or the 25th percentile ROI is 0.8279,
indicates that 25% of the movies have a relatively low ROI.
The 'mode' of the categorical coloums are:
-> *genres : Action with 1029 action movies

-> *original_language : en with 3821 english movies

-> The "director" can be used as a categorical data type and the mode calculated on that data would represent the director
with the most movies, in the dataset(current).

Classification of data :
 Ordinal - release_date.
 Interval - popularity (no true '0').
 Nominal - director, title, original_language, genres, ID.
 Ratio - budget, revenue, runtime, vote_average, vote_count.

Problem 2
2. Investigate the data set for missing values. Also classify the missingness as MCAR, MAR or MNAR. Recommend ways to
replace missing values in the dataset and apply them for revenue, budget and runtime columns.

Hint: Make sure to capture data from both, missing values in numeric fields and empty strings in descriptive fields. Convert
all missing placeholders to type NA. Look at the distribution of the dataset to classify the type of missing values.
 There are technically no NA values in the dataset.The ROI col has 599 inf/ NaN values (0/0).
 Some movies have put up their budget entries as 0.
 This means that most of this data can be classified as : MAR (Missing at Random),meaning that some movie entries might have
randomly missed out on entering the revenue and budget fields.
CONCLUSION AND ANALYSIS:

 Zero untime technically means missing val soo i have counted that as well

 All these rows mostly will be overlapping each otehr, that can be tested by finding the intersection between wrongly entered
revenue and budget vals and thus we can verify that the missingness is MAR (Missing At Random), as some values havent been
filled up by whoever gave the values to the dataset.

To deal with these missing values: which are mostly MAR, we can use :

 * Regression Imputation, *LOCF and ROCF, *Multiple imputation packages like mice and amelia..
Problem 3
3.Analyze the spread of the data set along years. How number of movie releases have changed over
the years?
Problem 4

4.Create a horizontal box plot using the column “runtime”. What inferences can you make from this box and
whisker plot? Comment on the skew of the runtime field (visual inspection is enough).

CONCUSION:

 With these box plot we can easily tell that there are so many outliers.
 The data is right skewed or posivitely skewed.
 Mean is greater than median.
 This plot has the bulk of data concentrated to the right.
Problem 5

5. Analyze the top 20 titles with highest budget, revenue and ROI. Plot a horizontal bar graph for all
three metrics in each case. What analysis can you make by looking at these graphs? What kind of
movies attracts the highest investments and do they promise a better ROI ?
Problem 6
6.Put yourself in the shoes of a production house. You want to produce the next big blockbuster. Plot
the ROI, revenue and budget across genres to finalize the genre of your upcoming movie as you did in
the previous problem. Elaborate your answers with proper explanation. Since one movie can fall in
multiple genre categories, you are free to choose a combination. You can also understand how the
popularity of different genres have changed along the years. Do provide a nice name to your movie
and your dream cast ;)

CONCULSION:

 The top genre is concluded with Horror with an average return of 24.8604712
 The name of my new movie is : The Nun- The origins, an addition to the NUN verse which seemed to terrifying.

Cast:
 Demián Bichir as Father Burke
 Taissa Farmiga as Sister Irene (cutie :))
 Jonas Bloquet as Frenchie
 Bonnie Aarons as the Nun, etc
DATA ANALYTICS
WORKSHEET 2

Problem 1

1.Find the total number of accidents in each state for the year 2016 and display your results. Make
sure to display all rows while printing the dataframe. Print only the necessary columns. (Hint: use the
grep command to help filter out column names).
Problem 2

Find the (fatality rate = total number of deaths total number of accidents ) in each state. Find out if
there is a significant linear correlation at a significance of α = 0.05 between the fatality rate of a state
and the mist/foggy rate (fraction of total accidents that happen in mist/foggy conditions). Correlation
between two continuous RVs: Pearson’s correlation coefficient. Pearson’s correlation coefficient
between two RVs x and y is given by:

ρ = Covariance(x, y) σx · σy where: ρ represents the Pearson’s correlation coefficient,

Covariance(x, y) is the covariance between x and y

σx is the standard deviation of x

σy is the standard deviation of y.

Plot the fatality rate against the mist/foggy rate.

(Hint: use the ggscatter library to plot a scatterplot with the confidence interval of the correlation
coefficient). Plot the fatality rate and mist/foggy rate (see this and this for R plot customization).
CONCLUSION:

1. It initializes empty lists and iterates through column names in a dataset


'data' to identify columns containing the word "Killed" and stores them
in 'mylist'.
2. It calculates the total number of deaths ('x') per state, using columns
from 'mylist', and prints the state-wise death counts.
3. It calculates the fatality rate (death per accident) for each of the 36
states and prints them.
4. It collects data from the 'Mist..Foggy...Total.Accidents' column into
'foggy'.
5. It calculates the foggy weather-related accident rate (foggy accidents
per total accidents) for each state and prints them.
6. It converts 'fat' and 'foggyr' lists to numeric vectors.
7. It calculates the correlation coefficient between fatality rate and foggy
weather-related accident rate.
8. It computes the confidence interval and p-value for the correlation.
9. It prints the correlation coefficient, confidence interval, and whether
there's a significant linear correlation based on the p-value.
10. It provides a conclusion based on whether there is a significant linear
correlation or not, using a significance level of 0.05.
Problem 4
4.Convert the column Hail.Sleet. . . Total.Accidents to a binary column as follows. If a hail/sleet
accident has occurred in a state, give that state a value of 1. Otherwise, give it a value of 0. Once
converted, find out if there is a significant correlation between the hail_accident_occcur binary
column created and the number of rainy total accidents for every state. Calculate the point bi-serial
correlation coefficient between the two columns.

(Hint: it is equivalent to calculating the Pearson correlation between a continuous and a dichotomous
variable. You could also use the ltm package’s biserial.cor function).
Problem 5
5. Similar to in Problem 4, create a binary column to represent whether a dust storm accident has
occurred in a state (1 = occurred, 0 = not occurred). Convert the two columns into a contingency table.
Calculate the phi coefficient of the two tables. (Hint: use the psych package).
Problem 6
6.Read about correlation on this website and analyze the effect of sample size on correlation
coefficients and spurious correlation. Are correlation coefficients affected by outliers?

SOL:

 As sample size increases, correlation coefficients become more stable and spurious
correlation can be avoided.
 Thus larger sample sizes are more reliable and give stabler estimates than smaller sample
sizes. The calculated sample correlation coefficient becomes more trustworthy and closer to
the population correlation(if the 2 variables are actually co-related, it shows a positive
correlation and so on)
 Smaller samples as tested with the ball drawing experiment in the north and south poles tend
to be very variable and have artificially low/high correlation coefficient.
 Spurious correlation refers to a situation where two variables appear to be correlated, but in
reality, they are not directly correlated. This can occur due to the presence of confounding
variables(or third variable) or purely by chance(as seen with the ball drawing in the north and
south poles exp).
 Smaller sample sizes can contribute to spurious correlation, thus large sample sizes are
preferred, to know the actual relation ebtween two parameters.
 Thus correlation is not equal to causation.
 Outliers can distort the calculated correlation coefficient. An outlier that lies far from the rest
of the data points can pull the regression line towards itself leading to an overestimated or
underestimated correlation.
 Pearson's correlation coefficient is sensitive to outliers because it focuses on the degree of
linear relationship between variables.
 Other correlation measures such as Spearman's rank correlation, etc are not affected by
outliers as they consider the rank order of data rather than specific values.

Problem 7

7.Look at these plots and answer What problems do they have? How do they affect correlation
analysis?
SOL:
i)The direct causation could be that people die by carelessly checking their phone, when they shouldnt
actually be,but this seems to be far too driven by chance that such a death happens. The nearly perfect
correlation, suggests that this is spurious and there is some third variable, which both these
parameters depend on : like :older people with IPhones,

ii) Similary this also seems to be a spurious correlation. Spending on admission to spectator sports and
being really health conscious seems to be something the upper middle class and above wud bother
about, so income/poshness of an area might just be the confounding variable.

iii)This straight up seems to be a spurious correlation by chance. Both of these might also be affected
by the the overall level of economic activity in a region.
Eg: if the economy is doing well, people are more likely to buy cars and take vacations. This could lead
to an increase in both automobile sales and trips to Universal Orlando.
SOL:
2.Although a few articles online state that these two parameters measure two totally
different things, and must not be correlated, i believe that they are related.
eBay total gross merchandise volume (GMV) is the total value of all items sold on eBay's
platforms in a given period of time, and many studies have shown that black friday and "cyber
monday", have specifically shown upto 300% increase in sales. Thus they are positively
correlated.

SOL:
3.These are Skewed Scales with totally different ranges of company revenue and thus cannot be
correlated as they manipulate the range to align data.
DATA ANALYTICS
WORKSHEET 3

Problem 1
1. Read the data set and display the box plot for each of the fitness plans A, B, C, D. Analyze the box
plot for outliers
Problem 2
2. Is the data symmetrical or skewed for each group? Verify the normality assumption for ANOVA.
(Hint: Find the Pearson’s moment coefficient of skewness and justify it with probability distribution
function plot or you can also plot the Q-Q plot)
Problem 3
3. Is there any evidence to suggest a difference in the average marks obtained by students under
different fitness plans? Explain what test are you using and why ? Define the hypothesis and the steps
of testing. What does the output of this test signify ? (Note: Assume the significance level to be 0.05)

SOL:

To determine whether there is evidence to suggest a difference in the average marks obtained by
students under different fitness plans, you can perform a one-way ANOVA (Analysis of Variance) test.
ANOVA is suitable when you have more than two groups and want to test whether there are any
statistically significant differences among the group means.

Hypotheses:

Null Hypothesis (H₀):

In plain terms, the null hypothesis suggests that there's no real difference in the average marks
obtained by students under different fitness plans. It's like saying "The different fitness plans don't
really make a difference in how well students perform."

Alternative Hypothesis (H₁):


The alternative hypothesis, on the other hand, suggests that there is a meaningful difference in the
average marks obtained by students under different fitness plans. It's like saying "Some of the fitness
plans might actually affect how well students perform."

Perform ANOVA:

Fit a one-way ANOVA model to our data and obtain the F-statistic and its associated p-value.

Analyze p-value:

If the p-value is less than α (0.05), you reject the null hypothesis and conclude that there is evidence
of a significant difference in the average marks among the fitness plans.

If the p-value is greater than or equal to α (0.05), you fail to reject the null hypothesis, and you do not
have sufficient evidence to claim a significant difference.

Output Significance:

A significant result (small p-value) suggests that there is evidence to support the claim that the fitness
plans have a significant effect on the average marks obtained by students.

A non-significant result (large p-value) suggests that you do not have enough evidence to conclude
that there are significant differences among the group means.
Problem 4
4. Which specific task exhibits the lowest average training time? Does the combination of different
treats and tasks significantly influence the training time for pets?
 The p value for the combination of tasks and treats is greater than the p value 0.05 hence it
doesn't significantly influence the training time of the pets

Problem 5

5. Does the choice of treats significantly impact the training time for different tasks? Which specific
combinations of treats and tasks lead to the most significant differences in training time? (Note:
Assume the significance level to be 0.05 )
CONCLUSION:

 Combination of III:A-I:B,III:C-I:B have the most significance after looking at their p value on
the training time

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy