Pes1ug22cs841 Sudeep G Lab1
Pes1ug22cs841 Sudeep G Lab1
WORKSHEET 1 NAME:SUDEEP G
SRN:PES1UG22CS841
2. Compact Summary
3. Summary Statistics
4. Scatter Plots and Line Plots
5. Sorting a data frame
6. Column Transformation
7. Data Pre-processing
-> The "director" can be used as a categorical data type and the mode calculated on that data would represent the director
with the most movies, in the dataset(current).
Classification of data :
Ordinal - release_date.
Interval - popularity (no true '0').
Nominal - director, title, original_language, genres, ID.
Ratio - budget, revenue, runtime, vote_average, vote_count.
Problem 2
2. Investigate the data set for missing values. Also classify the missingness as MCAR, MAR or MNAR. Recommend ways to
replace missing values in the dataset and apply them for revenue, budget and runtime columns.
Hint: Make sure to capture data from both, missing values in numeric fields and empty strings in descriptive fields. Convert
all missing placeholders to type NA. Look at the distribution of the dataset to classify the type of missing values.
There are technically no NA values in the dataset.The ROI col has 599 inf/ NaN values (0/0).
Some movies have put up their budget entries as 0.
This means that most of this data can be classified as : MAR (Missing at Random),meaning that some movie entries might have
randomly missed out on entering the revenue and budget fields.
CONCLUSION AND ANALYSIS:
Zero untime technically means missing val soo i have counted that as well
All these rows mostly will be overlapping each otehr, that can be tested by finding the intersection between wrongly entered
revenue and budget vals and thus we can verify that the missingness is MAR (Missing At Random), as some values havent been
filled up by whoever gave the values to the dataset.
To deal with these missing values: which are mostly MAR, we can use :
* Regression Imputation, *LOCF and ROCF, *Multiple imputation packages like mice and amelia..
Problem 3
3.Analyze the spread of the data set along years. How number of movie releases have changed over
the years?
Problem 4
4.Create a horizontal box plot using the column “runtime”. What inferences can you make from this box and
whisker plot? Comment on the skew of the runtime field (visual inspection is enough).
CONCUSION:
With these box plot we can easily tell that there are so many outliers.
The data is right skewed or posivitely skewed.
Mean is greater than median.
This plot has the bulk of data concentrated to the right.
Problem 5
5. Analyze the top 20 titles with highest budget, revenue and ROI. Plot a horizontal bar graph for all
three metrics in each case. What analysis can you make by looking at these graphs? What kind of
movies attracts the highest investments and do they promise a better ROI ?
Problem 6
6.Put yourself in the shoes of a production house. You want to produce the next big blockbuster. Plot
the ROI, revenue and budget across genres to finalize the genre of your upcoming movie as you did in
the previous problem. Elaborate your answers with proper explanation. Since one movie can fall in
multiple genre categories, you are free to choose a combination. You can also understand how the
popularity of different genres have changed along the years. Do provide a nice name to your movie
and your dream cast ;)
CONCULSION:
The top genre is concluded with Horror with an average return of 24.8604712
The name of my new movie is : The Nun- The origins, an addition to the NUN verse which seemed to terrifying.
Cast:
Demián Bichir as Father Burke
Taissa Farmiga as Sister Irene (cutie :))
Jonas Bloquet as Frenchie
Bonnie Aarons as the Nun, etc
DATA ANALYTICS
WORKSHEET 2
Problem 1
1.Find the total number of accidents in each state for the year 2016 and display your results. Make
sure to display all rows while printing the dataframe. Print only the necessary columns. (Hint: use the
grep command to help filter out column names).
Problem 2
Find the (fatality rate = total number of deaths total number of accidents ) in each state. Find out if
there is a significant linear correlation at a significance of α = 0.05 between the fatality rate of a state
and the mist/foggy rate (fraction of total accidents that happen in mist/foggy conditions). Correlation
between two continuous RVs: Pearson’s correlation coefficient. Pearson’s correlation coefficient
between two RVs x and y is given by:
(Hint: use the ggscatter library to plot a scatterplot with the confidence interval of the correlation
coefficient). Plot the fatality rate and mist/foggy rate (see this and this for R plot customization).
CONCLUSION:
(Hint: it is equivalent to calculating the Pearson correlation between a continuous and a dichotomous
variable. You could also use the ltm package’s biserial.cor function).
Problem 5
5. Similar to in Problem 4, create a binary column to represent whether a dust storm accident has
occurred in a state (1 = occurred, 0 = not occurred). Convert the two columns into a contingency table.
Calculate the phi coefficient of the two tables. (Hint: use the psych package).
Problem 6
6.Read about correlation on this website and analyze the effect of sample size on correlation
coefficients and spurious correlation. Are correlation coefficients affected by outliers?
SOL:
As sample size increases, correlation coefficients become more stable and spurious
correlation can be avoided.
Thus larger sample sizes are more reliable and give stabler estimates than smaller sample
sizes. The calculated sample correlation coefficient becomes more trustworthy and closer to
the population correlation(if the 2 variables are actually co-related, it shows a positive
correlation and so on)
Smaller samples as tested with the ball drawing experiment in the north and south poles tend
to be very variable and have artificially low/high correlation coefficient.
Spurious correlation refers to a situation where two variables appear to be correlated, but in
reality, they are not directly correlated. This can occur due to the presence of confounding
variables(or third variable) or purely by chance(as seen with the ball drawing in the north and
south poles exp).
Smaller sample sizes can contribute to spurious correlation, thus large sample sizes are
preferred, to know the actual relation ebtween two parameters.
Thus correlation is not equal to causation.
Outliers can distort the calculated correlation coefficient. An outlier that lies far from the rest
of the data points can pull the regression line towards itself leading to an overestimated or
underestimated correlation.
Pearson's correlation coefficient is sensitive to outliers because it focuses on the degree of
linear relationship between variables.
Other correlation measures such as Spearman's rank correlation, etc are not affected by
outliers as they consider the rank order of data rather than specific values.
Problem 7
7.Look at these plots and answer What problems do they have? How do they affect correlation
analysis?
SOL:
i)The direct causation could be that people die by carelessly checking their phone, when they shouldnt
actually be,but this seems to be far too driven by chance that such a death happens. The nearly perfect
correlation, suggests that this is spurious and there is some third variable, which both these
parameters depend on : like :older people with IPhones,
ii) Similary this also seems to be a spurious correlation. Spending on admission to spectator sports and
being really health conscious seems to be something the upper middle class and above wud bother
about, so income/poshness of an area might just be the confounding variable.
iii)This straight up seems to be a spurious correlation by chance. Both of these might also be affected
by the the overall level of economic activity in a region.
Eg: if the economy is doing well, people are more likely to buy cars and take vacations. This could lead
to an increase in both automobile sales and trips to Universal Orlando.
SOL:
2.Although a few articles online state that these two parameters measure two totally
different things, and must not be correlated, i believe that they are related.
eBay total gross merchandise volume (GMV) is the total value of all items sold on eBay's
platforms in a given period of time, and many studies have shown that black friday and "cyber
monday", have specifically shown upto 300% increase in sales. Thus they are positively
correlated.
SOL:
3.These are Skewed Scales with totally different ranges of company revenue and thus cannot be
correlated as they manipulate the range to align data.
DATA ANALYTICS
WORKSHEET 3
Problem 1
1. Read the data set and display the box plot for each of the fitness plans A, B, C, D. Analyze the box
plot for outliers
Problem 2
2. Is the data symmetrical or skewed for each group? Verify the normality assumption for ANOVA.
(Hint: Find the Pearson’s moment coefficient of skewness and justify it with probability distribution
function plot or you can also plot the Q-Q plot)
Problem 3
3. Is there any evidence to suggest a difference in the average marks obtained by students under
different fitness plans? Explain what test are you using and why ? Define the hypothesis and the steps
of testing. What does the output of this test signify ? (Note: Assume the significance level to be 0.05)
SOL:
To determine whether there is evidence to suggest a difference in the average marks obtained by
students under different fitness plans, you can perform a one-way ANOVA (Analysis of Variance) test.
ANOVA is suitable when you have more than two groups and want to test whether there are any
statistically significant differences among the group means.
Hypotheses:
In plain terms, the null hypothesis suggests that there's no real difference in the average marks
obtained by students under different fitness plans. It's like saying "The different fitness plans don't
really make a difference in how well students perform."
Perform ANOVA:
Fit a one-way ANOVA model to our data and obtain the F-statistic and its associated p-value.
Analyze p-value:
If the p-value is less than α (0.05), you reject the null hypothesis and conclude that there is evidence
of a significant difference in the average marks among the fitness plans.
If the p-value is greater than or equal to α (0.05), you fail to reject the null hypothesis, and you do not
have sufficient evidence to claim a significant difference.
Output Significance:
A significant result (small p-value) suggests that there is evidence to support the claim that the fitness
plans have a significant effect on the average marks obtained by students.
A non-significant result (large p-value) suggests that you do not have enough evidence to conclude
that there are significant differences among the group means.
Problem 4
4. Which specific task exhibits the lowest average training time? Does the combination of different
treats and tasks significantly influence the training time for pets?
The p value for the combination of tasks and treats is greater than the p value 0.05 hence it
doesn't significantly influence the training time of the pets
Problem 5
5. Does the choice of treats significantly impact the training time for different tasks? Which specific
combinations of treats and tasks lead to the most significant differences in training time? (Note:
Assume the significance level to be 0.05 )
CONCLUSION:
Combination of III:A-I:B,III:C-I:B have the most significance after looking at their p value on
the training time