100% found this document useful (1 vote)
180 views10 pages

Big Data Assignment 1 Solutions

This document contains solutions to assignments on analyzing datasets related to Bollywood movies and heart disease. It includes using pandas to clean and explore the datasets, calculating metrics like return on investment, and visualizing relationships between variables through plots.

Uploaded by

Ashutosh Uke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
180 views10 pages

Big Data Assignment 1 Solutions

This document contains solutions to assignments on analyzing datasets related to Bollywood movies and heart disease. It includes using pandas to clean and explore the datasets, calculating metrics like return on investment, and visualizing relationships between variables through plots.

Uploaded by

Ashutosh Uke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Assignment 1 Solutions

1. How many records are present in the dataset? Print the metadata information of the
dataset.
Ans. bollywood.info()

2. How many movies got released in each genre? Which genre had highest number of
releases? Sort number of releases in each genre in descending order.

Ans. bollywood["Genre"].value_counts()

3. How Many movies in each genre got released in different release times like long
weekend, festive season ,etc. (Note: Do a cross tabulation between Genre and
ReleaseTime.)

Ans. pd.crosstab(bollywood["Genre"],bollywood["ReleaseTime"])
4. Which month of the year, maximum number movie releases are seen? (Note: Extract a
new column called month from ReleaseDate column.)
Ans. bollywood["Month"]=pd.DatetimeIndex(bollywood["Release Date"]).month
print(bollywood[["MovieName","Month"]])
bollywood["Month"].value_counts()

5. Which month of the year typically sees most releases of high budgeted movies, that is,
movies with budget of 25 crore or more?

Ans. bollywood[bollywood["Budget"]>=25]["Month"].value_counts()
6. Which are the top 10 movies with maximum return on investment (ROI)? Calculate
return on investment (ROI) as (BoxOfficeCollection-Budget) / Budget.

Ans. bollywood["ROI"]= (bollywood["BoxOfficeCollection"]-bollywood["Budget"]) /


bollywood["Budget"]
bollywood[["MovieName","ROI"]].sort_values("ROI",ascending=False)[0:10]

7. Do the movies have higher ROI if they get released on festive season or long
weekend? Calculate the average ROI for different release times.
Ans. bollywood.groupby("ReleaseTime")["ROI"].mean()
8. Draw a histogram and distribution plot to find out the distribution of movie budgets.
Interpret the plot to conclude if the most movies are high or low budgeted movies.
Ans. import matplotlib.pyplot as plt
import seaborn as sn
plt.hist(bollywood["Budget"],bins=5)
sn.distplot(bollywood["Budget"])

9. Compare the distribution of ROIs between movies with comedy genre and drama.
Which genre typically sees higher ROIs ?
Ans. bollywood.groupby("Genre")["ROI"].sum().plot.bar()
10. Is there a correlation between Box office collection and YouTube likes? Is the
correlation positive or negative?
Ans. corr_bolly=bollywood[["BoxOfficeCollection","YoutubeViews"]].corr()
sn.heatmap(corr_bolly,annot=True)

11. Which genre of movies typically sees more YouTube likes ? Draw boxplot for each
genre of movies to compare.
Ans. sn.boxplot(x="Genre",y = "YoutubeLikes", data=bollywood)
12. Which of the variables among Budget, BoxOfficeCollection, YoutubeViews,
YoutubeLikes, YoutubeDislikes are highly correlated? Note: Draw pair plot or
heatmap.
Ans. features=["Budget","YoutubeViews","YoutubeLikes","YoutubeDislikes"]
sn.pairplot(bollywood[features],height=2)
SAHeart Dataset
13. How many records are present in the dataset? Print the metadata information of the
dataset.
Ans. SAheart.info()

14. Draw a bar plot to show the number of persons having CHD or not in comparison to
they having family history of the disease or not.
Ans. for i in range(0,len(SAheart["chd"])):
if(SAheart["chd"][i]=="Si"):
SAheart["chd"][i]=1
else:
SAheart["chd"][i]=0

pd.crosstab(SAheart["famhist"],SAheart["chd"]).plot.bar()
15. Does Age has any correlation with sbp ? Choose appropriate plot to show the
relationship.
Ans. corr_heart=SAheart[["sbp","age"]].corr()
sn.heatmap(corr_heart,annot=True)

16. Compare the distribution of tobacco consumption for person having CHD and not
having CHD. Can you interpret the effect of tobacco on having coronary heart
disease?
Ans. dummy1 = SAheart[SAheart["chd"]==1]
dummy2 = SAheart[SAheart["chd"]==0]
sn.distplot(dummy1["tobacco"])
sn.distplot(dummy2["tobacco"])
17. How are the parameters sbp, obesity, age and ldl corelated? choose the right plot to
show the relationships.
Ans. corr_heart = SAheart[["sbp","age","obesity","ldl"]].corr()
sn.heatmap(corr_heart,annot=True)

18. Derive new column called agegroup from age column where persons falling in
different age ranges are categorized as below:
(0-15):Young
(15-35):adults
(35-55):mid
(55-):old
Ans.
SAheart["agegroup"]=pd.cut(SAheart.age,bins=[0,14,34,54,99],labels=["Young","Adults","Mi
d","Old"])
19. Find out number of chd cases in different age categories. Do a barplot and sort them
in the order of age groups.
Ans. SAheart.groupby("agegroup")["chd"].count().plot.bar()

20. Draw a box plot to compare distributions of ldl for different age groups.
Ans. sn.boxplot(x="agegroup",y = "ldl", data=SAheart)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy