100% found this document useful (1 vote)

180 views10 pages

Big Data Assignment 1 Solutions

This document contains solutions to assignments on analyzing datasets related to Bollywood movies and heart disease. It includes using pandas to clean and explore the datasets, calculating metrics like return on investment, and visualizing relationships between variables through plots.

Uploaded by

Ashutosh Uke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

180 views10 pages

Big Data Assignment 1 Solutions

Uploaded by

Ashutosh Uke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Assignment 1 Solutions

1. How many records are present in the dataset? Print the metadata information of the
dataset.
Ans. bollywood.info()

2. How many movies got released in each genre? Which genre had highest number of
releases? Sort number of releases in each genre in descending order.

Ans. bollywood["Genre"].value_counts()

3. How Many movies in each genre got released in different release times like long
weekend, festive season ,etc. (Note: Do a cross tabulation between Genre and
ReleaseTime.)

Ans. pd.crosstab(bollywood["Genre"],bollywood["ReleaseTime"])
4. Which month of the year, maximum number movie releases are seen? (Note: Extract a
new column called month from ReleaseDate column.)
Ans. bollywood["Month"]=pd.DatetimeIndex(bollywood["Release Date"]).month
print(bollywood[["MovieName","Month"]])
bollywood["Month"].value_counts()

5. Which month of the year typically sees most releases of high budgeted movies, that is,
movies with budget of 25 crore or more?

Ans. bollywood[bollywood["Budget"]>=25]["Month"].value_counts()
6. Which are the top 10 movies with maximum return on investment (ROI)? Calculate
return on investment (ROI) as (BoxOfficeCollection-Budget) / Budget.

Ans. bollywood["ROI"]= (bollywood["BoxOfficeCollection"]-bollywood["Budget"]) /

bollywood["Budget"]
bollywood[["MovieName","ROI"]].sort_values("ROI",ascending=False)[0:10]

7. Do the movies have higher ROI if they get released on festive season or long
weekend? Calculate the average ROI for different release times.
Ans. bollywood.groupby("ReleaseTime")["ROI"].mean()
8. Draw a histogram and distribution plot to find out the distribution of movie budgets.
Interpret the plot to conclude if the most movies are high or low budgeted movies.
Ans. import matplotlib.pyplot as plt
import seaborn as sn
plt.hist(bollywood["Budget"],bins=5)
sn.distplot(bollywood["Budget"])

9. Compare the distribution of ROIs between movies with comedy genre and drama.
Which genre typically sees higher ROIs ?
Ans. bollywood.groupby("Genre")["ROI"].sum().plot.bar()
10. Is there a correlation between Box office collection and YouTube likes? Is the
correlation positive or negative?
Ans. corr_bolly=bollywood[["BoxOfficeCollection","YoutubeViews"]].corr()
sn.heatmap(corr_bolly,annot=True)

11. Which genre of movies typically sees more YouTube likes ? Draw boxplot for each
genre of movies to compare.
Ans. sn.boxplot(x="Genre",y = "YoutubeLikes", data=bollywood)
12. Which of the variables among Budget, BoxOfficeCollection, YoutubeViews,
YoutubeLikes, YoutubeDislikes are highly correlated? Note: Draw pair plot or
heatmap.
Ans. features=["Budget","YoutubeViews","YoutubeLikes","YoutubeDislikes"]
sn.pairplot(bollywood[features],height=2)
SAHeart Dataset
13. How many records are present in the dataset? Print the metadata information of the
dataset.
Ans. SAheart.info()

14. Draw a bar plot to show the number of persons having CHD or not in comparison to
they having family history of the disease or not.
Ans. for i in range(0,len(SAheart["chd"])):
if(SAheart["chd"][i]=="Si"):
SAheart["chd"][i]=1
else:
SAheart["chd"][i]=0

pd.crosstab(SAheart["famhist"],SAheart["chd"]).plot.bar()
15. Does Age has any correlation with sbp ? Choose appropriate plot to show the
relationship.
Ans. corr_heart=SAheart[["sbp","age"]].corr()
sn.heatmap(corr_heart,annot=True)

16. Compare the distribution of tobacco consumption for person having CHD and not
having CHD. Can you interpret the effect of tobacco on having coronary heart
disease?
Ans. dummy1 = SAheart[SAheart["chd"]==1]
dummy2 = SAheart[SAheart["chd"]==0]
sn.distplot(dummy1["tobacco"])
sn.distplot(dummy2["tobacco"])
17. How are the parameters sbp, obesity, age and ldl corelated? choose the right plot to
show the relationships.
Ans. corr_heart = SAheart[["sbp","age","obesity","ldl"]].corr()
sn.heatmap(corr_heart,annot=True)

18. Derive new column called agegroup from age column where persons falling in
different age ranges are categorized as below:
(0-15):Young
(15-35):adults
(35-55):mid
(55-):old
Ans.
SAheart["agegroup"]=pd.cut(SAheart.age,bins=[0,14,34,54,99],labels=["Young","Adults","Mi
d","Old"])
19. Find out number of chd cases in different age categories. Do a barplot and sort them
in the order of age groups.
Ans. SAheart.groupby("agegroup")["chd"].count().plot.bar()

20. Draw a box plot to compare distributions of ldl for different age groups.
Ans. sn.boxplot(x="agegroup",y = "ldl", data=SAheart)

Product Hierarchy MARKETING
No ratings yet
Product Hierarchy MARKETING
10 pages
Porters Five Forrce Model For HERO HONDA
90% (10)
Porters Five Forrce Model For HERO HONDA
20 pages
Assignment 1 B2019010
No ratings yet
Assignment 1 B2019010
9 pages
Palliser Furniture LTD Group 16
100% (1)
Palliser Furniture LTD Group 16
13 pages
1.1 Univariate Analysis: 1.1.1 Categorical Data
No ratings yet
1.1 Univariate Analysis: 1.1.1 Categorical Data
10 pages
Assignment - 1 - DSML: Ques - 1
No ratings yet
Assignment - 1 - DSML: Ques - 1
9 pages
Summer 2010
100% (1)
Summer 2010
7 pages
Market Analysis - Tiffin & Catering Service, Mumbai Region
100% (1)
Market Analysis - Tiffin & Catering Service, Mumbai Region
10 pages
Unilive BCG Matrix
100% (1)
Unilive BCG Matrix
2 pages
Marketing Analytics Notes
No ratings yet
Marketing Analytics Notes
92 pages
Conjoint Analysis PDF
100% (1)
Conjoint Analysis PDF
15 pages
Plotting Data Using Matplotlib
No ratings yet
Plotting Data Using Matplotlib
25 pages
Bollywood and Heart Data Analysis
No ratings yet
Bollywood and Heart Data Analysis
15 pages
Lipton Green Tea Project Report
No ratings yet
Lipton Green Tea Project Report
13 pages
Ranbaxy Energy Candy
100% (2)
Ranbaxy Energy Candy
10 pages
Answer Sheet BC
No ratings yet
Answer Sheet BC
15 pages
Mayank M - B75 - C0-RNo20 - QM - Assign01
No ratings yet
Mayank M - B75 - C0-RNo20 - QM - Assign01
16 pages
Econometrics Project
No ratings yet
Econometrics Project
17 pages
IMDB Dataframe Insights
No ratings yet
IMDB Dataframe Insights
3 pages
FRA Class Notes
100% (1)
FRA Class Notes
16 pages
Gino Sa
No ratings yet
Gino Sa
19 pages
BRM CH 21
No ratings yet
BRM CH 21
31 pages
Week 3
No ratings yet
Week 3
2 pages
Factors Considered While Purchasing FMCG Product
No ratings yet
Factors Considered While Purchasing FMCG Product
36 pages
Oscar Mayer Case Study Report
No ratings yet
Oscar Mayer Case Study Report
10 pages
Capstone Project ON Impact of Quality Management Systems On Performance of A Company (Automobile Sector)
No ratings yet
Capstone Project ON Impact of Quality Management Systems On Performance of A Company (Automobile Sector)
49 pages
Paper Presentation FMCG
No ratings yet
Paper Presentation FMCG
22 pages
TVM PC
No ratings yet
TVM PC
5 pages
National Foods Masala Presentation
No ratings yet
National Foods Masala Presentation
12 pages
Mba ZG536 Ec-2r First Sem 2023-2024
No ratings yet
Mba ZG536 Ec-2r First Sem 2023-2024
4 pages
Business Research Methods Zikmund CHP 20
No ratings yet
Business Research Methods Zikmund CHP 20
32 pages
Analytics For Competitive Advantage: Mukund G Kallapur, Mtech Guest Faculty, BITS Pilani
No ratings yet
Analytics For Competitive Advantage: Mukund G Kallapur, Mtech Guest Faculty, BITS Pilani
15 pages
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
No ratings yet
BDA Lab 4: Python Data Visualization: Your Name: Mohamad Salehuddin Bin Zulkefli Matric No: 17005054
10 pages
Hands-On Lab: Generative AI For Querying Databases: Efficient
No ratings yet
Hands-On Lab: Generative AI For Querying Databases: Efficient
4 pages
Consumer Buying Behaviour of Lubricant
No ratings yet
Consumer Buying Behaviour of Lubricant
13 pages
Business Statistics Cia 1
No ratings yet
Business Statistics Cia 1
33 pages
Text Book Answers Unit 11
100% (2)
Text Book Answers Unit 11
16 pages
Projects PDF
No ratings yet
Projects PDF
12 pages
End Term 2020
No ratings yet
End Term 2020
6 pages
BRM Data Analysis Techniques
No ratings yet
BRM Data Analysis Techniques
53 pages
Course Material BM QT 2019 PDF
No ratings yet
Course Material BM QT 2019 PDF
44 pages
QTDM (Quantitative Techniques For Decision Making) :: An Introduction
No ratings yet
QTDM (Quantitative Techniques For Decision Making) :: An Introduction
16 pages
OPS 5003 End-Term Question Paper
No ratings yet
OPS 5003 End-Term Question Paper
7 pages
Mba ZG536 Course Handout
No ratings yet
Mba ZG536 Course Handout
7 pages
BPSM Question Bank
No ratings yet
BPSM Question Bank
17 pages
College Canteen Decreasing Sales Analysis Dilemmas.: Name of Members
No ratings yet
College Canteen Decreasing Sales Analysis Dilemmas.: Name of Members
6 pages
Scalene Works-HR Analytics
0% (1)
Scalene Works-HR Analytics
10 pages
Statistics of Management
No ratings yet
Statistics of Management
7 pages
CB Report
No ratings yet
CB Report
31 pages
11 Bibliography
No ratings yet
11 Bibliography
24 pages
FM - Assignment Batch 19 - 21 IMS Indore
No ratings yet
FM - Assignment Batch 19 - 21 IMS Indore
3 pages
Assignment 1 Questions Pricing of Players in The Indian Premier League Assignment Questions
No ratings yet
Assignment 1 Questions Pricing of Players in The Indian Premier League Assignment Questions
1 page
Case Analysis
No ratings yet
Case Analysis
3 pages
Oscm Assignment
No ratings yet
Oscm Assignment
2 pages
What Is The FMCG Industry
No ratings yet
What Is The FMCG Industry
4 pages
Formulation of Marketing Strategies To Improve Market Share of LG
0% (1)
Formulation of Marketing Strategies To Improve Market Share of LG
10 pages
Techminator 2023 Eximius
No ratings yet
Techminator 2023 Eximius
1 page
4587 2261 10 1487 54 Budgeting
No ratings yet
4587 2261 10 1487 54 Budgeting
46 pages
110 Studymat SM Case Studies
No ratings yet
110 Studymat SM Case Studies
7 pages
Surf4Joy Business Sample Answer
From Everand
Surf4Joy Business Sample Answer
AIB Publishing
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Assignment 1 Solutions

Uploaded by

Big Data Assignment 1 Solutions

Uploaded by

Assignment 1 Solutions

Ans. bollywood["ROI"]= (bollywood["BoxOfficeCollection"]-bollywood["Budget"]) /

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.