100% found this document useful (2 votes)
200 views43 pages

SMDM-Project Report (Madhur Dhananiwala)

This problem analyzes a dataset containing annual spending amounts of various product categories for 440 customers of a wholesale distributor. The goals are to describe variation among customer types and provide insights to help structure delivery service. Key findings include: - Spending on "detergents and paper products" is strongly correlated with "grocery products" spending. - Hotel channel customers on average spend more on all product categories than retail channel customers. - Customers in the Lisbon region spend the most on average across all product categories compared to other regions. Descriptive statistics are calculated and exploratory data analysis is conducted including pairplots and correlation analysis to understand relationships between variables. Contingency tables are made
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
200 views43 pages

SMDM-Project Report (Madhur Dhananiwala)

This problem analyzes a dataset containing annual spending amounts of various product categories for 440 customers of a wholesale distributor. The goals are to describe variation among customer types and provide insights to help structure delivery service. Key findings include: - Spending on "detergents and paper products" is strongly correlated with "grocery products" spending. - Hotel channel customers on average spend more on all product categories than retail channel customers. - Customers in the Lisbon region spend the most on average across all product categories compared to other regions. Descriptive statistics are calculated and exploratory data analysis is conducted including pairplots and correlation analysis to understand relationships between variables. Contingency tables are made
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

SMDM PROJECT

REPORT

MADHUR DHANANIWALA

PGP – DSBA ONLINE Dec_c 2021

Course : Statistical Methods for Decision


Making

Date: 27/02/2022
1|Page
Table Of Contents
Contents
1 – Wholesale Customer Data Analysis................................................................................................... 1

- Problem 1 Summary ----------------------------------------------------------------------------------------------- 1

- Problem 1 Description of variables ---------------------------------------------------------------------------- 1

- Problem 1 Sample Data ------------------------------------------------------------------------------------------- 2

- Problem 1 EDA ------------------------------------------------------------------------------------------------------ 3 – 5

- Problem 1 ( 1.1 ) ----------------------------------------------------------------------------------------------------- 6 – 9

- Problem 1 ( 1.2 ) ----------------------------------------------------------------------------------------------------- 10 – 15

- Problem 1 ( 1.3 ) ----------------------------------------------------------------------------------------------------- 16

- Problem 1 ( 1.4 ) ----------------------------------------------------------------------------------------------------- 17

- Problem 1 ( 1.5 ) ----------------------------------------------------------------------------------------------------- 18

2 – Clear Mountain State University (CMSU) Survey............................................................................................... 19

- Problem 2 Summary ------------------------------------------------------------------------------------------------- 19

- Problem 2 Sample Data ------------------------------------------------------------------------------------------------- 20

- Problem 2 EDA ------------------------------------------------------------------------------------------------- 20 – 21

- Problem 2 ( 2.1 ) ------------------------------------------------------------------------------------------------- 21 – 22

- Problem 2 ( 2.2 ) ------------------------------------------------------------------------------------------------- 23

- Problem 2 ( 2.3 ) ------------------------------------------------------------------------------------------------- 24 – 25

- Problem 2 ( 2.4 ) ------------------------------------------------------------------------------------------------- 26 – 27

- Problem 2 ( 2.5 ) ------------------------------------------------------------------------------------------------- 28 – 29

- Problem 2 ( 2.6 ) ------------------------------------------------------------------------------------------------- 30

- Problem 2 ( 2.7 ) ------------------------------------------------------------------------------------------------- 31 – 32

- Problem 2 ( 2.8 ) ------------------------------------------------------------------------------------------------- 33 – 34

3 –Hypothesis Testing for Quality of Shingles............................................................................................... 35

- Problem 3 Summary ------------------------------------------------------------------------------------------------- 35

- Problem 3 Sample Data -------------------------------------------------------------------------------------------- 36

- Problem 3 EDA ------------------------------------------------------------------------------------------------------- 36 – 37

- Problem 3 ( 3.1 ) ----------------------------------------------------------------------------------------------------- 37 – 38

- Problem 3 ( 3.2 ) ---------------------------------------------------------------------------------------------------- 39

2|Page
LIST OF FIGURES
1 – Wholesale Customer Data Analysis Page
Fig 1 – Pairplot for Data interaction 7
Fig 2 – Pearson Correlation 8
Fig – 3 Item Fresh vs channel bar plot 13
Fig – 4 Item Milk vs channel bar plot 14
Fig – 5 Item Grocery vs channel bar plot 15
Fig – 6 Item Frozen vs channel bar plot 16
Fig – 7 Item Detergents vs channel bar 17
plot
Fig – 8 Item Delicatessen vs channel bar 18
plot
Fig – 9 Outliers in data Box plot 20
2 – Clear Mountain State University
Survey
Fig – 10 histogram of GPA and count 36
Fig – 11 histogram of salary and count 36
Fig – 12 histogram of salary and spending 36
Fig – 13 histogram of salary and text 36
message

3|Page
List of Tables
1 – Wholesale Customer Data Analysis Page
Table 1: Wholesale Distributor Sample 5

Table 2 – Descriptive statistics 9


Table 3 - Descriptive statistics with channel and retail 9
Table 4 - Average product spending of Channel 12
Table 5 - Average product spending of Region 12
2 – Clear Mountain State University (CMSU) Survey
Table 6 – Sample of data 23
Table 7 – Contingency of gender and major 24
Table 8 – Contingency of Gender and Grand intention 25
Table 9 – Contingency of Gender and Employment 25
Table 10 – Contingency of Gender and Computer 25
Table 11 –Contingency of Gender and major 27
Table 12 –Contingency of Gender and major 28
Table 13 –Contingency of Gender and grad Intention 29
Table 14 –Contingency of Gender and computer 30
Table 15 –Contingency of Gender and Employment 31
Table 16 –Contingency of Gender and major 32
Table 17 –Contingency of Gender and intent graduate 33
Table 18 –Contingency of Gender and GPA 34
Table 19 –Contingency of Gender and salary 35
3 –Hypothesis Testing for Quality of Shingles
Table 20 – Sample of data 39
Table 21 – Descriptive statistics of the data 40

4|Page
PROBLEM 1
A wholesale distributor operating in different regions of Portugal has information on the annual spending of
several items in their stores across different regions and channels. The data consists of 440 large retailers’
annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across
different sales channel (Hotel, Retail).

Problem Summary:
In this problem, we will analyze a dataset containing data on various customers' annual spending amounts of
diverse product categories for internal structure. One goal of this report is to best describe the variation in the
different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor
with insight into how to best structure their delivery service to meet the needs of each customer.
Description of variables is as follows:
 FRESH: annual spending on fresh products (Continuous);

 MILK: annual spending on milk products (Continuous);

 GROCERY: annual spending on grocery products (Continuous);

 FROZEN: annual spending on frozen products (Continuous);

 DETERGENTS_PAPER: annual spending on detergents and paper products (Continuous);

 DELICATESSEN: Annual spending on delicatessen products (Continuous);

 CHANNEL: Customers Channel - Hotel (Hotel/Restaurant/Cafe) or Retail channel (Nominal);

 REGION: customers Region Lisbon, Oporto or Other (Nominal);

 BUYER/SPENDER: It is showing the running id number (assumption it is index)

1|Page
Sample of the dataset:

Table 1: Wholesale Distributor Sample

 The dataset gives data about sales of 6 categories of products across 3 regions via 2 channels.

 Region Frequency - total: 440 rows Lisbon 77 rows Oporto 47 rows Other 316 rows.

 Channel Frequency -total: 440 rows Hotel 298 rows Retail 142 rows.

2|Page
Exploratory Data Analysis:
- Let’s check the types of variables in the data frame.

There is a total of 440 rows and 9 columns in the dataset. Out of 9, 2 columns are of object type and the rest 7 are of
integer type.

-Check for missing values in the dataset:

From the above results, we can see that there is no missing value present in the dataset.

3|Page
- Let's use the Seaborn pairplot to have a first look at how our data is interacting.

Fig 1 – Pairplot for Data interaction


From the pairplot above, the correlation between the "detergents and paper products" and the "grocery products"
seems to be pretty strong, meaning that consumers would often spend money on these two types of product.

4|Page
- Let’s Check the data interacting with each other using the Correlation plot.

Fig 2 – Pearson Correlation

From the above observation, we can see that there is a strong correlation (0.92) between the
"detergents and paper products" and the "grocery products".

1.1)
5|Page
Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?

Answer)

- Using the methods of descriptive statistics to summarize the wholesale customer’s data.

Table 2 – Descriptive statistics

-Descriptive statistics of wholesale customer’s data including channel and retail:

Table 3 - Descriptive statistics with channel and retail

-From the above two describe functions, we can infer the following:
6|Page
 Channel has two unique values, with "Hotel" as the most frequent with 298 out of 440 transactions. i.e 67.7
percentage of spending comes from the "Hotel" channel.
 Retail has three unique values, with "Other" as the most frequent with 316 out of 440 transactions. i.e.71.8
percentage of spending comes from the "Other" region.

 Fresh item (440 records),

has a mean of 12000.3, the standard deviation of 12647.3, with a min value of 3 and a max value of 112151.

The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504
range = max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

 Milk item (440 records),


has a mean of 5796.27, the standard deviation of 7380.38, with a min value of 55 and max value a of 73498.

The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627

range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

 Grocery item (440 records),


has a mean of 7951.28, the standard deviation of 9503.16, with min value of 3 and max value of 92780.

The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5

range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))

 Frozen (440 records),


has a mean of 3071.93, the standard deviation of 4854.67, with min value of 25 and max value of 60869.

The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))

 Detergents_Paper (440 records),


has a mean of 2881.49, the standard deviation of 4767.85, with a min value of 3 and a max value of 40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))

 Delicatessen (440 records),


has a mean of 1524.87, the standard deviation of 2820.11, with a min value of 3 and a max value of 47943.

The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5

7|Page
range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))

- Which Region and which Channel spent the most?


- Which Region and which Channel spent the least?

From the above observation we see that :

The highest spend in the Region is from Others and the lowest spend in the region is
from Oporto¶

The highest spend in the Channel is from Hotel and the lowest spend in the Channel is
from Retail.

From the above observation, we see that the Highest spending in the Region/Channel is
from Others/Hotel and
the lowest spending in the Region/Channel is from Oporto/Hotel.

- Average product spending of Region/Channel :

CHANNEL :

8|Page
Table 4 - Average product spending of Channel

As we can see
In-Channel "Hotel" Average Highest Spending on Fresh items and Lowest Spending in
Detergents_Pape

In Channel "Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen
items.

Region :

Table 5 - Average product spending of Region

As we can see
In Region "Lisbon" Average Highest Spending in Fresh and Lowest in Delicatessen items.

In Region "Oporto" Average Highest Spending in Fresh and Lowest in Delicatessen items.

In Region "Other" Average Highest Spending in Fresh and Lowest in Delicatessen items

1.2 )
There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.

9|Page
Answer)

I) Item – fresh:

Fig – 3 Item Fresh vs channel bar plot


 Based on the above plot we can see that Fresh item is sold more in the hotel channel
compared to the retail channel.
 As we can see from the above plot Fresh item is sold more in the other region in the
hotel channel compared to Lisbon and Oporto regions.
 Moreover, we can see that in the retail channel Lisbon has the lowest sales but in the
hotel channel, Lisbon has around 13,000 sales of fresh.

II) Item – Milk

10 | P a g e
Fig – 4 Item Milk vs channel bar plot

 As we can see from the above plot that the item milk has more sales in the retail channel
compared to the hotel channel.
 Moreover, we can also observe that in the region other has sales compared to Lisbon and
Oporto.
 However, there is a very minimal difference between the other and Lisbon in the sales of
milk items as per the database.

III) Item – Grocery


11 | P a g e
Fig – 5 Item Grocery vs channel bar plot

 As we can see from the above plot that item grocery has better sales compared to the
hotel channel.
 Moreover, we can also see that the Lisbon region has the highest (17,500+) sales of
groceries compared to the other regions in the retail channel.
 Whereas, other and Oporto have similar sales in the retail channel of groceries.
 However, In the hotel channel other, Lisbon and Oporto have similar sales of groceries
wherein compared with the three regions Oporto has more sales in the hotel channel.

IV) Item – Frozen

12 | P a g e
Fig – 6 Item Frozen vs channel bar plot

 As we can see from the above plot the hotel channel has more sales of frozen
items compared to the retail channel.
 Moreover, In the hotel channel, Oporto has the highest sales of frozen items and
Lisbon has the lowest sales of frozen items in the hotel channel.
 However, In the retail channel other and Oporto have similar sales of frozen sales
and Lisbon has more sales compared to the other regions.

V) Item – Detergents_Paper

13 | P a g e
Fig – 7 Item Detergents vs channel bar plot

 As we can see from the above plot detergents_paper has more sales in the retail channel
compared to the hotel channel.
 Moreover, the Oporto region has the highest sales of detergents_paper in the retail
channel and other region has the lowest sales in the retail channel. Whereas there is a
very minimal difference between Lisbon and Oporto region.
 However, In the hotel channel, Lisbon has the highest sales of detergents_paper and
Oporto has the lowest sales. Whereas Lisbon and other regions have a very minimal
difference in sales.

VI) Item - Delicatessen

14 | P a g e
Fig – 8 Item Delicatessen vs channel bar plot

 As we can see from the above plot delicatessen item has more sales in the retail
channel compared to the hotel channel.
 Moreover, we can also see that Lisbon has the highest of delicatessen in the
retail channel and Oporto has the lowest sales of a delicatessen in the retail
channel. Whereas other and Lisbon region have a minimal difference between
each other in delicatesses sales.

 However, In the hotel channel other region has the highest sales of delicatessen and the
Oporto region has the lowest sales in the hotel channel. Whereas Lisbon and Oporto, region
have a minimal sales difference between each other in hotel channel.

1.3)
On the basis of a descriptive measure of variability, which item
shows the most inconsistent behaviour? Which items show the
least inconsistent behaviour?

15 | P a g e
ANSWER)
Below we have calculated the items using standard deviation to check the measure of
variability.

 From the above table we can see that :


 The fresh item has the highest standard deviation as 12647.33, which is inconsistent.
 The delicatessen item has the smallest standard deviation as 2820.11, which is
consistent.
 However, on the bases of the coefficient of variation we see that :

 “Fresh” items have the lowest coefficient of Variation So that is consistent.


 “Delicatessen” item has the highest coefficient of Variation, So that is Inconsistent.
However, on the basis of the above analysis, it can be concluded that considering all the
6 varieties of items, all varieties do not show similar behaviour across regions and
channels.

1.4 )
Are there any outliers in the data? Back up your answer with a
suitable plot/technique with the help of detailed comments.

Answer)

16 | P a g e
To determine the presence of Outliers in the Data the best method is creating Box plot of
all the variables as shown below:

Fig – 9 Outliers in data Box plot

From the Box plots of all the Variables as above it can be concluded that Yes, There are
outliers in the Data. Outliers are present in the variables
Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicatessen.

1.5)
On the basis of your analysis, what are your recommendations
for the business? How can your analysis help the business to
solve its problem? Answer from the business perspective

Answer)

On the basis of the analysis the following recommendations can be made :

17 | P a g e
 On the basis of the analysis, it can be seen that the region Other and the channel Retail
have Higher spending than other Channel and Regions. Hence From the Business
perspective if a new business is to be opened it Should be opened in the Other region
with Channel Retail as the Other region is absorbing the maximum amount of sales and
this can boost the Revenue compared to opening a new business in Lisbon or Oporto and
with the Channel Hotel.

 In all the regions the Food Items Fresh has the highest spending followed by Grocery
and Milk. Hence these food products are strongly recommended to be available
simultaneously at all the businesses with a priority of availability being Fresh food
products.

 Also the food item Delicatessen shows the least inconsistent behaviour across all
regions and channels. So Delicatessen is also recommended to be available at all times
in all the Businesses.

18 | P a g e
PROBLEM 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in
the Survey data set).

Summary
This business report provides a detailed explanation of the approach to each problem given in
the assignment and provides relative information with regard to solving the problem.

Sample of the dataset:

Table 6 – Sample of data

 The dataset gives data about student Education and information in 14 different categories.

Exploratory Data Analysis:


19 | P a g e
- Let’s check the types of variables in the data frame.

There is a total of 62 rows and 14 columns in the dataset. Out of 14, 6 columns are of object type and 6 are of integer
type and the rest are float type.

-Check for missing values in the dataset:

20 | P a g e
From the above results, we can see that there is no missing value present in the dataset.

2.1)
For this data, construct the following contingency tables (Keep
Gender as row variable)
2.1.1) Gender and Major
Answer)

Table 7 – Contingency of gender and major

As we can see From the above contingency table, it is clear that males and females
participating in different majors are 29 and 33 respectively.

2.1.2) Gender and Grad Intention


21 | P a g e
Answer)

Table 8 – Contingency of Gender and Grand intention

As we can see From the above contingency table, it is clear that major of males have selected
grad intention as yes and in females major of them are undecided.
2.1.3) Gender and Employment
Answer)

Table 9 – Contingency of Gender and Employment


As we can see From the above contingency table, it is clear that major of males have Part-time
employment and in females, major of them also have Part-time employment.

2.1.4) Gender and Computer


Answer)

Table 10 – Contingency of Gender and Computer

As we can see From the above contingency table, it is clear that major of males and females
have laptops and very minimal of them have desktops.

2.2)

22 | P a g e
Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:

2.2.1) What is the probability that a randomly selected CMSU student will be
male?

Answer)

Below we see the total no. of male and female students as per the dataset:

As we can see from the above table that there are 33 female students and 29 male students in
CMSU.

Total No of Students = 62
Total No of Male = 29
Probability a randomly selected student will be male = Total No of Male / Total No of Male

After calculation in python for the probability of males being selected randomly in CMSU, we
got the result that the probability of 46.77% of students will be male if randomly selected.

2.2.2) What is the probability that a randomly selected CMSU student will be
female?

Answer)
From the previous question 2.2.1, we have the total no. of male and female students in CMSU.
After calculation in python for the probability of females being selected randomly in CMSU,
we got the result that the probability of 53.22% of students will be female if randomly
selected.

2.3)
Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:
23 | P a g e
2.3.1) Find the conditional probability of different majors among the male
students in CMSU.

Answer)

From the contingency table between gender and major, we will get the right information. As
the total no. of the male candidate is 29.

Table 11 –Contingency of Gender and major

Using contingency tables of Gender and Majors we got the total numbers of males and number
of males opting for different majors
Below is the output of Python :

As we can see from the above output is that most male students prefer Management as a Major
with a probability of 44.23% and CIS is the least preferred one with a probability of 7.37%.

2.3.2)
Find the conditional probability of different majors among the female
students of CMSU.

24 | P a g e
Answer)

From the contingency table between gender and major, we will get the right information. As
the total no. of female candidates is 33.

Table 12 –Contingency of Gender and major

Using contingency tables of Gender and Majors we got the total numbers of females and the
number of females opting for different majors.

Below is the output of Python :

As we can see from the above output is that most female students prefer
Retailing /Marketing as a Major with a probability of 51.23%.

Moreover, The female students prefer Accounting, CIS, other as majors with a probability of
17.07%.

2.4)
Assume that the sample is a representative of the population of
CMSU. Based on the data, answer the following question:

25 | P a g e
2.4.1) Find the probability That a randomly chosen student is a male and
intends to graduate.

Answer)
From the below contingency table between gender and Grad Intention, we will get the right
information. we will know the total male students who intend to graduate.

Table 13 –Contingency of Gender and grad Intention

As we can see from the above table the total no. of students who intend to graduate is 17.

However, from post calculation from python, we found out that - Probability of Males who
intends to be Graduate. is 58.62%

2.4.2)
Find the probability that a randomly selected student is a female and does
NOT have a laptop.

Answer)
From the below contingency table between gender and Computer, we will get the right
information. we will know the total female students who do not have a laptop.

26 | P a g e
Table 14 –Contingency of Gender and computer

As we can see from the above table the total no. of female students who do have a laptop is 29
and the number of female students who do not have a laptop is 4.

However, from post calculation from python, we found out that –


The probability that a randomly selected student is a female and does NOT have a laptop is
13.79%

2.5)
Assume that the sample is representative of the
population of CMSU. Based on the data, answer the
following question:

2.5.1) Find the probability that a randomly chosen student is a male or has
full-time employment?

27 | P a g e
Answer)
From the below contingency table between gender and Employment, we will get the right
information. we will know the total male and full-time employment.

Table 15 –Contingency of Gender and Employment

As we can see from the above table, the number of male candidates is 29, the no of full-time
employment are 10 and male candidate doing full-time employment are 7.

However, from post calculation from python, we found out that –


The probability that a randomly chosen student is a male or has full-time employment is: 51.61%

2.5.2) Find the conditional probability that given a female student is


randomly chosen, she is majoring in international business or management.

Answer)

From the below contingency table between gender and Major, we will get the right
information. we will know the total female in international business or management.

28 | P a g e

Table 16 –Contingency of Gender and major


As we can see from the above table, the number of female candidates is 33, the no of
International Business is 4 and female candidates doing Management is 4.

However, from post calculation from python, we found out that –


The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is 22.77%

2.6)
Construct a contingency table of Gender and Intent to Graduate
at 2 levels (Yes/No). The Undecided students are not considered
now and the table is a 2x2 table. Do you think graduate intention
and being female are independent events?

Answer)

Below is the 2x2 contingency table of gender and intent to graduate without considering the
Undecided column.

Table 17 –Contingency of Gender and intent graduate

29 | P a g e
Two events A and B can be proved to be Independent events when it satisfies the condition :
P(A ∩ B) = P(A) * P(B)
In this case if being female and graduate intention are independent can be proven by checking
the condition :

P(F ∩ Yes) = P(F) * P(Yes)


Where F = Female
Yes = Grad Intention being Yes

Hence from the calculations done in Python, we conclude that :


P(F ∩ Yes) ≠ P(F) * P(Yes)
Hence, Graduate intention and being female are not independent events

2.7)
Note that there are four numerical (continuous) variables
in the data set, GPA, Salary, Spending and Text Messages.
Answer the following questions based on the data

2.7.1) If a student is chosen randomly, what is the probability that


his/her GPA is less than 3?

Answer)

Below Using contingency tables of Gender and GPA we got the total numbers of students and
number of students with GPA less than 3:

30 | P a g e
Table 18 –Contingency of Gender and GPA

As we can see from the above tale that there is a total of 17 students with less than 3 GPA.

Hence,
After calculation in python, we found out that –
The probability that a student is chosen randomly and that his/her GPA is less
then 3 is 27.41%

2.7.2)
Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected
female earns 50 or more.

Answer)

Below is the contingency table of Gender and salary of male and female who earns 50 or more:

Table 19 –Contingency of Gender and salary

From the above table, we can see that Total no. Of males with 50 or more salary are: 14 and
the total no. of the female with 50 or more salary are: 18

Hence,

31 | P a g e
After calculation in python, we found out that –
The conditional probability that a randomly selected male earns 50 or more: 48.27%

The conditional probability that a randomly selected female earns 50 or more: 54.54%

2.8)
Note that there are four numerical (continuous) variables in the
data set, GPA, Salary, Spending, and Text Messages. For each
of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions.
Answer)
Below is the output from python using the histograms to those variables in the data set, GPA,
Salary, Spending, and Text Messages to find out if there are Normal distribution or not :

Fig – 11 histogram of salary and count


Fig – 10 histogram of GPA and count

32 | P a g e
Fig – 12 histogram of salary and spending Fig – 13 histogram of salary and text message

From the above histograms for the continuous variables GPA, Salary, Spending and Text
Messages we can see that:
 GPA is almost Normally Distributed with a slight skewness toward the left.
 Salary is also Normally Distributed with a slight skewness towards the right.
 Spending is not Normally distributed and is highly Right Skewed
 Text message is not Normally distributed and highly Right Skewed.

The following is the output from python consist of the Skewness value of the variables:

33 | P a g e
PROBLEM 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging.   In some cases, excessive moisture can cause the granules attached to the shingles
for texture and colouring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated. The
company would like to show that the mean moisture content is less than 0.35 pounds per 100
square feet.
34 | P a g e
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A
shingles and 31 for B shingles.

Summary
This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem.

Sample of the dataset

Table 20 – Sample of data

 The dataset gives data about A & B shingles and the dataset contains 2 columns A and B
which contains the moisture measurements.

Exploratory Data Analysis:


- Let’s check the information of the data frame:

35 | P a g e
 There is a total of 36 rows and 2 columns in the dataset. Columns A and B contain float
data types.
 And the memory usage of the dataset is 704.0 bytes

- Check for missing values in the dataset:

 As we can see from the above table that the dataset contains missing values in column B.

- Using the methods of descriptive statistics to summarize the dataset.

Table 21 – Descriptive statistics of the data

3.1 )
Do you think there is evidence that means moisture contents in
both types of shingles are within the permissible limits? State
your conclusions clearly showing all steps.

Answer)

 For the A shingles, the null and alternative hypothesis to test whether the population
mean moisture content is less than 0.35 pound per 100 square feet is given:

36 | P a g e
So,
H0 : mean moisture content <=0.35
HA : mean moisture content > 0.35
Level of significance: 0.05

 We have samples and we do not know the population standard deviation.


 The sample is not a large sample. So we use the t distribution and the tSTAT test statistic
 Since we testing for only sample A we use One sample T test. Also as python by default
in Python, ttest_1samp shows the result of 2-sided it is divided by 2.

Hence,
From the calculations done in Python, we conclude that :
Our one-sample t-test p-value= [0.07477633]

We have no evidence to reject the null hypothesis since p-value > Level of significance
 For the B shingles, the null and alternative hypothesis to test whether the population
mean moisture content is less than 0.35 pound per 100 square feet is given:

So,
H0 : mean moisture content <=0.35
HA : mean moisture content > 0.35
Level of significance: 0.05

 We have samples and we do not know the population standard deviation.


 The sample is not a large sample. So you use the t distribution and the tSTAT test
statistic
 Since we testing for only sample A we use One sample T test. . Also as python by
default in Python, ttest_1samp shows the result of 2-sided it is divided by 2.
Hence,
From the calculations done in Python we conclude that :
Our one-sample t-test p-value= [0.0020904774003191826]

37 | P a g e
We have evidence to reject the null hypothesis since p-value < Level of significan

3.2 )
Do you think that the population mean for shingles A and B are
equal? Form the hypothesis and conduct the test of the
hypothesis. What assumption do you need to check before the
test for equality of means is performed?
Answer)

Theoretical Assumptions for the Hypothesis Testing :


To perform a Test of equality of the population mean of the A shingles and B shingles, the null
and alternative hypothesis to test whether the population mean moisture content is equal is
given:

 H0 : mean moisture content of A = mean moisture content of B


 HA : mean moisture content of A ≠ mean moisture content of B
 Level of significance: 0.05

 We have two samples A and B and we do not know the population standard deviation.
 The samples are not a large sample. So you use the t distribution and the tSTAT test
statistic
 Since we testing for equality between samples A and B we use two-sample T-test.
Hence
From the calculations done in Python, we conclude that :
Two-sample t-test p-value= 0.2017496571835306

We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis since p value > Level of significance

Therefore, It can be concluded that the population mean for shingles A and B are equal.

38 | P a g e
THANK YOU

39 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy