SMDM-Project Report (Madhur Dhananiwala)
SMDM-Project Report (Madhur Dhananiwala)
REPORT
MADHUR DHANANIWALA
Date: 27/02/2022
1|Page
Table Of Contents
Contents
1 – Wholesale Customer Data Analysis................................................................................................... 1
2|Page
LIST OF FIGURES
1 – Wholesale Customer Data Analysis Page
Fig 1 – Pairplot for Data interaction 7
Fig 2 – Pearson Correlation 8
Fig – 3 Item Fresh vs channel bar plot 13
Fig – 4 Item Milk vs channel bar plot 14
Fig – 5 Item Grocery vs channel bar plot 15
Fig – 6 Item Frozen vs channel bar plot 16
Fig – 7 Item Detergents vs channel bar 17
plot
Fig – 8 Item Delicatessen vs channel bar 18
plot
Fig – 9 Outliers in data Box plot 20
2 – Clear Mountain State University
Survey
Fig – 10 histogram of GPA and count 36
Fig – 11 histogram of salary and count 36
Fig – 12 histogram of salary and spending 36
Fig – 13 histogram of salary and text 36
message
3|Page
List of Tables
1 – Wholesale Customer Data Analysis Page
Table 1: Wholesale Distributor Sample 5
4|Page
PROBLEM 1
A wholesale distributor operating in different regions of Portugal has information on the annual spending of
several items in their stores across different regions and channels. The data consists of 440 large retailers’
annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across
different sales channel (Hotel, Retail).
Problem Summary:
In this problem, we will analyze a dataset containing data on various customers' annual spending amounts of
diverse product categories for internal structure. One goal of this report is to best describe the variation in the
different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor
with insight into how to best structure their delivery service to meet the needs of each customer.
Description of variables is as follows:
FRESH: annual spending on fresh products (Continuous);
1|Page
Sample of the dataset:
The dataset gives data about sales of 6 categories of products across 3 regions via 2 channels.
Region Frequency - total: 440 rows Lisbon 77 rows Oporto 47 rows Other 316 rows.
Channel Frequency -total: 440 rows Hotel 298 rows Retail 142 rows.
2|Page
Exploratory Data Analysis:
- Let’s check the types of variables in the data frame.
There is a total of 440 rows and 9 columns in the dataset. Out of 9, 2 columns are of object type and the rest 7 are of
integer type.
From the above results, we can see that there is no missing value present in the dataset.
3|Page
- Let's use the Seaborn pairplot to have a first look at how our data is interacting.
4|Page
- Let’s Check the data interacting with each other using the Correlation plot.
From the above observation, we can see that there is a strong correlation (0.92) between the
"detergents and paper products" and the "grocery products".
1.1)
5|Page
Use methods of descriptive statistics to summarize data. Which Region and which
Channel spent the most? Which Region and which Channel spent the least?
Answer)
- Using the methods of descriptive statistics to summarize the wholesale customer’s data.
-From the above two describe functions, we can infer the following:
6|Page
Channel has two unique values, with "Hotel" as the most frequent with 298 out of 440 transactions. i.e 67.7
percentage of spending comes from the "Hotel" channel.
Retail has three unique values, with "Other" as the most frequent with 316 out of 440 transactions. i.e.71.8
percentage of spending comes from the "Other" region.
has a mean of 12000.3, the standard deviation of 12647.3, with a min value of 3 and a max value of 112151.
The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504
range = max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627
range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5
range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526
range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5
range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))
The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5
7|Page
range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in calculating the
outlier(1.5 IQR Lower/Upper limit))
The highest spend in the Region is from Others and the lowest spend in the region is
from Oporto¶
The highest spend in the Channel is from Hotel and the lowest spend in the Channel is
from Retail.
From the above observation, we see that the Highest spending in the Region/Channel is
from Others/Hotel and
the lowest spending in the Region/Channel is from Oporto/Hotel.
CHANNEL :
8|Page
Table 4 - Average product spending of Channel
As we can see
In-Channel "Hotel" Average Highest Spending on Fresh items and Lowest Spending in
Detergents_Pape
In Channel "Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen
items.
Region :
As we can see
In Region "Lisbon" Average Highest Spending in Fresh and Lowest in Delicatessen items.
In Region "Oporto" Average Highest Spending in Fresh and Lowest in Delicatessen items.
In Region "Other" Average Highest Spending in Fresh and Lowest in Delicatessen items
1.2 )
There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
9|Page
Answer)
I) Item – fresh:
10 | P a g e
Fig – 4 Item Milk vs channel bar plot
As we can see from the above plot that the item milk has more sales in the retail channel
compared to the hotel channel.
Moreover, we can also observe that in the region other has sales compared to Lisbon and
Oporto.
However, there is a very minimal difference between the other and Lisbon in the sales of
milk items as per the database.
As we can see from the above plot that item grocery has better sales compared to the
hotel channel.
Moreover, we can also see that the Lisbon region has the highest (17,500+) sales of
groceries compared to the other regions in the retail channel.
Whereas, other and Oporto have similar sales in the retail channel of groceries.
However, In the hotel channel other, Lisbon and Oporto have similar sales of groceries
wherein compared with the three regions Oporto has more sales in the hotel channel.
12 | P a g e
Fig – 6 Item Frozen vs channel bar plot
As we can see from the above plot the hotel channel has more sales of frozen
items compared to the retail channel.
Moreover, In the hotel channel, Oporto has the highest sales of frozen items and
Lisbon has the lowest sales of frozen items in the hotel channel.
However, In the retail channel other and Oporto have similar sales of frozen sales
and Lisbon has more sales compared to the other regions.
V) Item – Detergents_Paper
13 | P a g e
Fig – 7 Item Detergents vs channel bar plot
As we can see from the above plot detergents_paper has more sales in the retail channel
compared to the hotel channel.
Moreover, the Oporto region has the highest sales of detergents_paper in the retail
channel and other region has the lowest sales in the retail channel. Whereas there is a
very minimal difference between Lisbon and Oporto region.
However, In the hotel channel, Lisbon has the highest sales of detergents_paper and
Oporto has the lowest sales. Whereas Lisbon and other regions have a very minimal
difference in sales.
14 | P a g e
Fig – 8 Item Delicatessen vs channel bar plot
As we can see from the above plot delicatessen item has more sales in the retail
channel compared to the hotel channel.
Moreover, we can also see that Lisbon has the highest of delicatessen in the
retail channel and Oporto has the lowest sales of a delicatessen in the retail
channel. Whereas other and Lisbon region have a minimal difference between
each other in delicatesses sales.
However, In the hotel channel other region has the highest sales of delicatessen and the
Oporto region has the lowest sales in the hotel channel. Whereas Lisbon and Oporto, region
have a minimal sales difference between each other in hotel channel.
1.3)
On the basis of a descriptive measure of variability, which item
shows the most inconsistent behaviour? Which items show the
least inconsistent behaviour?
15 | P a g e
ANSWER)
Below we have calculated the items using standard deviation to check the measure of
variability.
1.4 )
Are there any outliers in the data? Back up your answer with a
suitable plot/technique with the help of detailed comments.
Answer)
16 | P a g e
To determine the presence of Outliers in the Data the best method is creating Box plot of
all the variables as shown below:
From the Box plots of all the Variables as above it can be concluded that Yes, There are
outliers in the Data. Outliers are present in the variables
Fresh, Milk, Grocery, Frozen, Detergents_Paper and Delicatessen.
1.5)
On the basis of your analysis, what are your recommendations
for the business? How can your analysis help the business to
solve its problem? Answer from the business perspective
Answer)
17 | P a g e
On the basis of the analysis, it can be seen that the region Other and the channel Retail
have Higher spending than other Channel and Regions. Hence From the Business
perspective if a new business is to be opened it Should be opened in the Other region
with Channel Retail as the Other region is absorbing the maximum amount of sales and
this can boost the Revenue compared to opening a new business in Lisbon or Oporto and
with the Channel Hotel.
In all the regions the Food Items Fresh has the highest spending followed by Grocery
and Milk. Hence these food products are strongly recommended to be available
simultaneously at all the businesses with a priority of availability being Fresh food
products.
Also the food item Delicatessen shows the least inconsistent behaviour across all
regions and channels. So Delicatessen is also recommended to be available at all times
in all the Businesses.
18 | P a g e
PROBLEM 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in
the Survey data set).
Summary
This business report provides a detailed explanation of the approach to each problem given in
the assignment and provides relative information with regard to solving the problem.
The dataset gives data about student Education and information in 14 different categories.
There is a total of 62 rows and 14 columns in the dataset. Out of 14, 6 columns are of object type and 6 are of integer
type and the rest are float type.
20 | P a g e
From the above results, we can see that there is no missing value present in the dataset.
2.1)
For this data, construct the following contingency tables (Keep
Gender as row variable)
2.1.1) Gender and Major
Answer)
As we can see From the above contingency table, it is clear that males and females
participating in different majors are 29 and 33 respectively.
As we can see From the above contingency table, it is clear that major of males have selected
grad intention as yes and in females major of them are undecided.
2.1.3) Gender and Employment
Answer)
As we can see From the above contingency table, it is clear that major of males and females
have laptops and very minimal of them have desktops.
2.2)
22 | P a g e
Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:
2.2.1) What is the probability that a randomly selected CMSU student will be
male?
Answer)
Below we see the total no. of male and female students as per the dataset:
As we can see from the above table that there are 33 female students and 29 male students in
CMSU.
Total No of Students = 62
Total No of Male = 29
Probability a randomly selected student will be male = Total No of Male / Total No of Male
After calculation in python for the probability of males being selected randomly in CMSU, we
got the result that the probability of 46.77% of students will be male if randomly selected.
2.2.2) What is the probability that a randomly selected CMSU student will be
female?
Answer)
From the previous question 2.2.1, we have the total no. of male and female students in CMSU.
After calculation in python for the probability of females being selected randomly in CMSU,
we got the result that the probability of 53.22% of students will be female if randomly
selected.
2.3)
Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:
23 | P a g e
2.3.1) Find the conditional probability of different majors among the male
students in CMSU.
Answer)
From the contingency table between gender and major, we will get the right information. As
the total no. of the male candidate is 29.
Using contingency tables of Gender and Majors we got the total numbers of males and number
of males opting for different majors
Below is the output of Python :
As we can see from the above output is that most male students prefer Management as a Major
with a probability of 44.23% and CIS is the least preferred one with a probability of 7.37%.
2.3.2)
Find the conditional probability of different majors among the female
students of CMSU.
24 | P a g e
Answer)
From the contingency table between gender and major, we will get the right information. As
the total no. of female candidates is 33.
Using contingency tables of Gender and Majors we got the total numbers of females and the
number of females opting for different majors.
As we can see from the above output is that most female students prefer
Retailing /Marketing as a Major with a probability of 51.23%.
Moreover, The female students prefer Accounting, CIS, other as majors with a probability of
17.07%.
2.4)
Assume that the sample is a representative of the population of
CMSU. Based on the data, answer the following question:
25 | P a g e
2.4.1) Find the probability That a randomly chosen student is a male and
intends to graduate.
Answer)
From the below contingency table between gender and Grad Intention, we will get the right
information. we will know the total male students who intend to graduate.
As we can see from the above table the total no. of students who intend to graduate is 17.
However, from post calculation from python, we found out that - Probability of Males who
intends to be Graduate. is 58.62%
2.4.2)
Find the probability that a randomly selected student is a female and does
NOT have a laptop.
Answer)
From the below contingency table between gender and Computer, we will get the right
information. we will know the total female students who do not have a laptop.
26 | P a g e
Table 14 –Contingency of Gender and computer
As we can see from the above table the total no. of female students who do have a laptop is 29
and the number of female students who do not have a laptop is 4.
2.5)
Assume that the sample is representative of the
population of CMSU. Based on the data, answer the
following question:
2.5.1) Find the probability that a randomly chosen student is a male or has
full-time employment?
27 | P a g e
Answer)
From the below contingency table between gender and Employment, we will get the right
information. we will know the total male and full-time employment.
As we can see from the above table, the number of male candidates is 29, the no of full-time
employment are 10 and male candidate doing full-time employment are 7.
Answer)
From the below contingency table between gender and Major, we will get the right
information. we will know the total female in international business or management.
28 | P a g e
2.6)
Construct a contingency table of Gender and Intent to Graduate
at 2 levels (Yes/No). The Undecided students are not considered
now and the table is a 2x2 table. Do you think graduate intention
and being female are independent events?
Answer)
Below is the 2x2 contingency table of gender and intent to graduate without considering the
Undecided column.
29 | P a g e
Two events A and B can be proved to be Independent events when it satisfies the condition :
P(A ∩ B) = P(A) * P(B)
In this case if being female and graduate intention are independent can be proven by checking
the condition :
2.7)
Note that there are four numerical (continuous) variables
in the data set, GPA, Salary, Spending and Text Messages.
Answer the following questions based on the data
Answer)
Below Using contingency tables of Gender and GPA we got the total numbers of students and
number of students with GPA less than 3:
30 | P a g e
Table 18 –Contingency of Gender and GPA
As we can see from the above tale that there is a total of 17 students with less than 3 GPA.
Hence,
After calculation in python, we found out that –
The probability that a student is chosen randomly and that his/her GPA is less
then 3 is 27.41%
2.7.2)
Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected
female earns 50 or more.
Answer)
Below is the contingency table of Gender and salary of male and female who earns 50 or more:
From the above table, we can see that Total no. Of males with 50 or more salary are: 14 and
the total no. of the female with 50 or more salary are: 18
Hence,
31 | P a g e
After calculation in python, we found out that –
The conditional probability that a randomly selected male earns 50 or more: 48.27%
The conditional probability that a randomly selected female earns 50 or more: 54.54%
2.8)
Note that there are four numerical (continuous) variables in the
data set, GPA, Salary, Spending, and Text Messages. For each
of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions.
Answer)
Below is the output from python using the histograms to those variables in the data set, GPA,
Salary, Spending, and Text Messages to find out if there are Normal distribution or not :
32 | P a g e
Fig – 12 histogram of salary and spending Fig – 13 histogram of salary and text message
From the above histograms for the continuous variables GPA, Salary, Spending and Text
Messages we can see that:
GPA is almost Normally Distributed with a slight skewness toward the left.
Salary is also Normally Distributed with a slight skewness towards the right.
Spending is not Normally distributed and is highly Right Skewed
Text message is not Normally distributed and highly Right Skewed.
The following is the output from python consist of the Skewness value of the variables:
33 | P a g e
PROBLEM 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles
for texture and colouring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated. The
company would like to show that the mean moisture content is less than 0.35 pounds per 100
square feet.
34 | P a g e
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A
shingles and 31 for B shingles.
Summary
This business report provides detailed explanation of approach to each problem given in the
assignment and provides relative information with regards to solving the problem.
The dataset gives data about A & B shingles and the dataset contains 2 columns A and B
which contains the moisture measurements.
35 | P a g e
There is a total of 36 rows and 2 columns in the dataset. Columns A and B contain float
data types.
And the memory usage of the dataset is 704.0 bytes
As we can see from the above table that the dataset contains missing values in column B.
3.1 )
Do you think there is evidence that means moisture contents in
both types of shingles are within the permissible limits? State
your conclusions clearly showing all steps.
Answer)
For the A shingles, the null and alternative hypothesis to test whether the population
mean moisture content is less than 0.35 pound per 100 square feet is given:
36 | P a g e
So,
H0 : mean moisture content <=0.35
HA : mean moisture content > 0.35
Level of significance: 0.05
Hence,
From the calculations done in Python, we conclude that :
Our one-sample t-test p-value= [0.07477633]
We have no evidence to reject the null hypothesis since p-value > Level of significance
For the B shingles, the null and alternative hypothesis to test whether the population
mean moisture content is less than 0.35 pound per 100 square feet is given:
So,
H0 : mean moisture content <=0.35
HA : mean moisture content > 0.35
Level of significance: 0.05
37 | P a g e
We have evidence to reject the null hypothesis since p-value < Level of significan
3.2 )
Do you think that the population mean for shingles A and B are
equal? Form the hypothesis and conduct the test of the
hypothesis. What assumption do you need to check before the
test for equality of means is performed?
Answer)
We have two samples A and B and we do not know the population standard deviation.
The samples are not a large sample. So you use the t distribution and the tSTAT test
statistic
Since we testing for equality between samples A and B we use two-sample T-test.
Hence
From the calculations done in Python, we conclude that :
Two-sample t-test p-value= 0.2017496571835306
We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis since p value > Level of significance
Therefore, It can be concluded that the population mean for shingles A and B are equal.
38 | P a g e
THANK YOU
39 | P a g e