SMDM - Assignment 1
SMDM - Assignment 1
Assignment 1
The following analysis includes data collected of 1047 homes from the same geographical area sold in the last 12 months. The
data was collected on the following variables: price, living area, number of bathrooms, number of bedrooms, lot size, age of the
house, and presence of a fireplace.
1. Summary statistics of the home values or prices:
The variance can be calculated in the following manner = (standard deviation) 2 = (67651.559)2 = 4576733435.13, which can also
be represented as 4.5767e+9. This high level of variance indicates greater dispersion around the mean price or greater variability
in the prices.
With more homes priced closer to the minimum value (in the lower price range), it can be considered that the outliers lie in the
higher price range (also shown in the box plot). The difference between the 25% quartile (111875) and 75% quartile (205397) can
be used to calculate the interquartile range (IQR): (205397-111875) = 93522. The distance of the outliers from the 75% quartile
can be calculated as 1.5(IQR) = 140283. Hence, outliers exist from the price estimate of (205397+140283) = 345680 units.
Price is dependent on multiple other independent variables in the following manner:
• Living Area:
• Number of Bathrooms:
• Number of Bedrooms:
• Lot Size:
• Age of the House:
2. The normal quantile plot of price values shows a slight downward bend from 0 to 80000 units and a slight upward bend from
80000 to 160000 units outside the confidence limits (dotted lines). While the outliers or tail end values (the lowest and highest
price values) are within the confidence limits, they show deviation from the diagonal line (solid line).
The bulk of price values towards the left indicates the plot to be skewed towards the right (longer right-sided tail end). Hence,
positive skewness is expected, as validated by the calculated statistic = 0.87616. Additionally, a slightly higher peak with thicker
tail ends indicates a positive kurtosis value, as validated by the calculated statistic = 0.76.
Thus, the normal quantile plot of prices shows an overall upward bend, showing some deviation from the normal model, which
can be attributed to the presence of outliers towards the higher end of the price range (resulting in thicker tails), skewness towards
the right, and heteroscedasticity (changes in variability of price when monitored against other independent variables ).
According to the given data, P (Price < 255.5K) = 90%; and P (Price < 257.6K) = 90.32%
Hence, the calculated numbers are similar to the ones seen in the data.
(b) Percentage of houses with Price < 232K: [please find the graph below]
z = (x-µ)/S.D. = 1
This means that the percentage of houses with Price < 232K lies below 1 S.D. = 68% (values lying between S.D. -1 and 1) +
32%/2 (values lying below S.D. -1) = 84%.
4. In this dataset, the living area s of the homes range from 672 units (minimum at 0% quantile) to 4534 units (maximum at
100% quantile). The mean living area is 1807.303 units (as opposed to the median living area value of 1672 units at 50%
quantile) with a standard deviation of 641.461 units (indicating higher variability).
The normal quantile plot of living area shows a n upward bend from 800 to 1600 units outside the confidence limits (dotted
lines). While the outliers at the higher living area values are within the confidence limits, they show deviation from the diagonal
line (solid line).
The bulk of living area values towards the left and interquartile range of 872 suggests the plot to be skewed towards the right
(longer right-sided tail end). Hence, positive skewness is expected, as validated by the calculated statistic = 0.8078. Also, a
slightly higher peak with thicker tail ends indicates a positive kurtosis value, as validated by the calculated statistic = 0.4.
Thus, these values suggest that the histogram is NOT symmetrical. Additionally, the normal quantile plot of living areas shows
an overall upward bend, showing some deviation from the normal model, which could be attributable to the presence of outliers
towards the higher end of the price range (resulting in thicker tails) and skewness towards the right.
5. On taking the logarithm of the living area values, the range extends from 6.51 units (minimum at 0% quantile) to 8.42 units
(maximum at 100% quantile). The mean living area is 7.44 units (which is very close to the median living area of 7.42 units
at 50% quantile and mode living area of 7.3 units) with a standard deviation of 0.35 units (indicating lower variability).
The normal quantile plot of logarithm of living area shows near perfect alignment of data points to the diagonal line (solid line)
with a few outliers at the tail ends, indicating a better fit to the normal distribution. Skewness of 0.005 (close to 0) indicates a
symmetrical distribution. Also, a negative kurtosis value of –0.474 suggests a flatter peak with thinner & longer tail ends.
Taking the logarithm of the values reduces the magnitude of th e original variable. As a result, larger values or outliers are
compressed to a smaller value, bringing them closer to the bulk of the values. This results in a more symmetrical distribution.