We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109
FOUNDATION TO DATA SCIENCE
Business Analytics
Unit1: BASIC STATISTICS REFRESHER AND HOW
TO EXPLORE DATA -2
Prof. Dr. George Mathew
B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD 1 Measures of Location 1. Mean (Arithmetic Mean) 2. Median 3. Mode 4. Geometric Mean 5. Percentiles 6. Quartiles 1.Mean (Arithmetic Mean) The most commonly used measure of location is the mean (arithmetic mean), or average value, for a variable. The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by (x-bar) x̄. The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted by the Greek letter μ. Home Sale Data
1.Mean (Arithmetic Mean) Median The median, another measure of central location, is the value in the middle when the data are arranged in ascending order (smallest to largest value). With an odd number of observations, the median is the middle value. An even number of observations has no single middle value. In this case, we follow convention and define the median as the average of the values for the middle two observations. Median Mode A third measure of location, the mode, is the value that occurs most frequently in a data set. To illustrate the identification of the mode, consider the sample of five class sizes. 32 34 42 46 46 54 56 67 Here 46 repeats twice others only once, Hence Mode=46 Mean, Median, Mode Geometric Mean The geometric mean is a measure of location that is calculated by finding the nth root of the product of n values. The general formula for the sample geometric mean, denoted x g , follows.
The geometric mean is often used in analyzing
growth rates in financial data. In these types of situations, the arithmetic mean or average value will provide misleading results. Geometric Mean To illustrate the use of the geometric mean, consider Table 2.10 which shows the percentage annual returns, or growth rates, for a mutual fund over the past ten years. Suppose we want to compute how much $100 invested in the fund at the beginning of year 1 would be worth at the end of year 10. Geometric Mean Product= (0.779)(1.287)(1.109)(1.049)(1.158)(1.055)(0.630)(1. 265)(1.151)(1.021)] = $100(1.335) = 1.3345 G.M= tenth root of 1.335
The geometric mean tells us that annual returns grew
at an average annual rate of (1.029 - 1)100, or 2.9 percent. In other words, with an average annual growth rate of 2.9 percent, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029) 10 = $133.09 at the end of ten years. Geometric Mean We can use Excel to calculate the geometric mean for the data in Table 3 by using the function GEOMEAN. In Figure 10, the value for the geometric mean in cell is found using the formula ='=GEOMEAN(C4:C13). Geometric Mean Percentiles A percentile is the value of a variable at which a specified (approximate) percentage of observations are below that value. The pth percentile tells us the point in the data where approximately p percent of the observations have values less than the pth percentile; hence, approximately (100 – p) percent of the observations have values greater than the pth percentile. Percentiles Percentiles Percentiles Therefore, $305,912.50 represents the 85th percentile of the home sales data. The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC. Figure 12 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell E13 is calculated using the formula =PERCENTILE.EXC(B2:B13,0.85); B2:B13 defines the data set for which we are calculating a percentile, and 0.85 defines the percentile of interest. CALCULATING VARIABILITY MEASURES FOR THE HOME SALES DATA IN EXCEL Quartiles It is often desirable to divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles and are defined as: Q 1 = first quartile, or 25th percentile Q 2 = second quartile, or 50th percentile (also the median) Q 3 = third quartile, or 75th percentile. Quartiles To demonstrate quartiles, the home sales data are again arranged in ascending order. 108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250 We already identified Q2, the second quartile (median) as 203,750. To find Q1 and Q3, we must find the 25th and 75th percentiles. Quartiles Inter Quartile Range The difference between the third and first quartiles is often referred to as the interquartile range, or IQR. For the home sales data, IQR = Q 3 - Q 1 = 256,625 - 139,000 = 117,625. Because it excludes the smallest and largest 25 percent of values in the data, the IQR is a useful measure of variation for data that have extreme values or are badly skewed. Quartile Using Excel A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 12 shows the calculations for first, second, and third quartiles for the home sales data. The formula used in cell E15 is =QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1 indicates that we want to compute the 1st quartile. Cells E16 and E17 use similar formulas to compute the second and third quartiles. Compare the Spread of two Data Compare the Spread of two Data Range The simplest measure of variability is the range. The range can be found by subtracting the smallest value from the largest value in a data set. Let us return to the home sales data set to demonstrate the calculation of range. Refer to the data from home sales prices in Table 2. The largest home sales price is $456,250, and the smallest is $108,000. The range is $456,250 - $108,000 = $348,250. Variance The variance is a measure of variability of the data. The variance is based on the deviation about the mean, which is the difference between the value of each observation (x i ) and the mean. For a sample, a deviation of an observation about the mean is written (x i - x̄ ). In the computation of the variance, the deviations about the mean are squared. Variance Standard Deviation The standard deviation is defined to be the positive square root of the variance. We use s to denote the sample standard deviation and σ to denote the population standard deviation. The sample standard deviation, s, is a point estimate of the population standard deviation,σ, and is derived from the sample variance in the following way: Coefficient of Variation In some situations we may be interested in a descriptive statistic that indicates how large the standard deviation is relative to the mean. This measure is called the coefficient of variation and is usually expressed as a percentage. Identifying Outliers Sometimes a data set will have one or more observations with unusually large or unusually small values. These extreme values are called outliers. It should be removed during analysis to get best results. Standardized values (z-scores) can be used to identify outliers. Usually data satisfy a bell-shaped distribution, almost all the data values will be within three standard deviations of the mean. Hence, in using z-scores to identify outliers, we recommend treating any data value with a z-score less than -3 or greater than +3 as an outlier. z-scores A z-score allows us to measure the relative location of a value in the data set. More specifically, a z-score helps us determine how far a particular value is from the mean relative to the data set’s standard deviation. Suppose we have a sample of n observations, with the values denoted by x 1 , x 2 , . . . , x n . In addition, assume that the sample mean, x̄, and the sample standard deviation, s, are already computed. Associated with each value, x i , is another value called its z-score. Normal Distribution Normal Distribution z-scores
The z-score is often called the standardized
value. The z-score, z i , can be interpreted as the number of standard deviations, x i , is from the mean. For example, z 1 = 1.2 indicates that x 1 is 1.2 standard deviations greater than the sample mean. z-scores Box Plots A box plot is a graphical summary of the distribution of data. A box plot is developed from the quartiles for a data set. Figure 14 is a box plot for the home sales data. Here are the steps used to construct the box plot: Box Plots What can we learn from these box plots? The most expensive houses appear to be in Shadyside and the cheapest houses in Hamilton. The median home selling price in Groton is about the same as the median home selling price in Irving. However, home sales prices in Irving have much greater variability. Homes appear to be selling in Irving for many different prices, from very low to very high. Home selling prices have the least variation in Groton and Hamilton. Unusually expensive home sales (relative to the respective distribution of home sales values) have occurred in Fairview, Groton, and Irving, which appear as outliers. Groton is the only location with a low outlier, but note that most homes sell for very similar prices in Groton, so the selling price does not have to be too far from the median to be considered an outlier. Statistical Inference Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference. Example: To evaluate the advantages of the new filament, by Norris Electronics, a sample of 200 bulbs manufactured with the new filament were tested. Data collected from this sample showed the number of hours each lightbulb operated before filament burnout. Data given in See Table 7. Figure 16 provides a graphical summary of the statistical inference process for Norris Electronics. Inferential statistics inferential statistics, is used to make inferences or to project from a sample to an entire population. For example, when a firm test-markets a new product in two cities of United States, it is not only concerned about how customers in these two cities feel, but they want to make an inference from these sample markets to predict what will happen throughout the United States. So, two applications of statistics exist: (1) descriptive statistics which describe characteristics of the population or sample and (2) inferential statistics which are used to generalize from a sample to a population. Population Parameters and Sample Statistics Population parameters are measured characteristics of a specific population. In other words, information about the entire universe of interest. Sample statistics are used to make inferences (guesses) about population parameters based on sample data. In our notation, we will generally represent population parameters with Greek lowercase letters Mu for example, sigma or and sample statistics with English letters, such as X or S. Correlation Correlation Using Excell To find the correlations between each pair of stocks, click Data Analysis in the Analysis group on the Data tab and then select Correlation. You must install the Analysis ToolPak “Summarizing data by using histograms,” and “Summarizing data by using descriptive statistics” before you can use this feature. Click OK and then fill in the Correlation dialog box as shown in Figure Compare the Spread of two Data Measures of Variability In descriptive statistics we mainly analyse the data for its Central tendency or Average and Spread of the data, variability, or dispersion. The spread is measured by measuring its variation from the mean. We measure the spread of data using various measures of spread such as; Range, Percentiles, Absolute mean deviation, Variance and standard deviation, quartile deviation etc.
President Uhuru Kenyatta's Speech During The Extraordinary Session of The Assembly of Heads of State and Government of The African Union, Addis Ababa, Ethiopia