0% found this document useful (0 votes)
38 views10 pages

Datascience Session2

data science session notes2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
38 views10 pages

Datascience Session2

data science session notes2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
UNIT-2 Descriptive statistics, data preparation. Exploratory Data Analysis data summarization, data distribution, measuring asymmetry. Sample and estimated mean, variance and standard score. Statistical Inference frequency approach, variability of estimates, hypothesis testing using confidence intervals, using p-values DescriptiveStatistics Descriptive statistics helps to simplify large amounts of data in a sensible way. In contrast to inferential statistics, which will be introduced in 2 later chapter, in descriptive statistics we do not draw conclusions beyond the data we are analyzingjneither do we teach any conclusions regarding hypotheses we may make. We do nottry to infer characteristics of the “population” (see below) of the data, but claim to present quantitative descriptions of it in a manageable form. It is simply a way to describe the data. Statistics, and in particular descriptive statistics, is based on two main concepts: 2 population is a collection of objects, items ("units") about which information issought; + a sample is apart of the population thatis observed. Descriptive statisties applies the concepts, measures, and terms that are used to describe the basic features of the samples in a study. These procedures are essential to provide summaries about the samples as an approximation of the population. Together with simple graphics, they form the basis of every {quantitative analysis ofdata. In order to describe the sample data and to be able to infer any conclusion, weshould go through several steps: 1. Data preparation: Given a specific example, we need to prepare the data forgenerating statistically valid descriptions. 2. Descriptive statistics: This generates different statistics to describe and summarize the data concisely and evaluate different ways to visualize them, 30 2 Descitve Static Data Preparation One of the first tasks when analyzing data is to collect and prepare the data in a format appropriate for analysis of the samples. The most common steps for data preparationinvolve the following operations. 1. Obtaining the data: Data can be read directly from a file or they might be obtained by scraping the web. 2, Parsing the data: The right parsing procedure depends on what format the dataare in: plain text, fixed columns, CSV, XML, HTML, ete. 3. Cleaning the data: Survey responses and other data files are almost always in complete. Sometimes, there are multiple codes for things such as, not asked, didnot know, and declined to answer. And there are almost always errors. A simplestrategy is to remove or ignore incomplete records. 4. Building data structures: Once you read the data, itis necessary to store them ina data structure that lends itself to the analysis we are interested in. f the data fitinto the memory, building a data structure is usually the way to go. If not, usually database is built, which is an out-of-memory data structure. Most databases provide a mapping from keys to values, so they serve as dictionaries. The Adult Example Let us consider a public database called the “Adult” dataset, hosted on the UCI's Machine Learning Repository." It contains approximately 32,000 observations con- cerning different financial parameters related to the US population: age, sex, ‘marital(marital status ofthe individual), country, income (Boolean variable: whether the per- son makes more than $50,000 per annum), education (the highest level of educationachieved by the individual), occupation, capital gain, et. We will show that we can explore the data by asking questions like: “Are men more likely to become high-income professionals than women, Le., to receive an Income of over $50,000 per annum?” out nts outa Data Preparation First, let us read the data: © = opent’ of ehe_int (a) ie a.dadigit else: return 0 data = 0) for line in file Gatal = line.split(', ") if len (aacal) == 15 data append ({ehz_int (datat (0]), datat (2), chz_int (davai (2)), davai (31, chr_int (datal (4)) davai (6), datar (71, datai (9), chr_int (data: ehr_int (datad f11)) , che int (datal (12])) davai (23), datai (24) ) Checking the data, we obtain: prine data (1:21 {150,’Set-emp-not-inc, 83314, ‘Bachelors, 3, ‘warried-iv spouse’, Exec managerial, ‘Husband, White’, Male 0, 0,13, United: states’, <=50%"T) One of the easiest ways to manage data in Python is by using the DataFrame structure, defined in the Pandas library, which is a two-dimensional, size- ‘mutable, potentially heterogeneous tabular data structure with labeled axes: ae ~ pd. DetaFrame (date) ee.colunns = [ ‘education’, ‘education tmarizal’, taex', ‘capita 1 The command shapegives exactly the number of data samples (in rows, in this case) and features (in columns): af. shape (22562, 15) 32 3 Descriptive statics ‘Thus, we can see that our dataset contains 32,561 data records with 15 featureseach, Let us count the number of items per country: nf) ants + dt. groupby (‘country’). size () nt counts -head (} ous}: country ses Ccambola 19 Vietoam 67 Yogosavia 16 The first row shows the number of samples with unknown country, followed bythe number of samples corresponding to the first countries in the dataset. Let us split people according to their gender into two groups: men and women. ‘nish ml = af ((af.sex <= ‘maie’)) If we focus on high-income professionals separated by sex, we can do: mil = df [(df.sex -- ‘Male’) & (df. income 1 fm = ae (iar fmt = df ((at. sex a HosoR\n) x= le) ‘ronale') & (af. income Exploratory Data Analysis The data that come from performing a particular measurement on all the subjectsin a sample represent our observations for a single characteristic like country, age, education, ete. These measurements and categories represent a sample distribution of the variable, which in turn approximately represents the population distribution of the variable. One of the main goals of exploratory data analysis is to visualize and summarize the sample distribution, thereby allowing us to make tentative assumptions about the population distribution, Summarizing the Data The data in general can be categorical or quantitative. For categorical data, a simple tabulation of the frequency of each category is the best non-graphical exploration for data analysis. For example, we can ask ourselves what is the Proportion of high-income professionals in our database: 3.3 Exploratory Data Analysis 3 in: 3 dL = Ae((AE. income =" >508\n"} (len (af1}/ oat (Len (a) *100), 1 (Jen (m1) / float (len (ml)) 100), 4." (en (fat) / float (len (fm) *200), 4. (ous The rate of people with The rate of men with high income is: 30%. Therate of women with high income is: 10% Given 3 quantitative variable, exploratory data analysis is a way to make prelim-inary assessments about the population distribution of the variable using the data of the observed samples. The characteristics of the population distribution of a quanti- tative variable are its mean, deviation, histograms, outliers, etc. Our observed datarepresent just a finite set of samples of an often infinite number of possible samples.The characteristics of our randomly observed samples are interesting only to the degree that they represent the population of the data they came from, ‘Mean (One of the first measurements we use to have a look at the data isto obtain samplestatistics from the data, such as the sample mean [1]. Given a sample of values, {Bos Aen, the mean, yl the sum ofthe values duided bythe number of ni: oui (3.1) ‘The terms mean and average are often used interchangeably. in fact, the ‘maindistinction between them is that the mean of a sample is the summary statistic com-puted by Eq. (3.1), while an average is nat strictly defined and could be one of manysummary statistics that can be chosen to describe the central tendency of a sample. In our case, we can consider what the average age of men and women samples inour dataset would be in terms of their mean: Descriptive Statistics ml (‘ age’). mean () sm[/age/]. mean () mil (‘age /].mean 0) fa {" age’. mean 0) ‘The average age of men s:39.4335474989 The average age of women is: 36.8582304336 Tre average age of highincome men is: 446257880515 The average age of high-income women i: 421258301103, This difference in the sample means can be considered initial evidence that thereare differences between men and women with high income! Comment: Later, we will work with both concepts: the population mean and thesample mean, We should not confuse them! The first is the mean of samples takenfrom the population; the second, the mean of the whole population Sample Variance The mean is not usually a sufficient descriptor of the data. We can go further by knowing two numbers: mean and variance. The variance o describes the spread ofthe data and its defined as follows: et lu nwP. (32) The term (x; u)is called the deviation from the mean, so the variance is the mean squared deviation. The square root of the variance, o, is called the standard deviation, We consider the standard deviation, because the variance Is hard to interpret (eg, ifthe units are grams, the variance is in grams squared). Let us compute the mean and the variance of hours per week men and women inour dataset work mi_au = mi[‘age’}.mean 0) fmimu = fm[/age’].mean () mi_var = ml{’ago'].var() fmivar = fm{’age’].var() miletd = ml Nestea fmoetd ~ tml Hstag 33 explora Sy Data Analysis (0) Statisies of age for men: mu: 39.4335478889 var: 178:7737517455td:13.3706301925 Statistics of age for women: mu: 36.8582304336 var-196 383706355 st We can see that the mean number of hours worked per week by women is signif- icantly lesser than that worked by men, but with much higher variance and standarddeviation, Sample Median ‘The mean of the samales is a good descriptor, but it has an important drawback: what will happen if in the sample set there is an error with a value very different from the rest? For example, considering hours worked per week, it would normally be in a range between 20 and 80; but what would happen if by mistake there was a value of 1000? An item of data that is significantly different from the rest of the data is called an outlier. In this case, the mean, 4, will be drastically changed towards the outlier. One solution to this drawback is offered by the statistical median, 4uz, whichis an order statistic giving the middle value of 3 sample. In this case, all the values are ordered by their magnitude and the ‘median is defined as the value that isin themiddle of the ordered list. Hence, itis 2 value that is much more robust in the face of outliers. Let us see, the median age of working men and women in our dataset and the ‘median age of high-income men and women: nt mimedian = mi(‘ase’]. median () fmomedian ~ fm{"ace’].nedian () mi_median, fm median mi_median age = ali [/age’].median () fm_median_age = fmi(‘age’].median() mi_median age, fn median age ‘ou Median age per men and women: 38.0 35.0 Median age per menand women with highincome: 44.0 4.0, ‘As expected, the median age of high-income people is higher than the whole setof working people, although the difference between men and women in both sets isthe same. Quantiles and Percentiles Sometimes we are interested in observing how sample data are distributed in general. In this case, we can order the}samples x, then find the x, so that it divides the datainto two parts, where: nf Init 36 3 Descriptive Stasis Wate sates BEY aoe oe Fig. 2.1 Histogram ofthe age of working men (left) and women (ght) + a fraction p of the data values is less than or equal to xp and + the remaining fraction (1 ~ p)is greater than xp. That value, xp, is the p-th quantile, or the 100 peth percentile. For example, a S- ‘number summary is defined by the values xniy Qt, Qa, Qs, Xnax , Where Q: is the 25 p-thepercentile, Qe is the 50 p-th percentile and Qs is the 75 p-th percentile. Data Distributions Summarizing data by just looking at their mean, median, and variance can be danger- ‘ous: very different data can be described by the same statistics, The best thing to dois to validate the data by inspecting them. We can have a look at the data distribution, which describes how often each value appears (ie,, what is its frequency). “The most common representation ofa cistribution isa histogram, which is a graph ‘that shows the frequency of each value. Let us show the age of working men and women separately. milage = al(‘ sce] milage hist (normed = 0, histtype = ‘stepsitied’, bins = 20) fmage = fm{‘ace'] fmiage hist (normed histtype = ‘steptiied’, bins = 10) The output can be seen in Fig. 3.1. If we want to compare the histograms, we canplot them overlapping in the same graphic as follows: Int: 3.3. Exploratory Data Analysis a7 Pe Fons a oe roe Fig. 3.2 Histogram ofthe ag of working men [in ochre} and women in violet) left) Histogram of the age of working men {in ochre), women fn blue), and their intersection (in violet) after samples rormalization (right) inport seaborn as ens fm_age hist (normed = 0, histtype = ‘stopfitlod’, pha = .5, bina = 20) ml_age hist (normed - 0, histtype - ‘otepfitied’, a1 ae color ~ sna. desaturate ("indianred", 75) bins ~ 10) The output can be seen in Fig. 3.2 (left. Note that we are visualizing the absolute values of the number of people in our dataset according to thelr age (the abscissa ofthe histogram). As a side effect, we can see that there are many more men in ‘these conditions than women, We can normalize the frequencies of the histogram by dividing/normalizing by fn, the number of samples. The normalized histogram is called the Probability ‘MassFunction (PME). fm age hist (normed = 1, histtype = ‘stepfitied’, hipha = .5, bins = 20) milage hist (normed = 1, histtype = ‘stepsitied’, alpha = .5, bins = 10, color = sna.dessterate ("indianred™, -75)) This outputs Fig. 3.2 (right), where we can observe a comparable range of indi- viduals (men and women). The Cumulative Distribution Function (CDF), or just distribution function, describes the probability that a real-valued random variable X with a given proba- bility gistiaution will be found to have a value less than or equal to x. Let us show the CDF of age distribution for both men and women. tn): 38 3 Descriptive Stasis Fig. 33 the COF of the ageoF 10 working male (in Blve} and female tin red} samples a 06 oa or °% 2 2 © 0 © 7 wm H Age ml_age -hist (normed = 1, histtype = ‘step’, bins = 20) fm_age .hist (normed = 1, his op’, cumulative = True, linewidth = 3.5, color = sns.desaturate ( 75) ‘The output can be seen in Fig. 3.3, which illustrates the CDF of the age distributions for both men and women Outlier Treatment ‘As mentioned before, outliers are data samples with a value that is far from the centraltendency. Different rules can be defined to detect outliers, as follows: + Computing samples that are far from the median. + Computing samples whose values exceed the mean by 2 or 3 standard deviations. For example, in our case, we are interested in the age statistics of men versus ‘women with high incomes and we can see that in our dataset, the minimum age is {17years and the maximum is 90 years. We can consider that some of these samples are due to errors or are not representable. Applying the domain knowledge, we focus onthe median age (37, in our case) up to 72 and down to 22 years old, and we considerthe rest as outliers.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy