0% found this document useful (0 votes)
24 views2 pages

Data Science Q&A - Latest Ed (2020) - 2 - 2

Linear regression is a statistical method used to model relationships between variables. It finds the best fit straight line through data points to help understand the relationship between an independent and dependent variable. The p-value indicates whether the relationship is statistically significant. A lower p-value (often <0.05) means the independent variable reliably predicts the dependent variable. The coefficient represents the slope of the regression line, showing the expected change in the dependent variable for a one-unit change in the independent variable. The r-squared value measures how well the regression line approximates the real data points, with values closer to 1 indicating the independent variable better explains the dependent variable's behavior.

Uploaded by

M K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views2 pages

Data Science Q&A - Latest Ed (2020) - 2 - 2

Linear regression is a statistical method used to model relationships between variables. It finds the best fit straight line through data points to help understand the relationship between an independent and dependent variable. The p-value indicates whether the relationship is statistically significant. A lower p-value (often <0.05) means the independent variable reliably predicts the dependent variable. The coefficient represents the slope of the regression line, showing the expected change in the dependent variable for a one-unit change in the independent variable. The r-squared value measures how well the regression line approximates the real data points, with values closer to 1 indicating the independent variable better explains the dependent variable's behavior.

Uploaded by

M K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Sampling can be particularly useful with data sets that are too large to efficiently analyze in full – for

example, in big data analytics applications or surveys. Identifying and analyzing a representative sample
is more efficient and cost-effective than surveying the entirety of the data or population.
An important consideration, though, is the size of the required data sample and the possibility of
introducing a sampling error. In some cases, a small sample can reveal the most important information
about a data set. In others, using a larger sample can increase the likelihood of accurately representing
the data as a whole, even though the increased size of the sample may impede ease of manipulation and
interpretation.
There are many different methods for drawing samples from data; the ideal one depends on the data set
and situation. Sampling can be based on probability, an approach that uses random numbers that
correspond to points in the data set to ensure that there is no correlation between points chosen for the
sample. Further variations in probability sampling include:

• Simple random sampling: Software is used to randomly select subjects from the whole population.
• Stratified sampling: Subsets of the data sets or population are created based on a common factor,
and samples are randomly collected from each subgroup. A sample is drawn from each strata
(using a random sampling method like simple random sampling or systematic sampling).
o EX: In the image below, let's say you need a sample size of 6. Two members from each
group (yellow, red, and blue) are selected randomly. Make sure to sample proportionally:
In this simple example, 1/3 of each group (2/6 yellow, 2/6 red and 2/6 blue) has been
sampled. If you have one group that's a different size, make sure to adjust your
proportions. For example, if you had 9 yellow, 3 red and 3 blue, a 5-item sample would
consist of 3/9 yellow (i.e. one third), 1/3 red and 1/3 blue.
• Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor,
then a random sampling of clusters is analyzed. The sampling unit is the whole cluster; Instead of
sampling individuals from within each group, a researcher will study whole clusters.
o EX: In the image below, the strata are natural groupings by head color (yellow, red, blue).
A sample size of 6 is needed, so two of the complete strata are selected randomly (in this
example, groups 2 and 4 are chosen).

• Multistage sampling: A more complicated form of cluster sampling, this method also involves
dividing the larger population into a number of clusters. Second-stage clusters are then broken
out based on a secondary factor, and those clusters are then sampled and analyzed. This staging
could continue as multiple subsets are identified, clustered and analyzed.
• Systematic sampling: A sample is created by setting an interval at which to extract data from the
larger population – for example, selecting every 10th row in a spreadsheet of 200 items to create
a sample size of 20 rows to analyze.

Steve Nouri
Sampling can also be based on non-probability, an approach in which a data sample is determined and
extracted based on the judgment of the analyst. As inclusion is determined by the analyst, it can be more
difficult to extrapolate whether the sample accurately represents the larger population than when
probability sampling is used.

Non-probability data sampling methods include:

• Convenience sampling: Data is collected from an easily accessible and available group.
• Consecutive sampling: Data is collected from every subject that meets the criteria until the
predetermined sample size is met.
• Purposive or judgmental sampling: The researcher selects the data to sample based on predefined
criteria.
• Quota sampling: The researcher ensures equal representation within the sample for all subgroups
in the data set or population (random sampling is not used).

Once generated, a sample can be used for predictive analytics. For example, a retail business might use
data sampling to uncover patterns about customer behavior and predictive modeling to create more
effective sales strategies.

Q3. What is the difference between type I vs type II error?

https://www.datasciencecentral.com/profiles/blogs/understanding-type-i-and-type-ii-errors

Is Ha true? No, H0 is True (Ha is Negative: TN); Yes, H0 is False (Ha is Positive: TP).
A type I error occurs when the null hypothesis is true but is rejected. A type II error occurs when the null
hypothesis is false but erroneously fails to be rejected.

No reject H0 Reject H0
H0 is True TN FP (I error)
H0 is False FN (II error) TP

Q4. What is linear regression? What do the terms p-value, coefficient, and r-
squared value mean? What is the significance of each of these components?

Steve Nouri

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy