3 Unit - Dspu
3 Unit - Dspu
What is Regression?
He observes the data and comes to the conclusion that the data is
linear after he plots the scatter plot. For his first scatter plot, Aarav
uses two variables: ‘Living area’ and ‘Price’.
As soon as he saw a pattern in the data, he planned to make a
regression line on the graph so that he can use the line to predict the
‘price of the house’.
Using the training data i.e ‘Price’ and ‘Living area’, a regression line
is obtained which will give the minimum error. To do that he needs
to make a line that is closest to as many points as possible. This
‘linear equation’ is then used for any new data so that he is able to
predict the required output.
Here, the β1 it’s are the parameters (also called weights) βo is the y-
intercept and Єi is the random error term whose role is to add bias.
The above equation is the linear equation that needs to be obtained
with the minimum error.
Where ‘β1’ is the slope and ‘βo’ is the y-intercept similar to the
equation of a line. The values ‘β1’ and ‘βo’ must be chosen so that
they minimize the error. To check the error we have to calculate the
sum of squared error and tune the parameters to try to reduce the
error
Error = Σ (actual output — predicted
output)²
Key:
1. Y(predicted) is also called the hypothesis function.
2. J(θ) is the cost function which can also be called the error
function. Our main goal is to minimize the value of the cost.
3. y(i) is the predicted output.
4. hθ(x(i)) is called the hypothesis function which is basically
the Y(predicted) value.
Now the question arises, how do we reduce the error value. Well,
this can be done by using Gradient Descent. The main goal of
Gradient descent is to minimize the cost value. i.e. min
J(θo, θ1)
Choosing a perfect learning rate is a very important task as it
depends on how large of a step we take downhill during each
iteration. If we take too large of a step, we may step over the
minimum. However, if we take small steps, it will require many
iterations to arrive at the minimum.
In the last section, we saw two variables in your data set were
correlated but what happens if we know that our data is correlated,
but the relationship doesn’t look linear? So hence depending on
what the data looks like, we can do a polynomial regression on the
data to fit a polynomial equation to it.
Hence If we try to use a simple linear regression in the above graph
then the linear regression line won’t fit very well. It is very difficult
to fit a linear regression line in the above graph with a low value of
error. Hence we can try to use the polynomial regression to fit a
polynomial line so that we can achieve a minimum error or
minimum cost function. The equation of the polynomial regression
for the above graph data would be:
multivariate regression
In the above example, the expert decides to collect the mentioned data, which
act as the independent variables. These variables will affect the dependent
variables which are nothing but the conditions of the crops. In such a case,
using single regression would be a bad choice and multivariate regression
might just do the trick.
One of the minimization algorithms that can be used is the gradient descent
algorithm.
Step 5: Test the hypothesis
The formulated hypothesis is then tested with a test set to check its accuracy
and correctness.
Advantages:
The multivariate regression method helps you find a relationship
between multiple variables or features.
It also defines the correlation between independent variables and
dependent variables.
Disadvantages:
Multivariate regression technique requires high-level mathematical
calculations.
It is complex.
The output of the multivariate regression model is difficult to analyse.
The loss can use errors in the output.
Multivariate regression yields better results when used with larger
datasets rather than small ones.
Explore all the survey question types
What is OLTP?
Online transaction processing shortly known as OLTP supports
transaction-oriented applications in a 3-tier architecture. OLTP
administers day to day transaction of an organization.
The basic difference between OLTP and OLAP is that OLTP is an online
database modifying system, whereas, OLAP is an online database query
answering system.
Comparison Chart
BASIS FOR
OLTP OLAP
COMPARISON
modification.
database. making.
transactions. transactions.
OLTP.
(3NF).
affected.
Bias
Figure 2: Bias
High bias happens because of a high training error. There are multiple
ways to reduce the bias of a model, such as:
1. By adding more features from the data to make the model more
complex.
2. By increasing training iterations so that more complex models
and relevant data can be learned.
3. Replacing current model with more complex model can reduce
the bias.
4. Using non-linear algorithms
5. Using non-parameterized algorithms
6. By decreasing regularization on inputs at different levels, the
model can learn the training set more efficiently and prevent
underfitting.
7. By using a new model architecture. However, this should only be
used as a last resort if none of the methods above give satisfactory
results.
Variance
High variance is due to a high validation error. There are multiple ways
of reducing the variance of a model such as:
Bias-Variance tradeoff
The goal of any supervised machine learning algorithm is to achieve low
bias and low variance. In turn the algorithm should achieve good
prediction performance.
As seen above, if the algorithm is too simple, it will have a high bias and
a low variance. Similarly, if the algorithm is too complex, it will have a
high variance and a low bias. Therefore, it is clear that:
“Bias and variance are complements of each other” The increase of one
will result in the decrease of the other and vice versa. Hence, finding the
right balance of values is known as the Bias-Variance Tradeoff.
Target Function
An ideal algorithm should neither underfit nor overfit the data. The end
goal of all Machine Learning Algorithms is to produce a function that has
both low-bias and low-variance.
Bias-Variance Graph
Hypothetically, the dotted line above is the required optimal solution. But,
in the real world, it is very difficult to achieve due to an unknown best
target function. The goal is to find an iterative process through which we
can keep on improving our Machine Learning Algorithm so that its
predictions will improve.
The k-nearest neighbors algorithm has low bias and high variance,
but the trade-off can be changed by increasing the value of k
which increases the number of neighbors that contribute t the
prediction and in turn increases the bias of the model.
The support vector machine algorithm has low bias and high
variance, but the trade-off can be changed by increasing the C
parameter that influences the number of violations of the margin
allowed in the training data which increases the bias but decreases
the variance.
There is no escaping the relationship between bias and variance in
machine learning.
In reality, we cannot calculate the real bias and variance error terms
because we do not know the actual underlying target function.
Nevertheless, as a framework, bias and variance provide the tools to
understand the behavior of machine learning algorithms in the pursuit of
predictive performance.
K-Fold Cross-Validation
Holdout Method
The error that occurs in this process tells how well our model will perform
with the unknown dataset. Although this approach is simple to perform, it
still faces the issue of high variance, and it also produces misleading
results sometimes.
o Train/test split: The input data is divided into two parts, that are
training set and test set on a ratio of 70:30, 80:20, etc. It provides a
high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model,
and the dependent variable is known.
o Test Data: The test data is used to make the predictions
from the model that is already trained on the training data.
This has the same features as training data but not the part
of that.
o Cross-Validation dataset: It is used to overcome the
disadvantage of train/test split by splitting the dataset into groups
of train/test splits, and averaging the result. It can be used if we
want to optimize our model that has been trained on the training
dataset for the best performance. It is more efficient as compared
to train/test split as every observation is used for the training and
testing both.
Limitations of Cross-Validation
Applications of Cross-Validation