Datascience Notes
Datascience Notes
Classification
Classification deals with predicting categorical target variables, which
represent discrete classes or labels. For instance, classifying emails as spam or
not spam, or predicting whether a patient has a high risk of heart disease.
Classification algorithms learn to map the input features to one of the predefined
classes.
Here are some classification algorithms:
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target
variables, which represent numerical values. For example, predicting the price of
a house based on its size, location, and amenities, or forecasting the sales of a
product. Regression algorithms learn to map the input features to a continuous
numerical value.
Clustering
Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in
data without the need for labeled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Mean-shift algorithm
DBSCAN Algorithm
Principal Component Analysis
Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between
items in a dataset. It identifies rules that indicate the presence of one item
implies the presence of another item with a specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
Eclat
FP-growth Algorithm
Advantages of Unsupervised Machine Learning
It helps to discover hidden patterns and various relationships between the
data.
Used for tasks such as customer segmentation, anomaly
detection, and data exploration.
It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
Without using labels, it may be difficult to predict the quality of the model’s
output.
Cluster Interpretability may not be clear and may not have meaningful
interpretations.
It has techniques such as autoencoders and dimensionality reduction that can
be used to extract meaningful features from raw data.
Exploratory data analysis (EDA): Explore data and gain insights before
defining specific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and
unlabelled data. It’s particularly useful when obtaining labeled data is costly,
time-consuming, or resource-intensive. This approach is useful when the dataset
is expensive and time-consuming. Semi-supervised learning is chosen when
labeled data requires skills and relevant resources in order to train or learn from
it.
We use these techniques when we are dealing with data that is a little bit labeled
and the rest large portion of it is unlabeled. We can use the unsupervised
techniques to predict labels and then feed these labels to supervised techniques.
This technique is mostly applicable in the case of image data sets where usually
all images are not labeled.
Example: Consider that you are training an AI agent to play a game like chess.
The agent explores different moves and receives positive or negative feedback
based on the outcome. Reinforcement Learning also finds applications in which
they learn to perform tasks by interacting with their surroundings.
UNITIV DATAVISUALIZATION 5
Importing Matplotlib
importing matplotlib. pyplot and use the alias plt , which
is the alias used by convention for this submodule.
Matplotlib is a library that contains several submodules such as pyplot
.
After defining the two lists days and steps_walked , you use two of
the functions from matplotlib.
Matplotlib is a popular data visualization library in Python. It's often
used for creating static, interactive, and animated visualizations in
Python. Matplotlib allows you to generate plots, histograms, bar
charts, scatter plots, etc.,
Simple line plots
A line chart, also referred to as a line graph or a line plot, connects a series of data
points using a line. This chart type presents sequential values to help you identify
trends. Most of the time, the x-axis (horizontal axis) represents a sequential
progression of values.
Perhaps the simplest of all plots is the visualization of a single function y=f(x)y=f(x).
Here we will take a first look at creating a simple plot of this type. As with all the
following sections, we'll start by setting up the notebook for plotting and importing
the packages we will use:
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
For all Matplotlib plots, we start by creating a figure and an axes. In their simplest
form, a figure and axes can be created as follows:
In [2]:
fig = plt.figure()
ax = plt.axes()
import numpy as np
X, Y = np.meshgrid(xlist, ylist
print(xlist)
print(ylist)
print(X)
print(Y)
OUTPUT
Histograms
o A histogram is a graph used to represent the frequency distribution of
a few data points of one variable.
Histograms often classify data into various “bins” or “range groups” and
count how many data points belong to each of those bins.
The histogram was invented by Karl Pearson, an English mathematician.
Legends
A legend is used to identify data in visualizations by its color, size, or other
distinguishing features. Simply connect one or more data visualizations to a legend
and they will automatically display a table of symbols and descriptions to help
users understand what is being displayed.
Colours
Color selection in data visualization is not merely an aesthetic choice, it is a
crucial tool to convey quantitative information. Properly selected colors convey the
underlying data accurately, in contrast to many color schemes commonly used in
visualization that distort relationships between data values.
Subplot
Subplots are like mini-plots positioned in a grid arrangement. They are
useful for displaying different plot types for the same data, or the same plot type
with subsets of your data.
For example, you might have 4 plots arranged in a 2 by 2 grid like this.
Text on donation
Text visualization is one of the most important tools for text mining due to its
readability to both human and machines. Text visualization is mainly achieved
through the use of graph, chart, word cloud, map, network, timeline,etc…
Customization
Custom visuals in Power BI refer to visualizations that are not built-in to the
standard set of charts and graphs provided by Power BI Desktop. These custom
visuals are created by third-party developers or by users themselves using the Power
BI custom visuals SDK.
Three dimensional plotting
3D scatter plots display data points in a three-dimensional space, utilizing
three axes (X, Y, and Z) to represent three different variables.
3D data visualization is the process of creating three-dimensional
representations of data sets. This can include anything from simple graphs and charts
to complex models and simulations.
Geographic data and basemap
Matplotlib's main tool for this type of visualization is the Basemap toolkit,
which is one of several Matplotlib toolkits which lives under the mpl_toolkits
namespace.