Casos de ML Unsupervised Daniel Ames Camayo
Casos de ML Unsupervised Daniel Ames Camayo
STANDARD LIBRARIES:
In [58]:
import pandas as pd
pd.Timestamp.today().strftime('%Y-%m-%d %H:%M:%S') # Se capta la fecha y hora actual
Out[58]:
'2024-06-08 00:36:26'
In [59]:
In [60]:
from matplotlib import style
plt.style.use('ggplot')
#plt.style.use('seaborn-darkgrid')
#plt.style.use('fivethirtyeight')
In [61]:
import seaborn as sns
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
CUSTOMIZED LIBRARIES:
In [62]:
from sklearn.cluster import KMeans
KMeans
Out[62]:
sklearn.cluster._kmeans.KMeans
def __init__(n_clusters=8, *, init='k-means++', n_init='warn', max_iter=300, tol=0.0001
, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
K-Means clustering.
Parameters
----------
In [63]:
In [63]:
from numpy import unique
unique
Out[63]:
<function unique at 0x79c6b6324ab0>
In [64]:
# Importamos las librerias a utilizar
from numpy import where
where
Out[64]:
<function where at 0x79c6b9527db0>
In [65]:
from sklearn.datasets import make_classification
make_classification
Out[65]:
sklearn.datasets._samples_generator.make_classification
def make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2
, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_s
ep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
Generate a random n-class classification problem.
Read more
Compute inzthe
the :ref:`User
score Guide in
of each value <sample_generators>`.
the sample, relative to the
sample mean and standard deviation.
Parameters
----------
Parameters
n_samples : int, default=100
----------
EXTRACCION DE DATOS:
The number of samples.
a : array_like
An array like object containing the sample data.
n_features
axis : int or
In [67]: : int,
None,default=20
optional
The
Axis along which of
total number to features. These comprise
operate. Default is 0. If``n_informative``
None, compute over
# Creamos un
informative
the dataset sintetico
features,
whole array que usaremos en todosfeatures,
`a`. ``n_redundant`` redundant los ejemplos
# Parámetros modificados para crear 3 clústeres concéntricos
ddof ``n_repeated`` duplicated features and
X, y := int, optional
make_classification(
``n_features-n_informative-n_redundant-n_repeated``
Degrees of freedom correction in the calculation of the
n_samples=1000, useless features
drawn at deviation.
standard random.
n_features=2, Default is 0.
n_informative=2,
nan_policy : {'propagate', 'raise', 'omit'}, optional
n_redundant=0,
n_informative
Defines : to
how int, default=2
handle when input contains nan. 'propagate' returns nan,
n_classes=3,
The number
'raise' of informative
throws features.
an error, 'omit' Each the
performs class is composedignoring
calculations of a number
nan
n_clusters_per_class=1,
of gaussian
values. clusters
Default
random_state=37, is each located
'propagate'. around
Note the
that vertices
when the of
value a hypercube
is 'omit',
in
nansa in
subspace
class_sep=2, of
the input dimension
# Aumenta la ``n_informative``.
separación
also propagate to theentre For each
clústeres
output, but theycluster,
do not affect
# hypercube=False
informative
the z-scoresfeatures
computed# are
Para que non-nan
fordrawn
the los clústeres
independently
values.sean
from más cercanos
N(0, 1) and athen
nubes circulares
)
In [68]:
# Agrega una pequeña variabilidad en las distancias de los clústeres para hacerlos más co
ncéntricos
rng = np.random.default_rng(42)
X += rng.normal(scale=0.1, size=X.shape)
In [69]:
## ARRAY de Numpy
print(type( X ))
display(len( X )) # shape len
display( X[:3] ) # .head(3) .tail(2) [:3]
<class 'numpy.ndarray'>
1000
array([[-1.68839271, -1.97804483],
[-2.30557338, 1.69562228],
[-3.34048814, -2.5792299 ]])
In [70]:
# Multiplica la segunda columna por 10000
X[:, 1] = X[:, 1] * 10000
In [71]:
## ARRAY de Numpy
print(type( X ))
display(len( X )) # shape len
display( X[:3] ) # .head(3) .tail(2) [:3]
<class 'numpy.ndarray'>
1000
array([[-1.68839271e+00, -1.97804483e+04],
[-2.30557338e+00, 1.69562228e+04],
[-3.34048814e+00, -2.57922990e+04]])
In [72]:
df_X = pd.DataFrame(X)
<class 'pandas.core.frame.DataFrame'>
1000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 1000 non-null float64
0 0 1000 non-null float64
1 1 1000 non-null float64
dtypes: float64(2)
memory usage: 15.8 KB
None
0 1
0 1
0 1
0 -1.688393 -19780.448331
1 -2.305573 16956.222789
2 -3.340488 -25792.299004
3 -1.865527 23856.962445
4 -2.323384 16557.090853
In [115]:
## ARRAY de Numpy
print(type( X_standarized ))
display(len( X_standarized )) # shape len
display( X_standarized[:3] ) # .head(3) .tail(2) [:3]
<class 'numpy.ndarray'>
1000
1000
array([[-0.50490379, -0.67246203],
[-0.81575665, 1.23572057],
[-1.33700792, -0.98473062]])
In [76]:
df_X_standarized = pd.DataFrame(X_standarized)
display(df_X_standarized.describe())
display(df_X_standarized.sample(2))
display( df_X_standarized ) # .head(3) .tail(2)
0 1
0 1
0 1
0 -0.504904 -0.672462
1 -0.815757 1.235721
2 -1.337008 -0.984731
3 -0.594121 1.594160
4 -0.824727 1.214989
In [76]:
<class 'numpy.ndarray'>
1000
array([2, 1, 2, 1, 1, 0, 1, 2, 2, 2])
In [78]:
# Creamos un diagrama de dispersion iterando por cada clase
for class_value in range(3):
# obtenemos los indices
row_ix = where(y == class_value)
# creamos el diagrama
plt.scatter(X[row_ix, 0], X[row_ix, 1])
# mostramos la imagen
plt.show()
K-Means
Conceptos Básicos de K-Means
K-Means es uno de los algoritmos de clustering más populares y ampliamente utilizados. Su objetivo
principal es dividir un conjunto de datos en K grupos , donde K es un número predefinido por el usuario.
El algoritmo funciona de manera iterativa para asignar cada punto de datos al cluster más cercano en
función de las distancias entre los puntos y los centroides de los clústeres.
Los centroides son puntos que representan el centro de gravedad de un clúster y se actualizan en cada
iteración para minimizar la distancia intraclúster.
K-Means es un algoritmo de centroides porque genera puntos medios que se usan para la asignacion de los
clusters.
Ejemplo Práctico
Crear modelo
In [79]:
# Seleccionamos el numero de clusters
modelo2_kmeans = KMeans(n_clusters=3)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: Th
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
t` explicitly to suppress the warning
warnings.warn(
CPU times: user 389 ms, sys: 2.09 ms, total: 391 ms
Wall time: 426 ms
Out[80]:
▾ KMeans
KMeans(n_clusters=3)
In [82]:
## ARRAY de Numpy
print(type( yhat_norm ))
display(len( yhat_norm )) # shape len
display( yhat_norm[:10] ) # .head(3) .tail(2) [:3]
<class 'numpy.ndarray'>
1000
In [83]:
# extraemos los cluster unicos
clusters = unique(yhat_norm)
clusters
Out[83]:
array([0, 1, 2], dtype=int32)
In [84]:
# creamos el diagrama de dispersion
for cluster in clusters:
row_ix = where(yhat_norm == cluster)
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1])
# show the plot
plt.show()
EXPERIMENTO, VERSION SIN ESTANDARIZAR:
In [85]:
# Seleccionamos el numero de clusters
modelo1_kmeans = KMeans(n_clusters=3)
modelo1_kmeans
Out[85]:
▾ KMeans
KMeans(n_clusters=3)
In [86]:
%%time
# Ajustamos el modelo
modelo1_kmeans.fit(X)
modelo1_kmeans
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: Th
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
t` explicitly to suppress the warning
warnings.warn(
Out[86]:
▾ KMeans
KMeans(n_clusters=3)
In [87]:
modelo1_kmeans.cluster_centers_
Out[87]:
array([[ 5.38517431e-01, -1.69191183e+04],
[-2.00613206e+00, 1.98913554e+04],
[-6.62387384e-01, -2.38390518e+04]])
In [88]:
modelo1_kmeans.labels_
Out[88]:
array([0, 1, 2, 1, 1, 0, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 2, 2, 0, 0,
0, 0, 2, 2, 1, 0, 1, 1, 2, 0, 2, 0, 1, 2, 0, 1, 2, 2, 0, 1, 2, 2,
2, 0, 2, 0, 2, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1,
2, 2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2,
2, 2, 1, 2, 2, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2, 1,
1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 2,
0, 0, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2,
0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0, 2, 0,
0, 2, 0, 1, 1, 1, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0,
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 0, 2, 0, 1, 0, 2, 0,
0, 2, 2, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0,
2, 0, 0, 2, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 2, 1,
1, 0, 1, 0, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1,
0, 1, 1, 0, 2, 2, 2, 0, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 2,
2, 2, 1, 2, 1, 0, 2, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 1, 2, 0,
1, 2, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 0, 0, 1, 2,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
2, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 0, 2, 2, 0,
0, 2, 1, 0, 1, 0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 2, 1, 1, 2, 2, 0,
1, 2, 0, 1, 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 2, 1, 0, 1, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 1, 0, 0, 1, 0,
2, 0, 0, 2, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2, 2, 1,
0, 2, 1, 1, 0, 2, 1, 2, 2, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 1, 0, 2, 2, 2, 1, 2, 2, 0, 1,
0, 0, 1, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0, 1, 1, 2, 1,
2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 1, 0, 0, 0, 1, 0, 0, 2,
2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 1,
0, 1, 1, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 2, 0, 0, 1, 2, 1, 1, 2, 1,
1, 0, 0, 2, 1, 2, 2, 0, 1, 0, 2, 1, 1, 2, 2, 2, 0, 1, 0, 0, 0, 0,
0, 2, 1, 0, 2, 0, 2, 1, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 1, 2, 2, 0,
2, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 2, 2, 0, 1, 1, 2, 0, 1, 1,
2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2,
0, 1, 1, 0, 2, 2, 0, 1, 1, 0, 2, 1, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2,
1, 2, 2, 2, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2, 1, 2, 2, 2,
1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2, 2, 2,
1, 2, 2, 0, 1, 1, 0, 0, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 1, 2, 1,
1, 2, 2, 1, 2, 0, 0, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 1, 1, 0, 0,
1, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 2, 2, 0, 1,
0, 0, 1, 2, 2, 1, 0, 0, 0, 1, 0, 0, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 1, 0,
2, 2, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 2, 2, 0,
2, 1, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1,
0, 0, 1, 0, 2, 2, 2, 1, 2, 0], dtype=int32)
In [89]:
modelo1_kmeans.n_iter_
Out[89]:
7
In [90]:
# Asignamos un cluster a cada ejemplo
yhat = modelo1_kmeans.predict(X)
In [91]:
yhat
Out[91]:
array([0, 1, 2, 1, 1, 0, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 2, 2, 0, 0,
0, 0, 2, 2, 1, 0, 1, 1, 2, 0, 2, 0, 1, 2, 0, 1, 2, 2, 0, 1, 2, 2,
2, 0, 2, 0, 2, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1,
2, 2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2,
2, 2, 1, 2, 2, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2, 1,
1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 2,
0, 0, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2,
0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0, 2, 0,
0, 2, 0, 1, 1, 1, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0,
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 0, 2, 0, 1, 0, 2, 0,
0, 2, 2, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0,
2, 0, 0, 2, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 2, 1,
1, 0, 1, 0, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1,
0, 1, 1, 0, 2, 2, 2, 0, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 2,
2, 2, 1, 2, 1, 0, 2, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 1, 2, 0,
1, 2, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 0, 0, 1, 2,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
2, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 0, 2, 2, 0,
0, 2, 1, 0, 1, 0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 2, 1, 1, 2, 2, 0,
1, 2, 0, 1, 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 2, 1, 0, 1, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 1, 0, 0, 1, 0,
2, 0, 0, 2, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2, 2, 1,
0, 2, 1, 1, 0, 2, 1, 2, 2, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 1, 0, 2, 2, 2, 1, 2, 2, 0, 1,
0, 0, 1, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0, 1, 1, 2, 1,
2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 1, 0, 0, 0, 1, 0, 0, 2,
2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 1,
0, 1, 1, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 2, 0, 0, 1, 2, 1, 1, 2, 1,
1, 0, 0, 2, 1, 2, 2, 0, 1, 0, 2, 1, 1, 2, 2, 2, 0, 1, 0, 0, 0, 0,
0, 2, 1, 0, 2, 0, 2, 1, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 1, 2, 2, 0,
2, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 2, 2, 0, 1, 1, 2, 0, 1, 1,
2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2,
0, 1, 1, 0, 2, 2, 0, 1, 1, 0, 2, 1, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2,
1, 2, 2, 2, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2, 1, 2, 2, 2,
1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2, 2, 2,
1, 2, 2, 0, 1, 1, 0, 0, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 1, 2, 1,
1, 2, 2, 1, 2, 0, 0, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 1, 1, 0, 0,
1, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 2, 2, 0, 1,
0, 0, 1, 2, 2, 1, 0, 0, 0, 1, 0, 0, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 1, 0,
2, 2, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 2, 2, 0,
2, 1, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1,
0, 0, 1, 0, 2, 2, 2, 1, 2, 0], dtype=int32)
In [92]:
# extraemos los cluster unicos
clusters = unique(yhat)
clusters
Out[92]:
In [93]:
# creamos el diagrama de dispersion
for cluster in clusters:
row_ix = where(yhat == cluster)
plt.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
plt.show()
MODELO 2:
In [94]:
X_standarized
Out[94]:
Out[94]:
array([[-0.50490379, -0.67246203],
[-0.81575665, 1.23572057],
[-1.33700792, -0.98473062],
...,
[-0.73553231, 1.41379671],
[ 1.28632355, -0.7743968 ],
[-0.24777382, -0.4112469 ]])
In [95]:
Out[95]:
sklearn.cluster._mean_shift.MeanShift
def __init__(*, bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster
_all=True, n_jobs=None, max_iter=300)
In [98]:
clusters = unique(yhat)
In [99]:
for cluster in clusters:
row_ix = np.where(yhat == cluster)
if cluster == -1:
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1], c='#666160', s=1
0) # Adjust the size (s) as needed
else:
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1], s=10) # Adjust
the size (s) as needed
plt.show()
MeanShift
MODELO 3:
In [100]:
sklearn.cluster._dbscan.DBSCAN
def __init__(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorit
hm='auto', leaf_size=30, p=None, n_jobs=None)
Perform DBSCAN clustering from vector array or distance matrix.
In [103]:
clusters = unique(yhat)
Ventajas y Desventajas
Ventajas:
Puede encontrar clusters de formas arbitrarias y tamaños diferentes.
Capaz de identificar puntos de ruido (datos atípicos) en el conjunto de datos.
No es necesario especificar previamente el número de clusters.
Desventajas:
Sensible a la elección de los parámetros ε y MinPts, que deben ser ajustados adecuadamente.
Puede tener dificultades con conjuntos de datos de alta dimensionalidad.
MODELO 4:
In [104]:
# @title
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
In [105]:
# usar distance_threshold=0 hace que se creen todas las agrupaciones
model4 = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model4
Out[105]:
▾ AgglomerativeClustering
AgglomerativeClustering(distance_threshold=0, n_clusters=None)
In [105]:
In [106]:
def plot_dendrogram(model, **kwargs):
# Se Crea la matriz y se grafica el dendrograma
linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)
# grafica el dendrograma
dendrogram(linkage_matrix, **kwargs)
In [107]:
model4 = model4.fit(X_standarized)
model4
Out[107]:
▾ AgglomerativeClustering
AgglomerativeClustering(distance_threshold=0, n_clusters=None)
In [108]:
In [109]:
plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Número de Clusters")
plt.ylabel("Dispersión")
plt.show()
model5 = AgglomerativeClustering(n_clusters=3)
yhat = model5.fit_predict(X_standarized )
In [111]:
clusters = unique(yhat)
clusters
Out[111]:
array([0, 1, 2])
In [112]:
for cluster in clusters:
row_ix = where(yhat == cluster)
plt.scatter(X[row_ix, 0], X[row_ix, 1], s=10)
In [113]:
from sklearn.metrics import silhouette_score
Compara la similitud entre las etiquetas de clústeres predichas y las etiquetas verdaderas, ajustando por la
posibilidad de coincidencias aleatorias. Un valor cercano a 1 indica un alto acuerdo entre las etiquetas
predichas y las verdaderas.
In [114]:
from sklearn.metrics import adjusted_rand_score