0% found this document useful (0 votes)

23 views20 pages

Casos de ML Unsupervised Daniel Ames Camayo

Uploaded by

El Rusnder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views20 pages

Casos de ML Unsupervised Daniel Ames Camayo

Uploaded by

El Rusnder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

GUIA DE LABORATORIO

APELLIDOS, Nombres: Ames Camayo Daniel Vides

Fecha: 07 de Junio del 2024

STANDARD LIBRARIES:
In [58]:

import pandas as pd
pd.Timestamp.today().strftime('%Y-%m-%d %H:%M:%S') # Se capta la fecha y hora actual

Out[58]:
'2024-06-08 00:36:26'

In [59]:

import matplotlib.pyplot as plt

import numpy as np

In [60]:
from matplotlib import style
plt.style.use('ggplot')
#plt.style.use('seaborn-darkgrid')
#plt.style.use('fivethirtyeight')

from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 10,5
plt.rcParams['axes.facecolor'] = 'white'

In [61]:
import seaborn as sns
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")

CUSTOMIZED LIBRARIES:
In [62]:
from sklearn.cluster import KMeans
KMeans
Out[62]:

sklearn.cluster._kmeans.KMeans
def __init__(n_clusters=8, *, init='k-means++', n_init='warn', max_iter=300, tol=0.0001
, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
K-Means clustering.

This initially creates clusters of points normally distributed (std=1)

about vertices of an ``n_informative``-dimensional hypercube with sides of
length ``2*class_sep`` and assigns an equal number of clusters to each
class. It introduces interdependence between these features and adds
various
In [66]:types of further noise to the data.
from scipy.stats import zscore
Without
zscore shuffling, ``X`` horizontally stacks features in the following
order: the primary ``n_informative`` features, followed by ``n_redundant``
Out[66]:
linear combinations of the informative features, followed by ``n_repeated``
duplicates, drawn randomly with replacement from the informative and
scipy.stats._stats_py.zscore
redundant features. The remaining features are filled with random noise.
def zscore(a, axis=0, ddof=0, nan_policy='propagate')
Thus, without shuffling, all useful features are contained in the columns
``X[:, :n_informative
Compute the z score. + n_redundant + n_repeated]``.

Read more
Compute inzthe
the :ref:`User
score Guide in
of each value <sample_generators>`.
the sample, relative to the
sample mean and standard deviation.
Parameters
----------
Parameters
n_samples : int, default=100
----------
EXTRACCION DE DATOS:
The number of samples.
a : array_like
An array like object containing the sample data.
n_features
axis : int or
In [67]: : int,
None,default=20
optional
The
Axis along which of
total number to features. These comprise
operate. Default is 0. If``n_informative``
None, compute over
# Creamos un
informative
the dataset sintetico
features,
whole array que usaremos en todosfeatures,
`a`. ``n_redundant`` redundant los ejemplos
# Parámetros modificados para crear 3 clústeres concéntricos
ddof ``n_repeated`` duplicated features and
X, y := int, optional
make_classification(
``n_features-n_informative-n_redundant-n_repeated``
Degrees of freedom correction in the calculation of the
n_samples=1000, useless features
drawn at deviation.
standard random.
n_features=2, Default is 0.
n_informative=2,
nan_policy : {'propagate', 'raise', 'omit'}, optional
n_redundant=0,
n_informative
Defines : to
how int, default=2
handle when input contains nan. 'propagate' returns nan,
n_classes=3,
The number
'raise' of informative
throws features.
an error, 'omit' Each the
performs class is composedignoring
calculations of a number
nan
n_clusters_per_class=1,
of gaussian
values. clusters
Default
random_state=37, is each located
'propagate'. around
Note the
that vertices
when the of
value a hypercube
is 'omit',
in
nansa in
subspace
class_sep=2, of
the input dimension
# Aumenta la ``n_informative``.
separación
also propagate to theentre For each
clústeres
output, but theycluster,
do not affect
# hypercube=False
informative
the z-scoresfeatures
computed# are
Para que non-nan
fordrawn
the los clústeres
independently
values.sean
from más cercanos
N(0, 1) and athen
nubes circulares
)

In [68]:
# Agrega una pequeña variabilidad en las distancias de los clústeres para hacerlos más co
ncéntricos
rng = np.random.default_rng(42)
X += rng.normal(scale=0.1, size=X.shape)

In [69]:

## ARRAY de Numpy
print(type( X ))
display(len( X )) # shape len
display( X[:3] ) # .head(3) .tail(2) [:3]

1000

array([[-1.68839271, -1.97804483],
[-2.30557338, 1.69562228],
[-3.34048814, -2.5792299 ]])

In [70]:
# Multiplica la segunda columna por 10000
X[:, 1] = X[:, 1] * 10000

In [71]:
## ARRAY de Numpy
print(type( X ))
display(len( X )) # shape len
display( X[:3] ) # .head(3) .tail(2) [:3]

1000

array([[-1.68839271e+00, -1.97804483e+04],
[-2.30557338e+00, 1.69562228e+04],
[-3.34048814e+00, -2.57922990e+04]])

In [72]:
df_X = pd.DataFrame(X)

E.D.A - Exploratory Data Analysis: Análisis Estadística

Descriptiva
In [73]:
## DATAFRAME de Pandas
print(type( df_X ))
display(len( df_X )) # shape len
display( df_X.info() )
display(df_X.describe())
display(df_X.sample(2))
display( df_X ) # .head(3) .tail(2)

1000

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 1000 non-null float64
0 0 1000 non-null float64
1 1 1000 non-null float64
dtypes: float64(2)
memory usage: 15.8 KB

None

0 1

count 1000.000000 1000.000000

mean -0.685935 -6834.089666

std 1.986437 19261.811381

min -4.397035 -34008.454061

25% -2.139444 -21651.383469

50% -1.671931 -17380.589114

75% 1.485112 18433.092203

max 3.728174 25751.020011

0 1

840 -1.747337 19397.900488

851 -1.349984 20408.947698

0 1

0 -1.688393 -19780.448331

1 -2.305573 16956.222789

2 -3.340488 -25792.299004

3 -1.865527 23856.962445

4 -2.323384 16557.090853

... ... ...

995 -4.339401 -32187.439490

996 -2.082259 -20542.146294

997 -2.146293 20384.576406

998 1.867987 -21742.914665

999 -1.177876 -14751.488279

1000 rows × 2 columns

TRANSFORMACION Y/O LIMPIEZA - PRE-PROCESAMIENTO:

ETL
In [74]:
# Importante! Reescalar los datos (por ejemplo con una puntuacion Z)
X_standarized = zscore( X, axis=0 )

In [115]:
## ARRAY de Numpy
print(type( X_standarized ))
display(len( X_standarized )) # shape len
display( X_standarized[:3] ) # .head(3) .tail(2) [:3]

1000
1000

array([[-0.50490379, -0.67246203],
[-0.81575665, 1.23572057],
[-1.33700792, -0.98473062]])

In [76]:
df_X_standarized = pd.DataFrame(X_standarized)
display(df_X_standarized.describe())
display(df_X_standarized.sample(2))
display( df_X_standarized ) # .head(3) .tail(2)

0 1

count 1.000000e+03 1.000000e+03

mean 3.197442e-17 4.209966e-16

std 1.000500e+00 1.000500e+00

min -1.869155e+00 -1.411496e+00

25% -7.320828e-01 -7.696425e-01

50% -4.966127e-01 -5.478081e-01

75% 1.093482e+00 1.312432e+00

max 2.223236e+00 1.692541e+00

0 1

803 -0.377862 1.474280

513 1.600691 -0.780938

0 1

0 -0.504904 -0.672462

1 -0.815757 1.235721

2 -1.337008 -0.984731

3 -0.594121 1.594160

4 -0.824727 1.214989

... ... ...

995 -1.840126 -1.316908

996 -0.703281 -0.712026

997 -0.735532 1.413797

998 1.286324 -0.774397

999 -0.247774 -0.411247

1000 rows × 2 columns

In [76]:

Análisis Bivariado y Multivariado: Análisis de cada variable

"x(i)" versus la variable target (objetivo) "y"
In [77]:
## ARRAY de Numpy
print(type( y ))
display(len( y )) # shape len
display( y[:10] ) # .head(3) .tail(2) [:3]

1000

array([2, 1, 2, 1, 1, 0, 1, 2, 2, 2])

In [78]:
# Creamos un diagrama de dispersion iterando por cada clase
for class_value in range(3):
# obtenemos los indices
row_ix = where(y == class_value)
# creamos el diagrama
plt.scatter(X[row_ix, 0], X[row_ix, 1])
# mostramos la imagen
plt.show()

K-Means
Conceptos Básicos de K-Means
K-Means es uno de los algoritmos de clustering más populares y ampliamente utilizados. Su objetivo
principal es dividir un conjunto de datos en K grupos , donde K es un número predefinido por el usuario.
El algoritmo funciona de manera iterativa para asignar cada punto de datos al cluster más cercano en
función de las distancias entre los puntos y los centroides de los clústeres.
Los centroides son puntos que representan el centro de gravedad de un clúster y se actualizan en cada
iteración para minimizar la distancia intraclúster.
K-Means es un algoritmo de centroides porque genera puntos medios que se usan para la asignacion de los
clusters.

Cómo Funciona el Algoritmo

1. Inicialización: Selecciona K centroides iniciales de forma aleatoria o utilizando algún método específico (por
ejemplo, K-Means++).
2. Asignación: Asigna cada punto de datos al clúster cuyo centroide es el más cercano.
3. Actualización: Recalcula los centroides de cada clúster como el promedio de todos los puntos asignados a
ese clúster.
4. Repite los pasos 2 y 3 hasta que los centroides converjan o se alcance un número máximo de iteraciones.
Ventajas y Desventajas
Ventajas:
Fácil de entender e implementar.
Eficiente en términos de tiempo y escalabilidad.
Se puede adaptar a casos donde los clusters no son de similar tamaño o tienen distinta forma.
Desventajas:
Sensible a la elección inicial de los centroides, lo que puede llevar a soluciones subóptimas.
No es adecuado para datos con clusters de formas irregulares o de tamaños significativamente
diferentes.
Sensible a Valores extremos (Outliers) los valores extremos pueden alterar la posición de los
centroides..

Ejemplo Práctico

MODELO(S): Modelamiento: Crear modelo, FIT (Ajustar,

Entrenar), PREDICT (generar valores ajustados y/o
predicción)

EXPERIMENTO, VERSION CON ESTANDARIZAR:

Crear modelo
In [79]:
# Seleccionamos el numero de clusters
modelo2_kmeans = KMeans(n_clusters=3)

FIT (Ajustar, Entrenar)

In [80]:
%%time
# Ajustamos el modelo
modelo2_kmeans.fit(X_standarized)
modelo2_kmeans

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: Th
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
t` explicitly to suppress the warning
warnings.warn(

CPU times: user 389 ms, sys: 2.09 ms, total: 391 ms
Wall time: 426 ms
Out[80]:

▾ KMeans
KMeans(n_clusters=3)

PREDICT (generar valores ajustados y/o predicción)

In [81]:
# Asignamos un cluster a cada ejemplo
yhat_norm = modelo2_kmeans.predict(X_standarized)

In [82]:
## ARRAY de Numpy
print(type( yhat_norm ))
display(len( yhat_norm )) # shape len
display( yhat_norm[:10] ) # .head(3) .tail(2) [:3]

1000

array([2, 1, 2, 1, 1, 0, 1, 2, 2, 2], dtype=int32)

In [83]:
# extraemos los cluster unicos
clusters = unique(yhat_norm)
clusters
Out[83]:
array([0, 1, 2], dtype=int32)

In [84]:
# creamos el diagrama de dispersion
for cluster in clusters:
row_ix = where(yhat_norm == cluster)
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1])
# show the plot
plt.show()
EXPERIMENTO, VERSION SIN ESTANDARIZAR:

In [85]:
# Seleccionamos el numero de clusters
modelo1_kmeans = KMeans(n_clusters=3)
modelo1_kmeans
Out[85]:

▾ KMeans
KMeans(n_clusters=3)

In [86]:
%%time
# Ajustamos el modelo
modelo1_kmeans.fit(X)
modelo1_kmeans

CPU times: user 97.7 ms, sys: 11 µs, total: 97.7 ms

Wall time: 98.9 ms

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: Th
e default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_ini
t` explicitly to suppress the warning
warnings.warn(
Out[86]:

▾ KMeans
KMeans(n_clusters=3)

In [87]:
modelo1_kmeans.cluster_centers_
Out[87]:
array([[ 5.38517431e-01, -1.69191183e+04],
[-2.00613206e+00, 1.98913554e+04],
[-6.62387384e-01, -2.38390518e+04]])

In [88]:
modelo1_kmeans.labels_
Out[88]:
array([0, 1, 2, 1, 1, 0, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 2, 2, 0, 0,
0, 0, 2, 2, 1, 0, 1, 1, 2, 0, 2, 0, 1, 2, 0, 1, 2, 2, 0, 1, 2, 2,
2, 0, 2, 0, 2, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1,
2, 2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2,
2, 2, 1, 2, 2, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2, 1,
1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 2,
0, 0, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2,
0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0, 2, 0,
0, 2, 0, 1, 1, 1, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0,
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 0, 2, 0, 1, 0, 2, 0,
0, 2, 2, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0,
2, 0, 0, 2, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 2, 1,
1, 0, 1, 0, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1,
0, 1, 1, 0, 2, 2, 2, 0, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 2,
2, 2, 1, 2, 1, 0, 2, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 1, 2, 0,
1, 2, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 0, 0, 1, 2,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
2, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 0, 2, 2, 0,
0, 2, 1, 0, 1, 0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 2, 1, 1, 2, 2, 0,
1, 2, 0, 1, 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 2, 1, 0, 1, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 1, 0, 0, 1, 0,
2, 0, 0, 2, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2, 2, 1,
0, 2, 1, 1, 0, 2, 1, 2, 2, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 1, 0, 2, 2, 2, 1, 2, 2, 0, 1,
0, 0, 1, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0, 1, 1, 2, 1,
2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 1, 0, 0, 0, 1, 0, 0, 2,
2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 1,
0, 1, 1, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 2, 0, 0, 1, 2, 1, 1, 2, 1,
1, 0, 0, 2, 1, 2, 2, 0, 1, 0, 2, 1, 1, 2, 2, 2, 0, 1, 0, 0, 0, 0,
0, 2, 1, 0, 2, 0, 2, 1, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 1, 2, 2, 0,
2, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 2, 2, 0, 1, 1, 2, 0, 1, 1,
2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2,
0, 1, 1, 0, 2, 2, 0, 1, 1, 0, 2, 1, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2,
1, 2, 2, 2, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2, 1, 2, 2, 2,
1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2, 2, 2,
1, 2, 2, 0, 1, 1, 0, 0, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 1, 2, 1,
1, 2, 2, 1, 2, 0, 0, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 1, 1, 0, 0,
1, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 2, 2, 0, 1,
0, 0, 1, 2, 2, 1, 0, 0, 0, 1, 0, 0, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 1, 0,
2, 2, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 2, 2, 0,
2, 1, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1,
0, 0, 1, 0, 2, 2, 2, 1, 2, 0], dtype=int32)

In [89]:
modelo1_kmeans.n_iter_
Out[89]:
7

In [90]:
# Asignamos un cluster a cada ejemplo
yhat = modelo1_kmeans.predict(X)

In [91]:
yhat
Out[91]:
array([0, 1, 2, 1, 1, 0, 1, 2, 0, 0, 1, 2, 2, 0, 0, 1, 0, 1, 2, 1, 1, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 2, 0, 2, 0, 2, 0, 2, 1, 2, 2, 2, 0, 0,
0, 0, 2, 2, 1, 0, 1, 1, 2, 0, 2, 0, 1, 2, 0, 1, 2, 2, 0, 1, 2, 2,
2, 0, 2, 0, 2, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1,
2, 2, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2,
2, 2, 1, 2, 2, 2, 2, 0, 1, 0, 2, 0, 2, 1, 0, 0, 1, 1, 0, 1, 2, 1,
1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 2,
0, 0, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 1, 0, 1, 2, 0, 2, 2, 2, 2,
0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 0, 0, 2, 0,
0, 2, 0, 1, 1, 1, 2, 0, 2, 2, 2, 1, 0, 1, 1, 2, 2, 0, 0, 1, 1, 0,
2, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 0, 2, 0, 1, 0, 2, 0,
0, 2, 2, 0, 0, 1, 2, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0,
2, 0, 0, 2, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 2, 2, 1, 1, 0, 0, 2, 1,
1, 0, 1, 0, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 0, 0, 1, 1,
0, 1, 1, 0, 2, 2, 2, 0, 0, 2, 0, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 2,
2, 2, 1, 2, 1, 0, 2, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 1, 2, 0,
1, 2, 1, 1, 0, 0, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 1, 1, 0, 0, 1, 2,
1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 1, 1, 1, 0, 0, 0, 2, 0, 2, 1, 0,
2, 1, 0, 2, 2, 2, 2, 0, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 0, 2, 2, 0,
0, 2, 1, 0, 1, 0, 1, 1, 0, 2, 2, 1, 2, 1, 0, 2, 2, 1, 1, 2, 2, 0,
1, 2, 0, 1, 2, 2, 0, 0, 0, 2, 1, 0, 0, 0, 2, 1, 0, 1, 1, 2, 0, 0,
0, 0, 1, 0, 0, 0, 1, 2, 0, 1, 1, 2, 1, 0, 1, 2, 1, 1, 0, 0, 1, 0,
2, 0, 0, 2, 0, 1, 0, 1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 2, 0, 2, 2, 1,
0, 2, 1, 1, 0, 2, 1, 2, 2, 1, 0, 0, 1, 2, 0, 0, 2, 0, 0, 0, 1, 1,
0, 1, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 1, 0, 2, 2, 2, 1, 2, 2, 0, 1,
0, 0, 1, 2, 0, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 2, 2,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 0, 1, 1, 2, 1,
2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 2, 2, 1, 0, 0, 0, 1, 0, 0, 2,
2, 2, 2, 0, 1, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 0, 1,
0, 1, 1, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 2, 0, 0, 1, 2, 1, 1, 2, 1,
1, 0, 0, 2, 1, 2, 2, 0, 1, 0, 2, 1, 1, 2, 2, 2, 0, 1, 0, 0, 0, 0,
0, 2, 1, 0, 2, 0, 2, 1, 1, 2, 0, 0, 1, 0, 0, 0, 1, 2, 1, 2, 2, 0,
2, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 0, 2, 2, 2, 0, 1, 1, 2, 0, 1, 1,
2, 2, 0, 0, 1, 1, 0, 1, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 1, 1, 1, 2,
0, 1, 1, 0, 2, 2, 0, 1, 1, 0, 2, 1, 0, 2, 1, 2, 0, 0, 2, 2, 0, 2,
1, 2, 2, 2, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2, 1, 2, 2, 2,
1, 0, 0, 2, 2, 1, 0, 0, 0, 2, 0, 1, 2, 0, 2, 2, 2, 1, 1, 1, 2, 1,
1, 1, 1, 1, 1, 2, 1, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2, 2, 2,
1, 2, 2, 0, 1, 1, 0, 0, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0, 0, 1, 2, 1,
1, 2, 2, 1, 2, 0, 0, 1, 1, 1, 0, 0, 2, 2, 0, 2, 0, 2, 1, 1, 0, 0,
1, 0, 2, 0, 2, 0, 2, 2, 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 2, 2, 0, 1,
0, 0, 1, 2, 2, 1, 0, 0, 0, 1, 0, 0, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1, 1, 0,
2, 2, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 2, 2, 0,
2, 1, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 0, 2, 0, 2, 1, 1,
0, 0, 1, 0, 2, 2, 2, 1, 2, 0], dtype=int32)

In [92]:
# extraemos los cluster unicos
clusters = unique(yhat)
clusters
Out[92]:

array([0, 1, 2], dtype=int32)

In [93]:
# creamos el diagrama de dispersion
for cluster in clusters:
row_ix = where(yhat == cluster)
plt.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot
plt.show()

MODELO 2:
In [94]:

X_standarized
Out[94]:
Out[94]:

array([[-0.50490379, -0.67246203],
[-0.81575665, 1.23572057],
[-1.33700792, -0.98473062],
...,
[-0.73553231, 1.41379671],
[ 1.28632355, -0.7743968 ],
[-0.24777382, -0.4112469 ]])

In [95]:

from sklearn.cluster import MeanShift

MeanShift

Out[95]:

sklearn.cluster._mean_shift.MeanShift
def __init__(*, bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster
_all=True, n_jobs=None, max_iter=300)

Mean shift clustering using a flat kernel.

Mean shift clustering aims to discover "blobs" in a smooth density of

samples. It is a centroid-based algorithm, which works by updating
candidates for centroids to be the mean of the points within a given
region. These candidates are then filtered in a post-processing stage to
eliminate
In [96]: near-duplicates to form the final set of centroids.
# inicializamos con un ancho de banda W = 0.7
Seeding
model = is performed using a binning
MeanShift(bandwidth=0.7, technique= for
cluster_all scalability.
False)
model
Read more in the :ref:`User Guide <mean_shift>`.
Out[96]:
Parameters
▾ MeanShift
----------
MeanShift(bandwidth=0.7, cluster_all=False)
bandwidth : float, default=None
Bandwidth used in the flat kernel.
In [97]:
If not given, the bandwidth is estimated using
yhat = model.fit_predict(X_standarized)
sklearn.cluster.estimate_bandwidth; see the documentation for that
yhat
function for hints on scalability (see also the Notes, below).
Out[97]:
seeds : array-like
array([ 2, 0, -1, of 0,shape
0, (n_samples,
1, 0, 2, n_features),
2, 2, 0, 1, default=None
2, 1, 1, 0, 1,
Seeds0,used
1, to0,initialize
0, 0, 2, kernels.
0, 0,If2, not0,set,1, 1, 1, 0, 1, 2, 2,
the 2,
seeds
1,are1,calculated
1, 0, 2, by 1,
clustering.get_bin_seeds
1, 2, 1, 1, 1, 2, 1, 0, 2, 0,
with bandwidth as the grid size 1,
0, 1, 1, 2, 2, 0, 2, and 0, 2, 1,
default 1, for
values 0, 1, 1, 2, 2,
1, 1, 2,
other parameters. 0, 2, -1, 1, -1, 0, 1, -1, 0, 0, 0, 0, 1, 0,
1, 0, 0, 1, 1, 1, 2, 0, 1, 0, 1, 2, 0, 1, 2, 1, 1,
2, 2, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 2, 2, 1, 2, 0,
bin_seeding
2, :1,bool,
2, default=False
2, 0, 1, 2, 0, 0, -1, 0, 2, 0, 0, -1, 0, 0,
If true,
2, 0, initial
0, 1, kernel
0, 0,locations
0, 2,are 2,not2,locations
0, 1, 1,of all
0, 0, 2, 1,
2, 1,
points, but 1, 0, -1,
rather 0, 1, 0,
the location 0, discretized
of the 2, 2, 0, version
1, 0, of 0, 1, 0,
points, where points are binned onto a grid whose coarseness0, 0, 1,
1, 2, 2, 2, 1, -1, 1, 1, 0, -1, 2, 0, 2, 0,
1, 2, 2, 2, 0, 0, 1, 1, 2, 1, 1, 2, 2, -1, 0, 0, 0,
corresponds
1, 1, to 2, the
1, bandwidth.
1, 0, 2,Setting
0, 0,this 1, option
2, 1,to2, True
0,will
0, speed
2, -1,
up the algorithm because fewer seeds will be initialized.
-1, 0, -1, 0, 2, 1, 2, 1, 2, 0, 0, 2, -1, 2, 1, 1, 1,
The 0,
default
1, 1,value2,is1,False.
1, -1, 1, -1, 0, 2, -1, 0, 0, 1, 0, 2,
0, 1,
Ignored 0, 1,
if seeds 0, 2,is2,
argument not0, 2, 2, -1, 1, -1, 1, 0, 0, 1,
None.
2, 2, 0, 0, 1, 1, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 0,
2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, -1, 0, 1, 1,
min_bin_freq
0, 0, : int,
2, default=1
0, 0, 1, 2, -1, 2, 1, 1, -1, 2, 1, 0, 0, 0,
To speed up -1,
1, 0, the -1,
algorithm,
-1, 0,accept
1, 2, only
2,those
0, -1,bins0,
with
-1,at1, least
2, -1, 1,
min_bin_freq
0, 2, 2, points
0, as
1,seeds.
1, 2, 2, 2, 0, 2, 1, 0, 2, 0, 0, 2,
-1, 2, 0, 0, -1, 0, 1, 2, 2, 0, 2, 0, 0, -1, 1, 0, 1,
0, 1,
cluster_all 2, 1,
: bool, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, -1, 2, 1,
default=True
1, 2, -1, 0, 2, 1, 0, 2, -1, -1, 1, 1, 2, 0, 0, 1, 0,
If true,
-1, 2, then
2,all1,points
0, 1, are2,
clustered,
1, 2, even1, -1, those
2, orphans
0, 2, that0, are
2, 0,
not within any kernel. Orphans are assigned to the
0, 1, 1, 2, 0, 2, 0, 2, 1, -1, 0, 0, 1, 1, 1, 0, 2, nearest kernel.
0, 1, 1, 2, 0, 2, 0, 2, 1, -1, 0, 0, 1, 1, 1, 0, 2,
2, 0, 2, -1, 2, 2, 1, 1, 0, 2, 2, 1, 1, 0, 1, 0, 0,
1, 1, 2, 2, 2, 0, 2, 1, 1, 0, 1, 2, 0, 0, 1, 0, -1,
0, 2, 0, 0, -1, 1, 0, 1, 1, 2, 2, 1, 2, 0, 2, 0, 1,
1, 1, 1, 0, 1, 2, 2, 1, 2, -1, -1, 2, 0, 1, 2, 0, 0,
1, 1, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 2, -1, 1, 2, 0,
0, 2, 0, -1, 0, -1, 0, 2, 1, 1, 1, 0, 1, 0, -1, 1, 1,
1, 0, 2, 1, 1, 0, 1, -1, 0, 2, 2, 2, 0, 0, 2, 2, 1,
1, -1, 0, 0, 0, 2, 0, 2, -1, 1, -1, 1, 1, 1, 2, -1, 0,
2, 2, 0, 0, 0, 0, -1, 0, 0, 2, 1, 1, 0, 0, 1, 0, 1,
2, 0, 2, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 0, 2, 2, 2,
0, 1, -1, 1, 2, 1, 1, 1, 0, -1, 1, 2, 2, 1, 0, 0, 2,
2, 1, -1, 0, 0, 1, 0, 1, 0, 2, 0, 0, 2, 1, 2, 2, -1,
2, 2, 1, -1, 1, 2, -1, 1, 0, -1, 0, 0, 1, 0, 0, 1, 1,
2, 0, 2, -1, 1, 0, 2, 1, 0, 0, 1, 1, 1, 2, 0, 2, 1,
1, 1, 2, 1, 0, 1, -1, 2, 2, 0, 0, 2, 1, 2, 0, 1, 2,
1, 0, 2, 0, 2, 1, 1, 2, 1, 0, 0, 0, -1, -1, 2, 0, 2,
1, 1, 2, -1, 2, 1, 0, 0, 2, 2, 0, 0, 2, 1, 1, -1, 0,
0, 2, 0, 0, 0, 0, 2, 2, 0, 1, 1, 2, 1, 0, 0, 0, 1,
2, 0, 0, 1, 1, 2, 2, 0, 0, 2, -1, 0, 1, 2, 0, 2, 1,
-1, 2, 2, 2, 1, 0, 2, 1, 2, 2, 0, 1, 1, 1, 2, 1, 2,
-1, 0, 0, 2, 1, 1, 0, 2, 2, 2, 0, -1, 2, 1, 1, 0, 2,
1, -1, 2, 2, 0, 2, 1, 2, 1, 2, 0, 0, 0, -1, 0, 0, 0,
0, 0, 0, 1, 0, 2, 1, 2, 1, -1, 2, 2, 1, 1, 2, 1, 0,
-1, 2, 1, 0, 2, 1, 2, 0, 0, 2, 1, 2, 2, 0, -1, 0, 1,
0, 0, -1, 1, 2, 0, 1, 0, 0, 1, 2, 0, -1, 1, 2, 0, 0,
0, -1, 1, 1, -1, 2, -1, 1, 1, 0, 0, -1, 1, 0, 2, 1, 2,
1, 1, 1, 1, 0, 1, 2, 0, 1, 0, 2, -1, 0, 1, 2, 2, 2,
0, 1, 2, 0, 2, 1, 0, 1, 1, 1, 0, 2, 1, 0, 1, 2, 0,
0, 1, 1, 0, 1, 2, 1, 1, 0, 0, 0, -1, 0, 1, 2, 2, -1,
0, 2, 1, 2, 0, -1, 0, 1, 0, 0, 1, 2, 2, 1, 2, 0, 0,
0, 0, 1, 2, 2, -1, -1, 0, 2, 1, 0, 0, 1, 1, 2, 2, 1,
0, 2, 1, 1, 2, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
-1, 1, 0, 0, 1, -1, 0, 1, 1, -1, 2, 0, 1, 2])

In [98]:

clusters = unique(yhat)

for cluster in clusters:

row_ix = where(yhat == cluster)
if cluster == -1 : plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1],c='#6
66160')
else: plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1])
plt.show()

In [99]:
for cluster in clusters:
row_ix = np.where(yhat == cluster)
if cluster == -1:
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1], c='#666160', s=1
0) # Adjust the size (s) as needed
else:
plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1], s=10) # Adjust
the size (s) as needed

plt.show()

MeanShift

Conceptos Básicos de MeanShift

MeanShift es un algoritmo de clustering que se utiliza para encontrar los clusters en un conjunto de datos
de manera no paramétrica, lo que significa que no es necesario especificar previamente el número de
clusters K.
A diferencia de K-Means, MeanShift no asume una forma específica para los clusters y puede encontrar
clusters de formas arbitrarias.
El algoritmo se basa en la idea de encontrar los "picos" en la densidad de puntos de datos y asignar los
puntos a estos picos como sus clústeres.

Cómo Funciona el Algoritmo

1. Inicialización: Se define el radio de busqueda W.
2. Cálculo de Densidad: Para cada punto candidato, calcula la densidad de puntos dentro del radio W
alrededor de ese punto.
3. Desplazamiento de Pico: Mueve la ventana de busqueda hacia el centro de masa.
4. Repite los pasos 2 y 3 hasta que los puntos converjan a los picos.
Ventajas y Desventajas
Ventajas:
No es necesario especificar previamente el número de clusters (K).
Puede encontrar clusters de formas arbitrarias y tamaños diferentes.
Funciona bien en datos con densidades variables.
Desventajas:
Puede ser computacionalmente costoso en conjuntos de datos grandes debido a la necesidad de
calcular densidades.
Sensible al tamaño del radio de búsqueda W, cuyo calculo no es trivial.

MODELO 3:
In [100]:

from sklearn.cluster import DBSCAN

DBSCAN
Out[100]:

sklearn.cluster._dbscan.DBSCAN
def __init__(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorit
hm='auto', leaf_size=30, p=None, n_jobs=None)
Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise.

Finds core samples of high density and expands clusters from them.
Good for data which contains clusters of similar density.

model3 = DBSCAN( eps=0.25, min_samples=10)

Parameters
model3
----------
Out[101]:
eps : float, default=0.5
The maximum distance between two samples for one to be considered
▾ DBSCAN
as in the neighborhood of the other. This is not a maximum bound
DBSCAN(eps=0.25, min_samples=10)
on the distances of points within a cluster. This is the most
important DBSCAN parameter to choose appropriately for your data set
and distance function.
In [102]:
yhat = model.fit_predict(
min_samples X_standarized )
: int, default=5
yhat
The number of samples (or total weight) in a neighborhood for a point
to be considered as a core point. This includes the point itself.
Out[102]:
array([ 2, 0, -1, 0, 0, 1, 0, 2, 2, 2, 0, 1, 2, 1, 1, 0, 1,
metric :0,
str,
1,or0, callable,
0, 0, default='euclidean'
2, 0, 0, 2, 0, 1, 1, 1, 0, 1, 2, 2,
The metric
2, 1, 1, 1,when
to use 0, calculating
2, 1, 1, distance
2, 1, between
1, 1, instances
2, 1, 0,in2, a 0,
feature
0, 1,
array.
1, If2,metric
2, 0, is 2,
a string
1, 0, or 2,
callable,
1, 1,it0, must
1,be1,one2,
of 2,
the 1, 1, 2,
options 0, 2,
allowed by -1, 1, -1, 0, 1, -1, 0, 0, 0, 0, 1,for
:func:`sklearn.metrics.pairwise_distances` 0,
1, 0, 0, 1,
its metric parameter. 1, 1, 2, 0, 1, 0, 1, 2, 0, 1, 2, 1, 1,
2, 2, 0, 0, 0, 0, 1, 2, 1, 2, 0, 2, 2, 2, 1, 2, 0,
If metric
2, 1,is2, "precomputed",
2, 0, 1, X 2,is0,
assumed to be
0, -1, 0, a2,distance
0, 0,matrix
-1, 0,and0,
must2,
be 0,
square.
0, 1,X may
0,be0,a :term:`sparse
0, 2, 2, 2, graph`,
0, 1, in 1,
which
0, 0, 2, 1,
case2,
only
1,"nonzero"
1, 0, -1,elements
0, 1,may0,
be 0,
considered
2, 2, neighbors
0, 1, 0, for0,
DBSCAN.
1, 0,
1, 2, 2, 2, 1, -1, 1, 1, 0, -1, 2, 0, 2, 0, 0, 0, 1,
1, 2, 2, 2, 0, 0, 1, 1, 2, 1, 1, 2, 2, -1, 0, 0, 0,
1, 2, 2, 2, 0, 0, 1, 1, 2, 1, 1, 2, 2, -1, 0, 0, 0,
1, 1, 2, 1, 1, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 2, -1,
-1, 0, -1, 0, 2, 1, 2, 1, 2, 0, 0, 2, -1, 2, 1, 1, 1,
0, 1, 1, 2, 1, 1, -1, 1, -1, 0, 2, -1, 0, 0, 1, 0, 2,
0, 1, 0, 1, 0, 2, 2, 0, 2, 2, -1, 1, -1, 1, 0, 0, 1,
2, 2, 0, 0, 1, 1, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 0,
2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, -1, 0, 1, 1,
0, 0, 2, 0, 0, 1, 2, -1, 2, 1, 1, -1, 2, 1, 0, 0, 0,
1, 0, -1, -1, -1, 0, 1, 2, 2, 0, -1, 0, -1, 1, 2, -1, 1,
0, 2, 2, 0, 1, 1, 2, 2, 2, 0, 2, 1, 0, 2, 0, 0, 2,
-1, 2, 0, 0, -1, 0, 1, 2, 2, 0, 2, 0, 0, -1, 1, 0, 1,
0, 1, 2, 1, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, -1, 2, 1,
1, 2, -1, 0, 2, 1, 0, 2, -1, -1, 1, 1, 2, 0, 0, 1, 0,
-1, 2, 2, 1, 0, 1, 2, 1, 2, 1, -1, 2, 0, 2, 0, 2, 0,
0, 1, 1, 2, 0, 2, 0, 2, 1, -1, 0, 0, 1, 1, 1, 0, 2,
2, 0, 2, -1, 2, 2, 1, 1, 0, 2, 2, 1, 1, 0, 1, 0, 0,
1, 1, 2, 2, 2, 0, 2, 1, 1, 0, 1, 2, 0, 0, 1, 0, -1,
0, 2, 0, 0, -1, 1, 0, 1, 1, 2, 2, 1, 2, 0, 2, 0, 1,
1, 1, 1, 0, 1, 2, 2, 1, 2, -1, -1, 2, 0, 1, 2, 0, 0,
1, 1, 0, 1, 1, 0, 2, 2, 0, 2, 1, 1, 2, -1, 1, 2, 0,
0, 2, 0, -1, 0, -1, 0, 2, 1, 1, 1, 0, 1, 0, -1, 1, 1,
1, 0, 2, 1, 1, 0, 1, -1, 0, 2, 2, 2, 0, 0, 2, 2, 1,
1, -1, 0, 0, 0, 2, 0, 2, -1, 1, -1, 1, 1, 1, 2, -1, 0,
2, 2, 0, 0, 0, 0, -1, 0, 0, 2, 1, 1, 0, 0, 1, 0, 1,
2, 0, 2, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 0, 2, 2, 2,
0, 1, -1, 1, 2, 1, 1, 1, 0, -1, 1, 2, 2, 1, 0, 0, 2,
2, 1, -1, 0, 0, 1, 0, 1, 0, 2, 0, 0, 2, 1, 2, 2, -1,
2, 2, 1, -1, 1, 2, -1, 1, 0, -1, 0, 0, 1, 0, 0, 1, 1,
2, 0, 2, -1, 1, 0, 2, 1, 0, 0, 1, 1, 1, 2, 0, 2, 1,
1, 1, 2, 1, 0, 1, -1, 2, 2, 0, 0, 2, 1, 2, 0, 1, 2,
1, 0, 2, 0, 2, 1, 1, 2, 1, 0, 0, 0, -1, -1, 2, 0, 2,
1, 1, 2, -1, 2, 1, 0, 0, 2, 2, 0, 0, 2, 1, 1, -1, 0,
0, 2, 0, 0, 0, 0, 2, 2, 0, 1, 1, 2, 1, 0, 0, 0, 1,
2, 0, 0, 1, 1, 2, 2, 0, 0, 2, -1, 0, 1, 2, 0, 2, 1,
-1, 2, 2, 2, 1, 0, 2, 1, 2, 2, 0, 1, 1, 1, 2, 1, 2,
-1, 0, 0, 2, 1, 1, 0, 2, 2, 2, 0, -1, 2, 1, 1, 0, 2,
1, -1, 2, 2, 0, 2, 1, 2, 1, 2, 0, 0, 0, -1, 0, 0, 0,
0, 0, 0, 1, 0, 2, 1, 2, 1, -1, 2, 2, 1, 1, 2, 1, 0,
-1, 2, 1, 0, 2, 1, 2, 0, 0, 2, 1, 2, 2, 0, -1, 0, 1,
0, 0, -1, 1, 2, 0, 1, 0, 0, 1, 2, 0, -1, 1, 2, 0, 0,
0, -1, 1, 1, -1, 2, -1, 1, 1, 0, 0, -1, 1, 0, 2, 1, 2,
1, 1, 1, 1, 0, 1, 2, 0, 1, 0, 2, -1, 0, 1, 2, 2, 2,
0, 1, 2, 0, 2, 1, 0, 1, 1, 1, 0, 2, 1, 0, 1, 2, 0,
0, 1, 1, 0, 1, 2, 1, 1, 0, 0, 0, -1, 0, 1, 2, 2, -1,
0, 2, 1, 2, 0, -1, 0, 1, 0, 0, 1, 2, 2, 1, 2, 0, 0,
0, 0, 1, 2, 2, -1, -1, 0, 2, 1, 0, 0, 1, 1, 2, 2, 1,
0, 2, 1, 1, 2, 2, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
-1, 1, 0, 0, 1, -1, 0, 1, 1, -1, 2, 0, 1, 2])

In [103]:
clusters = unique(yhat)

for cluster in clusters:

row_ix = where(yhat == cluster)
if cluster == -1 : plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1],c='#6
66160', s=10)
else: plt.scatter(X_standarized[row_ix, 0], X_standarized[row_ix, 1], s=10)
plt.show()
DBSCAN

Conceptos Básicos de DBSCAN

DBSCAN es un algoritmo de clustering basado en densidad que se utiliza para identificar clusters en
conjuntos de datos con densidades variables y formas arbitrarias.
Al igual que MeanShift, DBSCAN no requiere especificar previamente el número de clusters (K) ni asumir
una forma particular para los clusters.
DBSCAN se basa en la idea de que un cluster se define como una región densa de puntos rodeada por una
región de baja densidad.

Cómo Funciona el Algoritmo

1. Puntos Centrales : Un punto se considera un punto central si dentro de un radio (épsilon, ε) alrededor de él,
hay al menos un número mínimo de puntos (MinPts).
2. Puntos de Borde : Un punto se considera un punto de borde si está dentro del radio ε de un punto central,
pero no cumple con el requisito de MinPts.
3. Puntos de Ruido : Los puntos que no son ni centrales ni de borde se consideran puntos de ruido .
4. Conexión de Puntos : Se forman clusters conectando puntos centrales y puntos de borde que están dentro
del mismo radio ε hasta barrer todos los puntos.

Ventajas y Desventajas
Ventajas:
Puede encontrar clusters de formas arbitrarias y tamaños diferentes.
Capaz de identificar puntos de ruido (datos atípicos) en el conjunto de datos.
No es necesario especificar previamente el número de clusters.
Desventajas:
Sensible a la elección de los parámetros ε y MinPts, que deben ser ajustados adecuadamente.
Puede tener dificultades con conjuntos de datos de alta dimensionalidad.

MODELO 4:
In [104]:

# @title
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

In [105]:
# usar distance_threshold=0 hace que se creen todas las agrupaciones
model4 = AgglomerativeClustering(distance_threshold=0, n_clusters=None)
model4
Out[105]:

▾ AgglomerativeClustering
AgglomerativeClustering(distance_threshold=0, n_clusters=None)
In [105]:

In [106]:
def plot_dendrogram(model, **kwargs):
# Se Crea la matriz y se grafica el dendrograma

# crea la cuenta de muestra debajo de cada nodo

counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1 # nodo hoja
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count

linkage_matrix = np.column_stack(
[model.children_, model.distances_, counts]
).astype(float)

# grafica el dendrograma
dendrogram(linkage_matrix, **kwargs)

In [107]:

model4 = model4.fit(X_standarized)
model4
Out[107]:

▾ AgglomerativeClustering
AgglomerativeClustering(distance_threshold=0, n_clusters=None)

In [108]:

plt.title("Dendrograma de Clustering Jerárquico")

# graficamos los primero p=5 niveles del dendrograma
plot_dendrogram(model4, truncate_mode="level", p=3)
plt.xlabel("Número de puntos en el nodo.") # si solo hay un punto indica el indice sin pa
rentesis
plt.show()
Métodos para la Selección de Clusters

Método del Codo

El método del codo es una técnica utilizada para determinar el número óptimo de clústeres en un algoritmo de
clustering, como K-Means. La idea central es observar cómo varía la dispersión de los puntos de datos a
medida que aumentamos el número de clusters. Graficamos esta variación y buscamos el punto en la curva
donde la disminución de la dispersión se vuelve menos pronunciada, formando un "codo". Este punto
representa el número óptimo de clústeres.

In [109]:

# Una lista para colocar los valores de dispersion por cada k

sse = []

for k in range(1, 11):

kmeans = KMeans(n_clusters=k, init = "random", n_init = 10)
kmeans.fit(X_standarized)
sse.append(kmeans.inertia_)

plt.style.use("fivethirtyeight")
plt.plot(range(1, 11), sse)
plt.xticks(range(1, 11))
plt.xlabel("Número de Clusters")
plt.ylabel("Dispersión")
plt.show()

MODELO 5: CLUSTERING AGLOMERATIVO ERNESTO

In [110]:

model5 = AgglomerativeClustering(n_clusters=3)
yhat = model5.fit_predict(X_standarized )

In [111]:
clusters = unique(yhat)
clusters

Out[111]:
array([0, 1, 2])

In [112]:
for cluster in clusters:
row_ix = where(yhat == cluster)
plt.scatter(X[row_ix, 0], X[row_ix, 1], s=10)

EVALUAR MODELOS: Métricas

Mide qué tan similares son los puntos dentro del mismo clúster comparados con puntos de otros clústeres.
Valores cercanos a 1 indican una mejor cohesión dentro de los clústeres y una buena separación entre ellos.

In [113]:
from sklearn.metrics import silhouette_score

# Calcular el Índice de Silueta

silhouette_avg = silhouette_score(df_X, kmeans.labels_)
print(f"Índice de Silueta: {silhouette_avg}")

Índice de Silueta: 0.2319562898356635

Compara la similitud entre las etiquetas de clústeres predichas y las etiquetas verdaderas, ajustando por la
posibilidad de coincidencias aleatorias. Un valor cercano a 1 indica un alto acuerdo entre las etiquetas
predichas y las verdaderas.

In [114]:
from sklearn.metrics import adjusted_rand_score

# Asumiendo que se tiene acceso a las etiquetas verdaderas `y`

ari = adjusted_rand_score(y, kmeans.labels_)
print(f"Coeficiente de Rand Ajustado: {ari}")

Coeficiente de Rand Ajustado: 0.5663573031091752

U9L05 - Big, Open, and Crowdsourced Data
50% (2)
U9L05 - Big, Open, and Crowdsourced Data
1 page
Chapter 7 - Exercise Answers
67% (3)
Chapter 7 - Exercise Answers
6 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
Python DM Lab Manual Part 2
No ratings yet
Python DM Lab Manual Part 2
8 pages
Drawback of Standard K-Means Algorithm
No ratings yet
Drawback of Standard K-Means Algorithm
5 pages
AbidAdhikari26840 DWDM
No ratings yet
AbidAdhikari26840 DWDM
43 pages
Cheat Sheet-Building Unsupervised Learning Models
No ratings yet
Cheat Sheet-Building Unsupervised Learning Models
3 pages
Practical 5
No ratings yet
Practical 5
6 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Machine Learning (ML)
No ratings yet
Machine Learning (ML)
35 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
K Means
100% (2)
K Means
329 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Ap Python
No ratings yet
Ap Python
12 pages
Da Exp 10
No ratings yet
Da Exp 10
6 pages
Application of Linear Algebra
No ratings yet
Application of Linear Algebra
7 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
17 pages
Numpy Cheatsheet
No ratings yet
Numpy Cheatsheet
11 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Assignment # 1: Performance Timeline of Flynn Taxonomy
No ratings yet
Assignment # 1: Performance Timeline of Flynn Taxonomy
21 pages
Da Exp 10
No ratings yet
Da Exp 10
6 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
AD3411
No ratings yet
AD3411
28 pages
Baidurya Debnath 4
No ratings yet
Baidurya Debnath 4
37 pages
K Means
No ratings yet
K Means
3 pages
AdityaGaur BDA Exp8
No ratings yet
AdityaGaur BDA Exp8
4 pages
K Means
No ratings yet
K Means
3 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
3 - Analisis - Y - Modificacion - Jupyter Notebook
No ratings yet
3 - Analisis - Y - Modificacion - Jupyter Notebook
15 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
MLFILE
No ratings yet
MLFILE
21 pages
Exp 6
No ratings yet
Exp 6
10 pages
SPPUML6
No ratings yet
SPPUML6
9 pages
CS-3361-Data-science-lab Manual
No ratings yet
CS-3361-Data-science-lab Manual
36 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Inbuilt Kmeans
No ratings yet
Inbuilt Kmeans
3 pages
ML Minors Exp7
No ratings yet
ML Minors Exp7
6 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
DSM 1
No ratings yet
DSM 1
6 pages
PR Final File
No ratings yet
PR Final File
70 pages
Vid 4
No ratings yet
Vid 4
6 pages
M PDF
No ratings yet
M PDF
13 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Week 8 DS Practical
No ratings yet
Week 8 DS Practical
13 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
K Means
No ratings yet
K Means
25 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
AIML Lab 10
No ratings yet
AIML Lab 10
4 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
K Means
No ratings yet
K Means
5 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Impact of The Digital Era On The Implementation of Lean Six Sigma
100% (1)
The Impact of The Digital Era On The Implementation of Lean Six Sigma
21 pages
A Comparative Study On FINANCIAL PERFORMANCE OF NEPAL BANGLADESH BANK LIMITED AND HIMALAYAN BANK LIMITED
76% (55)
A Comparative Study On FINANCIAL PERFORMANCE OF NEPAL BANGLADESH BANK LIMITED AND HIMALAYAN BANK LIMITED
35 pages
Correlation: By: Jubing 5
No ratings yet
Correlation: By: Jubing 5
14 pages
Exer Design Based Estimation
No ratings yet
Exer Design Based Estimation
9 pages
Introduction To Survival Analysis: Lecture Notes
No ratings yet
Introduction To Survival Analysis: Lecture Notes
28 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
Nama: Lingga Pristiya Ningsih Nim: 1501015020 MK: Ekonometrika
No ratings yet
Nama: Lingga Pristiya Ningsih Nim: 1501015020 MK: Ekonometrika
2 pages
Accounting Software in Computerized Business Envir
No ratings yet
Accounting Software in Computerized Business Envir
10 pages
Tugas Clustering - 132021012 - Kevin Gazkia Naufal
No ratings yet
Tugas Clustering - 132021012 - Kevin Gazkia Naufal
6 pages
Chapter 3
No ratings yet
Chapter 3
16 pages
Definitions and Formulae With Statistical Tables For Elementary Statistics and Quantitative Methods Courses
No ratings yet
Definitions and Formulae With Statistical Tables For Elementary Statistics and Quantitative Methods Courses
13 pages
STAB27
No ratings yet
STAB27
51 pages
Portfolio Theory Exam 2020 With Solution
No ratings yet
Portfolio Theory Exam 2020 With Solution
4 pages
Lesson Plan 8
No ratings yet
Lesson Plan 8
3 pages
Info Sphere Information Analyzer - Methodology and Best Practices
No ratings yet
Info Sphere Information Analyzer - Methodology and Best Practices
127 pages
Comparison of Methods: Passing and Bablok Regression - Biochemia Medica
No ratings yet
Comparison of Methods: Passing and Bablok Regression - Biochemia Medica
5 pages
Data!
No ratings yet
Data!
19 pages
STA201 - Assignment 2 Question (Spring2023)
No ratings yet
STA201 - Assignment 2 Question (Spring2023)
3 pages
DWM Record (Data Science)
No ratings yet
DWM Record (Data Science)
32 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
ML Supervised Learning Unit 3
No ratings yet
ML Supervised Learning Unit 3
51 pages
Multiple Linear Regression (MLR)
No ratings yet
Multiple Linear Regression (MLR)
17 pages
Presentation Slide
100% (1)
Presentation Slide
8 pages
The Effect of Product Quality, Promotion and Brand Image On Puchase Intention Wall's Ice Cream
No ratings yet
The Effect of Product Quality, Promotion and Brand Image On Puchase Intention Wall's Ice Cream
7 pages
Job 191
No ratings yet
Job 191
2 pages
Tutorial: Using The R Environment For Statistical Computing An Example With The Mercer & Hall Wheat Yield Dataset
No ratings yet
Tutorial: Using The R Environment For Statistical Computing An Example With The Mercer & Hall Wheat Yield Dataset
76 pages
Notes On Engineering Data Analysis
100% (1)
Notes On Engineering Data Analysis
4 pages
CUSUM Chart
0% (1)
CUSUM Chart
23 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Casos de ML Unsupervised Daniel Ames Camayo

Uploaded by

Casos de ML Unsupervised Daniel Ames Camayo

Uploaded by

GUIA DE LABORATORIO

APELLIDOS, Nombres: Ames Camayo Daniel Vides

import matplotlib.pyplot as plt

from matplotlib.pylab import rcParams

Read more in the :ref:`User Guide <k_means>`.

This initially creates clusters of points normally distributed (std=1)

E.D.A - Exploratory Data Analysis: Análisis Estadística

count 1000.000000 1000.000000

mean -0.685935 -6834.089666

std 1.986437 19261.811381

min -4.397035 -34008.454061

25% -2.139444 -21651.383469

50% -1.671931 -17380.589114

75% 1.485112 18433.092203

max 3.728174 25751.020011

840 -1.747337 19397.900488

851 -1.349984 20408.947698

... ... ...

995 -4.339401 -32187.439490

996 -2.082259 -20542.146294

997 -2.146293 20384.576406

998 1.867987 -21742.914665

999 -1.177876 -14751.488279

1000 rows × 2 columns

TRANSFORMACION Y/O LIMPIEZA - PRE-PROCESAMIENTO:

count 1.000000e+03 1.000000e+03

mean 3.197442e-17 4.209966e-16

std 1.000500e+00 1.000500e+00

min -1.869155e+00 -1.411496e+00

25% -7.320828e-01 -7.696425e-01

50% -4.966127e-01 -5.478081e-01

75% 1.093482e+00 1.312432e+00

max 2.223236e+00 1.692541e+00

803 -0.377862 1.474280

513 1.600691 -0.780938

... ... ...

995 -1.840126 -1.316908

996 -0.703281 -0.712026

997 -0.735532 1.413797

998 1.286324 -0.774397

999 -0.247774 -0.411247

1000 rows × 2 columns

Análisis Bivariado y Multivariado: Análisis de cada variable

Cómo Funciona el Algoritmo

MODELO(S): Modelamiento: Crear modelo, FIT (Ajustar,

EXPERIMENTO, VERSION CON ESTANDARIZAR:

FIT (Ajustar, Entrenar)

PREDICT (generar valores ajustados y/o predicción)

array([2, 1, 2, 1, 1, 0, 1, 2, 2, 2], dtype=int32)

CPU times: user 97.7 ms, sys: 11 µs, total: 97.7 ms

array([0, 1, 2], dtype=int32)

from sklearn.cluster import MeanShift

Mean shift clustering using a flat kernel.

Mean shift clustering aims to discover "blobs" in a smooth density of

for cluster in clusters:

Conceptos Básicos de MeanShift

Cómo Funciona el Algoritmo

from sklearn.cluster import DBSCAN

DBSCAN - Density-Based Spatial Clustering of Applications with Noise.

Read more in the :ref:`User Guide <dbscan>`.

model3 = DBSCAN( eps=0.25, min_samples=10)

for cluster in clusters:

Conceptos Básicos de DBSCAN

Cómo Funciona el Algoritmo

# crea la cuenta de muestra debajo de cada nodo

plt.title("Dendrograma de Clustering Jerárquico")

Método del Codo

# Una lista para colocar los valores de dispersion por cada k

for k in range(1, 11):

MODELO 5: CLUSTERING AGLOMERATIVO ERNESTO

EVALUAR MODELOS: Métricas

# Calcular el Índice de Silueta

Índice de Silueta: 0.2319562898356635

# Asumiendo que se tiene acceso a las etiquetas verdaderas `y`

Coeficiente de Rand Ajustado: 0.5663573031091752

You might also like