Skip to content

Commit e4b2d96

Browse files
committed
Merge pull request scikit-learn#3246 from ogrisel/rebased-pr-2657
[MRG+1] deprecate sequences of sequences multilabel support
2 parents 5c46c0c + c9048fb commit e4b2d96

19 files changed

+505
-233
lines changed

doc/developers/utilities.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ Multiclass and multilabel utility function
244244
a classification output is in label indicator matrix format.
245245

246246
- :func:`multiclass.unique_labels`: Helper function to extract an ordered
247-
array of unique labels from a list of labels.
247+
array of unique labels from different formats of target.
248248

249249

250250
Helper Functions

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1061,6 +1061,7 @@ Pairwise metrics
10611061
preprocessing.KernelCenterer
10621062
preprocessing.LabelBinarizer
10631063
preprocessing.LabelEncoder
1064+
preprocessing.MultiLabelBinarizer
10641065
preprocessing.MinMaxScaler
10651066
preprocessing.Normalizer
10661067
preprocessing.OneHotEncoder

doc/modules/model_evaluation.rst

Lines changed: 4 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -259,16 +259,11 @@ where :math:`1(x)` is the `indicator function
259259
>>> accuracy_score(y_true, y_pred, normalize=False)
260260
2
261261

262-
In the multilabel case with binary indicator format:
262+
In the multilabel case with binary label indicators: ::
263263

264264
>>> accuracy_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.ones((2, 2)))
265265
0.5
266266

267-
and with a list of labels format:
268-
269-
>>> accuracy_score([(1,), (3,)], [(1, 2), tuple()])
270-
0.0
271-
272267
.. topic:: Example:
273268

274269
* See :ref:`example_plot_permutation_test_for_classification.py`
@@ -377,16 +372,11 @@ where :math:`1(x)` is the `indicator function
377372
>>> hamming_loss(y_true, y_pred)
378373
0.25
379374

380-
In the multilabel case with binary indicator format: ::
375+
In the multilabel case with binary label indicators: ::
381376

382377
>>> hamming_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.zeros((2, 2)))
383378
0.75
384379

385-
and with a list of labels format: ::
386-
387-
>>> hamming_loss([(1, 2), (3,)], [(1, 2), tuple()]) # doctest: +ELLIPSIS
388-
0.166...
389-
390380
.. note::
391381

392382
In multiclass classification, the Hamming loss correspond to the Hamming
@@ -434,17 +424,11 @@ score is equal to the classification accuracy.
434424
>>> jaccard_similarity_score(y_true, y_pred, normalize=False)
435425
2
436426

437-
In the multilabel case with binary indicator format:
427+
In the multilabel case with binary label indicators: ::
438428

439429
>>> jaccard_similarity_score(np.array([[0.0, 1.0], [1.0, 1.0]]), np.ones((2, 2)))
440430
0.75
441431

442-
and with a list of labels format:
443-
444-
>>> jaccard_similarity_score([(1,), (3,)], [(1, 2), tuple()])
445-
0.25
446-
447-
448432
.. _precision_recall_f_measure_metrics:
449433

450434
Precision, recall and F-measures
@@ -897,16 +881,11 @@ where :math:`1(x)` is the `indicator function
897881
>>> zero_one_loss(y_true, y_pred, normalize=False)
898882
1
899883

900-
In the multilabel case with binary indicator format:
884+
In the multilabel case with binary label indicators: ::
901885

902886
>>> zero_one_loss(np.array([[0.0, 1.0], [1.0, 1.0]]), np.ones((2, 2)))
903887
0.5
904888

905-
and with a list of labels format:
906-
907-
>>> zero_one_loss([(1,), (3,)], [(1, 2), tuple()])
908-
1.0
909-
910889

911890
.. topic:: Example:
912891

doc/modules/multiclass.rst

Lines changed: 17 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -77,43 +77,31 @@ tasks :ref:`Decision Trees <tree>`, :ref:`Random Forests <forest>`,
7777
Multilabel classification format
7878
================================
7979

80-
In multilabel learning, the joint set of binary classification tasks
81-
is expressed with either a sequence of sequences or a label binary indicator
82-
array.
83-
84-
In the sequence of sequences format, each set of labels is represented as
85-
a sequence of integer, e.g. ``[0]``, ``[1, 2]``. An empty set of labels is
86-
then expressed as ``[]``, and a set of samples as ``[[0], [1, 2], []]``.
87-
In the label indicator format, each sample is one row of a 2d array of
88-
shape (n_samples, n_classes) with binary values: the one, i.e. the non zero
89-
elements, corresponds to the subset of labels. Our previous example is
90-
therefore expressed as ``np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0])``
91-
and an empty set of labels would be represented by a row of zero elements.
92-
93-
94-
In the preprocessing module, the transformer
95-
:class:`sklearn.preprocessing.label_binarize` and the function
96-
:func:`sklearn.preprocessing.LabelBinarizer`
97-
can help you to convert the sequence of sequences format to the label
98-
indicator format.
80+
In multilabel learning, the joint set of binary classification tasks is
81+
expressed with label binary indicator array: each sample is one row of a 2d
82+
array of shape (n_samples, n_classes) with binary values: the one, i.e. the non
83+
zero elements, corresponds to the subset of labels. An array such as
84+
``np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])`` represents label 0 in the first
85+
sample, labels 1 and 2 in the second sample, and no labels in the third sample.
86+
87+
Producing multilabel data as a list of sets of labels may be more intuitive.
88+
The transformer :class:`MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>`
89+
will convert between a collection of collections of labels and the indicator
90+
format.
9991

10092
>>> from sklearn.datasets import make_multilabel_classification
101-
>>> from sklearn.preprocessing import LabelBinarizer
102-
>>> X, Y = make_multilabel_classification(n_samples=5, random_state=0)
93+
>>> from sklearn.preprocessing import MultiLabelBinarizer
94+
>>> X, Y = make_multilabel_classification(n_samples=5, random_state=0,
95+
... return_indicator=False)
10396
>>> Y
10497
([0, 1, 2], [4, 1, 0, 2], [4, 0, 1], [1, 0], [3, 2])
105-
>>> LabelBinarizer().fit_transform(Y)
98+
>>> MultiLabelBinarizer().fit_transform(Y)
10699
array([[1, 1, 1, 0, 0],
107100
[1, 1, 1, 0, 1],
108101
[1, 1, 0, 0, 1],
109102
[1, 1, 0, 0, 0],
110103
[0, 0, 1, 1, 0]])
111104

112-
.. warning::
113-
114-
- The sequence of sequences format will disappear in a near future.
115-
- Most estimators and functions support both multilabel format.
116-
117105

118106
One-Vs-The-Rest
119107
===============
@@ -151,8 +139,8 @@ Multilabel learning
151139
-------------------
152140

153141
:class:`OneVsRestClassifier` also supports multilabel classification.
154-
To use this feature, feed the classifier a list of tuples containing
155-
target labels, like in the example below.
142+
To use this feature, feed the classifier an indicator matrix, in which cell
143+
[i, j] indicates the presence of label j in sample i.
156144

157145

158146
.. figure:: ../auto_examples/images/plot_multilabel_1.png

doc/modules/preprocessing.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -377,8 +377,9 @@ matrix from a list of multi-class labels::
377377
array([[1, 0, 0, 0],
378378
[0, 0, 0, 1]])
379379

380-
:class:`LabelBinarizer` also supports multiple labels per instance::
380+
For multiple labels per instance, use :class:`MultiLabelBinarizer`::
381381

382+
>>> lb = preprocessing.MultiLabelBinarizer()
382383
>>> lb.fit_transform([(1, 2), (3,)])
383384
array([[1, 1, 0],
384385
[0, 0, 1]])

doc/whats_new.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,12 @@ API changes summary
235235
- :class:`cluster.WardClustering` is deprecated. Use
236236
- :class:`cluster.AgglomerativeClustering` instead.
237237

238+
- Direct support for the sequence of sequences (or list of lists) multilabel
239+
format is deprecated. To convert to and from the supported binary
240+
indicator matrix format, use
241+
:class:`MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>`.
242+
By `Joel Nothman`_.
243+
238244
- Add score method to :class:`PCA <decomposition.PCA>` following the model of
239245
probabilistic PCA and deprecate
240246
:class:`ProbabilisticPCA <decomposition.ProbabilisticPCA>` model whose

examples/plot_multilabel.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,7 @@ def plot_subfigure(X, Y, subplot, title, transform):
5555
if transform == "pca":
5656
X = PCA(n_components=2).fit_transform(X)
5757
elif transform == "cca":
58-
# Convert list of tuples to a class indicator matrix first
59-
Y_indicator = LabelBinarizer().fit(Y).transform(Y)
60-
X = CCA(n_components=2).fit(X, Y_indicator).transform(X)
58+
X = CCA(n_components=2).fit(X, Y).transform(X)
6159
else:
6260
raise ValueError
6361

@@ -73,8 +71,8 @@ def plot_subfigure(X, Y, subplot, title, transform):
7371
pl.subplot(2, 2, subplot)
7472
pl.title(title)
7573

76-
zero_class = np.where([0 in y for y in Y])
77-
one_class = np.where([1 in y for y in Y])
74+
zero_class = np.where(Y[:, 0])
75+
one_class = np.where(Y[:, 1])
7876
pl.scatter(X[:, 0], X[:, 1], s=40, c='gray')
7977
pl.scatter(X[zero_class, 0], X[zero_class, 1], s=160, edgecolors='b',
8078
facecolors='none', linewidths=2, label='Class 1')
@@ -100,13 +98,15 @@ def plot_subfigure(X, Y, subplot, title, transform):
10098

10199
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
102100
allow_unlabeled=True,
101+
return_indicator=True,
103102
random_state=1)
104103

105104
plot_subfigure(X, Y, 1, "With unlabeled samples + CCA", "cca")
106105
plot_subfigure(X, Y, 2, "With unlabeled samples + PCA", "pca")
107106

108107
X, Y = make_multilabel_classification(n_classes=2, n_labels=1,
109108
allow_unlabeled=False,
109+
return_indicator=True,
110110
random_state=1)
111111

112112
plot_subfigure(X, Y, 3, "Without unlabeled samples + CCA", "cca")

sklearn/datasets/samples_generator.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@
77
# License: BSD 3 clause
88

99
import numbers
10+
import warnings
1011
import numpy as np
1112
from scipy import linalg
1213

13-
from ..preprocessing import LabelBinarizer
14+
from ..preprocessing import MultiLabelBinarizer
1415
from ..utils import array2d, check_random_state
1516
from ..utils import shuffle as util_shuffle
1617
from ..utils.random import sample_without_replacement
@@ -336,8 +337,15 @@ def sample_example():
336337
X, Y = zip(*[sample_example() for i in range(n_samples)])
337338

338339
if return_indicator:
339-
lb = LabelBinarizer()
340+
lb = MultiLabelBinarizer()
340341
Y = lb.fit([range(n_classes)]).transform(Y)
342+
else:
343+
warnings.warn('Support for the sequence of sequences multilabel '
344+
'representation is being deprecated and replaced with '
345+
'a sparse indicator matrix. '
346+
'return_indicator wil default to True from version '
347+
'0.17.',
348+
DeprecationWarning)
341349

342350
return np.array(X, dtype=np.float64), Y
343351

sklearn/datasets/tests/test_samples_generator.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from sklearn.utils.testing import assert_true
1414
from sklearn.utils.testing import assert_less
1515
from sklearn.utils.testing import assert_raises
16+
from sklearn.utils.testing import assert_warns
1617

1718
from sklearn.datasets import make_classification
1819
from sklearn.datasets import make_multilabel_classification
@@ -131,11 +132,11 @@ def test_make_classification_informative_features():
131132
n_clusters_per_class=2)
132133

133134

134-
def test_make_multilabel_classification():
135+
def test_make_multilabel_classification_return_sequences():
135136
for allow_unlabeled, min_length in zip((True, False), (0, 1)):
136-
X, Y = make_multilabel_classification(n_samples=100, n_features=20,
137-
n_classes=3, random_state=0,
138-
allow_unlabeled=allow_unlabeled)
137+
X, Y = assert_warns(DeprecationWarning, make_multilabel_classification,
138+
n_samples=100, n_features=20, n_classes=3,
139+
random_state=0, allow_unlabeled=allow_unlabeled)
139140
assert_equal(X.shape, (100, 20), "X shape mismatch")
140141
if not allow_unlabeled:
141142
assert_equal(max([max(y) for y in Y]), 2)

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy