Skip to content

Commit 642ae59

Browse files
committed
DOC make defaults more explicit in text feature extraction.
1 parent a870b90 commit 642ae59

File tree

1 file changed

+35
-35
lines changed

1 file changed

+35
-35
lines changed

sklearn/feature_extraction/text.py

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -267,12 +267,12 @@ def _validate_vocabulary(self):
267267

268268
def _check_vocabulary(self):
269269
"""Check if vocabulary is empty or missing (not fit-ed)"""
270-
msg="%(name)s - Vocabulary wasn't fitted."
270+
msg = "%(name)s - Vocabulary wasn't fitted."
271271
check_is_fitted(self, 'vocabulary_', msg=msg),
272-
272+
273273
if len(self.vocabulary_) == 0:
274274
raise ValueError("Vocabulary is empty")
275-
275+
276276
@property
277277
@deprecated("The `fixed_vocabulary` attribute is deprecated and will be "
278278
"removed in 0.18. Please use `fixed_vocabulary_` instead.")
@@ -320,7 +320,7 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin):
320320
Parameters
321321
----------
322322
323-
input: string {'filename', 'file', 'content'}
323+
input : string {'filename', 'file', 'content'}
324324
If 'filename', the sequence passed as an argument to fit is
325325
expected to be a list of filenames that need reading to fetch
326326
the raw content to analyze.
@@ -331,7 +331,7 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin):
331331
Otherwise the input is expected to be the sequence strings or
332332
bytes items are expected to be analyzed directly.
333333
334-
encoding : string, 'utf-8' by default.
334+
encoding : string, default='utf-8'
335335
If bytes or files are given to analyze, this encoding is used to
336336
decode.
337337
@@ -341,66 +341,66 @@ class HashingVectorizer(BaseEstimator, VectorizerMixin):
341341
'strict', meaning that a UnicodeDecodeError will be raised. Other
342342
values are 'ignore' and 'replace'.
343343
344-
strip_accents: {'ascii', 'unicode', None}
344+
strip_accents : {'ascii', 'unicode', None}
345345
Remove accents during the preprocessing step.
346346
'ascii' is a fast method that only works on characters that have
347347
an direct ASCII mapping.
348348
'unicode' is a slightly slower method that works on any characters.
349349
None (default) does nothing.
350350
351-
analyzer: string, {'word', 'char', 'char_wb'} or callable
351+
analyzer : string, {'word', 'char', 'char_wb'} or callable
352352
Whether the feature should be made of word or character n-grams.
353353
Option 'char_wb' creates character n-grams only from text inside
354354
word boundaries.
355355
356356
If a callable is passed it is used to extract the sequence of features
357357
out of the raw, unprocessed input.
358358
359-
preprocessor: callable or None (default)
359+
preprocessor : callable or None (default)
360360
Override the preprocessing (string transformation) stage while
361361
preserving the tokenizing and n-grams generation steps.
362362
363-
tokenizer: callable or None (default)
363+
tokenizer : callable or None (default)
364364
Override the string tokenization step while preserving the
365365
preprocessing and n-grams generation steps.
366366
367-
ngram_range: tuple (min_n, max_n)
367+
ngram_range : tuple (min_n, max_n), default=(1, 1)
368368
The lower and upper boundary of the range of n-values for different
369369
n-grams to be extracted. All values of n such that min_n <= n <= max_n
370370
will be used.
371371
372-
stop_words: string {'english'}, list, or None (default)
372+
stop_words : string {'english'}, list, or None (default)
373373
If 'english', a built-in stop word list for English is used.
374374
375375
If a list, that list is assumed to contain stop words, all of which
376376
will be removed from the resulting tokens.
377377
378-
lowercase: boolean, default True
378+
lowercase : boolean, default=True
379379
Convert all characters to lowercase before tokenizing.
380380
381-
token_pattern: string
381+
token_pattern : string
382382
Regular expression denoting what constitutes a "token", only used
383383
if `analyzer == 'word'`. The default regexp selects tokens of 2
384384
or more alphanumeric characters (punctuation is completely ignored
385385
and always treated as a token separator).
386386
387-
n_features : integer, optional, (2 ** 20) by default
387+
n_features : integer, default=(2 ** 20)
388388
The number of features (columns) in the output matrices. Small numbers
389389
of features are likely to cause hash collisions, but large numbers
390390
will cause larger coefficient dimensions in linear learners.
391391
392392
norm : 'l1', 'l2' or None, optional
393393
Norm used to normalize term vectors. None for no normalization.
394394
395-
binary: boolean, False by default.
395+
binary: boolean, default=False.
396396
If True, all non zero counts are set to 1. This is useful for discrete
397397
probabilistic models that model binary events rather than integer
398398
counts.
399399
400400
dtype: type, optional
401401
Type of the matrix returned by fit_transform() or transform().
402402
403-
non_negative : boolean, optional
403+
non_negative : boolean, default=False
404404
Whether output matrices should contain non-negative values only;
405405
effectively calls abs on the matrix prior to returning it.
406406
When True, output values can be interpreted as frequencies.
@@ -573,23 +573,23 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
573573
or more alphanumeric characters (punctuation is completely ignored
574574
and always treated as a token separator).
575575
576-
max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default
576+
max_df : float in range [0.0, 1.0] or int, default=1.0
577577
When building the vocabulary ignore terms that have a document
578578
frequency strictly higher than the given threshold (corpus-specific
579579
stop words).
580580
If float, the parameter represents a proportion of documents, integer
581581
absolute counts.
582582
This parameter is ignored if vocabulary is not None.
583583
584-
min_df : float in range [0.0, 1.0] or int, optional, 1 by default
584+
min_df : float in range [0.0, 1.0] or int, default=1
585585
When building the vocabulary ignore terms that have a document
586586
frequency strictly lower than the given threshold. This value is also
587587
called cut-off in the literature.
588588
If float, the parameter represents a proportion of documents, integer
589589
absolute counts.
590590
This parameter is ignored if vocabulary is not None.
591591
592-
max_features : optional, None by default
592+
max_features : int or None, default=None
593593
If not None, build a vocabulary that only consider the top
594594
max_features ordered by term frequency across the corpus.
595595
@@ -602,7 +602,7 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
602602
in the mapping should not be repeated and should not have any gap
603603
between 0 and the largest index.
604604
605-
binary : boolean, False by default.
605+
binary : boolean, default=False
606606
If True, all non zero counts are set to 1. This is useful for discrete
607607
probabilistic models that model binary events rather than integer
608608
counts.
@@ -630,7 +630,7 @@ class CountVectorizer(BaseEstimator, VectorizerMixin):
630630
631631
Notes
632632
-----
633-
The ``stop_words_`` attribute can get large and increase the model size
633+
The ``stop_words_`` attribute can get large and increase the model size
634634
when pickling. This attribute is provided only for introspection and can
635635
be safely removed using delattr or set to None before pickling.
636636
"""
@@ -846,7 +846,7 @@ def transform(self, raw_documents):
846846
"""
847847
if not hasattr(self, 'vocabulary_'):
848848
self._validate_vocabulary()
849-
849+
850850
self._check_vocabulary()
851851

852852
# use the same matrix-building strategy as fit_transform
@@ -926,15 +926,15 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
926926
norm : 'l1', 'l2' or None, optional
927927
Norm used to normalize term vectors. None for no normalization.
928928
929-
use_idf : boolean, optional
929+
use_idf : boolean, default=True
930930
Enable inverse-document-frequency reweighting.
931931
932-
smooth_idf : boolean, optional
932+
smooth_idf : boolean, default=True
933933
Smooth idf weights by adding one to document frequencies, as if an
934934
extra document was seen containing every term in the collection
935935
exactly once. Prevents zero divisions.
936936
937-
sublinear_tf : boolean, optional
937+
sublinear_tf : boolean, default=False
938938
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
939939
940940
References
@@ -1109,22 +1109,22 @@ class TfidfVectorizer(CountVectorizer):
11091109
or more alphanumeric characters (punctuation is completely ignored
11101110
and always treated as a token separator).
11111111
1112-
max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default
1112+
max_df : float in range [0.0, 1.0] or int, default=1.0
11131113
When building the vocabulary ignore terms that have a document frequency
11141114
strictly higher than the given threshold (corpus specific stop words).
11151115
If float, the parameter represents a proportion of documents, integer
11161116
absolute counts.
11171117
This parameter is ignored if vocabulary is not None.
11181118
1119-
min_df : float in range [0.0, 1.0] or int, optional, 1 by default
1119+
min_df : float in range [0.0, 1.0] or int, default=1
11201120
When building the vocabulary ignore terms that have a document frequency
11211121
strictly lower than the given threshold.
11221122
This value is also called cut-off in the literature.
11231123
If float, the parameter represents a proportion of documents, integer
11241124
absolute counts.
11251125
This parameter is ignored if vocabulary is not None.
11261126
1127-
max_features : optional, None by default
1127+
max_features : int or None, default=None
11281128
If not None, build a vocabulary that only consider the top
11291129
max_features ordered by term frequency across the corpus.
11301130
@@ -1135,7 +1135,7 @@ class TfidfVectorizer(CountVectorizer):
11351135
indices in the feature matrix, or an iterable over terms. If not
11361136
given, a vocabulary is determined from the input documents.
11371137
1138-
binary : boolean, False by default.
1138+
binary : boolean, default=False
11391139
If True, all non-zero term counts are set to 1. This does not mean
11401140
outputs will have only 0/1 values, only that the tf term in tf-idf
11411141
is binary. (Set idf and normalization to False to get 0/1 outputs.)
@@ -1146,15 +1146,15 @@ class TfidfVectorizer(CountVectorizer):
11461146
norm : 'l1', 'l2' or None, optional
11471147
Norm used to normalize term vectors. None for no normalization.
11481148
1149-
use_idf : boolean, optional
1149+
use_idf : boolean, default=True
11501150
Enable inverse-document-frequency reweighting.
11511151
1152-
smooth_idf : boolean, optional
1152+
smooth_idf : boolean, default=True
11531153
Smooth idf weights by adding one to document frequencies, as if an
11541154
extra document was seen containing every term in the collection
11551155
exactly once. Prevents zero divisions.
11561156
1157-
sublinear_tf : boolean, optional
1157+
sublinear_tf : boolean, default=False
11581158
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
11591159
11601160
Attributes
@@ -1171,7 +1171,7 @@ class TfidfVectorizer(CountVectorizer):
11711171
- were cut off by feature selection (`max_features`).
11721172
11731173
This is only available if no vocabulary was given.
1174-
1174+
11751175
See also
11761176
--------
11771177
CountVectorizer
@@ -1181,10 +1181,10 @@ class TfidfVectorizer(CountVectorizer):
11811181
TfidfTransformer
11821182
Apply Term Frequency Inverse Document Frequency normalization to a
11831183
sparse matrix of occurrence counts.
1184-
1184+
11851185
Notes
11861186
-----
1187-
The ``stop_words_`` attribute can get large and increase the model size
1187+
The ``stop_words_`` attribute can get large and increase the model size
11881188
when pickling. This attribute is provided only for introspection and can
11891189
be safely removed using delattr or set to None before pickling.
11901190
"""

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy