Groupby RST
Groupby RST
_groupby:
{{ header }}
*****************************
Group by: split-apply-combine
*****************************
By "group by" we are referring to a process involving one or more of the following
steps:
Out of these, the split step is the most straightforward. In the apply step, we
might wish to do one of the following:
Many of these operations are defined on GroupBy objects. These operations are
similar
to those of the :ref:`aggregating API <basics.aggregate>`,
:ref:`window API <window.overview>`, and :ref:`resample API
<timeseries.aggregate>`.
It is possible that a given operation does not fall into one of these categories or
is some combination of them. In such a case, it may be possible to compute the
operation using GroupBy's ``apply`` method. This method will examine the results of
the
apply step and try to sensibly combine them into a single result if it doesn't fit
into either
of the above three categories.
.. note::
An operation that is split into multiple steps using built-in GroupBy operations
will be more efficient than using the ``apply`` method with a user-defined
Python
function.
The name GroupBy should be quite familiar to those who have used
a SQL-based tool (or ``itertools``), in which you can write code like:
.. code-block:: sql
We aim to make operations like this natural and easy to express using
pandas. We'll address each area of GroupBy functionality, then provide some
non-trivial examples / use cases.
.. _groupby.split:
.. ipython:: python
speeds = pd.DataFrame(
[
("bird", "Falconiformes", 389.0),
("bird", "Psittaciformes", 24.0),
("mammal", "Carnivora", 80.2),
("mammal", "Primates", np.nan),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
speeds
grouped = speeds.groupby("class")
grouped = speeds.groupby(["class", "order"])
.. note::
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
.. ipython:: python
grouped = df.groupby("A")
grouped = df.groupby("B")
grouped = df.groupby(["A", "B"])
.. note::
If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
the columns except the one we specify:
.. ipython:: python
The above GroupBy will split the DataFrame on its index (rows). To split by
columns, first do
a transpose:
.. ipython::
.. ipython:: python
index = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], index=index)
grouped = s.groupby(level=0)
grouped.first()
grouped.last()
grouped.sum()
Note that **no splitting occurs** until it's needed. Creating the GroupBy object
only verifies that you've passed a valid mapping.
.. note::
.. _groupby.sorting:
GroupBy sorting
~~~~~~~~~~~~~~~~~~~~~~~~~
By default the group keys are sorted during the ``groupby`` operation. You may
however pass ``sort=False`` for potential speedups. With ``sort=False`` the order
among group-keys follows the order of appearance of the keys in the original
dataframe:
.. ipython:: python
Note that ``groupby`` will preserve the order in which *observations* are sorted
*within* each group.
For example, the groups created by ``groupby()`` below are in the order they
appeared in the original ``DataFrame``:
.. ipython:: python
df3.groupby(["X"]).get_group(("B",))
.. _groupby.dropna:
GroupBy dropna
^^^^^^^^^^^^^^
By default ``NA`` values are excluded from group keys during the ``groupby``
operation. However,
in case you want to include ``NA`` values in group keys, you could pass
``dropna=False`` to achieve it.
.. ipython:: python
df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
df_dropna
.. ipython:: python
The default setting of ``dropna`` argument is ``True`` which means ``NA`` are not
included in group keys.
.. _groupby.attributes:
The ``groups`` attribute is a dictionary whose keys are the computed unique groups
and corresponding values are the axis labels belonging to each group. In the
above example we have:
.. ipython:: python
df.groupby("A").groups
df.T.groupby(get_letter_type).groups
Calling the standard Python ``len`` function on the GroupBy object returns
the number of groups, which is the same as the length of the ``groups`` dictionary:
.. ipython:: python
.. _groupby.tabcompletion:
``GroupBy`` will tab complete column names, GroupBy operations, and other
attributes:
.. ipython:: python
n = 10
weight = np.random.normal(166, 20, size=n)
height = np.random.normal(60, 10, size=n)
time = pd.date_range("1/1/2000", periods=n)
gender = np.random.choice(["male", "female"], size=n)
df = pd.DataFrame(
{"height": height, "weight": weight, "gender": gender}, index=time
)
df
gb = df.groupby("gender")
.. ipython::
@verbatim
In [1]: gb.<TAB> # noqa: E225, E999
gb.agg gb.boxplot gb.cummin gb.describe gb.filter
gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot
gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups
gb.hist gb.max gb.min gb.nth gb.prod gb.resample
gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head
gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size
gb.tail gb.weight
.. _groupby.multiindex:
.. ipython:: python
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
s = pd.Series(np.random.randn(8), index=index)
s
.. ipython:: python
grouped = s.groupby(level=0)
grouped.sum()
If the MultiIndex has names specified, these can be passed instead of the level
number:
.. ipython:: python
s.groupby(level="second").sum()
.. ipython:: python
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["doo", "doo", "bee", "bee", "bop", "bop", "bop", "bop"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second", "third"])
s = pd.Series(np.random.randn(8), index=index)
s
s.groupby(level=["first", "second"]).sum()
.. ipython:: python
s.groupby(["first", "second"]).sum()
.. ipython:: python
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
df
Then we group ``df`` by the ``second`` index level and the ``A`` column.
.. ipython:: python
df.groupby([pd.Grouper(level=1), "A"]).sum()
.. ipython:: python
df.groupby([pd.Grouper(level="second"), "A"]).sum()
.. ipython:: python
df.groupby(["second", "A"]).sum()
.. ipython:: python
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
grouped_D = grouped["D"]
This is mainly syntactic sugar for the alternative, which is much more verbose:
.. ipython:: python
df["C"].groupby(df["A"])
You can also include the grouping columns if you want to operate on them.
.. ipython:: python
grouped[["A", "B"]].sum()
.. _groupby.iterating-label:
With the GroupBy object in hand, iterating through the grouped data is very
natural and functions similarly to :py:func:`itertools.groupby`:
.. ipython::
In the case of grouping by multiple keys, the group name will be a tuple:
.. ipython::
In [5]: for name, group in df.groupby(['A', 'B']):
...: print(name)
...: print(group)
...:
See :ref:`timeseries.iterating-label`.
Selecting a group
-----------------
.. ipython:: python
grouped.get_group("bar")
.. ipython:: python
.. _groupby.aggregate:
Aggregation
-----------
.. ipython:: python
animals = pd.DataFrame(
{
"kind": ["cat", "dog", "cat", "dog"],
"height": [9.1, 6.0, 9.5, 34.0],
"weight": [7.9, 7.5, 9.9, 198.0],
}
)
animals
animals.groupby("kind").sum()
In the result, the keys of the groups appear in the index by default. They can be
instead included in the columns by passing ``as_index=False``.
.. ipython:: python
animals.groupby("kind", as_index=False).sum()
.. _groupby.aggregate.builtin:
Many common aggregations are built-in to GroupBy objects as methods. Of the methods
listed below, those with a ``*`` do *not* have an efficient, GroupBy-specific,
implementation.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
Some examples:
.. ipython:: python
df.groupby("A")[["C", "D"]].max()
df.groupby(["A", "B"]).mean()
.. ipython:: python
.. ipython:: python
grouped.describe()
.. ipython:: python
ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]
df4 = pd.DataFrame(ll, columns=["A", "B"])
df4
df4.groupby("A")["B"].nunique()
.. note::
Aggregation functions **will not** return the groups that you are aggregating
over
as named *columns* when ``as_index=True``, the default. The grouped columns will
be the **indices** of the returned object.
Passing ``as_index=False`` **will** return the groups that you are aggregating
over as
named columns, regardless if they are named **indices** or *columns* in the
inputs.
.. _groupby.aggregate.agg:
.. note::
The :meth:`~.DataFrameGroupBy.aggregate` method can accept many different types
of
inputs. This section details using string aliases for various GroupBy methods;
other
inputs are detailed in the sections below.
.. ipython:: python
grouped = df.groupby("A")
grouped[["C", "D"]].aggregate("sum")
The result of the aggregation will have the group names as the
new index. In the case of multiple keys, the result is a
:ref:`MultiIndex <advanced.hierarchical>` by default. As mentioned above, this can
be
changed by using the ``as_index`` option:
.. ipython:: python
.. ipython:: python
df.groupby(["A", "B"]).agg("sum").reset_index()
.. _groupby.aggregate.udf:
Users can also provide their own User-Defined Functions (UDFs) for custom
aggregations.
.. warning::
When aggregating with a UDF, the UDF should not mutate the
provided ``Series``. See :ref:`gotchas.udf-mutation` for more information.
.. note::
.. ipython:: python
animals
animals.groupby("kind")[["height"]].agg(lambda x: set(x))
The resulting dtype will reflect that of the aggregating function. If the results
from different groups have
different dtypes, then a common dtype will be determined in the same way as
``DataFrame`` construction.
.. ipython:: python
animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum())
.. _groupby.aggregate.multifunc:
.. ipython:: python
grouped = df.groupby("A")
grouped["C"].agg(["sum", "mean", "std"])
.. ipython:: python
The resulting aggregations are named after the functions themselves. If you
need to rename, then you can add in a chained operation for a ``Series`` like this:
.. ipython:: python
(
grouped["C"]
.agg(["sum", "mean", "std"])
.rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
)
.. ipython:: python
(
grouped[["C", "D"]].agg(["sum", "mean", "std"]).rename(
columns={"sum": "foo", "mean": "bar", "std": "baz"}
)
)
.. note::
In general, the output column names should be unique, but pandas will allow
you apply to the same function (or two functions with the same name) to the same
column.
.. ipython:: python
grouped["C"].agg(["sum", "sum"])
pandas also allows you to provide multiple lambdas. In this case, pandas
will mangle the name of the (nameless) lambda functions, appending ``_<i>``
to each subsequent lambda.
.. ipython:: python
Named aggregation
~~~~~~~~~~~~~~~~~
To support column-specific aggregation *with control over the output column names*,
pandas
accepts the special syntax in :meth:`.DataFrameGroupBy.agg`
and :meth:`.SeriesGroupBy.agg`, known as "named aggregation", where
.. ipython:: python
animals
animals.groupby("kind").agg(
min_height=pd.NamedAgg(column="height", aggfunc="min"),
max_height=pd.NamedAgg(column="height", aggfunc="max"),
average_weight=pd.NamedAgg(column="weight", aggfunc="mean"),
)
.. ipython:: python
animals.groupby("kind").agg(
min_height=("height", "min"),
max_height=("height", "max"),
average_weight=("weight", "mean"),
)
If the column names you want are not valid Python keywords, construct a dictionary
and unpack the keyword arguments
.. ipython:: python
animals.groupby("kind").agg(
**{
"total weight": pd.NamedAgg(column="weight", aggfunc="sum")
}
)
When using named aggregation, additional keyword arguments are not passed through
to the aggregation functions; only pairs
of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation
functions
require additional arguments, apply them partially with :meth:`functools.partial`.
Named aggregation is also valid for Series groupby aggregations. In this case
there's
no column selection, so the values are just the functions.
.. ipython:: python
animals.groupby("kind").height.agg(
min_height="min",
max_height="max",
)
.. ipython:: python
The function names can also be strings. In order for a string to be valid it
must be implemented on GroupBy:
.. ipython:: python
.. _groupby.transform:
Transformation
--------------
.. ipython:: python
speeds
grouped = speeds.groupby("class")["max_speed"]
grouped.cumsum()
grouped.diff()
.. note::
Since transformations do not include the groupings that are used to split the
result,
the arguments ``as_index`` and ``sort`` in :meth:`DataFrame.groupby` and
:meth:`Series.groupby` have no effect.
A common use of a transformation is to add the result back into the original
DataFrame.
.. ipython:: python
result = speeds.copy()
result["cumsum"] = grouped.cumsum()
result["diff"] = grouped.diff()
result
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
.. _groupby.transformation.transform:
.. ipython:: python
speeds
grouped = speeds.groupby("class")[["max_speed"]]
grouped.transform("cumsum")
grouped.transform("sum")
* Return a result that is either the same size as the group chunk or
broadcastable to the size of the group chunk (e.g., a scalar,
``grouped.transform(lambda x: x.iloc[-1])``).
* Operate column-by-column on the group chunk. The transform is applied to
the first group chunk using chunk.apply.
* Not perform in-place operations on the group chunk. Group chunks should
be treated as immutable, and changes to a group chunk may produce unexpected
results. See :ref:`gotchas.udf-mutation` for more information.
* (Optionally) operates on all columns of the entire group chunk at once. If this
is
supported, a fast path is used starting from the *second* chunk.
.. note::
All of the examples in this section can be made more performant by calling
built-in methods instead of using UDFs.
See :ref:`below for examples <groupby_efficient_transforms>`.
.. versionchanged:: 2.0.0
.. ipython:: python
ts.head()
ts.tail()
.. ipython:: python
# Original Data
grouped = ts.groupby(lambda x: x.year)
grouped.mean()
grouped.std()
# Transformed Data
grouped_trans = transformed.groupby(lambda x: x.year)
grouped_trans.mean()
grouped_trans.std()
We can also visually compare the original and transformed data sets.
.. ipython:: python
@savefig groupby_transform_plot.png
compare.plot()
.. ipython:: python
Another common data transform is to replace missing data with the group mean.
.. ipython:: python
grouped = data_df.groupby(key)
We can verify that the group means have not changed in the transformed data,
and that the transformed data contains no NAs.
.. ipython:: python
grouped_trans = transformed.groupby(key)
.. _groupby_efficient_transforms:
As mentioned in the note above, each of the examples in this section can be
computed
more efficiently using built-in methods. In the code below, the inefficient way
using a UDF is commented out and the faster alternative appears below.
.. ipython:: python
# grouped = data_df.groupby(key)
# result = grouped.transform(lambda x: x.fillna(x.mean()))
grouped = data_df.groupby(key)
result = data_df.fillna(grouped.transform("mean"))
.. _groupby.transform.window_resample:
The example below will apply the ``rolling()`` method on the samples of
the column B, based on the groups of column A.
.. ipython:: python
df_re.groupby("A").rolling(4).B.mean()
.. ipython:: python
df_re.groupby("A").expanding().sum()
.. ipython:: python
df_re = pd.DataFrame(
{
"date": pd.date_range(start="2016-01-01", periods=4, freq="W"),
"group": [1, 1, 2, 2],
"val": [5, 6, 7, 8],
}
).set_index("date")
df_re
df_re.groupby("group").resample("1D", include_groups=False).ffill()
.. _groupby.filter:
Filtration
----------
.. ipython:: python
speeds
speeds.groupby("class").nth(1)
.. note::
Unlike aggregations, filtrations do not add the group keys to the index of the
result. Because of this, passing ``as_index=False`` or ``sort=True`` will not
affect these methods.
.. ipython:: python
speeds.groupby("class")[["order", "max_speed"]].nth(1)
Built-in filtrations
~~~~~~~~~~~~~~~~~~~~
The following methods on GroupBy act as filtrations. All these methods have an
efficient, GroupBy-specific, implementation.
.. csv-table::
:header: "Method", "Description"
:widths: 20, 80
Users can also use transformations along with Boolean indexing to construct complex
filtrations within groups. For example, suppose we are given groups of products and
their volumes, and we wish to subset the data to only the largest products
capturing no
more than 90% of the total volume within each group.
.. ipython:: python
product_volumes = pd.DataFrame(
{
"group": list("xxxxyyy"),
"product": list("abcdefg"),
"volume": [10, 30, 20, 15, 40, 10, 20],
}
)
product_volumes
.. note::
The ``filter`` method takes a User-Defined Function (UDF) that, when applied to
an entire group, returns either ``True`` or ``False``. The result of the ``filter``
method is then the subset of groups for which the UDF returned ``True``.
Suppose we want to take only elements that belong to groups with a group sum
greater
than 2.
.. ipython:: python
sf = pd.Series([1, 1, 2, 3, 3, 3])
sf.groupby(sf).filter(lambda x: x.sum() > 2)
.. ipython:: python
.. ipython:: python
For DataFrames with multiple columns, filters should explicitly specify a column as
the filter criterion.
.. ipython:: python
dff["C"] = np.arange(8)
dff.groupby("B").filter(lambda x: len(x["C"]) > 2)
.. _groupby.apply:
Flexible ``apply``
------------------
Some operations on the grouped data might not fit into the aggregation,
transformation, or filtration categories. For these, you can use the ``apply``
function.
.. warning::
``apply`` has to try to infer from the result whether it should act as a
reducer,
transformer, *or* filter, depending on exactly what is passed to it. Thus the
grouped column(s) may be included in the output or not. While
it tries to intelligently guess how to behave, it can sometimes guess wrong.
.. note::
All of the examples in this section can be more reliably, and more efficiently,
computed using other pandas functionality.
.. ipython:: python
df
grouped = df.groupby("A")
.. ipython:: python
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
``apply`` on a Series can operate on a returned value from the applied function
that is itself a series, and possibly upcast the result to a DataFrame:
.. ipython:: python
def f(x):
return pd.Series([x, x ** 2], index=["x", "x^2"])
s = pd.Series(np.random.rand(5))
s
s.apply(f)
To control whether the grouped column(s) are included in the indices, you can use
the argument ``group_keys`` which defaults to ``True``. Compare
.. ipython:: python
with
.. ipython:: python
.. versionadded:: 1.1
The function signature must start with ``values, index`` **exactly** as the data
belonging to each group
will be passed into ``values``, and the group index will be passed into ``index``.
.. warning::
.. ipython:: python
df
.. ipython:: python
df.groupby("A").std(numeric_only=True)
.. ipython:: python
df_dec = pd.DataFrame(
{
"id": [1, 2, 1, 2],
"int_column": [1, 2, 3, 4],
"dec_column": [
Decimal("0.50"),
Decimal("0.15"),
Decimal("0.25"),
Decimal("0.40"),
],
}
)
df_dec.groupby(["id"])[["dec_column"]].sum()
.. _groupby.observed:
.. ipython:: python
pd.Series([1, 1, 1]).groupby(
pd.Categorical(["a", "a", "a"], categories=["a", "b"]), observed=False
).count()
.. ipython:: python
pd.Series([1, 1, 1]).groupby(
pd.Categorical(["a", "a", "a"], categories=["a", "b"]), observed=True
).count()
The returned dtype of the grouped will *always* include *all* of the categories
that were grouped.
.. ipython:: python
s = (
pd.Series([1, 1, 1])
.groupby(pd.Categorical(["a", "a", "a"], categories=["a", "b"]),
observed=True)
.count()
)
s.index.dtype
.. _groupby.missing:
NA group handling
~~~~~~~~~~~~~~~~~
.. ipython:: python
df.groupby("key", dropna=True).sum()
df.groupby("key", dropna=False).sum()
.. ipython:: python
days = pd.Categorical(
values=["Wed", "Mon", "Thu", "Mon", "Wed", "Sat"],
categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
)
data = pd.DataFrame(
{
"day": days,
"workers": [3, 4, 1, 4, 2, 2],
}
)
data
.. _groupby.specify:
You may need to specify a bit more data to properly group. You can
use the ``pd.Grouper`` to provide this local control.
.. ipython:: python
import datetime
df = pd.DataFrame(
{
"Branch": "A A A A A A A B".split(),
"Buyer": "Carl Mark Carl Carl Joe Joe Joe Carl".split(),
"Quantity": [1, 3, 5, 1, 8, 1, 9, 3],
"Date": [
datetime.datetime(2013, 1, 1, 13, 0),
datetime.datetime(2013, 1, 1, 13, 5),
datetime.datetime(2013, 10, 1, 20, 0),
datetime.datetime(2013, 10, 2, 10, 0),
datetime.datetime(2013, 10, 1, 20, 0),
datetime.datetime(2013, 10, 2, 10, 0),
datetime.datetime(2013, 12, 2, 12, 0),
datetime.datetime(2013, 12, 2, 14, 0),
],
}
)
df
Groupby a specific column with the desired frequency. This is like resampling.
.. ipython:: python
.. ipython:: python
df = df.set_index("Date")
df["Date"] = df.index + pd.offsets.MonthEnd(2)
df.groupby([pd.Grouper(freq="6ME", key="Date"), "Buyer"])[["Quantity"]].sum()
Just like for a DataFrame or Series you can call head and tail on a groupby:
.. ipython:: python
g = df.groupby("A")
g.head(1)
g.tail(1)
.. _groupby.nth:
.. ipython:: python
g.nth(0)
g.nth(-1)
g.nth(1)
If the nth element of a group does not exist, then no corresponding row is included
in the result. In particular, if the specified ``n`` is larger than any group, the
result will be an empty DataFrame.
.. ipython:: python
g.nth(5)
If you want to select the nth not-null item, use the ``dropna`` kwarg. For a
DataFrame this should be either ``'any'`` or ``'all'`` just like you would pass to
dropna:
.. ipython:: python
g.B.nth(0, dropna="all")
You can also select multiple rows from each group by specifying multiple nth values
as a list of ints.
.. ipython:: python
.. ipython:: python
df.groupby([df.index.year, df.index.month]).nth[1:]
df.groupby([df.index.year, df.index.month]).nth[1:, :-1]
To see the order in which each row appears within its group, use the
``cumcount`` method:
.. ipython:: python
dfg.groupby("A").cumcount()
dfg.groupby("A").cumcount(ascending=False)
.. _groupby.ngroup:
Enumerate groups
~~~~~~~~~~~~~~~~
To see the ordering of the groups (as opposed to the order of rows
within a group given by ``cumcount``) you can use
:meth:`.DataFrameGroupBy.ngroup`.
Note that the numbers given to the groups match the order in which the
groups would be seen when iterating over the groupby object, not the
order they are first observed.
.. ipython:: python
dfg.groupby("A").ngroup()
dfg.groupby("A").ngroup(ascending=False)
Plotting
~~~~~~~~
Groupby also works with some plotting methods. In this case, suppose we
suspect that the values in column 1 are 3 times higher on average in group "B".
.. ipython:: python
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(50, 2))
df["g"] = np.random.choice(["A", "B"], size=50)
df.loc[df["g"] == "B", 1] += 3
.. ipython:: python
:okwarning:
@savefig groupby_boxplot.png
df.groupby("g").boxplot()
The result of calling ``boxplot`` is a dictionary whose keys are the values
of our grouping column ``g`` ("A" and "B"). The values of the resulting dictionary
can be controlled by the ``return_type`` keyword of ``boxplot``.
See the :ref:`visualization documentation<visualization.box>` for more.
.. warning::
.. _groupby.pipe:
Combining ``.groupby`` and ``.pipe`` is often useful when you need to reuse
GroupBy objects.
As an example, imagine having a DataFrame with columns for stores, products,
revenue and quantity sold. We'd like to do a groupwise calculation of *prices*
(i.e. revenue/quantity) per store and per product. We could do this in a
multi-step operation, but expressing it in terms of piping can make the
code more readable. First we set the data:
.. ipython:: python
n = 1000
df = pd.DataFrame(
{
"Store": np.random.choice(["Store_1", "Store_2"], n),
"Product": np.random.choice(["Product_1", "Product_2"], n),
"Revenue": (np.random.random(n) * 50 + 10).round(2),
"Quantity": np.random.randint(1, 10, size=n),
}
)
df.head(2)
.. ipython:: python
(
df.groupby(["Store", "Product"])
.pipe(lambda grp: grp.Revenue.sum() / grp.Quantity.sum())
.unstack()
.round(2)
)
Piping can also be expressive when you want to deliver a grouped object to some
arbitrary function, for example:
.. ipython:: python
def mean(groupby):
return groupby.mean()
df.groupby(["Store", "Product"]).pipe(mean)
Here ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
columns respectively for each Store-Product combination. The ``mean`` function can
be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy
object as a parameter into the function you specify.
Examples
--------
.. _groupby.multicolumn_factorization:
Multi-column factorization
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. ipython:: python
dfg
dfg.groupby(["A", "B"]).ngroup()
In order for resample to work on indices that are non-datetimelike, the following
procedure can be utilized.
In the following examples, **df.index // 5** returns an integer array which is used
to determine what gets selected for the groupby operation.
.. note::
The example below shows how we can downsample by consolidation of samples into
fewer ones.
Here by using **df.index // 5**, we are aggregating the samples in bins. By
applying **std()**
function, we aggregate the information contained in many samples into a small
subset of values
which is their standard deviation thereby reducing the number of samples.
.. ipython:: python
df = pd.DataFrame(np.random.randn(10, 2))
df
df.index // 5
df.groupby(df.index // 5).std()
Group DataFrame columns, compute a set of metrics and return a named Series.
The Series name is used as the name for the column index. This is especially
useful in conjunction with reshaping operations such as stacking, in which the
column index name will be used as the name of the inserted column:
.. ipython:: python
df = pd.DataFrame(
{
"a": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"b": [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
"c": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
"d": [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
}
)
def compute_metrics(x):
result = {"b_sum": x["b"].sum(), "c_mean": x["c"].mean()}
return pd.Series(result, name="metrics")
result
result.stack(future_stack=True)