目录
- namedagg
- apply操作
简介
pandas中的df数据类型可以像数据库表格一样进行groupby操作。通常来说groupby操作可以分为三部分:分割数据,应用变换和和合并数据。
本文将会详细讲解pandas中的groupby操作。
分割数据
分割数据的目的是将df分割成为一个个的group。为了进行groupby操作,在创建df的时候需要指定相应的label:
df = pd.dataframe(
...: {
...: "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "c": np.random.randn(8),
...: "d": np.random.randn(8),
...: }
...: )
...:
df
out[61]:
a b c d
0 foo one -0.490565 -0.233106
1 bar one 0.430089 1.040789
2 foo two 0.653449 -1.155530
3 bar three -0.610380 -0.447735
4 foo two -0.934961 0.256358
5 bar two -0.256263 -0.661954
6 foo one -1.132186 -0.304330
7 foo three 2.129757 0.445744
默认情况下,groupby的轴是x轴。可以一列group,也可以多列group:
in [8]: grouped = df.groupby("a")
in [9]: grouped = df.groupby(["a", "b"])
多index
在0.24版本中,如果我们有多index,可以从中选择特定的index进行group:
in [10]: df2 = df.set_index(["a", "b"])
in [11]: grouped = df2.groupby(level=df2.index.names.difference(["b"]))
in [12]: grouped.sum()
out[12]:
c d
a
bar -1.591710 -1.739537
foo -0.752861 -1.402938
get_group
get_group 可以获取分组之后的数据:
in [24]: df3 = pd.dataframe({"x": ["a", "b", "a", "b"], "y": [1, 4, 3, 2]})
in [25]: df3.groupby(["x"]).get_group("a")
out[25]:
x y
0 a 1
2 a 3
in [26]: df3.groupby(["x"]).get_group("b")
out[26]:
x y
1 b 4
3 b 2
dropna
默认情况下,nan数据会被排除在groupby之外,通过设置 dropna=false 可以允许nan数据:
in [27]: df_list = [[1, 2, 3], [1, none, 4], [2, 1, 3], [1, 2, 2]] in [28]: df_dropna = pd.dataframe(df_list, columns=["a", "b", "c"]) in [29]: df_dropna out[29]: a b c 0 1 2.0 3 1 1 nan 4 2 2 1.0 3 3 1 2.0 2
# default ``dropna`` is set to true, which will exclude nans in keys
in [30]: df_dropna.groupby(by=["b"], dropna=true).sum()
out[30]:
a c
b
1.0 2 3
2.0 2 5
# in order to allow nan in keys, set ``dropna`` to false
in [31]: df_dropna.groupby(by=["b"], dropna=false).sum()
out[31]:
a c
b
1.0 2 3
2.0 2 5
nan 1 4
groups属性
groupby对象有个groups属性,它是一个key-value字典,key是用来分类的数据,value是分类对应的值。
in [34]: grouped = df.groupby(["a", "b"])
in [35]: grouped.groups
out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
in [36]: len(grouped)
out[36]: 6
index的层级
对于多级index对象,groupby可以指定group的index层级:
in [40]: arrays = [
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....:
in [41]: index = pd.multiindex.from_arrays(arrays, names=["first", "second"])
in [42]: s = pd.series(np.random.randn(8), index=index)
in [43]: s
out[43]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
group第一级:
in [44]: grouped = s.groupby(level=0) in [45]: grouped.sum() out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64
group第二级:
in [46]: s.groupby(level="second").sum() out[46]: second one 0.980950 two 1.991575 dtype: float64
group的遍历
得到group对象之后,我们可以通过for语句来遍历group:
in [62]: grouped = df.groupby('a')
in [63]: for name, group in grouped:
....: print(name)
....: print(group)
....:
bar
a b c d
1 bar one 0.254161 1.511763
3 bar three 0.215897 -0.990582
5 bar two -0.077118 1.211526
foo
a b c d
0 foo one -0.575247 1.346061
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
如果是多字段group,group的名字是一个元组:
in [64]: for name, group in df.groupby(['a', 'b']):
....: print(name)
....: print(group)
....:
('bar', 'one')
a b c d
1 bar one 0.254161 1.511763
('bar', 'three')
a b c d
3 bar three 0.215897 -0.990582
('bar', 'two')
a b c d
5 bar two -0.077118 1.211526
('foo', 'one')
a b c d
0 foo one -0.575247 1.346061
6 foo one -0.408530 0.268520
('foo', 'three')
a b c d
7 foo three -0.862495 0.02458
('foo', 'two')
a b c d
2 foo two -1.143704 1.627081
4 foo two 1.193555 -0.441652
聚合操作
分组之后,就可以进行聚合操作:
in [67]: grouped = df.groupby("a")
in [68]: grouped.aggregate(np.sum)
out[68]:
c d
a
bar 0.392940 1.732707
foo -1.796421 2.824590
in [69]: grouped = df.groupby(["a", "b"])
in [70]: grouped.aggregate(np.sum)
out[70]:
c d
a b
bar one 0.254161 1.511763
three 0.215897 -0.990582
two -0.077118 1.211526
foo one -0.983776 1.614581
three -0.862495 0.024580
two 0.049851 1.185429
对于多index数据来说,默认返回值也是多index的。如果想使用新的index,可以添加 as_index = false:
in [71]: grouped = df.groupby(["a", "b"], as_index=false)
in [72]: grouped.aggregate(np.sum)
out[72]:
a b c d
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
in [73]: df.groupby("a", as_index=false).sum()
out[73]:
a c d
0 bar 0.392940 1.732707
1 foo -1.796421 2.824590
上面的效果等同于reset_index
in [74]: df.groupby(["a", "b"]).sum().reset_index()
grouped.size() 计算group的大小:
in [75]: grouped.size()
out[75]:
a b size
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
grouped.describe() 描述group的信息:
in [76]: grouped.describe()
out[76]:
c ... d
count mean std min 25% 50% ... std min 25% 50% 75% max
0 1.0 0.254161 nan 0.254161 0.254161 0.254161 ... nan 1.511763 1.511763 1.511763 1.511763 1.511763
1 1.0 0.215897 nan 0.215897 0.215897 0.215897 ... nan -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
2 1.0 -0.077118 nan -0.077118 -0.077118 -0.077118 ... nan 1.211526 1.211526 1.211526 1.211526 1.211526
3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061
4 1.0 -0.862495 nan -0.862495 -0.862495 -0.862495 ... nan 0.024580 0.024580 0.024580 0.024580 0.024580
5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081
[6 rows x 16 columns]
通用聚合方法
下面是通用的聚合方法:
| 函数 | 描述 |
|---|---|
mean() |
平均值 |
sum() |
求和 |
size() |
计算size |
count() |
group的统计 |
std() |
标准差 |
var() |
方差 |
sem() |
均值的标准误 |
describe() |
统计信息描述 |
first() |
第一个group值 |
last() |
最后一个group值 |
nth() |
第n个group值 |
min() |
最小值 |
max() |
最大值 |
可以同时指定多个聚合方法:
in [81]: grouped = df.groupby("a")
in [82]: grouped["c"].agg([np.sum, np.mean, np.std])
out[82]:
sum mean std
a
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
可以重命名:
in [84]: (
....: grouped["c"]
....: .agg([np.sum, np.mean, np.std])
....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
....: )
....:
out[84]:
foo bar baz
a
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
namedagg
namedagg 可以对聚合进行更精准的定义,它包含 column 和aggfunc 两个定制化的字段。
in [88]: animals = pd.dataframe(
....: {
....: "kind": ["cat", "dog", "cat", "dog"],
....: "height": [9.1, 6.0, 9.5, 34.0],
....: "weight": [7.9, 7.5, 9.9, 198.0],
....: }
....: )
....:
in [89]: animals
out[89]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
in [90]: animals.groupby("kind").agg(
....: min_height=pd.namedagg(column="height", aggfunc="min"),
....: max_height=pd.namedagg(column="height", aggfunc="max"),
....: average_weight=pd.namedagg(column="weight", aggfunc=np.mean),
....: )
....:
out[90]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
或者直接使用一个元组:
in [91]: animals.groupby("kind").agg(
....: min_height=("height", "min"),
....: max_height=("height", "max"),
....: average_weight=("weight", np.mean),
....: )
....:
out[91]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
不同的列指定不同的聚合方法
通过给agg方法传入一个字典,可以指定不同的列使用不同的聚合:
in [95]: grouped.agg({"c": "sum", "d": "std"})
out[95]:
c d
a
bar 0.392940 1.366330
foo -1.796421 0.884785
转换操作
转换是将对象转换为同样大小对象的操作。在数据分析的过程中,经常需要进行数据的转换操作。
可以接lambda操作:
in [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
填充na值:
in [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
过滤操作
filter方法可以通过lambda表达式来过滤我们不需要的数据:
in [136]: sf = pd.series([1, 1, 2, 3, 3, 3]) in [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2) out[137]: 3 3 4 3 5 3 dtype: int64
apply操作
有些数据可能不适合进行聚合或者转换操作,pandas提供了一个 apply 方法,用来进行更加灵活的转换操作。
in [156]: df
out[156]:
a b c d
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
in [157]: grouped = df.groupby("a")
# could also just call .describe()
in [158]: grouped["c"].apply(lambda x: x.describe())
out[158]:
a
bar count 3.000000
mean 0.130980
std 0.181231
min -0.077118
25% 0.069390
...
foo min -1.143704
25% -0.862495
50% -0.575247
75% -0.408530
max 1.193555
name: c, length: 16, dtype: float64
可以外接函数:
in [159]: grouped = df.groupby('a')['c']
in [160]: def f(group):
.....: return pd.dataframe({'original': group,
.....: 'demeaned': group - group.mean()})
.....:
in [161]: grouped.apply(f)
out[161]:
original demeaned
0 -0.575247 -0.215962
1 0.254161 0.123181
2 -1.143704 -0.784420
3 0.215897 0.084917
4 1.193555 1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211
本文已收录于 http://www.flydean.com/11-python-pandas-groupby/
最通俗的解读,最深刻的干货,最简洁的教程,众多你不知道的小技巧等你来发现!
到此这篇关于pandas高级教程之pandas中的groupby操作的文章就介绍到这了,更多相关pandas groupby用法内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com!