Pandas高级教程之Pandas中的GroupBy操作

目录
      • namedagg
        • apply操作

          简介

          pandas中的df数据类型可以像数据库表格一样进行groupby操作。通常来说groupby操作可以分为三部分:分割数据,应用变换和和合并数据。

          本文将会详细讲解pandas中的groupby操作。

          分割数据

          分割数据的目的是将df分割成为一个个的group。为了进行groupby操作,在创建df的时候需要指定相应的label:

          df = pd.dataframe(
             ...:     {
             ...:         "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
             ...:         "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
             ...:         "c": np.random.randn(8),
             ...:         "d": np.random.randn(8),
             ...:     }
             ...: )
             ...:
          
          df
          out[61]: 
               a      b         c         d
          0  foo    one -0.490565 -0.233106
          1  bar    one  0.430089  1.040789
          2  foo    two  0.653449 -1.155530
          3  bar  three -0.610380 -0.447735
          4  foo    two -0.934961  0.256358
          5  bar    two -0.256263 -0.661954
          6  foo    one -1.132186 -0.304330
          7  foo  three  2.129757  0.445744

          默认情况下,groupby的轴是x轴。可以一列group,也可以多列group:

          in [8]: grouped = df.groupby("a")
          
          in [9]: grouped = df.groupby(["a", "b"])

          多index

          0.24版本中,如果我们有多index,可以从中选择特定的index进行group:

          in [10]: df2 = df.set_index(["a", "b"])
          
          in [11]: grouped = df2.groupby(level=df2.index.names.difference(["b"]))
          
          in [12]: grouped.sum()
          out[12]: 
                      c         d
          a                      
          bar -1.591710 -1.739537
          foo -0.752861 -1.402938

          get_group

          get_group 可以获取分组之后的数据:

          in [24]: df3 = pd.dataframe({"x": ["a", "b", "a", "b"], "y": [1, 4, 3, 2]})
          
          in [25]: df3.groupby(["x"]).get_group("a")
          out[25]: 
             x  y
          0  a  1
          2  a  3
          
          in [26]: df3.groupby(["x"]).get_group("b")
          out[26]: 
             x  y
          1  b  4
          3  b  2

          dropna

          默认情况下,nan数据会被排除在groupby之外,通过设置 dropna=false 可以允许nan数据:

          in [27]: df_list = [[1, 2, 3], [1, none, 4], [2, 1, 3], [1, 2, 2]]
          
          in [28]: df_dropna = pd.dataframe(df_list, columns=["a", "b", "c"])
          
          in [29]: df_dropna
          out[29]: 
             a    b  c
          0  1  2.0  3
          1  1  nan  4
          2  2  1.0  3
          3  1  2.0  2
          # default ``dropna`` is set to true, which will exclude nans in keys
          in [30]: df_dropna.groupby(by=["b"], dropna=true).sum()
          out[30]: 
               a  c
          b        
          1.0  2  3
          2.0  2  5
          
          # in order to allow nan in keys, set ``dropna`` to false
          in [31]: df_dropna.groupby(by=["b"], dropna=false).sum()
          out[31]: 
               a  c
          b        
          1.0  2  3
          2.0  2  5
          nan  1  4

          groups属性

          groupby对象有个groups属性,它是一个key-value字典,key是用来分类的数据,value是分类对应的值。

          in [34]: grouped = df.groupby(["a", "b"])
          
          in [35]: grouped.groups
          out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
          
          in [36]: len(grouped)
          out[36]: 6

          index的层级

          对于多级index对象,groupby可以指定group的index层级:

          in [40]: arrays = [
             ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
             ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
             ....: ]
             ....: 
          
          in [41]: index = pd.multiindex.from_arrays(arrays, names=["first", "second"])
          
          in [42]: s = pd.series(np.random.randn(8), index=index)
          
          in [43]: s
          out[43]: 
          first  second
          bar    one      -0.919854
                 two      -0.042379
          baz    one       1.247642
                 two      -0.009920
          foo    one       0.290213
                 two       0.495767
          qux    one       0.362949
                 two       1.548106
          dtype: float64

          group第一级:

          in [44]: grouped = s.groupby(level=0)
          
          in [45]: grouped.sum()
          out[45]: 
          first
          bar   -0.962232
          baz    1.237723
          foo    0.785980
          qux    1.911055
          dtype: float64

          group第二级:

          in [46]: s.groupby(level="second").sum()
          out[46]: 
          second
          one    0.980950
          two    1.991575
          dtype: float64

          group的遍历

          得到group对象之后,我们可以通过for语句来遍历group:

          in [62]: grouped = df.groupby('a')
          
          in [63]: for name, group in grouped:
             ....:     print(name)
             ....:     print(group)
             ....: 
          bar
               a      b         c         d
          1  bar    one  0.254161  1.511763
          3  bar  three  0.215897 -0.990582
          5  bar    two -0.077118  1.211526
          foo
               a      b         c         d
          0  foo    one -0.575247  1.346061
          2  foo    two -1.143704  1.627081
          4  foo    two  1.193555 -0.441652
          6  foo    one -0.408530  0.268520
          7  foo  three -0.862495  0.024580

          如果是多字段group,group的名字是一个元组:

          in [64]: for name, group in df.groupby(['a', 'b']):
             ....:     print(name)
             ....:     print(group)
             ....: 
          ('bar', 'one')
               a    b         c         d
          1  bar  one  0.254161  1.511763
          ('bar', 'three')
               a      b         c         d
          3  bar  three  0.215897 -0.990582
          ('bar', 'two')
               a    b         c         d
          5  bar  two -0.077118  1.211526
          ('foo', 'one')
               a    b         c         d
          0  foo  one -0.575247  1.346061
          6  foo  one -0.408530  0.268520
          ('foo', 'three')
               a      b         c        d
          7  foo  three -0.862495  0.02458
          ('foo', 'two')
               a    b         c         d
          2  foo  two -1.143704  1.627081
          4  foo  two  1.193555 -0.441652

          聚合操作

          分组之后,就可以进行聚合操作:

          in [67]: grouped = df.groupby("a")
          
          in [68]: grouped.aggregate(np.sum)
          out[68]: 
                      c         d
          a                      
          bar  0.392940  1.732707
          foo -1.796421  2.824590
          
          in [69]: grouped = df.groupby(["a", "b"])
          
          in [70]: grouped.aggregate(np.sum)
          out[70]: 
                            c         d
          a   b                        
          bar one    0.254161  1.511763
              three  0.215897 -0.990582
              two   -0.077118  1.211526
          foo one   -0.983776  1.614581
              three -0.862495  0.024580
              two    0.049851  1.185429

          对于多index数据来说,默认返回值也是多index的。如果想使用新的index,可以添加 as_index = false:

          in [71]: grouped = df.groupby(["a", "b"], as_index=false)
          
          in [72]: grouped.aggregate(np.sum)
          out[72]: 
               a      b         c         d
          0  bar    one  0.254161  1.511763
          1  bar  three  0.215897 -0.990582
          2  bar    two -0.077118  1.211526
          3  foo    one -0.983776  1.614581
          4  foo  three -0.862495  0.024580
          5  foo    two  0.049851  1.185429
          
          in [73]: df.groupby("a", as_index=false).sum()
          out[73]: 
               a         c         d
          0  bar  0.392940  1.732707
          1  foo -1.796421  2.824590

          上面的效果等同于reset_index

          in [74]: df.groupby(["a", "b"]).sum().reset_index()

          grouped.size() 计算group的大小:

          in [75]: grouped.size()
          out[75]: 
               a      b  size
          0  bar    one     1
          1  bar  three     1
          2  bar    two     1
          3  foo    one     2
          4  foo  three     1
          5  foo    two     2

          grouped.describe() 描述group的信息:

          in [76]: grouped.describe()
          out[76]: 
                c                                                    ...         d                                                  
            count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
          0   1.0  0.254161       nan  0.254161  0.254161  0.254161  ...       nan  1.511763  1.511763  1.511763  1.511763  1.511763
          1   1.0  0.215897       nan  0.215897  0.215897  0.215897  ...       nan -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
          2   1.0 -0.077118       nan -0.077118 -0.077118 -0.077118  ...       nan  1.211526  1.211526  1.211526  1.211526  1.211526
          3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
          4   1.0 -0.862495       nan -0.862495 -0.862495 -0.862495  ...       nan  0.024580  0.024580  0.024580  0.024580  0.024580
          5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081
          
          [6 rows x 16 columns]

          通用聚合方法

          下面是通用的聚合方法:

          函数 描述
          mean() 平均值
          sum() 求和
          size() 计算size
          count() group的统计
          std() 标准差
          var() 方差
          sem() 均值的标准误
          describe() 统计信息描述
          first() 第一个group值
          last() 最后一个group值
          nth() 第n个group值
          min() 最小值
          max() 最大值

          可以同时指定多个聚合方法:

          in [81]: grouped = df.groupby("a")
          
          in [82]: grouped["c"].agg([np.sum, np.mean, np.std])
          out[82]: 
                    sum      mean       std
          a                                
          bar  0.392940  0.130980  0.181231
          foo -1.796421 -0.359284  0.912265

          可以重命名:

          in [84]: (
             ....:     grouped["c"]
             ....:     .agg([np.sum, np.mean, np.std])
             ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
             ....: )
             ....: 
          out[84]: 
                    foo       bar       baz
          a                                
          bar  0.392940  0.130980  0.181231
          foo -1.796421 -0.359284  0.912265

          namedagg

          namedagg 可以对聚合进行更精准的定义,它包含 column 和aggfunc 两个定制化的字段。

          in [88]: animals = pd.dataframe(
             ....:     {
             ....:         "kind": ["cat", "dog", "cat", "dog"],
             ....:         "height": [9.1, 6.0, 9.5, 34.0],
             ....:         "weight": [7.9, 7.5, 9.9, 198.0],
             ....:     }
             ....: )
             ....: 
          
          in [89]: animals
          out[89]: 
            kind  height  weight
          0  cat     9.1     7.9
          1  dog     6.0     7.5
          2  cat     9.5     9.9
          3  dog    34.0   198.0
          
          in [90]: animals.groupby("kind").agg(
             ....:     min_height=pd.namedagg(column="height", aggfunc="min"),
             ....:     max_height=pd.namedagg(column="height", aggfunc="max"),
             ....:     average_weight=pd.namedagg(column="weight", aggfunc=np.mean),
             ....: )
             ....: 
          out[90]: 
                min_height  max_height  average_weight
          kind                                        
          cat          9.1         9.5            8.90
          dog          6.0        34.0          102.75

          或者直接使用一个元组:

          in [91]: animals.groupby("kind").agg(
             ....:     min_height=("height", "min"),
             ....:     max_height=("height", "max"),
             ....:     average_weight=("weight", np.mean),
             ....: )
             ....: 
          out[91]: 
                min_height  max_height  average_weight
          kind                                        
          cat          9.1         9.5            8.90
          dog          6.0        34.0          102.75

          不同的列指定不同的聚合方法

          通过给agg方法传入一个字典,可以指定不同的列使用不同的聚合:

          in [95]: grouped.agg({"c": "sum", "d": "std"})
          out[95]: 
                      c         d
          a                      
          bar  0.392940  1.366330
          foo -1.796421  0.884785

          转换操作

          转换是将对象转换为同样大小对象的操作。在数据分析的过程中,经常需要进行数据的转换操作。

          可以接lambda操作:

          in [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())

          填充na值:

          in [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))

          过滤操作

          filter方法可以通过lambda表达式来过滤我们不需要的数据:

          in [136]: sf = pd.series([1, 1, 2, 3, 3, 3])
          
          in [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
          out[137]: 
          3    3
          4    3
          5    3
          dtype: int64

          apply操作

          有些数据可能不适合进行聚合或者转换操作,pandas提供了一个 apply 方法,用来进行更加灵活的转换操作。

          in [156]: df
          out[156]: 
               a      b         c         d
          0  foo    one -0.575247  1.346061
          1  bar    one  0.254161  1.511763
          2  foo    two -1.143704  1.627081
          3  bar  three  0.215897 -0.990582
          4  foo    two  1.193555 -0.441652
          5  bar    two -0.077118  1.211526
          6  foo    one -0.408530  0.268520
          7  foo  three -0.862495  0.024580
          
          in [157]: grouped = df.groupby("a")
          
          # could also just call .describe()
          in [158]: grouped["c"].apply(lambda x: x.describe())
          out[158]: 
          a         
          bar  count    3.000000
               mean     0.130980
               std      0.181231
               min     -0.077118
               25%      0.069390
                          ...   
          foo  min     -1.143704
               25%     -0.862495
               50%     -0.575247
               75%     -0.408530
               max      1.193555
          name: c, length: 16, dtype: float64

          可以外接函数:

          in [159]: grouped = df.groupby('a')['c']
          
          in [160]: def f(group):
             .....:     return pd.dataframe({'original': group,
             .....:                          'demeaned': group - group.mean()})
             .....: 
          
          in [161]: grouped.apply(f)
          out[161]: 
             original  demeaned
          0 -0.575247 -0.215962
          1  0.254161  0.123181
          2 -1.143704 -0.784420
          3  0.215897  0.084917
          4  1.193555  1.552839
          5 -0.077118 -0.208098
          6 -0.408530 -0.049245
          7 -0.862495 -0.503211

          本文已收录于 http://www.flydean.com/11-python-pandas-groupby/

          最通俗的解读,最深刻的干货,最简洁的教程,众多你不知道的小技巧等你来发现!

          到此这篇关于pandas高级教程之pandas中的groupby操作的文章就介绍到这了,更多相关pandas groupby用法内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com!

          (0)
          上一篇 2022年3月21日
          下一篇 2022年3月21日

          相关推荐