数据轴标签(即索引index)在pandas 的对象( Series和DataFrame )中起着重要作用,比如确定数据位置、获取数据集的子集。

Object TypeIndexers
Seriess.loc[indexer]
DataFramedf.loc[row_indexer,column_indexer]
import pandas as pd

dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
                            index=dates, columns=['A', 'B', 'C', 'D'])

df
                   A         B         C         D
2000-01-01 -0.479079  0.172883 -0.242021  0.628444
2000-01-02  0.159806 -1.505613 -0.134579  0.498075
2000-01-03  1.533752  1.201623  0.530075 -0.323315
2000-01-04  1.449016  0.762993  0.601700  1.255557
2000-01-05  0.239401  1.765966 -0.818390 -0.156134
2000-01-06  2.869231  0.956241 -0.178162 -2.097230
2000-01-07 -0.380161  0.602484  0.469034  1.277701
2000-01-08  0.640450  0.565034 -0.010303 -1.344165
s = df['A']#s为Series

s[dates[5]]

2.869230887708206

df[['B', 'A']] = df[['A', 'B']]#可实现交换DataFrame两列的功能

#df[['A', 'B']] 等价于df.loc[:,['A', 'B']]

可以用.对数据进行切片操作

>>>df.A
>>>df.D #等价于df['D']或 df.loc[:,'D'] 对列起作用
2000-01-01    0.628444
2000-01-02    0.498075
2000-01-03   -0.323315
2000-01-04    1.255557
2000-01-05   -0.156134
2000-01-06   -2.097230
2000-01-07    1.277701
2000-01-08   -1.344165
Freq: D, Name: D, dtype: float64

>>>df[::-1]  #df[::-1]将数据倒置显示,df[:3]和df[::-1]对行起作用
                   A         B         C         D
2000-01-08  0.640450  0.565034 -0.010303 -1.344165
2000-01-07 -0.380161  0.602484  0.469034  1.277701
2000-01-06  2.869231  0.956241 -0.178162 -2.097230
2000-01-05  0.239401  1.765966 -0.818390 -0.156134
2000-01-04  1.449016  0.762993  0.601700  1.255557
2000-01-03  1.533752  1.201623  0.530075 -0.323315
2000-01-02  0.159806 -1.505613 -0.134579  0.498075
2000-01-01 -0.479079  0.172883 -0.242021  0.628444

1、loc与iloc

>>> df1 = pd.DataFrame(np.random.randn(6, 4),
                          index=list('abcdef'),columns=list('ABCD'))
>>> df1
          A         B         C         D
a  0.899943  0.500422 -0.142480  0.714779
b -0.592714 -0.371228 -1.407495  0.748776
c -0.567269  0.225230 -0.215326  0.826066
d -0.882531 -1.744819  1.818175 -0.144823
e -1.174458  1.108387  1.127187 -0.110846
f  0.205241  0.035335  1.302494  0.813305

#df1.loc[['a', 'b', 'd'], :] df1.loc['d':, 'A':'C'] df1.loc[:, df1.loc['a'] > 0]

Selection by callable 根据函数返回结果调用

>>>df1 = pd.DataFrame(np.random.randn(6, 4),
                          index=list('abcdef'),columns=list('ABCD'))
>>>df1
          A         B         C         D
a  1.254603  0.666147  0.960109  0.290801
b  1.024046 -1.046331 -0.904427 -0.205843
c -2.321422 -0.014234 -0.171935 -0.511684
d  0.978548  1.030372 -0.298060  1.856619
e  0.106820  0.101090 -0.152575 -0.395502
f -0.560688  0.692521 -0.920736 -0.948279

>>> df['A'] > 0
2000-01-01    False
2000-01-02     True
2000-01-03     True
2000-01-04     True
2000-01-05     True
2000-01-06     True
2000-01-07    False
2000-01-08     True

>>>df1.loc[lambda df: df['A'] > 0, :] #相当于返回df['A'] > 0的Series
          A         B         C         D
a  1.254603  0.666147  0.960109  0.290801
b  1.024046 -1.046331 -0.904427 -0.205843
d  0.978548  1.030372 -0.298060  1.856619
e  0.106820  0.101090 -0.152575 -0.395502

df.sample(n=1, axis=1) 随机抽取一行或一列

2、at与iat 用于访问值(标量)用法与loc和iloc一致

3、利用逻辑向量对Series和DataFrame进行选取

| for or& for and, and ~ for not.

>>>df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],'c': np.random.randn(7)})
>>>df2
       a  b         c
0    one  x -1.201883
1    one  y  0.323085
2    two  y -1.228992
3  three  x -0.691629
4    two  y  0.342987
5    one  x -1.405064
6    six  x -0.023214

>>>criterion = df2['a'].map(lambda x: x.startswith('t'))
#map生成一个Boolean Series,startswith()返回Boolean
>>>criterion
0    False
1    False
2     True
3     True
4     True
5    False
6    False
Name: a, dtype: bool

4、isin使用

Series.isin(list())

>>> s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
>>> s
4    0
3    1
2    2
1    3
0    4
dtype: int64
>>> s[s.isin([2, 4, 6])]
2    2
0    4
dtype: int64

DataFrame.isin()

>>> df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],'ids2': ['a', 'n', 'c', 'n']})
>>> df
   vals ids ids2
0     1   a    a
1     2   b    n
2     3   f    c
3     4   n    n
>>> values = ['a', 'b', 1, 3]
>>> df.isin(values)
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

DataFrame.isin()参数为dict时

>>> values = {'ids': ['a', 'b'], 'vals': [1, 3]}
>>> df.isin(values)
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False
#any()与all()
>>> values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

>>> row_mask = df.isin(values)
>>> df[row_mask]
   vals  ids ids2
0   1.0    a    a
1   NaN    b  NaN
2   3.0  NaN    c
3   NaN  NaN  NaN

>>> row_mask = df.isin(values).all(1)
>>> df[row_mask]
   vals ids ids2
0     1   a    a

>>> row_mask = df.isin(values).any(1)
>>> row_mask
0     True
1     True
2     True
3    False
dtype: bool
>>> df[row_mask]
   vals ids ids2
0     1   a    a
1     2   b    n
2     3   f    c

5、where() & mask()

见链接 pandas.DataFrame.where() 和 mask()方法

6、query()

见链接 pandas.DataFrame.query()方法

7、query() 与 isin()

isin()见链接 pandas.Series.isin()和pandas.DataFrame.isin()

>>> df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
...                'c': np.random.randint(5, size=12),
...                 'd': np.random.randint(9, size=12)})
... 
>>> df
    a  b  c  d
0   a  a  3  5
1   a  a  3  5
2   b  a  0  5
3   b  a  3  2
4   c  b  3  6
5   c  b  2  8
6   d  b  2  3
7   d  b  3  7
8   e  c  3  4
9   e  c  1  5
10  f  c  2  2
11  f  c  3  5
>>> df.query('a in b')
   a  b  c  d
0  a  a  3  5
1  a  a  3  5
2  b  a  0  5
3  b  a  3  2
4  c  b  3  6
5  c  b  2  8
>>> df[df['a'].isin(df['b'])]
   a  b  c  d
0  a  a  3  5
1  a  a  3  5
2  b  a  0  5
3  b  a  3  2
4  c  b  3  6
5  c  b  2  8
>>> df[~df['a'].isin(df['b'])]
    a  b  c  d
6   d  b  2  3
7   d  b  3  7
8   e  c  3  4
9   e  c  1  5
10  f  c  2  2
11  f  c  3  5
>>> df.query('b == ["a", "b", "c"]')
    a  b  c  d
0   a  a  3  5
1   a  a  3  5
2   b  a  0  5
3   b  a  3  2
4   c  b  3  6
5   c  b  2  8
6   d  b  2  3
7   d  b  3  7
8   e  c  3  4
9   e  c  1  5
10  f  c  2  2
11  f  c  3  5
>>> df[df['b'].isin(["a", "b", "c"])]
    a  b  c  d
0   a  a  3  5
1   a  a  3  5
2   b  a  0  5
3   b  a  3  2
4   c  b  3  6
5   c  b  2  8
6   d  b  2  3
7   d  b  3  7
8   e  c  3  4
9   e  c  1  5
10  f  c  2  2
11  f  c  3  5

从性能上讲,query()更快

8、重复数据

duplicated 数据处理见链接 pandas DataFrame 重复数据处理 – duplicated()和 drop_duplicates()

Categories: pandasPython

0 Comments

Leave a Reply

Your email address will not be published.