Pandas 百题大冲关

发布日期: 2019-11-16

阅读次数:

Pandas 百题大冲关

Pandas 是基于 NumPy 的一种数据处理工具，该工具为了解决数据分析任务而创建。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的函数和方法。

Pandas 的数据结构：Pandas 主要有 Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。其中 Series 和 DataFrame 应用的最为广泛。

Series 是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python 对象等。Series 可以通过标签来定位。
DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。

Pandas 百题大冲关分为基础篇和进阶篇，每部分各有 50 道练习题。基础部分的练习题在于熟悉 Pandas 常用方法的使用，而进阶部分则侧重于 Pandas 方法的组合应用。

如果你对 Pandas 的使用还不熟悉，我们建议你先学习 Pandas 数据处理基础课程 后再使用本课程进行复习和检验。

课程说明

本次实验为 Notebook 实验，前后单元格之间有关联性，你需要按顺序执行单元格，跳跃或重复执行部分单元格可能会造成赋值混乱。

此外，实验给出了每道题目的参考答案，如果你想要独立完成，可以先删除单元格中的参考答案，然后使用 Ctrl/⌘ + Z 撤销复原。

基础部分

1. 导入 Pandas

练习 Pandas 之前，首先需要导入 Pandas 模块，并约定简称为 pd。

教学代码：

In [1]:

import pandas as pd

动手练习：

2. 查看 Pandas 版本信息

In [2]:

print(pd.__version__)

0.23.0

创建 Series 数据类型

Pandas 中，Series 可以被看作由 1 列数据组成的数据集。

创建 Series 语法：s = pd.Series(data, index=index)，可以通过多种方式进行创建，以下介绍了 3 个常用方法。

3. 从列表创建 Series

In [3]:

arr = [0, 1, 2, 3, 4]
s1 = pd.Series(arr)  # 如果不指定索引，则默认从 0 开始
s1

Out[3]:

0    0
1    1
2    2
3    3
4    4
dtype: int64

提示：前面的 0,1,2,3,4 为当前 Series 的索引，后面的 0,1,2,3,4 为 Series 的值。

4. 从 Ndarray 创建 Series

In [4]:

import numpy as np
n = np.random.randn(5)  # 创建一个随机 Ndarray 数组

index = ['a', 'b', 'c', 'd', 'e']
s2 = pd.Series(n, index=index)
s2

Out[4]:

a    2.370228
b    1.118292
c    0.569311
d    0.223086
e   -0.968531
dtype: float64

5. 从字典创建 Series

In [5]:

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
s3 = pd.Series(d)
s3

Out[5]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Series 基本操作

6. 修改 Series 索引

In [6]:

print(s1)  # 以 s1 为例

s1.index = ['A', 'B', 'C', 'D', 'E']  # 修改后的索引
s1

0    0
1    1
2    2
3    3
4    4
dtype: int64

Out[6]:

A    0
B    1
C    2
D    3
E    4
dtype: int64

7. Series 纵向拼接

In [7]:

s4 = s3.append(s1)  # 将 s1 拼接到 s3
s4

Out[7]:

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64

8. Series 按指定索引删除元素

In [8]:

print(s4)
s4 = s4.drop('e')  # 删除索引为 e 的值
s4

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64

Out[8]:

a    1
b    2
c    3
d    4
A    0
B    1
C    2
D    3
E    4
dtype: int64

9. Series 修改指定索引元素

In [9]:

s4['A'] = 6  # 修改索引为 A 的值 = 6
s4

Out[9]:

a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

10. Series 按指定索引查找元素

In [10]:

s4['B']

Out[10]:

11. Series 切片操作

例如对s4的前 3 个数据访问

In [11]:

s4[:3]

Out[11]:

a    1
b    2
c    3
dtype: int64

Series 运算

12. Series 加法运算

Series 的加法运算是按照索引计算，如果索引不同则填充为 NaN（空值）。

In [12]:

s4

Out[12]:

a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

In [13]:

s3

Out[13]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [14]:

s4.add(s3)

Out[14]:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    2.0
b    4.0
c    6.0
d    8.0
e    NaN
dtype: float64

13. Series 减法运算

Series的减法运算是按照索引对应计算，如果不同则填充为 NaN（空值）。

In [15]:

s4.sub(s3)

Out[15]:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    0.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

14. Series 乘法运算

Series 的乘法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。

In [16]:

s4.mul(s3)

Out[16]:

A     NaN
B     NaN
C     NaN
D     NaN
E     NaN
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64

15. Series 除法运算

Series 的除法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。

In [17]:

s4.div(s3)

Out[17]:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

16. Series 求中位数

In [18]:

s4.median()

Out[18]:

3.0

17. Series 求和

In [19]:

s4.sum()

Out[19]:

18.Series 求最大值

In [20]:

s4.max()

Out[20]:

19. Series 求最小值

In [21]:

s4.min()

Out[21]:

创建 DataFrame 数据类型

与 Sereis 不同，DataFrame 可以存在多列数据。一般情况下，DataFrame 也更加常用。

20. 通过 NumPy 数组创建 DataFrame

In [22]:

dates = pd.date_range('today', periods=6)  # 定义时间序列作为 index
num_arr = np.random.randn(6, 4)  # 传入 numpy 随机数组
columns = ['A', 'B', 'C', 'D']  # 将列表作为列名
df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
df1

Out[22]:

	A	B	C	D
2019-09-18 09:03:54.011568	-1.039513	2.057041	0.763148	-0.531636
2019-09-19 09:03:54.011568	1.573003	0.533353	1.424031	-0.551308
2019-09-20 09:03:54.011568	0.371965	0.026624	1.537917	0.560349
2019-09-21 09:03:54.011568	-0.121507	-1.112931	1.220485	-0.157225
2019-09-22 09:03:54.011568	0.326423	-0.195934	-2.091389	0.936256
2019-09-23 09:03:54.011568	1.140672	-0.216641	0.177470	-0.063391

21. 通过字典数组创建 DataFrame

In [23]:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2

Out[23]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

22. 查看 DataFrame 的数据类型

In [24]:

df2.dtypes

Out[24]:

animal       object
age         float64
visits        int64
priority     object
dtype: object

DataFrame 基本操作

23. 预览 DataFrame 的前 5 行数据

此方法对快速了解陌生数据集结构十分有用。

In [25]:

df2.head()  # 默认为显示 5 行，可根据需要在括号中填入希望预览的行数

Out[25]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no

24. 查看 DataFrame 的后 3 行数据

In [26]:

df2.tail(3)

Out[26]:

	animal	age	visits	priority
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

25.查看 DataFrame 的索引

In [27]:

df2.index

Out[27]:

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

26. 查看 DataFrame 的列名

In [28]:

df2.columns

Out[28]:

Index(['animal', 'age', 'visits', 'priority'], dtype='object')

27. 查看 DataFrame 的数值

In [29]:

df2.values

Out[29]:

array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

28. 查看 DataFrame 的统计数据

In [30]:

df2.describe()

Out[30]:

	age	visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

29. DataFrame 转置操作

In [31]:

df2.T

Out[31]:

	a	b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

30. 对 DataFrame 进行按列排序

In [32]:

df2.sort_values(by='age')  # 按 age 升序排列

Out[32]:

	animal	age	visits	priority
c	snake	0.5	2	no
f	cat	2.0	3	no
a	cat	2.5	1	yes
b	cat	3.0	3	yes
j	dog	3.0	1	no
g	snake	4.5	1	no
e	dog	5.0	2	no
i	dog	7.0	2	no
d	dog	NaN	3	yes
h	cat	NaN	1	yes

31. 对 DataFrame 数据切片

In [33]:

df2[1:3]

Out[33]:

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

32. 对 DataFrame 通过标签查询（单列）

In [34]:

df2['age']

Out[34]:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

In [35]:

df2.age  # 等价于 df2['age']

Out[35]:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

33. 对 DataFrame 通过标签查询（多列）

In [36]:

df2[['age', 'animal']]  # 传入一个列名组成的列表

Out[36]:

	age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

34. 对 DataFrame 通过位置查询

In [38]:

df2.iloc[1:3]  # 查询 2，3 行

Out[38]:

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

35. DataFrame 副本拷贝

In [39]:

# 生成 DataFrame 副本，方便数据集被多个不同流程使用
df3 = df2.copy()
df3

Out[39]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

36. 判断 DataFrame 元素是否为空

In [40]:

df3.isnull()  # 如果为空则返回为 True

Out[40]:

	animal	age	visits	priority
a	False	False	False	False
b	False	False	False	False
c	False	False	False	False
d	False	True	False	False
e	False	False	False	False
f	False	False	False	False
g	False	False	False	False
h	False	True	False	False
i	False	False	False	False
j	False	False	False	False

37. 添加列数据

In [42]:

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)

df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
df3

Out[42]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

38. 根据 DataFrame 的下标值进行更改。

In [43]:

# 修改第 2 行与第 2 列对应的值 3.0 → 2.0
df3.iat[1, 1] = 2  # 索引序号从 0 开始，这里为 1, 1
df3

Out[43]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

39. 根据 DataFrame 的标签对数据进行修改

In [44]:

df3.loc['f', 'age'] = 1.5
df3

Out[44]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

40. DataFrame 求平均值操作

In [45]:

df3.mean()

Out[45]:

age       3.25
visits    1.90
No.       4.50
dtype: float64

41. 对 DataFrame 中任意列做求和操作

In [46]:

df3['visits'].sum()

Out[46]:

字符串操作

42. 将字符串转化为小写字母

In [47]:

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                    np.nan, 'CABA', 'dog', 'cat'])
print(string)
string.str.lower()

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

Out[47]:

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

43. 将字符串转化为大写字母

In [48]:

string.str.upper()

Out[48]:

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

DataFrame 缺失值操作

44. 对缺失值进行填充

Markdown Code

In [49]:

df4 = df3.copy()
print(df4)
df4.fillna(value=3)

  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

Out[49]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	3.0	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	3.0	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

45. 删除存在缺失值的行

In [50]:

df5 = df3.copy()
print(df5)
df5.dropna(how='any')  # 任何存在 NaN 的行都将被删除

  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

Out[50]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

46. DataFrame 按指定列对齐

In [51]:

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})

print(left)
print(right)

# 按照 key 列对齐连接，只存在 foo2 相同，所以最后变成一行
pd.merge(left, right, on='key')

    key  one
0  foo1    1
1  foo2    2
    key  two
0  foo2    4
1  foo3    5

Out[51]:

	key	one	two
0	foo2	2	4

DataFrame 文件操作

47. CSV 文件写入

In [29]:

df3.to_csv('animal.csv')
print("写入成功.")

写入成功.

48. CSV 文件读取

In [30]:

df_animal = pd.read_csv('animal.csv')
df_animal

Out[30]:

	Unnamed: 0	animal	age	visits	priority	No.
0	a	cat	2.5	1	yes	0
1	b	cat	2.0	3	yes	1
2	c	snake	0.5	2	no	2
3	d	dog	NaN	3	yes	3
4	e	dog	5.0	2	no	4
5	f	cat	2.0	3	no	5
6	g	snake	4.5	1	no	6
7	h	cat	NaN	1	yes	7
8	i	dog	7.0	2	no	8
9	j	dog	3.0	1	no	9

Markdown Code

49. Excel 写入操作

In [31]:

df3.to_excel('animal.xlsx', sheet_name='Sheet1')
print("写入成功.")

写入成功.

In [ ]:

50. Excel 读取操作

In [32]:

pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Out[32]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

进阶部分

Markdown Code

时间序列索引

51. 建立一个以 2018 年每一天为索引，值为随机数的 Series

In [34]:

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
dti

Out[34]:

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
               '2018-01-09', '2018-01-10',
               ...
               '2018-12-22', '2018-12-23', '2018-12-24', '2018-12-25',
               '2018-12-26', '2018-12-27', '2018-12-28', '2018-12-29',
               '2018-12-30', '2018-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')

In [35]:

s = pd.Series(np.random.rand(len(dti)), index=dti)
s

Out[35]:

2018-01-01    0.242385
2018-01-02    0.325990
2018-01-03    0.689102
2018-01-04    0.594617
2018-01-05    0.794879
2018-01-06    0.806709
2018-01-07    0.323261
2018-01-08    0.664934
2018-01-09    0.915538
2018-01-10    0.892148
2018-01-11    0.496773
2018-01-12    0.502913
2018-01-13    0.552878
2018-01-14    0.674221
2018-01-15    0.042659
2018-01-16    0.350266
2018-01-17    0.452122
2018-01-18    0.023482
2018-01-19    0.825472
2018-01-20    0.730584
2018-01-21    0.048707
2018-01-22    0.368339
2018-01-23    0.971550
2018-01-24    0.418001
2018-01-25    0.447213
2018-01-26    0.455530
2018-01-27    0.815156
2018-01-28    0.986660
2018-01-29    0.576433
2018-01-30    0.975408
                ...   
2018-12-02    0.134845
2018-12-03    0.136748
2018-12-04    0.300511
2018-12-05    0.955365
2018-12-06    0.220481
2018-12-07    0.185626
2018-12-08    0.002454
2018-12-09    0.028550
2018-12-10    0.968371
2018-12-11    0.488868
2018-12-12    0.854177
2018-12-13    0.184961
2018-12-14    0.215790
2018-12-15    0.393876
2018-12-16    0.472714
2018-12-17    0.790068
2018-12-18    0.075892
2018-12-19    0.494739
2018-12-20    0.245847
2018-12-21    0.686981
2018-12-22    0.891848
2018-12-23    0.599637
2018-12-24    0.180019
2018-12-25    0.726704
2018-12-26    0.020163
2018-12-27    0.171086
2018-12-28    0.296539
2018-12-29    0.267030
2018-12-30    0.076163
2018-12-31    0.523547
Freq: D, Length: 365, dtype: float64

52. 统计`s` 中每一个周三对应值的和

In [36]:

# 周一从 0 开始
s[s.index.weekday == 2].sum()

Out[36]:

27.30757195752505

53. 统计`s`中每个月值的平均值

In [37]:

s.resample('M').mean()

Out[37]:

2018-01-31    0.550778
2018-02-28    0.461293
2018-03-31    0.445366
2018-04-30    0.454076
2018-05-31    0.521010
2018-06-30    0.540702
2018-07-31    0.567162
2018-08-31    0.482533
2018-09-30    0.507551
2018-10-31    0.524215
2018-11-30    0.582312
2018-12-31    0.375236
Freq: M, dtype: float64

54. 将 Series 中的时间进行转换（秒转分钟）

In [38]:

s = pd.date_range('today', periods=100, freq='S')
s
ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)
ts
ts.resample('Min').sum()

Out[38]:

2019-09-18 12:15:00     5376
2019-09-18 12:16:00    14086
2019-09-18 12:17:00     5024
Freq: T, dtype: int64

55. UTC 世界时间标准

In [39]:

s = pd.date_range('today', periods=1, freq='D')  # 获取当前时间
ts = pd.Series(np.random.randn(len(s)), s)  # 随机数值
ts_utc = ts.tz_localize('UTC')  # 转换为 UTC 时间
ts_utc

Out[39]:

2019-09-18 12:15:58.840478+00:00    0.311219
Freq: D, dtype: float64

56. 转换为上海所在时区

In [40]:

ts_utc.tz_convert('Asia/Shanghai')

Out[40]:

2019-09-18 20:15:58.840478+08:00    0.311219
Freq: D, dtype: float64

看一看你当前的时间，是不是一致？

57.不同时间表示方式的转换

In [41]:

rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()

2018-01-31    0.098535
2018-02-28    0.895295
2018-03-31   -0.542843
2018-04-30    0.751465
2018-05-31    1.265660
Freq: M, dtype: float64
2018-01    0.098535
2018-02    0.895295
2018-03   -0.542843
2018-04    0.751465
2018-05    1.265660
Freq: M, dtype: float64

Out[41]:

2018-01-01    0.098535
2018-02-01    0.895295
2018-03-01   -0.542843
2018-04-01    0.751465
2018-05-01    1.265660
Freq: MS, dtype: float64

Series 多重索引

58. 创建多重索引 Series

构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引，值为随机数的多重索引 Series。

In [42]:

letters = ['A', 'B', 'C']
numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters, numbers])  # 设置多重索引
s = pd.Series(np.random.rand(30), index=mi)  # 随机数
s

Out[42]:

A  0    0.273720
   1    0.284131
   2    0.128798
   3    0.448984
   4    0.097474
   5    0.640892
   6    0.669306
   7    0.160024
   8    0.489361
   9    0.601552
B  0    0.267547
   1    0.274272
   2    0.902249
   3    0.016477
   4    0.619946
   5    0.882701
   6    0.464922
   7    0.789508
   8    0.935174
   9    0.070130
C  0    0.650272
   1    0.859504
   2    0.045425
   3    0.586634
   4    0.391077
   5    0.618920
   6    0.465239
   7    0.679443
   8    0.369934
   9    0.474753
dtype: float64

59. 多重索引 Series 查询

In [43]:

# 查询索引为 1，3，6 的值
s.loc[:, [1, 3, 6]]

Out[43]:

A  1    0.284131
   3    0.448984
   6    0.669306
B  1    0.274272
   3    0.016477
   6    0.464922
C  1    0.859504
   3    0.586634
   6    0.465239
dtype: float64

60. 多重索引 Series 切片

In [44]:

s.loc[pd.IndexSlice[:'B', 5:]]

Out[44]:

A  5    0.640892
   6    0.669306
   7    0.160024
   8    0.489361
   9    0.601552
B  5    0.882701
   6    0.464922
   7    0.789508
   8    0.935174
   9    0.070130
dtype: float64

DataFrame 多重索引

61. 根据多重索引创建 DataFrame

创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引，值为随机数据的多重索引 DataFrame。

In [45]:

frame = pd.DataFrame(np.arange(12).reshape(6, 2),
                     index=[list('AAABBB'), list('123123')],
                     columns=['hello', 'shiyanlou'])
frame

Out[45]:

		hello	shiyanlou
A	1	0	1
2	2	3
3	4	5
B	1	6	7
2	8	9
3	10	11

62. 多重索引设置列名称

In [46]:

frame.index.names = ['first', 'second']
frame

Out[46]:

		hello	shiyanlou
first	second
A	1	0	1
2	2	3
3	4	5
B	1	6	7
2	8	9
3	10	11

63. DataFrame 多重索引分组求和

In [47]:

frame.groupby('first').sum()

Out[47]:

	hello	shiyanlou
first
A	6	9
B	24	27

64. DataFrame 行列名称转换

In [48]:

print(frame)
frame.stack()

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11

Out[48]:

first  second           
A      1       hello         0
               shiyanlou     1
       2       hello         2
               shiyanlou     3
       3       hello         4
               shiyanlou     5
B      1       hello         6
               shiyanlou     7
       2       hello         8
               shiyanlou     9
       3       hello        10
               shiyanlou    11
dtype: int64

65. DataFrame 索引转换

In [49]:

print(frame)
frame.unstack()

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11

Out[49]:

	hello	shiyanlou
second	1	2	3	1	2	3
first
A	0	2	4	1	3	5
B	6	8	10	7	9	11

66. DataFrame 条件查找

In [51]:

# 示例数据

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df

Out[51]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查找 age 大于 3 的全部信息

In [52]:

df[df['age'] > 3]

Out[52]:

	animal	age	visits	priority
e	dog	5.0	2	no
g	snake	4.5	1	no
i	dog	7.0	2	no

67. 根据行列索引切片

In [57]:

df.iloc[2:4, 1:3]

Out[57]:

	age	visits
c	0.5	2
d	NaN	3

68. DataFrame 多重条件查询

查找 age<3 且为 cat 的全部数据。

In [54]:

df = pd.DataFrame(data, index=labels)

df[(df['animal'] == 'cat') & (df['age'] < 3)]

Out[54]:

	animal	age	visits	priority
a	cat	2.5	1	yes
f	cat	2.0	3	no

69. DataFrame 按关键字查询

In [55]:

df3[df3['animal'].isin(['cat', 'dog'])]

Out[55]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

70. DataFrame 按标签及列名查询。

In [56]:

df.loc[df2.index[[3, 4, 8]], ['animal', 'age']]

Out[56]:

	animal	age
d	dog	NaN
e	dog	5.0
i	dog	7.0

71. DataFrame 多条件排序

按照 age 降序，visits 升序排列

In [58]:

df.sort_values(by=['age', 'visits'], ascending=[False, True])

Out[58]:

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
j	dog	3.0	1	no
b	cat	3.0	3	yes
a	cat	2.5	1	yes
f	cat	2.0	3	no
c	snake	0.5	2	no
h	cat	NaN	1	yes
d	dog	NaN	3	yes

72.DataFrame 多值替换

将 priority 列的 yes 值替换为 True，no 值替换为 False。

In [59]:

df['priority'].map({'yes': True, 'no': False})

Out[59]:

a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

73. DataFrame 分组求和

In [60]:

df4.groupby('animal').sum()

Out[60]:

	age	visits	No.
animal
cat	6.5	8	13
dog	15.0	8	24
snake	5.0	3	8

74. 使用列表拼接多个 DataFrame

In [61]:

temp_df1 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces)

          0         1         2         3
0  0.383991 -1.142700  0.714516 -0.065961
1  0.291247 -0.326451  0.194736  0.586472
2 -0.355746  0.158567 -0.552178  1.079521
3 -0.093985  0.112255 -0.460585 -0.250174
4 -0.935083 -1.128611 -0.995921  0.199508
          0         1         2         3
0  0.164079  0.271100  0.803741  1.481683
1 -1.936093 -0.320366 -1.457524 -1.141063
2  0.678864 -0.078156 -1.165706  0.657986
3  0.060676  1.132297  0.551034 -1.145125
4 -0.003243  1.063446  1.690816 -0.662396
          0         1         2         3
0 -0.682707  0.022459 -0.975442 -0.987889
1  1.587195 -1.449075 -0.772353 -0.565108
2 -1.504494 -0.659931  0.825370 -0.414365
3 -0.304541 -0.805884  1.602551 -0.308753
4  1.811441 -1.900020  0.503447  0.892578

Out[61]:

	0	1	2	3
0	0.383991	-1.142700	0.714516	-0.065961
1	0.291247	-0.326451	0.194736	0.586472
2	-0.355746	0.158567	-0.552178	1.079521
3	-0.093985	0.112255	-0.460585	-0.250174
4	-0.935083	-1.128611	-0.995921	0.199508
0	0.164079	0.271100	0.803741	1.481683
1	-1.936093	-0.320366	-1.457524	-1.141063
2	0.678864	-0.078156	-1.165706	0.657986
3	0.060676	1.132297	0.551034	-1.145125
4	-0.003243	1.063446	1.690816	-0.662396
0	-0.682707	0.022459	-0.975442	-0.987889
1	1.587195	-1.449075	-0.772353	-0.565108
2	-1.504494	-0.659931	0.825370	-0.414365
3	-0.304541	-0.805884	1.602551	-0.308753
4	1.811441	-1.900020	0.503447	0.892578

75. 找出 DataFrame 表中和最小的列

In [62]:

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
print(df)
df.sum().idxmin()  # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值

          a         b         c         d         e         f         g  \
0  0.658017  0.724698  0.688882  0.818700  0.270623  0.571789  0.933141   
1  0.079060  0.600451  0.629490  0.932317  0.482760  0.354452  0.207915   
2  0.518762  0.629966  0.205436  0.025052  0.097185  0.640541  0.269677   
3  0.423075  0.787404  0.606824  0.634464  0.084876  0.823212  0.962504   
4  0.232322  0.992694  0.056531  0.069404  0.630463  0.831217  0.067441   

          h         i         j  
0  0.981668  0.272827  0.312952  
1  0.171085  0.948558  0.245743  
2  0.925972  0.797848  0.192847  
3  0.352203  0.424905  0.473665  
4  0.568331  0.146546  0.901848

Out[62]:

'e'

76. DataFrame 中每个元素减去每一行的平均值

In [63]:

df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
df.sub(df.mean(axis=1), axis=0)

          0         1         2
0  0.425960  0.892803  0.152310
1  0.016100  0.300189  0.510423
2  0.841156  0.836527  0.046076
3  0.861969  0.066514  0.395170
4  0.022844  0.443551  0.066895

Out[63]:

	0	1	2
0	-0.064398	0.402445	-0.338048
1	-0.259471	0.024618	0.234853
2	0.266570	0.261941	-0.528510
3	0.420751	-0.374704	-0.046047
4	-0.154919	0.265787	-0.110868

77. DataFrame 分组，并得到每一组中最大三个数之和

In [64]:

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                   'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)
df.groupby('A')['B'].nlargest(3).sum(level=0)

Out[64]:

A
a    409
b    156
c    345
Name: B, dtype: int64

透视表

当分析庞大的数据时，为了更好的发掘数据特征之间的关系，且不破坏原数据，就可以利用透视表 pivot_table 进行操作。

78. 透视表的创建

新建表将 A, B, C 列作为索引进行聚合。

In [65]:

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})

print(df)

pd.pivot_table(df, index=['A', 'B'])

        A  B    C         D         E
0     one  A  foo -1.196048 -0.325957
1     one  B  foo  0.375415  0.241904
2     two  C  foo  0.791868 -1.274019
3   three  A  bar  1.194800  0.543847
4     one  B  bar  0.060005 -0.041084
5     one  C  bar  1.489608  0.121246
6     two  A  foo  0.785924 -0.010453
7   three  B  foo -0.296418 -0.051107
8     one  C  foo -0.757490 -0.828000
9     one  A  bar -0.165180 -2.021311
10    two  B  bar  0.164839  0.016729
11  three  C  bar -0.341369  0.310826

Out[65]:

		D	E
A	B
one	A	-0.680614	-1.173634
B	0.217710	0.100410
C	0.366059	-0.353377
three	A	1.194800	0.543847
B	-0.296418	-0.051107
C	-0.341369	0.310826
two	A	0.785924	-0.010453
B	0.164839	0.016729
C	0.791868	-1.274019

79. 透视表按指定行进行聚合

将该 DataFrame 的 D 列聚合，按照 A,B 列为索引进行聚合，聚合的方式为默认求均值。

In [66]:

pd.pivot_table(df, values=['D'], index=['A', 'B'])

Out[66]:

		D
A	B
one	A	-0.680614
B	0.217710
C	0.366059
three	A	1.194800
B	-0.296418
C	-0.341369
two	A	0.785924
B	0.164839
C	0.791868

80. 透视表聚合方式定义

上一题中 D 列聚合时，采用默认求均值的方法，若想使用更多的方式可以在 aggfunc 中实现。

In [67]:

pd.pivot_table(df, values=['D'], index=['A', 'B'], aggfunc=[np.sum, len])

Out[67]:

		sum	len
		D	D
A	B
one	A	-1.361229	2.0
B	0.435421	2.0
C	0.732119	2.0
three	A	1.194800	1.0
B	-0.296418	1.0
C	-0.341369	1.0
two	A	0.785924	1.0
B	0.164839	1.0
C	0.791868	1.0

81. 透视表利用额外列进行辅助分割

D 列按照 A,B 列进行聚合时，若关心 C 列对 D 列的影响，可以加入 columns 值进行分析。

In [68]:

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum)

Out[68]:

		D
	C	bar	foo
A	B
one	A	-0.165180	-1.196048
B	0.060005	0.375415
C	1.489608	-0.757490
three	A	1.194800	NaN
B	NaN	-0.296418
C	-0.341369	NaN
two	A	NaN	0.785924
B	0.164839	NaN
C	NaN	0.791868

82. 透视表的缺省值处理

在透视表中由于不同的聚合方式，相应缺少的组合将为缺省值，可以加入 fill_value 对缺省值处理。

In [69]:

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum, fill_value=0)

Out[69]:

		D
	C	bar	foo
A	B
one	A	-0.165180	-1.196048
B	0.060005	0.375415
C	1.489608	-0.757490
three	A	1.194800	0.000000
B	0.000000	-0.296418
C	-0.341369	0.000000
two	A	0.000000	0.785924
B	0.164839	0.000000
C	0.000000	0.791868

绝对类型

在数据的形式上主要包括数量型和性质型，数量型表示着数据可数范围可变，而性质型表示范围已经确定不可改变，绝对型数据就是性质型数据的一种。

83. 绝对型数据定义

In [70]:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [
                  'a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df

Out[70]:

	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

84. 对绝对型数据重命名

In [71]:

df["grade"].cat.categories = ["very good", "good", "very bad"]
df

Out[71]:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

85. 重新排列绝对型数据并补充相应的缺省值

In [72]:

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df

Out[72]:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

86. 对绝对型数据进行排序

In [73]:

df.sort_values(by="grade")

Out[73]:

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

87. 对绝对型数据进行分组

In [74]:

df.groupby("grade").size()

Out[74]:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

数据清洗

常常我们得到的数据是不符合我们最终处理的数据要求，包括许多缺省值以及坏的数据，需要我们对数据进行清洗。

88. 缺失值拟合

在FilghtNumber中有数值缺失，其中数值为按 10 增长，补充相应的缺省值使得数据完整，并让数据为 int 类型。

In [75]:

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df

Out[75]:

	From_To	FlightNumber	RecentDelays	Airline
0	LoNDon_paris	10045	[23, 47]	KLM(!)
1	MAdrid_miLAN	10055	[]	(12)
2	londON_StockhOlm	10065	[24, 43, 87]	(British Airways. )
3	Budapest_PaRis	10075	[13]	12. Air France
4	Brussels_londOn	10085	[67, 32]	“Swiss Air”

89. 数据列拆分

其中From_to应该为两独立的两列From和To，将From_to依照_拆分为独立两列建立为一个新表。

In [76]:

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp

Out[76]:

	From	To
0	LoNDon	paris
1	MAdrid	miLAN
2	londON	StockhOlm
3	Budapest	PaRis
4	Brussels	londOn

90. 字符标准化

其中注意到地点的名字都不规范（如：londON应该为London）需要对数据进行标准化处理。

In [78]:

temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()

91. 删除坏数据加入整理好的数据

将最开始的 From_to 列删除，加入整理好的 From 和 to 列。

In [79]:

df = df.drop('From_To', axis=1)
df = df.join(temp)
print(df)

   FlightNumber  RecentDelays              Airline      From         To
0         10045      [23, 47]               KLM(!)    London      Paris
1         10055            []    <Air France> (12)    Madrid      Milan
2         10065  [24, 43, 87]  (British Airways. )    London  Stockholm
3         10075          [13]       12. Air France  Budapest      Paris
4         10085      [67, 32]          "Swiss Air"  Brussels     London

92. 去除多余字符

如同 airline 列中许多数据有许多其他字符，会对后期的数据分析有较大影响，需要对这类数据进行修正。

In [80]:

df['Airline'] = df['Airline'].str.extract(
    '([a-zA-Z\s]+)', expand=False).str.strip()
df

Out[80]:

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM	London	Paris
1	10055	[]	Air France	Madrid	Milan
2	10065	[24, 43, 87]	British Airways	London	Stockholm
3	10075	[13]	Air France	Budapest	Paris
4	10085	[67, 32]	Swiss Air	Brussels	London

93. 格式规范

在 RecentDelays 中记录的方式为列表类型，由于其长度不一，这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开，取出列表中的相同位置元素作为一列，若为空值即用 NaN 代替。

In [81]:

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n)
                  for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)
df

Out[81]:

	FlightNumber	Airline	From	To	delay_1	delay_2	delay_3
0	10045	KLM	London	Paris	23.0	47.0	NaN
1	10055	Air France	Madrid	Milan	NaN	NaN	NaN
2	10065	British Airways	London	Stockholm	24.0	43.0	87.0
3	10075	Air France	Budapest	Paris	13.0	NaN	NaN
4	10085	Swiss Air	Brussels	London	67.0	32.0	NaN

数据预处理

94. 信息区间划分

班级一部分同学的数学成绩表，如下图所示

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})

但我们更加关心的是该同学是否及格，将该数学成绩按照是否>60来进行划分。

In [82]:

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'], 'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df

Out[82]:

	name	grades
0	Alice	0
1	Bob	1
2	Candy	1
3	Dany	1
4	Ella	1
5	Frank	0
6	Grace	1
7	Jenny	1

95. 数据去重

一个列为A的 DataFrame 数据，如下图所示

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

尝试将 A 列中连续重复的数据清除。

In [83]:

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]

Out[83]:

	A
0	1
1	2
3	3
4	4
5	5
8	6
9	7

96. 数据归一化

有时候，DataFrame 中不同列之间的数据差距太大，需要对其进行归一化处理。

其中，Max-Min 归一化是简单而常见的一种方式，公式如下:

Y=X−XminXmax−XminY=X−XminXmax−Xmin

In [84]:

def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y


df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)

          0         1         2
0  0.652348  0.837462  0.659787
1  0.093846  0.072570  0.034394
2  0.705503  0.064403  0.573365
3  0.202570  0.802206  0.527452
4  0.175225  0.582971  0.307263

Out[84]:

	0	1	2
0	0.913096	1.000000	1.000000
1	0.000000	0.010565	0.000000
2	1.000000	0.000000	0.861811
3	0.177753	0.954394	0.788398
4	0.133046	0.670799	0.436317

Pandas 绘图操作

为了更好的了解数据包含的信息，最直观的方法就是将其绘制成图。

97. Series 可视化

In [85]:

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

Out[85]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1f77b34198>

98. DataFrame 折线图

In [87]:

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

Out[87]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1f73464390>

99. DataFrame 散点图

In [88]:

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

Out[88]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1f733f79b0>

100. DataFrame 柱形图

In [89]:

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

Out[89]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1f732bc470>

实验总结

如果你亲自动手做完了上面的 100 道练习题，相信你已经对 Pandas 模块的熟练程度又提升了不少。我们推荐你定期回顾这些题目，相信你一定会熟能生巧。本次实验涉及的知识点主要有：

创建 Series
Series 基本操作
创建 DataFrame
DataFrame 基本操作
DataFrame 文件操作
Series，DataFrame 和多索引
透视表
数据清洗
数据预处理
可视化

转载请注明: 星晴 Pandas 百题大冲关

Pandas 时间序列数据处理

Pandas 时间序列数据处理介绍Pandas 是非常著名的开源数据处理库，我们可以通过它完成对数据集进行快速读取、转换、过滤、分析等一系列操作。同样，Pandas 已经被证明为是非常强大的用于处理时间序列数据的工具。本节将介绍所有 Pa

2019-11-16 starjian

机器学习

order by 注入

order by 注入 order by 在不知道列名的情况下，可以通过列的序号来代替相应的列。但是经过测试这里无法做运算，如order by 3-1和order by 2是不一样的。当desc参数可控的时候，使用if(exp,desc,

2019-11-16 starjian

ctf

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9