05 章: pandas入门

pandas简介 🐼

pandas 是一个强大的 Python 库，用于数据分析和处理。
它类似于电子表格程序，但功能更强大、更灵活。
它建立在 NumPy 之上，NumPy 是 Python 中另一个重要的数值计算库。

pandas 标志

Note

可以把 pandas 看作是打了兴奋剂的 Excel！💪

为什么选择 pandas？🤔

数据结构： 提供直观的数据结构，如 Series（一维）和 DataFrame（二维），以高效处理数据。

为什么选择 pandas？🤔 (续)

数据清洗： 提供处理缺失数据、过滤和转换数据的工具。

为什么选择 pandas？🤔 (续)

数据分析： 强大的分组、聚合和统计分析功能。

为什么选择 pandas？🤔 (续)

集成： 与 NumPy、SciPy、scikit-learn 和 matplotlib 等其他 Python 库无缝协作。

核心概念：数据分析 🔎

数据分析是对数据进行检查、清理、转换和建模的过程，目的是发现有用信息、得出结论并支持决策。它涉及几个关键步骤：

核心概念：数据分析 🔎 (续)

数据收集： 从各种来源收集数据。
数据清理： 处理缺失值、纠正错误并确保数据一致性。
数据转换： 将数据转换为适合分析的格式（例如，缩放、规范化）。

核心概念：数据分析 🔎 (续)

数据探索： 使用描述性统计和可视化来了解数据的模式和特征。
数据建模： 应用统计或机器学习技术来提取见解或进行预测。
解释和报告： 以清晰简洁的方式传达结果。

核心概念：机器学习 🤖

机器学习是人工智能 (AI) 的一个子领域，专注于使计算机能够从数据中学习，而无需显式编程。关键概念包括：

核心概念：机器学习 🤖 (续)

训练数据： 用于训练机器学习模型的数据集。
特征： 数据的各个可测量属性或特征（例如，表中的列）。
模型： 从训练数据中学习模式的数学表示。

核心概念：机器学习 🤖 (续)

预测： 使用训练好的模型对新的、未见过的数据进行预测。
监督学习： 从标记数据（已知正确输出）中学习。示例：分类、回归。
无监督学习： 从未标记数据（未知正确输出）中学习。示例：聚类、降维。

核心概念：Python 🐍

Python 是一种通用的高级编程语言，以其可读性和广泛的库而闻名。数据分析的关键特性：

易于学习： 清晰的语法使其对初学者友好。

核心概念：Python 🐍 (续)

庞大的社区： 广泛的在线资源和支持。
丰富的库生态系统： NumPy（数值计算）、pandas（数据处理）、scikit-learn（机器学习）、matplotlib/seaborn（可视化）。

核心概念：Python 🐍 (续)

解释型语言： 代码逐行执行，便于测试和调试。
动态类型： 无需显式声明变量类型。

导入 pandas

import pandas as pd  # 导入 pandas，并将其别名为 'pd'
import numpy as np   # 导入 NumPy，并将其别名为 'np'

我们使用别名 pd 导入 pandas（标准惯例）。
我们也导入 NumPy，别名为 np。 pandas 构建在 NumPy 之上。

pandas 数据结构：Series

Series 是一个一维带标签的数组。它类似于电子表格中的一列。
它可以保存任何类型的数据（整数、浮点数、字符串等）。
它有一个索引，用于标记每个元素。

obj = pd.Series([4, 7, -5, 3]) # 从列表创建一个 Series
obj

0    4
1    7
2   -5
3    3
dtype: int64

Series：索引和值

print(obj.array)  # 访问底层数据 (PandasArray)
print(obj.index)  # 访问索引（默认为：0, 1, 2, ...）

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64
RangeIndex(start=0, stop=4, step=1)

Note

obj.array 返回数据。 obj.index 返回索引。

带有自定义索引的 Series

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c']) # 创建一个带有自定义索引的 Series
obj2

d    4
b    7
a   -5
c    3
dtype: int64

Series：自定义索引 (续)

obj2.index # 访问自定义索引

Index(['d', 'b', 'a', 'c'], dtype='object')

Note

现在索引是 [‘d’, ‘b’, ‘a’, ‘c’]。我们可以使用这些索引来访问数据。

访问 Series 元素

obj2['a']  # 通过索引标签 'a' 访问元素

np.int64(-5)

访问 Series 元素 (续)

obj2['d'] = 6  # 修改索引标签 'd' 处的元素

访问 Series 元素 (续)

obj2[['c', 'a', 'd']]  # 使用标签列表访问多个元素

c    3
a   -5
d    6
dtype: int64

Note

我们可以使用索引标签列表来选择子集。

过滤 Series

obj2[obj2 > 0]  # 选择大于 0 的元素（布尔索引）

d    6
b    7
c    3
dtype: int64

过滤 Series (续)

obj2 * 2 # 将每个元素乘以 2

d    12
b    14
a   -10
c     6
dtype: int64

过滤 Series (续)

np.exp(obj2) # 逐元素应用指数函数（来自 NumPy）

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Note

在这些操作期间，索引-值的链接关系会保留。

Series 作为固定长度的有序字典

'b' in obj2  # 检查 'b' 是否在索引中（类似于检查字典中的键）

True

Series 作为固定长度的有序字典 (续)

'e' in obj2  # 检查 'e' 是否在索引中

False

Note

Series 类似于字典：键是索引标签，值是数据。

从字典创建 Series

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata) # 从字典创建一个 Series
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

从字典创建 Series (续)

obj3.to_dict() #转换回字典

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

Series：处理缺失数据

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)  # 使用指定的索引创建一个 Series
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Note

‘California’ 是 NaN（Not a Number）- 它在索引中，但不在 sdata 中。‘Utah’ 被排除 - 它在 sdata 中，但不在索引中。

检测缺失数据

pd.isna(obj4)  # 检查缺失值 (NaN) - 返回一个布尔 Series

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

检测缺失数据 (续)

pd.notna(obj4) # 检查非缺失值 - 返回一个布尔 Series

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Note

isna() 和 notna() 检测缺失值。

Series：自动对齐

obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Series：自动对齐 (续)

obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Series：自动对齐 (续)

obj3 + obj4  # 将两个 Series 相加；值按索引标签对齐

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Note

数据对齐是自动的。在标签不匹配的地方引入 NaN。

Series：`name` 属性

obj4.name = 'population'  # 设置 Series 的名称
obj4.index.name = 'state'  # 设置索引的名称
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

Note

Series 及其索引可以有名称。

就地更改 Series 索引

obj

0    4
1    7
2   -5
3    3
dtype: int64

就地更改 Series 索引 (续)

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']  # 就地修改索引
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

Note

可以通过赋值更改索引。

pandas 数据结构：DataFrame

DataFrame 是一个二维带标签的数据结构。类似于电子表格。
有行和列。
每列可以有不同的类型。
有行索引和列索引。

创建 DataFrame

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data) # 从字典（值为列表）创建一个 DataFrame
frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

Note

创建 DataFrame 的常用方法：字典，其值为列表。

DataFrame：`head()` 和 `tail()`

frame.head()  # 显示前 5 行

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

DataFrame：`head()` 和 `tail()` (续)

frame.tail() # 显示最后 5 行

	state	year	pop
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

DataFrame：指定列顺序

pd.DataFrame(data, columns=['year', 'state', 'pop']) # 指定列顺序

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

Note

可以指定列的顺序。

DataFrame：缺失数据

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt']) # 'debt' 是一个新列
frame2

	year	state	pop	debt
0	2000	Ohio	1.5	NaN
1	2001	Ohio	1.7	NaN
2	2002	Ohio	3.6	NaN
3	2001	Nevada	2.4	NaN
4	2002	Nevada	2.9	NaN
5	2003	Nevada	3.2	NaN

DataFrame：缺失数据 (续)

frame2.columns  # 显示列名

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Note

‘debt’ 列有缺失值 (NaN)，因为它不在原始数据中。

检索列

frame2['state']  # 检索 'state' 列（类似字典的表示法）

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

检索列 (续)

frame2.year  # 检索 'year' 列（类似属性的访问）

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

Note

两种方法都返回一个 Series。属性访问仅适用于有效的 Python 变量名（无空格等）。

检索行

frame2.loc[1]  # 按标签访问行（索引 '1'）

year     2001
state    Ohio
pop       1.7
debt      NaN
Name: 1, dtype: object

检索行 (续)

frame2.iloc[2]  # 按整数位置访问行（索引 2）

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: 2, dtype: object

Note

loc 基于标签。iloc 基于整数位置。这是至关重要的区别。

修改列

frame2['debt'] = 16.5  # 将标量值赋给 'debt' 列
frame2

	year	state	pop	debt
0	2000	Ohio	1.5	16.5
1	2001	Ohio	1.7	16.5
2	2002	Ohio	3.6	16.5
3	2001	Nevada	2.4	16.5
4	2002	Nevada	2.9	16.5
5	2003	Nevada	3.2	16.5

修改列 (续)

frame2['debt'] = np.arange(6.)  # 将 NumPy 数组赋给 'debt' 列
frame2

	year	state	pop	debt
0	2000	Ohio	1.5	0.0
1	2001	Ohio	1.7	1.0
2	2002	Ohio	3.6	2.0
3	2001	Nevada	2.4	3.0
4	2002	Nevada	2.9	4.0
5	2003	Nevada	3.2	5.0

赋值 Series

val = pd.Series([-1.2, -1.5, -1.7], index=[2, 4, 5])  # 创建带有自定义索引的 Series
frame2['debt'] = val  # 将 Series 赋给 'debt' 列
frame2

	year	state	pop	debt
0	2000	Ohio	1.5	NaN
1	2001	Ohio	1.7	NaN
2	2002	Ohio	3.6	-1.2
3	2001	Nevada	2.4	NaN
4	2002	Nevada	2.9	-1.5
5	2003	Nevada	3.2	-1.7

Note

标签已对齐！val 中的值被分配给 frame2 中相应的索引。在索引不匹配的地方填充缺失值 (NaN)。

创建新列

frame2['eastern'] = frame2['state'] == 'Ohio'  # 创建一个新列 'eastern'
frame2

	year	state	pop	debt	eastern
0	2000	Ohio	1.5	NaN	True
1	2001	Ohio	1.7	NaN	True
2	2002	Ohio	3.6	-1.2	True
3	2001	Nevada	2.4	NaN	False
4	2002	Nevada	2.9	-1.5	False
5	2003	Nevada	3.2	-1.7	False

Note

赋值给不存在的列会创建一个新列。

删除列

del frame2['eastern']  # 删除 'eastern' 列
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Note

del 关键字删除列。

从嵌套字典创建 DataFrame

populations = {'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6},
               'Nevada': {2001: 2.4, 2002: 2.9}}
frame3 = pd.DataFrame(populations) # 从嵌套字典创建 DataFrame
frame3

	Ohio	Nevada
2000	1.5	NaN
2001	1.7	2.4
2002	3.6	2.9

Note

外层键成为列，内层键成为行索引。

转置 DataFrame

frame3.T  # 转置 DataFrame（交换行和列）

	2000	2001	2002
Ohio	1.5	1.7	3.6
Nevada	NaN	2.4	2.9

Note

交换行和列。

DataFrame：`index.name` 和 `columns.name`

frame3.index.name = 'year'    # 设置行索引的名称
frame3.columns.name = 'state' # 设置列索引的名称
frame3

state	Ohio	Nevada
year
2000	1.5	NaN
2001	1.7	2.4
2002	3.6	2.9

DataFrame：`to_numpy()`

frame3.to_numpy()  # 将 DataFrame 转换为 NumPy 数组

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

Note

将数据作为二维 NumPy 数组返回。如果存在混合类型，dtype 将适应所有列，通常为 object。

索引对象

obj = pd.Series(np.arange(3), index=['a', 'b', 'c'])
index = obj.index  # 获取 Index 对象
index

Index(['a', 'b', 'c'], dtype='object')

索引对象 (续)

index[1:]  # 对 Index 对象进行切片（类似于列表）

Index(['b', 'c'], dtype='object')

Note

索引对象存储轴标签和元数据。它们是不可变的。

索引的不可变性

index[1] = 'd'  # TypeError: Index does not support mutable operations

Note

索引对象在创建后不能更改。

索引作为固定大小的集合

frame3

state	Ohio	Nevada
year
2000	1.5	NaN
2001	1.7	2.4
2002	3.6	2.9

索引作为固定大小的集合 (续)

'Ohio' in frame3.columns  # 检查列成员资格

True

索引作为固定大小的集合 (续)

2003 in frame3.index  # 检查行索引成员资格

False

重新索引

reindex 创建一个新对象；数据会根据新索引进行调整。

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

重新索引 (续)

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])  # 重新索引 Series
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

Note

reindex 创建新的 Series。为缺失的索引引入 NaN。

重新索引：插值的 `method`

obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

重新索引：插值的 `method` (续)

obj3.reindex(np.arange(6), method='ffill')  # 前向填充缺失值

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

Note

ffill（前向填充）将最后一个有效值向前传播。

使用 DataFrame 重新索引

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

	Ohio	Texas	California
a	0	1	2
c	3	4	5
d	6	7	8

使用 DataFrame 重新索引 (续)

frame2 = frame.reindex(index=['a', 'b', 'c', 'd']) # 重新索引行
frame2

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

Note

DataFrame 上的 reindex 可以更改行索引、列或两者。

重新索引列

states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)  # 重新索引列

	Texas	Utah	California
a	1	NaN	2
c	4	NaN	5
d	7	NaN	8

删除条目

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

删除条目 (续)

new_obj = obj.drop('c')  # 删除条目 'c'（创建一个新的 Series）
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

删除条目 (续)

obj.drop(['d', 'c'])  # 删除多个条目

a    0.0
b    1.0
e    4.0
dtype: float64

从 DataFrame 中删除

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

从 DataFrame 中删除 (续)

data.drop(index=['Colorado', 'Ohio'])  # 按索引标签删除行

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

从 DataFrame 中删除 (续)

data.drop(columns=['two']) # 按列名删除列

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

从 DataFrame 中删除 (续)

data.drop('two', axis=1) # 使用 axis=1 删除列（与上述相同）

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

索引、选择和过滤

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

索引、选择和过滤 (续)

obj['b']  # 按标签选择

np.float64(1.0)

索引、选择和过滤 (续)

obj[1]  # 按整数位置选择

np.float64(1.0)

索引、选择和过滤 (续)

obj[2:4]  # 切片

c    2.0
d    3.0
dtype: float64

索引、选择和过滤 (续)

obj[['b', 'a', 'd']]  # 选择多个标签

b    1.0
a    0.0
d    3.0
dtype: float64

索引、选择和过滤 (续)

obj[[1, 3]]  # 选择多个整数位置

b    1.0
d    3.0
dtype: float64

索引、选择和过滤 (续)

obj[obj < 2]  # 布尔索引

a    0.0
b    1.0
dtype: float64

使用 `loc` 和 `iloc` 进行索引

loc：按标签选择。
iloc：按整数位置选择。

obj.loc[['b', 'a', 'd']] # 使用 loc 按标签选择

b    1.0
a    0.0
d    3.0
dtype: float64

使用 `loc` 和 `iloc` 进行索引(续)

obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

使用 `loc` 和 `iloc` 进行索引(续)

obj1[[0,1,2]] #如果Series的索引是整数,则默认按标签选择

0    2
1    3
2    1
dtype: int64

使用 `loc` 和 `iloc` 进行索引(续)

obj2.loc[[0, 1]] # 索引是字符串，0和1不是obj2的标签时，会出错

使用 `loc` 和 `iloc` 进行索引 (续)

obj1.iloc[[0, 1, 2]] # iloc 使用整数位置

2    1
0    2
1    3
dtype: int64

使用 `loc` 和 `iloc` 进行索引 (续)

obj2.iloc[[0, 1, 2]] # iloc 使用整数位置

a    1
b    2
c    3
dtype: int64

Note

使用 loc 和 iloc 可以避免歧义，尤其是在使用整数索引时。

使用标签切片（包含）

obj2.loc['b':'c']  # 包括端点 ('c')！

b    2
c    3
dtype: int64

使用标签切片（包含）(续)

obj2.loc['b':'c'] = 5  # 使用基于标签的切片赋值
obj2

a    1
b    5
c    5
dtype: int64

Note

使用 loc 进行标签切片包括端点。

索引到 DataFrame 中

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

索引到 DataFrame 中 (续)

data['two']  # 选择列 'two'

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

索引到 DataFrame 中 (续)

data[['three', 'one']]  # 选择多列

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

DataFrame：索引的特殊情况

data[:2]  # 切片行（选择第 0 行和第 1 行）

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

DataFrame：索引的特殊情况 (续)

data[data['three'] > 5]  # 布尔索引（选择 'three' > 5 的行）

	one	two	three	four
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

DataFrame：布尔索引

data < 5  # 逐元素比较（返回一个布尔 DataFrame）

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

DataFrame：布尔索引 (续)

data[data < 5] = 0  # 将小于 5 的值设置为 0
data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择

data.loc['Colorado'] # 按标签选择一行

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.loc[['Colorado', 'New York']] # 按标签选择多行

	one	two	three	four
Colorado	0	5	6	7
New York	12	13	14	15

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.loc['Colorado', ['two', 'three']]  # 按标签选择行和列

two      5
three    6
Name: Colorado, dtype: int64

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.iloc[2] # 按位置选择一行

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.iloc[[2, 1]] # 按位置选择多行

	one	two	three	four
Utah	8	9	10	11
Colorado	0	5	6	7

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.iloc[2, [3, 0, 1]] # 按位置选择行和列

four    11
one      8
two      9
Name: Utah, dtype: int64

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

data.iloc[[1, 2], [3, 0, 1]] # 按位置选择行和列

	four	one	two
Colorado	7	0	5
Utah	11	8	9

算术和数据对齐

执行算术运算时，数据按索引标签对齐。
在标签不重叠的地方使用 NaN。

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

算术和数据对齐 (续)

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

算术和数据对齐 (续)

s1 + s2  # Series 相加；按索引标签对齐

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

DataFrame 的算术运算

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

DataFrame 的算术运算 (续)

df1

	b	c	d
Ohio	0.0	1.0	2.0
Texas	3.0	4.0	5.0
Colorado	6.0	7.0	8.0

DataFrame 的算术运算 (续)

df2

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

DataFrame 的算术运算 (续)

df1 + df2  # DataFrame 相加；按行和列标签对齐

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0	NaN	6.0	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0	NaN	12.0	NaN
Utah	NaN	NaN	NaN	NaN

具有填充值的算术方法

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))
df2.loc[1, 'b'] = np.nan  # 引入一个缺失值

具有填充值的算术方法 (续)

df1

	a	b	c	d
0	0.0	1.0	2.0	3.0
1	4.0	5.0	6.0	7.0
2	8.0	9.0	10.0	11.0

具有填充值的算术方法 (续)

df2

	a	b	c	d	e
0	0.0	1.0	2.0	3.0	4.0
1	5.0	NaN	7.0	8.0	9.0
2	10.0	11.0	12.0	13.0	14.0
3	15.0	16.0	17.0	18.0	19.0

具有填充值的算术方法 (续)

df1 + df2  # 相加，可能产生 NaN

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	NaN
1	9.0	NaN	13.0	15.0	NaN
2	18.0	20.0	22.0	24.0	NaN
3	NaN	NaN	NaN	NaN	NaN

Note

NaN 表示缺失值。

使用 `add` 和 `fill_value`

df1.add(df2, fill_value=0)  # 相加，在相加*之前*用 0 填充缺失值

	a	b	c	d	e
0	0.0	2.0	4.0	6.0	4.0
1	9.0	5.0	13.0	15.0	9.0
2	18.0	20.0	22.0	24.0	14.0
3	15.0	16.0	17.0	18.0	19.0

Note

fill_value 在运算之前替换缺失值。

灵活的算术方法

方法	描述
`add, radd`	加法 (+)
`sub, rsub`	减法 (-)
`div, rdiv`	除法 (/)
`mul, rmul`	乘法 (*)

Note

r 方法：反转参数（例如，1 / df1 等价于 df1.rdiv(1)）。

DataFrame 和 Series 之间的运算

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0] # 获取第一行

DataFrame 和 Series 之间的运算 (续)

frame

	b	d	e
Utah	0.0	1.0	2.0
Ohio	3.0	4.0	5.0
Texas	6.0	7.0	8.0
Oregon	9.0	10.0	11.0

DataFrame 和 Series 之间的运算 (续)

series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

DataFrame 和 Series 之间的运算 (续)

frame - series  # 从 DataFrame 中减去 Series（广播）

	b	d	e
Utah	0.0	0.0	0.0
Ohio	3.0	3.0	3.0
Texas	6.0	6.0	6.0
Oregon	9.0	9.0	9.0

Note

DataFrame 和 Series 之间的算术运算默认沿行广播。

跨列广播

series3 = frame['d']  # 获取 'd' 列
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

跨列广播 (续)

frame.sub(series3, axis='index')  # 匹配索引（行），跨列广播

	b	e
Utah	-1.0	1.0
Ohio	-1.0	1.0
Texas	-1.0	1.0
Oregon	-1.0	1.0

Note

axis='index'（或 axis=0）跨列广播。

函数应用和映射

frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

	b	d	e
Utah	0.343849	0.126635	-1.515288
Ohio	-0.374208	0.163970	0.764276
Texas	1.091432	0.770136	0.772678
Oregon	-0.327794	-0.451811	-0.821136

函数应用和映射 (续)

np.abs(frame)  # 逐元素应用 NumPy 的绝对值函数 (ufunc)

	b	d	e
Utah	0.343849	0.126635	1.515288
Ohio	0.374208	0.163970	0.764276
Texas	1.091432	0.770136	0.772678
Oregon	0.327794	0.451811	0.821136

使用 `apply` 应用函数

def f1(x):
    return x.max() - x.min() # 定义一个对 Series 进行操作的函数

frame.apply(f1)  # 沿列应用函数（默认轴是 'index'）

b    1.465640
d    1.221947
e    2.287966
dtype: float64

使用 `apply` 应用函数 (续)

frame.apply(f1, axis='columns')  # 沿行应用函数 (axis='columns')

Utah      1.859137
Ohio      1.138485
Texas     0.321296
Oregon    0.493343
dtype: float64

应用返回 Series 的函数

def f2(x):
    # 定义一个返回带有 'min' 和 'max' 的 Series 的函数
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f2)

	b	d	e
min	-0.374208	-0.451811	-1.515288
max	1.091432	0.770136	0.772678

使用 `applymap` 进行逐元素格式化

def my_format(x):
    return f"{x:.2f}"  # 定义一个格式化函数

frame.applymap(my_format) # 对 DataFrame 逐元素应用

	b	d	e
Utah	0.34	0.13	-1.52
Ohio	-0.37	0.16	0.76
Texas	1.09	0.77	0.77
Oregon	-0.33	-0.45	-0.82

Note

applymap 用于 DataFrame 的逐元素操作；map 用于 Series 的逐元素操作。

排序

obj = pd.Series(np.arange(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()  # 按索引标签排序

a    1
b    2
c    3
d    0
dtype: int64

排序 (续)

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index() # 按行索引排序

	d	a	b	c
one	4	5	6	7
three	0	1	2	3

排序 (续)

frame.sort_index(axis='columns') # 按列索引排序

	a	b	c	d
three	1	2	3	0
one	5	6	7	4

排序 (续)

frame.sort_index(axis='columns', ascending=False) # 降序排序

	d	c	b	a
three	0	3	2	1
one	4	7	6	5

按值排序

obj = pd.Series([4, 7, -3, 2])
obj.sort_values()  # 按值排序

2   -3
3    2
0    4
1    7
dtype: int64

按值排序 (续)

obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()  # 缺失值 (NaN) 默认排序到末尾

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

按列对 DataFrame 排序

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

	b	a
0	4	0
1	7	1
2	-3	0
3	2	1

按列对 DataFrame 排序 (续)

frame.sort_values('b')  # 按 'b' 列排序

	b	a
2	-3	0
3	2	1
0	4	0
1	7	1

按列对 DataFrame 排序 (续)

frame.sort_values(['a', 'b'])  # 按多列排序（先 'a'，再 'b'）

	b	a
2	-3	0
0	4	0
3	2	1
1	7	1

排名

obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()  # 分配排名（平局使用平均排名）

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Note

排名：分配从 1 到有效数据点数量的排名。

排名方法

方法	描述
`average`	默认：平局使用平均排名
`min`	使用最小排名
`max`	使用最大排名
`first`	按值出现的顺序排名

具有重复标签的轴索引

obj = pd.Series(np.arange(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

具有重复标签的轴索引 (续)

obj.index.is_unique  # 检查索引标签是否唯一

False

具有重复标签的轴索引 (续)

obj['a']  # 返回一个 Series（因为 'a' 重复）

a    0
a    1
dtype: int64

具有重复标签的轴索引 (续)

obj['c']  # 返回一个标量（因为 'c' 唯一）

np.int64(4)

汇总和计算描述性统计

df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

	one	two
a	1.40	NaN
b	7.10	-4.5
c	NaN	NaN
d	0.75	-1.3

汇总和计算描述性统计 (续)

df.sum()  # 计算列总和

one    9.25
two   -5.80
dtype: float64

汇总和计算描述性统计 (续)

df.sum(axis='columns')  # 计算行总和

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

处理归约中的缺失值

df.sum(axis='index', skipna=False) # 在计算中包含 NaN

one   NaN
two   NaN
dtype: float64

处理归约中的缺失值 (续)

df.mean(axis='columns') # 计算平均值，排除 NaN（默认）

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

描述性统计：归约方法的选项

方法	描述
`axis`	轴 (‘index’ 表示行, ‘columns’ 表示列)
`skipna`	排除缺失值 (默认为 True)

间接统计

df.idxmax()  # 最大值的索引标签（对于每列）

one    b
two    d
dtype: object

间接统计 (续)

df.cumsum()  # 累积和（对于每列）

	one	two
a	1.40	NaN
b	8.50	-4.5
c	NaN	NaN
d	9.25	-5.8

`describe()` 方法

df.describe()  # 生成描述性统计

	one	two
count	3.000000	2.000000
mean	3.083333	-2.900000
std	3.493685	2.262742
min	0.750000	-4.500000
25%	1.075000	-3.700000
50%	1.400000	-2.900000
75%	4.250000	-2.100000
max	7.100000	-1.300000

`describe()` 方法 (续)

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe() # 非数值数据的描述性统计

count     16
unique     3
top        a
freq       8
dtype: object

描述性和汇总统计

方法	描述
`count`	非 NA 值的数量
`describe`	计算一组汇总统计
`min, max`	计算最小值和最大值
`idxmin, idxmax`	计算达到最小值/最大值的索引标签
`quantile`	计算样本分位数 (0 到 1)
`sum`	求和
`mean`	平均值
`median`	中位数 (50% 分位数)
…	…

DataFrame 上的 `corr` 和 `cov`

returns.corr()  # 相关矩阵

	AAPL	GOOG	IBM	MSFT
AAPL	1.000000	0.407919	0.386817	0.389695
GOOG	0.407919	1.000000	0.405099	0.465919
IBM	0.386817	0.405099	1.000000	0.499764
MSFT	0.389695	0.465919	0.499764	1.000000

`corr` 和 `cov` on DataFrame (续)

returns.cov()  # 协方差矩阵

	AAPL	GOOG	IBM	MSFT
AAPL	0.000277	0.000107	0.000078	0.000095
GOOG	0.000107	0.000251	0.000078	0.000108
IBM	0.000078	0.000078	0.000146	0.000089
MSFT	0.000095	0.000108	0.000089	0.000215

`corrwith` 方法

returns.corrwith(returns['IBM'])  # 与 IBM 收益率的成对相关性

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

`corrwith` 方法 (续)

returns.corrwith(volume)  # 与成交量的成对相关性

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

Note

计算成对相关性。

唯一值、值计数和成员资格

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()  # 获取唯一值
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

唯一值、值计数和成员资格 (续)

obj.value_counts()  # 计算每个值的出现次数

c    3
a    3
b    2
d    1
Name: count, dtype: int64

唯一值、值计数和成员资格 (续)

pd.value_counts(obj.to_numpy(), sort=False) #使用pd.value_counts()

c    3
a    3
d    1
b    2
Name: count, dtype: int64

`isin` 方法

obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

`isin` 方法(续)

mask = obj.isin(['b', 'c'])  # 检查是否属于 ['b', 'c']
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

`isin` 方法 (续)

obj[mask]  # 选择 mask 为 True 的元素

0    c
5    b
6    b
7    c
8    c
dtype: object

Note

isin：检查成员资格。

总结

我们学习了 pandas 的基础数据结构：Series 和 DataFrame。
掌握了关键操作：索引、选择、过滤、算术运算、函数应用、排序、排名以及描述性统计。
了解了如何处理缺失数据。

思考与讨论

对比 pandas 和其他数据分析工具 (例如 Excel, R)，各自的优势和劣势是什么？
在实际工作中，有哪些 pandas 的应用场景？
pandas 有哪些局限性？在什么情况下应该选择其他工具？
数据对齐和广播在 pandas 中为何如此重要？

	AAPL	GOOG	IBM	MSFT
Date
2016-10-17	-0.000680	0.001837	0.002072	-0.003483
2016-10-18	-0.000681	0.019616	-0.026168	0.007690
2016-10-19	-0.002979	0.007846	0.003583	-0.002255
2016-10-20	-0.000512	-0.005652	0.001719	-0.004867
2016-10-21	-0.003930	0.003011	-0.012474	0.042096

05 章: pandas入门

pandas简介 🐼

pandas 标志

为什么选择 pandas？🤔

为什么选择 pandas？🤔 (续)

为什么选择 pandas？🤔 (续)

为什么选择 pandas？🤔 (续)

核心概念：数据分析 🔎

核心概念：数据分析 🔎 (续)

核心概念：数据分析 🔎 (续)

核心概念：机器学习 🤖

核心概念：机器学习 🤖 (续)

核心概念：机器学习 🤖 (续)

核心概念：Python 🐍

核心概念：Python 🐍 (续)

核心概念：Python 🐍 (续)

导入 pandas

pandas 数据结构：Series

Series：索引和值

带有自定义索引的 Series

Series：自定义索引 (续)

访问 Series 元素

访问 Series 元素 (续)

访问 Series 元素 (续)

过滤 Series

过滤 Series (续)

过滤 Series (续)

Series 作为固定长度的有序字典

Series 作为固定长度的有序字典 (续)

从字典创建 Series

从字典创建 Series (续)

Series：处理缺失数据

检测缺失数据

检测缺失数据 (续)

Series：自动对齐

Series：自动对齐 (续)

Series：自动对齐 (续)

Series：name 属性

就地更改 Series 索引

就地更改 Series 索引 (续)

pandas 数据结构：DataFrame

创建 DataFrame

DataFrame：head() 和 tail()

DataFrame：head() 和 tail() (续)

DataFrame：指定列顺序

DataFrame：缺失数据

DataFrame：缺失数据 (续)

检索列

检索列 (续)

检索行

检索行 (续)

修改列

修改列 (续)

赋值 Series

创建新列

删除列

从嵌套字典创建 DataFrame

转置 DataFrame

DataFrame：index.name 和 columns.name

DataFrame：to_numpy()

索引对象

索引对象 (续)

索引的不可变性

索引作为固定大小的集合

索引作为固定大小的集合 (续)

索引作为固定大小的集合 (续)

重新索引

重新索引 (续)

重新索引：插值的 method

重新索引：插值的 method (续)

使用 DataFrame 重新索引

使用 DataFrame 重新索引 (续)

重新索引列

删除条目

删除条目 (续)

删除条目 (续)

从 DataFrame 中删除

从 DataFrame 中删除 (续)

从 DataFrame 中删除 (续)

从 DataFrame 中删除 (续)

Series：`name` 属性

DataFrame：`head()` 和 `tail()`

DataFrame：`head()` 和 `tail()` (续)

DataFrame：`index.name` 和 `columns.name`

DataFrame：`to_numpy()`

重新索引：插值的 `method`

重新索引：插值的 `method` (续)

使用 `loc` 和 `iloc` 进行索引

使用 `loc` 和 `iloc` 进行索引(续)

使用 `loc` 和 `iloc` 进行索引(续)

使用 `loc` 和 `iloc` 进行索引(续)

使用 `loc` 和 `iloc` 进行索引 (续)

使用 `loc` 和 `iloc` 进行索引 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `loc` 和 `iloc` 在 DataFrame 上进行选择 (续)

使用 `add` 和 `fill_value`

使用 `apply` 应用函数

使用 `apply` 应用函数 (续)

使用 `applymap` 进行逐元素格式化