Pandas基础(二)

1152-张同学

发表文章数:21

首页 » 数据科学库 » 正文

Pandas基础(二)

1.pandas索引

#%%
import pandas as pd
import numpy as np
#%%
s = pd.Series(np.random.randn(5), index=list("abcde"))
s
#%%
s.index
#%%
s.index.name = "alpha"
s
#%%
df = pd.DataFrame(np.random.randn(4,3), columns=["one", "two", "three"])
df
#%%
df.index
#%%
df.columns
#%%
df.index.name = "row"
df.columns.name = "col"
df
#%%
# 创建包含重复索引的Series
s = pd.Series(np.arange(6), index=list("abcbad"))
s
#%%
# 选择一个重复的索引,返回重复索引对应的所有值
s.a
#%%
# 判断是否重复
s.index.is_unique
#%%
# 去除重复索引,返回唯一值
s.index.unique()
#%%
s.groupby(s.index).mean()
#%%
# 多级索引:将更高维度的数据以二维的形式来表现
a = [['a', 'a', 'a', 'b', 'b', 'c', 'c'], [1, 2, 3, 1, 2, 2, 3]]
t = list(zip(*a))
t
#%%
index = pd.MultiIndex.from_tuples(t, names=["level1", "level2"])
index
#%%
s = pd.Series(np.random.randn(7), index=index)
s
#%%
s["b"]
#%%
s["b":"c"]
#%%
s[["a","c"]]
#%%
s[:,2]
#%%
df = pd.DataFrame(np.random.randint(1, 10, (4, 3)), 
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
                  columns=[['one', 'one', 'two'], ['blue', 'red', 'blue']])
df.index.names = ['row-1', 'row-2']
df.columns.names = ['col-1', 'col-2']
df
#%%
df.loc["a"]
#%%
type(df.loc["a"])
#%%
df.loc["a",1]
# 索引交换及排序
df2 = df.swaplevel("row-1", "row-2")
df2
#%%
# 根据一级索引求和
df.sum(level=0)
#%%
# 根据二级索引求和
df.sum(level=1)
#%%
# 索引与列的转换
df = pd.DataFrame({
        'a': range(7),
        'b': range(7, 0, -1),
        'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
        'd': [0, 1, 2, 0, 1, 2, 3]
    })
df
#%%
# 将c、d列设置为索引值
df2 = df.set_index(["c","d"])
df2
#%%
df2.reset_index().sort_index("columns")
#%%
df2.reset_index().sort_index("columns") ==df

2.分组计算

算三步曲:拆分 -> 应用 -> 合并

  • 拆分:根据什么进行分组?
  • 应用:每个分组进行什么样的计算?
  • 合并:把每个分组的计算结果合并起来。

In [1]:

import pandas as pd
import numpy as np

In [2]:

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 5),
                  'data2': np.random.randint(1, 10, 5)})
df

Out[2]:

key1 key2 data1 data2
0 a one 2 8
1 a two 5 1
2 b one 1 2
3 b two 4 4
4 a one 3 3

In [3]:

df["data1"].groupby(df["key1"]).mean()

Out[3]:

key1
a    3.333333
b    2.500000
Name: data1, dtype: float64

In [5]:

key = [1,2,1,1,2]
df["data1"].groupby(key).mean()

Out[5]:

1    2.333333
2    4.000000
Name: data1, dtype: float64

In [7]:

df["data1"].groupby([df["key1"], df["key2"]]).sum()

Out[7]:

key1  key2
a     one     5
      two     5
b     one     1
      two     4
Name: data1, dtype: int32

In [10]:

mean = df.groupby(["key1", "key2"]).sum()["data1"]

In [11]:

mean.unstack()

Out[11]:

key2 one two
key1
a 5 5
b 1 4

In [14]:

for name, group in df.groupby("key1"):
    print(name)
    print(group)
a
  key1 key2  data1  data2
0    a  one      2      8
1    a  two      5      1
4    a  one      3      3
b
  key1 key2  data1  data2
2    b  one      1      2
3    b  two      4      4

In [15]:

dict(list(df.groupby("key1")))

Out[15]:

{'a':   key1 key2  data1  data2
 0    a  one      2      8
 1    a  two      5      1
 4    a  one      3      3, 'b':   key1 key2  data1  data2
 2    b  one      1      2
 3    b  two      4      4}

In [16]:

dict(list(df.groupby("key1")))["a"]

Out[16]:

key1 key2 data1 data2
0 a one 2 8
1 a two 5 1
4 a one 3 3

按列分组

In [18]:

df.dtypes

Out[18]:

key1     object
key2     object
data1     int32
data2     int32
dtype: object

In [22]:

df.groupby(df.dtypes, axis=1).sum()

Out[22]:

int32 object
0 10 aone
1 6 atwo
2 3 bone
3 8 btwo
4 6 aone

通过字典进行分组

In [23]:

df = pd.DataFrame(np.random.randint(1, 10, (5, 5)), 
                  columns=['a', 'b', 'c', 'd', 'e'], 
                  index=['Alice', 'Bob', 'Candy', 'Dark', 'Emily'])
df

Out[23]:

a b c d e
Alice 9 7 8 7 3
Bob 3 3 2 1 5
Candy 3 5 4 1 8
Dark 8 7 8 7 9
Emily 1 6 5 8 7

In [25]:

df.iloc[1, 1:3] = np.NaN
df

Out[25]:

a b c d e
Alice 9 7.0 8.0 7 3
Bob 3 NaN NaN 1 5
Candy 3 5.0 4.0 1 8
Dark 8 7.0 8.0 7 9
Emily 1 6.0 5.0 8 7

In [26]:

mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'orange', 'e': 'blue'}
grouped = df.groupby(mapping, axis=1)

In [27]:

grouped.sum()

Out[27]:

blue orange red
Alice 11.0 7.0 16.0
Bob 5.0 1.0 3.0
Candy 12.0 1.0 8.0
Dark 17.0 7.0 15.0
Emily 12.0 8.0 7.0

In [28]:

grouped.size()

Out[28]:

blue      2
orange    1
red       2
dtype: int64

In [29]:

grouped.count()

Out[29]:

blue orange red
Alice 2 1 2
Bob 1 1 1
Candy 2 1 2
Dark 2 1 2
Emily 2 1 2

通过函数分组

In [30]:

df = pd.DataFrame(np.random.randint(1, 10, (5, 5)), 
                  columns=['a', 'b', 'c', 'd', 'e'], 
                  index=['Alice', 'Bob', 'Candy', 'Dark', 'Emily'])
df

Out[30]:

a b c d e
Alice 3 4 2 9 5
Bob 7 5 4 9 3
Candy 5 5 4 5 2
Dark 7 6 6 7 2
Emily 6 3 6 9 6

In [32]:

# 根据函数的返回值进行分组
def _group_key(idx):
    print(idx)
    return idx

df.groupby(_group_key).size()
Alice
Bob
Candy
Dark
Emily

Out[32]:

Alice    1
Bob      1
Candy    1
Dark     1
Emily    1
dtype: int64

多级索引数据根据索引级别来分组

In [33]:

columns = pd.MultiIndex.from_arrays([['China', 'USA', 'China', 'USA', 'China'],
                                     ['A', 'A', 'B', 'C', 'B']], names=['country', 'index'])
df = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
df

Out[33]:

country China USA China USA China
index A A B C B
0 7 2 9 3 2
1 1 9 4 6 3
2 3 3 7 1 5
3 8 4 3 1 9
4 7 4 5 3 3

In [35]:

df.groupby(level="country", axis=1).sum()

Out[35]:

country China USA
0 18 5
1 8 15
2 15 4
3 20 5
4 15 7

3.聚合运算

一、数据聚合

分组运算,先根据一定规则拆分后的数据,然后对数据进行聚合运算,如前面见到的 `mean()`, `sum()` 等就是聚合的例子。聚合时,拆分后的第一个索引指定的数据都会依次传给聚合函数进行运算。最后再把运算结果合并起来,生成最终结果。

聚合函数除了内置的 `sum()`, `min()`, `max()`, `mean()` 等等之外,还可以自定义聚合函数。自定义聚合函数时,使用 `agg()` 或 `aggregate()` 函数。

1.内置聚合函数

In [2]:

import pandas as pd
import numpy as np

In [3]:

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 5),
                  'data2': np.random.randint(1, 10, 5)})
df

Out[3]:

key1 key2 data1 data2
0 a one 1 9
1 a two 1 4
2 b one 1 9
3 b two 7 7
4 a one 5 9

In [31]:

df.groupby("key1")["data1"].mean()

Out[31]:

key1
a    4.333333
b    5.500000
Name: data1, dtype: float64

2.自定义聚合函数

In [13]:

grouped = df.groupby("key1")
grouped

Out[13]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002112D144160>

In [14]:

def peak_range(s):
    print(type(s))
    return s.max() - s.min()

grouped.agg(peak_range)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Out[14]:

data1 data2
key1
a 4 5
b 6 2

3.应用多个聚合函数

In [15]:

# 元组("range", peak_range)中,“range”代表列名,peak_range为之前创建的聚合函数的名称
grouped.agg(["std", "mean", "sum", ("range", peak_range)])
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Out[15]:

data1 data2
std mean sum range std mean sum range
key1
a 2.309401 2.333333 7 4 2.886751 7.333333 22 5
b 4.242641 4.000000 8 6 1.414214 8.000000 16 2

4.给不同的列应用不同的聚合函数

In [22]:

# 使用 dict 作为参数来实现
d = {"data1":["mean",("range", peak_range)],
    "data2":"sum"}
grouped.agg(d)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Out[22]:

data1 data2
mean range sum
key1
a 2.333333 4 22
b 4.000000 6 16

5.重置索引

In [19]:

grouped.agg(d).reset_index()
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Out[19]:

key1 data1 data2
mean range sum
0 a 2.333333 4 22
1 b 4.000000 6 16

In [20]:

# 通过as_index=False将key1不作为索引
df.groupby("key1", as_index=False).agg(d)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Out[20]:

key1 data1 data2
mean range sum
0 a 2.333333 4 22
1 b 4.000000 6 16

二、分组运算和转换

groupby 是特殊的分组运算。更一般的分组运算包括 “拆分 – 应用 – 合并”。这里介绍 transform()apply() 来实现分组运算。

In [36]:

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 5),
                  'data2': np.random.randint(1, 10, 5)})
df

Out[36]:

key1 key2 data1 data2
0 a one 6 4
1 a two 2 4
2 b one 3 8
3 b two 4 3
4 a one 5 2

In [40]:

# 需求:增加两列数据,一列是根据key1分组的data1的平均值,一列是根据key2分组的data2的平均值
k1_mean = df.groupby("key1").mean().add_prefix("mean_")
k1_mean

Out[40]:

mean_data1 mean_data2
key1
a 4.333333 3.333333
b 3.500000 5.500000

In [41]:

pd.merge(df, k1_mean, left_on="key1", right_index=True)

Out[41]:

key1 key2 data1 data2 mean_data1 mean_data2
0 a one 6 4 4.333333 3.333333
1 a two 2 4 4.333333 3.333333
4 a one 5 2 4.333333 3.333333
2 b one 3 8 3.500000 5.500000
3 b two 4 3 3.500000 5.500000

1.transform

通过transform实现上述需求

In [44]:

k1_mean = df.groupby("key1").transform(np.mean).add_prefix("mean_")
k1_mean

Out[44]:

mean_data1 mean_data2
0 4.333333 3.333333
1 4.333333 3.333333
2 3.500000 5.500000
3 3.500000 5.500000
4 4.333333 3.333333

In [45]:

df[k1_mean.columns] = k1_mean
df

Out[45]:

key1 key2 data1 data2 mean_data1 mean_data2
0 a one 6 4 4.333333 3.333333
1 a two 2 4 4.333333 3.333333
2 b one 3 8 3.500000 5.500000
3 b two 4 3 3.500000 5.500000
4 a one 5 2 4.333333 3.333333

2.距平化

与平均值的差异值

In [46]:

df = pd.DataFrame(np.random.randint(1, 10, (5, 5)), 
                  columns=['a', 'b', 'c', 'd', 'e'], 
                  index=['Alice', 'Bob', 'Candy', 'Dark', 'Emily'])
df

Out[46]:

a b c d e
Alice 4 6 1 6 7
Bob 8 6 4 5 6
Candy 8 2 7 6 2
Dark 1 9 4 6 5
Emily 1 3 9 2 3

In [55]:

key = ['one', 'one', 'two', 'one', 'two']
def demean(s):
    return s - s.mean()

key = ['one', 'one', 'two', 'one', 'two']
demeaned = df.groupby(key).transform(demean)
demeaned

Out[55]:

a b c d e
Alice -0.333333 -1.0 -2.0 0.333333 1.0
Bob 3.666667 -1.0 1.0 -0.666667 0.0
Candy 3.500000 -0.5 -1.0 2.000000 -0.5
Dark -3.333333 2.0 1.0 0.333333 -1.0
Emily -3.500000 0.5 1.0 -2.000000 0.5

In [56]:

demeaned.groupby(key).mean()

Out[56]:

a b c d e
one 2.960595e-16 0.0 0.0 -2.960595e-16 0.0
two 0.000000e+00 0.0 0.0 0.000000e+00 0.0

三、apply 函数

我们介绍过 DataFrame 的 apply 函数是逐行或逐列来处理数据。GroupBy 的 apply 函数对每个分组进行计算。

In [62]:

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a', 'a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one', 'one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 10),
                  'data2': np.random.randint(1, 10, 10)})
df

Out[62]:

key1 key2 data1 data2
0 a one 9 5
1 a two 4 6
2 b one 7 7
3 b two 7 8
4 a one 5 1
5 a one 3 1
6 a two 2 4
7 b one 9 1
8 b two 4 4
9 a one 1 4

In [64]:

# 根据 column 排序,输出其最大的 n 行数据
def top(g, n=2, column="data1"):
    return g.sort_values(by=column, ascending=False)[:n]

top(df, n=3)

Out[64]:

key1 key2 data1 data2
0 a one 9 5
7 b one 9 1
2 b one 7 7

In [65]:

df.groupby("key1").apply(top)

Out[65]:

key1 key2 data1 data2
key1
a 0 a one 9 5
4 a one 5 1
b 7 b one 9 1
2 b one 7 7

In [66]:

df.groupby("key1").apply(top, n=3, column="data2")

Out[66]:

key1 key2 data1 data2
key1
a 1 a two 4 6
0 a one 9 5
6 a two 2 4
b 3 b two 7 8
2 b one 7 7
8 b two 4 4

apply 应用示例:用不同的分组平均值填充空缺数据

In [69]:

states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Out[69]:

Ohio         -0.401553
New York      1.728464
Vermont            NaN
Florida       0.697343
Oregon       -0.122013
Nevada             NaN
California   -0.319732
Idaho              NaN
dtype: float64

In [70]:

data.groupby(group_key).mean()

Out[70]:

East    0.674751
West   -0.220873
dtype: float64

In [76]:

data.groupby(group_key).apply(lambda g: g.fillna(g.mean()))

Out[76]:

Ohio         -0.401553
New York      1.728464
Vermont       0.674751
Florida       0.697343
Oregon       -0.122013
Nevada       -0.220873
California   -0.319732
Idaho        -0.220873
dtype: float64

4.数据IO

载入数据到 Pandas

  • 索引:将一个列或多个列读取出来构成 DataFrame,其中涉及是否从文件中读取索引以及列名
  • 类型推断和数据转换:包括用户自定义的转换以及缺失值标记
  • 日期解析
  • 迭代:针对大文件进行逐块迭代。这个是Pandas和Python原生的csv库的最大区别
  • 不规整数据问题:跳过一些行,或注释等等

In [1]:

import pandas as pd
import numpy as np

In [3]:

fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex1.csv"
df = pd.read_csv(fpath)
df

Out[3]:

a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [9]:

pd.read_table(fpath, sep=",")

Out[9]:

a b c d message
0 1 2 3 4 hello
1 5 6 7 8 world
2 9 10 11 12 foo

In [17]:

# 指定列名及行索引
fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex2.csv"
df = pd.read_csv(fpath, header=None, names=["a", "b", "c", "d", "message"], index_col=["message", "b"])
df

Out[17]:

a c d
message b
hello 2 1 3 4
world 6 5 7 8
foo 10 9 11 12

处理不规则的分隔符

In [18]:

# 通过正则表达式处理不同列数据之前的空格(1个或多个)
fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex3.csv"
pd.read_csv(fpath, sep="/s+")

Out[18]:

A B C
aaa -0.264438 -1.026059 -0.619500
bbb 0.927272 0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382 1.100491

缺失值处理

In [43]:

# 对缺失值的处理
fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex5.csv"
pd.read_csv(fpath)

Out[43]:

something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 foo

In [44]:

# pandas会自动将缺失值或NaN视为NaN
# 通过na_values=[]指定我们视为缺失值的数据
df = pd.read_csv(fpath, na_values=["Na", "Null", "foo"])
df

Out[44]:

something a b c d message
0 one 1 2 3.0 4 NaN
1 two 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

In [45]:

# 通过字典按列指定缺失值
df = pd.read_csv(fpath, na_values={"message":["NA", "NUll", "foo"], "something":["two"]})
df

Out[45]:

something a b c d message
0 one 1 2 3.0 4 NaN
1 NaN 5 6 NaN 8 world
2 three 9 10 11.0 12 NaN

逐块读取数据

In [35]:

fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex6.csv"
# nrows=10 读取前10行数据
df1 = pd.read_csv(fpath, nrows=10)
df1

Out[35]:

one two three four key
0 0.467976 -0.038649 -0.295344 -1.824726 L
1 -0.358893 1.404453 0.704965 -0.200638 B
2 -0.501840 0.659254 -0.421691 -0.057688 G
3 0.204886 1.074134 1.388361 -0.982404 R
4 0.354628 -0.133116 0.283763 -0.837063 Q
5 1.817480 0.742273 0.419395 -2.251035 Q
6 -0.776764 0.935518 -0.332872 -1.875641 U
7 -0.913135 1.530624 -0.572657 0.477252 K
8 0.358480 -0.497572 -0.367016 0.507702 S
9 -1.740877 -1.160417 -1.637830 2.172201 G

In [36]:

# chuncksize=1000 读取1000条数据并返回一个TextFileReader(支持迭代器协议)
tr = pd.read_csv(fpath, chunksize=1000)
tr

Out[36]:

<pandas.io.parsers.TextFileReader at 0x27df54dc940>

In [37]:

result = pd.Series([])
for chunk in tr:
    result = result.add(chunk["key"].value_counts(), fill_value=0)
result
D:/Tools/Anaconda3/lib/site-packages/ipykernel_launcher.py:1: DeprecationWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
  """Entry point for launching an IPython kernel.

Out[37]:

0    151.0
1    146.0
2    152.0
3    162.0
4    171.0
5    157.0
6    166.0
7    164.0
8    162.0
9    150.0
A    320.0
B    302.0
C    286.0
D    320.0
E    368.0
F    335.0
G    308.0
H    330.0
I    327.0
J    337.0
K    334.0
L    346.0
M    338.0
N    306.0
O    343.0
P    324.0
Q    340.0
R    318.0
S    308.0
T    304.0
U    326.0
V    328.0
W    305.0
X    364.0
Y    314.0
Z    288.0
dtype: float64

In [38]:

result.sort_values(ascending=False).head(10)

Out[38]:

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

In [39]:

result = result.sort_values(ascending=False)
result[:10]

Out[39]:

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

保存数据到磁盘

In [40]:

result.to_csv(r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex6(排序).csv")

In [46]:

# index=False 不保存索引,header=None 不保存列名,columns=["b", "c", ""message]指定保存列,sep="|"指定分隔符
fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/ex5(排序).csv"
df.to_csv(fpath, index=False, header=None, columns=["b", "c", "message"], sep="|")

5.时间日期

  • 时间戳 tiimestamp:固定的时刻 -> pd.Timestamp
  • 固定时期 period:比如 2016年3月份,再如2015年销售额 -> pd.Period
  • 时间间隔 interval:由起始时间和结束时间来表示,固定时期是时间间隔的一个特殊

时间日期在 Pandas 里的作用

  • 分析金融数据,如股票交易数据
  • 分析服务器日志

In [2]:

import pandas as pd
import numpy as np

Python datetime

In [2]:

# datatime表示时间
from datetime import datetime
# timedelta表示时间之间的间隔
from datetime import timedelta

In [3]:

now = datetime.now()

In [4]:

now

Out[4]:

datetime.datetime(2020, 9, 3, 22, 47, 24, 500671)

In [5]:

now.year, now.month, now.day

Out[5]:

(2020, 9, 3)

时间差

In [7]:

date1 = datetime(2016, 3, 20)
date2 = datetime(2016, 3, 16)
delta = date1 - date2

In [8]:

delta

Out[8]:

datetime.timedelta(days=4)

In [9]:

delta.days

Out[9]:

4

In [11]:

delta.total_seconds()

Out[11]:

345600.0

In [12]:

date2 + delta

Out[12]:

datetime.datetime(2016, 3, 20, 0, 0)

In [13]:

# 12代表12时
date2 + timedelta(4.5)

Out[13]:

datetime.datetime(2016, 3, 20, 12, 0)

字符串和 datetime 转换

In [14]:

date = datetime(2016, 3, 20, 8, 30)
date

Out[14]:

datetime.datetime(2016, 3, 20, 8, 30)

In [15]:

str(date)

Out[15]:

'2016-03-20 08:30:00'

In [16]:

date.strftime("%Y/%m/%d %H:%M:%S")

Out[16]:

'2016/03/20 08:30:00'

In [17]:

datetime.strptime('2016-03-20 09:30', '%Y-%m-%d %H:%M')

Out[17]:

datetime.datetime(2016, 3, 20, 9, 30)

Pandas 里的时间序列

In [18]:

dates = [datetime(2016, 3, 1), datetime(2016, 3, 2), datetime(2016, 3, 3), datetime(2016, 3, 4)]

In [19]:

s = pd.Series(np.random.randn(4), index=dates)
s

Out[19]:

2016-03-01    0.377466
2016-03-02    0.237420
2016-03-03    0.072644
2016-03-04   -0.600154
dtype: float64

In [21]:

type(s.index)

Out[21]:

pandas.core.indexes.datetimes.DatetimeIndex

In [22]:

type(s.index[0])

Out[22]:

pandas._libs.tslibs.timestamps.Timestamp

生成日期范围

In [23]:

pd.date_range('20160320', '20160331')

Out[23]:

DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
               '2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
               '2016-03-28', '2016-03-29', '2016-03-30', '2016-03-31'],
              dtype='datetime64[ns]', freq='D')

In [24]:

pd.date_range('20160320', periods=10)

Out[24]:

DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
               '2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
               '2016-03-28', '2016-03-29'],
              dtype='datetime64[ns]', freq='D')

In [25]:

## 规则化时间戳 normalize=True
pd.date_range(start='2016-03-20 16:23:32', periods=10, normalize=True)

Out[25]:

DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
               '2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
               '2016-03-28', '2016-03-29'],
              dtype='datetime64[ns]', freq='D')

In [28]:

# freq代表频率,D:按天
pd.date_range(start="20160320", periods=10, freq="D")

Out[28]:

DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
               '2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
               '2016-03-28', '2016-03-29'],
              dtype='datetime64[ns]', freq='D')

In [29]:

# W:按周
pd.date_range(start="20200903", periods=10, freq="W")

Out[29]:

DatetimeIndex(['2020-09-06', '2020-09-13', '2020-09-20', '2020-09-27',
               '2020-10-04', '2020-10-11', '2020-10-18', '2020-10-25',
               '2020-11-01', '2020-11-08'],
              dtype='datetime64[ns]', freq='W-SUN')

In [30]:

# M:按月
pd.date_range(start='20160320', periods=10, freq='M')

Out[30]:

DatetimeIndex(['2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30',
               '2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31',
               '2016-11-30', '2016-12-31'],
              dtype='datetime64[ns]', freq='M')

In [32]:

# BM:每个月的最后一个工作日
pd.date_range(start='2020', periods=12, freq='BM')

Out[32]:

DatetimeIndex(['2020-01-31', '2020-02-28', '2020-03-31', '2020-04-30',
               '2020-05-29', '2020-06-30', '2020-07-31', '2020-08-31',
               '2020-09-30', '2020-10-30', '2020-11-30', '2020-12-31'],
              dtype='datetime64[ns]', freq='BM')

In [33]:

# 4H:每4小时
pd.date_range(start='20200903', periods=10, freq='4H')

Out[33]:

DatetimeIndex(['2020-09-03 00:00:00', '2020-09-03 04:00:00',
               '2020-09-03 08:00:00', '2020-09-03 12:00:00',
               '2020-09-03 16:00:00', '2020-09-03 20:00:00',
               '2020-09-04 00:00:00', '2020-09-04 04:00:00',
               '2020-09-04 08:00:00', '2020-09-04 12:00:00'],
              dtype='datetime64[ns]', freq='4H')

时期及算术运算

pd.Period 表示时期,比如几日,月或几个月等。比如用来统计每个月的销售额,就可以用时期作为单位。

In [37]:

p = pd.Period(2020, freq="M")
p

Out[37]:

Period('2020-01', 'M')

In [38]:

p+2

Out[38]:

Period('2020-03', 'M')

时期序列

In [39]:

pd.period_range("2020-09", periods=10, freq="M")

Out[39]:

PeriodIndex(['2020-09', '2020-10', '2020-11', '2020-12', '2021-01', '2021-02',
             '2021-03', '2021-04', '2021-05', '2021-06'],
            dtype='period[M]', freq='M')

In [40]:

pd.period_range(start='2016-01', end='2016-10', freq='M')

Out[40]:

PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
             '2016-07', '2016-08', '2016-09', '2016-10'],
            dtype='period[M]', freq='M')

时期的频率转换

asfreq

  • A-DEC: 以 12 月份作为结束的年时期
  • A-NOV: 以 11 月份作为结束的年时期
  • Q-DEC: 以 12 月份作为结束的季度时期

In [41]:

a = pd.Period(2020)

In [42]:

# 以年为单位的时期
a

Out[42]:

Period('2020', 'A-DEC')

In [43]:

# 转化为以月为单位的时期
a.asfreq("M")

Out[43]:

Period('2020-12', 'M')

In [44]:

# 以2020的起始时间来转化
a.asfreq("M", how="start")

Out[44]:

Period('2020-01', 'M')

In [45]:

p = pd.Period("2020-09", freq="M")
p

Out[45]:

Period('2020-09', 'M')

In [46]:

# 转化为以年为单位
p.asfreq("A-DEC")

Out[46]:

Period('2020', 'A-DEC')

In [47]:

# 以年为周期,以一年中的 3 月份作为年的结束(财年)
p.asfreq('A-MAR')

Out[47]:

Period('2021', 'A-MAR')

季度时间频率

Pandas 支持 12 种季度型频率,从 Q-JAN 到 Q-DEC

In [50]:

p = pd.Period('2020Q4', 'Q-JAN')
p

Out[50]:

Period('2020Q4', 'Q-JAN')

In [51]:

# 以 1 月份结束的财年中,2020Q4 的时期是指 2019-11-1 到 2020-1-31
p.asfreq('D', how='start'), p.asfreq('D', how='end')

Out[51]:

(Period('2019-11-01', 'D'), Period('2020-01-31', 'D'))

In [52]:

# 获取该季度倒数第二个工作日下午4点的时间戳
(p.asfreq("B")-1).asfreq("T") + 16 * 60 + 20

Out[52]:

Period('2020-01-31 16:19', 'T')

Timestamp 和 Period 相互转换

In [3]:

s = pd.Series(np.random.randn(5), index=pd.date_range("2020-09-04", periods=5, freq="M"))
s

Out[3]:

2020-09-30   -0.608738
2020-10-31   -0.452613
2020-11-30    0.004881
2020-12-31    0.777084
2021-01-31   -0.707157
Freq: M, dtype: float64

In [5]:

# 将时间戳序列转化为时期序列
s.to_period

Out[5]:

<bound method Series.to_period of 2020-09-30   -0.608738
2020-10-31   -0.452613
2020-11-30    0.004881
2020-12-31    0.777084
2021-01-31   -0.707157
Freq: M, dtype: float64>

In [6]:

ts = pd.Series(np.random.randn(5), index = pd.date_range('2016-12-29', periods=5, freq='D'))
ts

Out[6]:

2016-12-29   -0.686783
2016-12-30   -1.456618
2016-12-31    0.201625
2017-01-01   -0.874102
2017-01-02    0.918938
Freq: D, dtype: float64

In [7]:

ts.to_period()

Out[7]:

2016-12-29   -0.686783
2016-12-30   -1.456618
2016-12-31    0.201625
2017-01-01   -0.874102
2017-01-02    0.918938
Freq: D, dtype: float64

In [9]:

# 改成以月为单位
pts = ts.to_period(freq="M")
pts

Out[9]:

2016-12   -0.686783
2016-12   -1.456618
2016-12    0.201625
2017-01   -0.874102
2017-01    0.918938
Freq: M, dtype: float64

In [10]:

# 转化为月导致索引重复
pts.index

Out[10]:

PeriodIndex(['2016-12', '2016-12', '2016-12', '2017-01', '2017-01'], dtype='period[M]', freq='M')

In [11]:

# level=0。表示以索引进行分组
pts.groupby(level=0).sum()

Out[11]:

2016-12   -1.941776
2017-01    0.044836
Freq: M, dtype: float64

In [12]:

# 转换为时间戳时,细部时间会丢失
pts.to_timestamp()

Out[12]:

2016-12-01   -0.686783
2016-12-01   -1.456618
2016-12-01    0.201625
2017-01-01   -0.874102
2017-01-01    0.918938
dtype: float64

In [14]:

# how=end,表示以终止时间作为起始日期
pts.to_timestamp(how="end")

Out[14]:

2016-12-31 23:59:59.999999999   -0.686783
2016-12-31 23:59:59.999999999   -1.456618
2016-12-31 23:59:59.999999999    0.201625
2017-01-31 23:59:59.999999999   -0.874102
2017-01-31 23:59:59.999999999    0.918938
dtype: float64

重采样

  • 高频率 -> 低频率 -> 降采样:5 分钟股票交易数据转换为日交易数据
  • 低频率 -> 高频率 -> 升采样
  • 其他重采样:每周三 (W-WED) 转换为每周五 (W-FRI)

In [16]:

ts = pd.Series(np.random.randint(0,50,60), index=pd.date_range("2016-04-25 09:30", periods=60, freq="T"))
ts

Out[16]:

2016-04-25 09:30:00    30
2016-04-25 09:31:00    19
2016-04-25 09:32:00    34
2016-04-25 09:33:00    41
2016-04-25 09:34:00    29
2016-04-25 09:35:00    27
2016-04-25 09:36:00    34
2016-04-25 09:37:00    44
2016-04-25 09:38:00    15
2016-04-25 09:39:00    19
2016-04-25 09:40:00    35
2016-04-25 09:41:00     5
2016-04-25 09:42:00    34
2016-04-25 09:43:00    30
2016-04-25 09:44:00    24
2016-04-25 09:45:00    18
2016-04-25 09:46:00    39
2016-04-25 09:47:00    40
2016-04-25 09:48:00    43
2016-04-25 09:49:00    35
2016-04-25 09:50:00    31
2016-04-25 09:51:00    27
2016-04-25 09:52:00    31
2016-04-25 09:53:00    32
2016-04-25 09:54:00    34
2016-04-25 09:55:00    38
2016-04-25 09:56:00    19
2016-04-25 09:57:00     2
2016-04-25 09:58:00    11
2016-04-25 09:59:00    39
2016-04-25 10:00:00     0
2016-04-25 10:01:00     1
2016-04-25 10:02:00     2
2016-04-25 10:03:00     1
2016-04-25 10:04:00    11
2016-04-25 10:05:00     0
2016-04-25 10:06:00    15
2016-04-25 10:07:00    18
2016-04-25 10:08:00     9
2016-04-25 10:09:00    19
2016-04-25 10:10:00    11
2016-04-25 10:11:00    22
2016-04-25 10:12:00     6
2016-04-25 10:13:00     5
2016-04-25 10:14:00    45
2016-04-25 10:15:00    44
2016-04-25 10:16:00    25
2016-04-25 10:17:00    24
2016-04-25 10:18:00    21
2016-04-25 10:19:00    37
2016-04-25 10:20:00    48
2016-04-25 10:21:00    23
2016-04-25 10:22:00    25
2016-04-25 10:23:00     1
2016-04-25 10:24:00    17
2016-04-25 10:25:00    14
2016-04-25 10:26:00    17
2016-04-25 10:27:00    27
2016-04-25 10:28:00     6
2016-04-25 10:29:00    39
Freq: T, dtype: int32

In [20]:

# 调整为每五分钟的采样
ts.resample("5min").sum()

Out[20]:

2016-04-25 09:30:00    153
2016-04-25 09:35:00    139
2016-04-25 09:40:00    128
2016-04-25 09:45:00    175
2016-04-25 09:50:00    155
2016-04-25 09:55:00    109
2016-04-25 10:00:00     15
2016-04-25 10:05:00     61
2016-04-25 10:10:00     89
2016-04-25 10:15:00    151
2016-04-25 10:20:00    114
2016-04-25 10:25:00    103
Freq: 5T, dtype: int32

In [22]:

# label=right
ts.resample("5min", label="right").sum()

Out[22]:

2016-04-25 09:35:00    153
2016-04-25 09:40:00    139
2016-04-25 09:45:00    128
2016-04-25 09:50:00    175
2016-04-25 09:55:00    155
2016-04-25 10:00:00    109
2016-04-25 10:05:00     15
2016-04-25 10:10:00     61
2016-04-25 10:15:00     89
2016-04-25 10:20:00    151
2016-04-25 10:25:00    114
2016-04-25 10:30:00    103
Freq: 5T, dtype: int32

通过 groupby 重采样

In [28]:

ts = pd.Series(np.random.randint(0,50,100), index=pd.date_range("2020-09-04", periods=100, freq="D"))
ts

Out[28]:

2020-09-04     4
2020-09-05    39
2020-09-06    49
2020-09-07    47
2020-09-08    36
              ..
2020-12-08     2
2020-12-09    15
2020-12-10     0
2020-12-11    49
2020-12-12    32
Freq: D, Length: 100, dtype: int32

In [29]:

# x为Series的逐个索引
ts.groupby(lambda x: x.month).sum()

Out[29]:

9     683
10    696
11    717
12    236
dtype: int32

In [30]:

ts.groupby(ts.index.to_period("M")).sum()

Out[30]:

2020-09    683
2020-10    696
2020-11    717
2020-12    236
Freq: M, dtype: int32

升采样和插值

In [31]:

# 以周为单位,每周五采样
df = pd.DataFrame(np.random.randint(1, 50, 2), index=pd.date_range('2016-04-22', periods=2, freq='W-FRI'))
df

Out[31]:

0
2016-04-22 31
2016-04-29 4

In [40]:

df.resample('D').sum()

Out[40]:

0
2016-04-22 31
2016-04-23 0
2016-04-24 0
2016-04-25 0
2016-04-26 0
2016-04-27 0
2016-04-28 0
2016-04-29 4

In [45]:

df.resample("D").ffill()

Out[45]:

0
2016-04-22 31
2016-04-23 31
2016-04-24 31
2016-04-25 31
2016-04-26 31
2016-04-27 31
2016-04-28 31
2016-04-29 4

时期重采样

In [46]:

df = pd.DataFrame(np.random.randint(2, 30, (24, 4)), 
                  index=pd.period_range('2015-01', '2016-12', freq='M'),
                  columns=list('ABCD'))
df

Out[46]:

A B C D
2015-01 10 18 28 6
2015-02 2 12 10 4
2015-03 2 19 16 23
2015-04 23 3 9 27
2015-05 12 5 29 8
2015-06 28 18 15 23
2015-07 25 6 2 18
2015-08 12 15 4 17
2015-09 4 12 18 28
2015-10 7 12 7 26
2015-11 3 14 7 24
2015-12 9 28 29 12
2016-01 22 26 23 10
2016-02 29 4 20 13
2016-03 29 27 15 8
2016-04 13 20 17 2
2016-05 23 5 23 12
2016-06 3 26 27 17
2016-07 7 25 25 4
2016-08 10 10 20 21
2016-09 2 18 24 26
2016-10 13 9 24 13
2016-11 4 16 11 24
2016-12 26 26 15 21

In [49]:

df.resample("A-MAR").sum()

Out[49]:

A B C D
2015 14 49 54 33
2016 203 170 178 214
2017 101 155 186 140

In [52]:

pdf = df.resample("A-DEC").mean()

In [54]:

pdf.resample("Q-DEC").ffill()

Out[54]:

A B C D
2015Q1 11.416667 13.500000 14.500000 18.00
2015Q2 11.416667 13.500000 14.500000 18.00
2015Q3 11.416667 13.500000 14.500000 18.00
2015Q4 11.416667 13.500000 14.500000 18.00
2016Q1 15.083333 17.666667 20.333333 14.25
2016Q2 15.083333 17.666667 20.333333 14.25
2016Q3 15.083333 17.666667 20.333333 14.25
2016Q4 15.083333 17.666667 20.333333 14.25

In [59]:

fpath = r"D:/WinterIsComing/python/New_Wave/Machine_Learning/Pandas_others/Pandas 教程_源码/pandas_tutor-master/data/002001.csv"
df = pd.read_csv(fpath, index_col="Date", parse_dates=True)
df.head(5)

Out[59]:

Open High Low Close Volume Adj Close
Date
2015-12-22 16.86 17.13 16.48 16.95 13519900 16.95
2015-12-21 16.31 17.00 16.20 16.85 14132200 16.85
2015-12-18 16.59 16.70 16.21 16.31 10524300 16.31
2015-12-17 16.28 16.75 16.16 16.60 12326500 16.60
2015-12-16 16.23 16.42 16.05 16.28 8026000 16.28

In [60]:

df.index

Out[60]:

DatetimeIndex(['2015-12-22', '2015-12-21', '2015-12-18', '2015-12-17',
               '2015-12-16', '2015-12-15', '2015-12-14', '2015-12-11',
               '2015-12-10', '2015-12-09', '2015-12-08', '2015-12-07',
               '2015-12-04', '2015-12-03', '2015-12-02', '2015-12-01',
               '2015-11-30', '2015-11-27', '2015-11-26', '2015-11-25',
               '2015-11-24', '2015-11-23', '2015-11-20', '2015-11-19',
               '2015-11-18', '2015-11-17', '2015-11-16', '2015-11-13',
               '2015-11-12', '2015-11-11', '2015-11-10', '2015-11-09',
               '2015-11-06', '2015-11-05', '2015-11-04', '2015-11-03',
               '2015-11-02', '2015-10-30', '2015-10-29', '2015-10-28',
               '2015-10-27', '2015-10-26', '2015-10-23', '2015-10-22',
               '2015-10-21', '2015-10-20', '2015-10-19', '2015-10-16',
               '2015-10-15', '2015-10-14', '2015-10-13', '2015-10-12',
               '2015-10-09', '2015-10-08', '2015-10-07', '2015-10-06',
               '2015-10-05', '2015-10-02', '2015-10-01'],
              dtype='datetime64[ns]', name='Date', freq=None)

In [63]:

df["Adj Close"].resample("W-FRI").agg(["max", "min", "mean"])

Out[63]:

max min mean
Date
2015-10-02 13.41 13.41 13.410
2015-10-09 14.75 13.41 13.920
2015-10-16 15.30 14.73 15.094
2015-10-23 15.22 14.26 14.888
2015-10-30 15.30 15.02 15.174
2015-11-06 15.86 14.62 15.214
2015-11-13 16.59 15.95 16.186
2015-11-20 16.22 15.75 16.056
2015-11-27 16.94 15.54 16.292
2015-12-04 16.62 15.70 16.052
2015-12-11 16.63 15.56 15.938
2015-12-18 16.60 16.06 16.286
2015-12-25 16.95 16.85 16.900

6.数据可视化

In [1]:

import pandas as pd
import numpy as np
%matplotlib inline

数据可视化

Pandas 的数据可视化使用 matplotlib 为基础组件。更基础的信息可参阅 matplotlib 相关内容。本节主要介绍 Pandas 里提供的比 matplotlib 更便捷的数据可视化操作。

线型图

Series 和 DataFrame 都提供了一个 plot 的函数。可以直接画出线形图。

In [3]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range("2000/1/1", periods=1000))
ts = ts.cumsum()
ts.describe()

Out[3]:

count    1000.000000
mean      -24.133763
std        15.978359
min       -58.584119
25%       -39.237189
50%       -19.251717
75%       -13.778533
max         4.484775
dtype: float64

In [10]:

ts.plot(title="cumsum", style="r--", figsize=(8, 6));

In [11]:

ts.index

Out[11]:

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10',
               ...
               '2002-09-17', '2002-09-18', '2002-09-19', '2002-09-20',
               '2002-09-21', '2002-09-22', '2002-09-23', '2002-09-24',
               '2002-09-25', '2002-09-26'],
              dtype='datetime64[ns]', length=1000, freq='D')

In [12]:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))
df = df.cumsum()
df.describe()

Out[12]:

A B C D
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 10.560458 -17.259742 8.040148 12.022911
std 12.159655 11.339032 7.890642 13.069479
min -13.402049 -35.716146 -7.152798 -19.014370
25% 1.916648 -26.181304 1.724714 2.460871
50% 7.089967 -19.212122 7.046527 11.961487
75% 20.370508 -9.730554 14.244365 21.882782
max 35.864621 8.078246 26.201914 40.772492

In [13]:

df.plot()

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be7cc84e0>
# subplots=True:将每列数据拆分成对应子图
# sharey:统一y轴坐标
df.plot(subplots=True, figsize=(6,12), sharey=True);

In [21]:

df["ID"] = np.arange(len(df))
df.plot(x="ID", y=["A", "C"], subplots=True)

Out[21]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000022BE81828D0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x0000022BE873A240>],
      dtype=object)

In [26]:

df = pd.DataFrame(np.random.randn(10,4), columns=list("ABCD"))
df

Out[26]:

A B C D
0 1.185675 0.123734 0.398607 0.199504
1 -1.123768 0.078920 -2.291356 -0.220206
2 0.193336 -1.840301 -0.740678 0.819865
3 -0.410740 0.880615 -0.535395 -1.450210
4 0.661644 -0.832695 -0.937107 -0.227033
5 2.974065 -0.016914 -0.069501 1.917553
6 1.323996 -0.103212 0.121825 0.724699
7 0.849647 -1.142902 0.652289 -0.045240
8 -0.631673 1.764061 0.379861 -0.904924
9 -1.986952 1.031205 -0.299877 -0.184474

In [27]:

df.loc[0,:].plot(kind="bar")

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be9878908>

In [32]:

df.plot.bar(figsize=(8,8))

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be83d0a90>

In [33]:

# stacked=True:对数据进行堆叠
df.plot.bar(stacked=True,figsize=(8,8))

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be82e9390>

In [34]:

# barh水平方向柱状图
df.plot.barh(stacked=True,figsize=(8,8))

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be83622b0>

直方图

直方图是一种对值频率进行离散化的柱状图。数据点被分到离散的,间隔均匀的区间中,绘制各个区间中数据点的数据。

In [36]:

df = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
                   'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df

Out[36]:

a b c
0 0.918520 -0.321703 -3.012545
1 0.530932 -1.822854 -0.201204
2 1.875175 -1.071331 -0.248311
3 2.367223 -0.012561 -0.594927
4 -1.494179 0.732017 -0.534631
995 1.756615 0.514619 0.087975
996 -0.315277 -0.250429 0.134666
997 0.332219 0.002270 -2.168648
998 1.324343 0.142002 -0.081007
999 1.374050 -0.709491 -0.465666

1000 rows × 3 columns

In [40]:

# bins=20:把a这列数据分成20等分,每个区间里面点的个数
df["a"].hist(bins=20)

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x22be98ee6d8>
df.plot.hist(subplots=True, sharex=True, sharey=True, bins=50);

未经允许不得转载:作者:1152-张同学, 转载或复制请以 超链接形式 并注明出处 拜师资源博客
原文地址:《Pandas基础(二)》 发布于2020-09-08

分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

  注册



长按图片转发给朋友

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

Vieu3.3主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录