pandas的简单操作(3):数据整形、数据透视、时间序列、数据可视化、数据载入与保存

1334-甘同学

发表文章数:18

热门标签

, ,
首页 » 数据科学库 » 正文

input:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as plt

input:

tuples = list(zip(*[['bar','bar','baz','baz',
                    'foo','foo','qux','qux'],
                   ['one','two','one','two',
                   'one','two','one','two']]))
tuples

output:

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

input:

index = pd.MultiIndex.from_tuples(tuples,names=['first','second']) #双层索引
index

output:

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

input:

df = pd.DataFrame(np.random.randn(8,2),index = index ,columns = ['A','B'])
df

output:

A B
first second
bar one 0.537622 -0.577700
two 2.028141 0.281219
baz one -0.070568 1.000496
two 0.005478 1.907888
foo one -0.305029 -0.149257
two 0.104812 -0.166493
qux one -0.340378 0.849778
two 1.137278 -0.002255

input:

stacked = df.stack() #stacked函数:把行索引和列索引进行转换
stacked

output:

first  second   
bar    one     A    0.537622
               B   -0.577700
       two     A    2.028141
               B    0.281219
baz    one     A   -0.070568
               B    1.000496
       two     A    0.005478
               B    1.907888
foo    one     A   -0.305029
               B   -0.149257
       two     A    0.104812
               B   -0.166493
qux    one     A   -0.340378
               B    0.849778
       two     A    1.137278
               B   -0.002255
dtype: float64

input:

stacked = df.unstack() #转换回来

input:

df = pd.DataFrame({'A':['one','one','two','three']*3,
                   'B':['A','B','C']*4,
                   'C':['foo','foo','foo','bar','bar','bar']*2,
                   'D': np.random.randn(12),
                   'E':np.random.randn(12)})
df

output:

A B C D E
0 one A foo -0.764417 1.005010
1 one B foo 0.774953 -1.087644
2 two C foo 0.205976 -1.945244
3 three A bar -0.503414 -1.890938
4 one B bar 0.573373 0.380991
5 one C bar -0.973088 0.647617
6 two A foo 0.777202 -1.975924
7 three B foo -1.152722 0.639917
8 one C foo -1.137695 0.178866
9 one A bar 0.266740 -0.599844
10 two B bar 1.459513 -0.665667
11 three C bar -0.854546 0.136330

input:

df.pivot_table(values=['D'],index=['A','B'], columns=['C']) #以A B 为行索引,C为列索引,值为D 数据透视表

output:

D
C bar foo
A B
one A 0.266740 -0.764417
B 0.573373 0.774953
C -0.973088 -1.137695
three A -0.503414 NaN
B NaN -1.152722
C -0.854546 NaN
two A NaN 0.777202
B 1.459513 NaN
C NaN 0.205976

input:

df.pivot_table(values=['E'],index=['A'],columns=['C'])

output:

E
C bar foo
A
one 0.142921 0.032077
three -0.877304 0.639917
two -0.665667 -1.960584

input:

df[df.A =='one'].groupby('C').mean()

output:

D E
C
bar -0.044325 0.142921
foo -0.375720 0.032077

input:

rng = pd.date_range('20200301',periods=600, freq='s')
rng

output:

DatetimeIndex(['2020-03-01 00:00:00', '2020-03-01 00:00:01',
               '2020-03-01 00:00:02', '2020-03-01 00:00:03',
               '2020-03-01 00:00:04', '2020-03-01 00:00:05',
               '2020-03-01 00:00:06', '2020-03-01 00:00:07',
               '2020-03-01 00:00:08', '2020-03-01 00:00:09',
               ...
               '2020-03-01 00:09:50', '2020-03-01 00:09:51',
               '2020-03-01 00:09:52', '2020-03-01 00:09:53',
               '2020-03-01 00:09:54', '2020-03-01 00:09:55',
               '2020-03-01 00:09:56', '2020-03-01 00:09:57',
               '2020-03-01 00:09:58', '2020-03-01 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')

input:

s = pd.Series(np.random.randint(0,500,len(rng)),index=rng) #时间序列
s

output:

2020-03-01 00:00:00    434
2020-03-01 00:00:01    200
2020-03-01 00:00:02    429
2020-03-01 00:00:03    498
2020-03-01 00:00:04    431
2020-03-01 00:00:05     33
2020-03-01 00:00:06    431
2020-03-01 00:00:07    360
2020-03-01 00:00:08    101
2020-03-01 00:00:09     89
2020-03-01 00:00:10     27
2020-03-01 00:00:11    257
2020-03-01 00:00:12    493
2020-03-01 00:00:13    309
2020-03-01 00:00:14    431
2020-03-01 00:00:15    346
2020-03-01 00:00:16    378
2020-03-01 00:00:17    267
2020-03-01 00:00:18     25
2020-03-01 00:00:19    271
2020-03-01 00:00:20    425
2020-03-01 00:00:21    169
2020-03-01 00:00:22     92
2020-03-01 00:00:23    220
2020-03-01 00:00:24    391
2020-03-01 00:00:25     81
2020-03-01 00:00:26    283
2020-03-01 00:00:27    304
2020-03-01 00:00:28    344
2020-03-01 00:00:29    316
                      ... 
2020-03-01 00:09:30     84
2020-03-01 00:09:31     70
2020-03-01 00:09:32     74
2020-03-01 00:09:33     57
2020-03-01 00:09:34    110
2020-03-01 00:09:35    111
2020-03-01 00:09:36    173
2020-03-01 00:09:37    239
2020-03-01 00:09:38    273
2020-03-01 00:09:39    200
2020-03-01 00:09:40    246
2020-03-01 00:09:41     75
2020-03-01 00:09:42    392
2020-03-01 00:09:43    262
2020-03-01 00:09:44    480
2020-03-01 00:09:45    154
2020-03-01 00:09:46     66
2020-03-01 00:09:47     51
2020-03-01 00:09:48    178
2020-03-01 00:09:49    395
2020-03-01 00:09:50    375
2020-03-01 00:09:51    228
2020-03-01 00:09:52    485
2020-03-01 00:09:53    161
2020-03-01 00:09:54    483
2020-03-01 00:09:55     92
2020-03-01 00:09:56     45
2020-03-01 00:09:57    488
2020-03-01 00:09:58      3
2020-03-01 00:09:59    469
Freq: S, Length: 600, dtype: int64

input:

s.resample('2Min',how='mean') #因为上面数据太多不利于分析,所以进行重采样,这里每两分钟采样一次,可以取平均值采样的方法或者求和的方法
/Users/mac/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).mean()
  """Entry point for launching an IPython kernel.





2020-03-01 00:00:00    246.425000
2020-03-01 00:02:00    272.933333
2020-03-01 00:04:00    253.975000
2020-03-01 00:06:00    268.491667
2020-03-01 00:08:00    242.025000
Freq: 2T, dtype: float64

input:

rng = pd.period_range('2000Q1','2016Q1',freq='Q')
rng

output:

PeriodIndex(['2000Q1', '2000Q2', '2000Q3', '2000Q4', '2001Q1', '2001Q2',
             '2001Q3', '2001Q4', '2002Q1', '2002Q2', '2002Q3', '2002Q4',
             '2003Q1', '2003Q2', '2003Q3', '2003Q4', '2004Q1', '2004Q2',
             '2004Q3', '2004Q4', '2005Q1', '2005Q2', '2005Q3', '2005Q4',
             '2006Q1', '2006Q2', '2006Q3', '2006Q4', '2007Q1', '2007Q2',
             '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3', '2008Q4',
             '2009Q1', '2009Q2', '2009Q3', '2009Q4', '2010Q1', '2010Q2',
             '2010Q3', '2010Q4', '2011Q1', '2011Q2', '2011Q3', '2011Q4',
             '2012Q1', '2012Q2', '2012Q3', '2012Q4', '2013Q1', '2013Q2',
             '2013Q3', '2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4',
             '2015Q1', '2015Q2', '2015Q3', '2015Q4', '2016Q1'],
            dtype='period[Q-DEC]', freq='Q-DEC')

input:

rng.to_timestamp()

output:

DatetimeIndex(['2000-01-01', '2000-04-01', '2000-07-01', '2000-10-01',
               '2001-01-01', '2001-04-01', '2001-07-01', '2001-10-01',
               '2002-01-01', '2002-04-01', '2002-07-01', '2002-10-01',
               '2003-01-01', '2003-04-01', '2003-07-01', '2003-10-01',
               '2004-01-01', '2004-04-01', '2004-07-01', '2004-10-01',
               '2005-01-01', '2005-04-01', '2005-07-01', '2005-10-01',
               '2006-01-01', '2006-04-01', '2006-07-01', '2006-10-01',
               '2007-01-01', '2007-04-01', '2007-07-01', '2007-10-01',
               '2008-01-01', '2008-04-01', '2008-07-01', '2008-10-01',
               '2009-01-01', '2009-04-01', '2009-07-01', '2009-10-01',
               '2010-01-01', '2010-04-01', '2010-07-01', '2010-10-01',
               '2011-01-01', '2011-04-01', '2011-07-01', '2011-10-01',
               '2012-01-01', '2012-04-01', '2012-07-01', '2012-10-01',
               '2013-01-01', '2013-04-01', '2013-07-01', '2013-10-01',
               '2014-01-01', '2014-04-01', '2014-07-01', '2014-10-01',
               '2015-01-01', '2015-04-01', '2015-07-01', '2015-10-01',
               '2016-01-01'],
              dtype='datetime64[ns]', freq='QS-OCT')

input:

pd.Timestamp('20160301')-pd.Timestamp('20160201') #pandas里时间运算非常方便

output:

Timedelta('29 days 00:00:00')

input:

pd.Timestamp('20160301')+pd.Timedelta(days=5)

output:

Timestamp('2016-03-06 00:00:00')

input:

df = pd.DataFrame({'id':[1,2,3,4,5,6],'raw_grade':['a','b','b','a','a','d']})
df

output:

id raw_grade
0 1 a
1 2 b
2 3 b
3 4 a
4 5 a
5 6 d

input:

df['grade'] = df.raw_grade.astype('category') #增加grade 列
df

output:

id raw_grade grade
0 1 a a
1 2 b b
2 3 b b
3 4 a a
4 5 a a
5 6 d d

input:

df.grade

output:

0    a
1    b
2    b
3    a
4    a
5    d
Name: grade, dtype: category
Categories (3, object): [a, b, d]

input:

df.grade.cat.categories

output:

Index(['a', 'b', 'd'], dtype='object')

input:

df.grade.cat.categories =['very good','good','bad'] #可以替换成任意内容 这里是以id值排序
df

output:

id raw_grade grade
0 1 a very good
1 2 b good
2 3 b good
3 4 a very good
4 5 a very good
5 6 d bad

input:

df.sort_values(by='grade',ascending=False) #以grade排序

output:

id raw_grade grade
5 6 d bad
2 3 b good
1 2 b good
4 5 a very good
3 4 a very good
0 1 a very good

input:

s = pd.Series(np.random.randn(1000),index=pd.date_range('20000101',periods=1000))
s

output:

2000-01-01   -0.532396
2000-01-02   -0.438004
2000-01-03    1.211030
2000-01-04    0.736409
2000-01-05   -0.721090
2000-01-06   -0.978725
2000-01-07    1.189487
2000-01-08    0.017212
2000-01-09    0.014859
2000-01-10    0.807690
2000-01-11    1.201846
2000-01-12   -1.622267
2000-01-13   -0.670456
2000-01-14    1.860123
2000-01-15   -0.355513
2000-01-16   -1.697407
2000-01-17   -0.349072
2000-01-18    0.493644
2000-01-19   -0.202796
2000-01-20    0.443553
2000-01-21    0.579038
2000-01-22    0.015395
2000-01-23   -1.703529
2000-01-24    0.200692
2000-01-25   -0.649918
2000-01-26    0.103963
2000-01-27    0.044044
2000-01-28   -0.211696
2000-01-29    0.016201
2000-01-30    0.350856
                ...   
2002-08-28   -0.145103
2002-08-29   -1.800730
2002-08-30   -1.216044
2002-08-31   -0.052010
2002-09-01    0.932612
2002-09-02   -1.063313
2002-09-03   -1.739508
2002-09-04   -1.398266
2002-09-05   -0.209526
2002-09-06   -0.048596
2002-09-07   -1.707406
2002-09-08   -0.117668
2002-09-09    0.208738
2002-09-10   -1.751093
2002-09-11   -0.143911
2002-09-12   -1.176041
2002-09-13    1.148417
2002-09-14    0.785341
2002-09-15   -1.165482
2002-09-16    1.184074
2002-09-17   -1.252482
2002-09-18   -2.602539
2002-09-19   -0.335973
2002-09-20    1.390396
2002-09-21    0.289487
2002-09-22   -0.121280
2002-09-23   -0.229991
2002-09-24    0.684140
2002-09-25   -0.391517
2002-09-26    0.003391
Freq: D, Length: 1000, dtype: float64

input:

s = s.cumsum() #累加求和
s
2000-01-01    -0.532396
2000-01-02    -0.970400
2000-01-03     0.240631
2000-01-04     0.977040
2000-01-05     0.255949
2000-01-06    -0.722775
2000-01-07     0.466711
2000-01-08     0.483923
2000-01-09     0.498782
2000-01-10     1.306472
2000-01-11     2.508318
2000-01-12     0.886052
2000-01-13     0.215595
2000-01-14     2.075718
2000-01-15     1.720205
2000-01-16     0.022798
2000-01-17    -0.326274
2000-01-18     0.167370
2000-01-19    -0.035426
2000-01-20     0.408127
2000-01-21     0.987165
2000-01-22     1.002560
2000-01-23    -0.700968
2000-01-24    -0.500276
2000-01-25    -1.150195
2000-01-26    -1.046232
2000-01-27    -1.002187
2000-01-28    -1.213884
2000-01-29    -1.197683
2000-01-30    -0.846828
                ...    
2002-08-28   -13.662979
2002-08-29   -15.463709
2002-08-30   -16.679753
2002-08-31   -16.731763
2002-09-01   -15.799151
2002-09-02   -16.862464
2002-09-03   -18.601972
2002-09-04   -20.000239
2002-09-05   -20.209765
2002-09-06   -20.258361
2002-09-07   -21.965767
2002-09-08   -22.083435
2002-09-09   -21.874697
2002-09-10   -23.625790
2002-09-11   -23.769701
2002-09-12   -24.945741
2002-09-13   -23.797324
2002-09-14   -23.011983
2002-09-15   -24.177466
2002-09-16   -22.993391
2002-09-17   -24.245873
2002-09-18   -26.848412
2002-09-19   -27.184385
2002-09-20   -25.793989
2002-09-21   -25.504502
2002-09-22   -25.625782
2002-09-23   -25.855773
2002-09-24   -25.171632
2002-09-25   -25.563150
2002-09-26   -25.559759
Freq: D, Length: 1000, dtype: float64

input:

s.plot() #数据可视化

output:

<matplotlib.axes._subplots.AxesSubplot at 0x116707f60>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jV2jN8Ku-1604852354405)(output_25_1.png)]

input:

df = pd.DataFrame(np.random.randn(100,4),columns=list('ABCD'))
df

output:

A B C D
0 1.564117 -0.463773 -0.091550 -1.857433
1 -1.184417 0.937198 -1.788886 -0.380206
2 -0.296087 0.412653 1.936886 -1.718887
3 1.671879 -0.203985 0.015520 -1.426769
4 0.258212 -0.599114 -1.571501 2.172205
5 -1.099829 2.178160 0.215177 -1.248541
6 0.860445 0.958647 -1.098336 -0.023771
7 1.956408 -1.914137 0.194243 0.855831
8 0.396876 -0.455098 1.124536 0.526166
9 0.136470 0.095513 1.297267 1.154325
10 -0.214534 -0.950800 1.607950 0.828318
11 -1.298054 0.100998 -0.515046 0.476188
12 1.163658 -1.994909 -0.758015 0.036630
13 0.905410 1.232997 0.633099 -0.450712
14 -0.488799 0.904530 -0.226913 -0.027718
15 -1.338279 0.607139 -0.714241 0.678542
16 1.572935 1.667777 0.597482 -1.846735
17 1.299655 0.755383 0.553495 -0.816247
18 -0.138044 0.213682 -0.685482 0.187415
19 0.849237 0.442252 -1.808233 -0.408269
20 -1.257406 -0.005541 -0.588033 -1.417979
21 -1.725082 -0.411008 -0.415789 1.219006
22 -1.093083 -0.017484 0.177462 0.444620
23 -0.674220 -0.209629 -0.987757 0.804393
24 0.858732 -0.961648 1.765393 -0.946591
25 0.835091 -0.119595 0.276223 -0.499499
26 1.293017 1.683998 -1.239953 0.578722
27 1.455033 0.492396 -1.397196 2.129232
28 0.035302 -0.813183 1.181686 0.199079
29 -0.246104 -1.071980 -0.092767 -0.871213
70 0.523143 -1.583870 0.037297 -0.359236
71 -1.186931 -0.921222 -1.634302 -0.231714
72 0.647745 -0.667466 -0.555059 -0.169008
73 0.703217 0.035210 -0.550578 0.064030
74 -1.239652 0.458127 -0.102111 -0.812303
75 0.730404 -0.109444 0.393924 0.902011
76 1.503345 -0.163920 -0.381882 -0.758856
77 0.758033 1.412092 0.893678 0.966148
78 -2.095461 -1.343793 1.009265 1.021844
79 -0.243499 0.812086 0.713547 -0.315148
80 -1.842734 -0.382523 0.842632 0.326414
81 0.293315 0.715708 -0.603879 -0.822051
82 0.320864 -0.750298 -0.937338 0.231808
83 -0.506367 -0.378503 0.799895 0.465828
84 -1.219393 0.295697 1.311684 2.146081
85 0.186327 -0.126560 1.477611 -1.200722
86 0.073617 1.448020 -0.528352 0.512307
87 0.724785 0.075949 -0.618813 -1.652172
88 1.238503 -0.319008 0.157979 -0.025204
89 0.248423 0.278020 -0.453552 0.588062
90 -0.629882 -2.125369 0.237280 0.030703
91 1.500885 -0.725088 1.012414 -1.572393
92 -0.751267 -0.235083 -1.052901 1.387676
93 0.151400 1.313931 -0.424040 -0.983791
94 0.667903 -0.105557 1.614367 0.426456
95 1.871545 0.072405 -0.303439 -0.347861
96 -0.051875 1.718629 0.198310 0.383673
97 1.390178 0.866700 -0.019794 -0.125795
98 -0.571301 -1.722597 -2.229214 0.400489
99 0.649530 -0.262061 -0.694856 -0.566315

100 rows × 4 columns

input:

df.to_csv('data.cav') #保存到磁盘

input:

%ls #查看内容

output:

[30m[43mAdlm[m[m/                       Untitled.ipynb
[34mApplications[m[m/               [34mVirtual Machines.localized[m[m/
[34mApplications (Parallels)[m[m/   [34manaconda3[m[m/
[34mCreative Cloud Files[m[m/       data.cav
[34mDesktop[m[m/                    matlab_crash_dump.1087-1
[34mDocuments[m[m/                  matlab_crash_dump.1095-1
[34mDownloads[m[m/                  matlab_crash_dump.661-1
[34mLibrary[m[m/                    [34mopt[m[m/
[34mMovies[m[m/                     pandas_tutorial(2).ipynb
[34mMusic[m[m/                      pandas_tutorials(1).ipynb
[34mPictures[m[m/                   pandas_tutorials(3).ipynb
[34mPublic[m[m/                     pandas_tutorials_1.ipynb
[34mQt5.14.1[m[m/

.14.1[m[m/

input:

%more data.csv   #查看内容

input:

pd.read_csv('data.csv') #读回来

未经允许不得转载:作者:1334-甘同学, 转载或复制请以 超链接形式 并注明出处 拜师资源博客
原文地址:《pandas的简单操作(3):数据整形、数据透视、时间序列、数据可视化、数据载入与保存》 发布于2020-11-09

分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

  注册



长按图片转发给朋友

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

Vieu3.3主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录