- 需求:
- 1.导入文件,查看原始数据
- 2.将人口数据和各州简称数据进行合并
- 3.将合并的数据中重复的abbreviation列进行删除
- 4.查看存在缺失数据的列
- 5.找到有哪些state/region使得state的值为NaN,进行去重操作
- 6.为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
- 7.给对应的 空值赋值
- 8.合并各州面积数据areas
- 9.我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
- 10.去除含有缺失数据的行
- 11.找出2010年的全民人口数据
- 12.计算各州的人口密度
- 13.排序,并找出人口密度最高的五个州 df.sort_values()
需求:
1.导入文件,查看原始数据
导入块
# -*-coding:utf-8-*-
import numpy as np
import pandas as pd
from pandas import DataFrame
导入文件,查看原始数据
#导入文件,查看原始数据
abb=pd.read_csv("./data/state-abbrevs.csv")
pop=pd.read_csv("./data/state-population.csv")
area=pd.read_csv("./data/state-areas.csv")
print("---------------------------------------------------")
print(abb.head(2))
print("---------------------------------------------------")
print(pop.head(2))
print("---------------------------------------------------")
print(area.head(2))
输出结果
E:/pythonProject5/venv/Scripts/python.exe E:/pythonProject5/人口分析案例.py
---------------------------------------------------
state abbreviation
0 Alabama AL
1 Alaska AK
---------------------------------------------------
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
---------------------------------------------------
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
Process finished with exit code 0
2.将人口数据和各州简称数据进行合并
#将人口数据和各州简称数据进行合并
abb_pop=pd.merge(abb,pop,how='outer',left_on="abbreviation",right_on="state/region" )
print(abb_pop.head())
输出结果
E:/pythonProject5/venv/Scripts/python.exe E:/pythonProject5/人口分析案例.py
state abbreviation state/region ages year population
0 Alabama AL AL under18 2012 1117489.0
1 Alabama AL AL total 2012 4817528.0
2 Alabama AL AL under18 2010 1130966.0
3 Alabama AL AL total 2010 4785570.0
4 Alabama AL AL under18 2011 1125763.0
Process finished with exit code 0
3.将合并的数据中重复的abbreviation列进行删除
#将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels="abbreviation",axis=1,inplace=True)
print(abb_pop.head())
输出结果
state state/region ages year population
0 Alabama AL under18 2012 1117489.0
1 Alabama AL total 2012 4817528.0
2 Alabama AL under18 2010 1130966.0
3 Alabama AL total 2010 4785570.0
4 Alabama AL under18 2011 1125763.0
Process finished with exit code 0
4.查看存在缺失数据的列
#查看存在缺失数据的列
# abb_pop.loc[]
print(abb_pop.isnull().any(axis=0))
输出结果
state True
state/region False
ages False
year False
population True
dtype: bool
Process finished with exit code 0
5.找到有哪些state/region使得state的值为NaN,进行去重操作
#找到有哪些state/region使得state的值为NaN,进行去重操作
print(abb_pop['state'].isnull() ) # 判断是否为空
#定位到state为空的数据
df=abb_pop.loc[abb_pop['state'].isnull() ]
print(df)
#将结果中的state/region列取出,返回的是一个Series,并去重
Ser=df['state/region'].unique()
print(Ser)
输出结果
['PR' 'USA']
6.为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
abb_pop.loc[abb_pop['state/region']=='PR']
PR_index=abb_pop.loc[abb_pop['state/region']=='PR'].index
print(PR_index)
输出结果
Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
2492, 2493, 2494, 2495],
dtype='int64')
Process finished with exit code 0
7.给对应的 空值赋值
#给对应的 空值赋值
abb_pop.loc[PR_index,'state']='PuertoRico'
USA_index=abb_pop.loc[abb_pop['state/region']=="USA"].index
abb_pop.loc[USA_index,'state']='UnitedStatesofAmerica'
8.合并各州面积数据areas
#合并各州面积数据areas
area.head()
abb_pop_area=pd.merge(abb_pop,area,how='outer')
#abb_pop_area.to_csv("./data/abb_pop_area.csv")
print(abb_pop_area.head())
9.我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
sqmi_index=abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
print(sqmi_index)
10.去除含有缺失数据的行
#去除含有缺失数据的行
abb_pop_area.drop(labels=sqmi_index,axis=0,inplace=True)
11.找出2010年的全民人口数据
#找出2010年的全民人口数据
pl_sum=abb_pop_area.query('year == 2010 & ages == "total"')
print(pl_sum)
输出结果
state state/region ... population area (sq. mi)
2346 Washington WA ... 6742256.0 71303.0
2357 West Virginia WV ... 1854146.0 24231.0
2442 Wisconsin WI ... 5689060.0 65503.0
2453 Wyoming WY ... 564222.0 97818.0
[52 rows x 6 columns]
Process finished with exit code 0
12.计算各州的人口密度
#计算各州的人口密度
#总人数population/总面积area (sq. mi)
mi_du=abb_pop_area['population']/abb_pop_area['area (sq. mi)']
# 将数据添加到 总数据中
abb_pop_area['mi_du']=mi_du
print(abb_pop_area.head())
输出结果
state state/region ages year population area (sq. mi) mi_du
0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221
2 Alabama AL under18 2010.0 1130966.0 52423.0 21.573851
3 Alabama AL total 2010.0 4785570.0 52423.0 91.287603
4 Alabama AL under18 2011.0 1125763.0 52423.0 21.474601
Process finished with exit code 0
13.排序,并找出人口密度最高的五个州 df.sort_values()
#排序,并找出人口密度最高的五个州 df.sort_values()
abb_sort=abb_pop_area.sort_values('mi_du',axis=0,ascending=False)
print(abb_sort.head())
输出结果
state state/region ... area (sq. mi) mi_du
439 District of Columbia DC ... 68.0 9506.602941
433 District of Columbia DC ... 68.0 9315.102941
435 District of Columbia DC ... 68.0 9112.117647
479 District of Columbia DC ... 68.0 8901.779412
437 District of Columbia DC ... 68.0 8898.897059
[5 rows x 7 columns]
Process finished with exit code 0
人口分析案例总结:
-
读取文件:pd.read_csv('文件路径')
-
查看每一列的详细信息:data.info()
-
df的条件查询:data.query('ages == "total" & year == 2010')
-
对某一列数据进行排序:data.sort_values(by='midu',ascending=False)
拜师教育学员文章:作者:2344-吴同学,
转载或复制请以 超链接形式 并注明出处 拜师资源博客。
原文地址:《数据分析项目-人口分析》 发布于2021-11-23
评论 抢沙发