数据分析项目-人口分析

2344-吴同学

发表文章数:57

热门标签

, ,
首页 » Python » 正文

需求:

1.导入文件,查看原始数据

导入块

# -*-coding:utf-8-*-
import  numpy as np
import  pandas as pd
from pandas import  DataFrame

 导入文件,查看原始数据

#导入文件,查看原始数据
abb=pd.read_csv("./data/state-abbrevs.csv")
pop=pd.read_csv("./data/state-population.csv")
area=pd.read_csv("./data/state-areas.csv")
print("---------------------------------------------------")
print(abb.head(2))
print("---------------------------------------------------")
print(pop.head(2))
print("---------------------------------------------------")
print(area.head(2))

输出结果

E:/pythonProject5/venv/Scripts/python.exe E:/pythonProject5/人口分析案例.py
---------------------------------------------------
     state abbreviation
0  Alabama           AL
1   Alaska           AK
---------------------------------------------------
  state/region     ages  year  population
0           AL  under18  2012   1117489.0
1           AL    total  2012   4817528.0
---------------------------------------------------
     state  area (sq. mi)
0  Alabama          52423
1   Alaska         656425

Process finished with exit code 0

2.将人口数据和各州简称数据进行合并 

#将人口数据和各州简称数据进行合并
abb_pop=pd.merge(abb,pop,how='outer',left_on="abbreviation",right_on="state/region" )
print(abb_pop.head())

输出结果

E:/pythonProject5/venv/Scripts/python.exe E:/pythonProject5/人口分析案例.py
     state abbreviation state/region     ages  year  population
0  Alabama           AL           AL  under18  2012   1117489.0
1  Alabama           AL           AL    total  2012   4817528.0
2  Alabama           AL           AL  under18  2010   1130966.0
3  Alabama           AL           AL    total  2010   4785570.0
4  Alabama           AL           AL  under18  2011   1125763.0

Process finished with exit code 0

3.将合并的数据中重复的abbreviation列进行删除

#将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels="abbreviation",axis=1,inplace=True)
print(abb_pop.head())

输出结果

     state state/region     ages  year  population
0  Alabama           AL  under18  2012   1117489.0
1  Alabama           AL    total  2012   4817528.0
2  Alabama           AL  under18  2010   1130966.0
3  Alabama           AL    total  2010   4785570.0
4  Alabama           AL  under18  2011   1125763.0

Process finished with exit code 0

4.查看存在缺失数据的列

#查看存在缺失数据的列
# abb_pop.loc[]
print(abb_pop.isnull().any(axis=0))

输出结果

state            True
state/region    False
ages            False
year            False
population       True
dtype: bool

Process finished with exit code 0

5.找到有哪些state/region使得state的值为NaN,进行去重操作 

#找到有哪些state/region使得state的值为NaN,进行去重操作
print(abb_pop['state'].isnull() ) # 判断是否为空
#定位到state为空的数据
df=abb_pop.loc[abb_pop['state'].isnull() ]
print(df)
#将结果中的state/region列取出,返回的是一个Series,并去重
Ser=df['state/region'].unique()
print(Ser)

输出结果

['PR' 'USA']

6.为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN

#为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
abb_pop.loc[abb_pop['state/region']=='PR']
PR_index=abb_pop.loc[abb_pop['state/region']=='PR'].index
print(PR_index)

输出结果

​
Int64Index([2448, 2449, 2450, 2451, 2452, 2453, 2454, 2455, 2456, 2457, 2458,
            2459, 2460, 2461, 2462, 2463, 2464, 2465, 2466, 2467, 2468, 2469,
            2470, 2471, 2472, 2473, 2474, 2475, 2476, 2477, 2478, 2479, 2480,
            2481, 2482, 2483, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491,
            2492, 2493, 2494, 2495],
           dtype='int64')

Process finished with exit code 0

7.给对应的 空值赋值

#给对应的 空值赋值
abb_pop.loc[PR_index,'state']='PuertoRico'
USA_index=abb_pop.loc[abb_pop['state/region']=="USA"].index
abb_pop.loc[USA_index,'state']='UnitedStatesofAmerica'

8.合并各州面积数据areas 

#合并各州面积数据areas
area.head()
abb_pop_area=pd.merge(abb_pop,area,how='outer')
#abb_pop_area.to_csv("./data/abb_pop_area.csv")
print(abb_pop_area.head())

9.我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行

#我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area['area (sq. mi)'].isnull()
sqmi_index=abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
print(sqmi_index)

10.去除含有缺失数据的行

#去除含有缺失数据的行
abb_pop_area.drop(labels=sqmi_index,axis=0,inplace=True)

11.找出2010年的全民人口数据

#找出2010年的全民人口数据
pl_sum=abb_pop_area.query('year == 2010 & ages == "total"')
print(pl_sum)

输出结果

                     state state/region  ...   population  area (sq. mi)
2346            Washington           WA  ...    6742256.0        71303.0
2357         West Virginia           WV  ...    1854146.0        24231.0
2442             Wisconsin           WI  ...    5689060.0        65503.0
2453               Wyoming           WY  ...     564222.0        97818.0

[52 rows x 6 columns]

Process finished with exit code 0

12.计算各州的人口密度

#计算各州的人口密度
#总人数population/总面积area (sq. mi)
mi_du=abb_pop_area['population']/abb_pop_area['area (sq. mi)']
# 将数据添加到 总数据中
abb_pop_area['mi_du']=mi_du
print(abb_pop_area.head())

 输出结果

     state state/region     ages    year  population  area (sq. mi)      mi_du
0  Alabama           AL  under18  2012.0   1117489.0        52423.0  21.316769
1  Alabama           AL    total  2012.0   4817528.0        52423.0  91.897221
2  Alabama           AL  under18  2010.0   1130966.0        52423.0  21.573851
3  Alabama           AL    total  2010.0   4785570.0        52423.0  91.287603
4  Alabama           AL  under18  2011.0   1125763.0        52423.0  21.474601

Process finished with exit code 0

13.排序,并找出人口密度最高的五个州 df.sort_values()

#排序,并找出人口密度最高的五个州 df.sort_values()
abb_sort=abb_pop_area.sort_values('mi_du',axis=0,ascending=False)
print(abb_sort.head())

输出结果

                    state state/region  ... area (sq. mi)        mi_du
439  District of Columbia           DC  ...          68.0  9506.602941
433  District of Columbia           DC  ...          68.0  9315.102941
435  District of Columbia           DC  ...          68.0  9112.117647
479  District of Columbia           DC  ...          68.0  8901.779412
437  District of Columbia           DC  ...          68.0  8898.897059

[5 rows x 7 columns]

Process finished with exit code 0

人口分析案例总结:

  1. 读取文件:pd.read_csv('文件路径')

  2. 查看每一列的详细信息:data.info()

  3. df的条件查询:data.query('ages == "total" & year == 2010')

  4. 对某一列数据进行排序:data.sort_values(by='midu',ascending=False)

拜师教育学员文章:作者:2344-吴同学, 转载或复制请以 超链接形式 并注明出处 拜师资源博客
原文地址:《数据分析项目-人口分析》 发布于2021-11-23

分享到:
赞(0) 打赏

评论 抢沙发

评论前必须登录!

  注册



长按图片转发给朋友

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

Vieu3.3主题
专业打造轻量级个人企业风格博客主题!专注于前端开发,全站响应式布局自适应模板。

登录

忘记密码 ?

您也可以使用第三方帐号快捷登录

Q Q 登 录
微 博 登 录