Posts Data Practicing-EP7
Post
Cancel

Data Practicing-EP7

Data Practicing-EP7

Visualization in Python

Pandas和notebook一起用, 在这个先被Spark处理过的几百万行的数据集上做可视化还是感觉方便些.

先做个依赖导入和数据清洗吧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# -*- coding: utf-8 -*-
# Python Version == 3.8.6
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

plt.rcParams['figure.dpi'] = 150
plt.rcParams['savefig.dpi'] = 150
sns.set(rc={"figure.dpi": 150, 'savefig.dpi': 150})
from jupyterthemes import jtplot

jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)

fullData = pd.read_csv('~/devEnvs/chicagoCrimeData.csv', encoding='utf-8')

fullData.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/home/mijazz/devEnvs/pyvenv/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3062: DtypeWarning: Columns (21) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7212273 entries, 0 to 7212272
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 1.1+ GB
1
fullData.head(5)
IDCase NumberDateBlockIUCRPrimary TypeDescriptionLocation DescriptionArrestDomestic...WardCommunity AreaFBI CodeX CoordinateY CoordinateYearUpdated OnLatitudeLongitudeLocation
011034701JA36692501/01/2001 11:00:00 AM016XX E 86TH PL1153DECEPTIVE PRACTICEFINANCIAL IDENTITY THEFT OVER $ 300RESIDENCEFalseFalse...8.045.011NaNNaN200108/05/2017 03:50:08 PMNaNNaNNaN
111227287JB14718810/08/2017 03:00:00 AM092XX S RACINE AVE0281CRIM SEXUAL ASSAULTNON-AGGRAVATEDRESIDENCEFalseFalse...21.073.002NaNNaN201702/11/2018 03:57:41 PMNaNNaNNaN
211227583JB14759503/28/2017 02:00:00 PM026XX W 79TH ST0620BURGLARYUNLAWFUL ENTRYOTHERFalseFalse...18.070.005NaNNaN201702/11/2018 03:57:41 PMNaNNaNNaN
311227293JB14723009/09/2017 08:17:00 PM060XX S EBERHART AVE0810THEFTOVER $500RESIDENCEFalseFalse...20.042.006NaNNaN201702/11/2018 03:57:41 PMNaNNaNNaN
411227634JB14759908/26/2017 10:00:00 AM001XX W RANDOLPH ST0281CRIM SEXUAL ASSAULTNON-AGGRAVATEDHOTEL/MOTELFalseFalse...42.032.002NaNNaN201702/11/2018 03:57:41 PMNaNNaNNaN

5 rows × 22 columns

1
2
fullData.drop_duplicates(subset=['ID', 'Case Number'], inplace=True)
fullData.drop(['Case Number', 'IUCR','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location'], inplace=True, axis=1)
1
fullData['Location Description'].describe()
1
2
3
4
5
count     7204883
unique        214
top        STREET
freq      1874164
Name: Location Description, dtype: object
1
fullData['Description'].describe()
1
2
3
4
5
count     7212273
unique        532
top        SIMPLE
freq       849119
Name: Description, dtype: object
1
fullData['Primary Type'].describe()
1
2
3
4
5
count     7212273
unique         36
top         THEFT
freq      1522618
Name: Primary Type, dtype: object

可以看到这三列的其中两列, Location DescriptionDescription有许多Unique值, 我们只取数量多的, 这里只取计数为前20的作为大类以做特征分析, 其他的归为杂类.

1
2
3
locationDescription20Except = list(fullData['Location Description'].value_counts()[20:].index)
# 用loc把数据砍掉
fullData.loc[fullData['Location Description'].isin(locationDescription20Except), fullData.columns=='Location Description'] = 'OTHER'
1
2
3
description20Except = list(fullData['Description'].value_counts()[20:].index)
# 用loc把数据砍掉
fullData.loc[fullData['Description'].isin(description20Except) , fullData.columns=='Description'] = 'OTHER'

之前在spark中已经看到犯罪数量是36种, 并且数量从2001年到现在是逐年减少的. 但是只有每年的统计, 这里尝试作做rolling sum. 也就是每个取样点的横坐标对应一个日期, 纵坐标对应(当前日期-364天 ~ 当天)的犯罪数量和.

先把Date换成Datetime

1
fullData.Date = pd.to_datetime(fullData.Date, format='%m/%d/%Y %I:%M:%S %p')

Resample要有Index, 日期做了cast之后就行.

1
2
3
4
5
fullData.index = pd.DatetimeIndex(fullData.Date)
fullData.resample('D').size().rolling(365).sum().plot()
plt.xlabel('Days')
plt.ylabel('Crimes Count')
plt.show()


png

可以看到rolling sum是在稳步减少的.

现在分犯罪种类Primary Type来作图.

1
2
3
eachCrime = fullData.pivot_table('ID', aggfunc=np.size, columns='Primary Type', index=fullData.index.date, fill_value=0)
eachCrime.index = pd.DatetimeIndex(eachCrime.index)
tmp = eachCrime.rolling(365).sum().plot(figsize=(12, 60), subplots=True, layout=(-1, 2), sharex=False, sharey=False)


png

这里看到了一些无用的数据, 有些犯罪种类甚至近20年来发生不超过千次, 砍掉犯罪数量非前20的犯罪种类, 只留下前20的种类再做一个rolling sum.

并且留意到NON-CRIMINALNON - CRIMINAL两个类重复, 砍掉. 并也将其变为OTHER

1
2
crime20Except = list(fullData['Primary Type'].value_counts()[20:].index)
fullData.loc[fullData['Primary Type'].isin(crime20Except), fullData.columns=='Primary Type'] = 'OTHER'
1
2
fullData.loc[fullData['Primary Type'] == 'NON-CRIMINAL', fullData.columns=='Primary Type'] = 'OTHER'
fullData.loc[fullData['Primary Type'] == 'NON - CRIMINAL', fullData.columns=='Primary Type'] = 'OTHER'
1
2
3
eachCrime = fullData.pivot_table('ID', aggfunc=np.size, columns='Primary Type', index=fullData.index.date, fill_value=0)
eachCrime.index = pd.DatetimeIndex(eachCrime.index)
tmp = eachCrime.rolling(365).sum().plot(figsize=(12, 60), subplots=True, layout=(-1, 2), sharex=False, sharey=False)


png

数据处理完之后, 明显能够看出来, 基本的犯罪种类的数量的确是在下降的, 但是有两个WEAPONS VIOLATIONINTERFERENCE WITH PUBLIC OFFICER在逆势上涨.

This post is licensed under CC BY 4.0 by the author.