Coronavirus (COVID-19) Visualization & Prediction 笔记

项目链接https://www.kaggle.com/therealcyberlord/coronavirus-covid-19-visualization-prediction

总结

写在前面:这个项目本来以为很复杂,会有很复杂的数据处理过程,其实就还好。由于图和表太多,把总结写在前面。

首先,没用到什么比较高级的算法什么的,对于python脚本能力要求不是很高。

其次,对我比较有价值的数据分析部分是该项目对新冠病毒未来数据的预测部分,包括SVM, Polynomial Regression, Bayesian Ridge Regression等回归方法。以及其中sklearn相关函数的使用,如 train_test_split, PolynomialFeatures等。

最后,关于数据的可视化这一部分,基本上和新闻上说的差不多,美国一枝独秀。

图和表巨多,有兴趣可以下去点点。部分结果有注释。

正文

冠状病毒(Coronavirus)是一个病毒家族,是由这些病毒具有的冠状突刺(spiky crown)而命名。新型冠状病毒,也就是SARS-CoV-2,是一种传染性呼吸道病毒,首次在中国武汉报道。2020年11月2日,世界卫生组织将这种新型冠状病毒引起的疾病命名为COVID-19。这篇Notebook旨在通过数据分析和预测来探索COVID-19。

新型冠状病毒病例数据由Johns Hopkins University提供

新型冠状病毒移动数据由Apple提供

World Health Organization获取更多信息

Centers for Disease Control and Prevention获取更多信息

JHU CCSE Dashboard查看地图可视化数据

源代码:my Github

最后更新:美国东部时间2020年10月20日下午5:13 数据表更新

最新更新:10月20日每日报告数据更新
时间序列数据更新为10/19,移动性数据更新为10/19
预测模型2020年3月13日开始训练。因此,较早的日期可能不准确。
img

目录

  • Exploring Global Coronavirus Cases
  • Exploring Coronavirus Cases From Different Countries
  • Worldwide Confirmed Cases Prediction
  • Data Table
  • Pie Charts
  • Bar Charts
  • US Testing Data
  • Mobility Data for Hotspots
# 导入模块,都比较常见
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import pandas as pd
import random
import math
import time
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error
import datetime
import operator
plt.style.use('fivethirtyeight')
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
读取数据
#也可以直接先下来,再读取
confirmed_df = pd.read_csv('./20201021_data/time_series_covid19_confirmed_global.csv')
deaths_df = pd.read_csv('./20201021_data/time_series_covid19_deaths_global.csv')
recoveries_df = pd.read_csv('./20201021_data/time_series_covid19_recovered_global.csv')
latest_data = pd.read_csv('./20201021_data/10-19-2020-latest.csv')
us_medical_data = pd.read_csv('./20201021_data/10-19-2020-us.csv')
apple_mobility = pd.read_csv('./20201021_data/applemobilitytrends-2020-10-19.csv')
# 这个文件是全球随时间变化的确诊病例
print(confirmed_df.columns)
confirmed_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
       '10/17/20', '10/18/20', '10/19/20', '10/20/20'],
      dtype='object', length=277)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/2010/11/2010/12/2010/13/2010/14/2010/15/2010/16/2010/17/2010/18/2010/19/2010/20/20
0NaNAfghanistan33.9391167.70995300000039799398703992839994400264007340141402004028740357
1NaNAlbania41.1533020.16830000000015399155701575215955162121650116774170551735017651
2NaNAlgeria28.033901.65960000000053072533255339953584537775399854203544025461654829
3NaNAndorra42.506301.5218000000002696299529953190319033773377337736233623
4NaNAngola-11.2027017.8739000000006366648866806846709672227462762278298049

5 rows × 277 columns

# 这个文件是全球随时间变化的死亡病例
print(deaths_df.columns)
deaths_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
       '10/17/20', '10/18/20', '10/19/20', '10/20/20'],
      dtype='object', length=277)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/2010/11/2010/12/2010/13/2010/14/2010/15/2010/16/2010/17/2010/18/2010/19/2010/20/20
0NaNAfghanistan33.9391167.7099530000001477147914801481148114851488149214971499
1NaNAlbania41.1533020.168300000000420424429434439443448451454458
2NaNAlgeria28.033901.6596000000001801180918181827182718411846185618651873
3NaNAndorra42.506301.52180000000055575759595959596262
4NaNAngola-11.2027017.873900000000218219222227228234241247248251

5 rows × 277 columns

# 这个文件是全球随时间变化的治愈病例
print(recoveries_df.columns)
recoveries_df.head()
Index(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20',
       '1/24/20', '1/25/20', '1/26/20', '1/27/20',
       ...
       '10/11/20', '10/12/20', '10/13/20', '10/14/20', '10/15/20', '10/16/20',
       '10/17/20', '10/18/20', '10/19/20', '10/20/20'],
      dtype='object', length=277)
Province/StateCountry/RegionLatLong1/22/201/23/201/24/201/25/201/26/201/27/2010/11/2010/12/2010/13/2010/14/2010/15/2010/16/2010/17/2010/18/2010/19/2010/20/20
0NaNAfghanistan33.9391167.70995300000033114331183330833354334473351633561336143376033790
1NaNAlbania41.1533020.16830000000095009585967597629864995710001100711016710225
2NaNAlgeria28.033901.65960000000037170373823749237603376033785637971380883821538346
3NaNAndorra42.506301.5218000000001814192819282011201120572057205722732273
4NaNAngola-11.2027017.8739000000002743274427612801292830123022303030313037

5 rows × 277 columns

# 截至当前20201021,各国新冠病毒统计情况
print(latest_data.columns)
latest_data.head()
Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incidence_Rate', 'Case-Fatality_Ratio'],
      dtype='object')
FIPSAdmin2Province_StateCountry_RegionLast_UpdateLatLong_ConfirmedDeathsRecoveredActiveCombined_KeyIncidence_RateCase-Fatality_Ratio
0NaNNaNNaNAfghanistan2020-10-20 04:24:2233.9391167.709953402871497337605030.0Afghanistan103.4901543.715839
1NaNNaNNaNAlbania2020-10-20 04:24:2241.1533020.16830017350454101676729.0Albania602.8910972.616715
2NaNNaNNaNAlgeria2020-10-20 04:24:2228.033901.6596005461618653821514536.0Algeria124.5489193.414750
3NaNNaNNaNAndorra2020-10-20 04:24:2242.506301.52180036236222731288.0Andorra4689.0571411.711289
4NaNNaNNaNAngola2020-10-20 04:24:22-11.2027017.873900782924830314550.0Angola23.8207763.167710
# 截至当前20201021,美国新冠病毒统计情况
print(us_medical_data.columns)
us_medical_data.head()
Index(['Province_State', 'Country_Region', 'Last_Update', 'Lat', 'Long_',
       'Confirmed', 'Deaths', 'Recovered', 'Active', 'FIPS', 'Incident_Rate',
       'People_Tested', 'People_Hospitalized', 'Mortality_Rate', 'UID', 'ISO3',
       'Testing_Rate', 'Hospitalization_Rate'],
      dtype='object')
Province_StateCountry_RegionLast_UpdateLatLong_ConfirmedDeathsRecoveredActiveFIPSIncident_RatePeople_TestedPeople_HospitalizedMortality_RateUIDISO3Testing_RateHospitalization_Rate
0AlabamaUS2020-10-20 04:30:2932.3182-86.9023173485278974238.096458.01.03538.2103671260100.0NaN1.60763284000001USA25699.621776NaN
1AlaskaUS2020-10-20 04:30:2961.3707-152.404411182676516.04599.02.01528.545749536223.0NaN0.59917784000002USA73300.070399NaN
2American SamoaUS2020-10-20 04:30:29-14.2710-170.132000NaN0.060.00.0000001616.0NaNNaN16ASM2904.333136NaN
3ArizonaUS2020-10-20 04:30:2933.7298-111.4312231897583038553.0187514.04.03185.9598331639785.0NaN2.51404784000004USA22528.489568NaN
4ArkansasUS2020-10-20 04:30:2934.9697-92.373199597171489217.08666.05.03300.3137381223914.0NaN1.72093584000005USA40556.444355NaN
## 苹果的移动数据
print(apple_mobility.columns)
apple_mobility.head()
Index(['geo_type', 'region', 'transportation_type', 'alternative_name',
       'sub-region', 'country', '2020-01-13', '2020-01-14', '2020-01-15',
       '2020-01-16',
       ...
       '2020-10-10', '2020-10-11', '2020-10-12', '2020-10-13', '2020-10-14',
       '2020-10-15', '2020-10-16', '2020-10-17', '2020-10-18', '2020-10-19'],
      dtype='object', length=287)
geo_typeregiontransportation_typealternative_namesub-regioncountry2020-01-132020-01-142020-01-152020-01-162020-10-102020-10-112020-10-122020-10-132020-10-142020-10-152020-10-162020-10-172020-10-182020-10-19
0country/regionAlbaniadrivingNaNNaNNaN100.095.30101.4397.20144.47148.87123.94111.80113.31111.52117.39128.99137.88114.93
1country/regionAlbaniawalkingNaNNaNNaN100.0100.6898.9398.46167.16142.52150.36141.02155.39134.41142.26142.22125.67149.77
2country/regionArgentinadrivingNaNNaNNaN100.097.07102.45111.2179.7249.1949.6961.1665.2668.4584.9488.9348.7652.73
3country/regionArgentinawalkingNaNNaNNaN100.095.11101.37112.6762.1536.5743.6652.5156.7955.1069.5962.4234.4042.97
4country/regionAustraliadrivingAUNaNNaN100.0102.98104.21108.6383.2488.8591.4593.1796.06104.2499.6585.4292.7294.60

5 rows × 287 columns

获取疫情爆发日期
# 从确诊、死亡、治愈病例中获取时间序列数据,第四列到最后一列
cols = confirmed_df.keys()
confirmed = confirmed_df.loc[:, cols[4]:cols[-1]]
deaths = deaths_df.loc[:, cols[4]:cols[-1]]
recoveries = recoveries_df.loc[:, cols[4]:cols[-1]]
# 获取累计确诊、死亡、治愈病例,然后画图
dates = confirmed.keys()
world_cases = []
total_deaths = []
mortality_rate = []
recovery_rate = []
total_recovered = []
total_active = []

for i in dates:
    confirmed_sum = confirmed[i].sum()
    death_sum = deaths[i].sum()
    recovered_sum = recoveries[i].sum()

    # confirmed, deaths, recovered, and active
    world_cases.append(confirmed_sum)
    total_deaths.append(death_sum)
    total_recovered.append(recovered_sum)
    total_active.append(confirmed_sum-death_sum-recovered_sum)

    # calculate rates
    mortality_rate.append(death_sum/confirmed_sum)
    recovery_rate.append(recovered_sum/confirmed_sum)
# 获取每日每周增加数据,指上面的 确诊数量变化 死亡数量变化 治愈数量变化 confirmed, deaths, recovered, and active ,以及比例
def daily_increase(data):
    d = []
    for i in range(len(data)):
        if i == 0:
            d.append(data[0])
        else:
            d.append(data[i]-data[i-1])
    return d

def moving_average(data, window_size):
    moving_average = []
    for i in range(len(data)):
        if i + window_size < len(data):
            moving_average.append(np.mean(data[i:i+window_size]))
        else:
            moving_average.append(np.mean(data[i:len(data)]))
    return moving_average

# window size
window = 7

# confirmed cases
world_daily_increase = daily_increase(world_cases)
world_confirmed_avg= moving_average(world_cases, window)
world_daily_increase_avg = moving_average(world_daily_increase, window)

# deaths
world_daily_death = daily_increase(total_deaths)
world_death_avg = moving_average(total_deaths, window)
world_daily_death_avg = moving_average(world_daily_death, window)


# recoveries
world_daily_recovery = daily_increase(total_recovered)
world_recovery_avg = moving_average(total_recovered, window)
world_daily_recovery_avg = moving_average(world_daily_recovery, window)


# active
world_active_avg = moving_average(total_active, window)
# 从1月22日开始计数,并将数据整理为n*1矩阵
days_since_1_22 = np.array([i for i in range(len(dates))]).reshape(-1, 1)
world_cases = np.array(world_cases).reshape(-1, 1)
total_deaths = np.array(total_deaths).reshape(-1, 1)
total_recovered = np.array(total_recovered).reshape(-1, 1)
print(days_since_1_22.shape)
(273, 1)
# 未来预测,获取未来10天的编号,从1月22日开始计数
days_in_future = 10
future_forcast = np.array([i for i in range(len(dates)+days_in_future)]).reshape(-1, 1)
adjusted_dates = future_forcast[:-10]
# 把上一步得到的编号(int)转化为日期格式,便于可视化。 模块 datetime
start = '1/22/2020'
start_date = datetime.datetime.strptime(start, '%m/%d/%Y')
future_forcast_dates = []
for i in range(len(future_forcast)):
    future_forcast_dates.append((start_date + datetime.timedelta(days=i)).strftime('%m/%d/%Y'))
# 正式预测:train_test_split sklearn.model_selection 从世界确诊病例中挑选第50天后的所有数据,test数据比例为0.15,不打乱
X_train_confirmed, X_test_confirmed, y_train_confirmed, y_test_confirmed = train_test_split(days_since_1_22[50:], world_cases[50:], test_size=0.15, shuffle=False)
print(X_train_confirmed.shape,X_test_confirmed.shape)
(189, 1) (34, 1)
# 利用支持向量机、贝叶斯岭和线性回归模型来预测确诊病例
# SVM sklearn.svm
svm_confirmed = SVR(shrinking=True, kernel='poly',gamma=0.01, epsilon=1,degree=3, C=0.1) #定义模型
svm_confirmed.fit(X_train_confirmed, y_train_confirmed) #训练
svm_pred = svm_confirmed.predict(future_forcast) #预测
# 画个图,看一下
# mean_absolute_error sklearn.metrics 平均绝对误差(MAE) https://blog.csdn.net/StupidAutofan/article/details/79556087
svm_test_pred = svm_confirmed.predict(X_test_confirmed)
plt.plot(y_test_confirmed)
plt.plot(svm_test_pred)
plt.legend(['Test Data', 'SVM Predictions'])
print('MAE:', mean_absolute_error(svm_test_pred, y_test_confirmed))
print('MSE:',mean_squared_error(svm_test_pred, y_test_confirmed))
MAE: 4400479.84939107
MSE: 21084654795537.24
png
# transform our data for polynomial regression
# 将数据转化为多项式回归,这一步是生成不同指数的特征项
# 用到的函数sklearn.preprocessing.PolynomialFeatures,如果有a,b两个特征,那么它的2次多项式为(1,a,b,a^2,ab, b^2)
poly = PolynomialFeatures(degree=4)
poly_X_train_confirmed = poly.fit_transform(X_train_confirmed)## 即生成单个x的不同乘方【0-4】
poly_X_test_confirmed = poly.fit_transform(X_test_confirmed)
poly_future_forcast = poly.fit_transform(future_forcast)

# 贝叶斯
bayesian_poly = PolynomialFeatures(degree=5)
bayesian_poly_X_train_confirmed = bayesian_poly.fit_transform(X_train_confirmed)
bayesian_poly_X_test_confirmed = bayesian_poly.fit_transform(X_test_confirmed)
bayesian_poly_future_forcast = bayesian_poly.fit_transform(future_forcast)
# polynomial regression
# 开始预测

linear_model = LinearRegression(normalize=True, fit_intercept=False)
linear_model.fit(poly_X_train_confirmed, y_train_confirmed)
test_linear_pred = linear_model.predict(poly_X_test_confirmed)
linear_pred = linear_model.predict(poly_future_forcast)
print('MAE:', mean_absolute_error(test_linear_pred, y_test_confirmed))
print('MSE:',mean_squared_error(test_linear_pred, y_test_confirmed))
MAE: 1125828.6734114974
MSE: 2107924255435.714
# 画个图,比之前的单纯SVR要好点
plt.plot(y_test_confirmed)
plt.plot(test_linear_pred)
plt.legend(['Test Data', 'Polynomial Regression Predictions'])
<matplotlib.legend.Legend at 0x234c8336588>


png

# bayesian ridge polynomial regression
# 贝叶斯岭回归多项式回归计算
tol = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
alpha_1 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
alpha_2 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
lambda_1 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
lambda_2 = [1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
normalize = [True, False]

bayesian_grid = {'tol': tol, 'alpha_1': alpha_1, 'alpha_2' : alpha_2, 'lambda_1': lambda_1, 'lambda_2' : lambda_2,
                 'normalize' : normalize}

bayesian = BayesianRidge(fit_intercept=False)
bayesian_search = RandomizedSearchCV(bayesian, bayesian_grid, scoring='neg_mean_squared_error', cv=3, return_train_score=True, n_jobs=-1, n_iter=40, verbose=1)
bayesian_search.fit(bayesian_poly_X_train_confirmed, y_train_confirmed)
Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    1.2s finished





RandomizedSearchCV(cv=3, estimator=BayesianRidge(fit_intercept=False),
                   n_iter=40, n_jobs=-1,
                   param_distributions={'alpha_1': [1e-07, 1e-06, 1e-05, 0.0001,
                                                    0.001],
                                        'alpha_2': [1e-07, 1e-06, 1e-05, 0.0001,
                                                    0.001],
                                        'lambda_1': [1e-07, 1e-06, 1e-05,
                                                     0.0001, 0.001],
                                        'lambda_2': [1e-07, 1e-06, 1e-05,
                                                     0.0001, 0.001],
                                        'normalize': [True, False],
                                        'tol': [1e-06, 1e-05, 0.0001, 0.001,
                                                0.01]},
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)
# 同样,画个图 not bad
bayesian_confirmed = bayesian_search.best_estimator_
test_bayesian_pred = bayesian_confirmed.predict(bayesian_poly_X_test_confirmed)
bayesian_pred = bayesian_confirmed.predict(bayesian_poly_future_forcast)
print('MAE:', mean_absolute_error(test_bayesian_pred, y_test_confirmed))
print('MSE:',mean_squared_error(test_bayesian_pred, y_test_confirmed))

plt.plot(y_test_confirmed)
plt.plot(test_bayesian_pred)
plt.legend(['Test Data', 'Bayesian Ridge Polynomial Predictions'])
MAE: 607921.3030146289
MSE: 390856736680.26746





<matplotlib.legend.Legend at 0x234c8396fc8>


png

Graphing the number of confirmed cases, active cases, deaths, recoveries, mortality rate (CFR), and recovery rate

画图关于确诊病例,现存病例,死亡数,治愈数,死亡率,治愈率

# helper method for flattening the data, so it can be displayed on a bar graph
# 定义便于bar图示的flatten数据函数 感觉可以直接用np.adarray.reshape,或者np.flatten
def flatten(arr):
    a = []
    arr = arr.tolist()
    for i in arr:
        a.append(i[0])
    return a
# 画图,感觉没啥说的,很基础
adjusted_dates = adjusted_dates.reshape(1, -1)[0]#拍扁
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, world_cases)
plt.plot(adjusted_dates, world_confirmed_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Cases', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_deaths)
plt.plot(adjusted_dates, world_death_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Deaths Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Deaths', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_recovered)
plt.plot(adjusted_dates, world_recovery_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Recoveries Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Worldwide Coronavirus Recoveries', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_active)
plt.plot(adjusted_dates, world_active_avg, linestyle='dashed', color='orange')
plt.title('# of Coronavirus Active Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Active Cases', size=30)
plt.legend(['Worldwide Coronavirus Active Cases', 'Moving Average {} Days'.format(window)], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

png
png
png
plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_increase)
plt.plot(adjusted_dates, world_daily_increase_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Cases', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Cases'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_death)
plt.plot(adjusted_dates, world_daily_death_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Deaths', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Deaths'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.bar(adjusted_dates, world_daily_recovery)
plt.plot(adjusted_dates, world_daily_recovery_avg, color='orange', linestyle='dashed')
plt.title('World Daily Increases in Confirmed Recoveries', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.legend(['Moving Average {} Days'.format(window), 'World Daily Increase in COVID-19 Recoveries'], prop={'size': 20})
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

png
png
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(world_cases))
plt.title('Log of # of Coronavirus Cases Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(total_deaths))
plt.title('Log of # of Coronavirus Deaths Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()

plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, np.log10(total_recovered))
plt.title('Log of # of Coronavirus Recoveries Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

png
png
# 根据国家画图
def country_plot(x, y1, y2, y3, y4, country):
    # window is set as 14 in in the beginning of the notebook
    confirmed_avg = moving_average(y1, window)
    confirmed_increase_avg = moving_average(y2, window)
    death_increase_avg = moving_average(y3, window)
    recovery_increase_avg = moving_average(y4, window)

    plt.figure(figsize=(16, 10))
    plt.plot(x, y1)
    plt.plot(x, confirmed_avg, color='red', linestyle='dashed')
    plt.legend(['{} Confirmed Cases'.format(country), 'Moving Average {} Days'.format(window)], prop={'size': 20})
    plt.title('{} Confirmed Cases'.format(country), size=30)
    plt.xlabel('Days Since 1/22/2020', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

    plt.figure(figsize=(16, 10))
    plt.bar(x, y2)
    plt.plot(x, confirmed_increase_avg, color='red', linestyle='dashed')
    plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Cases'.format(country)], prop={'size': 20})
    plt.title('{} Daily Increases in Confirmed Cases'.format(country), size=30)
    plt.xlabel('Days Since 1/22/2020', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

    plt.figure(figsize=(16, 10))
    plt.bar(x, y3)
    plt.plot(x, death_increase_avg, color='red', linestyle='dashed')
    plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Deaths'.format(country)], prop={'size': 20})
    plt.title('{} Daily Increases in Deaths'.format(country), size=30)
    plt.xlabel('Days Since 1/22/2020', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

    plt.figure(figsize=(16, 10))
    plt.bar(x, y4)
    plt.plot(x, recovery_increase_avg, color='red', linestyle='dashed')
    plt.legend(['Moving Average {} Days'.format(window), '{} Daily Increase in Confirmed Recoveries'.format(country)], prop={'size': 20})
    plt.title('{} Daily Increases in Recoveries'.format(country), size=30)
    plt.xlabel('Days Since 1/22/2020', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

# helper function for getting country's cases, deaths, and recoveries
def get_country_info(country_name):
    country_cases = []
    country_deaths = []
    country_recoveries = []

    for i in dates:
        country_cases.append(confirmed_df[confirmed_df['Country/Region']==country_name][i].sum())
        country_deaths.append(deaths_df[deaths_df['Country/Region']==country_name][i].sum())
        country_recoveries.append(recoveries_df[recoveries_df['Country/Region']==country_name][i].sum())
    return (country_cases, country_deaths, country_recoveries)


def country_visualizations(country_name):
    country_info = get_country_info(country_name)
    country_cases = country_info[0]
    country_deaths = country_info[1]
    country_recoveries = country_info[2]

    country_daily_increase = daily_increase(country_cases)
    country_daily_death = daily_increase(country_deaths)
    country_daily_recovery = daily_increase(country_recoveries)

    country_plot(adjusted_dates, country_cases, country_daily_increase, country_daily_death, country_daily_recovery, country_name)
# 每个国家的country_cases, country_deaths, country_recoveries,本来太多了,这里只显示中美
# countries = ['US', 'Russia', 'India', 'Brazil', 'South Africa', 'China', 'Italy',
#              'Germany', 'Spain', 'France', 'United Kingdom', 'Peru', 'Mexico', 'Colombia', 'Saudi Arabia', 'Iran', 'Bangladesh',
#             'Pakistan', 'Turkey', 'Philippines', 'Iraq', 'Indonesia', 'Israel', 'Ukraine', 'Ecuador', 'Bolivia', 'Netherlands']
countries = ['US','China']
for country in countries:
    country_visualizations(country)


png

png
png
png
png
png
png
png
# Country Comparison
# removed redundant code
# 下面这几个国家比较
compare_countries = ['US', 'Brazil', 'India', 'Russia', 'South Africa']
graph_name = ['Coronavirus Confirmed Cases', 'Coronavirus Confirmed Deaths', 'Coronavirus Confirmed Recoveries']

for num in range(3):
    plt.figure(figsize=(16, 10))
    for country in compare_countries:
        plt.plot(get_country_info(country)[num])
    plt.legend(compare_countries, prop={'size': 20})
    plt.xlabel('Days since 3/1', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.title(graph_name[num], size=30)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()


png

png
png
## Predictions for confirmed coronavirus cases worldwide
## 预测画图
def plot_predictions(x, y, pred, algo_name, color):
    plt.figure(figsize=(16, 10))
    plt.plot(x, y)
    plt.plot(future_forcast, pred, linestyle='dashed', color=color)
    plt.title('Worldwide Coronavirus Cases Over Time', size=30)
    plt.xlabel('Days Since 1/22/2020', size=30)
    plt.ylabel('# of Cases', size=30)
    plt.legend(['Confirmed Cases', algo_name], prop={'size': 20})
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

plot_predictions(adjusted_dates, world_cases, svm_pred, 'SVM Predictions', 'purple')


png

plot_predictions(adjusted_dates, world_cases, linear_pred, 'Polynomial Regression Predictions', 'orange')
plot_predictions(adjusted_dates, world_cases, bayesian_pred, 'Bayesian Ridge Regression Predictions', 'green')


png

png
# Future predictions using SVM
# SVM对未来数据的预测
svm_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'SVM Predicted # of Confirmed Cases Worldwide': np.round(svm_pred[-10:])})
svm_df.style.background_gradient(cmap='Reds')
DateSVM Predicted # of Confirmed Cases Worldwide
010/21/202047874251.000000
110/22/202048390574.000000
210/23/202048910680.000000
310/24/202049434582.000000
410/25/202049962294.000000
510/26/202050493831.000000
610/27/202051029205.000000
710/28/202051568431.000000
810/29/202052111522.000000
910/30/202052658492.000000
# Future predictions using polynomial regression
linear_pred = linear_pred.reshape(1,-1)[0]
linear_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'Polynomial Predicted # of Confirmed Cases Worldwide': np.round(linear_pred[-10:])})
linear_df.style.background_gradient(cmap='Reds')
DatePolynomial Predicted # of Confirmed Cases Worldwide
010/21/202037750457.000000
110/22/202037917642.000000
210/23/202038080320.000000
310/24/202038238382.000000
410/25/202038391715.000000
510/26/202038540207.000000
610/27/202038683745.000000
710/28/202038822215.000000
810/29/202038955502.000000
910/30/202039083490.000000
# Future predictions using Bayesian Ridge
bayesian_df = pd.DataFrame({'Date': future_forcast_dates[-10:], 'Bayesian Ridge Predicted # of Confirmed Cases Worldwide': np.round(bayesian_pred[-10:])})
bayesian_df.style.background_gradient(cmap='Reds')
DateBayesian Ridge Predicted # of Confirmed Cases Worldwide
010/21/202041655203.000000
110/22/202042000051.000000
210/23/202042345665.000000
310/24/202042692026.000000
410/25/202043039116.000000
510/26/202043386918.000000
610/27/202043735412.000000
710/28/202044084580.000000
810/29/202044434403.000000
910/30/202044784863.000000
# 死亡率的预测
mean_mortality_rate = np.mean(mortality_rate)
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, mortality_rate, color='orange')
plt.axhline(y = mean_mortality_rate,linestyle='--', color='black')
plt.title('Worldwide Mortality Rate of Coronavirus Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('Case Mortality Rate', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

# 治愈率的预测
mean_recovery_rate = np.mean(recovery_rate)
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, recovery_rate, color='blue')
plt.title('Worldwide Recovery Rate of Coronavirus Over Time', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('Case Recovery Rate', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

# 两个对比
plt.figure(figsize=(16, 10))
plt.plot(adjusted_dates, total_deaths, color='r')
plt.plot(adjusted_dates, total_recovered, color='green')
plt.legend(['death', 'recoveries'], loc='best', fontsize=25)
plt.title('Worldwide Coronavirus Cases', size=30)
plt.xlabel('Days Since 1/22/2020', size=30)
plt.ylabel('# of Cases', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

# 两个对比
plt.figure(figsize=(16, 10))
plt.plot(total_recovered, total_deaths)
plt.title('# of Coronavirus Deaths vs. # of Coronavirus Recoveries', size=30)
plt.xlabel('# of Coronavirus Recoveries', size=30)
plt.ylabel('# of Coronavirus Deaths', size=30)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

# 获取国家地区关于确诊病例的信息
unique_countries =  list(latest_data['Country_Region'].unique())

country_confirmed_cases = []
country_death_cases = []
country_active_cases = []
country_recovery_cases = []
country_incidence_rate = []
country_mortality_rate = []

no_cases = []
for i in unique_countries:
    cases = latest_data[latest_data['Country_Region']==i]['Confirmed'].sum()
    if cases > 0:
        country_confirmed_cases.append(cases)
    else:
        no_cases.append(i)

for i in no_cases:
    unique_countries.remove(i)

# sort countries by the number of confirmed cases
unique_countries = [k for k, v in sorted(zip(unique_countries, country_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(unique_countries)):
    country_confirmed_cases[i] = latest_data[latest_data['Country_Region']==unique_countries[i]]['Confirmed'].sum()
    country_death_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Deaths'].sum())
    country_recovery_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Recovered'].sum())
    country_active_cases.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Active'].sum())
    country_incidence_rate.append(latest_data[latest_data['Country_Region']==unique_countries[i]]['Incidence_Rate'].sum())
    country_mortality_rate.append(country_death_cases[i]/country_confirmed_cases[i])
country_df = pd.DataFrame({'Country Name': unique_countries, 'Number of Confirmed Cases': country_confirmed_cases,
                          'Number of Deaths': country_death_cases, 'Number of Recoveries' : country_recovery_cases,
                          'Number of Active Cases' : country_active_cases, 'Incidence Rate' : country_incidence_rate,
                          'Mortality Rate': country_mortality_rate})
# number of cases per country/region

country_df.style.background_gradient(cmap='Oranges')
Country NameNumber of Confirmed CasesNumber of DeathsNumber of RecoveriesNumber of Active CasesIncidence RateMortality Rate
0US821298122011932726034720260.0000007590091.7460660.026801
1India75970631151976733328748538.00000026859.6712870.015163
2Brazil52507271541764526393570158.00000090720.5278590.029363
3Russia1406667242051070920311542.00000083205.3042300.017207
4Argentina100266226716803965171981.0000002218.4860320.026645
5Spain97444933992150376790081.00000038939.1875110.034883
6Colombia9658832910286796168820.00000055190.0610880.030130
7France95260033647109611809342.00000013307.2476680.035321
8Peru8686753375978405650860.00000065217.3425750.038863
9Mexico8549268633872775940829.00000022262.2513520.100989
10United Kingdom744122438162613697693.00000010559.8594260.058883
unique_provinces =  list(latest_data['Province_State'].unique())
province_confirmed_cases = []
province_country = []
province_death_cases = []
# province_recovery_cases = []
province_active = []
province_incidence_rate = []
province_mortality_rate = []

no_cases = []
for i in unique_provinces:
    cases = latest_data[latest_data['Province_State']==i]['Confirmed'].sum()
    if cases > 0:
        province_confirmed_cases.append(cases)
    else:
        no_cases.append(i)

# remove areas with no confirmed cases
for i in no_cases:
    unique_provinces.remove(i)

unique_provinces = [k for k, v in sorted(zip(unique_provinces, province_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
for i in range(len(unique_provinces)):
    province_confirmed_cases[i] = latest_data[latest_data['Province_State']==unique_provinces[i]]['Confirmed'].sum()
    province_country.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Country_Region'].unique()[0])
    province_death_cases.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Deaths'].sum())
#     province_recovery_cases.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Recovered'].sum())
    province_active.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Active'].sum())
    province_incidence_rate.append(latest_data[latest_data['Province_State']==unique_provinces[i]]['Incidence_Rate'].sum())
    province_mortality_rate.append(province_death_cases[i]/province_confirmed_cases[i])
#前100数据
province_limit = 100
province_df = pd.DataFrame({'Province/State Name': unique_provinces[:province_limit], 'Country': province_country[:province_limit], 'Number of Confirmed Cases': province_confirmed_cases[:province_limit],
                          'Number of Deaths': province_death_cases[:province_limit],'Number of Active Cases' : province_active[:province_limit],
                            'Incidence Rate' : province_incidence_rate[:province_limit], 'Mortality Rate': province_mortality_rate[:province_limit]})
# number of cases per country/region

province_df.style.background_gradient(cmap='Oranges')
Province/State NameCountryNumber of Confirmed CasesNumber of DeathsNumber of Active CasesIncidence RateMortality Rate
0MaharashtraIndia160136542240174246.0000001300.3979890.026377
1Sao PauloBrazil106403938035117805.0000002317.2060900.035746
2CaliforniaUS87964516982862663.000000101246.3903000.019306
3TexasUS85694817481839467.000000635233.9061960.020399
4Andhra PradeshIndia786050645335065.0000001458.2569970.008209
5KarnatakaIndia77060410542106233.0000001140.5763230.013680
6FloridaUS75672716021740706.000000265466.5851810.021171
7Tamil NaduIndia6909361069138093.000000887.6217290.015473
8EnglandUnited Kingdom62921138783590428.0000001124.0487200.061638
9New YorkUS48527933366451913.00000074530.7437410.068756
10Uttar PradeshIndia456865668531495.000000192.0547190.014632
# return the data table with province/state info for a given country
def country_table(country_name):
    states = list(latest_data[latest_data['Country_Region']==country_name]['Province_State'].unique())
    state_confirmed_cases = []
    state_death_cases = []
    # state_recovery_cases = []
    state_active = []
    state_incidence_rate = []
    state_mortality_rate = []

    no_cases = []
    for i in states:
        cases = latest_data[latest_data['Province_State']==i]['Confirmed'].sum()
        if cases > 0:
            state_confirmed_cases.append(cases)
        else:
            no_cases.append(i)

    # remove areas with no confirmed cases
    for i in no_cases:
        states.remove(i)

    states = [k for k, v in sorted(zip(states, state_confirmed_cases), key=operator.itemgetter(1), reverse=True)]
    for i in range(len(states)):
        state_confirmed_cases[i] = latest_data[latest_data['Province_State']==states[i]]['Confirmed'].sum()
        state_death_cases.append(latest_data[latest_data['Province_State']==states[i]]['Deaths'].sum())
    #     state_recovery_cases.append(latest_data[latest_data['Province_State']==states[i]]['Recovered'].sum())
        state_active.append(latest_data[latest_data['Province_State']==states[i]]['Active'].sum())
        state_incidence_rate.append(latest_data[latest_data['Province_State']==states[i]]['Incidence_Rate'].sum())
        state_mortality_rate.append(state_death_cases[i]/state_confirmed_cases[i])


    state_df = pd.DataFrame({'State Name': states, 'Number of Confirmed Cases': state_confirmed_cases,
                              'Number of Deaths': state_death_cases, 'Number of Active Cases' : state_active,
                             'Incidence Rate' : state_incidence_rate, 'Mortality Rate': state_mortality_rate})
    # number of cases per country/region
    return state_df
#美国的数据
us_table = country_table('US')
us_table.style.background_gradient(cmap='Oranges')
State NameNumber of Confirmed CasesNumber of DeathsNumber of Active CasesIncidence RateMortality Rate
0California87964516982862663.000000101246.3903000.019306
1Texas85694817481839467.000000635233.9061960.020399
2Florida75672716021740706.000000265466.5851810.021171
3New York48527933366451913.00000074530.7437410.068756
4Illinois3507449496341248.000000222737.0384810.027074
5Georgia3413107657333653.000000545676.7340250.022434
6North Carolina2471723939243233.000000241343.1444150.015936
7Tennessee2320612922229139.000000349619.7829410.012592
8Arizona2318975830226068.00000049228.3217510.025140
9New Jersey22120516214204991.00000046343.3797930.073299
10Pennsylvania1883818475179906.00000066883.5266160.044989
#中国的数据
china_table = country_table('China')
china_table.style.background_gradient(cmap='Oranges')
State NameNumber of Confirmed CasesNumber of DeathsNumber of Active CasesIncidence RateMortality Rate
0Hubei6813945120.000000115.1580190.066218
1Hong Kong5256105169.00000070.1081550.019977
2Guangdong1889840.0000001.6649040.004235
3Zhejiang128313.0000002.2363600.000779
4Henan1281223.0000001.3336800.017174
5Shanghai1095775.0000004.5173270.006393
6Hunan101940.0000001.4770260.003925
7Anhui99160.0000001.5670460.006054
8Heilongjiang948130.0000002.5125890.013713
9Beijing93892.0000004.3546890.009595
10Jiangxi93510.0000002.0116180.001070
total_world_cases = np.sum(country_confirmed_cases)
us_confirmed = latest_data[latest_data['Country_Region']=='US']['Confirmed'].sum()
outside_us_confirmed = total_world_cases - us_confirmed

plt.figure(figsize=(16, 9))
plt.barh('United States', us_confirmed)
plt.barh('Outside United States', outside_us_confirmed)
plt.title('# of Total Coronavirus Confirmed Cases', size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


plt.figure(figsize=(16, 9))
plt.barh('United States', us_confirmed/total_world_cases)
plt.barh('Outside United States', outside_us_confirmed/total_world_cases)
plt.title('# of Coronavirus Confirmed Cases Expressed in Percentage', size=20)
plt.xticks(size=20)
plt.yticks(size=20)
plt.show()


png

png
# Only show 15 countries with the most confirmed cases, the rest are grouped into the other category
visual_unique_countries = []
visual_confirmed_cases = []
others = np.sum(country_confirmed_cases[10:])

for i in range(len(country_confirmed_cases[:10])):
    visual_unique_countries.append(unique_countries[i])
    visual_confirmed_cases.append(country_confirmed_cases[i])

visual_unique_countries.append('Others')
visual_confirmed_cases.append(others)
def plot_bar_graphs(x, y, title):
    plt.figure(figsize=(16, 12))
    plt.barh(x, y)
    plt.title(title, size=20)
    plt.xticks(size=20)
    plt.yticks(size=20)
    plt.show()

# good for a lot x values
def plot_bar_graphs_tall(x, y, title):
    plt.figure(figsize=(19, 18))
    plt.barh(x, y)
    plt.title(title, size=25)
    plt.xticks(size=25)
    plt.yticks(size=25)
    plt.show()

plot_bar_graphs(visual_unique_countries, visual_confirmed_cases, '# of Covid-19 Confirmed Cases in Countries/Regions')


png

log_country_confirmed_cases = [math.log10(i) for i in visual_confirmed_cases]
plot_bar_graphs(visual_unique_countries, log_country_confirmed_cases, 'Common Log # of Coronavirus Confirmed Cases in Countries/Regions')


png

# Only show 10 provinces with the most confirmed cases, the rest are grouped into the other category
visual_unique_provinces = []
visual_confirmed_cases2 = []
others = np.sum(province_confirmed_cases[10:])
for i in range(len(province_confirmed_cases[:10])):
    visual_unique_provinces.append(unique_provinces[i])
    visual_confirmed_cases2.append(province_confirmed_cases[i])

visual_unique_provinces.append('Others')
visual_confirmed_cases2.append(others)

plot_bar_graphs(visual_unique_provinces, visual_confirmed_cases2, '# of Coronavirus Confirmed Cases in Provinces/States')

log_province_confirmed_cases = [math.log10(i) for i in visual_confirmed_cases2]
plot_bar_graphs(visual_unique_provinces, log_province_confirmed_cases, 'Log of # of Coronavirus Confirmed Cases in Provinces/States')


png

png
## 检测率
us_medical_data.fillna(value=0, inplace=True)

def plot_us_medical_data():
    states = us_medical_data['Province_State'].unique()
    testing_number = []
    testing_rate = []

    for i in states:
        testing_number.append(us_medical_data[us_medical_data['Province_State']==i]['People_Tested'].sum())
        testing_rate.append(us_medical_data[us_medical_data['Province_State']==i]['Testing_Rate'].max())

    # only show the top 15 states
    testing_states = [k for k, v in sorted(zip(states, testing_number), key=operator.itemgetter(1), reverse=True)]
    testing_rate_states = [k for k, v in sorted(zip(states, testing_rate), key=operator.itemgetter(1), reverse=True)]

    for i in range(len(states)):
        testing_number[i] = us_medical_data[us_medical_data['Province_State']==testing_states[i]]['People_Tested'].sum()
        testing_rate[i] = us_medical_data[us_medical_data['Province_State']==testing_rate_states[i]]['Testing_Rate'].sum()

    top_limit = 30

    plot_bar_graphs_tall(testing_states[:top_limit], testing_number[:top_limit], 'Total Testing per State (Top 30)')
    plot_bar_graphs_tall(testing_rate_states[:top_limit], testing_rate[:top_limit], 'Testing Rate per 100,000 People (Top 30)')


plot_us_medical_data()


png

png
## 移动数据
def get_mobility_by_state(transport_type, state, day):
    return apple_mobility[apple_mobility['sub-region']==state][apple_mobility['transportation_type']==transport_type].sum()[day]

get_mobility_by_state('walking', 'Connecticut', '2020-07-30')
revised_dates = []
for i in range(len(dates)):
    revised_dates.append(datetime.datetime.strptime(dates[i], '%m/%d/%y').strftime('%Y-%m-%d'))

def weekday_or_weekend(date):
    date_obj = datetime.datetime.strptime(date, '%Y-%m-%d')
    day_of_the_week =  date_obj.weekday()
    if (day_of_the_week+1) % 6 == 0 or (day_of_the_week+1) % 7 == 0:
        return True
    else:
        return False

revised_day_since_1_22 = [i for i in range(len(revised_dates))]

import matplotlib.dates as mdates
states = ['New York', 'Connecticut', 'Florida', 'California', 'Texas', 'Georgia', 'Arizona', 'Illinois', 'Louisiana', 'Ohio',
          'Tennessee', 'North Carolina', 'South Carolina', 'Alabama', 'Missouri', 'Kansas', 'Pennsylvania', 'Wisconsin', 'Virginia', 'Massachusetts', 'Utah', 'Minnesota',
         'Oklahoma', 'Iowa', 'Arkansas', 'Kentucky', 'Puerto Rico', 'Colorado', 'New Jersey', 'Idaho', 'New Jersey', 'Nevada', 'Maryland']
states.sort()

# making sure the dates are in sync
mobility_latest_date = apple_mobility.columns[-1]
mobility_latest_index = revised_dates.index(mobility_latest_date)

for state in states:
    # weekend and weekday mobility are separated
    weekday_mobility = []
    weekday_mobility_dates = []
    weekend_mobility = []
    weekend_mobility_dates = []

    for i in range(len(revised_dates)):
        if i <= mobility_latest_index:
            if weekday_or_weekend(revised_dates[i]):
                weekend_mobility.append(get_mobility_by_state('walking', state, revised_dates[i]))
                weekend_mobility_dates.append(i)
            else:
                weekday_mobility.append(get_mobility_by_state('walking', state, revised_dates[i]))
                weekday_mobility_dates.append(i)
        else:
            pass

    # remove null values (they are counted as 0)
    for i in range(len(weekend_mobility)):
        if weekend_mobility[i] == 0 and i != 0:
            weekend_mobility[i] = weekend_mobility[i-1]
        elif weekend_mobility[i] == 0 and i == 0:
            weekend_mobility[i] = weekend_mobility[i+1]
        else:
            pass

    for i in range(len(weekday_mobility)):
        if weekday_mobility[i] == 0 and i != 0:
            weekday_mobility[i] = weekday_mobility[i-1]
        elif weekday_mobility[i] == 0 and i == 0:
            weekday_mobility[i] = weekday_mobility[i+1]
        else:
            pass


    weekday_mobility_average = moving_average(weekday_mobility, 7)
    weekend_mobility_average = moving_average(weekend_mobility, 7)

    plt.figure(figsize=(16, 10))
    plt.bar(weekday_mobility_dates, weekday_mobility, color='cornflowerblue')
    plt.plot(weekday_mobility_dates, weekday_mobility_average, color='green')

    plt.bar(weekend_mobility_dates, weekend_mobility, color='salmon')
    plt.plot(weekend_mobility_dates, weekend_mobility_average, color='black')

    plt.legend(['Moving average (7 days) weekday mobility', 'Moving Average (7 days) weekend mobility', 'Weekday mobility', 'Weekend mobility'], prop={'size': 25})
    plt.title('{} Walking Mobility Data'.format(state), size=25)
    plt.xlabel('Days since 1/22', size=25)
    plt.ylabel('Mobility Value', size=25)
    plt.xticks(size=25)
    plt.yticks(size=25)
    plt.show()


png

png
png
png
png
png
png
png
png
png
png
png
png
png
Categories: Python

1 Comment

Anonymous · 2020年11月10日 at 11:28

的确很长。

Leave a Reply

Your email address will not be published.