The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day’s data for one state. As many people know, the pandemic is hitting different states differently — New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative 20200505,AK,371,22321 20200505,CA,56212,723690 20200505,NY,321192,707707 20200505,WY,596,10319 20200504,AK,370,21353 20200504,CA,54937,692937 20200504,NY,318953,688357 20200504,WY,586,9868 20200503,AK,368,21210 20200503,CA,53616,662135 20200503,NY,316415,669496 20200503,WY,579,9640 20200502,AK,365,21034 20200502,CA,52197,634606 20200502,NY,312977,646094 20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not “NY”. What I want to do, though, is exclude multiple states with something like a “not in” function (this doesn’t work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate FROM all_states GROUP BY all_states.Date ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the ‘positive’ column, with ‘NY’ standing in for ‘tristate’):
date,not_tristate,tristate 20200502,53128,312977,366105 20200503,54563,316415,370978 20200504,55893,318953,374846 20200505,57179,321192,378371
Any help would be welcome.
Advertisement
Answer
to get the expected output, you can use groupby
on date and np.where
the states are isin
the states you want, sum
on positive, unstack
and assign to get the column total
df_f = all_states.groupby(['date', np.where(all_states['state'].isin(["NY","NJ","CT"]), 'tristate', 'not_tristate')]) ['positive'].sum() .unstack() .assign(total=lambda x: x.sum(axis=1)) print (df_f) not_tristate tristate total date 20200502 53128 312977 366105 20200503 54563 316415 370978 20200504 55893 318953 374846 20200505 57179 321192 378371
or with pivot_table
, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]), 'tristate', 'not_tristate')) .pivot_table(index='date', columns='state', values='positive', aggfunc='sum', margins=True)) state not_tristate tristate All date 20200502 53128 312977 366105 20200503 54563 316415 370978 20200504 55893 318953 374846 20200505 57179 321192 378371 All 220763 1269537 1490300