Skip to content
Advertisement

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day’s data for one state. As many people know, the pandemic is hitting different states differently — New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:

date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463

To get the entire data set I am doing this:

import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")

I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:

not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not “NY”. What I want to do, though, is exclude multiple states with something like a “not in” function (this doesn’t work):

not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:

pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')

but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:

SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;

The end result I am looking for is like this (using the sample data above and summing on the ‘positive’ column, with ‘NY’ standing in for ‘tristate’):

date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371

Any help would be welcome.

Advertisement

Answer

to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total

df_f = all_states.groupby(['date', 
                           np.where(all_states['state'].isin(["NY","NJ","CT"]), 
                                    'tristate', 'not_tristate')])
                 ['positive'].sum()
                 .unstack()
                 .assign(total=lambda x: x.sum(axis=1))

print (df_f)
          not_tristate  tristate   total
date                                    
20200502         53128    312977  366105
20200503         54563    316415  370978
20200504         55893    318953  374846
20200505         57179    321192  378371

or with pivot_table, you get similar result with:

print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]), 
                                          'tristate', 'not_tristate'))
                  .pivot_table(index='date', columns='state', values='positive', 
                               aggfunc='sum', margins=True))
state     not_tristate  tristate      All
date                                     
20200502         53128    312977   366105
20200503         54563    316415   370978
20200504         55893    318953   374846
20200505         57179    321192   378371
All             220763   1269537  1490300
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement