Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day’s data for one state. As many people know, the pandemic is hitting different states differently — New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:

date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463

​x
 
date,state,positive,negative20200505,AK,371,2232120200505,CA,56212,72369020200505,NY,321192,70770720200505,WY,596,1031920200504,AK,370,2135320200504,CA,54937,69293720200504,NY,318953,68835720200504,WY,586,986820200503,AK,368,2121020200503,CA,53616,66213520200503,NY,316415,66949620200503,WY,579,964020200502,AK,365,2103420200502,CA,52197,63460620200502,NY,312977,64609420200502,WY,566,9463​

To get the entire data set I am doing this:

import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")

 
import pandas as pdall_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")​

I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:

not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

 
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()​

This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not “NY”. What I want to do, though, is exclude multiple states with something like a “not in” function (this doesn’t work):

not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

 
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()​

Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:

pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')

 
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')​

but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:

SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;

 
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristateFROM all_statesGROUP BY all_states.DateORDER BY all_states.Date;​

The end result I am looking for is like this (using the sample data above and summing on the ‘positive’ column, with ‘NY’ standing in for ‘tristate’):

date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371

 
date,not_tristate,tristate20200502,53128,312977,36610520200503,54563,316415,37097820200504,55893,318953,37484620200505,57179,321192,378371​

Any help would be welcome.

Answer

to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total

df_f = all_states.groupby(['date', 
                           np.where(all_states['state'].isin(["NY","NJ","CT"]), 
                                    'tristate', 'not_tristate')])
                 ['positive'].sum()
                 .unstack()
                 .assign(total=lambda x: x.sum(axis=1))

print (df_f)
          not_tristate  tristate   total
date                                    
20200502         53128    312977  366105
20200503         54563    316415  370978
20200504         55893    318953  374846
20200505         57179    321192  378371

 
df_f = all_states.groupby(['date',                            np.where(all_states['state'].isin(["NY","NJ","CT"]),                                     'tristate', 'not_tristate')])                 ['positive'].sum()                 .unstack()                 .assign(total=lambda x: x.sum(axis=1))​print (df_f)          not_tristate  tristate   totaldate                                    20200502         53128    312977  36610520200503         54563    316415  37097820200504         55893    318953  37484620200505         57179    321192  378371​

or with pivot_table, you get similar result with:

print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]), 
                                          'tristate', 'not_tristate'))
                  .pivot_table(index='date', columns='state', values='positive', 
                               aggfunc='sum', margins=True))
state     not_tristate  tristate      All
date                                     
20200502         53128    312977   366105
20200503         54563    316415   370978
20200504         55893    318953   374846
20200505         57179    321192   378371
All             220763   1269537  1490300

 
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),                                           'tristate', 'not_tristate'))                  .pivot_table(index='date', columns='state', values='positive',                                aggfunc='sum', margins=True))state     not_tristate  tristate      Alldate                                     20200502         53128    312977   36610520200503         54563    316415   37097820200504         55893    318953   37484620200505         57179    321192   378371All             220763   1269537  1490300​

Advertisement

Answer