Skip to content
Advertisement

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day’s data for one state. As many people know, the pandemic is hitting different states differently — New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:

To get the entire data set I am doing this:

I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:

This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not “NY”. What I want to do, though, is exclude multiple states with something like a “not in” function (this doesn’t work):

Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:

but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:

The end result I am looking for is like this (using the sample data above and summing on the ‘positive’ column, with ‘NY’ standing in for ‘tristate’):

Any help would be welcome.

Advertisement

Answer

to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total

or with pivot_table, you get similar result with:

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement