I rewrite some code from SAS to Python using Pandas library.
I’ve got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I’ve changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql; create table &work_lib..opk_do_inf_4 as select distinct *, min(kat_opk) as opk_do_inf, count(nr_ks) as ilsc_opk_do_kosztu_infr from &work_lib..opk_do_inf_3 group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu having kat_opk = opk_do_inf ; quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min() df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count() df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf']) df = df_groupby.reset_index() df = df.drop_duplcates() return df
Advertisement
Answer
First, calling SELECT *
in an aggregate GROUP BY
query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT
columns should be limited to columns in GROUP BY
clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg()
operations with WHERE
(filter before aggregation) or HAVING
(filter after aggregation) conditions handled using either .loc
or query
.
SQL
SELECT col1, col2, col3, MIN(col1) AS min_col1, AVG(col2) AS mean_col2, MAX(col3) AS max_col3, COUNT(*) AS count_obs FROM mydata GROUP BY col1, col2, col3 HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False) .agg(min_col1 = ("col1", "min"), mean_col2 = ("col2", "mean"), max_col3 = ("col3", "max"), count_obs = ("col1", "count")) .query("col1 == min_col1") )
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr", "nazwa_zabiegu_icd_9", "nazwa_zabiegu"], as_index=False) .agg(opk_do_inf = ("kat_opk", "min"), ilsc_opk_do_kosztu_infr = ("nr_ks", "count")) .query("kat_opk == opk_do_inf") )