hello Thanks for taking the time to go through my question. I work in the budget space for a small city and during these precarious time, I am learning some python for maybe in the future helping me with some financial data modelling. We use SAP currently but i also wanted to learn a new language.
I need some pointers on where to look for certain answers. for ex, I made a database with a few million records, sorted by date and time. I was able to strip off the data I did not need and now have a clean database to work on
At a high level, I want to know if based on the first record in a day, is there another entry the same day that is double of the first record.
Date|time|dept|Value1 01/01/2019|11:00|BUD|51.00 01/01/2019|11:30|CSD|101.00 01/01/2019|11:50|BUD|102.00 01/02/2019|10:00|BUD|200.00 01/02/2019|10:31|BUD|201.00 01/02/2019|11:51|POL|400.00 01/03/2019|11:00|BUD|100.00 01/03/2019|11:30|PWD|101.00 01/03/2019|11:50|BUD|110.00
based on the data above and the requirement, I want to get an output of
Date|time|dept|Value| Start Value 01/01/2019|11:50|BUD|102.00|51.00 01/02/2019|11:51|POL|400.00|200.00 01/03/2019|NONE|NONE|NONE|100.00
On Day 3, There were no values that was at least double so, we have none or null.
What I have done so far
I have been able to connect to database [python] 2. I was able to strip off the unnecessary information and depts from the database [sqlite] 3. I have been able to create new tables for result [Python]
Questions / best Practices
- How to get the first line per day. Do I start off with a variable before the loop that is assigned to Jan 1, 2019 and then pick the row number and store it in another table or what other options do we have here.
- Once the first row per day is stored/captured in another table or a array, How do I get the first occurrence of a value at least twice of the first line.
ex? begin meta code***********
Start from Line 1 to end table2.date[] Should be equal to 01/01/2019 table2.value[] Should be equal to 51.00 look through each line if date = table2.date and value >= 2* (table2.value[]) *if successful, get record line number and department and value and store in new table else goto next line Then increase table2.date and table2.value by 1 and do the loop again.
end meta code*****************
is this the right approach, I feel like going through millions of records for each date change is not very optimized.
I can probably add a of condition to exit if date is not equal to table2.date[1] but am still not sure if this is the right way to approach this problem. This will be run only once or twice a year so system performance is not that important but still am thinking of approaching it the right way.
- Should I export the final data to excel for analysis or are thee good analysis modelling tools in Python. What would the professionals recommend?
Advertisement
Answer
You could use exists
to check if another record exists on the same day and with a value that is twice greater, and window functions to filter on the top record per day:
select * from ( select t.*, row_number() over(partition by date order by time) rn from mytable t where exists ( select 1 from mytable t1 where t1.date = t.date and t1.value = 2 * t.value ) ) t where rn = 1
In versions of SQLite where row_number()
is not available, another option is to filter with a correlated subquery:
select t.* from mytable t where exists(select 1 from mytable t1 where t1.date = t.date and t1.value = 2 * t.value) and t.time = (select min(t1.time) from mytable t1 where t1.date = t.date)