I’m trying to store daily snapshot of Employee data for FTE analysis project – analytics on how many FTE in various positions any given day.
I can call a REST API, which will give me data for all the active and terminated employees as of API call time. Is it prudent to call this API every single day and store daily snapshot or store only records which changed from previous version. What is the common design principle for this use case. Thanks!
Advertisement
Answer
I ended up using Delta tables on Databricks.
Sample code to create the table and load data
#read data into a dataframe df = spark.read.format("json").option("multiline", "true").load("/mnt/datalake/path/to/file/employee.json") #load data into a managed delta table df.write.format("Delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("raw.DimEmployee")
Now you can query data of any given day using below query
select count(*) from raw.DimEmployee timestamp as of '2022-09-14'