Skip to content
Advertisement

Data Architecture Question – Storing Daily Snapshot of Employee Master Data

I’m trying to store daily snapshot of Employee data for FTE analysis project – analytics on how many FTE in various positions any given day.

I can call a REST API, which will give me data for all the active and terminated employees as of API call time. Is it prudent to call this API every single day and store daily snapshot or store only records which changed from previous version. What is the common design principle for this use case. Thanks!

Advertisement

Answer

I ended up using Delta tables on Databricks.

Sample code to create the table and load data

#read data into a dataframe
df = spark.read.format("json").option("multiline", "true").load("/mnt/datalake/path/to/file/employee.json")

#load data into a managed delta table
df.write.format("Delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("raw.DimEmployee")

Now you can query data of any given day using below query

select count(*) from raw.DimEmployee timestamp as of '2022-09-14'

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement