Merge update records in a final table

Question

I have a user table in Hive of the form: User: Id String, Name String, Col1 String, UpdateTimestamp Timestamp I&#8217;m inserting data in this table from a file which has the following format: I/U,&#8230;

Accepted Answer

Typically with a merge statement your &#8220;file&#8221; would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:create table test.orc_acid_example_users(  id int  ,name string  ,col1 string  ,updatetimestamp timestamp)clustered by (id) into 5 bucketsstored as ORCtblproperties('transactional'='true');After your insert statements, your Bob record would say &#8220;stuff&#8221; in col1:As far as the updates &#8211; you could tackle these with an update or merge statement. I think the key here is the null values. It&#8217;s important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here&#8217;s a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.merge into test.orc_acid_example_users as t  using test.orc_acid_example_staging as son t.id = s.id  and s.type = 'U'when matched  then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)Now Bob will show &#8220;updatedstuff&#8221;Quick disclaimer &#8211; if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn&#8217;t really a complete transactional DB &#8211; it would be preferred for the source to send full user records any time there&#8217;s an update, instead of just the changed fields only.

Advertisement

Answer