I have a user table in Hive of the form:
User: Id String, Name String, Col1 String, UpdateTimestamp Timestamp
I’m inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Advertisement
Answer
Typically with a merge statement your “file” would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users ( id int ,name string ,col1 string ,updatetimestamp timestamp ) clustered by (id) into 5 buckets stored as ORC tblproperties('transactional'='true');
After your insert statements, your Bob record would say “stuff” in col1
:
As far as the updates – you could tackle these with an update or merge statement. I think the key here is the null
values. It’s important to keep the original name, or col1, or whatever, if the staging table from the file has a null
value. Here’s a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t using test.orc_acid_example_staging as s on t.id = s.id and s.type = 'U' when matched then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show “updatedstuff”
Quick disclaimer – if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn’t really a complete transactional DB – it would be preferred for the source to send full user records any time there’s an update, instead of just the changed fields only.