Skip to content
Advertisement

Merge update records in a final table

I have a user table in Hive of the form:

User: 
Id    String,
Name  String,
Col1  String,
UpdateTimestamp Timestamp

I’m inserting data in this table from a file which has the following format:

I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp

e.g. for inserting a user with Id 1:

I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456

and updating col1 for the same user with Id 1:

U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457

The columns which are not updated are returned as null.

Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.

However, how would I go about the update statements? So that my final row in hive looks like below:

1,Bob,updatedstuff,123457

I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?

Advertisement

Answer

Typically with a merge statement your “file” would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.

However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.

In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:

create table test.orc_acid_example_users
(
  id int
  ,name string
  ,col1 string
  ,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');

After your insert statements, your Bob record would say “stuff” in col1: enter image description here

As far as the updates – you could tackle these with an update or merge statement. I think the key here is the null values. It’s important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here’s a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.

merge into test.orc_acid_example_users as t
  using test.orc_acid_example_staging as s
on t.id = s.id
  and s.type = 'U'
when matched
  then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)

Now Bob will show “updatedstuff” enter image description here

Quick disclaimer – if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn’t really a complete transactional DB – it would be preferred for the source to send full user records any time there’s an update, instead of just the changed fields only.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement