When we use sqoop import
to transfer an RDBMS table to HIVE, will the constraints of the table such as primary key remain ?
i.e. will the column of the table which is the primary key remain as primary key at the HIVE
. Will this information be in the Hive metastore ?
Thanks a lot.
Advertisement
Answer
As you can see in the link of the official documentation of Hive QL below, PRIMARY and FOREIGN constraints have been added since Hive version 2.1.0. Hive QL
So, I assume that the PRIMARY and FOREIGN Keys constraints will remain when you import your tables to Hive using sqoop.
I tested a sqoop import of MySQL database, and I can see that PRIMARY KEY CONSTRAINT is not maintained during the import.
MySQL Table Format:
mysql> show create table employees; +-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | employees | CREATE TABLE `employees` ( `emp_no` int(11) NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, PRIMARY KEY (`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 | +-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0,00 sec)
Data has been imported from MySQL to Hive with the following command:
sqoop import --connect jdbc:mysql://localhost/employees --username root --password password --table employees --hive-import --create-hive-table --hive-table employees
When I describe the table in hive, I cannot see the PRIMARY KEY CONSTRAINT
hive> show create table employees; OK CREATE TABLE `employees`( `emp_no` int, `birth_date` string, `first_name` string, `last_name` string, `gender` string, `hire_date` string) COMMENT 'Imported by sqoop on 2019/03/18 00:24:11' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim'='', 'line.delim'='n', 'serialization.format'='') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://localhost:9000/user/hive/warehouse/employees' TBLPROPERTIES ( 'transient_lastDdlTime'='1552865076') Time taken: 1.304 seconds, Fetched: 22 row(s)
I inserted a new row with same employee number to check if Hive manage PK Constraint. The new row has been added as you can see:
hive> insert into employees values (10001, "1986-04-17", "Hichem", "BOUSSETTA", "M", "2014-09-91"); Moving data to directory hdfs://localhost:9000/user/hive/warehouse/employees/.hive-staging_hive_2019-03-18_00-32-16_851_8569619447966100947-1/-ext-10000 Loading data to table default.employees MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 5.79 sec HDFS Read: 5080 HDFS Write: 120 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 790 msec OK Time taken: 42.422 seconds hive> select * from employees; OK 10001 1986-04-17 Hichem BOUSSETTA M 2014-09-91 10001 1953-09-02 Georgi Facello M 1986-06-26 10002 1964-06-02 Bezalel Simmel F 1985-11-21
So to conclude: Sqoop does not preserve PK constraint when importing RDBMS data to Hive