I am running one SQL query in Hive and it gives different results with CBO enabled and disabled. The results are wrong when CBO is enabled (set hive.cbo.enable=true;).
Prerequisites: Apache Hadoop 2.10.1 + Apache Hive 2.3.6 installed. (I tried to reproduce the issue with Apache Hive 3+ version and Hadoop 3+ version and they work fine.)
Actions to reproduce:
1) Create the table in Hive
CREATE TABLE
sample
(fname
string,sname
string,sid
string) ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ STORED AS INPUTFORMAT ‘org.apache.hadoop.mapred.TextInputFormat’ OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’ TBLPROPERTIES ( ‘transient_lastDdlTime’=’1616020251’);
2) Insert some records in the table the below values
insert into sample values (“PQ”, “F”, “33”);
insert into sample values (“RK”, “A”, “11”);
insert into sample values (“AZ”, “B”, “22”);
3) Check the table
select * from sample;
PQ F 33 RK A 11 AZ B 22
4) Run the below query. It should give 0 records, instead it gives all records
select fname from sample where sid in (’11’) and sid in (’22’);
PQ RK AZ
5) The results are good when we disable cbo in Hive
set hive.cbo.enable=false;
select fname from sample where sid in (’11’) and sid in (’22’);
OK Time taken: 0.131 seconds
Advertisement
Answer
The wrong results come from a wrong simplification in the HivePointLookupOptimizerRule [1]. This corresponds to the bug logged under HIVE-21685 [2] that was fixed in versions >= 3.2.0.
As a workaround you can disable the respective rule by setting the respective property to false.
set hive.optimize.point.lookup=false