How to count all rows in raw data file using Hive?

I am reading some raw input which looks something like this:

20   abc   def
21   ghi   jkl
     mno   pqr
23   stu

​x
 
20   abc   def21   ghi   jkl     mno   pqr23   stu ​

Note the first two rows are “good” rows and the last two rows are “bad” rows since they are missing some data.

Here is the snippet of my hive query which is reading this raw data into a readonly external table:

DROP TABLE IF EXISTS readonly_s3;
CREATE EXTERNAL TABLE readonly_s3 (id string, name string, data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'

 
DROP TABLE IF EXISTS readonly_s3;CREATE EXTERNAL TABLE readonly_s3 (id string, name string, data string)ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'​

I need to get the count of ALL the rows, both “good” and “bad.” The problem is some of the data is missing, and if I do SELECT count(id) as total_rows for example, that doesn’t work since not all the rows have an id.

Any suggestions on how I can count ALL the rows in this raw data file?

Answer

Hmmm . . . You can use:

select sum(case when col1 is not null and col2 is not null and col3 is not null then 1 else 0 end) as num_good,
       sum(case when col1 is null or col2 is null or col3 is null then 1 else 0 end) as num_bad
from readonly_s3;

 
select sum(case when col1 is not null and col2 is not null and col3 is not null then 1 else 0 end) as num_good,       sum(case when col1 is null or col2 is null or col3 is null then 1 else 0 end) as num_badfrom readonly_s3;​

Advertisement

Answer