Skip to content
Advertisement

How to count all rows in raw data file using Hive?

I am reading some raw input which looks something like this:

20   abc   def
21   ghi   jkl
     mno   pqr
23   stu 

Note the first two rows are “good” rows and the last two rows are “bad” rows since they are missing some data.

Here is the snippet of my hive query which is reading this raw data into a readonly external table:

DROP TABLE IF EXISTS readonly_s3;
CREATE EXTERNAL TABLE readonly_s3 (id string, name string, data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'

I need to get the count of ALL the rows, both “good” and “bad.” The problem is some of the data is missing, and if I do SELECT count(id) as total_rows for example, that doesn’t work since not all the rows have an id.

Any suggestions on how I can count ALL the rows in this raw data file?

Advertisement

Answer

Hmmm . . . You can use:

select sum(case when col1 is not null and col2 is not null and col3 is not null then 1 else 0 end) as num_good,
       sum(case when col1 is null or col2 is null or col3 is null then 1 else 0 end) as num_bad
from readonly_s3;
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement