Skip to content
Advertisement

Production Hadoop query that takes lot of time

Current Status

We have a query that runs for 2+ hours. On examining the progress, the query spends a lot of time during the join with table T5 and during the final stage of the query. Is there any way we can simplify of do something about this query? I was unable to use aggregate functions in place of rank() as the orderby used is bit complicated.

What we have already tried

We have already converted the sub-queries to case statements in the select clause and helped reduce the execution time but that was not significant. We have simplified co-related query for T3, T4 and T6.

SELECT * FROM 
        (SELECT T2.f1, T2.f2 .... T5.f19, T5.f20, 
                   case when T1.trxn_id is null then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                    end as crt_ts , 
                    row_number() over ( partition by T2.w_trxn_id,
                                            if(T1.trxn_id is null, 'NULL', T1.trxn_id)
                                            order by T2.business_effective_ts desc,
                                            case when T1.trxn_id is null then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is null then T2.crt_ts end desc
                                        ) as rnk
                FROM(SELECT * FROM T3 WHERE title_name = 'CAPTURE' and tr_dt IN (SELECT tr_dt FROM DT_LKP))
                T2
                LEFT JOIN (SELECT * FROM T6 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T1 ON T2.w_trxn_id = T1.w_trxn_id AND T2.business_effective_ts = T1.business_effective_ts
                LEFT JOIN (SELECT f1, f3. ... f20 FROM T4 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T5 ON T1.trxn_id = T5.acct_trxn_id
                WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)
        ) FNL WHERE rnk = 1

Advertisement

Answer

Not sure if this will help you much. There is some rather strange WHERE clause:

WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)

This is probably for joining NULLs as well as normal values. Then it does not work because First of all the join condition is T5 ON T1.trxn_id = T5.acct_trxn_id this means NULLs are not joined, then WHERE works as a filter after join. IF T5 is not joined then T5.acct_trxn_id converted to ‘NULL’ string in the WHERE and compared with NOT NULL T1.trxn_id value and most probably filtered out, works like INNER JOIN in this case. If it happens T1.trxn_id is NULL (driving table), it converted to string ‘NULL’ and compared with always string ‘NULL’ (because not joined anyway according to ON clause) and such row is passed (I did not test it though). The logic looks strange and I think it does not work as intended or converted to INNER. If you want to join all including NULLs, move this WHERE to the JOIN ON clause.

If there are many rows with NULLs then the join on NULLs using substitution with string ‘NULL’ will multiply rows and will result in duplicates.

Actually when investigating JOIN poor performance, check two things:

  1. Join keys are not duplicating or duplication is expected
  2. Join keys (and also partition by columns in the row_number) are not skewed, see this: https://stackoverflow.com/a/53333652/2700344 and this: https://stackoverflow.com/a/51061613/2700344

If everything looks fine then tune proper reducer parallelism, reduce hive.exec.reducers.bytes.per.reducer to get more reducers running

Also reduce DT_LKP as much as possible even if you know it contains some dates which are definitely not/should not be in fact tables, use CTE for filtering it if possible.

Also simplify the logic a bit (this will not improve performance but will simplify the code). Case in the select:

when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts

<=>

else greatest(T2.trxn_id,T5.crt_ts)

If T5.crt_ts is null, your case statement will return null, greatest() will also return null

CASE statement in the row_number simplified:

case when case when (T1.trxn_id is null) or (T5.acct_trxn_id is null) then T2.crt_ts
     else greatest(T2.trxn_id,T5.crt_ts)
 end

Also this: if(T1.trxn_id is null, 'NULL', T1.trxn_id) <=> NVL(T1.trxn_id,'NULL')

Of course these are suggestions only, I did not test them

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement