Production Hadoop query that takes lot of time

Question

Current Status We have a query that runs for 2+ hours. On examining the progress, the query spends a lot of time during the join with table T5 and during the final stage of the query. Is there any ...

Accepted Answer

Not sure if this will help you much. There is some rather strange WHERE clause:WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)This is probably for joining NULLs as well as normal values. Then it does not work because First of all the join condition is T5 ON T1.trxn_id = T5.acct_trxn_id this means NULLs are not joined, then WHERE works as a filter after join. IF T5 is not joined then T5.acct_trxn_id converted to ‘NULL’ string in the WHERE and compared with NOT NULL T1.trxn_id value and most probably filtered out, works like INNER JOIN in this case. If it happens T1.trxn_id is NULL (driving table), it converted to string ‘NULL’ and compared with always string ‘NULL’ (because not joined anyway according to ON clause) and such row is passed (I did not test it though). The logic looks strange and I think it does not work as intended or converted to INNER. If you want to join all including NULLs, move this WHERE to the JOIN ON clause.If there are many rows with NULLs then the join on NULLs using substitution with string ‘NULL’ will multiply rows and will result in duplicates.Actually when investigating JOIN poor performance, check two things:Join keys are not duplicating or duplication is expectedJoin keys (and also partition by columns in the row_number) are not skewed, see this: https://stackoverflow.com/a/53333652/2700344 and this: https://stackoverflow.com/a/51061613/2700344If everything looks fine then tune proper reducer parallelism, reduce hive.exec.reducers.bytes.per.reducer to get more reducers runningAlso reduce DT_LKP as much as possible even if you know it contains some dates which are definitely not/should not be in fact tables, use CTE for filtering it if possible. Also simplify the logic a bit (this will not improve performance but will simplify the code).Case in the select:when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_tswhen T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts<=> else greatest(T2.trxn_id,T5.crt_ts)If T5.crt_ts is null, your case statement will return null, greatest() will also return nullCASE statement in the row_number simplified:case when case when (T1.trxn_id is null) or (T5.acct_trxn_id is null) then T2.crt_ts else greatest(T2.trxn_id,T5.crt_ts) endAlso this: if(T1.trxn_id is null, 'NULL', T1.trxn_id) <=> NVL(T1.trxn_id,'NULL')Of course these are suggestions only, I did not test them

Advertisement

Answer