Skip to content
Advertisement

Production Hadoop query that takes lot of time

Current Status

We have a query that runs for 2+ hours. On examining the progress, the query spends a lot of time during the join with table T5 and during the final stage of the query. Is there any way we can simplify of do something about this query? I was unable to use aggregate functions in place of rank() as the orderby used is bit complicated.

What we have already tried

We have already converted the sub-queries to case statements in the select clause and helped reduce the execution time but that was not significant. We have simplified co-related query for T3, T4 and T6.

Advertisement

Answer

Not sure if this will help you much. There is some rather strange WHERE clause:

This is probably for joining NULLs as well as normal values. Then it does not work because First of all the join condition is T5 ON T1.trxn_id = T5.acct_trxn_id this means NULLs are not joined, then WHERE works as a filter after join. IF T5 is not joined then T5.acct_trxn_id converted to ‘NULL’ string in the WHERE and compared with NOT NULL T1.trxn_id value and most probably filtered out, works like INNER JOIN in this case. If it happens T1.trxn_id is NULL (driving table), it converted to string ‘NULL’ and compared with always string ‘NULL’ (because not joined anyway according to ON clause) and such row is passed (I did not test it though). The logic looks strange and I think it does not work as intended or converted to INNER. If you want to join all including NULLs, move this WHERE to the JOIN ON clause.

If there are many rows with NULLs then the join on NULLs using substitution with string ‘NULL’ will multiply rows and will result in duplicates.

Actually when investigating JOIN poor performance, check two things:

  1. Join keys are not duplicating or duplication is expected
  2. Join keys (and also partition by columns in the row_number) are not skewed, see this: https://stackoverflow.com/a/53333652/2700344 and this: https://stackoverflow.com/a/51061613/2700344

If everything looks fine then tune proper reducer parallelism, reduce hive.exec.reducers.bytes.per.reducer to get more reducers running

Also reduce DT_LKP as much as possible even if you know it contains some dates which are definitely not/should not be in fact tables, use CTE for filtering it if possible.

Also simplify the logic a bit (this will not improve performance but will simplify the code). Case in the select:

<=>

If T5.crt_ts is null, your case statement will return null, greatest() will also return null

CASE statement in the row_number simplified:

Also this: if(T1.trxn_id is null, 'NULL', T1.trxn_id) <=> NVL(T1.trxn_id,'NULL')

Of course these are suggestions only, I did not test them

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement