Skip to content
Advertisement

Optimize Hive Query. java.lang.OutOfMemoryError: Java heap space/GC overhead limit exceeded

How can I optimize a query of this form since I keep running into this OOM error? Or come up with a better execution plan? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.

When the job fails, the beeline output shows the OOM Java heap space. Readings online suggested that I increase export HADOOP_HEAPSIZE but this still results in the error. Another thing I tried was increasing the hive.tez.container.size and hive.tez.java.opts (tez heap), but still has this error. In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can’t collect back enough memory.

I am using Azure HDInsight Interactive Query 4.0. 20 worker node, D13v2 8 core, and 56GB RAM.

Source table

Target Table

Query

Advertisement

Answer

If everything else is okay, try to add distribute by partiton key at the end of your query:

As a result each reducer will create only one partition file, consuming less memory

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement