How can I optimize a query of this form since I keep running into this OOM error? Or come up with a better execution plan? If I removed the substring clause, the query would work fine, suggesting that this takes a lot of memory.
When the job fails, the beeline output shows the OOM Java heap space. Readings online suggested that I increase export HADOOP_HEAPSIZE
but this still results in the error. Another thing I tried was increasing the hive.tez.container.size
and hive.tez.java.opts
(tez heap), but still has this error. In the YARN logs, there would be GC overhead limit exceeded, suggesting a combination of not enough memory and/or the query plan is extremely inefficient since it can’t collect back enough memory.
I am using Azure HDInsight Interactive Query 4.0. 20 worker node, D13v2 8 core, and 56GB RAM.
Source table
create external table database.sourcetable(
a,
b,
c,
(183 total columns)
)
PARTITIONED BY (
W string,
X int,
Y string,
Z int
)
Target Table
create external table database.NEWTABLE(
a,
b,
c,
(187 total columns)
W,
X,
Y,
Z
)
PARTITIONED BY (
aAAA,
bBBB
)
Query
insert overwrite table database.NEWTABLE partition(aAAA, bBBB, cCCC)
select
a,
b,
c,
(187 total columns)
W,
X,
Y,
Z,
cast(a as string) as aAAA,
from_unixtime(unix_timestamp(b,'yyMMdd'),'yyyyMMdd') as bBBB,
substring(upper(c),1,2) as cCCC
from database.sourcetable
Advertisement
Answer
If everything else is okay, try to add distribute by partiton key at the end of your query:
from database.sourcetable
distribute by aAAA, bBBB, cCCC
As a result each reducer will create only one partition file, consuming less memory