I am trying to add a couple of million rows to a table in exasol one column is supposed have increasing integer values (1 – xmio). I can’t get my python script to run in parallel and distribute the load to different nodes in the cluster. Because of that the process takes days to complete, which is not sufficient.
Since CSV-Import is parallel in exasol I created a CSV with 250mio rows locally, zipped it and uploaded it into the table, this was the fastest approach yet, taking 7 minutes (bottlenecked by my upload speed).
Other Approaches:
CREATE SCRIPT BIG1 (bound) AS i = 0 while i < bound do query([[insert into USER_XXXXX.BIG values (:n, null, null, null, null, null)]], {n=i}) i = i+1 end / EXECUTE SCRIPT BIG1(250000000);
Very naive approach because inserts are slow, takes years to complete
CREATE PYTHON SCALAR SCRIPT USER_XXXXX.BIG2 ("AMOUNT_TO_CREATE" DECIMAL(18,0)) EMITS (val DECIMAL(18,0), val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN, val BOOLEAN) AS def generate_sql_for_import_spec(import_spec): return "SELECT USER_XXXXX.BIG2("+import_spec.parameters["AMOUNT_TO_CREATE"]+")" def run(ctx): for i in range(1, ctx.AMOUNT_TO_CREATE + 1): ctx.emit(i, None, None, None, None, None) / IMPORT INTO USER_XXXXX.BIG FROM SCRIPT USER_XXXXX.BIG2 WITH AMOUNT_TO_CREATE='250000000';
Works better and runs fully on the server. Takes about 33 Minutes to execute
I understand, that the script can’t run in parallel as is. It would be easy to partition it in 50Mio Chunks and run it on 5 nodes. Unfortunately I have no Idea how to connect to other nodes or assign script execution to specific nodes in the cluster.
Advertisement
Answer
You are supposed to run multiple instances of the script on multiple nodes. The easiest way to achieve parallelism is to create a proxy table containing as many rows as number of parallel processes you want you want to run.
Let’s assume we want to run 5 parallel processes.
CREATE TABLE script_parallel ( proc_id DECIMAL(18,0) ); INSERT INTO script_parallel VALUES ((1),(2),(3),(4),(5));
You may use this table to run multiple instances of script now:
SELECT my_script(proc_id, 5, 250000000) FROM script_parallel GROUP BY proc_id
Inside each script instance EMIT every (Nth value + proc_id) up to max value of 250000000. Every script instance should generate smaller subset of final result. Exasol will merge results of all instances for you.
It’s a good idea to use LUA scripting language for scripts of this kind. Performance will be 100x better.