I’ve an aws Glue JDBC connection to a SQL server in a EC2 server. After crwaling the whole schema I created a job to query some table and used the activity monitor to check what is glue sending to the database, and the queries are just a select * into the whole table…
The code that does that is below:
spark_context = SparkContext.getOrCreate() glue_context = GlueContext(spark_context) dynamic_frame = glue_context.create_dynamic_frame.from_catalog(database=glue_db, table_name=table_name)
My question is, how do I change this request to be something like, ‘select column1, column2 from table_name’, instead of a select *… because imagine that on this table I’ve a blob content that I dont need to perform any transformation, why do I need to get all the data over network, if they will be not used?
I tried to edit the schema of the table, but nothing changes… it only sends select * …
Is there any why to force the initial query to be what I want?
Thanks a lot!
Advertisement
Answer
Pushing down the query to DB engine is a good idea as it reduces the load on your Glue job and also network transfers.
If you want to pass down the query to DB then there is no native feature in Glue which supports this.However you can achieve the same by reading data into Spark dataframe and then converting it to dynamicframe to levarage Glue transformations.
Following is an example of mysql read and you refer to this and change JDBC URL for SQL server.
query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult" datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()