Tag: apache-spark

Process several billion records from Redshift using custom logic

amazon-redshift apache-spark bigdata google-cloud-dataflow sql

I want to apply custom logic over dataset placed in Redshift. Example of input data: userid, event, fileid, timestamp, …. 100000, start, 120, 2018-09-17 19:11:40 100000, done, 120, 2018-…

spark [dataframe].write.option(“mode”,“overwrite”).saveAsTable(“foo”) fails with ‘already exists’ if foo exists

apache-spark overwrite scala sql

I think I am seeing a bug in spark where mode ‘overwrite’ is not respected, rather an exception is thrown on an attempt to do saveAsTable into a table that already exists (using mode ‘overwrite’). …

In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?

analytics apache-spark multilabel-classification pyspark sql

How to get the COUNT of emails for each id in Scala

apache-spark apache-spark-sql scala sql

I use this query in SQL to get return how many user_id’s have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact …

Aggregate data from multiple rows to one and then nest the data

apache-spark apache-spark-sql scala sql

I’m relatively new to scala and spark programming. I have a use case where I need to groupby data based on certain columns and have a count of a certain column (using pivot) and then finally I need …

What’s the default window frame for window functions

apache-spark apache-spark-sql sql window-functions

Running the following code: The result is: There is no window frame defined in the above code, it looks the default window frame is rowsBetween(Window.unboundedPreceding, Window.currentRow) Not sure my understanding about default window frame is correct Answer From Spark Gotchas Default frame specification depends on other aspects of a given window defintion: if the ORDER BY clause is specified and

Pyspark: cast array with nested struct to string

apache-spark pyspark python spark-dataframe sql

I have pyspark dataframe with a column named Filters: “array>” I want to save my dataframe in csv file, for that i need to cast the array to string type. I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19 The code is as follows Sample JSON data:

Including null values in an Apache Spark Join

apache-spark apache-spark-sql join scala sql

I would like to include null values in an Apache Spark join. Spark doesn’t include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( (“123”), (“456”), (null),…

Register temp table in dataframe not working

apache-spark python sql temp-tables

Below is my script to use sql in dataframe with python: df.show(5) shows result below: then I register the dataframe to a temp table: and tried to run some sql query like below: It doesn’t produce expected result, instead: I also tried: and it gives me: So it seems the registerTempTable method only create the table schema and the table

What is the difference between cube, rollup and groupBy operators?

apache-spark apache-spark-sql cube rollup sql

I can’t find any detailed documentation regarding the differences. I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy. Answer These are not intended to work in the same