I want to apply custom logic over dataset placed in Redshift. Example of input data: userid, event, fileid, timestamp, …. 100000, start, 120, 2018-09-17 19:11:40 100000, done, 120, 2018-…
Tag: apache-spark
spark [dataframe].write.option(“mode”,“overwrite”).saveAsTable(“foo”) fails with ‘already exists’ if foo exists
I think I am seeing a bug in spark where mode ‘overwrite’ is not respected, rather an exception is thrown on an attempt to do saveAsTable into a table that already exists (using mode ‘overwrite’). …
In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?
I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are: |_Body|label_0|label_1|label_10|…
How to get the COUNT of emails for each id in Scala
I use this query in SQL to get return how many user_id’s have more than one email. How would I write this same query against a users DataFrame in Scala? also how would I be able to return to exact …
Aggregate data from multiple rows to one and then nest the data
I’m relatively new to scala and spark programming. I have a use case where I need to groupby data based on certain columns and have a count of a certain column (using pivot) and then finally I need …
What’s the default window frame for window functions
Running the following code: The result is: There is no window frame defined in the above code, it looks the default window frame is rowsBetween(Window.unboundedPreceding, Window.currentRow) Not sure my understanding about default window frame is correct Answer From Spark Gotchas Default frame specification depends on other aspects of a given window defintion: if the ORDER BY clause is specified and
Pyspark: cast array with nested struct to string
I have pyspark dataframe with a column named Filters: “array>” I want to save my dataframe in csv file, for that i need to cast the array to string type. I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19 The code is as follows Sample JSON data:
Including null values in an Apache Spark Join
I would like to include null values in an Apache Spark join. Spark doesn’t include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( (“123”), (“456”), (null),…
Register temp table in dataframe not working
Below is my script to use sql in dataframe with python: df.show(5) shows result below: then I register the dataframe to a temp table: and tried to run some sql query like below: It doesn’t produce expected result, instead: I also tried: and it gives me: So it seems the registerTempTable method only create the table schema and the table
What is the difference between cube, rollup and groupBy operators?
I can’t find any detailed documentation regarding the differences. I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy. Answer These are not intended to work in the same