Tag: pyspark

Select rows from a table which contains at-least one alphabet in the column

apache-spark apache-spark-sql databricks pyspark sql

I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in othe…

Spark.sql Filter rows by MAX

apache-spark apache-spark-sql pyspark sql

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, …

How to format SQL Queries inside PySpark codefile

format prettify pyspark python sql

I would like to format my existing SQL queries inside the PySpark file. This is how my existing source file looks like: And this is how I wanted it to look like: I have already tried using black and other vscode extensions for formatting my code base but no luck since the SQL code is being treated as a python

Spark: How to transpose and explode columns with dynamic nested arrays

apache-spark apache-spark-sql pyspark python sql

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, &#8220…

is there a method to conect to postgresql (dbeaver ) from pyspark?

apache-spark postgresql pyspark python-3.x sql

hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please i tried this but i have an error Answer You need to add the jars you want to use when creating the sparkSession. See this : https://spark.apache.org/docs/2.4.7/submitting-applic…

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input datafram…

How to execute custom logic at pyspark window partition

apache-spark dataframe pyspark python sql

I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2= Y, if both the flag i.e. flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME=personnel

Facing issue while writing SQL in pyspark

apache-spark apache-spark-sql pyspark sql sql-server

I am trying to convert below SQL code to pyspark. Can someone please help me Here, util, count, procs are column names. while coding in pyspark, i can create a new column ‘col’ like this: Answer You can use when for doing the equivalent of update:

Only show rows in a table if something changed in previous row

apache-spark apache-spark-sql pyspark python sql

I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-…

Count number of weeks, days and months from a certain date in PySpark

apache-spark date pyspark python sql

So, I have a DataFrame of this type: And I want to create multiple columns containing, for each line, the current day, week, month and year from a certain date(simply a year, like 2020 for 2020-01-01). At first I thought of using something like this line of code unfortunately this wouldn’t work (except …