Tag: pyspark

Obtain count of non null values by casting a string column as type integer in pyspark – sql

I have an rdd with string columns, but I want to know if a string column has numeric values. Looking for a very inexpensive way to do this, I have many tables with millions of records. For example, I’ve tried casting the column to int, float, etc, but I get all null values, so count is always zero: returns values

Groupby fill missing values in dataframe based on average of previous values available and next value available

machine-learning missing-data pyspark python sql

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column i.e. (previous value+next value)/2. I …

doing some of columns based on some complex logic in pyspark

apache-spark apache-spark-sql pandas pyspark sql

Here is the question in the image attached: Table: So result column is calculated based on the below rules: If col3 >0 , then result=col1+col2 If col 3=0, then result= sum (col2) till col3 >0 + col1(where col3>0) for example for row =3, the result=60+70+80+30(from col1 from row 5 because here col3>0)=240 for row=4, the result=70+80+30(from col1 from row 5

Filtering rows in pyspark dataframe and creating a new column that contains the result

pyspark sql user-defined-functions

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if …

Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?

apache-spark apache-zeppelin pyspark scala sql

I have two dataframes in Spark, both with an IP column. One column has over 800000 entries while the other has 4000 entries. What I want to do is to see if the IP’s in the smaller dataframe appear in the IP column of the large dataframe. At the moment all I can manage is to compare the first row

Is there any method to find number of columns having data in pyspark data frame

pyspark sql

I have a pyspark data frame that has 7 columns, I have to add a new column named “sum” and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer Answer This sum can be calculated like this: Hope this helps!

In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?

analytics apache-spark multilabel-classification pyspark sql

Pyspark: cast array with nested struct to string

apache-spark pyspark python spark-dataframe sql

I have pyspark dataframe with a column named Filters: “array>” I want to save my dataframe in csv file, for that i need to cast the array to string type. I tried to cast it: DF.Filters.tostring() and DF.Filters.cast(StringType()), but both solutions generate error message for each row in the columns Filters: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@56234c19 The code is as follows Sample JSON data:

Filtering a Pyspark DataFrame with SQL-like IN clause

apache-spark dataframe pyspark python sql

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql(‘SELECT * from my_df WHERE field1 IN a’) where a is the tuple (1, 2, 3). …

Sparksql filtering (selecting with where clause) with multiple conditions

apache-spark apache-spark-sql pyspark python sql

Hi I have the following issue: All the values that I want to filter on are literal null strings and not N/A or Null values. I tried these three options: numeric_filtered = numeric.filter(numeric[‘LOW’] != ‘null’).filter(numeric[‘HIGH’] != ‘null’).filter(numeric[‘NORMAL’] != ‘null’) numeric_filtered = numeric.filter(numeric[‘LOW’] != ‘null’ AND numeric[‘HIGH’] != ‘null’ AND numeric[‘NORMAL’] != ‘null’) sqlContext.sql(“SELECT * from numeric WHERE LOW != ‘null’