filter stop words from text column – spark SQL

Question

I&#8217;m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don&#8217;t know how to filter. this is what I tried to do: thanks!

Accepted Answer

You are a little vague in that you do not allude to the flatMap approach, which is more common.Here an alternative just examining the dataframe column.import pyspark.sql.functions as Ffrom pyspark.sql.functions import regexp_extract stopWordsIn = spark.read.text('/FileStore/tables/sw.txt').rdd.flatMap(lambda line: line.value.split(" "))stopWords = stopWordsIn.collect()print(stopWords) words = spark.read.text('/FileStore/tables/df.txt')words = words.withColumn('value_1', F.lower(F.regexp_replace('value', "[^0-9a-zA-Z^ ]+", "")))words = words.withColumn('value_2', F.regexp_replace('value_1', '\b(' + '|'.join(stopWords) + ')\b', ''))words.show()returns &#8211; and filter out the columns you do not want.['a', 'in', 'the']+--------------+-------------+-------------+|         value|      value_1|      value_2|+--------------+-------------+-------------+|      A quick2|     A quick2|       quick2||brown fox was#|brown fox was|brown fox was|| in the house.| in the house|        house|+--------------+-------------+-------------+You see the stop words and the fact that I converted all to lower case and stripped some stuff out.

Advertisement

Answer