I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter.
I managed to split the reviews to lists of strings, but don’t know how to filter.
this is what I tried to do:
from pyspark.sql.functions import col stopWords = spark.read.text('/FileStore/tables/english.txt') df.select(split(col("reviewText")," ")) df.filter(col("reviewText") == stopWords)
thanks!
Advertisement
Answer
You are a little vague in that you do not allude to the flatMap approach, which is more common.
Here an alternative just examining the dataframe column.
import pyspark.sql.functions as F from pyspark.sql.functions import regexp_extract stopWordsIn = spark.read.text('/FileStore/tables/sw.txt').rdd.flatMap(lambda line: line.value.split(" ")) stopWords = stopWordsIn.collect() print(stopWords) words = spark.read.text('/FileStore/tables/df.txt') words = words.withColumn('value_1', F.lower(F.regexp_replace('value', "[^0-9a-zA-Z^ ]+", ""))) words = words.withColumn('value_2', F.regexp_replace('value_1', '\b(' + '|'.join(stopWords) + ')\b', '')) words.show()
returns – and filter out the columns you do not want.
['a', 'in', 'the'] +--------------+-------------+-------------+ | value| value_1| value_2| +--------------+-------------+-------------+ | A quick2| A quick2| quick2| |brown fox was#|brown fox was|brown fox was| | in the house.| in the house| house| +--------------+-------------+-------------+
You see the stop words and the fact that I converted all to lower case and stripped some stuff out.