Skip to content
Advertisement

Obtain count of non null values by casting a string column as type integer in pyspark – sql

I have an rdd with string columns, but I want to know if a string column has numeric values. Looking for a very inexpensive way to do this, I have many tables with millions of records.

For example, I’ve tried casting the column to int, float, etc, but I get all null values, so count is always zero:

spark.sql('''select count('Violations')
             from tmp
             where cast('Violations' as int) is not null''').show()

returns values unchanged in the column. I know for a fact that this column contains the string ‘9’, in one of its rows at least. I’ve tried variations of this with the count() function and cast() before the from statement. Is this a pipe dream?

I saw the the stack post with the udf using isdigit, but it looks awfully expensive.

Advertisement

Answer

If your code is correct, then you’re literally using the string literal “Violations” instead of referring to a column named Violations. Try removing the single quotes around Violations.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement