Skip to content
Advertisement

Select rows where array contains one of several values in bigquery (ideally with dbplyr)

I have a large set of tweets on bigquery and now want to filter those that contain at least one of a list of hashtags. The hashtags are saved in an array column (uploaded from a list column in R). How can I select rows that contain one of multiple values in any place in that array?

Below the code that I would use for the analysis in R. Unsurprisingly, dbplyr cannot translate the purrr part, and I am happy to learn to create the SQL myself, but haven’t yet found a good starting point. Thanks for any pointers.

PS: I have not yet uploaded the Tweets to bigquery, they currently live in 80 GB of RDS filed. If any simple data transformation would make this easier, I could still include that while uploading.

tweets_sample <- tibble::tribble(
  ~text, ~hashtags,
  "Hello", list("World", "You"),
  "Goodbye", list("Friend", "You"),
  "Not", list("interested")
)

hashtag_list <- c("World", "interested")

tweets_sample %>% filter(purrr::map_lgl(hashtags, ~ .x %in% hashtag_list %>%
                                   any()))

Advertisement

Answer

The difficult part here is that your hashtags column is of type list or array. As per this question dbplyr translation for more advanced data types like arrays does not appear to be well established.

Two alternative approaches:

  1. Convert your hashtags to a character string and use text search (grep).

  2. Write a bigquery query as a character string in R and attach it to an existing connection. Here is an example:

db_connection = DBI::dbConnect( ... ) # connect to database
remote_tbl = dplyr::tbl(db_connection, from = "remote_table_name")

# build SQL query
sql_query <- glue::glue("SELECT *n",
                        "FROM (n",
                        "{dbplyr::sql_render(remote_tbl)}n",
                        ") aliasn",
  
new_remote_table = dplyr::tbl(db_connection, dbplyr::sql(sql_query))
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement