Skip to content
Advertisement

Redshift – How to use column in one table as pattern in SIMILAR TO

I have a problem where I have two tables. One table constains urls and their information and another groups of urls that should be grouped by a pattern.

Urls table:
------------------------------------------------
| url                                  | files |
| https://myurl1/test/one/es/main.html | 530   |
| https://myurl1/test/one/en/main.html | 530   |
| https://myurl1/test/one/ar/main.html | 530   |
------------------------------------------------

Urls patterns table:
---------------------------------------------
| group  | url_pattern                      |
| group1 | https://myurl1/test/one/(es|en)/%|
| group2 | https://myurl1/test/one/(ar)/%   |
---------------------------------------------

I have tried something like this bearing in mind that url_patterns will only have one row per group.

SELECT * FROM urls_table
WHERE url SIMILAR TO (SELECT MAX (url_pattern) FROM url_patterns WHERE group='group1')
LIMIT 10

The main problem here is that it seems that applying SIMILAR TO with a column argument is not working.

enter image description here

Could anyone give me some advices? Thanks in advance.

Advertisement

Answer

You are running into the requirement that regexp patterns are compiled and that SIMILAR TO is a layer on regexp. So what you are trying to do won’t work. I believe there are a number of other ways to do this.

I) Change to LIKE pattern matching: LIKE patterns aren’t precompiled so can use dynamic patterns. The downside is that they are more limited but I think you can still do what you want. Just change your patterns to be set of pattern columns (if the number of patterns is limited) and test for all the patterns. Unneeded patterns can just be a value that can never match. Definitely a brute force hack.

II) Change to LIKE pattern matching w/ SQL to provide OR behavior: have multiple LIKE patterns in the url_pattern column separated by ‘|’ (for example). Then use split_part to match each sub-pattern – a bit complex and possible slow but works. Like this:

SELECT url
FROM urls_table
LEFT JOIN (SELECT split_part(pattern, '|', part_no::int) as pattern
              FROM url_patterns
              CROSS JOIN (SELECT row_number() over () as part_no FROM urls_table)
              WHERE "group" = 'group1'
             )
ON url LIKE pattern 
WHERE p.pattern IS NOT NULL;

You will also need to change your pattern strings to use the simpler LIKE format and use ‘|’ for multiple possibilities – Ex: Group1 pattern becomes ‘https://myurl1/test/one/es/%|https://myurl1/test/one/en/%’

III) Use some front-end query modification to find the pattern for the group and apply it to query BEFORE it is sent to the compiler. This could be an external tool or a stored procedure on Redshift. Get the pattern in one query and use it to issue the second query.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement