Skip to content
Advertisement

Group similar URLs

I am looking to fetch all requests to xmlrpc.php and wp-login.php and have them wildcarded in the statement.

But this presents an issue, as it doesn’t output the data in only two rows for xmlrpc and wp-login, but also includes the URLs with a query attached. Hoping to have it include every URL requested, but combine them to show as just xmlrpc.php or wp-login.php

I am a mysql n00b and was playing around with a substr replace and group_concat but couldn’t get it to work.

WITH 
  subq AS (
    SELECT url, COUNT(url) AS count
    FROM `flywheel-production.fastly_logs.ingress_logs`
    WHERE timestamp > TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY) 
  AND (url LIKE "/wp-login.php%" OR  url LIKE "/xmlrpc.php%")
  AND site_hash = "btmpuroizf"
    GROUP BY url
  )

SELECT 
  url,
  count,
  ROUND(count / (SELECT SUM(count) FROM subq) * 100, 2) AS percent
FROM subq
ORDER BY count DESC

Any help would be much appreciated. Thanks!

Advertisement

Answer

For BigQuery Standard SQL

Below adjusted query should do the “trick”

#standardSQL
WITH subq AS (
  SELECT REGEXP_EXTRACT(url, r'(.*?)(?:?|$)') url, COUNT(url) AS COUNT
  FROM `flywheel-production.fastly_logs.ingress_logs`
  WHERE timestamp > TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY) 
  AND (url LIKE "/wp-login.php%" OR  url LIKE "/xmlrpc.php%")
  AND site_hash = "btmpuroizf"
  GROUP BY url
)
SELECT 
  url,
  COUNT,
  ROUND(COUNT / (SELECT SUM(COUNT) FROM subq) * 100, 2) AS percent
FROM subq
ORDER BY COUNT DESC  
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement