Skip to content
Advertisement

Get word frequency rankings of words in postgreSQL and filter by label

I have a postgreSQL database with the following table:

CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT,
label INT);

The following data is inserted:

INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo squared', 1);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo fox squared', 1);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar', 3);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'raven bar', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar standard', 1);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar none', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar', 1);

I have the following query that fetches the frequency of each word in the table and ranks them, grouping by day:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from dummy d
    cross join lateral regexp_split_to_table(
        trim(regexp_replace(tweet, 'y(rt|co|https|amp|none)y', '', 'g')),
        's+'
    ) w(word)
    group by created_day, word
) t
where (created_day > CURRENT_DATE - INTERVAL '10 days') 
    and word IS NOT NULL
order by created_day DESC, rn;

Result:

      created_day       |   word   | cnt | rn
------------------------+----------+-----+----
 2020-12-18 00:00:00+00 | foo      |   4 |  1
 2020-12-18 00:00:00+00 | bar      |   2 |  2
 2020-12-18 00:00:00+00 | squared  |   2 |  2
 2020-12-18 00:00:00+00 | raven    |   1 |  4
 2020-12-18 00:00:00+00 | fox      |   1 |  4
 2020-12-17 00:00:00+00 | bar      |   3 |  1
 2020-12-17 00:00:00+00 | standard |   1 |  2

I would like to group by the label so that the new query shows the word rankings filtered by label like so:

      created_day       |   word   | cnt | rn | label |
------------------------+----------+-----+--------------
 2020-12-18 00:00:00+00 | foo      |   3 |  1 |     1 |
 2020-12-18 00:00:00+00 | squared  |   2 |  2 |     1 |
 2020-12-18 00:00:00+00 | fox      |   1 |  3 |     1 |
 2020-12-17 00:00:00+00 | bar      |   2 |  1 |     1 |
 2020-12-17 00:00:00+00 | standard |   1 |  2 |     1 |

I’ve tried adding a WHERE clause to filter by predictions:

where (created_day > CURRENT_DATE - INTERVAL '10 days')
    and word IS NOT NULL
    and label = 1
order by created_day DESC, rn;

But I get the error column "label" does not exist.

How would I go about fixing this?

Advertisement

Answer

Your new result shows you want to count and rank per day, word and label. This means you just have to add label to your GROUP BY and PARTITION BY clauses (and probably to ORDER BY, too).

select *
from 
(
  select 
    date_trunc('day', created_at) as created_day, 
    word, 
    label, 
    count(*) as cnt,
    rank() over(partition by date_trunc('day', created_at), label order by count(*) desc) as rn
  from dummy d
  cross join lateral regexp_split_to_table
  (
    trim(regexp_replace(tweet, 'y(rt|co|https|amp|none)y', '', 'g')),
    's+'
  ) w(word)
  group by created_day, word, label
) t
where (created_day > current_date - interval '10 days') and word is not null
order by created_day desc, label, rn;
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement