How to use timebucket_gapfill when rows can have null values?

I have a time series table where measurements are recorded into “wide” rows. Rows may contain all measurements or only some. The other columns are then set to NULL.

I would like to use timebucket_gapfill() to “clean” this table and make sure that every row in the output has data in all columns, even if the underlying dataset has some null values for some of the columns.

This is how I prepare the table with some data (schema from the getting started guide):

CREATE TABLE conditions (
  time        TIMESTAMPTZ       NOT NULL,
  location    TEXT              NOT NULL,
  temperature DOUBLE PRECISION  NULL,
  humidity    DOUBLE PRECISION  NULL
);
SELECT create_hypertable('conditions', 'time');
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:14-07', 'office', 70.0, 50.0);
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:15-07', 'office', 71.0, null);
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:16-07', 'office', 72.0, 48.0);
-- gap at 2019-07-10 05:02:17-07
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:18-07', 'office', 72.0, 48.0);
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:18.8-07', 'office', 72.1, NULL);
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:19.2-07', 'office', NULL, 46.0);
INSERT INTO conditions(time, location, temperature, humidity)
  VALUES ('2019-07-10 05:02:20-07', 'office', 73.0, 45.0);

​x
 
CREATE TABLE conditions (  time        TIMESTAMPTZ       NOT NULL,  location    TEXT              NOT NULL,  temperature DOUBLE PRECISION  NULL,  humidity    DOUBLE PRECISION  NULL);SELECT create_hypertable('conditions', 'time');INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:14-07', 'office', 70.0, 50.0);INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:15-07', 'office', 71.0, null);INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:16-07', 'office', 72.0, 48.0);-- gap at 2019-07-10 05:02:17-07INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:18-07', 'office', 72.0, 48.0);INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:18.8-07', 'office', 72.1, NULL);INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:19.2-07', 'office', NULL, 46.0);INSERT INTO conditions(time, location, temperature, humidity)  VALUES ('2019-07-10 05:02:20-07', 'office', 73.0, 45.0);​

And this is how I query it:

SELECT
    time_bucket_gapfill('1000ms', time,
      start => '2019-07-10 05:02:13',
      finish => '2019-07-10 05:02:21'
    ) as ival,
    count(*) as samplesUsed,
    interpolate(avg(temperature)) as lineartemperature,
    interpolate(avg(humidity)) as linearhumidity
 FROM conditions
 GROUP BY ival
 ORDER BY ival;

 
SELECT    time_bucket_gapfill('1000ms', time,      start => '2019-07-10 05:02:13',      finish => '2019-07-10 05:02:21'    ) as ival,    count(*) as samplesUsed,    interpolate(avg(temperature)) as lineartemperature,    interpolate(avg(humidity)) as linearhumidity FROM conditions GROUP BY ival ORDER BY ival;​

The output is:

          ival          | samplesused | lineartemperature | linearhumidity 
------------------------+-------------+-------------------+----------------
 2019-07-10 05:02:13-07 |             |                   |               
 2019-07-10 05:02:14-07 |           1 |                70 |             50
 2019-07-10 05:02:15-07 |           1 |                71 |               
 2019-07-10 05:02:16-07 |           1 |                72 |             48
 2019-07-10 05:02:17-07 |             |            72.025 |             48
 2019-07-10 05:02:18-07 |           2 |             72.05 |             48
 2019-07-10 05:02:19-07 |           1 |                   |             46
 2019-07-10 05:02:20-07 |           1 |                73 |             45

 
          ival          | samplesused | lineartemperature | linearhumidity ------------------------+-------------+-------------------+---------------- 2019-07-10 05:02:13-07 |             |                   |                2019-07-10 05:02:14-07 |           1 |                70 |             50 2019-07-10 05:02:15-07 |           1 |                71 |                2019-07-10 05:02:16-07 |           1 |                72 |             48 2019-07-10 05:02:17-07 |             |            72.025 |             48 2019-07-10 05:02:18-07 |           2 |             72.05 |             48 2019-07-10 05:02:19-07 |           1 |                   |             46 2019-07-10 05:02:20-07 |           1 |                73 |             45​

I understand why the first row is empty – no data in the dataset.
At 5:02:17, interpolation is working fine when there are no rows in the dataset.
However, at 5:02:15 and 5:02:19, where the underlying rows are “partial”, the database did not use values from the previous and next rows to interpolate a result for respectively humidity and temperature.

How do I write the query to return an interpolated value for all measurement columns?

Answer

Timescaledb does not consider NULL as missing values. I have to rewrite the query to avoid the rows with NULL values, that means doing multiple queries with timebucket_gapfill and joining the results together.

This works and does what I wanted:

SELECT
    condh.ival, humidity, temperature
from
(
    select
    time_bucket_gapfill('1000ms', time,
      start => '2019-07-10 05:02:13',
      finish => '2019-07-10 05:02:21'
    ) as ival,
    count(*) as samplesUsed,
    interpolate(avg(humidity)) as humidity
    FROM conditions
    WHERE humidity is not NULL
    GROUP BY ival
) condh 
INNER JOIN 
(
     SELECT
    time_bucket_gapfill('1000ms', time,
      start => '2019-07-10 05:02:13',
      finish => '2019-07-10 05:02:21'
    ) as ival,
    count(*) as samplesUsed,
    interpolate(avg(temperature)) as temperature
    FROM conditions
    WHERE temperature is not NULL
    GROUP BY ival
) condt
on (condt.ival = condh.ival)
ORDER BY ival;

 
SELECT    condh.ival, humidity, temperaturefrom(    select    time_bucket_gapfill('1000ms', time,      start => '2019-07-10 05:02:13',      finish => '2019-07-10 05:02:21'    ) as ival,    count(*) as samplesUsed,    interpolate(avg(humidity)) as humidity    FROM conditions    WHERE humidity is not NULL    GROUP BY ival) condh INNER JOIN (     SELECT    time_bucket_gapfill('1000ms', time,      start => '2019-07-10 05:02:13',      finish => '2019-07-10 05:02:21'    ) as ival,    count(*) as samplesUsed,    interpolate(avg(temperature)) as temperature    FROM conditions    WHERE temperature is not NULL    GROUP BY ival) condton (condt.ival = condh.ival)ORDER BY ival;​

Output:

          ival          | humidity | temperature 
------------------------+----------+-------------
 2019-07-10 05:02:13-07 |          |            
 2019-07-10 05:02:14-07 |       50 |          70
 2019-07-10 05:02:15-07 |       49 |          71
 2019-07-10 05:02:16-07 |       48 |          72
 2019-07-10 05:02:17-07 |       48 |      72.025
 2019-07-10 05:02:18-07 |       48 |       72.05
 2019-07-10 05:02:19-07 |       46 |      72.525
 2019-07-10 05:02:20-07 |       45 |          73
(8 rows)

 
          ival          | humidity | temperature ------------------------+----------+------------- 2019-07-10 05:02:13-07 |          |             2019-07-10 05:02:14-07 |       50 |          70 2019-07-10 05:02:15-07 |       49 |          71 2019-07-10 05:02:16-07 |       48 |          72 2019-07-10 05:02:17-07 |       48 |      72.025 2019-07-10 05:02:18-07 |       48 |       72.05 2019-07-10 05:02:19-07 |       46 |      72.525 2019-07-10 05:02:20-07 |       45 |          73(8 rows)​

Got some help on the timescaledb slack – thanks gayathri.

Advertisement

Answer