duplicate elimination with time

For my Smart home Stuff I create a Database, but I make a mistake during programming: the application posting stuff into the Database twice. I want to delete all rows, which contain duplicates. With duplicate I mean a tuples what is identically in the data to the last one from the same type. I mark the duplicates in this Example with “<<” please pay also attention to the last 3 rows. I want to keep the first new Data so I want to delete all the Duplicate after them. I still hope you can help me to solve my Problem.

SmartHome=# select * from sensordata order by time desc Limit 21;
 type | data |            time
------+------+----------------------------
    8 | 2459 | 2019-08-09 23:10:39.530087 << 
    8 | 2459 | 2019-08-09 23:10:39.356908
    8 | 2445 | 2019-08-09 23:05:39.933269 <<
    8 | 2445 | 2019-08-09 23:05:39.789173
   10 | 6105 | 2019-08-09 22:50:50.40792  <<
   10 | 6105 | 2019-08-09 22:50:50.096132
    8 | 2459 | 2019-08-09 22:50:41.429681 <<
    8 | 2459 | 2019-08-09 22:50:41.357483
    8 | 2474 | 2019-08-09 22:45:42.13396  <<
    8 | 2474 | 2019-08-09 22:45:41.813046
   10 | 6221 | 2019-08-09 22:40:51.107709 <<
   10 | 6221 | 2019-08-09 22:40:51.076903
   10 | 6105 | 2019-08-09 22:35:51.737255 <<
   10 | 6105 | 2019-08-09 22:35:51.544886
   10 | 6221 | 2019-08-09 22:30:52.493895 <<
   10 | 6221 | 2019-08-09 22:30:51.795203
    8 | 2459 | 2019-08-09 22:30:43.193447 <<
    8 | 2459 | 2019-08-09 22:30:43.045599
   10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above
   10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more
   10 | 6105 | 2019-08-09 22:20:51.356846    time between the rows
(21 rows)

SmartHome=# d sensordata
                   Table "public.sensordata"
 Column |            Type             |       Modifiers
--------+-----------------------------+------------------------
 type   | integer                     | not null
 data   | character varying(20)       | not null
 time   | timestamp without time zone | not null default now()
Indexes:
    "smarthome_idx" UNIQUE, btree (type, "time")
Foreign-key constraints:
    "sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)

​x
 
SmartHome=# select * from sensordata order by time desc Limit 21; type | data |            time------+------+----------------------------    8 | 2459 | 2019-08-09 23:10:39.530087 <<     8 | 2459 | 2019-08-09 23:10:39.356908    8 | 2445 | 2019-08-09 23:05:39.933269 <<    8 | 2445 | 2019-08-09 23:05:39.789173   10 | 6105 | 2019-08-09 22:50:50.40792  <<   10 | 6105 | 2019-08-09 22:50:50.096132    8 | 2459 | 2019-08-09 22:50:41.429681 <<    8 | 2459 | 2019-08-09 22:50:41.357483    8 | 2474 | 2019-08-09 22:45:42.13396  <<    8 | 2474 | 2019-08-09 22:45:41.813046   10 | 6221 | 2019-08-09 22:40:51.107709 <<   10 | 6221 | 2019-08-09 22:40:51.076903   10 | 6105 | 2019-08-09 22:35:51.737255 <<   10 | 6105 | 2019-08-09 22:35:51.544886   10 | 6221 | 2019-08-09 22:30:52.493895 <<   10 | 6221 | 2019-08-09 22:30:51.795203    8 | 2459 | 2019-08-09 22:30:43.193447 <<    8 | 2459 | 2019-08-09 22:30:43.045599   10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above   10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more   10 | 6105 | 2019-08-09 22:20:51.356846    time between the rows(21 rows)​SmartHome=# d sensordata                   Table "public.sensordata" Column |            Type             |       Modifiers--------+-----------------------------+------------------------ type   | integer                     | not null data   | character varying(20)       | not null time   | timestamp without time zone | not null default now()Indexes:    "smarthome_idx" UNIQUE, btree (type, "time")Foreign-key constraints:    "sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)​

If i run

with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;

 
with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;​

the output is:

 10 | 17316 | 2019-08-09 09:43:46.938507 |          1
   10 | 18276 | 2019-08-09 09:38:47.129788 |          1
   10 | 18176 | 2019-08-09 09:33:47.889064 |          1
   10 | 17107 | 2019-08-08 10:36:11.383106 |          1
   10 | 17921 | 2019-08-08 09:56:15.889191 |          1
   10 | 17533 | 2019-08-03 09:30:11.047639 |          1

 
 10 | 17316 | 2019-08-09 09:43:46.938507 |          1   10 | 18276 | 2019-08-09 09:38:47.129788 |          1   10 | 18176 | 2019-08-09 09:33:47.889064 |          1   10 | 17107 | 2019-08-08 10:36:11.383106 |          1   10 | 17921 | 2019-08-08 09:56:15.889191 |          1   10 | 17533 | 2019-08-03 09:30:11.047639 |          1​

thats not what i mean :/ (ßorry dont know how to mark the stuff as code block in the comment therfore this way

Answer

There are many possible ways to do this. The fastest is often a correlated subquery but I can never remember the syntax so I normally go for window functions, specifically row_number().

If you run

Select *, row_number() over(partition by type,data order by date) from sensor data

 
Select *, row_number() over(partition by type,data order by date) from sensor data​

That should give a version of your table where all the rows you want to keep have a number 1 and the duplicates are numbered 2,3,4… Use that same field in a delete query and you’ll be sorted.

EDIT: I understand now you only want to remove duplicates that occur sequentially within the same type. This can also be achieved using row_number and join.This query should give you only the data you want.

WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata)
SELECT a.* 
FROM s a 
JOIN s b 
ON a.rnum=b.rnum+1 AND a.type=b.type
WHERE NOT a.data=b.data

 
WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata)SELECT a.* FROM s a JOIN s b ON a.rnum=b.rnum+1 AND a.type=b.typeWHERE NOT a.data=b.data​

This might need a slight tweak to avoid missing the very first entry if that’s important.

Advertisement

Answer