Skip to content
Advertisement

duplicate elimination with time

For my Smart home Stuff I create a Database, but I make a mistake during programming: the application posting stuff into the Database twice. I want to delete all rows, which contain duplicates. With duplicate I mean a tuples what is identically in the data to the last one from the same type. I mark the duplicates in this Example with “<<” please pay also attention to the last 3 rows. I want to keep the first new Data so I want to delete all the Duplicate after them. I still hope you can help me to solve my Problem.

SmartHome=# select * from sensordata order by time desc Limit 21;
 type | data |            time
------+------+----------------------------
    8 | 2459 | 2019-08-09 23:10:39.530087 << 
    8 | 2459 | 2019-08-09 23:10:39.356908
    8 | 2445 | 2019-08-09 23:05:39.933269 <<
    8 | 2445 | 2019-08-09 23:05:39.789173
   10 | 6105 | 2019-08-09 22:50:50.40792  <<
   10 | 6105 | 2019-08-09 22:50:50.096132
    8 | 2459 | 2019-08-09 22:50:41.429681 <<
    8 | 2459 | 2019-08-09 22:50:41.357483
    8 | 2474 | 2019-08-09 22:45:42.13396  <<
    8 | 2474 | 2019-08-09 22:45:41.813046
   10 | 6221 | 2019-08-09 22:40:51.107709 <<
   10 | 6221 | 2019-08-09 22:40:51.076903
   10 | 6105 | 2019-08-09 22:35:51.737255 <<
   10 | 6105 | 2019-08-09 22:35:51.544886
   10 | 6221 | 2019-08-09 22:30:52.493895 <<
   10 | 6221 | 2019-08-09 22:30:51.795203
    8 | 2459 | 2019-08-09 22:30:43.193447 <<
    8 | 2459 | 2019-08-09 22:30:43.045599
   10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above
   10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more
   10 | 6105 | 2019-08-09 22:20:51.356846    time between the rows
(21 rows)

SmartHome=# d sensordata
                   Table "public.sensordata"
 Column |            Type             |       Modifiers
--------+-----------------------------+------------------------
 type   | integer                     | not null
 data   | character varying(20)       | not null
 time   | timestamp without time zone | not null default now()
Indexes:
    "smarthome_idx" UNIQUE, btree (type, "time")
Foreign-key constraints:
    "sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)

If i run

with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;

the output is:

 10 | 17316 | 2019-08-09 09:43:46.938507 |          1
   10 | 18276 | 2019-08-09 09:38:47.129788 |          1
   10 | 18176 | 2019-08-09 09:33:47.889064 |          1
   10 | 17107 | 2019-08-08 10:36:11.383106 |          1
   10 | 17921 | 2019-08-08 09:56:15.889191 |          1
   10 | 17533 | 2019-08-03 09:30:11.047639 |          1

thats not what i mean :/ (ßorry dont know how to mark the stuff as code block in the comment therfore this way

Advertisement

Answer

There are many possible ways to do this. The fastest is often a correlated subquery but I can never remember the syntax so I normally go for window functions, specifically row_number().

If you run

Select *, row_number() over(partition by type,data order by date) from sensor data

That should give a version of your table where all the rows you want to keep have a number 1 and the duplicates are numbered 2,3,4… Use that same field in a delete query and you’ll be sorted.

EDIT: I understand now you only want to remove duplicates that occur sequentially within the same type. This can also be achieved using row_number and join.This query should give you only the data you want.

WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata)
SELECT a.* 
FROM s a 
JOIN s b 
ON a.rnum=b.rnum+1 AND a.type=b.type
WHERE NOT a.data=b.data

This might need a slight tweak to avoid missing the very first entry if that’s important.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement