For my Smart home Stuff I create a Database, but I make a mistake during programming: the application posting stuff into the Database twice. I want to delete all rows, which contain duplicates. With duplicate I mean a tuples what is identically in the data to the last one from the same type. I mark the duplicates in this Example with “<<” please pay also attention to the last 3 rows. I want to keep the first new Data so I want to delete all the Duplicate after them. I still hope you can help me to solve my Problem.
SmartHome=# select * from sensordata order by time desc Limit 21; type | data | time ------+------+---------------------------- 8 | 2459 | 2019-08-09 23:10:39.530087 << 8 | 2459 | 2019-08-09 23:10:39.356908 8 | 2445 | 2019-08-09 23:05:39.933269 << 8 | 2445 | 2019-08-09 23:05:39.789173 10 | 6105 | 2019-08-09 22:50:50.40792 << 10 | 6105 | 2019-08-09 22:50:50.096132 8 | 2459 | 2019-08-09 22:50:41.429681 << 8 | 2459 | 2019-08-09 22:50:41.357483 8 | 2474 | 2019-08-09 22:45:42.13396 << 8 | 2474 | 2019-08-09 22:45:41.813046 10 | 6221 | 2019-08-09 22:40:51.107709 << 10 | 6221 | 2019-08-09 22:40:51.076903 10 | 6105 | 2019-08-09 22:35:51.737255 << 10 | 6105 | 2019-08-09 22:35:51.544886 10 | 6221 | 2019-08-09 22:30:52.493895 << 10 | 6221 | 2019-08-09 22:30:51.795203 8 | 2459 | 2019-08-09 22:30:43.193447 << 8 | 2459 | 2019-08-09 22:30:43.045599 10 | 6105 | 2019-08-09 22:25:52.571793 << Duplicate like them above 10 | 6105 | 2019-08-09 22:25:52.442844 << Also a Duplicate with much more 10 | 6105 | 2019-08-09 22:20:51.356846 time between the rows (21 rows) SmartHome=# d sensordata Table "public.sensordata" Column | Type | Modifiers --------+-----------------------------+------------------------ type | integer | not null data | character varying(20) | not null time | timestamp without time zone | not null default now() Indexes: "smarthome_idx" UNIQUE, btree (type, "time") Foreign-key constraints: "sensordata_type_fkey" FOREIGN KEY (type) REFERENCES sensortype(id)
If i run
with a as (Select *, row_number() over(partition by type,data order by time) from sensordata) select * from a where row_number=1 order by time desc;
the output is:
10 | 17316 | 2019-08-09 09:43:46.938507 | 1 10 | 18276 | 2019-08-09 09:38:47.129788 | 1 10 | 18176 | 2019-08-09 09:33:47.889064 | 1 10 | 17107 | 2019-08-08 10:36:11.383106 | 1 10 | 17921 | 2019-08-08 09:56:15.889191 | 1 10 | 17533 | 2019-08-03 09:30:11.047639 | 1
thats not what i mean :/ (ßorry dont know how to mark the stuff as code block in the comment therfore this way
Advertisement
Answer
There are many possible ways to do this. The fastest is often a correlated subquery but I can never remember the syntax so I normally go for window functions, specifically row_number().
If you run
Select *, row_number() over(partition by type,data order by date) from sensor data
That should give a version of your table where all the rows you want to keep have a number 1 and the duplicates are numbered 2,3,4… Use that same field in a delete query and you’ll be sorted.
EDIT: I understand now you only want to remove duplicates that occur sequentially within the same type. This can also be achieved using row_number and join.This query should give you only the data you want.
WITH s as (SELECT *,row_number() over(partition by type order by date) as rnum from sensordata) SELECT a.* FROM s a JOIN s b ON a.rnum=b.rnum+1 AND a.type=b.type WHERE NOT a.data=b.data
This might need a slight tweak to avoid missing the very first entry if that’s important.