Skip to content
Advertisement

How to get the size of the duplicates in this query?

Let’s say I have the following data:

md5          file_size          
1234          127.4
1234          127.4
1234          127.4
909           18.2
909           18.2
104           207

I want to find out how much data I have in duplicates — that is — if we have one of each of the files (unique by md5) how much space do we save? The answer should be:

md5        size_saved
1234       254.8
909        18.2
104        0

Here is the base query I have thus far:

SELECT file_size, SUM(file_size) total_size FROM files GROUP BY md5

I think the simplest way to do this is to do ABS(file_size - total_size), but I’m not sure how to do this without using a sub-select or an outer-query that has access to both the aggregate size and the single file size. What would be the proper way to do this, or is the only way with a subselect?

Advertisement

Answer

You could use:

SELECT
    md5,
    SUM(file_size) - MIN(file_size) AS size_saved
FROM yourTable
GROUP BY
    md5;

screen capture from demo link below

Demo

Note that my answer assumes that all records for a given md5 would always have the same file_size values, in the event that there be more than one record. If not, then my answer would not work, but we would have to redefine the logic anyway in this case.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement