Skip to content
Advertisement

Why does this suggested solution prefer a different way of calculating percentages based on aggregations?

I’m working through a problem set from CMU’s public db systems course. I have the following two tables:

Order

Id CustomerId EmployeeId OrderDate RequiredDate ShippedDate ShipVia Freight ShipName ShipAddress ShipCity ShipRegion ShipPostalCode ShipCountry
10248 VINET 5 2012-07-04 2012-08-01 2012-07-16 3 16.75 Vins et alcools Chevalier 59 rue de l’Abbaye Reims Western Europe 51100 France
10249 TOMSP 6 2012-07-05 2012-08-16 2012-07-10 1 22.25 Toms Spezialitäten Luisenstr. 48 Münster Western Europe 44087 Germany

Shipper

Id CompanyName Phone
1 Speedy Express (503) 555-9831
2 United Package (503) 555-3199

The question I am attempting to answer is the following:

For each Shipper, find the percentage of orders which are late. Details: An order is considered late if ShippedDate > RequiredDate. Print the following format, order by descending percentage, rounded to the nearest hundredths, like United Package|23.44.

This is the query I came up with, which produces the expected results:

SELECT 
    CompanyName, 
    ROUND(100*SUM(IIF(ShippedDate > RequiredDate, 1, 0))/Cast(Count(ShipName) as Float), 2) as percent 
FROM 'Order' AS O 
JOIN Shipper AS S 
ON O.ShipVia = S.Id 
GROUP BY CompanyName 
ORDER BY percent DESC;

The official solution is a more involved query as follows:

SELECT CompanyName, round(delayCnt * 100.0 / cnt, 2) AS pct
FROM (
      SELECT ShipVia, COUNT(*) AS cnt 
      FROM 'Order'
      GROUP BY ShipVia
     ) AS totalCnt
INNER JOIN (
            SELECT ShipVia, COUNT(*) AS delaycnt 
            FROM 'Order'
            WHERE ShippedDate > RequiredDate 
            GROUP BY ShipVia
           ) AS delayCnt
          ON totalCnt.ShipVia = delayCnt.ShipVia
INNER JOIN Shipper on totalCnt.ShipVia = Shipper.Id
ORDER BY pct DESC;

The outputs of the two queries match, each returning the following:

Federal Shipping|23.61
Speedy Express|23.46
United Package|23.44

What is the benefit of doing it as shown in the solution rather than using my approach?

Advertisement

Answer

The official solution is a very bad solution because it aggregates twice on the same table Order and then joins the results whilst the same can be achieved by a single aggregation query (like your first query).

I can’t believe that this would be proposed as a proper solution for this type of problems.
I hope it’s just for practice on joins for subqueries.

Also, I find it very disappointing to see the use of single quotes for table (or column) names/aliases.
This is a very bad practice an it can lead to unexpected results.
When needed use double quotes which is the SQL standard (square brackets or backticks also work in SQLite).

As I said your first query is correct and it could be further simplified with the use of the aggregate function AVG():

SELECT S.CompanyName, 
       ROUND(100*AVG(COALESCE(O.ShippedDate, '') > O.RequiredDate), 2) AS percent
FROM Shipper AS S JOIN "Order" AS O 
ON O.ShipVia = S.Id 
GROUP BY S.Id 
ORDER BY percent DESC;

Note the use of Id (it is the primary key of Shipper) in the GROUP BY clause instead of CompanyName, which provides better performance.

Also, I used COALESCE(), because I found (after downloading your database) that in the column ShippedDate there are nulls.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement