Aim was to write a SQL query to return ride details from a public database.
While going through a data analytics course, I stumbled across this problem. the code used is correct as the instruction provided was the same however the result I obtained was different. The first row returned a blank row with just the number of trips mentioned.
This is the code I used
SELECT usertype, CONCAT (start_station_name," to ",end_station_name) AS Route, Count(*) AS num_trips, Round(Avg(CAST(tripduration as INT64)/60),2) AS duration, FROM `bigquery-public-data.new_york_citibike.citibike_trips` GROUP BY start_station_name, end_station_name, Usertype ORDER BY num_trips DESC Limit 10
Advertisement
Answer
That happens because you actually have 5.8M rows with empty usertype, start_station_name and end_station_name and as you are sorting in DESC order they show as first line in the output.
Assuming that you only need to count when there is at least a start_station_name then you can add a WHERE clause
SELECT usertype, CONCAT (start_station_name," to ",end_station_name) AS Route, Count(*) AS num_trips, Round(Avg(CAST(tripduration as INT64)/60),2) AS duration, FROM `bigquery-public-data.new_york_citibike.citibike_trips` WHERE start_station_name <> '' GROUP BY start_station_name, end_station_name, Usertype ORDER BY num_trips DESC Limit 10