Using PostgreSQL 12, I am attempting to import data from a CSV with the following format:
country,state,county,lat,lng,type,measure,beds,population,year,source,source_url US,AK,ketchikan gateway,63.588753,-154.493062,ICU,1000HAB,3.928701,13745,2018,arcgis,https://services1.arcgis.com/Hp6G80Pky0om7QvQ/arcgis/rest/services/Hospitals_1/FeatureServer/0 US,AK,kodiak island,63.588753,-154.493062,ACUTE,1000HAB,,n,2018,arcgis,https://services1.arcgis.com/Hp6G80Pky0om7QvQ/arcgis/rest/services/Hospitals_1/FeatureServer/0
Notice that the second row’s “population” field has an n
instead of empty. My goal is to import the CSV so that the “population” column is BIGINT and “n” is replaced with NULL. My current solution is:
CREATE TABLE temp_table ( country CHAR(2), state CHAR(2), county VARCHAR(255), lat DOUBLE PRECISION, lng DOUBLE PRECISION, type VARCHAR(11), measure VARCHAR(255), beds DOUBLE PRECISION, pop VARCHAR(255), year SMALLINT, source VARCHAR(255), source_url VARCHAR(255) ); COPY temp_table FROM 'C:\Users\mconr\Downloads\global-hospital-beds-capacity-for-covid19\hospital_beds_USA_v1.CSV' WITH (DELIMITER ',', FORMAT CSV, HEADER TRUE); SELECT country, state, county, lat, lng, type, measure, beds, CAST (NULLIF (pop, 'n') AS BIGINT) AS population, year, source, source_url INTO USA FROM temp_table; DROP TABLE temp_table;
My current solution is to create a temporary table where “population” is VARCHAR(255), import the data, create a new table from a SELECT statement that replaces “n” with NULL and casts the column to BIGINT, then delete the temp table. However, this seems to be a little inefficient, since I am creating and deleting an intermediate table. Does anyone know a better way of doing this?
Advertisement
Answer
You could set the NULL parameter to ‘n’, but then that would break other cases, where null fields are represented in the usual way (by the unquoted empty string), rather than by ‘n’. COPY does not let you set NULL per column.
Your current method seems fine to me, it would be my first choice (other than fixing the generator of this file which may be out of your control). You could also write something with Perl or awk or sed which would edit the file as a stream to turn the ‘,n,’ to ‘,,’ and hook it up to the COPY with the PROGRAM feature, but that seems fiddly and error prone and I doubt the efficiency gains would be all that large.
This strikes as one of those premature optimizations which is the root of most evil. Is this really a meaningful problem?