The title may be confusing so I need to clarify.
I have the following survey dataset that I need to transfer to a PostgreSQL database.
| id | title | income | age | height | weight | Education | | 1 | foo | 10000 | 45 | 170 | 50 | College | | 2 | bar | 15000 | -20000 | -30000 | 45 | High School | | 3 | hoge | -10000 | -10000 | 150 | 60 | -20000 | | 4 | fuga | 20000 | 20 | -10000 | 70 | College | ...
Where -10000, -20000, and -30000 all represent a null value, but due to different reasons.
(e.g., -10000 means that the participant was not shown the question about the column, -20000 means the participant skipped the question, and -30000 means they made an invalid answer.)
The question is, how to design a table and constraint to model this situation?
Obviously, if I use INTEGER for income and other columns and use VARCHAR for education the null values will not be identical. Also, if I treat income as a mere INTEGER, the “null” values will be just a negative number which will be problematic in the following data manipulation.
I think the best design here would be to just add status columns for each data column, assuming you want to keep track of the status for a given column. Something like this:
| id | title | income | age | age_code | height | height_code | weight | Education | | 1 | foo | 10000 | 45 | 1 | 170 | 1 | 50 | College | | 2 | bar | 15000 | NULL | 2 | NULL | 4 | 45 | High School | | 3 | hoge | NULL | NULL | 3 | 150 | 1 | 60 | NULL | | 4 | fuga | 20000 | 20 | 1 | NULL | 5 | 70 | College |
Here, I am using the code values 2,3 to represent some sort of error code in the age, and the values 4,5 represent other types of errors with the height.
Typically the values in a column should represent just one thing. In this case, the age and height columns should store data for those measurements, not the status of those measurements. I backfilled
NULL values for where the data be missing or otherwise totally invalid.