Suppose I have these records
ID 1: has attributes A,B,D ID 2: has attributes B,C ID 3: has attributes F ID 4: has attributes C,G .....(Attributes will not duplicate in the same record)
Total estimated number of records: ~180,000
Total number of attributes: 70, increasing
Example of queries I’m going to do:
SELECT * from table WHERE (has attribute B)
SELECT * from table WHERE (has attributes B & D)
SELECT * from table WHERE (has 2 attributes)
SELECT * from table WHERE (has >=3 attributes)
SELECT count(*) from table WHERE (has attribute B)
What is the best database architecture?
Design 1: Storing attributes as 1s & 0s
ID|A|B|C|D|E|F|... 1|1|1|0|1|0|0|... 2|0|1|1|0|0|0|... 3|0|0|0|0|0|1|...
Problems:
- New columns needed to be added periodically when new attribute appears
- Much redundant data (0s), as more than 80% of data has only 1 attribute, and less than 0.01% of records will have more than 8 attributes.
Design 2: Store attributes as a CSV string
ID|Attributes 1|A,B,D, 2|B,C, 3|F,
Problems:
- Slow query when I do
SELECT * from table WHERE attributes LIKE '%B,%' AND attributes LIKE '%D,%'
Design 3: Each attribute has its own table storing record IDs
Table Attribute A ID 1 4 5 ... Table Attribute B ID 1 7 10 ... Table Attribute C ID 2 8 9 ...
Problems
- Many tables
- New tables needed to be added periodically
- How to do
SELECT * from table WHERE id (appears in exactly 3 tables)
?
These are the designs I can think of, please propose any good architecture.
Advertisement
Answer
Actually, none of your designs are optimal (the third is the best), and I recommend a single junction table which relates ID valued to their attributes, e.g.
ID | attr 1 | A 1 | B 1 | D 2 | B 2 | C 3 | F 4 | C 4 | G
This is the most normalized approach. To see why this design is optimal, see how easy it is to find all IDs which have attribute B:
SELECT DISTINCT ID FROM yourTable WHERE attr = 'B';
It is also fairly straightforward to find all IDs having both attributes B and D:
SELECT ID FROM yourTable WHERE attr IN ('B', 'D') GROUP BY ID HAVING MIN(attr) <> MAX(attr);
Your first two suggestions would make it much harder to write these queries (give it a try), and in general it is bad practice to store CSV in database tables. Your third suggestion does store the relationships correctly, but it unnecessarily spreads out data across multiple tables.
A more general form of the above query which can easily be extended to any number of IDs is:
SELECT ID FROM yourTable WHERE attr IN ('B', 'D') GROUP BY ID HAVING COUNT(DISTINCT attr) = 2;