Suppose I have these records
ID 1: has attributes A,B,D
ID 2: has attributes B,C
ID 3: has attributes F
ID 4: has attributes C,G
Attributes will not duplicate in the same record) ..(
Total estimated number of records: ~180,000
Total number of attributes: 70, increasing
Example of queries I’m going to do:
SELECT * from table WHERE (has attribute B)
SELECT * from table WHERE (has attributes B & D)
SELECT * from table WHERE (has 2 attributes)
SELECT * from table WHERE (has >=3 attributes)
SELECT count(*) from table WHERE (has attribute B)
What is the best database architecture?
Design 1: Storing attributes as 1s & 0s
ID|A|B|C|D|E|F|
1|1|1|0|1|0|0|
2|0|1|1|0|0|0|
3|0|0|0|0|0|1|
Problems:
- New columns needed to be added periodically when new attribute appears
- Much redundant data (0s), as more than 80% of data has only 1 attribute, and less than 0.01% of records will have more than 8 attributes.
Design 2: Store attributes as a CSV string
ID|Attributes
1|A,B,D,
2|B,C,
3|F,
Problems:
- Slow query when I do
SELECT * from table WHERE attributes LIKE '%B,%' AND attributes LIKE '%D,%'
Design 3: Each attribute has its own table storing record IDs
Table Attribute A
ID
1
4
5
Table Attribute B
ID
1
7
10
Table Attribute C
ID
2
8
9
Problems
- Many tables
- New tables needed to be added periodically
- How to do
SELECT * from table WHERE id (appears in exactly 3 tables)
?
These are the designs I can think of, please propose any good architecture.
Advertisement
Answer
Actually, none of your designs are optimal (the third is the best), and I recommend a single junction table which relates ID valued to their attributes, e.g.
ID | attr
1 | A
1 | B
1 | D
2 | B
2 | C
3 | F
4 | C
4 | G
This is the most normalized approach. To see why this design is optimal, see how easy it is to find all IDs which have attribute B:
SELECT DISTINCT ID
FROM yourTable
WHERE attr = 'B';
It is also fairly straightforward to find all IDs having both attributes B and D:
SELECT ID
FROM yourTable
WHERE attr IN ('B', 'D')
GROUP BY ID
HAVING MIN(attr) <> MAX(attr);
Your first two suggestions would make it much harder to write these queries (give it a try), and in general it is bad practice to store CSV in database tables. Your third suggestion does store the relationships correctly, but it unnecessarily spreads out data across multiple tables.
A more general form of the above query which can easily be extended to any number of IDs is:
SELECT ID
FROM yourTable
WHERE attr IN ('B', 'D')
GROUP BY ID
HAVING COUNT(DISTINCT attr) = 2;