Skip to content
Advertisement

Database architecture: When records having different number of attributes (columns)

Suppose I have these records

    ID 1: has attributes A,B,D
    ID 2: has attributes B,C
    ID 3: has attributes F
    ID 4: has attributes C,G
    .....(Attributes will not duplicate in the same record)

Total estimated number of records: ~180,000

Total number of attributes: 70, increasing

Example of queries I’m going to do:

  1. SELECT * from table WHERE (has attribute B)
  2. SELECT * from table WHERE (has attributes B & D)
  3. SELECT * from table WHERE (has 2 attributes)
  4. SELECT * from table WHERE (has >=3 attributes)
  5. SELECT count(*) from table WHERE (has attribute B)

What is the best database architecture?


Design 1: Storing attributes as 1s & 0s

ID|A|B|C|D|E|F|...
 1|1|1|0|1|0|0|...
 2|0|1|1|0|0|0|...
 3|0|0|0|0|0|1|...

Problems:

  1. New columns needed to be added periodically when new attribute appears
  2. Much redundant data (0s), as more than 80% of data has only 1 attribute, and less than 0.01% of records will have more than 8 attributes.

Design 2: Store attributes as a CSV string

ID|Attributes
 1|A,B,D,
 2|B,C,
 3|F,

Problems:

  1. Slow query when I do
SELECT * from table WHERE attributes LIKE '%B,%' AND attributes LIKE '%D,%'

Design 3: Each attribute has its own table storing record IDs

Table Attribute A
ID
 1
 4
 5
...

Table Attribute B
ID
 1
 7
10
...

Table Attribute C
ID
 2
 8
 9
...

Problems

  1. Many tables
  2. New tables needed to be added periodically
  3. How to do SELECT * from table WHERE id (appears in exactly 3 tables)?

These are the designs I can think of, please propose any good architecture.

Advertisement

Answer

Actually, none of your designs are optimal (the third is the best), and I recommend a single junction table which relates ID valued to their attributes, e.g.

ID | attr
1  | A
1  | B
1  | D
2  | B
2  | C
3  | F
4  | C
4  | G

This is the most normalized approach. To see why this design is optimal, see how easy it is to find all IDs which have attribute B:

SELECT DISTINCT ID
FROM yourTable
WHERE attr = 'B';

It is also fairly straightforward to find all IDs having both attributes B and D:

SELECT ID
FROM yourTable
WHERE attr IN ('B', 'D')
GROUP BY ID
HAVING MIN(attr) <> MAX(attr);

Your first two suggestions would make it much harder to write these queries (give it a try), and in general it is bad practice to store CSV in database tables. Your third suggestion does store the relationships correctly, but it unnecessarily spreads out data across multiple tables.

A more general form of the above query which can easily be extended to any number of IDs is:

SELECT ID
FROM yourTable
WHERE attr IN ('B', 'D')
GROUP BY ID
HAVING COUNT(DISTINCT attr) = 2;
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement