I want generate random unique fake data (first name and address)
I used the following SQL request:
CREATE OR REPLACE VIEW bah AS SELECT DISTINCT A.VAL AS VORNAME, B.VAL AS ADRESSE FROM ANON.FIRST_NAME A, ANON.ADDRESS B GROUP BY A.val,b.val HAVING COUNT(*) = 1;
How do I make the first names AND addresses unique?
Please note this is in CSV format:
"Abbas","Dippelstr. 736, 23892 Hainichen", "Gilda","Noackallee 6/2, 24711 Malchin", "Guenter","Fredy-Junk-Gasse 3, 90438 Schmölln", "Hans-Ulrich","Karl-Peter-Kambs-Gasse 996, 15869 Sulzbach-Rosenberg"
Column1 (contains oids (which is irrelevant) and vals(relevant! hence listed below): "Abbas", "Ante","Anthony"... Column2: (see above - oids and vals) "Benthinstraße 31, 35994 Kleve", "Cordula-Bachmann-Ring 4/8, 06292 Neustrelitz", "Danny-Fischer-Weg 8/9, 28346 Rastatt", "Eckbauergasse 157, 10570 Rudolstadt"
CROSS JOIN is the wrong approach to begin with. Multiple
FROM items separated by comma (
,) are cross-joined. See:
10 names in
anon.first_name and 10 addresses in
anon.address form a Cartesian product of 100 rows, with each name and address multiplied by the cardinality of the cross-joined table. Exactly what you don’t want. All the confusion with duplicates was created by that. Applying
DISTINCT after the cross join is hugely inefficient.
To use each name and address only once, attach a random (or arbitrary) number to each side and join on that.
Assuming (for lack of declaration) that each source table has distinct entries. (Else you need subqueries with
DISTINCT to collapse dupes first – window functions like
row_number() are applied before
DISTINCT on the same query level.)
SELECT * FROM ( SELECT row_number() OVER () AS arbitrary_nr, val AS vorname FROM anon.first_name ) a JOIN ( SELECT row_number() OVER () AS arbitrary_nr, val AS adresse FROM anon.address ) b USING (arbitrary_nr)
Combinations are arbitrary, rather than random. See:
The result has the cardinality of the smaller table. Excess rows from the bigger table are skipped.
To allow some duplicates, you just introduce some duplicate numbers (
arbitrary_nr) on one (or both) side(s), either with the same row multiple times, or different rows with the same arbitrary number. For your example:
There could be like 2 persons living at the same address sometimes
SELECT * FROM ( SELECT row_number() OVER (ORDER BY val) AS arbitrary_nr, val AS vorname FROM anon.first_name UNION ALL SELECT row_number() OVER (ORDER BY val) - 1, val FROM anon.first_name WHERE random() > .9 ) a JOIN ( SELECT row_number() OVER () AS arbitrary_nr, val AS adresse FROM anon.address ) b USING (arbitrary_nr)
This adds ~ 10 % of all names a second time. (So some names get two addresses.) This time, names are ordered alphabetically. The random sample starts with 0 (
- 1) and numbers can can only increase slower, so the same name can never get the same
arbitrary_nr, and the same address is never combined with the same name twice.
The result is still arbitrary (or random) as long as at least one side gets arbitrary (or random) numbers.
There are many ways. Much depends on your exact input, and exact requirements for the result.