Comparing comma separated values from two columns of two different tables

Question

I want to compare the values of two columns (diff table) having comma separated values of two different Oracle tables. I want to find rows that match with all values (NAME1 all values should match &#8230;

Accepted Answer

You could get the table(s) into first normal form and then compare the compounds that are stored in each row.  A starting point could be:{1} Tokenize each row, and write the tokens into a new table.  Give each token its original ID plus a 3-letter prefix, indicating which table the token came from.{2} Group the rows of the new (&#8220;normalized&#8221;) table by ID, and perform a LISTAGG().  Perform a self join, and find matching &#8220;token groups&#8221;.{1} Tokenize, create table as select (CTAS)create table tokensas select  ltrim(        -- ltrim() and rtrim() remove leading/trailing spaces (blanks)    rtrim(       substr( N.wrapped      , instr( N.wrapped, ',', 1, T.pos ) + 1      , ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1       )     )  ) token, N.idfrom (          select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id  union all  select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2  ) N join (    select level as pos   -- (max) possible position of char in an existing token  from dual   connect by level <= (    select greatest(    -- find the longest string ie max position (query T1 and T2)       ( select max( length( name1 ) ) from t1 )    , ( select max( length( name2 ) ) from t2 )    ) as pos    from dual  )  ) T  on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1 ;The inspiration to tokenize without using CONNECT BY came from this SO answer.The contents of the TOKENS table will look something like this:SQL> select * from tokens ;TOKEN                           ID       ASCORBIC ACID                   T1_1     SODIUM HYDROGEN CARBONATE       T1_2     CAFFEINE                        T1_3     PSEUDOEPHEDRINE HYDROCHLORIDE   T1_4     PARACETAMOL                     T1_100   sodium hydroxide                T1_110   POTASSIUM HYDROGEN CARBONATE    T2_4     SODIUM HYDROGEN CARBONATE       T2_5     PARACETAMOL PH. EUR.            T2_6     CODEINE PHOSPHATE               T2_7     DEXCHLORPHENIRAMINE MALEATE     T2_8     DEXCHLORPHENIRAMINE MALEATE     T2_10    PARACETAMOL                     T2_200 ...{2} GROUP BY, LISTAGG, self joinselect  S1.id id1, S2.id id2, S1.tokengroup_T1, S2.tokengroup_T2from (  select substr( id, 4, length( id ) - 3 ) id  , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1  from tokens  group by id   having substr( id, 1, 3 ) = 'T1_') S1   join (  select substr( id, 4, length( id ) - 3 ) id  , listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2  from tokens  group by id   having substr( id, 1, 3 ) = 'T2_') S2   on S1.tokengroup_T1 = S2.tokengroup_T2;-- resultID1   ID2   TOKENGROUP_T1                                                 TOKENGROUP_T2                                                 4     10    DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE   110   210   potassium carbonate + sodium hydroxide                        potassium carbonate + sodium hydroxide                        1     4     ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE    3     6     CAFFEINE + PARACETAMOL PH. EUR.                               CAFFEINE + PARACETAMOL PH. EUR. When doing things this way, you can get the substances into (alphabetical) order, and you can also pick a &#8220;delimiter&#8221; that you like (we have used &#8216;+&#8217;) here.ALTERNATIVEIf all that is no use to you, or you think this is too complicated, then you could try using TRANSLATE().  In this case, I&#8217;d recommend stripping all spaces/blanks from your dataset (in a query &#8211; not altering the original data!) like so:Queryselect   id1, id2, name1, name2from (  select     id_t1 id1  , id_t2 id2  , T1.name1 name1  , T2.name2 name2  from T1    join T2       on  translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' )        = translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )) ;Result  ID1   ID2 NAME1                                                                NAME2                                                            2     5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID   SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS            3     6 CAFFEINE, PARACETAMOL PH. EUR.                                       PARACETAMOL PH. EUR.,CAFFEINE                                  100    10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE           DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE     110   210 sodium hydroxide, potassium carbonate                                sodium hydroxide, potassium carbonateNOTE:I&#8217;ve added the following rows to your sample data:-- T1110, 'sodium hydroxide, potassium carbonate'-- T2210, 'sodium hydroxide, potassium carbonate' 211, 'potassium hydroxide, sodium carbonate'I found that it is easy to use TRANSLATE() in a way that gives you &#8220;false positives&#8221; ie the substances with ids 110, 210, and 211 will appear to &#8220;match&#8221;. (In other words: I don&#8217;t think this is the right tool for this job.)DBFIDDLE here(follow the link to see the sample tables and queries).

Advertisement

Answer