Skip to content
Advertisement

SQL Server: How to flatten nested arrays by merging values using

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values usin SQL Server? Jsons can be read in any language, I’m looking for any SQL dialect that can transform the data as I’m using spark connectors. I use many SQL dialects including not limiting Spark SQL, Postgresql,MySql, SQLite and SQL Server…

NOTE: I was asked by Martijn Pieters to create duplicates to be specific for each SQL dialect so this is for SQL Server.

Notes:

  • Input dataframe has more than 10000 columns name_1_a, name_1000_xx so column(array) names can not be hardcoded as it will requires to write 10000 names
  • id, date, val has always the same naming convention across all columns and all jsons
  • array size can vary but date, val are always there so they can be hardcoded
  • date can be different in each array, for example name_1_a starts with 2001, but name_10000_xvz for id == 1 starts with 2000 and finnish with 2004, however for id == 2 starts with 1990 and finish with 2004

Input df:

root
 |-- id: long (nullable = true)
 |-- name_10000_xvz: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: long (nullable = true)
 |    |    |-- val: long (nullable = true)
 |-- name_1_a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: long (nullable = true)
 |    |    |-- val: long (nullable = true)
 |-- name_1_b: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: long (nullable = true)
 |    |    |-- val: long (nullable = true)
 |-- name_2_a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- date: long (nullable = true)
 |    |    |-- val: long (nullable = true)

+---+------------------------------------------------------------------------+---------------------------------+---------------------------------+------------------------------------+
|id |name_10000_xvz                                                          |name_1_a                         |name_1_b                         |name_2_a                            |
+---+------------------------------------------------------------------------+---------------------------------+---------------------------------+------------------------------------+
|2  |[{1990, 39}, {2000, 30}, {2001, 31}, {2002, 32}, {2003, 33}, {2004, 34}]|[{2001, 1}, {2002, 2}, {2003, 3}]|[{2001, 4}, {2002, 5}, {2003, 6}]|[{2001, 21}, {2002, 22}, {2003, 23}]|
|1  |[{2000, 30}, {2001, 31}, {2002, 32}, {2003, 33}]                        |[{2001, 1}, {2002, 2}, {2003, 3}]|[{2001, 4}, {2002, 5}, {2003, 6}]|[{2001, 21}, {2002, 22}, {2003, 23}]|
+---+------------------------------------------------------------------------+---------------------------------+---------------------------------+------------------------------------+

Required output df:

+---+---------+----------+-----------+---------+----------------+
|id |   date  | name_1_a | name_1_b  |name_2_a | name_10000_xvz |
+---+---------+----------+-----------+---------+----------------+
|1  |   2000  |     0    |    0      |   0     |        30      |
|1  |   2001  |     1    |    4      |   21    |        31      |
|1  |   2002  |     2    |    5      |   22    |        32      |
|1  |   2003  |     3    |    6      |   23    |        33      |
|2  |   1990  |     0    |    0      |   0     |        39      |
|2  |   2000  |     0    |    0      |   0     |        30      |
|2  |   2001  |     1    |    4      |   21    |        31      |
|2  |   2002  |     2    |    5      |   22    |        32      |
|2  |   2003  |     3    |    6      |   23    |        33      |
|2  |   2004  |     0    |    0      |   0     |        34      |
+---+---------+----------+-----------+---------+----------------+

Below are jsons for input df:

1.json

{  "id": 1,  "name_1_a": [    {      "date": 2001,      "val": 1    },    {      "date": 2002,      "val": 2    },    {      "date": 2003,      "val": 3    }  ],  "name_1_b": [    {      "date": 2001,      "val": 4    },    {      "date": 2002,      "val": 5    },    {      "date": 2003,      "val": 6    }  ],  "name_2_a": [    {      "date": 2001,      "val": 21    },    {      "date": 2002,      "val": 22    },    {      "date": 2003,      "val": 23    }  ],   "name_10000_xvz": [    {        "date": 2000,        "val": 30    },    {      "date": 2001,      "val": 31    },    {      "date": 2002,      "val": 32    },    {      "date": 2003,      "val": 33    }  ]}

2.json

{  "id": 2,  "name_1_a": [    {      "date": 2001,      "val": 1    },    {      "date": 2002,      "val": 2    },    {      "date": 2003,      "val": 3    }  ],  "name_1_b": [    {      "date": 2001,      "val": 4    },    {      "date": 2002,      "val": 5    },    {      "date": 2003,      "val": 6    }  ],  "name_2_a": [    {      "date": 2001,      "val": 21    },    {      "date": 2002,      "val": 22    },    {      "date": 2003,      "val": 23    }  ],  "name_10000_xvz": [    {        "date": 1990,        "val": 39      },    {      "date": 2000,      "val": 30    },    {      "date": 2001,      "val": 31    },    {      "date": 2002,      "val": 32    },    {      "date": 2003,      "val": 33    },    {      "date": 2004,      "val": 34    }  ]}}

Advertisement

Answer

OK, so we have 2 “problems” we need to solve here. Firstly, the fact that you need a dynamic number of columns as you don’t know what names are you your data. This means you need dynamic SQL.

Next is the problem that not every name has a value for every year, so we need to also have a “year” table we can LEFT JOIN from so that we have a row for every name.

This, as a result, is going to be really messy, but it can be done. I’ve left comments where I can on this, but the best thing i can really suggest is taking the time to read the SQL, PRINTing/SELECTing the dynamic statement, and learning what it does.

First let’s build a static version, so you can see what it would look like. So here I use a CTE to get all the years, and then another to get the data in a normalised format from the JSON. Finally unpivot the data using condititional aggregation:

--Sample JSON
DECLARE @JSON nvarchar(MAX) = N'{  "id": 1,  "name_1_a": [    {      "date": 2001,      "val": 1    },    {      "date": 2002,      "val": 2    },    {      "date": 2003,      "val": 3    }  ],  "name_1_b": [    {      "date": 2001,      "val": 4    },    {      "date": 2002,      "val": 5    },    {      "date": 2003,      "val": 6    }  ],  "name_2_a": [    {      "date": 2001,      "val": 21    },    {      "date": 2002,      "val": 22    },    {      "date": 2003,      "val": 23    }  ],   "name_10000_xvz": [    {        "date": 2000,        "val": 30    },    {      "date": 2001,      "val": 31    },    {      "date": 2002,      "val": 32    },    {      "date": 2003,      "val": 33    }  ]}';

--Get distinct Years
WITH Years AS(
    SELECT DISTINCT V.date
    FROM OPENJSON(@JSON) J
         CROSS APPLY (SELECT *
                      FROM OPENJSON(J.[value]) 
                           WITH(date int) 
                      WHERE ISJSON(J.[value]) = 1) V),
--Get Data
Data AS(
    SELECT J.[key] AS [name],
           V.date,
           V.val 
    FROM OPENJSON(@JSON) J
         CROSS APPLY (SELECT *
                      FROM OPENJSON(J.[value]) 
                           WITH(date int,
                                val int) 
                      WHERE ISJSON(J.[value]) = 1) V)
--Final Select and Unpivot
SELECT JSON_VALUE(@JSON, '$.id') AS ID,
       Y.Date,
       ISNULL(MAX(CASE D.[name] WHEN 'name_1_a' THEN D.val END),0) AS name_1_a,
       ISNULL(MAX(CASE D.[name] WHEN 'name_1_b' THEN D.val END),0) AS name_1_b,
       ISNULL(MAX(CASE D.[name] WHEN 'name_2_a' THEN D.val END),0) AS name_2_a,
       ISNULL(MAX(CASE D.[name] WHEN 'name_10000_xvz' THEN D.val END),0) AS name_10000_xvz
FROM Years Y
     LEFT JOIN Data D ON Y.Date = D.Date
GROUP BY Y.Date;

As I mentioned, however, this isn’t dynamic. This is, therefore where it gets a little more messy. I, for the below, am assuming you’re using a recent version of SQL Server, and thus have access to STRING_AGG (if not, you’ll need to use the old FOR XML PATH and STUFF method):

--Sample JSON
DECLARE @JSON nvarchar(MAX) = N'{  "id": 1,  "name_1_a": [    {      "date": 2001,      "val": 1    },    {      "date": 2002,      "val": 2    },    {      "date": 2003,      "val": 3    }  ],  "name_1_b": [    {      "date": 2001,      "val": 4    },    {      "date": 2002,      "val": 5    },    {      "date": 2003,      "val": 6    }  ],  "name_2_a": [    {      "date": 2001,      "val": 21    },    {      "date": 2002,      "val": 22    },    {      "date": 2003,      "val": 23    }  ],   "name_10000_xvz": [    {        "date": 2000,        "val": 30    },    {      "date": 2001,      "val": 31    },    {      "date": 2002,      "val": 32    },    {      "date": 2003,      "val": 33    }  ]}';

--Variables for dynamic SQL
DECLARE @SQL nvarchar(MAX),
        @CRLF nchar(2) = NCHAR(13) + NCHAR(10);

DECLARE @Delimiter varchar(20) = N',' + @CRLF + N'       ';


--You'll note the start is all the same
SET @SQL = N'--Get disinct Years' + @CRLF + 
           N'WITH Years AS(' + @CRLF + 
           N'    SELECT DISTINCT V.date' + @CRLF + 
           N'    FROM OPENJSON(@JSON) J' + @CRLF + 
           N'         CROSS APPLY (SELECT *' + @CRLF + 
           N'                      FROM OPENJSON(J.[value]) ' + @CRLF + 
           N'                           WITH(date int) ' + @CRLF + 
           N'                      WHERE ISJSON(J.[value]) = 1) V),' + @CRLF + 
           N'--Get Data' + @CRLF + 
           N'Data AS(' + @CRLF + 
           N'    SELECT J.[key] AS [name],' + @CRLF + 
           N'           V.date,' + @CRLF + 
           N'           V.val ' + @CRLF + 
           N'    FROM OPENJSON(@JSON) J' + @CRLF + 
           N'         CROSS APPLY (SELECT *' + @CRLF + 
           N'                      FROM OPENJSON(J.[value]) ' + @CRLF + 
           N'                           WITH(date int,' + @CRLF + 
           N'                                val int) ' + @CRLF + 
           N'                      WHERE ISJSON(J.[value]) = 1) V)' + @CRLF + 
           N'--Final Select and Unpivot' + @CRLF + 
           N'SELECT JSON_VALUE(@JSON, ''$.id'') AS ID,' + @CRLF +
           N'       Y.Date,' + @CRLF +
           (SELECT STRING_AGG(N'ISNULL(MAX(CASE D.[name] WHEN ' + QUOTENAME(J.[key],'''') + N' THEN D.val END),0) AS ' + QUOTENAME(J.[key]),@Delimiter)
            FROM OPENJSON(@JSON) J) + @CRLF +
           N'FROM Years Y' + @CRLF +
           N'     LEFT JOIN Data D ON Y.Date = D.Date' + @CRLF +
           N'GROUP BY Y.Date;';

PRINT @SQL; --YOur best friend for debugging

EXEC sys.sp_executesql @SQL, N'@JSON nvarchar(MAX)', @JSON;

db<>fiddle

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement