How to auto generate data types for a SQL table

Question

I have many denormalize tables with 200+ columns. These tables are in SQL Server and they often, if not always, have varchar(100) or nvarchar(100) (string) data types. However, most columns are either ints, floats or other data types. It is not possible for me to go through every table & col and pick and choose data types. Due to many

Accepted Answer

I&#8217;ll assume for now you only care about string columns that:should stay as strings but are perhaps defined wider than they need to beshouldn&#8217;t be strings because:they only contain datesthey only contain numbers, in which case:you will care about lengths (to determine potential tinyint/int/bigint)you will care about whether they contain decimalsYou&#8217;ve seen an approach already for determining if columns already defined as integers could be made smaller, but a similar approach could be used to find potential candidates in a table where the data type is currently a string but it meets one of the criteria above.Let&#8217;s say you have a table like this:CREATE TABLE dbo.foo(  a int PRIMARY KEY,   h varchar(100),  i varchar(100),  j varchar(100));INSERT dbo.foo VALUES (1,'123','123','20200101 04:00:00'),(2,'456','456','20200101'),(3,'789','789','20200101'),(4,'867','foo','20200101'),(5,'876','876','20200101'),(6,'6.54','654','20200101');One approach would be to determine all the metadata for the columns as they are defined (which you can get easily from sys.dm_exec_describe_first_result_set), then from that build dynamic SQL to check each column for the longest value (which will determine the smallest string size), whether there is a single non-numeric (which means you can&#8217;t convert to a number), whether there is a single non-date (which means you can&#8217;t convert to a date), and whether there is a decimal point (which means you can&#8217;t convert to an int family, but you&#8217;ll need to also check precision/scale).This is absolutely just a rough, dirty kick-start, but it should get you going.DECLARE @table nvarchar(513) = N'dbo.foo';DECLARE @sql nvarchar(max) = N'SELECT ', @un nvarchar(max) = N'',  @un_sub nvarchar(max) = N'  SELECT ColumnName =  MIN([col $c$]),   CurrentType = MIN([type $c$]),   LongestValue = MAX([len $c$]),   [AllNumerics?] = MIN([is_num $c$]),   [AllDates?] = MIN([is_date $c$]),  [AnyContainDecimal] = MAX([has_dec $c$]) FROM x 'SELECT @sql += N'[col ' + name + '] = ''' + name + ''',   [type ' + name + '] = '''  + system_type_name + ''',' + QUOTENAME(name)  + ', [len ' + name + '] = LEN(' + QUOTENAME(name) + '),  [is_num ' + name + '] = CONVERT(tinyint, ISNUMERIC(' + QUOTENAME(name) + ')),  [is_date ' + name + '] = CONVERT(tinyint, ISDATE(' + QUOTENAME(name) + ')),  [has_dec ' + name + '] = CASE WHEN ISNUMERIC(' + QUOTENAME(name) + ') = 1    AND ' + QUOTENAME(name) + ' LIKE N''%.%'' THEN 1 ELSE 0 END,',  @un += N'UNION ALL ' + REPLACE(@un_sub, N'$c$', name)  FROM sys.dm_exec_describe_first_result_set('SELECT * FROM ' + @table, NULL, 1)WHERE system_type_name like '%char%'SELECT @sql += N'[$garbage$]='''' FROM ' + @table;SELECT @sql = N';WITH x AS (' + @sql + N') ' + STUFF(@un, 1, 10, '');EXEC sys.sp_executesql @sql;It&#8217;s a lot to digest&#8230; dynamic SQL is powerful but it is really ugly and not exactly object-oriented.Results (try it out in this fiddle):You can see there that:h is all numerics, and the longest value is 4, but at least one value contains a decimal point, so the optimal type here is decimal(something, something).i contains at least one non-numeric, and at least one non-date, therefore it can only be a string, but since the longest value is only 3 characters, varchar(100) is too much. Whether you can go to varchar(3) or char(3) or you need to future-proof with a little padding, is really only a question you can answer qualitatively given your data model, business requirements now and later, etc.j contains all date types, but you can&#8217;t interpret much from the max length here (because you have no idea how the dates are actually stored, since they are stored as strings and strings of many, many forms can be interpreted as a valid date). So you probably know enough to say j should be a datetime of some flavor, but you&#8217;ll need to look closer at values to get a feel for what&#8217;s actually there.You could change the result from this query (especially for tables with lots of columns) to only return values that are worth investigating, in this case I returned all rows to demonstrate (and all rows have potential fixes in my example anyway). Just add another CTE around the union and filter based on those columns (or others you add).Of course on big tables, this will scan potentially for every column, so don&#8217;t expect it to be fast, and expect it to disfavor a lot of memory if you&#8217;re short on that. Also this is probably obvious but this can&#8217;t protect you from picking a type that will hurt you later. Say the column was collecting integers and it just got to 99, so you change the type to tinyint because there are no decimals and the longest length was 2. Then someone inserts 256 and boom.You could also add other enhancements, like also get the min length (if they&#8217;re all strings, maybe you have varchar but it could be char), or check if any characters are outside of ASCII (maybe you have nvarchar but it could be varchar), how many digits on either side of the decimal (to be more precise about decimal types), or the max value (to increase accuracy for determining type of integer). I&#8217;ll leave those as an exercise.

Advertisement

Answer