Skip to content
Advertisement

How to auto generate data types for a SQL table

I have many denormalize tables with 200+ columns. These tables are in SQL Server and they often, if not always, have varchar(100) or nvarchar(100) (string) data types. However, most columns are either ints, floats or other data types. It is not possible for me to go through every table & col and pick and choose data types. Due to many reasons including app compatability, performance, joining and other reasons, I must convert these to proper (or at least close to proper) data types. Is there any tool that I can use? Has anyone create code to accomplish this? It doesn’t have to be perfect, but a close match would do.

What have I tried:

  1. I tried to export these tables to Excel, and move back to SQL. It worked but it takes many, many more hours than doing it manually because Excel screws up with your data and converts it to whatever it feels like (think of scientific notations, date from numbers, etc… christ!). It was very time-consuming and failed. If you choose to use “Text” option in Excel, it’ll just convert everything back to varchar (x)
  2. I tried exporting to flat file and using VS or SSMS new version that has intelligent data type. This worked better than Excel but unfortunately even a single row of conflicts stops the whole process. The tool is clunky, gives bad errors and doesnt tell you which row caused issues. Using this method is also terrible because these tables are massive and this is very time-consuming. Especially when you consider tool troubleshooting.

thank you for helping. I also appreciate it if you do not ask me to just drop the task by trying to say my setup is bad/etc.

Advertisement

Answer

I’ll assume for now you only care about string columns that:

  1. should stay as strings but are perhaps defined wider than they need to be
  2. shouldn’t be strings because:
    • they only contain dates
    • they only contain numbers, in which case:
      • you will care about lengths (to determine potential tinyint/int/bigint)
      • you will care about whether they contain decimals

You’ve seen an approach already for determining if columns already defined as integers could be made smaller, but a similar approach could be used to find potential candidates in a table where the data type is currently a string but it meets one of the criteria above.

Let’s say you have a table like this:

CREATE TABLE dbo.foo
(
  a int PRIMARY KEY, 
  h varchar(100),
  i varchar(100),
  j varchar(100)
);

INSERT dbo.foo VALUES 
(1,'123','123','20200101 04:00:00'),
(2,'456','456','20200101'),
(3,'789','789','20200101'),
(4,'867','foo','20200101'),
(5,'876','876','20200101'),
(6,'6.54','654','20200101');

One approach would be to determine all the metadata for the columns as they are defined (which you can get easily from sys.dm_exec_describe_first_result_set), then from that build dynamic SQL to check each column for the longest value (which will determine the smallest string size), whether there is a single non-numeric (which means you can’t convert to a number), whether there is a single non-date (which means you can’t convert to a date), and whether there is a decimal point (which means you can’t convert to an int family, but you’ll need to also check precision/scale).

This is absolutely just a rough, dirty kick-start, but it should get you going.

DECLARE @table nvarchar(513) = N'dbo.foo';

DECLARE @sql nvarchar(max) = N'SELECT ', @un nvarchar(max) = N'',
  @un_sub nvarchar(max) = N'
  SELECT ColumnName =  MIN([col $c$]), 
  CurrentType = MIN([type $c$]), 
  LongestValue = MAX([len $c$]), 
  [AllNumerics?] = MIN([is_num $c$]), 
  [AllDates?] = MIN([is_date $c$]),
  [AnyContainDecimal] = MAX([has_dec $c$]) FROM x '

SELECT @sql += N'[col ' + name + '] = ''' + name + ''',
   [type ' + name + '] = '''
  + system_type_name + ''',' + QUOTENAME(name)
  + ', [len ' + name + '] = LEN(' + QUOTENAME(name) + '),
  [is_num ' + name + '] = CONVERT(tinyint, ISNUMERIC(' + QUOTENAME(name) + ')),
  [is_date ' + name + '] = CONVERT(tinyint, ISDATE(' + QUOTENAME(name) + ')),
  [has_dec ' + name + '] = CASE WHEN ISNUMERIC(' + QUOTENAME(name) + ') = 1
    AND ' + QUOTENAME(name) + ' LIKE N''%.%'' THEN 1 ELSE 0 END,',
  @un += N'
UNION ALL ' + REPLACE(@un_sub, N'$c$', name)
  
FROM sys.dm_exec_describe_first_result_set('SELECT * FROM ' + @table, NULL, 1)
WHERE system_type_name like '%char%'

SELECT @sql += N'[$garbage$]='''' FROM ' + @table;

SELECT @sql = N';WITH x AS (
' + @sql + N'
) ' + STUFF(@un, 1, 10, '');

EXEC sys.sp_executesql @sql;

It’s a lot to digest… dynamic SQL is powerful but it is really ugly and not exactly object-oriented.

Results (try it out in this fiddle):

enter image description here

You can see there that:

  1. h is all numerics, and the longest value is 4, but at least one value contains a decimal point, so the optimal type here is decimal(something, something).
  2. i contains at least one non-numeric, and at least one non-date, therefore it can only be a string, but since the longest value is only 3 characters, varchar(100) is too much. Whether you can go to varchar(3) or char(3) or you need to future-proof with a little padding, is really only a question you can answer qualitatively given your data model, business requirements now and later, etc.
  3. j contains all date types, but you can’t interpret much from the max length here (because you have no idea how the dates are actually stored, since they are stored as strings and strings of many, many forms can be interpreted as a valid date). So you probably know enough to say j should be a datetime of some flavor, but you’ll need to look closer at values to get a feel for what’s actually there.

You could change the result from this query (especially for tables with lots of columns) to only return values that are worth investigating, in this case I returned all rows to demonstrate (and all rows have potential fixes in my example anyway). Just add another CTE around the union and filter based on those columns (or others you add).

Of course on big tables, this will scan potentially for every column, so don’t expect it to be fast, and expect it to disfavor a lot of memory if you’re short on that. Also this is probably obvious but this can’t protect you from picking a type that will hurt you later. Say the column was collecting integers and it just got to 99, so you change the type to tinyint because there are no decimals and the longest length was 2. Then someone inserts 256 and boom.

You could also add other enhancements, like also get the min length (if they’re all strings, maybe you have varchar but it could be char), or check if any characters are outside of ASCII (maybe you have nvarchar but it could be varchar), how many digits on either side of the decimal (to be more precise about decimal types), or the max value (to increase accuracy for determining type of integer). I’ll leave those as an exercise.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement