The PDI, the Bad and the SSIS

I was recently tasked with creating a SQL Server Integration Services (SSIS) package to load a flat file that contained records of varying lengths destined for separate tables.  Having just come out of a 16 month project where I was using Pentaho’s PDI toolset for ETL I had happy thoughts thinking I would I have this wrapped up in time for arvo beers.

I was sorely disappointed thought as what should have been a simple task turned out to be laborious mess.  In all fairness to SSIS there may be a better way to do this and if there is please let me know as I failed to find anything.  So back to the statement of work, load a flat file containing records of various lengths into a staging table that has as many columns as the longest record.

Using Pentaho this is a simple pleasant experience. It requires  a transformation with a “CSV file input” step that loads the file which is tab delimited, and table output step that gets the records and loads them into a table . There ya’ go 2 minutes later and I’m done, hell it even scanned the file and suggested datatype’s and generated SQL for the destination table… awesome.

pdi transformation

PDI Transformation

Using SSIS 2008 it turns out it’s not that nice. SSIS has a “Flat File  Source” step that you use to pull in data from flat file types. The only caveat is it doesn’t support files that have different record lengths. At first I thought I could get past this as it was just scanning the first row to get the number of columns, but it turns out it the just carries on eating chars, including the specified record delimiter char(s), until it has enough values for how many columns the first record had, and then just stuffs in leftovers in the last column. What this means is that for records that are “short” it starts eating into the next records values. (I was however pleased to see that Microsoft’s Data Transformations Services (DTS), yes that previous generation product by a few years, handles this scenario perfectly). So how do you this in SSIS well you have read the data into one large string column, then slap on a “Script Component” and define output columns and their data types one-by-one  (I had 220 output cols yay). Then you write a script to split the rows and populate values for all you output columns, and finally a “SQL Server destination” step to load the data.

Needless to say I’m doing my bit to get Penthao PDI in here.