A system needed to be transferred in a hurry. The only way to extract data from that system was via csv files. The data was dirty meaning fields were not consistent across different export files (extra space and unprintable characters in id fields, additional information in text fields, date incongruities, etc.) .

The software used here is:

  • Excel
  • Visual Studio 2017 SSDT

The first step was to save the csv files as excel macro enabled worksheets so I could do some basic formatting within the excel files. Then I enabled the developer tab. I used the default macro to write some code to do the basic formatting. I chose to do this formatting within the excel file rather than in SSIS because it just seemed simpler. I explored ways to do this via SSIS but the benefit cost analysis deemed just getting it done more viable. As any programmer knows the smallest tasks can turn into some deep rabbit holes.

Here is an example of one macro for one excel file.

  • It sets the source range so it doesn’t search thousands of empty rows/columns.
  • Deletes empty rows and header rows
  • it merges rows with duplicate ids (there might be three rows for one id with separate text in say the C cell. This would combine them into the first row separated by a pipe character and delete the rows with duplicate ids).

Example Excel Macro

Sub testing()

    Application.EnableEvents = False
    
    On Error Resume Next
      'Activate PowerPoint Library (version 2.9)
      ThisWorkbook.VBProject.References.AddFromGuid _
        GUID:="{3F4DACA7-160D-11D2-A8E9-00104B365C9F}", Major:=2, Minor:=9
    'Reset Error Handler
    On Error GoTo 0
    
    Application.Calculation = xlCalculationManual
    Application.ScreenUpdating = False
    Application.EnableEvents = False
    
    Dim SourceRange As Range
    Dim EntireRow As Range
    Dim iCntr As Long
    Dim previousrow As Long
    Dim finalstring As String

    For Each Sh In ThisWorkbook.Worksheets
        
        Set SourceRange = Sh.Range("A1:M" + CStr(Sh.UsedRange.Rows.Count))
        
        For iCntr = SourceRange.Rows.Count To 1 Step -1
            Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow
            If WorksheetFunction.CountA(EntireRow) = 0 Then
                EntireRow.Delete
                If iCntr Mod 5000 = 0 Then Debug.Print Sh.Name + " | " + CStr(iCntr)
            End If
        Next
        
        For iCntr = SourceRange.Rows.Count To 1 Step -1
            If Mid(SourceRange.Cells(iCntr, 1).Value, 1, 9) = "something:" Then
                Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow
                EntireRow.Delete
            End If
            If Mid(SourceRange.Cells(iCntr, 1).Value, 1, 8) = "somethingelse" Then
                Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow
                EntireRow.Delete
            End If
        Next
        
        For iCntr = SourceRange.Rows.Count To 1 Step -1
            previousrow = iCntr - 1
            If iCntr > 1 Then
                If SourceRange.Cells(iCntr, 1).Value = "name1" Or SourceRange.Cells(iCntr, 1).Value = "cnelson" Or SourceRange.Cells(iCntr, 1).Value = "name2" Or SourceRange.Cells(iCntr, 1).Value = "name3" Or SourceRange.Cells(iCntr, 1).Value = "name4" Or SourceRange.Cells(iCntr, 1).Value = "name5" Or SourceRange.Cells(iCntr, 1).Value = "name6" Or SourceRange.Cells(iCntr, 1).Value = "name7" Or SourceRange.Cells(iCntr, 1).Value = "name8" Then
                    
                    If IsEmpty(SourceRange.Range("B" + CStr(iCntr)).Value) Then
                        finalstring = ""
                    Else
                        finalstring = CStr(SourceRange.Range("B" + CStr(iCntr)).Value)
                    End If
                    
                    If IsEmpty(SourceRange.Range("Q" + CStr(iCntr)).Value) Then
                        previousstring = ""
                    Else
                        previousstring = CStr(SourceRange.Range("Q" + CStr(iCntr)).Value)
                    End If
                    
                    SourceRange.Range("Q" + CStr(previousrow)).Value = previousstring + " | " + finalstring
                    finalstring = ""
                    previousstring = ""


                    If IsEmpty(SourceRange.Range("C" + CStr(iCntr)).Value) Then
                        finalstring = ""
                    Else
                        finalstring = CStr(SourceRange.Range("C" + CStr(iCntr)).Value)
                    End If
                    
                    If IsEmpty(SourceRange.Range("R" + CStr(iCntr)).Value) Then
                        previousstring = ""
                    Else
                        previousstring = CStr(SourceRange.Range("R" + CStr(iCntr)).Value)
                    End If
                    
                    SourceRange.Range("R" + CStr(previousrow)).Value = previousstring + " | " + finalstring
                    finalstring = ""
                    previousstring = ""

                    Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow
                    EntireRow.Delete
                    iCntr = iCntr - 1
                End If
            End If
        Next
    
    Next Sh

    Application.Calculation = xlCalculationAutomatic
    Application.ScreenUpdating = True
    Application.EnableEvents = True

End Sub

SSIS Control Flow

The control flow dictates eye from the sky steps in the migration process. Since I have several different excel workbooks importing into several different tables, you can see there are lots of sql tasks to truncate those tables before new insertions. Also there are separate tasks for a bit more housecleaning of the data and a few steps that can be run stand alone.

I included all the excel files and the error files in the miscellaneous directory to centralize operations. You can see under the connection managers the connections for each file.

Control Flow Image

SSIS Data Flow

This is the busiest data flow as it is the all in one import. Each task is made up of

  • a source call with an accompanying error file
  • a data conversion section since excel data types do not translate directly to sql data types
  • a database destination with accompanying error file

There are other data flows some of which are a single transaction so I can just do a single import and some of which are more cleaning routines.

TSQL Final Cleanup and Import

After all the data has been moved into the separate tables, I created a final tsql project to do some final formats to the data, standardize it, and put it into a temporary table that mimics the production table.

This import was a bit of a unique scenario as I was forced to move the data into a legacy application that used an access database for updates. I am in the process of moving that application to the web but have only gotten the front-end up for search, not the back-end side for updates. But that can be in a separate post!

So here is a sample of the final SQL. This could have been put into the SSIS package as well but since it was a run once and done type scenario I just ran it from SQL Server Management Studio.

--------------------------------------------------------

;WITH CTE7 (ALTID,LOCNUMBER,LOCSTREET) AS (
SELECT ALTID, CASE WHEN ISNUMERIC(SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],'')))) = 1 THEN SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],''))) ELSE '' END, CASE WHEN ISNUMERIC(SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],'')))) = 1 THEN SUBSTRING(ISNULL([Address],''),CHARINDEX(' ',ISNULL([Address],''))+1,LEN(ISNULL([Address],''))) ELSE ISNULL([Address],'') END
FROM [dbo].[VNewThings]
)

UPDATE [Tbl  test] 
SET [Location Number] = LOCNUMBER,
	[Location Street] = LOCSTREET
FROM CTE7 
WHERE CTE7.ALTID = [Tbl  test].[Thing Number]

--------------------------------------------------------
--GET ONLY DISTINCT THINGS PER RECORD
--------------------------------------------------------

INSERT INTO DBO.TEMPTHINGS 
SELECT ISNULL(ALTID,''),THINGS,'','','','THING2',CHARINDEX('|',SUBSTRING(ISNULL(THINGS,''),3,LEN(THINGS)-3)),0,0
FROM [dbo].[VNewCitations] VNC2
WHERE SUBSTRING(LTRIM(RTRIM(REPLACE(ISNULL(THINGS,''),'|',''))),3,LEN(THINGS)-3) <> ''
	AND ISNULL(ALTID,'') <> ''

UPDATE DBO.TEMPTHINGS 
SET SECONDSEPARATOR = CHARINDEX('|',SUBSTRING(ISNULL(DBO.TEMPTHINGS.originalThing,''),DBO.TEMPTHINGS.FIRSTSEPARATOR+3,LEN(DBO.TEMPTHINGS.originalThing)))
FROM DBO.TEMPTHINGS TV2
 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID

UPDATE DBO.TEMPTHINGS 
SET Thing1 =  LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,3,COALESCE(NULLIF(DBO.TEMPTHINGS.FIRSTSEPARATOR,0),LEN(DBO.TEMPTHINGS.ORIGINALThing))),'|',''))) 
FROM DBO.TEMPTHINGS TV2
 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID

UPDATE DBO.TEMPTHINGS 
SET Thing2 =  CASE WHEN DBO.TEMPTHINGS.FIRSTSEPARATOR <> 0 AND DBO.TEMPTHINGS.SECONDSEPARATOR <> 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR+1,DBO.TEMPTHINGS.SECONDSEPARATOR),'|',''))) 
					   WHEN DBO.TEMPTHINGS.FIRSTSEPARATOR <> 0 AND DBO.TEMPTHINGS.SECONDSEPARATOR = 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR+1,LEN(DBO.TEMPTHINGS.originalThing)),'|',''))) END
FROM DBO.TEMPTHINGS TV2
INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID

UPDATE DBO.TEMPTHINGS 
SET Thing3 =  CASE WHEN DBO.TEMPTHINGS.SECONDSEPARATOR <> 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR + DBO.TEMPTHINGS.SECONDSEPARATOR,LEN(DBO.TEMPTHINGS.originalThing)),'|',''))) END
FROM DBO.TEMPTHINGS TV2
INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID

UPDATE [Tbl  test] 
SET [O1] = Thing1
FROM TEMPTHINGS 
WHERE TEMPTHINGS.ALTID = [Tbl  test].[Thing Number]

UPDATE [Tbl  test] 
SET [O2] = Thing2
FROM TEMPTHINGS 
WHERE TEMPTHINGS.ALTID = [Tbl  test].[Thing Number]

UPDATE [Tbl  test] 
SET [O3] = Thing3
FROM TEMPTHINGS 
WHERE TEMPTHINGS.ALTID = [Tbl  test].[Thing Number]

----------------------------------------------------------

UPDATE [Tbl  test]
SET [void]= 'void'
FROM [dbo].[VNewVoids] WHERE  LTRIM(RTRIM([VNewVS].ALTID)) = LTRIM(RTRIM([Tbl  test].[Number]))

----------------------------------------------------------

UPDATE [Tbl  test]
SET [Warning]= 'warning'
FROM [dbo].[VNewWarnings] WHERE  LTRIM(RTRIM([VNewWs].ALTID)) = LTRIM(RTRIM([Tbl test].[Number]))

----------------------------------------------------------

UPDATE [Tbl  test]
SET [Write Off] = 'Write Off'
FROM [dbo].[VNewWriteOffs] WHERE  LTRIM(RTRIM([VNewWFs].ALTID)) = LTRIM(RTRIM([Tbl test].[Number]))

Loading