A system needed to be transferred in a hurry. The only way to extract data from that system was via csv files. The data was dirty meaning fields were not consistent across different export files (extra space and unprintable characters in id fields, additional information in text fields, date incongruities, etc.) .
The software used here is:
- Excel
- Visual Studio 2017 SSDT
The first step was to save the csv files as excel macro enabled worksheets so I could do some basic formatting within the excel files. Then I enabled the developer tab. I used the default macro to write some code to do the basic formatting. I chose to do this formatting within the excel file rather than in SSIS because it just seemed simpler. I explored ways to do this via SSIS but the benefit cost analysis deemed just getting it done more viable. As any programmer knows the smallest tasks can turn into some deep rabbit holes.
Here is an example of one macro for one excel file.
- It sets the source range so it doesn’t search thousands of empty rows/columns.
- Deletes empty rows and header rows
- it merges rows with duplicate ids (there might be three rows for one id with separate text in say the C cell. This would combine them into the first row separated by a pipe character and delete the rows with duplicate ids).
Example Excel Macro
Sub testing() Application.EnableEvents = False On Error Resume Next 'Activate PowerPoint Library (version 2.9) ThisWorkbook.VBProject.References.AddFromGuid _ GUID:="{3F4DACA7-160D-11D2-A8E9-00104B365C9F}", Major:=2, Minor:=9 'Reset Error Handler On Error GoTo 0 Application.Calculation = xlCalculationManual Application.ScreenUpdating = False Application.EnableEvents = False Dim SourceRange As Range Dim EntireRow As Range Dim iCntr As Long Dim previousrow As Long Dim finalstring As String For Each Sh In ThisWorkbook.Worksheets Set SourceRange = Sh.Range("A1:M" + CStr(Sh.UsedRange.Rows.Count)) For iCntr = SourceRange.Rows.Count To 1 Step -1 Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow If WorksheetFunction.CountA(EntireRow) = 0 Then EntireRow.Delete If iCntr Mod 5000 = 0 Then Debug.Print Sh.Name + " | " + CStr(iCntr) End If Next For iCntr = SourceRange.Rows.Count To 1 Step -1 If Mid(SourceRange.Cells(iCntr, 1).Value, 1, 9) = "something:" Then Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow EntireRow.Delete End If If Mid(SourceRange.Cells(iCntr, 1).Value, 1, 8) = "somethingelse" Then Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow EntireRow.Delete End If Next For iCntr = SourceRange.Rows.Count To 1 Step -1 previousrow = iCntr - 1 If iCntr > 1 Then If SourceRange.Cells(iCntr, 1).Value = "name1" Or SourceRange.Cells(iCntr, 1).Value = "cnelson" Or SourceRange.Cells(iCntr, 1).Value = "name2" Or SourceRange.Cells(iCntr, 1).Value = "name3" Or SourceRange.Cells(iCntr, 1).Value = "name4" Or SourceRange.Cells(iCntr, 1).Value = "name5" Or SourceRange.Cells(iCntr, 1).Value = "name6" Or SourceRange.Cells(iCntr, 1).Value = "name7" Or SourceRange.Cells(iCntr, 1).Value = "name8" Then If IsEmpty(SourceRange.Range("B" + CStr(iCntr)).Value) Then finalstring = "" Else finalstring = CStr(SourceRange.Range("B" + CStr(iCntr)).Value) End If If IsEmpty(SourceRange.Range("Q" + CStr(iCntr)).Value) Then previousstring = "" Else previousstring = CStr(SourceRange.Range("Q" + CStr(iCntr)).Value) End If SourceRange.Range("Q" + CStr(previousrow)).Value = previousstring + " | " + finalstring finalstring = "" previousstring = "" If IsEmpty(SourceRange.Range("C" + CStr(iCntr)).Value) Then finalstring = "" Else finalstring = CStr(SourceRange.Range("C" + CStr(iCntr)).Value) End If If IsEmpty(SourceRange.Range("R" + CStr(iCntr)).Value) Then previousstring = "" Else previousstring = CStr(SourceRange.Range("R" + CStr(iCntr)).Value) End If SourceRange.Range("R" + CStr(previousrow)).Value = previousstring + " | " + finalstring finalstring = "" previousstring = "" Set EntireRow = SourceRange.Cells(iCntr, 1).EntireRow EntireRow.Delete iCntr = iCntr - 1 End If End If Next Next Sh Application.Calculation = xlCalculationAutomatic Application.ScreenUpdating = True Application.EnableEvents = True End Sub
SSIS Control Flow
The control flow dictates eye from the sky steps in the migration process. Since I have several different excel workbooks importing into several different tables, you can see there are lots of sql tasks to truncate those tables before new insertions. Also there are separate tasks for a bit more housecleaning of the data and a few steps that can be run stand alone.
I included all the excel files and the error files in the miscellaneous directory to centralize operations. You can see under the connection managers the connections for each file.
SSIS Data Flow
This is the busiest data flow as it is the all in one import. Each task is made up of
- a source call with an accompanying error file
- a data conversion section since excel data types do not translate directly to sql data types
- a database destination with accompanying error file
There are other data flows some of which are a single transaction so I can just do a single import and some of which are more cleaning routines.
TSQL Final Cleanup and Import
After all the data has been moved into the separate tables, I created a final tsql project to do some final formats to the data, standardize it, and put it into a temporary table that mimics the production table.
This import was a bit of a unique scenario as I was forced to move the data into a legacy application that used an access database for updates. I am in the process of moving that application to the web but have only gotten the front-end up for search, not the back-end side for updates. But that can be in a separate post!
So here is a sample of the final SQL. This could have been put into the SSIS package as well but since it was a run once and done type scenario I just ran it from SQL Server Management Studio.
-------------------------------------------------------- ;WITH CTE7 (ALTID,LOCNUMBER,LOCSTREET) AS ( SELECT ALTID, CASE WHEN ISNUMERIC(SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],'')))) = 1 THEN SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],''))) ELSE '' END, CASE WHEN ISNUMERIC(SUBSTRING(ISNULL([Address],''),0,CHARINDEX(' ',ISNULL([Address],'')))) = 1 THEN SUBSTRING(ISNULL([Address],''),CHARINDEX(' ',ISNULL([Address],''))+1,LEN(ISNULL([Address],''))) ELSE ISNULL([Address],'') END FROM [dbo].[VNewThings] ) UPDATE [Tbl test] SET [Location Number] = LOCNUMBER, [Location Street] = LOCSTREET FROM CTE7 WHERE CTE7.ALTID = [Tbl test].[Thing Number] -------------------------------------------------------- --GET ONLY DISTINCT THINGS PER RECORD -------------------------------------------------------- INSERT INTO DBO.TEMPTHINGS SELECT ISNULL(ALTID,''),THINGS,'','','','THING2',CHARINDEX('|',SUBSTRING(ISNULL(THINGS,''),3,LEN(THINGS)-3)),0,0 FROM [dbo].[VNewCitations] VNC2 WHERE SUBSTRING(LTRIM(RTRIM(REPLACE(ISNULL(THINGS,''),'|',''))),3,LEN(THINGS)-3) <> '' AND ISNULL(ALTID,'') <> '' UPDATE DBO.TEMPTHINGS SET SECONDSEPARATOR = CHARINDEX('|',SUBSTRING(ISNULL(DBO.TEMPTHINGS.originalThing,''),DBO.TEMPTHINGS.FIRSTSEPARATOR+3,LEN(DBO.TEMPTHINGS.originalThing))) FROM DBO.TEMPTHINGS TV2 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID UPDATE DBO.TEMPTHINGS SET Thing1 = LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,3,COALESCE(NULLIF(DBO.TEMPTHINGS.FIRSTSEPARATOR,0),LEN(DBO.TEMPTHINGS.ORIGINALThing))),'|',''))) FROM DBO.TEMPTHINGS TV2 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID UPDATE DBO.TEMPTHINGS SET Thing2 = CASE WHEN DBO.TEMPTHINGS.FIRSTSEPARATOR <> 0 AND DBO.TEMPTHINGS.SECONDSEPARATOR <> 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR+1,DBO.TEMPTHINGS.SECONDSEPARATOR),'|',''))) WHEN DBO.TEMPTHINGS.FIRSTSEPARATOR <> 0 AND DBO.TEMPTHINGS.SECONDSEPARATOR = 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR+1,LEN(DBO.TEMPTHINGS.originalThing)),'|',''))) END FROM DBO.TEMPTHINGS TV2 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID UPDATE DBO.TEMPTHINGS SET Thing3 = CASE WHEN DBO.TEMPTHINGS.SECONDSEPARATOR <> 0 THEN LTRIM(RTRIM(REPLACE(SUBSTRING(DBO.TEMPTHINGS.ORIGINALThing,DBO.TEMPTHINGS.FIRSTSEPARATOR + DBO.TEMPTHINGS.SECONDSEPARATOR,LEN(DBO.TEMPTHINGS.originalThing)),'|',''))) END FROM DBO.TEMPTHINGS TV2 INNER JOIN DBO.TEMPTHINGS ON TV2.ID = DBO.TEMPTHINGS.ID UPDATE [Tbl test] SET [O1] = Thing1 FROM TEMPTHINGS WHERE TEMPTHINGS.ALTID = [Tbl test].[Thing Number] UPDATE [Tbl test] SET [O2] = Thing2 FROM TEMPTHINGS WHERE TEMPTHINGS.ALTID = [Tbl test].[Thing Number] UPDATE [Tbl test] SET [O3] = Thing3 FROM TEMPTHINGS WHERE TEMPTHINGS.ALTID = [Tbl test].[Thing Number] ---------------------------------------------------------- UPDATE [Tbl test] SET [void]= 'void' FROM [dbo].[VNewVoids] WHERE LTRIM(RTRIM([VNewVS].ALTID)) = LTRIM(RTRIM([Tbl test].[Number])) ---------------------------------------------------------- UPDATE [Tbl test] SET [Warning]= 'warning' FROM [dbo].[VNewWarnings] WHERE LTRIM(RTRIM([VNewWs].ALTID)) = LTRIM(RTRIM([Tbl test].[Number])) ---------------------------------------------------------- UPDATE [Tbl test] SET [Write Off] = 'Write Off' FROM [dbo].[VNewWriteOffs] WHERE LTRIM(RTRIM([VNewWFs].ALTID)) = LTRIM(RTRIM([Tbl test].[Number]))
Leave a Reply