To help, this blog is separated into two sections – the first section describes the three different methods and their pro’s and con’s, and the second section contains the details of each design pattern with step by step guidance on how to implement this.
Method 1 – Traditional Type 2 SCD
This method is the traditional Type 2 SCD approach that creates a new entry in the dimension table whenever a change has been identified. Valid from and to dates will be updated on the old and new record to reflect the record history.
Pro’s
- Most development effort required
This method will require the most amount of development effort and complexity as it is necessary to compare the source data with the existing data in the dimension table to identify the record status as ether new, existing or deleted. Once this has been determined, existing records must be evaluated to establish whether a change has occurred. The entire dataset must then be recreated with the appropriate valid to/from dates and flags.
- Most storage-efficient method
As this method only captures when there is a change, it requires the least of amount storage.
- Greater reporting flexibility
With the use of valid from/to dates and deleted dates this method provides you with the most reporting flexibility
Con’s
- 2 Dataflows are required -Time staggered
As historical entries require updating, and currently there is only the method to append\truncate when using a dataflow, the entire dataset will need to be recreated using truncate. Due to the table being truncated, it is necessary to use a temporary table to load the data into first, before truncating the main dimension table. This means that the use of 2 dataflows is required. Currently it is not possible to link Dataflow tasks to run in sequence so they will need to be scheduled with a sufficient time delay between each job.
Method 2 – Daily Snapshot (Append)
This method captures a snapshot of the dimension on a daily basis by adding a Snapshot Date to the Dimension. The data is appended to the Dimension table; so if the load was run multiple times on a day, only the first run would be captured.
Pro’s
- Minimal development effort required
This method only requires one extra column to be populated with the snapshot/load date.
- Only 1 Dataflow required
As only one table is required for this method there is only a need for one dataflow
Con’s
- Data will be duplicated / Dataset size needs to be considered
Creating Snapshots will mean duplication of data within the Dimension table so we also need to consider the size of the dataset as the table could become extremely large if being appended every day.
- Snapshot date usage
You need to consider how to use the Snapshot date within your data model to ensure you don’t get duplicated records when joining to the dimension.
Method 3 – Daily Snapshot (Truncate)
This method is the same as Solution 2 as it captures a daily snapshot of the dimension, the main difference being that the entire dimension is recreated resulting in the most recent values being retained.
Pro’s
- 2 Dataflows are required – Time Staggered
As the dataset is truncated but also used as a source, this method requires the use of two dataflows that will need to be scheduled with a time delay between them.
Conclusion
Hopefully these three methods will help you if you have a similar modelling requirement in Data Warehouse Cloud. Each has its own advantages and disadvantages so you will obviously need to decide which is most appropriate for you.
Method 1 (Traditional Type 2 SCD) was the chosen method to implement for the DWC project I was working on due to the large volume of data and low volume data changes.
There are a number of upcoming items in the Data Warehouse Cloud Roadmap which may well help achieve the same objective but it’s not yet clear when they will arrive but please monitor as they might be beneficial.
Expected – Q4 2021
Roadmap item – Dependent serial scheduling tasks
Anticipated Impact – Expected to provide the ability to chain task\dataflows together
URL – https://roadmaps.sap.com/board?PRODUCT=73555000100800002141&range=FIRST-LAST#;INNO=000D3A4855C81EDB99C49210E6AFB1BD
Expected – Q4 2021
Roadmap item – Time-dependency support added to dimensions
Anticipated Impact – Expected to provide the support for a Type 2 SCD with Valid From\To
URL – https://roadmaps.sap.com/board?PRODUCT=73555000100800002141&range=FIRST-LAST#;INNO=000D3ABE772D1EDBB6DEA7546CD6AFB3
If you got this far then thanks for reading my first blog and I hope you have found it useful, please post any questions, comments and feedback that you may have below and don’t forget to follow me!
If you are interested in implementing one of the methods then please take a look at the Solution Walkthroughs below for a more detailed step-by-step guide and I will try to answer any questions or comments.
Solution Walkthroughs
This section details the individual steps needed to implement each of the solutions
1. Solution 1 – Traditional Type 2 SCD
The first solution is a traditional Type 2 Slowly Changing Dimension where any change in a record will create a new entry and the valid from\to dates updated accordingly.
Below is a high-level overview of all the objects used in the solution with a short description of the object usage.
In the Data Builder there are five different object types available and currently no way to distinguish if a View is Graphical\SQL once created. Applying a naming convention such as the one below helps aid clarity.
1.1 Naming Convention
- Tables – Prefixed with “TB_”
- Views (Graphical) – Prefixed with “VG_”
- Views (SQL) – Prefixed with “VS_”
- Dataflow – “DF_”
- Entity Relationship Model – “ER_”
Object Name | Object Type | Description |
“TB_Source_CSV” | Table | Source File |
“TB_Dim_SCD_Temp” | Table | Table to temporarily hold Dimension Data |
“TB_Dim_SCD” | Table | Table to store Dimension Data |
“VG_Dim_SCD_1” | View | Compares source and Dimension data to Identify if the record is new, updated, deleted or no change |
“VG_Dim_SCD_2” | View | Recreate dimension with valid to\from dates and is current\deleted flag |
“DF_Dim_Temp_Load” | Dataflow |
Loads data from “VG_Dim_SCD_2” to “TB_Dim_SCD_Temp” |
DF_Dim_Load |
Dataflow | Loads data from “TB_Dim_SCD_Temp” to “TB_Dim_SCD” |
1.2 Step 1 – Create Source
Create a Source Table\View to use as the source for the Dimension (“TB_Source_CSV”).
1.3 Step 2 – Create Dimension tables
Create two tables (“TB_Dim_SCD_Temp” and “TB_Dim_SCD”). These tables we be identical and will match the source plus five additional fields:
- Valid From
- Valid To
- Is Current
- Is Deleted
- Deleted Date
1.4 Step 3 – Create VG_Dim_SCD_1 – Identify Change Type
Create a new Graphical View to identify data changes
Add the Source Table\View twice (“TB_Source_CSV”) and alias them as:
- Source Data
- Source Current
Now add the table “TB_Dim_SCD” three Times and alias them as:
- Dim Current
- Dim Not Current
- Dim Deleted
Add a Filter to Dim Current that checks:
- Is_Current = 1 and Is_Deleted =
Add a Filter to Dim Deleted that checks:
- Is_Current = 1 and Is_Deleted = 0
Add a Filter to Dim Not Current that checks:
- Is_Current = 0 or Is_Deleted = 1
Left Join Source Data and Dim Current ensuring the Source Data is the left\driving table and the tables are joined on the primary key.
A Projection should be automatically added after the join, if it is not, add a projection.
Enable all fields in the Projection.
Rename all duplicated fields from Dim_Current
Add a calculated column.
Add a new field called SCD (String (10)).
Create a Case Statement:
If key Dim field is null Then New
If any Non-key fields has changed value Then Updated
Else No Change
Go back to Dim Not Current.
Add a Calculated column after the filter.
Add a new field called SCD (String (10)).
Hardcode the value “No Change”.
Left Join Source Deleted and Dim Deleted ensuring the Dim Deleted is the left\driving table and the tables ae joined on the primary key.
A Projection should be automatically added after the join, if it is not, add a projection.
Disable all fields from Source Deleted except the Key field.
Rename the Key field from Source Deleted.
Add a filter that returns where the Source Deleted Key field is null.
Add a calculated column.
Add a new field called SCD (String (10)).
Hardcode the value No Change.
The three datasets are now ready to be combined with a Union operator.
Start with Current and Not Current then Add Deleted.
Ensure all fields have been correctly mapped in the union.
The first view is now completed and it can be identified if the Dimension data is new, changed, deleted or unchanged.
1.5 Step 4 – Create VG_Dim_SCD_2 – Recreate Dimension
Create a New Graphical View
Add “VG_Dim_SCD_1” 5 times to the view and alias as followed:
- New
- Updated Historic
- Updated New
- Deleted
- No Change
Add a filter to each flow using SCD based on the alias name:
- New = New
- Updated Historic = Updated
- Updated New = Updated
- Deleted = Deleted
- No Change = No Change
Add a Calculated Column transform to each flow and derive the values to be used for:
- Valid From
- Valid To
- Is Current
- Is Deleted
- Deleted Date
Use the values below as the derived values for each alias
New
- Valid From = ‘1900-01-01’ – As your history only starts from the point that the load starts we use 1900 instead of the run date to cater for historic reporting
- Valid To = ‘9999-12-31’
- Is Current = 1
- Is Deleted = 0
- Deleted Date = NULL
No Change
- Valid From = Dim Valid From
- Valid To = Dim Valid To
- Is Current = Dim Is Current
- Is Deleted = Dim Is Deleted
- Deleted Date = Dim Deleted Date
Deleted
- Valid From = Dim Valid From
- Valid To = Dim Valid To
- Is Current = Dim Is Current
- Is Deleted = 1
- Deleted Date = Current\Load Date
Updated New
- Valid From = Current\Load Date
- Valid To = ‘9999-12-31’
- Is Current = 1
- Is Deleted = 0
- Deleted Date = NULL
Updated Historic
- Valid From = Dim Valid From
- Valid To = Current\Load Date
- Is Current = 0
- Is Deleted = 0
- Deleted Date = NULL
Add a Projection to each flow.
At this point it is important to ensure the correct fields are being used for all attributes.
Exclude all columns.
Enable Valid From, Valid To, Is Current, Is Deleted and Deleted Date
New – Enable all Fields from Source Data
No Change – Enable all Fields from Source Data\Dimension Data
Deleted – Enable all Fields from Dimension Data
Updated New – Enable all Fields from Source Data
Updated Historic – Enable all Fields from Dimension Data
Union all datasets
1.6 Step 5 – Create DF_Dim_Temp_Load – Load into Temp Table
The reason for using Dataflows\Tables is because the target table is also used as the source and the target is truncated when loaded.
When a dataflow is executed, the first step is to truncate the target table, this results in all records falling into the SCD category new and all history being lost.
To avoid this issue, first load the data into a temporary table.
Use “VG_Dim_SCD_2” as a Source and “TB_Dim_SCD_Temp” as the target set the table-loading mode to TRUNCATE
1.7 Step 6 – Create DF_Dim_Load – Load from Temp Table to Dimension
Use ”TB_Dim_SCD_Temp” as a Source and “TB_Dim_SCD” as the target set the table-loading mode to TRUNCATE
1.8 Step 7 – Schedule Dataflows
The Data Integration Monitor doesn’t currently have the ability to link/chain Dataflow executions, you only have the ability to schedule based on time\frequency.
Ensure that you give ””DF_Dim_Temp_Load” enough time to complete the execution prior to the “DF_Dim_Load” schedule starting.
2 Solution 2 – Daily Snapshot (Append)
Instead of comparing the data, this solution takes a daily snapshot of the entire dataset and appends this to the Dimension table with a Snapshot Date.
2.1 Naming Convention
- Tables – Prefixed with “TB_”
- Views (Graphical) – Prefixed with “VG_”
- Views (SQL) – Prefixed with “VS_”
- Dataflow – “DF_”
- Entity Relationship Model – “ER_”
Object Name | Object Type | Description |
“TB_Source_CSV” | Table | Source File |
TB_Dim_SCD | Table | Table to store Dimension Data |
“VG_Dim_SCD_1” | View | Compares current Snapshot Date with Max Dim Snapshot Date and filter data |
DF_Dim_Load |
Dataflow | Append data from “VG_Dim_SCD_1” to “TB_Dim_SCD” |
2.2 Step 1 – Create Source
Create a Source Table\View that you would like to use as your source for the Dimension “TB_Source_CSV”
2.3 Step 2 – Create Dimension table
Create the table “TB_Dim_SCD” this will be identical to the source plus one additional field
- Snapshot Date
2.4 Step 3 – Create VG_Dim_SCD_1 – Check Dimension Max Snapshot Date
Create a new Graphical View
Add “TB_Source_CSV” to the design pane add alias as Source
Add “TB_Dim_SCD” to the design pane add alias as Dim
The aggregation function only aggregates on Integer data types, so we will create an integer representation of the date
Add a calculated column transform to both flows and add the following fields
- Source
- Snapshot Date – Integer – CURRENT_DATE())
- Snapshot Date INT – Integer – DAYS_BETWEEN(‘1900-01-01’,CURRENT_DATE())
- Snapshot Date – Integer – CURRENT_DATE())
- Dim
- Snapshot Date INT – Integer – DAYS_BETWEEN(‘1900-01-01’, “Snapshot Date”)
Add a Projection to the Dim flow
Disable all fields excluding Snapshot Date INT
Add an aggregation transform
Set Snapshot Date INT to MAX aggregation type
Join the two flows together
Set the Join Type to Cross Join and remove any mappings
In the projection rename Snapshot Date INT (Dim) to Max Snapshot Date and ensure all fields are enabled
Add a calculated column transform
Update the formula in Max Snapshot date to replace null with 0 to cater for loading into an empty Dimension table
Add a filter where Snapshot Date INT > Max Snapshot Date INT
2.5 Step 4 – Create\Schedule Dataflo
Create a dataflow to load the data from your view to the Dimension, ensure the table load type is set to APPEND and schedule the Dataflow.
3 Solution 3 – Daily Snapshot (Truncate)
This solution is a replica of Solution 2, the difference here is that rather than checking the Max Snapshot Date in the Dimension, all the data is extracted from the Dimension except where the Snapshot Date equals the Load Date.
This allows the load to run multiple times in a period and the most recent run in the period will be the version held in the Dimension.
As the target table is truncated, two Dataflows and a temporary table will be required in this version of the solution.
3.1 Naming Convention
Object Name | Object Type | Description |
“TB_Source_CSV” | Table | Source File |
“TB_Dim_SCD_Temp” | Table | Table to Temporarily store Dimension Data |
“TB_Dim_SCD” | Table | Table to store Dimension Data |
“VG_Dim_SCD_1” |
View | Retrieves all records from “TB_Dim_SCD” where Snapshot Date < Run Date and combines with “TB_Source_CSV” |
“DF_Dim_Temp_Load” | Dataflow | Truncate and load data from “VG_Dim_SCD_1” into “TB_Dim_SCD_Temp” |
DF_Dim_Load |
Dataflow |
Truncate and load data from “TB_Dim_SCD_Temp” to “TB_Dim_SCD” |
3.2 Step 1 – Create Source
Create a Source Table\View that you would like to use as your source for the Dimension “TB_Source_CSV”
3.3 Step 2 – Create Dimension tables
Create the tables “TB_Dim_SCD” and “TB_Dim_SCD_Temp”, they will be identical to the source plus one additional field
- Snapshot Date
3.4 Step 3 – Create VG_Dim_SCD_1 – Combine Historic and Current Dimension
Create a new Graphical View
Add “TB_Source_CSV” to the design pane add alias as Source
Add “TB_Dim_SCD” to the design pane add alias as Dim
Add a calculated column transform to the source flow and add the following fields
- Source
- Snapshot Date – Integer – CURRENT_DATE())
Add a Filter to the Dim flow where Snapshot Date < CURRENT_DATE())
UNION the two flows together
Set the Join Type to Cross Join and remove any mappings
3.5 Step 4 – Create DF_Dim_Temp_Load
Create a dataflow to load the data from your view to the temporary table, ensure the table load type is set to TRUNCATE.
3.6 Step 5 – Create DF_Dim_Load
Create a dataflow to load the data from your temporary table to the Dimension table, ensure the table load type is set to TRUNCATE.
3.7 Step 6 – Schedule Dataflows
Ensure that you give ”DF_Dim_Temp_Load” enough time to complete the execution prior to the “DF_Dim_Load” schedule starting.