SAP Data Warehouse Cloud

SAP Data Warehouse Cloud – How to Create a Slowly Changing Dimension

To help, this blog is separated into two sections – the first section describes the three different methods and their pro’s and con’s, and the second section contains the details of each design pattern with step by step guidance on how to implement this.

Method 1 – Traditional Type 2 SCD

This method is the traditional Type 2 SCD approach that creates a new entry in the dimension table whenever a change has been identified. Valid from and to dates will be updated on the old and new record to reflect the record history.

Pro’s

  • Most development effort required

This method will require the most amount of development effort and complexity as it is necessary to compare the source data with the existing data in the dimension table to identify the record status as ether new, existing or deleted. Once this has been determined, existing records must be evaluated to establish whether a change has occurred. The entire dataset must then be recreated with the appropriate valid to/from dates and flags.

  • Most storage-efficient method

As this method only captures when there is a change, it requires the least of amount storage.

  • Greater reporting flexibility

With the use of valid from/to dates and deleted dates this method provides you with the most reporting flexibility

Con’s

  • 2 Dataflows are required -Time staggered

As historical entries require updating, and currently there is only the method to append\truncate when using a dataflow, the entire dataset will need to be recreated using truncate. Due to the table being truncated, it is necessary to use a temporary table to load the data into first, before truncating the main dimension table. This means that the use of 2 dataflows is required. Currently it is not possible to link Dataflow tasks to run in sequence so they will need to be scheduled with a sufficient time delay between each job.

Method 2 – Daily Snapshot (Append)

This method captures a snapshot of the dimension on a daily basis by adding a Snapshot Date to the Dimension. The data is appended to the Dimension table; so if the load was run multiple times on a day, only the first run would be captured.

Pro’s

  • Minimal development effort required

This method only requires one extra column to be populated with the snapshot/load date.

  • Only 1 Dataflow required

As only one table is required for this method there is only a need for one dataflow

Con’s

  • Data will be duplicated / Dataset size needs to be considered

Creating Snapshots will mean duplication of data within the Dimension table so we also need to consider the size of the dataset as the table could become extremely large if being appended every day.

  • Snapshot date usage

You need to consider how to use the Snapshot date within your data model to ensure you don’t get duplicated records when joining to the dimension.

Method 3 – Daily Snapshot (Truncate)

This method is the same as Solution 2 as it captures a daily snapshot of the dimension, the main difference being that the entire dimension is recreated resulting in the most recent values being retained.

Pro’s

  • 2 Dataflows are required – Time Staggered

As the dataset is truncated but also used as a source, this method requires the use of two dataflows that will need to be scheduled with a time delay between them.

Conclusion

Hopefully these three methods will help you if you have a similar modelling requirement in Data Warehouse Cloud. Each has its own advantages and disadvantages so you will obviously need to decide which is most appropriate for you.

Method 1 (Traditional Type 2 SCD) was the chosen method to implement for the DWC project I was working on due to the large volume of data and low volume data changes.

There are a number of upcoming items in the Data Warehouse Cloud Roadmap which may well help achieve the same objective but it’s not yet clear when they will arrive but please monitor as they might be beneficial.

Expected – Q4 2021

Roadmap item – Dependent serial scheduling tasks

Anticipated Impact – Expected to provide the ability to chain task\dataflows together

URL – https://roadmaps.sap.com/board?PRODUCT=73555000100800002141&range=FIRST-LAST#;INNO=000D3A4855C81EDB99C49210E6AFB1BD

Expected – Q4 2021

Roadmap item – Time-dependency support added to dimensions

Anticipated Impact – Expected to provide the support for a Type 2 SCD with Valid From\To

URL – https://roadmaps.sap.com/board?PRODUCT=73555000100800002141&range=FIRST-LAST#;INNO=000D3ABE772D1EDBB6DEA7546CD6AFB3

If you got this far then thanks for reading my first blog and I hope you have found it useful, please post any questions, comments and feedback that you may have below and don’t forget to follow me!

If you are interested in implementing one of the methods then please take a look at the Solution Walkthroughs below for a more detailed step-by-step guide and I will try to answer any questions or comments.

Solution Walkthroughs

This section details the individual steps needed to implement each of the solutions

1. Solution 1 – Traditional Type 2 SCD

The first solution is a traditional Type 2 Slowly Changing Dimension where any change in a record will create a new entry and the valid from\to dates updated accordingly.

Below is a high-level overview of all the objects used in the solution with a short description of the object usage.

In the Data Builder there are five different object types available and currently no way to distinguish if a View is Graphical\SQL once created. Applying a naming convention such as the one below helps aid clarity.

1.1 Naming Convention

  • Tables – Prefixed with “TB_”
  • Views (Graphical) – Prefixed with “VG_”
  • Views (SQL) – Prefixed with “VS_”
  • Dataflow – “DF_”
  • Entity Relationship Model – “ER_”
Object Name Object Type Description
“TB_Source_CSV”  Table Source File
“TB_Dim_SCD_Temp”  Table  Table to temporarily hold Dimension Data 
“TB_Dim_SCD”  Table Table to store Dimension Data 
“VG_Dim_SCD_1”  View  Compares source and Dimension data to Identify if the record is new, updated, deleted or no change 
“VG_Dim_SCD_2”  View Recreate dimension with valid to\from dates and is current\deleted flag 
“DF_Dim_Temp_Load”   Dataflow 

Loads data from

“VG_Dim_SCD_2” to “TB_Dim_SCD_Temp”

DF_Dim_Load

Dataflow Loads data from “TB_Dim_SCD_Temp” to “TB_Dim_SCD”

1.2 Step 1 – Create Source

Create a Source Table\View to use as the source for the Dimension (“TB_Source_CSV”).

1.3 Step 2 – Create Dimension tables

Create two tables (“TB_Dim_SCD_Temp” and “TB_Dim_SCD”). These tables we be identical and will match the source plus five additional fields:

  • Valid From
  • Valid To
  • Is Current
  • Is Deleted
  • Deleted Date

1.4 Step 3 – Create VG_Dim_SCD_1 – Identify Change Type

Create a new Graphical View to identify data changes

Add the Source Table\View twice (“TB_Source_CSV”) and alias them as:

  • Source Data
  • Source Current

Now add the table “TB_Dim_SCD” three Times and alias them as:

  • Dim Current
  • Dim Not Current
  • Dim Deleted

Add a Filter to Dim Current that checks:

  • Is_Current = 1 and Is_Deleted =

Add a Filter to Dim Deleted that checks:

  • Is_Current = 1 and Is_Deleted = 0

Add a Filter to Dim Not Current that checks:

  • Is_Current = 0 or Is_Deleted = 1

Left Join Source Data and Dim Current ensuring the Source Data is the left\driving table and the tables are joined on the primary key.

A Projection should be automatically added after the join, if it is not, add a projection.

Enable all fields in the Projection.

Rename all duplicated fields from Dim_Current

Add a calculated column.

Add a new field called SCD (String (10)).

Create a Case Statement:

If key Dim field is null Then New

If any Non-key fields has changed value Then Updated

Else No Change

Go back to Dim Not Current.

Add a Calculated column after the filter.

Add a new field called SCD (String (10)).

Hardcode the value “No Change”.

Left Join Source Deleted and Dim Deleted ensuring the Dim Deleted is the left\driving table and the tables ae joined on the primary key.

A Projection should be automatically added after the join, if it is not, add a projection.

Disable all fields from Source Deleted except the Key field.

Rename the Key field from Source Deleted.

Add a filter that returns where the Source Deleted Key field is null.

Add a calculated column.

Add a new field called SCD (String (10)).

Hardcode the value No Change.

The three datasets are now ready to be combined with a Union operator.

Start with Current and Not Current then Add Deleted.

Ensure all fields have been correctly mapped in the union.

The first view is now completed and it can be identified if the Dimension data is new, changed, deleted or unchanged.

1.5 Step 4 – Create VG_Dim_SCD_2 – Recreate Dimension

Create a New Graphical View

Add “VG_Dim_SCD_1” 5 times to the view and alias as followed:

  • New
  • Updated Historic
  • Updated New
  • Deleted
  • No Change

Add a filter to each flow using SCD based on the alias name:

  • New = New
  • Updated Historic = Updated
  • Updated New = Updated
  • Deleted = Deleted
  • No Change = No Change

Add a Calculated Column transform to each flow and derive the values to be used for:

  • Valid From
  • Valid To
  • Is Current
  • Is Deleted
  • Deleted Date

Use the values below as the derived values for each alias

New

  • Valid From = ‘1900-01-01’ – As your history only starts from the point that the load starts we use 1900 instead of the run date to cater for historic reporting
  • Valid To = ‘9999-12-31’
  • Is Current = 1
  • Is Deleted = 0
  • Deleted Date = NULL

No Change

  • Valid From = Dim Valid From
  • Valid To = Dim Valid To
  • Is Current = Dim Is Current
  • Is Deleted = Dim Is Deleted
  • Deleted Date = Dim Deleted Date

Deleted

  • Valid From = Dim Valid From
  • Valid To = Dim Valid To
  • Is Current = Dim Is Current
  • Is Deleted = 1
  • Deleted Date = Current\Load Date

Updated New

  • Valid From = Current\Load Date
  • Valid To = ‘9999-12-31’
  • Is Current = 1
  • Is Deleted = 0
  • Deleted Date = NULL

Updated Historic

  • Valid From = Dim Valid From
  • Valid To = Current\Load Date
  • Is Current = 0
  • Is Deleted = 0
  • Deleted Date = NULL

Add a Projection to each flow.

At this point it is important to ensure the correct fields are being used for all attributes.

Exclude all columns.

Enable Valid From, Valid To, Is Current, Is Deleted and Deleted Date

New – Enable all Fields from Source Data

No Change – Enable all Fields from Source Data\Dimension Data

Deleted – Enable all Fields from Dimension Data

Updated New – Enable all Fields from Source Data

Updated Historic – Enable all Fields from Dimension Data

Union all datasets

1.6 Step 5 – Create DF_Dim_Temp_Load – Load into Temp Table

The reason for using Dataflows\Tables is because the target table is also used as the source and the target is truncated when loaded.

When a dataflow is executed, the first step is to truncate the target table, this results in all records falling into the SCD category new and all history being lost.

To avoid this issue, first load the data into a temporary table.

Use “VG_Dim_SCD_2” as a Source and “TB_Dim_SCD_Temp” as the target set the table-loading mode to TRUNCATE

1.7 Step 6 – Create DF_Dim_Load – Load from Temp Table to Dimension

Use ”TB_Dim_SCD_Temp” as a Source and “TB_Dim_SCD” as the target set the table-loading mode to TRUNCATE

1.8 Step 7 – Schedule Dataflows

The Data Integration Monitor doesn’t currently have the ability to link/chain Dataflow executions, you only have the ability to schedule based on time\frequency.

Ensure that you give ””DF_Dim_Temp_Load” enough time to complete the execution prior to the “DF_Dim_Load” schedule starting.

2 Solution 2 – Daily Snapshot (Append)

Instead of comparing the data, this solution takes a daily snapshot of the entire dataset and appends this to the Dimension table with a Snapshot Date.

2.1 Naming Convention

  • Tables – Prefixed with “TB_”
  • Views (Graphical) – Prefixed with “VG_”
  • Views (SQL) – Prefixed with “VS_”
  • Dataflow – “DF_”
  • Entity Relationship Model – “ER_”
Object Name Object Type   Description
“TB_Source_CSV” Table Source File
TB_Dim_SCD Table  Table to store Dimension Data
“VG_Dim_SCD_1” View Compares current Snapshot Date with Max Dim Snapshot Date and filter data

DF_Dim_Load

Dataflow Append data from “VG_Dim_SCD_1” to “TB_Dim_SCD”

2.2 Step 1 – Create Source

Create a Source Table\View that you would like to use as your source for the Dimension “TB_Source_CSV”

2.3 Step 2 – Create Dimension table

Create the table “TB_Dim_SCD” this will be identical to the source plus one additional field

  • Snapshot Date

2.4 Step 3 – Create VG_Dim_SCD_1 – Check Dimension Max Snapshot Date

Create a new Graphical View

Add “TB_Source_CSV” to the design pane add alias as Source

Add “TB_Dim_SCD” to the design pane add alias as Dim

The aggregation function only aggregates on Integer data types, so we will create an integer representation of the date

Add a calculated column transform to both flows and add the following fields

  • Source
    • Snapshot Date – Integer – CURRENT_DATE())
    • Snapshot Date INT – Integer – DAYS_BETWEEN(‘1900-01-01’,CURRENT_DATE())
  • Dim
    • Snapshot Date INT – Integer – DAYS_BETWEEN(‘1900-01-01’, “Snapshot Date”)

Add a Projection to the Dim flow

Disable all fields excluding Snapshot Date INT

Add an aggregation transform

Set Snapshot Date INT to MAX aggregation type

Join the two flows together

Set the Join Type to Cross Join and remove any mappings

In the projection rename Snapshot Date INT (Dim) to Max Snapshot Date and ensure all fields are enabled

Add a calculated column transform

Update the formula in Max Snapshot date to replace null with 0 to cater for loading into an empty Dimension table

Add a filter where Snapshot Date INT > Max Snapshot Date INT

2.5 Step 4 – Create\Schedule Dataflo

Create a dataflow to load the data from your view to the Dimension, ensure the table load type is set to APPEND and schedule the Dataflow.

3 Solution 3 – Daily Snapshot (Truncate)

This solution is a replica of Solution 2, the difference here is that rather than checking the Max Snapshot Date in the Dimension, all the data is extracted from the Dimension except where the Snapshot Date equals the Load Date.

This allows the load to run multiple times in a period and the most recent run in the period will be the version held in the Dimension.

As the target table is truncated, two Dataflows and a temporary table will be required in this version of the solution.

3.1 Naming Convention

Object Name Object Type   Description
“TB_Source_CSV” Table Source File
“TB_Dim_SCD_Temp” Table  Table to Temporarily store Dimension Data
“TB_Dim_SCD” Table Table to store Dimension Data

“VG_Dim_SCD_1”

View Retrieves all records from “TB_Dim_SCD” where Snapshot Date < Run Date and combines with “TB_Source_CSV”
“DF_Dim_Temp_Load” Dataflow Truncate and load data from “VG_Dim_SCD_1” into “TB_Dim_SCD_Temp”

DF_Dim_Load

Dataflow

Truncate and load data from “TB_Dim_SCD_Temp” to “TB_Dim_SCD”

3.2 Step 1 – Create Source

Create a Source Table\View that you would like to use as your source for the Dimension “TB_Source_CSV”

3.3 Step 2 – Create Dimension tables

Create the tables “TB_Dim_SCD” and “TB_Dim_SCD_Temp”, they will be identical to the source plus one additional field

  • Snapshot Date

3.4 Step 3 – Create VG_Dim_SCD_1 – Combine Historic and Current Dimension

Create a new Graphical View

Add “TB_Source_CSV” to the design pane add alias as Source

Add “TB_Dim_SCD” to the design pane add alias as Dim

Add a calculated column transform to the source flow and add the following fields

  • Source
    • Snapshot Date – Integer – CURRENT_DATE())

Add a Filter to the Dim flow where Snapshot Date < CURRENT_DATE())

UNION the two flows together

Set the Join Type to Cross Join and remove any mappings

3.5 Step 4 – Create DF_Dim_Temp_Load

Create a dataflow to load the data from your view to the temporary table, ensure the table load type is set to TRUNCATE.

3.6 Step 5 – Create DF_Dim_Load

Create a dataflow to load the data from your temporary table to the Dimension table, ensure the table load type is set to TRUNCATE.

3.7 Step 6 – Schedule Dataflows

Ensure that you give ”DF_Dim_Temp_Load” enough time to complete the execution prior to the “DF_Dim_Load” schedule starting.

Rating: 5 / 5 (1 votes)