Translate topic in your language

Showing posts with label SCD. Show all posts
Showing posts with label SCD. Show all posts

Monday, 18 March 2019

What are the Slowly Changing Dimensions and how to implement it in ODI 12c ?


What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimension's attribute value for given date. Example of such dimensions could be: customer, geography, employee.


There are many approaches how to deal with SCD. The most popular are:


· Type 0 - The passive method

· Type 1 - Overwriting the old value

· Type 2 - Creating a new additional record

· Type 3 - Adding a new column

· Type 4 - Using historical table

· Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)



Type 0 - The passive method. In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.



Type 1 - Overwriting the old value. In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).


Before the change:



















Customer_ID


Customer_Name


Customer_Type


1


Cust_1


Corporate









After the change:





















Customer_ID


Customer_Name


Customer_Type


1


Cust_1


Retail





Advantages:

- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:

- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.

Usage:

About 50% of the time.

When to use Type 1:

Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.



Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Before the change:

























Customer_ID


Customer_Name


Customer_Type


Start_Date


End_Date


Current_Flag


1


Cust_1


Corporate


22-07-2010


31-12-9999


Y








After the change:





























Customer_ID


Customer_Name


Customer_Type


Start_Date


End_Date


Current_Flag


1


Cust_1


Corporate


22-07-2010


17-05-2012


N


2


Cust_1


Retail


18-05-2012


31-12-9999


Y






Advantages:

- This allows us to accurately keep all historical information.

Disadvantages:

- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

- This necessarily complicates the ETL process.

Usage:

About 50% of the time.

When to use Type 2:

Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed technique.



Before the change:

















Customer_ID


Customer_Name


Current_Type


Previous_Type


1


Cust_1


Corporate


Corporate



After the change:

















Customer_ID


Customer_Name


Current_Type


Previous_Type


1


Cust_1


Retail


Corporate



Advantages:

- This does not increase the size of the table, since new information is updated.

- This allows us to keep some part of history.

Disadvantages:

- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.

Usage:
Type 3 is rarely used in actual practice.

When to use Type 3:

Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.


Type 4 - Using historical table. In this method a separate historical table is used to track all dimension's attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer_history tables.

Current table:















Customer_ID


Customer_Name


Customer_Type


1


Cust_1


Corporate




Historical table: 

































Customer_ID


Customer_Name


Customer_Type


Start_Date


End_Date


1


Cust_1


Retail


01-01-2010


21-07-2010


1


Cust_1


Oher


22-07-2010


17-05-2012


1


Cust_1


Corporate


18-05-2012


31-12-9999




Type 6 -
Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:



· current_type -
for keeping current value of the attribute. All history records for given item of attribute have the same current value.

· historical_type -
for keeping historical value of the attribute. All history records for given item of attribute could have different values.

· start_date - for keeping start date of 'effective date' of attribute's history.

· end_date - for keeping end date of 'effective date' of attribute's history.

· current_flag - for keeping information about the most recent record. ·

In this method to capture attribute change we add a new record as in type 2. The current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3. 

















































Customer_ID


Customer_Name


Current_Type


Historical_Type


Start_Date


End_Date


Current_Flag


1


Cust_1


Corporate


Retail


01-01-2010


21-07-2010


N


2


Cust_1


Corporate


Other


22-07-2010


17-05-2012


N


3


Cust_1


Corporate


Corporate


18-05-2012


31-12-9999


Y





ODI 12c SCD Type 2 Step by Step Implementation 







ODI 12c SCD Type 2 is very easy compare to ODI 11G.



Please find the below steps for SCD Type 2 implementation.



I have created new target table to support SCD behaviour.



CREATE TABLE DEV.SCD
(
EMPLOYEE_ID NUMBER(6,0),
FIRST_NAME VARCHAR2(20 BYTE),
LAST_NAME VARCHAR2(25 BYTE) NOT NULL ENABLE,
EMAIL VARCHAR2(25 BYTE) NOT NULL ENABLE,
PHONE_NUMBER VARCHAR2(20 BYTE),
HIRE_DATE DATE NOT NULL ENABLE,
JOB_ID VARCHAR2(10 BYTE) NOT NULL ENABLE,
SALARY NUMBER(8,2),
COMMISSION_PCT NUMBER(2,2),
MANAGER_ID NUMBER(6,0),
DEPARTMENT_ID NUMBER(4,0),
STATUS_FLAG VARCHAR2(1 BYTE),
STARTING_DATE DATE,
ENDING_DATE DATE
) TABLESPACE SYSTEM ;




Step1:

------

Import IKM Oracle Slowly Changing Dimension Knowledge module.















































Step2:
--------
Open Target SCD table and Change the SCD Behavior.






I have select SCD Behaviour like below options.




Step3:
-------
Creating Mapping for Loading the data from Source table (hr.employees) table to target table (dev.scd).






















These three columns we are not receiving from Source , we need to map at direct target table


I have done mapping STATUS_FLAG=1 ( Default Active-1, Inactive-0),
STARTING_DATE=SYSDATE , ENDING_DATE = SYSDATE( But it will take default value from IKM SCD as 01-01-2400.



Selecting LKM SQL to SQL Knowledge module in Physical Tab





















Selecting IKM Oracle Slowly Changing Dimension Knowledge Module 







Selecting CKM SQL knowledge module for Data Quality validation at I$table.








Running Interface using Run button in Top Menu







Target table is empty as of now there is not records in target SCD table.




Program executed successfully. We can see the status as Green .























110 records are inserted as STATUS_FLAG=1 as New Records or active records.









UPDATE hr.employees SET salary=77777 WHERE employee_id=100;

COMMIT;


I have update data in source table and again i am running my mapping.





Program finished successfully.



One record got inserted as salary got changed it will Add row on change behavior





We can see the modified record inserted as new records and old record updated STATUS_FLAG=0  inactive record. STATUS-FLAG=1 is active for new records.

































Monday, 18 February 2019

What is Static Control and Flow Control ? Difference between both of them





Answer:- 


What is Data Integrity(Quality) Check in ODI:-


Data Integrity Check Process checks is activated in the following cases:


When a Static Control is started (from Studio, or using a package) on a model, sub-model or datastore. The data in the datastores(table) are checked against the constraints defined in the Oracle Data Integrator model.


If a mapping is executed and a Flow Control is activated in the IKM. The flow data staged in the integration table (I$) is checked against the constraints of the target datastore, as defined in the model. Only those of the constraints selected in the mapping are checked.


In both those cases, a CKM is in charge of checking the data quality of data according to a predefined set of constraints. The CKM can be used either to check existing data when used in a "static control" or to check flow data when used in a "flow control". It is also in charge of removing the erroneous records from the checked table if specified.








Static vs Flow control in Oracle Data Integrator


FLOW CONTROL :


Data quality validation is done before loading the data into target tables.


Check Control Knowledge Module (CKM) will create E$ table and SNP_CHECK_TAB table for data quality check.


It will validate data in I$ table before inserting data into target table. If it has any errors then it will delete from I$ table and insert into E$ table and common error message and interface name into SNP_CHECK_TAB.


STATIC CONTROL :


Data quality validation is done after loading the data into target tables.


CKM will validate data on target table and if any error is detected it will be inserted to E$ table and SNP_CHECK_TAB. Remember that the incorrect entry will not be deleted like Flow control. 


What is Recycle Errors in Data Quality in ODI?


Recycle errors is process for reprocessing the failed or rejected data from E$table to I$table.


E$table will contains all errored data from Flow Control or Static control for rejected due to Duplicate or null rows (PK, UK or CHK Constraints).


First we need to rectify error data in E$table and then we can enable RECYCLE ERROR=true option In IKM level. It will select data from E$table


and It will Insert Into I$table then again It will validate on E$table If It is valid data then it will load Into Target table otherwise it will insert into E$table.


What actually happening in the  flow control :- 




  • After loading the data into I$,a check table is created (SNP_CHECK_TAB) , deleting previous error table and previous errors as ODI generally does.

  • Now it creates a new Error table E$, and check for Primary key unique constraints , other constraints and conditions defined in Database or Model level ODI conditions and Not Null check for each column marked as Not null.

  • If records violate the above constraints and conditions, it adds the required records into E$ table and add an entry of it into SNP_CHECK_TAB with information about schema, error message , count etc.

  • Finally the other records are inserted and updated as per the KM and logic.





Error Table Structure


The E$ error table has the list of columns described in the following table:-


Columns    Description


----------------------------------------------------------------------------------


ERR_TYPE  Type of error: 


'F' when the datastore is checked during flow control


'S' when the datastore is checked using static control


ERR_MESS Error message related to the violated constraint


CHECK_DATE Date and time when the datastore was checked


ORIGIN Origin of the check operation. This column is set either to the datastore name or to a mapping                            name and ID depending on how the check was performed.


CONS_NAME Name of the violated constraint.


CONS_TYPE Type of the constraint: 'PK': Primary Key,'AK': Alternate Key,'FK': Foreign Key, 'CK': Check                            condition, 'NN': Mandatory attribute


----------------------------------------------------------------------------------


Summary Table Structure:-


The SNP_CHECK_TAB table has the list of columns described below:


Column Description


----------------------------------------------------------------------------------


ODI_CATALOG_NAME         Catalog name of the checked table, where applicable


ODI_SCHEMA_NAME Schema name of the checked table, where applicable


ODI_RESOURCE_NAME Resource name of the checked table


ODI_FULL_RES_NAME Fully qualified name of the checked table. For example ..



ODI_ERR_TYPE        Type of error:


'F' when the datastore is checked during flow control


'S' when the datastore is checked using static control


ODI_ERR_MESS         Error message


ODI_CHECK_DATE Date and time when the datastore was checked


ODI_ORIGIN Origin of the check operation. This column is set either to the datastore name                                              or to a mapping name and ID depending on how the check was performed.


ODI_CONS_NAME Name of the violated constraint.


ODI_CONS_TYPE Type of constraint: 'PK': Primary Key,'AK': Alternate Key,'FK': Foreign Key,'CK':                                            Check condition,'NN': Mandatory attribute (Not Null)


ODI_ERR_COUNT Total number of records rejected by this constraint during the check process


ODI_SESS_NO ODI         session number




ODI_PK Unique identifier for this table, where appicable


----------------------------------------------------------------------------------