Nuts & Bolts of DataStage: May 2014

Saturday, May 31, 2014

Surrogate Key Generator - Create/Update State File

In last post ( Surrogate Key Generator - Create State File ) we have seen how to generate the surrogate key file, In this post we will discuss about what are the other ways to generate it and how to update state file

Update Surrogate State File (Flat File) :-
To update the state file, add a Surrogate Key Generator stage to a job with a single input link from another stage.

Surrogate Key Generator - Create State File

Surrogate keys are numeric keys that can replace business keys in the datawarehouse. A surrogate key has the following characteristics:

1) It is typically an integer.
2) Surrogate keys are unique.
3) Surrogate keys allow you to combine data from tables with incompatible keys.
4) It has no meaning. You will not be able to know the meaning of that row of data based on the surrogate key value.
5) It is not visible to end users. End users should not see a surrogate key in a report.

I will not go in details of surrogate key, I will post it later, But we will see how to generate surrogate keys in DataStage.

Some Oracle SQL should known by Developer - 1

========================================================================

Interview Questions : DataWareHouse - Part6

What is BUS Schema?
BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

What are the methodologies of Data Warehousing?
Every company has methodology of their own. However, to name a few SDLC Methodology, AIM methodology is standard used.

DataStage Scenario - Design8 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem8

Solution Design :

a) Job Design :
Below is the design which can achieve the output as we needed. Here, we are reading 2 seq file as a input, then data is passing through a Join and Filter stage to achieve the output.

Data Warehouse Testing Checklist

Unit testing checklist

A checklist will aid database programmers to systematically test their code before formal QA testing.

Check the mapping of fields that support data staging and in data marts.
Check for duplication of values generated using sequence generators.
Check the correctness of surrogate keys that uniquely identify rows of data.
Check for data-type constraints of the fields present in staging and core levels.
Check the data loading status and error messages after ETLs (extracts, transformations, loads).
Look for string columns that are incorrectly left or right trimmed.
Make sure all tables and specified fields were loaded from source to staging.
Verify that not-null fields were populated.
Verify that no data truncation occurred in each field.
Make sure data types and formats are as specified during database design.
Make sure there are no duplicate records in target tables.
Make sure data transformations are correctly based on business rules.
Verify that numeric fields are populated precisely.
Make sure every ETL session completed with only planned exceptions.
Verify all data cleansing, transformation, and error and exception handling.
Verify stored procedure calculations and data mappings.

DataStage Scenario - Design7- job1

DataStage Scenario Problem --> DataStage Scenario - Problem7

Solution Design :

a) Job Design : Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer and Peek stage to achieve the output.

DataStage Scenario - Design6 - job1

DataStage Scenario Problem --> DataStage Scenario - Problem6

Solution Design :

a) Job Design :

Below is the design which can achieve the output as we needed. Here, we are reading seq file as a input, then data is passing through a Transformer stage to achieve the output.

What ETL is not?

ETL should not be confused with a data creation process. It never creates new data. If a list of hundred employees is being loaded, one more employee cannot be added to the list and make it hundred and one. Or if last name of customer is absent an arbitrary last name cannot be substituted.

Data warehouses are not OLTP systems. Duplication of calculations in Source system & the data warehouse should not be attempted, as the numbers will be very difficult to match during QA. Also in future the process in the source system can change that will result in asynchronous data.

Framework ( usually followed ) in ETL Testing

To discuss how testing relates to the different phases of data mart design, this frame-work includes eight phases:

Requirement analysis:

Requirements are elicited from users and represented either informally by means of proper glossaries or formally.

Analysis and reconciliation:

Data sources are inspected, normalized, and integrated to obtain a reconciled schema,

ETL Testing : Trends & Challenges

ETL Testing : Trends

Count Validation
- Record Count Verification DWH backend/Reporting queries against source and target as a initial check.

Dimensional Validation
Data integrity between the various source tables and relationships. Also we should test the slowly changing dimension as given below. The most common slowly changing dimensions are Types 1, 2, and 3.

ETL Testing : Approach

Testing is undoubtedly an essential part of DW life-cycle but it received a few attention with respect to other design phases.

DW testing specification:
– Software testing is predominantly focused on program code, while Software testing is predominantly focused on program code, while DW testing is directed at data and information DW testing is directed at data and information. .
– DW testing focuses on the correctness and usefulness of the information delivered to users information delivered to users
– Differently from generic software systems, DW testing involves a huge data volume huge data volume, which significantly impacts performance and productivity.
– DW systems are aimed at supporting any views of data, so the possible number of use scenarios is practically infinite and only few them are known from the beginning.
– It is almost impossible to predict all the possible types of errors that will be encountered in real operational data.

DataStage Scenario - Problem19

Goal : Reverse duplicate Removal

Input :
source | destination | distance
city1|city2|500
city2|city1|500
city3|city4|500
city4|city3|500

Subscribe to: Posts ( Atom )

Nuts & Bolts of DataStage

Saturday, May 31, 2014

Surrogate Key Generator - Create/Update State File

Friday, May 30, 2014

Surrogate Key Generator - Create State File

Wednesday, May 28, 2014

Some Oracle SQL should known by Developer - 1

Tuesday, May 27, 2014

Interview Questions : DataWareHouse - Part6

Monday, May 26, 2014

DataStage Scenario - Design8 - job1

Tuesday, May 20, 2014

Data Warehouse Testing Checklist

Sunday, May 18, 2014

DataStage Scenario - Design7- job1

Friday, May 16, 2014

DataStage Scenario - Design6 - job1

Thursday, May 08, 2014

What ETL is not?

Wednesday, May 07, 2014