We have moved to www.dataGenX.net, Keep Learning with us.

Sunday, February 23, 2014

DataWareHouse (ETL) Designing Steps


A) Planning and Designing Steps

i) Requirement and Realities Gathering

                       1. Business Needs
                       2. Data Profiling and other Data source realities
3. Compliance Requirement
4. Security Requirement
5. Data Integration
6. Data Latency
7. Archiving and Lineage
8. End user delivery interfaces
9. Available Development Skills
10. Available management Skills
11. Legacy licenses

ii) Architecture

1. Hand coded v/s ETL vender tools
2. Batch v/s Streaming data flow
3. Horizontal v/s Vertical task dependency
4. Scheduler automation
5. Exception Handling
6. Quality Handling
7. Recovery and Restart
8. Metadata
9. Security

iii) System Implementation

1. Hardware
2. Software
3. Coding Practices
4. Documentation Practices
5. Specific quality checks

iv) Test and Release

1. Development System
2. Test System
3. Production System
4. Handoff Procedure
5. Update propagation approach
6. System snapshot and rollback procedure
7. Performance tuning

B) The Data Flow design

i) Extract

1. Reading source data models
2. Connecting to and accessing data
3. Scheduling the source system intercepting notification and demons
4. Capturing changed data
5. Staging the extracted data to disk

ii) Clean

1. Enforcing column properties
2. Enforcing structure
3. Enforcing data and value rules
4. Enforcing complex business rules
5. Building a meta data foundation to describe data quality
6. Staging a cleaned data to disk

iii) Conform

1. Conforming business rules ( in dimension)
2. Conforming business metrics and performance indicator ( in fact )
3. De-duplicating
4. Householding
5. Internationalizing
6. Staging the conformed data to disk

iv) Deliver

1. Loading flat/snowflaked dimensions
2. Generating time dimensions
3. Loading degenerated dimensions
4. Loading sub-dimension
5. Loading type 1,2 and 3 SCD
6. Conforming dimensions and conforming flats
7. Handling late arriving dimensions and facts
8. Loading multivalued dimensions and facts
9. Loading ragged hierarchy dimensions
10. Loading text fact in dimensions
11. Running the surrogate key pipeline for fact tables
12. Loading three fundamental fact table grain
13. Loading and updating aggregation
14. Staging the delivered data to disk




No comments :

Post a Comment