We have moved to www.dataGenX.net, Keep Learning with us.

Wednesday, June 25, 2014

Fork n Join in DataStage

Algorithm :-
Fork/join parallelism is a style of parallel programming useful for exploiting the parallelism inherent in divide and conquer algorithms on shared memory multiprocessors. The idea is quite simple: a larger task can be divided into smaller tasks whose solutions can then be combined. As long as the smaller tasks are independent, they can be executed in parallel. One important concept to note in this framework is that ideally no worker thread is idle.

Monday, June 23, 2014

Purpose of Parameters in ETL

In an ETL process, a parameter represents an attribute that is not hard coded or sourced from the transactional system. Parameters provide flexibility so that you can adapt ETL processes to fit your business requirements.
or in Simple Words, A job parameter is a way to change a property within a job without having to alter and recompile it. These are input values to job/process which can be changed time to time as per need.

Saturday, June 21, 2014

List all RPMs in Linux

Use this command

$ rpm -qa --qf '%11{SIZE} %{NAME}\n' | sort -k1nr

Thursday, June 19, 2014

APT_CONFIG_FILE : Configuration File

APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use?

1)If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
1)Current working directory.
2)INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

What are the different options a logical node can have in the configuration file?

Tuesday, June 17, 2014

FastTrack Makes Your DataStage Development Faster

IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser. 
The tool was introduced with the Information Server and is available in the 8.1 version.
As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules.  Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project.  We have been using the tool for a few weeks now and we are impressed.  It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen.  Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:

Sunday, June 15, 2014

Things need to consider while developing a Datastage job

Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.

Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.

2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.

Why Data Warehouses are Hard to Deploy ?

Inherent Complexity of Data Warehouse Databases 

Data Warehouse Databases are Large
Since data warehouse databases are constructed by combining selected data from several operational databases, data warehouse databases are inherently large. They are often the largest databases within an organization. The very size of these databases can make them very difficult and expensive to query. 

Tuesday, June 10, 2014

Interview Questions : Unix/Linux : Part-8

1. Display all the files in current directory sorted by size?
ls -l | grep '^-' | awk '{print $5,$9}' |sort -n|awk '{print $2}'

2. Write a command to search for the file 'map' in the current directory?
find -name map -type f

3. How to display the first 10 characters from each line of a file?
cut -c -10 filename

4. Write a command to remove the first number on all lines that start with "@"?
sed '\,^@, s/[0-9][0-9]*//' < filename

Wednesday, June 04, 2014

Surrogate Key Generator - Generate Surrogate Key for Data

In this post, We will see how to generate surrogate key for data, where we have to use surrogate key stage.

A) Design :  
Below design is a demo design of job. Here our data source is a row generator which is generating rows. In real time scenario, Source can be a flat file, DB stages, Passive Stage or can be a Active stage also.
In Row Generator Stage, we are generating a col "Name".

Tuesday, June 03, 2014

Surrogate Key Generator - Delete State File

Adding this post as well with Surrogate Key walk through, In this post we will see how to delete a Surrogate Key State file. Its quite simple as creation.

a) Design :
Again, for the deletion of state file, we need only 1 stage in design and that is Surrogate Key Stage.