Nuts & Bolts of DataStage: October 2012

Tuesday, October 30, 2012

ETL process and concepts

ETL is an abbreviation of the three words Extract, Transform and Load. It is an ETL process to extract data, mostly from different types of systems, transform it into a structure that’s more appropriate for reporting and analysis and finally load it into the database and or cube(s).

As ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

Oracle DataBase books

1) OReilly.Oracle.PL.SQL.Programming.4th.Edition download
2) Oracle PLSQL Best Practices download
3) OReilly-Oracle_Language_Pocket_Reference download
4) Oracle_9i_SQL_Reference download
5) OCA Oracle Database 11g Administration ( 1Z0-051 and 1Z0-052 ) download

Hamster - have fun !!!

This lively pet hamster will keep you company throughout the day. Watch him run on his wheel, drink water, and eat the food you feed him by clicking your mouse. Click the center of the wheel to make him get back on it.

Its ma fav…..

DB2 query to select first or last N rows

There may be instances when you wish to select first or last N rows.

You can use the following query to limit the number of rows retrieved using select command.

First N rows

Special Nix Commands

The following are a set of special commands which the shell provides as stand alone statements. Input and output redirection is permitted for all these commands unlike the complex commands. You cannot redirect the output from a while loop construct, only the simple or special commands used within the loop list.

The colon ( : ) does nothing! A zero exit code is returned. Can be used to stand in for a command but I must admit not to finding a real use for this command.

Websphere Application Server[WAS] log files

IBM Websphere Application Server creates the following log files trace.log ,SystemOut.log , and SystemErr.log , activity.log, StartServer.log , stopServer.log , native_stdout.log , native_stderr.log.

Let us see the above log files in details .

Useful Perl Scripts : Part-3

To Access the last element of an array

#!/usr/bin/perl

@array = (1,2,3,4);

print $array[$#array];

Newton’s Cradle - have fun !!!

For every action there is an equal and opposite reaction
try this......

DataStage Configuration file : Explained - 3

Below is the sample diagram for 1 node and 4 node resource allocation:

DataStage Configuration file : Explained - 2

1. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to each other using a very high-speed network switches. How do you configure your system so that you will be able to achieve optimized parallelism ??

1. Make sure that none of the stages are specified to be run on the conductor node.

2. Use conductor node just to start the execution of parallel job.

3. Make sure that conductor node is not the part of the default pool.

DataStage Configuration file : Explained - 1

The Datastage configuration file is a master control file (a textfile which sits on the server side) for jobs which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage understands the architecture of the system through this file.

This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data, where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.

How to convert a (multiple) space separated text into Tab delimited (with WORD)

1) paste the columnar text in an empty Word Document

2) Start the search/replace function

2) In the search field input 2 (two) spaces

3) in the replace field input ^t (the character t preceded by ^ means a TAB)

Input/Output data buffering (on Link) in DataStage

To improve performance and resolve bottlenecks, you can specify how input and output data is buffered. Although the size and operation of the buffer are usually the same for all links on all stages, you can modify the settings for specific links.

By default, data is buffered so that no deadlocks occur. Be careful when changing data buffering settings because specifying inappropriate values might create a deadlock.

Any changes that you make to the properties on the Advanced tab are automatically reflected on the Advanced tab of the stage at the other end of the link.

Using Configuration Files in Data Stage Best Practices & Performance Tuning

The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources.

When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design.

Download for Powershell v2 for Windows 7? No need... It's already there!

A while back, Microsoft announced the release of PowerShell v2 for Windows XP, Windows Server 2003, Windows Vista, and Windows Server 2008 (see http://go.microsoft.com/fwlink/?LinkID=151321).
However, it is not clear to everyone that Powershell v2 is already part of Windows 7 and Windows Server 2008 R2.

Environment Variable for Data Stage Best Practices and Performance Tuning

DataStage provides a number of environment variables to control how jobs operate on a UNIX system. In addition to providing required information, environment variables can be used to enable or disable various DataStage features, and to tune performance settings.

Data Stage Environment Variable Settings for All Jobs

Interview Questions : DataStage - Part 1

How did you handle reject data?
Ans: Typically a Reject-link is defined and the rejected data is loaded back into data warehouse. So Reject link has to be defined every Output link you wish to collect rejected data. Rejected data is typically bad data like duplicates of Primary keys or null-rows where data is expected.

If worked with DS6.0 and latest versions what are Link-Partitioner and Link-Collector used for?
Ans: Link Partitioner - Used for partitioning the data.
Link Collector - Used for collecting the partitioned data.

To Release Job locks in Datastage

There are Three methods to unlock the DataStage jobs:

– Using DataStage Administrator Tool.

– Using UV Utility

– Using DataStage Director

Difference between scratch disk and resource scratch disk

The Only difference is :

Scratch Disk is for Temporary storage (Like RAM in our PC)

DataSet, FileSet and Seq File in DataStage

Seq File:

Extract/load from/to seq file max 2GB { Its depends on OS property, Now most of the OS supports greater than 2 GB }

when used as a source at the time of compilation it will be converted into native format from ASCII

does not support null values

A seg file can only be accessed on one node.

Useful Perl Scripts : Part-2

1. To Split the text and join

#!/usr/bin/perl

Subscribe to: Posts ( Atom )

Tuesday, October 30, 2012

Monday, October 29, 2012

Sunday, October 28, 2012

Friday, October 26, 2012

Thursday, October 25, 2012

Monday, October 22, 2012

Sunday, October 21, 2012

Friday, October 19, 2012

Thursday, October 18, 2012

Tuesday, October 16, 2012

Monday, October 15, 2012

Saturday, October 13, 2012

Thursday, October 11, 2012

Wednesday, October 10, 2012

Data Stage Environment Variable Settings for All Jobs

Sunday, October 07, 2012

Saturday, October 06, 2012

Wednesday, October 03, 2012

Tuesday, October 02, 2012

Monday, October 01, 2012