We have moved to www.dataGenX.net, Keep Learning with us.

Saturday, June 30, 2012

Major business and technical advantages and disadvantages of using DataStage ETL tool

Source :  Major business and technical advantages and disadvantages of using DataStage ETL tool


Business advantages of using DataStage as an ETL tool:

  Significant ROI (return of investment) over hand-coding
  Learning curve - quick development and reduced maintenance with GUI tool
  Development Partnerships - easy integration with top market products interfaced with the datawarehouse, such as SAP, Cognos, Oracle, Teradata, SAS
  Single vendor solution for bulk data transfer and complex transformations (DataStage versus DataStage TX)
  Transparent and wide range of licensing options 

Technical advantages of using DataStage tool to implement the ETL processes

  Single interface to integrate heterogeneous applications
  Flexible development environment - it enables developers to work in their desired style, reduces training needs and enhances reuse. ETL developers can follow data integrations quickly through a graphical work-as-you-think solution which comes by default with a wide range of extensible objects and functions
  Team communication and documentation of the jobs is supported by data flows and transformations self-documenting engine in HTML format.
  Ability to join data both at the source, and at the integration server and to apply any business rule from within a single interface without having to write any procedural code.
  Common data infrastructure for data movement and data quality (metadata repository, parallel processing framework, development environment)
  With Datastage Enterprise Edition users can use the parallel processing engine which provides unlimited performance and scalability. It helps get most out of hardware investment and resources.
  The datastage server performs very well on both Windows and unix servers. 

Major Datastage weaknesses and disadvantages

  Big architectural differences in the Server and Enterprise edition which results in the fact that migration from server to enterprise edition may require vast time and resources effort.
  There is no automated error handling and recovery mechanism - for example no way to automatically time out zombie jobs or kill locking processes. However, on the operator level, these errors can be easily resolved.
  No Unix Datastage client - the Client software available only under Windows and there are different clients for different datastage versions. The good thing is that they still can be installed on the same windows pc and switched with the Multi-Client Manager program.
  Might be expensive as a solution for a small or mid-sized company.

njoy the simplicity.......
Atul Singh

DataStage FAQ1

1. What are the ways to execute DataStage jobs?
A job can be run using a few different methods:
* from DataStage Director (menu Job -> Run now...)
* from command line using a dsjob command
* DataStage routine can run a job (DsRunJob command)
* by a job sequencer

2. How to invoke a DataStage shell command?
DataStage shell commands can be invoked from :
* DataStage administrator (projects tab -> Command)
* Telnet client connected to the DataStage server

Friday, June 29, 2012

10 Ways to Make DataStage Run Slower

Source : 10 Ways to Make DataStage Run Slower - Vincent McBurney

Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.

1. Use the same configuration file for all your jobs.

You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across is a bit like the solid wheel on a aerodrome racing bike, they take a lot of time to crank up to full speed but once you are there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire DataStage batch run!

2. Use a sparse database lookup on high volumes.

This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can send 10 million SQL statements to the database! That will put the brakes on!

3. Keep resorting your data.

Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of height before anyone gets their food. You need to read every row before you can output your sort results. You can really slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the SQL select statement but sorting it again anyway!

4. Design single threaded bottlenecks

This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update stream. Reading a large sequential file from a parallel job without using multiple readers per node.

5. Turn on debugging and forget that it's on

In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the bottleneck!

6. Let the disks look after themselves

Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk type or changing the location of your scratch space are all things that stand between you and slower job run times.

7. Keep Writing that Data to Disk

Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again. Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will give you massive slow down times.

8. Validate every field

A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII file before being processed by DataStage. One way to slow down your job and slow down your ETL development and testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't. Adding a layer of validation you don't need should slow those jobs down.

9. Write extra steps in database code

The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying "I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this earth".

10. Don't do Performance Testing

Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your first draft design and throw that into production and get the hell out of there.

njoy the simplicity.......
Atul Singh

File Stages : DataStage

Data set stage
The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.
What is a data set? parallel jobs use data sets to manage data within a job. You can think of each link in a
job as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent
form, which can then be used by other InfoSphere DataStage jobs. Data sets are operating system files,
each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be
key to good performance in a set of linked jobs. You can also manage data sets independently of a job
using the Data Set Management utility, available from the InfoSphere DataStage Designer or Director.

Sequential file stage
The Sequential File stage is a file stage. It allows you to read data from or write data one or more flat
files. The stage can have a single input link or a single output link, and a single rejects link.

File set stage
The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can
have a single input link, a single output link, and a single rejects link. It only executes in parallel mode.
What is a file set? InfoSphere DataStage can generate and name exported files, write them to their
destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data
files and the file that lists them are called a file set. This capability is useful because some operating
systems imposea2GB limit on the size of a file and you need to distribute files among nodes to prevent

Lookup file set stage
The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a
lookup. The stage can have a single input link or a single output link. The output link must be a
reference link. The stage can be configured to execute in parallel or sequential mode when used with an
input link.
When creating Lookup file sets, one file will be created for each partition. The individual files are
referenced by a single descriptor file, which by convention has the suffix .fs.

External source stage
The External Source stage is a file stage. It allows you to read data that is output from one or more
source programs. The stage calls the program and passes appropriate arguments. The stage can have a
single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode.

External Target stage
The External Target stage is a file stage. It allows you to write data to one or more source programs. The
stage can have a single input link and a single rejects link. It can be configured to execute in parallel or
sequential mode. There is also an External Source stage, which allows you to read from an external program

Complex Flat File stage
The Complex Flat File (CFF) stage is a file stage. You can use the stage to read a file or write to a file, but
you cannot use the same stage to do both.
As a source, the CFF stage can have multiple output links and a single reject link. You can read data from
one or more complex flat files, including MVS™ data sets with QSAM and VSAM files. You can also read data from files that contain multiple record types. The source data can contain one or more of the
following clauses:
·                           GROUP
·                           REDEFINES
·                           OCCURS
·                           OCCURS DEPENDING ON
CFF source stages run in parallel mode when they are used to read multiple files, but you can configure
the stage to run sequentially if it is reading only one file with a single reader.
As a target, the CFF stage can have a single input link and a single reject link. You can write data to one
or more complex flat files. You cannot write to MVS data sets or to files that contain multiple record

Thursday, June 28, 2012

Configure DB2 remote connectivity with WebSphere DataStage Enterprise Edition

Hi Guys this is nice doc on the IBM developerWorks site for DS administration work.

njoy the simplicity.......
Atul Singh

Partitioning & Pipelining in DataStage

Introduction to Enterprise Edition

          Parallel processing = executing your application on multiple CPUs
         Scalable processing = add more resources (CPUs and disks) to increase system performance

          Example system containing 6 CPUs (or processing nodes) and disks
          Run an application in parallel by executing it on 2 or more CPUs
          Scale up system by adding more CPUs
          Can add new CPUs as individual nodes, or add CPUs to an SMP node

Traditional Batch Processing

Write to disk and read from disk before each processing operation
            Sub-optimal utilization of resources
           a 10 GB stream leads to 70 GB of I/O
           processing resources can sit idle during I/O
            Very complex to manage (lots and lots of small jobs)
            Becomes impractical with big data volumes
            disk I/O consumes the processing
            terabytes of disk required for temporary staging

Pipeline Multiprocessing

Think of a conveyor belt moving rows from process to process!
          Transform, clean and load processes are executing simultaneously
           Rows are moving forward through the flow
            Start a downstream process while an upstream process is still running.
            This eliminates intermediate storing to disk, which is critical for big data.
            This also keeps the processors busy.
            Still have limits on scalability

Partition Parallelism

          Divide large data into smaller subsets (partitions) across resources
         Goal is to evenly distribute data
         Some transforms require all data within same group to be in same partition
          Requires the same transform on all partitions
         BUT: Each partition is independent of others, there is no concept of “global” state
          Facilitates near-linear scalability (correspondence to hardware resources)
         8X faster on 8 processors
24X faster on 24 processors  
Enterprise Edition Combines Partition and Pipeline Parallelisms

          Within the Parallel Framework,
Pipelining and Partitioning Are Always Automatic
         Job developer need only identify
          Sequential vs. Parallel operations (by stage)
          Method of data partitioning
          Configuration file (there are advanced topics here)
          Advanced per-stage options (buffer tuning, combination, etc) 

Job Design vs. Execution

User assembles the flow using DataStage Designer

 … at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)

njoy the simplicity.......
Atul Singh

Tuesday, June 26, 2012

date commands

A useful command line tool is date, which is typically used for displaying the current system date, or setting it. The default format of the date and time displayed will be the system default, eg "date : Wed May  9 19:42:23 GMT 2012", but it is possible to apply your own formatting, and also to specify a different date to use, without adjusting the system clock.

The date command writes the current date and time to standard output if called with no flags or with a flag list that begins with a + (plus sign). Otherwise, it sets the current date. Only a root user can change the date and time.
If, for example, you wanted to display the current day's day of the week only.
If you follow the date command with a + (plus sign) and a field descriptor, you can control the output of the command. You must precede each field descriptor with a % (percent sign). The system replaces the field descriptor with the specified value. Enter a literal % as %% (two percent signs). The date command copies any other characters to the output without change. The date command always ends the string with a new-line character.

date +%A

The + switch tells the date command to apply the following format to the current date. %A tells date that the format to use is the locale's full weekday name. A full list of the formatting modifiers is at the end of this article. It's Friday today, so entering the above command at the command prompt would display this:

$ date +%A

If you wanted to display the date in the format YYYY-MM-DD, with a 4 digit year and 2 digit months with leading zeros, you would do this:

$ date +%Y-%m-%d

Specifying different dates

That was pretty easy, but the above examples only show the current system date. What if you wanted to show yesterday's date? There's another switch for date which allows you to specify a date other than the current one, the -d switch. The great thing with -d is you can use words to specify previous or future dates, as per the examples below.

Using date in other commands

Within the bash/ksh shell you can embed commands within other commands using backticks. As a very simple example, we'll use the echo command. The first example is without backticks so will just echo the word "date" the second example uses backticks and does echo the date. You wouldn't normally do this because date echoes the output anyway.

$ echo date

$ echo `date`
Wed May 9 19:42:23 GMT 2012

Date format specifies

The following are the available date format specifiers: (Some are supported by only specific shells, please check before using these)

%%     a literal %
%a     locale's abbreviated weekday name (e.g., Sun)
%A     locale's full weekday name (e.g., Sunday)
%b     locale's abbreviated month name (e.g., Jan)
%B     locale's full month name (e.g., January)
%c     locale's date and time (e.g., Thu Mar  3 23:05:25 2005)
%C     century; like %Y, except omit last two digits (e.g., 21)
%d     day of month (e.g, 01)
%D     date; same as %m/%d/%y
%e     day of month, space padded; same as %_d
%F     full date; same as %Y-%m-%d
%g     last two digits of year of ISO week number (see %G)
%G     year of ISO week number (see %V); normally useful only with %V
%h     same as %b
%H     hour (00..23)
%I     hour (01..12)
%j     day of year (001..366)
%k     hour ( 0..23)
%l     hour ( 1..12)
%m     month (01..12)
%M     minute (00..59)
%n     a newline
%N     nanoseconds (000000000..999999999)
%p     locale's equivalent of either AM or PM; blank if not known
%P     like %p, but lower case
%r     locale's 12-hour clock time (e.g., 11:11:04 PM)
%R     24-hour hour and minute; same as %H:%M
%s     seconds since 1970-01-01 00:00:00 UTC
%S     second (00..60)
%t     a tab
%T     time; same as %H:%M:%S
%u     day of week (1..7); 1 is Monday
%U     week number of year, with Sunday as first day of week (00..53)
%V     ISO week number, with Monday as first day of week (01..53)
%w     day of week (0..6); 0 is Sunday
%W     week number of year, with Monday as first day of week (00..53)
%x     locale's date representation (e.g., 12/31/99)
%X     locale's time representation (e.g., 23:13:48)
%y     last two digits of year (00..99)
%Y     year
%z     +hhmm numeric timezone (e.g., -0400)
%Z     alphabetic time zone abbreviation (e.g., EDT)

njoy the simplicity.......
Atul Singh

chsh Command

The chsh command changes a user's login shell attribute. The shell attribute defines the initial program that runs after a user logs in to the system. This attribute is specified in the /etc/passwd file. By default, the chsh command changes the login shell for the user who gives the command.

Shell used by me
[root@home ~]# echo $SHELL

To get the chsh help
[root@home ~]# chsh --help
Usage: chsh [ -s shell ] [ --list-shells ] [ --help ] [ --version ][ username ]

To list the available shells in system
[root@home ~]# chsh -l

Step to change Linux login shell.
The example below show the use of chsh command to change shell for current user.
[atul@home ~]$ chsh
Changing shell for atul.
New shell [/bin/bash]: /bin/csh
Shell changed.

[atul@home ~]$ chsh -s /bin/bash  ---> chsh [ -s shell ]
Changing shell for atul.
Shell changed.

The chsh command issue with no option and –s option with the full pathname of the desired shell, is used to change shell for the user and it will prompt user for their password (note: the password only prompt for non-root user).  In above example we change the shell to /bin/csh (C SHell) and to /bin/bash (Bourne-Again Shell).  NOTE: the changes will take effect after you logout and login again.

Step to change shell for other user.
[root@home ~]# chsh -s /bin/csh atul  -->  chsh [ -s shell ][ username ]
Changing shell for atul.
Shell changed.

njoy the simplicity.......
Atul Singh

ps command

The ps (process status) command is used to provide information about the currently running processes, including their process identification numbers (PIDs).
A process, also referred to as a task, is an executing (i.e., running) instance of a program. Every process is assigned a unique PID by the system.

The basic syntax of ps is
    ps [options]
-a          Displays all processes on a terminal, with the exception of group leaders.
-c          Displays scheduler data.
-d         Displays all processes with the exception of session leaders.
-e         Displays all processes.
-f          Displays a full listing.
-glist      Displays data for the list of group leader IDs.
-j          Displays the process group ID and session ID.
-l          Displays a long listing
-plist      Displays data for the list of process IDs.
-slist      Displays data for the list of session leader IDs.
-tlist      Displays data for the list of terminals.
-ulist      Displays data for the list of usernames.

The Default Output
atul@atul:/home/atul $ ps
 3567696  pts/2  0:00 ps
 4391144  pts/2  0:00 -ksh

The Process list output for user smaadmp7
atul@atul:/home/atul $ ps -u smaadmp7
 1930  299258      -  0:01 dsapi_server
 1930  671930      -  0:01 dsapi_slave
 1930  696452      -  0:17 dsapi_slave
 1930  790608      -  0:00 dsapi_server
 1930 1200174      -  0:01 dsapi_server
 1930 1634556      -  0:15 dsapi_slave
 1930 1687662      -  0:00 dsapi_server
 1930 1757284      -  0:00 dsapi_slave
 1930 1790152      -  0:00 dsapi_server
 1930 2081022      -  0:00 dsapi_server
 1930 2158782      -  0:47 dsapi_slave
 1930 2289812      -  0:00 dsapi_server

 The -a option tells ps to list the processes of all users on the system rather than just those of the current user
atul@atul:/home/atul $ ps -a
 1294566  pts/8  0:00 sleep
 1884294  pts/5  0:00 ksh
 2240748  pts/3  0:19 topas_nmon
 3813422         0:00 <defunct>
 4661352 pts/12  0:00 tail
 4710552  pts/8  0:00 ksh
 5263386  pts/3  0:00 ksh
 5558486 pts/11  0:01 topas_nmon
 6266958  pts/2  0:00 ps
 6885508 pts/13  0:00 ksh
 7254052  pts/4  0:00 ksh
 8024154 pts/11  0:00 ksh

Own output format
If you are bored by the regular output, you could simply change the format. To do so use the formatting characters which are supported by the ps command.
If you execute the ps command with the 'o' parameter you can tell the ps command what you want to see:

atul@atul:/home/atul $ ps -o "%u : %U : %p : %a"
   RUSER       USER       PID   COMMAND
atul : atul : 2256934 : ps -o %u : %U : %p : %a
atul : atul : 4391144 : -ksh

Log listing
The -l option generates a long listing
The additional columns of most interest are NI and SZ. The former shows the nice value of the process, which determines the priority of the process. The higher the value, the lower the priority. The default nice value is 0 on NIX systems.
The latter displays the size of the process in memory. The value of the field is the number of pages the process is occupying. On Linux systems a page is 4,096 bytes.

atul@atul:/home/atul $ ps -l
       F S  UID     PID    PPID   C PRI NI ADDR    SZ    WCHAN    TTY  TIME CMD
  200001 A 2291 3305488 4391144   3  61 20 432309400  1788           pts/2  0:00 ps
  240001 A 2291 4391144 5161086   0  60 20 3430d7400   792           pts/2  0:00 ksh

The -e option generates a list of information about every process currently running. The -f option generates a listing that contains fewer items of information for each process than the -l option.

atul@atul:/home/atul > ps -ef | more
     UID     PID    PPID   C    STIME    TTY  TIME CMD
    root       1       0   0   Apr 28      -  6:21 /etc/init
    root  102518       1   0   Apr 28      - 23:31 /usr/sbin/syncd 60
    root  106530       1   0   Apr 28      -  0:00 /usr/sbin/uprintfd
    root  127116       1   0   Apr 28      -  0:10 /usr/ccs/bin/shlap64
    root  135190  204998   0   Apr 28      -  4:39 /opt/IBM/ITM/aix526/ux/bin/stat_daemon 12
    root  147644  323618   0   Apr 28      -  0:02 /usr/sbin/portmap
    root  163990       1   0   Apr 28      -  0:00 /usr/lib/errdemon
    root  192710  323618   0   Apr 28      -  0:00 haemd HACMP 1 CCMP0 SECNOSUPPORT
    root  200780       1   0   Apr 28      -  0:35 /usr/tivoli/tsm/client/ba/bin/dsmc sched -servername=ibmtsmp1
    root  204998       1   0   Apr 28      - 2352:55 /opt/IBM/ITM/aix526/ux/bin/kuxagent
    root  209066  135190   0   Apr 28      -  0:06 /opt/IBM/ITM/aix526/ux/bin/ifstat 30 7
    root  217206  286754   0 08:18:21      -  0:00 sshd: mahendra [priv]
     mqm  225350       1   0   Apr 28      -  7:37 /opt/IBM/ITM_ibmccmpc/aix523/mq/bin/kmqagent /opt/IBM/ITM_ibm
    root  229520  323618   0   Apr 28      -  0:00 /usr/sbin/rsct/bin/IBM.ServiceRMd
    root  233604  323618   0   Apr 28      -  5:59 /usr/sbin/snmpd -c /etc/snmpd.conf
    root  241798  135190   0   Apr 28      -  0:59 /opt/IBM/ITM/aix526/ux/bin/nfs_stat AIX 30 9
    root  249928       1   0   Apr 28      -  0:11 /tivoli_ep_B/opt/Tivoli/lcf/bin/aix4-r1/mrt/lcfd
    root  253990       1   0   Apr 28      -  3:53 /opt/IBM/ITM/aix526/ul/bin/kulagent
    root  258098  135190   0   Apr 28      -  2:00 /opt/IBM/ITM/aix526/ux/bin/kux_vmstat 30 6

njoy the simplicity.......
Atul Singh