We have moved to www.dataGenX.net, Keep Learning with us.

Wednesday, April 30, 2014

Performance Tunings in DataStage


JOB LEVEL

  • Parametrize all the inputs needed for the jobs; avoid hard coding of parameters like username, password, directory paths etc.
  • By using the environment variable ‘APT_CONFIG_FILE’ as a job parameter, user can dynamically change the number of nodes being used for processing a particular job.
  • For reading/writing data from large tables/files, make use of the environment variable ‘APT_BUFFER_MAXIMUM_MEMORY’. It can be used to change the memory buffer size being used for each stage.
  • It is recommended to set the environment variable $APT_DUMP_SCORE to a value of 1. When this environment variable is set, an entry is placed in the WebSphere DataStage job log showing the actual runtime structure (processes, their associated internal operators, datasets, nodes, etc) used to execute the job flow.

  • Use the environment variable APT_RECORD_COUNTS to display the number of records per partition for each component.
  • Have a predefined format for naming stage variables, routines, job parameters and other parameters which will be used in DataStage and UNIX shell scripts.
  • Use appropriate extensions for all different files. Use ‘.txt’ for sequential files, ‘.ds’ for datasets, ‘.err’ for error files, ‘.rej’ for reject files and ‘.fs’ for filesets.
  • Set the environment variable APT_PM_SHOW_PIDS = 1. When set to 1, DataStage phantom or Server processes are not shown
  • Set the environment variable APT_PM_PLAYER_TIMING = 1. When set, prints detailed information in the job log for each operator, including CPU utilization and elapsed processing time.
  • Set the environment variable APT_PM_PLAYER_MEMORY = 1. When set, prints detailed information in the job log for each operator when additional memory is allocated.



STAGE LEVEL

  • Minimize type conversion
  • Minimize the warnings.
  • Column and Stage variable data types should be same
  • Whenever possible, use a SQL Filter (WHERE clause) to limit the number of rows sent to the DataStage job
  • Use a SQL Join to combine data from tables of small-medium number of rows, especially when the join columns are indexed
  • Avoid SQL Sorts – EE SORT is much faster and runs in parallel
  • USE EE SORT and JOIN to combine data from very large tables, or when the join condition is complex
  • Always use a dataset to store data in intermediate files, instead of using a sequential file. In a dataset, data is stored in a persistent form, native to DataStage. In a dataset, data is stored in partitions. ata can be read and written into the file in parallel and I/O usage is lesser, making full use of the parallel processing provided by DataStage.
  • If a dataset is created using a particular configuration file, the same configuration file must be used while reading the dataset set elsewhere in another job.
  • Do not read from a sequential File followed by a “SAME” partitioning. Unless more than one source file is specified, “SAME” will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is later repartitioned).
  • When multiple files are read by a single Sequential File Stage (using multiple files or by using a File Pattern), each file’s data is read into a separate partition. It is important to use ROUND-ROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow.
  • Avoid using a lookup stage for reference tables with large volumes of data. Use a join stage instead.
  • Use a merge stage instead of a join stage if rejected records need to be captured.
  • Avoid using a transformer wherever it’s possible to eliminate its presence, as it will considerably increase the job performance.
  • Copy Stage should be used instead of a Transformer for simple operations including Job Design placeholder between stages, Renaming Columns, Dropping Columns and Default Type Conversions.
  • Filter and/or Switch Stages can be used to separate rows into multiple output links based on SQL-like link constraint expressions, instead of using a transformer stage for the same purpose.
  • Use ‘Round robin’ partition to minimize data skew.
  • Use ‘Hash’ partition for the following stages: Join, Aggregate, Sort, Merge and Remove Duplicate. This will increase the performance because of even partitioning of data.
  • For optimal utilization of a lookup stage, the driver file must be ‘Hash’ partition and reference file should use an ‘Entire’ partition.
  • In a Remove Duplicate stage, for partitioning of data, all the keys must be used in the same order that was being used for removing duplicates.
  • Use a Column generator stage for generating unique numbers in parallel jobs. The initial value should be ‘part’, increment should be ‘partcount’ and generator type should be ‘cycle’.
  • In aggregator stage, if null fields are to be output by the stage, then the property “Allow null output “ should be set explicitly.
  • It is recommended to specify multiple users per node if a job uses a sequential stage. This helps maximize utilization of parallel processing.
  • If the key fields are of numeric type, then make use of Modulus partitioning instead of hash partitioning. This will improve performance.
  • Use lookup file sets for doing lookups on large table. This will improve the performance considerably.
  • The master file in a Merge stage must be sorted and duplicates removed before the actual merge operation takes place.
  • If the data is going to be read back in, in parallel, should never be written in a sequential file stage, Dataset or File stage is better.


    More tips are HERE




No comments :

Post a Comment