We have moved to www.dataGenX.net, Keep Learning with us.

Wednesday, May 15, 2013

Diagnosing DataStage Job Monitor problems


DataStage job run statistics (i.e. rows per second processed) do not update in DataStage Designer or Director clients.

 

Check these details :

This section contains a quick series of diagnosis steps for those familiar with DataStage and Job monitor. If more detail is needed for any step, please refer to the more detailed instructions in the "Resolving the Problem" section:
If customer has checked and verified that the following:


  1. Check the DataStage Director job log for jobs which do not not show job run statistics and confirm whether the following variable is defined:
    APT_NO_JOBMON
    If APT_NO_JOBMON is defined and set to value of 1 or true, then it will disable job monitoring and process metadata reporting for parallel jobs.
  2. Confirm the JobMonApp process is up and running:
    ps -ef | grep JobMonApp
  3. Confirm default ports 13400 and 13401 are listening:
    netstat -an | grep 134
  4. Check job monitor log file for errors:
    cat /ibm/InformationServer/Server/PXEngine/java/JobMonApp.log
  5. Confirm job monitor is setup to use ports 13400 and 13401
    cat /ibm/InformationServer/Server/PXEngine/etc/jobmon_ports
  6. If job monitor log shows no errors but job log reports "Failed to connect to JobMonApp on port 13401" then update jobmon_ports file to use 2 new ports which are not already in use. This will require restart of JobMonApp.
  7. If problem still occurs, confirm that /etc/hosts file contains the following entry
    127.0.0.1 localhost
    without a localhost entry, Job Monitor will be unable to use the ports correctly.


Solution :

DataStage jobs generate statistics (such as rows per second processed) which can be displayed on each link when a job is run via Designer. However, these statistics only update when the job monitor application, JobMonApp, is running.

JobMonApp is started with command jobmoninit script located in directory:
    .../ibm/InformationServer/Server/PXEngine/java


Verify that JobMonApp is running

On Unix systems, you can enter the following command to confirm if JobMonApp process is running:
    ps -ef | grep JobMonApp
to confirm if the job monitor is running. On Windows systems, you can enter "ksh" command first to obtain a command shell prompt (if installed) from where you can enter the ps command. If the job monitor is running, you should see a process entry with long description including the command string which started it and the 2 active ports requested. You will see a second entry also which is from the "grep" command itself, so you should see at least 2 matches if the job monitor is running, i.e.:
    $ ps -ef | grep JobMonApp

      SYSTEM   4964   4044  0 09:22:27 con  0:00 C:\IBM\InformationServer\ASBNode\apps\jre\bin\java -Xrs -classpath C:/IBM/InformationServer/Server/PXEngine/java/JobMonApp.jar;C:/IBM/InformationServer/Server/PXEngine/java/xerces/xmlParserAPIs.jar;C:/IBM/InformationServer/Server/PXEngine/java/xerces/xercesImpl.jar JobMonApp
     13400 13401 -debug
      dsadm    7788   4136  0 16:48:50 con  0:00 grep JobMonApp

However, on some platforms such as Solaris, the PS output may be truncated, and since JobMonApp appears at the end of string, the above command may not find a match even though the job is running. In this situation, you can instead look in the .jobmonpid file, located in same directory as jobmoninit. The .jobmonpid file contains the process id last used for JobMonApp. You can then query that process id to see if it is running, i.e. if .jobmonpid file contains 3174, then enter command:
    $ ps -ef | grep 3174

If JobMonApp is not running, then run the "jobmoninit" command script to restart it.


Check for errors in JobMonApp.log

If JobMonApp is running, but your jobs do not update statistics, then the next place to check for an error is the JobMonApp.log file written to the above directory. Historical logs are also saved in the same directory. During a normal startup, JobMonApp requires that is 2 defined ports be available, not used by other programs. One port is used to communicate with the job, while the other port is used to communicate with the DataStage engine. Both ports must be available, so if startup message indicates one port available and one has conflict, then Job Monitor will not function correctly.

A normal startup log will appear as follows:
    WELCOME to the Job Mon Application.
    Tue Jun 30 09:22:28 PDT 2009
    Using ports: 13400 and 13401

A startup with port conflict will instead contain:
    Tue May 19 13:48:19 CDT 2009
    Using ports: 13400 and 13401
    Could not listen on port: 13400 Address already in use

Additionally, if the failing port is used to communicate with the running job, it may cause an additional error to appear in the job log:
    Failed to connect to JobMonApp on port 13401


Resolving port conflicts for JobMonApp

To resolve port conflict issues for JobMonApp, use the following command to determine current usage for each port used by job monitor, i.e.:
    netstat -a | grep 134
Confirm that the output shows both ports 13400 and 13401.

If the ports are found, they should have a status of "LISTENING". If the status is CLOSE_WAIT or something else, it could indicate that an older instance of DataStage or JobMonApp did not successfully release the port. While some operating systems have commands to force the release of the port or to kill an application holding the port, in some cases it may take a system restart to free the port.

If this port conflict continues even after a system restart, then multiple applications may have been setup to use this port. If you are running multiple DataStage instances on one server, you should check the /etc/services file to confirm your ports have not been allocated to multiple applications. Then look at the following file:
    .../ibm/InformationServer/Server/PXEngine/etc/jobmon_ports
This file contains 2 variables that define the ports used by JobMonApp:
    APT_JOBMON_PORT1=13400
    APT_JOBMON_PORT2=13401
For systems with multiple instances of DataStage running, ensure that each instance is using a separate set of ports for the job monitor application.

If two DataStage instances are using the same job monitor ports, you will need to update this file for one instance. After changing the above port values, you will need to stop and restart JobMonApp for the change to take effect.

Also confirm that your /etc/hosts file on DataStage server machine contains the following entry:
127.0.0.1 localhost
Without a localhost definition, the job monitor may not be able to communicate correctly on the above ports.

If no port conflict exists, and no port errors are found in the JobMonApp.log file, but the log does contain other errors, it may be necessary to contact Information Server technical support if the error message does not give a clear cause of the problem.

An additional problem using localhost can occur if the /etc/nsswitch.conf file is not setup correctly to check hosts file before domain nameserver. The correct entry normally appears as:
hosts: files dns


Running JobMonApp with debug output

If no errors appear in log file, or if more detailed error messages are needed, you can run JobMonApp in debug mode. To enable debug output you will modify jobmoninit, so first create a backup copy of jobmoninit. Next, edit jobmoninit and find the section of script for your current operating system, and then locate a line similar to:
    nohup $APT_ORCHHOME/java/jre/bin/java -classpath $CLASSPATH JobMonApp $jobmon_port1 $jobmon_port2 > $logfile 2>&1 &
Add the option "-debug" after second port, i.e.:
    nohup $APT_ORCHHOME/java/jre/bin/java -classpath $CLASSPATH JobMonApp $jobmon_port1 $jobmon_port2 -debug > $logfile 2>&1 &
Some platforms will have 2 commands listed, one with -Xrs option and one without. In that situation you can update both lines. After making this change, stop and restart the job monitor application. Additional debug output should now appear in the log file which may provide more insight into cause of JobMonApp problems.


Tracing job monitor calls originating in DataStage jobs

Enabling debug mode for JobMonApp / jobmoninit will only trace problems which occur within the JobMonApp process. For problems where DataStage jobs cannot connect to job monitor or do not work correctly with job monitor, an additinoal trace needs to be enabled for the failing job.

In the job properties dialog, parameters panel, use the Add Environment Variables button to add variable:
OSHMON_TRACE
and set it to value of 1 (or true if presented with selection dialog for value). Compile and re-run the failing job. When OSHMON_TRACE=1 is set, additional trace files will be written to the &PH& directory of the project which owns the failing job.

Review the new files written at time of job failure for additional errors. For example, in the case where connection to job monitor from job fails, the trace may show the following line:
"opensocket() returned 30"
which means that the host name is not recognized (which in this case means a problem with the localhost definition).


Contacting technical support for job monitor problems

When contacting technical support with a job monitor issue, provide the following files and details:

Courtesy : IBM