Nuts & Bolts of DataStage: DataSet in DataStage

Monday, December 10, 2012

DataSet in DataStage

Inside a InfoSphere DataStage parallel job, data is moved around in data sets. These carry meta data with them, both column definitions and information about the configuration that was in effect when the data set was created. If for example, you have a stage which limits execution to a subset of available nodes, and the data set was created by a stage using all nodes, InfoSphere DataStage can detect that the data will need repartitioning.

If required, data sets can be landed as persistent data sets, represented by a Data Set stage .This is the most efficient way of moving data between linked jobs. Persistent data sets are stored in a series of files linked by a control file (note that you should not attempt to manipulate these files using UNIX tools such as RM or MV. Always use the tools provided with InfoSphere DataStage).

there are the two groups of Datasets - persistent and virtual.

The first type, persistent Datasets are marked with *.ds extensions, while for second type, virtual datasets *.v extension is reserved. (It's important to mention, that no *.v files might be visible in the Unix file system, as long as they exist only virtually, while inhabiting RAM memory. Extesion *.v itself is characteristic strictly for OSH - the Orchestrate language of scripting).

Further differences are much more significant. Primarily, persistent Datasets are being stored in Unix files using internal Datastage EE format, while virtual Datasets are never stored on disk - they do exist within links, and in EE format, but in RAM memory. Finally, persistent Datasets are readable and rewriteable with the DataSet Stage, and virtual Datasets - might be passed through in memory.

A data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments.

Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single job. So a segment can contain files from many partitions, and a partition has files from many segments.

Firstly, as a single Dataset contains multiple records, it is obvious that all of them must undergo the same processes and modifications. In a word, all of them must go through the same successive stage.
Secondly, it should be expected that different Datasets usually have different schemas, therefore they cannot be treated commonly.

Alias names of Datasets are

1) Orchestrate File
2) Operating System file

And Dataset is multiple files. They are
a) Descriptor File
b) Data File
c) Control file
d) Header Files

In Descriptor File, we can see the Schema details and address of data.
In Data File, we can see the data in Native format.
And Control and Header files resides in Operating System.

Starting a Dataset Manager

Choose Tools ► Data Set Management, a Browse Files dialog box appears:

Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds.
Select the data set you want to manage and click OK. The Data Set Viewer appears. From here you can copy or delete the chosen data set. You can also view its schema (column definitions) or the data it contains.

till then.....
njoy the simplicity.......