DataStage_EndToEnd_Interview_Question & Answers
DataStage_EndToEnd_Interview_Question & Answers
3) Node Configuration: **
o Node is software that is created in operating system.
o Node configuration is a technique to create the logical CPU.
o “Node is a logical CPU i.e., is instance of physical CPU.
o Hence, “the process of creating virtual CPU’s is called Node Configuration.”
o Node configuration concept is exclusively work on the DataStage; it is the best
Feature comparing from other ETL tools.
o The partition parallelism run the same job would effectively by multiples CPU’S
o Split the source data into subset is known as Partitions.
o Partition is a distributing the data across the nodes, based on partition
techniques.
1
o Each partition of the data is processed by same node.
o Partition parallelisms facilitate near linier scalability.
ex: 8times faster on 8 processors
24times faster on 24 processors
Key based
Hash - records with similar hash key values received by same node and partition by
same node
Modulus – it is similar to the Hash, but it will perform on only numeric key columns,
Range – similar hash values it will process, but it is slightly over head to process the
range of records
Db/2 -- db2 connector stage.
Key less
Round robin – all the records distributed to each node evenly across multiple nodes.
Ex: if we have 4 nodes, first records goes to 1st node 2nd record goes to 2nd node.
Random -- all the records distributed to each node evenly across multiple nodes. It is
similar to the Round Robin Ex: if we have 4 nodes, first records goes to 1st node 2nd
record goes to 2nd node.
Entire – it will accommodate all input data into each node.
Same - it will process the same node data to next preceding stages.
“All pipes carry the data parallel and the process done simultaneously”
In server environment: the execution process is called traditional batch
Processing.
8) what is the node APT configuration file / Node configuration file? ***********
Configuration file:
Configuration file contain information about processors. Each processor is known as Node.
A node is logical representation of CPU. It is instance of Physical CPU.
2
Configuration file is created and managed by Datastage designer client. Tools-> configuration
Configuration file created and saved with an extinction “.apt” (Advanced Parallel Technology)
Configuration file is activated by “Runtime environment variable called “$APT_CONFIG_FILE”.
We can determine the parallelism from configuration file
No. Of Nodes = No.of Partitions
No.of Nodes= Degree of parallelism
Node Components:
There are 4 node components are in Configuration file
1) Node name
2) Fast Name
3) Pool
4) Resource – Resource Disk
-- Resource Scratch Disk.
3
9) what is the difference between Sequential File stage and Dataset? *******
Sequential File stage:
Sequential file is a file stage, it will read flat files for different extensions/outputs (.csv, .txt,.psv)
file formats.
Enable the “No of Readers per Node = true” Or set the “Read from multiple node =2”
DataSet:
Dataset is file stage, which is used for staging the data when we design
dependent jobs. It allows you to read or write data to a dataset. The stage
can have single input link or a single output link. It can be configured to
execute in parallel or sequential mode.
Data sets operating system files. Each referred by control files and
which is stored with a .ds format.
We can manage the datasets independently by using dataset
management utility in datastage desingner, tools -- > data set
management utility
It supports More than 2 GB data processing.
No conversion required, because in datasets data represent/resides in
native format.
Dataset Extension is .ds
4
Q: How many files are created internally when we created a dataset?
Dataset is not a single file; it creates multiple files when it created internally.
o Descriptor file
o Data file
o Control file
o Header file
orachadmin rm dataset.ds
11) how many ways we can see the datasets?
12) I have a source file. It contains header and footer values; I need remove them before processing
the data from sequential file stage?
Ex:
current date : 2022
empno,ename,job,sal
12,abc,abc,1000
5
13,aadaf,afa,2000
total count: 2
Output:
empno,ename,job,sal
12,abc,abc,1000
13,aadaf,afa,2000
In the sequential file stage we have a “filter” option to remove the header and footer, using Unix
command we can achieve this.
13) I have a file it contains 100 records, in the target I need to generate 100 files, how can you achieve
this?
14) How can you handle the null values in Sequential file stage?
6
Step1 – In the format tab,
By using Unstructured stage, we can able to read the Excel data. This stage is exclusively from Excel data
read.
16) what are all the stages can generate the mock data?
In the Develop and debug stage using Row Generator Stage and Column Generator Stage we can
generate the mock data.
16) what is the difference between Row Generator and Column Generator Stage?
Row Generator: If we don’t’ have any mock data from the client then directly we can use Row generator
stage. It will support only 1 output link.
Note: in entire data stage only one stage can support one output link what is that?- Row generator stage
Column Generator Stage: If we have some sample data for few columns, and if we need to generate
sample data for other columns then we can have used Column generator stage.
It is a development and debug stage; it is used to see the logs of input data. It also can act as copy stage.
7
18) I have a data with 15 records, I wanted to seed the data only for 6 th to 9th records? Using
datastage?
By using head and tail stage can achieve the above problem solution.
Step2) use head stage and set the option as “head -9”
Isolation level
Specify the degree to which the data that is being accessed by the Db2 connector
stage is locked or isolated from other concurrently executing transactions, units of
work, or processes
Cursor stability
This is the default value. Takes exclusive locks on modified data and sharable locks on
all other data. Exclusive locks are held until a commit or rollback is executed.
Uncommitted changes are not readable by other transactions. Sharable locks are
released immediately after the data has been processed, allowing other transactions to
modify it.
Read uncommitted
Takes exclusive locks on modified data. Locks are held until a commit or rollback is
executed. No other locks are taken. However, other transactions can still read but not
modify the uncommitted changes.
Read stability
8
Takes exclusive locks on modified data and takes sharable locks on all other data. All
locks are held until a commit or rollback is executed, preventing other transactions
from modifying any data that has been referenced during the transaction.
Repeatable read options
Takes exclusive locks on all data. All locks are held until a commit or rollback is
executed, preventing other transactions from modifying any data that has been
referenced during the transaction.
20) what is the difference between “Insert then update” & “Update then insert”?
Record count
Specify the number of records to process before the connector commits the
current transaction or unit of work. You must specify a value that is a multiple of
the value that you set for Array size. The default value is 2000. If you
set Record count to 0, all available records are included in the transaction.
Valid values are integers 0 - 999999999.
Array size
Specifies the number of records or rows to use in each read or write database
operation. The default value is 2000. Valid values are from 1 to a database-
specific maximum.
9
Processing Stages
1) what is the aggregator stage?
Using aggregator stage we can get the aggregated results, it will support 1 input link and 1 output link.
a) Calculation
b) Count Rows
c) Column for Calculation
3) Can we do count function along with min, max , avg, sum using single Aggregator stage?
Decimal (8,2) – if we do not mention this default type, basically it give the result in Double datatype
format.
hash mode for a relatively small number of groups; fewer than about 1000 groups per megabyte of
memory.
Sort mode requires the input data set to have been partition sorted with all of the grouping keys
specified as hashing and sorting keys.
6) what is the copy stage and what is the use of force option in copy stage?
Copy stage is a processing stage, it supports 1 input link and ‘n’ number of output links, we can copy all
input data to multiple output links.
Force: true/false
True to specify that DataStage should not try to optimize the job by removing the Copy operation.
10
7) what is the difference between filter stage and switch stage?
Filter:
1. it is a processing stage, it supports 1 input link and N o/p and 1 reject link
2. Using we can filter the data based on where clause option, and addition unmatched records we can
send to the reject link.
Switch stage:
1) it is a processing stage, it supports 1 input link and 128 output links and 1 reject links.
2) using this we can filter the data based on ‘C’ case statement level filter.
3) it is native stage to the datastage, hence whenever we want to filter only specific data then instead of
filter stage we can use switch stage.
11
4. Main difference in the switch stage, we can’t use the ‘in / or ‘operators in Switch stage, but the same
operators we can use in filter stage.
Note: Make all input datasets numbers columns must same in every input datasets
Join Stage:
a) Join stage is processing stage, it supports Multiple Input links and 1 output link, using this
join stage we can join the multiple inputs and send the data to 1 target.
C) Whenever we are using Full Join, then it will support only 2 Input links.
For rest of the joins it will support any no. of input links. Join stage will not support any reject
link. In
D) Usually whenever reference tables contain huge volume of the data, then Join Stage is
Appropriate, reason, it will not create any paging in the databases.
E) While doing the join, make sure, sort the input data to get appropriate joining results.
Default in the join stage we will perform Hash Partition Technique.
12
Lookup Stage:
A) Lookup stage is processing stage; The Lookup stage is most appropriate when the reference data
for all lookup stages in a job is small enough to fit into available physical memory
B) Each lookup reference requires a contiguous block of shared memory. If the Data Sets are larger
than available memory resources, the JOIN or MERGE stage should be used.
C) The lookup key columns do not have to have the same names in the primary and the reference
links.
D) Lookup stage can support 1 input link, ‘N’ No.of Reference links, 1 output link and 1 reject link.
The optional reject link carries source records that do not have a corresponding entry in the input
lookup tables.
Note: Whenever using the sparse lookup it will support only 1 reference link
13
Normal Lookup:
Normal lookup will available in any connector stage in the reference. And normal lookup is appropriate
when the reference data is small enough to fit to the available physical memory (RAM).
Sparse Lookup:
1. The size of reference table is huge, i.e., more than millions of rows. If the
reference table is small enough to fit into memory entirely, normal lookup is a
better choice.
2. The number of input rows is less than 1% of the reference table. Otherwise, use a
Join stage.
3. Here each source record directly executes in the reference database level. In the
reference we will write the SQL code like below using ORCHESTRATE operator.
14
Default Lookup Stage Supports Entire Partition Technique.
15
Merge:
A) The Merge stage is a processing stage. It can have 1 Master input links, 1 Master output link, N no.
of Updated Input links and same Number of Reject link.
B) Merge stage combines a master dataset with one or more update datasets based on the key
columns. The output record contains all the columns from master record plus any additional columns
from each update record that are required.
C) The data sets input to the Merge stage must be key partitioned and sorted. This ensures that
rows with the same key column values are located in the same partition and will be processed by the
same node. It also minimizes memory requirements because fewer rows need to be in memory at any
one time.
10) What is parameter and What is Parameter Set and How many Ways we can create?
Using Parameter, we can provide the different values in the run time, this is called parameter.
16
11) what is the modify stage, how it will use?
1) Modify stage is a processing stage that alters the record schema of the input data
2) Modify stage can have a single input and a single output link.
3) Modify stage can also be used to handle NULL values, string, date, time and timestamp manipulation
functions.
4) Modify stage is a native parallel stage and has performance benefits over the Transformer stage
Keeping and Dropping Fields
Invoke the modify operator to keep fields in or drop fields from the output. Here are the effects of keeping
and dropping fields:
Pivot enterprise stage is a processing stage which pivots data vertically and horizontally depending upon
the requirements. There are two types
1. Horizontal
2. Vertical
17
13) How Many Ways we can Remove the Duplicates in data stage? ***********
Input data:
empno,ename,job,sal
1200, abc123, xyz,2500
1200, abc123, xyz,2500
1200, abc123, xyz,2500
1202, lmn789, aaa,2600
1202, lmn789, aaa,2600
1204, abc, xyz,2700
1206, lmn, aaa,2800
1208, abc, xyz,2900
Out Put:
empno,ename,job,sal
1200, abc123, xyz,2500
1202, lmn789, aaa,2600
1204, abc, xyz,2700
1206, lmn, aaa,2800
1208, abc, xyz,2900
Case-1:
Using Remove Duplicate stage, we can remove the duplicates.
18
Case-2:
Process1: in the Sort we have an option called ‘Allow Duplicates = true/false’, if we enable to false,
then it will restrict the data without duplicates
Process2: in Sort stage using “create key change column = true’, then it will sort the data based on sort
key columns, and it will generate additional column called ‘key change’ and in that, I a group first row it
will treat a 1, next subsequent records it will treat a 0. After this conversion, then using filter stage
where ever key_change=1 then we can send to one output.
19
Then in filter stage filter the data like below
Process:3 from the link sort we can remove the duplicates and find the options
Partition > hash >select column> perform sort > unique
20
Way3:
Using Transformer stage variable can perform/remove the duplicates like below
Extract the data
Sort the data based on key columns
In transformer stage, take 2 stage variable and apply the below logic.
Use the stage variable in constraint level, to filter the data where ever key is 1
21
Way4:
Using Aggregator Stage
The surrogate key generates sequential incremental and unique integers for a provided start point. It can
have a single input and a single output link.
Surrogate Key is a Primary Key for a dimensional table/ (Surrogate key is alternate to Primary Key) The
most importance of using Surrogate key is not affected by the changes going on with a database.
And in Surrogate Key Duplicates are allowed, where it can’t be happened in the Primary Key.
Note: Transformer stage is a processing stage, it will support 1 input link, N Output links, using this we
can perform all kinds of data validation and filter conditions we can apply.
22
@FALSE
18) how to generate the counter or sequence number in transformer stage, if a job is running on 2 or
4 nodes?
19) what are all the Null Handing functions in Transformer stage?
IsNotNull
IsNull
NullToEmpty
NullToZero
NullToValue
SetNull
23
Use the Resource Estimation window to estimate and predict the system resource utilization of
parallel job runs.
22) How to Import/Export the Datastage Job Using Command line/Unix? ****
Location of command:
UNIX: /opt/IBM/InformationServer/Clients/istools/cli
Windows: \IBM\InformationServer\Clients\istools\cli
JobWise Export :
cd /opt/IBM/InformationServer/Clients/istools/cli
./istool export -dom XYZ123:9080 -u dsadm -p dsadm -ar /tmp/Test1.isx -
ds XYZ123/Test1/Jobs/TestJob.dsx
24
Syntax : ./istool -dom [domain]:9080 -u [user] -p [password] -ar
[Path/ExportFileName.isx] -ds [domain/ProjectName]
Check List Of Datastage Projects from Command Line:
DSHOME=`cat /.dshome`;
echo $DSHOME
DSHOME=`cat /.dshome`
cd $DSHOME
. ./dsenv
cd ./bin
./dsjob -lprojects => { For Listing the Datastage Projects}
./dsjob -ljobs project => { For Listing the Datastage Jobs in given Project}
23) To get a list of DataStage jobs that are running use a command similar to the following
UNIX command:
3) Look for any process with dsapi_slave and kill them using this command
kill -9 pid (process id is the first numeric column after your id)
25
24) I have a field in a Project repository, I need to identify that field is flowing in which job?
Above screen shot is explaining us, empno column is flow till where.
Containers:
A container is a group of stages and links. Containers enable you to simplify and modularize
your job designs by replacing complex areas of the diagram with a single container stage.
Local Containers:
Local containers. These are created within a job and are only
accessible by that job. Their main use is to 'tidy up/ simplify' a job
design. Local container we can de-construct as well. Right click on that
container and select the de-construct
26
.
Step1: Select list of the stages and links from job canvas
Step2: go to Edit containers Local
These local containers can’t re-use, it just simplifies the job design.
Shared Containers:
These are created separately and are stored in the Repository in the same way that jobs are.
These containers logic can reuse anywhere in the project.
If we wanted to destructive the job, first need to convert into local container, then local to we
have to de-construct the job.
27
Edit -- > Construct Containers Select Shared
This portion of the logic will save in the repository and can able to re-use the same logic in entire
project level.
Ref: https://www.ibm.com/docs/en/iis/11.7?topic=reusable-shared-containers
26) how to run the datastage job from command line interface/unix?
28
Sequencers Questions:
In this we can invoke single parallel job at a time, if we have many parallel job those many Job activities
need to use.
2) I have a 5 job activity stages in a sequencer, here first job completed, 2 nd job aborted even
though without stop 3 and 4 & jobs need to run, what is the procedure?
In the job activity stage triggers, don’t mention any condition just keep as is as ‘unconditional’
3) I have 5 jobs, in sequencer in that 3rd job failed, job should restart from where it is aborted,
how to achieve this?
In each activity enable to execution action = reset if required, then run
29
In job properties enable the “Add check point so sequence is restart able” option
It will identify the error code, once the parallel job issue is fixed and compile the job, run
the sequencer from Datastage Directory itself.
4) how to execute the Unix commands in sequencer?
Using execute command activity stage
5) how can we pass the multiple commands in execute command activity stage?
Each command can separate by secolon (;)
30
6) what is mean by Any /All in Sequencer?
Any: if any job activity got finished, then it will go to next dependent jobs
ALL: jobs activity will wait until all jobs got completed
1.Before/After Subroutine.
2.Transformer routines/Functions.
Before/After Subroutines :
Transformer Routines:
Transformer Routines are custom developed functions, as you all know even DS has
some limitations corresponding to inbuilt functions(TRIM,PadString,.etc), like in DS
version 8.1 we don’t have any function to return ASCII value of a character, Now
from 8.5 they have introduced seq() function for above mentioned scenario.
These Custom routines are developed in C++ Writing a routine in CPP and
linking it to our datastage project is really simple task as follows,
31
3. Import .dsx file from command line
SOL: DSXImportService -ISFile dataconnection –DSProject dstage –DSXFile
c:\export\oldproject.dsx
4. Generate Surrogate Key without Surrogate Key Stage
SOL: @PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM – 1)) + 1
Use above Formula in Transformer stage to generate a surrogate key.
32
./dsjob -server $server_nm -user $user_nm -password $pwd -run
$project_nm $job_nm
REUSABILITY IN DATASTAGE
Below are some of the ways through which reusability can be achieved in
DataStage.
Multiple Instance Jobs.
Parallel Shared Container
After-job Routines.
33
After-job Routines:
After/Before job subroutines are types of routines which run after/before the
job to which the routine is attached. We might have a scenario where in we
shouldn’t have any of the input records to be rejected by any of the stages in
the job. So we design a job which have reject links for different stages in the
job and then code a common after-job routine which counts the number of
records in the reject links of the job and aborts the job when the count
exceeds a pre-defined limit.
This routine can be parameterised for stage and link names and can then be
re-used for different jobs
Stable sort means "if you have two or more records that have the same
exact keys, keep them in the same order on output that they were on
input".
34
what is the main difference between key change column and cluster key change
column in sort stage?
create key change column generates while sorting the data..It generate one
for first record and zero for rest of the records by group wise..
cluster key column generates on sorted data when sort mode is donot sort
For that
a) Avoid using Transformer stage where ever necessary. For example if you are using Transformer stage
to change the column names or to drop the column names. Use Copy stage, rather than using
Transformer stage. It will give good performance to the Job.
b)Take care to take correct partitioning technique, according to the Job and requirement.
c) Use User defined queries for extracting the data from databases .
d) If the data is less , use Sql Join statements rather then using a Lookup stage.
e) If you have more number of stages in the Job, divide the job into multiple jobs.
Data Profiling:-
Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is good or dirty or
not.
And these 5 steps are
a) Column Analysis
b) Primary Key Analysis
c) Foreign Key Analysis
d) Cross domain Analysis
e) Base Line analysis
After completing the Analysis, if the data is good not a problem. If your data is dirty, it will be sent for
cleansing. This will be done in the second phase.
35
Data Quality:-
Data Quality, after getting the dirty data it will clean the data by using 5 different ways.
They are
a) Parsing
b) Correcting
c) Standardize
d) Matching
e) Consolidate
Data Transformation:-
After completing the second phase, it will gives the Golden Copy. Golden copy is nothing but single
version of truth. That means, the data is good one now.
Error handling can be done by using the reject file link.what are the errors coming through job needs to
be capture in sequential file and that file needs to be fetch in job which will load this exceptions or errors
in database.
Scenario 1(Dependency exists between script and a job): Where a job has to be executed first then the
script has to run, upon completion of script execution only the sec job has to be invoked. In this case
develop a sequencer job where first job activity will invoke the first job then using Execute command
activity call the script u would desire to invoke by typing "sh <script name>" in the command property of
the activity, then with the other job activity call the second job.
Scenario 2: (Script and job are independent) : In this case right in your parallel job say job1, under job
properties u can find "After-job subroutine" where u need to select "ExecSH" and pass the script name
which you would like to execute. By doing this once the job1 execution completes the script gets invoked.
The job succeeding the job1 say job2 doesn’t wait for the execution of the script.
A stage can also request that the next stage in the job preserves whatever partitioning it
has implemented.
36
It does this by setting the preserve partitioning flag for its output link. Note,
however, that the next stage might ignore this request.
In most cases you are best leaving the preserve partitioning flag in its
default state. The exception to this is where preserving existing partitioning
is important. The flag will not prevent repartitioning, but it will warn you that
it has happened when you run the job. If the Preserve Partitioning flag is
cleared, this means that the current stage doesn’t care what the next stage
in the job does about partitioning. On some stages, the Preserve Partitioning
flag can be set to Propagate. In this case the stage sets the flag on its output
link according to what the previous stage in the job has set. If the previous
job is also set to Propagate, the setting from the stage before is used and so
on until a Set or Clear flag is encountered earlier in the job. If the stage has
multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag
is set if it is set on any of the inputs, or cleared if all the inputs are clear.
37
projectinfo - returns the project information(hostname and project name)
stageinfo - returns the stage name, stage type, input rows, etc.,)
report - display a report contains Generated time, start time, elapsed time, status,
etc.,
I have a source file with 14 columns. How to extract 10 fields out of it without
changing jobs.
Answer: awk –F ‘delimiter‘ {print $1,$2..$10}
38
How to keep recent 5 days log files and remove rest of log files from log directory?
Answer: Unix Command : find /home/input/files* -mtime +5 -exec rm {} ;
Interview Questions:
Question)
My source Path Contain 10 files (csv files) with same metadata, but data is different, day by day files may
increase, but I have to write each file into different files with psv file format.w
Question) I have sequence which contains a 5 Job Activity Stages, if 2nd job is getting aborted, 3rd job
should run without stop the job.
Question) I have a to pass the 5 different commands in using one Execute command activity stage, how
can we do that?
Question) I have a sequencer job with 10 job activities, in that 5th job got aborted, how to restart the job,
from the point of the failure.
Question) what is the sequencer and what are all the options it contains. (difference between ANY/ALL)
Question) What is Start loop and end loop, what it will do?
1) What is the parameter and how many ways we can create the parameters?
Parameters: we can pass the value in run time, for that we have to create a parameter
dynamically.
There are 2 ways we can create the parameters.
1) Local Parameter/ Job level parameter
2) Project Level parameter
a. Parameter set
39
b. Administrator Level Parameters
Job level parameter: It will specific only for that particular job, we can use anywhere in other jobs or
project.
Project level Parameter: we can reuse anywhere any job in the project level.
1) Parameter Set: it is a container, it can hold all the parameters like file path, database
username, password, hostname ….. and it will save in the repository.
2) Administrator Level Environment Variables: In Administrators client in Environment
Variables we can create the parameters and we can all them anywhere in the project.
Funnel Stage is a processing stage it will support multiple input links and 1 output link.
It will combine the multiple input datasets to single target output link.
It is similar to the SQL UNION All function, make sure all the input datasets must be the same number of
columns. And order also the same.
1) Continues Funnel
2) Sequence Funnel
3) Sort Funnel
Using Transformer stage variable concept, we can generate the code like Key_Change codes then put
the filter in constraint level. Below is design and solution
40
How to remove duplicates in SQL?
Select deptno,count(*) from emp group by deptno having count(*)>1; --only pure duplicates will
select;
Delete from (select deptno,row_number() over (partition by deptno order by deptno) rn from
emp ) where rn >1;
Select * from (select deptno,row_number() over (partition by deptno order by deptno) rn from
emp ) where rn >1;
41
Truncate is DDL command, we can delete entire table without paging and we can’t rollback the
data
Delete is DML command, we can delete portion of the data using “where” clause or we can
delete entire table data. Here we can roll back the data unless commit
10) Tab 1 having 1 1 1 2 5 data tab 2 having 1 1 2 null null what is count for all 4 types of join?
Table-1 table-
(left) 2(right)
1 1
1 1
1 2
2 null
5 null
total 7 records (here table1 records every time match with table-2)
Table-1 table-2
1 1
1 1
1 1
1 1
1 1
1 1
2 2
Left Join: Matched records from the Left and right tables and unmatched record from the Left table,
corresponding right table will be null
Table-1 table-2
1 1
1 1
1 1
1 1
1 1
1 1
2 2
5 null left unmatched
Right Join: Matched records from the both the tables and unmatched records from the Right table
Total Count: 9
Table-1 table-2
42
1 1
1 1
1 1
1 1
1 1
1 1
2 2
null null
null null
Full join: Matched and unmatched records from both the tables.
Total count: 10
Table-1 table-2
1 1
1 1
1 1
1 1
1 1
1 1
2 2
null null right unmatched
null null
5 null left unmatched
Union – Union will all allow any duplicates from the both input dataset
Union All- Will allow duplicates
sed -n 5p lines.txt
43
sed -e “s/^M//” filename > newfilename.
16) How to find the 4th word (nth word ) from the file:
echo "This is a temporary change to complete the export"|awk '{ print $4}'
1657 19.6117
1410 18.8302
3078 18.6695
2434 14.0508
3129 13.5495
20) how to find the unique records 4th columns of the file?
44
MYLK4 Mylk4 reg 1 Heart
ATP8A1 Atp8a1 reg 5 Heart
Now the organ name (here Heart) can be different. I there are several
organs that the data is about. I am wondering how I can figure out the
names of the unique elements of that column (column 5)? The data file is
huge.
Ans)
45
To run a job when the content of a file is 0
Here is the simple scenario,
To run a DataStage Job based on the count from file "File1". When File1 contains no rows, the
trigger Job 1 else Job 2.
Solution:
EReplace(Execute_Command_6.$CommandOutput,@FM,"")
Reason:
Whenever we read any content from a file to user variable, the field marks will also be written. So in
order the remove the FM, we are using the above command.
Unix command to bring all records(within a column) in a single row with delimiters:
sed -e 's/$/<CRLF>/' $* | tr -d "\r\n" | sed 's/<CRLF>/,/g' | sed 's/.$//' | sed 's/,/'"','"'/g' | sed
's/$/'"'"'/g' | sed 's/^/'"'"'/'
Example:
Actual File:
46
[user@123]$ cat NewFile.txt
Rule1
Rule2
Rule3
Rule4
Rule5
[user@123]$ cat NewFile.txt | sed -e 's/$/<CRLF>/' $* | tr -d "\r\n" | sed 's/<CRLF>/,/g' | sed 's/.$//' |
sed 's/,/'"','"'/g' | sed 's/$/'"'"'/g' | sed 's/^/'"'"'/'
'Rule1′,'Rule2′,'Rule3′,'Rule4′,'Rule5′
This will run the dsenv file which contains all the environment variables.
Without doing this, your UNIX commands won't run on the command prompt.
To run a job:
Using the dsjob command you can start,stop,reset or run the job in
validation mode.
Running with the invocationid would mean that the job would be run with that
specific invocation id
Now if you have parameters to set or paratemeterset values to set then this
can also be as set as shown below
To stop a job:
47
Stopping a job is fairly simple. You might not actually require it but still its
worth to take a look. It acts the same way as you would stop a running job the
Datastage director.
To list projects, jobs, stages in jobs, links in jobs, parameters in jobs and
invocations of jobs
dsjob can very easily give you all the above based on the different keywords.
It will be useful for you if you want to get a report of what's being used in what
project and things like that
'dsjob –lprojects' will give you a list of all the projects on the server
'dsjob –ljobs project_name' will give you a list of jobs in a particular project
'dsjobs –lstages project_name job_name' will give you a list of all the stages
used in your job. Replacing –lstage with –links will give you a list of all the links in
your job. Using –lparams will give you a list of all the parameters used in your job.
Using –linvocations will give you a list of all the invocations of your multiple instance
job.
Running this command will give you a short report of your job which includes
The current status of the job, the name of any controlling job for the job, the date
and time when the job started , the wave number of the last or current run (internal
InfoSphere Datastage reference number) and the user status
You can get a more detailed report using the below command
48
To access logs:
You can use the below command to get the list of latest 5 fatal errors from
the log of the job that was just run
You can get different types of information based on the keyword you specify
for –type. Full list of allowable types are available in the help guide for reference
Following is the command to run the sequence with passing the parameters :
Here,
"param" is the name of parameter defined in the job properties of the sequence.
"value" is the value of the parameter that you want to pass to this sequence.
http://mydatastagesolutions.blogspot.com/2015/04/how-to-test-odbc-connection-from-putty.html
49
What is a routine in DataStage?
DataStage Manager defines a collection of functions within a routine. There are basically three
types of routines in DataStage, namely, job control routine, before/after subroutine, and
transform function.
Client components
Servers
Stages
Table definitions
Containers
Projects
Jobs
50
Name the command line functions to import and export the DS jobs?
The dsimport.exe function is used to import the DS jobs, and to export the DS jobs,
dsexport.exe is used.
If we want to further remove the logs, then we need to go to the respective jobs and clean up the log files.
Routines are stored in the Routine branch of the DataStage repository. This is where we
can create, view, or edit all the Routines. The Routines in DataStage could be the
following: Job Control Routine, Before-after Subroutine, and Transform function.
Q #26) What is the difference between passive stage and active stage?
Answers: Passive stages are utilized for extraction and loading whereas active stages are
utilized for transformation.
51
Q #32) What is a quality stage?
Answers: The quality stage (also called as integrity stage) is a stage that aids in combining
the data together coming from different sources.
Ref: https://www.naukri.com/learning/articles/top-datastage-interview-questions-and-answers/
52