Talend Job Design Patterns
Talend Job Design Patterns
1
Agenda
• Introduction & Overview
• SDLC – Development Guidelines
• Job Design Patterns
• Best Practices 1-16
• DDLC – Database Modeling
• Best Practices 17-32
2
Open Discussions & QA
Question
“What is the best way for me to write a Talend job”?
• It needs to be:
• EASY to READ
• EASY to WRITE
• EASY to MAINTAIN
Avoiding Complexity:
• NO Monolithic Jobs
• Minimize Depth Levels
• Scrunched Components
• Overlapping Links
6
Talend SDLC Best Practices Guide
…. so take it Seriously
7
Continuous Integration/Deployment
9
Formulating the Basics
Foundational Precepts
✓ Readability: creating code that can be easily figured out and understood
✓ Writeability: creating straightforward, simple, code in the least amount of time
✓ Maintainability: creating appropriate complexity with minimal impact from change
✓ Functionality: creating code that delivers on the requirements
✓ Reusability: creating sharable objects and atomic units of work
✓ Conformity: creating real discipline across teams, projects, repositories, and code
✓ Pliability: creating code that will bend but not break
✓ Scalability: creating elastic modules that adjust throughput on demand
✓ Consistency: creating commonality across everything
✓ Efficiency: creating optimized data flow and component utilization
✓ Compartmentalization: creating atomic, focused modules that serve a single purpose
✓ Optimization: creating the most functionality with the least amount of code
✓ Performance: creating effective modules that provide the fastest throughput
10
SDLC – Developer Guidelines
Guidelines NOT Standards – It’s about Discipline
• Standards are Rigid leaving no room for the unexpected
• Guidelines are pliable which can bend and rarely break
Create a Development Guidelines Document
• Involvement and Adoption from all teams is essential
• Incorporates Corporate SDLC process
• Defines the foundation, structure, & context
Other Useful Documents
• Code Module Library
• Data Dictionary
• Data Access Layer
11
Just One More Thing
Instill Good Habits
• Start easy – something everyone can adopt
• Agree to label every component for code readability; a foundational Precept!
• Incrementally raise the bar – over time
• Organize Repository Folders
• Establish & utilize naming conventions; Conformity!
• Adopt logging, messaging, & error handlers
• Build out reusable/shared code modules; Several precepts used here!
12
Job Design Patterns & Best Practices
13
What Are Job Design Patterns?
Template or Skeleton Layouts
• Focus on essential and/or required elements
• Often bound around use case
• Target Common and/or Reusable code modules
• When identified and implemented properly they:
• Strengthen overall code base
• Condense overall effort
• Reduce repetitive but similar code
14
Best Practices
Best Practice #1
Consider carefully how to layout your Job Canvas
• Don’t just splash objects on your canvas
with the idea to clean it up later
• Use discipline to paint it right the 1st time
• Allow space for Readability & Maintainability
• Line up the Flows & Components
• Preferred layout is
‘Top-to-Bottom’ then ‘Left-to-Right’
• A ‘Zig-Zag’ or ‘Snake’ layout can be easy to write,
and maybe easy to follow along
• But inserting new functionality can lead to
re-factoring the whole job layout
16
Best Practice #2
Atomic Job Modules – Parent/Child Jobs
• Avoid Big, Monolithic Jobs
• They can be hard to read and maintain
• Plus they can perform poorly
17
Best Practice #3
Joblets versus tRunJob Component
• INCLUDED code vs CALLED code
• Joblets are common code you ‘Include’ in your job
• tRunJob ‘Calls’ a Child job from a Parent job
18
Best Practice #4
Job Entry & Exit Points
• Talend code needs to Start & Stop
somewhere
• tPreJob & tPostJob components are highly advised
• tPreJob executes 1st, then continues
• tPostJob wraps it up (like ‘finally’ for you OOP guys)
19
Best Practice #5
Error Handling & Logging
• One of the MOST IMPORTANT
things you can incorporate into
your Jobs
• Creating a common Error Handler is
highly advised
• Incorporate well defined ‘Return Codes’
20
Best Practice #6
OnSubJobOk/ERROR vs OnComponentOK/ERROR
• Often Misunderstood
• These ‘trigger’ links do affect job design
flow and must be considered properly
• OK vs ERROR is obvious
• OnSubJob will pass control only after
the current Sub Job has executed fully
• OnComponent will pass control only
after the component has processed a
row or a data set (depending upon the
component)
21
Best Practice #7
What is a Job Loop?
• A Highly Significant Job Design
Consideration!
• These are decision points where control
of the next step in the process is made
• All Job designs should identify One (1)
‘Main Loop’ where exit control can be
established
• Again use established ‘Return Codes’
and exit strategies
• ‘Secondary Loops’ are OK, just ensure
the process flow makes sense
22
Best Practice #8
Software Development Life Cycle (SDLC)
• “People, Product, & Process” Marcus Leminos “The Profit” (CNBC)
• These 3 keys can determine the Success or Failure of any Business
• The same is true for Software Development
• Talend’s SDLC Best Practice Guide provides a deep look into the
concepts, principles, specifications, and details Continuous
Integration/Deployment practices available to Talend developers
• Incorporation of any SDLC Best practice into a ‘Development
Guidelines’ document is highly advised
23
Best Practice #9
Managing Workspaces
• Talend Studio installations use a ‘Workspace’
• Typically created on your local disk drive C:
• As in many software installations a ‘Default’ location is assigned
• Usually placed along side the Software executables
24
Best Practice #10
Reference Projects
• Do you know what they are?
• We all want re-usable, common, or generic code that can be shared across projects
• Avoid cut-and-paste and/or copying similar code; Use Reference Projects!
• Limit the number of Reference Projects as too many defeat their purpose
25
Best Practice #11
Object Naming Conventions
• “A rose by any other name is still a rose!” who said that anyway?
• The answer may not matter, but Naming Conventions do!
• All Talend Objects have unique internally used names
• Adopt Conventions of Object Naming in Talend
• Clearly define them in your ‘Development Guidelines’ document
• Have the entire team adopt these conventions
Objects To Consider
• Directories, Folders, & Workspaces
• Data File ‘root’ & I/O locations & names
• Jobs, Joblets, Code Routines
• Context Groups & Variables
• Database Connections
26
Best Practice #12
Project Repository
• Where all project objects reside
• Several Important Sections include:
• Job Designs - where your jobs are located
• Contexts - groups reusable variables
• Code - add java code modules
• Metadata - variety of schema definitions
• Documentation - auto-generate project Wiki
27
Best Practice #13
Version Control
• Job Properties allow setting ‘M’ajor & ‘m’inor version numbering
• Allows a status of ‘development’, ‘test’, ‘production’, or ‘user defined’
• This is designed for Single User Environments ONLY!
• When used in conjunction with a SCCS, considerable workspace ‘bloat’ occurs
28
Best Practice #14
Memory Management
• So, you want to run your job?
• Have you considered its Memory
needs?
• Is the data flow processing Millions of Rows or
have lots of columns?
• How many tMap Lookups are employed
• Do you know how much memory your Job Server
has?
• How many levels of Parent/Child job nesting are
there?
• Are Child jobs run in separate JVM?
• Are you using ESB Jobs? How many Routes?
• Are you using Parallelization?
• Check Job Run>Advanced Settings to
make appropriate adjustments
29
Best Practice #15 SQL
Dynamic SQL Syntax
• Talend Database Input components support SQL syntax
• Developers can generate a query based upon the specified schema
• Developers can also hard-code the query as desired
• What about when the query is unknown until Run-Time?
• ‘Context Variables’ to the rescue
• Using a tJava component and context variables can construct the SQL syntax
• Specified in the Database Input component, these variables will execute the constructed SQL
✓ sqlCOLUMNS ✓ sqlFROM
✓ sqlWHERE ✓ sqlGROUPBY
✓ sqlORDERBY ✓ sqlLIMITS
30
Best Practice #16
Parallelization Options
• These are several mechanisms to enable code parallelization
• Use them correctly, efficiently, and with serious consideration
• Used inappropriately, may have negative impact to CPU & RAM utilization
• Used properly, highly performing Job Design Patterns can be created
Common Sense Utilization
• Use parallelization sparingly
• Do not use parallelization for code segments that already perform well
• Do use parallelization for code segments that need high throughput or
where processing bottlenecks occur
31
Best Practice #16
Parallelization Option Stack
Execution Plan (TAC) Multiple job/tasks can be configured to run in parallel
Multiple Job Flows (Job) Within a single job, multiple starting points can be created which will execute
simultaneously yet share the same thread; Preference should be to create separate
child jobs
Parent/Child Jobs When calling a child job, the tRunJob component supports the ‘Use an independent
process to run subjob’, a check box which when checked will establish a separate
JVM heap/thread to run the child job in
Components The tParallelize component links multiple process flows for simultaneous execution;
The tPartitioner, tDepartitioner, tCollector, and tRecollector components offer
direct control over the number of parallel threads for a specific data flow
DB Components Most of the database components offer an advanced setting to enable
parallelization thread counts on specific SQL statements (like INSERT or UPDATE);
these can be highly efficient but setting the number too high may have the opposite
effect; 2-5 threads is a recommended Best Practice
32
BREAK TIME
33
Best Practice #17
Code Routines
• You can add tJava components that
embed java code as needed into a flow
OR
• You can add custom Java methods to
the project repository which can be
used in a variety of ways
• Many built-in functions, like:
• getCurrentDate()
• sequence(string seqName, int startVal, int step)
• ISNULL(object variable)
• Make sure to incorporate comments
which provide function ‘helper’ text
34
Best Practice #18
Repository Schemas
• Reusable Objects defined in the Project
Repository Metadata provides significant
opportunities to create reusable code
• Repository Schemas include:
• Files
• Delimited / Positional / Regex
• XML / JSON
• Excel
• Generic
• WSDL
• LDAP ‘md_{objectname}’
• UN/EDIFAC
35
Best Practice #19
Apache Log4J (Studio)
• All components are Log4J enabled (v6+)
• ‘Enable’ in the Studio Project Settings
• Customize Log4J scripting paradigm
• Works with ELK:
• Elastic Search
• Log Server
• Kabana UI
• Utilizes Talend Priorities:
• INFO
• WARNING
• ERROR
• FATAL
36
Best Practice #19
Apache Log4J (TAC)
• Also ‘enable’ in the TAC for each Task
• Ensure to set appropriately for each
environment:
• DEV / TEST / UAT / PROD
• Use in conjunction with your error
handler
• Ensure to utilize the components:
• tDie
• tWarn
• tAssert
37
Best Practice #20
Activity Monitoring Console: AMC (Studio)
• ‘Enable’ database logging in the
Studio Project Settings
• Specify Database Connection
• ‘Enable’ which information to catch
• Java Runtime Errors
• Job Errors
• Job Warnings
• Select AMC tables to use:
• tStatCatcher
• tLogCatcher
• tFlowMeterCatcher
38
Best Practice #20
Activity Monitoring Console: AMC (TAC)
• Visualization available in
both Talend Studio &
the TAC
• Establish ‘Return Codes’
as discussed in #5 which
provide a mechanism to
query the tLogCatcher
table externally
39
Best Practice #21
Recovery Checkpoints (Studio)
• When long running jobs or jobs with
critical steps fail, starting over can be
problematic
• Restarting/Recovering these jobs from
a specified checkpoint
• With Talend you can set one or more
Checkpoints on ‘OnSubJobOk’ links
• ‘Enable’ Recovery Checkpoint
• Give it a name
• Document recovery information
40
Best Practice #21
Recovery Checkpoints (TAC)
• Tasks define how to recover
a Job automatically on a Job
Server
• Wait
• Reset Task
• Restart Task
• Recover Task
• Error Recovery Manager
provides the ability to
manually restart at a
selected checkpoint
41
Best Practice #22
Joblets
• We looked at Joblets in #3 & #5
• ‘Included’ in Jobs, not called
• Most have ‘Input’ & ‘Output’
components to pass data flow
through
• Reusable Code within single job
or across many jobs
• Not all components should be
used in a Joblet, like:
• dB Connections
• tJavaFlex, unless fully contained
in Joblet
42
Best Practice #23
Component Test Cases
• Available since v6.0.1
• Components allow creation of a
‘Test Case’ where data flow is
involved
• Test case is tied to component
• Right click on component under
test to generate a ‘Test Case’ job
• When component schema
changes the test case changes
automatically
• Generated but can be modified
43
Best Practice #23
Test Case Job
• Reads an ‘input data file’
• Processes the data through the
component under test
• Writes out a ‘result file’
• Compares to an expected result
or ‘reference file’ for a match
• PASS / FAIL
• A test case ‘instance’ can support
multiple ‘input’ and ‘reference’
files
• GOOD / BAD / UGLY
• SMALL / MEDIUM / LARGE
44
Best Practice #24
Data Flow Iterations
• Normal link between components
is either a ‘trigger’ or a ‘row’
• Data Flow generally processes a tFlowToIterate
‘pipeline’ of records or list of files
tIterateToFlow
• Each pipeline between two
components have unique object
names
• ie: row1; row2; etc..
• Pipelines usually processes all
rows until done, but sometimes
logic requires direct control, row-
by-row called ‘iterations’
45
Best Practice #25
tMap Lookups
• The highly essential tMap
component is used for processing a
data flow from a ’source’ to a
‘target’ where some remapping
and/or transformation takes place
• A compelling use for the tMap is for
data Lookups that ‘join’ the primary
data flow with one or more other
data flows
• These ‘lookups’ can originate from
many kinds of source data
• Notice the ‘Lookup Model’
46
Best Practice #25
tMap Lookup Considerations
• How you set up ‘joins’ for lookups
impact both performance and
memory
• Choose the right ‘Lookup Model’
• Load Once
• Reload at each Row
• Reload at each Row (cache)
• Memory process will likely be
much faster but may
oversubscribe available RAM
• Row by Row will use far less
memory but will be slower
47
Best Practice #25
tMap Row-by-Row Lookups
• The ‘key’ required for row-by-row
lookups are set in the tMap editor
shown previously
• The ‘lookup’ data flow then needs
to use the variable in the join
logic using the method
(datatype)globalMap.get(“key”)
• This method is limited to SQL
database lookups
48
Best Practice #26
Global Variables
• ‘Context Variables’ are used in jobs
to control programmatically, values
at runtime; referenced as:
context.variable
• ‘Built In’ variables a available only
within the job their created in
• ‘Project Repository’ variables are
available across all jobs in a project;
this is the recommended practice
• The tSetGlobalVar component
defines them within a job at runtime
50
Best Practice #26
More on Global Variables
• The tGlobalVarLoad component is
used for the same purpose in Big
Data jobs
• Use the globalMap to access them
(datatype)globalMap.get(“gVar”)
• Using global variables with
components provides better
memory management however
requires more code
51
Best Practice #26
System Global Variables
• Use them where needed:
ERROR_MESSAGE
DIE_MESSAGE
WARN_MESSAGE
CHILD_RETURN_CODE
DIE_CODE
WARN_CODE
NB_LINE
NB_LINK_OK
NB_LINE_REJECT
NB_LINE_INSERTED
NB_LINE_UPDATED
NB_LINE_DELETED
• Two more include:
global.projectName
global.jobName
52
Best Practice #27
Loading Contexts
• Context Group variables
can be loaded at runtime
using the tContextLoad
component
• Storing them externally in a
file can be highly effective
and even support some
security concerns
• A corresponding
tContextDump component
can write out values from a
database first
53
Best Practice #28
Using Dynamic Schemas
• Can a single job design cope with
dynamic schemas?
• 100 tables all need the same job
design; is it possible to build one job
for them all? NO!
• But you can do in TWO jobs!
One to DUMP Schema
One to LOAD Schema
• Here we use the ‘Information
Schema’ of the database to retrieve
a list of Tables & Columns;
processing each through the
tSetDymanicSchema component
54
Best Practice #29
Dynamic SQL Components
• Instead of using the ‘Information
Schema’ to pull the list of Tables and
Columns, specialized components
for each DB are available:
t{db}TableList
t{db}ColumnList
• TWO jobs perform the same process
shown previously, yet differently
One to DUMP Schema
One to LOAD Schema
• These components can be used for
other job designs as well
55
Best Practice #30
CDC – Change Data Capture
• How a Job Design handles CDC is very important when data
synchronization is required
• Talend job designs can use the ‘Publish/Subscribe’ mechanism
tied directly with the host database system involved, including:
✓Oracle ✓MySQL
✓MS SQL Server ✓PostgreSQL
✓Sybase ✓Informix
✓Ingress ✓DB2 • Each of these have a
✓Teradata ✓AS/400 corresponding
t{db}CDC component to use
56
Best Practice #30
How does CDC work in Talend?
• Three CDC modes are available:
➢Trigger (default) - Uses DB Host triggers that tracks Inserts, Updates, & Deletes
➢Redo/Archive Log - Used with Oracle 11g and earlier versions only
➢XStream - Used with Oracle 12 and OCI only
• Go to help.talend.com for
instructions on how to use this
feature
60
Some Parting Do’s & Don’ts
➢ Do Use Both The tPreJob & tPostJob Components
➢ Do Not Clutter Canvas With Tightly Grouped Components; Spread it out a bit
➢ Do Layout Your Code Nicely; Top-2-Bottom & Left-2-Right
➢ Do Not Expect To Get It Just Right The 1st Time You Code It
➢ Do Identify Your Main Job Loop & Control Your Exit
➢ Do Not Ignore Error Handling Techniques
➢ Do Use Context Groups Extensively (DEV/QA/UAT/PROD) & Wisely
➢ Do Not Create Massive Single Job Layouts
➢ Do Create Atomic Job Modules
➢ Do Not Force Complexity; Simplify
➢ Do Use Generic Schemas Everywhere (arguable exception is the single column schema)
➢ Do Not Forget To Name Your Objects
➢ Do Use Joblets Where Appropriate (there may only be a few)
➢ Do Not Over utilize The tJavaFlex Component; tJava or tJavaRow is likely enough
➢ Do Generate/Publish The Project Documentation When Done
➢ Do Not Skip Setting The Runtime Memory Heap
61