0% found this document useful (0 votes)
445 views60 pages

Talend Job Design Patterns

This document provides an overview and best practices for designing Talend jobs. It discusses job design patterns that promote readability, writability and maintainability. Some key points: - Jobs should be broken into smaller, modular components to avoid monolithic designs. Parent/child job hierarchies and reusable joblets are recommended. - Layouts should follow a top-down, left-to-right structure. Loops and error handling should be clearly defined. - Common patterns include using tPreJob and tPostJob components for entry/exit points, and handling errors and logging consistently across jobs. - OnComponent and OnSubJob triggers affect control flow and should be used appropriately. Comments and naming conventions

Uploaded by

Aloui Hatem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
445 views60 pages

Talend Job Design Patterns

This document provides an overview and best practices for designing Talend jobs. It discusses job design patterns that promote readability, writability and maintainability. Some key points: - Jobs should be broken into smaller, modular components to avoid monolithic designs. Parent/child job hierarchies and reusable joblets are recommended. - Layouts should follow a top-down, left-to-right structure. Loops and error handling should be clearly defined. - Common patterns include using tPreJob and tPostJob components for entry/exit points, and handling errors and logging consistently across jobs. - OnComponent and OnSubJob triggers affect control flow and should be used appropriately. Comments and naming conventions

Uploaded by

Aloui Hatem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Talend Tech Boot Camp

Job Design Patterns & Best Practices

1
Agenda
• Introduction & Overview
• SDLC – Development Guidelines
• Job Design Patterns
• Best Practices 1-16
• DDLC – Database Modeling
• Best Practices 17-32

2
Open Discussions & QA
Question
“What is the best way for me to write a Talend job”?
• It needs to be:
• EASY to READ
• EASY to WRITE
• EASY to MAINTAIN

... but honestly, you can only effectively pick 2!


4
Answer
If the Job will change over time:
• Priority is to make it:
- EASY to MAINTAIN
- EASY to READ If there are multiple developers:
- EASY to WRITE • Priority is to make it :
- EASY to READ
- EASY to MAINTAIN
If the Job is unlikely to change: - EASY to WRITE
• Priority is to make it :
- EASY to WRITE
- EASY to READ
- EASY to MAINTAIN
5
Paint your Talend Code

After several years of developing visual code,


patterns started to emerge:
• Canvas Layout & Spacing But Basics Still Count:
• Process & Data Flow • Functionality
• Modularized Code • Error Handling
• Consistent Job Types • Memory Management
• Harness / Process Driven
• Stateless / State-full • Performance
• Atomic (Parent/Child & Joblets) • Naming Conventions

Avoiding Complexity:
• NO Monolithic Jobs
• Minimize Depth Levels
• Scrunched Components
• Overlapping Links

6
Talend SDLC Best Practices Guide

Talend’s Software Architecture is:


• Comprehensive
• Multi-Faceted
• Robust
• Flexible
• Serious Stuff

…. so take it Seriously

7
Continuous Integration/Deployment

Talend Software Development Life Cycle Best Practices Guide


8
SDLC – Development Guidelines

9
Formulating the Basics
Foundational Precepts
✓ Readability: creating code that can be easily figured out and understood
✓ Writeability: creating straightforward, simple, code in the least amount of time
✓ Maintainability: creating appropriate complexity with minimal impact from change
✓ Functionality: creating code that delivers on the requirements
✓ Reusability: creating sharable objects and atomic units of work
✓ Conformity: creating real discipline across teams, projects, repositories, and code
✓ Pliability: creating code that will bend but not break
✓ Scalability: creating elastic modules that adjust throughput on demand
✓ Consistency: creating commonality across everything
✓ Efficiency: creating optimized data flow and component utilization
✓ Compartmentalization: creating atomic, focused modules that serve a single purpose
✓ Optimization: creating the most functionality with the least amount of code
✓ Performance: creating effective modules that provide the fastest throughput

10
SDLC – Developer Guidelines
Guidelines NOT Standards – It’s about Discipline
• Standards are Rigid leaving no room for the unexpected
• Guidelines are pliable which can bend and rarely break
Create a Development Guidelines Document
• Involvement and Adoption from all teams is essential
• Incorporates Corporate SDLC process
• Defines the foundation, structure, & context
Other Useful Documents
• Code Module Library
• Data Dictionary
• Data Access Layer
11
Just One More Thing
Instill Good Habits
• Start easy – something everyone can adopt
• Agree to label every component for code readability; a foundational Precept!
• Incrementally raise the bar – over time
• Organize Repository Folders
• Establish & utilize naming conventions; Conformity!
• Adopt logging, messaging, & error handlers
• Build out reusable/shared code modules; Several precepts used here!

As Development Guideline Document Evolves


• Discipline improves
• Project code becomes:
• EASIER TO READ
• EASIER TO WRITE
• EASIER TO MAINTAIN

12
Job Design Patterns & Best Practices

13
What Are Job Design Patterns?
Template or Skeleton Layouts
• Focus on essential and/or required elements
• Often bound around use case
• Target Common and/or Reusable code modules
• When identified and implemented properly they:
• Strengthen overall code base
• Condense overall effort
• Reduce repetitive but similar code

Adopt a Repeatable Coding Style for Easy: R/W/M


• So every developer can view and understand any other’s code
• Jumpstarts development for any new project
• This is where Best Practices come in…

14
Best Practices
Best Practice #1
Consider carefully how to layout your Job Canvas
• Don’t just splash objects on your canvas
with the idea to clean it up later
• Use discipline to paint it right the 1st time
• Allow space for Readability & Maintainability
• Line up the Flows & Components

• Preferred layout is
‘Top-to-Bottom’ then ‘Left-to-Right’
• A ‘Zig-Zag’ or ‘Snake’ layout can be easy to write,
and maybe easy to follow along
• But inserting new functionality can lead to
re-factoring the whole job layout

16
Best Practice #2
Atomic Job Modules – Parent/Child Jobs
• Avoid Big, Monolithic Jobs
• They can be hard to read and maintain
• Plus they can perform poorly

• Break Big Process flows into smaller Jobs


• Establish a Parent/Child hierarchy
• But keep the nesting levels to a minimum
• Recommended maximum nesting is 5
• Consider Job Memory Settings at each level
• Carefully set the checkboxes:
• ‘Use an independent process to run subjob’
• ‘Die on child error’
• ‘Transmit whole context’

17
Best Practice #3
Joblets versus tRunJob Component
• INCLUDED code vs CALLED code
• Joblets are common code you ‘Include’ in your job
• tRunJob ‘Calls’ a Child job from a Parent job

• Both promote code reusability


• Establish a Parent/Child hierarchy
• A highly effective strategy when used appropriately

18
Best Practice #4
Job Entry & Exit Points
• Talend code needs to Start & Stop
somewhere
• tPreJob & tPostJob components are highly advised
• tPreJob executes 1st, then continues
• tPostJob wraps it up (like ‘finally’ for you OOP guys)

• Use tWarn & tDie Components effectively


• They provide programmable control over where and
when a job should complete
• Note that the tDie component can set IF the JVM will
exit immediately or not!

19
Best Practice #5
Error Handling & Logging
• One of the MOST IMPORTANT
things you can incorporate into
your Jobs
• Creating a common Error Handler is
highly advised
• Incorporate well defined ‘Return Codes’

• Use Project Settings>Log4J


• Configure and use the Log Stash server

20
Best Practice #6
OnSubJobOk/ERROR vs OnComponentOK/ERROR
• Often Misunderstood
• These ‘trigger’ links do affect job design
flow and must be considered properly
• OK vs ERROR is obvious
• OnSubJob will pass control only after
the current Sub Job has executed fully
• OnComponent will pass control only
after the component has processed a
row or a data set (depending upon the
component)

• Also ‘Run If’ linkage


• Quite useful when continuation of the
process needs control programmatically

21
Best Practice #7
What is a Job Loop?
• A Highly Significant Job Design
Consideration!
• These are decision points where control
of the next step in the process is made
• All Job designs should identify One (1)
‘Main Loop’ where exit control can be
established
• Again use established ‘Return Codes’
and exit strategies
• ‘Secondary Loops’ are OK, just ensure
the process flow makes sense

22
Best Practice #8
Software Development Life Cycle (SDLC)
• “People, Product, & Process” Marcus Leminos “The Profit” (CNBC)
• These 3 keys can determine the Success or Failure of any Business
• The same is true for Software Development
• Talend’s SDLC Best Practice Guide provides a deep look into the
concepts, principles, specifications, and details Continuous
Integration/Deployment practices available to Talend developers
• Incorporation of any SDLC Best practice into a ‘Development
Guidelines’ document is highly advised

23
Best Practice #9
Managing Workspaces
• Talend Studio installations use a ‘Workspace’
• Typically created on your local disk drive C:
• As in many software installations a ‘Default’ location is assigned
• Usually placed along side the Software executables

We Recommend you Change your Workspace!


• The default location may not be the best place to store your code
• These directories are attached to a Source Code Control System (SVN or GIT)
• The TAC manages synchronization of these workspaces with the SCCS
• Backup/Restore & Import/Export operations are clunky when located with executables
• Might even be a good idea to place your workspace on a separate disk drive

24
Best Practice #10
Reference Projects
• Do you know what they are?
• We all want re-usable, common, or generic code that can be shared across projects
• Avoid cut-and-paste and/or copying similar code; Use Reference Projects!
• Limit the number of Reference Projects as too many defeat their purpose

25
Best Practice #11
Object Naming Conventions
• “A rose by any other name is still a rose!” who said that anyway?
• The answer may not matter, but Naming Conventions do!
• All Talend Objects have unique internally used names
• Adopt Conventions of Object Naming in Talend
• Clearly define them in your ‘Development Guidelines’ document
• Have the entire team adopt these conventions

Objects To Consider
• Directories, Folders, & Workspaces
• Data File ‘root’ & I/O locations & names
• Jobs, Joblets, Code Routines
• Context Groups & Variables
• Database Connections
26
Best Practice #12
Project Repository
• Where all project objects reside
• Several Important Sections include:
• Job Designs - where your jobs are located
• Contexts - groups reusable variables
• Code - add java code modules
• Metadata - variety of schema definitions
• Documentation - auto-generate project Wiki

27
Best Practice #13
Version Control
• Job Properties allow setting ‘M’ajor & ‘m’inor version numbering
• Allows a status of ‘development’, ‘test’, ‘production’, or ‘user defined’
• This is designed for Single User Environments ONLY!
• When used in conjunction with a SCCS, considerable workspace ‘bloat’ occurs

• Instead use Project Branching & Tagging with your SCCS


• Cooperative development and seamless source code control require a different method
• SVN and GIT both provide a strategy for Branching & Tagging code
• Talend v6.2.1 GIT supports a graphical Diff/Merge feature @ Job level
• Talend v6.3.1 GIT supports a graphical Diff/Merge feature @ Component level
• Clearly define your preferred method in your ‘Development Guidelines’ document
• Have the entire team adopt these conventions

28
Best Practice #14
Memory Management
• So, you want to run your job?
• Have you considered its Memory
needs?
• Is the data flow processing Millions of Rows or
have lots of columns?
• How many tMap Lookups are employed
• Do you know how much memory your Job Server
has?
• How many levels of Parent/Child job nesting are
there?
• Are Child jobs run in separate JVM?
• Are you using ESB Jobs? How many Routes?
• Are you using Parallelization?
• Check Job Run>Advanced Settings to
make appropriate adjustments
29
Best Practice #15 SQL
Dynamic SQL Syntax
• Talend Database Input components support SQL syntax
• Developers can generate a query based upon the specified schema
• Developers can also hard-code the query as desired
• What about when the query is unknown until Run-Time?
• ‘Context Variables’ to the rescue
• Using a tJava component and context variables can construct the SQL syntax
• Specified in the Database Input component, these variables will execute the constructed SQL
✓ sqlCOLUMNS ✓ sqlFROM
✓ sqlWHERE ✓ sqlGROUPBY
✓ sqlORDERBY ✓ sqlLIMITS

“SELECT “ + context.sqlCOLUMNS + context.sqlFROM + context.sqlWHERE

30
Best Practice #16
Parallelization Options
• These are several mechanisms to enable code parallelization
• Use them correctly, efficiently, and with serious consideration
• Used inappropriately, may have negative impact to CPU & RAM utilization
• Used properly, highly performing Job Design Patterns can be created
Common Sense Utilization
• Use parallelization sparingly
• Do not use parallelization for code segments that already perform well
• Do use parallelization for code segments that need high throughput or
where processing bottlenecks occur

31
Best Practice #16
Parallelization Option Stack
Execution Plan (TAC) Multiple job/tasks can be configured to run in parallel
Multiple Job Flows (Job) Within a single job, multiple starting points can be created which will execute
simultaneously yet share the same thread; Preference should be to create separate
child jobs
Parent/Child Jobs When calling a child job, the tRunJob component supports the ‘Use an independent
process to run subjob’, a check box which when checked will establish a separate
JVM heap/thread to run the child job in
Components The tParallelize component links multiple process flows for simultaneous execution;
The tPartitioner, tDepartitioner, tCollector, and tRecollector components offer
direct control over the number of parallel threads for a specific data flow
DB Components Most of the database components offer an advanced setting to enable
parallelization thread counts on specific SQL statements (like INSERT or UPDATE);
these can be highly efficient but setting the number too high may have the opposite
effect; 2-5 threads is a recommended Best Practice

32
BREAK TIME

33
Best Practice #17
Code Routines
• You can add tJava components that
embed java code as needed into a flow
OR
• You can add custom Java methods to
the project repository which can be
used in a variety of ways
• Many built-in functions, like:
• getCurrentDate()
• sequence(string seqName, int startVal, int step)
• ISNULL(object variable)
• Make sure to incorporate comments
which provide function ‘helper’ text
34
Best Practice #18
Repository Schemas
• Reusable Objects defined in the Project
Repository Metadata provides significant
opportunities to create reusable code
• Repository Schemas include:
• Files
• Delimited / Positional / Regex
• XML / JSON
• Excel
• Generic
• WSDL
• LDAP ‘md_{objectname}’

• UN/EDIFAC
35
Best Practice #19
Apache Log4J (Studio)
• All components are Log4J enabled (v6+)
• ‘Enable’ in the Studio Project Settings
• Customize Log4J scripting paradigm
• Works with ELK:
• Elastic Search
• Log Server
• Kabana UI
• Utilizes Talend Priorities:
• INFO
• WARNING
• ERROR
• FATAL

36
Best Practice #19
Apache Log4J (TAC)
• Also ‘enable’ in the TAC for each Task
• Ensure to set appropriately for each
environment:
• DEV / TEST / UAT / PROD
• Use in conjunction with your error
handler
• Ensure to utilize the components:
• tDie
• tWarn
• tAssert

37
Best Practice #20
Activity Monitoring Console: AMC (Studio)
• ‘Enable’ database logging in the
Studio Project Settings
• Specify Database Connection
• ‘Enable’ which information to catch
• Java Runtime Errors
• Job Errors
• Job Warnings
• Select AMC tables to use:
• tStatCatcher
• tLogCatcher
• tFlowMeterCatcher

38
Best Practice #20
Activity Monitoring Console: AMC (TAC)
• Visualization available in
both Talend Studio &
the TAC
• Establish ‘Return Codes’
as discussed in #5 which
provide a mechanism to
query the tLogCatcher
table externally

39
Best Practice #21
Recovery Checkpoints (Studio)
• When long running jobs or jobs with
critical steps fail, starting over can be
problematic
• Restarting/Recovering these jobs from
a specified checkpoint
• With Talend you can set one or more
Checkpoints on ‘OnSubJobOk’ links
• ‘Enable’ Recovery Checkpoint
• Give it a name
• Document recovery information

40
Best Practice #21
Recovery Checkpoints (TAC)
• Tasks define how to recover
a Job automatically on a Job
Server
• Wait
• Reset Task
• Restart Task
• Recover Task
• Error Recovery Manager
provides the ability to
manually restart at a
selected checkpoint

41
Best Practice #22
Joblets
• We looked at Joblets in #3 & #5
• ‘Included’ in Jobs, not called
• Most have ‘Input’ & ‘Output’
components to pass data flow
through
• Reusable Code within single job
or across many jobs
• Not all components should be
used in a Joblet, like:
• dB Connections
• tJavaFlex, unless fully contained
in Joblet
42
Best Practice #23
Component Test Cases
• Available since v6.0.1
• Components allow creation of a
‘Test Case’ where data flow is
involved
• Test case is tied to component
• Right click on component under
test to generate a ‘Test Case’ job
• When component schema
changes the test case changes
automatically
• Generated but can be modified
43
Best Practice #23
Test Case Job
• Reads an ‘input data file’
• Processes the data through the
component under test
• Writes out a ‘result file’
• Compares to an expected result
or ‘reference file’ for a match
• PASS / FAIL
• A test case ‘instance’ can support
multiple ‘input’ and ‘reference’
files
• GOOD / BAD / UGLY
• SMALL / MEDIUM / LARGE
44
Best Practice #24
Data Flow Iterations
• Normal link between components
is either a ‘trigger’ or a ‘row’
• Data Flow generally processes a tFlowToIterate
‘pipeline’ of records or list of files
tIterateToFlow
• Each pipeline between two
components have unique object
names
• ie: row1; row2; etc..
• Pipelines usually processes all
rows until done, but sometimes
logic requires direct control, row-
by-row called ‘iterations’
45
Best Practice #25
tMap Lookups
• The highly essential tMap
component is used for processing a
data flow from a ’source’ to a
‘target’ where some remapping
and/or transformation takes place
• A compelling use for the tMap is for
data Lookups that ‘join’ the primary
data flow with one or more other
data flows
• These ‘lookups’ can originate from
many kinds of source data
• Notice the ‘Lookup Model’
46
Best Practice #25
tMap Lookup Considerations
• How you set up ‘joins’ for lookups
impact both performance and
memory
• Choose the right ‘Lookup Model’
• Load Once
• Reload at each Row
• Reload at each Row (cache)
• Memory process will likely be
much faster but may
oversubscribe available RAM
• Row by Row will use far less
memory but will be slower
47
Best Practice #25
tMap Row-by-Row Lookups
• The ‘key’ required for row-by-row
lookups are set in the tMap editor
shown previously
• The ‘lookup’ data flow then needs
to use the variable in the join
logic using the method
(datatype)globalMap.get(“key”)
• This method is limited to SQL
database lookups

48
Best Practice #26
Global Variables
• ‘Context Variables’ are used in jobs
to control programmatically, values
at runtime; referenced as:
context.variable
• ‘Built In’ variables a available only
within the job their created in
• ‘Project Repository’ variables are
available across all jobs in a project;
this is the recommended practice
• The tSetGlobalVar component
defines them within a job at runtime

50
Best Practice #26
More on Global Variables
• The tGlobalVarLoad component is
used for the same purpose in Big
Data jobs
• Use the globalMap to access them
(datatype)globalMap.get(“gVar”)
• Using global variables with
components provides better
memory management however
requires more code

51
Best Practice #26
System Global Variables
• Use them where needed:
ERROR_MESSAGE
DIE_MESSAGE
WARN_MESSAGE
CHILD_RETURN_CODE
DIE_CODE
WARN_CODE
NB_LINE
NB_LINK_OK
NB_LINE_REJECT
NB_LINE_INSERTED
NB_LINE_UPDATED
NB_LINE_DELETED
• Two more include:
global.projectName
global.jobName

52
Best Practice #27
Loading Contexts
• Context Group variables
can be loaded at runtime
using the tContextLoad
component
• Storing them externally in a
file can be highly effective
and even support some
security concerns
• A corresponding
tContextDump component
can write out values from a
database first
53
Best Practice #28
Using Dynamic Schemas
• Can a single job design cope with
dynamic schemas?
• 100 tables all need the same job
design; is it possible to build one job
for them all? NO!
• But you can do in TWO jobs!
One to DUMP Schema
One to LOAD Schema
• Here we use the ‘Information
Schema’ of the database to retrieve
a list of Tables & Columns;
processing each through the
tSetDymanicSchema component
54
Best Practice #29
Dynamic SQL Components
• Instead of using the ‘Information
Schema’ to pull the list of Tables and
Columns, specialized components
for each DB are available:
t{db}TableList
t{db}ColumnList
• TWO jobs perform the same process
shown previously, yet differently
One to DUMP Schema
One to LOAD Schema
• These components can be used for
other job designs as well
55
Best Practice #30
CDC – Change Data Capture
• How a Job Design handles CDC is very important when data
synchronization is required
• Talend job designs can use the ‘Publish/Subscribe’ mechanism
tied directly with the host database system involved, including:
✓Oracle ✓MySQL
✓MS SQL Server ✓PostgreSQL
✓Sybase ✓Informix
✓Ingress ✓DB2 • Each of these have a
✓Teradata ✓AS/400 corresponding
t{db}CDC component to use

56
Best Practice #30
How does CDC work in Talend?
• Three CDC modes are available:
➢Trigger (default) - Uses DB Host triggers that tracks Inserts, Updates, & Deletes
➢Redo/Archive Log - Used with Oracle 11g and earlier versions only
➢XStream - Used with Oracle 12 and OCI only

Talend User Guide, Chapter 11


57
Best Practice #31
Custom Components
• With over 1000+ component in
the Data Fabric Platform, is that
enough?
• Not if you want to incorporate
specialized business logic into a
repeatable object for jobs
• Many 3rd Party components are
available at exchange.talend.com
• Setup the Preferences Dialog box
• Install custom components from
the Exchange menu link
58
Best Practice #31
Custom Components – Build your Own
• Building your own custom
components is another choice
• Most custom components are
built on the ‘JavaJet’ framework
• Use the help.talend.com to find
the tutorial on how to build
custom components

• The new Talend Component


Framework ‘TCOMP’ which is Java
based is planned for release soon
59
Best Practice #32
JobScript API
• Normally we create Jobs in the
‘Designer’ tab of the Studio
• Can a Job be GENERATED?
• YES – with JobScripts!
• All jobs created in the ‘Designer’
have a corresponding ‘JobScript’

• Go to help.talend.com for
instructions on how to use this
feature

60
Some Parting Do’s & Don’ts
➢ Do Use Both The tPreJob & tPostJob Components
➢ Do Not Clutter Canvas With Tightly Grouped Components; Spread it out a bit
➢ Do Layout Your Code Nicely; Top-2-Bottom & Left-2-Right
➢ Do Not Expect To Get It Just Right The 1st Time You Code It
➢ Do Identify Your Main Job Loop & Control Your Exit
➢ Do Not Ignore Error Handling Techniques
➢ Do Use Context Groups Extensively (DEV/QA/UAT/PROD) & Wisely
➢ Do Not Create Massive Single Job Layouts
➢ Do Create Atomic Job Modules
➢ Do Not Force Complexity; Simplify
➢ Do Use Generic Schemas Everywhere (arguable exception is the single column schema)
➢ Do Not Forget To Name Your Objects
➢ Do Use Joblets Where Appropriate (there may only be a few)
➢ Do Not Over utilize The tJavaFlex Component; tJava or tJavaRow is likely enough
➢ Do Generate/Publish The Project Documentation When Done
➢ Do Not Skip Setting The Runtime Memory Heap
61

You might also like