User Guide For Powercenter: Informatica Powerexchange For Hadoop (Version 10.0)
User Guide For Powercenter: Informatica Powerexchange For Hadoop (Version 10.0)
(Version 10.0)
Version 10.0
November 2015
This software and documentation contain proprietary information of Informatica LLC and are provided under a license agreement containing restrictions on use and
disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any
form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. This Software may be protected by U.S. and/or
international Patents and other Patents Pending.
Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as
provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14
(ALT III), as applicable.
The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us
in writing.
Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,
PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange Informatica
On Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging and
Informatica Master Data Management are trademarks or registered trademarks of Informatica LLC in the United States and in jurisdictions throughout the world. All
other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rights
reserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rights
reserved.Copyright © Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © Meta
Integration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe Systems
Incorporated. All rights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. All
rights reserved. Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rights
reserved. Copyright © Glyph & Cog, LLC. All rights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rights
reserved. Copyright © Information Builders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved.
Copyright Cleo Communications, Inc. All rights reserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-
technologies GmbH. All rights reserved. Copyright © Jaspersoft Corporation. All rights reserved. Copyright © International Business Machines Corporation. All rights
reserved. Copyright © yWorks GmbH. All rights reserved. Copyright © Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved.
Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. All
rights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved. Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, All
rights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright
© EMC Corporation. All rights reserved. Copyright © Flexera Software. All rights reserved. Copyright © Jinfonet Software. All rights reserved. Copyright © Apple Inc. All
rights reserved. Copyright © Telerik Inc. All rights reserved. Copyright © BEA Systems. All rights reserved. Copyright © PDFlib GmbH. All rights reserved. Copyright ©
Orientation in Objects GmbH. All rights reserved. Copyright © Tanuki Software, Ltd. All rights reserved. Copyright © Ricebridge. All rights reserved. Copyright © Sencha,
Inc. All rights reserved. Copyright © Scalable Systems, Inc. All rights reserved. Copyright © jQWidgets. All rights reserved. Copyright © Tableau Software, Inc. All rights
reserved. Copyright© MaxMind, Inc. All Rights Reserved. Copyright © TMate Software s.r.o. All rights reserved. Copyright © MapR Technologies Inc. All rights reserved.
Copyright © Amazon Corporate LLC. All rights reserved. Copyright © Highsoft. All rights reserved. Copyright © Python Software Foundation. All rights reserved.
Copyright © BeOpen.com. All rights reserved. Copyright © CNRI. All rights reserved.
This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versions
of the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to in
writing, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the Licenses for the specific language governing permissions and limitations under the Licenses.
This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software
copyright © 1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License
Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of any
kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.
The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California,
Irvine, and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.
This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) and
redistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.
This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, <[email protected]>. All Rights Reserved. Permissions and limitations regarding this
software are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with or
without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http://www.dom4j.org/ license.html.
The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to
terms available at http://dojotoolkit.org/license.
This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations
regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.
This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found at
http:// www.gnu.org/software/ kawa/Software-License.html.
This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & Wireless
Deutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.
This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software are
subject to terms available at http:/ /www.boost.org/LICENSE_1_0.txt.
This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available at
http:// www.pcre.org/license.txt.
This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to terms
available at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.
This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://
www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://
httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/
license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-
agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;
http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/
2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://
forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://
www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://
www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/
license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://
www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;
http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://
protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5-
current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/
blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?
page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/
blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/
twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/
master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.
This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and Distribution
License (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License
Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/
licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artistic-
license-1.0) and the Initial Developer’s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).
This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding this
software are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.
For further information please visit http://www.extreme.indiana.edu/.
This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subject
to terms of the MIT license.
DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the implied
warranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. The
information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is
subject to change at any time without notice.
NOTICES
This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress Software
Corporation ("DataDirect") which are subject to the following terms and conditions:
1. THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT
INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT
LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.
4 Table of Contents
Sessions with Hadoop Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Staging HDFS Source Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Hadoop Source Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Session Properties for a Hadoop Source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Sessions with Hadoop Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Hadoop Target Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Hadoop Target Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Session Properties for a Hadoop Target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Table of Contents 5
Preface
The PowerExchange for Hadoop User Guide provides information about how to build mappings to extract
data from Hadoop and load data into Hadoop.
It is written for database administrators and developers. This guide assumes you have knowledge of
relational database concepts and database engines, flat files, PowerCenter, Hadoop, the Hadoop Distributed
File System (HDFS), and Apache Hive. You must also be familiar with the interface requirements for other
supporting applications.
Informatica Resources
Informatica Documentation
The Informatica Documentation team makes every effort to create accurate, usable documentation. If you
have questions, comments, or ideas about this documentation, contact the Informatica Documentation team
through email at [email protected]. We will use your feedback to improve our
documentation. Let us know if we can contact you regarding your comments.
The Documentation team updates documentation as needed. To get the latest documentation for your
product, navigate to Product Documentation from https://mysupport.informatica.com.
6
Informatica Product Availability Matrixes
Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types
of data sources and targets that a product release supports. You can access the PAMs on the Informatica My
Support Portal at https://mysupport.informatica.com.
Informatica Marketplace
The Informatica Marketplace is a forum where developers and partners can share solutions that augment,
extend, or enhance data integration implementations. By leveraging any of the hundreds of solutions
available on the Marketplace, you can improve your productivity and speed up time to implementation on
your projects. You can access Informatica Marketplace at http://www.informaticamarketplace.com.
Informatica Velocity
You can access Informatica Velocity at https://mysupport.informatica.com. Developed from the real-world
experience of hundreds of data management projects, Informatica Velocity represents the collective
knowledge of our consultants who have worked with organizations from around the world to plan, develop,
deploy, and maintain successful data management solutions. If you have questions, comments, or ideas
about Informatica Velocity, contact Informatica Professional Services at [email protected].
Preface 7
Informatica Global Customer Support
You can contact a Customer Support Center by telephone or through the Online Support.
Online Support requires a user name and password. You can request a user name and password at
http://mysupport.informatica.com.
The telephone numbers for Informatica Global Customer Support are available from the Informatica web site
at http://www.informatica.com/us/services-and-training/support-services/global-support-centers/.
8 Preface
CHAPTER 1
Understanding PowerExchange
for Hadoop
This chapter includes the following topics:
You can connect a flat file source to Hadoop to extract data from Hadoop Distributed File System (HDFS).
You can connect a flat file target to Hadoop to load data to HDFS. You can also load data to the Hive data
warehouse system.
Understanding Hadoop
Hadoop provides a framework for distributed processing of large data sets across multiple computers. It
depends on applications rather than hardware for high availability.
Hadoop applications use HDFS as the primary storage system. HDFS replicates data blocks and distributes
them across nodes in a cluster.
Hive is a data warehouse system for Hadoop. You can use Hive to add structure to datasets stored in file
systems that are compatible with Hadoop.
9
PowerCenter and Hadoop Integration
PowerExchange for Hadoop accesses Hadoop to extract data from HDFS or load data to HDFS or Hive.
To extract data from HDFS, a PowerExchange for Hadoop mapping contains a flat file source. To load data to
HDFS or Hive, a PowerExchange for Hadoop mapping contains a flat file target.
In the Workflow Manager, you specify the HDFS flat file reader to extract data from HDFS. You specify the
HDFS flat file writer to load data to HDFS or Hive. You select a Hadoop HDFS connection object to access
the HDFS database tier of Hadoop.
The Integration Service communicates with Hadoop through the Java Native Interface (JNI). JNI is a
programming framework that enables Java code running in a Java Virtual Machine (JVM) to call or be called.
11
Environment Variables for MapR Distribution
When you use MapR distribution to access Hadoop sources and targets, you must configure enviroment
variables.
• Set environment variable MAPR_HOME to the following path: <Informatica Installation Directory>/
server/bin/javalib/hadoop/mapr<version>.
• On the Linux operating system, change environment variable LD_LIBRARY_PATH to include the following
path: <Informatica Installation Directory>/server/bin/javalib/hadoop/mapr<version>.
• Set the MapR Container Location Database name variable CLDB in the following file: <Informatica
Installation Directory>/server/bin/javalib/hadoop/mapr<version>/conf/mapr-clusters.conf.
1. Go to the following directory on any node in the cluster: <MapR installation directory>/conf
For example, go to the following directory: /opt/mapr/conf.
2. Find the following files:
• mapr-cluster.conf
• mapr.login.conf
3. Copy the files to the following directory on the machine on which the PowerCenter Integration Service
runs:
<Informatica installation directory>/server/bin/javalib/hadoop/mapr<version>/conf
4. Log in to the Administrator tool.
5. In the Domain Navigator, select the PowerCenter Integration Service.
6. Recycle the Service.
Click Actions > Recycle Service.
Register the plug-in if you are upgrading from PowerCenter 9.1.0, PowerCenter 9.5.1, or a PowerCenter
9.5.1 HotFix release.
The plug-in file for PowerExchange for Hadoop is pmhdfs.xml. When you install the Repository component,
the installer copies pmhdfs.xml to the following directory:
<Informatica Installation Directory>/server/bin/native
Note: If you do not have the correct privileges to register the plug-in, contact the user who manages the
PowerCenter Repository Service.
When you upgrade PowerExchange for Hadoop, you must recreate HDFS connections to access Hadoop
source or target. Use the Namenode URI property to recreate the HDFS connections.
A highly available Hadoop cluster can provide uninterrupted access to the NameNode in theHadoop cluster.
The NameNode tracks file data across the cluster.
You can configure PowerCenter to communicate with a highly available Hadoop cluster on the following
Hadoop distributions:
• Cloudera CDH
• Hortonworks HDP
• IBM BigInsights
• MapR
• Pivotal HD
High Availability 15
CHAPTER 3
You can import a flat file definition or manually create one. To load data to a Hadoop target, the flat file
definition must be delimited.
16
CHAPTER 4
Before you create a session, configure a Hadoop HDFS application connection to connect to the HDFS host.
When the Integration Service extracts or loads Hadoop data, it connects to a Hadoop cluster through the
HDFS host that runs the name node service for a Hadoop cluster.
If the mapping contains a flat file source, you can configure the session to extract data from HDFS. If the
mapping contains a flat file target, you can configure the session to load data to HDFS or a Hive table.
When the Integration Service loads data to a Hive table, it first loads data to HDFS. The Integration Service
then generates an SQL statement to create the Hive table and load the data from HDFS to the table.
17
PowerExchange for Hadoop Connections
Use a Hadoop HDFS application connection object for each Hadoop source or target that you want to access.
The following table describes the properties that you configure for a Hadoop HDFS application connection:
Property Description
Name The connection name used by the Workflow Manager. Connection name cannot contain
spaces or other special characters, except for the underscore character.
User Name The name of the user in the Hadoop group that is used to access the HDFS host.
Password Password to access the HDFS host. Reserved for future use.
Hive User Name The Hive user name. Reserved for future use.
Hive Password The password for the Hive user. Reserved for future use.
When you configure a session for a Hadoop source, you select the HDFS Flat File reader file type and a
Hadoop HDFS application connection object. You can stage the source data and configure partitioning.
Stage an HDFS source when you want the Integration Service to read the source files and then close the
connection before continuing to process the data.
Configure staging for an HDFS source by setting HDFS flat file reader properties in a session.
For example, if you stage a file named source.csv from the Hadoop source location to the following
directory:
c:\staged_files\source_stage.csv
Indirect
Use indirect staging when you want to read data from multiple source files. The Integration Service reads
data from multiple files in the source. It creates an indirect file that contains the names of the source
files. It then stages the indirect file and the files read from the source in the local staging directory before
passing to downstream transformations.
For example, you stage the files named source1.csv and source2.csv from the Hadoop source location
to the following directory:
c:\staged_files\source_stage_list.txt
The Integration Service creates an indirect file named source_stage_list.txt that contains the
following entries:
source1.csv
source2.csv
The Integration Service stages the indirect file and the source files. In the c:\staged_files directory,
you would see the following files:
source_stage_list.txt
source1.csv
source2.csv
Then, the Integration Service loads data from the staged source files into the Hadoop target path as
specified by the output file path in the HDFS flat file writer properties.
Property Set to
Is Staged Enabled.
File Path For direct staging, enter the name and path of the source file. For indirect staging, enter the
directory path to the source files.
Staged File Name Name and path on the local machine used to stage the source data.
In the sources node section of the Mapping tab in the session properties, select HDFS Flat File Reader as
the reader type for the source. Then, select a Hadoop HDFS application connection object for the source.
When you select HDFS flat file reader type, you can select a code page of a delimited file from the codepage
drop-down list. You cannot set the code page to use a user-defined variable or a session parameter file.
The following table describes the properties that you can configure for a Hadoop source:
Session Description
Property
Is Staged Before reading the source file, the Integration Service stages the remote file or files locally.
Staged File The local file name directory where the Integration Service stages files. If you use direct
Name staging, the Integration Service stages the source file using this file name. If you use indirect
staging, the Integration Service uses the file name to create the indirect file.
Concurrent Order in which multiple partitions read input rows from a source file. You can choose one of
read the following options:
partitioning - Optimize throughput. The Integration Service does not preserve row order when multiple
partitions read from a single file source. Use this option if the order in which multiple partitions
read from a file source is not important.
- Keep relative input row order. The Integration Service preserves the input row order for the
rows read by each partition. Use this option to preserve the sort order of the input rows read by
each partition.
- Keep absolute input row order. The Integration Service preserves the input row order for all
rows read by all partitions. Use this option to preserve the sort order of the input rows each time
the session runs. In a pass-through mapping with passive transformations, the order of the rows
written to the target will be in the same order as the input rows.
File Path The Hadoop directory path of the flat file source. The path can be relative or absolute. If
relative, it is relative to the home directory of the Hadoop user.
For example, you might specify the following value in the File Path property:
\home\foo\
When you configure a session for a Hadoop target, you select the HDFS Flat File writer file type and a
Hadoop HDFS application connection object.
When you configure the session to load data to HDFS, you can configure partitioning, file header, and output
options. When you configure the session to load data to a Hive table, you can configure partitioning and
output options.
When the Integration Service loads data to a Hive table, it generates a relational table in the Hive database.
You can overwite the Hive table data when you run the session again.
In the targets node section of the Mapping tab in the session properties, select HDFS Flat File Writer as the
writer type for the target. Then, select a Hadoop HDFS application connection object for the target.
When you select HDFS flat file writer type, you can select a code page of a delimited file from the codepage
drop-down list. You cannot set the code page to use a user-defined variable or a session parameter file.
You can select the following merge types for Hadoop target partitioning:
No Merge
The Integration Service generates one target file for each partition. If you stage the files, the Integration
Service transfers the target files to the remote location at the end of the session. If you do not stage the
files, the Integration Service generates the target files at the remote location.
Sequential Merge
The Integration Service creates one output file for each partition. At the end of the session, the
Integration Service merges the individual output files into a merge file, deletes the individual output files,
and transfers the merge file to the remote location.
If you set the merge type to sequential, you need to define the merge file path and the output file path in
the session properties. The merge file path determines the final Hadoop target location where the
Integration Service creates the merge file. The Integration Service creates the merge file from the
intermediate merge file output in the location defined for the output file path.
The following table describes the properties that you can configure for a Hadoop target:
Merge Type Type of merge that the Integration Service performs on the data for partitioned targets.
You can choose one of the following merge types:
- No Merge
- Sequential Merge
Header Options Creates a header row in the flat file when loading data to HDFS.
Merge File Path If you choose a sequential merge type, defines the final Hadoop target location where the
Integration Service creates the merge file. The Integration Service creates the merge file
from the intermediate merge file output in the location defined in the output file path.
Generate And Load Generates a relational table in the Hive database. The Integration Service loads data into
Hive Table the Hive table from the HDFS flat file target.
Hive Table Name Hive table name. Default is the target instance name.
Externally Managed Loads Hive table data to the location defined in the output file path.
Hive Table
Output File Path Defines the absolute or relative directory path or file path on the HDFS host where the
Integration Service writes the HDFS data. A relative path is relative to the home directory
of the Hadoop user.
If you choose a sequential merge type, defines where the Integration Service writes
intermediate output before it writes to the final Hadoop target location as defined by the
merge file path.
If you choose to generate partition file names, this path can be a directory path.
Reject File Path The path to the reject file. By default, the Integration Service writes all reject files to
service process variable directory, $PMBadFileDir.
P
pipeline partitioning
description for Hadoop 21
T
plug-ins targets
registering for PowerExchange for Hadoop 13 PowerExchange for Hadoop 16
PowerCenter
high availability 14
PowerExchange for Hadoop
configuring 11
23