By Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Eric Badger, Software Development Engineer, about using and contributing to Hadoop at Verizon Media.
By Kishor Patil, Principal Software Systems Engineer at Verizon Media, and PMC member of Apache Storm & Bobby Evans, Apache Member and PMC member of Apache Hadoop, Spark, Storm, and Tez
We are excited to be part of the new release of Apache Storm 2.0.0. The open source community has been working on this major release, Storm 2.0, for quite some time. At Yahoo we had a long time and strong commitment to using and contributing to Storm; a commitment we continue as part of Verizon Media. Together with the Apache community, we’ve added more than 1000 fixes and improvements to this new release. These improvements include sending real-time infrastructure alerts to the DevOps folks running Storm and the ability to augment ingested content with related content, thereby giving the users a deeper understanding of any one piece of content.
Performance
Performance and utilization are very important to us, so we developed a benchmark to evaluate various stream processing platforms and the initial results showed Storm to be among the best. We expect to release new numbers by the end of June 2019, but in the interim, we ran some smaller Storm specific tests that we’d like to share.
Storm 2.0 has a built-in load generation tool under examples/storm-loadgen. It comes with the requisite word count test, which we used here, but also has the ability to capture a statistical representation of the bolts and spouts in a running production topology and replay that load on another topology, or another version of Storm. For this test, we backported that code to Storm 1.2.2. We then ran the ThroughputVsLatency test on both code bases at various throughputs and different numbers of workers to see what impact Storm 2.0 would have. These were run out of the box with no tuning to the default parameters, except to set max.spout.pending in the topologies to be 1000 sentences, as in the past that has proven to be a good balance between throughput and latency while providing flow control in the 1.2.2 version that lacks backpressure.
In general, for a WordCount topology, we noticed 50% - 80% improvements in latency for processing a full sentence. Moreover, 99 percentile latency in most cases, is lower than the mean latency in the 1.2.2 version. We also saw the maximum throughput on the same hardware more than double.
Why did this happen? STORM-2306 redesigned the threading model in the workers, replaced disruptor queues with JCTools queues, added in a new true backpressure mechanism, and optimized a lot of code paths to reduce the overhead of the system. The impact on system resources is very promising. Memory usage was untouched, but CPU usage was a bit more nuanced.
At low throughput (< 8000 sentences per second) the new system uses more CPU than before. This can be tuned as the system does not auto-tune itself yet. At higher rates, the slope of the line is much lower which means Storm has less overhead than before resulting in being able to process more data with the same hardware. This also means that we were able to max out each of these configurations at > 100,000 sentences per second on 2.0.0 which is over 2x the maximum 45,000 sentences per second that 1.2.2 could do with the same setup. Note that we did nothing to tune these topologies on either setup. With true backpressure, a WordCount Topology could consistently process 230,000 sentences per second by disabling the event tracking feature. Due to true backpressure, when we disabled it entirely, then we were able to achieve over 230,000 sentences per second in a stable way, which equates to over 2 million messages per second being processed on a single node.
Scalability
In 2.0, we have laid the groundwork to make Storm even more scalable. Workers and supervisors can now heartbeat directly into Nimbus instead of going through ZooKeeper, resulting in the ability to run much larger clusters out of the box.
Developer Friendly
Prior to 2.0, Storm was primarily written in Clojure. Clojure is a wonderful language with many advantages over pure Java, but its prevalence in Storm became a hindrance for many developers who weren’t very familiar with it and didn’t have the time to learn it. Due to this, the community decided to port all of the daemon processes over to pure Java. We still maintain a backward compatible storm-clojure package for those that want to continue using Clojure for topologies.
Split Classpath
In older versions, Storm was a single jar, that included code for the daemons as well as the user code. We have now split this up and storm-client provides everything needed for your topology to run. Storm-core can still be used as a dependency for tests that want to run a local mode cluster, but it will pull in more dependencies than you might expect.
To upgrade your topology to 2.0, you’ll just need to switch your dependency from storm-core-1.2.2 to storm-client-2.0.0 and recompile.
Backward Compatible
Even though Storm 2.0 is API compatible with older versions, it can be difficult when running a hosted multi-tenant cluster. Coordinating upgrading the cluster with recompiling all of the topologies can be a massive task. Starting in 2.0.0, Storm has the option to run workers for topologies submitted with an older version with a classpath for a compatible older version of Storm. This important feature which was developed by our team, allows you to upgrade your cluster to 2.0 while still allowing for upgrading your topologies whenever they’re recompiled to use newer dependencies.
Generic Resource Aware Scheduling
With the newer generic resource aware scheduling strategy, it is now possible to specify generic resources along with CPU and memory such as Network, GPU, and any other generic cluster level resource. This allows topologies to specify such generic resource requirements for components resulting in better scheduling and stability.
More To Come
Storm is a secure enterprise-ready stream but there is always room for improvement, which is why we’re adding in support to run workers in isolated, locked down, containers so there is less chance of malicious code using a zero-day exploit in the OS to steal data.
We are working on redesigning metrics and heartbeats to be able to scale even better and more importantly automatically adjust your topology so it can run optimally on the available hardware. We are also exploring running Storm on other systems, to provide a clean base to run not just on Mesos but also on YARN and Kubernetes.
If you have any questions or suggestions, please feel free to reach out via email.
P.S. We’re hiring! Explore the Big Data Open Source Distributed System Developer opportunity here.
By Scott Bush, Director, Hadoop Software Engineering, Oath
On Tuesday, September 25, we hosted a special day-long Hadoop Contributors Meetup at our Sunnyvale, California campus. Much of the early Hadoop development work started at Yahoo, now part of Oath, and has continued over the past decade. Our campus was the perfect setting for this meetup, as we continue to make Hadoop a priority.
More than 80 Hadoop users, contributors, committers, and PMC members gathered to hear talks on key issues facing the Hadoop user community.
Speakers from Ampool, Cloudera, Hortonworks, Microsoft, Oath, and Twitter detailed some of the challenges and solutions pertinent to their parts of the Hadoop ecosystem. The talks were followed by a number of parallel, birds of a feather breakout sessions to discuss HDFS, Tez, containers and low latency processing. The day ended with a reception and consensus that the event went well and should be repeated in the near future.
Presentation recordings (YouTube playlist) and slides (links included in the video description) are available here:
Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit
By Kuhu Shukla
This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.
As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.
Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.
Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.
The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,” I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.
I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.
Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.
We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.
As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.
In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.
About the Author:
Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.
By Sumeet Singh, Sr. Director, Product Management, Hadoop
Yahoo and Hortonworks are pleased to host the 7th Annual Hadoop Summit - the leading conference for the Apache Hadoop community - on June 3-5, 2014 in San Jose, California.
Yahoo is a major open source contributor to and one of the largest users of Apache Hadoop. The Hadoop project is at the heart of many of Yahoo’s important business processes and we continue to make the Hadoop ecosystem stronger by working closely with key collaborators in the community to drive more users and projects to the Hadoop ecosystem.
Join us at one of the following sessions or stop by Kiosk P9 at the Hadoop Summit to get an in-depth look at Yahoo’s Hadoop culture.
Keynote
Hadoop Intelligence – Scalable Machine Learning
Amotz Maimon (@AmotzM) – Chief Architect
“This talk will cover how Yahoo is leveraging Hadoop to solve complex computational problems with a large, cross-product feature set that needs to be computed in a fast manner. We will share challenges we face, the approaches that we’re taking to address them, and how Hadoop can be used to support these types of operations at massive scale.”
Track: Hadoop Driven Business
Day 1 (12.05 PM). Data Discovery on Hadoop – Realizing the Full Potential of Your Data
Thiruvel Thirumoolan (@thiruvel) – Principal Engineer
Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management
“The talk describes an approach to manage data (location, schema knowledge and evolution, sharing and adhoc access with business rules based access control, and audit and compliance requirements) with an Apache Hive based solution (Hive, HCatalog, and HiveServer2).”
Day 1 (4.35 PM). Video Transcoding on Hadoop
Shital Mehta (@smcal75) – Architect, Video Platform
Kishore Angani (@kishore_angani) – Principal Engineer, Video Platform
“The talk describes the motivation, design and the challenges faced while building a cloud based transcoding service (that processes all the videos before they go online) and how a batch processing infrastructure has been used in innovative ways to build a transactional system requiring predictable response times.”
Track: Committer
Day 1 (2.35 PM). Multi-tenant Storm Service on Hadoop Grid
Bobby Evans – Principal Engineer, Apache Hadoop PMC, Storm PPMC, Spark Committer
Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC
“Multi-tenancy and security are foundational to building scalable-hosted platforms, and we have done exactly that with Apache Storm. The talk describes our enhancements to Storm that has allowed us to build one of the largest installations of Storm in the world to offer low-latency big data platform services to entire Yahoo on the common storm clusters while sharing infrastructure components with our Hadoop platform.”
Day 2 (1.45 PM). Pig on Tez – Low Latency ETL with Big Data
Daniel Dai (@daijy)– Member of Technical Staff, Hortonworks, Apache Pig PMC
Rohini Palaniswamy (@rohini_aditya) – Principal Engineer, Apache Pig PMC and Oozie Committer
“Pig on Tez aims to make ETL faster by using Tez as the execution as it is a more natural fit for the query plan produced by Pig. With optimized and shorter query plan graphs, Pig on Tez delivers huge performance improvements by executing the entire script within one YARN application as a single DAG and avoiding intermediate storage in HDFS. It also employs a lot of other optimizations made feasible by the Tez framework.”
Track: Deployment and Operations
Day 1 (3:25 PM). Collection of Small Tips on Further Stabilizing your Hadoop Cluster
Koji Noguchi (@kojinoguchi) – Apache Hadoop and Pig Committer
“For the first time, the maestro shares his pearls of wisdom in a public forum. Call Koji and he will tell you if you have a slow node, misconfigured node, CPU-eating jobs, or HDFS-wasting users even in the middle of the night when he pretends he is sleeping.”
Day 2 (12:05 PM). Hive on Apache Tez: Benchmarked at Yahoo! Scale
“At Yahoo, we’d like our low-latency use-cases to be handled within the same framework as our larger queries, if viable. We’ve spent several months benchmarking various versions of Hive (including 0.13 on Tez), file-formats, and compression and query techniques, at scale. Here, we present our tests, results and conclusions, alongside suggestions for real-world performance tuning.”
Track: Future of Hadoop
Day 1 (4:35 PM). Pig on Storm
Kapil Gupta – Principal Engineer, Cloud Platforms
Mridul Jain (@mridul_jain) – Senior Principal Engineer, Cloud Platforms
“In this talk, we propose PIG as the primary language for expressing real-time stream processing logic and provide a working prototype on Storm. We also illustrate how legacy code written for MR in PIG, can run with minimal to no changes, on Storm. We also propose a “Hybrid Mode” where a single PIG script can express logic for both real-time streaming and batch jobs.”
Day 2 (11:15 AM). Hadoop Rolling Upgrades - Taking Availability to the Next Level
Jason Lowe – Senior Principal Engineer, Apache Hadoop PMC
“No more maintenance downtimes, coordinating with users, catch-up processing etc. for Hadoop upgrades. The talk will describe the challenges with getting to transparent rolling upgrades, and discuss how these challenges are being addressed in both YARN and HDFS.”
Day 3 (11:50 AM). Spark-on-YARN - Empower Spark Applications on Hadoop Cluster
Thomas Graves – Principal Engineer, Apache Hadoop PMC and Apache Spark Committer
Andy Feng (@afeng76) – Distinguished Architect, Apache Storm PPMC
“In this talk, we will cover an effort to empower Spark applications via Spark-on-YARN. Spark-on-YARN enables Spark clusters and applications to be deployed onto your existing Hadoop hardware (without creating a separate cluster). Spark applications can then directly access Hadoop datasets on HDFS.”
Track: Data Science
Day 2 (11:15 AM) – Interactive Analytics in Human Time - Lighting Fast Analytics using a Combination of Hadoop and In-memory Computation Engines at Yahoo
Supreeth Rao (@supreeth_) – Technical Yahoo, Ads and Data Team
Sunil Gupta (@_skgupta) – Technical Yahoo, Ads and Data Team
“Providing interactive analytics over all of Yahoo’s advertising data across the numerable dimensions and metrics that span advertising has been a huge challenge. From getting results in a concurrent system back in under a second, to computing non-additive cardinality estimations to audience segmentation analytics, the problem space is computationally expensive and has resulted in large systems in the past. We have attempted to solve this problem in many different ways in the past, with systems built using traditional RDBMS to no-sql stores to commercial licensed distributed stores. With our current implementation, we look into how we have evolved a data tech stack that includes Hadoop and in-memory technologies.”
Track: Hadoop for Business Apps
Day 3 (11:00 AM) – Costing Your Big Data Operations
Sumeet Singh (@sumeetksingh) – Sr. Director of Product Management
Amrit Lal (@Amritasshwar) – Product Manager, Hadoop and Big Data
“As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations. Our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. We will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.”
For public inquiries or to learn more about the opportunities with the Hadoop team at Yahoo, reach out to us at bigdata AT yahoo-inc DOT com.
Hadoop Meetup (HUG) at Yahoo,October 2013: Grid Gain-In memory acceleration
This video recording for HUG was conducted on Oct 16, 2013 at Yahoo.
The tech talks at Yahoo Hack USA will feature technologies that developers can use on the back end to create high-performance multi-device applications.
Here’s a preview of some of the topics that will be be covered:
In his talk, Hacking with NodeJS, Yahoo’s NodeJS/Open Source architect, Dav Glass, will be walking hackers through how to rapidly build, test and deploy a hack with NodeJS, Github, TravisCI and probably Heroku, Nodejitsu, etc.
Lead service engineer Jennifer Davis will explain some of the adventures involved with service engineering at Yahoo to build quality services in her talk, Dungeons and Data.
Joyent’s Max Bruning will give a talk about Debugging and Performance Analysis of NodeJS Applications Using DTrace/mdb . This talk will demonstrate the use of DTrace to analyze performance issues for a given NodeJS application.
Head of Products for Cloud Services and Hadoop, Sumeet Singh, will talk about Hadoop at Yahoo – Now and Beyond. The talk will highlight some of the notable use cases of Hadoop at Yahoo and the state-of-the-art technology stack we have today that puts us at the frontier of Hadoop scale. The talk will also cover where we are headed in the areas of core Hadoop, low latency processing, BI and adhoc queries, and developer tools, and how we see the technology stack evolving in order to advance the personally ordered Internet.
Using YQL, by Yahoo engineer Guilherme Chapiewski, will guide you through the basics of the Yahoo! Query Language (YQL), an expressive SQL-like language that lets you query, filter, and join data across web APIs.
Follow @ydn for more updates about the talks at #yahoohackusa. To hear all the talks in person, register for Yahoo Hack USA.