Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

yahoohadoop:

image

We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event:

  • Familiarize yourself with the cutting edge in Apache project developments from the committers
  • Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives
  • Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks
  • Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more
  • Attend the community showcase where you can network with sponsors and industry experts, including a host of startups and large companies like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata

Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase.

Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code MSPO20. Register here for Hadoop Summit, San Jose, California!


DAY 1. TUESDAY June 13, 2017


12:20 - 1:00 P.M. TensorFlowOnSpark - Scalable TensorFlow Learning On Spark Clusters

Andy Feng - VP Architecture, Big Data and Machine Learning

Lee Yang - Sr. Principal Engineer

In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.


2:10 - 2:50 P.M. Handling Kernel Upgrades at Scale - The Dirty Cow Story

Samy Gawande - Sr. Operations Engineer

Savitha Ravikrishnan - Site Reliability Engineer

Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).


5:00 – 5:40 P.M. Data Highway Rainbow -  Petabyte Scale Event Collection, Transport, and Delivery at Yahoo

Nilam Sharma - Sr. Software Engineer

Huibing Yin - Sr. Software Engineer

This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers.


DAY 2. WEDNESDAY June 14, 2017


9:05 - 9:15 A.M. Yahoo General Session - Shaping Data Platform for Lasting Value

Sumeet Singh  – Sr. Director, Products

With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.


12:20 - 1:00 P.M. CaffeOnSpark Update - Recent Enhancements and Use Cases

Mridul Jain - Sr. Principal Engineer

Jun Shi - Principal Engineer

By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.


12:20 - 1:00 P.M. Tez Shuffle Handler - Shuffling at Scale with Apache Hadoop

Jon Eagles - Principal Engineer  

Kuhu Shukla - Software Engineer

In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.


2:10 - 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

Thiruvel Thirumoolan – Principal Engineer

Francis Liu – Sr. Principal Engineer

At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.


2:10 - 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse

Nick Huang – Director, Data Engineering, Yahoo Mail  

Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail

Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.


DAY3. THURSDAY June 15, 2017


2:10 – 2:50 P.M. OracleStore - A Highly Performant RawStore Implementation for Hive Metastore

Chris Drome - Sr. Principal Engineer  

Jin Sun - Principal Engineer

Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.


3:00 P.M. - 3:40 P.M. Bullet - A Real Time Data Query Engine

Akshai Sarma - Sr. Software Engineer

Michael Natkovich - Director, Engineering

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.


3:00 P.M. - 3:40 P.M. Yahoo - Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez

Rohini Palaniswamy - Sr. Principal Engineer

Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation.


4:10 P.M. - 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning

Evans Ye,  Software Engineer

Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.

Register here for Hadoop Summit, San Jose, California with a 20% discount code MSPO20

Questions? Feel free to reach out to us at [email protected]. Hope to see you there!

(via yahoohadoop)

Open Sourcing Athenz: Fine-Grained, Role-Based Access Control

yahoocs:

image

By Lee Boynton, Henry Avetisyan, Ken Fox, Itsik Figenblat, Mujib Wahab, Gurpreet Kaur, Usha Parsa, and Preeti Somal


Today, we are pleased to offer Athenz, an open-source platform for fine-grained access control, to the community. Athenz is a role-based access control (RBAC) solution, providing trusted relationships between applications and services deployed within an organization requiring authorized access.

If you need to grant access to a set of resources that your applications or services manage, Athenz provides both a centralized and a decentralized authorization model to do so. Whether you are using container or VM technology independently or on bare metal, you may need a dynamic and scalable authorization solution. Athenz supports moving workloads from one node to another and gives new compute resources authorization to connect to other services within minutes, as opposed to relying on IP and network ACL solutions that take time to propagate within a large system. Moreover, in very high-scale situations, you may run out of the limited number of network ACL rules that your hardware can support.

Prior to creating Athenz, we had multiple ways of managing permissions and access control across all services within Yahoo. To simplify, we built a fine-grained, role-based authorization solution that would satisfy the feature and performance requirements our products demand. Athenz was built with open source in mind so as to share it with the community and further its development.

At Yahoo, Athenz authorizes the dynamic creation of compute instances and containerized workloads, secures builds and deployment of their artifacts to our Docker registry, and among other uses, manages the data access from our centralized key management system to an authorized application or service.

Athenz provides a REST-based set of APIs modeled in Resource Description Language (RDL) to manage all aspects of the authorization system, and includes Java and Go client libraries to quickly and easily integrate your application with Athenz. It allows product administrators to manage what roles are allowed or denied to their applications or services in a centralized management system through a self-serve UI.

Access Control Models

Athenz provides two authorization access control models based on your applications’ or services’ performance needs. More commonly used, the centralized access control model is ideal for provisioning and configuration needs. In instances where performance is absolutely critical for your applications or services, we provide a unique decentralized access control model that provides on-box enforcement of authorization.  

Athenz’s authorization system utilizes two types of tokens: principal tokens (N-Tokens) and role tokens (Z-Tokens). The principal token is an identity token that identifies either a user or a service. A service generates its principal token using that service’s private key. Role tokens authorize a given principal to assume some number of roles in a domain for a limited period of time. Like principal tokens, they are signed to prevent tampering. The name “Athenz” is derived from “Auth” and the ‘N’ and ‘Z’ tokens.

Centralized Access Control: The centralized access control model requires any Athenz-enabled application to contact the Athenz Management Service directly to determine if a specific authenticated principal (user and/or service) has been authorized to carry out the given action on the requested resource. At Yahoo, our internal continuous delivery solution uses this model. A service receives a simple Boolean answer whether or not the request should be processed or rejected. In this model, the Athenz Management Service is the only component that needs to be deployed and managed within your environment. Therefore, it is suitable for provisioning and configuration use cases where the number of requests processed by the server is small and the latency for authorization checks is not important.

The diagram below shows a typical control plane-provisioning request handled by an Athenz-protected service.

image

Decentralized Access Control: This approach is ideal where the application is required to handle large number of requests per second and latency is a concern. It’s far more efficient to check authorization on the host itself and avoid the synchronous network call to a centralized Athenz Management Service. Athenz provides a way to do this with its decentralized service using a local policy engine library on the local box. At Yahoo, this is an approach we use for our centralized key management system. The authorization policies defining which roles have been authorized to carry out specific actions on resources, are asynchronously updated on application hosts and used by the Athenz local policy engine to evaluate the authorization check. In this model, a principal needs to contact the Athenz Token Service first to retrieve an authorization role token for the request and submit that token as part of its request to the Athenz protected service. The same role token can then be re-used for its lifetime.

The diagram below shows a typical decentralized authorization request handled by an Athenz-protected service.

image
Athenz Decentralized Access Control Model

With the power of an RBAC system in which you can choose a model to deploy according your performance latency needs, and the flexibility to choose either or both of the models in a complex environment of hosting platforms or products, it gives you the ability to run your business with agility and scale.

Looking to the Future

We are actively engaged in pushing the scale and reliability boundaries of Athenz. As we enhance Athenz, we look forward to working with the community on the following features:

  • Using local CA signed TLS certificates
  • Extending Athenz with a generalized model for service providers to launch instances with bootstrapped Athenz service identity TLS certificates
  • Integration with public cloud services like AWS. For example, launching an EC2 instance with a configured Athenz service identity or obtaining AWS temporary credentials based on authorization policies defined in ZMS.

Our goal is to integrate Athenz with other open source projects that require authorization support and we welcome contributions from the community to make that happen. It is available under Apache License Version 2.0. To evaluate Athenz, we provide both AWS AMI and Docker images so that you can quickly have a test development environment up and running with ZMS (Athenz Management Service), ZTS (Athenz Token Service), and UI services. Please join us on the path to making application authorization easy. Visit http://www.athenz.io to get started!

Operating OpenStack at Scale

By James Penick, Cloud Architect & Gurpreet Kaur, Product Manager

A version of this byline was originally written for and appears in CIO Review.


A successful private cloud presents a consistent and reliable facade over the complexities of hyperscale infrastructure. It must simultaneously handle constant organic traffic growth, unanticipated spikes, a multitude of hardware vendors, and discordant customer demands. The depth of this complexity only increases with the age of the business, leaving a private cloud operator saddled with legacy hardware, old network infrastructure, customers dependent on legacy operating systems, and the list goes on. These are the foundations of the horror stories told by grizzled operators around the campfire.

Providing a plethora of services globally for over a billion active users requires a hyperscale infrastructure. Yahoo’s on-premises infrastructure is comprised of datacenters housing hundreds of thousands of physical and virtual compute resources globally, connected via a multi-terabit network backbone. As one of the very first hyperscale internet companies in the world, Yahoo’s infrastructure had grown organically – things were built, and rebuilt, as the company learned and grew. The resulting web of modern and legacy infrastructure became progressively more difficult to manage. Initial attempts to manage this via IaaS (Infrastructure-as-a-Service) taught some hard lessons. However, those lessons served us well when OpenStack was selected to manage Yahoo’s datacenters, some of which are shared below.

Centralized team offering Infrastructure-as-a-Service

Chief amongst the lessons learned prior to OpenStack was that IaaS must be presented as a core service to the whole organization by a dedicated team. An a-la-carte-IaaS, where each user is expected to manage their own control plane and inventory, just isn’t sustainable at scale. Multiple teams tackling the same challenges involved in the curation of software, deployment, upkeep, and security within an organization is not just a duplication of effort; it removes the opportunity for improved synergy with all levels of the business. The first OpenStack cluster, with a centralized dedicated developer and service engineering team, went live in June 2012.  This model has served us well and has been a crucial piece of making OpenStack succeed at Yahoo. One of the biggest advantages to a centralized, core team is the ability to collaborate with the foundational teams upon which any business is built: Supply chain, Datacenter Site-Operations, Finance, and finally our customers, the engineering teams. Building a close relationship with these vital parts of the business provides the ability to streamline the process of scaling inventory and presenting on-demand infrastructure to the company.

Developers love instant access to compute resources

Our developer productivity clusters, named “OpenHouse,” were a huge hit. Ideation and experimentation are core to developers’ DNA at Yahoo. It empowers our engineers to innovate, prototype, develop, and quickly iterate on ideas. No longer is a developer reliant on a static and costly development machine under their desk. OpenHouse enables developer agility and cost savings by obviating the desktop.

Dynamic infrastructure empowers agile products

From a humble beginning of a single, small OpenStack cluster, Yahoo’s OpenStack footprint is growing beyond 100,000 VM instances globally, with our single largest virtual machine cluster running over a thousand compute nodes, without using Nova Cells.

Until this point, Yahoo’s production footprint was nearly 100% focused on baremetal – a part of the business that one cannot simply ignore. In 2013, Yahoo OpenStack Baremetal began to manage all new compute deployments. Interestingly, after moving to a common API to provision baremetal and virtual machines, there was a marked increase in demand for virtual machines.

Developers across all major business units ranging from Yahoo Mail, Video, News, Finance, Sports and many more, were thrilled with getting instant access to compute resources to hit the ground running on their projects. Today, the OpenStack team is continuing to fully migrate the business to OpenStack-managed. Our baremetal footprint is well beyond that of our VMs, with over 100,000 baremetal instances provisioned by OpenStack Nova via Ironic.

How did Yahoo hit this scale?  

Scaling OpenStack begins with understanding how its various components work and how they communicate with one another. This topic can be very deep and for the sake of brevity, we’ll hit the high points.

1. Start at the bottom and think about the underlying hardware

Do not overlook the unique resource constraints for the services which power your cloud, nor the fashion in which those services are to be used. Leverage that understanding to drive hardware selection. For example, when one examines the role of the database server in an OpenStack cluster, and considers the multitudinous calls to the database: compute node heartbeats, instance state changes, normal user operations, and so on; they would conclude this core component is extremely busy in even a modest-sized Nova cluster, and in need of adequate computational resources to perform. Yet many deployers skimp on the hardware. The performance of the whole cluster is bottlenecked by the DB I/O. By thinking ahead you can save yourself a lot of heartburn later on.

2. Think about how things communicate

Our cluster databases are configured to be multi-master single-writer with automated failover. Control plane services have been modified to split DB reads directly to the read slaves and only write to the write-master. This distributes load across the database servers.

3. Scale wide

OpenStack has many small horizontally-scalable components which can peacefully cohabitate on the same machines: the Nova, Keystone, and Glance APIs, for example. Stripe these across several small or modest hardware. Some services, such as the Nova scheduler, run the risk of race conditions when running multi-active. If the risk of race conditions is unacceptable, use ZooKeeper to manage leader election.

4. Remove dependencies

In a Yahoo datacenter, DHCP is only used to provision baremetal servers. By statically declaring IPs in our instances via cloud-init, our infrastructure is less prone to outage from a failure in the DHCP infrastructure.

5. Don’t be afraid to replace things

Neutron used Dnsmasq to provide DHCP services, however it was not designed to address the complexity or scale of a dynamic environment. For example, Dnsmasq must be restarted for any config change, such as when a new host is being provisioned.  In the Yahoo OpenStack clusters this has been replaced by ISC-DHCPD, which scales far better than Dnsmasq and allows dynamic configuration updates via an API.

6. Or split them apart

Some of the core imaging services provided by Ironic, such as DHCP, TFTP, and HTTPS communicate with a host during the provisioning process. These services are normally  part of the Ironic Conductor (IC) service. In our environment we split these services into a new and physically-distinct service called the Ironic Transport Service (ITS). This brings value by:

  • Adding security: Splitting the ITS from the IC allows us to block all network traffic from production compute nodes to the IC, and other parts of our control plane. If a malicious entity attacks a node serving production traffic, they cannot escalate from it  to our control plane.
  • Scale: The ITS hosts allow us to horizontally scale the core provisioning services with which nodes communicate.
  • Flexibility: ITS allows Yahoo to manage remote sites, such as peering points, without building a new cluster in that site. Resources in those sites can now be managed by the nearest Yahoo owned & operated (O&O) datacenter, without needing to build a whole cluster in each site.

Be prepared for faulty hardware!

Running IaaS reliably at hyperscale is more than just scaling the control plane. One must take a holistic look at the system and consider everything. In fact, when examining provisioning failures, our engineers determined the majority root cause was faulty hardware. For example, there are a number of machines from varying vendors whose IPMI firmware fails from time to time, leaving the host inaccessible to remote power management. Some fail within minutes or weeks of installation. These failures occur on many different models, across many generations, and across many hardware vendors. Exposing these failures to users would create a very negative experience, and the cloud must be built to tolerate this complexity.

Focus on the end state

Yahoo’s experience shows that one can run OpenStack at hyperscale, leveraging it to wrap infrastructure and remove perceived complexity. Correctly leveraged, OpenStack presents an easy, consistent, and error-free interface. Delivering this interface is core to our design philosophy as Yahoo continues to double down on our OpenStack investment. The Yahoo OpenStack team looks forward to continue collaborating with the OpenStack community to share feedback and code.

Introducing Tripod: Flickr’s Backend, Refactored

By Peter Welch, Senior Principal Engineer

Today, Yahoo Mail introduced a feature that allows you to automatically sync your mobile photos to Yahoo Mail so that they’re readily available when you’re composing an email from your computer. A key technology behind this feature is a new photo and video platform called “Tripod,” which was born out of the innovations and capabilities of Flickr.

For 13 years, Flickr has served as one of the world’s largest photo-sharing communities and as a platform for millions of people who have collectively uploaded more than 13 billion photos globally. Tripod provides a great opportunity to bring some of the most-loved and useful Flickr features to the Yahoo network of products, including Yahoo Mail, Yahoo Messenger, and Yahoo Answers Now.

Tripod and its Three Services

As the name suggests, Tripod offers three services:

  1. The Pixel Service: for uploading, storing, resizing, editing, and serving photos and videos.
  2. The Enrichment Service: for enriching media metadata using image recognition algorithms. For example, the algorithms might identify and tag scenes, actions, and objects.
  3. The Aggregation Service: for in-application and cross-application metadata aggregation, filtering, and search.

The combination of these three services makes Tripod an end-to-end platform for smart image services. There is also an administrative console for configuring the integration of an application with Tripod, and an identity service for authentication and authorization.

image

Figure 1: Tripod Architecture

The Pixel Service

Flickr has achieved a highly-scalable photo upload and resizing pipeline. Particularly in the case of large-scale ingestion of thousands of photos and videos, Flickr’s mobile and API teams tuned techniques, like resumable upload and deduplication, to create a high-quality photo-sync experience. On serving, Flickr tackled the challenge of optimizing storage without impacting photo quality, and added dynamic resizing to support more diverse client photo layouts.

Over many years at Flickr, we’ve demonstrated sustained uploads of more than 500 photos per second. The full pipeline includes the PHP Upload API endpoint, backend Java services (Image Daemon, Storage Master), hot-hot uploads across the US West and East Coasts, and five worldwide photo caches, plus a massive CDN.

In Tripod’s Pixel Service, we leverage all of this core technology infrastructure as-is, except for the API endpoint, which is now written in Java and implements a new bucket-based data model.

The Enrichment Service

In 2013, Flickr made an exciting leap. Yahoo acquired two Computer Vision technology companies, IQ Engines and LookFlow, and rolled these incredible teams into Flickr. Using their image recognition algorithms, we enhanced Flickr Search and introduced Magic View to the Flickr Camera Roll.

In Tripod, the Enrichment Service applies the image recognition technology to each photograph, resulting in rich metadata that can be used to enhance filtering, indexing, and searching. The Enrichment Service can identify places, themes, landmarks, objects, colors, text, media similarity, NSFW content, and best thumbnail. It also performs OCR text recognition and applies an aesthetic score to indicate the overall quality of the photograph.

The Aggregation Service

The Aggregation Service lets an application, such as Yahoo Mail, find media based on any criteria. For example, it can return all the photos belonging to a particular person within a particular application, all public photos, or all photos belonging to a particular person taken in a specific location during a specific time period (e.g. San Francisco between March 1, 2015 and May 31, 2015.)

Vespa, Yahoo’s internal search engine, indexes all metadata for each media item. If the Enrichment Service has been run on the media, the metadata is indexed in Vespa and is available to the Aggregation API. The result set from a call to the Aggregation Service depends on authentication and the read permissions defined by an API key.

APIs and SDKs

Each service is expressed as a set of APIs. We upgraded our API technology stack, switching from PHP to Spring MVC on a Java Jetty servlet container, and made use of the latest Spring features such as Spring Data, Spring Boot, and Spring Security with OAuth 2.0. Tripod’s API is defined and documented using Swagger. Each service is developed and deployed semi-autonomously from a separate Git repository with a separate build lifecycle to an independent micro-service container.

image

Figure 2: Tripod API

Swagger Editor makes it easy to auto-generate SDKs in many languages, depending on the needs of Yahoo product developers. The mobile SDKs for iOS and Android are most commonly used, as is the JS SDK for Yahoo’s web developers. The SDKs make integration with Tripod by a web or mobile application easy. For example, in the case of the Yahoo Mail photo upload feature, the Yahoo Mail mobile app includes the embedded Tripod SDK to manage the photo upload process.

Buckets and API Keys

The Tripod data model differs in some important ways from the Flickr data model. Tripod applications, buckets, and API keys introduce the notion of multi-tenancy, with a strong access control boundary. An application is simply the name of the application that is using Tripod (e.g. Yahoo Mail). Buckets are logical containers for the application’s media, and media in an application is further affected by bucket settings such as compression rate, capacity, media time-to-live, and the selection of enrichments to compute.

image

Figure 3: Creating a new Bucket

Beyond Tripod’s generic attributes, a bucket may also have custom organizing attributes that are defined by an application’s developers. API keys control read/write permissions on buckets and are used to generate OAuth tokens for anonymous or user-authenticated access to a bucket.

image

Figure 4: Creating a new API Key

App developers at Yahoo use the Tripod Console to:

  • Create the buckets and API keys that they will use with their application
  • Define the bucket settings and the access control rules for each API key

Another departure from the Flickr API is that Tripod can handle media that is not user-generated content (UGC). This is critical for storing curated content, as is required by many Yahoo applications.

Architecture and Implementation

Going from a monolithic architecture to a microservices architecture has had its challenges. In particular, we’ve had to find the right internal communication process between the services. At the core of this is our Pulsar Event Bus, over which we send Avro messages backed by a strong schema registry. This lets each Tripod team move fast, without introducing incompatible changes that would break another Tripod service.

For data persistence, we’ve moved most of our high-scale multi-colo data to Yahoo’s distributed noSQL database. We’ve been experimenting with using Redis Cluster as our caching tier, and we use Vespa to drive the Aggregation service. For Enrichment, we make extensive use of Storm and HBase for real-time processing of Tripod’s Computer Vision algorithms. Finally, we run large backfills using PIG, Oozie, and Hive on Yahoo’s massive grid infrastructure.

In 2017, we expect Tripod will be at 50% of Flickr’s scale, with Tripod supporting the photo and video needs across many Yahoo applications that serve Yahoo’s 1B users across mobile and desktop.

After reading about Tripod, you might have a few questions

Did Tripod replace Flickr?!

No! Flickr is still here, better than ever. In fact, Flickr celebrated its 13th birthday last week! Over the past several years, the Flickr team has implemented significant innovations on core photo management features (such as an optimized storage footprint, dynamic resizing, Camera Roll, Magic View, and Search). We wanted to make these technology advancements available to other teams at Yahoo!

But, what about the Flickr API? Why not just use that?

Flickr APIs are being used by hundreds of thousands of third-party developers around the world. Flickr’s API was designed for interacting with Flickr Accounts, Photos, and Groups, generally on lower scale than the Flickr site itself; it was not designed for independent, highly configurable, multi-tenant core photo management at large scale.

How can I join the team?

We’re hiring and we’d love to talk to you about our open opportunities! Just email [email protected] to start the conversation.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

yahoohadoop:

By Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team

Introduction

Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters.

Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.

image

Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters. We use CaffeOnSpark at Yahoo to improve our NSFW image detection, to automatically identify eSports game highlights from live-streamed videos, and more. With the community’s valuable feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on docker containers. This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe.  

After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.

To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. DataBricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.

TensorFlowOnSpark

image

Our new framework, TensorFlowOnSpark (TFoS), enables distributed TensorFlow execution on Spark and Hadoop clusters. As illustrated in Figure 2 above, TensorFlowOnSpark is designed to work along with SparkSQL, MLlib, and other Spark libraries in a single pipeline or program (e.g. Python notebook).

TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.

Any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.

TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As illustrated in Figure 3, TensorFlowOnSpark doesn’t involve Spark drivers in tensor communication, and thus achieves similar scalability as stand-alone TensorFlow clusters.

image

TensorFlowOnSpark provides two different modes to ingest data for training and inference:

  1. TensorFlow QueueRunners: TensorFlowOnSpark leverages TensorFlow’s file readers and QueueRunners to read data directly from HDFS files. Spark is not involved in accessing data.
  2. Spark Feeding: Spark RDD data is fed to each Spark executor, which subsequently feeds the data into the TensorFlow graph via feed_dict.

Figure 4 illustrates how the synchronous distributed training of Inception image classification network scales in TFoS using QueueRunners with a simple setting: 1 GPU, 1 reader, and batch size 32 for each worker. Four TFoS jobs were launched to train 100,000 steps. When these jobs completed after 2+ days, the top-5 accuracy of these jobs were 0.730, 0.814, 0.854, and 0.879. Reaching top-5 accuracy of 0.730 takes 46 hours for a 1-worker job, 22.5 hours for a 2-worker job, 13 hours for a 4-worker job, and 7.5 hours for an 8-worker job. TFoS thus achieves near linear scalability for Inception model training. This is very encouraging, although TFoS scalability will vary for different models and hyperparameters.

image

RDMA for Distributed TensorFlow

In Yahoo’s Hadoop clusters, GPU nodes are connected by both Ethernet and Infiniband. Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband.

In conjunction with our TFoS release, we are introducing a new protocol for TensorFlow servers in addition to the default “grpc” protocol. Any distributed TensorFlow program can leverage our enhancement via specifying protocol=“grpc_rdma” in tf.train.ServerDef() or tf.train.Server().

With this new protocol, a RDMA rendezvous manager is created to ensure tensors are written directly into the memory of remote servers. We minimize the tensor buffer creation: Tensor buffers are allocated once at the beginning, and then reused across all training steps of a TensorFlow job. From our early experimentation with large models like the VGG-19 network, our RDMA implementation has demonstrated a significant speedup on training time compared with the existing gRPC implementation.

Since RDMA support is a highly requested capability (see TensorFlow issue #2916), we decided to make our current implementation available as an alpha release to the TensorFlow community. In the coming weeks, we will polish our RDMA implementation further, and share detailed benchmark results.

Simple CLI and API

TFoS programs are launched by the standard Apache Spark command, spark-submit. As illustrated below, users can specify the number of Spark executors, the number of GPUs per executor, and the number of parameter servers in the CLI. A user can also state whether they want to use TensorBoard (–tensorboard) and/or RDMA (–rdma).

      spark-submit –master ${MASTER} \
      ${TFoS_HOME}/examples/slim/train_image_classifier.py \
      –model_name inception_v3 \
      –train_dir hdfs://default/slim_train \
      –dataset_dir hdfs://default/data/imagenet \
      –dataset_name imagenet \
      –dataset_split_name train \
      –cluster_size ${NUM_EXEC} \
      –num_gpus ${NUM_GPU} \
      –num_ps_tasks ${NUM_PS} \
      –sync_replicas \
      –replicas_to_aggregate ${NUM_WORKERS} \
      –tensorboard \
      –rdma  

TFoS provides a high-level Python API (illustrated in our sample Python notebook):

  • TFCluster.reserve() … construct a TensorFlow cluster from Spark executors
  • TFCluster.start() … launch Tensorflow program on the executors
  • TFCluster.train() or TFCluster.inference() … feed RDD data to TensorFlow processes
  • TFCluster.shutdown() … shutdown Tensorflow execution on executors

Open Source

Yahoo is happy to release TensorFlowOnSpark at github.com/yahoo/TensorFlowOnSpark and a RDMA enhancement of TensorFlow at github.com/yahoo/tensorflow/tree/yahoo. Multiple example programs (including mnist, cifar10, inception, and VGG) are provided to illustrate the simple conversion process of TensorFlow programs to TensorFlowOnSpark, and leverage RDMA. An Amazon Machine Image is also available for applying TensorFlowOnSpark on AWS EC2.

Going forward, we will advance TensorFlowOnSpark as we continue to do with CaffeOnSpark. We welcome the community’s continued feedback and contributions to CaffeOnSpark, and are interested in thoughts on ways TensorFlowOnSpark can be enhanced.

(via yahoohadoop)

Open Sourcing Screwdriver, Yahoo’s Continuous Delivery Build System for Dynamic Infrastructure

By James Collins, Sr. Director, Developer Platforms and Services, and St. John Johnson, Principal Engineer

Continuous Delivery enables software development teams to move faster and adapt to users’ needs quicker by reducing the inherent friction associated with releasing software changes. Yahoo’s engineering has modernized as it has embraced Continuous Delivery as a strategy for improving product quality and engineering agility. All our active products deliver from commit to production with full automation and this has greatly improved Yahoo’s ability to deliver products.  

Part of what enabled Yahoo to make Continuous Delivery at scale a reality was our improved build and release tooling. Now, we are open sourcing an adaptation of our code as Screwdriver.cd, a new streamlined build system designed to enable Continuous Delivery to production at scale for dynamic infrastructure.

Some of the key design features of Screwdriver have helped Yahoo achieve Continuous Delivery at scale. At a high level these are:

  • Making deployment pipelines easy
  • Optimizing for trunk development
  • Making rolling back easy

Easy deployment pipelines: Deployment pipelines that continuously test, integrate, and deploy code to production greatly reduce the risk of errors and reduce the time to get feedback to developers. The challenge for many groups had been that pipelines were cumbersome to setup and maintain. We designed a solution that made pipelines easy to configure and completely self-service for any developer. By managing the pipeline configuration in the code repository Screwdriver allows developers to configure pipelines in a manner familiar to them, and as a bonus, to easily code review pipeline changes too.

Trunk development: Internally, we encourage workflows where the trunk is always shippable. Our teams use a modified GitHub flow for their workflows. Pull Requests (PRs) are the entry point for running tests and ensuring code that entered the repository has been sufficiently tested. Insisting on formal PRs also improves the quality of our code reviews.

To ensure trunks are shippable, we enable functional testing of code in the PRs. Internally, this is a configuration baked into pipelines that dynamically allocates compute resources, deploys the code, and runs tests. These tests include web testing using tools like Selenium. These dynamically-allocated resources are also available for a period after the PR build, allowing engineers to interact with the system and review visual aspects of their changes.

Easy rollbacks: To allow for easy code rollbacks, we allow phases of the pipeline to be re-run at a previously-saved state. We leverage features in our PaaS to handle the deployment, but we store and pass metadata to enable us to re-run from a specific git SHA with the same deployment data. This allows us to roll back to a previous state in production. This design makes rolling back as easy as selecting a version from a dropdown menu and clicking “deploy.” Anyone with write access to the project can make this change. This helped us move teams to a DevOps model where developers were responsible for the production state.

The successful growth of Screwdriver over the past 5 years at Yahoo has today led to Screwdriver being synonymous with Continuous Delivery within the company. Screwdriver handles over 25,000+ builds per day and 12,000+ daily git commits as a single shared entrypoint for Yahoo. It supports multiple languages and handles both virtual machine and container-based builds and deployment.

image

Screwdriver.cd’s architecture is comprised of four main components: a frontend for serving content to the user, a stateless API that orchestrates between user interactions and build operations, the execution engines (Docker Swarm, Kubernetes, etc.) that checkout source code and execute in containers, and the launcher that executes and monitors commands inside the container.

The diagram below shows this architecture overlaid with a typical developer flow.

image

To give some context around our execution engines, internal Screwdriver started as an abstraction layer on top of Jenkins and used Docker to provide isolation, common build containers, etc. We used features provided by Jenkins plugins to leverage existing work around coverage and test reports. However, as Screwdriver usage continued to climb, it outgrew a single Jenkins cluster. So in order to grow to our needs, we added capabilities in Screwdriver that allowed us to scale horizontally while also adding capabilities to schedule pipelines across a number of Jenkins clusters. As we scaled Screwdriver, we used less from Jenkins and built more supporting services utilizing our cloud infrastructure. The open-source version is focused on Kubernetes and Docker Swarm as our primary supported execution engines.

In the coming months we will expand our offering to match many of the features we have internally, including:

  • Mechanism to store structured build data for later use (last deployed version, test coverage, etc.)
  • Built-in metric collecting
  • System-wide templates to enable getting started quickly
  • Log analysis to provide insights to developers

Please join us on the path to making Continuous Delivery easy. Visit http://screwdriver.cd to get started.

Presenting an Open Source Toolkit for Lightweight Multilingual Entity Linking

yahooresearch:

By Aasish Pappu, Roi Blanco, and Amanda Stent

What’s the first thing you want to know about any kind of text document (like a Yahoo News or Yahoo Sports article)? What it’s about, of course! That means you want to know something about the people, organizations, and locations that are mentioned in the document. Systems that automatically surface this information are called named entity recognition and linking systems. These are one of the most useful components in text analytics as they are required for a wide variety of applications including search, recommender systems, question answering, and sentiment analysis.

Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).

After researching and implementing solutions to enhance our own personalization technology, we are pleased to offer the open source community Fast Entity Linker, our unsupervised, accurate, and extensible multilingual named entity recognition and linking system, along with datapacks for English, Spanish, and Chinese.

For broad usability, our system links text entity mentions to Wikipedia. For example, in the sentence Yahoo is a company headquartered in Sunnyvale, CA with Marissa Mayer as CEO, our system would identify the following entities:

On the algorithmic side, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data structures and aggressive hashing functions.

Entity embeddings are vector-based representations that capture how entities are referred to in context. We train entity embeddings using Wikipedia articles, and use hyperlinked terms in the articles to create canonical entities. The context of an entity and the context of a token are modeled using the neural network architecture in the figure below, where entity vectors are trained to predict not only their surrounding entities but also the global context of word sequences contained within them. In this way, one layer models entity context, and the other layer models token context. We connect these two layers using the same technique that (Quoc and Mikolov ‘14) used to train paragraph vectors.

image

Search click-log data gives very useful signals to disambiguate partial or ambiguous entity mentions. For example, if searchers for “Fox” tend to click on “Fox News” rather than “20th Century Fox,” we can use this data in order to identify “Fox” in a document. To disambiguate entity mentions and ensure a document has a consistent set of entities, our system supports three entity disambiguation algorithms:

*Currently, only the Forward Backward Algorithm is available in our open source release–the other two will be made available soon!

These algorithms are particularly helpful in accurately linking entities when a popular candidate is NOT the correct candidate for an entity mention. In the example below, these algorithms leverage the surrounding context to accurately link Manchester City, Swansea City, Liverpool, Chelsea, and Arsenal to their respective football clubs.


At this time, Fast Entity Linker is one of only three freely-available multilingual named entity recognition and linking systems (others are DBpedia Spotlight and Babelfy). In addition to a stand-alone entity linker, the software includes tools for creating and compressing word/entity embeddings and datapacks for different languages from Wikipedia data. As an example, the datapack containing information from all of English Wikipedia is only ~2GB.


The technical contributions of this system are described in two scientific papers:

There are numerous possible applications of the open-source toolkit. One of them is attributing sentiment to entities detected in the text, as opposed to the entire text itself. For example, consider the following actual review of the movie “Inferno” from a user on MetaCritic (revised for clarity): “While the great performance of Tom Hanks (wiki_Tom_Hanks) and company make for a mysterious and vivid movie, the plot is difficult to comprehend. Although the movie was a clever and fun ride, I expected more from Columbia (wiki_Columbia_Pictures).”  Though the review on balance is neutral, it conveys a positive sentiment about wiki_Tom_Hanks and a negative sentiment about wiki_Columbia_Pictures.

Many existing sentiment analysis tools collate the sentiment value associated with the text as a whole, which makes it difficult to track sentiment around any individual entity. With our toolkit, one could automatically extract “positive” and “negative” aspects within a given text, giving a clearer understanding of the sentiment surrounding its individual components.

Feel free to use the code, contribute to it, and come up with addtional applications; our system and models are available at https://github.com/yahoo/FEL.

Great work from our Yahoo Research team!

The Apache Traffic Server Project’s Next Chapter

By Bryan Call, Yahoo Distinguished Software Engineer, Apache Traffic Server PMC Chair 

This post also appears on the ATS blog, https://blogs.apache.org/trafficserver.

image

Last week, the ATS Community held a productive and informative Apache Traffic Server (ATS) Fall Summit, hosted by LinkedIn in Sunnyvale, CA. At a hackathon during the Summit, we fixed bugs, cleaned up code, users were able to spend time with experts on ATS and have their questions answered, and the next release candidate for ATS 7.0.0 was made public. There were talks on operations, new and upcoming features, and supporting products. More than 80 people registered for the event and we had a packed room with remote video conferencing.

I have been attending the ATS Summits since their inception in 2010 and have had the pleasure of being involved with the Apache Traffic Server Project for the last nine years. I was also part of the team at Yahoo that open sourced the code to Apache. Today, I am honored to serve as the new Chair and VP of the ATS Project, having been elected to the position by the ATS community a couple weeks ago.

Traffic Server was originally created by Inktomi and distributed as a commercial product. After Yahoo acquired Inktomi, Yahoo open sourced Traffic Server and submitted it to the Apache Incubator in July 2009.

Since graduating as Apache Traffic Server (an Apache Top-Level Project as of April 2010), many large and small companies use it for caching and proxying HTTP requests. ATS supports HTTP/2, HTTP/1.1, TLS, and many other standards. The Apache Committers on the project are actively involved with the Internet Engineering Task Force (IETF) – whose mission it is to “make the Internet work better by producing high quality, relevant technical documents that influence the way people design, use, and manage the Internet” – to make sure ATS is able to support the latest standards going forward.

Many companies have greatly benefited from the open sourcing of ATS; numerous industry colleagues and invested individuals have improved the project by fixing bugs and adding features, tests, and documentation. An example is Yahoo, which uses ATS for nearly all of its incoming HTTP/2 and HTTP/1.1 traffic. It is a common layer that all users go through before making a request to the origin server. Having a common layer has made it easier for Yahoo to deploy security fixes and updates extremely quickly. ATS is used as a caching proxy in locations worldwide and is also used to proxy requests for dynamic content from remote locations through already-established persistent connections. This decreases the latency for users when their cacheable content can be served, and connection establishments can be made to a nearby server.

The ATS PMC and I will focus on continuing to increase the ATS user base and having more developers contribute to the project. The ATS community welcomes other companies’ contributions and enhancements to the software through a well-established process with Apache. Unlike other commercial products, ATS has no limits or restrictions with accepting open source contributions.

Moving forward, we would also like to focus on three specific areas of ATS as a means of increasing the user base, while maintaining the performance advantage of the server: ease of use, features, and stability.

I support the further simplification of the configuration of ATS to make it so that end users can quickly get a server up with little effort. Common requirements should be easy to configure, while continuing to allow users to write custom plugins for more advanced requirements.

Adding new features to ATS is important and there are a lot of new drafts and standards currently being worked on in IETF with HTTP, TLS, and QUIC that will improve user experience. ATS will need to continue to support the latest standards that allow deployments of ATS to decrease the latency for the users. Having our developers attend the IETF meetings and participate in the decision-making is key to our ability to keep on top of these latest developments.

Stability is a fundamental requirement for a proxy server. Since all the incoming HTTP/2 and HTTP/1.1 traffic is handled by the server, it must be stable and resilient. We are continually working on improving our continuous integration and testing. We are making it easier for developers to write testing and run tests before making contributions to the code.

The ATS community is a welcoming group of people that encourages contributions and input from users, and I am excited to help lead the Project into its next chapter. Please feel free to join the mailing lists, attend one of our events such as the recent ATS Summit, or jump on IRC to talk to the users and developers of this project. We invite you to learn more about ATS at http://trafficserver.apache.org. 

Why Professional Open Source Management is Critical for your Business

By Gil Yehuda, Sr. Director of Open Source and Technology Strategy

This byline was originally written for and appears in CIO Review

In his Open Source Landscape keynote at LinuxCon Japan earlier this year, Jim Zemlin, Executive Director of the Linux Foundation said that the trend toward corporate-sponsored open source projects is one of the most important developments in the open source ecosystem. The jobs report released by the Linux Foundation earlier this year found that open source professionals are in high demand. The report was followed by the announcement that TODOGroup, a collaboration project for open source professionals who run corporate open source program offices, was joining the Linux Foundation. Open source is no longer exclusively a pursuit of the weekend hobbyist. Professional open source management is a growing field, and it’s critical to the success of your technology strategy.

Open Source Potential to Reality Gap

Open source has indeed proven itself to be a transformative and disruptive part of many companies’ technology strategies. But we know it’s hardly perfect and far from hassle-free. Many developers trust open source projects without carefully reviewing the code or understanding the license terms, thus inviting risk. Developers say they like to contribute to open source, but are not writing as much of it as they wish. By legal default, their code is not open source unless they make it so. Despite being touted as an engineering recruitment tool, developers don’t flock to companies who toss the words “open source” all over their tech blogs. They first check for serious corporate commitment to open source.

Open source offers potential to lower costs, keep up with standards, and make your developers more productive. Turning potential into practice requires open source professionals on your team to unlock the open source opportunity. They will steer you out of the legal and administrative challenges that open source brings, and help you create a voice in the open source communities that matter most to you. Real work goes into managing the strategy, policies, and processes that enable you to benefit from the open source promise. Hence the emerging trend of hiring professionals to run open source program offices at tech companies across the industry.

Getting the Program off the Ground

Program office sounds big. Actually, many companies staff these with as few as one or two people. Often the rest is a virtual team that includes someone from legal, PR, developer services, an architect, and a few others depending on your company. As a virtual team, each member helps address the areas they know best. Their shared mission is to provide an authoritative and supportive decision about all-things open source at your company. Ideally they are technical, respected, and lead with pragmatism – but what’s most important is that they all collaborate toward the same goal.

The primary goal of the open source program office is to steer the technology strategy toward success using the right open source projects and processes. But the day-to-day program role is to provide services to engineers. Engineers need to know when they can use other people’s code within the company’s codebase (to ‘inbound’ code), and when they can publish company code to other projects externally (to ‘outbound’ code). Practical answers require an understanding of engineering strategy, attention to legal issues surrounding licenses (copyright, patent, etc.), and familiarity with managing GitHub at scale.

New open source projects and foundations will attract (or distract) your attention. Engineers will ask about the projects they contribute to on their own time, but in areas your company does business. They seek to contribute to projects and publish new projects. Are there risks? Is it worth it? The questions and issues you deal with on a regular basis will help give you a greater appreciation for where open source truly works for you, and where process-neglect can get in the way of achieving your technology mission.

Will it Work?

I’ve been running the open source program office at Yahoo for over six years. We’ve been publishing and supporting industry-leading open source projects for AI, Big Data, Cloud, Datacenter, Edge, Front end, Mobile, all the way to Zookeeper. We’ve created foundational open source projects like Apache Hadoop and many of its related technologies. When we find promise in other projects, we support and help accelerate them too, like we did with OpenStack, Apache Storm and Spark. Our engineers support hundreds of our own projects, we contribute to thousands of outside projects, and developers around the world download and use our open source code millions of times per week! We are able to operate at scale and take advantage of the open source promise by providing our engineers with a lightweight process that enables them to succeed in open source.

You can do the same at your company. Open source professionals who run program offices at tech companies share openly – it comes with the territory. I publish answers about open source on Quora and I’m a member of TODOGroup, the collaboration project managed by the Linux Foundation for open source program directors. There, I share and learn from my peers who manage the open source programs at various tech companies.

Bottom line: If you want to take advantage of the value that open source offers, you’ll need someone on your team who understands open source pragmatics, who’s plugged into engineering initiative, and who’s empowered to make decisions. The good news is you are not alone and there’s help out there in the open source community.

Refactoring Components for Redux Performance

By Shahzad Aziz

Front-end web development is evolving fast; a lot of new tools and libraries are published which challenge best practices everyday. It’s exciting, but also overwhelming. One of those new tools that can make a developer’s life easier is Redux, a popular open source state container. This past year, our team at Yahoo Search has been using Redux to refresh a legacy tool used for data analytics. We paired Redux with another popular library for building front-end components called React. Due to the scale of Yahoo Search, we measure and store thousands of metrics on the data grid every second. Our data analytics tool allows internal users from our Search team to query stored data, analyze traffic, and compare A/B testing results. The biggest goal of creating the new tool was to deliver ease of use and speed. When we started, we knew our application would grow complex and we would have a lot of state living inside it. During the course of development we got hit by unforeseen performance bottlenecks. It took us some digging and refactoring to achieve the performance we expected from our technology stack. We want to share our experience and encourage developers who use Redux, that by reasoning about your React components and your state structure, you will make your application performant and easy to scale.

Redux wants your application state changes to be more predictable and easier to debug. It achieves this by extending the ideas of the Flux pattern by making your application’s state live in a single store. You can use reducers to split the state logically and manage responsibility. Your React components subscribe to changes from the Redux store. For front-end developers this is very appealing and you can imagine the cool debugging tools that can be paired with Redux (see Time Travel).

With React it is easier to reason and create components into two different categories, Container vs Presentational. You can read more about it (here). The gist is that your Container components will usually subscribe to state and manage flow, while Presentational are concerned with rendering markup using the properties passed to them. Taking guidance from early Redux documentation, we started by adding container components at the top of our component tree. The most critical part of our application is the interactive ResultsTable component; for the sake of brevity, this will be the focus for the rest of the post.

image
React Component Tree

To achieve optimal performance from our API, we make a lot of simple calls to the backend and combine and filter data in the reducers. This means we dispatch a lot of async actions to fetch bits and pieces of the data and use redux-thunk to manage the control flow. However, with any change in user selections it invalidates most of the things we have fetched in the state, and we need to then re-fetch. Overall this works great, but it also means we are mutating state many times as responses come in.

The Problem

While we were getting great performance from our API, the browser flame graphs started revealing performance bottlenecks on the client. Our MainPage component sitting high up in the component tree triggered re-renders against every dispatch. While your component render should not be a costly operation, on a number of occasions we had a huge amount of data rows to be rendered. The delta time for re-render in such cases was in seconds.

So how could we make render more performant? A well-advertized method is the shouldComponentUpdate lifecycle method, and that is where we started. This is a common method used to increase performance that should be implemented carefully across your component tree. We were able to filter out access re-renders where our desired props to a component had not changed.

Despite the shouldComponentUpdate improvement, our whole UI still seemed clogged. User actions got delayed responses, our loading indicators would show up after a delay, autocomplete lists would take time to close, and small user interactions with the application were slow and heavy. At this point it was not about React render performance anymore.

In order to determine bottlenecks, we used two tools: React Performance Tools and React Render Visualizer. The experiment was to study users performing different actions and then creating a table of the count of renders and instances created for the key components.

Below is one such table that we created. We analyzed two frequently-used actions. For this table we are looking at how many renders were triggered for our main table components.

image
Change Dates: Users can change dates to fetch historical data. To fetch data, we make parallel API calls for all metrics and merge everything in the Reducers.Open Chart: Users can select a metric to see a daily breakdown plotted on a line chart. This is opened in a modal dialog.

Experiments revealed that state needed to travel through the tree and recompute a lot of things repeatedly along the way before affecting the relevant component. This was costly and CPU intensive. While switching products from the UI, React spent 1080ms on table and row components. It was important to realize this was more of a problem of shared state which made our thread busy. While React’s virtual DOM is performant, you should still strive to minimize virtual DOM creation. Which means minimizing renders.

Performance Refactor

The idea was to look at container components and try and distribute the state changes more evenly in the tree. We also wanted to put more thought into the state, make it less shared, and more derived. We wanted to store the most essential items in the state while computing state for the components as many times as we wanted.

We executed the refactor in two steps:

Step 1. We added more container components subscribing to the state across the component hierarchy. Components consuming exclusive state did not need a container component at the top threading props to it; they can subscribe to the state and become a container component. This dramatically reduces the amount of work React has to do against Redux actions.

image

React Component Tree with potential container components

We identified how the state was being utilized across the tree. The important question was, “How could we make sure there are more container components with exclusive states?” For this we had to split a few key components and wrap them in containers or make them containers themselves.

image
React Component Tree after refactor

In the tree above, notice how the MainPage container is no longer responsible for any Table renders. We extracted another component ControlsFooter out of ResultsTable which has its own exclusive state. Our focus was to reduce re-renders on all table-related components.

Step 2. Derived state and should component update.

It is critical to make sure your state is well defined and flat in nature. It is easier if you think of Redux state as a database and start to normalize it. We have a lot of entities like products, metrics, results, chart data, user selections, UI state, etc. For most of them we store them as Key/Value pairs and avoid nesting. Each container component queries the state object and de-normalizes data. For example, our navigation menu component needs a list of metrics for the product which we can easily extract from the metrics state by filtering the list of metrics with the product ID.

The whole process of deriving state and querying it repeatedly can be optimized. Enter Redux Reselect. Reselect allows you to define selectors to compute derived state. They are memoized for efficiency and can also be cascaded. So a function like the one below can be defined as a selector, and if there is no change in metrics or productId in state, it will return a memoized copy.

getMetrics(productId, metrics) {

 metrics.find(m => m.productId === productId)

}

We wanted to make sure all our async actions resolved and our state had everything we needed before triggering a re-render. We created selectors to define such behaviours for shouldComponentUpdate in our container components (e.g our table will only render if metrics and results were all loaded). While you can still do all of this directly in shouldComponentUpdate, we felt that using selectors allowed you to think differently. It makes your containers predict state and wait on it before rendering while the state that it needs to render can be a subset of it. Not to mention, it is more performant.

The Results

Once we finished refactoring we ran our experiment again to compare the impact from all our tweaks. We were expecting some improvement and better flow of state mutations across the component tree.

image

A quick glance at the table above tells you how the refactor has clearly come into effect. In the new results, observe how changing dates distributes the render count between comparisonTable (7) and controlsHeader (2). It still adds up to 9. Such logical refactors have allowed us to to speed up our application rendering up to 4 times. It has also allowed us to optimize time wasted by React easily. This has been a significant improvement for us and a step in the right direction.

What’s Next

Redux is a great pattern for front-end applications. It has allowed us to reason about our application state better, and as the app grows more complex, the single store makes it much easier to scale and debug.

Going forward we want to explore Immutability for Redux State. We want to enforce the discipline for Redux state mutations and also make a case for faster shallow comparison props in shouldComponentUpdate.