Configuration management for distributed systems (using GitHub and cfg4j)

Norbert Potocki, Software Engineer @ Yahoo Inc.

Warm up: Why configuration management?

When working with large-scale software systems, configuration management becomes crucial - supporting non-uniform environments gets greatly simplified, if you decouple code from configuration. While building complex software/products such as Flickr, we had to come up with a simple yet powerful way to manage configuration. Popular approaches to solving this problem include using configuration files or having a dedicated configuration service. Our new solution combines both the extremely popular GitHub and cfg4j library, giving you a very flexible approach that will work with applications of any size.

Why should I decouple configuration from the code?

  • Faster configuration changes (e.g. flipping feature toggles): Configuration can simply be injected without requiring parts of your code to be reloaded and re-executed. Config-only updates tend to be faster than code deployment
  • Different configuration for different environments: Running your app on a laptop or in a test environment requires a different set of settings than production instance
  • Keeping credentials private: If you don’t have a dedicated credential store, it may be convenient to keep credentials as part of configuration. They usually aren’t supposed to be “public” but the code still may be. Be a good sport and don’t keep credentials in a public GitHub repo :)

Meet the Gang: Overview of configuration management players

Let’s see what configuration-specific components we’ll be working with today:

image
Figure 1 - Overview of configuration management components

  • Configuration repository and editor: Where your configuration lives. We’re using Git for storing configuration files and GitHub as an ad hoc editor.
  • Push cache : Intermediary store that we use to improve fetch speed and to ease load on GitHub servers.
  • CD pipeline: Continuous deployment pipeline pushing changes from repository to push cache and validating config correctness.
  • Configuration library: Fetches configs from push cache and exposing them to your business logic.
  • Bootstrap configuration : Initial configuration specifying where your push cache is located (so that library knows where to get configuration from).

All these players work as a team to provide an end-to-end configuration management solution.

The Coach: Configuration repository and editor

The first thing you might expect from the configuration repository and editor is ease of use. Let’s enumerate what that means:

  • Configuration should be easy to read and write
  • It should be straightforward to add a new configuration set
  • You most certainly want to be able to review changes if your team is bigger than one person
  • It’s nice to see a history of changes, especially when you’re trying to fix a bug at the middle of night
  • Support from popular IDEs - freedom of choice is priceless
  • Multi-tenancy support (optional) is often pragmatic

So what options are out there that may satisfy those requirements? The three very popular formats for storing configuration are YAML, Java Property files, and XML files. We use YAML because it is widely supported by multiple programming languages and IDEs and it’s very readable and easy to understand, even for the non-engineer.

We could use a dedicated configuration store; however, the great thing about files is that they can be easily versioned by version control tools like Git, which we decided to use as it’s widely known and proven.

Git provides us with a history of changes and an easy way to branch off configuration. It also has great support in the form of GitHub which we use both as an editor (built-in support for YAML files) and collaboration tool (pull requests, forks, review tool). Both are nicely glued together by following the Git flow branching model. Here’s an example of configuration file that we use:

image
Figure 2 - configuration file preview

One of the goals was to make managing multiple configuration sets (execution environments) a breeze. We needed the ability to add and remove environments quickly. If you look at the screenshot below, you’ll notice a “prod-us-east” directory in the path. For every environment, we stored a separate directory with config files in Git. All of them have the exact same structure and only differ in YAML file contents.

This solution makes working with environments simple and comes in very handy during local development or new production fleet rollout (see use cases at the end of this article). Here’s a sample config repo for a project that has only one “feature”:

image
Figure 3 - support for multiple environments

Some of the products that we work with at Yahoo have a very granular architecture - hundreds of micro-services working together. For scenarios like this, it’s convenient to store configurations for all services in a single repository, which greatly reduces the overhead of maintaining multiple repositories. We support this use case by having multiple top-level directories each holding configurations for one service only.

The Sprinter: Push cache

The main role of push cache is to decrease load put on the GitHub server and improve configuration fetch time. Since speed is the only concern here, we decided to keep the push cache simple - it’s just a key-value store. Consul was our choice: the nice thing is that it’s fully distributed.

You can install Consul clients on the edge nodes and they will keep being synchronized across the fleet. This greatly improves both reliability and performance of the system. If performance is not a concern, any key-value store will do. You can skip using push cache altogether and connect directly to Github, which comes in handy during development (see use cases below to learn more about this).

The Manager: CD Pipeline

When the configuration repository is updated, a CD pipeline kicks in. This fetches configuration, converts it into a more optimized format and pushes it to the cache. Additionally, the CD pipeline validates the configuration (once at the pull-request stage and again after being merged to master) and controls multi-phase deployment by deploying config change to only 20% of production hosts at one time.

The Mascot: Bootstrap configuration

Before we can connect to the push cache to fetch configuration we need to know where it is. That’s where bootstrap configuration comes into play - it’s very simple. The config contains the hostname, port to connect to, and the name of the environment to use. You need to put this config with your code or as part of the CD pipeline. This simple yaml file binding Spring profiles to different Consul hosts suffices for our needs:

image
Figure 4 - bootstrap configuration

The Cool Guy: Configuration library

image

The configuration library takes care of fetching the configuration from push cache and exposing it to your business logic. We use the library called cfg4j (“configuration for java”). This library re-loads configurations from the push cache every few seconds and injects them into configuration objects that our code uses. It also takes care of local caching, merging properties from different repositories, and falling back to user-provided defaults when necessary (read more at http://www.cfg4j.org/).

Briefly summarizing how we use cfg4j’s features:

  • Configuration auto-reloading: Each service reloads configuration every ~30 seconds and auto re-configures itself.
  • Multi-environment support: for our multiple environments (beta, performance, canary, production-us-west, production-us-east, etc.).
  • Local caching: Remedies service interruption when the push cache or configuration repository is down and also improves the performance for obtaining configs.
  • Fallback and merge strategies: Simplifies local development and provides support for multiple configuration repositories.
  • Integration with Dependency Injection containers: because we love DI :D.

If you want to play with this library yourself, there’s plenty of examples both in its documentation and cfg4j-sample-apps Github repository.

The Heavy Lifter: Configurable code

The most important piece is business logic. To best make use of a configuration service, the business logic has to be able to re-configure itself in runtime. Here are a few rules of thumb and code samples:

  • Use dependency injection for injecting configuration. This is how we do it using Spring Framework (see the bootstrap configuration above for host/port values):
  • Use configuration objects to inject configuration instead of providing configuration directly - here’s where the difference is:

Direct configuration injection (won’t reload as config changes)

Configuration injection via “interface binding” (will reload as config changes):

The exercise: Common use-cases (applying our simple solution)

Configuration during development (local overrides)

When you develop a feature, a main concern is the ability to evolve your code quickly. A full configuration management pipeline is not conducive to this ability. We use the following approaches when doing local development:

  • Add a temporary configuration file to the project and use cfg4j’s MergeConfigurationSource for reading config both from the configuration store and your file. By making your local file a primary configuration source, you provide an override mechanism. If the property is found in your file, it will be used. If not, cfg4j will fall back to using values from configuration store. Here’s an example (reference examples above to get a complete code):
  • Fork the configuration repository, make changes to the fork and use cfg4j’s GitConfigurationSource to access it directly (no push cache required):
  • Set up your private push cache, point your service to the cache and edit values in it directly.

Configuration defaults

When you work with multiple environments, some of them may share a common configuration. That’s when using configuration defaults may be convenient. You can do this by creating a “default” environment and using cfg4j’s MergeConfigurationSource for reading config first from the original environment and then (as a fallback) from the “default” environment.

Dealing with outages

Configuration repository, push cache and configuration CD pipeline can experience outages. To minimize impact of such events, it’s good practice to cache the configuration locally (in-memory) after each fetch. cfg4j does that automatically.

Responding to incidents - ultra fast configuration updates (skipping configuration CD pipeline)

Tests can’t always detect all problems. Bugs leak to production environment and at times it’s important to make a config change as fast as possible to stop the fire. If you’re using push cache, the fastest way to modify config values is to make changes directly within the cache. Consul offers a rich REST API and web UI for updating configuration in the key-value store.

Keeping code and configuration in sync

Verifying that code and configuration are kept in sync happens at the configuration CD pipeline level. One part of the continuous deployment process deploys the code into a temporary execution environment, and pointing it to the branch that contains the configuration changes. Once the service is up, we execute a batch of functional tests to verify configuration correctness.

The cool down: Summary

The presented solution is the result of work that we put into building huge-scale photo-serving services. We needed a simple yet flexible configuration management system, and combining Git, Github, Consul and cfg4j provided a very satisfactory solution that we encourage you to try.


I want to thank the following people for reviewing this article: Bhautik Joshi, Elanna Belanger, Archie Russell.

Elements of API design: Delivering a flawless NFL experience

Sid Reddy, Director of Engineering, VDEO

On October 25, 2015, Yahoo live streamed the first-ever regular season NFL game over the Internet to a global audience. This was a phenomenal effort, as several teams at Yahoo came together to deliver a flawless experience to viewers across devices and geographies. To give you a sense of the scale involved, the NFL live stream on Yahoo attracted over 15 million unique viewers, over 33 million views, and had a peak of over 3 million concurrent viewers. 33% of these views came from international viewers from 185 countries, who watched over 460 million minutes of the game.

Our innovative API systems were one of the key drivers for the broadcast’s success. During the 4 hours of the NFL game, the API platform fielded over 215 million calls, with a median latency of 11ms, and a 95th percentile latency of 16ms. The APIs showed six 9s of availability during this time period, despite failure of dependencies during spikes in the game.

In this post, we will discuss the importance of APIs to the NFL video experience, examine the Key Performance Indicators (KPIs) to pay attention to, and highlight several essential elements that comprise a robust API platform.

Why are the APIs important?

The APIs are critical to powering the player and the video experiences. As an example, the APIs provide the player with several pieces of information about the video, including the title, description, as well as URLs to fetch “streams” of video (each stream represents a particular combination of resolution, bitrate, and encoding profile); the APIs also filter and provide only those streams to the player that are supported by the given device.

The Lifecycle of an API request

The lifecycle starts with a client doing a DNS lookup for the API, then issuing an HTTPS request that is terminated at the Edge; the Edge then forwards this request to the API origin which calls several dependency services, processes the data received, and builds the final response (see the figure below).

image

In the remainder of this post, we will discuss our KPIs, the Yahoo ecosystem that we leveraged, and several lessons from having examined our API stack end-to-end. The sections below are organized around components as they are encountered in the lifecycle of an API request.

Know your KPIs

Always start by defining your KPIs and measure your performance against them. This will help ensure that changes you make to the system lead to improved performance. For our system, we had 4 KPIs

  • Latency: We needed to provide users with an instant-on video experience.
  • Availability: We only had one shot with live events, so we had better be highly available!
  • Error rate: We wanted our system to operate without errors, for reasons of dependency degradation, data store unreachability, etc.
  • Throughput: We wanted to maximize our hardware capabilities to ensure higher scalability for a given cost.

Yahoo ecosystem

Yahoo has a long history of operating services at the highest scale, and as such, several infrastructure components are readily available for use. For example, to avoid a single point of failure, API systems at Yahoo are present by default in several data centers around the world (serving the Americas, Europe, and APAC regions). Paradigms exist to replicate data across these data centers, while ensuring consistency. Also, code is deployed to these data centers with a CI/CD (Continuous Integration/Continuous Delivery) pipeline. Given this baseline Yahoo ecosystem, we will identify some of the most interesting enhancements below.

DNS

This is the first stop for any client talking to APIs. Should a data center become unavailable, clients should get routed appropriately for high availability. Yahoo operates such a DNS-based GSLB (Global Server Load Balancing) system. There are typically two components to such a system: periodic health checks for the different data centers, and setting the TTL (Time To Live) for DNS records. We decreased these times by 5x to improve availability, with the tradeoff being an increase in health check traffic to our origin and DNS servers.

Use the Edge

While our APIs are already present in several data centers, adding an edge layer (aka ADN: Application Delivery Network) helps in lowering latencies significantly, by up to 25% in our case, because the HTTPS termination happens closer to the user. SSL costs 2 additional round trips (1 additional round trip if session is resumed), and optimizing this latency helps significantly.

Abuse protection

We have a system in place to detect abuse, and are able to dynamically accept/reject/degrade the request. This system helps protect the APIs against potential DDoS attacks, preventing a meltdown of either the APIs or the dependencies. Our systems detect abuse from IP addresses as well as “users”, and enables whitelisting/blacklisting specific entities. This system runs on the edge layer, and relays only “good” requests to the API system.

Caching

We have a caching layer on each of the API nodes. This technique not only improves latency (by up to 20%), but enhances availability when dependency services go down. In addition, the load on dependency systems reduces significantly (by up to 70% in our case). For maximum flexibility, we place the cache close to the dependency calls, and use Guava in our implementation.

Dependency protection

Occasionally, we notice patterns where a dependency service would degrade, causing threads in the API system to stall. This dramatically reduces throughput, and can bring the system down, due to thread exhaustion. To avoid this, we implemented aggressive timeouts on dependencies, and an automated mechanism to eliminate calls to dependency services when they are down. This technique improves scalability significantly as threads can now proceed instead of waiting for a timeout.

Speculative retry

In our analysis, some dependencies tend to have a disproportionately large p99 (aka 99th-percentile) latency, compared to say the p95 latency. More generally, we see patterns where there is a steep jump in latency. At the point where the latency jump occurs, we introduced a parallel retry, and consume the first response received. The tradeoff is an increase in traffic to our dependency systems, in exchange for lower latency and error rate. In our case, this approach reduced latency by up to 22%, and error rate by up to 81%.

Throughput benchmarking

We  invested a significant amount of time in optimizing our API performance. We have high scalability targets, and squeezing performance from hardware is critical to keeping costs under control. We improved our performance 2.3X, with a further 3X improvement planned for the future (by optimizing our core Java platform). An important lesson from this exercise is to benchmark a dummy API as well as the APIs of interest, and measure where the throughput drops are happening; then systematically target the causes for the biggest drops. Another lesson is to pre-materialize the data in an off-stage process, and keep the serving APIs simple; this technique, along with an in-memory data store, improves throughput significantly.

Monitoring/Alerting

When you measure, you can act. We developed extensive monitoring capabilities for the entire API system, including dependencies, on a variety of metrics to track the KPIs. We also have proactive alerting, whenever the SLAs are not met. This proved critical during the NFL live stream, as we detected some dependency services that were briefly unavailable, and were able to repair the services involved. Given the robust design and fallbacks, such intermittent issues did not significantly degrade the end-user experience.

Summary

Building a robust and highly scalable API platform was key to delivering a flawless experience for the first NFL game that was live streamed over the Internet. In this post, we examined several techniques and components that enabled such a robust API platform. Here’s a table summarizing the techniques, and the KPIs they promote.

image


Future work

We have several interesting projects on our roadmap to make our APIs more resilient, including incorporating a destructive testing framework into our API platform to ensure continued robustness, as well as an automated framework to take bad servers/data centers out of rotation.

If you are excited about this sort of work, reach out to sidreddy at yahoo-inc dot com.

Using NSURLProtocol for Testing

By Roberto Osorio-Goenaga, iOS Developer

Unit testing networking code can be problematic, due mostly to its asynchronous nature. Using a staging server introduces lag and external factors that can cause your tests to run slowly, or not at all. Frameworks like OCMock exist to specify how an object responds to a specific query to address this behavior, but a mock object must still be set up for each type of behavior being mocked.

Fry Tests a Server Connection

Using Apple’s NSURLProtocol, we can create a test suite that eschews these problems by mocking the response to our network calls centrally, essentially letting your test focus only on business logic. This protocol can be used not only with the built-in NSURLSession class, but can also be used to test classes and structs written with modern third party networking libraries, such as the popular Alamofire. In this article, we will look at mocking network responses in Swift for requests made using Alamofire. The sample project can be found on github.

NSURLProtocol’s main purpose is to extend URL loading to incorporate custom schemes or enhance existing ones. A secondary, yet extremely powerful, use of NSURLProtocol is to mock a server by sending canned responses back to callbacks and delegates. Say we have a very simple struct that uses Alamofire to make an HTTP GET request.

Fig 1 - A simple struct that serves as a REST client

The sample in Figure 1 creates a struct with an NSURL as an init parameter, and a sole method, getAvailableItems(), taking in a completion block as an argument, making a rest call to the NSURL and populating an array of MyItem in the block sent into it. From a testing perspective, we’d like to have a JSON response that matches the expected response, containing an object called items whose value pair is an array of strings. In order to make our tests as thorough and robust as possible, we’d also include at least two other mock responses: a JSON response that does not match this expectation, to test the else clause, and a garbage or erroneous response to check our error handling.

 
Fig 2 - A valid response

 
Fig 3 - A non-valid response

 
Fig 4 - A throw-away garbage response
 

Figures 2, 3 and 4 show a valid response for our purposes, a non-valid yet correct JSON response, and a throw-away string that isn’t even valid JSON, respectively. Without having to make a full-blown staging server, let’s see how we could go about testing these using NSURLProtocol.

To understand where NSURLProtocol fits into this problem, it’s important to look at a bit of the architecture Alamofire employs. Alamofire works as a singleton, as one can see from the above example. There is no instantiation required. Just feed a URL in, and make a request. Under the hood, the entity making the request is called the Manager. Manager is the entity that actually stores the URL and parameters, and is responsible for firing off an NSURLSession request abstracted from the caller class.

The manager for Alamofire can be initialized with a custom configuration of type NSURLSessionConfiguration, which has a property called protocolClasses, an array of NSURLProtocol members. By creating a new protocol that defines what happens when NSURLSession tries to reach a certain type of endpoint, loading it into the protocol array of a new configuration at index 0 (the default configuration), and initializing a new Manager object with this configuration, we can inject Alamofire with a simple, local mock server that will return whatever we want, given any request. Let’s start setting up a test class for our REST client by extending NSURLProtocol to respond to GET requests, and creating an Alamofire.Manager object with a custom NSURLSessionConfiguration that employs our protocol.

Fig 5 - Setting up a testing class for our client

Great, now we have an NSURLProtocol class that takes a GET request, checks the URL, and returns either a valid JSON response, or a simple “GARBAGE” response. This should allow us to test how our client responds. We still haven’t written any cases. We have a MyRESTClient property, as well as a Manager property. We also have a setup initial method that instantiates and loads our custom protocol into the manager instance. We now need a way to inject this manager instance into our Alamofire singleton. Let’s extend our client to the following.

Fig 6 - The REST client with an injectable “manager” parameter

We’ve added an initializer to our struct that allows us to send either a custom manager or nil into Alamofire. When the parameter is nil, the manager will load with its standard configuration. We also edited the request execution to be called via the manager we selected instead of directly through Alamofire. We can now add the following test case to our test class.

Fig 7 - Our first test case

In this test case, we create a new client, and give it our custom manager through the new initializer. We set a testing expectation, since the result comes back on a closure, and, after loading our itemsArray inside, fulfill the expectation. We tell the test case to wait for said expectation to be fulfilled, and, once it is, we make sure the itemsArray contains three items. If so, our test is successful, and our business logic is tested for getAvailableItems. Notice that we have used a bogus URL of “http://notNil”, which we have defined in the protocol to be selected in the conditional for populating the response correctly. To test the “garbage” case, we could write a test like the following.

Fig 8 - A test case for verifying a garbage response
In this second test case, the mocked URL of “http://nil” is not recognized, and the protocol responds by returning “GARBAGE”, thus not populating the response array. If our method is written correctly, it will call the closure with a nil array.
Fig 9 - A test case for verifying an incorrect response

In the third and final test case, our protocol class will return a “concepts” array instead of an “items” array, so the end result should still be a nil array in the closure.

As you can see, using NSURLProtocol we have created what amounts to a tiny server that responds to our requests and replies as specified, perfect for testing our asynchronous net calls. Now, go forth and test!

“Turing Test” for OTT Video Streaming: Can a viewer distinguish between Streaming and Broadcast Video in 2016?

P.P.S. Narayan, VP of Engineering

Introduction

About sixty-five years ago, Alan Turing, considered to be the father of Computer Science and modern computing machines, put forth a deep and philosophical question: can machines think? Asked differently, “Can a machine exhibit (intelligent) behavior indistinguishable from a human [1]?” In order to confirm the conclusion, he devised a test, coined the imitation game.  While there have been various interpretations (and extensive debate) of the test, our objective is to look at its application to video streaming by understanding the premise and appreciating the concept.

image
Figure 1: Imitation Game

In my interpretation, Alan Turing proposed that there were 3 participants: A, B and C. The role of participant C is that of the interrogator, and C communicates with A and B only via written text. C cannot see or talk to the other participants and only knows them as labels [2]. By asking questions to A and B, C tries to determine which of A and B, is human and the other is machine.

Turing says:

I believe that in about fifty years’ time it will be possible to program computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of questioning. … I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

Over the years, the Turing test has become an important concept in artificial intelligence and the evolution of computing in modern times. In fact, some modifications to the Turing test have been proposed and have been adopted widely. CAPTCHA is a form of reverse Turing test, where we have participant C replaced by a machine, and the task of C, as a gatekeeper, is to distinguish the humans (in A or B or …) from machines [1].

TV on the internet

Television and video broadcasting has been around for decades. Starting with the first television broadcasts in the 1930s to modern broadcasts in 3D and 4K UHD, the evolution has been phenomenal and eye-popping. Over the years, TV broadcast has moved from terrestrial over-the-air radio waves to satellite or fiber delivery to the home. Analog signals have modernized to digital; black-and-white to color. Televisions have evolved from mechanical to electronic to digital, and traditional CRT displays have changed to ultra thin LED displays.

In the mid-1990s, with the explosion of the consumer Internet, a new wave of video consumption started on devices other than the traditional television sets. Video streaming was soon available on desktops and then laptops, and with the WiFi and mobile revolution, we began streaming video on our phones and tablets.

Nearly a century of evolution of standards, technology, infrastructure, and innovation helped the US TV industry grow to more than a $100B [6]. We are in a new generation where internet streaming is a revolution and growing rapidly. “Over-the-Top” video streaming, or OTT as it is commonly referred to, is delivered through the open unmanaged internet, with the “last-mile” companies (e.g., Comcast) acting only as the internet service provider [3]. Netflix, an OTT video streaming service, already accounts for more than 35% of peak US internet traffic. In fact, all of OTT video streaming accounts for more than 50% of both internet and mobile traffic in the US [4].

This new form of video consumption is pushing the infrastructure and technology built for the Internet beyond its original design parameters and capabilities. While content and user interfaces have advanced quite drastically with OTT, the overall quality of experience is still years behind.

The “OTT Turing Test”

Several months ago, I attended the Streaming Media West conference. And, during one of the panels, the question was asked about the “success metrics” for OTT streaming. Rather than invent what we will call as “success”, I believe we need to use existing TV quality as a baseline benchmark.

Say, we modify the original Turing test, to replace A and B with simple LED TV-like monitors. The video inputs for A and B will have the similar videos (e.g. say any NFL game) from either over-the-air HD signals or from an OTT stream. The participant C, or the interrogator, has to interact with the two monitors to determine which of A or B is displaying video that is OTT streamed. We coin this as the “OTT Turing test”.

image
Figure 2: OTT Turing Test

What is the holy grail? Quoting Turing again, an average interrogator will not have more than 70 percent chance of making the right identification after five minutes of viewing.

Note that we modified the original Turing test in just a couple of ways. For one, C can only determine the outcome based on the visual perceptions of the video being played on the two monitors. And second,  we have restricted the interaction of the interrogator, with the monitors, to be very limited. The only instruction that C can deliver is to start or stop the video via a “remote” controller. While C can consume the video experience on the TV, they cannot use it like a DVR/VCR or a cable box. A few things that the interrogator cannot do, are “pause” and “play”. C cannot “change channels”. C cannot choose from a “channel guide”. And so on.

And, to make it even more challenging, C could be either a human or a machine. Therefore, any variance in human perception will be eliminated via more systematic and objective evaluation of the inputs.  

If we did this test today, would OTT pass the Turing test? My answer is no. The reason is because there are a number of challenges that have yet to be completely solved. Let’s examine some of these challenges, why they exist, and how far away the technology is for us to achieve this goal.

Startup Time

When we want to start watching television today, we switch “on” the TV set and/or our connected device (e.g. cable box). Typically, these startup in at most 2-3 seconds, and you can see the video on your screen in that time. At times, the audio even starts up sooner. In traditional television, the latency for startup can be broken down into two parts: the time taken for the first few frames of the video to reach the device which is determined by the speed of light, and then to the additional time taken for decoding the signal and constructing the frames to be rendered on the display. This holds true for the over-the-air broadcasts as well as for cable or satellite transmissions.

image
Figure 3: Propagation Delay in television

For internet OTT streaming to be indistinguishable, we need to have startup times as seen on traditional television today. Current streaming protocols have been designed and evolved to take into account the “best-effort” nature of the internet. The packetization of video in protocols like HLS and DASH enable the playback to start faster with the ability to deliver key frames in a few packets. Also modern protocols like P2P and UDP based network coding help in increasing the reliability of packet delivery on a best-effort network. In addition to the delivery of the video packets, other overheads like player (e.g., flash and javascript) downloads impact the speed of video startup.


Video Quality

These days, most displays support HD and resolution of 720p or 1080i/p – table stakes in video broadcasting. In fact, most television HDTV broadcasts are shot at 1080i/60 fields-per-sec. Modern television sets have inbuilt sophisticated upscaling algorithms and technologies that can take lower resolution incoming signals and upscale them to 1080p or even 4K. However, these algorithms vary from device to device, manufacturer to manufacturer, and upscaling can have artifacts introduced which can appear on certain types of content (e.g., high motion sports like NFL or video games).

image
Figure 4: Common Display resolutions [5]

So having OTT streaming start at lower resolutions and lower frame-rates will not produce an indistinguishable viewing experience as compared to traditional television broadcasts. It is imperative that OTT streaming has the whole video pipeline producing content at high enough resolutions with purity in signal. Starting with cameras, the whole video pipeline including, video signal acquisition, video mixing, encoding, transcoding and all the packetization must do 1080p or higher resolutions natively.

The Spinner

Anyone who has watched video on the internet knows (or should know) what the “Spinner” is. It is the non-technical term given to the manifestation of not being able to deliver content reliably, over a network which is not dedicated.


image

Figure 5: Typical “Spinner” during Internet Streaming (Image Courtesy [8])

Buffering was a concept introduced in internet streaming that allowed the video player to collect enough video frames (and keep collecting additional frames regularly) so as to maintain a smooth playback on a non-reliable best-effort Internet. On traditional television broadcasts, there is no concept of buffering (unless watching some video-on-demand content from your cable provider) with no visual manifestations (e.g., spinner) shown to the viewers.

Buffering and re-buffering (after the video has started) have become a pervasive pain point for viewers of OTT streaming. And there are many different causes of rebuffering. The reasons vary from end viewer device capabilities (and software), to network congestion inside the viewer’s home/premise, to ISP or local area bandwidth issues, to the core getting clogged and so on.  

Rebuffering ratio is a typical industry measure to determine the perceived user quality. It is a ratio, measured for a time window, as the total time spent rebuffering across all users to the total time spent viewing the stream. Rebuffering ratios of 2-4% are considered acceptable across the industry today. For example, if you are watching a video online for 2 minutes, it is normal for you (and all users) to see the spinner for 2-3 seconds. This typically gets worse during live streams, so if you watch an NFL game for more than 2 hours, you are more than likely to experience many interruptions in your viewing totaling up to 2-3 minutes!! Imagine that experience as compared to what you get on TV or cable today.

At Yahoo, our goal for re-buffering is ZERO. Plus, we’ve focused on some new metrics. For example, the percentage of rebuffering impacted views is more important and relevant. Having 100% of your viewers impacted for 1 minute is far worse than 1% of your viewers impacted for 100 minutes. Basically, this would mean that 99% of your viewers had a FLAWLESS experience. That number needs to grow to five 9s like traditional television.

Visual Artifacts

Video encoding is a complex process from signal acquisition to mixing to adding audio tracks and then shipping the packets to the device. The receiving device capabilities can vary quite drastically. And, this is handled when encoding video digitally for transmission, by picking a few different resolutions. This allows for rendering the video well on various form factors (e.g., a 4K display vs a VGA device), and also allows for optimizing the number of bits transmitted depending on the viewing device. The optimization comes from the fact that lower resolutions would require lower number of bits to encode.

Usual OTT streams are encoded in a few different resolutions or “bitrates”. The availability of various bitrates has become a mechanism for handling poor network conditions. Video players on devices “adapt” and deliver a continuous experience (i.e. avoid rebuffering). The tradeoff has become straightforward – reduce resolution rather than stop the video.

image

Figure 6: Blockiness of Video (Image courtesy [7])

What does it result in? OTT streaming regularly causes video to be “blocky” where the rendering of lower bitrates at lower resolutions. While the artifacts are clearly perceptible for viewers of OTT, it is also easy for a machine to find the introduction of noise into the original video signal as a result of the encoding bitrates. Traditional television transmissions rarely, if ever, have any blockiness or pixelations for viewers. This effect is so prolific in OTT stream for optimized content, that an interrogator may use this knowledge to deliberately look for blockiness in scenes that are visually complex or contain fast motion, as a way of identifying which display is OTT.


Ad Transition

Advertising on television is native and seamless. What do I mean by native? A viewer sees very little or no difference between content and advertising. In other words, when there is an imperceptible break of under 500ms between the content and ads and from the viewer perspective, it is considered “seamless”.  Over time, the television ad ecosystem has evolved and local stations have the ability to insert local video ads into global or national content.

image
Figure 7: Ad insertion technology

In OTT streaming, we need to move closer to TV-like experience for ads. Today OTT video ad insertion for mid-roll (during a content or broadcast) is typically done from the viewing device. Advertisers are excited by the many advantages of this technology. It allows the ads to be more personalized and targeted than local ads. Plus, the medium also allows advertisers get more detailed engagement metrics as compared to traditional television. However, this challenges the seamless experience and typically viewers see “spinners” and blank screens during ad insertion.  

Synchronization

Consider watching a live (American) football, cricket or soccer game on television. Maybe with a bunch of friends or strangers in a bar, with multiple big screen televisions. We quite frequently do this, to enjoy a social viewing experience. With the current television technology, typically all the television sets in the bar are within a few milliseconds of each other – not much lag or lack of synchronization.

However, with OTT, multiple viewers of the same “live” event cannot be guaranteed a synchronized viewing experience. Even if they all started streaming at the same exact moment! This common problem is due to the inherent nature of various streaming protocols and buffer management. In OTT streaming, especially for live events, it is likely that a viewer may watch or experience a goal at a significantly different moment in time as compared to his neighbors in the same apartment complex, which creates a bad user experience.

Summary

In summary, as an industry, we have a lot of challenges to deliver a TV-like experience for OTT streaming. Some of them are easy, while the others are quite difficult to overcome.

It is likely that very soon, startup time, video quality and ad transitions will improve significantly and be indistinguishable from the current television experience. The tougher technical challenges will be to get to ZERO rebuffering and to enable synchronized watching experience for OTT. These are fundamental challenges in the today’s technology, that will require significant innovation and even some revolutionary new concepts such as P2P or new protocols to up the game for OTT.

However, I am very confident that we will overcome these challenges, and win the OTT Turing Test – much before the original Turing Test is solved!!

References

[1] https://en.wikipedia.org/wiki/Turing_test

[2] http://plato.stanford.edu/entries/turing-test/#Tur195ImiGam

[3] https://en.wikipedia.org/wiki/Internet_television

[4] http://www.statista.com/chart/1620/top-10-traffic-hogs/

[5] https://en.wikipedia.org/wiki/Display_resolution#/media/File:Vector_Video_Standards8.svg

[6] http://www.statista.com/topics/977/television/

[7] http://www.bestshareware.net/howto/img1/how-to-remove-pixellation-from-video-1.jpg

[8] http://blogs.tigarus.com/patcoola/wp-content/uploads/sites/patcoola/2014/buffering-2014.png

Yahoo Account Key Now Available With New Features On Even More Apps

yahoo:

By Lovlesh Chhabra, Product Manager

Passwords suck and we’re on a mission to kill them. That’s why we introduced Yahoo Account Key in October 2015, which is now available on over 50M devices. This product lets you access your Yahoo account with the simple tap on an Account Key push notification sent to your mobile device. It is a major step towards a password-free future, and one where we can say “Goodbye complicated passwords!”

Since then, we’ve made improvements to the product and wanted to share the steps we’ve taken to make signing-in easier:

image
  • Available Anywhere: Now, you can use Account Key in a majority of Yahoo apps, including Yahoo Finance, Fantasy, Mail, Messenger, and Sports for iOS and Android. Traveling abroad? As long as you have an internet connection,  you can access your account with Account Key.
  • Do Not Disturb: We’ve launched a dashboard where you can control the devices or apps that receive Account Key push notifications. For example, if you have an iPad on the kitchen table, you can turn off sign-in notifications on that device so your kids aren’t disturbed while doing their homework. Access this dashboard by tapping on the ‘key’ icon within the Yahoo app that you use or by going to login.yahoo.com/account/security.
  • More Convenient Access: From time to time, you may be asked for an Account Key code. If that happens, we will deliver the code to your Yahoo app via the same Account Key push notification.

The password is antiquated and needs to be put to rest. We’ve been investing in better and more simplified ways to sign in and can’t wait to show you what we have in store. Stay tuned for updates in our mission to kill passwords!

CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters

yahoohadoop:

By Andy Feng(@afeng76), Jun Shi and Mridul Jain (@mridul_jain), Yahoo Big ML Team


Introduction

Deep learning (DL) is a critical capability required by Yahoo product teams (ex. Flickr, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning.

image

Figure 1: ML Pipeline with multiple programs on separated clusters


As discussed in our earlier Tumblr post, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2). 

image

Figure 2: ML Pipeline with single program on one cluster


CaffeOnSpark: API & Configuration and CLI

CaffeOnSpark is designed to be a Spark deep learning package. Spark MLlib supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. CaffeOnSpark API supports dataframes so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL.

image

Figure 3: CaffeOnSpark as a Spark Deep Learning package


1:   def main(args: Array[String]): Unit = {

2:   val ctx = new SparkContext(new SparkConf())

3:   val cos = new CaffeOnSpark(ctx)

4:   val conf = new Config(ctx, args).init()

 5:   val dl_train_source = DataSource.getSource(conf, true)

 6:   cos.train(dl_train_source)

 7:   val lr_raw_source = DataSource.getSource(conf, false)

 8:   val extracted_df = cos.features(lr_raw_source)

 9:   val lr_input_df = extracted_df.withColumn(“Label”, cos.floatarray2doubleUDF(extracted_df(conf.label)))

10:     .withColumn(“Feature”, cos.floatarray2doublevectorUDF(extracted_df(conf.features(0))))

11:  val lr = new LogisticRegression().setLabelCol(“Label”).setFeaturesCol(“Feature”)

12:  val lr_model = lr.fit(lr_input_df)

13:  lr_model.write.overwrite().save(conf.outputPath)

14: }

Figure 4: Scala application using CaffeOnSpark both MLlib


Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together:

  • L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object.
  • L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS.
  • L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS.
  • L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification).
  • L13 … You could save the classification model onto HDFS.

As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner.

CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our example, the neural network will have a MemoryData layer with 2 extra parameters:

  1. source_class specifying a data source class
  2. source specifying dataset location.

The initial CaffeOnSpark release has several built-in data source classes (including com.yahoo.ml.caffe.LMDB for LMDB databases and com.yahoo.ml.caffe.SeqImageDataSource for Hadoop sequence files). Users could easily introduce customized data source classes to interact with the existing data formats.

CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib.

First command:

spark-submit \
   –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \
   –num-executors 2  \
   –class com.yahoo.ml.caffe.CaffeOnSpark  \
      caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
      -train -persistent \
      -conf caffenet_train_solver.prototxt \
      -model hdfs:///sample_images.model \
      -devices 2


Second command:

spark-submit \
   –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \
   –num-executors 2  \
   –class com.yahoo.ml.caffe.examples.MyMLPipeline \                                         
caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \

       -features fc8 \
       -label label \
       -conf caffenet_train_solver.prototxt \
       -model hdfs:///sample_images.model  \
       -output hdfs:///image_classifier_model \
       -devices 2


System Architecture

image

Figure 5: System Architecture


Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters.

Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job.


Open Source

In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope Flickr Creative Commons 100M dataset on Hadoop clusters.

CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 license.

CaffeOnSpark can be tested on an  AWS EC2 cloud or on your own Spark clusters. Please find the detailed instructions at Yahoo github repository, and share your feedback at [email protected]. Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen. .

(via yahoohadoop)

Yahoo Hosts The Streaming Video Alliance’s Quarterly Member Meeting


Yesterday, Yahoo hosted the Streaming Video Alliance’s quarterly member meeting where over 70 executives from across the streaming video landscape convened to advance discussions on a broad range of streaming video topics and reach agreements on best practices, policy and proposed standards. Ron Jacoby, VP of Engineering, Yahoo Video & TV Applications, and P.P.S. Narayan, VP of Engineering, Yahoo Video were the morning’s featured keynotes.


image

Ron kicked off discussing the challenges and complexities behind building a strong streaming video experience. The root of which stemmed from the rapidly changing consumption patterns of today’s audiences.

Millennials, which now represent over 30% of the US population, consume 283% more media via the internet than non-millennial age groups – a vast change from how their parents watched TV. This reflects a dramatic shift in how TV is being consumed, and is accelerating in key demographics.

Additionally, the 18-24 year old demographic saw a 37% decline in traditional TV viewing. Ron attributed the shift to the pervasiveness of online video content to premium video services and social media across portable media devices, including laptops, tablets, smartphones, etc.

“In order for the industry to succeed in the face of these trends, it needs to look at content and delivery differently,” said Jacoby. “investments in live streaming and innovation in video protocols and delivery is necessary.”

image

P.P.S. Narayan followed up with a presentation about the technical opportunities and challenges in video streaming. To echo Ron’s statements, PPSN said “folks are moving away from TV..and watching video across different social and OTT platforms. Gone are the days of sitting in the same room with everyone watching the same show.”

He added, “the shift in consumer behavior and consumption patterns is leading to the disaggregation of content – providers are taking content from TV and cable, and making it accessible on multiple platforms, such as phones, tablets, and connected devices. Services, like Hulu, HBOGo, and MLB.TV have invested heavily in this which is a clear indication that they, and the rest of the industry, are serious about embracing this consumer shift.”

This move is indicative of bigger technology shifts, which begs the question “Can the quality of the video be as good as what we see on TV?” Almost. PPSN explained that when Yahoo hosted the first ever NFL live stream, the technological considerations he and his team had to account for, included resolution, bandwidth, encoding, ads, and latency.

He also talked about the next-generation immersive experiences, including time-based immersion, made possible by cloud DVRs and live scrubbing; space-based immersion with VR and 360 degree videos; and people-based immersion, evidenced by the sharing of content on social media. Additionally, he covered how the disaggregation of content without having a “TV Guide” is leading to gaps in content discovery and personalization. The Yahoo Video Guide, is one such example of addressing the growing needs for users to discover and consume relevant and contextual content.

PPSN concluded by expressing the importance of the groups like the SVA, as they are critical to working together as an industry and help move the ball forward in streaming video.

Introducing 1080p Video Experience on Yahoo

By P.P.S. Narayan, VP of Engineering

Today, we are excited to announce the debut of a Full HD experience on Yahoo’s video programming.  Going forward, we will be delivering, both live and video-on-demand content, at 1080p – a high definition standard, characterised by a resolution of 1920 pixels wide by 1080 pixels in height, with progressive scan – to our users, based on devices, platforms and network capabilities. In addition to this, users will also be able to enjoy the content at up to 60 frames-per-second (fps).

Back in October 2015,  we delivered the first, global live stream of a regular season NFL game at a maximum of 720p/60fps. And this past week, we streamed the AT&T Pebble Beach Pro-Am from PGA TOUR LIVE on Yahoo, at formats up to 1080p/60fps, with TV-like quality. Fans were able to enjoy panoramic views of Pebble Beach, while watching players hit shots off the tight fairways, follow balls as they sailed across the horizon, and see the professional and celebrity golfers nail putts – all thanks to this high resolution and smooth motion video.  

We’re also excited to announce that on April 30, you’ll be able to experience this enhanced video quality as Yahoo Finance hosts the first-ever, global livestream of the Berkshire Hathaway annual shareholders meeting, from Omaha, NE. The “Woodstock of Capitalism” will be broadcast across all devices - with exclusive video-on-demand (VOD) of the event available on Yahoo Finance for 30 days following the meeting.

We have also upgraded our studios to support 1080p/60fps. Starting with Yahoo’s cameras, the whole video pipeline, including video signal acquisition, video mixing, encoding, transcoding, and all the packetization, will be 1080p capable, when we produce the content.

These days, while most of our television displays support Full HD, most television HDTV broadcasts are delivered at 1080i/60 fields-per-second, a lower resolution than 1080p/60 frames-per-second. And, over-the-top (OTT) streaming experiences do not widely support a top quality resolution with purity in signal from start to finish. We are thrilled at Yahoo to be able to deliver this top-notch quality.

We recently delivered our Yahoo Sports’ analysts preview of Super Bowl 50 from our new HD-supported studio. The lineup included: Shaun King, former NFL Quarterback and current Yahoo Sports’ NFL and College Football analyst, Frank Schwab, Editor of Yahoo Sports’ Shutdown Corner blog, and Tank Williams, former NFL safety and Yahoo Sports NFL/Fantasy contributor.

This is an exciting milestone for our team and for Yahoo. We look forward to delivering the best possible viewing experience to users as we continue to iterate on and improve the quality of our stream. Sit back, relax and enjoy the next round.

Hadoop Turns 10

yahoohadoop:

image


It is hard to believe that 10 years have already passed since Hadoop was started at Yahoo. We initially applied it to web search, but since then, Hadoop has become central to everything we do at the company. Today, Hadoop is the de facto platform for processing and storing big data for thousands of companies around the world, including most of the Fortune 500. It has also given birth to a thriving industry around it, comprised of a number of companies who have built their businesses on the platform and continue to invest and innovate to expand its capabilities.

At Yahoo, Hadoop remains a cornerstone technology on which virtually every part of our business relies on to power our world-class products, and deliver user experiences that delight more than a billion users worldwide. Whether it is content personalization for increasing engagement, ad targeting and optimization for serving the right ad to the right consumer, new revenue streams from native ads and mobile search monetization, data processing pipelines, mail anti-spam or search assist and analytics – Hadoop touches them all.

When it comes to scale, Yahoo still boasts one of the largest Hadoop deployments in the world. From a footprint standpoint, we maintain over 35,000 Hadoop servers as a central hosted platform running across 16 clusters with a combined 600 petabytes in storage capacity (HDFS), allowing us to execute 34 million monthly compute jobs on the platform.

But we aren’t stopping there, and actively collaborate with the Hadoop community to further push the scalability boundaries and advance technological innovation. We have used MapReduce historically to power batch-oriented processing, but continue to invest in and adopt low latency data processing stacks on top of Hadoop, such as Storm for stream processing, and Tez and Spark for faster batch processing.

What’s more, the applications of these innovations have spanned the gamut – from cool and fun features, like Flickr’s Magic View to one of our most exciting recent projects that involves combining Apache Spark and Caffe. The project allows us to leverage GPUs to power deep learning on Hadoop clusters. This custom deployment bridges the gap between HPC (High Performance Computing) and big data, and is helping position Yahoo as a frontrunner in the next generation of computing and machine learning.

We’re delighted by the impact the platform has made to the big data movement, and can’t wait to see what the next 10 years has in store.

Cheers!

(via yahoohadoop)

Yahoo Champaign - Scoble-ized

Tech evangelist, Robert Scoble, got a rare glimpse into Yahoo’s research center in Champaign, IL where he spoke with engineers about about the innovations in big data that the teams were working on. 

Cathy Singer, senior director of engineering, served as Robert’s gracious tour guide.

Tour of Yahoo research in Illinois. Internet-based computer science. Cathy Singer senior director of engineering showing me around.

Posted by
Robert Scoble
on Friday, January 22, 2016