Yahoo Developer Network

YDN

Showing 21 posts tagged YDN

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

By Ashley Wolf, Open Source Program Manager, Verizon Media

The second installment of Dash Open is ready for you to tune in!

In this episode, Gil Yehuda, Sr. Director of Open Source at Verizon Media, interviews Dav Glass, Distinguished Architect of IaaS and Node.js at Verizon Media. Dav discusses how open source inspired him to start HackSI, a Hack Day for all ages, as well as robotics mentorship programs for the Southern Illinois engineering community.

Listen now on iTunes or SoundCloud.

Dash Open is your place for interesting conversations about open source and other technologies, from the open source program office at Verizon Media. Verizon Media is the home of many leading brands including Yahoo, Aol, Tumblr, TechCrunch, and many more.

Follow us on Twitter @YDN and on LinkedIn.

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

By Chris Larsen, Architect

OpenTSDB is one of the first dedicated open source time series databases built on top of Apache HBase and the Hadoop Distributed File System. Today, we are proud to share that version 2.4.0 is now available and has many new features developed in-house and with contributions from the open source community. This release would not have been possible without support from our monitoring team, the Hadoop and HBase developers, as well as contributors from other companies like Salesforce, Alibaba, JD.com, Arista and more. Thank you to everyone who contributed to this release!

A few of the exciting new features include:

Rollup and Pre-Aggregation Storage

As time series data grows, storing the original measurements becomes expensive. Particularly in the case of monitoring workflows, users rarely care about last years’ high fidelity data. It’s more efficient to store lower resolution “rollups” for longer periods, discarding the original high-resolution data. OpenTSDB now supports storing and querying such data so that the raw data can expire from HBase or Bigtable, and the rollups can stick around longer. Querying for long time ranges will read from the lower resolution data, fetching fewer data points and speeding up queries.

Likewise, when a user wants to query tens of thousands of time series grouped by, for example, data centers, the TSD will have to fetch and process a significant amount of data, making queries painfully slow. To improve query speed, pre-aggregated data can be stored and queried to fetch much less data at query time, while still retaining the raw data. We have an Apache Storm pipeline that computes these rollups and pre-aggregates, and we intend to open source that code in 2019. For more details, please visit http://opentsdb.net/docs/build/html/user_guide/rollups.html.

Histograms and Sketches

When monitoring or performing data analysis, users often like to explore percentiles of their measurements, such as the 99.9th percentile of website request latency to detect issues and determine what consumers are experiencing. Popular metrics collection libraries will happily report percentiles for the data they collect. Yet while querying for the original percentile data for a single time series is useful, trying to query and combine the data from multiple series is mathematically incorrect, leading to errant observations and problems. For example, if you want the 99.9th percentile of latency in a particular region, you can’t just sum or recompute the 99.9th of the 99.9th percentile.

To solve this issue, we needed a complex data structure that can be combined to calculate an accurate percentile. One such structure that has existed for a long time is the bucketed histogram, where measurements are sliced into value ranges and each range maintains a count of measurements that fall into that bucket. These buckets can be sized based on the required accuracy and the counts from multiple sources (sharing the same bucket ranges) combined to compute an accurate percentile.

Bucketed histograms can be expensive to store for highly accurate data, as many buckets and counts are required. Additionally, many measurements don’t have to be perfectly accurate but they should be precise. Thus another class of algorithms could be used to approximate the data via sampling and provide highly precise data with a fixed interval. Data scientists at Yahoo (now part of Oath) implemented a great Java library called Data Sketches that implements the Stochastic Streaming Algorithms to reduce the amount of data stored for high-throughput services. Sketches have been a huge help for the OLAP storage system Druid (also sponsored by Oath) and Bullet, Oath’s open source real-time data query engine.

The latest TSDB version supports bucketed histograms, Data Sketches, and T-Digests.

Some additional features include:

HBase Date Tiered Compaction support to improve storage efficiency.
A new authentication plugin interface to support enterprise use cases.
An interface to support fetching data directly from Bigtable or HBase rows using a search index such as ElasticSearch. This improves queries for small subsets of high cardinality data and we’re working on open sourcing our code for the ES schema.
Greater UID cache controls and an optional LRU implementation to reduce the amount of JVM heap allocated to UID to string mappings.
Configurable query size and time limits to avoid OOMing a JVM with large queries.

Try the releases on GitHub and let us know of any issues you run into by posting on GitHub issues or the OpenTSDB Forum. Your feedback is appreciated!

OpenTSDB 3.0

Additionally, we’ve started on 3.0, which is a rewrite that will support a slew of new features including:

Querying and analyzing data from the plethora of new time series stores.
A fully configurable query graph that allows for complex queries OpenTSDB 1x and 2x couldn’t support.
Streaming results to improve the user experience and avoid overwhelming a single query node.
Advanced analytics including support for time series forecasting with Yahoo’s EGADs library.

Please join us in testing out the current 3.0 code, reporting bugs, and adding features.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues.

We welcome your contributions and feedback about any new features or improvements you’d like to see.

For December, we’re excited to share the following product news:

Streaming Search Performance Improvement

Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here.

ONNX Integration

ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models.

Precise Transaction Log Pruning

Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart.

Grouping on Maps

Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes.

Questions or suggestions? Send us a tweet or an email.

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

By Kristian Aune, Tech Product Manager, Oath

If you are in Seattle on Thursday November 29, please join Jon Bratseth (Distinguished Architect, Oath) at a machine learning meetup hosted by Zillow. Jon will share a Vespa overview and answer any questions about it. Largely developed by Yahoo engineers, Vespa is a big data processing and serving engine, available as open source on GitHub. It’s used by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms.

Eric Ringger, director of machine learning for personalization for Zillow will discuss some of the models used to help users find homes, including collaborative filtering, a content-based model, and deep learning.

Learn more and RSVP here. If you can’t join, please check out these slides from a recent Vespa presentation to learn more about this technology.

The Vespa Team

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

By Ganesh Harinath, VP Engineering, AI Platform & Applications, Oath

If you’re attending the upcoming Telco Data Analytics and AI Conference in San Francisco, make sure to join my keynote talk. I’ll be presenting “Building a Terabyte Scale Machine Learning Application” on November 28th at 10:10 am PST. You’ll learn about how Oath builds AI platforms at scale.

My presentation will focus on our approach and experience at Oath in architecting and using frameworks to build machine learning models at terabyte scale, near real-time. I’ll also highlight Trapezium, an open source framework based on Spark, developed by Oath’s Big Data and Artificial Intelligence (BDAI) team.

I hope to catch you at the conference. If you would like to connect, reach out to me. If you’re unable to attend the conference and are curious about the topics shared in my presentation, follow @YDN on Twitter and we’ll share highlights during and after the event.

Track

Dash Open 01: Gil Yehuda - Starting, Running, and Improving an Open Source Program Office

Introducing the Dash Open Podcast, sponsored by Yahoo Developer Network

By Ashley Wolf, Principal Technical Program Manager, Oath

Is open source the wave of the future, or has it seen its best days already? Which Big Data and AI trends should you be aware of and why? What is 5G and how will it impact the apps you enjoy using? You’ve got questions and we know smart people; together we’ll get answers. Introducing the Dash Open podcast, sponsored by the Yahoo Developer Network and produced by the Open Source team at Oath.

Dash Open will share interesting conversations about tech and the people who spend their day working in tech. We’ll look at the state of technology through the lens of open source; keeping you up-to-date on the trends we’re seeing across the internet. Why Dash Open? Because it’s like a command line argument reminding the command to be open. What can you expect from Dash Open? Interviews with interesting people, occasional witty banter, and a catchy theme song.

In the first episode, Rosalie Bartlett, Open Source community manager at Oath, interviews Gil Yehuda, Senior Director of Open Source at Oath. Tune in to hear one skeptic’s journey from resisting the open source movement to heading one of the more prolific Open Source Program Offices (OSPO). Gil highlights the benefits of open source to companies and provides actionable advice on how technology companies can start or improve their OSPO.

Give Dash Open a listen and tell us what topics you’d like to hear next.

–

Ashley Wolf manages the Open Source Program at Oath/Verizon Media Group.

How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study

By Murali Krishna Bachhu, Anurag Damle, and Vishal Patel

As engineers on the Yahoo Mail team at Oath, we pride ourselves on the things that matter most to developers: faster development cycles, more reliability, and better performance. Users don’t necessarily see these elements, but they certainly feel the difference they make when significant improvements are made. Recently, we were able to upgrade all three of these areas at scale by adopting webpack® as Yahoo Mail’s underlying module bundler, and you can do the same for your web application.

What is webpack?

webpack is an open source module bundler for modern JavaScript applications. When webpack processes your application, it recursively builds a dependency graph that includes every module your application needs. Then it packages all of those modules into a small number of bundles, often only one, to be loaded by the browser.

webpack became our choice module bundler not only because it supports on-demand loading, multiple bundle generation, and has a relatively low runtime overhead, but also because it is better suited for web platforms and NodeJS apps and has great community support.

Comparison of webpack to other open source bundlers

How did we integrate webpack?

Like any developer does when integrating a new module bundler, we started integrating webpack into Yahoo Mail by looking at its basic config file. We explored available default webpack plugins as well as third-party webpack plugins and then picked the plugins most suitable for our application. If we didn’t find a plugin that suited a specific need, we wrote the webpack plugin ourselves (e.g., We wrote a plugin to execute Atomic CSS scripts in the latest Yahoo Mail experience in order to decrease our overall CSS payload**).

During the development process for Yahoo Mail, we needed a way to make sure webpack would continuously run in the background. To make this happen, we decided to use the task runner Grunt. Not only does Grunt keep the connection to webpack alive, but it also gives us the ability to pass different parameters to the webpack config file based on the given environment. Some examples of these parameters are source map options, enabling HMR, and uglification.

Before deployment to production, we wanted to optimize the javascript bundles for size to make the Yahoo Mail experience faster. webpack provides good default support for this with the UglifyJS plugin. Although the default options are conservative, they give us the ability to configure the options. Once we modified the options to our specifications, we saved approximately 10KB.

Code snippet showing the configuration options for the UglifyJS plugin

Faster development cycles for developers

While developing a new feature, engineers ideally want to see their code changes reflected on their web app instantaneously. This allows them to maintain their train of thought and eventually results in more productivity. Before we implemented webpack, it took us around 30 seconds to 1 minute for changes to reflect on our Yahoo Mail development environment. webpack helped us reduce the wait time to 5 seconds.

More reliability

Consumers love a reliable product, where all the features work seamlessly every time. Before we began using webpack, we were generating javascript bundles on demand or during run-time, which meant the product was more prone to exceptions or failures while fetching the javascript bundles. With webpack, we now generate all the bundles during build time, which means that all the bundles are available whenever consumers access Yahoo Mail. This results in significantly fewer exceptions and failures and a better experience overall.

Better Performance

We were able to attain a significant reduction of payload after adopting webpack.

Reduction of about 75 KB gzipped Javascript payload
50% reduction on server-side render time
10% improvement in Yahoo Mail’s launch performance metrics, as measured by render time above the fold (e.g., Time to load contents of an email).

Below are some charts that demonstrate the payload size of Yahoo Mail before and after implementing webpack.

Payload before using webpack (JavaScript Size = 741.41KB)

Payload after switching to webpack (JavaScript size = 669.08KB)

Conclusion

Shifting to webpack has resulted in significant improvements. We saw a common build process go from 30 seconds to 5 seconds, large JavaScript bundle size reductions, and a halving in server-side rendering time. In addition to these benefits, our engineers have found the community support for webpack to have been impressive as well. webpack has made the development of Yahoo Mail more efficient and enhanced the product for users. We believe you can use it to achieve similar results for your web application as well.

**Optimized CSS generation with Atomizer

Before we implemented webpack into the development of Yahoo Mail, we looked into how we could decrease our CSS payload. To achieve this, we developed an in-house solution for writing modular and scoped CSS in React. Our solution is similar to the Atomizer library, and our CSS is written in JavaScript like the example below:

Sample snippet of CSS written with Atomizer

Every React component creates its own styles.js file with required style definitions. React-Atomic-CSS converts these files into unique class definitions. Our total CSS payload after implementing our solution equaled all the unique style definitions in our code, or only 83KB (21KB gzipped).

During our migration to webpack, we created a custom plugin and loader to parse these files and extract the unique style definitions from all of our CSS files. Since this process is tied to bundling, only CSS files that are part of the dependency chain are included in the final CSS.

Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine

By Jon Bratseth, Distinguished Architect, Vespa

Ever since we open sourced Hadoop in 2006, Yahoo – and now, Oath – has been committed to opening up its big data infrastructure to the larger developer community. Today, we are taking another major step in this direction by making Vespa, Yahoo’s big data processing and serving engine, available as open source on GitHub.

Vespa architecture overview

Building applications increasingly means dealing with huge amounts of data. While developers can use the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users. Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.

By releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.

Serving often involves more than looking up items by ID or computing a few numbers from a model. Many applications need to compute over large datasets at serving time. Two well-known examples are search and recommendation. To deliver a search result or a list of recommended articles to a user, you need to find all the items matching the query, determine how good each item is for the particular request using a relevance/recommendation model, organize the matches to remove duplicates, add navigation aids, and then return a response to the user. As these computations depend on features of the request, such as the user’s query or interests, it won’t do to compute the result upfront. It must be done at serving time, and since a user is waiting, it has to be done fast. Combining speedy completion of the aforementioned operations with the ability to perform them over large amounts of data requires a lot of infrastructure – distributed algorithms, data distribution and management, efficient data structures and memory management, and more. This is what Vespa provides in a neatly-packaged and easy to use engine.

With over 1 billion users, we currently use Vespa across many different Oath brands – including Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr, and others – to process and serve billions of daily requests over billions of documents while responding to search queries, making recommendations, and providing personalized content and advertisements, to name just a few use cases. In fact, Vespa processes and serves content and ads almost 90,000 times every second with latencies in the tens of milliseconds. On Flickr alone, Vespa performs keyword and image searches on the scale of a few hundred queries per second on tens of billions of images. Additionally, Vespa makes direct contributions to our company’s revenue stream by serving over 3 billion native ad requests per day via Yahoo Gemini, at a peak of 140k requests per second (per Oath internal data).

With Vespa, our teams build applications that:

Select content items using SQL-like queries and text search
Organize all matches to generate data-driven pages
Rank matches by handwritten or machine-learned relevance models
Serve results with response times in the low milliseconds
Write data in real-time, thousands of times per second per node
Grow, shrink, and re-configure clusters while serving and writing data

To achieve both speed and scale, Vespa distributes data and computation over many machines without any single master as a bottleneck. Where conventional applications work by pulling data into a stateless tier for processing, Vespa instead pushes computations to the data. This involves managing clusters of nodes with background redistribution of data in case of machine failures or the addition of new capacity, implementing distributed low latency query and processing algorithms, handling distributed data consistency, and a lot more. It’s a ton of hard work!

As the team behind Vespa, we have been working on developing search and serving capabilities ever since building alltheweb.com, which was later acquired by Yahoo. Over the last couple of years we have rewritten most of the engine from scratch to incorporate our experience onto a modern technology stack. Vespa is larger in scope and lines of code than any open source project we’ve ever released. Now that this has been battle-proven on Yahoo’s largest and most critical systems, we are pleased to release it to the world.

Vespa gives application developers the ability to feed data and models of any size to the serving system and make the final computations at request time. This often produces a better user experience at lower cost (for buying and running hardware) and complexity compared to pre-computing answers to requests. Furthermore it allows developers to work in a more interactive way where they navigate and interact with complex calculations in real time, rather than having to start offline jobs and check the results later.

Vespa can be run on premises or in the cloud. We provide both Docker images and rpm packages for Vespa, as well as guides for running them both on your own laptop or as an AWS cluster.

We’ll follow up this initial announcement with a series of posts on our blog showing how to build a real-world application with Vespa, but you can get started right now by following the getting started guide in our comprehensive documentation.

Managing distributed systems is not easy. We have worked hard to make it easy to develop and operate applications on Vespa so that you can focus on creating features that make use of the ability to compute over large datasets in real time, rather than the details of managing clusters and data. You should be able to get an application up and running in less than ten minutes by following the documentation.

We can’t wait to see what you’ll build with it!

Change to Yahoo Authentication Policy

By: Lovlesh Chhabra, Product Manager

At Yahoo, our users’ security is paramount, and we continue to update our policies and practices to keep our users’ accounts and data secure. While developers and partners using Yahoo APIs are currently able to use basic authentication protocols and/or ‘plain text’ usernames and passwords to authenticate their users, beginning May 30, 2015, all third-party applications will need to move to OAuth-based authentication. The good news is that Yahoo APIs already support OAuth-based authentication.

We believe that OAuth provides the most secure authentication method with the best user experience. OAuth puts the users security in the hands of Yahoo’s systems and allows Yahoo to disclose to the user what the requesting application is attempting to do and allows Yahoo to issue credentials that are designed to work with the authorization provided by the user.

Starting May 30, 2015, Yahoo will require the use of Oauth 2.0 based authentication from apps that are designed for Yahoo users. We want to promote a consistent user experience for users signing-in to their Yahoo accounts regardless of the apps that they choose to use. This also reduces the amount of work developers have to do to keep up with the constant improvements being made to Yahoo’s identity platform. Today, developers need to write specific code to handle error codes sent by Yahoo and respond to the user accordingly. In the future, once integrated with webviews, Yahoo will manage all the error cases within the webview and hand off tokens at the end of a successful authentication.

Yahoo APIs and standard Internet protocols work with OAuth. If your application currently submits plain text username and passwords to authenticate users into Yahoo, you’ll need to move to OAuth to minimize any disruption in service. Yahoo’s teams are committed to making this transition smooth for you and will be regularly monitoring and responding to queries posted to the Yahoo Develop Network forum. Please feel free to use these forums to report any integration issues or ask any integration related questions.

Introducing the Revamped Flurry Developer Documentation Site

By: Bisera Ferrero, Architect

Today, we’re excited to announce a new home for the Flurry from Yahoo developer documentation. Based on your feedback, we’ve completely redesigned and reorganized the Flurry documentation. The site includes updated code samples, FAQs, task-based technical content and a new, simple navigation structure, making it easier for you to use Flurry products.

The technical content has been extensively revised and updated with task-oriented developer guides, spotlights and much more. If you’re a Flurry developer building mobile apps for iOS or Android, you’ll find the site tailored to your integration needs and focused on enhancing the quality of your experience with Flurry Analytics, APIs or Flurry for Publishers.

Check out the new Flurry Documentation Site

We’d love to hear what you think, so please send us your feedback! We’re also hosting the Yahoo Mobile Developer Conference on Feb. 19th, 2015 in San Francisco, where we’ll announce a new suite of tools that will help mobile developers better understand their users in order to improve, grow and monetize their apps. Apply for an invite today!