Measuring Operational Quality of Recommendations — Quality Oriented Service Level Objectives
by Lina Weichbrodt (Zalando)
With the rise of machine learning in production, we need to talk about operational data science. The talk introduces a pragmatic method on how to measure the response quality of a recommendation service. To that end the definition of a successful response is introduced and guidelines how to capture the rate of successful responses are presented.
There are several changes that can happen during the serving phase of a model which negatively affect the quality of the algorithmic response. A few examples are:
- The model is updated and the new version is inferior to the previous one.
- The latest deployment of the stack that processes the request and serves the model contains a bug.
- Changes in the infrastructure lead to performance loss. An example in an e-commerce setting is switching to a different microservice to obtain article metadata used for filtering the recommendations.
- The input data changes. Typical reasons might be a client application that releases a bug (e.g., lowercasing a case sensitive identifier) or changes a feature in a way that affects the data distribution such as allowing all users to use the product cart instead of previously allowing it only for logged in users. If the change is not detected training data and serving data diverge.
Current monitoring solutions mostly focus on the completion of a request without errors and the request latency. That means the mentioned examples would be hard to detect despite the response quality being significantly degraded, sometimes permanently.
In addition to not being able to detect the mentioned changes, it can be argued that current monitoring practices are not sufficient to capture the performance of a recommender system or any other data driven service in a meaningful way. We might for instance have returned popular articles as a fallback in a case where personalized recommendations were requested. We should record that response as unsuccessful.%If we did so fast and without error current monitoring systems consider the response successful. The customer meanwhile has a bad experience so we should record the response as unsuccessful.
A new paradigm for measuring response quality should fulfil the following criteria:
- comparable across models
- simple and understandable metrics
- measurements are collected in real time
- allows for actionable alerting on problems
The response quality is defined as an approximation of how well the response fits the defined business and modelling case. The goal is to bridge the gap between metrics used during model learning and technical monitoring metrics. Ideally we would like to obtain Service Level Objectives (SLO) that contain this quality aspect and can be discussed with the different client applications based on the business cases, e.g., 85% of the order confirmation emails contain personalized recommendations based on the purchase.
A case study will illustrate how algorithmic monitoring was introduced in the recommendation team at Zalando. Zalando is one of Europe’s largest fashion retailers and multiple recommendation algorithms serve many online and offline use cases. You will see several examples of how the monitoring helped to identify bugs or diagnose quality problems.
About the Speaker
Lina Weichbrodt is a Research Engineer in the recommendation team at Zalando. She focusses on developing scalable machine learning with a user oriented product design view and bringing a data driven perspective to dev ops and backend engineering.
Building Recommender Systems with Strict Privacy Boundaries
by Renaud Bourassa (Slack)
Every day, millions of people rely on Slack to get the information they need to do their jobs. To make their working lives more productive, Slack has built a number of recommender systems to prioritize the content a given user is most likely to need at any point in time. These systems have wide-ranging purposes, from recommending channels for users to join to ranking unread content so users can catch up more easily.
A common trait of all these systems is that they must deal with strict privacy boundaries inherent to the underlying dataset. By policy, users can only be exposed to data that was publicly shared in their own Slack team. These restrictions must carry over into the recommender systems: not only must they refrain from recommending data from foreign teams, but—more subtly—patterns in foreign teams’ data must not be inferable from the usage of these systems.
In this talk, I will discuss how Slack’s dataset differs from those used in traditional recommender systems such as the Netflix Prize dataset. I will also present some techniques we developed to leverage the entire dataset to improve the performance of our recommender systems without jeopardizing the privacy boundaries we guarantee to our customers. These include a mix of algorithms with increased locality as well as the use of metadata over data to generate privacy sensitive recommendations.
About the Speaker
Renaud Bourassa is a Staff Software Engineer on the Search, Learning, and Intelligence team at Slack, working on machine learning services that help users find the most relevant information. Prior to Slack, Renaud worked at Google where he built systems to detect place visits from user location data. Renaud holds a degree in Software Engineering from the University of Waterloo.
Artwork Personalization at Netflix
by Fernando Amat (Netflix)
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front of our members at the right time. But the job of recommendation does not end there. The homepage should be able to convey to the member enough evidence of why a title may be good for her, especially for shows that the member has never heard of. One way to address this challenge is to personalize the way we portray the titles on our service. An important aspect of how to portray titles is through the artwork or imagery we display to visually represent each title. The artwork may highlight an actor that you recognize, capture an exciting moment like a car chase, or contain a dramatic scene that conveys the essence of a movie or show. It is important to select good artwork because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we use on the Netflix homepage. The system selects an image for each member and video to give better visual evidence for why the title might be appealing to that particular member.
There are many challenges involved in getting artwork personalization to succeed. One challenge is that we can only select a single piece of artwork to represent each title. In contrast, typical recommendation engines present multiple items (in some order) to a member allowing us to subsequently learn about preferences between items through the specific item a member selects from the presented assortment. In contrast, we only collect feedback from the one image that was presented to each member for each title. This leads to a training paradigm based on incomplete logged bandit feedback. Moreover, since the artwork selection process happens on top of a recommendation system, collecting data directly from the production experience (observational data) makes it hard to detangle whether a play was due to the recommendation or from the incremental effect of personalized evidence. Another challenge is understanding the impact of changing the artwork between sessions and if that is beneficial or confusing to the user. We also need to consider how diverse artworks perform in relation to one another. Finally, given that the popularity and audiences for titles can change or drop quickly after launch, the system needs to quickly learn how to personalize images for a new item.
All these considerations naturally lead us to frame the problem as online learning with contextual multi-arm bandits. Briefly, contextual bandits are a class of online learning algorithms that balance the cost of gathering randomized training data (which is required for learning an unbiased model on an ongoing basis) with the benefits of applying the learned model to each member context (to maximize user engagement). This is known as the explore-exploit trade-off. In this setting, for a given title the set of actions is the set of available images for the title. We aim to discover the underlying unknown reward, based on probability of play, for each image given a member, a title, and some context. The context could be based on profile attributes (geo-localization, previous plays, etc), the device, time, and other factors that might affect what is the optimal image to choose in each session.
With a large member base, many titles in the catalog, and multiple images per title, Netflix’s product is an ideal platform to test ideas for personalization of artwork. At peak, over 20 million personalized image requests per second need to be handled with low latency. To train our model, we leveraged existing logged data from a previous system that chose images in an unpersonalized manner. We will present results comparing the contextual bandit personalization algorithms using offline policy evaluation metrics, such as inverse propensity scoring and doubly robust estimators. We will conclude with a discussion of opportunities to expand and improve our approach. This includes developing algorithms to handle cold-start by quickly personalizing new images and new titles. We also discuss extending this personalization approach across other types of artwork we use and other evidence that describe our titles such as synopses, metadata, and trailers. Finally, we discuss potentially closing the loop by looking at how we can help artists and designers figure out what new imagery they should create to make a title even more compelling and personalizable.
About the Speaker
Fernando Amat is a Senior Research Engineer at Netflix working on large-scale machine learning problems related to automated image selection and optimization. Previously he worked at Howard Hughes Medical Institute applying machine learning to biomedical research problems in neuroscience. He holds a PhD in Electrical Engineering from Stanford University. Ashok Chandrashekar is the manager of the Discovery Research team at Netflix that develops novel machine learning techniques for personalizing the Netflix home page through slate recommendations and evidence optimization. Ashok received his PhD in computer science from Dartmouth College where he developed novel techniques for minimally supervised object recognition tasks in images and videos. Tony Jebara is Director of Machine Learning at Netflix where he works on personalization, recommendation, search, marketing and content algorithms. Previously, he was a professor at Columbia University. Tony served as general chair in 2017 and program chair in 2014 for the International Conference on Machine Learning. He has won the National Science Foundation Career award and multiple best paper awards. Tony has published over 100 articles across machine learning, computer vision, computational social science, and personalization. He holds a PhD from MIT. Justin Basilico is a Research/Engineering Director for Page Algorithms Engineering at Netflix. He leads an applied research team focused on developing the next generation of algorithms used to generate the Netflix homepage through machine learning, ranking, recommendation, and large-scale software engineering. He has also developed machine learning approaches that yielded significant improvements in the personalized ranking algorithms that drive the Netflix recommendation system. Prior to Netflix, he worked on machine learning in the Cognitive Systems group at Sandia National Laboratories.
Conversational Content Discovery via Comcast X1 Voice Interface
by Shahin Sefati (Comcast)
The global market for intelligent voice-enabled devices is expanding at a fast pace. Comcast, one of the largest cable provides in the US with about 30 million users, has recently reinvented the way that customers can discover and access content on an entertainment platform by introducing a voice remote control for its Xfinity X1 entertainment platform. Spoken language input allows the customer to express what they are interested in on their terms, which has made it significantly more convenient for the users to find their favorite TV channel or movie compared to the traditional limits of a screen menu navigated with the keys of a TV remote.
The more natural user experience via voice interface results in voice queries that are considerably more complex to handle compared to channel numbers typed in or movie titles selected on screen and this poses a challenge for the platform to understand the user intent and find the appropriate action for millions of voice queries that we receive every day. This also makes it necessary to adapt the underlying content recommendation algorithms to incorporate the richer intent context from the users.
We describe some of the key components of our voice-powered content discovery platform that addresses specifically these issues. We discuss how we leverage multimodal data including voice queries and large database of metadata to enable a more natural search experience via voice queries for finding relevant movies, TV shows or even a specific episode of a series. We describe the models that encode semantic similarities between the content and their metadata to allow users to search for places, people, topics using keywords or phrases that do not explicitly appear in the movie/show titles as is traditionally the case. We describe how this category of voice search queries can be framed as a recommendation problem.
Even though voice input is extremely powerful to capture the intent of our customers, the freedom to say anything makes it also more difficult for a voice remote user to know the range of possible queries that are supported by our system. We show how we can leverage millions of voice queries that we receive every day to build and train a deep learning-based recommender system that produces different types of recommendations such as educational suggestions and tips for voice commands that the platform support.
Finally, it is important to consider that the true potential of the voice-powered entertainment experience is the result of the fusion of intents expressed in language with navigation of content on the screen via the remote navigation buttons. For all the applications and features discussed in this talk, our recommendation systems are adapted to provide the most relevant suggestions no matter if the voice interface is initiating the action, navigating through the results rendered on the TV screen and narrowing down the set of results by allowing the user to ask follow-up queries or select buttons.
About the Speaker
Shahin Sefati is the Manager of Search and Recommendation team at Comcast Applied AI Research. His team apply machine learning and natural language processing to develop algorithms and build data-driven product features for facilitating content discovery via search, browse, recommendation, and personalization on Comcast’s X1 voice-powered platform. Parsa Saadatpanah is a research member of search and recommendation team at Comcast Applied AI Research. He is also a PhD student in computer science at University of Maryland. Hassan Sayyadi is the Directoer of Voice Semantic Search Platform at Comcast Applied AI Research. His team develops the core Natural Language Understanding algorithm for Xfinity X1 voice system that is used for search, browse and navigation using the voice interface. He is also leading the personalization research team. Jan Neumann leads the Comcast Applied Artificial Intelligence Research group. His team combines large-scale machine learning, deep learning, NLP and computer vision to develop novel algorithms and product concepts that improve the experience of Comcast’s customers such as the X1 voice remote and personalization features, virtual assistants and predictive intelligence for customer service, as well assmart video and sensor analytics.
Machine Learning at eBay: Connecting Sellers and Buyers for Search and Recommendation on the World’s Largest Inventory
by Ido Guy (eBay)
At eBay, sellers can offer virtually any type of listing, rendering the world’s largest inventory, with well over a billion items. Yet, the noisy nature of the input data and the extremely long-tailed item distribution pose a variety of challenges for search and recommendation, such as understanding the unique attributes (aspects) of the products, their importance to both sellers and buyers, and their intra-relationships, all essential to providing a high-quality user experience on the site.
In this talk, I will present several challenges and corresponding solution frameworks recently developed at eBay Research for aspect extraction, normalization, weighting, and relation inference; the mapping of relationships between e-commerce entities for matching uploaded listings to catalog products and feeding the e-commerce knowledge graph; the recommendation of categories for sellers’ contributions; and the automatic generation of textual fields (title, description) to bridge the gap between sellers and buyers by helping them speak the same language. Our methods combine a variety of language processing and computer vision approaches applied on the different types of data contributed by sellers. Learning to rank, named entity recognition, object identification, machine translation, and summarization are just a few example techniques that come to play. Our methods drive different usage scenarios by enabling a better representation of users and items and an effective computation of their similarities. I will also describe how our applied research teams perform their work, from the development of initial prototypes, through offline and online production processes, to different evaluation schemes. I will conclude the talk by reviewing open challenges in large-scale e-commerce that will have to be addressed in the years to come.
About the Speaker
Ido Guy is a Director of Research at eBay, leading teams in Israel and the United States. The teams focus on the application of machine learning and language processing to eBay’s core product. In 2017, Ido’s team led the reduction of duplicates in eBay’s catalog by a factor of three. The team’s end goal is to improve the search and recommendation experience at eBay by better connecting sellers and buyers. Prior to eBay, Ido spent nearly three years at Yahoo Research as a Principal Research Engineer, focusing on the application of machine learning to Web Search and Community Questions and Answering. Prior to that, he established and managed the Social Technologies group at IBM Research, developing tools for enterprise social media analysis, including recommenders of people and content. Ido has been active in the Recommender Systems domain during the past decade. He co-authored a variety of conference and journal papers in areas such as enterprise social media, social recommender systems, web search, and community question-answering. Ido contributed the chapter on Social Recommender Systems to the RecSys handbook and served at various roles on the RecSys organizing and steering committee, including program co-chair for RecSys 2012.