Inverse Cooking: Recipe Generation from Food Images

Group Members: Nishant (ngovind), Luca (lfonstad), Thomas (tvanderm)

Introduction: We are implementing the model from the paper “Inverse Cooking: Recipe Generation from Food Images”, which takes as input an image of food and is able to generate a novel set of ingredients, as well as cooking instructions based off of those ingredients. We chose this project because it seemed like an interesting and novel way to combine different aspects of deep learning that we had learned about, such as transformers and NLP models and image models.

Related Work: There are a lot of food related computer vision models, such as various models for food classification, for estimating the calories in an image of food, for estimating the quantity of food in an image, and even several models for predicting the ingredients in an image for food, as is also implemented in this model. There are also plenty of natural language models that are focused on conditional text generation, including previous models that generate recipes from flow graphs or ingredients lists. Other previous works have focused more on retrieval, so given a database of recipes the best way to use an image as a query to look up existing recipes. These models work pretty well for ingredient recognition and do perform effectively on recipes that are in their database, but are not able to synthesize novel recipes, so any image of a food not in the database wouldn’t be handled well.

Public implementations: https://github.com/facebookresearch/inversecooking

Data: We plan to use the Recipe1M dataset, which is composed of 1029720 recipes scraped from cooking websites. The dataset contains 720639 training, 155036 validation and 154045 test recipes, containing a title, a list of ingredients, a list of cooking instructions and (optionally) an image.

There will be significant preprocessing required. Firstly, we have to discard any recipes that don’t contain an image. We clean up the ingredients by reducing redundancy by means of normalization. For example cheddar cheese and parmesan cheese merge into cheese. For the instructions, we must tokenize the text and add unique tokens to mark the start and end of a recipe, and the end of each instruction. We will process the data such that the title is the first instruction. From this, we have to build the instruction and recipe vocabularies. For both, we remove words of low frequency (words that appear less than 10 times).

Since we plan to use the same dataset as the paper (because that is the most comprehensive recipe database available publicly), we decided to further preprocess the data to strip the recipe and ingredients vocabulary further. This will involve further merging of ingredients and a cap on the maximum size of the instructions.

Methodology: The architecture of our model is summarized by the image attached with this submission.

Our architecture consists of many layers: image processing layer (image encoder) to produce the image embedding, ingredient decoder to get the ingredients, and instruction decoder that uses an ingredient embedding from the ingredient encoder to produce the instructions for the image, where the first instruction will be the title. We plan to use tensorflow’s implementation of the ResNet50 model except for the final layer to produce the image features. For the ingredient decoder, we plan to use a set transformer architecture, in which we have to apply a max pooling operation between layers in order to reduce the importance of the order of the ingredients (hence set transformer). For the ingredient decoder, we plan to concatenate the image embeddings and ingredient embeddings and pass them through transformer blocks. We intend to train in 2 stages: first the image encoder and ingredient decoder, and then the ingredient encoder and the instruction decoder. For the training of the instruction decoder, we plan to use the ground truth ingredients as input.

The hardest part will be implementing the transformer architecture for the ingredient decoder and instruction decoder. This will involve dealing with attention (both self and multi-headed), and the intricacies with respect to the ingredient and instruction decoder respectively. For instance, we need to make sure the set transformer learns when to stop predicting ingredients (stopping criteria).

Metrics: The model has two parts that are judged separately from each other. The recipe generating portion’s success is measured by perplexity. The ingredient predicting section is evaluated on Intersection over Union and F1 scores. Precision and recall are also provided. Accuracy doesn’t make much sense for this model, since for any given recipe the number of correct ingredients is so vastly outnumbered by the amount of possible incorrect ingredients. The authors of the original paper hoped to create a model that would generate correct and usable recipes. To quantify this, they measured the success of the recipes according to human judgement. This option will likely not be available to us in a statistically meaningful way due to small sample size.

Base/Target/Stretch Goals:

Baseline goal: similar IoU/F1 on ingredient prediction compared to paper

Target goal: recipe generation with reasonable perplexity scores

Ethics: This project will use the Recipe1M dataset used in the original paper. This dataset was collected by scraping “from over two dozen popular cooking websites”. According to the paper presenting the dataset, a wide breadth of cultures and nationalities are represented -- twelve distinct examples are given. This is better than the counterexample mentioned in the Recipe1M+ paper, of a similar work that only contained recipes for Chinese cuisine. However, no information is given on the actual distribution of recipes across various cultural and culinary traditions, and all the examples mentioned are well established in the culinary world. There are valid concerns that the dataset will fail to represent cuisine from different cultures in balanced proportions, or to even represent certain cultures at all.

In the introduction of the original paper, the authors observe that in modern society, it is becoming increasingly difficult to be aware of what goes into the food we eat. There are two primary reasons someone may wish to know the details of a recipe for a given food: desire to replicate the recipe, or dietary constraints. The original paper seems more geared towards the first use case, but the existence of the second raises ethical concerns. If a recipe prediction algorithm were widely available, eventually someone would attempt to use it for the purpose of determining whether something fits their dietary restrictions or not. When this happens, a failure of the algorithm could have consequences ranging from mild distaste all the way to fatal allergic reaction. For the first (and intended) use-case, failure means a ruined meal and wasted ingredients.

Division of Labor: We haven’t decided the division of labor yet, but we mapped out chunks of the project that we could use to assign roles as we start to code. We have preprocessing, the image and ingredient encoding layers, and the transformer architecture (which includes both the ingredient and instruction decoders). We will either divide the work evenly for each chunk, which makes sense so every person can work with all aspects of the project. For instance one person could work on pre processing to tokenize the instructions, one person could build the recipe vocab, and the last person could build the ingredient vocab. We can similarly divide each chunk like that. Another possible plan is we assign one chunk to each person, while making sure we understand the overall code. We plan to write about how exactly we ended up dividing the work in our second reflection once we start the implementation.

Built With

  • tensorflow
Share this project:

Updates

posted an update

Introduction: We are implementing the model from the paper “Inverse Cooking: Recipe Generation from Food Images”, which takes as input an image of food and is able to generate a novel set of ingredients, as well as cooking instructions based off of those ingredients. We chose this project because it seemed like an interesting and novel way to combine different aspects of deep learning that we had learned about, such as transformers and NLP models and image models.

Challenges: So far our greatest challenge has been in obtaining a good dataset to use. We have been unable to access the dataset that we had intended to use, the Recipe1M dataset, due to a technical glitch on the website, so we are currently training our architecture on a smaller dataset that doesn’t have instructions for now, which can be found at http://www.ub.edu/cvub/recipes5k/. This dataset consists of approximately 5000 images of food and corresponding ingredients. This is a fairly challenging situation because to get higher accuracies, we’ll need more data. This means it will be very tough to reach the metrics achieved by the original researchers, who used the Recipe1M dataset.

We plan to email the researchers that built the Recipe1M dataset in order to get the full dataset required for this project. Either way, we can start working on the first part where we predict ingredients from an image.

Insights: We haven’t implemented our architecture yet, but we wrote the initial code for our preprocessing that essentially builds the ingredient vocabulary. However, we thoroughly outlined our model, discussing how we go from the preprocessing format of the data to the output of the set transformer that decodes the ingredients (producing the list of ingredients for each image). We also got into details of the transformer, which will be tailored to the specific task of classifying ingredients. This includes adding a max-pool layer between each time step in order to avoid penalizing for order (the order of ingredients for a recipe should not matter, whereas an autoregressive model like the transformer gives weight to order (eg: sequence of words in a sentence)), and also learning when the transformer model should stop predicting ingredients for a given image. So far, we are very clear about how the instruction decoder to predict ingredients will work.

Plan: Since we got basic preprocessing done, we are on track as we have thoroughly detailed the architecture. One major change is that we might only be able to implement the first part of our initially proposed project, where we only predict the ingredients from the image instead of both the ingredients and the instructions. However, upon further review, we realized that implementing the set transformer that actually predicts the ingredients is fairly comprehensive, and in the case we cannot find the data for recipe instructions, we plan to optimize our set transformer as much as possible to get the best results based on our dataset. Our plan currently is to implement the ingredient decoder to finish the first part of our project. Given the issue with the dataset, this is now our target goal. Once we do that, if we have the data from Recipe1M, we will try to implement the instruction decoder (which should be similar to the ingredient decoder). If we don’t have access to the data, we might consider writing our own scraper to gather the data, which would be our stretch goal now, as this includes significant data retrieval code.

Log in or sign up for Devpost to join the conversation.