Inverse Cooking: Recipe Generation from Food Images
Group Members: Nishant (ngovind), Luca (lfonstad), Thomas (tvanderm)
Introduction: We are implementing the model from the paper “Inverse Cooking: Recipe Generation from Food Images”, which takes as input an image of food and is able to generate a novel set of ingredients, as well as cooking instructions based off of those ingredients. We chose this project because it seemed like an interesting and novel way to combine different aspects of deep learning that we had learned about, such as transformers and NLP models and image models.
Related Work: There are a lot of food related computer vision models, such as various models for food classification, for estimating the calories in an image of food, for estimating the quantity of food in an image, and even several models for predicting the ingredients in an image for food, as is also implemented in this model. There are also plenty of natural language models that are focused on conditional text generation, including previous models that generate recipes from flow graphs or ingredients lists. Other previous works have focused more on retrieval, so given a database of recipes the best way to use an image as a query to look up existing recipes. These models work pretty well for ingredient recognition and do perform effectively on recipes that are in their database, but are not able to synthesize novel recipes, so any image of a food not in the database wouldn’t be handled well.
Public implementations: https://github.com/facebookresearch/inversecooking
Data: We plan to use the Recipe1M dataset, which is composed of 1029720 recipes scraped from cooking websites. The dataset contains 720639 training, 155036 validation and 154045 test recipes, containing a title, a list of ingredients, a list of cooking instructions and (optionally) an image.
There will be significant preprocessing required. Firstly, we have to discard any recipes that don’t contain an image. We clean up the ingredients by reducing redundancy by means of normalization. For example cheddar cheese and parmesan cheese merge into cheese. For the instructions, we must tokenize the text and add unique tokens to mark the start and end of a recipe, and the end of each instruction. We will process the data such that the title is the first instruction. From this, we have to build the instruction and recipe vocabularies. For both, we remove words of low frequency (words that appear less than 10 times).
Since we plan to use the same dataset as the paper (because that is the most comprehensive recipe database available publicly), we decided to further preprocess the data to strip the recipe and ingredients vocabulary further. This will involve further merging of ingredients and a cap on the maximum size of the instructions.
Methodology: The architecture of our model is summarized by the image attached with this submission.
Our architecture consists of many layers: image processing layer (image encoder) to produce the image embedding, ingredient decoder to get the ingredients, and instruction decoder that uses an ingredient embedding from the ingredient encoder to produce the instructions for the image, where the first instruction will be the title. We plan to use tensorflow’s implementation of the ResNet50 model except for the final layer to produce the image features. For the ingredient decoder, we plan to use a set transformer architecture, in which we have to apply a max pooling operation between layers in order to reduce the importance of the order of the ingredients (hence set transformer). For the ingredient decoder, we plan to concatenate the image embeddings and ingredient embeddings and pass them through transformer blocks. We intend to train in 2 stages: first the image encoder and ingredient decoder, and then the ingredient encoder and the instruction decoder. For the training of the instruction decoder, we plan to use the ground truth ingredients as input.
The hardest part will be implementing the transformer architecture for the ingredient decoder and instruction decoder. This will involve dealing with attention (both self and multi-headed), and the intricacies with respect to the ingredient and instruction decoder respectively. For instance, we need to make sure the set transformer learns when to stop predicting ingredients (stopping criteria).
Metrics: The model has two parts that are judged separately from each other. The recipe generating portion’s success is measured by perplexity. The ingredient predicting section is evaluated on Intersection over Union and F1 scores. Precision and recall are also provided. Accuracy doesn’t make much sense for this model, since for any given recipe the number of correct ingredients is so vastly outnumbered by the amount of possible incorrect ingredients. The authors of the original paper hoped to create a model that would generate correct and usable recipes. To quantify this, they measured the success of the recipes according to human judgement. This option will likely not be available to us in a statistically meaningful way due to small sample size.
Base/Target/Stretch Goals:
Baseline goal: similar IoU/F1 on ingredient prediction compared to paper
Target goal: recipe generation with reasonable perplexity scores
Ethics: This project will use the Recipe1M dataset used in the original paper. This dataset was collected by scraping “from over two dozen popular cooking websites”. According to the paper presenting the dataset, a wide breadth of cultures and nationalities are represented -- twelve distinct examples are given. This is better than the counterexample mentioned in the Recipe1M+ paper, of a similar work that only contained recipes for Chinese cuisine. However, no information is given on the actual distribution of recipes across various cultural and culinary traditions, and all the examples mentioned are well established in the culinary world. There are valid concerns that the dataset will fail to represent cuisine from different cultures in balanced proportions, or to even represent certain cultures at all.
In the introduction of the original paper, the authors observe that in modern society, it is becoming increasingly difficult to be aware of what goes into the food we eat. There are two primary reasons someone may wish to know the details of a recipe for a given food: desire to replicate the recipe, or dietary constraints. The original paper seems more geared towards the first use case, but the existence of the second raises ethical concerns. If a recipe prediction algorithm were widely available, eventually someone would attempt to use it for the purpose of determining whether something fits their dietary restrictions or not. When this happens, a failure of the algorithm could have consequences ranging from mild distaste all the way to fatal allergic reaction. For the first (and intended) use-case, failure means a ruined meal and wasted ingredients.
Division of Labor: We haven’t decided the division of labor yet, but we mapped out chunks of the project that we could use to assign roles as we start to code. We have preprocessing, the image and ingredient encoding layers, and the transformer architecture (which includes both the ingredient and instruction decoders). We will either divide the work evenly for each chunk, which makes sense so every person can work with all aspects of the project. For instance one person could work on pre processing to tokenize the instructions, one person could build the recipe vocab, and the last person could build the ingredient vocab. We can similarly divide each chunk like that. Another possible plan is we assign one chunk to each person, while making sure we understand the overall code. We plan to write about how exactly we ended up dividing the work in our second reflection once we start the implementation.
Built With
- tensorflow


Log in or sign up for Devpost to join the conversation.