The project uses PyTorch to implement two networks that generate descriptive sentences from an image input. The models were inspired by a couple of existing projects[1, 2].
The networks were trained using the Flickr8k dataset
CNN-RNN with Single layer LSTM without Attention
- Encoder - Pre-Trained Inception v3
- Decoder - Single Layer LSTM
CNN-RNN with Single layer LSTM with Attention
- Encoder - Pre-Trained ResNet-50
- Decoder - Single Layer LSTM with Soft Attention
Below are a few examples of the generated captions:

Below are a few examples of the generated captions and the attention weights:

- Fill in the following paths:
path_images="/content/flickr8k/Images" #Dataset Images
path_captions="/content/flickr8k/captions.txt" #Dataset Captions
path_examples="" #Images to caption
path_checkpoints="" #Model checkpoints- Use the following function to caption images:
print_examples(model, device, dataset, path, transform, attention=False, save=False, max_imgs=5, dpi=None)Check out the notebook for additional information.
model= model to evaluatedevice= device to use i.e.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")dataset= dataset used for vocabularypath= directories for the images to captiontransform= transform depending on the modelattention=Truefor Model 2,Falsefor Model 1save= saves the figures with generated captionsmax_imgs= generates captions for only max_imgs pictures from the folder (random)dpi= resolution for saved figures
- Examples: a few sample images from the Flickr30k dataset.
- Checkpoints: checkpoints to the trained models.
- Code: .py and .ipynb files with the code.
[2] https://www.kaggle.com/mdteach/image-captioning-with-attention-pytorch

