Download

*Note: We recently changed the location where we store VQA dataset. Please use the updated download links given below if you are getting 404 Error while trying to download the dataset.

VQA Annotations

Balanced Real Images [Cite]

Training annotations 2017 v2.0*
4,437,570 answers
Validation annotations 2017 v2.0*
2,143,540 answers

Balanced Binary Abstract Scenes [Cite]

Training annotations
220,550 answers
Validation annotations
113,280 answers

Abstract Scenes (same as v1.0 release) [Cite]

Training annotations 2015 v1.0
600,000 answers
Validation annotations 2015 v1.0
300,000 answers

VQA Input Questions

Training questions 2017 v2.0*
443,757 questions
Validation questions 2017 v2.0*
214,354 questions
Testing questions 2017 v2.0
447,793 questions

Training questions
22,055 questions
Validation questions
11,328 questions

Training questions 2015 v1.0
60,000 questions
Validation questions 2015 v1.0
30,000 questions
Testing questions 2015 v1.0
60,000 questions

VQA Input Images

COCO

Training images
82,783 images
Validation images
40,504 images
Testing images
81,434 images

Training images
20,629 images
Validation images
10,696 images

Training images
20,000 images
Validation images
10,000 images
Testing images
20,000 images

Complementary Pairs List

Training complementary pairs
200,394 pairs
Validation complementary pairs
95,144 pairs

*Note: The training and validation set files have been updated with minor changes on 04/26/17 to be consistent with test set. Therefore, if you downloaded the files before this date, you should download them again. Thanks!

The captions for training and validation sets of the abstract scenes can be downloaded from here.

Overview

The annotations we release are the result of the following post-processing steps on the raw crowdsourced data:

Spelling correction (using Bing Speller) of question and answer strings
Question normalization (first char uppercase, last char ‘?’)
Answer normalization (all chars lowercase, no period except as decimal point, number words —> digits, strip articles (a, an the))
Adding apostrophe if a contraction is missing it (e.g., convert "dont" to "don't")

Please follow the instructions in the README to download and setup the VQA data (annotations and images).
By downloading this dataset, you agree to our Terms of Use.

VQA API

getQuesIds - Get question ids that satisfy given filter conditions. getImgIds - Get image ids that satisfy given filter conditions. loadQA - Load questions and answers with the specified question ids. showQA - Display the specified questions and answers. loadRes - Load result file and create result object.

Here is a link to the python API demo script.

Input Questions Format

The questions are stored using the JSON file format.

The questions format has the following data structure:

{ "info" : info, "task_type" : str, "data_type": str, "data_subtype": str, "questions" : [question], "license" : license } info { "year" : int, "version" : str, "description" : str, "contributor" : str, "url" : str, "date_created" : datetime } license{ "name" : str, "url" : str } question{ "question_id" : int, "image_id" : int, "question" : str }

task_type: type of annotations in the JSON file (OpenEnded).
data_type: source of the images (mscoco or abstract_v002).
data_subtype: type of data subtype (e.g. train2014/val2014/test2015 for mscoco, train2015/val2015 for abstract_v002).

Annotation Format

The annotations are stored using the JSON file format.

The annotations format has the following data structure:

{ "info" : info, "data_type": str, "data_subtype": str, "annotations" : [annotation], "license" : license } info { "year" : int, "version" : str, "description" : str, "contributor" : str, "url" : str, "date_created" : datetime } license{ "name" : str, "url" : str } annotation{ "question_id" : int, "image_id" : int, "question_type" : str, "answer_type" : str, "answers" : [answer], "multiple_choice_answer" : str } answer{ "answer_id" : int, "answer" : str, "answer_confidence": str }

data_type: source of the images (mscoco or abstract_v002).
data_subtype: type of data subtype (e.g. train2014/val2014/test2015 for mscoco, train2015/val2015 for abstract_v002).
question_type: type of the question determined by the first few words of the question. For details, please see README.
answer_type: type of the answer. Currently, "yes/no", "number", and "other".
multiple_choice_answer: most frequent ground-truth answer.
answer_confidence: subject's confidence in answering the question. For details, please see Antol et al., ICCV 2015.

Complementary Pairs List Format

The complementary pairs lists are stored using the JSON file format.

The complementary pairs list has the following data structure:

[ (question_id_1, question_id_2) ]

The (question, image, answer) example with question_id_1 and the (question, image, answer) example with question_id_2 are complementary of each other i.e., they share the same question for two different images with two different answers. For more details, please see Goyal et al., CVPR 2017.

Abstract Scenes and Captions

This section provides more information regarding abstract scenes' composition (e.g., the (x,y) pixel coordinates of each clipart object, left/right facing) files and abstract captions. If you are using any data (images, questions, answers, or captions) associated with abstract scenes, please cite Antol et al., ICCV 2015. If you are using the balanced binary abstract scenes dataset, please also cite Zhang et al., CVPR 2016. An example BibTeX is:

@InProceedings{VQA, author = {Stanislaw Antol and Aishwarya Agrawal and Jiasen Lu and Margaret Mitchell and Dhruv Batra and C. Lawrence Zitnick and Devi Parikh}, title = {VQA: Visual Question Answering}, booktitle = {International Conference on Computer Vision (ICCV)}, year = {2015}, }

The following links contain the abstract scenes' composition files for Abstract Scenes v1.0 dataset:

The following links contain the abstract scenes' composition files for Balanced Binary Abstract Scenes dataset:

Each of the links above contains the following:

A file of the type "abstract_v002_[datasubset]_scene_information.json" where [datasubset] is either "train2015" or "val2015" or "test2015". This file has the following data structure:

{ "info" : info, "data_type": str, "data_subtype": str, "compositions" : [composition], "images" : [image], "license" : license } info { "year" : int, "version" : str, "description" : str, "contributor" : str, "url" : str, "date_created" : datetime } license{ "name" : str, "url" : str } image{ "image_id" : int, "file_name" : str, "url" : str, "height" : int, "width" : int } composition{ "image_id" : int, "file_name" : str }

data_type: source of the images (abstract_v002).
data_subtype: type of data subtype (train2015/val2015/test2015).
The file_name in images list contains the name of the image file for the corresponding abstract scene. These image files can be downloaded from the links provided in the "Download" section in this page.
The file_name in compositions list contains the name of the scene composition file for the corresponding abstract scene (see the bullet below).

A folder of the type "scene_composition_abstract_v002_[datasubset]" where [datasubset] is either "train2015" or "val2015" or "test2015". This folder contains the scene composition files for the corresponding [datasubset].

For more information on how to render the scenes from annotation files and to obtain API support for abstract scenes, please visit the GitHub repository.

The JSON files containing the captions for training and validation sets of the abstract scenes can be downloaded from the link provided in the "Download" section in this page. These files have the following data structure:

{ "info" : info, "task_type": str, "data_type": str, "data_subtype": str, "annotations" : [annotation], "images" : [image], "license" : license } info { "year" : int, "version" : str, "description" : str, "contributor" : str, "url" : str, "date_created" : datetime } license{ "name" : str, "url" : str } image{ "image_id" : int, "file_name" : str, "url" : str, "height" : int, "width" : int } annotation{ "id" : int, "image_id" : int, "caption" : str }

task_type: Captioning.
data_type: dataset source of the images (abstract_v002).
data_subtype: type of datasubset (train2015/val2015).