The Second Workshop on Fine-grained Activity Detection (FGAD'23) will be held on October 2 2023 at the International Conference on Computer Vision (ICCV).
This workshop will host a new challenge on fine-grained activity recognition. The competition dataset will be the Consented Activities of People (CAP) dataset. This is an annotated dataset of 1.45M videos of 512 fine grained activity classes of consented people, curated using the Collector platform. The Collector platform provides the framework to collect ethically sourced video dataset people performing rare activities in public and shared private spaces. The dataset is annotated with bounding boxes around the primary actor, and temporal start/end frames for each activity instance. All videos are collected with informed consent from subjects for how their data will be used, stored and shared. Videos are submitted by workers around the world using a mobile-app, which provides an ethical platform for on-demand visual dataset collection.
The tasks for evaluation on this challenge are:
-
Activity classification (AC). The Activity Classification (AC) task is to assign one or more activity class labels and confidence scores to each video clip from a set of predefined classes. The metric for AC performance is Mean Average Precision (mAP), top-1 and top-5 classification performance averaged over all classes. Sequestered evaluation data includes 4x more data not previously evaluated in the First FGAD'23 challenge.
-
Temporal Activity Detection (TAD)}. The Temporal Activity Detection (AD) task is to detect and temporally localize all activity instances in untrimmed video. The metric for AD performance is Mean Average Precision (mAP) at a fixed temporal intersection over union (IoU) of 0.2, 0.5, 0.8.
This workshop features an open leaderboard on these two tasks. The leaderboard where we will evaluate activity classification and activity detection in the CAP dataset. The leaderboard evaluation will be performed on a video test set available only to challenge participants. Training and validation sets are publicly available. Test sets are sequestered and are available to the challenge participants for video download following registration and submission of a license agreement. Challenge participants are required to upload a CSV file with results on each test set videos to an evaluation server, which will then perform the evaluatifor the public leaderboard. A comprehensive evaluation plan is available for review.
- cap_detection_handheld_val.tar.gz (0.9 GB) MD5:72f58e69582c17dd366d3c7e85cf0da8 (05May23)
- Validation set for handheld activity detection in untrimmed clips
- cap_classification_clip.tar.gz (288 GB) MD5:54315e2ce204f0dbbe298490a63b5b3b (02Mar22)
- Tight temporal clip training/validation set for handheld activity classification
- cap_classification_pad.tar.gz (386 GB) MD5:fbdc75e6ef10b874ddda20ee9765a710 (02Mar22)
- Temporally padded (>4s) training/validation set for handheld activity classification
The CAP dataset contains 1.45M clips of 512 fine-grained activity labels. A participant is required to use the training set of trimmed clips of activities to train a system for activity classification. This trained activity classification system is then used for untrimmed activity detection.
The download section provides two activity classification training sets. The first training set contains examples of each activity class with no temporal padding. The second (optional) training set contains examples of each activity class with centered temporal padding so that each clip has equal padding before and after such that the clip length is greater than or equal to 4 seconds. The training sets are equivalent other than this padding, and a participant may choose which training set to use.
The download section also provides an activity detection validation set. Unlike the activity classification task, we do not provide an activity detection training set. A participant is expected to train a representation for each activity class from the AC training set, then a system should leverage this trained system for temporal localization. The validation set includes 47 examples of 45 second long videos containing one or more fine-grained activity labels. A participant system is required to temporally localize the start and end frames of each detected activity class and report these frame indexes along with a label confidence. The validation set includes visualizations of the ground truth in the 47 videos. Each video in the activity detection test set may contain one or more of the 512 activity labels in the activity classification training set in any order. Tt is the responsibility of the participant system to localize any detected instances in these untrimmed clips.
The training and validation sets are exported in an open JSON format. A participant may find the VIPY Python tools useful for dataset transformation, visualization and training preparation. To load and interact with the activity detection dataset:
# Load the dataset into memory as a flat list
valset = vipy.load('./annotations').load().flatten()
# Retrieve all activities
activities = [a for v in valset for a in v.activitylist()]
# Retrieve all object tracks
tracks = [t for v in valset for t in v.tracklist()]
# Visualize a single frame of a video
im = valset.takeone().frame(100)
print(im.objects())