1 University of Washington 2 Allen Institute for AI 3 Stanford University
VideoNet contains 1,000 actions across 37 domains. Click on an action below to see its videos.
By continuing, you agree to:
We provide two evaluation settings.
Q: Which of the following Pen Spinning actions is shown in the video?
A: B
Q: Which of the following Figure Skating actions is shown in the video?
A: D
Q: Which of the following Yo-yo actions is shown in the video?
A: D
Q: Which of the following Break Dance actions is shown in the video?
A: C
Q: Which of the following Crochet actions is shown in the video?
A: D
Q: The following video shows a Ghetto Bird, which is an action in Skateboarding.
Now consider the following video. Does it also show a Ghetto Bird?
A: Yes
Q: The following three videos show a Back Essence, which is an action in Tap Dance.
Now consider the following video. Does it also show a Back Essence?
A: Yes
Q: The following video shows Rubenstein's Revenge, which is an action in Juggling.
Now consider the following video. Does it also show Rubenstein's Revenge?
A: No
(If you're curious, the test clip shows Burke's Barrage.)
Q: Recall that a Doosra is an action in Cricket. Does the following video show a Doosra?
A: Yes
Q: The following two videos show Running Into the Kicker, which is a Penalty in American Football.
Now consider the following video. Does it also show Running Into the Kicker?
A: No
(If you're curious, the test clip shows Roughing the Kicker.)
The ground-truth labels of our clips are estimated to be $97\%$ accurate (see §3.3 of the paper).
All accuracies are reported as percentages (%). Note the gap between closed and open models.
| Model | Multiple-Choice | Binary 0-shot | Binary 3-shot |
|---|---|---|---|
| Gemini 3.1 Pro | 69.9 | 72.0 | 67.2 |
| Gemini 3 Flash | 68.7 | 70.3 | 75.1 |
| GPT-5.4 | 68.0 | 72.4 | 76.3 |
| GPT-5 | 67.6 | 72.9 | 75.4 |
| Molmo2-4B (FT) | 53.5 | 66.6 | - |
| Qwen3VL-8B | 45.0 | 59.3 | 66.2 |
| Non-expert Humans (with definitions) | 68.5 | 69.1 | 82.7 |
| InternVL3.5-8B | 44.1 | 57.4 | 60.5 |
| Molmo2-8B | 44.9 | 55.8 | - |
| Molmo2-4B (base) | 42.0 | 55.3 | - |
Fine-tuning a 4B model on our training data improves MCQ performance by 11.5 points.
Below, human performance is marked with stars; closed models with triangles; open models with circles.
Click on a model's legend entry to show its line. (We initially hide GPT models to prevent overcrowding.)
On average, VLMs improve by 3 points from $k=0$ to $k=3$. Humans improve by 13.9 points.
@misc{yadav2026videonet,
title={VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition},
author={Tanush Yadav and Mohammadreza Salehi and Jae Sung Park and Vivek Ramanujan and Hannaneh Hajishirzi and Yejin Choi and Ali Farhadi and Rohun Tripathi and Ranjay Krishna},
year={2026},
eprint={2605.02834},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.02834},
}
This project was partially funded by a grant from Apple.