VideoNet: A Large-Scale Dataset for
Domain-Specific Action Recognition.

Tanush Yadav1,2, Mohammadreza Salehi1,2, Jae Sung Park1,2, Vivek Ramanujan1,
Hannaneh Hajishirzi1,2, Yejin Choi3, Ali Farhadi1,2, Rohun Tripathi2†, Ranjay Krishna1,2†

1 University of Washington    2 Allen Institute for AI    3 Stanford University

VideoNet contains 1,000 actions across 37 domains. Click on an action below to see its videos.

TLDR: We revitalize action recognition in the era of VLMs by focusing on domain-specific actions.

Benchmark

We provide two evaluation settings.

Multiple-Choice Examples

Pen Spinning
Figure Skating
Yo-yo
Break Dance
Crochet

Q: Which of the following Pen Spinning actions is shown in the video?

  1. warped sonic
  2. twisted sonic reverse
  3. charge reverse
  4. devil's sonic

A: B

Q: Which of the following Figure Skating actions is shown in the video?

  1. double salchow
  2. double toe loop
  3. double flip
  4. double loop

A: D

Q: Which of the following Yo-yo actions is shown in the video?

  1. jade whip
  2. bee sting
  3. plastic whip
  4. iron whip

A: D

Q: Which of the following Break Dance actions is shown in the video?

  1. ufo
  2. hand hop
  3. jackhammer
  4. 1990 spin

A: C

Q: Which of the following Crochet actions is shown in the video?

  1. bullion stitch
  2. front post double crochet (fpdc)
  3. puff stitch
  4. bobble stitch

A: D

Binary Few-Shot Examples

Skateboarding
Tap Dance
Juggling
Cricket
Football
In this example, $k=1$ in-context example is provided.

Q: The following video shows a Ghetto Bird, which is an action in Skateboarding.

Now consider the following video. Does it also show a Ghetto Bird?

A: Yes

In this example, $k=3$ in-context examples are provided.

Q: The following three videos show a Back Essence, which is an action in Tap Dance.

Now consider the following video. Does it also show a Back Essence?

A: Yes

In this example, $k=1$ in-context example is provided.

Q: The following video shows Rubenstein's Revenge, which is an action in Juggling.

Now consider the following video. Does it also show Rubenstein's Revenge?

A: No

(If you're curious, the test clip shows Burke's Barrage.)

In this example, no in-context examples are provided (i.e., $k=0$).

Q: Recall that a Doosra is an action in Cricket. Does the following video show a Doosra?

A: Yes

In this example, $k=2$ in-context examples are provided.

Q: The following two videos show Running Into the Kicker, which is a Penalty in American Football.

Now consider the following video. Does it also show Running Into the Kicker?

A: No

(If you're curious, the test clip shows Roughing the Kicker.)

Expert Verification

The ground-truth labels of our clips are estimated to be $97\%$ accurate (see §3.3 of the paper).


Results

Domain-Specific Action Recognition

All accuracies are reported as percentages (%). Note the gap between closed and open models.

Model Multiple-Choice Binary 0-shot Binary 3-shot
Gemini 3.1 Pro69.972.067.2
Gemini 3 Flash68.770.375.1
GPT-5.468.072.476.3
GPT-567.672.975.4
Molmo2-4B (FT)53.566.6-
Qwen3VL-8B45.059.366.2
Non-expert Humans
(with definitions)
68.569.182.7
InternVL3.5-8B44.157.460.5
Molmo2-8B44.955.8-
Molmo2-4B (base)42.055.3-

Fine-tuning a 4B model on our training data improves MCQ performance by 11.5 points.

Takeaway: Models benefit significantly from domain-specific training data.

Video In-Context Learning

Below, human performance is marked with stars; closed models with triangles; open models with circles.

Click on a model's legend entry to show its line. (We initially hide GPT models to prevent overcrowding.)

Impact of Few-Shot Examples on Binary Performance.

On average, VLMs improve by 3 points from $k=0$ to $k=3$. Humans improve by 13.9 points.

Takeaway: Humans excel at exploiting few-shot video examples, while VLMs struggle.

Citation

@misc{yadav2026videonet,
    title={VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition}, 
    author={Tanush Yadav and Mohammadreza Salehi and Jae Sung Park and Vivek Ramanujan and Hannaneh Hajishirzi and Yejin Choi and Ali Farhadi and Rohun Tripathi and Ranjay Krishna},
    year={2026},
    eprint={2605.02834},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2605.02834}, 
}

Acknowledgements

This project was partially funded by a grant from Apple.