YOLOv11 and RF-DETR for video object detection: adding temporal features with a superframe
2026/03/27
If you enter the query object detection into the search, almost immediately among the results you will find models of the YOLO family in different versions. This is not surprising: they are popular, fast, quite easy to use and well suited for applied tasks. Models like YOLO and RF-DETR usually work according to the same scheme: an image is supplied as an input, and we get object predictions as an output.
This approach is convenient and works well for pictures. But video is not just a collection of independent footage.
Why is the video different from the image
When we apply YOLOv5, YOLOv8, YOLOv11 or RF-DETR to video, most often each frame is processed as a separate picture.
And there is an important limitation to this.
Videos have a time structure. Each frame is connected to neighboring frames. Movement, direction, acceleration, the game action itself - all this manifests itself in time. If you process frames one at a time, this connection is lost, and then you have to try to restore it in post-processing.
I wanted to test a simpler option: is it possible to add temporary information without reworking the YOLO or RF-DETR architecture itself?
Superframe idea
For this I decided to use superframe.
Superframe is a single input frame assembled from three adjacent video frames.
The idea is simple:
each frame is converted to grayscale,
three adjacent frames are taken,
they are then placed along the channels of a single image.
That is, instead of the usual use of RGB as color channels, the channels begin to encode time.
For a 30 FPS video, this gives about 0.12 seconds of temporal context:
R = frame -1 G = current frame B = frame +1
If you increase the step to ±2, you can cover about 0.2 seconds.
The result is that the model still receives a standard three-channel image, but now the motion information is already hardcoded into that image.
Why this might be useful
In this approach, we sacrifice color but maintain short temporal context. For many sports applications, especially with a fixed camera, this turns out to be more important than accurate color information.
Superframe can be seen as an easy way to add temporal-aware representation without going straight to full-fledged video architectures.
In practice, a useful effect appears: moving objects begin to stand out, and the static background remains more stable.
First experiment: ball detection
I first used this approach to detect a volleyball in a video from a stationary camera.
This is a good test scenario because the ball is a small object, difficult to detect, and highly dependent on motion.
The experiment showed that superframe helps improve the detection of small moving objects, in particular a volleyball.
After this, it became interesting to see if the same idea could be used for a more complex problem.
Go to the markup of game actions

After the ball experiment, we decided to test this approach on the task of detecting game actions.
A quick review showed that in volleyball, different versions of YOLO are already actively used to detect game elements and actions, such as:
delivery, reception, pass, attack,
But this non-standard approach has a practical disadvantage: there is no ready-made tool for marking.
Custom markup tool
To solve this problem, we made a small video annotation tool - Volleyball Action Annotator (VAA).
Volleyball Action Annotator (VAA): https://github.com/asigatchov/VAA
The tool was created specifically for marking game actions in volleyball both in regular frame viewing mode and in superframe mode.
This allowed us to build our own annotation methodology for the task.
Methodology for marking game actions
The marking is built around the moment the ball touches. For one game action we usually take 3-4 frames. A key frame is the frame in which the ball is touched. From him we note:
frame to touch, touch frame, frame after touch.
In normal RGB viewing mode, this allows you to conveniently see the action in a familiar way.
In SUPERFRAME mode, each such moment additionally contains adjacent frames in a window of ±1 frame. Thus, the marking takes into account not only the moment of contact itself, but also its immediate temporal context.
Frame overlay creates an effect similar to a pseudo-attention map for motion. Moving objects become more visible against static backgrounds because their position changes between channels while the background remains relatively stable.
For sports video from a stationary camera, this is especially useful.
Why the dataset had to be collected from scratch
Due to the specifics of this format, existing datasets were not suitable for the experiment in finished form.
Therefore, we had to collect the dataset from scratch: think through the markup logic, make a tool, and create our own data preparation pipeline.
This is one of the main disadvantages of the approach: the idea itself is simple, but the process of data preparation is non-standard.
However, this can be a good compromise if you want to improve the quality of action detection without going straight to heavy video models.
First results of training
After labeling the first 50 game actions, we trained RT-DETR for 30 epochs.
Even with such a small starting amount of data, the first model has already proven useful: it has begun to speed up the second round of annotation by making preliminary predictions that can then be quickly corrected by hand.
The result is a convenient iterative loop:
mark up a small starting dataset,
train the first model, use the model to speed up the next marking circle, repeat the cycle.
For experimental sports datasets, this approach can significantly reduce the cost and time of annotation.
Conclusion
The main idea of this approach is very simple: instead of completely reworking the detector for video, you can try to embed temporary information directly into the input image.
Superframe is an easy way to add temporal context to standard image detector models like YOLO or RF-DETR. In the case of volleyball video from a stationary camera, the approach has already shown benefits both for ball detection and for marking game actions.
There are still many areas for development: select the best time step, increase the dataset, expand the set of actions, compare the approach with full-fledged video architectures. But it is already clear that this is an interesting practical compromise between frame-by-frame detection and full-fledged video understanding.
To be continued.