Region-Constraint In-Context Generation for Instructional Video Editing
Abstract
The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.
Examples 1: Replacement
Here we demonstrate our model's capability in content replacement. ReCo can replace objects in the original video with new ones, achieving natural and seamless compositing.
Edited prompt: Replace the man with a cartoon penguin.
Edited prompt:Replace the woman with a white swan head looking sideways.
Edited prompt: Replace the black beanie with a big fluffy white fur hat.
Edited prompt: Replace the boy with a chimpanzee wearing glasses pointing at the board.
Edited prompt: Replace the black robot vehicle with a white Jeep SUV.
Edited prompt: Replace the man's head with a white cartoon mask and a black hat.
Edited prompt: Replace the woman on the left with a colorful cartoon llama.
Edited prompt: Replace the woman in the field with a young deer.
Edited prompt: Replace the red-haired woman with a cartoon gnome wearing a red hat and blue coat.
Edited prompt: Replace the girl on the right with a cute white unicorn toy.
Edited prompt: Replace the white lamb with a fluffy brown alpaca being petted.
Edited prompt: Replace the woman on the left with a brightly colored cartoon unicorn head.
Edited prompt: Replace the man in the passenger seat with a large yellow teddy bear.
Edited prompt: Replace the woman with a cute cartoon panda holding a clipboard.
Edited prompt: Replace the brown goat grazing in the field with a white llama.
Edited prompt: Replace the crystal stone with a round pastry with orange filling.
Examples 2: Addition
Here we demonstrate our model's capability in content addition. Our method can insert various objects into the original video while achieving natural and seamless compositing.
Edited prompt: Add a sliced cornbread onto the wooden cutting board.
Edited prompt: Add a deer crossing the road.
Edited prompt: Add a flowers wreath on the man's head
Edited prompt: Add a white cartoon character with big black eyes and rosy cheeks next to the woman.
Edited prompt: Add an otter wearing sunglasses next to the man.
Edited prompt: Add a small gray mouse in front of the cat.
Edited prompt: Add a small, golden crown perched on top of the seal's head.
Edited prompt: Add a large, glowing purple crystal standing on the right of the man.
Edited prompt: Add a fierce brown bear standing inside the wooden structure.
Edited prompt: Add a green apron worn by the man.
Edited prompt: Add a large red mushroom with white spots to the left of the woman.
Edited prompt: Add a large black gorilla dancing on the left side.
Examples 3: Removal
Here we demonstrate our model's capability in content removal. Our method can eliminate specific objects from the original video while producing natural and visually coherent results.
Edited prompt: Remove the woman in the white robe on the right.
Edited prompt: Remove the woman in the dark blue shirt.
Edited prompt: Remove the helicopter from the background.
Edited prompt: Remove the seal lying on the sand.
Edited prompt: Remove the man standing behind the car.
Edited prompt: Remove the large white cartoon character on the left.
Edited prompt: Remove the bird on the trunk.
Edited prompt: Remove the white cockatoo from the branch.
Edited prompt: Remove the man sitting on the far left.
Edited prompt: Remove the cartoon penguin from the couch.
Edited prompt: Remove the woman in the leather jacket.
Edited prompt: Remove the woman with glasses walking on the left.
Examples 4: Stylization
Here we demonstrate our model's capability in video stylization. ReCo can generate vivid results while preserving the spatial structure well.
Edited prompt: Convert the video into a flat vector illustration style.
Edited prompt: Convert the video into a watercolor painting style.
Edited prompt: Convert the video into a 3D Chibi style.
Edited prompt: Convert the video into a 3D animated movie style.
Edited prompt: Convert the video into a LEGO style.
Edited prompt: Convert the video into a comic book style.
Edited prompt: Convert the video into a bright 2D cartoon style.
Edited prompt: Convert the video into a Japanese anime style.
Edited prompt: Convert the video into a Paper Cutting style.
Edited prompt: Convert the video into a Simpsons-esque style.
Edited prompt: Convert the video into a claymation style.
Edited prompt: Convert the video into a saturated vector art style.
ReCo-Data Visualization
ReCo-Data is a large-scale, high-quality video editing dataset comprising 500K+ instruction-video pairs. Below we present the statistics and the collection pipeline.
Statistics
Figure 1. (a) Overview of data scale. (b) Task distribution showing balanced quantities:
Replace (156.6K), Style (130.6K), Remove (121.6K), and Add (115.6K). Human evaluation on 200 randomly sampled videos confirms that the proportion of high-quality data exceeds 90% for each task.
(c) Details of video informations.
Collection Pipeline
Our data collection pipeline consists of six primary stages:
- (1) Raw data pre-processing: Filtering raw video data based on specific quality criteria.
- (2) Object segmentation: Extracting object masks from videos.
- (3) Instruction generation: Employing VLLM (i.e., Gemini-2.5-Flash-Thinking) to construct editing prompts.
- (4) Condition pair construction: Involving first frame editing and depth map generation to prepare the input conditions for VACE.
- (5) Video synthesis: Employing VACE to generate videos based on conditions.
- (6) Video filtering and re-captioning: Leveraging VLLM (i.e., Gemini-2.5-Flash-Thinking) again to filter out low-quality samples and re-caption the remaining videos.
The data synthesis process required approximately 76,800 GPU hours on NVIDIA RTX 4090, while the VLLM (i.e., Gemini-2.5-Flash-Thinking) operations incurred a total cost of approximately $13,600. For complete technical specifications, task configurations, and implementation details, please refer to our supplementary materials and the ReCo-Data Details Card.
We visualize diverse instruction–video pairs from ReCo-Data. Each pair shows the source (left) and target video (right). The editing instruction appears on the Target when you hover over it.
Comparison with Previous Methods
We compare ReCo with previous instruction-based video editing methods across three tasks, and with captioning-based editing pipelines.Hover over edited videos to see the instruction prompt.