Region-Constraint In-Context Generation for Instructional Video Editing

1 University of Science and Technology of China     2 HiDream.ai Inc    

Abstract

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

Examples 1: Replacement

Here we demonstrate our model's capability in content replacement. ReCo can replace objects in the original video with new ones, achieving natural and seamless compositing.

Input
Result

Edited prompt: Replace the man with a cartoon penguin.

Input
Result

Edited prompt:Replace the woman with a white swan head looking sideways.

Input
Result

Edited prompt: Replace the black beanie with a big fluffy white fur hat.

Input
Result

Edited prompt: Replace the boy with a chimpanzee wearing glasses pointing at the board.

Input
Result

Edited prompt: Replace the black robot vehicle with a white Jeep SUV.

Input
Result

Edited prompt: Replace the man's head with a white cartoon mask and a black hat.

Input
Result

Edited prompt: Replace the woman on the left with a colorful cartoon llama.

Input
Result

Edited prompt: Replace the woman in the field with a young deer.

Input
Result

Edited prompt: Replace the red-haired woman with a cartoon gnome wearing a red hat and blue coat.

Input
Result

Edited prompt: Replace the girl on the right with a cute white unicorn toy.

Input
Result

Edited prompt: Replace the white lamb with a fluffy brown alpaca being petted.

Input
Result

Edited prompt: Replace the woman on the left with a brightly colored cartoon unicorn head.

Input
Result

Edited prompt: Replace the man in the passenger seat with a large yellow teddy bear.

Input
Result

Edited prompt: Replace the woman with a cute cartoon panda holding a clipboard.

Input
Result

Edited prompt: Replace the brown goat grazing in the field with a white llama.

Input
Result

Edited prompt: Replace the crystal stone with a round pastry with orange filling.

Examples 2: Addition

Here we demonstrate our model's capability in content addition. Our method can insert various objects into the original video while achieving natural and seamless compositing.

Input
Result

Edited prompt: Add a sliced cornbread onto the wooden cutting board.

Input
Result

Edited prompt: Add a deer crossing the road.

Input
Result

Edited prompt: Add a flowers wreath on the man's head

Input
Result

Edited prompt: Add a white cartoon character with big black eyes and rosy cheeks next to the woman.

Input
Result

Edited prompt: Add an otter wearing sunglasses next to the man.

Input
Result

Edited prompt: Add a small gray mouse in front of the cat.

Input
Result

Edited prompt: Add a small, golden crown perched on top of the seal's head.

Input
Result

Edited prompt: Add a large, glowing purple crystal standing on the right of the man.

Input
Result

Edited prompt: Add a fierce brown bear standing inside the wooden structure.

Input
Result

Edited prompt: Add a green apron worn by the man.

Input
Result

Edited prompt: Add a large red mushroom with white spots to the left of the woman.

Input
Result

Edited prompt: Add a large black gorilla dancing on the left side.

Examples 3: Removal

Here we demonstrate our model's capability in content removal. Our method can eliminate specific objects from the original video while producing natural and visually coherent results.

Input
Result

Edited prompt: Remove the woman in the white robe on the right.

Input
Result

Edited prompt: Remove the woman in the dark blue shirt.

Input
Result

Edited prompt: Remove the helicopter from the background.

Input
Result

Edited prompt: Remove the seal lying on the sand.

Input
Result

Edited prompt: Remove the man standing behind the car.

Input
Result

Edited prompt: Remove the large white cartoon character on the left.

Input
Result

Edited prompt: Remove the bird on the trunk.

Input
Result

Edited prompt: Remove the white cockatoo from the branch.

Input
Result

Edited prompt: Remove the man sitting on the far left.

Input
Result

Edited prompt: Remove the cartoon penguin from the couch.

Input
Result

Edited prompt: Remove the woman in the leather jacket.

Input
Result

Edited prompt: Remove the woman with glasses walking on the left.

Examples 4: Stylization

Here we demonstrate our model's capability in video stylization. ReCo can generate vivid results while preserving the spatial structure well.

Input
Result

Edited prompt: Convert the video into a flat vector illustration style.

Input
Result

Edited prompt: Convert the video into a watercolor painting style.

Input
Result

Edited prompt: Convert the video into a 3D Chibi style.

Input
Result

Edited prompt: Convert the video into a 3D animated movie style.

Input
Result

Edited prompt: Convert the video into a LEGO style.

Input
Result

Edited prompt: Convert the video into a comic book style.

Input
Result

Edited prompt: Convert the video into a bright 2D cartoon style.

Input
Result

Edited prompt: Convert the video into a Japanese anime style.

Input
Result

Edited prompt: Convert the video into a Paper Cutting style.

Input
Result

Edited prompt: Convert the video into a Simpsons-esque style.

Input
Result

Edited prompt: Convert the video into a claymation style.

Input
Result

Edited prompt: Convert the video into a saturated vector art style.

ReCo-Data Visualization

ReCo-Data is a large-scale, high-quality video editing dataset comprising 500K+ instruction-video pairs. Below we present the statistics and the collection pipeline.

Statistics

Statistics of OpenHumanVid

Figure 1. (a) Overview of data scale. (b) Task distribution showing balanced quantities:
Replace (156.6K), Style (130.6K), Remove (121.6K), and Add (115.6K). Human evaluation on 200 randomly sampled videos confirms that the proportion of high-quality data exceeds 90% for each task. (c) Details of video informations.

Collection Pipeline

Collection Pipeline

Our data collection pipeline consists of six primary stages:

  • (1) Raw data pre-processing: Filtering raw video data based on specific quality criteria.
  • (2) Object segmentation: Extracting object masks from videos.
  • (3) Instruction generation: Employing VLLM (i.e., Gemini-2.5-Flash-Thinking) to construct editing prompts.
  • (4) Condition pair construction: Involving first frame editing and depth map generation to prepare the input conditions for VACE.
  • (5) Video synthesis: Employing VACE to generate videos based on conditions.
  • (6) Video filtering and re-captioning: Leveraging VLLM (i.e., Gemini-2.5-Flash-Thinking) again to filter out low-quality samples and re-caption the remaining videos.

The data synthesis process required approximately 76,800 GPU hours on NVIDIA RTX 4090, while the VLLM (i.e., Gemini-2.5-Flash-Thinking) operations incurred a total cost of approximately $13,600. For complete technical specifications, task configurations, and implementation details, please refer to our supplementary materials and the ReCo-Data Details Card.

We visualize diverse instruction–video pairs from ReCo-Data. Each pair shows the source (left) and target video (right). The editing instruction appears on the Target when you hover over it.

Source
Target
Replace the small blonde-haired child wearing a pink top and shorts being held by the man with a yellow and brown stuffed monkey toy wearing a light blue top.
Source
Target
Replace the man wearing a dark green puffer jacket and a dark green knitted hat sitting in the grass with a panda mascot wearing a green puffer jacket and a brown knitted hat sitting in the grass.
Source
Target
Replace the dark brown acoustic guitar on the left with a brightly lit and patterned acoustic guitar.
Source
Target
Replace the white bowl filled with soil on the table with the small green plant in a dark pot.
Source
Target
Add a smiling woman with long brown hair, wearing a black long-sleeved top, black pants, and a black baseball cap, with her hands clasped, sitting on the left side of the frame.
Source
Target
Add a light blue and white robot with a smiling face, blue headphones, and a black lanyard around its neck, holding a red object in its right hand, on the left side of the image.
Source
Target
Add a tall, clear crystal sculpture with multiple sharp points and an ornate base on the right side of the image, next to the white door.
Source
Target
Add a furry, brown cartoon bear with a yellow snout and a dark blue apron, sitting on the couch.
Source
Target
Remove the animated girl with glowing purple and blue hair.
Source
Target
Remove the man wearing a gray shirt on the right.
Source
Target
Remove the large beetle-like character with a pink horn.
Source
Target
Remove the white device with the blue glowing light.
Source
Target
Convert the video into an American Cartoon style.
Source
Target
Convert the video into a Monet style.
Source
Target
Convert the video into an Irasutoya style.
Source
Target
Convert the video into a Watercolor Painting style.

Comparison with Previous Methods

We compare ReCo with previous instruction-based video editing methods across three tasks, and with captioning-based editing pipelines.Hover over edited videos to see the instruction prompt.

Input
InsViE
Lucy Edit
Ditto
ReCo (Ours)
Addition
Input
InsViE
Add a small, golden crown perched on top of the seal's head.
Lucy Edit
Add a small, golden crown perched on top of the seal's head.
Ditto
Add a small, golden crown perched on top of the seal's head.
ReCo
Add a small, golden crown perched on top of the seal's head.
Replacement
Input
InsViE
Replace the man with a brown chimpanzee wearing a hoodie and typing on the laptop.
Lucy Edit
Replace the man with a brown chimpanzee wearing a hoodie and typing on the laptop.
Ditto
Replace the man with a brown chimpanzee wearing a hoodie and typing on the laptop.
ReCo
Replace the man with a brown chimpanzee wearing a hoodie and typing on the laptop.
Stylization
Input
InsViE
Convert the video into a Simpsons style.
Lucy Edit
Convert the video into a Simpsons style.
Ditto
Convert the video into a Simpsons style.
ReCo
Convert the video into a Simpsons style.

Comparison of object removal tasks. Hover over edited videos to see the instruction prompt.

Input
InsViE
VACE
ReCo (Ours)
Removal
Input
InsViE
Remove the woman with glasses on the left.
VACE
Remove the woman with glasses on the left.
ReCo
Remove the woman with glasses on the left.