Region-Constraint In-Context Generation for Instructional Video Editing

Zhongwei Zhang¹ Fuchen Long² Wei Li² Zhaofan Qiu² Wu Liu¹ Ting Yao² Tao Mei²

¹ University of Science and Technology of China ² HiDream.ai Inc

arXiv PDF GitHub 🤗 ReCo-Data 📈 ReCo-Bench

Replacement Addition Removal Stylization

Abstract

The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.

Examples 1: Replacement

Here we demonstrate our model's capability in content replacement. ReCo can replace objects in the original video with new ones, achieving natural and seamless compositing.

Input

Result

Edited prompt: Replace the man with a cartoon penguin.

Input

Result

Edited prompt:Replace the woman with a white swan head looking sideways.

Input

Result

Edited prompt: Replace the black beanie with a big fluffy white fur hat.

Input

Result

Edited prompt: Replace the boy with a chimpanzee wearing glasses pointing at the board.

Input

Result

Edited prompt: Replace the black robot vehicle with a white Jeep SUV.

Input

Result

Edited prompt: Replace the man's head with a white cartoon mask and a black hat.

Input

Result

Edited prompt: Replace the woman on the left with a colorful cartoon llama.

Input

Result

Edited prompt: Replace the woman in the field with a young deer.

Input

Result

Edited prompt: Replace the red-haired woman with a cartoon gnome wearing a red hat and blue coat.

Input

Result

Edited prompt: Replace the girl on the right with a cute white unicorn toy.

Input

Result

Edited prompt: Replace the white lamb with a fluffy brown alpaca being petted.

Input

Result

Edited prompt: Replace the woman on the left with a brightly colored cartoon unicorn head.

Input

Result

Edited prompt: Replace the man in the passenger seat with a large yellow teddy bear.

Input

Result

Edited prompt: Replace the woman with a cute cartoon panda holding a clipboard.

Input

Result

Edited prompt: Replace the brown goat grazing in the field with a white llama.

Input

Result

Edited prompt: Replace the crystal stone with a round pastry with orange filling.

Examples 2: Addition

Here we demonstrate our model's capability in content addition. Our method can insert various objects into the original video while achieving natural and seamless compositing.

Input

Result

Edited prompt: Add a sliced cornbread onto the wooden cutting board.

Input

Result

Edited prompt: Add a deer crossing the road.

Input

Result

Edited prompt: Add a flowers wreath on the man's head

Input

Result

Edited prompt: Add a white cartoon character with big black eyes and rosy cheeks next to the woman.

Input

Result

Edited prompt: Add an otter wearing sunglasses next to the man.

Input

Result

Edited prompt: Add a small gray mouse in front of the cat.

Input

Result

Edited prompt: Add a small, golden crown perched on top of the seal's head.

Input

Result

Edited prompt: Add a large, glowing purple crystal standing on the right of the man.

Input

Result

Edited prompt: Add a fierce brown bear standing inside the wooden structure.

Input

Result

Edited prompt: Add a green apron worn by the man.

Input

Result

Edited prompt: Add a large red mushroom with white spots to the left of the woman.

Input

Result

Edited prompt: Add a large black gorilla dancing on the left side.

Examples 3: Removal

Here we demonstrate our model's capability in content removal. Our method can eliminate specific objects from the original video while producing natural and visually coherent results.

Input

Result

Edited prompt: Remove the woman in the white robe on the right.

Input

Result

Edited prompt: Remove the woman in the dark blue shirt.

Input

Result

Edited prompt: Remove the helicopter from the background.

Input

Result

Edited prompt: Remove the seal lying on the sand.

Input

Result

Edited prompt: Remove the man standing behind the car.

Input

Result

Edited prompt: Remove the large white cartoon character on the left.

Input

Result

Edited prompt: Remove the bird on the trunk.

Input

Result

Edited prompt: Remove the white cockatoo from the branch.

Input

Result

Edited prompt: Remove the man sitting on the far left.

Input

Result

Edited prompt: Remove the cartoon penguin from the couch.

Input

Result

Edited prompt: Remove the woman in the leather jacket.

Input

Result

Edited prompt: Remove the woman with glasses walking on the left.

Examples 4: Stylization

Here we demonstrate our model's capability in video stylization. ReCo can generate vivid results while preserving the spatial structure well.

Input

Result

Edited prompt: Convert the video into a flat vector illustration style.

Input

Result

Edited prompt: Convert the video into a watercolor painting style.

Input

Result

Edited prompt: Convert the video into a 3D Chibi style.

Input

Result

Edited prompt: Convert the video into a 3D animated movie style.

Input

Result

Edited prompt: Convert the video into a LEGO style.

Input

Result

Edited prompt: Convert the video into a comic book style.

Input

Result

Edited prompt: Convert the video into a bright 2D cartoon style.

Input

Result

Edited prompt: Convert the video into a Japanese anime style.

Input

Result

Edited prompt: Convert the video into a Paper Cutting style.

Input

Result

Edited prompt: Convert the video into a Simpsons-esque style.

Input

Result

Edited prompt: Convert the video into a claymation style.

Input

Result

Edited prompt: Convert the video into a saturated vector art style.

ReCo-Data Visualization

ReCo-Data is a large-scale, high-quality video editing dataset comprising 500K+ instruction-video pairs. Below we present the statistics and the collection pipeline.

Statistics

Figure 1. (a) Overview of data scale. (b) Task distribution showing balanced quantities:
Replace (156.6K), Style (130.6K), Remove (121.6K), and Add (115.6K). Human evaluation on 200 randomly sampled videos confirms that the proportion of high-quality data exceeds 90% for each task. (c) Details of video informations.

Collection Pipeline

Our data collection pipeline consists of six primary stages:

(1) Raw data pre-processing: Filtering raw video data based on specific quality criteria.
(2) Object segmentation: Extracting object masks from videos.
(3) Instruction generation: Employing VLLM (i.e., Gemini-2.5-Flash-Thinking) to construct editing prompts.
(4) Condition pair construction: Involving first frame editing and depth map generation to prepare the input conditions for VACE.
(5) Video synthesis: Employing VACE to generate videos based on conditions.
(6) Video filtering and re-captioning: Leveraging VLLM (i.e., Gemini-2.5-Flash-Thinking) again to filter out low-quality samples and re-caption the remaining videos.

The data synthesis process required approximately 76,800 GPU hours on NVIDIA RTX 4090, while the VLLM (i.e., Gemini-2.5-Flash-Thinking) operations incurred a total cost of approximately $13,600. For complete technical specifications, task configurations, and implementation details, please refer to our supplementary materials and the ReCo-Data Details Card.

We visualize diverse instruction–video pairs from ReCo-Data. Each pair shows the source (left) and target video (right). The editing instruction appears on the Target when you hover over it.

Source

Target

Replace the small blonde-haired child wearing a pink top and shorts being held by the man with a yellow and brown stuffed monkey toy wearing a light blue top.

Source

Target

Replace the man wearing a dark green puffer jacket and a dark green knitted hat sitting in the grass with a panda mascot wearing a green puffer jacket and a brown knitted hat sitting in the grass.

Source

Target

Replace the dark brown acoustic guitar on the left with a brightly lit and patterned acoustic guitar.

Source

Target

Replace the white bowl filled with soil on the table with the small green plant in a dark pot.

Source

Target

Add a smiling woman with long brown hair, wearing a black long-sleeved top, black pants, and a black baseball cap, with her hands clasped, sitting on the left side of the frame.

Source

Target

Add a light blue and white robot with a smiling face, blue headphones, and a black lanyard around its neck, holding a red object in its right hand, on the left side of the image.

Source

Target

Add a tall, clear crystal sculpture with multiple sharp points and an ornate base on the right side of the image, next to the white door.

Source

Target

Add a furry, brown cartoon bear with a yellow snout and a dark blue apron, sitting on the couch.

Source

Target

Remove the animated girl with glowing purple and blue hair.

Source

Target

Remove the man wearing a gray shirt on the right.

Source

Target

Remove the large beetle-like character with a pink horn.

Source

Target

Remove the white device with the blue glowing light.

Source

Target

Convert the video into an American Cartoon style.

Source

Target

Convert the video into a Monet style.

Source

Target

Convert the video into an Irasutoya style.

Source

Target

Convert the video into a Watercolor Painting style.

Comparison with Previous Methods

We compare ReCo with previous instruction-based video editing methods across three tasks, and with captioning-based editing pipelines.Hover over edited videos to see the instruction prompt.

Input

InsViE

Lucy Edit

Ditto

ReCo (Ours)

Addition

Input