VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

VideoSwap

Customized video subject swapping via point control

Problem Formulation

Subject replacement: change video subject to a customized subject
Background preservation: preserve the unedited background same as the source video

Gu et al., “VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,” 2023.

✅ 要求，背景一致，动作一致，仅替换前景 content.
✅ 因比对原视频提取关键点，基于关键点进行控制。

P227

Motivation

Existing methods are promising but still often motion not well aligned
Need ensure precise correspondence of semantic points between the source and target

✅ (1) 人工标注每一帧的 semantic point．（少量标注，8帧）
✅ (2) 把 point map 作为 condition．

P228

Empirical Observations

Question: Can we learn semantic point control for a specific source video subject using only a small number of source video frames
Toy Experiment: Manually define and annotate a set of semantic points on 8 frame; use such point maps as condition for training a control net, i.e., T2I-Adapter.

✅ 实验证明，可以用 semantic point 作为 control．
✅ 结论：T2I 模型可以根据新的点的位置进行新的内容生成。

P229

Empirical Observations

Observation 1: If we can drag the points, the trained T2I-Aapter can generate new contents based on such dragged new points (new condition) → feasible to use semantic points as condition to control and maintain the source motion trajectory.

✅ 也可以通过拉部分点改变车的形状。

P230

Empirical Observations

Observation 2: Further, we can drag the semantic points to control the subject’s shape

✅ 虚线框为类似于 control net 的模块，能把 semanti point 抽出来并输入到 denoise 模块中。
✅ Latent Blend 能更好保留背景信息。
✅ 蓝色部分为 Motion layer.

P231

Framework

Motion layer: use pretrained and fixed AnimateDiff to ensure essential temporal consistency
ED-LoRA \(_{(Mix-of-Show)}\): learn the wconcept to be customized
Key design aims:
- Introduce semantic point correspondences to guide motion trajectory
- Reduce human efforts of annotating points

Gu et al. “Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.” NeurIPS, 2023.

P232

Step 1: Semantic Point Extraction

Reduce human efforts in annotating points
- User define point at one keyframe
- Propagate to other frames by point tracking/detector
Embedding

✅ 什么是比较好的 Semantic point 的表达？

P233

Methodology – Step 1: Semantic Point Extraction on the source video

Reduce human efforts in annotating points
Embedding
- Extract DIFT embedding (intermediate U-Net feature) for each semantic point
- Aggregate over all frames

❓ Embedding, 怎么输人到网络中？
✅ 网络参数本身是 fix 的，增加一些小的 MLP, 把 Embeddin 转化为不同的 scales 的 condition map, 作为 U-Net 的 condition.

P234

Methodology – Step 2: Semantic Point Registration on the source video

Introduce several learnable MLPs, corresponding to different scales
Optimize the MLPs
- Point Patch Loss: restrict diffusion loss to reconstruct local patch around the point
- Semantic-Enhanced Schedule: only sample higher timestep (0.5T, T), which prevents overfitting to low-level details

✅ 有些场景下需要去除部分 semanfic point, 或移动 point 的位置。

P235

Methodology

After Step1 (Semantic Point Extraction) and Step2 (Semantic Point Registration), those semantic points can be used to guide motion
User-point interaction for various applications

✅ 在一帧上做的 semantic point 的移动，迁移到其它帧上。

P236

Methodology

How to drag point for shape change?
- Dragging at one frame is straightforward, propagating drag displacement over time is non-trivial, because of complex camera motion and subject motion in video.
- Resort to canonical space (i.e., Layered Neural Atlas) to propagate displacement.

P237

Methodology

How to drag point for shape change?
Dragging at one frame is straightforward, propagating drag displacement over time is non-trivial because of complex camera motion and subject motion in video.
Resort to canonical space (i.e., Layered Neural Atlas) to propagate displacement.

P238

P239

✅ point contrd 可以处理形变比较大的场景。

P240

Qualitative Comparisons to previous works

VideoSwap can support shape change in the target swap results, leading to the correct identity of target concept.

P241

✅ 重建 3D 可以解决时间一致性问题。

ReadPapers

VideoSwap