VideoSwap

Customized video subject swapping via point control

Problem Formulation

  • Subject replacement: change video subject to a customized subject
  • Background preservation: preserve the unedited background same as the source video

Gu et al., “VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,” 2023.

✅ 要求,背景一致,动作一致,仅替换前景 content.
✅ 因比对原视频提取关键点,基于关键点进行控制。

P227

Motivation

  • Existing methods are promising but still often motion not well aligned
  • Need ensure precise correspondence of semantic points between the source and target

✅ (1) 人工标注每一帧的 semantic point.(少量标注,8帧)
✅ (2) 把 point map 作为 condition.

P228

Empirical Observations

  • Question: Can we learn semantic point control for a specific source video subject using only a small number of source video frames
  • Toy Experiment: Manually define and annotate a set of semantic points on 8 frame; use such point maps as condition for training a control net, i.e., T2I-Adapter.

✅ 实验证明,可以用 semantic point 作为 control.
✅ 结论:T2I 模型可以根据新的点的位置进行新的内容生成。

P229

Empirical Observations

  • Observation 1: If we can drag the points, the trained T2I-Aapter can generate new contents based on such dragged new points (new condition) → feasible to use semantic points as condition to control and maintain the source motion trajectory.

✅ 也可以通过拉部分点改变车的形状。

P230

Empirical Observations

  • Observation 2: Further, we can drag the semantic points to control the subject’s shape

✅ 虚线框为类似于 control net 的模块,能把 semanti point 抽出来并输入到 denoise 模块中。
✅ Latent Blend 能更好保留背景信息。
✅ 蓝色部分为 Motion layer.

P231

Framework

  • Motion layer: use pretrained and fixed AnimateDiff to ensure essential temporal consistency

  • ED-LoRA \(_{(Mix-of-Show)}\): learn the wconcept to be customized

  • Key design aims:

    • Introduce semantic point correspondences to guide motion trajectory
    • Reduce human efforts of annotating points

Gu et al. “Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.” NeurIPS, 2023.

P232

Step 1: Semantic Point Extraction

  • Reduce human efforts in annotating points
    • User define point at one keyframe
    • Propagate to other frames by point tracking/detector
  • Embedding

✅ 什么是比较好的 Semantic point 的表达?

P233

Methodology – Step 1: Semantic Point Extraction on the source video

  • Reduce human efforts in annotating points
  • Embedding
    • Extract DIFT embedding (intermediate U-Net feature) for each semantic point
    • Aggregate over all frames

❓ Embedding, 怎么输人到网络中?
✅ 网络参数本身是 fix 的,增加一些小的 MLP, 把 Embeddin 转化为不同的 scales 的 condition map, 作为 U-Net 的 condition.

P234

Methodology – Step 2: Semantic Point Registration on the source video

  • Introduce several learnable MLPs, corresponding to different scales
  • Optimize the MLPs
    • Point Patch Loss: restrict diffusion loss to reconstruct local patch around the point
    • Semantic-Enhanced Schedule: only sample higher timestep (0.5T, T), which prevents overfitting to low-level details

✅ 有些场景下需要去除部分 semanfic point, 或移动 point 的位置。

P235

Methodology

  • After Step1 (Semantic Point Extraction) and Step2 (Semantic Point Registration), those semantic points can be used to guide motion
  • User-point interaction for various applications

✅ 在一帧上做的 semantic point 的移动,迁移到其它帧上。

P236

Methodology

  • How to drag point for shape change?
    • Dragging at one frame is straightforward, propagating drag displacement over time is non-trivial, because of complex camera motion and subject motion in video.
    • Resort to canonical space (i.e., Layered Neural Atlas) to propagate displacement.

P237

Methodology

  • How to drag point for shape change?
  • Dragging at one frame is straightforward, propagating drag displacement over time is non-trivial because of complex camera motion and subject motion in video.
  • Resort to canonical space (i.e., Layered Neural Atlas) to propagate displacement.

P238

P239

✅ point contrd 可以处理形变比较大的场景。

P240

Qualitative Comparisons to previous works

  • VideoSwap can support shape change in the target swap results, leading to the correct identity of target concept.

P241

✅ 重建 3D 可以解决时间一致性问题。