本仓库论文方向

动画方向

让看到的景象成为一个会动的景象，其管线为：

景象 --(建模)--> 动画代理 --(驱动)--> 会动的代理 --(渲染)--> 会动的景象

根据代理的表达不同，又可以分为2D管线（2D建模）、3D管线(3D建模)。
不同的代理有不同的驱动和渲染方式。驱动需要外观信息（来自代理）、动作信息（来自动画数据）和外观与动作的关联关系，由此引申出动作生成和动作驱动两个领域。
将建模与驱动显式地分成两个步骤的方式，称为静态建模与驱动。也可以是由景象直接得到会动的的代理，称为动态建模与驱动。甚至直接由景象得到会动的景象，这就是可控文生视频。

3D管线

优势：由于会在3D进行建模，擅长时间、空间上的一致性的保持。不太容易出现非常奇怪的驱动的结果。
局限性：通常用于比较明确的主体

建模（外观来源、驱动代理）	驱动方式（外观与动作的关联）	动作（动作来源）	渲染
Mesh建模 + 骨骼蒙皮	LBS	骨骼动作生成、骨骼动作捕捉、骨骼动作迁移	无
Mesh	无（逐顶点直接驱动）	Mesh顶点运动生成、运动捕捉、运动迁移	无
静态3DGS建模	3DGS直接驱动	无	3DGS渲染
静态3DGS建模	3DGS动作提取	3DGS绑定到骨骼	3DGS渲染
无	动态3DGS同时实现建模和驱动	无	3DGS渲染
3D点云重建	3D点云驱动	无	3D点云渲染
Nerf建模			Nerf渲染

2D管线

优势：被驱动的内容更广泛，可以是任意图像内容
局限性：自由度高，时间、空间的一致性是难点，可能出现非常奇怪的结果，通常比较耗时

建模（外观来源）	动作（动作来源）	驱动（外观与动作的关联）	渲染

仿真方向

创造一个具有物理真实感的景象，其管线为：

(建模)--> 仿真代理 --(驱动)--> 会动的代理 --(渲染)--> 会动的景象

仿真代理与动画代理相似但不同。
驱动所需要的动作信息中包含物理规律。
有些工作中，会同时结合动画与仿真。

总结

Pipeline	3D建模（外观来源、驱动/仿真代理）	驱动方式（外观与动作的关联）	动作（动作来源）	渲染
1	3DMesh 相关技术：3DMesh重建	骨骼带动Mesh顶点相关技术：蒙皮绑定、LBS	骨骼动作相关技术：骨骼动作生成、骨骼动作捕捉、骨骼动作迁移	无
2		不同Mesh形变基的线性组合相关技术：表情基绑定	表情基系数相关技术：表情基系数预测	无
3		逐顶点直接驱动	Mesh顶点运动 4DMesh生成、运动捕捉、运动迁移	无
4		顶点作为粒子，线作为约束，以物理规律控制顶点运动关键技术：软体仿真	力	无
5	3DGS 相关技术：静态3DGS建模	3DGS直接驱动	3DGS运动相关技术：3DGS驱动	3DGS渲染
6		3DGS绑定到骨骼，骨骼驱动高斯点相关技术：3DGS绑定	骨骼动作	3DGS渲染
7		每一帧都重新生成3DGS	3DGS的全部属性相关技术：动态3DGS	3DGS渲染
8	3D点云相关技术：3D点云重建	逐顶点驱动	点云运动数据相关技术：3D点云驱动	3D点云渲染相关技术：3D点云渲染
9	3D仿真粒子	基于物理规律的粒子运动相关技术：粒子仿真	粒子控制	仿真粒子的渲染
10	Nerf相关技术:Nerf建模	[?]	[?]	Nerf渲染

Pipeline	2D建模（外观来源、驱动代理）	驱动方式（外观与动作的关联）	动作（动作来源）	渲染
1	MAT建模	MAT动作提取	MAT驱动	MAT渲染
2	2D图形建模	2D图形动作提取	2D图形驱动	2D图形渲染
3	像素	控制信号直接驱动像素关键技术：可控图生视频	控制信号	无
4	像素	光流驱动像素 + inpainting 关键技术：inpainting	光流关键技术：控制信号生成光流	无
5	2DMesh建模	Mesh绑定到控制点上，由控制点带动Mesh	控制点运动	无
6		Mesh绑定到骨骼上，由骨骼带动Mesh	骨骼运动	无

读论文方法

转载出处：GAMES 在线报告227

找论文

找方向

主流会议的paper session

找综述，精读

搜索“方向”相关的关键词

找经典论文，精读

综述里面被highlight的论文
引用非常高的论文
获奖论文

延伸

reference：综述里（第二章）被metion的论文次之
citation：在google里点击“被引用”

习惯

维持一个paper queue，经典论文提高优先级
从队首选一篇
把reference和citation（筛去不相关）放入队尾

多交流

读论文

Quick skim

内容：看图，看视频（会议的 presentation view）目的：对文章了解大概时间：十分钟

Critical reading

内容：

Title
Abstract
Introduction
discussion/limitation

目的：

核心问题是什么
核心贡献是什么？
大致方法是什么？有效？缺陷？验证？
启发

记笔记

Summary
Fact.

motivafion
contribution
method
evaluation

Crifical thinking

批判优缺点

Creative thinking

如何启发了我

想到什么idea
怎么improve
怎么generalize

what I have learned

精读：1周1～2篇

基于骨骼代理的Mesh的驱动

---
title: 基于骨骼代理的Mesh的驱动
---
flowchart LR
    Mesh[("Mesh")]
    骨骼动作[("骨骼动作")]
    骨骼代理(["骨骼代理"])
    蒙皮权重(["蒙皮权重"])
    被驱动的Mesh(["被驱动的Mesh"])

    艺术家制作[("艺术家制作")]
    动画师制作[("动画师制作")]

    Mesh重建技术 & Mesh生成技术 & 艺术家制作 --> Mesh
    动作提取技术 & 动作生成技术 & 动画师制作 --> 动作迁移技术 --> 动作优化技术 --> 骨骼动作
    Mesh-->蒙皮绑定技术-->骨骼代理 & 蒙皮权重
    骨骼代理 & 蒙皮权重 & 骨骼动作 --> 被驱动的Mesh

mindmap
相关技术
    Mesh重建技术
    Mesh生成技术
    动作提取技术
    动作生成技术
        按动作表示分
            基于连续表示的动作生成
            基于离散表示的动作生成
        按动作控制方式分
            无控制
            文本控制
            声音控制
            交互控制
    动作优化技术
        动作先验
    动作迁移技术
    蒙皮绑定技术

动作先验 / 动作优化

这一个系列的论文通常包含三个主题：

mindmap
骨骼动作先验
    构建骨骼动作先验
    基于动作先验的应用
        动作生成任务
        动作优化任务
    动作先验与其它算法相结合的应用
        动作生成&优化任务
        动作交互任务

在本文中，动作生成任务、动作优化任务、动作生成&优化任务的区别在于

	基于动作先验的动作生成任务	基于动作先验的动作优化任务	动作先验与其它算法相结合的动作生成&优化任务
动作来源	由动作先验决定	源动作为已知信息	由其它的生成算法决定
动作先验的作用	决定做什么动作	约束动作的合理性	约束动作的合理性
生成过程与优化过程是否耦合	不涉及优化	不耦合	耦合，需要考虑其它生成算法与动作先验模型的合作方式

如果一个方法中使用了多种先验，还需要考虑这些先验之间怎么结合。

构建动作先验：

要解决的问题：如何通过数据或规则构造动作先验。

mindmap
构建骨骼动作先验
    基于数据的先验
    基于规则的先验
    基于数据+规则的先验

基于动作先验的应用

基于动作先验完成以下任务：
（1）动作生成任务：可以基于动作先验进行动作生成（无控制条件），生成结果为最符合动作先验的动作。
要解决的问题：怎样基于动作先验进行生成
（2）动作优化任务：可以基于动作先验对已有的动作进行优化，生成结果为与原始动作最接近的且最符合动作先验的任务。
要解决的问题：怎样基于动作先验对已有动作进行优化

mindmap
基于动作先验的应用
    动作生成任务
        从先验分布中采样
        随机采样后向先验分布靠近
    动作优化任务
        以提升动作合理性作为优化目标
        向合理动作的区域步进

动作先验与其它算法相结合的应用

动作先验与其它算法结合完成以下任务，此时动作先验只用于约束动作的合理性。

（1）动作生成&优化任务：可以使用额外的动作生成方法（基于控制条件）先生成再优化，或者生成与优化迭代进行，得到即使符合生成要求又符合动作先验的动作。
要解决的问题：怎样把动作生成和动作优化结合起来
（2）动作交互任务：使用额外的交互算法与用户交互，通过动作先验的约束，使得交互结果符合动作先验。
要解决的问题：如何把动作先验实时地融入到交互结果中

如果文章中通过使用动作先验结合其它算法实现来完成任务，但方法是没有体现动作先给与其它算法的结合，而是分阶段进行，则将归类为基于动作先验的动作优化任务。
对于归于此类的文章，如果关键创新点在动作先验，则归到本页。如果关键创新点在其它算法，则归到其它算法对应的页面。

mindmap
动作先验与其它算法相结合的应用
    动作生成&优化任务
    动作交互任务

基于数据的动作先验

mindmap
基于数据构建骨骼动作先验
    按先验信息的内容分
        对一帧动作的分布建模
        对一段动作的分布建模
        以上一帧为条件对下一帧的分布建模（自回归）
        以上一帧为条件对动作转移关系建模（自回归）
    按先验的表达方式分
        将数据建模为正态分布（VAE）
        将数据建模为NDF
        将数据建模为归一化流

VAE

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
14	2021	HuMoR: 3D Human Motion Model for Robust Pose Estimation	在存在噪声和遮挡的情况下恢复合理的姿势序列	1. 建模运动先验模型： based on 136，通过先验逼近后验，使得先验建模更准确 2. 基于运动先验的动作生成：based on 136 3. 基于运动先验的动作优化，可以产生『准确且合理的运动和接触』的动作。	Condition VAE，转移关系建模	link
136	2021.3.26	Character Controllers Using Motion VAEs	在给定足够丰富的运动捕捉片段集合的情况下，如何实现有目的性且逼真的人体运动	1. 建模运动先验模型：信息来源：数据建模内容：以上一帧为条件对下一帧的分布建模（自回归）建模方式：VAE 2. 基于运动先验的动作生成：从分布中随机采样实现随机生成，从分布中蒙特卡罗采样实现可控生成 3. 动作先验与其它算法相结合：深度强化学习	link

Normalizing Flows

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
134	2020.12.7	MoGlow: Probabilistic and controllable motion synthesis using normalising flows	基于数据的运动建模与合成	基于归一化流的新型概率化、生成式且可控的运动数据模型。 1. 运动先验：通过归一化流来描述运动数据的概率分布 2. 基于运动先验的动作生成：通过随机采样从先验分布中生成新的运动数据 3. 动作生成任务：以控制信号的满足程度和动作的合理性概率为目标进行动作优化		link

NDF

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags
137	2025.9.11	Geometric Neural Distance Fields for Learning Human Motion Priors	实现鲁棒、时序一致且物理合理的三维运动重建	1. 建模运动先验模型：显式地将人体运动建模为对应姿态、转移（速度）和加速度动态的神经距离场（NDF）零水平集 2. 动作优化任务（1）一种新型自适应步长混合算法，用于在合理运动集合上进行投影计算；（2）一种创新几何积分器，在测试时优化和生成过程中实现真实运动轨迹的"展开"。
172	2025.5.26	PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation		合理性感知运动扩散模型 (PAMD)的音乐生成舞蹈	link
139	2024.4.11	NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors		1. 建模运动先验模型：通过高维四元数积空间中的神经场零水平集来表征合理关节空间. 2. 动作优化任务：采用自适应步长黎曼梯度下降法，确保迭代点始终保持在SO(3)^K乘积空间内，从而获得更快的收敛速度
138	2022.7.27	Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields	直接建模真实姿态流形并保持姿态间距离	1. 建模运动先验模型：信息来源：数据建模内容：单帧动作建模方式：NDF 2. 动作优化任务：根据距离向0水平集步进，每一次步进后都需要投影回SO(3)空间

基于规则的动作先验

基于规则的先验，通常是使用物理规则。物理规则是客观存在的，不需要刻意去构建。关键在于怎么把物理规则应用到具体的任务中。

mindmap
基于物理规则的应用  
    基于动作先验的应用
        动作交互任务
        动作优化任务
    动作先验与其它算法相结合的应用
        动作生成+物理约束
        动作捕捉+物理约束

物理规则不是一种分布，无法采样，所以没有单纯基于物理规则的生成任务。
基于动作先验交互任务是指，给角色一个力，角色如何响应或者保持平衡。
根据动作优化过程是否与其上游过程耦合，还区分是单纯的动作优化任务，还是动作优化结合其它算法的任务。

规则注入方式	应用于动作交互	应用于动作优化	应用于『与其它算法结合』
借助物理引擎			其它算法->物理约束->其它算法
将物理方程作为损失函数约束			其它算法与物理约束联合优化
向物理合理的方向移动一小步（需要结合diffsion这种多步生成方法，也可能借助物理引擎或者损失函数也移动）			其它算法与物理约束依次进行

动作先验与其它算法相结合的应用

借助物理引擎

常见套路：
其它算法->物理约束->其它算法

要解决的问题：

通过物理引擎约束后，其结果可能又存在僵硬等artifacts了
物理引擎约束这一步可能不可微，导致不能端到端优化
物理引擎的计算可能会比较耗时、笨重

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
112	2025.6.5	POMP: Physics-consistent Motion Generative Model through Phase Manifolds	物理先验 + 动作生成	一个基于运动学的框架，其它算法->物理约束->其它算法：其它算法：Diffusion模块生成动作物理约束：仿真模块使用比例微分控制器优化动作算法：利用相流形将运动先验与物理约束对齐，优化结果再映射回运动学数据，从而合成物理上真实的运动。	物理合理，自回归，动作优化	link
114	2023	Drop: Dynamics responses from human motion prior and projective dynamics	物理先验 + 动作生成	1. DROP，一个高度稳定、极简的基于物理的人体模拟器，提供人体运动的物理先验。 2.其它算法->物理约束->其它算法: 其它算法：本文以MotionVAE为例，用于生成动作物理约束：利用投影动力学其它算法：SoftCorrection	link

向物理合理的方向移动一小步（需要结合diffsion这种多步生成方法）

常见套路：
在其它算法过程中的每一小步，进行一次小的物理约束

要解决的问题：

物理约束这一步的计算可能会比较耗时、笨重
物理约束这一步可能借助物理引擎或者构造Loss，但必须可微
需要结合diffusion这种多step的算法

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
140	2023.8.18	PhysDiff: Physics-Guided Human Motion Diffusion Model	物理先验 + 动作生成（diffusion）	其它算法与物理约束依次进行：Diffusion + 物理修正局限性：物理模拟器导致计算复杂度极高

将物理方程作为损失函数约束

要解决的问题：

将物理约束转化为Loss

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
26	2024.4.5	PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos	物理先验 + 动作捕捉	其它算法与物理约束联合优化：其它算法：基于SMPL模型从单目视频估计人体动力学物理约束：通过拉格朗日损失隐式融入物理约束	link

基于数据+规则的动作先验

mindmap
基于『数据+规则』构建骨骼动作先验
    依次使用不同的先验
    通过数据学习规则的关系参数

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags
135	2025.9.11	Improving Human Motion Plausibility with Body Momentum	『附着在根关节上的局部坐标系中的局部运动』与『世界坐标系中根关节的全局运动』之间的耦合关系 1. 分别处理则无法精确捕捉局部与全局动力学之间的耦合关系 2. 基于物理仿真（从关节扭矩和外部力推导全局轨迹）又存在计算成本高、复杂度大的问题	1. 考虑全局与局部耦合关系的运动先验：使用全身线性和角动量作为约束条件，将局部运动与全局运动联系起来，其原理为动量反映了关节层面动力学对身体空间运动的整体影响，它提供了将局部关节行为与全局位移相关联的物理基础方法。 2. 数据先验与物理先验的结合：从数据中学习动量曲线 3. 基于运动先验的动作优化：一种新的损失项，用于强制生成动量曲线与真实数据中观测到的动量曲线保持一致。采用我们的损失函数可减少脚部滑动和抖动，改善平衡性，并保持恢复运动的准确性。	link
142	2023.9.24	Incorporating Physics Principles for Precise Human Motion Prediction		基于欧拉-拉格朗日方程（EL-Eq.）预测未来SMPL姿态参数，流程简单。	PhysMoP
141	2022.6	PIMNet: Physics-infused Neural Network for Human Motion Prediction	未来动作预测	人体动力学 + VAE

基于离散表示的骨骼动作生成

不管是离散表示还是连续表示，动作生成任务要解决的问题、所使用的数据集、评价指标等都是相似的。这里把离散表示的动作生成单独提成一页，是考虑到：

离散表示所构建的是真实数据的离散分布
离散分布的采样与连续分布的采样对于构建生成模型有较大的影响
采样是生成算法的重要环节

mindmap
基于学习的动作生成
    按生成方式分
        自回归生成
        非自回归生成
            Regression
            完形填空式（Bert Style）
    按生成模型分
        确定性映射
        离散空间采样
            离散分布采样(GPT Style)
            掩码语言模型(Bert Style)
            离散去噪扩散概率模型（D3PM）
        连续空间采样
            VAE
            GAN
            diffusion
    按控制信号分
        文本驱动
            Action/Label驱动
            自然语言驱动
        声音驱动
            音乐驱动舞蹈
            语言驱动手势
        动作驱动
            轨迹驱动
            关键帧驱动
        场景驱动
            场景交互

VQ-VAE及其变体将动作编码为离散标记，本质上将运动生成问题从回归任务转化为分类任务。然而受限于码本结构，VQ-VAE倾向于存储已知动作而非泛化到新动作。虽然这些模型在训练数据分布内能精确生成和重建动作，却难以处理分布外运动导致信息损失和动作感知失真。

离散空间采样

GPT Style

『离散表示+自回归生成框架』能够实现文生动作任务，且生成动作的质量非常高。

离散表示把motion序列变成了token序列。
动作生成的控制信号也可以有离散形式的或者连续形式。如果控制信号正好也是离散的token表达，那么通过将控制信号的离散表达与动作的离散表达进行对齐，那么可以提升跨模态生成的一致性。

要解决的问题：

生成结果与控制信号的匹配度
生成时长
生成质量

多模态Latent Code对齐

Latent Code对齐用于以下场景：

输入（控制信号）与输出（生成动作）都是离散表达
输入与输出具体不同的表达语义（例如语言和动作）
输入包含不同语义的控制信号（例如语言+动作）

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
87	2024.3.18	MotionGPT: Finetuned LLMs are General-Purpose Motion Generators	1. 利用VQ-VAE，将运动序列编码为一种特殊“语言” 2. 将运动生成视为序列到序列任务，结合LLM能力实现从文本到动作的端到端生成。 3. 首个多模态控制的动作生成方法	VQ-VAE + LLM + LoRA，生成质量(FID)有明显提升	控制条件：文本(token)/key frame 生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：复用GPT 其它：LLM	link
146	2023.11.28	AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond	在与人体运动相关的研究领域，学者们仍在为每个任务开发孤立模型。	VQ-VAE + LLM + Adapter
145	2023.7.20	MotionGPT: Human Motion as a Foreign Language.	构建一个能够统一处理语言与运动等多模态数据的模型	1. 采用离散向量量化技术将人体运动转化为运动标记 2. 基于该"运动词汇表"，以统一的方式对运动和文本进行语言建模，将人体运动视为特殊形式的语言。 3. (提示学习)采用运动-语言混合数据对MotionGPT进行预训练，并基于提示问答任务进行微调。	控制条件：问题（文本T5，动作VQ-VAE）生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：GPT Style 问答模型
	2022.8.4	TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts.	文生3D全身动作，实现同文本生成多个差异化动作，并避免产生无意义的静止姿态序列。	首次提出离散量化的运动表示互惠生成方法通过同时训练文本→运动和运动→文本任务，显著提升了语义对齐能力。	控制条件：文本（NMT Encoder）生成方式：自回归表示方式：离散表示（同VQ-VAE，但没有使用这个词）生成模型：GPT Style（NMT Decoder）

不需要Latent Code对齐

以下场景不需要Latent Code对齐：

输入（控制信号）与输出（生成动作）具有相同的语义，例如历史动作预测未来动作的任务。
输入（控制信号）使用连续表示方式，不能与输出（生成动作）的离散表示方式共享空间。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
151	2024.6.2	T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences	处理多句子文本生成长且复杂的动作序列，直接学习端到端文本-运动映射。	– 连续长期VQ-VAE生成框架 – 1D(时序维度)卷积VQ-VAE（避免时序不一致） – 无法生成细粒度运动 – 仅支持短文本描述	1D卷积VQ-VAE + Transformer，长序列生成	控制条件：文本（CLIP）生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：GPT Style 其它：Transformer
88	2023.9.24	T2m-gpt: Generating human motion from textual descriptions with discrete representations	基于VQ-VAE与GPT的文生人体运动框架	1. 基于VQ-VAE的离散运动表示 2. VQ-VAE + Transformer（GPT）的文生动作框架** 3. 生成质量(FID)有明显提升	控制条件：文本（CLIP）生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：GPT Style 其它：Transformer，开源	link
150	2023.9.2	AttT2M:Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism.	1. 人体运动固有的复杂时空特性 2. 文本与运动间跨模态关系学习的难度	– 基于身体部位注意力的时空VQ-VAE – 全局-局部注意力学习跨模态关系 – 长文本驱动生成多样性不足 – 数据依赖（无法生成未见运动）	控制条件：文本（CLIP）生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：GPT Style 其它：Transformer	link
143	2022.10.19	PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting	任意观测长度（包括零观测）条件下的运动生成	1. 量化隐空间的编码器-解码器架构 2. 基于离散编解码的动作生成	控制条件：历史动作, action 生成方式：自回归表示方式：离散表示生成模型：类GPT模型预测隐空间索引其它：量化方案限制运动多样性

Bert Style

『离散表示 + 掩码语言模型生成框架』的文生动作模型。

核心思想： 将动作序列离散化 为令牌序列（类似 NLP 中的单词）。在训练时，随机或有策略地掩码 (Mask) 一部分令牌，让模型基于上下文（未掩码令牌和文本条件）预测被掩码的令牌。
优势： 通常比扩散模型效率更高，能有效学习动作的时空依赖关系。

Text to Motion

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2025	BAMM: Bidirectional Autoregressive Motion Model.	bert style – 条件掩码自注意力Transformer – 混合注意力掩码训练	– 中等计算复杂度 – 无法生成快速变化的根运动
148	2024.3.28	MMM: Generative Masked Motion Model.	基于掩码动作模型的全新简易动作生成范式。	与MoMask非常相似，文中没有与MoMask的对比对输入动作令牌进行随机掩码，模型基于所有未掩码令牌（上下文）同时预测所有被掩码的令牌（非自回归）。局限性：无法生成长而详细的文本描述	控制条件：文本（CLIP）生成方式：Bert Style 表示方式：离散表示VQ-VAE 生成模型：条件掩码运动模型	link
	2023.11.29	MoMask: Generative Masked Modeling of 3D Human Motions	VQ-VAE + Bert Style的文生动作新框架	VQ-VAE + 分层码本结构；掩码预测生成粗糙运动，残差层逐步细化首个离散运动表示+掩码语言模型的文生动作框架	控制条件：文本（CLIP）生成方式：Bert Style 表示方式：离散表示（VQ-VAE + 残差细化）生成模型：掩码语言模型	link

music 2 dance

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2023	TM2D [Gong et al., 2023]	– VQ-VAE框架 – 双模态特征融合（跨模态Transformer）	– 缺乏配对数据（音乐/文本） – 限于特定舞蹈风格（数据依赖）
	2022.11.29	UDE: A Unified Driving Engine for Human Motion Generation	统一文本/音频驱动的单模型	模态无关的Transformer Encoder + Diffusion Decoder – 处理多模态复杂交互困难	link

离散去噪概率模型 D3PM

Text to Motion

ID	Year	Name	Note	Tags	Link
152	2024.7.19	M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models	先用VQ-VAE获取离散运动编码，再在标记序列上学习去噪扩散模型。为多动作生成设计动态转移概率确保动作间平滑过渡。	– 动态转移概率模型 – 新评估指标Jerk（动作边界平滑度），但Jerk指标无法评估所有场景	控制条件：文本(CLIP) 生成方式：非自回归表示方式：离散表示（VQ-VAE）生成模型：离散去噪扩散概率模型（D3PM）其它：动作边界平滑度指标Jerk
	2023.9.4	DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion	在动作质量与多样性之间取得平衡仍是一个未解决的挑战。该问题主要由两个关键因素导致： 1）现有基准数据集中动作-描述对缺乏多样性； 2）对文本提示存在片面且有偏差的语义理解，主要关注动词成分而忽略其他词语所指示的微妙差异。	1. 构建了大规模野生动作-描述数据集（WMC） 2. 提出分层语义聚合（HSA）模块来捕获细粒度语义。 3. 将上述设计整合到有效的动作离散扩散（MDD）框架中	控制条件：文本（分层语义聚合HSA）生成方式：非自回归表示方式：离散表示（VQ-VAE）生成模型：动作离散扩散（MDD）框架其它：数据集
	2023	Text-to-Motion Synthesis using Discrete Diffusion Model	扩散模型计算成本较高，且生成的运动可能与输入文本对齐度不足。	结合离散潜在空间与扩散模型，学习表达性条件概率映射以实现运动合成。 1. 学习离散运动表达 2. 应用离散去噪扩散概率模型（D3PM）学习运动标记的条件概率分布。 3. 训练过程中进一步采用离散无分类器引导技术，通过合适的引导尺度实现运动与对应文本描述的对齐。	控制条件：文本生成方式：非自回归表示方式：离散表示（VQ-VAE）生成模型：离散去噪扩散概率模型（D3PM）其它：MoDDM
147	2023.8.30	Priority-Centric Human Motion Generation in Discrete Latent Space	并非所有动作都与特定文本描述具有同等关联度——某些更具显著性和信息量的动作应在生成过程中被优先考虑	1. 基于Transformer的VQ-VAE架构，通过全局自注意力机制与正则化项构建紧凑的离散动作表示，有效防止代码坍塌。 2. 一种创新的运动离散扩散模型，通过分析动作令牌在整体序列中的重要性来制定噪声调度策略。局限性：难以捕捉运动细粒度细节	M2DM 控制条件：文本生成方式：非自回归表示方式：离散表示（基于Transformer的VQ-VAE架构）生成模型：离散去噪扩散概率模型（D3PM）

连续空间采样

Diffusion

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
149	2024.9.17	BAD: Bidirectional Auto-Regressive Diffusion for Text-to-Motion Generation	自回归模型难以捕捉复杂的双向模式。 Mask Modeling假设标记相互独立，削弱了对序列依赖关系。掩码或吸收操作对序列进行的破坏可能引入不自然的失真，增加学习难度。	双向自回归扩散模型（BAD），基于排列的序列破坏技术，融合了自回归与基于掩码的生成模型优势，保持因果依赖的同时有效捕捉序列与双向关系。 [?] 创新的把diffusion用于离散数据的方法	控制条件：文本（CLIP）生成方式：Bert Style 表示方式：离散表示VQ-VAE 生成模型：a novel corruption (diffusion) technique	link

Score Matching

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
102	2025.5.16	HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining	由于文本固有的歧义性以及人体运动动态的复杂性	1. 类似MoMask的残差VQ-VAE，但专门训练了一个网络来决定给哪些token掩码 2. 把文本编码成不同粒度的embedding，提升文本的整体把控与细节控制	控制条件：文本（Graph Reasoning）生成方式：Bert Style 表示方式：离散表示（分层文本编码，每一层是残差VQ-VAE）生成模型：残差VQ-VAE(类似于Diffusion的逐渐细化的生成模式)	link
92	2025	Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis	基于score的生成模型，其训练过程涉及复杂的曲率轨迹，导致训练稳定性不足。	1. 第一阶段通，运动重建(VQ-VAE with different network)，学习运动潜在表征 2. 第二阶段，使用确定性特征映射过程(DerODE)构建高斯分布与运动潜在空间分布之间的映射关系 3. 生成时通过通过向确定性特征映射过程的梯度场中注入可控噪声(DivSDE)实现多样性。	控制条件： Action Label 生成方式：非自回归表示方式：离散表示（VQ-VAE）生成模型：flow matching + score matching	link

离散表示 vs 连续表示对比表

对比维度	离散表示	连续表示
运动编码	VQ-VAE或量化器从姿态序列生成运动token	自编码器或直接使用原始连续姿态数据
生成模型	Transformer（如GPT）掩码模型（如BERT）离散扩散模型	原始运动空间的扩散模型隐空间扩散（如LDMs）
文本对齐	易与NLP模型集成可将运动视为"语言"	需注意力/跨模态融合映射结构较弱
训练稳定性	易发码本坍塌和量化伪影	扩散中连续MSE损失保障稳定性
保真度与多样性	码本大时保真度高多样性受限	随机采样天然多样表现力强
推理速度	小型自回归模型快长序列慢	迭代采样通常较慢 LDMs可提速
控制与编辑	支持掩码修复 token级符号控制	精细编辑（如FLAME）支持帧/关节控制（如SALAD）
流式/在线能力	自回归解码受限非因果序列阻碍实时性	因果隐变量支持流式生成（如MotionStreamer）
常见局限	量化信息损失分词器训练困难	计算成本高文本精确对齐难
代表工作	T2M-GPT [2023] MMM [2024] MotionGPT [2023] MoDDM [2023] M2D2M [2024]	MotionDiffuse [2022] MoFusion [2023] FLAME [2023] SALAD [2025] MoLA [2024] MotionStreamer [2025]

Locomotion

任务：实时控制虚拟角色进行高效、真实、可控且自适应的行走、奔跑、跳跃等基础位移运动。

mindmap
    Locomotion
        多
            更多的动作类型
                适配不同地形（平地、楼梯、斜坡）
                适配不同角色
                    二足/四足
                    不同体型
                不同的动作
                不同的风格
            更多的应用场景
                路径引导
                风格控制
                动作类型切换
                动作插帧
                实时运动输入（如动捕、VR 追踪数据）
        快
            满足交互式场景的帧率要求
            快速地数据准备
        好
            动作具有物理真实感
                运动轨迹、肢体姿态、步态节奏符合生物力学规律
                无穿模、僵硬、浮空等失真现象
            动作具有生物真实感
                还原人类 / 生物自然运动特征
            符合用户控制
                支持用户高层指令（速度、方向、动作风格、启停）的精准响应
                实现细粒度运动调节
            稳定性
                具备外力扰动下的稳定性
        省
            计算开销低
            更少的数据
            更少的人工参与
            更少的预处理
                在适配不同泛化性问题时，无需重新设计运动逻辑

基于匹配的方法

graph TD
    A[当前帧 frame] --> B{是否匹配？}
    B -- N --> C[下一帧 frame]
    B -- Y --> D[提取特征]
    D --> E[当前特征]
    F[获取数据] --> D
    G["(5) 角色状态"] --> D
    D --> H[最近邻搜索]
    H --> I[当前帧 frame]
    J[控制目标] --> H
    H --> K["(3) 下一帧动作"]
    L["(6) 角色的动作数据集"] --> M[提取特征]
    M --> N[特征集]

ID	Year	Title	特点
-	-	Motion Field	-
-	-	Motion Graph	Baseline，以 clip 为单位 (1) 只在一个 clip 结束时重新匹配 (2) 寻找最像的 clip，并用插值衔接
-	-	Motion Matching	以 frame 为单位 (1) 每帧或几帧重新匹配，响应更快 (2) 寻找最近匹配的帧，并用 blend 衔接
-	2020	Learned Motion Matching	基于数据集，把 (1)(2)(3) 替换成了网络模块，消除了在线搜索时对数据库的依赖
P47	Ⓐ	(风格迁移)	让 (5) 和 (6) 分别是不同的角色，并增加将运动内容与运动风格解耦的模块。在运动空间进行最近邻匹配，在匹配空间中融入目标风格，实现在线风格迁移的效果。

P41 Ⓐ

Motion Graph / Motion Matching / Motion Field

笔记P2

flowchart LR
A([当前帧frame]) --> B("是否重新匹配(1)")
    B -->|"N"|C([当前帧frame])
    C --> D["(3)取下一帧动作"]
    D --> E([下一帧frame])
    E --> A
    F --> G(["(5)角色状态"])
    H([控制目标])
    G --> I["(4)提取特征"]
    H --> I
    I --> N([当前特征])
    N -->J["(2)最近邻搜索"]
    J -->C
    B --> F["(5)取数据"]
    K(["(6)角色的动作数据集"])
    K --> L["(4)<br>提取特征"]
    L --> M(["特征集"])
    M --> J
    K --> D

笔记P1

P41 A

ID	Year	Title	特点
		Motion Field
		Motion Graph	Baseline，以clip为单位 (1) 只在一个clip结束时重新匹配自己 (2) 寻找最适配的clip，并用拖帧衔接。
		Motion Matching	以frame为单位 (1) 每帧或几帧重新匹配，响应更快 (2) 寻找最匹配的帧，并用blend 做衔接。
	2020	Learned Motion Matching	基于数据集，把(1)(2)(3)(4)替换成了网络模块，消除了在线推理时对数据库的依赖。
231016	2023.10.16	MOCHA: RealTime Motion Characterization via Context Matching	让(5)和(6)分别是不同的角色，并增加将运动内容与运动风格解耦的模块。在运动空间进行最近邻匹配，在匹配结果中融入目标风格，实现在线风格迁移的效果。

通过检索、拼接、插值实现运动生成，是工业界长期主流方案。

缺点：

依赖特定角色的海量数据
最近邻搜索成本高
内存占用大
对不同地形泛化性差
随着数据集规模增大，扩展性较差

基于监督学习的方法

非相位方法

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2018.10.4	Recurrent Transition Networks for Character Locomotion	传统方法（如运动图、高斯过程模型）存在泛化性差、仅适配单一动作类型、运行时计算成本高等问题。基于深度学习的“从当前状态到目标状态的定向过渡生成”处于研究空白	1. 改进型 LSTM：传统 LSTM 仅依赖历史状态，RTN 在门控计算中加入未来上下文特征（目标 + 偏移），使生成过程始终朝向目标状态，避免无约束漂移； 2. 隐藏状态初始化：摒弃 “零向量初始化” 或 “全局共享初始状态”，通过 MLP 学习输入首帧与最优初始隐藏状态的映射，让 LSTM 从初始阶段就捕捉运动特征，提升生成质量； 3. ResNet 风格解码器：输出当前帧与下一帧的偏移量，而非直接输出姿态，减少生成帧与输入上下文的间隙，提升过渡流畅性。	link

基于相位的方法

笔记P3

flowchart LR
    C([NN2])-->A
    D([NN3]) -->A
    E([NN4]) -->A
    B([NN1]) -->A[专家混合]
    A-->F([混合专家模型])
    F-->G[模型推理]
    H(["控制目标<br>(轨迹、标签)"])-->G
    J([当前状态])-->G
    G -->K([相位变化量])-->P([相位])-->O([当前相位])
    G -->L([触地信息])-->R[IK]-->N
    G -->Q([未来轨迹])-->N
    G -->M([下一帧状态])-->N([动作输出])-->J 
    O -->A

笔记P4

优势：

能对用户输入做出稳定实时的响应。

局限性：

依赖精心的设计
用高度可变的运动数据进行训练，会产生平均化的结果。
难以泛化到数据以外的动作。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2023.8.24	Motion In-Betweening with Phase Manifolds	首次将周期自编码学习的相位流形引入角色补间动画	1. 将复杂的人体运动分解到频域，用相位+振幅编码运动的周期性和时序规律。 2. 相位提取：DeepPhase. 3. 双向控制：角色坐标系+目标坐标系。 4. 用户控制：轨迹控制、动作类型控制。	多：超长动作插帧快：实时动作插帧好：可进行路径、风格控制	PDF
2022	DeepPhase	从全身运动数据中提取深度多维相位信息，以实现更好的时间和空间对齐
	2022.1.12	Real-Time Style Modelling of Human Locomotion via Feature-Wise Transformations and Local Motion Phases	将相位方法引入到风格迁移任务中。	1. 发布了100 style数据集 2. 扩展相位提取方式，对于非接触运动也能提取相位：对同一类数据使用PCA提取主成分，取第一主成分的系数。若存在周期性，用sin函数拟合系数。 3. 注入风格：输入风格clip，输出alpha和beta，用于调制主网络的隐藏层。训练时，先训练主网络，再接入FiLM，并finetune FiLM。推断时，不需要FiLM，使用预置风格的alpha和beta，也可以用插值得到alpha和beta。	多：具有风格迁移能力。动作泛化到非接触的周期动作。快：实时的风格切换。好：连续的风格参数，风格切换无跳变。	PDF
	2020.7.8	Local motion phases for learning multi-contact character movements	1. PFNN只用于周期性动作 2. PFNN需要手动标注相位	1. 非周期动作=各个局部周期动作的叠加，因此给几个重要的关节独立的相位。 2. 自动提取相位的方法：定义接触为1，再用sin函数拟合。 3. 一个权重估计网络，输入n个相位，输出m个权重参数。 4. 用户的控制信号过于稀疏，导致生成结果平均化。因此训练生成模型GAN，根据稀疏控制信号生成细节控制信号。	多：泛化到非周期动作。快：好：引入生成模型，避免动作趋于平均化。省：无须人工标注。	PDF
	2018.10.1	Few‐shot Learning of Homogeneous Human Locomotion Styles	小样本的学习策略	1. 数据准备：(1)大量基本风格数据（2）少量特定风格数据 2. 模型准备：用PFNN实现通用模块，用residual adapter来实现style模块 3. 训练策略：数据（1）用于通用模块与style模块的解耦，数据(2)用于finetune style模块 4. 参数策略：将adapter矩阵分解为X=ADB^T，进一步减少参数量。	多：泛化到不同风格。快：少量样本即可迁移，训练快。好：无过拟合，泛化性好。少：省训练数据，少内存。	PDF
	2018.6.30	Mode-adaptive neural networks for quadruped motion control	传统数据驱动方法（运动图、运动匹配）：需存储完整动作数据库，依赖手动分割、标注，搜索过程复杂，实时性差 CNN 存在输入输出映射模糊问题，RNN 长期预测易收敛到平均姿态（漂浮），PFNN 虽解决模糊问题，但依赖手动相位标注，无法适配非循环运动。	本文提出模式自适应神经网络（MANN），专为四足动物运动控制设计，核心通过 “门控网络 + 运动预测网络” 的双模块架构，从大规模非结构化动作捕捉（MOCAP）数据中自主学习运动模式，无需手动标注相位或步态标签；门控网络基于脚部关节速度、目标速度等特征动态加权多个专家网络输出，运动预测网络生成平滑连贯的下一帧运动，支持怠速、移动、跳跃、坐姿等多种循环与非循环运动，同时允许用户通过速度、方向、动作变量交互控制；实验证明该模型在运动质量、实时性、内存占用上优于传统数据驱动方法（如运动图）和现有神经网络模型（如 PFNN），填补了四足动物非结构化运动数据高效建模与交互控制的空白。	PPT、视频
113	2017.7.20	Phase-functioned neural networks for character control	1. Motion Matching 需要存储大量数据 2. 自回归方法存在误差积累 3. CNN方法不能实时 4. 物理方法不可控 5. 不能支持复杂地形	1. 混合专家模型，首次将运动相位从「网络输入特征」升级为「网络权重的全局参数化变量」2. 将平地动捕数据与复杂地形耦合，让模型学会了根据地形自动调整动作。 Baseline	PFNN 多：支持不同地形的泛化性快：0.8ms/帧好：1. 混合专家模型，解决相位混合引入的artifacts 2. IK后处理解决脚本问题省：引入NN，不需要存储原始数据	link

基于生成的方法

笔记P7

ID	Year	Title	特点
136	2021.3.26	Character Controllers Using Motion VAEs	在给定足够丰富的运动捕捉片段集合的情况下，如何实现有目的性且逼真的人体运动
	2023.10.16	MOCHA: RealTime Motion Characterization via Context Matching	风格迁移 + 实时控制
	2024.8.16	Interactive Character Control with Auto-Regressive Motion Diffusion Models	A-MDM 136中的VAE替换成了MLP diffusion 并使用分层强化学习进行控制。

条件生成

flowchart LR
 A([当前状态])--> C
 B([控制目标])--> C
 C([条件])--> F[Diffusion]--> G([latent code]) --> H[Decoder]--> I([下一帧状态])
 D([噪声])--> F
 E([t])-->F

难以在各种控制信号间进行泛化。
难以泛化到数据以外的动作。
支持多模态条件信号和任意损失引导。
能够实现更长期的目标和复杂任务。

Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
2025.5.30	MotionPersona: Characteristics-aware Locomotion Control	首个生成式角色控制器。实时交互+角色个性化	1. 输入：文本描述->CLIP->emb，SMPL beta，历史状态，未来轨迹，示例版本(Optimal) 2. 模型：DiT结构，每次45帧，单帧>60fps 3. 动作衔接：在噪声空间对生成的前5帧与最后一帧做平滑 4. 推理时基于对小样本对编码层和输出层微调（DreamBooth） 5. 数据集	多：个性化快：实时推理省：极少的定制化数据及微调时间	PDF
2025.5.13	DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control	自然语言的角色控制，条件为自然语言。为了支持轨迹控制，额外引入控制方式，类似上文“生成+控制”方法。
2024.8.16	Interactive Character Control with Auto-Regressive Motion Diffusion Models
2024.7.11	AAMDM: Accelerated Auto-regressive Motion Diffusion Model		使用DDGAN 进行推断加速
2024.4.23	Taming Diffusion Probabilistic Models for Character Control	这篇发表于SIGGRAPH 2024的论文聚焦基于扩散模型的实时角色控制，提出了条件自回归运动扩散模型（CAMDM），首次将运动扩散概率模型成功落地到实时交互式角色控制场景中。核心解决了传统扩散模型计算量大、可控性差、多样性不足的问题，实现了单模型支持多风格、实时响应用户控制、生成高质量且多样化的角色动画，同时能完成风格间的无缝过渡，是角色动画和运动生成领域的重要突破。

基于动力学的方法

优势：
(1) 能够生成新颖且符合物理规律的动作
(2) 数据依赖少

局限性：
(1) 开销问题
(2) 可扩展性问题

笔记P13

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2022.5.12	AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control	模仿学习时，模仿目标需要精心设定，模仿效果的目标函数也难设定。	不模仿特定的动作，而是模仿目标动作的风格。通过对抗学习来判断模仿的风格像不像。 AMP动作先验 = 对抗式判别器。

运动规划器 + 运动控制器

类似“可控生成 + 动作优化”

规划器和控制器之间存在GAP，导致动作质量下降
控制策略难以准确跟踪规划的运动，需要微调，限制了其泛化能力
优点同“可控生成”

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2025.5.13	CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control	扩散规划器 + 跟踪控制器	扩散规划器：以文本和目标位置为条件，生成下一个运动计划。	跟踪控制器：接收来自 DiP 的计划并提供来自环境的反馈。

Year	Name	解决了什么痛点	主要贡献是什么	Tags
2023.10.18	Interactive Locomotion Style Control for a Human Character based on Gait Cycle Features	没有下载
2020.7.26	Feature-based locomotion controllers
2018.4.8	DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills
2010	Real-time planning and control for simulated bipedal locomotion
2007.7.29	SIMBICON: simple biped locomotion control	① 零力矩点（ZMP）方法依赖预计算轨迹，灵活性不足； ② 强化学习 / 策略搜索需设计复杂奖励函数，高维状态下难以收敛； ③ 数据驱动方法多为运动学建模，缺乏物理适应性。	“有限状态机 + 全局坐标控制 + 质心反馈” 的极简组合，无需复杂动力学建模，实现实时、鲁棒的物理基双足运动生成	link

笔记P11

ID	Year	Name	主要贡献
	[2]		1. 把运动规划和控制整合到一个模型中，消除两个模型带来的domaingap 2. 文本、目标、轨迹等多模态输入 3. Diffusion Forcing范式进行训练，消除长期累计误差 4. 通过引导采样，无需finetune，即可泛化到不同（包括没见过）的控制信号。具体方法为行为克隆学习，先从动捕数据中提取状态-策略对，再用diffusion学习策略。

基于Diffusion的文生动作

基于扩散的方法直接在连续运动空间操作（如原始动作数据或VAE编码的隐表征），支持高质量动作生成。但这些方法主要进行片段级建模——一次性生成整个动作序列的所有帧。该设计使得修改特定帧时难以保持整体动作一致性与质量。且通常仍局限于粗粒度语义引导或轨迹级控制，导致其缺乏细粒度时间控制能力，用户无法在生成过程中精确定义或编辑动作细节。

mindmap
基于diffusion的动作生成
    相同点
        控制条件：文本
            CLIP
            Roberta
            LLM
        生成方式：非自回归
        表示方式：连续表示
        生成模型：DDPM/DDIM
    按表示方式分：
        原始数据
        Latent表示
    按要解决的问题分
        与文本的匹配
            长文本
            细节控制能力
            文本契合度
        生成质量
        生成多样性
        生成速度
            Latent Space
            减少采样步骤
            LCM
        长序列生成
        训练数据
    按基础架构分
        Transformer Based
        UNet Based
        Masked Modeling

生成质量

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
155	2024.12.9	StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework	现有的基于扩散模型的方法采用了各不相同的网络架构和训练策略，各个组件设计的具体影响尚不明确。滑步问题	深入分析并优化人体动作生成的每个组件。 - 通过识别足地接触关系并在去噪过程中修正脚部运动 – 推理速度慢（计算成本高）	控制条件：文本（CLIP+word embedding）表示方式：原始数据基础架构：Transformer，UNet, RetNet 要解决的问题：生成质量	link

生成多样性

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2023.10.1	ReMoDiffuse: RetrievalAugmented Motion Diffusion Model	检索增强生成，融合Top-k相似运动样本+CLIP语义特征，提升生成多样性与上下文相关性	– 语义调制Transformer	– 数据依赖性强 – 计算成本高昂	link

框架演进

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
160	2025.5.26	Absolute Coordinates Make Motion Generation Easy	局部相对动作表示方法(通过相对于骨盆和前一帧对动作进行编码)简化了早期生成模型的训练，但它却给扩散模型带来了关键限制，并阻碍了其在下游任务中的应用。	全局空间中的绝对关节坐标表示方法的文生动作。保持性能的同时天然支持下游任务
101	2025.5.16	Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion	VQVAE类方法：离散表征难以忠实表达新动作扩散模型类方法：连续表征缺乏对单帧的细粒度控制作者希望可以帧级的动作控制同时能够对训练集以及的动作有较好的泛化性。	1. 递归式地补充部分帧（类似MoMask），直到全部生成 2. 帧级VAE	VAE + MoMask + diffusion	link
164	2024.8.30	Text-driven Human Motion Generation with Motion Masked Diffusion Model	通过上下文推理学习时空语义间运动关系的能力	– 跨时间帧与身体部位的掩码建模策略 – 计算成本高昂	MMDM	link
165	2023.8.30	Human Motion Diffusion as a Generative Prior	标注运动数据匮乏、仅关注单人动作生成以及缺乏精细控制能力	1. 并行组合：于两个固定先验及少量双人训练样本，我们开发了轻量级通信模块ComMDM，用于协调两个生成动作间的交互 2. 串行组合：仅通过短片段训练的扩散先验，即可生成由提示区间及其过渡组成的长动画序列。3. 模型组合：先训练独立先验模型以生成符合指定关节约束的动作，进而提出DiffusionBlending插值机制，有效融合多个此类模型，实现灵活高效的关节级与轨迹级精细控制与编辑	– 依赖初始模型质量 – 长间隔运动不一致 – 泛化能力不足	link
	2022.9.29	Human Motion Diffusion Model	扩散模型首次应用于动作条件生成	1. 多样性与保真度权衡（训练/采样轨迹曲线限制）生成结果多样且逼真 2. 预测x0而不是噪声计算开销大、推理速度低，仅适合短序列生成	控制条件：文本（CLIP）表示方式：原始数据表示基础架构：Transformer 要解决的问题：motion + DDPM HMDM, MDM
132	2022.8.31	MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model	根据多样化文本输入实现细腻且精细的运动生成	首个基于扩散模型的文本驱动运动生成框架，并在此基础上做了大量改进： 1. 通过文本特征与noise的self attention，实现文本-动作的跨模态生成 2. 在噪声空间对不同文本提示的融合，实现不同部分的细粒度控制 3. 在噪声空间对不同片断的融合，实现长序列的生成	控制条件：文本（CLIP）表示方式：原始数据基础架构：Transformer 要解决的问题：motion + DDPM 其它：开源	link

生成速度

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
162	2024.12.30	Motionlcm: Real-time controllable motion generation via latent consistency model	motion diffusion + LCM	link
154	2024.11.23	EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation	实现快速、高质量的人体运动生成。 1. latent space方法学习latent space需要成员 2. DDIM 会导致质量下降。	通过条件去噪扩散GAN技术，在任意（可能更大）步长条件下根据控制信号捕捉多模态数据分布，实现高保真度、多样化的少步数运动采样。 – 快速扩散方案 – 可能违反物理定律（如漂浮/地面穿透）	控制条件：文本（CLIP）表示方式：原始数据基础架构：Transformer 要解决的问题：生成速度	link
	2023.5.19	Executing your Commands via Motion Diffusion in Latent Space		在潜在空间应用扩散模型，降低计算复杂度，潜在空间压缩提升生成速度 – 生成运动长度受限 – 仅支持人体主干（无手部/面部动作）	控制条件：文本（CLIP）表示方式：Latent表示(Transformer Based VAE) 基础架构：Transformer 要解决的问题：生成速度其它：MLD, LMDM	link

文本控制性

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
161	2025.5.8	ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment	双语文本输入合成3D人体运动	1. BiHumanML3D双语人体运动数据集 2. 双语运动扩散模型(BiMD)，通过跨语言对齐表征捕捉语义信息，实现统一的双语建模。 3. 提出了奖励引导的采样对齐方法(ReAlign)，包含用于评估采样过程中对齐质量的步态感知奖励模型，以及引导扩散过程向最优对齐分布演进的奖励引导策略。该奖励模型融合步态感知标记，结合保证语义一致性的文本对齐模块和提升真实性的运动对齐模块，通过在每一步优化噪声运动，平衡概率密度与对齐效果。实验表明，在文本-运动对齐质量和运动生成质量上SOTA。		link
158	2023.12.22	FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing	生成符合细粒度描述的复杂动作序列	1. 名为时空混合注意力（SAMI）的新型Transformer架构。SAMI从两个角度优化全局注意力模板的生成： 1）时空混合注意力机制, 显式建模时空组合的约束条件； 2）利用稀疏激活的专家混合机制自适应提取细粒度特征。 2. HuMMan-MoGen数据集，包含2,968个视频和102,336条细粒度时空描述。 – 运动数据格式支持有限 – 依赖大型语言模型	控制条件：文本表示方式：基础架构：Transformer 要解决的问题：文本契合	link
	2023.12.21	AMD: Anatomical Motion Diffusion with Interpretable Motion Decomposition and Fusion	处理描述复杂或长时间运动的文本输入	利用大语言模型（LLM）将输入文本解析为一系列与目标运动相对应的简洁可解释的解剖学脚本	link
	2023.10.19	Humantomato: Text-aligned whole-body motion generation	文生全身动作（肢体、表情、手势）	link
157	2023.9.12	Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model	现有方法仅能生成确定性或不够精确的动作序列，难以有效控制时空关系以契合给定文本描述。	1）语言结构辅助模块——通过构建精准完整的语言特征以充分挖掘文本信息； 2）上下文感知渐进推理模块——借助浅层与深层图神经网络分别学习局部与全局语义语言学特征，实现多步推理。 – 语言结构辅助模块 – 受限于语言模型能力	控制条件：文本（CLIP）表示方式：原始数据基础架构：Others 要解决的问题：文本契合	link
166	2023.6.26	Flame: Free-form language-based motion synthesis & editing	动作生成，且无需微调即可实现帧级与关节级的动作局部编辑	纯Transformer解码器 + 掩码策略，动态掩码处理变长输入，灵活支持复杂动作组合 – 计算成本高昂	文本（Roberta）	link

长序列生成

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
170	2024.7.16	LengthAware Motion Synthesis via Latent Diffusion	根据目标时长的合成动作序列（长度感知动作生成）	1）长度感知变分自编码器，通过长度相关的潜在代码学习运动表征；2）长度顺应潜在扩散模型，其生成运动的细节丰富度会随着所需目标序列长度的增加而提升
	2024.2.23	Seamless Human Motion Composition with Blended Positional Encodings	根据连续变化的文本描述生成长时间连续运动序列	1. 混合位置编码：绝对编码阶段重建全局运动一致性，而相对编码阶段则构建平滑逼真的动作过渡 2. 两个新指标：峰值加加速度和加加速度曲线下面积，用以检测运动中的突变转换 – 复杂文本描述生成失败 – 部分过渡轻微不匹配	要解决的问题：长序列生成其它：FlowMDM	link

训练数据

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
156	2023.5.16	Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation	基于diffusion的动作生成任务受限于对较小规模动作捕捉数据的依赖，导致在面对多样化真实场景提示时表现不佳。	摒弃了顺序生成（通过特定的损失函数强制保证时序一致性）的方式，仅依赖于标准的扩散损失。两阶段训练。先在一个大型静态 3D 姿态数据集上进行预训练，以学习姿态-文本的关联。通过其并行采样策略和大规模预训练策略保持运动的连续性。 – 计算成本高 – 生成部分不自然运动	控制条件：文本（T5）表示方式：原始数据基础架构：UNet 要解决的问题：训练数据	link

Mamba

ID	Year	Name	Note	Tags	Link
	2025.5.15	Dyadic Mamba: Long-term Dyadic Human Motion Synthesis		文生超长序列双人动作	link
	2025	Motion Mamba: Efficient and Long Sequence Motion Generation	– 双模块去噪U-Net： • 分层时序Mamba • 双向空间Mamba	– 未展示短序列性能 – 模型泛化能力未验证

Human Pose Estimation

输出关节位置、旋转、连接关系

mindmap
基于视觉的人类动作捕捉
    输入信息
        单帧图像/连续视频
        单目/多目
        相机位姿固定/不固定
    输出格式
        SMPL/SMPLX/SMPLH骨骼动作
        自定义骨骼动作
    输出对象
        单人骨骼动作
        多人骨骼动作
        人类骨骼动作 + 物体位姿
        相机位姿
    方法
        特定数据的优化方法
        前向推理方法
    要解决的问题
        动作的连续性
        动作的合理性
        视频数据的摭挡问题与歧义性
        实时流式输出
        与图像数据的一致性
        接触准确且不穿模
        人类动作、相机位姿、人类体型之间的耦合关系

单人HPE

图像单人HPE

ID	Year	Name	Note	Tags	Link
31		SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation	基于 SMPL 的 Transformer 框架的HMR		link

Solving Body Structure Understanding

图像人物-物体交互 (HOI)

ID	Year	Name	Note	Tags	Link
	2025.4.24	PICO: Reconstructing 3D People In Contact with Objects			link

视频单人HPE

Solving Single-frame Limitation

ID	Year	Name	Note	Tags	Link
	2025.5.29	GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion		从单目人体视频中生成精确且时序一致的深度图和法线图	link

Solving Real-time Problems

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2025.8.29	Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning	基于diffusion的方法成本高	采用分层时序剪枝（HTP）策略，能在保留关键运动动态的同时，从帧级别和语义级别动态剪除冗余姿态令牌。

Solving Body Structure Understanding

ID	Year	Name	Note	Tags	Link
		Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction	提出替代传统物理引擎的神经网络，辅助视频动作理解。	LARP

Solving Data Lacking

ID	Year	Name	Note	Tags	Link
103	2025.5.2	GENMO: A GENeralist Model for Human MOtion	把HPE看作是视频condition的动作生成任务。通过动作估计与动作生成的协同增强，提升动作估计的准确性。	人体运动通用模型，动作估计，动作生成, NVIDIA	link

Human Mesh Recovery

Template-based human mesh recovery

Naked human body recovery

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2025.8.13	HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics	1. 传统HPE方法没有考虑特定人体体型与3D姿态的关系，因此牺牲了HPE精度。 2. 依赖2D图像衍生的约束条件的对齐效果来优化姿态。	1. 通过先校准用户身体形状，再基于该形状进行个性化姿态拟合。2. 开发了基于身体形状条件的3D姿态先验模型，有效缓解了因过度依赖2D约束而产生的误差。升了骨盆对齐姿态精度，还改善了绝对姿态精度	仅需合成数据训练，即插即用

Multimodal Methods

ID	Year	Name	Tags	Link
[123]	2019
[124]	2022
[125]	2022
[126]	2022
	2023	WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion	单人，移动相机	link
	2024	Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment	2D to 3D lifting	link
Moritz Einfalt, Katja Ludwig, and Rainer Lienhart. Uplift and upsample: Efficient 3d human pose estimation with uplifting transformers. In IEEE Winter Conf. Appl. Comput. Vis., pages 2903–2913, 2023.
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Trans. Multimedia, 25:1282–1293, 2022a.
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Eur. Conf. Comput. Vis., pages 461–478. Springer, 2022.
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13232– 13242, 2022.
Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 3d human pose estimation with spatio-temporal criss-cross attention. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4790–4799, 2023.
Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8877–8886, 2023.

Utilizing Attention Mechanism

ID	Year	Name	Note	Tags	Link
	2023	Humans in 4D: Reconstructing and Tracking Humans with Transformers		图像，开源	link

Exploiting Temporal Information

ID	Year	Name	Note	Tags	Link
[134]	2019
[135]	2021
[136]	2021
[137]	2021
[138]	2022
[139]	2023	Global-to-local modeling for video-based 3d human pose and shape estimation	To effec-tively balance the learning of short-term and long-term temporal correlations, Global-to-Local Transformer (GLoT) [139] structurally decouples the modeling of long-term and short-term correlations.	视频，单人，SMPL，非流式，transformer	link
	2024	TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Video	仅图像特征恢复3D动作		link

Developing Various Representations

Utilizing Structural Information

ID	Year	Name	Note	Tags	Link
26	2024.4.5	PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos	利用物理合理化人物动作	基于SMPL模型从单目视频估计人体动力学，但仅通过拉格朗日损失隐式融入物理约束	link

Choosing Appropriate Learning Strategies

ID	Year	Name	Note	Tags	Link
161	2019
44	2020
163	2020	Coherent reconstruction of multiple humans from a single image		图像，多人
164	2021
46	2021
214	2021
165	2022
166	2022
167	2023	Jotr: 3d joint con-trastive learning with transformers for occluded human mesh recovery	融合 2D 和 3D 特征，并通过基于 Transformer 的对比学习框架结合对 3D 特征的监督
162	2023	Refit: Recurrent fitting network for 3d human recovery	通过反馈-更新循环机制重新投影关键点并完善人体模型
4	2023	Co-evolution of pose and mesh for 3d human body estimation from video	引入了一种利用 3D 姿势作为中介的人体mesh恢复的共同进化方法。该方法将过程分为两个不同的阶段：首先，它从视频中估计 3D 人体姿势，随后，根据估计的 3D 姿势并结合时间图像特征对mesh顶点进行回归	开源、单人、视频、mesh	link
168	2023	Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction	为了弥合训练和测试数据之间的差距，CycleAdapt [168]提出了一种域自适应方法，包括mesh重建网络和运动降噪网络，能够实现更有效的自适应。

Detailed human body recovery

With Hands

ID	Year	Name	Note	Tags	Link
173	2023	SGNify, a model that captures hand pose, facial expression, and body movement from sign language videos. It employs linguistic priors and constraints on 3D hand pose to effectively address the ambiguities in isolated signs.
174	2021	the relationship between Two- Hands
175	2021	the relationship between Hand-Object
	2023	HMP: Hand Motion Priors for Pose and Shape Estimation from Video	先用无视频信息的手势数据做手势动作先验。基于先验再做手势识别	手、开源	link

Whole Body

ID	Year	Name	Note
176
177
178	2021	independently running 3D mesh recovery regression for face, hands, and body and subsequently combining the outputs through an integration module
179	2021	integrates independent es- timates from the body, face, and hands using the shared shape space of SMPL-X across all body parts
180	2022	Accurate 3d hand pose estimation for whole-body 3d human mesh estimation	end-to-end framework for whole-body human mesh recovery named Hand4Whole, which employs joint features for 3D joint rotations to enhance the accuracy of 3D hand predictions
181	2023	Pymaf-x: Towards well-aligned full-body model regression from monocular images	to resolve the misalignment issues in regression-based, one-stage human mesh recovery methods by employing a feature pyramid approach and refining the mesh-image alignment parameters.
215
182	2023	One-stage 3d whole-body mesh recovery with component aware transformer	a simple yet effective component-aware transformer that includes a global body encoder and a lo- cal face/hand decoder instead of separate networks for each part
183

Template-free human body recovery

运动相机场景

提取相机轨迹

ID	Year	Name	Note	Tags	Link
	2022	BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking
	2023	Decoupling Human and Camera Motion from Videos in the Wild	联合优化人体姿势和相机scale，使人体位移与学习的运动模型相匹配	多人	link

Evaluation

Evaluation metrics

For pose and shape reconstruction

mean per-joint error (MPJPE), Procrustes-aligned perjoint error (PA-MPJPE),
per-vertex error (PVE)

To evaluate the motion smoothness

acceleration error (ACCEL) against the ground truth acceleration

For human trajectory evaluation,

we slice a sequence into 100-frame segments and evaluate 3D joint error after aligning the first two frames (W-MPJPE100) or the entire segment (WA-MPJPE100) [93].
evaluate the error of the entire trajectory after aligning the first frame, with root translation error (RTE), root orientation error (ROE), and egocentric root velocity error (ERVE).

For camera trajectory evaluation

absolute trajectory error (ATE) [75], which performs Procrustes with scaling to align the estimation with ground truth before computing error.

To evaluate the accuracy of our scale estimation

evaluate ATE using our estimated scale (ATE-S) [35].

Reference

Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey

Face Reenactment and Identity Preservation

3D Face Generation and Editing

Text-to-Face and Style-Based Face Generation

【翻译】

可动画头部建模

参数化3D头部模型作为统计先验被广泛应用于可动画头部建模。3D可变形模型（3DMM）[Paysan等，2009]通过低维主成分表示头部形状。在此基础上，FLAME模型[Li等，2017]引入形状与姿势混合形状（blendshapes），实现了下颌、颈部及眼球的运动控制。后续研究[Daněček等，2022；Feng等，2021，2023]基于参数化头部模型[Blanz与Vetter，2023；Li等，2017；Ploumpis等，2020]进一步建模细节表情与情感。ROME方法[Khakhulin等，2022]提出顶点偏移量以捕捉头发几何，但这些方法因固定拓扑和有限表达能力常产生过度平滑的表面，难以处理头饰或复杂发型等几何结构。另一类研究探索混合表示：DELTA[Feng等，2023]将面部显式网格与NeRF头发建模结合，支持多样化发型。

为实现高质量渲染，多项工作[Gafni等，2021；Grassal等，2022；Xu等，2023]采用神经辐射场（NeRF）[Mildenhall等，2021]建模头部虚拟形象。HeadNeRF[Hong等，2022]提出参数化NeRF模型，将头部模型融入NeRF；INSTA[Zielonka等，2023]基于InstantNGP[Müller等，2022]开发动态NeRF。PointAvatar[Zheng等，2023]提出基于点的表征，通过FLAME表情驱动点云形变场。NeRFBlendshape[Gao等，2022]构建基于NeRF的混合形状模型，结合多级体素场与表情系数实现语义动画控制与超写实渲染。

近期研究[Chen等，2024；Dhamo等，2025；Ma等，2024等]利用3D高斯溅射（3D Gaussian Splatting）[Kerbl等，2023]建模头部形象。FlashAvatar[Xiang等，2024]在网格上附加可学习偏移量的高斯点；GaussianBlendshapes[Ma等，2024]将偏移解耦为混合形状。尽管这些方法对写实形象有效，但难以处理风格化内容。

生成式头部建模

头部建模领域的最新进展利用生成模型合成新视角。PanoHead[An等，2023]采用三网格神经体积表征，支持360度头部合成；Rodin[Wang等，2023b]及其扩展RodinHD[Zhang等，2024]通过扩散模型生成头部三平面图。但这些生成的头部均为静态，无法动画。Liveportrait[Guo等，2024]可将单图动态化为视频，但局限于2D空间。CAT4D[Wu等，2024a]训练多视角可变形扩散模型创建动态形象，但基于扩散的方法常面临跨视角一致性挑战。

另一类研究[Chen等，2023a；Liao等，2024等]通过分数蒸馏采样（SDS）将2D扩散先验提炼至3D，虽能实现高质量，但单形象生成需数小时。相比之下，前馈方法[Hong等，2023；Tang等，2025等]在大规模3D数据集训练后可在秒级生成资产，但因训练数据为通用物体，应用于头部时存在显著领域差距，常产生形状失真。总体而言，现有推理方法仍局限于静态形象重建。

【深度解析】

技术演进图谱

技术路线	代表性方法	核心突破	关键局限
参数化建模	3DMM/FLAME	建立可动画的混合形状参数体系	拓扑固定导致几何细节缺失
神经辐射场(NeRF)	HeadNeRF/INSTA	实现超写实渲染与动态光照	难以兼容传统动画管线/高计算成本
点云与高斯表征	PointAvatar/GaussianBlendshapes	支持非刚性形变的灵活表征	风格化内容适应性差/缺乏语义控制
混合表示	DELTA	分区优化（面部网格+头发NeRF）	接缝区域过渡不自然
生成式建模	RodIN/PanoHead	单图到3D的零样本生成	输出静态/跨视角几何不一致

关键技术瓶颈突破

动态-静态表征鸿沟
- 现有生成式方法（如扩散模型）多聚焦静态输出，需通过时序感知的潜在空间编码将动画参数（如FACS系数）注入生成过程
- 潜在解决方案：在NeRF体积场中嵌入可驱动的形变场（如SE(3)-Field），实现表情驱动的密度场变化
风格化内容建模
- 传统参数化模型对非写实风格的泛化能力弱，需开发解耦式风格迁移框架：
  - 几何风格（如卡通比例）通过对抗学习在顶点位移空间建模
  - 外观风格（如赛博朋克色调）通过纹理生成网络实现
跨模态控制
- 现有方法缺乏多粒度控制接口，理想系统应支持：
  - 高层语义控制：通过自然语言描述调整发型（如"蓬松卷发+金属耳环"）
  - 底层参数控制：精确调节混合形状权重与骨骼绑定

该领域正经历从"重建-驱动"到"生成-动画"的范式转换，下一阶段突破将取决于神经符号系统（结合生成式AI与参数化建模）与物理启发生成（模拟真实肌肉运动）的深度融合。

以下是整理后的表格，概述了文本到人脸生成与编辑模型的关键特性：

模型名称	基础架构/方法	主要贡献	输入	输出	训练目标/优化方法	关键创新点
AdaTrans [32]	非线性潜在空间变换（基于StyleGAN）	改进复杂条件编辑能力，保持图像真实感	潜在代码 + 编辑条件	编辑后的面部图像	自适应非线性变换优化	非线性潜在空间变换替代传统线性编辑（如StyleGAN），提升编辑灵活性
StyleT2I [33]	StyleGAN + CLIP引导	解决属性组合性与生成忠实度问题	文本描述	符合文本的面部图像	CLIP-guided对比损失 + 文本到方向模块（Text-to-Direction）	文本到方向模块学习潜在方向；组合属性调整确保多属性正确表达
M3Face [34]	Muse/VQ-GAN + ControlNet + Imagic优化	支持多模态输入（多语言文本、分割掩码、地标）	文本/掩码/地标	多模态编辑的面部图像	多模态条件输入融合 + Imagic高保真微调	端到端集成生成与编辑流程，支持多语言与多模态输入
GuidedStyle [35]	StyleGAN + 知识网络（预训练属性分类器）	实现精准、可解释的语义面部编辑	属性条件（如年龄、表情）	属性编辑后的面部图像	稀疏注意力控制分层编辑 + 知识网络引导	稀疏注意力机制实现分层编辑；知识网络防止意外属性变化
AnyFace [36]	StyleGAN + 两流框架 + CLIP	开放世界自由文本生成，解决模式崩溃与词汇限制	自由文本描述	多样化且对齐文本的面部图像	跨模态蒸馏（CLIP） + 多样性三元组损失（Diverse Triplet Loss）	两流框架分离合成与重建；跨模态蒸馏增强文本-图像对齐；多样性损失提升生成丰富性

ID	Year	Name	Note	Tags	Link
	2025.5.8	SOAP: Style-Omniscient Animatable Portraits	从单张图像生成可动画化的3D虚拟头象	FLAME，FACS面部动作编码，多风格3D头像数据集	link
	2025.5.2	Model See Model Do: Speech-Driven Facial Animation with Style Control		语音驱动，唇形同步，风格	link

关键说明

架构演进：
- 基础模型：多数基于StyleGAN，逐步引入CLIP、ControlNet等多模态组件。
- 编辑方式：从线性（StyleGAN）→ 非线性（AdaTrans）→ 分层（GuidedStyle）→ 开放世界（AnyFace）。
多模态支持：
- M3Face支持文本、掩码、地标混合输入，扩展应用场景。
生成可控性：
- StyleT2I通过文本到方向模块实现语义精准控制；GuidedStyle利用稀疏注意力避免属性干扰。
开放性与多样性：
- AnyFace通过两流框架与多样性损失，突破传统模型的词汇限制与模式崩溃问题。

Speech-Driven and Multimodal Expression Generation

以下是整理后的表格，概述了3D面部动画生成与编辑模型的关键特性：

模型名称/引用	基础架构/方法	主要贡献	输入	输出	训练目标/优化方法	关键创新点
[37] 2021	GPT-2文本编码器 + 扩张卷积音频编码器	双模态（音频+文本）驱动，提升上半脸表情与唇同步（优于VOCA [38]/MeshTalk [39]）	音频 + 文本	3D面部动画	联合音频-文本特征对齐	首个双模态联合模型，但缺乏头部与视线控制
CSTalk [40] 2024.4	Transformer编码器	捕捉面部区域相关性，增强情感语音驱动的动画真实感	情感语音	情感面部动画	面部区域关联建模	基于Transformer的跨区域关联编码，但仅支持5种情感
ExpCLIP [41] 2023	CLIP编码器（文本/图像/表情对齐）	支持文本/图像驱动的表情动画，适配多样化情感风格	文本/图像 + 语音	表情丰富的面部动画	CLIP多模态对齐 + TEAD数据集 + 表情提示增强（Expression Prompt Augmentation）	三模态（文本/图像/表情）对齐，扩展情感风格泛化性
[42] 2023.10	解缠表示（风格+内容）	提升身份保持与过渡平滑性，优于FaceFormer [43]的视听同步	语音 + 身份特征	个性化面部动画	解缠风格与内容表征	身份保留优化，但计算效率较低
AdaMesh [44] 2023.10	Expression Adapter (MoLoRA) + Pose Adapter	个性化语音驱动动画，表达力/多样性/同步性优于GeneFace [45]/Imitator [46]	语音 + 个性化参数	个性化表情与姿势动画	MoLoRA增强的表情适配器 + 基于检索的姿势适配器	分模块适配表情与姿势，支持高效个性化定制
[47] 2023	FaceXHuBERT [48] + FaceDiffuser [49]	解耦情感表达与随机运动多样性	语音 + 情感标签	多样化情感动画	随机扩散过程增强运动变化	结合HuBERT语音特征与扩散模型，实现可控随机性
NFR [51] 2023	解耦编码（身份码 $z_i$ + 表情码 $z_e$）	自动绑定与表情重定向，支持可解释参数（zFACS）	无表情网格 + 目标中性网格	重定向后的动画网格	身份与表情解耦训练 + 可解释参数生成	艺术家友好工具，支持自动绑定与参数化表情控制

关键说明

多模态驱动：
- [37] 和 ExpCLIP 通过音频/文本/图像多模态输入增强动画表现力。
- NFR 专注于网格数据的解耦与重定向，适用于影视与游戏制作。
个性化与解耦：
- [42] 和 AdaMesh 通过解缠表示或模块化适配器提升身份保留与个性化控制。
- [47] 结合扩散模型实现随机运动多样性，平衡可控性与自然性。
技术挑战：
- 部分模型（如 [42]）牺牲计算效率以提升生成质量，需进一步优化实时性。
- 情感类型限制（如 CSTalk 仅支持5种情感）仍是细分场景应用的瓶颈。

此表格总结了3D面部动画生成模型的核心技术路径，突出多模态驱动、解耦表示与个性化适配的演进方向。

Reference

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

mindmap
基于学习的动作生成
    按生成方式分
        自回归生成
        非自回归生成
            Regression
            完形填空式（Bert Style）
    按动作表示分
        连续表示
            原始表示
            AE/VAE Latent Code
        离散表示
            VQ-VAE
    按生成模型分
        确定性映射
        离散空间采样
            离散分布采样(GPT Style)
            掩码语言模型(Bert Style)
            离散去噪扩散概率模型（D3PM）
        连续空间采样
            VAE
            GAN
            diffusion

无条件动作生成

GAN

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
133	2022.12.18	Modi: Unconditional motion synthesis from diverse data.	从给定分布中无条件合成运动	1. MoDi——一种在无监督设置下，从极其多样化、非结构化和无标签的数据集中训练得到的生成模型。 2. 将StyleGAN风格控制引入运动生成实现风格化运动生成	控制条件：无生成方式：Regression 表示方式：连续表示(可能是VAE) 生成模型：Style GAN 其它：模式崩溃/混合（生成动作重复或混乱）
	2022	Ganimator: Neural motion synthesis from a single sequence	小样本生成

TEXT-CONDITIONED MOTION GENERATION

Action to Motion

VAE

ID	Year	Name	Note	Tags	Link
	2021.10	Action-Conditioned 3D Human Motion Synthesis with Transformer VAE	生成多样且真实的3D人体动作，作为后续研究的基线不可学习的可微分SMPL层，数据依赖性强 – 生成长序列计算密集	ACTOR、Transformer + VAE、潜在高斯分布对齐
	2020	Action2Motion	– 动作条件运动生成的首个方法 – 基于李代数的VAE框架 – 构建新3D运动数据集：HumanAct12	– 泛化能力不足 – 仅能生成单一动作的简单运动

Normalizing Flows

ID	Year	Name	Note	Tags	Link
	2024.7	Stylevr: Stylizing character animations with normalizing flows	Style Label

Text to Motion

潜在表征对齐

ID	Year	Name	Note	Tags	Link
100	2025.5.16	MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation	一种代替CLIP的文本编码方式，其编码空间能跟Motion有更好的对齐，因此更适用于文生动作任务。 MoCLIP是CLIP的Motion版，不能独立使用，需结合基于CLIP的文生动作Pipeline。		link
	2023	TMR [120]	改进自TEMOS，提升文本-动作对齐，支持检索与生成，根据文本输出3D动作/跨模态检索结果对比损失优化联合潜在空间，过滤误导性负样本（MPNet）引入CLIP式对比学习，优化负样本选择策略提升检索性能 TEMOS/TMR通过共享潜在空间实现跨模态对齐，TMR进一步引入对比损失提升检索能力。	VAE隐空间对比学习文本描述相似性过滤策略泛化能力不足部分场景内存效率低
	2022	Motionclip: Exposing human motion generation to clip space	将运动潜空间直接对齐CLIP的语义文本嵌入，实现零样本泛化能力。然而，CLIP嵌入主要捕获静态图文语义，难以完整表征真实运动合成所需的时序与运动学细节。此外，直接微调CLIP嵌入可能导致预训练语义知识的灾难性遗忘。
	2022	Temos: Generating diverse human motions from textual descriptions.	改进自ACTOR 实现文本到SMPL动作的生成共享潜在空间中文本与动作表征对齐（跨模态一致性）对称编码器（动作序列+冻结DistilBERT文本编码器），共享潜在空间虽然生成运动真实，但存在内存消耗大（二次内存消耗、不适于长运动）、长序列处理弱、多样性不足的问题。 – 跨模态嵌入相似性 – – 文本拼写错误时易失效 – 多样性不足	Transformer VAE、潜在空间对齐，非自回归

VAE

ID	Year	Name	Note	Tags
	2022	TEACH [118]	扩展自TEMOS，处理连续文本指令生成连贯动作分层生成：非自回归（单个动作内） + 自回归（动作间时序组合）分层策略实现时序组合与平滑过渡 TEACH结合非自回归与自回归生成平衡质量与效率。存在的问题：动作过渡时易出现加速度峰值
	2023	ATOM [Zhai et al., 2023]	– CVAE分解复杂动作为原子动作 – 基于掩码运动的课程学习策略	– 解释复杂文本能力有限 – 文本-运动特征融合策略不足
	2023	MultiAct [Lee et al., 2023]	– 条件VAE架构 – 生成多动作长序列模型	– 生成不真实运动 – 无法生成复杂多样动作序列
	2022	ImplicitMotion [Cervantes et al., 2022]	– 变分隐式神经表示 – 线性计算成本	– 参数更新导致性能不稳定
	2022	UM-CVAE [Zhong et al., 2022]	– 解耦序列级CVAE – 基于FiLM的动作感知调制	– 无法生成全新动作 – 生成运动质量有限（数据依赖）
	2022a	Generating diverse and natural 3d human motions from text	两阶段（卷积AE + 时序VAE）分阶段生成文本对应动作预训练运动编码器提取片段；时序VAE生成运动代码序列两阶段框架（先编码运动代码，再生成序列） – text2length阶段确定运动时长 – text2motion阶段用时序VAE生成运动 – 无法处理罕见动作（如“跺脚”） – 细粒度描述和复杂动作失败 – 生成运动不真实	T2M, Transformer VAE
144	2023.3.7	Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Action Generation	当前运动捕捉数据集中的动作短语通常只包含最精简的核心信息。	通过为大语言模型精心设计提示模板，我们能够生成更丰富、更细粒度的动作描述。 – 首个基于LLM的文本条件运动生成 – 兼容VAE模型的模块 – 无法生成长序列 – 不支持复杂身体运动（瑜伽/舞蹈） – 无手指运动

Motion-Conditioned Motion Generation

ID	Year	Name	Note	Tags	Link
27	2024	Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment	2D轨迹生成3D Motion		link
19	2024	WANDR: Intention-guided Human Motion Generation	基于初始与结束状态控制的动作生成。		link

Training Free

ID	Year	Name	Note	Tags	Link
	2025.5.2	TSTMotion: Training-free Scene-awarenText-to-motion Generation		场景感知，文生动作	link
	2024	“Move as you say, interact as you can: Language-guided human motion generation with scene affordance	AffordMotion
	2023	Synthesizing diverse human motions in 3d indoor scenes
	2022	Humanise: Language-conditioned human motion generation in
3d scenes

AUDIO-CONDITIONED MOTION GENERATION

Speech to Gesture

Year	Name	Note	Tags	Link
2025.5.6	PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model		音频驱动上半身人体动画	link, link
2025.5.14	CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation		单阶段音频驱动的说话身体生成	link
2025.6.1	TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans		实时3D手势生成	link
2025.5.29	MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation		利用音频，以及从音频信号生成的运动掩码和运动特征，共同驱动生成同步的语音-手势视频	link
2025.5.22	MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation		以自我为中心的手-物体运动生成	link
2025.5.21	Intentional Gesture: Deliver Your Intentions with Gestures for Speech		意图驱动手势生成框架	link
2025.5.14	Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View		高保真手势生成	link
2025.5.3	Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion		语音生成手势、双人交互、数据集	link
2024	Diffsheg: A diffusion-based approach for real-time speech-driven holistic 3d expression and gesture generation
2024	Emotional speech-driven 3d body animation via disentangled latent diffusion
2024	Semantic gesticulator: Semantics-aware co-speech gesture synthesis
2023	Gesturediffuclip: Gesture diffusion model with clip latents	– 多模态提示控制风格（文本+语音） – CLIP引导的语音同步手势合成	– 数据依赖性强 – CLIP对细节运动建模有限
2023	DiffGesture [Zhu et al.]	– 扩散音频-手势Transformer（多模态信息处理） – 扩散手势稳定器消除时序不一致	– 数据多样性不足 – 计算成本高昂

SCENE-CONDITIONED MOTION GENERATION

确定性映射

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
131	2016	A deep learning framework for character motion synthesis and editing	自动生成角色动作数据	深开创了Deep Learning Based运动生成的先河	控制条件：轨迹条件、风格条件（风格迁移）生成方式：非自回归表示方式：连续表示（AE）生成模型：确定性映射	link

VAE Based

ID	Year	Name	Note	Tags	Link
	2025.6.18	HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization			link
29	2024	PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios	基于2D轨迹或视频的行人动作生成		link

Diffusion

ID	Year	Name	Note	Tags	Link
	2025.5.19	UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes		整合静态环境、可移动物体、自然语言提示和空间路径点等多模态信息的文生动作	link
	2024.3.26	Move as you say, interact as you can: Language-guided human motion generation with scene affordance		3D环境中的文生3D动作	link

交互动作生成

ID	Year	Name	Note	Tags	Link
	2025.5.20	Large-Scale Multi-Character Interaction Synthesis		生成大规模多角色交互的角色动画	link

多人动作生成

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2025.5.23	Multi-Person Interaction Generation from Two-Person Motion Priors		利用现有双人运动扩散模型作为运动先验，生成逼真且多样化的多人交互动作	link

3D人体运动生成与合成数据集

数据集名称	关键统计	模态	链接/备注
Motion-X++ [301]	1950万3D姿势，120,500序列，80,800视频，45,300音频，自由文本描述	3D/点云、文本、音频、视频	Motion-X++
HumanMM (ms-Motion) [308]	120长序列（237分钟），600多视角视频重建，包含罕见交互动作	3D/点云、视频	HumanMM
Multimodal Anatomical [309]	51,051姿势（53解剖标记），48虚拟视角，2000+病理运动变体	3D/点云、文本	Multimodal Anatomical Motion
AMASS [242]	11,265动作片段（43小时），整合15个数据集（如CMU、KIT），SMPL格式，100+动作类别	3D/点云	AMASS
HumanML3D [119]	14,616序列（28.6小时），44,970文本描述，200+动作类别	3D/点云、文本	HumanML3D
BABEL [307]	43小时动作（AMASS数据），250+动词中心动作类别，13,220序列，含时序动作边界	3D/点云、文本	BABEL
AIST++ [246]	1,408舞蹈序列（1010万帧），9摄像机视角，15小时多视角视频	3D/点云、视频	AIST++
3DPW [245]	60序列（51,000帧），多样化室内/室外场景，挑战性姿势与物体交互	3D/点云、视频	3DPW
PROX [310]	20受试者，12交互场景，180标注RGB帧，场景感知运动分析	3D/点云、图像	PROX
KIT-ML [304]	3,911动作片段（11.23小时），6,278自然语言标注（52,903词），BVH/FBX格式	3D/点云、文本	KIT-ML
CMU MoCap	2605试验，6大类23子类，140+受试者	3D/点云、音频	CMU MoCap

文本到动作生成评估指标

评估指标	定义/计算方式	用途	典型基准
FID (Fréchet Inception Distance)	比较生成与真实动作特征分布的Fréchet距离（低值表示更真实）	真实性评估（如虚拟现实应用）	HumanML3D, KIT Motion-Language
R-Precision [311]	在共享嵌入空间中，正确文本在Top-k匹配中的比例（如Top-1/3）	语义一致性（文本-动作对齐）	HumanML3D, BABEL
MultiModal Distance [312]	动作与文本嵌入的欧氏距离（低值表示强语义耦合）	跨模态语义对齐量化	ExpCLIP [41], TMR [120]
Diversity [313]	随机采样动作对的平均距离（高值表示生成多样性）	动作空间覆盖广度	DiverseMotion [122], Motion Anything [125]
Multimodality [313]	同一文本生成多动作的方差（高值表示单提示下的多样性）	单提示多样性（避免重复）	MoMask [123], TEACH [118]
用户研究 (User Studies)	人工评分自然度、情感表达、上下文相关性	主观质量评估（自动化指标补充）	研究论文中常用（如[314]）

Reference

Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions
Human Motion Generation Summary
Text-driven Motion Generation: Overview, Challenges and Directions

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags
	2026.1.8	Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video	一种用于单目4D网格重建的前馈模型。给定动态对象的单目视频，我们的模型能够重建对象的完整3D形状与运动，并表示为变形场。	Feed-forward	link
	2025.8.27	ScanMove: Motion Prediction and Transfer for Unregistered Body Meshes	未注册未绑定的人体Mesh难以直接驱动	运动嵌入网络+逐顶点特征场，生成驱动网格变形的时空变形场。
	2025.6.18	GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects			link
109	2025.6.11	AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation	1. 将动态网格分解为初始状态与相对轨迹 2. 融合网格拓扑信息 3. 基于注意力机制实现高效变长压缩与重建	修正流，数据集	link
110	2025.6.9	Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video	1. 以文本和目标视频为条件驱动Mesh 2. 将动态网格分解为初始状态与相对轨迹 3. 使用latent set + Transformer VAE对动态Mesh进行编码 4. 使用diffusion进行生成	Latent Sets，diffusion，数据集	link

NeRF：将一个连续场景表示为一个神经网络。这个网络输入一个3D坐标和2D观察方向，输出该点的颜色和密度。通过体渲染技术，将无数个这样的点合成一张2D图像。其核心是隐式表示和基于辐射场的可微分渲染。

Nerf的优点与缺点

优点

1. 高质量的连续视图合成与平滑性

优点描述： NeRF学习的是一个连续的场景函数，因此它可以在任意尺度下进行渲染，并且生成的结果非常平滑，没有明显的瑕疵或“孔洞”。对于光滑的表面、复杂的材质和精细的细节，NeRF往往能产生更“保真”和物理上更合理的结果。
原因： 神经网络本身就是一个平滑的先验，它内在地填充了场景的空隙，并对输入数据进行了正则化。

2. 优秀的内存压缩能力

优点描述： NeRF的模型（一个几MB到几十MB的神经网络权重文件）可以表示一个非常庞大的场景。它本质上是一个强大的压缩算法，将数十亿体素的信息压缩到一个紧凑的神经网络中。
原因： 神经网络的权重共享和泛化能力使得它可以用相对较小的参数量拟合一个复杂函数。

3. 强大的泛化与先验知识

优点描述： 这是NeRF一个潜力巨大的优势。通过在大规模数据集上预训练，NeRF模型可以学习到关于物体形状、材质、光照的通用先验知识。这使得它能够：
- 在输入图像极少的情况下进行重建。
- 处理有遮挡物或不确定性的区域。
- 实现语义编辑、风格迁移等任务。
原因： 神经网络架构天然适合迁移学习和预训练。

4. 对噪声和异常值的鲁棒性更强

优点描述： 由于神经网络的平滑性，NeRF对输入图像中的噪声和匹配错误（如SfM产生的错误点）不那么敏感。
原因： 网络在训练过程中倾向于学习数据中的主要模式，而不是过拟合每一个噪声点。

缺点

1. 训练和渲染速度极慢

缺点描述： 这是原始NeRF最大的痛点。训练一个高质量模型需要数小时甚至数天，渲染一张高分辨率图像也需要数秒到数分钟。
原因： 需要为每条射线查询数百次神经网络，计算量巨大。

2. 容易陷入局部最优，出现“浮游物”瑕疵

缺点描述： NeRF在优化过程中，有时会在空白空间错误地生成半透明的“浮游物”或伪几何，尤其是在缺少视角观察的区域。
原因： 基于梯度的优化在复杂的、高维的损失空间中容易收敛到不完美的局部最优点。

3. 编辑和控制的困难

缺点描述： 由于NeRF是隐式表示，场景信息被编码在神经网络的权重中，人类很难直观地理解和编辑它。例如，想要移动场景中的一个杯子，在NeRF中是非常困难的操作。
原因： 隐式表示缺乏显式的几何和语义结构。
3DGS对比： 3DGS是显式的点云，编辑相对直观。你可以直接选择、移动、删除或修改一组高斯球。这为场景编辑、动画和组合打开了大门。

4. 对初始化和超参数敏感

缺点描述： 许多NeRF变体对相机位姿的准确性要求极高，并且其性能受学习率、网络结构等超参数的影响较大。
原因： 神经网络的训练本身就是一个复杂的优化问题。
3DGS对比： 3DGS虽然也依赖SfM初始化，但其优化过程相对鲁棒，且社区已经形成了比较固定的超参数设置，开箱即用性更好。

3D静态Nerf

基于Nerf的3D生成

ID	Year	Name	Note	Tags	Link
68	2022.9.29	DreamFusion: Text-to-3D using 2D Diffusion	利用2D扩散模型的先验知识，绕过3D数据限制，实现开放域文本到3D的高效生成，同时支持多视角一致性和几何细节。	SDS	link

基于Nerf的单图3D场景重建

动态Nerf(基于NeRF的变体，实现动态场景重建)

核心思想：扩展静态 NeRF（学习从空间位置和视角到颜色/密度的映射），增加时间维度或变形场来建模动态。
代表技术：可变形 NeRF, 时变 NeRF。
优点：理论上能建模非常复杂、连续的动态效果（如流体、布料）。
主要缺点：

优化时间长：训练/优化过程非常耗时。
渲染效率低：体渲染过程计算开销巨大。
重建质量受限：由于优化和渲染的挑战，最终重建或生成的质量（清晰度、细节）可能不如人意。
与现代引擎兼容性差：输出格式非标准网格/点云，难以集成到游戏/影视渲染管线。

Year	Name	Note	Tags	Link
2025.6.17	GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation			link
2024	Consistent4d: Consistent 360° dynamic object generation from monocular video	引入了一个视频到4D的框架，通过优化一个级联动态NeRF (Cascaded DyNeRF) 来从静态捕获的视频生成4D内容。	driving video
	Animate124	利用多种扩散先验，能够通过文本运动描述将单张野外图像动画化为3D视频。	SDS
	4D-fy	使用混合分数蒸馏采样 (hybrid SDS)，基于多个预训练扩散模型实现了引人注目的文本到4D生成。	SDS

3DGS：将一个场景显式地表示为数百万个可学习的3D高斯椭球体。每个高斯球拥有位置、协方差（尺度/旋转）、不透明度和球谐函数系数（表示颜色和视角相关外观）。通过光栅化技术，将这些球体投影并融合成2D图像。其核心是显式表示和基于点的可微分光栅化。

特性	NeRF	3D Gaussian Splatting	分析
核心表示	隐式（神经网络）	显式（3D高斯球集合）
渲染质量	高保真、连续、平滑，细节和光泽表面处理更好	极高质量，但可能有颗粒感/空洞，在极端近距离下会“露馅”	NeRF学习的是一个连续的场景函数。 3DGS是离散的点云表示，在非常稀疏或点分布不均的区域，可能会产生“空洞”或颗粒感。虽然通过 densification 和 pruning 可以缓解，但其本质仍然是离散的。
速度	慢（训练：小时/天，渲染：秒/帧）	极快（训练：分钟/小时，渲染：实时 >100 FPS）	每条射线查询数百次神经网络，计算量巨大。
内存效率	高（模型小，优秀的压缩）	低（模型大，存储所有显式属性）	NeRF的模型是一个强大的压缩算法，将数十亿体素的信息压缩到一个紧凑的神经网络中。 3DGS需要显式存储数百万甚至数千万个高斯球。
编辑性	困难（黑盒模型，难以操控）	相对容易（可像点云一样选择、移动、编辑）	隐式表示缺乏显式的几何和语义结构。显式的点云，编辑相对直观。
泛化能力	强（支持预训练和先验学习）	弱（每个场景独立优化）	神经网络架构天然适合迁移学习和预训练。 3DGS本质上是“从零开始”为每个场景进行优化，缺乏这种跨场景的泛化能力。每个高斯球都是独立的，没有共享的语义知识。
鲁棒性	对噪声和错误初始化相对鲁棒	严重依赖高质量的SfM点云初始化	神经网络具有平滑性，不会过拟合每一个噪声点。但许多NeRF变体对相机位姿的 3DGS虽然依赖SfM初始化，但其优化过程相对鲁棒，且社区已经形成了比较固定的超参数设置，开箱即用性更好。
主要应用	高质量离线渲染、学术研究、场景压缩	实时应用（VR/AR、游戏）、快速预览、需要交互的场景

3DGS能够克服隐式方法（特别是动态 NeRF）的效率瓶颈和兼容性问题。
但由于缺乏真实的4D标注数据，只能依赖多视角渲染进行监督学习，因此容易出现视角间的不一致性问题。

直接预测动态高斯属性

ID	Year	Name	Note	Tags	Link
127	2025.7.31	Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis	直接对4D高斯进行diffusion生成数据量比较大，因此构建4D GS的VAE，并基于这个VAE进行隐空间的4G生成	link
	2025.6.5	SinGS: Animatable Single-Image Human Gaussian Splats with Kinematic Priors			link
	2024.6.14	L4gm: Large 4d gaussian reconstruction model	单视角视频输入生成动态物体的4D大重建模型	1. 多视角视频数据集 2.基于预训练的3D大重建模型LGM, 通过低帧率采样的视频帧生成逐帧的3D高斯泼溅表征	link
	2023.22	Stag4d: Spatial-temporal anchored generative 4d gaussians	实现具有时空一致性的高保真4D生成	单目视频->多目视频，SDS优化出GS属性	link
36	2023.4	GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians	1. 引入可动画化的 3D GS 来明确代表各种姿势和服装风格的人类。 2. 设计一个动态外观网络以及一个可优化的特征张量，用于实现运动到外观的映射。通过动态属性进一步增强3D GS表示。 3. 对运动和外观进行联合优化，缓解『单目视频中运动估计不准确』的问题。	开源、SMPLX、动态高斯	link

显式驱动静态高斯属性

与动态高斯的对比

核心思路：利用控制点/蒙皮等显式或参数化结构来驱动显式图元（如高斯椭球）的变形，从而表示动态。这比纯隐式 NeRF 更高效且渲染质量更高。
优点：

将静态几何与动态运动解耦。静态部分可以高效优化/表示，动态部分专注于运动。这通常比直接拟合整个时空函数更有效率。

主要缺点：

要解决如何有效控制显式图元随时间的变形以保持时空一致性和高质量。

问题定义

输入：首帧图像或静态3DGS，控制信号
输出：GS的动态属性，静态3DGS（Optional）

通过控制信号驱动GS，需要先学习到GS的运动方式与控制信号之间的关联。

flowchart LR

用户 -->|生成| 控制信号 -->|驱动| GS运动

如果控制控制是视频，控制信号与GS运动的关系非常直观，通过Video SDS和视频重建来约束，就可以实现驱动效果。
但如果控制号与GS运动不是那么显性的关系，就需要借助GS的运动代理来驱动GS了。
所以GS的运动方式可以是直接驱动每个高斯点，也可以是借助运动代理驱动高斯点。

flowchart LR

3DGS资产 -->|生成| GS运动代理
用户 -->|生成| 控制信号
控制信号 & GS运动代理 -->|驱动| 变形GS运动代理
变形GS运动代理 & 3DGS资产 --> |带动| 变形GS资产

借助运动代理来驱动GS有这些好处：

高斯资产中的高斯球数量巨大，简化的代理更容易学习
邻近的高斯球的运动是相关联的，通过运动代理可以学到high level的运动趋势
运动代理更方便于运动迁移

因此高斯的驱动可以拆分为以下两个模块：
（1）

flowchart LR
高斯球Motion --> 代理Motion --> 高斯球Motion

可以通过以下方式配置运动代理：

人工配置
基于规则
学习

(2)

flowchart LR
驱动代理 & 交互信号 --> 驱动代理Motion

驱动代理可以是：

原始高斯球（无代理）
点云
Mesh
Skeleton
物理仿真对象

交互信号驱动驱动代理的方式与具体的驱动代理的形式有关，因此下文使用不同驱动代理的类型作为第一级分类。

技术图谱

mindmap
静态高斯驱动
    表达对象
        场景
        单个3D对象
        人/动物
    控制信号
        单/多视角视频
        文本
        力
    驱动方式
        直接驱动
        Mesh/骨骼代理驱动
        参数化线条驱动
        物理属性驱动
        稀疏锚点驱动
    运动建模方式
        针对单个视频的优化
        前向推理
    监督方式
        Video SDS
        视频重建
    要解决的问题
        时空一致性
        前向式/非优化/跨ID

Video SDS (视频分数蒸馏) 来从视频扩散模型中“蒸馏”运动信息

输入：单/多视角视频
输出：静态3DGS+GS的动态属性或动态3DGS

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags
176	2025.6.11	HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene	学习结构化且时间一致的运动表征	一个通过稀疏锚点驱动形变实现结构化一致动态建模的统一框架。 1. 通过锚点过滤器识别运动相关区域，抑制静态区域的冗余更新；2. 利用自监督诱导流引导变形模块，通过多帧特征聚合驱动锚点运动，无需显式光流标签； 3. 为处理细粒度形变，分层锚点传播机制能依据运动复杂度提升锚点分辨率，并传播多级变换关系。	运动信息来源：? 驱动方式：稀疏锚点驱动
173	2025.5.14	SplineGS: Learning Smooth Trajectories in Gaussian Splatting for Dynamic Scene Reconstruction	静态场景的高质量快速重建的基础上融入形变模块	用Spline来表征时间维度上的平滑形变	运动信息来源：单目视频驱动方式：参数化线条驱动
	2024.6.15	4d gaussian splatting for real-time dynamic scene rendering	link

GS的运动代理：无

无需要代理，直接控制每个高斯球的运动。

控制信号：多/视角视频控制，GS运动代理：无

由视频直接驱动每个高斯点的运动控制信号与GS运动的关系非常直观，通过Video SDS和视频重建来约束，就可以实现驱动效果。
这实际上是一个基于3DGS的4D重建的问题。

控制信号：文本，GS运动代理：无

先用文本和首帧生成视频，再用视频驱动GS，即：

TI2V + 基于3DGS的4D重建

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2024.9.9	Animate3d: Animating any 3d model with multi-view video diffusion	充分利用现有具有多视图属性的3D资产，解决生成结果存在时空不一致问题	1）多视角视频扩散模型（MV-VDM） 2）大规模多视图视频数据集（MV-Video） 3）基于MV-VDM，我们引入结合重建技术与4D分数蒸馏采样（4D-SDS）的框架，利用多视图视频扩散先验实现3D对象动画。	静态高斯模型：预置表达对象：单个3D对象运动信息来源：自己训练的多视角图生视频驱动方式：直接驱动（HexPlane）监督方式：4D-SDS，视频重建，ARAP 运动推断方式：先前向，再优化	link
111	2023.12	Dreamgaussian4d: Generative 4d gaussian splatting	隐式表示 (NeRF)的场景重建与驱动都非常低效	一个系统性的图像到4D生成框架	静态高斯模型：DreamGaussianHD 表达对象：单个3D对象运动信息来源：图生视频得到的单视角视频驱动方式：直接驱动（HexPlane）监督方式：video SDS，视频重建运动推断方式：优化	link

GS运动代理：物理仿真对象

---
title: 基于物理仿真对象代理的GS驱动  
---
flowchart LR

3DGS资产 --> |分割| 具有不同物理属性的高斯球 -->|绑定| 物理仿真代理
用户交互 -->|转化| 力
力 & 物理仿真代理 -->|驱动| 变形物理仿真代理
变形物理仿真代理 & 3DGS资产 --> |带动| 变形GS资产

技术一：高斯对象分割
一个高斯场景中可能包含多个对象，这些对象具有不同的物理属性，因此需要分割。

显式分割
隐式分割

技术二：将静态高斯对象绑定到物理仿真代理上。

代理通常是由高斯球采样或人工挑选出的粒子。
这些粒子可以再重组成Particle或Mesh或Grid或其它混合仿真形式。
仿真粒子带动高斯粒子

技术三：学习物理仿真对象的物理属性。

人工配置
从视频中学习

技术四：使用物理仿真驱动物理仿真对象

借助物理仿真方法（粒子系统、网格系统）
神经网络方法

借助物理仿真的方法

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
129	2025.8.13	TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos	不同物理属性的对象需要人工标注	无需要人工标注，从视频中学习每个高斯点的动力学属性	对象分割：无显式分割物理仿真代理：高斯球作为刚性粒子物理属性：从多视角视频中学习仿真方式：刚性粒子仿真开源	link
175	2025.6.9	PIG: Physically-based Multi-Material Interaction with 3D Gaussians	由3D高斯基元表征的场景中，物体间的交互存在三大缺陷：三维分割精度不足、异质材质形变失准及严重渲染伪影。	1. 从二维像素到三维高斯基元的快速精准映射，从而达成精确的物体级三维分割。 2. 为场景中分割后的物体赋予独特物理属性，以实现多材质耦合交互。 3. 创新性地将约束尺度嵌入变形梯度，通过钳制高斯基元的缩放与旋转属性消除渲染伪影，达成几何保真度与视觉一致性。	对象分割：显式分割物理仿真代理：高斯球采样物理属性：预置？仿真方式：MLS-MPM	link
181	2025.6.5	FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity			link
174	2025.6.4	EnliveningGS: Active Locomotion of 3DGS	3D 高斯溅射(3DGS)表示的 3D 模型能够实现主动运动	高效且鲁棒地建模“活化模型”与环境之间的摩擦接触	对象分割：显式分割物理仿真代理：四面体+肌肉物理属性：预置？仿真方式：肌肉动力学	link
	2024.6.16	Physically embodied gaussian splatting: A realtime correctable world model for robotics	以统一的方式捕捉几何、物理及视觉外观信息	提出一种新颖的“高斯-粒子”双元表征，该表征在建模物理世界的同时，（i）支持对未来状态进行预测性仿真，并（ii）允许在动态世界中基于视觉观测进行在线校正。	对象分割：先分割再建模物理仿真代理：粒子（先建粒子，再细化成高斯）物理属性：从视频中学习仿真方式：通过『粒子仿真+形状约束』实现刚体仿真、软体仿真	link
	2024.4.15	Physgaussian: Physicsintegrated 3d gaussians for generative dynamics		通过将基于牛顿力学的物理动力学无缝集成到3D高斯模型中，实现高质量的新运动合成	对象分割：无分割物理仿真代理：高斯球作为粒子物理属性：预置？仿真方式：MPM	link
	2024.4.1	Language-driven physics-based scene synthesis and editing via feature splatting.	同时操控GS对象的外观与物理属性	1. 提出了一种将高质量、以物体为中心的视觉-语言特征蒸馏至三维高斯模型的方法，从而支持基于文本查询的半自动场景解构。 2. 提出了一种利用粒子仿真器从原本静态的场景中合成基于物理的动态效果的方法，其中材料属性通过文本查询自动分配。	对象分割：基于语言的分割物理仿真代理：高斯球作为粒子物理属性：通过文本查询自动分配仿真方式：MPM	link
	2024.3.14	Reconstruction and simulation of elastic objects with spring-mass 3d gaussians	物体物理属性并进行仿真		对象分割：无物理仿真代理：Volume Sampling得到anchor，再构建弹簧系统物理属性：从视频中学习仿真方式：Mass-Spring System	link

神经网络方法

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
177	2025.6.18	Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos			对象分割：显式分割物理仿真代理：高斯球采样成粒子，空间划分成风格物理属性：从视频中学习仿真方式：粒子仿真+网格仿真+数据学习	link
179	2025.5.26	ParticleGS: Particle-Based Dynamics Modeling of 3D Gaussians for Prior-free Motion Extrapolation	动态重建方法不能显式学习动态演化规律，因此不能外推引入了显式仿真框架的方法，需人工设定外力条件. 基于物理信息神经网络（PINNs）和基于神经常微分方程（Neural ODEs）均无法直接从视频中学习运动规律。	根据视觉观测数据建模三维高斯分布的动力学特性。首个无需任何人工定义物理先验、完全端到端的通用动态三维外推方法。
	2019	Occupancy flow: 4d reconstruction by learning particle dynamics

GS运动代理：Mesh

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2024.10.9	Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian-mesh hybrid representation	时空一致性与表面外观	图像->3DMesh->Mesh形状->GS形状	运动信息来源：单目视频驱动方式：Mesh形变驱动	link

INTRODUCTION

需求：基于图像的3D场景重建

发展史：


光场和基本场景和重建	1-3	受到对密集采样和结构化捕捉的依赖的限制，导致在处理复杂场景和照明条件方面面临重大挑战
structure-frommotion [4]， multi-view stereo [5] algorithms	4， 5	难以进行新视角合成，并且缺乏与深度场景理解模型的兼容性
NeRF：实现空间坐标到颜色和密度的直接映射	6-11，	NeRF 的成功取决于其创建连续的体积场景函数的能力，产生具有前所未有的细节和真实感的结果。 1. 计算强度。基于 NeRF 的方法是计算密集型的 [6]-[11]，通常需要大量的训练时间和大量的渲染资源，特别是对于高分辨率输出。 2. 可编辑性。操纵隐式表示的场景可能具有挑战性，因为对神经网络权重的直接修改与场景的几何或外观属性的变化并不直观相关。
3D Gaussian splatting (GS) [12]	12	引入先进的、明确的场景表示，使用空间中数百万个可学习的 3D 高斯模型对场景进行建模。采用显式表示和高度并行化的工作流程，促进更高效的计算和渲染

ID	Year	Name	Note	Tags	Link
12	2023	3D Gaussian Splatting for Real-Time Radiance Field Rendering			link

BACKGROUND

Problem Formulation

Radiance Field

GAMES101课程关于光场的介绍

光场是三维空间中光分布的表示，它捕获光如何与环境中的表面和材料相互作用[30]。在数学上，光场可以描述为函数

$$ L : (x, y, z, θ, φ) \in R^5 → R^+ $$

其中(x, y, z)为映射空间中的一个点，(θ, φ)为球坐标指定的方向。
radiance value为非值。
L可以通过隐式或显式表示来封装，每种表示对于场景表示和渲染都有特定的优势。

Implicit Radiance Field

隐式辐射场表示场景中的光分布，而无需显式定义场景的几何形状。

L经常使用神经网络来学习连续的体积场景表示[35]，[36]。最突出的例子是 NeRF [15]。

在 NeRF 中，神经网络（通常是MLP）将一组空间坐标 (x, y, z) 和观察方向 (θ, φ) 映射到颜色和密度值。任何点的辐射度都不会显式存储，而是通过查询 MLP 即时计算。因此，该函数可以写为：

$$ L = MLP(x, y, z, θ, φ) $$

Good：这种格式允许对复杂场景进行可微分和紧凑的表示
Bad：然而由于volumetric ray marching [12]，渲染的计算负载高。

Explicit Radiance Field

显式辐射场直接表示离散空间结构中的光分布，例如体素、网格、点云。该结构中的每个元素存储其各自空间位置的辐射信息。

显式辐射场表示的通用形式可以写为：

$$ L = \text {DataStructure}[(x,y,z)] \cdot f(θ, φ) $$

Good：这种方法允许更直接且通常更快地访问radiance value。
Bad：代价是更高的内存使用量和可能更低的分辨率。

3D Gaussian Splatting: 两全其美

3D GS [12]是显式辐射场，又具有隐式辐射场的优点。因为它结合了基于神经网络的优化和显式结构化数据存储的优点。因此可以实时、高质量渲染，并且需要更少的训练时间，特别是对于复杂场景和高分辨率输出。 3D 高斯表示形式为：

$$ L = \sum_i G(x,y,z,\mu_i, \sigma_i)\cdot c_i(θ, φ) $$

显式的方法，只能把radiance绑定在点上，因此受限于点的分辨率，而点的分辨率又受限于内存。
3D GS把radiance绑定在有体积的点（球）上，所以对点的分辨率要求低一点。球的作用有点像点之间的插值。

背景和术语

场景重建与渲染

3D重建：图像->3D模型
渲染：3D模型->图像

神经渲染和辐射场

体积表示和ray marching

体积表示不仅将对象和场景建模为表面，而且将其建模为充满材料或空白空间的体积[46]。这种方法可以更准确地渲染雾、烟或半透明材料等现象。
光线行进是一种与体积表示一起使用的技术，通过增量跟踪穿过体积的光路来渲染图像[13]、[14]。

非表面模型的渲染

NeRF [15] 与体积射线行进有着相同的精神，并引入了重要性采样和位置编码来提高合成图像的质量。因此高质量的结果，高计算成本。

基于点的渲染

基于点的渲染是一种使用点而不是传统多边形来可视化 3D 场景的技术。可以通过可学习神经网络等附加属性来增强点描述符[47]、[48]，并有效渲染[49]、[50]。

Good：对于渲染复杂、非结构化或稀疏的几何数据特别有效。 Bad：存在渲染漏洞或锯齿效应等问题。 3D GS [12] 通过使用各向异性高斯函数扩展了这一概念，以实现更连续、更有凝聚力的场景表示。

3D GAUSSIAN SPLATTING: PRINCIPLES

使用学习的 3D 高斯函数进行新颖视图合成

3D GS 如何在给定结构良好的 3D 高斯的情况下合成图像，即3D GS的前向过程。

$$ L = \sum_i G(x,y,z,\mu_i, \Sigma_i)\cdot c_i(θ, φ) $$

输入

一组高斯球，每个高斯球包含以下信息：

位置：$\mu$
不透明度： $\alpha$
协方差：$\Sigma$
颜色：c

一个高斯球是 3D GS 中场景表示的最小元素。

所有属性都可以通过反向传播来学习和优化。现在假设这些高斯球都已经优化好了。

Splatting

首先将这些 3D 高斯投影到基于像素的图像平面上，这一过程称为“splatting”。

Frustum

相机pose确定以后，根据frustum切出能看见的高斯球。计算出相机视角下的高斯球的协方差。

可微分渲染 by pixels

此处先只介绍基本过程，不讲加速算法

给定像素 x 的位置，可以通过投影变换 W 来计算其到所有重叠高斯函数的距离，即这些高斯函数的深度，形成高斯函数 N 的排序列表。

然后，采用alpha合成来计算该像素的最终颜色：

如图所示，NeRF和3D GS的渲染可以看作是彼此的逆过程。

加速技术

像素级计算的成本比较高，因此将精度从像素级转移到块级。

具体来说，3D GS 先将图像划分为多个不重叠的图块，每个图块包含 16×16 像素。

3D GS 进一步确定每个图块被哪些投影高斯覆盖。如果一个投影高斯覆盖多个图块，则需要把高斯复制多份。

3D GS的优化：为给定场景获取构造良好的 3D 高斯

3D GS 的优化，旨在构建大量 3D 高斯集合，准确捕捉场景的本质，从而促进自由视点渲染。

参数优化

Loss

由于ray marching成本高昂，NeRF 通常在像素级别而不是图像级别计算损失。

3D GAUSSIAN SPLATTING: DIRECTIONS

APPLICATION AREAS AND TASKS

Simultaneous Localization and Mapping (SLAM)

Dynamic Scene Reconstruction

ID	Year	Name	Note	Tags	Link
36	2024	GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians	输入：单个视频输出：具有动态 3D 外观的逼真人类头像目的：实现自由视角渲染，生成逼真人类头像动画		link

AI-Generated Content (AIGC)

ID	Year	Name	Note	Tags	Link
34	2024	Splatter a Video: Video Gaussian Representation for Versatile Processing	利用高斯进行视频编辑

Autonomous Driving

Endoscopic Scene Reconstruction

Medical Image

Reference

A Survey on 3D Gaussian Splatting

3D动物建模

ID	Year	Name	Note	Tags	Link
	2025.5.23	Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance		在无需动物几何先验知识、无需逐帧优化或人工标注的情况下，对实验动物的完整姿态和外观进行建模	link

3D动物运动序列生成

ID	Year	Name	Note	Tags	Link
128	2025.8.7	X-MoGen: Unified Motion Generation across Humans and Animals	首个覆盖人类与动物的统一跨物种文本驱动动作生成框架。第一阶段：CGVAE学习规范T姿态先验，AE将动作编码至由形态学损失正则化的共享潜空间；第二阶段：通过掩码动作建模生成基于文本描述的动作嵌入。	跨物种生成
	2025.6.4	AniMo: Species-Aware Model for Text-Driven Animal Motion Generation		文生动物动作	link
35		MagicPony: Learning Articulated 3D Animals in the Wild	图像生成3D动物Mesh并绑定，图像生成3D动作		link

动物视频序列生成

ID	Year	Name	Note	Tags	Link
32		Artemis: Articulated Neural Pets with Appearance and Motion Synthesis	NGI 动物的高度逼真渲染		link

动物4D生成

ID	Year	Name	Note	Tags	Link
39	2024.5	Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion	文生3D Mesh + 文生成3D动作 + 重定向 = 3D动物运动序列		link

Video Diffusion Models章节地图

Fundamentals of Diffusion Models
Video Generation
- 闭源T2V大模型
- 开源T2V基模型
- T2I Base Model + Control
- T2V Base Model + Control
- 长视频生成
- StoryBoard
- 多生成任务
- Human Video Generation(link)
Video Editing
视频生成的评价指标
数据集
Summary

Reference

Mike Shou

Asst Prof, National U. of Singapore

Joint work with Pei Yang & Jay Wu

Slides:https://sites.google.com/view/showlab/tutorial

Others

CVPR Tutorial (English): https://www.youtube.com/watch?v=cS6JQpEY9cs
Lil’s blog: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Hung-yi Lee (Chinese):
- https://www.youtube.com/watch?v=azBugJzmz-o
- https://www.youtube.com/watch?v=ifCDXFdeaaM
Xing et al., “A Survey on Video Diffusion Models,” arXiv 2023.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P31

P34

Problem Definition

Text-Guided Video Generation

输入：Text prompt（或其它控制信号）
输出：video

T2I -> T2V

✅ 由于已有一个开源的大数据文生图预训练模型Stale Diffusion Model。为了充分利用这个预训练模型，通常的做法是把这个文生图模型改造成文生视频模型。即，从 2D 输出变成 3D 输出。
动作信息来源：文本
外观信息来源：文本

T2I/T2V -> TI2V

直接从文本生成视频，很难对视频内容进行更细节的控制，因此演生出了Image-2-Video任务。I2V通常是通过在预训练T2I的基础上，引入reference image的注入和时序层来实现。也可以通过直接在预训练的T2V上增加reference image的注入来实现。

任务1：驱动图像

外观信息来源：图像
动作信息来源：无控制地续写、或文本

任务2：以视频为控制条件的视频生成

外观信息来源：文本
动作信息来源：视频

T2I/T2V/TI2V + 其它控制信号

选一个合适的（开源）预训练模型，在此基础上

注入自己的控制信号，例如图像、控制点、光流、拖拽等
构造特定的（相对于训练基模型来说）少量的训练数据
根据任务特性引入一些技巧
经过（相对于训练基模型来说）少量的训练就得到了针对特定任务的垂域的视频生成模型。

对于大多数社区玩家来说，只能获取到开源的预训练模型，因此要先了解可用的开源模型。

外观信息来源：图像
动作信息来源：文本、骨骼动作序列、物理规律、用户交互轨迹等

T2V -> Improved T2V

在预训练的T2V的基础上，通过一些微调手段，让它在某些方向更优，成为更强大的基模型

动作信息来源：文本
外观信息来源：文本

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P36

ID	Year	Name	Note	Tags	Link
57	2023.9	Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation	直接在像素空间实现时序扩散模型，结合修复（inpainting）与超分辨率技术生成高分辨率视频		link
	2023.8	I2vgen-xl: High-quality image-to-video	提出级联网络，通过分离内容与运动因素提升模型性能，并利用静态图像作为引导增强数据对齐。
48	2023.4	Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models	首次将潜在扩散模型（LDM）范式引入视频生成，在潜在空间中加入时序维度 T2I(LDM) -> T2V(SVD) Cascaded generation	Video LDM	link
59	2023	AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning	1. T2I + Transformer = T2V 2. MotionLoRA实现不同风格的视频运动		link
	2023	Chen et al., “GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation,”	Transformer-based diffusion for text-to-video generation ✅Transformer-based architecture extended from DiT (class-conditioned transformer-based LDM) ✅Train T2I $\to $ insert temporal self-attn $\to $ joint image-video finetuning (motion-free guidance)
	2023	Gupta et al., “Photorealistic Video Generation with Diffusion Models,”	Transformer-based diffusion for text-to-video generation ✅Transformer-based denoising diffusion backbone ✅Joint image-video training via unified image/video latent space (created by a joint 3D encoder with causal 3D conv layers, allowing the first frame of a video to be tokenized independently) ✅Window attention to reduce computing/memory costs ✅Cascaded pipeline for high-quality generation
	2022.11	Imagen Video: High Definition Video Generation with Diffusion Models	提出级联扩散模型以生成高清视频，并尝试将文本到图像（text-to-image）范式迁移至视频生成级联扩散模型实现高清生成，质量与分辨率提升 ✅ 先在 image 上做 cascade 生成 ✅ 视频是在图像上增加时间维度的超分 ✅ 每次的超分都是独立的 diffusion model? 7 cascade models in total. 1 Base model (16x40x24) 3 Temporal super-resolution models. 3 Spatial super-resolution models. ✅ 通过 7 次 cascade，逐步提升顺率和像素的分辨率，每一步的训练对上一步是依赖的。	Cascade
56	2022.9	Make-A-Video: Text-to-Video Generation without Text-Video Data			link
55	2022.4	Video Diffusion Models	首次采用3D U-Net结构的扩散模型预测并生成视频序列引入conv(2+1)D，temporal attention		link

More Works


	MagicVideo (Zhou et al.) Insert causal attention to Stable Diffusion for better temporal coherence “MagicVideo: Efficient Video Generation With Latent Diffusion Models,” arXiv 2022.
	Simple Diffusion Adapter (Xing et al.) Insert lightweight adapters to T2I models, shift latents, and finetune adapters on videos “SimDA: Simple Diffusion Adapter for Efficient Video Generation,” arXiv 2023.
	Dual-Stream Diffusion Net (Liu et al.) Leverage multiple T2I networks for T2V “Dual-Stream Diffusion Net for Text-to-Video Generation,” arXiv 2023.
	MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation,2024

Traning Free

ID	Year	Name	Note	Tags	Link
84	2025.5.14	Generating time-consistent dynamics with discriminator-guided image diffusion models	1. 训练一个时序一致性判别器，用判别器引导T2I模型生成时序一致性的模型。	图像生成+时间一致性判别器=视频生成	link

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P67

T2I -> T2V

ID	Year	Name	Note	Tags	Link
	2025	Wan. Wan-AI/Wan2.1-T2V-14B			https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
	2025	CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
	2024	Hunyuanvideo: A systematic framework for large video generative models
81	2024	CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers	1. 使用预训练T2I模型CogView2 2. 先生成1 fps关键帧再递归向中间插帧 3. 引入temporal channel，并以混合因子$\alpha$与spatial channel混合	CogView2（60亿参数）, Transformer Based	link
107	2023	Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators			link
58	2023	ModelScope Text-to-Video Technical Report			link
	2023	ZeroScope	✅ ZeroScope 在 ModelScope 上 finetune，使用了非常小但质量非常高的数据，得到了高分辨率的生成效果。
50	2023	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Scaling latent video diffusion models to large datasets Data Processing and Annotation		link
	2023	Wang et al., “LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models,”	Joint image-video finetuning with curriculum learning ✅ 提供了一套高质量数据集，生成的视频质量也更好（训练集很重要）。
	2023	Chen et al., “VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,”		LDM

T2V -> Improved Text-2-Video

ID	Year	Name	Note	Tags	Link
105	2025.5.27	Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation	1. 使用LLM分析视频生成的预期效果，用于引导生成 2. LLM对生成结果的评价也作为模型训练的Loss项 3. 基于Wan大模型的LoRA微调	数据集, LLM, LoRA, 数据集，物理	link

P74

其它相关工作


" Robot dancing in times square,” arXiv 2023.	" Clown fish swimming through the coral reef,” arXiv 2023.	" Melting ice cream dripping down the cone,” arXiv 2023.	" Hyper-realistic photo of an abandoned industrial site during a storm,” arXiv 2023.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Image-2-Video

把用户控制（稀疏轨迹等）转为运动表征（光流等）
用运动表征驱动图像

ID	Year	Name	Note	Tags	Link
51	2023	Motion-Conditioned Diffusion Model for Controllable Video Synthesis	✅ 用户提供的稀疏运动轨迹 -> dense光流 ✅ dense光流（condition） + Image -> 视频	Two-stage, 自回归生成	link
44	2024	Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling	✅ 用户提供的控制信号（condition）+ Image -> dense光流 ✅ dense光流（condition） + Image -> 视频	Two-stage，轨迹控制	link
	2024	Physmotion: Physicsgrounded dynamics from a single image.	轨迹控制
	2023	LFDM (Ni et al.) “Conditional Image-to-Video Generation with Latent Flow Diffusion Models,”	✅视频->光流 + Mask ✅ 光流+Mask+图像 ->视频
	2024	Generative Image Dynamics (Li et al.) “Generative Image Dynamics,”	图像（无condition） -> SV ✅ SV + 力 -> 光流 ✅ 光流 + Image -> 视频
	2023	LaMD: Latent Motion Diffusion for Video Generation	视频 -> 图像特征 + 运动特征 ✅ 运动特征+图像特征->视频
	2023	Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models	PYoCo (Ge et al.) Generate video frames starting from similar noise patterns
	2023	Animate-a-story: Storytelling with retrieval-augmented video generation	深度控制

More Works 闭源


	Latent Shift (An et al.) Shift latent features for better temporal coherence “Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation,” arXiv 2023.
	Video Factory (Wang et al.) Modify attention mechanism for better temporal coherence “VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation,” arXiv 2023.
	VideoFusion (Lorem et al.) Decompose noise into shared “base” and individual “residuals” “VideoFusion: ecomposed Diffusion Models for High-Quality Video Generation,” CVPR 2023.

✅ Framwork (1) 在原模型中加入 temporal layers (2) fix 原模型，训练新的 layers (3) 把 lager 插入到目标 T2 I 模型中。

Sound2Video

Year	Name	Note
2023	The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion	Text + Sound -> Video
2023	AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion
2023	Generative Disco (Liu et al.) “Generative Disco: Text-to-Video Generation for Music Visualization,

Bain Activity 2 Video

✅ 大脑信号控制生成。

Brain activity-guided video generation

Task: human vision reconstruction via fMRI signal-guided video generation

Chen et al., “Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity,” arXiv 2023.

✅ 用纯文本的形式把图片描述出来。
✅ 方法：准备好 pair data，对 GPT 做 fine-tune.
✅ 用结构化的中间表示生成图片。
✅ 先用 GPT 进行文本补全。

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Image（提供动作信息）Text(提供外观信息)-2-Video

ID	Year	Name	Note	Tags	Link
126	2025.7.22	MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation	1. 参考对象（动作信息来自图像）与目标对象（外观信息来自文本）外观或结构差异显著 2. 显示提取源和目标在外观上的语义匹配以及对应部分的形变关系，通过对源做warp得到目标的大致轮廓，以引作为condition引入视频生成	training-free，开源

Image（提供外观信息）-2-Video

强调符合物理规律

如何描述物理规律：LLM对物理的理解、特定的数据集、已有的物理模型
如何使用物理规律：数据集、损失
是否显示提取物理规律

ID	Year	Name	Note	Tags	Link
106	2025.5.26	Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals	1. 将物理力(全局力和点力)编码后作为生成条件引导生成 2. 构造少量数据集 3. 证明大TI2V模型 + 少量样本能得到比较好的泛化性	开源， CogVideoX + ControlNet，物理	link
	2025.5.1	T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation		文生视频，物理，评估	link
96	2025.3.26	PhysAnimator: Physics-Guided Generative Cartoon Animation	静态动漫插图生成动画 1. 分割出可形变部分 2. 转成2D Mesh 3. FEM驱动2D Mesh 4. 根据2D Mesh形变生成光流 5. 光流驱动Image草图 6. 草图作为控制信号，生成视频	2D Mesh，FEM，ControlNet，光流，轨迹控制，SAM	link
	2025	Physdreamer: Physics-based interaction with 3d objects via video generation
	2024.9.27	PhysGen	通过刚体物理仿真将单张图像与输入力转换为真实视频，证明从视觉数据推理物理参数的可能性；

强调时序一致性

ID	Year	Name	Note	Tags	Link
130	2025.8.25	Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors

强调控制性

如何对控制信号进行表示
如何注入控制信号

ID	Year	Name	Note	Tags	Link
97	2025	Draganything: Motion control for anything using entity representation	1. 分割可拖动对象 2. 提取对象的latent diffusion feature 3. 路径转为高斯热图 4. feature和heatmap作为控制信号进行生成	轨迹控制，ControlNet，高斯热图，SAM，潜在扩散特征	link
47	2024	Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics	拖拽控制的对象零件级运动的视频生成	零件级运动数据集	link

其它未归档

Year	Name	Note	Tags	Link
2025.6.17	VideoMAR: Autoregressive Video Generatio with Continuous Tokens			link
2025.5.29	ATI: Any Trajectory Instruction for Controllable Video Generation		视频生成中运动控制	link
2025.5.26	MotionPro: A Precise Motion Controller for Image-to-Video Generation		通过交互式运动控制实现图像动画	link
2025.5.23	Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis		通过图像到视频(I2V)合成框架来模拟规律的运动过程	link
2025.5.20	LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer		文+图像+运动视频->视频	link
2025.5.14	CameraCtrl: Enabling Camera Control for Video Diffusion Models		相机位姿控制的视频生成	link
2025.5.4	DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization		文生视频	link
2025.4.30	Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis		文生3D视频	link
2025	Sparsectrl: Adding sparse controls to text-to-video diffusion models	深度控制
2024	Cinemo: Consistent and controllable image animation with motion diffusion models
2024.06	Mimicmotion: High-quality human motion video generation with confidence-aware pose guidance	pose控制
2024	Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality

P108

2.5 Storyboard

P113

ID	Year	Name	Note	Tags	Link
84	2024	Learning Long-form Video Prior via Generative Pre-Training	利用GPT生成长视频内容的结构化信息，用于帮助下游的视频生成/理解任务。	结构化信息，数据集	dataset link
61	2023	Xie et al., “VisorGPT: Learning Visual Prior via Generative Pre-Training,”	A “diffusion over diffusion” architecture for very long video generation		link
	2023	Lin et al., “VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning,”	Use storyboard as condition to generate video ✅ Control Net，把文本转为 Pixel 图片。


	Dysen-VDM (Fei et al.) Storyboard through scene graphs “Empowering Dynamics-aware Text-to-Video Diffusion with Large Language Models,” arXiv 2023.
	DirectT2V (Hong et al.) Storyboard through bounding boxes “Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation,” arXiv 2023.
	Free-Bloom (Huang et al.) Storyboard through detailed text prompts “Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator,” NeurIPS 2023.
	LLM-Grounded Video Diffusion Models (Lian et al.) Storyboard through foreground bounding boxes “LLM-grounded Video Diffusion Models,” arXiv 2023.

P104

✅ 生成电影级别的视频，而不是几秒钟的视频。

P106

✅ 文本 → 结构化的中间脚本 → 视频

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P126

2.6 Long video generation

长视频生成主要有这样一些难点：

长视频生成的复杂性
- 训练与推理差距：模型在训练时仅接触短视频，无法学习长视频的全局时序模式，导致生成内容逻辑断裂。
- 顺序生成的低效性：自回归生成需逐帧顺序处理，生成时间随视频长度线性增长，无法满足实际应用需求。
- 保持内容的一致性:长篇视频包含复杂的人物、物体及其动态交互关系。
数据稀缺性
高质量的长视频标注数据（如逐帧注释）获取成本极高，现有数据集（如短视频库）难以支持长视频先验的学习。

ID	Year	Name	Note	Tags	Link
	2025.6.2	DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion		基于预训练扩散模型的高帧率视频生成新方法	link
	2025.6.1	FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation		无训练引导方法增加视频生成的连续性	link
80	2025	One-Minute Video Generation with Test-Time Training	1. 引入TTT层，通过TTT层动态调整模型隐藏状态，增强对长序列的全局理解能力。 2. 通过门控机制防止TTT层训练初期引入噪声。 3. 多阶段训练策略：从3秒片段逐步扩展至63秒，仅微调TTT层和门控参数，保留预训练模型的知识。	Test Time Training, RNN,	link
41	2024	STORYDIFFUSION: CONSISTENT SELF-ATTENTION FOR LONG-RANGE IMAGE AND VIDEO GENERATION	先生成一致的关键帧，再插帧成中间图像		link
60	2023	NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation	diffusion over diffusion的递归架构实现长视频生成和并行生成	coarse-to-fine, 数据集	link
	2025	Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
	2022	Latent Video Diffusion Models for High-Fidelity Long Video Generation (He et al.) Generate long videos via autoregressive generation & interpolation
	2023	VidRD (Gu et al.) Autoregressive long video generation “Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation,” arXiv 2023.
	2023	VideoGen (Li et al.) Cascaded pipeline for long video generation “VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation,” arXiv 2023.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P139

✅ 用文生图模型生成 appearance, dynamics 来自于 reference video.

P141

✅ 当前帧只与上帧和前一帧做 attention，大大减少计算量。
✅ 在所有帧上做 attention 开销比较大。
✅ 解决方法：前一帧与第一帧。
❓ 怎么保证生成动作与原视频动作的一致性呢?

P142

✅ 对要编辑的视频，先 DDIM Inversion，得到 inverfed noise，这是保留了原视频 pattern 的 noise.
✅ 用这个 noise 作为 init noise，还原出的视频跟原视频有比较好的结构化保留。
✅ 解法方法

P144

多生成任务


	MovieFactory (Zhu et al.) “MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images,” arXiv 2023.
	CoDi (Tang et al.) “Any-to-Any Generation via Composable Diffusion,” NeurIPS 2023.
	MM-Diffusion (Ruan et al.) “MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation,” CVPR 2023.
	NExT-GPT (Wu et al.) “NExT-GPT: Any-to-Any Multimodal LLM,” arXiv 2023.

✅ 在物体改变比较大的情况下，diffusion 比其它生成方法效果更好。

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

人类视频生成的基础知

关键子任务

根据驱动生成过程的模态将现有方法分为三类：文本驱动、音频驱动和姿势驱动

文本驱动的人类视频生成

讨论了如何使用文本描述来控制生成视频中的人类外观和动作。

ID	Year	Name	Note	Tags	Link
	2025.5.21	Interspatial Attention for Efficient 4D Human Video Generation		以可控方式生成数字人(digital humans)的逼真视频	link
1	2024	ID-Animator	To ensure the consistency of appearance in generated videos with the textual descriptions while preserving identity details during frames, ID-Animator [1] leverages a pre-trained textto-video (T2V) model with a lightweight face adapter to encode identity-relevant embeddings.	人体外观控制
83		HMTV	文本生成动作和相机运动，再生成图像	人体动作控制，2阶段方法
84	2020	SignSynth	Gloss2Pose文生动作，GAN动作生视频	人体动作控制，2阶段方法
85	2022	H-DNA		人体动作控制，2阶段方法
86	2024	SignLLM	文本->GLoss->Pose->Video	人体动作控制，2阶段方法
89	2024		文本->GLoss->Pose->Video	人体动作控制，2阶段方法
53		Text2Performer	involves the motion text and a motion encoder. motion text describes the movement, such as "She is swinging to the right." The model implicitly models these descriptions by separately representing appearance and motion, thereby generating high-quality videos with consistent appearance and actions.	text作为prompt直接生成video

音频驱动的人类视频生成

语音驱动：要求生成的人体动作在高级语义方面及在情感和节奏方面与音频和谐。
音乐驱动：合成一个人在给定的音乐片段引导下跳舞或演奏某种乐器的视频，关注于低级节拍对齐。

语音驱动手势

以下是整理后的表格，概述了语音驱动人体视频生成模型的关键特性与演进：

方法/模型	基础架构	主要贡献	输入	输出	训练目标/优化方法	关键创新点	局限性
传统方法 [61][92][93]	2D/3D骨架 + 分离式渲染	基于结构先验（骨架）生成手势视频	语音 + 2D/3D骨架	手势视频	骨架运动生成与视频渲染分离	利用手写结构先验（骨架）定义运动	外观信息丢失，控制困难；预训练姿态估计器导致抖动与误差累积
ANGIE [62]	无监督MRAA特征 + VQ-VAE + GPT网络	通过无监督特征与离散化建模提升手势生成	语音	手势视频	VQ-VAE量化运动模式 + 自回归预测离散动作	无监督运动特征（MRAA）避免依赖骨架标注	MRAA线性建模限制复杂区域表达；语音与协方差关联不准确
DiffTED & He et al.	TPS运动模型 + 扩散模型	解耦运动与外观，保留身体区域关键信息	语音 + TPS关键点	多样化手势视频	扩散模型生成运动序列 + TPS渲染关键点至图像	基于扩散的多样化生成；解耦运动与外观（避免信息丢失）	依赖TPS模型精度；计算成本较高

关键说明

技术演进：
- 传统方法依赖刚性骨架，导致外观信息丢失与抖动问题；
- ANGIE引入无监督特征与离散化建模，但受限于线性表达能力；
- DiffTED通过解耦运动与外观、结合扩散模型，实现高质量多样化生成。
核心挑战：
- 运动-外观平衡：传统方法牺牲外观信息，DiffTED通过解耦部分保留；
- 生成多样性：扩散模型（DiffTED）优于自回归（ANGIE）与骨架驱动方法。
未来方向：
- 结合物理仿真优化运动真实性（如减少抖动）；
- 提升复杂区域（手部、微表情）的细粒度控制能力。

此表格对比了语音驱动手势视频生成的关键方法，凸显从结构先验到无监督学习再到解耦扩散模型的技术路径。

语音驱动口型(视频生成)

唇形同步技术需要根据输入的音频生成对应的唇部动作，同时保持头部姿态和人物身份的一致性。

Image + Audio -> Video

以下是整理后的表格，概述了音频驱动说话人脸生成方法的分类、核心特性与挑战：

方法类型	关键方法/技术	输入	输出	优点	局限性
Person-Specific	3D模型（Song et al., 2020; Thies et al., 2020） NeRF（Park et al., 2022）	音频 + 目标人物多分钟训练视频	高保真、身份保留的说话视频	高保真，精确的唇部-音频映射	训练耗时，依赖大量目标数据，难以实时应用
One-Shot Talking Head	两阶段流程（音频→标志→视频，Chen et al., 2019） 3D系数驱动（Chen et al., 2020）	音频 + 单张参考图像	多样化表情与头部运动的视频	单图驱动，灵活性强；扩散模型（Tian et al., 2024）提升生成多样性	细节缺失（牙齿/纹理）；扩散模型导致身份细节丢失、计算成本高、推理步骤复杂
Few-Shot Face Visual Dubbing	编码器-解码器（Prajwal et al., 2020a）变形修复网络（Zhang et al., 2023）	音频 + 源人脸（少量参考图）	嘴部替换的配音视频	直接替换唇部区域，适配性强	纹理模糊、身份不一致；修复网络易过拟合，局部颜色差异

关键说明

输入需求差异：
- Person-Specific：依赖目标人物大量训练数据；
- One-Shot：仅需单张参考图，灵活性高；
- Few-Shot：基于少量参考图进行局部（嘴部）替换。
核心挑战：
- 保真度与效率：Person-Specific保真但低效，One-Shot/Diffusion多样但计算昂贵；
- 细节保留：牙齿、嘴部纹理与高频细节仍是技术瓶颈（尤其One-Shot与Few-Shot）。
代表工作演进：
- 3D模型 → 扩散模型：从基于物理建模转向生成式AI，提升多样性但牺牲确定性；
- 编码器-解码器 → 变形修复：Few-Shot方法逐步优化纹理保留，但仍需解决过拟合问题。

此表格对比了音频驱动说话人脸生成的核心方法类型，凸显其在不同应用场景下的优势与待突破点。

唇音同步(视频编辑)

Video + Audio -> Video

扩散模型（如[29]）在细节丰富度上占优，但生成速度较慢；
GAN类方法（如MuseTalk）牺牲部分细节以提升速度。

基于扩散模型的唇音同步方法

ID	Year	Name	Note	Tags	Link
89	2025.3.13	LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision	1. 在latent space训练，在pixel space监督 2. 用TREPA代表temporal layer 3. 系统性地分析SyncNet的训练参数与效果	LDM, 开源	link
	2023	Speech Driven Video Editing via an Audio-Conditioned Diffusion Model	> ✅（1）把说话的部分 mask 掉（2）用 diffusion 根据 Audio Feature 生成说话的部分。 ✅ 额外约束：（1）reference 状态（2）前后帧 smooth ✅ 语音驱动嘴形。

方法/论文	关键架构	训练策略	生成阶段说明	输入 → 输出	主要创新点
[34] & [2] 2024	像素空间扩散模型	端到端音频条件扩散	单阶段：直接生成同步唇部图像	音频 → 图像	端到端像素级扩散，无需中间表示
[57] 2024	扩散模型 + VAE	两阶段训练	阶段1：扩散模型（音频→运动）阶段2：VAE（运动→图像）	音频 → 运动 → 图像	分阶段解耦运动与渲染，降低生成复杂度
[64] 2024.08	Transformer + 扩散模型	两阶段训练	阶段1：Transformer（音频→运动）阶段2：扩散模型（运动→图像）	音频 → 运动 → 图像	Transformer编码音频时序，扩散模型细化生成
[29] 2024	扩散自编码器	两阶段训练	阶段1：扩散自编码器（掩码图→语义代码）阶段2：扩散模型（语义代码+音频→图像）	音频 + 掩码图 → 图像	结合语义潜在代码与音频条件，增强可控性

非扩散模型的唇音同步方法

ID	Year	Name	Note	Tags	Link
	2025.6.17	SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting			link
90	2024.10	MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting	1. 借用扩散架构但采用GAN式训练（无扩散过程），平衡生成速度与质量 2. 用根据特征筛选的方式选择reference image，提升生成质量。	LDM, 开源，实时，GAN, 逐帧, VQ-VAE	link
91	2020.8.23	A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild	1. 首个跨ID的唇间同步口型生成方法 2. 预训练唇同步判别器（SyncNet监督） + 对抗学习 3. 提出唇音对齐性指标LSE-C和LSE-D	Wav2Lip， GAN, SyncNet, LSE-C, LSE-D	link

方法/论文	关键架构	训练策略	生成阶段说明	输入 → 输出	主要创新点
[20] 2023	VQ-VAE + 量化空间生成器	分阶段训练	阶段1：VQ-VAE编码面部/头部姿势阶段2：量化空间生成高分辨率图像	音频 → 量化代码 → 图像	在量化空间中训练生成器，提升图像分辨率
StyleSync [18] 2023	StyleGAN2生成器	对抗学习（SyncNet监督）	单阶段：StyleGAN2生成同步唇部图像	音频 → 图像	结合StyleGAN2高保真生成能力与SyncNet监督
VideoReTalking [8] 2022	多组件框架（重演+同步+细化）	分模块联合训练	阶段1：语义重演网络阶段2：唇音同步网络阶段3：身份感知细化	音频 → 图像	模块化设计分离语义、同步与身份控制
DINet [63] 2023	特征图变形网络, 双编码器 + 面部动作单元（AU）系统	端到端训练	单阶段：驱动音频直接变形特征图生成嘴型	音频 → 图像	通过特征变形实现精细嘴型控制，避免分阶段误差累积

模型名称	核心技术	主要贡献	关键创新点	优势
Wav2Lip [Pra20b]	预训练唇同步判别器 + 对抗训练	生成高逼真唇部同步视频	引入SyncNet作为判别器监督生成器，优化唇-音频对齐	广泛认可的唇同步效果，适用于多种场景
VideoRetalking [Che22]	三阶段流程（表情中和→唇同步→身份增强）	高质量视频编辑的唇同步生成	分阶段处理（表情中和+身份感知增强），提升身份一致性	适用于视频编辑，保持人物身份与表情自然
DI-Net [Zha23]	双编码器 + 面部动作单元（AU）系统	生成逼真且情感一致的面部视频	结合面部动作单元系统控制情感表达，双编码器分离内容与身份特征	情感一致性高，适合需要情绪表达的应用（如虚拟主播）
TalkLip [Wan23]	对比学习 + Transformer音频编码	提升唇语音同步的全局时间依赖性	对比学习优化音频-视频对齐；Transformer建模全局时序关系	同步效果更精准，适应复杂语音节奏与长时序依赖

姿势驱动的人类视频生成

包括单条件姿势引导方法和多条件姿势引导方法。

2D动作驱动

pose + reference Image -> video

ID	Year	Name	Note	Tags	Link
108	2025.4.30	ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction	参数化的三维物理知识显式地集成到一个预训练的条件视频生成模型中，从而显著增强了其生成高质量、包含复杂动作和交互的视频的能力 1.使用一个视频扩散模型生成一个粗糙的视频 2. 从该粗略视频中提取一组 2D 和 3D 特征，构建一个以对象为中心的 3D 表示，并通过我们提出的参数化物理先验模型对其进行优化，生成精确的 3D 动作序列。 3. 这一优化后的动作序列被反馈到同一个视频扩散模型中作为额外的条件输入	三阶段, 即插即用	link
	2025.5.6	FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios		姿势引导视频合成	link
	2025.5.6	Real-Time Person Image Synthesis Using a Flow Matching Model		姿势引导人物图像合成， flow matching	link
37	2024	TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models	通过修正attention map实现背景的时序稳定性	Diffusion	link
2	2024.1	Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos	uses text descriptions to provide semantic information about the content of the characters, ensuring the generated videos align with the textual descriptions.	人体外观控制设计了一个两阶段训练方案，利用图像姿态对和无姿态视频生成姿态可控的角色动画
	2023	DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion
121	2023	MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
	2023	Dancing avatar: Pose and text-guided human motion videos synthesis with image diffusion model
	2023	Disco: Disentangled control for referring human dance generation in real world

视频动作驱动

ID	Year	Name	Note	Tags	Link
99	2025.5.19	FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance	1. 从视频中提取2D pose 2. 2D pose lifting到3D pose 3. 对3D pose作物理优化 4. 用优化后的pose引导视频生成	可微的物理优化过程，pose信息来自视频，无外观信息控制	link
53	2024	Implicit Warping for Animation with Image Sets	用driving视频中的人去驱动reference图像中的人，生成reference做与driving中相同动作的视频	pose信息来自视频外观信息来自Reference Image Cross Attention	link

3D动作驱动

ID	Year	Name	Note	Tags	Link
	2025.5.28	LatentMove: Towards Complex Human Movement Video Generation		专门为高度动态的人体动画量身定制的、基于DiT(扩散Transformer)的框架的图像到视频(I2V)生成	link
42	2024	HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation	3D建模 + 3D重定向 + 渲染，动作控制+相机控制	人物视频生成，3D管线	link

虚拟换衣

ID	Year	Name	Note	Tags	Link
	2025	RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency	虚拟试衣

数据集和评估指标

数据集

评估指标

link

挑战和难题

遮挡问题：身体部位重叠或多人遮挡很常见，但大多数模型不能很好地处理相互影响的问题[98]，[138]。
Body Deformation
外观不一致
背景影响
时序不一致
不自然的姿势
文本驱动或语音驱动中，由于本身是一对多问题，可能受限于数据集而存在偏向性

影响生成质量的因素

生成范式。

与姿势驱动方法（可以视为一阶段方法）相比，文本和音频驱动方法可以分为一阶段和两阶段方法。前者直接使用输入文本或音频作为提示来指导人类视频生成，而后者从输入文本或音频生成姿势，然后使用这些生成的姿势作为信号来指导人类视频生成。在两阶段方法中引入各种姿势类型（例如骨架姿势）提供了额外的几何和语义信息，从而提高了视频运动的准确性和真实感。这使得两阶段方法明显比一阶段方法更有效，尽管会牺牲一些效率。

backbone

SD 和 SVD 等扩散模型因其卓越的性能和多样性而广泛应用于各种生成任务，包括人类视频生成。然而，与在单个采样步骤中生成样本的 GAN 不同，扩散模型需要多个采样步骤，从而增加了训练和推理的时间成本。

pose控制信号

不同类型的条件姿势之所以有效，是因为它们提供了补充信息。

骨骼姿势准确地描述了帧中人体的空间信息以及身体部位的相对位置。然而，它捕获离散的姿势变化而不是连续的运动细节，提供有限的时间连贯性。
光流本质上包括时间信息，捕获连续帧之间的变化并提供特征空间中的连续运动轨迹。这使得模型能够生成帧之间平滑过渡的视频，避免跳跃或不连续。
深度地图捕捉人体与背景之间的距离信息，以及表面细节和深度变化。
3D 网格提供了骨骼姿势所缺乏的物体表面的详细几何结构。

总之，不同类型的姿势提供互补的时空信息，并且不存在满足所有要求的统一姿势类型。不同的场景和问题可能需要不同的姿势。

未来研究方向

大规模高质量人类视频数据集
长视频生成
高保真视频生成
提高人类视频扩散模型的效率
细粒度可控性
交互性。

Reference

A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
https://github.com/wentaoL86/Awesome-Human-Video-Generation

Video Editing

3.1 Tuning-based

One-Shot Tuned Video Editing

Compared to training-free editing methods:

Cons: still need 1 video for training
Pros: supports significant shape change

P149

ID	Year	Name	Note	Tags
118	2023	Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
119	2023	Dreamix: Video Diffusion Models are General Video Editors
	2023	Towards Consistent Video Editing with Text-to-Image rDiffusion Models	Modify self-attention for better temporal consistency
	2023	Video-P2P: Video Editing with Cross-attention Control	Improve input-output semantic consistency of video editing via shared embedding optimization and cross-attention control。	attention控制

P166

Multiple-Shot Tuned

Video Editing: Text Conditioned

P167

ID	Year	Name	Note	Tags	Link
120	2023	MotionDirector: Motion Customization of Text-to-Video Diffusion Models

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

✅ 在一个视频上训练后可以对视频进行编辑。
✅ 训练过程：(1) 对模型的时域模块 finetune．
✅ (2) 对图像打乱后用图像 finetune．
✅ 把视频和图片进行 mix finetune.
✅ 图片 finetune 会把 tenmporal 模块 fix 住。

✅ 需要训练的模型，且针对一个模型进行训练。
✅ 基本泛式：输入：一段视频，一个文生图模型，一个文本提示词。输出：基于定制化的文生图得到文生视频。
✅ 不在大规模上训练，只在一个视频上训练，只需十分钟。

✅ 推断过程：(1) 把视频 dounsample，维度变小。
✅ (2) 加入噪声作为初始噪声，类似于 DDIM Inversion.
✅ (3) 用 diffusion model 生成。
✅ (4) 上采样。
✅ 如果有更多 reference vedio 是不是能学得更好。
✅ (1) 用几段视频学习 concept．
✅ (2) 把 concept 接入到 diffusion model 中。
✅ 通过多段视频学习 motion concept.

✅ 不仅学 motion，还可以学 camera motion，camera motion，物体轨迹。

✅ 怎么把一个 concept 应用到不同的物体上。
✅ 怎样只学 motion 而不被物体的 appearance 影响，能不能 decouple.
✅ 分支1：spatial path，灰色为 spatial LoRA，学习外表信息。
✅ 分支2：temporal path，蓝色为 temporal LoRA，这个 path 用于学习 motion.
✅ debias：去掉 appreance 对 loss 的影响。
✅ temporal LORA 学习时使用但不修改 spatial LORA 的 Weight.
✅ 应用：(1) 也可以用于 one shot
✅ (2) 可以用于 appreace 和 motion 的组合
✅ (3) 可以用于 Image Animation

3.2 Training-free

P178

ID	Year	Name	Note	Tags	Link
117	2023	TokenFlow: Consistent Diffusion Features for Consistent Video Editing
	2023	FateZero: Fusing Attentions for Zero-shot Text-based Video Editing	Attention map fusing for better temporal consistency - During DDIM inversion, save inverted self-/cross-attention maps - During editing, use some algorithms to blend inverted maps and generated maps

P187

More Works


	MeDM (Chu et al.) OpScal flow-based guidance for temporal consistency “MeDM: Mediagng Image Diffusion Models for Video-to Video Translagon with Temporal Correspondence Guidance,” arXiv 2023.
	Ground-A-Video (Jeong et al.) Improve temporal consistency via modified attention and optical flow “Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models,” arXiv 2023.
	Gen-L-Video (Lorem et al.) Edit very long videos using existing generators “Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising,” arXiv 2023.
	FLATTEN (Cong et al.) Optical flow-guided attention for temporal consistency “Flatten: optical flow-guided attention for consistent text-to-video editing,” arXiv 2023.
	InFusion (Khandelwal et al.) Improve temporal consistency via fusing latents “InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing,” ICCVW 2023.
	Vid2Vid-Zero (Wang et al.) Improve temporal consistency via crossattention guidance and null-text inversion “Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models,” arXiv 2023.

P194

✅ 对于输入文本的每个 wordtoken, 都可以通过 attentior map 找到图像中的大概位置，把要去除的 token mask 掉，剩下部分保留。生成图像则把非 token 部分 mask 掉，以此进行两部分的融合。

P197

✅ 基于不同信号的各种版的 control net.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

3.3 Controlled Edifng (depth/pose/point/ControlNet)

✅ 已有一段视频，通过 guidance 或文本描述，修改视频。

P189

P190

Depth Control

Depth estimating network

ID	Year	Name	Note	Link
	2022	Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer	✅ 深变信息 Encode 成 latent code, 与 noise concat 到一起。
122	2023	Structure and Content-Guided Video Synthesis with Diffusion Models	Transfer the style of a video using text prompts given a “driving video”，以多种形式在预训练图像扩散模型中融入时序混合层进行扩展	Gen-1, Framewise, depth-guided
123	2023	Pix2Video: Video Editing using Image Diffusion	Framewise depth-guided video editing

P199

ControlNet / Multiple Control

也是control net 形式，但用到更多控制条件。

ID	Year	Name	Note
124	2023	ControlVideo: Training-free Controllable Text-to-Video Generation
	2023	VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet	Optical flow-guided video editing; I, P, B frames in video compression ✅ 内容一致性，适用于 style transfer, 但需要对物体有较大编辑力度时不适用(例如编辑物体形状)。
	2023	CCEdit: Creative and Controllable Video Editing via Diffusion Models
	2023	VideoComposer: Compositional Video Synthesis with Motion Controllability	Image-, sketch-, motion-, depth-, mask-controlled video editing ✅ 每个 condition 进来，都过一个 STC-Encoder, 然后把不同 condition fuse 到一起，输入到 U-Net. Spako-Temporal Condikon encoder (STC-encoder): a unified input interface for condikons
	2023	Control-A-Video: Controllable Text-to-Video Generagon with Diffusion Models	通过边缘图或深度图等序列化控制信号生成视频，并提出两种运动自适应噪声初始化策略
	2024	Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models.	轨迹控制
	2023	MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
	2023	Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance
	2023	MagicEdit: High-Fidelity and Temporally Coherent Video Editing
	2023	EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

P225

Point-Control

ID	Year	Name	Note	Tags	Link
98	2023	VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

P226

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

3.4 3D-Aware

P243

ID	Year	Name	Note	Link
	2023	Layered Neural Atlases for Consistent Video Editing	- Decompose a video into a foreground image + a background image - Edit the foreground/background image = edit the video ✅ 对背景进行编辑（图片编辑、风格迁移）再传播到不同帧上去。
	2023	VidEdit: Zero-Shot and Spagally Aware Text-Driven Video Edigng	Atlas-based video editing - Decompose a video into a foreground image + a background image - Edit the foreground/background image = edit the video - Use diffusion to edit foreground/background atlas > ✅ 前景编辑： (1) 抠出第一帧前景并进行编辑得到 Partial Atlas. ✅ (2) Partial Atlas 作为下一帧的 condition 整体上是自回归的。 ✅ 所有 Partial 合起来得到一个整体。 ✅ 背景使用深度信息作为 cordition.
	2023	Shape-aware Text-driven Layered Video Editing	Atlas-based video editing
	2023.11	Stablevideo: Text-driven consistency-aware diffusion video editing	✅ 给一个场景的多视角图片，基于 MLP 学习 3D 场景的隐式表达。
115	2023	CoDeF: Content Deformation Fields for Temporally Consistent Video Processing		link
	2023	HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
116	2023	DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing		link

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

3.5 Other Guidance

P264

Year	Name	Note	Link
2023	InstructVid2Vid: Controllable Video Editing with Natural Language Instructions	- Generate ⟨instruction, video⟩ dataset using ChatGPT, BLIP and Tune-A-Video - Train inflated Stable Diffusion for instruction-guided video editing
2023	Soundini: Sound-Guided Diffusion for Natural Video Editing	Sound-guided video editing
2023	DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory		轨迹控制
2023	Collaborative Score Distillation for Consistent Visual Synthesis
2023	Make-A-Protagonist: Generic Video Edigng with An Ensemble of Experts

P272

✅ showlab/Awesome-Video-Diffusion

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

评价指标

图像质量、视频质量、一致性、多样性、美学和动作准确性

Image-level Evaluation Metrics

Fréchet Inception Distance (FID, ↓): semantic similarity between images
Peak Signal-to-Noise Ratio (PSNR, ↑): pixel-level similarity between images
Structural Similarity Index (SSIM, ↓): pixel-level similarity between images
CLIPSIM (↑): image-text relevance

Fréchet Inception Distance (FID)

✅ FID：评估两个 distribution 的差距有多大。
✅ 由于使用了网络的高层 feature，可以评价 high／evel 的语义相似性。

✅ CNN＋Softmax 是一个预训练好的图像分类网络，取 softmax 上一层做为图像的 feature.
✅ 取大量真实图像的 feature 和预训练模型生成的图 feature.
✅ 假设两类图像的 feature 各自符合高斯分布，计算两个分布的距离。
✅ 优点：评价结果与人类直觉很接近，缺点：需要大量 sample.

P49

Peak Signal-to-Noise Ratio (PSNR)

Pixel-level similarity between images

For two images $x,y \text{ of shape } M\times N$:

\begin{align*} \mathrm{PSNR} (x,y) = 10 \log_{10}{} \frac{255^2}{\mathrm{MSE} (x,y)} \end{align*}

where

\begin{align*} \mathrm{MSE} (x,y) = \frac{1}{MN} \sum_{i=1}^{M} \sum_{j=1}^{N} (x_{ij}-y_{ij})^2\end{align*}

P50

Structural Similarity Index Measure (SSIM)

Pixel-level similarity between images

Model any image distortion as a combination of:
(1) loss of correlation, (2) luminance distortion, (3) contrast distortion
For two images $x,y \text{ of shape } M\times N$:

\begin{align*} \mathrm{SSIM} (x,y)=l(x,y)\cdot c(x,y)\cdot s(x,y)\end{align*}

where

\begin{align*} \begin{cases} \text{Lumiannce Comparison Funckon:} l(x,y)=\frac{2\mu _x\mu _y+C_1}{\mu _x^2+\mu _y^2+C_1} \\ \text{Contrast Comparison Funckon:} c(x,y)=\frac{2\sigma _x\sigma _y+C_2}{\sigma _x^2+\sigma _y^2+C_2} \\ \text{Structure Comparison Funckon:} s(x,y)=\frac{\sigma _{xy}+C_3}{\sigma _{x}\sigma _{y}+C_3} \end{cases}\end{align*}

P51

CLIP Similarity

✅ CLIP Score，衡量与文字的匹配度。

Year	Name	Link
2017	GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium	link
2023	Hung-Yi Lee, “Machine Learning 2023 Spring,” National Taiwan University.
2010	Horé et al., “Image Quality Metrics: PSNR vs. SSIM,”
2004	Wang et al., “Image Quality Assessment: from Error Visibility to Structural Similarity,”
2021	Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,”

Video-level Evaluation Metrics

Fréchet Video Distance (FVD, ↓): semantic similarity & temporal coherence
Kernel Video Distance (KVD, ↓): video quality (via semantic features and MMD)
Video Inception Score (IS, ↑): video quality and diversity
Frame Consistency CLIP Score (↑): frame temporal semantic consistency

P52

Fréchet Video Distance (FVD)

Semantic similarity and temporal coherence between two videos

P53

Kernel Video Distance

Video quality assessment via semantic features and MMD

P54

Video Inception Score (IS)

Video quality and diversity

✅ 多样性，在不给定 condition 的情况生成的分布的多样性。
✅ 质量：在给 condition 的条件下应生成特定的类别。

P55

Frame Consistence CLIP scores

Frame temporal semantic consistency

Compute CLIP image embeddings for all frames
Report average cosine similarity between all pairs of frames

ID	Year	Name	Note	Tags	Link
	2019	Unterthiner et al., “FVD: A new Metric for Video Generation,”
	2018	Unterthiner et al., “Towards Accurate Generative Models of Video: A New Metric & Challenges,”
	2016	Salimans et al., “Improved Techniques for Training GANs,”
	2018	Barratt et al., “A Note on the Inception Score,”
	2020	Saito et al., “Train Sparsely, Generated Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN,”
	2021	Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,”

P57

主观评价

Hybrid evaluation：EvalCrafter

Creates a balanced prompt list for evaluation
Multi-criteria decision analysis on 18 metrics: visual quality, content quality…
Regress the coefficients of all metrics to generate an overall score aligned with user opinions

ID	Year	Name	Note	Tags	Link
	2023	Liu et al., “EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,”

P45

Datasets

The WebVid-10M Dataset

Bain et al., “Frozen in Time: A Joint Video and Image Encoder for End to End Paper,” ICCV 2021.

✅ WebVid 是常用的视频数据集，有高清视频及配对文本。

ID	Year	Name	Note	Tags	Link
	2025	FlexiClip: Locality-Preserving Free-Form Character Animation

A COMPREHENSIVE ANALYSIS OF PINNS: VARIANTS, APPLICATIONS, AND CHALLENGES

物理信息神经网络（PINN）作为经典神经网络的新型变体，专为求解偏微分方程及其衍生形式而开发。与传统数值方法相比，PINN具有以下优点：

采用无网格化方法，能够有效处理具有不规则、复杂或高维几何特征的问题
具有理解并编码物理先验知识的能力，从而生成有效近似解
能够从未标注的训练数据中自主推导规律

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
180	2019	Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations	PINN	求解偏微分方程及其衍生形式	link

PINNs Architecture

输入：指定积分域内的坐标点
输出：对应微分方程的近似解

[TODO] 图1

3.3 损失函数构建

考虑参数化偏微分方程的一般表达式： $$ \begin{aligned} \text{偏微分方程：} f\left(x, t, \frac{\partial y}{\partial x}, \frac{\partial y}{\partial t}, \ldots; \Psi\right) &= 0, \quad x \in \Omega, ; t \in [0, T] \ \text{初始条件IC：} y(x, t_0) &= h(x), \quad x \in \Omega \ \text{边界条件BC：} y(x, t) &= g(t), \quad x \in \partial\Omega, ; t \in [0, T] \end{aligned} $$

该方程定义在域 (\Omega \subset \mathbb{R}^N) 上，边界为 (\partial\Omega)。其中：

(x = (x_1, x_2, \cdots, x_N) \in \mathbb{R}^N) 表示空间坐标，
(t) 表示时间，
(f) 是描述问题的函数，包含微分算子及参数 (\Psi)。
(y(x, t)) 是偏微分方程的解，
初始条件为 (h(x))，
边界条件为 (g(t))（可以是狄利克雷、诺伊曼、罗宾或周期性边界条件）。

利用神经网络的通用逼近能力，可以构建 (y(x, t)) 的代理解 (\hat{y}(x, t; \theta))，其中 (\theta) 表示神经网络中的权重和偏置向量集合：

$$ y(x, t) \approx \hat{y}(x, t; \theta) $$

损失函数定义为：

$$ \begin{aligned} \mathcal{L}(\Theta) &= w_f \mathcal{L}f(\theta) + w{ic} \mathcal{L}{ic}(\theta) + w{bc} \mathcal{L}{bc}(\theta) \ \mathcal{L}f(\theta) &= \frac{1}{N_f} \sum{i=1}^{N_f} \left| f\left(x, t, \frac{\partial \hat{y}}{\partial x}, \frac{\partial \hat{y}}{\partial t}, \ldots; \Psi\right) \right|2^2 \ \mathcal{L}{ic}(\theta) &= \frac{1}{N{ic}} \sum_{i=1}^{N_{ic}} \left| \hat{y}(x, t_0) - h(x) \right|2^2 \ \mathcal{L}{bc}(\theta) &= \frac{1}{N_{bc}} \sum_{i=1}^{N_{bc}} \left| \hat{y}(x, t) - g(t) \right|_2^2 \end{aligned} $$

其中：

(N_f) 是配置点集合，
(N_{ic}) 是满足初始条件的点集合，
(N_{bc}) 是满足边界条件的点集合，
(w_f)、(w_{ic}) 和 (w_{bc}) 是相应的权重系数。

PINN解ODE

相较于传统数学方法，基于深度学习的方法在求解ODE时展现出多方面的显著优势:

无论求解过程涉及的数学方法多么复杂，这类方法生成的解都具有较高的精确度。
边界条件与维度因素是制约数学方法效能的关键要素，而深度学习方法对这两个因素均具备良好的适应性。
对于具有随机分布或噪声的数据，此类方法也能有效求解。

当前，用于求解ODE的主流深度学习技术有神经ODE、物理信息神经网络、生成对抗网络。本文专注于第二种。

[TODO] 表2

Year	Name	解决了什么痛点	主要贡献是什么	Tags
2023	Solving stiff ordinary differential equations using physics informed neural networks (pinns)	用PINN求解刚性ODE
2023	Solving differential equations using physics informed deep learning: a hand-on tutorial with benchmark tests.	系统阐述用于求解ODE的DL技术从传统NN到PINN的演变历程。	1. 详细解释了设计PINN涉及的多种因素，包括损失函数构建、物理概念的作用以及优化方法等。 2. 该网络在不同ODE上进行了性能测试，并与经典积分方法进行了对比验证。作者发现，PINN的主要优势在于：对于弱非线性问题，仅需后者（传统方法）数据量的一小部分，即可产生与当前任何常用技术相媲美的结果。对于高度非线性问题，PINN在常规条件下难以取得良好效果，需要在一定的积分区间内获得训练数据的先验知识以弥补性能不足。
2021	Solving ordinary differential equations using an optimization technique based on training improved artificial neural networks	使用基于DL求解ODE，可被视为推动PINN发展的关键因素之一。	提出了一种借助改进型ANN识别ODE数值解的新方法： 1. 先计算特定ODE的近似解，再进行损失最小化。 2. 损失函数由多个误差计算函数组合而成。 3. 网络参数基于Levenberg-Marquardt算法的结果进行了重构。所提网络能实现更高的精度和更快的收敛速度。
2020	A tutorial on solving ordinary differential equations using python and hybrid physics-informed neural network.	使用PINN求解ODE的研究仍处于较浅层面，未能形成系统性的发现。	首次对PINN在ODE求解中的应用进行了较为全面的探讨。该文献着重从实现角度出发，基于经典Python框架进行技术阐释。但并未过度聚焦物理概念本身，而是将数据驱动核作为一种更便捷的模型训练收敛途径。因此，所构建的混合网络同时融合了物理概念与数据驱动核的双重特性。	link

PINN解PDE

PDE至今仍无法高效生成解析解。目前已有多种成熟的数值方法复杂度较高。

[TODO] 表3

PINN解分数阶微分方程(FDE)

[TODO] 表4

PINN变种

PINN应用

流体力学

该领域大部分问题可归结为NS方程组的求解范畴，而这组方程恰恰适合通过PINN模型进行有效逼近。

相较于传统数值方法，PINNs在流体力学应用中的核心优势在于：

同一模型能同时处理正问题与反问题。
PINNs能有效融合流动观测数据与物理控制方程，实现数据与物理机理的双重驱动。

Physics-based fluid simulation in computer graphics: Survey, research trends, and challenges

link: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10885003

[TODO] 图1

符号定义

[TODO] 表1

流体力学基础

早期发展

Agenda

01 Flow Matching Basics

生成模型的基本范式
Flow Matching 的参数化、训练、推断

02 Flow Matching Advanced Designs

条件生成
Inverse Problem(训练方法)
使用 Flow Matching 生成(对称的或黎曼流型的)几何

03 Model Adaptation

Faster Sampling
Inverse Problems (Training-Free)
Reward Fine-tuning

04 Generator Matching and Discrete Flows

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Flow Matching Basics

P6
WHAT IS FLOW MATCHING?
A scalable method to train flow generative models.

HOW DOES IT WORK?
Train by regressing a velocity, sample by following the velocity

P11

增量生成方法

Marginal probability path

flow matching属于增量生成方法，需要学习边缘概率路径。
边缘概率路径是指，任意一个特定的 $t$ 时刻，$X_t$ 所属于的分布 $p_t$。即连续时间上的分布簇。

生成模型最重要的是，边缘概率路径以 $P$ 分布开始，以 $Q$ 分布结束。

P12

三种增量生成模型的特点

相比于其它增量生成方法，流的特点：(1) 确定性，已知 $X_t$，那么 $X_{t+h}$ 是确定的。(2) 平滑
流的优势：(1) sample 速度快 (2) 可以构建模型似然的无偏估计器。
Diffusion 和 Jump 具有更大的设计空间，因此具有更多生成能力。

P13

Flow 生成模型

Flow 的参数化

$\Psi_t$ 是 flow 生成模型的转移函数。
$\Psi_t$ 是一个双射函数，因此它可以重塑空间而不丢失信息。
通过对高维空间的 warping，使 $P$ 分布逐步变为 $Q$ 分布。
双射函数的特性：

一一对应：每个输入对应唯一的输出，且每个输出都被某个输入映射到。
可逆性：存在逆函数 $ f^{-1}: Y \to X $，满足 $ f^{-1}(f(x)) = x $ 且 $ f(f^{-1}(y)) = y $。

flow Model 是一个马尔可夫过程。
马尔可夫过程（Markov Process）是一类具有无记忆性（马尔可夫性质）的随机过程，其核心特点是未来状态仅依赖于当前状态，而与历史状态无关。

直接参数化会遇到的问题

对两个双射函数做线性组合，得到的函数不能保持其双射的特性，因此，基于双射函数的模型难以被参数化。

$$ \alpha X_ {t|1}+\beta X_ {t|2}\ne \Psi _ t(\alpha X_ {t|1}+\beta X_ {t|2}) $$

网络模型中通常包含大量线性组合，激活函数等会破坏双射性的结构，因此很难让网络学到一个双射函数。
“模型的参数化”（Parameterization of a Model）是指用一组可调整的参数（Parameters）来定义模型的结构和功能的过程。它是模型设计的核心步骤，决定了模型如何从输入数据中学习规律、进行预测或生成输出。包括（设计模型结构、连接方式，定义参数如何初始化，哪些参数可以被优化）。

P14

利用速度对流做参数化

因此利用速度对流做参数化。在这里，速度是指 $P_t$ 分布中的每个 sample 向 $Q$ 分布中对应 sample 变化的速度（快慢和方向）。
Flow 和 velocity 是可以互相转化的。对 Flow 做微分可以得到 velocity，对 velocily 解常微分方程，可以得到 Flow.

使用速度的好处：速度是线性的，可以相加或分解，因此可以对速度做参数化。
使用速度的缺点：需要对 sample 出速度做 ODE，解出图像。

$$ \frac{d}{dt} \Psi _t(x)=u_t(\Psi _t(x)) $$

$$ \frac{d}{dt}\Psi _t(\alpha X_1+\beta X_2)=\alpha u_t(\psi _t(X_1))+\beta u_t(\psi _t(X_2)) $$

P15

Velocity $u_t$ generates $p_t$ if

$$ X _t=\Psi _t(X_0)\sim p_t $$

使用速度来定义边缘概率路径，$\Psi_t$ 是基于速度的转移函数。

P16

Flow Matching 的训练

学习一个速度模型，由速度得到边缘路径概率 $P_t$，使得 $P_0 = P$， $P_1= Q$

P17

Sampling a flow model

Flow Matching 的推断：
(1) 从 $P$ 分布中 sample 一个 noise
(2) 根随速度（解ODE）得到对应在 $Q$ 分布中的 sample。

$$ \frac{d}{dt} X_t=u^0_t(X_t) $$

Use any ODE numerical solver.
One that works well: Midpoint

P19

Simplest version of Flow Matching

flow matching 的训练

(1) 随机构造源 $X_0$ 和目标 $X_1$。
(2) 在 [0，1] 区间随机采样一个时间步 $t$。
(3) $X_t$ 是 $X_0$ 与 $X_1$ 的线性组合。
(4) $X_t$ 是网络输入，让网络输出逼近$X_1-X_0$。

$$ \mathbb{E } _{t,X_0,X_1}||u_t^0(X_t)-(X_1-X_0)||^2 $$

🔎 "Flow Matching for Generative Modeling" Lipman el al. (2022)
🔎 "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" Liu et al. (2022)
🔎 "Building Normalizing Flows with Stochastic Interpolants" Albergo et al. (2022)

P20

这里没有对 $X_0$ 和 $X_1$ 所属的分布作限制。 $X_0$ 和 $X_1$ 可以是独立的噪声和图像，也可以是具有某种关系（例如黑白与彩色）的 pair data。

Why does it work?

• Build flow from conditional flows

如何从一个更简单的速度或流（称为条件流）构建一个速度或流。条件流是指一些简单的，固定的部分。

• Regress conditional flows

通过观察更简单的条件流来学习复杂的部分。

P21

局部问题

假设目标分布只有 $X_1$ 这一个点，那么流和速度是这样的。

$$ X_t=\Psi _t(X_0|x_1)=(1-t)X_0+tx_1 $$

这是一个条件流。
$p_{t|1}(x|x_1)$ 是 conditional probability
$u_t(x|x_1)$ 是 conditional velocity，是常数。

P22

全局问题

实际的 $Q$ 分布包含很多 $x_1$ 这样的 sample，每一个 sample 都可以作为一个 condition，得到一个 $P_{t|条件}$ ，综合所有 $P_{t|条件}$ 得到的 $p_t(X)$ 是这 $P_{t|条件}$ 的期望。可以证明，$p_t(X)$ 以 $P$ 开始，以 $Q$ 结束。对 $Q$ 分布中的所有的 $x_1$，对 $U_t(X|X_1)$ 取平均，得到生成“边缘概率路径”的速度。
$p_t(x)= \mathbb{E} _ {X_ 1}p_{t|1}(x|X_ 1)$
$u_t(X)$ 也可以以这种方式得出。

$u_t(x)=\mathbb{E} [u_t(X_t|X_1)|X_t=x]$

这个速度场称为边缘速度。

P23

Theorem*: The marginal velocity (边缘速度) generates the marginal probability path (边缘概率路径)。

以上公式中的期望，实际含义是“平均”。

P24

conditional loss

目标函数：回归边缘速度场。

(1) 直接回归边缘速度场

$$ ℒ_{FM}(θ) = \mathbb{E} _{t,X_t}||u^θ_t (X_t) − u_t(X_t)||^ 2 $$

其中，$u_t(X_t)$ 是通过许多数据计算出的均值(根据上文中的公式)。
(2) 回归条件速度

$$ ℒ_{CFM}(θ) = \mathbb{E} _{t,X_1,X_t}||u^θ_t (X_t) − u_t(X_t|X_1)||^ 2 $$

Theorem: Losses are equivalent,

$$ \nabla_θℒ_{FM}(θ) = \nabla_θℒ_{CFM}(θ) $$

结论：仅回归条件速度，与直接回归速度相同。
使用条件分布(公式 2)相比于公式 1 的好处是，可以逐个样本去计算，而不需要对整个数集做平均。

P25
Theorem: Losses are equivalent if $D$ is a Bregman divergence.

更进一步，使用任意的 Bregman Divergence Loss $(D(\cdot ,\cdot ))$ 散度代替 $L2(||\cdot ,\cdot ||^2)$，都能得到相同结论，L2 Loss 只是其中一种。

P26

因为要学习的是一个“期望”。

P27

How to choose $ψ_t(x|x_1)$?

Optimal Transport minimizes Kinetic Energy

在上文中，定义

$$ ψ _t(x|x_1)=tx_1+(1-t)x $$

这样定义，是基于“最小化动能”的考虑。

如果最小化动能，能让路径变得直，且速度恒定。
所以将 $ψ _t(X_0|X_1)$ 定义为 $X_0$ 和 $X_1$ 连线上的一个点，其中 $X_0$ 可以是空间中任意一点定义为 $X$ 。

直接优化动能不容易，因为它不依赖于具体的条件。因此给它设定一个 Jensen bound，来限制边缘速度的动能。

Jensen bound 是具体的条件 $(X_0,X_1)$ 下的期望。
当 $X_0$ 和 $X_1$ 确定时，Jensen bound 可以被算出来，也可以(通过优化$ψ _t$)被最小化。

结论： 当 $ψ _t(x|x_1)$ 定义为 $tX_1+(1-t)X$ 时，Jensen bound 被最小化，此时 $X_0$ 到 $X_1$ 是直线。

Linear conditional flow总结:
• Minimizes bound，而不是直接优化动能。
• Reduces Kinetic Energy of initial coupling
把 $ψ _t$ 代入 Jensen bound 公式可得出此结论。

• Exact Optimal Transport for single data points
如果 $Q$ 分布中只有一个 $X_1$。此时公式左右两边相等，是最优传输。

• Not Optimal Transport (but in high dim straighter)
如果 $Q$ 分布里不止一个点，不是最优传输，$X_0$ 到 $X_1$ 也不是直线。

🔎 "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" Liu et al. (2022)
🔎 "On Kinetic Optimal Probability Paths for Generative Models" Shaul et al. (2023)

P29
好的最优传输，可以通过单个欧拉步骤采样。

$$ \frac{d}{dt} \Psi _t(x)=u_t(\Psi _t(x)) $$

$$ ℒ_{CFM}(θ) = \mathbb{E}D(u^θ_t (X_t),u_t(X_t|X_1)) $$

D 是一个 Bregman 散度，L2 Loss 是其中一种，根据上文中 $ψ _t$ 的定义，把L2 和条件速度代入公式得：对于特定的 $X_0$ 和 $X_1$ ， $X_1-X_0$ 是条件路径的条件速度。

$$ ℒ_{CFM}(θ) = \mathbb{E}||u^θ_t (X_t)-(X_1-X_0)||^ 2 $$

因此，这个算法是特定条件流 + 特定 Loss 下的一个 flow matching 实例。

P30

Affine paths

在前面的方法中，$ψ_t(x|x_1)$ 是 $x$ 与 $x_1$ 的线性组合，这只是一种选择。现在假设其为仿射组合。

这种情况下，$X_0$ 到 $X_1$ 不再是直线。
由此得到不同的参数化速度的方式，例如：
(1) $u_t(x)=\frac{d\psi t}{dt}$ ，直接预测速度

(2) 源预测：通过 $X_0$ 的条件期望来参数化速度。预测 $X_0$ ，再转化为 $x$ 的速度

(3) 目标预测类似，预测 $X_1$ ，再转化为 $x$的速度

根据 $\alpha _t$ 和 $\sigma _t$ 的定义不同，推导出的 $a_t,b_t,c_t,d_t$ 不同。
以上公式中的“期望”部分，都是网络要预测的部分。预测的内容不同，最终目的都是为了求 $x$ 的速度。

P31

Gaussian paths

目前为止，没有对源分布 $P$ 和目标分布 $Q$ 做任何假设。
如果假设 $P$ 是一个高斯分布，$P$ 和 $Q$ 是独立的，这个过程即与 diffusion 的 ODE 过程吻合。

$$ p(x) = 𝒩(x |0 , I) \quad π_{0,1}(x_0, x_1) = p(x_0)q(x_1) $$

diffusion 的噪声预测，在 $x$ 接近噪声时(初始 steps)会有奇异性问题。

P32 　

Affine and Gaussian paths

参数比较

蓝色部分适用于所有的仿射路径(包括高斯 path)。粉色部分仅适用于高斯 path.
[❓] 表格怎么看？

P33

flow matching 与确定性 diffusion 之间的关系:
1.diffusion 通过定义 forward process 然后再反转来生成概率路径。
flow matching 通过将所有已知的条件概率路径的聚合来生成概率路径。
2.diffusion 构造了 forward prossess，需要一个根据 forward process 构造条件概率的闭式解，因此会要求 $P$ 是高斯，且 $P$ 和 $Q$ 独立。
flow matching 没有这样的限制，$P$ 和 $Q$ 可以是任意的分布。

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P34

Flow Matching Advanced Designs

P35

1．条件生成
2．$P$ 分布和 $Q$ 分布耦合的场景
3．在几何域上使用 flow matching 构造生成模型

P37

Conditioning and Guidance

问题定义：
数据集：样本 + 标签
生成：给定标签，从具有特标签的分布中采样

P39

Conditional Models

公式定义

$$ p_ {t,1|Y} (x, x_1|y) = p_ {t|1}(x|x_1)q(x_1|y) $$


无条件	条件
边缘概率分布
边缘速度

将条件概率路径构建为不显式依赖于条件 $Y$。

P40

网络训练

Train same neural network on all conditions:

对于网络训练的影响在于，数据增加一个维度来表示$Y$。

P41

Examples

🔎 “Flow Matching for Generative Modeling” Lipman et al. (2022)
🔎 “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” Nichol et al. (2021)

局限性

此方法在“每个条件都有大量数据”时很有用，例如条件是类别时。
条件是文本时不适用，因为数据集里一段文本通常只对应一张图像。

P42

Condition as Guidance

Score Matching 和 diffusion

classifier Guidance：通过引入分类器，将无条件模型变成条件模型.

CFG：条件生成结果与无条件生成结果外插。

🔎 CFG

P43

Flow Matching with Caussian Path

Assume a velocity field trained with Gaussian paths.以上来自 score matching 的公式，同样可以适配到 flow matching.

P44
相关工作：
🔎 "Guided Flows for Generative Modeling and Decision Making" Zheng et al. (2023)
🔎 "Mosaic-SDF for 3D Generative Models" Yariv et al. (2023)
🔎 "Audiobox: Unified Audio Generation with Natural Language Prompts" Vyas et al. (2023)
🔎 "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" Esser et al. (2024)
🔎 "Movie Gen: A Cast of Media Foundation Models" Polyak et al. (2024)

P45

其中，movie Gen 发现使用 flow matching loss，在生成质量和文本一致性上，均优于 diffusion loss.

P46

非 Gaussian Path

Open Problem
How to guide FM with non-Gaussian paths?

CFG 要求正在学习的是具有高斯路径的 flow matching 模型，但 flow matching 不局限于高斯源。

P52

Data Couplings

前面工作都假设 $P$ 和 $Q$ 是独立的。What about dependent couplings?



• Non-Gaussian source distribution • Alternative conditioning approach • Inverse problems	• Applications to Optimal Transport • Efficiency: straighter trajectories

两种方法，利用 $P$ 和 $Q$ 的耦合关系优化生成过程。
1．利用耦合关系，构造另一种条件方法，用于解决 Inverse 问题。
2．试图找到多样本之间的耦合关系，用于优化采样效率。

P58

Paired Data

问题定义

方法

Alter source distribution and coupling instead of adding condition

改变源分布和耦合，而不是添加条件。
从数据中取出样本$(X_1,Y)$
[❓] $X_1$ 和 Y 有什么区别？
根据Y构造$X_0$

$$ X_0=Y+\epsilon \sim p $$

源分布不是噪声，而是 $Y$ 添加噪声，损失不变。

P61

Result

P63

Multisample Couplings

问题定义

Given uncoupled source and target distributions,can we build a coupling to induce straighter paths?

有一个预训练的 flow matching 模型，构建一种耦合，使 $P$ 到 $Q$ 的路径更直线，或 $Q$ 能更好地采样。

P64

耦合的本质

耦合 cost 限制了动能．降低 coupling cost，就能减少动能。

对于不同的耦合关系会得到不同的 $u_t$ 和动能。但它有上限，降低上限能减少动能。

Marginal $u_t$ with cond-OT FM and $π_{0,1}$

P69

方法

Use mini batch optimal transport couplings

从 $P$ 分布和 $Q$ 分布中随机采样 $k$ 个点。

寻找两组点之间的最优排列，来最小化 cost.

假设找到了最优组合，随机选择一对。

P70
$$ \mathrm{When} \quad k = 1 → π_{0,1} = p(X_0)q(X_1) $$

当 $k＝1$ 时，相当于 $P$ 和 $Q$ 是独立的。

P71

When $k → ∞, u_t$ generates the Optimal Transport map

P72

Result

High dimensions-minor improvement in sampling speed compared to tailored samplers.

低维时，此方法能明显降低 cost

Shows promise in lower dimensional problems for scientific applications (e.g. protein backbone design [Bose et al.'23]).

高维时，路径本身已接近直线，因此效果不明显。

P73

Geometric Flow Matching

使用 Flow Matching 生成(对称的或黎曼流型的)几何


Data with Symmetries	Riemannian Manifolds

• Equivariant flows → invariant densities • Alignment couplings	• Simulation free on simple manifolds • General geometries

P87

Data with Symmetries

问题定义

有些对象具有对称性，希望生成的对象也能满足这些特征。

对称性的直观理解和公式表示

原始 $P、Q$ 分布与对称 $P、Q$ 分布应具有相同的密度或似然性。
边缘概率路径也应具有对称性，且原概率路径保持不变。

$$ p_t(g\cdot x)=p_t(x) $$

等变性：是教学中关于的群的术语，在这里简单理解为具有对称性。
边缘概率路径具有对称性和边缘速度具有对称性，是等价的。
等变速度场可以生成不变的概率路径和等变流。

🔎 "Equivariant Flows: Exact Likelihood Generative Learning for Symmetric Densities" Köhler et al. (2020)

P88

方法

因此，只需要构建一个能生成等变速度的 flow matching model。

Equivariant Velocity

$$ u^θ_t (g⋅x) = g⋅u^θ_t(x) $$

Train with CFM:

🔎 "Equivariant flow matching" Klein et al. (2023)
🔎 "Equivariant Flow Matching with Hybrid Probability Transport" Song et al. (2023)

P89

存在的问题

数据是具有对称性的。

如果没有考虑数据的对称性，仍假设 $P$ 和 $Q$ 是独立的，会发生这种情况。

P90

导致模型学到的轨迹弯曲。降低 sample 的效率。

P91

解决方法

🔎 "Equivariant flow matching" Klein et al. (2023)
🔎 "Equivariant Flow Matching with Hybrid Probability Transport" Song et al. (2023)

这两篇 Paper 提出对齐耦合；解决以上问题。

P92

Result


"Fast Point Cloud Generation with Straight Flows" Wu et al. (2022)	"Equivariant Flow Matching with Hybrid Probability Transport" Song et al. (2023) "Equivariant flow matching" Klein et al. (2023)

此方法适用于点云和分子。

P94

Generative Modeling on Manifolds

生成流形数据，例如 Nesh，轨迹、曲面等而不是整个欧拉空间。

P95
Need to re-define the geometric structures we have in Euclidean space.

重新定义几何结构，以便定义 flow matching 模型。
此处以黎曼流形为例。

P98

定义几何结构

🔎 黎曼流形

假设只考虑黎曼流形
1．光滑流形，即可微，能够定义切空间。

切空间是某点$x$处所有方向导数的集合。
2．选择一个内积来计算黎曼度量，描述流形上的角度和距离。

P99

Pl00

构建黎曼流形时，速度定义在切空间上。
这样速度 $v$ 和流形 $x$ 不在同一空间，计算出 $v$ 以后，要投影回 $x$，转成流形。

P101

构建 Riemannian Flow Matching

图像上的 flow matching 与黎曼空间上的 flow matching，具有相同的数据构造、训练方法、唯一的不同是 Loss 的定义。黎曼度量代替 L2 Loss。

• Riemannian Flow Matching loss:

P102
• Riemannian Conditional Flow Matching loss:

Losses are equivalent 的结论在这里同样适用：

$$ ∇_θℒ_{RFM}(θ) = ∇_θℒ_{RCFM}(θ) $$

P103

Conditional Flows - Simple Geometries

flow matching 中的直线推广到这里就是测地线，因为测地线是流形上的最短路径。

For simple manifolds (e.g. Euclidean, sphere, torus, hyperbolic)，测地线的计算具有闭式表达：

$$ \Psi _t(x_0|x_1)=\mathrm{exp} _{x_0}(\kappa (t)\mathrm{log} _{x_0}(x_1)),\quad t \in [0,1] $$

$$ \mathrm{Scheduler }\quad \kappa (t):\kappa (0)=0,\quad \kappa (1)=1 $$

这种情况，无需模拟就能计算条件流。

P104

Conditional Flows - General Geometries

对于一般的几何结构，可能存在两个问题：

Geodesics can be hard to compute
Concentrate probability at boundary

因此难以计算。

P105

Choose a premetric satisfying:

Non-negative:$d(x,y) ≥ 0$.
Positive: $d(x, y) = 0$ iff $x = y$.
Non-degenerate:$∇d(x, y) ≠ 0$ iff $x ≠ y$.

Build conditional flow satisfying:

$$ d(ψ_t(x_0|x_1),x_1) = \tilde{κ}(t)d(x_0,x_1) $$

$$ \mathrm{Scheduler} \quad \tilde{κ} (t) = 1 − κ(t) $$

为了解决以上问题，提出了一种新的度量方法。

P106

对时间求导，得到微分方程。

🔎 "Flow Matching on General Geometries" Chen & Lipman (2023)

P107

新度量方法与测地距离比较。

P108

Riemannian Flow vs. Score Matching

flow matching 的优势
（1）simulate free，速度快。PPT例子中快20倍。
（2）解 ODE 比解 SDE 容易
（3）$u_t(X_t|X_1)$是精确值，$\nabla \mathrm{log}$ $p_t(x|x_0)$ 是近似值。

P109

🔎 "Riemannian Score-Based Generative Modelling" De Bortoli et al. (2022)
🔎 "Flow Matching on General Geometries" Chen & Lipman (2023)

P110

Model Adaptation

P112

You’ve trained a model. What next?

已有一个预训练模，可以做什么？

P113

Faster Sampling

P114

Recitde Flow-Faster sampling by straightening the flow

方法

$$ ℒ(θ) = \mathbb{E} _ {t,(X_0,X_1)∼π_ {0,1}^0}||u^θ_t (X_t) − (X_1 − X_0)||^2 $$

Rectified Flow refits using the pre-trained (noise, data) coupling.
Leads to straight flows.

Rectified Flow：让 flow 从源直接到目标。
第1步：训练 flow matching，flow matching 模型定义了源和目标的耦合关系，也得到了噪声与数据的 pair data.
第2步：用 pair data 继续训练。

🔎 “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow” Liu et al. (2022)

P115

P116

Result

Diffusion 对比 Rectified Flow

局限性

Enforcing straightness restricts the model. Often a slight drop in sample quality

🔎 “InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation” Liu et al. (2022)

P118

Faster sampling by self-consistency loss

增大 $h$，在 $x_t$ 和 $X_{t＋h}$ 之间建立 shortcut，类似于 diffusion 中的蒸馏方法。

原理

P119

方法

P121

Result

局限性

Shortcuts with $h$ >0 do not work with classifier-free guidance (CFG).
CFG weight can & must be specified before training.

short cuts 直接预测流而不是速度，流是非线性的，不能对结果加权组合，因此不能结合 CFG.
针对此问题的 workaround：预置 CFG 权重

🔎 “One Step Diffusion via Shortcut Models” Frans et al. (2024)

P124

Faster sampling by only modifying the solver

以上两种方法，都需训练。此方法不需要训练，而是修改 solver.

补充：关于调度器．$\beta, \alpha _t$ 和 $\sigma _t$的 trick．

Can adapt pre-trainedmodels to different schedulers.

有一个用 scheduler A 训练好的模型，现在要一个用 scheduler B 继续训练，这两个模型是什么关系？

结论：这两个 scheduler 及其 flow 可以通过 $X$ 的缩放和时间的重参数化关联起来。
时间重参数化是指，匹配两个 scheduler 的 SNR 和 scaling。

Related by a scaling & time transformation:

如图所示，调整 scheduler,流会表现出不同，但 $X_0$ 与 $X_1$ 的耦合关系不变。

🔎 “Elucidating the design space of diffusion-based generative models” Karras et al. (2023)

P126

修改 scheduler 的例子

Bespoke solvers:
Decouples model & solver.
Model is left unchanged.
Parameterize solver and optimize.

模型与 solver 解耦：模型不变，仅优化求 solver.
向 solver 中传入参数(表达 scheduler)，优化这些参数相当于在优化 scheduler。

Can be interpreted as finding best scheduler + more.

Solver consistency: sample quality is retained as NFE → ∞.

由于仅优化solver，好处：
1．可以利用 solver 的一致性，把步数取到无穷大，仍然能准确地解 ODE。做法是，用数据集 A 训练生成模型后，用数据集 B 训练 scheduler 的新参数。
2．在不同的模型(不同数据集、分辨率等训练出来的模型)之间可迁移。

Bespoke solvers can transfer across different data sets and resolutions.

局限性：

虽然能(不重训生成模型)直接迁移到另一个模型，但比在另一个模型上蒸馏(重训)效果要差一点。

P127

However, does not reach distillation performance at extremely low NFEs.

P128

Inverse Problems (Training-Free)

Inverse Problem：填充、去糊、超分、编辑。
与上节中的 data coupling 中要解决的问题不同的是，这里要利用在完全干净的数据集上训好的预训练模型，不经过重训，得到解决 Inverse Problem 的效果。

P133

Solving inverse problems by posterior inference

$x_1$ 为干净图像，$y$ 为噪声图像。

用高斯来近似其中未知的部分 (score function)
score function 可能是 multi 的，但实验证明仅用高斯也能有比较好的效果。

P134

局限性

Typically requires known linear corruption and Gaussian prob path.
Can randomly fail due to the heuristic sampling.

🔎 “Pseudoinverse-Guided Diffusion Models for Inverse Problems” Song et al. (2023)
🔎 “Training-free Linear Image Inverses via Flows” Pokle et al. (2024)

P135

Solving inverse problems by optimizing the source

观察结论

Don’t want to rely on likelihoods / densities.

预训练一个生成模型，然后有这个模型来评估数据，评估结果很不可靠，它把数据集中的数据评估为低密度，非数据集中的数据评估为低密度。
因为，高密度$\ne$ 高采样率。

Have observation $y$ being nonlinear in $x_1$.

$y$ 是真实图像，$X_1$ 是模型 sample,$X_1$ 与 $y$ 之间差了一个 Decoder.因此它们的关系是非线性的。

🔎 “Do Deep Generative Models Know What They Don't Know?” Nalisnick et al. (2018)

P138

方法

逆问题转化为优化问题。

$$ X_1=\psi (X_0) $$

$\psi $ 是预训练的生成模型，不优化 $\psi $ 的参数，那就优化$X_0$ 。因为 $\psi $ 是一个平滑、可逆、可微的函数。

P139

特点与局限性

$$ \min_{x_0} L(\psi ^\theta _1(x_0)) $$

Theory: Jacobian of the flow $\nabla _{x_0}\psi ^\theta_1$ projects the gradient along the data manifold.

Intuition: Diffeomorphism enables mode hopping!

P140

Simplicity allows application in multiple domains.

Caveat: Requires multiple simulations and differentiation of $\psi ^\theta _1$.

求导链路很长，计算成本很高。

🔎 “D-Flow: Differentiating through Flows for Controlled Generation” Ben-Hamu et al. (2024)

P141

Inverse problems references

Online sampling methods inspired by posterior inference:

🔎 “Diffusion Posterior Sampling for General Noisy Inverse Problems” Chung et al. (2022)
🔎 “A Variational Perspective on Solving Inverse Problems with Diffusion Models” Mardani et al. (2023)
🔎 “Pseudoinverse-Guided Diffusion Models for Inverse Problems” Song et al. (2023)
🔎 “Training-free Linear Image Inverses via Flows” Pokle et al. (2023)
🔎 “Practical and Asymptotically Exact Conditional Sampling in Diffusion Models” Wu et al. (2023)
🔎 “Monte Carlo guided Diffusion for Bayesian linear inverse problems” Cardoso et al. (2023)

Source point optimization:

🔎 “Differentiable Gaussianization Layers for Inverse Problems Regularized by Deep Generative Models" Li (2021)
🔎 “End-to-End Diffusion Latent Optimization Improves Classifier Guidance” Wallace et al. (2023)
🔎 “D-Flow: Differentiating through Flows for Controlled Generation” Ben-Hamu et al. (2024)

方法 1：通过修改 sample 方法来逐步接近目标。这些方法大多数受到某种后验推断的启发，可以在准确性和效率之间 trade off.
方法 2：简单但开销很大。

P144

Reward Fine-tuning

Data-driven and reward-driven fine-tuning



A lot of focus put into data set curation through human filtering.	Can use human preference models or text-to-image alignment.

Data-driven 的关键在于精心准备数据集。
Reward-driven 不增加训练数据，而是给模型输出一个 reward。finetune 的目标是生成得分高的 sample.
此处仅介绍后者。

P145

Reward fine-tuning by gradient descent

Initializing with a pre-trained flow model $p^\theta$：

$$ \max_{\theta } \mathbb{E} _{X_1\sim p^\theta }[r(X_1)] $$

Optimize the reward model with RL [Black et al. 2023]
or direct gradients [Xu et al. 2023, Clark et al. 2024]

P146
优点：
不同的奖励模型可以组合，得到综合的效果。

局限性：
Requires using LoRA to heuristically stay close to the original model.
Still relatively easy to over-optimize reward models; “reward hacking”.

这种方法没有 GT，所以生成结果有可能对 reward model 过拟合。因此需要使用 LoRA.

🔎 “Training diffusion models with reinforcement learning” Black et al. (2023)
🔎 “Imagereward: Learning and evaluating human preferences for text-to-image generation.” Xu et al. (2023)
🔎 “Directly fine-tuning diffusion models on differentiable rewards.” Clark et al. (2024)

P149

Reward fine-tuning by stochastic optimal control

方法1：RLHF

和直接优化相比，RLHF 将一个预训练分布倾科为能得到更高奖励的分布。

正则化：微调模型分布应与预训练模型分布接近。常用方法是增加KL 项，如下面公式蓝色部分。但这里不这样用。因为，我们要优化的不是概率路径，而是与 $X_0$ 相关的 something.
这里采用公式（3），即引入 value function bias．
value function bias 是 $X＝X_0$时，所有可能的 $X_1$ 的期望。

P150
原理：

Intuition: Both initial noise $p(X_0)$ and the model $u_t^{base}$ affect $p^{base}(X_1)$.

原理：某一时刻的分布受到 noise 分布和模型的共同影响，即使是同一个预预训练模型改变 noise 的分布，那么 $X_1$ 的分布也会改变。
由于 $X_1$ 同时受模型和 noise 分布的影响，那么 RLHF 同时优化这两个因素。

[Uehara et al. 2024] (即 RLHF) proposes to learn the optimal source distribution $p^\ast (X_0)$.

方法2：Adjoint Matching

或者，改变采样方法，让 $X_0$ 分布与 $X_1$ 分布独立。那么此时，value function 是一个常数。

[Domingo-Enrich et al. 2024] proposes to remove the dependency between $X_0, X_1$.

$$ p^\ast (X_{(0,1)})=p^{base}(X_{(0,1)})\mathrm{exp} (r(X_1)+const.)\Rightarrow p^\ast (X_1)\propto p^{base}(X_1)\mathrm{exp} (r(X_1)) $$

🔎 “Fine-tuning of continuous-time diffusion models as entropy regularized control” Uehara et al. (2024)

P151

🔎 “Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control” Domingo-Enrich et al. (2024)

这篇论文的主要内容：
1．使用 flow matching 在真实图像上训练后，再使用 ODE 采样，能得到真实的输出。
2．把 ODE 过程改成无记忆 SDE（强制 $X_0$ 与 $X_1$ 独立），那么在早期的 sample step 实际上没有什么收益，因为那时候 $X$ 大部分都是噪声。因此 SD 的采样结果不符合预训练的分布。
3．把 2 用于 finetune 的过程，因此 finetune 过程，不使用 flow 的 sample 方式，而是 SDE 的 sample 方式。
4．finetune 之后，可以把 SDE 换回成 ODE。

P152

Reward fine-tuning 总结

Gradient-based optimization:

🔎 “DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models” Fan et al. (2023)
🔎 “Training diffusion models with reinforcement learning” Black et al. (2023)
🔎 “Imagereward: Learning and evaluating human preferences for text-to-image generation.” Xu et al. (2023)
🔎 “Directly fine-tuning diffusion models on differentiable rewards.” Clark et al. (2024)

Stochastic optimal control:

🔎 “Fine-tuning of continuous-time diffusion models as entropy regularized control” Uehara et al. (2024)
🔎 “Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control” Domingo-Enrich et al. (2024)

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P153

Generator Matching and Discrete Flows

P155

这一节比较抽象，旨在提供思考的素材，以及这个框架还能用来做什么。

Continuous Time Markov Processes

flow：通过特定的“宏观的随机的过程”，将 source 平滑转换为 target.
这个过程称为连续时间马尔可夫过程。转移空间可以是连续的或偏散的。

CTMC 是一个离散空间上的过程转移的例子。所有的状态来自某个离散的集合。


	连续时间	不连续时间
连续空间	flow,score matching	diffusion
不连续空间	CTMC

状态转移的过程称为 transition kernel. 输入当前状态，输出下一个状态的概率分布，根据分布采样，得到下一个状态。

P156

Generator

如果要以离散状态转换的方式实现 flow matching，关键是找出线性的 transition kernal.
速度是线性的关键。
transition kernel 的导数被称为生成器

Generalize the notion of velocity to arbitrary CTMP

🔎 "Generator Matching: Generative modeling with arbitrary Markov processes" Holderrieth et al. (2024)

P157

CTMP via generator

取一个速度，并用它定义流。类似于用生成器定义一个连续时间过程的轨迹。

P158

训练的目标仍然是让边缘概率路径以 $p$ 分布开始，以 $Q$ 分布结束。

P163

Building generator from conditional generators

Repeating the Kata from flows……

P164

也可以从简单 condition 推广到所有数据，之前的结论同样适用。

🔎 "Generator Matching: Generative modeling with arbitrary Markov processes" Holderrieth et al. (2024)

P165

Discrete Flow Matching

这里讲的是与具体场景无关的通用方法。

$u_t$ 是一个巨大的转移矩阵。
彩色圆点代表质量函数，类似于前面的概率密度的概念。

🔎 “Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design” Campbell et al. (2024)
🔎 “Discrete Flow Matching” Gat el al. (2024)

P166

Factorized velocities

Similar to continuous case $𝒮 = ℝ^d$ :

$$ u_t(x) = [u^1_t (x),…, u^d_t (x)] $$

但如果状态表太多这种方法不可行。解决方法是分解速度，一次只修改矩阵某一个维度上的某一个数值。

🔎 “A Continuous Time Framework for Discrete Denoising Models” Campbell et al. (2022)

P167

Build (factorized) velocities

🔎 “Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design” Campbell et al. (2024)
🔎 “Discrete Flow Matching” Gat el al. (2024)

P168

Discrete Flow Matching Loss

$$ ℒ _ {CDFM}(\theta )=\mathbb{E} _ {t,X_1,X_t} \sum _ {i}^{} D_{X_t}(\frac{1}{1-t}\delta (\cdot ,X_1^i),u_t^{\theta,i}(\cdot ,X_t))
$$

🔎 “Discrete Flow Matching” Gat el al. (2024)
🔎 "Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective” Shaul et al. (2024)
🔎 “Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution” Lou et al. (2024)

P169

Example: code generation model (1.7B)

🔎 “Discrete Flow Matching” Gat el al. (2024)

P171

OPEN PROBLEMS FOR DISCRETE FLOWS

How to go beyond the factorized velocity?
Better sampling?
How to explore the (huge) design space?

Design choices:

Process
Marginal Path
Corrector steps
Models superposition

P172

Flow Matching blueprint

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Diffusion Model章节地图

基础
- DDPM
  - 原理：Forward/Reverse
  - 训练与推断
  - 数学原理
- Scored-based
  - 从DDPM到SDE
  - 用SDE描述DDPM的正向过程和逆向过程
  - 拟合score function
  - 从SDE到ODE
- Pipeline
  - 条件生成
  - 损失函数
  - 采样
T2I基模型
基于T2I基模型的图像应用
- 控制图像生成
- 图像编辑
- 特定对象的图像生成
- 多个特定对象的生成
基于T2I基模型的3D应用
- 3D表示
- 3D生成
- 新视角生成
- 3D重建
- 3D编辑
基于T2I基模型的视频生成（见另外一个系列）
算法技巧
- 加速
- 提升质量
- 提升稳定性

参考材料

[CVPR #18546]Denoising Diffusion Models: A Generative Learning Big Bang 视频 github
李宏毅DM课程

P12

Diffusion Model 是如何运作的？

P13

Denoising diffusion models consist of two processes:

Forward diffusion process that gradually adds noise to input
Reverse denoising process that learns to generate data by denoising

P14

Forward Diffusion Process

The formal definition of the forward process in T steps:

直观理解

真正的加噪过程，不是直接的image + noise。

从数学上理解

✅ 从第一张图像到最后的纯噪声，实际上是分布的改变。

通过逐步的 scale down 让均值趋近于 0。通过引入噪声使方差趋近于 1。使得原始分布逐步逼近 $\mathcal{N} (0,1 )$分布，

❓ 求联合分布有什么用?

从操作层面理解

✅ 实际上，在给定一张图像x0时，想要获得第t张加噪图像时，不需要真的通过公式$q(x_t|x_{t-1})$从 $\mathbf{x} _{t-1}$到 $\mathbf{x} _{t}$一步一步计算出来，可以直接从 $\mathbf{x}_0$生成任意的 $\mathbf{x}_t$。

从数学上可以证明，从x0逐步计算到xt和从x0直接计算到xt，这两种行为是等价的。

根据公式 $\mathbf{x} _t=\sqrt{\bar{a} _t} \mathbf{x} _0+\sqrt{(1-\bar{a} _t) } \varepsilon $可知，当 $\bar{a} _T → 0$，分布$q(x_T)$的均值趋于0，方差趋于1，变成纯高斯噪声。

P16

进一步理解

So far, we discussed the diffusion kernel $q(\mathbf{x} _t|\mathbf{x} _0)$ but what about $q(\mathbf{x}_t)$?

The diffusion kernel is Gaussian convolution.

✅ convolution 是一种信号平滑方法。
✅ $q(\mathbf{x} _ t|\mathbf{x} _ 0)$ 是标准高斯分布，因此 $q(\mathbf{x} _ t)$ 是以高斯分布为权重的真实数据的加权平均。

We can sample $\mathbf{x}_t \sim q(\mathbf{x}_t)$ by first sampling $\mathbf{x}_0$ and then sampling $\mathbf{x}_t \sim q(\mathbf{x}_t|\mathbf{x}_0)$ (i.e., ancestral sampling).

✅ 实际上，没有任意一个时间步的 $q(\mathbf{x})$ 的真实分布，只有这些分布的 sample.

Reverse Denoising Process

P17

直观理解

Denoise是一个网络模块，通过Denoise模块学习每个时间步的去噪过程。

✅ 把 $\mathbf{x}_0$ 加噪为 init-noise，再从 init-noise 恢复出 $\mathbf{x}_0$，这个操作是不可行的。
✅ 因为，根据公式 $\mathbf{x} _t=\sqrt{\bar{a} _t} \mathbf{x} _0+\sqrt{(1-\bar{a} _t) } \varepsilon $, 且 $\bar{a} _T → 0$，那么经过 $T$ 步加噪后，$\mathbf{x} _t\approx \varepsilon $. 而是 $\varepsilon $ 是一个与 $\mathbf{x} _ 0$ 没有任务关系的噪声，所以不可能从中恢复出 $\mathbf{x} _ 0$.

从数学上理解

从xT到x0的过程，也是分布的改变。从$\mathcal{N}(\mathbf{x}_T；\mathbf{0,I})$w分布变成真实分布的过程。

与Forward不同的是，$q(\mathbf{x}_{t-1}|\mathbf{x}_t)$没有一个准确的数学公式来表达。

Can we approximate $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$? Yes, we can use a Normal distribution if $\beta _t$ is small in each forward diffusion step.

✅ Nomal distribution 是特定均值和方差的高斯分布，不一定是 std 高斯。

P18

假设$p(\mathbf{x} _ T)$和$p(\mathbf{x}_{t-1}|\mathbf{x}t)$分别符合以上分布。
从第1个分布中sample出$x_T$，把它代入第二个分布，就可以sample出$x{T-1}$，直到最终sample出$x_0$

由于以上截图来自不同的材料，存在p和q混有的情况，需注意区分。

P19

Learning Denoising Model

✅ 以上是去噪模型的公式，下面有关于这些公式的详细解释。

P20

训练与推断

使用Forward流程对真实数据加噪，以构造pair data。
使用使用Denoise模块学习去噪分布，完成去噪过程。

P21

Implementation Considerations

Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent $\epsilon _\theta (\mathbf{x}_t,t)$.

Time representation: sinusoidal positional embeddings or random Fourier features.

Time features are fed to the residual blocks using either simple spatial addition or using adaptive group normalization layers. (see Dharivwal and Nichol NeurIPS 2021).

✅ $\sigma $ 是怎么定义的？

数学原理

P10

生成模型本质上的共同目标

目标是要学一个分布

生成模型的本质是要学到真实数据的分布，以及从某个已经分布（通常是正态分布）到这个真实数据分布的映射。

✅ 实际使用中还会加一个 condition，但整体上没有本质差异，因此后面推导中不考虑 condition.

P11

定义目标函数

以Minimize KL Divergence作为目标函数

目标是让生成数据的分布与真实数据的分布尽量的接近，但是怎样衡量两个分布是否接近？

✅ 常用KL Divergence来衡量预测分布与GT分布之间的距离。

以Maximum Likelihood Estimation

$P_{data}$ 代表真实分布，从分布中 Sample 出来的 $x$ 即训练集
$x_i$是数据集里的一个数据，也是真实数据分布里的一个采样。$P_\theta (x^i)$ 代表 $P_\theta$ 生成 $x^i$ 的概率。

✅ 由于 $P_\theta$ 非常复杂，算不出这个概率，但此处假设 $P_\theta (x^i)$ 已知。

于是可以将定义目标函数为：找出让真实 $x^i$ 被生成出来的概率最高的$\theta $.

\begin{align*} \theta ^\ast =\text{arg } \max_{\theta } \prod_{i=1}^{m} P_\theta (x^i) \end{align*}

两个目标函数是等价的

可通过数据推导证明，这里提到的两个目标，本质上是一致的。证明过程如下：

P12

Maximum Likelihood = Minimize KL Divergence

✅ 结论：让真实数据的概率最大，与让两个分布尽量接近，在数学上是一致的。
✅ VAE、diffusion、flow based 等生成模型，都是以最大化 Likelihood 为目标。GAN 是最小化 JS Divergence 为目标。

P13

Compute $𝑃_\theta(x)$

计算$𝑃_\theta(x)$的常用技巧

✅ VAE 和 diffusion 非常相似，许多公式是通用的。

技巧一：不推断生成结果，而是推断生成结果分布的均值

✅ $G（z）$ 不代表某个生成结果，而是一个高斯的均值，然后计算 $x$ 在这个分布中的概率。

P14

技巧二：不求$𝑃_\theta(x)$，而是求Lower bound of $log P(x)$

✅ 通常无法最大化 $P（x）$，而是最大化 $log P(x)$ 的下界。
✅ 以上公式推导中省略参数 $ \theta$。

P15

DDPM: Compute $𝑃_\theta(x)$

对于 diffusion model，假设每次 denoise 出的是一个高斯分布的均值。

❓ 问：为什么假设$G(x_t)$ 是高斯分布的 mean？
✅ 答：有人尝试过其它假设，效果没有变好，且高斯分布便于计算。

通过链式法则，可以得出 $x_0$ 在最终分布中的概率为：

$$ P_ \theta (x_0)=\int\limits _ {x_1:x_T}^{} P(x_T)P_ \theta (x_{T-1}|x_T) \dots P_ \theta (x_ {t-1}|x_t) \dots P_ \theta(x_0|x_1)dx_1:x_T
$$

P16

DDPM: Lower bound of $log P(x)$

计算Lower bound of $log P(x)$

计算$q（x_t｜x_{t-1}）$

P17

✅ 提前定好一组 $\beta $．代表 noise 要加多大。
✅ $q（x_t｜x_{t-1}）$ 仍然属于高斯分布，其均值为 $\sqrt{1-\beta _t} \cdot x_t$，方差为 $\beta _t$.

计算$q（x_t｜x_{0}）$

P18

P19

✅ 由于两次 sample 出的 noise 是独立同分布，两个 noise 以这种形式相加的结果，也符合某个特定的高斯分布。

P20

✅ 结论：$q（x_t｜x_{0}）$也符合高斯分布，其均值为$\bar{\alpha }_t$，方差为${1-\bar{\alpha }_t}$.

定义损失函数

如何定义损失函数，可以达到最大化$\log P_{\theta}(x_0)$的目的

损失函数与目标函数

目标函数是根据实际意义推导出来的优化目标。损失函数是能引导学习收敛到目标状态的函数，可以没有实际意义，也可以跟目标函数不一样。
虽然目标函数很明确，但是损失函数不一定要跟目标函数一样。可以从目标函数中提取出影响结果的关键因素来引导学习过程。

推导与简化目标函数$log P(x)$

P21

P22

最后简化为以下三项：

\begin{align*} E_{q(x_1|x_0)}[log P(x_0|x_1)]-KL(q(x_T|x_0)||P(x_T)) -\sum_{t=2}^{T}E_{q(x_t|x_0)}[KL(q(x_{t-1}|x_t,x_0)||P(x_{t-1}|x_t))] \end{align*}

分析目标函数中与优化相关的关键因素

结论

✅ 目标是要优化 $ \theta$，第二项与$ \theta$无关，可以略掉。
✅ 第三项的 KL Divrgence 涉及到两个分布，分布1是固定的，可以通过计算得到，分布2是由 $ \theta$ 决定的，是要优化的对象。

P23

关于第三项分布1的推导过程

已知 $q (x_t\mid x_0)$，$q (x_{t-1} \mid x_0)$ 和 $q (x_t \mid x_{t-1})$为：

求 $q (x_{t-1} \mid x_t,x_0)$.

✅ $(q(x_{t-1}|x_t,x_0)$的数据含义为：已知$x_0$ 和 $x_t$，求 $x_{t-1}$ 的分布。

P24

P25

https://arxiv.org/pdf/2208.11970.pdf

P26

✅ 结论：$q(x_{t-1}|x_t,x_0)$ 也是高斯分布，且其均值与方差是与$\theta$无关的固定的值。

化简后的目标函数

根据以上推导，目标函数可简化为最小化原目标函数第三项中分布1与分布2的KL Divergence。

\begin{align*} E_{q(x_1|x_0)}[log P(x_0|x_1)]-KL(q(x_T|x_0)||P(x_T)) -\sum_{t=2}^{T}E_{q(x_t|x_0)}[KL(q(x_{t-1}|x_t,x_0)||P(x_{t-1}|x_t))] \end{align*}

其中分布1为与$\theta$无关的固定，分布2为与$\theta$有关的待优化分布。

How to minimize KL divergence?

方式一：直接套公式

✅ 两个高斯分布的 KLD 有公式解，但此处不用公式解，因为 $ \theta$ 只能影响分布2的均值。

方式二

分布1的均值和方差是固定的。分布2的均值是待优化的，方差是固定的。

✅ 因此减小 KLD 的方法是让分布2的均值接近分布1的均值。

定义损失函数

✅ 分布1的均值可以看作是 $x_{t-1}$ 的 GT 了。其计算公式为：

$x_{t-1}$的GT的计算公式中包含了x0和xt，把x0和xt都转化为xt的表示，得：

✅ 可以发现 $x_t$ 与 $x_{t-1}$和GT 之间，唯一未知的部分就是 noise $\varepsilon $. 因此用网络学习这个noise。

最终定义损失函数为网络输出(预测的noise)与GT（构造训练数据时所生成的noise）之间的L2距离。

其它问题

关于$\alpha $

✅ $\alpha $ 是预定义的超参，DDPM 试图学习 $\alpha $，发现没有提升。

Score-based Generative Modeling with Differential Equations

ID	Year	Name	Note	Tags	Link
	2021	Score-Based Generative Modeling through Stochastic Differential Equations			link

P26

DDPM VS Stochastic Differential Equation

🔎 SDE

✅ DDPM 是在时间上做了离散化的 SDE．

P27

Forward Diffusion Process as Stochastic Differential Equation

✅ drift term 使 $ \mathbf{x} _ t$ 趋向于 Origin.
✅ Origin 我理解为 $ \vec{0} $ 向量的意思。
✅ $ \mathbf{x} _ t$ 最终趋向于 std normal.

P29

The Generative Reverse Stochastic Differential Equation

🔎 Anderson, in Stochastic Processes and their Applications, 1982

✅ $q _ t(\cdot )$ 描述 $t$ 时刻的分布。
✅ $q _ t(\mathbf{x} _ t)$ 为 $\mathbf{x} _ t$ 在 $q _ t$ 分布中的概率。
✅ Generative 的关键是拟合 score funchon．

But how to get the score function $\nabla \mathbf{x} _t \log q_t(\mathbf{x} _t)$?

P32

Score Matching

Naïve idea, learn model for the score function by direct regression?

✅ 直接用一个网络拟合 score function．

But $\nabla \mathbf{x} _t \log q_t(\mathbf{x} _t)$ (score of the marginal diffused density $q_t(\mathbf{x} _t)$) is not tractable!

✅ 存在的问题：只能 sample from $q_t$，但没有 $q_t$ 的 close form.

Vincent, “A Connection Between Score Matching and Denoising Autoencoders”, Neural Computation, 2011

Song and Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution”, NeurIPS, 2019

P33

Denoising Score Matching

Instead, diffuse individual data points $\mathbf{x}_0$. Diffused $q_t(\mathbf{x}_t|\mathbf{x}_0)$ is tractable!

🔎 Vincent, in Neural Computation, 2011

❓ $\gamma _ t$ 和 $\sigma$ 怎么定义？答：见上一页DDPM的推导。

因此Denoising Score Matching的目标函数变为:

After expectations, $\mathbf{s} _ \theta (\mathbf{x} _ t,t)\approx \nabla _ {\mathbf{x} _ t}\log q _ t(\mathbf{x} _ t)$!

🔎 Song and Ermon, NeurIPS, 2019

✅ 最后 $\mathbf{s} _ \theta (\mathbf{x} _ t,t)$ 学到的是所有 $\mathbf{x} _ 0$ 对应的 score 的均值。

✅ 结果发现时间离散的 diffusion model(DDPM) 和时间连续的 diffusion model(SDE),其目标函数是一致的，且两个版本可以互相转化。

$$ \min_ {\mathbf{\theta} } \mathbb{E} _ {t\sim u(0,T)}\mathbb{E} _ {\mathbf{x} _ 0\sim q_ 0(\mathbf{x} _ 0)}\mathbb{E} _{\epsilon \sim \mathcal{N}(\mathbf{0,I} ) }\frac{1}{\sigma ^2_t} ||\epsilon -\epsilon _ \theta (\mathbf{x} _ t,t)||^2_2 $$

P35

Different Parameterizations

🔎 Karras et al., "Elucidating the Design Space of Diffusion-Based Generative Models", NeurIPS 2022 link

✅ 调参对生成质量影响很大。

P36

Synthesis with SDE vs. ODE

Generative Reverse Diffusion SDE (stochastic):

$$ d\mathbf{x} _ t=-\frac{1}{2} \beta (t)[\mathbf{x} _ t+2s_ \theta (\mathbf{x} _ t,t)]dt+\sqrt{\beta (t)} d\varpi _ t $$

Generative Probability Flow ODE (deterministic):

$$ d\mathbf{x} _ t=-\frac{1}{2} \beta (t)[\mathbf{x} _ t+s_ \theta (\mathbf{x} _ t,t)]dt $$

✅ Song et al., ICLR, 2021表明，可以把 SDE 模型转换为ODE模型。只需要对sample过程进行公式修改即可。每个噪声对应特定的输出。

P37

Diffusion Models as Neural ODEs

使用ODE的sample公式有以下好处：

ODE 推断，可以使用成熟的 ODE solver 进行 sample 加速。
Deterministic encoding and generation (semantic image interpolation, etc.)
Log-likelihood computation (instantaneous change of variables):

❓ 第三条没听懂，把 model 当成基于数据的 ODE 来用？

ScoreSDE: simple linear problems, e.g., inpainting, colorization; later extended to MRI and CT.
ILVR: more linear problems, e.g., super-resolution.
SNIPS: slow solution for noisy linear problems.
CCDF: better initializations.
DDRM: fast solution for all noisy linear problems, and JPEG.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P38

Accelerated Sampling

P39

The generative learning trilemma

🔎 Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, ICLR 2022

其中Diffusion based生成模型的主要问题是生成速度慢，因此需要在保持高采样质量和多样性的前提下，针对采样速度慢的问题进行加速。

P41

Acceleration Techniques

Advanced ODE/SDE Solvers
Distillation Techniques
Low-dim. Diffusion Processes
Advanced Diffusion Processes

P42

Advanced ODE/SDE Solvers

✅ ODE 实现 std normal 分布与真实数据分布之间的映射。

P43

Generative ODEs

Solve ODEs with as little function evaluations as possible

$$ dx=\epsilon _\theta (x,t)dt $$

一阶方法（Euler 方法）：每个时间步简化为线性过程。当 step 较大时，会与 GT 有较大的偏离。

P44

高阶方法 P45

P46

扩散模型 ODE/SDE 求解器的相关工作

ID	Year	Name	Note	Link
2	2021	Denoising Diffusion Implicit Models (DDIM)	✅ DDIM：可以直接从 $t_2$ 去噪到 $t_1$. ✅ 把 $x_t$ 去掉一个 nolse 之后，不是 sample 另一个noise，而是把原来的 noise 乘以一个系数再加回去。	link
	2021	Score-Based Generative Modeling through Stochastic Differential Equations	Runge-Kutta adaptive step-size ODE solver
	2021	Gotta Go Fast When Generating Data with Score-Based Models	Higher-Order adaptive step-size SDE solver
	2021	Denoising Diffusion Implicit Models	Reparametrized, smoother ODE
	2022	gDDIM: Generalized denoising diffusion implicit models	Reparametrized, smoother ODE
	2022	Pseudo Numerical Methods for Diffusion Models on Manifolds	Higher-Order ODE solver with linear multistepping
	2022	Fast Sampling of Diffusion Models with Exponential Integrator	Exponential ODE Integrators
	2022	DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps	Exponential ODE Integrators
	2022	DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models	Exponential ODE Integrators
	2022	Elucidating the Design Space of Diffusion-Based Generative Models	Higher-Order ODE solver with Heun’s Method
	2023	UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
	2023	Parallel Sampling of Diffusion Model
	2023	A Geometric Perspective on Diffusion Models

✅ 这些solvers可以以plug-in的方式使用，且通常能比DDPM更快收敛。

Distillation Techniques

P48

ODE Distillation

Can we train a neural network to directly predict $\mathbf{x} _{{t}'} $ given $\mathbf{x} _t$?

✅ $\mathbf{x} _{{t}'} $与$\mathbf{x} _t$的关系是确定的。

P49

Year	Name	Note	Link
2022	Progressive distillation for fast sampling of diffusion models	蒸馏	link
2023	On Distillation of Guided Diffusion Models	Award Candidate	link
2023	Consistency Models		link

P52

SDE Distillation

Can we train a neural network to directly predict distribution of $\mathbf{x} _ {{t}'} $ given $\mathbf{x} _ t $ ?

✅ $\mathbf{x} _ t$ 与 $ \mathbf{x} _ {{t}' }$ 没有必然的联系，得到的是 $ \mathbf{x} _ {{t}' }$ 的分布。

但Normal assumption in denoising distribution holds only for small step

✅ 从 $t$ 与 ${t}'$ 的差距过大时，normal 分布不足以表达 $q(\mathbf{x} _ {{t}'}｜\mathbf{x} _ t)$.

因此Requires more complicated functional approximators!，例如GAN或energy-based。

ID	Year	Name	Note	Tags	Link
	2022	Tackling the Generative Learning Trilemma with Denoising Diffusion GANs	GAN		link
	2021	Learning energy-based models by diffusion recovery likelihood	Energy-based models

P54

Training-based Sampling Techniques

Year	Name	Note
2021	Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed	Knowledge distillation
2022	Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality	Learned Samplers
2023	Fast Sampling of Diffusion Models via Operator Learning	Neural Operators
2023	Wavelet Diffusion Models Are Fast and Scalable Image Generators	Wavelet Diffusion Models
2022	GENIE: Higher-Order Denoising Diffusion Solvers	Distilled ODE Solvers

P56

Low-dim Diffusion Process

Cascaded Generation

Cascaded Diffusion Models outperform Big-GAN in FID and IS and VQ-VAE2 in Classification Accuracy Score.

Year	Name	Link
2021	Cascaded Diffusion Models for High Fidelity Image Generation	link
2022	Hierarchical Text-Conditional Image Generation with CLIP Latents
2022	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

P57

Latent Diffusion Models

Main Idea：

Variational autoencoder + score-based prior

Encoder maps the input data to an embedding space
Denoising diffusion models are applied in the latent space

P58

Advantages:

(1) The distribution of latent embeddings close to Normal distribution $\to $ Simpler denoising, Faster synthesis!
(2) Latent space $\to $ More expressivity and flexibility in design!
(3) Tailored Autoencoders $\to $ More expressivity, Application to any data type (graphs, text, 3D data, etc.)!

ID	Year	Name	Note	Link
	2021	Score-based generative modeling in latent space	End-to-End Training objective ✅ 这篇文章对 VAE 和 diffusion 一起进行训练，文章的创新点是，利用 score matching 中的信息来计算 cross entropy.
45	2022	High-Resolution Image Synthesis with Latent Diffusion Models	Two-stage Training，先训E&D，再训diffusion。每次需要训练的网络都不大。	link
	2021	D2C: Diffusion-Denoising Models for Few-shot Conditional Generation
	2022	Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problems
	2022	Dimensionality-Varying Diffusion Process

The efficiency and expressivity of latent diffusion models + open-source access fueled a large body of work in the community

Advanced Diffusion Models

✅ 这一部分没有讲

P63

ODE interpretation

把ODE看作是Deterministic generative process

DDIM sampler can be considered as an integration rule of the following ODE:

$$ d\mathbf{\bar{x} } (t)=\epsilon ^{(t)} _ \theta(\frac{\mathbf{\bar{x} } (t)}{\sqrt{\eta ^2+1}} )d\eta (t); \mathbf{\bar{x} } =\mathbf{x} / \sqrt{\bar{a} },\eta = \sqrt{1-\bar{a}} / \sqrt{\bar{a } } $$

Karras et al. argue that the ODE of DDIM is favored, as the tangent of the solution trajectory always points towards the denoiser output.
This leads to largely linear solution trajectories with low curvature à Low curvature means less truncation errors accumulated over the trajectories.

🔎 Song et al., “Denoising Diffusion Implicit Models”, ICLR 2021.
🔎 Karras et al., “Elucidating the Design Space of Diffusion-Based Generative Models”, arXiv 2022.

ID	Year	Name	Note	Tags	Link
	2022	Progressive distillation for fast sampling of diffusion models	通过修改参数化方式来提升“减少sampling steps”的稳定性。		link

P64

“Momentum-based” diffusion

Introduce a velocity variable and run diffusion in extended space

Dockhorn et al., “Score-Based Generative Modeling with Critically-Damped Langevin Diffusion”, ICLR 2022.

P65

Additional Reading

Schrödinger Bridge:

🔎 Bortoli et al., "Diffusion Schrödinger Bridge", NeurIPS 2021
🔎 Chen et al., “Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory”, ICLR 2022

Diffusion Processes on Manifolds:

🔎 Bortoli et al., "Riemannian Score-Based Generative Modelling", NeurIPS 2022

Cold Diffusion:

🔎 Bansal et al., "Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise", arXiv 2022

Diffusion for Corrupted Data:

🔎 Daras et al., "Soft Diffusion: Score Matching for General Corruptions", TMLR 2023
🔎 Delbracio and Milanfar, "Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration", arXiv 2023
🔎 Luo et al., "Image Restoration with Mean-Reverting Stochastic Differential Equations", ICML 2023
🔎 Liu et al., “I2SB: Image-to-Image Schrödinger Bridge”, ICML 2023

Blurring Diffusion Process:

🔎 Hoogeboom and Salimans, "Blurring Diffusion Models", ICLR 2023
🔎 Rissanen et al, “Generative Modelling With Inverse Heat Dissipation”, ICLR 2023

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P66

Conditional Generation and Guidance

P67

✅ 通常需要的是特定的生成，而不是随意的生成。因此需要通过control引入特定的需求。

以下是文生图的例子：

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022.

P68

Conditioning and Guidance Techniques

Explicit Conditions
Classifier Guidance
Classifier-free Guidance

P69

Explicit Conditions

P70
Conditional sampling can be considered as training $p(\mathbf{x} |\mathbf{y} )$ where $\mathbf{y}$ is the input conditioning (e.g., text) and $\mathbf{x}$ is generated output (e.g., image)

Train the score model for $\mathbf{x}$ conditioned on $\mathbf{y}$ using:

$$ \mathbb{E} _ {(\mathbf{x,y} )\sim P\mathrm{data} (\mathbf{x,y} )}\mathbb{E} _ {\epsilon \sim \mathcal{N}(\mathbf{0,I} ) }\mathbb{E} _{t\sim u[0,T]}||\epsilon _ \theta (\mathbf{x} _ t,t;\mathbf{y} )- \epsilon ||^2_2 $$

The conditional score is simply a U-Net with $\mathbf{x}_t$ and $\mathbf{y}$ together in the input.

✅ 需要 $(x，y)$ 的 pair data.

P71

Classifier Guidance

P72

Bayes’ Rule in Action

✅ $p(y)$ 与 $\mathbf{x} _ t$ 无关，因此可以去掉。

训练方法

✅ 第一步：需要一个训好的p(x)的 diffusion model 。
✅ 第二步：训练一个分类网络，输入xt能够正确地预测控制条件（y不一定是离散的类别）。
✅ 第三步：取第二步的梯度，用一定的权重$w $结合到第一步的forward过程中。$w $决定分类器的影响力。

✅ 只需要部分pair data和大量的非pair data。但需要单独训练一个分类器。

Classifier-free Guidance

ID	Year	Name	Note	Tags	Link
	2021	Classifier-Free Diffusion Guidance			link

参数化方法

ID	Year	Name	Note	Tags	Link
75	2023	simple diffusion: End-to-end diffusion for high resolution images		DiT基模型	link

P76

Summary

We reviewed diffusion fundamentals in 4 parts:

Discrete-time diffusion models
Continuous-time diffusion models
Accelerated sampling from diffusion models
Guidance and conditioning.

Next, we will review different applications and use cases of diffusion models after a break.

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Architecture

U-Net Based Diffusion Architecture

U-Net Architecture

✅ U-Net的是Large Scale Image Diffusion Model中最常用的backbone。

🔎 Ronneberger et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation”, MICCAI 2015

Pipeline

✅ 包含Input、U-Net backbone、Condition。
✅ Condition 通常用 Concat 或 Cross attention 的方式与 Content 相结合。

ID	Year	Name	Note	Tags	Link
45	2022	High-Resolution Image Synthesis with Latent Diffusion Models	常被称为Stable Diffusion 或 LDM，是diffusion方法做图像生成最经典工作（没有之一） ✅ (1)：在 latent space 上工作 ✅ (2)：引入多种 condition．	UNet, latent space	link
69	2022	Photorealistic text-to-image diffusion models with deep language understanding	1. 用纯文本预训练的大语言模型（如 T5）而不是传统图文对齐模型（CLIP） 2. 用4级超分而不是latent space	Imagen, UNet, T5, Google, pixel space	link
70	2022	ediffi: Text-to-image diffusion models with an ensemble of expert denoiser	1. T5, Clip混合引导 2. 第二阶段基于第一阶段对时间步分段微调，解决传统扩散模型在生成过程中不同阶段对文本依赖的动态变化问题。 3. 部分区域关联文本条件	NVIDIA, eDiff-I, UNet, pixel space	link

Transformer Architecture

Vision Transformer(ViT)

ID	Year	Name	Note	Tags	Link
71	2021	Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale”	分类任务。基核心思想是将图像分割为固定大小的块（如16x16像素），并将每个块视为一个“单词”，通过线性投影转换为嵌入向量序列，直接输入标准Transformer编码器进行处理。这一方法突破了传统卷积神经网络（CNN）在视觉任务中的主导地位，证明了纯Transformer在图像识别中的有效性。	ViT	link

Pipeline

ID	Year	Name	Note	Tags	Link
72	2022	All are Worth Words: a ViT Backbone for Score-based Diffusion Models	1. 基于transformer的diffusion网络 U-ViT，替代传统U-Net架构。 2. 将图像生成过程中的所有输入（包括噪声图像块、时间步长、条件信息）统一视为“令牌”（Token），通过ViT的全局自注意力机制进行建模。 3. 突破了diffusion对U-Net的依赖，展示了纯Transformer架构在生成任务中的潜力。	U-ViT	link
73	2022	Scalable Diffusion Models with Transformers	1. 以ViT为backbone的扩散模型——Diffusion Transformer（DiT），代表UNet backbone 2. 通过Transformer的全局自注意力机制建模图像生成过程，验证了Transformer在扩散模型中的可扩展性与性能优势。	DiT, ViT	link

其它

ID	Year	Name	Note	Tags	Link
	2022	DALL-E2	利用CLIP（Radford等，2021）联合特征空间优化文本-图像对齐度，解决"语义漂移"问题
	2021	GLIDE	首次引入文本条件控制，并通过分类器引导（classifier guidance）机制提升生成效果首次将条件控制（文本）与扩散过程结合，通过梯度调节实现语义精准映射

可控生成

ID	Year	Name	Note	Tags	Link
65	2023	T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models	1. 通过轻量级适配器（Adapter），将外部控制信号（如草图、深度图）与模型内部知识对齐，实现更精准的生成控制 2. 仅优化apapter，高效训练 3. 非均匀时间步采样，在扩散过程的早期阶段（图像结构形成期）增加采样概率，提升控制信号的有效性。	优化训练效率	link
66	2013	Adding Conditional Control to Text-to-Image Diffusion Models	通过克隆预训练模型的网络块，并引入“零卷积”连接，实现在不破坏原模型能力的前提下学习条件控制。	ControlNet	link
67	2023	GLIGEN: Open-Set Grounded Text-to-Image Generation			link

图像编辑

P10

Gaussian Noise方法

ID	Year	Name	Note	Tags	Link
22	2022	SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	提出了一种无需额外训练的统一框架，通过加噪和去噪（随机微分方程SDE）的逆向过程实现图像生成与编辑。		link

DDIM Inversion方法

ID	Year	Name	Note	Tags	Link
23	2023	Dual diffusion implicit bridges for image-to-image translation	DDIB利用diffusion隐式空间的对齐性，提出了一种基于DDIM的图像到图像翻译方法，通过隐式桥接（Implicit Bridges）实现跨域转换。	DDIM	link
24	2023	DiffEdit: Diffusion-based semantic image editing with mask guidance	利用扩散模型在不同文本条件下的噪声预测差异，生成与编辑语义相关的区域mask，从而实现精准的局部编辑。	DDIM, auto mask	link

编辑文本embedding

ID	Year	Name	Note	Tags	Link
25	2023	Imagic: Text-Based Real Image Editing with Diffusion Models	1. 利用T2I实现图像文本图像编辑 2. 需要微调T2I 3. 先求出$T_{orig}$，然后在$T_{orig}$和$T_{tgt}$之间插值		link
76	2022	NULL-text Inversion for Editing Real Images Using Guided Diffusion Models	针对真实图像（非生成图像）的编辑，以CFG为基础，fix condition分支，优化无condition分支，使其embedding向condition分支的embedding靠近	DDIM	link

Attention based 方法

ID	Year	Name	Note	Tags	Link
20	2023	Prompt-to-Prompt Image Editing with Cross-Attention Control	交叉注意力层决定了文本提示（prompt）与图像空间布局的关联，通过修改注意力图即可在不破坏原始图像结构的情况下完成编辑。仅适用于编辑用相同预训模型生成的图像。	attention控制	link
77	2022	Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation	直接操纵扩散模型内部的空间特征和自注意力机制，实现生成过程的细粒度控制。其核心思想是：从源图像中提取中间层的空间特征和自注意力图，注入目标图像的生成过程，从而在保留源图像语义布局的同时，根据文本提示修改外观属性。	attention控制	link
21	2023	InstructPix2Pix: Learning to Follow Image Editing Instructions	在已有图片的情况，输入完整的控制文本不符合用户习惯，用户只需要告诉模型要怎么修改图像，通过 Prompt 2 Prompt 转化为完整 prompt.		link

P32

特定对象定制化的图像生成

ID	Year	Name	Note	Tags	Link
62	2023	DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	每个主体分配一个罕见词（如“sks”），作为其文本标签。通用微调扩散模型，使其能够精准生成特定主体。	finetune	link
63	2023	An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion	不修改模型权重，而是通过优化文本嵌入空间中的一个新的嵌入向量来表示目标概念。该向量可以像普通词汇一样被插入到自然语言描述中，指导模型生成包含该概念的图像。	Textual Inversion, 优化	link
38	2021	Lora: Low-rank adaptation of large language models	对已训好的大模型进行微调，生成想要的风格。学习其中的残差。残差通常可以用low rank Matrix来拟合，因此称为low-rank adaptation。low rank的好处是要训练或调整的参数非常少。	优化训练效率	link
		Lora + Dreambooth (by Simo Ryu)	没有找到论文		https://github.com/cloneofsimo/lora

P43

多个特定对象定制化的图像生成

ID	Year	Name	Note	Tags	Link
52	2024	Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models	多个特定对象的图像生成，让多个特定的对象生成到一张图像中，并用2D pose控制对象的动作	TI, LoRA	link
64	2023	Multi-Concept Customization of Text-to-Image Diffusion	1. 用『正则化』的方法防止多concept之间的混淆 2. 用"仅finetune KV"的方法提升训练效率 3. 用『多概念组合优化』的方法把多个concept融合	优化训练效率， TI	link
79	2023	Key-Locked Rank One Editing for Text-to-Image Personalization	✅ 方法：dynamic rank one update. ✅ Perffusion 解决 Image Personalization 的 overfitting 问题的方法： ✅ (1) 训练时，Introducing new xxxx that locks the new concepts cross-attention keys to their sub-ordinate category. ✅ (2) 推断时，引入 a gate rank one approach 可用于控制 the learned concept的影响力。 ✅ (3) 允许 medel 把不同的 concept 结合到一起，并学到不同concept 之间的联系。 Results: 可以很好地model the interaction of the two conception。		link

P67

Other applications

P68

Your Diffusion Model is Secretly a Zero-Shot Classifier

✅ 一个预训练好的 diffusion model （例如stable diffusion model），无须额外训练可以用作分类器，甚至能完成 Zero-shot 的分类任务。

Li et al., "Your Diffusion Model is Secretly a Zero-Shot Classifier", arXiv 2023

Pipeline

✅ 输入图像$x$，用随机噪声$\epsilon $加噪；再用 condition c 预测噪声 $\epsilon _\theta $。优化条件 C 使得 $\epsilon _\theta $ 最接近 $\epsilon $. 得到的 C 就是分类。

P69

Improving Robustness using Generated Data

✅ 使用 diffusion Model 做数据增强。

Overview of the approach:

train a generative model and a nonrobust classifier, which are used to provide pseudo-labels to the generated data.
The generated and original training data are combined to train a robust classifier.

Gowal et al., "Improving Robustness using Generated Data", NeurIPS 2021

P70

Better Diffusion Models Further Improve Adversarial Training

Wang et al., "Better Diffusion Models Further Improve Adversarial Training", ICML 2023

多模态生成

ID	Year	Name	Note	Tags	Link
74	2023	One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale		U-Vit base model	link

P72

Reference

Li et al., "Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models", NeurIPS 2022
Avrahami et al., "Blended Diffusion for Text-driven Editing of Natural Images", CVPR 2022
Sarukkai et al., "Collage Diffusion", arXiv 2023
Bar-Tal et al., "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation", ICML 2023
Kumari et al., "Multi-Concept Customization of Text-to-Image Diffusion", CVPR 2023
Tewel et al., "Key-Locked Rank One Editing for Text-to-Image Personalization", SIGGRAPH 2023
Zhao et al., "A Recipe for Watermarking Diffusion Models", arXiv 2023
Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022
Avrahami et al., "SpaText: Spatio-Textual Representation for Controllable Image Generation", CVPR 2023
Orgad et al., "Editing Implicit Assumptions in Text-to-Image Diffusion Models", arXiv 2023
Han et al., "SVDiff: Compact Parameter Space for Diffusion Fine-Tuning", arXiv 2023
Xie et al., "DiffFit: Unlocking Transferability of Large Diffusion Models via Simple ParameterEfficient Fine-Tuning", rXiv 2023
Saharia et al., "Palette: Image-to-Image Diffusion Models", SIGGRAPH 2022
Whang et al., "Deblurring via Stochastic Refinement", CVPR 2022
Xu et al., "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", arXiv 2023
Saxena et al., "Monocular Depth Estimation using Diffusion Models", arXiv 2023
Li et al., "Your Diffusion Model is Secretly a Zero-Shot Classifier", arXiv 2023
Gowal et al., "Improving Robustness using Generated Data", NeurIPS 2021
Wang et al., "Better Diffusion Models Further Improve Adversarial Training", ICML 2023

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

任务描述

图像去噪
图像超分
图像补全

输入：

输出：

基于某个预训练的diffusion model，在无condition的情况下，每张图像都符合diffusion生成模型的分布。当以某个特定的图像（模糊图像、低分辨率图像）时，期望能够得到的是对应的清晰、高分辨率的图像的分布。

Replacement-based Methods

(Overwrites model prediction with known information)

ID	Year	Name	Note	Tags	Link
	2021	ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models
		Kawar et al., "SNIPS: Solving Noisy Inverse Problems Stochastically", NeurIPS 2021
		Chung et al., "Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction", CVPR 2022
		Song et al., "Solving Inverse Problems in Medical Imaging with Score-Based Generative Models", ICLR 2022
		Kawar et al., "Denoising Diffusion Restoration Models", NeurIPS 2022

Reconstruction-based Methods

(Approximate classifier-free guidance without additional training)

Chung et al., "Diffusion Posterior Sampling for General Noisy Inverse Problems", ICLR 2023

✅ cfg 使用$(x,t)$的 pair data 来近似 $\nabla _{x_t} \log p_t(\mathbf{y}|\mathbf{x}_t)$，但此处没有 pair data，希望通过非训练的方法来得出。
✅ 公式基于马尔可夫推导。$p(\mathbf{y}|\mathbf{x}_t)$ 可描述为 $p(\mathbf{y}|\mathbf{x}_0)$ 的期望。然后把期望从外面移到里面。

P8
In the Gaussian case,

$$ p(\mathbf{y} |\mathbb{E} [\mathbf{x} _ 0|\mathbf{x} _ t])=-c||\mathcal{A} \mathbf{(\hat{x}} _ 0)-\mathbf{y} ||^2_2 $$

Maximizing the likelihood is minimizing the L2 distance between measured and generated!

Chung et al., "Diffusion Posterior Sampling for General Noisy Inverse Problems", ICLR 2023

✅ 在 diffusion 的同时做重建。

Video Diffusion/Pyramid DDPM: used for uper-resolution.
Pseudoinverse guidance: linear and some non-differentiable problems, e.g., JPEG
MCG: combines replacement & reconstruction for linear problems.

Others

CSGM: Posterior sampling with Langevin Dynamics based on the diffusion score model.
RED-Diff: A Regularizing-by-Denoising (RED), variational inference approach.
Posterior sampling: use RealNVP to approximate posterior samples from diffusion models.

ID	Year	Name	Note	Tags	Link
		Chung et al., "Improving Diffusion Models for Inverse Problems using Manifold Constraints", NeurIPS 2022
		Ryu and Ye, "Pyramidal Denoising Diffusion Probabilistic Models", arXiv 2022
		Chung et al., "Diffusion Posterior Sampling for General Noisy Inverse Problems", arXiv 2022
		Song et al., "Pseudoinverse-Guided Diffusion Models for Inverse Problems", ICLR 2023
		Jalal et al., "Robust Compressed Sensing MRI with Deep Generative Priors", NeurIPS 2021
		Mardani et al., "A Variational Perspective on Solving Inverse Problems with Diffusion Models", arXiv 2023
		Feng et al., "Score-Based Diffusion Models as Principled Priors for Inverse Imaging", arXiv 2023

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P68

Diffusion Models for Large Contents

同样的方法也可用于Applications such as long images, looped motion, 360 images…

Suppose model is trained on small, squared images, how to extend it to larger images?
Outpainting is always a solution, but not a very efficient one!

Let us generate this image with a diffusion model only trained on squared regions:

Generate the center region $q(\mathbf{x} _ 1,\mathbf{x} _ 2)$
Generate the surrounding region conditioned on parts of the center image $q(\mathbf{x} _ 3|\mathbf{x} _ 2)$

Latency scales linearly with the content size!

✅ 根据左边的图生成右边的图，存在的问题：慢
✅ 直接生成大图没有这样的数据。
✅ 并行化的生成。

P69

DiffCollage

Unlike autoregressive models, diffusion models can generate large contents in parallel!

P70

A “large” diffusion model from “small” diffusion models!

P71

More Works

Year	Name	Note
2023	Zhang et al., "DiffCollage: Parallel Generation of Large Content with Diffusion Models"
2023	Jiménez, "Mixture of Diffusers for scene composition and high resolution image generation", arXiv 2023	- Based on similar ideas but differ in how overlapping regions are mixed. ✅ 这种并行化方法可以用于各种 overlapping 的场景。
2023	Bar-Tal et al., "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation", ICML 2023

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P23

由于缺少3D数据，把2D T2I Base Model作为先验来实现3D生成。

SDS

ID	Year	Name	Note	Tags	Link
82	2023	Magic3D: High-Resolution Text-to-3D Content Creation	在68的基础上： 1. 采用“粗到细”（Coarse-to-Fine）的两阶段优化策略，，结合不同分辨率扩散模型与场景表示，coarse阶段速度更快，Fine阶段提升细节 2. Coarse阶段采用Instant-NGP** + eDiff-I，快速收敛，且适合处理复杂拓扑变化。 3. Fine阶段使用DMTet + LDM	SDS, Coarse-to-Fine	link
	2023	Wang et al.,"Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation",		Alternative to SDS
	2023	Wang et al., "ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation",		Alternative to SDS

P31

Alternative to SDS: Score Jacobian Chaining

A different formulation, motivated from approximating 3D score.

In principle, the diffusion model is the noisy 2D score (over clean images),
but in practice, the diffusion model suffers from out-of-distribution (OOD) issues!

For diffusion model on noisy images, the non-noisy images are OOD!

✅ 2D sample, 3D score

P32

Score Jacobian Chaining

SJC approximates noisy score with “Perturb-and-Average Scoring”, which is not present in SDS.

Use score model on multiple noise-perturbed data, then average it.

✅ 通过这种方法来近似 clean image 的输出，解决 clean image 的 OOD 问题。

P33

SJC and SDS

SJC is a competitive alternative to SDS.

P34

Alternative to SDS: ProlificDreamer

SDS-based method often set classifier-guidance weight to 100, which limits the “diversity” of the generated samples.
ProlificDreamer reduces this to 7.5, leading to diverse samples.

P35

ProlificDreamer and Variational Score Distillation

Instead of maximizing the likelihood under diffusion model, VSD minimizes the KL divergence via variational inference.

$$ \begin{matrix} \min_{\mu } D _ {\mathrm{KL} }(q^\mu _ 0(\mathbf{x} _ 0|y)||p _ 0(\mathbf{x} _ 0|y)). \\ \quad \mu \quad \text{is the distribution of NeRFs} . \end{matrix} $$

Suppose is a $\theta _ \tau \sim \mu $ NeRF sample, then VSD simulates this ODE:

Diffusion model can be used to approximate score of noisy real images.
How about noisy rendered images? sss

✅ 第一项由 diffusion model 得到，在此处当作 GT．

P36

Learn another diffusion model to approximate the score of noisy rendered images!

✅ 使用 LoRA 近第二项。

P37

Why does VSD work in practice?

The valid text-to-image NeRFs form a distribution with infinite possibilities!
In SDS, epsilon is the score of noisy “dirac distribution” over finite renders, which converges to the true score with infinite renders!
In VSD, the LoRA model aims to represent the (true) score of noisy distribution over infinite number of renders!
If the generated NeRF distribution is only one point and LoRA overfits perfectly, then VSD = SDS!
But LoRA has good generalization (and learns from a trajectory of NeRFs), so closer to the true score!
This is analogous to
- Representing the dataset score via mixture of Gaussians on the dataset (SDS), versus
- Representing the dataset score via the LoRA UNet (VSD)

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Diffusion on various 3D representations

Year	Name	Tags
2021	3D Shape Generation and Completion through Point-Voxel Diffusion	Point-Voxel
2019	Point-Voxel CNN for Efficient 3D Deep Learning	Point-Voxel
2022	Zeng et al., "LION: Latent Point Diffusion Models for 3D Shape Generation"
2022	Nichol et al., "Point-E: A System for Generating 3D Point Clouds from Complex Prompts	点云
2022	Hui et al., "Neural Wavelet-domain Diffusion for 3D Shape Generation	SDF
2022	Chou et al., "DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions	SDF
2022	Shue et al., "3D Neural Field Generation using Triplane Diffusion", arXiv 2022	Nerf
2023	Yang et al., "Learning a Diffusion Prior for NeRFs", ICLR Workshop 2023	Nerf
2023	Jun and Nichol, "Shap-E: Generating Conditional 3D Implicit Functions", arXiv 2023	Nerf

P12

3D Shape Generation and Completion through Point-Voxel Diffusion

Year	Name	Note
2021	3D Shape Generation and Completion through Point-Voxel Diffusion	A set of points with location information. > ✅ 分支1：逐顶点的 MLP (对应图中 b) ✅ 分支2：VOX 可以看作是低分辨率的 points ✅ 优点是结构化，可用于 CNN ❓ VOX → points，低分辨到高分辨率要怎么做？ ❓ 怎么把 voxel 内的点转换为 voxel 的特征？
2019	Point-Voxel CNN for Efficient 3D Deep Learning	✅ Completion：深度图 → 完整点 ✅ 方法：(1) 基于深度图生成点云 (2) 用 inpainting 技术补全 ✅ generation 和 completion 是两种不同的 task.
2022	LION: Latent Point Diffusion Models for 3D Shape Generation	✅ 1、latent diffusion model for point clouds. ✅ 2、point-voxel CNN 架构，用于把 shape 编码成 latent shape 及 lantent point. ✅ 3、diffusion model 把 latent point 重建出原始点。
2022	Point-E: A System for Generating 3D Point Clouds from Complex Prompts	Point-E uses a synthetic view from fine-tuned GLIDE, and then ”lifts” the image to a 3d point cloud. ✅ point E task：文生成点云。 ✅ 第1步：文生图，用 fine-tuned GLIDE ✅ 第2步：图生点，用 transformer-based diffusion model.

P16

Diffusion Models for Signed Distance Functions

SDF is a function representation of a surface.
For each location x, |SDF(x)| = smallest distance to any point on the surface.

ID	Year	Name	Note	Tags	Link
	2022	Neural Wavelet-domain Diffusion for 3D Shape Generation	- Memory of SDF grows cubically with resolution - Wavelets can be used for compression! - Diffusion for coarse coefficients, then predict detailed ones. ✅ 这里说的 SDF，是用离散的方式来记录每个点的 distance. ✅ Wavelet 把 SDF 变为 coarse 系数，diffusion model 生成 coarse 系数，再通过另一模型变为 detailed
	2022	DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions	Latent space diffusion for SDFs, where conditioning can be provided with cross attention ✅ 原理与上一页相似，只是把 waveles 换成了 VAE.

P19

Diffusion Models for NeRF

Neural Radiance Fields (NeRF) is another representation of a 3D object.

✅ NeRF：用体的方式来描述 3D 物体
✅ (1) 从 diffusion 中提取 image （2）从 image 计算 loss (3) loss 更新 image (4) image 更新 NeRF．
✅ $（x,y,z,\theta ,\phi ）$ 是每个点在向量中的表示，其中前三维是 world coordinate，后面两维是 viewing direction
✅ density 描述这个点有多透明。
✅ F 是一个小型的网络，例如 MLP.

P20

NeRF
(Fully implicit)

Voxels
(Explicit / hybrid)

Triplanes
(Factorized, hybrid)

Image from EG3D paper.

P21

✅ Nerf 可以有三种表示形式

Triplanes, regularized ReLU Fields, the MLP of NeRFs...
A good representation is important!

Triplane diffusion

Regularized ReLU Fields

Implicit MLP of NeRFs

Shue et al., "3D Neural Field Generation using Triplane Diffusion", arXiv 2022
Yang et al., "Learning a Diffusion Prior for NeRFs", ICLR Workshop 2023
Jun and Nichol, "Shap-E: Generating Conditional 3D Implicit Functions", arXiv 2023

✅ 这三种表示形式都可以与 diffuson 结合。
✅ 好的表示形式对diffusion 的效果很重要。

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P40

Year	Name	Note	Tags	Link
2024b	Unique3D
2023	Novel View Synthesis with Diffusion Models	Sample based on stochastic conditions, allowing the use of multiple conditional frames. ✅ UNet，2 branch，分别用于原始角度和要生成的角度。 ✅ 引入 step 2 是为了内容一致性。 ✅ frame：坐标系。在不同的坐标系下看到的是不同的视角。 ❓ 为什么有两个pose？ ✅ 每个 frame 的内部由 cross-attention 连接。	- Condition on a frame and two poses, predict another frame. UNet with frame cross-attention	3Dim
2024	CAT3D
2023	Generative Novel View Synthesis with 3D-Aware Diffusion Models	- 3D-aware architecture with latent feature field. - Use diffusion model to improve render quality based on structure. ✅ (1) 生成 feature field (2) render 其中一个视角 (3) 优化渲染效果 ✅ (2) 是 MLP (3) 是 diffusion．		GenVS

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Year	Name	Note
2023	NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views	SDS + Fine-tuned CLIP text embedding + Depth supervision> ✅ 整体上是类似 SDS 的优化方法，再结合其它的损失函数。 ✅ (1) 渲染不同视角，并对渲染结果用 clip score打分。 ✅ (2) 监督深度信息。
2023	Zero-1-to-3: Zero-shot One Image to 3D Object	Generate novel view from 1 view and pose, with 2d model. Then, run SJC / SDS-like optimizations with view-conditioned model. ✅ (1) 用 2D diffusion 生成多视角。用 SDS 对多视角图像生成3D．
2024	CAT3D

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

ID	Year	Name	Note	Tags	Link
	2023	Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions		Nerf
	2023	Vox-E: Text-guided Voxel Editing of 3D Objects	- Text-guided object editing with SDS - Regularize the structure of the new voxel grid.		Voxel

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

P73

Outline

Safety and limitations of diffusion models

P74

Data Memorization in Diffusion Models

Due to the likelihood-base objective function, diffusion models can ”memorize” data.
And with a higher chance than GANs!
Nevertheless, a lot of “memorized images” are highly-duplicated in the dataset.

Carlini et al., "Extracting Training Data from Diffusion Models", arXiv 2023

P75

Erasing Concepts in Diffusion Models

Fine-tune a model to remove unwanted concepts.
From original model, obtain score via negative CFG.
A new model is fine-tuned from the new score function.

Gandikota et al., "Erasing Concepts from Diffusion Models", arXiv 2023

✅ 考虑到版权等问题。
✅ finetune 已有的 text-2-image model．
✅ 使用 negative CFG 原有信息不会受到影响。

Reference

P77

Part I

Ho et al., "Denoising Diffusion Probabilistic Models", NeurIPS 2020
Kingma et al., "Variational Diffusion Models", arXiv 2021
Karras et al., "Elucidating the Design Space of Diffusion-Based Generative Models", NeurIPS 2022
Song et al., "Denoising Diffusion Implicit Models", ICLR 2021
Jolicoeur-Martineau et al., "Gotta Go Fast When Generating Data with Score-Based Models", arXiv 2021
Liu et al., "Pseudo Numerical Methods for Diffusion Models on Manifolds", ICLR 2022
Lu et al., "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps", NeurIPS 2022
Lu et al., "DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models", NeurIPS 2022
Zhang and Chen, "Fast Sampling of Diffusion Models with Exponential Integrator", arXiv 2022
Zhang et al., "gDDIM: Generalized denoising diffusion implicit models", arXiv 2022
Zhao et al., "UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models", arXiv 2023
Shih et al., "Parallel Sampling of Diffusion Models", arxiv 2023
Chen et al., "A Geometric Perspective on Diffusion Models", arXiv 2023
Xiao et al., "Tackling the Generative Learning Trilemma with Denoising Diffusion GANs", arXiv 2021
Salimans and Ho, "Progressive Distillation for Fast Sampling of Diffusion Models", ICLR 2022
Meng et al., "On Distillation of Guided Diffusion Models", arXiv 2022
Dockhorn et al., "GENIE: Higher-Order Denoising Diffusion Solvers", NeurIPS 2022
Watson et al., "Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality", ICLR 2022
Phung et al., "Wavelet Diffusion Models Are Fast and Scalable Image Generators", CVPR 2023
Dhariwal and Nichol, "Diffusion Models Beat GANs on Image Synthesis", arXiv 2021
Ho and Salimans, "Classifier-Free Diffusion Guidance", NeurIPS Workshop 2021
Automatic1111, "Negative Prompt", GitHub
Hong et al., "Improving Sample Quality of Diffusion Models Using Self-Attention Guidance", arXiv 2022
Saharia et al., "Image Super-Resolution via Iterative Refinement", arXiv 2021
Ho et al., "Cascaded Diffusion Models for High Fidelity Image Generation", JMLR 2021
Sinha et al., "D2C: Diffusion-Denoising Models for Few-shot Conditional Generation", NeurIPS 2021
Vahdat et al., "Score-based Generative Modeling in Latent Space", arXiv 2021
Daras et al., "Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problems", ICML 2022

P78

Part I (cont’d)

Bortoli et al., "Diffusion Schrödinger Bridge", NeurIPS 2021
Bortoli et al., "Riemannian Score-Based Generative Modelling", NeurIPS 2022
Neklyudov et al., "Action Matching: Learning Stochastic Dynamics from Samples", ICML 2023
Bansal et al., "Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise", arXiv 2022
Daras et al., "Soft Diffusion: Score Matching for General Corruptions", TMLR 2023
Delbracio and Milanfar, "Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration", arXiv 2023
Luo et al., "Image Restoration with Mean-Reverting Stochastic Differential Equations", ICML 2023

P79

Part II

Jabri et al., "Scalable Adaptive Computation for Iterative Generation", arXiv 2022
Li et al., "Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models", NeurIPS 2022
Avrahami et al., "Blended Diffusion for Text-driven Editing of Natural Images", CVPR 2022
Sarukkai et al., "Collage Diffusion", arXiv 2023
Bar-Tal et al., "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation", ICML 2023
Kumari et al., "Multi-Concept Customization of Text-to-Image Diffusion", CVPR 2023
Tewel et al., "Key-Locked Rank One Editing for Text-to-Image Personalization", SIGGRAPH 2023
Zhao et al., "A Recipe for Watermarking Diffusion Models", arXiv 2023
Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models", ICLR 2022
Avrahami et al., "SpaText: Spatio-Textual Representation for Controllable Image Generation", CVPR 2023
Orgad et al., "Editing Implicit Assumptions in Text-to-Image Diffusion Models", arXiv 2023
Han et al., "SVDiff: Compact Parameter Space for Diffusion Fine-Tuning", arXiv 2023
Xie et al., "DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning", arXiv 2023
Saharia et al., "Palette: Image-to-Image Diffusion Models", SIGGRAPH 2022
Whang et al., "Deblurring via Stochastic Refinement", CVPR 2022
Xu et al., "Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models", arXiv 2023
Saxena et al., "Monocular Depth Estimation using Diffusion Models", arXiv 2023
Li et al., "Your Diffusion Model is Secretly a Zero-Shot Classifier", arXiv 2023
Gowal et al., "Improving Robustness using Generated Data", NeurIPS 2021
Wang et al., "Better Diffusion Models Further Improve Adversarial Training", ICML 2023

P81

Part III

Jalal et al., "Robust Compressed Sensing MRI with Deep Generative Priors", NeurIPS 2021
Song et al., "Solving Inverse Problems in Medical Imaging with Score-Based Generative Models", ICLR 2022
Kawar et al., "Denoising Diffusion Restoration Models", NeurIPS 2022
Chung et al., "Improving Diffusion Models for Inverse Problems using Manifold Constraints", NeurIPS 2022
Ryu and Ye, "Pyramidal Denoising Diffusion Probabilistic Models", arXiv 2022
Chung et al., "Diffusion Posterior Sampling for General Noisy Inverse Problems", arXiv 2022
Feng et al., "Score-Based Diffusion Models as Principled Priors for Inverse Imaging", arXiv 2023
Song et al., "Pseudoinverse-Guided Diffusion Models for Inverse Problems", ICLR 2023
Mardani et al., "A Variational Perspective on Solving Inverse Problems with Diffusion Models", arXiv 2023
Delbracio and Milanfar, "Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration", arxiv 2023
Stevens et al., "Removing Structured Noise with Diffusion Models", arxiv 2023
Wang et al., "Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model", ICLR 2023
Zhou et al., "3D Shape Generation and Completion through Point-Voxel Diffusion", ICCV 2021
Zeng et al., "LION: Latent Point Diffusion Models for 3D Shape Generation", NeurIPS 2022
Nichol et al., "Point-E: A System for Generating 3D Point Clouds from Complex Prompts", arXiv 2022
Chou et al., "DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions", arXiv 2022
Cheng et al., "SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation", arXiv 2022
Hui et al., "Neural Wavelet-domain Diffusion for 3D Shape Generation", arXiv 2022
Shue et al., "3D Neural Field Generation using Triplane Diffusion", arXiv 2022
Yang et al., "Learning a Diffusion Prior for NeRFs", ICLR Workshop 2023
Jun and Nichol, "Shap-E: Generating Conditional 3D Implicit Functions", arXiv 2023
Metzer et al., "Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures", arXiv 2022
Hong et al., "Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation", CVPR Workshop 2023
Watson et al., "Novel View Synthesis with Diffusion Models", arXiv 2022
Chan et al., "Generative Novel View Synthesis with 3D-Aware Diffusion Models", arXiv 2023
Zhou and Tulsiani, "SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction", arXiv 2022

P82

Part III (cont’d)

Seo et al., "DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model", arXiv 2023
Haque et al., "Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions", arXiv 2023
Sella et al., "Vox-E: Text-guided Voxel Editing of 3D Objects", arXiv 2023
Harvey et al., "Flexible Diffusion Modeling of Long Videos", arXiv 2022
Voleti et al., "MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation", NeurIPS 2022
Mei and Patel, "VIDM: Video Implicit Diffusion Models", arXiv 2022
Wang et al., "Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models", arXiv 2023
Jiménez, "Mixture of Diffusers for scene composition and high resolution image generation", arXiv 2023
Bar-Tal et al., "MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation", arXiv 2023
Zhang et al., "DiffCollage: Parallel Generation of Large Content with Diffusion Models", CVPR 2023
Du et al., "Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model", CVPR 2023
Somepalli et al., "Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models", CVPR 2023
Carlini et al., "Extracting Training Data from Diffusion Models", arXiv 2023
Gandikota et al., "Erasing Concepts from Diffusion Models", arXiv 2023
Kumari et al., "Ablating Concepts in Text-to-Image Diffusion Models", arXiv 2023
Somepalli et al., "Understanding and Mitigating Copying in Diffusion Models", arXiv 2023

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

Large Multimodal Models:

Notes on CVPR 2023 Tutorial

Chunyuan Li
Microsoft Research, Redmond
https://chunyuan.li

Abstract

This tutorial note summarizes the presentation on Large Multimodal Models: To-wards Building and Surpassing Multimodal GPT-4, a part of CVPR 2023 tutorial on Recent Advances in Vision Foundation Models. The tutorial consists of three parts. We first introduce the background on recent GPT-like large models for vision-and-language modeling to motivate the research in instruction-tuned large multimodal models (LMMs). As a pre-requisite, we describe the basics of instruction-tuning in large language models, which is further extended to the multimodal space. Lastly, we illustrate how to build the minimum prototype of multimodal GPT-4 like models with the open-source resource, and review the recently emerged topics.

❓ GPT 是语言模型，为什么说它是多模态模型？
❓ 什么是 instruction-tuning？

1 Prologue

In view of the rapid assimilation and widespread adoption of OpenAI ChatGPT [32]/GPT-4 [33] in contemporary society, there has been a growing interest among academics and researchers to develop open-source large language models (LLMs), and simultaneously explore the extensions into large multimodal models (LMMs)$^1$. In order to elucidate this popular topic for a broader audience, in the CVPR 2023 tutorial on Recent Advances in Vision Foundation Models, we have provided a lecture on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4, based on the public materials in the literature. This note summarizes the tutorial presentation and makes it more complete. It gives guided tours through the literature and explain topics to those who seek to learn the areas on LMMs from basics to the advances. It is prepared for audience including graduate students, researchers and professionals that LMMs are outside their specialties, to help them develop perspectives, and identify trends in LMMs in an accessible way.

✅ 在本文中，LMM＝multimodal LLM

In the full tutorial, as shown in Figure 2, we have covered the most recent approaches and principles at the frontier of learning and applying vision foundation models, including Q1: Visual and Vision-Language Pre-training; Q2: Generic Vision Interface; Q3: Alignments in Text-to-image Generation; Q4: Large Multimodal Models; and Q5: Multimodal Agents.
This note focuses on Q4: how to leverage LLM for multimodality, and train LMMs in an end-to-end fashion, so that the models can see and chat. The presentation consists of three parts. To start, we first share background on recent GPT-like large models for vision-and-language modeling in Section 2. In the 2nd part, as a pre-requisite, we will introduce the concept of instruction tuning in language domains in Section 3, which empowered ChatGPT. Finally, Section 4 covers the last part of the presentation, where we focus on how to build a minimum version of multimodal GPT-4, using LLaVA as a running example. Since LMM is a popular research topic, many new papers have appeared in this line of research in the past three months, of which we provide a summary, so that the audience may quickly get a picture on what the LMM community has been working on.
The related links of the tutorial presentation on large multimodal models are available at:

Slides: https://tinyurl.com/5c2c2mtm
YouTube Video: https://youtu.be/mkI7EPD1vp8
Bilibili Video: https://www.bilibili.com/video/BV1Ng4y1T7v3/
For the full information and other parts of the CVPR tutorial, please see the official website at:
https://vlp-tutorial.github.io/

2 Background

2.1 Image-to-Text Generative Models

LMMs in their current form is primarily an image-to-text generative model, which takes images as input, and outputs a text sequence. One example is illustrated in Figure 3 (a) Left. All of the model variants share very similar model architecture and training objective.

Model Architecture. As illustrated in Figure 3 (a) Right, the model typically consists of an image encoder to extract visual features, and a language model to decode the text sequence. The vision and language modalities can be optionally connected by trainable connection module. The image encoder and language model can be either trained from scratch or initialized from pre-trained models.
Training Objective. As illustrated in Figure 3 (b), it typically employs an auto-regressive loss on the output text tokens. For the attention map in the Transformers [46], image tokens can attend to each other, and the text token depends on and all image tokens and the previous text tokens.

✅ 语言通常使用自回归方式，图像通常使用 attenion 方式。

2.2 Case Studies

We use some known LMMs as examples to illustrate how the network architecture framework can be instantiated in different models, while maintaining the same auto-regressive training objective.

Case Study I: LMM trained with image-text pairwise instances. Most LMMs are trained on a large number of image-text pairs, where each training sample is a pair. GIT and BLIP2 are two large models that achieve state-of-the-art (SoTA) performance on many datasets. The comparisons are shown in Figure 4(a). GIT [48] initializes image encoder with constrastive pre-trained Microsoft Florence model, and train a language model from scratch. On the other hand, BLIP2 freezes the weights of pre-trained image and language model, and a train lightweight Q-former. BLIP2 [20] shows higher sample-efficiency with the bootstrapping training method.

✅ GIT 对所有模块进行端到端训练。
✅ BLIP2 fix 已有模块，仅训练新增的 connection 模块。

Case Study II: LMM trained with interleaved image-text sequence instances. We use Flamingo [1] as example, shown in Figure 4(b). It connect the frozen pre-trained image and language models – by adding novel architectural components in between. Specifically, Perceiver Sampler module helps reduce compute complexity, and Gated Transformer module helps stabilize training in the initial stage. Flamingo is trained on a mixture of complementary large-scale multimodal data coming only from the web, without using any data annotated for machine learning purposes. After this training is done, Flamingo can be directly adapted to vision tasks via simple few-shot learning without any additional task-specific tuning.

❓ 这个数据集和 pair data 有什么区别？
✅ Flamingo 的训练方式同 BLIP2．

Multimodal In-Context-Learning. Beside the SoTA performance on dozens of academic bench-marks, proabably the most appealing aspect of Flamingo is that it exhibits an emerged property: Multimodal In-Context-Learning. Specifically, given a couple of image-text pairs as examples, Flamingo can zero-shot task transfer to new unseen problems, such as solving visual math problems. This means Flamingo can tackle a number of difficult problems with just a handful of task-specific examples, without any additional training required. For example in Figure 5, two new tasks are presented to Flamingo. The top row provides two image-text pairs as the context in the prompt, where the text describes the name of the animal in the image, followed by the geographical information of the animal. Flamingo is able to understand the patterns in the task instruction illustrated by the examples, and output the corresponding information for a new image. In the bottom row, the text first shows the optical character recognition (OCR) result of the image, followed by the arithmetic result. Flamingo learns the task instruction illustrated in the multimodal context, outputs the correct answer for a new math problem in the image. Therefore, Flamingo is generally considered as the GPT-3 moment [3] in the multimodal domain.

✅ 对于新任务，不需要训练，只需要给几个例子就能学会。
❓ Flamingo 有交互功能吗？怎样学习例子？
❓ 这个特性与 In-Context-Learning 有什么关系？

2.3 OpenAI Multimulti GPT4 and Research Gaps

In March 2023, OpenAI released GPT-4 [33], with impressive capability in visual understanding and reasoning. Though the model details are unknown, there is no doubt that GPT4 enables many new scenarios, based on the examples highlighted the technique report. For instance, two popular visual examples are illustrated in Figure 6. The first one identifies the uncommon visual region and exhibits strong complex reasoning performance. The second one recognizes text in the image and captures the mere across image-text. For a while, the research community had no clue how this new ability is achieved (probably because they are not tightened to any established academic tasks/datasets), but all are determined that these are exciting results. It naturally raise a question: How can we build Multimodal GPT-4 like models?

To answer it, we start to review the big models from OpenAI, by highlighting the most appealing properties for each model in Figure 7. There are several key observations: (i) GPT-2 [38] is the auto-regressive counterpart in the BERT era [8] for the paradigm of pre-training then fine-tuning. Compared with GPT-2, GPT-3 [3] is a 175B model trained on web-scale text corpus, which exhibits two emerging properties with a frozen model: in-context-learning [3] and chain-of-thoughts (CoT) reasoning [53].. This means, without any additional training required, the model can tackle a wide range of new problems with just a few task-specific examples and by properly prompting it step-by-step, respectively. It further leads to the paradigm from fine-tuning model weights to prompting

frozen models, where the latter shows higher generality and lower adaptation cost in task transfer. (ii) ChatGPT and InstructGPT [34] shows the importance of instruction-following and alignment with human intents for LLMs, by fine-tuning the base language model GPT-3/GPT-3.5 on high quality instruction-following data, and improving them with a reward model via reinforcement learning with human feedback. ($iii$) GPT-4 not only improves the language ability of previous models, but also allows visual signals as additional input for understanding and reasoning. We see that the newer generation model maintains/improves the existing properties of the previous ones, and enable new properties.

✅ In-Context-learning 指通过新任务的例子学习新任务。
✅ Instruction-Following 指通过理解任务描述完成新任务。

In another words, from GPT-3 to GPT-4, we see two new properties: instruction-following and multimodal input. This reveals the gap between existing LMMs such as Flamingo and multimodal GPT-4: how to perform instruction-following and alignment research in the multimodal space. and thus the focus of this tutorial & note.

3 Pre-requisite: Instruction Tuning in Large Language Models

Note that instruction-following is a notion originated in natural language processing (NLP). To study the intuition and gain a full picture of the history, we revisit instruction tuning with LLMs.

3.1 Instruction Tuning

Traditional Language Data. As a typical data instance in NLP, seq2seq representation is quite common for many language tasks: each data instance consists of two parts: sequence as the input and sequence as the output. We provide two examples in Figure 8 (a). Without any task instruction specified, we know they are translation and summarization tasks, respectively.

This seq2seq representation is also how NLP community used to use their data. Task instructions are implicit. Based on each data domain, individual models are trained, or sometimes multi-tasking over multiple data domain without specifying the task instructions. When such models are trained, they are hard to generalize to new tasks in a zero-shot fashion, because the models do not learn the skill to understand the task instruction, and have no ability to distinguish and generalize what task to perform in the testing stage.

Instruct Language Data. Instead, recently researchers start to explicitly add task instructions in the model training, as shown in Figure 8 (b). Interestingly, the task instructions of most NLP tasks can be expressed in natural language as well. It leads a new data format: instruction-input-output triplets. Based on the new format, one single model can be trained, multi-tasking with specified instructions. Since models have observed many task instructions and many instances for each task in training, it is natural and easy for the models to generalize to new tasks by task composition in the inference stage.

P9
For example, in the evaluation stage, a new task that require both summarization and translation is provided in Figure 8 (c). Though the model has never seen this new task in training, it observes individual task basis, and learn to perform on new tasks. Note that we humans are always creating new tasks in our daily life, and presumably these new tasks would never been observed by models. It is thus appealing if a model is able to solve thousands of new tasks in the wild in without training. This is partially why ChatGPT is becoming popular and prevalent quickly.

3.2 Self-Instruct and Open-Source LLMs

How can we collect a diverse set of high-quality instruction-following data? There are two general schemes. One is human-human interaction, where humans (task providers) provide the annotation statement and requirements, based on which another group of humans complete the annotation tasks. such a scheme is typically cost and time consuming. The other scheme is human-machine interaction, where similarly humans provide the annotation statement and requirements, but it is now the machines/models that complete the annotation tasks.

To enable LLMs to follow natural language instructions and complete real-world tasks, researchers have been exploring methods of instruction-tuning of LLMs. This is implemented by either fine-tuning the model on a wide range of tasks using human-annotated prompts and feedback [34], or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions [52]. Among these methods, Self-Instruct tuning [51] is a simple and effective method of aligning LLMs to human intent, by learning from instruction-following data generated by SoTA teacher LLMs. It turns out that the line of instruction-tuning research has produced effective means to improve the zero and few-shot generalization abilities of LLMs. Self-instruct leverages the in-context-learning ability of LLM. The pipeline is illustrated in Figure 9. Humans create a few examples (i.e., seed examples) as the context, and ask LLM such as GPT-3 or GPT-4 to create more instruct and responses that follows the requirements stated in the prompt. The machine-generated instruction-following data can be further selected to construct with the prompt for in-context-learning in the next data generation iteration. The procedure iterates till a given number of samples are collected. Due to the relatively lower cost and higher response speed of API calls (compared with human annotations), self-instruct is becoming more favorable in the research community.

✅ (1) 人工生成一些例子。 (2) LLM 通过例子学习任务。(3) LLM 生成新的问题并回答。（4）人工把生成结果变为数据。

Open-Source LLMs: LLaMA Family. The open-source community has witnessed a surge of open LLM. The success of ChatGPT [32] and GPT-4 [33] offers tremendous opportunities to improve open-source LLMs using instruction-tuning. Figure 10 compares several open-source instruction tuned LLMs. LLaMA [45] is a series of open-sourced LLMs, which match the performance of proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-Instruct tuning has been quickly adopted given its superior performance and low cost. For example, to name a few early attempts in this line of research, Stanford Alpaca [43] uses 52K instruction-following samples generated by GPT-3.5, while Vicuna [47] uses around 500K high-quality instruction-following samples (150K conversions) between user and GPT [39]. To advance the SoTA of instruction-tuning for LLMs, GPT-4 is utilized as the teacher to generate the responses for the Alpaca instructions [36]. Many papers have been proposed to improve the instruction-following data to improve the model alignment quality in chat. For a comprehensive review, we suggest the readers to refer the recent paper [50], where a LLM Tulu is trained on a mix of several high-quality instruct data, and comprehensive comparisons are conducted across multiple benchmarks.

P10

Quick Assessment of LLM Chatbots. To study the quality of LLM Chatbots, We consider Vicuna-Instructions-$80^2$ [47], a dataset with 80 challenging questions that baseline models find challenging. Beside generic instructions, there are 8 categories, including knowledge, math, Fermi, counterfactual, roleplay, generic, coding, writing, common-sense. To quantitatively compare the performance, we ask GPT-4 to rate the response from score 1 to 10 for any two given chatbots, then compute the relative score. The results are shown in Figure 11. Surprisingly, it turns out this evaluation metric is quite consistent across different settings. The open-source LLaMA family seem performing closely to SoTA proprietary Chatbots.

Further Discussions. There are several important topics on LLMs that we have not covered in the tutorial presentation, but are worthwhile future exploring.

Data-centric AI. We emphasize that the developmet of these open-source LLM projects is data-centric [29], rather than model-centric, so that we hope readers could align the perspective when discussing the topic. As the training objective and network architectures are becoming similar and even identical on GPT-like projects, the key differential factor is data. For example, behaviors of the aforementioned LLMs are determined by the instruction tuning data.
False Promise? There is a debate that the open LLMs could catch up with the proprietary LLMs is a false promise [14]. To align the discussions, we argue that there are two distinctive abilities for LLMs: the instruction-following ability to know which task to perform, and massive knowledge storage to complete the task with quality. Imitation models are good at the former, by mimicking ChatGPT’s style but not its factuality. They authors in [14] conclude that there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. They also advocate that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs. However, unfortunately the resources to train such base LMs are only available in a few industry labs, and the formulas to train the base LMs is largely well explored. It seems more promising for most academic research labs to explore the opportunities in alignment research with affordable resources, or explore the techniques to reduce the compute the barriers.

✅ Imitation Modes 从 base model 处得到大量数据，可得到 instruction-following 的能力，但其质量无法达到 base model.

Base LLMs. Developing more capable or commercial usable LLMs is of great value. Besides LLaMA, the open-source community has developed several capable base LLMs such as OpenLLaMA [11], MPT [44] and Falcon [35], or released the training recipe [5].

https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl

P11

4 Instructed Tuned Large Multimodal Models

In this tutorial, we illustrate how to build the minimum prototype of multimodal GPT4 with open-source resources. Specially, we use LLaVA [24] as the running example, a similar idea is also proposed in its co-current work miniGPT-4 [66].

4.1 Open-Source Prototypes: LLaVA / MiniGPT4

The research in the multimodal space has often been inspired by the latest advances in NLP in recent years. One successful recipe is to keep asking what would happen if the most intriguing and successful NLP ideas are borrowed for the vision-and-language community. We are leveraging the self-instruct idea from the language domain. The unique challenge with self-instruct is that there is no strong multimodal teacher available yet. How can we use language model such as language-only GPT-4 to create multimodal instruction following data.

4.1.1 Data Creation

Instead of directly feed images into OpenAI GPT, we use their symbolic sequence representations shown in Figure 12 (a). In LLaVA, the caption and boxes are considered, due to the following

P12

reasons: (1) it is empirically found that GPT-4 can understand them well, in contrast that ChatGPT has a difficult time in understanding the box data. (2) they are important to represent the image as informative as possible.

✅ 图像 → 结构化文本 → 文本输出。
✅ 结构化文本称为 text representation.

As exemplified in Figure 12 (b), three types of instruction-following data are considered: multi-turn conversations so that users can chat with bot, detailed description so that long response can be generated from the bot; Lastly, complex reasoning, this is more about the implication of the image, rather than the image content. For example, “what challenge do these people face” in this image? The image is about a SUV in the parking area, while the challenge is how the luggage can be packed into the SUV due to the tight space in the car. In total, 158K samples are collected.

To summarize, the trick is that whatever tasks one wants to the model to perform in the serving stage, it is important to create the corresponding instruction-following for the training.

❓ 怎样让模型不只识别图片信息，还要根据图片做复杂推断？

4.1.2 Network Architecture and Training

As illustrated in Figure 13, the LLaVA network architecture is an instantiation of the general image-to-text generative model framework introduced in Section 2 and Figure 3. Specifically, LLaVa connects pre-trained CLIP ViT-L/14 visual encoder [37] and large language model Vicuna [47], using a simple projection matrix. A two-stage instruction-tuning procedure is considered:

Stage 1: Pre-training for Feature Alignment. Only the projection matrix is updated, based on a subset of CC3M [40]. The only task is image captioning.
Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated for two different use scenarios.

✅ 即使每个模块分工明确且单独训好，E2E 的 finetune 还是必不可少的。

4.1.3 Performance

Performance on Visual Chat: Towards building multimodal GPT-4 level chatbot. . LLaVA is fine-tuned on the generated multimodal instruction-following data, which contains a diverse set of task instruction and response for daily user-oriented applications. It is empirically found that fine-tuning the linear projection layer only is sufficient for the chat demo/scenarios, though it requires longer training time.

An evaluation dataset with 30 unseen images is constructed: each image is associated with three types of instructions: conversation, detailed description and complex reasoning. This leads to 90 new language-image instructions, on which we test LLaVA and GPT-4, and use GPT-4 to rate their responses from score 1 to 10. The summed score and relative score per type is reported in Figure 14. Overall, LLaVA achieves 85.1% relative score compared with GPT-4, indicating the effectiveness of the proposed self-instruct method in multimodal settings.

P13
Performance on Science QA: New SoTA with the synergy of LLaVA with GPT-4. LLaVA is fine-tuned on a multimodal reasoning dataset in the science domain [26]. In Figure 15, LLaVA alone achieves 90.92%. We use the language-only GPT-4 as the judge, to predict the final answer based on its own previous answers and the LLaVA answers. This “GPT-4 as judge” scheme yields a new SoTA 92.53%.

P14
Performance on OCR in the wild: An emerging property. LLaVA has never been explicitly trained on OCR data, i.e., images that contains text from the corresponding caption. Surprisingly, the model show strong zero-shot OCR task transfer ability in the wild. Some examples are shown in Figure 16.

P16

4.2 Emerging Topics

The history of recent instructed tuned LMM are illustrated in Figure 17 (a). Due to the popularity of ChatGPT and GPT-4, instructed tuned LMM appears as an emerging line of research in the past three months after GPT-4 was proposed. Alpaca and Vicuna were proposed to make LLaMA more instruction-following in the language domain in March. In two weeks, MiniGPT-4 and LLaVA were proposed to make Vicuna to see and chat about the visual world. In ten days, Llama-Adpter v2 and mPlug-OWL started to compare performance with MiniGPT-4/LLaVA, indicating the beginning of model evolution. The data points in April are relatively sparse. In May, a large number of LMM papers appeared on arXiv, which improve this line of research from many different aspects. The momentum is till going in June.

P17
It is easy to lose track of all the recent papers for the readers, so as well in our literature review. To better organize the literature, we group them based on specific research topics in this tutorial, shown in Figure 17 (b). The early LMMs with billions of parameters include GPT-4 [33], Flamingo [1], PaLM-E [9] and KOSMOS-1 [15]. In constrast to these proprietary LMMs, LLaVA/MiniGPT-4 open the opportunities to build LMMs with open-source resource. We will discuss the several topics as below, in addition to dense prediction [49, 60], video [62, 28, 21], image generation [16] and embodied agent [31].

4.2.1 More Modalities (Beyond VL)

🔎 ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst [65]
🔎 PandaGPT: One Model To Instruction-Follow Them All [41]
🔎 SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities [61]
🔎 X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages [4]

While LMM extends LLM by adding the vision modality into language, it is natural to further extend the framework to include more modalities beyond vision and language. Following this spirit, several attempts have been made. In Figure 18, PandaGPT leverages ImageBind to add more modalities into LMMs. The ImageBind model [12] learns a single, shared representation space for text, image/video, audio, sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position. ImageBind provides a holistic understanding of the visual world that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move. By training a projection layer for one modality in LMM, the model can zero-shot transfer to infer over other modalities due to the shared multimodal embedding space. Another representative model is SpeechGPT, where language and speech modalities are enabled for both input and output ends. Despite of rich model variations, the idea to connect diverse modalities is similar to LMM that adds images into LLMs.

❓ 把多种模态信息融合到同一空间，那多种骨骼动作也可以，哪来的 pairdata呢？
❓ 只训一个模态，其它模态能自动迁移，这些模态是怎么对齐的？
❓ 不同骨骨动作的迁移，BVH 能否作为中间的结构化文本？

P18

4.2.2 Multitask Instruct with Established Academic Datasets/Tasks

🔎 MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning [57]
🔎 mPlug-OWL: Modularization empowers large language models with multimodality [58]
🔎 InstructBLIP: Towards general-purpose vision-language models with instruction tuning [6]
🔎 Multimodal-GPT: A vision and language model for dialogue with humans [13]
🔎 Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [54]

As discussed earlier in Section 3, instruction tuning in the language domains is implemented in two different ways: fine-tuning the model on a wide range of tasks using human-annotated prompts and feedback[34], or supervised fine-tuning using public benchmarks and datasets augmented with manually or automatically generated instructions [52]. The former is good at user-oriented daily life tasks, and the latter is good at achieving good numbers on established benchmarks. LLaVA/MiniGPT-4 can be categorized as the former class. Several other works either target for the latter class or combine both classes.

✅ 用 prompt 使用更友好，但用数据 finetue 能得到更好的效果。
✅ 前者数据来自 daily conversation，因此没有明确的任务类型，属于通才。
✅ 后者数据来专用数据集，有明确的任务类型，属于专才。

4.2.3 Multimodal In-Context-Learning

🔎 OpenFlamingo [2]
🔎 Otter: A Multi-Modal Model with In-Context Instruction Tuning [18]
🔎 $M^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning [22]
🔎 MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models [30]

Similar to the behaviour of LLMs, which can address a language task by processing examples of the task in their text prompt, multimodal in-context-learning refers to an visual and text interface can steer the model towards solving a multimodal task. Given a few example pairs of visual inputs and expected text responses composed in the multimodal prompt, the model can be asked a question with a new image or video, and then generate an answer.

P19
OpenFlamingo [2] is an open source version of DeepMind’s Flamingo model, trained on Multimodal C4 dataset [67], which is a billions-scale corpus of image interleaved with text. To explicit enhance the multimodal in-context-learning ability of LMMs, MIMIC-IT [17] dataset is constructed, which is 2.4M multimodal instruction instances with in-context examples. By tuning OpenFlamingo on MIMIC-IT, a new model Otter is obtained with a stronger instruction-following ability. The model life cycle is summarized in Figure 20. Using two image-text pairs as the context, Otter learns the concise answering style demonstrated by the examples, otherwise a tedious response is generated.

✅ 提升 in-context-learning 主要靠增加数据集。

4.2.4 Parameter-Efficient Training

🔎 LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model [10]
🔎 Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models [27]

🔎 QLoRA: Efficient Finetuning of Quantized LLMs [7]

While fine-tuning very large models often leads to high performance, it is prohibitively expensive; For example, regular 16-bit fine-tuning of a LLaMA 65B parameter model [45] requires more than 780 GB of GPU memory [7]. Therefore, it is critical to reduce the memory footprint of LLMs/LMMs, especially when it comes to improve the accessibility of large models to a wider community. Parameter-efficient training is an effective approach for LMM adaptation. Two representative methods are illustrated in Figure 21. They freeze most of the model parameters, and only allow a small of trainable parameter to update with domain specific data. For example, LLaMA Adapter v2 and LAVIN only has 14M and 3.8M trainable parameters, compared with 7B/13B LLM parameters. Another efficient training method is quantization. The recent QLoRA finetunes 65B LLaMA for 24 hours on a single GPU, reaching 99.3% of the performance level of ChatGPT. Since instruction tuning typically involves a small amount of data, it makes parameter-efficient training or model quantization feasible with limited GPU resources.

✅ quantization 是什么技术？

✅ 可以在两个模态的中间加 adapter，学习模态间的 alignment.
✅ 可以在两个模态上增加 adapter，增加模态的泛化性。

P20

4.2.5 Benchmarks

🔎 On the Hidden Mystery of OCR in Large Multimodal Models [25]
🔎 Evaluating Object Hallucination in Large Vision-Language Models [23]
🔎 On Evaluating Adversarial Robustness of Large Vision-Language Models [64]
🔎 LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [59]
🔎 LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models [56]

While LMMs have shown excellent visual recognition and reasoning in an open-set manner with free-form text in many scenarios, the evaluation of LMMs is becoming an urgent and challenging problem. Several related benchmarks have been developed to evaluate various aspects of LMMs, ranging from their specific abilities including OCR[25], object hallucination [23] and adversarial robustness [64], to comprehensive evaluation [59, 56].

❓ 这四个能力是怎么评价的？
✅ OCR：从图片中识别文本。LMM 不需要学习就具有的能力，其中 BLIP2 甚至优于专门训练的 OCR 任务 SOTA．

It is surprising that LMMs shows strong zero-shot OCR performance in the wild, without explicitly training on text recognition data. To shed light on the hidden mystery of OCR in LMMs, a compre-hensive empirical study is conducted in [25] to compare open-source LMMs on 24 academic text recognition datasets, shown in Figure 22. Three observations are highlighted: (1) LLaVA consistently outperforms miniGPT-4 on 21 out of 24 datasets, despite LLaVA being trained with an order of magnitude smaller training data. (2) Training with significantly larger training data leads to higher OCR performance, as demonstrated by BLIP2 [20] and mPLUG-Owl. (3) In most cases, supervised SoTA results significantly outperform zero-shot LMM. However, it is worth noting that in the WordArt dataset [55], which primarily features challenging artistic text, BLIP2 surpasses supervised SoTA. This reveals the potential of LMM in recognizing more complex text types.

P21

4.2.6 Applications

🔎 PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology [42]
🔎 PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [63]
🔎 LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day [19]

The success of ChatGPT/GPT-4 in the general domain has inspired the interests in building assistants in the vertical domains such as medicine, gaming and education. Such domain-specific assistants can have the several advantages over the general domain counterpart: (1) training high-quality domain knowledge makes the assistants more helpful, (2) the model size can be smaller, and thus severing cost is low, (3) the sensitive user prompt data can be maintained internally by serving the model at local, and the privacy issue can be avoided.

❓ 为什么 domain-specific assistants 会更小？

LMMs have been recently explored in the biomedical domain [42, 63, 19], where conversational gener-ative AI has demonstrated remarkable promise for empowering biomedical practitioners. LLaVA-Med is a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model LLaVA using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. In Figure 23, we provide examples on the biomed visual conversations of different chatbots. LLaVA-Med precisely answers the questions with biomedical knowledge, while LLaVA behaves like a layperson, who hallucinate based on commonsense.

P22

5 How Close We Are with OpenAI Multimodal GPT-4?

With all these new works, are we close or even surpassing OpenAI Multimodal GPT-4? It is encouraging to see that the open-source community has quickly developed a variety of models and prototypes for various new capabilities. For example, LLaVA/Mini-GPT4 paves the way towards building multimodal chatbots, with some examples that reproduce the results in OpenAI GPT-4 technique report; GILL [16] extends LMMs for end-to-end image generation, to our best knowledge, this is a capability that the current GPT-4 does not exhibit. From the perspective of enabling new multimodal capabilities with the minimum prototypes, the open-source community seems close to OpenAI Multimodal GPT-4, by exploring the baby steps towards building the general-purpose multimodal assistant.

However, there is a large gap in terms of scaling a given capability, for example, even the for visual reasoning capability that we have observed in LLaVA. Figure 24 shows two more visual examples from OpenAI technique report. To correctly answer the questions, it requires models to understand multiple high-resolution images and long sequence, as well we responding with domain knowledge. It requires much larger compute and more powerful language models, which are not available for most people.

In summary, we have presented the background and strong capabilities of large multimodal models, reviewed instruction tuning in LLMs, and showed how we can build a prototype such as LLaVA and minigpt4 using open-sourced resources. We also summarize and cateorized the most recent papers merged on this line of research to help thoese who are interested to gain the momentum to start the journey of LMM research.

To discuss the next steps to work on as a community, one sustainable suggestion can be that those with resource can continue focusing on the scaling success and study new emerging properties, while others focus on prototypes for new functionalities and evaluation, as well as developing techniques to reduce the compute barriers and thus allow more accessibility for larger model compute.

P23
Acknowledgments

We thank all authors who have contributed to the related papers in LLM/LMM, which makes the tutorial possible. We have tried to track related papers for the CVPR tutorial before June 19, 2023, but may not cover all the papers on the topic, due to the fast research pace in LMMs. Apologies in advance.

References

[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 5, 6, 17

[2] Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023. 13, 18, 19

[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. 5,6

[4] Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023. 17

[5] Together Computer. Redpajama-data: An open source recipe to reproduce llama training dataset, 2023. 10

[6] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 18

[7] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023. 19

[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 6

[9] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 17

[10] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023. 19

[11] Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023. 10

[12] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023. 17

[13] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023. 18

[14] Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023. 10

P23
[15] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023. 17

[16] Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023. 17, 22

[17] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023. 19

[18] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023. 18

[19] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023. 20, 21

[20] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 4, 5, 13, 20

[21] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 17

[22] Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023. 18

[23] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 20

[24] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 11, 12, 13, 14

[25] Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023. 20

[26] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022. 13

[27] Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, and Rongrong Ji. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023, 2023. 19

[28] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023. 17

[29] Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Douwe Kiela, David Jurado, et al. Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062, 2022. 10

[30] Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F Yang, and Kai-Wei Chang. Metavl: Transferring in-context learning ability from language models to vision-language models. arXiv preprint arXiv:2306.01311, 2023. 18

P25
[31] Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023. 17

[32] OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022. 3, 9

[33] OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023. 3, 6, 9, 13, 14, 17, 22

[34] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. 7, 9, 18

[35] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. 10

[36] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 9

Instruction tuning: Finetuned Language Models Are Zero-Shot Learners

[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 12

[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. 6

[39] ShareGPT. yhttps://sharegpt.com/, 2023. 9

[40] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. 12

[41] Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023. 17

[42] Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Zhongyi Shui, Xiaoxuan Yu, Yizhi Zhao, Honglin Li, Yunlong Zhang, Ruojia Zhao, et al. Pathasst: Redefining pathology through generative foundation ai assistant for pathology. arXiv preprint arXiv:2305.15072, 2023. 21

[43] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 9

[44] MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023. Accessed: 2023-03-28. 10

[45] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 9, 19

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 4

[47] Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023. 9, 10, 12

[48] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 4, 5

[49] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. VisionLLM: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023. 17

P26
[50] Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023. 9

[51] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc-tions. arXiv preprint arXiv:2212.10560, 2022. 9

[52] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022. 9, 18

[53] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 6

[54] Zhenxiang Xiao, Yuzhong Chen, Lu Zhang, Junjie Yao, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, et al. Instruction-vit: Multi-modal prompts for instruction learning in vit. arXiv preprint arXiv:2305.00201, 2023. 18

[55] Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, and Xiang Bai. Toward understanding wordart: Corner-guided transformer for scene text recognition, 2022. 20

[56] Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023. 20

[57] Zhiyang Xu, Ying Shen, and Lifu Huang. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022. 18

[58] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 18

[59] Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023. 20

[60] Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023. 17

[61] Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023. 17

[62] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 17

[63] Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023. 21

[64] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023. 20

[65] Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, and Jing Liu. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023. 17

[66] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023. 11

P27
[67] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023. 19

本文出自CaterpillarStudyGroup，转载请注明出处。

https://caterpillarstudygroup.github.io/ImportantArticles/

The Generative Modeling Problem

正方形代表所有可能的状态所构成的空间，即图像空间。正方形中的每个点代表一个sample，即一张图像。
$P$ 是源分布，$Q$ 是目标分布。
$X_0$ 和 $X_1$分别是 $P$ 分布和 $Q$ 分布中的 sample．
生成模型的目标是，找到一个可以从 $P$ 中 sample 到 $Q$ 中 sample 的映射。

生成模型的范式

生成模型有两大类范式：直接生成和增量生成。

直接生成

GAN、VAE 属于第一大类生成模型，优点是快，因为它的生成过程只需要一个forward。

GAN的缺点是：
（1）没有一个精确的可以用于 sample 的概率模型
（2）难以训练。

自回归 VS 非自回归

要生成的内容是一个整体，可以一次性生成整个内容，也可以把要生成的内容分解成多个小块，分别生成这些小块，再合成整体。

例如图像生成，PixelCNN、ViT把图像分成多个patch，并分别生成这些patch。为了让这些patch之间有协调性，后生成的patch要以已生成的patch为依据。
再例如动作生成，要生成一个动作序列，可以把每一帧作为一个patch，也可以把连续的几帧作为一个patch。
自回归生成的特点是，生成内容的依赖关系是固定的。先生成的patch会对后生成的patch产生影响，反之则不行。

增量生成

增量生成是另一种生成范式，不是直接生成最终结果，而是逐步生成。每一次生成比上一次要好。

生成模型	特点	链接
Flow Matching	转移过程是平滑的。	link
Diffusion	转移过程是连续但不平滑的	link
Jump	转移过程是不连续的
Score Matching		link
DSDFM	std normal --(flow matching/score matching)--> VQ-VAE latent --(VQ-VAE)--> pixel	link

共同点：都是基于连续时间马尔可夫过程的随机过程Continuous-time Markov process。

$\Phi$ 是从一次生成到另一次生成的转移函数。
增量生成模型的目标是学习转移函数。

ID	Year	Name	Note	Tags	Link
	2024	PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling	一个以人为中心的多功能数据集，用于从密集的多视图视频中高保真重建和渲染动态人类场景。超过 56 个同步摄像机， 45 个不同场景， 32 不同的人，820万帧。每帧都有高度详细的外观和逼真的人体动作
	2023	BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion
	2023	CIRCLE: Capture In Rich Contextual Environments	具有目标导向运动的数据集
	2022	Artemis: Articulated Neural Pets with Appearance and Motion Synthesis	动态毛茸茸动物（DFA）数据集： - 来自艺术家的建模。 - 含九种高质量的 CGI 动物，包括熊猫、狮子、猫等。 - 它们具有基于纤维/线的毛皮和骨骼 - 使用商业渲染引擎（例如 MAYA）将所有这些 CGI 动物角色渲染成各种代表性骨骼运动下的高质量多视图 1080 × 1080 RGBA 视频。具体来说，我们采用了 36 个摄像机视图，这些摄像机视图均匀地围绕捕获的动物排列成一个圆圈，每个动物的代表性姿势数量从 700 到 1000 个不等。	四足动物	论文，数据集
	2019	AMASS: Archive of Motion Capture as Surface Shapes	AMASS数据集构成了一个全面且多样化的人体运动数据集，包含来自300名受试者的11,000多个动作，总计超过40个小时。运动数据以及用于骨架和网格表示的 SMPL 参数源自利用 15 个光学标记的基于标记的 MoCap 系统。
	2019	iMapper	i3DB [69] contains RGB videos of person-scene interactions involving medium to heavy occlusions. It provides annotated 3D joint positions and a primitive 3D scene reconstruction.
	2019	Resolving 3D Human Pose Ambiguities With 3D Scene Constraints	PROX [34] contains RGB-D videos of people interacting with indoor environments.
	2018	Recovering Accurate 3D Human Pose in the Wild Using IMUs and a Moving Camera	3DPW 数据集捕获 51,000 个单视图int the wild视频序列，并由 IMU 数据补充。这些视频是使用手持式摄像机录制的，IMU 数据有助于将 2D 姿势与其 3D 对应姿势关联起来。 3DPW 是最强大的数据集之一，将自身确立为近期多人野外场景中 3D 姿态估计的基准。
	2014	Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments	使用 RGB 和 ToF 相机从现实世界环境中的不同视角捕获的 360 万个姿势的大量集合。身体网格的高分辨率 3D 扫描仪数据。
225		MPI-INF-3DPH	超过 2K 的视频，具有户外场景中 13 个关键点的联合注释，适用于 2D 和 3D 人体姿势估计。 GT是通过多摄像头布置和无标记动捕系统获得的，这代表了与涉及真实个体的传统基于标记的动捕系统的转变。
226		HumanEva dataset	多视图 3D 人体姿态估计数据集。包括两个版本：HumanEva-I 和 HumanEva-II。在 HumanEva-I 中，数据集包括从位于前、左、右 (RGB) 和四个角 (Mono) 的七个摄像头捕获的约 40,000 个多视图视频帧。 HumanEva-II 具有大约 2,460 帧，由每个角落的四个摄像机记录。
227,248		CMU-Panoptic dataset	65 个帧序列，大约 5.5 小时的镜头，并具有 150 万个 3D 带注释的姿势。该数据集通过配备 511 个校准相机和 10 个具有基于硬件同步功能的 RGB-D 传感器的大型多视图系统记录，对于通过多视图几何开发弱监督方法至关重要。这些方法解决了传统计算机视觉技术中常见的遮挡问题。
115		Multiperson Composited 3D Human Pose (MuCo-3DHP) dataset	用作 3D 人体姿态估计的大规模多人遮挡训练集。 MuCo-3DHP 中的帧是通过合成和增强方案从 MPI-INF-3DPH 数据集生成的。
SURREAL dataset [228] is a large synthetic human body dataset containing 6 million RGB video frames. It provides a range of accurate annotations, including depth, body parts, optical flow, 2D/3D poses, and surfaces. In the SURREAL dataset, images exhibit variations in texture, view, and pose, and the body models are based on the SMPL parameters, a widely-recognized mesh representation standard.
3DOH50K dataset [150] offers a collection of 51,600 images obtained from six distinct viewpoints in real-world settings, predominantly featuring object oc- clusions. Each image is annotated with ground truth 2D and 3D poses, SMPL parameters, and a segmentation mask. Utilized for training human estimation and reconstruction models, the 3DOH50K dataset facilitates exceptional per- formance in occlusion scenarios.
3DCP dataset [229] represents a 3D human mesh dataset, derived from AMASS [230]. It includes 190 self-contact meshes spanning six human subjects (three males and three females), each modeled with an SMPL-X parameterized template.
DensePose dataset [231] features 50,000 manually annotated real images, comprising 5 million image-to-surface correspondence pairs extracted from the COCO [249] dataset. This dataset proves instrumental for training in dense human pose estimation, as well as in detection and segmentation tasks.
UP-3D dataset [232] is a dedicated 3D human pose and shape estima- tion dataset featuring extensive annotations in sports scenarios. The UP-3D comprises approximately 8,000 images from the LSP and MPII datasets. Addi- tionally, each image in UP-3D is accompanied by a metadata file indicating the quality (medium or high) of the 3D fit.
THuman dataset [233] constitutes a 3D real-world human mesh dataset. It includes 7,000 RGBD images, each featuring a textured surface mesh obtained using a Kinect camera. Including surface mesh with detailed texture and the aligned SMPL model is anticipated to significantly enhance and stimulate future research in human mesh reconstruction.

未归档论文

ID	Year	Name	Note	Tags	Link
30	2024	CAT3D: Create Anything in 3D with Multi-View Diffusion Models	基于Diffusion的3D重建		link

[2025] Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

翻译：

基于预训练文生视频模型的先进先出（FIFO）视频扩散方法，近期已成为无需微调的长视频生成有效方案。该技术通过维护一个噪声逐步递增的视频帧队列，在队列头部持续输出干净帧，同时在尾部入队高斯噪声。然而，由于缺乏跨帧的对应关系建模，FIFO-Diffusion往往难以维持生成视频的长程时间一致性。本文提出衔尾蛇扩散（Ouroboros-Diffusion）——一种新型视频去噪框架，旨在增强结构与内容（主体）一致性，实现任意长度视频的连贯生成。具体而言：

队列尾部潜在空间采样技术：通过改进队列尾部的潜在空间采样策略，增强结构一致性，确保帧间感知平滑过渡；
主体感知跨帧注意力机制（SACFA）：在短片段内对齐跨帧主体，提升视觉连贯性；
自循环引导技术：利用队列前端所有历史干净帧的信息，指导尾部含噪帧的去噪过程，促进丰富且有上下文关联的全局信息交互。
在VBench基准测试上的长视频生成实验表明，Ouroboros-Diffusion在主体一致性、运动平滑性、时间一致性等关键指标上显著优于现有方法，展现出全面优越性。

关键术语对照：

FIFO (First-In-First-Out) → 先进先出
Tuning-free long video generation → 无需微调的长视频生成
Long-range temporal consistency → 长程时间一致性
Ouroboros-Diffusion → 衔尾蛇扩散（保留英文术语，体现自循环特性）
Subject-Aware Cross-Frame Attention (SACFA) → 主体感知跨帧注意力机制（SACFA）
Self-recurrent guidance → 自循环引导
VBench benchmark → VBench基准测试

[2025] RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

翻译：

虚拟试穿技术作为计算机视觉与时尚领域的交叉核心任务，旨在通过数字手段模拟服饰在人体上的穿着效果。尽管单图像虚拟试穿（VTO）已取得显著进展，但现有方法往往难以在长视频序列中保持服饰外观的一致性与真实性，其根源在于动态人体姿态捕捉与目标服饰特征维持的复杂性。我们基于现有视频基础模型提出RealVVT——一种逼真视频虚拟试穿框架，专门针对动态视频场景下的稳定性与真实感进行强化。该方法包含三部分核心技术：

服饰与时序一致性策略：确保跨帧服饰纹理、褶皱等细节的连续性；
无关性引导的注意力聚焦损失机制：通过约束无关区域特征，强化空间一致性；
姿态引导的长视频VTO技术：适配长视频序列处理，优化动态试穿的流畅性。
通过在多数据集上的广泛实验验证，RealVVT在单图像与视频VTO任务中均超越现有最优模型，为时尚电商与虚拟试衣场景提供了实用化解决方案。

关键术语对照：

Virtual try-on (VTO) → 虚拟试穿（VTO）
PhotoRealistic Video Virtual Try-on → 逼真视频虚拟试穿
Clothing & Temporal Consistency → 服饰与时序一致性
Agnostic-guided Attention Focus Loss → 无关性引导的注意力聚焦损失
Pose-guided Long Video VTO → 姿态引导的长视频虚拟试穿
Fashion e-commerce → 时尚电商

[2025] FlexiClip: Locality-Preserving Free-Form Character Animation

为剪贴画图像赋予流畅运动的同时保持视觉保真度与时间连贯性是一项重大挑战。现有方法（如AniClipart）虽能有效建模空间形变，却常难以确保平滑的时序过渡，导致动作突变和几何失真等伪影。类似地，文本到视频（T2V）和图像到视频（I2V）模型因自然视频与剪贴画风格的统计特性差异而难以处理此类内容。本文提出FlexiClip——一种新方法，通过协同解决时序一致性与几何完整性的交织难题来突破这些限制。FlexiClip在传统贝塞尔曲线轨迹建模的基础上引入三项关键创新：

时序雅可比矩阵：通过增量式修正运动动力学，确保动作连贯性；
基于概率流常微分方程（pfODEs）的连续时间建模：降低时序噪声对生成质量的影响；
受GFlowNet启发的流匹配损失：优化运动过渡的平滑性。
这些改进使得FlexiClip能在快速运动和非刚性形变等复杂场景下生成连贯动画。大量实验验证了FlexiClip在生成流畅自然且结构一致的动画效果上的有效性（涵盖人类、动物等多样剪贴画类型）。通过将时空建模与预训练视频扩散模型结合，FlexiClip为高质量剪贴画动画树立了新标杆，并在广泛视觉内容上展现出鲁棒性能。
项目主页：https://creative-gen.github.io/flexiclip.github.io/

关键术语对照：

Temporal coherence → 时间连贯性
Bézier curve-based trajectory modeling → 基于贝塞尔曲线的轨迹建模
Temporal Jacobians → 时序雅可比矩阵
Probability flow ODEs (pfODEs) → 概率流常微分方程（pfODEs）
Flow matching loss → 流匹配损失
Non-rigid deformations → 非刚性形变

20250903角色骨骼动作生成

核心问题定义

用一句话说清楚：这个技术主要想解决动画/仿真领域的什么经典痛点或瓶颈？

角色骨骼动作生成是为了解决动画领域制作动画数据时间长门槛高的痛点。

技术解析

它是什么

用直观的语言描述这项技术的核心思想

根据用户意图，自动地为某个角色骨骼生成动画数据。通过动画数据可以让这个角色真实地动起来。

关键论文

关键论文/算法：找到1-2篇最具代表性的开创性论文或关键改进论文。不必深究数学细节，但要看懂其核心架构图和主要贡献。

深开创了Deep Learning Based运动生成的先河，可完成动作生成、轨迹控制动作生、带约束动作生成、动作风格迁移等任务。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
131	2016	A deep learning framework for character motion synthesis and editing	自动生成角色动作数据	深开创了Deep Learning Based运动生成的先河	轨迹条件，AE，风格迁移	link

首个基于Diffusion的文生动作工作，提升了动作生成的多样性和动作质量。但diffusion的架构生成速度较慢。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
132	2022.8.31	MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model	根据多样化文本输入实现细腻且精细的运动生成	首个基于扩散模型的文本驱动运动生成框架，通过文本特征与noise的self attention，实现文本-动作的跨模态生成在噪声空间对不同文本提示的融合，实现不同部分的细粒度控制在噪声空间对不同片断的融合，实现长序列的生成	CLIP, DDPM, Transformer，开源	link

对动作进行离散表示，结合VQVAE和GPT，使用动作生成质量有极大的提升。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
88	2023.9.24	T2m-gpt: Generating human motion from textual descriptions with discrete representations	基于VQ-VAE与GPT的文生人体运动框架	1. 基于VQ-VAE的离散运动表示 2. VQ-VAE + Transformer（GPT）的文生动作框架** 3. 生成质量(FID)有明显提升	VQ-VAE + Transformer, CLIP, 开源，自回归	link

所需数据

所需数据：它需要什么样的数据进行训练？（图像、3D模型、运动捕捉数据、仿真数据？）是监督学习、无监督还是自监督？

通常是监督学习，需要『条件-动作数据』的数据对。

应用场景与案例

学术界

在Siggraph等顶会上，这项技术最常被用在哪些方面？找1-2个论文中的例子。

除了动作生成任务本身，非生成类任务（例如动作迁移、动作编辑等）也会使用生成类方法来解决。
生成后的动作可用于Mesh的驱动。

工业界

是否有公司已经将其产品化？

无产品化。

目前的动作生成任务通常针对特定骨骼角色实现，需要该骨骼的大量数据，因此实际上使用成本较高。
动作生成算法生成的动作质量不稳定，往往不能直接使用，需要动作优化或cherry-pick。

电影/VFX：迪士尼、Weta等工作室如何用它？

游戏：哪些游戏引擎或大厂在探索它？

创业公司：是否有基于该技术的明星创业公司？

价值主张分析（战略家的核心思考）

效率提升：它能将某个环节的速度提升多少倍？能节省多少艺术家的人力成本？

动画师对手K动作通过是先制作关键帧，再进行关键帧之间的插值。
算生成骨骼动作数据，一段196帧（6秒）的动作数据的生成时间为1min以内。
动画数据的生产效率有很大的提升。

质量突破：它是否能实现传统方法无法达到的质量或逼真度？

质量上比动画师制作有较大的差距。

创新可能性：它是否开启了全新的创作范式或产品类型？（例如，实时虚拟制作、个性化内容生成）

目前动画师不倾向于使用这种技术。因其生成质量与动画师有较大的差距，而在一个质量较差的动画数据上修改，不如直接重新制作方便。

现状与挑战

当前局限性：这项技术目前最大的问题是什么？（计算成本高、训练慢、控制力不足、艺术导向性差？）

生成质量不可控，动作不自然
需要生成特定角色有大量的数据
一个算法只能用于特定角色的生成
需要对生成动作有更精确的可控性
生成速度较慢，不能实时控制
依赖于蒙皮绑定的质量，受限于LBS的驱动效果。

未来趋势：它的下一个突破点可能在哪里？

降低使用成本

通过其它方向引入先验信息，减少对特定数据的依赖
算法具有通用性，基于一个角色的生成模型，经过少量的调整即可适用于另一角色
更多控制方式，适配多种场景

电影等高质量场景

动作可以更精确地控制
提升动作的pick率
提升动作质量的自然性、合理性

游戏等实时场景

保证生成质量的下限，避免出现不可接受的生成结果
提升生成速度，实现可实时交互

可进行长序列生成
直接的Mesh驱动

20250914骨骼动作离散表示

核心问题定义

用一句话说清楚：这个技术主要想解决动画/仿真领域的什么经典痛点或瓶颈？

角色骨骼动作生成是为了解决连续动作表示解码出的动作质量不同的痛点。

技术解析

它是什么

用直观的语言描述这项技术的核心思想

用离散编码来描述动作序列。

[TODO] 把下面表格中的图下载下来，换成本地链接

生成模型	特点	结构	链接
AE	降维、聚类，但latent仍是复杂分布，不能直接sample
VAE	降维、聚类，latent为std normal，可以直接sample
VQ-VAE	离散AE（用于降维、聚类）。其分布为整个码本但码本的使用率不可能到100%（Perplexity不会打满），因此不能直接采样。还需要结合其它生成模型。例如图像生成中使用PixelCNN（用于sample）
GAN

关键论文

关键论文/算法：找到1-2篇最具代表性的开创性论文或关键改进论文。不必深究数学细节，但要看懂其核心架构图和主要贡献。

基于离散表示的文生动作

虽然离散表示擅长精确存储训练数据，但最早使用离散动作表示，是为了像处理语言一样地处理动作。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2022.8.4	TM2T: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts.	文生3D全身动作，实现同文本生成多个差异化动作，并避免产生无意义的静止姿态序列。	首次提出离散量化的运动表示互惠生成方法通过同时训练文本→运动和运动→文本任务，显著提升了语义对齐能力。	控制条件：文本（NMT Encoder）生成方式：自回归表示方式：离散表示（同VQ-VAE，但没有使用这个词）生成模型：离散分布采样（NMT Decoder）

T2m-gpt首次证明了『离散表示+自回归生成框架』能够实现文生动作任务，且生成动作的质量非常高。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
88	2023.9.24	T2m-gpt: Generating human motion from textual descriptions with discrete representations	基于VQ-VAE与GPT的文生人体运动框架	1. 基于VQ-VAE的离散运动表示 2. VQ-VAE + Transformer（GPT）的文生动作框架** 3. 生成质量(FID)有明显提升	控制条件：文本（CLIP）生成方式：自回归表示方式：离散表示（VQ-VAE）生成模型：离散分布采样（GPT）其它：Transformer，开源	link

MoMask则首次提出了『离散表示 + 掩码语言模型生成框架』的文生动作模型。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2023	MoMask: Generative Masked Modeling of 3D Human Motions	VQ-VAE + Bert Style的文生动作新框架	VQ-VAE + 分层码本结构；掩码预测生成粗糙运动，残差层逐步细化首个离散运动表示+掩码语言模型的文生动作框架	控制条件：文本（CLIP）生成方式：Bert Style 表示方式：离散表示（VQ-VAE + 残差细化）生成模型：掩码语言模型

也有VQ-VAE结合其它生成框架的尝试，例如结合离散扩散模型、score matching等。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
	2023	Text-to-Motion Synthesis using Discrete Diffusion Model	扩散模型计算成本较高，且生成的运动可能与输入文本对齐度不足。	结合离散潜在空间与扩散模型，学习表达性条件概率映射以实现运动合成。 1. 学习离散运动表达 2. 应用离散去噪扩散概率模型（D3PM）学习运动标记的条件概率分布。 3. 训练过程中进一步采用离散无分类器引导技术，通过合适的引导尺度实现运动与对应文本描述的对齐。	控制条件：文本生成方式：非自回归表示方式：离散表示（VQ-VAE）生成模型：离散去噪扩散概率模型（D3PM）其它：MoDDM

此后的基于离散表示的动作生成研究主要有这些方向：

进一步提升动作质量
提升多样性/随机性
控制能力，例如更好地理解文本、支持长文本、其它控制方式等

进一步提升动作质量

在基础码本之上，增加残差码本，提升码本可表示的细节。代表论文为上面提到的MoMask。
HGM3则是在MoMask基础上的发展。

ID	Year	Name	解决了什么痛点	主要贡献是什么	Tags	Link
102	2025.5.16	HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining	由于文本固有的歧义性以及人体运动动态的复杂性	1. 类似MoMask的残差VQ-VAE，但专门训练了一个网络来决定给哪些token掩码 2. 把文本编码成不同粒度的embedding，提升文本的整体把控与细节控制	控制条件：文本（Graph Reasoning）生成方式：Bert Style 表示方式：离散表示（分层文本编码，每一层是残差VQ-VAE）生成模型：残差VQ-VAE(类似于Diffusion的逐渐细化的生成模式)	link

所需数据

它需要什么样的数据进行训练？（图像、3D模型、运动捕捉数据、仿真数据？）是监督学习、无监督还是自监督？

真实的3D骨骼动作数据

应用场景与案例

学术界

在Siggraph等顶会上，这项技术最常被用在哪些方面？找1-2个论文中的例子。

VQ-VAE是一种动作表示方式，可以用于任何与骨骼动作有关的场景。但目前调研中都是用于动作生成任务。但应该也可以用于其它任务。

工业界

是否有公司已经将其产品化？

无产品化。

电影/VFX：迪士尼、Weta等工作室如何用它？

游戏：哪些游戏引擎或大厂在探索它？

创业公司：是否有基于该技术的明星创业公司？

价值主张分析（战略家的核心思考）

效率提升：它能将某个环节的速度提升多少倍？能节省多少艺术家的人力成本？

相比于连续表示方式，无效率提升。

质量突破：它是否能实现传统方法无法达到的质量或逼真度？

相比于连续表示方式，质量有很大提升。

创新可能性：它是否开启了全新的创作范式或产品类型？（例如，实时虚拟制作、个性化内容生成）

它使得可以像处理语言一样地处理动作。

现状与挑战

当前局限性：这项技术目前最大的问题是什么？（计算成本高、训练慢、控制力不足、艺术导向性差？）

大多数网络适用于连续数据表示。要适配这种离散数据表示，需要一些额外的工程。
受限于码本结构，VQ-VAE倾向于存储已知动作而非泛化到新动作。虽然这些模型在训练数据分布内能精确生成和重建动作，却难以处理分布外运动导致信息损失和动作感知失真。

未来趋势：它的下一个突破点可能在哪里？

VQ-VAE使得可以像处理语言一样地处理动作，那么也能发展出类似于大语言模型的大动作模型，目前主要局限于有限的真实动作的数据量。

20250914可控视频生成

核心问题定义

用一句话说清楚：这个技术主要想解决动画/仿真领域的什么经典痛点或瓶颈？

可控视频生成是为了解决视频制作低效的痛点。

技术解析

它是什么

用直观的语言描述这项技术的核心思想

输出控制条件和参考图像（可以没有），生成特定的视频。

输入：Text prompt（或其它控制信号）
输出：video

关键论文

关键论文/算法：找到1-2篇最具代表性的开创性论文或关键改进论文。不必深究数学细节，但要看懂其核心架构图和主要贡献。

T2I -> T2V

✅ 由于已有一个开源的大数据文生图预训练模型Stale Diffusion Model。为了充分利用这个预训练模型，通常的做法是把这个文生图模型改造成文生视频模型。即，从 2D 输出变成 3D 输出。
动作信息来源：文本
外观信息来源：文本

ID	Year	Name	Note	Tags	Link
50	2023	Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets	Scaling latent video diffusion models to large datasets Data Processing and Annotation		link

T2I/T2V -> TI2V

任务1：驱动图像

外观信息来源：图像
动作信息来源：无控制地续写、或文本

任务2：以视频为控制条件的视频生成

外观信息来源：文本
动作信息来源：视频

ID	Year	Name	Note	Tags	Link
126	2025.7.22	MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation	1. 参考对象（动作信息来自图像）与目标对象（外观信息来自文本）外观或结构差异显著 2. 显示提取源和目标在外观上的语义匹配以及对应部分的形变关系，通过对源做warp得到目标的大致轮廓，以引作为condition引入视频生成	training-free，开源

T2I/T2V/TI2V + 其它控制信号

选一个合适的（开源）预训练模型，在此基础上

注入自己的控制信号，例如图像、控制点、光流、拖拽等
构造特定的（相对于训练基模型来说）少量的训练数据
根据任务特性引入一些技巧
经过（相对于训练基模型来说）少量的训练就得到了针对特定任务的垂域的视频生成模型。

对于大多数社区玩家来说，只能获取到开源的预训练模型，因此要先了解可用的开源模型。

外观信息来源：图像
动作信息来源：文本、骨骼动作序列、物理规律、用户交互轨迹等

ID	Year	Name	Note	Tags	Link
44	2024	Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling	✅ 用户提供的控制信号（condition）+ Image -> dense光流 ✅ dense光流（condition） + Image -> 视频	Two-stage，轨迹控制	link

T2V -> Improved T2V

在预训练的T2V的基础上，通过一些微调手段，让它在某些方向更优，成为更强大的基模型

动作信息来源：文本
外观信息来源：文本

所需数据

它需要什么样的数据进行训练？（图像、3D模型、运动捕捉数据、仿真数据？）是监督学习、无监督还是自监督？

视频数据，或者控制信号与视频的pair data。

应用场景与案例

学术界

在Siggraph等顶会上，这项技术最常被用在哪些方面？找1-2个论文中的例子。

文生视频模型除了可以用于生成视频，还常常用于辅助其它内容的生成。例如通过先生成角色视频再提取角色动作的方式来生成角色动作。因为视频生成模型用大量真实视频数据训练，包含了视觉先验信息，可以在其它任务训练数据不足的情况下，提供额外的信息。

工业界

是否有公司已经将其产品化？

公司 / 机构	模型名称 / 系列	核心特点 / 定位
腾讯 (Tencent)	混元视频系列 (HunyuanCustom, 图生视频等)	主体一致性强，支持多模态控制（如图生视频、音频驱动数字人），并积极开源。
阿里巴巴 (Alibaba)	通义万相 (Wan2.5-preview)	支持音画同步生成，可一次性生成匹配的人声、音效和音乐。
智谱AI (Zhipu AI)	CogVideo 系列	早期代表性中文视频生成模型，后续有CogVideoX等升级版本。
Luma AI	Ray3	强调具备推理能力，可理解复杂指令并进行物理模拟，支持4K HDR视频输出。
OpenAI	Sora	生成的视频逼真度和连贯性突出，能模拟真实物理世界，但尚未对公众开放。
Runway	Gen 系列 (如 Gen-3)	在影视级质感和动态控制上表现优秀，受到不少视频创作者的青睐。
谷歌 (Google)	Veo	与YouTube等产品有深度集成，支持生成高质量、长时长的视频。
Stability AI	Stable Video	基于其图像生成技术，开源是其重要特点，方便开发者研究和定制。
Meta	Make-A-Video	依托其庞大的社交数据，致力于从文本或图像直接生成短视频。
字节跳动 (ByteDance)	Boximator	通过框选等精细控制方式，实现对视频中物体运动的精准引导。
昆仑万维 (Kunlun Wanwei)	天工SkyVideo	支持文生视频、图生视频等多种模态，致力于生成高质量视频内容。

生成速度慢
不能生成太长的视频

未来趋势：它的下一个突破点可能在哪里？

视频的更多可控性编辑。