评价指标

图像质量、视频质量、一致性、多样性、美学和动作准确性

Image-level Evaluation Metrics

Fréchet Inception Distance (FID, ↓): semantic similarity between images
Peak Signal-to-Noise Ratio (PSNR, ↑): pixel-level similarity between images
Structural Similarity Index (SSIM, ↓): pixel-level similarity between images
CLIPSIM (↑): image-text relevance

Fréchet Inception Distance (FID)

✅ FID：评估两个 distribution 的差距有多大。
✅ 由于使用了网络的高层 feature，可以评价 high／evel 的语义相似性。

✅ CNN＋Softmax 是一个预训练好的图像分类网络，取 softmax 上一层做为图像的 feature.
✅ 取大量真实图像的 feature 和预训练模型生成的图 feature.
✅ 假设两类图像的 feature 各自符合高斯分布，计算两个分布的距离。
✅ 优点：评价结果与人类直觉很接近，缺点：需要大量 sample.

P49

Peak Signal-to-Noise Ratio (PSNR)

Pixel-level similarity between images

For two images \(x,y \text{ of shape } M\times N\):

\begin{align*} \mathrm{PSNR} (x,y) = 10 \log_{10}{} \frac{255^2}{\mathrm{MSE} (x,y)} \end{align*}

where

\begin{align*} \mathrm{MSE} (x,y) = \frac{1}{MN} \sum_{i=1}^{M} \sum_{j=1}^{N} (x_{ij}-y_{ij})^2\end{align*}

P50

Structural Similarity Index Measure (SSIM)

Pixel-level similarity between images

Model any image distortion as a combination of:
(1) loss of correlation, (2) luminance distortion, (3) contrast distortion
For two images \(x,y \text{ of shape } M\times N\):

\begin{align*} \mathrm{SSIM} (x,y)=l(x,y)\cdot c(x,y)\cdot s(x,y)\end{align*}

where

\begin{align*} \begin{cases} \text{Lumiannce Comparison Funckon:} l(x,y)=\frac{2\mu _x\mu _y+C_1}{\mu _x^2+\mu _y^2+C_1} \\ \text{Contrast Comparison Funckon:} c(x,y)=\frac{2\sigma _x\sigma _y+C_2}{\sigma _x^2+\sigma _y^2+C_2} \\ \text{Structure Comparison Funckon:} s(x,y)=\frac{\sigma _{xy}+C_3}{\sigma _{x}\sigma _{y}+C_3} \end{cases}\end{align*}

P51

CLIP Similarity

✅ CLIP Score，衡量与文字的匹配度。

Year	Name	Link
2017	GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium	link
2023	Hung-Yi Lee, “Machine Learning 2023 Spring,” National Taiwan University.
2010	Horé et al., “Image Quality Metrics: PSNR vs. SSIM,”
2004	Wang et al., “Image Quality Assessment: from Error Visibility to Structural Similarity,”
2021	Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,”

Video-level Evaluation Metrics

Fréchet Video Distance (FVD, ↓): semantic similarity & temporal coherence
Kernel Video Distance (KVD, ↓): video quality (via semantic features and MMD)
Video Inception Score (IS, ↑): video quality and diversity
Frame Consistency CLIP Score (↑): frame temporal semantic consistency

P52

Fréchet Video Distance (FVD)

Semantic similarity and temporal coherence between two videos

P53

Kernel Video Distance

Video quality assessment via semantic features and MMD

P54

Video Inception Score (IS)

Video quality and diversity

✅ 多样性，在不给定 condition 的情况生成的分布的多样性。
✅ 质量：在给 condition 的条件下应生成特定的类别。

P55

Frame Consistence CLIP scores

Frame temporal semantic consistency

Compute CLIP image embeddings for all frames
Report average cosine similarity between all pairs of frames

ID	Year	Name	Note	Tags	Link
	2019	Unterthiner et al., “FVD: A new Metric for Video Generation,”
	2018	Unterthiner et al., “Towards Accurate Generative Models of Video: A New Metric & Challenges,”
	2016	Salimans et al., “Improved Techniques for Training GANs,”
	2018	Barratt et al., “A Note on the Inception Score,”
	2020	Saito et al., “Train Sparsely, Generated Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN,”
	2021	Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,”

P57

主观评价

Hybrid evaluation：EvalCrafter

Creates a balanced prompt list for evaluation
Multi-criteria decision analysis on 18 metrics: visual quality, content quality…
Regress the coefficients of all metrics to generate an overall score aligned with user opinions

ID	Year	Name	Note	Tags	Link
	2023	Liu et al., “EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,”

P45

Datasets

The WebVid-10M Dataset

Bain et al., “Frozen in Time: A Joint Video and Image Encoder for End to End Paper,” ICCV 2021.

✅ WebVid 是常用的视频数据集，有高清视频及配对文本。

ImportantArticles