Diffusion Models for Adversarial Purification

4 minute read

1. Introduction

  • Adversarial attack์„ ๋ง‰๋Š” ๋ฐฉ๋ฒ•์—๋Š” ํฌ๊ฒŒ ๋‘๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

    1. Adversarial training: Adversarial sample์„ ํ•™์Šต์‹œ์ผœ ํ•ด๋‹น sample์˜ ํŠน์ง•์„ ๋ชจ๋ธ์ด ํ•™์Šต ํ•˜๋„๋ก ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
      • ์žฅ์ : ๊ฑฐ์˜ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ์— SOTA๋ฅผ ๋ณด์ด๊ณ  ์žˆ๋Š” ๋ชจ๋ธ์ด๋‹ค
      • ์žฅ์ : ํ•™์Šต์— ์‚ฌ์šฉ๋œ adversarial example๋งŒ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ ๋ณ„๋„์˜ ํ›ˆ๋ จ๊ณผ์ •์ด ํ•„์š”ํ•˜๋ฏ€๋กœ computationlly expensive ํ•˜๋‹ค.
    2. Adversarial purification: Adversarial example์„ ์ƒ์„ฑ ๋ชจ๋ธ์ด ํ•œ๋ฒˆ purify ํ•ด์ฃผ๊ณ  purified๋œ sample์„ ๋ถ„๋ฅ˜ํ•œ๋‹ค.
      • ์žฅ์ : plug-and-play manner์œผ๋กœ ์ž‘๋™ํ•˜๋ฏ€๋กœ adversarial training๊ณผ๋Š” ๋‹ฌ๋ฆฌ ๋ณ„๋„์˜ ํ›ˆ๋ จ๊ณผ์ •์„ ์š”๊ตฌํ•˜์ง€ ์•Š๋Š”๋‹ค.
      • ๋‹จ์ : ์ƒ์„ฑ๋ชจ๋ธ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋ฌธ์ œ(GAN์˜ ๊ฒฝ์šฐ์—๋Š” mode collapse) ๋•Œ๋ฌธ์— adversarial training๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค.
  • Diffusion model(์ดํ•˜ DM)์€ likelihood based model๋กœ, GAN๊ณผ ๋น„๊ตํ•˜์—ฌ์„œ ์—ฌ๋Ÿฌ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ์žฅ์  ๋•Œ๋ฌธ์— purification ์„ฑ๋Šฅ์ด ๋” ์ข‹์„ ๊ฒƒ์ž„์ด ๊ธฐ๋Œ€๋œ๋‹ค.

  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ contribution์„ ํ•˜๊ณ  ์žˆ๋‹ค.

    1. pretrained๋œ DM์˜ forward and reverse process๋ฅผ ํ†ตํ•ด adversarial purification์„ ์ˆ˜ํ–‰ํ•˜๋Š” DiffPure์„ ์ œ์‹œํ•œ๋‹ค.
    2. DM์˜ forward and reverse process๊ฐ€ ์–ด๋–ป๊ฒŒ ํšจ๊ณผ์ ์œผ๋กœ adversarial perturbation์„ ์ œ๊ฑฐํ•˜๋Š” ๋™์‹œ์— label semantics๋ฅผ ์œ ์ง€ํ•˜๋Š”์ง€ ์•Œ์•„๋ณธ๋‹ค.
    3. Adaptive attack์„ ๋ง‰๊ธฐ ์œ„ํ•ด reverse process์—์„œ ํšจ๊ณผ์ ์œผ๋กœ full gradient๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค.
    4. Adaptive attack ๋ฒค์น˜๋งˆํฌ์—์„œ SOTA๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Œ์„ ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ํ†ตํ•ด ์ฆ๋ช…ํ•œ๋‹ค.

2. Background

ํ•ด๋‹น ์„น์…˜์—์„œ๋Š” โ€˜Score-based generative modeling through stochastic differential equations.โ€™ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๊ณ  ์žˆ๋Š” continuous-time diffusion model์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฆฌ๋ทฐํ•˜๊ณ  ์žˆ๋‹ค. ๋จผ์ € foward diffusion process๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ธ๋‹ค.

DM์—๋Š” ํฌ๊ฒŒ ๋‘๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋Š”๋ฐ, VE-SDE์™€ VP-SDE ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” VP-SDE๋ฅผ purification model๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

๋‹ค์Œ์œผ๋กœ sample generation์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด reverse-time SDE๋ฅผ ํ†ตํ•˜์—ฌ ์ˆ˜ํ–‰ํ•œ๋‹ค.

DM์˜ ํ›ˆ๋ จ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด score function์ธ $s_\theta (\tilde{x}, t)$์„ transition probability์— ์ตœ๋Œ€ํ•œ ๊ฐ€๊นŒ์›Œ ์ง€๋„๋ก ๋งŒ๋“œ๋Š” denosing score matching ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•œ๋‹ค.

3. Method

3.1 Diffusion purification

ํ•ด๋‹น ์„น์…˜์—์„œ๋Š” adversarial example์ธ $x_a$์— foward process๋ฅผ ํ†ตํ•ด ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ๋‹ค์‹œ reverse process๋ฅผ ํ†ตํ•ด ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด example์ด ๊ฐ€์ง€๊ณ  ์žˆ๋˜ adversarial feature๋“ค์ด ์‚ฌ๋ผ์ง€๊ณ  clean image๋กœ ๋ณ€๊ฒฝ๋œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •๋ฆฌ๋ฅผ ๋‚ด์„ธ์› ๋‹ค. (์—ฌ๊ธฐ์„œ ์ •๋ฆฌ์— ๋Œ€ํ•œ ์ฆ๋ช…์€ ๊ธธ์–ด์„œ ์ƒ๋žตํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.)


${x(t)}_{t\in[0,1]}$๊ฐ€ foward SDE, $p_t$๊ฐ€ clean data ๋ถ„ํฌ, $q_t$๊ฐ€ adversarial sample ๋ถ„ํฌ์ผ ๋•Œ ์ฆ๋ช…์„ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด forward SDE๊ฐ€ ์ผ์–ด๋‚ ์ˆ˜๋ก $p_t$์™€ $q_t$๋Š” ์„œ๋กœ ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

DM์˜ forward, reverse process๋ฅผ ํ†ตํ•ด adversarial example์ด clean image์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ–ˆ์œผ๋ฏ€๋กœ forward, reverse process๊ฐ€ ์ผ์–ด๋‚˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด๋ฉด ๊ฐ๊ฐ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. Forward process: ์ด ๋…ผ๋ฌธ์—์„œ๋Š” DM์œผ๋กœ VP-SDE๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. VP-SDE๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณต์‹์œผ๋กœ timestep $t^{\ast}$์— ๋…ธ์ด์ฆˆ๊ฐ€ ์ถ”๊ฐ€๋œ ์ด๋ฏธ์ง€๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

  1. Reverse process: Euler-Maruyama solver๋กœ reverse-time SDE๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ’€ ์ˆ˜ ์žˆ๋‹ค.

์œ„ ์‹์—์„œ sdeint์˜ ์ž…๋ ฅ๊ฐ’์€ ์ฐจ๋ก€๋Œ€๋กœ initial value, drift coefficient, diffusion coefficient, Wiener process, initial time, ์™€ end time ์ด๋‹ค. $f_{rev}$์™€ $g_{rev}$๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ diffusion์„ ํ•˜๋Š” timestep์ธ $t^{\ast}$๋ฅผ ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ตœ์ ์˜ $t^{\ast}$์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •๋ฆฌ๋ฅผ ํ†ตํ•ด์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

๋งŒ์•ฝ score function์ด $\lVert s_\theta (x,t) \rVert \leq \frac{1}{2}C_s$๋ฅผ ๋งŒ์กฑํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด clean data $x$์™€ purified data $\hat{x}(0)$์™€์˜ L2 ๊ฑฐ๋ฆฌ์˜ upper bound๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

purified image๊ฐ€ ์ตœ๋Œ€ํ•œ clean image์™€ ์œ ์‚ฌํ•œ label semantics๋ฅผ ๊ฐ€์ ธ์•ผ ํ•˜๋ฏ€๋กœ $\lVert \tilde{x}(0)-x \rVert$๋ฅผ ์ž‘๊ฒŒ ํ•˜๋Š” $t^{\ast}$๋ฅผ ์ฐพ์•„์•ผ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” $\gamma(t^{\ast})$๊ฐ€ ์ž‘์•„์•ผ ํ•˜๊ณ , ์ด๋Š” ๊ณง $t^{\ast}$๊ฐ€ ์ž‘์•„์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ $t^{\ast}$๋ฅผ ์ตœ๋Œ€ํ•œ ์ž‘๊ฒŒ ์„ค์ •ํ•ด์•ผ ํ•œ๋‹ค.

์ฒซ๋ฒˆ์งธ์™€ ๋‘๋ฒˆ์งธ ์ •๋ฆฌ๋ฅผ ํ†ตํ•ด $t^{\ast}$๊ฐ’์— ๋”ฐ๋ผ adversarial sample์—์„œ perturbation์„ purifyํ•˜๋Š” ์ •๋„์™€ purified image๊ฐ€ ๊ฐ€์ง€๋Š” label semantics๊ฐ„์—๋Š” trade-off๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ $t^{\ast}$๋ฅผ ์ž˜ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” perturbation์ด ์ž‘์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค๊ณ  ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— $t^{\ast}$๊ฐ’์„ ์ž‘๊ฒŒ ์„ค์ •ํ•˜์˜€๋‹ค.

3.2 Adaptive attack to diffusion purification

๊ฐ•ํ•œ adaptive attack ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” SDE solver์˜ full gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๊ณผ์ •์€ ๋งค์šฐ computationally expensiveํ•˜๋ฏ€๋กœ ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” adjoint method๋ฅผ ์ด์šฉํ•˜์—ฌ $O(1)$์˜ ๊ณต๊ฐ„ ๋ณต์žก๋„๋กœ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ํ•ด๋‹น ๋ฐฉ๋ฒ•์˜ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Adjoint method์— ๋Œ€ํ•ด์„œ๋Š” โ€˜scalable gradients for stochastic differential equationsโ€™๋…ผ๋ฌธ์— ์ž์„ธํ•˜๊ฒŒ ๋‚˜์™€์žˆ๋‹ค.

4. Experiments

๋ชจ๋ธ์˜ robustness๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฒค์น˜๋งˆํฌ์ธ RobustBench์˜ SOTA๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ต๋ฅผ ํ•˜์˜€๋‹ค.

Experimental settings

  1. Dataset์€ CIFAR-10, CelebA-HQ, ImageNet์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  2. Classifier์€ ResNet, WideResNet, ViT๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  3. Adversarial attack ๋ฐฉ๋ฒ•์—๋Š” AutoAttack์˜ $l_\infty $, $l_2$์™€ StAdv, BPDA+EOT๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  4. Evaluation metric์œผ๋กœ๋Š” clean data์— ๋Œ€ํ•œ ํ‰๊ฐ€์ธ standard accuracy, purified data์— ๋Œ€ํ•œ ํ‰๊ฐ€์ธ robust accuracy๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

Comparision with the sate-of-the-art

๋จผ์ € AutoAttack $l_\infty (\epsilon = 8/255) $๋ฅผ CIFAR-10์— ์ ์šฉํ–ˆ์„ ๋•Œ ๊ฒฐ๊ณผ์ด๋‹ค. DiffPure์ด SOTA๋ฅผ ๋ชจ๋“  ํ‰๊ฐ€์ง€ํ‘œ์—์„œ ๋‹ฌ์„ฑํ•˜๊ณ  ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค์Œ์€ CIFAR-10์— $l_2 (\epsilon = 0.5)$ AutoAttack์„ ๊ฐ๊ธฐ ๋‹ค๋ฅธ classifier์„ ์ ์šฉํ–ˆ์„ ๋•Œ ๊ฒฐ๊ณผ์ด๋‹ค. ์ถ”๊ฐ€์ ์ธ ๋ฐ์ดํ„ฐ์˜ ํ•™์Šต์ด ์—†์ด๋„ SOTA๋ชจ๋ธ๊ณผ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๋งˆ์ง€๋ง‰์€ AutoAttack $l_\infty (\epsilon = 4/255) $์„ ImageNet์— ์ ์šฉํ–ˆ์„ ๋•Œ ๊ฒฐ๊ณผ์ด๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋“  classifier์—์„œ SOTA์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Defense against unseen threats

Adversarial training์˜ ์ฃผ์š” ๋‹จ์ ์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด ํ›ˆ๋ จ๋˜์ง€ ์•Š์€ ๊ณต๊ฒฉ๋ฐฉ๋ฒ•์—๋Š” ์ทจ์•ฝํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์•„๋ž˜ ํ‘œ๋Š” CIFAR-10์— adversarial trainingํ•œ ๋ชจ๋ธ๊ณผ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ํ‘œ์ธ๋ฐ, ํšŒ์ƒ‰๊ธ€์ž๋Š” ๋ชจ๋ธ์ด ํ›ˆ๋ จ๋œ ๊ณต๊ฒฉ ๋ฐฉ๋ฒ•์ด๋‹ค. ๋น„๊ต ๊ฒฐ๊ณผ ๋ชจ๋“  ๊ณต๊ฒฉ๋ฐฉ๋ฒ•์— ๋Œ€ํ•˜์—ฌ diffpure์ด ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Comparison with other purfication model

Diffpure์„ GAN์„ ์‚ฌ์šฉํ•œ ๋‹ค๋ฅธ purification ๋ฐฉ๋ฒ•๊ณผ์˜ ๋น„๊ต ๋˜ํ•œ ์‹œํ–‰ํ•˜์˜€๋‹ค. ์—ฌ๊ธฐ์„œ AutoAttack๊ณผ ๊ฐ™์€ white-box adaptive attack์„ ๊ฐ€ํ•˜๋ฉด ์ตœ์ ํ™” ๋˜๋Š” sampling loop ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚ฌ๊ธฐ ๋–„๋ฌธ์— BPDA+EOT ๊ณต๊ฒฉ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋ฐ์ดํ„ฐ์…‹์€ CelebA-HQ์™€ CIFAR-10์„ ์‚ฌ์šฉํ•˜์˜€๊ณ , ๋น„๊ต์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜ํ‘œ์— ๋‚˜์™€์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ ENC์™€ OPT๋Š” GAN์˜ optimization-based, encoder-based inversion์„ ์ง€์นญํ•œ๋‹ค. ๋‘ ๋ฐ์ดํ„ฐ์…‹์—์„œ Robust accuracy๊ฐ€ ๋ชจ๋‘ SOTA ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Conclusion

  • Adversarial example์„ ํ•œ๋ฒˆ ๋ถ„๋ฅ˜๊ธฐ๋กœ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์ „์— ์ •ํ™”์‹œํ‚ค๋Š” Diffpure ๋ฐฉ๋ฒ•์„ ์ƒˆ๋กญ๊ฒŒ ์ œ์‹œํ•˜์˜€๋‹ค.

  • White-box adaptive attack์— ๋Œ€ํ•œ robustness๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” full gradient๋ฅผ ๊ตฌํ•˜๋Š” ์ž‘์—…์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋Ÿฌํ•œ ์ž‘์—…์— adjoint method๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

  • Diffpure์„ ๋‹ค๋ฅธ SOTA ๋ฐฉ๋ฒ•๋“ค๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ๋ฐ์ดํ„ฐ์…‹์€ CIFAR-10, ImageNet, CelebA-HQ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๊ณ  classifier๋Š” ResNet, WideResNet, ViT ๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ robust accuracy์—์„œ ์ด๋“ค์„ ๋ชจ๋‘ ๋›ฐ์–ด๋„˜์—ˆ๋‹ค.

  • ๊ทธ๋Ÿฌ๋‚˜ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ํ•œ๊ณ„์ ์„ ๋‚ดํฌํ•˜๊ณ  ์žˆ๋‹ค.

    1. Purification์„ ํ•˜๋Š” ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜ ์†Œ์š”๋œ๋‹ค.
    2. Diffusion model์€ ์ด๋ฏธ์ง€์˜ ์ƒ‰์— ๊ต‰์žฅํžˆ ๋ฏผ๊ฐํ•˜๋ฏ€๋กœ ์ƒ‰์— ๋Œ€ํ•œ ๊ณต๊ฒฉ์€ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐฉ์–ดํ•˜์ง€ ๋ชปํ•œ๋‹ค.

Comments