DefendaGAN

Leveraging generative adversarial networks for robust training of neural networks against adversarial attacks.

Overview

Deep neural networks are notoriously vulnerable to adversarial examples — inputs with imperceptible perturbations that cause confident misclassification. This is more than an academic concern: drones, autonomous vehicles, and medical imaging systems all run classifiers that can be manipulated by carefully crafted inputs. DefendaGAN uses a Generative Adversarial Network to simultaneously generate adversarial examples and train a multi-class discriminator that is robust to them, all within a single framework.

This was a final project for EECE 7398 (Fall 2024), joint with Hanna Jiamei Zhang. Code is available on GitHub.

Background

Adversarial attacks on classifiers fall into a few well-known families:

  • Carlini–Wagner (CW) — solves a constrained optimization problem to find minimal perturbations that flip the label. Highly effective but slow.
  • Fast Gradient Sign Method (FGSM) — single-step attack along the sign of the loss gradient. Cheap but coarse.
  • Projected Gradient Descent (PGD) — iterative FGSM with projection back to an ε-ball. Strong baseline for evaluating robustness.

The standard defense is adversarial training — training the classifier on adversarial examples it is expected to face. DefendaGAN replaces the hand-coded attack with a learned one: a generator that produces perturbations directly, while the discriminator learns to classify both clean and perturbed inputs correctly.

Architecture

The single-generator setup follows the NoiseGAN framework: the generator takes a latent noise vector z plus the original image x, and outputs a perturbation G(z) that gets added back to x. The discriminator is a multi-class classifier (10 MNIST classes) trained to assign the correct label to both x and x + G(z).

Single-generator architecture: the generator perturbs original images, the discriminator classifies both clean and perturbed versions.

The discriminator loss combines cross-entropy on clean and adversarial examples, while the generator loss is an MSE term that pushes the discriminator’s output away from the true label (with several variants for targeted vs. non-targeted attacks).

Contributions

1. Training the generator to “completion”

On a dataset as simple as MNIST, the discriminator dominates the zero-sum game almost immediately, and the generator never produces meaningful adversarial examples. The fix: instead of one generator step per discriminator step, let the generator train continuously until discriminator accuracy drops below 0.8, then alternate. This is loosely inspired by PGD’s iterative refinement.

Comparing a "6" image processed by a disabled generator, a generator trained to completion, and one not trained to completion. The completion-trained generator produces much more structured perturbations.

2. Semi-targeted attacks

Standard targeted attacks aim at a fixed wrong class; non-targeted attacks distribute probability uniformly over wrong classes. We propose a middle ground: assign probability to wrong classes in proportion to the discriminator’s current second-, third-, fourth-place predictions, decaying exponentially. The intuition is that classes the discriminator already thinks are plausible should be easier to push it toward.

3. Multiple generators with diversity loss

To cover a wider range of attack styles, we train multiple generators simultaneously, each with a different ℓₚ-ball regularization strength. A pairwise KL-divergence diversity term in each generator’s loss encourages the generators to produce different attacks rather than collapsing to the same mode.

Multi-generator architecture: several generators compete to produce adversarial examples; the one that maximally disrupts the discriminator is selected at each step.

Results

Single generator

The DefendaGAN-trained discriminators significantly outperform an adversary-unaware MNIST classifier under FGSM, PGD, and CW attacks at ε = 0.3. Where the baseline classifier had attack success rates approaching 100%, the GAN-trained discriminators kept ASR below 30% across most settings, with only a small accuracy drop on clean data.

The qualitative behavior of the generators is also interesting. Untargeted, non-visually-similar generators produce noise-like perturbations that lighten and darken regions. Untargeted visually-similar generators desaturate the digit. Targeted-to-7 generators produce slanted-stripe patterns that look like overlaid 7’s at multiple positions — likely an artifact of the convolutional discriminator’s translation invariance.

Left: generator outputs across attack variants — A original; B untargeted, non-similar; C untargeted, similar; D targeted-7, non-similar; E targeted-7, similar. Right: comparison of DefendaGAN outputs with NoiseGAN.
Discriminator accuracy during training across the four single-generator variants. Untargeted attacks initially fool the discriminator more than targeted attacks, consistent with untargeted attacks moving toward the nearest decision boundary.

Multi-generator

The 5-generator system robustifies the discriminator to 89% accuracy under FGSM and 70% under PGD. Diversity loss successfully separates the generators’ output domains, and each produces visually distinct attack styles owing to its different regularization budget.

Left: max/min perturbed accuracy, diversity losses, and FGSM/PGD perturbed accuracy during training. Right: distinct attack styles produced by three of the five generators.

Future Work

Natural extensions: scaling to harder datasets like CIFAR-10, comparing DefendaGAN-trained classifiers against classifiers adversarially trained on FGSM/PGD perturbations directly, and a more thorough hyperparameter sweep of the multi-generator configuration (number of generators, regularization schedule, KL weight).

Resources