Multi-Intention Inverse Reinforcement Learning

Disentangling internally misaligned demonstrations and learning multiple reward functions using gradient-based classification with MaxEnt IRL.

Overview

When learning from demonstrations, you usually assume the demonstrators all want the same thing. In practice, that assumption breaks: different people have different preferences, biases, and goals, and trying to fit a single reward function to a misaligned mixture often produces a policy that satisfies no one. Multi-Intention IRL (MIIRL) addresses this by learning K reward functions from N demonstrations, partitioning the demonstrations by which underlying intent they reflect.

This project, joint with Misha Lvovsky, proposes a framework for augmenting standard IRL algorithms into MIIRL algorithms by using the gradients of the candidate reward functions to classify demonstrations. We apply this approach to Maximum Entropy IRL and characterize when it works, when it fails, and why.

Background

The MIIRL problem learns K reward functions (intents) from a set of N demonstrations, where each demonstration is assumed to correspond to one of the K classes. Prior approaches mostly use Expectation-Maximization in the inner loop — Babes-Vroman et al. combine EM with MaxEnt; Bighashdel et al. extend it to deep networks; Hausman et al. add a latent intent vector to GAIL.

EM requires running full IRL inside each iteration, which is expensive. Our approach sidesteps this by using a single quantity each step — the gradient magnitude of each demonstration with respect to each candidate reward — to softly classify demonstrations into K sets, then runs the standard MaxEnt update on the resulting weighted mixture.

Method

For MaxEnt IRL specifically, the gradient of the log-likelihood for a demonstration under reward parameters θ is the difference between the demonstration’s empirical feature counts and the feature counts under the policy induced by θ. Demonstrations whose feature counts are closer to those produced by reward k will have smaller gradient magnitudes under θₖ. We exploit this by computing per-demonstration, per-reward gradient magnitudes, applying a softmin to get assignment probabilities, and using those probabilities to weight the feature-count averages used in the MaxEnt update.

The general framework:

  1. Compute per-demonstration, per-reward gradients
  2. Convert gradient magnitudes to soft cluster assignments via softmin
  3. Build K weighted datasets from the assignment logits
  4. Apply each IRL update using its weighted dataset
  5. Repeat until convergence

The full algorithm is general: any IRL method whose gradient is informative about per-demonstration fit can be extended this way. We instantiate it with MaxEnt for tractability.

Experimental Setup

Experiments use 15×15 randomly generated gridworld environments built on OpenAI Gym, with 5 features distributing color-coded rewards across cells. Ground-truth reward functions are generated as small perturbations around a baseline vector — typically perturbation magnitude m = 0.1 to mimic the close-but-not-identical preferences of real demonstrators. We use 64 demonstrations per reward function.

A randomly generated gridworld. The agent starts at the top-left (black square); cell colors correspond to features and their rewards.

Findings

Performance is bimodal

On unconstrained random gridworlds and reward pairs, the algorithm either converges cleanly (cross-entropy near zero) or fails to disentangle the demonstrations at all. There is little middle ground.

Distributions of classification cross-entropy across many runs. Left: random reward proposals. Right: orthogonally initialized proposals. The tall bar at zero represents converged runs; the rest are failures.

Failure modes are pair-dependent, not environment-dependent

Boxplots across many environments with different reward pairs show only mild variation in median performance — no individual gridworld is consistently “bad.” A heatmap of (environment, reward) pairs reveals that the difficulty is in the interaction between specific environments and specific reward functions, not in either alone.

Left: cross-entropy by environment. Middle: best vs. worst sampled environments. Right: a heatmap of environment-reward pairs showing that difficulty is mutual — certain pairs are hard, but no row or column dominates.

Demonstration feature counts can be nearly indistinguishable

For small reward perturbations, the feature-count distributions of the two demonstration classes overlap heavily. When the underlying rewards are too similar, no amount of clever classification will separate them — the signal isn’t there.

Pairplots of feature counts for two reward classes (blue and orange), at perturbation 0.1. From left to right: best-case, median, and worst-case separation. The marginal distributions overlap heavily across the board.

Initialization matters more than expected

Comparing random reward proposals against orthogonal proposals (where the second proposal is the negation of the first) showed dramatic differences. Orthogonal initialization consistently produced lower cross-entropy, even when the true reward functions were similar. The intuition: starting with proposals that are far apart gives the algorithm room to separate demonstrations before the proposals collapse toward each other.

Random-initialization results, alongside the orthogonal-initialization results above. Random initialization yields heavier tails and more failed runs.

Reward similarity is the dominant difficulty axis

Sweeping the perturbation magnitude from 0.01 to 1.0 confirms what theory suggests: the closer the two ground-truth reward functions, the harder the problem. At perturbation 1.0 the runs converge cleanly almost every time; at 0.01–0.1 (the realistic regime for human-generated data) the bimodal failure mode dominates.

Left: cross-entropy at perturbation 0.1. Middle: at perturbation 1.0 the failures largely disappear. Right: cross-entropy distribution across the full perturbation sweep — performance improves monotonically with separation between rewards.

Conclusions

The gradient-augmented MIIRL framework works, but its sensitivity to initialization and reward similarity is significant. The bimodal failure pattern is largely a property of MaxEnt’s optimizer — converged solutions are correct, but bad initializations get stuck. Most environment-reward combinations are learnable in principle; getting there requires either orthogonal initialization or rewards separated by more than ~0.1.

For real-world deployment, this suggests two practical levers: (1) use orthogonal or otherwise diverse reward proposals, and (2) be aware that the method has a floor on how similar the underlying intents can be while remaining recoverable.

Future Work

The most natural extension is applying the gradient-augmentation framework to a stronger IRL backbone — AIRL is the obvious target, since its gradient also carries per-demonstration information about reward fit. Beyond that, applying this to deliberately biased demonstrator pools (left-handed vs right-handed, novice vs expert) would test whether the method recovers semantically meaningful intents from real human data. There’s also a connection to KL-constrained preference learning for LLMs worth exploring, where MIIRL-style separation could let preference budgets be allocated per niche rather than globally averaged.

Resources