Replacing RL w/ Parameter-based Evolutionary Strategies

Logan Riggs

I want to highlight this paper (from Sept 29, 2025) of an alternative to RL (for fine-tuning pre-trained LLMs) which:

Performs better
Requires less data
Consistent across seeds
Robust (ie don't need to do a grid search on your hyperparameters)
Less "Reward Hacking" (ie when optimizing for conciseness, it naturally stays close to the original model ie low KL-Divergence)

They claim the magic sauce behind all this is the evolutionary strategy optimizing over distributions of model parameters. Surprisingly, they've scaled this to optimize over billion-parameter models.

Let's get into their method.

Evolutionary Strategy (ES) Algorithm

They start w/ a "Basic ES Algorithm" which is:

In other words, we're gonna sample noise around the original model's weights N times (ie we're going to explore around the model weights where the variance I is the identity covariance).

[Below is an example explaining more in depth, feel free to skip to the next section if you get it already]

We get the reward for each of these noises/perturbation , where the full perturbed model weights are θ + σ·εₙ, where σ is a learnable hyperparameter to scale the noise (ie increase the variance)

We normalize these rewards, giving a list of high-to-low rewards that sum to 1. Importantly, we don't keep only the best reward. Instead we move towards each perturbation weighted by its reward.

Suppose we have 4 perturbations which got rewards [20, 10, 6, 4]. We normalize to get:

R = [0.5, 0.25, 0.15, 0.10]

These determine how much we move towards each pertubat. For the sum:

$θ_{n e w} = θ + α * (1 / N) * Σ_{n = 1}^{N} R_{n} * ε_{n}$

Where $ε_{n}$ is the perturbed weights at iteration n, we get:

$θ_{n} e w = θ + α * (1 / 4) * (0.5 * ε_{1} + 0.25 * ε_{2} + 0.15 * ε_{3} + 0.10 * ε_{4})$

Clearly showing we're weighting towards the higher reward solutions. Also clear we're optimizing over a distribution of weights (given by the perturbations + original weights)

New ES Implementation

They make 7 changes to make this scale to large LLMs

Changes 1,2,3,6: Better GPU utilization
Change 4: Reward is normalized using Z-scores meaning mean of 0 and standard deviation of 1 (to keep "the reward scale consistent across iterations and tasks")
Change 5: Greedy decoding of the LLM to make it deterministic (this seems correct since they're sampling from nearby weights)
Change 7: They add a learnable hyperparameter that scales the learning rate (which they just fold into the learning rate)

And with just those changes, they achieve really impressive results w/ less compute.

Task 1: Countdown task

The Countdown task (Pan et al., 2025; Goodfellow et al., 2016) requires constructing an arithmetic expression from a given set of numbers using basic operations (+, −, ×, ÷) to match a target value. For instance, the target 950 can be obtained from {100, 50, 6, 3} with 100 × (6 + 3) + 50 = 950). This constitutes a compact test of constrained symbolic reasoning, i.e. an important use case for fine-tuning.

The results are much much better, especially for smaller models where RL typically fails.

And look at these training curves!

ES gets solid results in way less evaluations (0.5 is still 500k evaluations though!)

Task 2: Conciseness

Train the model w/ the reward only being the length of the response. In RL, there's typically reward hacking where the model does very short responses that don't answer the question. ES does have a couple of examples, but it's drastically lower!

As a quantitative check, they plot the conciseness reward vs KL-divergence (ie how far the output distribution divereges from the base model)

It does seem like a common theme that optimizing over a distribution leads to more conservative optimizations. This sort'of rhymes w/ Attainable Utility Preservation, where optimizing for maintaining the ability to achieve many auxiliary goals leads to more conservative policies.

Future Work

This is still early work that mostly just advanced ES scaling. I'd be interested in work applying this to OpenAI's reward hacking environment to see if it helps. It's also nice that ES doesn't need huge hyperparameter sweeps and converges to the same solutions (at least in these examples), meaning it shouldn't be too much work to do.

From the paper:

One counterintuitive result is that the ES implementation only needs a population of 30 to effectively optimize billions of parameters. In contrast, previous work (Salimans et al., 2017; Zhang et al., 2017; Lehman et al., 2018; Lorenc & Neruda, 2025) used populations of 10,000 or more for models with millions or fewer parameters. An interesting future direction is to analyze how such small populations are possible. Perhaps this is related to the observed low intrinsic dimensionality of LLMs (Aghajanyan et al., 2021). Another promising direction is to use ES to perform unsupervised fine-tuning based on internal behaviors of LLMs, such as confidence calculated based on semantic entropy and semantic density (Qiu & Miikkulainen, 2024; Farquhar et al., 2024). Such fine-tuning cannot be done with RL, since action space exploration does not change the internal representations of LLMs (that is, each action sampling is generated via output distribution without changing the internal parameters). In a broader sense, since ES does not need process rewards during exploration, it may be a necessary ingredient for superintelligence (Mucci & Stryker, 2023), which would be difficult to achieve by supervised learning using process guidance from human data. Massive parallelization of ES will speed up exploration by distributing the computations across GPU machines or even data centers.

It's a pretty interesting paper, and I've definitely missed a few points. Do give it a read!

Tell me if this is stupid, but my first thought reading this post is that, the ES algorithm is literally just estimating and then doing gradient descent on it, so I'd be really quite surprised if this leads to qualitatively different results wrt high level behaviors like reward-hacking.

(It might still exhibit efficiency gains or stuff like that though I'm not sure.)

One intuition I can offer is that you end up in wider basins of reward/loss landscape.

If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.

Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.

Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??

The second intuition is the paths being substantially different, which can be quantified as well.

Why do you end up in wider basins in the reward/loss landscape? This method and eg policy gradient methods for llm RLVR are both constructing an estimate of the same quantity. Are you saying this will have higher variance? You can control variance with normal methods, and typically you want low variance.

In general evolutionary methods reward hack just as much as RL I think.

EDIT: I think I misunderstood As $\omega$ -> 0, you're just estimating $\lambda_\theta E[R]$. However, if its not very small, youre optimizing a smoothed objective. So it makes sense to me that this would encourage "wider basins".

That said, I'm still skeptical that this would lead to less reward hacking, at least not in the general case. Like reward hacking doesn't really seem like a more "brittle" strategy in general. Like, what makes me skeptical is that reward-hacking is not a natural category from the model/reward-functions perspective, so it doesn't seem plausible to me that it would admit a compact description, like how sensitive the solution is to perturbations in parameter space.

Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.

The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?