EUREKA: Using LLM’s to Create Reward Functions
As LLM's get smarter, the potential for using AI to build AI grows.
Created on November 3|Last edited on November 3
Comment
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. In RL, an agent interacts with its environment in discrete time steps, choosing actions based on observed states, and in return, it receives rewards and new state observations from the environment. This process leads to a sequence of state-action-reward tuples, and the agent's task is to learn a policy—mapping from states to actions—that maximizes the total future reward, often discounted over time.
RL vs. Supervised Learning
RL is fundamentally different from supervised learning, the more traditional form of machine learning. In supervised learning, an algorithm learns from a labeled dataset, providing the correct answer (the label) for each input; the model is trained to predict these labels for new, unseen data. Supervised learning is like a student learning with the help of a teacher who provides the answers for training questions, whereas RL is akin to a student learning through trial and error in the world without a teacher, driven by rewards for their actions.

RL has been successfully applied to many Atari games, where the reward signal is clearly defined and multiple simulations can be ran
In the world of Reinforcement Learning (RL), the concept of reward functions is foundational, as they guide the learning agent towards desired behaviors. Traditional RL relies on the reward function to signal the right actions to take in given states, shaping the agent's policy over time. However, crafting an effective reward function can be a daunting task due to the complexity and unpredictability of the agent's environment.
Reward Shaping
To assist with this, "reward shaping" is employed as a technique to provide additional guidance to the learning agent. Reward shaping involves the modification of the reward function to make learning faster and easier for the agent. This is often achieved by incorporating supplementary rewards or penalties that reinforce the intermediate steps towards achieving the final goal, rather than just providing a reward for the end goal itself.
This technique is different from traditional RL reward functions, which typically focus on the ultimate goal without guiding the intermediate steps. For instance, in a maze navigation task, a traditional reward function might only provide a positive reward when the agent reaches the end of the maze. In contrast, reward shaping might give incremental rewards for moving closer to the goal, thus helping the agent learn the desired path more efficiently.
Reward shaping is used because it can significantly accelerate learning, especially in complex environments where the desired outcomes or goals are sparse or very delayed. By giving the agent more immediate feedback, reward shaping can make the learning process more intuitive and direct. However, the challenge lies in designing these shaping rewards without inadvertently leading the agent to learn suboptimal policies — a process that requires careful consideration to ensure that the additional rewards are aligned with the overall goal and do not encourage undesired behaviors.
EUREKA
Researchers have just unveiled EUREKA, which is a algorithmic solution proposed for reward generation that consists of three components:
Environment as Context
The environment’s specification is fed directly to the LLM, providing it with the necessary context to generate reward functions. The rationale is that LM’s trained on code can perform better when given the native code context.
Evolutionary Search
This step involves iteratively generating and refining reward candidates using the LM. It utilizes evolutionary strategies to sample multiple reward functions, and then select and mutate the best-performing ones over several iterations, constantly seeking improvement based on the feedback from the fitness function.
Reward Reflection
This step analyzes the policy training dynamics and provides feedback. Reward reflection tracks the scalar values of all reward components during training and uses this detailed feedback to guide the LM in improving the reward function. This step acknowledges that the effectiveness of a reward function can vary based on the RL algorithm used and its hyperparameters.
In summary, the method entails:
1. Feeding the LM with the environment code as context.
2. Using evolutionary strategies to generate a set of reward function candidates.
3. Reflecting on the performance of these candidates by analyzing policy training dynamics and providing feedback.
4. Iteratively refining reward functions to maximize the fitness function score.

Iterative Improvement
By employing this method, the LLM can generate and refine reward functions that are more likely to lead to high-performing policies in the given RL environment. This approach leverages the LM’s ability to generate code and its understanding of the task to propose reward functions that are both executable and well-suited to guide the agent towards the desired behavior.
Technical Details
The system named EUREKA is evaluated using GPT-4 as its backbone.
Tests are conducted across a suite of robots and tasks in the Isaac Gym simulator with 10 robots and 29 tasks.
Two benchmarks are used: 9 original Isaac Gym environments and 20 tasks from the Dexterity benchmark.
Training Details
A consistent Reinforcement Learning (RL) algorithm and hyperparameters are used for all tasks.
Reward functions are optimized independently with the RL algorithm's output being the performance measure.
Results
EUREKA Performance: EUREKA's rewards generally outperform or match human-engineered rewards across most tasks.
Improvement Over Time: EUREKA shows progressive improvement in reward quality through evolutionary iterations.
Novelty of Rewards: EUREKA often creates rewards that are weakly correlated or even negatively correlated with human rewards, indicating novel solutions.
Targeted Improvement: EUREKA utilizes reward reflections effectively to improve rewards.
Special Case - Dexterous Pen Spinning
EUREKA successfully creates rewards that enable a robot to learn complex dexterous tasks like pen spinning using curriculum learning.
Human Feedback Integration
EUREKA also supports the incorporation of human feedback, showing that starting with a human-designed reward can be advantageous.
Overall
The results show EUREKA as a robust system that can generate efficient reward functions, improve over time, and produce novel solutions that often surpass human-crafted rewards. The system is also versatile, capable of integrating human feedback effectively, and suitable for complex skill acquisition in robots.
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.