Environment has 3 phases. Agent should pick up key in phase 1 (no immediate reward), teleports to phase 2 with 4 gifts that give immediate rewards, then teleports to phase 3, where the agent should go to the goal. If the agent is carrying a key when it reaches the goal, it is rewarded extra for having the key.
max possible rewards in last phase: 5 (goal) + 15 (key) = 20
distractor gift phase has 4 gifts. Each experiment (run set) has different reward sizes (and thus max reward from gift phase):
reward = 0 ; max phase2 reward = 0 ; max total = 20
reward = 1 ; max phase2 reward = 4 ; max_total = 24
reward = 5 ; max phase2 reward = 20 ; max_total = 40
reward = 8 ; max phase2 reward = 32 ; max_total = 52