High fluctuation in Q0 value for TD3 agent while training.

Question

James Sorokhaibam on 12 May 2024

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training

Answered: Ronit on 23 May 2024

I am training a TD3 RL agent for pick and place robot. The reward function is, reward = exp(-E/d) where E is the total energy consumed where the trajectory is complete and d is the distance of the object from the end-effector. The training went smoothly while using DQN agent but it fails when DDPG, TD3 are used. What could be the reasion for this? I used the following code for agent creation.

obsInfo = rlNumericSpec([34 1]);

actInfo = rlNumericSpec([14 1], ...

LowerLimit=-1, ...

UpperLimit= 1);

env = rlFunctionEnv(obsInfo,actInfo,"KondoStepFunction","KondoResetFunction");

agent = rlTD3Agent(obsInfo,actInfo);

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Ronit on 23 May 2024

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training#answer_1462046

Hello James,

To understand why there are high fluctuations while using different RL agents, firstly we need to understand how these agents work.

The primary difference between DQN and agents like DDPG and TD3 is that DQN is just a value-based learning method, whereas DDPG and TD3 use the actor-critic method.
The DQN network tries to predict the Q values for each state-action pair, so it is just a single model. On the other hand, DDPG has a critic model that determines the Q value but uses the actor model to determine the action to take. Hence, we can say DDPG tries to directly learn the policy whereas DQN learns the Q values which are used to define the policy, generally an epsilon-greedy policy.
So, training an agent with DDPG or TD3 must be done more carefully. Not only because its learning is sometimes unstable, but because the number of hyperparameters to fine-tune in it is pretty much double that of DQN.

Here are a few suggestions which can help in getting good results using TD3 or DDPG agents:

Tune Hyperparameters: Adjust learning rates, replay buffer size, and exploration noise.
Normalize Rewards: Consider scaling your reward to reduce variability and improve learning stability.
Monitor Training: Use diagnostics to understand action, reward, and learning dynamics better.

Adjusting these aspects can help mitigate the high fluctuation and improve your TD3 agent's training performance.

Hope this helps!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

High fluctuation in Q0 value for TD3 agent while training.

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

High fluctuation in Q0 value for TD3 agent while training.

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments