Clear Filters
Clear Filters

Episode q0 estimate does not converge but I get good results in simulation

6 views (last 30 days)
Hi, I'm using Reinforcement Learning on a control problem, specifically a TD3 agent. I have an order 3 plant and I want to use RL to find the optimal values for PI gains, so i'm basing on this matlab link.
My problem is very similar to the matlab example, but instead of water tank I have to control the input airflow to generate a temperature signal that follows this reference:
Summarizing:
  1. action: airflow speed (%)
  2. Observation: error (reference temperature vs real temperature) and integral error
  3. Reward function:
  • +10 if the error < 0.1, -1 otherwise
  • -1000 if temp < 0 (episode stop condition)
  • -10 if action < 50 (this helps to avoid bad states)
I have 3 neural networks:
  1. Actor: a single neuron. The weights are the PI gains.
  2. Critics: TD3 algorithm uses 2 critics, they have the same architecture
This is one of the best agent I've trained so far:
  • Learning rate: actor: 0.001, critic: 0.01
I trained this agent 1000 episodes more, but I get worse simulation results.
Looking at the action signal generated by this RL agent, it's fairly good, in control terms I think.
The problem here, it's Episode q0. According to matlab:
For agents with a critic, Episode Q0 is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed. Episode Q0 approaches the true discounted long-term reward, as shown in the preceding figure.
Episode q0 (yellow line) doesn't approach the episode reward (blue line) and average reward (red line). So, according to this, my agent is very bad, right? but why I'm getting good results? And also, how can I fix this? Just trying another critic architecture like more layers?

Accepted Answer

Ayush Modi
Ayush Modi on 17 Jan 2024
Hi,
I found following answer in the community regarding Episode Q0. It is not necessary for Episode Q0 to be an indication of the learning quality of the RL agent for actor-critic methods. If you are getting good results, you need not make any changes.
"In general, it is not required for this to happen for actor-critic mathods. The actor may converge first and at that point it would be totally fine to stop training."

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!