Optimizing a Regression Learner App for an Electrochemical NO2 Sensor: Dealing with Drift and Input Variations

Hello,
I am currently using the Regression Learner App to develop a GPR Exponential model for my Electrochemical NO2 Sensor. This sensor outputs a voltage, and I use reference data alongside temperature and humidity measurements to train my model.
Initially, after creating a model with the App, I find that the GPR Exponential model aligns reasonably well with the sensor data. However, over time, I have noticed a slight drift in the data. I don't believe that this drift is a result of the sensor itself. Instead, it may be influenced by new combinations of sensor output voltage, temperature, humidity, and reference data values, which the model might not have encountered during the training process.
If I rerun the Regression Learner App to update or create a new GPR Exponential model, the sensor output appears to be accurate again. This leads me to believe that the need to retrain the model might be due to changes in the combination of the input parameters.
Considering the potential for a wide array of different parameter combinations, how can I optimize my model to predict more accurately?
Moreover, could the nature of my temperature input impact the prediction? Specifically, would there be a noticeable difference in the accuracy of predictions if I input the absolute temperature compared to inputting the temperature segmented into smaller blocks?
I'm curious to know if anyone else has had similar experiences with their models? Any insights or suggestions to enhance the performance of my GPR Exponential model would be greatly appreciated.

6 Comments

Why a GPR model, first, instead of something built on the physical considerations of the sensor response to its environment based on physical principles?
Second, see the section on <fitting happenstance data> from a classic.
"... the need to retrain the model might be due to changes in the combination of the input parameters."
A particularly apt quote for the circumstance above would be
"To find out what happens when you change something, it is necessary to change it."
A fundamental principle of experiment design is to vary the parameter levels of interest of the region over which they need to be to cover the range for which the derived correlation is to be used.
Also, have you plotted residuals to discover if, per chance, you have left out potentially important interaction or higher order terms?
GPR was giving me the best results so i have been using that as my model.
How do i vary the parameters, is this done within the regression App or once the model is created do it with my actual raw data?
You design and execute an experiment that sets the conditions you want to measure, you don't just set something over in the corner and let it go -- that's "happenstance data" and is rife with trouble. Namely, more than likely you don't cover much range; what you do cover will be serially correlated in time and unless everything else in the environment is controlled, there may be confounding extraneous variables that vary but aren't even being measured but also affect the sensor response.
I've not used one of the electrochemical NO2 sensors so don't have real hands on experience with it, but DAGS and found a couple studies -- one did a "calibration" similar as to what you describe by placing 16 sensors by a roadway and near an official monitoring station and tried to calibrate against its data.
In the end, the best correlation they had both the T and RH input but also a compensation for ozone levels. Not including ozone made a significant difference in the observed R-squared of the correlation but since the devices didn't include an ozone measurement, it wasn't able to be used in practice. Something like that missing variable is quite possibly some of your issue as well. The temperature data above 30C were simply ignored as the sensor elements become highly nonlinear at and above those temperatures. (Part of the issue in the specific setup was that the onboard electronics were not actively cooled so the internal temperatures were quite a bit higher than the ambient air temperature.)
All in all, it was a quite complex issue to get something useful from the sensors and they also referenced some other studies that ended up using time-based drift compensation besides; that study was not able to track down all the confounding variables to be able to compensate for the effects otherwise, it seems.
Data Range Limitations
If my dataset includes variables such as sensor output, temperature, and humidity, and the temperature in the dataset ranges from 15 to 32 degrees, can the model predict outcomes for temperatures that exceed this range? Specifically, could the model provide accurate predictions for temperatures above 32 degrees, based on its learning from the 15-32 degree range? Furthermore, would the model's ability to make such predictions depend on whether the effect of temperature is linear or polynomial?
Input Variability
Would it be necessary to include every possible combination of temperature and humidity in the dataset for accurate modeling? To be specific, if our sensor shows 60% humidity at 22 degrees, do we need to generate a dataset that demonstrates humidity levels ranging from 1% to 100% across the entire temperature spectrum?
A. You can always compute something outside the model range; how accurate it will be is clearly dependent upon how accurate the model is to begin with plus how well it does predict what the response will be outside that range.
Clearly, if a sensor's response were purely linear over the entire range, then it wouldn't matter; a straight line is a straight line. That is never the case in practice; just how nonlinear and how well the fitted model holds is purely up to whatever the particular data/model predict related to what the sensor output actually is for a given input. Polynomials in higher degrees are particular notorious for "blowing up" as a range gets larger; a quadratic term response alone increases by 2X for every 1.4X in input; iow a 40% increase in T would double the predicted sensor output including a quadratic term by that term alone. (38/32)^2 ==> 1.4. Remember the shape of a parabola is always increasing slope magnitude, whether pointing up or down.
B. You clearly can't measure every single combination of all paramters, that's not what experiment design is about. You should, however cover the RANGE of all parameters over the ranges that can exist jointly. Picking that set of points is the subject of experiment design; one method that has been generally found helpful in fitting quadratic response surface models is the central composite design. Again, I recommend to you Box, Hunter and Hunter as an essential background tool to get an idea of the issues and techniques designed to avoid pitfalls.
Thanks for the update.
Yes, I can add some additional computing after the mode if necessary. My concern is that if I were to train my model with data with temperature values below 20 degrees, but then, when my model is used in the real world, the temperature becomes 35 degrees, how would the model behave? Would it not know, or would it somehow learn and predict?
If I retrain the model, is it possible to see what new elements are learned from the new data?
I have a large amount of data which is being inputted into the regression learner app, and it becomes very slow. When I want to see the effects of temperature using the "Partial Dependence Plot", can I simply import my model into my script, keep all variables (apart from temperature) static, and observe the effect of varying the temperature value?

Sign in to comment.

Answers (1)

For the model to work well, it needs to see inputs that are similar to what it saw during training. For example, if the input and output had a linear relationship during training, the model will do well if this relationship stays the same. But if the relationship changes to something like exponential during testing, the model might not perform well, and you'll need to retrain it.
To make your model more accurate, try to gather a large dataset that shows different types of input-output relationships. You can also improve the model by updating it regularly with new data it hasn't seen before.
The way you input temperature data can also affect how well the model works. It's important to use the same method of representing temperature both when training the model and when using it to make predictions. This consistency is key to maintaining accuracy.
I hope this helps clear things up!

Products

Release

R2023a

Asked:

on 6 Aug 2023

Answered:

on 19 Aug 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!