Chemo-Informatics: DataBase Uncertainty destroy Neural Network.
I always made Neural Network (NN) base Thermo-chemical properties estimation scheme. Once I build the NNs, I can estimate HFE (Hydro Fluoro Ether) properties or can exhaustive search of alternative CFCs. For example, CFC-113 was used for precision cleaning. In that case Surface Tension prediction is very important. To build NN for predicting Surface tension, I need to compile many Surface tension data. It is very important which kind of compounds are in the database, because NN will learn the data so as to return proper answer analog to learned.
The data set become larger, the more conflict data increase. For example, in database, the surface tension of propylene glycol (CH2(OH)CH(OH)CH3) is 35.5dyn/cm or 72dyn/cm.
If I believe 35.5dyn/cm and set middle layer neuron number to 3 and build neural network with 291 compounds, the learned result become like below.
The result is not so bad. So I thought 72dyn/cm was wrong data. But for confirmation, I build neural network with the 72dyn/cm and same condition, the result is also not so bad showed below.
Then how can we know which data is more reliable? For propylene glycol case, If I reduce middle layer neuron number 3 to 2, the estimation result become 40dyn/cm like chart below, so 35.5dyn/cm seems to much reliable. But it is very rare case finding these unreliable data.
Conflict data, Error input, unit selection error, I always spent a lot of time to build Neural Network. I know the fundamental problems of neural network system and try to avoid with programing technique. But original input data uncertainty, I can do nothing. I think I need corroborate with experimental researcher who are interested in these area.