Surrogate Model
After the expected values and errors of the component properties have
been estimated for the sampled grid points, these quantities are
predicted for all points. The is done based on the surrogate model.
In gmak, one can individually specify surrogate models for each
property. The options available are
cubic interpolation;
linear interpolation;
(homoscedastic) Gaussian Process regression.
For flexibility, one can also create customized surrogate models and use them in the program by means of the customization API.
Warning
For the linear and cubic interpolation models, gmak
enforces the simulation of the corners of the grid—they are
automatically added to the list of sampled points.
Note
The motivation for applying the surrogate model to predict the component properties (instead of the composite ones) is to allow for the future use of reweighting-based surrogate models. Since the component properties associated with a given composite property may originate from different simulation trajectories, it is not possible in general to use a reweighting-based surrogate model for composite properties.
Linear/Cubic Interpolation
The linear and cubic interpolations are carried out based on the
scipy.interpolate.griddata() function by specifying the
method parameter as "linear" or "cubic". The arguments
points and xi are set as the tuple
indexes of the sampled grid points and
of the entire grid, respectively. The expected values and the
statistical errors are independently interpolated for each component
property.
Gaussian Process Regression (GPR)
The GPR surrogate model is based on the
sklearn.gaussian_process.GaussianProcessRegressor.
The features of the model are the normalized axes of the main variation. The normalization is done independently for each axis so that the smallest interval containing the normalized values of the corresponding parameter is \([0,1]\).
Each regression round individually takes into account a component property, using as target the expected values for the sampled grid points.
The kernel is chosen from a list of commonly used ones
kernels = [
WhiteKernel(noise_level=noiseLevel, noise_level_bounds='fixed') + ConstantKernel(constant_value_bounds=(1e-10,1e+10)) * RBF(),
WhiteKernel(noise_level=noiseLevel, noise_level_bounds='fixed') + ConstantKernel(constant_value_bounds=(1e-10,1e+10)) * Matern(),
WhiteKernel(noise_level=noiseLevel, noise_level_bounds='fixed') + ConstantKernel(constant_value_bounds=(1e-10,1e+10)) * DotProduct(),
WhiteKernel(noise_level=noiseLevel, noise_level_bounds='fixed') + ConstantKernel(constant_value_bounds=(1e-10,1e+10)) * ExpSineSquared(),
WhiteKernel(noise_level=noiseLevel, noise_level_bounds='fixed') + ConstantKernel(constant_value_bounds=(1e-10,1e+10)) * RationalQuadratic()]
by using each one to indepedently fit a surrogate model and selecting
the one associated with the best log marginal likelihood. The
training set of each fitting comprises all grid points, thus implying
an empty validation set. The noiseLevel is the mean statistical
error of the property along the sampled cells.
Note
Contrary to the linear and cubic interpolations, which require the expected values and statistical errors to be treated as two independent targets, the GPR model estimates these two quantities simultaneously.