mbGDMLTrain
#
- class mbgdml.train.mbGDMLTrain(entity_ids, comp_ids, use_sym=True, use_E=True, use_E_cstr=False, use_cprsn=False, solver='analytic', lam=1e-10, solver_tol=0.0001, use_torch=False, max_processes=None)[source]#
Train many-body GDML models.
- Parameters:
entity_ids (
numpy.ndarray
) – Modelentity_ids
.comp_ids (
numpy.ndarray
) – Modelcomp_ids
.use_sym (
bool
, default:True
) – If to identify and include symmetries when training GDML models. This usually increases training and prediction times, but comes with accuracy improvements.use_E (
bool
, default:True
) – Whether or not to reconstruct the potential energy surface (True
) with or (False
) without energy labels. It is highly recommended to train with energies.use_E_cstr (
bool
, default:False
) – Whether or not to include energies as a part of the model training. MeaningTrue
will add another column of alphas that will be trained to the energies. This is rarely useful for higher order n-body models.use_cprsn (
bool
, default:False
) – Compresses the kernel matrix along symmetric degrees of freedom to try to reduce training time. Usually does not provide significant benefits.solver (
str
, default:'analytic'
) – The GDML solver to use, eitheranalytic
oriterative
.lam (
float
, default:1e-10
) – Hyper-parameter lambda (regularization strength). This generally does not need to change.solver_tol (
float
, default:1e-4
) – Solver tolerance.use_torch (
bool
, default:False
) – Use PyTorch to enable GPU acceleration.max_processes (
int
, default:None
) – The maximum number of cores to use for the training process. Will automatically calculate if not specified.
- active_train(dataset, model_name, n_train_init, n_train_final, n_valid, model0=None, n_train_step=100, n_test=None, save_dir='.', overwrite=False, write_json=True, write_idxs=True)[source]#
Trains a GDML model by using Bayesian optimization and adding problematic (high error) structures to the training set.
Trains a GDML model with
mbgdml.train.mbGDMLTrain.bayes_opt()
.- Parameters:
dataset (
mbgdml.data.DataSet
) – Data set to split into training, validation, and test sets are derived from.model_name (
str
) – User-defined model name without the'.npz'
file extension.n_train_init (
int
) – Initial size of the training set. Ifmodel0
is provided, this is the size of that model.n_train_final (
int
) – Training set size of the final model.n_valid (
int
) – Size of the validation set to be used for each training task. Different structures are sampled for each training task.model0 (
dict
, default:None
) – Initial model to start training with. Training indices will be taken from here.n_train_step (
int
, default:100
) – Number of problematic structures to add to the training set for each iteration.n_test (
int
, default:None
) – The number of test points to test the validated GDML model. Defaults to testing all available structures.save_dir (
str
, default:'.'
) – Path to train and save the mbGDML model.overwrite (
bool
, default:False
) – Overwrite existing files.write_json (
bool
, default:True
) – Write a JSON file containing information about the training job.write_idxs (
bool
, default:True
) – Write npy files for training, validation, and test indices.
- bayes_opt(dataset, model_name, n_train, n_valid, n_test=None, save_dir='.', is_final=False, use_domain_opt=False, plot_bo=True, train_idxs=None, valid_idxs=None, overwrite=False, write_json=True, write_idxs=True)[source]#
Train a GDML model using Bayesian optimization for sigma.
Uses the Bayesian optimization package to automatically find the optimal sigma. This will maximize the negative validation loss.
gp_params
can be used to specify options toBayesianOptimization.maximize()
method.A sequential domain reduction optimizer is used to accelerate the convergence to an optimal sigma (when requested).
- Parameters:
dataset (
mbgdml.data.DataSet
) – Dataset to train, validate, and test a model on.model_name (
str
) – User-defined model name without the'.npz'
file extension.n_train (
int
) – The number of training points to use.n_valid (
int
) – The number of validation points to use.n_test (
int
, default:None
) – The number of test points to test the validated GDML model. Defaults to testing all available structures.save_dir (
str
, default:'.'
) – Path to train and save the mbGDML model. Defaults to current directory.is_final (
bool
) – If we usebayes_opt_params_final
or not.use_domain_opt (
bool
, default:False
) – Whether to use a sequential reduction optimizer or not. This sometimes crashes.plot_bo (
bool
, default:True
) – Plot the Bayesian optimization Gaussian process.train_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to train the model on. IfNone
will automatically sample the training data set.valid_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to validate models on. IfNone
, structures will be automatically determined.overwrite (
bool
, default:False
) – Overwrite existing files.write_json (
bool
, default:True
) – Write a JSON file containing information about the training job.write_idxs (
bool
, default:True
) – Write npy files for training, validation, and test indices.
- Returns:
dict
– Optimal many-body GDML model.bayes_opt.BayesianOptimization
– The Bayesian optimizer object.
- bayes_opt_n_check_rising#
Number of additional
sigma_grid
probes to check if loss continues to rise after finding a minima.We often perform a grid search prior to Bayesian optimization. Sometimes, with \(n\)-body training, the loss will start rising but then fall again to a lower value. Thus, we do some extra (larger) sigmas to check if the loss will fall again. If it does, then we restart the grid search.
- Type:
- bayes_opt_params#
Bayesian optimization parameters.
Default
{ 'init_points': 10, 'n_iter': 10, 'alpha': 1e-7, 'acq': 'ucb', 'kappa': 1.5 }
- Type:
- bayes_opt_params_final#
Bayesian optimization parameters for the final model.
If
None
, thenbayes_opt_params
are used.Default:
None
- Type:
- check_energy_pred#
Will return the model with the lowest loss that predicts reasonable energies. If
False
, the model with the lowest loss is not checked for reasonable energy predictions.Sometimes, GDML kernels are unable to accurately reconstruct potential energies even if the force predictions are accurate. This is sometimes prevalent in many-body models with low and high sigmas (i.e., sigmas less than 5 or greater than 500).
Default:
True
- Type:
- create_task(train_dataset, n_train, valid_dataset, n_valid, sigma, train_idxs=None, valid_idxs=None)[source]#
Create a single training task that can be used as a template for hyperparameter searches.
- Parameters:
train_dataset (
mbgdml.data.DataSet
) – Dataset for training a model on.n_train (
int
) – The number of training points to sample.valid_dataset (
mbgdml.data.DataSet
) – Dataset for validating a model on.n_valid (
int
) – The number of validation points to sample, without replacement.sigma (
float
orint
) – Kernel length scale of the desired model.train_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to train the model on. IfNone
will automatically sample the training data set.valid_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to validate models on. IfNone
, structures will be automatically determined.
- grid_search(dataset, model_name, n_train, n_valid, n_test=None, save_dir='.', train_idxs=None, valid_idxs=None, overwrite=False, write_json=True, write_idxs=True)[source]#
Train a GDML model using a grid search for sigma.
Usually, the validation errors will decrease until an optimal sigma is found then start to increase (overfitting). We sort
sigmas
from lowest to highest and stop the search once the loss function starts increasing.- Parameters:
dataset (
mbgdml.data.DataSet
) – Dataset to train, validate, and test a model on.model_name (
str
) – User-defined model name without the'.npz'
file extension.n_train (
int
) – The number of training points to use.n_valid (
int
) – The number of validation points to use.n_test (
int
, default:None
) – The number of test points to test the validated GDML model. Defaults to testing all available structures.save_dir (
str
, default:'.'
) – Path to train and save the mbGDML model. Defaults to current directory.train_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to train the model on. IfNone
will automatically sample the training data set.valid_idxs (
numpy.ndarray
, default:None
) – The specific indices of structures to validate models on. IfNone
, structures will be automatically determined.overwrite (
bool
, default:False
) – Overwrite existing files.write_json (
bool
, default:True
) – Write a JSON file containing information about the training job.write_idxs (
bool
, default:True
) – Write npy files for training, validation, and test indices.
- Returns:
GDML model with an optimal hyperparameter found via grid search.
- Return type:
- keep_tasks#
Keep all models trained during the train task if
True
. They are removed by default.- Type:
- loss_func#
Loss function for validation. The input of this function is the dictionary of
mbgdml._gdml.train.add_valid_errors
which containsforce
andenergy
errors.Default:
mbgdml.losses.loss_f_rmse
- Type:
callable
- loss_kwargs#
Loss function keyword arguments with the exception of the validation
results
dictionary.- Type:
- min_memory_analytic(n_train, n_atoms)[source]#
Minimum memory recommendation for training analytically.
GDML currently only supports closed form solutions (i.e., analytically). Thus, the entire kernel matrix must be in memory which requires \((M * 3N)^2\) double precision (8 byte) entries. This provides a rough estimate for memory requirements.
- plot_bayes_opt_gp(optimizer)[source]#
Prepare a plot of the Bayesian optimization Gaussian process.
- Parameters:
optimizer (
bayes_opt.BayesianOptimization
) –- Returns:
A matplotlib figure object.
- Return type:
object
- require_E_eval#
Require energy evaluation regardless even if they are terrible.
If
False
, it defaults to sGDML behavior (this does not work well with n-body training).Default:
True
- Type:
- save_idxs(model, dataset, save_dir, n_test)[source]#
Saves npy files of the dataset splits (training, validation, and test).
- Parameters:
model (
mbgdml.models.gdmlModel
) – Many-body GDML model.dataset (
dict
) – Dataset used for training, validation, and testing.
- save_valid_csv(save_dir, valid_json)[source]#
Writes a CSV summary file for validation statistics.
This is just easier to see the trend of sigma and validation error.
- sigma_bounds#
Kernel length scale bounds for the Bayesian optimization.
This is only used if
sigma_grid
ifNone
.Default:
(2, 400)
- Type:
- sigma_grid#
Determining reasonable
sigma_bounds
is difficult without some prior experience with the system. Even then, the optimalsigma
can drastically change depending on the training set size.sigma_grid
will assist with determining optimalsigma_bounds
by first performing a course grid search. The Bayesian optimization will start with the bounds of the grid-search minimum. It is recommended to choose a largesigma_bounds
as large as yoursigma_grid
; it will be updated internally.The number of probes done during the initial grid search will be subtracted from the Bayesian optimization
init_points
.We recommend that the grid includes several lower sigmas (< 50), a few medium sigmas (< 500), and several large sigmas that span up to at least 1000 for higher-order models.
Default
[ 2, 25, 50, 100, 200, 300, 400, 500, 700, 900, 1100, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 50000, 100000 ]
- Type:
- test_model(model, dataset, n_test=None)[source]#
Test model and add mbGDML modifications.
- Parameters:
model (
mbgdml.models.gdmlModel
) – Model to test.dataset (
dict
) – Test dataset.
- Returns:
Tested and finalized many-body GDML model.
- Return type: