mbGDMLTrain#

class mbgdml.train.mbGDMLTrain(entity_ids, comp_ids, use_sym=True, use_E=True, use_E_cstr=False, use_cprsn=False, solver='analytic', lam=1e-10, solver_tol=0.0001, use_torch=False, max_processes=None)[source]#

Train many-body GDML models.

Parameters:
  • entity_ids (numpy.ndarray) – Model entity_ids.

  • comp_ids (numpy.ndarray) – Model comp_ids.

  • use_sym (bool, default: True) – If to identify and include symmetries when training GDML models. This usually increases training and prediction times, but comes with accuracy improvements.

  • use_E (bool, default: True) – Whether or not to reconstruct the potential energy surface (True) with or (False) without energy labels. It is highly recommended to train with energies.

  • use_E_cstr (bool, default: False) – Whether or not to include energies as a part of the model training. Meaning True will add another column of alphas that will be trained to the energies. This is rarely useful for higher order n-body models.

  • use_cprsn (bool, default: False) – Compresses the kernel matrix along symmetric degrees of freedom to try to reduce training time. Usually does not provide significant benefits.

  • solver (str, default: 'analytic') – The GDML solver to use, either analytic or iterative.

  • lam (float, default: 1e-10) – Hyper-parameter lambda (regularization strength). This generally does not need to change.

  • solver_tol (float, default: 1e-4) – Solver tolerance.

  • use_torch (bool, default: False) – Use PyTorch to enable GPU acceleration.

  • max_processes (int, default: None) – The maximum number of cores to use for the training process. Will automatically calculate if not specified.

active_train(dataset, model_name, n_train_init, n_train_final, n_valid, model0=None, n_train_step=100, n_test=None, save_dir='.', overwrite=False, write_json=True, write_idxs=True)[source]#

Trains a GDML model by using Bayesian optimization and adding problematic (high error) structures to the training set.

Trains a GDML model with mbgdml.train.mbGDMLTrain.bayes_opt().

Parameters:
  • dataset (mbgdml.data.DataSet) – Data set to split into training, validation, and test sets are derived from.

  • model_name (str) – User-defined model name without the '.npz' file extension.

  • n_train_init (int) – Initial size of the training set. If model0 is provided, this is the size of that model.

  • n_train_final (int) – Training set size of the final model.

  • n_valid (int) – Size of the validation set to be used for each training task. Different structures are sampled for each training task.

  • model0 (dict, default: None) – Initial model to start training with. Training indices will be taken from here.

  • n_train_step (int, default: 100) – Number of problematic structures to add to the training set for each iteration.

  • n_test (int, default: None) – The number of test points to test the validated GDML model. Defaults to testing all available structures.

  • save_dir (str, default: '.') – Path to train and save the mbGDML model.

  • overwrite (bool, default: False) – Overwrite existing files.

  • write_json (bool, default: True) – Write a JSON file containing information about the training job.

  • write_idxs (bool, default: True) – Write npy files for training, validation, and test indices.

bayes_opt(dataset, model_name, n_train, n_valid, n_test=None, save_dir='.', is_final=False, use_domain_opt=False, plot_bo=True, train_idxs=None, valid_idxs=None, overwrite=False, write_json=True, write_idxs=True)[source]#

Train a GDML model using Bayesian optimization for sigma.

Uses the Bayesian optimization package to automatically find the optimal sigma. This will maximize the negative validation loss.

gp_params can be used to specify options to BayesianOptimization.maximize() method.

A sequential domain reduction optimizer is used to accelerate the convergence to an optimal sigma (when requested).

Parameters:
  • dataset (mbgdml.data.DataSet) – Dataset to train, validate, and test a model on.

  • model_name (str) – User-defined model name without the '.npz' file extension.

  • n_train (int) – The number of training points to use.

  • n_valid (int) – The number of validation points to use.

  • n_test (int, default: None) – The number of test points to test the validated GDML model. Defaults to testing all available structures.

  • save_dir (str, default: '.') – Path to train and save the mbGDML model. Defaults to current directory.

  • is_final (bool) – If we use bayes_opt_params_final or not.

  • use_domain_opt (bool, default: False) – Whether to use a sequential reduction optimizer or not. This sometimes crashes.

  • plot_bo (bool, default: True) – Plot the Bayesian optimization Gaussian process.

  • train_idxs (numpy.ndarray, default: None) – The specific indices of structures to train the model on. If None will automatically sample the training data set.

  • valid_idxs (numpy.ndarray, default: None) – The specific indices of structures to validate models on. If None, structures will be automatically determined.

  • overwrite (bool, default: False) – Overwrite existing files.

  • write_json (bool, default: True) – Write a JSON file containing information about the training job.

  • write_idxs (bool, default: True) – Write npy files for training, validation, and test indices.

Returns:

  • dict – Optimal many-body GDML model.

  • bayes_opt.BayesianOptimization – The Bayesian optimizer object.

bayes_opt_n_check_rising#

Number of additional sigma_grid probes to check if loss continues to rise after finding a minima.

We often perform a grid search prior to Bayesian optimization. Sometimes, with \(n\)-body training, the loss will start rising but then fall again to a lower value. Thus, we do some extra (larger) sigmas to check if the loss will fall again. If it does, then we restart the grid search.

Type:

int

bayes_opt_params#

Bayesian optimization parameters.

Default

{
    'init_points': 10, 'n_iter': 10, 'alpha': 1e-7, 'acq': 'ucb',
    'kappa': 1.5
}
Type:

dict

bayes_opt_params_final#

Bayesian optimization parameters for the final model.

If None, then bayes_opt_params are used.

Default: None

Type:

dict

check_energy_pred#

Will return the model with the lowest loss that predicts reasonable energies. If False, the model with the lowest loss is not checked for reasonable energy predictions.

Sometimes, GDML kernels are unable to accurately reconstruct potential energies even if the force predictions are accurate. This is sometimes prevalent in many-body models with low and high sigmas (i.e., sigmas less than 5 or greater than 500).

Default: True

Type:

bool

create_task(train_dataset, n_train, valid_dataset, n_valid, sigma, train_idxs=None, valid_idxs=None)[source]#

Create a single training task that can be used as a template for hyperparameter searches.

Parameters:
  • train_dataset (mbgdml.data.DataSet) – Dataset for training a model on.

  • n_train (int) – The number of training points to sample.

  • valid_dataset (mbgdml.data.DataSet) – Dataset for validating a model on.

  • n_valid (int) – The number of validation points to sample, without replacement.

  • sigma (float or int) – Kernel length scale of the desired model.

  • train_idxs (numpy.ndarray, default: None) – The specific indices of structures to train the model on. If None will automatically sample the training data set.

  • valid_idxs (numpy.ndarray, default: None) – The specific indices of structures to validate models on. If None, structures will be automatically determined.

Train a GDML model using a grid search for sigma.

Usually, the validation errors will decrease until an optimal sigma is found then start to increase (overfitting). We sort sigmas from lowest to highest and stop the search once the loss function starts increasing.

Parameters:
  • dataset (mbgdml.data.DataSet) – Dataset to train, validate, and test a model on.

  • model_name (str) – User-defined model name without the '.npz' file extension.

  • n_train (int) – The number of training points to use.

  • n_valid (int) – The number of validation points to use.

  • n_test (int, default: None) – The number of test points to test the validated GDML model. Defaults to testing all available structures.

  • save_dir (str, default: '.') – Path to train and save the mbGDML model. Defaults to current directory.

  • train_idxs (numpy.ndarray, default: None) – The specific indices of structures to train the model on. If None will automatically sample the training data set.

  • valid_idxs (numpy.ndarray, default: None) – The specific indices of structures to validate models on. If None, structures will be automatically determined.

  • overwrite (bool, default: False) – Overwrite existing files.

  • write_json (bool, default: True) – Write a JSON file containing information about the training job.

  • write_idxs (bool, default: True) – Write npy files for training, validation, and test indices.

Returns:

GDML model with an optimal hyperparameter found via grid search.

Return type:

dict

keep_tasks#

Keep all models trained during the train task if True. They are removed by default.

Type:

bool

loss_func#

Loss function for validation. The input of this function is the dictionary of mbgdml._gdml.train.add_valid_errors which contains force and energy errors.

Default: mbgdml.losses.loss_f_rmse

Type:

callable

loss_kwargs#

Loss function keyword arguments with the exception of the validation results dictionary.

Type:

dict

min_memory_analytic(n_train, n_atoms)[source]#

Minimum memory recommendation for training analytically.

GDML currently only supports closed form solutions (i.e., analytically). Thus, the entire kernel matrix must be in memory which requires \((M * 3N)^2\) double precision (8 byte) entries. This provides a rough estimate for memory requirements.

Parameters:
  • n_train (int) – Number of training structures.

  • n_atoms (int) – Number of atoms in a single structure.

Returns:

Minimum memory requirements in MB.

Return type:

float

plot_bayes_opt_gp(optimizer)[source]#

Prepare a plot of the Bayesian optimization Gaussian process.

Parameters:

optimizer (bayes_opt.BayesianOptimization) –

Returns:

A matplotlib figure object.

Return type:

object

require_E_eval#

Require energy evaluation regardless even if they are terrible.

If False, it defaults to sGDML behavior (this does not work well with n-body training).

Default: True

Type:

bool

save_idxs(model, dataset, save_dir, n_test)[source]#

Saves npy files of the dataset splits (training, validation, and test).

Parameters:
save_valid_csv(save_dir, valid_json)[source]#

Writes a CSV summary file for validation statistics.

This is just easier to see the trend of sigma and validation error.

Parameters:
  • save_dir (str) – Where to save the CSV file.

  • valid_json (dict) – The validation json curated during the training routine.

sigma_bounds#

Kernel length scale bounds for the Bayesian optimization.

This is only used if sigma_grid if None.

Default: (2, 400)

Type:

tuple

sigma_grid#

Determining reasonable sigma_bounds is difficult without some prior experience with the system. Even then, the optimal sigma can drastically change depending on the training set size.

sigma_grid will assist with determining optimal sigma_bounds by first performing a course grid search. The Bayesian optimization will start with the bounds of the grid-search minimum. It is recommended to choose a large sigma_bounds as large as your sigma_grid; it will be updated internally.

The number of probes done during the initial grid search will be subtracted from the Bayesian optimization init_points.

We recommend that the grid includes several lower sigmas (< 50), a few medium sigmas (< 500), and several large sigmas that span up to at least 1000 for higher-order models.

Default

[
    2, 25, 50, 100, 200, 300, 400, 500, 700, 900, 1100, 1500, 2000,
    2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000,
    50000, 100000
]
Type:

list

test_model(model, dataset, n_test=None)[source]#

Test model and add mbGDML modifications.

Parameters:
Returns:

Tested and finalized many-body GDML model.

Return type:

mbgdml.models.gdmlModel

train_model(task)[source]#

Trains a GDML model from a task.

Parameters:
  • task (dict) – Training task.

  • n_train (int) – The number of training points to sample.

Returns:

Trained (not validated or tested) model.

Return type:

dict