GDMLTrain#

class mbgdml._gdml.train.GDMLTrain(max_memory=None, max_processes=None, use_torch=False)[source]#

Train GDML force fields.

This class is used to train models using different closed-form and numerical solvers. GPU support is provided through PyTorch (requires optional torch dependency to be installed) for some solvers.

Parameters:
  • max_memory (int, default: None) – Limit the maximum memory usage. This is a soft limit that cannot always be enforced.

  • max_processes (int, default: None) – Limit the max. number of processes. Otherwise all CPU cores are used. This parameters has no effect if use_torch=True

  • use_torch (bool, default: False) – Use PyTorch to calculate predictions (if supported by solver).

Raises:
  • Exception – If multiple instances of this class are created.

  • ImportError – If the optional PyTorch dependency is missing, but PyTorch features are used.

_assemble_kernel_mat(R_desc, R_d_desc, tril_perms_lin, sig, desc, use_E_cstr=False, col_idxs=slice(None, None, None), alloc_extra_rows=0)[source]#

Compute force field kernel matrix.

The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrix-valued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).

Parameters:
  • R_desc (numpy.ndarray, ndim: 2, optional) – An array of size \(M \\times D\) containing the descriptors of dimension \(D\) for \(M\) molecules.

  • R_d_desc (numpy.ndarray, ndim: 2, optional) – An array of size \(M \\times D \\times 3N\) containing of the descriptor Jacobians for \(M\) molecules. The descriptor has dimension \(D\) with \(3N\) partial derivatives with respect to the \(3N\) Cartesian coordinates of each atom.

  • tril_perms_lin (numpy.ndarray, ndim: 1) – An array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • sig (int) – Hyperparameter sigma (kernel length scale).

  • use_E_cstr (bool, optional) – Include energy constraints in the kernel. This can sometimes be helpful in tricky cases.

  • cols_m_limit (int, optional) – Only generate the columns up to index cols_m_limit. This creates a \(M3N \\times\) cols_m_limit\(3N\) kernel matrix, instead of \(M3N \\times M3N\).

  • cols_3n_keep_idxs (numpy.ndarray, optional) – Only generate columns with the given indices in the \(3N \\times 3N\) kernel function. The resulting kernel matrix will have dimension \(M3N \\times M\) len(cols_3n_keep_idxs).

Returns:

Force field kernel matrix.

Return type:

numpy.ndarray

_recov_int_const(model, task, R_desc=None, R_d_desc=None, require_E_eval=False)[source]#

Estimate the integration constant for a force field model.

The offset between the energies predicted for the original training data and the true energy labels is computed in the least square sense. Furthermore, common issues with the user-provided datasets are self diagnosed here.

Parameters:
  • model (dict) – Data structure of custom type model.

  • task (dict) – Data structure of custom type task.

  • R_desc (numpy.ndarray, ndim: 2, optional) – An array of size \(M \\times D\) containing the descriptors of dimension \(D\) for \(M\) molecules.

  • R_d_desc (numpy.ndarray, ndim: 2, optional) – An array of size \(M \\times D \\times 3N\) containing of the descriptor Jacobians for \(M\) molecules. The descriptor has dimension \(D\) with \(3N\) partial derivatives with respect to the \(3N\) Cartesian coordinates of each atom.

  • require_E_eval (bool, default: False) – Force the computation and return of the integration constant regardless if there are significant errors.

Returns:

Estimate for the integration constant.

Return type:

float

Raises:
  • ValueError – If the sign of the force labels in the dataset from which the model emerged is switched (e.g. gradients instead of forces).

  • ValueError – If inconsistent/corrupted energy labels are detected in the provided dataset.

  • ValueError – If different scales in energy vs. force labels are detected in the provided dataset.

create_model(task, solver, R_desc, R_d_desc, tril_perms_lin, std, alphas_F, alphas_E=None)[source]#

Create a data structure, dict, of custom type model.

These data structures contain the trained model are everything that is needed to generate predictions for new inputs. Each task also contains the MD5 fingerprints of the used datasets.

Parameters:
  • task (dict) – Data structure of custom type task from which the model emerged.

  • solver (str) – Identifier string for the solver that has been used to train this model.

  • R_desc (numpy.ndarray, ndim: 2) – An array of size \(M \\times D\) containing the descriptors of dimension \(D\) for \(M\) molecules.

  • R_d_desc (numpy.ndarray, ndim: 2) – An array of size \(M \\times D \\times 3N\) containing of the descriptor Jacobians for \(M\) molecules. The descriptor has dimension \(D\) with \(3N\) partial derivatives with respect to the \(3N\) Cartesian coordinates of each atom.

  • tril_perms_lin (numpy.ndarray, ndim: 1) – An array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • std (float) – Standard deviation of the training labels.

  • alphas_F (numpy.ndarray, ndim: 1) – An array of size \(3NM\) containing of the linear coefficients that correspond to the force constraints.

  • alphas_E (numpy.ndarray, ndim: 1, optional) – An array of size N containing of the linear coefficients that correspond to the energy constraints. Only used if use_E_cstr is True.

Returns:

Data structure of custom type model.

Return type:

dict

create_task(train_dataset, n_train, valid_dataset, n_valid, sig, lam=1e-10, perms=None, use_sym=True, use_E=True, use_E_cstr=False, use_cprsn=False, solver=None, solver_tol=0.0001, idxs_train=None, idxs_valid=None)[source]#

Create a data structure, dict, of custom type task.

These data structures serve as recipes for model creation, summarizing the configuration of one particular training run. Training and test points are sampled from the provided dataset, without replacement. If the same dataset if given for training and testing, the subsets are drawn without overlap.

Each task also contains a choice for the hyperparameters of the training process and the MD5 fingerprints of the used datasets.

Parameters:
  • train_dataset (dict) – Data structure of custom type dataset containing train dataset.

  • n_train (int) – Number of training points to sample.

  • valid_dataset (dict) – Data structure of custom type dataset containing validation dataset.

  • n_valid (int) – Number of validation points to sample.

  • sig (int) – Hyperparameter sigma (kernel length scale).

  • lam (float, default: 1e-10) –

    Hyperparameter lambda (regularization strength).

    Note

    Early sGDML models used 1e-15.

  • perms (numpy.ndarray, optional) – An 2D array of size P x N containing P possible permutations of the N atoms in the system. This argument takes priority over the ones provided in the training dataset. No automatic discovery is run when this argument is provided.

  • use_sym (bool, default: True) – True: include symmetries (sGDML), False: GDML.

  • use_E (bool, optional) –

    True: reconstruct force field with corresponding potential energy surface,

    False: ignore energy during training, even if energy labels are available in the dataset. The trained model will still be able to predict energies up to an unknown integration constant. Note, that the energy predictions accuracy will be untested.

  • use_E_cstr (bool, default: False) – Include energy constraints in the kernel. This can sometimes be helpful in tricky cases.

  • use_cprsn (bool, default: False) – Compress the kernel matrix along symmetric degrees of freedom. If False, training is done on the full kernel matrix.

  • solver (str, default: None) – Type of solver to use for training. 'analytic' is currently the only option and defaults to this.

Returns:

Data structure of custom type task.

Return type:

dict

Raises:

ValueError – If a reconstruction of the potential energy surface is requested, but the energy labels are missing in the dataset.

train(task, require_E_eval=False)[source]#

Train a model based on a task.

Parameters:
  • task (dict) – Data structure of custom type task from create_task().

  • require_E_eval (bool, default: False) – Require energy evaluation.

Returns:

Data structure of custom type model from create_model().

Return type:

dict

Raises:

ValueError – If the provided dataset contains invalid lattice vectors.

train_labels(F, use_E, use_E_cstr, E=None)[source]#

Compute custom train labels.

By default, they are the raveled forces scaled by the standard deviation. Energy labels are included if use_E and use_E_cstr are True.

Note

We use negative energies with the mean removed for training labels (if requested).

Parameters:
  • F (numpy.ndarray, ndim: 3) – Train set forces.

  • use_E (bool) – If energies are used during training.

  • use_E_cstr (bool) – If energy constraints are being added to the kernel.

  • E (numpy.ndarray, default: None) – Train set energies.

Returns:

  • numpy.ndarray – Train labels including forces and possibly energies.

  • float – Standard deviation of training labels.

  • float – Mean energy. If energy labels are not included then this is None.