GDMLPredict#

class mbgdml._gdml.predict.GDMLPredict(model, batch_size=None, num_workers=None, max_memory=None, max_processes=None, use_torch=False)[source]#

Query trained sGDML force fields.

This class is used to load a trained model and make energy and force predictions for new geometries. GPU support is provided through PyTorch (requires optional torch dependency to be installed).

Note

The parameters batch_size and num_workers are only relevant if this code runs on a CPU. Both can be set automatically via the function prepare_parallel. Note: Running calculations via PyTorch is only recommended with available GPU hardware. CPU calculations are faster with our NumPy implementation.

Parameters:
  • model (dict) – Data structure that holds all parameters of the trained model. This object is the output of GDMLTrain.train

  • batch_size (int, optional) – Chunk size for processing parallel tasks.

  • num_workers (int, optional) – Number of parallel workers.

  • max_memory (int, optional) – Limit the max. memory usage [GB]. This is only a soft limit that can not always be enforced.

  • max_processes (int, optional) – Limit the max. number of processes. Otherwise all CPU cores are used. This parameters has no effect if use_torch=True

  • use_torch (bool, optional) – Use PyTorch to calculate predictions

_set_bulk_mp(bulk_mp=False)[source]#

Toggles bulk prediction mode.

If bulk prediction is enabled, the prediction is parallelized across input geometries, i.e. each worker generates the complete prediction for one query. Otherwise (depending on the number of available CPU cores) the input geometries are process sequentially, but every one of them may be processed by multiple workers at once (in chunks).

Note

This parameter can be optimally determined using prepare_parallel.

Parameters:

bulk_mp (bool, optional) – Enable or disable bulk prediction mode.

_set_chunk_size(chunk_size=None)[source]#

Set chunk size for each worker process.

Every prediction is generated as a linear combination of the training points that the model is comprised of. If multiple workers are available (and bulk mode is disabled), each one processes an (approximately equal) part of those training points. Then, the chunk size determines how much of a processes workload is passed to NumPy’s underlying low-level routines at once. If the chunk size is smaller than the number of points the worker is supposed to process, it processes them in multiple steps using a loop. This can sometimes be faster, depending on the available hardware.

Note

This parameter can be optimally determined using prepare_parallel.

Parameters:

chunk_size (int, default: None) – Chunk size (maximum value is set if None).

_set_num_workers(num_workers=None, force_reset=False)[source]#

Set number of processes to use during prediction.

If bulk_mp is True, each worker handles the whole generation of single prediction (this if for querying multiple geometries at once)

If bulk_mp is False, each worker may handle only a part of a prediction (chunks are defined in 'wkr_starts_stops'). In that scenario multiple processes are used to distribute the work of generating a single prediction.

This number should not exceed the number of available CPU cores.

Note

This parameter can be optimally determined using prepare_parallel.

Parameters:
  • num_workers (int, optional) – Number of processes (maximum value is set if None).

  • force_reset (bool, optional) – Force applying the new setting.

get_GPU_batch()[source]#

Get batch size used by the GPU implementation to process bulk predictions (predictions for multiple input geometries at once).

This value is determined on-the-fly depending on the available GPU memory.

predict(R=None, return_E=True)[source]#

Predict energy and forces for multiple geometries.

This function can run on the GPU, if the optional PyTorch dependency is installed and use_torch=True was specified during initialization of this class.

Optionally, the descriptors and descriptor Jacobians for the same geometries can be provided, if already available from some previous calculations.

Note

The order of the atoms in R is not arbitrary and must be the same as used for training the model.

Parameters:
  • R (numpy.ndarray, optional) – An 2D array of size M x 3N containing the Cartesian coordinates of each atom of M molecules. If this parameter is omitted, the training error is returned. Note that the training geometries need to be set right after initialization using set_R() for this to work.

  • return_E (bool, default: True) – If False, only the forces are returned.

Returns:

  • numpy.ndarray – Energies stored in an 1D array of size M. Unless return_E is False.

  • numpy.ndarray – Forces stored in an 2D array of size M x 3N.

prepare_parallel(n_bulk=1, n_reps=1, return_is_from_cache=False)[source]#

Find and set the optimal parallelization parameters for the currently loaded model, running on a particular system. The result also depends on the number of geometries n_bulk that will be passed at once when calling the predict function.

This function runs a benchmark in which the prediction routine is repeatedly called n_reps-times (default: 1) with varying parameter configurations, while the runtime is measured for each one. The optimal parameters are then cached for fast retrieval in future calls of this function.

We recommend calling this function after initialization of this class, as it will drastically increase the performance of the predict function.

Note

Depending on the parameter n_reps, this routine may take some seconds/minutes to complete. However, once a statistically significant number of benchmark results has been gathered for a particular configuration, it starts returning almost instantly.

Parameters:
  • n_bulk (int, optional) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).

  • n_reps (int, optional) – Number of repetitions (bigger value: more accurate, but also slower).

  • return_is_from_cache (bool, optional) – If enabled, this function returns a second value indicating if the returned results were obtained from cache.

Returns:

  • int – Force and energy prediction speed in geometries per second.

  • bool, optional – Return, whether this function obtained the results from cache.

set_R_d_desc(R_d_desc)[source]#

Store a reference to the training geometry descriptor Jacobians.

This function must be called before set_alphas() can be used.

This routine is used during iterative model training.

Parameters:

R_d_desc (numpy.ndarray, optional) – A 2D array of size M x D x 3N containing of the descriptor Jacobians for M molecules. The descriptor has dimension D with 3N partial derivatives with respect to the 3N Cartesian coordinates of each atom.

set_R_desc(R_desc)[source]#

Store a reference to the training geometry descriptors.

This can accelerate iterative model training.

Parameters:

R_desc (numpy.ndarray, optional) – An 2D array of size M x D containing the descriptors of dimension D for M molecules.

set_alphas(alphas_F, alphas_E=None)[source]#

Reconfigure the current model with a new set of regression parameters. R_d_desc needs to be set for this function to work.

This routine is used during iterative model training.

Parameters:
  • alphas_F (numpy.ndarray) – 1D array containing the new model parameters.

  • alphas_E (numpy.ndarray, optional) – 1D array containing the additional new model parameters, if energy constraints are used in the kernel (use_E_cstr=True)