DataSet#

class mbgdml.data.DataSet(dset_path=None, Z_key='Z', R_key='R', E_key='E', F_key='F')[source]#

For creating, loading, manipulating, and using data sets.

Parameters:
  • dset_path (str, optional) – Path to a npz file.

  • Z_key (str, default: Z) – dict key in dset_path for atomic numbers.

  • R_key (str, default: R) – dict key in dset_path for Cartesian coordinates.

  • E_key (str, default: E) – dict key in dset_path for energies.

  • F_key (str, default: F) – dict key in dset_path for atomic forces.

property E#

The energies of structure(s).

A numpy.ndarray with shape of (n,) where n is the number of atoms.

Type:

numpy.ndarray

property E_max#

Maximum energy of all structures.

Type:

float

property E_mean#

Mean of all energies.

Type:

float

property E_min#

Minimum energy of all structures.

Type:

float

property E_var#

Energy variance.

Type:

float

property F#

Atomic forces of atoms in structure(s).

A numpy.ndarray with shape of (m, n, 3) where m is the number of structures and n is the number of atoms with three Cartesian components.

Type:

numpy.ndarray

property F_max#

Maximum atomic force in all structures.

Type:

float

property F_mean#

Mean of all forces.

Type:

float

property F_min#

Minimum atomic force in all structures.

Type:

float

property F_var#

Force variance.

Type:

float

asdict(gdml_keys=True)[source]#

Converts object into a custom dict.

Parameters:

gdml_keys (bool, default: True) – Data sets can use any keys to specify atomic data. However, mbGDML uses the standard of Z for atomic numbers, R for structure coordinates, E for energies, and F for forces. Using this option changes the data set keys to GDML keys.

Return type:

dict

property comp_ids#

A 1D array relating entity_id to a fragment label for chemical components or species. Labels could be WAT or h2o for water, MeOH for methanol, bz for benzene, etc. There are no standardized labels for species. The index of the label is the respective entity_id. For example, a water and methanol molecule could be ['h2o', 'meoh'].

Examples

Suppose we have a structure containing a water and methanol molecule. We can use the labels of h2o and meoh (which could be anything): ['h2o', 'meoh']. Note that the entity_id is a str.

Type:

numpy.ndarray

convertE(E_units)[source]#

Convert energies and updates e_unit.

Parameters:

E_units (str) – Desired units of energy. Options are 'eV', 'hartree', 'kcal/mol', and 'kJ/mol'.

convertF(force_e_units, force_r_units, e_units, r_units)[source]#

Convert forces.

Does not change e_unit or r_unit.

Parameters:
  • force_e_units (str) – Specifies package-specific energy units used in calculation. Available units are 'eV', 'hartree', 'kcal/mol', and 'kJ/mol'.

  • force_r_units (str) – Specifies package-specific distance units used in calculation. Available units are 'Angstrom' and 'bohr'.

  • e_units (str) – Desired units of energy. Available units are 'eV', 'hartree', 'kcal/mol', and 'kJ/mol'.

  • r_units (str) – Desired units of distance. Available units are 'Angstrom' and 'bohr'.

convertR(R_units)[source]#

Convert coordinates and updates r_unit.

Parameters:

R_units (str) – Desired units of coordinates. Options are 'Angstrom' or 'bohr'.

property e_unit#

Units of energy. Options are 'eV', 'hartree', 'kcal/mol', and 'kJ/mol'.

Type:

str

property entity_ids#

1D array specifying which atoms belong to which entities.

An entity represents a related set of atoms such as a single molecule, several molecules, or a functional group. For mbGDML, an entity usually corresponds to a model trained to predict energies and forces of those atoms. Each entity_id is an int starting from 0.

It is conceptually similar to PDBx/mmCIF _atom_site.label_entity_ids data item.

Examples

A single water molecule would be [0, 0, 0]. A water (three atoms) and methanol (six atoms) molecule in the same structure would be [0, 0, 0, 1, 1, 1, 1, 1, 1].

Type:

numpy.ndarray

load(dataset_path)[source]#

Read data set.

Parameters:

dataset_path (str) – Path to NumPy npz file.

property mb#

Many-body expansion order of this data set. This is None if the data set does not contain many-body energies and forces.

Type:

int

property mb_dsets_md5#

All MD5 hash of data sets used to remove n-body contributions from data sets.

Type:

numpy.ndarray

property mb_models_md5#

All MD5 hash of models used to remove n-body contributions from models.

Type:

numpy.ndarray

property md5#

Unique MD5 hash of data set.

Notes

Z and R are always used to generate the MD5 hash. If available, mbgdml.data.DataSet.E and mbgdml.data.DataSet.F are used.

Type:

str

property name#

Human-readable label for the data set.

Type:

str

print()[source]#

Prints all structure coordinates, energies, and forces of a data set.

property r_prov_ids#

Specifies structure sets IDs/labels and corresponding MD5 hashes.

Keys are the Rset IDs (int) and values are MD5 hashes (str) for the particular structure set.

This is used as a breadcrumb trail that specifies where each structure in the data set originates from.

Examples

>>> dset.r_prov_ids
{0: '2339670ad87a606cb11a72191dfd9f58'}
Type:

dict

property r_prov_specs#

An array specifying where each structure in R originates from.

A (n_R, 1 + n_entity) array where each row contains the Rset ID from r_prov_ids (e.g., 0, 1, 2, etc.) then the structure index and entity_ids from the original full structure in the structure set.

If there has been no previous sampling, an array of shape (1, 0) is returned.

Type:

numpy.ndarray

Examples

>>> dset.r_prov_specs  # [r_prov_id, r_index, entity_1, entity_2, entity_3]
array([[0, 985, 46, 59, 106],
       [0, 174, 51, 81, 128]])
property theory#

The level of theory used to compute energy and gradients of the data set.

Type:

str

write_xyz(save_dir)[source]#

Saves xyz file of all structures in data set.

Parameters:

save_dir (str) –