Skip to content

MDTerp module

MDTerp.init_analysis.py – Initial MDTerp round for discarding irrelevant features from analysis for the forward feature selection in MDTerp.final_analysis.py.

SGDreg(data, labels, seed, alpha)

Function for implementing linear regression using stochastic gradient descent.

Parameters:

Name Type Description Default
data np.ndarray

Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.

required
labels np.ndarray

Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.

required
seed int

Random seed.

required
alpha float

L2 norm of Ridge regression.

required

Returns:

Type Description
np.ndarray

Numpy array with coefficients of all the features of the fitted linear model. float: Intercept of the fitted linear model.

Source code in MDTerp/init_analysis.py
def SGDreg(data: np.ndarray, labels: np.ndarray, seed: int, alpha: float) -> Tuple[np.ndarray, float]:
    """
    Function for implementing linear regression using stochastic gradient descent.

    Args:
        data (np.ndarray): Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.
        labels (np.ndarray): Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.
        seed (int): Random seed.
        alpha (float): L2 norm of Ridge regression.

    Returns:
        np.ndarray: Numpy array with coefficients of all the features of the fitted linear model.
        float: Intercept of the fitted linear model.
    """
    clf = Ridge(alpha, random_state = seed, solver = 'saga')
    clf.fit(data,labels.ravel())
    coefficients = clf.coef_
    intercept = clf.intercept_
    return coefficients, intercept

init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)

Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

Parameters:

Name Type Description Default
neighborhood_data np.ndarray

Perturbed data generated by MDTerp.neighborhood.py.

required
pred_proba np.ndarray

Metastable state probabilities obtained from the black-box.

required
cutoff int

Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).

required
given_indices np.ndarray

Indices of the features to perform this (first) round of MDTerp on.

required
seed int

Random seed.

required

Returns:

Type Description
np.ndarray

List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.

Source code in MDTerp/init_analysis.py
def init_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, cutoff: int, given_indices: np.ndarray, seed: int, alpha: float) -> list:
    """
    Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

    Args:
        neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
        pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
        cutoff (int): Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).
        given_indices (np.ndarray): Indices of the features to perform this (first) round of MDTerp on.
        seed (int): Random seed.

    Returns:
        np.ndarray: List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.
    """
    explain_class = np.argmax(pred_proba[0,:])

    target = pred_proba[:,explain_class]

    threshold, upper, lower = 0.5, 1, 0
    target_binarized = np.where(target>threshold, upper, lower)

    clf = lda()
    clf.fit(neighborhood_data,target_binarized)
    projected_data = clf.transform(neighborhood_data)
    weights = similarity_kernel(projected_data.reshape(-1,1), 1)


    predict_proba = pred_proba[:,explain_class]
    data = neighborhood_data*(weights**0.5).reshape(-1,1)
    labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)

    coefficients_selection, intercept_selection = SGDreg(data, labels, seed, alpha)
    selected_features = np.argsort(np.absolute(coefficients_selection))[::-1][:cutoff]

    return selected_features

similarity_kernel(data, kernel_width=1.0)

Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

Parameters:

Name Type Description Default
data np.ndarray

LDA transformed data.

required
kernel_width float

Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter.

1.0

Returns:

Type Description
np.ndarray

Similarity∈[0,1] of neighborhood.

Source code in MDTerp/init_analysis.py
def similarity_kernel(data: np.ndarray, kernel_width: float = 1.0) -> np.ndarray:
    """
    Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

    Args:
        data (np.ndarray): LDA transformed data.
        kernel_width (float): Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter.

    Returns:
        np.ndarray: Similarity∈[0,1] of neighborhood.
    """
    distances = met.pairwise_distances(data,data[0].reshape(1, -1),metric='euclidean').ravel()
    return np.sqrt(np.exp(-(distances ** 2) / kernel_width ** 2))