MDTerp module¶

MDTerp.init_analysis.py – Initial MDTerp round for discarding irrelevant features from analysis for the forward feature selection in MDTerp.final_analysis.py.

`SGDreg(data, labels, seed, alpha)` ¶

Function for implementing linear regression using stochastic gradient descent.

Parameters:

Name	Type	Description	Default
`data`	`np.ndarray`	Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.	required
`labels`	`np.ndarray`	Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.	required
`seed`	`int`	Random seed.	required
`alpha`	`float`	L2 norm of Ridge regression.	required

Returns:

Type	Description
`np.ndarray`	Numpy array with coefficients of all the features of the fitted linear model. float: Intercept of the fitted linear model.

Source code in MDTerp/init_analysis.py

def SGDreg(data: np.ndarray, labels: np.ndarray, seed: int, alpha: float) -> Tuple[np.ndarray, float]:
    """
    Function for implementing linear regression using stochastic gradient descent.

    Args:
        data (np.ndarray): Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.
        labels (np.ndarray): Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.
        seed (int): Random seed.
        alpha (float): L2 norm of Ridge regression.

    Returns:
        np.ndarray: Numpy array with coefficients of all the features of the fitted linear model.
        float: Intercept of the fitted linear model.
    """
    clf = Ridge(alpha, random_state = seed, solver = 'saga')
    clf.fit(data,labels.ravel())
    coefficients = clf.coef_
    intercept = clf.intercept_
    return coefficients, intercept

`init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)` ¶

Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

Parameters:

Name	Type	Description	Default
`neighborhood_data`	`np.ndarray`	Perturbed data generated by MDTerp.neighborhood.py.	required
`pred_proba`	`np.ndarray`	Metastable state probabilities obtained from the black-box.	required
`cutoff`	`int`	Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).	required
`given_indices`	`np.ndarray`	Indices of the features to perform this (first) round of MDTerp on.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`np.ndarray`	List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.

Source code in MDTerp/init_analysis.py

def init_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, cutoff: int, given_indices: np.ndarray, seed: int, alpha: float) -> list:
    """
    Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

    Args:
        neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
        pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
        cutoff (int): Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).
        given_indices (np.ndarray): Indices of the features to perform this (first) round of MDTerp on.
        seed (int): Random seed.

    Returns:
        np.ndarray: List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.
    """
    explain_class = np.argmax(pred_proba[0,:])

    target = pred_proba[:,explain_class]

    threshold, upper, lower = 0.5, 1, 0
    target_binarized = np.where(target>threshold, upper, lower)

    clf = lda()
    clf.fit(neighborhood_data,target_binarized)
    projected_data = clf.transform(neighborhood_data)
    weights = similarity_kernel(projected_data.reshape(-1,1), 1)


    predict_proba = pred_proba[:,explain_class]
    data = neighborhood_data*(weights**0.5).reshape(-1,1)
    labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)

    coefficients_selection, intercept_selection = SGDreg(data, labels, seed, alpha)
    selected_features = np.argsort(np.absolute(coefficients_selection))[::-1][:cutoff]

    return selected_features

`similarity_kernel(data, kernel_width=1.0)` ¶

Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

Parameters:

Name	Type	Description	Default
`data`	`np.ndarray`	LDA transformed data.	required
`kernel_width`	`float`	Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter.	`1.0`

Returns:

Type	Description
`np.ndarray`	Similarity∈[0,1] of neighborhood.

Source code in MDTerp/init_analysis.py

def similarity_kernel(data: np.ndarray, kernel_width: float = 1.0) -> np.ndarray:
    """
    Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

    Args:
        data (np.ndarray): LDA transformed data.
        kernel_width (float): Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter.

    Returns:
        np.ndarray: Similarity∈[0,1] of neighborhood.
    """
    distances = met.pairwise_distances(data,data[0].reshape(1, -1),metric='euclidean').ravel()
    return np.sqrt(np.exp(-(distances ** 2) / kernel_width ** 2))

MDTerp module¶

SGDreg(data, labels, seed, alpha) ¶

init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha) ¶

similarity_kernel(data, kernel_width=1.0) ¶

`SGDreg(data, labels, seed, alpha)` ¶

`init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)` ¶

`similarity_kernel(data, kernel_width=1.0)` ¶