Skip to content

MDTerp module

MDTerp.init_analysis.py – Initial MDTerp round for discarding irrelevant features from analysis for the forward feature selection in MDTerp.final_analysis.py.

init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)

Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

Parameters:

Name Type Description Default
neighborhood_data np.ndarray

Perturbed data generated by MDTerp.neighborhood.py.

required
pred_proba np.ndarray

Metastable state probabilities obtained from the black-box.

required
cutoff int

Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).

required
given_indices np.ndarray

Indices of the features to perform this (first) round of MDTerp on.

required
seed int

Random seed.

required

Returns:

Type Description
np.ndarray

List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.

Source code in MDTerp/init_analysis.py
def init_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, cutoff: int, given_indices: np.ndarray, seed: int, alpha: float) -> list:
    """
    Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.

    Args:
        neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
        pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
        cutoff (int): Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).
        given_indices (np.ndarray): Indices of the features to perform this (first) round of MDTerp on.
        seed (int): Random seed.

    Returns:
        np.ndarray: List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.
    """
    explain_class = np.argmax(pred_proba[0,:])

    target = pred_proba[:,explain_class]

    threshold, upper, lower = 0.5, 1, 0
    target_binarized = np.where(target>threshold, upper, lower)

    clf = lda()
    clf.fit(neighborhood_data,target_binarized)
    projected_data = clf.transform(neighborhood_data)
    weights = similarity_kernel(projected_data.reshape(-1,1), 1)


    predict_proba = pred_proba[:,explain_class]
    data = neighborhood_data*(weights**0.5).reshape(-1,1)
    labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)

    coefficients_selection, intercept_selection = ridge_regression(data, labels, seed, alpha)
    selected_features = np.argsort(np.absolute(coefficients_selection))[::-1][:cutoff]

    return selected_features