MDTerp module¶
MDTerp.init_analysis.py – Initial MDTerp round for discarding irrelevant features from analysis for the forward feature selection in MDTerp.final_analysis.py.
init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)
¶
Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
neighborhood_data |
np.ndarray |
Perturbed data generated by MDTerp.neighborhood.py. |
required |
pred_proba |
np.ndarray |
Metastable state probabilities obtained from the black-box. |
required |
cutoff |
int |
Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant). |
required |
given_indices |
np.ndarray |
Indices of the features to perform this (first) round of MDTerp on. |
required |
seed |
int |
Random seed. |
required |
Returns:
| Type | Description |
|---|---|
np.ndarray |
List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection. |
Source code in MDTerp/init_analysis.py
def init_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, cutoff: int, given_indices: np.ndarray, seed: int, alpha: float) -> list:
"""
Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.
Args:
neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
cutoff (int): Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).
given_indices (np.ndarray): Indices of the features to perform this (first) round of MDTerp on.
seed (int): Random seed.
Returns:
np.ndarray: List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.
"""
explain_class = np.argmax(pred_proba[0,:])
target = pred_proba[:,explain_class]
threshold, upper, lower = 0.5, 1, 0
target_binarized = np.where(target>threshold, upper, lower)
clf = lda()
clf.fit(neighborhood_data,target_binarized)
projected_data = clf.transform(neighborhood_data)
weights = similarity_kernel(projected_data.reshape(-1,1), 1)
predict_proba = pred_proba[:,explain_class]
data = neighborhood_data*(weights**0.5).reshape(-1,1)
labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)
coefficients_selection, intercept_selection = ridge_regression(data, labels, seed, alpha)
selected_features = np.argsort(np.absolute(coefficients_selection))[::-1][:cutoff]
return selected_features