MDTerp module¶
MDTerp.init_analysis.py – Initial MDTerp round for discarding irrelevant features from analysis for the forward feature selection in MDTerp.final_analysis.py.
SGDreg(data, labels, seed, alpha)
¶
Function for implementing linear regression using stochastic gradient descent.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
np.ndarray |
Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns. |
required |
labels |
np.ndarray |
Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability. |
required |
seed |
int |
Random seed. |
required |
alpha |
float |
L2 norm of Ridge regression. |
required |
Returns:
Type | Description |
---|---|
np.ndarray |
Numpy array with coefficients of all the features of the fitted linear model. float: Intercept of the fitted linear model. |
Source code in MDTerp/init_analysis.py
def SGDreg(data: np.ndarray, labels: np.ndarray, seed: int, alpha: float) -> Tuple[np.ndarray, float]:
"""
Function for implementing linear regression using stochastic gradient descent.
Args:
data (np.ndarray): Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.
labels (np.ndarray): Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.
seed (int): Random seed.
alpha (float): L2 norm of Ridge regression.
Returns:
np.ndarray: Numpy array with coefficients of all the features of the fitted linear model.
float: Intercept of the fitted linear model.
"""
clf = Ridge(alpha, random_state = seed, solver = 'saga')
clf.fit(data,labels.ravel())
coefficients = clf.coef_
intercept = clf.intercept_
return coefficients, intercept
init_model(neighborhood_data, pred_proba, cutoff, given_indices, seed, alpha)
¶
Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
neighborhood_data |
np.ndarray |
Perturbed data generated by MDTerp.neighborhood.py. |
required |
pred_proba |
np.ndarray |
Metastable state probabilities obtained from the black-box. |
required |
cutoff |
int |
Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant). |
required |
given_indices |
np.ndarray |
Indices of the features to perform this (first) round of MDTerp on. |
required |
seed |
int |
Random seed. |
required |
Returns:
Type | Description |
---|---|
np.ndarray |
List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection. |
Source code in MDTerp/init_analysis.py
def init_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, cutoff: int, given_indices: np.ndarray, seed: int, alpha: float) -> list:
"""
Function for fitting the initial linear model for discarding irrelevant features and choosing promising features for detailed analysis.
Args:
neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
cutoff (int): Maximum number of features kept for the final round of MDTerp and forward feature selection (use to improve compute time: when too many features are in the dataset and a priori it is known it is unlikely that more features than set by cutoff will be relevant).
given_indices (np.ndarray): Indices of the features to perform this (first) round of MDTerp on.
seed (int): Random seed.
Returns:
np.ndarray: List of three lists containing indices of the selected numeric, angular, sine/cosine features for further analysis/forward feature selection.
"""
explain_class = np.argmax(pred_proba[0,:])
target = pred_proba[:,explain_class]
threshold, upper, lower = 0.5, 1, 0
target_binarized = np.where(target>threshold, upper, lower)
clf = lda()
clf.fit(neighborhood_data,target_binarized)
projected_data = clf.transform(neighborhood_data)
weights = similarity_kernel(projected_data.reshape(-1,1), 1)
predict_proba = pred_proba[:,explain_class]
data = neighborhood_data*(weights**0.5).reshape(-1,1)
labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)
coefficients_selection, intercept_selection = SGDreg(data, labels, seed, alpha)
selected_features = np.argsort(np.absolute(coefficients_selection))[::-1][:cutoff]
return selected_features
similarity_kernel(data, kernel_width=1.0)
¶
Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
np.ndarray |
LDA transformed data. |
required |
kernel_width |
float |
Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter. |
1.0 |
Returns:
Type | Description |
---|---|
np.ndarray |
Similarity∈[0,1] of neighborhood. |
Source code in MDTerp/init_analysis.py
def similarity_kernel(data: np.ndarray, kernel_width: float = 1.0) -> np.ndarray:
"""
Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.
Args:
data (np.ndarray): LDA transformed data.
kernel_width (float): Width of the similarity kernel (Default: 1.0). Since LDA was used for dimensionality reduction, no need to tune this hyperparameter.
Returns:
np.ndarray: Similarity∈[0,1] of neighborhood.
"""
distances = met.pairwise_distances(data,data[0].reshape(1, -1),metric='euclidean').ravel()
return np.sqrt(np.exp(-(distances ** 2) / kernel_width ** 2))