MDTerp module¶

MDTerp.final_analysis.py – Final MDTerp round for implementing forward feature selection and attributing feature importance.

`SGDreg(data, labels, seed, alpha=1.0)` ¶

Function for implementing linear regression using stochastic gradient descent.

Parameters:

Name	Type	Description	Default
`data`	`np.ndarray`	Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.	required
`labels`	`np.ndarray`	Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.	required
`seed`	`int`	Random seed.	required
`alpha`	`float`	L2 norm of Ridge regression (Default: 1.0).	`1.0`

Returns:

Type	Description
`np.ndarray`	Numpy array with coefficients of all the features of the fitted linear model. float: Intercept of the fitted linear model.

Source code in MDTerp/final_analysis.py

def SGDreg(data: np.ndarray, labels: np.ndarray, seed: int, alpha: float = 1.0) -> Tuple[np.ndarray, float]:
    """
    Function for implementing linear regression using stochastic gradient descent.

    Args:
        data (np.ndarray): Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.
        labels (np.ndarray): Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.
        seed (int): Random seed.
        alpha (float): L2 norm of Ridge regression (Default: 1.0).

    Returns:
        np.ndarray: Numpy array with coefficients of all the features of the fitted linear model.
        float: Intercept of the fitted linear model.
    """
    clf = Ridge(alpha, random_state = seed, solver = 'saga')
    clf.fit(data,labels.ravel())
    coefficients = clf.coef_
    intercept = clf.intercept_
    return coefficients, intercept

`charac_theta(d_U, d_S)` ¶

Function for computing change in unfaithfulness per unit change in interpretation entropy as the number of features used to build a linear model increases by 1.

Parameters:

Name	Type	Description	Default
`d_U`	`np.ndarray`	Change in unfaithfulness with increasing features in linear models.	required
`d_S`	`np.ndarray`	Change in interpretation entropy with increasing features in linear models.	required

Returns:

Type	Description
`np.ndarray`	Change in unfaithfulness per unit change in interpretation entropy.

Source code in MDTerp/final_analysis.py

def charac_theta(d_U: np.ndarray,d_S: np.ndarray) -> np.ndarray:
  """
    Function for computing change in unfaithfulness per unit change in interpretation entropy as the number of features used to build a linear model increases by 1.

    Args:
        d_U (np.ndarray): Change in unfaithfulness with increasing features in linear models.
        d_S (np.ndarray): Change in interpretation entropy with increasing features in linear models.

    Returns:
        np.ndarray: Change in unfaithfulness per unit change in interpretation entropy.
  """
  return -d_U/d_S

`final_model(neighborhood_data, pred_proba, unf_threshold, feature_type_indices, selected_features, seed)` ¶

Function for computing final feature importance by implementing forward feature selection.

Parameters:

Name	Type	Description	Default
`neighborhood_data`	`np.ndarray`	Perturbed data generated by MDTerp.neighborhood.py.	required
`pred_proba`	`np.ndarray`	Metastable state probabilities obtained from the black-box.	required
`unf_threshold`	`float`	Hyperparameter setting a lower limit on model unfaithafulness. Forward feature selection ends when unfaithfulness reaches lower than this threshold.	required
`feature_type_indices`	`np.ndarray`	Indices of the features to perform the final round of MDTerp on.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`np.ndarray`	Normalized feature importance.

Source code in MDTerp/final_analysis.py

def final_model(neighborhood_data: np.ndarray, pred_proba: np.ndarray, unf_threshold: float, feature_type_indices: np.ndarray, selected_features: np.ndarray, seed:int) -> np.ndarray:
    """
    Function for computing final feature importance by implementing forward feature selection.

    Args:
        neighborhood_data (np.ndarray): Perturbed data generated by MDTerp.neighborhood.py.
        pred_proba (np.ndarray): Metastable state probabilities obtained from the black-box.
        unf_threshold (float): Hyperparameter setting a lower limit on model unfaithafulness. Forward feature selection ends when unfaithfulness reaches lower than this threshold.
        feature_type_indices (np.ndarray): Indices of the features to perform the final round of MDTerp on.
        seed (int): Random seed.

    Returns:
        np.ndarray: Normalized feature importance.
    """
    tot_feat = feature_type_indices[0].shape[0] + feature_type_indices[1].shape[0] + feature_type_indices[2].shape[0] + feature_type_indices[3].shape[0]
    k_max = neighborhood_data.shape[1]

    explain_class = np.argmax(pred_proba[0,:])

    target = pred_proba[:,explain_class]

    threshold, upper, lower = 0.5, 1, 0
    target_binarized = np.where(target>threshold, upper, lower)

    clf = lda()
    clf.fit(neighborhood_data,target_binarized)
    projected_data = clf.transform(neighborhood_data)
    weights = similarity_kernel(projected_data.reshape(-1,1), 1)

    predict_proba = pred_proba[:,explain_class]
    data = neighborhood_data*(weights**0.5).reshape(-1,1)
    labels = target.reshape(-1,1)*(weights.reshape(-1,1)**0.5)

    best_parameters_master = []
    best_parameters_converted = []
    best_unfaithfulness_master = []
    best_interp_master = []

    N = data.shape[1]
    k_array = np.arange(1,k_max+1)

    for k in k_array:
      unfaithfulness_calc(k, N, data, predict_proba, best_parameters_master, labels, best_interp_master, best_parameters_converted, best_unfaithfulness_master, tot_feat, selected_features, seed)


    optimal_k = 1


    if N<=3:
      for i in range(1,N):
        prime_model = -1
        if best_unfaithfulness_master[i]<=best_unfaithfulness_master[i-1] - unf_threshold:
          prime_model = copy.deepcopy(i)-1
          continue
        else:
          break

    else:
      charac_theta_mast = []

      d_U_lst = []
      d_S_lst = []
      for i in range(1, len(selected_features)):
        d_U_lst.append(best_unfaithfulness_master[i] - best_unfaithfulness_master[i-1])
        d_S_lst.append(best_interp_master[i] - best_interp_master[i-1])

      for i in range(len(selected_features)-1):
        charac_theta_mast.append(charac_theta(d_U_lst[i], d_S_lst[i]))

      range_theta_mast = []
      for i in range(1,len(charac_theta_mast)):
        range_theta_mast.append(np.array(charac_theta_mast)[i]-np.array(charac_theta_mast)[i-1])

      prime_model = np.argmin(np.array(range_theta_mast))
    return np.absolute(np.array(best_parameters_converted)[prime_model+1])

`interp(coef_array)` ¶

Function for computing interpretation entropy of the coefficients of a fitted linear model.

Parameters:

Name	Type	Description	Default
`coef_array`	`np.ndarray`	Numpy array with coefficients of all the features of the fitted linear model.	required

Returns:

Type	Description
`float`	Interpretation entropy of the coefficients of a fitted linear model.

Source code in MDTerp/final_analysis.py

def interp(coef_array: np.ndarray) -> float:
  """
    Function for computing interpretation entropy of the coefficients of a fitted linear model.

    Args:
        coef_array (np.ndarray): Numpy array with coefficients of all the features of the fitted linear model.

    Returns:
        float: Interpretation entropy of the coefficients of a fitted linear model.
  """
  a = np.absolute(coef_array)/np.sum(np.absolute(coef_array))
  t = 0
  for i in range(a.shape[0]):
    if a[i]==0:
      continue
    else:
      t += a[i]*np.log(a[i])
  return -t/np.log(coef_array.shape[0])

`similarity_kernel(data, kernel_width=1.0)` ¶

Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

Parameters:

Name	Type	Description	Default
`data`	`np.ndarray`	LDA transformed data.	required
`kernel_width`	`float`	Width of the similarity kernel (Default: 1.0).	`1.0`

Returns:

Type	Description
`np.ndarray`	Similarity∈[0,1] of neighborhood.

Source code in MDTerp/final_analysis.py

def similarity_kernel(data: np.ndarray, kernel_width: float = 1.0) -> np.ndarray:
    """
    Function for computing similarity∈[0,1] of a perturbed sample with respect to the original sample using LDA transformed distance.

    Args:
        data (np.ndarray): LDA transformed data.
        kernel_width (float): Width of the similarity kernel (Default: 1.0).

    Returns:
        np.ndarray: Similarity∈[0,1] of neighborhood.
    """
    distances = met.pairwise_distances(data,data[0].reshape(1, -1),metric='euclidean').ravel()
    return np.sqrt(np.exp(-(distances ** 2) / kernel_width ** 2))

`unfaithfulness_calc(k, N, data, predict_proba, best_parameters_master, labels, best_interp_master, best_parameters_converted, best_unfaithfulness_master, tot_feat, selected_features, seed)` ¶

Function for implementing linear regression using stochastic gradient descent.

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of features for building a surrogate, local, linear model.	required
`N`	`int`	Number of features selected for detailed analysis.	required
`data`	`np.ndarray`	Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.	required
`predict_proba`	`np.ndarray`	Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.	required
`best_parameters_master`	`list`	List of lists that saves the best fit coefficients for linear models built using k=1, .., N features.	required
`best_interp_master`	`list`	List that saves the interpretation entropy for the linear models built using k=1,...,N features.	required
`best_parameters_converted`	`list`	List of lists that saves the best fit coefficients for linear models built using k=1, .., N features, and imputes the discarded features in the initial MDTerp round with 0 importance to preserve feature ID.	required
`best_unfaithfulness_master`	`list`	List that saves the unfaithfulness of the best-fit linear models built using k=1, ..., N features.	required
`tot_feat`	`int`	Total number of features in the dataset.	required
`selected_features`	`np.ndarray`	Indices of the features selected for detailed MDTerp analysis.	required
`seed`	`int`	Random seed.	required

Returns:

Type	Description
`None`	None

Source code in MDTerp/final_analysis.py

def unfaithfulness_calc(k: int, N: int, data: np.ndarray, predict_proba: np.ndarray, best_parameters_master: list, labels: np.ndarray, best_interp_master: list, best_parameters_converted: list, best_unfaithfulness_master: list, tot_feat: int, selected_features: np.ndarray, seed: int) -> None:
  """
    Function for implementing linear regression using stochastic gradient descent.

    Args:
        k (int): Number of features for building a surrogate, local, linear model.
        N (int): Number of features selected for detailed analysis.
        data (np.ndarray): Numpy 2D array containing the similarity-weighted training data for the black-box model. Samples along rows and features along columns.
        predict_proba (np.ndarray): Numpy array containing metastable state prediction probabilities for a perturbed neighborhood corresponding to a specific sample. Includes the state for which the original sample has the highest probability.
        best_parameters_master (list): List of lists that saves the best fit coefficients for linear models built using k=1, .., N features.
        best_interp_master (list): List that saves the interpretation entropy for the linear models built using k=1,...,N features.
        best_parameters_converted (list): List of lists that saves the best fit coefficients for linear models built using k=1, .., N features, and imputes the discarded features in the initial MDTerp round with 0 importance to preserve feature ID.
        best_unfaithfulness_master (list): List that saves the unfaithfulness of the best-fit linear models built using k=1, ..., N features.
        tot_feat (int): Total number of features in the dataset.
        selected_features (np.ndarray): Indices of the features selected for detailed MDTerp analysis.
        seed (int): Random seed.

    Returns:
        None
  """ 
  models = []
  TERP_SGD_parameters = []
  TERP_SGD_unfaithfulness = []
  TERP_SGD_interp = []
  if k == 1:
    inherited_nonzero = np.array([],dtype=int)
    inherited_zero = np.arange(N)

  elif k > 1:
    inherited_nonzero = np.nonzero(best_parameters_master[k-2][:-1])[0]
    inherited_zero = np.where(best_parameters_master[k-2][:-1] == 0)[0]

  for i in range(N-k+1):
    models.append(np.append(inherited_nonzero, inherited_zero[i]))
    result_a, result_b = SGDreg(data[:,models[i]], labels, seed)
    parameters = np.zeros((N+1))
    parameters[models[i]] = result_a
    parameters[-1] = result_b
    TERP_SGD_parameters.append(parameters)
    residual = np.corrcoef(labels[:,0],(np.column_stack((data, np.ones((data.shape[0]))))@parameters[:]).reshape(-1,1)[:,0])[0,1]
    TERP_SGD_unfaithfulness.append(1-np.absolute(residual))
    TERP_SGD_interp.append(interp(TERP_SGD_parameters[-1][:-1]))
    TERP_SGD_IFE = np.array(TERP_SGD_unfaithfulness)

  best_model = np.argsort(TERP_SGD_IFE)[0]
  best_parameters_master.append(TERP_SGD_parameters[best_model])
  best_interp_master.append(TERP_SGD_interp[best_model])

  temp_coef_1 = TERP_SGD_parameters[best_model][:-1]
  temp_coef_2 = np.zeros((tot_feat))
  temp_coef_2[selected_features] = copy.deepcopy(temp_coef_1)
  best_parameters_converted.append(temp_coef_2)
  best_unfaithfulness_master.append(TERP_SGD_unfaithfulness[best_model])

  surrogate_pred = data@TERP_SGD_parameters[best_model][:-1]

`zeta(U, S, theta)` ¶

Function for computing the interpretation free energy.

Parameters:

Name	Type	Description	Default
`U`	`np.ndarray`	Numpy array with unfaithfulness of the best models for number of features, k = 1, ..., N.	required
`S`	`np.ndarray`	Numpy array with interpretation entropy of the best models for number of features, k = 1, ..., N.	required
`theta`	`float`	Temperature of the fitted linear model.	required

Returns:

Type	Description
`np.ndarray`	Interpretation free energy.

Source code in MDTerp/final_analysis.py

def zeta(U: np.ndarray,S: np.ndarray,theta: float) -> np.ndarray:
  """
    Function for computing the interpretation free energy.

    Args:
        U (np.ndarray): Numpy array with unfaithfulness of the best models for number of features, k = 1, ..., N.
        S (np.ndarray): Numpy array with interpretation entropy of the best models for number of features, k = 1, ..., N.
        theta (float): Temperature of the fitted linear model.

    Returns:
        np.ndarray: Interpretation free energy.
  """
  return U + theta*S

MDTerp module¶

SGDreg(data, labels, seed, alpha=1.0) ¶

charac_theta(d_U, d_S) ¶

final_model(neighborhood_data, pred_proba, unf_threshold, feature_type_indices, selected_features, seed) ¶

interp(coef_array) ¶

similarity_kernel(data, kernel_width=1.0) ¶

unfaithfulness_calc(k, N, data, predict_proba, best_parameters_master, labels, best_interp_master, best_parameters_converted, best_unfaithfulness_master, tot_feat, selected_features, seed) ¶

zeta(U, S, theta) ¶

`SGDreg(data, labels, seed, alpha=1.0)` ¶

`charac_theta(d_U, d_S)` ¶

`final_model(neighborhood_data, pred_proba, unf_threshold, feature_type_indices, selected_features, seed)` ¶

`interp(coef_array)` ¶

`similarity_kernel(data, kernel_width=1.0)` ¶

`unfaithfulness_calc(k, N, data, predict_proba, best_parameters_master, labels, best_interp_master, best_parameters_converted, best_unfaithfulness_master, tot_feat, selected_features, seed)` ¶

`zeta(U, S, theta)` ¶