Skip to contents

1) Selection of critical feature set


A univariate logistic regression is performed separately for every feature f (f = 1, …, total number of features):

P(casei)=11+exp((α+βfzf,i)) P(\text{case}_i) = \frac{1}{1 + \exp\left(-(\alpha + \beta_f z_{f,i})\right)}

where zfz_f is the vector of z-score scaled values for feature f, and α (intercept) and 𝛽f𝛽_f are the estimated parameters. This is performed for all features seperately, and then the features are ordered by their univariate area under the receiver operating characteristic curve (AUROC) or p-value (H0H_0: 𝛽f𝛽_f = 0).

Next let C represent the set of the c most discriminative genes (i.e. the top c features after ordering), then for patient i the classifier score is the sum of the 𝛽f𝛽_fzf,iz_{f,i} values calculated over these c features:


scorei=fCβfzf,i \text{score}_i = \sum_{f \in C} \beta_f z_{f,i}

This function is essentially a prediction model to which various classification metrics (e.g. AUROC(case ~ score), MCC etc) can be applied to assess performance for each model size c. A critical feature set ccritc_{crit} is defined as the subset of features that maximizes the chosen metric across all values of c (i.e the gloabal maximum). It is generally observed that as features are added from 1 to ccritc_{crit} the predictive performance improves; thus the critical feature set can be construed as the “signal”. After ccritc_{crit} adding more features to the model generally leads to a decrease in predictive performance; thus this set of features may be construed as “noise”.

Note if predicted probabilities are required from the critical feature set model then Platt scaling (logistic regression) is applied to the scores:


P(casei)=11+exp((a+b.scorei) P(\text{case}_i) = \frac{1}{1 + \exp\left(-(a + b\text{.score}_i \right)}

2) Selection of minimal feature set


Within the critical set there may be features with redundant information (i.e correlated features). To address this issue we seek a minimal set of informative features from the critical set defined above.

Beginning with an intercept-only model we assess if adding features consecutively improves predictive improvement. First the most discriminative feature (top of the ordered feature list) is used to contruct the model (as above), if this is sufficiently better than the intercept-only model then this feature is included in the minimal feature set. If the feature does not sufficiently improve the model it is excluded. Then the next most discriminative feature is considered, provisionally including it to the minimal set and assessing if the model is improved, including or excluding accordingly. This is iterated over all features in the critical set defined above.

To assess model improvement metrics like LRT p-value or fraction of new information can be used (see Frank Harrel). For both these methods the classifier score is converted to a probability using Platt scaling.