Procedure
Joe Butler
2024-10-28
Derivation.Rmd1) Selection of critical feature set
A univariate logistic regression is performed separately for every feature f (f = 1, …, total number of features):
where
is the vector of z-score scaled values for feature f, and
α (intercept) and
are the estimated parameters. This is performed for all features
seperately, and then the features are ordered by their univariate area
under the receiver operating characteristic curve (AUROC) or p-value
(:
= 0).
Next let C represent the set of the c
most discriminative genes (i.e. the top c features after
ordering), then for patient i the classifier score is the sum
of the
values calculated over these c features:
This function is essentially a prediction model to which various
classification metrics (e.g. AUROC(case ~ score), MCC etc) can be
applied to assess performance for each model size c. A critical
feature set
is defined as the subset of features that maximizes the chosen metric
across all values of c (i.e the gloabal maximum). It is
generally observed that as features are added from 1 to
the predictive performance improves; thus the critical feature set can
be construed as the “signal”. After
adding more features to the model generally leads to a decrease in
predictive performance; thus this set of features may be construed as
“noise”.
Note if predicted probabilities are required from the
critical feature set model then Platt scaling (logistic regression) is
applied to the scores:
2) Selection of minimal feature set
Within the critical set there may be features with redundant information (i.e correlated features). To address this issue we seek a minimal set of informative features from the critical set defined above.
Beginning with an intercept-only model we assess if adding features consecutively improves predictive improvement. First the most discriminative feature (top of the ordered feature list) is used to contruct the model (as above), if this is sufficiently better than the intercept-only model then this feature is included in the minimal feature set. If the feature does not sufficiently improve the model it is excluded. Then the next most discriminative feature is considered, provisionally including it to the minimal set and assessing if the model is improved, including or excluding accordingly. This is iterated over all features in the critical set defined above.
To assess model improvement metrics like LRT p-value or fraction of new information can be used (see Frank Harrel). For both these methods the classifier score is converted to a probability using Platt scaling.