Machine Learning

See also Course material, Other LABS, Course projects


6. Data Driven Models

Part I - Prediction of a categorical variable with 2 levels

We have collected gene expression levels for 4654 genes on 97 early-stage breast cancer samples. After surgical removal of the tumour, some unfortunately relapsed within 5 years (label=+1), while other did not (label=0).
The goal of the lab in to build models for predicting the relapse given gene expressions using data driven models.
1. Nearest Neighbor method
a. Use the nearest neighbor method to predict the relapse. You will have to choose the « best » number of neighbors. Plot a ROC curve (estimated on cross-validation).
b. It is not a good idea to compute distance between observation in large dimension. So, propose a nearest neighbor method based on a space with smaller dimension (PCA, best correlated predictors, predictors extract from a Lasso regression, etc).

2. Decision Tree
a. Build a decision tree to predict the relapse. Plot a ROC curve (estimated on cross-validation).
b. Can you improve the previous result if you fit the tree on a space with small dimension?
3. Which on is your best model? According to which criteria?

Part II - Prediction of a continuous variable

What are the changes needed in the previous codes to run them on the cookies dataset for prediction of fat percent?

Codes: CancerRelapse_DataDrivenModels_ToStart.py, CancerRelapse_DataDrivenModels_ToStart.R, Cookies_DataDrivenModels.py, Cookies_DataDrivenModels.R