teaching/enseignement

Machine Learning


See also Course material, LABS

Projects/datasets

DATASET no 1 [classification]: Forest CoverType
Accurate natural resource inventory information is vital to any private, state, or federal land management agency. Forest cover type is one of the most basic characteristics recorded in such inventories. Generally, cover type data is either directly recorded by field personnel or estimated from remotely sensed data. Both of these techniques may be prohibitively time consuming and/or costly in some situations. Furthermore, an agency may find it useful to have inventory information for adjoining lands that are not directly under its control, where it is often economically or legally impossible to collect inventory data. Predictive models provide an alternative method for obtaining such data. The question is then to propose machine learning algorithms to predict the cover type from cartographic variables only (no remotely sensed data).
A paper describes the data, the problem and a solution based on random forest here. Data can be downloaded here.

DATASET no 2 [regression]: Purity of tumor
With the advent of array-based techniques to measure methylation levels in primary tumor samples, systematic investigations of methylomes have widely been performed on a large number of tumor entities. Most of these approaches are not based on measuring individual cell methylation but rather the bulk tumor sample DNA, which contains a mixture of tumor cells, infiltrating immune cells and other stromal components. This raises questions about the purity of a certain tumor sample. The question is then to propose machine learning algorithm to predict the purity.
A paper describes the data, the problem and a solution based on random forest here. Data can be downlaod here.

DATASET no 3 [classification]: Mesothelioma’s disease
Malignant mesotheliomas (MM) are very aggressive tumors of the pleura. These tumors are connected to asbestos exposure,
however it may also be related to previous simian virus 40 (SV40) infection and quite possible for genetic predisposition.
Molecular mechanisms can also be implicated in the development of mesothelioma.
The question here is to propose machine learning algorithm to predict the diagnostic.
Data and their description are available here and the associated paper here.

DATASET no 4 [regression]: HIV mutation and resistance to drugs
Antiretroviral drugs are a very effective therapy against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to the drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for an optimum medical treatment.  Drug susceptibility data comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI susceptibility results from HIV-1 virus isolates. The question here  is to predict the resistance given the virus sequences (for one chosen dataset).
References to past uses of the data can be found here (and references there in). The data, their description and an R code to preprocess the data are available here.

DATASET no 4 [regression and/or classification] Pollens emission in Luxembourg
Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships among both biotic and environmental variables. In this project, you will aim at building Machine Learning models for prediction of presence/absence (or concentration) of pollens in air given meteorological data. This paper can give a first overview of meteo variable that can be usefull for pollen emission prediction. However, you may build/find other variables to improve the models. The dataset is available here. Weather data come from the ECA dataset and are recorded at Luxembourg airport.

DATASET no 5 [regression (*)]: HIV mutation and resistance to drugs
Antiretroviral drugs are very effective therapies against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for optimum medical treatment. Drug susceptibility data comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI susceptibility results from HIV-1 virus isolates. The question here is to predict the resistance given the virus sequences (for one chosen dataset). References to past uses of the data can be found here (and references therein).
The data, their description and an R code to preprocess the data are available here .
In this project, kernel representations are used. These concepts are discussed in the last part of the course. But, help will be provided if you choose to work on this subject.

DATASET no 6 [classification]: Down syndrome mice exposed to context fear conditioning.
Down syndrome (DS) is a chromosomal abnormality (trisomy of human chromosome 21) associated with intellectual disability and affecting approximately one in 1000 live births worldwide. The overexpression of genes encoded by the extra copy of a normal chromosome in DS is believed to be sufficient to perturb normal pathways and normal responses to stimulation, causing learning and memory deficits. The questions are to detect/explain the treatment and genotype differences.
This dataset has been used in at least two papers, Higuera et al. (2015) for clustering taks and for classification and Ahmed et al (2015) extraction. Data are available here with a short description here.

DATASET no 7 [classification, deep learning]: Blood cell images classification

The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications.
The dataset contains 12,500 augmented images of blood cells (JPEG) with accompanying cell type labels (CSV). There are approximately 3,000 images for each of 4 different cell types. The cell types are Eosinophil, Lymphocyte, Monocyte, and Neutrophil. This dataset is accompanied by an additional dataset containing the original 410 images (pre-augmentation).
Data can be retrieved and download from the kaggle website here. More details and some ideas to solve the classification problem can be read here.
♠: deep learning will be adressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab can be used instead. 

DATASET no 8 [classification, deeplearning]: Clouds type detection
Clouds are a major challenge for passive satellite imaging, and daily cloud cover and rain showers in the Amazon basin can significantly complicate monitoring in the area. For this reason we have chosen to include a cloud cover label for each chip. These labels closely mirror what one would see in a local weather forecast: clear, partly cloudy, cloudy, and haze. For our purposes haze is defined as any chip where atmospheric clouds are visible but they are not so opaque as to obscure the ground. Clear scenes show no evidence of clouds, and partly cloudy scenes can show opaque cloud cover over any portion of the image. Cloudy images have 90% of the chip obscured by opaque cloud cover. The dataset to be used gathers satelite images of Amazon and labels (clear, partly cloudy, cloudy, and haze) ; it can be downlaod from the kaggle website here.  This talk describes an analysis of this dataset.
♠: deep learning will be adressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab can be used instead.
♣: if you decide to work on this project, the dataset is available in a google drive (link).



DATASET no 9 [regression]: Correction of temperature  forecast  in France.
The challenge consists in improving the 2m hourly temperatures forecast 15h and 27h ahead for 7 locations in France given several weather variables. The dataset provide the temperature forecast and the observation at the same locations and same time. Your models will use the temperature forecast and other weather variables to improve the temperature forecasted by the meteo-france AROME model.
A paper with an example here and a paper with some description of the data here. The datasets are available below for the forecast horizons 15h and 27h with a README file. 

DATASET no 10 [classification (rare event)]: Prediction of ozone pollution in Houston.
Accurate ozone alert forecasting systems are necessary to issue warnings to the public before the ozone reaches a dangerous level. However, little is known on exactly what features are important in ozone production and how they actually interact in the formation of ozone. This provides a wonderful opportunities for machine learning.
One of the reference paper here and a short description of the original dataset here which contains 72 meteorological variables and a version with no missing values below (Ozone_imputed.csv).

DATASET no 11 [classification, deeplearning (*)]: Birdcall identification
Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life-based on a changing bird population. it can be downloaded from the Kaggle website here. The challenge is to build machine learning algorithm(s) to predict the bird species from audio records. Deeplearning is only one of the possible solution.
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab or Kaggle facilities can be used instead. You will find lectures about deep learning for audio here .

DATASET no 12 [classification]:  Lymphoma
The lymphoma dataset (Shipp et al., 2002) consists of 7129 gene expression levels from 77 lymphomas. The 77 samples are divided into 58 diffuse large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL). The data can be found at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and the reference paper here

DATASET no 11 [classification]:  Colon cancer tumor
The colon dataset is from the microarray experiment of colon tissues samples of Alon et al. (1999). It contains the expression level of 2000 genes for 40 tumors and 22 normal colon tissues. The data can be freely downloaded from http://microarray.princeton.edu/oncology/affydata/index.html and the reference paper here.


DATASET no 13 [regression]: Communities and Crime in US.
Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR. They can be used for various regression tasks as for instance  predict the number of murders in 1995, predict the number of rapes, predict total number of non-violent crimes per 100K popuation etc. and try to extract the most revelant features.
A description of the data can be found here and the data can be download here.
The dataset includes missing data.