Machine Learning

See also Course material, LABS


DATASET no 1 [classification]: Forest CoverType
Accurate natural resource inventory information is vital to any private, state, or federal land management agency. Forest cover type is one of the most basic characteristics recorded in such inventories. Generally, cover type data is either directly recorded by field personnel or estimated from remotely sensed data. Both of these techniques may be prohibitively time consuming and/or costly in some situations. Furthermore, an agency may find it useful to have inventory information for adjoining lands that are not directly under its control, where it is often economically or legally impossible to collect inventory data. Predictive models provide an alternative method for obtaining such data. The question is then to propose machine learning algorithms to predict the cover type from cartographic variables only (no remotely sensed data).
A paper describes the data, the problem and a solution based on random forest here. Data can be downloaded here.

DATASET no 2 [regression/classification]: Toxicity of drugs
Cholestatic liver injury is frequently associated with drug inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients.
Here, we propose to develop machine learning algorithm to predict inhibition of the bile salt export pump from properties of compounds.
A paper with examples can be found here. Data can be downloaded here in the SMILE format. Preprocessing is required.

DATASET no 3 [regression]: Purity of tumor
With the advent of array-based techniques to measure methylation levels in primary tumor samples, systematic investigations of methylomes have widely been performed on a large number of tumor entities. Most of these approaches are not based on measuring individual cell methylation but rather the bulk tumor sample DNA, which contains a mixture of tumor cells, infiltrating immune cells and other stromal components. This raises questions about the purity of a certain tumor sample. The question is then to propose  machine learning algorithms to predict the purity.
A paper describes the data, the problem and a solution based on random forest here. Data can be downloaded here is used (see also an R code). 

DATASET no 4 [classification, deep learning]: Blood cell images classification

The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications.
The dataset contains 12,500 augmented images of blood cells (JPEG) with accompanying cell type labels (CSV). There are approximately 3,000 images for each of 4 different cell types. The cell types are Eosinophil, Lymphocyte, Monocyte, and Neutrophil. This dataset is accompanied by an additional dataset containing the original 410 images (pre-augmentation).
Data can be retrieved and downloaded from the Kaggle website here. More details and some ideas to solve the classification problem can be read here.
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab and/or Kaggle facilities can be used instead. 

DATASET no 5 [regression]: Plant species richness in Tibetan alpine grasslands
Species richness is the core of biodiversity-ecosystem functioning (BEF) research. Nevertheless, it is difficult to accurately predict changes in plant species richness under different climate scenarios, especially in alpine biomes. In this project we propose to identify the most critical driver of species richness distribution.
A paper describing an analysis of the data set is available here. The dataset is available  here.

DATASET no 6 [classification, deeplearning]: Clouds type detection
Clouds are a major challenge for passive satellite imaging, and daily cloud cover and rain showers in the Amazon basin can significantly complicate monitoring in the area. For this reason, we have chosen to include a cloud cover label for each chip. These labels closely mirror what one would see in a local weather forecast: clear, partly cloudy, cloudy, and haze. For our purposes, haze is defined as any chip where atmospheric clouds are visible but they are not so opaque as to obscure the ground. Clear scenes show no evidence of clouds, and partly cloudy scenes can show opaque cloud cover over any portion of the image. Cloudy images have 90% of the chip obscured by opaque cloud cover. The dataset to be used gathers satellite images of Amazon and labels (clear, partly cloudy, cloudy, and haze) ; it can be downloaded from the Kaggle website here.  This talk describes an analysis of this dataset.
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab can be used instead.
♣: if you decide to work on this project, the dataset is available in a google drive (link).

An example of code for starting is proposed under this link. Thank you to save a copy in order that the code will not be changed.

DATASET no 7 [classification]: Mesotheliomas disease
Malignant mesotheliomas (MM) are very aggressive tumors of the pleura. These tumors are connected to asbestos exposure
however, it may also be related to previous simian virus 40 (SV40) infection and quite possible for genetic predisposition.
Molecular mechanisms can also be implicated in the development of mesothelioma.
The question here is to propose machine learning algorithms to predict the diagnostic.
Data and their description are available here and the associated paper here.

DATASET no 8 [classification, regression]: Chinese gut microbiota
Some studies conclude that gut microbiota does not change with age. The data set provides gut microbiota of about 1000 healthy chinise with some covariates : sex, age, height, weight, diet, etc. The goal of this project is to propose a complete analysis of this data. Some questions you can try to repky to are : dies the microbiote differ with age? with BMI? with diet? If yes, which parts of the microbiote are "differentially expressed"?
Data  are available here and the associated paper here.

(*) DATASET no 9 [classification, deep learning, transfert learning]: Classification of Butterfly Species
Butterflies are abundant species on the earth, and the task of identification of butterflies is complex. How to
apply image processing methods to automatic identification of butterfly species is a hot issue in current research.

In this paper, the problem of automatic detection and classification of butterfly species using buttery
photographs is studied.
Data can be retrieved and downloaded from the Kaggle website here. A paper with ideas and algorithms is available here
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab and/or Kaggle facilities can be used instead. 

(*) DATASET no 10 [classification, deep learning, transfert learning]: Clinical diagnosis from chest X-ray images
The implementation of clinical-decision support algorithms for medical imaging faces challenges with reliability and interpretability. Here, you have to establish a diagnostic tool based on a deep-learning framework  for diagnosis of pediatric pneumonia using chest X-ray images. This tool may ultimately aid in expediting the diagnosis and referral of these treatable conditions, thereby facilitating earlier treatment, resulting in improved clinical outcomes.
A paper proposing  a methodology is available here. Dataset is available in kaggle here.
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab and/or Kaggle facilities can be used instead. 

(*) DATASET no 11 [classification, deeplearning, sound preprocessing]: Birdcall identification 
Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life-based on a changing bird population. it can be downloaded from the Kaggle website here.
The challenge is to build  machine learning algorithm(s) to predict the bird species from audio records. Deeplearning is only one of the possible solution.

♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab or Kaggle facilities can be used instead.
You will find lectures about deep learning for audio here.

DATASET no 12 [classification] Influenza outbreak event prediction via Twitter
By identifying influenza-related tweets, the goal is to forecast the spatiotemporal patterns of influenza outbreaks for different locations and dates.
The data is from the United States. The data comes from different states under different weeks. For each week, the task is to predict whether or not there is an influenza outbreak on the next date. More specifically, for influenza activity, there are four levels of flu activities from minimal to high according to CDC Flu Activity Map. An influenza outbreak occurrence is indicated if the activity level is high.
Data and paper can be download here.

DATASET no 13 [classification] Amphibians
The dataset is a multilabel classification problem. The goal is to predict the presence of amphibians species near the water reservoirs based on features obtained from GIS systems and satellite images.
Road A project concerned part of the planned A1 motorway section in Pyrzowice; the section is located along the northern border of the Silesian Voivodship and is about 75 km long. The field research involved a strip of land with a width of 500 m on both sides of the proposed project area.  Finally, the first project included 80 amphibian breeding sites. Data are available or a second road.
Data and paper can be download here.

DATASET no 14 [regression] Predicting CO2 Emissions
The ability to accurately monitor carbon emissions is a critical step in the fight against climate change. Precise carbon readings allow researchers and governments to understand the sources and patterns of carbon mass output. While Europe and North America have extensive systems in place to monitor carbon emissions on the ground, there are few available in Africa.
The objective  is to create a machine learning models using open-source CO2 emissions data from Sentinel-5P satellite observations to predict future carbon emissions.
These solutions may help enable governments, and other actors to estimate carbon emission levels across Africa, even in places where on-the-ground monitoring is not possible.
Material is available here.

DATASET no 15  [regression and/or classification] Pollens emission
Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health
Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to
aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships
among both biotic and environmental variables.
In this project, you will aim at building Machine Learning models for prediction of presence/absence (or concentration) of pollens in air given meteorological data. This paper can give a first overview of meteo variable that can be usefull for pollen emission prediction. However, you may build/find other variables to improve the models. Please ask the teacher to get the data.
Weather data come from the ECA dataset and are recorded at Luxembourg airport. 
DATASET no 16 [clustering, classification]: Down syndrome mice exposed to context fear conditioning.
Down syndrome (DS) is a chromosomal abnormality (trisomy of human chromosome 21) associated with intellectual disability and affecting approximately one in 1000 live births worldwide. The overexpression of genes encoded by the extra copy of a normal chromosome in DS is believed to be sufficient to perturb normal pathways and normal responses to stimulation, causing learning and memory deficits. The questions are to detect/explain the treatment and genotype differences.
This dataset has been used in at least two papers, Higuera et al. (2015) for clustering tasks and for classification and Ahmed et al (2015) extraction. Data are available here with a short description here.
(**) DATASET no 17 [regression]: HIV mutation and resistance to drugs
Antiretroviral drugs are very effective therapies against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for optimum medical treatment.  Drug susceptibility data comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI susceptibility results from HIV-1 virus isolates. The question here is to predict the resistance given the virus sequences (for one chosen dataset).
References to past uses of the data can be found here (and references therein). The data, their description and an R code to preprocess the data are available here.
In this project, kernel representations are used. These concepts are discussed in the last part of the course. But, help will be provided if you choose to work on this subject.

DATASET no 18 [classification]: bone marrow transplantation
The aim of our study was to compare the results of unrelated donor (UD) peripheral blood stem cell transplantation versus UD bone marrow transplantation and to analyze the impact of infused CD34+ and CD3+ cell doses on survival and incidence of severe graft-versus-host disease (GVHD) in 187 children who underwent UD hematopoietic cell transplantation with the use of in vivo T cell depletion (antithymocyte globulin or CAMPATH-1H).
More details and the dataset are available here.

DATASET no 19  [regression]: Correction of temperature  forecast  in France.
The challenge consists in improving the 2m hourly temperatures forecast 15h and 27h ahead for 7 locations in France given several weather variables. The dataset provides the temperature forecast and the observation at the same locations and same time. Your models will use the temperature forecast and other weather variables to improve the temperature forecasted by the meteo-france AROME model.
A paper with an example here and a paper with some description of the data here. The datasets are available below for the forecast horizons 15h and 27h with a README file. 

DATASET no 20 [classification, rare event]: Prediction of ozone pollution in Houston.
Accurate ozone alert forecasting systems are necessary to issue warnings to the public before the ozone reaches a dangerous level. However, little is known on exactly what features are important in ozone production and how they actually interact in the formation of ozone. This provides wonderful opportunities for machine learning.
One of the reference paper here and a short description of the original dataset here which contains 72 meteorological variables and a version with no missing values below (Ozone_imputed.csv).

DATASET no 21 [classification]:  Lymphoma
The lymphoma dataset (Shipp et al., 2002) consists of 7129 gene expression levels from 77 lymphomas. The 77 samples are divided into 58 diffuse large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL). The data can be found at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and the reference paper here

DATASET no 18 [classification]:  Colon cancer tumor
The colon dataset is from the microarray experiment of colon tissue samples of Alon et al. (1999). It contains the expression level of 2000 genes for 40 tumors and 22 normal colon tissues. The data can be freely downloaded from http://microarray.princeton.edu/oncology/affydata/index.html and the reference paper here.

DATASET no 22 [regression]: Communities and Crime in the US.
Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR. They can be used for various regression tasks as for instance predict the number of murders in 1995, predict the number of rapes, predict the total number of non-violent crimes per 100K population etc. and try to extract the most relevant features.
A description of the data can be found here and the data can be download here.
The dataset includes missing data.