Valérie Monbet

teaching/enseignement

Machine Learning

Projects/datasets examples and ideas

BIOLOGY/HEALTH SCIENCES

Gut microbiota and Irritable Bowel Disease [classification]
Despite the availability of various diagnostic tests for inflammatory bowel diseases (IBD), misdiagnosis of IBD occurs frequently, and thus, there is a clinical need to further improve the diagnosis of IBD. As gut dysbiosis is reported in patients with IBD, we hypothesized that supervised machine learning (ML) could be used to analyze gut microbiome data for predictive diagnostics of IBD.
Data are available here, species counts are available here and an associated paper here.

Chinese gut microbiota [classification, regression]

Some studies conclude that gut microbiota does not change with age. The data set provides gut microbiota of about 1000 healthy chinise with some covariates : sex, age, height, weight, diet, etc. The goal of this project is to propose a complete analysis of this data. Some questions you can try to repky to are : dies the microbiote differ with age? with BMI? with diet? If yes, which parts of the microbiote are "differentially expressed"?

Data are available here and the associated paper here.

DNA sequences classification [classification, deep learning]
In recent years, a deep learning model called convolutional neural network with an ability of ex-
tracting features of high-level abstraction from minimum preprocessing data has been widely
used. In this project, we propose to use deep learning algorithm for classifying DNA sequences using the con-
volutional neural network while considering these sequences as text data. One-hot vec-
tors will be applied to represent sequences as input to the model; therefore, it conserves the essential position
information of each nucleotide in sequences.
Data can be download here a paper describing some ideas is available here.

Toxicity of drugs [regression/classification]

Cholestatic liver injury is frequently associated with drug inhibition of bile salt transporters, such as the bile salt export pump (BSEP). Reliable in silico models to predict BSEP inhibition directly from chemical structures would significantly reduce costs during drug discovery and could help avoid injury to patients.

Here, we propose to develop machine learning algorithm to predict inhibition of the bile salt export pump from properties of compounds.
A paper with examples can be found here. Data can be downloaded here in the SMILE format. Preprocessing is required.

Purity of tumor [regression]

With the advent of array-based techniques to measure methylation levels in primary tumor samples, systematic investigations of methylomes have widely been performed on a large number of tumor entities. Most of these approaches are not based on measuring individual cell methylation but rather the bulk tumor sample DNA, which contains a mixture of tumor cells, infiltrating immune cells and other stromal components. This raises questions about the purity of a certain tumor sample. The question is then to propose machine learning algorithms to predict the purity.

A paper describes the data, the problem and a solution based on random forest here. Data can be downloaded here is used (see also an R code).

Mesotheliomas disease [classification]

Malignant mesotheliomas (MM) are very aggressive tumors of the pleura. These tumors are connected to asbestos exposure
however, it may also be related to previous simian virus 40 (SV40) infection and quite possible for genetic predisposition.
Molecular mechanisms can also be implicated in the development of mesothelioma.
The question here is to propose machine learning algorithms to predict the diagnostic.

Data and their description are available here and the associated paper here.

Clinical diagnosis from chest X-ray images [classification, deep learning, transfert learning]

The implementation of clinical-decision support algorithms for medical imaging faces challenges with reliability and interpretability. Here, you have to establish a diagnostic tool based on a deep-learning framework for diagnosis of pediatric pneumonia using chest X-ray images. This tool may ultimately aid in expediting the diagnosis and referral of these treatable conditions, thereby facilitating earlier treatment, resulting in improved clinical outcomes.

A paper proposing a methodology is available here. Dataset is available in kaggle here.

Blood cell images classification [classification, deep learning]:
The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications.
The dataset contains 12,500 augmented images of blood cells (JPEG) with accompanying cell type labels (CSV). There are approximately 3,000 images for each of 4 different cell types. The cell types are Eosinophil, Lymphocyte, Monocyte, and Neutrophil. This dataset is accompanied by an additional dataset containing the original 410 images (pre-augmentation).
Data can be retrieved and downloaded from the Kaggle website here. More details and some ideas to solve the classification problem can be read here.

Music and mental health [classification, regression]
Music therapy, or MT, is the use of music to improve an individual's stress, mood, and overall mental health. MT is also recognized as an evidence-based practice, using music as a catalyst for "happy" hormones such as oxytocin. However, MT employs a wide range of different genres, varying from one organization to the next. The MxMH dataset aims to identify what, if any, correlations exist between an individual's music taste and their self-reported mental health. Ideally, these findings could contribute to a more informed application of MT or simply provide interesting sights about the mind.
Data are available here.

Influenza outbreak event prediction via Twitter [classification]
By identifying influenza-related tweets, the goal is to forecast the spatiotemporal patterns of influenza outbreaks for different locations and dates.
The data is from the United States. The data comes from different states under different weeks. For each week, the task is to predict whether or not there is an influenza outbreak on the next date. More specifically, for influenza activity, there are four levels of flu activities from minimal to high according to CDC Flu Activity Map. An influenza outbreak occurrence is indicated if the activity level is high.
Data and paper can be download here.

Down syndrome mice exposed to context fear conditioning. [clustering, classification]
Down syndrome (DS) is a chromosomal abnormality (trisomy of human chromosome 21) associated with intellectual disability and affecting approximately one in 1000 live births worldwide. The overexpression of genes encoded by the extra copy of a normal chromosome in DS is believed to be sufficient to perturb normal pathways and normal responses to stimulation, causing learning and memory deficits. The questions are to detect/explain the treatment and genotype differences.

This dataset has been used in at least two papers, Higuera et al. (2015) for clustering tasks and for classification and Ahmed et al (2015) extraction. Data are available here with a short description here.

(**) HIV mutation and resistance to drugs [regression]

Antiretroviral drugs are very effective therapies against HIV infection. However, the high mutation rate of HIV permits the emergence of variants that can be resistant to drug treatment. Predicting drug resistance to previously unobserved variants is therefore very important for optimum medical treatment. Drug susceptibility data comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI susceptibility results from HIV-1 virus isolates. The question here is to predict the resistance given the virus sequences (for one chosen dataset).
References to past uses of the data can be found here (and references therein). The data, their description and an R code to preprocess the data are available here.
In this project, kernel representations are used. These concepts are discussed in the last part of the course. But, help will be provided if you choose to work on this subject.

bone marrow transplantation [classification]
The aim of our study was to compare the results of unrelated donor (UD) peripheral blood stem cell transplantation versus UD bone marrow transplantation and to analyze the impact of infused CD34⁺ and CD3⁺ cell doses on survival and incidence of severe graft-versus-host disease (GVHD) in 187 children who underwent UD hematopoietic cell transplantation with the use of in vivo T cell depletion (antithymocyte globulin or CAMPATH-1H).
More details and the dataset are available here.

Lymphoma [classification]
The lymphoma dataset (Shipp et al., 2002) consists of 7129 gene expression levels from 77 lymphomas. The 77 samples are divided into 58 diffuse large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL). The data can be found at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and the reference paper here.

Colon cancer tumor [classification]
The colon dataset is from the microarray experiment of colon tissue samples of Alon et al. (1999). It contains the expression level of 2000 genes for 40 tumors and 22 normal colon tissues. The data can be freely downloaded from http://microarray.princeton.edu/oncology/affydata/index.html and the reference paper here.

ECOLOGY/ENVIRONMENT

Plant species richness in Tibetan alpine grasslands [regression]
Species richness is the core of biodiversity-ecosystem functioning (BEF) research. Nevertheless, it is difficult to accurately predict changes in plant species richness under different climate scenarios, especially in alpine biomes. In this project we propose to identify the most critical driver of species richness distribution.
A paper describing an analysis of the data set is available here. The dataset is available here.

Forest CoverType [classification]
Accurate natural resource inventory information is vital to any private, state, or federal land management agency. Forest cover type is one of the most basic characteristics recorded in such inventories. Generally, cover type data is either directly recorded by field personnel or estimated from remotely sensed data. Both of these techniques may be prohibitively time consuming and/or costly in some situations. Furthermore, an agency may find it useful to have inventory information for adjoining lands that are not directly under its control, where it is often economically or legally impossible to collect inventory data. Predictive models provide an alternative method for obtaining such data. The question is then to propose machine learning algorithms to predict the cover type from cartographic variables only (no remotely sensed data).

A paper describes the data, the problem and a solution based on random forest here. Data can be downloaded here.

Impact of Bedrock Fractures on River Erosion [classification, deeplearning]
In bedrock-dominated upland terrains, local heterogeneity in the erodability of rock masses is a critical but under-explored factor constraining sediment erosion, mobilisation and transport. In this project, we propose to analyze original data from laboratory experiments. The experience is detailled here. It results in a large amount of images presenting areas of plucking. The objective is to propose a deep learning algorithm to automatically discover the plucking areas. Data have to be asked to me.

Clouds type detection [classification, deep learning]
Clouds are a major challenge for passive satellite imaging, and daily cloud cover and rain showers in the Amazon basin can significantly complicate monitoring in the area. For this reason, we have chosen to include a cloud cover label for each chip. These labels closely mirror what one would see in a local weather forecast: clear, partly cloudy, cloudy, and haze. For our purposes, haze is defined as any chip where atmospheric clouds are visible but they are not so opaque as to obscure the ground. Clear scenes show no evidence of clouds, and partly cloudy scenes can show opaque cloud cover over any portion of the image. Cloudy images have 90% of the chip obscured by opaque cloud cover. The dataset to be used gathers satellite images of Amazon and labels (clear, partly cloudy, cloudy, and haze) ; it can be downloaded from the Kaggle website here. This talk describes an analysis of this dataset.

♣: if you decide to work on this project, the dataset is available in a google drive (link).

An example of code for starting is proposed under this link. Thank you to save a copy in order that the code will not be changed.

Classification of Butterfly Species [classification, deep learning, transfert learning]
Butterflies are abundant species on the earth, and the task of identification of butterflies is complex. How toapply image processing methods to automatic identification of butterfly species is a hot issue in current research.

In this paper, the problem of automatic detection and classification of butterfly species using buttery photographs is studied.
Data can be retrieved and downloaded from the Kaggle website here. A paper with ideas and algorithms is available here.

Birdcall identification [classification, deeplearning, sound preprocessing]
Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature. They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution. However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality of life-based on a changing bird population. it can be downloaded from the Kaggle website here.
The challenge is to build machine learning algorithm(s) to predict the bird species from audio records. Deeplearning is only one of the possible solution.
You will find lectures about deep learning for audio here.

Amphibians [classification]

The dataset is a multilabel classification problem. The goal is to predict the presence of amphibians species near the water reservoirs based on features obtained from GIS systems and satellite images.
Road A project concerned part of the planned A1 motorway section in Pyrzowice; the section is located along the northern border of the Silesian Voivodship and is about 75 km long. The field research involved a strip of land with a width of 500 m on both sides of the proposed project area. Finally, the first project included 80 amphibian breeding sites. Data are available or a second road.

Data and paper can be download here.

Predicting CO2 Emissions [regression]
The ability to accurately monitor carbon emissions is a critical step in the fight against climate change. Precise carbon readings allow researchers and governments to understand the sources and patterns of carbon mass output. While Europe and North America have extensive systems in place to monitor carbon emissions on the ground, there are few available in Africa.
The objective is to create a machine learning models using open-source CO2 emissions data from Sentinel-5P satellite observations to predict future carbon emissions.
These solutions may help enable governments, and other actors to estimate carbon emission levels across Africa, even in places where on-the-ground monitoring is not possible.
Material is available here.

Pollens emission [regression and/or classification]
Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health
Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to
aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships
among both biotic and environmental variables.
In this project, you will aim at building Machine Learning models for prediction of presence/absence (or concentration) of pollens in air given meteorological data. This paper can give a first overview of meteo variable that can be usefull for pollen emission prediction. However, you may build/find other variables to improve the models. Please ask the teacher to get the data.
Weather data come from the ECA dataset and are recorded at Luxembourg airport.

Correction of temperature forecast in France [regression]:

The challenge consists in improving the 2m hourly temperatures forecast 15h and 27h ahead for 7 locations in France given several weather variables. The dataset provides the temperature forecast and the observation at the same locations and same time. Your models will use the temperature forecast and other weather variables to improve the temperature forecasted by the meteo-france AROME model.
A paper with an example here and a paper with some description of the data here. The datasets are available below for the forecast horizons 15h and 27h with a README file.

Prediction of ozone pollution in Houston. [classification, rare event]
Accurate ozone alert forecasting systems are necessary to issue warnings to the public before the ozone reaches a dangerous level. However, little is known on exactly what features are important in ozone production and how they actually interact in the formation of ozone. This provides wonderful opportunities for machine learning.
One of the reference paper here and a short description of the original dataset here which contains 72 meteorological variables and a version with no missing values below (Ozone_imputed.csv).

OTHERS

Communities and Crime in the US [regression]

Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR. They can be used for various regression tasks as for instance predict the number of murders in 1995, predict the number of rapes, predict the total number of non-violent crimes per 100K population etc. and try to extract the most relevant features.
A description of the data can be found here and the data can be download here.
The dataset includes missing data.