Machine Learning
See also
Course material,
LABS
Projects/datasets examples and ideas
BIOLOGY/HEALTH SCIENCES
Gut microbiota and Irritable Bowel Disease [classification]
Despite the availability of various diagnostic tests for
inflammatory bowel diseases (IBD), misdiagnosis of IBD occurs
frequently, and thus, there is a clinical need to further
improve the diagnosis of IBD. As gut dysbiosis is reported in
patients with IBD, we hypothesized that supervised machine
learning (ML) could be used to analyze gut microbiome data for
predictive diagnostics of IBD.
Data are available
here,
species counts are available
here
and an associated paper
here.
Chinese gut microbiota [classification,
regression]
Some studies conclude that gut microbiota does not change
with age. The data set provides gut microbiota of about 1000
healthy chinise with some covariates : sex, age, height,
weight, diet, etc. The goal of this project is to propose a
complete analysis of this data. Some questions you can try
to repky to are : dies the microbiote differ with age? with
BMI? with diet? If yes, which parts of the microbiote are
"differentially expressed"?
Data are available
here and the
associated paper
here.
DNA sequences classification [classification,
deep learning]
In recent years, a deep learning model called convolutional
neural network with an ability of ex-
tracting features of high-level abstraction from minimum
preprocessing data has been widely
used. In this project, we propose to use deep learning
algorithm for classifying DNA sequences using the con-
volutional neural network while considering these sequences as
text data. One-hot vec-
tors will be applied to represent sequences as input to the
model; therefore, it conserves the essential position
information of each nucleotide in sequences.
Data can be download
here
a paper describing some ideas is available
here.
Toxicity of drugs
[regression/classification]
Cholestatic liver injury is frequently associated with
drug inhibition of bile salt transporters, such as the bile
salt export pump (BSEP). Reliable in silico models
to predict BSEP inhibition directly from chemical structures
would significantly reduce costs during drug discovery and
could help avoid injury to patients.
Here, we propose to develop machine learning algorithm to
predict inhibition of the bile salt export pump from
properties of compounds.
A paper with examples can be found
here. Data can be downloaded
here in the
SMILE format.
Preprocessing is required.
Purity of tumor [regression]
With the advent of array-based techniques to measure
methylation levels in primary tumor samples, systematic
investigations of methylomes have widely been performed on a
large number of tumor entities. Most of these approaches are
not based on measuring individual cell methylation but
rather the bulk tumor sample DNA, which contains a mixture
of tumor cells, infiltrating immune cells and other stromal
components. This raises questions about the purity of a
certain tumor sample. The question is then to propose
machine learning algorithms to predict the purity.
A paper describes the data, the problem and a solution
based on random forest
here.
Data can be downloaded
here is used (see also an
R code).
Mesotheliomas disease [classification]
Malignant mesotheliomas (MM) are very aggressive tumors
of the pleura. These tumors are connected to asbestos
exposure
however, it may also be related to previous simian virus
40 (SV40) infection and quite possible for genetic
predisposition.
Molecular mechanisms can also be implicated in the
development of mesothelioma.
The question here is to propose machine learning
algorithms to predict the diagnostic.
Data and their description are available
here and the associated paper
here.
Clinical
diagnosis from chest X-ray images [classification, deep
learning, transfert learning]
The implementation of clinical-decision support algorithms
for medical imaging faces challenges with reliability and
interpretability. Here, you have to establish a diagnostic
tool based on a deep-learning framework for
diagnosis of pediatric pneumonia using chest X-ray
images. This tool may ultimately aid in expediting the
diagnosis and referral of these treatable conditions,
thereby facilitating earlier treatment, resulting in
improved clinical outcomes.
A paper proposing a methodology is available
here. Dataset is available in kaggle
here.
Blood cell images classification [classification,
deep learning]:
The diagnosis of blood-based diseases often involves
identifying and characterizing patient blood samples.
Automated methods to detect and classify blood cell subtypes
have important medical applications.
The dataset contains 12,500 augmented images of blood cells
(JPEG) with accompanying cell type labels (CSV). There are
approximately 3,000 images for each of 4 different cell
types. The cell types are Eosinophil, Lymphocyte, Monocyte,
and Neutrophil. This dataset is accompanied by an additional
dataset containing the original 410 images
(pre-augmentation).
Data can be retrieved and downloaded from the Kaggle website
here. More details and some ideas to
solve the classification problem can be read
here.
Music and mental health [classification, regression]
Music therapy, or MT, is the use of music to improve an
individual's stress, mood, and overall mental health. MT is
also recognized as an evidence-based practice, using music
as a catalyst for "happy" hormones such as oxytocin.
However, MT employs a wide range of different genres,
varying from one organization to the next. The
MxMH
dataset aims to identify what, if any, correlations exist
between an individual's music taste and their self-reported
mental health. Ideally, these findings could contribute to a
more informed application of MT or simply provide
interesting sights about the mind.
Data are available
here.
Influenza outbreak event prediction via Twitter [classification]
By identifying influenza-related tweets, the goal is to
forecast the spatiotemporal patterns of influenza outbreaks
for different locations and dates.
The data is from the United States. The data comes from
different states under different weeks. For each week, the
task is to predict whether or not there is an influenza
outbreak on the next date. More specifically, for influenza
activity, there are four levels of flu activities from
minimal to high according to CDC Flu Activity Map. An
influenza outbreak occurrence is indicated if the activity
level is high.
Data and paper can be download
here.
Down syndrome mice exposed to context
fear conditioning. [clustering, classification]
Down syndrome (DS) is a chromosomal abnormality (trisomy
of human chromosome 21) associated with intellectual
disability and affecting approximately one in 1000 live
births worldwide. The overexpression of genes encoded by
the extra copy of a normal chromosome in DS is believed to
be sufficient to perturb normal pathways and normal
responses to stimulation, causing learning and memory
deficits. The questions are to detect/explain the
treatment and genotype differences.
This dataset has been used in at least two papers,
Higuera et al. (2015) for clustering
tasks and for classification and Ahmed et al (2015)
extraction. Data are available
here with a short description
here.
(**) HIV mutation and resistance to drugs [regression]
Antiretroviral drugs are very effective therapies
against HIV infection. However, the high mutation rate of
HIV permits the emergence of variants that can be
resistant to drug treatment. Predicting drug resistance to
previously unobserved variants is therefore very important
for optimum medical treatment.
Drug
susceptibility data comprising 25,434 PI, 19,858
NRTI, 11,546 NNRTI and 4,606 INI susceptibility results
from HIV-1 virus isolates. The question here is to predict
the resistance given the virus sequences (for one chosen
dataset).
References to past uses of the data can be found
here (and references therein). The
data, their description and an R code to preprocess the
data are available
here.
In this project, kernel representations are used. These
concepts are discussed in the last part of the course.
But, help will be provided if you choose to work on this
subject.
bone marrow
transplantation [classification]
The aim of our study was to compare the results of unrelated
donor (UD)
peripheral blood stem cell transplantation
versus UD
bone marrow
transplantation and to analyze the impact of infused
CD34+ and CD3
+
cell doses on survival and incidence of severe
graft-versus-host
disease (GVHD) in 187 children who underwent UD
hematopoietic
cell transplantation with the use of in vivo
T cell depletion
(antithymocyte globulin or CAMPATH-1H).
More details and the dataset are available
here.
Lymphoma [classification]
The lymphoma dataset (Shipp et al., 2002) consists of 7129
gene expression levels from 77 lymphomas. The 77 samples
are divided into 58 diffuse large B-cell lymphomas (DLBCL)
and 19 follicular lymphomas (FL). The data can be found
at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and
the reference paper here.
Colon cancer tumor [classification]
The colon dataset is from the microarray experiment of
colon tissue samples of Alon et al. (1999). It contains
the expression level of 2000 genes for 40 tumors and 22
normal colon tissues. The data can be freely downloaded
from http://microarray.princeton.edu/oncology/affydata/index.html
and the reference paper here.
ECOLOGY/ENVIRONMENT
Plant species richness in Tibetan alpine
grasslands [regression]
Species richness is the core of biodiversity-ecosystem
functioning (BEF) research. Nevertheless, it is difficult to
accurately predict changes in
plant species richness under
different climate scenarios, especially in alpine biomes. In
this project we propose to identify the most critical driver
of species richness distribution.
A paper describing an analysis of the data set is available
here. The dataset is available
here.
Forest CoverType [classification]
Accurate natural resource inventory information is vital
to any private, state, or federal land management agency.
Forest cover type is one of the most basic characteristics
recorded in such inventories. Generally, cover type data
is either directly recorded by field personnel or
estimated from remotely sensed data. Both of these
techniques may be prohibitively time consuming and/or
costly in some situations. Furthermore, an agency may find
it useful to have inventory information for adjoining
lands that are not directly under its control, where it is
often economically or legally impossible to collect
inventory data. Predictive models provide an alternative
method for obtaining such data. The question is then to
propose machine learning algorithms to predict the cover
type from cartographic variables only (no remotely sensed
data).
A paper describes the data, the problem and a solution
based on random forest
here.
Data can be downloaded
here.
Impact of Bedrock Fractures on River Erosion
[classification, deeplearning]
In bedrock-dominated upland terrains,
local heterogeneity in the erodability of rock
masses is a critical but under-explored factor
constraining sediment erosion, mobilisation and
transport. In this project, we propose to analyze
original data from laboratory experiments. The
experience is detailled here.
It results in a large amount of images presenting
areas of plucking. The objective is to propose a
deep learning algorithm to automatically discover
the plucking areas. Data have to be asked to me.
Clouds type detection [classification,
deep learning]
Clouds are a major challenge for passive satellite
imaging, and daily cloud cover and rain showers in the
Amazon basin can significantly complicate monitoring in
the area. For this reason, we have chosen to include a
cloud cover label for each chip. These labels closely
mirror what one would see in a local weather forecast:
clear, partly cloudy, cloudy, and haze. For our purposes,
haze is defined as any chip where atmospheric clouds are
visible but they are not so opaque as to obscure the
ground. Clear scenes show no evidence of clouds, and
partly cloudy scenes can show opaque cloud cover over any
portion of the image. Cloudy images have 90% of the chip
obscured by opaque cloud cover. The dataset to be used
gathers satellite images of Amazon and labels (clear,
partly cloudy, cloudy, and haze) ; it can be downloaded
from the Kaggle website here.
This talk
describes an analysis of this dataset.
♣: if you decide to work on this project, the dataset is
available in a google drive (link).
An example of code for starting is proposed under this link. Thank you to save a copy in
order that the code will not be changed.
Classification
of Butterfly Species [classification, deep
learning, transfert learning]
Butterflies are abundant species on the earth, and
the task of identification of butterflies is complex.
How toapply
image processing methods to automatic identification of
butterfly species is a hot issue in current research.
In this paper, the problem of automatic detection and classification of butterfly species using buttery photographs is studied.
Data can be retrieved and downloaded from the Kaggle
website
here. A paper with ideas and
algorithms is available
here.
Birdcall identification
[classification, deeplearning, sound preprocessing]
Do you hear the birds chirping outside your window? Over
10,000 bird species occur in the world, and they can be
found in nearly every environment, from untouched
rainforests to suburbs and even cities. Birds play an
essential role in nature. They are high up in the food
chain and integrate changes occurring at lower levels. As
such, birds are excellent indicators of deteriorating
habitat quality and environmental pollution. However, it
is often easier to hear birds than see them. With proper
sound detection and classification, researchers could
automatically intuit factors about an area’s quality of
life-based on a changing bird population. it can be
downloaded from the Kaggle website here.
The challenge is to build machine learning
algorithm(s) to predict the bird species from audio
records. Deeplearning is only one of the possible
solution.
You will find lectures about deep learning for audio here.
Amphibians [classification]
The dataset is a multilabel classification problem. The
goal is to predict the presence of amphibians species near
the water reservoirs based on features obtained from GIS
systems and satellite images.
Road A project concerned part of the planned A1 motorway
section in Pyrzowice; the section is located along the
northern border of the Silesian Voivodship and is about 75
km long. The field research involved a strip of land with
a width of 500 m on both sides of the proposed project
area. Finally, the first project included 80
amphibian breeding sites. Data are available or a second
road.
Data and paper can be download
here.
Predicting CO2 Emissions [regression]
The ability to accurately monitor
carbon emissions is a critical step in the fight against
climate change. Precise carbon readings allow
researchers and governments to understand the sources
and patterns of carbon mass output. While Europe and
North America have extensive systems in place to monitor
carbon emissions on the ground, there are few available
in Africa.
The objective is to create a machine learning
models using open-source CO2 emissions data from Sentinel-5P
satellite observations to predict future carbon
emissions.
These solutions may help enable governments, and other
actors to estimate carbon emission levels across Africa,
even in places where on-the-ground monitoring is not
possible.
Material is available here.
Pollens emission [regression and/or
classification]
Air pollution in large cities produces numerous diseases and
even millions of deaths annually according to the World
Health
Organization. Pollen exposure is related to allergic
diseases, which makes its prediction a valuable tool to
assess the risk level to
aeroallergens. However, airborne pollen concentrations are
difficult to predict due to the inherent complexity of the
relationships
among both biotic and environmental variables.
In this project, you will aim at building Machine Learning
models for prediction of presence/absence (or concentration)
of pollens in air given meteorological data. This
paper
can give a first overview of meteo variable that can be
usefull for pollen emission prediction. However, you may
build/find other variables to improve the models. Please ask
the teacher to get the data.
Weather data come from the
ECA dataset and are recorded at
Luxembourg airport.
Correction of temperature forecast in
France [regression]:
The challenge consists in improving the 2m hourly temperatures
forecast 15h and 27h ahead for 7 locations in France given
several weather variables. The dataset provides the
temperature forecast and the observation at the same locations
and same time. Your models will use the temperature forecast
and other weather variables to improve the temperature
forecasted by the meteo-france AROME model.
A paper with an example
here
and a paper with some description of the data
here.
The datasets are available below for the forecast horizons 15h
and 27h with a README file.
Prediction of ozone pollution in Houston.
[classification, rare event]
Accurate ozone alert forecasting systems are necessary to
issue warnings to the public before the ozone reaches a
dangerous level. However, little is known on exactly what
features are important in ozone production and how they
actually interact in the formation of ozone. This provides
wonderful opportunities for machine learning.
One of the reference paper here
and a short description of the original dataset here
which contains 72 meteorological variables and a version
with no missing values below (Ozone_imputed.csv).
OTHERS
Communities and Crime in the US [regression]
Data combines socio-economic data from the '90 Census,
law enforcement data from the 1990 Law Enforcement
Management and Admin Stats survey, and crime data from the
1995 FBI UCR. They can be used for various regression tasks
as for instance predict the number of murders in 1995,
predict the number of rapes, predict the total number of
non-violent crimes per 100K population etc. and try to
extract the most relevant features.
A description of the data can be found
here and the data can be download
here.
The dataset includes missing data.