Malignant mesotheliomas (MM) are very aggressive tumors of the pleura. These tumors are connected to asbestos exposure,
however it may also be related to previous simian virus 40 (SV40) infection and quite possible for genetic predisposition.
Molecular mechanisms can also be implicated in the development of mesothelioma.
The question here is to propose machine learning algorithm to predict the diagnostic.
Antiretroviral
drugs are a very effective therapy against HIV infection. However, the
high mutation rate of HIV permits the emergence of variants that can be
resistant to the drug treatment. Predicting drug resistance to
previously unobserved variants is therefore very important for an
optimum medical treatment.
Drug susceptibility data
comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI
susceptibility results from HIV-1 virus isolates. The question
here is to predict the resistance given the virus sequences (for
one chosen dataset).
References to past uses of the data can be found
here (and references there in). The data, their description and an R code to preprocess the data are available
here.
Air pollution in large cities produces numerous diseases and even millions of deaths annually according to the World Health
Organization. Pollen exposure is related to allergic diseases, which makes its prediction a valuable tool to assess the risk level to
aeroallergens. However, airborne pollen concentrations are difficult to predict due to the inherent complexity of the relationships
among both biotic and environmental variables.
In this project, you will aim at building Machine Learning models for prediction of presence/absence (or concentration) of pollens in air given meteorological data.
This
paper can give a first overview of meteo variable that can be usefull for pollen emission prediction. However, you may build/find other variables to improve the models.
The dataset is available
here.
Weather data come from the
ECA dataset and are recorded at Luxembourg airport.
DATASET no 5 [regression (*)]: HIV mutation and resistance to drugs
Antiretroviral drugs are very effective therapies against HIV infection.
However, the high mutation rate of HIV permits the emergence of variants that can be resistant to drug treatment.
Predicting drug resistance to previously unobserved variants is therefore very important for optimum medical treatment.
Drug susceptibility data comprising 25,434 PI, 19,858 NRTI, 11,546 NNRTI and 4,606 INI susceptibility results from HIV-1 virus isolates. The question here is to predict the resistance given the virus sequences (for one chosen dataset).
References to past uses of the data can be found
here (and references therein).
The data, their description and an R code to preprocess the data are available
here .
In this project, kernel representations are used. These concepts are discussed in the last part of the course. But, help will be provided if you choose to work on this subject.
DATASET no 6 [classification]: Down syndrome mice exposed to context fear conditioning.
Down
syndrome (DS) is a chromosomal abnormality (trisomy of human chromosome
21) associated with intellectual disability and affecting approximately
one in 1000 live births worldwide. The overexpression of genes encoded
by the extra copy of a normal chromosome in DS is believed to be
sufficient to perturb normal pathways and normal responses to
stimulation, causing learning and memory deficits. The questions are to
detect/explain the treatment and genotype differences.
This dataset has been used in at least two papers,
Higuera et al. (2015) for clustering taks and for classification and Ahmed et al (2015) extraction. Data are available
here with a short description
here.
DATASET no 7 [classification, deep learning]: Blood cell images classification
The
diagnosis of blood-based diseases often involves identifying and
characterizing patient blood samples. Automated methods to detect and
classify blood cell subtypes have important medical applications.
The
dataset contains 12,500 augmented images of blood cells (JPEG) with
accompanying cell type labels (CSV). There are approximately 3,000
images for each of 4 different cell types. The cell types are
Eosinophil, Lymphocyte, Monocyte, and Neutrophil. This dataset is
accompanied by an additional dataset containing the original 410 images
(pre-augmentation).
Data can be retrieved and download from the kaggle website here. More details and some ideas to solve the classification problem can be read here.
♠:
deep learning will be adressed only during the last lectures of this
course and it may require an efficient computer (allowing GPU calculus)
although google colab can be used instead.
DATASET no 8 [classification, deeplearning]: Clouds type detection
Clouds
are a major challenge for passive satellite imaging, and daily cloud
cover and rain showers in the Amazon basin can significantly complicate
monitoring in the area. For this reason we have chosen to include a
cloud cover label for each chip. These labels closely mirror what one
would see in a local weather forecast: clear, partly cloudy, cloudy, and
haze. For our purposes haze is defined as any chip where atmospheric
clouds are visible but they are not so opaque as to obscure the ground.
Clear scenes show no evidence of clouds, and partly cloudy scenes can
show opaque cloud cover over any portion of the image. Cloudy images
have 90% of the chip obscured by opaque cloud cover. The dataset to be
used gathers satelite images of Amazon and labels (clear, partly cloudy,
cloudy, and haze) ; it can be downlaod from the kaggle website here. This talk describes an analysis of this dataset.
♠:
deep learning will be adressed only during the last lectures of this
course and it may require an efficient computer (allowing GPU calculus)
although google colab can be used instead.
♣: if you decide to work on this project, the dataset is available in a google drive (link).
DATASET no 9 [regression]: Correction of temperature forecast in France.
The
challenge consists in improving the 2m hourly temperatures forecast 15h
and 27h ahead for 7 locations in France given several weather
variables. The dataset provide the temperature forecast and the
observation at the same locations and same time. Your models will use
the temperature forecast and other weather variables to improve the
temperature forecasted by the meteo-france AROME model.
A paper with an example
here and a paper with some description of the data
here. The datasets are available below for the forecast horizons 15h and 27h with a README file.
DATASET no 10 [classification (rare event)]: Prediction of ozone pollution in Houston.
Accurate
ozone alert forecasting systems are necessary to issue warnings to the
public before the ozone reaches a dangerous level. However, little is
known on exactly what features are important in ozone production and how
they actually interact in the formation of ozone. This provides a
wonderful opportunities for machine learning.
One of the reference paper here and a short description of the original dataset here which contains 72 meteorological variables and a version with no missing values below (Ozone_imputed.csv).
DATASET no 11 [classification, deeplearning (*)]: Birdcall identification
Do you hear the birds chirping outside your window? Over 10,000 bird species occur in the world,
and they can be found in nearly every environment, from untouched rainforests to suburbs and even cities. Birds play an essential role in nature.
They are high up in the food chain and integrate changes occurring at lower levels. As such, birds are excellent indicators of deteriorating habitat quality and environmental pollution.
However, it is often easier to hear birds than see them. With proper sound detection and classification, researchers could automatically intuit factors about an area’s quality
of life-based on a changing bird population. it can be downloaded from the Kaggle website
here.
The challenge is to build machine learning algorithm(s) to predict the bird species from audio records. Deeplearning is only one of the possible solution.
♠: deep learning will be addressed only during the last lectures of this course and it may require an efficient computer (allowing GPU calculus) although google colab or Kaggle facilities can be used instead.
You will find lectures about deep learning for audio
here .
DATASET no 12 [classification]: Lymphoma
The
lymphoma dataset (Shipp et al., 2002) consists of 7129 gene expression
levels from 77 lymphomas. The 77 samples are divided into 58 diffuse
large B-cell lymphomas (DLBCL) and 19 follicular lymphomas (FL). The
data can be found at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and the reference paper here.
DATASET no 11 [classification]: Colon cancer tumor
The
colon dataset is from the microarray experiment of colon tissues
samples of Alon et al. (1999). It contains the expression level of 2000
genes for 40 tumors and 22 normal colon tissues. The data can be freely
downloaded from http://microarray.princeton.edu/oncology/affydata/index.html
and the reference paper here.
DATASET no 13 [regression]: Communities and Crime in US.
Data
combines socio-economic data from the '90 Census, law enforcement data
from the 1990 Law Enforcement Management and Admin Stats survey, and
crime data from the 1995 FBI UCR. They can be used for various
regression tasks as for instance predict the number of murders in
1995, predict the number of rapes, predict total number of
non-violent crimes per 100K popuation etc. and try to extract the most
revelant features.
A description of the data can be found
here and the data can be download
here.
The dataset includes missing data.