teaching/enseignement
The diagnosis of blood-based diseases often involves
identifying and characterizing patient blood samples.
Automated methods to detect and classify blood cell
subtypes have important medical applications.
The dataset contains 12,500 augmented images of blood
cells (JPEG) with accompanying cell type labels (CSV).
There are approximately 3,000 images for each of 4
different cell types. The cell types are Eosinophil,
Lymphocyte, Monocyte, and Neutrophil. This dataset is
accompanied by an additional dataset containing the
original 410 images (pre-augmentation).
Data can be retrieved and downloaded from the Kaggle
website here. More details and some ideas to
solve the classification problem can be read here.
♠: deep learning will be addressed only during the last
lectures of this course and it may require an efficient
computer (allowing GPU calculus) although google colab
and/or Kaggle facilities can be used instead.
DATASET no 5 [regression]: Plant species
richness in Tibetan alpine grasslands
Species richness is the core of biodiversity-ecosystem
functioning (BEF) research. Nevertheless, it is difficult
to accurately predict changes in plant species
richness under different climate scenarios, especially in
alpine biomes. In this project we propose to identify the
most critical driver of species richness distribution.
A paper describing an analysis of the data set is
available here. The dataset is available
here.
DATASET no 6 [classification, deeplearning]: Clouds
type detection
Clouds are a major challenge for passive satellite
imaging, and daily cloud cover and rain showers in the
Amazon basin can significantly complicate monitoring in
the area. For this reason, we have chosen to include a
cloud cover label for each chip. These labels closely
mirror what one would see in a local weather forecast:
clear, partly cloudy, cloudy, and haze. For our purposes,
haze is defined as any chip where atmospheric clouds are
visible but they are not so opaque as to obscure the
ground. Clear scenes show no evidence of clouds, and
partly cloudy scenes can show opaque cloud cover over any
portion of the image. Cloudy images have 90% of the chip
obscured by opaque cloud cover. The dataset to be used
gathers satellite images of Amazon and labels (clear,
partly cloudy, cloudy, and haze) ; it can be downloaded
from the Kaggle website here.
This talk
describes an analysis of this dataset.
♠: deep learning will be addressed only during the last
lectures of this course and it may require an efficient
computer (allowing GPU calculus) although google colab can
be used instead.
♣: if you decide to work on this project, the dataset is
available in a google drive (link).
An example of code for starting is proposed under this link. Thank you to save a copy in
order that the code will not be changed.
In this paper, the problem of automatic detection and classification of butterfly species using buttery
photographs is studied.
Data can be retrieved and downloaded from the Kaggle
website here. A paper with ideas and
algorithms is available here.
♠: deep learning will be addressed only during the last
lectures of this course and it may require an efficient
computer (allowing GPU calculus) although google colab
and/or Kaggle facilities can be used instead.
(*) DATASET no 11 [classification, deeplearning, sound
preprocessing]: Birdcall identification
Do you hear the birds chirping outside your window? Over
10,000 bird species occur in the world, and they can be
found in nearly every environment, from untouched
rainforests to suburbs and even cities. Birds play an
essential role in nature. They are high up in the food
chain and integrate changes occurring at lower levels. As
such, birds are excellent indicators of deteriorating
habitat quality and environmental pollution. However, it
is often easier to hear birds than see them. With proper
sound detection and classification, researchers could
automatically intuit factors about an area’s quality of
life-based on a changing bird population. it can be
downloaded from the Kaggle website here.
The challenge is to build machine learning
algorithm(s) to predict the bird species from audio
records. Deeplearning is only one of the possible
solution.
♠: deep learning will be addressed only during the last
lectures of this course and it may require an efficient
computer (allowing GPU calculus) although google colab or
Kaggle facilities can be used instead.
You will find lectures about deep learning for audio here.
DATASET no 12 [classification] Influenza outbreak event
prediction via Twitter
By identifying influenza-related tweets, the goal is to
forecast the spatiotemporal patterns of influenza
outbreaks for different locations and dates.
The data is from the United States. The data comes from
different states under different weeks. For each week,
the task is to predict whether or not there is an
influenza outbreak on the next date. More specifically,
for influenza activity, there are four levels of flu
activities from minimal to high according to CDC Flu
Activity Map. An influenza outbreak occurrence is
indicated if the activity level is high.
Data and paper can be download here.
DATASET no 14 [regression] Predicting
CO2 Emissions
The ability to accurately monitor carbon
emissions is a critical step in the fight against
climate change. Precise carbon readings allow
researchers and governments to understand the sources
and patterns of carbon mass output. While Europe and
North America have extensive systems in place to monitor
carbon emissions on the ground, there are few available
in Africa.
The objective is to create a machine learning
models using open-source CO2 emissions data from Sentinel-5P
satellite observations to predict future carbon
emissions.
These solutions may help enable governments, and other
actors to estimate carbon emission levels across Africa,
even in places where on-the-ground monitoring is not
possible.
Material is available here.
DATASET no 18 [classification]: bone marrow
transplantation
The aim of our study was to compare the results of
unrelated donor (UD) peripheral blood stem cell transplantation
versus UD bone marrow
transplantation and to analyze the impact of infused
CD34+ and CD3+
cell doses on survival and incidence of severe graft-versus-host
disease (GVHD) in 187 children who underwent UD hematopoietic
cell transplantation with the use of in vivo T cell depletion
(antithymocyte globulin or CAMPATH-1H).
More details and the dataset are available here.
DATASET no 20 [classification, rare event]: Prediction
of ozone pollution in Houston.
Accurate ozone alert forecasting systems are necessary to
issue warnings to the public before the ozone reaches a
dangerous level. However, little is known on exactly what
features are important in ozone production and how they
actually interact in the formation of ozone. This provides
wonderful opportunities for machine learning.
One of the reference paper here
and a short description of the original dataset here
which contains 72 meteorological variables and a version
with no missing values below (Ozone_imputed.csv).
DATASET no 21 [classification]: Lymphoma
The lymphoma dataset (Shipp et al., 2002) consists of 7129
gene expression levels from 77 lymphomas. The 77 samples are
divided into 58 diffuse large B-cell lymphomas (DLBCL) and
19 follicular lymphomas (FL). The data can be found at https://github.com/ramhiser/datamicroarray/blob/master/data/shipp.RData and
the reference paper here.
DATASET no 18 [classification]: Colon cancer
tumor
The colon dataset is from the microarray experiment of colon
tissue samples of Alon et al. (1999). It contains the
expression level of 2000 genes for 40 tumors and 22 normal
colon tissues. The data can be freely downloaded from http://microarray.princeton.edu/oncology/affydata/index.html
and the reference paper here.