PREDICTION OF THE ENSO INDEX

Data The database used in this challenge is coming from the version 4 of NOAA NCDC ERSST for Extended Reconstructed Sea Surface Temperature (more details here: https://www.ncdc.noaa.gov/data-access/marineocean-data/extended-reconstructed-sea-surface-temperature-ersst-v4). They correspond to monthly maps of Sea Surface Temperature (SST) anomalies (from a 1971-2000 climatology) in the Equatorial Pacific. It takes into account in situ measurements (buoys, ships) and satellite observations (since the 70th). Data are available between 1854 and 2017. ## Aim The goal is to predict the ENSO index calculated from averaged SST anomalies over the El Nino 3.4 region, 6 months ahead. More information regarding the ENSO index and how to calculate it from SST is available at: http://journals.ametsoc.org/doi/pdf/10.1175/1520-0477%281997%29078%3C2771%3ATDOENO%3E2.0.CO%3B2. Below, you will find an explanation of El Nino and La Nina. To compute the ENSO index, the SST anomalies in the Tropical Pacific (El Nino 3.4 region) are averaged. The El Nino/La Nina threshold is 0.5∘ Celsius. ## Evaluation procedure To evaluate and compare the results, we propose to use a cross validation with the RMSE score, using the following command lines: - skf = ShuffleSplit(y.shape[0], n_iter=10, test_size=1/3, random_state=1) - sqrt(-cross_val_score(reg, X=X, y=y, scoring=‘neg_mean_squared_error’, cv=skf)) Then, we will test the methodology on the prediction of the last 10 years and try to see if we are able to predict the last big El Nino event in 2009-2010 and 2015-2016. Remarks Note that the ENSO and SST anomalies time series are not stationnary in time (sligh increase in the data). You can take into account this trend or to use a stationnary subpart of the time series to train your predictive model.

Load data

The full netcdf file contains the 163 years of SST anomalies in the Equatorial Pacific.Here, we have selected a subsample of the dataset with a smaller area and time from 1st of january 1950 to april 2017.

First download ElNino.Rdata from https://perso.univ-rennes1.fr/valerie.monbet/enseignement.html page.

Images are plot using the CRAN R library fields. In the sequel, if you can not load fields library, just use image instead of image.plot

load("ElNino.Rdata")

library(fields )
## Loading required package: spam
## Loading required package: dotCall64
## Loading required package: grid
## Spam version 2.1-1 (2017-07-02) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
## 
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve
## Loading required package: maps
image.plot(lon,lat,matrix(temperatures[1,],length(lon),length(lat)),col=tim.colors(n=64))
title('Example of SST field')

Feature definition

n.temp = nrow(temperatures)
X = temperatures[1:(n.temp-5),]
Y = ENSO[6:n.temp]

Empirical Orthogonal Functions (EOF) allow to vizualize the data

Empirical Orthogonal Functions is the usual name of Principal Component Analysis in environmental field.

library(FactoMineR)
pca = PCA(X,nc=30,graph=FALSE)
# Eigenvalues (100 first)
plot(pca$eig$eigenvalue[1:100],typ="l",xlab="Nb of component",ylab="Eigenvalues") 

First EOF

image.plot(lon,lat,matrix(pca$var$coord[,1],length(lon),length(lat)))
title("1st EOF") 

Second EOF

image.plot(lon,lat,matrix(pca$var$coord[,2],length(lon),length(lat)))
title("2nd EOF") 

Exemple of Ridge regression with selection of the regularization parameter

library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-12
n = nrow(X)
DF = data.frame(cbind(Y,X))
for (k in 2:ncol(DF)) {
   colnames(DF)[k] = paste("X",k-1,sep="") 
}
B = 10
lambda = c(5e-3,1e-2,5e-2,1e-1,5e-1,1)
RMSE = matrix(0,B,length(lambda))
RMSE_LM = matrix(0,B,1)
for (b in 1:B){
    itrain = sample(1:n,round(n*2/3))
    itest = setdiff(1:n,itrain)
    LM = lm(Y~.,data=DF,subset=itrain) # linear model learned on itrain samples
    yhat = predict(LM,DF)[itest]
    RMSE_LM[b,1] = sqrt(mean((Y[itest]-yhat)^2) )
    ridge = glmnet(X[itrain,], Y[itrain], family="gaussian",lambda=lambda,alpha=0) # alpha=0 is the ridge penalty
    yhat.ridge = predict(ridge,X[itest,])
    for (i in 1:length(lambda)){
        RMSE[b,i] =  sqrt(mean((Y[itest]-yhat.ridge[,i])^2))
    } 
}
## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(LM, DF): prediction from a rank-deficient fit may be
## misleading
Err = data.frame(cbind(RMSE_LM,RMSE))
colnames(Err) <- c("Linear",lambda)
boxplot(Err)

It is usefull here, to zoom on the last boxplot because the error of linear model is hudge compare to the other.

Err.ridge = data.frame(RMSE)
colnames(Err.ridge) <- c(lambda)
boxplot(Err.ridge)

Let us have a look on the coefficients. One could expect more regularity (plot is correct?)

beta = coef(ridge,s=1)
image.plot(lon,lat,matrix(beta[2:length(beta)] ,length(lon),length(lat)))
title("Ridge coefficients")

Test the statistical predictor

Here, we learn one regression model on the first part of the time series and test the fit on the independant last part (10 years).

itest = ((n-120):n) # last 10 years
itrain = setdiff(1:n,itrain)
ridge = glmnet(X[itrain,], Y[itrain], family="gaussian",lambda=1,alpha=0) # lambda=1 leads to the better RMSE
ENSO.ridge = predict(ridge,X)
plot(ts(ENSO,start=1950,freq=12))
lines(ts(ENSO.ridge,start=1950,freq=12),col="red")