Friday, August 23, 2013

Compensating for autocorrelation in global temperature data

Autocorrelation in global temperature data simply means that the average temperature for any one month is correlated with the average temperature of the previous month.  It is an unfortunately common problem when dealing with time series and spatial statistics.  The gist of the issue is that most of the standard statistical analysis techniques such as ANOVA, regression, and the like assume that variation in the data is random or white noise when calculating standard errors and p-values.  Autocorrelation means that the noise in the data is not random but correlated or red noise.  The degree of correlation reduces the effective size of the data set and means that the standard errors and p-values calculated from normal statistical tests will be lower than they should be and biased toward showing statistical significance when in reality the tests should not show significance.

One of the best ways to compensate for autocorrelation is to use an Autoregressive Integrated Moving Average (ARIMA) model to account for the red noise.  Here's an example of how to construct an ARIMA model using GISS global temperature data from January 1880 to June 2013.  The first task was to isolate GISS data from the larger data object and save it as a time series:

Climate.1880 = subset(Climate, Time>=1880)
GISS.ts = ts(Climate.1880$GISS, start = c(1880, 1), frequency = 12)
plot(GISS.ts, ylab="Temperature anomaly (ºC)", main="GISS global temperature anomalies January 1880-June 2013")
GISS is not a stationary time series as required for ACF and PACF analysis.  The trend is obvious.  In order to accurately measure the level of autocorrelation, we have to remove the trend, usually accomplished by differencing the data.  Differencing is when you calculate the difference between the data at time t and t-1.

GISS.diff = diff(GISS.ts)
 plot(GISS.diff, main="Differenced GISS data")

While differencing once worked this time, sometimes I'll need to repeat the differencing to completely remove the trend.  The next steps are to determine the degree of autocorrelation in the data using the autocorrelation function and partial autocorrelation function on the differenced data.

acf(GISS.diff, lag.max=40)
acf(GISS.diff, lag.max=40, plot=FALSE) 
pacf(GISS.diff, lag.max=40)
pacf(GISS.diff, lag.max=40, plot=FALSE)

The ACF results show a negative correlation between adjacent months (ACF = -0.371), with none of the rest showing much correlation except for the one at lag 2.  That's likely due to random chance, as none of the intervening lags are significant.  The PACF function shows a seasonal pattern in the differenced data, with the greatest negative correlation again between adjacent months and decreasing from there until a lag of 1.

We can use this information to manually determine which ARIMA model will be most appropriate.  The ACF suggest AR = 1 and we had to difference the data once so I = 1.  The MA would have to be at least 1 but possibly 2 if the trend is accelerating.  We could figure out which is better by trial and error or we can just allow R to figure it out for us.

GISS.arima = auto.arima(GISS.ts, ic = c("bic"))  #"bic" term penalizes models that have too many parameters

Series: GISS.ts

           ar1          ma1        ma2
       0.8129     -1.3465    0.3661
s.e.  0.0304      0.0438    0.0404

sigma^2 estimated as 0.01115:  log likelihood=1326.71
AIC=-2645.42   AICc=-2645.4   BIC=-2623.91

The best statistical model: ARIMA(1,1,2), as suspected.  The end result matches the observed data very well.

Graph code:
plot(GISS.ts, col="blue", type="l", ylab="Temperature anomaly (ºC)", main="GISS and ARIMA model results")
points(fitted(GISS.arima), type="l", col="red")
legend(1880, 0.9, c("GISS", "ARIMA"), col=c("blue", "red"), lwd=2)
We can use the ARIMA model of the temperature data in various ways, including forecasting the short-term behavior of the data and adjusting standard errors and p-values in regression models.  How do we incorporate ARIMA models into regression models?  It's pretty easy with R, actually.  First, I calculated a regression of the GISS data as normal, assuming white noise:

Time = time(GISS.ts)-1880
Time2 = Time^2
GISS.lm = lm(GISS.ts~Time + Time2)

lm(formula = GISS.ts ~ Time + Time2)

     Min       1Q   Median       3Q      Max
-0.57989 -0.09593  0.00235  0.09732  0.56760

                   Estimate    Std. Error     t value    Pr(>|t|)  
(Intercept)  -2.474e-01  1.113e-02   -22.238   < 2e-16 ***
Time          -2.644e-03   3.852e-04  -6.864     9.53e-12 ***
Time2         6.915e-05   2.795e-06   24.737   < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1486 on 1599 degrees of freedom
Multiple R-squared: 0.7675,    Adjusted R-squared: 0.7672
F-statistic:  2639 on 2 and 1599 DF,  p-value: < 2.2e-16 
Next, I used ACF and PACF on the residuals of the regression model to check for autocorrelation:

acf(resid(GISS.lm), 40)
pacf(resid(GISS.lm), 40)
As expected, the residuals do show autocorrelation, with the ACF and PACF showing an ACF = -0.371 between adjacent months, as well as a seasonal signal.  I then used auto.arima to compute the best model:

auto.arima(resid(GISS.lm), ic=c("bic")

Series: resid(GISS.lm)
ARIMA(1,0,1) with zero mean    

         ar1      ma1
      0.8894  -0.4378
s.e.  0.0162   0.0337

sigma^2 estimated as 0.01114:  log likelihood=1328.93
AIC=-2651.87   AICc=-2651.85   BIC=-2635.73
The resulting model is an ARMA(1,1).  I wouldn't have expected the residuals to need differencing, as the trend had already been removed.  If the residuals did need differencing, then it indicates that my regression model needed another term as it didn't include the full trend.  And since I already included an acceleration term (Time2) in the regression model, the MA term is only first-order to remove the seasonal cycle.  I incorporated the ARMA model into a regression analysis using generalized least squares from the nlme package:

GISS.gls = gls(GISS.ts~Time + Time2, correlation=corARMA(p=1, q=1))

Generalized least squares fit by REML
  Model: GISS.ts ~ Time + Time2
  Data: NULL
       AIC         BIC           logLik
  -2603.85   -2571.587   1307.925

Correlation Structure: ARMA(1,1)
 Formula: ~1
 Parameter estimate(s):
      Phi1           Theta1
 0.8943245  -0.4420998

                       Value          Std.Error        t-value      p-value
(Intercept) -0.25014128  0.04077149   -6.135201   0.0000
Time         -0.00255438   0.00141217   -1.808825   0.0707
Time2        0.00006854   0.00001024     6.692065   0.0000

            (Intr)     Time 
Time  -0.861      
Time2  0.738   -0.968

Standardized residuals:
       Min              Q1            Med             Q3           Max
-3.8499081 -0.6388528  0.0195990  0.6492011  3.7759758

Residual standard error: 0.1501945
Degrees of freedom: 1602 total; 1599 residual

While the coefficients are similar to the white noise model, the standard errors and p-values are very different.  For example, the linear Time term in the white noise model is -0.002644 versus -0.002554 in the red noise model.  However, in the white noise model, the standard error is 0.0003852, resulting in a p-value of 0.000000000953 whereas the red noise model shows that the true standard error is 0.00141217, nearly 1.5 million times larger and the true p-value, while close, is actually not significant at 0.0707.

GISS temperature data with ARIMA-adjusted regression trend added
 While this seems like quite a bit of work to go through if the overall regression doesn't change, it's necessary to get things right.  And that is the ultimate goal of science.

1 comment:

  1. I have data in .csv format which contains daily temperature in various years. how should we arrange data to do analysis in above manner.