This morning, I was working with Julie, a student of mine, coming from Rennes, on mortality tables. Actually, we work on genealogical datasets from a small region in Québec, and we can observe a lot of volatiliy. If I borrow one of her graph, we get something like

Since we have some missing data, we wanted to use some Generalized Nonlinear Models. So let us see how to get a smooth estimator of the mortality surface. We will write some code that we can use on our data later on (the dataset we have has been obtained after signing a lot of official documents, and I guess I cannot upload it here, even partially).

DEATH <- read.table( "http://freakonometrics.free.fr/Deces-France.txt", header=TRUE) EXPO <- read.table( "http://freakonometrics.free.fr/Exposures-France.txt", header=TRUE,skip=2) library(gnm) D=DEATH$Male E=EXPO$Male A=as.numeric(as.character(DEATH$Age)) Y=DEATH$Year I=(A<100) base=data.frame(D=D,E=E,Y=Y,A=A) subbase=base[I,] subbase=subbase[!is.na(subbase$A),]

The first idea can be to use a Poisson model, where the mortality rate is a smooth function of the age and the year, something like

that can be estimated using

library(mgcv) regbsp=gam(D~s(A,Y,bs="cr")+offset(log(E)),data=subbase,family=quasipoisson) predmodel=function(a,y) predict(regbsp,newdata=data.frame(A=a,Y=y,E=1)) vX=trunc(seq(0,99,length=41)) vY=trunc(seq(1900,2005,length=41)) vZ=outer(vX,vY,predmodel) persp(vZ,theta=-30,col="green",shade=TRUE,xlab="Ages (0-100)", ylab="Years (1900-2005)",zlab="Mortality rate (log)")

The mortality surface is here

It is also possible to extract the average value of the years, which is the interpretation of the coefficient in the Lee-Carter model,

predAx=function(a) mean(predict(regbsp,newdata=data.frame(A=a, Y=seq(min(subbase$Y),max(subbase$Y)),E=1))) plot(seq(0,99),Vectorize(predAx)(seq(0,99)),col="red",lwd=3,type="l")

We have the following smoothed mortality rate

Recall that the Lee-Carter model is

where parameter estimates can be obtained using

regnp=gnm(D~factor(A)+Mult(factor(A),factor(Y))+offset(log(E)), data=subbase,family=quasipoisson)

predmodel=function(a,y) predict(regnp,newdata=data.frame(A=a,Y=y,E=1)) vZ=outer(vX,vY,predmodel) persp(vZ,theta=-30,col="green",shade=TRUE,xlab="Ages (0-100)", ylab="Years (1900-2005)",zlab="Mortality rate (log)")

The (crude) mortality surface is

with the following coefficients.

plot(seq(1,99),coefficients(regnp)[2:100],col="red",lwd=3,type="l")

Here we have a lot of coefficients, and unfortunately, on a smaller dataset, we have much more variability. Can we smooth our Lee-Carter model ? To get something which looks like

Actually, we can, and the code is rather simple

library(splines) knotsA=c(20,40,60,80) knotsY=c(1920,1945,1980,2000) regsp=gnm(D~bs(subbase$A,knots=knotsA,Boundary.knots=range(subbase$A),degre=3)+ Mult(bs(subbase$A,knots=knotsA,Boundary.knots=range(subbase$A),degre=3), bs(subbase$Y,knots=knotsY,Boundary.knots=range(subbase$Y),degre=3))+ offset(log(E)),data=subbase, family=quasipoisson) BpA=bs(seq(0,99),knots=knotsA,Boundary.knots=range(subbase$A),degre=3) BpY=bs(seq(min(subbase$Y),max(subbase$Y)),knots=knotsY,Boundary.knots= range(subbase$Y),degre=3) predmodel=function(a,y) predict(regsp,newdata=data.frame(A=a,Y=y,E=1)) v Z=outer(vX,vY,predmodel) persp(vZ,theta=-30,col="green",shade=TRUE,xlab="Ages (0-100)", ylab="Years (1900-2005)",zlab="Mortality rate (log)")

The mortality surface is now

and again, it is possible to extract the *average* mortality rate, as a function of the age, over the years,

BpA=bs(seq(0,99),knots=knotsA,Boundary.knots=range(subbase$A),degre=3) Ax=BpA%*%coefficients(regsp)[2:8] plot(seq(0,99),Ax,col="red",lwd=3,type="l")

We can then play with the smoothing parameters of the spline functions, and see the impact on the mortality surface

knotsA=seq(5,95,by=5) knotsY=seq(1910,2000,by=10) regsp=gnm(D~bs(A,knots=knotsA,Boundary.knots=range(subbase$A),degre=3)+ Mult(bs(A,knots=knotsA,Boundary.knots=range(subbase$A),degre=3), bs(Y,knots=knotsY,Boundary.knots=range(subbase$Y),degre=3)) +offset(log(E)),data=subbase,family=quasipoisson) predmodel=function(a,y) predict(regsp,newdata=data.frame(A=a,Y=y,E=1)) vZ=outer(vX,vY,predmodel) persp(vZ,theta=-30,col="green",shade=TRUE,xlab="Ages (0-100)", ylab="Years (1900-2005)",zlab="Mortality rate (log)")

We now have to use those functions our our small data sample ! That should be fun….

Dear Arthur,

Thank you very much for your code. I got an error when I tried to run this part:

predict(regsp,newdata=data.frame(A=a,Y=y,E=1)) v

Z=outer(vX,vY,predmodel)

I got the following error:

Error in model.frame.default(modelTerms, newdata, na.action = na.action, :

variable lengths differ (found for ‘offset(log(E))’) In addition: Warning message:

‘newdata’ had 1681 rows but variables found have 10700 rows

Could you help me with that? I really appreciate it.

Thank you very much

Victoria

Dear Arthur,

I am still concerned about the fact that the values of parameter a_x becoms higher than 0 for oldest group ages. Is it correct? I am asking because I tried to use the code to produce a forecast of mortality rates and found it hard to apply coefficient values printed by coefficients(regnp).

Kind regards,

Maciej

Dear Arthur,

I have also tried to apply the code presented in this post. However, after some time of struggling, the same error occured:

Error in glm.fit(x, y, weights = weights, start = start, etastart = etastart, :

NA/NaN/Inf in ‘y’

The only difference was the dataset I have chosen: instead of French records, I selected Dutch ones from Human Mortality Database. The rest is exactly the same.

Could you tell me what can be the mistake in code?

Kind regards,

Maciej

It might come from problem when your exposure is null… in that case, I remember I had problems… you have to remove them ! (start with more recent years, or stop at age 100)

I restricted my sample to the same period (1900-2005) and the same age groups (until 100 tears old) as in your article and it finally worked fine. Thank you for advice!

Great post.

It is great to see how these functions can fit existing mortality models so easily and be extended to new models.

One comment: I believe the smoothing on age, but the smoothing on the time effects k(t) seems to lead to some spilling over of what we know are isolated period effects (e.g., 1918 flu) into years prior and after the events.

Yes ! this is why I wanted some flexible model here, so select where I want some very smooth functions, and where there might be breaks (and there are a couple of breaks, at least, on those French data). Good point. But again, the purpose was not to write a detailed study on mortality in France, only to see that a can use smoothing functions in the gnm package (which was not clear for me when I started the post actually).

Dear Arthur,

I am Giancarlo Camarda and I’m researcher at the French National Institute for Demographic Studies in Paris.

A friend of mine pointed me out your post and I have found interesting to see that smoothing mortality rates became a topic in other areas than demography.

Specifically, analysis of mortality by smoothing techniques has been the main topic during my PhD and I also device an R-package (MortalitySmooth) tailored to smooth deaths counts as well as any (quasi-)Poisson data by P-splines and Generalized Linear Array Models.

About your post, you stated that “Since we have some missing data, we wanted to use some Generalized Nonlinear Models.”. I would say that smoothing techniques could be used when noise needs to be disregarded without imposing any model structure to the data.

Moreover it seems to me that whereas GAMs oversmooth your data, the Lee-Carter imposed too much structure in your estimation. In order to see that it would be good to portray the residuals for both models:

subbase$resGAM <- residuals(regbsp)

subbase$resLC <- residuals(regnp)

subbase$resLCsmo <- residuals(regsp)

## histograms

hist(subbase$resGAM, breaks=100)

hist(subbase$resLC, breaks=100)

hist(subbase$resLCsmo, breaks=100)

## over ages and years

library(lattice)

levelplot(resGAM ~ Y*A, subbase,

at=c(min(subbase$resGAM),-2,-1,1,2,max(subbase$resGAM)))

levelplot(resLC ~ Y*A, subbase,

at=c(min(subbase$resLC),-2,-1,1,2,max(subbase$resLC)))

levelplot(resLCsmo1 ~ Y*A, subbase, at=c(min(subbase$resLCsmo),-2,-1,1,2,max(subbase$resLCsmo)))

Additionally assuming smoothness over age and time in your data is not completely suitable:

– French males have experienced many “shocks” during the 20th century and models who accounts for these shocks would be more appropriate (see Kirkby and Currie (2010) on Statistical Modelling)

– French males show an interesting, though not incorporated into the model, cohort effects

Anyway I understand it was just a working example for further datasets.

Finally I would like to contribute by giving a small snippet in which I smoothed the French males using P-splines, I plot the fitted log-rates as well as the deviance residuals. A final section shows how to use the package to interpolate mortality data when missing data are present. A tried to comment the code as much as possible, but do not hesitate to contact me for any further questions.

Best regards and thanks for hosting my reply,

Giancarlo

ps: references could be found in citation("MortalitySmooth") and associated bibliography.

## reading data

Dvec <- DEATH$Male

Evec <- EXPO$Male

## loading package

library(MortalitySmooth)

## preparing the domains and data in matrices

A <- unique(as.numeric(as.character(DEATH$Age)))

A[length(A)] <- 110

Y <- unique(DEATH$Year)

Dmat <- matrix(Dvec, length(A), length(Y))

Emat <- matrix(Evec, length(A), length(Y))

## select the data

agemax <- 100

x <- A[A<=agemax]

y <- Y

D <- Dmat[A<=agemax, ]

E <- Emat[A<=agemax, ]

## fit with 2D P-splines

## default: optimal smoothing parameters selected by BIC

fitPS <- Mort2Dsmooth(x=x, y=y, Z=D, offset=log(E))

## default plot: shaded contour plot

plot(fitPS)

## plotting with perspective plot

persp(x, y, fitPS$logmortality, theta=-30,

col="green", shade=TRUE, xlab="Ages (0-100)",

ylab="Years (1899-2005)", zlab="Mortality rate (log)")

## plotting deviance residuals

## histogram

hist(residuals(fitPS), breaks=100)

## over ages and years

res.list <- list(x=x, y=y)

res.grid <- expand.grid(res.list)

res.grid$dev <- c(residuals(fitPS))

levelplot(dev ~ y*x, res.grid,

at=c(min(residuals(fitPS)), -2, -1, 1, 2, max(residuals(fitPS))))

## it's clear that smoothing is not appropriate for this time-window:

## we have got wars and epidemics

## moroever a cohort effect is clear for those born in

## during the Spanish flu epidemics

## MISSING DATA

## randomly assign zero-weights to about 50% of the data:

## interpolate complete surface using only half of the available information

m <- length(x)

n <- length(y)

set.seed(0)

whi0 <- sample(x=1:(m*n), size=floor(m*n/2))

W <- matrix(1, m, n)

W[whi0] <- 0

fitPSint <- Mort2Dsmooth(x=x, y=y, Z=D, offset=log(E), W=W)

## plotting some years over ages

lmx.hat <- fitPSint$logmortality[, c(1,36,71,107)]

lmx.act0 <- log(D/E)

lmx.act0[whi0] <- NA

lmx.act <- lmx.act0[,c(1,36,71,107)]

matplot(x, lmx.hat, type="l", lty=1,

xlab="Ages (0-100)", ylab="Mortality rate (log)")

matpoints(x, lmx.act, pch = 1)

legend("topleft", legend = y[c(1,36,71,107)], col=1:4,

pch=1, lty=1)

Trying to run your script. can you help me with this error?

> regbsp=gam(D~s(A,Y,bs=”cr”)+offset(log(E)),data=subbase,family=quasipoisson)

Error in smooth.construct.cr.smooth.spec(object, dk$data, dk$knots) :

Basis only handles 1D smooths

sorry… as mentioned in the error, it is simply because you cannot use cr-splines in dimension 2… just remove that option