It’s now December 1st and the last time we got new snow was on November 11th. In my last post I looked at the lengths of snow-free periods in the available weather data for Fairbanks, now at 20 days. That’s a long time, but what I’m interested in looking at today is whether the monthly pattern of snowfall in Fairbanks is changing.

The Alaska Dog Musher’s Association holds a series of weekly sprint races starting at the beginning of December. For the past several years—and this year—there hasn’t been enough snow to hold the earliest of the races because it takes a certain depth of snowpack to allow a snow hook to hold a team back should the driver need to stop. I’m curious to know if scheduling a bunch of races in December and early January is wishful thinking, or if we used to get a lot of snow earlier in the season than we do now. In other words, has the pattern of snowfall in Fairbanks changed?

One way to get at this is to look at the earliest data in the “winter year” (which I’m defining as starting on September 1st, since we do sometimes get significant snowfall in September) when 12 inches of snow has fallen. Here’s what that relationship looks like:

And the results from a linear regression:

```
Call:
lm(formula = winter_doy ~ winter_year, data = first_foot)
Residuals:
Min 1Q Median 3Q Max
-60.676 -25.149 -0.596 20.984 77.152
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -498.5005 462.7571 -1.077 0.286
winter_year 0.3067 0.2336 1.313 0.194
Residual standard error: 33.81 on 60 degrees of freedom
Multiple R-squared: 0.02793, Adjusted R-squared: 0.01173
F-statistic: 1.724 on 1 and 60 DF, p-value: 0.1942
```

According to these results the date of the first foot of snow is getting later
in the year, but it’s not significant, so we can’t say with any authority that
the pattern we see isn’t just random. Worse, this analysis could be confounded
by what appears to be a decline in the total *yearly* snowfall in Fairbanks:

This relationship (less snow every year) has even less statistical significance. If we combine the two analyses, however, there is a significant relationship:

```
Call:
lm(formula = winter_year ~ winter_doy * snow, data = yearly_data)
Residuals:
Min 1Q Median 3Q Max
-35.15 -11.78 0.49 14.15 32.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.947e+03 2.082e+01 93.520 <2e-16 ***
winter_doy 4.297e-01 1.869e-01 2.299 0.0251 *
snow 5.248e-01 2.877e-01 1.824 0.0733 .
winter_doy:snow -7.022e-03 3.184e-03 -2.206 0.0314 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.95 on 58 degrees of freedom
Multiple R-squared: 0.1078, Adjusted R-squared: 0.06163
F-statistic: 2.336 on 3 and 58 DF, p-value: 0.08317
```

Here we’re “predicting” winter year based on the yearly snowfall, the first date where a foot of snow had fallen, and the interaction between the two. Despite the near-significance of the model and the parameters, it doesn’t do a very good job of explaining the data (almost 90% of the variation is unexplained by this model).

One problem with boiling the data down into a single (or two) values for each year is that we’re reducing the amount of data being analyzed, lowering our power to detect a significant relationship between the pattern of snowfall and year. Here’s what the overall pattern for all years looks like:

And the individual plots for each year in the record:

Because “winter month” isn’t a continuous variable, we can’t use normal linear regression to evaluate the relationship between year and monthly snowfall. Instead we’ll use multinominal logistic regression to investigate the relationship between which month is the snowiest, and year:

```
library(nnet)
model <- multinom(data = snowiest_month, winter_month ~ winter_year)
summary(model)
Call:
multinom(formula = winter_month ~ winter_year, data = snowiest_month)
Coefficients:
(Intercept) winter_year
3 30.66572 -0.015149192
4 62.88013 -0.031771508
5 38.97096 -0.019623059
6 13.66039 -0.006941225
7 -68.88398 0.034023510
8 -79.64274 0.039217108
Std. Errors:
(Intercept) winter_year
3 9.992962e-08 0.0001979617
4 1.158940e-07 0.0002289479
5 1.120780e-07 0.0002218092
6 1.170249e-07 0.0002320081
7 1.668613e-07 0.0003326432
8 1.955969e-07 0.0003901701
Residual Deviance: 221.5413
AIC: 245.5413
```

I’m not exactly sure how to interpret the results, but typically you’re looking to see if the intercepts and coefficients are significantly different from zero. If you look at the difference in magnitude between the coefficients and the standard errors, it appears they are significantly different from zero, which would imply they are statistically significant.

In order to examine what they have to say, we’ll calculate the probability curves for whether each month will wind up as the snowiest month, and plot the results by year.

```
fit_snowiest <- data.frame(winter_year = 1949:2012)
probs <- cbind(fit_snowiest, predict(model, newdata = fit_snowiest, "probs"))
probs.melted <- melt(probs, id.vars = 'winter_year')
names(probs.melted) <- c('winter_year', 'winter_month', 'probability')
probs.melted$month <- factor(probs.melted$winter_month)
levels(probs.melted$month) <- \
list('oct' = 2, 'nov' = 3, 'dec' = 4, 'jan' = 5, 'feb' = 6, 'mar' = 7, 'apr' = 8)
q <- ggplot(data = probs.melted, aes(x = winter_year, y = probability, colour = month))
q + theme_bw() + geom_line(size = 1) + scale_y_continuous(name = "Model probability") \
+ scale_x_continuous(name = 'Winter year', breaks = seq(1945, 2015, 5)) \
+ ggtitle('Snowiest month probabilities by year from logistic regression model,\n
Fairbanks Airport station') \
+ scale_colour_manual(values = \
c("violet", "blue", "cyan", "green", "#FFCC00", "orange", "red"))
```

The result:

Here’s how you interpret this graph. Each line shows how likely it is that a month will be the snowiest month (November is always the snowiest month because it always has the highest probabilities). The order of the lines for any year indicates the monthly order of snowiness (in 1950, November, December and January were predicted to be the snowiest months, in that order), and months with a negative slope are getting less snowy overall (November, December, January).

November is the snowiest month for all years, but it’s declining, as is snow in December and January. October, February, March and April are increasing. From these results, it appears that we’re getting more snow at the very beginning (October) and at the end of the winter, and less in the middle of the winter.

Yesterday a co-worker and I were talking about how we weren’t able to enjoy the new snow because the weather had turned cold as soon as the snow stopped falling. Along the way, she mentioned that it seemed to her that the really cold winter weather was coming later and later each year. She mentioned years past when it was bitter cold by Halloween.

The first question to ask before trying to determine if there has been a change in the date of the first cold snap is what qualifies as “cold.” My officemate said that she and her friends had a contest to guess the first date when the temperature didn’t rise above -20°F. So I started there, looking for the month and day of the winter when the maximum daily temperature was below -20°F.

I’m using the GHCN-Daily dataset from NCDC, which includes daily minimum and maximum temperatures, along with other variables collected at each station in the database.

When I brought in the data for the Fairbanks Airport, which has data available from 1948 to the present, there was absolutely no relationship between the first -20°F or colder daily maximum and year.

However, when I changed the definition of “cold” to the first date when the
daily *minimum* temperature is below -40, I got a weak (but not statistically
significant) positive trend between date and year.

The SQL query looks like this:

```
SELECT year, water_year, water_doy, mmdd, temp
FROM (
SELECT year, water_year, water_doy, mmdd, temp,
row_number() OVER (PARTITION BY water_year ORDER BY water_doy) AS rank
FROM (
SELECT extract(year from dte) AS year,
extract(year from dte + interval '92 days') AS water_year,
extract(doy from dte + interval '92 days') AS water_doy,
to_char(dte, 'mm-dd') AS mmdd,
sum(CASE WHEN variable = 'TMIN'
THEN raw_value * raw_multiplier
ELSE NULL END
) AS temp
FROM ghcnd_obs
INNER JOIN ghcnd_variables USING(variable)
WHERE station_id = 'USW00026411'
GROUP BY extract(year from dte),
extract(year from dte + interval '92 days'),
extract(doy from dte + interval '92 days'),
to_char(dte, 'mm-dd')
ORDER BY water_year, water_doy
) AS foo
WHERE temp < -40 AND temp > -80
) AS bar
WHERE rank = 1
ORDER BY water_year;
```

I used “water year” instead of the actual year because the winter is split between two years. The water year starts on October 1st (we’re in the 2013 water year right now, for example), which converts a split winter (winter of 2012/2013) into a single year (2013, in this case). To get the water year, you add 92 days (the sum of the days in October, November and December) to the date and use that as the year.

Here’s what it looks like (click on the image to view a PDF version):

The dots are the observed date of first -40° daily minimum temperature for each
water year, and the blue line shows a linear regression model fitted to the data
(with 95% confidence intervals in grey). Despite the scatter, you can see a
slightly positive slope, which would indicate that colder temperatures in
Fairbanks *are* coming later now, than they were in the past.

As mentioned, however, our eyes often deceive us, so we need to look at the
regression model to see if the visible relationship is significant. Here’s the
R `lm` results:

Call: lm(formula = water_doy ~ water_year, data = first_cold) Residuals: Min 1Q Median 3Q Max -45.264 -15.147 -1.409 13.387 70.282 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -365.3713 330.4598 -1.106 0.274 water_year 0.2270 0.1669 1.360 0.180 Residual standard error: 23.7 on 54 degrees of freedom Multiple R-squared: 0.0331, Adjusted R-squared: 0.01519 F-statistic: 1.848 on 1 and 54 DF, p-value: 0.1796

The first thing to check in the model summary is the p-value for the entire model on the last line of the results. It’s only 0.1796, which means that there’s an 18% chance of getting these results simply by chance. Typically, we’d like this to be below 5% before we’d consider the model to be valid.

You’ll also notice that the coefficient of the independent variable
(`water_year`) is positive (0.2270), which means the model predicts that the
earliest cold snap is 0.2 days later every year, but that this value is not
significantly different from zero (a p-value of 0.180).

Still, this seems like a relationship worth watching and investigating further. It might be interesting to look at other definitions of “cold,” such as requiring three (or more) consecutive days of -40° temperatures before including that period as the earliest cold snap. I have a sense that this might reduce the year to year variation in the date seen with the definition used here.

I re-ran the analysis of my ski speeds discussed in an earlier post. The model looks like this:

```
lm(formula = mph ~ season_days + temp, data = ski)
Residuals:
Min 1Q Median 3Q Max
-1.76466 -0.20838 0.02245 0.15600 0.90117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.414677 0.199258 22.156 < 2e-16 ***
season_days 0.008510 0.001723 4.938 5.66e-06 ***
temp 0.027334 0.003571 7.655 1.10e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.428 on 66 degrees of freedom
Multiple R-squared: 0.5321, Adjusted R-squared: 0.5179
F-statistic: 37.52 on 2 and 66 DF, p-value: 1.307e-11
```

What this is saying is that about half the variation in my ski speeds can be
explained by the temperature when I start skiing and how far along in the season
we are (`season_days`). Temperature certainly makes sense—I was reminded of
how little glide you get at cold temperatures skiing to work this week at -25°F.
And it’s gratifying that my speeds are increasing as the season goes on. It’s
not my imagination that my legs are getting stronger and my technique better.

The following figure shows the relationship of each of these two variables
(`season_days` and `temp`) to the average speed of the trip. I used the
`melt` function from the `reshape` package to make the plot:

```
melted <- melt(data = ski,
measure.vars = c('season_days', 'temp'),
id.vars = 'mph')
q <- ggplot(data = melted, aes(x = value, y = mph))
q + geom_point()
+ facet_wrap(~ variable, scales = 'free_x')
+ stat_smooth(method = 'lm')
+ theme_bw()
```

Last week I replaced by eighteen-year-old ski boots with a new pair, and they’re hurting my ankles a little. Worse, the first four trips with my new boots were so slow and frustrating that I thought maybe I’d made a mistake in the pair I’d bought. My trip home on Friday afternoon was another frustrating ski until I stopped and applied warmer kick wax and had a much more enjoyable mile and a half home. There are a lot of other unmeasured factors including the sort of snow on the ground (fresh snow vs. smooth trail vs. a trail ripped up by snowmachines), whether I applied the proper kick wax or not, whether my boots are hurting me, how many times I stopped to let dog teams by, and many other things I can’t think of. Explaining half of the variation in speed is pretty impressive.

January 2012 was a historically cold month in Fairbanks, the fifth-coldest in more than 100 years of records. According to the National Weather Service office in Fairbanks:

January 2012 was the coldest month in more than 40 years in Fairbanks. Not since January 1971 has the Fairbanks area endured a month as cold as this.

The average high temperature for January was 18.2 below and the average low was 35 below. The monthly average temperature of 26.9 below was 19 degrees below normal and made this the fifth coldest January of record. The coldest January of record was 1971, when the average temperature was 31.7 below. The highest temperature at the airport was 21 degrees on the 10th, one of only three days when the temperature rose above zero. This ties with 1966 as the most days in January with highs of zero or lower. There were 16 days with a low temperature of 40 below or lower. Only four months in Fairbanks in more than a century of weather records have had more 40 below days. The lowest temperature at the airport was 51 below on the 29th.

Here’s a figure showing some of the relevant information:

The vertical bars show how much colder (or warmer for the red bars) the average daily temperature at the airport was compared with the 30-year average. You can see from these bars that we had only four days where the temperature was slightly above normal. The blue horizontal line shows the average anomaly for the period, and the orange (Fairbanks airport) and dark cyan (Goldstream Creek) horizontal lines show the actual average temperatures over the period. The average temperature at our house was -27.7°F for the month of January.

Finally, the circles and + symbols represent the minimum daily temperatures recorded at the airport (orange) and our house (dark cyan). You can see the two days late in the month where we got down to -54 and -55°F; cold enough that the propane in our tank remained a liquid and we couldn’t use our stove without heating up the tank.

No matter how you slice it, it was a very cold month.

Here’s some of the R code used to make the plot:

```
library(lubridate)
library(ggplot2)
library(RPostgreSQL)
# (Read in dw1454 data here)
dw_1454$date <- force_tz(
ymd(as.character(dw_1454$date)),
tzone = "America/Anchorage")
dw_1454$label <- 'dw1454 average'
# (Read FAI data here)
plot_data$line_color <- as.factor(
as.numeric(plot_data$avg_temp_anomaly > 0))
plot_data$anomaly <- as.factor(
ifelse(plot_data$line_color == 0,
"degrees colder",
"degrees warmer"))
plot_data$daily <- 'FAI average'
q <- ggplot(data = plot_data,
aes(x = date + hours(9))) # TZ?
q + geom_hline(y = avg_mean_anomaly,
colour = "blue", size = 0.25) +
geom_hline(y = avg_mean_pafg,
colour = "orange", size = 0.25) +
geom_hline(y = avg_mean_dw1454,
colour = "darkcyan", size = 0.25) +
geom_linerange(aes(ymin = avg_temp_anomaly,
ymax = 0, colour = anomaly)) +
theme_bw() +
scale_y_continuous(name = "Temperature (degrees F)") +
scale_color_manual(name = "Daily temperature",
c("degrees colder" = "blue",
"degrees warmer" = "red",
"FAI average" = "orange",
"dw1454 average" = "darkcyan")) +
scale_x_datetime(name = "Date") +
geom_point(aes(y = min_temp,
colour = daily), shape = 1, size = 1) +
geom_point(data = dw_1454,
aes(x = date, y = dw1454_min,
colour = label), shape = 3, size = 1) +
opts(title = "Average Daily Temperature Anomaly") +
geom_text(aes(x = ymd('2012-01-31'),
y = avg_mean_dw1454 - 1.5),
label = round(avg_mean_dw1454, 1),
colour = "darkcyan", size = 4)
```

This morning I skied to work at the coldest temperatures I’ve ever attempted (-31°F when I left). We also got more than an inch of snow yesterday, so not only was it cold, but I was skiing in fresh snow. It was the slowest 4.1 miles I’d ever skied to work (57+ minutes!) and as I was going, I thought about what factors might explain how fast I ski to and from work.

Time to fire up R and run some PostgreSQL queries. The first query grabs the skiing data for this winter:

```
SELECT start_time,
(extract(epoch from start_time) - extract(epoch from '2011-10-01':date))
/ (24 * 60 * 60) AS season_days,
mph,
dense_rank() OVER (
PARTITION BY
extract(year from start_time)
|| '-' || extract(week from start_time)
ORDER BY date(start_time)
) AS week_count,
CASE WHEN extract(hour from start_time) < 12 THEN 'morning'
ELSE 'afternoon'
END AS time_of_day
FROM track_stats
WHERE type = 'Skiing'
AND start_time > '2011-07-03' AND miles > 3.9;
```

This yields data that looks like this:

start_time | season_days | miles | mph | week_count | time_of_day |
---|---|---|---|---|---|

2011-11-30 06:04:21 | 60.29469 | 4.11 | 4.70 | 1 | morning |

2011-11-30 15:15:43 | 60.67758 | 4.16 | 4.65 | 1 | afternoon |

2011-12-02 06:01:05 | 62.29242 | 4.07 | 4.75 | 2 | morning |

2011-12-02 15:19:59 | 62.68054 | 4.11 | 4.62 | 2 | afternoon |

Most of these are what you’d expect. The unconventional ones are
`season_days`, the number of days (and fraction of a day) since October 1st
2011; `week_count`, the count of the number of days in that week that I skied.
What I really wanted `week_count` to be was the number of days in a row I’d
skied, but I couldn’t come up with a quick SQL query to get that, and I think
this one is pretty close.

I got this into R using the following code:

```
library(lubridate)
library(ggplot2)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname=...)
ski <- dbGetQuery(con, query)
ski$start_time <- ymd_hms(as.character(ski$start_time))
ski$time_of_day <- factor(ski$time_of_day, levels = c('morning', 'afternoon'))
```

Next, I wanted to add the temperature at the start time, so I wrote a function in R that grabs this for any date passed in:

```
get_temp <- function(dt) {
query <- paste("SELECT ... FROM arduino WHERE obs_dt > '",
dt,
"' ORDER BY obs_dt LIMIT 1;", sep = "")
temp <- dbGetQuery(con, query)
temp[[1]]
}
```

The query is simplified, but the basic idea is to build a query that finds the next temperature observation after I started skiing. To add this to the existing data:

```
temps <- sapply(ski[,'start_time'], FUN = get_temp)
ski$temp <- temps
```

Now to do some statistics:

```
model <- lm(data = ski, mph ~ season_days + week_count + time_of_day + temp)
```

Here’s what I would expect. I’d think that `season_days` would be positively
related to speed because I should be getting faster as I build up strength and
improve my skill level. `week_count` should be negatively related to speed
because the more I ski during the week, the more tired I will be. I’m not sure
if `time_of_day` is relevant, but I always get the sense that I’m faster on
the way home so `afternoon` should be positively associated with speed.
Finally, `temp` should be positively associated with speed because the glide
you can get from a properly waxed pair of skis decreases as the temperature
drops.

Here's the results:

```
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.143760 0.549018 7.548 1.66e-08 ***
season_days 0.006687 0.006097 1.097 0.28119
week_count 0.201717 0.087426 2.307 0.02788 *
time_of_dayafternoon 0.137982 0.143660 0.960 0.34425
temp 0.021539 0.007694 2.799 0.00873 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4302 on 31 degrees of freedom
Multiple R-squared: 0.4393, Adjusted R-squared: 0.367
F-statistic: 6.072 on 4 and 31 DF, p-value: 0.000995
```

The model is significant, and explains about 37% of the variation in speed. The
only variables that are significant are `week_count` and `temp`, but oddly,
`week_count` is *positively* associated with speed, meaning the more I
ski during the week, the faster I get by the end of the week. That doesn’t make
any sense, but it may be because the variable isn’t a good proxy for the
“consecutive days” variable I was hoping for. Temperature *is* positively
associated with speed, which means that I ski faster when it’s warmer.

The other refinement to this model that might have a big impact would be to add a variable for how much snow fell the night before I skied. I am fairly certain that the reason this morning’s -31°F ski was much slower than my return home at -34°F was because I was skiing on an inch of fresh snow in the morning and had tracks to ski in on the way home.