This morning I skied to work at the coldest temperatures I’ve ever attempted (-31°F when I left). We also got more than an inch of snow yesterday, so not only was it cold, but I was skiing in fresh snow. It was the slowest 4.1 miles I’d ever skied to work (57+ minutes!) and as I was going, I thought about what factors might explain how fast I ski to and from work.

Time to fire up R and run some PostgreSQL queries. The first query grabs the skiing data for this winter:

```
SELECT start_time,
(extract(epoch from start_time) - extract(epoch from '2011-10-01':date))
/ (24 * 60 * 60) AS season_days,
mph,
dense_rank() OVER (
PARTITION BY
extract(year from start_time)
|| '-' || extract(week from start_time)
ORDER BY date(start_time)
) AS week_count,
CASE WHEN extract(hour from start_time) < 12 THEN 'morning'
ELSE 'afternoon'
END AS time_of_day
FROM track_stats
WHERE type = 'Skiing'
AND start_time > '2011-07-03' AND miles > 3.9;
```

This yields data that looks like this:

start_time | season_days | miles | mph | week_count | time_of_day |
---|---|---|---|---|---|

2011-11-30 06:04:21 | 60.29469 | 4.11 | 4.70 | 1 | morning |

2011-11-30 15:15:43 | 60.67758 | 4.16 | 4.65 | 1 | afternoon |

2011-12-02 06:01:05 | 62.29242 | 4.07 | 4.75 | 2 | morning |

2011-12-02 15:19:59 | 62.68054 | 4.11 | 4.62 | 2 | afternoon |

Most of these are what you’d expect. The unconventional ones are
`season_days`, the number of days (and fraction of a day) since October 1st
2011; `week_count`, the count of the number of days in that week that I skied.
What I really wanted `week_count` to be was the number of days in a row I’d
skied, but I couldn’t come up with a quick SQL query to get that, and I think
this one is pretty close.

I got this into R using the following code:

```
library(lubridate)
library(ggplot2)
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname=...)
ski <- dbGetQuery(con, query)
ski$start_time <- ymd_hms(as.character(ski$start_time))
ski$time_of_day <- factor(ski$time_of_day, levels = c('morning', 'afternoon'))
```

Next, I wanted to add the temperature at the start time, so I wrote a function in R that grabs this for any date passed in:

```
get_temp <- function(dt) {
query <- paste("SELECT ... FROM arduino WHERE obs_dt > '",
dt,
"' ORDER BY obs_dt LIMIT 1;", sep = "")
temp <- dbGetQuery(con, query)
temp[[1]]
}
```

The query is simplified, but the basic idea is to build a query that finds the next temperature observation after I started skiing. To add this to the existing data:

```
temps <- sapply(ski[,'start_time'], FUN = get_temp)
ski$temp <- temps
```

Now to do some statistics:

```
model <- lm(data = ski, mph ~ season_days + week_count + time_of_day + temp)
```

Here’s what I would expect. I’d think that `season_days` would be positively
related to speed because I should be getting faster as I build up strength and
improve my skill level. `week_count` should be negatively related to speed
because the more I ski during the week, the more tired I will be. I’m not sure
if `time_of_day` is relevant, but I always get the sense that I’m faster on
the way home so `afternoon` should be positively associated with speed.
Finally, `temp` should be positively associated with speed because the glide
you can get from a properly waxed pair of skis decreases as the temperature
drops.

Here's the results:

```
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.143760 0.549018 7.548 1.66e-08 ***
season_days 0.006687 0.006097 1.097 0.28119
week_count 0.201717 0.087426 2.307 0.02788 *
time_of_dayafternoon 0.137982 0.143660 0.960 0.34425
temp 0.021539 0.007694 2.799 0.00873 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4302 on 31 degrees of freedom
Multiple R-squared: 0.4393, Adjusted R-squared: 0.367
F-statistic: 6.072 on 4 and 31 DF, p-value: 0.000995
```

The model is significant, and explains about 37% of the variation in speed. The
only variables that are significant are `week_count` and `temp`, but oddly,
`week_count` is *positively* associated with speed, meaning the more I
ski during the week, the faster I get by the end of the week. That doesn’t make
any sense, but it may be because the variable isn’t a good proxy for the
“consecutive days” variable I was hoping for. Temperature *is* positively
associated with speed, which means that I ski faster when it’s warmer.

The other refinement to this model that might have a big impact would be to add a variable for how much snow fell the night before I skied. I am fairly certain that the reason this morning’s -31°F ski was much slower than my return home at -34°F was because I was skiing on an inch of fresh snow in the morning and had tracks to ski in on the way home.