Yesterday I discovered (and ordered!) a new book from O'Reilly called Baseball Hacks by Joseph Alder. I've got a bookshelf full of O'Reilly books on other computer subjects, so I'm very excited to see this. On the web site for the book, there are a couple example hacks from the book.
Last year I spent some time getting the Lahman database into MySQL so I could fool around with some advanced baseball statistics. The Lahman database is a Microsoft Access database, and doesn't allow re-distribution, so for an open-source advocate like me, this isn't exactly the best source for baseball information. It took me a few days to get it all into MySQL successfully, and any of my improvements couldn't be distributed.
Well from reading the sample hacks, I discovered there's a less restrictive database that's also available for MySQL (a free database server). In addition, the author of Baseball Hacks shows how to connect a MySQL database with the fantastic statistical package R. R is also free, and is incredibly powerful. I also found previous article by the same author. Some of what appears below is based on that article.
Anyway, I can't wait to get the book to see what's in it, but meantime I did a very simple analysis comparing payroll to wins for the 2005 season. For the 2005 season, team payroll numbers range from a low of $29.7 million for the Tampa Bay Devil Rays to the Yankee's astronomical payroll of $208.3 million. The second place team, the Red Sox, spent only $123.5 million on player payrolls in 2005. What does all that money buy? I'm sure the owners hope it'll buy them enough wins to make it to the playoffs, and hopefully win the World Series. The White Sox, winners in 2005, were 13th in payroll at $75.2 million.
It turns out that payroll doesn't really account for a lot of whether a team wins or loses. It explained only 24% of the variation in wins in 2005. For comparison, a team's hits and earned run average explains 72% of the variation in wins. Obviously, getting lots of hits, and keeping your opponent from scoring runs will contribute to winning a lot of games.
But what I want to see is whether a team did better than expected based on their player spending. The Yankees didn't wind up with the best record in baseball, despite spending more than twice as much as every other team in baseball except the Red Sox. How badly did they under-perform?
Not that badly, actually. The plot below shows the relationship between payroll and wins for 2005. The straight line is the regression line showing the best linear fit to the data. The team letters on the plot show how they actually performed. Teams that show up above the line, played better than their salaries would have predicted. Those below, did much worse.
For example, look how far the Chicago White Sox (CHA) are from the regression line. The Cardinals also wind up well above what we would expect based solely on their salaries (and that's with Scott Rolen on the DL the whole season!). Also check out the Cleveland Indians. They're a team that has a lot of very good younger players who aren't eligible for arbitration yet, but have loads of talent.
You can see the Yankees over on the right, far from all the other teams. Based on their payroll, they should have won 102 games in 2005, but only managed 95. The Kansas City Royals were much worse, only managing to win 56 games when their player salaries predicted 75 wins. It's easy to explain why teams like the Dodgers or Giants didn't do well in 2005---their high paid players were injured for most of the year---but something else must be going on with Seattle and Kansas City.
What does all this tell us about baseball? Well, I'd argue that this metric (payroll vs. wins) tells us something about how effective the front office of a team is. Smart general managers will pick up talent that is undervalued by the market, buying more wins than they're paying for. Also, teams with a good farm system can "grow their own" talent, rather than having to buy it on the free market. Teams like Cleveland and Oakland are good examples of this. The excesses of George Steinbrenner should have been enough to buy a World Series championship, but the Yankee front office overpaid for all their veteran talent, and in 2005, they didn't live up to their high salaries.
If you want to see the R code I used to generate the plot, you can download it from the link.
I've been reading William T. Vollmann's Europe Central, which is a historical fiction concerned with the Soviet and German regimes of the middle 20th century. It's really good so far, but as with a lot of Vollmann's fiction, it's helpful to read some actual history to fill in the blanks. For example, he continually refers to Hitler as "the sleepwalker", and to this point it's not clear why (syphilis, methamphetamine addiction?).
If anyone reproaches me and asks why I did not resort to the regular courts of justice, then all I can say is this: In this hour I was responsible for the fate of the German people, and thereby I became the supreme judge of the German people. ---William L. Shirer. 1959. The Rise and Fall of the Third Reich. Simon and Schuster. New York, quoted in the Wikipedia.
A powerful member of the executive branch of government breaking the law without judicial oversight. Sound familiar?
Yesterday I was searching the New Yorker DVDs and found an article by Jay McInerney. I enjoyed Bright Lights, Big City, so I went back to the search to see what else he'd written for the magazine (check out Philomena in the Dec 25, 1995 / Jan 1, 1996 issue). But I couldn't find his name in the Authors list. It turns out this is because when his name was entered into the database, he was entered as Mclnerney, Jay rather than as McInerney, Jay. The font in the viewer (as well as the New Yorker's titling font) makes an 'I' (upper-case 'eye') look just like an 'l' (lower case 'ell'), so it's hard to see the error unless you look in the database itself. (In fact, it may be hard to see in this post depending on what sans-serif font you're reading this in!)
The fix is a very simple one if you have the sqlite3 interface to the database:
> sqlite3.exe ny-sqlite-3.db sqlite3> update Authors set Author="McInerney, Jay" where AuthorID=13374; sqlite3> .quit
It's probably a good idea to make a copy of the ny-sqlite-3.db file before manipulating the database. On a Windows machine, this file is in C:\Program Files\New Yorker Viewer\.
I did send email to the New Yorker DVD technical support, but I don't know if this is the sort of thing they'll fix on their own or not. I hope so. Luckily, we can fix it ourselves!
A few hours ago I wrote about a script I wrote that displays your New Yorker DVD reading lists in a convenient format. I just finished writing a similar script (again using Python and the pysqlite python module) which dumps the SQL statements you'd need to restore a reading list to the database. To restore the data to the database you'll also need the command line sqlite3 tool.
Use the script, reading_list_dump.py, to periodically dump your readings lists to a file:
$ ./reading_list_dump.py CSS > css_reading_list.sql
The file you created (css_reading_list.sql in this example) will contain SQL like:
INSERT INTO ReadingListEntry VALUES ( 700, 7, 181437 ); INSERT INTO ReadingListEntry VALUES ( 701, 7, 154899 ); INSERT INTO ReadingLists VALUES ( 7, 'CSS' );
To insert them into the database, close the Viewer program and use the sqlite3 command line program to upload the file:
$ sqlite3 -init css_reading_list.sql ny-sqlite-3.db
One important note: The table that's being updated (ReadingListEntry) has a primary key field that has to be unique in the table. It's the first number in the INSERT statements above (700, 701, etc.). Before running the sqlite3 command, you should make sure that there are no rows in the ReadingListEntry table that match the data you're inserting. If there are, you'll need to edit the dump file and change those numbers until they don't conflict with the existing table.
To find out what the largest value in the ReadingListEntry table is, run the following commands:
$ sqlite3 ny-sqlite-3.db sqlite3> SELECT ReadingListEntryID FROM ReadingListEntry ORDER BY ReadingListEntryID DESC LIMIT 1; 585
All the values for the first field in the dump file for the ReadingListEntry table should be larger than this number.
The New Yorker DVD's allow you to create your own personal reading lists, which are lists of articles that make it easy to access the articles you want to come back to. You can even restrict your searching to your own reading list. The DVD comes with several reading lists already loaded, including one for all the articles mentioned in the hardcover book that comes with the disks.
As it turns out, the database used by the viewer program is a SQLite3 database. It's a public domain database, and the New Yorker hasn't encrypted it so you can view the data in the database with your own custom program, or with a SQLite command line tool available from the SQLite3 web site. You can also extract the data for your reading list, in case something goes wrong with the database and you need to reinstall it.
I wrote a simple Python script, reading_lists.py that allows you to generate a formatted display of all the articles in your reading list. To use it you'll need a standard Python installation, and the pysqlite Python module.
To use is, you'll need to find the database file, which is at C:\Program Files\New Yorker Viewer\ny-sqlite-3.db in a default Windows install. I'm not sure where it installs on a Mac, but the filename is probably the same. Then just run the program specifying the database file, the reading list, and the other options the program offers. The result is a list of the articles complete with abstract and keywords.
Here's the usage statement:
Usage: ./reading_lists.py [-h] [-v] [-c columns] [-d database ] [-r list] [-y year] -v Verbose -h This help page -c columns Number of columns in the output (72 defautl) -d database Full path to the database file -r list Name of reading list (escape spaces or quote list) -y year Year cutoff (negative means article before listed year)
and a sample of the output:
------------------------------------------------------------------------ The Financial Page: The Catastrophe Problem. Surowiecki, James The Talk of the Town Jan. 10, 2005 (Disk_1_1998_2005)
THE FINANCIAL PAGE about the difficulties presented by natural disasters and catastrophes for the insurance industry. . . Writer notes that, in dollar terms, the recent Indian Ocean tsunamis impact was relatively small. . .the affected countries are relatively poor and the businesses there are less likely to have insurance. . . This past year was the most expensive in history for insurers. . . Discusses the actuarial nature of the insurance business: its about distributing risk over a large group of people whose behavior is collectively predictable. . . Catastrophes are not nearly as predictable. The same goes for man-made catastrophes such as September 11th. Insurers can make forecasts as long as they take a much longer view. Notes that even after September 11th, the number of companies that purchased terrorism insurance was relatively small. . .
Keywords: Actuarial Businesses; Hurricanes; September 11th, 2001 (9/11); Catastrophes; Insurance; Terrorism; Asia; Natural Disasters; Tsunamis; Business;
Note that if your database did get corrupted, you'd still have to manually recreate your reading list. I'll probably write a second Python script that will generate all the SQL INSERT statements necessary to recreate your reading lists if something went wrong with the database.