Metropolitan statistical areas (MSA) are well defined areal regions used by many government datasets. However, in the context of building a statistical model, they can be difficult to work with. Here, we provide simplified polygon geometries as well as a number of files that facilitate mappings from zipcode, city, and state into MSAs.
For latlong point data, this reduces to an elementary, computational geometry, point-in-polygon problem. However, it's necessary to manipulate the MSA geometries. This geometrical information is available from TIGER/Line shapefiles, but the polygons are often far more detailed than necessary and are split across CBSA and NECTA for unintelligible historical reasons. All told, the relative files exceed 50MB.
Below, we include simplified polygons for all the MSA regions. The simplication was carried out with an implementation of the Douglas-Peuker algorithm. The tolerance was chosen using an eyeball test: the difference between the simplification and the original was imperceptable. The result is a 1MB text file organized as follows:
% CBSAFP NLINES # a comment line containing metadata LONG LAT # NLINES of latlong point data associated with the CBSAFP
Often, locational information is specified by a mailing address. If the address includes a zipcode, that zipcode can be directly mapped to MSA. The following CSV file contains all zipcode-MSA pairs.
Sometimes locational information is only a state and county. That information can be mapped to MSA as well. The following CSV file contains all county-state-MSA pairs. There is also an additional column with a numerical geoid code that is often also useful when working with US Census data.
In some applications, zipcode will have been removed from an address.
One such dataset was the Lending Club data set, where zipcode has been deemed an illegal input into a borrower's risk profile. The argument, as I understand it, is that the last two digits of the zipcode are often correlated with neighborhood and, therefore, race. However, Lending Club still gives out city and state.
If, for instance, you need to know that Rahway, NJ and Yonkers, NY are both in the New York City metro area, you need a mapping from place name to MSA. The following CSV file contains placename-state-MSA pairs.
NOTE: this is a good collection of place names, but it is not exhaustive and contains some unexpected omissions. For example, Brooklyn does not have an entry in this table.This file is included in case you want to map metro names to MSA identifiers.