Description and Policies Lectures and Readings Homework
CMPSCI 591Y Spring 2005 Tuesday & Thursday 1:00-2:15 Lederle A339
Ben Carterette
carteret@cs.umass.edu
National Exit Polls
For this preliminary analysis, I downloaded raw data from a national exit poll conducted by Edison Media Research and Mitofsky International. The poll was sponsored by the National Election Poll (NEP), comprising ABC News, Associated Press, CBS News, CNN, Fox News, and NBC News. The NEP wrote the questionnaires. 13,718 individuals were polled either at the polling station or by phone. There were four versions of the in-person poll, each with 26 or 28 questions. The phone poll had more than 45 questions, with subquestions and possibly optional questions making it difficult to get an exact count.
The file I downloaded has 13,718 rows (one for each polled individual) and 115 columns. Some columns were recodings or combinations of other columns; I removed those. I removed any column that was obviously highly correlated to another column (e.g. "Who would you have voted for in a two-candidate race?" vs. "Who did you vote for?"). I ended up with 65 columns, including info about the questionnaire itself (version #, whether the back was completed, whether it was a telephone poll, etc), demographic information (state, congressional district, sex, race, age, income, education, etc), voting info (vote for president and House candidate, vote in 2000), positions on issues (for/against gay marriage, abortion; approve/disapprove handling of Iraq, terrorism; etc), opinion of candidates (Bush, Kerry favorable/unfavorable), feelings about direction of country (right/wrong track), get-out-the-vote efforts (contacted by Bush/Kerry campaign), and other stuff. Because of the differences between versions of the poll, there are 411,720 missing values. Therefore D = 13718*65 - 411720 = 479,950.
Felix Dollinger
felix@dollingers.de
Income and election outcome by county
Data I was interested in the relationship between average income and election outcome. I felt that doing this on a state level would probably not be fine-grained enough, so I decided to look for data on county level. This turned out to be quite challenging, but finally I found some relatively good data on income per county on quickfacts.census.gov.
Quite surprisingly, it wasn't easy to get the county level election outcome data. The only source I was able to find was cnn.com, but they only had it hidden in really ugly html code, not as *.csv or something similarly convenient. So I had to strip it out of 209 html files... Another problem came up when I realized that cnn had split the county level election outcome for some states (CT, MA, MD, ME, NH, RI, VT and one county in IL), so I had to reaggregate them (which was possible because I had absolute vote counts). I ran into some further trouble in HI and AK, so I decided to just ignore these two states. Matching the two different data sources wasn't easy either as county names were often spelled differently in both data sources. Fortunately, CNN used the same FIPS codes as hidden attributes in their html code as I found in the quickfacts.census.gov data, so except for the split up states, matching could be done efficiently.
I ended up having data for a total of 3109 distinct counties. For each of them, I have a figure for the median household income, the absolute vote counts for every single candidate in the county (and hence the exact percentage as well), Bush and Kerry in particular, the state the county belongs to, some more figures about poverty (people of all ages in poverty, people age 0-17 in poverty, related children age 5-17 in families in poverty, people under age 5 in poverty, all as absoulute and percentage values). After getting rid of AK and HI, there is no missing data in either of the data sources, the election outcome should be accurate for all counties (at least the percentages added up to 100% for all counties).
Alex Epshteyn
aepshtey@student.umass.edu
Exit Polls
Source: ftp://ftp.icpsr.umich.edu/pub/FastTrack/General_Election_Exit_Polls2004
Each instance (row) in the data is essentially a tuple of responses of an individual to an exit poll questionnaire. Each questionnaire consisted of about 25 questions. Some of the questions were the same on all questionnaires, such as age, race, sex, and who the responder voted for. Other questions differed depending on which version of a questionnaire a responder was given, in order to keep the number of questions presented to each person reasonable while getting feedback on a broad range of issues. There were 4 versions of the national questionnaire and also a version for each state. A polling place administered one kind of questionnaire -- either one of the national questionnaires or a state questionnaire. Polling places, the type of questionnaire to be given at each one, and the interval at which people were approached with the questionnaire were selected by the polling agency according to some criteria selected to provide what it thought would provide information most representative of the population.
The dataset for the national questionnaires contains about 100 variables that represent answers to questions for 13719 rows. There are lots of missing values because each person was only given about a quarter of the available questions, and furthermore, people left some answers blank.
Since all questions were multiple-choice, all of the data is categorical (discrete). Therefore its impossible to calculate useful numerical statistics such as mean or standard deviation. This dataset therefore would not be very useful for a regression task. However, it is suitable to classification and clustering, such as learning a decision tree for predicting which candidate a person supports based only on some of his/her characteristics such as religion, education, and income. A possible clustering task could be to learn some of fundamental classes of people that support each candidate such as union workers or white evangelists. In both cases the missing values might pose problems.
Andreas Fuchs
afuchs@engin.umass.edu
Election outcome by state
My DataSet is a combination of the election outcome on a state-basis (as i could not find the outcome county-based) for Kerry and Bush, the size of a state, the size of the Urban part of the State, the size of its population, the size of the urban population, the urban density, the amount of population groth from 1990 to 1999 and some combination of these.
It is supposed to either proof or decline the theorie "This election demonstrated a large split between urban and rural voters (Metro vs. Retro)." If I somewhere found the election outcome on a county basis, i could take the county-population-values which would lead to a more acurate result, but I was hoping to be able to find patterns even in here.
Kyle Harrington
kih03@hampshire.edu
Election outcome and voting technology by county in Ohio
This data set contains county level data for Ohio's presidential election results. It has number of votes and percentage of votes for Bush, Kerry, Badnarik, and Peroutka. It also has the number of precincts in each county in the 2000 and 2004 elections, as well as the voting technology employed in each county. Every instance within this dataset has a value for each variable, thus there should be no problems with holes in the data. I also first examined the data without the percentages of votes for each candidate, but the results seemed to be less useful, which is why I included those variables.
Potential additions to this data set could be variables relating the locations of each county to other counties in order to establish a data set that would be more useful for examining in a relational sense.
This data was obtained from CBS news, verifiedvoting.org, and jqjacobs.net
Phillip Kirlin
pkirlin@cs.umass.edu
Election outcome and education by county in Virginia
I decided to analyze data from the 2004 Presidential Election from the state of Virginia (the state where I grew up)...
I retrieved data from the USA Today website; these data contained election results by county for every state. I also used the 2000 Census FactFinder website to retrieve data on education levels in Virginia, also separated by county. There are 136 counties/cities in Virginia --- some cities are technically not part of any county and therefore are tabulated independently. However, two of these entities -- the cities of Clifton Forge and South Boston were not included in the data. Perhaps they are technically separate from other counties but included in a neighboring county because they are extremely small.
The education data I collected were separated by sex. For each sex, I was provided with the number of people with their highest education level corresponding to one of sixteen categories ranging from "no schooling" to "doctorate degree." Note that this data was collected from the census form that was only distributed to about 1-in-6 people during the census, so the numbers are only useful when compared to the total number of people who took the optional survey. Luckily, these numbers were also provided, so I was able to calculate the percentage of people who fall into each of the sixteen categories for each sex.
So for these data, N = 134, V = 32, and M = 0, so therefore D = 4288.
Angela Labrador
alabra@anthro.umass.edu
Election outcome and socio-economic variables by county in Ohio
I started hunting around for data that I could obtain for the whole US first, as I wasn't sure what would be readily available. I started first by obtaining an authority file of every county/state combination in the US and corresponding FIPS codes. I figured that might be useful as I gathered census data to create smaller sets of data for individual states or regions of the country. I then downloaded verified voter's data for the entire US so that I could deal with one of my task descriptions having to do with hypotheses about voting equipment if I was interested. I then sought a county level breakdown of all states' election results. I was unable to find this for free and unwilling to compile individual state results in my own file. It was time to pick a state and run with it which I did with Ohio (I will admit that this is my home state and thus influenced my decision but, it was also a "swing" state, used a mixture of voting technology, and has a high number of counties to work with).
Here's what I found for Ohio which has 88 counties all data was per county:
- election results from the Ohio Board of Elections (there were 10 candidates on the ballot in the state, but I only calculated ratios for Kerry and Bush and created another nominal field of the winning candidate's name)
- population data with a breakdown of the population into races in real numbers which I recalculated as decimal percentages of the county's total population so that I could compare counties (and not have highly populated counties skewing my data): total pop, total one race, total two or more races, white, Black, American Indian, Asian, Hawaiian, other, and Hispanic from the 2000 census
- averaged personal per capita income from a 2002 study by the Ohio Office of Strategic Research
- data on the attainment of educational degrees which I recalculated as decimal percentages of the county's total population: no hs diploma, hs graduate, some college (no degree), AA, BA, graduate or professional degree from the 2000 census
- poverty level data which compared 1990 census stats to 2000 census stats: all fields are a # or % below the poverty level: all families, families with related 18 and under, families with related 5 and under, families with single mothers, all individuals, individuals 65+, related children under 18 Ð from the 2000 census
- employment levels: total population 16+, 16+ in civilian labor force, 16+ % unemployed, 16+ in the military, % not in the labor force, total female population, 16+ females in the labor force, percent of females in civilian labor force, % of 16+ who drove alone on their commute, % carpooled, % using public transport, %walked, %other transp., % worked at home, mean travel time to work
I figured that from this large pool of data (where N=88 and M=0) I could easily construct data sets for different types of analyses with D > 500.
Marc Liberatore
liberato@cs.umass.edu
Election outcome and voting equipment by county in the U.S.
The data represent the per-county summary results of the last
U.S. Presidential Election result. In addition to the raw numbers of
votes cast in each direction (Republican or Democratic), the voter
registration totals from the October before the election are listed,
as well as the type of vote recording software used.
These data are likely incomplete. In particular, it seems that the "Totals" columns are always equal to the sum of the Democratic and Republican totals for both votes and voter registration. I strongly suspect this is more due to laziness on the part of the original collector of the data rather than the absence of third-party voters in the entire state of Florida.
The data indicate the type of voting machine in use in each county. Optical scanners are by far the most prevalent, though some counties did use touchscreens. Further examination of this and other data may reveal interesting patterns correlated with the type of voting machine.
Also notable is the increase in some counties between registered voters in October, and votes cast in November. The clear implication is that turnout in some counties approached 100% of registered voters -- or that some votes were fraudulent.
Marc Maier
maier.marc@gmail.com
Election outcome and race by county in the entire U.S.
Can the percentage breakdown of certain races in a given county be used to determine the likelihood of that county supporting a particular party? In order to solve this question, the percentages of race for each county must be known. Fortunately, this data is readily available at the US census website. There are csv files for each state, broken down into individual counties, which project the numbers for July 2003 (the most recent estimation) for white, black, Asian, Native American/Alaskan, Native Hawaiian/Pacific Islander, and mixed heritage. The data for this are easily transferred into Excel to be cleaned (to calculate percentage values). The other half of the necessary information is how each county voted in this past election. This data is also readily available from a number of sources, including the CNN website. After manually entering the values for Bush and Kerry for each county, the csv file was easily transformed into an arff file for use by Weka.
The whole data file consists of 3116 instances (one for each US county) and 7 variables (percent white, black, Asian, Native American/Alaskan, Native Hawaiian/ Pacific Islander, two or more races, percent Bush). There are also variables for percent Kerry and the supported party (R or D). In addition, there are instance identifiers (state and county). However, the necessary variables are the first 7 described. There are also roughly 50 missing data points, specifically those for Maine and Rhode Island. Unfortunately, all of New England made the task of data collection difficult by reporting individual towns instead of counties. The total amount of data is as follows: D=NV-M, where N=3116, V=7, M = 50 D = 21762
County results(excluding New England): www.cnn.com/elections
County race data: www.census.gov
Vermont results: http://vermont-elections.org/elections1/2004\_election\_info.html
Connecticut results: http://209.101.151.73/statementofvote/Reports%5CPE\_by2.html
New Hampshire results: http://www.sos.nh.gov/general%202004/sumpres04.htm
Massachusetts results: http://www.sec.state.ma.us/ele/elepdf/elepresvpres04.pdf
Natasha Mohanty
nmohanty@cs.umass.edu
Election outcome and socio-economic variables by county in Ohio
For the purpose of initial experimentation I restricted myself to examining only a section of the available data. In particular for each county I looked at the percentage of population that is white, the percentage of the population that had registered to vote by 2002 (assuming the change in the population from 2000 to 2002 is negligible), the percentage of unemployed people and the percentage of votes cast in 2004. Thus, N = 88, V = 6, M = 0 and therefore D = 528.
Jamie Rothfeder
jrothfed@cs.umass.edu
Election outcome, moral values, and socioeconomic variables by county in Florida
People say that the election was decided by: 1. Voters who felt that Òmoral valuesÓ was the most important issue. 2. Voters who were not swayed by KerryÕs complicated way of discussing problems and related more to BushÕs simpler directness. 3. That there was more young voter turnout, but this was balanced by more turnout in general. And, that young people were more likely to vote for Kerry. Data was chosen that would describe the voters indicated by the above list. This data is county-by-county data for Florida taken from the 2000 census: 1) The population that was under the age of 35; 2) The population that was married; 3) The population that was white; 4) The population that owned a house; 5) The population that had children under 18; 6) The population that had a bachelors degree; 7) The population that was employed; 8) The population that was salaried; 9) The population that had an income less than the average income for the state of Florida.
The first item describes young voters. The second, third, forth, and fifth items describe what could possibly be the type of voters interested in Òmoral valuesÓ. The sixth item describes a voter who may prefer a more complex explanation of problems and solutions than a simpler one. The seventh, eight and ninth may indicate the voters who would be most effected by changes in the economy, and may think this the most important issue.
Andrew Tolopko
cs591y@tolopko.com
2004 Presidential Election Data
Since data mining is about finding unexpected relationships and patterns in data, I decided to try to piece together an assortment of data sources. I was initially interested in determining whether reported voting incidents might have some relationship to the particular type of voting equipment being used. So I obtained Nationwide Election Day 2004 Incidents data from https://voteprotect.org/ for machine-related voting incidents, broken down by voting district (usually county). I combined this with voting technology data from http://verifiedvoting.org/verifier/, again on a per-voting district basis. "Voting technology" is the particular type of voting machines being used by the voting district. To this, I then added 2004 presidential vote counts and election results for all counties in the U.S., obtained from usatoday.com (I could not find a single table for nationwide data, so I ended up piecing together the data from individual states). Finally, I decided to tack on some demographic information for each county, which included total county population, male/female percentages, and a percentage breakdown of individuals younger and older than 55 years old (excluding individuals under 15 years of age). This demographic data was obtained from the U.S. Census government website: http://www2.census.gov/census_2000/datasets/100_and_sample_profile/0_All_State/2khxx.zip.
Joining these four data sources ("relations") together was difficult. Among these relations, state names were presented as either full names or as abbreviations, and county names had differing punctuation, abbreviations and suffixes (e.g. "County", "Parish"; "St." versus "Saint", etc.). In the case of the Census data, counties were listed along with cities, townships, etc. To accurately "join" the various relations, all state and county names were normalized and then concatenated. Normalization of names included lowercasing, removal of all punctuation and white space, and common suffixes (e.g., "County"). The join operation, implemented as a Perl script, reported unmatched records for each data source, which allowed for further hand-editing of county names for which normalization rules did not work (e.g., to reconcile "Kings County, NY" with "Brooklyn, NY", which are the same).
Not all of the available data was joined together: Of 2103 incident records, 97 were unmatched/ignored. Of 3076 census records, 31 were unmatched/ignored. Of 3216 voting technology records, 42 were unmatched/ignored (27 were for AK alone, for which voting districts were not available).
AK voting data was not broken down by county, and therefore it has no associated voting technology, incidents, or demographic data. CT, MA, ME, NH, RI, and VT have no associated census demographic data, since voting data in these states was reported by townships, not counties. (This could be reconciled with effort.) For the same reason, these states do not have associated voting technology data. However, for MA and NH, the voting technology data provided a county-to-township mapping, which allowed the data to be joined (via special-case logic). CT was a special case, in that it was reported to use the same voting technology state-wide, and so the compiled data was coerced to include voting technology data for CT. VA has spotty census demographic and voting technology data, since voting districts appear to be split among counties and townships. NE, UT, and WI had almost no voting technology data available.
The final ARFF-formatted data file consists of the following attributes (fields):
@attribute Key string
@attribute State {AK, AL, AR, AZ, ...}
@attribute County string
@attribute BushVotes numeric
@attribute KerryVotes numeric
@attribute NaderVotes numeric
@attribute Incidents numeric
@attribute Technology {"", "Lever", "E-Voting", "Paper Ballots", "Punch Card", "Optical Scan"}
@attribute Vendor {...}
@attribute Model {...}
@attribute Population numeric
@attribute MalePct numeric
@attribute FemalePct numeric
@attribute 54YoungerPct numeric
@attribute 55OlderPct numeric
@attribute Winner { Kerry, Bush, Nader, N/A }
There are 4607 records in the data file. 1126 records are missing voting technology data. 1562 records are missing census data. 5 records have missing voting data, for unknown reasons (all in ME). (Incident data is "allowed" to have values 0, so there is no "missing" data, per se).
The data file can be found at http://www.tolopko.com/cs591y/data.arff