Tuesday, April 28, 2015

Assignment 5 - Regression Analysis

Part I

Regression Analysis of Crime Rate and Free Lunches (fig. 1)
Crime rate is the dependent value and free lunches is the independent value. The null hypothesis is that there is not a linear relationship between free lunches and crime rate in the given areas. The alternative hypothesis is that there is a linear relationship between free lunches and crime rate in the given areas. We fail to reject the null hypothesis because there is a small relationship between free lunches and crime rate since the confidence value is under .05. The regression equation is y=1.685x+21.819. The percentage of persons getting a free lunch with a crime rate of 79.7 would be 34.35% according to this model. I am not very confident in this result due to the model's weak correlation and the r-squared value is low.

Part II

Introduction

For this next exercise we are to look at data from the UW system and find which variables best describe why students choose the school they do based on what county they are from. Some of the variables we will be looking at include number of people within the county that have some college, 2 years of college, college degree, graduate/professional degree, population, population 18-24, median household income. The focus will be on two specific schools that I picked, UW-Eau Claire and UW-Madison and focusing on the variables of percentage BS degree, median household income, and population normalized by distance from school. We will exclude any students that come from out-of-state in this analysis.

Methods

The first thing to do is to perform regression analysis on each of the three variables for both schools. For each regession analysis output we can tell if the variable is significant if the significant value is below .05. First I state the null and alternative hypotheses for both schools for each of the three variables.

Eau Claire student attendance and Population normalized by distance for Eau Claire.
  • The null hypothesis is that there is no significant relationship between Eau Claire student attendance and Population normalized by distance for Eau Claire. The alternative hypothesis is that there is a significant relationship between Eau Claire student attendance and Population normalized by distance for Eau Claire.
Regression Analysis of Eau Claire student attendance and Population normalized by distance from Eau Claire which is showing that there is a significant relationship because of the low significance value of .000 which is under the critical value. (fig. 2)
We reject the null hypothesis for the variable of population normalized by distance from Eau Claire because the regression analysis shows a value below the critical value.

Eau Claire student attendance and percentage of BS degrees.
  • The null hypothesis is that there is no significant relationship between Eau Claire student attendance and percentage of BS degrees. The alternative hypothesis is that there is a significant relationship between Eau Claire student attendance and percentage of BS degrees.
Regression Analysis of Eau Claire student attendance and percentage of BS degrees which is showing that there is a significant relationship because of the low significance value of .003 which is under the critical value. (fig. 3)
We reject the null hypothesis for the variable of percentage of BS degrees because the regression analysis shows a value below the critical value.

Eau Claire student attendance and median household income.
  • The null hypothesis is that there is no significant relationship between Eau Claire student attendance and median household income. The alternative hypothesis is that there is a significant relationship between Eau Claire student attendance and median household income.
Regression Analysis of Eau Claire student attendance and median household income which is showing that there is no significant relationship due to the high significance value of 0.104 which is above the critical value. (fig. 4)
We fail to reject the null hypothesis for the variable of mean household income because the regression analysis shows a value above the critical value.

Madison student attendance and Population normalized by distance for Madison. 
  • The null hypothesis is that there is no significant relationship between Madison student attendance and Population normalized by distance for Madison. The alternative hypothesis is that there is a significant relationship between Madison student attendance and Population normalized by distance for Madison.
Regression Analysis of Madison student attendance and Population normalized by distance from Madison which is showing that there is a significant relationship because of the low significance value of .000 which is under the critical value. (fig. 5)
We reject the null hypothesis for the variable of population normalized by distance from Madison because the regression analysis shows a value below the critical value.

Madison student attendance and percentage of BS degrees.
  • The null hypothesis is that there is no significant relationship between Madison student attendance and percentage of BS degrees. The alternative hypothesis is that there is a significant relationship between Madison student attendance and percentage of BS degrees.
Regression Analysis of Madison student attendance and percentage of BS degrees which is showing that there is a significant relationship because of the low significance value of .000 which is under the critical value. (fig. 6)
We reject the null hypothesis for the variable of percentage of BS degrees because the regression analysis shows a value below the critical value.

Madison student attendance and median household income.
  • The null hypothesis is that there is no significant relationship between Madison student attendance and median household income. The alternative hypothesis is that there is a significant relationship between Madison student attendance and median household income.
Regression Analysis of Madison student attendance and median household income which is showing that there is a significant relationship due to the low significance value of 0.001 which is below the critical value. (fig. 7)
We reject the null hypothesis for the variable of median household income because the regression analysis shows a value below the critical value.

Results

I will map the residuals for just those variables that were found to be significant and the above regression analyses show that all but one of the variables was found to be significant. To get the residuals I just needed to save the standardized residuals before performing the regression analysis and export the tables of residuals to ArcGIS.

Eau Claire student attendance and Population normalized by distance for Eau Claire. R2=.753
This variable had a significance value of .000 and shows a pattern of counties with large population centers having large attendance at UW-Eau Claire. (fig. 8)
Eau Claire student attendance and percentage of BS degrees. R2=.121
This variable had a significance value of .003 and shows a pattern that counties with high percentages of BS degrees have large attendance at UW-Eau Claire. (fig. 9)
Madison student attendance and Population normalized by distance for Madison. R2=.853
This variable had a significance value of .000 and shows a pattern of the largely populated Milwaukee area as having large attendance at UW-Madison. (fig. 10)
Madison student attendance and percentage of BS degrees. R2=.154
This variable had a significance value of .000 and shows a pattern that counties with high percentages of BS degrees have large attendance at UW-Madison. (fig. 11)
Madison student attendance and median household income. R2=.363
This variable had a significance value of .001 and shows a pattern of counties that have high median household income such as the Milwaukee and Madison areas have high attendance at UW-Madison. (fig. 12)

Discussion & Conclusion

The results from the residuals for Eau Claire show that the two variables of population normalized by distance to Eau Claire and percentage of BS degrees have a correlation with the students that attend UW-Eau Claire in each county. This means that students that go to UW-Eau Claire are more likely to come from populated areas and areas that have a lot of people with BS degrees. UW-Eau Claire students likely come from populated and educated areas across the state. Median household income does not have a significant relationship with student attendance at UW-Eau Claire meaning that income does not significantly influence students to attend UW-Eau Claire. The R2 value for the percentage of BS degrees is low however and this means that the BS degree variable has a weak relationship with student attendance at UW-Eau Claire. The population normalized by distance to Eau Claire variable has a high R2 and a very strong relationship with student attendance at UW-Eau Claire meaning this is the most influential predictor I looked at for why students choose to go to UW-Eau Claire.

The results from the residuals for Madison shows that all three of the variables of population normalized by distance to Eau Claire, percentage of BS degrees and median household income all have a correlation with the students that attend UW-Madison in each county. This means that students that go to UW-Madison are more likely to come from populated areas, areas that have a lot of people with BS degrees and areas with high median household income. UW-Madison students likely come from populated, educated and rich areas across the state. The R2 value for percentage of BS degrees and median household income have low R2 values meaning that they have weak relationships with student attendance at UW-Madison. The population normalized by distance to Madison variable has a very high R2 and a very strong relationship with student attendance at UW-Madison meaning this is the most influential predictor I looked at for why students choose to go to UW-Madison.

Tuesday, April 7, 2015

Assignment 4 - Correlation & Spatial Autocorrelation

Part 1: Correlation

Scatter-plot Results with Trend-line
Pearson Correlation Results in SPSS
1. The null hypothesis is that there is no relationship between distance and sound level. The alternative hypothesis is that there is a relationship between distance and sound level. This Pearson Correlation shows a strong negative correlation between distance and sound level. Therefore we will reject the null hypothesis.

Results of Correlation Matrix in SPSS

2. This matrix tells me certain strong correlations between variables. Some examples include: Percent Black has a very strong negative correlation with Bachelors Degree and a strong positive correlation with Below Poverty while Percent Hispanic has a very strong positive correlation with no high school Diplomas. There seems to be a pattern in non-white percentages that don't have high school diplomas or bachelor degrees, walk to work and are below the poverty level. Meanwhile white percentages are less likely to walk to work, be under the poverty level and to not have a high school diploma and/or bachelors degree. 

Part 2: Spatial Autocorrelation

Introduction

This exercise will be looking a 1980 and 2008 Presidential Elections in Texas and looking for patterns in the voting for the Texas Election Commission (TEC). We will comparing the results to Hispanic Population as well to see if there is a relationship between the that and voting turnouts. The TEC wants to see if there is clustering in voting patterns in the state as well as if there as been a change in the past 30 years.

Methods

The first process to this is to download the information I need to analysis this data. I downloaded the Hispanic 2010 Population Percentage for the counties of Texas from the U.S. Census website. The Voting data was provided by the TEC. I downloaded a shapefile of the counties in Texas also on the U.S. Census website. Next, in ArcGIS I performed a join between the Texas counties shapefile and the 2010 Census data as well as the voting data provided by TEC until I had all three datasets in one shapefile. To create the shapefile after I joined these tables to the counties I simply export the data into a shapefile.

Once I have my shapefile I can open it in GeoDa to perform some spatial auto-correlation analysis. First, I need to create a .gal file from the shapefile by going to Tools--Weights--Create and clicking on Add ID Variable, selecting Poly-ID and Rook Continuity under contiguity weight. Make sure to save the file as a .gal file. This will be used in the analysis later.

The first analysis we are going to do is a Moran's I. To do this just click on the Univariate Moran button, select the variable you want to analyze and select the .gal file created before. I repeated this for all five variables. Next, to create a LISA Cluster Map you just need to click the Univariate LISA, select the variable, select the .gal file from earlier and check The Cluster Map. Below I have all of the Moran's I and LISA Cluster Maps for each of the five variables.

Results


Moran's I and LISA Cluster Map for the Voter Turnout 1980 variable
The voter turnout in 1980 had some large clustering going on. The northern counties had a cluster of very high voter turnout while the southern and eastern counties had clustering of low voter turnout. The variable has a medium strong positive correlation.


Moran's I and LISA Cluster Map for the Percent Democratic Vote 1980 variable
The percentage of democratic votes in 1980 has even larger clustering going on. The northwestern counties have clusters of low democratic votes while the southern and eastern countries have clusters of high democratic votes. There is a medium strong positive correlation to this variable.

Moran's I and LISA Cluster Map for the Voter Turnout 2008 variable
Voter turnout in 2008 has small clustering in the counties of Texas. There are clusters of high voter turnout in the northern, northeastern and central counties as well as a cluster of low voter turnout in the southern counties. This variable has a medium strong positive correlation.

Moran's I and LISA Cluster Map for the Percent Democratic Vote 2008 variable
The percent of democratic votes in Texas in 2008 has large clusters going on. There are clusters of low percentage of democratic votes in the northern, northwestern and central counties while there are clusters of high democratic votes in the southern and southwestern counties. This variable has a very strong positive correlation.

Moran's I and LISA Cluster Map for the Percent Hispanic Population variable
The percentage of Hispanic population in 2010 in Texas has very large clusters going on. There is a large cluster of low percentage of Hispanic population in the northeastern counties while there is large cluster of high percentage of Hispanic population in the southwestern counties. This variable has a very strong positive correlation.

Conclusion

There are patterns in the voting in Texas counties between 1980 and 2008 as well as in Hispanic Populations in 2010. In 1980 there is a pattern of high voter turnout for Republicans in the north and low voter turnout for Democrats in the south and east. In 2008 there is a similar pattern however the Democrats have moved from the east to the west with low voter turnout still occurring in the southern counties. A variable that could be affecting the Democratic vote from moving east to west could be the increase in Hispanic population in the west counties and the lack of Hispanic population in the eastern counties.

For the analysis for the TEC I would tell them that there is a clustering in the state voting patterns with Democrats in the south and Republicans in the north. There has been a change in voting patterns over the last 30 years; the Democratic vote has moved from the east to the west and it seems to be possibly related to the Hispanic population in the western counties.