Chapter 12 Estimation and Hypothesis Testing IV: Bivariate Correlation and Regression Learning Objectives: 1. Describe the association between two interval or ratio level variables. 2. Interpret a scatterplot. 3. Calculate and interpret Pearson’s correlation coefficient (r). 4. Test hypotheses using Pearson’s r. 5. Explain linear regression between two interval or ratio level variables. 6. Explain the least-squares regression criterion. 7. Calculate and interpret unstandardized and standardized beta coefficients. 8. Write and interpret a least-squares regression equation. 9. Test hypotheses using bivariate linear regression. Chapter Summary In this chapter, students extend their understanding of statistical hypothesis testing to bivariate correlations and regressions. The chapter starts by describing the concepts of association and correlation and then uses scatterplots to illustrate the two. The chapter then moves into a discussion about prediction and the least squares regression criterion. The chapter ends with an exercise in hypothesis testing using linear regression. Key Formulas The following represent the key formulas for this chapter. PowerPoint slides are provided for each chapter. In addition to these slides, a PDF file containing only the formulas are also provided. Pearson’s r Pearson’s r using z-scores Standard error of r Degrees of freedom for r t-value for Pearson’s r Least squares regression Unstandardized beta coefficient Standardized beta coefficient Intercept Residual Standard error of the slope t-value for the linear regression Interactive Figures: The textbook contains interactive figures. You may wish to use these in a lecture. Students also have access to these. For this chapter, there are two interactive figures. 1. Figure 12.1 illustrates a scatterplot and Pearson’s correlation coefficient. 2. Regression interactive demonstrates the regression line and hypothesis test. These interactive figures can be found on in the eBook and the Library under Chapter 11 Resources. Typical Lecture Material We have provided two sample lectures below. You may wish to add in additional discipline specific information to make these more relevant to your students. Lecture 1: Objective: To understand the conceptual underpinnings of correlation and association; and how to compute Pearson’s r and the Correlation coefficient. Review the following concept table with your students; and help them to fill in the definitions of each statistical concept. The definitions that the students should come up with are in italics. Statistical Concept Defintion Association The relationship between two or more variables that co-vary with one another. Scatterplot Is a graph that displays individual respondent (or subject scores) on two variables. Positive Association When lower/higher values of one variable correspond with lower/higher values on another varible Negative Association When lower (higher) values on one varible correponds with higher (lower) values on another variable. No Assocation When a change in one variable does not coincide with a change in another variable. Strength of Association Is the degree to which two variables are associated. Correlation Coefficient Is a numerical value representing the strength and direction of a linear assoctaion between two variables. Pearsons Correlation Coefficient Is the correlation coefficient most commonly used for measuring the association between two interval or ratio level variables. A Regression Line Is a line of best fit that describeds the linear association between an independent and dependent variable. Least-Squares Regression Line Is the regression line that minimizes the sum of the squared distances between the observations and the line itself. Intercept Is the constant in the regression equation. It is the value of y when x equals zero. Beta Represents the slope of the regression line. Tells us how the value of y changes in relation to a one-unit change in x. Unstandardized Beta The value of the slope of the regression line in the raw score values. Standardized Beta Is the value of the slope of the regression line in standardized values (standard deviations). Coefficient of Determination Is the percentage of the variance in the dependent variable that is explained by the independent variable. Example 1: Ask your students to identify whether the following scatterplots correspond to a positive, negative or no association. 1. (positive association) 2. (negative correlation) 3. (no association) Example 2: Read the following scenario to your students: Suppose that we want to test the hypothesis that grades of the last high school math course taken are positively associated with grades of the first university math course taken. You randomly select 15 students from the population and are testing the hypothesis using an alpha value of 0.05. Note: Draw the following table on the board and ask students to complete the values for xy, x2, and y2. You may wish to use the interactive Figure 12.1 for this activity. The answers are in the table below. 1. Define the null and alternative hypothesis. H0 : ρ = 0 HA : ρ ≠ 0 2. Define the sampling distribution and critical values. First we need to determine the value of Pearson’s r. This is calculated as: The standard error for r is: 𝑠𝑟 = √1−𝑟2 𝑛−2 = √1−0.5422 15−2 = √1−0.5422 15−2 = √0.706 13 = √0.054 = 0.232 There are 13 degrees of freedom: df = n – 2 = 15 – 2 = 13 The critical value of t at 13 degrees of freedom (using a two-tailed test) is ± 2.160 3. Calculate the test-statistic using the t-distribution. 𝑡 = 𝑟 √1 − 𝑟2 𝑛 − 2 = 0.542 0.232 = 2.34 4. Make a decision regarding the hypothesis (i.e. determine if the test statistic passes the critical value). The test statistic of t = 2.34 is past the critical value of ± 2.160. Therefore we reject the null hypothesis and state that the correlation is significantly different from zero. 5. Interpret the results (i.e. identify if the correlation is positive or negative and statistically significant or not). The Pearson’s correlation between the high school math grade and the university math grade is positive and statistically significant. We can therefore say that based on this sample, there appears to be a positive association between the grades students obtain in high school math and the grades students achieve in university math. Lecture 2: Objective: To understand the concept of regression, the least-squares regression criterion and the process of hypothesis testing when using regression. Example 1: Ask your students to answer the following: 1. How do we interpret the unstandardized beta value of 1.24? (a one-unit increase in the independent variable corresponds to a 1.24 unit increase in the dependent variable). 2. How do we interpret the standardized beta coefficient of -.235? (a one-standard deviation increase in the independent variable (x) is associated with a .235 standard deviation decrease in the dependent variable (y). 3. True or False: In a bivariate regression, the unstandardized beta is equal to the Pearson’s correlation coefficient? (False, the standardized beta is the correlation coefficient). 4. If our coefficient of determination is equal to .475; how do we interpret this? (that our independent variable explains approximately 47.5% of the variance in the dependent variable. Example 2: Use the regression interactive to demonstrate to students how regression works. If you are not able to access the interactive in the classroom, you could always draw the diagrams below on your board. 1. Start with the “study example” data and uncheck the “Show Fit Line” so that the blue dotted fit line is hidden from the scatterplot. The yellow console and scatterplot should look as follows: 2. Ask students to estimate where they think the regression line will be. Then show them the scatterplot with the fitted line by checking “Show Fit Line” in the console. The scatterplot with the fitted line will look like this: 3. Next, show the students how the data and predicted values (from below) match up to the values in the scatterplot. The observations are numbered so that you can match them. 4. Next, show the students how you can plot the residual values to see how far away they are from the regression line. This is shown below: 5. By checking the box (in the console) labelled “Show Predict Values” you can then use the slider to show how the regression line (and equation) is used to predict the values of y based on x. 6. Finally, you can use the values in the results box and change the alpha values, to show students how the hypothesis test works: 7. There are 3 datasets that you can plot with this interactive (shown below). The “Study Example” is the default and represents a positive association. The “Crime Example,” which can be selected from the drop-down menu, represents a negative association. The “Charity Example” represents a no association example. Solutions to End-of-Chapter Problems Problem 12-1 a) Scatter given below. (LO2) b) r = -0.893 (LO3) c) Computed statistic = -35.27; critical value of t with 8 degrees of freedom = ±2.306; conclusion: the null hypotheses would be rejected and the result is statistically different from 0. (LO4) d) price = 88.152 -8.536 age e) r2 = 0.798 a good fit. F(1, 8) = 31.61; F0.05(1, 8) = 5.32 Based on this information the null hypothesis would be rejected. There is enough evidence to believe the regression is a success. (LO8/LO9) Source Sum of Squares df Mean Sum of Squares F-Ratio Between Groups regression 2011.14 1 2011.14 31.61 Within Groups - residual 508.96 8 63.62 Total 2520.10 9 Problem 12-2 a) Scatter given below (LO2) b) r = 0.547 (LO3); Computed statistic = 6.722 ; critical value of t with 16 degrees of freedom = ±2.120; conclusion: the null hypotheses would be rejected and the result is statistically different from 0. (LO4) c) y = 5.941 + 0.273x; Computed statistic = 2.614 ; critical value of t with 16 degrees of freedom = ±2.120; conclusion: the null hypotheses would be rejected and the result is statistically different from 0. (LO7/LO9) d) r2=0.299 – a poor fit; F(1, 16) = 6.817; F0.05(1, 16) = 4.49. Based on this information the null hypothesis would be rejected. There is enough evidence to believe the regression is a success. (LO8/LO9) e) Scatter appears non-linear. Results would not be supported. (LO2) Problem 12-3 a) r = -0.468 (LO3); Computed statistic = -1.50; critical value of t with 8 degrees of freedom = ±2.306; conclusion: the null hypotheses would be accepted and the result is not statistically different from 0. (LO4) b) grade = 73.04 – 1.24 hours; Computed statistic = 9.303 ; critical value of t with 8 degrees of freedom = ±2.306; conclusion: the null hypotheses would be rejected and the result is statistically different from 0. (LO7/LO9) c) r2=0.219 – a poor fit; F(1, 8) = 2.245; F0.05(1, 8) = 5.32. Based on this information the null hypothesis would be accepted. There is not enough evidence to believe the regression is a success. (LO8/LO9) Source Sum of Squares df Mean Sum of Squares F-Ratio Between Groups 101.10 2 50.55 7.74 Within Groups 71.83 11 6.53 Total 172.93 13 d) 0.025 P-value 0.05 from table. Actual p-value for 7 degrees of freedom = 0.0317 (LO7/LO8) Source Sum of Squares df Mean Sum of Squares F-Ratio Between Groups Within Groups Total Solutions to Interactive Exercises Exercise 12-1 (i) Calculate coefficient of correlation from the following data ∑xy = 7000 ∑x = 100 ∑y = 700 ∑x2 = 1000 ∑y2 = 50 000 n = 12 (ii) Interpret the value. (iii) Find the coefficient of determination. Answer: (i) r= 0.94 (ii) There is a high positive correlation between the two variables. (iii) 0.8836 Exercise 12-2 From the following data, test whether the correlation coefficient is significant: ∑xy = 7000 ∑x = 100 ∑y = 700 ∑x2 = 1000 ∑y2 = 50 000 n = 12 Level of significance = 0.05 Answer: H0: ρ =0 HA: ρ ≠0 Test statistic = r = 8.7199 t critical = 2.228 Test statistic is greater than t critical. Hence we reject H0. There is enough evidence to conclude that the correlation is significant. Exercise 12-3 It is noted that there is a linear relationship between the number of hours a treadmill is used and the number of months it is owned. The study was based on 15 randomly selected families. The regression equation is given as: # of hours used = 9.0 -0.18 * # of months owned. (i) Interpret the slope of the regression equation. (ii) If the standard error of the slope is 0.06, is there a significant relationship between the 2 variables. Level of significance = 0.05 Answer: i) For an increase in 1 month of having a tread mill, there is a decrease of 0.18 hours use in the treadmill. (ii) H0: β= 0 HA: β≠ 0 Test statistic = t = b/ sb = -0.18/0.06 = 3.0 t-critical = 2.160 Test statistic is greater than t critical. Hence we reject H0. There is enough evidence to conclude that there is a significant relationship between the two variables. Solutions to SPSS Exercises Exercise 12-1 1. You are given the following information for two variables x and y (explanatory and response variable respectively. Use SPSS to assist you in answering the following questions x 2 12 4 6 9 4 11 3 10 11 3 1 13 12 14 7 2 8 y 4 8 10 9 10 8 8 5 10 9 8 3 9 8 8 11 6 9 a) Draw a scatter plot for these data points. Does there appear to be a positive or negative association? There appears to be a positive association. LO: 2 Page: 333-338 b) Calculate r, the correlation coefficient. Correlations x y x Pearson Correlation 1 .547* Sig. (2-tailed) .019 N 18 18 y Pearson Correlation .547* 1 Sig. (2-tailed) .019 N 18 18 *. Correlation is significant at the 0.05 level (2-tailed). LO: 3 Page: 338-341 c) Calculate the least squares regression line using the appropriate equation. Plot this line on the scatter plot. Would you believe that there is a good “fit” or poor “fit” and why? Coefficientsa Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 1 (Constant) 5.941 .884 6.722 .000 x .273 .105 .547 2.611 .019 a. Dependent Variable: y Regression line: ŷ = 5.941 + 0.273 X Poor fit since r2 is very small. LO: 5, 6, 7 Page: 343-347 d) Do your results support what you see on the scatter plot? Results may or may not support the scatter. The key issue is that it is inherently non-linear. LO: 8 Page: 348-349 Equations from Chapter 12 Scott R. Colwell and Edward M. Carter c 2012 Equation 12.1: Pearson’s r Correlation Coefficient r = P xy − PxnPyr Px2 −(Px)2 nPy2 − (Py)2 n Where: r = Pearson’s r x = x variable y = y variable n = sample size Equation 12.3: Pearson’s r Correlation Coefficient (using z-scores) r = Pzxzy n Where: r = Pearson’s correlation coefficient r zx = z-score for the x variable zy = z-score for the y variable n = sample size Equation 12.4: Standard Error of r sr = r1 −r 2 n −2 Where: sr = Standard error of r r = Pearson’s correlation coefficient r n = sample size Equation 12.6: Degrees of Freedom for Pearson’s r df = n −2 Where: df = degrees of freedom n = sample size Equation 12.7: t-value for Pearson’s r t = q r 1−r2 n−2 Where: t = test statistic r = Pearson’s correlation coefficient r n = sample size Equation 12.8: Least Squares Regression yˆ = α + bx Where: yˆ = predict value of the dependent y-variable α = intercept b = unstandardized beta (slope of the regression line) x = observed value of the independent variable Equation 12.9: Unstandardized Beta Coefficient b = Pxy − (Px)(Py) n Px2 −(Px)2 n Where: b = unstandardized beta (slope of the regression line) x = observed value of the independent variable y = observed value of the dependent variable n = sample size Equation 12.11: Standardized Beta Coefficient β = bsx sy Where: β = standardized beta (slope of the regression line) b = unstandardized beta (slope of the regression line) sx = standard deviation of x sy = standard deviation of y Equation 12.13: Formula for Calculating the Intercept α = y¯ −bx¯ Where: α = intercept y¯ = mean value of y b = unstandardized beta x¯ = mean value of x Equation 12.18: Formula for Calculating the Residual Residual = y −yˆ Where: Residual = residual value (ǫ) y = individual value of y yˆ = predict value of the dependent y-variable Equation 12.19: The Standard Error of the Slope sb = q P(residuals)2 n−2 pP(x −x¯)2 Where: sb = standard error of the slope Residual = residual value (ǫ) n = sample size x = observed value of the independent x-variable x¯ = mean value of x Equation 12.20: t-value for the Linear Regression t = b sb Where: t = test statistic b = unstandardized beta sb = standard error of the slope Solution Manual for Introduction to Statistics for Social Sciences Scott R. Colwell, Edward M. Carter 9780071319126
Close