This Document Contains Chapters 14 to 19 PART II SECTION D TEACHING NOTES FOR CHAPTERS CHAPTER FOURTEEN SAMPLING FUNDAMENTALS Learning Objectives • Describe the distinction between a census and a sample. • Describe the differences between sampling and nonsampling errors. • Describe the sampling process. • Describe probability sampling procedures. • Describe nonprobability sampling procedures. • Discuss determining sample size with ad hoc methods. • Discuss sampling in the international context. Teaching Suggestions The objectives of this chapter are first, to acquaint students with the basic terminology and basic issues in sample design and an understanding of the alternative methods of selecting a sample. Because this objective has primacy we have left the technical issues of sample precision and sample size determination to the next chapter. This also helps the achievement of the second objective, which is to develop an appreciation that an appropriate sample design is largely a matter of common sense, exercised within the context of the research purpose and objectives. The third objective is to emphasize that the determinants of an acceptable marketing research sample are the precision and freedom from bias required by the research purpose, within a reasonable budget dictated by the value of the information. This chapter also provides a good opportunity to review the multiple and conflicting considerations that have to be balanced by research designers, and to further emphasize that random sampling error (due to taking a sample rather than a survey of the population) is often a small proportion of the total error. The diagram below has been found to be helpful as a basis for discussion when put on the board at the beginning of the class. Each question for discussion, and example provided by the instructor can be related easily to this scheme. Questions and Problems 1(a).The frame could be those who ride during a one-week period or during a one-day period. Or the frame could be those who enter the system from any of five stations. Alternatively, an area telephone survey could use the telephone directory. (b) Yellow pages of telephone directories. A list supplied by an association of sporting goods retailers. (c) Such a frame would have to include sporting goods stores, discount stores, department stores, hardware stores, etc. Again, the yellow pages might provide a start, but it would probably have to include several categories of retailers (not just sporting goods). (d) The whole population would need to be in the frame since such a high proportion of people watch TV. A telephone directory could provide the basis (if 10 were added to the last digit, those with unlisted phones would be included) if a telephone survey were planned. (e) An area sample could be restricted to census districts with high average income. Lists of people with high average income are available like subscribers to the New Yorkers. (f) Like the television viewer, such a population is so broad that a frame would have to be comprehensive. See (d). 2(a) By where their trip originated, their destination and their area of residence. (b) By size, and by location in terms of area of the country (shopping center location vs. stand-alone location, vs. other types of locations). (c) By type of store, by tennis racket sales volume. (d) By area, by time spent watching TV per week, by program types watched. (e) By area, by income, by family size (f) Male vs. female (g) By area, by community. (h) By shopping frequency, size of disposable income. 3. This calls for a stratified sample. If a sample size of 200 is budgeted, a sample of perhaps 100 or 150 should be taken for the strata of large stores and the rest (50 or 100) should be taken from the population strata of large stores. Since the information about stores will probably be as valuable as information from small stores, it doesn’t make sense to sample them proportionately to their numbers. It would be useful to know the cost of sampling, the relative value of the two strata to the study, and the variance within each strata. 4. This question is intended to get the student to think through the mechanics of a telephone survey. A key issue is whether it will be worthwhile to attempt to include those without telephone listings by some mechanism like randomizing the last digit. One consideration is the number of unlisted phones in Fargo. Another is the likely bias in excluding them. A second issue is how many call-backs to include in the sampling plan. A third issue is when to schedule calls. Presumably, most would be scheduled as close to 24 hours after the broadcast as possible. 5. The issue here is whether an in-store sampling plan should be designed. If so, people could be intercepted while shopping at the checkout line or upon leaving the store. The design would probably stratify by hour of the day and perhaps sample proportionate to the number of shoppers during that hour. The problem is that those who shop exclusively at other stores would be missed. The alternative would be an area sample or a telephone sampling design. The frame could be obtained either from the names of owners of cars in the parking lot (since the license plates can be noted) or from those who fill out a coupon for a free draw. (It should be noted that it is usually difficult to get permission from competitors to sample their shoppers.) 6. If the problem were the low usage of the library it might well be important to survey users who are not cardholders and even more important to survey those who do not use the library at all. Thus, a general survey of the population might be much more appropriate. A convenience sampling frame can often be biased. 7. The point of these questions is to get the student to have a hand-on feel for simple random sampling. The acts of drawing four samples and observing that the sample mean is different in each, should provide an understanding of the concept of the sampling distribution of the sampling mean. 8. A serious bias may be introduced because those living in dense neighborhoods (high-rise apartments, for example) will have a much lower probability that those living in roomy subdivisions. The person on the five acre estate will have the largest probability of being drawn. 9. To obtain information about windmills in use, Robert Ferber (Marketing News), March 24, 1978) attempted to take a census of the 300 or so power generating windmills in use. He approached 3,000 county extension agents, owners. He then employed snowball techniques to have those owners identify other owners. 10. Students should use the approach given in the textbook to answer this question, using tables 13-2 and 13-4. For example a student might take the sixth row of Table 13-2 and start from the right to obtain the number 39,359. The selected city from Table 13-4 would be the one with the cumulative population corresponding to 39,359: Filmore. Students can use the random number table to answer this question. For example a student might take a number from a random number table and use that number to select the city with the requisite population. 11. As mentioned in the textbook, one entrance of a shopping center may draw from very different neighborhoods than another. A solution is to stratify by entrance location. To obtain an overall average the resulting strata averages need to be combined by weighting them to reflect the relative traffic that is associated with each entrance. In this case if the sample is stratified by entrance A & B. the proportion of shoppers using entrance A is .44 (1100 people out of a total of 2,500) and the proportion of shoppers using entrance B is .56 (1400 people out of a total of 2500). The estimate of people saying they will buy the product would be the Entrance A total plus the Entrance B total weighted by the proportion of shoppers represented as follows: (.44 x 95) + (.56 x 125) = 111.8. This represents 4.5% of the total shoppers polled. Another way to stratify the sample might be on the basis of time of day (i.e., weekdays, evenings and weekends) rather than by entrance since different shoppers may shop at different times. This would result in a different estimate of the proportion of people that say they will buy the product. Students should justify whichever method they choose to arrive at their answer. 12. Stratified sampling is the probability sampling procedure that divides a population by a specific strata after which people are chosen randomly from each stratum. Two types of stratified sampling is possible: Proportionate and disproportionate stratified sampling. An ideal stratum in stratified sampling must be as homogeneous as possible with respect to the variables of interest. Therefore the within-stratum similarity of units will lower the standard error contributions from each stratum to the overall standard error. Thus the more homogeneous the strata, the more precise the confidence interval estimate will be. However, a stratified sample is marked by heterogeneity between groups. Cluster sampling is a probability sampling procedure in which the population is divided into clusters. Clusters of population units are studied at random and then all the units in the chosen clusters are studied. When a sampling frame is not available to stratify the population, cluster sampling may be resorted to. Unlike in stratified sampling, Cluster sampling does not include units from all the clusters into which the population is divided and hence the clusters should be as heterogeneous as possible. But clusters are homogeneous between groups. An ideal cluster should be an exact replica of the population. 13. Sampling efficiency is defined as the ratio of accuracy over cost. It is the trade-off between cost of employing a probability sampling procedure and the resulting accuracy that can be achieved. Higher the cost, higher is the accuracy. Sampling efficiency can be increased by various ways as described below: (i) Holding accuracy constant and decreasing the cost (ii) Holding cost constant and increasing accuracy (iii) Increasing accuracy at a faster rate than the rate of cost increase (iv) Decreasing accuracy at a slower rate than the rate of cost decrease. 14. Simple random sampling is an approach by which each population member and each possible sample has an equal probability of being selected. Systematic sampling is spreading the sample in a systematic manner through the list of population members. The sampling list is of paramount importance in case of systematic sampling and the sampling efficiency depends upon the list used. For instance, let us consider that we need to sample a group of 100 girl scouts for a survey out of a list of 1200 girl scouts. The list is arranged in three different ways: a random order, cyclic order of the month each one of them joined and in ascending order of their ages. In case of the random order list, the sampling efficiency of the systematic sampling will be equal to that of the simple random sampling. In case of sampling using the ascending ages of scouts, the sampling efficiency of systematic sampling will be higher than that of simple random sampling. In the case of sampling by the month of joining the scouts’ movement and if the sampling interval is twelve, the sampling efficiency of the systematic sampling will be lesser than that of simple random sampling as the same month will be selected resulting in lower accuracy. 15. Probability sampling insures that each element within the population of interest has a known chance of being chosen whereas non probability sampling is a subjective procedure in which the probability of selection is not known. Non probability sampling does not insure that each unit will have a known chance of being chosen. The advantages of non probability sampling over probability sampling are that it costs less, useful in exploratory research, takes lesser time to execute and are simple in design. An important feature of non probability sampling is that it offers researchers greater freedom and flexibility in selecting individual population units. Probability samples are drawn when there is a need for highly accurate estimates of market share or volume that has to be projected to the entire market. Heterogeneous markets favor probability samples. Non probability samples are used in cases of limited budgets and when probability sampling becomes prohibitively expensive. Non probability samples are used in concept tests, product tests, package tests and focus groups. 16. Proportionate random sampling is a form of stratified sampling in which the sample consists of units selected from each population stratum in proportion to the total number of units in the stratum. Disproportionate random sampling is a form of stratified random sampling in which the sample consists of units selected from each population stratum according to how varied the units within the stratum are. The total sample is allocated on the basis of relative variabilities rather than sizes as compared to the proportionate sampling. 17a.The target population of the study will be all the residents of Winona (individuals above the age of fifteen), from households over a stipulated annual income, with an interest in sports. b. If the study involves only the attenders, the sampling frame will be the list of all attenders collated in the past two seasons. However, in order to gauge the attitudes of both the attenders and non-attenders, this sampling frame will be inadequate. Any list that gives the names of the residents of Winona can be used. For instance, a mailing list can be bought from a survey company can be used to obtain a cross sectional sample of the residents. c. The researcher should decide whether to use a traditional method of sampling or a bayesian method. Most of the marketing research projects employ a traditional sampling method without replacement since a respondent is not contacted twice for the information. In our case, since the financial resources are pretty scarce and the nature of the study point towards a non probability type of sampling. The specific type of probability sample depends upon a host of reasons. The students can be asked to assess the advantages and disadvantages of each method and recommend what they think is an optimal method. 18a.In domestic research, the identification of target population and determination of sample frame is easier relative to the international context due to the availability of information. However, owing to the paucity of information, the generation of the sampling frame is a major problem. Even if the lists are available, they may not provide the adequate coverage. Another difference between domestic research and international research is that sampling may take place at a number of geographic levels in the international context. The level at which the sample is drawn depends upon the product market, research objectives and on the availability of lists in each level. b. Before conducting an international marketing research, the researcher should determine if the research should be conducted across all countries or if the results are generalizable across all the countries Due to the high costs of research in the different countries, there is a tradeoff between the research costs and the number of countries in which the research is conducted. After the countries have been identified, the sampling technique is to be determined. Probability sampling is not too very feasible owing to the paucity of the sampling lists. A popular technique is the snowball sampling technique in which the initial set of respondents are selected at random and the additional respondents are selected based on the responses given by initial respondents. The international researcher should not set his mind on using the same sampling procedure in all the countries as each procedure varies in its reliability across countries. Also, costs involved for various procedures differ from country to country. An appropriate sampling procedure should be chosen based on the tradeoff between the costs and the reliability and a sampling procedure chosen. Students can be asked to go through this exercise by following the procedure as outlined in Q.17. PART II SECTION D TEACHING NOTES FOR CHAPTERS CHAPTER FIFTEEN SAMPLE SIZE AND STATISTICAL THEORY Learning Objectives • Discuss some ad hoc methods of determining sample size. • Discuss the concepts of population characteristics. • Discuss the concepts of sample statistics. • Discuss sample reliability. • Discuss confidence intervals and interval estimation. • Describe how to calculate sample size for a simple random sample. • Discuss the formulas used for estimating proportions. • Discuss when to use the coefficient of variation. • Describe how to calculate sample sizes for stratified sample designs. • Describe how to calculate sample sizes for multistage designs. • Describe the concept of sequential sampling. Teaching Suggestions This chapter was deliberately separated from Chapter 14 so that it could be bypassed by some instructors who want a less quantitatively oriented course. The chapter presents the approach to sample size determination based on statistical theory. This approach really is rarely applied directly because of its focus upon a single question, simple random sampling, and the assumption that the population standard error and the confidence level are both known. Further, as the last chapter made clear, several ad hoc methods are available. However, it introduces several useful concepts that do provide guidance to the sample size question and are useful in themselves: concepts like population characteristics, sample characteristic, sample reliability, and interval estimation. Also, it sometimes is used directly to determine sample size, although the student should not get the impression that marketing research can not be conducted if these formulas are not understood. The first two sections attempt to graphically make clear the distinctive population characteristics (parameters) and sample characteristics (statistics). Students always seem to get these two concept confused when learning this material. Consequently, the distinction should be emphasized in class. Another source of confusion is the distinction between the distribution of X and the distribution of . For some encountering the normal curve for the first time, Figure 15-4 will be important to understand. One trick to teaching interval estimation is not to get too bogged down in probability theory. Try to keep it as natural and simple as possible. Of course, the concept of a confidence level needs to be explained in the context of probability theory. The sample size formula immediately follows from the interval estimate. It is best to de-emphasize the derivation (the curious students can work it through). The interpretation and determination of the terms in the general formula should be the focus of the discussion. Questions and Problems 1. This question is intended to make the student note the difference between the population and the sample and to provide an opportunity to actually do some calculating (one of the few in the book). One approach is discussed below: (a)(b) Total 3.32 = μ, 1.73 = σ2 Thus, μ = 3.32; σ2 = 1 73; σ = 1.31 (c) R Freq. (X-X) 2 times Freq. 5 times 4 = 20 13.54 4 times 9 = 36 6.35 3 times 4 = 12 0 2 times 3 = 6 4.04 1 times 5 = 5 23.33 Total 79/25 = 3.16 Thus, s2 = 1 97 s = 1.40 (d) (e) The population mean (μ) is 3 and the variance (σ2) is (5-3) 2 12,500 + (1-3) 2 12,500 = 4 = σ2. Thus σ = 2. 25,000 25,000 This value of the population standard deviation would be larger than any estimate because it represents the extreme case of the maximum possible variations. (d) n > (cσ)2/(error) 2 where c = 2 α = 1.49 error = .10 3. (a) n = c 2 /4 (error) 2 = 2 2 = 10,000 4(.0l) 2 (b) n=c 2 /4(error) 2 = 2 2 = 1,l11 4(.03) 2 (c) n = c 2 2/4 (error) 2 = 2 2 = 278 4(,06) 2 (d) for a 90% confidence level c = 5/3 error ± .01 n = (5/3) 2 = 6,944 4(.01) 2 error ± .03 n = (5/3) 2 = 772 4(.03) 2 error ± .06 n = (5/3) 2 = 193 4(.06) 2 4(a) n > c2 = (5/3) 2 = 1,736 4(error)2 4(.02) 2 (b) n > c 2 = 2 2 = 2,500 4(error)2 4(.02) 2 (c) If it were known that P is less than .3 then we would use .3 in our formula instead of .5 because .3X(1.3) is smaller than .5(1.5). Then instead of .25 we would use .21 (if we wanted to be that precise). = 2100 for 95% n = C2p(1 p) (error) 2 = 1458 for 90% (d) .3 + (5/3) * P(l9) √400 = 0.3 + 0.017. 5(a) The assumption is made that the usage follows a normal curve and it is known that 95 percent of a normal curve lies within plus or minus two standard deviations from the population mean. Thus, a total of four population standard deviations are 16 times per month and one standard deviation would be f. (b) n = (cσ) 2 / (error) 2 = (5/3 x 4) 2 / 1 2 = 45 (90 percent error) = (2 x 4) 2 / 1 2 = 64 (95 percent error) (c) n = (cσ) 2 / (error) 2 = (5/3 x 4) 2 /l (4) 2 = 281 (90 percent error = .4) = (2 x 4) 2 / (4) 2 = 400 (95 percent error = .4) (d) In selecting the confidence level and desired accuracy, higher confidence level and increased accuracy must be traded-off against size. Some considerations that will affect this decision are the importance of the decision that relies on the estimate, the profit potential and investment required by the decision and the cost of the sample. 6(a)By looking at the table, it is intuitive that we should sample least from small and most from large because of the differences in the standard deviation and source interview cost across strata. To determine this, use the formula as follows: n i = πi σi √ ci n Σi(π i σ i √ci) (b) We recommend 200 interviews from the large strata as opposed to 30 in a simple random sample because of the large variation in order sizes in this stratum. (c) X=Σ i = 1 X i πi =0.1 x 100 + 0.2 x 8 + 0.7 x 5 = 10 + 1.6 + 3.5 = 15.1 (thousand $) (d) The estimate of the variance of the population mean is as follows: If we had taken a simple random sample of 300, we would sample only about 30 people instead of 200 from the large strata. There would be much larger variance in the estimate of the mean for this stratum with 30 people instead of 200. This would raise the estimate of variance from the simple random sample as a whole. The reason for using a stratified sample is to improve the variance estimate over a simple random sample. (e) Total interviewing cost = $19,200. (f) (g) Total interviewing cost under (f) = 64x (144+22) + 9x 134 = $11,830 (h) If the total budget was $19,200 as in (e), the budget = Σi=i cini 19,200 = 64nl+ 64n2+ 9n3 n1 = .480n n2 = .07n n3 = .448n = 64(.480)n + 64 ((.072)n + 9(.448)n = 30.72n + 4.61n + 4.03n = 39.36n: n = 488 Revised allocation n1 = .080n = .48(488) = 234 n2 = .072n = .072(488) = 35 n3 = .448n = .448(485) = 219 488 7a. Assuming that the total qualified population (over a specified age and is able to attend sports events) is 144,000. The list of the number of attenders includes 1200 names. Therefore the number of nonattenders should be 142,800(144000-1200). Proportion of nonattenders in the lsample is 99.1% (142800/144000) and the sample should contain 99.1 % of the nonattenders in the sample. b. The nonattender population contains 142800 people and the sample should contain 99.1% of non -attenders. 8a. The owner of Galaxy Pizza can use census information of the area, municipal list of the households in the area etc., This can be used as a sampling frame. b. Assuming each household has 5 people, the total number of households is 20000. A random sample of 2000 households will imply that 10% of the population is covered in the sample. c. The residential areas within a ten mile radius and/or within a thirty minute journey time of the store is identified. Area sampling can be used. After selecting a sample of clusters from the city, a sample of consumers can be contacted from each cluster. This helps in reducing the traveling time and money. PART III TEACHING NOTES FOR CHAPTERS CHAPTER SIXTEEN FUNDAMENTALS OF DATA ANALYSIS Learning Objectives • Discuss the need for preliminary data preparation techniques such as data editing, coding, and statistically adjusting the data where required. • Describe the various statistical techniques for adjusting the data. • Discuss the significance of data tabulation. • Discuss the factors that influence the selection of an appropriate data analysis strategy. • Discuss the various statistical techniques available for data analysis. Teaching Suggestions This chapter is written to be used in conjunction with Chapters 17 and 18 for maximal benefit to the students, it includes the basics of data analysis including an introduction to the various data preparation, editing, coding, weighting and adjusting the data. The reality is that most data analysis is practical, marketing research uses the measures and techniques described in this chapter. In part, that is because these techniques are extremely powerful and useful although it is also, in part, due to the fact that many analysts and their clients are not familiar with more advanced techniques. Whatever the emphasis of the course, this chapter will deserve thorough treatment. Each section can legitimately be stressed as important to effective data analysis. At the outset a very useful six-step process is put forward. It represents a structuring of the chapter and of data analysis in general. It suggests a natural, logical flow to the process. The Table 16-2 calculation can be tricky for some and the instructor may want to walk the students through it. The difference between means discussion might be related to the theory of marketing segmentation—the development and pursuit of marketing programs for market subgroups. Multivariate analysis is only mentioned in this chapter. The results presentation is most important. Questions and Problems 1. Another coding question was taken from one of the California Poll surveys shortly after Brown was first elected Governor. One way to compensate for the number of answers a respondent gives is to code separately the first, second, third, etc., response. Then the analysis can be conducted separately for only the first response in addition to analyzing all responses together. (a) Should the first comment be coded as “like his ideas” or on a “welfare” category or both? (No, the category is for answers that literally say something similar to “like his ideas.”) Should the next two involve a “business climate” category or a “regulation” category or both? Should “too soon” really be combined with “not bad or good?” (b) Does this respondent like his ideas? Should category one be used because he is doing OK (the respondent did not say “so far” so he really doesn’t fit there?) (c) This response will probably require two new categories. (d) Should the “strong stand” be coded as category 5? Probably not since category 5 is a positive category. (e) Should these be coded as category 3? Probably not, since category 3 refers to “people” and this comment refers to “farm workers.” Similarly, “is doing a good job” and “l like him” really does not fit well into any of the categories. Some of the other categories used and the number of responses are as follows: (56) Young. (57) Not influenced by private interest groups, others. (52) Sets an example, small car, small apartment, walks to work, etc. (53) Like what he did about smog device bill. (30) Like position on Vietnamese refugees. (34) Has not kept promises. (33) Policies are poor, expected more, general negative mention. (25) Is keeping campaign promises. (25) Improving economy. (20) Likes position on farm workers. (21) Like appointments and what they are doing. (20) All other positive. (19) Too young, immature. (24) Negative personality, mentions (too politically ambitious, doesn’t think, too quick, not realistic). (39) All other negative responses, (135) No answer. 2. Figure 16-4 indicates that those most interested in the HMO tended to be lower income and younger families. The implications are that it might be useful to use income and age to identify target segments. A good starting point is to first cross-tab income by age to make sure that there are really two variables involved. It may be that the low income families are also young and thus the two tables would really be measuring the same thing in which case we really have found only one segment defining variable. The second data analysis step would be to describe the “low income segment” in terms of other variables that we have in the survey. This step is a standard approach in a segmentation study. 3. This figure indicates that those not intending to have children are better prospects than the others. Here again it is very likely that “not intend to have children” reflects an older family and a higher income family and thus it would be premature to say that a new segment has been identified. It is more likely that this question simply describes further the young, low income segment. Cross-tabulation of intentions to have more children with income or age would confirm this judgment. PART III TEACHING NOTES FOR CHAPTERS CHAPTER SEVENTEEN HYPOTHESIS TESTING: BASIC CONCEPTS AND TESTS OF ASSOCIATIONS Learning Objectives • Discuss the logic behind hypothesis testing. • Describe the steps involved in testing a hypothesis. • Discuss the concepts basic to the hypothesis‐testing procedure. • Discuss the significance level of a test. • Describe the difference between Type I and Type II errors. • Describe the chi‐square test of independence and the chi‐square goodness‐of‐fit test. • Discuss the purpose of measuring the strength of association. Teaching Suggestions The emphasis in this chapter and the next chapter (as in the whole data analysis section of the book) is not upon calculation. The computer or a statistical consultant can do the calculating. The emphasis is, rather, upon asking the right questions and making proper interpretation of the results. Thus, the chapter seeks to get the student to: 1. Ask the hypothesis test question. Maybe these empirical findings simply represent sampling variation. What is the probability that such (or even more impressive) results would have emerged if the null hypothesis was true? A low p-value means the results are impressive and that their implications are worth considering. A high p-value means that the results should be disregarded or discounted. 2. How to interpret the significance level: a significance level of 0.10 simply means that the p-value was less than 0.10. The four steps in Figure l7-l summarize the logic. There are two things that this book is not. It is not a reference book for the hundreds of statistical tests that could be used. We do not feel a marketing research book should provide that function or that a student should be burdened with sorting out all the available tests. Second, this book does not attempt to provide the ability to perform calculations. Rather, it emphasizes inputs, outputs, assumptions, and interpretations. The inclusion of the formulas behind chi-square tests in cross tabulations is the exception to the rule. The computer will perform the calculations. The task is again to ask the right questions and to interpret the results appropriately. The cross-tabulation example is used as a vehicle to explain conceptually what independence (the null hypothesis) is. The experiment in Table 17-2 is the primary vehicle. The use of chi-square as an association measure is discussed but it is the appendix that provides a more detailed discussion of association measures for nominally scaled variables. Questions and Problems 1. a Row Total E1 = 17.8 E4 = 46.8 E7 = 24.5 22.3% (89) E2 = 19.8 E5 = 52.1 E8 = 27.3 24.8% (99) E3 = 42.4 E6 = 111.3 E9 = 58.3 53.0% (212) Column Total (80) (210) (110) b. E1 means that if the rows and columns were independent (a knowledge of one provides no information about the other--like flipping a coin or drawing a card), then a total of 17.8 people would be “expected” to be in cell 1. If the experiment were repeated many times, on the average 17.8 would be in cell 1. c. d. With four degrees of freedom, (r- 1) (c- 1), the critical value given in the table at the end of the book is 18.5 at the 0.001 level and 14.9 at the 0.005 level. Thus, the chi-square statistic is significant at the 0.005 significance level and we would reject the independence null hypothesis. e. False. It just shows that if usage differs by age, then the probability of getting a chi-square value this large or larger would be very small. Thus, the evidence points to the conclusion that usage differs by age. 2. Ho: Preferences and brands are not related. Ha Preferences and brands are related. Purchaser A B C D Total Buys the brand 45(50) 50(50) 45(50) 60(50) 200 Doesn’t buy the brand 55(50) 50(50) 55(50) 40(50) 200 Total 100 100 100 100 400 Expected value (represented in brackets) = Row total x Column total Grand total EII = 200 x 100 400 EII = 50 X2cal = (O - E) 2 = (45 - 50) 2 + (50 - 50) 2 + ... + (40 - 50) 2 E 50 50 50 = 0.5 + 0 + ... + 2 = 6 X2 test statistic at (4-1)(2-1) = (3)(1) = 3 degrees of freedom a. α = 0.05 = 7.815 X2cal, X2table therefore Ho cannot be rejected Preferences and brands are not related. 3. Ho: The observed distribution attending the concert fits with the on campus distribution (statistically equivalent) Ha: The observed distribution attending the concert with the on campus distribution Observed value Expected value Juniors = 74 % (59) Juniors = 62 % (50) Seniors = 17 % (14) Seniors = 23 % (18) Freshman & Freshmen & Sophomores = 9 % (7) Sophomores = 15 % (12) degrees of freedom = (3 - 1) = 2 X2tab at α = 0.05 = 5.991 X2cal = (O - E)2 + (59 - 50) 2 + (14 - 18) 2 + (7 12) 2 E 50 18 7 = 1.62 + 0.889 + 3.571 = 6.08 X2cal > X2tab Therefore, reject Ho and conclude that the observed distribution does not fit with the on campus distribution. 4. Ho: The observed application pool coincides with the historical pattern. Ha: The observed application pool does not coincide with the historical pattern. at α = 0-05. Observed pattern Expected pattern In-state = 75 In-state = 70 Neighboring Neighboring states = 15 States = 20 Other states = 10 Other states = 10 X2(da = (3-1) = 2) at α = 0.05 = 5.991 X2cal = (75 - 70) 2 + (15 - 20) 2 + (10 - 10) 2 70 20 10 = 0.357 + 1.25 = 1.607 X2cal < X2tab Therefore, do not reject Ho and conclude that the observed application pool coincides with the historical pattern. 5. α=0.1 Ho: There is no association between a child’s sex and the hours of play. Ho: There is an association between a child’s sex and the hours of play. Less than 2.5 2.5 or more Total Boys 18 (16.9) 10 (11.1) 28 Girls 17 (18.1) 13 (11.9) 30 35 23 58 Expected value is given in the brackets. X2tab at α = 0.01; df = 1 = 6.635 X2cal = (18 - 16.9)2 + ... + (13 - 11 9) 2 16.9 11.9 = 0.0715 + 0.109 + 0.067 + 0.1016 = 0.349 X2cal < X2tab Do not reject Ho, and conclude lack of association between child’s sex and hours of play. PART III TEACHING NOTES FOR CHAPTERS CHAPTER EIGHTEEN HYPOTHESIS TESTING: MEANS AND PROPORTIONS Learning Objectives • Discuss the more commonly used hypothesis tests in marketing research—tests of means and proportions. • Describe the relationship between confidence interval and hypothesis testing. • Discuss the use of the analysis of variance technique. • Describe one‐way and n‐way analyses of variance. • Discuss the probability‐values (p‐values) approach to hypothesis testing. • Describe the effect of sample size on hypothesis testing. Teaching Suggestions As reiterated in Chapter 17, this chapter should not be used for calculation purposes. It aims to provide a conceptual framework upon which the commonly used techniques in marketing research can be built upon. The instructor should be able to provide insights to the students and elicit proper interpretation of results. ANOVA is introduced in the context of a small numerical example with actual (though contrived) data. The use of “actual data” is intended to make the discussion more understandable and less abstract and will be the rule followed in the later chapters. The chapter has a minimum of symbols and concepts but it still does include some concepts that are normally taught in a statistics course. Further, the basic idea is exposed in Chapter 16. This chapter, like other technical chapters, should usually be supported by a lecture and discussion which follows the text fairly closely. The students should understand the various figures and tables. Make sure that they see the link between the difference between means discussion in Chapters 16 and 17 and the ANOVA table. The interaction discussion is also worth reviewing. Questions and Problems 1. The evidence is indeed that the gasoline cost question is the one that best distinguishes the two groups. However, before Question 3 is considered, it should occur to the reader to check the probability (p-value) that a difference of 1.7 or greater would have occurred if the two groups were equal. Footnote C indicates that this p-value is less than the 0.01 level. Thus, we conclude that the null hypothesis that the two population means are equal is not true (rejected). However, a sophisticated reader should ask the question with a null hypothesis that the difference is no greater than question 1. Data and methodology to answer that are not available to the student, but he or she should still intuitively see that the difference between 1.4 (question 1) and 1.7 (question 3) is not statistically significant. Of course, question 3 is answered with a different frame of reference and contains a different amount of variation than question 1, so that the two should not be compared directly. But it is still appropriate to consider a null hypothesis that the difference between population means for question 3 is not more than 1.4 as it places the results in a useful perspective. 2. Since the p-value is 0.07 the results are significant at the 0.10 level but not at the 0.05 level. Suppose we felt that 0.07 was a “low p-value”. Does that mean that the point has been proved, that the null hypothesis that the population mean is 10 ounces has been disproved? No. You never prove or disprove anything with a sample or with the hypothesis test associated with the sample. You just generate strong or weak evidence. And you certainly do not generate a decision like whether to boycott. Such a decision would involve a host of considerations. 3(a) The null hypothesis would be that the population proportions would be the same -that trial would be the same in each city. (b) That trial would be higher in Tulsa than Fresno. (c) 0.06. (d) The p-level (0.06) would be significant at the .10 level but not the .05 level. The null hypothesis would be rejected at the .10 level but not the 0.05 level. (e) The hypothesis test only provides the p-level, a measure of the strength of the evidence against the null hypothesis—it does not show it true or false. To determine whether to use a $.50 coupon we would need much more information, such as costs of various types. 4. A random sample of 100 automobiles. Ho: μ ≥ 5 miles/gallon H1: μ Ztab Therefore, reject Ho and conclude that the population mean is less than 5 miles/gallon. P value Ztab; reject Ho and conclude that half of all purchases are not women. 8. Ho: p ≥ 0.45 Ha: p Ztab: reject Ho and conclude that members opting for international marketing research is lower. 9. n1 = 120 n2 = 100 Ho: μ1 - μ2 = 0 (Population means are equal) Ha: μ1 - μ2 ≠ 0 (Population means are not equal.) = √ 0.0343 + √ 0.0441 = 0.28 Zcal = (3-355 - 9 5) - (μ1 - μ2) 0.28 = (3.355 - 9.5) - 0 = -21.9 0.28 Ztab at α = 0.05; Zα/2 = ± 1.96 Since ⏐ Zcal ⏐ > Ztab; reject Ho and conclude that population means are not equal. 10. α = 0.10 σ = 0.1 μ =5.0 n =5 Ho: μ = 5.0 Ha: μ ≠ 5.0 Two tailed test at α = 0.1 Ztab = ± 1.645 Zcal = 5.1 -5.0 = 0.1 =5 0.1/√25 0.02 ⏐ Zcal ⏐ > Ztab Therefore, reject Ho and conclude that the mean preference is not 5.0. 11. n = 9 μ = 2.0 σ = 0.06 Ho: μ ≥ 2 Ha: μ Ztab Therefore, reject Ho and conclude that the mean is less than 2 units. 12.(a) Null hypothesis is that the “population” means are equal—there is, no difference between the three advertisements. The alternate hypothesis is that there is some difference—they are not all equally effective. (b) The F-ratio is: 6.0 = 3.0 2.0 The p-value is about .055. (c) The p-value is significant at the .10 level but not at the .05 level. (d) There may be. The evidence against the null hypothesis is fairly strong but we don’t know for sure. 13. (a) The new F-ratio is: 6.0 = 3.315 1.81 (b) The p-level is about .049. It is different because the unexplained variance has been reduced. F-ratio = 24.0 = 13.3 which is significant at the .001 level. 1.81 PART IV TEACHING NOTES FOR CHAPTERS CHAPTER NINETEEN CORRELATION ANALYSIS AND REGRESSION ANALYSIS Learning Objectives • Discuss the use of correlation as a measure of association and describe the distinction between simple correlation and partial correlation. • Discuss the objectives of regression analysis. • Discuss the application of regression analysis. Teaching Suggestions The instructor can start off by explaining correlation analysis. Specific emphasis can be placed on the reasons underlying correlation analysis. In discussing correlation analysis, start off with the pearson correlation coefficient by going through the example in the text (Table 19-1). After discussing correlation analysis, the instructor can discuss regression analysis. Care should be taken to differentiate between the two techniques. The regression material was written to be as accessible as possible. The emphasis is upon the output and its interpretation. Although the hypothesis test on the regression coefficient is covered, statistical theory is either deleted or relegated to footnotes. Students should be encouraged to bypass footnotes regarding them as reference material and not material that really should be understood. It is tempting to delete them entirely, but they do provide a sense of completeness and some readers, particularly those who have had some regression elsewhere or who will be applying the technique, will find them useful. Estimation is introduced as a curve fitting procedure. The concept that an independent variable can serve to explain and predict a dependent variable is stressed. The students should be pushed to realize that the key assumption is that the independent variables are appropriate and that some conceptual theory and hard thinking are usually needed to identify good independent variables. A related point is that regression can be used in an exploratory, model building context as was done in the HMO example. However, it is always good policy to hold back some data so that a “model kit” can be conducted using as independent variables those variables that had high beta coefficients. The HMO would have been an ideal time to do exactly that because the sample size was large enough that both the exploratory phase and the testing phase would have enough data. Unfortunately, it was not done at the time and the data are not now available to us to redo the analysis. Stepwise regression is not covered. Some instructors may want to mention it. Stepwise regression is where the computer has a set of candidate independent variables and selects the one that will provide the highest r2. With the first variable specified, the computer selects a second variable that will generate the largest incremental r2. The problem of stepwise regression, is that one variable (i.e., income) may just miss getting selected in the first round. If that variable (income) is correlated with the variable (i.e., education) that was selected in the first round, then it might never get selected. The analyst might be tempted to erroneously conclude that it should not be part of the model. Stepwise regression is a technique of an exploratory or model-building phase. The raw data for the Figure 18-5 illustration are given below in case student analysis or replication is desired. The students might be asked to do a stepwise regression or to attempt models without using the advertising variable, or to exclude the urban suburban variable (which really explains little variance). The correlation is: I. Store Traffic 1.00 II. Advertising .59 III. Store size .62 IV. Urban-Suburban -.25 I II III IV 1. 90 0 1.3 1 2. 550 0 2.0 1 3. 380 40 2.1 1 4. 180 100 1.5 1 5. 200 110 1.0 0 6. 600 180 1.9 0 7. 300 200 1.0 1 8. 220 300 1.5 1 9. 790 310 2.0 0 10. 700 380 2.0 1 11. 380 420 1.6 1 12, 1000 480 2.8 0 13. 870 500 1.4 1 14. 200 520 2.3 1 15. 500 550 1.7 1 16, 580 580 1.9 0 17. 1000 570 2.8 1 18. 600 720 2.5 0 19. 730 680 2.4 0 20. 1020 690 2.0 1 The regression models discussed in this chapter have been cross section models. Time series models are not covered here. Questions and Problems 1a. Sx = 8.72 Sy = 2.193 ryx = 0.43 (b) Ho: P=0 Ha: p≠0 at 5% level tcal = 1.16 tcri at α = 0.05, df = 6, = 2.45 Since, tcal 0 The statistic is given by: Substituting the appropriate values in the equation, tcal = 2.93 tcri= 1.67(at α =0.05) Therefore, reject the null hypothesis and conclude that β., is positive and that there is an effect of the change in spot rate predicted by the forward rate on the actual change in the spot rate. (d) Ho = β1 = 1 Ha = β1 ≠ 1 tcal = -0.77 tcri =2 (at α = 0.05) Do not reject the null hypothesis and conclude that,β1 is one. This means that for a the independent variable 7(a) The value of r2 means that 30 percent of the variation in Y is accounted for by X. It also means that the correlation between X and Y is the square root of .30. (b) The parameter estimate b1 is the estimate of which is the change in Y to be expected if X changes by one unit. The parameter estimate a is the value of Y expected when X is zero (assuming the linearity assumption of the model holds at X equals 0). (c) Yes, at the 0.05 level. (d) 103,000, The major assumption is that the assumption holds at such an extreme value of X. There is no data at that level of X. 8. This question is intended to get the student to do some hard thinking about what independent variables should be used and how they should be measured. Among the possible variables are: (a) The size of the population within a two mile radius (or within that area that contains 90 percent of customers). (b) The population income (which could be obtained from census tract data) or age or average percent of families with children under 16. (c) Distance from competing store or number of competing stores within the drawing area. (d) The size and quality of competing stores. (e) The size of the shopping center in which the store is located or the number of large department or discount stores in the shopping center. (f) Parking. 9. The beta coefficient is interpreted as the change in terms of the standard deviation of Y that would be expected if Xi were increased by one standard deviation of Xi and the other independent variables were not changed. It removes the units problem (one variable like age might be measured in years and another, like income, might be measured in dollars) and therefore, makes the relative comparison of the regression coefficients less ambiguous. The footnote means that the hypothesis test, that the_βi is zero, has associated with it a p-value under 0.01. The probability of getting a nonzero_βi is zero, is under 0.01. It is interesting that different independent variables appear in the two models. The coverage provided is the dominant basis upon which the present plan is evaluated. Thus, when selecting target segments a useful approach might be to look at people who have plans that are short on coverage. The coefficients in the “proposed HMO” model indicate that distance to doctor or hospital and ability to choose doctor seem to be important explanatory variables whereas in the other model they were not significant. Thus, those dimensions might be given a close look when making the final design decisions. 10a) Significance is determined by the t-value and X1 is the most significant. Some students might select X3 because it has the largest coefficient. (b) None are significant at even the 0.13 level. Yet, r2 has increased dramatically. Why? The reason is that the independent variables are inter-correlated (there is multi-collinearity). Such a situation will hold down the t-values. In essence, the new variables are important but the model does not know which one because they are correlated. Thus, the t-values for each are lower than we might expect knowing the increase m r2. (c) The estimate would be 100,000, although the values are within the range of the data. A primary concern might be that conditions surrounding the estimation context differ from those surrounding the data. The student should observe that the sales will surely not be exactly 100,000. There will be uncertainty associated with our estimate. The biggest source of uncertainty will be the fact that 55 percent of the variance remains unexplained. Other sources include the uncertainty associated with our estimates of the regression coefficients (d) Crosby, North Dakota might be quite unlike those stations in our random sample. It is a farm community and the involved station will draw from a large area. There is no independent variable that reflects these special conditions although it might be possible to create one. 11. The regression coefficients for X2 (product mailing expenditures in year t) is the highest. The t value for X2 is the highest and it leads us to believe that X2 is the most significant variable in the equation. However, caution should be exercised when interpreting the regression equation. It should be made sure that all the relevant variables are accommodated in the model and that the presence of any observed variable is not specified. Also, it should be made clear that the model was specified accounting for the possible multi-collinearity among the independent variables. In this context, it is not advisable to recommend without making sure that the model was correctly specified. Instructor Manual for Marketing Research V. Kumar; Robert P. Leone; David A. Aaker; George S. Day 9781119497493
Close