![]() |
![]() ![]() |
||||||||||||||||
![]() ![]() ![]() |
|
||||||||||||||||
|
![]() Evaluation of a risk factor survey with three assessment methods
Beth Theis, Jennifer Frood, Diane Nishri, and Loraine D Marrett Abstract This paper describes the evaluation of questions on a cancer risk factor survey using three different methods: dataset response patterns, qualitative feedback, and questionnaire appraisal. These methods addressed the survey data, procedures and questions. The three methods identified similar issues, but also made unique contributions. Dataset response patterns showed missing and out-of-range data, an order effect, and mixed coding. Qualitative feedback revealed lack of clarity, sensitive topics, technical or undefined terms, failure to hear all response options, overlapping response options (as perceived by respondents), coding problems and recall difficulties. Questionnaire appraisal showed technical or undefined terms, complex syntax, hidden definitions, and ambiguous wording. The survey assessment methods described here can improve data quality, especially when limited time and resources preclude in-depth questionnaire development. Key words: data collection; health surveys; population surveillance; questionnaires
Introduction This paper describes the evaluation of cancer risk factor survey questions through the application of three different assessment methods to the data, process and questions from a pilot survey. We report each method's unique contribution, and areas where the different approaches converge, in identifying areas for improved data collection or caution in interpreting responses. Newell and co-authors, in reviewing the accuracy of self-reported cancer-related health behaviours, suggest strategies that include ensuring that respondents fully understand questions, phrasing questions to minimize social desirability bias, encouraging exact rather than rounded-off answers for continuous variables, and ensuring that questions have clear, exhaustive, mutually exclusive response options.1 Investigators collecting and using survey data need mechanisms to assess attempts to implement these strategies. Rapid risk factor surveillance systems offer opportunities for ongoing evaluation and ideally offer some flexibility in introducing changes to questions. A pilot test of such a system, carried out in Durham Region, Ontario, provided an opportunity for the assessment described here. The pilot was a collaboration between Health Canada, the Ontario Ministry of Health and Longterm Care, Durham Region Public Health Unit, and Cancer Care Ontario (CCO). Materials and Methods The Durham Region pilot survey The pilot survey was designed to test collaboration among the sponsoring agencies, including the process of formulating, adding and changing survey content, and to determine whether survey data could be generated quickly in a useful format. Actual survey results and quality evaluation were secondary aspects. Interviews were held in five monthly waves of approximately 200 each in June through October 1999, resulting in 1,047 completed interviews with Durham Region residents aged 18 through 89. Of the eligible individuals contacted, 69% completed the interview. The Institute for Social Research (ISR) at York University, Toronto, was contracted to conduct the survey using Computer Assisted Telephone Interviewing (CATI). The members of the content development group, three epidemiologists and two survey methodologists represented Durham Region Health Department, CCO, Health Canada and ISR. Content was limited to approximately 80 questions for a target average interview length of 20 minutes. Cancer risk factor questions Content areas of particular interest to a provincial cancer agency were addressed in 45 questions about 1) sun-related behaviour; 2) screening for breast, cervical, colorectal and prostate cancer; 3) diet; 4) physical activity; 5) tobacco consumption. The Appendix shows these in their final (fifth survey wave) form. Questions on sun-related behaviour were adapted from those developed for surveys at the 1998 Canadian National Workshop on Measurement of Sun-Related Behaviours.2 Slight changes in this group's wording were made by our content group's survey methodologists, based on their knowledge and experience of telephone surveys. Questions on screening for breast, cervical, colorectal and prostate cancer all used the same format about 1) ever being tested, 2) time since last test and 3) reason for last test. Breast and cervical cancer questions were from the National Population Health Survey; the content group developed prostate and colorectal cancer screening questions. The reference period of two years since last test reflected breast and cervical screening guidelines. Mammogram questions were asked of women aged 35 and older, and colorectal screening questions of respondents 40 and older. Pap test questions were restricted to women who reported not having had a hysterectomy. Questions on prostate-specific antigen (PSA) tests had no age restriction because some men in the pretest reported PSA testing in their 30s. Questions about reasons for cancer detection tests were expanded from the US Behavioral Risk Factor Surveillance System (BRFSS)3 question about Pap smears (routine examination or to check a current or previous problem) to include a third response option to distinguish between concern about symptoms and follow-up of a medically diagnosed problem. To address diet, we incorporated a set of BRFSS questions on fruit and vegetable consumption. Physical activity questions were adapted from a set proposed for the BRFSS by the Centers for Disease Control and Prevention's Physical Activity and Health Branch. Tobacco consumption questions were those used in the BRFSS (1999)3 with minor changes to reflect Canadian experience and to capture quit attempts. Evaluation methods We evaluated the 45 questions on cancer risk factors using 1) analysis of traditional dataset descriptors and response patterns; 2) qualitative feedback from interview monitoring, interviewer debriefing, and direct questions to respondents; 3) questionnaire appraisal with a checklist, modified from a published questionnaire coding system, to describe and assess potential problems related to comprehension or response generation.4 Dataset descriptors and response patterns Data characteristics alone can yield substantial information on question quality. For instance, a substantial number of refusals to answer a particular question may indicate a sensitive topic that could be dropped or the need to reword the question; unexpected answers may mean that a question is being misunderstood. Response patterns used to assess the quality of this set of questions were appropriate adherence to skip patterns, the proportion of refusals or "don't know" responses, the range of responses, and ease of analysis. An apparent order effect in days of vigorous and moderate activity was tested with a chi-squared statistic on three degrees of freedom. Qualitative feedback Qualitative analysis of text compiled from three activities (interview monitoring, interviewer debriefing and respondent feedback) revealed themes in question and interview attributes that indicated potential problems with the survey data. ISR's equipment enables switching among interviews undetected by interviewers and respondents. Four pilot survey investigators monitored interviews by telephone and computer on separate evenings during wave three. Three investigators debriefed interviewers and supervisors together after completion of all five waves. Respondent feedback was sought with two questions at the end of the interview in the two final waves, which included 412 respondents. Interviewers first asked whether any questions had been confusing or difficult to understand and, if yes, which questions. Cancer risk factor questions were difficult in four instances: three people had difficulty with the physical activity questions, and one said "the food questions" were confusing. Interviewers then asked all 412 respondents whether there were questions they understood but still found difficult to answer. One respondent reported trouble in answering the physical activity question, four the fruit and vegetable questions and one the reason for a Pap test. Questionnaire appraisal Lessler and colleagues have developed a scheme for coding questions, response options and instructions to characterize the mental burden involved in responding to a questionnaire.4,5 Its purpose is to identify features that may affect question comprehension and interpretation, response accuracy and willingness to respond. In adapting their scheme we excluded items relating to attitude rather than behaviour, and items that we felt would require a cognitive interview. (Cognitive interviews use various techniques for investigating the mental information processing necessary to respond to questions.) We then fine-tuned the coding scheme by independently coding three questions, comparing results, and achieving consensus on coding definitions and on elements inappropriate for our risk factor questionnaire. One author (JF) then coded all the questions using the resulting refined scheme. Results Dataset descriptors and response patterns Skip patterns were appropriate, with minor exceptions. A few males were asked female cancer screening questions because interviewers asked respondents all questions when they could not determine sex from a person's voice. (If still in doubt, interviewers asked directly whether respondents were male or female at the end of the interview.) None of the questions evaluated had more than 1.5% refusals. Ten questions had more than 10% "don't know" responses; all were to questions requiring detailed recall about time or frequency, such as hours spent in the sun, time since screening, or frequency of fruit and vegetable consumption. Responses sometimes did not match questions as asked. For example, four respondents reported an answer of less than 10 minutes to the question "On days when you do moderate activity for at least 10 minutes at a time, how much total time do you spend doing these activities?". Others seemed unlikely or extreme (more than eight hours of vigorous physical activity a day, smoking 90 cigarettes a day, PSA testing 24 years ago). Figure 1 shows an order effect in the distribution of the number of days per week respondents reported that they engaged in vigorous or moderate activity. Although the preamble to the physical activity questions told respondents they would be asked about participation in vigorous and moderate activity, "vigorous" and "moderate" were not defined until each question was read. In waves two and three respondents were asked first about the number of days per week they engaged in vigorous activity, and then about moderate activity. The order was reversed for waves four and five. The first definition heard by respondents may have become a reference point for answering the second question. As a group, respondents reported engaging in vigorous activity on more days when the reference point (the first definition) was vigorous activity and on fewer days when the reference point was moderate activity (p = 0.003). Similarly, they reported engaging in moderate activity on more days when the reference point was moderate activity and on fewer days when the first definition was vigorous activity (p = 0.001).
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Analytic difficulty arose from combined response options for three types of questions. Response coding options for physical activity were a mixture of continuous and categorical: <enter the number of MINUTES> or <more than 8 hours>. Average time spent exercising cannot be calculated because of the <more than 8 hours> categorical response unless either these respondents are removed from the calculation, or some assumption is made regarding the distribution of these values. Similarly, the question about time in the sun mixed numeric and text response fields. Interviewers were instructed to code responses as answered either in minutes, hours or a combination of the two. While responses in minutes or whole hours were recorded as numeric values, combination responses ("an hour and a half", for example), were recorded as text, which then had to be converted to numeric form (1.5 hours) and manually entered into the numeric field for combining with numeric responses. Partial answers to fruit and vegetable consumption resulted in missing or excluded information when daily, weekly or monthly consumption was reported but quantity could not be recalled: 14.6% of respondents were unable to estimate amounts for their daily, weekly or monthly consumption of at least one fruit and vegetable category. Analysis attempts also revealed missing "zero" values when some responses were contingent on answers to preceding questions. For instance, when respondents said "no" to the question asking whether they engaged in moderate or vigorous activity for at least 10 minutes at a time, the CATI system was programmed to skip the following question asking how many days they engaged in such activity, but was not programmed to enter "0" for physical activity days. Compensating for this oversight required some vigilance before data analysis reflected actual reporting. Qualitative feedback Four major themes emerged: stylistic problems, sensitive questions, question clarity and response validity. Stylistic issues were noted during interview monitoring and interviewer debriefing. Investigators monitoring interviews were concerned that some interviewers' monotone and rapid pace might interfere with question comprehension or lead to respondent frustration, although they did not detect any such frustration. Both interviewers and investigators felt the need for more transitional statements, particularly before such sensitive topics as tobacco use and (for interviewers) colorectal and women's cancer screening. In addition, investigators heard inaccurate or incomplete explanations from interviewers in response to questions about the purpose of the survey, how results would be used, and the reasons for randomly selecting respondents. Topics noted as sensitive during interview monitoring and interviewer debriefing were not necessarily mentioned in respondent feedback. Both interviewers and investigators observed respondent defensiveness about weight and tobacco consumption. Interviewers also felt the colorectal screening question was sensitive. Respondents, on the other hand, were more likely to report that the survey questions about income and education were too personal; only one mentioned weight as uncomfortably personal, and none reported discomfort with cancer screening questions. Problems with question clarity were noted from all three qualitative sources. Both interviewers and investigators felt that some questions were open to misinterpretation. In some cases this was related to either undefined or unfamiliar terms. Interviewers reported, for instance, that many respondents apparently thought PSA was a routine blood test, and gave a potentially invalid "yes" response; investigators noted that definition was an issue for some fruit and vegetable questions (some respondents had trouble understanding "green salad", for instance). One respondent reported that definitions of moderate and vigorous physical activity were not clear enough to distinguish them. In other cases, question intent was unclear. For instance, interviewers felt that the sun avoidance question might need clarifying if meant to capture moving purposely "out of the sunlight" as opposed to "out of the heat", and that "clothing with long sleeves" would better capture covering-up behaviour than asking specifically about a "shirt". Interview monitoring identified questions to which respondents offered answers before all response options had been read or terms defined. Questions incorporating lists of responses (reasons for cancer screening tests, for example) needed rewording to signal clearly that a list was coming; definitions of "moderate" and "vigorous" in the physical activity questions needed to be placed so that respondents heard them before offering an answer. Difficulties in interpreting some response categories could result in misclassification. As a reason for screening tests, interviewers described some respondents answering "concerned that I might have a problem", yet saying that this was routine screening. Only one respondent singled out a screening question as difficult to answer; she had a Pap smear "because I was having my tubes tied" and didn't see how this fit the offered response options. Investigators noted that interviewers had difficulty appropriately coding some responses to questions about time in the sun and fruit and vegetable consumption. Some interviews showed the necessity of providing coding instructions, for instance, about what counts as "fruit" or "fruit juice" when questions have been used from another survey. (One respondent in a monitored interview asked whether apple juice counted in response to a question from the BRFSS about "fruit juices such as orange, grapefruit, or tomato"). Respondents reported trouble in answering some questions because behavioural details were difficult to report correctly (time exercising, vegetable consumption) or because the question did not ask for a response about a specific time period (vegetable consumption). Questionnaire appraisal Table 1 summarizes the results of applying the questionnaire coding scheme to the 45 cancer risk factor questions. The scheme scores the questions themselves, memory/judgement tasks required to answer the questions, and the responses. Most questions asked about past rather than current behaviour. The frequency of carry-over and embedded reference periods reflects several question series asking for increasingly detailed information; for instance, "Have you ever had a mammogram?", then "Did you have your mammogram in the last 2 years?", and then "How many years/months ago was that?". Undefined reference periods were in questions about current behaviour. Ill-defined reference periods occurred in the physical activity questions. Between a quarter and a third of questions used technical terms, many undefined, or ambiguous wording, and/or complex syntax. Technical terms were usually screening test names; complex syntax was largely needed to clarify the wording of "moderate" and "vigorous" physical activity and the reference time and type of day for being outside on a sunny day. Most memory retrieval and judgement tasks involved remembering an episode or set of episodes that included a blend of common habits, distinct habits, rare events and time estimates. Most questions required qualitative judgement, reflecting the large number of yes/no and categorical response options, whereas fewer questions required estimation of the actual number of times something happened or how long ago. Questions about cancer screening were coded as sensitive in this scheme because of the physically personal nature of breast, cervical, colorectal and PSA screening. One major problem with response options was hidden definitions, information provided only if respondents requested clarification. Although most were for fruits and vegetables (respondents asked, for example, "Are potato chips vegetables?", "Does the fruit in a Pop Tart count?"), there were others throughout the questionnaire. The other response problem was the inclusion of ambiguous or vague terms, mainly in sun behaviour questions ("rarely" and "often") and in reasons for screening tests ("routine screening", "ongoing problem", "concerned about a problem").
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
All three methods (dataset response patterns, qualitative feedback, and questionnaire appraisal) pointed to potential problems with response validity, respondent reluctance, and recall difficulty (Table 2). By validity we mean the extent to which responses were directed to the intent of a question and were correctly captured by interviewers. While potential problems were identified, usually through contributions unique to each method, there was convergence on the broad areas of sensitive topics (although respondents singled out different topics than interviewers, monitors and questionnaire coding), undefined technical terms, question clarity and hard-to-remember information. In our evaluation, only examination of the dataset revealed analytic difficulties associated with the responses as entered.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Discussion Without critical assessment of survey data and the methods used to collect them, health agencies risk basing policy decisions on inaccurate information. Users of survey data know that self-reports are, to varying degrees, the result of imperfect recall, biased reporting6 and misclassified responses. Within these limitations, Newell and co-authors emphasize the scope for improved data collection on health-related behaviours.1 Each of the assessment methods we describe can offer insight into the data or opportunities for improvement. These methods identified problems not only in aspects of this pilot survey, but also in individual questions adopted from other surveys. In addressing the problems raised, investigators must often weigh the pros and cons of opposite approaches. In ongoing surveys, the benefits of changes may not outweigh the benefits of data comparability.
Dataset descriptors and response patterns are traditional evaluation tools; skip patterns, refusals, question-response mismatches and extreme responses indicate areas for cautious interpretation of data or, in ongoing surveys, for programming changes and interviewer instructions. Many refusals or "don't know" responses may identify sensitive or misunderstood questions. A high proportion of "don't know" responses or evidence of an order effect (if different orders have been tried) raises an alert about the validity of all responses to those questions. Responses outside the expected range show items for which programming to restrict allowable CATI entries or to prompt interviewers to repeat a question may improve data quality. The analytic difficulties of mixed continuous and categorical, or numeric and text, responses should be avoided unless theoretical reasons exist for including them. (There may, for instance, be arguments for grouping numeric responses above a certain threshold for some behaviours.) Similarly, although allowing respondents to select their own units of reporting presented analytic problems in our pilot survey, this must be weighed against the benefits of giving respondents the freedom to provide information at a level they feel is most accurate. Qualitative feedback can point to possible areas for change. In this survey, interview monitoring identified areas where data quality could be improved through question rewording, additional interviewer training, or more comprehensive coding instructions. Monitoring can reveal particular wording requirements of a telephone interview, especially for investigators more familiar with self-administered or face-to-face questionnaires. In this ongoing pilot survey, for instance, we altered wording so that respondents would wait to hear a list of response options. Debriefing interviewers provides the experience of a wider range of interviews than investigators can monitor. Interviewers are especially aware of the usefulness of transitional statements to alert respondents that a personal question is coming, suggest that no personal judgements will be made, or generally "soften" the approach. For this pilot survey, interviewers requested definitions, described a response option problem, and noted areas where transitional statements would be helpful.
Despite the rich detail on potential problems that interviewer and respondent feedback provided, such information may need careful assessment before it prompts changes. Interviewer discomfort may be less informative than refusal or quit rates in identifying topics sensitive enough to warrant changes in wording, transitions, or placement. In this survey, while only a small proportion of our wave four and five interviewees responded to the request for feedback, they did provide qualitative detail on the high proportion of "don't know" responses for activity time and fruit and vegetable consumption. Whether or not a questionnaire change is justified when comparatively few respondents are willing to lengthen the interview to provide negative feedback is a matter best decided in the context of the project as a whole. Different decisions may be made depending on, for instance, the importance of data comparability across survey waves or different surveys, or whether an ongoing survey is at an early or later phase. Our questionnaire appraisal was exploratory and carried out after the pilot survey had been conducted. A more appropriate use would be to identify areas for change before field testing. We adapted another group's published scheme for application to a health behaviour interview. Our adaptation may need further revision for application to other questionnaires. As application of the codes necessarily involves individual judgement, another group intending to use the scheme will need to agree internally on item definitions (what constitutes a "technical term", for instance). More fundamentally, the published scheme that we adapted depends on the validity of the underlying models of the cognitive processes involved in question response.4 Intuitively, however, some form of checklist seems appropriate for indicating potential problems prior to any pretest in the field. Shorter lists have been published.7,8 A coding scheme or checklist could be expanded to include aspects identified in this pilot survey through response pattern analysis (mixed categorical and continuous responses, for instance) or qualitative feedback (such as announcing a list of response options). An advantage of the list we used is that it aids choices in wording by quantifying different aspects of respondent burden. Analysis could show, for example, that a high proportion of questions required complex estimation on the part of respondents. These might be memory retrieval tasks unavoidable when reports of preventive health behaviours are required. In such a case those designing the survey might want to make other changes (dropping some questions, for instance) to compensate for this aspect of respondent burden. This is best done in the context of the survey as a whole, rather than trying to establish acceptable levels for respondent burden or potential problems. As with the other methods described here, changes must be weighed against new problems they might introduce or advantages that would be lost. Although complex syntax increases burden, for instance, it may be required in order to clarify questions and provide definitions. Another example is a reference period tied to a previous question; while this can reduce topic sensitivity by minimizing repeats of sensitive wording, extensive use could lead to response fatigue. Ideally, survey questions are developed using focus groups, cognitive interviews and pretesting, or at least some of these, in the population targeted for the survey. Additional methods of assessing survey quality exist. Questionnaire responses can be compared, for instance, with food records or 24-hour food recall, pedometers or other physical activity monitors, mammogram reports in medical records, or responses to related questions within a survey. When limited time and resources preclude in-depth development and assessment, the combination of methods described here offers useful insights. They are equally important for questions adopted from other surveys, as differences in populations, questionnaire administration and question order can alter validity. Rapid risk factor surveys lend themselves especially well to these data quality measures: data are available quickly for quality assessment, and flexibility in altering question wording or transitions is likely to be a key principle. Such surveys may provide suitable frameworks for investigating order effects like the one revealed for physical activity in our Durham pilot survey, or for using more than one question for the same issue and assessing inter-item correlation. Again, insights gained from similar experiments in wording or order must be weighed against data comparability. A multimodal approach like the one described here can confirm observations where the findings of different methods converge.9 More important, when resources are scarce the use of these different methods can compensate for aspects missed by any single method and thus reveal a broader range of potential problems. Acknowledgements We thank Dr Bernard Choi and Dr Philippa Holowaty for their helpful comments on an earlier draft of this paper. David Northrup and his staff at the Institute for Social Research, York University, collaborated in this work by inviting investigators to monitor interviews, providing insightful feedback from interviewers, and suggesting the addition of questions for respondent feedback. Dr. Holowaty, Mr. Northrup and (originally) Dr. Margaret de Groh of Health Canada, with two of the authors (BT and LDM), worked as the content development group for the pilot survey. References 1. Newell SA, Girgis A, Sanson-Fisher RW, Savolainen NJ. The accuracy of self-reported health behaviors and risk factors relating to cancer and cardiovascular disease in the general population. A critical review. Am J Prev Med 1999;17:211-29. 2. Lovato C, Shoveller J, Mills C. Canadian national workshop on measurement of sun-related behaviours [Workshop report]. Chronic Dis Can 1999;20:96-100. 3. Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System. Web site at <http://www.cdc.gov/nccdphp/brfss/index.htm>. Accessed May 4, 2001. 4. Lessler JT, Forsyth BH. A coding system for appraising questionnaires. In: Schwarz N, Sudman S, eds. Answering questions. Methodology for determining cognitive and communicative processes in survey research. San Francisco: Joessey-Bass, 1996: 259-291. 5. Forsyth BH, Lessler JT, Hubbard ML. Cognitive evaluation of the questionnaire, and Appendix B. Cognitive form appraisal codes. In: Turner CF, Lessler JT, Gfroerer JC, eds. Survey measurement of drug use: methodological studies. Washington, DC: US Government Printing Office, 1992:13-52 and 327-36. 6. Choi BCK, Pak AWP. Bias, overview. In: Armitage P, Colton T, eds. Encyclopedia of biostatistics. Volume 1. Chichester, UK: John Wiley & Sons, 1998:331-338. 7. Armstrong BK, White E, Saracci R. Principles of exposure measurement in epidemiology. New York: Oxford University Press, 1994:144. 8. Woodward CA, Chambers LW. Guide to questionnaire construction and question writing, 3rd ed. Ottawa: Canadian Public Health Association, 1986:23. 9. Friedemann ML, Smith AA. A triangulation approach to testing a family instrument. West J Nurs Res 1997;19:364-78.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
APPENDIX
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Author References Beth Theis, Jennifer Frood, Diane Nishri, Division of Preventive Oncology, Cancer Care Ontario Loraine D Marrett, Division of Preventive Oncology, Cancer Care Ontario and Department of Public Health Sciences, University of Toronto Correspondence: Beth Theis, Division of Preventive Oncology, Cancer Care Ontario, 620 University Avenue, Toronto, Ontario M5G 2L7; Fax: (416) 971-6888; E-mail: beth.theis@cancercare.on.ca
[Previous] [Table
of Contents] [Next]
|
![]() |
|||
![]() |
Last Updated: 2002-02-21 | ![]() |