Government of CanadaPublic Health Agency of Canada / Agence de santé publique du Canada
   
Skip all navigation -accesskey z Skip to sidemenu -accesskey x Skip to main menu -accesskey m  
Français Contact Us Help Search Canada Site
PHAC Home Centres Publications Guidelines A-Z Index
Child Health Adult Health Seniors Health Surveillance Health Canada
   



Volume 18, No.4 -1997

 [Table of Contents] 

 

Public Health Agency of Canada (PHAC)

Capture-Recapture: Reconnaissance of a Demographic Technique in Epidemiology

Debra J Nanan and Franklin White


Abstract

The objective of this paper is to review capture-recapture (CR) methodology and its usefulness in epidemiology. Capture-recapture is an established and well-accepted sampling tool in wildlife studies, and it has been proposed as a cost-effective demographic technique for conducting censuses. However, the application of CR in the field of epidemiology requires consideration of relevant factors such as the nature of the condition under surveillance, its case definition, patient characteristics, reporting source and propensity for misdiagnosis and underdiagnosis. The use of CR in epidemiology has expanded over the last 10 years and no doubt will continue to be adopted. Although it has a role in public health surveillance, a more traditional approach to disease monitoring seems more advantageous in certain instances.

Key words: Capture-recapture; epidemiologic methods; population surveillance


Introduction

A necessary component of the development and implementation of effective public health strategies in the prevention and control of disease is adequate and accurate information on when, where, how and who is affected. Epidemiology is the study of patterns of disease occurrence in human populations in terms of time, place and persons, and the factors that influence these patterns.1 Observing and monitoring health and behaviour trends requires a surveillance system that captures useful data on those persons correctly identified with the characteristic under study and from which a descriptive epidemiologic profile can be formed. With this information, priorities can be identified and groups targeted for specific interventions based on their profile. It also allows for evaluation of interventions and the best use of resources in the management of the condition. This process relies on accurate identification of the condition (and its various stages) and a valid, reliable surveillance system with complete and accurate monitoring in a timely fashion.

A concern with any surveillance system is the quality of the data collected, including the degree of ascertainment of affected individuals. Although some diseases and/or their risk factors may have a high prevalence in a population, the number of reported cases may greatly underestimate the number of persons with the condition. This may be due to a variety of reasons, e.g. poorly defined criteria for diagnosis, missed diagnosis, poorly designed surveillance systems, lack of awareness of the need to report or lack of health-seeking behaviour by those with the disease and/or risk factor.

Therefore, to determine the usefulness of any surveillance system, there must be some way of assessing the quality of the data and completeness of ascertainment. One approach that attempts to accomplish this is the capture-recapture (CR) method.

Capture-Recapture Methodology

History

The basic methodology has been applied in different scientific areas and has a long history. It was first introduced by ecologists as a means of estimating the size of wildlife populations.2-4 In demography, it has been used to adjust for undercounting in population censuses and to estimate birth and death rates and the extent of registration in developing countries.5

The US censuses have utilized similar approaches to estimate the undercount known to occur with the decennial censuses. The census in 1950 was the first to use the CR approach, referred to as the "dual system estimation," to evaluate the undercount.6 The principle is as follows: after the census has been carried out, a second, more thorough sample, called the "post-enumeration survey" (PES), is performed and matched to the census records; statistical techniques are used to adjust for matching errors, omissions and erroneous enumerations, such as duplications. Although all subgroups of the population are missed to some degree, relative under-enumeration of particular minority groups and the poor is greater than for whites.

In 1990 a PES was conducted and matched to the 1990 US census. Although the methodology has provided revised estimates, the US Secretary of Commerce has opposed the application in spite of its statistical validity (1991), a decision upheld by the Supreme Court (1996) in the face of a suit filed by the City of New York (1980) to use the method and thereby increase the city's federal entitlements.7

A new strategy for estimating population size is being proposed by the US Census Bureau for the 2000 census, using statistical sampling and analysis (employing CR techniques) that would correct for the undercount and reduce the costs associated with door-to-door surveys; it also cites the fact that the US population is too large and too mobile for physical counts. However, the House Committee on Government Reform and Oversight is against the change, apparently from fear of reapportionment of House of Representative seats once undercount corrections are made.

There are two methods of applying CR: using two individual data sources or using at least three individual data sources (multiple source approach).

Two-Sample Approach

There are four basic assumptions underlying the CR approach.8

  • Closure: The population under study is closed, i.e. no changes in births, deaths, immigration or emigration during the sampling process (demographic closure).
  • Independence: The sources are independent of one another, i.e. the probability of appearing on one list is not affected by the probability of being on another.
  • Homogeneity: All individuals in the defined population under study have equal probabilities of being observed (captured) in any sample.
  • Perfect matching: Individuals identified in one source can be perfectly matched to another source without error, i.e. no mismatches or non-matches.

As used in wildlife studies, the basic CR principle is as follows: sequential independent samples of animals are captured at different stations; the animals are tagged and allowed to mix with those still untagged; and estimation of the size of the population is based on the number of animals caught in successive samplings and the proportion of those caught that are tagged.

An example of the two-sample approach to estimate population size is given below.2

  • First sample: 1000 animals are captured and tagged, and allowed to remix with the population.
  • Second sample: 500 animals are recaptured, of which 450 are found to be untagged and 50, tagged.
  • The capture probability, p, is estimated from the second sample by p, which is 50/500 = 0.1.
  • Assuming the capture probability is the same for the two samples, an estimate of the total population, N = 1000/0.1 = 10,000. (Extending to include more than two samples would improve the precision of N.)

In general, most estimation methods appear to be very sensitive to the breakdown of certain assumptions: they are not "robust." Even with wildlife populations, the traditional assumption that all members of a given population are equally "catchable" on all occasions is now recognized to rarely hold, and much work has been done in recent years to allow the assumption to be relaxed, leading to the construction of models that allow for variation in the capture probabilities. The three major sources of variation are these.2

  • Capture probabilities that vary by time
  • Capture probabilities that vary by behavioural responses
  • Capture probabilities that vary by the individual (heterogeneity among individuals)

Multiple Source Approach

With three sources, 23 cells (subgroups) are obtained, denoting the number of possible combinations by which observations may be recorded simultaneously from each of the three sources, e.g. an observation may be reported by sources 1 and 3, but not by the second source. With k sources, there are 2k cells. In any cross-classification there will be one cell where no observations are recorded, corresponding to those individuals who have not been recorded by any source. The objective is to estimate the number of observations in this missing cell, which is then used to estimate the total population size.

The multiple source approach is more flexible, allowing consideration of variables that may influence reporting, and can identify reporting patterns for the different sources. The assumption of closure of the population still applies. However, the assumption of independence can be dropped, and interdependence among data sets can be accounted for by using Bernoulli census and log-linear modelling techniques to assess source dependencies.

The Bernoulli census approach plots all possible pairwise comparisons of two-sample estimates; if dependence between a pair of sources is suspected, they may be merged and treated as a single source.9 With the log-linear modelling approach, models are fitted to the 2k contingency table (described above); an estimate may be derived from a model that best fits the data or from calculating a weighted estimate by combining results from different models.10,11

The effects of heterogeneity among individuals, which produces apparent dependence, can sometimes be reduced by stratifying the population of interest by any known factor thought likely to influence the capture probabilities, although one must ensure sufficient observations in each cell (see below); another approach is to use a model that accounts for the heterogeneity (e.g. logistic regression).12,13

With human populations, matching involves the use of identifiers common to sources (e.g. birth date, name, race). Probabilistic record linkage makes it feasible and efficient to link large databases in a statistically justifiable manner, while addressing the problem of matching two files under conditions of uncertainty. Automated linkage of records is accomplished by the use of statistical packages, which also account for matching errors.14 These include general record linkage packages, such as GLIM, and more specialized software, such as GIRLS.12,15,16 Where the issue of confidentiality arises, such as with human immunodeficiency virus disease, this may limit the ability to match if sufficient useful variables are not available.

One disadvantage of using multiple sources is the need for a sufficient number of observations in each cell; if data are sparse, the estimate will be unreliable. In some instances, it may be advantageous to pool sources; this, however, may result in the loss of useful ("overlap") information.

Current Applications In Epidemiology

In epidemiology, "being caught in a sample" is replaced by "appearing on a list." These "lists" are represented by the information sources or surveillance systems. Routine databases can be used as sources, e.g. disease registries, hospital discharge data, death certificates, medical prescriptions. Where the surveillance system relies on voluntary reporting (often the case), there is very likely some form of bias in the system. With respect to human populations, the primary assumption of independence seems unlikely. For example, persons identified as injection-drug users on one list are more likely to appear on rehabilitative treatment lists if cases are referred for treatment once identified.17

The assumption of homogeneity is also questionable in human populations. Variations in ascertainment among sources are often determined by factors such as source, severity of illness, quality of care, legal requirements for reporting and patient characteristics. That is, determinants exist that increase the likelihood that a person with a given condition is diagnosed and appears on a particular list. For example, it is more likely that persons with a lower income will use public sector health services than persons who can afford private sector health services (which are less likely to comply with reporting requirements).

As a result of violation of these assumptions, the two-sample approach is rarely used with human populations. Nonetheless, the use of CR in epidemiology has expanded over the last 10 years.18 Some of these uses are listed below (categorized by disease group).

  • Birth defects: Studies related to birth defects (resulting from congenital rubella, cleft lip and cleft palate, spina bifida, Down's syndrome and fetal alcohol syndrome) applied CR techniques to correct for the number of incident or prevalent cases and completeness of reporting.
  • Cancer: CR methods were used to estimate breast cancer screening sensitivity and false negative rates. Other studies employed CR to ascertain the completeness of cancer registries.
  • Drug use: CR methods have been used in several prevalence studies and, in one instance, to estimate patterns of utilization of methicillin. The method was also used to correct for prevalence of intravenous drug use and to estimate population size of particular groups of users.
  • Infectious disease: These studies were related to sexually transmitted diseases, especially acquired immunodeficiency syndrome. CR was used to estimate either or both prevalence and efficiency of the reporting systems.
  • Injuries: CR provided ascertainment-adjusted estimates of dog bite injuries, terrain vehicle injuries, sports injuries and motor vehicle fatalities. The method was also used to evaluate the cost-effectiveness of various source combinations.
  • Insulin-dependent diabetes mellitus: Currently, most registries use this procedure for checking the degree of ascertainment and providing ascertainment-adjusted rates.
  • Others: CR methods have also been applied to estimate the incidence or prevalence of hemophilia, myocardial infarction, Huntington's disease and mental disease. Other areas of CR usage in epidemiology include the size of the homeless population, the number of children dependent on medical support, evaluation of the effectiveness of surveillance systems for monitoring abortion mortality, infections among hospitalized patients and vaccine-associated paralytic poliomyelitis.

Currently, several large multinational projects are under way with CR as a design component.19 Examples of these large-scale studies are listed below.

  • World Health Organization's Multinational Project for Childhood Diabetes, where 155 registries in over 70 countries monitor insulin-dependent diabetes mellitus in children (the DiaMond project)
  • Global Lower Extremity Amputation (LEA) Study to enable comparisons between and within countries across the world and over time on the incidence of LEA
  • Global Spine and Head Injury Project to monitor incidence of head injuries in over 20 countries
  • Taiwan Head Injury Project to determine and compare incidence of head injury in Taipei City and a rural district

Discussion

Assessment of a single source or multiple source databases for a specific condition is generally performed to evaluate data quality and the level of ascertainment. The usefulness of the CR approach is that it attempts to account for deficiencies that may exist in a single source approach. However, neither single nor multiple source ascertainment will account for those persons not identified as a case, e.g. missed diagnosis, incorrect diagnosis, poorly defined criteria or lack of seeking health care. Although the method can account for false negatives, there is the possibility of false positives that are not identified by the CR technique. The method thus relies on a standardized case definition with high sensitivity and specificity. For example, systemic lupus erythematosus is a poorly defined disease. There is a higher likelihood that false positives will occur, resulting in overestimation.

Use of a log-linear model entails the selection of the most appropriate model, given the actual data.20 Models have been constructed to describe matching errors and can be used when errors are expected to occur during record linkage due to mismatches and non-matches. However, the ability to match will depend on the quality of the data and the availability of unique identifiers.

Even with perfect matching and good sensitivity and specificity, the information on those with the condition relates to individuals who have been reported and thus may represent a selective group and not the entire population. How far the results can be extrapolated will depend on varying factors, e.g. the characteristics of those reported, the timeliness of reporting or the nature of the condition.

Conclusion

In designing studies, the investigator must be aware of the basic underlying assumptions and make the correct transition from model assumptions to the real world of human populations. With this in mind, there appears to be two main roles for CR in public health.

  • To assess the degree of ascertainment of a given condition using a particular source
  • To augment/adjust for the degree of ascertainment by using a multiple source model

Where multiple sources exist, the application of CR methods may provide a saving in time, effort and expense compared with using the traditional field survey to determine ascertainment. New information may be gathered on the use of services by subgroups and the interaction effects. As neither the true value of the parameter to be estimated nor the correct assumptions about capture probabilities are known, whatever estimate is computed from the selected model should be accompanied by confidence limits to give an idea of its reliability. In practice, this has resulted in wide confidence intervals that raise doubts regarding the reliability of the estimate and the realistic nature of the model.

The CR approach holds continuing promise for its application to epidemiologic surveillance. However, even though CR is a valuable method for enhancing existing surveillance data, there is an ongoing need to strengthen more "traditional" surveillance systems and data collection sources. This involves such activities as improving and validating case definitions, promoting diagnosis and reporting, developing information systems and training in the use of health records. In the final analysis, the usefulness of CR in surveillance must be evaluated in terms of its public health utility.



References

1. Lillienfield A, Lillienfield DE. Foundations of epidemiology. 2nd ed. New York (NY): Oxford University Press, 1980.

2. Los Alamos National Laboratory. Capture-recapture and removal methods for sampling closed populations. Los Alamos (NM), 1982; Cat LA-8787-NERP UC1.

3. Cormack RM. The statistics of capture-recapture methods. Oceanog Mar Biol Ann Rev 1968;6:455-506.

4. Chapman, DG. The estimation of wildlife populations. Ann Math Stat 1954;25:1-15.

5. Sekar C, Deming EW. On a method of estimating birth and death rates and extent of registration. J Am Stat Assoc 1949;44:1059-68.

6. Himes CL, Clogg CC. An overview of demographic analysis as a method for evaluating census coverage in the US Popul Index 1992;58:587-607.

7. The Supreme Court on the adjustment of the US census. Popul Develop Rev 1996 June:399-405.

8. Ding Y, Feinberg SE. Multiple sample estimation of population and census undercount in the presence of matching errors. Surv Methodol 1996;22:55-64.

9. Wittes JT, Colton T, Sidel VW. Capture-recapture models for assessing the completeness of case ascertainment when using multiple information sources. J Chronic Dis 1974;27:25-36.

10. Fienberg SE. The multiple-recapture census for closed populations and incomplete 2k contingency tables. Biometrika 1972;59:591-603.

11. Hook EB, Regal RR. Internal validity analysis: a method for adjusting capture-recapture estimates of prevalence. Am J Epidemiol 1995;142:48-52.

12. International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple record systems estimation I: history and theoretical development. Am J Epidemiol 1995;142:1047-58.

13. Hook, EB, Regal RR. Effect of variation in probability of ascertainment by sources ("variable catchability") upon "capture-recapture" estimates of prevalence. Am J Epidemiol 1993;137:1148-66.

14. Jaro MA. Probabilistic linkage of large public health data files. Stat Med 1995;14:491-8.

15. Newcombe HB, Kennedy JM, Axford SJ, James AP. Automatic linkage of vital records. Science 1959;130:954-9.

16. Howe GR, Spasoff RA, editors. Proceedings of the Workshop on Computerized Record Linkage in Health Research; 1986 May 21-23; Ottawa, Ontario. Toronto: University of Toronto, 1986.

17. Neugebauer R, Wittes J. Voluntary and involuntary capture-recapture samples-problems in the estimation of hidden and elusive populations [letter]. Am J Public Health 1994;84:1068-9.

18. International Working Group for Disease Monitoring and Forecasting. Capture-recapture and multiple record systems estimation II: applications in human diseases. Am J Epidemiol 1995;142:1059-68.

19. Summary report: Capture-Recapture Injury Epidemiology Conference; 1996 Sept 9-10; University of Pittsburg [unpublished manuscript].

20. McCullagh P, Nelder JA.  Generalized linear models. In: Cox DR, Hinkley DV, Rubin D, Silverman BW, editors. Monographs on statistics and applied probability, No 37. London:  Chapman and Hall, 1983.



Author References

Debra Nanan, School of Public Health, George Washington University, Washington, DC, USA
Franklin White, Program Co-ordinator, Non-communicable Diseases, Pan American Health Organization, Regional Office of the World Health Organization, 525 Twenty-third Street NW, Washington, DC, USA  20037-2895

[Table of Contents] [Next]

Last Updated: 1998-10-20 Top