Public Health Agency of Canada / Agence de santé publique du Canada
Skip all navigation -accesskey z Skip to sidemenu -accesskey x Skip to main menu -accesskey m Skip all navigation -accesskey z
Français Contact Us Help Search Canada Site
PHAC Home Centres Publications Guidelines A-Z Index
Child Health Adult Health Seniors Health Surveillance Health Canada



Volume 21, No.1 - 2000

 [Table of Contents] 

 

Public Health Agency of Canada (PHAC)

An Assessment of the Validity of a Computer System for Probabilistic Record Linkage of Birth and Infant Death Records in Canada

Martha Fair, Margaret Cyr, Alexander C Allen, Shi Wu Wen, Grace Guyon and Ralph C MacDonald, for the Fetal and Infant Health Study Groupa


Abstract

Studies on the validity of probabilistic record linkage are sparse. We performed a probabilistic linkage to link the 1984-1994 birth records (obtained from the Canadian Birth Data Base) with 1984-1995 infant death records (from the Canadian Mortality Data Base) in Canada. We extracted the linked birthSdeath records for Nova Scotia and Alberta (from January 1990 to December 1991) obtained from Statistics Canada's vital registration data and compared them with corresponding records from provincial data (primarily hospital records). The results showed that over 99% of infant deaths (153/155) in the Nova Scotia provincial data were successfully located in the linked Statistics Canada file; the corresponding figure for Alberta neonatal deaths was also 99% (365/367). The distributions of gestational age and birth weight in matched cases demonstrated high agreement between the two data sources. We conclude that the computer system for probabilistic linkage developed by Statistics Canada using the available personal identifying variables in the Canadian Birth Data Base and the Canadian Mortality Data Base is valid.

Key words: births; deaths; epidemiologic methods; infant; medical record linkage; validity


Introduction

Existing databases such as the Canadian Birth Data Base (CBDB)1 and the Canadian Mortality Data Base (CMDB)2 are attractive to health and health care analysts because they permit detailed studies of large populations. However, a single database may lack information that is crucial to the analysis. For example, information on birth weight and gestational age is usually recorded on birth registrations, whereas information on infant mortality is recorded on death registrations. Thus, an analysis of birth weight-specific or gestational age-specific infant mortality is impossible using separately registered data.

In recent years, developments in computerized record linkage methodology have made it possible to bring such databases together.3-6 To facilitate the surveillance of fetal and infant health, the Canadian Perinatal Surveillance System (CPSS) funded a record linkage of the Canadian Birth Data Base for 1985-1994 with infant deaths from the Canadian Mortality Data Base for 1985-1995 at Statistics Canada. A probabilistic linkage (or maximum likelihood linkage) was performed of infant death records with corresponding birth records using the Generalized Record Linkage System developed by Statistics Canada6,7 and a calculated statistical probability based on relevant identifying variables (e.g. name, date of birth).

To assess the validity of this record linkage procedure, the records located in the Statistics Canada linked file were compared with information on the corresponding births and deaths collected independently by the provincial perinatal surveillance programs in Nova Scotia and Alberta. A full account of this validation study has been published by Statistics Canada as a technical report.8 Studies on the validity of probabilistic linkage are sparse in the published literature, despite widespread use of this methodology in public health surveillance and epidemiologic research activities. We felt it important to disseminate the main results of this validation study to a broad audience through a peer-reviewed publication.


Methods

The Linked Statistics Canada Birth and Death Files

Statistics Canada's vital registration data are primarily collected through parental reporting at registration. The registrars of the 10 provinces and 2 territories collect data on live births, fetal deaths and other deaths. Paper and electronic (if available) copies of the registration documents are submitted to Statistics Canada under agreements between federal and provincial governments. The data are reformatted, edited, verified and stored in Statistics Canada's Integrated Vital Statistics files.

Statistics Canada's files were preprocessed into a format suitable for record linkage. A phonetic code from the New York State Individual and Intelligence System9 was generated to encode the surname values so that, during the linkage process, records could be compared that had similar sounding surnames in order to check for misspelling or key entry errors. Data items from other sources, such as postal codes, were added, and alternative records were created using other surname fields (e.g. parental surnames). The 1990-1991 live birth events recorded in the CBDB were linked to the 1990-1992 deaths recorded in the CMDB. A probabilistic record linkage was performed using the Generalized Record Linkage System,6,7 which compares fields common to the two files, assigns weights to the resulting links and calculates a total weight. A weight of less than -90 would be automatically rejected, and a weight greater than +300 would be automatically accepted as a match. Manual resolution was carried out to confirm all linked records with a total weight ranging from -90 to 300, and to confirm all links to multiple births. Updates were made to the computer decisions where necessary, to create the linked Statistics Canada birth-infant death file.

Two subsets of the linked birth-death file were extracted for the validation comparison: (1) all 1990-1991 live birth-death linkages for which the province of birth was Nova Scotia and the age at death was 0-364 days (158 infant deaths); and (2) all 1990-1991 live birth- death linkages for which the province of birth and death was Alberta and the age at death was 0-< 28 days (371 neonatal deaths). These two provinces were chosen because they were the only ones with a perinatal data system that would allow an independent assessment of Statistic Canada's vital registration data.


The Nova Scotia and Alberta Provincial Data Files

In contrast to Statistics Canada's vital registration data, the provincial perinatal data were obtained primarily from hospital discharge records.

The Reproductive Care Program of Nova Scotia supports the Nova Scotia Atlee Perinatal Database, which collects data on obstetric and neonatal events from hospital records in the province, including information on birth weight and gestational age.10 To ensure that all cases are received by the Reproductive Care Program, these records are matched with Nova Scotia's vital statistics. Birth and death registration numbers are entered in the database. Records of infant deaths occurring after discharge from hospital are captured from Nova Scotia's vital statistics. Nova Scotia supplied the data for this validation study for births that took place only in Nova Scotia: babies born out of province and transferred to Nova Scotia were not included, nor were babies born out of province to Nova Scotia residents. The Nova Scotia file contains data for infant deaths occurring from 0 to 364 days after birth and includes birth weight and gestational age.

The Alberta Medical Association's Committee on Reproductive Care obtains mortality data from Alberta's hospital records for case review of fetal, neonatal and maternal deaths. Data are also collected from hospital birth records, including delivery, prenatal and newborn records as well as autopsy reports. Patient records are verified with Alberta's vital statistics (no information is extracted) to ensure that all cases have been received.11 The cases reviewed were those for which the province of birth and death was Alberta: babies born out of province and transferred to Alberta were not included, nor were babies born out of province to Alberta residents. The Alberta file contains data for neonatal deaths occurring less than 28 days after birth and includes birth weight and gestational age.

In order for the records of deaths to be matched, each province was asked to provide Statistics Canada with a special file containing all infant (Nova Scotia) and neonatal (Alberta) deaths in 1990-1991.


Assessment Technique

The Nova Scotia and Alberta files were processed at Statistics Canada to facilitate the match to the Statistics Canada files. Name fields were standardized, geographic fields were recoded and infant (neonatal) death records were grouped. Agreement (matching rate) between provincial records and Statistics Canada's records was used as the indicator of validity.

For data from Nova Scotia, the infant deaths on the provincial file were matched to the linked Statistics Canada file by the death registration and birth registration numbers supplied by the province. An exact match technique was employed using the SAS software package.12 For data from Alberta, registration numbers were not available on the provincial fetal and neonatal death records. Therefore, they were matched using common personal identifiers and processed together to maximize the number of individual matches.

Because there was no "type of event" field (i.e. live birth or stillbirth) on the Alberta file, this variable was imputed using "age at death" and "time of death" variables. When there were discrepancies in "type of event" between the Statistics Canada and Alberta files, the Alberta database administrator was contacted. There were 22 cases with such anomalies, 19 of which were corrected to agree with the Statistics Canada file; 3 of the 22 were not reconciled. In 2 of these cases, the records were coded as fetal death on the Alberta file and infant death on the linked Statistics Canada file. In the remaining case, the event was coded as neonatal death on the Alberta file and fetal death on the Statistics Canada file. For the 3 cases, Statistics Canada's "type of event" code was used in the presentation of the results.

Three passes were executed to maximize the number of potential matches. The selected fields for matching in the first pass were sex, the first four bytes of the surname (equivalent to first four letters) and date of birth (year, month, day). The second pass used sex, the first four bytes of the surname field and birth weight. The third pass used sex and date of birth. All matches were manually verified.

Birth weight and gestational age in the two provincial files were compared with the Statistics Canada file if the records referred to the same individual.


Results

Nova Scotia

All but 2 (153/155, or 99%) of the infant deaths in the Nova Scotia file matched the corresponding Statistics Canada file. When gestational age was grouped by important analytic categories, agreement was found on 105 of 153 (70%) infant deaths (Table 1). Most of the disagreements were out by just plus or minus one category. When birth weight was grouped by important analytic categories, agreement was reached on 147 of 153 (96%) (Table 2).

Alberta

All but 2 (365/367, or 99%) of the neonatal deaths in the Alberta file matched the corresponding Statistics Canada file. When gestational age was grouped by important analytic categories, there was agreement on 317 of 365 (87%) neonatal deaths (Table 3). When birth weight was grouped by important analytic categories, there was agreement on 354 of 365 (97%) (Table 4).


TABLE 1
Comparison of gestational age on Statistics Canada's records with gestational age on Nova Scotia's records-infant deaths, for birth years 1990-1991

 

Gestational age
(weeks)

STATISTICS CANADA

Row total
(%)

20-21

22

23-24

25-26

27-28

29-31

32-33

34-36

37-41

>=42

N
O
V
A

S
C
O
T
I
A

20-21

1

3

               

4
(2.61)

22  

3

 

1

           

4
(2.61)

23-24  

4

12

1

           

17
(11.11)

25-26    

3

7

2

         

12
(7.84)

27-28      

2

8

         

10
(6.54)

29-31    

1

1

1

10

 

1

   

14
(9.15)

32-33          

1

3

2

   

6
(3.92)

34-36      

1

   

1

13

4

 

19
(12.42)

37-41          

1

 

3

43

4

51
(33.33)

>=42                

2

5

7
(4.58)

N/A

1

 

1

1

     

1

5

 

9
(5.88)

  Column total
(%)

2
(1.31)

10
(6.54)

17
(11.11)

14
(9.15)

11
(7.19)

12
(7.84)

4
(2.61)

20
(13.07)

54
(35.29)

9
(5.88)

153
(100.00)


TABLE 2
Comparison of birth weight on Statistics Canada's records with birth weight on Nova Scotia's records-infant deaths, for birth years 1990-1991

 

Birth weight
(grams)

STATISTICS CANADA

<=399

400-499

500-749

750-999

1000-1499

N
O
V
A

S
C
O
T
I
A

<=399

4

       
400-499  

9

     
500-749  

2

20

1

 
750-999    

1

14

 
1000-1499        

11

1500-2499          
2500-4499          
>=4500          
  Column total
(%)

4
(2.61)

11
(7.19)

21
(13.73)

15
(9.80)

11
(7.19)

TABLE 2
Comparison of birth weight on Statistics Canada's records with birth weight on Nova Scotia's records-infant deaths, for birth years 1990-1991

 

Birth weight
(grams)

STATISTICS CANADA

Row total
(%)

1500-2499

2500-4499

>=4500

N/A

N
O
V
A

S
C
O
T
I
A

<=399        

4
(2.61)

400-499  

1

   

10
(6.54)

500-749        

23
(15.03)

750-999        

15
(9.80)

1000-1499        

11
(7.19)

1500-2499

26

     

26
(16.99)

2500-4499  

60

 

1

61
(39.87)

>=4500    

3

 

3
(1.96)

  Column total
(%)

26
(16.99)

61
(39.87)

3
(1.96)

1
(0.65)

153
(100.00)


TABLE 3
Comparison of gestational age on Statistics Canada's records with gestational age on Alberta's records-neonatal deaths, for birth years 1990-1991

 

Gestational age
(weeks)

STATISTICS CANADA

<=20

20-21

22

23-24

25-26

27-28

A
L
B
E
R
T
A

<=20  

1

       
20-21

3

27

3

     
22  

3

26

2

   
23-24

1

2

4

55

4

1

25-26    

1

3

32

4

27-28      

1

2

27

29-31          

2

32-33            
34-36            
37-41            
>=42            
  Column total
(%)

4
(1.10)

33
(9.04)

34
(9.32)

61
(16.71)

38
(10.41)

34
(9.32)

TABLE 3
Comparison of gestational age on Statistics Canada's records with gestational age on Alberta's records-neonatal deaths, for birth years 1990-1991

 

Gestational age
(weeks)

STATISTICS CANADA

Row total
(%)

29-31

32-33

34-36

37-41

>=42

A
L
B
E
R
T
A

<=20          

1
(0.27)

20-21          

33
(9.04)

22          

31
(8.49)

23-24          

67
(18.36)

25-26      

1

 

41
(11.23)

27-28

2

       

32
(8.77)

29-31

22

       

24
(6.58)

32-33  

12

2

   

14
(3.84)

34-36    

38

3

 

41
(11.23)

37-41    

2

72

1

75
(20.55)

>=42        

6

6
(1.64)

  Column total
(%)

24
(6.58)

12
(3.29)

42
(11.51)

76
(20.82)

7
(1.92)

365
(100.00)


TABLE 4
Comparison of birth weight on Statistics Canada's records with birth weight on Alberta's records-neonatal deaths, for birth years 1990-1991

 

Birth weight
(grams)

STATISTICS CANADA

<=399

400-499

500-749

750-999

1000-1499

A
L
B
E
R
T
A

<=399

29

       
400-499

1

40

     
500-749    

81

1

 
750-999      

30

 
1000-1499    

1

1

45

1500-2499          
2500-4499          
>=4500          
N/A      

1

 
  Column total
(%)

30
(8.22)

40
(10.96)

82
(22.47)

33
(9.04)

45
(12.33)


TABLE 4
Comparison of birth weight on Statistics Canada's records with birth weight on Alberta's records-neonatal deaths, for birth years 1990-1991

 

Birth weight
(grams)

STATISTICS CANADA

Row total
(%)

1500-2499

2500-4499

>=4500

N/A

A
L
B
E
R
T
A

<=399        

29
(7.95)

400-499        

41
(11.23)

500-749      

1

83
(22.74)

750-999        

30
(8.22)

1000-1499        

47
(12.88)

1500-2499

44

     

44
(12.05)

2500-4499

1

84

   

85
(23.29)

>=4500  

4

1

 

5
(1.37)

N/A        

1
(0.27)

  Column total
(%)

45
(12.33)

88
(24.11)

1
(0.27)

1
(0.27)

365
(100.00)


   

Discussion

Probabilistic linkage is considered the preferable linkage method because the calculation of the probability can be refined in various respects to accommodate weights associated with identifier values and coding errors, and can thus maximize the available information. As a result, computer systems used for probabilistic linkage, including the one developed by Statistics Canada, have become popular tools in large population-based studies.3S6 However, before applying the methodology to different data sets, it is important to assess the validity of the linked records, that is, whether and to what extent the linked records are accurate.

The matching rate between provincial records and Statistics Canada's records is high: for the Nova Scotia file, 99% of the infant deaths were successfully located on the linked Statistics Canada file, and for Alberta neonatal deaths, the agreement was over 99%. The high agreement was demonstrated not only in the matching percentage of the study records but also in the agreement of birth weights obtained from these records. The agreement for gestational age, a variable that is more vulnerable to misclassification, was still quite satisfactory. The robustness of Statistics Canada's computer system is further documented by the consistently high agreement in data collected in this study from different jurisdictions through different data collection mechanisms and with different available personal identifying variables. Furthermore, a matching rate of 84S98% has been observed in a few earlier studies using similar methods.13-17

We acknowledge that there is no "gold standard" for the current study. Nova Scotia's and Alberta's provincial surveillance data may not necessarily be superior to Statistics Canada's registry data, and the reverse is also true. However, if we can assume that a high degree of agreement between records collected from two different systems suggests a high degree of validity, then these results indicate that Statistic Canada's computer system for probabilistic linkage using the available personal identifying variables in the CBDB and the CMDB is valid.

Record linkage also provides an opportunity to improve the data quality of participating databases. In this case, the linkage study located three infants born in Nova Scotia who had died in another province and who had not been identified on the Nova Scotia file. The linkage also located 22 events in Alberta (i.e. fetal or neonatal deaths) that had been incorrectly identified in the Alberta file.

This validation study examines the linkage results and analytic variables in only two provinces over a two-year period. Whether the results are applicable to other jurisdictions or other periods of time needs consideration. There may be variations in coding and recording practices over time and across jurisdictions. For example, some provinces have historically recorded birth weight in pounds and ounces whereas other provinces have used grams. Standard definitions and coding schemes are required in all jurisdictions in order to carry out a comprehensive comparison. Also noted within the Alberta file were inconsistencies between the type of death and the age of death. To improve record linkage, definitions should be followed precisely and the personal identifiers necessary for record linkage and grouping of analytic variables should be as complete as possible.

Acknowledgements

We thank the Vital Statistics Registrars of the provinces and territories who gave us access to their files. This study was conducted under the auspices of the Canadian Perinatal Surveillance System. The study's participants would like to thank Monique Atkinson and Ernesto Delgado for their assistance in preparing the text and tables.

References

1. Fair M, Cyr M. The Canadian Birth Data Base: a new research tool to study reproductive outcomes. Health Rep 1993;5(3):281-90.

2. Smith ME, Newcombe HB. Use of the Canadian Mortality Data Base for epidemiological follow-up. Can J Public Health 1982;73:39-44.

3. Newcombe HB, Kennedy JM, Axford SJ. Automatic linkage of vital records. Science 1959;130:954S9.

4. Newcombe HB. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford (England): Oxford University Press, 1988.

5. Howe GR. Use of computerized record linkage in cohort studies. Epidemiol Rev 1998;20:112S21.

6. Howe GR, Lindsay J. A generalized iterative record linkage computer system for use in medical follow-up studies. Comput Biomed Res 1981;14:327S40.

7. Smith ME, Silins J. Generalized iterative record linkage system. In: Proceedings of the American Statistical Association, Social Statistics Section. 1981:128S37.

8. Fair M, Cyr M, Allen AC, Wen SW, Guyon G, MacDonald RC and the Fetal-Infant Mortality Study Group of the Canadian Perinatal Surveillance System. Validation study for a record linkage of births and infant deaths in Canada. Ottawa: Statistics Canada, 1999; Cat 84F0013XIE.

9. Lynch BT, Arends WL. Selection of surname coding procedure for the SRS record linkage system. Washington (DC): US Department of Agriculture, 1977 Feb.

10. Allen A, Attenborough R, Dodds L, Luther E, Pole J. Perinatal care in Nova Scotia, 1988-1995. Halifax: The Reproductive Care Program, 1996.

11. The Alberta Medical Association Committee on Reproductive Care. 1995 Alberta perinatal and neonatal statistics and maternal mortality annual report. Edmonton: Alberta Medical Association, 1997.

12. SAS Institute Inc. Statistical Analysis System, Version 6. Cary (NC): SAS Institute Inc., 1989.

13. Kusiak RA, Springer J, Ritchie AC, Muller J. Carcinoma of the lung in Ontario gold miners: possible etiological factors. Br J Ind Med 1991;48:808S17.

14. Goldberg MS, Carpenter M, Thériault G, Fair ME. The accuracy of ascertaining vital status in a historical cohort study of synthetic textiles workers using computerized record linkage to the Canadian Mortality Data Base. Can J Public Health 1993;84(3):201S4.

15. Shannon HS, Jamieson E, Walsh C, Julian JA, Fair ME, Buffet A. Comparison of individual follow-up and computerized record linkage using the Canadian Mortality Data Base. Can J Public Health 1989;80:54S7.

16. Newcombe HB, Smith ME, Howe GR, Mingay J, Strugnell, Abatt A. Reliability of computerized versus manual death searches in a study of the health of Eldorado uranium workers. Comput Biol Med 1983;13:111S23.

17. Schnatter AR, Thériault G, Katz AM, Thompson FS, Donaleski D, Murray N. A retrospective mortality study within operating segments of a petroleum company. Am J Ind Med 1992;22:209S29.



Author References

Martha Fair, Margaret Cyr and Ralph C MacDonald, Statistics Canada, Ottawa, Ontario

Alexander C Allen, Department of Pediatrics, Dalhousie University, Halifax, Nova Scotia

Shi Wu Wen, Bureau of Reproductive and Child Health, Laboratory Centre for Disease Control, Health Protection Branch, Health Canada, Ottawa, Ontario

Grace Guyon, Alberta Medical Association, Edmonton, Alberta

Correspondence: Shi Wu Wen, Bureau of Reproductive and Child Health, Laboratory Centre for Disease Control, Tunney's Pasture, Address Locator: 0701D, Ottawa, Ontario  K1A  0L2

a Contributing members of the Fetal and Infant Health Study Group: Linda Bartlett (Centers for Disease Control and Prevention), KS Joseph (Dalhousie University), Michael S Kramer (McGill University), Robert M Liston (University of British Columbia), Sylvie Marcoux (Université Laval), Brian McCarthy (Centers for Disease Control and Prevention), Douglas D McMillan (University of Calgary), Arne Ohlsson (University of Toronto), Reg Sauvé (University of Calgary) and Russell Wilkins (Statistics Canada)

[Previous][Table of Contents [Next]

 

Last Updated: 2002-10-11 Top