Validity of International Health Regulations in Reporting Emerging Infectious Diseases

Use of more prescriptive criteria and training of persons responsible for reporting could improve results.

T he great infl uenza pandemic of 1918 and the increase in HIV/AIDS are 2 striking examples of the devastation and profound effect on human societies caused by emerging infectious diseases (EIDs) (1). The Institute of Medicine defi nes EIDs as "new, re-emerging, or drug-resistant infections whose incidence in humans has increased or whose incidence threatens to increase in the near future" (2). EIDs are a global phenomenon, with hotspots from which EIDs are more likely to appear, concentrated in low-latitude developing countries (3). EIDs are probably underreported, particularly in areas which have hotspots and also weak surveillance systems (4). A study in 2008 by Jones et al. reported 335 EIDs during 1940-2004.
The purpose of the 2005 International Health Regulations (IHR) is to "help the international community prevent and respond to acute public health risks that have the potential to cross borders and threaten people worldwide" (5). This purpose includes development of an international reporting system, under which member states have a duty to report to the World Health Organization (WHO) "all events which may constitute a public health emergency of international concern" (5). These events are not limited to communicable diseases and can include contaminated food, chemical contamination of products or the environment, release of radionuclear material, or other toxic release (6). Events are reported to WHO by designated national focal points (NFPs) in each member state. WHO has designed a decision instrument contained in Annex 2 of the 2005 IHR (7) to assist with the notifi cation process on the basis of an algorithm comprising 4 main criteria: the event has a serious public health effect, the event is unusual or unexpected, there is a major risk for international spread, and there is a major risk for international travel or trade restrictions. At least 2 of the criteria must be satisfi ed for an event to be notifi able.
An IHR expert committee suggested regular evaluations of the notifi cation process (8). However, the only published evaluation of the Annex 2 decision instrument is a reliability study that analyzed NFPs notifi cation concordance (9). This study also reported a sensitivity of 80% (on the basis of 5 events) and a specifi city of 50% (on the basis of 4 events). Although the study reported a high reliability, the number of events was too low to adequately assess the sensitivity and specifi city of the decision instrument. A 2008 WHO technical report on Annex 2 (10) mentions a 2006 workshop assessing the decision instrument validity and fi nding a sensitivity of 100% and a specifi city of 55% on the basis of 10 events. There were no details on the methods used and the study results were not published.
The aim of this study was to evaluate the predictive validity of the Annex 2 decision instrument. We focused on EIDs by applying screening test evaluation methods to the IHR Annex 2 decision instrument and estimated its sensitivity, specifi city, and positive predictive value (PPV).

Methods
The sensitivity, specifi city, and PPV of the Annex 2 decision instrument were calculated by asking an investigator to decide whether each event in a series of historical EID events would have been reported to WHO by using the criteria of the instrument. A panel of 3 internationally recognized EID and IHR experts, who were independent of the notifying investigator, was then asked whether each event was truly of international public health concern. The sensitivity, specifi city, and PPV of the decision instrument were then calculated by crosstabulating the outcome of the notifi cation process and the true outcome of each event (taken as the expert panel consensus decision) in a 2 × 2 table.
The EID events used were sampled systematically from the list of 335 EID events identifi ed by Jones et al. (3), starting from the most recent and going back until the required sample size was reached. The study required 160 events to have CIs that did not exceed 10% on each side of the point estimate of sensitivity and specifi city if sensitivity was 90%, specifi city was 55%, and 40% of events were truly of international public health concern. These values were chosen on the basis of the best available information (9,10). The IHR Annex 2 decision instrument was used to decide whether each EID event fulfi lled the notifi cation criteria. The decision was based on the information available in the references for each EID event provided in the original report by Jones et al. (3).
To emulate real-life conditions, the investigator used only information available at the time of event occurrence. Each criterion was answered by yes or no, and >2 positive answers classed the EID event as notifi able, according to WHO guidance. To establish the true outcome for every EID event, each expert had to give an opinion on 4 statements: the public health effect of the event was serious; the event was unusual or unexpected; the event spread internationally; and the event led to travel or trade restrictions. The 4 statements were derived from the IHR Annex 2 criteria, but were retrospective and ascertained the a posteriori outcome of each EID event. A Likert scale was used to score each statement with scores from 1 (strongly disagree) to 5 (strongly agree).
Experts based their decisions on their opinion and knowledge and on a supplied information sheet for each event. They were blinded to the notifi cation outcome of each EID event and assessed each event independently. The opinion on each statement of each event for each expert was converted to a numerical score from −2 to +2 (Table 1), which was then summed to give an overall value for each statement and 4 values per EID event. For each statement, an overall positive score was considered a consensus agreement with the statement, and an overall negative score was considered a consensus disagreement with the statement. A null score was considered a failure to agree on that criterion. Events with >2 agreed statements were considered to be of international public health concern. Events with <1 agreed statement and >1 disagreed statement were considered to be of no international public health concern. Events for which there was 1 agreed statement and for which no agreement could be reached on 3 statements were not used in the study.
Statistical analysis was performed by using Stata version 11 (StataCorp LP, College Station, TX, USA). A description was made of the distribution of events according to WHO region, type of pathogen, and type of event. We calculated the notifi cation rate; the prevalence of EID events of international public health concern according to the expert panel; and the distribution of these events by type of pathogen, WHO region, and type of event. Sensitivity, specifi city, PPV, and CIs of the decision instrument were then calculated. Concordance and its association with type of event, type of pathogen, and WHO region were calculated by using logistic regression. Concordance for each of the 4 criteria was also calculated. An intraclass correlation coeffi cient (11) was calculated for the combined score allocated by each expert (aggregated scores of all 4 criteria for each event, which provided a measure of overall concern; possible score of 20) to each EID event.
The appropriateness of the consensus-building method was tested by translating the judgment of each expert panel member into a binary scoring system, in which for each criteria, a score of 4 or 5 would translate to "I agree" and a score of 1, 2, or 3 would translate to "I disagree." This process enabled identifi cation of which EID events experts individually considered to be of public health concern. EID events with >2 criteria agreed with signifying international public health concern. Agreement levels between individual experts and the consensus were then calculated.

Results
Of 204 identifi ed EID events, 13 were not eligible because they did not fi t the defi nition of an EID or were duplicates. Sixteen events were discarded because of insuffi cient information, leaving 175 (92%) of 191 eligible events to be analyzed. Their characteristics are summarized in Table 2. A total of 124 (70.9%) of 175 events fulfi lled >2 of the 4 decision instrument criteria according to the notifying investigator and should have been reported to WHO, according to the Annex 2 decision instrument. No EID event was withdrawn from the study because of failure of the expert panel to agree. Of the 175 EID events assessed by the expert panel, 46 (26.3%) were deemed to be of international public health concern. Characteristics of these 46 events are shown in Table 3.
The intraclass correlation coeffi cient for assessing the agreement level for overall public health concern for each event, by using an aggregated score of 20, was 0.68 (95% CI 0.60-0.74). After simplifying the scores to obtain a judgment for each EID event for each expert, the agreement levels for each panel member compared with those of the consensus were 76.5%, 84.6%, and 85.7%, respectively.

Discussion
The IHR Annex 2 decision instrument has a high sensitivity (95.6%; 95% CI 89.8%-100%) but a low specifi city (38%; 95% CI 29.6%-46.3%). These fi gures are consistent with previous anecdotal evidence (9,10). In this situation, trading specifi city for high sensitivity is desirable because missing events of international public health concern would have serious consequences and would outweigh benefi ts of a lower volume of falsepositive results. A low specifi city is not a major concern as long as the volume of notifi cation is low (9), and currently there is "little evidence that Annex 2 is being frequently or routinely used by State Parties in the assessment of events" (12). A low specifi city could become problematic if the volume of events reported through Annex 2 increased. The low specifi city would result in an increase in false-positives results and increased costs associated with the notifi cation process and determination of serious events.
The low specifi city is refl ected in a PPV of 35.8%. The calculated PPV could be underestimated for 2 main reasons. First, the prevalence of events identifi ed as being of international public health concern might not refl ect the prevalence of events truly reported to WHO. Second, in the current study, all EID events selected were submitted to the decision instrument, regardless of personal judgment. In real life conditions, events least likely to be of international public health concern would have been excluded even before being submitted to the decision instrument, which would increase the prevalence of events of international public health concern in events submitted to the decision instrument and consequently the PPV.
The specifi city estimate was lower than that in 2 other evaluations (9,10) (38% vs. 50% and 55%, respectively). Although our estimate could be a more accurate refl ection of the instrument specifi city, it could also be an underestimate. Because instrument criteria are quite fl exible and subject to interpretation, it is possible to reach a decision to report an event in which the likelihood of it becoming of international public health concern is small. In addition, courtesy bias, in which the assessor believes that that erring on the side of caution is more acceptable than not reporting that an event, could have occurred. The current study strictly applied the criteria described in the Annex 2 guidance without using the context of the event or personal judgment. The decision instrument criteria are designed to take context and personal judgment into account to be adaptable to current and future unknown threats (13). Use of personal judgment rather than strictly applying the decision instrument criteria leads to a lower notifi cation rate (9).
Two events of international public health concern were missed despite the high sensitivity of the instrument, which refl ected challenges of predicting evolution of an event as it occurs and potential for human error. Although a sensitivity of 100% would be diffi cult to attain, maintaining the number of missed events at an absolute minimum should be a priority when the instrument is revised or evaluated.
Prediction of seriousness and unusualness of events were least accurate and showed concordance rates of 49.7% and 58.3%, respectively. This fi nding refl ects the subjectivity and broad spectrum of the seriousness and unusualness criteria. Although these fi ndings might lead to overreporting, criteria fl exibility is also "a major strength that makes the IHR future-proof against new and unforeseeable threats" (9). The other 2 criteria of international spread and restriction to travel and trade have higher concordance rates of 81.1% and 96%, respectively. Should there be a need to increase the specifi city of the instrument, the focus should be on tightening the fi rst 2 criteria and one should be more specifi c about what makes an event serious or unusual. Training staff at NFPs could also increase the specifi city of the instrument (by perfecting their use of the decision instrument) and its PPV (by prefi ltering which events to submit to the decision instrument). Staff of NFPs have been trained in the past by using online tools and workshops (10,14), and both approaches could be used.
Sensitivity and specifi city of the decision instrument did not depend on event type, pathogen type, or WHO region of occurrence because no strong evidence of an association between concordance and these factors was found. This fi nding suggests the Annex 2 decision tool is adequate for reporting antimicrobial drug resistance, although it was not designed with drug resistance in mind. There have been calls to use the decision instrument for antimicrobial drug resistance events (15).
Although EID events were systematically, rather than randomly, sampled from the EID list compiled by Jones et al. (3), the distribution of events by type of pathogen was not signifi cantly different from the distribution of events in the complete list from which the study sample is  extracted. The study sample and database from which it is extracted have a proportion of bacteria that is higher than other estimates of EID distribution (16,17). This fi nding can be explained by the fact that a large (43.8%) proportion of bacterial events are antimicrobial drug resistance events, which were not included as EIDs in many other studies. Jones et al. also reported a bias toward events occurring in industrialized countries, which refl ect publication bias and better surveillance systems in these countries (3). However, these fi ndings do not affect the internal validity of the study, and the fact that the current sample includes a wide variety of types of events can give confi dence that the types of EID events truly reported to WHO are likely represented in the study sample. The 16 events for which no information could be obtained did not statistically differ from the rest of the events, and the proportion of events without information was relatively low (8%), which made bias caused by information availability unlikely. The notifying investigator could not be blinded to EID events he or she was assessing, and it was possible to identify famous EID (such as emergence of Nipah virus) from the information, potentially introducing a bias toward reporting famous events. However, knowledge of these events is often the result of international concern, and they would have been reported regardless of these factors.
The intraclass correlation coeffi cient of 0.68 showed moderate-to-strong levels of agreement between expert panel members. The overall score given by each judge for each event was believed to be a good overall refl ection of the role of the event. One limitation of this method was that the same score could be obtained with different opinions: e.g., if 1 expert strongly agreed that an event was serious but strongly disagreed that an event spread internationally, it would produce the same score as another expert strongly disagreeing with seriousness but strongly agreeing with international spread. However, when agreement levels were assessed for each criterion by calculating 4 intraclass correlation coeffi cients, there was no strong disagreement on any of the criteria, making that scenario unlikely.
The method showed agreement levels between experts and the consensus >75%. This agreement could have been improved by using a Delphi style approach (18), showing panel members their results compared with the mean of the whole panel and having a second round of evaluation.
This study took the approach of treating the IHR decision instrument a as a screening tool, thus enabling screening evaluation methods to be applied. One strong point of this study was the sample size: 175 real life events, a large enough sample to accurately estimate sensitivity and specifi city with relatively narrow CIs. Furthermore, the fact that retrospective events were used enabled testing for predictive validity because in hindsight it was possible to evaluate the true international public health role of each event rather than just its potential for international public health concern. All panel members were internationally recognized as experts in the fi eld. Therefore, their opinions were as reliable as can be obtained by using such a method. The fact they were blinded to whether each event would be reported and to each other's opinions, and the objective method used to decide on consensus for each EID event further strengthens the method. Increasing the size of the panel may also have added rigor to the evaluation.
The defi nition of an EID was wider and more encompassing than most defi nitions used in the literature, particularly because it included antimicrobial drug resistance. Therefore, the validity of the decision instrument was tested by using a wide variety of type of events likely to represent a range of events NFPs staff would encounter in real life.
This study attempted to replicate real-life situations by means of a theoretical exercise. The amount of information available on each event was limited, and the WHO Annex 2 decision instrument criteria described in the guidance were rigidly applied. Furthermore, political or economic considerations that could not be replicated in a study are often taken in account when reporting an event (19). Therefore, the study implies a degree of simplifi cation of real-life conditions.
The sample of events was limited to EIDs in which the Annex 2 decision instrument is used for a variety of events, including radiation and chemical incidents and outbreaks of well-established pathogens. Whether the results of this study can be extrapolated to such events is not clear.
Although as much care as possible was taken to make the expert panel method objective, it still relied to some extent on individual opinion, and expert panel judgment on each event could not claim to be the defi nitive and universal truth. This shortcoming is inherent to the method and has been noted in other studies of the IHR Annex 2 decision instrument that used expert panels (9,10). Every attempt was made to minimize subjectivity by giving clear written guidelines to each expert, blinding the experts to the notifi cation outcome, preventing experts from discussing the events, and deriving agreement mathematically.
The IHR Annex 2 decision instrument is a sensitive tool for reporting EIDs of international public health concern. The instrument lacks specifi city mainly because of broad, nonspecifi c criteria that can lead to overreporting. The PPV of the instrument is also relatively low. If one considers the nature of the instrument and potential consequences of WHO not being aware of an EID event of international public health concern, sensitivity should be prioritized over specifi city. In the current situation in which the volume of notifi cation remains low, the instrument is adequate. However, if the IHR Annex 2 decision instrument is to be used more systematically in reporting of and the volume of notifi cation increases, there may be a need to increase the specifi city and PPV of the instrument. This increase could be achieved by focusing particularly on setting more prescriptive seriousness and unusualness criteria to be more specifi c about what constitutes a serious or an unusual event, and by regular training of NFP staff online and through workshops to ensure that NFP staff report only relevant events, which would improve specifi city without decreasing sensitivity and in turn increasing PPV. Also, focus should be placed on keeping the number of missed events to a minimum. However, instrument criteria must retain a certain level of interpretability so that the instrument can be adapted to a variety of unknown threats in the future, and not sacrifi ce sensitivity, which should remain the priority of the instrument. Finally, the approach taken in treating the IHR decision instrument as a screening tool and evaluating it as such has proved useful in understanding its value and limitations.