Bulletin of the World Health Organization

Structured approaches for the screening and diagnosis of childhood tuberculosis in a high prevalence region of South Africa

Mark Hatherill a, Monique Hanslo a, Tony Hawkridge b, Francesca Little c, Lesley Workman a, Hassan Mahomed a, Michele Tameris a, Sizulu Moyo a, Hennie Geldenhuys a, Willem Hanekom a, Lawrence Geiter d & Gregory Hussey a

a. School of Child and Adolescent Health, University of Cape Town, Anzio Road, Cape Town, 7925, South Africa.
b. Aeras Global TB Vaccine Foundation, Rockville, United States of America (USA).
c. Department of Statistical Sciences, University of Cape Town, Cape Town, South Africa.
d. Otsuka Pharmaceutical Development and Commercialization Inc., Rockville, USA.

Correspondence to Mark Hatherill (e-mail: mark.hatherill@uct.ac.za).

(Submitted: 09 January 2009 – Revised version received: 11 September 2009 – Accepted: 07 October 2009 – Published online: 29 December 2009.)

Bulletin of the World Health Organization 2010;88:312-320. doi: 10.2471/BLT.09.062893


Despite the scale of the worldwide tuberculosis epidemic, the disease remains very difficult to diagnose in children, especially in regions with limited resources.1 Childhood tuberculosis is often paucibacillary and the diagnosis rests on interpretation of chest radiograph findings and non-specific symptoms and signs.1 Improving diagnostic accuracy and reliability is key to integrating childhood tuberculosis into national control programmes, and the World Health Organization (WHO) has thus prioritized diagnostic criteria for childhood tuberculosis.2 Objective, reproducible tuberculosis diagnosis will also be pivotal for defining end-points in trials of new tuberculosis vaccines.3 The need for accurate diagnosis is felt most acutely among younger children, who contribute substantially to the burden of tuberculosis in high prevalence regions.47

Routine clinical use of a structured diagnostic approach that is unsuited to a particular setting can result in systematic errors in estimating the burden of tuberculosis and in patient management. It follows that regional guidelines for screening and diagnosis of childhood tuberculosis should be tailored to their epidemiological context.

The relative merits of existing structured diagnostic approaches are debatable.5,817 Hesseling et al. reviewed 16 such approaches and noted that few of the scoring systems, algorithms and classifications for the screening and diagnosis of childhood tuberculosis have been validated against a gold standard. Most have been developed for hospital-based studies and their usefulness in community settings is relatively unknown.5,1821 Some have suggested that structured diagnostic approaches should be used only as screening tools to select children for further investigation,9,10 while others have proposed a simplified case definition of childhood tuberculosis, based on cardinal symptoms, as an alternative to complex diagnostic systems.1,22

Existing structured approaches to childhood tuberculosis provide a logical and reproducible basis for diagnosis based on clinical acumen, which Cundall termed “the art of the possible”.23 However, we hypothesized that commonly used, structured approaches for screening and diagnosing childhood tuberculosis may show poor agreement and yield highly variable case frequency results. The objectives of this paper were to quantify the tuberculosis case frequencies obtained by means of nine different diagnostic systems, to assess agreement between systems, and to offer possible explanations for discordant findings.


This analysis is based on data collected during a bacille Calmette-Guérin (BCG) vaccine trial conducted by the South African Tuberculosis Vaccine Initiative (SATVI) from March 2001 to August 2006 near Cape Town, South Africa (clinical trials identifier: NCT00242047).8 In the Boland-Overberg region of South Africa, tuberculosis incidence among children aged < 2 years was estimated as > 3000 cases per 100 000 in 2006.6,8,24 In the trial, which compared the vaccine efficacy obtained with percutaneous versus intradermal Tokyo-172 BCG, 11 680 neonates were followed up for a minimum of 2 years after vaccination.8

Children in the community suspected of having tuberculosis due to a history of contact with an adult case or to the presence of symptoms compatible with the disease were identified by a regional surveillance system. All such children underwent comprehensive radiological and bacteriological investigation, even if they had no symptoms. The presence and duration of cough, wheezing, fever or weight loss; the response to antibiotics; and the proximity of contact with an adult having tuberculosis (mother, other person within the household, person outside of the household), were recorded. Human immunodeficiency virus (HIV) status was determined by a rapid antibody test and, if the result was positive, confirmatory polymerase chain reaction (PCR) was performed as well. Tuberculin skin tests included both Mantoux and Tine. Chest radiographs (anteroposterior and lateral) were reviewed by three paediatricians and classified in terms of the likelihood of tuberculosis (Table 1). Two consecutive, paired gastric lavages and induced sputum samples were obtained for smear microscopy and culture of Mycobacterium tuberculosis using mycobacteria growth indicator tubes (Becton Dickinson and Co., Sparks, MD, United States of America). A diagnostic algorithm was developed, based on approaches described by Cundall and WHO, for objective post hoc determination of tuberculosis status as the trial end-point.21,23 The decision to start tuberculosis treatment was made on discharge by the attending clinician on the basis of all available results, independent of the assigned trial end-point.

A protocol-specified objective was to compare the structured approaches used to diagnose childhood tuberculosis in developing countries with a high prevalence of tuberculosis and limited resources. Diagnostic approaches relevant to sub-Saharan Africa, dating from 1990 onwards, were selected by literature review and expert consultation. Recent modifications were preferred over versions predating the HIV era. Eight structured approaches were compared with the SATVI trial algorithm for tuberculosis case frequency.8 The country of origin, lineage and type of approach are summarized in Table 2.

Structured diagnostic approaches were categorized as follows:

  • binary, with the diagnosis being simply positive or negative (yes = tuberculosis; no = not tuberculosis);12,15
  • hierarchical, with stratification into categories of diagnostic certainty, such as “definite”, “probable”, “possible”, “unlikely” or “not tuberculosis”;8,14,16 or
  • numerical, with a score obtained by adding the weighted values assigned to each variable (score ≥ x = tuberculosis).911,13

Data for the variables used in these diagnostic approaches were collected prospectively during the trial. Missing variables were assigned a zero value. Referenced threshold values were used for the analysis unless cut-off thresholds were unspecified, and trial algorithm values were used as the default.8 To standardize reporting, the terms for the hierarchical categories of diagnostic certainty were “unlikely/not”, “possible”, “probable”, and “definite” tuberculosis.11,13,14,16 Details of the various diagnostic approaches are provided in Appendix A (Available at: http://vacfa.com/index.php?option=com_content&view=section&layout=blog&id=10&Itemid=10).

The variables required by each system to compute a tuberculosis outcome for each child were programmed using STATA version 10 (StataCorp, Inc., College Station, TX, USA). Tuberculosis cases were defined by:

  • “positive” classification for binary (tuberculosis/not tuberculosis) systems;
  • “definite”, “probable” or “possible” classification for hierarchical systems; or
  • score ≥ the specified cut-off for numerical scoring systems.

The analysis of binary outcomes compared the nine diagnostic approaches in terms of the number and percentage of tuberculosis cases diagnosed among the children investigated. McNemar’s test was used to compare the paired proportions of tuberculosis cases diagnosed with each system. P-values were not manipulated to adjust for multiple comparisons. Cohen’s kappa coefficient (Κ) was used to examine agreement between individual observations for each system. Weighted Κ statistics were calculated for systems with hierarchical classifications. The degree of agreement was defined by the following values of Κ: 0–0.2 = slight; 0.2–0.4 = fair; 0.4–0.6 = moderate; 0.6–0.8 = substantial; and 0.8–1.0 = nearly perfect.25

In total, 1869 case episodes involving 1654 children were investigated, and one case episode was selected for each child. Since children older than 2 years were excluded, 1445 children were included in this analysis.


The median age at investigation was 11.4 months (interquartile range: 6.0–17.4). Contact with an adult with tuberculosis was reported for 952 children (65.9%), and 628 children (43.5%) had cough lasting > 2 weeks. Weight was recorded as being 60–80% of expected weight–for–age in 316 (21.9%) children and as being < 60% of expected weight–for–age in 29 children (2.0%). Of the 1445 children studied, 54 (3.7%) tested positive for HIV with enzyme-linked immunosorbent assay, and 28 of these children (1.9%) were confirmed positive for HIV by polymerase chain reaction (PCR) assay. The chest radiograph was compatible with tuberculosis in 271 children (18.8%) and Mycobacterium tuberculosis was cultured from induced sputum or gastric lavage in 172 children (11.9%). Treatment for tuberculosis was started by the attending clinician in 611 children (42.3%).

Comparison of binary outcomes

Fig. 1 illustrates the number and percentage of tuberculosis cases diagnosed with each system. The median tuberculosis case frequency was 41.7% (602 of the 1445 children investigated).

Fig. 1. Frequency of cases classified as tuberculosis with various scoring systems, with hierarchical and numerical outcomes condensed to a binary “tuberculosis/not tuberculosis” output , South Africa, 2001–2006
Fig. 1. Frequency of cases classified as tuberculosis with various scoring systems, with hierarchical and numerical outcomes condensed to a binary “tuberculosis/not tuberculosis” output , South Africa, 2001–2006
MASA, Medical Association of South Africa; SATVI, South African Tuberculosis Vaccine Initiative; WHO, World Health Organization.

Differences in tuberculosis case frequency are shown in Table 3. The differences were significant (P < 0.05) in 34 of 36 possible pair-wise comparisons between the various structured diagnostic approaches. Only the comparisons between the Stegen–Toledo and SATVI approaches and between the Stoltz–Donald and Fourie approaches yielded non-significant differences. The pair-wise differences in tuberculosis case frequency ranged from 1.5% to 82.3%.

Table 4 summarizes the observed agreement between all structured diagnostic approaches and shows the Κ statistics for binary “tuberculosis/not tuberculosis” outcomes. For the 36 pair-wise comparisons, Κ ranged from 0.02 to 0.71 (median Κ: 0.18).

Two systems based on clinical, radiological and bacteriological source data (Osborne and Kibel) generated the highest tuberculosis case frequencies, yet showed only fair agreement. Four systems – MASA, Osborne, Fourie and WHO–Harries – demonstrated poor to fair agreement with all of the structured diagnostic approaches analysed. Notably, two numerical systems – MASA and WHO–Harries– classified the fewest case episodes as tuberculosis, but showed only slight agreement.

Comparison of hierarchical outcomes

The distribution of diagnoses in categories of ascending diagnostic certainty is illustrated for three hierarchical and two numerical-hierarchical scoring systems (Fig. 2). The distribution of the diagnostic categories assigned by the Osborne and Kibel systems was similar: a bell-shaped curve with most diagnoses grouped in the “possible” and “probable” categories. By contrast, the Stegen–Toledo and Stoltz–Donald systems yielded results with opposite distributions, with most cases in the “not”/“unlikely” or “definite” categories.

Fig. 2. Frequency of tuberculosis diagnoses assigned to each category of diagnostic certainty, in order of increasing certainty of tuberculosis, with five hierarchical or hierarchical–numerical systems, South Africa, 2001–2006
Fig. 2. Frequency of tuberculosis diagnoses assigned to each category of diagnostic certainty, in order of increasing certainty of tuberculosis, with five hierarchical or hierarchical–numerical systems, South Africa, 2001–2006
SATVI, South African Tuberculosis Vaccine Initiative

Table 5 summarizes the observed agreement and weighted Κ for hierarchical and numerical-hierarchical systems across categories of increasing diagnostic certainty. Hierarchical agreement was nearly perfect between SATVI and Stoltz–Donald, and substantial between Kibel and Osborne.

Comparison of numerical outcomes

Tuberculosis case frequency ranged from 10.0% to 70.0% across four numerical scoring systems (Kibel, Fourie, WHO–Harries and Stegen–Toledo) when set at the pre-specified threshold (Fig. 3). Relative to the observed distribution of scores, two of the numerical systems (Kibel and Stegen–Toledo) used a low threshold for tuberculosis diagnosis, resulting in case frequencies of 70.0% and 53.4%, respectively. The other two systems (Fourie and WHO–Harries) used a relatively high diagnostic threshold, resulting in case frequencies of only 30.4% and 10.0%.

Fig. 3. Distribution of scores (n = 1445) obtained with different numerical scoring systems for the diagnosis of childhood tuberculosis, South Africa, 2001–2006
Fig. 3. Distribution of scores (<em>n</em> = 1445) obtained with different numerical scoring systems for the diagnosis of childhood tuberculosis, South Africa, 2001–2006


The most striking finding of this study was the wide variation (6.9–89.2%) in the frequency of tuberculosis cases diagnosed with the nine structured diagnostic systems. The fact that the differences in tuberculosis case frequency were statistically significant for all but two of 36 possible paired comparisons between systems suggests that the burden of childhood tuberculosis in a given population could be under- or overestimated by as much as 82%. The risk of systematic clinical error is clearly high, and excess morbidity or unnecessary treatment may result if an inappropriate diagnostic system is used for routine management. The variability in tuberculosis case frequency also underscores the importance of accurate phenotyping for interpretation of clinical trial end-points; genotypic studies, and studies of immune correlates.

The second major finding is that the systems that yielded the highest and lowest tuberculosis case frequencies, namely the Osborne (89.2%) and Kibel (70.0%) and the MASA (6.9%) and WHO–Harries (10.0%) systems, demonstrated only fair or slight agreement with each other. Although the two outlier systems that generated the lowest results yielded similar tuberculosis case frequencies, the slight agreement suggests that they may be identifying different subpopulations.

In this study, the variation in tuberculosis case frequency observed when different structured diagnostic approaches were used and the relatively poor agreement between systems were more pronounced than previously reported. Edwards et al. retrospectively assessed agreement between clinical scoring systems used to diagnose tuberculosis among 91 children at a hospital in Kinshasa, Democratic Republic of the Congo. The four approaches (Fourie, WHO provisional guidelines, Stegen–Kaplan, and Ghidey–Habte) generated tuberculosis case frequencies ranging from 87% to 96%.9,1921 Agreement between systems ranged from fair (Κ: < 0.4) to moderate (Κ: 0.4–0.6).26 The reason Edwards et al. found less variation in case frequency may be that the study was hospital-based and all children had been diagnosed with tuberculosis on the original Edwards scale.18,26

We have also shown marked variation between hierarchical systems in the certainty of the diagnosis of tuberculosis.13,14 The evaluation of related hierarchical approaches with similar distributions (SATVI and Stoltz–Donald) by weighting Κ for concordant and discordant categories resulted in better agreement than for binary outcomes.8,16 Although hierarchical and numerical systems that share key variables, such as a positive tuberculin skin test, a positive chest radiograph, and a positive sputum culture (Stegen–Toledo, Stoltz–Donald, and SATVI) showed moderate agreement, other systems with the same common variables showed less agreement and outlying case frequencies (Kibel, Osborne).8,11,13,14,16 It follows that system structure, weighting of variables and the exact order of Boolean decision-making may be as important as the constituent variables in determining the diagnostic output of each system.

There are several other reasons for the observed variation in tuberculosis case frequency and the relatively poor agreement between diagnostic approaches. They include differences in: (i) the purpose for which the systems were developed (as a screening tool or for definitive diagnosis; for clinical management or to obtain a trial end-point); (ii) clinical setting (community or hospital); (iii) disease severity (mild or severe tuberculosis); and (iv) regional prevalence of tuberculosis and/or HIV infection (low or high). Ideally, for clinical trials a low-yielding diagnostic system should be used to minimize false positives at the expense of lower sensitivity.8 On the other hand, clinicians might prioritize sensitivity to avoid the potentially fatal consequences of underdiagnosis and delayed treatment.14,27 Therefore, approaches designed for clinical management, especially to serve as screening tools, might yield higher tuberculosis case frequencies.9,14,27 Although the SATVI trial algorithm lay in the mid-range of case frequency estimates, in the absence of a gold standard it is not possible to determine which of the nine approaches yielded the most accurate rate of tuberculosis.8 However, the proportion of children treated for tuberculosis on clinical grounds (42.3%) was almost identical to the median tuberculosis case frequency across all nine diagnostic approaches (41.7%).

The importance of context

This study was carried out in a community in which children with suspected tuberculosis were identified early, when the disease was probably mild.8 By contrast, the WHO–Harries system assigns the highest diagnostic weight to chronic illness, severe malnutrition and extra-pulmonary tuberculosis, all of which occur more frequently in hospitalized children. It is therefore not surprising that this approach yielded a low tuberculosis case frequency in our context.10 Similarly, the MASA approach, which requires the presence of the complete triad of symptoms compatible with tuberculosis, as well as a positive tuberculin skin test and a suggestive chest radiograph, is designed as a treatment guideline for hospitalized children.15 The Osborne approach, which yielded results at the upper extreme of tuberculosis case frequency, was designed in a developing country setting where the index of suspicion for tuberculosis is high. It functions best as a screening tool, since children with suspected or possible tuberculosis are not necessarily treated.11,14,16 Similarly, the Kibel system is designed to guide initial treatment decisions rather than to establish a definitive diagnosis in resource-limited settings.11,27 The Fourie system, also designed as a screening tool, yielded one of the lowest tuberculosis case frequencies, which suggests that it may be unsuitable for screening in our epidemiological setting.9 Some have noted that regional HIV prevalence may affect the performance of a particular diagnostic approach unless HIV infection status is incorporated.5,8,14 The confounding effect of HIV status on diagnostic decision-making is likely to be greatest in systems that emphasize the non-specific features of malnutrition.10 Edwards et al. noted that HIV-infected children scored higher on the Keith Edwards scale,18 a feature that would be common to the WHO–Harries approach. Consequently, the current edition of the WHO’s TB/HIV: a clinical manual no longer recommends the use of diagnostic scoring systems.10,26

Study limitations

This study has several limitations. Investigations were nested within a clinical trial that might not reflect clinical practice in developing regions. Variables were analysed in a standardized fashion that may differ from that used in the original diagnostic systems, and we acknowledge the potential limitations of Κ scores for assessing agreement. Children were younger than 2 years (an age group in which diagnostic imprecision is highest) and the findings may not be applicable to older children with a different disease spectrum. Since the study was community-based and investigations were geared towards pulmonary tuberculosis, there may have been a bias against diagnostic approaches that included features of extra-pulmonary tuberculosis. Furthermore, since all children identified by active case-finding were investigated for tuberculosis, even if they had no symptoms, the discrepancies between clinical, symptom-based and bacteriology-based systems may have been exaggerated. Structured diagnostic approaches were selected on the basis of relevance to the sub-Saharan region. Thus, four of the nine approaches were of South African origin.8,11,15,16 We acknowledge the existence of other structured approaches for diagnosing childhood tuberculosis, such as the Sant’Anna score, but they were not included in this analysis.17,28

Significance of findings

The public health significance of these findings is illustrated by the marked differences in tuberculosis case frequency and the poor agreement between diagnostic systems. Regional tuberculosis control programmes should make an informed decision to advocate a specific approach for the screening and diagnosis of childhood tuberculosis. Clearly, the study data do not support the routine, uncritical use of any particular diagnostic system for therapeutic decision-making. Some diagnostic approaches may in fact be best suited to specific settings. For example, a high-yielding system, such as Osborne, may be suitable as a screening tool, whereas the low-yielding WHO–Harries system may be most appropriate as a tool for diagnosing severe tuberculosis in regions with a low prevalence of HIV infection.


Although systems with a moderate case yield are less prone to extreme diagnostic error, the predictive value of any one system cannot be determined in the absence of a gold standard. Any structured approach to estimate tuberculosis case frequency can yield biased results if used in a way that differs from that for which it was originally designed, whether for clinical care or research purposes, screening or definitive diagnosis, mild or severe disease, or in low or high tuberculosis prevalence regions. However, in the absence of validation cohorts, there is limited evidence that these systems would have better diagnostic accuracy in their original settings. The findings of this study should not undermine confidence in existing diagnostic methods. Instead, they should encourage innovative research and critical analysis in the search for improved diagnostics for childhood tuberculosis.


We thank the staff of The South African Bacille Calmette-Guérin Trial Team for data collection; Maurice Kibel, John Burgess and Robert Gie for expert radiology review; Suzanne Verver for epidemiological support; and Lyness Matizirofa for statistical support.


The study was supported by the Aeras Global TB Vaccine Foundation, a non-profit organization that aims to develop tuberculosis vaccines.

Competing interests:

TH and LG are current and previous full time employees of the trial sponsor. The authors have not entered into any agreements that have limited the completion of the research as planned, and they have had full control of all primary data.