Advertisement

Inter- and Intrarater Reliability of Osteoarthritis Classification at the Trapeziometacarpal Joint

Published:October 14, 2014DOI:https://doi.org/10.1016/j.jhsa.2014.09.007

      Purpose

      To assess the reliability of the Eaton and Glickel classification for base of thumb osteoarthritis.

      Methods

      The interrater and intrarater reliability of this classification were assessed by comparing ratings from 6 raters using quadratic weighted kappa scores.

      Results

      Median inter-rater reliability ranged from kappa of .53 to .54; intrarater reliability ranged from kappa of .60 to .82. Using unweighted kappa interrater reliability was “slightly” reliable, and intrarater reliability was “fairly” reliable. Overall, the value of the intraclass correlation for all 6 raters was .56.

      Conclusions

      This radiological classification does not describe all stages of carpometacarpal joint osteoarthritis accurately enough to permit reliable and consistent communication between clinicians. Therefore we believe it should be used with an understanding of its limitations when communicating disease severity between clinicians or as a tool to assist in clinical decision making.

      Type of study/level of evidence

      Diagnostic III.

      Key words

      Trapeziometacarpal (TMC) joint osteoarthritis (OA) is a common clinical problem, most frequently encountered in the dominant hand and in women. Various classification systems exist for the purpose of facilitating communication between clinicians regarding TMC joint OA disease stage and treatment options.
      The Burton classification
      • Burton R.I.
      Basal joint arthrosis of the thumb.
      takes into account clinical and radiographic findings. Eaton and Littler
      • Eaton R.G.
      • Littler J.W.
      Ligament reconstruction for the painful thumb carpometacarpal joint.
      devised a purely radiological classification system for staging the severity of OA at the TMC joint. Eaton and Glickel
      • Eaton R.G.
      • Glickel S.Z.
      Trapeziometacarpal osteoarthritis: staging as a rationale for treatment.
      later refined their classification to include scaphotrapezial joint degeneration. This refined classification system is the most commonly used scale in clinical practice.
      A number of studies have investigated the interrater and intrarater reliability of this classification system.
      • Kubik N.J.
      • Lubahn J.D.
      Intra-rater and inter-rater reliability of the Eaton classification of basal joint arthritis.
      • Spaans A.J.
      • van Laarhoven C.M.
      • Schuurman A.H.
      • van Minnen L.P.
      Inter-observer agreement of the Eaton–Littler classification system and treatment strategy of thumb carpometacarpal joint osteoarthritis.
      These studies used unweighted kappa scores, whereas we believe the kappa scores should be weighted. The Eaton and Glickel classification has ordinal categories that reflect increasing levels of joint disease. Disagreement by 1 scale point (eg, grade 1 – grade 2) is less serious than disagreement by 2 scale points (eg, grade 1 – grade 3). To reflect the degree of disagreement, kappa can be weighted so that it attaches greater emphasis to large differences between ratings than to small differences. Weighted kappa penalizes disagreements in terms of their seriousness, whereas unweighted kappa treats all disagreements equally. Unweighted kappa, therefore, is inappropriate for ordinal scales. Through the use of weighted kappa we believe that the levels of interrater and intrarater agreement can be assessed more accurately than in previous studies.
      • Sim J.
      • Wright C.C.
      The kappa statistic in reliability studies: use, interpretation, and sample size requirements.

      Methods

      Six raters, including 3 hand surgery consultants and 3 registrars in hand surgery, participated in the study. All 3 consultants participating were plastic surgeons with a subspecialty interest in hand and upper limb surgery and who frequently treat this condition. The 3 registrars included one year-3 and one year-4 trainee and a senior fellow in hand surgery. This permitted comparisons of agreement to be drawn within individuals, within the consultant and registrar groups, and also between these 2 groups. The purpose of making comparisons both within and between groups of doctors of different grades was to identify whether increased experience of interpreting TMC joint radiographs correlated with better grading reliability. Quadratic weighted kappa was used to assess the correlation between 2 ratings at a time, which allowed inter- and intrarater comparisons to be made. It is not simple to extend quadratic weighted kappas to more than 2 raters. Therefore, to assess correlation between all the raters, intra-class correlation was used. The justification for using intra-class correlation coefficients was that the calculated differences between quadratic weighted kappa and intra-class correlation coefficients were less than .005. Calculating weighted kappa using quadratic weights is virtually the same as calculating intra-class correlation coefficients provided the sample size is large enough (the sample size in this study was sufficient).
      • Fleiss J.L.
      • Cohen J.
      The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability.
      Each rater was asked to score base of thumb radiographs from 52 patients arranged into a slideshow. The sample size was calculated based on the width of confidence intervals for the intra-class correlation using previously devised formulas.
      • Shoukri M.M.
      • Asyali M.H.
      • Donner A.
      Sample size requirements for the design of reliability study: review and new results.
      A sample size of 52 was sufficient for the width of the 95% confidence interval to be .25 or less if the intra-class correlation was greater than .5, and .2 or less if the intra-class correlation was greater than .68.
      Because the radiographs were de-identified, our institution did not require institutional review board approval. At the outset of the slideshow, a presentation explaining the scale to be evaluated was explained. Radiographs for each grade were shown, highlighting the pertinent features that distinguish each of them. Understanding of the scale by each rater was checked prior to proceeding with the slideshow.
      The posteroanterior and lateral radiographs included a spectrum of images, ranging from normal joints to stage 4 disease with at least 8 of each grade. The radiographs had been obtained from 23 men and 29 women with a mean age of 58 years (range, 35–82 y). The distribution of right:left TMC joint radiographs was 30:22. The images were randomly selected over a 6-month period from patients known to have TMC joint OA and patients with normal joints whom had had appropriate radiographs taken for other reasons.
      The slideshow was shown to all raters simultaneously to ensure each rater spent the same amount of time analyzing each patient’s radiographs. After an interval of one week, the same raters were shown the same radiographs arranged in a different order and asked to score them again. Raters were asked not to confer with each other regarding the scoring of the radiographs to reduce bias.

      Results

      The median results and ranges for the quadratic weighted kappa scores in the different clinician groups are seen in Table 1. This includes an overall score for all 6 raters.
      Table 1Interrater Reliability Between Different Clinician Groups
      Clinician GroupMedian ScoreRange
      Consultant vs consultant0.550.51–0.66
      Consultant vs registrar0.540.49–0.65
      Registrar vs registrar0.530.50–0.60
      Overall score: all raters0.56N/A
      Median quadratic weighted kappa scores (interrater reliability) between different clinician groups and overall score between all raters (intraclass correlation coefficient).
      N/A, not applicable.
      In order to make comparison with previous studies, we also derived unweighted kappa scores from our data. These are shown in Table 2.
      Table 2Comparison With Previous Studies
      StudyYearObserversNo. PatientsIntrarater ReliabilityInterrater Reliability
      Kubik and Lubahn
      • Kubik N.J.
      • Lubahn J.D.
      Intra-rater and inter-rater reliability of the Eaton classification of basal joint arthritis.
      20023 hand surgeons, 3 orthopedic residents40.66.53
      Dela Rosa et al
      • Dela Rosa T.L.
      • Vance M.C.
      • Stern P.J.
      Radiographic optimization of the Eaton classification.
      20046 hand surgeons40.59–.61.37–.56
      Spaans et al
      • Spaans A.J.
      • van Laarhoven C.M.
      • Schuurman A.H.
      • van Minnen L.P.
      Inter-observer agreement of the Eaton–Littler classification system and treatment strategy of thumb carpometacarpal joint osteoarthritis.
      20115 radiologists, 8 hand surgeons40NA.50
      Hansen et al
      • Hansen T.B.
      • Sørensen O.G.
      • Kirkeby L.
      • Homilius M.
      • Amstrup A.L.
      Computed tomography improves intra-observer reliability, but not the inter-observer reliability of the Eaton–Glickel classification.
      20132 hand surgeons, 2 residents, 1 radiologist43.54.11
      Choa and Giele (current study)20143 hand surgeons, 3 hand surgery registrars52.37 (.17–.53).22 (.05–.38)
      Comparison with previous studies on intra- and interrater reliability of the Eaton classification using unweighted kappa.
      NA, not available.
      We calculated quadratic weighted kappa scores for each individual clinician, which are equivalent to intrarater reliability. These results are shown in Table 3.
      Table 3Individual Intrarater Reliability
      ClinicianIntrarater Reliability
      Consultant 1.82
      Consultant 2.78
      Consultant 3.76
      Registrar 1.60
      Registrar 2.74
      Registrar 3.77
      Individual quadratic weighted kappa scores (intrarater reliability).
      Additional analyses were undertaken to evaluate whether the raters showed greater reliability at scoring those grades at the extremes of the scale (ie, grades 0, 1, and 4), compared with those in the middle (grades 2 and 3). These analyses are presented in Table 4.
      Table 4Reliability of Grading Radiographs at the Extremes of the Scale (0, 1, and 4) Compared With the Central Grades (2 and 3)
      RaterGroup 1 (grades 0, 1, 4)Group 2 (grades 2, 3)
      Consultant 1.84.78
      Consultant 2.85.68
      Consultant 3.79.54
      Registrar 1.67.36
      Registrar 2.75.70
      Registrar 3.80.69

      Discussion

      Quadratic weighted kappa scores are valuable because they demonstrate the proportion of the total variability in the grades that is due to variability in the rated radiographs. If every radiograph was given the same grade by rater 1 and every subject was given the same grade by rater 2 but the grades given by the 2 raters were different, none of the variability in the grades would be due to variability in the radiographs, so the quadratic weighted kappa scores would be 0 because the raters were in total disagreement.
      Conversely, if not all the radiographs were given the same grade by rater 1 and if rater 2 gave the same grade as rater 1 in every case, all the variability in the grades would be due to variability in the radiographs, so the quadratic weighted kappa scores would be 1 because the raters were in total agreement.
      The overall score of .56 across all the raters means that approximately half of the total variability in the grades is due to variability in the radiographs being graded, with the other variability being due to the raters themselves. We interpret this as inadequate agreement for the grading system to be deemed useful.
      The anticipated biases in this study were those involving study design. To reduce this type of bias a validated outcome measure would normally be used, but given that the Eaton scale itself was being assessed this was not possible. It is conceivable that the participants understood the nature of the study or had preconceptions about the applicability of the Eaton classification, which could have led to bias when scoring the radiographs. The participants, however, were not asked to comment on the scale themselves. It is difficult to control for any preconceptions; nevertheless, by asking participants not to confer with each other, these sources of bias may have been minimized.
      It is not surprising that intrarater reliability was better than interrater reliability as the same person is likely to rate the same radiograph with the same score. In general, consultants had higher levels of intrarater reliability than registrars, possibly reflecting more experience with interpretation of radiographs in the allotted time.
      All raters graded the radiographs at the extremes of the rating scale (Although the scoring of the radiographs at the extremes of the scale were not perfect, the reduced reliability of scoring grades 2 and 3 appear to affect the reliability of the scale to a greater extent.
      Previous studies used various scales for categorizing their unweighted kappa scores to determine the strength of inter- and intrarater agreement.
      • Kubik N.J.
      • Lubahn J.D.
      Intra-rater and inter-rater reliability of the Eaton classification of basal joint arthritis.
      • Spaans A.J.
      • van Laarhoven C.M.
      • Schuurman A.H.
      • van Minnen L.P.
      Inter-observer agreement of the Eaton–Littler classification system and treatment strategy of thumb carpometacarpal joint osteoarthritis.
      The Landis and Koch scale is one of these, assigning the following descriptive terms for strength of agreement to various kappa scores: < .00 = Poor; .00 to .20 = Slight; .21 to .40 = Fair; .41 to .60 = Moderate; .61 to .80 = Substantial; and .81 to 1.00 = Almost perfect.
      • Landis J.R.
      • Koch G.G.
      The measurement of observer agreement for categorical data.
      The problem with using the type of scale proposed by Landis and Koch is that the categories are arbitrary and the effects of prevalence and bias on kappa must be considered when judging its magnitude.
      • Sim J.
      • Wright C.C.
      The kappa statistic in reliability studies: use, interpretation, and sample size requirements.
      • Brennan P.
      • Silman A.
      Statistical methods for assessing observer variability in clinical measures.
      We therefore did not apply a rating to our quadratic weighted results. Instead, we believe that the interrater reliability is inadequate and the intrarater reliability only marginally better.
      A systematic review of intra- and interrater reliability of the Eaton classification
      • Berger A.J.
      • Momeni A.
      • Ladd A.L.
      Intra- and inter-observer reliability of the Eaton classification for trapeziometacarpal arthritis.
      compared 4 studies and found interrater reliability to be poor to fair, with intrarater reliability being fair to moderate. The system used to classify these amalgamated results was not specified, however.
      When we used the Landis and Koch scale on our unweighted kappa values, there was slight to fair interrater agreement between all the clinician groups. Furthermore, the intrarater reliabilities ranged from slight to moderate.
      We believe that the quadratic weighted kappa scores are more appropriate for testing inter- and intrarater reliabilities for this type of clinical test with multiple ordered categories. Regardless of the type of test used, the level of interrater agreement was in the range from slight to moderate when using a scale such as the one described previously on the unweighted kappa data. Given this relatively low level of interrater reliability, we do not believe this classification should be used to communicate disease severity or as a tool to assist in clinical decision making if the limitations of the test are not fully understood. Furthermore, we do not believe that the decision to treat patients should be based upon a radiographic system alone, given that some patients with severe radiograph findings are asymptomatic and vice versa. Instead, the decision to treat should be based on a combination of clinical and radiographic findings.

      References

        • Burton R.I.
        Basal joint arthrosis of the thumb.
        Orthop Clin North Am. 1973; 4: 331-338
        • Eaton R.G.
        • Littler J.W.
        Ligament reconstruction for the painful thumb carpometacarpal joint.
        J Bone Joint Surg Am. 1973; 55: 1655-1666
        • Eaton R.G.
        • Glickel S.Z.
        Trapeziometacarpal osteoarthritis: staging as a rationale for treatment.
        Hand Clin. 1987; 3: 455-471
        • Kubik N.J.
        • Lubahn J.D.
        Intra-rater and inter-rater reliability of the Eaton classification of basal joint arthritis.
        J Hand Surg Am. 2002; 27: 882-885
        • Spaans A.J.
        • van Laarhoven C.M.
        • Schuurman A.H.
        • van Minnen L.P.
        Inter-observer agreement of the Eaton–Littler classification system and treatment strategy of thumb carpometacarpal joint osteoarthritis.
        J Hand Surg Am. 2011; 36: 1467-1470
        • Sim J.
        • Wright C.C.
        The kappa statistic in reliability studies: use, interpretation, and sample size requirements.
        Phys Ther. 2005; 85: 257-268
        • Fleiss J.L.
        • Cohen J.
        The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability.
        Educ Psychol Meas. 1973; 33: 613-619
        • Shoukri M.M.
        • Asyali M.H.
        • Donner A.
        Sample size requirements for the design of reliability study: review and new results.
        Stat Methods Med Res. 2004; 13: 251-271
        • Dela Rosa T.L.
        • Vance M.C.
        • Stern P.J.
        Radiographic optimization of the Eaton classification.
        J Hand Surg Br. 2004; 29: 173-177
        • Hansen T.B.
        • Sørensen O.G.
        • Kirkeby L.
        • Homilius M.
        • Amstrup A.L.
        Computed tomography improves intra-observer reliability, but not the inter-observer reliability of the Eaton–Glickel classification.
        J Hand Surg Eur Vol. 2013; 38: 187-191
        • Landis J.R.
        • Koch G.G.
        The measurement of observer agreement for categorical data.
        Biometrics. 1977; 33: 159-174
        • Brennan P.
        • Silman A.
        Statistical methods for assessing observer variability in clinical measures.
        BMJ. 1992; 304: 1491-1494
        • Berger A.J.
        • Momeni A.
        • Ladd A.L.
        Intra- and inter-observer reliability of the Eaton classification for trapeziometacarpal arthritis.
        Clin Orthop Relat Res. 2014; 472: 1155-1159