If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Our primary purpose was to evaluate the reliability of telephone administration of the Patient-Reported Outcomes Measurement Information System (PROMIS) Upper Extremity (UE) Computer Adaptive Test (CAT) version 2.0 in a hand and upper extremity population, and secondarily to make comparisons with the abbreviated version of the Disabilities of the Arm, Shoulder, and Hand (QuickDASH).
Patients more than 1 year out from hand surgeries performed at a single tertiary institution were enrolled. Half of the patients completed telephone PROMIS UE CAT and QuickDASH surveys first, followed by computer-based surveys 1 to 10 days later, and the other half completed them in the reverse order. Telephone surveys were readministered 2 to 6 weeks later to evaluate test-retest reliability. Concordance correlation coefficients (CCCs) were used to assess agreement between telephone and computer-based scores, and intraclass correlation coefficients (ICCs) were used to assess test-retest reliability. The proportion of patients with discrepancies in follow-up scores that exceeded estimates of the minimal clinically important difference (MCID) was evaluated.
For the 89 enrolled patients, the PROMIS UE CAT CCC was 0.82 (83% confidence interval [83% CI], 0.77–0.86; good), which was significantly lower than 0.92 (83% CI, 0.89–0.94; good to excellent) for the QuickDASH. The PROMIS UE CAT ICC did not differ significantly from the QuickDASH (0.85 and 0.91, respectively). Differences in telephone versus computer scores exceeded 5 points (MCID estimate) for the PROMIS UE CAT in 34% of patients versus 5% of patients exceeding 14 points (MCID estimate) for the QuickDASH.
Significantly better reliability was observed for the QuickDASH than the PROMIS UE CAT when comparing telephone with computer-based score acquisition. Over one-third of patients demonstrated a clinically relevant difference in scores between the telephone and the computer-administered tests. We conclude that the PROMIS UE CAT should only be administered through computer-based methods.
These findings suggest that differences in collection methods for the PROMIS UE CAT may systematically affect the scores obtained, which may erroneously influence the interpretation of postoperative scores for hand surgery patients.
In 2004, the National Institute of Health initiated the Patient-Reported Outcomes Measurement Information System (PROMIS) with the goal of developing and validating PROMs to assess a diverse array of health domains.
These instruments were designed to be administered to patients in digital format (desktop, laptop, or tablet computers) to allow for Computer Adaptive Testing (CAT). The CAT involves the administration of a subset of questions from a testing bank to a patient, while tailoring subsequent questions based upon responses to previous items.
The PROMIS Upper Extremity (UE) CAT has recently received increasing attention in the hand and upper extremity literature. Specifically, PROMIS UE CAT scores have been demonstrated to correlate with traditional legacy scales including the Disabilities of the Arm, Shoulder, and Hand (DASH) and its abbreviated version, QuickDASH,
Situations exist in which verbal administration of follow-up PROMs would be advantageous. Previous studies have demonstrated greater rates of response and survey completeness using telephone-acquired PROMs, compared with other methods of administration.
Furthermore, outcome assessment for patients lacking computer access, those who are illiterate, or those who do not regularly use e-mail may be more successful by telephone. Requesting patients return to the office for collection of mid- to long-term PROMs may also be impractical. Because PROMs are often obtained via telephone in clinical research, it is important to understand the implications of changing the administration format because the long-term efficacy of various treatments may be based upon this method of collection.
Despite the advantages of telephone acquisition of PROMs, authors have cautioned against administering surveys in a format other than that for which the measure was originally designed and validated.
our main purpose was to assess the reliability of the PROMIS UE CAT administered by telephone compared with administration using the intended computer-based format. Secondarily, we aimed to compare the reliability of the PROMIS UE CAT with that of the QuickDASH.
Materials and Methods
This institutional review board–approved study was performed at a single-center tertiary academic center. Patients were identified who were more than 1-year after surgery performed by 1 of 5 fellowship-trained orthopedic hand surgeons (J.T.W. and J.W.C.). We excluded non–English-speaking patients, minors (<18 years), and patients undergoing nonsurgical procedures. In addition, we excluded patients indicating on a screening questionnaire that they had developed an upper extremity injury, or had undergone additional upper extremity treatment, since the listed date of their index surgery. This exclusion, and requirement of a minimum of 1-year postoperative follow-up, was implemented in an effort to exclude patients who could be expected to demonstrate changes in PROMs for upper extremity function or disability. In this specific sample of patients meeting the selection criteria, we assumed that any changes in PROM scores between telephone surveys would be due to variation in how the patient rates their function (which would reflect instrument reliability), rather than actual variation in their function.
Postoperative patients were identified by surgical Current Procedural Terminology codes performed by the 5 surgeons. Surgical procedures, postoperative diagnoses, and baseline demographic data were verified and recorded through manual chart review. Eligible patients were mailed opt-out consent forms. After excluding those who opted out, patients were informed that study participation would involve 3 contacts for survey completion over a several-week period. E-mail addresses were confirmed, and all participants were asked the screening question regarding new upper extremity conditions or treatments. Patients were informed that their responses would be helpful in understanding how to measure their surgical outcome without specific discussion of the planned comparisons.
Remaining patients were electronically randomized to 1 of 2 study arms in blocks of 4: (1) arm 1 received the telephone survey first, followed by the computer-based survey, and (2) arm 2 received the computer-based survey first, followed by the telephone survey. Specifically, arm 1 patients were administered the PROMIS UE CAT v2.0 followed by the QuickDASH during this first contact, which was by telephone. Computer-based administration of both instruments was solicited using e-mail with a personalized link within a day following the telephone survey. The link was valid 1 to 10 days from completion of the initial telephone survey. Reminder e-mails were sent daily until completion, or until the window of eligibility had lapsed. Arm 2 patients were e-mailed a personal link for completion of the PROMIS UE CAT and QuickDASH, with reminder e-mails every other day until completed. Once this computer-based administration was completed, attempts to contact each participant by telephone were made the next day to complete the same surveys verbally and up to 10 days after. Attempts to contact participants were made daily until completed or until outside of the eligibility window.
All participants in both study arms were then contacted by telephone between 2 and 6 weeks following the completion of their second set of surveys, as previously recommended for test-retest analysis.
Verbal administration of both the PROMIS UE CAT and QuickDASH was performed during this third contact.
Basic descriptive statistics were calculated. Continuous variables were compared using the Student t test, and categorical variables were compared using either chi-square or Fisher exact tests. Response rate was calculated as the number of patients who responded to the PROMIS UE CAT and QuickDASH for both telephone and computer-based administrations, divided by the total number of patients for whom contact was attempted.
Differences in PROMIS UE CAT and QuickDASH scores between the first 2 contacts were calculated as the first telephone contact score minus the computer-based score. The concordance correlation coefficient (CCC) was used to assess agreement between the first verbal score and the computer-based score. Ninety-five percent confidence intervals (CIs) for the CCC were calculated for all participants, and separately for arms 1 and 2. For comparisons of scores between the 2 instruments and between the 2 arms, we used 83% CIs to assess statistical significance because, in this situation, overlapping CIs do not indicate a lack of statistical significance.
The reason is because there are 2 sets of CIs indicating variability among 2 different sample means, and thus narrower 83% (or 83.7%) CIs are needed to equate overlap of CIs with a lack of statistical significance at the .05 level.
Pearson correlation coefficients were calculated to evaluate the relationship between patient age and difference in telephone versus computer-based scores in which P value calculation utilized the Fisher z transformation. Univariate linear regression was used to model the change in the differences between scores at the first 2 contacts (first telephone score minus computer-based score) for both instruments. Predictor variables included study arm, age, sex, race, anesthetic type, body mass index, surgical location (ambulatory surgery center versus main operating room), diagnosis, provider, American Society of Anesthesiologists score, and marital status. Owing to the potential for unreliable estimates, we did not provide coefficients for predictor variable categories with fewer than 5 observations.
Test-retest reliability was used to assess agreement between the 2 telephone scores using intraclass correlation coefficients (ICCs). Differences between scores collected at the 2 telephone contacts were calculated as the second telephone score minus the first telephone score for both instruments. Estimates and 95% CIs were calculated for the entire cohort, and separately for arms 1 and 2. For statistical comparison of ICCs between the 2 instruments and between the 2 arms, 83% CIs were calculated.
Bland-Altman plots were created to illustrate differences between telephone and computer-based scores for both instruments. Bland-Altman plots were also created to demonstrate differences between the 2 telephone scores. The number and percentage of patients with differences in scores exceeding estimates of the minimal clinically important difference (MCID) were calculated for each plot, as has been done previously in orthopedic outcomes studies.
One-sided paired t tests and associated 95% CIs were used to determine whether absolute differences in telephone versus computer-based scores, or the pair of telephone scores, were different within the tolerance of the MCID estimate for the study population.
The CCCs and ICCs were interpreted as follows in terms of strength: poor reliability (<0.50), moderate reliability (0.50–0.75), good reliability (0.75–0.90), and excellent reliability (>0.90).
Absolute values of Pearson correlation r values were interpreted as follows in terms of strength of the association: negligible (r < 0.3), low (0.3 ≤ r < 0.5), moderate (0.5 ≤ r < 0.7), high (0.7 ≤ r < 0.9), and very high (≥0.9).
Out of 465 reviewed patients, 112 were excluded owing to injury or additional upper extremity treatment or surgery that occurred after their index surgery. A total of 15 patients opted out (10 in arm 1, 5 in arm 2). An additional 8 patients were excluded owing to recent additional surgery (3 in both arms) or injury (2 in arm 2) after the initial survey was completed. Among the remaining 330 patients, only 89 (27%) completed all 3 sets of questionnaires. Of the 89 included patients, the mean age was 50.6 ± 15.9 years and 54 (59%) were female. Baseline characteristics for patients in arm 1 and arm 2 are provided in Table 1.
Comparison of telephone and computer-based PROMIS UE CAT scores revealed a CCC of 0.82 (83% CI, 0.77–0.88) in the moderate to good reliability range, which was significantly lower than a QuickDASH CCC of 0.92 (83% CI, 0.89–0.94) in the good to excellent range (Table 2). Given no overlap between respective 83% CIs, the CCC for the QuickDASH should be considered significantly greater than that for the PROMIS UE CAT at a .05 significance level. Bland-Altman plots illustrating differences between telephone and computer-based scores are provided in Figure 1 (PROMIS UE CAT) and Figure 2 (QuickDASH). For the PROMIS UE CAT, 34% of patients had a score difference that exceeded an MCID estimate of 5, and for the QuickDASH 5% had score differences beyond an MCID estimate of 14. One-sided paired t test results demonstrated that the overall difference in telephone and computer-based scores were below MCID estimates for both the PROMIS UE CAT (mean difference, 4.17; 95% CI, 0–4.99; P = .049) and the QuickDASH (mean difference, 4.76; 95% CI, 0–5.63; P < .05).
Table 2Comparison of Telephone and Computer-Based Scores
In the PROMIS UE CAT univariate regression analysis, increasing age, American Society of Anesthesiologists class 2 and class 3, minimal aesthetic concentration anesthesia, and Medicare insurance status were significantly associated with a greater score difference between telephone and computer-based modalities (P < 0.05 for all comparisons; Table 3). These factors were not significant for the QuickDASH (P = 0.30–0.95). A diagnosis of a ligament/tendon injury was significantly associated with a greater score difference for the QuickDASH (P < .05). Differences in all other baseline patient characteristics under study were not significant (all P > .05).
Table 3Univariate Regression of Differences in Telephone and Computer-Based Scores
Test-retest results comparing the 2 telephone scores are presented in Table 4. The PROMIS UE CAT ICC was 0.85 (83% CI, 0.80–0.89) in the good to excellent range. The QuickDASH ICC was 0.91 (83% CI, 0.88–0.93), which was not significantly different, and also in the good to excellent range. Given overlap between 83% CIs, there was no significant difference between PROMIS UE CAT and QuickDASH ICCs at a .05 significance level. Bland-Altman plots illustrating differences between telephone and computer-based scores are provided in Figure 3 (PROMIS UE CAT) and Figure 4 (QuickDASH). For the PROMIS UE CAT, 29% of patients had a score difference that exceeded an MCID estimate of 5, and for the QuickDASH 6% had score differences beyond an MCID estimate of 14. One-sided paired t test results demonstrated that the overall difference in telephone and computer-based scores were below MCID estimates for both the PROMIS UE CAT (mean difference, 3.58; 95% CI, 0–4.36; P < .05) and the QuickDASH (mean difference, 4.89; 95% CI, 0–5.85; P < .05).
Table 4Test-Retest: Comparison of First and Second Telephone Scores
Pearson correlations between age and difference in scores are provided in Table 5. The only statistically significant finding was a low correlation between increasing age and score difference between the first telephone score and the computer-based score for the PROMIS UE CAT (P < .05). This association is illustrated as a scatterplot in Figure 5.
Table 5Correlations Between Age and Difference in Scores
The main finding of this study was that, when verbally administered, the PROMIS UE CAT had significantly lower reliability than the QuickDASH on computer-based administration. Although the PROMIS UE CAT reliability was considered good, it is noteworthy that the difference between the telephone and the computer scores exceeded the PROMIS UE CAT MCID estimate one-third of the time (vs 5% for the QuickDASH). These findings illustrate the clinical relevance of the differing modes of survey administration.
Although test-retest reliability did not significantly differ between instruments, the difference between the 2 telephone scores for the PROMIS UE CAT exceeded an MCID estimate in 29% of cases as opposed to only 6% for the QuickDASH. As a final concern, a weak but significant correlation was observed such that increasing age was associated with greater differences between telephone and computer-based PROMIS UE CAT scores—this association was absent for the QuickDASH.
In light of these observations regarding the proportion of patients with scores exceeding a difference of an MCID estimate upon repeated surveys, it is noteworthy that applying the MCID concept to individual patients may be controversial. However, multiple studies across several orthopedic specialties have published analyses of this nature.
The optimal way to apply the MCID and the most appropriate way to report outcomes results is yet to be determined and is the focus of an increased research effort recently. Despite these findings, it should be noted that the paired differences between repeat survey scores were less than the chosen MCID estimate for the PROMIS UE CAT and QuickDASH comparisons of telephone with computer scores, and for comparisons between the 2 telephone scores. Because the MCID for the PROMIS UE CAT version 2.0 has yet to be elucidated (eg, the chosen estimate of 5 is hypothetical), it is possible that an estimate lower than the value of 5 used here may lead to significant differences for the PROMIS UE CAT. Mean differences between score acquisitions for the PROMIS UE CAT were in the range of 3.6 to 4.2 on the paired t test analysis, and therefore, our conclusion that the overall difference in scores is lower than an MCID could be reversed should new estimates arise that fall within or below these values. Although scores from different versions of the PROMIS UE CAT are not interchangeable nor can the values be directly compared,
Although we could not identify a publication with PROMIS UE CAT v2.0 MCID values, one published abstract suggested this value falls in the range of 3.0 to 4.1 (Kazmers et al, presented at the 74th Annual Meeting of the American Society for Surgery of the Hand, 2020). This range of values, which were obtained in a general hand and upper extremity population similar to that of the current study, are interestingly below the paired differences between computer and telephone scores. It is possible that the differences between telephone and computer scores observed in the current study actually exceed that MCID estimate, which raises further concerns about telephone acquisition of the PROMIS UE CAT. In contrast, paired differences for the QuickDASH were in the range of 4.8 to 4.9, which is subjectively far from even the lower of the MCID estimates in the literature (6.8–19).
Although further investigation is warranted, we hypothesize that a negative trade-off for scale brevity for the PROMIS UE CAT could be slightly lower reliability. That is, if a patient completes a second survey and provides a slightly different response to even the same initial question, the computer adaptive algorithm may place that patient on a different track of subsequent questions, which may pertain to higher or lower overall function than the track of questioning for their initial survey. It should be noted that this is a hypothesis and formally commenting upon this is beyond the scope of the study results.
Although our findings for the PROMIS UE CAT are unique, study findings with regard to the QuickDASH may be compared with those of prior literature. London et al
which is subjectively similar to the value of 0.92 and 95% CI spanning the good to excellent range in the current study. Similarly, the authors did not observe a correlation between patient age and difference in QuickDASH scores. Minor differences between the studies should be noted, however. The current study differs in its use of CCCs (although they are very similar to the ICCs), and by comparing telephone scores with computer-based scores rather than the written version of the QuickDASH. London et al
observed good test-retest reliability, as opposed to good to excellent reliability noted in the current study for the QuickDASH. Similarly, the ICCs for test-retest reliability observed in the current study were subjectively greater (0.91 vs 0.68). It is possible that our shorter time frame between the test and the retest administrations (mean, 20 days vs 5 months), or comparison with computer-based scores rather than those from paper forms, could contribute to these differences. Nonetheless, as in the current study, the authors concluded that these differences were not clinically relevant.
Although the QuickDASH demonstrated superior performance over the PROMIS UE CAT in certain aspects of reliability, both telephone and computer-based acquisition methods may potentially be suitable for both instruments to be utilized in clinical research. Frost et al
have recommended a minimum reliability threshold of 0.70 for an instrument to be used for clinical trials outcomes, and both instruments have exceeded this when comparing telephone with computer-based scores and for test-retest reliability. Furthermore, other studies have evaluated reliability of select PROMIS instruments with favorable results, but we were unable to locate prior studies on the PROMIS UE CAT for comparison. Deyo et al
evaluated multiple modes of administration (interactive voice response, paper form, hand-held computer, or personal computer) for 8-item short forms derived from the PROMIS Physical Function, Fatigue, and Depression item banks. High levels of reliability (average ICC of 0.90) between different modes of administration were noted with no clinically relevant effect attributed to the mode of administration within a score tolerance of ±2 points. Broderick et al
collected several PROMIS instruments (Pain Intensity, Pain Interference, Physical Function, and Fatigue) in CAT and short-form formats for a large cohort of patients with osteoarthritis. Good to excellent test-retest performance for both modes of administration were observed, and less than 25% of patients had differences in CAT and short-form scores that exceeded MCID estimates. For a sample of spinal cord or traumatic brain injury patients, Kisala et al
observed no effect of mode of administration for multiple PROMIS fixed-length short forms (Physical Function, Fatigue, Pain Interference, Anger, Anxiety, Depression), which were administered by an interviewer (in-person form or telephone) or by a computer-based format. Lastly, Magnus et al
observed similarity between telephone and computer-based PROMIS Depressive Symptoms, Fatigue, and Mobility short-form scores in a pediatric population.
Our study limitations deserve mention. Although our findings should not be generalized to all PROMIS instruments or all study populations, we agree with the notion that nonstandard administration methods should be evaluated before clinical or research use.
Our study sample age should be considered when interpreting the results because it is possible that the results could differ with a younger population more comfortable with technology. Our study did not investigate the reasons why the PROMIS UE CAT performs less favorably than the QuickDASH. It is unclear whether a higher response rate would affect our results. It remains possible that the advantage of fewer questions for PROMIS UE CAT completion could decrease precision or reliability, although future work is needed to elucidate and potentially improve upon this. In addition, it remains unclear what level of reliability is acceptable, which makes it more challenging to interpret the study results with regard to reliability between different modes of survey administration. Our finding that patients with diagnoses related to tendon/ligament pathology demonstrated a greater score difference between telephone and computer administration of the QuickDASH on univariate analysis likely represents alpha error—that is, erroneously concluding there is a difference when 1 does not actually exist—versus the possibility that these patients were older. However, this study is limited in that it was not designed to specifically evaluate for reliability in outcome collection between diagnostic categories. Lastly, it is unclear if our response rate of 26% to 28% had an impact on the study findings because it is possible that responders and nonresponders were different in terms of treatment or patient factors.
In conclusion, the QuickDASH and PROMIS UE CAT demonstrated similar test-retest reliability for telephone administration in the good to excellent range relative to computer-based administration. However, the QuickDASH demonstrated significantly better reliability than the PROMIS UE CAT when comparing telephone and computer-based administration. Similarly, we observed a high rate of clinically important differences with the PROMIS UE CAT in both the verbal test-retest and the telephone versus computer portions of this study. Based on these findings, we recommend that the PROMIS UE CAT only be administered in a computer-based format.
This investigation was supported by the University of Utah Study Design and Biostatistics Center, with funding in part from the National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health , through Grant UL1TR002538 .
Implementation of a value-driven outcomes program to identify high variability in clinical costs and outcomes and association with reduced cost and improved quality.