Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection... Background: The COSMIN checklist is a tool for evaluating the methodological quality of studies on measurement properties of health-related patient-reported outcomes. The aim of this study is to determine the inter-rater agreement and reliability of each item score of the COSMIN checklist (n = 114). Methods: 75 articles evaluating measurement properties were randomly selected from the bibliographic database compiled by the Patient-Reported Outcome Measurement Group, Oxford, UK. Raters were asked to assess the methodological quality of three articles, using the COSMIN checklist. In a one-way design, percentage agreement and intraclass kappa coefficients or quadratic-weighted kappa coefficients were calculated for each item. Results: 88 raters participated. Of the 75 selected articles, 26 articles were rated by four to six participants, and 49 by two or three participants. Overall, percentage agreement was appropriate (68% was above 80% agreement), and the kappa coefficients for the COSMIN items were low (61% was below 0.40, 6% was above 0.75). Reasons for low inter-rater agreement were need for subjective judgement, and accustom to different standards, terminology and definitions. Conclusions: Results indicated that raters often choose the same response option, but that it is difficult on item level to distinguish between articles. When using the COSMIN checklist in a systematic review, we recommend getting some training and experience, completing it by two independent raters, and reaching consensus on one final rating. Instructions for using the checklist are improved. Background properties of HR-PROs. It can also be used to design Recently, a checklist for the evaluation of the methodo- and report a study on measurement properties. Also, logical quality of studies on measurement properties of reviewers and editors could use it to identify shortcom- health-related patient-reported outcomes (HR-PROs) - ings in studies on measurement properties, and to assess the COSMIN checklist - was developed in an interna- whether the methodological quality of such studies is tional Delphi study [1]. COSMIN is an acronym for high enough to justify publication. COnsensus-based Standards for the selection of health The COSMIN checklist contains twelve boxes [1]. Ten status Measurement INstruments. This checklist can be boxes can be used to assess whether a study meets the used for the appraisal of the methodological quality of standards for good methodological quality (ranging from studies included in a systematic review of measurement 5-18 items). Nine of these boxes contain the standards for the measurement properties considered (internal * Correspondence: w.mokkink@vumc.nl consistency (box A), reliability (box B), measurement Department of Epidemiology and Biostatistics and the EMGO Institute for error (box C), content validity (box D), structural valid- Health and Care Research, VU University Medical Center, Amsterdam, The ity (box E), hypotheses testing (box F) and cross-cultural Netherlands Full list of author information is available at the end of the article © 2010 Mokkink et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 2 of 11 http://www.biomedcentral.com/1471-2288/10/82 validity (box G), criterion validity (box H), and respon- reliability, validity, or responsiveness. A total of 5137 siveness (box I)), and one box contains standards for articles were eligible. Second, from these articles, we studies on interpretability (box J). In addition, one box randomly selected studies that fulfilled our inclusion (IRT box) contains requirements for articles in which criteria. Item Response Theory (IRT) methods are applied (4 Inclusion criteria were: items), and one box (Generalisability box) is included in the checklist that contains requirements for the gener- � Purposeofthe studywas to evaluate oneormore alisability of the results (8 items). measurement properties It is important to assess the quality of the COSMIN � Instrument under study was a HR-PRO instrument checklist itself. For example, it is important that differ- � English language publications ent researchers, who use the COSMIN checklist to rate the same article, give the same ratings on each item. Articles from any setting and any population could be Therefore, the aim of this study is to determine the included, and articles could have used Classical Test inter-rater agreement and reliability of each item score Theory (CTT) or modern test theory (i.e, Item Response of the COSMIN checklist among potential users. Theory (IRT)) or both. Exclusion criteria: Methods Because the COSMIN checklist will be applied in the � Systematic reviews, case reports, letters to editors future to a variety of studies on different topics and � Studies that evaluated construct validity of two or study populations, with low and high quality, it was our more instruments at the same time by correlating goal to generalise the results of this study to a broad the scores of the instruments mutually, without indi- range of articles on measurement properties. In addi- cating one of instruments as the instrument of inter- tion, the COSMIN checklist will be used by many est. In these studies, it is unclear of which researchers, using the instructions in the COSMIN man- instrument the construct validity is being assessed. ual as guidance. We were interested in the inter-rater agreement and reliability in this situation. Often, in an One of the authors (LM) selected articles until each article only a selection of measurement properties are measurement property was assessed in at least 20 arti- being evaluated. Consequently, only parts of the COS- cles. It appeared that we needed to select 75 articles. MIN checklist can be completed. We arbitrarily decided For each included article LM determined the relative in advanced that (1) we aimed for four ratings for each workload for a rater to evaluate the methodological item of the COSMIN checklist on the same article; (2) quality of the article, i.e. high, moderate, or low work- we aimed for each measurement property to be evalu- load. The relative workload was based on the number of ated in at least 20 different articles. This was done to measurement properties assessed in the study, the num- increase the representativity of studies and raters. ber of instruments that were studied, the number of pages, and whether IRT was used. For example, an arti- Article selection cle in which IRT is used is considered having a high In this study we included articles that were representa- workload, and an article in which three measurement tive of studies on measurement properties. We selected properties were evaluated in a four page paper was con- articles from the bibliographic database compiled by the sidered as having a low workload. We decided to ask Patient-Reported Outcome Measurement (PROM) each rater to evaluate three articles. We provided each Group, Oxford, UK http://phi.uhce.ox.ac.uk. The biblio- rater with one article with a low workload, one with a graphy includes evaluations of PROs with information moderate workload and one with a high workload. about psychometric properties and operational charac- teristics, and applications where for example a PRO has Selection of participants been used in a trial as a primary or secondary endpoint. Raters were professionals who had some experience with The online PROM database comprises records down- assessing measurement properties. This could range loaded from several electronic databases using a com- from having little experience to being an expert. We prehensive search strategy (details available on request). choose to select a heterogeneous group of raters, The selection of articles for this study was a two-step because this reflects best the raters who will potentially procedure. First, of the 30,000+ included articles it was use the COSMIN checklist in the future. We invited the determined, based on the title, whether it concerned an international panel of the COSMIN Delphi study [1] to article of a study on the evaluation of measurement participate in the inter-rater agreement and reliability properties of a PRO. For example, the title included study (n = 91), attendees of two courses on clinimetrics terms of a specific measurement property, such as given in 2009 by the department of Epidemiology and Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 3 of 11 http://www.biomedcentral.com/1471-2288/10/82 Biostatistics of the VU University Medical Center (n = In addition, we calculated the reliability of the items 72), researchers on the mailing list of the Dutch chapter using kappa coefficients. This is a measure that indicates of the International Society for Quality of Life Research how well articles can be distinguished from each other (ISOQOL-NL) (n = 295), members of the EMGO Clini- based on the given COSMIN item score. Dichotomous metrics working group (n = 32), members of the PRO items were analysed using intraclass kappa coefficients Methods Group of the Cochrane Collaboration (n = 79), [3]; the scoring was yes = 1 and no = 0. researchers who previously showed interest in the COS- MIN checklist (n = 15), colleagues of the authors, and article Intraclass Kappa = , COSMINitem other researchers who were likely to show interest. We  + article error also asked these people if they knew other researchers who were interested in participating. where s denotes the variance due to systematic article differences between the articles for which the item was Procedure scored, and s denotes the random error. error Those who agreed to participate received three selected Ordinal items were analyzed with weighted kappa articles, together with a manual of the COSMIN checklist coefficients using quadratic weights; the scoring was [2] and a data collection form to enter their scores. For ‘yes’ =1, ‘?’ =2,and ‘no’ = 3. (Note that the scorings each article, they were asked to follow all the COSMIN order in the COSMIN checklist is yes/no/?). These mea- evaluation steps. Step 1: to indicate, for each measure- sures are numerically the same as intraclass correlation ment property, whether it was evaluated in the article coefficients (ICCs) obtained from analysis of variance (’yes/no’). The participants had to determine themselves (ANOVA) [4-6]. which boxes they should complete for each of the three Twenty-two items could be answered with “na”, which papers. Step 2: they were asked whether IRT was used in makes the scale of these items a multi-categorical nom- the article, and if so, they were asked to complete the inal scale. For these items, we calculated for each item IRT box. Step 3: they were asked to complete the relevant kappa’s after all possible dichotomizations. For example, boxes of the COSMIN checklist. Step 4: raters were asked item A9 has three response options, i.e. ‘yes’, ‘no’,and to complete the Generalisability box for each measure- ‘na’. This item has three times been dichotomized, i.e. ment property assessed in the article. into yes = 1 and not yes = 0 (dummy variable 1), into Instructions on how to complete the boxes were pro- no = 1 and not no = 0 (dummy variable 2), and into na vided in the COSMIN manual [2]. Raters did not receive = 1 and not na = 0 (dummy variable 3). Next, the com- any additional training in completing the COSMIN ponents for the intraclass kappa were calculated, and a checklist and were not familiar with the checklist. Items summary intraclass (SI) kappa was calculated using for- couldbeansweredwith “yes"/"no”,with “yes"/"?"/"no”, mula [3] or with “yes"/"no"/"not applicable” ("na”). One item had four response options, i.e., “yes"/"?"/"no"/or “na”.  () i article SI Kappa = . COSMINitem Statistical analyses 22  ()ii + () article error ∑ ∑ Each rater scored three of the 75 selected articles, and i ii in each article a selection of the measurement properties was evaluated. Therefore, we analyzed each COSMIN The numerator reflects the variance due to the article, item score using a one-way design. and the denominator reflects the total variance. In case We calculated percentage agreement for each item. a variance component was negative, we set the variance This measure indicates how often raters who rated the at zero. same items on the same articles choose the same Since we do not calculate overall scores per box, we response category. We considered the highest number only calculated kappa coefficients per COSMIN item. of similar ratings per item per article as agreement, and We considered a kappa for each item below 0.40 as the other ratings as non-agreement. For example, if five poor, between 0.40 and 0.75 as moderate to good, and raters rated the same item for the same article, and above 0.75 as excellent [6]. three of the raters rated ‘yes’,and tworated ‘no’,we Reliability measures such as kappa are dependent on considered three ratings as agreement. Percentage agree- the distribution of the data (s ). Vach showed that article ment was calculated by the number of ratings with reliability measures are low when data are skewed [7]. agreement on all articles, divided by the total number of We considered a distribution of scores as skewed when ratings on all articles for which that measurement prop- more than 75% of the raters who responded to an item erty was assessed. A percentage agreement > 80% was used the same response category. Percentage agreement considered appropriate (arbitrarily chosen). is not dependent on the distribution of the data. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 4 of 11 http://www.biomedcentral.com/1471-2288/10/82 In our analysis we combined scores of the items on Table 1 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) on whether the the Generalisability box for all measurement properties, property was evaluated in an article (COSMIN step 1) so that we calculated percentage agreement and kappa percentage agreement Intraclass kappa coefficients only once for each of the items from this box, and not separately for each measurement property. Internal consistency 94 0.66 Reliability 94 0.77 Results Measurement error 94 0.02 A total of 154 raters agreed to participate in this study. Content validity 84 0.29 We received the ratings from 88 (57%) of the partici- Structural validity 86 0.48 pants. The responders came from the Netherlands Hypotheses testing 87 0.29 (58%), Canada (10%), UK (7%), Australia or New Zeal- Cross-cultural validity 95 0.66 and (6%), Europe without Netherlands and UK (15%), Criterion validity 93 0.23 other (5%). The mean number of years experience in Responsiveness 96 0.81 research was 12 years (SD = 8.7), and 9 years (SD = 7.1) Interpretability 86 0.02 a b experience in research related to measurement number of ratings on the 75 articles = 263; items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response properties. category; printed in bold indicates kappa > 0.70 or % agreement >80% Of the 75 selected articles, 8 articles were rated by six participants, 7 articles were rated by five participants, 11 by four participants, 38 by three participants, and 11 by In Table 2 we describe percentages agreement, and two participants. The percentage missing items per box kappa coefficients for each item of the COSMIN boxes were 7% for box A Internal Consistency (11 item), 5% A to J (step 3). Fifty-nine items (61%) of the 96 items in for box B Reliability (14 items), 1% box D Content Table 2 had a percentage agreement above 80%. Thirty Validity (5 items), 11% box E Structural Validity (7 items (31%) had a percentage agreement between 70% items), 7% box F Hypotheses Testing (10 items), 5% box and 80%, and seven items (7%) between 60% and 70%. G Cross-cultural Validity (15 items), 5% box H Criterion Of the 96 items, five (5%) had an excellent kappa coeffi- Validity (7 items), 18% box I Responsiveness (18 items), cient, thirty (31%) had a moderate to good kappa coeffi- 3% box J Interpretability (9 items), and 1% for the Gen- cient, and 61 items (64%) had a poor kappa coefficient eralisability box (8 items). (including the 15 items of which we set negative var- Items of the IRT box had 26 ratings for 13 articles; for iance components to 0). Sample sizes for percentage 6 articles this box was completed by one rater, for two agreement and kappa coefficients per item were slightly articles by two raters, for four articles by three raters, different, due to articles thatwere scoredonlyonceby and for one article by four raters. The box C Measure- one rater. When calculating percentage agreement, these ment error had 17 ratings for 14 articles; for twelve arti- articles could not be taken into account. cles this box was completed by one rater, for one article In Table 3 percentages agreement and kappa coeffi- by two raters, and one article by three raters. The cients are given for the eight items from the Generalisa- results of these items are not shown, because percentage bility box (step 4). We combined scores of the items on agreement and kappa coefficients based on such small the Generalisability box for all measurement properties. numbers are unreliable. For the property measurement Therefore, the sample sizes are much higher. All items error, however, we have some information because 10 of in Table 3 had a percentage agreement above 80%. the 11 items from this box (i.e. all items on design None of the items had an excellent kappa coefficient. requirements) were exactly the same items as the items Four items had a moderate to good kappa coefficient, about design requirements from box B Reliability (i.e. and four items had a poor kappa coefficient. items B1 to B10). We observed two issues. Firstly, thirty-two of the 114 Table 1 shows the inter-rater agreement and reliability items (Table 1, 2 and 3; 28%) showed hardly any disper- of the questions regarding whether the property was sal, i.e. more than 75% of the raters who responded to evaluated in an article (step 1 of the COSMIN check- the item rated the same response category. When data list). Note that these scores are not summary scores of are skewed, the between article variance, i.e. s ,is article the overall methodological quality of the property. All low, and thus the kappa will be low. Secondly, in Table properties had high percentage agreement (range from 2 it can be seen that twenty-nine items (28%) had a 84% to 96%).Two of the ten properties, i.e. Reliability sample size below 50 for the calculation of kappa coeffi- and Responsiveness, had an excellent kappa coefficient, cients, of which four were below 30 (4%). For the calcu- i.e. above 0.75. Three properties had moderate to good lation of percentage agreement thirty-five items (34%) kappa coefficients and five had poor kappa coefficients. had a sample size of below 50, of which twenty-nine Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 5 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3) Item Item N (minus articles % N Kappa nr with 1 rating) agreement Box A Internal consistency (n = 195) A1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 185 82 193 0.06 Design requirements A2 Was the percentage of missing items given? 183 87 190 0.48 A3 Was there a description of how missing items were handled? 180 90 187 0.54 A4 Was the sample size included in the internal consistency analysis adequate? 177 87 185 0.06 A5 Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model 180 92 187 0.69 applied? A6 Was the sample size included in the unidimensionality analysis adequate? 166 79 178 0.27 A7 Was an internal consistency statistic calculated for each (unidimensional) (sub)scale 179 85 187 0.31 separately? c d A8 Were there any important flaws in the design or methods of the study? 174 86 179 0.22 Statistical methods d,e A9 for Classical Test Theory (CTT): Was Cronbach’s alpha calculated? 179 93 187 0.27 d,e A10 for dichotomous scores: Was Cronbach’s alpha or KR-20 calculated? 151 91 165 0.17 2 d,e A11 for IRT: Was a goodness of fit statistic at a global level calculated? e.g. c , reliability 154 93 167 0.46 coefficient of estimated latent trait value (index of (subject or item) separation) Box B. Reliability (n = 141) Design requirements B1 Was the percentage of missing items given? 129 87 140 0.39 c d B2 Was there a description of how missing items were handled? 125 91 137 0.43 B3 Was the sample size included in the analysis adequate? 127 77 139 0.35 c d B4 Were at least two measurements available? 129 98 140 0.72 B5 Were the administrations independent? 129 73 139 0.18 c d B6 Was the time interval stated? 125 94 136 0.50 B7 Were patients stable in the interim period on the construct to be measured? 126 75 138 0.24 B8 Was the time interval appropriate? 125 84 137 0.45 B9 Were the test conditions similar for both measurements? e.g. type of administration, 127 83 138 0.30 environment, instructions B10 Were there any important flaws in the design or methods of the study? 117 77 129 0.08 Statistical methods B11 for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? 119 86 133 0.59 B12 for dichotomous/nominal/ordinal scores: Was kappa calculated? 111 81 127 0.32 B13 for ordinal scores: Was a weighted kappa calculated? 111 83 127 0.42 B14 for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic 108 81 124 0.35 Box D. Content validity (n = 83) Design requirements D1 Was there an assessment of whether all items refer to relevant aspects of the construct to 62 79 83 0.33 be measured? D2 Was there an assessment of whether all items are relevant for the study population? (e.g. 62 76 83 0.46 age, gender, disease characteristics, country, setting) D3 Was there an assessment of whether all items are relevant for the purpose of the 62 66 83 0.21 measurement instrument? (discriminative, evaluative, and/or predictive) D4 Was there an assessment of whether all items together comprehensively reflect the 62 66 83 0.15 construct to be measured? D5 Were there any important flaws in the design or methods of the study? 58 76 78 0.13 Box E. Structural validity (n = 118) E1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 99 78 116 0 Design requirements Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 6 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) E2 Was the percentage of missing items given? 95 87 110 0.41 E3 Was there a description of how missing items were handled? 93 91 109 0.55 E4 Was the sample size included in the analysis adequate? 94 87 109 0.56 E5 Were there any important flaws in the design or methods of the study? 89 84 103 0.27 Statistical methods d,e E6 for CTT: Was exploratory or confirmatory factor analysis performed? 92 90 106 0.51 e,f E7 for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? 62 87 80 0.39 Box F. Hypotheses testing (n = 170) Design requirements F1 Was the percentage of missing items given? 158 87 168 0.41 c d F2 Was there a description of how missing items were handled? 159 92 169 0.60 F3 Was the sample size included in the analysis adequate? 157 84 167 0.12 F4 Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before 158 74 168 0.42 data collection)? F5 Was the expected direction of correlations or mean differences included in the 159 75 169 0.26 hypotheses? F6 Was the expected absolute or relative magnitude of correlations or mean differences 159 82 168 0.29 included in the hypotheses? F7 for convergent validity: Was an adequate description provided of the comparator 125 83 136 0.30 instrument(s)? F8 for convergent validity: Were the measurement properties of the comparator instrument(s) 124 81 135 0.35 adequately described? F9 Were there any important flaws in the design or methods of the study? 131 81 145 0.17 Statistical methods d,e, F10 Were design and statistical methods adequate for the hypotheses to be tested? 150 78 161 0.00 Box G. Cross-cultural validity (n = 33) Design requirements G1 Was the percentage of missing items given? 25 88 32 0.52 G2 Was there a description of how missing items were handled? 22 82 30 0.32 G3 Was the sample size included in the analysis adequate? 26 81 33 0.23 c d G4 Were both the original language in which the HR-PRO instrument was developed, and the 28 89 33 0.34 language in which the HR-PRO instrument was translated described? G5 Was the expertise of the people involved in the translation process adequately described? 28 86 33 0.46 e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages G6 Did the translators work independently from each other? 28 89 33 0.61 G7 Were items translated forward and backward? 28 100 33 1.00 G8 Was there an adequate description of how differences between the original and translated 28 86 33 0.50 versions were resolved? G9 Was the translation reviewed by a committee (e.g. original developers)? 25 88 31 0.56 G10 Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, 21 90 29 0.61 cultural relevance of the translation, and ease of comprehension? c f G11 Was the sample used in the pre-test adequately described? 28 79 32 0 G12 Were the samples similar for all characteristics except language and/or cultural 26 81 31 0.41 background? G13 Were there any important flaws in the design or methods of the study? 26 85 31 0.42 Statistical methods e,f G14 for CTT: Was confirmatory factor analysis performed? 27 74 32 0.03 e,f G15 for IRT: Was differential item function (DIF) between language groups assessed? 13 77 23 0.28 Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 7 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) Box H. Criterion validity (n = 57) Design requirements c d H1 Was the percentage of missing items given? 35 91 56 0.59 c d H2 Was there a description of how missing items were handled? 35 97 56 0.79 H3 Was the sample size included in the analysis adequate? 35 69 54 0.06 H4 Can the criterion used or employed be considered as a reasonable ‘gold standard’?37 62570 H5 Were there any important flaws in the design or methods of the study? 33 79 54 0.10 Statistical methods H6 for continuous scores: Were correlations, or the area under the receiver operating curve 37 78 56 0.16 calculated? e,f H7 for dichotomous scores: Were sensitivity and specificity determined? 29 83 47 0.28 Box I. Responsiviness (n = 79) Design requirements c d I1 Was the percentage of missing items given? 71 82 76 0.14 c d I2 Was there a description of how missing items were handled? 73 92 77 0.36 I3 Was the sample size included in the analysis adequate? 72 72 76 0.40 c d I4 Was a longitudinal design with at least two measurement used? 73 100 78 1.00 c d I5 Was the time interval stated? 73 89 78 0.25 I6 If anything occurred in the interim period (e.g. intervention, other relevant events), was it 72 78 75 0.17 adequately described? c d I7 Was a proportion of the patients changed (i.e. improvement or deterioration)? 70 97 73 0.32 Design requirements for hypotheses testing For constructs for which a gold standard was not available I8 Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? 65 69 72 0.35 I9 Was the expected direction of correlations or mean differences of the change scores of 60 78 65 0.19 HR-PRO instruments included in these hypotheses? d,e I10 Were the expected absolute or relative magnitude of correlations or mean differences of 61 90 66 0.05 the change scores of HR-PRO instruments included in these hypotheses? c f I11 Was an adequate description provided of the comparator instrument(s)? 56 70 63 0 I12 Were the measurement properties of the comparator instrument(s) adequately described? 56 80 63 0.06 I13 Were there any important flaws in the design or methods of the study? 63 71 68 0.03 Statistical methods e,f I14 Were design and statistical methods adequate for the hypotheses to be tested? 63 73 67 0.21 Design requirements for comparison to a gold standard For constructs for which a gold standards was available: I15 Can the criterion for change be considered as a reasonable ‘gold standard’?21 67280 c f I16 Were there any important flaws in the design or methods of the study? 12 67 21 0 Statistical methods e,f I17 for continuous scores: Were correlations between change scores, or the area under the 28 79 39 0.47 Receiver Operator Curve (ROC) curve calculated? I18 for dichotomous scales: Were sensitivity and specificity (changed versus not changed) 28 79 37 0.15 determined? Box J. Interpretability (n = 42) J1 Was the percentage of missing items given? 22 95 41 0.80 J2 Was there a description of how missing items were handled? 21 76 41 0.19 J3 Was the sample size included in the analysis adequate? 23 74 41 0 J4 Was the distribution of the (total) scores in the study sample described? 23 74 41 0.08 J5 Was the percentage of the respondents who had the lowest possible (total) score 20 95 40 0.84 described? J6 Was the percentage of the respondents who had the highest possible (total) score 21 90 41 0.70 described? Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 8 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) J7 Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e. 21 76 41 0.05 g. for normative groups, subgroups of patients, or the general population c d J8 Was the minimal important change (MIC) or the minimal important difference (MID) 19 89 40 0.26 determined? c f J9 Were there any important flaws in the design or methods of the study? 21 71 41 0 a b When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account; number of times a box was c d e evaluated; dichotomous item; Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category; Combined kappa coefficient calculated because of nominal response scale in a one-way design; Negative variance component in the calculation of kappa was set at 0; sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%. (28%) was below 30. These percentage agreement and measurement error can not be calculated. Because we kappa coefficients based on such small numbers should were interested in whether the ratings were similar, we be interpreted with caution. present the percentage agreement of all nominal and ordinal items. Discussion In this study we investigated the inter-rater agreement Reasons for low kappa coefficients and reliability of the item scores on the COSMIN Kappa coefficients for 70 of the 114 items were poor. checklist. Overall, the percentages agreement were high, This is partly due to a skewed distribution of the item indicating that raters often choose the same response scores. Low dispersal rates strongly influence the kappa, option. The kappa coefficients were low, indicating that because if the variance between articles is low, the error it is difficult to distinguish on item level between arti- variance is large in relation to the article variance. For cles. We will start the discussion with reasons for low example, item I5 of the box Responsiveness (i.e. was the kappa coefficients, and for low percentages of time interval stated) had a kappa of 0.25; 65 times raters agreement. scored “yes” (83%), and 13 times they scored “no” (17%). Although the term inter-rater agreement does not appear in the COSMIN taxonomy [8], we used it in this Reasons for low inter-rater agreement between raters study. For measurement instruments that have continu- Percentage agreement was below 80% in 37 of the 114 ous scores the measurement error can be investigated. items. Formanyitems of theCOSMIN checklist asub- However, instruments with a nominal or ordinal score jective judgement is needed. For example, in each box do not have a unit of measurement, and consequently, the item ’were there are any important flaws in the Table 3 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 4) Item Item N (minus articles with % N Kappa nr 1 rating) agreement b c Generalisability Box (n = 866) Was the sample in which the HR-PRO instruments was evaluated adequately described? In terms of: 1 median or mean age (with standard deviation or range)? 733 86 865 0.36 d e 2 distribution of sex? 735 88 863 0.38 3 important disease characteristics (e.g. severity, status, duration) and description of 746 80 862 0.39 treatment? d e 4 setting(s) in which the study was conducted? e.g. general population, primary care or 735 89 863 0.30 hospital/rehabilitation care d e 5 countries in which the study was conducted? 733 90 861 0.40 d e 6 language in which the HR-PRO instrument was evaluated? 733 86 861 0.41 7 Was the method used to select patients adequately described? e.g. convenience, 729 81 857 0.40 consecutive, or random 8 Was the percentage of missing responses (response rate) acceptable? 724 82 849 0.48 a b When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account; number of times a box was evaluated; sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement d e properties were combined; dichotomous item; Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category; Combined kappa coefficient calculated because of nominal response scale in a one-way design; printed in bold indicates Kappa > 0.70 or % agreement >80%. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 9 of 11 http://www.biomedcentral.com/1471-2288/10/82 design or the methods of the study’ was included (e.g., COSMIN Checklist, since the number of years of B10, I13, I16 and J9). To answer this question, the rater experiences in research varied widely. We used a wide should judge this based on his own experience and range of articles that are likely to be a representative knowledge. Therefore, some kind of subjective evalua- sample of articles on measurement properties. The dis- tion is involved. Some other items might be rather diffi- tribution of many articles over many raters (no pairs, no cult to score, because the information needed to answer ordering) enhances generalisability of our results and the item is not reported in the article. For example, leads to conservative estimates. Also, we did not inter- information to be able to respond on the item ’were the vene beyond the delivery of the checklist and the administrations independent’ (B5) is often not reported. instructions manual. In all, the study should be seen as Although raters should score ‘?’ in this case, raters are a very similar to the usual conditions of its use. likely to guess, or skip these items. This influences the It was our aim to randomly select equal numbers of kappa coefficients and the percentage agreement. studies on each measurement property. However, studies Furthermore, the COSMIN checklist contains consen- on internal consistency and hypotheses testing are more sus-based standards that may deviate from how persons common than studies on measurement error and inter- are used to evaluate measurement properties or a per- pretability. Studies that are based on CTT are more com- son may disagree on a particular item. Consequently, a mon than studies that apply IRT methods. Consequently, rater may score an item differently than recommended these less common measurement properties were less in the COSMIN manual. For example, many people often selected for this study. This prevented analysis of consider effect sizes as appropriate measures for respon- the items on measurement error and on IRT analysis. siveness. Within the COSMIN Delphi study, we decided In addition, it was our aim to include a representative to consider this as inappropriate [9]. We believe that sample of potential users of the COSMIN checklist. As only when clear hypotheses are formulated about the expected, the years of experience of the participants in expected magnitude of the effect sizes (ES) it is appro- this study both in research in general and in research in priate as an indicator of responsiveness (I14). Another measurement instruments differed widely. Although example is the issue about the gold standard. The COS- more than half of the raters came from the Netherlands, MIN panel considered a commonly used measurement we do notexpectthatthe countryoforiginwill have a instrument, such as the SF-36, not as a reasonable gold major influence on the results. standard. However, raters may disagree with this, and In this study it was not feasible to train the raters rate the item ‘can the criterion (for change) be considered because we expected that this would dramatically as a reasonable gold standards’ (H4 and I15) as ‘yes’ decrease the response rate. However, we recommend while according to the COSMIN manual this item getting some experience incompletingthe COSMIN should be scored with ‘no’. Consequently, the kappa checklist before conducting a systematic review. In the coefficient and the percentage agreement will be low. future, when more raters are trained in completing the Last, the distinction between rating the methodological checklist, a reliability study among trained raters could quality of the study and rating the quality of the instru- be performed. ment that is evaluated in the study may be difficult, espe- Due to the incomplete study design (i.e. not all raters cially for content validity. Therefore, the items on scored all articles, and in an article not all measurement content validity are difficult to score. All items of box D properties are evaluated) we had a one-way design. of content validity had low kappa coefficients and per- Therefore, the variance due to raters could not be dis- centage agreement. They ask whether the article under tinguished from the error variance. Other optional study appropriately investigated whether the items were designs would be asking a few raters to evaluate many relevant and comprehensive. This refers to the methodo- articles, or asking many raters to evaluate the same few logical quality of a study. For example, an appropriate articles. Both designs were considered poor. In the first method to investigate the content validity of a HR-PRO case, it is likely that we would not find participants, due is involving patients from the target population, by asking to the large amount of work each rater had to do. We them about the relevance and comprehensiveness of the felt that we as authors of the COSMIN checklist should items. These COSMIN items do not ask whether the not be these raters, because of our involvement in the items of the PRO under study are relevant and compre- development of the checklist. The second design is con- hensive, which refers to the quality of an instrument. sidered poor because we would have to include a few Raters may have been confused about this distinction. articles in which all measurement properties were evalu- ated. It is very likely that these articles do not exist, and Strength and weaknesses of the study if such an article is published, it is very likely that it is We are confident that raters who have participated in not a good representation of studies on measurement this study are representative for the future users of the properties. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 10 of 11 http://www.biomedcentral.com/1471-2288/10/82 Recommendations for improvement of the inter-rater In one study the reliability of a 39 item appraisal tool agreement and reliability of the COSMIN checklist to evaluate PRO instruments (EMPRO) [10] was investi- Firstly, based on the results of this study, and feedback we gated. In this study five panels (in which three or four received from raters, we improved the wording and gram- raters participated) each assessed the quality of the mar of a few items and we adapted the instructions in the Spanish version of one well-known and widely used manual. This might improve the agreement on the COS- PRO instrument. Intraclass correlation coefficients (two- MIN item scores. Secondly, the COSMIN checklist is not way model, absolute agreement) were calculated both a ready-made checklist, in a sense that the user can for the overall assessment of the quality of the score. instantly complete all items. We recommend that High ICCs were found (all above 0.75) [10]. COSMIN researchers who use the COSMIN checklist, for example and EMPRO both focus on PROs. However, with the in a systematic review, agree beforehand on how to handle COSMIN checklist it is not yet possible to calculate an items that need a subjective judgement, and how to deal overall score per box or an overall score about the qual- with lack of reporting in the original article. For example, ity of all measurement properties together. In addition, based on the topic of the review, they should agree on EMPRO assesses the overall quality of a measurement what they consider an appropriate time interval for relia- instrument, while COSMIN assesses the methodological bility (B8), on an adequate description for the comparator quality of studies on measurement properties. instrument(s) (F7 and I11), or on an acceptable percentage In two other studies two independent raters scored a of missing responses (item 8 of the Generalisability box). number of articles using either STAndards for the This may also increase the inter-rater agreement. Thirdly, Reporting of Diagnostic accuracy studies (STARD) [11] some experience in completing the checklist before con- or Nelson-Moberg Expanded CONSORT Instrument ducting a systematic review is also likely to increase the (NMECI) [12]. Both studies reported percentage agree- inter-rater agreement of the COSMIN checklist. There- ment and kappa coefficients. In the study by Smidt et al. fore, we are developing a training set of articles (to be pub- [11] they found percentage agreement between 63% and lished on our website), explaining how these articles 100%, and kappa coefficients between -0.032 and 1.00. should be evaluated using the COSMIN checklist. About the same percentage of items as in COSMIN Fourthly, we strongly recommend using the taxonomy and (61% of the STARD items) showed high percentage terminology of the COSMIN checklist. For example, if agreement (i.e. above 80%). However, more items had authors compare their PRO to a commonly used PRO higher kappa coefficients, i.e. 23% of the STARD items such as the SF-36, and they refer to this as criterion valid- showed excellent kappa coefficients (i.e above 0.70). In ity, we recommend considering this an evaluation of the study by Moberg-Mogren & Nelson [12], 77% of the hypotheses testing which is an aspect of construct validity, CONSORT items showed high ICC (i.e. above 0.70), and complete box F. Fifthly, when using the checklist in a and 57% of the NMECI items showed high kappa coeffi- systematic review of HR-PROs, we recommend to com- cients (i.e. above 0.70). Of the NMECI items, 29 of the plete the checklist by at least two independent raters, and 176 kappa coefficients were below 0.40. For these items to reach consensus on one final rating. In this study we they also showed percentage agreement, ranging used the ratings of single raters to determine the inter- between43% and93%.CONSORT andNMECI items rater agreement of the checklist, because a design with had higher values for reliability than the COSMIN consensus scores of two raters was not feasible. We items. recommend evaluating the inter-rater agreement of the consensus scores of couples of raters in a future study, Conclusion when more raters are trained. The inter-rater agreement of the COSMIN items was Note that in this study, we investigated the inter-rater adequate, i.e. raters mostly rated the items of the COS- agreement and reliability on item level. Results showed MIN checklist quite the same. The inter-rater reliability that it is difficult to distinguish articles on item level. of the COSMIN items was poor for many items; it was When using the COSMIN checklist in a systematic difficult to distinguish between articles based on item review on measurement properties, an overall score per level. Some disagreements between raters are likely to box is useful to decide whether the methodological be influenced by a subjective judgement needed to quality can be considered as good. For such a score, the answer an item. Therefore, we recommend making deci- reliability might be better. sions in advance about how to score these issues. The inter-rater agreement on other items may have Reliability of other checklists improved after this study since we have tried to improve We found three studies in which the inter-rater agree- the instructions in the manual on some issues, based on ment and reliability of a similar kind of checklist was the feedback of raters. When using the COSMIN check- investigated. list it is important to read the manual carefully, and get Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 11 of 11 http://www.biomedcentral.com/1471-2288/10/82 4. Kraemer HC, Periyakoil VS, Noda A: Tutorial in biostatistics. Kappa some training and experience in completing the coefficients in medical research. Stat Med 2002, 21:2109-2129. checklist. 5. Lin L, Hedayat AS, Wu W: A unified approach for assessing agreement for continuous and categorical data. J Biopharm Stat 2007, 17:629-652. 6. Fleiss JL: Statistical methods for rates and proportions New York: John Wiley Acknowledgements & Sons 1981. We are grateful to all the participants of the COSMIN inter-rater reliability 7. Vach W: The dependence of Cohen’s kappa on the prevalence does not study: Femke Abma, Gwenda Albers, Jagath Amarasehera, Adri Apeldoorn, matter. J Clin Epidemiol 2005, 58:655-661. Ingrid Arévalo Rodríguez, Susan Armijo Olivo, Geert Aufdemkampe, Ruth 8. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Barclay-Goddard, Ilse Beljouw, Sandra Beurskens, Michiel de Boer, Sandra Bot, Bouter LM, de Vet HC: The COSMIN study reached international Han Boter, Laurien Buffart, Mauro Carone, Oren Cheifetz, Bert Chesworth, consensus on taxonomy, terminology, and definitions of measurement Anne Christie, Heather Christie, Heather Colguhoun, Janet Copeland, properties for health-related patient-reported outcomes. J Clin Epidemiol Dominique Dubois, Michael Echteld, Roy Elbers, Willem Eijzenga, Antonio 2010, 63:737-745. Escobar, Brigitte Essers, Marie Louise Essink-Bot, Teake Ettema, Silvia Evers, 9. Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, Wouter van de Fliert, Jorge Fuentes, Carlos Garcia Forero, Fania Gartner, Bouter LM, De Vet HCW: The COSMIN checklist for evaluating the Claudia Gorecki, Francis Guillemin, Alice Hammink, Graeme Hawthorne, Nick methodological quality of studies on measurement properties: A Henschke, Kelvin Jordan, Sophia Kramer, Joke Korevaar, Hilde Lamberts, clarification of its content. BMC Med Res Methodol 2010, 10:22. Henrik Lauridsen, Hanneke van der Lee, Tim Lucket, Han Marinus, Belle van 10. Valderas JM, Ferrer M, Mendivil J, Garin O, Rajmil L, Herdman M, Alonso J: der Meer, Henk Mokkink, Paola Mosconi, Sara Muller, Ricky Mullis, Joanneke Development of EMPRO: A tool for the standardized assessment of van der Nagel, Rinske Nijland, Ruth van Nispen, Jan Passchier, George Peat, patient-reported outcome measures. Value Health 2008, 11:700-708. Hein Raat, Luis Rajmil, Bryce Reeve, Leo Roorda, Sabine Roos, Nancy Salbach, 11. Smidt N, Rutjes AW, Van der Windt DA, Ostelo RW, Bossuyt PM, Reitsma JB, Jasper Schellingerhout, Wouter Schuller, Hanneke Schuurmans, Jane Scott, Bouter LM, De Vet HCW: Reproducibility of the STARD checklist: an Jos Smeets, Antonette Smelt, Kevin Smith, Eric van Sonderen, Alan Stanton, instrument to assess the quality of reporting of diagnostic accuracy Ben Steenkiste, Raymond Swinkels, Fred Tromp, Joan Trujols, Arianne studies. BMC Med Res Methodol 2006, 6:12. Verhagen, Gemma Vilagut Saiz, Torquil Watt, Adrian Wenban, Daniëlle van 12. Moberg-Mogren E, Nelson DL: Research concepts in clinical scholarship: der Windt, Harriet Wittink, Virginia Wright, Carlijn van der Zee. Evaluating the quality of reporting occupational therapy randomized This study was financially supported by the EMGO Institute, VU University controlled trials by expanding the CONSORT criteria. Am J Occup Ther Medical Center, Amsterdam, the Netherlands, and the Anna Foundation, 2006, 60:226-235. Leiden, the Netherlands. Pre-publication history Author details The pre-publication history for this paper can be accessed here: Department of Epidemiology and Biostatistics and the EMGO Institute for http://www.biomedcentral.com/1471-2288/10/82/prepub Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Department of Public Health, Patient-reported Outcome doi:10.1186/1471-2288-10-82 Cite this article as: Mokkink et al.: Inter-rater agreement and reliability Measurement Group, University of Oxford, Oxford, UK. School of of the COSMIN (COnsensus-based Standards for the selection of health Rehabilitation Science and Department of Clinical Epidemiology and status Measurement Instruments) Checklist. BMC Medical Research Biostatistics, McMaster University, Hamilton, Canada. Health Services Methodology 2010 10:82. Research Unit, IMIM-Institut de Recerca Hospital del Mar, Parc de Salud Mar de Barcelona, Spain. CIBER en Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain. Department of Health Services, University of Washington, Seattle, USA. Executive Board of VU University Amsterdam, Amsterdam, The Netherlands. Authors’ contributions LB, CT and HdV secured funding for the study. CT, HdV, LB, DK, DP, JA, PS, and EG conceived the idea for the study. EG prepared the database and LM selected the articles. All authors invited potential raters. LM coordinated the study and managed the data. LM, CT, DK and HdV interpreted the data. CT, EG, DP, JA, PS, DK, LB and HdV supervised the study. LM wrote the manuscript with input from all the authors. All authors read and approved the final version of the report. Competing interests The authors except for E. Gibbons were the developers of the COSMIN checklist. Received: 23 June 2010 Accepted: 22 September 2010 Published: 22 September 2010 Submit your next manuscript to BioMed Central References and take full advantage of: 1. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, De Vet HCW: The COSMIN checklist for assessing the • Convenient online submission methodological quality of studies on measurement properties of health • Thorough peer review status measurement instruments: an international Delphi study. Qual Life Res 2010, 19:539-549. • No space constraints or color figure charges 2. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, • Immediate publication on acceptance Bouter LM, De Vet HCW: The COSMIN checklist manual.[http://www. cosmin.nl]. • Inclusion in PubMed, CAS, Scopus and Google Scholar 3. Landis JR, Koch GG: A one-way components of variance model for • Research which is freely available for redistribution categorical data. Biometrics 1977, 33:671-679. Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Medical Research Methodology Springer Journals

Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Loading next page...
 
/lp/springer-journals/inter-rater-agreement-and-reliability-of-the-cosmin-consensus-based-oW06fe2c11

References (34)

Publisher
Springer Journals
Copyright
Copyright © 2010 by Mokkink et al; licensee BioMed Central Ltd.
Subject
Medicine & Public Health; Theory of Medicine/Bioethics; Statistical Theory and Methods; Statistics for Life Sciences, Medicine, Health Sciences
eISSN
1471-2288
DOI
10.1186/1471-2288-10-82
pmid
20860789
Publisher site
See Article on Publisher Site

Abstract

Background: The COSMIN checklist is a tool for evaluating the methodological quality of studies on measurement properties of health-related patient-reported outcomes. The aim of this study is to determine the inter-rater agreement and reliability of each item score of the COSMIN checklist (n = 114). Methods: 75 articles evaluating measurement properties were randomly selected from the bibliographic database compiled by the Patient-Reported Outcome Measurement Group, Oxford, UK. Raters were asked to assess the methodological quality of three articles, using the COSMIN checklist. In a one-way design, percentage agreement and intraclass kappa coefficients or quadratic-weighted kappa coefficients were calculated for each item. Results: 88 raters participated. Of the 75 selected articles, 26 articles were rated by four to six participants, and 49 by two or three participants. Overall, percentage agreement was appropriate (68% was above 80% agreement), and the kappa coefficients for the COSMIN items were low (61% was below 0.40, 6% was above 0.75). Reasons for low inter-rater agreement were need for subjective judgement, and accustom to different standards, terminology and definitions. Conclusions: Results indicated that raters often choose the same response option, but that it is difficult on item level to distinguish between articles. When using the COSMIN checklist in a systematic review, we recommend getting some training and experience, completing it by two independent raters, and reaching consensus on one final rating. Instructions for using the checklist are improved. Background properties of HR-PROs. It can also be used to design Recently, a checklist for the evaluation of the methodo- and report a study on measurement properties. Also, logical quality of studies on measurement properties of reviewers and editors could use it to identify shortcom- health-related patient-reported outcomes (HR-PROs) - ings in studies on measurement properties, and to assess the COSMIN checklist - was developed in an interna- whether the methodological quality of such studies is tional Delphi study [1]. COSMIN is an acronym for high enough to justify publication. COnsensus-based Standards for the selection of health The COSMIN checklist contains twelve boxes [1]. Ten status Measurement INstruments. This checklist can be boxes can be used to assess whether a study meets the used for the appraisal of the methodological quality of standards for good methodological quality (ranging from studies included in a systematic review of measurement 5-18 items). Nine of these boxes contain the standards for the measurement properties considered (internal * Correspondence: w.mokkink@vumc.nl consistency (box A), reliability (box B), measurement Department of Epidemiology and Biostatistics and the EMGO Institute for error (box C), content validity (box D), structural valid- Health and Care Research, VU University Medical Center, Amsterdam, The ity (box E), hypotheses testing (box F) and cross-cultural Netherlands Full list of author information is available at the end of the article © 2010 Mokkink et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 2 of 11 http://www.biomedcentral.com/1471-2288/10/82 validity (box G), criterion validity (box H), and respon- reliability, validity, or responsiveness. A total of 5137 siveness (box I)), and one box contains standards for articles were eligible. Second, from these articles, we studies on interpretability (box J). In addition, one box randomly selected studies that fulfilled our inclusion (IRT box) contains requirements for articles in which criteria. Item Response Theory (IRT) methods are applied (4 Inclusion criteria were: items), and one box (Generalisability box) is included in the checklist that contains requirements for the gener- � Purposeofthe studywas to evaluate oneormore alisability of the results (8 items). measurement properties It is important to assess the quality of the COSMIN � Instrument under study was a HR-PRO instrument checklist itself. For example, it is important that differ- � English language publications ent researchers, who use the COSMIN checklist to rate the same article, give the same ratings on each item. Articles from any setting and any population could be Therefore, the aim of this study is to determine the included, and articles could have used Classical Test inter-rater agreement and reliability of each item score Theory (CTT) or modern test theory (i.e, Item Response of the COSMIN checklist among potential users. Theory (IRT)) or both. Exclusion criteria: Methods Because the COSMIN checklist will be applied in the � Systematic reviews, case reports, letters to editors future to a variety of studies on different topics and � Studies that evaluated construct validity of two or study populations, with low and high quality, it was our more instruments at the same time by correlating goal to generalise the results of this study to a broad the scores of the instruments mutually, without indi- range of articles on measurement properties. In addi- cating one of instruments as the instrument of inter- tion, the COSMIN checklist will be used by many est. In these studies, it is unclear of which researchers, using the instructions in the COSMIN man- instrument the construct validity is being assessed. ual as guidance. We were interested in the inter-rater agreement and reliability in this situation. Often, in an One of the authors (LM) selected articles until each article only a selection of measurement properties are measurement property was assessed in at least 20 arti- being evaluated. Consequently, only parts of the COS- cles. It appeared that we needed to select 75 articles. MIN checklist can be completed. We arbitrarily decided For each included article LM determined the relative in advanced that (1) we aimed for four ratings for each workload for a rater to evaluate the methodological item of the COSMIN checklist on the same article; (2) quality of the article, i.e. high, moderate, or low work- we aimed for each measurement property to be evalu- load. The relative workload was based on the number of ated in at least 20 different articles. This was done to measurement properties assessed in the study, the num- increase the representativity of studies and raters. ber of instruments that were studied, the number of pages, and whether IRT was used. For example, an arti- Article selection cle in which IRT is used is considered having a high In this study we included articles that were representa- workload, and an article in which three measurement tive of studies on measurement properties. We selected properties were evaluated in a four page paper was con- articles from the bibliographic database compiled by the sidered as having a low workload. We decided to ask Patient-Reported Outcome Measurement (PROM) each rater to evaluate three articles. We provided each Group, Oxford, UK http://phi.uhce.ox.ac.uk. The biblio- rater with one article with a low workload, one with a graphy includes evaluations of PROs with information moderate workload and one with a high workload. about psychometric properties and operational charac- teristics, and applications where for example a PRO has Selection of participants been used in a trial as a primary or secondary endpoint. Raters were professionals who had some experience with The online PROM database comprises records down- assessing measurement properties. This could range loaded from several electronic databases using a com- from having little experience to being an expert. We prehensive search strategy (details available on request). choose to select a heterogeneous group of raters, The selection of articles for this study was a two-step because this reflects best the raters who will potentially procedure. First, of the 30,000+ included articles it was use the COSMIN checklist in the future. We invited the determined, based on the title, whether it concerned an international panel of the COSMIN Delphi study [1] to article of a study on the evaluation of measurement participate in the inter-rater agreement and reliability properties of a PRO. For example, the title included study (n = 91), attendees of two courses on clinimetrics terms of a specific measurement property, such as given in 2009 by the department of Epidemiology and Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 3 of 11 http://www.biomedcentral.com/1471-2288/10/82 Biostatistics of the VU University Medical Center (n = In addition, we calculated the reliability of the items 72), researchers on the mailing list of the Dutch chapter using kappa coefficients. This is a measure that indicates of the International Society for Quality of Life Research how well articles can be distinguished from each other (ISOQOL-NL) (n = 295), members of the EMGO Clini- based on the given COSMIN item score. Dichotomous metrics working group (n = 32), members of the PRO items were analysed using intraclass kappa coefficients Methods Group of the Cochrane Collaboration (n = 79), [3]; the scoring was yes = 1 and no = 0. researchers who previously showed interest in the COS- MIN checklist (n = 15), colleagues of the authors, and article Intraclass Kappa = , COSMINitem other researchers who were likely to show interest. We  + article error also asked these people if they knew other researchers who were interested in participating. where s denotes the variance due to systematic article differences between the articles for which the item was Procedure scored, and s denotes the random error. error Those who agreed to participate received three selected Ordinal items were analyzed with weighted kappa articles, together with a manual of the COSMIN checklist coefficients using quadratic weights; the scoring was [2] and a data collection form to enter their scores. For ‘yes’ =1, ‘?’ =2,and ‘no’ = 3. (Note that the scorings each article, they were asked to follow all the COSMIN order in the COSMIN checklist is yes/no/?). These mea- evaluation steps. Step 1: to indicate, for each measure- sures are numerically the same as intraclass correlation ment property, whether it was evaluated in the article coefficients (ICCs) obtained from analysis of variance (’yes/no’). The participants had to determine themselves (ANOVA) [4-6]. which boxes they should complete for each of the three Twenty-two items could be answered with “na”, which papers. Step 2: they were asked whether IRT was used in makes the scale of these items a multi-categorical nom- the article, and if so, they were asked to complete the inal scale. For these items, we calculated for each item IRT box. Step 3: they were asked to complete the relevant kappa’s after all possible dichotomizations. For example, boxes of the COSMIN checklist. Step 4: raters were asked item A9 has three response options, i.e. ‘yes’, ‘no’,and to complete the Generalisability box for each measure- ‘na’. This item has three times been dichotomized, i.e. ment property assessed in the article. into yes = 1 and not yes = 0 (dummy variable 1), into Instructions on how to complete the boxes were pro- no = 1 and not no = 0 (dummy variable 2), and into na vided in the COSMIN manual [2]. Raters did not receive = 1 and not na = 0 (dummy variable 3). Next, the com- any additional training in completing the COSMIN ponents for the intraclass kappa were calculated, and a checklist and were not familiar with the checklist. Items summary intraclass (SI) kappa was calculated using for- couldbeansweredwith “yes"/"no”,with “yes"/"?"/"no”, mula [3] or with “yes"/"no"/"not applicable” ("na”). One item had four response options, i.e., “yes"/"?"/"no"/or “na”.  () i article SI Kappa = . COSMINitem Statistical analyses 22  ()ii + () article error ∑ ∑ Each rater scored three of the 75 selected articles, and i ii in each article a selection of the measurement properties was evaluated. Therefore, we analyzed each COSMIN The numerator reflects the variance due to the article, item score using a one-way design. and the denominator reflects the total variance. In case We calculated percentage agreement for each item. a variance component was negative, we set the variance This measure indicates how often raters who rated the at zero. same items on the same articles choose the same Since we do not calculate overall scores per box, we response category. We considered the highest number only calculated kappa coefficients per COSMIN item. of similar ratings per item per article as agreement, and We considered a kappa for each item below 0.40 as the other ratings as non-agreement. For example, if five poor, between 0.40 and 0.75 as moderate to good, and raters rated the same item for the same article, and above 0.75 as excellent [6]. three of the raters rated ‘yes’,and tworated ‘no’,we Reliability measures such as kappa are dependent on considered three ratings as agreement. Percentage agree- the distribution of the data (s ). Vach showed that article ment was calculated by the number of ratings with reliability measures are low when data are skewed [7]. agreement on all articles, divided by the total number of We considered a distribution of scores as skewed when ratings on all articles for which that measurement prop- more than 75% of the raters who responded to an item erty was assessed. A percentage agreement > 80% was used the same response category. Percentage agreement considered appropriate (arbitrarily chosen). is not dependent on the distribution of the data. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 4 of 11 http://www.biomedcentral.com/1471-2288/10/82 In our analysis we combined scores of the items on Table 1 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) on whether the the Generalisability box for all measurement properties, property was evaluated in an article (COSMIN step 1) so that we calculated percentage agreement and kappa percentage agreement Intraclass kappa coefficients only once for each of the items from this box, and not separately for each measurement property. Internal consistency 94 0.66 Reliability 94 0.77 Results Measurement error 94 0.02 A total of 154 raters agreed to participate in this study. Content validity 84 0.29 We received the ratings from 88 (57%) of the partici- Structural validity 86 0.48 pants. The responders came from the Netherlands Hypotheses testing 87 0.29 (58%), Canada (10%), UK (7%), Australia or New Zeal- Cross-cultural validity 95 0.66 and (6%), Europe without Netherlands and UK (15%), Criterion validity 93 0.23 other (5%). The mean number of years experience in Responsiveness 96 0.81 research was 12 years (SD = 8.7), and 9 years (SD = 7.1) Interpretability 86 0.02 a b experience in research related to measurement number of ratings on the 75 articles = 263; items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response properties. category; printed in bold indicates kappa > 0.70 or % agreement >80% Of the 75 selected articles, 8 articles were rated by six participants, 7 articles were rated by five participants, 11 by four participants, 38 by three participants, and 11 by In Table 2 we describe percentages agreement, and two participants. The percentage missing items per box kappa coefficients for each item of the COSMIN boxes were 7% for box A Internal Consistency (11 item), 5% A to J (step 3). Fifty-nine items (61%) of the 96 items in for box B Reliability (14 items), 1% box D Content Table 2 had a percentage agreement above 80%. Thirty Validity (5 items), 11% box E Structural Validity (7 items (31%) had a percentage agreement between 70% items), 7% box F Hypotheses Testing (10 items), 5% box and 80%, and seven items (7%) between 60% and 70%. G Cross-cultural Validity (15 items), 5% box H Criterion Of the 96 items, five (5%) had an excellent kappa coeffi- Validity (7 items), 18% box I Responsiveness (18 items), cient, thirty (31%) had a moderate to good kappa coeffi- 3% box J Interpretability (9 items), and 1% for the Gen- cient, and 61 items (64%) had a poor kappa coefficient eralisability box (8 items). (including the 15 items of which we set negative var- Items of the IRT box had 26 ratings for 13 articles; for iance components to 0). Sample sizes for percentage 6 articles this box was completed by one rater, for two agreement and kappa coefficients per item were slightly articles by two raters, for four articles by three raters, different, due to articles thatwere scoredonlyonceby and for one article by four raters. The box C Measure- one rater. When calculating percentage agreement, these ment error had 17 ratings for 14 articles; for twelve arti- articles could not be taken into account. cles this box was completed by one rater, for one article In Table 3 percentages agreement and kappa coeffi- by two raters, and one article by three raters. The cients are given for the eight items from the Generalisa- results of these items are not shown, because percentage bility box (step 4). We combined scores of the items on agreement and kappa coefficients based on such small the Generalisability box for all measurement properties. numbers are unreliable. For the property measurement Therefore, the sample sizes are much higher. All items error, however, we have some information because 10 of in Table 3 had a percentage agreement above 80%. the 11 items from this box (i.e. all items on design None of the items had an excellent kappa coefficient. requirements) were exactly the same items as the items Four items had a moderate to good kappa coefficient, about design requirements from box B Reliability (i.e. and four items had a poor kappa coefficient. items B1 to B10). We observed two issues. Firstly, thirty-two of the 114 Table 1 shows the inter-rater agreement and reliability items (Table 1, 2 and 3; 28%) showed hardly any disper- of the questions regarding whether the property was sal, i.e. more than 75% of the raters who responded to evaluated in an article (step 1 of the COSMIN check- the item rated the same response category. When data list). Note that these scores are not summary scores of are skewed, the between article variance, i.e. s ,is article the overall methodological quality of the property. All low, and thus the kappa will be low. Secondly, in Table properties had high percentage agreement (range from 2 it can be seen that twenty-nine items (28%) had a 84% to 96%).Two of the ten properties, i.e. Reliability sample size below 50 for the calculation of kappa coeffi- and Responsiveness, had an excellent kappa coefficient, cients, of which four were below 30 (4%). For the calcu- i.e. above 0.75. Three properties had moderate to good lation of percentage agreement thirty-five items (34%) kappa coefficients and five had poor kappa coefficients. had a sample size of below 50, of which twenty-nine Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 5 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3) Item Item N (minus articles % N Kappa nr with 1 rating) agreement Box A Internal consistency (n = 195) A1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 185 82 193 0.06 Design requirements A2 Was the percentage of missing items given? 183 87 190 0.48 A3 Was there a description of how missing items were handled? 180 90 187 0.54 A4 Was the sample size included in the internal consistency analysis adequate? 177 87 185 0.06 A5 Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model 180 92 187 0.69 applied? A6 Was the sample size included in the unidimensionality analysis adequate? 166 79 178 0.27 A7 Was an internal consistency statistic calculated for each (unidimensional) (sub)scale 179 85 187 0.31 separately? c d A8 Were there any important flaws in the design or methods of the study? 174 86 179 0.22 Statistical methods d,e A9 for Classical Test Theory (CTT): Was Cronbach’s alpha calculated? 179 93 187 0.27 d,e A10 for dichotomous scores: Was Cronbach’s alpha or KR-20 calculated? 151 91 165 0.17 2 d,e A11 for IRT: Was a goodness of fit statistic at a global level calculated? e.g. c , reliability 154 93 167 0.46 coefficient of estimated latent trait value (index of (subject or item) separation) Box B. Reliability (n = 141) Design requirements B1 Was the percentage of missing items given? 129 87 140 0.39 c d B2 Was there a description of how missing items were handled? 125 91 137 0.43 B3 Was the sample size included in the analysis adequate? 127 77 139 0.35 c d B4 Were at least two measurements available? 129 98 140 0.72 B5 Were the administrations independent? 129 73 139 0.18 c d B6 Was the time interval stated? 125 94 136 0.50 B7 Were patients stable in the interim period on the construct to be measured? 126 75 138 0.24 B8 Was the time interval appropriate? 125 84 137 0.45 B9 Were the test conditions similar for both measurements? e.g. type of administration, 127 83 138 0.30 environment, instructions B10 Were there any important flaws in the design or methods of the study? 117 77 129 0.08 Statistical methods B11 for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? 119 86 133 0.59 B12 for dichotomous/nominal/ordinal scores: Was kappa calculated? 111 81 127 0.32 B13 for ordinal scores: Was a weighted kappa calculated? 111 83 127 0.42 B14 for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic 108 81 124 0.35 Box D. Content validity (n = 83) Design requirements D1 Was there an assessment of whether all items refer to relevant aspects of the construct to 62 79 83 0.33 be measured? D2 Was there an assessment of whether all items are relevant for the study population? (e.g. 62 76 83 0.46 age, gender, disease characteristics, country, setting) D3 Was there an assessment of whether all items are relevant for the purpose of the 62 66 83 0.21 measurement instrument? (discriminative, evaluative, and/or predictive) D4 Was there an assessment of whether all items together comprehensively reflect the 62 66 83 0.15 construct to be measured? D5 Were there any important flaws in the design or methods of the study? 58 76 78 0.13 Box E. Structural validity (n = 118) E1 Does the scale consist of effect indicators, i.e. is it based on a reflective model? 99 78 116 0 Design requirements Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 6 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) E2 Was the percentage of missing items given? 95 87 110 0.41 E3 Was there a description of how missing items were handled? 93 91 109 0.55 E4 Was the sample size included in the analysis adequate? 94 87 109 0.56 E5 Were there any important flaws in the design or methods of the study? 89 84 103 0.27 Statistical methods d,e E6 for CTT: Was exploratory or confirmatory factor analysis performed? 92 90 106 0.51 e,f E7 for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? 62 87 80 0.39 Box F. Hypotheses testing (n = 170) Design requirements F1 Was the percentage of missing items given? 158 87 168 0.41 c d F2 Was there a description of how missing items were handled? 159 92 169 0.60 F3 Was the sample size included in the analysis adequate? 157 84 167 0.12 F4 Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before 158 74 168 0.42 data collection)? F5 Was the expected direction of correlations or mean differences included in the 159 75 169 0.26 hypotheses? F6 Was the expected absolute or relative magnitude of correlations or mean differences 159 82 168 0.29 included in the hypotheses? F7 for convergent validity: Was an adequate description provided of the comparator 125 83 136 0.30 instrument(s)? F8 for convergent validity: Were the measurement properties of the comparator instrument(s) 124 81 135 0.35 adequately described? F9 Were there any important flaws in the design or methods of the study? 131 81 145 0.17 Statistical methods d,e, F10 Were design and statistical methods adequate for the hypotheses to be tested? 150 78 161 0.00 Box G. Cross-cultural validity (n = 33) Design requirements G1 Was the percentage of missing items given? 25 88 32 0.52 G2 Was there a description of how missing items were handled? 22 82 30 0.32 G3 Was the sample size included in the analysis adequate? 26 81 33 0.23 c d G4 Were both the original language in which the HR-PRO instrument was developed, and the 28 89 33 0.34 language in which the HR-PRO instrument was translated described? G5 Was the expertise of the people involved in the translation process adequately described? 28 86 33 0.46 e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages G6 Did the translators work independently from each other? 28 89 33 0.61 G7 Were items translated forward and backward? 28 100 33 1.00 G8 Was there an adequate description of how differences between the original and translated 28 86 33 0.50 versions were resolved? G9 Was the translation reviewed by a committee (e.g. original developers)? 25 88 31 0.56 G10 Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, 21 90 29 0.61 cultural relevance of the translation, and ease of comprehension? c f G11 Was the sample used in the pre-test adequately described? 28 79 32 0 G12 Were the samples similar for all characteristics except language and/or cultural 26 81 31 0.41 background? G13 Were there any important flaws in the design or methods of the study? 26 85 31 0.42 Statistical methods e,f G14 for CTT: Was confirmatory factor analysis performed? 27 74 32 0.03 e,f G15 for IRT: Was differential item function (DIF) between language groups assessed? 13 77 23 0.28 Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 7 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) Box H. Criterion validity (n = 57) Design requirements c d H1 Was the percentage of missing items given? 35 91 56 0.59 c d H2 Was there a description of how missing items were handled? 35 97 56 0.79 H3 Was the sample size included in the analysis adequate? 35 69 54 0.06 H4 Can the criterion used or employed be considered as a reasonable ‘gold standard’?37 62570 H5 Were there any important flaws in the design or methods of the study? 33 79 54 0.10 Statistical methods H6 for continuous scores: Were correlations, or the area under the receiver operating curve 37 78 56 0.16 calculated? e,f H7 for dichotomous scores: Were sensitivity and specificity determined? 29 83 47 0.28 Box I. Responsiviness (n = 79) Design requirements c d I1 Was the percentage of missing items given? 71 82 76 0.14 c d I2 Was there a description of how missing items were handled? 73 92 77 0.36 I3 Was the sample size included in the analysis adequate? 72 72 76 0.40 c d I4 Was a longitudinal design with at least two measurement used? 73 100 78 1.00 c d I5 Was the time interval stated? 73 89 78 0.25 I6 If anything occurred in the interim period (e.g. intervention, other relevant events), was it 72 78 75 0.17 adequately described? c d I7 Was a proportion of the patients changed (i.e. improvement or deterioration)? 70 97 73 0.32 Design requirements for hypotheses testing For constructs for which a gold standard was not available I8 Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? 65 69 72 0.35 I9 Was the expected direction of correlations or mean differences of the change scores of 60 78 65 0.19 HR-PRO instruments included in these hypotheses? d,e I10 Were the expected absolute or relative magnitude of correlations or mean differences of 61 90 66 0.05 the change scores of HR-PRO instruments included in these hypotheses? c f I11 Was an adequate description provided of the comparator instrument(s)? 56 70 63 0 I12 Were the measurement properties of the comparator instrument(s) adequately described? 56 80 63 0.06 I13 Were there any important flaws in the design or methods of the study? 63 71 68 0.03 Statistical methods e,f I14 Were design and statistical methods adequate for the hypotheses to be tested? 63 73 67 0.21 Design requirements for comparison to a gold standard For constructs for which a gold standards was available: I15 Can the criterion for change be considered as a reasonable ‘gold standard’?21 67280 c f I16 Were there any important flaws in the design or methods of the study? 12 67 21 0 Statistical methods e,f I17 for continuous scores: Were correlations between change scores, or the area under the 28 79 39 0.47 Receiver Operator Curve (ROC) curve calculated? I18 for dichotomous scales: Were sensitivity and specificity (changed versus not changed) 28 79 37 0.15 determined? Box J. Interpretability (n = 42) J1 Was the percentage of missing items given? 22 95 41 0.80 J2 Was there a description of how missing items were handled? 21 76 41 0.19 J3 Was the sample size included in the analysis adequate? 23 74 41 0 J4 Was the distribution of the (total) scores in the study sample described? 23 74 41 0.08 J5 Was the percentage of the respondents who had the lowest possible (total) score 20 95 40 0.84 described? J6 Was the percentage of the respondents who had the highest possible (total) score 21 90 41 0.70 described? Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 8 of 11 http://www.biomedcentral.com/1471-2288/10/82 Table 2: Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COS- MIN checklist (COSMIN step 3) (Continued) J7 Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e. 21 76 41 0.05 g. for normative groups, subgroups of patients, or the general population c d J8 Was the minimal important change (MIC) or the minimal important difference (MID) 19 89 40 0.26 determined? c f J9 Were there any important flaws in the design or methods of the study? 21 71 41 0 a b When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account; number of times a box was c d e evaluated; dichotomous item; Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category; Combined kappa coefficient calculated because of nominal response scale in a one-way design; Negative variance component in the calculation of kappa was set at 0; sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%. (28%) was below 30. These percentage agreement and measurement error can not be calculated. Because we kappa coefficients based on such small numbers should were interested in whether the ratings were similar, we be interpreted with caution. present the percentage agreement of all nominal and ordinal items. Discussion In this study we investigated the inter-rater agreement Reasons for low kappa coefficients and reliability of the item scores on the COSMIN Kappa coefficients for 70 of the 114 items were poor. checklist. Overall, the percentages agreement were high, This is partly due to a skewed distribution of the item indicating that raters often choose the same response scores. Low dispersal rates strongly influence the kappa, option. The kappa coefficients were low, indicating that because if the variance between articles is low, the error it is difficult to distinguish on item level between arti- variance is large in relation to the article variance. For cles. We will start the discussion with reasons for low example, item I5 of the box Responsiveness (i.e. was the kappa coefficients, and for low percentages of time interval stated) had a kappa of 0.25; 65 times raters agreement. scored “yes” (83%), and 13 times they scored “no” (17%). Although the term inter-rater agreement does not appear in the COSMIN taxonomy [8], we used it in this Reasons for low inter-rater agreement between raters study. For measurement instruments that have continu- Percentage agreement was below 80% in 37 of the 114 ous scores the measurement error can be investigated. items. Formanyitems of theCOSMIN checklist asub- However, instruments with a nominal or ordinal score jective judgement is needed. For example, in each box do not have a unit of measurement, and consequently, the item ’were there are any important flaws in the Table 3 Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 4) Item Item N (minus articles with % N Kappa nr 1 rating) agreement b c Generalisability Box (n = 866) Was the sample in which the HR-PRO instruments was evaluated adequately described? In terms of: 1 median or mean age (with standard deviation or range)? 733 86 865 0.36 d e 2 distribution of sex? 735 88 863 0.38 3 important disease characteristics (e.g. severity, status, duration) and description of 746 80 862 0.39 treatment? d e 4 setting(s) in which the study was conducted? e.g. general population, primary care or 735 89 863 0.30 hospital/rehabilitation care d e 5 countries in which the study was conducted? 733 90 861 0.40 d e 6 language in which the HR-PRO instrument was evaluated? 733 86 861 0.41 7 Was the method used to select patients adequately described? e.g. convenience, 729 81 857 0.40 consecutive, or random 8 Was the percentage of missing responses (response rate) acceptable? 724 82 849 0.48 a b When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account; number of times a box was evaluated; sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement d e properties were combined; dichotomous item; Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category; Combined kappa coefficient calculated because of nominal response scale in a one-way design; printed in bold indicates Kappa > 0.70 or % agreement >80%. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 9 of 11 http://www.biomedcentral.com/1471-2288/10/82 design or the methods of the study’ was included (e.g., COSMIN Checklist, since the number of years of B10, I13, I16 and J9). To answer this question, the rater experiences in research varied widely. We used a wide should judge this based on his own experience and range of articles that are likely to be a representative knowledge. Therefore, some kind of subjective evalua- sample of articles on measurement properties. The dis- tion is involved. Some other items might be rather diffi- tribution of many articles over many raters (no pairs, no cult to score, because the information needed to answer ordering) enhances generalisability of our results and the item is not reported in the article. For example, leads to conservative estimates. Also, we did not inter- information to be able to respond on the item ’were the vene beyond the delivery of the checklist and the administrations independent’ (B5) is often not reported. instructions manual. In all, the study should be seen as Although raters should score ‘?’ in this case, raters are a very similar to the usual conditions of its use. likely to guess, or skip these items. This influences the It was our aim to randomly select equal numbers of kappa coefficients and the percentage agreement. studies on each measurement property. However, studies Furthermore, the COSMIN checklist contains consen- on internal consistency and hypotheses testing are more sus-based standards that may deviate from how persons common than studies on measurement error and inter- are used to evaluate measurement properties or a per- pretability. Studies that are based on CTT are more com- son may disagree on a particular item. Consequently, a mon than studies that apply IRT methods. Consequently, rater may score an item differently than recommended these less common measurement properties were less in the COSMIN manual. For example, many people often selected for this study. This prevented analysis of consider effect sizes as appropriate measures for respon- the items on measurement error and on IRT analysis. siveness. Within the COSMIN Delphi study, we decided In addition, it was our aim to include a representative to consider this as inappropriate [9]. We believe that sample of potential users of the COSMIN checklist. As only when clear hypotheses are formulated about the expected, the years of experience of the participants in expected magnitude of the effect sizes (ES) it is appro- this study both in research in general and in research in priate as an indicator of responsiveness (I14). Another measurement instruments differed widely. Although example is the issue about the gold standard. The COS- more than half of the raters came from the Netherlands, MIN panel considered a commonly used measurement we do notexpectthatthe countryoforiginwill have a instrument, such as the SF-36, not as a reasonable gold major influence on the results. standard. However, raters may disagree with this, and In this study it was not feasible to train the raters rate the item ‘can the criterion (for change) be considered because we expected that this would dramatically as a reasonable gold standards’ (H4 and I15) as ‘yes’ decrease the response rate. However, we recommend while according to the COSMIN manual this item getting some experience incompletingthe COSMIN should be scored with ‘no’. Consequently, the kappa checklist before conducting a systematic review. In the coefficient and the percentage agreement will be low. future, when more raters are trained in completing the Last, the distinction between rating the methodological checklist, a reliability study among trained raters could quality of the study and rating the quality of the instru- be performed. ment that is evaluated in the study may be difficult, espe- Due to the incomplete study design (i.e. not all raters cially for content validity. Therefore, the items on scored all articles, and in an article not all measurement content validity are difficult to score. All items of box D properties are evaluated) we had a one-way design. of content validity had low kappa coefficients and per- Therefore, the variance due to raters could not be dis- centage agreement. They ask whether the article under tinguished from the error variance. Other optional study appropriately investigated whether the items were designs would be asking a few raters to evaluate many relevant and comprehensive. This refers to the methodo- articles, or asking many raters to evaluate the same few logical quality of a study. For example, an appropriate articles. Both designs were considered poor. In the first method to investigate the content validity of a HR-PRO case, it is likely that we would not find participants, due is involving patients from the target population, by asking to the large amount of work each rater had to do. We them about the relevance and comprehensiveness of the felt that we as authors of the COSMIN checklist should items. These COSMIN items do not ask whether the not be these raters, because of our involvement in the items of the PRO under study are relevant and compre- development of the checklist. The second design is con- hensive, which refers to the quality of an instrument. sidered poor because we would have to include a few Raters may have been confused about this distinction. articles in which all measurement properties were evalu- ated. It is very likely that these articles do not exist, and Strength and weaknesses of the study if such an article is published, it is very likely that it is We are confident that raters who have participated in not a good representation of studies on measurement this study are representative for the future users of the properties. Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 10 of 11 http://www.biomedcentral.com/1471-2288/10/82 Recommendations for improvement of the inter-rater In one study the reliability of a 39 item appraisal tool agreement and reliability of the COSMIN checklist to evaluate PRO instruments (EMPRO) [10] was investi- Firstly, based on the results of this study, and feedback we gated. In this study five panels (in which three or four received from raters, we improved the wording and gram- raters participated) each assessed the quality of the mar of a few items and we adapted the instructions in the Spanish version of one well-known and widely used manual. This might improve the agreement on the COS- PRO instrument. Intraclass correlation coefficients (two- MIN item scores. Secondly, the COSMIN checklist is not way model, absolute agreement) were calculated both a ready-made checklist, in a sense that the user can for the overall assessment of the quality of the score. instantly complete all items. We recommend that High ICCs were found (all above 0.75) [10]. COSMIN researchers who use the COSMIN checklist, for example and EMPRO both focus on PROs. However, with the in a systematic review, agree beforehand on how to handle COSMIN checklist it is not yet possible to calculate an items that need a subjective judgement, and how to deal overall score per box or an overall score about the qual- with lack of reporting in the original article. For example, ity of all measurement properties together. In addition, based on the topic of the review, they should agree on EMPRO assesses the overall quality of a measurement what they consider an appropriate time interval for relia- instrument, while COSMIN assesses the methodological bility (B8), on an adequate description for the comparator quality of studies on measurement properties. instrument(s) (F7 and I11), or on an acceptable percentage In two other studies two independent raters scored a of missing responses (item 8 of the Generalisability box). number of articles using either STAndards for the This may also increase the inter-rater agreement. Thirdly, Reporting of Diagnostic accuracy studies (STARD) [11] some experience in completing the checklist before con- or Nelson-Moberg Expanded CONSORT Instrument ducting a systematic review is also likely to increase the (NMECI) [12]. Both studies reported percentage agree- inter-rater agreement of the COSMIN checklist. There- ment and kappa coefficients. In the study by Smidt et al. fore, we are developing a training set of articles (to be pub- [11] they found percentage agreement between 63% and lished on our website), explaining how these articles 100%, and kappa coefficients between -0.032 and 1.00. should be evaluated using the COSMIN checklist. About the same percentage of items as in COSMIN Fourthly, we strongly recommend using the taxonomy and (61% of the STARD items) showed high percentage terminology of the COSMIN checklist. For example, if agreement (i.e. above 80%). However, more items had authors compare their PRO to a commonly used PRO higher kappa coefficients, i.e. 23% of the STARD items such as the SF-36, and they refer to this as criterion valid- showed excellent kappa coefficients (i.e above 0.70). In ity, we recommend considering this an evaluation of the study by Moberg-Mogren & Nelson [12], 77% of the hypotheses testing which is an aspect of construct validity, CONSORT items showed high ICC (i.e. above 0.70), and complete box F. Fifthly, when using the checklist in a and 57% of the NMECI items showed high kappa coeffi- systematic review of HR-PROs, we recommend to com- cients (i.e. above 0.70). Of the NMECI items, 29 of the plete the checklist by at least two independent raters, and 176 kappa coefficients were below 0.40. For these items to reach consensus on one final rating. In this study we they also showed percentage agreement, ranging used the ratings of single raters to determine the inter- between43% and93%.CONSORT andNMECI items rater agreement of the checklist, because a design with had higher values for reliability than the COSMIN consensus scores of two raters was not feasible. We items. recommend evaluating the inter-rater agreement of the consensus scores of couples of raters in a future study, Conclusion when more raters are trained. The inter-rater agreement of the COSMIN items was Note that in this study, we investigated the inter-rater adequate, i.e. raters mostly rated the items of the COS- agreement and reliability on item level. Results showed MIN checklist quite the same. The inter-rater reliability that it is difficult to distinguish articles on item level. of the COSMIN items was poor for many items; it was When using the COSMIN checklist in a systematic difficult to distinguish between articles based on item review on measurement properties, an overall score per level. Some disagreements between raters are likely to box is useful to decide whether the methodological be influenced by a subjective judgement needed to quality can be considered as good. For such a score, the answer an item. Therefore, we recommend making deci- reliability might be better. sions in advance about how to score these issues. The inter-rater agreement on other items may have Reliability of other checklists improved after this study since we have tried to improve We found three studies in which the inter-rater agree- the instructions in the manual on some issues, based on ment and reliability of a similar kind of checklist was the feedback of raters. When using the COSMIN check- investigated. list it is important to read the manual carefully, and get Mokkink et al. BMC Medical Research Methodology 2010, 10:82 Page 11 of 11 http://www.biomedcentral.com/1471-2288/10/82 4. Kraemer HC, Periyakoil VS, Noda A: Tutorial in biostatistics. Kappa some training and experience in completing the coefficients in medical research. Stat Med 2002, 21:2109-2129. checklist. 5. Lin L, Hedayat AS, Wu W: A unified approach for assessing agreement for continuous and categorical data. J Biopharm Stat 2007, 17:629-652. 6. Fleiss JL: Statistical methods for rates and proportions New York: John Wiley Acknowledgements & Sons 1981. We are grateful to all the participants of the COSMIN inter-rater reliability 7. Vach W: The dependence of Cohen’s kappa on the prevalence does not study: Femke Abma, Gwenda Albers, Jagath Amarasehera, Adri Apeldoorn, matter. J Clin Epidemiol 2005, 58:655-661. Ingrid Arévalo Rodríguez, Susan Armijo Olivo, Geert Aufdemkampe, Ruth 8. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Barclay-Goddard, Ilse Beljouw, Sandra Beurskens, Michiel de Boer, Sandra Bot, Bouter LM, de Vet HC: The COSMIN study reached international Han Boter, Laurien Buffart, Mauro Carone, Oren Cheifetz, Bert Chesworth, consensus on taxonomy, terminology, and definitions of measurement Anne Christie, Heather Christie, Heather Colguhoun, Janet Copeland, properties for health-related patient-reported outcomes. J Clin Epidemiol Dominique Dubois, Michael Echteld, Roy Elbers, Willem Eijzenga, Antonio 2010, 63:737-745. Escobar, Brigitte Essers, Marie Louise Essink-Bot, Teake Ettema, Silvia Evers, 9. Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, Wouter van de Fliert, Jorge Fuentes, Carlos Garcia Forero, Fania Gartner, Bouter LM, De Vet HCW: The COSMIN checklist for evaluating the Claudia Gorecki, Francis Guillemin, Alice Hammink, Graeme Hawthorne, Nick methodological quality of studies on measurement properties: A Henschke, Kelvin Jordan, Sophia Kramer, Joke Korevaar, Hilde Lamberts, clarification of its content. BMC Med Res Methodol 2010, 10:22. Henrik Lauridsen, Hanneke van der Lee, Tim Lucket, Han Marinus, Belle van 10. Valderas JM, Ferrer M, Mendivil J, Garin O, Rajmil L, Herdman M, Alonso J: der Meer, Henk Mokkink, Paola Mosconi, Sara Muller, Ricky Mullis, Joanneke Development of EMPRO: A tool for the standardized assessment of van der Nagel, Rinske Nijland, Ruth van Nispen, Jan Passchier, George Peat, patient-reported outcome measures. Value Health 2008, 11:700-708. Hein Raat, Luis Rajmil, Bryce Reeve, Leo Roorda, Sabine Roos, Nancy Salbach, 11. Smidt N, Rutjes AW, Van der Windt DA, Ostelo RW, Bossuyt PM, Reitsma JB, Jasper Schellingerhout, Wouter Schuller, Hanneke Schuurmans, Jane Scott, Bouter LM, De Vet HCW: Reproducibility of the STARD checklist: an Jos Smeets, Antonette Smelt, Kevin Smith, Eric van Sonderen, Alan Stanton, instrument to assess the quality of reporting of diagnostic accuracy Ben Steenkiste, Raymond Swinkels, Fred Tromp, Joan Trujols, Arianne studies. BMC Med Res Methodol 2006, 6:12. Verhagen, Gemma Vilagut Saiz, Torquil Watt, Adrian Wenban, Daniëlle van 12. Moberg-Mogren E, Nelson DL: Research concepts in clinical scholarship: der Windt, Harriet Wittink, Virginia Wright, Carlijn van der Zee. Evaluating the quality of reporting occupational therapy randomized This study was financially supported by the EMGO Institute, VU University controlled trials by expanding the CONSORT criteria. Am J Occup Ther Medical Center, Amsterdam, the Netherlands, and the Anna Foundation, 2006, 60:226-235. Leiden, the Netherlands. Pre-publication history Author details The pre-publication history for this paper can be accessed here: Department of Epidemiology and Biostatistics and the EMGO Institute for http://www.biomedcentral.com/1471-2288/10/82/prepub Health and Care Research, VU University Medical Center, Amsterdam, The Netherlands. Department of Public Health, Patient-reported Outcome doi:10.1186/1471-2288-10-82 Cite this article as: Mokkink et al.: Inter-rater agreement and reliability Measurement Group, University of Oxford, Oxford, UK. School of of the COSMIN (COnsensus-based Standards for the selection of health Rehabilitation Science and Department of Clinical Epidemiology and status Measurement Instruments) Checklist. BMC Medical Research Biostatistics, McMaster University, Hamilton, Canada. Health Services Methodology 2010 10:82. Research Unit, IMIM-Institut de Recerca Hospital del Mar, Parc de Salud Mar de Barcelona, Spain. CIBER en Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain. Department of Health Services, University of Washington, Seattle, USA. Executive Board of VU University Amsterdam, Amsterdam, The Netherlands. Authors’ contributions LB, CT and HdV secured funding for the study. CT, HdV, LB, DK, DP, JA, PS, and EG conceived the idea for the study. EG prepared the database and LM selected the articles. All authors invited potential raters. LM coordinated the study and managed the data. LM, CT, DK and HdV interpreted the data. CT, EG, DP, JA, PS, DK, LB and HdV supervised the study. LM wrote the manuscript with input from all the authors. All authors read and approved the final version of the report. Competing interests The authors except for E. Gibbons were the developers of the COSMIN checklist. Received: 23 June 2010 Accepted: 22 September 2010 Published: 22 September 2010 Submit your next manuscript to BioMed Central References and take full advantage of: 1. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, Bouter LM, De Vet HCW: The COSMIN checklist for assessing the • Convenient online submission methodological quality of studies on measurement properties of health • Thorough peer review status measurement instruments: an international Delphi study. Qual Life Res 2010, 19:539-549. • No space constraints or color figure charges 2. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, • Immediate publication on acceptance Bouter LM, De Vet HCW: The COSMIN checklist manual.[http://www. cosmin.nl]. • Inclusion in PubMed, CAS, Scopus and Google Scholar 3. Landis JR, Koch GG: A one-way components of variance model for • Research which is freely available for redistribution categorical data. Biometrics 1977, 33:671-679. Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Medical Research MethodologySpringer Journals

Published: Sep 22, 2010

There are no references for this article.