Access the full text.
Sign up today, get an introductory month for just $19.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
This guidance describes how the FDA evaluates patient-reported outcome (PRO) instruments used as effectiveness endpoints in clinical trials. It also describes our current thinking on how sponsors can develop and use study results measured by PRO instruments to support claims in approved product labeling (see appendix point 1). It does not address the use of PRO instruments for purposes beyond evaluation of claims made about a drug or medical product in its labeling. By explicitly addressing the review issues identified in this guidance, sponsors can increase the efficiency of their endpoint discussions with the FDA during the product development process, streamline the FDA's review of PRO endpoint adequacy, and provide optimal information about the patient's perspective of treatment benefit at the time of product approval. A PRO is a measurement of any aspect of a patient's health status that comes directly from the patient (i.e., without the interpretation of the patient's responses by a physician or anyone else). In clinical trials, a PRO instrument can be used to measure the impact of an intervention on one or more aspects of patients' health status, hereafter referred to as PRO concepts, ranging from the purely symptomatic (response of a headache) to more complex concepts (e.g., ability to carry out activities of daily living), to extremely complex concepts such as quality of life, which is widely understood to be a multidomain concept with physical, psychological, and social components. Data generated by a PRO instrument can provide evidence of a treatment benefit from the patient perspective. For this data to be meaningful, however, there should be evidence that the PRO instrument effectively measures the particular concept that is studied. Generally, findings measured Page 1 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 by PRO instruments may be used to support claims in approved product labeling if the claims are derived from adequate and well-controlled investigations that use PRO instruments that reliably and validly measure the specific concepts at issue. The glossary defines many of the terms used in this guidance. In particular, the term instrument refers to the actual questions or items contained in a questionnaire or interview schedule along with all the additional information and documentation that supports the use of these items in producing a PRO measure (e.g., interviewer training and instructions, scoring and interpretation manual). The term conceptual framework refers to how items are grouped according to subconcepts or domains (e.g., the item walking without help may be grouped with another item, walking with difficulty, within the domain of ambulation, and ambulation may be further grouped into the concept of physical ability). FDA's guidance documents, including this guidance, do not establish legally enforceable responsibilities. Instead, guidance documents describe the Agency's current thinking on a topic and should be viewed only as recommendations, unless specific regulatory or statutory requirements are cited. The use of the word should in Agency guidance documents means that something is suggested or recommended but not required. First publication of the Draft Guidance by the Food and Drug Administration- February 2006. complex claim. For example, PRO-based evidence of 1. Background PRO instruments provide a means for measuring treat- improved symptoms alone generally is not sufficient to ment benefits by capturing concepts related to how a substantiate a claim related to improvement in a patient's patient feels or functions with respect to his or her health ability to function or the patient's psychological state. or condition. The concepts, events, behaviors, or feelings Rather, to substantiate such a general claim, a sponsor measured by PRO instruments can be either readily should develop evidence to show not only a change in observed or verified (e.g., walking) or can be non-observ- symptoms, but how that change translates into other spe- able, known only to the patient and not easily verified cific endpoints such as ability to perform activities of daily (e.g., feeling depressed). Although an assessment of symp- living, or improved psychological state. Accordingly, tom improvement or pertinent function depends on many PRO instruments are specifically designed to assess patient perception, historically these assessments were both symptoms and other possible consequences of treat- often made by physicians who observed and interacted ment. with patients (depression scales, heart failure severity scales, activities of daily living scales). Increasingly, such 2. Patient-reported outcomes – regulatory assessments are based on PRO instruments. The purpose perspective of this guidance is to explain how the FDA evaluates such 2.1 Why use patient-reported outcome instruments in instruments for their usefulness in measuring and charac- medical product development? terizing the benefit of medical product treatment. PRO instruments are included in clinical trials for new medical products because (1) some treatment effects are The amount and kind of evidence that the FDA expects to known only to the patient; (2) there is a desire to know support a labeling claim measured by a PRO instrument is the patient perspective about the effectiveness of a treat- the same as that required for any other labeling claim (see ment; or (3) systematic assessment of the patient's per- appendix point 2). As with other labeling claims, the spective may provide valuable information that can be determination of whether the PRO instrument supports lost when that perspective is filtered through a clinician's an effectiveness endpoint includes an assessment of the evaluation of the patient's response to clinical interview ability of the PRO instrument to measure the claimed questions. treatment benefit and is specific to the intended popula- tion and to the characteristics of the condition or disease 2.1.1 Some treatment effects are known only to the patient treated. Endpoints measured by PRO instruments are For some treatment effects, the patient is the only source most often used in support of claims that refer to a of data. For example, pain intensity and pain relief are the patient's symptoms or ability to function. fundamental measures used in the development of anal- gesic products. There are no observable or physical meas- Note, however, that PRO instruments that measure a sim- ures for these concepts. ple concept may not be adequate to substantiate a more Page 2 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 2.1.2 Patients provide a unique perspective on treatment others). PRO concepts can be general (e.g., improvement effectiveness in physical function, psychological well-being, or treat- PRO instruments can be developed to measure what ment satisfaction) or specific (e.g., decreased frequency, patients want and expect from their treatment and what is severity, or how bothersome the symptoms are). PRO most important to them. When used to measure study concepts can also be generic (i.e., applicable in a broad endpoints, PRO instruments can augment what is known scope of diseases or conditions as in the case of physical about the product based on the clinician perspective or functioning), condition-specific (e.g., asthma-specific), or physiologic measures. This is important because improve- treatment-specific (e.g., measures of the toxicities of a ments in clinical measures of a condition may not neces- class of drugs such as interferons or opioids). sarily correspond to improvements in how the patient functions or feels. For example, clinically meaningful Some PRO instruments (e.g., health-related quality of life improvements in lung function as measured by spirome- instruments) attempt to measure both the effectiveness try may not correlate well with improvements in asthma- and the side effects of treatment. PRO instruments that are related symptoms and their impact on a patient's ability used in clinical trials to support effectiveness claims to perform daily activities. should measure the adverse consequences of treatment separately from the effectiveness of treatment. 2.1.3 Formal assessment may be more reliable than informal interview The specific attributes of a PRO instrument will affect the Seeking information from patients about their symptoms way it is developed, tested, and incorporated into a study and the impact of those symptoms on function is not new. protocol to support conclusions of treatment benefit. In clinical practice, to obtain information known only to Table 1 lists some of the ways that PRO instruments can the patients, clinicians often assess patient status by infor- vary in their objectives, uses, and characteristics. When the mally asking questions such as, "How many pillows do FDA reviews a PRO instrument, our goal is to determine you sleep on?" or, "Do you cough at night?" In clinical tri- whether its characteristics are appropriate and adequate to als, clinical assessments are formalized using specific support the study objectives. questions because a structured interview technique mini- mizes measurement error and ensures consistency. Self- 3. Evaluating pro instruments completed questionnaires that are given directly to The adequacy of a PRO instrument as a measure to sup- patients without the intervention of clinicians are often port medical product claims depends on its developmen- preferable to the clinician-administered interview and rat- tal history and demonstrated measurement properties. ing. Self-completed questionnaires capture directly the Sponsors are encouraged to identify all endpoint meas- patient's perceived response to treatment, without a third urement goals early in product development, before stud- party's interpretation, and may be more reliable than ies are initiated, to provide the basis for product approval observer-reported measures because they are not affected or claim substantiation, allowing adequate time for PRO by interobserver variability (which usually can be reduced instrument identification, modification, or if necessary, only by extensive training of observers). On the other new instrument development. A new PRO instrument can hand, PRO measures may be affected by interpatient vari- be developed or an existing instrument can be modified if ability if the instrument is not easily understood and com- sponsors determine that none is available, adequate, or pleted by patients. Despite these concerns, well-developed applicable to their product development program. When and adequately validated PRO instruments have been considering an instrument that has been modified from shown to give answers that match the results obtained by the original, the FDA generally plans to evaluate the mod- the most expert assessors (indeed, that is the usual way ified instrument just as it would a new one. Therefore, in their validity is assessed), and they appear to be particu- such instances, we encourage sponsors to document the larly suitable in studies involving many investigators. original development processes, all modifications made, and updated assessments of its measurement properties. 2.2.1 A taxonomy of PRO instruments PRO instruments measure concepts ranging from the state PRO instrument development, modification, and valida- of discrete symptoms or signs (e.g., pain severity or seizure tion usually occur in a nonlinear fashion with a varying frequency) to the overall state of a condition (e.g., depres- sequence of events, simultaneous processes, or iterations. sion, heart failure, angina, asthma, urinary incontinence, This iterative process is presented as a wheel and spokes dia- or rheumatoid arthritis), where both specific symptoms gram, shown in Figure 1, and discussed in detail in Sec- and the impact of the condition (e.g., on function, activi- tions 3.1. – 3.4. One or more parts of the original process ties, or feelings) can be measured, to feelings about the may be repeated in new PRO instrument development, condition or treatment (e.g., worry about getting worse, modification, or change in application of an existing having to avoid certain situations, feeling different from Page 3 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Table 1: Taxonomy of PROs Used in Clinical Trials Attribute Types Intended use of the measure • To define entry criteria for study populations • To evaluate efficacy • To evaluate adverse events Concepts measured • Overall health status • Symptoms/signs, individually or as a syndrome associated with a medical condition • Functional status (physical, psychological or social) • Health perceptions (e.g., self-rating of health or worry about condition) • Satisfaction with treatment or preference for treatment • Adherence to medical treatment Number of items • Single item for single concept • Multiple items for single concept • Multiple items for multiple domains within a concept Intended measurement population or condition • Generic • Condition-specific • Population-specific Mode of data collection • Interviewer-administered • Self-administered, with or without supervision • Computer-administered or computer-assisted • Interactively administered (e.g., interactive voice response systems or Web-based systems) Timing and frequency of administration • As events occur • At regular intervals throughout a study • Baseline and end of treatment Types of scores • Single rating on a single concept (e.g., pain severity) • Index – single score combining multiple ratings of related domains or independent concepts • Profile – multiple uncombined scores of multiple-related domains • Battery – multiple uncombined scores of independent concepts • Composite – an index, profile, or battery Weighting of items or concepts • All items and domains are equally weighted • Items are assigned variable weights • Domains are assigned variable weights Response options • See Table 2 for examples of response options (types of PRO scales) instrument. The following five sections describe the steps on patient interviews along with reviews of the literature usually taken in instrument development. and expert opinion. 3.1 Development of the conceptual framework and If documentation exists that a single item is a reliable and identification of the intended application valid measure of the concept of interest (e.g., pain sever- During the planning of clinical development programs, ity), a one-item PRO instrument may be a reasonable the FDA encourages sponsors to specify what claims they measure to support a claim concerning that concept. If the seek, determine what concepts underlie those claims, and concept of interest is general (e.g., physical function), a then determine whether an adequate PRO instrument single-item PRO instrument is usually unable to provide a exists to assess and measure those concepts. If it doesn't, a complete understanding of the treatment's effect because new PRO instrument can be developed. The typical steps a single item cannot capture all the domains of the general involved in the selection or development of PRO instru- concept. For this reason, single-item questions about gen- ments for endpoints for clinical trials are described in the eral concepts that imply multiple domains rarely provide following sections. sufficient evidence to support claims about that general concept. However, single-item questions about general 3.1.1 Identification of concepts and domains that are to be concepts can be useful to help interpret multi-item meas- measured ures of the same concept and to determine whether One fundamental consideration in the development and important items or domains of a general concept are miss- use of a PRO instrument is whether the instrument's con- ing (e.g., when results using single general questions do ceptual framework is appropriate and clearly defined. In not correlate with results using a multi-item question- some cases, of course, the question of what to measure naire, this may be evidence that the questionnaire is not may be obvious given the nature of the condition being capturing all the important domains of the concept con- treated. Generally, however, instrument developers tained in the claim). Evidence from the patient cognitive choose the concepts and domains to be measured based debriefing studies (i.e., the interview schedule, transcript, Page 4 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 i. Identify Concepts and Develop Conceptual Framework Identify concepts and domains that are important to patients. Determine intended population and research application. Hypothesize expected relationships among concepts. ii. Create Instrument iv. Modify Instrument Generate items. Choose administration method, Change concepts measured, recall period, and response scales. populations studied, Draft instructions. research application, PRO Format instrument. instrumentation, Draft procedures for scoring and or method of administration. administration. Pilot test draft instrument. Refine instrument and procedures. iii. Assess Measurement Properties Assess score reliability, validity, and ability to detect change. Evaluate administrative and respondent burden. Add, delete, or revise items. Identify meaningful differences in scores. Finalize instrument formats, scoring, procedures, and training materials. The Figure 1 PRO instrument development and modification process The PRO instrument development and modification process. and listing of all concepts elicited by a single item) can be particular attention to the precise claim that is supported used to determine when a concept is adequately captured by the results in the measured concepts or domains. by a single item. Documentation of the instrument development process Multidomain PRO instruments can be used to support should reveal the means by which the domains were iden- claims about a general concept if the PRO instrument has tified and named. This helps substantiate the adequacy of been appropriately developed and validated to measure the measure to support both the general concept and the the important and relevant domains of the general con- named domains. If a sponsor desires to support a claim cept. The complex nature of multidomain PRO instru- based on a portion of a multi-item instrument (a domain ments, however, often raises significant questions about or an item), the development and validation process how to interpret and report results in a way that is not mis- should ensure that the instrument supports the measure- leading. For example, if improvements in a score for a gen- ment of the claimed concept. For example, some broad eral concept (e.g., physical function) is driven by a single health status measures include item lists of symptoms that responsive domain (e.g., symptom improvement) while are summed in an overall score. Individual items that con- other important domains (e.g., physical abilities and tribute to the overall score (e.g., dyspnea) generally would activities of daily living) did not show a response, a gen- not support a dyspnea claim unless the items were devel- eral claim about improvements in physical function oped to measure the claimed concept (e.g., the items val- would not be supported. The FDA intends to review all idly and reliably capture the impact of treatment on evidence based on multidomain PRO measurements with dyspnea). Page 5 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Item A Domain Item B Score 1 Item C Overall Score Item D Domain Item E Score 2 Item F Item G Diagra Figure 2 m of a conceptual framework Diagram of a conceptual framework. For measures of general concepts, the FDA intends to whether the instrument is appropriate to that population review how individual items are associated with each with respect to patient age, sex, ethnic identity, and cogni- other, how items are associated with each domain, and tive ability. Specific measurement considerations posed how domains are associated with each other and the gen- by pediatric, cognitively impaired, or seriously ill patients eral concept of interest. A diagram of the expected rela- are discussed in Section 3.5. tionships among the PRO items and domains can help reviewers evaluate these relationships. The diagram in Fig- 3.2 Creation of the PRO instrument ure 2 depicts a generic example of a conceptual framework When developing a PRO instrument, sponsors are encour- where Domain Score 1, Domain Score 2, and Overall aged to assess its adequacy in the context of the following Score each represent related but separate concepts. Items development processes. in this diagram are aggregated into domains. In some measures, domains can be aggregated into an overall 3.2.1 Generation of items score. These expectations should be specified before the It is important to consider the procedures used to identify validation process begins. the set of items selected to measure a specific concept. PRO instrument items can be generated from literature 3.1.2 Identification of the intended application of the PRO instrument reviews, transcripts from focus groups, or interviews with It is also important to consider whether the development patients, clinicians, family members, researchers, or other and demonstrated measurement properties of a PRO sources. Depending on the conceptual framework, the instrument provide an adequate basis for its planned use FDA may review whether appropriate individuals and in the study to support a claim. This is best established sources were used and how information gleaned from before the study commences, but would in any case be those sources was used in the PRO instrument develop- part of the FDA's application review. This is true whether ment process. the PRO instrument is generic, intended for use across multiple applications and populations, or specific, devel- PRO instrument item generation is incomplete without oped for a certain condition or population. The PRO patient involvement. Item generation generally incorpo- instrument can be developed for a variety of roles, includ- rates the input of a wide range of patients with the condi- ing defining trial entry criteria, including excessive sever- tion of interest to represent appropriate variations in ity, evaluating treatment benefit, or monitoring adverse severity and in population characteristics such as age or events. sex. The FDA plans to review instrument development (e.g., results from patient interviews or focus groups) to 3.1.3 Identification of the intended population determine whether adequate numbers of patients have The FDA plans to compare the patient population used in supported the opinion that the specific items in the instru- the PRO instrument development process to the study ment are adequate and appropriate to measure the con- populations enrolled in clinical trials to determine cept. Page 6 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Table 2: Types of Response Options Type Description Visual analog scale (VAS) A line of fixed length (usually 100 mm) with words that anchor the scale at the extreme ends and no words describing intermediate positions. Patients are instructed to place a mark on the line corresponding to their perceived state. These scales often produce a false sense of precision. Anchored or categorized VAS A VAS that has the addition of one or more intermediate marks positioned along the line with reference terms assigned to each mark to help patients identify the locations (e.g., half-way) between the ends of the scale. Likert scale An ordered set of discrete terms or statements from which patients are asked to choose the response that best describes their state or experience. Rating scale A set of numerical categories from which patients are asked to choose the category that best describes their state or experience. The ends of rating scales are anchored with words but the categories do not have labels. Event log Specific events are recorded as they occur using a patient diary or other reporting system (e.g., interactive voice response system) Pictorial scale A set of pictures applied to any of the other types of response options. Pictorial scales are often used in pediatric questionnaires but also have been used for patients with cognitive impairments and for patients who are otherwise unable to speak or write. Checklist Checklists provide a simple choice between a limited set of options, such as Yes, No, and Don't know. Some checklists ask patients to place a mark in a space if the statement in the item is true. Checklists are reviewed for completeness and nonredundancy. Items that ask patients to respond hypothetically or that 3.2.3 Choice of the recall period give patients the opportunity to respond on the basis of Sponsors should also evaluate the rationale and the their desired condition rather than on their actual condi- appropriateness of the recall period for a PRO instrument. tion are not recommended. For example, in assessing the To this end, it is important to consider patients' ability to concept performance of daily activities, it is more appropri- accurately recall the information requested as proposed. ate to ask whether or not the respondent performs specific The choice of recall period that is most suitable depends activities (and if so, with how much difficulty) than on the purpose and intended use of the instrument, the whether or not he or she can perform daily activities characteristics of the disease/condition, and the treatment (because patients may report they are able to perform a to be tested. When evaluating PRO-based claims, the FDA task even when they never do so). Of course, it would be intends to review the study protocol to determine what critical to know that each item refers to something that steps were taken to ensure that patients understand the patients actually do. appropriate recall period. If a patient diary or some other form of unsupervised data entry is used, the FDA plans to It is also important to consider all of the item generation review the protocol to determine what measures are taken techniques used, including any theoretical approach used, to ensure that patients make entries according to the study the populations studied, sources of items, selection and design and not, for example, just before a clinic visit when reduction of items, cognitive debriefing interviews, pilot their reports will be collected. testing, importance ratings, and quantitative techniques for item evaluation such as factor analysis and item- PRO instruments that require patients to rely on memory, response analysis. especially if they must recall over a period of time, or to average their response over a period of time may threaten 3.2.2 Choice of the data collection method the accuracy of the PRO data. It is usually better to con- Sponsors should consider the method of data collection struct items that ask patients to describe their current state and all procedures and protocols associated with instru- than to ask them to compare their current state with an ment administration, including instructions to interview- earlier period or to attempt to average their experiences ers, instructions for self-administration, instructions for over a period of time. supervising self-administration, case report forms or 3.2.4 Choice of response options examples of electronic PRO instruments, and other spe- cial considerations specific to the mode of administration It is also important to consider whether the response including data quality control procedures. Modes of options are consistent with the purpose and intended use administration include interview, paper-based, electronic, of the PRO instrument. Table 2 describes the types of Web-based, and interactive voice response formats. The response options that are typically used in clinical trials. FDA intends to review the comparability of data obtained when using multiple modes of administration to deter- Response choices are generally considered appropriate mine whether pooling of results from the multiple modes when: is appropriate. Page 7 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Wording used in responses is clear and appropriate (e.g., any potentially important changes in presentation or for- anchoring a scale using the term normal assumes that mat. Examples of changes that can alter the way that patients understand what is normal). patients respond to the same set of questions include: Responses are appropriate for the intended population. Changing an instrument from paper to electronic format For example, patients with visual impairment may find the VAS difficult to complete. Changing the timing of or procedures for PRO instru- ment administration within the clinic visit Responses offer a clear distinction between choices (e.g., patients may not distinguish between intense and severe if Changing the order of items or deleting portions of a both are offered as response choices to describe their questionnaire pain). Changing the instructions or the placement of instruc- Instructions to patients for completing the question- tions within the PRO instrument naire and selecting response options are adequate. It is important that the PRO instrument format used in the The number of response options is justified. clinical trial be consistent with the format that is used in the instrument validation process. Format refers to the Response options are appropriately ordered and appear exact appearance of the instrument. Instrument format is to represent equal intervals. specific to the mode of administration, including paper and pencil, interviewer-administered or supervised, or Response options avoid potential ceiling or floor effects electronic data collection. The FDA plans to review the (e.g., introducing more categories to capture worsening or PRO instrument in the format used in the clinical trial improvement so that fewer patients respond at the top or case report forms, including the order and numbering of bottom of the response continuum). items, the presentation of response options in single response or grid formats, the grouping of items, patterns Response options do not bias the direction of responses for skipping questions that are not applicable, and all (e.g., offering one negative choice, one neutral choice, and instructions to patients in the interview schedule or on the two or more positive choices on a scale makes it more questionnaire. likely for patients to respond that they feel or function bet- ter). The FDA recommends that the PRO instrument develop- ment process includes the generation of a user manual 3.2.5 Evaluation of patient understanding that specifies how to incorporate the instrument into a Sponsors are encouraged to examine the procedures used clinical trial in a way that minimizes administrator bur- with patients to determine readability and understanding den, patient burden, missing data, and poor data quality. of the items included in the PRO instrument. The FDA's evaluation of these procedures is likely to include a review 3.2.7 Identification of preliminary scoring of items and domains of a cognitive debriefing report containing the readability For each item, numerical scores are generally assigned to each answer category based on the most appropriate scale test used, the script used in patient cognitive debriefing interviews, the transcript of the interviews, the analysis of of measurement for the item (e.g., nominal, ordinal, the interview results, and the actions taken to delete or interval, or ratio scales). The FDA intends to consider modify an item in response to the cognitive debriefing whether a PRO measure conforms to assumptions that the interview or pilot test results. response choices represent appropriate intervals by reviewing distributions of item responses. 3.2.6 Development of format, instructions, and training PRO study results can vary according to the instructions to A scoring algorithm creates a single score from multiple patients or the training given to the interviewer or persons items. Equally weighted scores for each item are appropri- supervising PRO data collection. Sponsors should con- ate only when the responses to the items are relatively sider all PRO instrument instructions and procedures con- uncorrelated. Otherwise, the assignment of equal weights tained in publications and user manuals provided by will overweight correlated items and underweight inde- developers, including procedures for reviewing completed pendent items. Even when items are uncorrelated, assign- questionnaires and re-administration to avoid missing ing equal weights to each item may overweight certain data or clarify responses. Other important considerations items if the number of response options or the values include the format of the questionnaire, the final wording associated with response options varies by item. The same of PRO instruments as implemented in clinical trials, and Page 8 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Table 3: Common Reasons for Changing PRO Instruments During Initial Development Item Property Reason for Change or Deletion Clarity or relevance • Reported as not relevant by a large segment of the population of interest • Generates an unacceptably large amount of missing data points • Generates many questions or requests for clarification from patients as they complete the PRO instrument • Patients interpret items and responses in a way that is inconsistent with the conceptual framework Response range • A high percent of patients respond at the floor (worst end of the response scale) or ceiling (optimal end of the response scale) • Patients note that none of the response choices apply to them • Item means are highly skewed Variability • All patients give the same answer (i.e., no variance) • Most patients choose only one of the response choices • Differences among patients are not detected when important differences are known Reproducibility • Unstable scores over time when there is no logical reason for variation from one assessment to the next Inter-item correlation • Item uncorrelated with other items in the same concept of interest Ability to detect change • Item is nonresponsive (i.e., does not change when there is a known change in the concepts of interest) Item discrimination • Item is highly correlated with measures of concepts other than the one it is intended to measure Redundancy • Item duplicates information collected with other items that have equal or better measurement properties weighting concerns apply with added complexity when Privacy of the setting in which the PRO is completed combining domain scores into a single overall score. (e.g., not providing a private space for patients to com- plete questionnaires containing sensitive information When empirically determined patient preference ratings about their sexual performance or substance abuse his- are used to weight items or domains, the FDA also intends tory) to review the composition of samples and the process used to determine the preference weights. Because prefer- Inadequate time to complete questionnaires or inter- ence weights are often developed for use in resource allo- views cation (e.g., as in cost-effectiveness analysis that may use predetermined community weights), it is tempting to use Literacy level too high for population those same weights in the clinical trial setting to demon- strate treatment benefit. However, this practice is discour- Questions that patients are unwilling to answer aged unless the relationship of the preference weights to the intended study population is known and found ade- Perception by patients that the interviewer wants or quate and appropriate. expects a particular response 3.2.8 Assessment of respondent and administrator burden The degree of respondent burden that is acceptable for Undue physical, emotional, or cognitive strain on patients instruments in clinical trials depends on the frequency are burdens that will generally decrease the quality and and timing of PRO assessments in a protocol and on the quantity of PRO data. Factors that can contribute to severity of the illness or toxicity of the treatment studied. respondent burden include the following: For example, if the questionnaire contains instructions to skip one or more questions based on responses to a previ- Length of questionnaire or interview ous question, respondents may fail to understand what is required and make errors in responding or find the assess- Formatting ment too complicated to complete. Sponsors should con- sider missing data and the refusal rate as possible Font size too small to read easily indications of unacceptable patient burden or inappropri- ate items or response options. New instructions for each item 3.2.9 Confirmation of the conceptual framework and finalization of Words or sentence structures that require a technical the instrument knowledge or developmental level beyond that of the The FDA intends to examine the final version of an instru- patients in the trials ment in light of its development history, including docu- mentation of the complete list of items generated and the Requirement that patients consult records to complete reasons for deleting or modifying items, as illustrated in responses Table 3. It will be important to determine from empirical Page 9 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 data submitted whether the conceptual framework (e.g., concept. If developers expected the instrument to discrim- the expected relationships between items, domains, and inate between patient groups (e.g., between patients with measurement concepts as diagrammed in Figure 2) have different levels of severity), the FDA is interested in evi- been demonstrated. dence that shows the instrument meaningfully discrimi- nates. 3.3 Assessment of measurement properties The FDA generally intends to review a PRO instrument In some cases, some types of validity testing are not possi- for: reliability, validity, ability to detect change, and inter- ble due to the nature of the concept to be measured. In pretability (e.g., minimum important difference). The such instances, the FDA generally plans to review the FDA plans to review the measurement properties that are cumulative evidence for the appropriate use of the meas- specific to the documented conceptual framework, con- ure and apply it to the interpretation of clinical study firmed scoring algorithm, administration procedures, and results. questionnaire format in light of the study population, study design, and statistical analysis plan. The sociodemo- 3.3.3 Evaluation of ability to detect change When a concept is expected to change, the values for the graphic and medical characteristics of any sample used to develop or validate a PRO instrument determine its PRO instrument measuring that concept should change. If appropriateness for future clinical study settings. (See there is clear evidence that patient experience relative to Table 4.) the concept has changed, but the PRO scores do not change, the validity of the PRO instrument should be 3.3.1 Evaluation of reliability questioned. If there is evidence that PRO scores are Because clinical trials involve change over time, the ade- affected by changes that are not specific to the concept of quacy of a PRO instrument for use in a clinical trial interest, the validity of the PRO instrument should be depends on its reliability. Because clinical trials are questioned. intended to provide unbiased estimates of true treatment impact, systematic and/or other changes in measurement The ability of an instrument to detect change influences methods may undermine the purpose of the trial. the sample size needed to evaluate the effectiveness of treatment. The extent to which the PRO instrument's abil- Test-retest reliability is the most important type of reliabil- ity to detect change varies by important patient subgroups ity for PRO instruments used in clinical trials. Test-retest is (e.g., sex, race, age, or ethnicity) can affect clinical trial most informative when the time interval chosen between results. It is important to identify any important subgroup the test and retest is appropriate for identifying stability in differences in ability to detect change so that these differ- reference to the clinical trial protocol. ences can be taken into account in assessing results. Internal consistency reliability, in the absence of test- 3.3.4 Choice of methods for interpretation retest reliability, does not generally constitute sufficient The following sections describe some of the methods that evidence of reliability for clinical trial purposes. When have helped sponsors and the FDA interpret clinical trial PRO instruments are interviewer-administered, inter- results based on PRO endpoints. interviewer reproducibility is critical. 126.96.36.199 Defining a minimum important difference 3.3.2 Evaluation of validity Many PRO instruments are able to detect mean changes The FDA recognizes that the validation of an instrument is that are very small; accordingly it is important to consider an ongoing process and that validity relates to both the whether such changes are meaningful. Therefore, it is instrument itself and how it is used. Sponsors should con- appropriate for a critical distinction to be made between sider a PRO endpoint for evidence of content-related the mean effect seen (and what effect might be considered validity, the instrument's ability to measure the stated important) and a change in an individual that would be concepts, and the instrument's ability to predict future considered important, perhaps leading to a definition of a outcomes, as illustrated in Table 4. responder. For many widely used measures (pain, treadmill distance, HamD), the ability to show any difference If instrument developers expected the instrument to give between treatment groups has been considered evidence results for the measured concept similar to those meas- of a relevant treatment effect. If PRO instruments are to be ured by existing PRO or non-PRO measures (e.g., physical considered more sensitive than past measures, it can be or physician-based measures), the FDA is interested in useful to specify a minimum important difference (MID) documented demonstration of those relationships to as a benchmark for interpreting mean differences. An MID determine whether the instrument convincingly measures is usually specific to the population under study. that concept and can therefore support a claim about that Page 10 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Table 4: Measurement Properties Reviewed for PRO Instruments Used in Clinical Trials Measurement Test What is Assessed FDA Review Considerations Property Reliability Test-retest Stability of scores over time when no Does the PRO instrument reliably change has occurred in the concept of measure the concepts it was designed to interest measure? Internal consistency Whether the items in a domain are Were appropriate reliability tests intercorrelated, as evidenced by an conducted? internal consistency statistic (e.g., coefficient alpha) Inter-interviewer reproducibility Agreement between responses when What was the quality of the evidence of (for interviewer-administered the PRO is administered by two or reliability? PROs only) more different interviewers Validity Content-related Whether items and response options Do items in the verbatim copy of the are relevant and are comprehensive PRO instrument appear to measure the measures of the domain or concept concepts they are intended to measure in a useful way? Have patients similar to those participating in the clinical trial confirmed the completeness and relevance of all items? Ability to measure the concept Whether relationships among items, Do observed relationships between the (also known as construct-related domains, and concepts conform to what items and domains confirm the validity; can include tests for is predicted by the conceptual hypotheses in the conceptual discriminant, convergent, and framework for the PRO instrument framework? Do results compare known-groups validity) itself and its validation hypotheses. favorably with results from a similar but independent measure? Do results distinguish one group from another based on a prespecified variable that is relevant to the concept of interest? Ability to predict future outcomes Whether future events or status can be Do PRO scores predict subsequent (also known as predictive validity) predicted by changes in the PRO scores events or outcomes accurately? Ability to detect change Includes calculations of effect size Whether PRO scores are stable when Has ability to detect change been and standard error of there is no change in the patient, and demonstrated in a comparative trial measurement among others the scores change in the predicted setting, comparing mean group scores direction when there has been a notable or proportion of patients who change in the patient as evidenced by experienced a response to the some effect size statistic. Ability to treatment? detect change is always specific to a time interval. Has ability to detect change been assessed for the time interval appropriate to study? Interpretability Smallest difference that is Difference in mean score between The FDA is specifically requesting considered clinically important; treatment groups that provides comment on appropriate review of this can be a specified difference convincing evidence of a treatment derivation and application of an MID in (the minimum important benefit. Can be based on experience the clinical trial setting. difference (MID)) or, in some with the measure using a distribution- cases, any detectable difference. based approach, a clinical or nonclinical The MID is used as a benchmark anchor, an empirical rule, or a to interpret mean score combination of approaches. The differences between treatment definition of an MID using a clinical arms in a clinical trial anchor is sometimes called an MCID. Responder definition – used to Change in score that would be clear The FDA is specifically requesting identify responders in clinical trials evidence that an individual patient comment on appropriate review of for analyzing differences in the experienced a treatment benefit. Can be derivation and application of responder proportion of responders between based on experience with the measure definitions when used in clinical trials. treatment arms using a distribution-based approach, a clinical or nonclinical anchor, an empirical rule, or a combination of approaches. Page 11 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 The FDA has reviewed MIDs derived in many ways. Exam- On the other hand, if the PRO instrument is to be used in ples include: an entirely new population of patients, a small rand- omized study to ascertain the measurement properties in Mapping changes in PRO scores to clinically relevant the new population may minimize the risk that the instru- and important changes in non-PRO measures of treat- ment will not perform adequately in a phase 3 study. ment outcome in the condition of interest (e.g., when PRO measures of asthma or COPD are mapped to spirom- The FDA intends to consider a modified instrument as a etry scores). different instrument from the original and will consider measurement properties to be version-specific. The FDA Mapping changes in PRO scores to other PRO scores to recommends additional validation to support the devel- arrive at an MID that is appreciable to patients (e.g., when opment of a modified PRO instrument when one or more multi-item PROs are mapped to a single question asking of the following modifications occur. the patient to rate his or her global impression of change since the start of treatment). A problem with this 3.4.1 Revised measurement concept An instrument that is developed and validated to measure approach is that it uses individual rates to reach a conclu- sion about mean effects. It may be more useful to look at one concept is used to measure a different concept. For the distribution of individual effects in treatment and con- example: trol groups. A single domain from a multiple domain PRO is admin- Using a distribution-based approach (e.g., defining the istered without the other domains MID as 0.5 times the standard deviation). This, of course, may bear no relation to the patient's assessment and is Response options are changed to assess a different qual- usually inadequate in isolation. ity (e.g., frequency versus how bothersome) Using an empirical rule (e.g., 8 percent of the theoretical An index or composite score is used to summarize mul- range of scores). Again, this arbitrary approach does not tiple PRO concepts/domains when existing validation take into account patient preferences or assessment. applies only to concept/domain-specific scores If an MID is to be applied to clinical study results, it is gen- Items from an existing PRO instrument are used to cre- erally helpful to use a variety of methods to discover ate a new instrument whether concordance among methods confirms the choice of an MID (see appendix point 3). One or more items from an existing instrument are used to support a claim for a concept the items were not devel- 188.8.131.52 Definition of responders oped to measure There may be situations where it is more reasonable to characterize the meaningfulness of an individual's 3.4.2 Application to a new population or condition response to treatment than a group's response, and there An instrument developed for use in one population or may be interest in characterizing an individual patient as condition is used in a different patient population or con- a responder to treatment, based upon prespecified criteria dition. For example: backed by empirically derived evidence supporting the responder definition as a measure of benefit. Such exam- Patients in the proposed trial have a disease, condition, ples include categorizing a patient as a responder based or severity level that is different from that of the patient upon a prespecified change from baseline on one or more population used for instrument development and valida- scales; a change in score of a certain size or greater (e.g., a tion 2-point change on an 8-point scale); or a percent change from baseline (see appendix point 4). Patients in the proposed trial differ in age, gender, race, or developmental or life stage from those for instrument 3.4 Modification of an existing instrument development and validation When a PRO instrument is modified, additional valida- tion studies may be needed to confirm the adequacy of 3.4.3 Changed item content or instrument format the modified instrument's measurement properties. The An instrument is altered in item content or format. This extent of additional validation recommended depends on includes changes in the following: the type of modification made. For example, small non- randomized studies may be adequate to assess the results Number of items (more or fewer) used to assess a con- of changing a response scale from vertical to horizontal. cept or domain Page 12 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Wording or placement of instructions Sponsors should consider whether generally accepted standards for translation and cultural adaptation have Wording or order of the items been used to support the validity of data from a trans- lated/adapted PRO instrument, including but not Wording, scaling, ordering, or number of response restricted to the following: options The background and experience of the persons involved Recall period associated with an item in the translation/adaptation Point of reference for comparison for an item or domain The translation/adaptation methodology used Weighting of items The harmonization of different versions Scoring (including creation of summary scores, sub- The evidence that measurement properties for translated domain scores, or cut-points) versions are comparable Any changes that could alter the patient's interpretation 3.4.6 Other changes of the instructions, items, or response options Other changes to the PRO instrument or the way in which it is assessed that may necessitate additional validation 3.4.4 Changed mode of administration include: An instrument's data collection mode is altered. For exam- ple: The PRO instrument was not developed and validated for use in a clinical trial An interviewer-administered or supervised question- naire is modified for self-administration (skip patterns A PRO instrument developed and previously used as a can be a problem in this situation) stand-alone assessment is included as a part of a battery of measures Paper-and-pencil self-administered PRO is modified to be administered by computer or other electronic device A PRO developed to measure a treatment benefit is sub- (e.g., computer adaptive testing, interactive voice sequently used to measure a decrement as interpreted by response systems, Web-based questionnaire administra- a score change in the opposite direction tion, computer) 3.5 Development of PRO instruments for specific Instructions or procedures for administration within a populations trial differ from those used in validation studies (can alter Measurement of PRO concepts in children and youth, and the meaning of the responses from that of the original ver- in patients who have cognitive impairment, introduces sion) challenges in addition to those already mentioned. These are discussed in the following sections. 3.4.5 Changed culture or language of application An instrument developed in one language or culture is 3.5.1 Children and youth adapted or translated for use in another language or cul- In general, the review issues related to the development ture. The FDA recommends that sponsors provide evi- and validation of pediatric PRO instruments are similar to dence that the methods and results of the translation those detailed for adults. It is important that PRO instru- process were adequate to ensure that the validity of the ments developed for adults are not used in pediatric pop- responses is not affected. Some examples include the fol- ulations unless the measurement properties are similar in lowing: all age groups tested. We recommend that instruments intended for use in pediatric populations be rigorously PRO instruments are developed initially in one lan- developed and validated according to the principles guage, culture, or ethnic group and are used subsequently described earlier. Additional review issues for PRO instru- in another ments applied in children and youth include age-related vocabulary, language comprehension, comprehension of PRO instruments developed and validated outside the the health concept measured, and duration of recall. United States are applied to the U.S. population Instrument development and validation testing within fairly narrow age groupings is important to account for developmental differences and to determine the lower age Page 13 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 limit at which children can understand the questions and harder to answer in a biased way when previous answers provide reliable and valid responses that can be compared are not available. For the same reasons, allowing patients across age categories. access to previous responses can bias results when unblinding is a possibility. This is, however, an area that 3.5.2 Patients cognitively impaired or unable to communicate could benefit from rigorous study. Over the course of some clinical trials, it can be antici- pated that patients may become too ill to complete a There are certain situations, particularly in the develop- questionnaire or to respond to an interviewer. In such ment of medical devices, where blinding is not feasible cases, proxy reporting may help to prevent missing data. and other situations where there is no reasonable control When this situation is anticipated, the FDA encourages group (and therefore no randomization). When a PRO the inclusion of proxy reports in parallel with patient self- instrument appears useful in assessing patient benefit in report from the beginning of the study (i.e., even before those situations, the FDA encourages sponsors to confer the patient is no longer able to answer independently) so with the appropriate review division. that the relationship between the patient reports and the 4.1.2 Clinical trial quality control proxy reports can be assessed. Study quality can be optimized at the design stage by spec- ifying procedures to minimize inconsistencies in trial con- 4. Study design The same study design principles that apply to other end- duct. Examples of standardized instructions and processes point measures apply to PROs. This section, therefore, that may appear in the protocol include: focuses primarily on issues unique to PROs. Standardized training and instructions to patients for 4.1 General protocol considerations self-administered PRO instruments If the goal of PRO measurement is to support claims, we recommend that measurement of the PRO concept be Standardized interviewer training and interview format clearly stated as a specific study objective. It is important for PRO instruments administered in an interview format that the protocol include the exact format and version of the specific PRO instrument to be administered. In the Standardized instructions for the clinical investigators process of considering the NDA/BLA/PMA or NDA/BLA/ regarding patient supervision, timing and order of ques- PMA supplement, the FDA intends to compare both the tionnaire administration during or outside the office visit, planned and actual use of the PRO instrument and its processes and rules for questionnaire review for complete- analysis. ness, and documentation of how and when data are filed, stored, and transmitted to or from the study site 4.1.1 Blinding and randomization Because responses to PRO measures are subjective, repre- 4.1.3 Designing the trial to avoid data missing due to withdrawal from exposure senting a patient's impression, open-label studies, where patients and investigators are aware of assigned therapy, Sometimes patients fail to report for visits, fail to com- are rarely credible. Patients who know they are in an active plete questionnaires that contain response endpoints, or treatment group may overestimate benefit while those withdraw from assigned treatment prior to planned com- who know they are not receiving active treatment may pletion of a clinical trial without contributing PRO infor- underreport any improvement actually experienced. Every mation. The resulting missing data can introduce bias and effort should be made to assure that patients are masked interfere with the ability to compare effects in the test to treatment assignment throughout the trial. If the treat- group with the control group because only a subset of the ment has obvious effects, blinding may be difficult. The initial randomized population contributes, and these impact of possible unblinding is important to consider in patient groups may no longer be comparable. Missing the interpretation of study results. data is a major challenge to the success and interpretation of any clinical trial. The importance of blinding can be determined, in part, by the characteristics of the PRO instrument used. For exam- The protocol can increase the likelihood that a trial will ple, questions that ask how patients' current status com- still be informative by establishing plans for gathering all pares to baseline seem likely to be more influenced by treatment-related reasons for patients withdrawing from a unblinding (optimism can readily be expressed as a favo- trial and by trying to minimize patient dropouts prior to rable comparison) than questions that ask about current trial completion. We recommend the study protocol status (which requires a current assessment, not a state- describe how missing data will be handled in the analysis. ment about duration). Questions that ask for current sta- It could also establish a process by which PRO measure- tus, or PRO instruments that ask many questions, are ment is ascertained before or shortly after patient with- Page 14 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 drawal from treatment exposure due to lack of efficacy or when patient dropouts, withdrawals from exposure, or toxicity. missing data are expected (e.g., in studies where repeated PRO measurement is planned). See Section 5.5 for guid- 4.2 Frequency of measurements ance on interpretation considerations for a study's statisti- The frequency of PRO assessment depends on the natural cal analysis plan. history of the disease and the nature of the treatment. 4.6 Specific concerns when using electronic PRO Some diseases, conditions, or study designs may necessi- tate more than one baseline assessment and several PRO instruments assessments during treatment. The frequency of PRO When electronic PRO instruments are used, sponsors assessment should correspond with the demonstrated should plan carefully to ensure that FDA regulatory measurement properties of the instrument and with the requirements are met for sponsor and investigator record planned data analysis. keeping, maintenance, and access (see appendix point 5). These responsibilities are independent of the method 4.3 Duration of study used to record clinical trial data and, therefore, apply to It is also important to consider whether the duration of electronic PRO data. Sponsors are responsible for provid- the study is of adequate length to support the proposed ing investigators with the information they need to con- claim and assess a durable outcome in the disease or con- duct the investigation properly, for monitoring the dition being studied. Generally, duration of follow-up investigation, for ensuring that the investigation is con- with a PRO assessment should be at least as long as for ducted in accordance with the investigational plan, and other measures of effectiveness. It should be noted, how- for permitting the FDA to access, copy, and verify records ever, that the study duration appropriate for the PRO- and reports relating to the investigation. related study objective may not be the same as the study duration for other study endpoints. In a trial for a progres- The principal record keeping requirements for clinical sive disease where the PRO concept of interest does not investigators include the preparation and maintenance of change until after the follow-up required for other clinical adequate and accurate case histories (including the case efficacy parameters, longer study duration can be indi- report forms and supporting data), record retention, and cated. provision for the FDA to access, copy, and verify records (i.e., source data verification). The investigator's responsi- 4.4 Design considerations for multiple endpoints bility to control, access, and maintain source documenta- The hierarchy of endpoints is determined by the stated tion can be satisfied easily when paper PRO instruments objectives of the trial and the clinical relevance and are used, because the subject usually returns the diary to importance of each specific measure independently and the investigator who either retains the original or a certi- in relationship to each other. A PRO instrument could be fied copy as part of the case history. The use of electronic the primary endpoint measure of the study, a co-primary PRO instruments, however, may pose a problem if direct endpoint measure in conjunction with other objective or control over source data is maintained by the sponsor or physician-rated measurements, or a secondary endpoint the contract research organization and not by the clinical measure whose analysis would be considered according to investigator. The FDA considers the investigator to have a hierarchical sequence. The FDA recommends that the met his or her responsibility when the investigator retains study protocol define the study endpoint measures and the ability to control and provide access to the records that the criteria for the statistical analysis and interpretation of serve as the electronic source documentation for the pur- results, including a clear specification of the conditions pose of an FDA inspection. The FDA recommends that the for a positive study conclusion. study protocol, or a separate document, clearly specify how the electronic PRO source data will be maintained. 4.5 Planning for study interpretation The FDA recommends that sponsors discuss with the In addition, the FDA has previously provided guidance to appropriate review division how best to plan for the inter- address the use of computerized systems to create, mod- pretation of study findings. In some cases, the FDA may ify, maintain, archive, retrieve, or transmit clinical data to request an a priori definition of the minimum observed the agency (see appendix point 6) and to clarify the difference between treatment group means (i.e., MID) requirements and application of 21 CFR part 11 (see that will serve as a benchmark to interpret whether study appendix point 7). Because electronic PRO data (includ- findings are conclusive. In other cases, the FDA may ing data gathered by personal digital assistants or phone- request an a priori definition of a treatment responder that based interactive voice recording systems) are part of the can be applied to individual patient changes over time. case history, the FDA expects electronic PRO data to be Prespecification of methods for interpretation is particu- consistent with the data standards described in that guid- larly important with new or unfamiliar instruments or ance. Sponsors should plan carefully to establish appro- Page 15 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 priate system and security controls, as well as variable (mean scores), dichotomous variable (success/ cybersecurity and system maintenance plans that address failure), or some graded response, the primary and sec- how to ensure data integrity during network attacks and ondary endpoints, corrections for multiplicity, and the software updates. specific statistical methods planned. Sponsors should also plan to avoid the following (see In some situations, the SAP can specify that two or more appendix point 8) variables must be statistically demonstrated to be superior to control group findings to support a claim. This may be Direct PRO data transmission from the PRO data collec- the case, for example, when a clinician-reported endpoint tion device to the sponsor (i.e., the sponsor should not and a patient-reported endpoint both need to be shown have exclusive control of the source document) better than the control. Control for multiplicity (i.e., adjustment of the Type I error) generally is not a concern The existence of only one database without backup (i.e., when all endpoints are shown to be superior to those of risk of data corruption or loss during the trial with no way the comparison group, but we recommend carefully con- to reconstitute or verify the data) sidering the impact of choosing multiple primary end- points on Type II error and sample size. The sample size Removal of investigator accountability for confirming of the trial may be affected by how many endpoints are the accuracy of the data measured, the overall strategy planned to integrate all endpoints in the SAP, and the decision rule for declaring Loss of adverse event data a successful study outcome. Access to unblinded data Because each PRO item or domain often can represent an endpoint that could imply a distinct claim on its own, we Inability of an FDA investigator to inspect, verify, and recommend careful planning to avoid substantial copy the data at the clinical site during an inspection increases in Type 1 error from multiple endpoints. If it is important in a study to demonstrate that PROs have the An insecure system that allows for easily alterable same directional effect as other measures of treatment records. benefit, then statistical procedures can be considered to minimize the impact of multiple endpoint comparisons. 5. Data analysis Incorporating PRO instruments as study endpoint meas- There is no single best statistical procedure for multiplicity ures introduces challenges in the analysis of clinical trial adjustment because the choice of procedure depends data. Some of these challenges are discussed in the follow- upon the study objectives, the most important endpoints ing sections. among the collection, and other considerations. Some of the statistical procedures that can be useful for a more effi- 5.1 General statistical considerations cient analysis approach include methods that prespecify a The statistical analysis considerations for PRO endpoints sequence or order of the testing or that have a hierarchy of are not unlike statistical considerations for any other end- comparisons that first need to be satisfied before others point used in drug development (see appendix point 9). are considered for testing (i.e., closed testing procedures, We recommend that the principal features of the planned gatekeeper strategies). Generally, these statistical methods statistical analysis of the data be described in the statistical are less conservative than the classical Bonferroni or other section of the protocol and in a detailed elaboration of the statistical multiplicity adjustments that are used to control analysis often called the Statistical Analysis Plan (SAP). false positive conclusions from a family of eligible The FDA intends to determine the adequacy of study data hypotheses. Another reason to consider less conservative to support claims in light of the prespecified method for methods is to adjust for what are often strong correlations endpoint analysis. Unplanned or post hoc statistical anal- among the endpoints (causing a Bonferroni adjustment to yses are usually viewed as exploratory and, therefore, una- be too conservative). These strategies reduce the need for ble to serve as the basis of a claim of effectiveness. more stringent statistical tests for the subsequent end- points, but do not allow statistical testing for endpoint 5.2 Statistical considerations for using multiple endpoints combinations not prespecified. It is important that the study protocol specify all end- points that will be considered, including each domain A multidomain PRO measure can successfully support a score targeted to support a specific claim. The SAP should claim based on one or a subset of the domains measured describe the planned primary analysis in detail, noting if an a priori analysis plan prespecifies the domains that whether the endpoint will be analyzed as a continuous will be targeted as endpoints for the study. However, dem- Page 16 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 onstration that only a subset of domains is affected by In general, if analysis of scores for the individual compo- treatment (e.g., the physical function domain) generally nent endpoints of a composite shows the improvement is will not support a general claim (e.g., a claim of improved driven primarily by a single domain (e.g., performance of HRQL) because such a claim implies improvement on all a specific activity), the findings for the composite score domains that are important to the general concept. Use of would not support a general claim (e.g., psychological or domain subsets as study endpoints presupposes that the emotional benefit, or even general physical state if all that PRO instrument was adequately developed and validated is shown is symptom improvement). to measure the subset of domains independently from the other domains. 5.4 Statistical considerations for patient-level missing data The FDA recommends that the SAP address plans for how The FDA recommends that the sponsor discuss with the the statistical analyses will handle missing data when eval- FDA in advance of the study the appropriateness of the uating treatment efficacy and when considering patient statistical strategies proposed in the SAP. success or patient response. 5.3 Statistical considerations for composite measures 5.4.1 Missing items within domains Understanding the usefulness and measurement proper- At a specific patient visit, a domain measurement may be ties of a composite endpoint (i.e., an index, profile, or bat- missing some, but not all, items. Defining rules that spec- tery of scores) is an iterative process that evolves over ify the number of items that can be missing and still con- time. Rules for interpretation of composite measures sider the domain to have been measured is one approach depend on substantial clinical experience with the meas- to handling this type of missing data. Rules for handling ure in the clinical trial setting. Development of a compos- missing data should be specific to each PRO instrument ite endpoint at the time the confirmatory clinical study and should usually be determined during the instrument protocol is generated is discouraged unless there is sub- development and validation process. The FDA recom- stantial prior empirical evidence of the value of the chosen mends that all rules be specified in the SAP. For example, components of the composite. Though one reason for use the SAP can specify that a domain will be treated as miss- of a composite is to reduce the multiplicity problems asso- ing if more than 25 percent of the items are missing; if less ciated with multiple separate endpoints, composites can than 25 percent of the items are missing, the domain score do so only if it is agreed that treatment impact on each of can be taken to be the average of the nonmissing items. the endpoints is of value and if the endpoints move in the same direction. 5.4.2 Missing entire domains or entire measurements When the amount of missing data becomes large, study Establishing benefit is difficult if only one component of results can be inconclusive. As described earlier, the FDA a composite endpoint responds to the treatment. For encourages prespecified procedures in the study protocol, example, a treatment may relieve certain symptoms or particularly when patients discontinue study treatment. improve functioning but this benefit may not be detected Because missing data may be due to the treatment using a composite score that includes other endpoints received or the underlying disease and can introduce bias (e.g., psychological or emotional well-being) that fail to in the analysis of treatment differences and conclusions improve with the treatment. In any such composite, it is about treatment impact, the FDA encourages sponsors to critical to ensure that patients enrolled in a clinical study obtain data on each patient at the time of withdrawal to are impaired in all domains (e.g., psychological or emo- determine the reason for withdrawal. When available, this tional well-being) because they cannot improve in information can be taken into account in the analysis. domains if they are not impaired in whatever concept the domain measures. A variety of statistical strategies have been proposed in the literature and applications to the FDA to deal with missing Multiplicity problems arise when the multiple individual data due to patient withdrawal from assigned treatment components of a composite endpoint are intended as pos- exposure prior to planned completion of the trial. No sin- sible claims. In general, individual components of a com- gle method is generally accepted as preferred. One used in posite measure will not be adequate to support a claim the past was to exclude subjects from the analyses if they unless the components are prespecified in the SAP as sep- did not complete the study (i.e., completers' analysis). This arate endpoints, either sharing overall study alpha (co-pri- strategy is generally inadvisable because the reason for mary endpoints) or identified in a sequential analysis, missing data can be treatment-related and these patients and the study results are found statistically and clinically may not adequately represent the study population. meaningful in the context of the total composite and other individual component results. Another common, albeit problematic, strategy is to use the last observation available as the final evaluation – usu- Page 17 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 ally referred to as last observation carried forward (LOCF). When clinical trials show small mean effect sizes, rather Even though LOCF enables every patient randomized to than considering results in terms of an MID, it may be contribute some observation to the analysis, it can be more informative to examine the distribution of problematic for the following reasons: responses between treatment groups to more fully charac- terize the treatment effect and examine the possibility that If the objective of the trial is to detect a treatment effect the mean improvement reflects very different responses in after a certain duration of treatment (e.g., at 8 weeks), subsets of patients. When only a modest fraction of peo- then a comparison that includes only measurements on ple respond to a treatment, that fraction may experience patients at earlier times or visits is not addressing the orig- meaningful change in the face of a mean effect that is very inal trial objective. The average of patient responses, many small. When defining a meaningful change on an individ- of which are at different times or visits, may be uninter- ual patient basis (i.e., a responder), that definition is gen- pretable. erally larger than the minimum important difference for application to group mean comparisons. LOCF makes an implicit assumption that the patient would sustain the same response seen at an early study Glossary visit for the entire duration of the trial. This assumption is Claim untestable and potentially unrealistic. A statement of treatment benefit or comparative safety advantage. A claim can appear in any section of a medical Some other approaches involve imputation of missing product's FDA-approved label or in advertising of pre- data on a per-patient basis. These strategies try to predict scription drugs. missing outcomes for a patient who has withdrawn from the trial using data from subjects who stayed in the trial Cognitive debriefing and for whom all data have been collected. All of these A qualitative research tool used to determine whether strategies are imperfect, as they involve strong or weak concepts and items are understood by patients in the same assumptions about what caused data to be missing, way that instrument developers intend. Cognitive debrief- assumptions that usually cannot be verified from the data. ing interviews involve incorporating follow-up questions If missing data are associated with treatment effect in ways in a field test interview to gain a better understanding of that cannot be predicted from measurements on subjects how patients interpret questions asked of them. with complete data, analyses using imputation proce- dures will be biased. When there are few patients with Concept missing measurements and the frequency of missing data The specific goal of measurement (i.e., the thing that is to or proportion of patients with missing data is comparable be measured by a PRO instrument). across treatment groups, most approaches will yield simi- lar results. When a higher proportion of patients have Conceptual framework missing data, the FDA recommends the use of several dif- The expected relationships of items within a domain and ferent imputation methods (including a worst-case sce- of domains within a PRO concept. The validation process nario in which missing data are assumed to be confirms the conceptual framework. When used in a clin- unfavorable for those on the investigational treatment ical trial, the observed relationships among items and and favorable for those in the control group) and an domains will again confirm the conceptual framework. assessment of the consistency of the study results using each method. These analyses will demonstrate the sensi- Domain tivity of the conclusions to the assumptions made by the A domain is a discrete concept within a multidomain con- different methods. cept. All the items in a single domain contribute to the measurement of the domain concept. 5.5 Interpretation of study results Because statistical significance can sometimes be achieved Health-related quality of life (HRQL) for very small changes if a study is large enough, it is A multidomain concept that represents the patient's over- tempting to identify an MID as a benchmark for interpret- all perception of the impact of an illness and its treatment. ing the clinical importance or relevance of study results. If An HRQL measure captures, at a minimum, physical, psy- the MID is truly to be the smallest effect considered mean- chological (including emotional and cognitive), and ingful, however, it would be logical to establish the null social functioning. Claiming a statistical and meaningful hypothesis to rule out a difference less than or equal to the improvement in HRQL implies: (1) that the instrument MID. This is rarely done, and would have major implica- measures all HRQL domains that are important to inter- tions for sample size. preting change in how the study population feels or func- tions as a result of treatment; and (2) that improvement Page 18 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 was demonstrated in all of the important domains. An Validation HRQL instrument is a particular type of PRO instrument. The process of assessing a PRO instrument's ability to measure a specific concept or collection of concepts. This Instrument ability is described in terms of the instrument's measure- A means to capture data (e.g., questionnaire, diary) plus ment properties that are derived during the validation all the information and documentation that supports its process. At the conclusion of the process, a set of measure- use. Generally, that includes clearly defined methods and ment properties is produced that are specific to the spe- instructions for administration or responding, a standard cific population and the specific form and format of the format for data collection, and well-documented methods PRO instrument tested. The validation process involves: for scoring, analysis, and interpretation of results. Identifying the concept to be measured Item An individual question, statement, or task that is evalu- Assessing the content validity (i.e., being sure the items ated by the patient to address a particular concept. in the questionnaire cover all important aspects of the concept from the patient perspective) Minimum important difference (MID) The amount of difference or change observed in a PRO Evaluating the proposed scores to be obtained from the measure between treatment groups in a clinical trial that instrument will be interpreted as a treatment benefit. Defining a priori hypotheses of the expected relation- Patient-reported outcome (PRO) ships between PRO concepts and other measures Any report coming directly from patients (i.e., study sub- jects) about a health condition and its treatment. Testing the hypotheses by reporting the observed corre- lations among scores Quality of life A general concept that implies an evaluation of the impact Availability of all aspects of life on general well-being. Because this For questions regarding this draft document contact Lau- term implies the evaluation of nonhealth-related aspects rie Burke (CDER) 301-796-0700, Toni Stifano (CBER) of life, it is too broad to be considered appropriate for a 301-827-6190, or Sahar Dawisha (CDRH) 301-594-3090. medical product claim. Additional copies are available from: Questionnaire A set of questions or items shown to a respondent in order Office of Training and Communications, Division of to get answers for research purposes. Drug Information, HFD-240, Center for Drug Evaluation and Research, Food and Drug Administration, 5600 Fish- Scale ers Lane, Rockville, MD 20857, USA The system of numbers or verbal anchors by which a value or score is derived. Examples include visual analogue (Tel) 301-827-4573 scales, Likert scales, and rating scales. http://www.fda.gov/cder/guidance/index.htm Score A number derived from a patient's response to items in a Office of Communication, Training, and Manufacturers questionnaire. A score is computed based on a prespeci- Assistance, HFM-40, Center for Biologics Evaluation and fied, validated scoring algorithm and is subsequently used Research, Food and Drug Administration, 1401 Rockville in statistical analyses of clinical study results. Scores can Pike, Rockville, MD 20852-1448 be computed for individual items, domains, or concepts, or as a summary of items, domains, or concepts. (Tel) 800-835-4709 or 301-827-1800 Treatment benefit http://www.fda.gov/cber/guidelines.htm An improvement in how a patient survives, feels, or func- tions as a result of treatment. Measures that do not directly Office of Communication, Education, and Radiological capture the impact of treatment on how a patient survives, Programs, Division of Small Manufacturers Assistance, feels, or functions are surrogate measures of treatment HFZ-220, Center for Devices and Radiological Health, benefit. Food and Drug Administration, 1350 Piccard Drive Page 19 of 20 (page number not for citation purposes) Health and Quality of Life Outcomes 2006, 4:79 http://www.hqlo.com/content/4/1/79 Rockville, MD 20850-4307, USA 4. The FDA is specifically asking for comment on the appropriate review standards for the definition of a (Tel) Manufacturers Assistance: 800-638-2041 or 301- responder when applied to PRO instruments used in clin- 443-6597 ical studies to support medical product development. (Tel) International Staff Phone: 301-827-3993 5. For the principal record keeping requirements for clini- cal investigators and sponsors, see 21 CFR 312.50, 312.58, E-mail: email@example.com 312.62, 312.68, 812.140, and 812.145. Fax: 301-443-8818 6. See the draft guidance for industry Computerized Systems Used in Clinical Trials. When final, this guidance will http://www.fda.gov/cdrh/ggpmain.html supersede the guidance of the same name issued in April 1999 and will represent the FDA's current thinking on this Authors' contributions topic. For the most recent version of a guidance, check the This guidance has been prepared by the Office of New CDER guidance Web page at http://www.fda.gov/cder/ Drugs and the Office of Medical Policy in the Center for guidance/index.htm. Drug Evaluation and Research (CDER) in cooperation with the Center for Biologics Evaluation and Research 7. See the guidance for industry Part 11, Electronic Records; (CBER) and the Center for Devices and Radiological Electronic Signatures – Scope and Application http:// Health (CDRH) at the Food and Drug Administration. www.fda.gov/cder/guidance/index.htm Appendix 8. The FDA specifically welcomes comment and addi- 1. Labeling, as used in this guidance, refers to the medical tional information that will inform these policies as new product description and summary of use, safety, and effec- electronic PRO technology is developed and used in the tiveness that must be approved by the FDA. See 21 CFR medical product development setting. 201.56 and 201.57 for regulations pertaining to prescrip- tion drug (including biological drug) labeling. For medi- 9. See the ICH guidance for industry E9 Statistical Princi- cal device labeling, see 21 CFR 801. For blood and blood ples for Clinical Trials http://www.fda.gov/cder/guidance/ products for transfusion, see 21 CFR 606.122 Instruction index.htm Circular. Disclaimer 2. For drugs, section 505(d) of the Federal Food, Drug, This draft guidance, when finalized, will represent the and Cosmetic Act (the Act) establishes substantial evidence Food and Drug Administration's (FDA's) current thinking as the evidence standard for making conclusions that a on this topic. It does not create or confer any rights for or drug will have a claimed effect and states that reports of on any person and does not operate to bind FDA or the adequate and well-controlled investigations provide the public. You can use an alternative approach if the basis for determining whether there is substantial evidence approach satisfies the requirements of the applicable stat- to support claims of effectiveness for new drugs. See 21 utes and regulations. If you want to discuss an alternative CFR 314.126 for a description of the characteristics of an approach, contact the FDA staff responsible for imple- adequate and well-controlled investigation. See the guid- menting this guidance. If you cannot identify the appro- ance for industry Providing Clinical Evidence of Effectiveness priate FDA staff, call the appropriate number listed on the for Human Drug and Biological Products for considerations title page of this guidance. concerning the quantity of evidence necessary to meet the substantial evidence standard http://www.fda.gov/cder/ Acknowledgements This article is a reprint of draft guidance distributed by the U.S. Department guidance/index.htm. For medical devices, the Medical of Health and Human Services Food and Drug Administration. This guid- Device Amendments of 1976 to the Act established the ance document is being distributed for comment purposes only. Comments assurance of safety and effectiveness of medical devices and suggestions regarding this draft document should be submitted within intended for human use. See 21 CFR 860.7 for the evi- 60 days of publication in the Federal Register of the notice announcing the dence used in the determination of safety and effective- availability of the draft guidance. Submit comments to the Division of Dock- ness of a medical device. ets Management (HFA-305), Food and Drug Administration, 5630 Fishers Lane, rm. 1061, Rockville, MD 20852. All comments should be identified 3. The FDA is specifically asking for comment on the need with the docket number listed in the notice of availability that publishes in the Federal Register. for, and appropriate standards for, MID definitions applied to PRO instruments used in clinical studies. Page 20 of 20 (page number not for citation purposes)
Health and Quality of Life Outcomes – Springer Journals
Published: Oct 11, 2006
Access the full text.
Sign up today, get an introductory month for just $19.