Systematic Reviews to Answer Health Care Questions
Animated publication
Activate your eBook
SYSTEMATIC REVIEWS to Answer Health Care Questions SYSTEMATIC REVIEWS
SECOND EDITION
HEIDI D. NELSON
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Systematic Reviews to Answer Health Care Questions Second Edition
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Systematic Reviews to Answer Health Care
Questions Second Edition
Heidi D. Nelson, MD, MPH, MACP, FRCP Professor Department of Health Systems Science Kaiser Permanente Bernard J. Tyson School of Medicine Pasadena, California
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Acquisitions Editor: Joe Cho Development Editor: Cindy Yoo Editorial Coordinator: Janet Jayne Editorial Assistant: Kristen Kardoley Marketing Manager: Kirsten Watrud Production Project Manager: Justin Wright Manager, Graphic Arts & Design: Stephen Druding Manufacturing Coordinator: Bernard Tomboc Prepress Vendor: Lumina Datamatics
Second Edition
Copyright © 2025 Wolters Kluwer.
Copyright © 2014 by LIPPINCOTT WILLIAMS & WILKINS, a WOLTERS KLUWER business. All rights reserved. This book is protected by copyright. No part of this book may be reproduced or transmitted in any form or by any means, including as photocopies or scanned-in or other electronic copies, or utilized by any information storage and retrieval system without written permission from the copyright owner, except for brief quotations embodied in critical articles and reviews. Materials appearing in this book prepared by individuals as part of their official duties as U.S. government employees are not covered by the above-mentioned copyright. To request permission, please contact Wolters Kluwer at Two Commerce Square, 2001 Market Street, Philadelphia, PA 19103, via email at permissions@lww.com, or via our website at shop.lww.com (products and services).
9 8 7 6 5 4 3 2 1
Printed in the United States of America
Library of Congress Cataloging-in-Publication Data
ISBN-13: 978-1-9752-1109-7
Cataloging in Publication data available on request from publisher.
This work is provided “as is,” and the publisher disclaims any and all warranties, express or implied, including any warranties as to accuracy, comprehensiveness, or currency of the content of this work.
This work is no substitute for individual patient assessment based upon healthcare professionals’ examina tion of each patient and consideration of, among other things, age, weight, gender, current or prior medical conditions, medication history, laboratory data and other factors unique to the patient. The publisher does not provide medical advice or guidance and this work is merely a reference tool. Healthcare professionals, and not the publisher, are solely responsible for the use of this work including all medical judgments and for any resulting diagnosis and treatments. Given continuous, rapid advances in medical science and health information, independent professional verification of medical diagnoses, indications, appropriate pharmaceutical selections and dosages, and treatment options should be made and healthcare professionals should consult a variety of sources. When prescribing medication, healthcare professionals are advised to consult the product information sheet (the manufacturer’s package insert) accompanying each drug to verify, among other things, conditions of use, warnings and side effects and identify any changes in dosage schedule or contraindications, particularly if the medication to be administered is new, infrequently used or has a narrow therapeutic range. To the maximum extent permitted under applicable law, no responsibility is assumed by the publisher for any injury and/or damage to persons or property, as a matter of products liability, negligence law or otherwise, or from any reference to or use by any person of this work. Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
shop.lww.com
To Don, Norris, and Amelia Comer and Don and Marian Nelson
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Contributing Authors
Amy G. Cantor, MD, MPH, FAAFP Associate Professor
Rebecca M. Jungbauer, DrPH, MPH, MA Researcher Department of Medical Informatics and Clinical Epidemiology Pacific Northwest Evidence-based Practice Center Portland Oregon Robin Paynter, MLIS Information Specialist Fertility Regulation Review Group Cochrane Collaboration Portland, Oregon
Departments of Medical Informatics and Clinical Epidemiology, Family Medicine, and Obstetrics and Gynecology Core Investigator, Pacific Northwest Evidence Based Practice Center Oregon Health and Science University Portland, Oregon Rongwei Fu, PhD Professor of Biostatistics School of Public Health Oregon Health and Science University Portland, Oregon
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
vi
Preface
S ystematic reviews use scientific methods to identify, select, assess, and summarize the find ings of studies to answer health care questions. They provide the evidence for evidence-based medicine and are essential in determining health care guidelines and policies. As such, a sys tematic review can have a huge impact on how health care is practiced and funded. To meet this challenge, systematic reviews must adhere to methodological standards. They may include only some or the wrong kinds of studies or provide incorrect conclusions. The selection of studies could be biased or the statistical analysis inappropriate. The studies included in a systematic review could be so flawed that their results are unreliable. A systematic review that simply col lects and catalogs studies will miss these possibilities, whereas one that accurately evaluates and synthesizes the evidence will reveal them. This book is a guide to conducting comprehensive systematic reviews to answer health care questions based on currently accepted methods and standards in the field. It is most relevant to health care practices and populations in the United States but can be applied more broadly. Although intended primarily for researchers, its concise format and practical approach make it suitable for multiple types of users. It emphasizes main concepts, incorporates examples and case studies, and provides references for additional sources. Most examples are based on recent real-world projects conducted by the authors. The second edition is an updated resource that describes essential components in designing and conducting a systematic review. These include defining its purpose, topic, and scope; devel oping research questions; building the team and managing the project; identifying and selecting studies; extracting relevant data; assessing studies for quality and applicability; synthesizing the evidence using qualitative and quantitative analysis; assessing the strength of evidence; and preparing and disseminating the report. New chapters include how to assess the quality of diag nostic accuracy studies, qualitative studies, and systematic reviews; a guide to electronic tools for systematic reviews; and answers to case studies. Each component provides the necessary underpinnings for a comprehensive systematic review that accurately reflects a body of evidence that could ultimately lead to improvements in health care.
Heidi D. Nelson, MD, MPH, MACP, FRCP
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
vii
Acknowledgments
A t a time when facts and evidence are often maligned, ignored, or misrepresented, the rigor ous pursuit of truth continues to be a principle and driving force in science and medicine. While the COVID-19 pandemic raged across the world, scientists rapidly mobilized efforts to under stand the virus, its epidemiology and health effects, and how to prevent and treat infections and their complications. Among them emerged several international collaborations creating living systematic reviews that required continual updating and ongoing surveillance of emerg ing research evidence. This commitment to finding truth in midst of confusion is a hallmark of systematic review science. This book draws from the collective knowledge of systematic review scientists internation ally and first-hand experiences of the contributing authors of the first and second editions. We have had tremendous opportunities to contribute to the emerging field of systematic review and actively participate in the historic shift to evidence-based health care. I acknowledge all the truth finders in the field, particularly those who have journeyed with me and contributed to this book.
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
viii
Contents
ContributingAuthors..........................vi Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . .viii 1 SystematicReviews......................1 2 Defining the Topic and Scope and Developing Research Questions, Analytic Frameworks, and Protocols . . . . . . . . . . 9 3 Building the Systematic Review Team, Engaging Stakeholders, and ManagingtheProject.. . . . . . . . . . . . . . . . . . . 22 4 Determining Inclusion and Exclusion Criteria for Studies. . . . . . 35 5 Conducting Searches for Relevant Studies. . . . . . . . . . . . 48 6 Selecting Studies for Inclusion. . . . . . . . . . . . . . . . . 65 7 Extracting Data from Studies and Constructing Evidence Tables. . . 78 8 Assessing Quality and Applicability of Controlled Clinical Trials, Cohort Studies, and Case-Control Studies . . . . . . . . . . . . 94 9 Assessing Quality and Applicability of Diagnostic Accuracy
Studies, Qualitative Studies, and Systematic Reviews. . . . . . . 120 10 QualitativeAnalysis. . . . . . . . . . . . . . . . . . . . .135 11 QuantitativeAnalysis. . . . . . . . . . . . . . . . . . . . 149 12 Assessing and Rating the Strength of the Body of Evidence. . . . .183 13 Preparing and Disseminating the Report. . . . . . . . . . . . 199 14 Guide to Electronic Tools for Systematic Reviews . . . . . . . . . 223 15 Answers to Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .230 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239 Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
ix
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
12 CHAPTER
Assessing and Rating the Strength of the Body of Evidence
Heidi D. Nelson
■ ■ INTRODUCTION The goal of a systematic review is to provide the best evidence available to make health-care decisions. 1 The final step in the synthesis of a systematic review is determining how strong the best evidence is. This involves evaluating the strength or quality of the body of evidence for specific research questions and outcomes. Previous chapters outlined the issues to consider and steps required to select and evaluate stud ies addressing the research questions posed in the systematic review, including both qualitative and quantitative analysis. These steps result in an understanding of the quality and applicability of individual studies and synthesis of their results across studies. However, at this point, a systematic approach to understanding the strength of the body of evidence for a given outcome or question is needed to provide an assessment of the reliability of the findings. It is important to distinguish the difference between assessing the strength of the evidence itself versus determining how well the evidence supports a specific recommendation or guide line. These are separate processes. For example, the strength of the evidence may be high for a given intervention and outcome, but the strength of the recommendation made by a guideline development group may be low because of other factors, such as patient preferences, adverse effects, or relevance to current practice. This chapter describes how to evaluate the strength of a body of evidence, including the development of current methods; definitions of the domains or characteristics of evidence and how they are evaluated; criteria unique to observational studies; and how to make final assessments of the strength of evidence. Accepted standards for rating the strength of a body of evidence guide these methods (Table 12.1). 2–8 ■ ■ DEVELOPMENT Systems to provide structure and hierarchy to bodies of evidence have been developed by many groups for multiple purposes. In general, they were intended to help users quickly and simply understand the relevant research to make evidence-based health care, guideline, or policy deci sions. Traditionally, members of guideline groups were clinical rather than research methods experts. Groups with greater expertise in methodology created systems that were more flexible and detailed, but also more complicated and difficult to understand. As a result, more than 50 types of grading systems have been used. 9,10 Dissatisfaction with the methodology, variation, and number of competing systems led a group of evidence-based guideline developers, researchers, and methodologists to create a new, more universal, and transparent approach to rating the strength of a body of evidence, Grades of Recommendation Assessment , Development , and Evaluation ( GRADE ). 11 Although these efforts focused on a system primarily for guideline development groups, the essential concepts are relevant to systematic reviews used for other purposes. Key principles of GRADE include
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
183
184
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
an approach that is simple and transparent and based on methodological characteristics of evi dence in defined domains to capture the most important concepts. 11 The assessment of the body of evidence is individually rated for each important outcome. In addition, GRADE has an explicit methodology for consistency across systematic reviews. ■ ■ CURRENT METHODS The GRADE method has been well accepted and adopted by more than 110 organizations inter nationally, such as the World Health Organization, the American College of Physicians, and the Cochrane Collaboration. 11 It is generally agreed that the concepts addressed by GRADE are comprehensive and accurately reflect the most important issues in considering the strength of a body of evidence. A series of articles describes each domain 6,12–25 and software applications are available to support the GRADE approach. 26 Training sessions are held at various venues worldwide, primarily through the Cochrane Collaboration meetings. 27 The Agency for Healthcare Research and Quality (AHRQ) Evidence-based Practice Centers (EPC) Program has adapted the GRADE methodology to address issues specific to compar ative effectiveness reviews of a broad range of topics. 3 The AHRQ EPC method varies from GRADE in a few key areas. These include the terminology used to describe the overall process, where GRADE refers to grading the quality of the evidence, and AHRQ EPC method refers to it as grading the strength of the evidence. 3 The terminology for the overall ratings also differs. The GRADE method defines four categories as high, moderate, low, and very low , whereas the AHRQ EPC method defines them as high, moderate, low, and insufficient . Also, GRADE con siders observational studies as inherently biased, whereas the AHRQ EPC method allows them greater weight under certain circumstances. Variations of these methods are used by guideline groups. For example, the U.S. Preventive Services Task Force (USPSTF) uses a methodology to rate the strength of the body of evi dence that addresses similar characteristics as both GRADE and the AHRQ EPC methods. 28 However, in the USPSTF method, the body of evidence is based on the research question the evidence is supporting as well as the overall evidence supporting a given preventive ser vice, rather than the outcome alone. In the first step, the evidence is assessed according to the research question, based on aggregate internal and external validity of studies and the consistency and coherence of results. In the second step, the quality of the evidence for each 1. Risk of bias 2. Consistency 3. Precision 4. Directness 5. Reporting bias • For bodies of evidence that include observational research, also systematically assess the following characteristics for each outcome: 1. Dose–response association 2. Plausible confounding that would change the observed effect 3. Strength of association • For each outcome specified in the protocol, use consistent language to characterize the level of confidence in the estimates of the effect of an intervention Source: Institute of Medicine. Finding What Works in Health Care: Standards for Systematic Reviews . Washington, DC: The National Academies Press; 2011. Reprinted with permission from the National Academies Press, Copyright 2011, National Academy of Sciences. TABLE 12.1 STANDARDS FOR SYNTHESIZING THE BODY OF EVIDENCE USE A PRESPECIFIED METHOD TO EVALUATE THE BODY OF EVIDENCE • For each outcome, systematically assess the following characteristics of the body of evidence:
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
185
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
question is viewed across the analytic framework to determine whether there is adequate evi dence to support a complete chain of linkages connecting the preventive service to health out comes, and the degree to which the evidence directly addresses the populations, conditions, and outcomes identified in the research questions. The evidence is graded as good, fair, or poor. For this method, the systematic reviewers assess the evidence for the first step, and the USPSTF members assess the evidence for the second step similar to guideline groups rating the strength of a recommendation in a guideline. This chapter highlights the GRADE and AHRQ EPC methods because they closely align with accepted standards, are explicit and similar, and are widely used. This book refers to the eval uation of the body of evidence as strength of evidence while acknowledging that other groups use the term quality of evidence. Although both terms are accurate and acceptable, the term “quality” has also been applied to the assessment of internal validity (risk of bias) of individual studies and use of both terms could be confusing. ■ ■ HOW TO ASSESS THE STRENGTH OF EVIDENCE Systematic reviewers assess the strength of evidence, whereas guideline development groups determine the strength of a recommendation based on the evidence. This section describes how to assess the strength of evidence as a final step in synthesizing studies in a systematic review based on the GRADE and AHRQ EPC methods. Assessing the strength of evidence begins by determining how well studies address methodological domains ( characteristics ). These include study limitations , directness , consistency , precision , and reporting bias of the body of evi dence (Table 12.2). Additional domains for observational studies include magnitude of effect ( strength of association ), dose–response association , and plausible confounding that could change the observed effect. The GRADE method assigns ratings by identifying problems with the body of evidence rather than affirming the lack of a problem, 5,11 whereas the AHRQ EPC method uses the inverse approach. 3 In GRADE, a body of evidence consisting of randomized controlled trials (RCTs) begins with a high strength of evidence rating, whereas that consisting of observa tional studies begins with a low strength of evidence rating. It is important to be aware of which approach is used to avoid misinterpretation. Also, although the domains represent dis tinct concepts, they often overlap or are interwoven. Nonetheless, breaking the concepts into domains improves transparency and outlines the rationale behind the overall rating. The rat ings of strength of evidence were developed to be applied to individual outcomes. Depending on the purpose of the systematic review, ratings can also be applied to research questions. The approach is similar regardless. Each domain is evaluated separately and given a rating based on specific metrics. In the GRADE method, most domains are rated as no , serious , and very serious , whereas publication bias is rated as undetected or strongly suspected . Each domain starts at the highest level, and the levels are then reduced depending on the specific limitations of the evidence. GRADE pro vides guidance about how to reduce the levels within a domain for various types of limita tions. 5,11 In the AHRQ EPC method, the metrics are different for each domain. For example, study limitations are rated high , medium , or low ; directness is direct or indirect ; consistency is consistent , inconsistent or unknown/none ; and precision is rated precise or imprecise . AHRQ EPC ratings are based on global judgments about the evidence, rather than reducing levels based on limitations. 3 The overall rating is based on the ratings of the individual domains. Neither method uses a cumulative scoring system to reach an overall rating. Both methods use four categories, includ ing high , moderate , and low . The GRADE method uses a very low category, whereas the AHRQ EPC method uses an insufficient category (Table 12.3). When the body of evidence consists of both trials and observational studies, ratings can be done initially for the separate study designs to accommodate the inherent issues related
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
186
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
TABLE 12.2 DOMAINS USED IN RATING STRENGTH OF EVIDENCE MEASURES DOMAIN DEFINITION CONSIDERATIONS GRADE 5,11 AHRQ EPC 3 Study limitations High, medium, low
The extent to which methodological defi ciencies across studies could bias results The relevance of the evidence to the research question based on patient population, interventions, compara tors, and outcomes
What are the quality (risk of bias) ratings of individual studies?
No limitations, serious limita tions, very seri ous limitations No indirectness, serious indi rectness, very serious indirect ness; includes applicability No inconsis tency, serious inconsistency, very serious inconsistency
Directness
Do most studies address the PICOTS elements of the key question?
Direct, indirect; applicability assessed sepa rately
Consistent, inconsistent, unknown/none
Consistency Degree of similarity in the direction and mag nitude of effect of dif ferent studies in a body of evidence
How consistent are results based on over lapping confidence intervals, similarity of point estimates, between-study vari ance, measures of het erogeneity (eg, I 2 ) as relevant to the review? Are the number of events sufficiently high and width of the confidence intervals adequate to support a meaningful effect? Is there evidence of small study effects or selective reporting or publishing?
Precision Degree of certainty
No imprecision, serious impreci sion, very seri ous imprecision
Precise, imprecise
surrounding an estimate of effect for a specific outcome
Reporting bias
Includes publication bias (the entire study is missing), outcome reporting bias (specific outcomes that are mea sured are not reported), and analysis reporting bias (specific preplanned analyses are conducted but not reported) The size of the effect is so large that results are believable despite potential study bias Effects are greater with increasingly higher levels of interventions or exposures Confounding factors lead to an underesti mate of the effect of an intervention
Publication bias assessed sepa rately as unde tected, strongly suspected; other biases considered under study limitations Level upgraded by one (if RR >2 or <0.5) or two categories (if RR >5 or <0.2) Level upgraded by one
Undetected, suspected
Large magnitude of effect (strength of association) Dose– response association
What is the magni tude of effect and width of the confi dence intervals?
Weak, strong; taken into account to improve rating Weak, strong; taken into account to improve rating Weak, strong; taken into account to improve rating
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Is a dose response effect present?
Opposing plausible
What are possible biases and confound ing factors that were not considered but could have influenced the estimate of effect?
Level upgraded by one
residual bias and confounding
Abbreviations: AHRQ EPC, Agency for Healthcare Research and Quality Evidence-based Practice Centers; GRADE, Grading of Recommendations Assessment, Development and Evaluation; RR, relative risk.
187
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
TABLE 12.3 OVERALL RATING OF THE STRENGTH OF EVIDENCE (GRADE AND AHRQ EPC) GRADE 5,11 AHRQ EPC 3 High Considerable confidence in estimate of effect High
Very confident that the estimate of effect lies close to the true effect; evidence has few or no deficiencies; findings are stable
Moderate Further research likely to have an impact on con fidence in estimate, may change estimate Low Further research is very likely to have an impact on confi dence in estimate, likely to change the estimate
Moderate Moderately confident that the estimate of effect lies close to the true effect; evidence has some deficiencies; findings are likely to be stable
Low
Limited confidence that the estimate of effect lies close to the true effect; evi dence has major and/or numerous defi ciencies; additional evidence is needed to make conclusions
Very Low Any estimate of effect is very uncertain Insufficient No evidence, unable to estimate an effect, or no confidence in the estimate Abbreviations: AHRQ EPC, Agency for Healthcare Research and Quality Evidence-based Practice Centers; GRADE, Grading of Recommendations Assessment, Development and Evaluation.
to each. The overall rating then considers the ratings of all study designs. This approach can also be taken when both direct and indirect evidence are available for an outcome. For exam ple, a systematic review comparing the effectiveness of two interventions may include results of meta-analyses of placebo-controlled trials in addition to head-to-head trials. The ratings of strength of evidence for the placebo-controlled and head-to-head trials can be determined sep arately and then combined. In most cases, ratings will be made regarding the evidence for specific outcomes and inter vention/comparison pairs. The systematic reviewers must determine which outcomes will be graded based on the research questions and purpose of the systematic review. In general, these are the outcomes most relevant to decision making and reflect the input of clinical experts, stakeholders, and users and consider patient preferences and values. The number of outcomes may vary by the scope, research questions, and intent of the systematic review. Study Limitations The study limitations domain, also referred to as risk of bias, incorporates both study design and study quality ( risk of bias , internal validity ). 3,14 Based on the hierarchy of evidence (Chapter 8), a body of evidence composed entirely of RCTs has a higher strength of evidence rating than one with only observational studies. 29 Also, among observational studies, cohort studies are generally ranked higher than case–control studies because they are inherently less biased, whereas before–after (pre–post) and time-series studies are ranked lower. To determine the rating for this domain, the quality of the studies is considered collectively while considering the study design. Earlier in the systematic review process, individual studies were evaluated for quality using prespecified criteria (Chapter 8) resulting in a rating assigned to each study. In this domain, the reviewer makes an assessment about the quality of the entire group of studies. This is a subjective assessment, where outliers may be given less weight. For example, if a group of RCTs includes four moderately sized fair-quality trials and one small poor-quality trial, the reviewer may weigh the larger studies more heavily. The body of evidence would then get a rating of medium for study limitations using the AHRQ EPC method indicat ing moderate methodological limitations. The GRADE method would downgrade the trials from the no limitations to the serious limitations rating.
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
188
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
Directness Directness describes the relevance of the evidence to the PICOTS elements of the research ques tion. 3,18 Evidence is direct when the interventions and comparisons in studies are the same as those specified by the research question. For example, if the research questions are about the comparative benefits and harms of Drug A versus Drug B, direct evidence would consist of studies comparing the two drugs against each other (eg, head-to-head trials). Indirect evidence would consist of studies of Drug A and Drug B compared to placebo, but not to each other. In this case, the systematic reviewer would make indirect comparisons across the placebo-con trolled trials to compare the two drugs, often using statistical indirect comparison meta-analysis, if feasible. Directness also describes how well the study population compares with the target pop ulation of the systematic review. For example, in a systematic review of treatments for osteoarthritis, the long-term cardiovascular harms of nonselective nonsteroidal anti-in flammatory drugs (NSAIDs) were determined in placebo-controlled trials of NSAIDs for the prevention of Alzheimer disease. 30 Although results of trials to prevent Alzheimer dis ease may provide insights into possible adverse outcomes in patients with osteoarthritis, the two populations are markedly different. These studies would be considered indirect evidence. Evidence is also indirect when intermediate or surrogate outcomes are used instead of the intended health outcome. This form of indirectness is presumed in systematic reviews that explicitly include intermediate or surrogate outcomes in the study selection criteria. For example, to determine whether statin drugs reduce cardiovascular disease events in high-risk patients, selection criteria could include studies with blood lipid measures as outcomes in addition to studies with cardiovascular event outcomes. The inclusion of studies with inter mediate measures could be justified by the established relationship between lipid levels and cardiovascular events. However, this evidence would be considered indirect if cardiovascular events are the specified health outcomes of the research question. This approach becomes more complicated when the relationships between the intermediate and health outcomes are not established, or the relationships vary across interventions or populations. For example, directness would be more difficult to determine when the relationships between lipid levels and cardiovascular outcomes differ between the various types of statin drugs. These issues need to be considered early in the systematic review process and involve technical and clinical experts. The applicability of individual studies was described in Chapter 8 as the extent to which the effects of an intervention observed in a study are likely to reflect the expected results when the intervention is applied under real-world conditions. 31 Other terms used when referring to appli cability include external validity , generalizability , and relevance . Many of the issues consid ered when determining directness also concern applicability, and the GRADE method refers to applicability as another dimension of directness. 18 A body of evidence is applicable if it focuses on the specific condition, patient population, intervention, comparators, and health outcomes that are the focus of the systematic review’s research protocol. Applicability is considered sepa rately from strength of evidence in the AHRQ EPC method. Consistency Consistency refers to the degree of similarity of results of different studies in a body of evi dence. 3,17 This concept is important because a body of evidence is stronger when studies agree with each other. The AHRQ EPC method distinguishes between consistency in direction of effect and magnitude of effect and requires the systematic reviewer to determine when the mag nitude of effect is important based on the underlying research questions. In assessing consistency of the direction of effect, the primary consideration is whether the point estimates are on the same side of the point of no effect (1.0 for relative measures,
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
189
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
Consistent Same Direction of Effects Venous Thromboembolic Events
Consistent Overlapping Confidence Intervals Coronary Heart Disease Events
Tamoxifen
Tamoxifen
Placebo
Placebo
0.125
0.125
0.25
0.5 1.0
2.0
0.25
0.5 1.0
2.0
Risk Ratio (95% Cl) Risk Ratio (95% Cl) ■■ FIGURE 12.1 Primary prevention trials of tamoxifen versus placebo are consistent for venous thromboembolic and coronary heart disease outcomes.
0 for absolute measures). If a meta-analysis has been done, visually inspecting the alignment of results in the forest plot is an easy first step (Figure 12.1). The overlap of the 95% confidence intervals is also useful in determining consistency, because confidence intervals reflect the pos sible range of true point estimates. For example, if a minority of studies has point estimates that are not consistent with the direction of the other studies, but their 95% confidence intervals overlap with the other studies, findings could be considered consistent. Greater overlap indi cates greater consistency. When equivalence ( noninferiority ) is being determined, systematic reviewers must first determine the minimal important difference between the outcomes for two competing inter ventions that is considered clinically meaningful. For example, if two treatments for depression are being examined, what difference in the change in symptom score is clinically meaningful? Differences between interventions that do not meet this threshold (ie, are smaller than the min imal important difference) indicate equivalence. Determining consistency for equivalence trials is similar to other efficacy trials except that the studies are compared based on the minimal important difference instead of the point of no effect. The minimal important difference is based on clinical interpretation and must be prespecified, although it is not always possible to define. In these cases, consistency cannot be evaluated. The determination of statistical heterogeneity in a meta-analysis of studies can be used to assess consistency using the chi-square test (with its Q -statistic) and I 2 measure of inconsistency (Chapter 11). High levels of statistical heterogeneity indicate less consistency. Guidance from the Cochrane Collaboration suggests that I 2 values of 25% indicate low inconsistency, 50% mod erate inconsistency, and 75% high inconsistency. 32 Rating consistency is straightforward when the studies clearly agree. However, determining thresholds for various degrees of inconsistency is mostly subjective and requires consideration of statistical as well as clinical factors. For example, initial ratings of consistency may depend on magnitudes of effect, directions of effect, and sizes and overlap of confidence intervals. When inconsistency is identified based on these parameters, the studies can then be examined to determine underlying reasons, such as variations in the intervention (eg, dosing, duration, intensity) or prognostic characteristics of the population (eg, age, severity of illness). If consis tency improves after examining these types of variations, the evidence could be evaluated by relevant subgroups.
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
190
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
For any chosen relative risk reduction, the available evidence meets optimal information size criteria if the number of events is above the associated line
RRR = 20%
400 Events
RRR = 25%
300 Events
RRR = 30%
200 Events
200 300 400 500 600 700
100 Events
Total number of events needed
0 100
0.0
0.2
0.4
0.6
0.8
1.0
Control group event rate
■■ FIGURE 12.2 Optimal information size calculations. Number of events given alpha of 0.05 and beta of 0.20 for varying control event rates and RRR (relative risk reduction) of 20%, 25%, and 30%. Source: Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 6. Rating the quality of evidence—imprecision. J Clin Epidemiol . 2011;64(12):1283–1293. Reprinted with permission.
Precision Precision is the degree of certainty surrounding an estimate of effect for a specific outcome. 16 For a meta-analysis of studies, precision is reflected in the width of the confidence interval. For studies that cannot be combined in a meta-analysis, precision can be determined qualitatively. The first step in assessing precision is to determine whether the studies in a systematic review collectively have adequate power to show a statistically significant difference where one exists. 33 Adequate power is estimated from the number of participants enrolled in the studies and the number of outcome events. The GRADE method refers to this as the optimal information size (OIS) , and is similar to a sample size calculation for an individual trial. A sample size calcu lation estimates the number of study subjects required for a prespecified effect size, whereas the OIS estimates the number of study subjects required for a prespecified number of outcome events. The required number of outcome events varies with the baseline risk of the outcome and the prespecified effect size, but 200 to 300 events are typically required (Figure 12.2). 16 If the OIS is met, precision can be determined from the confidence interval of an estimate from a meta-analysis of studies, or from studies with very large sample sizes and adequate fol low-up periods. For dichotomous outcomes, the systematic reviewer must determine acceptable thresholds for an appreciable benefit and an appreciable harm , for example a 25% increase or decrease in relative risk. Confidence intervals that extend beyond either threshold and cross the line of no effect (ie, are not statistically significant) are imprecise (Figure 12.3). Confidence intervals that reflect a statistically significant difference between groups are considered precise. A confidence interval that is not statistically significant but does not cross the preestablished threshold for appreciable benefit or harm is also precise. For continuous outcomes, thresholds for benefits or harms are determined by the minimum change in the outcome that is clinically important, such as a change in score on a symptom scale (ie, minimal important difference). Similar to the approach for dichotomous outcomes, confidence intervals for continuous outcomes that are not statistically significant and cross the minimal important difference thresholds are not precise. Thresholds for appreciable benefits and harms and minimal important differences should be prespecified, and the rationale for these decisions should be clearly described in the systematic review.
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
191
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
Precise Estrogen Receptor Positive Breast Cancer
Imprecise Noninvasive Breast Cancer
Placebo
Tamoxifen
Placebo
Tamoxifen
0.125 0.25 0.5
1.0 2.0
1.0 2.0
0.125 0.25 0.5
Risk Ratio (95% CI)
Risk Ratio (95% CI)
■■ FIGURE 12.3 Primary prevention trials of tamoxifen versus placebo are precise for estrogen receptor positive breast cancer but imprecise for noninvasive breast cancer outcomes. The dashed lines indicate thresholds for appreciable benefits and harms.
Reporting Bias Reporting biases include publication bias (the entire study is missing), 15 outcome reporting bias (specific outcomes that are measured are not reported), and analysis reporting bias (specific preplanned analyses are conducted but not reported). 3 Reporting bias is import ant to determine because it potentially reduces the strength of the evidence. The measure ment and evaluation of these and other types of reporting biases for individual studies are described in Chapter 8, although methods are changing as research in this area continues to develop. In the AHRQ EPC method the choices for rating the risk of reporting bias are suspected and undetected . In GRADE, publication bias is considered as a separate domain and is also rated as suspected or undetected, whereas other reporting biases are considered under study limitations. Observational Studies Observational studies are considered inherently biased and given less weight when determining the strength of evidence. In GRADE, a body of evidence consisting of observational studies begins with a low rating of strength of evidence. However, ratings can be upgraded for studies with characteristics that reduce observational design limitations and increase the believability of the findings. These include studies demonstrating a large magnitude of effect (strength of association) or a dose–response association , and studies where all plausible biases and con founders would theoretically reduce the reported treatment effect. 3,19 In all three situations, the exact point estimate may be inaccurate, but the effect itself is likely to be real. In studies with a large magnitude of effect, the size of the effect may overcome potential bias introduced by the study design. Although it is possible that results could be an overestimation because of bias in these studies, the real effect is likely to be significant regardless of bias. For example, observational studies indicate that smoking cigarettes is associated with a 9- to 10-fold increase in lung cancer. 34 This magnitude of effect is so large that the association is believable and ultimately drove changes in health policy and practice. GRADE suggests that relative risks of more than 2 or less than 0.5 constitute large effects, and studies with these magnitudes of effects can be upgraded by one level (ie, from very low to low). Relative risks of more than 5 or less than 0.2 are considered very large, and these studies can be upgraded by two levels (ie, from very low to moderate). 19
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
192
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
In a similar way, studies demonstrating dose–response associations also overcome bias. In these studies, effects are greater with increasingly higher levels of interventions or exposures. For example, in studies of smoking and lung cancer, relative risks are greatest for smokers with the highest packs per day of use compared to smokers with lower use. The concept of plausible confounders describes the situation where confounding factors lead to an underestimate of the effect of an intervention. These biases create a situation that makes it less likely to find an effect, and an effect found under these circumstances is more likely to be real. For example, the association of smoking and heart disease could be confounded in an observational study comparing smokers and nonsmokers if age was not considered because age is also associated with heart disease. However, if the smokers in this study were younger than the nonsmokers, the association between smoking and heart disease would be biased against an effect. The estimate of the effect of smoking from this study could be considered an underesti mate of the true effect. To increase the rating of strength of evidence for observational studies based on a large mag nitude of effect (strength of association), dose–response association, or plausible confounders, the body of evidence must adequately meet criteria for the other domains. Therefore, it is inten tionally difficult for observational evidence to be rated as high using the GRADE method. The AHRQ EPC method rates observational studies according to how the domain criteria are fulfilled without starting at a specific level based on study design. In this approach, obser vational evidence for harms is considered stronger evidence than for benefits. 3 In some cases, observational studies of harms are superior to RCTs because they represent more real-world situations and can include much larger sample sizes. Overall Assessment The overall rating of the strength of evidence considers the assessment of all the individual domains. In the GRADE and AHRQ EPC methods, final ratings include high, moderate, low, and either very low or insufficient (Table 12.3). 3,20 The overall ratings are not necessarily cumulative. Issues contributing to the ratings of individual domains may overlap, and specific domains may be given more or less weight in individual situations. Additionally, the overall rating must take into account all bodies of evidence relating to the outcome being assessed, including direct and indirect evidence, trials and observational studies, and other relevant evidence. The final rating is qualitative and dependent on an in-depth knowledge of the evi dence and understanding of the domains. 35 However, despite efforts to standardize the rating process, it has been found to be subjective and highly variable across systematic reviewers. 36 Dual, independent assessment of ratings and descriptions of their rationale improve the trans parency of this process. Information relating to the body of evidence, domain ratings, and the overall strength of evidence rating are typically summarized in tables. 5 Although the structure and format of the summary tables may vary, tables that concisely summarize key information in a clear and trans parent manner are most useful to users. An example of a strength of evidence table outlining the domains and ratings for a systematic review is described in Box 12.1. 37 BOX 12.1 Example of Determining Strength of Evidence Grades for Studies of Patient Navigation to Increase Cancer Screening A systematic review of the effectiveness of health system interventions to reduce dispari ties in preventive health services included studies of patient navigation versus usual care to increase rates of screening for colorectal, breast, cervical, and lung cancer. 37,38 Patient
Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
193
Chapter 12 • Assessing and Rating the Strength of the Body of Evidence
navigation refers to services that provide personal guidance through the healthcare sys tem to meet an individual patient’s needs during the course of care. In the cancer screen ing studies, patient navigation included any of an array of services, such as education, scheduling, transportation, information, financial, referral, and reminders. A strength of evidence summary is outlined in Table 12.4. Using the AHRQ EPC approach, the strength of evidence for each type of cancer screening was assessed for: study limitations (low, medium, or high level); consistency (consistent, inconsistent, or unknown/none); directness (direct or indirect); precision (precise or imprecise); and reporting bias (suspected or undetected). Study limitations: The study limitations domain was rated medium for colorectal, breast, and cervical cancer screening because most included studies were fair-quality RCTs with some methodological deficiencies. The single lung cancer screening trial was given a poor-quality rating because of lack of reporting on randomization and allocation con cealment, unclear masking of assessors or patients, and large loss to follow-up. This led to a domain rating of high study limitations. Consistency: Most studies for all types of cancer screening showed increased screening rates for intervention versus usual care groups. Consequently, colorectal, breast, and cervical cancer screening studies achieved a rating of consistent for the consistency domain. This rating was none for the single lung cancer screening study. Directness: This domain was rated direct for all types of cancer screening because studies closely matched the PICOTS elements defined by the key questions. Most studies were RCTs that directly compared patient navigation with usual care and reported screening rates as the primary outcome measure. Precision: The precision domain was rated precise for colorectal, breast, and cervical cancer studies. Meta-analyses of RCTs of colorectal cancer screening (RR 1.64; 95% CI 1.42-1.92; 22 trials) and breast cancer screening (RR 1.50; 95% CI, 1.22-1.91; 10 trials) indicated statistically significant effects with narrow confidence intervals. Although studies of cervical cancer screening could not be combined because of statistical heterogeneity, results were statistically significant and effect sizes were clinically relevant. The domain for lung cancer was rated imprecise because of the uncertainty of the results of the single screening study in the context of its high loss to follow-up. Reporting bias: As reporting bias is difficult to assess, the investigators did not detect small study effects for the meta-analyses and did not suspect selective outcome report ing because the main outcome, the screening rate, was prespecified for each study. This domain was rated undetected for all types of cancer screening.
Using the definitions in Table 12.3, the strength of evidence for each type of can cer screening was assigned an overall grade of high, moderate, low, or insufficient by evaluating and weighing the combined results of the above domains, and by applying the systematic reviewers’ comprehensive understanding of the studies. The strength of evidence was graded high for colorectal cancer screening; medium for breast and cer vical cancer screening; and low for lung cancer screening. The large numbers of trials and participants and the extensive sensitivity analysis of the meta-analysis placed the colorectal cancer screening evidence at a higher level than the other types. Results of the single poor-quality RCT for lung cancer screening showed higher screening rates for the intervention versus usual care group, demonstrating an effect consistent with the other types of screening. Although a low grade was assigned, an insufficient grade could also be justified. Copyright © 2024 Wolters Kluwer, Inc. Unauthorized reproduction of the content is prohibited.
Made with FlippingBook - professional solution for displaying marketing and sales documents online