Skip to Main Content

Athletic Training & Evidence-Based Practice

Online and print library sources for athletic trainers using evidence-based practice


the evidence
3. Appraise that evidence for its validity (closeness to the truth) and applicability (usefulness in clinical practice)

We have now identified current information which can answer our clinical question. The next step is to read the article and evaluate the study. There are three basic questions that need to be answered for every type of study:

  • Are the results of the study valid?
  • What are the results?
  • Will the results help in caring for my athlete?

Scales to Help You Appraise

The health care community has developed scales to help you appraise your articles.  

Appraising Study Types

This section provides other questions that may be helpful as you appraise the research. Because study types have different features, you will not use the same validity criteria for all articles. Click below to review questions to appraise various articles.

Are the results of this article valid?

1. Did the review explicitly address a sensible question?

The systematic review should address a specific question that indicates the patient problem, the exposure and one or more outcomes. General reviews, which usually do not address specific questions, may be too broad to provide an answer to the clinical question for which you are seeking information.

2. Was the search for relevant studies detailed and exhaustive?

Researchers should conduct a thorough search of appropriate bibliographic databases. The databases and search strategies should be outlined in the methodology section. Researchers should also show evidence of searching for non-published evidence by contacting experts in the field. Cited references at the end of articles should also be checked.

3. Were the primary studies of high methodological quality?

Researchers should evaluate the validity of each study included in the systematic review. The same EBP criteria used to critically appraise studies should be used to evaluate studies to be included in the systematic review. Differences in study results may be explained by differences in methodology and study design.

4. Were selection and assessments of the included studies reproducible?

More than one researcher should evaluate each study and make decisions about its validity and inclusion. Bias (systematic errors) and mistakes (random errors) can be avoided when judgment is shared. A third reviewer should be available to break a tie vote.

Key issues for Systematic Reviews:
  • focused question
  • thorough literature search
  • include validated studies
  • selection of studies reproducible


What are the results?

Were the results similar from study to study?
How similar were the point estimates?
Do confidence intervals overlap between studies?

What are the overall results of the review?
Were results weighted both quantitatively and qualitatively in summary estimates?

How precise were the results?
What is the confidence interval for the summary or cumulative effect size?


More information on reading forest plots:

Deciphering a forest plot for a systematic review/meta-analysis (UNC)

Ried K. Interpreting and understanding meta-analysis graphs: a practical
guide. Aust Fam Physician. 2006 Aug;35(8):635-8. PubMed PMID: 16894442.

Greenhalgh T. Papers that summarise other papers (systematic
reviews and meta-analyses). BMJ. 1997 Sep 13;315(7109):672-5.
PubMed PMID: 9310574.


How can I apply the results to patient care?

Were all patient-important outcomes considered?
Did the review omit outcomes that could change decisions?

Are any postulated subgroup effects credible?
Were subgroup differences postulated before data analysis?
Were subgroup differences consistent across studies?

What is the overall quality of the evidence?
Were prevailing study design, size, and conduct reflected in a summary of the quality of evidence?

Are the benefits worth the costs and potential risks?
Does the cumulative effect size cross a test or therapeutic threshold?

Based on:  Guyatt, G. Rennie, D. Meade, MO, Cook, DJ.  Users’ Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.

Evaluating the Validity of a Harm Study


Are the results of this article valid?


FOR COHORT STUDIES:   Aside from the exposure of interest, did the exposed and control groups start and finish with the same risk for the outcome?

1. Were patients similar for prognostic factors that are known to be associated with the outcome (or did statistical adjustment level the playing field)?
The two groups, those exposed to the harm and those not exposed, must begin with the same prognosis. The characteristics of the exposed and non-exposed patients need to be carefully documented and their similarity (except for the exposure) needs to be demonstrated. The choice of comparison groups has a significant influence on the credibility of the study results. The researchers should identify an appropriate control population before making a strong inference about a harmful agent. The two groups should have the same baseline characteristics.  If there are differences investigators should use statistical techniques to adjust or correct for differences.

2. Were the circumstances and methods for detecting the outcome similar? 
In cohort studies determination of the outcome is critical.  It is important to define the outcome and use objective measures to avoid possible bias.  Detection bias may be an issue for these studies, as unblinded researchers may look deeper to detect disease or an outcome.

3. Was follow-up sufficiently complete? 
Patients unavailable for complete follow-up may compromise the validity of the research because often these patients have very different outcomes than those that stayed with the study. This information must be factored into the study results.


FOR CASE CONTROL STUDIES:  Did  the cases and control group have the same risk (chance) of being exposed in the past?

1. Were cases and controls similar with respect to the indication or circumstances that would lead to exposure?
The characteristics of the cases and controls need to be carefully documented and their similarity needs to be demonstrated. The choice of comparison groups has a significant influence on the credibility of the study results. The researchers should identify an appropriate control population that would be eligible or likely to have the same exposure as the cases.

2. Were the circumstances and methods for determining exposure similar for cases and controls? 
In a case control study determination of the exposure is critical.  The exposure in the two groups should be identified by the same method. The identification should avoid any kind of bias, such as recall bias. Sometimes using objective data, such as medical records, or blinding the interviewer can help eliminate bias.


Key issues for Harm Studies:
  • similarity of comparison groups
  • outcomes and exposures measured same for both groups
  • follow-up of sufficient length (80% or better)


What are the results?

How strong is the association between exposure and outcome?
* What is the risk ratio or odds ratio?
* Is there a dose-response relationship between exposure and outcome?

How precise was the estimate of the risk?
* What is the confidence interval for the relative risk or odds ratio?


Strength of inference:

For RCT or Prospective cohort studies: Relative Risk



Outcome present


Outcome not present

Exposure Yes a b
Exposure No c d


Relative Risk (RR) = a /(a + b) / c/(c + d)
is the risk of the outcome in the exposed group divided by the risk of the outcome in the unexposed group:

RR = (exposed outcome yes / all exposed) / (not exposed outcome yes / all not exposed)

Example: “RR of 3.0 means that the outcome occurs 3 times more often in those exposed versus unexposed.”


For case-control or retrospective studies: Odds Ratio



Outcome present


Outcome not present

Exposure Yes a b
Exposure No c d


Odds Ratio (OR) =  (a / c) / (b / d)
is the odds of previous exposure in a case divided by the odds of exposure in a control patient:

OR = (exposed - outcome yes / not exposed - outcome yes) / (exposed - outcome no / not exposed - outcome no)

Example: “OR of 3.0 means that cases were 3 times more likely to have been exposed than were control patients.”


Confidence Intervals are a measure of the precision of the results of a study. For example, “36 [95% CI 27-51]“, a 95%CI range means that if you were to repeat the same clinical trial a hundred times you can be sure that 95% of the time the results would fall within the calculated range of 27-51. Wider intervals indicate lower precision; narrow intervals show greater precision.

Confounding Variable is one whose influence distorts the true relationship between a potential risk factor and the clinical outcome of interest.

Read more on odds ratios: The odds ratio Douglas G Altman & J Martin Bland BMJ 2000;320:1468 (27 May)

Watch more on odds ratios:  Understanding odds ratio with Gordon Guyatt. (21 minutes.)


How can I apply the results to patient care?

Were the study subjects similar to your patients or population?
Is your patient so different from those included in the study that the results may not apply?

Was the follow-up sufficiently long?
Were study participants followed-up long enough for important harmful effects to be detected?

Is the exposure similar to what might occur in your patient?
Are there important differences in exposures (dose, duration, etc) for your patients?

What is the magnitude of the risk?
What level of baseline risk for the harm is amplified by the exposure studied?

Are there any benefits known to be associated with the exposure?
What is the balance between benefits and harms for patients like yours?


Source:  Guyatt, G. Rennie, D. Meade, MO, Cook, DJ.  Users’ Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.

Evaluating the Validity of a Diagnostic Test Study


Are the results valid?

1. Did participating patients present a diagnostic dilemma?

The group of patients in which the test was conducted should include patients with a high, medium and low probability of having the target disease. The clinical usefulness of a test is demonstrated in its ability to distinguish between obvious illness and those cases where it is not so obvious or where the diagnosis might otherwise be confused. The patients in the study should resemble what might be expected in a clinical practice.

2. Did investigators compare the test to an appropriate, independent reference standard?

The reference (or gold)  standard refers to the commonly accepted proof that the target disorder is present or not present. The reference standard might be an autopsy or biopsy. The reference standard provides objective criteria (e.g., laboratory test not requiring subjective interpretation) or a current clinical standard (e.g., a venogram for deep venous thrombosis) for diagnosis. Sometimes there may not be a widely accepted reference standard. The author will then need to clearly justify their selection of the reference test. 

3. Were those interpreting the test and reference standard blind to the other results?

To avoid potential bias, those conducting the test should not know or be aware of the results of the other test.

4. Did the investigators perform the same reference standard to all patients regardless of the results of the test under investigation?

Researchers should conduct both tests (the study test and the reference standard) on all patients in the study regardless of the results of the test in question. Researchers should not be tempted to forego either test based on the results of only one of the tests. Nor should the researchers apply a different reference standard to patients with a negative results in the study test.

Key issues for Diagnostic Studies:

  • diagnostic uncertainty
  • blind comparison to gold standard
  • each patient gets both tests


What are the results?



Reference Standard
Disease Positive

Reference Standard
Disease Negative

Study Test
True Positive False Positive
Study Test
False negative True Negative


Sensitivity: = true positive / all disease positives 

measures the proportion of patients with the disease who also test positive for the disease in this study. It is the probability that a person with the disease will have a positive test result. 

Specificity: Specificity = true negative / all disease negatives 

measures the proportion of patients without the disease who also test negative for the disease in this study. It is the probability that a person without the disease will have a negative test result. 

Sensitivity and specificity are characteristics of the test but do not provide enough information for the clinician to act on the test results.  Likelihood ratios can be used to help adapt the results of a study to specific patients. They help determine the probability of disease in a patient.

Likelihood ratios (LR):

LR + = positive test in patients with disease / positive test in patients without disease

LR - =  negative test in patients with disease / negative test in patients without disease

Likelihood ratios indicate the likelihood that a given test result would be expected in a patient with the target disorder compared to the likelihood that the same result would be expected in a patient without that disorder.

Likelihood ratio of a positive test result (LR+) increases the odds of having the disease after a positive test result.

Likelihood ratio of a negative test result (LR-) decreases the odds of having the disease after a negative test result.


How much do LRs change disease likelihood?

LRs greater than 10 or less than 0.1 cause large changes
LRs 5 – 10 or 0.1 – 0.2 cause moderate changes
LRs 2 – 5 or 0.2 – 0.5 cause small changes
LRs less than 2 or greater than 0.5 cause tiny changes
LRs = 1.0 cause no change at all


Want to know more about nomograms? See How to use a nomogram (pretest-probability) with a likelihood ratio

More about likelihood ratios: Diagnostic tests 4: likelihood ratios. JJ Deeks & Douglas G Altman BMJ 2004 329:168-169


How can I apply the results to patient care?

Will the reproducibility of the test result and its interpretation be satisfactory in your clinical setting?
Does the test yield the same result when reapplied to stable participants?
Do different observers agree about the test results?

Are the study results applicable to the patients in your practice?Does the test perform differently (different LRs) for different severities of disease?
Does the test perform differently for populations with different mixes of competing conditions?

Will the test results change your management strategy?
What are the test and treatment thresholds for the health condition to be detected?
Are the test LRs high or low enough to shift posttest probability across a test or treatment threshold?

Will patients be better off as a result of the test?
Will patient care differ for different test results?
Will the anticipated changes in care do more good than harm?

Based on:  Guyatt, G. Rennie, D. Meade, MO, Cook, DJ.  Users’ Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.

Are the results Valid?

1. Was the sample of patients representative?

The patients groups should be clearly defined and representative of the spectrum of disease found in most practices. Failure to clearly define the patients who entered the study increases the risk that the sample is unrepresentative. To help you decide about the appropriateness of the sample, look for a clear description of which patients were included and excluded from a study. The way the sample was selected should be clearly specified, along with the objective criteria used to diagnose the patients with the disorder.

2. Were the patients sufficiently homogeneous with respect to prognostic factors?

Prognostic factors are characteristics of a particular patient that can be used to more accurately predict the course of a disease. These factors, which can be demographic (age, gender, race, etc.) or disease specific (e.g., stage of a tumor or disease) or comorbid (other conditions existing in the patient at the same time), can also help predict good or bad outcomes.

In comparing the prognosis of the 2 study groups, researchers should consider whether or not the patient’s clinical characteristics are similar. It may be that adjustments have to made based on prognostic factors to get a true picture of the clinical outcome. This may require clinical experience or knowledge of the underlying biology to determine if all relevant factors were considered.

3. Was the follow-up sufficiently complete?

Follow-up should be complete and all patients accounted for at the end of the study. Patients who are lost to follow-up may often suffer the adverse outcome of interest and therefore, if not accounted for, may bias the results of the study. Determining if the number of patients lost to follow up affects the validity depends on the proportion of patients lost and the proportion of patients suffering the adverse outcome.

Patients should be followed until they fully recover or one of the disease outcomes occur. The follow-up should be long enough to develop a valid picture of the extent of the outcome of interest. Follow-up should include at least 80% of participants until the occurrence of a major study end point or to the end of the study.

4. Were objective and unbiased outcome criteria used?

Some outcomes are clearly defined, such as death or full recovery. In between, can exist a wide range of outcomes that may be less clearly defined. Investigators should establish specific criteria that define each possible outcome of the disease and use these same criteria during patient follow-up. Investigators making judgments about the clinical outcomes may have to be “blinded” to the patient characteristics and prognostic factors in order to eliminate possible bias in their observations.


Key issues for Prognosis Studies:
  • well-defined sample
  • similar prognosis
  • follow-up complete
  • objective and unbias outcome criteria



What are the results?


How likely are the outcomes over time?

  • What are the event rates at different points in time?
  • If event rates vary with time, are the results shown using a survival curve?

How precise are the estimates of likelihood?

  • What is the confident interval for the principle event rate?
  • How do confidence intervals change over time?

Prognostic Results are the numbers of events that occur over time, expressed in:

  • absolute terms: e.g. 5 year survival rate
  • relative terms: e.g. risk from prognostic factor
  • survival curves: cumulative events over time

Are the results of the study valid?

1. Were patients randomized? The assignment of patients to either group (treatment or control) must be done by a random allocation. This might include a coin toss (heads to treatment/tails to control) or use of randomization tables, often computer generated. Research has shown that random allocation comes closest to insuring the creation of groups of patients who will be similar in their risk of the events you hope to prevent. Randomization balances the groups for known prognostic factors (such as age, weight, gender, etc.) and unknown prognostic factors (such as compliance, genetics, socioeconomics, etc.).  This reduces the chance of over-representation of any one characteristic within the study groups.


2. Was group allocation concealed? The randomization sequence should be concealed from the clinicians and researchers of the study to further eliminate conscious or unconscious selection bias. Concealment (part of the enrollment process) ensures that the researchers cannot predict or change the assignments of patients to treatment groups. If allocation is not concealed it may be possible to influence the outcome (consciously or unconsciously) by changing the enrollment order or the order of treatment which has been randomly assigned. Concealed allocation can be done by using a remote call center for enrolling patients or the use of opaque envelopes with assignments.  This is different from blinding which happens AFTER randomization.


3. Were patients in the study groups similar with respect to known prognostic variables?  The treatment and the control group should be similar for all prognostic characteristics except whether or not they received the experimental treatment. This information is usually displayed in Table 1, which outlines the baseline characteristics of both groups.  This is a good way to verify that randomization resulted in similar groups.


4. To what extent was the study blinded? Blinding means that the people involved in the study do not know which treatments were given to which patients. Patients, researchers, data collectors and others involved in the study should not know which treatment is being administered. This helps eliminate assessment bias and preconceived notions as to how the treatments should be working. When it is difficult or even unethical to blind patients to a treatment, such as a surgical procedure, then a "blinded" clinician or researcher is needed to interpret the results.


5. Was follow-up complete?  The study should begin and end with the same number of patients in each group. Patients lost to the study must be accounted for or risk making the conclusions invalid. Patients may drop out because of the adverse effects of the therapy being tested. If not accounted for, this can lead to conclusions that may be overly confident in the efficacy of the therapy.  Good studies will have better than 80% follow-up for their patients. When there is a large loss to follow-up, the lost patients should be assigned to the "worst-case" outcomes and the results recalculated. If these results still support the original conclusion of the study then the loss may be acceptable.


6. Were patients analyzed in the groups to which they were first allocated? Anything that happens after randomization can affect the chances that a patient in a study has an event. Patients who forget or refuse their treatment should not be eliminated from the study results or allowed to “change groups”. Excluding noncompliant patients from a study group may leave only those that may be more likely to have a positive outcome, thus compromising the unbiased comparison that we got from the process of randomization. Therefore all patients must be analyzed within their assigned group. Randomization must be preserved.  This is called "intention to treat" analysis. 


7. Aside from the experimental intervention, were the groups treated equally?  Both groups must be treated the same except for administration of the experimental treatment. If "cointerventions" (interventions other than the study treatment which are applied differently to both groups) exist they must be described in the methods section of the study.

How can I apply the results to patient care?

Were the study patients similar to my population of interest? 
Does your population match the study inclusion criteria?
If not, are there compelling reasons why the results should not apply to your population?

Were all clinically important outcomes considered? 
What were the primary and secondary endpoints studied?
Were surrogate endpoints used?

Are the likely treatment benefits worth the potential harm and costs?
What is the number needed to treat (NNT) to prevent one adverse outcome or produce one positive outcome?
Is the reduction of clinical endpoints worth the potential harms of the surgery or the cost of surgery? 

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License