Features
Are Press Ganey Statistics Reliable?Part II: Small Samples Create Questionable Results How many patient satisfaction surveys are necessary to obtain a statistically reliable look at the performance of hospitals and health care providers? Press Ganey states that only 30 survey responses are needed to draw meaningful conclusions, although they prefer to have at least 50 responses before analyzing the data. We asked Dr. Eric Armbrecht, a statistician and Assistant Professor for St. Louis University’s Center for Outcomes Research and Dana Oliver, a biostatistician at St. Louis University if they agreed. Dr. Armbrecht suggested that analyzing only 30-50 responses would lead to unacceptably wide confidence intervals and would substantially limit the generalizability and use of the data obtained, regardless of whether 3,000 or 10,000 patients were surveyed. Dr. Armbrecht explained that low response rates could create confidence intervals as wide as 50%, which could be similar to just flipping a coin to determine whether the data is representative of an entire population’s perceptions. Breaking down those same 30-50 responses in an attempt to analyze satisfaction scores of individual physicians would create even less reliable results as the number of responses per physician would be even less. Ms. Oliver also disagreed with Press Ganey’s assertion that 30 or 50 responses would result in statistically sound data, noting that those numbers could be “arbitrarily chosen” by some survey methdologists. How many responses are necessary in order to have statistically reliable data? The answer depends upon the size of the sample population. Assuming a margin of error of 4% (which is double the margin of error that Press Ganey would like to use) and assuming a statistical standard 95% confidence interval, the minimum sample sizes that Dr. Armbrecht recommended for populations of 1000, 2500, 5000, and 7500 would be 375, 484, 536, and 556 respectively. He noted how the response rate tends to flatten out with larger sample sizes and cautioned that these response rates would only apply to “yes/no” questions (such as whether or not a doctor was “very good”). In order to measure the validity of rating scales (such as those from 1-5), the calculations become somewhat more difficult and are dependent upon the standard deviation in the sample population. Dr. Armbrecht gave an example that using a 1-5 scale with a standard deviation of 0.7 and a margin of error of 10% (which is five times higher than Press Ganey seeks), 188 responses would be needed in order to reliably estimate the responses from the general population. Dr. Armbrecht recommended online statistical calculators such as those available at Creative Research Systems (http://www.surveysystem.com/sscalc.htm) to help determine the statistical significance of most data. Aside from low response rates, Dr. Armbrecht and Ms. Oliver described additional problems that can occur when using a 1-5 scale in satisfaction surveys. If hospital administrators seek to be at or above the 90th percentile in satisfaction scores, asking patients to grade performance on a 1-5 scale essentially creates a system with one passing grade and four failing grades. If patients are not aware that a score of “4” is a failing grade, the data that they provide may be misinterpreted when being analyzed. In addition, patients may perceive a small relative difference between a grade of “4” and “5” on a survey, but may perceive a larger relative difference between a “3” and a “4” on the same survey, creating a system in which they grade “so-so” care with the same score as “just less than perfect” care. Finally, with small sample sizes, one unhappy customer can turn many “passing” grades into failing grades. Four patient scores of “perfect” fives can be brought down to “failing” fours by one extremely unhappy patient who grades a provider or hospital with scores of all zero. Our experts noted that a simple way to avoid these analytical problems was to create a dichotomous scoring system with “yes-no” questions. For example, “Did your care meet your expectations?” The “standard error of the mean” is the standard deviation of a sample population’s mean. Ms. Oliver noted that before performing any type of statistical testing, it is a good idea to first plot a histogram of multiple sample responses to determine whether survey data will be distributed in a normal bell curve pattern. If the survey responses are not distributed in a bell curve pattern, conclusions cannot be drawn from the data – unless the variability of the data is low. Press Ganey literature relies on the “central limit theorem” in justifying a reliance upon sample sizes as low as thirty. Ms. Oliver explained that the central limit theorem holds that the mean and median scores from very large survey samples tend to form a typical bell curve. In most cases, the central limit theorem only applies if there is a similar distribution of variables in each survey. Because patient satisfaction survey samples from specific hospitals are generally not large and because the surveys do not always have a similar distribution of variables, the central limit theorem probably would not apply to satisfaction survey data. Analysis of survey results depends in part on the “margin of error” of the survey data. Margin of error is used to express the confidence with which survey responses can be relied upon when an entire survey population is incompletely sampled. For example, suppose that five percent of a sample population is surveyed and one question has a mean score of 50. If the margin of error for the question is 30, then the actual value for the response in the sample population could be anywhere between 20 and 80 (the mean score of 50 plus or minus 30). Dr. Armbrecht stated that a good estimate of a margin of error is given by the formula 1/[square root of the number of partipants in the sample size] (Niles, 2006). In other words, for a sample size of 100, the margin of error would be roughly 10% and for a sample size of 9, the margin of error would be roughly 33%. Achieving Press Ganey’s goal margin of error of 2% or less would require a sample size of approximately 2500. If you want to see how good the soup in a pot tastes, first the ingredients in the pot must be well mixed. The “mixing” of the soup is analagous to obtaining completely random data from a sample population. If you only mix the top layers of the pot, you might not get the beans and pasta on the bottom of the pot, so your sample taste will not be representative of the true flavor of the soup. Similarly, failure to completely randomize data samples by excluding certain segments in a population (such as admitted, transferred or LWOBS patients) significantly increases the likelihood that the results will be inaccurate. Similarly, small sample sizes from a large population are likely to provide misleading data. Once an appropriate sample is taken, surveys can only be used to determine whether there has been a change in the sample population. Using the soup analogy, you tweak the recipe by adding or changing ingredients and take another sample to see if people like the new recipe better. Surveys can only be used to measure how the soup in a single pot is changing over time. What are the takeaway points about analysis of satisfaction survey data? First, small sample sizes can lead to significantly unreliable data. Last month, we showed how small sample sizes resulted in a 99% change in a hospital’s percentile rank in just two months. Simply put, small response sizes lead to inaccurate results. Glossary of Statistical Terms Mean: The average of all the responses. |
Recently on Twitter |
Comments
Recently I became aware of a nurse manger who simply sent a PG form to all patients who complained.
When I approached hospital board with this data and the learned conclusions - they were not willing to admit that they had made a mistake investing over $100,000/year in P-G. Got a slock statistician (adjunct part-timer) from the unviersity to discredit ASQ and Demming. Wonder what that cost ?
Common sense can tell you too little data points does not allow for drastic conclusions. When actual data gets worse but percentiles get better - duh !
Keep up the meaningful editorial coverage.
John C. Johnson, MD, FACEPe, FACPE
Past President - ACEP
EM doc for 35 years
I like the soup-pot analogy. To carry it a bit further: Let's say that every spoonful you tasted had a bean in it. "confidence interval" is the likelihood that it's bean soup, rather than only having a few beans that you were lucky enough to get one every time.
Bottom line, this survey would NEVER stand muster for a population based public health survey that could be used for any policy or resource allocation....f or one simple reason....it is not statistically reliable data.
RSS feed for comments to this post