P&T
Our
Other
Journal
MediMedia Managed Markets
Managed Care

 

Emerging Insights About Measuring Disease Management Outcomes

MANAGED CARE May 2011. © MediMedia USA

Emerging Insights About Measuring Disease Management Outcomes

Many problems attributed to outcomes measurement result from poor planning before a program is initiated. A DM expert lays down some ‘must-do’ rules for success in this excerpt from Disease Management and Wellness in the Post-Reform Era, published by Atlantic Information Services Inc.
Al Lewis
MANAGED CARE May 2011. ©MediMedia USA

Many problems attributed to outcomes measurement result from poor planning before a program is initiated. A DM expert lays down some ‘must-do’ rules for success in this excerpt from Disease Management and Wellness in the Post-Reform Era, published by Atlantic Information Services Inc.

Al Lewis

Perhaps no issue in disease management (DM) is more controversial than outcomes measurement. As for wellness, that field is five years behind DM in the ability to measure outcomes validly. Being five years behind DM in measurement is like being five years behind Iraq in democracy.

Many — if not most — reported results are wrong, infected by either obvious or insidious regression to the mean and distortions due to faulty trend calculations. How do you know if your results are among those so infected? Three simple tests will tell you whether your results are infected by regression to the mean:

(1) Did you see cost or utilization declines in categories which do not normally decline in DM, such as physician visits or drugs?

(2) Did drug costs decline (a reduction attributed to the program) while the quality indicators showed an improvement in adherence to drug therapy?

(3) Is the stated decline in admissions of a much greater proportion than the improvement in quality indicators?

If the answer to any one of these questions is positive, your results are infected and hence invalid at worst and controversial at best. But help is on the way. We are already seeing a glimpse of the future in measurement, and the good news is that “regression to the mean” specifically — and complex, invalid, expensive actuarial methodologies generally — are being banished to what Leon Trotsky once called “the dustbin of history.”

What follows are the emerging insights which, taken into consideration when you measure, will remove most of the controversy around measurement and produce generally valid results … and will save both time and money in the process. That’s because validity of outcomes and complexity of the process used to generate those outcomes turn out to be inversely correlated.

Rule #1

Though generally not practical for health plans, the only truly valid methodology is randomized controlled trials (RCTs). Any other methodology needs to be confirmed with plausibility checking before being accepted.

Randomized control group trials were used by the Centers for Medicare & Medicaid Services (CMS) in their Medicare Health Support project. Whatever other mistakes made by CMS that perhaps caused the contracted vendors to miss their targets (and there were many), there was no issue about measuring in this manner, the closest approximation to a double-blind study there could be in a field where placebos aren’t possible. Of course, in a “real” RCT, the doctors don’t have patients in both the control and study groups, the way they did in this situation. That is just one example of mistakes made in the CMS study design.

Before embarking on your own RCT or accepting a study provided by a vendor, keep in mind that all RCTs are not created equal. In particular, there are a number of comparisons between the two groups which must be checked, and rarely are, in RCTs:

What was the previous hospitalization rate of the two groups? Often, the groups look like a match on demographics and illness burden, but had a much different rate of hospitalizations in the six months prior to the start of the trial.

Could the difference in results have been caused by the intervention? It’s not enough to just accept differing results between the control group and the intervention group in the study period. Some changes are not due to DM, including large percentage differences of any type; differences between the groups in categories like radiology or post-acute care, which simply do not get noticeably affected by DM; and differences which are larger in lower-acuity members than high-acuity members.

Even within the “expectable” categories such as hospitalizations, did the researchers rule out other possible causes for differing results? Once one focuses on knowing that only certain categories are affected by DM, one must go a step further to determine whether the differences — even if in the “expectable” categories like hospitalizations — were in fact due to the program. Was a differential decline in hospitalizations due to fewer hospitalizations for the conditions actually being managed? Was a differential decline in surgeries concentrated in the surgeries where patient preference can make a difference? Or was it across the board?

Did you achieve cost reductions in most or all categories? Keep in mind that it is not possible to reduce costs in most or all categories — the cost has to go somewhere. It might move from inpatient to outpatient, or inpatient to drugs, or ER to physician office visits. But it doesn’t go away.

Does the reported savings change when the outlier cutoff point is changed? If so, the savings are not likely caused by the program, since a few phone calls can’t prevent a six-figure hospitalization. A good way to check this: Does the vendor, who is touting the RCT, tell you what the outlier cutoff is, and whether changing it changed the savings? If not, chances are that they picked the cutoff which resulted in the greatest savings.

Assuming these paragraphs above are taken into account, the RCT is the best comparison available. That is why it is used in drug trials. However, the major disadvantage is that only rarely does a health plan or any other entity find itself in a position to conduct an RCT. Occasionally a program is offered to the insured population but many self-insured groups don’t buy it, as was the case where Blue Shield of California offered a catastrophic case management program to its own members and used the California Public Employees Retirement System as a control. It had several hundred thousand people in each group, with essentially no migration between groups. The outcomes, peer-reviewed and published in the February 2007 American Journal of Managed Care, appeared to be valid.

RCTs provide a far more accurate analysis than a “pre-post” study, in which a single population is used as both the control and study group over two periods of time. In a pre-post study, generally it is assumed that the baseline cohort’s costs would stay the same adjusted for trend (as measured by the nondiseased population’s cost change) absent the DM intervention. Therefore, any change in costs (adjusted for the change in costs of the nondiseased population) is attributed to the program.

Among the many problems with this methodology, the most obvious is that the baseline does not include the entire population, only the population sick enough to have claims. Hence the “planes on the ground” (as explained in Rule Five, below) are not included and the calculated average cost of everyone with claims for the condition is higher than the underlying average cost of everyone with the condition.

Pre-post methodologies can be divided into two types: “prospective identification,” in which anyone who ever had a claim for a condition is counted in all future periods, and “annual requalification,” in which only people with claims in any period are counted going forward.

Rules #2 and #3

Before initiating a program, you need to know which conditions are most out of control and are creating the most unnecessary admissions.

To know which conditions are out of control, you need to know basic facts. For instance, if you are managing or considering managing heart disease, you need to know your rate per 1,000 for heart attacks, angina attacks, and other cardiac events.

These two rules can be considered together. Today, too many health plans and employers say, “Let’s do DM.” Too many employers say, “Let’s do wellness.” Dollars are committed and spent and measured by actuaries … and yet, basic questions don’t get asked or answered. Let us use the example of heart disease. Health plans and employers are spending millions to manage this category, to reduce heart attacks and other ischemic events. But almost no one can answer basic questions like:

What is our rate per 1,000 patients for heart attacks, angina attacks and other cardiac events?

How has it been trending since we started this program?

How does it compare to other similar populations? Are we out of control or in control?

That set of epidemiological questions begs another set of managerial questions: How can you manage something if you don’t know what you are managing? How do you know where to focus your DM efforts if you don’t know whether and where your adverse event rates are out of control?

These event-rate tests are very simple, and avoid all the actuarial data-crunching and what-if scenarios found in the typical benefits consultant analysis. You divide the number of ER and inpatient events primary-coded for the condition in question by the total plan membership, just as if you were calculating a birth rate.

For instance, if you count 3,000 asthma attacks overall, and you have 1 million members in your plan, your asthma attack rate is 3 per 1,000.

It is vastly more actionable to know one’s out-of-control event rate, which is a known, valid, replicable figure, than it is to know the prevalence rate. Prevalence is a term of art whose parameters vary according to the “claims-extraction algorithm” used to find members. Suppose you are satisfied with the prevalence-rate algorithm and find that the prevalence is high. Does that mean you should “do DM?” Not necessarily. Perhaps usual care is quite good — that means there are few events left to avoid. The Boston area, for example, is a hotbed of asthma. Yet in the commercial population, the health plan which has the best event avoidance in the entire U.S. pulls its membership largely from greater Boston. How can they have such a low event rate in a high-asthma-prevalence environment? Because usual care is quite good thanks to years of physician education and disease management, so this plan is trying to move its customers into other care management programs.

There are two examples of what health plans can learn by looking at their event rate trend over time, and by comparing their trend to national averages.

For instance, a Southeast health plan implemented programs but never asked whether the program was actually doing what it was intended to do — reduce adverse events in the conditions being managed. An observation of condition-specific event rates showed that there was no program impact on utilization (and hence cost), notwithstanding the actuarial calculations of large savings using its modeling system.

Comparing oneself to historical performance yields some insight, but one can’t be certain that the lack of decline isn’t reflective of excellent initial performance, and therefore the expectation of improvement could be unrealistic. In that particular case, it turned out that the Southeast health plan’s performance was average and therefore should have improved.

How is it possible to know that it was “average”? The results from 29 commercial health plans and employers (but not Medicare health plans or Medicaid health plans, which would have different event rates) were combined into one average. This allows a health plan to compare itself to a benchmark and see how it is performing over time. Another case is Harvard Pilgrim Health Care Inc. Harvard Pilgrim has the best outcomes in the country, roughly tied with Providence Health Plans in Oregon.

The trend lines suggest that Harvard Pilgrim had been improving both in absolute terms and versus the national averages, and — unlike the Southeast health plan above — was already much better than average before implementing its DM programs. Even so, its performance has improved since DM implementation.

The broader question: Why doesn’t everyone look ahead of time at adverse events by condition before deciding which programs to do? The goal of chronic DM is to reduce adverse events, so it would seem very logical to see ahead of time if — and in which conditions — there are enough to merit an attempt to reduce them.

Rule #4

In addition to looking at these event rates — the so-called “plausibility test” — biostatisticians also recommend a “number needed to decrease” (NND) analysis to confirm whether the ROI you believe you have achieved actually was achieved.

An NND test tells you how many of these events you need to avoid in order to hit your ROI targets, given inputs for program costs and emergency and inpatient care expenses. Then, you input an explicit, transparent assumption about the likelihood of comorbidities being reduced as well, if admissions for the specific primary morbidity are avoided. For instance, in asthma most of the event avoidance will take place in asthma itself. But in diabetes, good DM could avoid events across many related comorbidities.

In addition to the basic assumption that a DM program should reduce events in the disease being managed, there are two other assumptions implicit in an NND test. The second critical assumption is that events associated with those related comorbidities are falling at the same rate that the events coded to the primary morbidity are falling. The third is that it is plausible to say that related comorbidities could fall only if there appears to have been an impact on the primary morbidity.

For this last assumption, an analogy could be made to, yes, sports. If you watch a player hit a bunch of slow balls down the middle for home runs, and the player tells you he can also hit sliders on the corners, you might believe him. However, if he misses the slow balls down the middle, that very same statement about being able to hit sliders on the corners is simply not plausible. That’s why both the straight plausibility test and the NND analysis are so concerned with success or lack thereof in the primary morbidity. While easily measurable on its own, that success would also certainly correlate with much less easily measurable results across a range of comorbidities.

A very simple example of an NND analysis might be as follows. Assume you are spending $1 million on asthma, and that avoiding an average event — the weight-average of the costs of an admission and an ER visit — saves $1,000. If you are targeting a 2:1 ROI, and you assume that whatever minor comorbidity reduction you might achieve is offset by higher drug costs, you must therefore avoid 2,000 asthma events to save $2 million.

Is that achievable? Go back to the event-rate chart. There are about three asthma events for every 1,000 plan members. Recall that this is not diagnosed members or members participating in the DM program — this is just a raw rate of event incidence. Since asthma attacks can occur in anyone, since the health plan pays claims for everyone, and since it’s the program’s job to save money by avoiding events, the raw rate of incidence is the correct rate to measure.

As one example from an event-rate chart, 1 million members would yield about 3,000 asthma events in total. This would make avoidance of 2,000 events extraordinarily unlikely — it would be a 67 percent reduction, would generate a very sharp decline in the event-rate line, and would run into the reality that about a third of asthmatics are simply unknown to a health plan in the first place. Either they themselves don’t know, or they received their diagnosis while belonging to another health plan.

However, if you have 10 million members generating 30,000 events, it would indeed be possible to avoid 2,000 of them. You would just track events the following year using the event-rate test above to see if indeed you avoided 2,000 events — 6.7 percent of the total. An event-rate chart database would reveal multiple instances of a 6.7 percent decline in events.

Adding comorbidities creates another layer of complexity in search of more validity. Vary the asthma example above to substitute heart failure for asthma. For heart failure, the value of an avoided event is much greater than for asthma, because a much higher percentage of patients presenting in the ER get admitted, and the lengths of stay are much longer. By looking at the composition of the event rate as between ER and inpatient, and applying your costs for hospital use, you can figure that perhaps the average avoidable heart failure event saved $10,000, rather than $1,000 as in asthma.

Event rates for heart failure fluid overload in the commercial population are about 0.5 per 1,000 members. So assuming the same 1 million people as in the asthma example and the same $1 million in spending, you would have to avoid 200 events to get a 2:1 ROI.

But with an event rate of just 0.5 per 1,000, there are only 500 such events to begin with, making avoidance of 200 — a 40 percent reduction — a difficult challenge. This is where the comorbidity assumption comes in. Suppose that instead of virtually no comorbidity impact from a DM program, as in asthma, you assume that for every fluid overload case your disease managers avoid, they avoid four cases of other medically-related complications or comorbidities. These are nowhere near as measurable because they are spread out over many ICD-9 codes. That assumption significantly reduces the number of avoidable, measured fluid-overload cases needed to decrease to reach the target ROI.

Specifically, to avoid 200 total events in members with congestive heart failure, one has to measure only 40 avoided cases specifically of fluid overload, or about 8 percent of the 500 expected. This is enough to show up on the event-rate charts as a noticeable decline in all but the smallest health plans.

While it is, of course, true that the “comorbidity multiplier” is clearly an assumption, the NND analysis has two huge advantages over actuarial pre-post analysis. Measurement of comorbidities is explicit and transparent, and the “comorbidity multiplier” can easily be varied. In actuarial methodologies, the sources of savings, by condition, are totally implicit, as results are presented in dollars. The best example: A public presentation by a William M. Mercer consultant showed savings of $6 million in asthma, without even checking to see if asthma admissions and ER visits changed. Had he done that, he would have noted that his client, a large retailer, didn’t even incur enough asthma events to spend $6 million on them in the first place, let alone save $6 million by avoiding them.

Rule #5

The “once chronic, always chronic” methodology will invariably overstate savings.

A popular methodology among health plans, vendors and especially benefits consultants is the “once chronic, always chronic” approach, a form of pre-post analysis. The Care Continuum Alliance refers to it formally as the “prospective identification” methodology, in which they find the same flaws as described in this section. In this methodology, any member who is identified in any period as having a chronic condition is assumed to continue to have that chronic condition in future periods. The assumption is based on the logic that chronic conditions, by definition, don’t go away and therefore everyone who has them, even if they are totally under control, should be tracked.

The flaw in this logic is that only members who have high enough claims to be identified through a claims algorithm are counted in the baseline. As is well known by now, tracking members with high claims forward will always yield a decline in costs, through regression to the mean.

The classic analogy, well-known to veterans of the DM field, is to aviation. Radar measures the altitude of all the flights it tracks. One could use that data to measure the altitude of planes actually in the air. This measurement will overstate the altitude of the average plane, because many planes are on the ground at any given time. So the “baseline” measurement of altitude will overstate the actual average altitude of all planes because planes on the ground are not captured in the initial average. Over time they will be, because the radar will “know” which planes have landed. So over time, the average will migrate from an average of flights in the air, to an average of all planes including the planes on the ground, thus showing a decline in measured altitude even if there is no change in actual average altitude in the U.S. aviation system.

Now assume that the radar is a claims-extraction algorithm and that the different times of the reading are years of the program. One can see, by analogy, that measured claims will decline as more people with the disease who didn’t happen to have claims in the baseline — metaphorical “planes on the ground” — are included in the measurement.

If indeed everyone with a chronic condition had claims in every period, a “prospective identification” methodology would work. However, as long as there are any “planes on the ground,” the average “altitude” (cost per disease-eligible member) noted by claims-extraction algorithms will overstate the true average cost per disease-eligible member.

Rule #6

The “annual requalification” methodology should prevent overstatement of savings, but often doesn’t, due to the correlation between higher compliance and recent events.

In theory, the problems noted above should be avoided by a methodology, recommended by the DMAA, which is essentially the same as the “prospective identification” methodology except that it does not count “planes on the ground” in any period, thus canceling out the bias by creating seemingly symmetrical measurement periods. It is a vastly preferable methodology, but still should not be taken as valid unless checked via an event-rate-based “plausibility test.”

Table 1 shows a hypothetical example illustrating how the “annual requalification” methodology gives a much more valid result than does the prospective identification methodology. Assume that there are only two asthmatics in the health plan, and one baseline and one program year. Further assume that inflation/trend have already been taken into account.

TABLE 1 Cost per person with asthma in both periods using both methodologies (scenario 1)
  2005 (baseline year) 2006 (contract year)
Patient #1 $1,000 0
Patient #2 0 $1,000
Cost per patient with asthma: $1,000 in both methodologies
$500 in prospective, $1,000 in annual requalification

In the baseline, both methodologies yield the same result — $1,000 is the average cost per asthmatic because the second asthmatic, a classic “plane on the ground,” doesn’t show up in the measurement. In the contract period, however, the methodologies yield dramatically different results. The “annual requalification” shows a $1,000 cost per asthmatic because #1 is not counted since he had no asthma-identifiable claims. The prospective methodology, though, counts him because he had asthma in the baseline so he certainly still has it, even if it’s under control enough not to generate claims.

Even though the total costs to the plan have not changed, the “prospective” methodology shows a 50 percent reduction in cost per member just by counting both members, while the annual requalification methodology shows the correct result, that costs did not decline. Curiously, even though the annual requalification methodology finds the correct mathematical answer in this case and “prospective” does not, most epidemiologists would argue the opposite — that prospective identification truly captures the right population because actual chronic disease itself does not go away. Hence, people who have shown that they have it should be counted in all future periods, even without claims. The proponents of the “annual requalification” methodology would respond that the large majority of those who do not requalify represented false positives in the baseline, and so eliminating them is epidemiologically correct as well as mathematically correct.

Yet even the annual reconciliation methodology must be plausibility-checked with an event-rate measurement, because it too can be flawed. Assume the previous example, but in this scenario, assume that #1 takes drugs for a while, having had a “scare” (see Table 2).

TABLE 2 Cost per person with asthma in both periods using both methodologies (scenario 2)
  2005 (baseline year) 2006 (contract year)
Patient #1 $1,000 $100
Patient #2 0 $1,000
Cost per person with asthma $1,000 $550

Both methodologies identify the member in the contract period and both would then show a 45 percent decline in costs, even though the costs actually increased. Note, below, how a simple application of an event-based plausibility test highlights the flaw in the measurement. Having seen this red flag, the actuaries can now go back and remeasure to get the right answer (Table 3).

TABLE 3 Asthma events in the payer as a whole
  2005 (baseline year) 2006 (contract year)
Patient #1 $1,000 $100
Patient #2 0 $1,000
Cost per person with asthma $1,000 $550
Event per person with asthma 1 1

One might say, “That’s not fair — you added the drugs to 2006 to create an artificial scenario where annual requalification wouldn’t work.” However, there is nothing “artificial” about this scenario. It is actually the most common scenario imaginable — that people are much more likely to take their drugs after they have a “scare” than before they do. If indeed it were the case that taking drugs was not more likely after a scare, then a “$100” would also consistently show up in the quadrant of baseline per patient #2, and the average in both years would be $550. However, who among us isn’t much more careful after just having had a “scare” of any kind than after the memory of the scare has faded?

Rules #7 and #8

There are only two significant sources of savings: a reduction in inpatient admissions for the condition(s) and emergency room (ER) avoidance for the specific conditions being managed and their closely related comorbidities. No other savings of significance can be attributed to DM, so there is no reason to complicate measurement with claims from other cost categories.

There is no unit-cost change possible in DM and, therefore, no reason to measure inflation, or “trend.” Doing so just increases the cost and complexity of measurement, while reducing the validity.

These two facts can be grouped because they both point in the same direction: Keep the calculation simple and population-based.

If the answer is so simple and these facts are so incontrovertible, why did reconciliations develop into the complex, expensive, invalid methodologies that have been, until recently, so widely used?

There are three reasons:

(1) The industry evolved based on savings guarantees. Guarantees had to be financially based. So methodologies were developed which were totally based on financial results, and usually never even considered whether the underlying utilization declines necessary to support that analysis were even possible. To this day, one national health plan routinely presents financial savings which, according to its own data, are impossible, and no one notices.

(2) Even without guarantees, the program sponsors within a health plan felt that they needed to present “a number” to senior management. It did not matter that the “number” was no more relevant to the program’s success than the North Vietnamese “body count” was to the Vietnam War’s success. People felt that they needed “a number.”

(3) Most health plans rely on their actuaries, as employers rely on their actuarial consultants, for financial calculations. So the actuarial departments were either given responsibility or took responsibility for this function. And actuaries are clearly the authorities when it comes to answering questions like, “How will a three-tier drug program affect medical spending?” That is an actuarial question and should be given an actuarial answer.

However, DM is not an actuarial science, involving the application of numerical models to a set of assumptions. It is a biostatistical science requiring knowledge and inferences about dose-response relationships for behavior change and event avoidance. It is all about the avoidance of exacerbations and complications, either currently (as noted in the event-rate calculations) or in the future, through favorable changes in quality indicators which presage the avoidance of future events. It is not about unrelated hospitalizations. It is not about lab tests, radiology, home care or any other element of cost which gets included in savings calculations.

And it is not about where one sets the “outlier filter.” By the time people get anywhere near the six-figure claims level, they have long since surpassed a point where events can be prevented by phone calls. Yet, changes in outlier filters can dramatically change the savings “number” despite the lack of impact of DM on those very high-cost members. It’s all about luck, at that point: Did there happen to be more outliers in the baseline or in the study period?

Likewise, pricing has nothing to do with it. DM vendors do not affect contract pricing. So why include inflation in the calculations? Small changes in inflationary trend assumptions can create massive changes in perceived savings, when inflation has nothing to do with it.

Mercer’s lead actuary, Seth Serxner, graciously and candidly acknowledges this fatal flaw in actuarial methodologies in a 2008 article when he writes:

“We can conclude, however, that the choice [emphasis added] of trend has a large impact on estimates of financial savings. Evaluators may be wise, therefore, to conduct their analyses with more than one trend in mind in order to attain a range.”

In other words, there is no way of knowing what the underlying savings actually are since it is all dependent on one’s “choice” of inflation trend. A methodology which does not need to be adjusted for inflation avoids that fatal flaw.

Rule #9

Speaking of “trend,” there is no evidence that the trend for the nonchronic conditions can be used as a proxy for what the chronic-disease trend would have been absent the intervention.

Contrary to what actuaries will tell you, the nonchronically ill population typically differs from the diseased population on nearly all demographic or economic variables. Using a noncomparable group to determine expected trends in cost will introduce measurement bias and limit the ability to draw accurate conclusions about the results. Only if many serial observations of cost are determined to be equivalent between the populations can some degree of confidence be achieved in using the nonchronic trend as a comparator for the chronic population.

These concerns are illustrated by the National Hospital Discharge Survey (NHDS) data presented in Figure 3. Three major chronic disease categories (circulatory, endocrine, and respiratory) are compared to “all other” discharges. As shown, those categories in which the majority of conditions are chronic have been flat, while nonchronic discharges have gone up by 7.5 percent over the observed five-year period. The assumption can also be made that the “all other” category of discharges is more costly than the chronic conditions because many of the diagnoses require surgeries (i.e., injuries, deliveries, and complications), as opposed to less costly medical stays. While these data do not represent chronic versus nonchronic populations per se and some degree of overlap is inevitable, these data do demonstrate that both the level and trend of discharges for nonchronic conditions are significantly higher from those categories considered chronic. These findings suggest that applying a nonchronic “trend” to the diseased population will bias the results in favor of the DM program.

There is also the problem that the two populations aren’t static. The “planes on the ground” will show up in the nonchronic population in the baseline. Then suppose some of them have an event. That event will show up in the nonchronic trend line, causing it to rise even though the person had the condition. In the next period, those people will be shifted into the chronic population. And, like many people with an event, they will then regress to the mean and not have another event. Thus their spike in costs will be counted as part of the nonchronic trend and their subsequent regression to the mean will be part of the chronic trend. As a result, the calculation might favor the diseased population.

Or, some actuaries might do it the other way around to avoid this, and recalculate the trend for both the chronic and nonchronic populations retrospectively, once it is learned that some of the people in the baseline were really “planes on the ground” and not nonchronic, and should have been in the chronic population to begin with. How many years would one do this for, retrospectively? How many times would one recalculate?

No matter which way one looks at it, finding an answer is cumbersome and the answer is probably invalid anyway. One can see why simply counting how many events are avoided is becoming the preferred approach.

Rule #10

A population selected based on “risk scores” of any type — including members selected on the basis of predictive modeling — will also regress to the mean.

One of the most common fallacies of the actuarial approach is that starting not with a population based on claims identification, but rather with a population based on its risk score, will avoid regression to the mean. However, all risk-scoring methodologies weigh either last year’s claims or some proxy for last year’s claims. Why? For the simple reason that last year’s claims are a good predictor of this year’s claims. Many people who are high-cost last year will stay high-cost, though some will move to low-cost.

Likewise, most of the people who were low-cost last year will remain so, though some will move to high-cost. If predictive models or risk scoring could predict those two moves, then indeed “risk scores” would avoid regression to the mean. But they can’t. If your doctor can’t predict when you will have a heart attack, how can a software claims algorithm?

A major risk-bearing health system once tested the ability of predictive algorithms to truly “predict,” meaning to determine which low-cost people would transition to become high-cost. What they found was, ironically, in itself quite predictable but also disappointing. Only a few low-cost members were correctly predicted to become high-cost.

In one exercise, predictive modeling vendors were asked to, well, predict. They were given a two-year-old data set and asked to predict the low-cost people who would become high-cost in the following year. “Low cost” was defined as having had claims of $4,000 or less in the base year, while high cost was defined as $10,000 or more. The results were that very few members were predicted to transition in this manner, with only Vendor A predicting more than a handful. The line, representing the percentage of those predicted who actually became high-cost, tells a story too. Even Vendor D, which tried to be most specific in this prediction and basically predicted less than 20, was right about only one of them.

In reality, several thousand people transitioned in this manner. Most were a surprise to the predictive modeling vendors, and were probably also surprises to themselves and their physician.

Conclusion: While risk scoring and its predictive-modeling cousin may have a role in identifying members for DM, they cannot be relied upon as a tool to predict a cohort’s claims cost, nor can they be used as a study design when trying to select a population whose future claims will be immune to regression to the mean.

Rule #11

The good news is that you can measure validly using “ingredients you have around the kitchen,” without the need for expensive actuarial consulting.

The preferred methodologies that have been described in this chapter can be easily measured by any health plan or large employer using just ICD-9 codes. There is no reason to spend large sums of money on actuarial modeling when you can get greater validity and transparency simply by determining whether you have avoided events and complications closely associated with the conditions which you are managing specifically in order to avoid events and complications.

Taking the advice in this chapter will improve your measurement dramatically. Too many reports are accepted which contain too many unnoticed mistakes, mistakes which would be caught if these simple rules and observations are followed. You may think you can spot these mistakes already. But you probably didn’t even notice that this chapter on “10 Things You Need To Know about Measuring Outcomes” actually contained a list of eleven.

For further reading

  • Linden A. What will it take for disease management to demonstrate a return on investment? New perspectives on an old theme. Am J Manage Care 2006 12(4):217–222.
  • Linden A. Use of the total population approach to measure U.S. disease management industry’s cost savings: issues and implications. Dis Manage and Healt Outc. 2007;15(1):13–18.
  • Linden A, Biuso TJ, Gopal A, Barker AF, Cigarroa J, Haranath SP, Rinkevich D, Stajduhar K. Consensus development and application of ICD-9 codes for defining chronic illnesses and their complications. Dis Manage and Healt Outc. 2007;15(5):315–322.
  • Serxner et al., Testing the DMAA’s recommendation for disease management program evaluation. Journal of Population Health Management. 11(5): September–October 2008.

Al Lewis is executive director of the Disease Management Purchasing Consortium International Inc. in Wellesley, Mass., and a co-editor of this book with Jill Brown of Atlantic Information Services. Contact Lewis at alewis@dismgmt.com.

Copyright ©2011 by Atlantic Information Services Inc. (AIS). This is an excerpt from Disease Management and Wellness in the Post-Reform Era, published by Atlantic Information Services Inc. (www.AISHealth.com). It is reprinted with permission from AIS. For more information on this book, visit http://aishealth.com/marketplace/disease-management-and-wellness-post-reform-era