EVALUATING THE "National Demonstration Study"

Bradley, R.T., McCraty, R., Atkinson, M., Arguelles, L., Rees, R.A. & Tomasino, D. (2007). Reducing Test Anxiety and Improving Test Performance in America’s Schools: Results from the TestEdge National Demonstration Study. Boulder Creek, CA: Institute of HeartMath.

     In this study, the subjects were tenth-grade students at two high schools in Northern California. At one school, 602 students received HeartMath training. At the other school, described as the “control group,” 332 students were not trained. The HeartMath TestEdge® program included practicing the Freeze-Frame technique and trying to “re-experience” positive emotions. The students did so while using the computerized “emWave® PC Stress Relief System” that provides biofeedback.

     When reporting the students’ test scores, the researchers’ summary says:

“In four matched-group comparisons (involving subsamples of 50 to 129 students) there was a significant increase in test performance in the experimental group over the control group, ranging on average from 10 to 25 points.” (p. 6)

In the sections that follow, I discuss the weaknesses in that claim.

WHEN A "CONTROL GROUP" IS NOT REALLY A CONTROL GROUP

       This study did not have random assignment of individual subjects. However, using a quasi-experimental procedure, the researchers randomly assigned two schools: large, naturally-occurring groups of subjects. This is a legitimate research method, but it requires the collection of extra data to assess the similarity of the natural groups. The danger is that the schools could have pre-existing differences that cause students to perform differently.
       In this study, one pre-existing difference was the number of people in the groups. At the start of the study, complete data were collected for 602 students at the school assigned to use HeartMath, but there were only 332 such students at the other school. This discrepancy compromises statistical tests that might show the average test scores at the two schools were significantly different because of HeartMath or anything else.
       There also was a sex difference between the groups. At the school assigned to use HeartMath, 48% of the students were female; at the other school, 60% of the students were female. There also were ethnic differences: at the HeartMath school about 50% of the students were Hispanic or Latino, 37% were White, and 3% were Asian. At the other school, only about 12% were Hispanic or Latino, 54% were White, and 20% were Asian. (These figures are for students who did the posttests as well as the pretests.)
       The researchers also compared the schools on a standardized measure of academic performance: California’s Academic Performance Index (API). The API results were for the year before this study was conducted. At the HeartMath school, the API was 666; at the other school the API was 740, a significant 74-point difference. (At that time, the statewide average score was 671.) Because the “control” school was performing at a higher level before the study began, there would be less room for these students to improve by the time the study was completed. In other words, the control group might not show much improvement in test scores simply because their scores were
so high already.
       On the other hand, “Teacher workloads were higher and per pupil expenditures were lower in the intervention site compared to the control site.” (65) This difference could reduce the chances of student success at the HeartMath school and inhibit the possible benefits of any learning technique.
       You should not assume that the advantages and disadvantages of the schools somehow balance each other out. There is no scientific way to equate such different variables. The general problem is that the schools appear to have been very different at the start of the study, and so any differences at the end could be due to characteristics of the schools and not because one school used HeartMath.

       In addition to the lack of random assignment of individual subjects, and the apparent nonequivalence of the two schools, the study did not have “blind” observers. So, this study had all three of the major problems described on my other page.

       Also, the researchers used some dubious statistical procedures when analyzing the results.

STATISTICAL ISSUES

Attrition: Dealing with Dropouts

       Despite the fact that the study lasted only four months, many students who started the study did not finish it. At the school using HeartMath, 602 started but only 488 finished--an attrition rate of 19.1%. At the other school, 332 started the study but only 261 finished--an attrition rate of 21.4%. The large number of dropouts means that the populations tested at the end of the study may have been significantly different from the populations tested at the beginning. 
         What could be the consequences of so many people dropping out? One possibility is that smarter, more highly motivated students were likely to continue participating. Let’s assume, for a moment, that this is what happened. If it did, the average IQ and/or need for achievement of the students remaining at the end of the study would be higher than the average of all the students who started the study. This means that the higher average test scores at the end of the study could be caused by higher average intelligence and/or motivation, and not by the way that students were taught. The researchers did not even mention this possibility in their report. Of course, I’m only making a suggestion here. Attrition could have other consequences. The problem is that high attrition rates raise doubts about the meaning of the results.
         Also, the researchers reported the pretest scores of all the students who began the study and then compared those with the posttest scores of just the students who were still participating at the end. This means that the pretest performance of larger groups was compared with the posttest performance of smaller groups—they compared apples with oranges. The researchers should have compared the pretest and posttest scores of only the students who finished the study—the ones for whom complete data were available--to see whether significant changes occurred.

Post Hoc Analysis: Creating Sub-Samples

         The researchers ran many statistical analyses of students’ test scores. They reported the results of their first analysis this way:

“However, the results of an ANCOVA (not shown) for all students in the intervention school did not find evidence of a relationship between the frequency of use of the TestEdge tools and pre-post intervention test anxiety reduction. There was also no evidence of a relationship between student use of the tools and 10th grade CST ELA test performance [the California Standardized Test of English-Language Arts]. (119)

       In other words, when they first looked at the data for all the students, they did not find evidence that HeartMath techniques reduced test anxiety or increased scores on a standardized English test.
       This initial failure, however, didn’t slow them down for long. Like many other researchers, they applied a simple rule: “If you don’t find what you’re looking for, look somewhere else.” They began carving their total population of students into various subgroups, and analyzing the scores of the subgroups to see if there were differences.
       To be fair, creating one of these subgroups addressed a problem with the study. The researchers had intended to measure test performance by using students’ scores on the California Standardized Test (CST). The students’ CST scores in the 9th grade would serve as a pretest—a measure taken before the start of the HeartMath treatment. The students’ CST scores in the 10th grade would serve as the posttest, taken after the treatment. Then the pretest and posttest scores could be compared to see whether the treatment made a difference. This procedure is typical of quasi-experiments, and a very good way to test hypotheses. In this study, however, the researchers ran into trouble:

“. . . with the exception of the CST English-Language Arts test, which appeared to be administered universally on a standardized basis to all students in both 9th and 10th grades, and thus met our need for a repeated measures format, a number of unanticipated complications prevented our use of much of the CST data.” (87)

          The problem was that the CST is actually a collection of tests in various subjects, and different students took different combinations of these tests. Students did not always take the same tests in the 9th and 10th grades, and some students at the two schools took different combinations of tests.

“For example, in the 9th grade 91% of the experimental group took Earth Science while 85% of the control group took Biology; in the 10th grade most of the experimental group took Biology while the control group took Chemistry. This meant that the CST Science scores could not be compared and thus were unusable.” (87)

         The good news, mentioned above, was that all students took the English tests, and so their pre- and post-HeartMath Englishscores could be compared. The researchers also got creative and found a way to use math scores of some of the students: 

“. . . we found a notable subset of 183 students (121 in the experimental group and 62 in the control group) who all took Geometry in the 9th grade and who also all took Algebra 2 in the 10th grade. In the analyses that follow in a later section, this group of students is referred to as Math Group 1.” (87)

       I don’t know what math faculty may think about equating performance in geometry with performance in algebra, but let’s say that this was a reasonable way to select students who had both a pretest and a posttest in math. Unfortunately, the group of math students at the HeartMath school was almost twice as large as the group at the other school. As I mentioned earlier, large differences in the sizes of groups makes statistical analyses very problematic.

         There’s another problem: when the researchers report the results of the analyses of the “Math 1” groups, the numbers change:

“Moving to the results for Math Group 1 . . . Altogether, there was a total of 129 students in this sub-sample, of whom 69 (53.5%) were in the experimental group and 60 (46.5%) were in the control group.” (157)

       Somehow, the 183 students with math scores became 129 students; the 121 of them at the HeartMath school became 69; and the 62 students at the other school became 60. It’s true that the sizes of the groups became similar—and this is good—but there is no explanation in the report of how the researchers chose to use these students in their analysis and excluded the rest of the students. Discrepancies such as this raise further doubts about the meaning of the reported results.

     To return to the discussion of claims about HeartMath, click here.

© 2009 David Douglass