Outline of Evaluating Research Paper

1. Overview of the Paper
a. Statement of the paper's research question
b. Description of the statistics used to address the question
2. Methodology and Methods
a. Methodological assumptions behind the use of these particular statistics
b. Strengths and weaknesses of their use in addressing the research question
3. Statistical Calculations
4. Conclusions

Moody, S.W., Vaughn, S., Hughes, M., Fischer, M. (2000). Reading Instruction in the Resource Room: Set Up for Failure. Exceptional Children, Vol. 66, No.3, pp. 305-316. (Pdf is available on the wiki site. It was uploaded July 31, 2009.)

Overview of the paper

Although the Individuals with Disabilities Education Act (IDEA) states that students with disabilities will be provided "specially designed instruction" to meet their unique needs, related studies have reported the failure of programs to provide specialized instruction. A previous study conducted by Vaughn, Moody, & Schumm in 1998 indicated that little individualized instruction was provided during reading instruction for students with special needs and no significant gains in reading comprehension were identified as well as inadequate progress in reading fluency. The authors state that the present structure of the resource room is failing to provide opportunities for a "special" education. The current study by Moody, Vaughn, Hughes & Fischer (2000) was designed as a 2-year follow-up of the teachers who participated in the 1998 study. The authors were interested in reexamining these teachers' instructional practices in light of reading reform initiatives implemented over the last 2 years. The authors clearly discuss the educational significance of the problem citing several studies conducted.

The researchers indicate 4 areas of focus for the study: a) the instructional practices for reading implemented by special education resource room teachers, b) the reading outcomes of students with disabilities in resource room settings, c) the grouping practices for instruction, and d) the teachers' perspectives on what makes special education instruction special.

The authors used traditional measures to describe study participants (mean IQ scores, disability label, reading comprehension tests, demographic data). In addition to traditional measures, the researchers used a teacher checklist to collect data regarding the amount of time each student spent on reading/language arts instruction in their classrooms. Teacher interviews were conducted to ascertain instructor perceptions of particular reading practices. The interview questions were evaluated for clarity, brevity, appropriateness and freedom from ambiguity, social desirability, and potential inferences of bias. Interviews were conducted individually pre and post study.
A Likert-type scale was used to measure time in groups; the extent to which the teacher monitors ongoing performance; and how frequently the teacher provides positive feedback. This measure included a quantitative section as well as a descriptive section. A teacher self-report measure was also used to collect information regarding instructor perceptions of the days' lesson. Both a test for reading fluency (TORF) and a test for reading comprehension (Woodcock-Johnson Achievement Test-Revised, or WJ-R) were administered. Both assessments are considered to have acceptable reliability and validity. Two statistical measures were modified for use in this study although the modifications are not explicitly described.

Methodology and methods

Conceptually, statistics appear near the end of a research project, and it is important to evaluate them within that context:

Research question → theory, epistemology → methodology (assumptions about the practical means by which the research question can be answered) → methods → data → statistical analysis.

Critiquing the use of statistics within a study is thus difficult to do without critiquing the authors’ methodologies and methods. For example, in this paper, the authors support their choice of the Wilcoxon Signed Ranks t-test (which we did not cover in this course) for the data on classroom grouping practices, and it may well be that this specific test is the best choice for this data. However, even when the choice of statistical test is sound, other assumptions may be contestable. The authors assume that grouping practices can be represented on a Likert scale, ordered from whole-class to individual. This methodological assumption risks ignoring other nuances in classroom dynamics. For example, while the difference between individual instruction and whole-class instruction is in some respects ordinal, there are often other dynamics at play. Some students might in fact benefit more from working in pairs than alone with a teacher. Statistical analysis that treated these practices as nominal categories--rather than as ordinal--would have made it conceptually easier to look for advantages and disadvantages to specific grouping practices regardless of their rank on a scale from whole-class to individual. Similarly, the Likert scale can mask variations within the different grouping formats. “Use whole-class activities” is analyzed as one grouping format, yet in practice, whole-class activities invoke a diversity of dynamics that might relate to personalities, class size, duration of activity, and other contextual factors, in addition to the whole-class aspect of the activity. In sum, the statistical analysis of classroom grouping practices is only as sound as the underlying assumptions.

Additional statistics might have enhanced Moody, et al.'s (2000) conclusions. The authors are broadly concerned with what they argue to be the failure of the present structure of special education resource rooms to provide effective reading instruction, and they address this concern through the four areas of focus described above. Indeed, reading the introduction, one is interested in how research on these different areas interrelates—for example, how changes in instructional practices might correlate with reading outcomes. However, their methods and analyses ultimately address these questions separately. They present data on classroom grouping practices and on changes in reading growth, but they do not attempt to correlate the two categories. Grouping practices are discussed at the classroom level, and reading outcomes are discussed at the school district level. One wonders whether the reading outcome data could have been broken down by classroom in order to make a test of association possible between these two categories. It seems that this might well have been possible, since the authors use paired samples t-tests to evaluate gain in reading ability. The use of the paired samples test allows changes in ability to be evaluated for each student, rather than for students in the aggregate. If data on reading ability was available at the level of student, it presumably could have been analyzed at the classroom level.

Statistical calculations

In all, Moody, et al. (2000) have several different statistical calculations throughout their results. The classroom climate scale was analyzed using the Wilcoxon Signed Ranks. This is a non-parametric alternative to the paired samples t-test. Among the assumptions that must be satisfied is that the scale of measurement for the two sets of values have the property of an equal-interval scale (Lowry 2009). We argue above that the imposition of an ordinal Likert scale on classroom groupings might be conceptually problematic. If so, the data might not meet this assumption. However, we would need to be more familiar with this test before evaluating it here.

The Wilcoxon test created several calculations regarding the benefits of various instructional setting groups. The calculations for each of the instructional groupings were all negative correlations. The authors state that these findings fell within the significant range, but do not clarify in what manner these findings were significant, or how the negative correlations affect interpretation of the data. The discussion of teacher interviews in the results section is problematic from a calculation standpoint, because the authors describe the procedure and not the results. If the authors would have stated what their findings were through each step of coding the interviews they would have had data to make conclusions and recommendations. The authors did several calculations evaluating grouping formats of instruction and teacher behavior. However, calculations in specific achievement ranges for teachers would have enhanced the quality and impact of the data. The other area the authors perform data analysis is in student achievement, conducting paired samples t-tests on both TORF and WJ-R scores. One area of concern is the selection of the WJ-R. The WJ-R is a standardized test battery, so one would assume that scores would remain similar across time, and there would not be a significant increase in reading achievement. Throughout the results the authors use several statements that imply calculations were made, however their statements are not quantified or supported by data (i.e. considerably more). These terms are ambiguous and leave the reader to personally interpret the findings. Overall, this study had data calculation strengths and areas for improvement.

We omit complete examples of the statistical tests used in the paper, partly in the interest of remaining within this assignment's space constraints. We note that we did not cover the Wilcoxon Signed Ranks test in class, and students conducted a paired samples t-test in Homework 5.


Throughout this paper there are gaps in the conceptual framework, methodology, and discussion. One of the major flaws between the data collection and the discussion is that the methods evaluated instruction within the general education setting, but the discussion begins by discussing special education resource room instruction, which was not evaluated within this current study. The other component within the discussion is a preference for phonics instruction. This preference was never fully supported by research in the introduction, nor was it evaluated within the methods. The authors also make some large leaps in concluding the impact of the data. One of the major claims is that we cannot blame teachers for students’ lack of progress; instead it is due to large caseloads of special education teachers. This study did not look at the impact of caseload of student’s performance. As a researcher this claim is not supported from the data. The other area lacking is the authors did not mention any limitations; this is vital to being reflective researchers that realize the limitations and implications for the research, as well as areas that still need more investigation. The gaps in the overall conceptual framework become transparent when the authors make conclusions about the data produced in the study. A well designed research study must connect the conceptual framework; that is rooted in previous research, the research design; understanding the assumptions made in various designs, and the implications; in how the research can be interpreted and impact practice within the field.

Lowry, R. (2009.) The Wilcoxon Signed-Rank Test. http://faculty.vassar.edu/lowry/ch12a.html
. Last accessed August 16, 2009.
Dumont, R, Willis, J.O., Janetti, J. Tables to Aid in the Interpretation of the Woodcock Johnson - Revised Cognitive Battery. http://alpha.fdu.edu/psychology/WJR_tables_to_aid_interp.htm. Last accessed August 16, 2009.