Chen on Program Theory and Apt Evaluation Methods: Applications to Team Read

Chen on Program Theory and Apt Evaluation Methods: Applications to Team Read PA 522—Professor Mario Rivera

Articulating and testing program theory • Chen addresses the role of stakeholders with respect to program theory by postulating that there is both scientific validity and stakeholder validity. One can determine a program’s change/action models by reviewing existing program documents and interviewing program principals and stakeholders. Evaluators may also facilitate discussions, for instance on topics like strategic planning, logic models, and change models. Discussion of program theory entails forward reasoning and backward reasoning—either (1) projecting from program premises or (2) reasoning back from actual or desired program outcomes (“feedback” and “feed-forward”). An action model may be articulated in facilitated discussion, then distributed to stakeholders for further consideration. Evaluation design will incorporate explicitly articulated program theory, so as to test single or multiple program stages against it.

Evaluation and threats to internal and external validity • Are you sure that it is the program that caused the effect? • Typical concerns: secular trends, program history (attrition, turnover), testing effects, regression to the mean, etc. • Mixed-methods evaluation can help uncover complex causal relationships • As with Team Read, evaluating for multiple purposes and at multiple levels at the same time (e.g., learning, self-efficacy, organizational/program capacity) may be uniquely revealing • Program evaluation for partnered programs often means assessment along several lines of inquiry—and therefore also necessarily evaluating more inclusively than is usually the case • Formative evaluation occurs at the organizational and program levels (internal validity), while impacts are assessed more comprehensively, encompassing more actors (external validity) • Context is important—in other words, program environment, broader strains of causation that include exogenous influences

“Integrative Validity”—From Chen, Huey T., 2010. “The bottom-up approach to integrative validity: A new perspective for program evaluation,” Evaluation and Program Planning, 33(3), pp. 205-214, August. • “Evaluators and researchers have . . . increasingly recognized that in an evaluation, the over-emphasis on internal validity reduces that evaluation's usefulness and contributes to the gulf between academic and practical communities regarding interventions (p. 205).” • Chen proposes an alternative integrative validity model for program evaluation, premised on viability and “bottom-up” incorporation of stakeholders’ views and concerns. The integrative validity model and the bottom-up approach enable evaluators to meet scientific and practical requirements, facilitate in advancing external validity, and gain a new perspective on methods. For integrative validity to obtain, stakeholders must be centrally involved. Consistent with Chen’s emphasis on addressing both scientific and stakeholder validity. Was there sufficient stakeholder involvement in the Team Read evaluation?

A Program Theory or Change Model traces the causal chains anticipated by program designers Descriptive & prescriptive theory assumptions: activities and outcomes that must obtain before intermediate to long-term program outcomes can be expected. This is the action model Evaluator works with program principals and other stakeholders to uncover and make explicit a program’s underlying theory of change, or change model Program Outcome Mediating variables Moderating variables External determinants Internal determinants Intervention determinants

Team Read and mediating & moderating variables Action Model(which together with change model=Program Theory) Implementation/interventiondeterminantsprogram outcomes Moderating/attenuating variables, impediments Moderators: Mediating variables, facilitators Generative, enabling elements essential to the intervention itself. In Team Read, cross-age tutoring, a critical implementation method, is the most essential intervention determinant, or mediator. Parental involvement (supportive, engaged parents) may also be considered a key determinant, one that is external to the program but is also encouraged by it (to which extent it is also an internal determinant). External barriers to success ( socioeconomic variables like poverty, immigrant status, parents’ education, English as 2nd language, prejudice. Staff problems with cultural competency (an internal moderator). Net impact on individual subjects of the intervention. Aggregate net impacts= program outcomes

Logic Models versus Change & Action Models: How are they different? Applicable? • Logic models graphically illustrate operational linkages among program components, and creating one helps stakeholders clearly identify outcomes, inputs, and activities • Theories of Change (Change and Action Models) link outcomes and activities to explain how and why the desired change is expected to come about, causally. Sometimes called causal logic models, which combine causal mapping and logic modeling. • Just as outcomes are in part or large part beyond the direct reach of a program, Chen indicates (p. 42) that determinants are partly beyond program reach (such as a drug education program participant’s prior knowledge of the dangers of drug use, and his or her prior attitudes toward drug use, or his/her cultural frames). • Consider and discus a causal logic model and change model for Team Read. Did the evaluator, Jones, provide one? What was the change model, in your view? What determinants of success were beyond program reach and needed to be factored into program design and evaluation planning, consistent with a change model?

Feedforward and feedback loops in logic models—logic models as graphic representations of program dynamics, and of program theory. Outcomes should correspond to strategic goals and objectives. INPUTS OUTPUTS OUTCOMES Program activities, interventions Short-term Medium-term Long-term Program investments

Intervening Variables, Confounding Factors • Intervening variables can take various forms • Direct and indirect effects of one’s intervention (for instance, stakeholder support or resistance); other interventions or messages or influences; interactions among variables • Reciprocal relations among variables involve mutual causation (for example, the interaction between the factors of [1] lack of parental inclusion and [2] limited extent of tutoring effectiveness in the Team Read case). • How context-dependent is the intervention? The more complex it is, the more dependent on context are both the program implementation and program evaluation (cf. Holden text). Why would that be? Many interventions are very sensitive to context and can be implemented successfully only if adapted or changed to suit social, political, and other dimensions of context.

Evaluating complex, partnered programs • Program evaluation in collaborative/partnership contexts—questions pertinent to Team Read: • Does it matter to the functioning and success of a program that it involves different sectors, settings, stakeholders, and standards? What is the role of funders s/a foundations? • How do we determine if partnerships have been strengthened or if new linkages have been formed as a result of a particular program? • How can we evaluate the development of organizational capacity along with program capacity and outcomes? • To what extent have program managers and evaluators consulted with each other and with key constituencies and stakeholders in establishing goals and designing programs? In after-school programs, working partnerships between teachers and after-school personnel, and between these and parents, is essential.

Articulation or specification and refinement of program theory (an iterative process) Findings of the Program Evaluation Implicit Theory Specification of that theory Re-specification or refinement of program theory

Team Read Evaluation • Evaluation conducted by Margo Jones, a statistician • 17 schools in the program by June 2000 but only 10 used in the evaluation. • 3 main goals or objectives relating to Team Read’s impact: • Goal #1: Do the reading skills of the student readers improve significantly during their participation in the Team Read program? • Goal #2: How does the program affect the reading coaches? How do the coaches see the program? • Goal #3: What is working well, and what can be improved?

Evaluation Methods for Goal #1 Goal #1: Improve Reading skills. Two-fold approach

Jones was comparing apples and oranges

Methods used to measure goal 1 • Both methods for goal 1 are flawed. • Note from Jones in the report confirms problems with use of the different tests, which are not strictly comparable. • “The question then becomes ‘Is it reasonable to assume that the pre- and post-tests measure the same skills?’ • The evaluator adopted a correlation criterion for answering this question. If the correlation between pre- and post-tests was near or above .8 (as was the case for the 2nd grade, where pre- and post-tests were essentially the same), the pre- to post-test change score was interpreted as a gain score. If the correlation coefficient was much less than .8, no analysis was performed.” Yet, Jones proceeded with comparison when tests fell just short of the .8 threshold, calling the NCE criterion into question.

Goal 1, second method • 2nd method: determine proportion of Team Read participants who moved from below grade level test score to at-or above level compared to all students from entire district • Note: one of the selection criteria to be in Team Read is to be within bottom 25% of district reading test score. Consequently, what was wrong with this vector of comparison (Team Read cohort to entire school district)?

Testing problems • Major, obvious problems with the before-and-after testing approach used in the evaluation: • Pre and post-tests do not measure the same skills except in the 2nd grade • All Team Read participants are within the bottom quartile of the reading test scores and should not be compared with students district-wide but rather pre- and post program in a cross-panel evaluation, if possible with lowest-quartile nonparticipants in Team Read (or other afterschool reading programs). • Analysis could not be carried out for some grades (3rd) due to incompatibility of reading tests

Evaluation Method, Goal #2 • Method: survey questionnaire given to reading coaches during their last week of coaching. • Goal: to assess the impact of Team Read on coaches. What are some built-in weaknesses of satisfaction surveys in evaluating impact? Did Jones take these shortcomings into account? • Identify three main areas of job satisfaction: • How positive is their experience with TR? • Extent to which the coaches feel that TR helps student readers • How supportive are TR’s site coordinators and staff?

Evaluation Method, Goal #2 – Assessment of Coaches’ Satisfaction • Jones’ survey results showed that coaches were satisfied with their job and felt a sense of pride and accomplishment. The program therefore appeared successful in meeting this goal, unlike Goal #1. • Some questions sought ways to improve the program (formative purpose). The survey results were useful to the program manager (in ways that the evaluation as a whole was not useful managerially as a formative evaluation ).

Evaluation Method, Goal #3 • Method: interviews with major stakeholders, one site visit, review of last 5 years’ research literature on cross-age tutoring and last 3 years research on best practices. • Goal: assess the overall progress of the program. • Were stakeholder interviews adequate? Can stakeholder interviews be adequate if stakeholder involvement has been inadequate? Did interviews work together with other methods to comprise a fully dimensional picture of the Team Read program?

Evaluation Methods—Summary

Methods and results summary Reading Skills • Statistical analysis of standard reading scores before and after program, using the whole district as a control or reference group against backdrop of state standardized test scores. • Results varied by grade, not as promising as expected for most grades; one grade (3rd) could not be measured Coaches • Qualitative survey—gauged “feelings of satisfaction” • Generally positive responses • Analysis done on school-to-school variation as to satisfaction Program Implementation • Literature review—some TR elements worked well according to current literature; lit review as research synthesis a key measure. • Considered feedback from staff, teachers, and managers, and an unobtrusive measure, observation of a single Team Read session • Found need to improve training, match and synchronize pre- and post-tests.

Follow-up evaluation • In 2004, Team Read hired a different consultant, Evergreen Training and Evaluation, to provide an independent impact evaluation for the program. This firm surveyed student readers and reading coaches to ascertain the value of Team Read to students. As before, coaches reported that the program benefitted the individual student readers, and they gave positive feedback concerning guidance received from their site coordinators. Meanwhile, student readers reported reading more often that reading had become more fun since they had joined the program. Parents (approached this time) also reported greater enthusiasm about reading on the part of their children (TeamRead.org). Evergreen did not compare TR results to entire district, and it engaged more stakeholders.Their evaluation results follow.

IMPACT EVALUATION: 2005-2006 Results From Team Read’s 2006 Annual Report. http://www.teamread.com/downloads/team_read_106903_report_proof.pdf

Summary Team Read Evaluation Update for 2006 • Program Evaluation Results for 2005-2006 School Year • 70% of 2nd graders and 52% of 3rd graders were reading at/above or approaching grade level • 54% of 2nd graders and 35% of 3rd graders gained greater than 1.5 grade levels in reading • 98% of the parents of 2nd & 3rd graders and 92% of their teachers reported increased reading skills as a result of participation in Team Read • 76% of 2nd & 3rd graders said that reading was more fun since joining Team Read

What Could Jones Have Done Better?

Other options—cross-case comparison • Need for control (or, better, “comparison”) groups who are closely comparable and undergo closely comparable assessments (standardized tests). Comparing Team Read participants with the entire district was poorly conceived and made for disappointing results. For instance, “Title III” students—Limited English Proficient and Immigrant Students taking language instruction could have been sought out as one demographic in a stratified random or purposive sampling effort, to attain comparability with similarly-situated students among Team Read cohorts. Similarly, levels of parental involvement could have been gauged for both Team Read students and comparison group students. How would such evaluation factors have aligned with coherent change and action models? Did Jones work with stakeholders to clarify program theory?

Chen Ch. 9: efficacy & effectiveness evaluation • Chen differentiates “efficacy evaluation” from “effectiveness evaluation.” The main difference is that ‘effectiveness evaluation’ does not need a “perfect setting,” random assignment of subjects, and other elements of a controlled experiment in order to work. The contrast corresponds to the distinction Chen draws between scientific and stakeholder validity. Effectiveness evaluation can work well in the uncontrolled real world (p. 205)—it is much more feasible and works better in community and human services contexts than does efficacy evaluation. The case suggests that evaluator, Margo Jones, wanted to approximate scientific rigor (particularly in her use of statistics) but failed to do so by both using different, incommensurable tests, and a statistical equivalency adjustment which was poorly justified and in any case not applied consistently. At the same time, there was little consideration of cultural or demographic factors or of the need for stakeholder inclusion.

Efficacy evaluation & effectivenessevaluation • Chen indicates that many researchers believe that efficacy evaluation should precede effectiveness evaluation. The argument is that only when efficacy evaluation has determined desirable outcomes and appropriate program design, especially a program’s change and action models, can effectiveness evaluation come in and try to check actual outcomes in a real-world setting. • Chen explains how this sequence is used in the biomedical field, and that it is oftentimes successful there. However, while this sequencing may work in clinical settings where all variables can be controlled, with social programs like Team Read these controls are not possible for a number of reasons—costs, ethics, and the variability introduced by human behavior.

Efficacy evaluation & effectivenessevaluation • If effectiveness evaluation was therefore indicated for the Team Read program, there is still a problem: Jones did not involve stakeholders nearly enough in the evaluation. Chen explains that when this occurs, stakeholders are prone to become suspicious of the motives of the evaluator. In any event, in this case they are not allowed to contribute to (and therefore buy into) the evaluation. • Jones obviously did not make enough visits to different schools and sites (also a means for stakeholder involvement). This evaluation could have been better if there was an evaluation team (incorporating stakeholders, including TR Advisory Board members) that attended tutoring sessions at different sites. An engaged evaluation team could have built a better relationship of trust and working relationship with key stakeholders (teachers, coaches, parents, students, and program funders and administrators).

Evaluating complex, partnered programs • Collaborative management involves creating and operating through multi-organizational associations and partnerships to address difficult problems that lie beyond the capacity of single organizations. Student educational performance in K-12 presents many such problems, because multiple constituencies, including federal and state governments, school districts, foundations and other private funders, nonprofit agency actors, business interests, and others set disparate expectations and requirements. An additional challenge here is that communication and coordination functions are more complex and taxing than is the case for single-organization programs; evaluation is also more difficult. To become aligned and effective, partnered management and evaluation require closer, more continuous communication than do single-agency counterparts. It is important to gauge partnership functioning early on, and to move judiciously from formative to summative evaluation. Did Jones prematurely move into impact evaluation of TR?

Effective Collaborations Building relationships is fundamental to the success of collaboration in program planning, design, implementation, and evaluation. Effective collaborations are characterized by: building clear expectations defining relationships identifying tasks, roles, responsibilities identifying work plans reaching agreement on desired outcomes consulting participants, stakeholders

Evaluation then entails multiple purposes, levels • Assessing the merit and worth of programs • Does the program work? Jones addressed question too narrowly, and tried to gauge program impact too early • Randomized surveys, focus groups, observational studies—right mix of methods is important for replicability, validity, reliability • Program and organizational improvement • Formative and process evaluation—process improvement • Oversight, accountability, and compliance • Summative evaluation (of program effectiveness, fidelity) • Implementation fidelity—efficacy as understood by program principals, funders, school district, other stakeholders • Knowledge development in multi-level evaluation • How does the program work? What are its essential, defining elements? What are the conditions necessary for its replication? • What adaptations may be needed in other settings?

Concluding guidelines Any valid and effective human services evaluation needs to: • Involve all partners and key stakeholders • Include a genuine feedback loop so that the engagement process truly informs the development of the partnership • Find a good balance between external ‘objectivity’ and internal or local knowledge/experience (stakeholder vs. scientific validity) in designing the program and its evaluation • To these ends, rely on an appropriate or apt mix of methods (mixed methods). While Jones used mixed methods to an extent, the methods chosen were not well suited to the program. They were particularly unsuited to an early-stage program like Team Read.

CDC Evaluation Model (Holden):Stakeholder-engaging evaluation as might have been applied to Team Read Participatory, Consultative Data Collection & Analysis Summative (Impact) Evaluation of testing outcomes, protocols. Cross-panel, pre-post, quasi-experimental designs Consultatively defining & refining Goals, Objectives, Program Rationale, Indicators, Measures Determining Program Strategies & Activities & tailoring these to student population Formative Evaluation of student assessment process, for program improvement Improved Student Reading Performance Implementation and evaluation roles: role clarification for participants and evaluator Consultative effort at testing based on true comparability vectors (consistent tests, no comparing TR to district)

Becoming aware of the ‘filters’ or biases one brings to the task of evaluation • Preconceptions, assumptions, and possibly prejudices • Cultural/political ‘lens’ • Personal values/belief system • Professional discipline/training (leading, for example, to predilection for some methods rather than others that may be better suited to the program under review). Did Jones exhibit this bias? How? • Capacity to make sense of complex and multi-source data through evaluation methods that correspond to that level of complexity in the program; mixed methods that closely match program features is a form of “requisite complexity” (Ashby).

Human services evaluation requires: • Active engagement and interest • Attention to detail • Willingness to explore contradictions • Sifting and selecting data and methods with due sensitivity to the problem and populations addressed • Interpretation and clarification of data and findings accordingly, in a consultative fashion, learning from program principals and stakeholders • ‘Triangulation’ (confirmation) of findings with multiple sources of data, multiple methods, avoiding overreliance on single methods (Jones) • Realistic movement from formative to outcome evaluation • Focus on real-world effectiveness evaluation rather than ‘ideal-circumstances’ efficacy evaluation, when warranted.

A Few Evaluation Resources http://www.cdc.gov/eval/evalcbph.pdf Request print copy from Laurie or Chris http://www.cdc.gov/eval/resources.htm

A Few EvaluationResources http://ctb.ku.edu/en/ http://www.wkkf.org/Pubs/Tools/Evaluation/Pub770.pdf

Member perspective Relationships and synergy Use of complementary knowledge, skills, abilities and resources to accomplish more as a group than individually Trust Includes measures of reliability, shared belief in mission, openness, etc. Perceived benefits, drawbacks, satisfaction Connectivity(definition: measured interactions between partners) Dimensions include interaction patterns, communication, trust, reciprocity Use mapping/graphing or strategic network analysis methodology Partnership Assessment Elements • Sample Assessment Item: • By working together, how well are these partners able to • Respond to the needs and problems of the community • Implement strategies that work

Example from the CDC Prevention Research Centers’ Partnership Trust Tool Survey Instrument: Partners identify their top three components of trust Partners are asked to participate in facilitated discussions of results

Resources for Partnership Assessments http://www.cdc.gov/dhdsp/state_program/evaluation_guides/evaluating_partnerships.htm http://www.cdc.gov/prc/about-prc-program/partnership-trust-tools.htm

Other Resources • http://www.cdc.gov/dhdsp/state_program/evaluation_guides/evaluating_partnerships.htm CDC Division for Heart Disease and Stroke Prevention, “Fundamentals of Evaluating Partnerships: Evaluation Guide” (2008) • http://www.cdc.gov/prc/about-prc-program/partnership-trust-tools.htm CDC Prevention Research Center’s Partnership Trust Tool • http://www.cacsh.org/ Center for Advancement of Collaborative Strategies in Health, Partnership Self-Assessment Tool • http://www.joe.org/joe/1999april/tt1.phpJournal of Extension article: “Assessing Your Collaboration: A Self-Evaluation Tool” by L.M. Borden and D.F. Perkins (1999)

References Lincoln, Y.S. & Guba, E.G., 1985 Naturalistic Inquiry, Sage Publicatoins, Newbury Park, CA. . Feyerabend, Paul, 1975, Against method, Verso, London. Geotz, J. & LeCompte, M. 1982, ‘Problems of Reliability & Validity in Ethnographic Research’, in Review of Educational Research, vol 52, No. 1, 31-60 Guba, E. G. & Lincoln, Y. S. 2004, Competing Paradigms in Qualitative Research: theories and Issues, Oxford University Press, New York.

Chen on Program Theory and Apt Evaluation Methods: Applications to Team Read