Training Evaluation

By Noe, R.A.

Edited by Paul Ducham


Reaction outcomes refer to trainees’ perceptions of the program, including the facilities, trainers, and content. (Reaction outcomes are often referred to as a measure of “creature comfort.”) They are often called class or instructor evaluations. This information is typically collected at the program’s conclusion. You probably have been asked to complete class or instructor evaluations either at the end of a college course or a training program at work. Reactions are useful for identifying what trainees thought was successful or what inhibited learning. Reaction outcomes are level 1 (reaction) criteria in Kirkpatrick’s framework.

Reaction outcomes are typically collected via a questionnaire completed by trainees. A reaction measure should include questions related to the trainee’s satisfaction with the instructor, training materials, and training administration (ease of registration, accuracy of course description) as well as the clarity of course objectives and usefulness of the training content. Table 6.3 shows a reaction measure that contains questions about these areas.

An accurate evaluation needs to include all the factors related to a successful learning environment. Most instructor or class evaluations include items related to the trainer’s preparation, delivery, ability to lead a discussion, organization of the training materials and content, use of visual aids, presentation style, ability and willingness to answer questions, and ability to stimulate trainees’ interest in the course. These items come from trainer’s manuals, trainer certification programs, and observation of successful trainers. Conventional wisdom suggests that trainees who like a training program (who have positive reactions) learn more and are more likely to change behaviors and improve their performance (transfer of training). Is this the case? Recent research results suggest that reactions have the largest relationship with changes in affective learning outcomes. Also, research has found that reactions are significantly related to changes in declarative and procedural knowledge, which challenges previous research suggesting that reactions are unrelated to learning. For courses such as diversity training or ethics training, trainee reactions are especially important because they affect learners’ receptivity to attitude change. Reactions have been found to have the strongest relationship with post-training motivation, trainee self-efficacy, and declarative knowledge when technology is used for instructional delivery. This suggests that for online or e-learning training methods, it is important to ensure that it is easy for trainees to access them and the training content is meaningful, i.e., linked to their current job experiences, tasks, or work issues.



Cognitive outcomes are used to determine the degree to which trainees are familiar with principles, facts, techniques, procedures, or processes emphasized in the training program. Cognitive outcomes measure what knowledge trainees learned in the program. Cognitive outcomes are level 2 (learning) criteria in Kirkpatrick’s framework. Typically, pencil-andpaper tests are used to assess cognitive outcomes. Table 6.4 provides an example of items from a pencil-and-paper test used to measure trainees’ knowledge of decision-making skills. These items help to measure whether a trainee knows how to make a decision (the process he or she would use). They do not help to determine if the trainee will actually use decision-making skills on the job.



Skill-based outcomes are used to assess the level of technical or motor skills and behaviors. Skill-based outcomes include acquisition or learning of skills (skill learning) and use of skills on the job (skill transfer). Skill-based outcomes relate to Kirkpatrick’s level 2 (learning) and level 3 (behavior). The extent to which trainees have learned skills can be evaluated by observing their performance in work samples such as simulators. Skill transfer is usually determined by observation. For example, a resident medical student may perform surgery while the surgeon carefully observes, giving advice and assistance as needed. Trainees may be asked to provide ratings of their own behavior or skills (self-ratings). Peers, managers, and subordinates may also be asked to rate trainees’ behavior or skills based on their observations. Because research suggests that the use of only self-ratings likely results in an inaccurately positive assessment of skill or behavior transfer of training, it is recommended that skill or behavior ratings be collected from multiple perspectives (e.g., managers and subordinates or peers). Table 6.5 shows a sample rating form. This form was used as part of an evaluation of a training program developed to improve school principals’ management skills.



Affective outcomes include attitudes and motivation. Affective outcomes that might be collected in an evaluation include tolerance for diversity, motivation to learn, safety attitudes, and customer service orientation. Affective outcomes can be measured using surveys. Table 6.6 shows an example of questions on a survey used to measure career goals, plans, and interests. The specific attitude of interest depends on the program objectives. Affective outcomes relate to Kirkpatrick’s level 2 (learning) or level 3 (behavior) depending on how they are evaluated. If trainees were asked about their attitudes on a survey, that would be considered a learning measure. For example, attitudes toward career goals and interests might be an appropriate outcome to use to evaluate training focusing on employees self-managing their careers.



Results are used to determine the training program’s payoff for the company. Examples of results outcomes include increased production and reduced costs related to employee turnover, accidents, and equipment downtime as well as improvements in product quality or customer service. Results outcomes are level 4 (results) criteria in Kirkpatrick’s framework. For example, Kroger, the supermarket chain, hires more than 100,000 new employees each year who need to be trained. Kroger collected productivity data for an evaluation comparing cashiers who received computer-based training to those who were trained in the classroom and on the job. The measures of productivity included rate of scanning grocery items, recognition of produce that had to be identified and weighed at the checkout, and the amount of time that store offices spent helping the cashiers deal with more complex transactions such as food stamps and checks.

LQ Management LLC, the parent company of La Quinta Inns and Suites, is responsible for training for the hotels. The company recently implemented a new sales strategy designed to generate the best available rate for each hotel location based on customer demand and occupancy rates. A training program involving an experiential game (“Buddy’s View”) was used to engage staff in understanding how the new sales strategy would impact the business as well as to improve customer service. To evaluate the effectiveness of the program, the company collects business results (Kirkpatrick’s level 4 criteria), specifically, percent changes in service quality and customers’ intent to return before and after the program.


Return on investment (ROI) refers to comparing the training’s monetary benefits with the cost of the training. ROI is often referred to as level 5 evaluation (see Table 6.2). Training costs can be direct and indirect. Direct costs include salaries and benefits for all employees involved in training, including trainees, instructors, consultants, and employees who design the program; program material and supplies; equipment or classroom rentals or purchases; and travel costs. Indirect costs are not related directly to the design, development, or delivery of the training program. They include general office supplies, facilities, equipment, and related expenses; travel and expenses not directly billed to one program; training department management and staff salaries not related to any one program; and administrative and staff support salaries. Benefits are the value that the company gains from the training program.

The Northwest Airlines technical operations training department includes 72 instructors who are responsible for training thousands of aircraft technicians and more than 10,000 outside vendors who work on maintaining the Northwest aircraft fleet. Each of the training instructors works with one type of aircraft, such as the Airbus 320. Most of the department’s training is instructor-led in a classroom, but other instruction programs use a simulator or take place in an actual airplane.

By tracking department training data, which allowed for training evaluation, the technical operations department was able to demonstrate its worth by showing how its services contribute to the airline’s business. For example, the technical operations department reduced the cost of training an individual technician by 16 percent; increased customer satisfaction through training; increased training productivity; made the case for upper management to provide financial Resources for training; and improved postcourse evaluations, knowledge, and performance gains.

To achieve these results, the technical operations training department developed the Training Quality Index (TQI). The TQI is a computer application that collects data about training department performance, productivity, budget, and courses and allows for detailed Analysis of the Data. TQI tracks all department training data into five categories: effectiveness, quantity, perceptions, financial impact, and operational impact. The quality of training is included under the effectiveness category. For example, knowledge gain relates to the difference in trainees’ pretraining and posttraining knowledge measured by exams. The system can provide performance reports that relate to budgets and the cost of training per student per day and other costs of training. The measures that are collected are also linked to department goals, to department strategy, and ultimately, to Northwest Airline’s overall strategy. Questions that were often asked before TQI was developed but couldn’t easily be answered—such as how can the cost of training be justified, what is the operational impact of training, and what amount of training have technicians received—can now be answered through the TQI system. Training demand can be compared against passenger loads and the number of flying routes to determine the right number of trainers in the right locations to support business needs. These adjustments increase customer satisfaction and result in positive views of the training operations.



Criteria relevance refers to the extent to which training outcomes are related to the learned capabilities emphasized in the training program. The learned capabilities required to succeed in the training program should be the same as those required to be successful on the job. The outcomes collected in training should be as similar as possible to what trainees learned in the program. That is, the outcomes need to be valid measures of learning. One way to ensure the relevancy of the outcomes is to choose outcomes based on the learning objectives for the program. The learning objectives show the expected action, the conditions under which the trainee is to perform, and the level or standard of performance.

Figure 6.2 shows two ways that training outcomes may lack relevance. Criterion contamination refers to the extent that training outcomes measure inappropriate capabilities or are affected by extraneous conditions. For example, if managers’ evaluations of job performance are used as a training outcome, trainees may receive higher ratings of job performance simply because the managers know they attended the training program, believe the program is valuable, and therefore give high ratings to ensure that the training looks like it positively affects performance. Criteria may also be contaminated if the conditions under which the outcomes are measured vary from the learning environment. That is, trainees may be asked to perform their learned capabilities using equipment, time constraints, or physical working conditions that are not similar to those in the learning environment.

For example, trainees may be asked to demonstrate spreadsheet skills using a newer version of spreadsheet software than they used in the training program. This demonstration likely will result in no changes in their spreadsheet skills from pretraining levels. In this case, poor-quality training is not the cause for the lack of change in their spreadsheet skills. Trainees may have learned the necessary spreadsheet skills, but the environment for the evaluation differs substantially from the learning environment, so no change in skill level is observed.

Criteria may also be deficient. Criterion deficiency refers to the failure to measure training outcomes that were emphasized in the training objectives. For example, the objectives of a spreadsheet skills training program emphasize that trainees both understand the commands available on the spreadsheet (e.g., compute) and use the spreadsheet to calculate statistics using a data set. An evaluation design that uses only learning outcomes such as a test of knowledge of the purpose of keystrokes is deficient, because the evaluation does not measure outcomes that were included in the training objectives (e.g., use spreadsheet to compute the mean and Standard Deviation of a set of data).



Reliability refers to the degree to which outcomes can be measured consistently over time. For example, a trainer gives restaurant employees a written test measuring knowledge of safety standards to evaluate a safety training program they attended. The test is given before (pretraining) and after (posttraining) employees attend the program. A reliable test includes items for which the meaning or interpretation does not change over time. A reliable test allows the trainer to have confidence that any improvements in posttraining test scores from pretraining levels are the result of learning that occurred in the training program, not test characteristics (e.g., items are more understandable the second time) or the test environment (e.g., trainees performed better on the posttraining test because the classroom was more comfortable and quieter).


Discrimination refers to the degree to which trainees’ performance on the outcome actually reflects true differences in performance. For example, a paper-and-pencil test that measures electricians’ knowledge of electrical principles must detect true differences in trainees’ knowledge of electrical principles. That is, the test should discriminate on the basis of trainees’ knowledge of electrical principles. (People who score high on the test have a better understanding of the principles of electricity than do those who score low.)


Practicality refers to the ease with which the outcome measures can be collected. One reason companies give for not including learning, performance, and behavior outcomes in their evaluation of training programs is that collecting them is too burdensome. (It takes too much time and energy, which detracts from the business.) For example, in evaluating a sales training program, it may be impractical to ask customers to rate the salesperson’s behavior because this would place too much of a time commitment on the customer (and probably damage future sales relationships).


From our discussion of evaluation outcomes and evaluation practices you may have the mistaken impression that it is necessary to collect all five levels of outcomes to evaluate a training program. While collecting all five levels of outcomes is ideal, the training program objectives determine which ones should be linked to the broader business strategy. To ensure adequate training evaluation, companies should collect outcome measures related to both learning (levels 1 and 2) and transfer of training (levels 3, 4, or 5).

It is important to recognize the limitations of choosing to measure only reaction and cognitive outcomes. Consider the previous discussions of learning and transfer of training. Remember that for training to be successful, learning and transfer of training must occur. Figure 6.4 shows the multiple objectives of training programs and their implication for choosing evaluation outcomes. Training programs usually have objectives related to both learning and transfer. That is, they want trainees to acquire knowledge and cognitive skill and also to demonstrate the use of the knowledge or strategy in their onthe- job behavior. As a result, to ensure an adequate training evaluation, companies must collect outcome measures related to both learning and transfer.

Ernst Young’s training function uses knowledge testing (level 2) for all of the company’s e-learning courses, which account for 50 percent of training. New courses and programs use behavior transfer (level 3) and business results (level 4). Regardless of the program, the company’s leaders are interested in whether the trainees feel that training has been a good use of their time, money, and whether they would recommend it to other employees (level 1). The training function automatically tracks these outcomes. Managers use training and development to encourage observable behavior changes in employees that will result in business results such as client satisfaction and lower turnover, which they also monitor.

Note that outcome measures are not perfectly related to each other. That is, it is tempting to assume that satisfied trainees learn more and will apply their knowledge and skill to the job, resulting in behavior change and positive results for the company. However, research indicates that the relationships among reaction, cognitive, behavior, and results outcomes are small.

Which training outcomes measure is best? The answer depends on the training objectives. For example, if the instructional objectives identified business-related outcomes such as increased customer service or product quality, then results outcomes should be included in the evaluation. As Figure 6.4 shows, both reaction and cognitive outcomes may affect learning. Reaction outcomes provide information regarding the extent to which the trainer, facilities, or learning environment may have hindered learning. Learning or cognitive outcomes directly measure the extent to which trainees have mastered training content. However, reaction and cognitive outcomes do not help determine how much trainees actually use the training content in their jobs. As much as possible, evaluation should include behavior or skill-based, affective, or results outcomes to determine the extent to which transfer of training has occurred—that is, whether training has influenced a change in behavior, skill, or attitude or has directly influenced objective measures related to company effectiveness (e.g., sales).

How long after training should outcomes be collected? There is no accepted standard for when the different training outcomes should be collected. In most cases, reactions are usually measured immediately after training. Learning, behavior, and results should be measured after sufficient time has elapsed to determine whether training has had an influence on these outcomes. Positive transfer of training is demonstrated when learning occurs and positive changes in skill-based, affective, or results outcomes are also observed. No transfer of training is demonstrated if learning occurs but no changes are observed in skillbased, affective, or learning outcomes. Negative transfer is evident when learning occurs but skills, affective outcomes, or results are less than at pretraining levels. Results of evaluation studies that find no transfer or negative transfer suggest that the trainer and the manager need to investigate whether a good learning environment (e.g., opportunities for feedback and practice) was provided in the training program, trainees were motivated and able to learn, and the Needs Assessment correctly identified training needs.



Table 6.7 presents threats to validity of an evaluation. Threats to validity refer to factors that will lead an evaluator to question either (1) the believability of the study results or (2) the extent to which the evaluation results are generalizable to other groups of trainees and situations. The believability of study results refers to Internal Validity. The internal threats to validity relate to characteristics of the company (history), the outcome measures (instrumentation, testing), and the persons in the evaluation study (maturation, regression toward the mean, mortality, initial group differences). These characteristics can cause the evaluator to reach the wrong conclusions about training effectiveness. An evaluation study needs internal validity to provide confidence that the results of the evaluation (particularly if they are positive) are due to the training program and not to another factor. For example, consider a group of managers who have attended a communication skills training program. At the same time that they attend the program, it is announced that the company will be restructured. After the program, the managers may become better communicators simply because they are scared that otherwise they will lose their jobs. Perhaps no learning actually occurred in the training program!

Trainers are also interested in the generalizability of the study results to other groups and situations (i.e., they are interested in the External Validity of the study). As shown in Table 6.7, threats to external validity relate to how study participants react to being included in the study and the effects of multiple types of training. Because evaluation usually does not involve all employees who have completed a program (or who may take training in the future), trainers want to be able to say that the training program will be effective in the future with similar groups.

Methods to Control for Threats to Validity

Because trainers often want to use evaluation study results as a basis for changing training programs or demonstrating that training does work (as a means to gain additional funding for training from those who control the training budget), it is important to minimize the threats to validity. There are three ways to minimize threats to validity: the use of pretests and posttests in evaluation designs, comparison groups, and random assignment.

Pretests and Posttests One way to improve the internal validity of the study results is to first establish a baseline or pretraining measure of the outcome. Another measure of the outcomes can be taken after training. This is referred to as a posttraining measure. A comparison of the posttraining and pretraining measures can indicate the degree to which trainees have changed as a result of training.

Use of Comparison Groups Internal validity can be improved by using a control or comparison group. A comparison group refers to a group of employees who participate in the evaluation study but do not attend the training program. The comparison employees have personal characteristics (e.g., gender, education, age, tenure, skill level) as similar to the trainees as possible. Use of a comparison group in training evaluation helps to rule out the possibility that changes found in the outcome measures are due to factors other than training. The Hawthorne effect refers to employees in an evaluation study performing at a high level simply because of the attention they are receiving. Use of a comparison group helps to show that any effects observed are due specifically to the training rather than the attention the trainees are receiving. Use of a comparison group helps to control for the effects of history, testing, instrumentation, and maturation because both the comparison group and the training group are treated similarly, receive the same measures, and have the same amount of time to develop.

For example, consider an evaluation of a safety training program. Safe behaviors are measured before and after safety training for both trainees and a comparison group. If the level of safe behavior improves for the training group from pretraining levels but remains relatively the same for the comparison group at both pretraining and posttraining, the reasonable conclusion is that the observed differences in safe behaviors are due to the training and not some other factor, such as the attention given to both the trainees and the comparison group by asking them to participate in the study.

Random Assignment Random assignment refers to assigning employees to the training or comparison group on the basis of chance. That is, employees are assigned to the training program without consideration of individual differences (ability, motivation) or prior experiences. Random assignment helps to ensure that trainees are similar in individual differences such as age, gender, ability, and motivation. Because it is often impossible to identify and measure all the individual characteristics that might influence the outcome measures, random assignment ensures that these characteristics are equally distributed in the comparison group and the training group. Random assignment helps to reduce the effects of employees dropping out of the study (mortality) and differences between the training group and comparison group in ability, knowledge, skill, or other personal characteristics.

Keep in mind that random assignment is often impractical. Companies want to train employees who need training. Also, companies may be unwilling to provide a comparison group. One solution to this problem is to identify the factors in which the training and comparison groups differ and control for these factors in the analysis of the data (a statistical procedure known as analysis of covariance). Another method is to determine trainees’ characteristics after they are assigned and ensure that the comparison group includes employees with similar characteristics.



A number of different designs can be used to evaluate training programs. Table 6.8 compares each design on the basis of who is involved (trainees, comparison group), when measures are collected (pretraining, posttraining), the costs, the time it takes to conduct the evaluation, and the strength of the design for ruling out alternative explanations for the results. As shown in Table 6.8, research design vary based on whether they include pretraining and posttraining measurement of outcomes and a comparison group. In general, designs that use pretraining and posttraining measures of outcomes and include a comparison group reduce the risk that alternative factors (other than the training itself) are responsible for the results of the evaluation. This increases the trainer’s confidence in using the results to make decisions. Of course, the trade-off is that evaluations using these designs are more costly and take more time to conduct than do evaluations not using pretraining and posttraining measures or comparison groups.

Posttest Only

The posttest-only design refers to an evaluation design in which only posttraining outcomes are collected. This design can be strengthened by adding a comparison group (which helps to rule out alternative explanations for changes). The posttest-only design is appropriate when trainees (and the comparison group, if one is used) can be expected to have similar levels of knowledge, behavior, or results outcomes (e.g., same number of sales, equal awareness of how to close a sale) prior to training.

Consider the evaluation design that Mayo Clinic used to compare two methods for delivering new manager training. Mayo Clinic is one of the world’s leading centers of medical education and research. Recently, Mayo has undergone considerable growth because a new hospital and clinic have been added in the Phoenix area (Mayo Clinic is also located in Rochester, Minnesota). In the process, employees who were not fully prepared were moved into management positions, which resulted in increased employee dissatisfaction and employee turnover rates. After a needs assessment indicated that employees were leaving because of dissatisfaction with management, Mayo decided to initiate a new training Program Design to help the new managers improve their skills. There was some debate whether the training would be best administered in a classroom or one-on-one with a coach. Because of the cost implications of using coaching instead of classroom training (the costs were higher for coaching), Mayo decided to conduct an evaluation using a posttest comparison group design. Before training all managers, Mayo held three training sessions. No more than 75 managers were included in each session. Within each session managers were divided into three groups: a group that received four days of classroom training, a group that received one-on-one training from a coach, and a group that received no training (a comparison group). Mayo collected reaction (did the trainees like the program?), learning, transfer, and results outcomes. The evaluation found no statistically significant differences in the effects of the coaching compared to classroom and no training. As a result, Mayo decided to rely on classroom courses for new managers and to consider coaching only for managers with critical and immediate job issues.


The pretest/posttest refers to an evaluation design in which both pretraining and posttraining outcome measures are collected. There is no comparison group. The lack of a comparison group makes it difficult to rule out the effects of Business Conditions or other factors as explanations for changes. This design is often used by companies that want to evaluate a training program but are uncomfortable with excluding certain employees or that only intend to train a small group of employees.

Pretest/Posttest with Comparison Group

The pretest/posttest with comparison group refers to an evaluation design that includes trainees and a comparison group. Pretraining and posttraining outcome measures are collected from both groups. If improvement is greater for the training group than the comparison group, this finding provides evidence that training is responsible for the change. This type of design controls for most of the threats to validity.

Table 6.9 presents an example of a pretest/posttest comparison group design. This evaluation involved determining the relationship between three conditions or treatments and learning, satisfaction, and use of computer skills. The three conditions or treatments (types of computer training) were behavior modeling, self-paced study, and lecture. A comparison group was also included in the study. Behavior modeling involved watching a video showing a model performing key behaviors necessary to complete a task. In this case the task was procedures on the computer.

Forty trainees were included in each condition. Measures of learning included a test consisting of 11 items designed to measure information that trainees needed to know to operate the computer system (e.g., “Does formatting destroy all data on the disk?”). Also, trainees’ comprehension of computer procedures (procedural comprehension) was measured by presenting trainees with scenarios on the computer screens and asking them what would appear next on the screen. Use of computer skills (skill-based learning outcome) was measured by asking trainees to complete six computer tasks (e.g., changing directories). Satisfaction with the program (reaction) was measured by six items (e.g., “I would recommend this program to others”).

As shown in Table 6.9, measures of learning and skills were collected from the trainees prior to attending the program (pretraining). Measures of learning and skills were also collected immediately after training (posttraining time 1) and four weeks after training (posttraining time 2). The satisfaction measure was collected immediately following training.

The posttraining time 2 measures collected in this study help to determine the occurrence of training transfer and retention of the information and skills. That is, immediately following training, trainees may have appeared to learn and acquire skills related to computer training. Collection of the posttraining measures four weeks after training provides information about trainees’ level of retention of the skills and knowledge.

Statistical procedures known as analysis of variance and analysis of covariance were used to test for differences between pretraining measures and posttraining measures for each condition. Also, differences between each of the training conditions and the comparison group were analyzed. These procedures determine whether differences between the groups are large enough to conclude with a high degree of confidence that the differences were caused by training rather than by chance fluctuations in trainees’ scores on the measures.

Time Series

Time series refers to an evaluation design in which training outcomes are collected at periodic intervals both before and after training. (In the other evaluation designs discussed here, training outcomes are collected only once after and maybe once before training.) The strength of this design can be improved by using reversal, which refers to a time period in which participants no longer receive the training intervention. A comparison group can also be used with a time series design. One advantage of the time series design is that it allows an analysis of the stability of training outcomes over time. Another advantage is that using both the reversal and comparison group helps to rule out alternative explanations for the evaluation results. The time series design is frequently used to evaluate training programs that focus on improving readily observable outcomes (such as accident rates, productivity, and absenteeism) that vary over time.

Table 6.10 shows a time series design that was used to evaluate how much a training program improved the number of safe work behaviors in a food manufacturing plant. This plant was experiencing an accident rate similar to that of the mining industry, the most dangerous area of work. Employees were engaging in unsafe behaviors such as putting their hands into conveyors to unjam them (resulting in crushed limbs).

To improve safety, the company developed a training program that taught employees safe behaviors, provided them with incentives for safe behaviors, and encouraged them to monitor their own behavior. To evaluate the program, the design included a comparison group (the Makeup Department) and a trained group (the Wrapping Department). The Makeup Department is responsible for measuring and mixing ingredients, preparing the dough, placing the dough in the oven and removing it when it is cooked, and packaging the finished product. The Wrapping Department is responsible for bagging, sealing, and labeling the packaging and stacking it on skids for shipping. Outcomes included observations of safe work behaviors. These observations were taken over a 25-week period.

The baseline shows the percentage of safe acts prior to introduction of the safety training program. Training directed at increasing the number of safe behaviors was introduced after approximately five weeks (20 observation sessions) in the Wrapping Department and 10 weeks (50 observation sessions) in the Makeup Department. Training was withdrawn from the Wrapping and Makeup Departments after approximately 62 observation sessions. The withdrawal of training resulted in a reduction of the work incidents performed safely (to pretraining levels). As shown, the number of safe acts observed varied across the observation period for both groups. However, the number of safe behaviors increased after the training program was conducted for the trained group (Wrapping Department). The level of safe acts remained stable across the observation period. (See the intervention period.) When the Makeup Department received training (at 10 weeks, or after 50 observations), a similar increase in the percentage of safe behaviors was observed.

Solomon Four-Group

The Solomon four-group design combines the pretest/posttest comparison group and the posttest-only control group design. In the Solomon four-group design, a training group and a comparison group are measured on the outcomes both before and after training. Another training group and control group are measured only after training. This design controls for most threats to internal and external validity.

An application of the Solomon four-group design is shown in Table 6.11. This design was used to compare the effects of training based on integrative learning (IL) with traditional (lecture-based) training of manufacturing resource planning. Manufacturing resource planning is a method for effectively planning, coordinating, and integrating the use of all resources of a manufacturing company. The IL-based training differed from the traditional training in several ways. IL-based training sessions began with a series of activities intended to create a relaxed, positive environment for learning. The students were asked what manufacturing resource planning meant to them, and attempts were made to reaffirm their beliefs and unite the trainees around a common understanding of manufacturing resource planning. Students presented training material and participated in group discussions, games, stories, and poetry related to the manufacturing processes.

Because the company was interested in the effects of IL related to traditional training, groups who received traditional training were used as the comparison group (rather than groups who received no training). A test of manufacturing resource planning (knowledge test) and a reaction measure were used as outcomes. The study found that participants in the IL-based learning groups learned slightly less than participants in the traditional training groups. However, IL-group participants had much more positive reactions than did those in the traditional training program.



There is no one appropriate evaluation design. An evaluation design should be chosen based on an evaluation of the factors shown in Table 6.12. There are several reasons why no evaluation or a less rigorous evaluation design may be more appropriate than a more rigorous design that includes a comparison group, random assignment, or pretraining and posttraining measures. First, managers and trainers may be unwilling to devote the time and effort necessary to collect training outcomes. Second, managers or trainers may lack the expertise to conduct an evaluation study. Third, a company may view training as an investment from which it expects to receive little or no return. A more rigorous evaluation design (pretest/posttest with comparison group) should be considered if any of the following conditions are true:

  1. The evaluation results can be used to change the program. 
  2. The training program is ongoing and has the potential to have an important influence on (employees or customers).
  3. The training program involves multiple classes and a large number of trainees.
  4. Cost justification for training is based on numerical indicators. (Here the company has a strong orientation toward evaluation.)
  5. Trainers or others in the company have the expertise (or the budget to purchase expertise from outside the company) to design and evaluate the data collected from an evaluation study.
  6. The cost of the training creates a need to show that it works.
  7. There is sufficient time for conducting an evaluation. Here, information regarding training effectiveness is not needed immediately.
  8. There is interest in measuring change (in knowledge, behavior, skill, etc.) from pretraining levels or in comparing two or more different programs.

For example, if the company is interested in determining how much employees’ communications skills have changed as a result of a training program, a pretest/posttest comparison group design is necessary. Trainees should be randomly assigned to training and no-training conditions. These evaluation design features offer a high degree of confidence that any communication skill change is the result of participation in the training program. This type of evaluation design is also necessary if the company wants to compare the effectiveness of two training programs.

Evaluation designs without pretest or comparison groups are most appropriate in situations in which the company is interested in identifying whether a specific level of performance has been achieved. (For example, are employees who participated in training able to adequately communicate their ideas?) In these situations, companies are not interested in determining how much change has occurred but rather in whether the trainees have achieved a certain proficiency level.

One company’s evaluation strategy for a training course delivered to the company’s tax professionals shows how company norms regarding evaluation and the purpose of training influence the type of evaluation design chosen. This accounting firm views training as an effective method for developing human resources. Training is expected to provide a good return on investment. The company used a combination of affective, cognitive, behavior, and results criteria to evaluate a five-week course designed to prepare tax professionals to understand state and local tax law. The course involved two weeks of self-study and three weeks of classroom work. A pretest/posttest comparison design was used. Before they took the course, trainees were tested to determine their knowledge of state and local tax laws, and they completed a survey designed to assess their self-confidence in preparing accurate tax returns. The evaluators also identified the trainees’ (accountants’) billable hours related to calculating state and local tax returns and the revenue generated by the activity. After the course, evaluators again identified billable hours and surveyed trainees’ self-confidence. The results of the evaluation indicated that the accountants were spending more time doing state and local tax work than before training. Also, the trained accountants produced more revenue doing state and local tax work than did accountants who had not yet received the training (comparison group). There was also a significant improvement in the accountants’ confidence following training, and they were more willing to promote their expertise in state and local tax preparation. Finally, after 15 months, the revenue gained by the company more than offset the cost of training. On average, the increase in revenue for the trained tax accountants was more than 10 percent.



One method for comparing costs of alternative training programs is the resource requirements model. The resource requirements model compares equipment, facilities, personnel, and materials costs across different stages of the training process (needs assessment, development, training design, implementation, and evaluation). Use of the resource requirements model can help determine overall differences in costs among training programs. Also, costs incurred at different stages of the training process can be compared across programs.

Accounting can also be used to calculate costs. Seven categories of cost sources are costs related to: program development or purchase, instructional materials for trainers and trainees, equipment and hardware, facilities, travel and lodging, salary of trainer and support staff, and the cost of lost productivity while trainees attend the program (or the cost of temporary employees who replace the trainees while they are at training). This method also identifies when the costs are incurred. One-time costs include those related to needs assessment and program development. Costs per offering relate to training site rental fees, trainer salaries, and other costs that are realized every time the program is offered. Costs per trainee include meals, materials, and lost productivity or expenses incurred to replace the trainees while they attend training.


To identify the potential benefits of training, the company must review the original reasons that the training was conducted. For example, training may have been conducted to reduce production costs or overtime costs or to increase the amount of repeat business. A number of methods may be helpful in identifying the benefits of training:

  1. Technical, academic, and practitioner literature summarizes the benefits that have been shown to relate to a specific training program.
  2. Pilot training programs assess the benefits from a small group of trainees before a company commits more resources.
  3. Observance of successful job performers helps a company determine what successful job performers do differently than unsuccessful job performers.
  4. Trainees and their managers provide estimates of training benefits.

For example, a training and development consultant at Apple Computer was concerned with the quality and consistency of the training program used in assembly operations. She wanted to show that training was not only effective but also resulted in financial benefits. To do this, the consultant chose an evaluation design that involved two separately trained groups—each consisting of 27 employees—and two untrained groups (comparison groups). The consultant collected a pretraining history of what was happening on the production line in each outcome she was measuring (productivity, quality, and labor efficiency). She determined the effectiveness of training by comparing performance between the comparison and training groups for two months after training. The consultant was able to show that the untrained comparison group had 2,000 more minutes of downtime than the trained group did. This finding meant that the trained employees built and shipped more products to customers—showing definitively that training was contributing to Apple’s business objectives.

To conduct a cost-benefit analysis, the consultant had each employee in the training group estimate the effect of a behavior change on a specific business measure (e.g., breaking down tasks will improve productivity or efficiency). The trainees assigned a confidence percentage to the estimates. To get a cost-benefit estimate for each group of trainees, the consultant multiplied the monthly cost-benefit by the confidence level and divided by the number of trainees. For example, one group of 20 trainees estimated a total overall monthly cost benefit of $336,000 related to business improvements and showed an average 70 percent confidence level with that estimate. Seventy percent multiplied by $336,000 gave a cost-benefit of $235,200. This number was divided by 20 ($235,200/20 trainees) to give an average estimated cost benefit for each of the 20 trainees ($11,760). To calculate return on investment, follow these steps:

  1. Identify outcomes (e.g., quality, accidents).
  2. Place a value on the outcomes.
  3. Determine the change in performance after eliminating other potential influences on training results.
  4. Obtain an annual amount of benefits (operational results) from training by comparing results after training to results before training (in dollars).
  5. Determine the training costs (direct costs + indirect costs + development costs + overhead costs + compensation for trainees).
  6. Calculate the total savings by subtracting the training costs from benefits (operational results).
  7. Calculate the ROI by dividing benefits (operational results) by costs. The ROI gives an estimate of the dollar return expected from each dollar invested in training.

A cost-benefit analysis is best explained by an example. A wood plant produced panels that contractors used as building materials. The plant employed 300 workers, 48 supervisors, 7 shift superintendents, and a plant manager. The business had three problems. First, 2 percent of the wood panels produced each day were rejected because of poor quality. Second, the production area was experiencing poor housekeeping, such as improperly stacked finished panels that would fall on employees. Third, the number of preventable accidents was higher than the industry average. To correct these problems, the supervisors, shift superintendents, and plant manager attended training in (1) performance management and interpersonal skills related to quality problems and poor work habits of employees and (2) rewarding employees for performance improvement. Training was conducted in a hotel close to the plant. The training program was a purchased videotape, and the instructor for the program was a consultant. Table 6.13 shows each type of cost and how it was determined.

The benefits of the training were identified by considering the objectives of the training program and the type of outcomes the program was to influence. These outcomes included the quality of panels, housekeeping in the production area, and the accident rate. Table 6.14 shows how the benefits of the program were calculated. Once the costs and benefits of the program are determined, ROI is calculated by dividing return or benefits by costs. In this example, ROI was 6.7. That is, every dollar invested in the program returned approximately seven dollars in benefits. How can the company determine if the ROI is acceptable? One way is for managers and trainers to agree on what level of ROI is acceptable. Another method is to use the ROI that other companies obtain from similar types of training. Table 6.15 provides examples of ROI obtained from several types of training programs.

Recall the discussion of the new manager training program at Mayo Clinic. To determine Mayo’s return on investment, the human resource department calculated that one-third of the 84 employees retained (29 employees) would have left Mayo as a result of dissatisfaction. The department believed that this retention was due to the impact of the training. Mayo’s cost of a single employee turnover was calculated to be 75 percent of average total compensation, or $42,000 per employee. Multiplying $42,000 by the 29 employees retained equals a savings of $609,000. However, the cost of the training program needs to be considered. If the annual cost of the training program ($125,000) is subtracted from the savings, the new savings amount is $484,000. These numbers are based on estimates, but even if the net savings figure were cut in half, the ROI is still over 100 percent. By being able to quantify the benefits delivered by the program, Mayo’s human resource department achieved greater credibility within the company.




Other more sophisticated methods are available for determining the dollar value of training. For example, utility analysis is a cost-benefit analysis method that involves assessing the dollar value of training based on estimates of the difference in job performance between trained and untrained employees, the number of individuals trained, the length of time a training program is expected to influence performance, and the variability in job performance in the untrained group of employees. Utility analysis requires the use of a pretest/posttest design with a comparison group to obtain an estimate of the difference in job performance for trained versus untrained employees. Other types of economic analyses evaluate training as it benefits the firm or the government using direct and indirect training costs, government incentives paid for training, wage increases received by trainees as a result of completion of training, tax rates, and discount rates.


As mentioned earlier in the discussion, ROI analysis may not be appropriate for all training programs. Training programs best suited for ROI analysis have clearly identified outcomes, are not one-time events, are highly visible in the company, are strategically focused, and have effects that can be isolated. In the examples of ROI analysis in this chapter, the outcomes were very measurable. That is, in the wood plant example, it was easy to see changes in quality, to count accident rates, and to observe housekeeping behavior. For training programs that focus on soft outcomes (e.g., attitudes, interpersonal skills), it may be more difficult to estimate the value.

Showing the link between training and market share gain or other higher level strategic business outcomes can be very problematic. These outcomes can be influenced by too many other factors not directly related to training (or even under the control of the business), such as competitors’ performance and economic upswings and downturns. Business units may not be collecting the data needed to identify the ROI of training programs on individual performance. Also, the measurement of training can often be very expensive. Verizon Communications employs 240,000 people. The company estimates that it spends approximately $5,000 for an ROI study. Given the large number of training programs the company offers, it is too expensive to conduct an ROI for each program.

Companies are finding that, despite these difficulties, the demand for measuring ROI is still high. As a result, companies are using creative ways to measure the costs and benefits of training. For example, to calculate ROI for a training program design to cut absenteeism, trainees and their supervisors were asked to estimate the cost of an absence. The values were averaged to obtain an estimate. Cisco Systems tracks how often its partners return to its Web site for additional instruction. A. T. Kearney, a management consulting firm, tracks the success of its training by how much business is generated from past clients. Rather than relying on ROI, Verizon Communications uses training ROE, or return on expectations. Prior to training, the senior managers who are financially accountable for the training program are asked to identify their expectations regarding what the training program should accomplish as well as a cost estimate of the current issue or problem. After training, the senior managers are asked whether their expectations have been met, and they are encouraged to attach a monetary value to those met expectations. The ROE is used as an estimate in an ROI analysis. Verizon Communications continues to conduct ROI analysis for training programs and courses in which objective numbers are available (e.g., sales training) and in which the influence of training can be better isolated (evaluation designs that have comparison groups and that collect pretraining and posttraining outcomes).

Statistical analyses of costs and benefits are desirable; however, they are not the only way to demonstrate that training delivers bottom-line results. Success cases can also be used. Success cases refer to concrete examples of the impact of training that show how learning has led to results that the company finds worthwhile and the managers find credible. Success cases do not attempt to isolate the influence of training but rather to provide evidence that it was useful. For example, Federated Department Stores wanted to show the effectiveness of its Leadership Choice program, which was designed to increase sales and profit performance in Federated stores by helping leaders enhance personal effectiveness, develop a customer-focused plan to improve business performance, and recognize the impact of decisions on business results. At the conclusion of the program, participants were asked to set a business and leadership goal to work on in the next three months. Because the managers in the program had a wide range of responsibilities in Federated’s various divisions, it was necessary to show that the program produced results for different managers with different responsibilities. As a result, concrete examples were used to illustrate the value of the program to management. Recall from Table 6.12 that a company may want to consider its organizational culture when choosing an evaluation design. Because story telling is an important part of Federated’s culture, using success cases was an acceptable way to show senior managers the impact of the leadership program.

Participants who reported good progress toward their goals were interviewed, and Case Studies were written to highlight the program’s impact in management’s priority areas, which were categories such as “in-store shopping experience” and “differentiated assortment of goods.” Because the case studies told a retailing story, they communicated the kind of results that the program was designed to achieve. For example, one manager set a goal of increasing sales by designing new neckware assortments so that the selling floor would be visually exciting. He directed a new associate buyer to take more ownership of the neckware business, he reviewed future buying strategies, and he visited competitors’ stores. As a result, this manager exceeded his goal of increasing sales 5 percent.