Source: Evaluation Set-Aside Program, Division of Program Coordination, Planning, and Strategic Initiatives (DPCPSI), Office of the Director, NIH. Used by permission.
A | C | D | E | F | G | H | I | L | N | O | P | R | S | T | U | V
Archival data (also called secondary data) is information collected prior to the evaluation and for another purpose. An extensive amount of archival data is available from NIH and/or other organizations, some of which may be relevant to the proposed evaluation. Typical archival data sources include program documents, computerized information systems, Web sites, reports and other publications, and CD-ROMs containing census data and other national survey statistics. Commonly used strategies for collecting archival data include: document reviews, database extractions, Web site reviews and literature reviews. Using archival data generally requires much less time and expense than collecting new data.
Clinger-Cohen Act Requirements
The Clinger-Cohen Act (CCA) of 1996 (also known as the Information Technology Management Reform Act) is designed to improve the way the Federal Government acquires and manages information technology (IT). The law requires that Federal agencies use performance-based management principles in operating their IT systems, as is done by efficient and profitable businesses. Acquisition, planning and management of IT must be treated as a "capital investment" and tied to agency missions and strategic goals. While the law is complex, all consumers of hardware and software should be aware of the leadership responsibilities of each agency’s chief information officer in implementing this statute.
A codebook documents how raw data will be synthesized, categorized and transformed (usually to numeric values) so that the information gathered can be tabulated and analyzed using statistical tests and/or other standardized procedures.
Comparison measures are measurements against which the performance of a program will be compared. There are three types of comparison measures:
- Measures of the program’s prior performance
- Measures of a comparable program’s performance or a control group’s performance.
- Recognized standards of performance.
A conceptual framework (also called a logic model) describes, usually in the form of a diagram, how a particular program is intended to work. It typically illustrates how program resources, population characteristics, program activities and external factors (if any) are expected to influence the achievement of a program’s specific process, intermediate and/or long-term goals. A conceptual framework may be simple or elaborate, and it can be developed using either common sense or a specific theory as a foundation.
A cost-benefit measure expresses the relationship between net discounted program costs and net discounted program outcomes, usually as the ratio of two monetary amounts.
See Efficiency Measures.
Descriptive statistics are used to tabulate, depict and describe collections of data. The data may be either quantitative (e.g., number of scientific papers published) or categorical (e.g., gender, geographic region) in nature. Examples of descriptive statistics include frequency distributions, contingency or cross-tabulation tables, measures of central tendency (e.g., mean, mode, median) and measures of variability (e.g., standard deviation, margin of error). These varied descriptive statistical techniques may be used to reduce a large amount of information into a more manageable (i.e., summarized) form.
Direct Labor Costs
Direct labor costs are expenses directly attributable to the workforce conducting the evaluation, based on the salary levels of the individuals performing the work. The mixture of skills, training and experience needed by the individuals conducting the evaluation is often called the “labor mix.” Total direct labor costs depend on the labor mix as well as the hourly rate and total hours of effort for each labor category. Typically, the first step in constructing an evaluation budget estimate is to predict the mixture of skills needed to perform the project.
Effectiveness research seeks to determine to what degree an intervention, such as a specific program, works under average conditions, such as in diverse populations, institutions and clinical practice settings. Many outcome evaluations can be considered examples of effectiveness research.
Efficacy research seeks to understand whether or not an intervention, such as a specific program, works for a targeted outcome when delivered under ideal conditions. “Ideal” is defined as the best possible control of conditions, such as in carefully controlled research settings, and is therefore not characteristic of most evaluations of existing programs.
Efficiency measures are measurements of program performance that focus on the cost (in terms of dollars, FTEs, employee-hours, facilities or other resources) per unit of output or outcome. Efficiency measures are sometimes called cost-effectiveness measures.
See Feasibility Study.
An evaluation is a formal appraisal of an entity, which involves a systematic investigation of its worth and/or performance. In the case of a program evaluation, the object being investigated is a program (i.e., a set of activities designed to achieve one or more predefined goals). See Program and Program Evaluation.
Expert panels are groups of individuals with expertise in specific areas. Meetings of expert panels usually involve a trained facilitator to stimulate discussion and help the group reach consensus, if possible, on a few major issues. For program evaluations, the issues to be considered by the panel are usually programmatic in nature and may involve making recommendations to program administrators. Expert panel meetings are sometimes audiotaped (with the approval of the panel members).
External factors (sometimes called confounding variables) are conditions or circumstances beyond the control of the program that may influence program success. Examples include other programs with similar goals, unexpected positive events (such as a state budget surplus) and unexpected negative events (such as a natural disaster).
A feasibility study is a systematic assessment of the optimal approach for evaluating a program, including which evaluation designs and data collection strategies can and should be used. It usually includes determining whether conducting an evaluation is appropriate, designing a process evaluation or outcome evaluation for a proposed or existing program and/or determining whether the evaluation can be conducted at a reasonable cost. A feasibility study may serve as a preliminary evaluation aimed at determining the optimal approach for a full-scale outcome evaluation. This type of study is sometimes called an evaluability assessment.
The fee (or profit) in an evaluation budget estimate is the dollar amount over and above allowable costs that is to be paid to the organization responsible for conducting the evaluation. The complexity of the task, the level of risk to the organization performing the work and other factors determine the fee, which is usually presented as a percentage of the total estimated costs.
Focus groups are group interviews in which a trained facilitator asks general questions about one or more topics and encourages the participants to interact and consider each other’s comments. The combined effort of the group may produce a wide range of information, insight and ideas. Focus group sessions are often audiotaped, videotaped and/or observed by others (with the approval of the focus group participants). See Paperwork Reduction Act Requirements.
See Indirect Costs.
A hypothesis is an assumption about the relationship between two or more measurable variables. Inferential statistics may be used to test whether a specific hypothesis is true, assuming a predefined probability (significance level) that the hypothesized event could occur by chance. Study questions are frequently addressed by testing one or more hypotheses.
See Outcome Evaluation.
Indirect costs are expenses that are difficult to assign to specific project functions. They typically include:
Fringe benefits may include paid holidays, vacations, sick leave, retirement benefits and Social Security tax funded by the employer. Overhead costs generally include infrastructure expenses associated with the performance of a project, such as building rent and maintenance, utilities and depreciation of equipment. G&A expenses often include the costs of personnel who are indirectly involved with project work (e.g., senior executives, human resource and accounting personnel) and other indirect costs associated with the overall management of an organization (e.g., advertising, marketing, taxes).
- Fringe benefits for the individuals performing direct labor.
- Overhead costs.
- General and administrative (G&A) expenses.
Inferential statistics are used to make inferences (i.e., to draw conclusions or to generalize) about the properties of a population by examining a sample of the population. Inferential statistics are essential when assessing the relationships among program variables and testing hypotheses associated with particular study questions. Examples of inferential statistical methods include analysis of variance, regression analysis, correlation analysis, discriminant analysis and analyses using chi-square tests, t-tests and other parametric or nonparametric statistical tests.
Intermediate goals describe specific outcomes the program should achieve in the near term. Examples of intermediate goals include increased publications in peer-reviewed journals, more individuals obtaining doctoral degrees in health-related sciences, development of an instrument for use in research or medicine that meets certain standards and achievement of a specified level of satisfaction reported by scientists using the program. Whether or not intermediate goals have been achieved is commonly asked in outcome evaluations.
Institutional Review Board (IRB) Approval
Studies that involve a potential risk to the rights and welfare of human subjects may require prior approval by an institutional review board (IRB) of the study design, including the method for obtaining informed consent. Unlike clinical research studies, IRB approval is usually not required for program evaluations if potential respondents are clearly informed that they may choose not to participate.
See Conceptual Framework.
Long-term goals describe the ultimate outcomes the program is designed to achieve. Examples of long-term goals include discovery of a new treatment for a specific disease, more NIH-sponsored trainees/fellows pursuing biomedical research careers and development of an improved approach for preventing disease or disability. Whether or not long-term goals have been achieved is commonly asked in outcome evaluations.
A needs assessment is a type of program evaluation aimed at systematically determining the nature and extent of the problems that a proposed or existing program should address. It usually includes assessing the needs of stakeholders, developing appropriate program goals and determining how a program should be designed or modified to achieve those goals. A needs assessment is often used as a tool for strategic planning and priority setting.
New data (also called primary data) is information collected specifically for the evaluation. Commonly used strategies for collecting new data include personal interviews, focus groups, expert panels, questionnaires or other data collection instruments (i.e., forms) to be completed, adding evaluation questions to broader surveys (sometimes called omnibus surveys) and structured observations of program processes. Collecting new data generally requires more time and expense than using archival data. See Paperwork Reduction Act Requirements.
See Paperwork Reduction Act Requirements.
Other Direct Costs
Other direct costs are expenses directly attributable to the proposed evaluation, excluding direct labor costs. Examples include costs related to consultants, subcontracts, meetings/travel and miscellaneous supplies and services.
An outcome evaluation is a systematic assessment of program accomplishments and effects to determine the extent to which a program is achieving its intermediate and/or long-term goals. Often, this type of evaluation involves a comparison between current program performance and either:
- Prior program performance.
- Current performance of a comparable control or comparison group.
- Recognized standards of performance.
An outcome evaluation may also include an examination of the relationship between program activities and their effects, both intended and unintended, to identify why some program variations or strategies worked better than others. This type of evaluation is sometimes called an impact evaluation.
Outcome measures are measurements of program performance that focus on the intermediate and/or long-term accomplishments and effects of the program.
Output measures are measurements of program performance that focus on the number of program activities conducted or products produced.
See Indirect Costs.
Paperwork Reduction Act Requirements
The Paperwork Reduction Act of 1985 administered by the U.S. Office of Management and Budget (OMB), requires all Federal agencies to obtain OMB clearance prior to collecting the same information from 10 or more non-Federal employees, a process that often requires 6 to 9 months.
Performance measures are measurements of program performance during a given time period. There are three types of performance measures:
- Output measures.
- Outcome measures.
- Efficiency measures.
Personal interviews may be conducted via telephone or in person, with one individual or a few individuals, for the purpose of collecting data needed to answer questions on one or more topics. An interview guide (or discussion guide) is usually used by the interviewer to ask specific questions, some of which may be followed by probes for additional information. The interviewer generally summarizes the answers of the respondent(s) either during the interview or immediately afterward. Personal interviews are sometimes audiotaped (with the approval of the respondents).
Pilot tests (sometimes called pretests) are trial runs designed to improve data collection instruments and procedures before the data collection effort is begun. They are usually conducted as part of a feasibility study. Pilot tests typically include:
- Using the data collection instruments to examine a small number of cases (e.g., asking a few individuals to fill out questionnaires, conducting interviews with a few people, completing a few observations, examining a small set of records).
- Reviewing the completed forms for problem areas (e.g., blank responses, misinterpretations).
- Conducting personal or group interviews with the data collectors and/or respondents to discuss their general impressions of the questionnaire and to identify any items that were difficult to understand or problematic.
- Analyzing the pilot data collected to determine the effectiveness of the instruments in gathering the desired information.
- Using the analyses and comments to revise the data collection instruments and procedures.
- Conducting field tests of the data collection instruments and procedures to find out how they work under realistic conditions. Field tests are particularly useful for determining the overall feasibility of the proposed data collection and analysis strategies, making final revisions and estimating the total costs of the study.
Population characteristics are variables that describe differences among the members of the target population, particularly characteristics that may be related to program success. Examples include demographic characteristics (e.g., age, gender, socioeconomic status), measures of health status and characteristics of grant applications.
Privacy Act Requirements
The Privacy Act of 1974 restricts the use and disclosure of personally identifiable information maintained by NIH and other Federal agencies in organized “systems of records.” The Privacy Act also specifies that information collected for one purpose may not be used for another purpose without notifying or obtaining the consent of the subject of the record. In program evaluations, the Privacy Act generally applies when data to be collected and maintained can be linked with a personal identifier (e.g., name, Social Security number, date of birth, patient identifier or a randomly assigned computer number that is linked to a master index of individual identifiers). In cases where the Privacy Act applies, a Privacy Act Notification Statement is required so that potential study participants know:
- The statutory authority for the data collection.
- Whether or not their response is voluntary.
- The consequence, if any, of not providing the information.
- The extent to which confidentiality of the information is protected.
A process evaluation is a systematic assessment of program operations to determine whether a program is being conducted as planned, whether expected output is being produced and/or how program-critical processes can be improved. Often, this type of evaluation involves a comparison between current program operations and either:
- Prior program operations.
- The current operations of a comparable control or comparison group.
- Recognized standards of operations. It usually includes assessing the extent to which process goals have been achieved.
Process goals describe how the program should operate and what levels of output should be expected. Process goals are often expressed in terms of the number of activities to be conducted, services to be provided, products to be produced or efficiency of program operations to be achieved during a given time period. Examples of process goals include adherence to a pre-established timeline and budget, an increased level of program activities and a reduction in unit costs. Whether or not process goals have been achieved is commonly asked in process evaluations, although it may also be asked in outcome evaluations.
The term “program” is broadly defined as a set of activities to achieve one or more predefined goals (referred to as “program goals”). Examples of programs include national health awareness campaigns, initiatives to enhance the research capacity of academic institutions, grants management programs, training programs for intramural researchers and activities to improve the efficiency and/or effectiveness of NIH operations (e.g., computerized information systems, Web sites).
Program activities are the specific actions, operations, processes or other functions that are essential to the conduct of the program. Examples include initiating pilot projects, holding workshops, reviewing grants, holding media events and providing new incentives to encourage research.
Program evaluations are systematic investigations or studies that involve assessing the worth and/or performance of particular programs. In most cases, the underlying purpose of a program evaluation is to help program administrators improve a program or make other programmatic decisions. There are four types of program evaluations:
Needs assessments and feasibility studies are usually conducted as preliminary studies to improve the design of a more complex process evaluation or outcome evaluation. Experts external to the program often conduct program evaluations, but program managers may also conduct them.
- Needs assessments.
- Feasibility studies.
- Process evaluations.
- Outcome evaluations.
Program goals are the intended effects of a program, as noted in authorizing legislation or other documents written when the program was established. For a program that is not yet established, the proposed program goals should summarize the intended effects of the new program. There are three types of program goals:
The extent to which a particular program goal has been achieved should be assessed using one or more performance measures and comparison measures.
- Process goals.
- Intermediate goals.
- Long-term goals.
Program resources are the funding, human capital (e.g., FTEs), infrastructure and/or other assets allocated to the program or specific program components during a given time period. Examples include employee hours, total square feet of laboratory space and the average amount of dollars budgeted per year for program salaries and wages, consultant services, equipment, supplies, travel costs and other direct costs.
A statistical analysis determining how large a sample of measurements (e.g., survey respondents or months during which a program’s output is observed) is needed to make valid inferences about the population(s) from which the sample is drawn. For a given sample size, a power analysis can determine how precisely a particular characteristic of one or more populations, such as the difference between program participants and comparison group participants on a performance measure, can be estimated. See Sample Size.
Qualitative analyses are used to describe and/or interpret data presented in the form of words rather than numbers. In program evaluations, qualitative analyses are typically used when data are collected from document reviews, expert panels, focus groups, personal interviews and structured observations. Examples include pattern (or thematic) coding, content analysis, triangulation, within-case and cross-case analyses and the use of matrices, chronological models and other displays to explain qualitative findings.
Quality Control Procedures
Quality control procedures refer to the steps taken to improve the reliability and validity of the data collected and analyzed in a program evaluation. The most common quality control procedure is checking the data for inconsistent, unlikely or otherwise erroneous responses. Other commonly used quality control procedures include training and monitoring of individuals handling data (e.g., data collectors, coders, data entry personnel and data analysts), developing written instructions and codebooks and conducting pilot tests of instruments and procedures, inter-rater reliability checks and double data entry.
Quantitative analyses are used to describe and/or interpret data presented in the form of numbers rather than words. This type of data can be measured along a continuum and is characterized by having additive properties, equal intervals and usually a zero point. Both descriptive and inferential statistical techniques may be used with quantitative data.
Questionnaires are written data collection instruments (i.e., forms) that include instructions and a set of questions about one or more topics. They may be administered in person, by mail or electronically (e.g., via e-mail or Web sites). Newly developed questionnaires should be pilot-tested for effectiveness. See Paperwork Reduction Act Requirements.
Reliability is the extent to which a data collection instrument or effort yields consistent and stable results over repeated measurements conducted under similar conditions. For example, a bathroom scale is unreliable if it produces three different weights in three consecutive weighings of the same person. Reliability may be assessed in several ways, including:
- Inter-rater reliability, which measures the similarity of scores assigned by different raters (e.g., interviewers, observers) to the same phenomenon.
- Test-retest reliability, which measures the similarity of scores (or responses to a particular set of questions) obtained at different times from the same individuals.
- Internal-consistency reliability (often assessed by Cronbach’s coefficient α, and sometimes by a method resulting in a similar “split-half reliability” coefficient), which measures the similarity of scores obtained from the same individuals responding to different sets of items measuring the same concept, construct, or trait. A data collection instrument or effort must be reliable to be considered valid.
The response rate for a data collection effort is the number of actual respondents divided by the number of potential respondents. The denominator consists of all of the individuals in the target population who were sampled, including those who did not respond for a particular reason (e.g., refusal, language problems, inability to contact).
A sample is a subset of individuals or objects selected (or drawn) from the target population by means of a sampling strategy.
Sample size is the number of individuals or objects selected from the target population for data collection purposes. The size of a sample is important because it must be large enough to make valid inferences about the population from which the sample was drawn. Many factors should be considered when determining the sample size, including the planned sampling strategy, the number of subgroups within the target population for which separate estimates are required and estimates of the proportion of the population that fall into those subgroups.
A sampling strategy is the approach used to select a sample. Sampling strategies are used to increase the likelihood that the inferences made about the target population are valid. Examples include one-stage sampling techniques (e.g., simple random sampling, stratified sampling) or multi-stage sampling techniques (e.g., random digit dialing, area probability sampling).
Stakeholders are individuals or groups who are likely to be interested in, to be impacted by or to use the findings of the evaluation. Stakeholders typically include those involved in program operations (e.g., NIH researchers, staff, administrators), those served by the program and others who have an investment or interest in the program (e.g., academic institutions, advocates, the public).
Standards of Performance
Standards of performance are levels of program processes, outputs and/or outcomes established by authority or general agreement as being acceptable. Examples include a defined timeline or budget, a certain level of work output or product/service quality and a specific outcome.
A state-of-the-science assessment is a systematic review of existing research and recent advances in a specific area of biomedical research for the purpose of identifying scientific achievements, gaps and opportunities. It is usually conducted via a conference, workshop or expert panel meeting. State-of-the-science assessments are designed to help NIH program administrators and researchers identify research priorities and develop or modify program goals.
Strategic planning is a process that involves setting goals for a program or organization, developing strategies for achieving those goals and determining how success will be measured and evaluated.
A structured observation is a type of data collection in which the situation of interest is watched by one or more observers trained to record relevant facts, actions and behaviors in a standardized way. Structured observations are usually recorded on data collection forms and may include the use of audiotape or videotape (with the approval of the individuals being observed). See Paperwork Reduction Act Requirements.
Study questions are the key questions that the evaluation is designed to answer. For process and outcome evaluations, the study questions usually address the extent to which specific program goals have been achieved. Study questions are often answered by testing specific hypotheses.
The target population is the primary group about which information is needed to answer the study questions. It is frequently a group of individuals having certain characteristics, such as the participants in a specific NIH training program, the members of an IC’s scientific review groups (study sections), the individuals who called an NIH health hotline during a given time period or the NIH administrators who implemented a new program. The target population may also consist of a group of objects having certain characteristics, such as the academic institutions funded or the R01 grants awarded by an IC during a given period.
Unit of Analysis
The unit of analysis is the individual item within the target population for which data will be collected and analyzed to answer the study questions. The unit of analysis, for example, may be defined as a program participant, a member of the general public who accessed an NIH service, an academic department or an individual grant award. In some cases, more than one unit of analysis may be included in the evaluation design.
Validity is the extent to which a data collection instrument or effort accurately measures what it is supposed to measure. Validity may be assessed in several different ways, for example:
It is generally agreed that there is no simple, uniform, wholly objective procedure for determining the validity of a data collection instrument or effort. For a data collection instrument or effort to be considered valid, it must also be reliable.
- The face validity of a questionnaire or other data collection instrument is assessed using human judgment, frequently the judgment of a group of experts in the field, to determine whether the instrument appears to measure what it claims to measure.
- Construct validity is assessed by determining the extent to which the underlying construct, concept or theory accounts for respondents’ scores.
- Concurrent validity is assessed by comparing the similarity of respondents’ scores to other criteria that are assumed to measure the same construct.
- Predictive validity is assessed by comparing respondents’ scores to future measures of performance.
A variable is a factor, construct or characteristic of a person, object or program that can be measured or classified. In a program evaluation, the key variables are those for which data will be gathered to answer one or more study questions. Measures of a program’s performance (e.g., output measures, outcome measures or efficiency measures) are sometimes called dependent variables, while factors that may be predictive of a program’s performance (e.g., program resources, population characteristics or program activities) are sometimes called independent variables.