Advanced Measurement Designs for Survey Research

By Hair, J.F., Bush, R.P., Ortinau, D.J.

Edited by Paul Ducham


Attitudes have three components: cognitive, affective, and behavioral. Marketing researchers and decision makers must understand all three components.

Cognitive Component

The cognitive component of attitude is a person’s beliefs, perceptions, and knowledge about an object and its attributes. For example, as a college student you may believe that your university:

  • is a prestigious place to get a degree.
  • has professors with excellent reputations.
  • is a good value for the money.
  • needs more and better computer labs.

These beliefs make up the cognitive component of your attitude toward your university. Your beliefs may or may not be true, but they represent reality to you. The more positive beliefs you have of your university and the more positive each belief is, the more favorable the overall cognitive component is assumed to be.

Affective Component

The affective component of an attitude is the person’s emotional feelings toward a given object. This component is most frequently revealed when a person is asked to verbalize his or her attitude toward some object, person, or phenomenon. For example, if you claim you “love your university” or your university “has the best athletes possible or smartest students around” you are expressing your emotional feelings. These emotional feelings are the affective component of your attitude about your university. Your overall feelings about your university may be based on years of observing it, or they may be based on little actual knowledge. Your attitude could change as you are exposed to more information (e.g., from your freshman to your senior year), or it may remain essentially the same. Finally, two individuals may have different affective responses to the same experience (one student may like a particular professor’s teaching approach while another one may hate it).

Behavioral Component

The behavioral component, also sometimes referred to as a conative component, is a person’s intended or actual behavioral response to an object. For example, your decision to return to your university for the sophomore year is the behavioral component of your attitude. The behavioral component is an observable outcome driven by the interaction of a person’s cognitive component (beliefs) and affective component (emotional strength of beliefs) as they relate to a particular object. The behavioral component may represent future intentions (your plan to get an MBA degree after you finish your BA), but it usually is limited to a specific time period. Recommendations also represent a behavioral component (such as recommending that another student take a class from a particular professor).

Attitudes are a complex area to understand fully. In the next section we discuss the different scales used to measure attitudes and behaviors.


A Likert scale asks respondents to indicate the extent to which they either agree or disagree with a series of mental or behavioral belief statements about a given object. Typically the scale format is balanced between agreement and disagreement scale descriptors. Named after its original developer, Rensis Likert, this scale typically has five scale descriptors: “strongly agree,” “agree,” “neither agree nor disagree,” “disagree,” “strongly disagree.” A series of hierarchical steps is followed in developing a Likert scale:

Step #1: Identify and understand the concept to be studied. For example, assume the concept is voting in Florida.

Step #2: Assemble a large number of belief statements (e.g., 50 to 100) concerning the general public’s sentiments toward voting in Florida.

Step #3: Subjectively classify each statement as having either a “favorable” or an “unfavorable” relationship to the specific attitude under investigation. Then, the entire list of statements is pretested (e.g., through a pilot test) using a sample of respondents.

Step #4: Respondents decide the extent to which they either agree or disagree with each statement, using the intensity descriptors “strongly agree,” “agree,” “not sure,” “disagree,” “strongly disagree.” Each response is then given a numerical weight, such as 5, 4, 3, 2, 1. For assumed favorable statements, a weight of 5 would be given to a “strongly agree” response; for assumed unfavorable statements, a weight of 1 would be given to a “strongly disagree” response.

Step #5: A respondent’s overall-attitude score is calculated by the summation of the weighted values associated with the statements rated.

Step #6: Only statements that appear to discriminate between the high and low total scores are retained in the analysis. One possible method is a simple comparison of the top (or highest) 25 percent of the total mean scores with the bottom (or lowest) 25 percent of total mean scores.

Step #7: In determining the final set of statements (normally 20 to 25), statements that exhibit the greatest differences in mean values between the top and bottom total scores are selected.

Step #8: Using the final set of statements, steps 3 and 4 are repeated in a full study. By using the summation of the weights associated with all the statements, researchers can tell whether a person’s attitude toward the object is overall positive or negative. For example, the maximum favorable score on a 25-item scale would be 125 (5 X 25 = 125). Therefore a person scoring 110 would be assumed to hold a positive (favorable) attitude. Another respondent who scores 45 would be assumed to hold a negative attitude toward the object. The total scores do not identify any of the possible differences that might exist on an individual statement basis between respondents.

Over the years researchers have extensively modified the design of the Likert scale. Today, the modified Likert scale expands the original five-point format to either a six-point forced-choice format with scale descriptors such as “definitely agree,” “generally agree,” “slightly agree,” “slightly disagree,” “generally disagree,” “definitely disagree” or a sevenpoint free-choice format with these same descriptors plus “neither agree nor disagree” (sometimes labeled “not sure”) in the middle. In addition, many researchers treat the Likert scale format as an interval scale.

Regardless of the actual number of scale descriptors that are used, Likert scales have several other unique characteristics. First, the Likert scale is the only summated rating scale that uses a set of agreement/disagreement scale descriptors. A Likert scale collects only cognitive-based or specific behavioral beliefs. Despite the popular notion that Likert scales can measure a person’s complete attitude, they can capture only the cognitive components of a person’s attitude and are therefore only partial measures. They also do not capture the different possible intensity levels of expressed affective or behavioral components of a person’s attitude.

Likert scales are best for research design that use self-administered surveys, personal interviewers, or most online methods to collect the data. It is difficult to administer a Likert scale over the telephone because respondents have trouble visualizing and remembering the relative magnitudes of agreement and disagreement that make up the scale descriptors. Exhibit 12.2 shows an example of a partial modified Likert scale in a self-administered survey.

To point out the interpretive difficulties associated with the Likert scale, in Exhibit 12.2 we have used boldface in each of the statements for the words that indicate a single level of intensity. For example, in the first statement (I buy many things with a credit card), the main belief focuses on many things. If respondents check the “generally disagree” response, it would be a leap of faith for researchers to interpret that response to mean that respondents buy only a few things with a credit card. In addition, it would be a speculative guess on the part of researchers to assume that the respondents’ attitudes toward purchasing products or services with a credit card are unfavorable. The intensity levels assigned to the agree/disagree scale point descriptors do not truly represent the respondents’ feelings associated with the belief response. The intensity levels used in a Likert scale identify only the extent to which respondents think the statement represents their own belief about credit card purchases.

Consider the last statement in Exhibit 12.2 (I am never influenced by advertisements) as another example. The key words in this statement are never influenced. If respondents check “definitely disagree,” it would again be researchers’ subjective guess that the response means that respondents are very much influenced by advertisements. In reality, all that the “definitely disagree” response indicates is that the statement is not one that respondents would make. No measure of feeling can be attached to the statement.

Likert scales can be used to identify and assess personal or psychographic (lifestyle) traits of individuals. To see how international marketing research companies, like the Gallup Organization, use attitude and psychographic scale measurements to profile consumers across Latin American countries, visit the book’s Web site at and follow the links.



Another rating scale used quite often in marketing research projects is the semantic differential scale. This type of scale is unique in its use of bipolar adjectives and adverbs (good/bad, like/dislike, competitive/noncompetitive, helpful/unhelpful, high quality/low quality, dependable/undependable) as the endpoints of a symmetrical continuum. Typically there will be one object and a related set of factors (or attributes), each with its own set of bipolar adjectives to measure either a cognitive or an affective element. Because the individual scale descriptors are not identified, each bipolar scale appears to be a continuum. In most cases, semantic differential scales will use between five and seven scale descriptors, though only the endpoints are identified. Respondents are asked to select the point on the continuum that expresses their thoughts or feelings about the given object.

In most cases a semantic differential scale will use an odd number of scale points, thus creating a so-called neutral response that symmetrically divides the positive and negative poles into two equal parts. An interpretive problem that arises with an odd-number scale point format comes from the natural neutral response in the middle of the scale. A neutral response has little or no diagnostic value to researchers or decision makers. Sometimes it is interpreted as meaning “no opinion,” “don’t know,” “neither/nor,” or “average.” None of these interpretations give much information to researchers. To overcome this problem, researchers can use an even-point (or forced-choice) format and incorporate a “not applicable” response out to the side of the bipolar scale.

A semantic differential scale is one of the few attitudinal scales that enable researchers to collect both cognitive and affective data for any given factor. But both types of data cannot be collected at the same time. For a given factor, a bipolar scale can be designed to capture either a person’s feelings or cognitive beliefs. Although some researchers believe a semantic differential scale can be used to measure a person’s complete attitude about an object or behavior, this scale type is best for identifying a “perceptual image profile” about the object or behavior of concern.

The actual design of a semantic differential scale can vary from situation to situation. To help understand the benefits and weaknesses associated with design differences, we present three different formats and discuss the pros and cons of each. In the first situation, researchers are interested in developing a credibility scale that can be used by Nike to assess the credibility of Tiger Woods as a spokesperson in TV or print advertisements for Nike brands of personal grooming products. Researchers determine the credibility construct consists of three factors—(1) expertise, (2) trustworthiness, and (3) attractiveness—with each factor measured using a specific set of five bipolar scales (see Exhibit 12.3).

Randomization of Positive and Negative Pole Descriptors

While the semantic differential scale format in Exhibit 12.3 appears to be correctly designed, there are several technical problems that can create response bias. First, notice that all the positive pole descriptors are arranged on the left side of each scale and the negative pole descriptors are all on the right side. This approach can cause a halo effect bias.4 That is, it tends to lead respondents to react more favorably to the positive poles on the left side than to the negative poles on the right side. To prevent this problem, researchers should randomly mix the positions of the positive and negative pole descriptors.

Lack of Extreme Magnitude Expressed in the Pole Descriptors

A second response problem with the scale format displayed in Exhibit 12.3 is that the descriptors at the ends of each scale do not express the extreme intensity associated with end poles. Respondents are asked to check one of seven possible lines to express their opinion, but only the two end lines are given narrative meaning. Researchers can only guess how respondents are interpreting the other positions between the two endpoints. Consider, for example, the “dependable/ undependable” scale for the trustworthiness dimension. Notice that the extreme left scale position represents “dependable” and the extreme right scale position represents “undependable.” Because dependable and undependable are natural dichotomous phrase descriptors, the scale design does not allow for any significant magnitudes to exist between them. The logical question is what the other five scale positions represent, which in turn raises the question of whether or not the scale truly is a continuum ranging from dependable to undependable. This problem can be corrected by attaching a narratively expressed extreme magnitude to the bipolar descriptors (“extremely” or “quite” dependable, and “extremely” or “quite” undependable).

Use of Nonbipolar Descriptors to Represent the Poles

A third response problem that occurs in designing semantic differential scales relates to the inappropriate narrative expressions of the scale descriptors. In a good semantic differential scale design, the individual scales should be truly bipolar so that a symmetrical scale can be designed. Sometimes researchers will express the negative pole in such a way that the positive one is not really its opposite. This creates a skewed scale design that is difficult for respondents to interpret correctly.

Consider the “expert/not an expert” scale in the “expertise” dimension in Exhibit 12.3. While the scale is dichotomous, the words “not an expert” do not allow respondents to interpret any of the other scale points as being relative magnitudes of that phrase. Other than the one endpoint described as “not an expert,” all the other scale points would have to represent some intensity of “expert,” thus creating an unbalanced, skewed scale toward the positive pole. In other words, interpreting “not an expert” as really meaning “extremely” or “quite” not an expert makes little or no diagnostic sense. Researchers must be careful when selecting bipolar descriptors to make sure that the words or phrases are truly extremely bipolar in nature and that they allow for creating symmetrically balanced scale designs. For example, researchers could use pole descriptors such as “complete expert” and “complete novice” to correct the above-described scale point descriptor problems.

Matching Standardized Intensity Descriptors to Pole Descriptors

The scale design used by Bank of America for a bank image study in Exhibit 12.4 eliminates the three problems identified in the example in Exhibit 12.3, as well as a fourth—it gives narrative expression to the intensity level of each scale point. Notice that all the separate poles and scale points in between them are anchored by the same set of intensity descriptors (“very,” “moderately,” “slightly,” “neither one nor the other,” “slightly,” “moderately,” “very”). In using standardized intensity descriptors, however, researchers must be extra careful in determining the specific phrases for each pole—each phrase must fit the set of intensity descriptors in order for the scale points to make complete sense to respondents. Consider the “makes you feel at home/makes you feel uneasy” scale in Exhibit 12.4. The intensity descriptor of “very” does not make much sense when applied to that scale (“very makes you feel at home” or “very makes you feel uneasy”). Thus, including standardized intensity descriptors in a semantic differential scale design may force researchers to limit the types of bipolar phrases used to describe or evaluate the object or behavior of concern. This can only raise questions about the appropriateness of the data collected using this type of scale design.

The fundamentals can help researchers develop customized scales to collect attitudinal or behavioral data. To illustrate this point, Exhibit 12.5 shows a semantic differential scale used by Midas Auto Systems Experts to collect attitudinal data about the performance of Midas. Notice that each of the 15 different features that make up Midas’s service profile has its own bipolar scale communicating the intensity level for the positive and negative poles. This reduces the possibility that respondents will misunderstand the scale.

Exhibit 12.5 also illustrates the use of an “NA”—not applicable—response as a replacement for the more traditional mid-scale neutral response. After the data are collected from this scale format, researchers can calculate aggregate mean values for each of the 15 features, plot those mean values on each of their respective scale lines, and graphically display the results using “profile” lines. The result is an overall profile that depicts Midas’s service performance patterns (see Exhibit 12.6). In addition, researchers can use the same scale and collect data on several competing automobile service providers (Firestone Car Care, Sears Auto Center), then show each of the semantic differential profiles on one display.

Ex- 12.3

Ex 12.5


One of the most widely used scale formats in marketing research is the behavior intention scale. In using this scale, decision makers are attempting to obtain some idea of the likelihood that people will demonstrate some type of predictable behavior regarding the purchase of a product or service. In general, behavior intent scales have been found to be good predictors of consumers’ choices of frequently purchased and durable consumer products.

Behavior intention scales (purchase intent, attendance intent, shopping intent, usage intent) are easy to construct. Consumers are asked to make a subjective judgment on their likelihood of buying a product or service or taking a specified action. The scale descriptors typically used with a behavior intention scale are “definitely would,” “probably would,” “not sure,” “probably would not,” and “definitely would not.” For example, for Vail Valley Foundation’s interest in identifying how likely it is people will attend a variety of performing arts events at its new outdoor Ford Amphitheater in Vail, Colorado, see Exhibit 12.7, which illustrates the behavior intention scale the Vail Valley Foundation Management Team used to collect the intention data. Note that this scale uses a forced-choice design by not including the middle logical scale point of “not sure.” It is important to remember when designing behavior intention scales that researchers should include a specific time frame (“would consider attending in the next six months”) in the question/setup portion of the scale. Without an expressed time frame, researchers increase the possibility that respondents will bias their responses toward the “definitely would” or “probably would” scale categories.

To increase the clarity of the scale point descriptors, researchers can attach a percentage equivalent expression to each one. To illustrate this concept, let’s assume that Sears is interested in knowing how likely it is that customers will shop at certain types of retail stores for men’s casual clothing. The following set of scale points could be used to obtain the intention data: “definitely would shop at (90% to 100% chance)”; “probably would shop at (50% to 89% chance)”; probably would not shop at (10% to 49% chance)”; and definitely would not shop at (less than 10% chance).” Exhibit 12.8 shows what the complete shopping intention scale might look like.

For more examples of Likert, semantic differential, and behavior intention types of scale designs visit the book’s Web site at and follow the links.

Ex 12.7

exhibit 12.8


Before concluding the discussion of advanced scale measures, several comments are needed about when and why researchers use single-item and multiple-item scaling formats. First, a scale design can be characterized as being a single-item scale design when the data requirements focus on collecting data about only one attribute of the object or construct being investigated. An easy example to remember is collecting “age” data. Here the object is “a person” and the single attribute of interest is that person’s “age.” Only one measure is needed to collect the required age data. Respondents are asked a single question about their age and supply only one possible response to the question. In contrast, most marketing research projects that involve collecting attitudinal, emotional, and behavioral data require some type of multiple-item scale design. Basically, when using a multiple-item scale to measure the object or construct of interest, researchers will have to measure several items simultaneously rather than measuring just one item. Most advanced attitude, emotion, and behavior scales are multiple-item scales.

The decision to use a single-item versus a multiple-item scale is made in the construct development stage. Two factors play a significant role in the process. First, researchers must assess the dimensionality of the construct under investigation. Any construct that is viewed as consisting of several different, unique subdimensions will require researchers to measure each of those subcomponents. Second, researchers must deal with the reliability and validity issues of the scales used to collect data. Consequently, researchers are forced to measure each subcomponent using a set of different scale items. To illustrate these two points, consider again the Tiger Woods as a spokesperson example in Exhibit 12.3. Here the main construct of interest was “credibility as a spokesperson.” Credibility was made up of three key subcomponents (expertise, trustworthiness, and attractiveness). Each of the subcomponents was measured using five different seven-point scale items (e.g., expertise— knowledgeable/unknowledgeable, expert/not expert, skilled/unskilled, qualified/unqualified, experienced/inexperienced).

Another point to remember about multiple-item scales is that there are two types of scales: formative and reflective. Aformative composite scale is used when each of the individual scale items measures some part of the whole construct, object, or phenomenon. For example, to measure the overall image of a 2010 Hummer, researchers have to measure the different attributes that make up that automobile’s image, such as performance, resale value, gas mileage, styling, price, safety features, sound system, and craftsmanship. By creating a scale that measures each pertinent attribute, researchers can sum the parts into a complete (that is, formative) whole that measures the overall image held by respondents toward the 2010 Hummer. With a reflective composite scale design, researchers use multiple items to measure an individual subcomponent of a construct, object, or phenomenon. For example, in isolating the investigation to the performance dimension of the 2010 Hummer, researchers can use a common performance rating scale and measure those identified attributes (trouble-free, MPG rating, comfort of ride, workmanship, overall quality, dependability, responsiveness) that make up the performance dimension. Each of these attributes reflects performance and an average of the reflective scale items can be interpreted as a measure of performance.


The main design issues related to both construct development and scale measurement are reviewed below.

Construct Development Issues

Researchers must clearly define and operationalize constructs before they attempt to develop their scales. For each construct being investigated, researchers must determine its dimensionality characteristics (i.e., single versus multidimensional) before developing appropriate scales. In a multidimensional construct, all relevant dimensions must be identified as well as their related attributes.

Researchers also must not create double-barreled dimensions. That is, two different dimensions of a construct should not be represented as if they are one. For example, when investigating consumers’ perceptions of service quality, do not combine the service provider’s “technological competence” and “diagnostic competence” into a single competence dimension. Similarly, with a singular dimension, do not include double-barreled attributes. For example, avoid asking respondents to rate two attributes simultaneously (e.g., indicate to what extent you agree or disagree that Martha Stewart “perjured herself” and “should have been indicted”). For a multidimensional construct, use scale designs in which multiple attribute items are represented separately to measure each dimension independently from the other dimensions (see again the Tiger Woods example in Exhibit 12.3). Finally, Construct Validity must always be assessed before creating the final scales.

Measurement Scale Issues

When phrasing the question/setup element of a scale, use clear wording and avoid ambiguity. Also avoid using “leading” words or phrases in the question/setup part of any scale measurement.

Regardless of the Data Collection Method (personal, telephone, or computer-assisted interviews, or any type of offline or online self-administered survey), all necessary instructions for both respondents and interviewers should be part of the scale measurement’s setup. All instructions should be kept simple and clear. When using multiattribute items, make sure the items are phrased unidimensionally (avoid double-barreled item phrases). When determining the appropriate set of scale point descriptors, make sure the descriptors are relevant to the type of data being sought. Use only scale descriptors and formats that have been pretested and evaluated for scale reliability and validity. Scale descriptors should have adequate discriminatory power, be mutually exclusive, and make sense to respondents.

Screening Questions

Screening Questions (also referred to as screeners or filter questions) should be used in any type of interview. Their purpose is to identify qualified prospective respondents and prevent unqualified respondents from being included in the study. It is difficult to use screening questions in many self-administered questionnaires, except for computer-assisted surveys. Screening questions need to be separately administered before the beginning of the main interview.

Skip Questions

Skip questions (also referred to as conditional or branching questions) should be avoided if at all possible. If they are needed, the instructions must be clearly communicated to respondents or interviewers. Skip questions can appear anywhere within the questionnaire and are used if the next question (or set of questions) should be responded to only by respondents who meet a previous condition. Asimple expression of a skip command might be: “If you answered yes to Question 5, skip to Question 9.” Skip questions help ensure that only specifically qualified respondents answer certain items.

Ethical Responsibility of Researchers

In the development of scale measurements, researchers must use the most appropriate scales possible. Intentionally using scale measurements to produce biased information raises questions about the professional ethics of researchers. Any set of scale point descriptors used to frame a noncomparative rating scale can be manipulated to bias the results in any direction. Inappropriate scale descriptors to collect brand-image data can be used to create a positive view of one brand or a negative view of a competitor’s brand, which might not paint a true picture of the situation. To illustrate this point, let’s revisit the Aca Joe example. Let’s assume that in creating the seven-point semantic differential scale used to collect the image data for the seven dimensions of Aca Joe’s store image (quality, assortment, style/fashion, prices of merchandise, store’s location, overall reputation, and knowledgeability of sales staff), researchers decided not to follow many of the process guidelines for developing accurate scale measurements we have discussed, including no pretesting of the scales. Instead, they just used their intuitive judgment of what they thought the owner of Aca Joe’s was hoping for. Consequently, the following semantic differential scale measurement was developed:


Now, select a retail store of your choice, assume it to be Aca Joe’s, and rate that store using the above scale. Interpret the image profile that you create and compare it to Aca Joe’s desired and actual images. What differences do you detect? How objective were your ratings? Did you find yourself rating your store positively like Aca Joe? What problems did you encounter on each dimension?

Using the above scale, researchers can negatively bias evaluations of competitors’ image by providing mildly negative descriptors against strong descriptors or vice versa. For example, the adjective ‘Truly Terrible” is much more negative than the positive impression of ‘Outstanding.’ Similarly, the adjective ‘Extremely High’ used with Merchandise Prices is much more negative than is the positive adjective of ‘Reasonable.’ Ethically, it is important to use balanced scales with comparable positive and negative descriptors. In addition, when researchers do not follow scale development guidelines, responses can be biased. This example also points out the need to pretest and establish scale measurements that have adequate reliability, validity, and generalizability. Remember, scales that are unreliable are invalid, or lack generalizability to the defined Target Population, and will provide misleading findings (garbage in, garbage out).