1 Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 37 Effects of the Number of Common Items on Equating Precision and Estimates of the Lower Bound to the ∗ Number of Common Items Needed † Mengyao Zhang and Michael J. Kolen August 2013 ∗ The authors thank Robert L. Brennan and Won-Chan Lee for helpful com- ments on a previous draft. † Mengyao Zhang is a research assistant in the Center for Advanced Studies in Measurement and Assessment (CASMA), College of Education, University of Iowa (email: [email protected]). Michael J. Kolen is Professor, College of Education, University of Iowa (email: [email protected]).

2 Zhang and Kolen Effects of the Number of Common Items Advanced Studies in Center for and Measurement Assessment (CASMA) College of Education University of Iowa Iowa Cit y, IA 52242 Tel: 319-335-5439 Web: www.education.uiowa.edu/casma All rights reserved ii

3 Zhang and Kolen Effects of the Number of Common Items Contents 1 Introduction 1 2 Classical Congeneric Model Results 2 3 Chained Linear Equating Method 4 4 Direct Estimates of Lower Bound of the Number of Common 7 Items Needed 5 Simulation 9 5.1 Simulation Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Simulation Results 1 . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.3 Simulation Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.4 Simulation Results 2 . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Discussion 14 7 Appendix 15 . . . . . . . . . . . 15 ( X,V ) 7.1 Evaluating the Expected Correlation, ρ 7.2 Estimating the relative length of common items, k . . . . . . . . 17 8 References 19 iii

4 Zhang and Kolen Effects of the Number of Common Items List of Tables ′ 2 ) as a function of ρ ( X,X ρ ) and k using external common X,V 1 ( items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ′ 2 k X,X ) and using internal common ( ( ρ ) as a function of X,V 2 ρ items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Estimated standard errors of equating using external common ′ 8, N = 2 , 000) . . . . . . . . . . . . . . . . 22 ( ρ . items ( ) = 0 X,X tot Estimated standard errors of equating using internal common 4 ′ X,X ρ ) = 0 . items ( N ( = 2 , 000) . . . . . . . . . . . . . . . . 22 8, tot Estimated lower bound of relative length of the external set of 5 ′ z ρ ) = 0 . 8, − 3 ≤ X,X common items ( ≤ 3) . . . . . . . . . . . 23 ( i 6 Estimated lower bound of relative length of the internal set of ′ ( X,X ≤ ) = 0 . 8, − 3 ≤ z common items ( ρ 3) . . . . . . . . . . . 24 i 7 Parameters for Simulation Study 1 . . . . . . . . . . . . . . . . . 25 8 Modified k for Tables 5 and 6 based on Simulation Study 2 ′ ( X,X N ) = 0 . 8, 10) . . . . . . . . . . . . . . 26 . = 2 , 000, u = 0 ρ ( tot 9 Modified numbers of common items needed based on Simulation Study 2 (total number of items on either Form X or Y is 50, ′ ( X,X ρ ) = 0 . 8, N 10) . . . . . . . . . . . . . . 26 = 2 , 000, u = 0 . tot List of Figures ′ 2 X,V ) as a function of ρ ( X,X ( ) and k . . . . . . . . . . . . . . 27 ρ 1 ′ ρ ( X,X =2,000) . . . . . . . . . . . . 28 )=0.8, N 2 Estimated SEE ( tot 3 Flowchart of a process for estimating lower bound of the number of common items needed . . . . . . . . . . . . . . . . . . . . . . . 29 4 Difference between analytic SEE and empirical SEE (normal dis- ′ ( X,X . ) = 0 tribution, ρ N 000) . . . . . . . . . . . . . . 30 = 2 , 8, tot 5 Difference between analytic SEE and empirical SEE (positively ′ ρ ( X,X = 2 ) = 0 . 8, N 000) . . . . . . . . 31 skewed distribution, , tot 6 Difference between analytic SEE and empirical SEE (negatively ′ = 2 X,X ρ ) = 0 . 8, skewed distribution, 000) . . . . . . . . 32 ( , N tot 7 Difference between analytic SEE and empirical SEE when ES = ′ ρ ( X,X . ) = 0 0 ( 8, N = 2 000) . . . . . . . . . . . . . . . . . . 33 , tot 8 Modified numbers of common items needed (total number of ′ ρ ( X,X . ) = 0 items on either Form X or Y is 50, 8, N 000, = 2 , tot u = 0 . 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv

5 Zhang and Kolen Effects of the Number of Common Items Abstract The construction of test forms including common items can be challeng- ing. By combining the classical congeneric model with analytic standard errors derived by the delta method, this study develops a process for estimating the numbers of common items that are necessary to provide the desired equating precision indexed by the standard error of equating. The chained linear equating method is studied. Both external and internal sets of common items are consid- ered, along with a variety of real test situations represented by test reliability, sample size available, and score range of interest. v

6 Zhang and Kolen Effects of the Number of Common Items 1 Introduction In common item equating, scores on test Form X are equated to scores on test Form Y using scores on a set of items, V, that are in common to the two forms. When scores on the common items contribute to the total score on test forms X and Y, the common items are referred to as being internal. When scores on the common items do not contribute to the total score on test forms X and Y, the common items are referred to as being external. In common item equating, the groups of examinees taking test forms X and Y can be considered to be equivalent in ability, such as when forms X and Y are randomly assigned to examinees for the purposes of equating using the random groups design (Kolen & Brennan, 2004). Alternatively, the groups can be considered not equivalent in ability using what is sometimes referred to as the common item nonequivalent groups design (CINEG, Kolen & Brennan, 2004) or as the nonequivalent groups anchor test design (NEAT, Holland & Dorans, 2006). The construction of test forms including common items is one of the most challenging parts of common item equating (Kolen & Brennan, 2004). Some pre- vious studies have empirically shown that larger numbers of common items gen- erally produced greater equating precision (Puhan, 2010; Ricker & von Davier, 2007; Wang, Lee, Brennan, & Kolen, 2006; Yang & Houang, 1996). However, since the findings were based on specific test data and situations by manipulat- ing the numbers of common items in a limited manner, the generalizability of these results is uncertain. Also, no general analytic process exists in the liter- ature for estimating the numbers of common items leading to desired equating precision. In this study it is also shown that the number of common items included in equating has a direct effect on the precision of the estimates of the equating relationship, with larger numbers of common items leading to greater precision. Furthermore, when designing equating studies, the test developer can base the choice of the number of common items on the degree of equating precision desired. The purpose of this study is to detail a process that can be used to choose the numbers of common items that are necessary to provide the desired degree of equating precision when using chained linear equating procedures. Both external and internal sets of common items are considered. In this study, under the classical congeneric model, the precision of equating is shown to be related directly to the correlation between the scores on the total test and scores on the common items. The development of the approach begins by showing how this correlation relates to reliability for the total test and the ratio of test lengths for the common items and total numbers of items on the test. The development of the approach then relates this correlation to equating precision as indexed by the standard error of equating (SEE). After specifying reliability of the total test, the sample size available, the score range of interest, and the degree of precision desired, the procedures described in this study allow the test developer to choose the length of the set of common items that will lead to the desired equating precision when the chained linear equating method 1

7 Zhang and Kolen Effects of the Number of Common Items is used. In the present study, two major assumptions are made that the groups are equivalent in ability and the score distributions are normal. The simulation provides an empirical check on the extent to which the results hold when these assumptions are violated. 2 Classical Congeneric Model Results X V represent the random variable observed scores on test forms X, Y Let , and , Y, and the set of common items V, respectively. As assumed in the classical test theory, every observed score on a test form or, more generally, on a set of items is the sum of two exclusive components, true score T and error of measurement (Feldt & Brennan, 1989; Kolen & Brennan, 2004). Subscripts are needed E T and E to specify which test form or item set is considered. For example, X X denote true score and error of measurement related to test form X. In classical test theory, varying degrees of heterogeneity between test forms are studied by using different conceptions of parallel measurements (Feldt & Brennan, 1989). In this study, the classical congeneric model is chosen for test form X and the set of common items V because of its flexibility in reflecting similarity and dissimilarity between a test form and a subset of items on the test. Similar results can be extended to test form Y and the set of common items V. According to Kolen and Brennan (2004), the following properties hold if the classical congeneric model is assumed. are slopes 1. True scores T T λ ’s and δ ’s and are linearly related, where V X and intercepts respectively, as = T X + E (1) = ( λ , T + δ E ) + X X X X X and = T δ + E (2) = ( λ . T + V E ) + V V V V V λ and 2. Error variances are proportional to effective test lengths λ as V X 2 2 σ E (3) ) = λ , σ ( ( E ) X X and 2 2 E ) = λ ( σ E ( σ ) . (4) V V 3. Score variances and covariances are derived from Equations 1 through 4 as 2 2 2 2 λ ) = ( σ X ( T ) + λ (5) σ σ ( E ) , X X 2 2 2 2 ) = λ ( (6) σ σ ( T ) + λ , σ V ( E ) V V and 2 σ ( X,V λ λ σ ) = (7) ( . T ) + σ ( E ) ,E V X V X 2

8 Zhang and Kolen Effects of the Number of Common Items For convenience, Equations 6 and 7 can be rewritten as equations involving ′ 2 X ), reliability for the total test ρ ( X,X only the observed score variance ), σ ( ′ and , where reliability ρ ( X,X ) is defined as λ and effective test lengths λ V X 2 2 T ( ) σ ( E σ ) X X (Feldt & Brennan, 1989). Specifically, for the − the ratio = 1 2 2 ) ( X ( X ) σ σ classical congeneric model, Equation 6 is rewritten as 2 2 2 2 T V σ ) = ( σ ) + λ ) ( λ ( E σ V V 2 λ λ V 2 2 2 V = · σ ) ( T ) + E ( σ · λ λ X X 2 λ λ X X 2 λ λ V 2 2 V ) + ( σ T E = ( ) σ X X 2 λ λ X X 2 λ λ V ′ 2 ′ 2 V ) ρ ( X,X σ ) + X ( )[1 = X,X ( X σ − ρ ( )] 2 λ λ X X ] ) ( [ λ λ V V 2 ′ 1 + X σ = (8) . − ) ) ρ ( X,X ( 1 λ λ X X When V is an external set of common items, σ ( E ) = ,E X,V ) = 0, and σ ( V X 2 λ ( T ) (Kolen & Brennan, 2004). Thus, Equation 7 can be rewritten as λ σ V X 2 ( X,V λ σ σ ) = λ T ) ( V X λ V 2 2 = · λ σ ( T ) X λ X λ V 2 σ ( T ) = X λ X λ V 2 ′ = ( X ) ρ ( X,X σ ) . (9) λ X 2 ), and ) = λ E σ ( E When V is an internal set of common items, ,E σ ( V V X 2 2 X,V ) = λ ) (Kolen & Brennan, 2004). Similarly, Equa- λ E σ σ ( T ) + λ ( σ ( V V X tion 7 can be rewritten as 2 2 ) = λ T λ ) σ σ ( ( ) + λ E σ X,V ( V V X λ V 2 2 2 [ λ λ ( σ = ( T ) + )] E σ X X λ X λ V 2 = ( X ) . (10) σ λ X For an external set of common items, by substituting Equations 8 and 9 in the equation for the Pearson product-moment correlation coefficient, an expres- sion for the squared correlation between the scores on the total test and scores ′ ρ ( X,X is defined as the ratio ), is developed, where k on the common items, λ V representing the relative length of the set of common items ( k > 0), as λ X 3

9 Zhang and Kolen Effects of the Number of Common Items λ 2 ′ 2 V 2 ρ ) X ( X,X σ )] ( [ ( X,V ) σ λ 2 X X,V ( ) = ρ = λ λ 2 2 V V ′ 2 2 σ X V ( ( ) ) σ )] σ σ ( · )[1 + ( ) ( X X,X − 1) ρ ( X λ λ X X 2 ′ ( X,X kρ ) = . (11) ′ X,X 1 + ( ) ρ ( k − 1) 2 ( ) is developed For an internal set of common items, the expression of ρ X,V 1) as ≤ < k by using Equations 8 and 10 in a similar manner (0 λ 2 2 V 2 X )] ( [ σ σ ) X,V ( λ 2 X = ) = X,V ( ρ λ λ 2 2 V V 2 2 ′ ) σ ( ( ) σ V X )] · ) X ( X )[1 + ( ( σ σ X,X ( − 1) ρ λ λ X X k = . (12) ′ ( X,X 1 + ( ) − 1) ρ k Table 1 shows that when V is an external set of common items, the squared 2 ′ X,X ) changes as ρ ( correlation ( ) and k change. Figure 1 (upper part) X,V ρ k graphically illustrates this relationship. For a fixed , higher test reliability 2 lends to higher ( X,V ). For fixed test reliability, the longer k is, the higher ρ 2 2 ′ X,X ). ρ ( X,V ) would reach 1 only when ( ( ρ X,V ) = 1. ρ The internal common items case is shown in Table 2 and Figure 1 (lower part). Note that V and X are actually the same when k = 1. The relationship of ′ 2 ( X,V ρ ( X,X ρ ) and k is similar to the external common items case, except ), 2 ρ ( X,V that k = 1. ) eventually reaches 1 when 3 Chained Linear Equating Method In the CINEG design, the chained equating method generally involves two steps (Kolen & Brennan, 2004). First, scores on test form X are converted to scores on the common items V based on the group of examinees taking test form X (Group 1), denoted as X → V . Next, scores on the common items V are converted to scores on test form Y based on the group of examinees taking test form Y (Group 2), denoted as V → Y . In this study, the chained linear equating method is considered. As its name suggests, this equating method contains two linear conversions, → V and V → X . The chained linear Y equating method is relatively simple and straightforward compared with other equating methods, and it still can be formulated within the general framework for observed-score equating relationships (Brennan, 2006). In addition, the chained linear equating method often leads to greater random error of equating compared to other linear methods used for common-item equating (Kolen & Brennan, 2004). Thus, estimation of the number of common items based on this equating method might be fairly conservative. Suppose scores on test form X and the common items V in Group 1 satisfy a bivariate normal distribution. Let μ ( X ) and σ ( X ) denote the mean and 4

10 Zhang and Kolen Effects of the Number of Common Items V μ ( V ) denote mean standard deviation of scores on form X, and let ) and ( σ and standard deviation of scores on the set of common items V. Use N to represent the sample size. Subscripts are used to differentiate group membership only when confusion may otherwise occur. For every possible score on test x i form X, an approximation of random error variance for the single group linear was originally proposed by Lord (1950) (also see Angoff, 1971; equating X → V Kolen & Brennan, 2004) as { } [ ] 2 2 σ ) X ( μ − x )[1 ρ ( X,V )] V ( − i 1 ∼ ˆ ( )] ρ 2 + [1 + (13) X,V . ( )] l x [ var = V i σ ( X ) N 1 V Y in A similar equation also holds for the single group linear equating → Group 2 if scores on test form Y and the common items V are assumed to have a bivariate normal distribution as { } ] [ 2 2 V ( μ − v ) ( Y σ − ρ ( Y,V )] )[1 2 i ∼ ˆ )] . (14) ( Y,V 2 + [1 + ρ v ( var l )] [ = Y i ) ( V σ N 2 2 → V and According to Braun and Holland (1982), if two equating chains, X Y , are statistically independent, the error variance of the entire chained V → ˆ ˆ var [ e ˆ x ( equating, )], )], could be estimated based on var [ l v ( x ( )] and var [ l Y i i Y V i 2 ′ ∼ ˆ ˆ ˆ (15) l x , v ( )] + [ [ l )] l ( v var )] [ var ( [ˆ )] e var ( x = i i Y i V i Y Y Y ) ( σ ′ ˆ ) indicates the slope of the linear conversion from V to Y that is v ( l where i Y σ ( V ) 2 by the definition of linear equating (Kolen & Brennan, 2004). As a result of a μ X v ( − ) μ ( V ) − x i i 1 and linear conversion, two z -scores should be equal, denoted by V σ σ ( X ) ) ( 1 ) X ( v μ − μ − ( V ) x i 1 i = = z . By substituting Equations 13 and 14 in Equation 15 i ) V ( σ ( X ) σ 1 and assuming that, 1. groups are equivalent in ability, so that μ ) = ( V ) = μ V ( V ) and σ ( 1 1 2 σ ( V ), 2 2. the correlation between X and V in Group 1 equals the correlation between Y and V in Group 2, ρ ( X,V ) = ρ ( Y,V ), and 3. numbers of examinees taking test forms X and Y are equal, namely = N 1 N tot N represents the total sample size, , where = N tot 2 2 an approximation of random error variance for the chained linear equating method is as follows 5

11 Zhang and Kolen Effects of the Number of Common Items ′ 2 ∼ ˆ ˆ ˆ x var [ ( l ( ( v l )] + [ x e )] [ˆ ( v )] )] var var [ l = i i V Y i i Y Y { } ] [ 2 2 ρ ( Y,V )] σ μ ( Y ) V ( )[1 − v − i 2 ( )] Y,V ρ 2 + [1 + = V ( σ N ) 2 2 { } [ ] ] [ 2 2 2 − μ ( X ) x ( V )[1 − ρ ( X,V )] σ σ ( Y ) i 1 ρ 2 + [1 + )] X,V ( + σ N σ ( X ) V ( ) 2 1 2 4 σ )] ( Y )[1 − ρ ( X,V 2 = 2 + [1 + ρ ( X,V { z (16) . } )] i N tot Letting be standardized to having a mean of 0 and a standard deviation Y of 1, X,V ( ρ − 4[1 )] 2 ∼ )] x ( var e [ˆ ( (17) { 2 + [1 + ρ . X,V )] z } = i Y i N tot This result is also consistent with the equation presented by Lord (1950) for “Case IV” in which test forms X and Y are both equated to the set of common items V (also see Angoff, 1971). Example By substituting the expressions in Equation 11 or 12 for ρ ( X,V ) in Equation 16 or 17, error variance for equating can be viewed as a function of reliability, ′ X,X ρ ), relative effective test length, k , sample size, N ( , and standardized tot score . Suppose that there is a desire to estimate the length of the set of z i common items necessary for the equating to have a certain level of precision over a range of z -scores. Assume that reliability of the test is known and that the sample size available for equating is known as well. Equations 16 and 17 can be used to find the approximate number of common items needed to achieve the desired equating precision. Consider the following example. An external set of common items is to be used. The test contains 50 multiple-choice items. Test reliability is 0.8 and the available sample size for equating is N 000. Also assume that the = 2 , tot target equating precision is a standard error of equating of 0.1 or below over the range of z -scores from -3 to 3. Table 3 provides standard errors of equating (square root of error variance) for this situation at various values of k and at various z -scores. Based on this table, approximately k = 0 . 50 or greater is necessary to achieve the precision target. Because the test length is 50 items, the common items length should be at least 25 items to achieve the precision target. Note that when using external common items, the relative length of the set of common items can be even longer than the total test. Now consider that all of the same characteristics hold, except that an internal set of common items is to be used. Based on Table 4, approximately k = 0 . 20 or greater would be needed to achieve the target precision. Thus, an internal set 6

12 Zhang and Kolen Effects of the Number of Common Items of common items of at least 10 items would be needed to achieve the precision target. The values in Table 3 and 4 are shown graphically in Figure 2. As can be the minimum standard error of equating is at a -score seen, for a given value of z k of 0, and the more the z -scores deviate from 0 the greater is the standard error of equating. For the external common items, the standard errors of equating are clustered together more than for the internal common items. In addition, using the internal set of common items leads to smaller standard errors than using the external set of common items. 4 Direct Estimates of Lower Bound of the Num- ber of Common Items Needed As mentioned in the previous example, in practice, the test developer may need to decide on the relative length of the set of common items that are necessary to provide the desired degree of equating precision. In the previous section, two tables were created that were used to provide an approximate procedure for finding the length of the external and internal set of common items respectively. In this section, a procedure that can be used to directly estimate the length of the ′ ρ X,X ), the sample common items is developed. It is based on test reliability, ( N , target SEE in terms of numbers of standard deviation units, size available, tot u , and standardized test scores of interest, z . Some tables are also created so i the test developer can easily deal with a variety of test construction situations. In general, this process involves four steps. Step 1. Specify N . and u tot It is not surprising that, when the sample size used for equating is large or the test developer has a high tolerance of equating error or both, the target equating precision can be achieved even when relatively shorter sets of common items are used. However, when the sample size available is limited or the desired equating precision is strict, constraints placed on the number of common items become stringent. Then there is a need to estimate the lower bound of the number of common items necessary. z Step 2. Determine . i Every test is designed to fulfill some specific purposes. As a result, the score range of interest varies from test to test. For example, a test that provides information for selecting scholarship recipients tends to focus more on better- than-average performances, whereas a test that is to be used for a variety of purposes might need precision for a wide range of scores. Accordingly, in terms of standardized score z 3 for the , the range of interest might be 1 . 5 ≤ z ≤ i i former, whereas it could be − 3 ≤ z 3 for the latter. As shown in Figure 2 in ≤ i the previous example, random equating error reaches its lowest value at z = 0, i and increases as it deviates from the middle scores. Consequently, especially when the score range of interest covers some extreme score values, the number of common items should be large enough to provide the desired equating precision 7

13 Zhang and Kolen Effects of the Number of Common Items at these extreme values. Step 3. Choose the type of common items, internal or external. Type of common items, internal or external common items, may also be a factor when deciding the length of the set of common items needed. Step 4. Specify test reliability. If test reliability is too low, it might be impossible to find a lower bound of the number of common items needed. Note that under the classical congeneric model, reliability for test form X can be estimated by Feldt’s internal consistency coefficient (Feldt & Brennan, 1989) as ∑ 2 2 2 ) − S S ( S X X X f ∑ ′ = ˆ ρ , XX F 4 2 − S S X X X f 2 2 , is the variance for individual item X is the total score variance, S where S f X X f S is the covariance between individual item and total test score X and X X f f . In practice, however, Cronbach’s alpha is routinely reported as an index X of test reliability, which is under the essentially tau-equivalent model. Feldt and Brennan (1989) provided an example that compared different reliability coefficient estimates based on the same variance-covariance matrix (see pp. 114– 116). k The process for choosing ?? , is represented in the flowchart in Figure and detailed analytic derivations are provided in the Appendix. As seen in the flowchart, there are five different results regarding the choice of the number of common items necessary to provide the target SEE, and sometimes a result can be reached without going through all four steps. The following set of examples is used to demonstrate the use of the flowchart in practice. Example Consider the example described in the previous section. The sample size avail- N = 2 , 000, able for equating is and the target SEE is assumed to be = 0 . 1 u tot standard deviation units. At Step 1, 2 2 8 u . 1) N = 20 > = 2000(0 , tot so branch left and move to Step 2. At Step 2, squared z -scores of interest are compared with a criterion, 2 2 u N − 8 − 8 2000(0 . 1) tot = = 3 . 4 4 z -score range of interest is from -1.5 to 1.5. Note that every Suppose the 2 2 . ≤ 3, so branch left again and move to Result 1. (1 possible 5) z = 2 . 25 < i That is, under this situation, the target SEE will always be achieved regardless of the value of k assuming that the test satisfies the requirements of the classical congeneric model and randomly equivalent groups are administered Forms X and Y. 8

14 Zhang and Kolen Effects of the Number of Common Items -score z Suppose that all of the same characteristics hold, except that the 2 range of interest now is from -3 to 3. As a result, at Step 2, some can exceed z i the criterion, so branch right instead, and move to Step 3. Now, the type of common items directly affects the estimation of the lower bound of the number of common items necessary to achieve the target SEE. If external common items are to be used, then test reliability needs to be higher than a criterion, √ 2 2 − 8 u N 4 tot + 1 − z 1 + − 2 i 4 z i 2 ∼ 0 . 51 . ρ = = H 2 z i Otherwise, it is impossible to achieve the target SEE regardless of the choice of the number of common items included (Result 3). Assume that test reliability is 0.8 which is higher than 0.51, and then go to Result 2; that is, the relative length of the set of common items is ′ 2 X,X [1 )] ρ ( − ρ H ∼ . 44 . 0 ≥ k = 2 ′ ′ ρ ( − ] ) )[ ρ X,X ( ρ X,X H . 44(50) = 22 external common Thus, if the test contains 50 items, at least 0 items should be used. If internal common items are to be used, go to Result 4; that is, the relative length of the set of common items is ′ 2 )] [1 − ρ ( X,X ρ H ∼ 17 . 0 . k ≥ = 2 ′ ) ( X,X ρ ρ − 1 H ∼ Similarly, for a test including 50 items, at least 0 17(50) = 8 . 5 . 9 internal = common items are necessary. Tables 5 and 6 provide estimates of k using exter- nal and internal sets of common items at various combinations of sample sizes, N -scores range from , and degree of precision, u , where reliability is 0.8, and z tot -3 to 3. 5 Simulation In this study, two major assumptions are made in order to derive a simplified form for estimating SEE for the chained linear equating method and a practical process for directly estimating the number of common items needed to achieve desired equating precision. These assumptions are discussed in more detail in this section. Two separate simulation studies are presented. Simulation Study 1 focuses on the accuracy of simplified random error estimation for the chained linear equating when two major assumptions are violated to varying degrees. Simulation Study 2 provides some modifications for using Tables 5 and 6 to choose appropriate numbers of common items when the two major assumptions are violated. The first assumption is that the two groups used for equating are equivalent in ability, and in particular, that the means and variances of scores on the ) = μ ) and ( V common items in the two groups are identical. Namely, μ V ( 2 1 9

15 Zhang and Kolen Effects of the Number of Common Items 2 2 V ) = σ ( V ). This assumption is very useful in deriving a simplified form σ ( 2 1 for estimating random error variance for the chained linear equating based on . However, in many situations where V and V → Y two equating links X → common items are involved, the groups taking different forms differ from one another by varying amounts. The group equivalence assumption can be violated slightly or dramatically. In the simulation studies, an effect size parameter is defined to reflect the group difference, V μ ( ( V ) ) μ − 2 1 √ ES (18) , = 2 2 ) ( )+ N N V σ V ( σ 2 1 2 1 + N N 2 1 where all the parameters involved in the equation are defined in previous sec- tions. When groups are equivalent in terms of equal means, ES is exactly zero. Otherwise, the larger ES is associated with greater group differences. When the sample sizes taking two forms are identical, Equation 18 is simplified to Equation 19 as ) ( V ) − μ V ( μ 2 1 √ ES = (19) . 2 2 σ σ ( V ( V ) )+ 1 2 2 The second assumption is that score distributions are normal. Specifically, scores on test form X and the common items V in Group 1 follow a bivariate normal distribution, and similarly, scores on test form Y and the common items V in Group 2 follow a bivariate normal distribution. The normality assump- tion is important for developing a simplified form for estimating random error variance for the chained linear equating. In practice, however, the normality assumption might be violated. In the simulation studies, a lognormal distri- bution and its translated mirror image are applied to simulate positively and negatively skewed score distributions to assess the impact of violation of the normality assumption. 5.1 Simulation Study 1 A crucial equation in this study is a simplified form for estimating the SEE for the chained linear equating method, as shown in Equation 16 or 17. The only difference between Equations 16 and 17 is whether Y is standardized to have a mean of 0 and a standard deviation of 1. Continue using the scenario that was introduced and discussed in previous sections. That is, the test contains 50 multiple-choice items, test reliability is N 0.8, and the available sample size for equating is = 2 , 000, where 1,000 tot examinees take each form. In Simulation Study 1, the number of common items is fixed at 20, which is 40% of the total test length. Both external and internal sets of common items are considered. The following steps are used to evaluate estimation accuracy of the analytic SEE for the chained linear equating: 1. Take a random sample of size 1,000 from a bivariate distribution of scores on test Form X and common items V in Group 1. Take a random sample 10

16 Zhang and Kolen Effects of the Number of Common Items of size 1,000 from a bivariate distribution of scores on test Form Y and common items V in Group 2. 2. Equate Form X to Form Y using the chained linear equating method r . The estimated equating relationship is based on these random samples r ( ) denoted as ˆ ( x ). e i Y 3. Use Equation 16 to estimate the SEE with statistics based on these random r ) ( ̂ S samples EE ( x replacing parameters, referred to as ). r i 4. Repeat steps 1-3 R times ( R = 10 , 000). There are two ways of estimating ̂ EE S x SEE at every possible Form X raw score, ( x ) and , denoted as i i a ̂ S ( x EE ) respectively. b i (a) Compute the standard deviation of equated scores as √ ∑ ( r ) R 2 ̄ x )] x ( e ˆ [ ( ) − e ˆ Y i i Y =1 r ̂ (20) , ) = x SEE ( i a 1 N − ̄ ˆ ) is the average equated score over ( x where e R replications. i Y ( r ) ( r ) ̂ ̂ EE ( S (b) Average ) over replications, where S EE ( x R ) is com- x i i puted at Step 3, ) r ( 1 ̂ ̂ (21) ) = x SEE ( . SEE ( x ) i i b R ̂ x EE For the two estimates of SEE, ( S ) is a straightforward estimated i a standard error of equating obtained by simulation that is not affected by the ̂ S EE analytic estimation process. ) represents how Equation 16 might be ( x i b used in practice when values of population parameters are unknown. Different degrees of violation of group equivalence and normality assump- tions are reflected in the characteristics of population score distributions from which Steps 1 and 2 draw samples. For potential group differences, eight levels of are considered: 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, and 1.0. For potential score ES distribution shapes, three levels are considered: normal, lognormal, and trans- lated mirror image of lognormal, representing normal, positively skewed, and negatively skewed distributions, respectively. In addition, two types of common items, external and internal, are included. In total, there are 8 × 3 × 2 = 48 combinations of conditions. The replication process runs separately for each condition by using R (R Development Core Team, 2005). 5.2 Simulation Results 1 Parameters for the simulation are summarized in Table 7. When a positively skewed distribution is used, the skewness of scores on Form X in Group 1 is approximately 0.77, and skewness of scores on Form Y in Group 2 varies from 11

17 Zhang and Kolen Effects of the Number of Common Items 0.77 to 1.10 as the difference between two populations increases from 0 to 1.0 in terms of ES . When a negatively skewed distribution is used, only the direction of skewness changes to negative. The accuracy of analytic SEE estimation is evaluated by examining the dif- ̂ ̂ S ( x EE ), and the analytic SEE, S ference between the empirical SEE, EE ), ( x i b a i x μ ( X ) − i -score scale, where z along the Form X z = . The difference is reflected i σ X ) ( in the number of standard deviation units. The ideal value is 0, suggesting that the analytic SEE estimation led to the same result as the empirical SEE. Positive values indicate “overestimation” using the analytic procedure, and negative val- ues are related to “underestimation” using the analytic procedure. The greater the deviation from zero, the less the accuracy of the analytic SEE estimation. As shown in Figures 4 to 6, within each distribution condition, larger ES is associated with more “bias” in analytic estimation compared to empirical estimation. When the normality assumption holds, the effect of group difference on the accuracy of analytic SEE estimation is not very large. Even when the group difference is as large as . 0, the difference between analytic SEE and ES = 1 empirical SEE is still within -0.02 to 0.02 standard deviation units. However, when score distributions are positively or negatively skewed, the accuracy of analytic SEE is reduced especially when groups are largely different in ability. Holding level of group difference and shapes of score distributions constant, the use of an internal set of common items tends to produce more accurate analytic SEE estimation compared to the use of an external set of common items. Figure 7 focuses on the effect of shapes of score distributions on the accuracy of analytic SEE estimation. For the six conditions as shown in the figure, groups taking forms X and Y are equivalent, because ES = 0. Score distributions are either normal, positively skewed, or negatively skewed. Both external and internal common items are considered. When score distributions are normal, the analytic SEE is very close to the SEE estimated empirically. However, when score distributions are skewed, analytic estimation tends to overestimate the SEE for scores near the middle and underestimate the SEE for scores near either extreme. Using an internal set of common items generally produces more accurate analytic estimation than using an external set of common items. 5.3 Simulation Study 2 Tables 5 and 6 in the previous section can provide test developers some practical guidance in choosing the number of common items under certain conditions. Es- timates in these tables are obtained by following the direct estimation procedure that is also based on group equivalence and normality assumptions. Simulation Study 2 intends to examine and modify estimates of relative length of the set of common items, k , when two assumptions are violated by different amounts. Three levels of group difference are considered: ES = 0, = 0 ES . 2, and ES = 0 . 5. Three shapes of score distribution are considered: normal, moderately positively skewed, and extremely positively skewed. A log- normal transformation is used to simulate the extremely skewed condition where 12

18 Zhang and Kolen Effects of the Number of Common Items the skewness is approximately from 0.7 to 0.8. A hybrid lognormal and nor- mal transformation is used to simulate the moderately skewed condition where the skewness is around 0.2 to 0.3. Results for positively skewed distributions are generalizable to those for negatively skewed distributions. Both external 2 = 18 combinations of × 3 and internal common items are used. In total, 3 × conditions are displayed. Similar to Simulation Study 1, the test still consists of 50 multiple-choice = items, test reliability is 0.8, and the available sample size for equating is N tot u , in terms of numbers of standard deviation units is 2 000. The target SEE, , -scores of interest range from -3 to 3. The following steps z fixed to be 0.10, and are followed: using the value from Table 5 or 6 depending on which types of 1. Initialize k 1 common items are used. replications ( R = 10 , 000) as described in Simulation Study 1. 2. Run R ) x − μ ( X i ̂ SEE = ranges ( Only compute the empirical SEE, x ), where z i i a X ( σ ) from -3 to 3. ̂ SEE ). If Table 5 or 6 works perfectly, the ( x Y ) with 3. Compare uσ ( i a z -scores of interest, which is from -3 to 3, should fall below SEE at the u uσ ( Y ). Otherwise, because as k standard deviation units, which is increases, the SEE will decrease, a larger value of k will be considered. ̂ (a) If max as the modified k ), stop and report Y ( x SEE < uσ ( ) 3 ≤ − z ≤ 3 i a i relative length of the set of common items. ̂ max (i.e., use k ), add 0.02 to current Y ( uσ ≥ (b) if SEE ) ( x i a ≤ 3 z ≤ 3 − i 50 0 . 02 = 1 additional common item). For an internal set of com- × mon items, if the updated k exceeds 1.0, stop and report an error message. Otherwise, go back to Step 2. 4. Repeat steps 1-3 T times ( T = 20) to stabilize the estimation of k . The replication process was conducted using R (R Development Core Team, 2005). 5.4 Simulation Results 2 Tables 8 and 9 contain summaries of modified k values and corresponding num- bers of common items needed under the 18 different simulation conditions. Fig- ure 8 directly illustrate the result as shown in Table 9. As expected, when two major assumptions both hold, values from Tables 5 and 6 lead to the target SEE. When a set of external common items is used, as the condition moves from 1 To decide the initial value of k for some conditions, instead of using Table 5 or 6 that might be inefficient and time-consuming, several additional simulation runs were done first to gather a rough idea of the relationship between k and empirical SEE (similar to Simulation Study 1). 13

19 Zhang and Kolen Effects of the Number of Common Items ES = 0, normal) to extreme (i.e., = 0 . 5, extremely skewed), the ideal (i.e., ES number of common items that is necessary to meet the target SEE steadily in- creases. When a set of internal common items is used, violation of the normality assumption tends to affect the number of common items needed more dramat- ically than violation of the group equivalence assumption. For example, when the normality assumption holds, even as two groups differ from each other as ES = 0 . much as 5, only two additional common items are needed to achieve the target SEE. In general, compared to the use of a set of external common items, the use of a set of internal common items leads to more stable estimation of the number of common items necessary, although violation of both assumptions still requires the number of common items needed to be approximately twice as many as to achieve the same target SEE. 6 Discussion Most previous studies on numbers of common items only provided exploratory results concerning the relationship between the numbers of common items and equating precision. These results were applicable to specific tests, situations, and limited conditions of lengths for the common items. In this study a novel way of understanding the relationship between number of common items and equating precision is provided, by combining the classical congeneric model with analytic standard errors derived by the delta method. This study describes a process along with some figures and tables that can be used by the test devel- opers to choose the length of the common item set that leads to the desired equating precision under various real test situations. For both external and internal common items, the relationship between test score reliability for the total score on a test, the effective test length of the common items, and the correlation between test scores and scores on common items was derived analytically in this study using the classical congeneric model. These relationships show clearly that as reliability and effective test length for the common items increase, the correlation between total test and common item scores increases. These derivations were used to illustrate how the standard error of equating for chained linear equating is related directly to test reliability and to the effective test length of the common items. In addition, a process was developed to estimate the number of common items needed for a specified degree of equating precision for chained linear equating. Two points are worth noting when the estimated lower bound for the number of common items is used in practice. Theoretically, the estimated lower bound of the relative length of common items provided in this study is always a ratio of λ V = effective test lengths, k . In some situations, this ratio can be viewed, ap- λ X proximately, as the ratio of actual number of items, as described in the examples in this study. In other situations, the relationship between the ratio of effective test lengths and ratio of actual test lengths needs careful consideration, such as when a test contains both multiple-choice and constructed-response questions. In addition, as k approaches 0, the two test forms X and Y share few, if any, 14

20 Zhang and Kolen Effects of the Number of Common Items k might lead to some problems in items in common. However, small values of applying these methods. One such problem is that, when the number of com- mon items is very small, content and statistical representativeness are difficult to maintain. In practice, the number of common items should, at a minimum, be large enough to adequately represent the content of the total test. Two simulation studies were used to empirically check the accuracy of the simplified form for estimating SEE for the chained linear equating method and the process for directly estimating the number of common items needed to achieve desired equating precision when the group equivalence and normality assumptions are violated. It appears that violation of the normality assumption can lead to the analytic SEE being a substantial underestimate of the SEE and number of common item required to meet target precision being substantially underestimated by the simplified procedure. When the normality assumption holds and the group differences are greater than zero, the analytic SEE is a slight underestimate of the SEE and the number of common items required to meet target precision is slightly underestimated. Overall, it appears that the simplified process described in this study for estimating the SEE and the number of common items needed to meet the target precision is reasonably accurate when the scores are close to being normally distributed and the group differences are not large. 7 Appendix Larger numbers of common items generally provide greater equating precision. Thus, estimation of the lower bound of the relative length of common items required by the specified SEE is important. The following derivation consists of two parts. First, the expected correlation between the scores on the total ρ ( X,V ), given the specified situation is test and scores on the common items, evaluated. Next, the lower bound of the relative length of the common item set, , under the classical congeneric model is estimated. Various equating methods k may perform differently in terms of random error. The chained linear method is examined in this study. ( X,V ) ρ 7.1 Evaluating the Expected Correlation, index the target SEE in terms of numbers of standard deviation unit such Let u 2 2 [ e that x )] ≤ u var is σ ( Y ). Specifically, according to Equation 17, when Y ˆ ( Y i standardized to have a mean of 0 and a standard deviation of 1, )] X,V ( ρ − 4[1 2 2 2 u ρ z )] X,V ( 2 + [1 + { }≤ u )] ≤ var [ˆ e ( x ⇔ Y i i N tot 1 2 2 2 2 4 z ρ }≤ ρ ⇔ ( X,V ) − 8 u ( X,V ) + (8 + 4 z {− ) i i N tot 2 2 2 z − 8) − 4 ( u N z tot i i 2 ρ X,V ( ρ ( ) + X,V ⇔ ) + ≥ 0 . (22) 2 8 15

21 Effects of the Number of Common Items Zhang and Kolen X,V ρ ) ( The final expression in Equation 22 is a quadratic inequality of = 0. The purpose of this step is to determine possible values of ρ X,V ) z when ( 6 i which make the inequality in Equation 22 true. The relationship between the determinant and the roots of a quadratic function plays an important role. The determinant of the inequality in Equation 22 is 2 2 2 2 N ( u − − 8) − 4 z 8 u z N tot tot i i 4 2 2 − · . − 4 · z = 1 ∆ (23) + 1 z = 1 i i 8 4 2 2 Note that ∆ , and its determinant is is another quadratic function of z 1 i ) ( 2 2 2 2 u − u N 16) N ( 8 − u N tot tot tot 1 = 4 − · 1 · (24) . ∆ = 2 16 4 ( ). ρ Three conditions are discussed to explore possible values of X,V 2 u N > 16 . Under this condition, ∆ Condition 1: is always positive, and 2 tot consequently, ∆ has two distinct roots as 1 √ √ 2 − 8 u N tot 2 − ∆ − u 8) ∆ − 4 ( N 2 tot 2 2 4 ( (25) , ) = ≡ z L i 2 8 and √ √ 2 N 8 u − tot 2 + ∆ − u ( N 8) + 4 ∆ 2 tot 2 2 4 . ) = z ( ≡ (26) H i 2 8 2 2 z ) and ( z It can be shown that both ( are positive. Furthermore, if ) H L i i 2 2 2 ) 0 and the inequality in Equation 22 is always ≤ z ( ≤ ≤ ( z z , then ∆ ) H 1 L i i i 2 2 2 2 < ( z true. Otherwise, if 0 < z ) , then ∆ or z > ) > ( z 0 and the L 1 H i i i i inequality in Equation 22 has two distinct roots as √ − 1 − ∆ 1 ≡ ρ , (27) L 2 z i and √ − 1 + ∆ 1 ≡ . (28) ρ H 2 z i The inequality in Equation 22 holds only if ρ ( X,V ) ≤ ρ . or ρ ( X,V ) ≥ ρ H L ρ is always negative whereas the correlation between the scores on the Since L ρ ( X,V ), is expected to be positive total test and scores on the common items, for a well-designed test, possible values of ρ ( X,V ) should be no less than ρ . H The sign of ρ decides whether this requirement is trivial. Specifically, H 16

22 Zhang and Kolen Effects of the Number of Common Items √ 1 + ∆ − 1 ≡ ρ 0 > H 2 z i √ 1 + 0 ∆ > ⇔ − 1 > ∆ ⇔ 1 1 2 N − u 8 tot 4 2 − ⇔ z > 0 z i i 4 2 8 N − u tot 2 z > . (29) ⇔ i 4 2 8 N − u 2 tot z Thus, if , then ρ ( > ) is expected to be larger than ρ . X,V H i 4 Otherwise, because ( X,V ) is expected to be positive, the inequality ρ ( X,V ) > ρ 2 2 ≥ always holds. It can be shown that either ( z 0 is smaller than ) ) or ( z ρ H L H i i 2 N u − 8 2 tot z can be combined , so the above results with regard to the intervals of i 4 and simplified. √ 2 − u N 8 tot − ≤ In sum, under Condition 1, the specified precision is obtained for 2 √ √ 2 2 u 8 − N 8 N − u tot tot ≤ z as long as the test is well developed. For either < − z i i 2 2 √ 2 N u 8 − tot > z , the specified precision can only be achieved as the expected or i 2 correlation between the scores on the total test and scores on the common items, ρ ρ ( . X,V ), exceeds H 2 u Condition 2: 8 ≤ 16 . Under this condition, ∆ < N is no longer positive. 2 tot As a result, ∆ ≥ 0 and there always exist two distinct roots of the inequality 1 √ √ 2 2 N 8 8 − N − u u tot tot ρ in Equation 22, − ρ . Again, for and z ≤ , the ≤ H L i 2 2 ρ ( X,V ) as long equating precision does not depend heavily on slight changes in √ √ 2 2 N N u u − 8 8 − tot tot as the test is well developed, whereas for < − z z , > or i i 2 2 ρ X,V ) ≥ ρ . the specified precision required ( H 2 < N Condition 3: u 0 ≤ 8 . Same as Condition 2, there exist two distinct tot roots of the inequality in Equation 22, and the sign of ρ directly affect the H 2 ρ ). However, as N u possible values of ≤ 8, the final inequality in X,V ( tot z 6 = 0, such that ρ 0. Thus, to provide the > Equation 29 is always true with H i ρ ( X,V ) needs to exceed ρ specified equating precision, . H Next, corresponding relative lengths of the common item set are estimated. 7.2 Estimating the relative length of common items, k According to the discussion in previous subsection, sometimes the target equat- ing precision can be achieved as long as the test forms are well constructed, whereas other times, ρ ( X,V ) needs to be no less than ρ which is a positive H value after test reliability, the sample size available, the degree of precision de- sired, and the standardized score range have been clearly specified. Estimation 17

23 Zhang and Kolen Effects of the Number of Common Items of k for three different conditions defined in previous subsection are very similar. Using external or internal common items always leads to different estimates. ( X,V ρ ≥ For an external set of common items, substitute Equation 11 in ) ρ , H 2 2 ≥ ( ⇔ ρ ) ( X,V ) ≥ ρ X,V ρ ρ H H ′ 2 kρ ) ( X,X 2 ⇔ ≥ ρ H ′ 1 + ( ) k − 1) ρ ( X,X ′ 2 ′ 2 ′ ρ ( X,X ⇔ ) − ρ ( X,X ] k ≥ ρ ρ (30) [1 − ρ ( X,X )[ )] . H H 2 ′ ( ≤ ρ Note that if ) , the final equality in Equation 30 never holds. ρ X,X H ′ 2 ρ ) ≤ ρ In other words, in some situations where X,X , the specified equating ( H precision cannot be obtained no matter how many common items are included. The test developer needs to redesign the test or may be able to modify the ′ 2 , the lower bound ) ( X,X ρ situation such as increasing the sample size. If > ρ H of the relative length of the set of common items is ′ 2 )] [1 − ρ ( X,X ρ H ≥ k (31) . 2 ′ ′ X,X )[ ) − ρ ( ρ ] X,X ( ρ H ρ ( X,V ) ≥ For an internal set of common items, substitute Equation 12 in ρ , H 2 2 ) ≥ ρ ( ⇔ ρ ρ X,V X,V ) ≥ ρ ( H H k 2 ⇔ ρ ≥ H ′ k ) − X,X 1 + ( 1) ρ ( ′ 2 X,X )] [1 − ρ ( ρ H (32) . ≥ k ⇔ 2 ′ 1 ρ ( X,X − ρ ) H In theory, the internal set of common items can be lengthened to be the total test form eventually, so it is not surprising that any specified equating precision can be provided. Although the mathematical procedures dealing with Condition 1 and Con- dition 2 are different, the final results turn out to be identical, so these two 2 u conditions are combined into a single condition, N > 8, for simplicity. tot If z = 0, the inequality in Equation 22 is linear rather than quadratic. The i whole procedure is analogous but can be simplified. The same conditions are considered. 2 N Under this condition, the equating precision does u > Condition 1: 8 . tot not heavily depend on varying numbers of common items for a well-developed test. 18

24 Zhang and Kolen Effects of the Number of Common Items 2 u . ≤ 8 < N Under this condition, if external common Condition 2: 0 tot 2 2 (8 u − ) N tot items are used and reliability of the total test exceeds , the specified 64 precision is satisfied by choosing ′ 2 2 N )] u (8 ) − [1 − ρ ( X,X tot (33) . k ≥ ′ 2 2 ′ ρ X,X − (8 − N ( u ) ) )[64 ] X,X ( ρ tot If internal common items are used, the precision is always achieved by choos- ing ′ 2 2 N )] u X,X (8 − [1 − ρ ( ) tot (34) . ≥ k 2 2 ′ − − ) N ρ ( X,X u 64 (8 ) tot 8 References Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike Educational measurement (2nd ed., 508–600). Washington, DC: (Ed.), American Council on Education. Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A math- ematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (9–49). New York: Academic. Brennan, R. L. (2006). Chained linear equating (CASMA Technical Note No. 3). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. Educa- Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), (3rd ed., 105–146). New York: Macmillan. tional measurement Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), (4th ed., 187–220). Westport, Educational measurement CT: Praeger. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer-Verlag. Lord, F. M. (1950). (Research Notes on comparable scales for test scores Bulletin 5048). Princeton, NJ: Educational Testing Service. Puhan, G. (2010). A comparison of chained linear and poststratification lin- ear equating under different testing conditions. Journal of Educational Measurement, 47 , 54–75. R Development Core Team (2005). R: A language and environment for sta- tistical computing, reference index version 2.1.1. [Computer software]. Vienna, Austria: R Foundation for Statistical Computing. 19

25 Zhang and Kolen Effects of the Number of Common Items Ricker, K. L., & von Davier, A. A. (2007). The impact of anchor test length on equating results in a nonequivalent groups design (ETS Research Report 07-44). Princeton, NJ: Educational Testing Service. Wang, T., Lee, W., Brennan, R. L., & Kolen, M. J. (2008). A comparison of the frequency estimation and chained equipercentile methods under the common-item nonequivalent groups design. Applied Psychological Mea- surement, 32 , 632–651. Yang, W., & Houang, R. T. (April, 1996). The effect of anchor length and equating method on the accuracy of test equating: Comparisons of linear and IRT-based equating using an anchor-item design . Paper presented at the annual meeting of the American Educational Research Association, New York. 20

26 Zhang and Kolen Effects of the Number of Common Items 2 ′ X,V ) as a function of ρ ( X,X ρ ) and k using external common items Table 1: ( ′ X,X ) ρ ( λ V 0.99 0.95 0.70 0.75 0.80 0.85 0.90 = k λ X 0.1731 0.2286 0.3075 0.4263 0.6224 0.8992 0.10 0.1324 0.2813 0.3556 0.4516 0.5786 0.7521 0.9424 0.2227 0.20 0.2882 0.3553 0.4364 0.5352 0.30 0.8082 0.9578 0.6568 0.40 0.4091 0.4923 0.5898 0.7044 0.8395 0.9656 0.3379 0.3769 0.4500 0.6283 0.7364 0.8595 0.9704 0.50 0.5333 0.4083 0.5647 0.6568 0.7594 0.8734 0.9736 0.60 0.4821 0.4342 0.5081 0.5895 0.6789 0.70 0.8836 0.9759 0.7767 0.80 0.5294 0.6095 0.6964 0.7902 0.8914 0.9777 0.4558 0.4742 0.5473 0.6261 0.7107 0.8011 0.8975 0.9790 0.90 1.00 0.4900 0.5625 0.6400 0.7225 0.8100 0.9025 0.9801 ′ 2 Table 2: X,V ) as a function of ρ ( X,X ρ ) and k using internal common items ( ′ ( ) X,X ρ λ V k = 0.99 0.70 0.75 0.80 0.85 0.90 0.95 λ X 0.2703 0.3077 0.3571 0.4255 0.5263 0.6897 0.9174 0.10 0.4546 0.6250 0.5556 0.20 0.7143 0.8333 0.9615 0.5000 0.30 0.6316 0.6818 0.7407 0.8108 0.8955 0.9772 0.5882 0.40 0.6897 0.7273 0.7692 0.8163 0.8696 0.9302 0.9852 0.50 0.7692 0.8333 0.8696 0.9091 0.9524 0.9901 0.8000 0.8333 0.9375 0.8824 0.9091 0.60 0.9677 0.9934 0.8571 0.70 0.9032 0.9211 0.9396 0.9589 0.9790 0.9957 0.8861 0.80 0.9302 0.9412 0.9524 0.9639 0.9756 0.9877 0.9975 0.90 0.9730 0.9783 0.9836 0.9890 0.9945 0.9989 0.9677 1.00 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 21

27 Zhang and Kolen Effects of the Number of Common Items Table 3: Estimated standard errors of equating using external common items ′ ( ρ ) = 0 . 8, N X,X = 2 , 000) ( tot z i λ V . 0 0 ± 0 . 5 ± 1 = 0 ± 1 . 5 ± k . 0 ± 2 . 5 ± 3 . 2 λ X 0.0497 0.0603 0.10 0.0909 0.1083 0.1264 0.0457 0.0746 0.0402 0.0440 0.0539 0.0672 0.20 0.0983 0.1150 0.0823 0.30 0.0405 0.0499 0.0624 0.0766 0.0917 0.1073 0.0369 0.0346 0.0380 0.0590 0.0725 0.0868 0.1017 0.40 0.0470 0.0329 0.0449 0.0564 0.0694 0.0831 0.0974 0.50 0.0362 0.0315 0.0348 0.0432 0.0543 0.0669 0.0802 0.0940 0.60 0.70 0.0305 0.0418 0.0527 0.0649 0.0779 0.0912 0.0337 0.0296 0.0633 0.0407 0.0513 0.80 0.0759 0.0889 0.0328 0.90 0.0320 0.0398 0.0502 0.0619 0.0742 0.0870 0.0289 1.00 0.0283 0.0313 0.0390 0.0492 0.0607 0.0728 0.0853 Table 4: Estimated standard errors of equating using internal common items ′ ( X,X ( ) = 0 . 8, N ρ = 2 , 000) tot z i λ V k = . 0 ± 0 . 5 ± 1 . 0 ± 1 . 5 ± 2 . 0 0 2 . 5 ± 3 ± λ X 0.0401 0.1148 0.0538 0.0671 0.0822 0.0982 0.10 0.0439 0.20 0.0352 0.0437 0.0549 0.0676 0.0811 0.0950 0.0319 0.30 0.0264 0.0293 0.0365 0.0461 0.0569 0.0684 0.0802 0.40 0.0222 0.0309 0.0391 0.0484 0.0581 0.0682 0.0246 0.0187 0.0410 0.0261 0.0331 0.50 0.0493 0.0579 0.0208 0.60 0.0174 0.0219 0.0278 0.0344 0.0414 0.0486 0.0156 0.70 0.0127 0.0142 0.0179 0.0227 0.0282 0.0339 0.0398 0.80 0.0110 0.0138 0.0176 0.0219 0.0263 0.0309 0.0098 0.90 0.0066 0.0074 0.0093 0.0119 0.0148 0.0178 0.0209 1.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 22

28 Zhang and Kolen Effects of the Number of Common Items Table 5: Estimated lower bound of relative length of the external set of common ′ items ( ρ ) = 0 . 8, − 3 ≤ z X,X ≤ 3) ( i u N 0.10 0.15 0.20 0.25 ≥ 0 . 30 0.05 tot 0.44 < . 01 500 0.11 0.07 < . 01 550 0.34 0.27 0.04 01 600 < . 0.99 0.02 < . 01 650 0.21 700 0.17 < . 01 < . 01 0.81 750 0.13 < . 01 < . 01 0.67 0.57 0.10 01 < . 01 800 < . 0.49 < . 01 < . 01 850 0.07 0.42 0.05 < . 900 < . 01 01 950 0.03 < . 01 < . 01 0.37 0.32 0.02 < . 01 < . 01 1000 1200 0.19 < . 01 < . 01 < . 01 1400 0.11 01 < . 01 < . 01 < . 0.78 < . 01 < . 01 < . 01 1600 0.05 0.57 0.02 < . 01 < . 01 < . 01 1800 0.44 < . 01 < . 01 < . 01 < . 01 2000 2200 < . 01 < . 01 < . 01 < . 01 0.34 2400 0.27 < . 01 < . 01 < . 01 < . 01 2600 < . 01 < . 01 < . 01 < . 01 0.21 2800 0.17 < . 01 < . 01 < . 01 < . 01 < . 3000 < . 01 < . 01 < . 01 0.13 01 Note. Blank areas indicate that the precision target can never been achieved regardless of the numbers of common items included. Some bounds may be too low to maintain the content and statistical representativeness and need to be used with caution. 23

29 Zhang and Kolen Effects of the Number of Common Items Table 6: Estimated lower bound of relative length of the internal set of common ′ items ( ρ ) = 0 . 8, − 3 ≤ z X,X ≤ 3) ( i u 0.10 0.15 0.20 0.25 0.05 0 . 30 N ≥ tot 0.58 0.34 0.17 0.06 < . 500 0.86 01 0.85 0.31 0.15 0.04 < . 01 550 0.56 0.53 0.28 0.12 0.02 < . 01 600 0.84 0.83 0.51 0.26 0.10 0.01 650 01 < . 700 0.49 0.24 0.09 < . 01 < . 01 0.81 0.80 0.47 0.07 < . 01 < . 01 750 0.22 0.79 0.20 0.06 < . 01 < . 01 800 0.45 0.78 0.43 0.18 0.04 850 01 < . 01 < . 900 0.41 0.17 0.03 < . 01 < . 01 0.77 0.76 0.39 0.15 0.02 < . 01 < . 01 950 1000 0.75 0.38 0.14 0.01 < . 01 < . 01 1200 0.71 0.09 < . 01 < . 01 < . 01 0.32 0.68 0.06 < . 01 < . 01 < . 01 1400 0.27 0.64 0.23 0.03 < . 01 < . 01 < . 01 1600 1800 0.61 0.01 < . 01 < . 01 < . 01 0.20 01 2000 < . 01 < . 0.17 < . 01 < . 01 0.58 2200 0.56 0.15 < . 01 < . 01 < . 01 < . 01 2400 0.53 < . 01 < . 01 < . 01 < . 01 0.12 01 2600 < . 01 < . 0.10 < . 01 < . 01 0.51 2800 0.49 0.09 < . 01 < . 01 < . 01 < . 01 < . 3000 0.07 < . 01 < . 01 0.47 01 < . 01 Note. Some bounds may be too low to maintain the content and statistical representativeness and need to be used with caution. 24

30 Zhang and Kolen Effects of the Number of Common Items Table 7: Parameters for Simulation Study 1 Parameter Name Value Test information 50 Number of items on Form X 50 Number of items on Form Y 0.4 Relative length of the common item set V, k 20 Number of common items ′ ′ 0.8 Reliability coefficient, ( ρ X,X ρ ( Y,Y ) = ) Population where Group 1 is from 10 μ Mean score on common items, V ( ) 1 2 9 ( V ) Variance of scores on common items, σ 1 a ) V ( μ 1 25 ) = Mean score on Form X, μ ( X k 2 43.26923 Variance of scores on Form X, σ ( X ) [Equation 8] 13.84615 Covariance between X and V, ( ) [Equation 9 or 10] σ X,V b or 17.30769 Population where Group 2 is from c Mean score on common items, μ Varies ) [Equation 19] V ( 2 2 ( 9 ) V Variance of scores on common items, σ 2 V ) ( μ 2 Varies Y ) = ( Mean score on Form Y, μ k 2 43.26923 Y ( σ Variance of scores on Form Y, ) [Equation 8] 13.84615 Y,V ( ) [Equation 9 or 10] Covariance between Y and V, σ or 17.30769 Sample Size 1,000 Number of examinees taking Form X 1,000 Number of examinees taking Form Y Group Differences Varies Effect size, ES Non-Normality Lognormal transformation a The equation holds when the ratio of effective test lengths equals the ratio of actual test lengths. b Covariance between X and V varies when different types of common items are used. For an external common items, it is 13.84615 by using Equation 9, and for an internal common items, it is 17.30769 by using Equation 10. Similar results apply to Y and V . c Amount of group difference is reflected by mean score difference between two populations on common items. To obtain ES levels of 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, and 1.0, the mean for Group 2 is lower than the mean for Group 1 by 3 ES score points according to Equation 19. 25

31 Zhang and Kolen Effects of the Number of Common Items ′ k ( X,X for Tables 5 and 6 based on Simulation Study 2 ( ) = Table 8: Modified ρ 8, = 0 0 = 2 , 000, u . . 10) N tot = 0 = 0 . 2 ES ES . 5 ES = 0 Using a set of external common items 0.4530 (0.0149) Normal 0.6490 (0.0165) 0.5250 (0.0128) Moderately Skewed 0.6540 (0.0114) 0.7670 (0.0149) 0.5920 (0.0136) 0.7140 (0.0131) 0.8630 (0.0163) Extremely Skewed 0.7750 (0.0089) Using a set of internal common items Normal 0.1820 (0.0062) 0.2000 (0.0000) 0.2210 (0.0045) 0.2800 (0.0000) 0.3000 (0.0000) 0.3240 (0.0082) Moderately Skewed 0.3620 (0.0062) Extremely Skewed 0.4090 (0.0102) 0.3800 (0.0000) Because the test in simulation studies contains 50 items, k Note. . 18 which = 0 would give an integer solution for the number of common items needed to pro- vide the target SEE is used as initial value when internal common items are considered. Values in parenthesis are standard deviations of k over 20 replica- tions. Table 9: Modified numbers of common items needed based on Simulation Study ′ 8, ρ ) = 0 . X,X N 2 (total number of items on either Form X or Y is 50, = ( tot , 000, u 2 . 10) = 0 ES ES = 0 . 2 = 0 = 0 . 5 ES Using a set of external common items Normal 23 27 33 Moderately Skewed 30 39 33 36 39 Extremely Skewed 44 Using a set of internal common items Normal 10 10 12 Moderately Skewed 14 15 17 Extremely Skewed 19 21 19 Note. Numbers of common items needed are calculated by 50 × k , where k ’s are displayed in Table 8. If the resulting number is not an integer, find the closest integer that is bigger than it. 26

32 Zhang and Kolen Effects of the Number of Common Items Using External Common Items 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Reliability of Form X 0.2 0.90 0.70 0.95 0.75 0.1 0.80 0.99 Squared Correlation Between X and V 0.85 0.0 0.1 0.9 0.5 0.4 0.0 0.8 0.7 0.3 0.2 0.6 1.0 k Using Internal Common Items 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Reliability of Form X 0.2 0.90 0.70 0.95 0.75 0.1 0.80 0.99 Squared Correlation Between X and V 0.85 0.0 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 k 2 ′ ρ ( X,V ) as a function of Figure 1: ( X,X ρ ) and k 27

33 Zhang and Kolen Effects of the Number of Common Items Using External Common Items Relative Length of Common Items, k 0.14 0.60 0.10 0.70 0.20 0.12 0.80 0.30 0.90 0.40 1.00 0.50 0.10 0.08 0.06 0.04 0.02 Estimated Standard Error of Equating 0.00 3 1 −1 −3 0 2 −2 Z−score of X Using Internal Common Items Relative Length of Common Items, k 0.14 0.60 0.10 0.70 0.20 0.12 0.30 0.80 0.40 0.90 1.00 0.50 0.10 0.08 0.06 0.04 0.02 Estimated Standard Error of Equating 0.00 −3 −2 −1 0 1 2 3 Z−score of X ′ ρ ( X,X Figure 2: Estimated SEE ( )=0.8, N =2,000) tot 28

34 Zhang and Kolen Effects of the Number of Common Items ] ! ) ! 푢 ] ] ) ) ) ! ! ! !"! 푋 푋 푋 , 푁 , , 푋 푋 푋 ( − ( ( 휌 8 휌 휌 ! ( ) − − ! − 푢 No 1 1 ) [ [ ! ! ! !"! ) ) 푋 , the target equating , 푁 ! ! ! ! 푢 푢 푋 ! − ( ! 휌 !"! !"! Result 5 8 ( 4 푁 푁 !"! !" 6 ! [ − − − ! ) ! ! ! 8 8 64 푋 ( ( , ≤ 푋 ≥ ) ( ! 푘 휌 푋 , ≥ 푋 ( 푘 No 휌 0 If Internal common items: Otherwise, External common items: precision cannot be achieved; ≠ ! 푧 ] ) ) ! ! 푋 푋 , , 푋 푋 ( ( 휌 휌 ! ! − 휌 Internal 1 [ − ! ! Result 4 1 휌 Yes ≥ 푘 8 > ! 푢 Start !"! Types of 푁 No common items Result 3 No be achieved The target SEE cannot ! ! 휌 > External ) ! 푋 8 , − 푋 ( ! 4 휌 푢 !"! ] 푁 ! ! 휌 ] ≤ ) − ! ! ! Yes 푧 ) 푋 ! , 푋 푋 , ( 2 푋 휌 ( 휌 − [ ) 1 ! [ 푋 ! ! Result , 휌 Yes 푋 ( Flowchart of a process for estimating lower bound of the number of common items needed 휌 ≥ 푘 Result 1 Yes always be achieved The target SEE can Figure 3: 29

35 Zhang and Kolen Effects of the Number of Common Items Using External Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.75 0.20 0.06 1.00 0.30 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0.08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) 0 −2 2 −3 −1 1 3 Z−Score of X Using Internal Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.75 0.20 0.06 0.30 1.00 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0.08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) −2 −1 0 1 2 3 −3 Z−Score of X Figure 4: Difference between analytic SEE and empirical SEE (normal distri- ′ ρ ( X,X . ) = 0 bution, 8, N 000) = 2 , tot 30

36 Zhang and Kolen Effects of the Number of Common Items Using External Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.75 0.20 0.06 1.00 0.30 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0.08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) 3 −2 2 1 0 −1 −3 Z−Score of X Using Internal Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.20 0.75 0.06 0.30 1.00 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0 .08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) −1 2 1 0 3 −2 −3 Z−Score of X Figure 5: Difference between analytic SEE and empirical SEE (positively skewed ′ ρ ( X,X distribution, ) = 0 . 8, N 000) = 2 , tot 31

37 Zhang and Kolen Effects of the Number of Common Items Using External Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.75 0.20 0.06 1.00 0.30 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0.08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) 0 −2 2 −3 −1 1 3 Z−Score of X Using Internal Common Items Effect Size, ES 0.10 0.40 0.00 0.08 0.50 0.10 0.75 0.20 0.06 0.30 1.00 0.04 0.02 0.00 −0.02 −0.04 −0.06 −0.08 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) −2 −1 0 1 2 3 −3 Z−Score of X Figure 6: Difference between analytic SEE and empirical SEE (negatively ′ ρ ( X,X . ) = 0 skewed distribution, 8, N 000) = 2 , tot 32

38 Zhang and Kolen Effects of the Number of Common Items 0.10 External, Normal Internal, Normal Internal, Positively Skewed nal, Positively Skewed Exter 0.08 Internal, Negatively Skewed External, Negatively Skewed 0.06 0.04 0.02 0.00 −0.02 −0.04 −0.06 .08 −0 −0.10 Analytic SEE Minus Empirical SEE (SD Unit) 0 −2 2 −1 1 −3 3 Z−Score of X ES = 0 Figure 7: Difference between analytic SEE and empirical SEE when ′ ρ ( X,X ) = 0 ( . 8, N 000) = 2 , tot 33

39 Zhang and Kolen Effects of the Number of Common Items Using External Common Items 50 Normal Moderately Skewed 45 Extremely Skewed 40 35 30 25 20 15 10 5 Number of Common Items Needed 0 ES = 0 ES = 0.2 ES = 0.5 Group Difference Using Internal Common Items 50 Normal Moderately Skewed 45 Extremely Skewed 40 35 30 25 20 15 10 5 Number of Common Items Needed 0 ES = 0.5 ES = 0.2 ES = 0 Group Difference Figure 8: Modified numbers of common items needed (total number of items on ′ ρ ( X,X either Form X or Y is 50, ) = 0 . 8, N 10) = 2 , 000, u = 0 . tot 34

MPI : A Message-Passing Interface Standard Version 3.0 Message Passing Interface Forum September 21, 2012

More info »STATE OF NEW YORK 2 0 1 9 T E N T A T I V E A S S E S S M E N T R O L L PAGE 1 VALUATION DATE-JUL 01, 2018 COUNTY - Niagara T A X A B L E SECTION OF THE ROLL - 1 CITY - North Tonawanda TAX MAP NUMBER ...

More info »THE EARLY STAGES IN OF THE FISHES CALIFORNIA CURRENT REGION CALIFORNIA FISHERIES COOPERATIVE OCEANIC INVESTIGATIONS ATLAS NO. 33 BY THE SPONSORED STATES OF COMMERCE DEPARTMENT UNITED OCEANIC AND ATMOS...

More info »SCRIE TENANTS LIST ~ By Docket Number ~ Borough of Bronx SCRIE in the last year; it includes tenants that have a lease expiration date equal or who have received • This report displays information on ...

More info »201 8 Fourth National Report on Human Exposure to Environmental Chemicals U pdated Tables, March 2018 , Volume One

More info »U.S. DEPARTMENT OF TRANSPORTATION ORDER FEDERAL AVIATION ADMINISTRATION 7400.11C JO Air Traffic Organization Policy August 13, 2018 SUBJ: Airspace Designations and Reporting Points . This O rder, publ...

More info »CIRCULAR U.S. Department of Transportation FTA C 4710.1 Federal Transit Administration November 4, 2015 AMERICANS WITH DISABILITIES ACT (ADA): GUIDANCE Subject: PURPOSE. This circular provides guidanc...

More info »GENERAL ASSEMBLY OF NORTH CAROLINA SESSION 2019 H 2 HOUSE BILL 966 Committee Substitute Favorable 4/30/19 2019 Appropriations Act. (Public) Short Title: Sponsors: Referred to: April 26, 2019 A BILL TO...

More info »February 10, 2019 $ 1 BUS BOOK EFFECTIVE THROUGH JUNE 8, 2019 OCBus.com EFECTIVO HASTA EL 8 DE JUNIO 2019 EASY JUST GOT EASIER. Upgrade to version 2.0 See back for cool new features! CHANGE HIGHLIGHTS...

More info »Second National Report on Biochemical Indicators of Diet and Nutrition in the U.S. Population Second National Report on Biochemical Indicators of Diet and Nutrition in the U.S. Population 2012 Nationa...

More info »PISA 2012 Results: What Students Know and Can Do tICS, themA StuDent PeRfoRmAnCe In mA ReADIng AnD SCIenCe Volume I rogramme for ssessment A tudent S nternational I P

More info »This publication contains: VOLUME I: SOURCES SOURCES AND EFFECTS Report of the United Nations Scientific Committee on the Effects of Atomic Radiation to the General Assembly OF IONIZING RADIATION Scie...

More info »C HERMODYNAMICS HEMICAL T OMPOUNDS AND C OMPLEXES OF OF C U, Np, Pu, Am, Tc, Se, Ni and Zr O ELECTED WITH RGANIC L IGANDS S Wolfgang Hummel (Chairman) Laboratory for Waste Management Paul Scherrer Ins...

More info »2018 of OUT REACH THE HIGH COST OF HOUSING MADE POSSIBLE BY THE GENEROSITY OF:

More info »g Star t f or Mothe rs and Newbor ns Evaluation: Stron YNTHESIS ROJECT S AR 5 P YE Volume 1 indings -Cutting F ross : C Prepared for: ss Caitlin Cro -Barnet Center fo HS nd Medicaid Innovation, DH r M...

More info »National Tracking Poll #190463 April 25-27, 2019 Crosstabulation Results Methodology: This poll was conducted from April 25-27, 2019, among a national sample of 2201 Adults. The interviews were conduc...

More info »IMPLEMENTATION HANDBOOK FOR THE CONVENTION ON THE RIGHTS OF THE CHILD FULLY REVISED THIRD EDITION IMPLEMENTATION HANDBOOK IMPLEMENTATION HANDBOOK FOR THE CONVENTION ON THE FOR THE CONVENTION ON THE RI...

More info »How to Think Like a Computer Scientist — How to Think Like a Computer Scientist: Learning with Python 3 index | How to Think Like a Computer Scientist: Learning with Python 3 » next How to Think Like ...

More info »