Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge

Transcript

1 - to Measure Using Technology Enhanced Items Knowledge Fourth Grade Geometry Jessica Masters , Ph.D. [email protected] Matthew Gushta , Ph.D. [email protected] Paper presented at the rican Educational Research Association Annual Meeting of the Ame Annual Meeting Washington, D.C. April, 2018 Abstract - enhanced items have the potential to provide improved measures of student Technology This pape r uses quantitative analysis of fourth knowledge compared to traditional item types. - grade geometry field test data to explore a) the validity of inferences made from certain kinds of technology - enhanced items and b) whether those item types provide improved measurement compared to traditional item types. There was stro ng evidence based on internal structure and based on the relationship with other variables that technology - enhanced items provided valid inferences. The evidence of whether that measurement was an improvement over the measurement provided by traditional se lected - response items was mixed.

2 Using Technology - 2 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge - quality assessment is critical to the educational process. Teachers and educational High stakeholders can only make effective decisions about instruction and student progress if they in reliable and valid inferences about student knowledge. assessments that result have access to - enhanced (TE) items have the potential to provide improved measures of student T echnology , rather than produce knowledge over traditional item types because they require students to nse. TE item s can create a more engaging environment for students and can reduce select , a respo - - response (CR) items can often provide the effects of guessing and test taking skills. Constructed benefits, but TE items have the additional benefit of supporting auto mated scoring, a similar - paced, inexpensive, and accurate critical feature in the modern classroom that requires fast assessment feedback. For these and other reasons, national assessment consortia , state have wi departments of education, and assessment developers dely incorporated TE items into potential benefits of TE s While the formative and summative assessment . item s have their se the forward momentum of TE item spurred development and use, there has been a lack of rigorous research on the validity of inferences made from TE items and the ability of TE item s to over traditional selected - response (SR) items (Bryant, 2017) . provide improved measurement , Researchers have stressed the need for evidence that TE items are more than merely engaging but also provide an a , measure of student knowledge. ccurate, and potentially improved e small but growing base of research related to whether the The current paper contributes to th TE items are valid an d whether TE items provide improved measure ment inferences made from SR items Field test data are examined to address the following research questions: . over Q 1) To what extent do TE items provide a valid measure of geometry standards in the R ? elementary grades Q red to SR items? 2) To what extent do TE items provide improved measurement compa R To address these questions, the Validity of Technology Enhanced Assessment in Geometry - collected from classroom administration of TE, SR, and CR items targeting (VTAG) project (shown in Figure 1) fourth grade Common Core State Standards in geometry . CCSS.MATH.CONTENT.4.G.A.1: Draw points, lines, line segments, rays, angles (right, acute, obtuse), and perpendicular and parallel lines. Identify these in two - dimensional figures. CCSS.MATH.CONTENT.4.G.A.2: Classify two - dimensional figures bas ed on the presence or absence of parallel or perpendicular lines, or the presence or absence of angles of a specified size. Recognize right triangles as a category, and identify right triangles. CCSS.MATH.CONTENT.4.G.A.3: Recognize a line of symmetry for a two - dimensional figure as a line across the figure such that the figure can be folded along the line into matching parts. Identify line - symmetric figures and draw lines of symmetry. Figure 1 : Common Core State Standards in Fourth Grade Geometry

3 Using Technology - 3 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Theoretical Framework 1. Assessment is a critical component within the instructional process and instruction should be the of assessments (Pellegrino, Chudowsky, & Glaser, 2001). The differentiated based on results 2010 National Education Tec hnology (NET) Plan’s goal related to assessment is that “ Our at all levels will leverage the power of technology to measure what matters and education system Research has documented the use assessment data for continuous improvement (USDE, p. xvii).” many types of high quacies of SR items to measure level knowledge and understanding inade - Birenbaum & Tatsuoka, 1987; Hickson & Reed, (Archbald & Newmann, 1988; Bennett, 1993; Darling - Hammond & Lieberman, 1992 ; Q u ellmalz, Timm s, 2009; Lane, 2004; Livingston, 2009; Schneider , 2009 ). One approach to overcome the shortcomings of SR items is the use of text - & - order skills and knowledge. In recent entry or CR items, which have been used to measure high years, researchers have leveraged technological advancements to co mbine the measurement power of CR items with the automated - scoring capability of SR items. One branch of this research has focused on automated text and essay scoring (e.g., Dikli, 2006), while another branch has focused on using technology to allow studen ts to interact with digital content in TE items . This second line of research is consistent innovative ways, through the development of with the NET Plan’s assessment related recommendations , which include the development of - nd better ways” to assess students and the expansion of the assessments that provide “new a - enhanced assessments that can access capacity to design, develop, and validate technology constructs difficult to measure with traditional assessments ( USDE , 2010 ). For this ealized, more research is needed on the validity of technology recommendation to be r enhanced - assessments in a variety of contexts. offer many potential benefits over SR items. The most significant is that have TE items TE items the potential to provide improved measurement of level or certain constructs, specifically high - constructs, because , rather than simply select information, they require cognitively complex produce information, which is often a more authentic form of measurement students to (Archbald & Newmann, 1988; Benn ett, 1999; Harlen & Crick, 2003; Huff & Sireci, 2001; Jodoin, 2003; McFarlane, Williams, & Bonnett, 2000 ; Sireci & Zenisky, 2006; Zenisky & Sireci, ). A second benefit is that TE items reduce the effects of test - taking skills and random 2002 & Sireci, 2001). A third benefit is that TE items have the potential to provide guessing (Huff richer diagnostic information by recording not only the student’s final response , but also the response that lead to that process es that can reveal the student’s thought process interaction and response (Birenbaum & Tatsuoka, 1987). CR items have always offered the first of these two benefits, but TE items allow these benefits to be leveraged on items administered via computer that can be automatically and instantly scored. A fo urth potential benefit of TE items is a possible reduction of cognitive load from non relevant constructs, such as the reading load for - non - reading related items or the cognitive load required to keep various item constructs in memory tend to be more engaging to ( Mayer & Moreno, 2003 ; Thomas, 2016 ). Finally, TE items - students, an important consideration in an era when students frequently feel over - tested (Strain

4 Using Technology - 4 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge - Seymour, Adams, & Sethuraman, Seymour, Way, & Dolan, 2009; Dolan, Goodman, Strain 2011). TE items to provide improved measurement wa s ential of underscored by the he pot T initially d awarding of Race to the Top Assessment funds to PARCC and Smarter Balanced, who propose generation assessment systems to develop next TE items in th eir - that would incorporate - summative assessments (PARCC, 2010; SBAC, 2010). Since that time , state summative and non Statewide departments of education have continued to pursue the promise of TE items. specific provi summative assessment Requests for Proposals frequently include sions for the stipulating the availability of specific interaction development and administration of TE items, e.g., State of Maine, Department of Education, 2014 and the presumption of types ( ) - e.g., Oklahoma St ate Dep artment of Education, measurement of higher order thinking skills ( ) . Despite th 2017 forward momentum to develop and use TE items , there is only a small e research base evaluating the validity of TE items in various contexts within K - 12 education. Pearson conducted cognitive labs with elementa ry, middle, and high school students to evaluate , and the potential perceptions of TE items , the cognitive processes used to respond to TE items for TE items to better evaluate constructs in both mathematics and English language arts (Dolan, n - Goodman, Strai Seymour, Adams, & Sethuraman, 2011). Although the results cannot be broadly generalized because of the small sample sizes, the research found preliminary evidence TE items to suggest that highly usable and engaging. More importantly, the research fo und were TE items produce d measurements of constructs, particularly high - that that were level constructs, not easily measured with traditional items types. The study found that the use of TE items eractions with content. The reduced guessing and allowed students to have more authentic int study also found that TE items required more time to complete and that this factor was influenced by students’ technical proficiencies (ibid.). In a separate research effort, Pearson partnered with the Minnesota State Departmen t of Education to evaluate and compare the performance of TE items , SR items, and CR items in the context of fifth grade, eighth grade, and high school science (Wan & Henley, 2012). This study explored of the figural response type, which includes “hotspot” identification, drag - TE items and drop, and reordering. Through item response theory analyses, this study found that TE items - provided the same amount of information as SR items in fifth and eighth grade, and slightly more information in high school. Whil e CR items provided more information than both TE items and SR items, those items required human scoring. This study also found that TE items and SR items were equally efficient ( i.e., they provided equivalent amounts of information per unit of time). The researchers concluded that their statistical analyses support the use of TE items in K - 12. However, they were also careful to note that further psychological testing (e.g., cognitive labs) should be conducted to confirm the results of their statistical tes ting. The researchers also advised caution in using TE items when standard SR items are able to measure a construct. “We reviewed the test forms administered in this study and found that a number of innovative items

5 Using Technology - 5 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge oice] items without changing the [knowledge, skills, and could be easily replaced by [multiple ch abilities] measured. This issue is not uncommon in the development of innovative assessments: the face validity of innovative item formats is so appealing that their real potential of providing hing beyond what is available using the [multiple choice] format may be overlooked (ibid., somet innovation solely for innovation’s p. 74).” This is supported by other researchers who warn of sake when traditional items can fully and accurately measure a constru ct (Haldyna, 1999; Sireci both psychometric and psychological research & Zenisky, 2006). The researchers recommend ed are appropriate. TE items to determine when Research conducted by Pacific Metrics Corporation, with funding from the U.S. Department of Edu cation through a Grant for Enhanced Assessment Instruments, explored SR items, CR items, TE items in the context of seventh grade mathematics and Algebra I. The researchers found and a te no significant difference when they compared the correlation between st comprised of CR and teacher ratings of student knowledge with the correlation of a test comprised of and TE items and teacher ratings of student knowledge SR items test was reviewed by experts and . The CR/TE found to be similar to the SR test in terms o f measuring the intent of the standards and the depth of knowledge. The CR/ TE t est was found to be significantly more reliable and provide more information than the SR test (Winter, Wood, Lottridge, Hughes, & Walker, 2012). “These results indicate that tes ts incorporating CR/TE items can measure some mathematics content with less error than tests comprising only [SR] items (ibid, p. 53).” This study provides promising results, text but must be generalized cautiously because of the narrow content focus and the bl ending of - based CR items and on the same test form. TE items contributes to this small but critical base of research related to the validity of The VTAG project , differing from previous efforts and making new contributions to the research bas e in TE items - methods design. The researchers conducted an two important ways. First, VTAG used a mixed cognitive labs with small samples and statistical analysis analysis of qualitative data collected via from larger samples of field test data (using a field test) . ( he latter analysis is the focus of the current paper. Second, the VTAG project focuses on a broad content area not previously studied: elementary geometry. 2. Selected Response and Technology - Enhanced Item Types The VTAG project uses the taxonomy of items defined by Scalise and Gifford (2006) to determine whether an item is categorized as SR or TE. This taxonomy, displayed in Figure 2 , organizes items by the placed on the student’s options for responding to or degree of constraint interacting with the item. The taxonomy does not classify items based on the method or mode of interaction required by the student or the media included in the item.

6 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 6 / 46 - 2006, , Gifford & Scalise : Taxonomy of Item Types based on Level of Constraint ( 2 Figure p. 9) choice item. This item - For example, the item displayed in Exhibit A.1(a) is a traditional multiple is clearly SR. The item could be reformatted, however, to use a different method of interaction. interface. Exhibit A.1(c) drag and drop a In Exhibit A.1(b), a similar item is rewritten using hotspot shows another similar item that uses a interface. For the VTAG project, the researchers argue that the interface or mode of interaction is irrelevant and that all three of these items are SR because they ha : the degree of constraint ve the same student is asked to select one option out of four options. Although Exhibit A.1(b) and Exhibit A.1(c) use a more modern and interactive interface, the content of the items and the extent to elect (rather than produce) content is the same as the item displayed which they ask students to s in Exhibit A.1(a); all three of the items displayed in Exhibit A.1 are consistent with the “Multiple Choice” category of the Scalise and Gifford taxonomy (displayed in Figure 2). What de fines an SR item is that it places a high degree of constraint on the student’s response, thus asking them to select , rather than produce , a response. A TE item, on the other hand, requires the student to produce a response. For the VTAG project, the item must also be able to be automatically scored. For example, the item in Exhibit A. 2( a ) asks a student to produce - (draw) a shape as their response. Although this item is similar to what could be offered on paper - text - pencil as a (non based) CR item, the and item is administered via computer - and can be

7 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 7 / 46 - automatically scored via the computer; this difference makes the item TE for the purposes of the VTAG project. Other items require a student to produce a different kind of response compared to ited from a paper - and - pencil CR item. For example, the item displayed in what could be elic 2 ( b ) asks students to move objects into various categories or buckets, which would be Exhibit A. - computer based administration . The key more difficult to directly translate to a traditional non distinctions again of why this item is considered TE for the VTAG project is that it requires a student to produce a response and can be automatically scored. Some TE items do ask students to select a response. But, unlike the SR items, the combi nation of possible responses from which to select result in an item with a higher cognitive complexity. While the items do place more constraint on the student than one in which the student is truly han a selection of one in four /five/or producing the response, they place much less constraint t . For example, the six Exhibit A. 3 is classified as a TE item . In theory, this item could be item in rewritten as a series of three multiple choice items, each with a list of all 36 possible street pairs as the options . Thus, while technically the student is selecting their response, not producing it, they are selecting it from a large enough sample of responses that this item is cognitively more akin to item types that require a produced response. To summarize, VTAG r esearchers defined SR items as those that require a student to select a response from a small set of responses. SR items place a high degree of constraint on the response. TE items were defined as those that require a student to either produce a response o r select a response from a very large set of responses ( which are not explicitly enumerated). TE items place a low degree of constraint on the response. VTAG project aimed to contribute to the current knowledge base in It should be noted that t he topics t hat have been less explored, including those that would inform the ongoing and - - scale summative and classroom - based increasing use of computer based assessments in large ities of formative diagnostic assessment. To this end, the project compared the measurement qual TE items possibility (even to SR items, not to CR items. The researchers acknowledge the probability!) that many CR items can provide improved measurement of constructs. However, many of these items either cannot currently be automatically scored or require substantial investment to establish automated scoring algorithms . Thus, the current study addresses the comparison of item type s that can be widely administered and instantly scored by the computer (i.e., TE vs SR) . The researchers further ackno wledge the large body of existing research exploring automated text and numerical response analysis ; they acknowledge that many of these types of CR items can be automatically scored. However, in the interest of contributing to the current knowledge base i n an area less explored, t hose item types were not included in the VTAG project .

8 Using Technology - 8 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Selected Response Item Types 2.1. As described above, for the purposes of the VTAG project, SR items are items that place a high oject defines four SR item types: standard degree of constraint on the response. The VTAG pr - all, complex multiple choice, and matching. These items come primarily multiple choice, select from the two most constrained categories from the Scalise and Gifford (2006) taxonomy (Multiple Choice and Selection/I dentification) and approximately correspond to levels 1C (Conventional or Standard Multiple Choice), 2A (Multiple True/False), 2C (Multiple Answer), Figure 2 ). 2D (Complex Multiple Choice), and 3A (Matching) (as displayed in and select - all items require a student to select one or more Standard multiple choice items s from a small set of options. Exhibit A. 4 shows example standard multiple choice items; option 5 shows example select all items. Complex multiple choice items are a combination of Exhibit A. - more than one standard multiple choice or select - all items. These types of items could be revised - all format (each with a into multiple items that use either a standard multiple choice or a select reasonable number of response options). 6 shows exampl e complex multiple choice Exhibit A. items. - to - one Matching items require a student to connect pieces of content. There is a one correspondence between the content, i.e., if content A, B, C, and D is matched with 1, 2, 3, and . 4, exactly one letter will be matched to exactly one number 7 shows example Exhibit A. matching items. Note that while a more interactive interface may be employed with the aforementioned SR items, they are still considered SR because of the degree of constraint placed on student responses. Techn ology 2.2. Enhanced Item Types - TE items place a lower degree of constraint on the response. The VTAG project defines five TE These item types: categorization, hotspot, matrix completion, figure placement, and drawing. tegories 3 6 from the Scalise and Gifford (2006) - items come primarily from the constraint ca taxonomy (Reordering/Rearrangement, Substitution/Correction, Completion, and Construction) and approximately correspond to levels 3B (Categorizing), 3C (Ranking and Sequencing), 4C 5D (Matrix Completion), and 6B (Figural Constructed response) (as (Limited Figural Drawing), Figure 2 ). displayed in Categorization items require students to sort or classify content. It’s important to note that, in theory, categorization items could be rewritten as SR items er, many of these items would ; howev have too large a response space, and thus be unworkable to administer in SR format. Exhibit A. 8 shows example s of categorization item s . For categorization items, the general guideline used in the VTAG project was that if at le ast one “content tile” (i.e., object to be classified) belonged in more than one “bucket” (i.e., category), the item would likely be classified as a TE item. Some items fit this description but overall had a small, constrained response space (e.g., the ite m displayed in Exhibit A. 6 ); these items were instead classified as SR instead of TE because of the higher level of constraint placed on the response options.

9 Using Technology - 9 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Exhibit A. 9 Hotspot items require the student to identify part of a graphical figure. shows s of hotspot item s . Similar to the categorization items, it is often possible to use the ple exam hotspot interface to write SR items. The general guideline used for hotspot was that, for a hotspot item to be considered a TE item, the placement of the response w ithin the overall visual space must be critical to the overall item context (e.g., select the parallel lines on the images below or select all coastal cities on the map below). Matrix completion items require students to place content into pre - defined spa ces within the content. Exhibit A. 10 shows problem space. This includes items asking students to rank or order examples of this type of item. Figure placement items require students to place content within the larger space, which do not have pre defined pl acement options (e.g., no “slots” or “buckets” in - Exhibit A. shows example items. Drawing items require students to which to drop objects). 11 i draw a response, including the drawing of points, line segments, lines, rays, angles, or polygons. 12 s hows example drawing items. Exhibit A. Methodology 3. earchers administered SR, TE, and CR items, along with surveys, to fourth grade students The res in a field test designed to collect data to address RQ1 and RQ2. Instruments 3.1. The researchers developed a set of approxi mately ten SR and ten TE items to target each of the in fourth grade geometry. The items were revised based on feedback from three standards internal project content experts, independent experts on the project advisory board, and based on the qualitative a Masters, in press; Masters, nalysis of data collected from cognitive labs ( These items were assembled to create fourth grade test forms, each Famularo, King, 2016). three - forms , covering standards 4.G.A.1 (Test AB) , 4.G.A.2 of which consisted of two sub est CD) , (T and 4.G.A.3 (Test EF) . Table 1 presents test construction information for these three test forms ( six sub - form s ), showing the breakdown in the number of SR and TE item types. The tests were presented to students in sub - t might take sub - form A in a single forms. For example, a studen sitting and then complete sub form B in another sitting. - A small set of text - based CR items was developed for each of the targeted standards. CR items are generally accepted as capable of providing a strong and reliable measure of student knowledge (e.g., Bennett, 1993; Hickson & Reed, 2009; Livingston, 2009). Just like any item, the ability to provide an accurate measure is dependent on the quality of the item, the context of administration, and the scoring approach or r ubrics (which is particularly important when it involves human, subjective scoring). However, well - designed and administered CR items (including text - based CR items) are often considered an improved form of measurement (albeit one that cannot be administer ed or scored at a large scale). Accordingly, for VTAG, the CR items were designed to serve as an independent measure of student knowledge, something that

10 Using Technology - 10 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge true measure of the student’s knowledge as is possible to could be considered as close to the measure within the current research context. Table : Test Construction 1 Number of TE Items Number of SR Items Form Standard - Test Completion Total All - Sub Matching Hotspot Complex Multiple Matrix Select Total Figure Placement Choice Standard Multiple Choice Categorization A 1 4 0 1 2 2 1 0 6 8 4.G.A.1 4 2 2 1 0 3 1 0 B 8 8 0 C 3 3 1 0 0 4 0 7 6 4.G.A.2 1 3 1 0 0 2 0 D 3 5 8 E 0 0 2 4 1 0 2 0 6 8 4.G.A.3 F 2 5 1 0 2 2 0 0 8 6 ts from state and national tests, The researchers collected released items and item contex ii modifying the language, targeted content, or item structure when necessary. To supplement the limited pool of publicly available test items, the VTAG researchers developed additional CR items specifically for use in this project. Exhibit A.13 shows example CR items. A Technology Survey was designed to measure students’ comfort with, proficiency with, access to, and frequency of use related to technology, as well as their attitude towards technology and hematics. These areas were chosen because they represent the various attitude towards mat factors that might influence student performance on TE items. The researchers gathered released survey items from various years of the fourth grade National Assessment of Educational Prog ress ( NCES, 2016 ) and Trends in International Mathematics and Science Study (NCES, . Some questions were modified based on other smaller surveys (e.g., the Use, 2018) surveys Support, and Effect of Instructional Technology Project) (UseIt, 201 items were ). The survey 8 reviewed internally by project experts, reviewed by members of the external VTAG expert advisory board, and piloted with a sample of over 740 elementary students; the survey items were revised accordingly. Data Collection 3.2. The researchers colle cted initial validity evidence based on test content, e.g., the extent to which the assessment represents the domain of interest with minimal construct - irrelevant information. This type of evidence was collected through review by the e xpert advisory panel . Because fundamental underpinning of validity, item revisions were evidence based on test content is a

11 Using Technology - 11 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge expert advisors before the items used to collect made based on the feedback of the were additional validity evidence. re face - to think alouds face interactions during which a researcher observes Cognitive labs, or , a - and evaluates a student’s cognitive processes. Cognitive labs have become a widely used method of gathering evidence related to the validity of inferences made by assessments, specifically evi dence about whether assessment items are measuring the intended constructs (Dolan, Seymour, Adams, & Sethuraman, 2011; Ericsson & Simon, 1999; Gorin, 2006; - Goodman, Strain Willis, 1999). This type of qualitative evidence adds significant value to more trad itional, quantitative validity evidence collected from larger samples (Beatty & Willis, 2007; Willis, 1999; Zucker, Sassman, & Case, 2004). The expert review and cognitive labs were used both to collect validity evidence evidence based - d based on student thought processes. They were also used to revise and refine on test content an Masters, the items used in the field test and described in the current paper (Masters, in press; . The cognitive labs also revealed that: a) the increased n umber of Famularo, King, 2016) components in the TE items resulted in those items being both more difficult and more time intensive; b) students generally preferred the most interactive TE items; and c) when the s provided an equally accurate, technology of the testing interface was not a barrier, the TE item and sometimes improved, measure of knowledge (ibid). These results contributed to the overall understanding related to the research questions and helped inform the final revisions to the items and testing system that was use d in the field test that is described in the current paper. s pring 2016, t he researchers conducted a field test In and TE items , to collect validity of the SR evidence based on internal structure and based on the relationship to other variables . All SR and scored so that teachers received immediate feedback with TE items were instantly computer - student results. the CR items, a background survey, and the I n addition, students completed Technology Survey (all administered via computer). 4. Results 4 students that completed the combined Test Form AB (target ing standard There were 38 the combined Test Form CD (target ing standard 4.G.A.2), 4.G.A.1), 374 students that completed (targeting Form E F and 342 students that completed the combined Test standard 4.G.A.3). T able 2 information about student gender, race, ethnicity, and the frequency of when a presents Answering these questions was voluntary language other than English was spoken in the home. for students; thus, percentages may not always add up to 100% . Table 3 shows the average test scores , ranges of item difficulty , and ranges of item - total correlations . For the item analysis described throughout this section, all items were worth 1 point. Some items were dichotomously scored and others were scored with parti al credit of 0, iii 0.5, or 1 .

12 Using Technology 12 / 46 - Enhanced Items to Measure Fourth Grade Geometry Knowledge : Participating Student Demographics 2 Table Value Subgroup Demographic Total 425 n Female 53.20% Gender Male 38.60% Asian 5.20% American Indian / Alaskan 7.80% Native Race 22.40% Black Native Hawaiian / Pacific 4.20% Islander 42.60% White Ethnicity Hispanic / Latino 12.50% of the time 12.20% All or most 5.40% About half of the time Language Other than English 12.20% Once in a while 58.10% Never : Table 3 Total Correlations - Difficulties and Item Item Range of Item - Range of Item Number of Average Score Item Set Items Difficulty Total Correlations (Standard 4.G.A.1) Form AB Test Combined SR 11.065 / 30 = 37% 0.063 - 0.865 0.269 - 30 0.644 and TE Items 0.216 0.865 - 0.129 0.560 6.307 / 14 = 45% 14 SR Items - TE Items 16 4.758 / 16 = 30% 0.063 - 0.543 0.365 - 0.587 (Standard 4.G.A.2) Form CD Test Combined SR iv 3 7.25 24 0.055 - 0.74 0.293 / 24 = 30% - 0.713 and TE Items 11 0.623 SR Items - 0.304 0.74 - 0.156 4.350 / 11 = 40% TE It ems 13 2.902 / 13 = 22% 0.055 - 0.72 0.275 - 0.652 (Standard 4.G.A.3) Test Form EF Combined SR 0.222 - 0.631 28 11.6 10 / 28 = 41% 0.039 - 0.824 and TE Items 0.156 0.602 0.824 - - 0.326 / 14 = 50% 9 6.94 14 SR Items 0.191 - 0.6 24 TE Items 14 4.66 1 / 14 = 33% 0.039 - 0.614

13 Using Technology - 13 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge presents the results of a two - Table 4 way ANOVA examining item difficulty by test form (AB, 2 η CD, EF) and item type (SR or TE). In addition, effect size for each factor was calculated as = 14.265, SS F SS ). The results show a significant effect of Item Type ( p < 0.001) and / ( effect total 1,76 2 - up η to = 0.158; Cohen, 1988). A follow - - t - test confirms that the TE a medium large effect size ( p 5, df = 77.154, t < 0.001). The effect of test items were more difficult than the SR items ( = 3.82 significant and small indicating no difference in difficulty between Test AB, Test - form was non Similarly, Table 5 presents the results of a two - CD, and Test EF. - way ANOVA examining item est form and item type. The results show no significant effects of test form, total correlation by t item type, or the interaction. : Table 4 Item Difficulty ANOVA and Effect Size um of S 2 sig η F Factor df S quares Test Form 2.291 0.048 n.s. 2 0.171 14.265 p < 0.001 0.15 0 Item Type 1 0.53 3 0.014 0.001 n.s. 0 Interaction 2 2.83 0.801 Residuals 7 76 Total Correlation ANOVA and Effect Size - Item : 5 Table Sum of 2 Factor df F sig η Squares 0.011 0.421 n.s. 0.010 2 Test Form 0.015 Item Type 0.016 1 1.258 n.s. 0.011 0.456 n.s. 2 0 .012 Interaction Residuals 76 0.883 0.962 via Classical Test Theory 4.1. Internal Structure is based on internal RQ1 asks whether TE items provide valid evidence. One type of evidence i.e., structure of the assessments the which support the statistical properties of an assessment , idea that respondents are applying the targeted construct without applying additional constructs interest (Goodwin & Leech , 2003 ) . This was not of direct first evaluated using methods from Cronbach’s alpha was calculated for all items on the test, the subset of SR classical test theory. items, and the subset of TE items. As shown in Table 6 , for all three tests, the reliability was high f or the full set of items and for α ≥ 0. 70 is often considered as the threshold for acceptable the subsets of SR and TE items ( reliability of smaller, formative assessments while α ≥ 0. 8 0 is often considered as the threshold ; Salvia, Ysseldyke, & Bolt, see - stakes assessment for acceptable reliability scale, high - of large T he Spearman - Brown correction was applied to the SR reliability index to estimate the 2007 ) . 6 Table TE points. if value it would have the number of SR points were equal to the number of

14 Using Technology - 14 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge , which are comparable to the reliability analysis conducted without the shows the results correction: both item sets showed high reliabilities and the TE items showed a marginally higher (Test CD). reliability for one of the test forms cted factor analysis was condu an exploratory To further evaluate internal consistency (RQ1), shown in using a Varimax rotation on the full set of items, the SR subset, and the TE subset. As 7 , the variance accounted for by the first factor in the full set of items was comparable to Table accounted for by the first factor in the SR and TE subtests. For Test Forms AB and the variance factor accounted for a somewhat greater percentage of variance for the TE items CD, the first For all three sets of all thre e tests, the items compared to the SR items or the full test form. loaded strongly on the first factor and the average factor loadings were strong. that for all three tests, the group of items as a whole, the subset of SR These results indicate show evidence of fo rming a unidimensional scale. items, and the subset of TE items each analysis results provide one source of evidence that the TE items are a The reliability and factor valid measure of knowledge of the targeted standard (RQ1); the strong internal structure hat valid inferences may be drawn about the assessed demonstrated by these items suggests t geometry standards. One way to evaluate if there is an improved measure (RQ2) is to evaluate whether the TE items provide r better information than SR items . Better information might be indicated by large reliability , greater proportion of variance accounted for by the first factor , or stronger indices loadings in comparing SR and TE item subsets. As shown in Table 6 , the Spearman - factor reliability indices for the TE items Brown corrected higher than for the SR items for Tests was AB and CD. The difference in reliabilities was statistically significant ( p <0.05, Diedenhofen and Musch, 2016) for only Test AB. For Tests AB and CD, the TE items also show ed stronger internal structure based on the other charac teristics (including higher factor loadings) . 4.2. Internal Structure via Item Response Theory Item response theory (IRT) was employed to further examine the internal structure of the assessments. A two - parameter logistic (2PL) model was fit to the dichotomou s items and a graded response model (GRM) was fit to the polytomous items, allowing for the IRT models to v accommodate across items with respect to both difficulty and discrimination. differences Similar to the study by Jodoin (2003), this analysis focused on the Test Information Functions (TI F s), which are derived from the IRT item parameters estimated separately for the SR and TE subtests.

15 Using Technology - 15 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge : Reliability Analyses 6 Table Reliability Number Reliability Item Set ) α (Cronbach’s (Spearman - of Items Brown) (Standard 4.G.A.1) Test Form AB Combined SR -- 30 0.897 and TE Items 0.789 0.84 SR Items 14 0.84 TE Items 16 0.834 (Standard 4.G.A.2) Test Form CD Combined SR -- 24 0.889 and TE Items 0.80 SR Items 11 0.790 0.84 13 0.827 TE Items (Standard 4.G.A.3) Test Form EF Combined SR -- 28 0.902 and TE Items 14 0.84 SR Items 0.829 0.82 14 TE Items 0.818 7 Table : Factor Analysis % of Variance Ra nge of Factor Average Factor Number of Accounted for Loadings on Item Set Loadings Items by First Factor First Factor AB Form Test (Standard 4.G.A.1) Combined SR 0.680 0.510 30 27.095% 0.307 - and TE Items 0.289 0.513 28.076% 14 SR Items - 0.677 0.668 TE Items 16 31.095% 0.553 0.430 - CD (Standard 4.G .A.2) Test Form Combined SR 31.495% 0.547 - 0.314 30 0.770 and TE Items 0.748 0.566 SR Items 14 33.422% 0.378 - 16 0.755 - TE Items 0.594 36.906% 0.357 EF (Standard 4.G.A.3) Test Form Combined SR - 0.526 28.760% 28 0.690 0.253 and TE Items 31.852% 0.557 0. - 0.402 14 SR Items 705 0.730 TE Items 0.545 14 31.178% 0.248 - TIF is inversely proportional to the conditional standard error of measurement for the estimate (the latent trait variable), which is the error distribution of theta at various ability levels of theta hman & Al ). To oversimplify, this can be restated as such: when the test gives 03 , 20 Mahrazi - (Lo more information, there is less uncertainty in the estimate of the respondent’s ability.

16 Using Technology - 16 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge peak values is to note Interpretation of TIF values can be difficult, but one way to compare TIF . the relationship between TIF peak values and respondent ability Because the TIF is linearly related to the number of items (and, thus, the number of points) on a comparison must allow for the adjustment of the TIF for the SR items test, the use of TIF as a upward, to account for there being fewer SR points; this adjustment is analogous to the - Spearman Brown correction applied when comparing the reliabilities. Plots of the adjusted TIF values are presented in Figure 3 for the SR (sol id line) and TE (dashed Table 8 shows the comparison of the adjusted peak values. line) items. The in TIF peak values between two tests can allow us to compare the standard errors, difference precision . The peak of the SR and TE T IF was comparable for Test or measurement , of two tests AB; the TE peak was notably higher for Test CD; the SR peak was slightly higher for Test EF. - Figure - SR, Dashed Line TE) (Solid Line TIF Curves : 3 Table s : Peak TIF Value 8 Ratio TIF Peak TIF Item Set Test Form AB (Standard 4.G.A.1) SR Items 8.72 0.998 8.75 1.002 TE Items Test Form CD (Standard 4.G.A.2) SR Items 6.21 0.829 TE Items 9.04 1.207 Test Form EF (Standard 4.G.A.3) 8.00 1.069 SR Items TE Items 7 0.935 .00 By taking the square root of the ratio of the TIF peak value for the TE items on Test CD to the it can be estimated that scale scores from a test TIF peak value for the SR items on Test CD, would have a standard error composed only of SR items that is approximately 20% larger than

17 Using Technology - 17 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge the standard error of the scale scores from a test composed only of TE items. This indicates that this set of TE items did, in fact, provide better information compared to the SR items. Using this same analysis, the co mparison of TIF peak values for Tests AB and EF indicate that the difference is insignificant. It should also be noted that the TIF peak value for the TE items always occurred at a higher han the TE items at lower value of theta than for the SR items. The SR items had greater TIF t the opposite was true at higher values of theta. These differences in TIF values of theta, while and shape indicate that the TE items tended to have notably higher difficulty than the SR items , tests. erence in TIF between the SR and TE items for these this difference was the primary diff The TIF analysis, resulting from IRT estimation, only do the TE items provides evidence that not provide comparable statistical information or certainty about student ability as the SR items provide , but may , i.e., an improvement over SR items (RQ2). (RQ1) differential certainty Assessment via TE items for Test CD demonstrated approximately 20% less error than assessment via the traditional SR items. Comparison with CR Items 4.3. ity evidence is based on the relationship of test scores Another source of valid to other variables, which describes the extent to which an assessment provides information consistent with other the examinee with respect to the construct of interest. The research ers used the information about CR items as the external measure of geometry knowledge for comparison. . The average scores and item Students’ performance on the CR items is summarized in Table 9 difficulties are generally comparable to performance on the SR and TE items, sep arately and displayed in combined ( 3 ) , with the CR items tending to be slightly more difficult . Table A one way ANOVA was conducted across items difficulties ( - ) along with SR, TE, and CR items 2 effect size ( η up analyses ) and necessary follow - . Test form was (displayed in Table 10) excluded from analysis having previously been shown to be a non - significant factor in item F = 8.89, p difficulty. Again, item type was found to be a significant factor in item difficulty ( 2,94 total variance in item difficulty. < 0.001), accounting for 15.9% of the - up t - tests indicated Follow t = 77.154, 3.825, df again that TE items are more difficult than SR items ( p < 0.001) and that = CR items are also more difficult than SR items ( = 3.4087, df = 47.009, t < 0.001); no p signif icance difference was demonstrated between TE and CR item difficulty. 9 Difficulties Set : Item CR Table Range of Item Number of Average Score Test Form Items Difficulty 0.628 0.148 - Test Form AB 15 5.3313 / 15 = 36% 0.603 - 5.2120 / 15 = 35% 15 Test Form CD 0.151 - 0.621 Test Form EF 15 5.3957 / 15 = 36% 0.153

18 Using Technology - 18 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Item Difficulty ANOVA and Effect Size : 10 Table Sum of 2 sig η F df Factor Squares Item 5 8.890 2 p < 0.001 0.159 0.59 Type 6 94 Residuals 3.14 0.841 SR and TE items to their performance on the CR performance tudents’ S on the was compared items and the two subsets of items all display Table 11 ed a displayed in items. As , the combined strong and statistically significant correlation to the CR items. provides another source of evidence to CR item performance The strong, positive correlations the that the TE items provided a valid measure of knowledge (RQ1) . It should also be noted that each other SR and TE items were strongly and statistically significantly correlated to AB , r = 0.755** for Test C D 0 = (r .810** for Test , and r = 0.813** for Test EF), which suggests ct measured by the that the two tests might be measuring the same construct , the constru independent measure (the CR items), which is being used as an independent estimate of the This student’s knowledge. indicates that the TE items are a valid measure of knowledge. further sure if they provide a measure. As mea improved TE items might be considered an Again, better correlation between the TE items and the CR items was shown in Table 11 , for all three tests, the and the CR items, although, again, both correlation between the stronger than the SR items correlations were strong The correlation between performance on the and statistically significant. - transformation - r CR items and either SR or TE items is compared according to Fisher’s to z (z = - 1.655, p > 0.05) or Test ( Preacher, 2002 ) indicating no significant difference for Test AB EF (z = - 0.342, p < 0.05). However, the correlation with CR item performance is stronger for TE Test 2.021, p < 0.05) . The significance testing indicates that , for items than SR items on CD (z = - (i.e., closer to the measure provided by the CR Test CD, the TE items might be a better measure of that construct compared to the SR items. items)

19 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 19 / 46 - : Correlations with CR Items for Fourth Grade 11 Table Correlation Item Set with CR Items Test Form AB (Standard 4.G.A.1) Combined SR r = 0.797** d TE Items an SR Items r = 0.732** r = 0.782** TE Items Test Form CD (Standard 4.G.A.2) Combined SR r = 0.771** and TE Items r = 0.689** SR Items r = 0.759** TE Items Test Form EF (Standard 4.G.A.3) Combined SR r = 0.682** and TE Items Items SR r = 0.643** r = 0.658** TE Items * p<0.05, ** p < 0.01 and Time Survey Data 4.4. The researchers explored the relationship between subtest performance and students’ technology use, and students’ math affect . Table 1 2 affect, frequency of s hows correlations technology factors and plus the CR items. between these the three test forms each of the tudent affect towards technology was S demonstrated weak positive correlations to . (r = 0.195 to 0.261, p < 0.05) These results suggest no pr four sets of items actical relationship between technology affect and performance on SR, TE, and CR item types. Frequency of - technology use was weak and negatively correlated to each of the four sets of items (r = 0.098 to - 0.187, p < 0.05), suggesting no practical relation ship between the frequency with which students Lastly, , student affect towards performance on any of the item types. use technology and their math in general demonstrated weak positive correlations to all four item types, indicating that not meaningfully related to their performance on any of students’ opinion s towards math were the item types. It is often the case that TE items take longer to complete, which is often considered a potential unless TE items provide more negative factor that can counter potential benefits of TE items ( justifying the extra time required). , thus potentially information Therefore, average time to complete each item was compared across TE and DR items. The median seconds to answer the t CD, and Test EF, respectively. The median SR items was 42, 26, and 11 for Test AB, Tes in number of seconds to answer the TE items was 64, 63, and 24. For each test, the difference time to answer by item type was statistically significant ( p < 0.01).

20 Using Technology - 20 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge 12 : Correlations Table with Student Survey Data for Fourth Grade Frequency of Technology Math Technology Item Set Affect Affect Use AB (Standard 4.G.A.1) Test Form Combined SR r = 0.225** r = - r = 0.263** 0.110* and TE Items r = 0.211** r = - 0.111* r = 0.259** SR Items TE It r = 0.218** r = - 0.098 r = 0.243** ems r = 0.244** CR Items - 0.139* r = 0.246** r = Test CD (Standard 4.G.A.2) Form Combined SR r = - 0.182** r = 0.274** r = 0.222** and TE Items r = 0.202** r = - 0.187* r = 0.286** SR Items r = 0.216* TE Items - 0.1 52** r = 0.227** r = CR Items r = 0.261** r = - 0.143* r = 0.241** Test Form EF (Standard 4.G.A.3) Combined SR r = 0.220** r = 0.236** r = - 0.179** and TE Items r = 0.223** r = - 0.173** r = 0.213** SR Items r = - 0.169** r = 0.239** TE Items r = 0.195** Items r = 0.236** r = - 0.161* CR r = 0.239** * p<0.05, ** p < 0.01 se results show a weak relationship between student performance on the items and both The technology use and affect towards technology . This suggest that technology use and affect do not sent construct - irrelevant barriers to students’ performance on the test. pre These are positive results in support of the use of TE items, as i ssues related to technology use and affect are often show that the TE items took considered potential reasons to avoid TE items. The results did significantly more time to complete, but that finding must be considered along with the overall findings or better information (Jodoin, 2003) . about whether the items also provide more Discussion 5. y of the resul ts is provided in Figure 4 . The analyses clearly provide A condensed summar evidence of validity of inferences made from the TE items (RQ1). The evidence on whether the TE items provided improved measures (RQ2) is somewhat inconsistent. In several cases, there was evidence t o support better or more information being obtained from the TE items. The evidence was not across all tests or overwhelmingly conclusive for any individual generalizable test, particularly considering the equally strong performance of the SR items. The an alysis seems to support the development of tests with a combination of SR and TE items as the best measurement of student knowledge . It is clear that the circumstances under which TEs will provide an improved measure are very context - dependent .

21 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 21 / 46 - Test RQ e Evidenc CD EF AB internal structure (reliability)       internal structure (unidimensionality)       internal structure (information)       RQ1 relationship to other variables (correlation with       CR items) lack of evidence of the effect of c onstruct       irrelevant factors (technology and affect) comparable item difficulties between CR and TE items (compared to significant differences in       difficulty between CR and ) SR items internal structure ( significant improved    of TE over SR items ) reliability Stronger ) internal structure ( unidimensionality     for TE versus SR items RQ2 h igher TIF values and smaller ore information ( m IRT - based standard errors) for TE versus SR   items tronger relationship to other variables s correlat ion between performance on CR items (     ) SR items CR items and and TE items, versus Figure 4: Summary of Evidence from Field Test There was strong evidence from all three tests that the TE items were a valid measure of student geted standards (RQ1). The TE items had strong internal structure, knowledge for the three tar evidence of unidimensionality, and a strong relationship to an external measure of knowledge. here was no evidence found to support what are often considered potential negative Further, t nfounding factors that inhibit the ability of TE items to provide an accurate measure of co knowledge. Specifically, there was no evidence of a meaningful relationship between performance on TE items and students’ opinions towards technology, frequency of use of technology, or opinions towards math. Th ese findings alleviate some of the more common concerns about the use of TE items. Thus, there is strong evidence related to RQ1. There was some evidence that TE items provided an improved measure (RQ2), but the evidence across tests . Tests CD had the most consistent evidence that TE items was not consistent provided an improved measure. For this test, TE and CR items demonstrated comparable difficulties, indicating that, despite the TE items being more difficult than the SR items, they were closer to the independent measure of knowledge, as provided by the CR items. This idea

22 Using Technology - 22 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge was also supported by the significantly stronger correlation between the TE and CR items, compared to the SR and CR items. The TE items had significantly higher reliability and higher average factor loadings. Additionally, the IRT analysis provided information indicating that the TE items on Test CD provided more statistical information, and therefore, improved measurement over the SR items. The evidence was less consistent for the other two tests. For Test AB, the TE items had higher average factor loadings compared to the SR items and they had a significantly stronger compared to the SR items. For Tests EF, the on correlation to the CR items ly evidence that TE - significant difference between TE and CR items might provide improved measurement was non item difficulties. There are many indications in the data that, from an assessment development perspective, when computer - scored, the best way measurement might be achieved limited only to items that can be and TE items. For all three tests, the individual item subsets and the by including both SR combined item sets all showed strong and often comparable measures of internal structure. And the evide nce of TE items providing improved measurement was inconsistent. Thus, under what circumstances TEs will provide an improved measure might be incredibly context - dependent, and, thus, without further research on specific contexts, the best approach is likel y a blend of Further, the inclusion of SR items can mitigate the effects of the increased time item types. required to respond to TE items. This recommendation is supported by an examination of the context of the three targeted standards. The standards con tained a mix of components, asking students to draw (4.G.A.1 - Test AB , 4.G.A.3 - Test EF ), identify (4.G.A.1, 4.G.A.2, 4.G.A.3 , all three tests ), classify (4.G.A.2 - Test CD ), and (4.G.A.2, 4.G.A.3 - Test CD and Test EF ). Two of these actions, recognize and recognizing are more likely to be able to be accurately measured by an item with ident ifying a high level of constraint, meaning a typical SR item. By definition, these actions do not require produce their own response. Drawing , on the other ha nd, by definition does require a a student to student to produce a response, and thus it is perhaps best to measure this ability using an item classify appears to also be more with a lower level of constraint (a TE item). The ability to with accurately measured by TE items less constraint. While a more constrained, typical SR item certainly measure student’s ability to classify, it also allows guessing and “chance can classifications.” A less constrained, TE - style item requires the student to classify with less opportun ity for guessing. A TE - style item also allows for greater flexibility in the grouping of objects, which might allow response options to more closely match a student’s true he three targeted standards contain a mix of components, some of whi ch would understanding. T be seemingly more appropriate to measure with more constrained SR items and others with less constrained TE items. This helps contextualize the results and supports the idea that the content being measured should be the driving force in selecting appropriate item types and that a blend of item types might often be the best approach.

23 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 23 / 46 - TE items are quickly becoming ubiquitous and thus it is critical to generate a research base about the quality of measurement and the validity of inferences based on TE items in a variety of The current paper contributes to this small but important base of research contexts. The result s . provide evidence that TE items can provide a valid measure of geometry standards in the elementary grades Based on the VTAG findings, t he researchers recommend that item developers not consider TE items a panacea to solve all modern measurement challenges . Rather, TE item types should be in the item developer’s toolbox. treated as tool s one of many This represents a kind of back to basics for item writing: the selection of item type should be based on the construct being measured. TE items should not be automatically favored over SR items because it is assumed they will provide a better measure. W hether TE items will provide improve d always measurement is likely to be highly dependent on the construct. Item writers should consider the action students will take to demonstrate knowledge to inform the selection of item type. For ight be actions that can be example, measuring a student’s ability to identify or recognize m accurately and fully measured with an SR item because the action can be captured even when a high degree of constraint is placed upon the response options. Measuring a student’s ability to might be better measured with a TE item because those draw, classify, represent, or interpret actions might be better demonstrated with a less constrained response space. The researchers further recommend that TE items should not be avoided on the basis that they expensive to develop and test, they might introduce construct - irrelevant are typically more have potential , or because they require more time to answer Well - designed TE items do . factors of some constructs. Thus, again, i tem developers should to provide improved measurement choose the i tem type based on the best way to measure a specifically targeted construct, considering the varying levels of constraint represented by SR and TE item types as the broad set of possible types from which to choose. item roject. small articipating teachers and students were There are limitations of the VTAG p P samples of convenience statistical the generalizability and available statistical , limiting procedures . With larger, representative samples, more complex IRT models could be estimated and the impact of guessing evaluated. Finally, the research presented was conducted only with students that did not require accommodations during testing. Future research should explore the same questions posed by VTAG, using the VTAG item sets, but with the full breadth o f the student population, including students with a variety of visual, auditory, psychical, and motor disabilities. The researchers acknowledge the strong (and continually growing) base of research related to TE accommodations and acknowledge the VTAG proj ect did not include test accommodations.

24 Using Technology - 24 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Acknowledgements This material is based upon work supported by the National Science Foundation under Grant umber 1316557 . A ny opinions, findings, and conclusions or recommendations expressed in this N those of the author(s) and do not necessarily reflect the views of t he National material are Science Foundation. i For drawing items, given the targeted age range, the interface decision was made to use “snap - to” behavior, ection of two meaning that if the student clicks somewhere in between grid lines, rather than directly on the inters grid lines, the point would be placed on the nearest grid line. This is one of two possible behaviors for graphing (the other being allowing a student to place a point anywhere on the grid). Given that the constructs being measured did not re late to concepts involving fractions or decimals, and that the item writers and researchers purposefully avoided to actions and decimals due to the age range and confounding influence of these mathematical ideas, the snap r f - behavior also allows for more precise scoring algorithms that do not require behavior was chosen. The snap - to tolerances around correct values (which, in turn, require extensive field and validity testing). ii Core State Standards, and thus Because the majority of the items were released prior to the adoption of Common were based on other standards, many of these items did not align clearly to the standards being targeted for the Some required modification. ted - VTAG project. Further, the researchers required that the items be strictly construc response, meaning that items could not primarily use an interaction that could be used in a SR or TE item (e.g., selecting options, graphing, plotting, drawing, etc.). The researchers contended that the inclusion of these kinds of the alignment between CR items and similar SR or TE items. For example, if the set of CR items could influence items targeting a standard included items asking students to plot points, and the analysis later revealed a stronger correlation between the CR and TE items (which al so asked students to plot points) than between the CR and SR items, one could argue that the correlation was a result of the interface rather than the accuracy of the construct measurement. iii that items supporting partial credit would be This assessment design was purposefully chosen to not presuppose During initial item design, the number of points an worth more or weigh more heavily towards the overall test score. item was worth was made for each item on an individual basis, as was the decision of whether to award partial , written as 2 - point credit. Half points were not awarded; thus, items that awarded partial credit were, by default - - scale assessment. In response to the expert items. These are common conventions used when writing items for large advisors’ feedback, the researchers made all items worth one point. This decision was made so as not to assume that any i ndividual item or item type would return more information than any other, since whether or not this is the case is part of what the VTAG project seeks to discover (i.e., do TE items provide more information than SR items). The redit (a score of 0.5) was used for all items where a response would indicate some level opportunity to earn partial c of partial understanding without the demonstration of conflicting misunderstanding. Partial credit was awarded to all, matching, and complex mult iplex choice types only) and TE items (all types). both SR items (select - iv Two items were dropped from Test CD (one SR item and one TE item). These items both had very low item - negative item difficulty statistics which means they were very difficult items (.073 and .005) and had very low or total correlations, both when compared with the combined set of SR and TE items and with the SR or TE subset. The analysis in this section is based on the CD test without these two items. v For the analysis using item response theory (IRT), r esponse data was recoded such that items were either dichotomously scored or polytomously scored as 0, 1, or 2.

25 Appendix A: Example s Item Note: Example items are presented to display SR and TE item type. Some items target the fourth grade standards described in the current paper. Other items target fifth grade standards, also a target of focus of the VTAG project. Exhibit - Formatted and Interactive SR Items A.1: Examples of Traditionally (a) (b) (c ) Exhibit A.2: Examples of TE Item (a)

26 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 26 / 46 - (b) Exhi bit A. 3: Example TE Item for which the Response Space is Too Large to Administer as SR

27 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 27 / 46 - Exhibit A. 4: Example SR Items (Standard Multiple Choice)

28 - Using Technology 28 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Exhibit A. 5: Example SR Items (Select - All)

29 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 29 / 46 -

30 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 30 / 46 - Exhibit A. 6: Example SR Item (Complex Multi ple Choice)

31 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 31 / 46 - Exhibit A. 7: Example SR Items (Matching)

32 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 32 / 46 - Exhibit A. 8: Example TE Items (Categorization)

33 / Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 33 - 46

34 46 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 34 / -

35 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 35 / 46 - Exhibit A. 9: Example TE Item (Hotspot)

36 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 36 / 46 - Exhibit A. 10: Example TE Items (Matrix Completion)

37 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 37 / 46 - Exhibit A. 11: Ex ample TE Items (Figure Placement)

38 46 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 38 / -

39 - Using Technology 39 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Exhibit A. 12: Example TE Item (Drawing)

40 / Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 40 - 46

41 46 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 41 / -

42 46 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 42 / -

43 Using Technology Enhanced Items to Measure Fourth Grade Geometry Knowledge 43 / 46 - Exhibit A. 13: Example CR Items

44 Using Technology - 44 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge References Cited uthentic Archbald, D. A. & Newmann, F. M. (1988). Beyond Standardized Testing: Assessing A . Reston, VA: National Association of Secondary Academic Achievement in Secondary School School Principals. Beatty, P. C. & Willis, G. B. (2007). Research synthesis: The practice of cognitive interviewing. Public Opinion Quarterly, 71(2), 287 - 311. Bennett, R. E. (1993). On the meaning of constructed response. In R. E. Bennett and W. C. Ward, (Eds.), Construction versus Choice in Cognitive Measurement: Issues in Constructed – 27). Hillsdale , NJ: Response, Performance Testing, and Portfolio Assessment (pp. 1 Lawrence Erlbaum Associates. Bennett, R. E. (1999). Using new technology to improve assessment. Educational Measurement: (3), 5 - 12. Issues and Practice, 18 Birenbaum, M., Kelly, A. E. & Tatsuoka, K. (1992). Towards a stable diagnostic representation o - RR - 92 - 58 - ONR. ERIC Clearinghouse Doc . ED 356973. f students' errors in algebra. ETS - enhanced items in large - scale Bryant, W. (2017). Developing a strategy for using technology standardized tests. Practical Assessment, Research & Evaluation, 22(1). . (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Cohen, J Lawrence Erlbaum Associates Darling Hammond, L. & Lieberman, A. (1992). The shortcomings of standardized tests. The - Chronicle of Higher Education, B1 - B2. n, B., & Musch, J. (2016). cocron: A web interface and R package for the statistical Diedenhofe comparison of Cronbach’s alpha coefficients. International Journal of Internet Science, 11, 51 – 60. al of Technology, Dikli, S. (2006). An overview of automated scoring of essays. The Journ Learning, and Assessment, 5(1). - Seymour, E., Adams, J., & Sethuraman, S. (2011). Cognitive Dolan, R. P., Goodman, J., Strain Lab Evaluation of Innovative Items in Mathematics and English Language Arts Assessment High School Students. Iowa City, IA: Pearson Education. of Elementary, Middle, and Ericsson, K. A., & Simon, H. A. (1999). Protocol analysis: Verbal reports as data. Cambridge, MA: Massachusetts Institute of Technology. Goodwin, L. D. & Leech, N. L. (2003). The meaning of validity in the new Standards for Educational and Psychological Testing: Implications for measurement courses. Measurement - 191. and Evaluation in Counseling and Development, 36, 181 ssues and Gorin, J. S. (2006). Test design with cognition in mind. Educational Measurement: I - 35. Practice, 25(4), 21 Haladyna, T. M. (1999). Developing and validating multiple - choice test items. Hillsdale, NJ: Erlbaum. Harlen, W. & Crick, R. D. (2003). A Systematic Review of the Impact on Students and Teachers sment of Creative and Critical Thinking Skills . London: Institute of the Use of ICT for Asses of Education, University of London. Hickson, S., & Reed, W.R. (2009) Do constructed - response and multiple - choice questions measure the same thing? Department of Economics. Working Paper Seri March, es. Retrieved 2018 from http://hdl.handle.net/10092/2465 . Huff, K. L. & Sireci, S. G. (2001). Validity issues in computer - based testing. Educational Measurement: Issues and Practice. 20 (3), 16 - 25.

45 Using Technology - 45 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge Use, Support, and Effect of I nstructional Technology Study inTASC. (2018). USEiT: . Retrieved March, 2018, from http://www.bc.edu/research/intasc/researchprojects/USEIT/useit.shtml - based Jodoin, M. G. (2003). Measurement efficiency of innovative item formats in computer testing. Journal of Educational Measur ement, 40 - 15. (1), 1 Lane, S. (2004). Validity of high - stakes assessment: Are students engaged in complex thinking? - Educational Measurement: Issues and Practice, 23(3), 6 14. - them; how Livingston, S. (2009, September). Constructed response test questions: Why we use we score them. ETS R&D Connections, no. 11. - Mahrzi, R. (2003). Personal Lohman, D. F., & Al standard errors of measurement. Retrieved March, 2018 from . https://faculty.education.uiowa.edu/docs/dlohman/personalSEMr5.pdf Validity of Technology - Masters, J. (in press) The Enhanced Assessment in Geometry (VTAG) Project: Technical Report. Measured Progress. Masters, J., Famularo, L., & King, K. (2016). Using Technology Enhanced Items to Measure - Fifth Grade Geometry. Annual Meeting of the National Counci l on Measurement in Education. Washington, D.C. Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist, 38 – 52. , 43 dia authoring McFarlane, A., Williams, J. M., & Bonnett, M. (2000). Assessment and multime a - tool for externalizing understanding. Journal of Computer Assisted Learning, 16, 201 - 212. National Council of Education Statistics Tools and Applications. (NCES) . ( 2016 ). NAEP Retrieved March, 2018 out/naeptools.aspx/ , https://nces.ed.gov/nationsreportcard/ab (NCES) 2018 ). Trends in International Math and National Council of Education Statistics . ( March, 2018 , https://nces.ed.gov/timss/educators.asp Science Study. Retrieved 340. Retrieved March, Oklahoma State Department of Education (2017). Solicitation #2650000 2018 from https://www.ok.gov/cio/documents/Solicitation2650000340.pdf Partnership for Assessment of Readiness for College and Careers (PARCC). (2010). Application for the Race to the Top Comprehensive Assessment Systems Competition. R etrieved December, 2011 from - %20FINAL.pdf http://www.parcconline.org/sites/parcc/files/PARCC%20Application%20 . Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.) (2001). Knowing What Students Know: The Science and Design of Educational Assessment. Washi ngton, DC: The National Academy Press. Preacher, K. J. (2002, May). Calculation for the test of the difference between two independent correlation coefficients [Computer software]. Available from March, 2018 . http://quantpsy.org J., & Schneider, S. A. (2009). Assessment of student learning in Quellmalz, E. S., Timms, M. science simulations and games. Paper prepared for the National Research Council Workshop on Gaming and Simulations. Washington, DC: National Research Council. Salvia, J., Ysseldyke, J. E., & Bolt, S. (2007). Assessment in special and inclusive education. Wadsworth. Scalise, K. & Gifford, B. (2006). Computer based assessment in e - earning: A framework for - constructing “intermediate constraint” questions and tasks for technology platforms. The J ournal of Technology, Learning, and Assessment, 4 (6). Sireci, S. G. & Zenisky, A. L. (2006). Innovative item formats in computer - based testing: In pursuit of improved construct representation. In S. M. Downing and T. M. Haladyna (Eds.),

46 Using Technology - 46 / 46 Enhanced Items to Measure Fourth Grade Geometry Knowledge velopment - 347). Mahwah, NJ, US: Lawrence Erlbaum Handbook of test de (pp. 329 Associates Publishers. Smarter Balanced Assessment Consortium (SBAC). (2010). Race to the Top Assessment Program Application for New Grants; Comprehensive Assessment Systems CFDA Number: 84.395B. Wa shington, D.C.: U.S. Department of Education. State of Maine, Department of Education (2014). Maine Comprehensive Asssessment System. Procurement for the Implementation of English Language Arts and Mathematics Assessment in Grades 3 through 8 and High Scho ol. Retrieved March, 2018 from www.maine.gov/doe/assessment/documents/rfp.doc http:// - Seymour, E., Way, W., & Dolan, R. (2009). Strategies and Processes for Developing Strain - Iowa City, IA: Pearson Education. Innovative Items in Large Scale Assessments. mas, A. (2016). Evaluating the Validity of Technology - Enhanced Educational Assessment Tho Studying Ite m Features and Scoring Rubrics. Items and Tasks: An Empirical Approach to (2016). CUNY Academic Works. U .S. Department of Education. (2010). Transforming Amer ican Education: Learning Powered by Technology. National Educational Technology Plan 2010. Washington, D. C. National Council of Education Statistics . ( 2016 ). NAEP Tools and Applications. (NCES) March, 2018 about/naeptools.aspx/ Retrieved , https://nces.ed.gov/nationsreportcard/ Wan, L. & Henly, G. A. (2012): Measurement properties of two innovative item formats in a - based test, computer (1), 58 - 78. Applied Measurement in Education, 25 Willis, G. B. (1999). Cognitive interviewing: A “how to” guide. Paper prese nted at the Meeting of the American Statistical Association. North Carolina: Research Triangle Institute. Winter, P. C., Wood, S. W., Lottridge, S. M., Hughes, T. B., & Walker, T. E. (2012). The utility t ructed - response items: Mai ntaining important mathematics in of online mathematics cons state assessments and providing appropriate access to students. Final research report. Pacific Metrics Corporation. Zenisky, A. L. & Sireci, S. G. (2002). Technological innovat ions in large - scale assessment. Applied Measu rement in Education, 15 (4), 337 - 362. Zucker, S., Sassman, C., & Case, B. J. (2004). Cognitive Labs. San Antoni o, TX: Harcourt Assessment, Inc.

Related documents

SR288.PS

SR288.PS

113th Congress S. Report ! " SENATE 2d Session 113–288 REPORT of the SENATE SELECT COMMITTEE ON INTELLIGENCE COMMITTEE STUDY of the CENTRAL INTELLIGENCE AGENCY’S DETENTION AND INTERROGATION PROGRAM to...

More info »
Security and Privacy Controls for Federal Information Systems and Organizations

Security and Privacy Controls for Federal Information Systems and Organizations

-53 NIST Special Publication 800 Revision 4 Security and Privacy Controls for Federal Information Systems and Organizations JOINT TASK FORCE TRANSFORMATION INITIATIVE This publication is available fre...

More info »
MCO 1200.17E MILITARY OCCUPATIONAL SPECIALTIES MANUAL (SHORT TITLE: MOS MANUAL)

MCO 1200.17E MILITARY OCCUPATIONAL SPECIALTIES MANUAL (SHORT TITLE: MOS MANUAL)

DEPAR T MENT THE NAVY OF ADQ UARTE UNI T ED ST ATE S MAR INE CORPS HE RS RINE COR N PS PENT 3000 MA AGO 20350-3000 NGTON, HI D.C. W AS 7E 00 .1 12 MCO 465 c AUG 0 8 013 2 ORDER 1200.17E MARINE CORPS C...

More info »
DoD7045.7H

DoD7045.7H

DoD 7045.7-H EPARTMENT OF D EFENSE D F UTURE Y EARS D EFENSE P ROGRAM (FYDP) S TRUCTURE Codes and Definitions for All DoD Components Office of the Director, Program Analysis and Evaluation A pril 2004

More info »
344260 newintro.pmd

344260 newintro.pmd

STEP MIA: MI A: M OT IVATIONAL M I NTERVIEWING I SSESSMENT A : A S upervisory T ools for A product of the E nhancing NIDA-SAMHSA B lending Initiative P roficiency

More info »
Handbook: Revenue for software and SaaS

Handbook: Revenue for software and SaaS

Revenue for software and SaaS Handbook US GAAP December 201 8 _____ kpmg.com/us/ frv

More info »
Justification Book

Justification Book

UNCLASSIFIED Department of Defense Fiscal Year (FY) 2019 Budget Estimates February 2018 Office of the Secretary Of Defense Defense-Wide Justification Book Volume 3B of 5 Research, Development, Test & ...

More info »
G:\COMP\PHSA\PHSA.bel

G:\COMP\PHSA\PHSA.bel

G:\COMP\PHSA\PHSA-MERGED.XML PUBLIC HEALTH SERVICE ACT [As Amended Through P.L. 115–408, Enacted December 31, 2018] References in brackets ¿ ø¿ ø are to title 42, United States Code TITLE I—SHORT TITL...

More info »
Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation

Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation

MANAGING THE RISKS OF EXTREME EVENTS AND DISASTERS TO ADVANCE CLIMATE CHANGE ADAPTATION SPECIAL REPORT OF THE INTERGOVERNMENTAL PANEL ON CLIMATE CHANGE

More info »
ndpii final11

ndpii final11

THE REPUBLIC OF UGANDA SECOND NATIONAL DEVELOPMENT PLAN (NDPII) 2015/16 – 2019/20 Uganda Vision 2040 “A Transformed Ugandan Society from a Peasant to a Modern and Prosperous Country within 30 years” N...

More info »
HESTEM FinalReport Email

HESTEM FinalReport Email

National HE STEM Programme - Final Report Written and compiled by Michael Grove

More info »
Agenda21.doc

Agenda21.doc

United Nations Conference on Environment & Development Rio de Janerio, Brazil, 3 to 14 June 1992 AGENDA 21 CONTENTS Paragraphs Chapter 1.1 - 1.6 1. Preamble SOCIAL AND ECONOMIC DIMENSIONS SECTION I . ...

More info »
C:\Users\aholmes4\AppData\Roaming\SoftQuad\XMetaL\7.0\gen\c\H5515 ~1.XML

C:\Users\aholmes4\AppData\Roaming\SoftQuad\XMetaL\7.0\gen\c\H5515 ~1.XML

H. R. 5515 One Hundred Fifteenth Congress of the United States of America AT THE SECOND SESSION Begun and held at the City of Washington on Wednesday, the third day of January, two thousand and eighte...

More info »
Strong Start for Mothers and Newborns Evaluation: Year 5 Project Synthesis Volume 1: Cross Cutting Findings

Strong Start for Mothers and Newborns Evaluation: Year 5 Project Synthesis Volume 1: Cross Cutting Findings

g Star t f or Mothe rs and Newbor ns Evaluation: Stron YNTHESIS ROJECT S AR 5 P YE Volume 1 indings -Cutting F ross : C Prepared for: ss Caitlin Cro -Barnet Center fo HS nd Medicaid Innovation, DH r M...

More info »
ASAALT Weapon Systems Handbook 2018

ASAALT Weapon Systems Handbook 2018

AMERICA’S ARMY: TODAY’S MODERNIZATION IS TOMORROW’S READINESS

More info »
Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space

Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space

TIONAL ACADEMIES PRESS THE NA This PDF is available at http://nap.edu/24938 SHARE     Thriving on Our Changing Planet: A Decadal Strategy for Earth Observation from Space DET AILS 700 pages | 8.5 ...

More info »
Thematic Review on Supervisory Approaches to SIBs

Thematic Review on Supervisory Approaches to SIBs

Thematic Review on Supervisory Frameworks and Approaches for SI Bs Peer Review Report 26 May 201 5

More info »
OneNYC

OneNYC

One N e w Yo r k The Plan for a Strong and Just City The City of New York Mayor Bill de Blasio Anthony Shorris First Deputy Mayor

More info »
TIP 35 Enhancing Motivation For Change in Substance Abuse Treatment

TIP 35 Enhancing Motivation For Change in Substance Abuse Treatment

Substance and Mental Health Abuse Services Administration Center for Substance Abuse Treatment Enhancing Motivation For Change in Substance Abuse Treatment Treatment Improvement Protocol (TIP) Series ...

More info »