1 P I TEM S TATISTICS U SING RELIMINARY ALUES P-V OINT -B ISERIAL C ORRELATION AND P Y B , EEMA V ARMA S .D. P H I NC E DUCATIONAL D ATA S YSTEMS , . 15850 S C ONCORD C IRCLE , A UITE M ORGAN H ILL CA 95037 COM WWW . EDDATA .

2 Overview Educators in schools and districts routinely develop and administer classroom or benchmark tests throughout the academic year to get a quick overview of student mastery of relevant material. Generally these tests are used as diagnostic tools and are not used to measure growth over ti me or for official or statewide reporting purposes. For these reasons such tests are perceived to be low stakes and they undergo minimal or no statistical review for reliability and validity. We argue that even though internally developed tests may appear to be less important due to their low-stakes nature, the fact that they are used for student diagnostics, as we ll as for classroom instruction, curriculum development, and the most vital sources of information at the school and to steer teacher professional development, makes them one of district levels. For these reasons, it is essential that th e data these tests provide be reliable and meaningful. We believe that with some pre-planning, schools and dist ricts can significantly improve the quality of internally developed tests. One of the ways by which tests are check ed for quality is analysis of each question or “item.” The objective of this paper is to help edu cators conduct item analysis of internally developed assessments to obtain higher quality and more reliable test results. What is Item Analysis? Item analysis is a method of reviewing items on a test, both qu alitatively and statistically, to ensure that they all meet minimum quality-control criteria. The difference between qualitati ve review and statistical analysis is that the former s to identify items that do not appear to meet minimum uses the expertise of content experts and test review board l during item development when no data are available for quality-control criteria. Such qualitative review is essentia ysis, such as item analysis, is conducted after items have been administered quantitative analysis. A statistical anal and real-world data are available for analysis. Statistical analysis also helps to identify items that may have slipped through item review boards; item defects can be hard to iden tify. The objective of qualitative and statistical review is the same – to identify problematic items on the test. Probl ematic items are also called “bad” or “misfitting” items. Items may be problematic due to on e or more of the following reasons: be confused when responding to them. ¾ Items may be poorly written causing students to Graphs, pictures, diagrams or other information accomp ¾ anying the items may not be clearly depicted or may actually be misleading. ¾ Items may not have a clear correct response, and a dist ractor could potentially qualify as the correct answer. Items may contain distractors that most students can see are obviously wrong, increasing the odds of student ¾ guessing the correct answer. ¾ Items may represent a different content area than that measured by the rest of the test (also known as multidimensionality). Bias for or against a gender, ethnic or other sub-group may be present in the item or the distractors. ¾ Educational Data Systems 1

3 Normally, output from a standard Rasch program, such as Winsteps, provides statistics on various item analysis terpreting the output requires training. This document is indices. However, using such statistical programs and in perform preliminary item analysis without extensive training intended to help educators in schools and district offices point-biserial correlations and p-values in psychometrics. The item analyses we discuss here are . We will show how to compute and interpret these statistics usi ng two different programs: Excel and SPSS. Implications of Problematic Items in Terms of Test Scores One may ask, why is it so important to review every item on a test? One may speculate that as long as the majority of the items on a test are good, there may not be much impa ct if a few items are problematic. However, based on statistical theory and previous experience, we know that the presence of even a few problematic items reduces overall test reliability and validity, sometimes markedly. All measu rement tools, such as tests, surveys, and questionnaires, are assessed in terms of these two criteria, reliability and validity. We will briefly explain these two terms in the context of assessments. Reliability tells us whether a test is likely to yield the same results if administered to the same group of test-takers multiple times. Another indication of reliability is that the test items should behave the same way with different populations of test-takers, by which is generally meant that the items should have approximately the same ranking when sorted by their p-values, which are indicators of “item difficulty.” The items on a math test administered to fifth value rankings when administered to fi graders in Arizona should show similar p- fth graders in California. The same how similar rankings. Large fifth graders should again s test administered the next year to another cohort of fluctuations in the rank of item (one year it was the easies t item; the next year it was the hardest on the same test) would indicate a problem with the both the item and, by implication, the test itself. Validity tells us whether the test is measuring what it purports to measure. For example, an algebra test is supposed to measure student competency in algebra. But if it includes word problems, such a test may be a challenge for students with poor English language skills. It would, in effect, be measuring not simp ly algebra skills but English language for one subgroup of students, algebra skills as well. Thus, the test woul d measure both algebra and English skills issue. This is different fro m the stated goal of being an alone for the rest of the students for whom language is not an test for all students. algebra Both reliability and validity are important in any assessment. If a test is not reli able and valid, then the student scores therefore not indicative of the student’s mastery of the obtained from that test are not reliable and valid and are material or his or her ability in that content. Non-relia ble and non-valid test scores are simply meaningless numbers. Thus, even when tests are administered at the school level (as opposed to a large-scale assessment), it is important to identify problematic items to ensure the test results are meaningful. While educators may reuse items over several years, it is advisable to remove bad items from the item pool. Besides lowering the reliability of the test, bad items also confuse students during the test-taking process. Studen ts cannot be expected to be adept at identifying and rejecting items that appear incorrect and moving on to th e next question. They spend time and energy responding to poorly written items. Educational Data Systems 2

4 Point-Biserial Correlation and P-Values r a test item is likely to be valid and reliable: point- We now discuss two simple statistics used to determine whethe biserial correlations and p-values. is the correlation between the right/wrong scores that students receive on a given item The point-biserial correlation ceive when summing up their scores across th e remaining items. It is a special and the total scores that the students re which is right or wrong, 0 or 1) type of correlation between a dichotomous variable (the multiple-choice item score maximum number of multiple-choice items score on the test ranging from 0 to the and a continuous variable (the total on the test). As in all correlations, point-biserial values ra nge from -1.0 to +1.0. A large positive point-biserial value e also getting the item right (which we would expect) and indicates that students with high scores on the overall test ar ng the item wrong (which we would also expect). A low that students with low scores on the overall test are getti ect tend to do poorly on the overall test (which would point-biserial implies that students who get the item corr ong tend to do well on the test (also an anomaly). indicate an anomaly) and that students who get the item wr p-value of an item tells us the proportion of students th at get the item correct. When multiplied by 100, the p- The the percentage of students that got the item correct. The p-value statistic value converts to a percentage, which is ranges from 0 to 1. Computation and Interpretation of Po int-Biserial Correlation and P-Values Illustrated below is a sample data matrix comprised of 10 items and 9 students (labeled Kid-A through Kid I). Items are represented in the matrix columns from left to right, and students are represented as rows. A value of “1” in the grid signifies that the student got the item correct; “0” i ndicates the student got it wrong. For instance Kid-A got all the items right except for Item 9; Kid-H got only Items 2 and 3 right and the rest of the items wrong. Table 1: Sample Data Matrix Items 1 2 3 Students 5 6 7 8 9 10 4 Kid-A 1 1 1 1 1 1 1 1 0 1 1 Kid-B 1 1 1 1 1 1 0 1 0 1 0 0 Kid-C 1 1 1 1 1 1 0 0 Kid-D 1 1 1 1 1 0 1 0 1 Kid-E 1 1 1 1 1 1 0 1 0 0 0 Kid-F 1 1 0 1 1 0 0 0 0 0 Kid-G 1 1 0 1 1 0 0 0 0 0 Kid-H 1 0 1 0 0 0 0 0 1 Kid-I 0 1 1 0 0 0 0 0 0 0 Computing Point-Biserial Correlations and P-Values in Excel To compute point-biserials and p-values in Excel, replicate the sample data matrix, above, in an Excel worksheet. Now, referring to Table 2, the bold alphabetical letters on th e top row (A, B, C, etc.) represent the column labels in Excel. Underneath them are the test item labels. Rows 1 through 10 label students Kid-A through Kid-I. The steps for computing the point-biserial correlation are as follows: K for each row) as shown in Table 2, column L. Compute the total student score (sum columns B through 1. Educational Data Systems 3

5 2. Compute the total score minus each item score, shown in Table 3, columns M through V, so that the total score minus the first item is in column M, the total scor e minus the second item is in column N, and so forth. 3. Compute the point-biserial correlation for each item using th e “Correl” function. This computation results in the correlation of the item score and the total score minus that item score. For example, the Item 1 correlation is computed by correlating Columns B and M. To compute point-biserials, insert the Excel function =CORREL(array1, array2) into Row 12 for each Columns M through V, as sh own in Table 4. The result is the point-biserial correlation for each item. In Row 13, compute each item p-value by calculating the sum of the correct scores for each item, as shown in Table 2, Row 11, then divide this number by the total number of students who took that item (e.g., Item 1 has 8 correct answers and 9 students, or 8/9 = 0.89). Data Matrix in Excel Table 2: Sample Student E A B C D L F G H I J K Student Items 3 9 8 10 1 2 7 4 5 6 Total 1 =sum(B2:K2) Students Score 1 Kid-A 1 1 1 1 1 1 0 1 9 1 2 Kid-B 1 1 1 1 1 8 1 0 1 0 1 3 1 Kid-C 1 1 1 7 1 0 1 0 0 1 4 7 Kid-D 1 1 1 1 1 0 1 0 1 0 5 1 1 1 1 1 1 Kid-E 0 1 0 0 7 6 4 Kid-F 1 1 1 0 1 0 0 0 0 0 7 4 Kid-G 1 1 0 1 0 1 0 0 0 0 8 3 Kid-H 1 0 1 0 0 0 0 0 0 1 9 2 0 0 Kid-I 0 1 1 0 0 0 0 0 10 3 Item total 8 8 8 6 7 5 3 2 1 11 =sum(B2:B10) =L2-B2 e for Point-Biserial Correlation Table 3: Computation of Total Scor M N O P Q R S T U V total- total- total- total- total- total- total- total- total- total- Students 1 item1 item7 item2 item8 item6 item9 item5 item10 item3 item4 Kid-A 8 8 8 8 8 8 8 8 9 8 2 7 Kid-B 7 7 7 8 7 7 8 7 7 3 7 Kid-C 6 6 6 6 6 6 7 6 7 4 Kid-D 6 6 6 6 6 6 7 6 7 7 5 Kid-E 6 6 6 6 6 6 7 6 7 7 6 3 3 3 3 4 Kid-F 4 4 4 4 4 7 4 Kid-G 3 3 4 3 4 3 4 4 4 8 2 2 3 2 3 Kid-H 3 3 3 3 3 9 2 Kid-I 2 1 1 2 2 2 2 2 2 10 Educational Data Systems 4

6 =CORREL(B2:B10, M2:M10) Table 4: Computation of Total Scor e for Point-Biserial Correlation M N O P Q R S T U V total- total- total- total- total- total- total- total- total- total- item1 item3 item7 item2 item8 item6 item9 item5 item10 item4 Point- 12 0.40 0.46 0.29 0.12 0.73 0.49 0.49 0.59 0.46 0.26 Biserial P- 13 0.11 0.89 0.89 0.89 0.67 0.78 0.56 0.33 0.33 0.22 Value =B11/9 Computing Point-Biserial Correlations Using SPSS the sample data in Table 1 into an SPSS data window. To compute the point-biserial correlation using SPSS, copy Open a syntax window (File\New\Syntax), paste the following syntax and click Run. RELIABILITY ORMAT=NOLABELS /SCALE(ALPHA)=ALL /VARIABLES= item1 to item10 /F /MODEL=ALPHA /STATI STICS=SCALE /SUMMARY=TOTAL. Corrected Item-Total Correlation The SPSS output window will show the following table. The column labeled provides the point-biserial correlation. Table 5: Item Point-Biserial Output from SPSS Corrected Item- Scale Mean if Item Cronbach's Alpha Scale Variance if Items Total Item Deleted Deleted if Item Deleted Correlation i1 5.194 0.746 4.7778 0.457 4.7778 5.444 i2 0.763 0.286 i3 4.7778 5.694 0.779 0.122 i4 5.0000 4.250 0.699 0.728 i5 4.861 4.8889 0.739 0.486 i6 4.611 0.739 5.1111 0.491 5.3333 4.500 i7 0.722 0.589 i8 5.3333 4.750 0.743 0.459 i9 5.4444 5.278 0.770 0.260 5.5556 5.278 0.752 i10 0.399 Corrected point-biserial correlation indicates that the item score is excluded from the total score before computing the correlation (as we did manually in Excel). This is a minor but important detail because inclusion of the item score in the total score can artificially inflate the point-biserial value (due to correlation of the item score with itself). Interpretation of Results Interpretation of point-biserial correlations Referring to Tables 4 and 5, all items except for Item 3 – which has a point-biserial of 0.12 – show acceptable point- Educational Data Systems 5

7 s who got the item incorrect also scored high on the test biserial values. A low point-biserial implies that student overall while students who got the item correct scored low on the test overall. Therefore, items with low point- biserial values need further examination. Something in the wording, presentation or content of such items may explain the low point-biserial correlation. However, even if nothing appears visibly faulty with such items, it is recommended that they be removed from scoring and future testing. When evaluating items it is helpful to use a minimum threshold value for the point-bi serial correlation. A point-biserial va lue of at least 0.15 is recommended, though our experience has shown that “good” ite ms have point-biserials above 0.25. Interpretation of p-values The p-value of an item provides the proportion of students th at got the item correct, and is a proxy for item difficulty (or more precisely, item easiness). Refer to Table 4 and no te that Item 10 has the lowest p-value (0.11). A brief examination of the data matrix explains why – only one student got that item correct. The highest p-value, 0.89, is associated with three items: 1, 2 and 3. Eight out of ni ne students got each of these three items correct. The higher the p-value, the easier the item. Low p-values indicate a diffi cult item. In general, tests are more reliable when the p- values are spread across the entire 0.0 to 1.0 range with a la rger concentration toward the center, around 0.50. (Note: Better item difficulty statistics can be computed using psyc hometric models, but p-values give a reasonable estimate.) The relationship between point-biserial correlations and p-values Problematic items (items with a low point-biserial correla tion) may show high p-values, but the high p-values should iserial should be used to judge item quality. Our sample not be taken as indicative of item quality. Only the point-b ting” p-value and point-biserial statistics. One is Item 3, data matrix contains two items that appear to have “conflic which has a low point-biserial (0.12) but a high p-value (0 .89); the second is Item 10, which has a high point-biserial (0.40) but a low p-value (0.11). Examination of the data matrix shows that Item 3 was an swered incorrectly by Kid-G. Even though Kid-G did not correctly answer Item 3, she did correctly respond to Item s 4 and 6, which are harder items (as indicated by their lower p-values). One explanation of how Kid-G could get the harder items correct but the easier item wrong is that she guessed on Items 4 and 6. Let us assume, however, that she did not guess and that she actually did answer Items 4 and 6 correctly. This would suggest that Item 3 measures so mething different from the rest of the test, at least for Kid-G, as she was unable to respond to this item correctly ev en though it is a relatively easy item. Although in this l data matrix, in real life there may be a group of students like Kid-G. When article we are dealing with a very smal faced with statistics such as we see here for Item 3 (high it is recommended that the item p-value, low point-biserial), ing in the wording of the item, its presentation or the be qualitatively reviewed for content and wording. Someth content caused Kid-G to get it wrong and caused the low item point-biserial. In our little sample matrix, Item 3 is a problematic item because it does not fit the model, meaning that this item behaves differently from other items for Kid-G. The model says that Kid-G shoul d have got that item right, but she got it wrong. Even if qualitative review of serial, it is often advisable that this item be removed the item does not reveal any obvious reason for the low point-bi ent (also called multidimensionality) often show low point- from future testing. Items that measure another cont biserials. Educational Data Systems 6

8 Now let us examine Item 10, which shows an opposite pattern . Item 10 has a high point-biserial (0.40) and a low p- value (0.11). This item was correctly answered by only one student (which explains the low p-value). However, the one student that got the item correct is also the “smartest” kid as measured on this test, which is why the point-biserial for the item is high. The low p-value and high point-biseria l are perfectly acceptable statistics. The two numbers in this case tell us that Item 10 is a difficult item but not a problematic item. l correlation and p-value statistics. Problematic items will Thus, there is no real relationship between the point-biseria always show low point-biserial correlations, but The point-biserial the accompanying p-value may be low or high. correlation should be used to assess item quality; p- values should be used assess item difficulty. Problematic items and test reliability Let us briefly refer to one other statis tic reported under the SPSS output in Table 2, labeled “Cronbach's Alpha if Item Deleted”, which is an indicator of overa ll test reliability. Cronbach’s Alpha ranges from 0 to 1, with a 0 indicating no gh test reliability. The SPSS output computes the reliability test reliability and fractions approaching 1 indicating hi coefficient for the test excluding one item at a time. If the re liability increases when an item is deleted, that indicates of increasing it. In our example, the test reliability is that the item is problematic and reduces test reliability instead highest (0.779) when Item 3 is deleted. Remember that Item 3 was the most problematic item. This SPSS statistic atic items (misfitting items, emphasizes the point made earlier that removal of problem multidimensional items, poorly written items) increases the overall test reliability. Using point-biserial statistics to check multiple-choice keys Another useful role of point-biserial correlations involv es validating the multiple-choice scoring key. The key for all developers, then to test administrators and finally to items on a test is communicated from item developers to form test scorers. In this flow of information the scoring key may be incorrectly communicated or inadvertently mis- programmed into computerized scoring programs, which can cause serious errors in student results. One way to catch such errors is to run the point-biserial statistic on all items after the tests are scored. This is a quick, easy and reliable method for catching incorrect scoring keys. Items with inco rrect keys will show point-biserials close to or below zero. As a rule of thumb, items with point-biserials be low 0.10 should be examined for a possible incorrect key. Educational Data Systems 7

IAWA List of microscopic features for softwood identification 1 IAWA LIST OF MICROSCOPIC FEATURES FOR SOFTWOOD IDENTIFICATION IAWA Committee Pieter Baas – Leiden, The Netherlands Nadezhda Blokhina – V...

More info »Psychological Bulletin 1968, Vol. 70, No. 6, 426-443 AS A MULTIPLE REGRESSION GENERAL 1 DATA-ANALYTIC SYSTEM COHEN JACOB University York New for general variance-accounting Techniques as a using multi...

More info »GRIT META 1 - ANALYSIS Much Ado about Grit: A Meta - Analytic Synthesis of the Grit Literature Marcus Credé Department of Psychology, Iowa State University [email protected] Michael C. Tynan Departme...

More info »Table of Tables/Table of Figures STATEWIDE ASSESSMENT PROGRAM INFORMATION GUIDE 2018 - 2019 Updated March 5 , 2019 Statewide Assessment Program Information Guide 1

More info »Psychological Bulletin Copyright 2004 by the American Psychological Association, Inc. 2004, Vol. 130, No. 2, 261–288 0033-2909/04/$12.00 DOI: 10.1037/0033-2909.130.2.261 Do Psychosocial and Study Skil...

More info »Three Options Are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research Michael C. Rodriguez, University of Minnesota (p. 59). This guideline has received more attention in the em...

More info » Standardized Effect Size Estimation: Why and How? Statisticians have long opined that researchers generally need to present estimates of the sizes of the effects which they report. It is not suffici...

More info »Copyright 2006 by the American Psychological Association Journal of Personality and Social Psychology 0022-3514/06/$12.00 DOI: 10.1037/0022-3514.91.4.612 2006, Vol. 91, No. 4, 612–625 See What You Wan...

More info »ECOGRAPHY 29: 129 / 151, 2006 Novel methods improve prediction of species’ distributions from occurrence data ́ k, Simon Ferrier, Antoine Guisan, Jane Elith*, Catherine H. Graham*, Robert P. Anderso...

More info »See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/13506041 Infants’ Use of Attentional Cues to Identify the Referent of Another Person's Emo...

More info »Research Original Investigation JAMA | Association of Digital Media Use With Subsequent Symptoms of Attention-Deficit/Hyperactivity Disorder Among Adolescents Chaelin K. Ra, MPH; Junhan Cho, PhD; Matt...

More info »tables Attachment Division 245, including A: Nov. 15-16, 2018, EQC meeting 1 of 121 Page Division 245 CLEANER AIR OREGON 340-245-0005 Purpose and Overview (1) This statement of purpose and overview is...

More info »Centers for Medicare & Medicaid Services Long-Term Care Facility Resident Assessment Instrument 3.0 User’s Manual Version 1.16 October 2018

More info »NAVSEA STANDARD ITEM NUMERICAL INDEX 19 FY - UTILIZATION CATEGORY TITLE DATE ITEM NO. 01 General Criteria; accomplish 009 01 OCT 2017 - I - 02 NOV 2016 009 18 I Report Environmental Compliance for Mat...

More info »SECURITIES AND EXCHANGE COMMISSION 17 CFR Parts 210, 229, 230, 232, 239, 240 and 249 No. 33- 10064; 34 06- 16 Release -77599; File No. S7- AL78 RIN 3235- REGULATION S -K BUSINESS AND FINANCIAL DISCLOS...

More info »THE CITY OF NEW YORK DEPARTMENT OF ENVIRONMENTAL PROTECTION BUREAU OF WATER AND SEWER OPERATIONS STANDARD SEWER AND WATER MAIN SPECIFICATIONS JULY 1, 2014 Patalano Revised: July 1, 2014; Purnima Dhari...

More info »Harmoniz ed vision 4 hedule of the United States (2019) Re Tariff Sc Annotated f poses ting Pur or Statistical Repor GN p .1 GENERAL R ATION ULES OF INTERPRET inciples: wing pr ollo y the f verned b i...

More info »UNCLASSIFIED Department of Defense Fiscal Year (FY) 2019 Budget Estimates February 2018 Office of the Secretary Of Defense Defense-Wide Justification Book Volume 3B of 5 Research, Development, Test & ...

More info »