Learning, Knowledge and Human Development MOOC’s Updates

Item Response Theory (IRT) in Educational Assessment

Item Response Theory (IRT) is a statistical framework used to design, analyze, and score educational and psychological tests. It models the relationship between an examinee's performance on individual test items and an unobservable trait or ability, often called a latent trait (e.g., mathematical ability, reading comprehension).

Description and Analysis of Tests

IRT goes beyond traditional scoring methods by providing detailed information about the test and the test-takers:

Item Parameters: Each test item is statistically characterized by parameters that describe its properties, such as:

Difficulty (b): The level of ability at which an examinee has a 50% chance of answering the item correctly.

Discrimination (a): How well the item differentiates between examinees with ability levels near the item's difficulty.

Guessing (c): The probability that an examinee with a very low ability can still get the item correct by guessing (usually only for multiple-choice questions).

Ability Estimation (Invariance): A major advantage of IRT is that it allows for the measurement of an examinee's ability (often denoted as θ) to be independent of the specific set of items they were given, provided the items accurately measure the same latent trait. Similarly, item parameters are independent of the specific group of people taking the test. This is an improvement over Classical Test Theory.

Computerized Adaptive Testing (CAT): IRT enables Computerized Adaptive Testing (CAT), where a computer selects the next test item based on the examinee's performance on previous items. This ensures the test is constantly adapted to provide the most informative items for that individual's ability level, leading to more precise measurement with fewer questions.

Application Example: Standardized Achievement Tests

IRT is the foundation for major standardized achievement tests (e.g., those used for college admissions or state-wide academic proficiency).

It allows test developers to create and maintain item banks—large collections of calibrated questions—for generating multiple, non-identical test forms that are statistically equivalent in difficulty.

It ensures equating, which means a score on a test given one year can be accurately compared to a score on a test given a different year, even if the specific questions are different.

It helps identify item bias (Differential Item Functioning or DIF) to ensure questions don't unfairly favor or disadvantage particular subgroups (e.g., based on gender or ethnicity) after controlling for overall ability.