Item Response Theory (IRT) is transforming the landscape of entrepreneurial assessment through precision adaptive testing and scientifically defensible ability estimation.
Item Response Theory (IRT) in Supsindex: Adaptive Testing & Ability Estimation
What Makes Founder Assessment Scientifically Defensible
Entrepreneurial capability is inherently latent — you cannot observe a founder’s strategic judgment directly, only its manifestations through their responses to carefully designed questions and simulated scenarios. The challenge, then, is to move from observed responses to an accurate estimate of underlying ability, while acknowledging the psychometric reality that measurement is never perfect. This is precisely where Item Response Theory (IRT) becomes indispensable.
Why Classical Test Theory Falls Short Without Item Response Theory (IRT)

Traditional approaches, grounded in Classical Test Theory (CTT), have dominated assessment for decades. In CTT, a test taker’s observed score is understood as the sum of two components: a true score (the hypothetical average score if the test were taken infinite times) and an error component.While intuitive, this framework carries two significant limitations that become consequential in high-stakes entrepreneurial assessment.
First, CTT assumes that measurement error is constant across all ability levels — that is, the standard error of measurement is a single number applied uniformly to every test taker. In reality, measurement precision varies meaningfully depending on where a founder stands along the ability continuum. For example, an assessment tool may be quite accurate at distinguishing average performers but considerably less precise at the extremes (very low or very high ability), yet CTT provides no mechanism to capture this variation.
Second, CTT’s item statistics are sample-dependent. The difficulty of a question is defined as the proportion of test takers who answer it correctly in a given sample. This means that an item’s difficulty index changes when administered to a different group — a high-performing cohort will make every question appear easier, while a weaker cohort will make the same questions appear harder. This instability fundamentally limits the comparability of scores across different test administrations. More troubling still, research demonstrates that CTT’s equating errors range between 0.7 and 1.6, while IRT-based methods achieve errors between 0.2 and 0.6 — a difference of practical significance when decisions about capital allocation, accelerator admission, or team formation hang in the balance. Moreover, multiple empirical studies have confirmed that IRT produces significantly less measurement error than CTT, and that IRT item difficulty estimates remain stable across different samples — a property CTT conspicuously lacks.
How Item Response Theory (IRT) Addresses These Gaps in Adaptive Testing

IRT represents a fundamentally different measurement paradigm. Rather than treating test performance as a simple sum of correct responses, IRT models the probabilistic relationship between a respondent’s latent ability (denoted conventionally as θ) and their probability of answering each item correctly. The mathematical foundation is an Item Characteristic Curve (ICC) — a monotonic function that describes, for any given level of ability, the likelihood of a correct or specific response. The shape and position of this curve are governed by parameters estimated separately for each item, enabling IRT to address the limitations of CTT directly.
IRT offers three principal advantages for entrepreneurial assessment:
- Variable measurement precision. Unlike CTT’s uniform standard error, IRT calculates measurement precision as a function of ability level. Items contribute most information near their difficulty point, where the ICC is steepest. This means Supsindex can quantify, for each founder, exactly how confident we are in their estimated ability — and where the measurement carries more uncertainty.
- Sample-independent item parameters. Once an item’s difficulty (the ability level at which a test taker has a 50% probability of answering correctly) and discrimination (the steepness of the ICC at its inflection point, measured by parameter a) are calibrated on a large sample, they remain stable across different populations. This property makes it possible to build a bank of calibrated items that can be used interchangeably across different test forms while maintaining score comparability.
- Person-invariant measurement. Just as item parameters are independent of the sample, a test taker’s estimated ability is theoretically independent of which particular items they receive — provided all items are calibrated onto the same scale. This property underpins the possibility of computerized adaptive testing.
Operationalizing Item Response Theory (IRT) Within Supsindex’s Assessment Philosophy
At Supsindex, our commitment to IRT is not merely technical — it flows directly from our core belief that subjective judgment must be replaced with objective, evidence-based measurement. The startup ecosystem has long suffered from assessment practices that are slow, ambiguous, costly, and unreliable — methods that default to intuition precisely where the risk is highest. IRT provides the scientific scaffolding to move beyond this impasse. By modeling the relationship between founder responses and underlying entrepreneurial knowledge (FPA), behavioral judgment (GEB), and ecosystem literacy (EEA), we can quantify measurement uncertainty with mathematical rigor rather than concealing it behind false precision.
Our implementation varies by index to match the specific measurement challenges each dimension presents:
- For the FPA Index (Founder Public Awareness), we apply a 2-Parameter Logistic (2PL) IRT model. This estimates two parameters per item: difficulty (b), representing the ability level at which a founder has a 50% probability of answering correctly, and discrimination (a), representing how sharply the item separates high-ability from low-ability founders. The 2PL model is appropriate for knowledge-focused assessment where guessing is not a dominant concern (distractors in our multiple-choice items are carefully engineered to be plausible without yielding undue guessing advantage). Our 2PL IRT implementation is protected under international patent law (PCT/IB2025/056957).
- For the GEB Index (General Entrepreneurial Behavior), we employ a Thurstonian Item Response Theory (T-IRT) model, which is specifically designed for forced-choice format items where test takers select the “most effective” and “least effective” response among options. Traditional IRT models assume item responses are independent, but forced-choice formats violate this assumption. Thurstonian IRT recovers latent utility values for each choice option, overcoming the ipsative bias problem that renders conventional reliability coefficients (like Cronbach’s alpha) mathematically invalid for such instruments. This approach enables us to measure behavioral traits like resilience, adaptability, and ethical judgment without forcing artificial trade-offs between them.
- For the EEA Index (Ecosystem Environmental Awareness), cross-ecosystem fairness is a non-negotiable requirement. To ensure that a given score reflects the same level of mastery regardless of whether a founder is being assessed on a relatively permissive regulatory environment (e.g., Delaware) or a more constrained one (e.g., Germany), we employ anchor item equating. At least 20% of the questions in every EEA test deck are universal anchor items — identical across regions and industries. Performance on these anchors is used to statistically adjust the difficulty of context-specific questions, ensuring comparability without sacrificing local relevance.
What Would Be Lost Without Item Response Theory (IRT)
To appreciate the value IRT delivers, consider the counterfactual: a Supsindex built on CTT principles. The consequences would ripple across every function the platform serves:
- For founders, a CTT-based assessment would provide a single error estimate — say, ±5 points — applied uniformly to all test takers. A founder scoring near the distinguishing threshold for certificate eligibility (top quartile) would have no way of knowing whether the measurement error at that specific ability level was larger or smaller than this single value suggests. This opacity would make it impossible to interpret whether a borderline score represented genuine capability or mere measurement imprecision.
- For investors and accelerator partners, CTT’s sample-dependent item statistics would mean that an assessment calibrated on one population (e.g., early-stage founders in the United States) might not be directly comparable when administered to a different population (e.g., growth-stage founders in Southeast Asia). The ability to benchmark founders against global peers — a core value proposition of Supsindex — would be seriously compromised.
- For the platform’s scientific credibility, the absence of IRT would preclude the possibility of computerized adaptive testing (CAT), a methodology that dramatically reduces test length while preserving or even improving measurement precision. CAT has been successfully deployed in high-stakes domains ranging from medical licensing examinations to professional certification, precisely because it builds on IRT’s mathematical guarantees about ability estimation. Without IRT, this pathway to reducing test-taker burden while improving measurement efficiency would be closed entirely.
Moreover, IRT’s framework for Differential Item Functioning (DIF) analysis — which detects whether an item favors or disadvantages a demographic group (such as gender, ethnicity, or nationality) after controlling for ability — would be unavailable. DIF represents systematic measurement bias that threatens the validity of group comparisons. Without the ability to detect and remove biased items, Supsindex could inadvertently penalize founders from certain backgrounds, undermining the principle that talent is universally distributed even if opportunity is not.
Our Calibration Roadmap for Item Response Theory (IRT)

Building an IRT-based assessment system requires calibration — the process of administering items to a representative sample and estimating their difficulty, discrimination, and (where applicable) guessing parameters with sufficient precision that they can be trusted for subsequent operational use. To date, Supsindex has completed calibration on a baseline sample of 300 founders distributed across our initial industry and ecosystem coverage. While this sample size is adequate for basic parameter estimation (published research suggests that sample sizes as small as 300 respondents per item can be adequate for estimating ability and classifying examinees), it represents a founding iteration rather than a final state. Larger calibration samples yield narrower confidence intervals around parameter estimates, and for the more complex 3PL model (which includes a guessing parameter for items where pure guessing is a meaningful concern), common recommendations suggest at least 1,000 respondents. Our scientific roadmap therefore targets a calibration expansion to 3,000 respondents by the end of 2026.
This expanded sample will enhance parameter stability across our 55+ industries and 20+ ecosystems, enable finer-grained differentiation among high-ability founders (where small differences in knowledge or judgment can have large consequences for venture outcomes), and support the development of item banks large enough to support full-scale computerized adaptive testing. As the calibration sample grows, we will also pursue cross-ecosystem equating studies to establish that our difficulty parameters maintain stability across international contexts — a research program that has already begun through our Faculty network.
Toward Computerized Adaptive Testing using Item Response Theory (IRT)
The same IRT foundation that powers precision measurement also enables Computerized Adaptive Testing (CAT) — a delivery model in which the assessment system selects each subsequent item based on the test taker’s current estimated ability. Founders who demonstrate strong knowledge early receive more challenging items to refine the estimate; those who struggle receive easier items to ensure their ability can still be estimated from lower-difficulty content. CAT offers three benefits directly aligned with Supsindex’s mission to make entrepreneurial assessment efficient and accessible:
- Reduced test length: Because items are targeted to the test taker’s ability level, fewer items are needed to achieve a given level of measurement precision. Research in comparable high-stakes domains shows that CAT can reduce test length by 50% or more while maintaining or improving measurement accuracy.
- Improved test-taker experience: Removing items that are far too easy or far too difficult — where a founder gains little information but expends significant time and mental energy — makes the assessment feel appropriately challenging rather than either boring or demoralizing.
- Enhanced security: With an item bank sufficiently large, no two test takers need receive the same set of items, reducing the risk of item exposure and answer sharing.
We anticipate launching CAT for select indices following the completion of our 3,000-respondent calibration target, with a phased rollout that prioritizes indices (such as FPA) where item bank size and calibration stability are already sufficient.
Closing Remarks on Item Response Theory (IRT)
Item Response Theory is not merely a statistical detail — it is the organizing scientific framework that makes Supsindex’s core promise possible: replacing guesswork with measurement, intuition with evidence, and subjective judgment with objective, defensible data. By modeling the probabilistic relationship between founder responses and underlying capability, quantifying measurement uncertainty, and enabling sample-independent item parameters, IRT transforms entrepreneurial assessment from an art into a science.
Our calibration roadmap — from the current 300-founder baseline to the 3,000-founder target by 2026 — reflects our commitment to continuous improvement and scientific rigor. As the sample grows, so too will the precision of every estimate, the fairness of every comparison, and the confidence that founders, investors, and institutions can place in our results. In a world where the cost of misjudging founder potential is measured in wasted capital, lost opportunity, and unrealized human potential, measurement science is not optional. It is essential. And IRT is how we meet that obligation.