Introduction

Computer adaptive testing (CAT) is widely recognized in the psychometric community for its powerful capability to precisely estimate examinee ability in short testing periods. Adaptive testing achieves this end by estimating the ability of the examinee at designated points during the test and then tailoring selection of subsequent items to those items which will provide the most information about the examinee. The appeal of a shorter testing time makes an adaptive testing approach highly desirable for use in multiple assessment and learning contexts. Computerized adaptive assessment is currently used in a number of contexts in the educational eco-system, including as a diagnostic to initialize a learner’s location within an adaptive instructional system (e.g., Scootpad by ACT, IXL, Exact Path by Edmentum, i-Ready by Curriculum Associates), in interim benchmark assessment (e.g., NWEA MAP), as end-of-course or end-of-unit assessment (e.g., NWEA End-of-Course) and in summative assessments meant for either individual high stakes or as accountability measures (e.g., Graduate Management Admissions Test, Smarter Balanced).

However, for those who have been tasked with designing, configuring, and deploying adaptive tests for operational use at scale, authoring an adaptive test is anything but simple. The process often involves a complex interplay among psychometricians, content experts, and technologists who operate with different vocabularies and subject matter expertise. The challenges of shifting test assembly to smart tools mirror the challenges in shifting item authoring to automatic question generation. Manually assembling multiple exams from a single, large item pool demands “training, experience, and resources” (Kurdi et al. 2020). Similar challenges are also described in the authoring of intelligent tutoring systems (Dermeval et al. 2018; Sottilare et al. 2018). The operational efficiencies gained with embedding feedback during the test assembly process reduce the number of hand-offs between the groups of experts and removes subjective evaluations in the test authoring process. Therefore, a platform that can simplify those exchanges while allowing for the complexities of each area of interest is paramount to efficient operational processes for designing, configuring, and deploying adaptive assessments at scale. In addition, embedding smart feedback into the platform can reduce rework and frustration at later steps. This paper presents the authors’ experience of developing smart platforms for designing, configuring, and deploying adaptive assessments along with findings from survey and cognitive labs with users of a platform we will highlight, Echo-Adapt, which is currently used operationally. It is our intent to outline the many considerations and trade-offs discovered during development and to present the choices made with regards to the adaptive algorithm that allows for the separation of concerns necessary for a productive user experience.

Process of Building an Adaptive Assessment

To design and configure an adaptive assessment for deployment, several steps are typically followed as illustrated in Fig. 1. These steps include: 1) identifying the item pool to be used, 2) selecting the adaptive item selection algorithms and configuring related parameters, 3) specifying content and other test constraints, 4) simulating the assessment, and 5) deploying the assessment for live test delivery. Steps 1 and 3 are typically conducted by a content expert, step 2 is typically conducted by a psychometrician, and step 5 is typically conducted by a technologist. To the extent configuration parameters are embedded directly into the adaptive algorithm software code, many of these steps may involve the technologist as well to adjust the code. Within each of these steps are typically several sub-steps that can be completed in any order. In addition, the process of configuring an adaptive assessment is typically iterative. Step 1 may be repeated should the desired statistical attributes and content and other test constraints be infeasible given the identified item pools. Steps 2-4 may also be adjusted multiple times following observations during simulation of the assessment. It is for this reason that a smart authoring tool can easily introduce operational efficiencies. In the authors’ experience, embedding smart capabilities during step 3 results in the most substantial efficiency gains.

Fig. 1
figure 1

Process of designing and configuring an adaptive assessment

The reader will note that the aforementioned process does not include steps associated with the generation of the item content and the estimation of or prediction of statistical attributes of the item content. While these are critically important processes that precede the selection of an item pool for adaptive assessment, and smart systems developed to support those processes will also benefit from application of the design principles presented in this paper, descriptions of specific technical solutions for content generation are out of scope of this paper. The reader may refer to Kurdi et al. (2020) and Pandarova et al. (2019) as a few examples of some of the complexities of those processes.

Item Pool Specification

First, a content developer must determine the pool (collection of items and stimuli) that will be used for the assessment. Choice of CAT pool is linked directly to the purpose of assessment and inferences to be made from test scores and therefore the item pool requirements may include the input of content developers, psychometricians, scientists, test security analysts, and so on. For example, pools associated with high-stakes CAT assessments may require different amounts of items across the ability distribution. For these tests, substantially more items may be required particularly at the upper end of the difficulty range than pools from which non-adaptive linear test are formed in order to prevent over exposure of items associated with higher scores (although item selection algorithms can also help with this, the pool still must have sufficient items for it to work well). Pools for diagnostic assessments are developed with less concern about exposure and more emphasis on having sufficient item coverage for all characteristics required for the diagnosis to be supported. Pools associated with certification tests around a single cut score should feature items with content and statistical characteristics to promote efficiency and precision around the cut. One common misconception about item pools for adaptive testing is that CAT pools should only be composed of discrete and dichotomous items. However, sophisticated adaptive algorithms can administer item pools of discrete items or pools comprised of sets of items around stimuli, adapting prior to each set of item or even within the sets of items.

The success of the efficiency of the adaptive exam is dependent upon the robustness of the item pool and the intersections with estimated item difficulties, discriminations, and all content constraints of interest. For that reason, the selected item pool must include the item and stimulus metadata required for adaptive decisions. Statistical metadata aligned with the chosen psychometric model must be available for each item and shared stimulus throughout the pool. Those statistics must be available to the adaptive engine to make optimal item selections and to employ stopping rules. Content metadata must be available for each item and stimulus to ensure that the blueprint requirements are met.

Adaptive Item Selection Algorithms and Related Parameters

In this step, critical decisions are made with regards to the algorithms to use to estimate the examinee’s ability throughout and at the end of the assessment, how to handle selection of the first item before an estimated ability is available, and how to impose item and stimulus exposure control for test security and item pool usage needs. In some cases, the objective to be optimized during item selection may be specified; in many cases and in our later example, this is taken to be the test information function.

Content and Other Test Constraints

Additional constraints must be applied to passage and item selection to ensure the assessment meets the test blueprint as well as other non-statistical desirable characteristics. Often additional content constraints may be needed to successfully sample a domain or to support reporting at a finer grain level. For example, a mathematics assessment may need to have a specified number of items measuring algebra, a specified number of items measuring geometry, and a specified number of items measuring statistics. Controlling other constraints such as depth of knowledge level, gender/diversity codes, content balancing, and item type (multiple choice, technology enhanced, and constructed response) helps support high quality sampling of evidence from the examinee. Meanwhile, constraints about attributes of shared stimulus such as total word count and genre, and patterns of answer options used helps examinees have more parallel experiences. A summative standardized assessment used for high-stakes purposes may have hundreds of constraints; shorter adaptive assessments for formative inferences may have only a few. Regardless, each of these constraints must be communicated to the engine by the user configuring the test.

Simulation of the Configured Assessment

Once the adaptive algorithms and their parameters are specified and content and other constraints are applied, it is common practice to run simulations to ensure that the resulting adaptive tests meet psychometric and content requirements. During this phase, statistics like root mean square error (RMSE) and bias are carefully examined. Adherence to the content and other test constraints is also examined. Should the simulation reveal concerns, content developers and psychometricians will return to steps 2 and 3 to make adjustments. In some cases, a return to step 1 may even be warranted to augment or amend the item pool. It should be noted that simulation can place a high load on the system if concurrent examinees are being modeled; yet, sequential treatment of users may result in a long wait time for the simulation to finish.

Deploying the Total Configuration

Finally, once the simulations run without concerns, the item pool, adaptive configurations, and content and other constraints must be deployed into a live testing production environment. If the adaptive configurations or test blueprint are coded directly into the algorithms, deployment may require a software engineer to modify the engine code. In this case, longer development schedules with additional rounds of quality assurance may be expected.

System Design Principles for Authoring Adaptive Assessments at Scale and Implementation Example

In this section, we describe a variety of design principles used when building systems to facilitate the process outlined above, specifically highlighting our use of smart technologies to reduce rework and wait time. The authors have distilled these design principles from their experiences building and launching multiple adaptive assessment authoring platforms in multiple organizations.

Six design principles are presented below, along with illustrative examples of scaled authoring platform following those design principles, the commercial Software-as-a-Service Echo-Adapt (ACT 2020a). The paper will also reflect on user feedback associated with the example software.

  1. 1.

    Architect the solution to meet non-functional requirements of scale, reliability, extensibility, and usability.

  2. 2.

    Comply with interoperability standards for efficient deployment.

  3. 3.

    Select algorithms and methods that enable simplicity in configuration. Only algorithms with a high degree of configurability will be able to support the following design principle.

  4. 4.

    Provide a user experience that allows non-programmer content developers to fully author the adaptive assessment.

  5. 5.

    Allow for simulation and adjustment of the authored assessment prior to live deployment, with quickly obtainable results and visualization. A fast feedback loop ensures shorter turn-around time in the adaptive assessment authoring process.

  6. 6.

    The extent to which users are supported in the use of a new smart authoring system determines the ultimate integration with current work and processes. Therefore, engage users in the design activities, provide change management support, and include functionality that leverages automation in the areas of greatest inefficiency.

Platform Architecture

The first design principle calls for solution architecture that will enable the system to meet non-functional requirements of scale, reliability, extensibility, and usability.

Echo-Adapt is software-as-a-service built for adaptive assessment at scale. The Echo-Adapt architecture and its interaction with users and other platforms are shown in Fig. 2. Echo-Adapt consists of three major components: the intuitive UI, RDS database, and CAT engine, which are deployed on Amazon Web Services (AWS) and loosely coupled by APIs to conduct various CAT related tasks. In alignment with the commonly understood benefits of service-oriented architecture, the three separate and loosely coupled components improve the extensibility of the solution, as individual components may be updated and deployed to production at a separate cadence from other components.

Fig. 2
figure 2

Echo-Adapt architecture as software-as-a-service

Although examinees are presented with Echo-Adapt’s item choices, Echo-Adapt is not seen by the examinee during administration. Therefore, the users of the Echo-Adapt UI are not the examinees but the psychometricians and content developers who assemble the adaptive assessments prior to administration. Content developers and psychometricians can access the same CAT configuration on Echo-Adapt UI to prepare for the same CAT administration. To optimize the efficiencies required by these users, the Echo-Adapt architecture is designed to accommodate the various technical capabilities required and the UI has been developed by prioritizing user feedback and building a consistence experience across the application.

The CAT configuration, as well as its associated item pool data and interim CAT data, are persisted in the AWS RDS database. The Echo-Adapt CAT engine retrieves configuration data from RDS via API calls and conducts CAT tasks in parallel, i.e., live item administrations for CAT, CAT simulations, and configuration feasibility check. The computing capacity is pre-scaled based on the potential peak demand of CAT tasks. Thus, the CAT engine may be deployed on multiple Amazon Elastic Compute Cloud (Amazon EC2) instances to enhance system reliability and balance runtime performance and operational cost for large-scale assessments.

The CAT engine is also optimized to achieve required performance for large-scale assessment, i.e., less than 500ms latency for an item administration given 40,000 concurrent examinees. In live CAT administration, Echo-Adapt communicates with the test delivery platform via APIs that comply with the IMS Global Question & Test Interoperability (QTI) specification.

Interoperability for Efficient Deployment

The second design principle calls for interoperability of the system with other systems in the learning and assessment eco-system. Interoperability reduces barriers to adoption as it allows for faster cross-system integration.

Because adaptive assessment is an active area of research, and new algorithms may be desired by psychometricians, we have found that it is much more effective to decouple the adaptive testing engine and its design and configuration UI from the test delivery platform itself. When implemented in this way, the adaptive engine may be versioned independent of the test delivery platform, the adaptive engine may be used with multiple different test delivery platforms, and test delivery platforms can interact with multiple different adaptive testing engines.

The standards organization IMS Global commits to advancing technologies that can affordably scale and improve educational participation and attainment (I.M.S. Global 2020). Among the many IMS interoperability standards that facilitate easy and secure connection between learning and assessment applications, the IMS QTI specification enables the exchange of item and test content and results data between authoring tools, item banks, test construction tools, and other related systems (I.M.S. Global 2015).

Echo-Adapt serves as a reference implementation for the IMSGlobal CAT standard and may be used by member organization to test their APIs. It is currently integrated with a QTI-compliant test delivery platform.

Configurability

The third design principle calls for careful selection of algorithms and methods that enable simplicity in configuration. Whenever possible, the system design should separate the configuration of an algorithm from the algorithmic coding itself. This allows users who are not software engineers to design, configure, and deploy an adaptive assessment from a user interface. Some adaptive algorithm choices are more amenable than others to separation of configuration from the algorithm code.

In our implementation example, an adaptive method called the shadow-test approach is used for a number of reasons, one of which is the ease of separating the algorithm code and mathematical models from the configuration of the assessment. An overview of this approach along with brief background on the psychometric models Echo-Adapt employs in its implementation are described below.

Psychometric Models

To maximize the configurability, efficiency, and reliability of CAT delivery, Echo-Adapt relies on the shadow-test approach to select the optimal item for an examinee at each adaptive stage. Echo-Adapt supports adaptation at both item and stimulus (e.g., a block or a passage) levels. While conforming to all content constraints, Echo-Adapt prioritizes the selection and delivery of stimuli with high average-item information. Inside a stimulus, the items with highest information are delivered first. The shadow-test model is integrated with the 3-parameter logistic item response (3PL IRT) model, scoring models, and item/stimulus exposure control models. After receiving the examinee’s response, the scoring model updates the examinee’s ability immediately. To meet the test security requirement, the item exposure control model balances item usage for each test administration.

Response Model

Operational items used for Echo-Adapt CAT delivery are calibrated to fit the 3PL IRT model. The probability of a correct response on a dichotomous item i is as

$$ p_{i}(\theta) = c_{i} + (1-c_{i})\frac{\exp{\left [a_{i}(\theta - b_{i}) \right ]}}{1+\exp{\left [a_{i}(\theta - b_{i}) \right ]}}. $$
(1)

where θ is the examinee’s ability. ai, bi, and ci are the discrimination, difficulty, and pseudo-guessing parameters of item i, respectively. The Fisher information of item i at ability θ is calculated as

$$ I_{i}(\theta) = {a_{i}^{2}}\left( \frac{1-p_{i}(\theta)}{p_{i}(\theta)}\right)\left (\frac{p_{i}(\theta)-c_{i}}{1-c_{i}} \right)^{2}. $$
(2)

Shadow-Test Approach

At each adaptive stage in CAT, the shadow-test approach sequentially assembles an entire test form (shadow test) and administers the optimal item based on the real-time updated estimate of examinee’s ability (van der Linden 2009). The shadow-test assembly is modeled as a mixed integer programming (MIP) problem, which optimizes a function of variables (the objective) by selecting the best possible set of decisions (Smith and Taskin 2007). In typical CAT, the shadow-test assembly MIP selects a subset of items from an item pool to maximize the test information as

$$ \begin{array}{@{}rcl@{}} \text{Maximize} \qquad && \sum\limits_{i_{j} \in S}{I_{i_{j}}(\hat{\theta})}x_{i_{j}}, \\ \text{Subject to} \qquad && \text{content specification constraints}. \end{array} $$
(3)

where S is the set of items in the item pool, \(I_{i_{j}}(\hat {\theta })\) is the Fisher information of item i associated with stimulus j at the examinee’s ability estimate \(\hat {\theta }\), and \(x_{i_{j}}\) is the binary decision variable for the selection of item ij in the shadow test. \(x_{i_{j}} = 1\) indicates item ij is selected in the shadow test. Otherwise, \(x_{i_{j}} = 0\). Similar to the fixed form test, CAT should conform to multiple content specification constraints to meet the test blueprint requirement, including but not limited to: 1) test length, 2) number of stimuli in the test, 3) number of items/stimuli of specific attributes, and 4) enemy items. An enemy-items constraint specifies that one or more items must not be administered on a test if a given item has been administered. For example, item A has enemy items including item B and item C. If item A has been administered in the test, then item B and item C cannot be administered in the same test. The shadow-test approach models these constraints as MIP constraints and ensures the compliance of feasible test blueprint in each shadow test.

At the beginning of an adaptive stage, the shadow-test approach administers optimal items in two steps as shown in Fig. 3. The first step is to construct the shadow test by solving the shadow-test assembly MIP. A shadow test consists of two parts, a set of items that have already been administered and a set of items that are unseen to the examinee. The second step is to administer the optimal item from the set of unseen items with rules including maximizing the item information and ensuring the correct item/stimulus order in a stimulus. To assemble a new shadow test for the next adaptive stage, all previously administered items are constrained to be selected in the MIP model.

Fig. 3
figure 3

Shadow-test assembly and item selection

Scoring Method

The CAT design principles require the immediate update of the examinee’s ability estimate after the response to the previously administered item is received. Echo-Adapt uses the simple but effective scoring method, i.e., expected a posterior (EAP) estimator (Bock and Mislevy 1982), to reduce the CAT cycle runtime for large-scale assessment. The EAP scoring method estimates the examinee’s ability \(\hat {\theta }\) with the associated standard error \(\hat {\sigma }\).

Item Exposure Rate Control

The shadow-test approach supports the seamless integration of various item exposure control methods into CAT delivery, including the alpha stratification method (Chang and van der Linden 2003), the Sympson-Hetter method (Sympson and Hetter 1985), and the ineligible constraint method (van der Linden and Veldkamp 2007). Echo-Adapt is implemented with the ineligible constraint method.

In the ineligible constraint method, the item/stimulus administration eligibility is represented by an I × K probability matrix, where I is the number of items/stimuli in the pool and K is the number of contiguous intervals across the theta continuum (from \(-\infty \) to \(+\infty \)). An individual probability \(\hat {P}^{(j+1)}(E_{i}|\theta _{k})\) is calculated to determine if item/stimulus i is eligible for administration to an examinee with ability in the theta range k.

$$ \hat{P}^{(j+1)}(E_{i}|\theta_{k}) = \begin{cases} \min\left\{\frac{r_{\max} \epsilon_{ijk}}{\alpha_{ijk}}, 1 \right\}, \quad &\text{if} \quad \alpha_{ijk}>0\\ 1, &\text{otherwise} \end{cases} $$
(4)

where \(r_{\max \limits }\) is the exposure rate, αijk is the number of examinees through examinee j who visited theta range k and took item/stimulus i, and 𝜖ijk is the number of examinees through examinee j who visited theta range k when item/stimulus i was eligible. I × K binomial experiments are then conducted with the value of \(\hat {P}^{(j+1)}(E_{i}|\theta _{k})\). If the experiment result Xik = 0 then item/stimulus i is ineligible at theta interval k; otherwise the item/stimulus is eligible.

To avoid the infeasibility (no solutions) issue when shadow tests are being assembled, the item/stimulus ineligibility constraints are added to the MIP model as soft constraints. Specifically, a penalty term is subtracted from the objective function when ineligible items/stimuli are selected in the shadow-test

$$ \text{Maximize} \qquad \sum\limits_{i_{j} \in S}{I_{i_{j}}(\hat{\theta})}x_{i_{j}} - M \sum\limits_{i_{j} \in V}x_{i_{j}} $$
(5)

where V is the set of ineligible items due to the exposure control experiment. M is selected as a value greater than the maximum item information value of the items in the pool at the current ability estimate. The penalty term avoids selecting ineligible items if feasible shadow tests exist after excluding them, because the selection of infeasible items will decrease the MIP objective value that is to be maximized. Otherwise, ineligible items are still allowed for selection to prevent infeasibility and test interruption.

It may be noted that readers interested in the algorithms may also explore an open-source version, RSCAT (Jiang 2020), which provides the algorithms however with substantially limited user interface (UI) functionality as it is intended for research and development, not scaled authoring use.

User Interface Simplicity with Smart Feedback

The fourth design principle calls for a user experience that would allow non-programmer content developers to fully configure the adaptive assessment. Similar to findings in the authoring of intelligent tutoring systems, in which evidence has shown that the effectiveness, efficiency, quality of authored artifacts, and usability are improved when interfaces are available to allow non-programmers to author content (Dermeval et al. 2018), we find that following this principle substantially improves the efficiency of adaptive assessment authoring.

Echo-Adapt’s UI presents users with a way to assemble test configurations and load item pools to the configuration.The algorithm is integrated on the back-end of the system and is separated from the Echo-Adapt UI as described above. Although data scientists have programmed the MIP models and MIP solver behind the scenes to generate shadow tests on the fly, the primary users of the Echo-Adapt UI are test content developers who may have minimal to no experience in programming. The interface was therefore developed to allow users to input constraints and other test variables in a transparent and familiar way without requiring any knowledge of programming, and to provide immediate and algorithmic feedback to a user on a number of dimensions. The functionalities described below provide examples of the UI simplicity and immediate, smart feedback.

Configurability and Flexibility of Item Pools

Echo-Adapt allows for a variety of different metadata fields to be uploaded inside item and stimulus pools. For items, two CSV files are required in a zip file: an item data CSV file and an item data definition CSV file. If stimuli are used, then two additional similar files are required. The data files have a small number of required fields (such as item identifiers and IRT parameters), and the data definition files allow users to name and categorize any additional metadata fields they would want to include as parameters in their configuration file. The data definition files list characteristics such as the name of the attributes, the attribute categories (metadata or statistics), and attribute type (continuous or categorical for metadata). The two-file “paradigm enables Echo-Adapt to support predefined as well as entirely user-defined item/passage attributes (e.g., for defining item/passage selection constraints)” (ACT 2020b). This high level of configurability enables a wide variety of test configurations to be supported in Echo-Adapt and eliminates the need for code deployments to manage metadata fields within the application. In addition, if a pool of both item and stimuli is uploaded, then content developers can set constraints on both of these entities to take advantage of the engine’s MIP algorithms and post-processing logic for adapting among and within sets of items with common stimulus.

Echo-Adapt provides two different paths for verifying the item pool on the UI. Through the item pools tab, the user can click into the pool to check the number of items and metadata fields that have been loaded in the system. This functionality is useful to find the correct item pool prior to associating it with a test configuration. The other path for examining an item pool is through the constraint editor in a test configuration, after an item pool has been associated with a configuration, and the user is ready to filter and sort on the item pool columns for constraint building. An unfiltered item pool is shown in Fig. 4.

Fig. 4
figure 4

The stimulus portion of an item pool as viewed in Echo-Adapt for verification

An item pool must be selected in the UI prior to setting up constraints in a test configuration because each item pool contains unique metadata. If a user sets up a test configuration with a specific item pool, Echo-Adapt allows the user to switch item pools but requires the same columns to be present in the new pool; otherwise, the configuration could become infeasible. Users would like more flexibility in moving pools between configurations, but a validation of metadata against constraints is an essential component in moving pools across configurations. Therefore, Echo-Adapt provides smart guardrails for the user, preventing the choice of a newly selected item pool if the columns do not match the original pool. The validation helps to maintain the configurability of the system by checking for consistency without requiring hard-coded metadata fields into the system.

After a test has been configured, Echo-Adapt’s item pools can be re-used across test configurations. One user noted, “It’s nice that once an item pool has been uploaded, many different trials can be run against it.” By sharing the item pools in other configurations, users gain the following efficiencies: test configurations can be copied with a button click and varied for easy comparison of simulations results; new configurations can be built from existing pools; and pools only need to exist in the system once, reducing database size and confusion with managing duplicate pools.

Constraint Editor

The more constraints applied, the less adaptive the test will typically become, and eventually the test will be infeasible if the item pool can’t support the multiple multi-variate constraints applied. Therefore, at this stage, smart algorithms that can provide instant feedback to the user become very important in system design.

Content developers prefer spreadsheets for listing, sorting, and filtering their item metadata. The Echo-Adapt UI was configured to work like spreadsheets so that users are more comfortable working with the test constraints. The constraint editor provides a page that displays either the uploaded item or passage data, depending on which type of constraint is being added or modified. Users “can scroll through the rows and columns of data via the vertical and horizontal scroll bars and navigate the pages of data using the navigation controls at the bottom of the page. Applying filters based on the attributes (i.e., columns) restricts the items/passages display to those matching the filter criteria” (ACT 2020b).

Figure 5 illustrates Echo-Adapt’s constraint filtering functionality and item pool feedback in the UI. Users have indicated that they like “being able to manipulate the constraints myself” and “that the constraints help show the pool health.”

Fig. 5
figure 5

Echo-Adapt’s intuitive constraint editor for filtering and authoring item-level constraints on uploaded metadata fields

In a small survey of current users of Echo-Adapt, when asked about the degree to which they found the constraint editor usable, the majority of users indicated that the constraint editor is moderately to very simple to use. Overall, the users described the filtering capability as “a fairly standard interface of this type” that is “intuitive” and “very easy to use.” Only 1 user indicated that the editor was difficult to use, noting along with 1 other user that the ability to upload constraints directly to the application would be most desirable. However, because Echo-Adapt consciously makes this filtering visible on the item pool in the UI, the uploading of constraints through a flat file would run counter-intuitive to the design goals of the interface.

Beyond making the effects of the constraints on the item pool visible in the UI, Echo-Adapt also delivers smart feedback to the users through feasibility checks, both on constraints and form builds. If the feasibility analysis yields an infeasible combination of item pool and constraints, the user is notified that the configuration is infeasible and the user is provided with options to relax constraints. In addition, for combinations that are feasible, the number of total feasible tests is returned as this can provide a gut check on the likely resulting adaptivity of the CAT. Very few feasible tests will mean that examinees are unlikely to have a highly personalized assessment. In addition, because running the feasibility check can result in long wait times if there are very high numbers of feasible tests, the feasibility analysis cutoff allows the user to enter in a specific number of feasible tests that can be generated with the current test configuration and receive feedback in the “Total Feasible Tests” field, as illustrated in Fig. 6. Note that in the shown configuration, 47 constraints are simultaneously applied to the item pool via a MIP model and the solver returns the number of feasible test forms. This type of smart analysis enables a more efficient use of the system, which users noted they appreciated as “the ability to fine-tune things after making trials.” The default is for 1000 feasible tests that could be administered through CAT; however, users enter in much lower numbers when using the system to build one or more linear test forms, a process called automated test assembly which will be briefly discussed later in this paper. It is worth noting that the choice of the adaptive algorithm described above (shadow test with MIP modeling) uniquely allows for this feature.

Fig. 6
figure 6

Feasibility analysis feedback with constraints on test configuration

A feasibility analysis cutoff of 0 will disable feasibility analyses run upon saving the configuration, but this is not recommended. Echo-Adapt delivers a warning to the user that “setting the feasibility analysis cutoff to zero may result in a simulation failure” (ACT 2020b). The feasibility analysis cutoff delivers feedback to the user with each save, checking that at least one shadow test can be generated from the configuration. With feedback delivered on the feasibility of the constraints as the user assembles the configuration, the user can make adjustments efficiently prior to spending time running simulations or form builds.

Simulation Using Same Algorithmic Code as Live Delivery

The fifth design principle calls for fast feedback loops that allow for efficient simulation and adjustment of the authored assessment prior to live deployment, with quickly obtainable results and visualization. To assist test developers in building test blueprints and setting CAT configurations, Echo-Adapt provides the functionality to run CAT simulations before the actual CAT administration. Users can easily configure simulation parameters, e.g., the number of simulated examinees and their true ability distribution, on the same UI where the CAT is configured. The simulations run on the secure and scaled cloud. Simulation results are generated at multiple granularities and formatted in a csv file. Users can check detailed CAT information at each adaptive stage, or evaluate the ability estimate performance in an ability interval. After the verification through simulations, the same test blueprint and CAT configuration can be seamlessly used for the large-scale CAT administration.

Results for Easy Post-Analysis and Visualization

Echo-Adapt records live CAT audit data, CAT simulation results, and automated test assembly results in CSV files that can be downloaded for easy post-CAT analysis and visualization. The CSV files are formatted to include multi-granularity results, from the highest test administration level to the lowest adaptive stage level. Some data are included in both audit data and simulation results, e.g., item administrations and ability estimates, as the mirror between live CAT delivery and simulation. In addition, simulation metrics, e.g., bias and RMSE of ability estimates, are derived and recorded in the simulation result CSV to assist with CAT performance analysis. This detailed feedback enables visualization of results as in Fig. 7.

Fig. 7
figure 7

Test form visualization using simulation results

Support of Users and Process at Scale

The sixth and final design principle calls for engagement of users in design activities, sufficient change management support, and inclusion of automation to address areas of greatest inefficiency, as the extent to which users are supported in the use of a new smart authoring system determines the ultimate integration with current work and processes. While technical functionality is imperative to the output of the system, it is only successful if users will engage in the use of the system. Therefore, when building new smart technologies to support or automate existing processes, preparing for the use of a system at scale requires several critical user and process-focused steps for a successful operational roll-out.

The Echo-Adapt team analyzed the needs of content developers and psychometricians to configure tests, run simulations for delivery, and analyze output to deliver business value and efficiencies. Echo-Adapt’s business integration further relies on process support prior to and after working inside the system to fully realize the benefits.

Process Transition Support

An important starting point for moving a system to operational use in a business is for users to accept the untenable nature of the current state. Their realization of the need for change can take a variety of methods such as SIPOCs or process mapping. Whatever the format, users should be engaged in each step of mapping current to future state. Some of these steps will be system-based while others will be processes that surround the system. For example, with Echo-Adapt, psychometricians and content developers worked together on a method for producing the item pools to upload to Echo-Adapt. The pools are not built within the system, yet without this process in place prior to the roll-out of the system, users would have stumbled in their initial use.

Comfort with a new authoring tool is even more important when supporting an adaptive system such as Echo-Adapt, which users may misinterpret as usurping their expertise in test construction. Enacting some tenants of change management philosophies can help users understand that their roles are not disappearing but simply changing (Hiatt 2006). Participation in mapping the new process is critical for them to accept the change and fully realize the business efficiencies. Furthermore, this process work provided a foundation for the requirements of Echo-Adapt. As users disclosed their problems with the process, the system was built to addresses the areas of greatest need: feedback on the number of form builds possible, visibility and control over constraints, concurrent runs by users, and iterative management of item pools.

In this new process, Echo-Adapt greatly improves the efficiency of test constructions for CAT and ATA. For CAT, Echo-Adapt selects and delivers optimal items with maximum Fisher’s information to an examinee, which is equivalent to minimizing the variance of the examinee’s ability estimate and improving the accuracy of the assessment result (van der Linden 2005). In addition, the built-in exposure control functionality in Echo-Adapt balances item usage for each CAT administration, thereby using item bank assets more efficiently, e.g., fewer items are considered compromised and removed from the item bank due to test security issues after a test administration. With regards to ATA, Echo-Adapt automatically assembles the optimal test forms that best fit design requirements, e.g., ideal test characteristic functions, while conforming to all complex content specification constraints. Therefore, the effort of adjusting/exchanging items to satisfy the design requirements is minimized, resulting in the reduction of form construction time and improvement of form reliability. As a flexible framework, a well-designed MIP mode requires little change of code as the specific test construction problem changes (Luo 2020), which further reduces the time spent on model tuning.

Involving users prior to the roll-out date resulted in a smooth transition to the internal use of Echo-Adapt. In a small survey, 8 users responded that their production process has some degree of change, and 7 of these users describe their evaluation of the change as positive (5 of them as “strongly positive”). One respondent described the iterations of reviews decreasing “dramatically” and another as being able to create tests “more on-demand, and to iterate and fit things within our schedule better.” The work performed prior to the actual use of Echo-Adapt supported a fast, efficient integration of the adaptive system into the business because the initial system use was not burdened with additional process confusion, and Echo-Adapt was constructed with the user’s process at the forefront of its architectural and UI implementation.

Discussion

As shown in the example above, the design and implementation of a smart authoring system for designing, configuring, and deploying adaptive assessments at scale requires careful consideration of disparate user needs and mechanisms for meaningful and actionable feedback. These considerations may add additional requirements that directly influence choice of algorithms and architecture for adaptive testing engines and should not be taken lightly if a scaled solution for adaptive testing is desired. Using the six design principles presented in this article has proven to be effective in bringing a smart authoring platform for computerized adaptive assessment to production use at scale.

In addition, we have found that by attending to user needs and feedback loops, it is possible to widen the utility of the new process and software to address adjacent business needs. In our case, the Echo-Adapt platform has been extensible and provided additional process improvement beyond its initial intent of designing, configuring and deploying adaptive tests. Since its launch, additional functionality for Automated Test Assembly (ATA) has been added into Echo-Adapt for users to build multiple parallel linear forms against a single test configuration. Similar to the shadow-test CAT, the ATA model is formulated as MIP. While the ATA MIP model presents additional complexity in objective functions and constraints and takes more time to solve, the foundations in user experience setting constraints and receiving feedback established for adaptive testing allowed the same user experience to be extended to new constraint types without substantial confusion on the part of the user.

In addition, because it addressed a separate business need from CAT, ATA has now replaced manual test construction with an algorithm and the business has recognized benefit beyond the original intent for Echo-Adapt. Content developers had been using older tools such as VBA macros to select items for linear test forms. In some cases, the tools were partially functioning, and in other cases, test forms were built entirely by content developers selecting each item manually. When asked to increase the volume of test forms administered each year, content developers could not meet this output with the manual current state. Over a period of several months, psychometricians and content developers worked together to map out the current process and inform the requirements for Echo-Adapt’s ATA functionality. Operating under the design principles discussed in Section “Process Transition Support”, users were able to enact the new process as soon as the system was ready, realizing significant efficiencies. When users put the new process in place with Echo-Adapt’s ATA functionality, the test form construction process reduced from 2-4 weeks of time to 2 days.

Building smart software to replace manual or previously impossible processes is not an easy feat. When those processes involve complex algorithms and highly specialized skill sets as in the example we have discussed above, it becomes even more difficult. We cannot overstate the importance of process transition support, nor can we ignore the contributions of a thoughtful and integrated user experience that allows each of the people in the process to work in a context most accessible to them given their background and experience.