Skip to secondary navigation Skip to main content

Appendix B: Methods

Overview

GSA, OMB, and the U.S. Access Board (Access Board) built upon the groundwork laid by the FY23 Assessment to develop data-driven methods for the FY24 Assessment. (For reference, see Appendix C: Methods from the FY23 Assessment). We identified primary research questions, transformed them into testable hypotheses structured by Assessment criteria, and conducted statistical analysis to test these hypotheses. Our approach to analysis was designed to gain insight into the current state of Section 508 programs, identify their key drivers, and trace their evolution year over year. Notably, access to two reporting periods’ worth of Assessment data gave us a new opportunity to pursue analysis of how Section 508 Programs changed over time. (See Pre-Post Analysis.)

Development and Dissemination of Assessment Criteria

To better evaluate the current state of Section 508 compliance and digital accessibility across the federal government, GSA and OMB, in collaboration with the Access Board and OSTP, refined the FY23 Assessment criteria language for FY24 with a focus on making questions and response options easier to interpret. For example, we used the term “reporting entity” in place of  “agency” to encompass both agencies, i.e., bureaus, departments, and headquarters,  and components, i.e., organizational units that reside within a department or large agency. Additionally, we added frequency percentages for never, sometimes, regularly, frequently, and almost always directly to response options to enhance clarity. We introduced several new questions, covering topics such as total federal employees, ICT test processes utilized, and exceptions processes. We also introduced 10 new questions—questions 80 to 89—on a rotating basis to broaden the scope of inquiries regarding ICT. We removed five questions due to data quality issues or redundancies and significantly revised answer choices for the following criteria: questions 30 to 33, 36 specifically answer choice d), 39 to 42, 53 to 57, 60, 62 to 63, and 65. All 103 questions were mandatory for FY24, some with dependencies. For a complete list of Assessment criteria changes, please reference this crosswalk between FY23 criteria to FY24 criteria.

While the criteria underwent minor structural changes, their major organizing framework remained intact. Please see the subsection in Appendix C: Methods from the FY23 Assessment for more information on how we developed the original Assessment criteria. The 11 dimensions that categorize the criteria remained unchanged from FY23. Table B1 describes each of the 11 dimensions.

Table B1. Description of Assessment dimensions.
Dimension Description
General Information Information and metrics related to the reporting entity Section 508 Program or equivalent activities.
IT Accessibility Program Office Reporting entity’s program management, reporting, benchmarking, risk management, continuous process improvement, and other business-related functions that align to the development, implementation, and maintenance of the reporting entity’s Section 508 Program or equivalent.
Policies, Procedures, and Standards Reporting entity’s development, implementation, and continuous improvement of digital accessibility-related policies, procedures, directives and standards, and the inclusion of digital accessibility into relevant policies across all business functions of the reporting entity.
Communications Reporting entity’s internal and external communication accessibility considerations.
Content Creation Reporting entity’s development, testing, remediation, and conformance tracking of digital content, including but not limited to documents, presentations, PDFs, spreadsheets, audio, video, multimedia, social media, and digital forms.
Human Capital, Culture, and Leadership Reporting entity’s leadership and professional development, and how digital accessibility is integrated into mission-related strategic planning.
Technology Lifecycle Activities Reporting entity’s level of inclusion of accessibility in the technology lifecycle to include design, development, operation, and maintenance of ICT.
Testing and Validation Reporting entity’s level of inclusion of digital accessibility in the testing and evaluation of reporting entity’s products and services, including processes, tools, templates, best practices, and guidance.
Acquisition and Procurement Reporting entity’s level of inclusion of digital accessibility in procurement lifecycle processes.
Training Reporting entity’s development, use, and tracking of digital accessibility-related training.
Conformance Metrics Specific data points and outcomes related to measuring reporting entity’s program inclusion of digital accessibility and conformance to the ICT Standards and Guidelines.

On April 8, 2024, OMB disseminated 103 Assessment criteria to reporting entities that may be subject to Section 508 requirements. OMB distributed this material to heads of reporting entities, reporting entity CIOs, and Section 508 PMs. Simultaneously, GSA posted the Assessment instructions and criteria on this web site, Section508.gov.

Reporting entities designated POCs and coordinated with OMB to determine their reporting structure as a “reporting entity,” either as a standalone organizational unit or as a component of a larger parent agency. GSA maintained a list of designated POCs in preparation for the release of the reporting tool. The reporting tool was released on May 29, 2024. 

Data Collection

GSA and OMB received data submissions from 245 reporting entities between May 29, 2024, and July 31, 2024. GSA provided reporting entities with a reporting entity-specific link to submit their data within the eight-week reporting submission window. Before data validation and subsequent analysis, the data underwent quality testing to identify and remove outliers, including extreme values and data entry errors.

Data Validation

GSA developed a script to systematically validate data submitted by reporting entities. Like FY23, this script primarily operated according to conditional if-then logic, relying on interconnections between different response options for a given reporting entity. When the validation tests identified inconsistencies among the response options for a given reporting entity, they triggered flags. While GSA categorized and tabulated these flags, it did not alter or remove any data for analysis. Please refer to the “Data Validation for FY24 Governmentwide Annual Assessment” for a summary of validation tests, associated flag counts, and the validation script written in R.

Descriptive Analysis

Our descriptive analysis approach followed the methodology used in the FY23 Assessment. We conducted a descriptive study of the data, akin to an inventory or initial exploration, to provide a holistic view of reporting entity data and determine key patterns and trends. We maintained a dual focus from FY23 on “business function maturity” and “operational conformance,” which is a reporting entity’s conformance to the applicable requirements in the ICT Standards and Guidelines.

First, we created an index to assess reporting entity business function maturity (m-index). This index quantified reporting entity responses to criteria across nine dimensions: IT Accessibility Program Office; Policies, Procedures, and Standards; Communications; Content Creation; Human Capital, Culture, and Leadership; Technology Lifecycle Activities; Testing and Validation; Acquisition and Procurement; and Training. The m-index encompassed Questions 29 to 66 and all questions were multiple choice, equally weighted, and scored as follows: 

  • a) = 0; signifying very low

  • b) = 1; signifying low

  • c) = 2; signifying moderate

  • d) = 3; signifying high

  • e) = 4; signifying very high

Furthermore, a selection of “Unknown” received a 0 and a selection of “Not applicable” or “N/A” received a 4. For two criteria (49 and 56), (f) = 4 also signifies Very High. We considered that argument and understood that scoring N/A as a “4” could inflate a reporting entity score for a dimension. Nonetheless, we chose to do this so all reporting entities had an equal number of questions to score—the denominator would be the same for each reporting entity— and no reporting entity was penalized with a low score for N/A, i.e., things that do not apply to them.

Second, we created an operational conformance index, referred to as “conformance” or “c-index,” to assess how well reporting entities performed in meeting Section 508 and digital accessibility requirements. Thus, this index quantified select reporting entity responses to 16 specific criteria in the Conformance section of criteria that directly relate to quantifiable compliance outcomes and included: Q69a, Q71, Q74a, Q76, Q77, Q78, and Q80-Q89. They were assigned numerical values and weighted as shown in Table B2.

Table B2. Topics, Conversion Approaches, and Weights of Conformance Criteria
Topic Criteria Conversion Approach Weight
Internet Q69a Provided as a percentage by reporting entity; no conversion needed 12.50%
Internet Q71 Converted the number of fully conformant public internet web pages into a percentage of the total public internet web pages the reporting entity specified 12.50%
Intranet Q74a Provided as a percentage by reporting entity; no conversion needed 12.50%
Intranet Q76 Converted the number of fully conformant internal intranet web pages into a percentage of the total internal intranet web pages the reporting entity specified 12.50%
Documents Q77 Converted the number of fully conformant electronic documents into a percentage of the total electronic documents the reporting entity specified 12.50%
Miscellaneous Q80 to Q89
  • 100% = 1
  • 90%-99% = 0.9
  • 50%-90% = 0.5
  • Less than 50% = .25 - Unknown = 0
  • N/A = 1
2.50% each

The internet and intranet are essential and increasingly important mediums for digital commerce, communication, and collaboration across the federal government. Additionally, web testing methodologies and tools are much more mature than other ICT types, as noted by the number of web pages that reporting entities regularly test. In the development of the c-index, we placed slightly more emphasis on these areas relative to documents, videos, and other covered ICT. GSA and the Access Board are driving initiatives to develop best practices and guidance to standardize testing for other ICT, including software, hardware, and electronic documents. As reporting entities adopt and implement these standardized testing methodologies, GSA will work with OMB and the Access Board to modify Assessment criteria to more consistently evaluate ICT conformance. We also intend to adjust the c-index to incorporate expanded measurement of other types of ICT in addition to web content.

By converting and totaling reporting entity-specific responses to each criterion listed above, we determined a reporting entity-specific value for the m-index and c-index. Importantly, the c-index was rescaled by a factor of 4 to equal the scale of the m-index.

Pre-Post Analysis

Two consecutive reporting periods, FY23 and FY24, provided us with the opportunity to gain insight into how Section 508 activities changed over time. Through pre-post analysis, we evaluated the impact of an intervention – in this case, the release of the FY23 Assessment – by measuring changes in relevant criteria and indices over time. We asked three broad questions.

First, was there a meaningful change? For a given research question, a p-value told us if a meaningful change occurred over the two reporting periods. For pre-post analysis, the p-value refers to the probability of observing a change when in fact no change between the two times occurred. A small p-value, typically less than 0.05, indicates the observed change is real and is deemed a statistically significant difference between the two periods. However, the p-value can be statistically significant even for very small or near-significant changes, especially when the amount of data is large. Consequently, we factored in the amount of data when determining the magnitude of the change.

Second, what was the magnitude of the change? To determine the amount of change over FY23 and FY24, we used effect size, a measure that helped us understand the magnitude of a phenomenon or effect. The effect size took into account the impact of the sample size on the significance of the results. This was particularly useful when dealing with Likert scale data like m-index data, which lack inherent numerical meaning by themselves and only gains meaning when interpreted through the scale. For example, a score of 4 on a survey only gains meaning when contextualizing it on a scale ranging from 1 to 5.

We used the following standard categories for effect sizes: Small (0.1 to 0.3), Medium (0.3 to 0.5), and Large (0.5 to 0.9). These categories, expressed in terms of standard deviations, facilitated a more straightforward and standardized interpretation of results. However, because effect sizes represent absolute values and, as such, do not provide information about the direction of change, i.e.,  increase or decrease, we pursued further analysis to discern this aspect.

Third, in which direction did the change occur? Did we see an increase or decrease? We used mean and median values to determine the direction of change between FY23 and FY24. For example, if the difference was positive, it meant the results for a question in FY24 is higher than in FY23 and the response improved over the past year. On the other hand, if the difference was negative, it meant the results for a question in FY24 is lower than in FY23 and the response worsened over the past year.

While no single test can answer all three questions, a combination of tests and confirmations can. We performed the below sequence of tests listed, further summarized in Table B3.

  • Normality Assessment: We used the Shapiro-Wilk test to evaluate normality, helping us identify whether the data was normally distributed or skewed. For comparisons used in our pre-post analysis, we expected a subset to exhibit normal distributions. However, only one out of approximately 100 comparisons met the normality criteria, requiring us to adapt our approach.

  • Skewness: Next, to examine the asymmetry, or lopsidedness, in our data distribution, we measured skewness. For pre-post analysis, we considered values between -0.5 and +0.5 as indicative of a symmetrical distribution.

  • Optional Pre-Test for Significance: When the differences between groups are non-normal and asymmetric, we initially investigated statistical significance with the Sign test. We considered it a weak statistical test that sometimes shows statistical significance more than we would expect, even when the underlying differences are not substantial. Rather than rely on it as the sole determinant of significance, we used it as an initial indicator and then followed up with the Wilcoxon signed-rank test (See 4A) to validate the results for non-normal and asymmetric data.

  • Test for Significance

  • Wilcoxon signed-rank test (WSRT): We focused on testing statistical significance. If the data was non-normal or ordinal, we applied this test.

Paired t-test (PTT): As an alternative, we considered the paired t-test for significance when dealing with normally distributed differences.

  • Effect Size:

    • Wilcoxon effect size test: To quantify the magnitude of the effect of non-normal differences, we used the Wilcoxon effect size. This step allowed us to understand the practical significance of observed changes. Notably, effect sizes fell into three categories: small (0.1 to 0.3), medium (0.3 to 0.5), and large (0.5 to 0.9).

    • Cohen’s D test: To quantify the magnitude of the effect of normal,  bell-shaped differences, we used Cohen’s D test. This step allowed us to understand the practical significance of observed changes. We applied the same standard categories for effect size as we did for the Wilcoxon test.

Table B3. Summary of Pre-Post Analysis Approach
Order Purpose Name of Test Conditions Comments
1 Test for normalcy Shapiro-Wilk test N/A N/A
2 Test for asymmetry Skewness N/A Values between -0.5 to 0.5 denote approximate symmetry38
3 Pre-test for statistical significance Sign test Use if differences are non-normal and asymmetric Follow up with Wilcoxon signed- rank test
4A Test for statistical significance Wilcoxon signed-rank test Use if differences are non-normal or if data N/A is ordinal N/A
4B Test for statistical significance Paired t-test Use if differences are normal N/A
5A Effect size Wilcoxon effect size test Use if differences are non-normal or if data is ordinal Expected range of values –
  • Small (0.1 to 0.3),
  • Medium (0.3 to 0.5), and
  • Large (0.5 to 0.9)
5B Effect size Cohen’s D test Use if differences are normal Expected range of values –
  • Small (0.1 to 0.3),
  • Medium (0.3 to 0.5), and
  • Large (0.5 to 0.9)

A probability value (p-value) helps us determine whether the difference we observe between two groups is real or just due to chance. It tells us how likely it is that we would observe these results if there were no real difference at all. The lower the p-value, the stronger the evidence that the difference is meaningful and not due to random chance. A low p-value of typically 0.05 or less suggests the difference is meaningful and not just random, while a high p-value greater than 0.05 suggests the difference might be due to chance, and we lack enough evidence to say it is statistically significant. We notate the extent of statistical significance as summarized by Table C4.

Table C4. Summary of Statistical Significance Notation
Meaning Description Notation
P > 0 .05 not significant ns
P ≤ 0 .05 statistically significant *
P ≤ 0 .01 highly statistically significant **
P ≤ 0 .001 very highly statistically significant ***
P ≤ 0 .0001 extremely statistically significant ****

Throughout this report, we present mean values or averages to provide a straightforward understanding of YOY changes in the Section 508 landscape. However, since the data distribution was often non-normal, we used the Wilcoxon signed-rank test to evaluate statistical significance by assessing median differences. While averages are presented in the text for ease of interpretation, the Wilcoxon test offers a robust analysis by accounting for non-normality. Therefore, when the Wilcoxon signed-rank test indicates statistical significance, it reflects changes in the median, even if mean values are reported for simplicity.

Regression Analysis

Regression analysis helps explore the relationships between independent variables and Section 508 compliance outcomes. For FY24, we conducted 22 regressions (XLSX) using both simple and multivariable models to explore which criteria, and to what extent, drive Section 508 program maturity and conformance. However, none of these regressions resulted in both a p-value below the threshold for statistical significance (0.05) and a high R² value (above 0.75). This suggests that while the models captured relevant factors, other dynamics such as data quality issues persist or may have compounded, hindering efforts to isolate the specific drivers of Section 508 compliance.

Given the lack of statistically significant findings and the absence of high R² values, detailed regression methods and results are not included in this year’s report. However, a condensed methodology is provided below and full regression methods and earlier findings remain available for reference in the previous year’s report.

A typical regression equation used in the analysis took the following form:

Dependent Variable = β0 + β1(Independent Variable 1) + β2(Independent Variable 2) +…+ ε

Where:

  • β0 is the intercept or value of the dependent variable if all independent variables are 0.

  • β1, β2, … are coefficients that describe the strength and direction of relationships between independent variables and the outcome.

  • ε accounts for unexplained factors.

Time Fixed Effects

New to FY24, time fixed effects were incorporated to control for YOY influences. This approach helps regression models account for shifts between FY23 and FY24, isolating time-specific factors that might otherwise skew the results. For example:

Dependent Variable = β0 + β1(Independent Variable 1) + γ1(FY23) + γ2(FY24) + ε

By including these time-specific effects, our analysis aimed to accurately capture the underlying relationships between independent and dependent variables without biases introduced by changes over time. Despite running eight regression models with time fixed effects using data from both FY23 and FY24, none produced results substantial enough to include in this year’s Findings.

P-Values and R2 Values

The key metrics we used to evaluate the regressions were:

  • P-value: Indicates whether relationships are statistically significant, with values below 0.05 suggesting significance.

  • R2: Measures how well the independent variables explain variation in the outcome. Values above 0.75 are ideal, though moderate values can still offer insights.


  1. A 0 denotes perfect symmetry or normal distribution of differences, which should be a rare outcome given our data.

Reviewed/Updated: December 2024

Section508.gov

An official website of the General Services Administration

Looking for U.S. government information and services?
Visit USA.gov