Financial Reporting and XBRL Open access

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

arXiv (Cornell University) | Jun 16, 2026

Abstract

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

Direct answer

What can I do from this paper page?

Use this page to scan "The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data" quickly: start with the summary and abstract, then check the authors, source, topics, and related papers. From here, open Scollr to follow Financial Reporting and XBRL research, save the paper, or map adjacent work.

Authors

Researchers on this paper

Nick Bettencourt

first

Xiaowei Ding

middle

Kay Giesecke

last | ORCID 0000-0002-5380-2918

Research areas

Follow related topics

Latest Financial Reporting and XBRL research Latest Stock Market Forecasting Methods research Latest Auditing, Earnings Management, Governance research

Citation

BibTeX

@article{Bettencourt2026Stanford,
  title = {The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data},
  author = {Nick Bettencourt and Xiaowei Ding and Kay Giesecke},
  journal = {arXiv (Cornell University)},
  year = {2026},
  doi = {10.48550/arxiv.2606.18192},
  url = {https://doi.org/10.48550/arxiv.2606.18192}
}

FAQ

Using this paper in a discovery workflow

How do I find related work for this paper?

Use the related papers and topic links on this page as starting points. In Scollr, you can also open the paper and build a literature map around its references, citing papers, and related work.

How can I keep up with new Financial Reporting and XBRL research papers?

Follow Financial Reporting and XBRL research in Scollr. New papers from the topic flow into a personalized feed, and you can save useful studies to revisit later.

Can I cite this paper from this page?

This page includes a static BibTeX block for The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data. Always verify the DOI, source, and publication details against the publisher record before submitting a manuscript.

Follow this research in Scollr

Follow the topics and authors behind this paper, save useful studies, and build a literature map when you are ready to go deeper.

Get the app

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

Abstract

What can I do from this paper page?

Researchers on this paper

Nick Bettencourt

Xiaowei Ding

Kay Giesecke

Follow related topics

Related papers

Artificial Intelligence in accounting: A corpus-wide bibliometric review and decade-by-decade thematic evolution

Role of Artificial Intelligence in Transforming Accounting Practices and Financial Reporting

The Role of Information Governance in Navigating the Impact of Big Data Analytics Capabilities on Sustainable Business Performance

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

Digital disclosure complexity and tax aggressiveness: insights from US XBRL filings

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

A Literature Review of XBRL Studies in Information Systems