The Form 497K Files Dataset is a continuous, accession-level corpus of every Form 497K summary prospectus transmitted to EDGAR under Rule 497(k) of the Securities Act of 1933. Each record represents a single 497K submission by a registered open-end management investment companies — a U.S. mutual fund or exchange-traded fund — and pairs a normalized metadata.json header with the original SGML-wrapped HTML summary prospectus as filed. The dataset begins in April 2009, when Rule 497(k) first became operational after the SEC's January 2009 Enhanced Disclosure and New Prospectus Delivery Option adopting release (Release No. 33-8998, effective 13 March 2009), and is refreshed monthly as new filings are transmitted. Records are grouped into monthly ZIP containers named YYYY-MM.zip, each carrying one subdirectory per accession with the primary .htm document and its structured header. The corpus is designed for section-level extraction of the standardized Item 2 through Item 8 disclosures that Rule 498 prescribes for summary prospectuses.
Programmatically retrieve the full list of dataset archive files, download URLs and dataset metadata.
Dataset Index JSON API
Download the entire dataset as a single archive file.
Download Entire Dataset:
Download a single container file (e.g. monthly archive) from the dataset.
Download Single Container:
The dataset packages every EDGAR submission of Form 497K from April 2009 forward. Form 497K is not a free-standing registration form — it is the Rule 497 submission sub-type used to transmit a summary prospectus, the concise standardized disclosure document authorized by Rule 498 under the Securities Act of 1933. Rule 497(k) allows a fund with an effective Form N-1A registration statement to satisfy statutory prospectus-delivery obligations by providing (or making available online) a short summary prospectus rather than the full statutory prospectus, provided the summary adheres to the disclosure content and ordering requirements codified in Items 2 through 8 of Form N-1A.
A 497K filing therefore occupies a tightly constrained disclosure envelope. It is always a summary prospectus or a supplement/sticker to one; it is always tied back to an effective N-1A shelf; and it always arrives on EDGAR with structured header data identifying the specific fund series and share classes the summary covers. The content is investor-facing, narrative-plus-tabular, and event-independent — driven by fund launches, outcome-period rollovers, annual updates, fee changes, and similar disclosure events rather than by corporate events.
The dataset is distributed as monthly ZIP containers. Each accession folder inside a container holds a normalized metadata.json header and a single SGML-wrapped HTML document carrying the summary-prospectus text. Image binaries (performance charts, logos, payoff diagrams) and the composite SGML .txt submission wrapper are referenced by EDGAR URL in metadata.json but are not materialized on disk.
One record in the Form 497K Files Dataset is a single EDGAR submission of Form 497K. Physically, the record is a per-accession subdirectory inside a monthly ZIP container, named with the 18-digit zero-padded SEC accession number with dashes removed (for example 000121390025059889/ for accession 0001213900-25-059889). Each accession folder holds two materialized files: a structured metadata.json header and a single SGML-wrapped HTML document containing the summary-prospectus text itself.
A record ties together three layers: (1) the packaging unit on disk, (2) the normalized EDGAR submission header, and (3) the underlying summary-prospectus disclosure document as originally filed.
At the container level, records are grouped into monthly ZIP archives named YYYY-MM.zip. Inside, a single top-level directory YYYY-MM/ holds one subdirectory per accession. A representative month contains on the order of 1,200–1,500 accession folders; the sampled June 2025 container holds 1,255 accession subdirectories.
Inside each accession folder, the on-disk payload is essentially fixed:
metadata.json — structured EDGAR header object. Always present, exactly one per record.<primary>.htm — one SGML-wrapped HTML document carrying the Form 497K summary prospectus. Always present, always exactly one.Everything else that the filer attached to the EDGAR submission — GIF/JPG performance-chart and logo graphics and the concatenated SGML .txt composite submission file — is listed under metadata.json.documentFormatFiles[] with an EDGAR URL, but is not materialized inside the ZIP. The dataset deliberately omits image binaries; the composite .txt is not duplicated on disk because the primary HTML already carries the document content. Although the dataset brief advertises PDF and TXT format support, 497K submissions in practice use HTML as the primary presentation format almost universally, so the per-accession payload is effectively HTML plus JSON.
HTML filenames are filer-specific and do not reliably contain the literal string 497k. Observed conventions include:
<accessionStub>_497k.htm (e.g. ea0245768-03_497k.htm, tm2517758-2_497k.htm)d<digits>d497k.htm (e.g. d93511d497k.htm)f<digits>d1.htm (e.g. f42385d1.htm, which notably omits 497k from the filename)c497k.htm and etf1_497k.htmg108182_rdv-isi.htmThe authoritative form identifier is the SGML <TYPE>497K tag inside the document wrapper and the formType field inside metadata.json, not the filename.
metadata.json headerThe per-accession JSON object is a flat, consistently keyed record that mirrors and normalizes the SGML <SEC-HEADER> block of the original EDGAR submission. Observed top-level fields:
formType — always the string "497K".accessionNo — dashed SEC accession number, e.g. "0001213900-25-059889". The containing folder name is the same number with dashes stripped and zero-padded to 18 digits.effectivenessDate — ISO YYYY-MM-DD date on which the summary prospectus becomes effective; typically the first day of the month following filing, or a series-launch-specific date.filedAt — full ISO-8601 timestamp with timezone offset (e.g. "2025-06-30T20:23:34-04:00") marking EDGAR acceptance.description — boilerplate human-readable form label, uniform across the dataset: "Form 497K - Summary Prospectus for certain open-end management investment companies filed pursuant to Securities Act Rule 497(K)".linkToFilingDetails — EDGAR URL to the primary HTML summary prospectus.linkToTxt — URL to the concatenated SGML submission text file.linkToHtml — URL to the accession's -index.htm landing page on EDGAR.linkToXbrl — empty string throughout. Summary prospectuses filed under 497K do not carry interactive-data attachments; the structured Risk/Return Summary XBRL that corresponds to the same disclosure is filed separately under the companion registration-statement amendment (typically 485BPOS).id — 32-character hexadecimal record identifier used by the provider's API.documentFormatFiles[] — enumeration of every attachment in the EDGAR submission.entities[] — filer-entity array.seriesAndClassesContractsInformation[] — structured fund series and share-class data.dataFiles[] — structured data-file array; consistently empty ([]) for 497K records because no XBRL instance is filed under this sub-type.documentFormatFiles[]Each element describes one file that was part of the original EDGAR submission package:
sequence — filer-assigned ordering string ("1" for the primary document, "2", "3", … for subsequent graphics; a literal single space " " for the composite .txt submission wrapper).size — byte count, serialized as a string.documentUrl — canonical EDGAR URL to the individual file.description — filer-supplied free-text caption (e.g. "FORM 497K", "GRAPHIC", "Complete submission text file"); occasionally absent.type — EDGAR document type code: "497K" for the primary prospectus, "GRAPHIC" for embedded images, and " " (single space) for the composite text submission.Sequence 1 is always the primary 497K HTML and is the only entry materialized on disk inside the accession folder. The remaining entries enumerate graphic attachments (performance bar charts, hypothetical-growth line charts, adviser logos, and occasional payoff-diagram images for defined-outcome ETFs) and the composite .txt — all represented as URLs, not binaries. Observed attachment counts range from two entries (HTML plus composite text only, typical of short stickers) up to eight or more for full prospectuses with many embedded charts; the sampled Royce Fund record, for instance, lists seven graphic attachments.
entities[]An array of filer-entity objects capturing the SEC-registered parties attached to the submission. Per-entity fields observed:
companyName — legal entity name with its submission role appended in parentheses, e.g. "Innovator ETFs Trust (Filer)".cik — unpadded Central Index Key.fileNo — SEC file number. For 497K filings this is usually the Securities Act registration file number for the N-1A shelf ("333-xxxxxx"); older trusts use legacy "002-xxxxx" formats, and Investment Company Act registration numbers ("811-xxxxx") can also appear on multi-registered entities.filmNo — the SEC-assigned film number for this specific filing event.type — entity-scoped filing type ("497K").act — Securities Act reference code ("33" for the 1933 Act).irsNo — IRS Employer Identification Number; frequently "000000000" for mutual-fund statutory trusts.fiscalYearEnd — four-digit MMDD; may be absent.stateOfIncorporation — two-letter jurisdiction code (commonly "DE" for Delaware statutory trusts or "MA" for Massachusetts business trusts); occasionally absent.seriesAndClassesContractsInformation[]This is the structurally distinguishing header block for open-end fund filings, reflecting the Investment Company Act of 1940 series/class reporting framework mandatory since 2006. Each array element represents one fund series covered by the summary prospectus:
series — SEC series identifier of the form S000######.name — series name (typically the marketing name of the fund, e.g. "Innovator Equity Dual Directional 15 Buffer ETF - July").classesContracts[] — array of share-class objects, each with:
classContract — SEC class identifier of the form C000######.name — share-class name (e.g. "Investor Class", "Service Class", or simply the ETF name for single-class ETFs).ticker — exchange ticker symbol when one has been assigned (e.g. "BFJL", "RYDVX", "RDVIX"); the field is omitted entirely when no ticker applies, not emitted as an empty string.Most 497K records describe a single series with a single share class — characteristic of ETFs and of per-series summary prospectuses issued by mutual-fund trusts. Multi-class mutual-fund series populate multiple classesContracts entries (for example a Royce Fund record listing both the Service Class RYDVX and the Investment Class RDVIX under one series). Less commonly, a single 497K covers multiple series, in which case the top-level array holds several series objects.
The primary .htm file in each accession folder is an SGML-wrapped HTML rendering of the statutory summary prospectus. The outer SGML envelope follows the canonical EDGAR document-wrapper pattern:
1
<DOCUMENT>
2
<TYPE>497K
3
<SEQUENCE>1
4
<FILENAME>d93511d497k.htm
5
<DESCRIPTION>ISHARES LARGE CAP DEEP BUFFER ETF
6
<TEXT>
7
<HTML>...</HTML>
8
</TEXT>
9
</DOCUMENT>
Inside <TEXT> is a self-contained HTML document. Observed sizes range from roughly 7 KB for one-page supplements or stickers up to 250–270 KB for fully embedded multi-series summary prospectuses with inline tables and graphic references.
The body of the HTML implements the Rule 498 content schema — the ordered set of Items 2 through 8 of Form N-1A that defines a compliant summary prospectus — preceded by a cover-page block:
For multi-series trusts, this ordered block repeats once per series in the same HTML document, each introduced by its own fund-name heading.
A substantial minority of 497K submissions are supplements / stickers rather than full fresh summary prospectuses — short documents (often 1–10 KB) that amend a previously filed summary prospectus (for example updating a cap rate on an outcome-period ETF, adding a new share class, correcting a portfolio manager's name, or extending a fee waiver). These supplements reuse the 497K submission type because they modify the summary-prospectus content; their HTML body is correspondingly short and narrative rather than a complete Items 2–8 sequence.
Graphics embedded by the HTML — performance bar charts, hypothetical-growth line graphs, fund-company logos, and occasional payoff-diagram images for defined-outcome ETFs — are part of the original EDGAR submission and are fully enumerated under documentFormatFiles[] with their EDGAR documentUrl. They are, however, not materialized inside the ZIP; the dataset excludes image binaries by design. Likewise, the composite SGML .txt file that EDGAR produces as a concatenation of all attachments is referenced by URL (linkToTxt and the documentFormatFiles entry with type: " ") but not duplicated on disk. Consumers wanting those assets retrieve them directly from EDGAR via the URLs in metadata.json.
Multi-class and multi-series variants are not separate files — they are handled by repeated content blocks within the single HTML document and by multiple entries inside seriesAndClassesContractsInformation[].
Included in each record
<DOCUMENT>, <TYPE>, <SEQUENCE>, <FILENAME>, <DESCRIPTION>, and <TEXT> envelope.metadata.json capturing the EDGAR header, filer entities, series/class identifiers and tickers, document list, timestamps, and EDGAR URLs.Referenced but not materialized
.txt submission wrapper.-index.htm accession landing page.S######) and class/contract (C######) identifiers, mandatory for registered open-end funds since 2006, are present in the SGML header from the start of the 497K window and are reflected in every record's seriesAndClassesContractsInformation[].S###### namespace.Form 497K has always been filed electronically on EDGAR with an SGML wrapper and an HTML primary document — there is no pre-HTML era for this form type, because the form did not exist before 2009. The practical format variation is therefore narrow: SGML-wrapped HTML is used universally for the primary document, graphics are attached as <TYPE>GRAPHIC SGML documents referenced from inline <img> tags, and the dataset's exclusion of those binaries is a packaging choice rather than a format change.
.htm contains a complete Items 2–8 disclosure. A 7 KB file is almost certainly a sticker or supplement amending a previously filed summary prospectus; a 200+ KB file is typically a full summary prospectus, potentially stacking multiple series.seriesAndClassesContractsInformation[] is the reliable indicator of how many series are packed into the record.ticker key inside a class entry entirely. Downstream parsers should treat absence as "not yet assigned or not applicable", not as an error.497k (see PGIM's f42385d1.htm). Rely on the SGML <TYPE>497K tag and the formType field in metadata.json./A variant; sequence of filedAt timestamps and distinct effectivenessDate values identify successive versions. There is no separate 497K/A form type.linkToXbrl and an empty dataFiles[] — a normal state, not a gap.<table> elements versus nested <div> grids.Form 497K is filed by registered open-end management investment companies — investment companies registered under the Investment Company Act of 1940 and classified as open-end management investment company companies. The legal filer is the registrant (the trust or corporation that serves as the issuer), not its investment adviser, distributor, or transfer agent, even though those parties typically prepare the document.
The filer universe consists of:
A fund enters the 497K population only after its board and registrant affirmatively elect the Rule 498 summary prospectus option. Election is voluntary — a fund may continue delivering the full statutory prospectus and never file a 497K — but most large U.S. open-end complexes use summary prospectuses for retail share classes. A single 497K typically covers one or more specific series and classes within the registrant, identified on EDGAR by series/class IDs.
Form 497K is specific to the open-end fund summary prospectus regime. It is not filed by:
variable annuity and variable life contract summary prospectuses fall under the separate Rule 498A regime (adopted 2020) and are filed under insurance-product submission types tied to Forms Form N-3, Form N-4, and Form N-6, not as 497K.
Form 497K implements Rule 497(k) under the Securities Act of 1933, part of the Rule 497 prospectus-filing family that carries out the prospectus-filing obligations of Section 10 of the Securities Act. Rule 497(k) requires that a summary prospectus relied upon under Rule 498 be filed with the Commission no later than the date of first use.
The substantive content is governed by Item 3 of Form N-1A — fund name, investment objectives, fees and expenses, principal strategies and risks, past performance, adviser and portfolio managers, purchase/sale procedures, tax information, and payments to intermediaries — combined with the cover-page and required-legend provisions of Rule 498. The summary prospectus is effectively the Item 3 summary section packaged as a standalone document with legends pointing investors to the statutory prospectus, SAI, and shareholder reports.
The regime was established by the SEC's 2009 rulemaking "Enhanced Disclosure and New Prospectus Delivery Option for Registered Open-End Management Investment Companies" (Investment Company Act Release No. 28584 / Securities Act Release No. 8998, adopted January 13, 2009), which created Rule 498, added Rule 497(k), and amended Form N-1A. The earliest possible 497K filings on EDGAR accordingly begin in early 2009; there is no pre-2009 paper analog.
A registrant files a new Form 497K on each of the following events:
Rule 497(k) imposes a first-use deadline, not a periodic schedule. Filings cluster around:
Because cadence is driven by the fund's own update cycle and the occurrence of material changes, a stable single-series fund may file only a handful of 497Ks per year, while a large multi-series trust with frequent strategy or personnel changes may file many dozens.
Form 497K sits inside a dense cluster of mutual-fund disclosures on EDGAR. The most useful comparisons fall into four groups: other Rule 497 sub-types, the Form N-1A registration statement and its 485 amendment pathway, the shareholder- and portfolio-reporting regime (N-CSR/N-CSRS, N-PORT, N-CEN), and prospectus forms for non-open-end fund structures (N-2, N-3, N-4, N-6, Form N-14).
Rule 497 is the post-effective filing rule that transmits prospectus materials to EDGAR. The sub-type suffix is load-bearing.
Among the 497 family, only 497K carries the standardized Item 3-driven summary-prospectus content, which is why it merits a dedicated, parseable dataset.
Form N-1A is the registration statement for open-end funds; it contains the full statutory prospectus (Part A) plus the Statement of Additional Information (Part B). The 485 series updates it — 485APOS (post-effective amendment subject to SEC review), 485BPOS (automatically effective), 485BXT (effective-date extensions).
497K is not independent disclosure. It is a condensed Item 3 extract of the N-1A prospectus, repackaged in the Rule 498 summary format. N-1A and the 485 series are the authoritative long-form source; 497K is the investor-facing short form derived from them. Timing also differs: 485BPOS filings cluster around annual update cycles with 60- or 75-day effectiveness windows, while 497K is transmitted whenever the summary prospectus itself is refreshed, stickered, or reissued for a new share class — producing a one-to-many relationship from 485 events to 497K transmissions.
N-CSR (annual) and N-CSRS (semi-annual) are certified shareholder reports carrying financial statements, schedules of investments, management performance discussion, and officer certifications. They are retrospective; 497K is prospective. 497K tells a prospective investor what they are buying; N-CSR tells existing shareholders what happened during the period.
The closest point of confusion is the Tailored Shareholder Report (TSR) regime effective July 2024, filed as an exhibit to N-CSR. TSRs are short, plain-English, visually formatted investor summaries — superficially similar to 497K. The distinction is function: TSRs replace the long-form annual/semi-annual report and communicate historical results; 497K remains the pre-sale offering summary. Format resembles; content does not overlap.
N-PORT (monthly holdings, filed quarterly with the third month public) and N-CEN (annual fund census) are structured data filings delivering position-level holdings, derivatives, liquidity classifications, and registrant-level attributes. They carry no narrative offering content. Overlap with 497K is effectively zero, but the two are complementary for a complete fund profile: 497K supplies the stated strategy, fees, and risk narrative; N-PORT supplies the actual holdings that implement it.
Rule 497(k) applies only to open-end management investment companies. Neighboring structures use different forms and, where a summary regime exists, a different rule.
497K is strictly the open-end mutual fund population.
Rule 498 defines a layered-delivery model:
497K is the only tier that is its own distinct EDGAR submission type; the SAI and statutory prospectus are embedded in N-1A and its 485 amendments. That regulatory discreteness plus content standardization is why 497K supports a dedicated dataset in a way the other two tiers do not.
A generic 497 dataset aggregates every Rule 497 sub-type other than 497K — stickers, supplements, full re-filings, procedural certifications — with highly variable structure. 497K is carved out because the Item 3 format is standardized and therefore suitable for section-level extraction, fee-table parsing, risk-factor normalization, and performance-table extraction at scale. Users who need consistent, machine-parseable summary-prospectus content should go directly to 497K rather than filtering a broader 497 corpus.
Form 497K is narrowly defined: the Rule 497(k) summary prospectus for open-end management investment companies, filed from April 2009 forward, containing a condensed Item 3 extract of the statutory prospectus. It is not marketing material (497AD), not a sticker or supplement (497, 497H2), not a certification (497J), not the registration statement (N-1A via 485APOS/485BPOS/485BXT), not a shareholder report (N-CSR/N-CSRS/TSR), not holdings or census data (N-PORT/N-CEN), and not applicable to closed-end, variable, or merger contexts (N-2, N-3, N-4, N-6, N-14). Its value lies in regulatory precision and content standardization — the single EDGAR submission type that reliably delivers the short-form, investor-facing mutual-fund summary prospectus and nothing else.
Form 497K summary prospectuses are the front-line retail disclosure for open-end mutual funds. A continuous corpus from April 2009 onward gives several professional functions a long panel on fees, risks, strategies, advisers, and share-class structure, joined across the structured metadata.json header and the narrative HTML body.
Analysts parse the fee and expense table (management fee, 12b-1, other expenses, acquired fund fees, total and net expense ratios, waivers, and 1/3/5/10-year example dollars) and the principal investment strategies and principal risks sections to drive category mapping, peer grouping, and scoring. metadata.json.seriesAndClassesContractsInformation[].ticker joins each class to NAV and flow feeds; effectivenessDate anchors the disclosure vintage used at any ratings cutoff. The adviser and sub-adviser block maintains the fund-adviser graph.
Quant teams treat 497K filings as an event stream for fee changes, waiver extensions, new share classes, strategy rewrites, risk additions, benchmark changes, and adviser swaps. Panels are keyed on CIK, series ID, class ID, and ticker. Expense-ratio time series, strategy-text embeddings, and risk-section diffs feed flow-prediction, performance-persistence, and fee-compression models. entities[] and seriesAndClassesContractsInformation[] reconstruct fund-family hierarchies without scraping.
In-house and outside counsel benchmark peer drafting: principal risk wording, fee-table footnotes, waiver and expense-limitation language, class-structure descriptions, and investment-objective phrasing. When launching a new class or amending a fee schedule, teams filter recent peer 497Ks by effectivenessDate to validate that their own language tracks current market practice.
Product ops ingest 497Ks to keep shelves accurate. They reconcile ticker, CUSIP, and class mappings from seriesAndClassesContractsInformation[], detect class launches and closures via filedAt and effectivenessDate, and refresh internal fact sheets, platform screens, and point-of-sale materials from the HTML body. The stream also surfaces adviser changes, fee reductions, and reorganizations that drive shelf approval, breakpoints, and commission grids, and confirms that a current summary prospectus exists before a purchase is allowed.
Platforms need the current summary prospectus on demand, indexed by ticker, for prospectus delivery, cost-transparency tools, and Reg BI documentation. The dataset backs APIs that return the latest 497K by ticker or CIK/series/class, auto-populates account-opening disclosure bundles, and surfaces the relevant prospectus alongside client holdings.
NLP pipelines perform risk-factor extraction, fee-table parsing, supplement and sticker detection, and revision diffing. Outputs include expense-ratio benchmarking dashboards, risk-taxonomy tracking, and alerts on material edits to principal strategies or adviser identification. Join keys come from CIK, series ID, class ID, ticker, and effectivenessDate.
The 2009 start date aligns with the SEC summary prospectus regime, making the corpus a natural panel for fee compression, share-class proliferation, risk-disclosure evolution, strategy drift, adviser-subadviser networks, and disclosure-flow relationships. Researchers extract fee tables, performance, and risk text keyed on class, series, and CIK, then merge with returns and flows. effectivenessDate and entities[] anchor family identity and vintage across longitudinal samples.
Cost calculators, fee-transparency apps, and fund comparison interfaces populate fields by ticker using seriesAndClassesContractsInformation[].ticker. Fee-table parsing drives cost projections; the strategies paragraph yields category labels and plain-English summaries; the adviser block supplies branding cues.
Section 36(b) excessive-fee teams, mis-selling matters, and class-action counsel use effectivenessDate and filedAt to reconstruct the fee table, waivers, risk disclosures, and adviser identification in force on specific dates. Expert witnesses compile exhibits comparing a defendant fund's fee trajectory to peer funds, document when specific risk language appeared or disappeared, and trace class-specific fee differentials over time.
Across these functions, the join between entities[], seriesAndClassesContractsInformation[], effectivenessDate, filedAt, and the HTML body (fee table, principal risks, performance, adviser) is what makes each workflow scale.
Each use case below anchors to specific fields in metadata.json and to identifiable sections of the Items 2-8 disclosure inside the primary HTML.
Parse the Item 3 Annual Fund Operating Expenses table from every 497K HTML and key each row by seriesAndClassesContractsInformation[].series, .classesContracts[].classContract, and .ticker. Stamp every observation with effectivenessDate to produce a monthly panel of management fee, 12b-1, other expenses, acquired fund fees, gross expense ratio, waiver, and net expense ratio from April 2009 forward. The output drives fee-compression regressions, peer-group median benchmarking, and share-class fee-differential analysis for Section 36(b) expert reports.
Filter accessions where entities[].companyName matches Innovator, First Trust, AllianzIM, Calamos, BlackRock iShares Deep Buffer, or TrueShares, then group sibling accessions within a single monthly ZIP by filer CIK. Extract cap rate, buffer level, floor, and outcome-period start/end dates from the Principal Investment Strategies section, joined to the ETF ticker in seriesAndClassesContractsInformation[]. The result is a structured ladder of active outcome-period series for portfolio construction, laddered-product monitoring, and distribution-shelf maintenance.
For a given series identifier, sort 497K accessions by filedAt and run heading-anchored extraction on the Principal Risks block. Produce a diff between consecutive versions to detect added, removed, or rewritten risk factors (e.g., appearance of crypto-derivatives risk, FLEX options risk, or new concentration language). Alerts feed regtech monitoring dashboards and compliance review of peer-family disclosure practice; the effectivenessDate pair identifies when each change took effect.
Index every record by each classesContracts[].ticker and keep the accession with the latest effectivenessDate per ticker. Serve the primary HTML on demand so broker-dealers, RIA platforms, and account-opening flows can attach the current summary prospectus to a point-of-sale record, populate Reg BI disclosure bundles, and confirm that a live summary prospectus exists before permitting a buy ticket.
Stream new 497K accessions by monthly container and compare the seriesAndClassesContractsInformation[] block against a rolling registry of known (series, classContract, ticker) tuples. First-time appearance of a C000###### identifier or ticker flags a share-class launch; disappearance across successive filings for the same series flags a closure. Events feed product-ops shelf updates, commission-grid refreshes, and quant flow-prediction models that condition on share-class proliferation.
Extract the Item 5 Investment Adviser block from the HTML and join to entities[] (filer CIK and file number) and seriesAndClassesContractsInformation[].series. Accumulated over the full 2009-present window, this yields a longitudinal bipartite graph of advisers, sub-advisers, and fund series, with portfolio-manager names and tenures. Detecting adviser swaps via changes in the Item 5 text between accessions for the same series supports subadvisory-mandate tracking and Item 4.01-style manager-change monitoring for fund complexes.
Use HTML byte size, section-heading coverage, and the presence of a complete Items 2-8 sequence to split the corpus into full summary prospectuses and shorter stickers/supplements. Route full prospectuses to fee-table, performance-chart, and risk-extraction pipelines; route supplements to a narrower change-event extractor that captures cap-rate resets, fee-waiver extensions, portfolio-manager corrections, and benchmark substitutions. The split keeps downstream NLP pipelines from misreading 7 KB stickers as structured prospectuses.
The Form 497K Files Dataset can be accessed programmatically through a JSON index endpoint, downloaded in full as a single archive, or retrieved one monthly container at a time. Containers are organized by month, starting from April 2009.
Dataset Index JSON API: https://api.sec-api.io/datasets/form-497k-files.json
Returns dataset-level metadata (name, description, last updated timestamp, earliest sample date, total records, total size, form types, container format, and file types) along with a containers array listing every per-period ZIP archive with its key, download URL, size, record count, and last-modified timestamp. This endpoint does not require an API key and can be polled to monitor which monthly containers were refreshed in the most recent run, enabling incremental downloads instead of re-fetching the entire dataset.
Example response:
1
{
2
"datasetId": "1f13365b-9ade-61dc-9da2-e4b13255c3bd",
3
"datasetDownloadUrl": "https://api.sec-api.io/datasets/form-497k-files.zip",
4
"name": "Form 497K Files Dataset",
5
"updatedAt": "2026-04-22T02:58:37.780Z",
6
"earliestSampleDate": "2009-04-01",
7
"totalRecords": 323522,
8
"totalSize": 3665685620,
9
"formTypes": ["497K"],
10
"containerFormat": "ZIP",
11
"fileTypes": ["HTML", "JSON", "TXT", "PDF"],
12
"containers": [
13
{
14
"downloadUrl": "https://api.sec-api.io/datasets/form-497k-files/2025/2025-06.zip",
15
"key": "2025/2025-06.zip",
16
"size": 13818783,
17
"records": 154,
18
"updatedAt": "2026-04-22T02:58:37.780Z"
19
}
20
]
21
}
Download Entire Dataset: https://api.sec-api.io/datasets/form-497k-files.zip?token=YOUR_API_KEY
Downloads the complete dataset as one ZIP archive containing every monthly container. This endpoint requires an SEC API key, passed via the token query parameter or an Authorization header. Use this option for one-time bulk ingestion; for ongoing updates prefer the per-container approach below.
Download Single Container: https://api.sec-api.io/datasets/form-497k-files/2025/2025-06.zip?token=YOUR_API_KEY
Downloads one monthly container ZIP. Each container extracts to a YYYY-MM/ directory containing one subdirectory per accession number, which holds the metadata file and all documents from the original EDGAR submission (excluding image files). This endpoint requires an SEC API key, passed via the token query parameter or an Authorization header.
The dataset covers Form 497K, the Rule 497(k) submission sub-type used to transmit a summary prospectus under Rule 498 of the Securities Act of 1933. Every record has formType equal to the string "497K" and includes no other form types.
One record represents a single EDGAR submission of Form 497K, materialized as a per-accession subdirectory inside a monthly ZIP container. Each subdirectory holds exactly two files: a normalized metadata.json header and one SGML-wrapped HTML document containing the summary-prospectus text.
Form 497K is filed by registered open-end management investment companies — traditional U.S. mutual funds and exchange-traded funds structured as open-end funds — that have affirmatively elected the Rule 498 summary prospectus option. Election is voluntary, so registrants that continue to deliver the full statutory prospectus do not appear in the 497K population.
The dataset begins on 2009-04-01 and extends to the present. That start date aligns with the effective date of the summary prospectus regime under Release No. 33-8998 (operative 13 March 2009); there is no pre-2009 paper analog for Form 497K.
The dataset is distributed as monthly ZIP containers named YYYY-MM.zip. Inside each container, every accession folder contains an HTML primary document (SGML-wrapped) and a metadata.json header. The dataset advertises support for HTML, JSON, TXT, and PDF file types, though 497K submissions in practice use HTML almost universally.
Form 497 (base) carries definitive prospectus materials under Rule 497(a)-(j) — full statutory prospectuses, stickers, and supplements of highly variable structure. Form 497K is specifically the Rule 497(k) summary prospectus: short, standardized, and governed by Items 2 through 8 of Form N-1A. 497K isolates the parseable summary-prospectus population that a generic 497 filter would dilute with heterogeneous content.
linkToXbrl always empty and dataFiles[] always []?Summary prospectuses filed under 497K do not carry interactive-data attachments. The structured Risk/Return Summary XBRL that corresponds to the same disclosure is filed separately under the companion registration-statement amendment — typically 485BPOS — as an interactive-data exhibit under Regulation S-T. The empty XBRL fields are a normal state of every 497K record, not a gap in the dataset.