sec-api.ioSEC API by D2V
FilingsPricingSandboxDocs
Log inGet Free API Key
  1. Home
  2. Tutorials

Download 10-K Filings of Russell 3000 Companies from SEC EDGAR

Open In Colab   Download Notebook

On this page:

  • Create a List of Tickers of Russell 3000 Constituents
    • Create a List of All 10-K URLs on SEC EDGAR
      • Download All 10-K Filings

        This Python tutorial demonstrates how to download all SEC 10-K filings of Russell 3000 companies from the EDGAR database.

        The process involves creating a list of all company tickers currently included in the Russell 3000 index, using the sec-api Python package and the Query API to find all URLs of 10-K filings filed within a given timeframe (e.g., 2015 to 2022), and downloading all filings using the Render API as a RenderApi method of the sec-api package. We use pandarallel to parallelize the downloading process for faster downloading.

        The source code can also be used to download any other SEC EDGAR filing filtered by other criteria, such as 10-Qs of all SP500 companies. Additionally, the code allows us to avoid the SEC.gov 10 requests per second limit by downloading the filings directly from the sec-api.io servers while using the Render API, which enables us to download up to 40 filings per second.

        To structure the downloaded HTML files of 10-K filings, we can develop a function that will save the files in a simple folder structure with the ticker symbol as the name of the parent folder. An example folder structure is shown below:

        folder-structure

        Create a List of Tickers of Russell 3000 Constituents

        First, let's download the CSV file containing all recent holdings of the Russell 3000 index from ishares.com and save the content as russell-3000.csv. The source file can be found by clicking the "Detailed Holdings and Analytics" link on the ishares.com website of the corresponding index. See the screenshot below for more details.

        img

        Please note that depending on when you are reading this tutorial, you may need to update the asOfDate query parameter in the URL to obtain the most recent holdings. The date format should be YYYYMMDD. For example:

        txt
        https://ishares.com/..ishares-russell-3000-etf..?..asOfDate=20221230
                                                                    ^------^
        import requests

        url = 'https://www.ishares.com/us/products/239714/ishares-russell-3000-etf/1467271812596.ajax?fileType=csv&fileName=IWV_holdings&dataType=fund&asOfDate=20221230'
        response = requests.get(url)

        with open('russell-3000.csv', 'wb') as f:
            f.write(response.content)

        The CSV file containing the recent holdings of the Russell 3000 index needs to be cleaned before converting it into a pandas DataFrame. The first 10 lines of metadata have a different structure compared to the actual holding data, which starts at line 11. The file also contains a text block at the end that must be removed.

        incorrect-csv

        (Metadata, line 1 to 10)

        incorrect-csv-2

        (Text block after line 2611)

        To clean the file, we need to locate the line number (or index) of the rows containing the non-breaking space character \xa0 which appear to be empty rows. These rows separate the metadata section at the beginning and the text block at the end from the actual holdings data. Using the indices of these rows, we can split the CSV file and extract the holdings data.

        # cleaning CSV file
        import csv

        with open('russell-3000.csv', 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
            rows = list(reader)

        empty_row_indicies = [i for i in range(len(rows)) if (len(rows[i]) == 0 or '\xa0' in rows[i])]

        print('Empty rows:', empty_row_indicies)

        start = empty_row_indicies[0] + 1
        end = empty_row_indicies[1]
        cleaned_rows = rows[start:end]

        with open('russell-3000-clean.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerows(cleaned_rows)
        Empty rows: [8, 2621]
        import pandas as pd

        # load Russell 3000 holdings CSV into a dataframe
        holdings = pd.read_csv('./russell-3000-clean.csv')

        print('Russell 3000 Constituents')
        holdings
        Russell 3000 Constituents
        Out:
        TickerNameSectorAsset ClassMarket ValueWeight (%)Notional ValueSharesPriceLocationExchangeCurrencyFX RateMarket CurrencyAccrual Date
        0AAPLAPPLE INCInformation TechnologyEquity559,365,151.115.16559,365,151.114,305,127.00129.93United StatesNASDAQUSD1.0USD-
        1MSFTMICROSOFT CORPInformation TechnologyEquity513,917,712.424.74513,917,712.422,142,931.00239.82United StatesNASDAQUSD1.0USD-
        2AMZNAMAZON COM INCConsumer DiscretionaryEquity213,823,596.001.97213,823,596.002,545,519.0084.00United StatesNASDAQUSD1.0USD-
        3BRKBBERKSHIRE HATHAWAY INC CLASS BFinancialsEquity159,603,687.601.47159,603,687.60516,684.00308.90United StatesNew York Stock Exchange Inc.USD1.0USD-
        4GOOGLALPHABET INC CLASS ACommunicationEquity151,996,026.751.40151,996,026.751,722,725.0088.23United StatesNASDAQUSD1.0USD-
        ................................................
        2606P5N994Petrocorp Inc EscrowEnergyEquity0.010.000.011,248.000.00United StatesNon-Nms Quotation Service (Nnqs)USD1.0USD-
        2607-OMNIAB INC $12.50 VESTING PrvtHealth CareEquity0.020.000.021,502.000.00United StatesNO MARKET (E.G. UNLISTED)USD1.0USD-
        2608-OMNIAB INC $15.00 VESTING PrvtHealth CareEquity0.020.000.021,502.000.00United StatesNO MARKET (E.G. UNLISTED)USD1.0USD-
        2609TFMFRESH MARKET ESCROWConsumer StaplesEquity0.110.000.1110,847.000.00United StatesNASDAQUSD1.0USD-
        2610MARGIN_USDFUTURES USD MARGIN BALANCECash and/or DerivativesCash Collateral and Margins-75,205.000.00-75,205.00-75,205.00100.00United States-USD1.0USDJan 01, 2000

        2611 rows × 15 columns

        Create a List of All 10-K URLs on SEC EDGAR

        Next, we create a list of URLs for all 10-K filings associated with each ticker listed in the holdings DataFrame. The Query API helps us achieve this by allowing us to filter the EDGAR database by any filing metadata field, such as form type, ticker of filer, and filing date.

        We also save additional metadata, such as the company ticker and filing date, for each URL. To filter for filings, we use the Lucene search syntax.

        API_KEY = 'YOUR_API_KEY'
        !pip install -q sec-api

        A company publishes one 10-K report per year, and the Query API can return up to 50 search results per search request. For this tutorial, we ignore all Form 10-K/A and NT 10-K filings, which represent amended filings and notifications of late filing, respectively.

        To handle the large number of tickers, we divide them into batches with a maximum length of 25. We iterate over the range of years we are interested in, such as 2020 to 2022, and download all 10-K URLs per batch.

        For instance, we begin with the year 2020 and download the URLs for the first 25 tickers, followed by the next 25 tickers, and so on until we have downloaded all URLs for 2020. We repeat this process for the next year and continue until we have downloaded all URLs for each year.

        # create batches of tickers: [[A,B,C], [D,E,F], ...]
        # a single batch has a maximum of max_length_of_batch tickers
        def create_batches(tickers = [], max_length_of_batch = 25):
          batches = [[]]

          for ticker in tickers:
            if len(batches[len(batches)-1]) == max_length_of_batch:
              batches.append([])

            batches[len(batches)-1].append(ticker)

          return batches


        batches = create_batches(list(holdings['Ticker']))

        We define the function download_10K_metadata(tickers, start_year, end_year) to download the URLs of the 10-K filings for all companies listed in tickers within the specified date range of start_year and end_year.

        The function saves the complete list of URLs to the metadata.csv file on our local disk. Once completed, we can re-run download_10K_metadata() without re-downloading all URLs.

        from sec_api import QueryApi, RenderApi
        from pathlib import Path
        import multiprocessing

        queryApi = QueryApi(api_key=API_KEY)

        def download_10K_metadata(tickers = [], start_year = 2021, end_year = 2022):
          # if Path('metadata.csv').is_file():
          # print('✅ Reading metadata from metadata.csv')
          # result = pd.read_csv('metadata.csv')
          # return result

          print('✅ Starting downloading metadata for years {} to {}'.format(start_year, end_year))

          # create ticker batches, with 25 tickers per batch
          batches = create_batches(tickers)
          frames = []

          for year in range(start_year, end_year + 1):
            for batch in batches:
              tickers_joined = ', '.join(batch)
              ticker_query = f'ticker:({tickers_joined})'
              filed_at_query = f'filedAt:[{year}-01-01 TO {year}-12-31]'
              form_type_query = 'formType:"10-K" AND NOT formType:"10-K/A" AND NOT formType:NT'

              query_string = ticker_query + ' AND ' + filed_at_query + ' AND ' + form_type_query

              query = {
                "query": query_string,
                "from": "0",
                "size": "50",
                "sort": [{ "filedAt": { "order": "desc" } }]
              }

              response = queryApi.get_filings(query)

              filings = response['filings']

              metadata = list(map(lambda f: {'ticker': f['ticker'],
                                             'cik': f['cik'],
                                             'formType': f['formType'],
                                             'filedAt': f['filedAt'],
                                             'filingUrl': f['linkToFilingDetails']}, filings))

              df = pd.DataFrame.from_records(metadata)

              frames.append(df)

            print('✅ Downloaded metadata for year', year)


          result = pd.concat(frames)
          result.to_csv('metadata.csv', index=False)

          number_metadata_downloaded = len(result)
          print('✅ Download completed. Metadata downloaded for {} filings.'.format(number_metadata_downloaded))

          return result


        tickers = list(holdings['Ticker'])

        metadata = download_10K_metadata(tickers=tickers, start_year=2021, end_year=2022)
        ✅ Starting downloading metadata for years 2021 to 2022
        ✅ Downloaded metadata for year 2021
        ✅ Downloaded metadata for year 2022
        ✅ Download completed. Metadata downloaded for 4861 filings.
        print('SEC form types:', list(metadata.formType.unique()))
        print('Number of 10-K URLs:', len(metadata))
        print('Metadata of the first 10 filings:')
        metadata.head(10)
        SEC form types: ['10-K']
        Number of 10-K URLs: 4861
        Metadata of the first 10 filings:
        Out:
        tickercikformTypefiledAtfilingUrl
        0AVGO173016810-K2021-12-17T16:42:51-05:00https://www.sec.gov/Archives/edgar/data/173016...
        1AMAT695110-K2021-12-17T16:14:51-05:00https://www.sec.gov/Archives/edgar/data/6951/0...
        2DE31518910-K2021-12-16T11:39:34-05:00https://www.sec.gov/Archives/edgar/data/315189...
        3ADI628110-K2021-12-03T16:02:52-05:00https://www.sec.gov/Archives/edgar/data/6281/0...
        4DIS174448910-K2021-11-24T16:34:25-05:00https://www.sec.gov/Archives/edgar/data/174448...
        5SBUX82922410-K2021-11-19T16:46:02-05:00https://www.sec.gov/Archives/edgar/data/829224...
        6V140316110-K2021-11-18T16:06:50-05:00https://www.sec.gov/Archives/edgar/data/140316...
        7QCOM80432810-K2021-11-03T16:06:41-04:00https://www.sec.gov/Archives/edgar/data/804328...
        8AAPL32019310-K2021-10-28T18:04:28-04:00https://www.sec.gov/Archives/edgar/data/320193...
        9ACN146737310-K2021-10-15T06:54:29-04:00https://www.sec.gov/Archives/edgar/data/146737...

        Download All 10-K Filings

        Let's continue with developing the code responsible for downloading the actual filing content. The function download_filing(metadata) creates a new folder for a ticker and downloads the HTML version of the 10-K filing into the newly created folder.

        The function also removes the iXBRL inline reader parameter from the filing URL using the .replace('ix?doc=/', '') method. For example:

        Original URL:

        https://www.sec.gov/ix?doc=/Archives/edgar/data/1045....0222.htm
                            ^------^ to be removed

        Cleaned URL:

        https://www.sec.gov/Archives/edgar/data/1045....0222.htm
        import os

        renderApi = RenderApi(api_key=API_KEY)

        def download_filing(metadata):
          try:
            ticker = metadata['ticker']
            new_folder = "./filings/" + ticker

            if not os.path.isdir(new_folder):
              os.makedirs(new_folder)

            url = metadata['filingUrl'].replace('ix?doc=/', '')
            file_content = renderApi.get_filing(url)
            file_name = url.split("/")[-1]

            with open(new_folder + "/" + file_name, "w") as f:
              f.write(file_content)
          except:
             print('❌ {ticker}: downloaded failed: {url}'.format(ticker=ticker, url=url))
        # downloaded sample filing
        download_filing(metadata.iloc[0])
        print('✅ Sample 10-K filing downloaded for {}'.format(metadata.iloc[0]['ticker']))
        ✅ Sample 10-K filing downloaded for AVGO

        To fully utilize the Render API and reduce the time needed to download all filings, we parallelize the download process using pandarallel, a library that applies a function in parallel on all rows or columns in a pandas DataFrame. With four workers running in parallel, we can download four filings simultaneously, significantly reducing the download time from 45 minutes to 10 minutes.

        For instance, if we were to download all filings one after another using a for loop, we would have to wait around 45 minutes to download 4,861 filings. This is calculated based on an average time of 550 ms to download a single filing. In contrast, with four workers running in parallel, we can download all filings in about 10 minutes.

        The figure below highlights the difference between downloading one filing at a time versus downloading four filings in parallel.

        parallelize-python

        !pip install -q pandarallel
        from pandarallel import pandarallel

        number_of_workers = 4
        pandarallel.initialize(progress_bar=True, nb_workers=number_of_workers, verbose=0)

        # uncomment to run a quick sample and download 50 filings
        sample = metadata.head(50)
        sample.parallel_apply(download_filing, axis=1)

        # download all filings
        # metadata.parallel_apply(download_filing, axis=1)

        print('✅ Download completed')
        VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13), Label(value='0 / 13'))), HBox…
        ✅ Download completed

        Footer

        Products

        • EDGAR Filing Search API
        • Full-Text Search API
        • Real-Time Filing Stream API
        • Filing Download & PDF Generator API
        • XBRL-to-JSON Converter
        • 10-K/10-Q/8-K Item Extractor
        • Investment Adviser & Form ADV API
        • Insider Trading Data - Form 3, 4, 5
        • Restricted Sales Notifications - Form 144
        • Institutional Holdings - Form 13F
        • Form N-PORT API - Investment Company Holdings
        • Form N-PX API - Proxy Voting Records
        • Form 13D/13G API
        • Form S-1/424B4 - IPOs, Debt & Rights Offerings
        • Form C - Crowdfunding Offerings
        • Form D - Private Placements & Exempt Offerings
        • Regulation A Offering Statements API
        • Changes in Auditors & Accountants
        • Non-Reliance on Prior Financial Statements
        • Executive Compensation Data API
        • Directors & Board Members Data
        • Company Subsidiaries Database
        • Outstanding Shares & Public Float
        • SEC Enforcement Actions
        • Accounting & Auditing Enforcement Releases (AAERs)
        • SRO Filings
        • CIK, CUSIP, Ticker Mapping

        General

        • Pricing
        • Features
        • Supported Filings
        • EDGAR Filing Statistics

        Account

        • Sign Up - Start Free Trial
        • Log In
        • Forgot Password

        Developers

        • API Sandbox
        • Documentation
        • Resources & Tutorials
        • Python API SDK
        • Node.js API SDK

        Legal

        • Terms of Service
        • Privacy Policy

        Legal

        • Terms of Service
        • Privacy Policy

        SEC API

        © 2025 sec-api.io by Data2Value GmbH. All rights reserved.

        SEC® and EDGAR® are registered trademarks of the U.S. Securities and Exchange Commission (SEC).

        EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system operated by the SEC.

        sec-api.io and Data2Value GmbH are independent of, and not affiliated with, sponsored by, or endorsed by the U.S. Securities and Exchange Commission.

        sec-api.io is classified under SIC code 7375 (Information Retrieval Services), providing on-demand access to structured data and online information services.