Download SEC Filings from EDGAR
On this page:
This Python tutorial will guide you through the process of downloading SEC filings from the EDGAR database and saving them to your local disk. By following this tutorial, you will be able to download filings without being blocked by sec.gov, as we will utilize the Query API and Render API provided by sec-api.io. No prior Python experience is required, and you can run the example in your browser using the "Open in Colab" button.
The tutorial consists of two main steps:
- Building a List of HTML Filings URLs: Using the Query API, you will search and filter the EDGAR database to create a list of URLs for all the HTML filings. This list will be saved to your local disk.
- Downloading and Scraping the Filings: Using the Render API, you will download and scrape the filings, saving them to the
filings
folder on your local disk. This step allows you to download up to 40 filings in parallel while accessing the filings directly from sec-api.io servers, avoiding any issues with being blocked by sec.gov servers. We will also cover how to download all filings as PDF files.
The tutorial focuses on downloading 10-K filings filed between 2020 and 2022, but you can adjust the search criteria and date range as needed. Additionally, the examples provided can be adapted to download other form types, filing exhibits, and XBRL files.
Following is an example of the folder structure for the downloaded filings:
Please note that downloading and parsing master.idx
index files is not required for this tutorial.
Let's get started with downloading SEC filings from EDGAR!
Getting Started
To begin, we need to install the sec-api
Python package, which will enable us to utilize the Query API and Render API for accessing and downloading SEC filings from the EDGAR database.
The Query API allows us to filter the EDGAR database using different search criteria, such as form types, filing dates, tickers, and more. On the other hand, the Render API enables us to download any EDGAR filing or exhibit at a speed of up to 40 downloads per second.
!pip install -q sec-api
API_KEY = 'YOUR_API_KEY'
Create a List of URLs for All EDGAR Filings
To obtain the URLs of all EDGAR filings that match our search criteria, we will utilize the Query API. This API allows us to search and filter all filings on the EDGAR database filed since 1994 using various parameters, such as form types, filing dates, tickers, and more. By defining a search query, we can retrieve the metadata of all filings that meet our specified criteria.
Please refer to the full documentation of the Query API to learn more about all available search parameters.
Define the Filing Search Query
Defining a search query is straightforward. For example, to retrieve all 10-K filings, we can use the following query:
formType:"10-K"
It's important to note that this search query will also include "10-K/A" (amended 10-K) and "NT 10-K" (notification of inability to timely file Form 10-K) filings. If you want to exclude these types, you can modify the query as follows:
formType:"10-K" AND NOT formType:("10-K/A", NT)
Additionally, you can narrow down the search by specifying a filing date range. For example, to search for filings filed between January 1, 2020, and December 31, 2020, you can use the following search term:
filedAt:[2020-01-01 TO 2020-12-31]
The search query follows the Lucene syntax. For more information on building complex search terms and using Lucene, refer to the Lucene Query Syntax Overview documentation.
Response of the Query API
The Query API will return the metadata of all filings that match our search query. Each filing's metadata contains various information, including:
formType
: The EDGAR form type (e.g.,10-K
,10-Q
).cik
: The Central Index Key (CIK) of the filer, with trailing zeros removed.companyName
: The name of the filer.linkToFilingDetails
: The URL to the HTML version of the filing on EDGAR.linkToHtml
: The URL to the index page of the filing, which lists all attachments, exhibits, XBRL files, images, and more.filedAt
: The date and time when the filing was accepted by the EDGAR system.periodOfReport
: The reporting period covered by the filing.documentFormatFiles
: An array of primary files associated with the filing, including the filing itself and additional exhibits or documents.dataFiles
: A list of data files attached to the filing, such as XBRL files.
These are just a few key parameters of the metadata. For a comprehensive list of available response parameters per filing, refer to the Query API documentation.
Create a List of 10-K URLs on EDGAR
To create a comprehensive list of URLs for all 10-K filings on EDGAR, we need to handle pagination, use date range filters, and iterate month by month to avoid hitting the maximum response limit of 10,000 filings per search universe.
The code provided demonstrates how to achieve this. Here are the key components explained:
The
get_10K_metadata()
function takesstart_year
andend_year
parameters to define the range of years for which we want to retrieve the metadata.Within the
get_10K_metadata()
function, a nested loop iterates over the specified years and months. For each month, a Lucene query is constructed using thedate_range_filter
andform_type_filter
variables. Thedate_range_filter
ensures that only filings within the specified month are included, while theform_type_filter
excludes amended filings (10-K/A) and notifications (NT).The
query_from
andquery_size
variables are initialized to handle pagination. Thequery_from
parameter represents the offset or starting position in the search results, and thequery_size
parameter determines the number of filings to retrieve per request.The
while True
loop ensures that all filings are fetched by incrementing thequery_from
value and retrieving the next set of filings until no more matches are returned.The metadata of each filing is extracted and stored in a dataframe. The
standardize_filing_url()
function is used to remove theix?doc=/
part from the URL that links to the iXBRL reader instead of the original HTML filing.The resulting dataframe is appended to the
frames
list, and the number of downloaded objects is tracked.After iterating through all specified years and months, the
frames
are concatenated into a single dataframe calledresult
. Any entries without a ticker symbol are removed.
from sec_api import QueryApi
queryApi = QueryApi(api_key=API_KEY)
import pandas as pd
def standardize_filing_url(url):
return url.replace('ix?doc=/', '')
def get_10K_metadata(start_year = 2021, end_year = 2022):
frames = []
for year in range(start_year, end_year + 1):
number_of_objects_downloaded = 0
for month in range(1, 13):
padded_month = str(month).zfill(2) # "1" -> "01"
date_range_filter = f'filedAt:[{year}-{padded_month}-01 TO {year}-{padded_month}-31]'
form_type_filter = f'formType:"10-K" AND NOT formType:("10-K/A", NT)'
lucene_query = date_range_filter + ' AND ' + form_type_filter
query_from = 0
query_size = 200
while True:
query = {
"query": lucene_query,
"from": query_from,
"size": query_size,
"sort": [{ "filedAt": { "order": "desc" } }]
}
response = queryApi.get_filings(query)
filings = response['filings']
if len(filings) == 0:
break
else:
query_from += query_size
metadata = list(map(lambda f: {'ticker': f['ticker'],
'cik': f['cik'],
'formType': f['formType'],
'filedAt': f['filedAt'],
'filingUrl': f['linkToFilingDetails']
}, filings))
df = pd.DataFrame.from_records(metadata)
# remove all entries without a ticker symbol
df = df[df['ticker'].str.len() > 0]
df['filingUrl'] = df['filingUrl'].apply(standardize_filing_url)
frames.append(df)
number_of_objects_downloaded += len(df)
print(f'✅ Downloaded {number_of_objects_downloaded} metadata objects for year {year}')
result = pd.concat(frames)
print(f'✅ Download completed. Metadata downloaded for {len(result)} filings.')
return result
metadata_10K = get_10K_metadata(start_year=2020, end_year=2022)
✅ Downloaded 5019 metadata objects for year 2020
✅ Downloaded 5890 metadata objects for year 2021
✅ Downloaded 6454 metadata objects for year 2022
✅ Download completed. Metadata downloaded for 17363 filings.
metadata_10K
ticker | cik | formType | filedAt | filingUrl | |
---|---|---|---|---|---|
0 | DOMH | 12239 | 10-K | 2020-01-31T18:42:32-05:00 | https://www.sec.gov/Archives/edgar/data/12239/... |
1 | SCRH | 831489 | 10-K | 2020-01-31T17:25:50-05:00 | https://www.sec.gov/Archives/edgar/data/831489... |
2 | EBAY | 1065088 | 10-K | 2020-01-31T16:53:51-05:00 | https://www.sec.gov/Archives/edgar/data/106508... |
4 | BA | 12927 | 10-K | 2020-01-31T13:23:40-05:00 | https://www.sec.gov/Archives/edgar/data/12927/... |
5 | NOBH | 72205 | 10-K | 2020-01-31T11:54:47-05:00 | https://www.sec.gov/Archives/edgar/data/72205/... |
... | ... | ... | ... | ... | ... |
154 | TGL | 1905956 | 10-K | 2022-12-05T16:38:57-05:00 | https://www.sec.gov/Archives/edgar/data/190595... |
155 | DLHC | 785557 | 10-K | 2022-12-05T16:16:18-05:00 | https://www.sec.gov/Archives/edgar/data/785557... |
156 | VERU | 863894 | 10-K | 2022-12-05T15:23:56-05:00 | https://www.sec.gov/Archives/edgar/data/863894... |
157 | MCLE | 1827855 | 10-K | 2022-12-02T16:27:58-05:00 | https://www.sec.gov/Archives/edgar/data/182785... |
159 | RGCO | 1069533 | 10-K | 2022-12-02T14:47:39-05:00 | https://www.sec.gov/Archives/edgar/data/106953... |
17363 rows × 5 columns
You can save the entire list of URLs of all 10-K filings to a CSV file named metadata_10K.csv
using the following command:
metadata_10K.to_csv('metadata_10K.csv', index=False)
Let's inspect the downloaded metadata by displaying all 10-K filings filed by Apple. We expect to see three filings, and we can verify this by executing the following code:
metadata_10K[metadata_10K['ticker'] == 'AAPL']
ticker | cik | formType | filedAt | filingUrl | |
---|---|---|---|---|---|
4 | AAPL | 320193 | 10-K | 2020-10-29T18:06:25-04:00 | https://www.sec.gov/Archives/edgar/data/320193... |
12 | AAPL | 320193 | 10-K | 2021-10-28T18:04:28-04:00 | https://www.sec.gov/Archives/edgar/data/320193... |
28 | AAPL | 320193 | 10-K | 2022-10-27T18:01:14-04:00 | https://www.sec.gov/Archives/edgar/data/320193... |
Create a List of 10-Q URLs on EDGAR
To create a list of URLs for all 10-Q filings on EDGAR, you can update the form_type_filter
in the get_10K_metadata(start_year, end_year)
function to include the desired form type. The resulting Lucene search query would look like this:
formType:"10-Q" AND NOT formType:("10-Q/A", NT)
You can rename the get_10K_metadata
function to get_10Q_metadata
to reflect the change in form type. The rest of the function remains the same.
def get_10Q_metadata(start_year = 2021, end_year = 2022):
frames = []
for year in range(start_year, end_year + 1):
number_of_objects_downloaded = 0
for month in range(1, 13):
padded_month = str(month).zfill(2) # "1" -> "01"
date_range_filter = f'filedAt:[{year}-{padded_month}-01 TO {year}-{padded_month}-31]'
form_type_filter = f'formType:"10-Q" AND NOT formType:("10-Q/A", NT)'
lucene_query = date_range_filter + ' AND ' + form_type_filter
query_from = 0
query_size = 200
while True:
query = {
"query": lucene_query,
"from": query_from,
"size": query_size,
"sort": [{ "filedAt": { "order": "desc" } }]
}
response = queryApi.get_filings(query)
filings = response['filings']
if len(filings) == 0:
break
else:
query_from += query_size
metadata = list(map(lambda f: {'ticker': f['ticker'],
'cik': f['cik'],
'formType': f['formType'],
'filedAt': f['filedAt'],
'filingUrl': f['linkToFilingDetails']
}, filings))
df = pd.DataFrame.from_records(metadata)
# remove all entries without a ticker symbol
df = df[df['ticker'].str.len() > 0]
df['filingUrl'] = df['filingUrl'].apply(standardize_filing_url)
frames.append(df)
number_of_objects_downloaded += len(df)
print(f'✅ Downloaded {number_of_objects_downloaded} metadata objects for year {year}')
result = pd.concat(frames)
print(f'✅ Download completed. Metadata downloaded for {len(result)} filings.')
return result
metadata_10Q = get_10Q_metadata(start_year=2020, end_year=2020)
✅ Downloaded 15638 metadata objects for year 2020
✅ Download completed. Metadata downloaded for 15638 filings.
metadata_10Q
ticker | cik | formType | filedAt | filingUrl | |
---|---|---|---|---|---|
1 | SOBR | 1425627 | 10-Q | 2020-01-31T17:38:31-05:00 | https://www.sec.gov/Archives/edgar/data/142562... |
2 | BTTR | 1471727 | 10-Q | 2020-01-31T17:19:14-05:00 | https://www.sec.gov/Archives/edgar/data/147172... |
3 | KOSS | 56701 | 10-Q | 2020-01-31T16:37:01-05:00 | https://www.sec.gov/Archives/edgar/data/56701/... |
4 | FLEX | 866374 | 10-Q | 2020-01-31T16:24:59-05:00 | https://www.sec.gov/Archives/edgar/data/866374... |
5 | CVCO | 278166 | 10-Q | 2020-01-31T16:21:17-05:00 | https://www.sec.gov/Archives/edgar/data/278166... |
... | ... | ... | ... | ... | ... |
181 | HOME | 1646228 | 10-Q | 2020-12-02T06:29:18-05:00 | https://www.sec.gov/Archives/edgar/data/164622... |
182 | KDCE | 1049011 | 10-Q | 2020-12-01T19:18:45-05:00 | https://www.sec.gov/Archives/edgar/data/104901... |
183 | NIHK | 1084475 | 10-Q | 2020-12-01T14:09:58-05:00 | https://www.sec.gov/Archives/edgar/data/108447... |
184 | TJX | 109198 | 10-Q | 2020-12-01T11:19:23-05:00 | https://www.sec.gov/Archives/edgar/data/109198... |
185 | GSGG | 1668523 | 10-Q | 2020-12-01T11:11:48-05:00 | https://www.sec.gov/Archives/edgar/data/166852... |
15638 rows × 5 columns
Download EDGAR Filings to Disk
In this final step, we will create a download_filing(metadata)
function that uses the get_filing(filing_url)
method from the RenderApi
class to download the content of the filings. Each filing will be saved in a folder named after the corresponding ticker. The file name will include the filing date, form type, and the original name of the file on EDGAR.
The resulting folder structure is going to look like this:
To speed up the download process, we will utilize the pandarallel
package, which allows us to apply the download_filing
function in parallel to multiple rows. By specifying the number_of_workers
, we can control the number of workers running in parallel. It's important to note that setting a high number of workers may lead to rate limit issues with the Render API, so it's recommended to choose a reasonable value.
from sec_api import RenderApi
renderApi = RenderApi(api_key=API_KEY)
import os
def download_filing(metadata):
ticker = metadata['ticker']
url = metadata['filingUrl']
try:
new_folder = './filings/' + ticker
date = metadata['filedAt'][:10]
file_name = date + '_' + metadata['formType'] + '_' + url.split('/')[-1]
if not os.path.isdir(new_folder):
os.makedirs(new_folder)
file_content = renderApi.get_filing(url)
with open(new_folder + "/" + file_name, "w") as f:
f.write(file_content)
except:
print(f"❌ {ticker}: downloaded failed: {url}")
download_filing(metadata_10K.iloc[0])
print('✅ Sample 10-K filing downloaded for {}'.format(metadata_10K.iloc[0]['ticker']))
✅ Sample 10-K filing downloaded for DOMH
!pip install -q pandarallel
from pandarallel import pandarallel
number_of_workers = 4
pandarallel.initialize(progress_bar=True, nb_workers=number_of_workers, verbose=0)
# uncomment to run a quick sample and download 500 filings
sample = metadata_10K.sort_values('ticker').head(500)
sample.parallel_apply(download_filing, axis=1)
# download all filings
# metadata_10K.parallel_apply(download_filing, axis=1)
print('✅ Download completed')
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=125), Label(value='0 / 125'))), HB…
✅ Download completed
Download EDGAR Filings as PDFs
In this section, we demonstrate how to download the 10-K filings as PDF files using the Render API, which can convert HTML filings into PDF versions.
The code uses the requests
library to make a GET request to the PDF Generator API (PDF_GENERATOR_API
) with the appropriate parameters. The API converts the HTML filing into a PDF file, which is then streamed and saved to the specified folder location.
Note that response.raise_for_status()
is used to check the status of the API response and raise an exception if an error occurs during the request. This helps handle any potential errors during the download process.
The code example demonstrates downloading a sample of 10 filings in parallel. You can adjust the number of filings to download by modifying the sample2
dataframe.
import requests
PDF_GENERATOR_API = 'https://api.sec-api.io/filing-reader'
def download_pdf(metadata):
ticker = metadata['ticker']
filing_url = metadata['filingUrl']
try:
new_folder = './filings/' + ticker
date = metadata['filedAt'][:10]
file_name = date + '_' + metadata['formType'] + '_' + filing_url.split('/')[-1] + '.pdf'
if not os.path.isdir(new_folder):
os.makedirs(new_folder)
api_url = f"{PDF_GENERATOR_API}?token={API_KEY}&type=pdf&url={filing_url}"
response = requests.get(api_url, stream=True)
response.raise_for_status()
with open(new_folder + "/" + file_name, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
except:
print(f"❌ {ticker}: downloaded failed: {filing_url}")
sample2 = metadata_10K.sort_values('ticker').head(10)
sample2.parallel_apply(download_pdf, axis=1)
# download all filings as PDFs
# metadata_10K.parallel_apply(download_pdf, axis=1)
print('✅ Download completed')
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=3), Label(value='0 / 3'))), HBox(c…
✅ Download completed