Extract Section Items From SEC Filings With Python
In this guide, we'll explore how you can extract sections from 10-K, 10-Q and 8-K SEC filings using the Extractor API in Python.
Quick Start
This ready-to-execute example demonstrates how to extract various text and content sections from SEC filings, including 10-K, 10-Q, and 8-K forms, using the .get_section(filing_url, item_id, return_type)
method from the ExtractorApi
class in the sec-api
Python package. The example covers extracting both HTML and text sections for the following items:
- 10-K, Item 1.A: Risk Factors
- 10-K, Item 7: Management’s Discussion and Analysis (MD&A)
- 10-Q, Part 2, Item 1.A: Risk Factors
- 10-Q, Part 2, Item 7: MD&A
- 8-K, Item 1.01: Entry into a Material Definitive Agreement
- 8-K, Item 4.01: Changes in Registrant’s Certifying Accountant
!pip install sec-api
from sec_api import ExtractorApi
extractorApi = ExtractorApi("YOUR_API_KEY")
# 10-K example
url_10k = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"
# extract Item 1.A Risk Factors from 10-K filing in text format
item_1A_text = extractorApi.get_section(url_10k, "1A", "text")
# extract Item 7 "MD&A" from 10-K filing in html format
item_7_html = extractorApi.get_section(url_10k, "7", "html")
# 10-Q example
url_10q = "https://www.sec.gov/Archives/edgar/data/1318605/000095017022006034/tsla-20220331.htm"
# extract Part II Item 1A Risk Factors from 10-Q filing in text format
part2_item_1A_text = extractorApi.get_section(url_10q, "part2item1a", "text")
# extract Part II Item 7 "MD&A" from 10-Q filing in html format
part2_item_7_html = extractorApi.get_section(url_10q, "part2item7", "html")
# 8-K example
url_8k = "https://www.sec.gov/Archives/edgar/data/66600/000149315222016468/form8-k.htm"
# extract Item 1.01 Entry into a Material Definitive Agreement from 8-K filing in text format
item_1_1_text = extractorApi.get_section(url_8k, "1-1", "text")
# extract Item 4.01 Changes in Registrant's Certifying Accountant from 8-K filing in html format
item_1_1_html = extractorApi.get_section(url_8k, "4-1", "html")
item_ids_10K = [
"1", "1A", "1B", "1C", "2", "3", "4",
"5", "6", "7", "7A", "8", "9", "9A", "9B",
"10", "11", "12", "13", "14", "15"
]
item_ids_10Q = [
# Part 1
"part1item1", "part1item2", "part1item3", "part1item4",
# Part 2
"part2item1", "part2item1a", "part2item2", "part2item3",
"part2item4", "part2item5", "part2item6"
]
item_ids_8K = [
# Item 1.x
"1-1", "1-2", "1-3", "1-4", "1-5",
# Item 2.x
"2-1", "2-2", "2-3", "2-4", "2-5", "2-6",
# Item 3.
"3-1", "3-2", "3-3",
# Item 4.x
"4-1", "4-2",
# Item 5.x
"5-1", "5-2", "5-3", "5-4", "5-5", "5-6", "5-7", "5-8",
# Item 6.x
"6-1", "6-2", "6-3", "6-4", "6-5", "6-6", "6-10",
# Item 7.x
"7-1",
# Item 8.x
"8-1",
# Item 9.x
"9-1",
# Miscellaneous
"signature"
]
Extract Item Sections from 10-K Filings
The following Python code extracts specific sections from Tesla’s 10-K filing: Item 1.A "Risk Factors" as plain text (without HTML tags) and Item 7 "MD&A" as original HTML. By specifying the filing URL, item ID, and desired return type (text
or html
), the Extractor API returns the requested section content.
!pip install sec-api
from sec_api import ExtractorApi
extractorApi = ExtractorApi("YOUR_API_KEY")
# Tesla 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"
# get the standardized and cleaned text of section 1A "Risk Factors"
section_text = extractorApi.get_section(filing_url, "1A", "text")
# get the original HTML of section 7
# "Management’s Discussion and Analysis of Financial Condition and Results of Operations"
section_html = extractorApi.get_section(filing_url, "7", "html")
By reviewing the first 1,000 characters of the section_text
response for Item 1A (Risk Factors), the cleaned and standardized text version of the extracted item is visible. Newline characters () and special character entities are preserved, allowing algorithms to easily identify lists and headings.
print("Tesla 10-K Risk Factors Section:")
print("--------------------------------")
print(section_text[:1000] + '...')
Tesla 10-K Risk Factors Section:
--------------------------------
ITEM 1A. RISK FACTORS
You should carefully consider the risks described below together with the other information set forth in this report, which could materially affect our business, financial condition and future results. The risks described below are not the only risks facing our company. Risks and uncertainties not currently known to us or that we currently deem to be immaterial also may materially adversely affect our business, financial condition and operating results.
Risks Related to Our Ability to Grow Our Business
We may be impacted by macroeconomic conditions resulting from the global COVID-19 pandemic.
Since the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. In some cases, the relaxation of such trends has recently been followed by actual or contempla...
To inspect the extracted HTML version of the MD&A section (Item 7), use the display
and HTML
functions as shown below. For brevity, only the first 2,000 characters of the extracted HTML content are displayed.
To ensure correct rendering when viewing the notebook here, the HTML section is prepended with <div><table><tr><td>
tags. This step is not necessary when running the notebook locally.
from IPython.display import display, HTML
display(HTML("<div><table><tr><td>" + section_html[0:2034]))
ITEM 7. | MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS |
The following discussion and analysis should be read in conjunction with the consolidated financial statements and the related notes included elsewhere in this Annual Report on Form 10-K. For discussion related to changes in financial condition and the results of operations for fiscal year 2018-related items, refer to Part II, Item 7. Management's Discussion and Analysis of Financial Condition and Results of Operations in our Annual Report on Form 10-K for fiscal year 2019, which was filed with the Securities and Exchange Commission on February 13, 2020.
Overview and 2020 Highlights
Our mission is to accelerate the world’s transition to sustainable energy. We design, develop, manufacture, lease and sell high-performance fully electric vehicles, solar energy generation systems and energy storage products. We also offer maintenance, installation, operation, financial and other services related to our products.
Using the URL of the text version of the filing, which ends in .txt
, is also supported as an alternative to the HTML version ending in .htm
.
# txt version of Tesla's 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt"
section_text = extractorApi.get_section(filing_url, "1A", "text")
Extract Item Sections from 10-Q Filings
Extracting sections from 10-Q filings follows the same boilerplate code as for 10-K filings, with the only difference being the use of the 10-Q filing URL and the appropriate item code.
Since the structure of 10-Q filings differs from that of 10-Ks, it’s essential to use the correct item codes specific to each filing type. For example, to extract Item 1A (Risk Factors) from Part 2 of a 10-Q filing, the item code is part2item1a
. In contrast, extracting the same section from a 10-K filing would require the item code 1A
.
The example below demonstrates this approach for a 10-Q filing.
from sec_api import ExtractorApi
extractorApi = ExtractorApi("YOUR_API_KEY")
# Tesla 10-Q filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000095017022006034/tsla-20220331.htm"
# extract section 1A "Risk Factors" in part 2 as cleaned text
section_text = extractorApi.get_section(filing_url, "part2item1a", "text")
print('Tesla 10-Q filing section 1A "Risk Factors" in part 2 as cleaned text:')
print('---------------------------------------------------------------------')
print(section_text[:1000] + '...')
Tesla 10-Q filing section 1A "Risk Factors" in part 2 as cleaned text:
---------------------------------------------------------------------
ITEM 1A.RISK FACTORS
You should carefully consider the risks described below together with the other information set forth in this report, which could materially affect our business, financial condition and future results. The risks described below are not the only risks facing our company. Risks and uncertainties not currently known to us or that we currently deem to be immaterial also may materially adversely affect our business, financial condition and operating results.
Risks Related to Our Ability to Grow Our Business
We may be impacted by macroeconomic conditions resulting from the global COVID-19 pandemic.
Since the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. In some cases, the relaxation of such trends has been followed by actual or contemplated ret...
Extract Item Sections from 8-K Filings
Extracting sections from 8-K filings is straightforward and can be accomplished with just a few lines of Python code. The filing_url
parameter accepts both .htm
and .txt
versions of a filing. A complete list of supported item codes is available in the Request & Response section of the documentation.
The example below demonstrates how to extract Item 1.01, "Entry into a Material Definitive Agreement," as cleaned text:
from sec_api import ExtractorApi
extractorApi = ExtractorApi("YOUR_API_KEY")
filing_url = "https://www.sec.gov/Archives/edgar/data/66600/000149315222016468/form8-k.htm"
# extract section 1.01 "Entry into Material Definitive Agreement" as cleaned text
section_text = extractorApi.get_section(filing_url, "1-1", "text")
print("Section 1.01 text:")
print("------------------")
print(section_text)
Section 1.01 text:
------------------
Item 1.01 Entry into a Material Definitive Agreement.
 
Quad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”), is a public holding company that offers staffing services and employee benefits, such as health plans, HR-human resources, and payroll services, to small and mid-sized group employers. The Company is filing this Current Report on Form 8-K to disclose recent material events, including the Company’s entry into a material agreements, through its wholly-owned subsidiary Physicians HealthCare Services LLC (“PHCS”), with Advent Health, a Florida-based clinically-integrated network that contracts with health care providers to provide certain Covered Services to Members (“Advent Health Participating Providers”) and has the ability to sign Payor contracts with Advent Health Participating Providers.
 
Through PHCS, the Company now has immediate access to approximately 10,000 employee/workers at the 2,000+ physician offices operated by Advent Health. These employees will be immediately eligible for health coverage through the self-funded plans operated by Quad M’s subsidiaries, Nuaxess and OpenAxess.
 
The Advent Health project was approved recently by the Company’s Board of Directors. Advent Health shares Quad M’s vision to form a strategic care program that seeks to provide quality, cost-effective Covered Services to persons enrolled in Nuaxess and OpenAxess.
 
The Agreements between the Company and PHCS and Advent are attached hereto as Exhibit 10.13 and 10.14, respectively.
 
SEC filings might contain HTML character entities, such as  
, “
and &
, and others, representing reserved characters, such as non-breaking space
, left double quotation mark “
, and ampersand &
, respectively. These entities are encoded in the HTML file, which can be converted to plain text using the html.unescape
function from the html
module. The html.unescape
function converts HTML entities to their corresponding characters. For example,  
is converted to a non-breaking space, “
is converted to a left double quotation mark “
, and &
is converted to an ampersand &
.
import html
text = html.unescape(section_text)
print("Section 1.01 text after unescaping HTML character entities:")
print("-----------------------------------------------------------")
print(text.strip())
Section 1.01 text after unescaping HTML character entities:
-----------------------------------------------------------
Item 1.01 Entry into a Material Definitive Agreement.
Quad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”), is a public holding company that offers staffing services and employee benefits, such as health plans, HR-human resources, and payroll services, to small and mid-sized group employers. The Company is filing this Current Report on Form 8-K to disclose recent material events, including the Company’s entry into a material agreements, through its wholly-owned subsidiary Physicians HealthCare Services LLC (“PHCS”), with Advent Health, a Florida-based clinically-integrated network that contracts with health care providers to provide certain Covered Services to Members (“Advent Health Participating Providers”) and has the ability to sign Payor contracts with Advent Health Participating Providers.
Through PHCS, the Company now has immediate access to approximately 10,000 employee/workers at the 2,000+ physician offices operated by Advent Health. These employees will be immediately eligible for health coverage through the self-funded plans operated by Quad M’s subsidiaries, Nuaxess and OpenAxess.
The Advent Health project was approved recently by the Company’s Board of Directors. Advent Health shares Quad M’s vision to form a strategic care program that seeks to provide quality, cost-effective Covered Services to persons enrolled in Nuaxess and OpenAxess.
The Agreements between the Company and PHCS and Advent are attached hereto as Exhibit 10.13 and 10.14, respectively.
Extract and Download Sections from 10-K Filings Over Multiple Years
Begin by aggregating the URLs for all 10-K filings submitted over the last 10 years, from 2014 to 2023. For this, refer to the Query API boilerplate code example here.
Next, iterate over the collected 10-K filing URLs and extract all sections from each filing using the extract_items_10k(filing_url)
function. The multiprocessing
library is used to parallelize this process, spawning four parallel threads, each responsible for extracting all sections of a single filing. The extracted sections are immediately ready for further analysis; if analysis is planned for a later stage, consider saving the sections either locally or in a document database to avoid "out of memory" errors.
Repeat these steps to extract sections from 10-Q and 8-K filings. Ensure the items list is updated with the appropriate section codes specific to 10-Q and 8-K filings to match their respective structures.
Note: It is recommended to run this code as a Python script (e.g.
python extract_sections.py
) rather than in a Jupyter notebook to avoid memory issues.
from sec_api import ExtractorApi
import multiprocessing
extractorApi = ExtractorApi("YOUR_API_KEY")
# number of processes to run in parallel.
# each process will extract all items from a 10-K filing
# if you have a large number of URLs, you may want to increase this number
# to speed up the extraction process.
number_of_processes = 2
urls_10k = [
"https://www.sec.gov/Archives/edgar/data/815094/000156459019020329/abmd-10k_20190331.htm",
"https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm",
# add more URLs of 10-K filings here
]
def extract_items_10k(filing_url):
items_10_K = [
"1", "1A", "1B", "2", "3",
"4", "5", "6", "7", "7A",
"8", "9A", "9B", "10", "11",
"12", "13", "14"
]
for item in items_10_K:
print(f"Extracting item {item} from 10-K filing {filing_url}")
try:
section_text = extractorApi.get_section(
filing_url=filing_url, section=item, return_type="text"
)
# Process section_text as needed: save to disk, store in a database, or perform analytics.
# IMPORTANT: Avoid holding a large number of sections in memory by appending them to a list,
# as this can lead to out-of-memory issues. Instead, ensure that memory is freed regularly
# by allowing garbage collection to manage unused objects.
except Exception as e:
print(e)
if __name__ == "__main__":
with multiprocessing.Pool(number_of_processes) as pool:
pool.map(extract_items_10k, urls_10k)
Cleaning Extracted Sections: Removing Newline Characters and Decoding HTML Entities
When extracting sections with the Extractor API, both HTML and plain text versions are returned. The text versions often contain newline characters (\n
) and HTML entities, such as  
. HTML entities are special character codes—for example, “
represents the left double quotation mark (“
) in UTF-8.
To clean the extracted text, two main options are available:
- Remove All Newline Characters and HTML Entities: This option replaces newline characters and HTML entities with empty strings, effectively removing them. This approach is useful if you want to condense the text into a single line or strip out non-visible formatting characters.
import re
# Example: Removing newline characters and HTML entities
clean_text = re.sub(r"\n|&#\d+;", "", extracted_text)
- Decode HTML Entities to UTF-8 Characters: This option converts HTML entities into their readable UTF-8 characters, preserving special characters and symbols. This approach is beneficial if you want the text to retain its intended symbols and punctuation.
import html
# Example: Decoding HTML entities
readable_text = html.unescape(extracted_text)
Choosing an Approach
The choice between these options depends on the use case and the desired output format:
- Option 1: If you need a clean, unformatted string without special symbols or formatting, removing newline characters and HTML entities is ideal.
- Option 2: If you need the text to retain special characters for readability or further processing, decoding the HTML entities to UTF-8 is more appropriate.
# text with new line characters "\n" and HTML entities " ", "”"
extracted_section = (
"Item 1.01 Entry into a Material Definitive Agreement."
+ " \n\n  \n\nQuad M Solutions, Inc., an Idaho corporation, "
+ "(the “Company” or “Quad M”),"
)
# the output of extracted_section includes "\n" and the HTML entities.
# "\n" is not actually converted into a new line here. we need to print()
# the string first to make Python convert "\n" into a line break.
extracted_section
'Item 1.01 Entry into a Material Definitive Agreement. \n\n  \n\nQuad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”),'
# we don't see "\n" in the printed version anymore
# because the printer replaced "\n" with an actual line break
print(extracted_section)
Item 1.01 Entry into a Material Definitive Agreement.
 
Quad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”),
# we use a regular expression to substitute new line characters and HTML entities
# with an empty string ""
import re
cleaned_section = re.sub(r"\n|&#[0-9]+;", "", extracted_section)
# "\n" and HTML entities are now removed
cleaned_section
'Item 1.01 Entry into a Material Definitive Agreement. Quad M Solutions, Inc., an Idaho corporation, (the Company or Quad M),'
print(cleaned_section)
Item 1.01 Entry into a Material Definitive Agreement. Quad M Solutions, Inc., an Idaho corporation, (the Company or Quad M),
# let's decode all HTML entities to their UTF-8 equivalents
# line breaks "\n" are kept
import html
import unicodedata
# all HTML entities are converted into human-readable characters
decoded_section = html.unescape(extracted_section)
# convert "\xa0" and "\u201d" into their UTF-8 equivalents
decoded_section = unicodedata.normalize("NFKC", decoded_section)
decoded_section
'Item 1.01 Entry into a Material Definitive Agreement. \n\n \n\nQuad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”),'
print(decoded_section)
Item 1.01 Entry into a Material Definitive Agreement.
Quad M Solutions, Inc., an Idaho corporation, (the “Company” or “Quad M”),