sec-api.ioSEC API by D2V
FilingsPricingSandboxDocs
Log inGet Free API Key
  1. Home
  2. Tutorials

Download and Analyze SEC Comment Letters

Open In Colab   Download Notebook

On this page:

  • Introduction to SEC Comment Letters
    • EDGAR Form Types of SEC Comment Letters
      • When are SEC Comment Letters released to the public?
        • Search and Find SEC Comment Letters
          • Find UPLOAD/LETTER Filings
            • Find CORRESP Filings
            • Example LETTER Filing
              • Example CORRESP Filing
              • Find SEC Comment Letters
                • Download SEC Comment Letters
                  • Analyze SEC Comment Letters
                    • Number of SEC Comment Letters per EDGAR Form Type
                      • Topic Modelling of SEC Comment Letters with BERTopic
                        • Vocabulary Generation
                          • Train BERTopic
                        • References
                          • SEC.gov Resources
                            • Papers
                              • Topic Modelling
                                • Other Resources

                                In this tutorial, you will learn how to find and download SEC Comment Letters from EDGAR and perform analysis on these filings.

                                The tutorial consists of the following steps:

                                1. Retrieving SEC Comment Letter URLs: Utilize the Query API to search and gather all URLs of SEC Comment Letters available on EDGAR, creating a comprehensive list.
                                2. Downloading Comment Letters: Use the Render API to download all Comment Letters as .TXT files and save them to your local disk for further analysis.
                                3. Analysis: In this tutorial, we will demonstrate how to analyze the distribution of comment letters over EDGAR form types in the year 2021. Additionally, you will learn how to perform topic modeling using BERTopic.

                                Introduction to SEC Comment Letters

                                SEC Comment letters are official correspondence issued by the U.S. Securities and Exchange Commission (SEC) to companies in response to their filed disclosure documents, such as annual reports (Form 10-K), quarterly reports (Form 10-Q), and registration statements (Form S-1). These letters are part of the SEC's review and comment process, which aims to ensure compliance with securities regulations and promote transparent and accurate financial reporting. They provide valuable insights into the SEC's concerns, highlight areas of potential risk, and give investors a better understanding of the company's financial disclosures and accounting practices.

                                When the SEC reviews a company's disclosure documents, they may identify issues, concerns, or requests for additional information. These findings are communicated to the company through Comment letters, also known as SEC Staff Letters. The letters highlight specific areas where the SEC believes further clarification, disclosure, or revision is necessary.

                                SEC Comment letters typically address various aspects of the company's financial statements, accounting practices, disclosures, risk factors, legal compliance, internal controls, executive compensation, and other matters relevant to public disclosure. The letters are intended to help the company enhance the quality and accuracy of its public filings and ensure compliance with applicable securities laws and regulations.

                                Upon receiving a Comment letter, the company is required to respond to each comment, either by providing the requested information, clarifying existing disclosures, or justifying its position. The response is either submitted as an amendment or an updated filing to the original document or as separate new letter to the SEC staff. The SEC may then review the company's response and provide further guidance or request additional revisions if necessary. The resolution of comments may involve multiple rounds of letters exchanged between the SEC staff and the filer.

                                Comment letters and the company's responses are published on the EDGAR system starting from August 1, 2004. The SEC issues Comment letters for various EDGAR filings. The most common types of filings that may trigger SEC Comment letters include:

                                • Annual and Quarterly Reports (10-K, 10-Q, etc.): Companies are required to file annual and quarterly reports that provide a comprehensive overview of their financial performance, business operations, risk factors, and other relevant information. The SEC may issue Comment letters to seek clarification or request additional disclosures regarding the company's financial statements, accounting practices, significant transactions, and other areas of concern.
                                • Registration Statements (S-1, S-3, etc.): When a company plans to offer securities to the public, it must file a registration statement with the SEC. These filings provide detailed information about the company, its business, and the securities being offered. The SEC may issue Comment letters during the registration review process to ensure compliance with disclosure requirements and address any concerns or deficiencies identified in the filing.
                                • Proxy Statements (DEF 14A, etc.): Proxy statements are filed when a company seeks shareholder approval on matters such as elections of directors, executive compensation, and mergers. The SEC may issue Comment letters to seek clarification or request revisions regarding the information provided in the proxy statement to ensure it is accurate and complete.
                                • Offering Statements (Form 1-A, Form 1-K, etc.): Companies conducting offerings under Regulation A or Regulation CF must file offering statements with the SEC.
                                • Others (8-K, 13F, Form 4, 14D-9, etc.): The SEC may issue Comment letters for various other filings made by companies, such as current reports (Form 8-K), periodic reports by institutional investment managers (Form 13F), reports by insiders (Form 4), and filings related to mergers and acquisitions (Schedule 14D-9, Schedule TO, etc.).

                                It's important to note that the SEC's review and comment process is not a guarantee of the accuracy or quality of a company's financial statements. It is primarily a regulatory oversight mechanism aimed at promoting transparency and maintaining fair and efficient markets.

                                EDGAR Form Types of SEC Comment Letters

                                There are two EDGAR form types associated with SEC Comment letters and responses: UPLOAD and CORRESP.

                                The UPLOAD form type is used for SEC-originated letters sent to filers, while the CORRESP form type is used for filer response letters to SEC staff that are not part of amended filings.

                                There is sometimes confusion between the UPLOAD and LETTER types, but they actually refer to the same thing. The UPLOAD form type represents the type of the form, while LETTER is the name of the file type that is uploaded to EDGAR. Therefore, when searching for SEC Comment letters, it is recommended to look for the file type LETTER instead of the UPLOAD form type.

                                When are SEC Comment Letters released to the public?

                                The SEC releases Comment letters and responses no earlier than 20 business days following the completion of a filing review. Prior to January 1, 2012, the stated goal has been to release the correspondence no earlier than 45 days after the review of the disclosure filing is complete.

                                Search and Find SEC Comment Letters

                                The Query API provides a straightforward way to search and find SEC Comment letters and responses by using the Lucene search syntax. By specifying the search expression, such as formType:CORRESP, the API retrieves the metadata of the matching filings, including details about the filer (CIK, ticker, filer name, etc.) and the filing itself (filing URL, filing date, filing type, exhibit URLs, etc.).

                                Find UPLOAD/LETTER Filings

                                To locate all SEC Comment letters of the UPLOAD form type, you can use the following search expression:

                                formType:LETTER

                                There is sometimes confusion between the UPLOAD and LETTER types, but they actually refer to the same thing. The UPLOAD form type represents the type of the form, while LETTER is the name of the file type that is uploaded to EDGAR. Therefore, when searching for SEC Comment letters, it is recommended to look for the file type LETTER instead of the UPLOAD form type.

                                When analyzing UPLOAD filings, there is no need to perform OCR (Optical Character Recognition) on the PDF files. Instead, you can directly download the attached .TXT file of the filing, which contains the complete content of the PDF in text format. This allows for easier text analysis without the need for additional conversion steps.

                                Find CORRESP Filings

                                CORRESP filings are usually published in HTML format and can easily be located using the following search expression:

                                formType:CORRESP

                                Example LETTER Filing

                                Let's examine two examples of LETTER filings to get a better understanding of their structure and content. LETTER filings can range from simple notifications of the SEC's completion of a review to more complex filings that include detailed requests and comments from the SEC staff.

                                Example of a Simple LETTER Filing

                                Source: Simple LETTER Filing

                                On the other hand, LETTER filings can also contain a comprehensive set of requests and comments from the SEC staff that the filer needs to address. Here's an example of a more detailed LETTER filing:

                                Example of a Complex LETTER Filing

                                Source: Complex LETTER Filing

                                Example CORRESP Filing

                                Let's also have a look at multiple examples of CORRESP filings.

                                Examples:

                                1. Requesting the effectiveness of a Registration Statement on Form S-3 at a specific date and time.

                                CORRESP Filing - Request for Registration Statement

                                Source: CORRESP Filing - Request for Registration Statement

                                1. Inclusions of additional risk factors as discussed with the SEC.

                                CORRESP EDGAR Filing Example 2

                                Source: CORRESP Filing - Additional Risk Factors

                                1. Request for clarification regarding whether a sponsor is controlled by a non-U.S. person. Source Filing

                                2. Request for providing more details about the effect on investors as a result of not having made specific arrangements, such as not utiziling an escrow account. Source Filing

                                3. Requesting clarification on the use of specific non-GAAP line items. Source Filing

                                Find SEC Comment Letters

                                Let's explore how to find SEC Comment letters to build up a list of URLs pointing to such letters. As an example, we will demonstrate how to locate all Comment letters published in the year 2021 using the Query API. The Lucene search expression to retrieve the metadata, including the URL, of such filings is as follows:

                                formType:LETTER AND filedAt:[2021-01-01 TO 2021-12-31]

                                To begin, install the sec-api package and import the necessary modules:

                                !pip install -q sec-api
                                API_KEY = 'YOUR_API_KEY'
                                from sec_api import QueryApi

                                queryApi = QueryApi(api_key=API_KEY)

                                Next, create a query using the Lucene search expression to find the Comment letters published in 2021 and retrieve the filings using the Query API:

                                query = {
                                  "query": "formType:LETTER AND filedAt:[2021-01-01 TO 2021-12-31]",
                                  "from": "0",
                                  "size": "200",
                                  "sort": [{ "filedAt": { "order": "desc" } }]
                                }

                                response = queryApi.get_filings(query)
                                print('Number of LETTER filings published in 2021:', response['total']['value'])
                                Number of LETTER filings published in 2021: 9245

                                Finally, you can convert the metadata into a pandas DataFrame for further analysis:

                                import pandas as pd

                                metadata = pd.DataFrame.from_records(response['filings'])
                                metadata.head(3)
                                Out:
                                idaccessionNociktickercompanyNamecompanyNameLongformTypedescriptionfiledAtlinkToTxtlinkToHtmllinkToXbrllinkToFilingDetailsentitiesdocumentFormatFilesdataFilesseriesAndClassesContractsInformation
                                0b0289576de3f821a2096b306176ee1950000000000-21-0155091801417BYNObyNordic Acquisition CorpbyNordic Acquisition Corp (Filed for)LETTERForm UPLOAD - SEC-generated letter2021-12-31T07:30:12-05:00https://www.sec.gov/Archives/edgar/data/180141...https://www.sec.gov/Archives/edgar/data/180141...https://www.sec.gov/Archives/edgar/data/180141...[{'companyName': 'byNordic Acquisition Corp (F...[{'sequence': '1', 'documentUrl': 'https://www...[][]
                                17f986b3cd6a3583643094da271ae7dae0000000000-21-0155081737995STSSSharps Technology Inc.Sharps Technology Inc. (Filed for)LETTERForm UPLOAD - SEC-generated letter2021-12-30T18:30:03-05:00https://www.sec.gov/Archives/edgar/data/173799...https://www.sec.gov/Archives/edgar/data/173799...https://www.sec.gov/Archives/edgar/data/173799...[{'companyName': 'Sharps Technology Inc. (File...[{'sequence': '1', 'documentUrl': 'https://www...[][]
                                2a6d2e09a2504eb0e4f9c4e75bb0b044e0000000000-21-015503895464YBGJYubo International Biotech LtdYubo International Biotech Ltd (Filed for)LETTERForm UPLOAD - SEC-generated letter2021-12-30T16:30:04-05:00https://www.sec.gov/Archives/edgar/data/895464...https://www.sec.gov/Archives/edgar/data/895464...https://www.sec.gov/Archives/edgar/data/895464...[{'companyName': 'Yubo International Biotech L...[{'sequence': '1', 'documentUrl': 'https://www...[][]

                                The following function get_metadata(start_year=2021, end_year=2022) retrieves the metadata of SEC Comment letters for a specified date range. Here's an explanation of how the function works:

                                1. The function iterates over the specified range of years and months to return a dataframe with metadata of SEC Comment letters.

                                2. Inside the nested loop, a Lucene query is constructed based on the year and month to filter the filings for that specific period.

                                3. Pagination is implemented using the query_from and query_size variables. The initial value of query_from is 0, and query_size is set to 200. These variables determine the starting index and the number of filings to retrieve in each API request.

                                4. A while loop is used to handle pagination. It continues until there are no more filings returned in the response.

                                5. Within each iteration of the while loop, a Query API request is made with the constructed query to retrieve a batch of Comment letters.

                                6. The retrieved filings are processed to extract the desired metadata, including the ticker, cik, formType, filedAt, filingUrl, and linkToTxt (URL of the TXT file).

                                7. The extracted metadata is appended to a dataframe called df, and any filings without a ticker symbol are filtered out. df is then added to a list called frames to store the metadata for all retrieved filings.

                                8. Once all years and months are processed, the frames list is concatenated into a single dataframe called result, which contains the metadata for all downloaded filings.

                                def get_metadata(start_year=2021, end_year=2022):
                                  form_type_filter = 'formType:LETTER'
                                  frames = []

                                  for year in range(start_year, end_year + 1):
                                    for month in range(1, 13):
                                      download_counter = 0
                                      padded_month = str(month).zfill(2) # "1" -> "01"
                                      date_range_filter = f'filedAt:[{year}-{padded_month}-01 TO {year}-{padded_month}-31]'
                                      lucene_query = date_range_filter + ' AND ' + form_type_filter

                                      query_from = 0
                                      query_size = 200

                                      while True:
                                        query = {
                                          "query": lucene_query,
                                          "from": query_from,
                                          "size": query_size,
                                          "sort": [{ "filedAt": { "order": "desc" } }]
                                        }

                                        response = queryApi.get_filings(query)
                                        filings = response['filings']

                                        if len(filings) == 0:
                                          break
                                        else:
                                          query_from += query_size

                                        metadata = list(map(lambda f: {'ticker': f['ticker'],
                                                                       'cik': f['cik'],
                                                                       'formType': f['formType'],
                                                                       'filedAt': f['filedAt'],
                                                                       'filingUrl': f['linkToFilingDetails'],
                                                                       'linkToTxt': f['linkToTxt']
                                                                      }, filings))

                                        df = pd.DataFrame.from_records(metadata)
                                        # remove all entries without a ticker symbol
                                        df = df[df['ticker'].str.len() > 0]
                                        df.loc[:, 'filingUrl'] = df['filingUrl'].apply(lambda url: url.replace('ix?doc=/', ''))
                                        frames.append(df)
                                        download_counter += len(df)

                                      print(f'✅ Downloaded {download_counter} metadata objects for {year}-{padded_month}')

                                  result = pd.concat(frames)

                                  print(f'✅ Download completed. Metadata downloaded for {len(result)} filings.')

                                  return result
                                metadata_2021 = get_metadata(start_year=2021, end_year=2021)
                                ✅ Downloaded 548 metadata objects for 2021-01
                                ✅ Downloaded 590 metadata objects for 2021-02
                                ✅ Downloaded 685 metadata objects for 2021-03
                                ✅ Downloaded 604 metadata objects for 2021-04
                                ✅ Downloaded 584 metadata objects for 2021-05
                                ✅ Downloaded 666 metadata objects for 2021-06
                                ✅ Downloaded 618 metadata objects for 2021-07
                                ✅ Downloaded 600 metadata objects for 2021-08
                                ✅ Downloaded 678 metadata objects for 2021-09
                                ✅ Downloaded 586 metadata objects for 2021-10
                                ✅ Downloaded 540 metadata objects for 2021-11
                                ✅ Downloaded 512 metadata objects for 2021-12
                                ✅ Download completed. Metadata downloaded for 7211 filings.
                                metadata_2021.head(3)
                                Out:
                                tickercikformTypefiledAtfilingUrllinkToTxt
                                0OWL1823945LETTER2021-01-29T18:30:04-05:00https://www.sec.gov/Archives/edgar/data/182394...https://www.sec.gov/Archives/edgar/data/182394...
                                1SIVR1450922LETTER2021-01-29T17:30:06-05:00https://www.sec.gov/Archives/edgar/data/145092...https://www.sec.gov/Archives/edgar/data/145092...
                                2IMCR1671927LETTER2021-01-29T17:30:06-05:00https://www.sec.gov/Archives/edgar/data/167192...https://www.sec.gov/Archives/edgar/data/167192...

                                Download SEC Comment Letters

                                In this section, we demonstrate how to download all SEC Comment letters using the Render API and pandarallel. The code allows for parallel downloading of up to 10 letters in text format (.TXT) at the same time. The letters are saved to the local disk in the filings folder, organized by company ticker.

                                Example Folder with downloaded SEC Comment Letters

                                The download_content(filing_metadata) function is used to download and process the content of each filing. It extracts the text content from the downloaded .TXT file using regular expressions and saves it to the corresponding file in the folder structure. If the download fails, the function retries up to four times before giving up.

                                The number_of_workers variable determines the number of parallel workers used for downloading. In this example, we set it to 10. The sample dataframe contains a subset of the metadata (500 filings in this case), and the download_content function is applied in parallel using sample.parallel_apply.

                                The text content of each Comment letter is also saved to sample['content'].

                                !pip install -q pandarallel
                                from sec_api import RenderApi

                                renderApi = RenderApi(api_key=API_KEY)
                                import re, os, time


                                def download_content(filing, retry_counter=0):
                                  url = filing['linkToTxt']
                                  ticker = filing['ticker']
                                  formType = filing['formType']

                                  try:
                                    new_folder = './filings/' + ticker
                                    date = filing['filedAt'][:10]
                                    file_name = date + '_' + formType + '_' + url.split('/')[-1]
                                    file_path = new_folder + '/' + file_name

                                    if not os.path.isdir(new_folder):
                                      os.makedirs(new_folder)

                                    content = renderApi.get_filing(url)

                                    pattern = r""
                                    matches = re.findall(pattern, content, re.DOTALL)

                                    if matches:
                                      if '' in matches[0]:
                                        extracted_text = matches[1]
                                      else:
                                        extracted_text = matches[0]

                                      with open(file_path, 'w') as f:
                                        f.write(extracted_text)

                                      return extracted_text
                                    else:
                                      print("No text found between .")
                                      return ''

                                  except Exception as e:
                                    print(f"❌ {ticker}: download failed: {url}")
                                    print(f"Error: {str(e)}")

                                    if retry_counter < 4:
                                      time.sleep(2)
                                      return download_content(filing, retry_counter=retry_counter+1)
                                    else:
                                      return ''


                                filing = metadata_2021.iloc[140]
                                content = download_content(filing)
                                print(content)
                                United States securities and exchange commission logo





                                                             January 25, 2021

                                       Mark A. Goldsmith, M.D., Ph.D.
                                       President and Chief Executive Officer
                                       Revolution Medicines, Inc.
                                       700 Saginaw Drive
                                       Redwood City, CA 94063

                                                                                        Re: Revolution
                                Medicines, Inc.
                                                                                            Draft Registration
                                Statement on Form S-1
                                                                                            Submitted January
                                21, 2021
                                                                                            CIK No. 0001628171

                                       Dear Mr. Goldsmith:

                                                                                        This is to advise you
                                that we do not intend to review your registration statement.

                                               We request that you publicly file your registration statement no
                                later than 48 hours prior
                                       to the requested effective date and time. Please refer to Rules 460 and
                                461 regarding requests for
                                       acceleration. We remind you that the company and its management are
                                responsible for the
                                       accuracy and adequacy of their disclosures, notwithstanding any review,
                                comments, action or
                                       absence of action by the staff.

                                                                                        Please contact David
                                Gessert at 202-551-2326 with any questions.




                                                             Sincerely,


                                                             Division of Corporation Finance

                                                             Office of Life Sciences

                                from pandarallel import pandarallel

                                number_of_workers = 10
                                pandarallel.initialize(progress_bar=True, nb_workers=number_of_workers, verbose=0)
                                # download a simple of 500 comment letters
                                sample = metadata_2021.head(500)
                                sample['content'] = sample.parallel_apply(download_content, axis=1)

                                # uncomment the line below to download all SEC comment letters
                                # sample = metadata_2021
                                # sample['content'] = metadata_2021.parallel_apply(download_content, axis=1)

                                print('✅ Download complete')
                                VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=50), Label(value='0 / 50'))), HBox…
                                ✅ Download complete
                                <ipython-input-15-8fc70936294f>:3: SettingWithCopyWarning: 
                                A value is trying to be set on a copy of a slice from a DataFrame.
                                Try using .loc[row_indexer,col_indexer] = value instead

                                See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
                                  sample['content'] = sample.parallel_apply(download_content, axis=1)
                                def get_directory_size(directory):
                                  total_size = 0

                                  for path, dirs, files in os.walk(directory):
                                    for file in files:
                                      file_path = os.path.join(path, file)
                                      total_size += os.path.getsize(file_path)

                                  return total_size / (1024 * 1024)


                                directory_path = './filings'
                                total_size_mb = get_directory_size(directory_path)
                                print(f"Total size of all downloaded SEC Comment letters: {total_size_mb:.2f} MB")
                                Total size of all downloaded SEC Comment letters: 2.02 MB

                                Analyze SEC Comment Letters

                                Now that we have downloaded all SEC Comment letters, i.e. UPLOAD filings, we can start analyzing their content. In this section, we will explore two examples of analysis using SEC Comment letters.

                                Examples

                                1. Plot the distribution of the number of SEC staff comments over EDGAR form types. The x-axis represents the EDGAR form types, and the y-axis represents the number of comment letters per form type.

                                2. Perform topic modeling with BERTopic to identify different topics within SEC Comment letters and cluster the letters based on their topics.

                                Number of SEC Comment Letters per EDGAR Form Type

                                To determine the EDGAR form type referred to in a comment letter, we need to extract the form type from the letter's header.

                                Here's an example header of a letter filing that refers to the EDGAR form type "Form S-3":

                                sample_content = sample['content'].values[50]
                                print(sample_content[:800], '...')
                                United States securities and exchange commission logo





                                                          January 28, 2021

                                       Kyle Guse
                                       Chief Financial Officer and General Counsel
                                       ATOSSA THERAPEUTICS, INC.
                                       107 Spring Street
                                       Seattle, Washington 98104

                                                                                        Re: ATOSSA
                                THERAPEUTICS, INC.
                                                                                            Registration
                                Statement on Form S-3
                                                                                            Filed January 22,
                                2021
                                                                                            File No. 333-252335

                                       Dear Mr. Guse:

                                              This is to advise you that we have not reviewed and will not
                                review your registration
                                       statement.

                                               Please refer to Rules 460 and 461 ...

                                We can extract the form type by using a regular expression that looks for a match of "Statement on" for each line above the "File No." line.

                                def extract_form_type(text):
                                  try:
                                    header_text_pattern = r".*Dear"
                                    header_text = re.findall(header_text_pattern, text, re.I | re.S)
                                    header_text = header_text[0]

                                    pattern = r"(Statement .*|Form [^\s]+)"
                                    matches = re.findall(pattern, header_text, re.I)
                                    matches = list(set(matches))
                                    matches = [re.sub(r'Statement (on|off)\s?', '', x, flags=re.I) for x in matches]

                                    return matches

                                  except Exception as e:
                                    print(e)
                                    return []


                                extract_form_type(sample_content)
                                Out:
                                ['Form S-3']
                                # Blackberry letter refers to two form types: 10-K, 10-Q
                                sample_content = sample['content'].values[100]
                                print(sample_content[:800], '...')
                                United States securities and exchange commission logo





                                                           January 26, 2021

                                       Steve Rai
                                       Chief Financial Officer
                                       Blackberry Limited
                                       2200 University Ave East
                                       Waterloo, ON N2K 0A7,
                                       Canada

                                                                                        Re: Blackberry Ltd.
                                                                                            Form 10-K for
                                Fiscal Year Ended February 29, 2020
                                                                                            Filed April 7, 2020
                                                                                            Form 10-Q for
                                Quarterly Periods Ended November 30, 2020
                                                                                            Filed September 24,
                                2020
                                                                                            File No. 001-38232 ...
                                print('Form types associated with the letter sent to Blackberry:')
                                print(extract_form_type(sample_content))
                                Form types associated with the letter sent to Blackberry:
                                ['Form 10-Q', 'Form 10-K']

                                By applying the extract_form_type function to the content of each letter, we can extract the relevant form types associated with the comment letters.

                                sample['refers_to'] = sample['content'].parallel_apply(extract_form_type)
                                df = sample['refers_to'].explode('refers_to')
                                df.reset_index(drop=True, inplace=True)
                                form_types = sorted(list(df.unique().astype(str)))

                                print('EDGAR Form Types of SEC Comment Letters:')
                                print('
                                '
                                .join(form_types))
                                EDGAR Form Types of SEC Comment Letters:
                                Form 1-A
                                Form 10
                                Form 10-12G
                                Form 10-K
                                Form 10-K/A
                                Form 10-Q
                                Form 20-F
                                Form 8-K
                                Form 8-K/A
                                Form F-1
                                Form F-3
                                Form F-4
                                Form S-1
                                Form S-11
                                Form S-3
                                Form S-4
                                Form SF-1
                                From S-1
                                From S-4
                                Schedule 14A
                                Schedule 14C
                                nan
                                import matplotlib.pyplot as plt

                                df = sample['refers_to'].explode('refers_to')
                                df.reset_index(drop=True, inplace=True)

                                value_counts = df.value_counts()

                                plt.figure(figsize=(10, 6))
                                value_counts.plot(kind='bar')
                                plt.xlabel('Form Type')
                                plt.ylabel('Number of SEC Comment Letters')
                                plt.title('Number of SEC Comment Letters per EDGAR Form Type (2021)')
                                plt.xticks(rotation=45)
                                plt.show()
                                Number of SEC Comment Letters per EDGAR Form Type (2021)

                                Topic Modelling of SEC Comment Letters with BERTopic

                                In this section, we will explore how to use BERTopic for topic modeling on SEC Comment letters. The process is divided into two main parts:

                                1. Text cleaning and preparation: We will remove the header text from each comment letter, eliminate stopwords such as "we," "it," and "what," and apply lemmatization to transform words to their root form. Lemmatization ensures that words with similar meanings are treated as the same word. For instance, the word "running" will be lemmatized to "run," and "better" will be lemmatized to "good."

                                2. Fitting BERTopic to the document corpus: We will generate embedding vectors for the SEC Comment letters and create topics using the HDBSCAN clustering algorithm. This step involves grouping similar letters into topics based on their content and context.

                                def remove_header(letter):
                                  cleaned_letter = re.sub(r'.*Dear', '', letter, flags=re.S)
                                  return cleaned_letter


                                sample['cleaned_content'] = sample['content'].parallel_apply(remove_header)
                                print('Example of a comment letter without its header')
                                print('----------------------------------------------')
                                print(sample['cleaned_content'].values[0][:1500], '...')
                                Example of a comment letter without its header
                                ----------------------------------------------
                                 Mr. Wasserman:

                                              We have reviewed your registration statement and have the
                                following comments. In
                                       some of our comments, we may ask you to provide us with information so
                                we may better
                                       understand your disclosure.

                                              Please respond to this letter by amending your registration
                                statement and providing the
                                       requested information. If you do not believe our comments apply to your
                                facts and
                                       circumstances or do not believe an amendment is appropriate, please tell
                                us why in your
                                       response.

                                              After reviewing any amendment to your registration statement and
                                the information you
                                       provide in response to these comments, we may have additional comments.

                                       Registration Statement on From S-4 filed January 4, 2021

                                       Selected Definitions, page 3

                                   1. Please revise to
                                include Seller Earnout Securities, Seller Earnout Share and Seller Earnout
                                                                                        Unit in your selected
                                definitions.
                                       Questions and Answers About this Business Combination, page 13

                                   2. Please revise this
                                section to include a question that asks and answers the question, "What
                                                                                        will Altimar Acquistion
                                Corporation equity holders receive as a result of the business
                                                                                        combination?" In th ...
                                !pip install -q nltk
                                import nltk

                                nltk.download('stopwords')
                                nltk.download('omw-1.4')
                                nltk.download('wordnet')

                                wn = nltk.WordNetLemmatizer()
                                [nltk_data] Downloading package stopwords to /root/nltk_data...
                                [nltk_data] Unzipping corpora/stopwords.zip.
                                [nltk_data] Downloading package omw-1.4 to /root/nltk_data...
                                [nltk_data] Downloading package wordnet to /root/nltk_data...
                                stopwords = nltk.corpus.stopwords.words('english')
                                print(f'There are {len(stopwords)} default stopwords. They are {stopwords}')
                                There are 179 default stopwords. They are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
                                # remove stop words
                                sample['content_standardized'] = sample['cleaned_content'].apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in stopwords]))
                                # lemmatize
                                sample['content_standardized'] = sample['content_standardized'].apply(lambda x: ' '.join([wn.lemmatize(w) for w in x.split() if w not in stopwords]))
                                # remove other non-informative words
                                wordlist = ['please', 'page', 'comment']
                                sample['content_standardized'] = sample['content_standardized'].apply(lambda x: ' '.join([w for w in x.split() if w.lower() not in wordlist]))
                                print(sample['content_standardized'].values[0][:300])
                                Mr. Wasserman: reviewed registration statement following comments. comments, may ask provide u information may better understand disclosure. respond letter amending registration statement providing requested information. believe apply fact circumstance believe amendment appropriate, tell u response.
                                # docs = list(sample['cleaned_content'])
                                docs = list(sample['content_standardized'])

                                Vocabulary Generation

                                To optimize memory usage before training our model, we can pre-process our vocabulary. By creating a vocabulary of words in our dataset and setting a minimum frequency threshold, we can reduce the computational load on the tokenizer. In the code below, we generate the vocabulary by selecting words that appear at least 15 times in our data.

                                import collections
                                from tqdm import tqdm
                                from sklearn.feature_extraction.text import CountVectorizer

                                # extract vocab to be used in BERTopic
                                vocab = collections.Counter()
                                tokenizer = CountVectorizer().build_tokenizer()

                                for doc in tqdm(docs):
                                  vocab.update(tokenizer(doc))

                                vocab = [word for word, frequency in vocab.items() if frequency >= 15]; len(vocab)
                                100%|██████████| 500/500 [00:00<00:00, 8673.34it/s]
                                Out:
                                1237

                                Train BERTopic

                                import locale
                                locale.getpreferredencoding = lambda: 'UTF-8'
                                !pip install -q bertopic
                                from bertopic import BERTopic
                                # source: https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing#scrollTo=DIv6fRecTvTw
                                from sentence_transformers import SentenceTransformer
                                from umap import UMAP
                                from hdbscan import HDBSCAN

                                # prepare sub-models
                                # see here for a list of all models: https://www.sbert.net/docs/pretrained_models.html
                                embedding_model = SentenceTransformer('all-mpnet-base-v2')
                                embeddings = embedding_model.encode(docs, show_progress_bar=True)

                                umap_model = UMAP(n_components=5, n_neighbors=15, random_state=42, metric='cosine', verbose=True)
                                hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
                                # vectorizer_model = CountVectorizer(vocabulary=vocab, stop_words="english")

                                # fit BERTopic without actually performing any clustering
                                topic_model= BERTopic(
                                        embedding_model=embedding_model,
                                        umap_model=umap_model,
                                        hdbscan_model=hdbscan_model,
                                        # vectorizer_model=vectorizer_model, # causes divide by zero encountered in true_divide
                                        verbose=True
                                ).fit(docs, embeddings=embeddings)
                                topic_model.get_topic_info()
                                Out:
                                TopicCountNameRepresentationRepresentative_Docs
                                003390_statement_disclosure_may_registration[statement, disclosure, may, registration, not...[Mr. Balkin: reviewed registration statement f...
                                11581_review_action_absence_461[review, action, absence, 461, acceleration, r...[Mr. Flood: advise reviewed review registratio...
                                22572_review_action_sciences_life[review, action, sciences, life, adequacy, acc...[Mr. Schreiber: advise reviewed review registr...
                                33463_review_action_completed_finance[review, action, completed, finance, adequacy,...[Mr. Beekhuisen: completed review filing. remi...
                                topic_model.get_topic(topic=1)
                                Out:
                                [('review', 0.16467594236548108),
                                 ('action', 0.15724217716679395),
                                 ('absence', 0.10993984620559281),
                                 ('461', 0.10682466720551433),
                                 ('acceleration', 0.10671969930903165),
                                 ('request', 0.10343612706686907),
                                 ('staff', 0.10165338949306918),
                                 ('advise', 0.10045037768051748),
                                 ('finance', 0.10039777356376828),
                                 ('adequacy', 0.09678466838447443)]
                                # reduce dimensionality of embeddings
                                umap_model = UMAP(n_components=5, n_neighbors=15, random_state=42, metric='cosine', verbose=True)
                                reduced_embeddings_2d = umap_model.fit_transform(embeddings)
                                UMAP(angular_rp_forest=True, metric='cosine', n_components=5, random_state=42, verbose=True)
                                Wed Jun 14 14:38:46 2023 Construct fuzzy simplicial set
                                Wed Jun 14 14:38:46 2023 Finding Nearest Neighbors
                                Wed Jun 14 14:38:46 2023 Finished Nearest Neighbor Search
                                Wed Jun 14 14:38:47 2023 Construct embedding
                                Epochs completed:   0%|            0/500 [00:00]
                                Wed Jun 14 14:38:48 2023 Finished embedding
                                import itertools
                                import pandas as pd

                                # define colors for the visualization to iterate over
                                colors = itertools.cycle(['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231',
                                                          '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe',
                                                          '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000',
                                                          '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080',
                                                          '#ffffff', '#000000'])
                                color_key = {str(topic): next(colors) for topic in set(topic_model.topics_) if topic != -1}

                                # prepare dataframe and ignore outliers
                                df = pd.DataFrame({"x": reduced_embeddings_2d[:, 0], "y": reduced_embeddings_2d[:, 1], "Topic": [str(t) for t in topic_model.topics_]})
                                df['Length'] = [len(doc) for doc in docs]
                                df = df.loc[df.Topic != '-1']
                                df = df.loc[(df.y > -10) & (df.y < 10) & (df.x < 10) & (df.x > -10), :]
                                df['Topic'] = df['Topic'].astype('category')

                                # get centroids of clusters
                                mean_df = df.groupby('Topic').mean().reset_index()
                                mean_df.Topic = mean_df.Topic.astype(int)
                                mean_df = mean_df.sort_values('Topic')
                                !pip install -q adjustText
                                import seaborn as sns
                                from matplotlib import pyplot as plt
                                from adjustText import adjust_text
                                import matplotlib.patheffects as pe

                                fig = plt.figure(figsize=(10, 10))
                                sns.scatterplot(data=df, x='x', y='y', c=df['Topic'].map(color_key), alpha=0.4, sizes=(0.4, 10), size="Length")

                                # annotate top 50 topics
                                texts, xs, ys = [], [], []
                                for row in mean_df.iterrows():
                                  topic = row[1]["Topic"]
                                  name = " - ".join(list(zip(*topic_model.get_topic(int(topic))))[0][:3])

                                  if int(topic) <= 50:
                                    xs.append(row[1]["x"])
                                    ys.append(row[1]["y"])
                                    texts.append(plt.text(row[1]["x"], row[1]["y"], name, size=10, ha="center", color=color_key[str(int(topic))],
                                                          path_effects=[pe.withStroke(linewidth=0.5, foreground="black")]))

                                # adjust annotations such that they do not overlap
                                adjust_text(texts, x=xs, y=ys, time_lim=1, force_text=(0.01, 0.02), force_static=(0.01, 0.02), force_pull=(0.5, 0.5))
                                plt.show()
                                # plt.savefig("visualization2.png", dpi=600)
                                BERT Topics Visualization
                                topic_model.visualize_topics()
                                topic_model.visualize_barchart()
                                topic_model.visualize_heatmap()

                                References

                                SEC.gov Resources

                                • https://www.sec.gov/answers/commentletters.htm
                                • https://www.sec.gov/answers/edgarletters.htm
                                • https://www.sec.gov/news/press/2004-89.htm
                                • https://www.sec.gov/divisions/corpfin/cfannouncements/edgarcorrespondence
                                • https://www.sec.gov/news/press/2005-72.htm
                                • https://www.sec.gov/page/edgar-how-do-i-submit-correspondence-corresp
                                • https://www.sec.gov/divisions/corpfin/cffilingreview

                                Papers

                                • Causes and Consequences of Earnings Manipulation: An Analysis of Firms Subject to Enforcement Actions by the SEC

                                Topic Modelling

                                • https://github.com/MaartenGr/BERTopic
                                • https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html
                                • https://www.sbert.net/docs/usage/semantic_textual_similarity.html

                                Other Resources

                                • SEC Comment Letter Trends by PwC

                                Footer

                                Products

                                • EDGAR Filing Search API
                                • Full-Text Search API
                                • Real-Time Filing Stream API
                                • Filing Download & PDF Generator API
                                • XBRL-to-JSON Converter
                                • 10-K/10-Q/8-K Item Extractor
                                • Investment Adviser & Form ADV API
                                • Insider Trading Data - Form 3, 4, 5
                                • Restricted Sales Notifications - Form 144
                                • Institutional Holdings - Form 13F
                                • Form N-PORT API - Investment Company Holdings
                                • Form N-PX API - Proxy Voting Records
                                • Form 13D/13G API
                                • Form S-1/424B4 - IPOs, Debt & Rights Offerings
                                • Form C - Crowdfunding Offerings
                                • Form D - Private Placements & Exempt Offerings
                                • Regulation A Offering Statements API
                                • Changes in Auditors & Accountants
                                • Non-Reliance on Prior Financial Statements
                                • Executive Compensation Data API
                                • Directors & Board Members Data
                                • Company Subsidiaries Database
                                • Outstanding Shares & Public Float
                                • SEC Enforcement Actions
                                • Accounting & Auditing Enforcement Releases (AAERs)
                                • SRO Filings
                                • CIK, CUSIP, Ticker Mapping

                                General

                                • Pricing
                                • Features
                                • Supported Filings
                                • EDGAR Filing Statistics

                                Account

                                • Sign Up - Start Free Trial
                                • Log In
                                • Forgot Password

                                Developers

                                • API Sandbox
                                • Documentation
                                • Resources & Tutorials
                                • Python API SDK
                                • Node.js API SDK

                                Legal

                                • Terms of Service
                                • Privacy Policy

                                Legal

                                • Terms of Service
                                • Privacy Policy

                                SEC API

                                © 2025 sec-api.io by Data2Value GmbH. All rights reserved.

                                SEC® and EDGAR® are registered trademarks of the U.S. Securities and Exchange Commission (SEC).

                                EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system operated by the SEC.

                                sec-api.io and Data2Value GmbH are independent of, and not affiliated with, sponsored by, or endorsed by the U.S. Securities and Exchange Commission.

                                sec-api.io is classified under SIC code 7375 (Information Retrieval Services), providing on-demand access to structured data and online information services.