[X] CLOSEMAIN MENU

[X] CLOSEIN THIS SECTION

Stacks of old documents
Mike_Zadera

Webinar Summary

ToxicDocs: A database of once-secret chemical industry documents

November 11, 2025

The ToxicDocs dataset and website contain millions of pages of industry documents about lead, asbestos, silica, PCBs, and other toxic substances. This collection includes internal memoranda, emails, slides, board minutes, unpublished scientific studies, and other documents. 

In a recent webinar, one of the ToxicDocs founders, Dr. Merlin Chowkwanyun, gave an overview of this continuously growing dataset, introducing the interface, explaining the technology behind it, and offering a tour of the searchable content. 

Peeking into a corporation’s mind

The once-secret documents in the database have been made public through the discovery process of toxic tort litigation. As one scholar wrote, the database “allows us to peek into a corporation's mind.”

As Chowkwanyun explained:

“Our database lets you try to understand modernity, but through the lens of these toxic substances, the products that they were packaged in, and the essential processes that they were and still are a part of.”

He shared several examples of important scholarship that has relied on ToxicDocs:

The resource has been tapped by researchers, journalists, and others exploring a new world of environmental health risk and how it came to be.

Parsing millions of pages

The mission of ToxicDocs is to make the information contained in the millions of pages of their database easily accessible and understood. Chowkwanyun explained the technological advances that they have been using to improve the database, such as using parallel computing to quickly add Optical Character Recognition (OCR) to the files.

ToxicDocs is also working on new innovations, such Named Entity Extraction. With Named Entity Extraction, names that occur frequently can be identified. This can be a first step to parsing out the corporate roles and relationships of the names within the documents.

Another innovation being developed involves running nearest neighbor analysis on the documents and using the results to create multi-dimension vector transformations of the information. The ToxicDocs managers have been experimenting with entering these vector transformations into Large Language Models (LLMs) to more quickly analyze the data. Early tests have shown promising results for certain tasks, such as identifying the most important papers on a given topic (for example, articles on asbestos and health).

These new tools are not yet publicly available, but will be released when they are more fully developed.

AI poses challenges

While LLMs could become useful tools for ToxicDocs and other large document databases, Chowkwanyun highlighted several challenges and risks posed by AI:

  • Just as AI is already being used to produce fake photographs and videos, it can also be used to fake documents. Ensuring the authenticity of electronic documents will only get more difficult as AI technology gets more powerful. Partly to ensure authenticity, ToxicDocs does not accept leaked documents into the database.
  • The use of AI as a research crutch could make researchers more prone to the Missing Context/Smoking Gun Fallacy. While one document can sometimes serve as a smoking gun, it usually takes much more to truly understand a full story. One document taken out of context could lead to erroneous conclusions. Even with hundreds of documents, a researchers could need a deep background in the history of epidemiology and toxicology during the period being studied to fully understand the implications of the available documents.
  • The use of AI comes with ecological costs. Chowkwanyun noted that ToxicDocs tries to be very cognizant of those costs and avoids using AI without a strong, clear purpose.

For more on the database and how it can be used, see our webinar ToxicDocs: A database of once-secret chemical industry documents.

 

This organizational blog was produced by CHE's Science Writer, Matt Lilley.

Related Posts