Industry Documents Library API and Data Sets

API

The Industry Documents Library uses Solr to index the document corpus. Users who are interested in accessing the data programmatically can query the Industry Documents Solr server directly through our application programming interface (API). This allows the user to easily export documents to another system, execute search queries and process search results by program. Data can be exported in these formats: xml, json, python, ruby, php, and csv.

Download documentation

IDL Datasets

For researchers who would prefer to work with Industry Documents Library (IDL) metadata and optical character recognition (OCR) text from within their own database systems, IDL has made these files available for free download via the link below. Please consult the included readme file for instructions. Note that the IDL website’s user interface provides access to the most current dataset, as the website undergoes a new release each month. In contrast, due to time constraints, the IDL dataset will be updated only twice a year. These files are provided on a do-it-yourself basis. IDL is unable to provide individual technical support for downloading files or for setting up your own database in which to ingest them. We do welcome feedback – please contact us at industrydocuments@ucsf.edu.

Access data sets

Other Datasets

A growing number of research projects using industry documents have made their datasets publicly available, including:

(2024). Industry Documents Library dataset (pixparse/idl-wds). HuggingFace. https://huggingface.co/datasets/pixparse/idl-wds
(2023). Opioid Industry Documents Archive (OIDA) Data on AWS. Registry of Open Data on AWS. https://registry.opendata.aws/oida/
Soboroff, I. (2022). Complex Document Information Processing (CDIP) dataset. NIST Public Data Repository. https://data.nist.gov/od/id/mds2-2531

If you are aware of other publicly-available datasets using IDL documents or metadata which could be added to this list, please contact us at industrydocuments@ucsf.edu.