High quality datasets are critically important for training many types of AI systems. Federal agencies continue to pursue efforts to increase access to data while maintaining safety, security, civil liberties, privacy, and confidentiality protections. This site provides links to a number of quality Federal datasets that are useful for AI research.
Data.gov was launched in 2009 as the central clearinghouse for open data from the Federal government. It also provides access to local government and non-Federal open data resources.
NASA OPEN DATA
NASA makes tens of thousands of datasets available to the public, cataloged on its clearinghouse site for open data, data.nasa.gov. Some datasets on it are harvested from other NASA data archives and others exist solely on data.nasa.gov. A data visualization page helps users better understand the wide array of data they can access through the repository.
NOAA BIG DATA PROGRAM
NOAA generates tens of terabytes of data per day from satellites, radars, ships, weather models, and other sources. The NOAA Big Data Program (BDP) uses commercial cloud platforms to remove obstacles for public use of open NOAA data. There are over 150 NOAA datasets on the Cloud Service Providers platforms.
NIH-SUPPORTED OPEN DATA REPOSITORIES
NIH has a long tradition of publishing publicly available results of research it supports and conducts, including publications and scientific data. Sharing data enables reuse, increases transparency, and facilitates research reproducibility. Open NIH-supported domain-specific repositories house data of specific types or related to specific disciplines.
NIH-SUPPORTED LIMITED ACCESS DATA REPOSITORIES
NIH-supported data includes repositories restricting data submission and access to specific authorized researchers. This resource links to those repositories and knowledgebases, which the NIH Strategic Plan for Data Science defines as resources that “accumulate, organize, and link growing bodies of information related to core datasets.”
NIST SCIENCE DATA PORTAL
This National Institute of Standards and Technology (NIST) Science Data Portal provides a user-friendly discovery and exploration tool for publicly available datasets at NIST. These data products are generated as part of the NIST mission, spanning multiple disciplines of scientific, engineering and technology research. NIST’s publicly available data sets showcase NIST’s commitment to providing accurate, well-curated measurements of physical properties, exemplified by the Standard Reference Data program, as well as its commitment to advancing basic research.
PATENT AND TRADEMARK DATASETS
The United States Patent and Trademark Office (USPTO) has expansive collections of scientific, technical, and commercial records, including millions of patents, published patent applications, and registered trademarks. Available through the USPTO Open Data Portal, these collections have enabled numerous AI research projects such as training large-scale language models.
Beyond enabling AI research, patent documents can also be used to understand the dynamics of AI innovation at large. Towards that end, in June 2021, the USPTO released an AI Patent Dataset that identifies which of 13.2 million United States patents and pre-grant publications include AI technology. This novel dataset can help researchers, policymakers, and the public explore the growing role of AI on invention.