Sixty-Four Free Chemistry Databases Part 1: PubChem
In this first installment of a series exploring free chemistry databases and resources on the Web, we look at PubChem. PubChem is not the first free public-facing chemical database on the Web, but in many ways it is the most important. And it continues to host one one of the largest collections of freely-available and reusable chemical structures and associated data available anywhere.
PubChem is both large and powerful, which tends to obscure its purpose and capabilities. From the PubChem FAQ:
PubChem provides information on the biological activities of small molecules. It is a component of NIH's Molecular Libraries Roadmap Initiative.
PubChem includes substance information, compound structures, and BioActivity data in three primary databases, Pcsubstance, Pccompound, and PCBioAssay, respectively.
- Pcsubstance contains more than 40 million records. You can check the count of substance records as of today.
- Pccompound contains more than 19 million unique structures. You can check the count of compound records as of today.
- PCBioAssay contains more than 1000 BioAssays. Each BioAssay contains a various number of data points. You can check the count of BioAssay records as of today.
In PubChem, a Compound is a unique chemical structure, whereas a Substance is an instance of that compound not unlike the concept of "batch" in many drug discovery organizations. A Compound can have many Substances, but a Substance can have only one Compound. BioAssay results are reported against Substances, from which the Compound can be inferred.
So What?
Although PubChem's stated purpose may not strike most chemists as something they could use in their daily work, a few additional applications have been found over the years.
One of the interesting things about PubChem Compound and Substance records is the registration data associated with them.
For example, many depositors provide CAS Registry Numbers® for their substances. To date, PubChem contains over 300,000 of them. This makes it very easy to use PubChem to find the structure associated with a number of important compounds from a CAS Number and vice versa.
PubChem also auto-generates systematic nomenclature for its compounds. This means that if you find a compound in PubChem, you can frequently find a reasonable name for it.
Every Compound in PubChem gets a unique numerical identifier, which can be used to later refer to the compound in spreadsheets, emails, and Web pages. One application would be to use PubChem identifiers to replace CAS numbers in certain situations, although this approach is not without its own problems.
To date, the most significant use of PubChem in chemistry has not been by chemists, but rather database developers who have used the freely-downloadable PubChem dataset as a starting point for more specialized chemical databases and services. Some of these will be featured in future articles.
The Future of PubChem
Although the size of the PubChem database has grown dramatically over the last four years, the system's scope and capabilities (at least those visible to end users) have not changed much. PubChem remains a system in which external depostitors add chemical structures in batch mode with limited optional metadata, and through which biological assay results can be published.
Many of the kinds of information that chemists find most interesting (links to the primary literature, characterization data in the form of spectra, solubilities, melting/boiling points, etc.) don't appear in PubChem. Although anything is possible, this situation seems unlikely to change. For that, we'll need to look to other services, both existing and to be developed.


Comments
Your thoughts?