BigSMILES

Having a compact yet robust structurally based identifier or representation system, as well as a widely accepted format for encoding data, is a key enabling factor for efficient sharing and dissemination of research results within the chemistry community, and such systems lay down the essential foundations for future informatics and data-driven research. While substantial advances have been made for small molecules, the polymer community has struggled in coming up with an efficient representation system. This is because, unlike other disciplines in chemistry, the basic premise that each distinct chemical species corresponds to a well-defined chemical structure does not hold for polymers. Polymers are intrinsically stochastic molecules that are often ensembles with a distribution of chemical structures. This difficulty limits the applicability of all deterministic representations developed for small molecules. In Olsen Group, we work towards designing new representation systems that are capable of handling the stochastic nature of polymers. In particular, a line notation, BigSMILES, that is especially designed for polymers, was developed. The line notation is based on the popular “simplified molecular-input line-entry system” (SMILES), but with additional features to accommodate for the stochastic nature of polymers. BigSMILES aims to provide representations that can be used as indexing identifiers for entries in polymer databases. Moreover, alongside the BigSMILES line notation, additional essential components, such as the data encoding standard for polymer characterization, molecular string generator and other relevant standards and suites of code, are also key thrusts in the overall polymer representation project. Ultimately, it is hoped that the developed system will provide a more effective language for communication within the polymer community and increase cohesion between the researchers within the community, as well as provide interface between polymer scientists and computer scientists and speed up data-driven research for polymers.

Highlighted Publications

Lin, T. S. et al. (2019). BigSMILES: A Structurally-Based Line Notation for Describing Macromolecules. ACS central science, 5(9), 1523-1531. Link

BigSMILES Project official webpage: Link

BigSMILES Line Notation documentation: Link