DO YOU speak chemistry? Analysing molecular structures as if they were sentences has revealed hidden "words" that are key to their make-up.
The approach suggests that algorithms like the ones Google uses in search engines might reveal ways to mix up molecules or invent drugs.
Linguists can analyse text by ranking
words according to how often they appear. This "bag-of-words" approach
can help to distinguish different kinds of texts. For example, the word
rankings for spam emails are different to those of genuine messages,
which helps algorithms filter out spam.
Bartosz Grzybowski
of Northwestern University in Evanston, Illinois, and his colleagues
wondered if a similar approach could find the most important parts of a
molecule. "Chemistry is about recognising certain patterns of atoms," he
says. "In linguistics we also recognise patterns: those are words."
The team took thousands of molecules, each
representing a sentence, and applied a bag-of-words algorithm. By
comparing pairs of molecules, they noted arrangements of atoms that
appear in both, like a ring of carbons or a particular group that
connects to oxygen and hydrogen, and ranked the frequency of these
common fragments.
The researchers had expected the "words" to
correspond to functional groups – clusters of atoms that chemists
recognise as controlling a molecule's chemical reactions. But
surprisingly, it was other, larger fragments that seemed to make up
crucial chemical "words". The distribution of the functional groups was
much less language-like.
To test the chemical dictionary that this
produced, the team borrowed the logic of search engines to find the
fragments that carry the most information. Search engines serve up the
most relevant sites by looking at how often your search term appears on a
particular page in comparison with the internet as a whole. For
example, the word "the" crops up frequently across all internet pages,
so it doesn't carry much weight in determining what a page is about. The
same technique is used to produce "word clouds" that can visually
summarise a document.
So the team ran an algorithm to identify
the fragments in the molecule with the highest information content. When
chemists synthesise complex molecules from scratch, they look for key
bonds that connect simpler compounds to serve as building blocks. It
turned out that the bonds connecting the most informative fragments were
often these key bonds.
In a test of 68 molecules, a panel of 10
chemists agreed that one of the top three bonds the algorithm chose was
an important bond for 66 of the molecules. "The most informative ones
appear to be the best," says Grzybowski (Angewandte Chemie, doi.org/f2s2vc). This shows the algorithm, with no chemical knowledge of its own, can replicate some of the skill of human chemists.
"We're trained to recognise patterns as organic chemists, and the patterns are related to the functionalities," says Robert Paton
of the University of Oxford. "This approach is obviously not limited by
the constraints of the human mind, so it can pick out unique fragments
that you might not always spot."
That insight could help to shake up the way compounds are made. Grzybowski previously developed Chematica,
software that wires up millions of molecules via their reaction
pathways in a sort of chemical internet capable of identifying quicker
and cheaper ways to produce substances. He is already using Chematica to
help pharmaceutical companies to simplify their drug production, and
plans to add the linguistic algorithm to the software.
Phil Blunsom,
a computational linguist at the University of Oxford, says it isn't
clear that chemistry is acting like a language, but the tools might
prove useful for data analysis. "It's always interesting to get
inspiration from techniques used in different fields, but that's
different to saying they are in some way the same problem."
Grzybowski thinks the language approach
has further promise. He plans to investigate more advanced linguistic
tools, like algorithms that measure how often words appear together or
in particular sequences, to see if they can also give context to
chemistry.
"They are setting a very interesting foundation that allows people to think differently," says Lee Cronin
of the University of Glasgow, UK, although he adds that the approach
still needs to prove itself. "If by doing a linguistic analysis you are
able to find a hidden pattern that allows you to make a new class of
molecule, then I think this is an incredibly profound development."