Using just the language in millions of old scientific papers, a machine learning algorithm was able to make completely new scientific discoveries.
In a study published in Nature on July 3, researchers from the Lawrence Berkeley National Laboratory used an algorithm called Word2Vec sift through scientific papers for connections humans had missed. Their algorithm then spit out predictions for possible thermoelectric materials, which convert heat to energy and are used in many heating and cooling applications.
The algorithm didn’t know the definition of thermoelectric, though. It received no training in materials science. Using only word associations, the algorithm was able to provide candidates for future thermoelectric materials, some of which may be better than those we currently use.
“It can read any paper on material science, so can make connections that no scientists could,” researcher Anubhav Jain said. “Sometimes it does what a researcher would do; other times it makes these cross-discipline associations.”
To train the algorithm, the researchers assessed the language in 3.3 million abstracts related to material science, ending up with a vocabulary of about 500,000 words. They fed the abstracts to Word2vec, which used machine learning to analyze relationships between words.
“The way that this Word2vec algorithm works is that you train a neural network model to remove each word and predict what the words next to it will be,” Jain said. “By training a neural network on a word, you get representations of words that can actually confer knowledge.”
Using just the words found in scientific abstracts, the algorithm was able to understand concepts such as the periodic table and the chemical structure of molecules. The algorithm linked words that were found close together, creating vectors of related words that helped define concepts. In some cases, words were linked to thermoelectric concepts but had never been written about as thermoelectric in any abstract they surveyed. This gap in knowledge is hard to catch with a human eye, but easy for an algorithm to spot.
After showing its capacity to predict future materials, researchers took their work back in time, virtually. They scrapped recent data and tested the algorithm on old papers, seeing if it could predict scientific discoveries before they happened. Once again, the algorithm worked.
In one experiment, researchers analyzed only papers published before 2009 and were able to predict one of the best modern-day thermoelectric materials four years before it was discovered in 2012.
This new application of machine learning goes beyond materials science. Because it’s not trained on a specific scientific dataset, you could easily apply it to other disciplines, retraining it on literature of whatever subject you wanted. Vahe Tshitoyan, the lead author on the study, says other researchers have already reached out, wanting to learn more.
“This algorithm is unsupervised and it builds its own connections,” Tshitoyan said. “You could use this for things like medical research or drug discovery. The information is out there. We just haven’t made these connections yet because you can’t read every article.”