Software engineers optimize code to accelerate machine learning research at Princeton

Written by
Allison Gasparini, Center for Statistics and Machine Learning
March 29, 2024

In laboratories dealing with computational research, the computer programs used by researchers are sometimes many years old, originally written in outdated programming languages and passed down through several generations of students. If a person doesn’t have a background in software development, they may not know how to write code in a way that is understandable to all. What’s more, each new researcher who joins the lab may write their own version of the research code. The result can be a tangle of messy code that’s difficult for outside researchers to understand, much less work with.

“This is a ubiquitous challenge in any academic setting where you have students and postdocs creating research-grade code that you hope can have a broader contribution to the community,” said Michael Skinnider, an assistant professor at Princeton University’s Lewis-Sigler Institute for Integrative Genomics. So, how can academics take code that is messy or obscure or both and turn it into software which is reproducible and sustainable so other researchers can take advantage of it?

Across the Princeton campus, a diffuse group of research software engineers are tackling this challenge head-on. Stationed across departments, the RSEs — as they’re referred to in shorthand — work to support computational research projects run by faculty. Skinnider himself uses machine learning software in his research on substances known as “designer drugs.” Hoping to make his tool usable to other researchers, Skinnider took his software to the RSEs at the Center for Statistics and Machine Learning (CSML).

“I’ve never really been at an institution before that had anything quite like the research software engineer program,” said Skinnider, who began his position at Princeton in September 2023. “It’s been a fantastic experience.”

Making code reproducible, maintainable, and sustainable

Skinnider’s project is concerned with identifying the designer drugs emerging in the illicit market. These substances often have unknown chemical structures and are created in an attempt to skirt the law by having effects similar to common illegal substances. The wave of designer drugs has proven to be dangerous and even fatal. Skinnider’s research has shown generative AI could help anticipate the chemical structures of designer drugs that are likely to hit the market.

 

Schematic showing how generative model predicts chemical structure of a designer drug

Given just the molecular mass of the unknown compound, a generative model is able to predict the most likely chemical structure of a designer drug in a patient's bloodstream. Image courtesy of Michael Skinnider, LSI, Princeton University

Given the urgent implications of this research, it’s easy to see why researchers in other fields would want to take advantage of Skinnider’s software to help tackle the designer drug epidemic. Anushka Acharya, one of CSML’s two in-house RSEs, has been working with Skinnider’s software since the fall to package it into user-friendly, installable code.

Before she started working on the software, Acharya said, anyone who would want to use it would have to have a background knowledge on how to run it, especially as it required human interventions as it ran. “It would’ve been a lot to learn for anyone who was just looking to get down to the science,” said Acharya. “Now, we’ve packaged it in a way you can install and put information into the software pipeline and all the researchers have to do is wait for the results.” 

Skinnider’s code still serves its original purpose. “It just runs faster, it’s more readable,” said Acharya. Should Skinnider want to reach out to other software engineers in the future to add new features to his code, Acharya has tweaked it in a way where “it’ll be easier for them to read and understand what the code was originally doing,” she said.

Anushka Acharya and Vineet Bansal talk with Michael Skinnider

Photo by Allison Gasparini

While Acharya has been working with Skinnider, her fellow RSE in CSML Vineet Bansal has been collaborating with Christine Allen-Blanchette, an assistant professor in the Department of Mechanical and Aerospace Engineering. Bansal is focused on ensuring that Allen-Blanchette’s code is sustainable — that is to say, coherent enough for students and engineers to work with it continuously. “I’m concerned with the minutiae of what’s going on under the hood and how can people use this system longer term?” said Bansal. 

Right now, video content generated by AI tends to fail to adhere to the laws of physics. Allen-Blanchette is working on developing a system that generates realistic videos of physics inspired simulations, for example the movement of a ball as it swings on a pendulum. “You can probably look at an artificial video and tell whether you’re looking at a real pendulum or a made up one because of the way it moves and how fast it moves,” said Bansal. Allen-Blanchette’s motivation is to create a neural network which allows these videos to take into consideration principles like the conservation of energy and momentum.​

 

Anushka Acharya and Vineet Bansal looking at computer

Anushka Acharya and Vineet Bansal work as research software engineers out of the Center for Statistics and Machine Learning, collaborating with faculty across the Princeton campus to help them optimize their research code. Photo by Allison Gasparini

“I wrote this codebase and sometimes in prototyping, the code gets unwieldy,” said Allen-Blanchette. “Vineet definitely made the code more legible and also made it into a package so that it can be installed.” If a code is difficult to understand, other researchers won’t be able to use it even if it is made open source. “We always want to open source our code,” said Allen-Blanchette. “Vineet’s been helping with that.”

The system built by Allen-Blanchette is trained on thousands of videos of simulated physical systems, like pendulums. After spending long hours working on optimizing the code in order to get the best output, Bansal said, “it’s exciting to actually see these generated videos.”

Bansal said Allen-Blanchette’s code is a good example of the typical project where he and Acharya can take machine learning code and make it more reproducible, maintainable, and sustainable. “Between Skinnider’s code and Christine’s code, we have codes in shape where we can be almost 100 percent sure there’s no bugs in the code,” said Bansal “And we can keep producing reliable results day after day, week after week.”