In the quiet foothills of Kentucky, a massive supercomputer is churning through data. It is hunting for new drugs to fight cancer.
Every week, the DataseamGrid processes 300 man-years worth of calculations. Yeah, that’s 300 years of calculations every week. Drug discovery usually takes 10 to 15 years, but the DataseamGrid blazes through that work in a fraction of the usual time. It is one of the largest pipelines of potential new cancer drugs in the country. Researchers here are about to start human trials this year of a new drug discovered by the supercomputer, which, if successful, may lead to an entirely new class of cancer drugs.
The DataseamGrid is the biggest grid of its kind on the planet — a monster machine that has received no national press, until now, which is remarkable for such a large, ambitious and successful project.
But the greatest innovation may be its architecture. The Grid’s processing power comes from 14,000 ordinary iMacs sitting in school classrooms spread across Kentucky. The DataseamGrid is a poor-man’s supercomputer, strung together from classroom computers in impoverished public schools across the Bluegrass State.
Kids in one-third of Kentucky’s school districts use Dataseam computers every day. Few have any idea that the machine helping them understand fractions is also running a state-of-the-art drug discovery program in the background.
“It’s counterintuitive that this high-end research is being done in one of the poorest parts of the country,” said Brian Gupton, Dataseam’s co-founder and executive director.
The DataseamGrid is part of a new wave of grid-computing projects, which have come a long way since the ET-hunting [email protected] grabbed headlines in the 1990s. CERN in Switzerland has a 100,000-processor grid that crunches the 15 petabytes of data produced each year by the Large Hadron Collider. That grid contributed to the discovery of the Higgs boson “God particle” in 2012. In other parts of Europe, supercomputers like the European Grid Infrastructure and France’s Wisdom grid have sought drugs to fight bird flu, malaria and the AIDS virus.
In the U.S., the FightAIDS @Home project connects thousands of computers in a search for better anti-HIV drugs, while the [email protected] project is looking for evidence of gravity waves. [email protected] is one of the most popular distributed computing projects, attracting more than 300,000 volunteers in 221 countries. In February 2014, the network clocked about 1,100 TFLOPS of computational power, placing it in the first 30 of the world’s TOP500 supercomputers.
But the DataseamGrid is markedly different from earlier grid computing projects like [email protected], a screensaver that used spare CPU cycles to look for signs of alien intelligence in radio signals from outer space. [email protected] farmed out number-crunching tasks to a screensaver downloaded by thousands of volunteers — but the nodes were unconnected to each other. When the calculations were complete, each node uploaded the results to centralized servers, which tabulated the results. The task was distributed — not the supercomputer itself.
The DataseamGrid in Kentucky is a more like a vast, virtual machine that treats each node like one of its processing units. It’s a virtual supercomputer that coordinates work across nodes and what emerges is one big, virtual machine.
“Grid computing never earned respect,” says Gupton. “It was seen as a novelty, a nice way to make Mandelbrot sets. We’re legitimizing it as a computing method.”
Gupton is a distinctly blue collar idealist-cum-entrepreneur. The son of a coal miner, he seems more interested in getting computers into classrooms than running cutting-edge cancer research — although he’s managing to do both.
In 2001, Gupton teamed up with Dr. John Trent, a cancer researcher at the University of Louisville, and two other local entrepreneurs: Dean Hughes and Henry Hunt, who is now Dataseam’s COO. They set out to build a supercomputer on the cheap.
Private grids are costly, as are supercomputers made from off-the-shelf hardware, like Virginia Tech’s Big Mac, which was decommissioned in 2012. The DataseamGrid is a unique public/private partnership, built largely using state economic development funds, established to diversify Kentucky’s economic profile and offset the state’s declining coal industry.
Kentucky is one of the poorest states in America. Its eastern section covers the Appalachian mountains, which has long been associated with hardship. As unemployment mushroomed, more stimulus money become available. Dataseam is using the funds to buy new computers for schools, and in exchange they join the grid. The computers serve a dual purpose: they provide new computers to some of the nation’s poorest school districts and they give cancer researchers a powerful supercomputer to search for cancer drugs.
“We’re building a computing infrastructure,” says Gupton, “and Johnny can use this for his school work.”
It was originally built with Apple’s Xgrid, a software package for distributed computing baked into Mac OS X from 10.4 to 10.7z. With a few tweaks and some proprietary software, Xgrid made setting up the system super simple: it was easy to add new nodes — a machine, a classroom, an entire school — and the system barely hiccuped if whole districts went offline.
Enabling Xgrid was a one-button click in OS X. The system auto-detected all the machines on the network that were available. If a network of computers was already in place, setting up a cluster was basically free — everything was already included in OS X or downloadable from Apple. Harvard and Stanford Universities have Xgrid clusters of about 400 or 500 computers each doing tasks like genome sequence searching and X-ray crystallography; but Dataseam is by far the largest.
Apple removed Xgrid from Snow Leopard (OS X 10.8), forcing Dataseam to write its own proprietary grid software.
The Dataseam Grid is a monster machine. It’s made from 14,000 desktop computers in 54 schools districts spread across Kentucky (see map). The project affects about 100,000 kids and up to 8,000 teachers. The iMacs have dual-core processors running at 1.83GHz or better. They are more than powerful enough to run the grid processing in the background. Some schools have hundreds of machines on the grid; others have dozens. Most of the computers are relatively new.
“Theoretically, we have some school districts with as much horsepower as Los Alamos,” says Gupton, referring to the nuclear testing supercomputers at Los Alamos National Laboratory.
The Dataseam Grid is wired together over the schools districts’ gigabit fiber backbone. Over the last few years, the state of Kentucky has connected most of its schools with a very fast, high-bandwidth infrastructure. This was largely due to federal eRate funding from the FCC for backbone communication projects, which awards subsidies based on the number of students on free- or reduced-rate lunch programs. Many Kentucky schools have up to 80 to 90 percent enrollment.
The grid is extremely flexible. Dataseam can keep adding and upgrading machines, adding a classroom or an entire school district.
“It’s a very cost effective way of getting it done,” says Gupton.
The Dataseam grid never stops. “We’re running 24/7, even when the kids are on the boxes,” says Gupton. The computing tasks can be processor intensive, but Gupton and Trent have to ensure they don’t overload the computers.
“Whatever we’re running can’t affect the client’s performance,” says Trent. “We can’t slow down one third of the school districts’ education computers.”
In the future, Gupton says, public/private grids like Dataseam will serve as utilities, offering on-demand number-crunching for big research programs. It’s a form of cloud computing. Just as Amazon rents out the spare capacity of its massive data centers, corporations and schools will be able to tap the unused CPU cycles on their networks.
“When Brian and I started the research, we had five processors,” says Trent. “It’s grown incredibly. Everything we did in the first year we can now do in one night.”
However, with Apple’s discontinuation of Xgrid, Mac-based supercomputers and clusters are becoming a rarity. The Virginia Tech supercomputer has been decommissioned, and I couldn’t find any mention of Macs in the latest Top500 list of the world’s fastest machines. Professor Jack Dongarra, who maintains the list, says there aren’t any. “I don’t think there are any Mac clusters on the Top500,” he said.
Trent and his colleagues are looking for chemicals that will disrupt or inhibit the growth of cancer. Based on the modeling techniques that just won the 2013 Nobel Prize for Chemistry, they’ve built a simulator that takes a 3D model of a cancer protein and matches it against a molecular model of a chemical. The grid moves the 3D models around, trying to fit them together like a 3D jigsaw puzzle. A match represents a potential anti-cancer drug. If the chemical binds strongly to the protein, it may be able to stop its growth in a cancer cell. They’re working with a library of 20 million chemicals.
Trent and his team are working with 14 other cancer research groups around the country and internationally. If they decide a target is amenable to modeling — and not all are — and it goes on the grid to be matched against the chemical library. The grid can churn through 7 million compounds in one weekend. It produces about 1.6TB of data per run. The results are returned a few days later. If grid returns promising matches, Trent’s team tests the chemicals physically in the lab. The cancer proteins and chemicals are mixed together in the laboratory to see if they bind — what’s known as a functional assay. If they do, the chemical — now a potential anti-cancer drug — may proceed into a long series of trials to test its suitability for treating human cancer.
“The drug discovery stuff takes a few days,” says Trent. “The clinical testing takes years.”
Dataseam speeds up the drug discovery process by orders of magnitude. Ten years ago, this work would have been performed by teams of grad students using Petri dishes. They’d mix compound after compound to see if there was a binding. Now, those same tests can be simulated and bindings tested virtually. All the research is controlled by a single iMac in Trent’s office.
There are only two or three other cancer research centers in the U.S. doing similar high-throughput screenings using supercomputers, among them Georgetown, the University of Michigan and University of California at San Francisco.
So far, the DataseamGrid has examined more than 250 different cancer targets and found 30 chemicals that have been verified clinically. One of the potential drugs discovered by the grid is about to go into human trials. It’s for treating solid tumors; that is, most cancers other than leukemia. Trent says if successful, the compound may represent a new class of anti-cancer drug. Ironically, some of the research program’s most promising anti-cancer compounds are derived from the tobacco plant – one of the state’s biggest cash crops.
However, the clinical testing process is the hard part. Two potential drugs previously discovered at the University’s cancer center were in Phase II testing until last year. The company running the tests ran into financial trouble and the trials are on hold. Trent says he hopes one of the drugs will reenter tests later this year. The other is in clinical limbo. “That’s the drug business,” he acknowledges.
Trent hopes that grid computers can revolutionize drug discovery for rarer cancers, which don’t attract the attention of the pharmaceuticals because of economics. He’s particularly interested in childhood cancers. “There’s no money in research for kids,” he says. “Cancer is older person’s disease.”
Kentucky is an unlikely place to be running a state-of-the-art medical research program. Known for its bourbon, fried food and tobacco, cancer rates here are 220% higher than the national average. It leads the nation with lung and colorectal cancers, as well as heart disease.
“There’s a lot of smoking and poor dietary habits here,” says Gupton.
Kentucky is very poor. Median state income is $42,000 per family, but can dip to $10,000 for the poorest counties, where unemployment soars to 20% and nearly 40% of the population live below the national poverty line. If you look at a map of educational attainment against income level, “large swathes of Kentucky look like Indian reservations in New Mexico,” says Gupton.
Gupton recalls the reaction of a member of his hometown school board when first told about the Dataseam Grid. “He’s a farmer. Eaten up by cancer. He was alive just to spite it. Overalls. Muddy boots. He said, ‘I don’t know anything about computers, but if it’ll help fight cancer, we’ll help to put it in tonight.'”
Gupton continues: “I hate to sound like we’re unhealthy and uneducated, but we have our challenges.” Employment in Kentucky has traditionally been coal mining. But as the industry has shrunk, jobs have disappeared and new ones haven’t materialized. Lack of education only compounds the problem. “But we also have our opportunities,” he adds.
“Educate the workforce, and the companies will come,” says Henry Hunt, Dataseam’s chief operating officer. “Before these kids’ future was mining, maybe. Now we’re sending kids to college with scholarships.”
Education is really what gets the Dataseam team fired up. This is the part of the program that Gupton is most passionate about. Finding anti-cancer drugs is cool, but giving kids an education is what really drives him.
Some of the schools were running Macs from the LC-era. Now the kids are producing podcasts, newsletters and digital videos. They’re streaming basketball games, communicating with other kids and troops overseas, and using them for schoolwork.
Dataseam awards about 20 paid scholarships a year, sending students to two local universities — University of Louisville and Morehead State University. It has also provided training and career development for about 6,000 teachers.
It has also trained more than 112 school IT technicians with Apple professional certifications, making Kentucky the highest concentration of Apple-certified technicians in the U.S., on a per capita basis.
“We’re growing minds,” Gupton says. “It’s not about having a strong back. It’s about whether you can think.”
He continues: “The jobs we’re trying to create are the ones that create a new economy in the state and sustainable wealth. We’re driving economic development. We’re building human infrastructure and capital.”
Gupton talks about a man from Martin County, where Lyndon B. Johnson declared the war on poverty in 1964. He started out as a janitor in the school system making less than $14 an hour. He studied to be a certified Apple systems engineer, a program supported by Dataseam. Now he works in a local school district as a certified technology professional, making $50-60,000 a year. “This guy is one of the successes of the program,” says Gupton. “He shows what education is all about. It’s had a tremendous impact on his life and his family.”
“It used to be about seams of coal,” Gupton says. “Now it’s seams of data.”