Tech by VICE

This Classic Algorithmic Problem Can't Get Any Less Complex (Deal With It)

Researchers offer a new proof that "edit distance" can't be solved any more efficiently.

by Michael Byrne
Jun 15 2015, 11:10am

The human genome, printed. Image: Adam Nieman/Flickr/Flickr

A group of MIT computer scientists has proven—presented a proof for—a vexing and outstanding question in algorithmic complexity related to a rather dry-sounding but also completely fascinating and weird problem known as "edit distance." For some 40 years, researchers have puzzled over whether or not the problem can be solved efficiently or not. It looks an awful lot like "not," but until now we haven't known for sure.

The problem is this: Take two strings of characters in a given alphabet—which can just be viewed as a particular finite set of representable things, whether it's ASCII characters or binary 1s and 0s or any other sort of set you might possibly imagine—and determine how dissimilar they are. Or, more formally, how they are different in terms of the number of operations required to make one string the same as another string. Operations in the edit distance problem are limited to addition, deletion, and substitution.

So, the edit distance between "michael" and "byrne" in the alphabet of ASCII characters is 5 substitutions plus 2 additions: the distance is 7. Seems easy enough, right? It's easy in trivially small cases, but applied to strings as long as, say, genomes, the problem becomes very complex; prohibitively so.

Just a small part of the human genome. Image: Holly Gramazio/Flickr

Edit-distance can be solved in what's known as quadratic time. Given two strings with two lengths, there is a method, the Wagner-Fischer algorithm, that can find the dissimilarity between the two in time proportional to the product of the two lengths. Given two eight-character strings, it will take 64 operations to return a correct result. Double the length of each of these strings, and the running time quadruples. Hence, quadratic time.

"A considerable effort has been invested into designing faster algorithms, either by assuming that the edit distance is bounded, by considering the average case, or by resorting to approximation," Piotr Indyk and Arturs Backurs, the two MIT computer scientists behind the new proof, write in an open-access paper posted to the arXiv pre-print server. The best algorithm yet devised is barely a shade faster than quadratic—remaining functionally quadratic then—suggesting the there exists a lower bound on the complexity of edit distance.

The Wagner-Fischer algorithm, still the classic and simplest approach to solving the edit distance problem, is reasonably simple to explain: take each string, string x and string y, and create a x X y grid, with each cell containing the operation required to make a character in string y the same as the corresponding character in string x. This results in x * y comparisons. Worth noting is that Wagner-Fischer is a dynamic programming algorithm—a favorite problem-solving paradigm among many algorithmists—which means that it avoids doing recalculations by implementing some storage or memory scheme, like putting earlier sub-solutions in a list.

Wagner-Fischer doesn't sound so bad, but figure that comparing two human genomes exhaustively would still take about 1,000 years (given some 3 billion base pairs). This relative inefficiency has made edit distance a hot problem to improve on, and a sub-quadratic time algorithm would go a long way in a large number of IRL research fields, particularly within bioinformatics. That is, if such an algorithm were to exist, which it doesn't, according to the MIT group, who will present their findings at next week's ACM Symposium on Theory of Computing. Turns out this has some deeper implications for computational complexity in general.

In a deeper sense, what the proof purports to show is that if the edit distance problem can be solved in sub-quadratic time, then the class of problems known as "NP-complete" problems can be solved in sub-exponential time. These are the problems that get truly, deeply burly: Rather than taking n2 operations to solve (where n is the size of the input), as in a quadratic problem, an exponential problem will take 2n operations. (A quadratic problem is also exponential in that it can be solved in exponential time, but not the other way around.)

And if an NP-complete problem can be solved in sub-exponential time, then the strong exponential time hypothesis (SETH)—which states that the family of most-general problems in NP-complete cannot be solved in sub-exponential time—is bunk. This if-then—if edit distance is sub-quadratic, then exponential problems are sub-exponential—is a pretty good argument for edit distance not having a sub-quadratic solution.

"There are lots of people who have been working on this problem, because it has a big practical impact."

All of this folding problems into other problems has to do with a complexity concept called reducibility. It's kind of a tricky concept, but the gist is that the complexity of different problems can be proved by showing that some problems can be reduced or generalized to other problems. The ur-NP-complete problem is called satisfiability (SAT): take a bunch of AND/OR logical statements/constraints and put them together—a satisfiability algorithm will tell you if it's possible to come up with some combination of true/false values that might make the statement true.

The proof shows that, by essentially breaking apart a bunch of logical statements/constraints into two substatements, satisfiability can be seen as a version of edit distance, where each substatement is like a string to be compared in an edit distance problem. The result is that if edit-distance is sub-quadratic, satisfiability is sub-exponential. Which violates SETH, which is most assuredly not violate-able.

"The rationale behind [SETH] is that, despite decades of research on fast algorithms for satisfiability and related problems, no algorithm was yet shown to run in time faster than [exponential]," Indyk and Backurs write. "Because of this state of affairs, SETH has served as the basis for proving conditional lower bounds for several important computational problems. "Our paper builds on these works, identifying a new important member of the class of 'SETH-hard' problems." Edit-distance just doesn't get any easier and let's just deal with it.

So, yeah: phew. Complexity is weird and kind of inscrutable (but incredibly important). Let's just leave it at that for today.

"This is really nice work," offered Barna Saha, an assistant professor of computer science at the University of Massachusetts, in a statement from MIT. "There are lots of people who have been working on this problem, because it has a big practical impact. But they won't keep trying to develop a subquadratic algorithm, because that seems very unlikely to happen, given the result of this paper."

Tagged:
Tech
Motherboard
computer science
Complexity
algorithms
Genome sequencing
motherboard show
computing
genomes
Arturs Backurs
Piotr Indy
dynamic programming
edit distance
np-complete