Solving mysteries with string algorithms - Channel Magazine

By Michael Soltys, Ph.D., Chair and Professor of Computer Science

String algorithms and Michael Soltys

Over the past several years, I have become interested in a field of Computer Science known as String Algorithms. It is an area of research that has come to the fore because it provides solutions to three very important applications of computing: analysis of the genetic code, data mining and text processing. The last one is perhaps the most familiar to the typical user of computers, who does text processing such as searching of text, replacing of text and compression of files. All these operations are enabled by applications running SA.

We know from genetics that the blueprint of all living organisms can be encoded with very long words (strings) over a four-letter alphabet, (A, G, C, U). In order to understand how these words define living organisms and how mutations and malformations in these words are responsible for traits and diseases, we need to understand the structure of these words, especially their patterns and repetitions. This is where biology and SA meet, and, in a very real sense, SA provides a glimpse into the secrets of life.

The third application is in the area of big data. We all witness daily that there is an overwhelming flow of information in our knowledge-based society. The questions is: How do we interpret this information, that is, how do we extract knowledge out of this relentless flow of bits? This is the problem of data mining and SA provides answers as researchers discover faster and faster algorithms that are able to process extremely long words, extracting structural information about them and gleaning meaning hidden underneath layers and layers of data.

There are many contributions to be made in the exciting field of SA. Over the past few years I have had the great luck of being able to solve an open problem about one of the most basic operations on strings: shuffle. Shuffle takes two words and combines them into one in such a way that the order of letters in the original words are preserved. For example, the two words aaa and bbb can shuffle into ababab. This seemingly simple operation has countless applications, from optimization to concurrency. Together with Professor Sam Buss of the University of California, San Diego, we were able to give a long-sought impossibility result showing that there do not exist fast algorithms for deciding whether a given word is the shuffle of the same smaller word with itself. That is: square roots are hard for words.

This work and other results created an exciting collaboration between King’s College London and CI, and, under the auspices of the British Royal Society, we will be able to have a vibrant exchange program between scientists and students at both institutions. One of the nicer aspects of the field of SA is that it is new, and as such, it is relatively easy to join for newcomers, which is the case of most students. One can quickly start participating in exciting research. As such, the field is ideal for undergraduate research initiatives, but of course it also offers challenging problems for graduate students and faculty.