In many applications, e.g. biological applications, one wants to compare or match two sequences. For example, two organisms have their own DNA sequence [WIKI] and a comparison would reveal their similarity based on molecules called bases (Adenine, Guanine, Cytosine and Thymine). Looking for longest common substring could be of limited use, since some random insertions may hinder the matching score. A similarity measure obtained from Longest Common Subsequence algorithm output finds a substring such that the order of matching elements is preserved but is not necessarily consecutive.

There is a number of algorithms solving this Dynamic Programming [WIKI] problem which is generally *O(mn)*. The optimization may be done either in time or space while the complexity of the algorithm still remains the same. I developed a straightforward version of the algorithm for MATLAB [WWW] in MEX which uses memory to gain some speed. Sure, there is a pure MATLAB implementation [HERE] but it uses heavily the for loops which will slow down the execution.

Download the source code and compiled MEX file for Mac 64bit [HERE]. The implementation should be considered as beta since it has been verified to work on few simple, easy to verify examples.

The function expects two int32 arrays. Many existing solutions work with character strings only and thus limited to 255 element vocabulary. This implementation allows to use much richer vocabulary (more than 4 billion).

If you want to improve the straightforward version, take a look at Generalized Suffix Trees [WWW] solving the LCS problem [HERE] with some code.

I think your blog needs some fresh posts. Writing manually takes a lot of time, but

there is tool for this boring task, search for; Boorfe’s tips

unlimited content