Detecting copy-and-pasted Code
I came across this interesting little program to detect code that appears to be copied and pasted from one place to another. It's about $20.00 and is free to try for Linux and Windows and works with any text. It doesn't seem like a difficult program to write, sounds like it might be similar to the rsync algorithm . Differences would be that you would chunk into n lines instead of bytes and you would store all the checksums in a hash table, to find duplicates. Maybe cleaning up whitespaces and removing comments might also improve the algorithm.