website
After making much progress on pal, I realized I had made an organisational mistake. Not huge one, just one that will take a few days to work around. At the same time, I was pondering about how pal will work with generics, which depends a great deal on how the Go team will deal with generics in golang.org/x/tools/go/ssa. While waiting patiently for some kind of discussion on the topic, an old colleague reached out to me with a problem that really caught my attention.
She has a client which has a need of reviewing large sets of documents from time to time. The need for review is under time pressure, so there’s no possibility of say hiring an army of subject experts to read it all. A problem in this setting is duplication of text, which introduces noise into the analysis and generally leads to inefficient review.
The classic work around finding duplicates in documents is based around document similarity measures, which is used in filtering web search results and selecting representative news stories. This classic approach just wasn’t working. There was no time to come up with a dedicated model for similarity over the set of documents. Nothing is known about them in advance, and they come in all forms shapes and sizes, including private communication and scanned documents and god knows what.
The question was, how can one quickly find all duplicate text over a set of hundreds of thousands or millions of documents in short order, with no knowledge ahead of time of what those documents are going to look like.
We discussed things such as document size, formats, and what should count as a duplicate in terms of the extent of duplicated text.
Something struck me about this. It is more challenging than search, or even elastic’s percolated search, because we don’t know the queries which will find the duplicates. This makes the whole thing basically N² on the number of documents, or worse on the number of snippets to be considered for duplication. But that just doesn’t work for a million documents on short order unless you happen to have quite substantial computational resources lying around.
Hmm.
Several questions came to mind.
Can we use Lucene or a traditional inverted index?
What is this LSH thing I was referred to?
Should I use existing tools?
Are there any existing tools
I tried blevesearch on the enron data set. I was too impatient to wait for it to finish, and that only solved the problem of indexing for search, not for finding all duplicates.
I tried Lucene, it was much faster, but it wasn’t indexing everything I needed and I wasn’t sure how it would scale if I asked it to add the minimal information to find all documents (problem of using one file containing all the data).
I tried microfts and it was reasonably fast, and it was recording position info for all trigrams out of the box.
So I starting thinking about trigrams and looking at how to make indices which might be easier to exploit for finding duplicates.
Having refreshed myself over the state of the open source art of full text search indexing, I turned to LSH, as my old colleague had initially directed me.
In playing with the Enron data, it occurred to me that for LSH to work, it would need to be integrated into the index and moreover it would need to be applied to snippet sized sub-documents. Otherwise, we’re back to N² and N, as always, is big.
But then the follow question: do we really want LSH applied to snippets? It seems a bit much.
Looking at the classic LSH forest after a friendly suggestion on Gopher slack, I was quite impressed with the work on indexing, but it just seemed like overkill for the problem at hand.
So then what? Didn’t BLAST solve this problem long ago? Looking over the algorithm description I’d have to say no, they solved it by exploiting statistics around tendencies in DNA sequences to reduce the quadratic behavior, but it was still fundamentally quadratic.
Voila la naissance de dupi, yet another tool in a family of different search and analysis algorithms, with yet another design.
I hope this tool finds its uses. I think it can be tweaked for many different use cases, including code and maybe even DNA.
Dupi indexes the enron dataset on my macbook in 17s. The duplicate extraction has a still buggy and rough around the edges mechanism, but the underpinnings work and are efficient.