Find Similar Documents Without Using a Full Text Index

Many systems that identify similar content do so by storing a collection of fingerprints (sometimes called a sketch) for each document in a database with other sketches. When similar content is requested, these systems apply various algorithms to match the selected content’s fingerprints with those stored in the database. Full text indexing solutions also require databases and index files to store word tokens, stems, synonyms, locations, etc. to facilitate identification of similar content. Some full text search engines can be configured to select the most important words from a document, and build a query using those words to identify similar content in its indexes.

In this Knowledge Sharing article, Scott Roth discusses how to configure Documentum to enable identification of syntactically similar content without the use of a full text indexing engine. The technique described utilizes a Java Aspect to calculate SimHash values for content objects and stores them in a database view. The database view can then be queried programmatically via an Aspect or by using DQL to identify content similar to a selected object.

Scott's solution condenses the salient features of a document into a single, 64-bit hash value that can be attached directly to the content object as metadata, thus eliminating the need for additional databases, indexes, or advanced detection algorithms. Similar content can be detected by simply comparing hash values.

Read the full article.

View All

No Events found!

General Discussion

Find Similar Documents Without Using a Full Text Index

Was this post helpful?