Showing 2 of total 2 results (show query)
ropensci
textreuse:Detect Text Reuse and Document Similarity
Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
Maintained by Yaoxiang Li. Last updated 1 months ago.
200 stars 9.28 score 226 scriptsbeniaminogreen
zoomerjoin:Superlatively Fast Fuzzy Joins
Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time.
Maintained by Beniamino Green. Last updated 2 months ago.
blazinglyfastfuzzyjoinjoinrustzoomercargo
102 stars 7.31 score 11 scripts