Showing 4 of total 4 results (show query)
bnosac
tokenizers.bpe:Byte Pair Encoding Text Tokenization
Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://aclanthology.org/P16-1162/>.
Maintained by Jan Wijffels. Last updated 2 years ago.
bpebyte-pair-encodingtext-miningtokenizationcpp
52.3 match 15 stars 4.56 score 48 scriptstidymodels
textrecipes:Extra 'Recipes' for Text Processing
Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.
Maintained by Emil Hvitfeldt. Last updated 9 days ago.
1.9 match 160 stars 10.87 score 964 scripts 1 dependentsdavzim
rtiktoken:A Byte-Pair-Encoding (BPE) Tokenizer for OpenAI's Large Language Models
A thin wrapper around the tiktoken-rs crate, allowing to encode text into Byte-Pair-Encoding (BPE) tokens and decode tokens back to text. This is useful to understand how Large Language Models (LLMs) perceive text.
Maintained by David Zimmermann-Kollenda. Last updated 4 months ago.
3.3 match 11 stars 4.22 score 3 scriptsbnosac
sentencepiece:Text Tokenization using Byte Pair Encoding and Unigram Modelling
Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.
Maintained by Jan Wijffels. Last updated 2 years ago.
bytenatural-language-processingsentencepieceword-segmentationcpp
2.6 match 25 stars 4.10 score 8 scripts