R-universe search: ngram

wrathematics

ngram:Fast n-Gram 'Tokenization'

An n-gram is a sequence of n "words" taken, in order, from a body of text. This is a collection of utilities for creating, displaying, summarizing, and "babbling" n-grams. The 'tokenization' and "babbling" are handled by very efficient C code, which can even be built as its own standalone library. The babbler is a simple Markov chain. The package also offers a vignette with complete example 'workflows' and information about the utilities offered in the package.

Maintained by Drew Schmidt. Last updated 1 years ago.

ngram text text-mining

85.7 match 71 stars 10.45 score 844 scripts 7 dependents

chrismuir

refinr:Cluster and Merge Similar Values Within a Character Vector

These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine <https://openrefine.org/>. More info on key collision and ngram fingerprint can be found here <https://openrefine.org/docs/technical-reference/clustering-in-depth>.

Maintained by Chris Muir. Last updated 1 years ago.

approximate-string-matching clustering data-cleaning data-clustering fuzzy-matching ngram openrefine cpp

13.6 match 104 stars 6.80 score 121 scripts

trinker

qdap:Bridging the Gap Between Qualitative Data and Quantitative Analysis

Automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse including frequency counts of sentence types, words, sentences, turns of talk, syllables and other assorted analysis tasks. The package provides parsing tools for preparing transcript data. Many functions enable the user to aggregate data by any number of grouping variables, providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text. This affords the user a more efficient and targeted analysis. 'qdap' is designed for transcript analysis, however, many functions are applicable to other areas of Text Mining/ Natural Language Processing.

Maintained by Tyler Rinker. Last updated 4 years ago.

qdap quantitative-discourse-analysis text-analysis text-mining text-plotting openjdk

7.3 match 176 stars 9.61 score 1.3k scripts 3 dependents

michbur

biogram:N-Gram Analysis of Biological Sequences

Tools for extraction and analysis of various n-grams (k-mers) derived from biological sequences (proteins or nucleic acids). Contains QuiPT (quick permutation test) for fast feature-filtering of the n-gram data.

Maintained by Michal Burdukiewicz. Last updated 7 months ago.

biological-sequences ngram-analysis

7.5 match 10 stars 7.50 score 87 scripts 3 dependents

tidymodels

textrecipes:Extra 'Recipes' for Text Processing

Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.

Maintained by Emil Hvitfeldt. Last updated 9 days ago.

4.5 match 160 stars 10.87 score 964 scripts 1 dependents

ropensci

jstor:Read Data from JSTOR/DfR

Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.

Maintained by Thomas Klebel. Last updated 8 months ago.

jstor peer-reviewed text-analysis text-mining

6.0 match 47 stars 7.29 score 55 scripts

polmine

polmineR:Verbs and Nouns for Corpus Analysis

Package for corpus analysis using the Corpus Workbench ('CWB', <https://cwb.sourceforge.io>) as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create subcorpora and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document-term matrices, term-co-occurrence matrices etc.) can be created based on the indexed corpora.

Maintained by Andreas Blaette. Last updated 1 years ago.

5.3 match 49 stars 7.96 score 311 scripts

vgherard

sbo:Text Prediction via Stupid Back-Off N-Gram Models

Utilities for training and evaluating text predictors based on Stupid Back-Off N-gram models (Brants et al., 2007, <https://www.aclweb.org/anthology/D07-1090/>).

Maintained by Valerio Gherardi. Last updated 4 years ago.

natural-language-processing ngram-models predictive-text sbo cpp

7.5 match 10 stars 4.78 score 12 scripts

seancarmody

ngramr:Retrieve and Plot Google n-Gram Data

Retrieve and plot word frequencies through time from the "Google Ngram Viewer" <https://books.google.com/ngrams>.

Maintained by Sean Carmody. Last updated 2 months ago.

linguistics

6.0 match 49 stars 5.79 score 42 scripts

ingmarboeschen

JATSdecoder:A Metadata and Text Extraction and Manipulation Tool Set

Provides a function collection to extract metadata, sectioned text and study characteristics from scientific articles in 'NISO-JATS' format. Articles in PDF format can be converted to 'NISO-JATS' with the 'Content ExtRactor and MINEr' ('CERMINE', <https://github.com/CeON/CERMINE>). For convenience, two functions bundle the extraction heuristics: JATSdecoder() converts 'NISO-JATS'-tagged XML files to a structured list with elements title, author, journal, history, 'DOI', abstract, sectioned text and reference list. study.character() extracts multiple study characteristics like number of included studies, statistical methods used, alpha error, power, statistical results, correction method for multiple testing, software used. An estimation of the involved sample size is performed based on reports within the abstract and the reported degrees of freedom within statistical results. In addition, the package contains some useful functions to process text (text2sentences(), text2num(), ngram(), strsplit2(), grep2()). See Böschen, I. (2021) <doi:10.1007/s11192-021-04162-z> Böschen, I. (2021) <doi:10.1038/s41598-021-98782-3> and Böschen, I (2023) <doi:10.1038/s41598-022-27085-y>.

Maintained by Ingmar Böschen. Last updated 6 days ago.

cermine niso-jats pubmedcentral text-extraction text-mining xml-files openjdk

7.1 match 18 stars 4.56 score 7 scripts

laresbernardo

lares:Analytics & Machine Learning Sidekick

Auxiliary package for better/faster analytics, visualization, data mining, and machine learning tasks. With a wide variety of family functions, like Machine Learning, Data Wrangling, Marketing Mix Modeling (Robyn), Exploratory, API, and Scrapper, it helps the analyst or data scientist to get quick and robust results, without the need of repetitive coding or advanced R programming skills.

Maintained by Bernardo Lares. Last updated 24 days ago.

analytics api automation automl data-science descriptive-statistics h2o machine-learning marketing mmm predictive-modeling puzzle rlanguage robyn visualization

3.0 match 233 stars 9.84 score 185 scripts 1 dependents

sparklyr

sparklyr:R Interface to Apache Spark

R interface to Apache Spark, a fast and general engine for big data processing, see <https://spark.apache.org/>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.

Maintained by Edgar Ruiz. Last updated 10 days ago.

apache-spark distributed dplyr ide livy machine-learning remote-clusters spark sparklyr

1.9 match 959 stars 15.16 score 4.0k scripts 21 dependents

kurthornik

NLP:Natural Language Processing Infrastructure

Basic classes and methods for Natural Language Processing.

Maintained by Kurt Hornik. Last updated 4 months ago.

3.0 match 6 stars 9.37 score 1.0k scripts 127 dependents

tidymodels

dials:Tools for Creating Tuning Parameter Values

Many models contain tuning parameters (i.e. parameters that cannot be directly estimated from the data). These tools can be used to define objects for creating, simulating, or validating values for such parameters.

Maintained by Hannah Frick. Last updated 30 days ago.

1.8 match 114 stars 14.31 score 426 scripts 52 dependents

cysouw

qlcMatrix:Utility Sparse Matrix Functions for Quantitative Language Comparison

Extension of the functionality of the 'Matrix' package for using sparse matrices. Some of the functions are very general, while other are highly specific for special data format as used for quantitative language comparison.

Maintained by Michael Cysouw. Last updated 9 months ago.

3.6 match 6 stars 6.98 score 256 scripts 1 dependents

paithiov909

RMeCab:Interface to 'MeCab'

Parses Japanese texts with 'MeCab'. The original 'MeCab' is licensed under the BSD 3-Clause "New" or "Revised" License. See the "LICENSE.note" file for its license notice.

Maintained by Motohiro Ishida. Last updated 11 days ago.

mecab cpp

6.6 match 3.10 score

computationalstylistics

stylo:Stylometric Multivariate Analyses

Supervised and unsupervised multivariate methods, supplemented by GUI and some visualizations, to perform various analyses in the field of computational stylistics, authorship attribution, etc. For further reference, see Eder et al. (2016), <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>. You are also encouraged to visit the Computational Stylistics Group's website <https://computationalstylistics.github.io/>, where a reasonable amount of information about the package and related projects are provided.

Maintained by Maciej Eder. Last updated 2 months ago.

2.3 match 187 stars 8.58 score 462 scripts

chris31415926535

storywranglr:Explore Twitter Trends with the 'Storywrangler' API

An interface to explore trends in Twitter data using the 'Storywrangler' Application Programming Interface (API), which can be found here: <https://github.com/janeadams/storywrangler>.

Maintained by Christopher Belanger. Last updated 4 years ago.

4.8 match 2.70 score 3 scripts

mlampros

fastText:Efficient Learning of Word Representations and Sentence Classification

An interface to the 'fastText' <https://github.com/facebookresearch/fastText> library for efficient learning of word representations and sentence classification. The 'fastText' algorithm is explained in detail in (i) "Enriching Word Vectors with subword Information", Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, 2017, <doi:10.1162/tacl_a_00051>; (ii) "Bag of Tricks for Efficient Text Classification", Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2017, <doi:10.18653/v1/e17-2068>; (iii) "FastText.zip: Compressing text classification models", Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, Tomas Mikolov, 2016, <arXiv:1612.03651>.

Maintained by Lampros Mouselimis. Last updated 1 years ago.

cpp11 fasttext cpp

1.7 match 42 stars 7.37 score 56 scripts

paithiov909

audubon:Japanese Text Processing Tools

A collection of Japanese text processing tools for filling Japanese iteration marks, Japanese character type conversions, segmentation by phrase, and text normalization which is based on rules for the 'Sudachi' morphological analyzer and the 'NEologd' (Neologism dictionary for 'MeCab'). These features are specific to Japanese and are not implemented in 'ICU' (International Components for Unicode).

Maintained by Akiru Kato. Last updated 22 days ago.

japanese javascript

2.3 match 10 stars 5.61 score 3 scripts 1 dependents

myeomans

doc2concrete:Measuring Concreteness in Natural Language

Models for detecting concreteness in natural language. This package is built in support of Yeomans (2021) <doi:10.1016/j.obhdp.2020.10.008>, which reviews linguistic models of concreteness in several domains. Here, we provide an implementation of the best-performing domain-general model (from Brysbaert et al., (2014) <doi:10.3758/s13428-013-0403-5>) as well as two pre-trained models for the feedback and plan-making domains.

Maintained by Mike Yeomans. Last updated 1 years ago.

2.3 match 13 stars 5.59 score 20 scripts 1 dependents

bnosac

ruimtehol:Learn Text 'Embeddings' with 'Starspace'

Wraps the 'StarSpace' library <https://github.com/facebookresearch/StarSpace> allowing users to calculate word, sentence, article, document, webpage, link and entity 'embeddings'. By using the 'embeddings', you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph 'embeddings' as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: 'StarSpace: Embed All The Things!' by Wu et al. (2017), available at <arXiv:1709.03856>.

Maintained by Jan Wijffels. Last updated 1 years ago.

classification embeddings natural-language-processing nlp similarity starspace text-mining cpp

1.9 match 101 stars 6.65 score 44 scripts

cran

textanalyzer:'textanalyzer', an R Package to Analyze Text

It analyzes text to create a count of top n-grams, including tokens (one-word), bigrams(two-word), and trigrams (three-word), while removing all stopwords. It also plots the n-grams and corresponding counts as a bar chart.

Maintained by Pushker Ravindra. Last updated 2 months ago.

4.5 match 2.70 score

paithiov909

gibasa:An Alternative 'Rcpp' Wrapper of 'MeCab'

A plain 'Rcpp' wrapper for 'MeCab' that can segment Chinese, Japanese, and Korean text into tokens. The main goal of this package is to provide an alternative to 'tidytext' using morphological analysis.

Maintained by Akiru Kato. Last updated 29 days ago.

mecab pos-tagging rcpp cpp

2.3 match 15 stars 5.02 score 3 scripts

microsoft

wpa:Tools for Analysing and Visualising Viva Insights Data

Opinionated functions that enable easier and faster analysis of Viva Insights data. There are three main types of functions in 'wpa': (i) Standard functions create a 'ggplot' visual or a summary table based on a specific Viva Insights metric; (2) Report Generation functions generate HTML reports on a specific analysis area, e.g. Collaboration; (3) Other miscellaneous functions cover more specific applications (e.g. Subject Line text mining) of Viva Insights data. This package adheres to 'tidyverse' principles and works well with the pipe syntax. 'wpa' is built with the beginner-to-intermediate R users in mind, and is optimised for simplicity.

Maintained by Martin Chan. Last updated 4 months ago.

workplace-analytics

1.7 match 30 stars 6.69 score 39 scripts 1 dependents

microsoft

vivainsights:Analyze and Visualize Data from 'Microsoft Viva Insights'

Provides a versatile range of functions, including exploratory data analysis, time-series analysis, organizational network analysis, and data validation, whilst at the same time implements a set of best practices in analyzing and visualizing data specific to 'Microsoft Viva Insights'.

Maintained by Martin Chan. Last updated 24 days ago.

1.7 match 11 stars 6.12 score 68 scripts

xytangtang

ProcData:Process Data Analysis

Provides tools for exploratory process data analysis. Process data refers to the data describing participants' problem-solving processes in computer-based assessments. It is often recorded in computer log files. This package provides functions to read, process, and write process data. It also implements two feature extraction methods to compress the information stored in process data into standard numerical vectors. This package also provides recurrent neural network based models that relate response processes with other binary or scale variables of interest. The functions that involve training and evaluating neural networks are wrappers of functions in 'keras'.

Maintained by Xueying Tang. Last updated 4 years ago.

cpp

2.0 match 10 stars 3.70 score 2 scripts

paithiov909

pipian:Tiny Interface to CaboCha for R

A tiny interface to 'CaboCha'; a Japanese dependency structure parser. The main goal of this package is to implement a parser for that XML output.

Maintained by Akiru Kato. Last updated 2 months ago.

cabocha cpp

2.3 match 4 stars 3.00 score 1 scripts

paithiov909

vibrrt:An R Wrapper for 'vibrato'

An R wrapper for 'vibrato' <https://github.com/daac-tools/vibrato>, a Rust reimplementation of 'MeCab' for fast tokenization.

Maintained by Akiru Kato. Last updated 16 days ago.

pos-tagging rust cargo

2.3 match 2.30 score 1 scripts

paithiov909

sudachir2:R Wrapper for 'sudachi.rs'

Offers bindings to 'sudachi.rs' <https://github.com/WorksApplications/sudachi.rs>, a Rust implementation of 'Sudachi' Japanese morphological analyzer.

Maintained by Akiru Kato. Last updated 16 days ago.

pos-tagging rust cargo

2.3 match 3 stars 2.18 score 3 scripts

jcaledo

EnvNJ:Whole Genome Phylogenies Using Sequence Environments

Contains utilities for the analysis of protein sequences in a phylogenetic context. Allows the generation of phylogenetic trees base on protein sequences in an alignment-independent way. Two different methods have been implemented. One approach is based on the frequency analysis of n-grams, previously described in Stuart et al. (2002) <doi:10.1093/bioinformatics/18.1.100>. The other approach is based on the species-specific neighborhood preference around amino acids. Features include the conversion of a protein set into a vector reflecting these neighborhood preferences, pairwise distances (dissimilarity) between these vectors, and the generation of trees based on these distance matrices.

Maintained by Juan Carlos Aledo. Last updated 3 years ago.

3.3 match 1.04 score 11 scripts

tconwell

textTools:Functions for Text Cleansing and Text Analysis

A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.

Maintained by Timothy Conwell. Last updated 4 years ago.

3.0 match 1.00 score 4 scripts