Showing 32 of total 32 results (show query)
wrathematics
ngram:Fast n-Gram 'Tokenization'
An n-gram is a sequence of n "words" taken, in order, from a body of text. This is a collection of utilities for creating, displaying, summarizing, and "babbling" n-grams. The 'tokenization' and "babbling" are handled by very efficient C code, which can even be built as its own standalone library. The babbler is a simple Markov chain. The package also offers a vignette with complete example 'workflows' and information about the utilities offered in the package.
Maintained by Drew Schmidt. Last updated 1 years ago.
85.7 match 71 stars 10.45 score 844 scripts 7 dependentschrismuir
refinr:Cluster and Merge Similar Values Within a Character Vector
These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine <https://openrefine.org/>. More info on key collision and ngram fingerprint can be found here <https://openrefine.org/docs/technical-reference/clustering-in-depth>.
Maintained by Chris Muir. Last updated 1 years ago.
approximate-string-matchingclusteringdata-cleaningdata-clusteringfuzzy-matchingngramopenrefinecpp
13.6 match 104 stars 6.80 score 121 scriptstrinker
qdap:Bridging the Gap Between Qualitative Data and Quantitative Analysis
Automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse including frequency counts of sentence types, words, sentences, turns of talk, syllables and other assorted analysis tasks. The package provides parsing tools for preparing transcript data. Many functions enable the user to aggregate data by any number of grouping variables, providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text. This affords the user a more efficient and targeted analysis. 'qdap' is designed for transcript analysis, however, many functions are applicable to other areas of Text Mining/ Natural Language Processing.
Maintained by Tyler Rinker. Last updated 4 years ago.
qdapquantitative-discourse-analysistext-analysistext-miningtext-plottingopenjdk
7.3 match 176 stars 9.61 score 1.3k scripts 3 dependentsmichbur
biogram:N-Gram Analysis of Biological Sequences
Tools for extraction and analysis of various n-grams (k-mers) derived from biological sequences (proteins or nucleic acids). Contains QuiPT (quick permutation test) for fast feature-filtering of the n-gram data.
Maintained by Michal Burdukiewicz. Last updated 7 months ago.
biological-sequencesngram-analysis
7.5 match 10 stars 7.50 score 87 scripts 3 dependentstidymodels
textrecipes:Extra 'Recipes' for Text Processing
Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.
Maintained by Emil Hvitfeldt. Last updated 9 days ago.
4.5 match 160 stars 10.87 score 964 scripts 1 dependentsropensci
jstor:Read Data from JSTOR/DfR
Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.
Maintained by Thomas Klebel. Last updated 8 months ago.
jstorpeer-reviewedtext-analysistext-mining
6.0 match 47 stars 7.29 score 55 scriptspolmine
polmineR:Verbs and Nouns for Corpus Analysis
Package for corpus analysis using the Corpus Workbench ('CWB', <https://cwb.sourceforge.io>) as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create subcorpora and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document-term matrices, term-co-occurrence matrices etc.) can be created based on the indexed corpora.
Maintained by Andreas Blaette. Last updated 1 years ago.
5.3 match 49 stars 7.96 score 311 scriptsvgherard
sbo:Text Prediction via Stupid Back-Off N-Gram Models
Utilities for training and evaluating text predictors based on Stupid Back-Off N-gram models (Brants et al., 2007, <https://www.aclweb.org/anthology/D07-1090/>).
Maintained by Valerio Gherardi. Last updated 4 years ago.
natural-language-processingngram-modelspredictive-textsbocpp
7.5 match 10 stars 4.78 score 12 scriptsseancarmody
ngramr:Retrieve and Plot Google n-Gram Data
Retrieve and plot word frequencies through time from the "Google Ngram Viewer" <https://books.google.com/ngrams>.
Maintained by Sean Carmody. Last updated 2 months ago.
6.0 match 49 stars 5.79 score 42 scriptsingmarboeschen
JATSdecoder:A Metadata and Text Extraction and Manipulation Tool Set
Provides a function collection to extract metadata, sectioned text and study characteristics from scientific articles in 'NISO-JATS' format. Articles in PDF format can be converted to 'NISO-JATS' with the 'Content ExtRactor and MINEr' ('CERMINE', <https://github.com/CeON/CERMINE>). For convenience, two functions bundle the extraction heuristics: JATSdecoder() converts 'NISO-JATS'-tagged XML files to a structured list with elements title, author, journal, history, 'DOI', abstract, sectioned text and reference list. study.character() extracts multiple study characteristics like number of included studies, statistical methods used, alpha error, power, statistical results, correction method for multiple testing, software used. An estimation of the involved sample size is performed based on reports within the abstract and the reported degrees of freedom within statistical results. In addition, the package contains some useful functions to process text (text2sentences(), text2num(), ngram(), strsplit2(), grep2()). See Bรถschen, I. (2021) <doi:10.1007/s11192-021-04162-z> Bรถschen, I. (2021) <doi:10.1038/s41598-021-98782-3> and Bรถschen, I (2023) <doi:10.1038/s41598-022-27085-y>.
Maintained by Ingmar Bรถschen. Last updated 6 days ago.
cermineniso-jatspubmedcentraltext-extractiontext-miningxml-filesopenjdk
7.1 match 18 stars 4.56 score 7 scriptslaresbernardo
lares:Analytics & Machine Learning Sidekick
Auxiliary package for better/faster analytics, visualization, data mining, and machine learning tasks. With a wide variety of family functions, like Machine Learning, Data Wrangling, Marketing Mix Modeling (Robyn), Exploratory, API, and Scrapper, it helps the analyst or data scientist to get quick and robust results, without the need of repetitive coding or advanced R programming skills.
Maintained by Bernardo Lares. Last updated 24 days ago.
analyticsapiautomationautomldata-sciencedescriptive-statisticsh2omachine-learningmarketingmmmpredictive-modelingpuzzlerlanguagerobynvisualization
3.0 match 233 stars 9.84 score 185 scripts 1 dependentssparklyr
sparklyr:R Interface to Apache Spark
R interface to Apache Spark, a fast and general engine for big data processing, see <https://spark.apache.org/>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.
Maintained by Edgar Ruiz. Last updated 10 days ago.
apache-sparkdistributeddplyridelivymachine-learningremote-clusterssparksparklyr
1.9 match 959 stars 15.16 score 4.0k scripts 21 dependentskurthornik
NLP:Natural Language Processing Infrastructure
Basic classes and methods for Natural Language Processing.
Maintained by Kurt Hornik. Last updated 4 months ago.
3.0 match 6 stars 9.37 score 1.0k scripts 127 dependentstidymodels
dials:Tools for Creating Tuning Parameter Values
Many models contain tuning parameters (i.e. parameters that cannot be directly estimated from the data). These tools can be used to define objects for creating, simulating, or validating values for such parameters.
Maintained by Hannah Frick. Last updated 30 days ago.
1.8 match 114 stars 14.31 score 426 scripts 52 dependentscysouw
qlcMatrix:Utility Sparse Matrix Functions for Quantitative Language Comparison
Extension of the functionality of the 'Matrix' package for using sparse matrices. Some of the functions are very general, while other are highly specific for special data format as used for quantitative language comparison.
Maintained by Michael Cysouw. Last updated 9 months ago.
3.6 match 6 stars 6.98 score 256 scripts 1 dependentspaithiov909
RMeCab:Interface to 'MeCab'
Parses Japanese texts with 'MeCab'. The original 'MeCab' is licensed under the BSD 3-Clause "New" or "Revised" License. See the "LICENSE.note" file for its license notice.
Maintained by Motohiro Ishida. Last updated 11 days ago.
6.6 match 3.10 scorecomputationalstylistics
stylo:Stylometric Multivariate Analyses
Supervised and unsupervised multivariate methods, supplemented by GUI and some visualizations, to perform various analyses in the field of computational stylistics, authorship attribution, etc. For further reference, see Eder et al. (2016), <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>. You are also encouraged to visit the Computational Stylistics Group's website <https://computationalstylistics.github.io/>, where a reasonable amount of information about the package and related projects are provided.
Maintained by Maciej Eder. Last updated 2 months ago.
2.3 match 187 stars 8.58 score 462 scriptschris31415926535
storywranglr:Explore Twitter Trends with the 'Storywrangler' API
An interface to explore trends in Twitter data using the 'Storywrangler' Application Programming Interface (API), which can be found here: <https://github.com/janeadams/storywrangler>.
Maintained by Christopher Belanger. Last updated 4 years ago.
4.8 match 2.70 score 3 scriptsmlampros
fastText:Efficient Learning of Word Representations and Sentence Classification
An interface to the 'fastText' <https://github.com/facebookresearch/fastText> library for efficient learning of word representations and sentence classification. The 'fastText' algorithm is explained in detail in (i) "Enriching Word Vectors with subword Information", Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, 2017, <doi:10.1162/tacl_a_00051>; (ii) "Bag of Tricks for Efficient Text Classification", Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2017, <doi:10.18653/v1/e17-2068>; (iii) "FastText.zip: Compressing text classification models", Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve Jegou, Tomas Mikolov, 2016, <arXiv:1612.03651>.
Maintained by Lampros Mouselimis. Last updated 1 years ago.
1.7 match 42 stars 7.37 score 56 scriptspaithiov909
audubon:Japanese Text Processing Tools
A collection of Japanese text processing tools for filling Japanese iteration marks, Japanese character type conversions, segmentation by phrase, and text normalization which is based on rules for the 'Sudachi' morphological analyzer and the 'NEologd' (Neologism dictionary for 'MeCab'). These features are specific to Japanese and are not implemented in 'ICU' (International Components for Unicode).
Maintained by Akiru Kato. Last updated 22 days ago.
2.3 match 10 stars 5.61 score 3 scripts 1 dependentsmyeomans
doc2concrete:Measuring Concreteness in Natural Language
Models for detecting concreteness in natural language. This package is built in support of Yeomans (2021) <doi:10.1016/j.obhdp.2020.10.008>, which reviews linguistic models of concreteness in several domains. Here, we provide an implementation of the best-performing domain-general model (from Brysbaert et al., (2014) <doi:10.3758/s13428-013-0403-5>) as well as two pre-trained models for the feedback and plan-making domains.
Maintained by Mike Yeomans. Last updated 1 years ago.
2.3 match 13 stars 5.59 score 20 scripts 1 dependentsbnosac
ruimtehol:Learn Text 'Embeddings' with 'Starspace'
Wraps the 'StarSpace' library <https://github.com/facebookresearch/StarSpace> allowing users to calculate word, sentence, article, document, webpage, link and entity 'embeddings'. By using the 'embeddings', you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph 'embeddings' as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: 'StarSpace: Embed All The Things!' by Wu et al. (2017), available at <arXiv:1709.03856>.
Maintained by Jan Wijffels. Last updated 1 years ago.
classificationembeddingsnatural-language-processingnlpsimilaritystarspacetext-miningcpp
1.9 match 101 stars 6.65 score 44 scriptscran
textanalyzer:'textanalyzer', an R Package to Analyze Text
It analyzes text to create a count of top n-grams, including tokens (one-word), bigrams(two-word), and trigrams (three-word), while removing all stopwords. It also plots the n-grams and corresponding counts as a bar chart.
Maintained by Pushker Ravindra. Last updated 2 months ago.
4.5 match 2.70 scorepaithiov909
gibasa:An Alternative 'Rcpp' Wrapper of 'MeCab'
A plain 'Rcpp' wrapper for 'MeCab' that can segment Chinese, Japanese, and Korean text into tokens. The main goal of this package is to provide an alternative to 'tidytext' using morphological analysis.
Maintained by Akiru Kato. Last updated 29 days ago.
2.3 match 15 stars 5.02 score 3 scriptsmicrosoft
wpa:Tools for Analysing and Visualising Viva Insights Data
Opinionated functions that enable easier and faster analysis of Viva Insights data. There are three main types of functions in 'wpa': (i) Standard functions create a 'ggplot' visual or a summary table based on a specific Viva Insights metric; (2) Report Generation functions generate HTML reports on a specific analysis area, e.g. Collaboration; (3) Other miscellaneous functions cover more specific applications (e.g. Subject Line text mining) of Viva Insights data. This package adheres to 'tidyverse' principles and works well with the pipe syntax. 'wpa' is built with the beginner-to-intermediate R users in mind, and is optimised for simplicity.
Maintained by Martin Chan. Last updated 4 months ago.
1.7 match 30 stars 6.69 score 39 scripts 1 dependentsmicrosoft
vivainsights:Analyze and Visualize Data from 'Microsoft Viva Insights'
Provides a versatile range of functions, including exploratory data analysis, time-series analysis, organizational network analysis, and data validation, whilst at the same time implements a set of best practices in analyzing and visualizing data specific to 'Microsoft Viva Insights'.
Maintained by Martin Chan. Last updated 24 days ago.
1.7 match 11 stars 6.12 score 68 scriptsxytangtang
ProcData:Process Data Analysis
Provides tools for exploratory process data analysis. Process data refers to the data describing participants' problem-solving processes in computer-based assessments. It is often recorded in computer log files. This package provides functions to read, process, and write process data. It also implements two feature extraction methods to compress the information stored in process data into standard numerical vectors. This package also provides recurrent neural network based models that relate response processes with other binary or scale variables of interest. The functions that involve training and evaluating neural networks are wrappers of functions in 'keras'.
Maintained by Xueying Tang. Last updated 4 years ago.
2.0 match 10 stars 3.70 score 2 scriptspaithiov909
pipian:Tiny Interface to CaboCha for R
A tiny interface to 'CaboCha'; a Japanese dependency structure parser. The main goal of this package is to implement a parser for that XML output.
Maintained by Akiru Kato. Last updated 2 months ago.
2.3 match 4 stars 3.00 score 1 scriptspaithiov909
vibrrt:An R Wrapper for 'vibrato'
An R wrapper for 'vibrato' <https://github.com/daac-tools/vibrato>, a Rust reimplementation of 'MeCab' for fast tokenization.
Maintained by Akiru Kato. Last updated 16 days ago.
2.3 match 2.30 score 1 scriptspaithiov909
sudachir2:R Wrapper for 'sudachi.rs'
Offers bindings to 'sudachi.rs' <https://github.com/WorksApplications/sudachi.rs>, a Rust implementation of 'Sudachi' Japanese morphological analyzer.
Maintained by Akiru Kato. Last updated 16 days ago.
2.3 match 3 stars 2.18 score 3 scriptsjcaledo
EnvNJ:Whole Genome Phylogenies Using Sequence Environments
Contains utilities for the analysis of protein sequences in a phylogenetic context. Allows the generation of phylogenetic trees base on protein sequences in an alignment-independent way. Two different methods have been implemented. One approach is based on the frequency analysis of n-grams, previously described in Stuart et al. (2002) <doi:10.1093/bioinformatics/18.1.100>. The other approach is based on the species-specific neighborhood preference around amino acids. Features include the conversion of a protein set into a vector reflecting these neighborhood preferences, pairwise distances (dissimilarity) between these vectors, and the generation of trees based on these distance matrices.
Maintained by Juan Carlos Aledo. Last updated 3 years ago.
3.3 match 1.04 score 11 scriptstconwell
textTools:Functions for Text Cleansing and Text Analysis
A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.
Maintained by Timothy Conwell. Last updated 4 years ago.
3.0 match 1.00 score 4 scripts