Showing 28 of total 28 results (show query)
dselivanov
text2vec:Modern Text Mining Framework for R
Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.
Maintained by Dmitriy Selivanov. Last updated 8 months ago.
glovelatent-dirichlet-allocationnatural-language-processingtext-miningtopic-modelingvectorizationword-embeddingsword2veccpp
860 stars 13.48 score 1.3k scripts 23 dependentsoscarkjell
text:Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning
Link R with Transformers from Hugging Face to transform text variables to word embeddings; where the word embeddings are used to statistically test the mean difference between set of texts, compute semantic similarity scores between texts, predict numerical variables, and visual statistically significant words according to various dimensions etc. For more information see <https://www.r-text.org>.
Maintained by Oscar Kjell. Last updated 7 days ago.
deep-learningmachine-learningnlptransformersopenjdk
145 stars 13.21 score 436 scripts 1 dependentstommyjones
textmineR:Functions for Text Mining and Topic Modeling
An aid for text mining in R, with a syntax that should be familiar to experienced R users. Provides a wrapper for several topic models that take similarly-formatted input and give similarly-formatted output. Has additional functionality for analyzing and diagnostics for topic models.
Maintained by Tommy Jones. Last updated 2 years ago.
106 stars 10.83 score 310 scripts 7 dependentsmatloff
regtools:Regression and Classification Tools
Tools for linear, nonlinear and nonparametric regression and classification. Novel graphical methods for assessment of parametric models using nonparametric methods. One vs. All and All vs. All multiclass classification, optional class probabilities adjustment. Nonparametric regression (k-NN) for general dimension, local-linear option. Nonlinear regression with Eickert-White method for dealing with heteroscedasticity. Utilities for converting time series to rectangular form. Utilities for conversion between factors and indicator variables. Some code related to "Statistical Regression and Classification: from Linear Models to Machine Learning", N. Matloff, 2017, CRC, ISBN 9781498710916.
Maintained by Norm Matloff. Last updated 2 months ago.
127 stars 9.39 score 48 scripts 3 dependentsprodriguezsosa
conText:'a la Carte' on Text (ConText) Embedding Regression
A fast, flexible and transparent framework to estimate context-specific word and short document embeddings using the 'a la carte' embeddings approach developed by Khodak et al. (2018) <arXiv:1805.05388> and evaluate hypotheses about covariate effects on embeddings using the regression framework developed by Rodriguez et al. (2021)<https://github.com/prodriguezsosa/EmbeddingRegression>.
Maintained by Pedro L. Rodriguez. Last updated 11 months ago.
104 stars 9.10 score 1.7k scriptstheharmonylab
topics:Creating and Significance Testing Language Features for Visualisation
Implements differential language analysis with statistical tests and offers various language visualization techniques for n-grams and topics. It also supports the 'text' package. For more information, visit <https://r-topics.org/> and <https://www.r-text.org/>.
Maintained by Oscar Kjell. Last updated 3 days ago.
5 stars 8.38 score 22 scripts 2 dependentsmatloff
qeML:Quick and Easy Machine Learning Tools
The letters 'qe' in the package title stand for "quick and easy," alluding to the convenience goal of the package. We bring together a variety of machine learning (ML) tools from standard R packages, providing wrappers with a simple, convenient, and uniform interface.
Maintained by Norm Matloff. Last updated 9 days ago.
41 stars 8.37 score 48 scripts 1 dependentsmatloff
dsld:Data Science Looks at Discrimination
Statistical and graphical tools for detecting and measuring discrimination and bias, be it racial, gender, age or other. Detection and remediation of bias in machine learning algorithms. 'Python' interfaces available.
Maintained by Norm Matloff. Last updated 2 months ago.
12 stars 7.81 score 35 scriptsmhahsler
markovDP:Infrastructure for Discrete-Time Markov Decision Processes (MDP)
Provides the infrastructure to work with Markov Decision Processes (MDPs) in R. The focus is on convenience in formulating MDPs, the support of sparse representations (using sparse matrices, lists and data.frames) and visualization of results. Some key components are implemented in C++ to speed up computation. Several popular solvers are implemented.
Maintained by Michael Hahsler. Last updated 16 days ago.
control-theorymarkov-decision-processoptimizationcpp
7 stars 5.51 score 4 scriptsdavid-cortes
recometrics:Evaluation Metrics for Implicit-Feedback Recommender Systems
Calculates evaluation metrics for implicit-feedback recommender systems that are based on low-rank matrix factorization models, given the fitted model matrices and data, thus allowing to compare models from a variety of libraries. Metrics include P@K (precision-at-k, for top-K recommendations), R@K (recall at k), AP@K (average precision at k), NDCG@K (normalized discounted cumulative gain at k), Hit@K (from which the 'Hit Rate' is calculated), RR@K (reciprocal rank at k, from which the 'MRR' or 'mean reciprocal rank' is calculated), ROC-AUC (area under the receiver-operating characteristic curve), and PR-AUC (area under the precision-recall curve). These are calculated on a per-user basis according to the ranking of items induced by the model, using efficient multi-threaded routines. Also provides functions for creating train-test splits for model fitting and evaluation.
Maintained by David Cortes. Last updated 3 months ago.
implicit-feedbackmatrix-factorizationrecommender-systemsopenblascppopenmp
28 stars 5.45 scoreoccupationmeasurement
occupationMeasurement:Interactively Measure Occupations in Interviews and Beyond
Perform interactive occupation coding during interviews as described in Peycheva, D., Sakshaug, J., Calderwood, L. (2021) <doi:10.2478/jos-2021-0042> and Schierholz, M., Gensicke, M., Tschersich, N., Kreuter, F. (2018) <doi:10.1111/rssa.12297>. Generate suggestions for occupational categories based on free text input, with pre-trained machine learning models in German and a ready-to-use shiny application provided for quick and easy data collection.
Maintained by Jan Simson. Last updated 8 months ago.
3 stars 5.18 score 17 scriptsbioc
ttgsea:Tokenizing Text of Gene Set Enrichment Analysis
Functional enrichment analysis methods such as gene set enrichment analysis (GSEA) have been widely used for analyzing gene expression data. GSEA is a powerful method to infer results of gene expression data at a level of gene sets by calculating enrichment scores for predefined sets of genes. GSEA depends on the availability and accuracy of gene sets. There are overlaps between terms of gene sets or categories because multiple terms may exist for a single biological process, and it can thus lead to redundancy within enriched terms. In other words, the sets of related terms are overlapping. Using deep learning, this pakage is aimed to predict enrichment scores for unique tokens or words from text in names of gene sets to resolve this overlapping set issue. Furthermore, we can coin a new term by combining tokens and find its enrichment score by predicting such a combined tokens.
Maintained by Dongmin Jung. Last updated 5 months ago.
softwaregeneexpressiongenesetenrichment
4.95 score 3 scripts 3 dependentsbioc
DeepPINCS:Protein Interactions and Networks with Compounds based on Sequences using Deep Learning
The identification of novel compound-protein interaction (CPI) is important in drug discovery. Revealing unknown compound-protein interactions is useful to design a new drug for a target protein by screening candidate compounds. The accurate CPI prediction assists in effective drug discovery process. To identify potential CPI effectively, prediction methods based on machine learning and deep learning have been developed. Data for sequences are provided as discrete symbolic data. In the data, compounds are represented as SMILES (simplified molecular-input line-entry system) strings and proteins are sequences in which the characters are amino acids. The outcome is defined as a variable that indicates how strong two molecules interact with each other or whether there is an interaction between them. In this package, a deep-learning based model that takes only sequence information of both compounds and proteins as input and the outcome as output is used to predict CPI. The model is implemented by using compound and protein encoders with useful features. The CPI model also supports other modeling tasks, including protein-protein interaction (PPI), chemical-chemical interaction (CCI), or single compounds and proteins. Although the model is designed for proteins, DNA and RNA can be used if they are represented as sequences.
Maintained by Dongmin Jung. Last updated 5 months ago.
softwarenetworkgraphandnetworkneuralnetworkopenjdk
4.78 score 4 scripts 2 dependentsmkearney
wactor:Word Factor Vectors
A user-friendly factor-like interface for converting strings of text into numeric vectors and rectangular data structures.
Maintained by Michael W. Kearney. Last updated 5 years ago.
texttext-classificationtext-processingtext-vectorizationword-embeddingsword-vectorsword2vec
33 stars 4.52 score 3 scriptsjavierdelahoz
LDAShiny:User-Friendly Interface for Review of Scientific Literature
Contains the development of a tool that provides a web-based graphical user interface (GUI) to perform a review of the scientific literature under the Bayesian approach of Latent Dirichlet Allocation (LDA)and machine learning algorithms. The application methodology is framed by the well known procedures in topic modelling on how to clean and process data. Contains methods described by Blei, David M., Andrew Y. Ng, and Michael I. Jordan (2003) <https://jmlr.org/papers/volume3/blei03a/blei03a.pdf> Allocation"; Thomas L. Griffiths and Mark Steyvers (2004) <doi:10.1073/pnas.0307752101> ; Xiong Hui, et al (2019) <doi:10.1016/j.cie.2019.06.010>.
Maintained by Javier De La Hoz Maestre. Last updated 4 years ago.
3 stars 4.48 score 3 scriptshuongtran53
PlotNormTest:Graphical Univariate/Multivariate Assessments for Normality Assumption
Graphical methods testing multivariate normality assumption. Methods including assessing score function, and cumulant generating functions, independent transformations and linear transformations.
Maintained by Huong Tran. Last updated 5 months ago.
4.30 scorerobindenz1
CareDensity:Calculate the Care Density or Fragmented Care Density Given a Patient-Sharing Network
Given a patient-sharing network, calculate either the classic care density as proposed by Pollack et al. (2013) <doi:10.1007/s11606-012-2104-7> or the fragmented care density as proposed by Engels et al. (2024) <doi:10.1186/s12874-023-02106-0>. By utilizing the 'igraph' and 'data.table' packages, the provided functions scale well for very large graphs.
Maintained by Robin Denz. Last updated 5 months ago.
care-coordinationnetwork-analysispatient-care
1 stars 4.18 score 6 scriptsbioc
IFAA:Robust Inference for Absolute Abundance in Microbiome Analysis
This package offers a robust approach to make inference on the association of covariates with the absolute abundance (AA) of microbiome in an ecosystem. It can be also directly applied to relative abundance (RA) data to make inference on AA because the ratio of two RA is equal to the ratio of their AA. This algorithm can estimate and test the associations of interest while adjusting for potential confounders. The estimates of this method have easy interpretation like a typical regression analysis. High-dimensional covariates are handled with regularization and it is implemented by parallel computing. False discovery rate is automatically controlled by this approach. Zeros do not need to be imputed by a positive value for the analysis. The IFAA package also offers the 'MZILN' function for estimating and testing associations of abundance ratios with covariates.
Maintained by Zhigang Li. Last updated 5 months ago.
softwaretechnologysequencingmicrobiomeregression
4.15 score 14 scriptspsychbruce
PsychWordVec:Word Embedding Research Framework for Psychological Science
An integrative toolbox of word embedding research that provides: (1) a collection of 'pre-trained' static word vectors in the '.RData' compressed format <https://psychbruce.github.io/WordVector_RData.pdf>; (2) a series of functions to process, analyze, and visualize word vectors; (3) a range of tests to examine conceptual associations, including the Word Embedding Association Test <doi:10.1126/science.aal4230> and the Relative Norm Distance <doi:10.1073/pnas.1720347115>, with permutation test of significance; (4) a set of training methods to locally train (static) word vectors from text corpora, including 'Word2Vec' <arXiv:1301.3781>, 'GloVe' <doi:10.3115/v1/D14-1162>, and 'FastText' <arXiv:1607.04606>; (5) a group of functions to download 'pre-trained' language models (e.g., 'GPT', 'BERT') and extract contextualized (dynamic) word vectors (based on the R package 'text').
Maintained by Han-Wu-Shuang Bao. Last updated 1 years ago.
bertcosine-similarityfasttextglovegptlanguage-modelnatural-language-processingnlppretrained-modelspsychologysemantic-analysistext-analysistext-miningtsneword-embeddingsword-vectorsword2vecopenjdk
22 stars 4.04 score 10 scriptsbioc
VAExprs:Generating Samples of Gene Expression Data with Variational Autoencoders
A fundamental problem in biomedical research is the low number of observations, mostly due to a lack of available biosamples, prohibitive costs, or ethical reasons. By augmenting a few real observations with artificially generated samples, their analysis could lead to more robust and higher reproducible. One possible solution to the problem is the use of generative models, which are statistical models of data that attempt to capture the entire probability distribution from the observations. Using the variational autoencoder (VAE), a well-known deep generative model, this package is aimed to generate samples with gene expression data, especially for single-cell RNA-seq data. Furthermore, the VAE can use conditioning to produce specific cell types or subpopulations. The conditional VAE (CVAE) allows us to create targeted samples rather than completely random ones.
Maintained by Dongmin Jung. Last updated 5 months ago.
softwaregeneexpressionsinglecellopenjdk
4.00 score 4 scriptsemilhvitfeldt
wordsalad:Provide Tools to Extract and Analyze Word Vectors
Provides access to various word embedding methods (GloVe, fasttext and word2vec) to extract word vectors using a unified framework to increase reproducibility and correctness.
Maintained by Emil Hvitfeldt. Last updated 5 years ago.
8 stars 3.60 score 9 scriptspilacuan-bonete-luis
LDABiplots:Biplot Graphical Interface for LDA Models
Contains the development of a tool that provides a web-based graphical user interface (GUI) to perform Biplots representations from a scraping of news from digital newspapers under the Bayesian approach of Latent Dirichlet Assignment (LDA) and machine learning algorithms. Contains LDA methods described by Blei , David M., Andrew Y. Ng and Michael I. Jordan (2003) <https://jmlr.org/papers/volume3/blei03a/blei03a.pdf>, and Biplot methods described by Gabriel K.R(1971) <doi:10.1093/biomet/58.3.453> and Galindo-Villardon P(1986) <https://diarium.usal.es/pgalindo/files/2012/07/Questiio.pdf>.
Maintained by Luis Pilacuan-Bonete. Last updated 3 years ago.
3.00 score 4 scriptstheogrost
NUSS:Mixed N-Grams and Unigram Sequence Segmentation
Segmentation of short text sequences - like hashtags - into the separated words sequence, done with the use of dictionary, which may be built on custom corpus of texts. Unigram dictionary is used to find most probable sequence, and n-grams approach is used to determine possible segmentation given the text corpus.
Maintained by Oskar Kosch. Last updated 8 months ago.
3.00 score 8 scriptscran
cdparcoord:Top Frequency-Based Parallel Coordinates
Parallel coordinate plotting with resolutions for large data sets and missing values.
Maintained by Norm Matloff. Last updated 6 years ago.
2.70 scorekidoishi
MadanText:Persian Textmining Tool for Frequency Analysis, Statistical Analysis, and Word Clouds
MadanText is an open-source software designed specifically for text mining in the Persian language. It allows users to examine word frequencies, download data for analysis, and generate word clouds. This tool is particularly useful for researchers and analysts working with Persian language data.
Maintained by Kido Ishikawa. Last updated 1 years ago.
2.70 scorekidoishi
MadanTextNetwork:Persian Textmining Tool for Co-Occurrence_Network
MadanText_co-occurrence_network is an open-source software designed specifically for text mining in the Persian language. It adds co-occurrence network functionality to MadanText. The input file replaces the text format with an Excel format.
Maintained by Kido Ishikawa. Last updated 1 years ago.
2.70 score