R-universe search: needs:clock

topepo

caret:Classification and Regression Training

Misc functions for training and plotting classification and regression models.

Maintained by Max Kuhn. Last updated 4 months ago.

1.6k stars 19.24 score 61k scripts 303 dependents

tidymodels

recipes:Preprocessing and Feature Engineering Steps for Modeling

A recipe prepares your data for modeling. We provide an extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting processed output can then be used as inputs for statistical or machine learning models.

Maintained by Max Kuhn. Last updated 3 days ago.

586 stars 18.80 score 7.2k scripts 383 dependents

tidymodels

tidymodels:Easily Install and Load the 'Tidymodels' Packages

The tidy modeling "verse" is a collection of packages for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse.

Maintained by Max Kuhn. Last updated 1 months ago.

783 stars 16.52 score 66k scripts 15 dependents

tidymodels

tune:Tidy Tuning Tools

The ability to tune models is important. 'tune' contains functions and classes to be used in conjunction with other 'tidymodels' packages for finding reasonable values of hyper-parameters in models, pre-processing methods, and post-processing steps.

Maintained by Max Kuhn. Last updated 27 days ago.

293 stars 14.27 score 756 scripts 39 dependents

business-science

timetk:A Tool Kit for Working with Time Series

Easy visualization, wrangling, and feature engineering of time series data for forecasting and machine learning prediction. Consolidates and extends time series functionality from packages including 'dplyr', 'stats', 'xts', 'forecast', 'slider', 'padr', 'recipes', and 'rsample'.

Maintained by Matt Dancho. Last updated 1 years ago.

coercion coercion-functions data-mining dplyr forecast forecasting forecasting-models machine-learning series-decomposition series-signature tibble tidy tidyquant tidyverse time time-series timeseries

626 stars 14.20 score 4.0k scripts 16 dependents

tidymodels

workflows:Modeling Workflows

Managing both a 'parsnip' model and a preprocessor, such as a model formula or recipe from 'recipes', can often be challenging. The goal of 'workflows' is to streamline this process by bundling the model alongside the preprocessor, all within the same object.

Maintained by Simon Couch. Last updated 1 months ago.

207 stars 13.97 score 876 scripts 43 dependents

business-science

tidyquant:Tidy Quantitative Financial Analysis

Bringing business and financial analysis to the 'tidyverse'. The 'tidyquant' package provides a convenient wrapper to various 'xts', 'zoo', 'quantmod', 'TTR' and 'PerformanceAnalytics' package functions and returns the objects in the tidy 'tibble' format. The main advantage is being able to use quantitative functions with the 'tidyverse' functions including 'purrr', 'dplyr', 'tidyr', 'ggplot2', 'lubridate', etc. See the 'tidyquant' website for more information, documentation and examples.

Maintained by Matt Dancho. Last updated 2 months ago.

dplyr financial-analysis financial-data financial-statements multiple-stocks performance-analysis performanceanalytics quantmod stock stock-exchanges stock-indexes stock-lists stock-performance stock-prices stock-symbol tidyverse time-series timeseries xts

872 stars 13.34 score 5.2k scripts

oscarkjell

text:Analyses of Text using Transformers Models from HuggingFace, Natural Language Processing and Machine Learning

Link R with Transformers from Hugging Face to transform text variables to word embeddings; where the word embeddings are used to statistically test the mean difference between set of texts, compute semantic similarity scores between texts, predict numerical variables, and visual statistically significant words according to various dimensions etc. For more information see <https://www.r-text.org>.

Maintained by Oscar Kjell. Last updated 9 days ago.

deep-learning machine-learning nlp transformers openjdk

145 stars 13.21 score 436 scripts 1 dependents

r-dbi

bigrquery:An Interface to Google's 'BigQuery' 'API'

Easily talk to Google's 'BigQuery' database from R.

Maintained by Hadley Wickham. Last updated 1 months ago.

bigquery database cpp

520 stars 12.47 score 1.8k scripts 4 dependents

bcallaway11

did:Treatment Effects with Multiple Periods and Groups

The standard Difference-in-Differences (DID) setup involves two periods and two groups -- a treated group and untreated group. Many applications of DID methods involve more than two periods and have individuals that are treated at different points in time. This package contains tools for computing average treatment effect parameters in Difference in Differences setups with more than two periods and with variation in treatment timing using the methods developed in Callaway and Sant'Anna (2021) <doi:10.1016/j.jeconom.2020.12.001>. The main parameters are group-time average treatment effects which are the average treatment effect for a particular group at a a particular time. These can be aggregated into a fewer number of treatment effect parameters, and the package deals with the cases where there is selective treatment timing, dynamic treatment effects, calendar time effects, or combinations of these. There are also functions for testing the Difference in Differences assumption, and plotting group-time average treatment effects.

Maintained by Brantly Callaway. Last updated 5 days ago.

329 stars 12.09 score 696 scripts 3 dependents

tidymodels

probably:Tools for Post-Processing Predicted Values

Models can be improved by post-processing class probabilities, by: recalibration, conversion to hard probabilities, assessment of equivocal zones, and other activities. 'probably' contains tools for conducting these operations as well as calibration tools and conformal inference techniques for regression models.

Maintained by Max Kuhn. Last updated 6 months ago.

115 stars 12.09 score 21k scripts 1 dependents

tidymodels

workflowsets:Create a Collection of 'tidymodels' Workflows

A workflow is a combination of a model and preprocessors (e.g, a formula, recipe, etc.) (Kuhn and Silge (2021) <https://www.tmwr.org/>). In order to try different combinations of these, an object can be created that contains many workflows. There are functions to create workflows en masse as well as training them and visualizing the results.

Maintained by Simon Couch. Last updated 5 months ago.

94 stars 12.04 score 294 scripts 19 dependents

zachmayer

caretEnsemble:Ensembles of Caret Models

Functions for creating ensembles of caret models: caretList() and caretStack(). caretList() is a convenience function for fitting multiple caret::train() models to the same dataset. caretStack() will make linear or non-linear combinations of these models, using a caret::train() model as a meta-model.

Maintained by Zachary A. Deane-Mayer. Last updated 3 months ago.

226 stars 11.98 score 780 scripts 1 dependents

hannameyer

CAST:'caret' Applications for Spatial-Temporal Models

Supporting functionality to run 'caret' with spatial or spatial-temporal data. 'caret' is a frequently used package for model training and prediction using machine learning. CAST includes functions to improve spatial or spatial-temporal modelling tasks using 'caret'. It includes the newly suggested 'Nearest neighbor distance matching' cross-validation to estimate the performance of spatial prediction models and allows for spatial variable selection to selects suitable predictor variables in view to their contribution to the spatial model performance. CAST further includes functionality to estimate the (spatial) area of applicability of prediction models. Methods are described in Meyer et al. (2018) <doi:10.1016/j.envsoft.2017.12.001>; Meyer et al. (2019) <doi:10.1016/j.ecolmodel.2019.108815>; Meyer and Pebesma (2021) <doi:10.1111/2041-210X.13650>; Milà et al. (2022) <doi:10.1111/2041-210X.13851>; Meyer and Pebesma (2022) <doi:10.1038/s41467-022-29838-9>; Linnenbrink et al. (2023) <doi:10.5194/egusphere-2023-1308>; Schumacher et al. (2024) <doi:10.5194/egusphere-2024-2730>. The package is described in detail in Meyer et al. (2024) <doi:10.48550/arXiv.2404.06978>.

Maintained by Hanna Meyer. Last updated 2 months ago.

autocorrelation caret feature-selection machine-learning overfitting predictive-modeling spatial spatio-temporal variable-selection

114 stars 11.85 score 298 scripts 1 dependents

tidymodels

stacks:Tidy Model Stacking

Model stacking is an ensemble technique that involves training a model to combine the outputs of many diverse statistical models, and has been shown to improve predictive performance in a variety of settings. 'stacks' implements a grammar for 'tidymodels'-aligned model stacking.

Maintained by Simon Couch. Last updated 5 months ago.

298 stars 11.46 score 840 scripts

tidymodels

textrecipes:Extra 'Recipes' for Text Processing

Converting text to numerical features requires specifically created procedures, which are implemented as steps according to the 'recipes' package. These steps allows for tokenization, filtering, counting (tf and tfidf) and feature hashing.

Maintained by Emil Hvitfeldt. Last updated 12 days ago.

160 stars 10.86 score 964 scripts 1 dependents

business-science

modeltime:The Tidymodels Extension for Time Series Modeling

The time series forecasting framework for use with the 'tidymodels' ecosystem. Models include ARIMA, Exponential Smoothing, and additional time series models from the 'forecast' and 'prophet' packages. Refer to "Forecasting Principles & Practice, Second edition" (<https://otexts.com/fpp2/>). Refer to "Prophet: forecasting at scale" (<https://research.facebook.com/blog/2017/02/prophet-forecasting-at-scale/>.).

Maintained by Matt Dancho. Last updated 5 months ago.

arima data-science deep-learning ets forecasting machine-learning machine-learning-algorithms modeltime prophet tbats tidymodeling tidymodels time time-series time-series-analysis timeseries timeseries-forecasting

551 stars 10.61 score 1.1k scripts 7 dependents

petersonr

bestNormalize:Normalizing Transformation Functions

Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ). ORQ normalization combines a rank-mapping approach with a shifted logit approximation that allows the transformation to work on data outside the original domain. It is also able to handle new data within the original domain via linear interpolation. The package is built to estimate the best normalizing transformation for a vector consistently and accurately. It implements the Box-Cox transformation, the Yeo-Johnson transformation, three types of Lambert WxF transformations, and the ordered quantile normalization transformation. It estimates the normalization efficacy of other commonly used transformations, and it allows users to specify custom transformations or normalization statistics. Finally, functionality can be integrated into a machine learning workflow via recipes.

Maintained by Ryan Andrew Peterson. Last updated 1 years ago.

39 stars 10.45 score 510 scripts 5 dependents

tidymodels

themis:Extra Recipes Steps for Dealing with Unbalanced Data

A dataset with an uneven number of cases in each class is said to be unbalanced. Many models produce a subpar performance on unbalanced datasets. A dataset can be balanced by increasing the number of minority cases using SMOTE 2011 <doi:10.48550/arXiv.1106.1813>, BorderlineSMOTE 2005 <doi:10.1007/11538059_91> and ADASYN 2008 <https://ieeexplore.ieee.org/document/4633969>. Or by decreasing the number of majority cases using NearMiss 2003 <https://www.site.uottawa.ca/~nat/Workshop2003/jzhang.pdf> or Tomek link removal 1976 <https://ieeexplore.ieee.org/document/4309452>.

Maintained by Emil Hvitfeldt. Last updated 2 months ago.

143 stars 10.37 score 1.3k scripts 2 dependents

ludvigolsen

cvms:Cross-Validation for Model Selection

Cross-validate one or multiple regression and classification models and get relevant evaluation metrics in a tidy format. Validate the best model on a test set and compare it to a baseline evaluation. Alternatively, evaluate predictions from an external model. Currently supports regression and classification (binary and multiclass). Described in chp. 5 of Jeyaraman, B. P., Olsen, L. R., & Wambugu M. (2019, ISBN: 9781838550134).

Maintained by Ludvig Renbo Olsen. Last updated 25 days ago.

39 stars 10.31 score 492 scripts 5 dependents

bioc

pRoloc:A unifying bioinformatics framework for spatial proteomics

The pRoloc package implements machine learning and visualisation methods for the analysis and interogation of quantitiative mass spectrometry data to reliably infer protein sub-cellular localisation.

Maintained by Lisa Breckels. Last updated 4 days ago.

immunooncology proteomics massspectrometry classification clustering qualitycontrol bioconductor proteomics-data spatial-proteomics visualisation openblas cpp

15 stars 10.31 score 101 scripts 2 dependents

bleutner

RStoolbox:Remote Sensing Data Analysis

Toolbox for remote sensing image processing and analysis such as calculating spectral indexes, principal component transformation, unsupervised and supervised classification or fractional cover analyses.

Maintained by Konstantin Mueller. Last updated 2 months ago.

ggplot2 land-cover-mapping remote-sensing spectral-unmixing supervised-classification unsupervised-classification openblas cpp

275 stars 10.10 score 1.1k scripts

tlverse

sl3:Pipelines for Machine Learning and Super Learning

A modern implementation of the Super Learner prediction algorithm, coupled with a general purpose framework for composing arbitrary pipelines for machine learning tasks.

Maintained by Jeremy Coyle. Last updated 5 months ago.

data-science ensemble-learning ensemble-model machine-learning model-selection regression stacking statistics

100 stars 9.94 score 748 scripts 7 dependents

ohdsi

CohortConstructor:Build and Manipulate Study Cohorts Using a Common Data Model

Create and manipulate study cohorts in data mapped to the Observational Medical Outcomes Partnership Common Data Model.

Maintained by Edward Burn. Last updated 3 days ago.

2 stars 9.73 score 207 scripts 2 dependents

bblonder

hypervolume:High Dimensional Geometry, Set Operations, Projection, and Inference Using Kernel Density Estimation, Support Vector Machines, and Convex Hulls

Estimates the shape and volume of high-dimensional datasets and performs set operations: intersection / overlap, union, unique components, inclusion test, and hole detection. Uses stochastic geometry approach to high-dimensional kernel density estimation, support vector machine delineation, and convex hull generation. Applications include modeling trait and niche hypervolumes and species distribution modeling.

Maintained by Benjamin Blonder. Last updated 2 months ago.

openblas cpp

23 stars 9.69 score 211 scripts 7 dependents

business-science

anomalize:Tidy Anomaly Detection

The 'anomalize' package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the 'forecast' package and the Twitter 'AnomalyDetection' package. Refer to the associated functions for specific references for these methods.

Maintained by Matt Dancho. Last updated 1 years ago.

anomaly anomaly-detection decomposition detect-anomalies iqr time-series

339 stars 9.56 score 332 scripts

ndphillips

FFTrees:Generate, Visualise, and Evaluate Fast-and-Frugal Decision Trees

Create, visualize, and test fast-and-frugal decision trees (FFTs) using the algorithms and methods described by Phillips, Neth, Woike & Gaissmaier (2017), <doi:10.1017/S1930297500006239>. FFTs are simple and transparent decision trees for solving binary classification problems. FFTs can be preferable to more complex algorithms because they require very little information, are easy to understand and communicate, and are robust against overfitting.

Maintained by Hansjoerg Neth. Last updated 5 months ago.

136 stars 9.53 score 144 scripts

microsoft

finnts:Microsoft Finance Time Series Forecasting Framework

Automated time series forecasting developed by Microsoft Finance. The Microsoft Finance Time Series Forecasting Framework, aka Finn, can be used to forecast any component of the income statement, balance sheet, or any other area of interest by finance. Any numerical quantity over time, Finn can be used to forecast it. While it can be applied outside of the finance domain, Finn was built to meet the needs of financial analysts to better forecast their businesses within a company, and has a lot of built in features that are specific to the needs of financial forecasters. Happy forecasting!

Maintained by Mike Tokic. Last updated 1 months ago.

business data-science feature-selection finance finnts forecasting machine-learning microsoft time-series

194 stars 9.30 score 39 scripts

business-science

sweep:Tidy Tools for Forecasting

Tidies up the forecasting modeling and prediction work flow, extends the 'broom' package with 'sw_tidy', 'sw_glance', 'sw_augment', and 'sw_tidy_decomp' functions for various forecasting models, and enables converting 'forecast' objects to "tidy" data frames with 'sw_sweep'.

Maintained by Matt Dancho. Last updated 1 years ago.

broom forecast forecasting-models prediction tidy tidyverse time time-series timeseries

155 stars 9.23 score 399 scripts 1 dependents

tidymodels

embed:Extra Recipes for Encoding Predictors

Predictors can be converted to one or more numeric representations using a variety of methods. Effect encodings using simple generalized linear models <doi:10.48550/arXiv.1611.09477> or nonlinear models <doi:10.48550/arXiv.1604.06737> can be used. There are also functions for dimension reduction and other approaches.

Maintained by Emil Hvitfeldt. Last updated 2 months ago.

142 stars 9.18 score 1.1k scripts

mlverse

tabnet:Fit 'TabNet' Models for Classification and Regression

Implements the 'TabNet' model by Sercan O. Arik et al. (2019) <doi:10.48550/arXiv.1908.07442> with 'Coherent Hierarchical Multi-label Classification Networks' by Giunchiglia et al. <doi:10.48550/arXiv.2010.10151> and provides a consistent interface for fitting and creating predictions. It's also fully compatible with the 'tidymodels' ecosystem.

Maintained by Christophe Regouby. Last updated 14 hours ago.

tabnet

109 stars 9.05 score 65 scripts

pedrohcgs

DRDID:Doubly Robust Difference-in-Differences Estimators

Implements the locally efficient doubly robust difference-in-differences (DiD) estimators for the average treatment effect proposed by Sant'Anna and Zhao (2020) <doi:10.1016/j.jeconom.2020.06.003>. The estimator combines inverse probability weighting and outcome regression estimators (also implemented in the package) to form estimators with more attractive statistical properties. Two different estimation methods can be used to estimate the nuisance functions.

Maintained by Pedro H. C. SantAnna. Last updated 6 months ago.

cpp

92 stars 8.88 score 133 scripts 5 dependents

evolecolgroup

tidysdm:Species Distribution Models with Tidymodels

Fit species distribution models (SDMs) using the 'tidymodels' framework, which provides a standardised interface to define models and process their outputs. 'tidysdm' expands 'tidymodels' by providing methods for spatial objects, models and metrics specific to SDMs, as well as a number of specialised functions to process occurrences for contemporary and palaeo datasets. The full functionalities of the package are described in Leonardi et al. (2023) <doi:10.1101/2023.07.24.550358>.

Maintained by Andrea Manica. Last updated 24 days ago.

species-distribution-modelling tidymodels

31 stars 8.82 score 51 scripts

ropensci

weatherOz:An API Client for Australian Weather and Climate Data Resources

Provides automated downloading, parsing and formatting of weather data for Australia through API endpoints provided by the Department of Primary Industries and Regional Development ('DPIRD') of Western Australia and by the Science and Technology Division of the Queensland Government's Department of Environment and Science ('DES'). As well as the Bureau of Meteorology ('BOM') of the Australian government precis and coastal forecasts, and downloading and importing radar and satellite imagery files. 'DPIRD' weather data are accessed through public 'APIs' provided by 'DPIRD', <https://www.agric.wa.gov.au/weather-api-20>, providing access to weather station data from the 'DPIRD' weather station network. Australia-wide weather data are based on data from the Australian Bureau of Meteorology ('BOM') data and accessed through 'SILO' (Scientific Information for Land Owners) Jeffrey et al. (2001) <doi:10.1016/S1364-8152(01)00008-1>. 'DPIRD' data are made available under a Creative Commons Attribution 3.0 Licence (CC BY 3.0 AU) license <https://creativecommons.org/licenses/by/3.0/au/deed.en>. SILO data are released under a Creative Commons Attribution 4.0 International licence (CC BY 4.0) <https://creativecommons.org/licenses/by/4.0/>. 'BOM' data are (c) Australian Government Bureau of Meteorology and released under a Creative Commons (CC) Attribution 3.0 licence or Public Access Licence ('PAL') as appropriate, see <http://www.bom.gov.au/other/copyright.shtml> for further details.

Maintained by Rodrigo Pires. Last updated 1 months ago.

dpird bom meteorological-data weather-forecast australia weather weather-data meteorology western-australia australia-bureau-of-meteorology western-australia-agriculture australia-agriculture australia-climate australia-weather api-client climate data rainfall weather-api

31 stars 8.47 score 40 scripts

tidymodels

tidyposterior:Bayesian Analysis to Compare Models using Resampling Statistics

Bayesian analysis used here to answer the question: "when looking at resampling results, are the differences between models 'real'?" To answer this, a model can be created were the performance statistic is the resampling statistics (e.g. accuracy or RMSE). These values are explained by the model types. In doing this, we can get parameter estimates for each model's affect on performance and make statistical (and practical) comparisons between models. The methods included here are similar to Benavoli et al (2017) <https://jmlr.org/papers/v18/16-305.html>.

Maintained by Max Kuhn. Last updated 6 months ago.

102 stars 8.44 score 273 scripts

tidymodels

finetune:Additional Functions for Model Tuning

The ability to tune models is important. 'finetune' enhances the 'tune' package by providing more specialized methods for finding reasonable values of model tuning parameters. Two racing methods described by Kuhn (2014) <arXiv:1405.6974> are included. An iterative search method using generalized simulated annealing (Bohachevsky, Johnson and Stein, 1986) <doi:10.1080/00401706.1986.10488128> is also included.

Maintained by Max Kuhn. Last updated 8 months ago.

62 stars 8.36 score 704 scripts 1 dependents

cefet-rj-dal

harbinger:A Unified Time Series Event Detection Framework

By analyzing time series, it is possible to observe significant changes in the behavior of observations that frequently characterize events. Events present themselves as anomalies, change points, or motifs. In the literature, there are several methods for detecting events. However, searching for a suitable time series method is a complex task, especially considering that the nature of events is often unknown. This work presents Harbinger, a framework for integrating and analyzing event detection methods. Harbinger contains several state-of-the-art methods described in Salles et al. (2020) <doi:10.5753/sbbd.2020.13626>.

Maintained by Eduardo Ogasawara. Last updated 4 months ago.

18 stars 8.32 score 216 scripts

business-science

modeltime.ensemble:Ensemble Algorithms for Time Series Forecasting with Modeltime

A 'modeltime' extension that implements time series ensemble forecasting methods including model averaging, weighted averaging, and stacking. These techniques are popular methods to improve forecast accuracy and stability.

Maintained by Matt Dancho. Last updated 9 months ago.

ensemble ensemble-learning forecast forecasting modeltime stacking stacking-ensemble tidymodels time time-series timeseries

77 stars 8.30 score 143 scripts

darwin-eu

DrugUtilisation:Summarise Patient-Level Drug Utilisation in Data Mapped to the OMOP Common Data Model

Summarise patient-level drug utilisation cohorts using data mapped to the Observational Medical Outcomes Partnership (OMOP) common data model. New users and prevalent users cohorts can be generated and their characteristics, indication and drug use summarised.

Maintained by Martí Català. Last updated 2 months ago.

8.20 score 156 scripts 2 dependents

bioc

POMA:Tools for Omics Data Analysis

The POMA package offers a comprehensive toolkit designed for omics data analysis, streamlining the process from initial visualization to final statistical analysis. Its primary goal is to simplify and unify the various steps involved in omics data processing, making it more accessible and manageable within a single, intuitive R package. Emphasizing on reproducibility and user-friendliness, POMA leverages the standardized SummarizedExperiment class from Bioconductor, ensuring seamless integration and compatibility with a wide array of Bioconductor tools. This approach guarantees maximum flexibility and replicability, making POMA an essential asset for researchers handling omics datasets. See https://github.com/pcastellanoescuder/POMAShiny. Paper: Castellano-Escuder et al. (2021) <doi:10.1371/journal.pcbi.1009148> for more details.

Maintained by Pol Castellano-Escuder. Last updated 4 months ago.

batcheffect classification clustering decisiontree dimensionreduction multidimensionalscaling normalization preprocessing principalcomponent regression rnaseq software statisticalmethod visualization bioconductor bioinformatics data-visualization dimension-reduction exploratory-data-analysis machine-learning omics-data-integration pipeline pre-processing statistical-analysis user-friendly workflow

11 stars 8.16 score 20 scripts 1 dependents

darwin-eu

IncidencePrevalence:Estimate Incidence and Prevalence using the OMOP Common Data Model

Calculate incidence and prevalence using data mapped to the Observational Medical Outcomes Partnership (OMOP) common data model. Incidence and prevalence can be estimated for the total population in a database or for a stratification cohort.

Maintained by Edward Burn. Last updated 21 days ago.

9 stars 7.96 score 102 scripts 1 dependents

brian-j-smith

MachineShop:Machine Learning Models and Tools

Meta-package for statistical and machine learning with a unified interface for model fitting, prediction, performance assessment, and presentation of results. Approaches for model fitting and prediction of numerical, categorical, or censored time-to-event outcomes include traditional regression models, regularization methods, tree-based methods, support vector machines, neural networks, ensembles, data preprocessing, filtering, and model tuning and selection. Performance metrics are provided for model assessment and can be estimated with independent test sets, split sampling, cross-validation, or bootstrap resampling. Resample estimation can be executed in parallel for faster processing and nested in cases of model tuning and selection. Modeling results can be summarized with descriptive statistics; calibration curves; variable importance; partial dependence plots; confusion matrices; and ROC, lift, and other performance curves.

Maintained by Brian J Smith. Last updated 7 months ago.

classification-models machine-learning predictive-modeling regression-models survival-models

62 stars 7.95 score 121 scripts

bcallaway11

BMisc:Miscellaneous Functions for Panel Data, Quantiles, and Printing Results

These are miscellaneous functions for working with panel data, quantiles, and printing results. For panel data, the package includes functions for making a panel data balanced (that is, dropping missing individuals that have missing observations in any time period), converting id numbers to row numbers, and to treat repeated cross sections as panel data under the assumption of rank invariance. For quantiles, there are functions to make distribution functions from a set of data points (this is particularly useful when a distribution function is created in several steps), to combine distribution functions based on some external weights, and to invert distribution functions. Finally, there are several other miscellaneous functions for obtaining weighted means, weighted distribution functions, and weighted quantiles; to generate summary statistics and their differences for two groups; and to add or drop covariates from formulas.

Maintained by Brantly Callaway. Last updated 2 months ago.

cpp

7 stars 7.92 score 110 scripts 8 dependents

tlverse

tmle3:The Extensible TMLE Framework

A general framework supporting the implementation of targeted maximum likelihood estimators (TMLEs) of a diverse range of statistical target parameters through a unified interface. The goal is that the exposed framework be as general as the mathematical framework upon which it draws.

Maintained by Jeremy Coyle. Last updated 5 months ago.

causal-inference machine-learning targeted-learning variable-importance

38 stars 7.91 score 286 scripts 5 dependents

myles-lewis

nestedcv:Nested Cross-Validation with 'glmnet' and 'caret'

Implements nested k*l-fold cross-validation for lasso and elastic-net regularised linear models via the 'glmnet' package and other machine learning models via the 'caret' package <doi:10.1093/bioadv/vbad048>. Cross-validation of 'glmnet' alpha mixing parameter and embedded fast filter functions for feature selection are provided. Described as double cross-validation by Stone (1977) <doi:10.1111/j.2517-6161.1977.tb01603.x>. Also implemented is a method using outer CV to measure unbiased model performance metrics when fitting Bayesian linear and logistic regression shrinkage models using the horseshoe prior over parameters to encourage a sparse model as described by Piironen & Vehtari (2017) <doi:10.1214/17-EJS1337SI>.

Maintained by Myles Lewis. Last updated 11 days ago.

12 stars 7.90 score 46 scripts

kylebutts

did2s:Two-Stage Difference-in-Differences Following Gardner (2021)

Estimates Two-way Fixed Effects difference-in-differences/event-study models using the approach proposed by Gardner (2021) <doi:10.48550/arXiv.2207.05943>. To avoid the problems caused by OLS estimation of the Two-way Fixed Effects model, this function first estimates the fixed effects and covariates using untreated observations and then in a second stage, estimates the treatment effects.

Maintained by Kyle Butts. Last updated 15 days ago.

97 stars 7.89 score 134 scripts

bioc

mistyR:Multiview Intercellular SpaTial modeling framework

mistyR is an implementation of the Multiview Intercellular SpaTialmodeling framework (MISTy). MISTy is an explainable machine learning framework for knowledge extraction and analysis of single-cell, highly multiplexed, spatially resolved data. MISTy facilitates an in-depth understanding of marker interactions by profiling the intra- and intercellular relationships. MISTy is a flexible framework able to process a custom number of views. Each of these views can describe a different spatial context, i.e., define a relationship among the observed expressions of the markers, such as intracellular regulation or paracrine regulation, but also, the views can also capture cell-type specific relationships, capture relations between functional footprints or focus on relations between different anatomical regions. Each MISTy view is considered as a potential source of variability in the measured marker expressions. Each MISTy view is then analyzed for its contribution to the total expression of each marker and is explained in terms of the interactions with other measurements that led to the observed contribution.

Maintained by Jovan Tanevski. Last updated 5 months ago.

software biomedicalinformatics cellbiology systemsbiology regression decisiontree singlecell spatial bioconductor biology intercellular machine-learning modular molecular-biology multiview spatial-transcriptomics

52 stars 7.87 score 160 scripts

schlosslab

mikropml:User-Friendly R Package for Supervised Machine Learning Pipelines

An interface to build machine learning models for classification and regression problems. 'mikropml' implements the ML pipeline described by Topçuoğlu et al. (2020) <doi:10.1128/mBio.00434-20> with reasonable default options for data preprocessing, hyperparameter tuning, cross-validation, testing, model evaluation, and interpretation steps. See the website <https://www.schlosslab.org/mikropml/> for more information, documentation, and examples.

Maintained by Kelly Sovacool. Last updated 2 years ago.

machine-learning

56 stars 7.83 score 86 scripts

matloff

dsld:Data Science Looks at Discrimination

Statistical and graphical tools for detecting and measuring discrimination and bias, be it racial, gender, age or other. Detection and remediation of bias in machine learning algorithms. 'Python' interfaces available.

Maintained by Norm Matloff. Last updated 2 months ago.

12 stars 7.81 score 35 scripts

ohdsi

PhenotypeR:Assess Study Cohorts Using a Common Data Model

Phenotype study cohorts in data mapped to the Observational Medical Outcomes Partnership Common Data Model. Diagnostics are run at the database, code list, cohort, and population level to assess whether study cohorts are ready for research.

Maintained by Edward Burn. Last updated 2 hours ago.

3 stars 7.76 score 57 scripts

nsaph-software

CausalGPS:Matching on Generalized Propensity Scores with Continuous Exposures

Provides a framework for estimating causal effects of a continuous exposure using observational data, and implementing matching and weighting on the generalized propensity score. Wu, X., Mealli, F., Kioumourtzoglou, M.A., Dominici, F. and Braun, D., 2022. Matching on generalized propensity scores with continuous exposures. Journal of the American Statistical Association, pp.1-29.

Maintained by Naeem Khoshnevis. Last updated 10 months ago.

cpp openmp

24 stars 7.67 score 39 scripts

spsanderson

healthyR.ts:The Time Series Modeling Companion to 'healthyR'

Hospital time series data analysis workflow tools, modeling, and automations. This library provides many useful tools to review common administrative time series hospital data. Some of these include average length of stay, and readmission rates. The aim is to provide a simple and consistent verb framework that takes the guesswork out of everything.

Maintained by Steven Sanderson. Last updated 6 months ago.

ai arima-forecasting arima-model ets forecasting ggplot2 machine-learning modeling prophet time-series time-series-analysis workflows

19 stars 7.58 score 56 scripts 1 dependents

ohdsi

CohortSymmetry:Sequence Symmetry Analysis Using the Observational Medical Outcomes Partnership Common Data Model

Calculating crude sequence ratio, adjusted sequence ratio and confidence intervals using data mapped to the Observational Medical Outcomes Partnership Common Data Model.

Maintained by Xihang Chen. Last updated 7 days ago.

1 stars 7.52 score 73 scripts

risktoollib

RTL:Risk Tool Library - Trading, Risk, Analytics for Commodities

A toolkit for Commodities 'analytics', risk management and trading professionals. Includes functions for API calls to <https://commodities.morningstar.com/#/>, <https://developer.genscape.com/>, and <https://www.bankofcanada.ca/valet/docs>.

Maintained by Philippe Cote. Last updated 1 months ago.

analytics api commodities commodities-api finance genscape morningstar python risk-management cpp

30 stars 7.51 score 198 scripts

ohdsi

OmopSketch:Characterise Tables of an OMOP Common Data Model Instance

Summarises key information in data mapped to the Observational Medical Outcomes Partnership (OMOP) common data model. Assess suitability to perform specific epidemiological studies and explore the different domains to obtain feasibility counts and trends.

Maintained by Cecilia Campanile. Last updated 15 days ago.

2 stars 7.47 score 16 scripts 1 dependents

rgcca-factory

RGCCA:Regularized and Sparse Generalized Canonical Correlation Analysis for Multiblock Data

Multi-block data analysis concerns the analysis of several sets of variables (blocks) observed on the same group of individuals. The main aims of the RGCCA package are: to study the relationships between blocks and to identify subsets of variables of each block which are active in their relationships with the other blocks. This package allows to (i) run R/SGCCA and related methods, (ii) help the user to find out the optimal parameters for R/SGCCA such as regularization parameters (tau or sparsity), (iii) evaluate the stability of the RGCCA results and their significance, (iv) build predictive models from the R/SGCCA. (v) Generic print() and plot() functions apply to all these functionalities.

Maintained by Arthur Tenenhaus. Last updated 9 months ago.

12 stars 7.43 score 74 scripts

spsanderson

healthyR.ai:The Machine Learning and AI Modeling Companion to 'healthyR'

Hospital machine learning and ai data analysis workflow tools, modeling, and automations. This library provides many useful tools to review common administrative hospital data. Some of these include predicting length of stay, and readmits. The aim is to provide a simple and consistent verb framework that takes the guesswork out of everything.

Maintained by Steven Sanderson. Last updated 2 months ago.

ai artificial-intelligence healthcareanalytics healthyr healthyverse machine-learning

16 stars 7.37 score 36 scripts 1 dependents

spsanderson

healthyR:Hospital Data Analysis Workflow Tools

Hospital data analysis workflow tools, modeling, and automations. This library provides many useful tools to review common administrative hospital data. Some of these include average length of stay, readmission rates, average net pay amounts by service lines just to name a few. The aim is to provide a simple and consistent verb framework that takes the guesswork out of everything.

Maintained by Steven Sanderson. Last updated 9 months ago.

analysis analytics healthcare healthyr

30 stars 7.27 score 103 scripts 1 dependents

bioc

tidytof:Analyze High-dimensional Cytometry Data Using Tidy Data Principles

This package implements an interactive, scientific analysis pipeline for high-dimensional cytometry data built using tidy data principles. It is specifically designed to play well with both the tidyverse and Bioconductor software ecosystems, with functionality for reading/writing data files, data cleaning, preprocessing, clustering, visualization, modeling, and other quality-of-life functions. tidytof implements a "grammar" of high-dimensional cytometry data analysis.

Maintained by Timothy Keyes. Last updated 5 months ago.

singlecell flowcytometry bioinformatics cytometry data-science single-cell tidyverse cpp

18 stars 7.24 score 35 scripts

tidymodels

tidyclust:A Common API to Clustering

A common interface to specifying clustering models, in the same style as 'parsnip'. Creates unified interface across different functions and computational engines.

Maintained by Emil Hvitfeldt. Last updated 2 months ago.

112 stars 7.21 score 139 scripts

business-science

correlationfunnel:Speed Up Exploratory Data Analysis (EDA) with the Correlation Funnel

Speeds up exploratory data analysis (EDA) by providing a succinct workflow and interactive visualization tools for understanding which features have relationships to target (response). Uses binary correlation analysis to determine relationship. Default correlation method is the Pearson method. Lian Duan, W Nick Street, Yanchi Liu, Songhua Xu, and Brook Wu (2014) <doi:10.1145/2637484>.

Maintained by Matt Dancho. Last updated 1 years ago.

correlation exploratory-analysis exploratory-data-analysis exploratory-data-visualizations tidyverse

137 stars 7.20 score 115 scripts

roux-ohdsi

allofus:Interface for 'All of Us' Researcher Workbench

Streamline use of the 'All of Us' Researcher Workbench (<https://www.researchallofus.org/data-tools/workbench/>)with tools to extract and manipulate data from the 'All of Us' database. Increase interoperability with the Observational Health Data Science and Informatics ('OHDSI') tool stack by decreasing reliance of 'All of Us' tools and allowing for cohort creation via 'Atlas'. Improve reproducible and transparent research using 'All of Us'.

Maintained by Rob Cavanaugh. Last updated 5 months ago.

16 stars 7.19 score 30 scripts

darwin-eu

DrugExposureDiagnostics:Diagnostics for OMOP Common Data Model Drug Records

Ingredient specific diagnostics for drug exposure records in the Observational Medical Outcomes Partnership (OMOP) common data model.

Maintained by Ger Inberg. Last updated 18 days ago.

4 stars 7.11 score 41 scripts

bioc

animalcules:Interactive microbiome analysis toolkit

animalcules is an R package for utilizing up-to-date data analytics, visualization methods, and machine learning models to provide users an easy-to-use interactive microbiome analysis framework. It can be used as a standalone software package or users can explore their data with the accompanying interactive R Shiny application. Traditional microbiome analysis such as alpha/beta diversity and differential abundance analysis are enhanced, while new methods like biomarker identification are introduced by animalcules. Powerful interactive and dynamic figures generated by animalcules enable users to understand their data better and discover new insights.

Maintained by Jessica McClintock. Last updated 5 months ago.

microbiome metagenomics coverage visualization

55 stars 6.95 score 23 scripts

bioc

pRolocGUI:Interactive visualisation of spatial proteomics data

The package pRolocGUI comprises functions to interactively visualise spatial proteomics data on the basis of pRoloc, pRolocdata and shiny.

Maintained by Lisa Breckels. Last updated 5 months ago.

proteomics visualization gui

8 stars 6.90 score 3 scripts

cytomining

cytominer:Methods for Image-Based Cell Profiling

`cytominer` is a suite of common functions used to process high-dimensional readouts from image-based cell profiling experiments.

Maintained by Shantanu Singh. Last updated 2 years ago.

microscopy profiling

50 stars 6.89 score 44 scripts

tidymodels

agua:'tidymodels' Integration with 'h2o'

Create and evaluate models using 'tidymodels' and 'h2o' <https://h2o.ai/>. The package enables users to specify 'h2o' as an engine for several modeling methods.

Maintained by Qiushi Yan. Last updated 10 months ago.

22 stars 6.88 score 80 scripts

tidymodels

usemodels:Boilerplate Code for 'Tidymodels' Analyses

Code snippets to fit models using the tidymodels framework can be easily created for a given data set.

Maintained by Max Kuhn. Last updated 6 months ago.

84 stars 6.88 score 128 scripts

raymondbalise

rUM:R Templates from the University of Miami

This holds some r markdown and quarto templates and a template to create a research project in "R Studio".

Maintained by Raymond Balise. Last updated 10 days ago.

rmarkdown

9 stars 6.84 score 16 scripts

kozodoi

fairness:Algorithmic Fairness Metrics

Offers calculation, visualization and comparison of algorithmic fairness metrics. Fair machine learning is an emerging topic with the overarching aim to critically assess whether ML algorithms reinforce existing social biases. Unfair algorithms can propagate such biases and produce predictions with a disparate impact on various sensitive groups of individuals (defined by sex, gender, ethnicity, religion, income, socioeconomic status, physical or mental disabilities). Fair algorithms possess the underlying foundation that these groups should be treated similarly or have similar prediction outcomes. The fairness R package offers the calculation and comparisons of commonly and less commonly used fairness metrics in population subgroups. These methods are described by Calders and Verwer (2010) <doi:10.1007/s10618-010-0190-x>, Chouldechova (2017) <doi:10.1089/big.2016.0047>, Feldman et al. (2015) <doi:10.1145/2783258.2783311> , Friedler et al. (2018) <doi:10.1145/3287560.3287589> and Zafar et al. (2017) <doi:10.1145/3038912.3052660>. The package also offers convenient visualizations to help understand fairness metrics.

Maintained by Nikita Kozodoi. Last updated 2 years ago.

algorithmic-discrimination algorithmic-fairness discrimination disparate-impact fairness fairness-ai fairness-ml machine-learning

32 stars 6.82 score 69 scripts 1 dependents

michaellli

evalITR:Evaluating Individualized Treatment Rules

Provides various statistical methods for evaluating Individualized Treatment Rules under randomized data. The provided metrics include Population Average Value (PAV), Population Average Prescription Effect (PAPE), Area Under Prescription Effect Curve (AUPEC). It also provides the tools to analyze Individualized Treatment Rules under budget constraints. Detailed reference in Imai and Li (2019) <arXiv:1905.05389>.

Maintained by Michael Lingzhi Li. Last updated 2 years ago.

14 stars 6.78 score 36 scripts

harrison4192

autostats:Auto Stats

Automatically do statistical exploration. Create formulas using 'tidyselect' syntax, and then determine cross-validated model accuracy and variable contributions using 'glm' and 'xgboost'. Contains additional helper functions to create and modify formulas. Has a flagship function to quickly determine relationships between categorical and continuous variables in the data set.

Maintained by Harrison Tietze. Last updated 26 days ago.

6 stars 6.76 score 5 scripts 2 dependents

mdbrown

rmda:Risk Model Decision Analysis

Provides tools to evaluate the value of using a risk prediction instrument to decide treatment or intervention (versus no treatment or intervention). Given one or more risk prediction instruments (risk models) that estimate the probability of a binary outcome, rmda provides functions to estimate and display decision curves and other figures that help assess the population impact of using a risk model for clinical decision making. Here, "population" refers to the relevant patient population. Decision curves display estimates of the (standardized) net benefit over a range of probability thresholds used to categorize observations as 'high risk'. The curves help evaluate a treatment policy that recommends treatment for patients who are estimated to be 'high risk' by comparing the population impact of a risk-based policy to "treat all" and "treat none" intervention policies. Curves can be estimated using data from a prospective cohort. In addition, rmda can estimate decision curves using data from a case-control study if an estimate of the population outcome prevalence is available. Version 1.4 of the package provides an alternative framing of the decision problem for situations where treatment is the standard-of-care and a risk model might be used to recommend that low-risk patients (i.e., patients below some risk threshold) opt out of treatment. Confidence intervals calculated using the bootstrap can be computed and displayed. A wrapper function to calculate cross-validated curves using k-fold cross-validation is also provided.

Maintained by Marshall Brown. Last updated 6 years ago.

28 stars 6.73 score 96 scripts

neon-biodiversity

Ostats:O-Stats, or Pairwise Community-Level Niche Overlap Statistics

O-statistics, or overlap statistics, measure the degree of community-level trait overlap. They are estimated by fitting nonparametric kernel density functions to each species’ trait distribution and calculating their areas of overlap. For instance, the median pairwise overlap for a community is calculated by first determining the overlap of each species pair in trait space, and then taking the median overlap of each species pair in a community. This median overlap value is called the O-statistic (O for overlap). The Ostats() function calculates separate univariate overlap statistics for each trait, while the Ostats_multivariate() function calculates a single multivariate overlap statistic for all traits. O-statistics can be evaluated against null models to obtain standardized effect sizes. 'Ostats' is part of the collaborative Macrosystems Biodiversity Project "Local- to continental-scale drivers of biodiversity across the National Ecological Observatory Network (NEON)." For more information on this project, see the Macrosystems Biodiversity Website (<https://neon-biodiversity.github.io/>). Calculation of O-statistics is described in Read et al. (2018) <doi:10.1111/ecog.03641>, and a teaching module for introducing the underlying biological concepts at an undergraduate level is described in Grady et al. (2018) <http://tiee.esa.org/vol/v14/issues/figure_sets/grady/abstract.html>.

Maintained by Quentin D. Read. Last updated 4 months ago.

ecology

7 stars 6.69 score 28 scripts

bioc

SPONGE:Sparse Partial Correlations On Gene Expression

This package provides methods to efficiently detect competitive endogeneous RNA interactions between two genes. Such interactions are mediated by one or several miRNAs such that both gene and miRNA expression data for a larger number of samples is needed as input. The SPONGE package now also includes spongEffects: ceRNA modules offer patient-specific insights into the miRNA regulatory landscape.

Maintained by Markus List. Last updated 5 months ago.

geneexpression transcription generegulation networkinference transcriptomics systemsbiology regression randomforest machinelearning

6.66 score 38 scripts 1 dependents

cefet-rj-dal

daltoolbox:Leveraging Experiment Lines to Data Analytics

The natural increase in the complexity of current research experiments and data demands better tools to enhance productivity in Data Analytics. The package is a framework designed to address the modern challenges in data analytics workflows. The package is inspired by Experiment Line concepts. It aims to provide seamless support for users in developing their data mining workflows by offering a uniform data model and method API. It enables the integration of various data mining activities, including data preprocessing, classification, regression, clustering, and time series prediction. It also offers options for hyper-parameter tuning and supports integration with existing libraries and languages. Overall, the package provides researchers with a comprehensive set of functionalities for data science, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>.

Maintained by Eduardo Ogasawara. Last updated 2 months ago.

1 stars 6.65 score 536 scripts 4 dependents

business-science

modeltime.resample:Resampling Tools for Time Series Forecasting

A 'modeltime' extension that implements forecast resampling tools that assess time-based model performance and stability for a single time series, panel data, and cross-sectional time series analysis.

Maintained by Matt Dancho. Last updated 1 years ago.

accuracy-metrics backtesting bootstrap bootstrapping cross-validation forecasting modeltime modeltime-resample resampling statistics tidymodels time-series

19 stars 6.64 score 38 scripts 1 dependents

bioc

scAnnotatR:Pretrained learning models for cell type prediction on single cell RNA-sequencing data

The package comprises a set of pretrained machine learning models to predict basic immune cell types. This enables all users to quickly get a first annotation of the cell types present in their dataset without requiring prior knowledge. scAnnotatR also allows users to train their own models to predict new cell types based on specific research needs.

Maintained by Johannes Griss. Last updated 5 months ago.

singlecell transcriptomics geneexpression supportvectormachine classification software

15 stars 6.61 score 20 scripts

spsanderson

tidyAML:Automatic Machine Learning with 'tidymodels'

The goal of this package will be to provide a simple interface for automatic machine learning that fits the 'tidymodels' framework. The intention is to work for regression and classification problems with a simple verb framework.

Maintained by Steven Sanderson. Last updated 11 months ago.

automatic-machine-learning automl classification machine-learning parsnip r-language r-programming regression tidy tidymodels tidyverse

68 stars 6.56 score 36 scripts 1 dependents

tsailintung

fastdid:Fast Staggered Difference-in-Difference Estimators

A fast and flexible implementation of Callaway and Sant'Anna's (2021)<doi:10.1016/j.jeconom.2020.12.001> staggered Difference-in-Differences (DiD) estimators, 'fastdid' reduces the computation time from hours to seconds, and incorporates extensions such as time-varying covariates and multiple events.

Maintained by Lin-Tung Tsai. Last updated 4 months ago.

difference-in-differences event-study staggered-did

28 stars 6.56 score 4 scripts

bioc

condiments:Differential Topology, Progression and Differentiation

This package encapsulate many functions to conduct a differential topology analysis. It focuses on analyzing an 'omic dataset with multiple conditions. While the package is mostly geared toward scRNASeq, it does not place any restriction on the actual input format.

Maintained by Hector Roux de Bezieux. Last updated 4 months ago.

rnaseq sequencing software singlecell transcriptomics multiplecomparison visualization

26 stars 6.52 score 17 scripts

esteban-alfaro

adabag:Applies Multiclass AdaBoost.M1, SAMME and Bagging

It implements Freund and Schapire's Adaboost.M1 algorithm and Breiman's Bagging algorithm using classification trees as individual classifiers. Once these classifiers have been trained, they can be used to predict on new data. Also, cross validation estimation of the error can be done. Since version 2.0 the function margins() is available to calculate the margins for these classifiers. Also a higher flexibility is achieved giving access to the rpart.control() argument of 'rpart'. Four important new features were introduced on version 3.0, AdaBoost-SAMME (Zhu et al., 2009) is implemented and a new function errorevol() shows the error of the ensembles as a function of the number of iterations. In addition, the ensembles can be pruned using the option 'newmfinal' in the predict.bagging() and predict.boosting() functions and the posterior probability of each class for observations can be obtained. Version 3.1 modifies the relative importance measure to take into account the gain of the Gini index given by a variable in each tree and the weights of these trees. Version 4.0 includes the margin-based ordered aggregation for Bagging pruning (Guo and Boukir, 2013) and a function to auto prune the 'rpart' tree. Moreover, three new plots are also available importanceplot(), plot.errorevol() and plot.margins(). Version 4.1 allows to predict on unlabeled data. Version 4.2 includes the parallel computation option for some of the functions. Version 5.0 includes the Boosting and Bagging algorithms for label ranking (Albano, Sciandra and Plaia, 2023).

Maintained by Esteban Alfaro. Last updated 2 years ago.

5 stars 6.46 score 720 scripts 6 dependents

statsgary

OddsPlotty:Odds Plot to Visualise a Logistic Regression Model

Uses the outputs of a logistic regression model, from caret <https://CRAN.R-project.org/package=caret>, to build an odds plot. This allows for the rapid visualisation of odds plot ratios and works best with the outputs of CARET's GLM model class, by returning the final trained model.

Maintained by Gary Hutson. Last updated 1 months ago.

17 stars 6.39 score 48 scripts 1 dependents

mattheaphy

actxps:Create Actuarial Experience Studies: Prepare Data, Summarize Results, and Create Reports

Experience studies are used by actuaries to explore historical experience across blocks of business and to inform assumption setting activities. This package provides functions for preparing data, creating studies, visualizing results, and beginning assumption development. Experience study methods, including exposure calculations, are described in: Atkinson & McGarry (2016) "Experience Study Calculations" <https://www.soa.org/49378a/globalassets/assets/files/research/experience-study-calculations.pdf>. The limited fluctuation credibility method used by the 'exp_stats()' function is described in: Herzog (1999, ISBN:1-56698-374-6) "Introduction to Credibility Theory".

Maintained by Matt Heaphy. Last updated 3 months ago.

14 stars 6.38 score 23 scripts

thewileylab

ReviewR:A Light-Weight, Portable Tool for Reviewing Individual Patient Records

A portable Shiny tool to explore patient-level electronic health record data and perform chart review in a single integrated framework. This tool supports browsing clinical data in many different formats including multiple versions of the 'OMOP' common data model as well as the 'MIMIC-III' data model. In addition, chart review information is captured and stored securely via the Shiny interface in a 'REDCap' (Research Electronic Data Capture) project using the 'REDCap' API. See the 'ReviewR' website for additional information, documentation, and examples.

Maintained by David Mayer. Last updated 2 years ago.

24 stars 6.33 score 6 scripts

egeulgen

driveR:Prioritizing Cancer Driver Genes Using Genomics Data

Cancer genomes contain large numbers of somatic alterations but few genes drive tumor development. Identifying cancer driver genes is critical for precision oncology. Most of current approaches either identify driver genes based on mutational recurrence or using estimated scores predicting the functional consequences of mutations. 'driveR' is a tool for personalized or batch analysis of genomic data for driver gene prioritization by combining genomic information and prior biological knowledge. As features, 'driveR' uses coding impact metaprediction scores, non-coding impact scores, somatic copy number alteration scores, hotspot gene/double-hit gene condition, 'phenolyzer' gene scores and memberships to cancer-related KEGG pathways. It uses these features to estimate cancer-type-specific probability for each gene of being a cancer driver using the related task of a multi-task learning classification model. The method is described in detail in Ulgen E, Sezerman OU. 2021. driveR: driveR: a novel method for prioritizing cancer driver genes using somatic genomics data. BMC Bioinformatics <doi:10.1186/s12859-021-04203-7>.

Maintained by Ege Ulgen. Last updated 2 years ago.

cancer-driverness driver driver-gene-prioritization identify-driver-genes ranking-genes scoring

15 stars 6.29 score 260 scripts

bioc

iNETgrate:Integrates DNA methylation data with gene expression in a single gene network

The iNETgrate package provides functions to build a correlation network in which nodes are genes. DNA methylation and gene expression data are integrated to define the connections between genes. This network is used to identify modules (clusters) of genes. The biological information in each of the resulting modules is represented by an eigengene. These biological signatures can be used as features e.g., for classification of patients into risk categories. The resulting biological signatures are very robust and give a holistic view of the underlying molecular changes.

Maintained by Habil Zare. Last updated 5 months ago.

geneexpression rnaseq dnamethylation networkinference network graphandnetwork biomedicalinformatics systemsbiology transcriptomics classification clustering dimensionreduction principalcomponent mrnamicroarray normalization geneprediction kegg survival core-services

74 stars 6.21 score 1 scripts

tidymodels

shinymodels:Interactive Assessments of Models

Launch a 'shiny' application for 'tidymodels' results. For classification or regression models, the app can be used to determine if there is lack of fit or poorly predicted points.

Maintained by Simon Couch. Last updated 5 months ago.

shiny

48 stars 6.21 score 48 scripts

jbryer

login:'shiny' Login Module

Framework for adding authentication to 'shiny' applications. Provides flexibility as compared to other options for where user credentials are saved, allows users to create their own accounts, and password reset functionality. Bryer (2024) <doi:10.5281/zenodo.10987876>.

Maintained by Jason Bryer. Last updated 12 months ago.

21 stars 6.15 score 45 scripts

ipd-tools

ipd:Inference on Predicted Data

Performs valid statistical inference on predicted data (IPD) using recent methods, where for a subset of the data, the outcomes have been predicted by an algorithm. Provides a wrapper function with specified defaults for the type of model and method to be used for estimation and inference. Further provides methods for tidying and summarizing results. Salerno et al., (2024) <doi:10.48550/arXiv.2410.09665>.

Maintained by Stephen Salerno. Last updated 3 months ago.

8 stars 6.13 score 5 scripts

erblast

easyalluvial:Generate Alluvial Plots with a Single Line of Code

Alluvial plots are similar to sankey diagrams and visualise categorical data over multiple dimensions as flows. (Rosvall M, Bergstrom CT (2010) Mapping Change in Large Networks. PLoS ONE 5(1): e8694. <doi:10.1371/journal.pone.0008694> Their graphical grammar however is a bit more complex then that of a regular x/y plots. The 'ggalluvial' package made a great job of translating that grammar into 'ggplot2' syntax and gives you many options to tweak the appearance of an alluvial plot, however there still remains a multi-layered complexity that makes it difficult to use 'ggalluvial' for explorative data analysis. 'easyalluvial' provides a simple interface to this package that allows you to produce a decent alluvial plot from any dataframe in either long or wide format from a single line of code while also handling continuous data. It is meant to allow a quick visualisation of entire dataframes with a focus on different colouring options that can make alluvial plots a great tool for data exploration.

Maintained by Bjoern Koneswarakantha. Last updated 1 years ago.

110 stars 6.13 score 81 scripts 1 dependents

andreanini

idiolect:Forensic Authorship Analysis

Carry out comparative authorship analysis of disputed and undisputed texts within the Likelihood Ratio Framework for expressing evidence in forensic science. This package contains implementations of well-known algorithms for comparative authorship analysis, such as Smith and Aldridge's (2011) Cosine Delta <doi:10.1080/09296174.2011.533591> or Koppel and Winter's (2014) Impostors Method <doi:10.1002/asi.22954>, as well as functions to measure their performance and to calibrate their outputs into Log-Likelihood Ratios.

Maintained by Andrea Nini. Last updated 24 days ago.

14 stars 6.12 score 3 scripts

sentometricsresearch

sentometrics:An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2021) <doi:10.18637/jss.v099.i02>.

Maintained by Samuel Borms. Last updated 4 years ago.

nlp prediction sentiment-analysis text-mining time-series openblas cpp openmp

83 stars 6.09 score 49 scripts

nicholasjclark

MRFcov:Markov Random Fields with Additional Covariates

Approximate node interaction parameters of Markov Random Fields graphical networks. Models can incorporate additional covariates, allowing users to estimate how interactions between nodes in the graph are predicted to change across covariate gradients. The general methods implemented in this package are described in Clark et al. (2018) <doi:10.1002/ecy.2221>.

Maintained by Nicholas J Clark. Last updated 1 years ago.

conditional-random-fields graphical-models machine-learning markov-random-field multivariate-analysis multivariate-statistics network-analysis networks

24 stars 6.03 score 30 scripts

jacekbialek

PriceIndices:Calculating Bilateral and Multilateral Price Indexes

Preparing a scanner data set for price dynamics calculations (data selecting, data classification, data matching, data filtering). Computing bilateral and multilateral indexes. For details on these methods see: Diewert and Fox (2020) <doi:10.1080/07350015.2020.1816176>, Białek (2019) <doi:10.2478/jos-2019-0014> or Białek (2020) <doi:10.2478/jos-2020-0037>.

Maintained by Jacek Białek. Last updated 2 months ago.

11 stars 6.02 score 16 scripts

business-science

alphavantager:Lightweight Interface to the Alpha Vantage API

Alpha Vantage has free historical financial information. All you need to do is get a free API key at <https://www.alphavantage.co>. Then you can use the R interface to retrieve free equity information. Refer to the Alpha Vantage website for more information.

Maintained by Matt Dancho. Last updated 2 years ago.

alpha-vantage financial-data historical-financial-data

70 stars 5.98 score 64 scripts

gorelab

waves:Vis-NIR Spectral Analysis Wrapper

Originally designed application in the context of resource-limited plant research and breeding programs, 'waves' provides an open-source solution to spectral data processing and model development by bringing useful packages together into a streamlined pipeline. This package is wrapper for functions related to the analysis of point visible and near-infrared reflectance measurements. It includes visualization, filtering, aggregation, preprocessing, cross-validation set formation, model training, and prediction functions to enable open-source association of spectral and reference data. This package is documented in a peer-reviewed manuscript in the Plant Phenome Journal <doi:10.1002/ppj2.20012>. Specialized cross-validation schemes are described in detail in Jarquín et al. (2017) <doi:10.3835/plantgenome2016.12.0130>. Example data is from Ikeogu et al. (2017) <doi:10.1371/journal.pone.0188918>.

Maintained by Jenna Hershberger. Last updated 12 months ago.

6 stars 5.98 score 53 scripts

legana

ampir:Predict Antimicrobial Peptides

A toolkit to predict antimicrobial peptides from protein sequences on a genome-wide scale. It incorporates two support vector machine models ("precursor" and "mature") trained on publicly available antimicrobial peptide data using calculated physico-chemical and compositional sequence properties described in Meher et al. (2017) <doi:10.1038/srep42362>. In order to support genome-wide analyses, these models are designed to accept any type of protein as input and calculation of compositional properties has been optimised for high-throughput use. For best results it is important to select the model that accurately represents your sequence type: for full length proteins, it is recommended to use the default "precursor" model. The alternative, "mature", model is best suited for mature peptide sequences that represent the final antimicrobial peptide sequence after post-translational processing. For details see Fingerhut et al. (2020) <doi:10.1093/bioinformatics/btaa653>. The 'ampir' package is also available via a Shiny based GUI at <https://ampir.marine-omics.net/>.

Maintained by Legana Fingerhut. Last updated 4 years ago.

cpp

28 stars 5.98 score 34 scripts

rajohansen

waterquality:Satellite Derived Water Quality Detection Algorithms

The main purpose of waterquality is to quickly and easily convert satellite-based reflectance imagery into one or many well-known water quality algorithms designed for the detection of harmful algal blooms or the following pigment proxies: chlorophyll-a, blue-green algae (phycocyanin), and turbidity. Johansen et al. (2019) <doi:10.21079/11681/35053>.

Maintained by Richard Johansen. Last updated 1 years ago.

algal-bloom algorithms landsat-8 meris modis olci remote-sensing sentinel-2 water-quality

44 stars 5.97 score 21 scripts

bioc

REMP:Repetitive Element Methylation Prediction

Machine learning-based tools to predict DNA methylation of locus-specific repetitive elements (RE) by learning surrounding genetic and epigenetic information. These tools provide genomewide and single-base resolution of DNA methylation prediction on RE that are difficult to measure using array-based or sequencing-based platforms, which enables epigenome-wide association study (EWAS) and differentially methylated region (DMR) analysis on RE.

Maintained by Yinan Zheng. Last updated 5 months ago.

dnamethylation microarray methylationarray sequencing genomewideassociation epigenetics preprocessing multichannel twochannel differentialmethylation qualitycontrol dataimport

2 stars 5.94 score 18 scripts

bioc

miRspongeR:Identification and analysis of miRNA sponge regulation

This package provides several functions to explore miRNA sponge (also called ceRNA or miRNA decoy) regulation from putative miRNA-target interactions or/and transcriptomics data (including bulk, single-cell and spatial gene expression data). It provides eight popular methods for identifying miRNA sponge interactions, and an integrative method to integrate miRNA sponge interactions from different methods, as well as the functions to validate miRNA sponge interactions, and infer miRNA sponge modules, conduct enrichment analysis of miRNA sponge modules, and conduct survival analysis of miRNA sponge modules. By using a sample control variable strategy, it provides a function to infer sample-specific miRNA sponge interactions. In terms of sample-specific miRNA sponge interactions, it implements three similarity methods to construct sample-sample correlation network.

Maintained by Junpeng Zhang. Last updated 5 months ago.

geneexpression biomedicalinformatics networkenrichment survival microarray software singlecell spatial rnaseq cerna mirna sponge

5 stars 5.88 score 8 scripts

phytoclass

phytoclass:Estimate Chla Concentrations of Phytoplankton Groups

Determine the chlorophyll a (Chl a) concentrations of different phytoplankton groups based on their pigment biomarkers. The method uses non-negative matrix factorisation and simulated annealing to minimise error between the observed and estimated values of pigment concentrations (Hayward et al. (2023) <doi:10.1002/lom3.10541>). The approach is similar to the widely used 'CHEMTAX' program (Mackey et al. 1996) <doi:10.3354/meps144265>, but is more straightforward, accurate, and not reliant on initial guesses for the pigment to Chl a ratios for phytoplankton groups.

Maintained by Alexander Hayward. Last updated 25 days ago.

2 stars 5.88 score 9 scripts

bcallaway11

qte:Quantile Treatment Effects

Provides several methods for computing the Quantile Treatment Effect (QTE) and Quantile Treatment Effect on the Treated (QTT). The main cases covered are (i) Treatment is randomly assigned, (ii) Treatment is as good as randomly assigned after conditioning on some covariates (also called conditional independence or selection on observables) using the methods developed in Firpo (2007) <doi:10.1111/j.1468-0262.2007.00738.x>, (iii) Identification is based on a Difference in Differences assumption (several varieties are available in the package e.g. Athey and Imbens (2006) <doi:10.1111/j.1468-0262.2006.00668.x> Callaway and Li (2019) <doi:10.3982/QE935>, Callaway, Li, and Oka (2018) <doi:10.1016/j.jeconom.2018.06.008>).

Maintained by Brantly Callaway. Last updated 11 months ago.

9 stars 5.87 score 55 scripts

vdblab

FLORAL:Fit Log-Ratio Lasso Regression for Compositional Data

Log-ratio Lasso regression for continuous, binary, and survival outcomes with (longitudinal) compositional features. See Fei and others (2024) <doi:10.1016/j.crmeth.2024.100899>.

Maintained by Teng Fei. Last updated 1 months ago.

openblas cpp openmp

12 stars 5.85 score 13 scripts

nelson-gon

manymodelr:Build and Tune Several Models

Frequently one needs a convenient way to build and tune several models in one go.The goal is to provide a number of machine learning convenience functions. It provides the ability to build, tune and obtain predictions of several models in one function. The models are built using functions from 'caret' with easier to read syntax. Kuhn(2014) <doi:10.48550/arXiv.1405.6974>.

Maintained by Nelson Gonzabato. Last updated 11 days ago.

analysis-of-variance anova correlation correlation-coefficient generalized-linear-models gradient-boosting-decision-trees knn-classification linear-models linear-regression machine-learning missing-values models r-programming random-forest-algorithm regression-models

2 stars 5.78 score 50 scripts

shinyworks

cookies:Use Browser Cookies with 'shiny'

Browser cookies are name-value pairs that are saved in a user's browser by a website. Cookies allow websites to persist information about the user and their use of the website. Here we provide tools for working with cookies in 'shiny' apps, in part by wrapping the 'js-cookie' JavaScript library <https://github.com/js-cookie/js-cookie>.

Maintained by Jon Harmon. Last updated 5 months ago.

33 stars 5.71 score 26 scripts 2 dependents

statsgary

MLDataR:Collection of Machine Learning Datasets for Supervised Machine Learning

Contains a collection of datasets for working with machine learning tasks. It will contain datasets for supervised machine learning Jiang (2020)<doi:10.1016/j.beth.2020.05.002> and will include datasets for classification and regression. The aim of this package is to use data generated around health and other domains.

Maintained by Gary Hutson. Last updated 1 years ago.

53 stars 5.70 score 19 scripts

bioc

CytoGLMM:Conditional Differential Analysis for Flow and Mass Cytometry Experiments

The CytoGLMM R package implements two multiple regression strategies: A bootstrapped generalized linear model (GLM) and a generalized linear mixed model (GLMM). Most current data analysis tools compare expressions across many computationally discovered cell types. CytoGLMM focuses on just one cell type. Our narrower field of application allows us to define a more specific statistical model with easier to control statistical guarantees. As a result, CytoGLMM finds differential proteins in flow and mass cytometry data while reducing biases arising from marker correlations and safeguarding against false discoveries induced by patient heterogeneity.

Maintained by Christof Seiler. Last updated 5 months ago.

flowcytometry proteomics singlecell cellbasedassays cellbiology immunooncology regression statisticalmethod software

2 stars 5.68 score 1 scripts 1 dependents

bioc

metabCombiner:Method for Combining LC-MS Metabolomics Feature Measurements

This package aligns LC-HRMS metabolomics datasets acquired from biologically similar specimens analyzed under similar, but not necessarily identical, conditions. Peak-picked and simply aligned metabolomics feature tables (consisting of m/z, rt, and per-sample abundance measurements, plus optional identifiers & adduct annotations) are accepted as input. The package outputs a combined table of feature pair alignments, organized into groups of similar m/z, and ranked by a similarity score. Input tables are assumed to be acquired using similar (but not necessarily identical) analytical methods.

Maintained by Hani Habra. Last updated 5 months ago.

software massspectrometry metabolomics mass-spectrometry

10 stars 5.65 score 5 scripts

akai01

caretForecast:Conformal Time Series Forecasting Using State of Art Machine Learning Algorithms

Conformal time series forecasting using the caret infrastructure. It provides access to state-of-the-art machine learning models for forecasting applications. The hyperparameter of each model is selected based on time series cross-validation, and forecasting is done recursively.

Maintained by Resul Akay. Last updated 2 years ago.

caret conformal-prediction data-science econometrics forecast forecasting forecasting-models machine-learning macroeconometrics microeconometrics time-series time-series-forcasting time-series-prediction

25 stars 5.62 score 28 scripts 4 dependents

trevorld

datetimeoffset:Datetimes with Optional UTC Offsets and/or Heterogeneous Time Zones

Supports import/export for a number of datetime string standards and R datetime classes often including lossless re-export of any original reduced precision including 'ISO 8601' <https://en.wikipedia.org/wiki/ISO_8601> and 'pdfmark' <https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/> datetime strings. Supports local/global datetimes with optional UTC offsets and/or (possibly heterogeneous) time zones with up to nanosecond precision.

Maintained by Trevor L. Davis. Last updated 7 days ago.

6 stars 5.56 score 1 scripts 2 dependents

bioc

bandle:An R package for the Bayesian analysis of differential subcellular localisation experiments

The Bandle package enables the analysis and visualisation of differential localisation experiments using mass-spectrometry data. Experimental methods supported include dynamic LOPIT-DC, hyperLOPIT, Dynamic Organellar Maps, Dynamic PCP. It provides Bioconductor infrastructure to analyse these data.

Maintained by Oliver M. Crook. Last updated 2 months ago.

bayesian classification clustering immunooncology qualitycontrol dataimport proteomics massspectrometry openblas cpp openmp

4 stars 5.56 score 3 scripts

techtonique

learningmachine:Machine Learning with Explanations and Uncertainty Quantification

Regression-based Machine Learning with explanations and uncertainty quantification.

Maintained by T. Moudiki. Last updated 4 months ago.

conformal-prediction machine-learning machine-learning-algorithms machinelearning statistical-learning uncertainty-quantification cpp

5 stars 5.53 score 21 scripts

jeffreyhanson

raptr:Representative and Adequate Prioritization Toolkit in R

Biodiversity is in crisis. The overarching aim of conservation is to preserve biodiversity patterns and processes. To this end, protected areas are established to buffer species and preserve biodiversity processes. But resources are limited and so protected areas must be cost-effective. This package contains tools to generate plans for protected areas (prioritizations), using spatially explicit targets for biodiversity patterns and processes. To obtain solutions in a feasible amount of time, this package uses the commercial 'Gurobi' software (obtained from <https://www.gurobi.com/>). For more information on using this package, see Hanson et al. (2018) <doi:10.1111/2041-210X.12862>.

Maintained by Jeffrey O Hanson. Last updated 1 years ago.

cpp

8 stars 5.52 score 83 scripts

jonathan-g

datafsm:Estimating Finite State Machine Models from Data

Automatic generation of finite state machine models of dynamic decision-making that both have strong predictive power and are interpretable in human terms. We use an efficient model representation and a genetic algorithm-based estimation process to generate simple deterministic approximations that explain most of the structure of complex stochastic processes. We have applied the software to empirical data, and demonstrated it's ability to recover known data-generating processes by simulating data with agent-based models and correctly deriving the underlying decision models for multiple agent models and degrees of stochasticity.

Maintained by Jonathan M. Gilligan. Last updated 4 years ago.

machine-learning cpp

11 stars 5.52 score 30 scripts

riccardo-df

causalQual:Causal Inference for Qualitative Outcomes

Implements the framework introduced in Di Francesco and Mellace (2025) <doi:10.48550/arXiv.2502.11691>, shifting the focus to well-defined and interpretable estimands that quantify how treatment affects the probability distribution over outcome categories. It supports selection-on-observables, instrumental variables, regression discontinuity, and difference-in-differences designs.

Maintained by Riccardo Di Francesco. Last updated 11 days ago.

11 stars 5.44 score

hehta

RESIDE:Rapid Easy Synthesis to Inform Data Extraction

Developed to assist researchers with planning analysis, prior to obtaining data from Trusted Research Environments (TREs) also known as safe havens. With functionality to export and import marginal distributions as well as synthesise data, both with and without correlations from these marginal distributions. Using a multivariate cumulative distribution (COPULA). Additionally the International Stroke Trial (IST) is included as an example dataset under ODC-By licence Sandercock et al. (2011) <doi:10.7488/ds/104>, Sandercock et al. (2011) <doi:10.1186/1745-6215-12-101>.

Maintained by Ryan Field. Last updated 25 days ago.

5.44 score 5 scripts

persimune

explainer:Machine Learning Model Explainer

It enables detailed interpretation of complex classification and regression models through Shapley analysis including data-driven characterization of subgroups of individuals. Furthermore, it facilitates multi-measure model evaluation, model fairness, and decision curve analysis. Additionally, it offers enhanced visualizations with interactive elements.

Maintained by Ramtin Zargari Marandi. Last updated 6 months ago.

ai classification clinical-research explainability explainable-ai interpretability machine-learning regression shap statistics

15 stars 5.43 score 12 scripts

jaipizgon

NeuralSens:Sensitivity Analysis of Neural Networks

Analysis functions to quantify inputs importance in neural network models. Functions are available for calculating and plotting the inputs importance and obtaining the activation function of each neuron layer and its derivatives. The importance of a given input is defined as the distribution of the derivatives of the output with respect to that input in each training data point <doi:10.18637/jss.v102.i07>.

Maintained by Jaime Pizarroso Gonzalo. Last updated 6 months ago.

15 stars 5.43 score 24 scripts

shinyworks

scenes:Switch Between Alternative 'shiny' UIs

Sometimes it is useful to serve up alternative 'shiny' UIs depending on information passed in the request object, such as the value of a cookie or a query parameter. This packages facilitates such switches.

Maintained by Jon Harmon. Last updated 5 months ago.

16 stars 5.41 score 16 scripts

flaviomoc

divraster:Diversity Metrics Calculations for Rasterized Data

Alpha and beta diversity for taxonomic (TD), functional (FD), and phylogenetic (PD) dimensions based on rasters. Spatial and temporal beta diversity can be partitioned into replacement and richness difference components. It also calculates standardized effect size for FD and PD alpha diversity and the average individual traits across multilayer rasters. The layers of the raster represent species, while the cells represent communities. Methods details can be found at Cardoso et al. 2022 <https://CRAN.R-project.org/package=BAT> and Heming et al. 2023 <https://CRAN.R-project.org/package=SESraster>.

Maintained by Flávio M. M. Mota. Last updated 14 days ago.

10 stars 5.40 score 7 scripts

fawda123

WRTDStidal:Weighted Regression for Water Quality Evaluation in Tidal Waters

An adaptation for estuaries (tidal waters) of weighted regression on time, discharge, and season to evaluate trends in water quality time series. Please see Beck and Hagy (2015) <doi:10.1007/s10666-015-9452-8> for details.

Maintained by Marcus W. Beck. Last updated 1 years ago.

weighted-regression

4 stars 5.38 score 119 scripts

smaakage85

modelgrid:A Framework for Creating, Managing and Training Multiple Caret Models

A minimalistic but flexible framework that facilitates the creation, management and training of multiple 'caret' models. A model grid consists of two components: (1) a set of settings that is shared by all models by default, and (2) specifications that apply only to the individual models. When the model grid is trained, model and training specifications are first consolidated from the shared and the model specific settings into complete 'caret' model configurations. These models are then trained with the 'train' function from the 'caret' package.

Maintained by Lars Kjeldgaard. Last updated 6 years ago.

caret machine-learning predictive-analytics predictive-modeling

23 stars 5.34 score 19 scripts

adrientaudiere

cati:Community Assembly by Traits: Individuals and Beyond

Detect and quantify community assembly processes using trait values of individuals or populations, the T-statistics and other metrics, and dedicated null models.

Maintained by Adrien Taudiere. Last updated 5 months ago.

12 stars 5.33 score 15 scripts

tlverse

tmle3shift:Targeted Learning of the Causal Effects of Stochastic Interventions

Targeted maximum likelihood estimation (TMLE) of population-level causal effects under stochastic treatment regimes and related nonparametric variable importance analyses. Tools are provided for TML estimation of the counterfactual mean under a stochastic intervention characterized as a modified treatment policy, such as treatment policies that shift the natural value of the exposure. The causal parameter and estimation were described in Díaz and van der Laan (2013) <doi:10.1111/j.1541-0420.2011.01685.x> and an improved estimation approach was given by Díaz and van der Laan (2018) <doi:10.1007/978-3-319-65304-4_14>.

Maintained by Nima Hejazi. Last updated 6 months ago.

causal-inference machine-learning marginal-structural-models stochastic-interventions targeted-learning treatment-effects variable-importance

17 stars 5.33 score 42 scripts 1 dependents

alexkychen

assignPOP:Population Assignment using Genetic, Non-Genetic or Integrated Data in a Machine Learning Framework

Use Monte-Carlo and K-fold cross-validation coupled with machine- learning classification algorithms to perform population assignment, with functionalities of evaluating discriminatory power of independent training samples, identifying informative loci, reducing data dimensionality for genomic data, integrating genetic and non-genetic data, and visualizing results.

Maintained by Kuan-Yu (Alex) Chen. Last updated 1 years ago.

cross-validation data-integration gbs machine-learning population-assignment population-genomics radseq

17 stars 5.33 score 25 scripts

bioc

DaMiRseq:Data Mining for RNA-seq data: normalization, feature selection and classification

The DaMiRseq package offers a tidy pipeline of data mining procedures to identify transcriptional biomarkers and exploit them for both binary and multi-class classification purposes. The package accepts any kind of data presented as a table of raw counts and allows including both continous and factorial variables that occur with the experimental setting. A series of functions enable the user to clean up the data by filtering genomic features and samples, to adjust data by identifying and removing the unwanted source of variation (i.e. batches and confounding factors) and to select the best predictors for modeling. Finally, a "stacking" ensemble learning technique is applied to build a robust classification model. Every step includes a checkpoint that the user may exploit to assess the effects of data management by looking at diagnostic plots, such as clustering and heatmaps, RLE boxplots, MDS or correlation plot.

Maintained by Mattia Chiesa. Last updated 5 months ago.

sequencing rnaseq classification immunooncology openjdk

5.32 score 7 scripts 1 dependents

biostatomics

Coxmos:Cox MultiBlock Survival

This software package provides Cox survival analysis for high-dimensional and multiblock datasets. It encompasses a suite of functions dedicated from the classical Cox regression to newest analysis, including Cox proportional hazards model, Stepwise Cox regression, and Elastic-Net Cox regression, Sparse Partial Least Squares Cox regression (sPLS-COX) incorporating three distinct strategies, and two Multiblock-PLS Cox regression (MB-sPLS-COX) methods. This tool is designed to adeptly handle high-dimensional data, and provides tools for cross-validation, plot generation, and additional resources for interpreting results. While references are available within the corresponding functions, key literature is mentioned below. Terry M Therneau (2024) <https://CRAN.R-project.org/package=survival>, Noah Simon et al. (2011) <doi:10.18637/jss.v039.i05>, Philippe Bastien et al. (2005) <doi:10.1016/j.csda.2004.02.005>, Philippe Bastien (2008) <doi:10.1016/j.chemolab.2007.09.009>, Philippe Bastien et al. (2014) <doi:10.1093/bioinformatics/btu660>, Kassu Mehari Beyene and Anouar El Ghouch (2020) <doi:10.1002/sim.8671>, Florian Rohart et al. (2017) <doi:10.1371/journal.pcbi.1005752>.

Maintained by Pedro Salguero García. Last updated 26 days ago.

1 stars 5.30 score 5 scripts

bioc

preciseTAD:preciseTAD: A machine learning framework for precise TAD boundary prediction

preciseTAD provides functions to predict the location of boundaries of topologically associated domains (TADs) and chromatin loops at base-level resolution. As an input, it takes BED-formatted genomic coordinates of domain boundaries detected from low-resolution Hi-C data, and coordinates of high-resolution genomic annotations from ENCODE or other consortia. preciseTAD employs several feature engineering strategies and resampling techniques to address class imbalance, and trains an optimized random forest model for predicting low-resolution domain boundaries. Translated on a base-level, preciseTAD predicts the probability for each base to be a boundary. Density-based clustering and scalable partitioning techniques are used to detect precise boundary regions and summit points. Compared with low-resolution boundaries, preciseTAD boundaries are highly enriched for CTCF, RAD21, SMC3, and ZNF143 signal and more conserved across cell lines. The pre-trained model can accurately predict boundaries in another cell line using CTCF, RAD21, SMC3, and ZNF143 annotation data for this cell line.

Maintained by Mikhail Dozmorov. Last updated 5 months ago.

software hic sequencing clustering classification functionalgenomics featureextraction

7 stars 5.29 score 14 scripts

randcorporation

optic:Simulation Tool for Causal Inference Using Longitudinal Data

Implements a simulation study to assess the strengths and weaknesses of causal inference methods for estimating policy effects using panel data. See Griffin et al. (2021) <doi:10.1007/s10742-022-00284-w> and Griffin et al. (2022) <doi:10.1186/s12874-021-01471-y> for a description of our methods.

Maintained by Pedro Nascimento de Lima. Last updated 2 months ago.

causal-inference diff-in-diff longitudinal-data simulation

9 stars 5.26 score 6 scripts

petersonr

sparseR:Variable Selection under Ranked Sparsity Principles for Interactions and Polynomials

An implementation of ranked sparsity methods, including penalized regression methods such as the sparsity-ranked lasso, its non-convex alternatives, and elastic net, as well as the sparsity-ranked Bayesian Information Criterion. As described in Peterson and Cavanaugh (2022) <doi:10.1007/s10182-021-00431-7>, ranked sparsity is a philosophy with methods primarily useful for variable selection in the presence of prior informational asymmetry, which occurs in the context of trying to perform variable selection in the presence of interactions and/or polynomials. Ultimately, this package attempts to facilitate dealing with cumbersome interactions and polynomials while not avoiding them entirely. Typically, models selected under ranked sparsity principles will also be more transparent, having fewer falsely selected interactions and polynomials than other methods.

Maintained by Ryan Andrew Peterson. Last updated 9 months ago.

6 stars 5.23 score 14 scripts

bioc

scGPS:A complete analysis of single cell subpopulations, from identifying subpopulations to analysing their relationship (scGPS = single cell Global Predictions of Subpopulation)

The package implements two main algorithms to answer two key questions: a SCORE (Stable Clustering at Optimal REsolution) to find subpopulations, followed by scGPS to investigate the relationships between subpopulations.

Maintained by Quan Nguyen. Last updated 5 months ago.

singlecell clustering dataimport sequencing coverage openblas cpp

4 stars 5.20 score 7 scripts

bioc

squallms:Speedy quality assurance via lasso labeling for LC-MS data

squallms is a Bioconductor R package that implements a "semi-labeled" approach to untargeted mass spectrometry data. It pulls in raw data from mass-spec files to calculate several metrics that are then used to label MS features in bulk as high or low quality. These metrics of peak quality are then passed to a simple logistic model that produces a fully-labeled dataset suitable for downstream analysis.

Maintained by William Kumler. Last updated 5 months ago.

massspectrometry metabolomics proteomics lipidomics shinyapps classification clustering featureextraction principalcomponent regression preprocessing qualitycontrol visualization

3 stars 5.13 score 5 scripts

bioc

SGCP:SGCP: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks

SGC is a semi-supervised pipeline for gene clustering in gene co-expression networks. SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology information in a semi-supervised clustering method that further improves the quality of the computed modules.

Maintained by Niloofar AghaieAbiane. Last updated 5 months ago.

geneexpression genesetenrichment networkenrichment systemsbiology classification clustering dimensionreduction graphandnetwork neuralnetwork network mrnamicroarray rnaseq visualization bioinformatics genecoexpressionnetwork graphs networkclustering networks self-training semi-supervised-learning unsupervised-learning

2 stars 5.12 score 44 scripts

spsanderson

healthyverse:Easily Install and Load the 'healthyverse'

The 'healthyverse' is a set of packages that work in harmony because they share common data representations and 'API' design. This package is designed to make it easy to install and load multiple 'healthyverse' packages in a single step.

Maintained by Steven Sanderson. Last updated 6 months ago.

analytics healthcare healthcare-application installation installer meta packages

11 stars 5.12 score 24 scripts

promidat

loadeR:Load Data for Analysis System

Provides a framework to load text and excel files through a 'shiny' graphical interface. It allows renaming, transforming, ordering and removing variables. It includes basic exploratory methods such as the mean, median, mode, normality test, histogram and correlation.

Maintained by Oldemar Rodriguez. Last updated 2 years ago.

5.09 score 275 scripts 3 dependents

adefazio

classifierplots:Generates a Visualization of Classifier Performance as a Grid of Diagnostic Plots

Generates a visualization of binary classifier performance as a grid of diagnostic plots with just one function call. Includes ROC curves, prediction density, accuracy, precision, recall and calibration plots, all using ggplot2 for easy modification. Debug your binary classifiers faster and easier!

Maintained by Aaron Defazio. Last updated 4 years ago.

50 stars 5.08 score 16 scripts

frankiethull

maize:Specialty Kernels for SVMs

Bindings for svm kernels via kernlab for use with the 'parsnip' package. Specifically related to specialty kernels for support vector machines not available in parsnip. package includes interface for various kernlab kernels and custom kernels too.

Maintained by Frankie T. Hull. Last updated 3 days ago.

10 stars 5.08 score 3 scripts

lanl

NEONiso:Tools to Calibrate and Work with NEON Atmospheric Isotope Data

Functions for downloading, calibrating, and analyzing atmospheric isotope data bundled into the eddy covariance data products of the National Ecological Observatory Network (NEON) <https://www.neonscience.org>. Calibration tools are provided for carbon and water isotope products. Carbon isotope calibration details are found in Fiorella et al. (2021) <doi:10.1029/2020JG005862>, and the readme file at <https://github.com/lanl/NEONiso>. Tools for calibrating water isotope products have been added as of 0.6.0, but have known deficiencies and should be considered experimental and unsupported.

Maintained by Rich Fiorella. Last updated 1 months ago.

2 stars 5.08 score 6 scripts

trevorld

xmpdf:Edit 'XMP' Metadata and 'PDF' Bookmarks and Documentation Info

Edit 'XMP' metadata <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform> in a variety of media file formats as well as edit bookmarks (aka outline aka table of contents) and documentation info entries in 'pdf' files. Can detect and use a variety of command-line tools to perform these operations such as 'exiftool' <https://exiftool.org/>, 'ghostscript' <https://www.ghostscript.com/>, and/or 'pdftk' <https://gitlab.com/pdftk-java/pdftk>.

Maintained by Trevor L Davis. Last updated 1 years ago.

4 stars 5.08 score 1 scripts 1 dependents

ff1201

sgs:Sparse-Group SLOPE: Adaptive Bi-Level Selection with FDR Control

Implementation of Sparse-group SLOPE (SGS) (Feser and Evangelou (2023) <doi:10.48550/arXiv.2305.09467>) models. Linear and logistic regression models are supported, both of which can be fit using k-fold cross-validation. Dense and sparse input matrices are supported. In addition, a general Adaptive Three Operator Splitting (ATOS) (Pedregosa and Gidel (2018) <doi:10.48550/arXiv.1804.02339>) implementation is provided. Group SLOPE (gSLOPE) (Brzyski et al. (2019) <doi:10.1080/01621459.2017.1411269>) and group-based OSCAR models (Feser and Evangelou (2024) <doi:10.48550/arXiv.2405.15357>) are also implemented. All models are available with strong screening rules (Feser and Evangelou (2024) <doi:10.48550/arXiv.2405.15357>) for computational speed-up.

Maintained by Fabio Feser. Last updated 6 days ago.

openblas cpp openmp

1 stars 5.07 score 13 scripts 1 dependents

jameshwade

measure:A Recipes-style Interface to Tidymodels for Analytical Measurements

Analytical measurements...

Maintained by James Wade. Last updated 1 months ago.

recipes tidymodels

5 stars 5.06 score 58 scripts

aleksandarsekulic

meteo:RFSI & STRK Interpolation for Meteo and Environmental Variables

Random Forest Spatial Interpolation (RFSI, Sekulić et al. (2020) <doi:10.3390/rs12101687>) and spatio-temporal geostatistical (spatio-temporal regression Kriging (STRK)) interpolation for meteorological (Kilibarda et al. (2014) <doi:10.1002/2013JD020803>, Sekulić et al. (2020) <doi:10.1007/s00704-019-03077-3>) and other environmental variables. Contains global spatio-temporal models calculated using publicly available data.

Maintained by Aleksandar Sekulić. Last updated 6 months ago.

18 stars 5.06 score 64 scripts

pyanglab

AdaSampling:Adaptive Sampling for Positive Unlabeled and Label Noise Learning

Implements the adaptive sampling procedure, a framework for both positive unlabeled learning and learning with class label noise. Yang, P., Ormerod, J., Liu, W., Ma, C., Zomaya, A., Yang, J. (2018) <doi:10.1109/TCYB.2018.2816984>.

Maintained by Pengyi Yang. Last updated 6 years ago.

11 stars 5.04 score 10 scripts

acabassi

coca:Cluster-of-Clusters Analysis

Contains the R functions needed to perform Cluster-Of-Clusters Analysis (COCA) and Consensus Clustering (CC). For further details please see Cabassi and Kirk (2020) <doi:10.1093/bioinformatics/btaa593>.

Maintained by Alessandra Cabassi. Last updated 5 years ago.

cluster-analysis cluster-of-clusters clustering coca genomics integrative-clustering multi-omics

6 stars 5.03 score 12 scripts 1 dependents

sstoeckl

FFdownload:Download Data from Kenneth French's Website

Downloads all the datasets (you can exclude the daily ones or specify a list of those you are targeting specifically) from Kenneth French's Website at <https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html>, process them and convert them to list of 'xts' (time series).

Maintained by Sebastian Stoeckl. Last updated 10 months ago.

9 stars 5.03 score 12 scripts

jobnmadu

Dyn4cast:Dynamic Modeling and Machine Learning Environment

Estimates, predict and forecast dynamic models as well as Machine Learning metrics which assists in model selection for further analysis. The package also have capabilities to provide tools and metrics that are useful in machine learning and modeling. For example, there is quick summary, percent sign, Mallow's Cp tools and others. The ecosystem of this package is analysis of economic data for national development. The package is so far stable and has high reliability and efficiency as well as time-saving.

Maintained by Job Nmadu. Last updated 15 days ago.

data-science equal-lenght-forecast forecasting knots machine-learning nigeria prediction regression-models spline-models statistics time-series

4 stars 5.03 score 38 scripts

caranathunge

promor:Proteomics Data Analysis and Modeling Tools

A comprehensive, user-friendly package for label-free proteomics data analysis and machine learning-based modeling. Data generated from 'MaxQuant' can be easily used to conduct differential expression analysis, build predictive models with top protein candidates, and assess model performance. promor includes a suite of tools for quality control, visualization, missing data imputation (Lazar et. al. (2016) <doi:10.1021/acs.jproteome.5b00981>), differential expression analysis (Ritchie et. al. (2015) <doi:10.1093/nar/gkv007>), and machine learning-based modeling (Kuhn (2008) <doi:10.18637/jss.v028.i05>).

Maintained by Chathurani Ranathunge. Last updated 2 years ago.

biomarkers differential-expression lfq machine-learning mass-spectrometry modeling proteomics

15 stars 5.02 score 14 scripts

lanedrew

ldmppr:Estimate and Simulate from Location Dependent Marked Point Processes

A suite of tools for estimating, assessing model fit, simulating from, and visualizing location dependent marked point processes characterized by regularity in the pattern. You provide a reference marked point process, a set of raster images containing location specific covariates, and select the estimation algorithm and type of mark model. 'ldmppr' estimates the process and mark models and allows you to check the appropriateness of the model using a variety of diagnostic tools. Once a satisfactory model fit is obtained, you can simulate from the model and visualize the results. Documentation for the package 'ldmppr' is available in the form of a vignette.

Maintained by Lane Drew. Last updated 1 months ago.

cpp

1 stars 5.00 score 2 scripts

bioc

jazzPanda:Finding spatially relevant marker genes in image based spatial transcriptomics data

This package contains the function to find marker genes for image-based spatial transcriptomics data. There are functions to create spatial vectors from the cell and transcript coordiantes, which are passed as inputs to find marker genes. Marker genes are detected for every cluster by two approaches. The first approach is by permtuation testing, which is implmented in parallel for finding marker genes for one sample study. The other approach is to build a linear model for every gene. This approach can account for multiple samples and backgound noise.

Maintained by Melody Jin. Last updated 29 days ago.

spatial geneexpression differentialexpression statisticalmethod transcriptomics correlation linear-models marker-genes spatial-transcriptomics

2 stars 5.00 score

bioc

MAI:Mechanism-Aware Imputation

A two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN and a single imputation approach for imputation of metabolites where left-censoring is present.

Maintained by Jonathan Dekermanjian. Last updated 5 months ago.

software metabolomics statisticalmethod classification imputation-methods machine-learning missing-data

2 stars 5.00 score 6 scripts

bioc

GARS:GARS: Genetic Algorithm for the identification of Robust Subsets of variables in high-dimensional and challenging datasets

Feature selection aims to identify and remove redundant, irrelevant and noisy variables from high-dimensional datasets. Selecting informative features affects the subsequent classification and regression analyses by improving their overall performances. Several methods have been proposed to perform feature selection: most of them relies on univariate statistics, correlation, entropy measurements or the usage of backward/forward regressions. Herein, we propose an efficient, robust and fast method that adopts stochastic optimization approaches for high-dimensional. GARS is an innovative implementation of a genetic algorithm that selects robust features in high-dimensional and challenging datasets.

Maintained by Mattia Chiesa. Last updated 5 months ago.

classification featureextraction clustering openjdk

5.00 score 2 scripts

kevinhzq

healthdb:Working with Healthcare Databases

A system for identifying diseases or events from healthcare databases and preparing data for epidemiological studies. It includes capabilities not supported by 'SQL', such as matching strings by 'stringr' style regular expressions, and can compute comorbidity scores (Quan et al. (2005) <doi:10.1097/01.mlr.0000182534.19832.83>) directly on a database server. The implementation is based on 'dbplyr' with full 'tidyverse' compatibility.

Maintained by Kevin Hu. Last updated 1 months ago.

2 stars 4.95 score

tylerjpike

OOS:Out-of-Sample Time Series Forecasting

A comprehensive and cohesive API for the out-of-sample forecasting workflow: data preparation, forecasting - including both traditional econometric time series models and modern machine learning techniques - forecast combination, model and error analysis, and forecast visualization.

Maintained by Tyler J. Pike. Last updated 4 years ago.

econometrics forecast-combination forecasting machine-learning

9 stars 4.95 score 5 scripts

shanpengli

FastJM:Semi-Parametric Joint Modeling of Longitudinal and Survival Data

Maximum likelihood estimation for the semi-parametric joint modeling of competing risks and longitudinal data applying customized linear scan algorithms, proposed by Li and colleagues (2022) <doi:10.1155/2022/1362913>. The time-to-event data is modelled using a (cause-specific) Cox proportional hazards regression model with time-fixed covariates. The longitudinal outcome is modelled using a linear mixed effects model. The association is captured by shared random effects. The model is estimated using an Expectation Maximization algorithm.

Maintained by Shanpeng Li. Last updated 9 days ago.

cpp cpp

5 stars 4.95 score 2 scripts 2 dependents

bioc

HPiP:Host-Pathogen Interaction Prediction

HPiP (Host-Pathogen Interaction Prediction) uses an ensemble learning algorithm for prediction of host-pathogen protein-protein interactions (HP-PPIs) using structural and physicochemical descriptors computed from amino acid-composition of host and pathogen proteins.The proposed package can effectively address data shortages and data unavailability for HP-PPI network reconstructions. Moreover, establishing computational frameworks in that regard will reveal mechanistic insights into infectious diseases and suggest potential HP-PPI targets, thus narrowing down the range of possible candidates for subsequent wet-lab experimental validations.

Maintained by Matineh Rahmatbakhsh. Last updated 5 months ago.

proteomics systemsbiology networkinference structuralprediction geneprediction network

3 stars 4.95 score 6 scripts

tjetka

SLEMI:Statistical Learning Based Estimation of Mutual Information

The implementation of the algorithm for estimation of mutual information and channel capacity from experimental data by classification procedures (logistic regression). Technically, it allows to estimate information-theoretic measures between finite-state input and multivariate, continuous output. Method described in Jetka et al. (2019) <doi:10.1371/journal.pcbi.1007132>.

Maintained by Tomasz Jetka. Last updated 1 years ago.

channel-capacity information-theory logistic-regression mutual-information-estimation

4 stars 4.92 score 21 scripts

nhanhocu

metamicrobiomeR:an R package for analysis of microbiome relative abundance data using zero inflated beta GAMLSS and meta-analysis across studies using random effect model

The metamicrobiomeR package implements Generalized Additive Model for Location, Scale and Shape (GAMLSS) with zero inflated beta (BEZI) family for analysis of microbiome relative abundance data (with various options for data transformation/normalization to address compositional effects) and random effect meta-analysis models for meta-analysis pooling estimates across microbiome studies. Random Forest model to predict microbiome age based on relative abundances of shared bacterial genera with the Bangladesh data (Subramanian et al 2014), comparison of multiple diversity indexes using linear/linear mixed effect models and some data display/visualization are also implemented.

Maintained by Nhan Ho. Last updated 4 years ago.

33 stars 4.90 score 12 scripts

connor-reid-tiffany

omu:A Metabolomics Analysis Tool for Intuitive Figures and Convenient Metadata Collection

Facilitates the creation of intuitive figures to describe metabolomics data by utilizing Kyoto Encyclopedia of Genes and Genomes (KEGG) hierarchy data, and gathers functional orthology and gene data from the KEGG-REST API.

Maintained by Connor Tiffany. Last updated 1 years ago.

3 stars 4.89 score 52 scripts

fmgarciadiaz

PortalHacienda:Acceder Con R a Los Datos Del Portal De Hacienda

Obtener listado de datos, acceder y extender series del Portal de Datos de Hacienda.Las proyecciones se realizan con 'forecast', Hyndman RJ, Khandakar Y (2008) <doi:10.18637/jss.v027.i03>. Search, download and forecast time-series from the Ministry of Economy of Argentina. Forecasts are built with the 'forecast' package, Hyndman RJ, Khandakar Y (2008) <doi:10.18637/jss.v027.i03>.

Maintained by Fernando Garcia Diaz. Last updated 2 years ago.

api argentina economia ministerio-de-economia series-de-tiempo

15 stars 4.88 score 7 scripts

bdwilliamson

flevr:Flexible, Ensemble-Based Variable Selection with Potentially Missing Data

Perform variable selection in settings with possibly missing data based on extrinsic (algorithm-specific) and intrinsic (population-level) variable importance. Uses a Super Learner ensemble to estimate the underlying prediction functions that give rise to estimates of variable importance. For more information about the methods, please see Williamson and Huang (2023+) <arXiv:2202.12989>.

Maintained by Brian D. Williamson. Last updated 1 years ago.

5 stars 4.88 score 2 scripts

robson-fernandes

bnviewer:Bayesian Networks Interactive Visualization and Explainable Artificial Intelligence

Bayesian networks provide an intuitive framework for probabilistic reasoning and its graphical nature can be interpreted quite clearly. Graph based methods of machine learning are becoming more popular because they offer a richer model of knowledge that can be understood by a human in a graphical format. The 'bnviewer' is an R Package that allows the interactive visualization of Bayesian Networks. The aim of this package is to improve the Bayesian Networks visualization over the basic and static views offered by existing packages.

Maintained by Robson Fernandes. Last updated 5 years ago.

bayesian-inference bayesian-network bayesian-networks probabilistic-graphical-models

7 stars 4.86 score 69 scripts 1 dependents

hectorrdb

Ecume:Equality of 2 (or k) Continuous Univariate and Multivariate Distributions

We implement (or re-implements in R) a variety of statistical tools. They are focused on non-parametric two-sample (or k-sample) distribution comparisons in the univariate or multivariate case. See the vignette for more info.

Maintained by Hector Roux de Bezieux. Last updated 10 months ago.

software infrastructure

1 stars 4.86 score 16 scripts 3 dependents

gokmenzararsiz

dtComb:Statistical Combination of Diagnostic Tests

A system for combining two diagnostic tests using various approaches that include statistical and machine-learning-based methodologies. These approaches are divided into four groups: linear combination methods, non-linear combination methods, mathematical operators, and machine learning algorithms. See the <https://biotools.erciyes.edu.tr/dtComb/> website for more information, documentation, and examples.

Maintained by Gokmen Zararsiz. Last updated 4 days ago.

4.85 score 7 scripts

epivec

TDLM:Systematic Comparison of Trip Distribution Laws and Models

The main purpose of this package is to propose a rigorous framework to fairly compare trip distribution laws and models as described in Lenormand et al. (2016) <doi:10.1016/j.jtrangeo.2015.12.008>.

Maintained by Maxime Lenormand. Last updated 25 days ago.

2 stars 4.85 score 3 scripts

nourmarzouka

multiclassPairs:Build MultiClass Pair-Based Classifiers using TSPs or RF

A toolbox to train a single sample classifier that uses in-sample feature relationships. The relationships are represented as feature1 < feature2 (e.g. gene1 < gene2). We provide two options to go with. First is based on 'switchBox' package which uses Top-score pairs algorithm. Second is a novel implementation based on random forest algorithm. For simple problems we recommend to use one-vs-rest using TSP option due to its simplicity and for being easy to interpret. For complex problems RF performs better. Both lines filter the features first then combine the filtered features to make the list of all the possible rules (i.e. rule1: feature1 < feature2, rule2: feature1 < feature3, etc...). Then the list of rules will be filtered and the most important and informative rules will be kept. The informative rules will be assembled in an one-vs-rest model or in an RF model. We provide a detailed description with each function in this package to explain the filtration and training methodology in each line. Reference: Marzouka & Eriksson (2021) <doi:10.1093/bioinformatics/btab088>.

Maintained by Nour-al-dain Marzouka. Last updated 2 years ago.

classification

12 stars 4.82 score 11 scripts

bioc

MLSeq:Machine Learning Interface for RNA-Seq Data

This package applies several machine learning methods, including SVM, bagSVM, Random Forest and CART to RNA-Seq data.

Maintained by Gokmen Zararsiz. Last updated 5 months ago.

immunooncology sequencing rnaseq classification clustering

4.81 score 27 scripts 1 dependents

joelcuerrier

cdid:The Chained Difference-in-Differences

Extends the 'did' package to improve efficiency and handling of unbalanced panel data. Bellego, Benatia, and Dortet-Bernadet (2024), "The Chained Difference-in-Differences", Journal of Econometrics, <doi:10.1016/j.jeconom.2024.105783>.

Maintained by David Benatia. Last updated 2 months ago.

2 stars 4.78 score 3 scripts

bioc

supersigs:Supervised mutational signatures

Generate SuperSigs (supervised mutational signatures) from single nucleotide variants in the cancer genome. Functions included in the package allow the user to learn supervised mutational signatures from their data and apply them to new data. The methodology is based on the one described in Afsari (2021, ELife).

Maintained by Albert Kuo. Last updated 5 months ago.

featureextraction classification regression sequencing wholegenome somaticmutation

3 stars 4.78 score 3 scripts

adimajo

glmtree:Logistic Regression Trees

A logistic regression tree is a decision tree with logistic regressions at its leaves. A particular stochastic expectation maximization algorithm is used to draw a few good trees, that are then assessed via the user's criterion of choice among BIC / AIC / test set Gini. The formal development is given in a PhD chapter, see Ehrhardt (2019) <https://github.com/adimajo/manuscrit_these/releases/>.

Maintained by Adrien Ehrhardt. Last updated 1 years ago.

6 stars 4.78 score 3 scripts

harrison4192

tidybins:Make Tidy Bins

Multiple ways to bin numeric columns with a tidy output. Wraps a variety of existing binning methods into one function, and includes a new method for binning by equal value, which is useful for sales data. Provides a function to automatically summarize the properties of the binned columns.

Maintained by Harrison Tietze. Last updated 10 months ago.

4 stars 4.78 score 2 scripts 1 dependents

cefet-rj-dal

heimdall:Drift Adaptable Models

By analyzing streaming datasets, it is possible to observe significant changes in the data distribution or models' accuracy during their prediction (concept drift). The goal of 'heimdall' is to measure when concept drift occurs. The package makes available several state-of-the-art methods. It also tackles how to adapt models in a nonstationary context. Some concept drifts methods are described in Tavares (2022) <doi:10.1007/s12530-021-09415-z>.

Maintained by Eduardo Ogasawara. Last updated 2 months ago.

2 stars 4.77 score 45 scripts

hknd23

DeepLearningCausal:Causal Inference with Super Learner and Deep Neural Networks

Functions to estimate Conditional Average Treatment Effects (CATE) and Population Average Treatment Effects on the Treated (PATT) from experimental or observational data using the Super Learner (SL) ensemble method and Deep neural networks. The package first provides functions to implement meta-learners such as the Single-learner (S-learner) and Two-learner (T-learner) described in Künzel et al. (2019) <doi:10.1073/pnas.1804597116> for estimating the CATE. The S- and T-learner are each estimated using the SL ensemble method and deep neural networks. It then provides functions to implement the Ottoboni and Poulos (2020) <doi:10.1515/jci-2018-0035> PATT-C estimator to obtain the PATT from experimental data with noncompliance by using the SL ensemble method and deep neural networks.

Maintained by Nguyen K. Huynh. Last updated 2 months ago.

causal-inference deep-neural-networks machine-learning

2 stars 4.73 score 5 scripts

bioc

TOP:TOP Constructs Transferable Model Across Gene Expression Platforms

TOP constructs a transferable model across gene expression platforms for prospective experiments. Such a transferable model can be trained to make predictions on independent validation data with an accuracy that is similar to a re-substituted model. The TOP procedure also has the flexibility to be adapted to suit the most common clinical response variables, including linear response, binomial and Cox PH models.

Maintained by Harry Robertson. Last updated 5 months ago.

software survival geneexpression

4.70 score 50 scripts

pierreroudier

dissever:Spatial Downscaling using the Dissever Algorithm

Spatial downscaling of coarse grid mapping to fine grid mapping using predictive covariates and a model fitted using the 'caret' package. The original dissever algorithm was published by Malone et al. (2012) <doi:10.1016/j.cageo.2011.08.021>, and extended by Roudier et al. (2017) <doi:10.1016/j.compag.2017.08.021>.

Maintained by Pierre Roudier. Last updated 5 years ago.

10 stars 4.70 score 6 scripts

duolajiang

RCTrep:Validation of Estimates of Treatment Effects in Observational Data

Validates estimates of (conditional) average treatment effects obtained using observational data by a) making it easy to obtain and visualize estimates derived using a large variety of methods (G-computation, inverse propensity score weighting, etc.), and b) ensuring that estimates are easily compared to a gold standard (i.e., estimates derived from randomized controlled trials). 'RCTrep' offers a generic protocol for treatment effect validation based on four simple steps, namely, set-selection, estimation, diagnosis, and validation. 'RCTrep' provides a simple dashboard to review the obtained results. The validation approach is introduced by Shen, L., Geleijnse, G. and Kaptein, M. (2023) <doi:10.21203/rs.3.rs-2559287/v1>.

Maintained by Lingjie Shen. Last updated 2 years ago.

8 stars 4.68 score 12 scripts

sahirbhatnagar

eclust:Environment Based Clustering for Interpretable Predictive Models in High Dimensional Data

Companion package to the paper: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures. Bhatnagar, Yang, Khundrakpam, Evans, Blanchette, Bouchard, Greenwood (2017) <DOI:10.1101/102475>. This package includes an algorithm for clustering high dimensional data that can be affected by an environmental factor.

Maintained by Sahir Rai Bhatnagar. Last updated 8 years ago.

2 stars 4.62 score 14 scripts

bioc

branchpointer:Prediction of intronic splicing branchpoints

Predicts branchpoint probability for sites in intronic branchpoint windows. Queries can be supplied as intronic regions; or to evaluate the effects of mutations, SNPs.

Maintained by Beth Signal. Last updated 5 months ago.

software genomeannotation genomicvariation motifannotation

4.62 score 21 scripts

abichat

scimo:Extra Recipes Steps for Dealing with Omics Data

Omics data (e.g. transcriptomics, proteomics, metagenomics...) offer a detailed and multi-dimensional perspective on the molecular components and interactions within complex biological (eco)systems. Analyzing these data requires adapted procedures, which are implemented as steps according to the 'recipes' package.

Maintained by Antoine BICHAT. Last updated 10 months ago.

4 stars 4.60 score 4 scripts

ccy-dev

LongDat:A Tool for 'Covariate'-Sensitive Longitudinal Analysis on 'omics' Data

This tool takes longitudinal dataset as input and analyzes if there is significant change of the features over time (a proxy for treatments), while detects and controls for 'covariates' simultaneously. 'LongDat' is able to take in several data types as input, including count, proportion, binary, ordinal and continuous data. The output table contains p values, effect sizes and 'covariates' of each feature, making the downstream analysis easy.

Maintained by Chia-Yu Chen. Last updated 4 months ago.

4 stars 4.60 score 4 scripts

bioc

MAIT:Statistical Analysis of Metabolomic Data

The MAIT package contains functions to perform end-to-end statistical analysis of LC/MS Metabolomic Data. Special emphasis is put on peak annotation and in modular function design of the functions.

Maintained by Pol Sola-Santos. Last updated 5 months ago.

immunooncology massspectrometry metabolomics software

4.60 score 20 scripts

bioc

SVMDO:Identification of Tumor-Discriminating mRNA Signatures via Support Vector Machines Supported by Disease Ontology

It is an easy-to-use GUI using disease information for detecting tumor/normal sample discriminating gene sets from differentially expressed genes. Our approach is based on an iterative algorithm filtering genes with disease ontology enrichment analysis and wilk and wilks lambda criterion connected to SVM classification model construction. Along with gene set extraction, SVMDO also provides individual prognostic marker detection. The algorithm is designed for FPKM and RPKM normalized RNA-Seq transcriptome datasets.

Maintained by Mustafa Erhan Ozer. Last updated 5 months ago.

genesetenrichment differentialexpression gui classification rnaseq transcriptomics survival machine-learning rna-seq shiny

4.60 score 2 scripts

riccardo-df

aggTrees:Aggregation Trees

Nonparametric data-driven approach to discovering heterogeneous subgroups in a selection-on-observables framework. 'aggTrees' allows researchers to assess whether there exists relevant heterogeneity in treatment effects by generating a sequence of optimal groupings, one for each level of granularity. For each grouping, we obtain point estimation and inference about the group average treatment effects. Please reference the use as Di Francesco (2024) <doi:10.48550/arXiv.2410.11408>.

Maintained by Riccardo Di Francesco. Last updated 1 months ago.

4.60 score 4 scripts

ssa-statistical-team-projects

povmap:Extension to the 'emdi' Package

The R package 'povmap' supports small area estimation of means and poverty headcount rates. It adds several new features to the 'emdi' package (see "The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators" by Kreutzmann et al. (2019) <doi:10.18637/jss.v091.i07>). These include new options for incorporating survey weights, ex-post benchmarking of estimates, two additional transformations, several new convenient functions to assist with reporting results, and a wrapper function to facilitate access from 'Stata'.

Maintained by Ifeanyi Edochie. Last updated 5 months ago.

1 stars 4.60 score 10 scripts

tjfarrar

skedastic:Handling Heteroskedasticity in the Linear Regression Model

Implements numerous methods for testing for, modelling, and correcting for heteroskedasticity in the classical linear regression model. The most novel contribution of the package is found in the functions that implement the as-yet-unpublished auxiliary linear variance models and auxiliary nonlinear variance models that are designed to estimate error variances in a heteroskedastic linear regression model. These models follow principles of statistical learning described in Hastie (2009) <doi:10.1007/978-0-387-21606-5>. The nonlinear version of the model is estimated using quasi-likelihood methods as described in Seber and Wild (2003, ISBN: 0-471-47135-6). Bootstrap methods for approximate confidence intervals for error variances are implemented as described in Efron and Tibshirani (1993, ISBN: 978-1-4899-4541-9), including also the expansion technique described in Hesterberg (2014) <doi:10.1080/00031305.2015.1089789>. The wild bootstrap employed here follows the description in Davidson and Flachaire (2008) <doi:10.1016/j.jeconom.2008.08.003>. Tuning of hyper-parameters makes use of a golden section search function that is modelled after the MATLAB function of Zarnowiec (2022) <https://www.mathworks.com/matlabcentral/fileexchange/25919-golden-section-method-algorithm>. A methodological description of the algorithm can be found in Fox (2021, ISBN: 978-1-003-00957-3). There are 25 different functions that implement hypothesis tests for heteroskedasticity. These include a test based on Anscombe (1961) <https://projecteuclid.org/euclid.bsmsp/1200512155>, Ramsey's (1969) BAMSET Test <doi:10.1111/j.2517-6161.1969.tb00796.x>, the tests of Bickel (1978) <doi:10.1214/aos/1176344124>, Breusch and Pagan (1979) <doi:10.2307/1911963> with and without the modification proposed by Koenker (1981) <doi:10.1016/0304-4076(81)90062-2>, Carapeto and Holt (2003) <doi:10.1080/0266476022000018475>, Cook and Weisberg (1983) <doi:10.1093/biomet/70.1.1> (including their graphical methods), Diblasi and Bowman (1997) <doi:10.1016/S0167-7152(96)00115-0>, Dufour, Khalaf, Bernard, and Genest (2004) <doi:10.1016/j.jeconom.2003.10.024>, Evans and King (1985) <doi:10.1016/0304-4076(85)90085-5> and Evans and King (1988) <doi:10.1016/0304-4076(88)90006-1>, Glejser (1969) <doi:10.1080/01621459.1969.10500976> as formulated by Mittelhammer, Judge and Miller (2000, ISBN: 0-521-62394-4), Godfrey and Orme (1999) <doi:10.1080/07474939908800438>, Goldfeld and Quandt (1965) <doi:10.1080/01621459.1965.10480811>, Harrison and McCabe (1979) <doi:10.1080/01621459.1979.10482544>, Harvey (1976) <doi:10.2307/1913974>, Honda (1989) <doi:10.1111/j.2517-6161.1989.tb01749.x>, Horn (1981) <doi:10.1080/03610928108828074>, Li and Yao (2019) <doi:10.1016/j.ecosta.2018.01.001> with and without the modification of Bai, Pan, and Yin (2016) <doi:10.1007/s11749-017-0575-x>, Rackauskas and Zuokas (2007) <doi:10.1007/s10986-007-0018-6>, Simonoff and Tsai (1994) <doi:10.2307/2986026> with and without the modification of Ferrari, Cysneiros, and Cribari-Neto (2004) <doi:10.1016/S0378-3758(03)00210-6>, Szroeter (1978) <doi:10.2307/1913831>, Verbyla (1993) <doi:10.1111/j.2517-6161.1993.tb01918.x>, White (1980) <doi:10.2307/1912934>, Wilcox and Keselman (2006) <doi:10.1080/10629360500107923>, Yuce (2008) <https://dergipark.org.tr/en/pub/iuekois/issue/8989/112070>, and Zhou, Song, and Thompson (2015) <doi:10.1002/cjs.11252>. Besides these heteroskedasticity tests, there are supporting functions that compute the BLUS residuals of Theil (1965) <doi:10.1080/01621459.1965.10480851>, the conditional two-sided p-values of Kulinskaya (2008) <arXiv:0810.2124v1>, and probabilities for the nonparametric trend statistic of Lehmann (1975, ISBN: 0-816-24996-1). For handling heteroskedasticity, in addition to the new auxiliary variance model methods, there is a function to implement various existing Heteroskedasticity-Consistent Covariance Matrix Estimators from the literature, such as those of White (1980) <doi:10.2307/1912934>, MacKinnon and White (1985) <doi:10.1016/0304-4076(85)90158-7>, Cribari-Neto (2004) <doi:10.1016/S0167-9473(02)00366-3>, Cribari-Neto et al. (2007) <doi:10.1080/03610920601126589>, Cribari-Neto and da Silva (2011) <doi:10.1007/s10182-010-0141-2>, Aftab and Chang (2016) <doi:10.18187/pjsor.v12i2.983>, and Li et al. (2017) <doi:10.1080/00949655.2016.1198906>.

Maintained by Thomas Farrar. Last updated 1 years ago.

7 stars 4.60 score 73 scripts

colemanrharris

mxnorm:Apply Normalization Methods to Multiplexed Images

Implements methods to normalize multiplexed imaging data, including statistical metrics and visualizations to quantify technical variation in this data type. Reference for methods listed here: Harris, C., Wrobel, J., & Vandekar, S. (2022). mxnorm: An R Package to Normalize Multiplexed Imaging Data. Journal of Open Source Software, 7(71), 4180, <doi:10.21105/joss.04180>.

Maintained by Coleman Harris. Last updated 2 years ago.

7 stars 4.54 score 7 scripts

vascobranco

red:IUCN Redlisting Tools

Includes algorithms to facilitate the assessment of extinction risk of species according to the IUCN (International Union for Conservation of Nature, see <https://www.iucn.org/> for more information) red list criteria.

Maintained by Vasco V. Branco. Last updated 3 months ago.

biodiversity extinction-risk

1 stars 4.54 score 29 scripts 1 dependents

erblast

parcats:Interactive Parallel Categories Diagrams for 'easyalluvial'

Complex graphical representations of data are best explored using interactive elements. 'parcats' adds interactive graphing capabilities to the 'easyalluvial' package. The 'plotly.js' parallel categories diagrams offer a good framework for creating interactive flow graphs that allow manual drag and drop sorting of dimensions and categories, highlighting single flows and displaying mouse over information. The 'plotly.js' dependency is quite heavy and therefore is outsourced into a separate package.

Maintained by Bjoern Koneswarakantha. Last updated 1 years ago.

25 stars 4.53 score 27 scripts

paulocortez4

rminer:Data Mining Classification and Regression Methods

Facilitates the use of data mining algorithms in classification and regression (including time series forecasting) tasks by presenting a short and coherent set of functions. Versions: 1.4.8 improved help, several warning and error code fixes (more stable version, all examples run correctly); 1.4.7 improved Importance function and examples, minor error fixes; 1.4.6 / 1.4.5 / 1.4.4 new automated machine learning (AutoML) and ensembles, via improved fit(), mining() and mparheuristic() functions, and new categorical preprocessing, via improved delevels() function; 1.4.3 new metrics (e.g., macro precision, explained variance), new "lssvm" model and improved mparheuristic() function; 1.4.2 new "NMAE" metric, "xgboost" and "cv.glmnet" models (16 classification and 18 regression models); 1.4.1 new tutorial and more robust version; 1.4 - new classification and regression models, with a total of 14 classification and 15 regression methods, including: Decision Trees, Neural Networks, Support Vector Machines, Random Forests, Bagging and Boosting; 1.3 and 1.3.1 - new classification and regression metrics; 1.2 - new input importance methods via improved Importance() function; 1.0 - first version.

Maintained by Paulo Cortez. Last updated 5 months ago.

3 stars 4.52 score 536 scripts

bioc

tLOH:Assessment of evidence for LOH in spatial transcriptomics pre-processed data using Bayes factor calculations

tLOH, or transcriptomicsLOH, assesses evidence for loss of heterozygosity (LOH) in pre-processed spatial transcriptomics data. This tool requires spatial transcriptomics cluster and allele count information at likely heterozygous single-nucleotide polymorphism (SNP) positions in VCF format. Bayes factors are calculated at each SNP to determine likelihood of potential loss of heterozygosity event. Two plotting functions are included to visualize allele fraction and aggregated Bayes factor per chromosome. Data generated with the 10X Genomics Visium Spatial Gene Expression platform must be pre-processed to obtain an individual sample VCF with columns for each cluster. Required fields are allele depth (AD) with counts for reference/alternative alleles and read depth (DP).

Maintained by Michelle Webb. Last updated 5 months ago.

copynumbervariation transcription snp geneexpression transcriptomics

3 stars 4.48 score 4 scripts

enriquegit

ssr:Semi-Supervised Regression Methods

An implementation of semi-supervised regression methods including self-learning and co-training by committee based on Hady, M. F. A., Schwenker, F., & Palm, G. (2009) <doi:10.1007/978-3-642-04274-4_13>. Users can define which set of regressors to use as base models from the 'caret' package, other packages, or custom functions.

Maintained by Enrique Garcia-Ceja. Last updated 6 years ago.

data-science machine-learning regression semi-supervised-learning

2 stars 4.46 score 29 scripts

nhejazi

medoutcon:Efficient Natural and Interventional Causal Mediation Analysis

Efficient estimators of interventional (in)direct effects in the presence of mediator-outcome confounding affected by exposure. The effects estimated allow for the impact of the exposure on the outcome through a direct path to be disentangled from that through mediators, even in the presence of intermediate confounders that complicate such a relationship. Currently supported are non-parametric efficient one-step and targeted minimum loss estimators based on the formulation of Díaz, Hejazi, Rudolph, and van der Laan (2020) <doi:10.1093/biomet/asaa085>. Support for efficient estimation of the natural (in)direct effects is also provided, appropriate for settings in which intermediate confounders are absent. The package also supports estimation of these effects when the mediators are measured using outcome-dependent two-phase sampling designs (e.g., case-cohort).

Maintained by Nima Hejazi. Last updated 1 years ago.

causal-inference causal-machine-learning inverse-probability-weights machine-learning mediation-analysis stochastic-interventions targeted-learning treatment-effects

13 stars 4.46 score 22 scripts

sarahleavitt

nbTransmission:Naive Bayes Transmission Analysis

Estimates the relative transmission probabilities between cases in an infectious disease outbreak or cluster using naive Bayes. Included are various functions to use these probabilities to estimate transmission parameters such as the generation/serial interval and reproductive number as well as finding the contribution of covariates to the probabilities and visualizing results. The ideal use is for an infectious disease dataset with metadata on the majority of cases but more informative data such as contact tracing or pathogen whole genome sequencing on only a subset of cases. For a detailed description of the methods see Leavitt et al. (2020) <doi:10.1093/ije/dyaa031>.

Maintained by Sarah V Leavitt. Last updated 19 days ago.

4 stars 4.45 score 14 scripts

bioc

PRONE:The PROteomics Normalization Evaluator

High-throughput omics data are often affected by systematic biases introduced throughout all the steps of a clinical study, from sample collection to quantification. Normalization methods aim to adjust for these biases to make the actual biological signal more prominent. However, selecting an appropriate normalization method is challenging due to the wide range of available approaches. Therefore, a comparative evaluation of unnormalized and normalized data is essential in identifying an appropriate normalization strategy for a specific data set. This R package provides different functions for preprocessing, normalizing, and evaluating different normalization approaches. Furthermore, normalization methods can be evaluated on downstream steps, such as differential expression analysis and statistical enrichment analysis. Spike-in data sets with known ground truth and real-world data sets of biological experiments acquired by either tandem mass tag (TMT) or label-free quantification (LFQ) can be analyzed.

Maintained by Lis Arend. Last updated 10 days ago.

proteomics preprocessing normalization differentialexpression visualization data-analysis evaluation

2 stars 4.41 score 9 scripts

acabassi

klic:Kernel Learning Integrative Clustering

Kernel Learning Integrative Clustering (KLIC) is an algorithm that allows to combine multiple kernels, each representing a different measure of the similarity between a set of observations. The contribution of each kernel on the final clustering is weighted according to the amount of information carried by it. As well as providing the functions required to perform the kernel-based clustering, this package also allows the user to simply give the data as input: the kernels are then built using consensus clustering. Different strategies to choose the best number of clusters are also available. For further details please see Cabassi and Kirk (2020) <doi:10.1093/bioinformatics/btaa593>.

Maintained by Alessandra Cabassi. Last updated 5 years ago.

cluster-analysis clustering coca genomics integrative-clustering kernel-methods multi-omics

5 stars 4.40 score 10 scripts

foucher-y

RISCA:Causal Inference and Prediction in Cohort-Based Analyses

Numerous functions for cohort-based analyses, either for prediction or causal inference. For causal inference, it includes Inverse Probability Weighting and G-computation for marginal estimation of an exposure effect when confounders are expected. We deal with binary outcomes, times-to-events, competing events, and multi-state data. For multistate data, semi-Markov model with interval censoring may be considered, and we propose the possibility to consider the excess of mortality related to the disease compared to reference lifetime tables. For predictive studies, we propose a set of functions to estimate time-dependent receiver operating characteristic (ROC) curves with the possible consideration of right-censoring times-to-events or the presence of confounders. Finally, several functions are available to assess time-dependent ROC curves or survival curves from aggregated data.

Maintained by Yohann Foucher. Last updated 1 months ago.

1 stars 4.33 score 47 scripts

robson-fernandes

dbnlearn:Dynamic Bayesian Network Structure Learning, Parameter Learning and Forecasting

It allows to learn the structure of univariate time series, learning parameters and forecasting. Implements a model of Dynamic Bayesian Networks with temporal windows, with collections of linear regressors for Gaussian nodes, based on the introductory texts of Korb and Nicholson (2010) <doi:10.1201/b10391> and Nagarajan, Scutari and Lèbre (2013) <doi:10.1007/978-1-4614-6446-4>.

Maintained by Robson Fernandes. Last updated 5 years ago.

bayesian-inference bayesian-networks dynamic-bayesian-networks probabilistic-graphical-models time-series

16 stars 4.32 score 26 scripts

bioc

PDATK:Pancreatic Ductal Adenocarcinoma Tool-Kit

Pancreatic ductal adenocarcinoma (PDA) has a relatively poor prognosis and is one of the most lethal cancers. Molecular classification of gene expression profiles holds the potential to identify meaningful subtypes which can inform therapeutic strategy in the clinical setting. The Pancreatic Cancer Adenocarcinoma Tool-Kit (PDATK) provides an S4 class-based interface for performing unsupervised subtype discovery, cross-cohort meta-clustering, gene-expression-based classification, and subsequent survival analysis to identify prognostically useful subtypes in pancreatic cancer and beyond. Two novel methods, Consensus Subtypes in Pancreatic Cancer (CSPC) and Pancreatic Cancer Overall Survival Predictor (PCOSP) are included for consensus-based meta-clustering and overall-survival prediction, respectively. Additionally, four published subtype classifiers and three published prognostic gene signatures are included to allow users to easily recreate published results, apply existing classifiers to new data, and benchmark the relative performance of new methods. The use of existing Bioconductor classes as input to all PDATK classes and methods enables integration with existing Bioconductor datasets, including the 21 pancreatic cancer patient cohorts available in the MetaGxPancreas data package. PDATK has been used to replicate results from Sandhu et al (2019) [https://doi.org/10.1200/cci.18.00102] and an additional paper is in the works using CSPC to validate subtypes from the included published classifiers, both of which use the data available in MetaGxPancreas. The inclusion of subtype centroids and prognostic gene signatures from these and other publications will enable researchers and clinicians to classify novel patient gene expression data, allowing the direct clinical application of the classifiers included in PDATK. Overall, PDATK provides a rich set of tools to identify and validate useful prognostic and molecular subtypes based on gene-expression data, benchmark new classifiers against existing ones, and apply discovered classifiers on novel patient data to inform clinical decision making.

Maintained by Benjamin Haibe-Kains. Last updated 5 months ago.

geneexpression pharmacogenetics pharmacogenomics software classification survival clustering geneprediction

1 stars 4.31 score 17 scripts

bioc

m6Aboost:m6Aboost

This package can help user to run the m6Aboost model on their own miCLIP2 data. The package includes functions to assign the read counts and get the features to run the m6Aboost model. The miCLIP2 data should be stored in a GRanges object. More details can be found in the vignette.

Maintained by You Zhou. Last updated 5 months ago.

sequencing epigenetics genetics experimenthubsoftware

2 stars 4.30 score 5 scripts

bioc

AnVILBilling:Provide functions to retrieve and report on usage expenses in NHGRI AnVIL (anvilproject.org).

AnVILBilling helps monitor AnVIL-related costs in R, using queries to a BigQuery table to which costs are exported daily. Functions are defined to help categorize tasks and associated expenditures, and to visualize and explore expense profiles over time. This package will be expanded to help users estimate costs for specific task sets.

Maintained by Vince Carey. Last updated 5 months ago.

infrastructure software

4.30 score 5 scripts