R-universe search: imputations

amices

mice:Multivariate Imputation by Chained Equations

Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn (2011) <doi:10.18637/jss.v045.i03>. Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.

Maintained by Stef van Buuren. Last updated 5 days ago.

chained-equations fcs imputation mice missing-data missing-values multiple-imputation multivariate-data cpp

178.5 match 462 stars 16.50 score 10k scripts 154 dependents

alexanderrobitzsch

miceadds:Some Additional Multiple Imputation Functions, Especially for 'mice'

Contains functions for multiple imputation which complements existing functionality in R. In particular, several imputation methods for the mice package (van Buuren & Groothuis-Oudshoorn, 2011, <doi:10.18637/jss.v045.i03>) are implemented. Main features of the miceadds package include plausible value imputation (Mislevy, 1991, <doi:10.1007/BF02294457>), multilevel imputation for variables at any level or with any number of hierarchical and non-hierarchical levels (Grund, Luedtke & Robitzsch, 2018, <doi:10.1177/1094428117703686>; van Buuren, 2018, Ch.7, <doi:10.1201/9780429492259>), imputation using partial least squares (PLS) for high dimensional predictors (Robitzsch, Pham & Yanagida, 2016), nested multiple imputation (Rubin, 2003, <doi:10.1111/1467-9574.00217>), substantive model compatible imputation (Bartlett et al., 2015, <doi:10.1177/0962280214521348>), and features for the generation of synthetic datasets (Reiter, 2005, <doi:10.1111/j.1467-985X.2004.00343.x>; Nowok, Raab, & Dibben, 2016, <doi:10.18637/jss.v074.i11>).

Maintained by Alexander Robitzsch. Last updated 14 days ago.

missing-data multiple-imputation openblas cpp

171.0 match 16 stars 9.16 score 542 scripts 9 dependents

statistikat

VIM:Visualization and Imputation of Missing Values

New tools for the visualization of missing and/or imputed values are introduced, which can be used for exploring the data and the structure of the missing and/or imputed values. Depending on this structure of the missing values, the corresponding methods may help to identify the mechanism generating the missing values and allows to explore the data including missing values. In addition, the quality of imputation can be visually explored using various univariate, bivariate, multiple and multivariate plot methods. A graphical user interface available in the separate package VIMGUI allows an easy handling of the implemented plot methods.

Maintained by Matthias Templ. Last updated 7 months ago.

hotdeck imputation-methods model-predictions visualization cpp

84.4 match 85 stars 14.44 score 2.6k scripts 19 dependents

matteo21q

jomo:Multilevel Joint Modelling Multiple Imputation

Similarly to Schafer's package 'pan', 'jomo' is a package for multilevel joint modelling multiple imputation (Carpenter and Kenward, 2013) <doi:10.1002/9781119942283>. Novel aspects of 'jomo' are the possibility of handling binary and categorical data through latent normal variables, the option to use cluster-specific covariance matrices and to impute compatibly with the substantive model.

Maintained by Matteo Quartagno. Last updated 2 years ago.

81.4 match 3 stars 9.58 score 126 scripts 154 dependents

pharmaverse

admiral:ADaM in R Asset Library

A toolbox for programming Clinical Data Interchange Standards Consortium (CDISC) compliant Analysis Data Model (ADaM) datasets in R. ADaM datasets are a mandatory part of any New Drug or Biologics License Application submitted to the United States Food and Drug Administration (FDA). Analysis derivations are implemented in accordance with the "Analysis Data Model Implementation Guide" (CDISC Analysis Data Model Team, 2021, <https://www.cdisc.org/standards/foundational/adam>).

Maintained by Ben Straub. Last updated 3 days ago.

cdisc clinical-trials open-source

40.0 match 236 stars 13.89 score 486 scripts 4 dependents

simongrund1

mitml:Tools for Multiple Imputation in Multilevel Modeling

Provides tools for multiple imputation of missing data in multilevel modeling. Includes a user-friendly interface to the packages 'pan' and 'jomo', and several functions for visualization, data management and the analysis of multiply imputed data sets.

Maintained by Simon Grund. Last updated 1 years ago.

imputation missing-data mixed-effects multilevel-data multilevel-models

44.7 match 29 stars 12.36 score 246 scripts 153 dependents

bioc

impute:impute: Imputation for microarray data

Imputation for microarray data (currently KNN only)

Maintained by Balasubramanian Narasimhan. Last updated 5 months ago.

microarray fortran

60.3 match 9.04 score 952 scripts 131 dependents

tirgit

missCompare:Intuitive Missing Data Imputation Framework

Offers a convenient pipeline to test and compare various missing data imputation algorithms on simulated and real data. These include simpler methods, such as mean and median imputation and random replacement, but also include more sophisticated algorithms already implemented in popular R packages, such as 'mi', described by Su et al. (2011) <doi:10.18637/jss.v045.i02>; 'mice', described by van Buuren and Groothuis-Oudshoorn (2011) <doi:10.18637/jss.v045.i03>; 'missForest', described by Stekhoven and Buhlmann (2012) <doi:10.1093/bioinformatics/btr597>; 'missMDA', described by Josse and Husson (2016) <doi:10.18637/jss.v070.i01>; and 'pcaMethods', described by Stacklies et al. (2007) <doi:10.1093/bioinformatics/btm069>. The central assumption behind 'missCompare' is that structurally different datasets (e.g. larger datasets with a large number of correlated variables vs. smaller datasets with non correlated variables) will benefit differently from different missing data imputation algorithms. 'missCompare' takes measurements of your dataset and sets up a sandbox to try a curated list of standard and sophisticated missing data imputation algorithms and compares them assuming custom missingness patterns. 'missCompare' will also impute your real-life dataset for you after the selection of the best performing algorithm in the simulations. The package also provides various post-imputation diagnostics and visualizations to help you assess imputation performance.

Maintained by Tibor V. Varga. Last updated 4 years ago.

comparison comparison-benchmarks imputation imputation-algorithm imputation-methods imputations kolmogorov-smirnov missing missing-data missing-data-imputation missing-status-check missing-values missingness post-imputation-diagnostics rmse

90.5 match 39 stars 5.89 score 40 scripts

steffenmoritz

imputeTS:Time Series Missing Value Imputation

Imputation (replacement) of missing values in univariate time series. Offers several imputation functions and missing data plots. Available imputation algorithms include: 'Mean', 'LOCF', 'Interpolation', 'Moving Average', 'Seasonal Decomposition', 'Kalman Smoothing on Structural Time Series models', 'Kalman Smoothing on ARIMA models'. Published in Moritz and Bartz-Beielstein (2017) <doi:10.32614/RJ-2017-009>.

Maintained by Steffen Moritz. Last updated 3 years ago.

data-visualization imputation imputation-algorithm imputets missing-data time-series cpp

39.9 match 162 stars 12.18 score 1.9k scripts 27 dependents

njtierney

naniar:Data Structures, Summaries, and Visualisations for Missing Data

Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. 'naniar' provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data. The work is fully discussed at Tierney & Cook (2023) <doi:10.18637/jss.v105.i07>.

Maintained by Nicholas Tierney. Last updated 2 days ago.

data-visualisation ggplot2 missing-data missingness tidy-data

27.6 match 657 stars 15.63 score 5.1k scripts 9 dependents

steffenmoritz

imputeR:A General Multivariate Imputation Framework

Multivariate Expectation-Maximization (EM) based imputation framework that offers several different algorithms. These include regularisation methods like Lasso and Ridge regression, tree-based models and dimensionality reduction methods like PCA and PLS.

Maintained by Steffen Moritz. Last updated 4 years ago.

missing-data

85.3 match 16 stars 4.94 score 54 scripts

mwheymans

psfmi:Prediction Model Pooling, Selection and Performance Evaluation Across Multiply Imputed Datasets

Pooling, backward and forward selection of linear, logistic and Cox regression models in multiply imputed datasets. Backward and forward selection can be done from the pooled model using Rubin's Rules (RR), the D1, D2, D3, D4 and the median p-values method. This is also possible for Mixed models. The models can contain continuous, dichotomous, categorical and restricted cubic spline predictors and interaction terms between all these type of predictors. The stability of the models can be evaluated using (cluster) bootstrapping. The package further contains functions to pool model performance measures as ROC/AUC, Reclassification, R-squared, scaled Brier score, H&L test and calibration plots for logistic regression models. Internal validation can be done across multiply imputed datasets with cross-validation or bootstrapping. The adjusted intercept after shrinkage of pooled regression coefficients can be obtained. Backward and forward selection as part of internal validation is possible. A function to externally validate logistic prediction models in multiple imputed datasets is available and a function to compare models. For Cox models a strata variable can be included. Eekhout (2017) <doi:10.1186/s12874-017-0404-7>. Wiel (2009) <doi:10.1093/biostatistics/kxp011>. Marshall (2009) <doi:10.1186/1471-2288-9-57>.

Maintained by Martijn Heymans. Last updated 2 years ago.

cox-regression imputation imputed-datasets logistic multiple-imputation pool predictor regression selection spline spline-predictors

53.5 match 10 stars 7.17 score 70 scripts

insightsengineering

rbmi:Reference Based Multiple Imputation

Implements standard and reference based multiple imputation methods for continuous longitudinal endpoints (Gower-Page et al. (2022) <doi:10.21105/joss.04251>). In particular, this package supports deterministic conditional mean imputation and jackknifing as described in Wolbers et al. (2022) <doi:10.1002/pst.2234>, Bayesian multiple imputation as described in Carpenter et al. (2013) <doi:10.1080/10543406.2013.834911>, and bootstrapped maximum likelihood imputation as described in von Hippel and Bartlett (2021) <doi: 10.1214/20-STS793>.

Maintained by Isaac Gravestock. Last updated 22 days ago.

42.3 match 18 stars 8.78 score 33 scripts 1 dependents

bioc

snpStats:SnpMatrix and XSnpMatrix classes and methods

Classes and statistical methods for large SNP association studies. This extends the earlier snpMatrix package, allowing for uncertainty in genotypes.

Maintained by David Clayton. Last updated 5 months ago.

microarray snp geneticvariability zlib

35.8 match 9.41 score 674 scripts 17 dependents

jeffreyevans

yaImpute:Nearest Neighbor Observation Imputation and Evaluation Tools

Performs nearest neighbor-based imputation using one or more alternative approaches to processing multivariate data. These include methods based on canonical correlation: analysis, canonical correspondence analysis, and a multivariate adaptation of the random forest classification and regression techniques of Leo Breiman and Adele Cutler. Additional methods are also offered. The package includes functions for comparing the results from running alternative techniques, detecting imputation targets that are notably distant from reference observations, detecting and correcting for bias, bootstrapping and building ensemble imputations, and mapping results.

Maintained by Jeffrey S. Evans. Last updated 6 months ago.

imputation cpp

42.5 match 3 stars 7.40 score 94 scripts 12 dependents

polkas

miceFast:Fast Imputations Using 'Rcpp' and 'Armadillo'

Fast imputations under the object-oriented programming paradigm. Moreover there are offered a few functions built to work with popular R packages such as 'data.table' or 'dplyr'. The biggest improvement in time performance could be achieve for a calculation where a grouping variable have to be used. A single evaluation of a quantitative model for the multiple imputations is another major enhancement. A new major improvement is one of the fastest predictive mean matching in the R world because of presorting and binary search.

Maintained by Maciej Nasinski. Last updated 1 months ago.

cpp fast fast-imputations grouping imputation imputations matrix mro multiple-imputation rcpp rcpparmadillo vif weighting openblas cpp openmp

49.1 match 20 stars 5.94 score 29 scripts

matthewblackwell

Amelia:A Program for Missing Data

A tool that "multiply imputes" missing data in a single cross-section (such as a survey), from a time series (like variables collected for each year in a country), or from a time-series-cross-sectional data set (such as collected by years for each of several countries). Amelia II implements our bootstrapping-based algorithm that gives essentially the same answers as the standard IP or EMis approaches, is usually considerably faster than existing approaches and can handle many more variables. Unlike Amelia I and other statistically rigorous imputation software, it virtually never crashes (but please let us know if you find to the contrary!). The program also generalizes existing approaches by allowing for trends in time series across observations within a cross-sectional unit, as well as priors that allow experts to incorporate beliefs they have about the values of missing cells in their data. Amelia II also includes useful diagnostics of the fit of multiple imputation models. The program works from the R command line or via a graphical user interface that does not require users to know R.

Maintained by Matthew Blackwell. Last updated 4 months ago.

openblas cpp

31.7 match 1 stars 9.06 score 1.4k scripts 7 dependents

harrelfe

Hmisc:Harrell Miscellaneous

Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulation, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX and html code, recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis.

Maintained by Frank E Harrell Jr. Last updated 2 days ago.

fortran

16.1 match 210 stars 17.61 score 17k scripts 750 dependents

nerler

JointAI:Joint Analysis and Imputation of Incomplete Data

Joint analysis and imputation of incomplete data in the Bayesian framework, using (generalized) linear (mixed) models and extensions there of, survival models, or joint models for longitudinal and survival data, as described in Erler, Rizopoulos and Lesaffre (2021) <doi:10.18637/jss.v100.i20>. Incomplete covariates, if present, are automatically imputed. The package performs some preprocessing of the data and creates a 'JAGS' model, which will then automatically be passed to 'JAGS' <https://mcmc-jags.sourceforge.io/> with the help of the package 'rjags'.

Maintained by Nicole S. Erler. Last updated 12 months ago.

bayesian generalized-linear-models glm glmm imputation imputations jags joint-analysis linear-mixed-models linear-regression-models mcmc-sample mcmc-sampling missing-data missing-values survival cpp

37.9 match 28 stars 7.30 score 59 scripts 1 dependents

tidymodels

recipes:Preprocessing and Feature Engineering Steps for Modeling

A recipe prepares your data for modeling. We provide an extensible framework for pipeable sequences of feature engineering steps provides preprocessing tools to be applied to data. Statistical parameters for the steps can be estimated from an initial data set and then applied to other data sets. The resulting processed output can then be used as inputs for statistical or machine learning models.

Maintained by Max Kuhn. Last updated 4 days ago.

14.3 match 584 stars 18.71 score 7.2k scripts 380 dependents

bluefoxr

COINr:Composite Indicator Construction and Analysis

A comprehensive high-level package, for composite indicator construction and analysis. It is a "development environment" for composite indicators and scoreboards, which includes utilities for construction (indicator selection, denomination, imputation, data treatment, normalisation, weighting and aggregation) and analysis (multivariate analysis, correlation plotting, short cuts for principal component analysis, global sensitivity analysis, and more). A composite indicator is completely encapsulated inside a single hierarchical list called a "coin". This allows a fast and efficient work flow, as well as making quick copies, testing methodological variations and making comparisons. It also includes many plotting options, both statistical (scatter plots, distribution plots) as well as for presenting results.

Maintained by William Becker. Last updated 2 months ago.

28.8 match 26 stars 9.07 score 73 scripts 1 dependents

mayer79

missRanger:Fast Imputation of Missing Values

Alternative implementation of the beautiful 'MissForest' algorithm used to impute mixed-type data sets by chaining random forests, introduced by Stekhoven, D.J. and Buehlmann, P. (2012) <doi:10.1093/bioinformatics/btr597>. Under the hood, it uses the lightning fast random forest package 'ranger'. Between the iterative model fitting, we offer the option of using predictive mean matching. This firstly avoids imputation with values not already present in the original data (like a value 0.3334 in 0-1 coded variable). Secondly, predictive mean matching tries to raise the variance in the resulting conditional distributions to a realistic level. This would allow, e.g., to do multiple imputation when repeating the call to missRanger(). Out-of-sample application is supported as well.

Maintained by Michael Mayer. Last updated 3 months ago.

imputation machine-learning missing-values random-forest

23.0 match 69 stars 11.07 score 208 scripts 6 dependents

shangzhi-hong

RfEmpImp:Multiple Imputation using Chained Random Forests

An R package for multiple imputation using chained random forests. Implemented methods can handle missing data in mixed types of variables by using prediction-based or node-based conditional distributions constructed using random forests. For prediction-based imputation, the method based on the empirical distribution of out-of-bag prediction errors of random forests and the method based on normality assumption for prediction errors of random forests are provided for imputing continuous variables. And the method based on predicted probabilities is provided for imputing categorical variables. For node-based imputation, the method based on the conditional distribution formed by the predicting nodes of random forests, and the method based on proximity measures of random forests are provided. More details of the statistical methods can be found in Hong et al. (2020) <arXiv:2004.14823>.

Maintained by Shangzhi Hong. Last updated 2 years ago.

imputation missing-data random-forest

51.2 match 5 stars 4.40 score 8 scripts

bioc

DAPAR:Tools for the Differential Analysis of Proteins Abundance with R

The package DAPAR is a Bioconductor distributed R package which provides all the necessary functions to analyze quantitative data from label-free proteomics experiments. Contrarily to most other similar R packages, it is endowed with rich and user-friendly graphical interfaces, so that no programming skill is required (see `Prostar` package).

Maintained by Samuel Wieczorek. Last updated 5 months ago.

proteomics normalization preprocessing massspectrometry qualitycontrol go dataimport prostar1

41.3 match 2 stars 5.42 score 22 scripts 1 dependents

mlr-org

mlr3pipelines:Preprocessing Operators and Pipelines for 'mlr3'

Dataflow programming toolkit that enriches 'mlr3' with a diverse set of pipelining operators ('PipeOps') that can be composed into graphs. Operations exist for data preprocessing, model fitting, and ensemble learning. Graphs can themselves be treated as 'mlr3' 'Learners' and can therefore be resampled, benchmarked, and tuned.

Maintained by Martin Binder. Last updated 7 days ago.

bagging data-science dataflow-programming ensemble-learning machine-learning mlr3 pipelines preprocessing stacking

17.8 match 141 stars 12.36 score 448 scripts 7 dependents

billdenney

PKNCA:Perform Pharmacokinetic Non-Compartmental Analysis

Compute standard Non-Compartmental Analysis (NCA) parameters for typical pharmacokinetic analyses and summarize them.

Maintained by Bill Denney. Last updated 15 days ago.

nca noncompartmental-analysis pharmacokinetics

17.2 match 73 stars 12.61 score 214 scripts 4 dependents

farrellday

miceRanger:Multiple Imputation by Chained Equations with Random Forests

Multiple Imputation has been shown to be a flexible method to impute missing values by Van Buuren (2007) <doi:10.1177/0962280206074463>. Expanding on this, random forests have been shown to be an accurate model by Stekhoven and Buhlmann <arXiv:1105.0828> to impute missing values in datasets. They have the added benefits of returning out of bag error and variable importance estimates, as well as being simple to run in parallel.

Maintained by Sam Wilson. Last updated 3 years ago.

imputation-methods machine-learning mice missing-data missing-values random-forests

28.3 match 67 stars 7.09 score 41 scripts 1 dependents

jwb133

smcfcs:Multiple Imputation of Covariates by Substantive Model Compatible Fully Conditional Specification

Implements multiple imputation of missing covariates by Substantive Model Compatible Fully Conditional Specification. This is a modification of the popular FCS/chained equations multiple imputation approach, and allows imputation of missing covariate values from models which are compatible with the user specified substantive model.

Maintained by Jonathan Bartlett. Last updated 15 hours ago.

21.5 match 11 stars 9.00 score 59 scripts 1 dependents

bioc

HIBAG:HLA Genotype Imputation with Attribute Bagging

Imputes HLA classical alleles using GWAS SNP data, and it relies on a training set of HLA and SNP genotypes. HIBAG can be used by researchers with published parameter estimates instead of requiring access to large training sample datasets. It combines the concepts of attribute bagging, an ensemble classifier method, with haplotype inference for SNPs and HLA types. Attribute bagging is a technique which improves the accuracy and stability of classifier ensembles using bootstrap aggregating and random variable selection.

Maintained by Xiuwen Zheng. Last updated 4 months ago.

genetics statisticalmethod bioinformatics gpu hla imputation mhc snp cpp

23.3 match 30 stars 8.24 score 48 scripts

markvanderloo

simputation:Simple Imputation

Easy to use interfaces to a number of imputation methods that fit in the not-a-pipe operator of the 'magrittr' package.

Maintained by Mark van der Loo. Last updated 8 months ago.

data-science imputation officialstatistics

22.3 match 92 stars 8.38 score 350 scripts

samwieczorek

imputeLCMD:A Collection of Methods for Left-Censored Missing Data Imputation

A collection of functions for left-censored missing data imputation. Left-censoring is a special case of missing not at random (MNAR) mechanism that generates non-responses in proteomics experiments. The package also contains functions to artificially generate peptide/protein expression data (log-transformed) as random draws from a multivariate Gaussian distribution as well as a function to generate missing data (both randomly and non-randomly). For comparison reasons, the package also contains several wrapper functions for the imputation of non-responses that are missing at random. * New functionality has been added: a hybrid method that allows the imputation of missing values in a more complex scenario where the missing data are both MAR and MNAR.

Maintained by Samuel Wieczorek. Last updated 3 years ago.

40.9 match 2 stars 4.55 score 93 scripts 5 dependents

choonghyunryu

dlookr:Tools for Data Diagnosis, Exploration, Transformation

A collection of tools that support data diagnosis, exploration, and transformation. Data diagnostics provides information and visualization of missing values, outliers, and unique and negative values to help you understand the distribution and quality of your data. Data exploration provides information and visualization of the descriptive statistics of univariate variables, normality tests and outliers, correlation of two variables, and the relationship between the target variable and predictor. Data transformation supports binning for categorizing continuous variables, imputes missing values and outliers, and resolves skewness. And it creates automated reports that support these three tasks.

Maintained by Choonghyun Ryu. Last updated 9 months ago.

16.0 match 212 stars 11.05 score 748 scripts 2 dependents

torockel

missMethods:Methods for Missing Data

Supply functions for the creation and handling of missing data as well as tools to evaluate missing data methods. Nearly all possibilities of generating missing data discussed by Santos et al. (2019) <doi:10.1109/ACCESS.2019.2891360> and some additional are implemented. Functions are supplied to compare parameter estimates and imputed values to true values to evaluate missing data methods. Evaluations of these types are done, for example, by Cetin-Berber et al. (2019) <doi:10.1177/0013164418805532> and Kim et al. (2005) <doi:10.1093/bioinformatics/bth499>.

Maintained by Tobias Rockel. Last updated 2 years ago.

24.6 match 7 stars 6.98 score 113 scripts 4 dependents

albertofranzin

bnstruct:Bayesian Network Structure Learning from Data with Missing Values

Bayesian Network Structure Learning from Data with Missing Values. The package implements the Silander-Myllymaki complete search, the Max-Min Parents-and-Children, the Hill-Climbing, the Max-Min Hill-climbing heuristic searches, and the Structural Expectation-Maximization algorithm. Available scoring functions are BDeu, AIC, BIC. The package also implements methods for generating and using bootstrap samples, imputed data, inference.

Maintained by Alberto Franzin. Last updated 1 years ago.

31.0 match 1 stars 5.40 score 111 scripts 3 dependents

agnesdeng

mixgb:Multiple Imputation Through 'XGBoost'

Multiple imputation using 'XGBoost', subsampling, and predictive mean matching as described in Deng and Lumley (2023) <doi:10.1080/10618600.2023.2252501>. The package supports various types of variables, offers flexible settings, and enables saving an imputation model to impute new data. Data processing and memory usage have been optimised to speed up the imputation process.

Maintained by Yongshi Deng. Last updated 2 months ago.

cpp openmp

25.1 match 23 stars 6.58 score 82 scripts

inbo

multimput:Using Multiple Imputation to Address Missing Data

Accompanying package for the paper: Working with population totals in the presence of missing data comparing imputation methods in terms of bias and precision. Published in 2017 in the Journal of Ornithology volume 158 page 603–615 (<doi:10.1007/s10336-016-1404-9>).

Maintained by Thierry Onkelinx. Last updated 16 days ago.

imputation imputation-model

45.4 match 1 stars 3.62 score 14 scripts 1 dependents

midasverse

rMIDAS:Multiple Imputation with Denoising Autoencoders

A tool for multiply imputing missing data using 'MIDAS', a deep learning method based on denoising autoencoder neural networks. This algorithm offers significant accuracy and efficiency advantages over other multiple imputation strategies, particularly when applied to large datasets with complex features. Alongside interfacing with 'Python' to run the core algorithm, this package contains functions for processing data before and after model training, running imputation model diagnostics, generating multiple completed datasets, and estimating regression models on these datasets.

Maintained by Thomas Robinson. Last updated 1 years ago.

deep-learning imputation-methods neural-network reticulate tensorflow

24.7 match 34 stars 6.53 score 33 scripts

dsalfran

ImputeRobust:Robust Multiple Imputation with Generalized Additive Models for Location Scale and Shape

Provides new imputation methods for the 'mice' package based on generalized additive models for location, scale, and shape (GAMLSS) as described in de Jong, van Buuren and Spiess <doi:10.1080/03610918.2014.911894>.

Maintained by Daniel Salfran. Last updated 6 years ago.

imputation missing-data multiple-imputation

44.2 match 9 stars 3.65 score 4 scripts

farhadpishgar

MatchThem:Matching and Weighting Multiply Imputed Datasets

Provides essential tools for the pre-processing techniques of matching and weighting multiply imputed datasets. The package includes functions for matching within and across multiply imputed datasets using various methods, estimating weights for units in the imputed datasets using multiple weighting methods, calculating causal effect estimates in each matched or weighted dataset using parametric or non-parametric statistical models, and pooling the resulting estimates according to Rubin's rules (please see <https://journal.r-project.org/archive/2021/RJ-2021-073/> for more details).

Maintained by Farhad Pishgar. Last updated 5 months ago.

21.8 match 16 stars 7.34 score 112 scripts

welch-lab

cytosignal:What the Package Does (One Line, Title Case)

What the package does (one paragraph).

Maintained by Jialin Liu. Last updated 5 days ago.

openblas cpp

26.7 match 16 stars 5.95 score 6 scripts

haghish

mlim:Single and Multiple Imputation with Automated Machine Learning

Machine learning algorithms have been used for performing single missing data imputation and most recently, multiple imputations. However, this is the first attempt for using automated machine learning algorithms for performing both single and multiple imputation. Automated machine learning is a procedure for fine-tuning the model automatic, performing a random search for a model that results in less error, without overfitting the data. The main idea is to allow the model to set its own parameters for imputing each variable separately instead of setting fixed predefined parameters to impute all variables of the dataset. Using automated machine learning, the package fine-tunes an Elastic Net (default) or Gradient Boosting, Random Forest, Deep Learning, Extreme Gradient Boosting, or Stacked Ensemble machine learning model (from one or a combination of other supported algorithms) for imputing the missing observations. This procedure has been implemented for the first time by this package and is expected to outperform other packages for imputing missing data that do not fine-tune their models. The multiple imputation is implemented via bootstrapping without letting the duplicated observations to harm the cross-validation procedure, which is the way imputed variables are evaluated. Most notably, the package implements automated procedure for handling imputing imbalanced data (class rarity problem), which happens when a factor variable has a level that is far more prevalent than the other(s). This is known to result in biased predictions, hence, biased imputation of missing data. However, the autobalancing procedure ensures that instead of focusing on maximizing accuracy (classification error) in imputing factor variables, a fairer procedure and imputation method is practiced.

Maintained by E. F. Haghish. Last updated 8 months ago.

automatic-machine-learning automl classimbalance data-science elastic-net extreme-gradient-boosting gbm glm gradient-boosting gradient-boosting-machine imputation imputation-algorithm imputation-methods machine-learning missing-data multipleimputation stack-ensemble

35.0 match 31 stars 4.49 score 7 scripts

david-cortes

isotree:Isolation-Based Outlier Detection

Fast and multi-threaded implementation of isolation forest (Liu, Ting, Zhou (2008) <doi:10.1109/ICDM.2008.17>), extended isolation forest (Hariri, Kind, Brunner (2018) <doi:10.48550/arXiv.1811.02141>), SCiForest (Liu, Ting, Zhou (2010) <doi:10.1007/978-3-642-15883-4_18>), fair-cut forest (Cortes (2021) <doi:10.48550/arXiv.2110.13402>), robust random-cut forest (Guha, Mishra, Roy, Schrijvers (2016) <http://proceedings.mlr.press/v48/guha16.html>), and customizable variations of them, for isolation-based outlier detection, clustered outlier detection, distance or similarity approximation (Cortes (2019) <doi:10.48550/arXiv.1910.12362>), isolation kernel calculation (Ting, Zhu, Zhou (2018) <doi:10.1145/3219819.3219990>), and imputation of missing values (Cortes (2019) <doi:10.48550/arXiv.1911.06646>), based on random or guided decision tree splitting, and providing different metrics for scoring anomalies based on isolation depth or density (Cortes (2021) <doi:10.48550/arXiv.2111.11639>). Provides simple heuristics for fitting the model to categorical columns and handling missing data, and offers options for varying between random and guided splits, and for using different splitting criteria.

Maintained by David Cortes. Last updated 13 days ago.

anomaly-detection imputation isolation-forest outlier-detection cpp openmp

14.2 match 203 stars 10.41 score 115 scripts 6 dependents

jinghuazhao

gap:Genetic Analysis Package

As first reported [Zhao, J. H. 2007. "gap: Genetic Analysis Package". J Stat Soft 23(8):1-18. <doi:10.18637/jss.v023.i08>], it is designed as an integrated package for genetic data analysis of both population and family data. Currently, it contains functions for sample size calculations of both population-based and family-based designs, probability of familial disease aggregation, kinship calculation, statistics in linkage analysis, and association analysis involving genetic markers including haplotype analysis with or without environmental covariates. Over years, the package has been developed in-between many projects hence also in line with the name (gap).

Maintained by Jing Hua Zhao. Last updated 15 days ago.

genetics imputation lmm fortran

11.9 match 12 stars 11.88 score 448 scripts 16 dependents

jinghuazhao

pan:Multiple Imputation for Multivariate Panel or Clustered Data

It provides functions and examples for maximum likelihood estimation for generalized linear mixed models and Gibbs sampler for multivariate linear mixed models with incomplete data, as described in Schafer JL (1997) "Imputation of missing covariates under a multivariate linear mixed model". Technical report 97-04, Dept. of Statistics, The Pennsylvania State University.

Maintained by Jing hua Zhao. Last updated 2 years ago.

fortran

17.4 match 1 stars 7.86 score 65 scripts 155 dependents

bioc

xcms:LC-MS and GC-MS Data Analysis

Framework for processing and visualization of chromatographically separated and single-spectra mass spectral data. Imports from AIA/ANDI NetCDF, mzXML, mzData and mzML files. Preprocesses data for high-throughput, untargeted analyte profiling.

Maintained by Steffen Neumann. Last updated 1 months ago.

immunooncology massspectrometry metabolomics bioconductor feature-detection mass-spectrometry peak-detection cpp

9.4 match 192 stars 14.32 score 984 scripts 11 dependents

xsswang

remiod:Reference-Based Multiple Imputation for Ordinal/Binary Response

Reference-based multiple imputation of ordinal and binary responses under Bayesian framework, as described in Wang and Liu (2022) <arXiv:2203.02771>. Methods for missing-not-at-random include Jump-to-Reference (J2R), Copy Reference (CR), and Delta Adjustment which can generate tipping point analysis.

Maintained by Tony Wang. Last updated 2 years ago.

bayesian control-based copy-reference delta-adjustment generalized-linear-models glm jags jump-to-reference mcmc missing-at-random missing-data missing-not-at-random multiple-imputation non-ignorable ordinal-regression pattern-mixture-model reference-based statistics cpp

30.2 match 4.30 score 3 scripts

mwheymans

miceafter:Data and Statistical Analyses after Multiple Imputation

Statistical Analyses and Pooling after Multiple Imputation. A large variety of repeated statistical analysis can be performed and finally pooled. Statistical analysis that are available are, among others, Levene's test, Odds and Risk Ratios, One sample proportions, difference between proportions and linear and logistic regression models. Functions can also be used in combination with the Pipe operator. More and more statistical analyses and pooling functions will be added over time. Heymans (2007) <doi:10.1186/1471-2288-7-33>. Eekhout (2017) <doi:10.1186/s12874-017-0404-7>. Wiel (2009) <doi:10.1093/biostatistics/kxp011>. Marshall (2009) <doi:10.1186/1471-2288-9-57>. Sidi (2021) <doi:10.1080/00031305.2021.1898468>. Lott (2018) <doi:10.1080/00031305.2018.1473796>. Grund (2021) <doi:10.31234/osf.io/d459g>.

Maintained by Martijn Heymans. Last updated 2 years ago.

26.6 match 2 stars 4.84 score 23 scripts

tdjorgensen

lavaan.mi:Fit Structural Equation Models to Multiply Imputed Data

The primary purpose of 'lavaan.mi' is to extend the functionality of the R package 'lavaan', which implements structural equation modeling (SEM). When incomplete data have been multiply imputed, the imputed data sets can be analyzed by 'lavaan' using complete-data estimation methods, but results must be pooled across imputations (Rubin, 1987, <doi:10.1002/9780470316696>). The 'lavaan.mi' package automates the pooling of point and standard-error estimates, as well as a variety of test statistics, using a familiar interface that allows users to fit an SEM to multiple imputations as they would to a single data set using the 'lavaan' package.

Maintained by Terrence D. Jorgensen. Last updated 3 days ago.

23.4 match 3 stars 5.45 score

bioc

ADImpute:Adaptive Dropout Imputer (ADImpute)

Single-cell RNA sequencing (scRNA-seq) methods are typically unable to quantify the expression levels of all genes in a cell, creating a need for the computational prediction of missing values (‘dropout imputation’). Most existing dropout imputation methods are limited in the sense that they exclusively use the scRNA-seq dataset at hand and do not exploit external gene-gene relationship information. Here we propose two novel methods: a gene regulatory network-based approach using gene-gene relationships learnt from external data and a baseline approach corresponding to a sample-wide average. ADImpute can implement these novel methods and also combine them with existing imputation methods (currently supported: DrImpute, SAVER). ADImpute can learn the best performing method per gene and combine the results from different methods into an ensemble.

Maintained by Ana Carolina Leote. Last updated 5 months ago.

geneexpression network preprocessing sequencing singlecell transcriptomics

29.6 match 4.30 score 7 scripts

phargarten2

miWQS:Multiple Imputation Using Weighted Quantile Sum Regression

The miWQS package handles the uncertainty due to below the detection limit in a correlated component mixture problem. Researchers want to determine if a set/mixture of continuous and correlated components/chemicals is associated with an outcome and if so, which components are important in that mixture. These components share a common outcome but are interval-censored between zero and low thresholds, or detection limits, that may be different across the components. This package applies the multiple imputation (MI) procedure to the weighted quantile sum regression (WQS) methodology for continuous, binary, or count outcomes (Hargarten & Wheeler (2020) <doi:10.1016/j.envres.2020.109466>). The imputation models are: bootstrapping imputation (Lubin et al (2004) <doi:10.1289/ehp.7199>), univariate Bayesian imputation (Hargarten & Wheeler (2020) <doi:10.1016/j.envres.2020.109466>), and multivariate Bayesian regression imputation.

Maintained by Paul M. Hargarten. Last updated 1 years ago.

26.2 match 2 stars 4.78 score 20 scripts 1 dependents

bioc

DEP:Differential Enrichment analysis of Proteomics data

This package provides an integrated analysis workflow for robust and reproducible analysis of mass spectrometry proteomics data for differential protein expression or differential enrichment. It requires tabular input (e.g. txt files) as generated by quantitative analysis softwares of raw mass spectrometry data, such as MaxQuant or IsobarQuant. Functions are provided for data preparation, filtering, variance normalization and imputation of missing values, as well as statistical testing of differentially enriched / expressed proteins. It also includes tools to check intermediate steps in the workflow, such as normalization and missing values imputation. Finally, visualization tools are provided to explore the results, including heatmap, volcano plot and barplot representations. For scientists with limited experience in R, the package also contains wrapper functions that entail the complete analysis workflow and generate a report. Even easier to use are the interactive Shiny apps that are provided by the package.

Maintained by Arne Smits. Last updated 5 months ago.

immunooncology proteomics massspectrometry differentialexpression datarepresentation

17.6 match 7.10 score 628 scripts

vaudigier

micemd:Multiple Imputation by Chained Equations with Multilevel Data

Addons for the 'mice' package to perform multiple imputation using chained equations with two-level data. Includes imputation methods dedicated to sporadically and systematically missing values. Imputation of continuous, binary or count variables are available. Following the recommendations of Audigier, V. et al (2018) <doi:10.1214/18-STS646>, the choice of the imputation method for each variable can be facilitated by a default choice tuned according to the structure of the incomplete dataset. Allows parallel calculation and overimputation for 'mice'.

Maintained by Vincent Audigier. Last updated 1 years ago.

40.3 match 1 stars 3.08 score 80 scripts 1 dependents

mariechion

mi4p:Multiple Imputation for Proteomics

A framework for multiple imputation for proteomics is proposed by Marie Chion, Christine Carapito and Frederic Bertrand (2021) <doi:10.1371/journal.pcbi.1010420>. It is dedicated to dealing with multiple imputation for proteomics.

Maintained by Frederic Bertrand. Last updated 5 months ago.

23.9 match 6 stars 4.91 score 27 scripts

jwb133

InformativeCensoring:Multiple Imputation for Informative Censoring

Multiple Imputation for Informative Censoring. This package implements two methods. Gamma Imputation described in <DOI:10.1002/sim.6274> and Risk Score Imputation described in <DOI:10.1002/sim.3480>.

Maintained by Jonathan Bartlett. Last updated 2 years ago.

24.2 match 4.78 score 9 scripts 1 dependents

elliecurnow

midoc:A Decision-Making System for Multiple Imputation

A guidance system for analysis with missing data. It incorporates expert, up-to-date methodology to help researchers choose the most appropriate analysis approach when some data are missing. You provide the available data and the assumed causal structure, including the likely causes of missing data. 'midoc' will advise which analysis approaches can be used, and how best to perform them. 'midoc' follows the framework for the treatment and reporting of missing data in observational studies (TARMOS). Lee et al (2021). <doi:10.1016/j.jclinepi.2021.01.008>.

Maintained by Elinor Curnow. Last updated 5 months ago.

missing-data multiple-imputation

21.6 match 6 stars 5.32 score 8 scripts

amices

ggmice:Visualizations for 'mice' with 'ggplot2'

Enhance a 'mice' imputation workflow with visualizations for incomplete and/or imputed data. The plotting functions produce 'ggplot' objects which may be easily manipulated or extended. Use 'ggmice' to inspect missing data, develop imputation models, evaluate algorithmic convergence, or compare observed versus imputed data.

Maintained by Hanne Oberman. Last updated 8 months ago.

ggplot2 mice visualization

15.4 match 32 stars 7.42 score 165 scripts

bioc

pcaMethods:A collection of PCA methods

Provides Bayesian PCA, Probabilistic PCA, Nipals PCA, Inverse Non-Linear PCA and the conventional SVD PCA. A cluster based method for missing value estimation is included for comparison. BPCA, PPCA and NipalsPCA may be used to perform PCA on incomplete data as well as for accurate missing value estimation. A set of methods for printing and plotting the results is also provided. All PCA methods make use of the same data structure (pcaRes) to provide a common interface to the PCA results. Initiated at the Max-Planck Institute for Molecular Plant Physiology, Golm, Germany.

Maintained by Henning Redestig. Last updated 5 months ago.

bayesian cpp

8.7 match 49 stars 13.10 score 538 scripts 73 dependents

tslumley

mitools:Tools for Multiple Imputation of Missing Data

Tools to perform analyses and combine results from multiple-imputation datasets.

Maintained by Thomas Lumley. Last updated 6 years ago.

11.9 match 1 stars 9.50 score 716 scripts 249 dependents

mroman-ibs

FuzzyImputationTest:Imputation Procedures and Quality Tests for Fuzzy Data

Special procedures for the imputation of missing fuzzy numbers are still underdeveloped. The goal of the package is to provide the new d-imputation method (DIMP for short, Romaniuk, M. and Grzegorzewski, P. (2023) "Fuzzy Data Imputation with DIMP and FGAIN" RB/23/2023) and covert some classical ones applied in R packages ('missForest','miceRanger','knn') for use with fuzzy datasets. Additionally, specially tailored benchmarking tests are provided to check and compare these imputation procedures with fuzzy datasets.

Maintained by Maciej Romaniuk. Last updated 2 months ago.

21.9 match 5.06 score 2 scripts

jaredlander

useful:A Collection of Handy, Useful Functions

A set of little functions that have been found useful to do little odds and ends such as plotting the results of K-means clustering, substituting special text characters, viewing parts of a data.frame, constructing formulas from text and building design and response matrices.

Maintained by Jared P. Lander. Last updated 1 years ago.

14.3 match 8 stars 7.75 score 748 scripts 6 dependents

bioc

methyLImp2:Missing value estimation of DNA methylation data

This package allows to estimate missing values in DNA methylation data. methyLImp method is based on linear regression since methylation levels show a high degree of inter-sample correlation. Implementation is parallelised over chromosomes since probes on different chromosomes are usually independent. Mini-batch approach to reduce the runtime in case of large number of samples is available.

Maintained by Anna Plaksienko. Last updated 1 months ago.

dnamethylation microarray software methylationarray regression imputation methylation missing-value-imputation

18.5 match 6 stars 5.62 score 3 scripts

stekhoven

missForest:Nonparametric Missing Value Imputation using Random Forest

The function 'missForest' in this package is used to impute missing values particularly in the case of mixed-type data. It uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data including complex interactions and non-linear relations. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation. It can be run in parallel to save computation time.

Maintained by Daniel J. Stekhoven. Last updated 1 years ago.

9.0 match 92 stars 11.53 score 1.1k scripts 32 dependents

alexwhitworth

imputeMulti:Imputation Methods for Multivariate Multinomial Data

Implements imputation methods using EM and Data Augmentation for multinomial data following the work of Schafer 1997 <ISBN: 978-0-412-04061-0>.

Maintained by Alex Whitworth. Last updated 3 years ago.

imputation imputation-methods multivariate-multinomial-data cpp

24.6 match 5 stars 4.15 score 28 scripts

matthias-da

robCompositions:Compositional Data Analysis

Methods for analysis of compositional data including robust methods (<doi:10.1007/978-3-319-96422-5>), imputation of missing values (<doi:10.1016/j.csda.2009.11.023>), methods to replace rounded zeros (<doi:10.1080/02664763.2017.1410524>, <doi:10.1016/j.chemolab.2016.04.011>, <doi:10.1016/j.csda.2012.02.012>), count zeros (<doi:10.1177/1471082X14535524>), methods to deal with essential zeros (<doi:10.1080/02664763.2016.1182135>), (robust) outlier detection for compositional data, (robust) principal component analysis for compositional data, (robust) factor analysis for compositional data, (robust) discriminant analysis for compositional data (Fisher rule), robust regression with compositional predictors, functional data analysis (<doi:10.1016/j.csda.2015.07.007>) and p-splines (<doi:10.1016/j.csda.2015.07.007>), contingency (<doi:10.1080/03610926.2013.824980>) and compositional tables (<doi:10.1111/sjos.12326>, <doi:10.1111/sjos.12223>, <doi:10.1080/02664763.2013.856871>) and (robust) Anderson-Darling normality tests for compositional data as well as popular log-ratio transformations (addLR, cenLR, isomLR, and their inverse transformations). In addition, visualisation and diagnostic tools are implemented as well as high and low-level plot functions for the ternary diagram.

Maintained by Matthias Templ. Last updated 25 days ago.

cpp

10.8 match 11 stars 9.19 score 226 scripts 2 dependents

ngreifer

cobalt:Covariate Balance Tables and Plots

Generate balance tables and plots for covariates of groups preprocessed through matching, weighting or subclassification, for example, using propensity scores. Includes integration with 'MatchIt', 'WeightIt', 'MatchThem', 'twang', 'Matching', 'optmatch', 'CBPS', 'ebal', 'cem', 'sbw', and 'designmatch' for assessing balance on the output of their preprocessing functions. Users can also specify data for balance assessment not generated through the above packages. Also included are methods for assessing balance in clustered or multiply imputed data sets or data sets with multi-category, continuous, or longitudinal treatments.

Maintained by Noah Greifer. Last updated 11 months ago.

causal-inference propensity-scores

7.6 match 75 stars 12.98 score 1.0k scripts 8 dependents

bgoodri

mi:Missing Data Imputation and Model Checking

The mi package provides functions for data manipulation, imputing missing values in an approximate Bayesian framework, diagnostics of the models used to generate the imputations, confidence-building mechanisms to validate some of the assumptions of the imputation algorithm, and functions to analyze multiply imputed data sets with the appropriate degree of sampling uncertainty.

Maintained by Ben Goodrich. Last updated 3 years ago.

11.8 match 2 stars 8.25 score 244 scripts 47 dependents

rezakj

iCellR:Analyzing High-Throughput Single Cell Sequencing Data

A toolkit that allows scientists to work with data from single cell sequencing technologies such as scRNA-seq, scVDJ-seq, scATAC-seq, CITE-Seq and Spatial Transcriptomics (ST). Single (i) Cell R package ('iCellR') provides unprecedented flexibility at every step of the analysis pipeline, including normalization, clustering, dimensionality reduction, imputation, visualization, and so on. Users can design both unsupervised and supervised models to best suit their research. In addition, the toolkit provides 2D and 3D interactive visualizations, differential expression analysis, filters based on cells, genes and clusters, data merging, normalizing for dropouts, data imputation methods, correcting for batch differences, pathway analysis, tools to find marker genes for clusters and conditions, predict cell types and pseudotime analysis. See Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.05.05.078550> and Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.03.31.019109> for more details.

Maintained by Alireza Khodadadi-Jamayran. Last updated 8 months ago.

10xgenomics 3d batch-normalization cell-type-classification cite-seq clustering clustering-algorithm diffusion-maps dropout icellr imputation intractive-graph normalization pseudotime scrna-seq scvdj-seq singel-cell-sequencing umap cpp

17.3 match 121 stars 5.56 score 7 scripts 1 dependents

bips-hb

micd:Multiple Imputation in Causal Graph Discovery

Modified functions of the package 'pcalg' and some additional functions to run the PC and the FCI (Fast Causal Inference) algorithm for constraint-based causal discovery in incomplete and multiply imputed datasets. Foraita R, Friemel J, Günther K, Behrens T, Bullerdiek J, Nimzyk R, Ahrens W, Didelez V (2020) <doi:10.1111/rssa.12565>; Andrews RM, Foraita R, Didelez V, Witte J (2021) <arXiv:2108.13395>; Witte J, Foraita R, Didelez V (2022) <doi:10.1002/sim.9535>.

Maintained by Ronja Foraita. Last updated 2 years ago.

causal-discovery graphical-models multiple-imputation

24.2 match 5 stars 3.70 score 20 scripts

rqtl

qtl2:Quantitative Trait Locus Mapping in Experimental Crosses

Provides a set of tools to perform quantitative trait locus (QTL) analysis in experimental crosses. It is a reimplementation of the 'R/qtl' package to better handle high-dimensional data and complex cross designs. Broman et al. (2019) <doi:10.1534/genetics.118.301595>.

Maintained by Karl W Broman. Last updated 7 days ago.

cpp

9.2 match 34 stars 9.48 score 1.1k scripts 5 dependents

jacky11

imp4p:Imputation for Proteomics

Functions to analyse missing value mechanisms and to impute data sets in the context of bottom-up MS-based proteomics.

Maintained by Quentin Giai Gianetto. Last updated 4 years ago.

cpp

43.4 match 1 stars 2.00 score 33 scripts 1 dependents

bioc

MSnbase:Base Functions and Classes for Mass Spectrometry and Proteomics

MSnbase provides infrastructure for manipulation, processing and visualisation of mass spectrometry and proteomics data, ranging from raw to quantitative and annotated data.

Maintained by Laurent Gatto. Last updated 17 hours ago.

immunooncology infrastructure proteomics massspectrometry qualitycontrol dataimport bioconductor bioinformatics mass-spectrometry proteomics-data visualisation cpp

6.8 match 130 stars 12.81 score 772 scripts 36 dependents

privefl

bigsnpr:Analysis of Massive SNP Arrays

Easy-to-use, efficient, flexible and scalable tools for analyzing massive SNP arrays. Privé et al. (2018) <doi:10.1093/bioinformatics/bty185>.

Maintained by Florian Privé. Last updated 9 days ago.

big-data bioinformatics memory-mapped-file parallel-computing polygenic-scores population-structure-inference snp-data statistical-methods openblas zlib cpp openmp

7.5 match 200 stars 11.44 score 1.5k scripts 3 dependents

sibipx

missForestPredict:Missing Value Imputation using Random Forest for Prediction Settings

Missing data imputation based on the 'missForest' algorithm (Stekhoven, Daniel J (2012) <doi:10.1093/bioinformatics/btr597>) with adaptations for prediction settings. The function missForest() is used to impute a (training) dataset with missing values and to learn imputation models that can be later used for imputing new observations. The function missForestPredict() is used to impute one or multiple new observations (test set) using the models learned on the training data.

Maintained by Elena Albu. Last updated 1 years ago.

21.3 match 4.00 score 3 scripts

bioc

MAI:Mechanism-Aware Imputation

A two-step approach to imputing missing data in metabolomics. Step 1 uses a random forest classifier to classify missing values as either Missing Completely at Random/Missing At Random (MCAR/MAR) or Missing Not At Random (MNAR). MCAR/MAR are combined because it is often difficult to distinguish these two missing types in metabolomics data. Step 2 imputes the missing values based on the classified missing mechanisms, using the appropriate imputation algorithms. Imputation algorithms tested and available for MCAR/MAR include Bayesian Principal Component Analysis (BPCA), Multiple Imputation No-Skip K-Nearest Neighbors (Multi_nsKNN), and Random Forest. Imputation algorithms tested and available for MNAR include nsKNN and a single imputation approach for imputation of metabolites where left-censoring is present.

Maintained by Jonathan Dekermanjian. Last updated 5 months ago.

software metabolomics statisticalmethod classification imputation-methods machine-learning missing-data

16.9 match 2 stars 5.00 score 6 scripts

randel

MixRF:A Random-Forest-Based Approach for Imputing Clustered Incomplete Data

It offers random-forest-based functions to impute clustered incomplete data. The package is tailored for but not limited to imputing multitissue expression data, in which a gene's expression is measured on the collected tissues of an individual but missing on the uncollected tissues.

Maintained by Jiebiao Wang. Last updated 8 years ago.

gene-expression imputation mixed-models random-forest

19.2 match 35 stars 4.39 score 14 scripts

bioc

rexposome:Exposome exploration and outcome data analysis

Package that allows to explore the exposome and to perform association analyses between exposures and health outcomes.

Maintained by Xavier Escribà Montagut. Last updated 5 months ago.

software biologicalquestion infrastructure dataimport datarepresentation biomedicalinformatics experimentaldesign multiplecomparison classification clustering

14.6 match 5.70 score 28 scripts 1 dependents

bioc

QFeatures:Quantitative features for mass spectrometry data

The QFeatures infrastructure enables the management and processing of quantitative features for high-throughput mass spectrometry assays. It provides a familiar Bioconductor user experience to manages quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable format.

Maintained by Laurent Gatto. Last updated 11 days ago.

infrastructure massspectrometry proteomics metabolomics bioconductor mass-spectrometry

6.9 match 27 stars 11.87 score 278 scripts 49 dependents

bioc

Melissa:Bayesian clustering and imputationa of single cell methylomes

Melissa is a Baysian probabilistic model for jointly clustering and imputing single cell methylomes. This is done by taking into account local correlations via a Generalised Linear Model approach and global similarities using a mixture modelling approach.

Maintained by C. A. Kapourani. Last updated 5 months ago.

immunooncology dnamethylation geneexpression generegulation epigenetics genetics clustering featureextraction regression rnaseq bayesian kegg sequencing coverage singlecell

16.1 match 4.90 score 7 scripts

zjg540066169

SBMTrees:Sequential Imputation with Bayesian Trees Mixed-Effects Models for Longitudinal Data

Implements a sequential imputation framework using Bayesian Mixed-Effects Trees ('SBMTrees') for handling missing data in longitudinal studies. The package supports a variety of models, including non-linear relationships and non-normal random effects and residuals, leveraging Dirichlet Process priors for increased flexibility. Key features include handling Missing at Random (MAR) longitudinal data, imputation of both covariates and outcomes, and generating posterior predictive samples for further analysis. The methodology is designed for applications in epidemiology, biostatistics, and other fields requiring robust handling of missing data in longitudinal settings.

Maintained by Jungang Zou. Last updated 3 months ago.

bayesian-machine-learning longitudinal-data missing-data-imputation openblas cpp

17.1 match 1 stars 4.40 score 10 scripts

alexpkeil1

qgcomp:Quantile G-Computation

G-computation for a set of time-fixed exposures with quantile-based basis functions, possibly under linearity and homogeneity assumptions. This approach estimates a regression line corresponding to the expected change in the outcome (on the link basis) given a simultaneous increase in the quantile-based category for all exposures. Works with continuous, binary, and right-censored time-to-event outcomes. Reference: Alexander P. Keil, Jessie P. Buckley, Katie M. OBrien, Kelly K. Ferguson, Shanshan Zhao, and Alexandra J. White (2019) A quantile-based g-computation approach to addressing the effects of exposure mixtures; <doi:10.1289/EHP5838>.

Maintained by Alexander Keil. Last updated 3 days ago.

exposure exposure-mixture exposure-mixtures quantile-gcomputation survival

8.5 match 37 stars 8.73 score 70 scripts 2 dependents

bioc

MOFA2:Multi-Omics Factor Analysis v2

The MOFA2 package contains a collection of tools for training and analysing multi-omic factor analysis (MOFA). MOFA is a probabilistic factor model that aims to identify principal axes of variation from data sets that can comprise multiple omic layers and/or groups of samples. Additional time or space information on the samples can be incorporated using the MEFISTO framework, which is part of MOFA2. Downstream analysis functions to inspect molecular features underlying each factor, vizualisation, imputation etc are available.

Maintained by Ricard Argelaguet. Last updated 5 months ago.

dimensionreduction bayesian visualization factor-analysis mofa multi-omics

7.3 match 319 stars 10.02 score 502 scripts

bioc

gwasurvivr:gwasurvivr: an R package for genome wide survival analysis

gwasurvivr is a package to perform survival analysis using Cox proportional hazard models on imputed genetic data.

Maintained by Abbas Rizvi. Last updated 5 months ago.

genomewideassociation survival regression genetics snp geneticvariability pharmacogenomics biomedicalinformatics

11.3 match 12 stars 6.43 score 75 scripts

bioc

Pirat:Precursor or Peptide Imputation under Random Truncation

Pirat enables the imputation of missing values (either MNARs or MCARs) in bottom-up LC-MS/MS proteomics data using a penalized maximum likelihood strategy. It does not require any parameter tuning, it models the instrument censorship from the data available. It accounts for sibling peptides correlations and it can leverage complementary transcriptomics measurements.

Maintained by Samuel Wieczorek. Last updated 5 months ago.

proteomics massspectrometry preprocessing software

14.9 match 4.81 score 9 scripts

boennecd

mdgc:Missing Data Imputation Using Gaussian Copulas

Provides functions to impute missing values using Gaussian copulas for mixed data types as described by Christoffersen et al. (2021) <arXiv:2102.02642>. The method is related to Hoff (2007) <doi:10.1214/07-AOAS107> and Zhao and Udell (2019) <arXiv:1910.12845> but differs by making a direct approximation of the log marginal likelihood using an extended version of the Fortran code created by Genz and Bretz (2002) <doi:10.1198/106186002394> in addition to also support multinomial variables.

Maintained by Benjamin Christoffersen. Last updated 2 years ago.

binary gaussian-copula imputation multinomial-variables ordinal semi-parametric fortran openblas cpp openmp

19.0 match 10 stars 3.78 score 12 scripts

jwb133

dejaVu:Multiple Imputation for Recurrent Events

Performs reference based multiple imputation of recurrent event data based on a negative binomial regression model, as described by Keene et al (2014) <doi:10.1002/pst.1624>.

Maintained by Jonathan Bartlett. Last updated 8 months ago.

15.0 match 4.68 score 24 scripts

marberts

piar:Price Index Aggregation

Most price indexes are made with a two-step procedure, where period-over-period elemental indexes are first calculated for a collection of elemental aggregates at each point in time, and then aggregated according to a price index aggregation structure. These indexes can then be chained together to form a time series that gives the evolution of prices with respect to a fixed base period. This package contains a collection of functions that revolve around this work flow, making it easy to build standard price indexes, and implement the methods described by Balk (2008, <doi:10.1017/CBO9780511720758>), von der Lippe (2007, <doi:10.3726/978-3-653-01120-3>), and the CPI manual (2020, <doi:10.5089/9781484354841.069>) for bilateral price indexes.

Maintained by Steve Martin. Last updated 13 days ago.

economics inflation official-statistics statistics

9.6 match 4 stars 7.32 score 25 scripts

wadpac

GGIR:Raw Accelerometer Data Analysis

A tool to process and analyse data collected with wearable raw acceleration sensors as described in Migueles and colleagues (JMPB 2019), and van Hees and colleagues (JApplPhysiol 2014; PLoSONE 2015). The package has been developed and tested for binary data from 'GENEActiv' <https://activinsights.com/>, binary (.gt3x) and .csv-export data from 'Actigraph' <https://theactigraph.com> devices, and binary (.cwa) and .csv-export data from 'Axivity' <https://axivity.com>. These devices are currently widely used in research on human daily physical activity. Further, the package can handle accelerometer data file from any other sensor brand providing that the data is stored in csv format. Also the package allows for external function embedding.

Maintained by Vincent T van Hees. Last updated 15 hours ago.

accelerometer activity-recognition circadian-rhythm movement-sensor sleep

5.3 match 109 stars 13.20 score 342 scripts 3 dependents

japal

zCompositions:Treatment of Zeros, Left-Censored and Missing Values in Compositional Data Sets

Principled methods for the imputation of zeros, left-censored and missing data in compositional data sets (Palarea-Albaladejo and Martin-Fernandez (2015) <doi:10.1016/j.chemolab.2015.02.019>).

Maintained by Javier Palarea-Albaladejo. Last updated 9 months ago.

censored-data compositional-data imputation-methods missing-data nondetection

8.0 match 7 stars 8.55 score 370 scripts 11 dependents

jackdunnnz

iai:Interface to 'Interpretable AI' Modules

An interface to the algorithms of 'Interpretable AI' <https://www.interpretable.ai> from the R programming language. 'Interpretable AI' provides various modules, including 'Optimal Trees' for classification, regression, prescription and survival analysis, 'Optimal Imputation' for missing data imputation and outlier detection, and 'Optimal Feature Selection' for exact sparse regression. The 'iai' package is an open-source project. The 'Interpretable AI' software modules are proprietary products, but free academic and evaluation licenses are available.

Maintained by Jack Dunn. Last updated 5 months ago.

33.9 match 1 stars 2.00 score 7 scripts

marjoleinf

pre:Prediction Rule Ensembles

Derives prediction rule ensembles (PREs). Largely follows the procedure for deriving PREs as described in Friedman & Popescu (2008; <DOI:10.1214/07-AOAS148>), with adjustments and improvements. The main function pre() derives prediction rule ensembles consisting of rules and/or linear terms for continuous, binary, count, multinomial, and multivariate continuous responses. Function gpe() derives generalized prediction ensembles, consisting of rules, hinge and linear functions of the predictor variables.

Maintained by Marjolein Fokkema. Last updated 9 months ago.

7.8 match 58 stars 8.49 score 98 scripts 1 dependents

bioc

aCGH:Classes and functions for Array Comparative Genomic Hybridization data

Functions for reading aCGH data from image analysis output files and clone information files, creation of aCGH S3 objects for storing these data. Basic methods for accessing/replacing, subsetting, printing and plotting aCGH objects.

Maintained by Peter Dimitrov. Last updated 5 months ago.

copynumbervariation dataimport genetics cpp

12.3 match 5.38 score 9 scripts 4 dependents

martinster

modi:Multivariate Outlier Detection and Imputation for Incomplete Survey Data

Algorithms for multivariate outlier detection when missing values occur. Algorithms are based on Mahalanobis distance or data depth. Imputation is based on the multivariate normal model or uses nearest neighbour donors. The algorithms take sample designs, in particular weighting, into account. The methods are described in Bill and Hulliger (2016) <doi:10.17713/ajs.v45i1.86>.

Maintained by Beat Hulliger. Last updated 2 years ago.

10.7 match 4 stars 6.02 score 88 scripts 1 dependents

stamats

MKinfer:Inferential Statistics

Computation of various confidence intervals (Altman et al. (2000), ISBN:978-0-727-91375-3; Hedderich and Sachs (2018), ISBN:978-3-662-56657-2) including bootstrapped versions (Davison and Hinkley (1997), ISBN:978-0-511-80284-3) as well as Hsu (Hedderich and Sachs (2018), ISBN:978-3-662-56657-2), permutation (Janssen (1997), <doi:10.1016/S0167-7152(97)00043-6>), bootstrap (Davison and Hinkley (1997), ISBN:978-0-511-80284-3), intersection-union (Sozu et al. (2015), ISBN:978-3-319-22005-5) and multiple imputation (Barnard and Rubin (1999), <doi:10.1093/biomet/86.4.948>) t-test; furthermore, computation of intersection-union z-test as well as multiple imputation Wilcoxon tests. Graphical visualization by volcano and Bland-Altman plots (Bland and Altman (1986), <doi:10.1016/S0140-6736(86)90837-8>; Shieh (2018), <doi:10.1186/s12874-018-0505-y>).

Maintained by Matthias Kohl. Last updated 11 months ago.

9.7 match 6 stars 6.56 score 71 scripts 4 dependents

jwb133

gFormulaMI:G-Formula for Causal Inference via Multiple Imputation

Implements the G-Formula method for causal inference with time-varying treatments and confounders using Bayesian multiple imputation methods, as described by Bartlett, Olarte Parra and Daniel (2023) <arXiv:2301.12026>. It creates multiple synthetic imputed datasets under treatment regimes of interest using the 'mice' package. These can then be analysed using rules developed for analysing multiple synthetic datasets.

Maintained by Jonathan Bartlett. Last updated 1 years ago.

14.0 match 7 stars 4.54 score 7 scripts

kogalur

randomForestSRC:Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC)

Fast OpenMP parallel computing of Breiman's random forests for univariate, multivariate, unsupervised, survival, competing risks, class imbalanced classification and quantile regression. New Mahalanobis splitting for correlated outcomes. Extreme random forests and randomized splitting. Suite of imputation methods for missing data. Fast random forests using subsampling. Confidence regions and standard errors for variable importance. New improved holdout importance. Case-specific importance. Minimal depth variable importance. Visualize trees on your Safari or Google Chrome browser. Anonymous random forests for data privacy.

Maintained by Udaya B. Kogalur. Last updated 2 months ago.

openmp

8.0 match 10 stars 7.90 score 1.2k scripts 12 dependents

bioc

msImpute:Imputation of label-free mass spectrometry peptides

MsImpute is a package for imputation of peptide intensity in proteomics experiments. It additionally contains tools for MAR/MNAR diagnosis and assessment of distortions to the probability distribution of the data post imputation. The missing values are imputed by low-rank approximation of the underlying data matrix if they are MAR (method = "v2"), by Barycenter approach if missingness is MNAR ("v2-mnar"), or by Peptide Identity Propagation (PIP).

Maintained by Soroor Hediyeh-zadeh. Last updated 5 months ago.

massspectrometry proteomics software label-free-proteomics low-rank-approximation

12.1 match 14 stars 5.15 score 7 scripts

bioc

sesame:SEnsible Step-wise Analysis of DNA MEthylation BeadChips

Tools For analyzing Illumina Infinium DNA methylation arrays. SeSAMe provides utilities to support analyses of multiple generations of Infinium DNA methylation BeadChips, including preprocessing, quality control, visualization and inference. SeSAMe features accurate detection calling, intelligent inference of ethnicity, sex and advanced quality control routines.

Maintained by Wanding Zhou. Last updated 2 months ago.

dnamethylation methylationarray preprocessing qualitycontrol bioinformatics dna-methylation microarray

6.8 match 69 stars 9.08 score 258 scripts 1 dependents

bioc

MAST:Model-based Analysis of Single Cell Transcriptomics

Methods and models for handling zero-inflated single cell assay data.

Maintained by Andrew McDavid. Last updated 5 months ago.

geneexpression differentialexpression genesetenrichment rnaseq transcriptomics singlecell

4.8 match 230 stars 12.75 score 1.8k scripts 5 dependents

ohdsi

PatientLevelPrediction:Develop Clinical Prediction Models Using the Common Data Model

A user friendly way to create patient level prediction models using the Observational Medical Outcomes Partnership Common Data Model. Given a cohort of interest and an outcome of interest, the package can use data in the Common Data Model to build a large set of features. These features can then be used to fit a predictive model with a number of machine learning algorithms. This is further described in Reps (2017) <doi:10.1093/jamia/ocy032>.

Maintained by Egill Fridgeirsson. Last updated 7 days ago.

hades openjdk

5.5 match 190 stars 10.85 score 297 scripts

jpquast

protti:Bottom-Up Proteomics and LiP-MS Quality Control and Data Analysis Tools

Useful functions and workflows for proteomics quality control and data analysis of both limited proteolysis-coupled mass spectrometry (LiP-MS) (Feng et. al. (2014) <doi:10.1038/nbt.2999>) and regular bottom-up proteomics experiments. Data generated with search tools such as 'Spectronaut', 'MaxQuant' and 'Proteome Discover' can be easily used due to flexibility of functions.

Maintained by Jan-Philipp Quast. Last updated 5 months ago.

data-analysis lip-ms mass-spectrometry omics protein proteomics systems-biology

7.0 match 61 stars 8.58 score 83 scripts

jinghuazhao

tdthap:TDT Tests for Extended Haplotypes

Functions and examples are provided for Transmission/disequilibrium tests for extended marker haplotypes, as in Clayton, D. and Jones, H. (1999) "Transmission/disequilibrium tests for extended marker haplotypes". Amer. J. Hum. Genet., 65:1161-1169, <doi:10.1086/302566>.

Maintained by Jing Hua Zhao. Last updated 15 days ago.

genetics imputation lmm

10.0 match 12 stars 5.92 score 6 scripts

olssol

idem:Inference in Randomized Controlled Trials with Death and Missingness

In randomized studies involving severely ill patients, functional outcomes are often unobserved due to missed clinic visits, premature withdrawal or death. It is well known that if these unobserved functional outcomes are not handled properly, biased treatment comparisons can be produced. In this package, we implement a procedure for comparing treatments that is based on the composite endpoint of both the functional outcome and survival. The procedure was proposed in Wang et al. (2016) <DOI:10.1111/biom.12594> and Wang et al. (2020) <DOI:10.18637/jss.v093.i12>. It considers missing data imputation with different sensitivity analysis strategies to handle the unobserved functional outcomes not due to death.

Maintained by Chenguang Wang. Last updated 2 years ago.

cpp

16.7 match 3.51 score 16 scripts

pharmaverse

admiralonco:Oncology Extension Package for ADaM in 'R' Asset Library

Programming oncology specific Clinical Data Interchange Standards Consortium (CDISC) compliant Analysis Data Model (ADaM) datasets in 'R'. ADaM datasets are a mandatory part of any New Drug or Biologics License Application submitted to the United States Food and Drug Administration (FDA). Analysis derivations are implemented in accordance with the "Analysis Data Model Implementation Guide" (CDISC Analysis Data Model Team (2021), <https://www.cdisc.org/standards/foundational/adam>). The package is an extension package of the 'admiral' package.

Maintained by Stefan Bundfuss. Last updated 2 months ago.

6.8 match 32 stars 8.66 score 30 scripts

bioc

ccImpute:ccImpute: an accurate and scalable consensus clustering based approach to impute dropout events in the single-cell RNA-seq data (https://doi.org/10.1186/s12859-022-04814-8)

Dropout events make the lowly expressed genes indistinguishable from true zero expression and different than the low expression present in cells of the same type. This issue makes any subsequent downstream analysis difficult. ccImpute is an imputation algorithm that uses cell similarity established by consensus clustering to impute the most probable dropout events in the scRNA-seq datasets. ccImpute demonstrated performance which exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities.

Maintained by Marcin Malec. Last updated 5 months ago.

singlecell sequencing principalcomponent dimensionreduction clustering rnaseq transcriptomics openblas cpp openmp

12.7 match 2 stars 4.48 score 2 scripts

bioc

PhosR:A set of methods and tools for comprehensive analysis of phosphoproteomics data

PhosR is a package for the comprenhensive analysis of phosphoproteomic data. There are two major components to PhosR: processing and downstream analysis. PhosR consists of various processing tools for phosphoproteomics data including filtering, imputation, normalisation, and functional analysis for inferring active kinases and signalling pathways.

Maintained by Taiyun Kim. Last updated 5 months ago.

software researchfield proteomics

12.1 match 4.71 score 51 scripts

belayb

drugprepr:Prepare Electronic Prescription Record Data to Estimate Drug Exposure

Prepare prescription data (such as from the Clinical Practice Research Datalink) into an analysis-ready format, with start and stop dates for each patient's prescriptions. Based on Pye et al (2018) <doi:10.1002/pds.4440>.

Maintained by David Selby. Last updated 3 years ago.

15.4 match 1 stars 3.70 score 3 scripts

bioc

AffiXcan:A Functional Approach To Impute Genetically Regulated Expression

Impute a GReX (Genetically Regulated Expression) for a set of genes in a sample of individuals, using a method based on the Total Binding Affinity (TBA). Statistical models to impute GReX can be trained with a training dataset where the real total expression values are known.

Maintained by Alessandro Lussana. Last updated 5 months ago.

geneexpression transcription generegulation dimensionreduction regression principalcomponent

14.1 match 4.00 score

missvalteam

Iscores:Proper Scoring Rules for Missing Value Imputation

Implementation of a KL-based scoring rule to assess the quality of different missing value imputations in the broad sense as introduced in Michel et al. (2021) <arXiv:2106.03742>.

Maintained by Loris Michel. Last updated 2 years ago.

imputation-methods machine-learning missing-values random-forest

14.4 match 7 stars 3.91 score 23 scripts

tetratech

baytrends:Long Term Water Quality Trend Analysis

Enable users to evaluate long-term trends using a Generalized Additive Modeling (GAM) approach. The model development includes selecting a GAM structure to describe nonlinear seasonally-varying changes over time, incorporation of hydrologic variability via either a river flow or salinity, the use of an intervention to deal with method or laboratory changes suspected to impact data values, and representation of left- and interval-censored data. The approach has been applied to water quality data in the Chesapeake Bay, a major estuary on the east coast of the United States to provide insights to a range of management- and research-focused questions. Methodology described in Murphy (2019) <doi:10.1016/j.envsoft.2019.03.027>.

Maintained by Erik W Leppo. Last updated 5 months ago.

8.4 match 12 stars 6.67 score 97 scripts

liamrevell

phytools:Phylogenetic Tools for Comparative Biology (and Other Things)

A wide range of methods for phylogenetic analysis - concentrated in phylogenetic comparative biology, but also including numerous techniques for visualizing, analyzing, manipulating, reading or writing, and even inferring phylogenetic trees. Included among the functions in phylogenetic comparative biology are various for ancestral state reconstruction, model-fitting, and simulation of phylogenies and trait data. A broad range of plotting methods for phylogenies and comparative data include (but are not restricted to) methods for mapping trait evolution on trees, for projecting trees into phenotype space or a onto a geographic map, and for visualizing correlated speciation between trees. Lastly, numerous functions are designed for reading, writing, analyzing, inferring, simulating, and manipulating phylogenetic trees and comparative data. For instance, there are functions for computing consensus phylogenies from a set, for simulating phylogenetic trees and data under a range of models, for randomly or non-randomly attaching species or clades to a tree, as well as for a wide range of other manipulations and analyses that phylogenetic biologists might find useful in their research.

Maintained by Liam J. Revell. Last updated 26 days ago.

4.0 match 218 stars 13.85 score 4.8k scripts 76 dependents

bioc

mixOmics:Omics Data Integration Project

Multivariate methods are well suited to large omics data sets where the number of variables (e.g. genes, proteins, metabolites) is much larger than the number of samples (patients, cells, mice). They have the appealing properties of reducing the dimension of the data by using instrumental variables (components), which are defined as combinations of all variables. Those components are then used to produce useful graphical outputs that enable better understanding of the relationships and correlation structures between the different data sets that are integrated. mixOmics offers a wide range of multivariate methods for the exploration and integration of biological datasets with a particular focus on variable selection. The package proposes several sparse multivariate models we have developed to identify the key variables that are highly correlated, and/or explain the biological outcome of interest. The data that can be analysed with mixOmics may come from high throughput sequencing technologies, such as omics data (transcriptomics, metabolomics, proteomics, metagenomics etc) but also beyond the realm of omics (e.g. spectral imaging). The methods implemented in mixOmics can also handle missing values without having to delete entire rows with missing data. A non exhaustive list of methods include variants of generalised Canonical Correlation Analysis, sparse Partial Least Squares and sparse Discriminant Analysis. Recently we implemented integrative methods to combine multiple data sets: N-integration with variants of Generalised Canonical Correlation Analysis and P-integration with variants of multi-group Partial Least Squares.

Maintained by Eva Hamrud. Last updated 2 days ago.

immunooncology microarray sequencing metabolomics metagenomics proteomics geneprediction multiplecomparison classification regression bioconductor genomics genomics-data genomics-visualization multivariate-analysis multivariate-statistics omics r-pkg r-project

4.0 match 182 stars 13.71 score 1.3k scripts 22 dependents

evolecolgroup

tidypopgen:Tidy Population Genetics

We provide a tidy grammar of population genetics, facilitating the manipulation and analysis of data on biallelic single nucleotide polymorphisms (SNPs).

Maintained by Andrea Manica. Last updated 2 days ago.

openblas zlib cpp openmp

9.4 match 4 stars 5.83 score 8 scripts

philchalmers

mirt:Multidimensional Item Response Theory

Analysis of discrete response data using unidimensional and multidimensional item analysis models under the Item Response Theory paradigm (Chalmers (2012) <doi:10.18637/jss.v048.i06>). Exploratory and confirmatory item factor analysis models are estimated with quadrature (EM) or stochastic (MHRM) methods. Confirmatory bi-factor and two-tier models are available for modeling item testlets using dimension reduction EM algorithms, while multiple group analyses and mixed effects designs are included for detecting differential item, bundle, and test functioning, and for modeling item and person covariates. Finally, latent class models such as the DINA, DINO, multidimensional latent class, mixture IRT models, and zero-inflated response models are supported, as well as a wide family of probabilistic unfolding models.

Maintained by Phil Chalmers. Last updated 10 days ago.

irt mirt openblas cpp openmp

3.6 match 210 stars 14.98 score 2.5k scripts 40 dependents

ikwak2

DrImpute:Imputing Dropout Events in Single-Cell RNA-Sequencing Data

R codes for imputing dropout events. Many statistical methods in cell type identification, visualization and lineage reconstruction do not account for dropout events ('PCAreduce', 'SC3', 'PCA', 't-SNE', 'Monocle', 'TSCAN', etc). 'DrImpute' can improve the performance of such software by imputing dropout events.

Maintained by Il-Youp Kwak. Last updated 8 years ago.

cpp

8.1 match 4 stars 6.66 score 77 scripts 1 dependents

bioc

MSPrep:Package for Summarizing, Filtering, Imputing, and Normalizing Metabolomics Data

Package performs summarization of replicates, filtering by frequency, several different options for imputing missing data, and a variety of options for transforming, batch correcting, and normalizing data.

Maintained by Max McGrath. Last updated 5 months ago.

metabolomics massspectrometry preprocessing

10.3 match 10 stars 5.20 score 4 scripts

ropengov

regions:Processing Regional Statistics

Validating sub-national statistical typologies, re-coding across standard typologies of sub-national statistics, and making valid aggregate level imputation, re-aggregation, re-weighting and projection down to lower hierarchical levels to create meaningful data panels and time series.

Maintained by Daniel Antal. Last updated 2 years ago.

observatory regions ropengov statistics

6.0 match 12 stars 8.81 score 67 scripts 5 dependents

billpetti

baseballr:Acquiring and Analyzing Baseball Data

Provides numerous utilities for acquiring and analyzing baseball data from online sources such as 'Baseball Reference' <https://www.baseball-reference.com/>, 'FanGraphs' <https://www.fangraphs.com/>, and the 'MLB Stats' API <https://www.mlb.com/>.

Maintained by Saiem Gilani. Last updated 4 months ago.

baseball pitchfx sabermetrics statcast

5.9 match 380 stars 8.98 score 582 scripts

rrwen

nbc4va:Bayes Classifier for Verbal Autopsy Data

An implementation of the Naive Bayes Classifier (NBC) algorithm used for Verbal Autopsy (VA) built on code from Miasnikof et al (2015) <DOI:10.1186/s12916-015-0521-2>.

Maintained by Richard Wen. Last updated 3 years ago.

autopsy bayes cause classifier coded computer death estimate imputation learning machine mds million naive nbc probability study theory va verbal

11.3 match 4.60 score 79 scripts

vaudigier

clusterMI:Cluster Analysis with Missing Values by Multiple Imputation

Allows clustering of incomplete observations by addressing missing values using multiple imputation. For achieving this goal, the methodology consists in three steps, following Audigier and Niang 2022 <doi:10.1007/s11634-022-00519-1>. I) Missing data imputation using dedicated models. Four multiple imputation methods are proposed, two are based on joint modelling and two are fully sequential methods, as discussed in Audigier et al. (2021) <doi:10.48550/arXiv.2106.04424>. II) cluster analysis of imputed data sets. Six clustering methods are available (distances-based or model-based), but custom methods can also be easily used. III) Partition pooling. The set of partitions is aggregated using Non-negative Matrix Factorization based method. An associated instability measure is computed by bootstrap (see Fang, Y. and Wang, J., 2012 <doi:10.1016/j.csda.2011.09.003>). Among applications, this instability measure can be used to choose a number of clusters with missing values. The package also proposes several diagnostic tools to tune the number of imputed data sets, to tune the number of iterations in fully sequential imputation, to check the fit of imputation models, etc.

Maintained by Vincent Audigier. Last updated 19 days ago.

openblas cpp

17.3 match 2.95 score

business-science

timetk:A Tool Kit for Working with Time Series

Easy visualization, wrangling, and feature engineering of time series data for forecasting and machine learning prediction. Consolidates and extends time series functionality from packages including 'dplyr', 'stats', 'xts', 'forecast', 'slider', 'padr', 'recipes', and 'rsample'.

Maintained by Matt Dancho. Last updated 1 years ago.

coercion coercion-functions data-mining dplyr forecast forecasting forecasting-models machine-learning series-decomposition series-signature tibble tidy tidyquant tidyverse time time-series timeseries

3.6 match 625 stars 14.15 score 4.0k scripts 16 dependents

larsenlab

hlaR:Tools for HLA Data

A streamlined tool for eplet analysis of donor and recipient HLA (human leukocyte antigen) mismatch. Messy, low-resolution HLA typing data is cleaned, and imputed to high-resolution using the NMDP (National Marrow Donor Program) haplotype reference database <https://haplostats.org/haplostats>. High resolution data is analyzed for overall or single antigen eplet mismatch using a reference table (currently supporting 'HLAMatchMaker' <http://www.epitopes.net> versions 2 and 3). Data can enter or exit the workflow at different points depending on the user's aims and initial data quality.

Maintained by Joan Zhang. Last updated 2 years ago.

9.8 match 7 stars 5.15 score 9 scripts

andrisignorell

DescTools:Tools for Descriptive Statistics

A collection of miscellaneous basic statistic functions and convenience wrappers for efficiently describing data. The author's intention was to create a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. The package contains furthermore functions to produce documents using MS Word (or PowerPoint) and functions to import data from Excel. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R. The reason for collecting them here, was primarily to have them consolidated in ONE instead of dozens of packages (which themselves might depend on other packages which are not needed at all), and to provide a common and consistent interface as far as function and arguments naming, NA handling, recycling rules etc. are concerned. Google style guides were used as naming rules (in absence of convincing alternatives). The 'BigCamelCase' style was consequently applied to functions borrowed from contributed R packages as well.

Maintained by Andri Signorell. Last updated 9 days ago.

fortran cpp

3.0 match 87 stars 16.68 score 7.7k scripts 99 dependents

jgill22

hot.deck:Multiple Hot Deck Imputation

Performs multiple hot-deck imputation of categorical and continuous variables in a data frame.

Maintained by Jeff Gill. Last updated 4 years ago.

17.5 match 2.80 score 21 scripts 1 dependents

bioc

ptairMS:Pre-processing PTR-TOF-MS Data

This package implements a suite of methods to preprocess data from PTR-TOF-MS instruments (HDF5 format) and generates the 'sample by features' table of peak intensities in addition to the sample and feature metadata (as a singl<e ExpressionSet object for subsequent statistical analysis). This package also permit usefull tools for cohorts management as analyzing data progressively, visualization tools and quality control. The steps include calibration, expiration detection, peak detection and quantification, feature alignment, missing value imputation and feature annotation. Applications to exhaled air and cell culture in headspace are described in the vignettes and examples. This package was used for data analysis of Gassin Delyle study on adults undergoing invasive mechanical ventilation in the intensive care unit due to severe COVID-19 or non-COVID-19 acute respiratory distress syndrome (ARDS), and permit to identfy four potentiel biomarquers of the infection.

Maintained by camille Roquencourt. Last updated 5 months ago.

software massspectrometry preprocessing metabolomics peakdetection alignment cpp

9.5 match 7 stars 5.15 score 3 scripts

thevaachandereng

bayesCT:Simulation and Analysis of Adaptive Bayesian Clinical Trials

Simulation and analysis of Bayesian adaptive clinical trials for binomial, continuous, and time-to-event data types, incorporates historical data and allows early stopping for futility or early success. The package uses novel and efficient Monte Carlo methods for estimating Bayesian posterior probabilities, evaluation of loss to follow up, and imputation of incomplete data. The package has the functionality for dynamically incorporating historical data into the analysis via the power prior or non-informative priors.

Maintained by Thevaa Chandereng. Last updated 5 years ago.

adaptive bayesian-methods bayesian-trial clinical-trials statistical-power

7.6 match 14 stars 6.30 score 36 scripts

bmcclintock

momentuHMM:Maximum Likelihood Analysis of Animal Movement Behavior Using Multivariate Hidden Markov Models

Extended tools for analyzing telemetry data using generalized hidden Markov models. Features of momentuHMM (pronounced ``momentum'') include data pre-processing and visualization, fitting HMMs to location and auxiliary biotelemetry or environmental data, biased and correlated random walk movement models, discrete- or continuous-time HMMs, continuous- or discrete-space movement models, approximate Langevin diffusion models, hierarchical HMMs, multiple imputation for incorporating location measurement error and missing data, user-specified design matrices and constraints for covariate modelling of parameters, random effects, decoding of the state process, visualization of fitted models, model checking and selection, and simulation. See McClintock and Michelot (2018) <doi:10.1111/2041-210X.12995>.

Maintained by Brett McClintock. Last updated 29 days ago.

openblas cpp

5.7 match 43 stars 8.47 score 162 scripts

covaruber

sommer:Solving Mixed Model Equations in R

Structural multivariate-univariate linear mixed model solver for estimation of multiple random effects with unknown variance-covariance structures (e.g., heterogeneous and unstructured) and known covariance among levels of random effects (e.g., pedigree and genomic relationship matrices) (Covarrubias-Pazaran, 2016 <doi:10.1371/journal.pone.0156744>; Maier et al., 2015 <doi:10.1016/j.ajhg.2014.12.006>; Jensen et al., 1997). REML estimates can be obtained using the Direct-Inversion Newton-Raphson and Direct-Inversion Average Information algorithms for the problems r x r (r being the number of records) or using the Henderson-based average information algorithm for the problem c x c (c being the number of coefficients to estimate). Spatial models can also be fitted using the two-dimensional spline functionality available.

Maintained by Giovanny Covarrubias-Pazaran. Last updated 20 days ago.

average-information mixed-models rcpparmadillo openblas cpp openmp

3.8 match 43 stars 12.70 score 300 scripts 9 dependents

laresbernardo

lares:Analytics & Machine Learning Sidekick

Auxiliary package for better/faster analytics, visualization, data mining, and machine learning tasks. With a wide variety of family functions, like Machine Learning, Data Wrangling, Marketing Mix Modeling (Robyn), Exploratory, API, and Scrapper, it helps the analyst or data scientist to get quick and robust results, without the need of repetitive coding or advanced R programming skills.

Maintained by Bernardo Lares. Last updated 23 days ago.

analytics api automation automl data-science descriptive-statistics h2o machine-learning marketing mmm predictive-modeling puzzle rlanguage robyn visualization

4.8 match 233 stars 9.84 score 185 scripts 1 dependents

simsem

semTools:Useful Tools for Structural Equation Modeling

Provides miscellaneous tools for structural equation modeling, many of which extend the 'lavaan' package. For example, latent interactions can be estimated using product indicators (Lin et al., 2010, <doi:10.1080/10705511.2010.488999>) and simple effects probed; analytical power analyses can be conducted (Jak et al., 2021, <doi:10.3758/s13428-020-01479-0>); and scale reliability can be estimated based on estimated factor-model parameters.

Maintained by Terrence D. Jorgensen. Last updated 2 days ago.

3.4 match 79 stars 13.74 score 1.1k scripts 31 dependents

teebusch

mifa:Multiple Imputation for Exploratory Factor Analysis

Impute the covariance matrix of incomplete data so that factor analysis can be performed. Imputations are made using multiple imputation by Multivariate Imputation with Chained Equations (MICE) and combined with Rubin's rules. Parametric Fieller confidence intervals and nonparametric bootstrap confidence intervals can be obtained for the variance explained by different numbers of principal components. The method is described in Nassiri et al. (2018) <doi:10.3758/s13428-017-1013-4>.

Maintained by Tobias Busch. Last updated 4 years ago.

factor-analysis imputation

15.7 match 2 stars 3.00 score 5 scripts

o1iv3r

ClustImpute:K-Means Clustering with Build-in Missing Data Imputation

This k-means algorithm is able to cluster data with missing values and as a by-product completes the data set. The implementation can deal with missing values in multiple variables and is computationally efficient since it iteratively uses the current cluster assignment to define a plausible distribution for missing value imputation. Weights are used to shrink early random draws for missing values (i.e., draws based on the cluster assignments after few iterations) towards the global mean of each feature. This shrinkage slowly fades out after a fixed number of iterations to reflect the increasing credibility of cluster assignments. See the vignette for details.

Maintained by Oliver Pfaffel. Last updated 3 years ago.

9.4 match 7 stars 4.96 score 13 scripts

openpharma

brms.mmrm:Bayesian MMRMs using 'brms'

The mixed model for repeated measures (MMRM) is a popular model for longitudinal clinical trial data with continuous endpoints, and 'brms' is a powerful and versatile package for fitting Bayesian regression models. The 'brms.mmrm' R package leverages 'brms' to run MMRMs, and it supports a simplified interfaced to reduce difficulty and align with the best practices of the life sciences. References: Bürkner (2017) <doi:10.18637/jss.v080.i01>, Mallinckrodt (2008) <doi:10.1177/009286150804200402>.

Maintained by William Michael Landau. Last updated 5 months ago.

brms life-sciences mc-stan mmrm stan statistics

5.2 match 21 stars 8.80 score 13 scripts

eliocamp

metR:Tools for Easier Analysis of Meteorological Fields

Many useful functions and extensions for dealing with meteorological data in the tidy data framework. Extends 'ggplot2' for better plotting of scalar and vector fields and provides commonly used analysis methods in the atmospheric sciences.

Maintained by Elio Campitelli. Last updated 19 days ago.

atmospheric-science ggplot2 visualization

3.8 match 144 stars 12.19 score 1000 scripts 22 dependents

andyliaw-mrk

randomForest:Breiman and Cutlers Random Forests for Classification and Regression

Classification and regression based on a forest of trees using random inputs, based on Breiman (2001) <DOI:10.1023/A:1010933404324>.

Maintained by Andy Liaw. Last updated 6 months ago.

fortran

3.8 match 47 stars 12.11 score 35k scripts 282 dependents

cran

CALIBERrfimpute:Multiple Imputation Using MICE and Random Forest

Functions to impute using random forest under full conditional specifications (multivariate imputation by chained equations). The methods are described in Shah and others (2014) <doi:10.1093/aje/kwt312>.

Maintained by Anoop Shah. Last updated 2 years ago.

17.4 match 2 stars 2.60 score

ropensci

dynamite:Bayesian Modeling and Causal Inference for Multivariate Longitudinal Data

Easy-to-use and efficient interface for Bayesian inference of complex panel (time series) data using dynamic multivariate panel models by Helske and Tikka (2024) <doi:10.1016/j.alcr.2024.100617>. The package supports joint modeling of multiple measurements per individual, time-varying and time-invariant effects, and a wide range of discrete and continuous distributions. Estimation of these dynamic multivariate panel models is carried out via 'Stan'. For an in-depth tutorial of the package, see (Tikka and Helske, 2024) <doi:10.48550/arXiv.2302.01607>.

Maintained by Santtu Tikka. Last updated 18 days ago.

bayesian-inference panel-data stan statistical-models

5.7 match 29 stars 7.92 score 20 scripts

stamats

MKmisc:Miscellaneous Functions from M. Kohl

Contains several functions for statistical data analysis; e.g. for sample size and power calculations, computation of confidence intervals and tests, and generation of similarity matrices.

Maintained by Matthias Kohl. Last updated 2 years ago.

6.0 match 11 stars 7.40 score 129 scripts 1 dependents

cran

e1071:Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien

Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbour ...

Maintained by David Meyer. Last updated 6 months ago.

cpp

3.0 match 28 stars 14.46 score 19k scripts 2.0k dependents

umich-cphds

lodi:Limit of Detection Imputation for Single-Pollutant Models

Impute observed values below the limit of detection (LOD) via censored likelihood multiple imputation (CLMI) in single-pollutant models, developed by Boss et al (2019) <doi:10.1097/EDE.0000000000001052>. CLMI handles exposure detection limits that may change throughout the course of exposure assessment. 'lodi' provides functions for imputing and pooling for this method.

Maintained by Alexander Rix. Last updated 5 years ago.

11.7 match 1 stars 3.70 score 10 scripts

jeffreyevans

spatialEco:Spatial Analysis and Modelling Utilities

Utilities to support spatial data manipulation, query, sampling and modelling in ecological applications. Functions include models for species population density, spatial smoothing, multivariate separability, point process model for creating pseudo- absences and sub-sampling, Quadrant-based sampling and analysis, auto-logistic modeling, sampling models, cluster optimization, statistical exploratory tools and raster-based metrics.

Maintained by Jeffrey S. Evans. Last updated 12 days ago.

biodiversity conservation ecology r-spatial raster spatial vector

4.5 match 110 stars 9.55 score 736 scripts 2 dependents

emmaskarstein

inlamemi:Missing Data and Measurement Error Modelling in INLA

Facilitates fitting measurement error and missing data imputation models using integrated nested Laplace approximations, according to the method described in Skarstein, Martino and Muff (2023) <doi:10.1002/bimj.202300078>. See Skarstein and Muff (2024) <doi:10.48550/arXiv.2406.08172> for details on using the package.

Maintained by Emma Skarstein. Last updated 4 months ago.

7.1 match 5.97 score 19 scripts

bips-hb

arf:Adversarial Random Forests

Adversarial random forests (ARFs) recursively partition data into fully factorized leaves, where features are jointly independent. The procedure is iterative, with alternating rounds of generation and discrimination. Data becomes increasingly realistic at each round, until original and synthetic samples can no longer be reliably distinguished. This is useful for several unsupervised learning tasks, such as density estimation and data synthesis. Methods for both are implemented in this package. ARFs naturally handle unstructured data with mixed continuous and categorical covariates. They inherit many of the benefits of random forests, including speed, flexibility, and solid performance with default parameters. For details, see Watson et al. (2023) <https://proceedings.mlr.press/v206/watson23a.html>.

Maintained by Marvin N. Wright. Last updated 18 days ago.

6.4 match 14 stars 6.65 score 16 scripts

luca-scr

mclust:Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference.

Maintained by Luca Scrucca. Last updated 11 months ago.

fortran openblas

3.5 match 21 stars 12.23 score 6.6k scripts 587 dependents

andrija-djurovic

monobinShiny:Shiny User Interface for 'monobin' Package

This is an add-on package to the 'monobin' package that simplifies its use. It provides shiny-based user interface (UI) that is especially handy for less experienced 'R' users as well as for those who intend to perform quick scanning of numeric risk factors when building credit rating models. The additional functions implemented in 'monobinShiny' that do no exist in 'monobin' package are: descriptive statistics, special case and outliers imputation. The function descriptive statistics is exported and can be used in 'R' sessions independently from the user interface, while special case and outlier imputation functions are written to be used with shiny UI.

Maintained by Andrija Djurovic. Last updated 3 years ago.

15.5 match 1 stars 2.70 score 6 scripts

cardiomoon

autoReg:Automatic Linear and Logistic Regression and Survival Analysis

Make summary tables for descriptive statistics and select explanatory variables automatically in various regression models. Support linear models, generalized linear models and cox-proportional hazard models. Generate publication-ready tables summarizing result of regression analysis and plots. The tables and plots can be exported in "HTML", "pdf('LaTex')", "docx('MS Word')" and "pptx('MS Powerpoint')" documents.

Maintained by Keon-Woong Moon. Last updated 1 years ago.

5.9 match 47 stars 7.00 score 69 scripts

branchlab

metasnf:Meta Clustering with Similarity Network Fusion

Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in <doi:10.1038/nmeth.2810>. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in <doi:10.1109/ICDM.2006.103>. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning.

Maintained by Prashanth S Velayudhan. Last updated 3 days ago.

bioinformatics clustering metaclustering snf

5.0 match 8 stars 8.21 score 30 scripts

michaelwrobbins

gerbil:Generalized Efficient Regression-Based Imputation with Latent Processes

Implements a new multiple imputation method that draws imputations from a latent joint multivariate normal model which underpins generally structured data. This model is constructed using a sequence of flexible conditional linear models that enables the resulting procedure to be efficiently implemented on high dimensional datasets in practice. See Robbins (2021) <arXiv:2008.02243>.

Maintained by Michael Robbins. Last updated 2 years ago.

15.0 match 2.70 score 3 scripts

bioc

RNAseqCovarImpute:Impute Covariate Data in RNA Sequencing Studies

The RNAseqCovarImpute package makes linear model analysis for RNA sequencing read counts compatible with multiple imputation (MI) of missing covariates. A major problem with implementing MI in RNA sequencing studies is that the outcome data must be included in the imputation prediction models to avoid bias. This is difficult in omics studies with high-dimensional data. The first method we developed in the RNAseqCovarImpute package surmounts the problem of high-dimensional outcome data by binning genes into smaller groups to analyze pseudo-independently. This method implements covariate MI in gene expression studies by 1) randomly binning genes into smaller groups, 2) creating M imputed datasets separately within each bin, where the imputation predictor matrix includes all covariates and the log counts per million (CPM) for the genes within each bin, 3) estimating gene expression changes using `limma::voom` followed by `limma::lmFit` functions, separately on each M imputed dataset within each gene bin, 4) un-binning the gene sets and stacking the M sets of model results before applying the `limma::squeezeVar` function to apply a variance shrinking Bayesian procedure to each M set of model results, 5) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 6) adjusting P-values for multiplicity to account for false discovery rate (FDR). A faster method uses principal component analysis (PCA) to avoid binning genes while still retaining outcome information in the MI models. Binning genes into smaller groups requires that the MI and limma-voom analysis is run many times (typically hundreds). The more computationally efficient MI PCA method implements covariate MI in gene expression studies by 1) performing PCA on the log CPM values for all genes using the Bioconductor `PCAtools` package, 2) creating M imputed datasets where the imputation predictor matrix includes all covariates and the optimum number of PCs to retain (e.g., based on Horn’s parallel analysis or the number of PCs that account for >80% explained variation), 3) conducting the standard limma-voom pipeline with the `voom` followed by `lmFit` followed by `eBayes` functions on each M imputed dataset, 4) pooling the results with Rubins’ rules to produce combined coefficients, standard errors, and P-values, and 5) adjusting P-values for multiplicity to account for false discovery rate (FDR).

Maintained by Brennan Baker. Last updated 5 months ago.

rnaseq geneexpression differentialexpression sequencing

9.0 match 1 stars 4.48 score 6 scripts

jepusto

clubSandwich:Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections

Provides several cluster-robust variance estimators (i.e., sandwich estimators) for ordinary and weighted least squares linear regression models, including the bias-reduced linearization estimator introduced by Bell and McCaffrey (2002) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2002002/article/9058-eng.pdf> and developed further by Pustejovsky and Tipton (2017) <DOI:10.1080/07350015.2016.1247004>. The package includes functions for estimating the variance- covariance matrix and for testing single- and multiple- contrast hypotheses based on Wald test statistics. Tests of single regression coefficients use Satterthwaite or saddle-point corrections. Tests of multiple- contrast hypotheses use an approximation to Hotelling's T-squared distribution. Methods are provided for a variety of fitted models, including lm() and mlm objects, glm(), geeglm() (from package 'geepack'), ivreg() (from package 'AER'), ivreg() (from package 'ivreg' when estimated by ordinary least squares), plm() (from package 'plm'), gls() and lme() (from 'nlme'), lmer() (from `lme4`), robu() (from 'robumeta'), and rma.uni() and rma.mv() (from 'metafor').

Maintained by James Pustejovsky. Last updated 14 days ago.

3.5 match 48 stars 11.25 score 656 scripts 4 dependents

dppalomar

imputeFin:Imputation of Financial Time Series with Missing Values and/or Outliers

Missing values often occur in financial data due to a variety of reasons (errors in the collection process or in the processing stage, lack of asset liquidity, lack of reporting of funds, etc.). However, most data analysis methods expect complete data and cannot be employed with missing values. One convenient way to deal with this issue without having to redesign the data analysis method is to impute the missing values. This package provides an efficient way to impute the missing values based on modeling the time series with a random walk or an autoregressive (AR) model, convenient to model log-prices and log-volumes in financial data. In the current version, the imputation is univariate-based (so no asset correlation is used). In addition, outliers can be detected and removed. The package is based on the paper: J. Liu, S. Kumar, and D. P. Palomar (2019). Parameter Estimation of Heavy-Tailed AR Model With Missing Data Via Stochastic EM. IEEE Trans. on Signal Processing, vol. 67, no. 8, pp. 2159-2172. <doi:10.1109/TSP.2019.2899816>.

Maintained by Daniel P. Palomar. Last updated 3 years ago.

financial-data missing-values outliers time-series

6.9 match 25 stars 5.80 score 25 scripts

torockel

imputeGeneric:Ease the Implementation of Imputation Methods

The general workflow of most imputation methods is quite similar. The aim of this package is to provide parts of this general workflow to make the implementation of imputation methods easier. The heart of an imputation method is normally the used model. These models can be defined using the 'parsnip' package or customized specifications. The rest of an imputation method are more technical specification e.g. which columns and rows should be used for imputation and in which order. These technical specifications can be set inside the imputation functions.

Maintained by Tobias Rockel. Last updated 3 years ago.

14.7 match 1 stars 2.70 score 2 scripts

neerajdhanraj

imputeTestbench:Test Bench for the Comparison of Imputation Methods

Provides a test bench for the comparison of missing data imputation methods in uni-variate time series. Imputation methods are compared using different error metrics. Proposed imputation methods and alternative error metrics can be used.

Maintained by Marcus W. Beck. Last updated 8 years ago.

8.0 match 5 stars 4.94 score 20 scripts 2 dependents

wjunger

mtsdi:Multivariate Time Series Data Imputation

This is an EM algorithm based method for imputation of missing values in multivariate normal time series. The imputation algorithm accounts for both spatial and temporal correlation structures. Temporal patterns can be modeled using an ARIMA(p,d,q), optionally with seasonal components, a non-parametric cubic spline or generalized additive models with exogenous covariates. This algorithm is specially tailored for climate data with missing measurements from several monitors along a given region.

Maintained by Washington Junger. Last updated 3 years ago.

9.8 match 1 stars 4.00 score 22 scripts 3 dependents

jwb133

mlmi:Maximum Likelihood Multiple Imputation

Implements so called Maximum Likelihood Multiple Imputation as described by von Hippel and Bartlett (2021) <doi:10.1214/20-STS793>. A number of different imputations are available, by utilising the 'norm', 'cat' and 'mix' packages. Inferences can be performed either using combination rules similar to Rubin's or using a likelihood score based approach based on theory by Wang and Robins (1998) <doi:10.1093/biomet/85.4.935>.

Maintained by Jonathan Bartlett. Last updated 2 years ago.

14.4 match 2.70 score 10 scripts

trevorhastie

softImpute:Matrix Completion via Iterative Soft-Thresholded SVD

Iterative methods for matrix completion that use nuclear-norm regularization. There are two main approaches.The one approach uses iterative soft-thresholded svds to impute the missing values. The second approach uses alternating least squares. Both have an 'EM' flavor, in that at each iteration the matrix is completed with the current estimate. For large matrices there is a special sparse-matrix class named "Incomplete" that efficiently handles all computations. The package includes procedures for centering and scaling rows, columns or both, and for computing low-rank SVDs on large sparse centered matrices (i.e. principal components).

Maintained by Trevor Hastie. Last updated 4 years ago.

fortran

5.2 match 10 stars 7.47 score 253 scripts 22 dependents

nicokubi

penetrance:Methods for Penetrance Estimation in Family-Based Studies

Implements statistical methods for estimating disease penetrance in family-based studies. Penetrance refers to the probability of disease§ manifestation in individuals carrying specific genetic variants. The package provides tools for age-specific penetrance estimation, handling missing data, and accounting for ascertainment bias in family studies. Cite as: Kubista, N., Braun, D. & Parmigiani, G. (2024) <doi:10.48550/arXiv.2411.18816>.

Maintained by Nicolas Kubista. Last updated 16 days ago.

7.1 match 5.41 score

bioc

POMA:Tools for Omics Data Analysis

The POMA package offers a comprehensive toolkit designed for omics data analysis, streamlining the process from initial visualization to final statistical analysis. Its primary goal is to simplify and unify the various steps involved in omics data processing, making it more accessible and manageable within a single, intuitive R package. Emphasizing on reproducibility and user-friendliness, POMA leverages the standardized SummarizedExperiment class from Bioconductor, ensuring seamless integration and compatibility with a wide array of Bioconductor tools. This approach guarantees maximum flexibility and replicability, making POMA an essential asset for researchers handling omics datasets. See https://github.com/pcastellanoescuder/POMAShiny. Paper: Castellano-Escuder et al. (2021) <doi:10.1371/journal.pcbi.1009148> for more details.

Maintained by Pol Castellano-Escuder. Last updated 4 months ago.

batcheffect classification clustering decisiontree dimensionreduction multidimensionalscaling normalization preprocessing principalcomponent regression rnaseq software statisticalmethod visualization bioconductor bioinformatics data-visualization dimension-reduction exploratory-data-analysis machine-learning omics-data-integration pipeline pre-processing statistical-analysis user-friendly workflow

4.7 match 11 stars 8.23 score 20 scripts 1 dependents

mikejseo

bipd:Bayesian Individual Patient Data Meta-Analysis using 'JAGS'

We use a Bayesian approach to run individual patient data meta-analysis and network meta-analysis using 'JAGS'. The methods incorporate shrinkage methods and calculate patient-specific treatment effects as described in Seo et al. (2021) <DOI:10.1002/sim.8859>. This package also includes user-friendly functions that impute missing data in an individual patient data using mice-related packages.

Maintained by Michael Seo. Last updated 3 years ago.

jags cpp

9.0 match 3 stars 4.26 score 20 scripts

cran

hsphase:Phasing, Pedigree Reconstruction, Sire Imputation and Recombination Events Identification of Half-sib Families Using SNP Data

Identification of recombination events, haplotype reconstruction, sire imputation and pedigree reconstruction using half-sib family SNP data.

Maintained by Mohammad Ferdosi. Last updated 1 years ago.

cpp openmp

11.5 match 1 stars 3.32 score 1 dependents

jhstaudacher

CoopGame:Important Concepts of Cooperative Game Theory

The theory of cooperative games with transferable utility offers useful insights into the way parties can share gains from cooperation and secure sustainable agreements, see e.g. one of the books by Chakravarty, Mitra and Sarkar (2015, ISBN:978-1107058798) or by Driessen (1988, ISBN:978-9027727299) for more details. A comprehensive set of tools for cooperative game theory with transferable utility is provided. Users can create special families of cooperative games, like e.g. bankruptcy games, cost sharing games and weighted voting games. There are functions to check various game properties and to compute five different set-valued solution concepts for cooperative games. A large number of point-valued solution concepts is available reflecting the diverse application areas of cooperative game theory. Some of these point-valued solution concepts can be used to analyze weighted voting games and measure the influence of individual voters within a voting body. There are routines for visualizing both set-valued and point-valued solutions in the case of three or four players.

Maintained by Jochen Staudacher. Last updated 4 years ago.

9.2 match 4.10 score 424 scripts 1 dependents

bioc

MAPFX:MAssively Parallel Flow cytometry Xplorer (MAPFX): A Toolbox for Analysing Data from the Massively-Parallel Cytometry Experiments

MAPFX is an end-to-end toolbox that pre-processes the raw data from MPC experiments (e.g., BioLegend's LEGENDScreen and BD Lyoplates assays), and further imputes the ‘missing’ infinity markers in the wells without those measurements. The pipeline starts by performing background correction on raw intensities to remove the noise from electronic baseline restoration and fluorescence compensation by adapting a normal-exponential convolution model. Unwanted technical variation, from sources such as well effects, is then removed using a log-normal model with plate, column, and row factors, after which infinity markers are imputed using the informative backbone markers as predictors. The completed dataset can then be used for clustering and other statistical analyses. Additionally, MAPFX can be used to normalise data from FFC assays as well.

Maintained by Hsiao-Chi Liao. Last updated 5 months ago.

software flowcytometry cellbasedassays singlecell proteomics clustering

8.3 match 1 stars 4.54 score

cran

miselect:Variable Selection for Multiply Imputed Data

Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. 'miselect' presents Stacked Adaptive Elastic Net (saenet) and Grouped Adaptive LASSO (galasso) for continuous and binary outcomes, developed by Du et al (2022) <doi:10.1080/10618600.2022.2035739>. They, by construction, force selection of the same variables across multiply imputed data. 'miselect' also provides cross validated variants of these methods.

Maintained by Michael Kleinsasser. Last updated 1 years ago.

13.9 match 2.70 score

jmpsteen

medflex:Flexible Mediation Analysis Using Natural Effect Models

Run flexible mediation analyses using natural effect models as described in Lange, Vansteelandt and Bekaert (2012) <DOI:10.1093/aje/kwr525>, Vansteelandt, Bekaert and Lange (2012) <DOI:10.1515/2161-962X.1014> and Loeys, Moerkerke, De Smet, Buysse, Steen and Vansteelandt (2013) <DOI:10.1080/00273171.2013.832132>.

Maintained by Johan Steen. Last updated 2 years ago.

causal-inference flexible-modeling mediation-analysis

5.3 match 23 stars 7.09 score 54 scripts

xiaolei-lab

rMVP:Memory-Efficient, Visualize-Enhanced, Parallel-Accelerated GWAS Tool

A memory-efficient, visualize-enhanced, parallel-accelerated Genome-Wide Association Study (GWAS) tool. It can (1) effectively process large data, (2) rapidly evaluate population structure, (3) efficiently estimate variance components several algorithms, (4) implement parallel-accelerated association tests of markers three methods, (5) globally efficient design on GWAS process computing, (6) enhance visualization of related information. 'rMVP' contains three models GLM (Alkes Price (2006) <DOI:10.1038/ng1847>), MLM (Jianming Yu (2006) <DOI:10.1038/ng1702>) and FarmCPU (Xiaolei Liu (2016) <doi:10.1371/journal.pgen.1005767>); variance components estimation methods EMMAX (Hyunmin Kang (2008) <DOI:10.1534/genetics.107.080101>;), FaSTLMM (method: Christoph Lippert (2011) <DOI:10.1038/nmeth.1681>, R implementation from 'GAPIT2': You Tang and Xiaolei Liu (2016) <DOI:10.1371/journal.pone.0107684> and 'SUPER': Qishan Wang and Feng Tian (2014) <DOI:10.1371/journal.pone.0107684>), and HE regression (Xiang Zhou (2017) <DOI:10.1214/17-AOAS1052>).

Maintained by Xiaolei Liu. Last updated 2 months ago.

openblas cpp openmp

4.6 match 287 stars 8.06 score 38 scripts

cran

bnlearn:Bayesian Network Structure Learning, Parameter Learning and Inference

Bayesian network structure learning, parameter learning and inference. This package implements constraint-based (PC, GS, IAMB, Inter-IAMB, Fast-IAMB, MMPC, Hiton-PC, HPC), pairwise (ARACNE and Chow-Liu), score-based (Hill-Climbing and Tabu Search) and hybrid (MMHC, RSMAX2, H2PC) structure learning algorithms for discrete, Gaussian and conditional Gaussian networks, along with many score functions and conditional independence tests. The Naive Bayes and the Tree-Augmented Naive Bayes (TAN) classifiers are also implemented. Some utility functions (model comparison and manipulation, random data generation, arc orientation testing, simple and advanced plots) are included, as well as support for parameter estimation (maximum likelihood and Bayesian) and inference, conditional probability queries, cross-validation, bootstrap and model averaging. Development snapshots with the latest bugfixes are available from <https://www.bnlearn.com/>.

Maintained by Marco Scutari. Last updated 2 months ago.

openblas

4.8 match 57 stars 7.72 score 32 dependents

kjhealy

gssrdoc:Document General Social Survey Variable

The General Social Survey (GSS) is a long-running, mostly annual survey of US households. It is administered by the National Opinion Research Center (NORC). This package contains the a tibble with information on the survey variables, together with every variable documented as an R help page. For more information on the GSS see \url{http://gss.norc.org}.

Maintained by Kieran Healy. Last updated 11 months ago.

16.1 match 2.28 score 38 scripts

mrc-ide

EpiEstim:Estimate Time Varying Reproduction Numbers from Epidemic Curves

Tools to quantify transmissibility throughout an epidemic from the analysis of time series of incidence as described in Cori et al. (2013) <doi:10.1093/aje/kwt133> and Wallinga and Teunis (2004) <doi:10.1093/aje/kwh255>.

Maintained by Anne Cori. Last updated 7 months ago.

3.0 match 95 stars 12.00 score 1.0k scripts 7 dependents

paulkinyanjui01

CondMVT:Conditional Multivariate t Distribution

The packages helps sample from the conditional multivariate t distribution.

Maintained by Paul Kimani Kinyanjui. Last updated 3 years ago.

13.3 match 2.70 score

bioc

scp:Mass Spectrometry-Based Single-Cell Proteomics Data Analysis

Utility functions for manipulating, processing, and analyzing mass spectrometry-based single-cell proteomics data. The package is an extension to the 'QFeatures' package and relies on 'SingleCellExpirement' to enable single-cell proteomics analyses. The package offers the user the functionality to process quantitative table (as generated by MaxQuant, Proteome Discoverer, and more) into data tables ready for downstream analysis and data visualization.

Maintained by Christophe Vanderaa. Last updated 15 days ago.

geneexpression proteomics singlecell massspectrometry preprocessing cellbasedassays bioconductor mass-spectrometry single-cell software

4.0 match 25 stars 8.94 score 115 scripts

bioc

autonomics:Unified Statistical Modeling of Omics Data

This package unifies access to Statistal Modeling of Omics Data. Across linear modeling engines (lm, lme, lmer, limma, and wilcoxon). Across coding systems (treatment, difference, deviation, etc). Across model formulae (with/without intercept, random effect, interaction or nesting). Across omics platforms (microarray, rnaseq, msproteomics, affinity proteomics, metabolomics). Across projection methods (pca, pls, sma, lda, spls, opls). Across clustering methods (hclust, pam, cmeans). It provides a fast enrichment analysis implementation. And an intuitive contrastogram visualisation to summarize contrast effects in complex designs.

Maintained by Aditya Bhagwat. Last updated 2 months ago.

software dataimport preprocessing dimensionreduction principalcomponent regression differentialexpression genesetenrichment transcriptomics transcription geneexpression rnaseq microarray proteomics metabolomics massspectrometry

6.0 match 5.95 score 5 scripts

alexanderrobitzsch

sirt:Supplementary Item Response Theory Models

Supplementary functions for item response models aiming to complement existing R packages. The functionality includes among others multidimensional compensatory and noncompensatory IRT models (Reckase, 2009, <doi:10.1007/978-0-387-89976-3>), MCMC for hierarchical IRT models and testlet models (Fox, 2010, <doi:10.1007/978-1-4419-0742-4>), NOHARM (McDonald, 1982, <doi:10.1177/014662168200600402>), Rasch copula model (Braeken, 2011, <doi:10.1007/s11336-010-9190-4>; Schroeders, Robitzsch & Schipolowski, 2014, <doi:10.1111/jedm.12054>), faceted and hierarchical rater models (DeCarlo, Kim & Johnson, 2011, <doi:10.1111/j.1745-3984.2011.00143.x>), ordinal IRT model (ISOP; Scheiblechner, 1995, <doi:10.1007/BF02301417>), DETECT statistic (Stout, Habing, Douglas & Kim, 1996, <doi:10.1177/014662169602000403>), local structural equation modeling (LSEM; Hildebrandt, Luedtke, Robitzsch, Sommer & Wilhelm, 2016, <doi:10.1080/00273171.2016.1142856>).

Maintained by Alexander Robitzsch. Last updated 2 months ago.

item-response-theory openblas cpp

3.6 match 23 stars 10.01 score 280 scripts 22 dependents

sfcheung

manymome:Mediation, Moderation and Moderated-Mediation After Model Fitting

Computes indirect effects, conditional effects, and conditional indirect effects in a structural equation model or path model after model fitting, with no need to define any user parameters or label any paths in the model syntax, using the approach presented in Cheung and Cheung (2024) <doi:10.3758/s13428-023-02224-z>. Can also form bootstrap confidence intervals by doing bootstrapping only once and reusing the bootstrap estimates in all subsequent computations. Supports bootstrap confidence intervals for standardized (partially or completely) indirect effects, conditional effects, and conditional indirect effects as described in Cheung (2009) <doi:10.3758/BRM.41.2.425> and Cheung, Cheung, Lau, Hui, and Vong (2022) <doi:10.1037/hea0001188>. Model fitting can be done by structural equation modeling using lavaan() or regression using lm().

Maintained by Shu Fai Cheung. Last updated 21 days ago.

bootstrapping confidence-interval lavaan manymome mediation moderated-mediation moderation regression sem standardized-effect-size structural-equation-modeling

4.4 match 1 stars 8.06 score 172 scripts 4 dependents

statnet

tergm:Fit, Simulate and Diagnose Models for Network Evolution Based on Exponential-Family Random Graph Models

An integrated set of extensions to the 'ergm' package to analyze and simulate network evolution based on exponential-family random graph models (ERGM). 'tergm' is a part of the 'statnet' suite of packages for network analysis. See Krivitsky and Handcock (2014) <doi:10.1111/rssb.12014> and Carnegie, Krivitsky, Hunter, and Goodreau (2015) <doi:10.1080/10618600.2014.903087>.

Maintained by Pavel N. Krivitsky. Last updated 4 months ago.

3.8 match 27 stars 9.29 score 78 scripts 3 dependents

gianmarcoborrata

Indicator:Composite 'Indicator' Construction and Imputation Data

Different functions includes constructing composite indicators, imputing missing data, and evaluating imputation techniques. Additionally, different tools for data normalization. Detailed methodologies of 'Indicator' package are: OECD/European Union/EC-JRC (2008), "Handbook on Constructing Composite Indicators: Methodology and User Guide", OECD Publishing, Paris, <DOI:10.1787/533411815016>, Matteo Mazziotta & Adriano Pareto, (2018) "Measuring Well-Being Over Time: The Adjusted Mazziotta–Pareto Index Versus Other Non-compensatory Indices" <DOI:10.1007/s11205-017-1577-5> and De Muro P., Mazziotta M., Pareto A. (2011), "Composite Indices of Development and Poverty: An Application to MDGs" <DOI:10.1007/s11205-010-9727-z>.

Maintained by Gianmarco Borrata. Last updated 4 months ago.

11.7 match 1 stars 3.00 score 1 scripts

bioc

LEA:LEA: an R package for Landscape and Ecological Association Studies

LEA is an R package dedicated to population genomics, landscape genomics and genotype-environment association tests. LEA can run analyses of population structure and genome-wide tests for local adaptation, and also performs imputation of missing genotypes. The package includes statistical methods for estimating ancestry coefficients from large genotypic matrices and for evaluating the number of ancestral populations (snmf). It performs statistical tests using latent factor mixed models for identifying genetic polymorphisms that exhibit association with environmental gradients or phenotypic traits (lfmm2). In addition, LEA computes values of genetic offset statistics based on new or predicted environments (genetic.gap, genetic.offset). LEA is mainly based on optimized programs that can scale with the dimensions of large data sets.

Maintained by Olivier Francois. Last updated 4 days ago.

software statistical method clustering regression openblas

5.3 match 6.63 score 534 scripts

pdwaggoner

hdImpute:A Batch Process for High Dimensional Imputation

A correlation-based batch process for fast, accurate imputation for high dimensional missing data problems via chained random forests. See Waggoner (2023) <doi:10.1007/s00180-023-01325-9> for more on 'hdImpute', Stekhoven and Bühlmann (2012) <doi:10.1093/bioinformatics/btr597> for more on 'missForest', and Mayer (2022) <https://github.com/mayer79/missRanger> for more on 'missRanger'.

Maintained by Philip Waggoner. Last updated 2 months ago.

10.2 match 2 stars 3.41 score 13 scripts

bioc

scRecover:scRecover for imputation of single-cell RNA-seq data

scRecover is an R package for imputation of single-cell RNA-seq (scRNA-seq) data. It will detect and impute dropout values in a scRNA-seq raw read counts matrix while keeping the real zeros unchanged, since there are both dropout zeros and real zeros in scRNA-seq data. By combination with scImpute, SAVER and MAGIC, scRecover not only detects dropout and real zeros at higher accuracy, but also improve the downstream clustering and visualization results.

Maintained by Zhun Miao. Last updated 5 months ago.

geneexpression singlecell rnaseq transcriptomics sequencing preprocessing software

6.6 match 8 stars 5.20 score 9 scripts

tomasfryda

h2o:R Interface for the 'H2O' Scalable Machine Learning Platform

R interface for 'H2O', the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), ANOVA GLM, Cox Proportional Hazards, K-Means, PCA, ModelSelection, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

Maintained by Tomas Fryda. Last updated 1 years ago.

4.1 match 3 stars 8.20 score 7.8k scripts 11 dependents

bioc

scone:Single Cell Overview of Normalized Expression data

SCONE is an R package for comparing and ranking the performance of different normalization schemes for single-cell RNA-seq and other high-throughput analyses.

Maintained by Davide Risso. Last updated 24 days ago.

immunooncology normalization preprocessing qualitycontrol geneexpression rnaseq software transcriptomics sequencing singlecell coverage

3.7 match 53 stars 9.12 score 104 scripts

felixthoemmes

rddapp:Regression Discontinuity Design Application

Estimation of both single- and multiple-assignment Regression Discontinuity Designs (RDDs). Provides both parametric (global) and non-parametric (local) estimation choices for both sharp and fuzzy designs, along with power analysis and assumption checks. Introductions to the underlying logic and analysis of RDDs are in Thistlethwaite, D. L., Campbell, D. T. (1960) <doi:10.1037/h0044319> and Lee, D. S., Lemieux, T. (2010) <doi:10.1257/jel.48.2.281>.

Maintained by Felix Thoemmes. Last updated 2 years ago.

non-parametric-rdd parametric-rdd rdd

5.3 match 9 stars 6.30 score 44 scripts

handcock

RDS:Respondent-Driven Sampling

Provides functionality for carrying out estimation with data collected using Respondent-Driven Sampling. This includes Heckathorn's RDS-I and RDS-II estimators as well as Gile's Sequential Sampling estimator. The package is part of the "RDS Analyst" suite of packages for the analysis of respondent-driven sampling data. See Gile and Handcock (2010) <doi:10.1111/j.1467-9531.2010.01223.x>, Gile and Handcock (2015) <doi:10.1111/rssa.12091> and Gile, Beaudry, Handcock and Ott (2018) <doi:10.1146/annurev-statistics-031017-100704>.

Maintained by Mark S. Handcock. Last updated 6 months ago.

8.6 match 1 stars 3.87 score 82 scripts 3 dependents

federico-m-stefanini

convergEU:Monitoring Convergence of EU Countries

Indicators and measures by country and time describe what happens at economic and social levels. This package provides functions to calculate several measures of convergence after imputing missing values. The automated downloading of Eurostat data, followed by the production of country fiches and indicator fiches, makes possible to produce automated reports. The Eurofound report (<doi:10.2806/68012>) "Upward convergence in the EU: Concepts, measurements and indicators", 2018, is a detailed presentation of convergence.

Maintained by Federico M. Stefanini. Last updated 2 years ago.

6.7 match 1 stars 4.95 score 89 scripts

weirichs

eatRep:Educational Assessment Tools for Replication Methods

Replication methods to compute some basic statistic operations (means, standard deviations, frequency tables, percentiles, mean comparisons using weighted effect coding, generalized linear models, and linear multilevel models) in complex survey designs comprising multiple imputed or nested imputed variables and/or a clustered sampling structure which both deserve special procedures at least in estimating standard errors. See the package documentation for a more detailed description along with references.

Maintained by Sebastian Weirich. Last updated 16 days ago.

6.3 match 1 stars 5.16 score 13 scripts

bioc

missRows:Handling Missing Individuals in Multi-Omics Data Integration

The missRows package implements the MI-MFA method to deal with missing individuals ('biological units') in multi-omics data integration. The MI-MFA method generates multiple imputed datasets from a Multiple Factor Analysis model, then the yield results are combined in a single consensus solution. The package provides functions for estimating coordinates of individuals and variables, imputing missing individuals, and various diagnostic plots to inspect the pattern of missingness and visualize the uncertainty due to missing values.

Maintained by Gonzalez Ignacio. Last updated 5 months ago.

software statisticalmethod dimensionreduction principalcomponent mathematicalbiology visualization

9.8 match 3.30 score 3 scripts

kaz-yos

regmedint:Regression-Based Causal Mediation Analysis with Interaction and Effect Modification Terms

This is an extension of the regression-based causal mediation analysis first proposed by Valeri and VanderWeele (2013) <doi:10.1037/a0031034> and Valeri and VanderWeele (2015) <doi:10.1097/EDE.0000000000000253>). It supports including effect measure modification by covariates(treatment-covariate and mediator-covariate product terms in mediator and outcome regression models) as proposed by Li et al (2023) <doi:10.1097/EDE.0000000000001643>. It also accommodates the original 'SAS' macro and 'PROC CAUSALMED' procedure in 'SAS' when there is no effect measure modification. Linear and logistic models are supported for the mediator model. Linear, logistic, loglinear, Poisson, negative binomial, Cox, and accelerated failure time (exponential and Weibull) models are supported for the outcome model.

Maintained by Yi Li. Last updated 1 years ago.

causal-inference mediation-analysis

4.6 match 29 stars 6.84 score 40 scripts

georgheinze

logistf:Firth's Bias-Reduced Logistic Regression

Fit a logistic regression model using Firth's bias reduction method, equivalent to penalization of the log-likelihood by the Jeffreys prior. Confidence intervals for regression coefficients can be computed by penalized profile likelihood. Firth's method was proposed as ideal solution to the problem of separation in logistic regression, see Heinze and Schemper (2002) <doi:10.1002/sim.1047>. If needed, the bias reduction can be turned off such that ordinary maximum likelihood logistic regression is obtained. Two new modifications of Firth's method, FLIC and FLAC, lead to unbiased predictions and are now available in the package as well, see Puhr et al (2017) <doi:10.1002/sim.7273>.

Maintained by Georg Heinze. Last updated 2 years ago.

3.4 match 12 stars 9.23 score 346 scripts 16 dependents

alinetalhouk

diceR:Diverse Cluster Ensemble in R

Performs cluster analysis using an ensemble clustering framework, Chiu & Talhouk (2018) <doi:10.1186/s12859-017-1996-y>. Results from a diverse set of algorithms are pooled together using methods such as majority voting, K-Modes, LinkCluE, and CSPA. There are options to compare cluster assignments across algorithms using internal and external indices, visualizations such as heatmaps, and significance testing for the existence of clusters.

Maintained by Derek Chiu. Last updated 1 months ago.

cpp

3.9 match 37 stars 8.13 score 60 scripts 3 dependents

daiscode

TestDataImputation:Missing Item Responses Imputation for Test and Assessment Data

Functions for imputing missing item responses for dichotomous and polytomous test and assessment data. This package enables missing imputation methods that are suitable for test and assessment data, including: listwise (LW) deletion (see De Ayala et al. 2001 <doi:10.1111/j.1745-3984.2001.tb01124.x>), treating as incorrect (IN, see Lord, 1974 <doi: 10.1111/j.1745-3984.1974.tb00996.x>; Mislevy & Wu, 1996 <doi: 10.1002/j.2333-8504.1996.tb01708.x>; Pohl et al., 2014 <doi: 10.1177/0013164413504926>), person mean imputation (PM), item mean imputation (IM), two-way (TW) and response function (RF) imputation, (see Sijtsma & van der Ark, 2003 <doi: 10.1207/s15327906mbr3804_4>), logistic regression (LR) imputation, predictive mean matching (PMM), and expectation–maximization (EM) imputation (see Finch, 2008 <doi: 10.1111/j.1745-3984.2008.00062.x>).

Maintained by Shenghai Dai. Last updated 3 years ago.

17.2 match 1.82 score 11 scripts 1 dependents

marco-geraci

Qtools:Utilities for Quantiles

Functions for unconditional and conditional quantiles. These include methods for transformation-based quantile regression, quantile-based measures of location, scale and shape, methods for quantiles of discrete variables, quantile-based multiple imputation, restricted quantile regression, directional quantile classification, and quantile ratio regression. A vignette is given in Geraci (2016, The R Journal) <doi:10.32614/RJ-2016-037> and included in the package.

Maintained by Marco Geraci. Last updated 1 years ago.

cpp

7.6 match 4.10 score 33 scripts 2 dependents

abshev

superMICE:SuperLearner Method for MICE

Adds a Super Learner ensemble model method (using the 'SuperLearner' package) to the 'mice' package. Laqueur, H. S., Shev, A. B., Kagawa, R. M. C. (2021) <doi:10.1093/aje/kwab271>.

Maintained by Aaron B. Shev. Last updated 3 years ago.

9.8 match 3 stars 3.18 score

jwb133

bootImpute:Bootstrap Inference for Multiple Imputation

Bootstraps and imputes incomplete datasets. Then performs inference on estimates obtained from analysing the imputed datasets as proposed by von Hippel and Bartlett (2021) <doi:10.1214/20-STS793>.

Maintained by Jonathan Bartlett. Last updated 6 months ago.

11.4 match 1 stars 2.70 score 6 scripts

bioc

FLAMES:FLAMES: Full Length Analysis of Mutations and Splicing in long read RNA-seq data

Semi-supervised isoform detection and annotation from both bulk and single-cell long read RNA-seq data. Flames provides automated pipelines for analysing isoforms, as well as intermediate functions for manual execution.

Maintained by Changqing Wang. Last updated 4 days ago.

rnaseq singlecell transcriptomics dataimport differentialsplicing alternativesplicing geneexpression longread zlib curl bzip2 xz-utils cpp

3.9 match 31 stars 7.95 score 12 scripts

jimb3

BinaryDosage:Creates, Merges, and Reads Binary Dosage Files

Tools to create binary dosage from either VCF or GEN files, merge binary dosage files, and read binary dosage files.

Maintained by John Morrison. Last updated 5 years ago.

cpp

7.6 match 4.06 score 33 scripts

bioc

structToolbox:Data processing & analysis tools for Metabolomics and other omics

An extensive set of data (pre-)processing and analysis methods and tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. This toolbox allows the user to build extensive and standardised workflows for data analysis. The methods and tools have been implemented using class-based templates provided by the struct (Statistics in R Using Class-based Templates) package. The toolbox includes pre-processing methods (e.g. signal drift and batch correction, normalisation, missing value imputation and scaling), univariate (e.g. ttest, various forms of ANOVA, Kruskal–Wallis test and more) and multivariate statistical methods (e.g. PCA and PLS, including cross-validation and permutation testing) as well as machine learning methods (e.g. Support Vector Machines). The STATistics Ontology (STATO) has been integrated and implemented to provide standardised definitions for the different methods, inputs and outputs.

Maintained by Gavin Rhys Lloyd. Last updated 24 days ago.

workflowstep metabolomics bioconductor-package dims lc-ms machine-learning multivariate-analysis statistics univariate

4.9 match 10 stars 6.26 score 12 scripts

jknowles

merTools:Tools for Analyzing Mixed Effect Regression Models

Provides methods for extracting results from mixed-effect model objects fit with the 'lme4' package. Allows construction of prediction intervals efficiently from large scale linear and generalized linear mixed-effects models. This method draws from the simulation framework used in the Gelman and Hill (2007) textbook: Data Analysis Using Regression and Multilevel/Hierarchical Models.

Maintained by Jared E. Knowles. Last updated 1 years ago.

2.9 match 105 stars 10.49 score 768 scripts

decisionpatterns

na.tools:Comprehensive Library for Working with Missing (NA) Values in Vectors

This comprehensive toolkit provide a consistent and extensible framework for working with missing values in vectors. The companion package 'tidyimpute' provides similar functionality for list-like and table-like structures). Functions exist for detection, removal, replacement, imputation, recollection, etc. of 'NAs'.

Maintained by Christopher Brown. Last updated 6 years ago.

7.5 match 2 stars 4.04 score 109 scripts

alexanderrobitzsch

mdmb:Model Based Treatment of Missing Data

Contains model-based treatment of missing data for regression models with missing values in covariates or the dependent variable using maximum likelihood or Bayesian estimation (Ibrahim et al., 2005; <doi:10.1198/016214504000001844>; Luedtke, Robitzsch, & West, 2020a, 2020b; <doi:10.1080/00273171.2019.1640104><doi:10.1037/met0000233>). The regression model can be nonlinear (e.g., interaction effects, quadratic effects or B-spline functions). Multilevel models with missing data in predictors are available for Bayesian estimation. Substantive-model compatible multiple imputation can be also conducted.

Maintained by Alexander Robitzsch. Last updated 8 months ago.

missing-data multiple-imputation openblas cpp

8.0 match 4 stars 3.78 score 26 scripts

nanxstats

hdnom:Benchmarking and Visualization Toolkit for Penalized Cox Models

Creates nomogram visualizations for penalized Cox regression models, with the support of reproducible survival model building, validation, calibration, and comparison for high-dimensional data.

Maintained by Nan Xiao. Last updated 6 months ago.

benchmark high-dimensional-data linear-regression nomogram-visualization penalized-cox-models survival-analysis openblas

3.8 match 43 stars 8.07 score 68 scripts 1 dependents

cran

mix:Estimation/Multiple Imputation for Mixed Categorical and Continuous Data

Estimation/multiple imputation programs for mixed categorical and continuous data.

Maintained by Brian Ripley. Last updated 3 months ago.

fortran

7.2 match 2 stars 4.21 score 5 dependents

wraff

wrProteo:Proteomics Data Analysis Functions

Data analysis of proteomics experiments by mass spectrometry is supported by this collection of functions mostly dedicated to the analysis of (bottom-up) quantitative (XIC) data. Fasta-formatted proteomes (eg from UniProt Consortium <doi:10.1093/nar/gky1049>) can be read with automatic parsing and multiple annotation types (like species origin, abbreviated gene names, etc) extracted. Initial results from multiple software for protein (and peptide) quantitation can be imported (to a common format): MaxQuant (Tyanova et al 2016 <doi:10.1038/nprot.2016.136>), Dia-NN (Demichev et al 2020 <doi:10.1038/s41592-019-0638-x>), Fragpipe (da Veiga et al 2020 <doi:10.1038/s41592-020-0912-y>), ionbot (Degroeve et al 2021 <doi:10.1101/2021.07.02.450686>), MassChroq (Valot et al 2011 <doi:10.1002/pmic.201100120>), OpenMS (Strauss et al 2021 <doi:10.1038/nmeth.3959>), ProteomeDiscoverer (Orsburn 2021 <doi:10.3390/proteomes9010015>), Proline (Bouyssie et al 2020 <doi:10.1093/bioinformatics/btaa118>), AlphaPept (preprint Strauss et al <doi:10.1101/2021.07.23.453379>) and Wombat-P (Bouyssie et al 2023 <doi:10.1021/acs.jproteome.3c00636>. Meta-data provided by initial analysis software and/or in sdrf format can be integrated to the analysis. Quantitative proteomics measurements frequently contain multiple NA values, due to physical absence of given peptides in some samples, limitations in sensitivity or other reasons. Help is provided to inspect the data graphically to investigate the nature of NA-values via their respective replicate measurements and to help/confirm the choice of NA-replacement algorithms. Meta-data in sdrf-format (Perez-Riverol et al 2020 <doi:10.1021/acs.jproteome.0c00376>) or similar tabular formats can be imported and included. Missing values can be inspected and imputed based on the concept of NA-neighbours or other methods. Dedicated filtering and statistical testing using the framework of package 'limma' <doi:10.18129/B9.bioc.limma> can be run, enhanced by multiple rounds of NA-replacements to provide robustness towards rare stochastic events. Multi-species samples, as frequently used in benchmark-tests (eg Navarro et al 2016 <doi:10.1038/nbt.3685>, Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>), can be run with special options considering such sub-groups during normalization and testing. Subsequently, ROC curves (Hand and Till 2001 <doi:10.1023/A:1010920819831>) can be constructed to compare multiple analysis approaches. As detailed example the data-set from Ramus et al 2016 <doi:10.1016/j.jprot.2015.11.011>) quantified by MaxQuant, ProteomeDiscoverer, and Proline is provided with a detailed analysis of heterologous spike-in proteins.

Maintained by Wolfgang Raffelsberger. Last updated 4 months ago.

8.2 match 3.67 score 17 scripts 1 dependents