R-universe search: challenge

bioc

KBoost:Inference of gene regulatory networks from gene expression data

Reconstructing gene regulatory networks and transcription factor activity is crucial to understand biological processes and holds potential for developing personalized treatment. Yet, it is still an open problem as state-of-art algorithm are often not able to handle large amounts of data. Furthermore, many of the present methods predict numerous false positives and are unable to integrate other sources of information such as previously known interactions. Here we introduce KBoost, an algorithm that uses kernel PCA regression, boosting and Bayesian model averaging for fast and accurate reconstruction of gene regulatory networks. KBoost can also use a prior network built on previously known transcription factor targets. We have benchmarked KBoost using three different datasets against other high performing algorithms. The results show that our method compares favourably to other methods across datasets.

Maintained by Luis F. Iglesias-Martinez. Last updated 5 months ago.

network graphandnetwork bayesian networkinference generegulation transcriptomics systemsbiology transcription geneexpression regression principalcomponent

40.1 match 4 stars 4.60 score 9 scripts

doehm

survivoR:Data from all Seasons of Survivor (US) TV Series in Tidy Format

Datasets detailing the results, castaways, and events of each season of Survivor for the US, Australia, South Africa, New Zealand, and the UK. This includes details on the cast, voting history, immunity and reward challenges, jury votes, boot order, advantage details, and episode ratings. Use this for analysis of trends and statistics of the game.

Maintained by Daniel Oehm. Last updated 4 days ago.

14.0 match 73 stars 7.08 score 94 scripts

adrtod

rchallenge:A Simple Data Science Challenge System

A simple data science challenge system using R Markdown and 'Dropbox' <https://www.dropbox.com/>. It requires no network configuration, does not depend on external platforms like e.g. 'Kaggle' <https://www.kaggle.com/> and can be easily installed on a personal computer.

Maintained by Adrien Todeschini. Last updated 4 years ago.

challenge data-science

20.7 match 7 stars 3.85 score 20 scripts

dicook

mulgar:Functions for Pre-Processing Data for Multivariate Data Visualisation using Tours

This is a companion to the book Cook, D. and Laa, U. (2023) <https://dicook.github.io/mulgar_book/> "Interactively exploring high-dimensional data and models in R". by Cook and Laa. It contains useful functions for processing data in preparation for visualising with a tour. There are also several sample data sets.

Maintained by Dianne Cook. Last updated 2 months ago.

15.0 match 4 stars 4.50 score 79 scripts

coolbutuseless

emphatic:Exploratory Analysis of Tabular Data using Colour Highlighting

Tools for exploratory analysis of tabular data using colour highlighting. Highlighting is displayed in any console supporting 'ANSI' colours, and can be converted to 'HTML', 'typst', 'latex' and 'SVG'. 'quarto' and 'rmarkdown' rendering are directly supported. It is also possible to add colour to regular expression matches and highlight differences between two arbitrary R objects.

Maintained by Mike Cheng. Last updated 3 months ago.

8.0 match 141 stars 7.55 score 12 scripts

apreshill

bakeoff:Data from "The Great British Bake Off"

Data about the bakers, challenges, and ratings for "The Great British Bake Off", from Wikipedia <https://en.wikipedia.org/wiki/The_Great_British_Bake_Off>.

Maintained by Alison Hill. Last updated 2 years ago.

10.5 match 67 stars 5.71 score 77 scripts

miyamot0

fxl:'fxl' Single Case Design Charting Package

The 'fxl' Charting package is used to prepare and design single case design figures that are typically prepared in spreadsheet software. With 'fxl', there is no need to leave the R environment to prepare these works and many of the more unique conventions in single case experimental designs can be performed without the need for physically constructing features of plots (e.g., drawing annotations across plots). Support is provided for various different plotting arrangements (e.g., multiple baseline), annotations (e.g., brackets, arrows), and output formats (e.g., svg, rasters).

Maintained by Shawn Gilroy. Last updated 3 months ago.

behavior-analysis single-case-design visual-analysis

10.8 match 8 stars 5.46 score 24 scripts

bioc

CellNOptR:Training of boolean logic models of signalling networks using prior knowledge networks and perturbation data

This package does optimisation of boolean logic networks of signalling pathways based on a previous knowledge network and a set of data upon perturbation of the nodes in the network.

Maintained by Attila Gabor. Last updated 5 months ago.

cellbasedassays cellbiology proteomics pathways network timecourse immunooncology

7.5 match 6.72 score 98 scripts 6 dependents

r-lib

testthat:Unit Testing for R

Software testing is important, but, in part because it is frustrating and boring, many of us avoid it. 'testthat' is a testing framework for R that is easy to learn and use, and integrates with your existing 'workflow'.

Maintained by Hadley Wickham. Last updated 18 days ago.

unit-testing cpp

2.0 match 900 stars 20.97 score 74k scripts 465 dependents

friendly

HistData:Data Sets from the History of Statistics and Data Visualization

The 'HistData' package provides a collection of small data sets that are interesting and important in the history of statistics and data visualization. The goal of the package is to make these available, both for instructional use and for historical research. Some of these present interesting challenges for graphics or analysis in R.

Maintained by Michael Friendly. Last updated 10 months ago.

graphics historical-data

4.5 match 63 stars 9.19 score 732 scripts 2 dependents

openintrostat

openintro:Datasets and Supplemental Functions from 'OpenIntro' Textbooks and Labs

Supplemental functions and data for 'OpenIntro' resources, which includes open-source textbooks and resources for introductory statistics (<https://www.openintro.org/>). The package contains datasets used in our open-source textbooks along with custom plotting functions for reproducing book figures. Note that many functions and examples include color transparency; some plotting elements may not show up properly (or at all) when run in some versions of Windows operating system.

Maintained by Mine Çetinkaya-Rundel. Last updated 3 months ago.

data openintro

3.6 match 240 stars 11.39 score 6.0k scripts

rconsortium

S7:An Object Oriented System Meant to Become a Successor to S3 and S4

A new object oriented programming system designed to be a successor to S3 and S4. It includes formal class, generic, and method specification, and a limited form of multiple dispatch. It has been designed and implemented collaboratively by the R Consortium Object-Oriented Programming Working Group, which includes representatives from R-Core, 'Bioconductor', 'Posit'/'tidyverse', and the wider R community.

Maintained by Hadley Wickham. Last updated 4 months ago.

3.0 match 432 stars 13.15 score 86 scripts 22 dependents

bioc

IgGeneUsage:Differential gene usage in immune repertoires

Detection of biases in the usage of immunoglobulin (Ig) genes is an important task in immune repertoire profiling. IgGeneUsage detects aberrant Ig gene usage between biological conditions using a probabilistic model which is analyzed computationally by Bayes inference. With this IgGeneUsage also avoids some common problems related to the current practice of null-hypothesis significance testing.

Maintained by Simo Kitanovski. Last updated 5 months ago.

differentialexpression regression genetics bayesian biomedicalinformatics immunooncology mathematicalbiology b-cell-receptor bcr-repertoire differential-analysis differential-gene-expression high-throughput-sequencing immune-repertoire immune-repertoire-analysis immune-repertoires immunogenomics immunoglobulin immunoinformatics immunological-bioinformatics immunology tcr-repertoire vdj-recombination cpp

6.6 match 6 stars 5.92 score 1 scripts

julianfaraway

faraway:Datasets and Functions for Books by Julian Faraway

Books are "Linear Models with R" published 1st Ed. August 2004, 2nd Ed. July 2014, 3rd Ed. February 2025 by CRC press, ISBN 9781439887332, and "Extending the Linear Model with R" published by CRC press in 1st Ed. December 2005 and 2nd Ed. March 2016, ISBN 9781584884248 and "Practical Regression and ANOVA in R" contributed documentation on CRAN (now very dated).

Maintained by Julian Faraway. Last updated 1 months ago.

data

3.6 match 29 stars 9.43 score 1.7k scripts 1 dependents

krzjoa

m5:'M5 Forecasting' Challenges Data

Contains functions, which facilitate downloading, loading and preparing data from 'M5 Forecasting' challenges (by 'University of Nicosia', hosted on 'Kaggle'). The data itself is set of time series of different product sales in 'Walmart'. The package also includes a ready-to-use built-in M5 subset named 'tiny_m5'. For detailed information about the challenges, see: Makridakis, Spyros & Spiliotis, Evangelos & Assimakopoulos, Vassilis. (2020). The M5 Accuracy competition: Results, findings and conclusions. <doi:10.1016/j.ijforecast.2021.10.009>

Maintained by Krzysztof Joachimiak. Last updated 3 years ago.

data-science kaggle-competition kaggle-dataset m5-competition m5-forecasting time-series-forecasting walmart walmart-sales-forecasting

7.3 match 2 stars 4.45 score 28 scripts

celevitz

topChef:Top Chef Data

Several datasets which describe the chef contestants in Top Chef, the challenges that they compete in, and the results of those challenges. This data is useful for practicing data wrangling, graphing, and analyzing how each season of Top Chef played out.

Maintained by Levitz Carly E. Last updated 2 days ago.

5.3 match 3 stars 5.99 score 26 scripts

wjbraun

DAAG:Data Analysis and Graphics Data and Functions

Functions and data sets used in examples and exercises in the text Maindonald, J.H. and Braun, W.J. (2003, 2007, 2010) "Data Analysis and Graphics Using R", and in an upcoming Maindonald, Braun, and Andrews text that builds on this earlier text.

Maintained by W. John Braun. Last updated 11 months ago.

3.8 match 8.25 score 1.2k scripts 1 dependents

reed-evic

cpsvote:A Toolbox for Using the CPS’s Voting and Registration Supplement

Provides automated methods for downloading, recoding, and merging selected years of the Current Population Survey's Voting and Registration Supplement, a large N national survey about registration, voting, and non-voting in United States federal elections. Provides documentation for appropriate use of sample weights to generate statistical estimates, drawing from Hur & Achen (2013) <doi:10.1093/poq/nft042> and McDonald (2018) <http://www.electproject.org/home/voter-turnout/voter-turnout-data>.

Maintained by Jay Lee. Last updated 2 years ago.

5.5 match 3 stars 5.58 score 21 scripts

alanarnholt

BSDA:Basic Statistics and Data Analysis

Data sets for book "Basic Statistics and Data Analysis" by Larry J. Kitchens.

Maintained by Alan T. Arnholt. Last updated 2 years ago.

3.4 match 7 stars 9.11 score 1.3k scripts 6 dependents

bioc

GARS:GARS: Genetic Algorithm for the identification of Robust Subsets of variables in high-dimensional and challenging datasets

Feature selection aims to identify and remove redundant, irrelevant and noisy variables from high-dimensional datasets. Selecting informative features affects the subsequent classification and regression analyses by improving their overall performances. Several methods have been proposed to perform feature selection: most of them relies on univariate statistics, correlation, entropy measurements or the usage of backward/forward regressions. Herein, we propose an efficient, robust and fast method that adopts stochastic optimization approaches for high-dimensional. GARS is an innovative implementation of a genetic algorithm that selects robust features in high-dimensional and challenging datasets.

Maintained by Mattia Chiesa. Last updated 5 months ago.

classification featureextraction clustering openjdk

6.0 match 5.00 score 2 scripts

lightbluetitan

usdatasets:A Comprehensive Collection of U.S. Datasets

Provides a diverse collection of U.S. datasets encompassing various fields such as crime, economics, education, finance, energy, healthcare, and more. It serves as a valuable resource for researchers and analysts seeking to perform in-depth analyses and derive insights from U.S.-specific data.

Maintained by Renzo Caceres Rossi. Last updated 5 months ago.

3.6 match 7 stars 5.99 score 141 scripts

learnitr

learnitdown:R Markdown, Bookdown and Learnr Additions for Learning Material

Extension to R Markdown, Bookdown and Learnr for building better learning and e-learning material: H5P integration, course-contextual divs, differed loading of Shiny and learnr applications, and much more ...

Maintained by Philippe Grosjean. Last updated 6 months ago.

bookdown learning-resources r-markdown teaching-materials

4.8 match 13 stars 4.49 score 16 scripts

njlyon0

dndR:Dungeons & Dragons Functions for Players and Dungeon Masters

The goal of 'dndR' is to provide a suite of Dungeons & Dragons related functions. This package is meant to be useful both to players and Dungeon Masters (DMs). All functions currently focus on Fifth Edition (a.k.a. "5e") but once the next edition is published functions will likely be expanded to include any rule changes.

Maintained by Nicholas Lyon. Last updated 11 months ago.

data-science dungeons-and-dragons ttrpg

3.0 match 17 stars 6.98 score 16 scripts

fbertran

ModStatR:Statistical Modelling in Action with R

Datasets and functions for the book "Modélisation statistique par la pratique avec R", F. Bertrand, E. Claeys and M. Maumy-Bertrand (2019, ISBN:9782100793525, Dunod, Paris). The first chapter of the book is dedicated to an introduction to the R statistical software. The second chapter deals with correlation analysis: Pearson, Spearman and Kendall simple, multiple and partial correlation coefficients. New wrapper functions for permutation tests or bootstrap of matrices of correlation are provided with the package. The third chapter is dedicated to data exploration with factorial analyses (PCA, CA, MCA, MDA) and clustering. The fourth chapter is dedicated to regression analysis: fitting and model diagnostics are detailed. The exercises focus on covariance analysis, logistic regression, Poisson regression, two-way analysis of variance for fixed or random factors. Various example datasets are shipped with the package: for instance on pokemon, world of warcraft, house tasks or food nutrition analyses.

Maintained by Frederic Bertrand. Last updated 1 years ago.

6.0 match 5 stars 3.40 score 4 scripts

sistm

cytometree:Automated Cytometry Gating and Annotation

Given the hypothesis of a bi-modal distribution of cells for each marker, the algorithm constructs a binary tree, the nodes of which are subpopulations of cells. At each node, observed cells and markers are modeled by both a family of normal distributions and a family of bi-modal normal mixture distributions. Splitting is done according to a normalized difference of AIC between the two families. Method is detailed in: Commenges, Alkhassim, Gottardo, Hejblum & Thiebaut (2018) <doi: 10.1002/cyto.a.23601>.

Maintained by Boris P Hejblum. Last updated 2 years ago.

cpp

3.3 match 9 stars 5.91 score 15 scripts 1 dependents

rudeboybert

resampledata:Data Sets for Mathematical Statistics with Resampling in R

Package of data sets from "Mathematical Statistics with Resampling in R" (1st Ed. 2011, 2nd Ed. 2018) by Laura Chihara and Tim Hesterberg.

Maintained by Albert Y. Kim. Last updated 4 months ago.

3.8 match 15 stars 5.15 score 187 scripts

angabrio

missingHE:Missing Outcome Data in Health Economic Evaluation

Contains a suite of functions for health economic evaluations with missing outcome data. The package can fit different types of statistical models under a fully Bayesian approach using the software 'JAGS' (which should be installed locally and which is loaded in 'missingHE' via the 'R' package 'R2jags'). Three classes of models can be fitted under a variety of missing data assumptions: selection models, pattern mixture models and hurdle models. In addition to model fitting, 'missingHE' provides a set of specialised functions to assess model convergence and fit, and to summarise the statistical and economic results using different types of measures and graphs. The methods implemented are described in Mason (2018) <doi:10.1002/hec.3793>, Molenberghs (2000) <doi:10.1007/978-1-4419-0300-6_18> and Gabrio (2019) <doi:10.1002/sim.8045>.

Maintained by Andrea Gabrio. Last updated 2 years ago.

cost-effectiveness-analysis health-economic-evaluation individual-level-data jags missing-data parametric-modelling sensitivity-analysis cpp

3.4 match 5 stars 5.38 score 24 scripts

modeloriented

DALEXtra:Extension for 'DALEX' Package

Provides wrapper of various machine learning models. In applied machine learning, there is a strong belief that we need to strike a balance between interpretability and accuracy. However, in field of the interpretable machine learning, there are more and more new ideas for explaining black-box models, that are implemented in 'R'. 'DALEXtra' creates 'DALEX' Biecek (2018) <arXiv:1806.08915> explainer for many type of models including those created using 'python' 'scikit-learn' and 'keras' libraries, and 'java' 'h2o' library. Important part of the package is Champion-Challenger analysis and innovative approach to model performance across subsets of test data presented in Funnel Plot.

Maintained by Szymon Maksymiuk. Last updated 2 years ago.

extension-for-dalex-package

2.4 match 67 stars 7.71 score 400 scripts 1 dependents

antonio-pgarcia

evoper:Evolutionary Parameter Estimation for 'Repast Simphony' Models

The EvoPER, Evolutionary Parameter Estimation for Individual-based Models is an extensible package providing optimization driven parameter estimation methods using metaheuristics and evolutionary computation techniques (Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization for continuous domains, Tabu Search, Evolutionary Strategies, ...) which could be more efficient and require, in some cases, fewer model evaluations than alternatives relying on experimental design. Currently there are built in support for models developed with 'Repast Simphony' Agent-Based framework (<https://repast.github.io/>) and with NetLogo (<https://ccl.northwestern.edu/netlogo/>) which are the most used frameworks for Agent-based modeling.

Maintained by Antonio Prestes Garcia. Last updated 5 years ago.

openjdk

4.5 match 6 stars 3.92 score 28 scripts

loelschlaeger

fHMM:Fitting Hidden Markov Models to Financial Data

Fitting (hierarchical) hidden Markov models to financial data via maximum likelihood estimation. See Oelschläger, L. and Adam, T. "Detecting Bearish and Bullish Markets in Financial Time Series Using Hierarchical Hidden Markov Models" (2021, Statistical Modelling) <doi:10.1177/1471082X211034048> for a reference on the method. A user guide is provided by the accompanying software paper "fHMM: Hidden Markov Models for Financial Time Series in R", Oelschläger, L., Adam, T., and Michels, R. (2024, Journal of Statistical Software) <doi:10.18637/jss.v109.i09>.

Maintained by Lennart Oelschläger. Last updated 6 months ago.

finance hidden-markov-models cpp openmp

2.5 match 16 stars 6.95 score 5 scripts

biodiverse

unmarked:Models for Data from Unmarked Animals

Fits hierarchical models of animal abundance and occurrence to data collected using survey methods such as point counts, site occupancy sampling, distance sampling, removal sampling, and double observer sampling. Parameters governing the state and observation processes can be modeled as functions of covariates. References: Kellner et al. (2023) <doi:10.1111/2041-210X.14123>, Fiske and Chandler (2011) <doi:10.18637/jss.v043.i10>.

Maintained by Ken Kellner. Last updated 3 days ago.

openblas cpp openmp

1.3 match 4 stars 13.03 score 652 scripts 12 dependents

sanfordweisberg

alr4:Data to Accompany Applied Linear Regression 4th Edition

Datasets to Accompany S. Weisberg (2014, ISBN: 978-1-118-38608-8), "Applied Linear Regression," 4th edition. Many data files in this package are included in the `alr3` package as well, so only one of them should be used.

Maintained by Sanford Weisberg. Last updated 7 years ago.

4.5 match 1 stars 3.45 score 306 scripts

shikokuchuo

mirai:Minimalist Async Evaluation Framework for R

Designed for simplicity, a 'mirai' evaluates an R expression asynchronously in a parallel process, locally or distributed over the network. The result is automatically available upon completion. Modern networking and concurrency, built on 'nanonext' and 'NNG' (Nanomsg Next Gen), ensures reliable and efficient scheduling over fast inter-process communications or TCP/IP secured by TLS. Distributed computing can launch remote resources via SSH or cluster managers. An inherently queued architecture handles many more tasks than available processes, and requires no storage on the file system. Innovative features include support for otherwise non-exportable reference objects, event-driven promises, and asynchronous parallel map.

Maintained by Charlie Gao. Last updated 3 hours ago.

async asynchronous-tasks concurrency distributed-computing high-performance-computing parallel-computing

1.3 match 217 stars 11.89 score 130 scripts 7 dependents

higgi13425

medicaldata:Data Package for Medical Datasets

Provides access to well-documented medical datasets for teaching. Featuring several from the Teaching of Statistics in the Health Sciences website <https://www.causeweb.org/tshs/category/dataset/>, a few reconstructed datasets of historical significance in medical research, some reformatted and extended from existing R packages, and some data donations.

Maintained by Peter Higgins. Last updated 2 years ago.

datasets

2.0 match 48 stars 7.43 score 317 scripts

kwstat

agridat:Agricultural Datasets

Datasets from books, papers, and websites related to agriculture. Example graphics and analyses are included. Data come from small-plot trials, multi-environment trials, uniformity trials, yield monitors, and more.

Maintained by Kevin Wright. Last updated 30 days ago.

data

1.3 match 126 stars 10.78 score 1.7k scripts 1 dependents

carpentries

sandpaper:Create and Curate Carpentries Lessons

We provide tools to build a Carpentries-themed lesson repository into an accessible standalone static website. These include local tools and those designed to be used in a continuous integration context so that all the lesson author needs to focus on is writing the content of the actual lesson.

Maintained by Robert Davey. Last updated 2 months ago.

carpentries carpentries-infrastructure carpentries-workbench lesson-template lessons markdown static-site-generator

1.8 match 44 stars 7.68 score 8 scripts

jacob-long

panelr:Regression Models and Utilities for Repeated Measures and Panel Data

Provides an object type and associated tools for storing and wrangling panel data. Implements several methods for creating regression models that take advantage of the unique aspects of panel data. Among other capabilities, automates the "within-between" (also known as "between-within" and "hybrid") panel regression specification that combines the desirable aspects of both fixed effects and random effects econometric models and fits them as multilevel models (Allison, 2009 <doi:10.4135/9781412993869.d33>; Bell & Jones, 2015 <doi:10.1017/psrm.2014.7>). These models can also be estimated via generalized estimating equations (GEE; McNeish, 2019 <doi:10.1080/00273171.2019.1602504>) and Bayesian estimation is (optionally) supported via 'Stan'. Supports estimation of asymmetric effects models via first differences (Allison, 2019 <doi:10.1177/2378023119826441>) as well as a generalized linear model extension thereof using GEE.

Maintained by Jacob A. Long. Last updated 1 years ago.

social-science statistics

1.5 match 101 stars 8.76 score 181 scripts 1 dependents

itsleeds

pct:Propensity to Cycle Tool

Functions and example data to teach and increase the reproducibility of the methods and code underlying the Propensity to Cycle Tool (PCT), a research project and web application hosted at <https://www.pct.bike/>. For an academic paper on the methods, see Lovelace et al (2017) <doi:10.5198/jtlu.2016.862>.

Maintained by Robin Lovelace. Last updated 14 days ago.

2.0 match 20 stars 6.54 score

mrc-ide

malariasimulation:An individual based model for malaria

Specifies the latest and greatest malaria model.

Maintained by Giovanni Charles. Last updated 29 days ago.

cpp

1.5 match 16 stars 8.17 score 146 scripts

lukejharmon

geiger:Analysis of Evolutionary Diversification

Methods for fitting macroevolutionary models to phylogenetic trees Pennell (2014) <doi:10.1093/bioinformatics/btu181>.

Maintained by Luke Harmon. Last updated 2 years ago.

openblas cpp

1.6 match 1 stars 7.84 score 2.3k scripts 28 dependents

mikejohnson51

climateR:climateR

Find, subset, and retrive geospatial data by AOI.

Maintained by Mike Johnson. Last updated 3 months ago.

aoi climate dataset geospatial gridded-climate-data weather

1.3 match 187 stars 8.74 score 156 scripts 1 dependents

chrisbrownlie

bushtucker:'I'm a Celebrity Get Me Out of Here' Data

Data on the first 24 seasons of the UK TV show 'I'm a Celebrity, Get Me Out of Here', broadcast from 2002-2024.

Maintained by Chris Brownlie. Last updated 1 months ago.

3.8 match 3.00 score 3 scripts

bioc

maPredictDSC:Phenotype prediction using microarray data: approach of the best overall team in the IMPROVER Diagnostic Signature Challenge

This package implements the classification pipeline of the best overall team (Team221) in the IMPROVER Diagnostic Signature Challenge. Additional functionality is added to compare 27 combinations of data preprocessing, feature selection and classifier types.

Maintained by Adi Laurentiu Tarca. Last updated 5 months ago.

microarray classification

4.8 match 2.30 score 2 scripts

f0nzie

rODE:Ordinary Differential Equation (ODE) Solvers Written in R Using S4 Classes

Show physics, math and engineering students how an ODE solver is made and how effective R classes can be for the construction of the equations that describe natural phenomena. Inspiration for this work comes from the book on "Computer Simulations in Physics" by Harvey Gould, Jan Tobochnik, and Wolfgang Christian. Book link: <http://www.compadre.org/osp/items/detail.cfm?ID=7375>.

Maintained by Alfonso R. Reyes. Last updated 7 years ago.

2.0 match 5.50 score 71 scripts

canmod

macpan2:Fast and Flexible Compartmental Modelling

Fast and flexible compartmental modelling with Template Model Builder.

Maintained by Steve Walker. Last updated 1 days ago.

compartmental-models epidemiology forecasting mixed-effects model-fitting optimization simulation simulation-modeling cpp

1.2 match 4 stars 8.90 score 246 scripts 1 dependents

bioc

cellxgenedp:Discover and Access Single Cell Data Sets in the CELLxGENE Data Portal

The cellxgene data portal (https://cellxgene.cziscience.com/) provides a graphical user interface to collections of single-cell sequence data processed in standard ways to 'count matrix' summaries. The cellxgenedp package provides an alternative, R-based inteface, allowind data discovery, viewing, and downloading.

Maintained by Martin Morgan. Last updated 5 months ago.

singlecell dataimport thirdpartyclient

1.5 match 8 stars 6.64 score 27 scripts

usepa

ctxR:Utilities for Interacting with the 'CTX' APIs

Access chemical, hazard, bioactivity, and exposure data from the Computational Toxicology and Exposure ('CTX') APIs <https://www.epa.gov/comptox-tools/computational-toxicology-and-exposure-apis>. 'ctxR' was developed to streamline the process of accessing the information available through the 'CTX' APIs without requiring prior knowledge of how to use APIs. Most data is also available on the CompTox Chemical Dashboard ('CCD') <https://comptox.epa.gov/dashboard/> and other resources found at the EPA Computational Toxicology and Exposure Online Resources <https://www.epa.gov/comptox-tools>.

Maintained by Paul Kruse. Last updated 2 months ago.

ccte comptox ord

1.2 match 10 stars 8.02 score 13 scripts 1 dependents

githubwilly

SymbolicDeterminants:Symbolic Representation of Matrix Determinant

Creates a numeric guide for writing the formula for the determinant of a square matrix (a detguide) as a function of the elements of the matrix and writes out that formula, the symbolic representation.

Maintained by William Fairweather. Last updated 4 years ago.

3.5 match 2.70 score

bergsmat

nonmemica:Create and Evaluate NONMEM Models in a Project Context

Systematically creates and modifies NONMEM(R) control streams. Harvests NONMEM output, builds run logs, creates derivative data, generates diagnostics. NONMEM (ICON Development Solutions <https://www.iconplc.com/>) is software for nonlinear mixed effects modeling. See 'package?nonmemica'.

Maintained by Tim Bergsma. Last updated 2 months ago.

2.0 match 4 stars 4.58 score 45 scripts

timhesterberg

resampledata3:Data Sets for "Mathematical Statistics with Resampling and R" (3rd Ed)

Data sets for Chihara and Hesterberg (2022, ISBN: 978-1-119-87404-1) "Mathematical Statistics with Resampling in R" (3rd Ed).

Maintained by Tim Hesterberg. Last updated 3 years ago.

6.0 match 1.52 score 33 scripts

tuncerkerem

experimentr:Datasets Used in Social Science Experiments: A Hands-on Introduction

Contains all the datasets that were used in Social Science Experiments: A Hands-On Introduction and in its R Companion. Relevant materials can be found at <https://osf.io/b78je>.

Maintained by Kerem Tuncer. Last updated 3 years ago.

3.3 match 2.70 score 8 scripts

great-northern-diver

loon.ggplot:A Grammar of Interactive Graphics

Provides a bridge between the 'loon' and 'ggplot2' packages. Extends the grammar of ggplot to add clauses to create interactive 'loon' plots. Existing ggplot(s) can be turned into interactive 'loon' plots and 'loon' plots into static ggplot(s); the function 'loon.ggplot()' is the bridge from one plot structure to the other.

Maintained by Zehao Xu. Last updated 10 months ago.

data-analysis ggplot ggplot-features graphics interactive-plots loon visualizations

1.3 match 24 stars 7.11 score 9 scripts 3 dependents

bryanhanson

readJDX:Import Data in the JCAMP-DX Format

Import data written in the JCAMP-DX format. This is an instrument-independent format used in the field of spectroscopy. Examples include IR, NMR, and Raman spectroscopy. See the vignette for background and supported formats. The official JCAMP-DX site is <http://www.jcamp-dx.org/>.

Maintained by Bryan A. Hanson. Last updated 1 years ago.

jcamp-dx-format spectroscopy

1.3 match 8 stars 6.48 score 7 scripts 5 dependents

rnuske

komaletter:Simply Beautiful PDF Letters from Markdown

Write beautiful yet customizable letters in R Markdown and directly obtain the finished PDF. Smooth generation of PDFs is realized by 'rmarkdown', the 'pandoc-letter' template and the 'KOMA-Script' letter class. 'KOMA-Script' provides enhanced replacements for the standard 'LaTeX' classes with emphasis on typography and versatility. 'KOMA-Script' is particularly useful for international writers as it handles various paper formats well, provides layouts for many common window envelope types (e.g. German, US, French, Japanese) and lets you define your own layouts. The package comes with a default letter layout based on 'DIN 5008B'.

Maintained by Robert Nuske. Last updated 9 months ago.

koma-script latex letter markdown pandoc pandoc-letter pdf

1.3 match 87 stars 6.78 score 3 scripts

dpmcsuss

iGraphMatch:Tools for Graph Matching

Versatile tools and data for graph matching analysis with various forms of prior information that supports working with 'igraph' objects, matrix objects, or lists of either.

Maintained by Daniel Sussman. Last updated 10 months ago.

graph-algorithms graph-matching cpp

1.5 match 9 stars 5.65 score 9 scripts

carpentries

pegboard:Explore and Manipulate Markdown Curricula from the Carpentries

The Carpentries (<https://carpentries.org>) curricula is made of of lessons that are hosted as websites. Each lesson represents between a half day to two days of instruction and contains several episodes, which are written as 'kramdown'-flavored 'markdown' documents and converted to HTML using the 'Jekyll' static website generator. This package builds on top of the 'tinkr' package; reads in these markdown documents to 'XML' and stores them in R6 classes for convenient exploration and manipulation of sections within episodes.

Maintained by Robert Davey. Last updated 24 days ago.

1.8 match 6 stars 4.58 score 4 scripts 1 dependents

oobianom

quickcode:Quick and Essential 'R' Tricks for Better Scripts

The NOT functions, 'R' tricks and a compilation of some simple quick plus often used 'R' codes to improve your scripts. Improve the quality and reproducibility of 'R' scripts.

Maintained by Obinna Obianom. Last updated 16 days ago.

colors data distributions images

1.0 match 5 stars 7.76 score 7 scripts 6 dependents

usepa

ccdR:Utilities for Interacting with the 'CTX' APIs

Access chemical, hazard, bioactivity, and exposure data from the Computational Toxicology and Exposure ('CTX') APIs <https://api-ccte.epa.gov/docs/>. 'ccdR' was developed to streamline the process of accessing the information available through the 'CTX' APIs without requiring prior knowledge of how to use APIs. Most data is also available on the CompTox Chemical Dashboard ('CCD') <https://comptox.epa.gov/dashboard/> and other resources found at the EPA Computational Toxicology and Exposure Online Resources <https://www.epa.gov/comptox-tools>.

Maintained by Paul Kruse. Last updated 8 months ago.

1.2 match 2 stars 6.38 score 7 scripts

tidymodels

workflows:Modeling Workflows

Managing both a 'parsnip' model and a preprocessor, such as a model formula or recipe from 'recipes', can often be challenging. The goal of 'workflows' is to streamline this process by bundling the model alongside the preprocessor, all within the same object.

Maintained by Simon Couch. Last updated 27 days ago.

0.5 match 207 stars 13.80 score 876 scripts 43 dependents

simonlabcode

bakR:Analyze and Compare Nucleotide Recoding RNA Sequencing Datasets

Several implementations of a novel Bayesian hierarchical statistical model of nucleotide recoding RNA-seq experiments (NR-seq; TimeLapse-seq, SLAM-seq, TUC-seq, etc.) for analyzing and comparing NR-seq datasets (see 'Vock and Simon' (2023) <doi:10.1261/rna.079451.122>). NR-seq is a powerful extension of RNA-seq that provides information about the kinetics of RNA metabolism (e.g., RNA degradation rate constants), which is notably lacking in standard RNA-seq data. The statistical model makes maximal use of these high-throughput datasets by sharing information across transcripts to significantly improve uncertainty quantification and increase statistical power. 'bakR' includes a maximally efficient implementation of this model for conservative initial investigations of datasets. 'bakR' also provides more highly powered implementations using the probabilistic programming language 'Stan' to sample from the full posterior distribution. 'bakR' performs multiple-test adjusted statistical inference with the output of these model implementations to help biologists separate signal from background. Methods to automatically visualize key results and detect batch effects are also provided.

Maintained by Isaac Vock. Last updated 4 months ago.

cpp

1.2 match 6 stars 6.12 score 21 scripts

pmair78

MPsychoR:Modern Psychometrics with R

Supplementary materials and datasets for the book "Modern Psychometrics With R" (Mair, 2018, Springer useR! series).

Maintained by Patrick Mair. Last updated 5 years ago.

4.0 match 1.73 score 54 scripts

le-huynh

lehuynh:Le-Huynh Truc-Ly's R Code and Templates

Miscellaneous R functions (for graphics, data import, data transformation, and general utilities) and templates (for exploratory analysis, Bayesian modeling, and crafting scientific manuscripts).

Maintained by Truc-Ly Le-Huynh. Last updated 9 months ago.

personal-rpackage

1.8 match 3 stars 3.88 score 4 scripts

connormayer

maxent.ot:Perform Phonological Analyses using Maximum Entropy Optimality Theory

Fit Maximum Entropy Optimality Theory models to data sets, generate the predictions made by such models for novel data, and compare the fit of different models using a variety of metrics. The package is described in Mayer, C., Tan, A., Zuraw, K. (in press) <https://sites.socsci.uci.edu/~cjmayer/papers/cmayer_et_al_maxent_ot_accepted.pdf>.

Maintained by Connor Mayer. Last updated 4 months ago.

1.2 match 8 stars 5.51 score 6 scripts

bioc

SNPRelate:Parallel Computing Toolset for Relatedness and Principal Component Analysis of SNP Data

Genome-wide association studies (GWAS) are widely used to investigate the genetic basis of diseases and traits, but they pose many computational challenges. We developed an R package SNPRelate to provide a binary format for single-nucleotide polymorphism (SNP) data in GWAS utilizing CoreArray Genomic Data Structure (GDS) data files. The GDS format offers the efficient operations specifically designed for integers with two bits, since a SNP could occupy only two bits. SNPRelate is also designed to accelerate two key computations on SNP data using parallel computing for multi-core symmetric multiprocessing computer architectures: Principal Component Analysis (PCA) and relatedness analysis using Identity-By-Descent measures. The SNP GDS format is also used by the GWASTools package with the support of S4 classes and generic functions. The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variations (SNVs), insertion/deletion polymorphism (indel) and structural variation calls in whole-genome and whole-exome variant data.

Maintained by Xiuwen Zheng. Last updated 5 months ago.

infrastructure genetics statisticalmethod principalcomponent bioinformatics gds-format pca simd snp openblas cpp

0.5 match 104 stars 12.69 score 1.6k scripts 18 dependents

patzaw

BED:Biological Entity Dictionary (BED)

An interface for the 'Neo4j' database providing mapping between different identifiers of biological entities. This Biological Entity Dictionary (BED) has been developed to address three main challenges. The first one is related to the completeness of identifier mappings. Indeed, direct mapping information provided by the different systems are not always complete and can be enriched by mappings provided by other resources. More interestingly, direct mappings not identified by any of these resources can be indirectly inferred by using mappings to a third reference. For example, many human Ensembl gene ID are not directly mapped to any Entrez gene ID but such mappings can be inferred using respective mappings to HGNC ID. The second challenge is related to the mapping of deprecated identifiers. Indeed, entity identifiers can change from one resource release to another. The identifier history is provided by some resources, such as Ensembl or the NCBI, but it is generally not used by mapping tools. The third challenge is related to the automation of the mapping process according to the relationships between the biological entities of interest. Indeed, mapping between gene and protein ID scopes should not be done the same way than between two scopes regarding gene ID. Also, converting identifiers from different organisms should be possible using gene orthologs information. The method has been published by Godard and van Eyll (2018) <doi:10.12688/f1000research.13925.3>.

Maintained by Patrice Godard. Last updated 3 months ago.

0.9 match 8 stars 6.85 score 25 scripts

a-dudek-ue

mdsOpt:Searching for Optimal MDS Procedure for Metric and Interval-Valued Data

Selecting the optimal multidimensional scaling (MDS) procedure for metric data via metric MDS (ratio, interval, mspline) and nonmetric MDS (ordinal). Selecting the optimal multidimensional scaling (MDS) procedure for interval-valued data via metric MDS (ratio, interval, mspline).Selecting the optimal multidimensional scaling procedure for interval-valued data by varying all combinations of normalization and optimization methods.Selecting the optimal MDS procedure for statistical data referring to the evaluation of tourist attractiveness of Lower Silesian counties. (Borg, I., Groenen, P.J.F., Mair, P. (2013) <doi:10.1007/978-3-642-31848-1>, Walesiak, M. (2016) <doi:10.15611/ekt.2016.2.01>, Walesiak, M. (2017) <doi:10.15611/ekt.2017.3.01>).

Maintained by Andrzej Dudek. Last updated 1 years ago.

2.6 match 2.28 score 19 scripts

erichson

rsvd:Randomized Singular Value Decomposition

Low-rank matrix decompositions are fundamental tools and widely used for data analysis, dimension reduction, and data compression. Classically, highly accurate deterministic matrix algorithms are used for this task. However, the emergence of large-scale data has severely challenged our computational ability to analyze big data. The concept of randomness has been demonstrated as an effective strategy to quickly produce approximate answers to familiar problems such as the singular value decomposition (SVD). The rsvd package provides several randomized matrix algorithms such as the randomized singular value decomposition (rsvd), randomized principal component analysis (rpca), randomized robust principal component analysis (rrpca), randomized interpolative decomposition (rid), and the randomized CUR decomposition (rcur). In addition several plot functions are provided.

Maintained by N. Benjamin Erichson. Last updated 4 years ago.

dimension-reduction matrix-approximation pca principal-component-analysis probabilistic-algorithms randomized-algorithm singular-value-decomposition svd

0.5 match 98 stars 10.80 score 408 scripts 119 dependents

bioc

BASiCS:Bayesian Analysis of Single-Cell Sequencing data

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model to perform statistical analyses of single-cell RNA sequencing datasets in the context of supervised experiments (where the groups of cells of interest are known a priori, e.g. experimental conditions or cell types). BASiCS performs built-in data normalisation (global scaling) and technical noise quantification (based on spike-in genes). BASiCS provides an intuitive detection criterion for highly (or lowly) variable genes within a single group of cells. Additionally, BASiCS can compare gene expression patterns between two or more pre-specified groups of cells. Unlike traditional differential expression tools, BASiCS quantifies changes in expression that lie beyond comparisons of means, also allowing the study of changes in cell-to-cell heterogeneity. The latter can be quantified via a biological over-dispersion parameter that measures the excess of variability that is observed with respect to Poisson sampling noise, after normalisation and technical noise removal. Due to the strong mean/over-dispersion confounding that is typically observed for scRNA-seq datasets, BASiCS also tests for changes in residual over-dispersion, defined by residual values with respect to a global mean/over-dispersion trend.

Maintained by Catalina Vallejos. Last updated 5 months ago.

immunooncology normalization sequencing rnaseq software geneexpression transcriptomics singlecell differentialexpression bayesian cellbiology bioconductor-package gene-expression rcpp rcpparmadillo scrna-seq single-cell openblas cpp openmp

0.5 match 83 stars 10.26 score 368 scripts 1 dependents

justinmshea

wooldridge:115 Data Sets from "Introductory Econometrics: A Modern Approach, 7e" by Jeffrey M. Wooldridge

Students learning both econometrics and R may find the introduction to both challenging. The wooldridge data package aims to lighten the task by efficiently loading any data set found in the text with a single command. Data sets have been compressed to a fraction of their original size. Documentation files contain page numbers, the original source, time of publication, and notes from the author suggesting avenues for further analysis and research. If one needs an introduction to R model syntax, a vignette contains solutions to examples from chapters of the text. Data sets are from the 7th edition (Wooldridge 2020, ISBN-13 978-1-337-55886-0), and are backwards compatible with all previous versions of the text.

Maintained by Justin M. Shea. Last updated 4 months ago.

econometrics

0.5 match 203 stars 9.38 score 1.4k scripts

bioc

tenXplore:ontological exploration of scRNA-seq of 1.3 million mouse neurons from 10x genomics

Perform ontological exploration of scRNA-seq of 1.3 million mouse neurons from 10x genomics.

Maintained by VJ Carey. Last updated 5 months ago.

immunooncology dimensionreduction principalcomponent transcriptomics singlecell

1.1 match 4.18 score 7 scripts

bioc

scmap:A tool for unsupervised projection of single cell RNA-seq data

Single-cell RNA-seq (scRNA-seq) is widely used to investigate the composition of complex tissues since the technology allows researchers to define cell-types using unsupervised clustering of the transcriptome. However, due to differences in experimental methods and computational analyses, it is often challenging to directly compare the cells identified in two different experiments. scmap is a method for projecting cells from a scRNA-seq experiment on to the cell-types or individual cells identified in a different experiment.

Maintained by Vladimir Kiselev. Last updated 5 months ago.

immunooncology singlecell software classification supportvectormachine rnaseq visualization transcriptomics datarepresentation transcription sequencing preprocessing geneexpression dataimport bioconductor-package human-cell-atlas projection-mapping single-cell-rna-seq openblas cpp

0.5 match 95 stars 8.82 score 172 scripts

gogonzo

sport:Sequential Pairwise Online Rating Techniques

Calculates ratings for two-player or multi-player challenges. Methods included in package such as are able to estimate ratings (players strengths) and their evolution in time, also able to predict output of challenge. Algorithms are based on Bayesian Approximation Method, and they don't involve any matrix inversions nor likelihood estimation. Parameters are updated sequentially, and computation doesn't require any additional RAM to make estimation feasible. Additionally, base of the package is written in C++ what makes sport computation even faster. Methods used in the package refers to Mark E. Glickman (1999) <http://www.glicko.net/research/glicko.pdf>; Mark E. Glickman (2001) <doi:10.1080/02664760120059219>; Ruby C. Weng, Chih-Jen Lin (2011) <http://jmlr.csail.mit.edu/papers/volume12/weng11a/weng11a.pdf>; W. Penny, Stephen J. Roberts (1999) <doi:10.1109/IJCNN.1999.832603>.

Maintained by Dawid Kałędkowski. Last updated 5 years ago.

cpp

0.8 match 25 stars 5.78 score 24 scripts

cran

gains:Lift (Gains) Tables and Charts

Constructs gains tables and lift charts for prediction algorithms. Gains tables and lift charts are commonly used in direct marketing applications. The method is described in Drozdenko and Drake (2002), "Optimal Database Marketing", Chapter 11.

Maintained by Craig A. Rolling. Last updated 8 years ago.

3.4 match 1.26 score

bioc

SIMLR:Single-cell Interpretation via Multi-kernel LeaRning (SIMLR)

Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization.

Maintained by Luca De Sano. Last updated 5 months ago.

immunooncology clustering geneexpression sequencing singlecell openblas cpp

0.5 match 111 stars 8.49 score 69 scripts

razrahman

IntegratedMRF:Integrated Prediction using Uni-Variate and Multivariate Random Forests

An implementation of a framework for drug sensitivity prediction from various genetic characterizations using ensemble approaches. Random Forests or Multivariate Random Forest predictive models can be generated from each genetic characterization that are then combined using a Least Square Regression approach. It also provides options for the use of different error estimation approaches of Leave-one-out, Bootstrap, N-fold cross validation and 0.632+Bootstrap along with generation of prediction confidence interval using Jackknife-after-Bootstrap approach.

Maintained by Raziur Rahman. Last updated 7 years ago.

cpp

3.4 match 1.26 score 18 scripts

numbersman77

eventTrack:Event Prediction for Time-to-Event Endpoints

Implements the hybrid framework for event prediction described in Fang & Zheng (2011, <doi:10.1016/j.cct.2011.05.013>). To estimate the survival function the event prediction is based on, a piecewise exponential hazard function is fit to the time-to-event data to infer the potential change points. Prior to the last identified change point, the survival function is estimated using Kaplan-Meier, and the tail after the change point is fit using piecewise exponential.

Maintained by Kaspar Rufibach. Last updated 24 days ago.

2.0 match 1 stars 2.00 score 2 scripts

bioc

SRAdb:A compilation of metadata from NCBI SRA and tools

The Sequence Read Archive (SRA) is the largest public repository of sequencing data from the next generation of sequencing platforms including Roche 454 GS System, Illumina Genome Analyzer, Applied Biosystems SOLiD System, Helicos Heliscope, and others. However, finding data of interest can be challenging using current tools. SRAdb is an attempt to make access to the metadata associated with submission, study, sample, experiment and run much more feasible. This is accomplished by parsing all the NCBI SRA metadata into a SQLite database that can be stored and queried locally. Fulltext search in the package make querying metadata very flexible and powerful. fastq and sra files can be downloaded for doing alignment locally. Beside ftp protocol, the SRAdb has funcitons supporting fastp protocol (ascp from Aspera Connect) for faster downloading large data files over long distance. The SQLite database is updated regularly as new data is added to SRA and can be downloaded at will for the most up-to-date metadata.

Maintained by Jack Zhu. Last updated 3 months ago.

infrastructure sequencing dataimport

0.5 match 2 stars 7.81 score 200 scripts

reichlab

zoltr:Interface to the 'Zoltar' Forecast Repository API

'Zoltar' <https://www.zoltardata.com/> is a website that provides a repository of model forecast results in a standardized format and a central location. It supports storing, retrieving, comparing, and analyzing time series forecasts for prediction challenges of interest to the modeling community. This package provides functions for working with the 'Zoltar' API, including connecting and authenticating, getting meta information (projects, models, and forecasts, and truth), and uploading, downloading, and deleting forecast and truth data.

Maintained by Matthew Cornell. Last updated 12 days ago.

0.5 match 2 stars 7.58 score 175 scripts 3 dependents

christopherkenny

royale:Clash Royale API

R interface to the official API for Clash Royale <https://developer.clashroyale.com/#/>.

Maintained by Christopher T. Kenny. Last updated 1 years ago.

2.3 match 1.70 score 4 scripts

yuelyu21

SCIntRuler:Guiding the Integration of Multiple Single-Cell RNA-Seq Datasets

The accumulation of single-cell RNA-seq (scRNA-seq) studies highlights the potential benefits of integrating multiple datasets. By augmenting sample sizes and enhancing analytical robustness, integration can lead to more insightful biological conclusions. However, challenges arise due to the inherent diversity and batch discrepancies within and across studies. SCIntRuler, a novel R package, addresses these challenges by guiding the integration of multiple scRNA-seq datasets.

Maintained by Yue Lyu. Last updated 5 months ago.

sequencing geneticvariability singlecell cpp

0.8 match 2 stars 4.85 score 3 scripts

extremestats

DATAstudio:The Research Data Warehouse of Miguel de Carvalho

Pulls together a collection of datasets from Miguel de Carvalho research articles. Including, for example: - de Carvalho (2012) <doi:10.1016/j.jspi.2011.08.016>; - de Carvalho et al (2012) <doi:10.1080/03610926.2012.709905>; - de Carvalho et al (2012) <doi:10.1016/j.econlet.2011.09.007>); - de Carvalho and Davison (2014) <doi:10.1080/01621459.2013.872651>; - de Carvalho and Rua (2017) <doi:10.1016/j.ijforecast.2015.09.004>; - de Carvalho et al (2023) <doi:10.1002/sta4.560>; - de Carvalho et al (2022) <doi:10.1007/s13253-021-00469-9>; - Palacios et al (2024) <doi:10.1214/24-BA1420>.

Maintained by Miguel de Carvalho. Last updated 1 days ago.

3.8 match 1.00 score 2 scripts

cran

SeqDetect:Sequence and Latent Process Detector

Sequence detector in this package contains a specific automaton model that can be used to learn and detect data and process sequences. Automaton model in this package is capable of learning and tracing sequences. Automaton model can be found in Krleža, Vrdoljak, Brčić (2019) <doi:10.1109/ACCESS.2019.2955245>. This research has been partly supported under Competitiveness and Cohesion Operational Programme from the European Regional and Development Fund, as part of the Integrated Anti-Fraud System project no. KK.01.2.1.01.0041. This research has also been partly supported by the European Regional Development Fund under the grant KK.01.1.1.01.0009.

Maintained by Dalibor Krleža. Last updated 5 years ago.

cpp

1.9 match 2.00 score 2 scripts

bioc

AMARETTO:Regulatory Network Inference and Driver Gene Evaluation using Integrative Multi-Omics Analysis and Penalized Regression

Integrating an increasing number of available multi-omics cancer data remains one of the main challenges to improve our understanding of cancer. One of the main challenges is using multi-omics data for identifying novel cancer driver genes. We have developed an algorithm, called AMARETTO, that integrates copy number, DNA methylation and gene expression data to identify a set of driver genes by analyzing cancer samples and connects them to clusters of co-expressed genes, which we define as modules. We applied AMARETTO in a pancancer setting to identify cancer driver genes and their modules on multiple cancer sites. AMARETTO captures modules enriched in angiogenesis, cell cycle and EMT, and modules that accurately predict survival and molecular subtypes. This allows AMARETTO to identify novel cancer driver genes directing canonical cancer pathways.

Maintained by Olivier Gevaert. Last updated 5 months ago.

statisticalmethod differentialmethylation generegulation geneexpression methylationarray transcription preprocessing batcheffect dataimport mrnamicroarray micrornaarray regression clustering rnaseq copynumbervariation sequencing microarray normalization network bayesian exonarray onechannel twochannel proprietaryplatforms alternativesplicing differentialexpression differentialsplicing genesetenrichment multiplecomparison qualitycontrol timecourse

0.8 match 4.88 score 15 scripts

eltebioinformatics

mulea:Enrichment Analysis Using Multiple Ontologies and False Discovery Rate

Background - Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. Results - mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. Conclusions - mulea is distributed as a CRAN R package. It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.

Maintained by Tamas Stirling. Last updated 3 months ago.

annotation differentialexpression geneexpression genesetenrichment go graphandnetwork multiplecomparison pathways reactome software transcription visualization enrichment enrichment-analysis functional-enrichment-analysis gene-set-enrichment ontologies transcriptomics cpp

0.5 match 28 stars 7.36 score 34 scripts

fawda123

SWMPr:Retrieving, Organizing, and Analyzing Estuary Monitoring Data

Tools for retrieving, organizing, and analyzing environmental data from the System Wide Monitoring Program of the National Estuarine Research Reserve System <https://cdmo.baruch.sc.edu/>. These tools address common challenges associated with continuous time series data for environmental decision making.

Maintained by Marcus W. Beck. Last updated 1 months ago.

0.5 match 13 stars 7.05 score 143 scripts 1 dependents

psyen0824

copulaSim:Virtual Patient Simulation by Copula Invariance Property

To optimize clinical trial designs and data analysis methods consistently through trial simulation, we need to simulate multivariate mixed-type virtual patient data independent of designs and analysis methods under evaluation. To make the outcome of optimization more realistic, relevant empirical patient level data should be utilized when it’s available. However, a few problems arise in simulating trials based on small empirical data, where the underlying marginal distributions and their dependence structure cannot be understood or verified thoroughly due to the limited sample size. To resolve this issue, we use the copula invariance property, which can generate the joint distribution without making a strong parametric assumption. The function copula.sim can generate virtual patient data with optional data validation methods that are based on energy distance and ball divergence measurement. The function compare.copula.sim can conduct comparison of marginal mean and covariance of simulated data. To simulate patient-level data from a hypothetical treatment arm that would perform differently from the observed data, the function new.arm.copula.sim can be used to generate new multivariate data with the same dependence structure of the original data but with a shifted mean vector.

Maintained by Pei-Shan Yen. Last updated 3 years ago.

1.2 match 3.00 score 7 scripts

statdivlab

rigr:Regression, Inference, and General Data Analysis Tools in R

A set of tools to streamline data analysis. Learning both R and introductory statistics at the same time can be challenging, and so we created 'rigr' to facilitate common data analysis tasks and enable learners to focus on statistical concepts. We provide easy-to-use interfaces for descriptive statistics, one- and two-sample inference, and regression analyses. 'rigr' output includes key information while omitting unnecessary details that can be confusing to beginners. Heteroscedasticity-robust ("sandwich") standard errors are returned by default, and multiple partial F-tests and tests for contrasts are easy to specify. A single regression function can fit both linear and generalized linear models, allowing students to more easily make connections between different classes of models.

Maintained by Amy D Willis. Last updated 9 months ago.

0.5 match 10 stars 7.09 score 39 scripts

cmerow

rangeModelMetadata:Provides Templates for Metadata Files Associated with Species Range Models

Range Modeling Metadata Standards (RMMS) address three challenges: they (i) are designed for convenience to encourage use, (ii) accommodate a wide variety of applications, and (iii) are extensible to allow the community of range modelers to steer it as needed. RMMS are based on a data dictionary that specifies a hierarchical structure to catalog different aspects of the range modeling process. The dictionary balances a constrained, minimalist vocabulary to improve standardization with flexibility for users to provide their own values. Merow et al. (2019) <DOI:10.1111/geb.12993> describe the standards in more detail. Note that users who prefer to use the R package 'ecospat' can obtain it from <https://github.com/ecospat/ecospat>.

Maintained by Cory Merow. Last updated 8 months ago.

ecological-metadata-language ecological-modelling ecological-models ecology species-distribution-modelling species-distributions

0.5 match 6 stars 6.96 score 16 scripts 3 dependents

ropensci

waywiser:Ergonomic Methods for Assessing Spatial Models

Assessing predictive models of spatial data can be challenging, both because these models are typically built for extrapolating outside the original region represented by training data and due to potential spatially structured errors, with "hot spots" of higher than expected error clustered geographically due to spatial structure in the underlying data. Methods are provided for assessing models fit to spatial data, including approaches for measuring the spatial structure of model errors, assessing model predictions at multiple spatial scales, and evaluating where predictions can be made safely. Methods are particularly useful for models fit using the 'tidymodels' framework. Methods include Moran's I ('Moran' (1950) <doi:10.2307/2332142>), Geary's C ('Geary' (1954) <doi:10.2307/2986645>), Getis-Ord's G ('Ord' and 'Getis' (1995) <doi:10.1111/j.1538-4632.1995.tb00912.x>), agreement coefficients from 'Ji' and Gallo (2006) (<doi: 10.14358/PERS.72.7.823>), agreement metrics from 'Willmott' (1981) (<doi: 10.1080/02723646.1981.10642213>) and 'Willmott' 'et' 'al'. (2012) (<doi: 10.1002/joc.2419>), an implementation of the area of applicability methodology from 'Meyer' and 'Pebesma' (2021) (<doi:10.1111/2041-210X.13650>), and an implementation of multi-scale assessment as described in 'Riemann' 'et' 'al'. (2010) (<doi:10.1016/j.rse.2010.05.010>).

Maintained by Michael Mahoney. Last updated 23 hours ago.

spatial spatial-analysis tidymodels tidyverse

0.5 match 37 stars 6.93 score 19 scripts

bioc

rawDiag:Brings Orbitrap Mass Spectrometry Data to Life; Fast and Colorful

Optimizing methods for liquid chromatography coupled to mass spectrometry (LC-MS) poses a nontrivial challenge. The rawDiag package facilitates rational method optimization by generating MS operator-tailored diagnostic plots of scan-level metadata. The package is designed for use on the R shell or as a Shiny application on the Orbitrap instrument PC.

Maintained by Christian Panse. Last updated 4 months ago.

massspectrometry proteomics metabolomics infrastructure software shinyapps fast mass-spectrometry multiplatform orbitrap visualization

0.5 match 36 stars 6.71 score 18 scripts

bioc

escheR:Unified multi-dimensional visualizations with Gestalt principles

The creation of effective visualizations is a fundamental component of data analysis. In biomedical research, new challenges are emerging to visualize multi-dimensional data in a 2D space, but current data visualization tools have limited capabilities. To address this problem, we leverage Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables. The proposed visualization can be applied to spatially-resolved transcriptomics data, but also broadly to data visualized in 2D space, such as embedding visualizations. We provide this open source R package escheR, which is built off of the state-of-the-art ggplot2 visualization framework and can be seamlessly integrated into genomics toolboxes and workflows.

Maintained by Boyi Guo. Last updated 5 months ago.

spatial singlecell transcriptomics visualization software multidimensional single-cell spatial-omics

0.5 match 6 stars 6.74 score 153 scripts 1 dependents

bioc

Modstrings:Working with modified nucleotide sequences

Representing nucleotide modifications in a nucleotide sequence is usually done via special characters from a number of sources. This represents a challenge to work with in R and the Biostrings package. The Modstrings package implements this functionallity for RNA and DNA sequences containing modified nucleotides by translating the character internally in order to work with the infrastructure of the Biostrings package. For this the ModRNAString and ModDNAString classes and derivates and functions to construct and modify these objects despite the encoding issues are implemenented. In addition the conversion from sequences to list like location information (and the reverse operation) is implemented as well.

Maintained by Felix G.M. Ernst. Last updated 5 months ago.

dataimport datarepresentation infrastructure sequencing software bioconductor biostrings dna dna-modifications modified-nucleotides nucleotides rna rna-modification-alphabet rna-modifications sequences

0.5 match 1 stars 6.64 score 5 scripts 8 dependents

cefet-rj-dal

daltoolbox:Leveraging Experiment Lines to Data Analytics

The natural increase in the complexity of current research experiments and data demands better tools to enhance productivity in Data Analytics. The package is a framework designed to address the modern challenges in data analytics workflows. The package is inspired by Experiment Line concepts. It aims to provide seamless support for users in developing their data mining workflows by offering a uniform data model and method API. It enables the integration of various data mining activities, including data preprocessing, classification, regression, clustering, and time series prediction. It also offers options for hyper-parameter tuning and supports integration with existing libraries and languages. Overall, the package provides researchers with a comprehensive set of functionalities for data science, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>.

Maintained by Eduardo Ogasawara. Last updated 1 months ago.

0.5 match 1 stars 6.65 score 536 scripts 4 dependents

srkobakian

sugarbag:Create Tessellated Hexagon Maps

Create a hexagon tile map display from spatial polygons. Each polygon is represented by a hexagon tile, placed as close to it's original centroid as possible, with a focus on maintaining spatial relationship to a focal point. Developed to aid visualisation and analysis of spatial distributions across Australia, which can be challenging due to the concentration of the population on the coast and wide open interior.

Maintained by Dianne Cook. Last updated 2 years ago.

0.5 match 42 stars 6.52 score 53 scripts

epinowcast

epidist:Estimate Epidemiological Delay Distributions With brms

Understanding and accurately estimating epidemiological delay distributions is important for public health policy. These estimates influence epidemic situational awareness, control strategies, and resource allocation. This package provides methods to address the key challenges in estimating these distributions, including truncation, interval censoring, and dynamical biases. These issues are frequently overlooked, resulting in biased conclusions. Built on top of 'brms', it allows for flexible modelling including time-varying spatial components and partially pooled estimates of demographic characteristics.

Maintained by Sam Abbott. Last updated 11 days ago.

0.5 match 14 stars 6.52 score 7 scripts

vkrakovna

sbfc:Selective Bayesian Forest Classifier

An MCMC algorithm for simultaneous feature selection and classification, and visualization of the selected features and feature interactions. An implementation of SBFC by Krakovna, Du and Liu (2015), <arXiv:1506.02371>.

Maintained by Viktoriya Krakovna. Last updated 3 years ago.

cpp

3.3 match 1.00 score 4 scripts

radicalcommecol

cxr:A Toolbox for Modelling Species Coexistence in R

Recent developments in modern coexistence theory have advanced our understanding on how species are able to persist and co-occur with other species at varying abundances. However, applying this mathematical framework to empirical data is still challenging, precluding a larger adoption of the theoretical tools developed by empiricists. This package provides a complete toolbox for modelling interaction effects between species, and calculate fitness and niche differences. The functions are flexible, may accept covariates, and different fitting algorithms can be used. A full description of the underlying methods is available in García-Callejas, D., Godoy, O., and Bartomeus, I. (2020) <doi:10.1111/2041-210X.13443>. Furthermore, the package provides a series of functions to calculate dynamics for stage-structured populations across sites.

Maintained by David Garcia-Callejas. Last updated 1 months ago.

0.5 match 10 stars 6.51 score 27 scripts

bioc

SpliceWiz:interactive analysis and visualization of alternative splicing in R

The analysis and visualization of alternative splicing (AS) events from RNA sequencing data remains challenging. SpliceWiz is a user-friendly and performance-optimized R package for AS analysis, by processing alignment BAM files to quantify read counts across splice junctions, IRFinder-based intron retention quantitation, and supports novel splicing event identification. We introduce a novel visualization for AS using normalized coverage, thereby allowing visualization of differential AS across conditions. SpliceWiz features a shiny-based GUI facilitating interactive data exploration of results including gene ontology enrichment. It is performance optimized with multi-threaded processing of BAM files and a new COV file format for fast recall of sequencing coverage. Overall, SpliceWiz streamlines AS analysis, enabling reliable identification of functionally relevant AS events for further characterization.

Maintained by Alex Chit Hei Wong. Last updated 6 days ago.

software transcriptomics rnaseq alternativesplicing coverage differentialsplicing differentialexpression gui sequencing cpp openmp

0.5 match 16 stars 6.41 score 8 scripts

ropensci

epubr:Read EPUB File Metadata and Text

Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.

Maintained by Matthew Leonawicz. Last updated 6 months ago.

epub epub-files epub-format peer-reviewed

0.5 match 24 stars 6.37 score 49 scripts

sentometricsresearch

sentometrics:An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2021) <doi:10.18637/jss.v099.i02>.

Maintained by Samuel Borms. Last updated 4 years ago.

nlp prediction sentiment-analysis text-mining time-series openblas cpp openmp

0.5 match 83 stars 6.09 score 49 scripts

cran

dsample:Discretization-Based Direct Random Sample Generation

Discretization-based random sampling algorithm that is useful for a complex model in high dimension is implemented. The normalizing constant of a target distribution is not needed. Posterior summaries are compared with those by 'OpenBUGS'. The method is described: Wang and Lee (2014) <doi:10.1016/j.csda.2013.06.011> and exercised in Lee (2009) <http://hdl.handle.net/1993/21352>.

Maintained by Chel Hee Lee. Last updated 2 years ago.

1.2 match 2.70 score

bioc

knowYourCG:Functional analysis of DNA methylome datasets

KnowYourCG (KYCG) is a supervised learning framework designed for the functional analysis of DNA methylation data. Unlike existing tools that focus on genes or genomic intervals, KnowYourCG directly targets CpG dinucleotides, featuring automated supervised screenings of diverse biological and technical influences, including sequence motifs, transcription factor binding, histone modifications, replication timing, cell-type-specific methylation, and trait-epigenome associations. KnowYourCG addresses the challenges of data sparsity in various methylation datasets, including low-pass Nanopore sequencing, single-cell DNA methylomes, 5-hydroxymethylation profiles, spatial DNA methylation maps, and array-based datasets for epigenome-wide association studies and epigenetic clocks.

Maintained by Goldberg David. Last updated 2 months ago.

epigenetics dnamethylation sequencing singlecell spatial methylationarray zlib

0.5 match 2 stars 6.10 score 4 scripts

modeloriented

arenar:Arena for the Exploration and Comparison of any ML Models

Generates data for challenging machine learning models in 'Arena' <https://arena.drwhy.ai> - an interactive web application. You can start the server with XAI (Explainable Artificial Intelligence) plots to be generated on-demand or precalculate and auto-upload data file beside shareable 'Arena' URL.

Maintained by Piotr Piątyszek. Last updated 4 years ago.

axplainable-artificial-intelligence ema explainability explanatory-model-analysis iml interactive-xai interpretability xai

0.5 match 31 stars 5.94 score 14 scripts

bioc

Dino:Normalization of Single-Cell mRNA Sequencing Data

Dino normalizes single-cell, mRNA sequencing data to correct for technical variation, particularly sequencing depth, prior to downstream analysis. The approach produces a matrix of corrected expression for which the dependency between sequencing depth and the full distribution of normalized expression; many existing methods aim to remove only the dependency between sequencing depth and the mean of the normalized expression. This is particuarly useful in the context of highly sparse datasets such as those produced by 10X genomics and other uninque molecular identifier (UMI) based microfluidics protocols for which the depth-dependent proportion of zeros in the raw expression data can otherwise present a challenge.

Maintained by Jared Brown. Last updated 5 months ago.

software normalization rnaseq singlecell sequencing geneexpression transcriptomics regression cellbasedassays

0.5 match 9 stars 6.02 score 13 scripts

bioc

dar:Differential Abundance Analysis by Consensus

Differential abundance testing in microbiome data challenges both parametric and non-parametric statistical methods, due to its sparsity, high variability and compositional nature. Microbiome-specific statistical methods often assume classical distribution models or take into account compositional specifics. These produce results that range within the specificity vs sensitivity space in such a way that type I and type II error that are difficult to ascertain in real microbiome data when a single method is used. Recently, a consensus approach based on multiple differential abundance (DA) methods was recently suggested in order to increase robustness. With dar, you can use dplyr-like pipeable sequences of DA methods and then apply different consensus strategies. In this way we can obtain more reliable results in a fast, consistent and reproducible way.

Maintained by Francesc Catala-Moll. Last updated 3 days ago.

software sequencing microbiome metagenomics multiplecomparison normalization bioconductor biomarker-discovery differential-abundance-analysis feature-selection microbiology phyloseq

0.5 match 2 stars 5.98 score 8 scripts

bioc

transcriptR:An Integrative Tool for ChIP- And RNA-Seq Based Primary Transcripts Detection and Quantification

The differences in the RNA types being sequenced have an impact on the resulting sequencing profiles. mRNA-seq data is enriched with reads derived from exons, while GRO-, nucRNA- and chrRNA-seq demonstrate a substantial broader coverage of both exonic and intronic regions. The presence of intronic reads in GRO-seq type of data makes it possible to use it to computationally identify and quantify all de novo continuous regions of transcription distributed across the genome. This type of data, however, is more challenging to interpret and less common practice compared to mRNA-seq. One of the challenges for primary transcript detection concerns the simultaneous transcription of closely spaced genes, which needs to be properly divided into individually transcribed units. The R package transcriptR combines RNA-seq data with ChIP-seq data of histone modifications that mark active Transcription Start Sites (TSSs), such as, H3K4me3 or H3K9/14Ac to overcome this challenge. The advantage of this approach over the use of, for example, gene annotations is that this approach is data driven and therefore able to deal also with novel and case specific events. Furthermore, the integration of ChIP- and RNA-seq data allows the identification all known and novel active transcription start sites within a given sample.

Maintained by Armen R. Karapetyan. Last updated 5 months ago.

immunooncology transcription software sequencing rnaseq coverage

0.9 match 3.30 score 2 scripts

imbs-hl

fuseMLR:Fusing Machine Learning in R

Recent technological advances have enable the simultaneous collection of multi-omics data i.e., different types or modalities of molecular data, presenting challenges for integrative prediction modeling due to the heterogeneous, high-dimensional nature and possible missing modalities of some individuals. We introduce this package for late integrative prediction modeling, enabling modality-specific variable selection and prediction modeling, followed by the aggregation of the modality-specific predictions to train a final meta-model. This package facilitates conducting late integration predictive modeling in a systematic, structured, and reproducible way.

Maintained by Cesaire J. K. Fouodo. Last updated 7 days ago.

0.5 match 6 stars 5.80 score 3 scripts

icra

ediblecity:Modeling Urban Agriculture at City Scale

The purpose of this package is to estimate the potential of urban agriculture to contribute to addressing several urban challenges at the city-scale. Within this aim, we selected 8 indicators directly related to one or several urban challenges. Also, a function is provided to compute new scenarios of urban agriculture. Methods are described by Pueyo-Ros, Comas & Corominas (2023) <doi:10.12688/openreseurope.16054.1>.

Maintained by Josep Pueyo-Ros. Last updated 1 years ago.

0.8 match 3.70 score 10 scripts

bioc

DMCHMM:Differentially Methylated CpG using Hidden Markov Model

A pipeline for identifying differentially methylated CpG sites using Hidden Markov Model in bisulfite sequencing data. DNA methylation studies have enabled researchers to understand methylation patterns and their regulatory roles in biological processes and disease. However, only a limited number of statistical approaches have been developed to provide formal quantitative analysis. Specifically, a few available methods do identify differentially methylated CpG (DMC) sites or regions (DMR), but they suffer from limitations that arise mostly due to challenges inherent in bisulfite sequencing data. These challenges include: (1) that read-depths vary considerably among genomic positions and are often low; (2) both methylation and autocorrelation patterns change as regions change; and (3) CpG sites are distributed unevenly. Furthermore, there are several methodological limitations: almost none of these tools is capable of comparing multiple groups and/or working with missing values, and only a few allow continuous or multiple covariates. The last of these is of great interest among researchers, as the goal is often to find which regions of the genome are associated with several exposures and traits. To tackle these issues, we have developed an efficient DMC identification method based on Hidden Markov Models (HMMs) called “DMCHMM” which is a three-step approach (model selection, prediction, testing) aiming to address the aforementioned drawbacks.

Maintained by Farhad Shokoohi. Last updated 5 months ago.

differentialmethylation sequencing hiddenmarkovmodel coverage

0.8 match 3.78 score 3 scripts

cadam00

prior3D:3D Prioritization Algorithm

Three-dimensional systematic conservation planning, conducting nested prioritization analyses across multiple depth levels and ensuring efficient resource allocation throughout the water column. It provides a structured workflow designed to address biodiversity conservation and management challenges in the 3 dimensions, while facilitating users’ choices and parameterization (Doxa et al. 2025 <doi:10.1016/j.ecolmodel.2024.110919>).

Maintained by Christos Adam. Last updated 2 months ago.

biodiversity conservation conservation-planning depth marine-spatial-planning multidimensional-environments prioritization

0.5 match 6 stars 5.62 score 3 scripts

mamc-dci

disttools:Distance Object Manipulation Tools

Provides convenient methods for accessing the data in 'dist' objects with minimal memory and computational overhead. 'disttools' can be used to extract the distance between any pair or combination of points encoded by a 'dist' object using only the indices of those points. This is an improvement over existing functionality, which requires either coercing a 'dist' object into a matrix or calculating the one dimensional index corresponding to a pair of observations. Coercion to a matrix is undesirable because doing so doubles the amount of memory required for storage. In contrast, there is no inherent downside to the latter solution. However, in part due to several edge cases, correctly and efficiently implementing such a solution can be challenging. 'disttools' abstracts away these challenges and provides a simple interface to access the data in a 'dist' object using the latter approach.

Maintained by Zachary Colburn. Last updated 3 years ago.

distance

0.8 match 3.70 score 4 scripts

johnbaums

hues:Distinct Colour Palettes Based on 'iwanthue'

Creating effective colour palettes for figures is challenging. This package generates and plot palettes of optimally distinct colours in perceptually uniform colour space, based on 'iwanthue' <http://tools.medialab.sciences-po.fr/iwanthue/>. This is done through k-means clustering of CIE Lab colour space, according to user-selected constraints on hue, chroma, and lightness.

Maintained by John Baumgartner. Last updated 5 years ago.

color palettes

0.5 match 34 stars 5.46 score 170 scripts

juanv66x

qvirus:Quantum Computing for Analyzing CD4 Lymphocytes and Antiretroviral Therapy

Resources, tutorials, and code snippets dedicated to exploring the intersection of quantum computing and artificial intelligence (AI) in the context of analyzing Cluster of Differentiation 4 (CD4) lymphocytes and optimizing antiretroviral therapy (ART) for human immunodeficiency virus (HIV). With the emergence of quantum artificial intelligence and the development of small-scale quantum computers, there's an unprecedented opportunity to revolutionize the understanding of HIV dynamics and treatment strategies. This project leverages the R package 'qsimulatR' (Ostmeyer and Urbach, 2023, <https://CRAN.R-project.org/package=qsimulatR>), a quantum computer simulator, to explore these applications in quantum computing techniques, addressing the challenges in studying CD4 lymphocytes and enhancing ART efficacy.

Maintained by Juan Pablo Acuña González. Last updated 13 days ago.

0.5 match 5.43 score 15 scripts

japilo

colorednoise:Simulate Temporally Autocorrelated Populations

Temporally autocorrelated populations are correlated in their vital rates (growth, death, etc.) from year to year. It is very common for populations, whether they be bacteria, plants, or humans, to be temporally autocorrelated. This poses a challenge for stochastic population modeling, because a temporally correlated population will behave differently from an uncorrelated one. This package provides tools for simulating populations with white noise (no temporal autocorrelation), red noise (positive temporal autocorrelation), and blue noise (negative temporal autocorrelation). The algebraic formulation for autocorrelated noise comes from Ruokolainen et al. (2009) <doi:10.1016/j.tree.2009.04.009>. Models for unstructured populations and for structured populations (matrix models) are available.

Maintained by July Pilowsky. Last updated 11 months ago.

openblas cpp

0.5 match 10 stars 5.43 score 18 scripts

igordot

clustermole:Unbiased Single-Cell Transcriptomic Data Cell Type Identification

Assignment of cell type labels to single-cell RNA sequencing (scRNA-seq) clusters is often a time-consuming process that involves manual inspection of the cluster marker genes complemented with a detailed literature search. This is especially challenging when unexpected or poorly described populations are present. The clustermole R package provides methods to query thousands of human and mouse cell identity markers sourced from a variety of databases.

Maintained by Igor Dolgalev. Last updated 1 years ago.

cell-type cell-type-annotation cell-type-classification cell-type-identification cell-type-matching gene-expression-signatures scrna-seq single-cell

0.5 match 13 stars 5.37 score 36 scripts

timothy-barry

ondisc:Algorithms and data structures for large single-cell expression matrices

Single-cell datasets are growing in size, posing challenges as well as opportunities for genomics researchers. `ondisc` is an R package that facilitates analysis of large-scale single-cell data out-of-core on a laptop or distributed across tens to hundreds processors on a cluster or cloud. In both of these settings, `ondisc` requires only a few gigabytes of memory, even if the input data are tens of gigabytes in size. `ondisc` mainly is oriented toward single-cell CRISPR screen analysis, but ondisc also can be used for single-cell differential expression and single-cell co-expression analyses. ondisc is powered by several new, efficient algorithms for manipulating and querying large, sparse expression matrices.

Maintained by Timothy Barry. Last updated 11 months ago.

dataimport singlecell differentialexpression crispr zlib cpp

0.5 match 11 stars 5.13 score 62 scripts

ropensci

qcoder:Lightweight Qualitative Coding

A free, lightweight, open source option for analyzing text-based qualitative data. Enables analysis of interview transcripts, observation notes, memos, and other sources. Supports the work of social scientists, historians, humanists, and other researchers who use qualitative methods. Addresses the unique challenges faced in analyzing qualitative data analysis. Provides opportunities for researchers who otherwise might not develop software to build software development skills.

Maintained by Elin Waring. Last updated 3 years ago.

unconf unconf18

0.5 match 134 stars 5.05 score 13 scripts

noaa-ocm

SWMPrExtension:Functions for Analyzing and Plotting Estuary Monitoring Data

Tools for performing routine analysis and plotting tasks with environmental data from the System Wide Monitoring Program of the National Estuarine Research Reserve System <https://cdmo.baruch.sc.edu/>. This package builds on the functionality of the 'SWMPr' package <https://cran.r-project.org/package=SWMPr>, which is used to retrieve and organize the data. The combined set of tools address common challenges associated with continuous time series data for environmental decision making, and are intended for use in annual reporting activities. References: Beck, Marcus W. (2016) <ISSN 2073-4859><https://journal.r-project.org/archive/2016-1/beck.pdf> Rudis, Bob (2014) <https://rud.is/b/2014/11/16/moving-the-earth-well-alaska-hawaii-with-r/>. United States Environmental Protection Agency (2015) <https://cfpub.epa.gov/si/si_public_record_Report.cfm?Lab=OWOW&dirEntryId=327030>. United States Environmental Protection Agency (2012) <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.646.1973&rep=rep1&type=pdf>.

Maintained by Matt Dornback. Last updated 2 years ago.

0.5 match 12 stars 5.10 score 42 scripts

syksy

ePCR:Ensemble Penalized Cox Regression for Survival Prediction

The top-performing ensemble-based Penalized Cox Regression (ePCR) framework developed during the DREAM 9.5 mCRPC Prostate Cancer Challenge <https://www.synapse.org/ProstateCancerChallenge> presented in Guinney J, Wang T, Laajala TD, et al. (2017) <doi:10.1016/S1470-2045(16)30560-5> is provided here-in, together with the corresponding follow-up work. While initially aimed at modeling the most advanced stage of prostate cancer, metastatic Castration-Resistant Prostate Cancer (mCRPC), the modeling framework has subsequently been extended to cover also the non-metastatic form of advanced prostate cancer (CRPC). Readily fitted ensemble-based model S4-objects are provided, and a simulated example dataset based on a real-life cohort is provided from the Turku University Hospital, to illustrate the use of the package. Functionality of the ePCR methodology relies on constructing ensembles of strata in patient cohorts and averaging over them, with each ensemble member consisting of a highly optimized penalized/regularized Cox regression model. Various cross-validation and other modeling schema are provided for constructing novel model objects.

Maintained by Teemu Daniel Laajala. Last updated 1 years ago.

0.5 match 5.00 score 20 scripts

thomaschln

nlpembeds:Natural Language Processing Embeddings

Provides efficient methods to compute co-occurrence matrices, pointwise mutual information (PMI) and singular value decomposition (SVD). In the biomedical and clinical settings, one challenge is the huge size of databases, e.g. when analyzing data of millions of patients over tens of years. To address this, this package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the 'RcppAlgos' package and sparse matrices. Furthermore, the functions can be called on 'SQL' databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. Partly based on Hong C. (2021) <doi:10.1038/s41746-021-00519-z>.

Maintained by Thomas Charlon. Last updated 27 days ago.

0.5 match 4.98 score

bioc

spiky:Spike-in calibration for cell-free MeDIP

spiky implements methods and model generation for cfMeDIP (cell-free methylated DNA immunoprecipitation) with spike-in controls. CfMeDIP is an enrichment protocol which avoids destructive conversion of scarce template, making it ideal as a "liquid biopsy," but creating certain challenges in comparing results across specimens, subjects, and experiments. The use of synthetic spike-in standard oligos allows diagnostics performed with cfMeDIP to quantitatively compare samples across subjects, experiments, and time points in both relative and absolute terms.

Maintained by Tim Triche. Last updated 5 months ago.

differentialmethylation dnamethylation normalization preprocessing qualitycontrol sequencing

0.5 match 2 stars 4.90 score 3 scripts

bioc

immunotation:Tools for working with diverse immune genes

MHC (major histocompatibility complex) molecules are cell surface complexes that present antigens to T cells. The repertoire of antigens presented in a given genetic background largely depends on the sequence of the encoded MHC molecules, and thus, in humans, on the highly variable HLA (human leukocyte antigen) genes of the hyperpolymorphic HLA locus. More than 28,000 different HLA alleles have been reported, with significant differences in allele frequencies between human populations worldwide. Reproducible and consistent annotation of HLA alleles in large-scale bioinformatics workflows remains challenging, because the available reference databases and software tools often use different HLA naming schemes. The package immunotation provides tools for consistent annotation of HLA genes in typical immunoinformatics workflows such as for example the prediction of MHC-presented peptides in different human donors. Converter functions that provide mappings between different HLA naming schemes are based on the MHC restriction ontology (MRO). The package also provides automated access to HLA alleles frequencies in worldwide human reference populations stored in the Allele Frequency Net Database.

Maintained by Katharina Imkeller. Last updated 5 months ago.

software immunooncology biomedicalinformatics genetics annotation

0.5 match 8 stars 4.90 score 3 scripts

mamba413

Ball:Statistical Inference and Sure Independence Screening via Ball Statistics

Hypothesis tests and sure independence screening (SIS) procedure based on ball statistics, including ball divergence <doi:10.1214/17-AOS1579>, ball covariance <doi:10.1080/01621459.2018.1543600>, and ball correlation <doi:10.1080/01621459.2018.1462709>, are developed to analyze complex data in metric spaces, e.g, shape, directional, compositional and symmetric positive definite matrix data. The ball divergence and ball covariance based distribution-free tests are implemented to detecting distribution difference and association in metric spaces <doi:10.18637/jss.v097.i06>. Furthermore, several generic non-parametric feature selection procedures based on ball correlation, BCor-SIS and all of its variants, are implemented to tackle the challenge in the context of ultra high dimensional data. A fast implementation for large-scale multiple K-sample testing with ball divergence <doi: 10.1002/gepi.22423> is supported, which is particularly helpful for genome-wide association study.

Maintained by Jin Zhu. Last updated 2 years ago.

0.5 match 1 stars 4.81 score 65 scripts

yannabraham

hilbertSimilarity:Hilbert Similarity Index for High Dimensional Data

Quantifying similarity between high-dimensional single cell samples is challenging, and usually requires some simplifying hypothesis to be made. By transforming the high dimensional space into a high dimensional grid, the number of cells in each sub-space of the grid is characteristic of a given sample. Using a Hilbert curve each sample can be visualized as a simple density plot, and the distance between samples can be calculated from the distribution of cells using the Jensen-Shannon distance. Bins that correspond to significant differences between samples can identified using a simple bootstrap procedure.

Maintained by Yann Abraham. Last updated 5 years ago.

cpp

0.5 match 5 stars 4.74 score 11 scripts

josesamos

clc:CORINE Land Cover Data and Styles

Streamline the management, analysis, and visualization of CORINE Land Cover data. Addresses challenges associated with its classification system and related styles, such as color mappings and descriptive labels.

Maintained by Jose Samos. Last updated 3 months ago.

0.5 match 4.52 score 11 scripts 1 dependents

egeulgen

PANACEA:Personalized Network-Based Anti-Cancer Therapy Evaluation

Identification of the most appropriate pharmacotherapy for each patient based on genomic alterations is a major challenge in personalized oncology. 'PANACEA' is a collection of personalized anti-cancer drug prioritization approaches utilizing network methods. The methods utilize personalized "driverness" scores from 'driveR' to rank drugs, mapping these onto a protein-protein interaction network. The "distance-based" method scores each drug based on these scores and distances between drugs and genes to rank given drugs. The "RWR" method propagates these scores via a random-walk with restart framework to rank the drugs. The methods are described in detail in Ulgen E, Ozisik O, Sezerman OU. 2023. PANACEA: network-based methods for pharmacotherapy prioritization in personalized oncology. Bioinformatics <doi:10.1093/bioinformatics/btad022>.

Maintained by Ege Ulgen. Last updated 2 years ago.

drug network-analysis oncology personalized-medicine

0.5 match 10 stars 4.70 score 3 scripts

bioc

zitools:Analysis of zero-inflated count data

zitools allows for zero inflated count data analysis by either using down-weighting of excess zeros or by replacing an appropriate proportion of excess zeros with NA. Through overloading frequently used statistical functions (such as mean, median, standard deviation), plotting functions (such as boxplots or heatmap) or differential abundance tests, it allows a wide range of downstream analyses for zero-inflated data in a less biased manner. This becomes applicable in the context of microbiome analyses, where the data is often overdispersed and zero-inflated, therefore making data analysis extremly challenging.

Maintained by Carlotta Meyring. Last updated 5 months ago.

software statisticalmethod microbiome

0.5 match 4.60 score 6 scripts

bioc

NoRCE:NoRCE: Noncoding RNA Sets Cis Annotation and Enrichment

While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint to a functional association. We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast.

Maintained by Gulden Olgun. Last updated 5 months ago.

biologicalquestion differentialexpression genomeannotation genesetenrichment genetarget genomeassembly go

0.5 match 1 stars 4.60 score 6 scripts

koenniem

mpathsenser:Process and Analyse Data from m-Path Sense

Overcomes one of the major challenges in mobile (passive) sensing, namely being able to pre-process the raw data that comes from a mobile sensing app, specifically 'm-Path Sense' <https://m-path.io>. The main task of 'mpathsenser' is therefore to read 'm-Path Sense' JSON files into a database and provide several convenience functions to aid in data processing.

Maintained by Koen Niemeijer. Last updated 22 days ago.

mobile-sensing

0.5 match 1 stars 4.48 score 6 scripts

bioc

HiCBricks:Framework for Storing and Accessing Hi-C Data Through HDF Files

HiCBricks is a library designed for handling large high-resolution Hi-C datasets. Over the years, the Hi-C field has experienced a rapid increase in the size and complexity of datasets. HiCBricks is meant to overcome the challenges related to the analysis of such large datasets within the R environment. HiCBricks offers user-friendly and efficient solutions for handling large high-resolution Hi-C datasets. The package provides an R/Bioconductor framework with the bricks to build more complex data analysis pipelines and algorithms. HiCBricks already incorporates example algorithms for calling domain boundaries and functions for high quality data visualization.

Maintained by Koustav Pal. Last updated 5 months ago.

dataimport infrastructure software technology sequencing hic

0.5 match 4.48 score 9 scripts 1 dependents

fberding

aifeducation:Artificial Intelligence for Education

In social and educational settings, the use of Artificial Intelligence (AI) is a challenging task. Relevant data is often only available in handwritten forms, or the use of data is restricted by privacy policies. This often leads to small data sets. Furthermore, in the educational and social sciences, data is often unbalanced in terms of frequencies. To support educators as well as educational and social researchers in using the potentials of AI for their work, this package provides a unified interface for neural nets in 'PyTorch' to deal with natural language problems. In addition, the package ships with a shiny app, providing a graphical user interface. This allows the usage of AI for people without skills in writing python/R scripts. The tools integrate existing mathematical and statistical methods for dealing with small data sets via pseudo-labeling (e.g. Cascante-Bonilla et al. (2020) <doi:10.48550/arXiv.2001.06001>) and imbalanced data via the creation of synthetic cases (e.g. Bunkhumpornpat et al. (2012) <doi:10.1007/s10489-011-0287-y>). Performance evaluation of AI is connected to measures from content analysis which educational and social researchers are generally more familiar with (e.g. Berding & Pargmann (2022) <doi:10.30819/5581>, Gwet (2014) <ISBN:978-0-9708062-8-4>, Krippendorff (2019) <doi:10.4135/9781071878781>). Estimation of energy consumption and CO2 emissions during model training is done with the 'python' library 'codecarbon'. Finally, all objects created with this package allow to share trained AI models with other people.

Maintained by Berding Florian. Last updated 1 months ago.

cpp

0.5 match 4.48 score 8 scripts

paul-haimerl

PAGFL:Joint Estimation of Latent Groups and Group-Specific Coefficients in Panel Data Models

Latent group structures are a common challenge in panel data analysis. Disregarding group-level heterogeneity can introduce bias. Conversely, estimating individual coefficients for each cross-sectional unit is inefficient and may lead to high uncertainty. This package addresses the issue of unobservable group structures by implementing the pairwise adaptive group fused Lasso (PAGFL) by Mehrabani (2023) <doi:10.1016/j.jeconom.2022.12.002>. PAGFL identifies latent group structures and group-specific coefficients in a single step. On top of that, we extend the PAGFL to time-varying coefficient functions.

Maintained by Paul Haimerl. Last updated 21 days ago.

classification panel-data-model time-varying-coefficients openblas cpp openmp

0.5 match 2 stars 4.43 score 3 scripts

syoung9836

knfi:Analysis of Korean National Forest Inventory Database

Understanding the current status of forest resources is essential for monitoring changes in forest ecosystems and generating related statistics. In South Korea, the National Forest Inventory (NFI) surveys over 4,500 sample plots nationwide every five years and records 70 items, including forest stand, forest resource, and forest vegetation surveys. Many researchers use NFI as the primary data for research, such as biomass estimation or analyzing the importance value of each species over time and space, depending on the research purpose. However, the large volume of accumulated forest survey data from across the country can make it challenging to manage and utilize such a vast dataset. To address this issue, we developed an R package that efficiently handles large-scale NFI data across time and space. The package offers a comprehensive workflow for NFI data analysis. It starts with data processing, where read_nfi() function reconstructs NFI data according to the researcher's needs while performing basic integrity checks for data quality.Following this, the package provides analytical tools that operate on the verified data. These include functions like summary_nfi() for summary statistics, diversity_nfi() for biodiversity analysis, iv_nfi() for calculating species importance value, and biomass_nfi() and cwd_biomass_nfi() for biomass estimation. Finally, for visualization, the tsvis_nfi() function generates graphs and maps, allowing users to visualize forest ecosystem changes across various spatial and temporal scales. This integrated approach and its specialized functions can enhance the efficiency of processing and analyzing NFI data, providing researchers with insights into forest ecosystems. The NFI Excel files (.xlsx) are not included in the R package and must be downloaded separately. Users can access these NFI Excel files by visiting the Korea Forest Service Forestry Statistics Platform <https://kfss.forest.go.kr/stat/ptl/article/articleList.do?curMenu=11694&bbsId=microdataboard> to download the annual NFI Excel files, which are bundled in .zip archives. Please note that this website is only available in Korean, and direct download links can be found in the notes section of the read_nfi() function.

Maintained by Sinyoung Park. Last updated 4 months ago.

data-analysis-r forestry

0.5 match 1 stars 4.48 score 2 scripts

bioc

PRONE:The PROteomics Normalization Evaluator

High-throughput omics data are often affected by systematic biases introduced throughout all the steps of a clinical study, from sample collection to quantification. Normalization methods aim to adjust for these biases to make the actual biological signal more prominent. However, selecting an appropriate normalization method is challenging due to the wide range of available approaches. Therefore, a comparative evaluation of unnormalized and normalized data is essential in identifying an appropriate normalization strategy for a specific data set. This R package provides different functions for preprocessing, normalizing, and evaluating different normalization approaches. Furthermore, normalization methods can be evaluated on downstream steps, such as differential expression analysis and statistical enrichment analysis. Spike-in data sets with known ground truth and real-world data sets of biological experiments acquired by either tandem mass tag (TMT) or label-free quantification (LFQ) can be analyzed.

Maintained by Lis Arend. Last updated 19 days ago.

proteomics preprocessing normalization differentialexpression visualization data-analysis evaluation

0.5 match 2 stars 4.38 score 9 scripts

joemsong

FunChisq:Model-Free Functional Chi-Squared and Exact Tests

Statistical hypothesis testing methods for inferring model-free functional dependency using asymptotic chi-squared or exact distributions. Functional test statistics are asymmetric and functionally optimal, unique from other related statistics. Tests in this package reveal evidence for causality based on the causality-by- functionality principle. They include asymptotic functional chi-squared tests (Zhang & Song 2013) <doi:10.48550/arXiv.1311.2707>, an adapted functional chi-squared test (Kumar & Song 2022) <doi:10.1093/bioinformatics/btac206>, and an exact functional test (Zhong & Song 2019) <doi:10.1109/TCBB.2018.2809743> (Nguyen et al. 2020) <doi:10.24963/ijcai.2020/372>. The normalized functional chi-squared test was used by Best Performer 'NMSUSongLab' in HPN-DREAM (DREAM8) Breast Cancer Network Inference Challenges (Hill et al. 2016) <doi:10.1038/nmeth.3773>. A function index (Zhong & Song 2019) <doi:10.1186/s12920-019-0565-9> (Kumar et al. 2018) <doi:10.1109/BIBM.2018.8621502> derived from the functional test statistic offers a new effect size measure for the strength of functional dependency, a better alternative to conditional entropy in many aspects. For continuous data, these tests offer an advantage over regression analysis when a parametric functional form cannot be assumed; for categorical data, they provide a novel means to assess directional dependency not possible with symmetrical Pearson's chi-squared or Fisher's exact tests.

Maintained by Joe Song. Last updated 10 months ago.

cpp

0.5 match 4.37 score 29 scripts

bioc

Uniquorn:Identification of cancer cell lines based on their weighted mutational/ variational fingerprint

'Uniquorn' enables users to identify cancer cell lines. Cancer cell line misidentification and cross-contamination reprents a significant challenge for cancer researchers. The identification is vital and in the frame of this package based on the locations/ loci of somatic and germline mutations/ variations. The input format is vcf/ vcf.gz and the files have to contain a single cancer cell line sample (i.e. a single member/genotype/gt column in the vcf file).

Maintained by Raik Otto. Last updated 5 months ago.

immunooncology statisticalmethod wholegenome exomeseq

0.5 match 4.30 score

bioc

gmoviz:Seamless visualization of complex genomic variations in GMOs and edited cell lines

Genetically modified organisms (GMOs) and cell lines are widely used models in all kinds of biological research. As part of characterising these models, DNA sequencing technology and bioinformatics analyses are used systematically to study their genomes. Therefore, large volumes of data are generated and various algorithms are applied to analyse this data, which introduces a challenge on representing all findings in an informative and concise manner. `gmoviz` provides users with an easy way to visualise and facilitate the explanation of complex genomic editing events on a larger, biologically-relevant scale.

Maintained by Kathleen Zeglinski. Last updated 5 months ago.

visualization sequencing geneticvariability genomicvariation coverage

0.5 match 4.30 score 9 scripts

drkowal

SeBR:Semiparametric Bayesian Regression Analysis

Monte Carlo sampling algorithms for semiparametric Bayesian regression analysis. These models feature a nonparametric (unknown) transformation of the data paired with widely-used regression models including linear regression, spline regression, quantile regression, and Gaussian processes. The transformation enables broader applicability of these key models, including for real-valued, positive, and compactly-supported data with challenging distributional features. The samplers prioritize computational scalability and, for most cases, Monte Carlo (not MCMC) sampling for greater efficiency. Details of the methods and algorithms are provided in Kowal and Wu (2024) <doi:10.1080/01621459.2024.2395586>.

Maintained by Dan Kowal. Last updated 8 days ago.

0.5 match 1 stars 4.30 score 3 scripts

bioc

OSAT:OSAT: Optimal Sample Assignment Tool

A sizable genomics study such as microarray often involves the use of multiple batches (groups) of experiment due to practical complication. To minimize batch effects, a careful experiment design should ensure the even distribution of biological groups and confounding factors across batches. OSAT (Optimal Sample Assignment Tool) is developed to facilitate the allocation of collected samples to different batches. With minimum steps, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the even distribution of confounding factors across batches. Our tool can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideal balanced RCBD. OSAT provides a number of predefined layout for some of the most commonly used genomics platform. Related paper can be find at http://www.biomedcentral.com/1471-2164/13/689 .

Maintained by Li Yan. Last updated 5 months ago.

datarepresentation visualization experimentaldesign qualitycontrol

0.5 match 4.30 score 3 scripts

bioc

GIGSEA:Genotype Imputed Gene Set Enrichment Analysis

We presented the Genotype-imputed Gene Set Enrichment Analysis (GIGSEA), a novel method that uses GWAS-and-eQTL-imputed trait-associated differential gene expression to interrogate gene set enrichment for the trait-associated SNPs. By incorporating eQTL from large gene expression studies, e.g. GTEx, GIGSEA appropriately addresses such challenges for SNP enrichment as gene size, gene boundary, SNP distal regulation, and multiple-marker regulation. The weighted linear regression model, taking as weights both imputation accuracy and model completeness, was used to perform the enrichment test, properly adjusting the bias due to redundancy in different gene sets. The permutation test, furthermore, is used to evaluate the significance of enrichment, whose efficiency can be largely elevated by expressing the computational intensive part in terms of large matrix operation. We have shown the appropriate type I error rates for GIGSEA (<5%), and the preliminary results also demonstrate its good performance to uncover the real signal.

Maintained by Shijia Zhu. Last updated 5 months ago.

genesetenrichment snp variantannotation geneexpression generegulation regression differentialexpression

0.5 match 4.30 score 2 scripts

jocelynchi

SEAGLE:Scalable Exact Algorithm for Large-Scale Set-Based Gene-Environment Interaction Tests

The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). We present 'SEAGLE', a Scalable Exact AlGorithm for Large-scale Set-based GxE tests, to permit GxE VC test scalable to biobank data. 'SEAGLE' employs modern matrix computations to achieve the same “exact” results as the original GxE VC tests, and does not impose additional assumptions nor relies on approximations. 'SEAGLE' can easily accommodate sample sizes in the order of 10^5, is implementable on standard laptops, and does not require specialized equipment. The accompanying manuscript for this package can be found at Chi, Ipsen, Hsiao, Lin, Wang, Lee, Lu, and Tzeng. (2021+) <arXiv:2105.03228>.

Maintained by Jocelyn Chi. Last updated 3 years ago.

0.5 match 4.30 score 8 scripts

subroy13

rsvddpd:Robust Singular Value Decomposition using Density Power Divergence

Computing singular value decomposition with robustness is a challenging task. This package provides an implementation of computing robust SVD using density power divergence (<arXiv:2109.10680>). It combines the idea of robustness and efficiency in estimation based on a tuning parameter. It also provides utility functions to simulate various scenarios to compare performances of different algorithms.

Maintained by Subhrajyoty Roy. Last updated 2 years ago.

openblas cpp openmp

0.5 match 3 stars 4.18 score 6 scripts

atbounds

ATbounds:Bounding Treatment Effects by Limited Information Pooling

Estimation and inference methods for bounding average treatment effects (on the treated) that are valid under an unconfoundedness assumption. The bounds are designed to be robust in challenging situations, for example, when the conditioning variables take on a large number of different values in the observed sample, or when the overlap condition is violated. This robustness is achieved by only using limited "pooling" of information across observations. For more details, see the paper by Lee and Weidner (2021), "Bounding Treatment Effects by Pooling Limited Information across Observations," <arXiv:2111.05243>.

Maintained by Sokbae Lee. Last updated 3 years ago.

causal-inference lack-of-overlap limited-overlap partial-identification treatment-effects unconfoundedness-assumption

0.5 match 3 stars 4.18 score 6 scripts

corradolanera

depigner:A Utility Package to Help you Deal with "Pignas"

Pigna [_pìn'n'a_] is the Italian word for pine cone. In jargon, it is used to identify a task which is boring, banal, annoying, painful, frustrating and maybe even with a not so beautiful or rewarding result, just like the obstinate act of trying to challenge yourself in extracting pine nuts from a pine cone, provided that, in the end, you will find at least one inside it. Here you can find a backpack of functions to be used to solve small everyday problems of coding or analyzing (clinical) data, which would be normally solved using quick-and-dirty patches. You will be able to convert 'Hmisc' and 'rms' summary()es into data.frames ready to be rendered by 'pander' and 'knitr'. You can access easy-to-use wrappers to activate essential but useful progress bars (from 'progress') into your loops or functionals. Easy setup and control Telegram's bots (from 'telegram.bot') to send messages or to divert error messages to a Telegram's chat. You also have some utilities helping you in the development of packages, like the activation of the same user interface of 'usethis' into your package, or call polite functions to ask a user to install other packages. Finally, you find a set of thematic sets of packages you may use to set up new environments quickly, installing them in a single call.

Maintained by Corrado Lanera. Last updated 2 years ago.

hmisc pigne rms telegram

0.5 match 3 stars 4.14 score 93 scripts

thomaswiemann

civ:Categorical Instrumental Variables

Implementation of the categorical instrumental variable (CIV) estimator proposed by Wiemann (2023) <arXiv:2311.17021>. CIV allows for optimal instrumental variable estimation in settings with relatively few observations per category. To obtain valid inference in these challenging settings, CIV leverages a regularization assumption that implies existence of a latent categorical variable with fixed finite support achieving the same first stage fit as the observed instrument.

Maintained by Thomas Wiemann. Last updated 1 years ago.

0.5 match 2 stars 4.00 score 5 scripts

technoslerphile

autoCovariateSelection:R Package to Implement Automated Covariate Selection for Two Exposure Cohorts Using High-Dimensional Propensity Score Algorithm

Contains functions to implement automated covariate selection using methods described in the high-dimensional propensity score (HDPS) algorithm by Schneeweiss et.al. Covariate adjustment in real-world-observational-data (RWD) is important for for estimating adjusted outcomes and this can be done by using methods such as, but not limited to, propensity score matching, propensity score weighting and regression analysis. While these methods strive to statistically adjust for confounding, the major challenge is in selecting the potential covariates that can bias the outcomes comparison estimates in observational RWD (Real-World-Data). This is where the utility of automated covariate selection comes in. The functions in this package help to implement the three major steps of automated covariate selection as described by Schneeweiss et. al elsewhere. These three functions, in order of the steps required to execute automated covariate selection are, get_candidate_covariates(), get_recurrence_covariates() and get_prioritised_covariates(). In addition to these functions, a sample real-world-data from publicly available de-identified medical claims data is also available for running examples and also for further exploration. The original article where the algorithm is described by Schneeweiss et.al. (2009) <doi:10.1097/EDE.0b013e3181a663cc> .

Maintained by Dennis Robert. Last updated 2 months ago.

0.5 match 4 stars 4.03 score 54 scripts

mroman-ibs

FuzzySimRes:Simulation and Resampling Methods for Epistemic Fuzzy Data

Random simulations of fuzzy numbers are still a challenging problem. The aim of this package is to provide the respective procedures to simulate fuzzy random variables, especially in the case of the piecewise linear fuzzy numbers (PLFNs, see Coroianua et al. (2013) <doi:10.1016/j.fss.2013.02.005> for the further details). Additionally, the special resampling algorithms known as the epistemic bootstrap are provided (see Grzegorzewski and Romaniuk (2022) <doi:10.34768/amcs-2022-0021>, Grzegorzewski and Romaniuk (2022) <doi:10.1007/978-3-031-08974-9_39>) together with the functions to apply statistical tests and estimate various characteristics based on the epistemic bootstrap. The package also includes a real-life data set of epistemic fuzzy triangular numbers. The fuzzy numbers used in this package are consistent with the 'FuzzyNumbers' package.

Maintained by Maciej Romaniuk. Last updated 7 months ago.

openblas

0.5 match 4.02 score 35 scripts 1 dependents

bioc

MethPed:A DNA methylation classifier tool for the identification of pediatric brain tumor subtypes

Classification of pediatric tumors into biologically defined subtypes is challenging and multifaceted approaches are needed. For this aim, we developed a diagnostic classifier based on DNA methylation profiles. We offer MethPed as an easy-to-use toolbox that allows researchers and clinical diagnosticians to test single samples as well as large cohorts for subclass prediction of pediatric brain tumors. The current version of MethPed can classify the following tumor diagnoses/subgroups: Diffuse Intrinsic Pontine Glioma (DIPG), Ependymoma, Embryonal tumors with multilayered rosettes (ETMR), Glioblastoma (GBM), Medulloblastoma (MB) - Group 3 (MB_Gr3), Group 4 (MB_Gr3), Group WNT (MB_WNT), Group SHH (MB_SHH) and Pilocytic Astrocytoma (PiloAstro).

Maintained by Helena Carén. Last updated 5 months ago.

immunooncology dnamethylation classification epigenetics

0.5 match 4.00 score 1 scripts

bioc

POWSC:Simulation, power evaluation, and sample size recommendation for single cell RNA-seq

Determining the sample size for adequate power to detect statistical significance is a crucial step at the design stage for high-throughput experiments. Even though a number of methods and tools are available for sample size calculation for microarray and RNA-seq in the context of differential expression (DE), this topic in the field of single-cell RNA sequencing is understudied. Moreover, the unique data characteristics present in scRNA-seq such as sparsity and heterogeneity increase the challenge. We propose POWSC, a simulation-based method, to provide power evaluation and sample size recommendation for single-cell RNA sequencing DE analysis. POWSC consists of a data simulator that creates realistic expression data, and a power assessor that provides a comprehensive evaluation and visualization of the power and sample size relationship.

Maintained by Kenong Su. Last updated 5 months ago.

differentialexpression immunooncology singlecell software

0.5 match 4.00 score 7 scripts

rshudde

BFF:Bayes Factor Functions

Bayes factors represent the ratio of probabilities assigned to data by competing scientific hypotheses. However, one drawback of Bayes factors is their dependence on prior specifications that define null and alternative hypotheses. Additionally, there are challenges in their computation. To address these issues, we define Bayes factor functions (BFFs) directly from common test statistics. BFFs express Bayes factors as a function of the prior densities used to define the alternative hypotheses. These prior densities are centered on standardized effects, which serve as indices for the BFF. Therefore, BFFs offer a summary of evidence in favor of alternative hypotheses that correspond to a range of scientifically interesting effect sizes. Such summaries remove the need for arbitrary thresholds to determine "statistical significance." BFFs are available in closed form and can be easily computed from z, t, chi-squared, and F statistics. They depend on hyperparameters "r" and "tau^2", which determine the shape and scale of the prior distributions defining the alternative hypotheses. Plots of BFFs versus effect size provide informative summaries of hypothesis tests that can be easily aggregated across studies.

Maintained by Rachael Shudde. Last updated 7 months ago.

0.5 match 1 stars 4.00 score 5 scripts

bioc

GSEAmining:Make Biological Sense of Gene Set Enrichment Analysis Outputs

Gene Set Enrichment Analysis is a very powerful and interesting computational method that allows an easy correlation between differential expressed genes and biological processes. Unfortunately, although it was designed to help researchers to interpret gene expression data it can generate huge amounts of results whose biological meaning can be difficult to interpret. Many available tools rely on the hierarchically structured Gene Ontology (GO) classification to reduce reundandcy in the results. However, due to the popularity of GSEA many more gene set collections, such as those in the Molecular Signatures Database are emerging. Since these collections are not organized as those in GO, their usage for GSEA do not always give a straightforward answer or, in other words, getting all the meaninful information can be challenging with the currently available tools. For these reasons, GSEAmining was born to be an easy tool to create reproducible reports to help researchers make biological sense of GSEA outputs. Given the results of GSEA, GSEAmining clusters the different gene sets collections based on the presence of the same genes in the leadind edge (core) subset. Leading edge subsets are those genes that contribute most to the enrichment score of each collection of genes or gene sets. For this reason, gene sets that participate in similar biological processes should share genes in common and in turn cluster together. After that, GSEAmining is able to identify and represent for each cluster: - The most enriched terms in the names of gene sets (as wordclouds) - The most enriched genes in the leading edge subsets (as bar plots). In each case, positive and negative enrichments are shown in different colors so it is easy to distinguish biological processes or genes that may be of interest in that particular study.

Maintained by Oriol Arqués. Last updated 5 months ago.

genesetenrichment clustering visualization

0.5 match 4.00 score 7 scripts

dgkf

parttime:Partial Datetime Handling

Datetimes and timestamps are invariably an imprecise notation, with any partial representation implying some amount of uncertainty. To handle this, 'parttime' provides classes for embedding partial missingness as a central part of its datetime classes. This central feature allows for more ergonomic use of datetimes for challenging datetime computation, including calculations of overlapping date ranges, imputations, and more thoughtful handling of ambiguity that arises from uncertain time zones. This package was developed first and foremost with pharmaceutical applications in mind, but aims to be agnostic to application to accommodate general use cases just as conveniently.

Maintained by Doug Kelkhoff. Last updated 1 years ago.

hacktoberfest

0.5 match 17 stars 3.93 score 3 scripts

taylor-arnold

sotu:United States Presidential State of the Union Addresses

The President of the United States is constitutionally obligated to provide a report known as the 'State of the Union'. The report summarizes the current challenges facing the country and the president's upcoming legislative agenda. While historically the State of the Union was often a written document, in recent decades it has always taken the form of an oral address to a joint session of the United States Congress. This package provides the raw text from every such address with the intention of being used for meaningful examples of text analysis in R. The corpus is well suited to the task as it is historically important, includes material intended to be read and material intended to be spoken, and it falls in the public domain. As the corpus spans over two centuries it is also a good test of how well various methods hold up to the idiosyncrasies of historical texts. Associated data about each address, such as the year, president, party, and format, are also included.

Maintained by Taylor B. Arnold. Last updated 3 years ago.

0.5 match 2 stars 3.87 score 74 scripts

celevitz

touRnamentofchampions:Tournament of Champions Data

Several datasets which describe the challenges and results of competitions in Tournament of Champions. This data is useful for practicing data wrangling, graphing, and analyzing how each season of Tournament of Champions played out.

Maintained by Levitz Carly. Last updated 9 days ago.

0.5 match 3.70 score

imalagaris

RCTRecruit:Non-Parametric Recruitment Prediction for Randomized Clinical Trials

Accurate prediction of subject recruitment for Randomized Clinical Trials (RCT) remains an ongoing challenge. Many previous prediction models rely on parametric assumptions. We present functions for non-parametric RCT recruitment prediction under several scenarios.

Maintained by Ioannis Malagaris. Last updated 2 months ago.

cpp

0.5 match 1 stars 3.65 score 3 scripts

kunfa

mixedBayes:Bayesian Longitudinal Regularized Quantile Mixed Model

In longitudinal studies, the same subjects are measured repeatedly over time, leading to correlations among the repeated measurements. Properly accounting for the intra-cluster correlations in the presence of data heterogeneity and long tailed distributions of the disease phenotype is challenging, especially in the context of high dimensional regressions. In this package, we developed a Bayesian quantile mixed effects model with spike- and -slab priors to dissect important gene - environment interactions under longitudinal genomics studies. An efficient Gibbs sampler has been developed to facilitate fast computation. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in 'C++'. The development of this software package and the associated statistical methods have been partially supported by an Innovative Research Award from Johnson Cancer Research Center, Kansas State University.

Maintained by Kun Fan. Last updated 29 days ago.

openblas cpp openmp

0.5 match 3.74 score 4 scripts

mncube

idmact:Interpreting Differences Between Mean ACT Scores

Interpreting the differences between mean scale scores across various forms of an assessment can be challenging. This difficulty arises from different mappings between raw scores and scale scores, complex mathematical relationships, adjustments based on judgmental procedures, and diverse equating functions applied to different assessment forms. An alternative method involves running simulations to explore the effect of incrementing raw scores on mean scale scores. The 'idmact' package provides an implementation of this approach based on the algorithm detailed in Schiel (1998) <https://www.act.org/content/dam/act/unsecured/documents/ACT_RR98-01.pdf> which was developed to help interpret differences between mean scale scores on the American College Testing (ACT) assessment. The function idmact_subj() within the package offers a framework for running simulations on subject-level scores. In contrast, the idmact_comp() function provides a framework for conducting simulations on composite scores.

Maintained by Mackson Ncube. Last updated 2 years ago.

assessment measurement psychometrics scale

0.5 match 3.70 score 4 scripts

bioc

SCAN.UPC:Single-channel array normalization (SCAN) and Universal exPression Codes (UPC)

SCAN is a microarray normalization method to facilitate personalized-medicine workflows. Rather than processing microarray samples as groups, which can introduce biases and present logistical challenges, SCAN normalizes each sample individually by modeling and removing probe- and array-specific background noise using only data from within each array. SCAN can be applied to one-channel (e.g., Affymetrix) or two-channel (e.g., Agilent) microarrays. The Universal exPression Codes (UPC) method is an extension of SCAN that estimates whether a given gene/transcript is active above background levels in a given sample. The UPC method can be applied to one-channel or two-channel microarrays as well as to RNA-Seq read counts. Because UPC values are represented on the same scale and have an identical interpretation for each platform, they can be used for cross-platform data integration.

Maintained by Stephen R. Piccolo. Last updated 5 months ago.

immunooncology software microarray preprocessing rnaseq twochannel onechannel

0.5 match 3.48 score 15 scripts

gshs-ornl

revengc:Reverse Engineering Summarized Data

Decoupled (e.g. separate averages) and censored (e.g. > 100 species) variables are continually reported by many well-established organizations (e.g. World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), World Bank, and various national censuses). The challenge therefore is to infer what the original data could have been given summarized information. We present an R package that reverse engineers decoupled and/or censored count data with two main functions. The cnbinom.pars() function estimates the average and dispersion parameter of a censored univariate frequency table. The rec() function reverse engineers summarized data into an uncensored bivariate table of probabilities.

Maintained by Samantha Duchscherer. Last updated 6 years ago.

0.5 match 5 stars 3.44 score 11 scripts

bitansa

MR.RGM:Multivariate Bidirectional Mendelian Randomization Networks

Addressing a central challenge encountered in Mendelian randomization (MR) studies, where MR primarily focuses on discerning the effects of individual exposures on specific outcomes and establishes causal links between them. Using a network-based methodology, the intricacy involving interdependent outcomes due to numerous factors has been tackled through this routine. Based on Ni et al. (2018) <doi:10.1214/17-BA1087>, 'MR.RGM' extends to a broader exploration of the causal landscape by leveraging on network structures and involves the construction of causal graphs that capture interactions between response variables and consequently between responses and instrument variables. The resulting Graph visually represents these causal connections, showing directed edges with effect sizes labeled. 'MR.RGM' facilitates the navigation of various data availability scenarios effectively by accommodating three input formats, i.e., individual-level data and two types of summary-level data. In the process, causal effects, adjacency matrices, and other essential parameters of the complex biological networks, are estimated. Besides, 'MR.RGM' provides uncertainty quantification for specific network structures among response variables.

Maintained by Bitan Sarkar. Last updated 16 days ago.

openblas cpp openmp

0.5 match 1 stars 3.40 score

egpivo

QuantRegGLasso:Adaptively Weighted Group Lasso for Semiparametric Quantile Regression Models

Implements an adaptively weighted group Lasso procedure for simultaneous variable selection and structure identification in varying coefficient quantile regression models and additive quantile regression models with ultra-high dimensional covariates. The methodology, grounded in a strong sparsity condition, establishes selection consistency under certain weight conditions. To address the challenge of tuning parameter selection in practice, a BIC-type criterion named high-dimensional information criterion (HDIC) is proposed. The Lasso procedure, guided by HDIC-determined tuning parameters, maintains selection consistency. Theoretical findings are strongly supported by simulation studies. (Toshio Honda, Ching-Kang Ing, Wei-Ying Wu, 2019, <DOI:10.3150/18-BEJ1091>).

Maintained by Wen-Ting Wang. Last updated 5 months ago.

admm group-lasso high-dimensional quantile-regression rcpp rcpparmadillo openblas cpp

0.5 match 2 stars 3.30 score 2 scripts

pridiltal

clap:Detecting Class Overlapping Regions in Multidimensional Data

The issue of overlapping regions in multidimensional data arises when different classes or clusters share similar feature representations, making it challenging to delineate distinct boundaries between them accurately. This package provides methods for detecting and visualizing these overlapping regions using partitional clustering techniques based on nearest neighbor distances.

Maintained by Priyanga Dilini Talagala. Last updated 9 months ago.

0.5 match 1 stars 3.18 score 2 scripts

jrhub

spinBayes:Semi-Parametric Gene-Environment Interaction via Bayesian Variable Selection

Many complex diseases are known to be affected by the interactions between genetic variants and environmental exposures beyond the main genetic and environmental effects. Existing Bayesian methods for gene-environment (G×E) interaction studies are challenged by the high-dimensional nature of the study and the complexity of environmental influences. We have developed a novel and powerful semi-parametric Bayesian variable selection method that can accommodate linear and nonlinear G×E interactions simultaneously (Ren et al. (2020) <doi:10.1002/sim.8434>). Furthermore, the proposed method can conduct structural identification by distinguishing nonlinear interactions from main effects only case within Bayesian framework. Spike-and-slab priors are incorporated on both individual and group level to shrink coefficients corresponding to irrelevant main and interaction effects to zero exactly. The Markov chain Monte Carlo algorithms of the proposed and alternative methods are efficiently implemented in C++.

Maintained by Jie Ren. Last updated 1 months ago.

bayesian-variable-selection gene-environment-interactions high-dimensional-data semi-parametric-modeling openblas cpp openmp

0.5 match 1 stars 3.18 score 3 scripts

pwkraft

discursive:Measuring Discursive Sophistication in Open-Ended Survey Responses

A simple approach to measure political sophistication based on open-ended survey responses. Discursive sophistication captures the complexity of individual attitude expression by quantifying its relative size, range, and constraint. For more information on the measurement approach see: Kraft, Patrick W. 2023. "Women Also Know Stuff: Challenging the Gender Gap in Political Sophistication." American Political Science Review (forthcoming).

Maintained by Patrick Kraft. Last updated 2 years ago.

0.5 match 2 stars 3.00 score 5 scripts

jienagu

dataMojo:Reshape Data Table

A grammar of data manipulation with 'data.table', providing a consistent a series of utility functions that help you solve the most common data manipulation challenges.

Maintained by Jiena McLellan. Last updated 2 years ago.

0.5 match 2.88 score 15 scripts

eogasawara

tspredit:Time Series Prediction Integrated Tuning

Prediction is one of the most important activities while working with time series. There are many alternative ways to model the time series. Finding the right one is challenging to model them. Most data-driven models (either statistical or machine learning) demand tuning. Setting them right is mandatory for good predictions. It is even more complex since time series prediction also demands choosing a data pre-processing that complies with the chosen model. Many time series frameworks have features to build and tune models. The package differs as it provides a framework that seamlessly integrates tuning data pre-processing activities with the building of models. The package provides functions for defining and conducting time series prediction, including data pre(post)processing, decomposition, tuning, modeling, prediction, and accuracy assessment. More information is available at Izau et al. <doi:10.5753/sbbd.2022.224330>.

Maintained by Eduardo Ogasawara. Last updated 3 months ago.

0.5 match 2.92 score 56 scripts

cran

HDMT:A Multiple Testing Procedure for High-Dimensional Mediation Hypotheses

A multiple-testing procedure for high-dimensional mediation hypotheses. Mediation analysis is of rising interest in epidemiology and clinical trials. Among existing methods for mediation analyses, the popular joint significance (JS) test yields an overly conservative type I error rate and therefore low power. In the R package 'HDMT' we implement a multiple-testing procedure that accurately controls the family-wise error rate (FWER) and the false discovery rate (FDR) when using JS for testing high-dimensional mediation hypotheses. The core of our procedure is based on estimating the proportions of three component null hypotheses and deriving the corresponding mixture distribution of null p-values. Results of the data examples include better-behaved quantile-quantile plots and improved detection of novel mediation relationships on the role of DNA methylation in genetic regulation of gene expression. With increasing interest in mediation by molecular intermediaries such as gene expression, the proposed method addresses an unmet methodological challenge. Methods used in the package refer to James Y. Dai, Janet L. Stanford & Michael LeBlanc (2020) <doi:10.1080/01621459.2020.1765785>.

Maintained by James Dai. Last updated 3 years ago.

0.5 match 2.86 score 12 scripts 2 dependents

cran

NST:Normalized Stochasticity Ratio

To estimate ecological stochasticity in community assembly. Understanding the community assembly mechanisms controlling biodiversity patterns is a central issue in ecology. Although it is generally accepted that both deterministic and stochastic processes play important roles in community assembly, quantifying their relative importance is challenging. The new index, normalized stochasticity ratio (NST), is to estimate ecological stochasticity, i.e. relative importance of stochastic processes, in community assembly. With functions in this package, NST can be calculated based on different similarity metrics and/or different null model algorithms, as well as some previous indexes, e.g. previous Stochasticity Ratio (ST), Standard Effect Size (SES), modified Raup-Crick metrics (RC). Functions for permutational test and bootstrapping analysis are also included. Previous ST is published by Zhou et al (2014) <doi:10.1073/pnas.1324044111>. NST is modified from ST by considering two alternative situations and normalizing the index to range from 0 to 1 (Ning et al 2019) <doi:10.1073/pnas.1904623116>. A modified version, MST, is a special case of NST, used in some recent or upcoming publications, e.g. Liang et al (2020) <doi:10.1016/j.soilbio.2020.108023>. SES is calculated as described in Kraft et al (2011) <doi:10.1126/science.1208584>. RC is calculated as reported by Chase et al (2011) <doi:10.1890/ES10-00117.1> and Stegen et al (2013) <doi:10.1038/ismej.2013.93>. Version 3 added NST based on phylogenetic beta diversity, used by Ning et al (2020) <doi:10.1038/s41467-020-18560-z>.

Maintained by Daliang Ning. Last updated 3 years ago.

0.5 match 2 stars 2.85 score 35 scripts

lcbc-uio

nettskjema.tsd:Decrypt and Organise Nettskjema Data Within TSD

Working with Nettskjema (<https://nettskjema.no/>) data inside TSD can be challenging. This package aims to aid users in managing their incoming Nettskjema data by decrypting the data and storing the nettskjema data and meta-data in convenient and standardised ways. This package functionality currently only works on the Linux VMs of TSD, and for version 1 of the nettskjema data delivery to TSD.

Maintained by Athanasia Mo Mowinckel. Last updated 3 years ago.

0.5 match 2.70 score 3 scripts

sigbertklinke

exams.forge:Support for Compiling Examination Tasks using the 'exams' Package

The main aim is to further facilitate the creation of exercises based on the package 'exams' by Grün, B., and Zeileis, A. (2009) <doi:10.18637/jss.v029.i10>. Creating effective student exercises involves challenges such as creating appropriate data sets and ensuring access to intermediate values for accurate explanation of solutions. The functionality includes the generation of univariate and bivariate data including simple time series, functions for theoretical distributions and their approximation, statistical and mathematical calculations for tasks in basic statistics courses as well as general tasks such as string manipulation, LaTeX/HTML formatting and the editing of XML task files for 'Moodle'.

Maintained by Sigbert Klinke. Last updated 8 months ago.

0.5 match 2.70 score 1 scripts

davezes

mactivate:Multiplicative Activation

Provides methods and classes for adding m-activation ("multiplicative activation") layers to MLR or multivariate logistic regression models. M-activation layers created in this library detect and add input interaction (polynomial) effects into a predictive model. M-activation can detect high-order interactions -- a traditionally non-trivial challenge. Details concerning application, methodology, and relevant survey literature can be found in this library's vignette, "About."

Maintained by Dave Zes. Last updated 4 years ago.

0.5 match 2.68 score 12 scripts

sinast3000

sectorgap:Consistent Economic Trend Cycle Decomposition

Determining potential output and the output gap - two inherently unobservable variables - is a major challenge for macroeconomists. 'sectorgap' features a flexible modeling and estimation framework for a multivariate Bayesian state space model identifying economic output fluctuations consistent with subsectors of the economy. The proposed model is able to capture various correlations between output and a set of aggregate as well as subsector indicators. Estimation of the latent states and parameters is achieved using a simple Gibbs sampling procedure and various plotting options facilitate the assessment of the results. For details on the methodology and an illustrative example, see Streicher (2024) <https://www.research-collection.ethz.ch/handle/20.500.11850/653682>.

Maintained by Sina Streicher. Last updated 1 years ago.

0.5 match 2.70 score

jmoonen

compositeReliabilityInNestedDesigns:Optimizing the Composite Reliability in Multivariate Nested Designs

The reliability of assessment tools is a crucial aspect of monitoring student performance in various educational settings. It ensures that the assessment outcomes accurately reflect a student's true level of performance. However, when assessments are combined, determining composite reliability can be challenging, especially for naturalistic and unbalanced datasets in nested design as is often the case for Workplace-Based Assessments. This package is designed to estimate composite reliability in nested designs using multivariate generalizability theory and enhance the analysis of assessment data. The package allows for the inclusion of weight per assessment type and produces extensive G- and D-study results with graphical interpretations, and options to find the set of weights that maximizes the composite reliability or minimizes the standard error of measurement (SEM).

Maintained by Joyce Moonen - van Loon. Last updated 6 months ago.

0.5 match 2.70 score

cran

rplec:Placental Epigenetic Clock to Estimate Aging by DNA Methylation

Placental epigenetic clock to estimate aging based on gestational age using DNA methylation levels, so called placental epigenetic clock (PlEC). We developed a PlEC for the 2024 Placental Clock DREAM Challenge (<https://www.synapse.org/Synapse:syn59520082/wiki/628063>). Our PlEC achieved the top performance based on an independent test set. PlEC can be used to identify accelerated/decelerated aging of placenta for understanding placental dysfunction-related conditions, e.g., great obstetrical syndromes including preeclampsia, fetal growth restriction, preterm labor, preterm premature rupture of the membranes, late spontaneous abortion, and placental abruption. Detailed methodologies and examples are documented in our vignette, available at <https://herdiantrisufriyana.github.io/rplec/doc/placental_aging_analysis.html>.

Maintained by Herdiantri Sufriyana. Last updated 2 months ago.

0.5 match 2.70 score

dscolby

spacejamr:Simulate Spatial Bernoulli Networks

Social network analysis is becoming commonplace in many social science disciplines, but access to useful network data, especially among marginalized populations, still remains a formidable challenge. This package mitigates that problem by providing tools to simulate spatial Bernoulli networks as proposed in Carter T. Butts (2002, ISBN:978-0-493-72676-2), "Spatial models of large-scale interpersonal networks." Using this package, network analysts can simulate a spatial point process or sequence with a given number of nodes inside a geographical boundary and estimate the probability of a tie formation between all node pairs. When simulating a network, an analyst can choose between five spatial interaction functions. The package also enables quick comparison of summary statistics for simulated networks and provides simple to use plotting methods for its classes that return plots which can be further refined with the 'ggplot2' package.

Maintained by Darren Colby. Last updated 3 years ago.

0.5 match 1 stars 2.70 score 8 scripts

sinast3000

RGAP:Production Function Output Gap Estimation

The output gap indicates the percentage difference between the actual output of an economy and its potential. Since potential output is a latent process, the estimation of the output gap poses a challenge and numerous filtering techniques have been proposed. 'RGAP' facilitates the estimation of a Cobb-Douglas production function type output gap, as suggested by the European Commission (Havik et al. 2014) <https://ideas.repec.org/p/euf/ecopap/0535.html>. To that end, the non-accelerating wage rate of unemployment (NAWRU) and the trend of total factor productivity (TFP) can be estimated in two bivariate unobserved component models by means of Kalman filtering and smoothing. 'RGAP' features a flexible modeling framework for the appropriate state-space models and offers frequentist as well as Bayesian estimation techniques. Additional functionalities include direct access to the 'AMECO' <https://economy-finance.ec.europa.eu/economic-research-and-databases/economic-databases/ameco-database_en> database and automated model selection procedures. See the paper by Streicher (2022) <http://hdl.handle.net/20.500.11850/552089> for details.

Maintained by Sina Streicher. Last updated 1 years ago.

0.5 match 2.70 score 9 scripts

mncube

swaprinc:Swap Principal Components into Regression Models

Obtaining accurate and stable estimates of regression coefficients can be challenging when the suggested statistical model has issues related to multicollinearity, convergence, or overfitting. One solution is to use principal component analysis (PCA) results in the regression, as discussed in Chan and Park (2005) <doi:10.1080/01446190500039812>. The swaprinc() package streamlines comparisons between a raw regression model with the full set of raw independent variables and a principal component regression model where principal components are estimated on a subset of the independent variables, then swapped into the regression model in place of those variables. The swaprinc() function compares one raw regression model to one principal component regression model, while the compswap() function compares one raw regression model to many principal component regression models. Package functions include parameters to center, scale, and undo centering and scaling, as described by Harvey and Hansen (2022) <https://cran.r-project.org/package=LearnPCA/vignettes/Vig_03_Step_By_Step_PCA.pdf>. Additionally, the package supports using Gifi methods to extract principal components from categorical variables, as outlined by Rossiter (2021) <https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/NonlinearPCA.html#2_Package>.

Maintained by Mackson Ncube. Last updated 2 years ago.

0.5 match 2.70 score 2 scripts

cran

PytrendsLongitudinalR:Create Longitudinal Google Trends Data

'Google Trends' provides cross-sectional and time-series data on searches, but lacks readily available longitudinal data. Researchers, who want to create longitudinal 'Google Trends' on their own, face practical challenges, such as normalized counts that make it difficult to combine cross-sectional and time-series data and limitations in data formats and timelines that limit data granularity over extended time periods. This package addresses these issues and enables researchers to generate longitudinal 'Google Trends' data. This package is built on 'pytrends', a Python library that acts as the unofficial 'Google Trends API' to collect 'Google Trends' data. As long as the 'Google Trends API', 'pytrends' and all their dependencies are working, this package will work. During testing, we noticed that for the same input (keyword, topic, data_format, timeline), the output index can vary from time to time. Besides, if the keyword is not very popular, then the resulting dataset will contain a lot of zeros, which will greatly affect the final result. While this package has no control over the accuracy or quality of 'Google Trends' data, once the data is created, this package coverts it to longitudinal data. In addition, the user may encounter a 429 Too Many Requests error when using cross_section() and time_series() to collect 'Google Trends' data. This error indicates that the user has exceeded the rate limits set by the 'Google Trends API'. For more information about the 'Google Trends API' - 'pytrends', visit <https://pypi.org/project/pytrends/>.

Maintained by Taeyong Park. Last updated 6 months ago.

0.5 match 2.70 score

alighanbari26

GPRMortality:Gaussian Process Regression for Mortality Rates

A Bayesian statistical model for estimating child (under-five age group) and adult (15-60 age group) mortality. The main challenge is how to combine and integrate these different time series and how to produce unified estimates of mortality rates during a specified time span. GPR is a Bayesian statistical model for estimating child and adult mortality rates which its data likelihood is mortality rates from different data sources such as: Death Registration System, Censuses or surveys. There are also various hyper-parameters for completeness of DRS, mean, covariance functions and variances as priors. This function produces estimations and uncertainty (95% or any desirable percentiles) based on sampling and non-sampling errors due to variation in data sources. The GP model utilizes Bayesian inference to update predicted mortality rates as a posterior in Bayes rule by combining data and a prior probability distribution over parameters in mean, covariance function, and the regression model. This package uses Markov Chain Monte Carlo (MCMC) to sample from posterior probability distribution by 'rstan' package in R. Details are given in Wang H, Dwyer-Lindgren L, Lofgren KT, et al. (2012) <doi:10.1016/S0140-6736(12)61719-X>, Wang H, Liddell CA, Coates MM, et al. (2014) <doi:10.1016/S0140-6736(14)60497-9> and Mohammadi, Parsaeian, Mehdipour et al. (2017) <doi:10.1016/S2214-109X(17)30105-5>.

Maintained by Ali Ghanbari. Last updated 4 years ago.

0.5 match 2.70 score 7 scripts

rajkumpismb

PCAPAM50:Enhanced 'PAM50' Subtyping of Breast Cancer

Accurate classification of breast cancer tumors based on gene expression data is not a trivial task, and it lacks standard practices.The 'PAM50' classifier, which uses 50 gene centroid correlation distances to classify tumors, faces challenges with balancing estrogen receptor (ER) status and gene centering. The 'PCAPAM50' package leverages principal component analysis and iterative 'PAM50' calls to create a gene expression-based ER-balanced subset for gene centering, avoiding the use of protein expression-based ER data resulting into an enhanced Breast Cancer subtyping.

Maintained by Praveen-Kumar Raj-Kumar. Last updated 2 months ago.

0.5 match 2.48 score 3 scripts

imarkonis

csa:A Cross-Scale Analysis Tool for Model-Observation Visualization and Integration

Integration of Earth system data from various sources is a challenging task. Except for their qualitative heterogeneity, different data records exist for describing similar Earth system process at different spatio-temporal scales. Data inter-comparison and validation are usually performed at a single spatial or temporal scale, which could hamper the identification of potential discrepancies in other scales. 'csa' package offers a simple, yet efficient, graphical method for synthesizing and comparing observed and modelled data across a range of spatio-temporal scales. Instead of focusing at specific scales, such as annual means or original grid resolution, we examine how their statistical properties change across spatio-temporal continuum.

Maintained by Yannis Markonis. Last updated 5 years ago.

0.5 match 1 stars 2.48 score 3 scripts

cran

Numero:Statistical Framework to Define Subgroups in Complex Datasets

High-dimensional datasets that do not exhibit a clear intrinsic clustered structure pose a challenge to conventional clustering algorithms. For this reason, we developed an unsupervised framework that helps scientists to better subgroup their datasets based on visual cues, please see Gao S, Mutter S, Casey A, Makinen V-P (2019) Numero: a statistical framework to define multivariable subgroups in complex population-based datasets, Int J Epidemiology, 48:369-37, <doi:10.1093/ije/dyy113>. The framework includes the necessary functions to construct a self-organizing map of the data, to evaluate the statistical significance of the observed data patterns, and to visualize the results.

Maintained by Ville-Petteri Makinen. Last updated 6 months ago.

cpp

0.5 match 2.30 score

andriyprotsak5

UAHDataScienceUC:Learn Clustering Techniques Through Examples and Code

A comprehensive educational package combining clustering algorithms with detailed step-by-step explanations. Provides implementations of both traditional (hierarchical, k-means) and modern (Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Gaussian Mixture Models (GMM), genetic k-means) clustering methods as described in Ezugwu et. al., (2022) <doi:10.1016/j.engappai.2022.104743>. Includes educational datasets highlighting different clustering challenges, based on 'scikit-learn' examples (Pedregosa et al., 2011) <https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>. Features detailed algorithm explanations, visualizations, and weighted distance calculations for enhanced learning.

Maintained by Andriy Protsak Protsak. Last updated 29 days ago.

0.5 match 2.30 score

dranthropoid

mmodely:Modeling Multivariate Origins Determinants - Evolutionary Lineages in Ecology

Perform multivariate modeling of evolved traits, with special attention to understanding the interplay of the multi-factorial determinants of their origins in complex ecological settings (Stephens, 2007 <doi:10.1016/j.tree.2006.12.003>). This software primarily concentrates on phylogenetic regression analysis, enabling implementation of tree transformation averaging and visualization functionality. Functions additionally support information theoretic approaches (Grueber, 2011 <doi:10.1111/j.1420-9101.2010.02210.x>; Garamszegi, 2011 <doi:10.1007/s00265-010-1028-7>) such as model averaging and selection of phylogenetic models. Accessory functions are also implemented for coef standardization (Cade 2015), selection uncertainty, and variable importance (Burnham & Anderson 2000). There are other numerous functions for visualizing confounded variables, plotting phylogenetic trees, as well as reporting and exporting modeling results. Lastly, as challenges to ecology are inherently multifarious, and therefore often multi-dataset, this package features several functions to support the identification, interpolation, merging, and updating of missing data and outdated nomenclature.

Maintained by David M Schruth. Last updated 2 years ago.

0.5 match 2.30 score 4 scripts

vanduttran

ConsensusOPLS:Consensus OPLS for Multi-Block Data Fusion

Merging data from multiple sources is a relevant approach for comprehensively evaluating complex systems. However, the inherent problems encountered when analyzing single tables are amplified with the generation of multi-block datasets, and finding the relationships between data layers of increasing complexity constitutes a challenging task. For that purpose, a generic methodology is proposed by combining the strength of established data analysis strategies, i.e. multi-block approaches and the Orthogonal Partial Least Squares (OPLS) framework to provide an efficient tool for the fusion of data obtained from multiple sources. The package enables quick and efficient implementation of the consensus OPLS model for any horizontal multi-block data structures (observation-based matching). Moreover, it offers an interesting range of metrics and graphics to help to determine the optimal number of components and check the validity of the model through permutation tests. Interpretation tools include score and loading plots, Variable Importance in Projection (VIP), functionality predict for SHAP computing, and performance coefficients such as R2, Q2, and DQ2 coefficients. J. Boccard and D.N. Rutledge (2013) <doi:10.1016/j.aca.2013.01.022>.

Maintained by Van Du T. Tran. Last updated 20 days ago.

0.5 match 2.30 score 9 scripts

katilingban

ennet:Utilities to Extract and Analyse Text Data from the Emergency Nutrition Network Forum

The Emergency Nutrition Network or en-net forum is the go to online forum for field practitioners requiring prompt technical advice for operational challenges for which answers are not readily accessible in current guidelines. The questions and the corresponding answers raised within en-net can provide insight into what the key topics of discussion are within the nutrition sector. This package provides utility functions for the extraction, processing and analysis of text data from the online forum.

Maintained by Ernest Guevarra. Last updated 2 years ago.

en-net nutrition

0.5 match 2 stars 2.08 score 12 scripts

pbiecek

BetaBit:Mini Games from Adventures of Beta and Bit

Three games: proton, frequon and regression. Each one is a console-based data-crunching game for younger and older data scientists. Act as a data-hacker and find Slawomir Pietraszko's credentials to the Proton server. In proton you have to solve four data-based puzzles to find the login and password. There are many ways to solve these puzzles. You may use loops, data filtering, ordering, aggregation or other tools. Only basics knowledge of R is required to play the game, yet the more functions you know, the more approaches you can try. In frequon you will help to perform statistical cryptanalytic attack on a corpus of ciphered messages. This time seven sub-tasks are pushing the bar much higher. Do you accept the challenge? In regression you will test your modeling skills in a series of eight sub-tasks. Try only if ANOVA is your close friend. It's a part of Beta and Bit project. You will find more about the Beta and Bit project at <https://github.com/BetaAndBit/Charts>.

Maintained by Przemyslaw Biecek. Last updated 2 years ago.

0.5 match 1 stars 2.03 score 106 scripts

cran

SOFIA:Making Sophisticated and Aesthetical Figures in R

Software that leverages the capabilities of Circos by manipulating data, preparing configuration files, and running the Perl-native Circos directly from the R environment with minimal user intervention. Circos is a novel software that addresses the challenges in visualizing genetic data by creating circular ideograms composed of tracks of heatmaps, scatter plots, line plots, histograms, links between common markers, glyphs, text, and etc. Please see <http://www.circos.ca>.

Maintained by Luis Diaz-Garcia. Last updated 8 years ago.

0.5 match 2.00 score

sunnypig1988

BCSub:A Bayesian Semiparametric Factor Analysis Model for Subtype Identification (Clustering)

Gene expression profiles are commonly utilized to infer disease subtypes and many clustering methods can be adopted for this task. However, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering. To deal with these challenges, we develop a novel clustering method in the Bayesian setting. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering.

Maintained by Jiehuan Sun. Last updated 8 years ago.

openblas cpp

0.5 match 2.00 score 2 scripts

hanjunwei-lab

iPRISM:Intelligent Predicting Response to Cancer Immunotherapy Through Systematic Modeling

Immunotherapy has revolutionized cancer treatment, but predicting patient response remains challenging. Here, we presented Intelligent Predicting Response to cancer Immunotherapy through Systematic Modeling (iPRISM), a novel network-based model that integrates multiple data types to predict immunotherapy outcomes. It incorporates gene expression, biological functional network, tumor microenvironment characteristics, immune-related pathways, and clinical data to provide a comprehensive view of factors influencing immunotherapy efficacy. By identifying key genetic and immunological factors, it provides an insight for more personalized treatment strategies and combination therapies to overcome resistance mechanisms.

Maintained by Junwei Han. Last updated 8 months ago.

0.5 match 2.00 score

cran

beanz:Bayesian Analysis of Heterogeneous Treatment Effect

It is vital to assess the heterogeneity of treatment effects (HTE) when making health care decisions for an individual patient or a group of patients. Nevertheless, it remains challenging to evaluate HTE based on information collected from clinical studies that are often designed and conducted to evaluate the efficacy of a treatment for the overall population. The Bayesian framework offers a principled and flexible approach to estimate and compare treatment effects across subgroups of patients defined by their characteristics. This package allows users to explore a wide range of Bayesian HTE analysis models, and produce posterior inferences about HTE. See Wang et al. (2018) <DOI:10.18637/jss.v085.i07> for further details.

Maintained by Chenguang Wang. Last updated 2 years ago.

cpp

0.5 match 1 stars 2.00 score 7 scripts

cran

UAHDataScienceUC:Learn Clustering Techniques Through Examples and Code

A comprehensive educational package combining clustering algorithms with detailed step-by-step explanations. Provides implementations of both traditional (hierarchical, k-means) and modern (Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Gaussian Mixture Models (GMM), genetic k-means) clustering methods as described in Ezugwu et. al., (2022) <doi:10.1016/j.engappai.2022.104743>. Includes educational datasets highlighting different clustering challenges, based on 'scikit-learn' examples (Pedregosa et al., 2011) <https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>. Features detailed algorithm explanations, visualizations, and weighted distance calculations for enhanced learning.

Maintained by Andriy Protsak Protsak. Last updated 29 days ago.

0.5 match 2.00 score

jjr1234

BClustLonG:A Dirichlet Process Mixture Model for Clustering Longitudinal Gene Expression Data

Many clustering methods have been proposed, but most of them cannot work for longitudinal gene expression data. 'BClustLonG' is a package that allows us to perform clustering analysis for longitudinal gene expression data. It adopts a linear-mixed effects framework to model the trajectory of genes over time, while clustering is jointly conducted based on the regression coefficients obtained from all genes. To account for the correlations among genes and alleviate the high dimensionality challenges, factor analysis models are adopted for the regression coefficients. The Dirichlet process prior distribution is utilized for the means of the regression coefficients to induce clustering. This package allows users to specify which variables to use for clustering (intercepts or slopes or both) and whether a factor analysis model is desired. More details about this method can be found in Jiehuan Sun, et al. (2017) <doi:10.1002/sim.7374>.

Maintained by Jiehuan Sun. Last updated 5 years ago.

openblas cpp

0.5 match 2.00 score 3 scripts

laxchan

SparseMSE:'Multiple Systems Estimation for Sparse Capture Data'

Implements the routines and algorithms developed and analysed in "Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges when there are Non-Overlapping Lists" Chan, L, Silverman, B. W., Vincent, K (2019) <https://www.tandfonline.com/doi/full/10.1080/01621459.2019.1708748> and in "Bootstrapping multiple systems estimates to account for model selection" Silverman, B. W., Chan, L, Vincent, K (2023)<https://doi.org/10.1007/s11222-023-10346-9>. This package explicitly handles situations where there are pairs of lists which have no observed individuals in common. It deals correctly with parameters whose estimated values can be considered as being negative infinity. It also addresses other possible issues of non-existence and non-identifiability of maximum likelihood estimates.

Maintained by Lax Chan. Last updated 1 years ago.

0.5 match 2.00 score 7 scripts

harish11999

transformerForecasting:Transformer Deep Learning Model for Time Series Forecasting

Time series forecasting faces challenges due to the non-stationarity, nonlinearity, and chaotic nature of the data. Traditional deep learning models like Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) process data sequentially but are inefficient for long sequences. To overcome the limitations of these models, we proposed a transformer-based deep learning architecture utilizing an attention mechanism for parallel processing, enhancing prediction accuracy and efficiency. This paper presents user-friendly code for the implementation of the proposed transformer-based deep learning architecture utilizing an attention mechanism for parallel processing. References: Nayak et al. (2024) <doi:10.1007/s40808-023-01944-7> and Nayak et al. (2024) <doi:10.1016/j.simpa.2024.100716>.

Maintained by G H Harish Nayak. Last updated 11 days ago.

0.5 match 1 stars 2.00 score

cran

lab2clean:Automation and Standardization of Cleaning Clinical Lab Data

Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required for lab data pre-processing and cleaning and the lack of all-in-one tools tailored for this need, we developed our algorithm 'lab2clean' as an open-source R-package. 'lab2clean' package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development. Version 1.0 of the algorithm is described in detail in 'Zayed et al. (2024)' <doi:10.1186/s12911-024-02652-7>.

Maintained by Ahmed Zayed. Last updated 6 months ago.

0.5 match 2.00 score

wraff

wrTopDownFrag:Internal Fragment Identification from Top-Down Mass Spectrometry

Top-Down mass spectrometry aims to identify entire proteins as well as their (post-translational) modifications or ions bound (eg Chen et al (2018) <doi:10.1021/acs.analchem.7b04747>). The pattern of internal fragments (Haverland et al (2017) <doi:10.1007/s13361-017-1635-x>) may reveal important information about the original structure of the proteins studied (Skinner et al (2018) <doi:10.1038/nchembio.2515> and Li et al (2018) <doi:10.1038/nchem.2908>). However, the number of possible internal fragments gets huge with longer proteins and subsequent identification of internal fragments remains challenging, in particular since the the accuracy of measurements with current mass spectrometers represents a limiting factor. This package attempts to deal with the complexity of internal fragments and allows identification of terminal and internal fragments from deconvoluted mass-spectrometry data.

Maintained by Wolfgang Raffelsberger. Last updated 5 years ago.

0.5 match 2.00 score 2 scripts

dziakj1

MOST:Multiphase Optimization Strategy

Provides functions similar to the 'SAS' macros previously provided to accompany Collins, Dziak, and Li (2009) <DOI:10.1037/a0015826> and Dziak, Nahum-Shani, and Collins (2012) <DOI:10.1037/a0026972>, papers which outline practical benefits and challenges of factorial and fractional factorial experiments for scientists interested in developing biological and/or behavioral interventions, especially in the context of the multiphase optimization strategy (see Collins, Kugler & Gwadz 2016) <DOI:10.1007/s10461-015-1145-4>. The package currently contains three functions. First, RelativeCosts1() draws a graph of the relative cost of complete and reduced factorial designs versus other alternatives. Second, RandomAssignmentGenerator() returns a dataframe which contains a list of random numbers that can be used to conveniently assign participants to conditions in an experiment with many conditions. Third, FactorialPowerPlan() estimates the power, detectable effect size, or required sample size of a factorial or fractional factorial experiment, for main effects or interactions, given several possible choices of effect size metric, and allowing pretests and clustering.

Maintained by John Dziak. Last updated 3 years ago.

0.5 match 2.00 score 5 scripts

kathbaum

DrDimont:Drug Response Prediction from Differential Multi-Omics Networks

While it has been well established that drugs affect and help patients differently, personalized drug response predictions remain challenging. Solutions based on single omics measurements have been proposed, and networks provide means to incorporate molecular interactions into reasoning. However, how to integrate the wealth of information contained in multiple omics layers still poses a complex problem. We present a novel network analysis pipeline, DrDimont, Drug response prediction from Differential analysis of multi-omics networks. It allows for comparative conclusions between two conditions and translates them into differential drug response predictions. DrDimont focuses on molecular interactions. It establishes condition-specific networks from correlation within an omics layer that are then reduced and combined into heterogeneous, multi-omics molecular networks. A novel semi-local, path-based integration step ensures integrative conclusions. Differential predictions are derived from comparing the condition-specific integrated networks. DrDimont's predictions are explainable, i.e., molecular differences that are the source of high differential drug scores can be retrieved. Our proposed pipeline leverages multi-omics data for differential predictions, e.g. on drug response, and includes prior information on interactions. The case study presented in the vignette uses data published by Krug (2020) <doi:10.1016/j.cell.2020.10.036>. The package license applies only to the software and explicitly not to the included data.

Maintained by Katharina Baum. Last updated 2 years ago.

0.5 match 2.00 score 2 scripts

vivid225

planningML:A Sample Size Calculator for Machine Learning Applications in Healthcare

Advances in automated document classification has led to identifying massive numbers of clinical concepts from handwritten clinical notes. These high dimensional clinical concepts can serve as highly informative predictors in building classification algorithms for identifying patients with different clinical conditions, commonly referred to as patient phenotyping. However, from a planning perspective, it is critical to ensure that enough data is available for the phenotyping algorithm to obtain a desired classification performance. This challenge in sample size planning is further exacerbated by the high dimension of the feature space and the inherent imbalance of the response class. Currently available sample size planning methods can be categorized into: (i) model-based approaches that predict the sample size required for achieving a desired accuracy using a linear machine learning classifier and (ii) learning curve-based approaches (Figueroa et al. (2012) <doi:10.1186/1472-6947-12-8>) that fit an inverse power law curve to pilot data to extrapolate performance. We develop model-based approaches for imbalanced data with correlated features, deriving sample size formulas for performance metrics that are sensitive to class imbalance such as Area Under the receiver operating characteristic Curve (AUC) and Matthews Correlation Coefficient (MCC). This is done using a two-step approach where we first perform feature selection using the innovated High Criticism thresholding method (Hall and Jin (2010) <doi:10.1214/09-AOS764>), then determine the sample size by optimizing the two performance metrics. Further, we develop software in the form of an R package named 'planningML' and an 'R' 'Shiny' app to facilitate the convenient implementation of the developed model-based approaches and learning curve approaches for imbalanced data. We apply our methods to the problem of phenotyping rare outcomes using the MIMIC-III electronic health record database. We show that our developed methods which relate training data size and performance on AUC and MCC, can predict the true or observed performance from linear ML classifiers such as LASSO and SVM at different training data sizes. Therefore, in high-dimensional classification analysis with imbalanced data and correlated features, our approach can efficiently and accurately determine the sample size needed for machine-learning based classification.

Maintained by Xinying Fang. Last updated 2 years ago.

0.5 match 1 stars 2.00 score 2 scripts