R-universe search: provenance

pvermees

provenance:Statistical Toolbox for Sedimentary Provenance Analysis

Bundles a number of established statistical methods to facilitate the visual interpretation of large datasets in sedimentary geology. Includes functionality for adaptive kernel density estimation, principal component analysis, correspondence analysis, multidimensional scaling, generalised procrustes analysis and individual differences scaling using a variety of dissimilarity measures. Univariate provenance proxies, such as single-grain ages or (isotopic) compositions are compared with the Kolmogorov-Smirnov, Kuiper, Wasserstein-2 or Sircombe-Hazelton L2 distances. Categorical provenance proxies such as chemical compositions are compared with the Aitchison and Bray-Curtis distances,and count data with the chi-square distance. Varietal data can either be converted to one or more distributional datasets, or directly compared using the multivariate Wasserstein distance. Also included are tools to plot compositional and count data on ternary diagrams and point-counting data on radial plots, to calculate the sample size required for specified levels of statistical precision, and to assess the effects of hydraulic sorting on detrital compositions. Includes an intuitive query-based user interface for users who are not proficient in R.

Maintained by Pieter Vermeesch. Last updated 2 months ago.

65.9 match 14 stars 5.52 score 79 scripts 1 dependents

end-to-end-provenance

provSummarizeR:Summarizes Provenance Related to Inputs and Outputs of a Script or Console Commands

Reads the provenance collected by the 'rdtLite' or 'rdt' packages, or other tools providing compatible PROV JSON output, created by the execution of a script or a console session, and provides a human-readable summary identifying the input and output files, the scripts used (if any), errors and warnings produced, and the environment in which it was executed. It can also optionally package all the files into a zip file. The exact format of the PROV JSON file created by 'rdtLite' and 'rdt' is described in <https://github.com/End-to-end-provenance/ExtendedProvJson>. More information about 'rdtLite' and associated tools is available at <https://github.com/End-to-end-provenance/> and Lerner, Boose, and Perez (2018), Using Introspection to Collect Provenance in R, Informatics, <doi: 10.3390/informatics5010012>.

Maintained by Emery Boose. Last updated 3 years ago.

19.1 match 4.18 score 7 scripts

end-to-end-provenance

rdtLite:Provenance Collector

Defines functions that can be used to collect provenance as an 'R' script executes or during a console session. The output is a text file in 'PROV-JSON' format.

Maintained by Barbara Lerner. Last updated 3 years ago.

21.6 match 2 stars 3.56 score 36 scripts

billdenney

PKNCA:Perform Pharmacokinetic Non-Compartmental Analysis

Compute standard Non-Compartmental Analysis (NCA) parameters for typical pharmacokinetic analyses and summarize them.

Maintained by Bill Denney. Last updated 16 days ago.

nca noncompartmental-analysis pharmacokinetics

5.4 match 73 stars 12.61 score 214 scripts 4 dependents

cnathe

Rlabkey:Data Exchange Between R and 'LabKey' Server

The 'LabKey' client library for R makes it easy for R users to load live data from a 'LabKey' Server, <https://www.labkey.com/>, into the R environment for analysis, provided users have permissions to read the data. It also enables R users to insert, update, and delete records stored on a 'LabKey' Server, provided they have appropriate permissions to do so.

Maintained by Cory Nathe. Last updated 3 days ago.

cpp

15.8 match 4.25 score 388 scripts 1 dependents

cboettig

neonstore:NEON Data Store

The National Ecological Observatory Network (NEON) provides access to its numerous data products through its REST API, <https://data.neonscience.org/data-api/>. This package provides a high-level user interface for downloading and storing NEON data products. Unlike 'neonUtilities', this package will avoid repeated downloading, provides persistent storage, and improves performance. 'neonstore' can also construct a local 'duckdb' database of stacked tables, making it possible to work with tables that are far to big to fit into memory.

Maintained by Carl Boettiger. Last updated 11 months ago.

database ecology neon-data provenance

10.0 match 9 stars 6.67 score 143 scripts 11 dependents

end-to-end-provenance

provViz:Provenance Visualizer

Displays provenance graphically for provenance collected by the 'rdt' or 'rdtLite' packages, or other tools providing compatible PROV JSON output. The exact format of the JSON created by 'rdt' and 'rdtLite' is described in <https://github.com/End-to-end-provenance/ExtendedProvJson>. More information about rdtLite and associated tools is available at <https://github.com/End-to-end-provenance/> and Barbara Lerner, Emery Boose, and Luis Perez (2018), Using Introspection to Collect Provenance in R, Informatics, <doi: 10.3390/informatics5010012>.

Maintained by Barbara Lerner. Last updated 3 years ago.

18.1 match 3.48 score 2 scripts 1 dependents

end-to-end-provenance

provTraceR:Uses Provenance to Trace File Lineage for One or more R Scripts

Uses provenance collected by 'rdtLite' package or comparable tool to display information about input files, output files, and exchanged files for a single R script or a series of R scripts.

Maintained by Emery Boose. Last updated 5 years ago.

16.7 match 3.70 score 4 scripts

dataobservatory-eu

dataset:Create Data Frames that are Easier to Exchange and Reuse

The aim of the 'dataset' package is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, well-described, interoperable datasets into release and reuse ready form.

Maintained by Daniel Antal. Last updated 20 days ago.

dataset metadata-management

7.9 match 15 stars 7.81 score 76 scripts 1 dependents

blernermhc

provDebugR:A Time-Travelling Debugger

Uses provenance post-execution to help the user understand and debug their script by providing functions to look at intermediate steps and data values, their forwards and backwards lineage, and to understand the steps leading up to warning and error messages. 'provDebugR' uses provenance produced by 'rdtLite' (available on CRAN), stored in PROV-JSON format.

Maintained by Barbara Lerner. Last updated 4 years ago.

16.0 match 3.64 score 22 scripts

end-to-end-provenance

provExplainR:Compare Provenance Collections to Explain Changed Script Outputs

Inspects provenance collected by the 'rdt' or 'rdtLite' packages, or other tools providing compatible PROV JSON output created by the execution of a script, and find differences between two provenance collections. Factors under examination included the hardware and software used to execute the script, versions of attached libraries, use of global variables, modified inputs and outputs, and changes in main and sourced scripts. Based on detected changes, 'provExplainR' can be used to study how these factors affect the behavior of the script and generate a promising diagnosis of the causes of different script results. More information about 'rdtLite' and associated tools is available at <https://github.com/End-to-end-provenance/> and Barbara Lerner, Emery Boose, and Luis Perez (2018), Using Introspection to Collect Provenance in R, Informatics, <doi:10.3390/informatics5010012>.

Maintained by Barbara Lerner. Last updated 3 years ago.

19.2 match 3.00 score 8 scripts

emitanaka

edibble:Encapsulating Elements of Experimental Design

A system to facilitate designing comparative (and non-comparative) experiments using the grammar of experimental designs <https://emitanaka.org/edibble-book/>. An experimental design is treated as an intermediate, mutable object that is built progressively by fundamental experimental components like units, treatments, and their relation. The system aids in experimental planning, management and workflow.

Maintained by Emi Tanaka. Last updated 4 months ago.

experimental-designs

6.6 match 217 stars 7.43 score 62 scripts

ropensci

bowerbird:Keep a Collection of Sparkly Data Resources

Tools to get and maintain a data repository from third-party data providers.

Maintained by Ben Raymond. Last updated 5 days ago.

ropensci antarctic southern ocean data environmental satellite climate peer-reviewed

6.3 match 50 stars 7.16 score 16 scripts 1 dependents

tesselle

nexus:Sourcing Archaeological Materials by Chemical Composition

Exploration and analysis of compositional data in the framework of Aitchison (1986, ISBN: 978-94-010-8324-9). This package provides tools for chemical fingerprinting and source tracking of ancient materials.

Maintained by Nicolas Frerebeau. Last updated 12 days ago.

archaeology archaeological-science archaeometry compositional-data provenance-studies

7.5 match 5.21 score 26 scripts 1 dependents

ropensci

rdataretriever:R Interface to the Data Retriever

Provides an R interface to the Data Retriever <https://retriever.readthedocs.io/en/latest/> via the Data Retriever's command line interface. The Data Retriever automates the tasks of finding, downloading, and cleaning public datasets, and then stores them in a local database.

Maintained by Henry Senyondo. Last updated 8 months ago.

data data-science database datasets science

4.8 match 46 stars 7.70 score 36 scripts

canmod

iidda.api:IIDDA API

R Bindings for the IIDDA API.

Maintained by Steve Walker. Last updated 4 months ago.

6.0 match 5.29 score 10 scripts

ropensci

drake:A Pipeline Toolkit for Reproducible Computation at Scale

A general-purpose computational engine for data analysis, drake rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date. Not every execution starts from scratch, there is native support for parallel and distributed computing, and completed projects have tangible evidence that they are reproducible. Extensive documentation, from beginner-friendly tutorials to practical examples and more, is available at the reference website <https://docs.ropensci.org/drake/> and the online manual <https://books.ropensci.org/drake/>.

Maintained by William Michael Landau. Last updated 3 months ago.

data-science drake high-performance-computing makefile peer-reviewed pipeline reproducibility reproducible-research ropensci workflow

2.0 match 1.3k stars 11.49 score 1.7k scripts 1 dependents

bioc

HuBMAPR:Interface to 'HuBMAP'

'HuBMAP' provides an open, global bio-molecular atlas of the human body at the cellular level. The `datasets()`, `samples()`, `donors()`, `publications()`, and `collections()` functions retrieves the information for each of these entity types. `*_details()` are available for individual entries of each entity type. `*_derived()` are available for retrieving derived datasets or samples for individual entries of each entity type. Data files can be accessed using `bulk_data_transfer()`.

Maintained by Christine Hou. Last updated 1 months ago.

software singlecell dataimport thirdpartyclient spatial infrastructure bioconductor-package client hubmap rstudio

3.8 match 3 stars 5.80 score 1 scripts

blernermhc

provParseR:Pulls Information from Prov.Json Files

R functions to access provenance information collected by 'rdt' or 'rdtLite'. The information is stored inside a 'ProvInfo' object and can be accessed through a collection of functions that will return the requested data. The exact format of the JSON created by 'rdt' and 'rdtLite' is described in <https://github.com/End-to-end-provenance/ExtendedProvJson>.

Maintained by Barbara Lerner. Last updated 3 years ago.

5.0 match 3.20 score 21 scripts 5 dependents

bioc

BiocPkgTools:Collection of simple tools for learning about Bioconductor Packages

Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.

Maintained by Sean Davis. Last updated 12 days ago.

software infrastructure bioconductor metadata

2.0 match 21 stars 7.67 score 68 scripts

canmod

iidda:Processing Infectious Disease Datasets in IIDDA.

Part of an open toolchain for processing infectious disease datasets available through the IIDDA data repository.

Maintained by Steve Walker. Last updated 4 months ago.

2.3 match 6.07 score 133 scripts 3 dependents

ropensci

EDIutils:An API Client for the Environmental Data Initiative Repository

A client for the Environmental Data Initiative repository REST API. The 'EDI' data repository <https://portal.edirepository.org/nis/home.jsp> is for publication and reuse of ecological data with emphasis on metadata accuracy and completeness. It is built upon the 'PASTA+' software stack <https://pastaplus-core.readthedocs.io/en/latest/index.html#> and was developed in collaboration with the US 'LTER' Network <https://lternet.edu/>. 'EDIutils' includes functions to search and access existing data, evaluate and upload new data, and assist other data management tasks common to repository users.

Maintained by Colin Smith. Last updated 1 years ago.

ecology eml-metadata open-access open-data research-data-management research-data-repository

2.0 match 10 stars 6.47 score 117 scripts

bioc

ndexr:NDEx R client library

This package offers an interface to NDEx servers, e.g. the public server at http://ndexbio.org/. It can retrieve and save networks via the API. Networks are offered as RCX object and as igraph representation.

Maintained by Florian Auer. Last updated 5 months ago.

pathways dataimport network

2.0 match 9 stars 6.44 score 38 scripts

terminological

dtrackr:Track your Data Pipelines

Track and document 'dplyr' data pipelines. As you filter, mutate, and join your way through a data set, 'dtrackr' seamlessly keeps track of your data flow and makes publication ready documentation of a data pipeline simple.

Maintained by Robert Challen. Last updated 5 months ago.

1.3 match 69 stars 8.78 score 362 scripts 1 dependents

guillaumepressiat

pmeasyr:Donnees PMSI avec R

Import de donnees PMSI. Gestion des archives. Formats depuis 2011. Connexion et interface avec une db. requetr. Valorisation des rsa, des rapss.

Maintained by Guillaume Pressiat. Last updated 13 days ago.

had mco pmsi psy ssr

1.6 match 20 stars 6.76 score 53 scripts

ropensci

datapack:A Flexible Container to Transport and Manipulate Data and Associated Resources

Provides a flexible container to transport and manipulate complex sets of data. These data may consist of multiple data files and associated meta data and ancillary files. Individual data objects have associated system level meta data, and data files are linked together using the OAI-ORE standard resource map which describes the relationships between the files. The OAI- ORE standard is described at <https://www.openarchives.org/ore/>. Data packages can be serialized and transported as structured files that have been created following the BagIt specification. The BagIt specification is described at <https://tools.ietf.org/html/draft-kunze-bagit-08>.

Maintained by Matthew B. Jones. Last updated 3 years ago.

1.2 match 44 stars 8.56 score 195 scripts 4 dependents

green-striped-gecko

dartR.captive:Analysing 'SNP' Data to Support Captive Breeding

Functions are provided that facilitate the analysis of SNP (single nucleotide polymorphism) data to answer questions regarding captive breeding and relatedness between individuals. 'dartR.captive' is part of the 'dartRverse' suit of packages. Gruber et al. (2018) <doi:10.1111/1755-0998.12745>. Mijangos et al. (2022) <doi:10.1111/2041-210X.13918>.

Maintained by Bernd Gruber. Last updated 27 days ago.

5.0 match 1 stars 2.00 score 3 scripts

predictiveecology

LandR:Landscape Ecosystem Modelling in R

Utilities for 'LandR' suite of landscape simulation models. These models simulate forest vegetation dynamics based on LANDIS-II, and incorporate fire and insect disturbance, as well as other important ecological processes. Models are implemented as 'SpaDES' modules.

Maintained by Eliot J B McIntire. Last updated 4 days ago.

ecological-modelling landscape-ecosystem-modelling spades

1.7 match 17 stars 6.07 score 12 scripts 4 dependents

nceas

metajam:Easily Download Data and Metadata from 'DataONE'

A set of tools to foster the development of reproducible analytical workflow by simplifying the download of data and metadata from 'DataONE' (<https://www.dataone.org>) and easily importing this information into R.

Maintained by Julien Brun. Last updated 7 months ago.

data data-analysis metadata repositories

1.1 match 16 stars 8.21 score 75 scripts

bristol-vaccine-centre

avoncap:AvonCap Study Analysis

A WIP set of functions allowing data load, wrangling of the AvonCap data set.

Maintained by Rob Challen. Last updated 3 months ago.

3.6 match 2.34 score 11 scripts

nlmixr2

lbfgsb3c:Limited Memory BFGS Minimizer with Bounds on Parameters with optim() 'C' Interface

Interfacing to Nocedal et al. L-BFGS-B.3.0 (See <http://users.iems.northwestern.edu/~nocedal/lbfgsb.html>) limited memory BFGS minimizer with bounds on parameters. This is a fork of 'lbfgsb3'. This registers a 'R' compatible 'C' interface to L-BFGS-B.3.0 that uses the same function types and optimization as the optim() function (see writing 'R' extensions and source for details). This package also adds more stopping criteria as well as allowing the adjustment of more tolerances.

Maintained by Matthew L Fidler. Last updated 6 months ago.

fortran cpp

1.1 match 1 stars 7.33 score 17 scripts 16 dependents

famuvie

breedR:Statistical Methods for Forest Genetic Resources Analysts

Statistical tools to build predictive models for the breeders community. It aims to assess the genetic value of individuals under a number of situations, including spatial autocorrelation, genetic/environment interaction and competition. It is under active development as part of the Trees4Future project, particularly developed having forest genetic trials in mind. But can be used for animals or other situations as well.

Maintained by Facundo Muñoz. Last updated 8 months ago.

1.3 match 33 stars 5.44 score 24 scripts

sbg

biocompute:Create and Manipulate BioCompute Objects

Tools to create, validate, and export BioCompute Objects described in King et al. (2019) <doi:10.17605/osf.io/h59uh>. Users can encode information in data frames, and compose BioCompute Objects from the domains defined by the standard. A checksum validator and a JSON schema validator are provided. This package also supports exporting BioCompute Objects as JSON, PDF, HTML, or 'Word' documents, and exporting to cloud-based platforms.

Maintained by Soner Koc. Last updated 9 months ago.

biocompute biocompute-objects bioinformatics science-communication sevenbridges standardization workflow

1.7 match 3 stars 4.07 score 13 scripts

blernermhc

provGraphR:Creates Adjacency Matrices for Lineage Searches

Creates and manages a provenance graph corresponding to the provenance created by the 'rdtLite' package, which collects provenance from R scripts. 'rdtLite' is available on CRAN. The provenance format is an extension of the W3C PROV JSON format (<https://www.w3.org/Submission/2013/SUBM-prov-json-20130424/>). The extended JSON provenance format is described in <https://github.com/End-to-end-provenance/ExtendedProvJson>.

Maintained by Barbara Lerner. Last updated 3 years ago.

3.1 match 2.18 score 4 scripts 1 dependents

ropensci

DataPackageR:Construct Reproducible Analytic Data Sets as R Packages

A framework to help construct R data packages in a reproducible manner. Potentially time consuming processing of raw data sets into analysis ready data sets is done in a reproducible manner and decoupled from the usual 'R CMD build' process so that data sets can be processed into R objects in the data package and the data package can then be shared, built, and installed by others without the need to repeat computationally costly data processing. The package maintains data provenance by turning the data processing scripts into package vignettes, as well as enforcing documentation and version checking of included data objects. Data packages can be version controlled on 'GitHub', and used to share data for manuscripts, collaboration and reproducible research.

Maintained by Dave Slager. Last updated 6 months ago.

peer-reviewed reproducibility

0.5 match 156 stars 9.38 score 72 scripts

philipmostert

PointedSDMs:Fit Models Derived from Point Processes to Species Distributions using 'inlabru'

Integrated species distribution modeling is a rising field in quantitative ecology thanks to significant rises in the quantity of data available, increases in computational speed and the proven benefits of using such models. Despite this, the general software to help ecologists construct such models in an easy-to-use framework is lacking. We therefore introduce the R package 'PointedSDMs': which provides the tools to help ecologists set up integrated models and perform inference on them. There are also functions within the package to help run spatial cross-validation for model selection, as well as generic plotting and predicting functions. An introduction to these methods is discussed in Issac, Jarzyna, Keil, Dambly, Boersch-Supan, Browning, Freeman, Golding, Guillera-Arroita, Henrys, Jarvis, Lahoz-Monfort, Pagel, Pescott, Schmucki, Simmonds and O’Hara (2020) <doi:10.1016/j.tree.2019.08.006>.

Maintained by Philip Mostert. Last updated 2 months ago.

0.5 match 25 stars 8.57 score 50 scripts 1 dependents

gmbecker

switchr:Installing, Managing, and Switching Between Distinct Sets of Installed Packages

Provides an abstraction for managing, installing, and switching between sets of installed R packages. This allows users to maintain multiple package libraries simultaneously, e.g. to maintain strict, package-version-specific reproducibility of many analyses, or work within a development/production release paradigm. Introduces a generalized package installation process which supports multiple repository and non-repository sources and tracks package provenance.

Maintained by Gabriel Becker. Last updated 2 years ago.

0.5 match 59 stars 6.49 score 52 scripts

ropensci

dendroNetwork:Create Networks of Dendrochronological Series using Pairwise Similarity

Creating dendrochronological networks based on the similarity between tree-ring series or chronologies. The package includes various functions to compare tree-ring curves building upon the 'dplR' package. The networks can be used to visualise and understand the relations between tree-ring curves. These networks are also very useful to estimate the provenance of wood as described in Visser (2021) <DOI:10.5334/jcaa.79> or wood-use within a structure/context/site as described in Visser and Vorst (2022) <DOI:10.1163/27723194-bja10014>.

Maintained by Ronald Visser. Last updated 1 months ago.

visualization graphandnetwork thirdpartyclient network archaeology dendrochronology dendroprovenance network-analysis tree-rings

0.5 match 7 stars 6.05 score 9 scripts

bioc

ppcseq:Probabilistic Outlier Identification for RNA Sequencing Generalized Linear Models

Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.

Maintained by Stefano Mangiola. Last updated 5 months ago.

rnaseq differentialexpression geneexpression normalization clustering qualitycontrol sequencing transcription transcriptomics bayesian-inference deseq2 edger negative-binomial outlier stan cpp

0.5 match 7 stars 5.65 score 16 scripts

larsvancutsem

piratings:Calculate Pi Ratings for Teams Competing in Sport Matches

Calculate and optimize dynamic performance ratings of association football teams competing in matches, in accordance with the method used in the research paper "Determining the level of ability of football teams by dynamic ratings based on the relative discrepancies in scores between adversaries", by dr. Constantinou and dr. Fenton. This dynamic rating system has proven to provide superior results for predicting association football outcomes. The research paper can be found here: (<http://www.constantinou.info/downloads/papers/pi-ratings.pdf>).

Maintained by Lars Van Cutsem. Last updated 6 years ago.

0.5 match 13 stars 5.29 score 9 scripts

tidylab

R6P:Design Patterns in R

Build robust and maintainable software with object-oriented design patterns in R. Design patterns abstract and present in neat, well-defined components and interfaces the experience of many software designers and architects over many years of solving similar problems. These are solutions that have withstood the test of time with respect to re-usability, flexibility, and maintainability. 'R6P' provides abstract base classes with examples for a few known design patterns. The patterns were selected by their applicability to analytic projects in R. Using these patterns in R projects have proven effective in dealing with the complexity that data-driven applications possess.

Maintained by Harel Lustiger. Last updated 3 months ago.

ddd design-patterns

0.5 match 10 stars 4.88 score 2 scripts 5 dependents

thomaschln

kgraph:Knowledge Graphs Constructions and Visualizations

Knowledge graphs enable to efficiently visualize and gain insights into large-scale data analysis results, as p-values from multiple studies or embedding data matrices. The usual workflow is a user providing a data frame of association studies results and specifying target nodes, e.g. phenotypes, to visualize. The knowledge graph then shows all the features which are significantly associated with the phenotype, with the edges being proportional to the association scores. As the user adds several target nodes and grouping information about the nodes such as biological pathways, the construction of such graphs soon becomes complex. The 'kgraph' package aims to enable users to easily build such knowledge graphs, and provides two main features: first, to enable building a knowledge graph based on a data frame of concepts relationships, be it p-values or cosine similarities; second, to enable determining an appropriate cut-off on cosine similarities from a complete embedding matrix, to enable the building of a knowledge graph directly from an embedding matrix. The 'kgraph' package provides several display, layout and cut-off options, and has already proven useful to researchers to enable them to visualize large sets of p-value associations with various phenotypes, and to quickly be able to visualize embedding results. Two example datasets are provided to demonstrate these behaviors, and several live 'shiny' applications are hosted by the CELEHS laboratory and Parse Health, as the KESER Mental Health application <https://keser-mental-health.parse-health.org/> based on Hong C. (2021) <doi:10.1038/s41746-021-00519-z>.

Maintained by Thomas Charlon. Last updated 25 days ago.

0.5 match 4.85 score

fairdatapipeline

rDataPipeline:Functions to Interact with the 'FAIR Data Pipeline'

R implementation of the 'FAIR Data Pipeline API'. The 'FAIR Data Pipeline' is intended to enable tracking of provenance of FAIR (findable, accessible and interoperable) data used in epidemiological modelling.

Maintained by Ryan Field. Last updated 3 months ago.

rhdf5 data-pipeline fair

0.5 match 4 stars 4.52 score 11 scripts

krashkov

pcSteiner:Convenient Tool for Solving the Prize-Collecting Steiner Tree Problem

The Prize-Collecting Steiner Tree problem asks to find a subgraph connecting a given set of vertices with the most expensive nodes and least expensive edges. Since it is proven to be NP-hard, exact and efficient algorithm does not exist. This package provides convenient functionality for obtaining an approximate solution to this problem using loopy belief propagation algorithm.

Maintained by Aleksei Krasikov. Last updated 5 years ago.

graph-algorithms r-language steiner-tree steiner-tree-problem

0.5 match 2 stars 4.00 score 3 scripts

molaison

MantaID:A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Maintained by Zhengpeng Zeng. Last updated 6 months ago.

0.5 match 3.78 score 2 scripts

cboettig

taxalight:A Lightweight and Lightning-Fast Taxonomic Naming Interface

Creates a local Lightning Memory-Mapped Database ('LMDB') of many commonly used taxonomic authorities and provides functions that can quickly query this data. Supported taxonomic authorities include the Integrated Taxonomic Information System ('ITIS'), National Center for Biotechnology Information ('NCBI'), Global Biodiversity Information Facility ('GBIF'), Catalogue of Life ('COL'), and Open Tree Taxonomy ('OTT'). Name and identifier resolution using 'LMDB' can be hundreds of times faster than either relational databases or internet-based queries. Precise data provenance information for data derived from naming providers is also included.

Maintained by Carl Boettiger. Last updated 4 years ago.

taxonomy

0.5 match 5 stars 3.40 score 4 scripts

wpihongzhang

TFisher:Optimal Thresholding Fisher's P-Value Combination Method

We provide the cumulative distribution function (CDF), quantile, and statistical power calculator for a collection of thresholding Fisher's p-value combination methods, including Fisher's p-value combination method, truncated product method and, in particular, soft-thresholding Fisher's p-value combination method which is proven to be optimal in some context of signal detection. The p-value calculator for the omnibus version of these tests are also included. For reference, please see Hong Zhang and Zheyang Wu. "TFisher Tests: Optimal and Adaptive Thresholding for Combining p-Values", submitted.

Maintained by Hong Zhang. Last updated 7 years ago.

0.5 match 3.34 score 18 scripts 15 dependents

korydjohnson

rai:Revisiting-Alpha-Investing for Polynomial Regression

A modified implementation of stepwise regression that greedily searches the space of interactions among features in order to build polynomial regression models. Furthermore, the hypothesis tests conducted are valid-post model selection due to the use of a revisiting procedure that implements an alpha-investing rule. As a result, the set of rejected sequential hypotheses is proven to control the marginal false discover rate. When not searching for polynomials, the package provides a statistically valid algorithm to run and terminate stepwise regression. For more information, see Johnson, Stine, and Foster (2019) <arXiv:1510.06322>.

Maintained by Kory D. Johnson. Last updated 3 years ago.

0.5 match 3 stars 3.18 score 7 scripts

audreh

mergeTrees:Aggregating Trees

Aggregates a set of trees with the same leaves to create a consensus tree. The trees are typically obtained via hierarchical clustering, hence the hclust format is used to encode both the aggregated trees and the final consensus tree. The method is exact and proven to be O(nqlog(n)), n being the individuals and q being the number of trees to aggregate.

Maintained by Audrey Hulot. Last updated 6 years ago.

cpp

0.5 match 2 stars 3.00 score 7 scripts

heming0425

uotm:Uncertainty of Time Series Model Selection Methods

We propose a new procedure, called model uncertainty variance, which can quantify the uncertainty of model selection on Autoregressive Moving Average models. The model uncertainty variance not pay attention to the accuracy of prediction, but focus on model selection uncertainty and providing more information of the model selection results. And to estimate the model measures, we propose an simplify and faster algorithm based on bootstrap method, which is proven to be effective and feasible by Monte-Carlo simulation. At the same time, we also made some optimizations and adjustments to the Model Confidence Bounds algorithm, so that it can be applied to the time series model selection method. The consistency of the algorithm result is also verified by Monte-Carlo simulation. We propose a new procedure, called model uncertainty variance, which can quantify the uncertainty of model selection on Autoregressive Moving Average models. The model uncertainty variance focuses on model selection uncertainty and providing more information of the model selection results. To estimate the model uncertainty variance, we propose an simplified and faster algorithm based on bootstrap method, which is proven to be effective and feasible by Monte-Carlo simulation. At the same time, we also made some optimizations and adjustments to the Model Confidence Bounds algorithm, so that it can be applied to the time series model selection method. The consistency of the algorithm result is also verified by Monte-Carlo simulation. Please see Li,Y., Luo,Y., Ferrari,D., Hu,X. and Qin,Y. (2019) Model Confidence Bounds for Variable Selection. Biometrics, 75:392-403.<DOI:10.1111/biom.13024> for more information.

Maintained by Heming Deng Developer. Last updated 2 years ago.

0.8 match 1.00 score 4 scripts

ivanlizaga

fingerPro:Sediment Source Fingerprinting

Quantifies the provenance of the sediments in a catchment or study area. Based on a comprehensive characterization of the sediment sources and the end sediment mixtures a mixing model algorithm is applied to the sediment mixtures in order to estimate the relative contribution of each potential source. The package includes several statistical methods such as Kruskal-Wallis test, discriminant function analysis ('DFA'), principal component plot ('PCA') to select the optimal subset of tracer properties. The variability within each sediment source is also considered to estimate the statistical distribution of the sources contribution.

Maintained by Ivan Lizaga. Last updated 7 years ago.

gsl cpp

0.5 match 1.11 score 13 scripts

ghahn-hsph

fastOnlineCpt:Online Multivariate Changepoint Detection

Implementation of a simple algorithm designed for online multivariate changepoint detection of a mean in sparse changepoint settings. The algorithm is based on a modified cusum statistic and guarantees control of the type I error on any false discoveries, while featuring O(1) time and O(1) memory updates per series as well as a proven detection delay.

Maintained by Georg Hahn. Last updated 4 years ago.

0.5 match 1.00 score 8 scripts

aurora-torrente

briKmeans:Package for Brik, Fabrik and Fdebrik Algorithms to Initialise Kmeans

Implementation of the BRIk, FABRIk and FDEBRIk algorithms to initialise k-means. These methods are intended for the clustering of multivariate and functional data, respectively. They make use of the Modified Band Depth and bootstrap to identify appropriate initial seeds for k-means, which are proven to be better options than many techniques in the literature. Torrente and Romo (2021) <doi:10.1007/s00357-020-09372-3> It makes use of the functions kma and kma.similarity, from the archived package fdakma, by Alice Parodi et al.

Maintained by Aurora Torrente. Last updated 3 years ago.

0.5 match 1.00 score

cran

kpcaIG:Variables Interpretability with Kernel PCA

The kernelized version of principal component analysis (KPCA) has proven to be a valid nonlinear alternative for tackling the nonlinearity of biological sample spaces. However, it poses new challenges in terms of the interpretability of the original variables. 'kpcaIG' aims to provide a tool to select the most relevant variables based on the kernel PCA representation of the data as in Briscik et al. (2023) <doi:10.1186/s12859-023-05404-y>. It also includes functions for 2D and 3D visualization of the original variables (as arrows) into the kernel principal components axes, highlighting the contribution of the most important ones.

Maintained by Mitja Briscik. Last updated 8 months ago.

0.5 match 1 stars 1.00 score