R-universe search: jaccard

ncchung

jaccard:Testing similarity between binary datasets using Jaccard/Tanimoto coefficients

Calculate statistical significance of Jaccard/Tanimoto similarity coefficients.

Maintained by Neo Christopher Chung. Last updated 5 years ago.

binary-data hypothesis-testing jaccard similarity statistics tanimoto cpp

96.5 match 5 stars 5.03 score 85 scripts

drostlab

philentropy:Similarity and Distance Quantification Between Probability Functions

Computes 46 optimized distance and similarity measures for comparing probability functions (Drost (2018) <doi:10.21105/joss.00765>). These comparisons between probability functions have their foundations in a broad range of scientific disciplines from mathematics to ecology. The aim of this package is to provide a core framework for clustering, classification, statistical inference, goodness-of-fit, non-parametric statistics, information theory, and machine learning tasks that are based on comparing univariate or multivariate probability functions.

Maintained by Hajk-Georg Drost. Last updated 4 months ago.

distance-measures distance-quantification information-theory jensen-shannon-divergence parametric-distributions similarity-measures statistics cpp

5.2 match 137 stars 12.44 score 484 scripts 24 dependents

stuart-lab

Signac:Analysis of Single-Cell Chromatin Data

A framework for the analysis and exploration of single-cell chromatin data. The 'Signac' package contains functions for quantifying single-cell chromatin data, computing per-cell quality control metrics, dimension reduction and normalization, visualization, and DNA sequence motif analysis. Reference: Stuart et al. (2021) <doi:10.1038/s41592-021-01282-5>.

Maintained by Tim Stuart. Last updated 7 months ago.

atac bioinformatics single-cell zlib cpp

5.1 match 355 stars 12.18 score 3.7k scripts 1 dependents

eagerai

fastai:Interface to 'fastai'

The 'fastai' <https://docs.fast.ai/index.html> library simplifies training fast and accurate neural networks using modern best practices. It is based on research in to deep learning best practices undertaken at 'fast.ai', including 'out of the box' support for vision, text, tabular, audio, time series, and collaborative filtering models.

Maintained by Turgut Abdullayev. Last updated 12 months ago.

audio collaborative-filtering darknet darknet-image-classification fastai medical object-detection tabular text vision

6.6 match 118 stars 9.40 score 76 scripts

thie1e

cutpointr:Determine and Evaluate Optimal Cutpoints in Binary Classification Tasks

Estimate cutpoints that optimize a specified metric in binary classification tasks and validate performance using bootstrapping. Some methods for more robust cutpoint estimation are supported, e.g. a parametric method assuming normal distributions, bootstrapped cutpoints, and smoothing of the metric values per cutpoint using Generalized Additive Models. Various plotting functions are included. For an overview of the package see Thiele and Hirschfeld (2021) <doi:10.18637/jss.v098.i11>.

Maintained by Christian Thiele. Last updated 4 months ago.

bootstrapping cutpoint-optimization roc-curve cpp

5.3 match 88 stars 10.44 score 322 scripts 1 dependents

serkor1

SLmetrics:Machine Learning Performance Evaluation on Steroids

Performance evaluation metrics for supervised and unsupervised machine learning, statistical learning and artificial intelligence applications. Core computations are implemented in 'C++' for scalability and efficiency.

Maintained by Serkan Korkmaz. Last updated 4 days ago.

cpp data-analysis data-science eigen3 machine-learning performance-metrics rcpp rcppeigen statistics supervised-learning cpp openmp

7.8 match 22 stars 6.56 score

bernd-mueller

epos:Epilepsy Ontologies' Similarities

Analysis and visualization of similarities between epilepsy ontologies based on text mining results by comparing ranked lists of co-occurring drug terms in the BioASQ corpus. The ranked result lists of neurological drug terms co-occurring with terms from the epilepsy ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS undergo further analysis. The source data to create the ranked lists of drug names is produced using the text mining workflows described in Mueller, Bernd and Hagelstein, Alexandra (2016) <doi:10.4126/FRL01-006408558>, Mueller, Bernd et al. (2017) <doi:10.1007/978-3-319-58694-6_22>, Mueller, Bernd and Rebholz-Schuhmann, Dietrich (2020) <doi:10.1007/978-3-030-43887-6_52>, and Mueller, Bernd et al. (2022) <doi:10.1186/s13326-021-00258-w>.

Maintained by Bernd Mueller. Last updated 1 years ago.

11.8 match 4.03 score 53 scripts

igraph

igraph:Network Analysis and Visualization

Routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more.

Maintained by Kirill Müller. Last updated 6 hours ago.

complex-networks graph-algorithms graph-theory mathematics network-analysis network-graph fortran libxml2 glpk openblas cpp

2.3 match 584 stars 21.13 score 31k scripts 1.9k dependents

pboutros

bedr:Genomic Region Processing using Tools Such as 'BEDTools', 'BEDOPS' and 'Tabix'

Genomic regions processing using open-source command line tools such as 'BEDTools', 'BEDOPS' and 'Tabix'. These tools offer scalable and efficient utilities to perform genome arithmetic e.g indexing, formatting and merging. bedr API enhances access to these tools as well as offers additional utilities for genomic regions processing.

Maintained by Paul C. Boutros. Last updated 6 years ago.

8.6 match 4.98 score 264 scripts 2 dependents

mw201608

SuperExactTest:Exact Test and Visualization of Multi-Set Intersections

Identification of sets of objects with shared features is a common operation in all disciplines. Analysis of intersections among multiple sets is fundamental for in-depth understanding of their complex relationships. This package implements a theoretical framework for efficient computation of statistical distributions of multi-set intersections based upon combinatorial theory, and provides multiple scalable techniques for visualizing the intersection statistics. The statistical algorithm behind this package was published in Wang et al. (2015) <doi:10.1038/srep16923>.

Maintained by Minghui Wang. Last updated 1 years ago.

intersection set statistics visualization

5.3 match 28 stars 7.47 score 70 scripts 1 dependents

bioc

BiRewire:High-performing routines for the randomization of a bipartite graph (or a binary event matrix), undirected and directed signed graph preserving degree distribution (or marginal totals)

Fast functions for bipartite network rewiring through N consecutive switching steps (See References) and for the computation of the minimal number of switching steps to be performed in order to maximise the dissimilarity with respect to the original network. Includes functions for the analysis of the introduced randomness across the switching steps and several other routines to analyse the resulting networks and their natural projections. Extension to undirected networks and directed signed networks is also provided. Starting from version 1.9.7 a more precise bound (especially for small network) has been implemented. Starting from version 2.2.0 the analysis routine is more complete and a visual montioring of the underlying Markov Chain has been implemented. Starting from 3.6.0 the library can handle also matrices with NA (not for the directed signed graphs). Since version 3.27.1 it is possible to add a constraint for dsg generation: usually positive and negative arc between two nodes could be not accepted.

Maintained by Andrea Gobbi. Last updated 5 months ago.

network

8.5 match 4.54 score 35 scripts

ms609

TreeDist:Calculate and Map Distances Between Phylogenetic Trees

Implements measures of tree similarity, including information-based generalized Robinson-Foulds distances (Phylogenetic Information Distance, Clustering Information Distance, Matching Split Information Distance; Smith 2020) <doi:10.1093/bioinformatics/btaa614>; Jaccard-Robinson-Foulds distances (Bocker et al. 2013) <doi:10.1007/978-3-642-40453-5_13>, including the Nye et al. (2006) metric <doi:10.1093/bioinformatics/bti720>; the Matching Split Distance (Bogdanowicz & Giaro 2012) <doi:10.1109/TCBB.2011.48>; Maximum Agreement Subtree distances; the Kendall-Colijn (2016) distance <doi:10.1093/molbev/msw124>, and the Nearest Neighbour Interchange (NNI) distance, approximated per Li et al. (1996) <doi:10.1007/3-540-61332-3_168>. Includes tools for visualizing mappings of tree space (Smith 2022) <doi:10.1093/sysbio/syab100>, for identifying islands of trees (Silva and Wilkinson 2021) <doi:10.1093/sysbio/syab015>, for calculating the median of sets of trees, and for computing the information content of trees and splits.

Maintained by Martin R. Smith. Last updated 2 months ago.

phylogenetics tree-distance phylogenetic-trees tree-distances trees cpp

3.6 match 32 stars 10.32 score 97 scripts 5 dependents

c0webster

fedmatch:Fast, Flexible, and User-Friendly Record Linkage Methods

Provides a flexible set of tools for matching two un-linked data sets. 'fedmatch' allows for three ways to match data: exact matches, fuzzy matches, and multi-variable matches. It also allows an easy combination of these three matches via the tier matching function.

Maintained by Chris Webster. Last updated 2 months ago.

cpp openmp

7.5 match 1 stars 4.61 score 80 scripts

snoweye

EMCluster:EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution

EM algorithms and several efficient initialization methods for model-based clustering of finite mixture Gaussian distribution with unstructured dispersion in both of unsupervised and semi-supervised learning.

Maintained by Wei-Chen Chen. Last updated 7 months ago.

openblas

4.5 match 18 stars 7.53 score 123 scripts 2 dependents

chrhennig

prabclus:Functions for Clustering and Testing of Presence-Absence, Abundance and Multilocus Genetic Data

Distance-based parametric bootstrap tests for clustering with spatial neighborhood information. Some distance measures, Clustering of presence-absence, abundance and multilocus genetic data for species delimitation, nearest neighbor based noise detection. Genetic distances between communities. Tests whether various distance-based regressions are equal. Try package?prabclus for on overview.

Maintained by Christian Hennig. Last updated 6 months ago.

5.3 match 1 stars 6.07 score 90 scripts 70 dependents

koheiw

proxyC:Computes Proximity in Large Sparse Matrices

Computes proximity between rows or columns of large matrices efficiently in C++. Functions are optimised for large sparse matrices using the Armadillo and Intel TBB libraries. Among various built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast.

Maintained by Kohei Watanabe. Last updated 5 months ago.

data-science distance-measures similarity-measures openblas onetbb cpp

3.5 match 28 stars 8.90 score 23 scripts 33 dependents

mlr-org

mlr3:Machine Learning in R - Next Generation

Efficient, object-oriented programming on the building blocks of machine learning. Provides 'R6' objects for tasks, learners, resamplings, and measures. The package is geared towards scalability and larger datasets by supporting parallelization and out-of-memory data-backends like databases. While 'mlr3' focuses on the core computational operations, add-on packages provide additional functionality.

Maintained by Marc Becker. Last updated 20 days ago.

classification data-science machine-learning mlr3 regression

2.0 match 972 stars 14.86 score 2.3k scripts 35 dependents

bioc

scp:Mass Spectrometry-Based Single-Cell Proteomics Data Analysis

Utility functions for manipulating, processing, and analyzing mass spectrometry-based single-cell proteomics data. The package is an extension to the 'QFeatures' package and relies on 'SingleCellExpirement' to enable single-cell proteomics analyses. The package offers the user the functionality to process quantitative table (as generated by MaxQuant, Proteome Discoverer, and more) into data tables ready for downstream analysis and data visualization.

Maintained by Christophe Vanderaa. Last updated 1 months ago.

geneexpression proteomics singlecell massspectrometry preprocessing cellbasedassays bioconductor mass-spectrometry single-cell software

3.2 match 26 stars 8.95 score 115 scripts

zpneal

backbone:Extracts the Backbone from Graphs

An implementation of methods for extracting an unweighted unipartite graph (i.e. a backbone) from an unweighted unipartite graph, a weighted unipartite graph, the projection of an unweighted bipartite graph, or the projection of a weighted bipartite graph (Neal, 2022 <doi:10.1371/journal.pone.0269137>).

Maintained by Zachary Neal. Last updated 1 years ago.

cpp

4.0 match 41 stars 7.06 score 31 scripts 2 dependents

rich-iannone

DiagrammeR:Graph/Network Visualization

Build graph/network structures using functions for stepwise addition and deletion of nodes and edges. Work with data available in tables for bulk addition of nodes, edges, and associated metadata. Use graph selections and traversals to apply changes to specific nodes or edges. A wide selection of graph algorithms allow for the analysis of graphs. Visualize the graphs and take advantage of any aesthetic properties assigned to nodes and edges.

Maintained by Richard Iannone. Last updated 2 months ago.

graph graph-functions network-graph property-graph visualization

1.8 match 1.7k stars 15.29 score 3.8k scripts 86 dependents

luismurao

bamm:Species Distribution Models as a Function of Biotic, Abiotic and Movement Factors (BAM)

Species Distribution Modeling (SDM) is a practical methodology that aims to estimate the area of distribution of a species. However, most of the work has focused on estimating static expressions of the correlation between environmental variables. The outputs of correlative species distribution models can be interpreted as maps of the suitable environment for a species but not generally as maps of its actual distribution. Soberón and Peterson (2005) <doi:10.17161/bi.v2i0.4> presented the BAM scheme, a heuristic framework that states that the occupied area of a species occurs on sites that have been accessible through dispersal (M) and have both favorable biotic (B) and abiotic conditions (A). The 'bamm' package implements classes and functions to operate on each element of the BAM and by using a cellular automata model where the occupied area of a species at time t is estimated by the multiplication of three binary matrices: one matrix represents movements (M), another abiotic -niche- tolerances (A), and a third, biotic interactions (B). The theoretical background of the package can be found in Soberón and Osorio-Olvera (2023) <doi:10.1111/jbi.14587>.

Maintained by Luis Osorio-Olvera. Last updated 8 months ago.

cpp

6.1 match 1 stars 4.40 score 4 scripts

bioc

PIUMA:Phenotypes Identification Using Mapper from topological data Analysis

The PIUMA package offers a tidy pipeline of Topological Data Analysis frameworks to identify and characterize communities in high and heterogeneous dimensional data.

Maintained by Mattia Chiesa. Last updated 5 months ago.

clustering graphandnetwork dimensionreduction network classification

5.2 match 4 stars 5.08 score 2 scripts

beniaminogreen

zoomerjoin:Superlatively Fast Fuzzy Joins

Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time.

Maintained by Beniamino Green. Last updated 2 months ago.

blazinglyfast fuzzyjoin join rust zoomer cargo

3.5 match 102 stars 7.31 score 11 scripts

cran

fossil:Palaeoecological and Palaeogeographical Analysis Tools

A set of analytical tools useful in analysing ecological and geographical data sets, both ancient and modern. The package includes functions for estimating species richness (Chao 1 and 2, ACE, ICE, Jacknife), shared species/beta diversity, species area curves and geographic distances and areas.

Maintained by Matthew J. Vavrek. Last updated 5 years ago.

7.3 match 1 stars 3.44 score 7 dependents

jarioksa

natto:An Extreme 'vegan' Package of Experimental Code

Random code that is too experimental or too weird to be included in the vegan package.

Maintained by Jari Oksanen. Last updated 1 months ago.

5.2 match 8 stars 4.68 score 1 scripts

djvanderlaan

reclin2:Record Linkage Toolkit

Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility.

Maintained by Jan van der Laan. Last updated 1 years ago.

cpp

3.3 match 43 stars 7.36 score 89 scripts 1 dependents

bioc

HarmonizR:Handles missing values and makes more data available

An implementation, which takes input data and makes it available for proper batch effect removal by ComBat or Limma. The implementation appropriately handles missing values by dissecting the input matrix into smaller matrices with sufficient data to feed the ComBat or limma algorithm. The adjusted data is returned to the user as a rebuild matrix. The implementation is meant to make as much data available as possible with minimal data loss.

Maintained by Simon Schlumbohm. Last updated 5 months ago.

batcheffect

5.8 match 4.20 score 16 scripts

anespinosa

netmem:Social Network Measures using Matrices

Measures to describe and manipulate networks using matrices.

Maintained by Alejandro Espinosa-Rada. Last updated 21 days ago.

matrices multilayer-networks network-analysis network-science sna social-network social-network-analysis sociology

5.6 match 11 stars 4.33 score 13 scripts

elies-ramon

kerntools:Kernel Functions and Tools for Machine Learning Applications

Kernel functions for diverse types of data (including, but not restricted to: nonnegative and real vectors, real matrices, categorical and ordinal variables, sets, strings), plus other utilities like kernel similarity, kernel Principal Components Analysis (PCA) and features' importance for Support Vector Machines (SVMs), which expand other 'R' packages like 'kernlab'.

Maintained by Elies Ramon. Last updated 1 days ago.

kernel-methods pca

4.8 match 1 stars 4.86 score 12 scripts

ocbe-uio

DIscBIO:A User-Friendly Pipeline for Biomarker Discovery in Single-Cell Transcriptomics

An open, multi-algorithmic pipeline for easy, fast and efficient analysis of cellular sub-populations and the molecular signatures that characterize them. The pipeline consists of four successive steps: data pre-processing, cellular clustering with pseudo-temporal ordering, defining differential expressed genes and biomarker identification. More details on Ghannoum et. al. (2021) <doi:10.3390/ijms22031399>. This package implements extensions of the work published by Ghannoum et. al. (2019) <doi:10.1101/700989>.

Maintained by Waldir Leoncio. Last updated 1 years ago.

biomarker-discovery jupyter-notebook scrna-seq single-cell-analysis transcriptomics openjdk

5.3 match 12 stars 4.38 score 5 scripts

kylebittinger

abdiv:Alpha and Beta Diversity Measures

A collection of measures for measuring ecological diversity. Ecological diversity comes in two flavors: alpha diversity measures the diversity within a single site or sample, and beta diversity measures the diversity across two sites or samples. This package overlaps considerably with other R packages such as 'vegan', 'gUniFrac', 'betapart', and 'fossil'. We also include a wide range of functions that are implemented in software outside the R ecosystem, such as 'scipy', 'Mothur', and 'scikit-bio'. The implementations here are designed to be basic and clear to the reader.

Maintained by Kyle Bittinger. Last updated 1 years ago.

5.2 match 9 stars 4.14 score 31 scripts

bioc

GeDi:Defining and visualizing the distances between different genesets

The package provides different distances measurements to calculate the difference between genesets. Based on these scores the genesets are clustered and visualized as graph. This is all presented in an interactive Shiny application for easy usage.

Maintained by Annekathrin Nedwed. Last updated 5 months ago.

gui genesetenrichment software transcription rnaseq visualization clustering pathways reportwriting go kegg reactome shinyapps

3.9 match 1 stars 5.36 score 22 scripts

bzhanglab

WebGestaltR:Gene Set Analysis Toolkit WebGestaltR

The web version WebGestalt <https://www.webgestalt.org> supports 12 organisms, 354 gene identifiers and 321,251 function categories. Users can upload the data and functional categories with their own gene identifiers. In addition to the Over-Representation Analysis, WebGestalt also supports Gene Set Enrichment Analysis and Network Topology Analysis. The user-friendly output report allows interactive and efficient exploration of enrichment results. The WebGestaltR package not only supports all above functions but also can be integrated into other pipeline or simultaneously analyze multiple gene lists.

Maintained by John Elizarraras. Last updated 5 days ago.

rust cargo

2.3 match 35 stars 9.18 score 180 scripts

yijuanhu

LDM:Testing Hypotheses About the Microbiome using the Linear Decomposition Model

A single analysis path that includes distance-based ordination, global tests of any effect of the microbiome, and tests of the effects of individual taxa with false-discovery-rate (FDR) control. It accommodates both continuous and discrete covariates as well as interaction terms to be tested either singly or in combination, allows for adjustment of confounding covariates, and uses permutation-based p-values that can control for sample correlations. It can be applied to transformed data, and an omnibus test can combine results from analyses conducted on different transformation scales. It can also be used for testing presence-absence associations based on infinite number of rarefaction replicates, testing mediation effects of the microbiome, analyzing censored time-to-event outcomes, and for compositional analysis by fitting linear models to centered-log-ratio taxa count data.

Maintained by Yi-Juan Hu. Last updated 2 years ago.

4.0 match 7 stars 4.91 score 23 scripts

lixiangzhang

OTclust:Mean Partition, Uncertainty Assessment, Cluster Validation and Visualization Selection for Cluster Analysis

Providing mean partition for ensemble clustering by optimal transport alignment(OTA), uncertainty measures for both partition-wise and cluster-wise assessment and multiple visualization functions to show uncertainty, for instance, membership heat map and plot of covering point set. A partition refers to an overall clustering result. Jia Li, Beomseok Seo, and Lin Lin (2019) <doi:10.1002/sam.11418>. Lixiang Zhang, Lin Lin, and Jia Li (2020) <doi:10.1093/bioinformatics/btaa165>.

Maintained by Lixiang Zhang. Last updated 1 years ago.

cpp

5.3 match 3.70 score 6 scripts

bioc

demuxSNP:scRNAseq demultiplexing using cell hashing and SNPs

This package assists in demultiplexing scRNAseq data using both cell hashing and SNPs data. The SNP profile of each group os learned using high confidence assignments from the cell hashing data. Cells which cannot be assigned with high confidence from the cell hashing data are assigned to their most similar group based on their SNPs. We also provide some helper function to optimise SNP selection, create training data and merge SNP data into the SingleCellExperiment framework.

Maintained by Michael Lynch. Last updated 5 months ago.

classification singlecell

3.5 match 6 stars 5.52 score 22 scripts

chrhennig

fpc:Flexible Procedures for Clustering

Various methods for clustering and cluster validation. Fixed point clustering. Linear regression clustering. Clustering by merging Gaussian mixture components. Symmetric and asymmetric discriminant projections for visualisation of the separation of groupings. Cluster validation statistics for distance based clustering including corrected Rand index. Standardisation of cluster validation statistics by random clusterings and comparison between many clustering methods and numbers of clusters based on this. Cluster-wise cluster stability assessment. Methods for estimation of the number of clusters: Calinski-Harabasz, Tibshirani and Walther's prediction strength, Fang and Wang's bootstrap stability. Gaussian/multinomial mixture fitting for mixed continuous/categorical variables. Variable-wise statistics for cluster interpretation. DBSCAN clustering. Interface functions for many clustering methods implemented in R, including estimating the number of clusters with kmeans, pam and clara. Modality diagnosis for Gaussian mixtures. For an overview see package?fpc.

Maintained by Christian Hennig. Last updated 6 months ago.

1.9 match 11 stars 9.32 score 2.6k scripts 69 dependents

bioc

ClustAll:ClustAll: Data driven strategy to robustly identify stratification of patients within complex diseases

Data driven strategy to find hidden groups of patients with complex diseases using clinical data. ClustAll facilitates the unsupervised identification of multiple robust stratifications. ClustAll, is able to overcome the most common limitations found when dealing with clinical data (missing values, correlated data, mixed data types).

Maintained by Asier Ortega-Legarreta. Last updated 5 months ago.

software statisticalmethod clustering dimensionreduction principalcomponent

4.7 match 3.70 score 1 scripts

rnabioco

valr:Genome Interval Arithmetic

Read and manipulate genome intervals and signals. Provides functionality similar to command-line tool suites within R, enabling interactive analysis and visualization of genome-scale data. Riemondy et al. (2017) <doi:10.12688/f1000research.11997.1>.

Maintained by Kent Riemondy. Last updated 23 days ago.

bedtools genome interval-arithmetic cpp

1.8 match 90 stars 9.69 score 227 scripts

anttonalberdi

hilldiv:Integral Analysis of Diversity Based on Hill Numbers

Tools for analysing, comparing, visualising and partitioning diversity based on Hill numbers. 'hilldiv' is an R package that provides a set of functions to assist analysis of diversity for diet reconstruction, microbial community profiling or more general ecosystem characterisation analyses based on Hill numbers, using OTU/ASV tables and associated phylogenetic trees as inputs. The package includes functions for (phylo)diversity measurement, (phylo)diversity profile plotting, (phylo)diversity comparison between samples and groups, (phylo)diversity partitioning and (dis)similarity measurement. All of these grounded in abundance-based and incidence-based Hill numbers. The statistical framework developed around Hill numbers encompasses many of the most broadly employed diversity (e.g. richness, Shannon index, Simpson index), phylogenetic diversity (e.g. Faith's PD, Allen's H, Rao's quadratic entropy) and dissimilarity (e.g. Sorensen index, Unifrac distances) metrics. This enables the most common analyses of diversity to be performed while grounded in a single statistical framework. The methods are described in Jost et al. (2007) <DOI:10.1890/06-1736.1>, Chao et al. (2010) <DOI:10.1098/rstb.2010.0272> and Chiu et al. (2014) <DOI:10.1890/12-0960.1>; and reviewed in the framework of molecularly characterised biological systems in Alberdi & Gilbert (2019) <DOI:10.1111/1755-0998.13014>.

Maintained by Antton Alberdi. Last updated 4 years ago.

3.9 match 11 stars 4.35 score 41 scripts

bblonder

hypervolume:High Dimensional Geometry, Set Operations, Projection, and Inference Using Kernel Density Estimation, Support Vector Machines, and Convex Hulls

Estimates the shape and volume of high-dimensional datasets and performs set operations: intersection / overlap, union, unique components, inclusion test, and hole detection. Uses stochastic geometry approach to high-dimensional kernel density estimation, support vector machine delineation, and convex hull generation. Applications include modeling trait and niche hypervolumes and species distribution modeling.

Maintained by Benjamin Blonder. Last updated 2 months ago.

openblas cpp

1.7 match 23 stars 9.69 score 211 scripts 7 dependents

bioc

clustifyr:Classifier for Single-cell RNA-seq Using Cell Clusters

Package designed to aid in classifying cells from single-cell RNA sequencing data using external reference data (e.g., bulk RNA-seq, scRNA-seq, microarray, gene lists). A variety of correlation based methods and gene list enrichment methods are provided to assist cell type assignment.

Maintained by Rui Fu. Last updated 5 months ago.

singlecell annotation sequencing microarray geneexpression assign-identities clusters marker-genes rna-seq single-cell-rna-seq

1.7 match 120 stars 9.63 score 296 scripts

core-bioinformatics

bulkAnalyseR:Interactive Shiny App for Bulk Sequencing Data

Given an expression matrix from a bulk sequencing experiment, pre-processes it and creates a shiny app for interactive data analysis and visualisation. The app contains quality checks, differential expression analysis, volcano and cross plots, enrichment analysis and gene regulatory network inference, and can be customised to contain more panels by the user.

Maintained by Ilias Moutsopoulos. Last updated 1 years ago.

3.4 match 27 stars 4.47 score 11 scripts

bioc

GeneTonic:Enjoy Analyzing And Integrating The Results From Differential Expression Analysis And Functional Enrichment Analysis

This package provides functionality to combine the existing pieces of the transcriptome data and results, making it easier to generate insightful observations and hypothesis. Its usage is made easy with a Shiny application, combining the benefits of interactivity and reproducibility e.g. by capturing the features and gene sets of interest highlighted during the live session, and creating an HTML report as an artifact where text, code, and output coexist. Using the GeneTonicList as a standardized container for all the required components, it is possible to simplify the generation of multiple visualizations and summaries.

Maintained by Federico Marini. Last updated 3 months ago.

gui geneexpression software transcription transcriptomics visualization differentialexpression pathways reportwriting genesetenrichment annotation go shinyapps bioconductor bioconductor-package data-exploration data-visualization functional-enrichment-analysis gene-expression pathway-analysis reproducible-research rna-seq-analysis rna-seq-data shiny transcriptome user-friendly

1.8 match 77 stars 8.28 score 37 scripts 1 dependents

helixcn

spaa:SPecies Association Analysis

Miscellaneous functions for analysing species association and niche overlap.

Maintained by Jinlong Zhang. Last updated 4 years ago.

2.0 match 12 stars 7.40 score 155 scripts 1 dependents

adriancorrendo

metrica:Prediction Performance Metrics

A compilation of more than 80 functions designed to quantitatively and visually evaluate prediction performance of regression (continuous variables) and classification (categorical variables) of point-forecast models (e.g. APSIM, DSSAT, DNDC, supervised Machine Learning). For regression, it includes functions to generate plots (scatter, tiles, density, & Bland-Altman plot), and to estimate error metrics (e.g. MBE, MAE, RMSE), error decomposition (e.g. lack of accuracy-precision), model efficiency (e.g. NSE, E1, KGE), indices of agreement (e.g. d, RAC), goodness of fit (e.g. r, R2), adjusted correlation coefficients (e.g. CCC, dcorr), symmetric regression coefficients (intercept, slope), and mean absolute scaled error (MASE) for time series predictions. For classification (binomial and multinomial), it offers functions to generate and plot confusion matrices, and to estimate performance metrics such as accuracy, precision, recall, specificity, F-score, Cohen's Kappa, G-mean, and many more. For more details visit the vignettes <https://adriancorrendo.github.io/metrica/>.

Maintained by Adrian A. Correndo. Last updated 9 months ago.

1.8 match 77 stars 7.88 score 49 scripts

cran

clv:Cluster Validation Techniques

Package contains most of the popular internal and external cluster validation methods ready to use for the most of the outputs produced by functions coming from package "cluster". Package contains also functions and examples of usage for cluster stability approach that might be applied to algorithms implemented in "cluster" package as well as user defined clustering algorithms.

Maintained by Lukasz Nieweglowski. Last updated 2 years ago.

3.9 match 1 stars 3.50 score 17 dependents

mlampros

textTinyR:Text Processing for Small or Big Data Files

It offers functions for splitting, parsing, tokenizing and creating a vocabulary for big text data files. Moreover, it includes functions for building a document-term matrix and extracting information from those (term-associations, most frequent terms). It also embodies functions for calculating token statistics (collocations, look-up tables, string dissimilarities) and functions to work with sparse matrices. Lastly, it includes functions for Word Vector Representations (i.e. 'GloVe', 'fasttext') and incorporates functions for the calculation of (pairwise) text document dissimilarities. The source code is based on 'C++11' and exported in R through the 'Rcpp', 'RcppArmadillo' and 'BH' packages.

Maintained by Lampros Mouselimis. Last updated 1 years ago.

bh boost cpp11 processing rcpp rcpparmadillo text openblas cpp openmp

1.8 match 39 stars 7.64 score 244 scripts 1 dependents

january3

tmod:Feature Set Enrichment Analysis for Metabolomics and Transcriptomics

Methods and feature set definitions for feature or gene set enrichment analysis in transcriptional and metabolic profiling data. Package includes tests for enrichment based on ranked lists of features, functions for visualisation and multivariate functional analysis. See Zyla et al (2019) <doi:10.1093/bioinformatics/btz447>.

Maintained by January Weiner. Last updated 2 months ago.

2.0 match 3 stars 6.88 score 168 scripts 1 dependents

blansche

fdm2id:Data Mining and R Programming for Beginners

Contains functions to simplify the use of data mining methods (classification, regression, clustering, etc.), for students and beginners in R programming. Various R packages are used and wrappers are built around the main functions, to standardize the use of data mining methods (input/output): it brings a certain loss of flexibility, but also a gain of simplicity. The package name came from the French "Fouille de Données en Master 2 Informatique Décisionnelle".

Maintained by Alexandre Blansché. Last updated 2 years ago.

8.5 match 1 stars 1.62 score 42 scripts

ozvan

Ravages:Rare Variant Analysis and Genetic Simulations

Rare variant association tests: burden tests (Bocher et al. 2019 <doi:10.1002/gepi.22210>) and the Sequence Kernel Association Test (Bocher et al. 2021 <doi:10.1038/s41431-020-00792-8>) in the whole genome; and genetic simulations.

Maintained by Ozvan Bocher. Last updated 2 years ago.

cpp

5.6 match 2.30 score 2 scripts

bnosac

textrank:Summarize Text by Ranking Sentences and Finding Keywords

The 'textrank' algorithm is an extension of the 'Pagerank' algorithm for text. The algorithm allows to summarize text by calculating how sentences are related to one another. This is done by looking at overlapping terminology used in sentences in order to set up links between sentences. The resulting sentence network is next plugged into the 'Pagerank' algorithm which identifies the most important sentences in your text and ranks them. In a similar way 'textrank' can also be used to extract keywords. A word network is constructed by looking if words are following one another. On top of that network the 'Pagerank' algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. More information can be found in the paper from Mihalcea, Rada & Tarau, Paul (2004) <https://www.aclweb.org/anthology/W04-3252/>.

Maintained by Jan Wijffels. Last updated 4 years ago.

natural-language-processing nlp textrank textrank-algorithm

1.7 match 77 stars 7.38 score 103 scripts 2 dependents

bommert

stabm:Stability Measures for Feature Selection

An implementation of many measures for the assessment of the stability of feature selection. Both simple measures and measures which take into account the similarities between features are available, see Bommert (2020) <doi:10.17877/DE290R-21906>.

Maintained by Andrea Bommert. Last updated 2 years ago.

2.0 match 6 stars 6.29 score 33 scripts 3 dependents

nanne-aben

iTOP:Inferring the Topology of Omics Data

Infers a topology of relationships between different datasets, such as multi-omics and phenotypic data recorded on the same samples. We based this methodology on the RV coefficient (Robert & Escoufier, 1976, <doi:10.2307/2347233>), a measure of matrix correlation, which we have extended for partial matrix correlations and binary data (Aben et al., 2018, <doi:10.1101/293993>).

Maintained by Nanne Aben. Last updated 7 years ago.

5.6 match 2.23 score 17 scripts

bioc

TMSig:Tools for Molecular Signatures

The TMSig package contains tools to prepare, analyze, and visualize named lists of sets, with an emphasis on molecular signatures (such as gene or kinase sets). It includes fast, memory efficient functions to construct sparse incidence and similarity matrices and filter, cluster, invert, and decompose sets. Additionally, bubble heatmaps can be created to visualize the results of any differential or molecular signatures analysis.

Maintained by Tyler Sagendorf. Last updated 5 months ago.

clustering genesetenrichment graphandnetwork pathways visualization gene-sets molecular-signatures

2.2 match 4 stars 5.58 score 4 scripts

mwsill

s4vd:Biclustering via Sparse Singular Value Decomposition Incorporating Stability Selection

The main function s4vd() performs a biclustering via sparse singular value decomposition with a nested stability selection. The results is an biclust object and thus all methods of the biclust package can be applied.

Maintained by Martin Sill. Last updated 5 years ago.

2.3 match 4 stars 5.31 score 17 scripts 2 dependents

tesselle

tabula:Analysis and Visualization of Archaeological Count Data

An easy way to examine archaeological count data. This package provides several tests and measures of diversity: heterogeneity and evenness (Brillouin, Shannon, Simpson, etc.), richness and rarefaction (Chao1, Chao2, ACE, ICE, etc.), turnover and similarity (Brainerd-Robinson, etc.). It allows to easily visualize count data and statistical thresholds: rank vs abundance plots, heatmaps, Ford (1962) and Bertin (1977) diagrams, etc.

Maintained by Nicolas Frerebeau. Last updated 16 hours ago.

data-visualization archaeology archaeological-science

2.3 match 5.14 score 38 scripts 1 dependents

larssnip

micropan:Microbial Pan-Genome Analysis

A collection of functions for computations and visualizations of microbial pan-genomes.

Maintained by Lars Snipen. Last updated 3 years ago.

1.9 match 21 stars 6.15 score 67 scripts

bioc

SynExtend:Tools for Working With Synteny Objects

Shared order between genomic sequences provide a great deal of information. Synteny objects produced by the R package DECIPHER provides quantitative information about that shared order. SynExtend provides tools for extracting information from Synteny objects.

Maintained by Nicholas Cooley. Last updated 18 days ago.

genetics clustering comparativegenomics dataimport fortran openmp

1.8 match 1 stars 6.42 score 77 scripts

deisygysi

NetSci:Calculates Basic Network Measures Commonly Used in Network Medicine

Calculates network measures commonly used in Network Medicine. Measures such as the Largest Connected Component, the Relative Largest Connected Component, Proximity and Separation are calculated along with their statistical significance. Significance can be computed both using a degree-preserving randomization and non-degree preserving.

Maintained by Deisy Morselli Gysi. Last updated 6 months ago.

6.6 match 1.70 score 9 scripts

bioc

CNVMetrics:Copy Number Variant Metrics

The CNVMetrics package calculates similarity metrics to facilitate copy number variant comparison among samples and/or methods. Similarity metrics can be employed to compare CNV profiles of genetically unrelated samples as well as those with a common genetic background. Some metrics are based on the shared amplified/deleted regions while other metrics rely on the level of amplification/deletion. The data type used as input is a plain text file containing the genomic position of the copy number variations, as well as the status and/or the log2 ratio values. Finally, a visualization tool is provided to explore resulting metrics.

Maintained by Astrid Deschênes. Last updated 5 months ago.

biologicalquestion software copynumbervariation cnv copy-number-variation metrics r-language

2.2 match 4 stars 5.08 score 8 scripts

bioc

omicsViewer:Interactive and explorative visualization of SummarizedExperssionSet or ExpressionSet using omicsViewer

omicsViewer visualizes ExpressionSet (or SummarizedExperiment) in an interactive way. The omicsViewer has a separate back- and front-end. In the back-end, users need to prepare an ExpressionSet that contains all the necessary information for the downstream data interpretation. Some extra requirements on the headers of phenotype data or feature data are imposed so that the provided information can be clearly recognized by the front-end, at the same time, keep a minimum modification on the existing ExpressionSet object. The pure dependency on R/Bioconductor guarantees maximum flexibility in the statistical analysis in the back-end. Once the ExpressionSet is prepared, it can be visualized using the front-end, implemented by shiny and plotly. Both features and samples could be selected from (data) tables or graphs (scatter plot/heatmap). Different types of analyses, such as enrichment analysis (using Bioconductor package fgsea or fisher's exact test) and STRING network analysis, will be performed on the fly and the results are visualized simultaneously. When a subset of samples and a phenotype variable is selected, a significance test on means (t-test or ranked based test; when phenotype variable is quantitative) or test of independence (chi-square or fisher’s exact test; when phenotype data is categorical) will be performed to test the association between the phenotype of interest with the selected samples. Additionally, other analyses can be easily added as extra shiny modules. Therefore, omicsViewer will greatly facilitate data exploration, many different hypotheses can be explored in a short time without the need for knowledge of R. In addition, the resulting data could be easily shared using a shiny server. Otherwise, a standalone version of omicsViewer together with designated omics data could be easily created by integrating it with portable R, which can be shared with collaborators or submitted as supplementary data together with a manuscript.

Maintained by Chen Meng. Last updated 2 months ago.

software visualization genesetenrichment differentialexpression motifdiscovery network networkenrichment

1.9 match 4 stars 5.82 score 22 scripts

kit-iism-em

partitionComparison:Implements Measures for the Comparison of Two Partitions

Provides several measures ((dis)similarity, distance/metric, correlation, entropy) for comparing two partitions of the same set of objects. The different measures can be assigned to three different classes: Pair comparison (containing the famous Jaccard and Rand indices), set based, and information theory based. Many of the implemented measures can be found in Albatineh AN, Niewiadomska-Bugaj M and Mihalko D (2006) <doi:10.1007/s00357-006-0017-z> and Meila M (2007) <doi:10.1016/j.jmva.2006.11.013>. Partitions are represented by vectors of class labels which allow a straightforward integration with existing clustering algorithms (e.g. kmeans()). The package is mostly based on the S4 object system.

Maintained by Fabian Ball. Last updated 2 years ago.

comparison dissimilarity-measures distance-measures partitions similarity-measures

2.8 match 2 stars 3.78 score 60 scripts

bioc

GeneOverlap:Test and visualize gene overlaps

Test two sets of gene lists and visualize the results.

Maintained by António Miguel de Jesus Domingues, Max-Planck Institute for Cell Biology and Genetics. Last updated 5 months ago.

multiplecomparison visualization

1.6 match 6.46 score 266 scripts

cran

fclust:Fuzzy Clustering

Algorithms for fuzzy clustering, cluster validity indices and plots for cluster validity and visualizing fuzzy clustering results.

Maintained by Paolo Giordani. Last updated 2 years ago.

openblas cpp

4.3 match 1 stars 2.38 score 2 dependents

josetamezpena

FRESA.CAD:Feature Selection Algorithms for Computer Aided Diagnosis

Contains a set of utilities for building and testing statistical models (linear, logistic,ordinal or COX) for Computer Aided Diagnosis/Prognosis applications. Utilities include data adjustment, univariate analysis, model building, model-validation, longitudinal analysis, reporting and visualization.

Maintained by Jose Gerardo Tamez-Pena. Last updated 2 months ago.

openblas cpp openmp

1.8 match 7 stars 5.59 score 31 scripts

blasbenito

distantia:Advanced Toolset for Efficient Time Series Dissimilarity Analysis

Fast C++ implementation of Dynamic Time Warping for time series dissimilarity analysis, with applications in environmental monitoring and sensor data analysis, climate science, signal processing and pattern recognition, and financial data analysis. Built upon the ideas presented in Benito and Birks (2020) <doi:10.1111/ecog.04895>, provides tools for analyzing time series of varying lengths and structures, including irregular multivariate time series. Key features include individual variable contribution analysis, restricted permutation tests for statistical significance, and imputation of missing data via GAMs. Additionally, the package provides an ample set of tools to prepare and manage time series data.

Maintained by Blas M. Benito. Last updated 1 months ago.

1.8 match 23 stars 5.73 score 11 scripts

bioc

IsoformSwitchAnalyzeR:Identify, Annotate and Visualize Isoform Switches with Functional Consequences from both short- and long-read RNA-seq data

Analysis of alternative splicing and isoform switches with predicted functional consequences (e.g. gain/loss of protein domains etc.) from quantification of all types of RNASeq by tools such as Kallisto, Salmon, StringTie, Cufflinks/Cuffdiff etc.

Maintained by Kristoffer Vitting-Seerup. Last updated 5 months ago.

geneexpression transcription alternativesplicing differentialexpression differentialsplicing visualization statisticalmethod transcriptomevariant biomedicalinformatics functionalgenomics systemsbiology transcriptomics rnaseq annotation functionalprediction geneprediction dataimport multiplecomparison batcheffect immunooncology

1.1 match 108 stars 9.26 score 125 scripts

hflavio12

cmahalanobis:Calculate Distance Measures for a Given List of Data Frames with Factors

It provides functions that calculate Mahalanobis distance, Euclidean distance, Manhattan distance, Chebyshev distance, Hamming distance, Canberra distance, Minkowski distance, Cosine distance, Bhattacharyya distance, Jaccard distance, Hellinger distance, Bray-Curtis distance, Sorensen-Dice distance between each pair of species in a list of data frames. These metrics are fundamental in various fields, such as cluster analysis, classification, and other applications of machine learning and data mining, where assessing similarity or dissimilarity between data is crucial. The package is designed to be flexible and easily integrated into data analysis workflows, providing reliable tools for evaluating distances in multidimensional contexts.

Maintained by Flavio Gioia. Last updated 3 months ago.

5.7 match 1.70 score

cadam00

prior3D:3D Prioritization Algorithm

Three-dimensional systematic conservation planning, conducting nested prioritization analyses across multiple depth levels and ensuring efficient resource allocation throughout the water column. It provides a structured workflow designed to address biodiversity conservation and management challenges in the 3 dimensions, while facilitating users’ choices and parameterization (Doxa et al. 2025 <doi:10.1016/j.ecolmodel.2024.110919>).

Maintained by Christos Adam. Last updated 2 months ago.

biodiversity conservation conservation-planning depth marine-spatial-planning multidimensional-environments prioritization

1.7 match 6 stars 5.62 score 3 scripts

mengxu98

inferCSN:Inferring Cell-Specific Gene Regulatory Network

An R package for inferring cell-type specific gene regulatory network from single-cell RNA data.

Maintained by Meng Xu. Last updated 22 hours ago.

openblas cpp

2.0 match 3 stars 4.80 score 6 scripts

dgrun

RaceID:Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

Application of 'RaceID' allows inference of cell types and prediction of lineage trees by the 'StemID2' algorithm (Herman, J.S., Sagar, Grun D. (2018) <DOI:10.1038/nmeth.4662>). 'VarID2' is part of this package and allows quantification of biological gene expression noise at single-cell resolution (Rosales-Alvarez, R.E., Rettkowski, J., Herman, J.S., Dumbovic, G., Cabezas-Wallscheid, N., Grun, D. (2023) <DOI:10.1186/s13059-023-02974-1>).

Maintained by Dominic Grün. Last updated 4 months ago.

cpp

2.0 match 4.74 score 110 scripts

bioc

FEAST:FEAture SelcTion (FEAST) for Single-cell clustering

Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as “features”), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have significant impact on the clustering accuracy. FEAST is an R library for selecting most representative features before performing the core of scRNA-seq clustering. It can be used as a plug-in for the etablished clustering algorithms such as SC3, TSCAN, SHARP, SIMLR, and Seurat. The core of FEAST algorithm includes three steps: 1. consensus clustering; 2. gene-level significance inference; 3. validation of an optimized feature set.

Maintained by Kenong Su. Last updated 5 months ago.

sequencing singlecell clustering featureextraction

1.6 match 10 stars 5.97 score 47 scripts

jonasrieger

ldaPrototype:Prototype of Multiple Latent Dirichlet Allocation Runs

Determine a Prototype from a number of runs of Latent Dirichlet Allocation (LDA) measuring its similarities with S-CLOP: A procedure to select the LDA run with highest mean pairwise similarity, which is measured by S-CLOP (Similarity of multiple sets by Clustering with Local Pruning), to all other runs. LDA runs are specified by its assignments leading to estimators for distribution parameters. Repeated runs lead to different results, which we encounter by choosing the most representative LDA run as prototype.

Maintained by Jonas Rieger. Last updated 2 years ago.

latent-dirichlet-allocation lda model-selection modelselection reliability text-mining textdata topic-model topic-models topic-similarities topicmodeling topicmodelling

2.0 match 8 stars 4.44 score 23 scripts 1 dependents

bioc

epiregulon.extra:Companion package to epiregulon with additional plotting, differential and graph functions

Gene regulatory networks model the underlying gene regulation hierarchies that drive gene expression and observed phenotypes. Epiregulon infers TF activity in single cells by constructing a gene regulatory network (regulons). This is achieved through integration of scATAC-seq and scRNA-seq data and incorporation of public bulk TF ChIP-seq data. Links between regulatory elements and their target genes are established by computing correlations between chromatin accessibility and gene expressions.

Maintained by Xiaosai Yao. Last updated 13 days ago.

generegulation network geneexpression transcription chiponchip differentialexpression genetarget normalization graphandnetwork

1.8 match 4.95 score 10 scripts

recon-icm

linkprediction:Link Prediction Methods

Implementations of most of the existing proximity-based methods of link prediction in graphs. Among the 20 implemented methods are e.g.: Adamic L. and Adar E. (2003) <doi:10.1016/S0378-8733(03)00009-1>, Leicht E., Holme P., Newman M. (2006) <doi:10.1103/PhysRevE.73.026120>, Zhou T. and Zhang Y (2009) <doi:10.1140/epjb/e2009-00335-8>, and Fouss F., Pirotte A., Renders J., and Saerens M. (2007) <doi:10.1109/TKDE.2007.46>.

Maintained by Michal Bojanowski. Last updated 5 months ago.

1.5 match 12 stars 5.40 score 14 scripts

markvanderloo

stringdist:Approximate String Matching, Fuzzy Text Search, and String Distance Functions

Implements an approximate string matching version of R's native 'match' function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences. This package is built for speed and runs in parallel by using 'openMP'. An API for C or C++ is exposed as well. Reference: MPJ van der Loo (2014) <doi:10.32614/RJ-2014-011>.

Maintained by Mark van der Loo. Last updated 4 months ago.

openmp

0.5 match 327 stars 15.54 score 2.0k scripts 179 dependents

bioc

BioCor:Functional similarities

Calculates functional similarities based on the pathways described on KEGG and REACTOME or in gene sets. These similarities can be calculated for pathways or gene sets, genes, or clusters and combined with other similarities. They can be used to improve networks, gene selection, testing relationships...

Maintained by Lluís Revilla Sancho. Last updated 5 months ago.

statisticalmethod clustering geneexpression network pathways networkenrichment systemsbiology bioconductor-packages bioinformatics functional-similarity gene gene-sets pathway-analysis similarity similarity-measurement

1.2 match 14 stars 6.47 score

bioc

mina:Microbial community dIversity and Network Analysis

An increasing number of microbiome datasets have been generated and analyzed with the help of rapidly developing sequencing technologies. At present, analysis of taxonomic profiling data is mainly conducted using composition-based methods, which ignores interactions between community members. Besides this, a lack of efficient ways to compare microbial interaction networks limited the study of community dynamics. To better understand how community diversity is affected by complex interactions between its members, we developed a framework (Microbial community dIversity and Network Analysis, mina), a comprehensive framework for microbial community diversity analysis and network comparison. By defining and integrating network-derived community features, we greatly reduce noise-to-signal ratio for diversity analyses. A bootstrap and permutation-based method was implemented to assess community network dissimilarities and extract discriminative features in a statistically principled way.

Maintained by Rui Guan. Last updated 5 months ago.

software workflowstep cpp

1.6 match 5 stars 4.85 score 14 scripts

bioc

PRONE:The PROteomics Normalization Evaluator

High-throughput omics data are often affected by systematic biases introduced throughout all the steps of a clinical study, from sample collection to quantification. Normalization methods aim to adjust for these biases to make the actual biological signal more prominent. However, selecting an appropriate normalization method is challenging due to the wide range of available approaches. Therefore, a comparative evaluation of unnormalized and normalized data is essential in identifying an appropriate normalization strategy for a specific data set. This R package provides different functions for preprocessing, normalizing, and evaluating different normalization approaches. Furthermore, normalization methods can be evaluated on downstream steps, such as differential expression analysis and statistical enrichment analysis. Spike-in data sets with known ground truth and real-world data sets of biological experiments acquired by either tandem mass tag (TMT) or label-free quantification (LFQ) can be analyzed.

Maintained by Lis Arend. Last updated 11 days ago.

proteomics preprocessing normalization differentialexpression visualization data-analysis evaluation

1.7 match 2 stars 4.41 score 9 scripts

jreisner

biclustermd:Biclustering with Missing Data

Biclustering is a statistical learning technique that simultaneously partitions and clusters rows and columns of a data matrix. Since the solution space of biclustering is in infeasible to completely search with current computational mechanisms, this package uses a greedy heuristic. The algorithm featured in this package is, to the best our knowledge, the first biclustering algorithm to work on data with missing values. Li, J., Reisner, J., Pham, H., Olafsson, S., and Vardeman, S. (2020) Biclustering with Missing Data. Information Sciences, 510, 304–316.

Maintained by John Reisner. Last updated 4 years ago.

1.8 match 3 stars 4.18 score 4 scripts

wraff

wrMisc:Analyze Experimental High-Throughput (Omics) Data

The efficient treatment and convenient analysis of experimental high-throughput (omics) data gets facilitated through this collection of diverse functions. Several functions address advanced object-conversions, like manipulating lists of lists or lists of arrays, reorganizing lists to arrays or into separate vectors, merging of multiple entries, etc. Another set of functions provides speed-optimized calculation of standard deviation (sd), coefficient of variance (CV) or standard error of the mean (SEM) for data in matrixes or means per line with respect to additional grouping (eg n groups of replicates). A group of functions facilitate dealing with non-redundant information, by indexing unique, adding counters to redundant or eliminating lines with respect redundancy in a given reference-column, etc. Help is provided to identify very closely matching numeric values to generate (partial) distance matrixes for very big data in a memory efficient manner or to reduce the complexity of large data-sets by combining very close values. Other functions help aligning a matrix or data.frame to a reference using partial matching or to mine an experimental setup to extract patterns of replicate samples. Many times large experimental datasets need some additional filtering, adequate functions are provided. Convenient data normalization is supported in various different modes, parameter estimation via permutations or boot-strap as well as flexible testing of multiple pair-wise combinations using the framework of 'limma' is provided, too. Batch reading (or writing) of sets of files and combining data to arrays is supported, too.

Maintained by Wolfgang Raffelsberger. Last updated 7 months ago.

1.7 match 4.23 score 33 scripts 4 dependents

nabod0815

ConNEcT:Contingency Measure-Based Networks for Binary Time Series

The ConNEcT approach investigates the pairwise association strength of binary time series by calculating contingency measures and depicts the results in a network. The package includes features to explore and visualize the data. To calculate the pairwise concurrent or temporal sequenced relationship between the variables, the package provides seven contingency measures (proportion of agreement, classical & corrected Jaccard, Cohen's kappa, phi correlation coefficient, odds ratio, and log odds ratio), however, others can easily be implemented. The package also includes non-parametric significance tests, that can be applied to test whether the contingency value quantifying the relationship between the variables is significantly higher than chance level. Most importantly this test accounts for auto-dependence and relative frequency.See Bodner et al.(2021) <doi: 10.1111/bmsp.12222>.Finally, a network can be drawn. Variables depicted the nodes of the network, with the node size adapted to the prevalence. The association strength between the variables defines the undirected (concurrent) or directed (temporal sequenced) links between the nodes. The results of the non-parametric significance test can be included by depicting either all links or only the significant ones. Tutorial see Bodner et al.(2021) <doi:10.3758/s13428-021-01760-w>.

Maintained by Nadja Bodner. Last updated 3 years ago.

4.0 match 1.70 score 2 scripts

lazappi

doilinker:Link Preprints And Publications By DOI

Links preprints to publications using the method described in Cabanac G, Oikonomidi T, Boutron I. "Day-to-day discovery of preprint-publication links". Scientometrics. 2021;1–20. DOI: 10.1007/s11192-021-03900-7.

Maintained by Luke Zappia. Last updated 1 years ago.

doi preprint publication

2.0 match 5 stars 3.40 score 3 scripts

adamlilith

statisfactory:Statistical and Geometrical Tools

A collection of statistical and geometrical tools including the aligned rank transform (ART; Higgins et al. 1990 <doi:10.4148/2475-7772.1443>; Peterson 2002 <doi:10.22237/jmasm/1020255240>; Wobbrock et al. 2011 <doi:10.1145/1978942.1978963>), 2-D histograms and histograms with overlapping bins, a function for making all possible formulae within a set of constraints, amongst others.

Maintained by Adam B. Smith. Last updated 6 months ago.

2d-histograms aligned-rank-transform sampling

2.0 match 3.38 score 16 scripts 1 dependents

davharris

blender:Analyze biotic homogenization of landscapes

Tools for assessing exotic species' contributions to landscape homogeneity using average pairwise Jaccard similarity and an analytical approximation derived in Harris et al. (2011, "Occupancy is nine-tenths of the law," The American Naturalist). Also includes a randomization method for assessing sources of model error.

Maintained by David J. Harris. Last updated 13 years ago.

2.2 match 3.00 score 4 scripts

ewouddt

BiBitR:R Wrapper for Java Implementation of BiBit

A simple R wrapper for the Java BiBit algorithm from "A biclustering algorithm for extracting bit-patterns from binary datasets" from Domingo et al. (2011) <DOI:10.1093/bioinformatics/btr464>. An simple adaption for the BiBit algorithm which allows noise in the biclusters is also introduced as well as a function to guide the algorithm towards given (sub)patterns. Further, a workflow to derive noisy biclusters from discoverd larger column patterns is included as well.

Maintained by De Troyer Ewoud. Last updated 7 years ago.

1.8 match 1 stars 3.76 score 19 scripts 2 dependents

cran

Mercator:Clustering and Visualizing Distance Matrices

Defines the classes used to explore, cluster and visualize distance matrices, especially those arising from binary data. See Abrams and colleagues, 2021, <doi:10.1093/bioinformatics/btab037>.

Maintained by Kevin R. Coombes. Last updated 5 months ago.

clustering

1.5 match 4.26 score 1 dependents

yhenryli

PAC:Partition-Assisted Clustering and Multiple Alignments of Networks

Implements partition-assisted clustering and multiple alignments of networks. It 1) utilizes partition-assisted clustering to find robust and accurate clusters and 2) discovers coherent relationships of clusters across multiple samples. It is particularly useful for analyzing single-cell data set. Please see Li et al. (2017) <doi:10.1371/journal.pcbi.1005875> for detail method description.

Maintained by Ye Henry Li. Last updated 4 years ago.

cpp

1.9 match 3.30 score 7 scripts

andymckenzie

bayesbio:Miscellaneous Functions for Bioinformatics and Bayesian Statistics

A hodgepodge of hopefully helpful functions. Two of these perform shrinkage estimation: one using a simple weighted method where the user can specify the degree of shrinkage required, and one using James-Stein shrinkage estimation for the case of unequal variances.

Maintained by Andrew McKenzie. Last updated 6 years ago.

1.8 match 1 stars 3.18 score 30 scripts

bioc

MesKit:A tool kit for dissecting cancer evolution from multi-region derived tumor biopsies via somatic alterations

MesKit provides commonly used analysis and visualization modules based on mutational data generated by multi-region sequencing (MRS). This package allows to depict mutational profiles, measure heterogeneity within or between tumors from the same patient, track evolutionary dynamics, as well as characterize mutational patterns on different levels. Shiny application was also developed for a need of GUI-based analysis. As a handy tool, MesKit can facilitate the interpretation of tumor heterogeneity and the understanding of evolutionary relationship between regions in MRS study.

Maintained by Mengni Liu. Last updated 5 months ago.

1.2 match 4.73 score 18 scripts 1 dependents

mpiet11

divo:Tools for Analysis of Diversity and Similarity in Biological Systems

A set of tools for empirical analysis of diversity (a number and frequency of different types in a population) and similarity (a number and frequency of shared types in two populations) in biological or ecological systems.

Maintained by Maciej Pietrzak. Last updated 1 months ago.

2.0 match 2.72 score 26 scripts

cran

jacpop:Jaccard Index for Population Structure Identification

Uses the Jaccard similarity index to account for population structure in sequencing studies. This method was specifically designed to detect population stratification based on rare variants, hence it will be especially useful in rare variant analysis.

Maintained by Dmitry Prokopenko. Last updated 6 years ago.

5.3 match 1.00 score

moseleybioinformaticslab

categoryCompare2:Meta-Analysis of High-Throughput Experiments Using Feature Annotations

Facilitates comparison of significant annotations (categories) generated on one or more feature lists. Interactive exploration is facilitated through the use of RCytoscape (heavily suggested).

Maintained by Robert M Flight. Last updated 5 months ago.

annotation go multiplecomparison pathways geneexpression bioconductor bioinformatics gene-annotation gene-expression gene-sets

2.3 match 1 stars 2.30 score 9 scripts

vpetrosyan

CTD:A Method for 'Connecting The Dots' in Weighted Graphs

A method for pattern discovery in weighted graphs as outlined in Thistlethwaite et al. (2021) <doi:10.1371/journal.pcbi.1008550>. Two use cases are achieved: 1) Given a weighted graph and a subset of its nodes, do the nodes show significant connectedness? 2) Given a weighted graph and two subsets of its nodes, are the subsets close neighbors or distant?

Maintained by Varduhi Petrosyan. Last updated 8 months ago.

1.8 match 2.70 score 1 scripts

tianmoul

bootcluster:Bootstrapping Estimates of Clustering Stability

Implementation of the bootstrapping approach for the estimation of clustering stability and its application in estimating the number of clusters, as introduced by Yu et al (2016)<doi:10.1142/9789814749411_0007>. Implementation of the non-parametric bootstrap approach to assessing the stability of module detection in a graph, the extension for the selection of a parameter set that defines a graph from data in a way that optimizes stability and the corresponding visualization functions, as introduced by Tian et al (2021) <doi:10.1002/sam.11495>. Implemented out-of-bag stability estimation function and k-select Smin-based k-selection function as introduced by Liu et al (2022) <doi:10.1002/sam.11593>. Implemented ensemble clustering method based-on k-means clustering method, spectral clustering method and hierarchical clustering method.

Maintained by Tianmou Liu. Last updated 5 months ago.

1.9 match 1 stars 2.18 score 3 scripts

clarkevansteenderen

BinMat:Processes Binary Data Obtained from Fragment Analysis (Such as AFLPs, ISSRs, and RFLPs)

A molecular genetics tool that processes binary data from fragment analysis. It consolidates replicate sample pairs, outputs summary statistics, and produces hierarchical clustering trees and nMDS plots. This package was developed from the publication available here: <https://www.sciencedirect.com/science/article/pii/S1049964420306538>. The GUI version of this package is available on the R Shiny online server at: <https://clarkevansteenderen.shinyapps.io/BINMAT/> or it is accessible via GitHub by typing: shiny::runGitHub("BinMat", "clarkevansteenderen") into the console in R. Two real-world datasets accompany the package: an AFLP dataset of Bunias orientalis samples from Tewes et. al. (2017) <https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2745.12869>, and an ISSR dataset of Nymphaea specimens from Reid et. al. (2021) <https://www.sciencedirect.com/science/article/pii/S0304377021000218> . The authors of these publications are thanked for allowing the use of their data.

Maintained by Clarke van Steenderen. Last updated 3 years ago.

1.8 match 2.00 score 3 scripts

jessicakubrusly

CFilt:Recommendation by Collaborative Filtering

Provides methods and functions to implement a Recommendation System based on Collaborative Filtering Methodology. See Aggarwal (2016) <doi:10.1007/978-3-319-29659-3> for an overview.

Maintained by Jessica Kubrusly. Last updated 6 months ago.

3.3 match 1.00 score

bioc

HTSFilter:Filter replicated high-throughput transcriptome sequencing data

This package implements a filtering procedure for replicated transcriptome sequencing data based on a global Jaccard similarity index in order to identify genes with low, constant levels of expression across one or more experimental conditions.

Maintained by Andrea Rau. Last updated 5 months ago.

sequencing rnaseq preprocessing differentialexpression geneexpression normalization immunooncology

0.5 match 6.24 score 58 scripts 1 dependents

tkhamiak

superbiclust:Generating Robust Biclusters from a Bicluster Set (Ensemble Biclustering)

Biclusters are submatrices in the data matrix which satisfy certain conditions of homogeneity. Package contains functions for generating robust biclusters with respect to the initialization parameters for a given bicluster solution contained in a bicluster set in data, the procedure is also known as ensemble biclustering. The set of biclusters is evaluated based on the similarity of its elements (the overlap), and afterwards the hierarchical tree is constructed to obtain cut-off points for the classes of robust biclusters. The result is a number of robust (or super) biclusters with none or low overlap.

Maintained by Tatsiana Khamiakova. Last updated 4 years ago.

1.8 match 1.48 score 2 scripts 1 dependents

bioc

FindIT2:find influential TF and Target based on multi-omics data

This package implements functions to find influential TF and target based on different input type. It have five module: Multi-peak multi-gene annotaion(mmPeakAnno module), Calculate regulation potential(calcRP module), Find influential Target based on ChIP-Seq and RNA-Seq data(Find influential Target module), Find influential TF based on different input(Find influential TF module), Calculate peak-gene or peak-peak correlation(peakGeneCor module). And there are also some other useful function like integrate different source information, calculate jaccard similarity for your TF.

Maintained by Guandong Shang. Last updated 5 months ago.

software annotation chipseq atacseq generegulation multiplecomparison genetarget

0.5 match 6 stars 5.08 score 7 scripts

marc-75

MSCA:Clustering of Multiple Censored Time-to-Event Endpoints

Provides basic tools for computing clusters of instances described by multiple time-to-event censored endpoints. From long-format datasets, where one instance is described by one or more records of events, a procedure is used to compute state matrices. Then, from state matrices, a procedure provides optimised computation of the Jaccard distance between instances. The library is currently in development, and more options and tools allowing graphical representation of typologies are expected. For methodological details, see our methodological paper: Delord M, Douiri A (2025) <doi:10.1186/s12874-025-02476-7>.

Maintained by Marc Delord. Last updated 1 months ago.

cpp

2.3 match 1.00 score

gzt

catsim:Binary and Categorical Image Similarity Index

Computes a structural similarity metric (after the style of MS-SSIM for images) for binary and categorical 2D and 3D images. Can be based on accuracy (simple matching), Cohen's kappa, Rand index, adjusted Rand index, Jaccard index, Dice index, normalized mutual information, or adjusted mutual information. In addition, has fast computation of Cohen's kappa, the Rand indices, and the two mutual informations. Implements the methods of Thompson and Maitra (2020) <doi:10.48550/arXiv.2004.09073>.

Maintained by Geoffrey Thompson. Last updated 6 months ago.

binary-data binary-image-classification binary-image-processing categorical-data categorical-images classification image-processing cpp

0.5 match 5 stars 4.40 score 5 scripts

singator

autoharp:Semi-Automatic Grading of R and Rmd Scripts

A customisable set of tools for assessing and grading R or R-markdown scripts from students. It allows for checking correctness of code output, runtime statistics and static code analysis. The latter feature is made possible by representing R expressions using a tree structure.

Maintained by Vik Gopal. Last updated 3 years ago.

2.0 match 1 stars 1.00 score 8 scripts

bewicklab

holobiont:Microbiome Analysis Tools

We provide functions for identifying the core community phylogeny in any microbiome, drawing phylogenetic Venn diagrams, calculating the core Faith’s PD for a set of communities, and calculating the core UniFrac distance between two sets of communities. All functions rely on construction of a core community phylogeny, which is a phylogeny where branches are defined based on their presence in multiple samples from a single type of habitat. Our package provides two options for constructing the core community phylogeny, a tip-based approach, where the core community phylogeny is identified based on incidence of leaf nodes and a branch-based approach, where the core community phylogeny is identified based on incidence of individual branches. We suggest use of the microViz package, which can be downloaded from the website provided under Additional repositories.

Maintained by Sharon Bewick. Last updated 6 months ago.

software phyloseq

1.8 match 1.00 score

zhang-zeyu

countTransformers:Transform Counts in RNA-Seq Data Analysis

Provide data transformation functions to transform counts in RNA-seq data analysis. Please see the reference: Zhang Z, Yu D, Seo M, Hersh CP, Weiss ST, Qiu W. (2019) <doi.org/10.1038/s41598-019-41315-w>.

Maintained by Zeyu Zhang. Last updated 6 years ago.

bioinformatics differentialexpression

1.8 match 1.00 score 10 scripts

chongwu-biostat

prclust:Penalized Regression-Based Clustering Method

Clustering is unsupervised and exploratory in nature. Yet, it can be performed through penalized regression with grouping pursuit. In this package, we provide two algorithms for fitting the penalized regression-based clustering (PRclust) with non-convex grouping penalties, such as group truncated lasso, MCP and SCAD. One algorithm is based on quadratic penalty and difference convex method. Another algorithm is based on difference convex and ADMM, called DC-ADD, which is more efficient. Generalized cross validation and stability based method were provided to select the tuning parameters. Rand index, adjusted Rand index and Jaccard index were provided to estimate the agreement between estimated cluster memberships and the truth.

Maintained by Chong Wu. Last updated 8 years ago.

cpp

0.5 match 2.70 score 6 scripts

cran

MCSim:Determine the Optimal Number of Clusters

Identifies the optimal number of clusters by calculating the similarity between two clustering methods at the same number of clusters using the corrected indices of Rand and Jaccard as described in Albatineh and Niewiadomska-Bugaj (2011). The number of clusters at which the index attain its maximum more frequently is a candidate for being the optimal number of clusters.

Maintained by Ahmed N. Albatineh. Last updated 6 years ago.

0.5 match 2.00 score

damienfinn

MicroNiche:Microbial Niche Measurements

Measures niche breadth and overlap of microbial taxa from large matrices. Niche breadth measurements include Levins' niche breadth (Bn) index, Hurlbert's Bn and Feinsinger's proportional similarity (PS) index. (Feinsinger, P., Spears, E.E., Poole, R.W. (1981) <doi:10.2307/1936664>). Niche overlap measurements include Levin's Overlap (Ludwig, J.A. and Reynolds, J.F. (1988, ISBN:0471832359)) and a Jaccard similarity index of Feinsinger's PS values between taxa pairs, as Proportional Overlap.

Maintained by Damien Finn. Last updated 5 years ago.

0.5 match 3 stars 1.48 score 5 scripts