R-universe search: cluster

mmaechler

cluster:"Finding Groups in Data": Cluster Analysis Extended Rousseeuw et al.

Methods for Cluster analysis. Much extended the original from Peter Rousseeuw, Anja Struyf and Mia Hubert, based on Kaufman and Rousseeuw (1990) "Finding Groups in Data".

Maintained by Martin Maechler. Last updated 4 days ago.

94.9 match 3 stars 11.98 score 14k scripts 2.2k dependents

bioc

clusterExperiment:Compare Clusterings for Single-Cell Sequencing

Provides functionality for running and comparing many different clusterings of single-cell sequencing data or other large mRNA Expression data sets.

Maintained by Elizabeth Purdom. Last updated 5 months ago.

clustering rnaseq sequencing software singlecell cpp

105.5 match 39 stars 9.63 score 192 scripts 1 dependents

chrhennig

fpc:Flexible Procedures for Clustering

Various methods for clustering and cluster validation. Fixed point clustering. Linear regression clustering. Clustering by merging Gaussian mixture components. Symmetric and asymmetric discriminant projections for visualisation of the separation of groupings. Cluster validation statistics for distance based clustering including corrected Rand index. Standardisation of cluster validation statistics by random clusterings and comparison between many clustering methods and numbers of clusters based on this. Cluster-wise cluster stability assessment. Methods for estimation of the number of clusters: Calinski-Harabasz, Tibshirani and Walther's prediction strength, Fang and Wang's bootstrap stability. Gaussian/multinomial mixture fitting for mixed continuous/categorical variables. Variable-wise statistics for cluster interpretation. DBSCAN clustering. Interface functions for many clustering methods implemented in R, including estimating the number of clusters with kmeans, pam and clara. Modality diagnosis for Gaussian mixtures. For an overview see package?fpc.

Maintained by Christian Hennig. Last updated 6 months ago.

85.0 match 11 stars 9.25 score 2.6k scripts 70 dependents

mlr-org

mlr3cluster:Cluster Extension for 'mlr3'

Extends the 'mlr3' package with cluster analysis.

Maintained by Maximilian Mücke. Last updated 26 days ago.

cluster-analysis clustering mlr3

87.4 match 23 stars 8.21 score 50 scripts 2 dependents

bioc

bluster:Clustering Algorithms for Bioconductor

Wraps common clustering algorithms in an easily extended S4 framework. Backends are implemented for hierarchical, k-means and graph-based clustering. Several utilities are also provided to compare and evaluate clustering results.

Maintained by Aaron Lun. Last updated 5 months ago.

immunooncology software geneexpression transcriptomics singlecell clustering cpp

73.4 match 9.43 score 636 scripts 51 dependents

mhahsler

dbscan:Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

A fast reimplementation of several density-based algorithms of the DBSCAN family. Includes the clustering algorithms DBSCAN (density-based spatial clustering of applications with noise) and HDBSCAN (hierarchical DBSCAN), the ordering algorithm OPTICS (ordering points to identify the clustering structure), shared nearest neighbor clustering, and the outlier detection algorithms LOF (local outlier factor) and GLOSH (global-local outlier score from hierarchies). The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search. An R interface to fast kNN and fixed-radius NN search is also provided. Hahsler, Piekenbrock and Doran (2019) <doi:10.18637/jss.v091.i01>.

Maintained by Michael Hahsler. Last updated 2 months ago.

clustering dbscan density-based-clustering hdbscan lof optics cpp

42.7 match 321 stars 15.62 score 1.6k scripts 84 dependents

branchlab

metasnf:Meta Clustering with Similarity Network Fusion

Framework to facilitate patient subtyping with similarity network fusion and meta clustering. The similarity network fusion (SNF) algorithm was introduced by Wang et al. (2014) in <doi:10.1038/nmeth.2810>. SNF is a data integration approach that can transform high-dimensional and diverse data types into a single similarity network suitable for clustering with minimal loss of information from each initial data source. The meta clustering approach was introduced by Caruana et al. (2006) in <doi:10.1109/ICDM.2006.103>. Meta clustering involves generating a wide range of cluster solutions by adjusting clustering hyperparameters, then clustering the solutions themselves into a manageable number of qualitatively similar solutions, and finally characterizing representative solutions to find ones that are best for the user's specific context. This package provides a framework to easily transform multi-modal data into a wide range of similarity network fusion-derived cluster solutions as well as to visualize, characterize, and validate those solutions. Core package functionality includes easy customization of distance metrics, clustering algorithms, and SNF hyperparameters to generate diverse clustering solutions; calculation and plotting of associations between features, between patients, and between cluster solutions; and standard cluster validation approaches including resampled measures of cluster stability, standard metrics of cluster quality, and label propagation to evaluate generalizability in unseen data. Associated vignettes guide the user through using the package to identify patient subtypes while adhering to best practices for unsupervised learning.

Maintained by Prashanth S Velayudhan. Last updated 4 days ago.

bioinformatics clustering metaclustering snf

74.0 match 8 stars 8.21 score 30 scripts

philips-software

latrend:A Framework for Clustering Longitudinal Data

A framework for clustering longitudinal datasets in a standardized way. The package provides an interface to existing R packages for clustering longitudinal univariate trajectories, facilitating reproducible and transparent analyses. Additionally, standard tools are provided to support cluster analyses, including repeated estimation, model validation, and model assessment. The interface enables users to compare results between methods, and to implement and evaluate new methods with ease. The 'akmedoids' package is available from <https://github.com/MAnalytics/akmedoids>.

Maintained by Niek Den Teuling. Last updated 2 months ago.

cluster-analysis clustering-evaluation clustering-methods data-science longitudinal-clustering longitudinal-data mixture-models time-series-analysis

85.9 match 30 stars 6.77 score 26 scripts

mhahsler

stream:Infrastructure for Data Stream Mining

A framework for data stream modeling and associated data mining tasks such as clustering and classification. The development of this package was supported in part by NSF IIS-0948893, NSF CMMI 1728612, and NIH R21HG005912. Hahsler et al (2017) <doi:10.18637/jss.v076.i14>.

Maintained by Michael Hahsler. Last updated 4 days ago.

data-stream-clustering datastream stream-mining cpp

53.3 match 39 stars 10.05 score 132 scripts 3 dependents

luca-scr

mclust:Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

Gaussian finite mixture models fitted via EM algorithm for model-based clustering, classification, and density estimation, including Bayesian regularization, dimension reduction for visualisation, and resampling-based inference.

Maintained by Luca Scrucca. Last updated 11 months ago.

fortran openblas

42.6 match 21 stars 12.23 score 6.6k scripts 587 dependents

spatstat

spatstat.model:Parametric Statistical Modelling and Inference for the 'spatstat' Family

Functionality for parametric statistical modelling and inference for spatial data, mainly spatial point patterns, in the 'spatstat' family of packages. (Excludes analysis of spatial data on a linear network, which is covered by the separate package 'spatstat.linnet'.) Supports parametric modelling, formal statistical inference, and model validation. Parametric models include Poisson point processes, Cox point processes, Neyman-Scott cluster processes, Gibbs point processes and determinantal point processes. Models can be fitted to data using maximum likelihood, maximum pseudolikelihood, maximum composite likelihood and the method of minimum contrast. Fitted models can be simulated and predicted. Formal inference includes hypothesis tests (quadrat counting tests, Cressie-Read tests, Clark-Evans test, Berman test, Diggle-Cressie-Loosmore-Ford test, scan test, studentised permutation test, segregation test, ANOVA tests of fitted models, adjusted composite likelihood ratio test, envelope tests, Dao-Genton test, balanced independent two-stage test), confidence intervals for parameters, and prediction intervals for point counts. Model validation techniques include leverage, influence, partial residuals, added variable plots, diagnostic plots, pseudoscore residual plots, model compensators and Q-Q plots.

Maintained by Adrian Baddeley. Last updated 7 days ago.

analysis-of-variance cluster-process confidence-intervals cox-process determinantal-point-processes gibbs-process influence leverage model-diagnostics neyman-scott parameter-estimation poisson-process spatial-analysis spatial-modelling spatial-point-processes statistical-inference

55.6 match 5 stars 9.09 score 6 scripts 46 dependents

talgalili

dendextend:Extending 'dendrogram' Functionality in R

Offers a set of functions for extending 'dendrogram' objects in R, letting you visualize and compare trees of 'hierarchical clusterings'. You can (1) Adjust a tree's graphical parameters - the color, size, type, etc of its branches, nodes and labels. (2) Visually and statistically compare different 'dendrograms' to one another.

Maintained by Tal Galili. Last updated 2 months ago.

28.6 match 154 stars 17.02 score 6.0k scripts 164 dependents

revelle

psych:Procedures for Psychological, Psychometric, and Personality Research

A general purpose toolbox developed originally for personality, psychometric theory and experimental psychology. Functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics. Item Response Theory is done using factor analysis of tetrachoric and polychoric correlations. Functions for analyzing data at multiple levels include within and between group statistics, including correlations and factor analysis. Validation and cross validation of scales developed using basic machine learning algorithms are provided, as are functions for simulating and testing particular item and test structures. Several functions serve as a useful front end for structural equation modeling. Graphical displays of path diagrams, including mediation models, factor analysis and structural equation models are created using basic graphics. Some of the functions are written to support a book on psychometric theory as well as publications in personality research. For more information, see the <https://personality-project.org/r/> web page.

Maintained by William Revelle. Last updated 3 months ago.

34.9 match 52 stars 13.94 score 29k scripts 317 dependents

asardaes

dtwclust:Time Series Clustering Along with Optimizations for the Dynamic Time Warping Distance

Time series clustering along with optimized techniques related to the Dynamic Time Warping distance and its corresponding lower bounds. Implementations of partitional, hierarchical, fuzzy, k-Shape and TADPole clustering are available. Functionality can be easily extended with custom distance measures and centroid definitions. Implementations of DTW barycenter averaging, a distance based on global alignment kernels, and the soft-DTW distance and centroid routines are also provided. All included distance functions have custom loops optimized for the calculation of cross-distance matrices, including parallelization support. Several cluster validity indices are included.

Maintained by Alexis Sarda. Last updated 8 months ago.

clustering dtw time-series openblas cpp

39.0 match 261 stars 12.39 score 406 scripts 14 dependents

okgreece

Cluster.OBeu:Cluster Analysis 'OpenBudgets.eu'

Estimate and return the needed parameters for visualisations designed for 'OpenBudgets' <http://openbudgets.eu/> data. Calculate cluster analysis measures in Budget data of municipalities across Europe, according to the 'OpenBudgets' data model. It involves a set of techniques and algorithms used to find and divide the data into groups of similar observations. Also, can be used generally to extract visualisation parameters convert them to 'JSON' format and use them as input in a different graphical interface.

Maintained by Kleanthis Koupidis. Last updated 4 years ago.

cluster cluster-analysis clustering-algorithm clustering-measures estimate-clustering-parameters obeu open-budgets openbudgets

101.0 match 2 stars 4.75 score 14 scripts

bioc

CATALYST:Cytometry dATa anALYSis Tools

CATALYST provides tools for preprocessing of and differential discovery in cytometry data such as FACS, CyTOF, and IMC. Preprocessing includes i) normalization using bead standards, ii) single-cell deconvolution, and iii) bead-based compensation. For differential discovery, the package provides a number of convenient functions for data processing (e.g., clustering, dimension reduction), as well as a suite of visualizations for exploratory data analysis and exploration of results from differential abundance (DA) and state (DS) analysis in order to identify differences in composition and expression profiles at the subpopulation-level, respectively.

Maintained by Helena L. Crowell. Last updated 4 months ago.

clustering dataimport differentialexpression experimentaldesign flowcytometry immunooncology massspectrometry normalization preprocessing singlecell software statisticalmethod visualization

40.3 match 67 stars 11.06 score 362 scripts 2 dependents

bioc

tidytof:Analyze High-dimensional Cytometry Data Using Tidy Data Principles

This package implements an interactive, scientific analysis pipeline for high-dimensional cytometry data built using tidy data principles. It is specifically designed to play well with both the tidyverse and Bioconductor software ecosystems, with functionality for reading/writing data files, data cleaning, preprocessing, clustering, visualization, modeling, and other quality-of-life functions. tidytof implements a "grammar" of high-dimensional cytometry data analysis.

Maintained by Timothy Keyes. Last updated 5 months ago.

singlecell flowcytometry bioinformatics cytometry data-science single-cell tidyverse cpp

60.7 match 19 stars 7.26 score 35 scripts

lazappi

clustree:Visualise Clusterings at Different Resolutions

Deciding what resolution to use can be a difficult question when approaching a clustering analysis. One way to approach this problem is to look at how samples move as the number of clusters increases. This package allows you to produce clustering trees, a visualisation for interrogating clusterings as resolution increases.

Maintained by Luke Zappia. Last updated 1 years ago.

clustering clustering-trees visualisation visualization

36.8 match 219 stars 11.40 score 1.9k scripts 5 dependents

bioc

clusterProfiler:A universal enrichment tool for interpreting omics data

This package supports functional characteristics of both coding and non-coding genomics data for thousands of species with up-to-date gene annotation. It provides a univeral interface for gene functional annotation from a variety of sources and thus can be applied in diverse scenarios. It provides a tidy interface to access, manipulate, and visualize enrichment results to help users achieve efficient data interpretation. Datasets obtained from multiple treatments and time points can be analyzed and compared in a single run, easily revealing functional consensus and differences among distinct conditions.

Maintained by Guangchuang Yu. Last updated 4 months ago.

annotation clustering genesetenrichment go kegg multiplecomparison pathways reactome visualization enrichment-analysis gsea

23.4 match 1.1k stars 17.03 score 11k scripts 48 dependents

bioc

celda:CEllular Latent Dirichlet Allocation

Celda is a suite of Bayesian hierarchical models for clustering single-cell RNA-sequencing (scRNA-seq) data. It is able to perform "bi-clustering" and simultaneously cluster genes into gene modules and cells into cell subpopulations. It also contains DecontX, a novel Bayesian method to computationally estimate and remove RNA contamination in individual cells without empty droplet information. A variety of scRNA-seq data visualization functions is also included.

Maintained by Joshua Campbell. Last updated 27 days ago.

singlecell geneexpression clustering sequencing bayesian immunooncology dataimport cpp openmp

37.0 match 147 stars 10.47 score 256 scripts 2 dependents

bioc

clustifyr:Classifier for Single-cell RNA-seq Using Cell Clusters

Package designed to aid in classifying cells from single-cell RNA sequencing data using external reference data (e.g., bulk RNA-seq, scRNA-seq, microarray, gene lists). A variety of correlation based methods and gene list enrichment methods are provided to assist cell type assignment.

Maintained by Rui Fu. Last updated 5 months ago.

singlecell annotation sequencing microarray geneexpression assign-identities clusters marker-genes rna-seq single-cell-rna-seq

40.2 match 119 stars 9.63 score 296 scripts

jepusto

clubSandwich:Cluster-Robust (Sandwich) Variance Estimators with Small-Sample Corrections

Provides several cluster-robust variance estimators (i.e., sandwich estimators) for ordinary and weighted least squares linear regression models, including the bias-reduced linearization estimator introduced by Bell and McCaffrey (2002) <https://www150.statcan.gc.ca/n1/pub/12-001-x/2002002/article/9058-eng.pdf> and developed further by Pustejovsky and Tipton (2017) <DOI:10.1080/07350015.2016.1247004>. The package includes functions for estimating the variance- covariance matrix and for testing single- and multiple- contrast hypotheses based on Wald test statistics. Tests of single regression coefficients use Satterthwaite or saddle-point corrections. Tests of multiple- contrast hypotheses use an approximation to Hotelling's T-squared distribution. Methods are provided for a variety of fitted models, including lm() and mlm objects, glm(), geeglm() (from package 'geepack'), ivreg() (from package 'AER'), ivreg() (from package 'ivreg' when estimated by ordinary least squares), plm() (from package 'plm'), gls() and lme() (from 'nlme'), lmer() (from `lme4`), robu() (from 'robumeta'), and rma.uni() and rma.mv() (from 'metafor').

Maintained by James Pustejovsky. Last updated 15 days ago.

33.1 match 48 stars 11.25 score 656 scripts 4 dependents

matteo21q

jomo:Multilevel Joint Modelling Multiple Imputation

Similarly to Schafer's package 'pan', 'jomo' is a package for multilevel joint modelling multiple imputation (Carpenter and Kenward, 2013) <doi:10.1002/9781119942283>. Novel aspects of 'jomo' are the possibility of handling binary and categorical data through latent normal variables, the option to use cluster-specific covariance matrices and to impute compatibly with the substantive model.

Maintained by Matteo Quartagno. Last updated 3 years ago.

37.1 match 3 stars 9.58 score 126 scripts 154 dependents

gagolews

genieclust:Fast and Robust Hierarchical Clustering with Noise Points Detection

A retake on the Genie algorithm (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>), which is a robust hierarchical clustering method (Gagolewski, Bartoszuk, Cena, 2016 <DOI:10.1016/j.ins.2016.05.003>). It is now faster and more memory efficient; determining the whole cluster hierarchy for datasets of 10M points in low dimensional Euclidean spaces or 100K points in high-dimensional ones takes only a minute or so. Allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of 'HDBSCAN*' (that is able to detect a predefined number of clusters and hence it does not dependent on the somewhat fragile 'eps' parameter). The package also features an implementation of inequality indices (e.g., Gini and Bonferroni), external cluster validity measures (e.g., the normalised clustering accuracy, the adjusted Rand index, the Fowlkes-Mallows index, and normalised mutual information), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices). See also the 'Python' version of 'genieclust' available on 'PyPI', which supports sparse data, more metrics, and even larger datasets.

Maintained by Marek Gagolewski. Last updated 4 days ago.

cluster-analysis clustering clustering-algorithm data-analysis data-mining data-science genie hdbscan hierarchical-clustering hierarchical-clustering-algorithm machine-learning machine-learning-algorithms mlpack nmslib python python3 sparse cpp openmp

48.5 match 61 stars 7.29 score 13 scripts 5 dependents

acabassi

coca:Cluster-of-Clusters Analysis

Contains the R functions needed to perform Cluster-Of-Clusters Analysis (COCA) and Consensus Clustering (CC). For further details please see Cabassi and Kirk (2020) <doi:10.1093/bioinformatics/btaa593>.

Maintained by Alessandra Cabassi. Last updated 5 years ago.

cluster-analysis cluster-of-clusters clustering coca genomics integrative-clustering multi-omics

70.1 match 6 stars 5.03 score 12 scripts 1 dependents

bioc

SC3:Single-Cell Consensus Clustering

A tool for unsupervised clustering and analysis of single cell RNA-Seq data.

Maintained by Vladimir Kiselev. Last updated 5 months ago.

immunooncology singlecell software classification clustering dimensionreduction supportvectormachine rnaseq visualization transcriptomics datarepresentation gui differentialexpression transcription bioconductor-package human-cell-atlas single-cell-rna-seq openblas cpp

34.9 match 122 stars 10.09 score 374 scripts 1 dependents

bioc

singleCellTK:Comprehensive and Interactive Analysis of Single Cell RNA-Seq Data

The Single Cell Toolkit (SCTK) in the singleCellTK package provides an interface to popular tools for importing, quality control, analysis, and visualization of single cell RNA-seq data. SCTK allows users to seamlessly integrate tools from various packages at different stages of the analysis workflow. A general "a la carte" workflow gives users the ability access to multiple methods for data importing, calculation of general QC metrics, doublet detection, ambient RNA estimation and removal, filtering, normalization, batch correction or integration, dimensionality reduction, 2-D embedding, clustering, marker detection, differential expression, cell type labeling, pathway analysis, and data exporting. Curated workflows can be used to run Seurat and Celda. Streamlined quality control can be performed on the command line using the SCTK-QC pipeline. Users can analyze their data using commands in the R console or by using an interactive Shiny Graphical User Interface (GUI). Specific analyses or entire workflows can be summarized and shared with comprehensive HTML reports generated by Rmarkdown. Additional documentation and vignettes can be found at camplab.net/sctk.

Maintained by Joshua David Campbell. Last updated 23 days ago.

singlecell geneexpression differentialexpression alignment clustering immunooncology batcheffect normalization qualitycontrol dataimport gui

32.9 match 181 stars 10.16 score 252 scripts

bioc

ComplexHeatmap:Make Complex Heatmaps

Complex heatmaps are efficient to visualize associations between different sources of data sets and reveal potential patterns. Here the ComplexHeatmap package provides a highly flexible way to arrange multiple heatmaps and supports various annotation graphics.

Maintained by Zuguang Gu. Last updated 5 months ago.

software visualization sequencing clustering complex-heatmaps heatmap

19.4 match 1.3k stars 16.93 score 16k scripts 151 dependents

bioc

Mfuzz:Soft clustering of omics time series data

The Mfuzz package implements noise-robust soft clustering of omics time-series data, including transcriptomic, proteomic or metabolomic data. It is based on the use of c-means clustering. For convenience, it includes a graphical user interface.

Maintained by Matthias Futschik. Last updated 5 months ago.

microarray clustering timecourse preprocessing visualization

42.4 match 7.64 score 338 scripts 4 dependents

kkholst

mets:Analysis of Multivariate Event Times

Implementation of various statistical models for multivariate event history data <doi:10.1007/s10985-013-9244-x>. Including multivariate cumulative incidence models <doi:10.1002/sim.6016>, and bivariate random effects probit models (Liability models) <doi:10.1016/j.csda.2015.01.014>. Modern methods for survival analysis, including regression modelling (Cox, Fine-Gray, Ghosh-Lin, Binomial regression) with fast computation of influence functions.

Maintained by Klaus K. Holst. Last updated 2 days ago.

multivariate-time-to-event survival-analysis time-to-event fortran openblas cpp

23.5 match 14 stars 13.47 score 236 scripts 42 dependents

clugen

clugenr:Multidimensional Cluster Generation Using Support Lines

An implementation of the clugen algorithm for generating multidimensional clusters with arbitrary distributions. Each cluster is supported by a line segment, the position, orientation and length of which guide where the respective points are placed. This package is described in Fachada & de Andrade (2023) <doi:10.1016/j.knosys.2023.110836>.

Maintained by Nuno Fachada. Last updated 7 months ago.

multidimensional-clusters multidimensional-data synthetic-clusters synthetic-data-generator synthetic-dataset-generation

57.9 match 5 stars 5.39 score 14 scripts

bioc

Banksy:Spatial transcriptomic clustering

Banksy is an R package that incorporates spatial information to cluster cells in a feature space (e.g. gene expression). To incorporate spatial information, BANKSY computes the mean neighborhood expression and azimuthal Gabor filters that capture gene expression gradients. These features are combined with the cell's own expression to embed cells in a neighbor-augmented product space which can then be clustered, allowing for accurate and spatially-aware cell typing and tissue domain segmentation.

Maintained by Joseph Lee. Last updated 12 days ago.

clustering spatial singlecell geneexpression dimensionreduction clustering-algorithm single-cell-omics spatial-omics

34.3 match 90 stars 9.03 score 248 scripts

mlampros

ClusterR:Gaussian Mixture Models, K-Means, Mini-Batch-Kmeans, K-Medoids and Affinity Propagation Clustering

Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software, <doi:10.18637/jss.v001.i04>; (ii) "Web-scale k-means clustering" by D. Sculley (2010), ACM Digital Library, <doi:10.1145/1772690.1772862>; (iii) "Armadillo: a template-based C++ library for linear algebra" by Sanderson et al (2016), The Journal of Open Source Software, <doi:10.21105/joss.00026>; (iv) "Clustering by Passing Messages Between Data Points" by Brendan J. Frey and Delbert Dueck, Science 16 Feb 2007: Vol. 315, Issue 5814, pp. 972-976, <doi:10.1126/science.1136800>.

Maintained by Lampros Mouselimis. Last updated 9 months ago.

affinity-propagation cpp11 gmm kmeans kmedoids-clustering mini-batch-kmeans rcpparmadillo openblas cpp openmp

27.7 match 84 stars 11.04 score 640 scripts 24 dependents

ropensci

phylotaR:Automated Phylogenetic Sequence Cluster Identification from 'GenBank'

A pipeline for the identification, within taxonomic groups, of orthologous sequence clusters from 'GenBank' <https://www.ncbi.nlm.nih.gov/genbank/> as the first step in a phylogenetic analysis. The pipeline depends on a local alignment search tool and is, therefore, not dependent on differences in gene naming conventions and naming errors.

Maintained by Shixiang Wang. Last updated 8 months ago.

blastn genbank peer-reviewed phylogenetics sequence-alignment

51.8 match 23 stars 5.86 score 156 scripts

bioc

genefu:Computation of Gene Expression-Based Signatures in Breast Cancer

This package contains functions implementing various tasks usually required by gene expression analysis, especially in breast cancer studies: gene mapping between different microarray platforms, identification of molecular subtypes, implementation of published gene signatures, gene selection, and survival analysis.

Maintained by Benjamin Haibe-Kains. Last updated 4 months ago.

differentialexpression geneexpression visualization clustering classification

40.4 match 7.42 score 193 scripts 3 dependents

s3alfisc

fwildclusterboot:Fast Wild Cluster Bootstrap Inference for Linear Models

Implementation of fast algorithms for wild cluster bootstrap inference developed in 'Roodman et al' (2019, 'STATA' Journal, <doi:10.1177/1536867X19830877>) and 'MacKinnon et al' (2022), which makes it feasible to quickly calculate bootstrap test statistics based on a large number of bootstrap draws even for large samples. Multiple bootstrap types as described in 'MacKinnon, Nielsen & Webb' (2022) are supported. Further, 'multiway' clustering, regression weights, bootstrap weights, fixed effects and 'subcluster' bootstrapping are supported. Further, both restricted ('WCR') and unrestricted ('WCU') bootstrap are supported. Methods are provided for a variety of fitted models, including 'lm()', 'feols()' (from package 'fixest') and 'felm()' (from package 'lfe'). Additionally implements a 'heteroskedasticity-robust' ('HC1') wild bootstrap. Last, the package provides an R binding to 'WildBootTests.jl', which provides additional speed gains and functionality, including the 'WRE' bootstrap for instrumental variable models (based on models of type 'ivreg()' from package 'ivreg') and hypotheses with q > 1.

Maintained by Alexander Fischer. Last updated 2 years ago.

clustered-standard-errors linear-regression-models wild-bootstrap wild-cluster-bootstrap openblas cpp openmp

44.8 match 24 stars 6.67 score 109 scripts 2 dependents

wlandau

crew.cluster:Crew Launcher Plugins for Traditional High-Performance Computing Clusters

In computationally demanding analysis projects, statisticians and data scientists asynchronously deploy long-running tasks to distributed systems, ranging from traditional clusters to cloud services. The 'crew.cluster' package extends the 'mirai'-powered 'crew' package with worker launcher plugins for traditional high-performance computing systems. Inspiration also comes from packages 'mirai' by Gao (2023) <https://github.com/shikokuchuo/mirai>, 'future' by Bengtsson (2021) <doi:10.32614/RJ-2021-048>, 'rrq' by FitzJohn and Ashton (2023) <https://github.com/mrc-ide/rrq>, 'clustermq' by Schubert (2019) <doi:10.1093/bioinformatics/btz284>), and 'batchtools' by Lang, Bischl, and Surmann (2017). <doi:10.21105/joss.00135>.

Maintained by William Michael Landau. Last updated 1 months ago.

crew high-performance-computing

43.8 match 28 stars 6.81 score 68 scripts

rezakj

iCellR:Analyzing High-Throughput Single Cell Sequencing Data

A toolkit that allows scientists to work with data from single cell sequencing technologies such as scRNA-seq, scVDJ-seq, scATAC-seq, CITE-Seq and Spatial Transcriptomics (ST). Single (i) Cell R package ('iCellR') provides unprecedented flexibility at every step of the analysis pipeline, including normalization, clustering, dimensionality reduction, imputation, visualization, and so on. Users can design both unsupervised and supervised models to best suit their research. In addition, the toolkit provides 2D and 3D interactive visualizations, differential expression analysis, filters based on cells, genes and clusters, data merging, normalizing for dropouts, data imputation methods, correcting for batch differences, pathway analysis, tools to find marker genes for clusters and conditions, predict cell types and pseudotime analysis. See Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.05.05.078550> and Khodadadi-Jamayran, et al (2020) <doi:10.1101/2020.03.31.019109> for more details.

Maintained by Alireza Khodadadi-Jamayran. Last updated 8 months ago.

10xgenomics 3d batch-normalization cell-type-classification cite-seq clustering clustering-algorithm diffusion-maps dropout icellr imputation intractive-graph normalization pseudotime scrna-seq scvdj-seq singel-cell-sequencing umap cpp

53.4 match 121 stars 5.56 score 7 scripts 1 dependents

bioc

cogena:co-expressed gene-set enrichment analysis

cogena is a workflow for co-expressed gene-set enrichment analysis. It aims to discovery smaller scale, but highly correlated cellular events that may be of great biological relevance. A novel pipeline for drug discovery and drug repositioning based on the cogena workflow is proposed. Particularly, candidate drugs can be predicted based on the gene expression of disease-related data, or other similar drugs can be identified based on the gene expression of drug-related data. Moreover, the drug mode of action can be disclosed by the associated pathway analysis. In summary, cogena is a flexible workflow for various gene set enrichment analysis for co-expressed genes, with a focus on pathway/GO analysis and drug repositioning.

Maintained by Zhilong Jia. Last updated 5 months ago.

clustering genesetenrichment geneexpression visualization pathways kegg go microarray sequencing systemsbiology datarepresentation dataimport bioconductor bioinformatics

39.9 match 12 stars 7.36 score 32 scripts

bioc

immunoClust:immunoClust - Automated Pipeline for Population Detection in Flow Cytometry

immunoClust is a model based clustering approach for Flow Cytometry samples. The cell-events of single Flow Cytometry samples are modelled by a mixture of multinominal normal- or t-distributions. The cell-event clusters of several samples are modelled by a mixture of multinominal normal-distributions aiming stable co-clusters across these samples.

Maintained by Till Soerensen. Last updated 4 months ago.

clustering flowcytometry singlecell cellbasedassays immunooncology gsl cpp

65.6 match 4.38 score 4 scripts

sparklyr

sparklyr:R Interface to Apache Spark

R interface to Apache Spark, a fast and general engine for big data processing, see <https://spark.apache.org/>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.

Maintained by Edgar Ruiz. Last updated 9 days ago.

apache-spark distributed dplyr ide livy machine-learning remote-clusters spark sparklyr

18.9 match 959 stars 15.16 score 4.0k scripts 21 dependents

joemsong

Ckmeans.1d.dp:Optimal, Fast, and Reproducible Univariate Clustering

Fast, optimal, and reproducible weighted univariate clustering by dynamic programming. Four problems are solved, including univariate k-means (Wang & Song 2011) <doi:10.32614/RJ-2011-015> (Song & Zhong 2020) <doi:10.1093/bioinformatics/btaa613>, k-median, k-segments, and multi-channel weighted k-means. Dynamic programming is used to minimize the sum of (weighted) within-cluster distances using respective metrics. Its advantage over heuristic clustering in efficiency and accuracy is pronounced when there are many clusters. Multi-channel weighted k-means groups multiple univariate signals into k clusters. An auxiliary function generates histograms adaptive to patterns in data. This package provides a powerful set of tools for univariate data analysis with guaranteed optimality, efficiency, and reproducibility, useful for peak calling on temporal, spatial, and spectral data.

Maintained by Joe Song. Last updated 2 years ago.

cpp

33.2 match 19 stars 8.62 score 339 scripts 19 dependents

core-bioinformatics

ClustAssess:Tools for Assessing Clustering

A set of tools for evaluating clustering robustness using proportion of ambiguously clustered pairs (Senbabaoglu et al. (2014) <doi:10.1038/srep06207>), as well as similarity across methods and method stability using element-centric clustering comparison (Gates et al. (2019) <doi:10.1038/s41598-019-44892-y>). Additionally, this package enables stability-based parameter assessment for graph-based clustering pipelines typical in single-cell data analysis.

Maintained by Andi Munteanu. Last updated 1 months ago.

software singlecell rnaseq atacseq normalization preprocessing dimensionreduction visualization qualitycontrol clustering classification annotation geneexpression differentialexpression bioinformatics genomics machine-learning parameter-optimization robustness single-cell unsupervised-learning cpp

50.0 match 22 stars 5.68 score 18 scripts

bioc

flowClust:Clustering for Flow Cytometry

Robust model-based clustering using a t-mixture model with Box-Cox transformation. Note: users should have GSL installed. Windows users: 'consult the README file available in the inst directory of the source distribution for necessary configuration instructions'.

Maintained by Greg Finak. Last updated 5 months ago.

immunooncology clustering visualization flowcytometry

38.7 match 7.30 score 83 scripts 6 dependents

bioc

ChemmineR:Cheminformatics Toolkit for R

ChemmineR is a cheminformatics package for analyzing drug-like small molecule data in R. Its latest version contains functions for efficient processing of large numbers of molecules, physicochemical/structural property predictions, structural similarity searching, classification and clustering of compound libraries with a wide spectrum of algorithms. In addition, it offers visualization functions for compound clustering results and chemical structures.

Maintained by Thomas Girke. Last updated 5 months ago.

cheminformatics biomedicalinformatics pharmacogenetics pharmacogenomics microtitreplateassay cellbasedassays visualization infrastructure dataimport clustering proteomics metabolomics cpp

29.7 match 14 stars 9.42 score 253 scripts 12 dependents

tomasfryda

h2o:R Interface for the 'H2O' Scalable Machine Learning Platform

R interface for 'H2O', the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models (GLM), Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Generalized Additive Models (GAM), ANOVA GLM, Cox Proportional Hazards, K-Means, PCA, ModelSelection, Word2Vec, as well as a fully automatic machine learning algorithm (H2O AutoML).

Maintained by Tomas Fryda. Last updated 1 years ago.

34.0 match 3 stars 8.20 score 7.8k scripts 11 dependents

emf-creaf

vegclust:Fuzzy Clustering of Vegetation Data

A set of functions to: (1) perform fuzzy clustering of vegetation data (De Caceres et al, 2010) <doi:10.1111/j.1654-1103.2010.01211.x>; (2) to assess ecological community similarity on the basis of structure and composition (De Caceres et al, 2013) <doi:10.1111/2041-210X.12116>.

Maintained by Miquel De Cáceres. Last updated 8 months ago.

43.8 match 2 stars 6.28 score 52 scripts 6 dependents

kassambara

factoextra:Extract and Visualize the Results of Multivariate Data Analyses

Provides some easy-to-use functions to extract and visualize the output of multivariate data analyses, including 'PCA' (Principal Component Analysis), 'CA' (Correspondence Analysis), 'MCA' (Multiple Correspondence Analysis), 'FAMD' (Factor Analysis of Mixed Data), 'MFA' (Multiple Factor Analysis) and 'HMFA' (Hierarchical Multiple Factor Analysis) functions from different R packages. It contains also functions for simplifying some clustering analysis steps and provides 'ggplot2' - based elegant data visualization.

Maintained by Alboukadel Kassambara. Last updated 5 years ago.

19.0 match 363 stars 14.13 score 15k scripts 52 dependents

laperez

Clustering:Techniques for Evaluating Clustering

The design of this package allows us to run different clustering packages and compare the results between them, to determine which algorithm behaves best from the data provided. See Martos, L.A.P., García-Vico, Á.M., González, P. et al.(2023) <doi:10.1007/s13748-022-00294-2> "Clustering: an R library to facilitate the analysis and comparison of cluster algorithms.", Martos, L.A.P., García-Vico, Á.M., González, P. et al. "A Multiclustering Evolutionary Hyperrectangle-Based Algorithm" <doi:10.1007/s44196-023-00341-3> and L.A.P., García-Vico, Á.M., González, P. et al. "An Evolutionary Fuzzy System for Multiclustering in Data Streaming" <doi:10.1016/j.procs.2023.12.058>.

Maintained by Luis Alfonso Perez Martos. Last updated 11 months ago.

66.3 match 5 stars 4.04 score 7 scripts

cran

ppclust:Probabilistic and Possibilistic Cluster Analysis

Partitioning clustering divides the objects in a data set into non-overlapping subsets or clusters by using the prototype-based probabilistic and possibilistic clustering algorithms. This package covers a set of the functions for Fuzzy C-Means (Bezdek, 1974) <doi:10.1080/01969727308546047>, Possibilistic C-Means (Krishnapuram & Keller, 1993) <doi:10.1109/91.227387>, Possibilistic Fuzzy C-Means (Pal et al, 2005) <doi:10.1109/TFUZZ.2004.840099>, Possibilistic Clustering Algorithm (Yang et al, 2006) <doi:10.1016/j.patcog.2005.07.005>, Possibilistic C-Means with Repulsion (Wachs et al, 2006) <doi:10.1007/3-540-31662-0_6> and the other variants of hard and soft clustering algorithms. The cluster prototypes and membership matrices required by these partitioning algorithms are initialized with different initialization techniques that are available in the package 'inaparc'. As the distance metrics, not only the Euclidean distance but also a set of the commonly used distance metrics are available to use with some of the algorithms in the package.

Maintained by Zeynel Cebeci. Last updated 1 years ago.

69.1 match 1 stars 3.86 score 4 dependents

rte-antares-rpackage

antaresEditObject:Edit an 'Antares' Simulation

Edit an 'Antares' simulation before running it : create new areas, links, thermal clusters or binding constraints or edit existing ones. Update 'Antares' general & optimization settings. 'Antares' is an open source power system generator, more information available here : <https://antares-simulator.org/>.

Maintained by Tatiana Vargas. Last updated 27 days ago.

antares-simulation cluster energy monte-carlo-simulation rte

29.6 match 8 stars 8.76 score 101 scripts

bioc

BayesSpace:Clustering and Resolution Enhancement of Spatial Transcriptomes

Tools for clustering and enhancing the resolution of spatial gene expression experiments. BayesSpace clusters a low-dimensional representation of the gene expression matrix, incorporating a spatial prior to encourage neighboring spots to cluster together. The method can enhance the resolution of the low-dimensional representation into "sub-spots", for which features such as gene expression or cell type composition can be imputed.

Maintained by Matt Stone. Last updated 5 months ago.

software clustering transcriptomics geneexpression singlecell immunooncology dataimport openblas cpp openmp

29.0 match 123 stars 8.89 score 278 scripts 1 dependents

satijalab

Seurat:Tools for Single Cell Genomics

A toolkit for quality control, analysis, and exploration of single cell RNA sequencing data. 'Seurat' aims to enable users to identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and to integrate diverse types of single cell data. See Satija R, Farrell J, Gennert D, et al (2015) <doi:10.1038/nbt.3192>, Macosko E, Basu A, Satija R, et al (2015) <doi:10.1016/j.cell.2015.05.002>, Stuart T, Butler A, et al (2019) <doi:10.1016/j.cell.2019.05.031>, and Hao, Hao, et al (2020) <doi:10.1101/2020.10.12.335331> for more details.

Maintained by Paul Hoffman. Last updated 1 years ago.

human-cell-atlas single-cell-genomics single-cell-rna-seq cpp

15.2 match 2.4k stars 16.86 score 50k scripts 73 dependents

bioc

diffcyt:Differential discovery in high-dimensional cytometry via high-resolution clustering

Statistical methods for differential discovery analyses in high-dimensional cytometry data (including flow cytometry, mass cytometry or CyTOF, and oligonucleotide-tagged cytometry), based on a combination of high-resolution clustering and empirical Bayes moderated tests adapted from transcriptomics.

Maintained by Lukas M. Weber. Last updated 1 months ago.

immunooncology flowcytometry proteomics singlecell cellbasedassays cellbiology clustering featureextraction software

25.6 match 20 stars 9.98 score 225 scripts 5 dependents

mlizhangx

NAIR:Network Analysis of Immune Repertoire

Pipelines for studying the adaptive immune repertoire of T cells and B cells via network analysis based on receptor sequence similarity. Relate clinical outcomes to immune repertoires based on their network properties, or to particular clusters and clones within a repertoire. Yang et al. (2023) <doi:10.3389/fimmu.2023.1181825>.

Maintained by Brian Neal. Last updated 2 months ago.

cpp openmp

36.1 match 7 stars 6.83 score 27 scripts

bioc

CAGEfightR:Analysis of Cap Analysis of Gene Expression (CAGE) data using Bioconductor

CAGE is a widely used high throughput assay for measuring transcription start site (TSS) activity. CAGEfightR is an R/Bioconductor package for performing a wide range of common data analysis tasks for CAGE and 5'-end data in general. Core functionality includes: import of CAGE TSSs (CTSSs), tag (or unidirectional) clustering for TSS identification, bidirectional clustering for enhancer identification, annotation with transcript and gene models, correlation of TSS and enhancer expression, calculation of TSS shapes, quantification of CAGE expression as expression matrices and genome brower visualization.

Maintained by Malte Thodberg. Last updated 5 months ago.

software transcription coverage geneexpression generegulation peakdetection dataimport datarepresentation transcriptomics sequencing annotation genomebrowsers normalization preprocessing visualization

33.0 match 8 stars 7.46 score 67 scripts 1 dependents

kurthornik

clue:Cluster Ensembles

CLUster Ensembles.

Maintained by Kurt Hornik. Last updated 4 months ago.

24.9 match 2 stars 9.85 score 496 scripts 401 dependents

pneuvial

adjclust:Adjacency-Constrained Clustering of a Block-Diagonal Similarity Matrix

Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated to a position, and only adjacent clusters can be merged. Typical application fields in bioinformatics include Genome-Wide Association Studies or Hi-C data analysis, where the similarity between items is a decreasing function of their genomic distance. Taking advantage of this feature, the implemented algorithm is time and memory efficient. This algorithm is described in Ambroise et al (2019) <doi:10.1186/s13015-019-0157-4>.

Maintained by Pierre Neuvial. Last updated 5 months ago.

clustering featureextraction gwas hi-c hierarchical-clustering linkage-disequilibrium cpp openmp

33.0 match 16 stars 7.35 score 13 scripts 2 dependents

a-dudek-ue

clusterSim:Searching for Optimal Clustering Procedure for a Data Set

Distance measures (GDM1, GDM2, Sokal-Michener, Bray-Curtis, for symbolic interval-valued data), cluster quality indices (Calinski-Harabasz, Baker-Hubert, Hubert-Levine, Silhouette, Krzanowski-Lai, Hartigan, Gap, Davies-Bouldin), data normalization formulas (metric data, interval-valued symbolic data), data generation (typical and non-typical data), HINoV method, replication analysis, linear ordering methods, spectral clustering, agreement indices between two partitions, plot functions (for categorical and symbolic interval-valued data). (MILLIGAN, G.W., COOPER, M.C. (1985) <doi:10.1007/BF02294245>, HUBERT, L., ARABIE, P. (1985) <doi:10.1007%2FBF01908075>, RAND, W.M. (1971) <doi:10.1080/01621459.1971.10482356>, JAJUGA, K., WALESIAK, M. (2000) <doi:10.1007/978-3-642-57280-7_11>, MILLIGAN, G.W., COOPER, M.C. (1988) <doi:10.1007/BF01897163>, JAJUGA, K., WALESIAK, M., BAK, A. (2003) <doi:10.1007/978-3-642-55721-7_12>, DAVIES, D.L., BOULDIN, D.W. (1979) <doi:10.1109/TPAMI.1979.4766909>, CALINSKI, T., HARABASZ, J. (1974) <doi:10.1080/03610927408827101>, HUBERT, L. (1974) <doi:10.1080/01621459.1974.10480191>, TIBSHIRANI, R., WALTHER, G., HASTIE, T. (2001) <doi:10.1111/1467-9868.00293>, BRECKENRIDGE, J.N. (2000) <doi:10.1207/S15327906MBR3502_5>, WALESIAK, M., DUDEK, A. (2008) <doi:10.1007/978-3-540-78246-9_11>).

Maintained by Andrzej Dudek. Last updated 6 months ago.

cpp

37.5 match 2 stars 6.35 score 512 scripts 9 dependents

bioc

CatsCradle:This package provides methods for analysing spatial transcriptomics data and for discovering gene clusters

This package addresses two broad areas. It allows for in-depth analysis of spatial transcriptomic data by identifying tissue neighbourhoods. These are contiguous regions of tissue surrounding individual cells. 'CatsCradle' allows for the categorisation of neighbourhoods by the cell types contained in them and the genes expressed in them. In particular, it produces Seurat objects whose individual elements are neighbourhoods rather than cells. In addition, it enables the categorisation and annotation of genes by producing Seurat objects whose elements are genes.

Maintained by Michael Shapiro. Last updated 1 months ago.

biologicalquestion statisticalmethod geneexpression singlecell transcriptomics spatial

36.4 match 3 stars 6.50 score

ovvo-financial

NNS:Nonlinear Nonparametric Statistics

Nonlinear nonparametric statistics using partial moments. Partial moments are the elements of variance and asymptotically approximate the area of f(x). These robust statistics provide the basis for nonlinear analysis while retaining linear equivalences. NNS offers: Numerical integration, Numerical differentiation, Clustering, Correlation, Dependence, Causal analysis, ANOVA, Regression, Classification, Seasonality, Autoregressive modeling, Normalization, Stochastic dominance and Advanced Monte Carlo sampling. All routines based on: Viole, F. and Nawrocki, D. (2013), Nonlinear Nonparametric Statistics: Using Partial Moments (ISBN: 1490523995).

Maintained by Fred Viole. Last updated 5 days ago.

clustering econometrics machine-learning nonlinear nonparametric partial-moments statistics time-series cpp

21.5 match 71 stars 10.96 score 66 scripts 3 dependents

ubod

apcluster:Affinity Propagation Clustering

Implements Affinity Propagation clustering introduced by Frey and Dueck (2007) <DOI:10.1126/science.1136800>. The algorithms are largely analogous to the 'Matlab' code published by Frey and Dueck. The package further provides leveraged affinity propagation and an algorithm for exemplar-based agglomerative clustering that can also be used to join clusters obtained from affinity propagation. Various plotting functions are available for analyzing clustering results.

Maintained by Ulrich Bodenhofer. Last updated 11 months ago.

cpp

24.1 match 10 stars 9.82 score 270 scripts 25 dependents

bioc

spatialHeatmap:spatialHeatmap: Visualizing Spatial Assays in Anatomical Images and Large-Scale Data Extensions

The spatialHeatmap package offers the primary functionality for visualizing cell-, tissue- and organ-specific assay data in spatial anatomical images. Additionally, it provides extended functionalities for large-scale data mining routines and co-visualizing bulk and single-cell data. A description of the project is available here: https://spatialheatmap.org.

Maintained by Jianhai Zhang. Last updated 4 months ago.

spatial visualization microarray sequencing geneexpression datarepresentation network clustering graphandnetwork cellbasedassays atacseq dnaseq tissuemicroarray singlecell cellbiology genetarget

37.7 match 5 stars 6.26 score 12 scripts

bioc

CAGEr:Analysis of CAGE (Cap Analysis of Gene Expression) sequencing data for precise mapping of transcription start sites and promoterome mining

The _CAGEr_ package identifies transcription start sites (TSS) and their usage frequency from CAGE (Cap Analysis Gene Expression) sequencing data. It normalises raw CAGE tag count, clusters TSSs into tag clusters (TC) and aggregates them across multiple CAGE experiments to construct consensus clusters (CC) representing the promoterome. CAGEr provides functions to profile expression levels of these clusters by cumulative expression and rarefaction analysis, and outputs the plots in ggplot2 format for further facetting and customisation. After clustering, CAGEr performs analyses of promoter width and detects differential usage of TSSs (promoter shifting) between samples. CAGEr also exports its data as genome browser tracks, and as R objects for downsteam expression analysis by other Bioconductor packages such as DESeq2, CAGEfightR, or seqArchR.

Maintained by Charles Plessy. Last updated 5 months ago.

preprocessing sequencing normalization functionalgenomics transcription geneexpression clustering visualization

38.3 match 6.12 score 73 scripts

crj32

Spectrum:Fast Adaptive Spectral Clustering for Single and Multi-View Data

A self-tuning spectral clustering method for single or multi-view data. 'Spectrum' uses a new type of adaptive density aware kernel that strengthens connections in the graph based on common nearest neighbours. It uses a tensor product graph data integration and diffusion procedure to integrate different data sources and reduce noise. 'Spectrum' uses either the eigengap or multimodality gap heuristics to determine the number of clusters. The method is sufficiently flexible so that a wide range of Gaussian and non-Gaussian structures can be clustered with automatic selection of K.

Maintained by Christopher R John. Last updated 5 years ago.

clustering spectral-clustering

38.8 match 7 stars 5.99 score 47 scripts 1 dependents

bioc

monocle:Clustering, differential expression, and trajectory analysis for single- cell RNA-Seq

Monocle performs differential expression and time-series analysis for single-cell expression experiments. It orders individual cells according to progress through a biological process, without knowing ahead of time which genes define progress through that process. Monocle also performs differential expression analysis, clustering, visualization, and other useful tasks on single cell expression data. It is designed to work with RNA-Seq and qPCR data, but could be used with other types as well.

Maintained by Cole Trapnell. Last updated 5 months ago.

immunooncology sequencing rnaseq geneexpression differentialexpression infrastructure dataimport datarepresentation visualization clustering multiplecomparison qualitycontrol cpp

26.1 match 8.89 score 1.6k scripts 2 dependents

hiweller

colordistance:Distance Metrics for Image Color Similarity

Loads and displays images, selectively masks specified background colors, bins pixels by color using either data-dependent or automatically generated color bins, quantitatively measures color similarity among images using one of several distance metrics for comparing pixel color clusters, and clusters images by object color similarity. Uses CIELAB, RGB, or HSV color spaces. Originally written for use with organism coloration (reef fish color diversity, butterfly mimicry, etc), but easily applicable for any image set.

Maintained by Hannah Weller. Last updated 1 years ago.

28.9 match 37 stars 7.93 score 76 scripts 2 dependents

juba

rainette:The Reinert Method for Textual Data Clustering

An R implementation of the Reinert text clustering method. For more details about the algorithm see the included vignettes or Reinert (1990) <doi:10.1177/075910639002600103>.

Maintained by Julien Barnier. Last updated 11 months ago.

text-analysis text-classification cpp

32.7 match 55 stars 6.90 score 24 scripts

bioc

scran:Methods for Single-Cell RNA-Seq Data Analysis

Implements miscellaneous functions for interpretation of single-cell RNA-seq data. Methods are provided for assignment of cell cycle phase, detection of highly variable and significantly correlated genes, identification of marker genes, and other common tasks in routine single-cell analysis workflows.

Maintained by Aaron Lun. Last updated 5 months ago.

immunooncology normalization sequencing rnaseq software geneexpression transcriptomics singlecell clustering bioconductor-package human-cell-atlas single-cell-rna-seq openblas cpp

17.1 match 41 stars 13.14 score 7.6k scripts 36 dependents

bioc

timeOmics:Time-Course Multi-Omics data integration

timeOmics is a generic data-driven framework to integrate multi-Omics longitudinal data measured on the same biological samples and select key temporal features with strong associations within the same sample group. The main steps of timeOmics are: 1. Plaform and time-specific normalization and filtering steps; 2. Modelling each biological into one time expression profile; 3. Clustering features with the same expression profile over time; 4. Post-hoc validation step.

Maintained by Antoine Bodein. Last updated 5 months ago.

clustering featureextraction timecourse dimensionreduction software sequencing microarray metabolomics metagenomics proteomics classification regression immunooncology geneprediction multiplecomparison cluster integration multi-omics time-series

37.6 match 24 stars 5.98 score 10 scripts

bioc

GeDi:Defining and visualizing the distances between different genesets

The package provides different distances measurements to calculate the difference between genesets. Based on these scores the genesets are clustered and visualized as graph. This is all presented in an interactive Shiny application for easy usage.

Maintained by Annekathrin Nedwed. Last updated 5 months ago.

gui genesetenrichment software transcription rnaseq visualization clustering pathways reportwriting go kegg reactome shinyapps

40.6 match 1 stars 5.52 score 22 scripts

jayanilakshika

cardinalR:Collection of Data Structures

A collection of simple simulation datasets designed for generating Nonlinear Dimension Reduction representations techniques such as t-distributed Stochastic Neighbor Embedding, and Uniform Manifold Approximation and Projection. These datasets serve as a valuable resource for understanding the reliability of Nonlinear Dimension Reduction representations in various contexts.

Maintained by Jayani P.G. Lakshika. Last updated 11 days ago.

49.3 match 4.54 score

tudo-r

BatchJobs:Batch Computing with R

Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine. Multicore and SSH systems are also supported. For further details see the project web page.

Maintained by Bernd Bischl. Last updated 3 years ago.

26.0 match 85 stars 8.57 score 616 scripts 3 dependents

nvelden

geneviewer:Gene Cluster Visualizations

Provides tools for plotting gene clusters and transcripts by importing data from GenBank, FASTA, and GFF files. It performs BLASTP and MUMmer alignments [Altschul et al. (1990) <doi:10.1016/S0022-2836(05)80360-2>; Delcher et al. (1999) <doi:10.1093/nar/27.11.2369>] and displays results on gene arrow maps. Extensive customization options are available, including legends, labels, annotations, scales, colors, tooltips, and more.

Maintained by Niels van der Velden. Last updated 29 days ago.

genetics

37.7 match 43 stars 5.86 score 13 scripts

bioc

ILoReg:ILoReg: a tool for high-resolution cell population identification from scRNA-Seq data

ILoReg is a tool for identification of cell populations from scRNA-seq data. In particular, ILoReg is useful for finding cell populations with subtle transcriptomic differences. The method utilizes a self-supervised learning method, called Iteratitive Clustering Projection (ICP), to find cluster probabilities, which are used in noise reduction prior to PCA and the subsequent hierarchical clustering and t-SNE steps. Additionally, functions for differential expression analysis to find gene markers for the populations and gene expression visualization are provided.

Maintained by Johannes Smolander. Last updated 5 months ago.

singlecell software clustering dimensionreduction rnaseq visualization transcriptomics datarepresentation differentialexpression transcription geneexpression

44.9 match 5 stars 4.88 score 2 scripts

matthias-studer

WeightedCluster:Clustering of Weighted Data

Clusters state sequences and weighted data. It provides an optimized weighted PAM algorithm as well as functions for aggregating replicated cases, computing cluster quality measures for a range of clustering solutions and plotting (fuzzy) clusters of state sequences. Parametric bootstraps methods to validate typology of sequences are also provided. Finally, it provides a fuzzy and crisp CLARA algorithm to cluster large database with sequence analysis.

Maintained by Matthias Studer. Last updated 3 months ago.

cpp

39.0 match 5.55 score 106 scripts 4 dependents

r-forge

sandwich:Robust Covariance Matrix Estimators

Object-oriented software for model-robust covariance matrix estimators. Starting out from the basic robust Eicker-Huber-White sandwich covariance methods include: heteroscedasticity-consistent (HC) covariances for cross-section data; heteroscedasticity- and autocorrelation-consistent (HAC) covariances for time series data (such as Andrews' kernel HAC, Newey-West, and WEAVE estimators); clustered covariances (one-way and multi-way); panel and panel-corrected covariances; outer-product-of-gradients covariances; and (clustered) bootstrap covariances. All methods are applicable to (generalized) linear model objects fitted by lm() and glm() but can also be adapted to other classes through S3 methods. Details can be found in Zeileis et al. (2020) <doi:10.18637/jss.v095.i01>, Zeileis (2004) <doi:10.18637/jss.v011.i10> and Zeileis (2006) <doi:10.18637/jss.v016.i09>.

Maintained by Achim Zeileis. Last updated 2 months ago.

14.2 match 14.92 score 11k scripts 887 dependents

bioc

ViSEAGO:ViSEAGO: a Bioconductor package for clustering biological functions using Gene Ontology and semantic similarity

The main objective of ViSEAGO package is to carry out a data mining of biological functions and establish links between genes involved in the study. We developed ViSEAGO in R to facilitate functional Gene Ontology (GO) analysis of complex experimental design with multiple comparisons of interest. It allows to study large-scale datasets together and visualize GO profiles to capture biological knowledge. The acronym stands for three major concepts of the analysis: Visualization, Semantic similarity and Enrichment Analysis of Gene Ontology. It provides access to the last current GO annotations, which are retrieved from one of NCBI EntrezGene, Ensembl or Uniprot databases for several species. Using available R packages and novel developments, ViSEAGO extends classical functional GO analysis to focus on functional coherence by aggregating closely related biological themes while studying multiple datasets at once. It provides both a synthetic and detailed view using interactive functionalities respecting the GO graph structure and ensuring functional coherence supplied by semantic similarity. ViSEAGO has been successfully applied on several datasets from different species with a variety of biological questions. Results can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility.

Maintained by Aurelien Brionne. Last updated 2 months ago.

software annotation go genesetenrichment multiplecomparison clustering visualization

31.9 match 6.64 score 22 scripts

alinetalhouk

diceR:Diverse Cluster Ensemble in R

Performs cluster analysis using an ensemble clustering framework, Chiu & Talhouk (2018) <doi:10.1186/s12859-017-1996-y>. Results from a diverse set of algorithms are pooled together using methods such as majority voting, K-Modes, LinkCluE, and CSPA. There are options to compare cluster assignments across algorithms using internal and external indices, visualizations such as heatmaps, and significance testing for the existence of clusters.

Maintained by Derek Chiu. Last updated 1 months ago.

cpp

25.5 match 37 stars 8.13 score 60 scripts 3 dependents

tidymodels

tidyclust:A Common API to Clustering

A common interface to specifying clustering models, in the same style as 'parsnip'. Creates unified interface across different functions and computational engines.

Maintained by Emil Hvitfeldt. Last updated 2 months ago.

27.8 match 111 stars 7.45 score 139 scripts

thibautjombart

adegenet:Exploratory Analysis of Genetic and Genomic Data

Toolset for the exploration of genetic and genomic data. Adegenet provides formal (S4) classes for storing and handling various genetic data, including genetic markers with varying ploidy and hierarchical population structure ('genind' class), alleles counts by populations ('genpop'), and genome-wide SNP data ('genlight'). It also implements original multivariate methods (DAPC, sPCA), graphics, statistical tests, simulation tools, distance and similarity measures, and several spatial methods. A range of both empirical and simulated datasets is also provided to illustrate various methods.

Maintained by Zhian N. Kamvar. Last updated 1 months ago.

16.3 match 182 stars 12.60 score 1.9k scripts 29 dependents

bioc

flowMatch:Matching and meta-clustering in flow cytometry

Matching cell populations and building meta-clusters and templates from a collection of FC samples.

Maintained by Ariful Azad. Last updated 5 months ago.

immunooncology clustering flowcytometry cpp

52.2 match 3.90 score 1 scripts

csafe-isu

handwriter:Handwriting Analysis in R

Perform statistical writership analysis of scanned handwritten documents. Webpage provided at: <https://github.com/CSAFE-ISU/handwriter>.

Maintained by Stephanie Reinders. Last updated 1 months ago.

cpp jags

23.3 match 24 stars 8.70 score 27 scripts 2 dependents

bioc

CluMSID:Clustering of MS2 Spectra for Metabolite Identification

CluMSID is a tool that aids the identification of features in untargeted LC-MS/MS analysis by the use of MS2 spectra similarity and unsupervised statistical methods. It offers functions for a complete and customisable workflow from raw data to visualisations and is interfaceable with the xmcs family of preprocessing packages.

Maintained by Tobias Depke. Last updated 5 months ago.

metabolomics preprocessing clustering

33.5 match 10 stars 6.04 score 22 scripts

easystats

parameters:Processing of Model Parameters

Utilities for processing the parameters of various statistical models. Beyond computing p values, CIs, and other indices for a wide variety of models (see list of supported models using the function 'insight::supported_models()'), this package implements features like bootstrapping or simulating of parameters and models, feature reduction (feature extraction and variable selection) as well as functions to describe data and variable characteristics (e.g. skewness, kurtosis, smoothness or distribution).

Maintained by Daniel Lüdecke. Last updated 2 days ago.

beta bootstrap ci confidence-intervals data-reduction easystats fa feature-extraction feature-reduction hacktoberfest parameters pca pvalues regression-models robust-statistics standardize standardized-estimates statistical-models

12.8 match 453 stars 15.65 score 1.8k scripts 56 dependents

bioc

scBubbletree:Quantitative visual exploration of scRNA-seq data

scBubbletree is a quantitative method for the visual exploration of scRNA-seq data, preserving key biological properties such as local and global cell distances and cell density distributions across samples. It effectively resolves overplotting and enables the visualization of diverse cell attributes from multiomic single-cell experiments. Additionally, scBubbletree is user-friendly and integrates seamlessly with popular scRNA-seq analysis tools, facilitating comprehensive and intuitive data interpretation.

Maintained by Simo Kitanovski. Last updated 5 months ago.

visualization clustering singlecell transcriptomics rnaseq big-data bigdata scrna-seq scrna-seq-analysis visual visual-exploration

34.5 match 6 stars 5.82 score 8 scripts

bioc

ChromSCape:Analysis of single-cell epigenomics datasets with a Shiny App

ChromSCape - Chromatin landscape profiling for Single Cells - is a ready-to-launch user-friendly Shiny Application for the analysis of single-cell epigenomics datasets (scChIP-seq, scATAC-seq, scCUT&Tag, ...) from aligned data to differential analysis & gene set enrichment analysis. It is highly interactive, enables users to save their analysis and covers a wide range of analytical steps: QC, preprocessing, filtering, batch correction, dimensionality reduction, vizualisation, clustering, differential analysis and gene set analysis.

Maintained by Pacome Prompsy. Last updated 5 months ago.

shinyapps software singlecell chipseq atacseq methylseq classification clustering epigenetics principalcomponent annotation batcheffect multiplecomparison normalization pathways preprocessing qualitycontrol reportwriting visualization genesetenrichment differentialpeakcalling epigenomics shiny single-cell cpp

34.2 match 14 stars 5.83 score 16 scripts

bioc

GOSemSim:GO-terms Semantic Similarity Measures

The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups, and have became important basis for many bioinformatics analysis approaches. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters. GOSemSim implemented five methods proposed by Resnik, Schlicker, Jiang, Lin and Wang respectively.

Maintained by Guangchuang Yu. Last updated 5 months ago.

annotation go clustering pathways network software bioinformatics gene-ontology semantic-similarity cpp

14.1 match 63 stars 14.12 score 708 scripts 68 dependents

bioc

scGPS:A complete analysis of single cell subpopulations, from identifying subpopulations to analysing their relationship (scGPS = single cell Global Predictions of Subpopulation)

The package implements two main algorithms to answer two key questions: a SCORE (Stable Clustering at Optimal REsolution) to find subpopulations, followed by scGPS to investigate the relationships between subpopulations.

Maintained by Quan Nguyen. Last updated 5 months ago.

singlecell clustering dataimport sequencing coverage openblas cpp

38.0 match 4 stars 5.20 score 7 scripts

bioc

FlowSOM:Using self-organizing maps for visualization and interpretation of cytometry data

FlowSOM offers visualization options for cytometry data, by using Self-Organizing Map clustering and Minimal Spanning Trees.

Maintained by Sofie Van Gassen. Last updated 5 months ago.

cellbiology flowcytometry clustering visualization software cellbasedassays

25.2 match 7.71 score 468 scripts 10 dependents

md-anderson-bioinformatics

NGCHM:Next Generation Clustered Heat Maps

Next-Generation Clustered Heat Maps (NG-CHMs) allow for dynamic exploration of heat map data in a web browser. 'NGCHM' allows users to create both stand-alone HTML files containing a Next-Generation Clustered Heat Map, and .ngchm files to view in the NG-CHM viewer. See Ryan MC, Stucky M, et al (2020) <doi:10.12688/f1000research.20590.2> for more details.

Maintained by Mary A Rohrdanz. Last updated 8 days ago.

heatmap nci-itcr ng-chm

35.2 match 9 stars 5.48 score 28 scripts

mllg

batchtools:Tools for Computation on Batch Systems

As a successor of the packages 'BatchJobs' and 'BatchExperiments', this package provides a parallel implementation of the Map function for high performance computing systems managed by schedulers 'IBM Spectrum LSF' (<https://www.ibm.com/products/hpc-workload-management>), 'OpenLava' (<https://www.openlava.org/>), 'Univa Grid Engine'/'Oracle Grid Engine' (<https://www.univa.com/>), 'Slurm' (<https://slurm.schedmd.com/>), 'TORQUE/PBS' (<https://adaptivecomputing.com/cherry-services/torque-resource-manager/>), or 'Docker Swarm' (<https://docs.docker.com/engine/swarm/>). A multicore and socket mode allow the parallelization on a local machines, and multiple machines can be hooked up via SSH to create a makeshift cluster. Moreover, the package provides an abstraction mechanism to define large-scale computer experiments in a well-organized and reproducible way.

Maintained by Michel Lang. Last updated 2 years ago.

batchexperiments batchjobs docker-swarm high-performance-computing hpc hpc-clusters lsf openlava parallel-computing reproducibility sge slurm torque

16.5 match 175 stars 11.39 score 772 scripts 14 dependents

bioc

slingshot:Tools for ordering single-cell sequencing

Provides functions for inferring continuous, branching lineage structures in low-dimensional data. Slingshot was designed to model developmental trajectories in single-cell RNA sequencing data and serve as a component in an analysis pipeline after dimensionality reduction and clustering. It is flexible enough to handle arbitrarily many branching events and allows for the incorporation of prior knowledge through supervised graph construction.

Maintained by Kelly Street. Last updated 5 months ago.

clustering differentialexpression geneexpression rnaseq sequencing software singlecell transcriptomics visualization

15.6 match 283 stars 12.01 score 1.0k scripts 4 dependents

bioc

phyloseq:Handling and analysis of high-throughput microbiome census data

phyloseq provides a set of classes and tools to facilitate the import, storage, analysis, and graphical display of microbiome census data.

Maintained by Paul J. McMurdie. Last updated 5 months ago.

immunooncology sequencing microbiome metagenomics clustering classification multiplecomparison geneticvariability

13.4 match 597 stars 13.90 score 8.4k scripts 37 dependents

bioc

FEAST:FEAture SelcTion (FEAST) for Single-cell clustering

Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as “features”), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have significant impact on the clustering accuracy. FEAST is an R library for selecting most representative features before performing the core of scRNA-seq clustering. It can be used as a plug-in for the etablished clustering algorithms such as SC3, TSCAN, SHARP, SIMLR, and Seurat. The core of FEAST algorithm includes three steps: 1. consensus clustering; 2. gene-level significance inference; 3. validation of an optimized feature set.

Maintained by Kenong Su. Last updated 5 months ago.

sequencing singlecell clustering featureextraction

31.1 match 10 stars 5.97 score 47 scripts

cran

e1071:Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien

Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier, generalized k-nearest neighbour ...

Maintained by David Meyer. Last updated 6 months ago.

cpp

12.8 match 28 stars 14.46 score 19k scripts 2.0k dependents

gagolews

genie:Fast, Robust, and Outlier Resistant Hierarchical Clustering

Includes the reference implementation of Genie - a hierarchical clustering algorithm that links two point groups in such a way that an inequity measure (namely, the Gini index) of the cluster sizes does not significantly increase above a given threshold. This method most often outperforms many other data segmentation approaches in terms of clustering quality as tested on a wide range of benchmark datasets. At the same time, Genie retains the high speed of the single linkage approach, therefore it is also suitable for analysing larger data sets. For more details see (Gagolewski et al. 2016 <DOI:10.1016/j.ins.2016.05.003>). For an even faster and more feature-rich implementation, including, amongst others, noise point detection, see the 'genieclust' package (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>).

Maintained by Marek Gagolewski. Last updated 3 years ago.

cluster cluster-analysis clustering data-analysis data-mining data-science datascience genie hierarchical-clustering-algorithm machine-learning machine-learning-algorithms outliers cpp openmp

40.7 match 22 stars 4.55 score 16 scripts

kisungyou

T4cluster:Tools for Cluster Analysis

Cluster analysis is one of the most fundamental problems in data science. We provide a variety of algorithms from clustering to the learning on the space of partitions. See Hennig, Meila, and Rocci (2016, ISBN:9781466551886) for general exposition to cluster analysis.

Maintained by Kisung You. Last updated 3 years ago.

openblas cpp openmp

43.1 match 6 stars 4.26 score 9 scripts 2 dependents

bioc

vsclust:Feature-based variance-sensitive quantitative clustering

Feature-based variance-sensitive clustering of omics data. Optimizes cluster assignment by taking into account individual feature variance. Includes several modules for statistical testing, clustering and enrichment analysis.

Maintained by Veit Schwammle. Last updated 2 months ago.

clustering annotation principalcomponent differentialexpression visualization proteomics metabolomics cpp

38.8 match 4.70 score 9 scripts

bioc

DESeq2:Differential gene expression analysis based on the negative binomial distribution

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.

Maintained by Michael Love. Last updated 11 days ago.

sequencing rnaseq chipseq geneexpression transcription normalization differentialexpression bayesian regression principalcomponent clustering immunooncology openblas cpp

11.2 match 375 stars 16.11 score 17k scripts 115 dependents

bioc

iClusterPlus:Integrative clustering of multi-type genomic data

Integrative clustering of multiple genomic data using a joint latent variable model.

Maintained by Qianxing Mo. Last updated 4 months ago.

multi-omics clustering fortran openblas

30.9 match 5.76 score 190 scripts

bioc

evaluomeR:Evaluation of Bioinformatics Metrics

Evaluating the reliability of your own metrics and the measurements done on your own datasets by analysing the stability and goodness of the classifications of such metrics.

Maintained by José Antonio Bernabé-Díaz. Last updated 5 months ago.

clustering classification featureextraction assessment clustering-evaluation evaluome evaluomer metrics

36.9 match 4.82 score 33 scripts

bioc

simplifyEnrichment:Simplify Functional Enrichment Results

A new clustering algorithm, "binary cut", for clustering similarity matrices of functional terms is implemeted in this package. It also provides functions for visualizing, summarizing and comparing the clusterings.

Maintained by Zuguang Gu. Last updated 5 months ago.

software visualization go clustering genesetenrichment

22.1 match 113 stars 8.02 score 196 scripts

azure

azuremlsdk:Interface to the 'Azure Machine Learning' 'SDK'

Interface to the 'Azure Machine Learning' Software Development Kit ('SDK'). Data scientists can use the 'SDK' to train, deploy, automate, and manage machine learning models on the 'Azure Machine Learning' service. To learn more about 'Azure Machine Learning' visit the website: <https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml>.

Maintained by Diondra Peck. Last updated 3 years ago.

amlcompute azure azure-machine-learning azureml dsi machine-learning rstudio sdk-r

19.9 match 106 stars 8.91 score 221 scripts

husson

FactoMineR:Multivariate Exploratory Data Analysis and Data Mining

Exploratory data analysis methods to summarize, visualize and describe datasets. The main principal component methods are available, those with the largest potential in terms of applications: principal component analysis (PCA) when variables are quantitative, correspondence analysis (CA) and multiple correspondence analysis (MCA) when variables are categorical, Multiple Factor Analysis when variables are structured in groups, etc. and hierarchical cluster analysis. F. Husson, S. Le and J. Pages (2017).

Maintained by Francois Husson. Last updated 3 months ago.

12.0 match 47 stars 14.71 score 5.6k scripts 112 dependents

cbhurley

gclus:Clustering Graphics

Orders panels in scatterplot matrices and parallel coordinate displays by some merit index. Package contains various indices of merit, ordering functions, and enhanced versions of pairs and parcoord which color panels according to their merit level.

Maintained by Catherine Hurley. Last updated 6 years ago.

21.4 match 8.23 score 406 scripts 82 dependents

declaredesign

randomizr:Easy-to-Use Tools for Common Forms of Random Assignment and Sampling

Generates random assignments for common experimental designs and random samples for common sampling designs.

Maintained by Alexander Coppock. Last updated 1 months ago.

17.8 match 37 stars 9.90 score 396 scripts 13 dependents

igraph

igraph:Network Analysis and Visualization

Routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more.

Maintained by Kirill Müller. Last updated 2 days ago.

complex-networks graph-algorithms graph-theory mathematics network-analysis network-graph fortran libxml2 glpk openblas cpp

8.3 match 581 stars 21.10 score 31k scripts 1.9k dependents

bioc

seqArchRplus:Downstream analyses of promoter sequence architectures and HTML report generation

seqArchRplus facilitates downstream analyses of promoter sequence architectures/clusters identified by seqArchR (or any other tool/method). With additional available information such as the TPM values and interquantile widths (IQWs) of the CAGE tag clusters, seqArchRplus can order the input promoter clusters by their shape (IQWs), and write the cluster information as browser/IGV track files. Provided visualizations are of two kind: per sample/stage and per cluster visualizations. Those of the first kind include: plot panels for each sample showing per cluster shape, TPM and other score distributions, sequence logos, and peak annotations. The second include per cluster chromosome-wise and strand distributions, motif occurrence heatmaps and GO term enrichments. Additionally, seqArchRplus can also generate HTML reports for easy viewing and comparison of promoter architectures between samples/stages.

Maintained by Sarvesh Nikumbh. Last updated 5 months ago.

annotation visualization reportwriting go motifannotation clustering

43.0 match 1 stars 4.00 score 2 scripts

civisanalytics

civis:R Client for the 'Civis Platform API'

A convenient interface for making requests directly to the 'Civis Platform API' <https://www.civisanalytics.com/platform/>. Full documentation available 'here' <https://civisanalytics.github.io/civis-r/>.

Maintained by Peter Cooman. Last updated 2 months ago.

21.6 match 16 stars 7.84 score 144 scripts

bioc

ggtree:an R package for visualization of tree and annotation data

'ggtree' extends the 'ggplot2' plotting system which implemented the grammar of graphics. 'ggtree' is designed for visualization and annotation of phylogenetic trees and other tree-like structures with their annotation data.

Maintained by Guangchuang Yu. Last updated 5 months ago.

alignment annotation clustering dataimport multiplesequencealignment phylogenetics reproducibleresearch software visualization annotations ggplot2 phylogenetic-trees

10.0 match 864 stars 16.86 score 5.1k scripts 109 dependents

zcebeci

odetector:Outlier Detection Using Partitioning Clustering Algorithms

An object is called "outlier" if it remarkably deviates from the other objects in a data set. Outlier detection is the process to find outliers by using the methods that are based on distance measures, clustering and spatial methods (Ben-Gal, 2005 <ISBN 0-387-24435-2>). It is one of the intensively studied research topics for identification of novelties, frauds, anomalies, deviations or exceptions in addition to its use for outlier removing in data processing. This package provides the implementations of some novel approaches to detect the outliers based on typicality degrees that are obtained with the soft partitioning clustering algorithms such as Fuzzy C-means and its variants.

Maintained by Zeynel Cebeci. Last updated 2 years ago.

anomaly-detection cluster-analysis clustering clustering-methods data datapreparation datapreprocessing exception-handling fcm fraud-detection fuzzy-clustering novelty-detection outlier-detection outlier-removal outliers partitioning pcm surprise-exploration

45.3 match 3.70 score 4 scripts

mpadge

spatialcluster:R port of redcap

R port of redcap (Regionalization with dynamically constrained agglomerative clustering and partitioning).

Maintained by Mark Padgham. Last updated 2 months ago.

cluster clustering-algorithm spatial cpp

33.7 match 31 stars 4.97 score 1 scripts

rhenkin

visxhclust:A Shiny App for Visual Exploration of Hierarchical Clustering

A Shiny application and functions for visual exploration of hierarchical clustering with numeric datasets. Allows users to iterative set hyperparameters, select features and evaluate results through various plots and computation of evaluation criteria.

Maintained by Rafael Henkin. Last updated 2 years ago.

clustering data-analysis data-science r-shiny shiny-apps

34.4 match 4 stars 4.86 score 12 scripts

bioc

MLInterfaces:Uniform interfaces to R machine learning procedures for data in Bioconductor containers

This package provides uniform interfaces to machine learning code for data in R and Bioconductor containers.

Maintained by Vincent Carey. Last updated 5 months ago.

classification clustering

21.9 match 7.63 score 79 scripts 6 dependents

spatstat

spatstat.random:Random Generation Functionality for the 'spatstat' Family

Functionality for random generation of spatial data in the 'spatstat' family of packages. Generates random spatial patterns of points according to many simple rules (complete spatial randomness, Poisson, binomial, random grid, systematic, cell), randomised alteration of patterns (thinning, random shift, jittering), simulated realisations of random point processes including simple sequential inhibition, Matern inhibition models, Neyman-Scott cluster processes (using direct, Brix-Kendall, or hybrid algorithms), log-Gaussian Cox processes, product shot noise cluster processes and Gibbs point processes (using Metropolis-Hastings birth-death-shift algorithm, alternating Gibbs sampler, or coupling-from-the-past perfect simulation). Also generates random spatial patterns of line segments, random tessellations, and random images (random noise, random mosaics). Excludes random generation on a linear network, which is covered by the separate package 'spatstat.linnet'.

Maintained by Adrian Baddeley. Last updated 6 months ago.

point-processes random-generation simulation spatial-sampling spatial-simulation cpp

15.4 match 5 stars 10.77 score 84 scripts 173 dependents

bioc

BioNAR:Biological Network Analysis in R

the R package BioNAR, developed to step by step analysis of PPI network. The aim is to quantify and rank each protein’s simultaneous impact into multiple complexes based on network topology and clustering. Package also enables estimating of co-occurrence of diseases across the network and specific clusters pointing towards shared/common mechanisms.

Maintained by Anatoly Sorokin. Last updated 18 days ago.

software graphandnetwork network

28.0 match 3 stars 5.90 score 35 scripts

bioc

DirichletMultinomial:Dirichlet-Multinomial Mixture Model Machine Learning for Microbiome Data

Dirichlet-multinomial mixture models can be used to describe variability in microbial metagenomic data. This package is an interface to code originally made available by Holmes, Harris, and Quince, 2012, PLoS ONE 7(2): 1-15, as discussed further in the man page for this package, ?DirichletMultinomial.

Maintained by Martin Morgan. Last updated 5 months ago.

immunooncology microbiome sequencing clustering classification metagenomics gsl

15.0 match 11 stars 10.97 score 125 scripts 26 dependents

datastorm-open

visNetwork:Network Visualization using 'vis.js' Library

Provides an R interface to the 'vis.js' JavaScript charting library. It allows an interactive visualization of networks.

Maintained by Benoit Thieurmel. Last updated 2 years ago.

10.8 match 549 stars 15.14 score 4.1k scripts 195 dependents

jchiquet

aricode:Efficient Computations of Standard Clustering Comparison Measures

Implements an efficient O(n) algorithm based on bucket-sorting for fast computation of standard clustering comparison measures. Available measures include adjusted Rand index (ARI), normalized information distance (NID), normalized mutual information (NMI), adjusted mutual information (AMI), normalized variation information (NVI) and entropy, as described in Vinh et al (2009) <doi:10.1145/1553374.1553511>. Include AMI (Adjusted Mutual Information) since version 0.1.2, a modified version of ARI (MARI), as described in Sundqvist et al. <doi:10.1007/s00180-022-01230-7> and simple Chi-square distance since version 1.0.0.

Maintained by Julien Chiquet. Last updated 1 years ago.

bucket-sort clustering clustering-comparison-measures cpp

20.1 match 25 stars 8.15 score 542 scripts 14 dependents

bioc

M3C:Monte Carlo Reference-based Consensus Clustering

M3C is a consensus clustering algorithm that uses a Monte Carlo simulation to eliminate overestimation of K and can reject the null hypothesis K=1.

Maintained by Christopher John. Last updated 5 months ago.

clustering geneexpression transcription rnaseq sequencing immunooncology

24.7 match 6.59 score 174 scripts 1 dependents

comeetie

greed:Clustering and Model Selection with the Integrated Classification Likelihood

An ensemble of algorithms that enable the clustering of networks and data matrices (such as counts, categorical or continuous) with different type of generative models. Model selection and clustering is performed in combination by optimizing the Integrated Classification Likelihood (which is equivalent to minimizing the description length). Several models are available such as: Stochastic Block Model, degree corrected Stochastic Block Model, Mixtures of Multinomial, Latent Block Model. The optimization is performed thanks to a combination of greedy local search and a genetic algorithm (see <arXiv:2002:11577> for more details).

Maintained by Etienne Côme. Last updated 2 years ago.

openblas cpp openmp

27.3 match 14 stars 5.94 score 41 scripts

mhahsler

rEMM:Extensible Markov Model for Modelling Temporal Relationships Between Clusters

Implements TRACDS (Temporal Relationships between Clusters for Data Streams), a generalization of Extensible Markov Model (EMM). TRACDS adds a temporal or order model to data stream clustering by superimposing a dynamically adapting Markov Chain. Also provides an implementation of EMM (TRACDS on top of tNN data stream clustering). Development of this package was supported in part by NSF IIS-0948893 and R21HG005912 from the National Human Genome Research Institute. Hahsler and Dunham (2010) <doi:10.18637/jss.v035.i05>.

Maintained by Michael Hahsler. Last updated 7 months ago.

clustering data-stream sequence-analysis

33.8 match 2 stars 4.79 score 31 scripts

biorgeo

bioregion:Comparison of Bioregionalisation Methods

The main purpose of this package is to propose a transparent methodological framework to compare bioregionalisation methods based on hierarchical and non-hierarchical clustering algorithms (Kreft & Jetz (2010) <doi:10.1111/j.1365-2699.2010.02375.x>) and network algorithms (Lenormand et al. (2019) <doi:10.1002/ece3.4718> and Leroy et al. (2019) <doi:10.1111/jbi.13674>).

Maintained by Maxime Lenormand. Last updated 10 days ago.

biogeography bioregion bioregionalization cpp

25.7 match 7 stars 6.27 score 11 scripts

cleanzr

clevr:Clustering and Link Prediction Evaluation in R

Tools for evaluating link prediction and clustering algorithms with respect to ground truth. Includes efficient implementations of common performance measures such as pairwise precision/recall, cluster homogeneity/completeness, variation of information, Rand index etc.

Maintained by Neil Marchant. Last updated 1 years ago.

clustering-evaluation entity-resolution evaluation-metrics link-prediction record-linkage cpp

33.6 match 12 stars 4.77 score 49 scripts

egenn

rtemis:Machine Learning and Visualization

Advanced Machine Learning and Visualization. Unsupervised Learning (Clustering, Decomposition), Supervised Learning (Classification, Regression), Cross-Decomposition, Bagging, Boosting, Meta-models. Static and interactive graphics.

Maintained by E.D. Gennatas. Last updated 1 months ago.

data-science data-visualization machine-learning machine-learning-library visualization

22.4 match 145 stars 7.09 score 50 scripts 2 dependents

chrhennig

prabclus:Functions for Clustering and Testing of Presence-Absence, Abundance and Multilocus Genetic Data

Distance-based parametric bootstrap tests for clustering with spatial neighborhood information. Some distance measures, Clustering of presence-absence, abundance and multilocus genetic data for species delimitation, nearest neighbor based noise detection. Genetic distances between communities. Tests whether various distance-based regressions are equal. Try package?prabclus for on overview.

Maintained by Christian Hennig. Last updated 6 months ago.

26.4 match 1 stars 5.99 score 90 scripts 71 dependents

bioc

cola:A Framework for Consensus Partitioning

Subgroup classification is a basic task in genomic data analysis, especially for gene expression and DNA methylation data analysis. It can also be used to test the agreement to known clinical annotations, or to test whether there exist significant batch effects. The cola package provides a general framework for subgroup classification by consensus partitioning. It has the following features: 1. It modularizes the consensus partitioning processes that various methods can be easily integrated. 2. It provides rich visualizations for interpreting the results. 3. It allows running multiple methods at the same time and provides functionalities to straightforward compare results. 4. It provides a new method to extract features which are more efficient to separate subgroups. 5. It automatically generates detailed reports for the complete analysis. 6. It allows applying consensus partitioning in a hierarchical manner.

Maintained by Zuguang Gu. Last updated 1 months ago.

clustering geneexpression classification software consensus-clustering cpp

21.1 match 61 stars 7.49 score 112 scripts

chrismuir

refinr:Cluster and Merge Similar Values Within a Character Vector

These functions take a character vector as input, identify and cluster similar values, and then merge clusters together so their values become identical. The functions are an implementation of the key collision and ngram fingerprint algorithms from the open source tool Open Refine <https://openrefine.org/>. More info on key collision and ngram fingerprint can be found here <https://openrefine.org/docs/technical-reference/clustering-in-depth>.

Maintained by Chris Muir. Last updated 1 years ago.

approximate-string-matching clustering data-cleaning data-clustering fuzzy-matching ngram openrefine cpp

23.0 match 104 stars 6.80 score 121 scripts

bioc

tradeSeq:trajectory-based differential expression analysis for sequencing data

tradeSeq provides a flexible method for fitting regression models that can be used to find genes that are differentially expressed along one or multiple lineages in a trajectory. Based on the fitted models, it uses a variety of tests suited to answer different questions of interest, e.g. the discovery of genes for which expression is associated with pseudotime, or which are differentially expressed (in a specific region) along the trajectory. It fits a negative binomial generalized additive model (GAM) for each gene, and performs inference on the parameters of the GAM.

Maintained by Hector Roux de Bezieux. Last updated 5 months ago.

clustering regression timecourse differentialexpression geneexpression rnaseq sequencing software singlecell transcriptomics multiplecomparison visualization

15.5 match 247 stars 10.06 score 440 scripts

mhahsler

seriation:Infrastructure for Ordering Objects Using Seriation

Infrastructure for ordering objects with an implementation of several seriation/sequencing/ordination techniques to reorder matrices, dissimilarity matrices, and dendrograms. Also provides (optimally) reordered heatmaps, color images and clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT). Hahsler et al (2008) <doi:10.18637/jss.v025.i03>.

Maintained by Michael Hahsler. Last updated 3 months ago.

combinatorial-optimization ordination seriation fortran

11.1 match 77 stars 14.07 score 640 scripts 79 dependents

bioc

MetaNeighbor:Single cell replicability analysis

MetaNeighbor allows users to quantify cell type replicability across datasets using neighbor voting.

Maintained by Stephan Fischer. Last updated 5 months ago.

immunooncology geneexpression go multiplecomparison singlecell transcriptomics

26.3 match 5.89 score 78 scripts

rpatin

segclust2d:Bivariate Segmentation/Clustering Methods and Tools

Provides two methods for segmentation and joint segmentation/clustering of bivariate time-series. Originally intended for ecological segmentation (home-range and behavioural modes) but easily applied on other series, the package also provides tools for analysing outputs from R packages 'moveHMM' and 'marcher'. The segmentation method is a bivariate extension of Lavielle's method available in 'adehabitatLT' (Lavielle, 1999 <doi:10.1016/S0304-4149(99)00023-X> and 2005 <doi:10.1016/j.sigpro.2005.01.012>). This method rely on dynamic programming for efficient segmentation. The segmentation/clustering method alternates steps of dynamic programming with an Expectation-Maximization algorithm. This is an extension of Picard et al (2007) <doi:10.1111/j.1541-0420.2006.00729.x> method (formerly available in 'cghseg' package) to the bivariate case. The method is fully described in Patin et al (2018) <doi:10.1101/444794>.

Maintained by Remi Patin. Last updated 11 months ago.

cpp

28.1 match 7 stars 5.50 score 30 scripts

swarm-lab

CEC:Cross-Entropy Clustering

Splits data into Gaussian type clusters using the Cross-Entropy Clustering ('CEC') method. This method allows for the simultaneous use of various types of Gaussian mixture models, for performing the reduction of unnecessary clusters, and for discovering new clusters by splitting them. 'CEC' is based on the work of Spurek, P. and Tabor, J. (2014) <doi:10.1016/j.patcog.2014.03.006>.

Maintained by Simon Garnier. Last updated 5 months ago.

clustering cross-entropy openblas cpp

36.2 match 10 stars 4.26 score 18 scripts

joemsong

OptCirClust:Circular, Periodic, or Framed Data Clustering: Fast, Optimal, and Reproducible

Fast, optimal, and reproducible clustering algorithms for circular, periodic, or framed data. The algorithms introduced here are based on a core algorithm for optimal framed clustering the authors have developed (Debnath & Song 2021) <doi:10.1109/TCBB.2021.3077573>. The runtime of these algorithms is O(K N log^2 N), where K is the number of clusters and N is the number of circular data points. On a desktop computer using a single processor core, millions of data points can be grouped into a few clusters within seconds. One can apply the algorithms to characterize events along circular DNA molecules, circular RNA molecules, and circular genomes of bacteria, chloroplast, and mitochondria. One can also cluster climate data along any given longitude or latitude. Periodic data clustering can be formulated as circular clustering. The algorithms offer a general high-performance solution to circular, periodic, or framed data clustering.

Maintained by Joe Song. Last updated 4 years ago.

cpp

34.8 match 4.42 score 22 scripts 2 dependents

bioc

Cardinal:A mass spectrometry imaging toolbox for statistical analysis

Implements statistical & computational tools for analyzing mass spectrometry imaging datasets, including methods for efficient pre-processing, spatial segmentation, and classification.

Maintained by Kylie Ariel Bemis. Last updated 3 months ago.

software infrastructure proteomics lipidomics massspectrometry imagingmassspectrometry immunooncology normalization clustering classification regression

14.8 match 47 stars 10.34 score 200 scripts

ethanyxu

ADPclust:Fast Clustering Using Adaptive Density Peak Detection

An implementation of ADPclust clustering procedures (Fast Clustering Using Adaptive Density Peak Detection). The work is built and improved upon the idea of Rodriguez and Laio (2014)<DOI:10.1126/science.1242072>. ADPclust clusters data by finding density peaks in a density-distance plot generated from local multivariate Gaussian density estimation. It includes an automatic centroids selection and parameter optimization algorithm, which finds the number of clusters and cluster centroids by comparing average silhouettes on a grid of testing clustering results; It also includes a user interactive algorithm that allows the user to manually selects cluster centroids from a two dimensional "density-distance plot". Here is the research article associated with this package: "Wang, Xiao-Feng, and Yifan Xu (2015)<DOI:10.1177/0962280215609948> Fast clustering using adaptive density peak detection." Statistical methods in medical research". url: <http://smm.sagepub.com/content/early/2015/10/15/0962280215609948.abstract>.

Maintained by Ethan Yifan Xu. Last updated 3 years ago.

28.4 match 10 stars 5.34 score 44 scripts

plangfelder

WGCNA:Weighted Correlation Network Analysis

Functions necessary to perform Weighted Correlation Network Analysis on high-dimensional data as originally described in Horvath and Zhang (2005) <doi:10.2202/1544-6115.1128> and Langfelder and Horvath (2008) <doi:10.1186/1471-2105-9-559>. Includes functions for rudimentary data cleaning, construction of correlation networks, module identification, summarization, and relating of variables and modules to sample traits. Also includes a number of utility functions for data manipulation and visualization.

Maintained by Peter Langfelder. Last updated 6 months ago.

cpp

15.7 match 54 stars 9.65 score 5.3k scripts 32 dependents

dgrun

RaceID:Identification of Cell Types, Inference of Lineage Trees, and Prediction of Noise Dynamics from Single-Cell RNA-Seq Data

Application of 'RaceID' allows inference of cell types and prediction of lineage trees by the 'StemID2' algorithm (Herman, J.S., Sagar, Grun D. (2018) <DOI:10.1038/nmeth.4662>). 'VarID2' is part of this package and allows quantification of biological gene expression noise at single-cell resolution (Rosales-Alvarez, R.E., Rettkowski, J., Herman, J.S., Dumbovic, G., Cabezas-Wallscheid, N., Grun, D. (2023) <DOI:10.1186/s13059-023-02974-1>).

Maintained by Dominic Grün. Last updated 4 months ago.

cpp

32.0 match 4.74 score 110 scripts

astamm

fdacluster:Joint Clustering and Alignment of Functional Data

Implementations of the k-means, hierarchical agglomerative and DBSCAN clustering methods for functional data which allows for jointly aligning and clustering curves. It supports functional data defined on one-dimensional domains but possibly evaluating in multivariate codomains. It supports functional data defined in arrays but also via the 'fd' and 'funData' classes for functional data defined in the 'fda' and 'funData' packages respectively. It currently supports shift, dilation and affine warping functions for functional data defined on the real line and uses the SRVF framework to handle boundary-preserving warping for functional data defined on a specific interval. Main reference for the k-means algorithm: Sangalli L.M., Secchi P., Vantini S., Vitelli V. (2010) "k-mean alignment for curve clustering" <doi:10.1016/j.csda.2009.12.008>. Main reference for the SRVF framework: Tucker, J. D., Wu, W., & Srivastava, A. (2013) "Generative models for functional data using phase and amplitude separation" <doi:10.1016/j.csda.2012.12.001>.

Maintained by Aymeric Stamm. Last updated 2 months ago.

openblas cpp openmp

24.6 match 5 stars 6.14 score 31 scripts 1 dependents

bioc

mobileRNA:mobileRNA: Investigate the RNA mobilome & population-scale changes

Genomic analysis can be utilised to identify differences between RNA populations in two conditions, both in production and abundance. This includes the identification of RNAs produced by multiple genomes within a biological system. For example, RNA produced by pathogens within a host or mobile RNAs in plant graft systems. The mobileRNA package provides methods to pre-process, analyse and visualise the sRNA and mRNA populations based on the premise of mapping reads to all genotypes at the same time.

Maintained by Katie Jeynes-Cupper. Last updated 5 months ago.

visualization rnaseq sequencing smallrna genomeassembly clustering experimentaldesign qualitycontrol workflowstep alignment preprocessing bioinformatics plant-science

30.0 match 4 stars 5.00 score 2 scripts

cran

epiR:Tools for the Analysis of Epidemiological Data

Tools for the analysis of epidemiological and surveillance data. Contains functions for directly and indirectly adjusting measures of disease frequency, quantifying measures of association on the basis of single or multiple strata of count data presented in a contingency table, computation of confidence intervals around incidence risk and incidence rate estimates and sample size calculations for cross-sectional, case-control and cohort studies. Surveillance tools include functions to calculate an appropriate sample size for 1- and 2-stage representative freedom surveys, functions to estimate surveillance system sensitivity and functions to support scenario tree modelling analyses.

Maintained by Mark Stevenson. Last updated 2 months ago.

18.3 match 10 stars 8.18 score 10 dependents

cran

flexclust:Flexible Cluster Algorithms

The main function kcca implements a general framework for k-centroids cluster analysis supporting arbitrary distance measures and centroid computation. Further cluster methods include hard competitive learning, neural gas, and QT clustering. There are numerous visualization methods for cluster results (neighborhood graphs, convex cluster hulls, barcharts of centroids, ...), and bootstrap methods for the analysis of cluster stability.

Maintained by Bettina Grün. Last updated 16 days ago.

25.6 match 3 stars 5.81 score 52 dependents

bioc

cyanoFilter:Phytoplankton Population Identification using Cell Pigmentation and/or Complexity

An approach to filter out and/or identify phytoplankton cells from all particles measured via flow cytometry pigment and cell complexity information. It does this using a sequence of one-dimensional gates on pre-defined channels measuring certain pigmentation and complexity. The package is especially tuned for cyanobacteria, but will work fine for phytoplankton communities where there is at least one cell characteristic that differentiates every phytoplankton in the community.

Maintained by Oluwafemi Olusoji. Last updated 5 months ago.

flowcytometry clustering onechannel

34.6 match 4.30 score 4 scripts

noramvillanueva

clustcurv:Determining Groups in Multiples Curves

A method for determining groups in multiple curves with an automatic selection of their number based on k-means or k-medians algorithms. The selection of the optimal number is provided by bootstrap methods. The methodology can be applied both in regression and survival framework. Implemented methods are: Grouping multiple survival curves described by Villanueva et al. (2018) <doi:10.1002/sim.8016>.

Maintained by Nora M. Villanueva. Last updated 4 months ago.

clustering data-analytics machinelearning multiple-curves nonparametric-statistics number-of-clusters regression survival-analysis

26.9 match 3 stars 5.53 score 38 scripts

bioc

Melissa:Bayesian clustering and imputationa of single cell methylomes

Melissa is a Baysian probabilistic model for jointly clustering and imputing single cell methylomes. This is done by taking into account local correlations via a Generalised Linear Model approach and global similarities using a mixture modelling approach.

Maintained by C. A. Kapourani. Last updated 5 months ago.

immunooncology dnamethylation geneexpression generegulation epigenetics genetics clustering featureextraction regression rnaseq bayesian kegg sequencing coverage singlecell

30.3 match 4.90 score 7 scripts

harrelfe

Hmisc:Harrell Miscellaneous

Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, simulation, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX and html code, recoding variables, caching, simplified parallel computing, encrypting and decrypting data using a safe workflow, general moving window statistical estimation, and assistance in interpreting principal component analysis.

Maintained by Frank E Harrell Jr. Last updated 2 days ago.

fortran

8.4 match 210 stars 17.61 score 17k scripts 750 dependents

cran

clusterability:Performs Tests for Cluster Tendency of a Data Set

Test for cluster tendency (clusterability) of a data set. The methods implemented - reducing the data set to a single dimension using principal component analysis or computing pairwise distances, and performing a multimodality test like the Dip Test or Silverman's Critical Bandwidth Test - are described in Adolfsson, Ackerman, and Brownstein (2019) <doi:10.1016/j.patcog.2018.10.026>. Such methods can inform whether clustering algorithms are appropriate for a data set.

Maintained by Zachariah Neville. Last updated 5 years ago.

73.2 match 2.02 score 21 scripts

cran

inaparc:Initialization Algorithms for Partitioning Cluster Analysis

Partitioning clustering algorithms divide data sets into k subsets or partitions so-called clusters. They require some initialization procedures for starting the algorithms. Initialization of cluster prototypes is one of such kind of procedures for most of the partitioning algorithms. Cluster prototypes are the centers of clusters, i.e. centroids or medoids, representing the clusters in a data set. In order to initialize cluster prototypes, the package 'inaparc' contains a set of the functions that are the implementations of several linear time-complexity and loglinear time-complexity methods in addition to some novel techniques. Initialization of fuzzy membership degrees matrices is another important task for starting the probabilistic and possibilistic partitioning algorithms. In order to initialize membership degrees matrices required by these algorithms, a number of functions based on some traditional and novel initialization techniques are also available in the package 'inaparc'.

Maintained by Zeynel Cebeci. Last updated 3 years ago.

54.3 match 2.69 score 33 scripts 5 dependents

bioc

DepecheR:Determination of essential phenotypic elements of clusters in high-dimensional entities

The purpose of this package is to identify traits in a dataset that can separate groups. This is done on two levels. First, clustering is performed, using an implementation of sparse K-means. Secondly, the generated clusters are used to predict outcomes of groups of individuals based on their distribution of observations in the different clusters. As certain clusters with separating information will be identified, and these clusters are defined by a sparse number of variables, this method can reduce the complexity of data, to only emphasize the data that actually matters.

Maintained by Jakob Theorell. Last updated 5 months ago.

software cellbasedassays transcription differentialexpression datarepresentation immunooncology transcriptomics classification clustering dimensionreduction featureextraction flowcytometry rnaseq singlecell visualization cpp

28.2 match 5.18 score 15 scripts

bioc

omada:Machine learning tools for automated transcriptome clustering analysis

Symptomatic heterogeneity in complex diseases reveals differences in molecular states that need to be investigated. However, selecting the numerous parameters of an exploratory clustering analysis in RNA profiling studies requires deep understanding of machine learning and extensive computational experimentation. Tools that assist with such decisions without prior field knowledge are nonexistent and further gene association analyses need to be performed independently. We have developed a suite of tools to automate these processes and make robust unsupervised clustering of transcriptomic data more accessible through automated machine learning based functions. The efficiency of each tool was tested with four datasets characterised by different expression signal strengths. Our toolkit’s decisions reflected the real number of stable partitions in datasets where the subgroups are discernible. Even in datasets with less clear biological distinctions, stable subgroups with different expression profiles and clinical associations were found.

Maintained by Sokratis Kariotis. Last updated 5 months ago.

software clustering rnaseq geneexpression

40.3 match 3.60 score 5 scripts

ropensci

treeio:Base Classes and Functions for Phylogenetic Tree Input and Output

'treeio' is an R package to make it easier to import and store phylogenetic tree with associated data; and to link external data from different sources to phylogeny. It also supports exporting phylogenetic tree with heterogeneous associated data to a single tree file and can be served as a platform for merging tree with associated data and converting file formats.

Maintained by Guangchuang Yu. Last updated 5 months ago.

software annotation clustering dataimport datarepresentation alignment multiplesequencealignment phylogenetics exporter parser phylogenetic-trees

11.6 match 102 stars 12.46 score 1.3k scripts 122 dependents

hiweller

recolorize:Color-Based Image Segmentation

Automatic, semi-automatic, and manual functions for generating color maps from images. The idea is to simplify the colors of an image according to a metric that is useful for the user, using deterministic methods whenever possible. Many images will be clustered well using the out-of-the-box functions, but the package also includes a toolbox of functions for making manual adjustments (layer merging/isolation, blurring, fitting to provided color clusters or those from another image, etc). Also includes export methods for other color/pattern analysis packages (pavo, patternize, colordistance).

Maintained by Hannah Weller. Last updated 13 days ago.

18.7 match 39 stars 7.68 score 87 scripts

csafe-isu

handwriterRF:Handwriting Analysis with Random Forests

Perform forensic handwriting analysis of two scanned handwritten documents. This package implements the statistical method described by Madeline Johnson and Danica Ommen (2021) <doi:10.1002/sam.11566>. Similarity measures and a random forest produce a score-based likelihood ratio that quantifies the strength of the evidence in favor of the documents being written by the same writer or different writers.

Maintained by Stephanie Reinders. Last updated 8 days ago.

jags cpp

23.2 match 2 stars 6.18 score 15 scripts 1 dependents

thomasp85

densityClust:Clustering by Fast Search and Find of Density Peaks

An improved implementation (based on k-nearest neighbors) of the density peak clustering algorithm, originally described by Alex Rodriguez and Alessandro Laio (Science, 2014 vol. 344). It can handle large datasets (> 100,000 samples) very efficiently. It was initially implemented by Thomas Lin Pedersen, with inputs from Sean Hughes and later improved by Xiaojie Qiu to handle large datasets with kNNs.

Maintained by Thomas Lin Pedersen. Last updated 1 years ago.

cpp

20.0 match 153 stars 7.14 score 75 scripts

samhforbes

eyetrackingR:Eye-Tracking Data Analysis

Addresses tasks along the pipeline from raw data to analysis and visualization for eye-tracking data. Offers several popular types of analyses, including linear and growth curve time analyses, onset-contingent reaction time analyses, as well as several non-parametric bootstrapping approaches. For references to the approach see Mirman, Dixon & Magnuson (2008) <doi:10.1016/j.jml.2007.11.006>, and Barr (2008) <doi:10.1016/j.jml.2007.09.002>.

Maintained by Samuel Forbes. Last updated 2 years ago.

18.1 match 22 stars 7.84 score 60 scripts

kharchenkolab

pagoda2:Single Cell Analysis and Differential Expression

Analyzing and interactively exploring large-scale single-cell RNA-seq datasets. 'pagoda2' primarily performs normalization and differential gene expression analysis, with an interactive application for exploring single-cell RNA-seq datasets. It performs basic tasks such as cell size normalization, gene variance normalization, and can be used to identify subpopulations and run differential expression within individual samples. 'pagoda2' was written to rapidly process modern large-scale scRNAseq datasets of approximately 1e6 cells. The companion web application allows users to explore which gene expression patterns form the different subpopulations within your data. The package also serves as the primary method for preprocessing data for conos, <https://github.com/kharchenkolab/conos>. This package interacts with data available through the 'p2data' package, which is available in a 'drat' repository. To access this data package, see the instructions at <https://github.com/kharchenkolab/pagoda2>. The size of the 'p2data' package is approximately 6 MB.

Maintained by Evan Biederstedt. Last updated 1 years ago.

scrna-seq single-cell single-cell-rna-seq transcriptomics openblas cpp openmp

17.7 match 222 stars 8.00 score 282 scripts

bioc

BPRMeth:Model higher-order methylation profiles

The BPRMeth package is a probabilistic method to quantify explicit features of methylation profiles, in a way that would make it easier to formally use such profiles in downstream modelling efforts, such as predicting gene expression levels or clustering genomic regions or cells according to their methylation profiles.

Maintained by Chantriolnt-Andreas Kapourani. Last updated 5 months ago.

immunooncology dnamethylation geneexpression generegulation epigenetics genetics clustering featureextraction regression rnaseq bayesian kegg sequencing coverage singlecell openblas cpp

24.6 match 5.75 score 94 scripts 1 dependents

bioc

InterCellar:InterCellar: an R-Shiny app for interactive analysis and exploration of cell-cell communication in single-cell transcriptomics

InterCellar is implemented as an R/Bioconductor Package containing a Shiny app that allows users to interactively analyze cell-cell communication from scRNA-seq data. Starting from precomputed ligand-receptor interactions, InterCellar provides filtering options, annotations and multiple visualizations to explore clusters, genes and functions. Finally, based on functional annotation from Gene Ontology and pathway databases, InterCellar implements data-driven analyses to investigate cell-cell communication in one or multiple conditions.

Maintained by Marta Interlandi. Last updated 5 months ago.

software singlecell visualization go transcriptomics

28.4 match 9 stars 4.95 score 7 scripts

seborinos

NCutYX:Clustering of Omics Data of Multiple Types with a Multilayer Network Representation

Omics data come in different forms: gene expression, methylation, copy number, protein measurements and more. 'NCutYX' allows clustering of variables, of samples, and both variables and samples (biclustering), while incorporating the dependencies across multiple types of Omics data. (SJ Teran Hidalgo et al (2017), <doi:10.1186/s12864-017-3990-1>).

Maintained by Sebastian J. Teran Hidalgo. Last updated 7 years ago.

c-plus-plus cancer-genomics clustering copy-number-variation devtools gene-expression graph-algorithms graph-cut graphs proteins rcpp cpp

31.3 match 4 stars 4.48 score 15 scripts

kharchenkolab

conos:Clustering on Network of Samples

Wires together large collections of single-cell RNA-seq datasets, which allows for both the identification of recurrent cell clusters and the propagation of information between datasets in multi-sample or atlas-scale collections. 'Conos' focuses on the uniform mapping of homologous cell types across heterogeneous sample collections. For instance, users could investigate a collection of dozens of peripheral blood samples from cancer patients combined with dozens of controls, which perhaps includes samples of a related tissue such as lymph nodes. This package interacts with data available through the 'conosPanel' package, which is available in a 'drat' repository. To access this data package, see the instructions at <https://github.com/kharchenkolab/conos>. The size of the 'conosPanel' package is approximately 12 MB.

Maintained by Evan Biederstedt. Last updated 1 years ago.

batch-correction scrna-seq single-cell-rna-seq openblas cpp openmp

19.1 match 204 stars 7.32 score 258 scripts

stan-dev

rstanarm:Bayesian Applied Regression Modeling via Stan

Estimates previously compiled regression models using the 'rstan' package, which provides the R interface to the Stan C++ library for Bayesian estimation. Users specify models via the customary R syntax with a formula and data.frame plus some additional arguments for priors.

Maintained by Ben Goodrich. Last updated 9 months ago.

bayesian bayesian-data-analysis bayesian-inference bayesian-methods bayesian-statistics multilevel-models rstan rstanarm stan statistical-modeling cpp

8.9 match 393 stars 15.65 score 5.0k scripts 12 dependents

bioc

CDI:Clustering Deviation Index (CDI)

Single-cell RNA-sequencing (scRNA-seq) is widely used to explore cellular variation. The analysis of scRNA-seq data often starts from clustering cells into subpopulations. This initial step has a high impact on downstream analyses, and hence it is important to be accurate. However, there have not been unsupervised metric designed for scRNA-seq to evaluate clustering performance. Hence, we propose clustering deviation index (CDI), an unsupervised metric based on the modeling of scRNA-seq UMI counts to evaluate clustering of cells.

Maintained by Jiyuan Fang. Last updated 5 months ago.

singlecell software clustering visualization sequencing rnaseq cellbasedassays

27.7 match 5 stars 5.00 score 4 scripts

bioc

csaw:ChIP-Seq Analysis with Windows

Detection of differentially bound regions in ChIP-seq data with sliding windows, with methods for normalization and proper FDR control.

Maintained by Aaron Lun. Last updated 2 months ago.

multiplecomparison chipseq normalization sequencing coverage genetics annotation differentialpeakcalling curl bzip2 xz-utils zlib cpp

16.6 match 8.32 score 498 scripts 7 dependents

bioc

limma:Linear Models for Microarray and Omics Data

Data analysis, linear models and differential expression for omics data.

Maintained by Gordon Smyth. Last updated 5 days ago.

exonarray geneexpression transcription alternativesplicing differentialexpression differentialsplicing genesetenrichment dataimport bayesian clustering regression timecourse microarray micrornaarray mrnamicroarray onechannel proprietaryplatforms twochannel sequencing rnaseq batcheffect multiplecomparison normalization preprocessing qualitycontrol biomedicalinformatics cellbiology cheminformatics epigenetics functionalgenomics genetics immunooncology metabolomics proteomics systemsbiology transcriptomics

10.0 match 13.81 score 16k scripts 585 dependents

bioc

Linnorm:Linear model and normality based normalization and transformation method (Linnorm)

Linnorm is an algorithm for normalizing and transforming RNA-seq, single cell RNA-seq, ChIP-seq count data or any large scale count data. It has been independently reviewed by Tian et al. on Nature Methods (https://doi.org/10.1038/s41592-019-0425-8). Linnorm can work with raw count, CPM, RPKM, FPKM and TPM.

Maintained by Shun Hang Yip. Last updated 5 months ago.

immunooncology sequencing chipseq rnaseq differentialexpression geneexpression genetics normalization software transcription batcheffect peakdetection clustering network singlecell cpp

21.9 match 6.26 score 61 scripts 5 dependents

cbg-ethz

clustNet:Network-Based Clustering

Network-based clustering using a Bayesian network mixture model with optional covariate adjustment.

Maintained by Fritz Bayer. Last updated 1 years ago.

bayesian-network bayesian-networks clustering dag genomics mixture-model network-clustering

26.5 match 7 stars 5.16 score 41 scripts

bioc

SGCP:SGCP: A semi-supervised pipeline for gene clustering using self-training approach in gene co-expression networks

SGC is a semi-supervised pipeline for gene clustering in gene co-expression networks. SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner. But unlike all existing frameworks, it further incorporates a novel step that leverages Gene Ontology information in a semi-supervised clustering method that further improves the quality of the computed modules.

Maintained by Niloofar AghaieAbiane. Last updated 5 months ago.

geneexpression genesetenrichment networkenrichment systemsbiology classification clustering dimensionreduction graphandnetwork neuralnetwork network mrnamicroarray rnaseq visualization bioinformatics genecoexpressionnetwork graphs networkclustering networks self-training semi-supervised-learning unsupervised-learning

26.7 match 2 stars 5.12 score 44 scripts

acabassi

klic:Kernel Learning Integrative Clustering

Kernel Learning Integrative Clustering (KLIC) is an algorithm that allows to combine multiple kernels, each representing a different measure of the similarity between a set of observations. The contribution of each kernel on the final clustering is weighted according to the amount of information carried by it. As well as providing the functions required to perform the kernel-based clustering, this package also allows the user to simply give the data as input: the kernels are then built using consensus clustering. Different strategies to choose the best number of clusters are also available. For further details please see Cabassi and Kirk (2020) <doi:10.1093/bioinformatics/btaa593>.

Maintained by Alessandra Cabassi. Last updated 5 years ago.

cluster-analysis clustering coca genomics integrative-clustering kernel-methods multi-omics

31.0 match 5 stars 4.40 score 10 scripts

bioc

mbkmeans:Mini-batch K-means Clustering for Single-Cell RNA-seq

Implements the mini-batch k-means algorithm for large datasets, including support for on-disk data representation.

Maintained by Davide Risso. Last updated 5 months ago.

clustering geneexpression rnaseq software transcriptomics sequencing singlecell human-cell-atlas cpp

18.4 match 10 stars 7.41 score 54 scripts 2 dependents

mhahsler

streamMOA:Interface for MOA Stream Clustering Algorithms

Interface for data stream clustering algorithms implemented in the MOA (Massive Online Analysis) framework (Albert Bifet, Geoff Holmes, Richard Kirkby, Bernhard Pfahringer (2010). MOA: Massive Online Analysis, Journal of Machine Learning Research 11: 1601-1604).

Maintained by Michael Hahsler. Last updated 7 months ago.

clustering datamining datastream openjdk

22.6 match 13 stars 5.98 score 37 scripts

dicook

mulgar:Functions for Pre-Processing Data for Multivariate Data Visualisation using Tours

This is a companion to the book Cook, D. and Laa, U. (2023) <https://dicook.github.io/mulgar_book/> "Interactively exploring high-dimensional data and models in R". by Cook and Laa. It contains useful functions for processing data in preparation for visualising with a tour. There are also several sample data sets.

Maintained by Dianne Cook. Last updated 2 months ago.

30.1 match 4 stars 4.50 score 79 scripts

bioc

tidySingleCellExperiment:Brings SingleCellExperiment to the Tidyverse

'tidySingleCellExperiment' is an adapter that abstracts the 'SingleCellExperiment' container in the form of a 'tibble'. This allows *tidy* data manipulation, nesting, and plotting. For example, a 'tidySingleCellExperiment' is directly compatible with functions from 'tidyverse' packages `dplyr` and `tidyr`, as well as plotting with `ggplot2` and `plotly`. In addition, the package provides various utility functions specific to single-cell omics data analysis (e.g., aggregation of cell-level data to pseudobulks).

Maintained by Stefano Mangiola. Last updated 5 months ago.

assaydomain infrastructure rnaseq differentialexpression singlecell geneexpression normalization clustering qualitycontrol sequencing bioconductor dplyr ggplot2 plotly single-cell-rna-seq single-cell-sequencing singlecellexperiment tibble tidyr tidyverse

15.3 match 36 stars 8.86 score 125 scripts 2 dependents

bioc

MetaCyto:MetaCyto: A package for meta-analysis of cytometry data

This package provides functions for preprocessing, automated gating and meta-analysis of cytometry data. It also provides functions that facilitate the collection of cytometry data from the ImmPort database.

Maintained by Zicheng Hu. Last updated 5 months ago.

immunooncology cellbiology flowcytometry clustering statisticalmethod software cellbasedassays preprocessing

28.6 match 4.73 score 18 scripts

spatstat

spatstat:Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests

Comprehensive open-source toolbox for analysing Spatial Point Patterns. Focused mainly on two-dimensional point patterns, including multitype/marked points, in any spatial region. Also supports three-dimensional point patterns, space-time point patterns in any number of dimensions, point patterns on a linear network, and patterns of other geometrical objects. Supports spatial covariate data such as pixel images. Contains over 3000 functions for plotting spatial data, exploratory data analysis, model-fitting, simulation, spatial sampling, model diagnostics, and formal inference. Data types include point patterns, line segment patterns, spatial windows, pixel images, tessellations, and linear networks. Exploratory methods include quadrat counts, K-functions and their simulation envelopes, nearest neighbour distance and empty space statistics, Fry plots, pair correlation function, kernel smoothed intensity, relative risk estimation with cross-validated bandwidth selection, mark correlation functions, segregation indices, mark dependence diagnostics, and kernel estimates of covariate effects. Formal hypothesis tests of random pattern (chi-squared, Kolmogorov-Smirnov, Monte Carlo, Diggle-Cressie-Loosmore-Ford, Dao-Genton, two-stage Monte Carlo) and tests for covariate effects (Cox-Berman-Waller-Lawson, Kolmogorov-Smirnov, ANOVA) are also supported. Parametric models can be fitted to point pattern data using the functions ppm(), kppm(), slrm(), dppm() similar to glm(). Types of models include Poisson, Gibbs and Cox point processes, Neyman-Scott cluster processes, and determinantal point processes. Models may involve dependence on covariates, inter-point interaction, cluster formation and dependence on marks. Models are fitted by maximum likelihood, logistic regression, minimum contrast, and composite likelihood methods. A model can be fitted to a list of point patterns (replicated point pattern data) using the function mppm(). The model can include random effects and fixed effects depending on the experimental design, in addition to all the features listed above. Fitted point process models can be simulated, automatically. Formal hypothesis tests of a fitted model are supported (likelihood ratio test, analysis of deviance, Monte Carlo tests) along with basic tools for model selection (stepwise(), AIC()) and variable selection (sdr). Tools for validating the fitted model include simulation envelopes, residuals, residual plots and Q-Q plots, leverage and influence diagnostics, partial residuals, and added variable plots.

Maintained by Adrian Baddeley. Last updated 2 months ago.

cluster-process cox-point-process gibbs-process kernel-density network-analysis point-process poisson-process spatial-analysis spatial-data spatial-data-analysis spatial-statistics spatstat statistical-methods statistical-models statistical-tests statistics

8.3 match 200 stars 16.32 score 5.5k scripts 41 dependents

zdebruine

RcppML:Rcpp Machine Learning Library

Fast machine learning algorithms including matrix factorization and divisive clustering for large sparse and dense matrices.

Maintained by Zach DeBruine. Last updated 2 years ago.

clustering matrix-factorization nmf rcpp rcppeigen sparse-matrix cpp openmp

12.8 match 104 stars 10.53 score 125 scripts 46 dependents

tengmcing

spotoroo:Spatiotemporal Clustering of Satellite Hot Spot Data

An algorithm to cluster satellite hot spot data spatially and temporally.

Maintained by Weihao Li. Last updated 4 months ago.

28.8 match 5 stars 4.65 score 18 scripts

bioc

edgeR:Empirical Analysis of Digital Gene Expression Data in R

Differential expression analysis of sequence count data. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models, quasi-likelihood, and gene set enrichment. Can perform differential analyses of any type of omics data that produces read counts, including RNA-seq, ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE, CAGE, metabolomics, or proteomics spectral counts. RNA-seq analyses can be conducted at the gene or isoform level, and tests can be conducted for differential exon or transcript usage.

Maintained by Yunshun Chen. Last updated 5 days ago.

alternativesplicing batcheffect bayesian biomedicalinformatics cellbiology chipseq clustering coverage differentialexpression differentialmethylation differentialsplicing dnamethylation epigenetics functionalgenomics geneexpression genesetenrichment genetics immunooncology multiplecomparison normalization pathways proteomics qualitycontrol regression rnaseq sage sequencing singlecell systemsbiology timecourse transcription transcriptomics openblas

10.0 match 13.40 score 17k scripts 255 dependents

immunomind

immunarch:Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires

A comprehensive framework for bioinformatics exploratory analysis of bulk and single-cell T-cell receptor and antibody repertoires. It provides seamless data loading, analysis and visualisation for AIRR (Adaptive Immune Receptor Repertoire) data, both bulk immunosequencing (RepSeq) and single-cell sequencing (scRNAseq). Immunarch implements most of the widely used AIRR analysis methods, such as: clonality analysis, estimation of repertoire similarities in distribution of clonotypes and gene segments, repertoire diversity analysis, annotation of clonotypes using external immune receptor databases and clonotype tracking in vaccination and cancer studies. A successor to our previously published 'tcR' immunoinformatics package (Nazarov 2015) <doi:10.1186/s12859-015-0613-1>.

Maintained by Vadim I. Nazarov. Last updated 12 months ago.

airr-analysis b-cell-receptor bcr bcr-repertoire bioinformatics ig ig-repertoire immune-repertoire immune-repertoire-analysis immune-repertoire-data immunoglobulin immunoinformatics immunology rep-seq repertoire-analysis single-cell single-cell-analysis t-cell-receptor tcr tcr-repertoire cpp

14.1 match 315 stars 9.49 score 203 scripts

bioc

hopach:Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH)

The HOPACH clustering algorithm builds a hierarchical tree of clusters by recursively partitioning a data set, while ordering and possibly collapsing clusters at each level. The algorithm uses the Mean/Median Split Silhouette (MSS) criteria to identify the level of the tree with maximally homogeneous clusters. It also runs the tree down to produce a final ordered list of the elements. The non-parametric bootstrap allows one to estimate the probability that each element belongs to each cluster (fuzzy clustering).

Maintained by Katherine S. Pollard. Last updated 5 months ago.

clustering

22.1 match 6.05 score 54 scripts 5 dependents

m-py

anticlust:Subset Partitioning via Anticlustering

The method of anticlustering partitions a pool of elements into groups (i.e., anticlusters) with the goal of maximizing between-group similarity or within-group heterogeneity. The anticlustering approach thereby reverses the logic of cluster analysis that strives for high within-group homogeneity and clear separation between groups. Computationally, anticlustering is accomplished by maximizing instead of minimizing a clustering objective function, such as the intra-cluster variance (used in k-means clustering) or the sum of pairwise distances within clusters. The main function anticlustering() gives access to optimal and heuristic anticlustering methods described in Papenberg and Klau (2021; <doi:10.1037/met0000301>), Brusco et al. (2020; <doi:10.1111/bmsp.12186>), and Papenberg (2024; <doi:10.1111/bmsp.12315>). The optimal algorithms require that an integer linear programming solver is installed. This package will install 'lpSolve' (<https://cran.r-project.org/package=lpSolve>) as a default solver, but it is also possible to use the package 'Rglpk' (<https://cran.r-project.org/package=Rglpk>), which requires the GNU linear programming kit (<https://www.gnu.org/software/glpk/glpk.html>), or the package 'Rsymphony' (<https://cran.r-project.org/package=Rsymphony>), which requires the SYMPHONY ILP solver (<https://github.com/coin-or/SYMPHONY>). 'Rglpk' and 'Rsymphony' have to be manually installed by the user because they are only "suggested" dependencies. Full access to the bicriterion anticlustering method proposed by Brusco et al. (2020) is given via the function bicriterion_anticlustering(), while kplus_anticlustering() implements the full functionality of the k-plus anticlustering approach proposed by Papenberg (2024). Some other functions are available to solve classical clustering problems. The function balanced_clustering() applies a cluster analysis under size constraints, i.e., creates equal-sized clusters. The function matching() can be used for (unrestricted, bipartite, or K-partite) matching. The function wce() can be used optimally solve the (weighted) cluster editing problem, also known as correlation clustering, clique partitioning problem or transitivity clustering.

Maintained by Martin Papenberg. Last updated 4 days ago.

14.2 match 31 stars 9.35 score 60 scripts 2 dependents

mikewlcheung

metaSEM:Meta-Analysis using Structural Equation Modeling

A collection of functions for conducting meta-analysis using a structural equation modeling (SEM) approach via the 'OpenMx' and 'lavaan' packages. It also implements various procedures to perform meta-analytic structural equation modeling on the correlation and covariance matrices, see Cheung (2015) <doi:10.3389/fpsyg.2014.01521>.

Maintained by Mike Cheung. Last updated 9 days ago.

meta-analysis meta-analytic-sem missing-data multilevel-models multivariate-analysis structural-equation-modeling structural-equation-models

14.0 match 30 stars 9.43 score 208 scripts 1 dependents

e-sensing

sits:Satellite Image Time Series Analysis for Earth Observation Data Cubes

An end-to-end toolkit for land use and land cover classification using big Earth observation data, based on machine learning methods applied to satellite image data cubes, as described in Simoes et al (2021) <doi:10.3390/rs13132428>. Builds regular data cubes from collections in AWS, Microsoft Planetary Computer, Brazil Data Cube, Copernicus Data Space Environment (CDSE), Digital Earth Africa, Digital Earth Australia, NASA HLS using the Spatio-temporal Asset Catalog (STAC) protocol (<https://stacspec.org/>) and the 'gdalcubes' R package developed by Appel and Pebesma (2019) <doi:10.3390/data4030092>. Supports visualization methods for images and time series and smoothing filters for dealing with noisy time series. Includes functions for quality assessment of training samples using self-organized maps as presented by Santos et al (2021) <doi:10.1016/j.isprsjprs.2021.04.014>. Includes methods to reduce training samples imbalance proposed by Chawla et al (2002) <doi:10.1613/jair.953>. Provides machine learning methods including support vector machines, random forests, extreme gradient boosting, multi-layer perceptrons, temporal convolutional neural networks proposed by Pelletier et al (2019) <doi:10.3390/rs11050523>, and temporal attention encoders by Garnot and Landrieu (2020) <doi:10.48550/arXiv.2007.00586>. Supports GPU processing of deep learning models using torch <https://torch.mlverse.org/>. Performs efficient classification of big Earth observation data cubes and includes functions for post-classification smoothing based on Bayesian inference as described by Camara et al (2024) <doi:10.3390/rs16234572>, and methods for active learning and uncertainty assessment. Supports region-based time series analysis using package supercells <https://jakubnowosad.com/supercells/>. Enables best practices for estimating area and assessing accuracy of land change as recommended by Olofsson et al (2014) <doi:10.1016/j.rse.2014.02.015>. Minimum recommended requirements: 16 GB RAM and 4 CPU dual-core.

Maintained by Gilberto Camara. Last updated 1 months ago.

big-earth-data cbers earth-observation eo-datacubes geospatial image-time-series land-cover-classification landsat planetary-computer r-spatial remote-sensing rspatial satellite-image-time-series satellite-imagery sentinel-2 stac-api stac-catalog cpp

13.9 match 494 stars 9.50 score 384 scripts

mschubert

clustermq:Evaluate Function Calls on HPC Schedulers (LSF, SGE, SLURM, PBS/Torque)

Evaluate arbitrary function calls using workers on HPC schedulers in single line of code. All processing is done on the network without accessing the file system. Remote schedulers are supported via SSH.

Maintained by Michael Schubert. Last updated 24 days ago.

cluster high-performance-computing lsf sge slurm ssh zeromq3 cpp

12.9 match 149 stars 10.23 score 253 scripts

melodyaowen

crt2power:Designing Cluster-Randomized Trials with Two Continuous Co-Primary Outcomes

Provides methods for powering cluster-randomized trials with two continuous co-primary outcomes using five key design techniques. Includes functions for calculating required sample size and statistical power. For more details on methodology, see Owen et al. (2025) <doi:10.1002/sim.70015>, Yang et al. (2022) <doi:10.1111/biom.13692>, Pocock et al. (1987) <doi:10.2307/2531989>, Vickerstaff et al. (2019) <doi:10.1186/s12874-019-0754-4>, and Li et al. (2020) <doi:10.1111/biom.13212>.

Maintained by Melody Owen. Last updated 2 days ago.

36.5 match 3.60 score 2 scripts

bioc

ctc:Cluster and Tree Conversion.

Tools for export and import classification trees and clusters to other programs

Maintained by Antoine Lucas. Last updated 5 months ago.

microarray clustering classification dataimport visualization

23.5 match 5.56 score 61 scripts 2 dependents

keefe-murphy

MoEClust:Gaussian Parsimonious Clustering Models with Covariates and a Noise Component

Clustering via parsimonious Gaussian Mixtures of Experts using the MoEClust models introduced by Murphy and Murphy (2020) <doi:10.1007/s11634-019-00373-8>. This package fits finite Gaussian mixture models with a formula interface for supplying gating and/or expert network covariates using a range of parsimonious covariance parameterisations from the GPCM family via the EM/CEM algorithm. Visualisation of the results of such models using generalised pairs plots and the inclusion of an additional noise component is also facilitated. A greedy forward stepwise search algorithm is provided for identifying the optimal model in terms of the number of components, the GPCM covariance parameterisation, and the subsets of gating/expert network covariates.

Maintained by Keefe Murphy. Last updated 11 days ago.

gaussian-mixture-models mixture-of-experts model-based-clustering

19.9 match 7 stars 6.51 score 44 scripts 1 dependents

vinhtantran

monoClust:Perform Monothetic Clustering with Extensions to Circular Data

Implementation of the Monothetic Clustering algorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) on continuous data sets. A lot of extensions are included in the package, including applying Monothetic clustering on data sets with circular variables, visualizations with the results, and permutation and cross-validation based tests to support the decision on the number of clusters.

Maintained by Tan Tran. Last updated 4 years ago.

circular-variables clusters ggplot2 monothetic plot visualization

31.0 match 1 stars 4.18 score 7 scripts 1 dependents

bioc

DuplexDiscovereR:Analysis of the data from RNA duplex probing experiments

DuplexDiscovereR is a package designed for analyzing data from RNA cross-linking and proximity ligation protocols such as SPLASH, PARIS, LIGR-seq, and others. DuplexDiscovereR accepts input in the form of chimerically or split-aligned reads. It includes procedures for alignment classification, filtering, and efficient clustering of individual chimeric reads into duplex groups (DGs). Once DGs are identified, the package predicts RNA duplex formation and their hybridization energies. Additional metrics, such as p-values for random ligation hypothesis or mean DG alignment scores, can be calculated to rank final set of RNA duplexes. Data from multiple experiments or replicates can be processed separately and further compared to check the reproducibility of the experimental method.

Maintained by Egor Semenchenko. Last updated 2 months ago.

sequencing transcriptomics structuralprediction clustering splicedalignment

28.1 match 1 stars 4.60 score 5 scripts

bioc

flowMerge:Cluster Merging for Flow Cytometry Data

Merging of mixture components for model-based automated gating of flow cytometry data using the flowClust framework. Note: users should have a working copy of flowClust 2.0 installed.

Maintained by Greg Finak. Last updated 5 months ago.

immunooncology clustering flowcytometry

28.3 match 4.56 score 6 scripts 1 dependents

kgoldfeld

simstudy:Simulation of Study Data

Simulates data sets in order to explore modeling techniques or better understand data generating processes. The user specifies a set of relationships between covariates, and generates data based on these specifications. The final data sets can represent data from randomized control trials, repeated measure (longitudinal) designs, and cluster randomized trials. Missingness can be generated using various mechanisms (MCAR, MAR, NMAR).

Maintained by Keith Goldfeld. Last updated 8 months ago.

data-generation data-simulation simulation statistical-models cpp

11.7 match 82 stars 11.00 score 972 scripts 1 dependents

bioc

iSEE:Interactive SummarizedExperiment Explorer

Create an interactive Shiny-based graphical user interface for exploring data stored in SummarizedExperiment objects, including row- and column-level metadata. The interface supports transmission of selections between plots and tables, code tracking, interactive tours, interactive or programmatic initialization, preservation of app state, and extensibility to new panel types via S4 classes. Special attention is given to single-cell data in a SingleCellExperiment object with visualization of dimensionality reduction results.

Maintained by Kevin Rue-Albrecht. Last updated 10 days ago.

cellbasedassays clustering dimensionreduction featureextraction geneexpression gui immunooncology shinyapps singlecell transcription transcriptomics visualization dimension-reduction feature-extraction gene-expression hacktoberfest human-cell-atlas shiny single-cell

10.0 match 225 stars 12.86 score 380 scripts 9 dependents

stemangiola

tidyseurat:Brings Seurat to the Tidyverse

It creates an invisible layer that allow to see the 'Seurat' object as tibble and interact seamlessly with the tidyverse.

Maintained by Stefano Mangiola. Last updated 8 months ago.

assaydomain infrastructure rnaseq differentialexpression geneexpression normalization clustering qualitycontrol sequencing transcription transcriptomics dplyr ggplot2 pca purrr sct seurat single-cell single-cell-rna-seq tibble tidyr tidyverse transcripts tsne umap

13.3 match 158 stars 9.66 score 398 scripts 1 dependents

evelinag

clusternomics:Integrative Clustering for Heterogeneous Biomedical Datasets

Integrative context-dependent clustering for heterogeneous biomedical datasets. Identifies local clustering structures in related datasets, and a global clusters that exist across the datasets.

Maintained by Evelina Gabasova. Last updated 8 years ago.

26.0 match 14 stars 4.92 score 12 scripts

ms609

TreeDist:Calculate and Map Distances Between Phylogenetic Trees

Implements measures of tree similarity, including information-based generalized Robinson-Foulds distances (Phylogenetic Information Distance, Clustering Information Distance, Matching Split Information Distance; Smith 2020) <doi:10.1093/bioinformatics/btaa614>; Jaccard-Robinson-Foulds distances (Bocker et al. 2013) <doi:10.1007/978-3-642-40453-5_13>, including the Nye et al. (2006) metric <doi:10.1093/bioinformatics/bti720>; the Matching Split Distance (Bogdanowicz & Giaro 2012) <doi:10.1109/TCBB.2011.48>; Maximum Agreement Subtree distances; the Kendall-Colijn (2016) distance <doi:10.1093/molbev/msw124>, and the Nearest Neighbour Interchange (NNI) distance, approximated per Li et al. (1996) <doi:10.1007/3-540-61332-3_168>. Includes tools for visualizing mappings of tree space (Smith 2022) <doi:10.1093/sysbio/syab100>, for identifying islands of trees (Silva and Wilkinson 2021) <doi:10.1093/sysbio/syab015>, for calculating the median of sets of trees, and for computing the information content of trees and splits.

Maintained by Martin R. Smith. Last updated 1 months ago.

phylogenetics tree-distance phylogenetic-trees tree-distances trees cpp

12.4 match 32 stars 10.32 score 97 scripts 5 dependents

mthrun

DatabionicSwarm:Swarm Intelligence for Self-Organized Clustering

Algorithms implementing populations of agents that interact with one another and sense their environment may exhibit emergent behavior such as self-organization and swarm intelligence. Here, a swarm system called Databionic swarm (DBS) is introduced which was published in Thrun, M.C., Ultsch A.: "Swarm Intelligence for Self-Organized Clustering" (2020), Artificial Intelligence, <DOI:10.1016/j.artint.2020.103237>. DBS is able to adapt itself to structures of high-dimensional data such as natural clusters characterized by distance and/or density based structures in the data space. The first module is the parameter-free projection method called Pswarm (Pswarm()), which exploits the concepts of self-organization and emergence, game theory, swarm intelligence and symmetry considerations. The second module is the parameter-free high-dimensional data visualization technique, which generates projected points on the topographic map with hypsometric tints defined by the generalized U-matrix (GeneratePswarmVisualization()). The third module is the clustering method itself with non-critical parameters (DBSclustering()). Clustering can be verified by the visualization and vice versa. The term DBS refers to the method as a whole. It enables even a non-professional in the field of data mining to apply its algorithms for visualization and/or clustering to data sets with completely different structures drawn from diverse research fields. The comparison to common projection methods can be found in the book of Thrun, M.C.: "Projection Based Clustering through Self-Organization and Swarm Intelligence" (2018) <DOI:10.1007/978-3-658-20540-9>.

Maintained by Michael Thrun. Last updated 1 years ago.

openblas cpp

20.7 match 12 stars 6.16 score 27 scripts 1 dependents

yjunechoe

jlmerclusterperm:Cluster-Based Permutation Analysis for Densely Sampled Time Data

An implementation of fast cluster-based permutation analysis (CPA) for densely-sampled time data developed in Maris & Oostenveld, 2007 <doi:10.1016/j.jneumeth.2007.03.024>. Supports (generalized, mixed-effects) regression models for the calculation of timewise statistics. Provides both a wholesale and a piecemeal interface to the CPA procedure with an emphasis on interpretability and diagnostics. Integrates 'Julia' libraries 'MixedModels.jl' and 'GLM.jl' for performance improvements, with additional functionalities for interfacing with 'Julia' from 'R' powered by the 'JuliaConnectoR' package.

Maintained by June Choe. Last updated 6 days ago.

cluster-based-permutation-test eeg eyetracking mixed-effects-models timeseries

21.8 match 13 stars 5.86 score 14 scripts

drordas

D2MCS:Data Driving Multiple Classifier System

Provides a novel framework to able to automatically develop and deploy an accurate Multiple Classifier System based on the feature-clustering distribution achieved from an input dataset. 'D2MCS' was developed focused on four main aspects: (i) the ability to determine an effective method to evaluate the independence of features, (ii) the identification of the optimal number of feature clusters, (iii) the training and tuning of ML models and (iv) the execution of voting schemes to combine the outputs of each classifier comprising the Multiple Classifier System.

Maintained by Miguel Ferreiro-Díaz. Last updated 3 years ago.

openjdk

34.2 match 3.70 score

bioc

DECIPHER:Tools for curating, analyzing, and manipulating biological sequences

A toolset for deciphering and managing biological sequences.

Maintained by Erik Wright. Last updated 5 days ago.

clustering genetics sequencing dataimport visualization microarray qualitycontrol qpcr alignment wholegenome microbiome immunooncology geneprediction openmp

15.0 match 8.40 score 1.1k scripts 14 dependents

bioc

SingleR:Reference-Based Single-Cell RNA-Seq Annotation

Performs unbiased cell type recognition from single-cell RNA sequencing data, by leveraging reference transcriptomic datasets of pure cell types to infer the cell of origin of each single cell independently.

Maintained by Aaron Lun. Last updated 28 days ago.

software singlecell geneexpression transcriptomics classification clustering annotation bioconductor singler cpp

10.0 match 182 stars 12.60 score 2.1k scripts 1 dependents