Showing 200 of total 223 results (show query)


BSDA:Basic Statistics and Data Analysis

Data sets for book "Basic Statistics and Data Analysis" by Larry J. Kitchens.

Maintained by Alan T. Arnholt. Last updated 2 years ago.

3.4 match 7 stars 9.11 score 1.3k scripts 6 dependents


malariasimulation:An individual based model for malaria

Specifies the latest and greatest malaria model.

Maintained by Giovanni Charles. Last updated 29 days ago.


1.5 match 16 stars 8.17 score 146 scripts


geiger:Analysis of Evolutionary Diversification

Methods for fitting macroevolutionary models to phylogenetic trees Pennell (2014) <doi:10.1093/bioinformatics/btu181>.

Maintained by Luke Harmon. Last updated 2 years ago.


1.6 match 1 stars 7.84 score 2.3k scripts 28 dependents



Find, subset, and retrive geospatial data by AOI.

Maintained by Mike Johnson. Last updated 3 months ago.


1.3 match 187 stars 8.74 score 156 scripts 1 dependents


BED:Biological Entity Dictionary (BED)

An interface for the 'Neo4j' database providing mapping between different identifiers of biological entities. This Biological Entity Dictionary (BED) has been developed to address three main challenges. The first one is related to the completeness of identifier mappings. Indeed, direct mapping information provided by the different systems are not always complete and can be enriched by mappings provided by other resources. More interestingly, direct mappings not identified by any of these resources can be indirectly inferred by using mappings to a third reference. For example, many human Ensembl gene ID are not directly mapped to any Entrez gene ID but such mappings can be inferred using respective mappings to HGNC ID. The second challenge is related to the mapping of deprecated identifiers. Indeed, entity identifiers can change from one resource release to another. The identifier history is provided by some resources, such as Ensembl or the NCBI, but it is generally not used by mapping tools. The third challenge is related to the automation of the mapping process according to the relationships between the biological entities of interest. Indeed, mapping between gene and protein ID scopes should not be done the same way than between two scopes regarding gene ID. Also, converting identifiers from different organisms should be possible using gene orthologs information. The method has been published by Godard and van Eyll (2018) <doi:10.12688/f1000research.13925.3>.

Maintained by Patrice Godard. Last updated 3 months ago.

0.9 match 8 stars 6.85 score 25 scripts


BASiCS:Bayesian Analysis of Single-Cell Sequencing data

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels in seemingly homogeneous populations of cells. However, these experiments are prone to high levels of technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the population of cells under study. BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model to perform statistical analyses of single-cell RNA sequencing datasets in the context of supervised experiments (where the groups of cells of interest are known a priori, e.g. experimental conditions or cell types). BASiCS performs built-in data normalisation (global scaling) and technical noise quantification (based on spike-in genes). BASiCS provides an intuitive detection criterion for highly (or lowly) variable genes within a single group of cells. Additionally, BASiCS can compare gene expression patterns between two or more pre-specified groups of cells. Unlike traditional differential expression tools, BASiCS quantifies changes in expression that lie beyond comparisons of means, also allowing the study of changes in cell-to-cell heterogeneity. The latter can be quantified via a biological over-dispersion parameter that measures the excess of variability that is observed with respect to Poisson sampling noise, after normalisation and technical noise removal. Due to the strong mean/over-dispersion confounding that is typically observed for scRNA-seq datasets, BASiCS also tests for changes in residual over-dispersion, defined by residual values with respect to a global mean/over-dispersion trend.

Maintained by Catalina Vallejos. Last updated 5 months ago.


0.5 match 83 stars 10.26 score 368 scripts 1 dependents


royale:Clash Royale API

R interface to the official API for Clash Royale <>.

Maintained by Christopher T. Kenny. Last updated 1 years ago.

2.3 match 1.70 score 4 scripts


mulea:Enrichment Analysis Using Multiple Ontologies and False Discovery Rate

Background - Traditional gene set enrichment analyses are typically limited to a few ontologies and do not account for the interdependence of gene sets or terms, resulting in overcorrected p-values. To address these challenges, we introduce mulea, an R package offering comprehensive overrepresentation and functional enrichment analysis. Results - mulea employs a progressive empirical false discovery rate (eFDR) method, specifically designed for interconnected biological data, to accurately identify significant terms within diverse ontologies. mulea expands beyond traditional tools by incorporating a wide range of ontologies, encompassing Gene Ontology, pathways, regulatory elements, genomic locations, and protein domains. This flexibility enables researchers to tailor enrichment analysis to their specific questions, such as identifying enriched transcriptional regulators in gene expression data or overrepresented protein domains in protein sets. To facilitate seamless analysis, mulea provides gene sets (in standardised GMT format) for 27 model organisms, covering 22 ontology types from 16 databases and various identifiers resulting in almost 900 files. Additionally, the muleaData ExperimentData Bioconductor package simplifies access to these pre-defined ontologies. Finally, mulea's architecture allows for easy integration of user-defined ontologies, or GMT files from external sources (e.g., MSigDB or Enrichr), expanding its applicability across diverse research areas. Conclusions - mulea is distributed as a CRAN R package. It offers researchers a powerful and flexible toolkit for functional enrichment analysis, addressing limitations of traditional tools with its progressive eFDR and by supporting a variety of ontologies. Overall, mulea fosters the exploration of diverse biological questions across various model organisms.

Maintained by Tamas Stirling. Last updated 3 months ago.


0.5 match 28 stars 7.36 score 34 scripts


waywiser:Ergonomic Methods for Assessing Spatial Models

Assessing predictive models of spatial data can be challenging, both because these models are typically built for extrapolating outside the original region represented by training data and due to potential spatially structured errors, with "hot spots" of higher than expected error clustered geographically due to spatial structure in the underlying data. Methods are provided for assessing models fit to spatial data, including approaches for measuring the spatial structure of model errors, assessing model predictions at multiple spatial scales, and evaluating where predictions can be made safely. Methods are particularly useful for models fit using the 'tidymodels' framework. Methods include Moran's I ('Moran' (1950) <doi:10.2307/2332142>), Geary's C ('Geary' (1954) <doi:10.2307/2986645>), Getis-Ord's G ('Ord' and 'Getis' (1995) <doi:10.1111/j.1538-4632.1995.tb00912.x>), agreement coefficients from 'Ji' and Gallo (2006) (<doi: 10.14358/PERS.72.7.823>), agreement metrics from 'Willmott' (1981) (<doi: 10.1080/02723646.1981.10642213>) and 'Willmott' 'et' 'al'. (2012) (<doi: 10.1002/joc.2419>), an implementation of the area of applicability methodology from 'Meyer' and 'Pebesma' (2021) (<doi:10.1111/2041-210X.13650>), and an implementation of multi-scale assessment as described in 'Riemann' 'et' 'al'. (2010) (<doi:10.1016/j.rse.2010.05.010>).

Maintained by Michael Mahoney. Last updated 23 hours ago.


0.5 match 37 stars 6.93 score 19 scripts


epubr:Read EPUB File Metadata and Text

Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.

Maintained by Matthew Leonawicz. Last updated 6 months ago.


0.5 match 24 stars 6.37 score 49 scripts


transcriptR:An Integrative Tool for ChIP- And RNA-Seq Based Primary Transcripts Detection and Quantification

The differences in the RNA types being sequenced have an impact on the resulting sequencing profiles. mRNA-seq data is enriched with reads derived from exons, while GRO-, nucRNA- and chrRNA-seq demonstrate a substantial broader coverage of both exonic and intronic regions. The presence of intronic reads in GRO-seq type of data makes it possible to use it to computationally identify and quantify all de novo continuous regions of transcription distributed across the genome. This type of data, however, is more challenging to interpret and less common practice compared to mRNA-seq. One of the challenges for primary transcript detection concerns the simultaneous transcription of closely spaced genes, which needs to be properly divided into individually transcribed units. The R package transcriptR combines RNA-seq data with ChIP-seq data of histone modifications that mark active Transcription Start Sites (TSSs), such as, H3K4me3 or H3K9/14Ac to overcome this challenge. The advantage of this approach over the use of, for example, gene annotations is that this approach is data driven and therefore able to deal also with novel and case specific events. Furthermore, the integration of ChIP- and RNA-seq data allows the identification all known and novel active transcription start sites within a given sample.

Maintained by Armen R. Karapetyan. Last updated 5 months ago.


0.9 match 3.30 score 2 scripts


DMCHMM:Differentially Methylated CpG using Hidden Markov Model

A pipeline for identifying differentially methylated CpG sites using Hidden Markov Model in bisulfite sequencing data. DNA methylation studies have enabled researchers to understand methylation patterns and their regulatory roles in biological processes and disease. However, only a limited number of statistical approaches have been developed to provide formal quantitative analysis. Specifically, a few available methods do identify differentially methylated CpG (DMC) sites or regions (DMR), but they suffer from limitations that arise mostly due to challenges inherent in bisulfite sequencing data. These challenges include: (1) that read-depths vary considerably among genomic positions and are often low; (2) both methylation and autocorrelation patterns change as regions change; and (3) CpG sites are distributed unevenly. Furthermore, there are several methodological limitations: almost none of these tools is capable of comparing multiple groups and/or working with missing values, and only a few allow continuous or multiple covariates. The last of these is of great interest among researchers, as the goal is often to find which regions of the genome are associated with several exposures and traits. To tackle these issues, we have developed an efficient DMC identification method based on Hidden Markov Models (HMMs) called โ€œDMCHMMโ€ which is a three-step approach (model selection, prediction, testing) aiming to address the aforementioned drawbacks.

Maintained by Farhad Shokoohi. Last updated 5 months ago.


0.8 match 3.78 score 3 scripts


NoRCE:NoRCE: Noncoding RNA Sets Cis Annotation and Enrichment

While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint to a functional association. We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast.

Maintained by Gulden Olgun. Last updated 5 months ago.


0.5 match 1 stars 4.60 score 6 scripts


aifeducation:Artificial Intelligence for Education

In social and educational settings, the use of Artificial Intelligence (AI) is a challenging task. Relevant data is often only available in handwritten forms, or the use of data is restricted by privacy policies. This often leads to small data sets. Furthermore, in the educational and social sciences, data is often unbalanced in terms of frequencies. To support educators as well as educational and social researchers in using the potentials of AI for their work, this package provides a unified interface for neural nets in 'PyTorch' to deal with natural language problems. In addition, the package ships with a shiny app, providing a graphical user interface. This allows the usage of AI for people without skills in writing python/R scripts. The tools integrate existing mathematical and statistical methods for dealing with small data sets via pseudo-labeling (e.g. Cascante-Bonilla et al. (2020) <doi:10.48550/arXiv.2001.06001>) and imbalanced data via the creation of synthetic cases (e.g. Bunkhumpornpat et al. (2012) <doi:10.1007/s10489-011-0287-y>). Performance evaluation of AI is connected to measures from content analysis which educational and social researchers are generally more familiar with (e.g. Berding & Pargmann (2022) <doi:10.30819/5581>, Gwet (2014) <ISBN:978-0-9708062-8-4>, Krippendorff (2019) <doi:10.4135/9781071878781>). Estimation of energy consumption and CO2 emissions during model training is done with the 'python' library 'codecarbon'. Finally, all objects created with this package allow to share trained AI models with other people.

Maintained by Berding Florian. Last updated 1 months ago.


0.5 match 4.48 score 8 scripts


knfi:Analysis of Korean National Forest Inventory Database

Understanding the current status of forest resources is essential for monitoring changes in forest ecosystems and generating related statistics. In South Korea, the National Forest Inventory (NFI) surveys over 4,500 sample plots nationwide every five years and records 70 items, including forest stand, forest resource, and forest vegetation surveys. Many researchers use NFI as the primary data for research, such as biomass estimation or analyzing the importance value of each species over time and space, depending on the research purpose. However, the large volume of accumulated forest survey data from across the country can make it challenging to manage and utilize such a vast dataset. To address this issue, we developed an R package that efficiently handles large-scale NFI data across time and space. The package offers a comprehensive workflow for NFI data analysis. It starts with data processing, where read_nfi() function reconstructs NFI data according to the researcher's needs while performing basic integrity checks for data quality.Following this, the package provides analytical tools that operate on the verified data. These include functions like summary_nfi() for summary statistics, diversity_nfi() for biodiversity analysis, iv_nfi() for calculating species importance value, and biomass_nfi() and cwd_biomass_nfi() for biomass estimation. Finally, for visualization, the tsvis_nfi() function generates graphs and maps, allowing users to visualize forest ecosystem changes across various spatial and temporal scales. This integrated approach and its specialized functions can enhance the efficiency of processing and analyzing NFI data, providing researchers with insights into forest ecosystems. The NFI Excel files (.xlsx) are not included in the R package and must be downloaded separately. Users can access these NFI Excel files by visiting the Korea Forest Service Forestry Statistics Platform <> to download the annual NFI Excel files, which are bundled in .zip archives. Please note that this website is only available in Korean, and direct download links can be found in the notes section of the read_nfi() function.

Maintained by Sinyoung Park. Last updated 4 months ago.


0.5 match 1 stars 4.48 score 2 scripts


FunChisq:Model-Free Functional Chi-Squared and Exact Tests

Statistical hypothesis testing methods for inferring model-free functional dependency using asymptotic chi-squared or exact distributions. Functional test statistics are asymmetric and functionally optimal, unique from other related statistics. Tests in this package reveal evidence for causality based on the causality-by- functionality principle. They include asymptotic functional chi-squared tests (Zhang & Song 2013) <doi:10.48550/arXiv.1311.2707>, an adapted functional chi-squared test (Kumar & Song 2022) <doi:10.1093/bioinformatics/btac206>, and an exact functional test (Zhong & Song 2019) <doi:10.1109/TCBB.2018.2809743> (Nguyen et al. 2020) <doi:10.24963/ijcai.2020/372>. The normalized functional chi-squared test was used by Best Performer 'NMSUSongLab' in HPN-DREAM (DREAM8) Breast Cancer Network Inference Challenges (Hill et al. 2016) <doi:10.1038/nmeth.3773>. A function index (Zhong & Song 2019) <doi:10.1186/s12920-019-0565-9> (Kumar et al. 2018) <doi:10.1109/BIBM.2018.8621502> derived from the functional test statistic offers a new effect size measure for the strength of functional dependency, a better alternative to conditional entropy in many aspects. For continuous data, these tests offer an advantage over regression analysis when a parametric functional form cannot be assumed; for categorical data, they provide a novel means to assess directional dependency not possible with symmetrical Pearson's chi-squared or Fisher's exact tests.

Maintained by Joe Song. Last updated 10 months ago.


0.5 match 4.37 score 29 scripts


autoCovariateSelection:R Package to Implement Automated Covariate Selection for Two Exposure Cohorts Using High-Dimensional Propensity Score Algorithm

Contains functions to implement automated covariate selection using methods described in the high-dimensional propensity score (HDPS) algorithm by Schneeweiss Covariate adjustment in real-world-observational-data (RWD) is important for for estimating adjusted outcomes and this can be done by using methods such as, but not limited to, propensity score matching, propensity score weighting and regression analysis. While these methods strive to statistically adjust for confounding, the major challenge is in selecting the potential covariates that can bias the outcomes comparison estimates in observational RWD (Real-World-Data). This is where the utility of automated covariate selection comes in. The functions in this package help to implement the three major steps of automated covariate selection as described by Schneeweiss et. al elsewhere. These three functions, in order of the steps required to execute automated covariate selection are, get_candidate_covariates(), get_recurrence_covariates() and get_prioritised_covariates(). In addition to these functions, a sample real-world-data from publicly available de-identified medical claims data is also available for running examples and also for further exploration. The original article where the algorithm is described by Schneeweiss (2009) <doi:10.1097/EDE.0b013e3181a663cc> .

Maintained by Dennis Robert. Last updated 2 months ago.

0.5 match 4 stars 4.03 score 54 scripts


GSEAmining:Make Biological Sense of Gene Set Enrichment Analysis Outputs

Gene Set Enrichment Analysis is a very powerful and interesting computational method that allows an easy correlation between differential expressed genes and biological processes. Unfortunately, although it was designed to help researchers to interpret gene expression data it can generate huge amounts of results whose biological meaning can be difficult to interpret. Many available tools rely on the hierarchically structured Gene Ontology (GO) classification to reduce reundandcy in the results. However, due to the popularity of GSEA many more gene set collections, such as those in the Molecular Signatures Database are emerging. Since these collections are not organized as those in GO, their usage for GSEA do not always give a straightforward answer or, in other words, getting all the meaninful information can be challenging with the currently available tools. For these reasons, GSEAmining was born to be an easy tool to create reproducible reports to help researchers make biological sense of GSEA outputs. Given the results of GSEA, GSEAmining clusters the different gene sets collections based on the presence of the same genes in the leadind edge (core) subset. Leading edge subsets are those genes that contribute most to the enrichment score of each collection of genes or gene sets. For this reason, gene sets that participate in similar biological processes should share genes in common and in turn cluster together. After that, GSEAmining is able to identify and represent for each cluster: - The most enriched terms in the names of gene sets (as wordclouds) - The most enriched genes in the leading edge subsets (as bar plots). In each case, positive and negative enrichments are shown in different colors so it is easy to distinguish biological processes or genes that may be of interest in that particular study.

Maintained by Oriol Arquรฉs. Last updated 5 months ago.


0.5 match 4.00 score 7 scripts


NST:Normalized Stochasticity Ratio

To estimate ecological stochasticity in community assembly. Understanding the community assembly mechanisms controlling biodiversity patterns is a central issue in ecology. Although it is generally accepted that both deterministic and stochastic processes play important roles in community assembly, quantifying their relative importance is challenging. The new index, normalized stochasticity ratio (NST), is to estimate ecological stochasticity, i.e. relative importance of stochastic processes, in community assembly. With functions in this package, NST can be calculated based on different similarity metrics and/or different null model algorithms, as well as some previous indexes, e.g. previous Stochasticity Ratio (ST), Standard Effect Size (SES), modified Raup-Crick metrics (RC). Functions for permutational test and bootstrapping analysis are also included. Previous ST is published by Zhou et al (2014) <doi:10.1073/pnas.1324044111>. NST is modified from ST by considering two alternative situations and normalizing the index to range from 0 to 1 (Ning et al 2019) <doi:10.1073/pnas.1904623116>. A modified version, MST, is a special case of NST, used in some recent or upcoming publications, e.g. Liang et al (2020) <doi:10.1016/j.soilbio.2020.108023>. SES is calculated as described in Kraft et al (2011) <doi:10.1126/science.1208584>. RC is calculated as reported by Chase et al (2011) <doi:10.1890/ES10-00117.1> and Stegen et al (2013) <doi:10.1038/ismej.2013.93>. Version 3 added NST based on phylogenetic beta diversity, used by Ning et al (2020) <doi:10.1038/s41467-020-18560-z>.

Maintained by Daliang Ning. Last updated 3 years ago.

0.5 match 2 stars 2.85 score 35 scripts


PytrendsLongitudinalR:Create Longitudinal Google Trends Data

'Google Trends' provides cross-sectional and time-series data on searches, but lacks readily available longitudinal data. Researchers, who want to create longitudinal 'Google Trends' on their own, face practical challenges, such as normalized counts that make it difficult to combine cross-sectional and time-series data and limitations in data formats and timelines that limit data granularity over extended time periods. This package addresses these issues and enables researchers to generate longitudinal 'Google Trends' data. This package is built on 'pytrends', a Python library that acts as the unofficial 'Google Trends API' to collect 'Google Trends' data. As long as the 'Google Trends API', 'pytrends' and all their dependencies are working, this package will work. During testing, we noticed that for the same input (keyword, topic, data_format, timeline), the output index can vary from time to time. Besides, if the keyword is not very popular, then the resulting dataset will contain a lot of zeros, which will greatly affect the final result. While this package has no control over the accuracy or quality of 'Google Trends' data, once the data is created, this package coverts it to longitudinal data. In addition, the user may encounter a 429 Too Many Requests error when using cross_section() and time_series() to collect 'Google Trends' data. This error indicates that the user has exceeded the rate limits set by the 'Google Trends API'. For more information about the 'Google Trends API' - 'pytrends', visit <>.

Maintained by Taeyong Park. Last updated 6 months ago.

0.5 match 2.70 score


GPRMortality:Gaussian Process Regression for Mortality Rates

A Bayesian statistical model for estimating child (under-five age group) and adult (15-60 age group) mortality. The main challenge is how to combine and integrate these different time series and how to produce unified estimates of mortality rates during a specified time span. GPR is a Bayesian statistical model for estimating child and adult mortality rates which its data likelihood is mortality rates from different data sources such as: Death Registration System, Censuses or surveys. There are also various hyper-parameters for completeness of DRS, mean, covariance functions and variances as priors. This function produces estimations and uncertainty (95% or any desirable percentiles) based on sampling and non-sampling errors due to variation in data sources. The GP model utilizes Bayesian inference to update predicted mortality rates as a posterior in Bayes rule by combining data and a prior probability distribution over parameters in mean, covariance function, and the regression model. This package uses Markov Chain Monte Carlo (MCMC) to sample from posterior probability distribution by 'rstan' package in R. Details are given in Wang H, Dwyer-Lindgren L, Lofgren KT, et al. (2012) <doi:10.1016/S0140-6736(12)61719-X>, Wang H, Liddell CA, Coates MM, et al. (2014) <doi:10.1016/S0140-6736(14)60497-9> and Mohammadi, Parsaeian, Mehdipour et al. (2017) <doi:10.1016/S2214-109X(17)30105-5>.

Maintained by Ali Ghanbari. Last updated 4 years ago.

0.5 match 2.70 score 7 scripts


DrDimont:Drug Response Prediction from Differential Multi-Omics Networks

While it has been well established that drugs affect and help patients differently, personalized drug response predictions remain challenging. Solutions based on single omics measurements have been proposed, and networks provide means to incorporate molecular interactions into reasoning. However, how to integrate the wealth of information contained in multiple omics layers still poses a complex problem. We present a novel network analysis pipeline, DrDimont, Drug response prediction from Differential analysis of multi-omics networks. It allows for comparative conclusions between two conditions and translates them into differential drug response predictions. DrDimont focuses on molecular interactions. It establishes condition-specific networks from correlation within an omics layer that are then reduced and combined into heterogeneous, multi-omics molecular networks. A novel semi-local, path-based integration step ensures integrative conclusions. Differential predictions are derived from comparing the condition-specific integrated networks. DrDimont's predictions are explainable, i.e., molecular differences that are the source of high differential drug scores can be retrieved. Our proposed pipeline leverages multi-omics data for differential predictions, e.g. on drug response, and includes prior information on interactions. The case study presented in the vignette uses data published by Krug (2020) <doi:10.1016/j.cell.2020.10.036>. The package license applies only to the software and explicitly not to the included data.

Maintained by Katharina Baum. Last updated 2 years ago.

0.5 match 2.00 score 2 scripts


planningML:A Sample Size Calculator for Machine Learning Applications in Healthcare

Advances in automated document classification has led to identifying massive numbers of clinical concepts from handwritten clinical notes. These high dimensional clinical concepts can serve as highly informative predictors in building classification algorithms for identifying patients with different clinical conditions, commonly referred to as patient phenotyping. However, from a planning perspective, it is critical to ensure that enough data is available for the phenotyping algorithm to obtain a desired classification performance. This challenge in sample size planning is further exacerbated by the high dimension of the feature space and the inherent imbalance of the response class. Currently available sample size planning methods can be categorized into: (i) model-based approaches that predict the sample size required for achieving a desired accuracy using a linear machine learning classifier and (ii) learning curve-based approaches (Figueroa et al. (2012) <doi:10.1186/1472-6947-12-8>) that fit an inverse power law curve to pilot data to extrapolate performance. We develop model-based approaches for imbalanced data with correlated features, deriving sample size formulas for performance metrics that are sensitive to class imbalance such as Area Under the receiver operating characteristic Curve (AUC) and Matthews Correlation Coefficient (MCC). This is done using a two-step approach where we first perform feature selection using the innovated High Criticism thresholding method (Hall and Jin (2010) <doi:10.1214/09-AOS764>), then determine the sample size by optimizing the two performance metrics. Further, we develop software in the form of an R package named 'planningML' and an 'R' 'Shiny' app to facilitate the convenient implementation of the developed model-based approaches and learning curve approaches for imbalanced data. We apply our methods to the problem of phenotyping rare outcomes using the MIMIC-III electronic health record database. We show that our developed methods which relate training data size and performance on AUC and MCC, can predict the true or observed performance from linear ML classifiers such as LASSO and SVM at different training data sizes. Therefore, in high-dimensional classification analysis with imbalanced data and correlated features, our approach can efficiently and accurately determine the sample size needed for machine-learning based classification.

Maintained by Xinying Fang. Last updated 2 years ago.

0.5 match 1 stars 2.00 score 2 scripts