R-universe search: oversampling

Showing 21 of total 21 results (show query)

ncordon

imbalance:Preprocessing Algorithms for Imbalanced Datasets

Class imbalance usually damages the performance of classifiers. Thus, it is important to treat data before applying a classifier algorithm. This package includes recent resampling algorithms in the literature: (Barua et al. 2014) <doi:10.1109/tkde.2012.232>; (Das et al. 2015) <doi:10.1109/tkde.2014.2324567>, (Zhang et al. 2014) <doi:10.1016/j.inffus.2013.12.003>; (Gao et al. 2014) <doi:10.1016/j.neucom.2014.02.006>; (Almogahed et al. 2014) <doi:10.1007/s00500-014-1484-5>. It also includes an useful interface to perform oversampling.

Maintained by Ignacio Cordón. Last updated 5 years ago.

binary-classification imbalanced-data oversampling openblas cpp

22.6 match 36 stars 7.14 score 98 scripts

reddertar

smotefamily:A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE

A collection of various oversampling techniques developed from SMOTE is provided. SMOTE is a oversampling technique which synthesizes a new minority instance between a pair of one minority instance and one of its K nearest neighbor. Other techniques adopt this concept with other criteria in order to generate balanced dataset for class imbalance problem.

Maintained by Wacharasak Siriseriwan. Last updated 1 years ago.

7.2 match 2 stars 5.93 score 512 scripts 8 dependents

fatihsaglam

SMOTEWB:Imbalanced Resampling using SMOTE with Boosting (SMOTEWB)

Provides the SMOTE with Boosting (SMOTEWB) algorithm. See F. Sağlam, M. A. Cengiz (2022) <doi:10.1016/j.eswa.2022.117023>. It is a SMOTE-based resampling technique which creates synthetic data on the links between nearest neighbors. SMOTEWB uses boosting weights to determine where to generate new samples and automatically decides the number of neighbors for each sample. It is robust to noise and outperforms most of the alternatives according to Matthew Correlation Coefficient metric. Alternative resampling methods are also available in the package.

Maintained by Fatih Saglam. Last updated 2 months ago.

9.2 match 1 stars 3.77 score 13 scripts 1 dependents

cbielow

PTXQC:Quality Report Generation for MaxQuant and mzTab Results

Generates Proteomics (PTX) quality control (QC) reports for shotgun LC-MS data analyzed with the MaxQuant software suite (from .txt files) or mzTab files (ideally from OpenMS 'QualityControl' tool). Reports are customizable (target thresholds, subsetting) and available in HTML or PDF format. Published in J. Proteome Res., Proteomics Quality Control: Quality Control Software for MaxQuant Results (2015) <doi:10.1021/acs.jproteome.5b00780>.

Maintained by Chris Bielow. Last updated 1 years ago.

drag-and-drop hacktoberfest heatmap match-between-runs maxquant metric mztab openms proteomics quality-control quality-metrics report

3.6 match 42 stars 9.35 score 105 scripts 1 dependents

andrisignorell

ModTools:Building Regression and Classification Models

Consistent user interface to the most common regression and classification algorithms, such as random forest, neural networks, C5 trees and support vector machines, complemented with a handful of auxiliary functions, such as variable importance and a tuning function for the parameters.

Maintained by Andri Signorell. Last updated 2 months ago.

5.3 match 2 stars 4.20 score 3 scripts

molson2

JOUSBoost:Implements Under/Oversampling for Probability Estimation

Implements under/oversampling for probability estimation. To be used with machine learning methods such as AdaBoost, random forests, etc.

Maintained by Matthew Olson. Last updated 8 years ago.

cpp

5.4 match 3.33 score 43 scripts

myles-lewis

nestedcv:Nested Cross-Validation with 'glmnet' and 'caret'

Implements nested k*l-fold cross-validation for lasso and elastic-net regularised linear models via the 'glmnet' package and other machine learning models via the 'caret' package <doi:10.1093/bioadv/vbad048>. Cross-validation of 'glmnet' alpha mixing parameter and embedded fast filter functions for feature selection are provided. Described as double cross-validation by Stone (1977) <doi:10.1111/j.2517-6161.1977.tb01603.x>. Also implemented is a method using outer CV to measure unbiased model performance metrics when fitting Bayesian linear and logistic regression shrinkage models using the horseshoe prior over parameters to encourage a sparse model as described by Piironen & Vehtari (2017) <doi:10.1214/17-EJS1337SI>.

Maintained by Myles Lewis. Last updated 8 days ago.

2.3 match 12 stars 7.92 score 46 scripts

cran

randomUniformForest:Random Uniform Forests for Classification, Regression and Unsupervised Learning

Ensemble model, for classification, regression and unsupervised learning, based on a forest of unpruned and randomized binary decision trees. Each tree is grown by sampling, with replacement, a set of variables at each node. Each cut-point is generated randomly, according to the continuous Uniform distribution. For each tree, data are either bootstrapped or subsampled. The unsupervised mode introduces clustering, dimension reduction and variable importance, using a three-layer engine. Random Uniform Forests are mainly aimed to lower correlation between trees (or trees residuals), to provide a deep analysis of variable importance and to allow native distributed and incremental learning.

Maintained by Saip Ciss. Last updated 3 years ago.

cpp

4.7 match 3 stars 3.77 score 99 scripts

fatihsaglam

imbalanceDatRel:Relocated Data Oversampling for Imbalanced Data Classification

Relocates oversampled data from a specific oversampling method to cover area determined by pure and proper class cover catch digraphs (PCCCD). It prevents any data to be generated in class overlapping area.

Maintained by Fatih Saglam. Last updated 11 months ago.

5.5 match 2.70 score

madr0008

mldr.resampling:Resampling Algorithms for Multi-Label Datasets

Collection of the state of the art multi-label resampling algorithms. The objective of these algorithms is to achieve balance in multi-label datasets.

Maintained by Miguel Ángel Dávila. Last updated 1 years ago.

5.2 match 1 stars 2.70 score 7 scripts

cran

espadon:Easy Study of Patient DICOM Data in Oncology

Exploitation, processing and 2D-3D visualization of DICOM-RT files (structures, dosimetry, imagery) for medical physics and clinical research, in a patient-oriented perspective.

Maintained by Cathy Fontbonne. Last updated 1 months ago.

cpp

4.5 match 2.85 score

bioc

ILoReg:ILoReg: a tool for high-resolution cell population identification from scRNA-Seq data

ILoReg is a tool for identification of cell populations from scRNA-seq data. In particular, ILoReg is useful for finding cell populations with subtle transcriptomic differences. The method utilizes a self-supervised learning method, called Iteratitive Clustering Projection (ICP), to find cluster probabilities, which are used in noise reduction prior to PCA and the subsequent hierarchical clustering and t-SNE steps. Additionally, functions for differential expression analysis to find gene markers for the populations and gene expression visualization are provided.

Maintained by Johannes Smolander. Last updated 5 months ago.

singlecell software clustering dimensionreduction rnaseq visualization transcriptomics datarepresentation differentialexpression transcription geneexpression

2.3 match 5 stars 4.88 score 2 scripts

s-kganz

scutr:Balancing Multiclass Datasets for Classification Tasks

Imbalanced training datasets impede many popular classifiers. To balance training data, a combination of oversampling minority classes and undersampling majority classes is useful. This package implements the SCUT (SMOTE and Cluster-based Undersampling Technique) algorithm as described in Agrawal et. al. (2015) <doi:10.5220/0005595502260234>. Their paper uses model-based clustering and synthetic oversampling to balance multiclass training datasets, although other resampling methods are provided in this package.

Maintained by Keenan Ganz. Last updated 1 years ago.

2.8 match 2 stars 3.68 score 16 scripts 1 dependents

cefet-rj-dal

daltoolboxdp:Data Pre-Processing Extensions

An important aspect of data analytics is related to data management support for artificial intelligence. It is related to preparing data correctly. This package provides extensions to support data preparation in terms of both data sampling and data engineering. Overall, the package provides researchers with a comprehensive set of functionalities for data science based on experiment lines, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>.

Maintained by Eduardo Ogasawara. Last updated 3 months ago.

openjdk

3.0 match 1 stars 3.26 score 12 scripts

andy-iskauskas

hmer:History Matching and Emulation Package

A set of objects and functions for Bayes Linear emulation and history matching. Core functionality includes automated training of emulators to data, diagnostic functions to ensure suitability, and a variety of proposal methods for generating 'waves' of points. For details on the mathematical background, there are many papers available on the topic (see references attached to function help files or the below references); for details of the functions in this package, consult the manual or help files. Iskauskas, A, et al. (2024) <doi:10.18637/jss.v109.i10>. Bower, R.G., Goldstein, M., and Vernon, I. (2010) <doi:10.1214/10-BA524>. Craig, P.S., Goldstein, M., Seheult, A.H., and Smith, J.A. (1997) <doi:10.1007/978-1-4612-2290-3_2>.

Maintained by Andrew Iskauskas. Last updated 13 days ago.

1.3 match 16 stars 7.19 score 37 scripts

andrija-djurovic

PDtoolkit:Collection of Tools for PD Rating Model Development and Validation

The goal of this package is to cover the most common steps in probability of default (PD) rating model development and validation. The main procedures available are those that refer to univariate, bivariate, multivariate analysis, calibration and validation. Along with accompanied 'monobin' and 'monobinShiny' packages, 'PDtoolkit' provides functions which are suitable for different data transformation and modeling tasks such as: imputations, monotonic binning of numeric risk factors, binning of categorical risk factors, weights of evidence (WoE) and information value (IV) calculations, WoE coding (replacement of risk factors modalities with WoE values), risk factor clustering, area under curve (AUC) calculation and others. Additionally, package provides set of validation functions for testing homogeneity, heterogeneity, discriminatory and predictive power of the model.

Maintained by Andrija Djurovic. Last updated 1 years ago.

1.8 match 14 stars 4.78 score 86 scripts

cran

FeaLect:Scores Features for Feature Selection

For each feature, a score is computed that can be useful for feature selection. Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models.Finally, the average score and the models are returned as the output. The features with relatively low scores are recommended to be ignored because they can lead to overfitting of the model to the training data. Moreover, for each random subset, the best set of features in terms of global error is returned. They are useful for applying Bolasso, the alternative feature selection method that recommends the intersection of features subsets.

Maintained by Habil Zare. Last updated 5 years ago.

1.8 match 2 stars 3.44 score 23 scripts

molaison

MantaID:A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Maintained by Zhengpeng Zeng. Last updated 6 months ago.

1.6 match 3.78 score 2 scripts

kjhealy

gssrdoc:Document General Social Survey Variable

The General Social Survey (GSS) is a long-running, mostly annual survey of US households. It is administered by the National Opinion Research Center (NORC). This package contains the a tibble with information on the survey variables, together with every variable documented as an R help page. For more information on the GSS see \url{http://gss.norc.org}.

Maintained by Kieran Healy. Last updated 11 months ago.

2.0 match 2.28 score 38 scripts

nkathh

sambia:A Collection of Techniques Correcting for Sample Selection Bias

A collection of various techniques correcting statistical models for sample selection bias is provided. In particular, the resampling-based methods "stochastic inverse-probability oversampling" and "parametric inverse-probability bagging" are placed at the disposal which generate synthetic observations for correcting classifiers for biased samples resulting from stratified random sampling. For further information, see the article Krautenbacher, Theis, and Fuchs (2017) <doi:10.1155/2017/7847531>. The methods may be used for further purposes where weighting and generation of new observations is needed.

Maintained by Norbert Krautenbacher. Last updated 7 years ago.

2.2 match 1.18 score 15 scripts

imbalance:Preprocessing Algorithms for Imbalanced Datasets

smotefamily:A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE

SMOTEWB:Imbalanced Resampling using SMOTE with Boosting (SMOTEWB)

PTXQC:Quality Report Generation for MaxQuant and mzTab Results

ModTools:Building Regression and Classification Models

JOUSBoost:Implements Under/Oversampling for Probability Estimation

nestedcv:Nested Cross-Validation with 'glmnet' and 'caret'

randomUniformForest:Random Uniform Forests for Classification, Regression and Unsupervised Learning

imbalanceDatRel:Relocated Data Oversampling for Imbalanced Data Classification

mldr.resampling:Resampling Algorithms for Multi-Label Datasets

espadon:Easy Study of Patient DICOM Data in Oncology

ILoReg:ILoReg: a tool for high-resolution cell population identification from scRNA-Seq data

scutr:Balancing Multiclass Datasets for Classification Tasks

daltoolboxdp:Data Pre-Processing Extensions

hmer:History Matching and Emulation Package

PDtoolkit:Collection of Tools for PD Rating Model Development and Validation

FeaLect:Scores Features for Feature Selection

MantaID:A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

gssrdoc:Document General Social Survey Variable

sambia:A Collection of Techniques Correcting for Sample Selection Bias

SmartMeterAnalytics:Methods for Smart Meter Data Analysis

: