Showing 15 of total 15 results (show query)
business-science
timetk:A Tool Kit for Working with Time Series
Easy visualization, wrangling, and feature engineering of time series data for forecasting and machine learning prediction. Consolidates and extends time series functionality from packages including 'dplyr', 'stats', 'xts', 'forecast', 'slider', 'padr', 'recipes', and 'rsample'.
Maintained by Matt Dancho. Last updated 1 years ago.
coercioncoercion-functionsdata-miningdplyrforecastforecastingforecasting-modelsmachine-learningseries-decompositionseries-signaturetibbletidytidyquanttidyversetimetime-seriestimeseries
626 stars 14.20 score 4.0k scripts 16 dependentsdboslab
expowo:An R package for mining global plant diversity and distribution data
Produces diversity estimates and species lists with associated global distribution for any vascular plant family and genus from 'Plants of the World Online' database <https://powo.science.kew.org/>, by interacting with the source code of each plant taxon page. It also creates global maps of species richness, graphics of species discoveries and nomenclatural changes over time. For more details
Maintained by Debora Zuanny. Last updated 8 days ago.
8 stars 7.44 score 64 scriptsgagolews
genieclust:Fast and Robust Hierarchical Clustering with Noise Points Detection
A retake on the Genie algorithm (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>), which is a robust hierarchical clustering method (Gagolewski, Bartoszuk, Cena, 2016 <DOI:10.1016/j.ins.2016.05.003>). It is now faster and more memory efficient; determining the whole cluster hierarchy for datasets of 10M points in low dimensional Euclidean spaces or 100K points in high-dimensional ones takes only a minute or so. Allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of 'HDBSCAN*' (that is able to detect a predefined number of clusters and hence it does not dependent on the somewhat fragile 'eps' parameter). The package also features an implementation of inequality indices (e.g., Gini and Bonferroni), external cluster validity measures (e.g., the normalised clustering accuracy, the adjusted Rand index, the Fowlkes-Mallows index, and normalised mutual information), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices). See also the 'Python' version of 'genieclust' available on 'PyPI', which supports sparse data, more metrics, and even larger datasets.
Maintained by Marek Gagolewski. Last updated 12 days ago.
cluster-analysisclusteringclustering-algorithmdata-analysisdata-miningdata-sciencegeniehdbscanhierarchical-clusteringhierarchical-clustering-algorithmmachine-learningmachine-learning-algorithmsmlpacknmslibpythonpython3sparsecppopenmp
61 stars 7.33 score 13 scripts 5 dependentspetolau
TSrepr:Time Series Representations
Methods for representations (i.e. dimensionality reduction, preprocessing, feature extraction) of time series to help more accurate and effective time series data mining. Non-data adaptive, data adaptive, model-based and data dictated (clipped) representation methods are implemented. Also various normalisation methods (min-max, z-score, Box-Cox, Yeo-Johnson), and forecasting accuracy measures are implemented.
Maintained by Peter Laurinec. Last updated 5 years ago.
data-analysisdata-miningdata-mining-algorithmsdata-sciencerepresentationtime-seriestime-series-analysistime-series-classificationtime-series-clusteringtime-series-data-miningtime-series-representationscpp
97 stars 7.23 score 117 scriptsmingzehuang
latentcor:Fast Computation of Latent Correlations for Mixed Data
The first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation. The estimation is based on latent copula Gaussian models. For continuous/binary types, see Fan, J., Liu, H., Ning, Y., and Zou, H. (2017). For ternary type, see Quan X., Booth J.G. and Wells M.T. (2018) <arXiv:1809.06255>. For truncated type or zero-inflated type, see Yoon G., Carroll R.J. and Gaynanova I. (2020) <doi:10.1093/biomet/asaa007>. For approximation method of computation, see Yoon G., Müller C.L. and Gaynanova I. (2021) <doi:10.1080/10618600.2021.1882468>. The latter method uses multi-linear interpolation originally implemented in the R package <https://cran.r-project.org/package=chebpol>.
Maintained by Mingze Huang. Last updated 3 years ago.
data-analysisdata-miningdata-processingdata-sciencedata-structuresmachine-learningmixed-typesstatistics
16 stars 6.65 score 46 scripts 1 dependentsr-spark
sparklyr.flint:Sparklyr Extension for 'Flint'
This sparklyr extension makes 'Flint' time series library functionalities (<https://github.com/twosigma/flint>) easily accessible through R.
Maintained by Edgar Ruiz. Last updated 3 years ago.
apache-sparkdata-analysisdata-miningdata-sciencedistributeddistributed-computingflintremote-clusterssparksparklyrstatistical-analysisstatisticsstatssummarizationsummary-statisticstime-seriestime-series-analysistwosigma-flint
9 stars 6.46 score 54 scriptsjrdnbradford
readMDTable:Read Markdown Tables into Tibbles
Efficient reading of raw markdown tables into tibbles. Designed to accept content from strings, files, and URLs with the ability to extract and read multiple tables from markdown for analysis.
Maintained by Jordan Bradford. Last updated 2 months ago.
datadata-analysisdata-analyticsdata-extractiondata-miningdata-sciencemarkdownmarkdown-parsermarkdown-tabler-programming
7 stars 6.10 score 3 scripts 1 dependentsgi0na
ghypernet:Fit and Simulate Generalised Hypergeometric Ensembles of Graphs
Provides functions for model fitting and selection of generalised hypergeometric ensembles of random graphs (gHypEG). To learn how to use it, check the vignettes for a quick tutorial. Please reference its use as Casiraghi, G., Nanumyan, V. (2019) <doi:10.5281/zenodo.2555300> together with those relevant references from the one listed below. The package is based on the research developed at the Chair of Systems Design, ETH Zurich. Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2016) <arXiv:1607.02441>. Casiraghi, G., Nanumyan, V., Scholtes, I., Schweitzer, F. (2017) <doi:10.1007/978-3-319-67256-4_11>. Casiraghi, G., (2017) <arXiv:1702.02048> Brandenberger, L., Casiraghi, G., Nanumyan, V., Schweitzer, F. (2019) <doi:10.1145/3341161.3342926> Casiraghi, G. (2019) <doi:10.1007/s41109-019-0241-1>. Casiraghi, G., Nanumyan, V. (2021) <doi:10.1038/s41598-021-92519-y>. Casiraghi, G. (2021) <doi:10.1088/2632-072X/ac0493>.
Maintained by Giona Casiraghi. Last updated 12 months ago.
data-miningdata-sciencegraphsnetworknetwork-analysisrandom-graph-generationrandom-graphs
8 stars 5.68 score 20 scriptsbeerda
nuggets:Extensible Data Pattern Searching Framework
Extensible framework for subgroup discovery (Atzmueller (2015) <doi:10.1002/widm.1144>), contrast patterns (Chen (2022) <doi:10.48550/arXiv.2209.13556>), emerging patterns (Dong (1999) <doi:10.1145/312129.312191>), association rules (Agrawal (1994) <https://www.vldb.org/conf/1994/P487.PDF>) and conditional correlations (Hájek (1978) <doi:10.1007/978-3-642-66943-9>). Both crisp (Boolean, binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. A user-defined function may be defined to evaluate on each generated condition to search for custom patterns.
Maintained by Michal Burda. Last updated 19 days ago.
association-rule-miningcontrast-pattern-miningdata-miningfuzzyknowledge-discoverypattern-recognitioncppopenmp
2 stars 5.38 score 10 scriptsgagolews
genie:Fast, Robust, and Outlier Resistant Hierarchical Clustering
Includes the reference implementation of Genie - a hierarchical clustering algorithm that links two point groups in such a way that an inequity measure (namely, the Gini index) of the cluster sizes does not significantly increase above a given threshold. This method most often outperforms many other data segmentation approaches in terms of clustering quality as tested on a wide range of benchmark datasets. At the same time, Genie retains the high speed of the single linkage approach, therefore it is also suitable for analysing larger data sets. For more details see (Gagolewski et al. 2016 <DOI:10.1016/j.ins.2016.05.003>). For an even faster and more feature-rich implementation, including, amongst others, noise point detection, see the 'genieclust' package (Gagolewski, 2021 <DOI:10.1016/j.softx.2021.100722>).
Maintained by Marek Gagolewski. Last updated 3 years ago.
clustercluster-analysisclusteringdata-analysisdata-miningdata-sciencedatasciencegeniehierarchical-clustering-algorithmmachine-learningmachine-learning-algorithmsoutlierscppopenmp
22 stars 4.55 score 16 scriptsdosorio
rPanglaoDB:Download and Merge Single-Cell RNA-Seq Data from the PanglaoDB Database
Download and merge labeled single-cell RNA-seq data from the PanglaoDB <https://panglaodb.se/> into a Seurat object.
Maintained by Daniel Osorio. Last updated 2 years ago.
data-integrationdata-miningrna-seqsingle-cellsingle-cell-rna-seq
26 stars 4.41 score 20 scriptsbioc
getDEE2:Programmatic access to the DEE2 RNA expression dataset
Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. As of April 2020, over 1 million SRA datasets have been processed. This package provides an R interface to access these expression data. More information about the DEE2 project can be found at the project homepage (http://dee2.io) and main publication (https://doi.org/10.1093/gigascience/giz022).
Maintained by Mark Ziemann. Last updated 3 months ago.
geneexpressiontranscriptomicssequencingbioinformaticsdata-mininggenomicsrna-expressionrna-seq
4 stars 4.20 score 5 scriptssherrisherry
cleandata:To Inspect and Manipulate Data; and to Keep Track of This Process
Functions to work with data frames to prepare data for further analysis. The functions for imputation, encoding, partitioning, and other manipulation can produce log files to keep track of process.
Maintained by Sherry Zhao. Last updated 6 years ago.
data-analysisdata-miningmachine-learningwrangling
3 stars 3.72 score 35 scriptsmlaib
SFtools:Space Filling Based Tools for Data Mining
Contains space filling based tools for machine learning and data mining. Some functions offer several computational techniques and deal with the out of memory for large big data by using the ff package.
Maintained by Mohamed Laib. Last updated 4 years ago.
3.00 score 6 scriptsjeangolay
IDmining:Intrinsic Dimension for Data Mining
Contains techniques for mining large and high-dimensional data sets by using the concept of Intrinsic Dimension (ID). Here the ID is not necessarily an integer. It is extended to fractal dimensions. And the Morisita estimator is used for the ID estimation, but other tools are included as well.
Maintained by Jean Golay. Last updated 4 years ago.
data-miningfeature-selectionintrinsic-dimensionmorisita-estimator
1 stars 2.90 score 16 scripts