lingtypology:Linguistic Typology and Mapping
Provides R with the Glottolog database <> and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project <>, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages <> and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.
Maintained by George Moroz. Last updated 5 months ago.
lfl:Linguistic Fuzzy Logic
Various algorithms related to linguistic fuzzy logic: mining for linguistic fuzzy association rules, composition of fuzzy relations, performing perception-based logical deduction (PbLD), and forecasting time-series using fuzzy rule-based ensemble (FRBE). The package also contains basic fuzzy-related algebraic functions capable of handling missing values in different styles (Bochvar, Sobocinski, Kleene etc.), computation of Sugeno integrals and fuzzy transform.
Maintained by Michal Burda. Last updated 4 months ago.
interlineaR:Importing Interlinearized Corpora and Dictionaries as Produced by Descriptive Linguistics Software
Interlinearized glossed texts (IGT) are used in descriptive linguistics for representing a morphological analysis of a text through a morpheme-by-morpheme gloss. 'InterlineaR' provide a set of functions that targets several popular formats of IGT ('SIL Toolbox', 'EMELD XML') and that turns an IGT into a set of data frames following a relational model (the tables represent the different linguistic units: texts, sentences, word, morphems). The same pieces of software ('SIL FLEX', 'SIL Toolbox') typically produce dictionaries of the morphemes used in the glosses. 'InterlineaR' provide a function for turning the LIFT XML dictionary format into a set of data frames following a relational model in order to represent the dictionary entries, the sense(s) attached to the entries, the example(s) attached to senses, etc.
Maintained by Sylvain Loiseau. Last updated 7 years ago.
lingglosses:Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation
Helps to render interlinear glossed linguistic examples in html 'rmarkdown' documents and then semi-automatically compiles the list of glosses at the end of the document. It also provides a database of linguistic glosses.
Maintained by George Moroz. Last updated 10 days ago.
phonics:Phonetic Spelling Algorithms
Provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others. The package is documented in <doi:10.18637/jss.v095.i08>.
Maintained by James Howard. Last updated 4 years ago.
linguisticsdown:Easy Linguistics Document Writing with R Markdown
Provides 'Shiny gadgets' to search, type, and insert IPA symbols into documents or scripts, requiring only knowledge about phonetics or 'X-SAMPA'. Also provides functions to facilitate the rendering of IPA symbols in 'LaTeX' and PDF format, making IPA symbols properly rendered in all output formats. A minimal R Markdown template for authoring Linguistics related documents is also bundled with the package. Some helper functions to facilitate authoring with R Markdown is also provided.
Maintained by Yongfu Liao. Last updated 6 years ago.
ngramr:Retrieve and Plot Google n-Gram Data
Retrieve and plot word frequencies through time from the "Google Ngram Viewer" <>.
Maintained by Sean Carmody. Last updated 2 months ago.
textgRid:Praat TextGrid Objects in R
The software application Praat can be used to annotate waveform data (e.g., to mark intervals of interest or to label events). (See <> for more information about Praat.) These annotations are stored in a Praat TextGrid object, which consists of a number of interval tiers and point tiers. An interval tier consists of sequential (i.e., not overlapping) labeled intervals. A point tier consists of labeled events that have no duration. The 'textgRid' package provides S4 classes, generics, and methods for accessing information that is stored in Praat TextGrid objects.
Maintained by Patrick Reidy. Last updated 7 years ago.
mclm:Mastering Corpus Linguistics Methods
Read, inspect and process corpus files for quantitative corpus linguistics. Obtain concordances via regular expressions, tokenize texts, and compute frequencies and association measures. Useful for collocation analysis, keywords analysis and variationist studies (comparison of linguistic variants and of linguistic varieties).
Maintained by Mariana Montes. Last updated 2 years ago.
phonfieldwork:Linguistic Phonetic Fieldwork Tools
There are a lot of different typical tasks that have to be solved during phonetic research and experiments. This includes creating a presentation that will contain all stimuli, renaming and concatenating multiple sound files recorded during a session, automatic annotation in 'Praat' TextGrids (this is one of the sound annotation standards provided by 'Praat' software, see Boersma & Weenink 2020 <>), creating an html table with annotations and spectrograms, and converting multiple formats ('Praat' TextGrid, 'ELAN', 'EXMARaLDA', 'Audacity', subtitles '.srt', and 'FLEx' flextext). All of these tasks can be solved by a mixture of different tools (any programming language has programs for automatic renaming, and Praat contains scripts for concatenating and renaming files, etc.). 'phonfieldwork' provides a functionality that will make it easier to solve those tasks independently of any additional tools. You can also compare the functionality with other packages: 'rPraat' <>, 'textgRid' <>.
Maintained by George Moroz. Last updated 8 months ago.
runes:Convert Strings to Elder Futhark Runes
Convert a string of text characters to Elder Futhark Runes <>.
Maintained by Bryan Jenks. Last updated 4 years ago.
DramaAnalysis:Analysis of Dramatic Texts
Analysis of preprocessed dramatic texts, with respect to literary research. The package provides functions to analyze and visualize information about characters, stage directions, the dramatic structure and the text itself. The dramatic texts are expected to be in CSV format, which can be installed from within the package, sample texts are provided. The package and the reasoning behind it are described in Reiter et al. (2017) <doi:10.18420/in2017_119>.
Maintained by Nils Reiter. Last updated 4 years ago.
glottospace:Language Mapping and Geospatial Analysis of Linguistic and Cultural Data
Streamlined workflows for geolinguistic analysis, including: accessing global linguistic and cultural databases, data import, data entry, data cleaning, data exploration, mapping, visualization and export.
Maintained by Rui Dong. Last updated 3 months ago.
lingmatch:Linguistic Matching and Accommodation
Measure similarity between texts. Offers a variety of processing tools and similarity metrics to facilitate flexible representation of texts and matching. Implements forms of Language Style Matching (Ireland & Pennebaker, 2010) <doi:10.1037/a0020386> and Latent Semantic Analysis (Landauer & Dumais, 1997) <doi:10.1037/0033-295X.104.2.211>.
Maintained by Micah Iserman. Last updated 25 days ago.
sdamr:Statistics: Data Analysis and Modelling
Data sets and functions to support the books "Statistics: Data analysis and modelling" by Speekenbrink, M. (2021) <> and "An R companion to Statistics: data analysis and modelling" by Speekenbrink, M. (2021) <>. All datasets analysed in these books are provided in this package. In addition, the package provides functions to compute sample statistics (variance, standard deviation, mode), create raincloud and enhanced Q-Q plots, and expand Anova results into omnibus tests and tests of individual contrasts.
Maintained by Maarten Speekenbrink. Last updated 1 months ago.
text2vec:Modern Text Mining Framework for R
Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.
Maintained by Dmitriy Selivanov. Last updated 7 months ago.
udpipe:Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit
This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.
Maintained by Jan Wijffels. Last updated 2 years ago.
SemNetCleaner:An Automated Cleaning Tool for Semantic and Linguistic Data
Implements several functions that automates the cleaning and spell-checking of text data. Also converges, finalizes, removes plurals and continuous strings, and puts text data in binary format for semantic network analysis. Uses the 'SemNetDictionaries' package to make the cleaning process more accurate, efficient, and reproducible.
Maintained by Alexander P. Christensen. Last updated 3 years ago.
corpora:Statistics and Data Sets for Corpus Frequency Data
Utility functions for the statistical analysis of corpus frequency data. This package is a companion to the open-source course "Statistical Inference: A Gentle Introduction for Computational Linguists and Similar Creatures" ('SIGIL').
Maintained by Stephanie Evert. Last updated 1 months ago.
languageR:Analyzing Linguistic Data: A Practical Introduction to Statistics
Data sets exemplifying statistical methods, and some facilitatory utility functions used in ``Analyzing Linguistic Data: A practical introduction to statistics using R'', Cambridge University Press, 2008.
Maintained by R. H. Baayen. Last updated 6 years ago.
regioncode:Convert Region Names and Division Codes of China Over Years
A tool to conquer the difficulties to convert various region names and administration division codes of Chinese regions. The current version enables seamlessly converting Chinese regions' formal names, common-used names, and codes between each other at the city level from 1986 to 2019.
Maintained by Yue Hu. Last updated 3 months ago.
qlcData:Processing Data for Quantitative Language Comparison
Functionality to read, recode, and transcode data as used in quantitative language comparison, specifically to deal with multilingual orthographic variation (Moran & Cysouw (2018) <doi:10.5281/zenodo.1296780>) and with the recoding of nominal data.
Maintained by Michael Cysouw. Last updated 9 months ago.
ACSWR:A Companion Package for the Book "A Course in Statistics with R"
A book designed to meet the requirements of masters students. Tattar, P.N., Suresh, R., and Manjunath, B.G. "A Course in Statistics with R", J. Wiley, ISBN 978-1-119-15272-9.
Maintained by Prabhanjan Tattar. Last updated 10 years ago.
qlcVisualize:Visualization for Quantitative Language Comparison
Collection of visualizations as used in quantitative language comparison. Currently implemented are visualisations dealing nominal data with multiple levels ("level map" and "factor map"), and assistance for making weighted geographical Voronoi-maps ("weighted map").
Maintained by Michael Cysouw. Last updated 6 months ago.
jimstools:Tools for R
jimstools:Tools for R
Maintained by Jimmy Briggs. Last updated 3 years ago.
nametagger:Named Entity Recognition in Texts using 'NameTag'
Wraps the 'nametag' library <>, allowing users to find and extract entities (names, persons, locations, addresses, ...) in raw text and build your own entity recognition models. Based on a maximum entropy Markov model which is described in Strakova J., Straka M. and Hajic J. (2013) <>.
Maintained by Jan Wijffels. Last updated 1 years ago.
pseudobibeR:Aggregate Counts of Linguistic Features
Calculates the lexicogrammatical and functional features described by Biber (1985) <doi:10.1515/ling.1985.23.2.337> and widely used for text-type, register, and genre classification tasks.
Maintained by David Brown. Last updated 4 months ago.
GeoFIS:Spatial Data Processing for Decision Making
Methods for processing spatial data for decision-making. This package is an R implementation of methods provided by the open source software GeoFIS <> (Leroux et al. 2018) <doi:10.3390/agriculture8060073>. The main functionalities are the management zone delineation (Pedroso et al. 2010) <doi:10.1016/j.compag.2009.10.007> and data aggregation (Mora-Herrera et al. 2020) <doi:10.1016/j.compag.2020.105624>.
Maintained by Jean-Luc Lablée. Last updated 3 months ago.
politeness:Detecting Politeness Features in Text
Detecting markers of politeness in English natural language. This package allows researchers to easily visualize and quantify politeness between groups of documents. This package combines prior research on the linguistic markers of politeness. We thank the Spencer Foundation, the Hewlett Foundation, and Harvard's Institute for Quantitative Social Science for support.
Maintained by Mike Yeomans. Last updated 1 months ago.
pangoling:Access to Large Language Model Predictions
Provides access to word predictability estimates using large language models (LLMs) based on 'transformer' architectures via integration with the 'Hugging Face' ecosystem. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., 'GPT-2'; Radford et al., 2019) and masked/bidirectional LLMs (e.g., 'BERT'; Devlin et al., 2019, <doi:10.48550/arXiv.1810.04805>) to compute the probability of words, phrases, or tokens given their linguistic context. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).
Maintained by Bruno Nicenboim. Last updated 4 days ago.
act:Aligned Corpus Toolkit
The Aligned Corpus Toolkit (act) is designed for linguists that work with time aligned transcription data. It offers functions to import and export various annotation file formats ('ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files), create print transcripts in the style of conversation analysis, search transcripts (span searches across multiple annotations, search in normalized annotations, make concordances etc.), export and re-import search results (.csv and 'Excel' .xlsx format), create cuts for the search results (print transcripts, audio/video cuts using 'FFmpeg' and video sub titles in 'Subrib title' .srt format), modify the data in a corpus (search/replace, delete, filter etc.), interact with 'Praat' using 'Praat'-scripts, and exchange data with the 'rPraat' package. The package is itself written in R and may be expanded by other users.
Maintained by Oliver Ehmer. Last updated 2 years ago.
doc2concrete:Measuring Concreteness in Natural Language
Models for detecting concreteness in natural language. This package is built in support of Yeomans (2021) <doi:10.1016/j.obhdp.2020.10.008>, which reviews linguistic models of concreteness in several domains. Here, we provide an implementation of the best-performing domain-general model (from Brysbaert et al., (2014) <doi:10.3758/s13428-013-0403-5>) as well as two pre-trained models for the feedback and plan-making domains.
Maintained by Mike Yeomans. Last updated 1 years ago.
qtkit:Quantitative Text Kit
Support package for the textbook "An Introduction to Quantitative Text Analysis for Linguists: Reproducible Research Using R" (Francom, 2024) <doi:10.4324/9781003393764>. Includes functions to acquire, clean, and analyze text data as well as functions to document and share the results of text analysis. The package is designed to be used in conjunction with the book, but can also be used as a standalone package for text analysis.
Maintained by Jerid Francom. Last updated 2 months ago.
RKorAPClient:'KorAP' Web Service Client Package
A client package that makes the 'KorAP' web service API accessible from R. The corpus analysis platform 'KorAP' has been developed as a scientific tool to make potentially large, stratified and multiply annotated corpora, such as the 'German Reference Corpus DeReKo' or the 'Corpus of the Contemporary Romanian Language CoRoLa', accessible for linguists to let them verify hypotheses and to find interesting patterns in real language use. The 'RKorAPClient' package provides access to 'KorAP' and the corpora behind it for user-created R code, as a programmatic alternative to the 'KorAP' web user-interface. You can learn more about 'KorAP' and use it directly on 'DeReKo' at <>.
Maintained by Marc Kupietz. Last updated 15 days ago.
Rexperigen:R Interface to Experigen
Provides convenience functions to communicate with an Experigen server: Experigen (<>) is an online framework for creating linguistic experiments, and it stores the results on a dedicated server. This package can be used to retrieve the results from the server, and it is especially helpful with registered experiments, as authentication with the server has to happen.
Maintained by Daniel Szeredi. Last updated 9 years ago.
keyperm:Keyword Analysis Using Permutation Tests
Fast implementation of permutation tests for keyword analysis in corpus linguistics. The aim is to identify words that are significantly more frequent in one corpus than in another. The method is described in Mildenberger (2023) <arXiv:2308.13383>.
Maintained by Thoralf Mildenberger. Last updated 2 years ago.
rLDCP:Text Generation from Data
Linguistic Descriptions of Complex Phenomena (LDCP) is an architecture and methodology that allows us to model complex phenomena, interpreting input data, and generating automatic text reports customized to the user needs (see <doi:10.1016/j.ins.2016.11.002> and <doi:10.1007/s00500-016-2430-5>). The proposed package contains a set of methods that facilitates the development of LDCP systems. It main goal is increasing the visibility and practical use of this research line.
Maintained by Patricia Conde-Clemente. Last updated 7 years ago.
