Showing 3 of total 3 results (show query)
ropensci
antiword:Extract Text from Microsoft Word Documents
Wraps the 'AntiWord' utility to extract text from Microsoft Word documents. The utility only supports the old 'doc' format, not the new xml based 'docx' format. Use the 'xml2' package to read the latter.
Maintained by Jeroen Ooms. Last updated 6 months ago.
59 stars 6.98 score 7 scripts 7 dependentsropensci
rtika:R Interface to 'Apache Tika'
Extract text or metadata from over a thousand file types, using Apache Tika <https://tika.apache.org/>. Get either plain text or structured XHTML content.
Maintained by Sasha Goodman. Last updated 2 years ago.
extract-metadataextract-textjavaparsepdf-filespeer-reviewedtesseracttika
55 stars 6.00 score 12 scriptsropensci
unrtf:Extract Text from Rich Text Format (RTF) Documents
Wraps the 'unrtf' utility <https://www.gnu.org/software/unrtf/> to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because 'unrtf' has limited support for converting between character encodings.
Maintained by Jeroen Ooms. Last updated 5 months ago.
14 stars 4.36 score 11 scripts