Being much like Python libraries, R has a set of standard packages and those that can be manually added from CRAN, Bioconductor, GitHub, and Omegahat
. For example, the CRAN Task View
can help you manage medical analysis, genetics, machine learning, finance, etc. But saying that the number of these packages is huge won't display a true story—it's actually humongous; therefore, such tools as Packrat and Checkpoint
are must-have tools for managing package dependencies and reproducibility inside R. Or you can simply use the list that RStudio recommends itself.Data importing
R is susceptible to various kinds of data to consume: statistic software files, text files, flat files, databases, or even web data.
Scraping, however, isn't a strong side for R, which is why it's necessary to use tools like Rvest
(pretty much an analogy for Beautiful Soup used in Python) and Rselenium
, packages built to allow for easy web data extraction when using this technology.
Such names as PostgreSQL, RMySQL, ROracle, and RODBC
may sound familiar or even funny, but that's because they were designed to connect and extract the data from related databases with no R in their names. Xlsx will be great for manage Excel data, while Readr
can help you discover flat and tabular text files, plus Haven for all sorts of Stata, SPSS, and SAS
In other words, importing isn't plain but very doable when using extra help from packages.Pre-processing of the data
Given the usual scale of statistical data computing, it's never a short affair, but R is good at facilitating this stage. The whole process isn't done in a single step; instead, it consists of many stages, which include cleansing, centering, scaling, transformation, normalization. That is why preprocessing is often done with the help of numerous packages.
Namely, Tidyr, Editrules, Deducorrect, MatchIt,
and many other tools will be the helping hand when you need to restructure, aggregate, tidy-up data, parse ,and apply clearing rules in both manual and automated ways.Deep analysis
That's the area when R truly shines and allows to perform such specific tasks as regression, principal component analysis, bootstrap, analysis of variance, generalized additive models, and statistical modeling. All of these features are hardly applicable to the trivial needs but serve as a unique way to refine massive data for all kinds of research. Data analysis is arguably the main reason for R's existence—research projects and separate analysis for large servers are the main specialties.
But if we're talking about real business use cases, it still delivers great value when applied to machine learning algorithms. For that matter, there's another set of useful packages that alleviate a lot of hard work and make data refining smooth. Plyr, Dplyr, and Data.table
can become handy for dataset manipulation; Stringr is for easy wrapping; Hmisc
serves all kinds of purposes; Igraph
will deliver analysis; Rpart
will help you segregate classification, regression, and survival trees. A bigger list was provided by KDnuggers
, so that you could pick a package for any of the goals.Post-processing
This is a crucial step that is responsible for making data easy to interpret, and R can deliver when filtering, pruning, transforming, rescaling, merging, and splitting data. Basically, all these actions distill all the "trash" data that was received during the previous stages.
When data transaction begins, Arules
serves the best its manipulation, representation, and analyzing. The above-mentioned Caret, Tidyr, and Data.table
are a nice fit for post-processing as well as for the "pre" stage. Visualization
Being an essential part of post-processing, visualization is one of the strong R sides—it was initially built to display analytical and statistical results. Graphics capabilities are what makes it good and allow for dynamic graphs, graphics devices, graphic displays, and of course visualization. Once again, there a ton of packages that make this process an easy affair rather than a convoluted mess.
After applying Ggplot2
for "grammar of graphics," Ggvis and Plothy
can transform data into web-based visualizations; Shiny does the same for making interactive web apps; Rgl
does 3D for OpenGL. Rmarkdown
is the same as Markdown, but instead, converts derived data into MS Word, HTML, and PDF reports and helps you customize them. GoogleVis
is for creating interactive charts based on data frames (Google charts API).