R is one of the most popular choices of language to learn data science. Despite this fact, most beginners often find themselves lost or confused about finding out the right learning path. The internet is filled with data science and R learning materials which could be overwhelming at first.
Like the saying ‘Think Twice, Code Once’, if you have a well-guided layout, you can save time and effort. This blog will help you lay out a clear systematic approach to learning R for data science.
1. Download R and RStudio
The first thing you need is R, which is the core piece. R is free and open-source language and environment for statistical analysis. The next thing you need is RStudio, which is a nice integrated development environment (IDE) that makes it much easier to use R.
Download R from HERE
Download RStudio from HERE
If you open R itself, it will look very plain.
It is not necessary to open RStudio to use R, but having RStudio as your default IDE for R development makes R a little more user-friendly as it provides all the plots, package management and the editor in one place.
2. Get familiar with the RStudio IDE
When you first open RStudio, this is what you see.
The left panel is the console for R. Type 1 + 1, hit ‘Enter’ and R will return the answer.
It’s a good idea to use a script so you can save your code. Open new script by selecting File -> New File -> RScript and it will appear in the top left panel of RStudio.
It’s basically a text document that can be saved (File -> Save As). You can type and run more than one line at a time by highlighting and clicking the ‘Run’ button on the script toolbar.
The top right panel gives you information about what variables you’re working with during your R session.
The bottom right panel can be used to find and open files, view plots, load packages, and look at help pages.
3. Learn the basics
The next thing is to learn the basics of R. Try doing some math operations.
Create a variable
To create a variable in R, we use:
Variable <- Value
e.g. number <- 1
Learn about variable types
R has three main variable types – character, numeric and logical.
Learn about Grouping Data
Learn about Vectors, Lists, Matrix, Data Frames.
Vectors – contain multiple values of the same type (e.g., all numbers or all words)
Lists – contain multiple values of different types (e.g., some numbers and some words)
Matrix – a table, like a spreadsheet, with only one data type
Data Frames – Like a matrix, but you can mix data types
Learn about Functions
Functions are a way to repeat the same tasks on a different data.
e.g. x <- c (2, 4, 5, 6, 10)
Another example of a function is plot ().
To write a comment, type # in front of your comment. Comments do not get evaluated.
e.g. #This is a comment
4. Learn to access the help files
This will be the most useful tool. You should learn how to get the help of a particular function, search the help files for a word or phrase and find help for a package.
?function_name –> get help of a function
help.search (‘phrase’) –> search the help files for a word or phrase
e.g. help.search (‘weighted mean’)
help (package = ‘package_name’) –> find help for a package
e.g. help (package = ‘dplyr’)
5. Learn about R packages
R comes with basic functionality, meaning that some functions will always be available when you start an R session. However, you can write functions for R that are not the part of the base functionality and make it available to other R users in a package. Packages must be installed first then loaded before using it.
To install a package, click on the ‘Packages’ tab on the bottom right panel of RStudio and then click ‘Install’ on the toolbar.
Once you click on ‘Install’, a window will pop up. Now type the name of the package into the ‘Packages’ box, select that package, and click ‘Install’.
Now that you’ve installed the package, you still can’t use the function you want. You must load the package first. For this, you will have to use the library () function to load ‘dplyr’ package (for example).
You don’t have to download the package again once you close and re-open RStudio, but you do need to load the package to use any function of it.
6. Data Loading
Getting the data into R is the first step of the data science process. R has varieties of options to get data of all forms into R. This is a common list of packages best suited for data loading.
7. Exploring the data
Use the following code to obtain the data.
airquality is a data frame that comes inbuilt so that you can play with the data.
Try the following commands and see what you get:
Viewing the data
RStudio has a special function called View () that makes it easier to look at data in a data frame.
8. Data Analysis and Visualization
After learning how to get data into R, and some data exploration techniques, now it’s time to learn some exploratory analysis. Install a list of some wonderful R packages given below that helps to simplify data analysis and visualization.
· dplyr – helps you do simple and elegant data manipulation
· data.table – handles big data with ease, provides faster data analysis
· ggplot2 – awesome package for data visualization
9. Data Preparation
Data preparation is another important step because clean data is hard to find, and often needs to be transformed and molded into a form on which we can run models.
· reshape2 – melt and cast the dataset into the shape you want
· tidyr – provides a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning)
· Amelia – missing value imputation
10. Communicate Result
You have learned to extract insights from data, but it could be useless without effective communication or display of result. R Markdown is a great tool for reporting your insights and share your findings with a fellow data scientist.