A Quick Introduction to R


R is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years. As of May 2018, R ranks 11th in the TIOBE index that measure of popularity of programming languages.

R is a GNU package. The source code for the R software environment is written primarily in C, Fortran, and R itself. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. While R has a command line interface, there are several graphical front-ends and integrated development environments available.

The following blogs give information about R gathered from various sources

Introduction


R is an interpreted programming language and software environment for statistical analysis, graphics representation and reporting. R also allows integration with the procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency. R is named so after the creators Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team.

R-Studio is most commonly used for development. It provides features to run commands as well as scripts. You can install R and R-Studio from their website.

R Command Prompt


Once you have R environment setup, then it’s easy to start your R command prompt by just typing the following command at your command prompt:

> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"

Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString.

R Script File


Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. So let's start with writing following code in a text file called test.R

# Hello World
myString <- "Hello, World!"
print ( myString)

You can save the above code in a file helloWorld.R and execute it at the command prompt as given below. You can also do that within the R-Studio if you want to avoid the hassles of setting up the path.

$ Rscript test.R 
[1] "Hello, World!"

Data Types


Like most scripting languages, variables in R are not hard typed. You do not declare a variable to be limited to a given data type. The variables in R are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. There are many types of R-objects. The frequently used ones are:

  • Vectors
  • Lists
  • Matrices
  • Arrays
  • Factors
  • Data Frames

The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors.

  • Logical (TRUE / FALSE)
  • Numeric (1, 2, 33.44)
  • Integer (1L, -100L, 0L
  • Complex (4 + 10i))
  • Character ("1", "examples", 'of', "characters")
  • Raw (A raw sequence of bytes.)
v <- TRUE 
print(class(v))
[1] "logical" 

v <- 23.5
print(class(v))
[1] "numeric"

v <- 2L
print(class(v))
[1] "integer"

v <- 2+5i
print(class(v))
[1] "complex"

v <- "TRUE"
print(class(v))
[1] "character"

v <- charToRaw("Convert Characters to RAW")
print(class(v))
[1] "raw"

Variable


A valid variable name consists of letters, numbers and the dot or underline characters. The variable name starts with a letter or the dot not followed by a number.

var_name2. # Valid Has letters, numbers, dot and underscore
var_name% # Invalid Has the character '%'. Only dot(.) and underscore allowed.
2var_name # Invalid Starts with a number
.var_name # Valid Can start with a dot(.) but the dot(.)should not be followed by a number.
var.name # Valid Variable name can contain a dot(.)
.2var_name # Invalid The starting dot is followed by a number making it invalid.
_var_name # Invalid Starts with _ which is not valid

The variables can be assigned values using leftward, rightward and equal to operator.

# Assignment using equal operator.
var.1 = c(0,1,2,3)

# Assignment using leftward operator.
var.2 <- c("learn","R")

# Assignment using rightward operator.   
c(TRUE,1) -> var.3

cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")

print() / cat()


We can view the contents of a variable using the print or cat functions. Print takes a single parameter while cat takes multiple parameters and concatenates them all.

print("Hello World")
[1] "Hello World"

cat("Hello", "World")
Hello world

ls() / rm()


R does not provide for namespaces. (You can import certain packages to enforce namespaces). For example, a variable declared in an if block is also available after you come out of the block. It is very easy to lose track of the variables available at a given point. R provides for two very useful functions to deal with this.

ls() gives a list of variables defined at a given time. And, if you want, you can also delete a variable from the memory using the rm()

Operators


R is quite rich in operators it provides. Not be as rich as Perl - but, the operators in R are taylored towards handling chunks of data. By default all the operators when applied on vectors perform the operation on individual corresponding elements. Operators in R can be classified into 5 major types:

Arithmetic


R defines these arithmetic operators +, -, *, /, %%, %/%, ^

The meaning of +, -, *, /, ^ is the same as in most other languages. That does not need any clarification. The %% and %/% are more interesting. Both are related to integer division. One gives the quotient and the other gives the remainder

> # / performs the usual division
> c(4, 2, 5.5, 6.5) / c(2, 4, 2.5, 3)
[1] 2.000000 0.500000 2.200000 2.166667

> # %% gives the remainder. Note that both the operands could be non integers. 
> # But the operator ensures integer division.
> c(4, 2, 5.5, 6.5) %% c(2, 4, 2.5, 3)
[1] 0.0 2.0 0.5 0.5

> # %/% gives the quotient. Note that both the operands could be non integers. 
> # But the operator ensures integer division.
> c(4, 2, 5.5, 6.5) %/% c(2, 4, 2.5, 3)
[1] 2 0 2 2

Relational


R defines the usual relational operators: <, >, =<, >=, ==, !=

They mean almost what they would mean in any other language. But, as mentioned above, the operators work on individual elements in the vector and produce another vector of boolean elements that stand for the result of each individual comparison. For example:

> c(4, 2, 5.5, 6.5) < c(2, 4, 2.5, 3)
[1] FALSE  TRUE FALSE FALSE

Logical


R defines all the usual logical operators: &, |, !, && and ||

The operators &, | and ! do just what one would expect - operate on individual elements of the operand vectors and produce another boolean vector as result. But the && and || work differently - They just operate on the first elements of the vectors and return a single boolean value based on that.

Assignment


There are two types of assignments in R. Left assignment and Right assignment.

> a <- c(1,2,3)
> a
[1] 1 2 3
> c(3,4,5,6) -> a
> a
[1] 3 4 5 6

You can also use <<-, ->> and ofcourse = There are subtle differences between these - we will check them out down the line.

Miscellaneous


R also provides other operators :, %in% and %*%

print(2:8) 
[1] 2 3 4 5 6 7 8

print(8 %in% 1:10) 
[1] TRUE
print(12 %in% 1:10)
[1] FALSE

These are not limited to numbers. They work as well on other data types.

Code Flow


Line most programming languages, R provides code flow support using if/else blocks and for/while loops. No developer would need a detailed explanation about these. The below code snippets give an overview of how it is used in R code.

if / else / else if


output <- 'blank'
number <- 10

if(number > 10){
    report <- "Greater than 10"
}else if (number < 10){
    report <- "Less than 10"
}else{
    report <- 'Equal to 10'
}
print(report)

[1] Equal to 10

for loops


We have versatile for loops in R. It provides ways of looping through the various data structures like vectors, lists, matrix, arrays.

vec <- c(1,3,4,6,9)
for (v in vec) {
    print(v)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

You can do the same with other data structures as well. Also the collection can be generated dynamically in the command:

for ( i in 1:10 ){
    print (i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

While Loops


While loops provide a more generic and more powerful mechanism to loop. The While loops in R are quite similar to most other languages:

> x <- 0
> while(x < 10){
+   cat('Value of x: ',x)
+   print("X is still less than 10")
+   # add one to x
+   x <- x+1
+ }
Value of x:  0[1] "X is still less than 10"
Value of x:  1[1] "X is still less than 10"
Value of x:  2[1] "X is still less than 10"
Value of x:  3[1] "X is still less than 10"
Value of x:  4[1] "X is still less than 10"
Value of x:  5[1] "X is still less than 10"
Value of x:  6[1] "X is still less than 10"
Value of x:  7[1] "X is still less than 10"
Value of x:  8[1] "X is still less than 10"
Value of x:  9[1] "X is still less than 10"
>

While loops also provide for break and next if you want to cut short through the loop at any point.

Matrices


Matrix is a very useful data structure in R. A lot of data processing and machine learning computations involve Matrices. So it is important that we understand this well.

As one would expect, the Matrix is a two dimensional data structure consisting of rows and columns.

Creating a Matrix


Matrix can be defined using the finction matrix(). The first/required argument to this function is the vector that should be cast into a matrix. The matrix() function splits the vector into a matrix based on the parameters passed in.

> matrix(1:20, nrow=4)
        [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

By default, the values are split along the columns. But, we can make it flow along rows by setting the byrow parameter.

> matrix(1:20, byrow=TRUE, nrow=4)
        [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20

OR

> matrix(1:20, byrow=T, nrow=4)
        [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20
>

Another way to create a matrix from vectors is to bind two or more vectors of same size. This can be done using rbind() or cbind().

> rbind(1:5, 2:6)
        [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    2    3    4    5    6
> 
> cbind(1:5, 2:6)
        [,1] [,2]
[1,]    1    2
[2,]    2    3
[3,]    3    4
[4,]    4    5
[5,]    5    6
>

In this case, the names of the individual vectors are assigned as matrix labels.

> v1 <- c(1:4)
> v2 <- c(4:1)
> rbind(v1,v2)
    [,1] [,2] [,3] [,4]
v1    1    2    3    4
v2    4    3    2    1
> cbind(v1,v2)
        v1 v2
[1,]  1  4
[2,]  2  3
[3,]  3  2
[4,]  4  1
>

Labels


It makes a lot of sense to label the rows and columns so that the code and graphs look a lot more meaningful. We can do that using the methods like colnames and rownames. Consider the below example that depicts the sale of bikes by different brands.

First define the two vectors

> honda <- c(10, 14, 12, 13, 11)
> honda
[1] 10 14 12 13 11
> yamaha <- c(12, 13, 14, 11, 10)
> yamaha
[1] 12 13 14 11 10

Now combine into a single vector that can be split to create a matrix.

> sales <- c(honda, yamaha)
> sales
[1] 10 14 12 13 11 12 13 14 11 10

Next, split the vector into a matrix.

> sales.matrix <- matrix(sales, byrow=T, nrow=2)
> sales.matrix
        [,1] [,2] [,3] [,4] [,5]
[1,]   10   14   12   13   11
[2,]   12   13   14   11   10

Finally, use the functions colnames and rownames to add labels to the matrix.

> colnames(sales.matrix) <- c("2013", "2014", "2015", "2016", "2017")
> rownames(sales.matrix) <- c("Honda", "Yamaha")
> sales.matrix
        2013 2014 2015 2016 2017
Honda    10   14   12   13   11
Yamaha   12   13   14   11   10
>

Matrix Arithmetic


Just like vectors, the arithmetic operations on the metrices work on the individual elements.

> mat <- matrix(1:20, byrow = T, nrow=5)

Scalar multiplication results in multiplication on each element.

> mat * 2
        [,1] [,2] [,3] [,4]
[1,]    2    4    6    8
[2,]   10   12   14   16
[3,]   18   20   22   24
[4,]   26   28   30   32
[5,]   34   36   38   40

Scalar division results in division of each element

> mat / 2
        [,1] [,2] [,3] [,4]
[1,]  0.5    1  1.5    2
[2,]  2.5    3  3.5    4
[3,]  4.5    5  5.5    6
[4,]  6.5    7  7.5    8
[5,]  8.5    9  9.5   10

Similarly, the exponent method works on each element.

> mat ^2
        [,1] [,2] [,3] [,4]
[1,]    1    4    9   16
[2,]   25   36   49   64
[3,]   81  100  121  144
[4,]  169  196  225  256
[5,]  289  324  361  400

You can also get an inverse of a matrix that results in inverse of each element.

> 1/mat
            [,1]       [,2]       [,3]       [,4]
[1,] 1.00000000 0.50000000 0.33333333 0.25000000
[2,] 0.20000000 0.16666667 0.14285714 0.12500000
[3,] 0.11111111 0.10000000 0.09090909 0.08333333
[4,] 0.07692308 0.07142857 0.06666667 0.06250000
[5,] 0.05882353 0.05555556 0.05263158 0.05000000

Logical operations result in a matrix of boolean values.

> mat > 15
        [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE  TRUE
[5,]  TRUE  TRUE  TRUE  TRUE

You can also filter the matrix elements to get a vector output.

> mat[mat > 15]
[1] 17 18 19 16 20

Just like scalar addition, you can use the operations on two matrices to get result on individual elements.

> mat + mat
        [,1] [,2] [,3] [,4]
[1,]    2    4    6    8
[2,]   10   12   14   16
[3,]   18   20   22   24
[4,]   26   28   30   32
[5,]   34   36   38   40
> mat * mat
        [,1] [,2] [,3] [,4]
[1,]    1    4    9   16
[2,]   25   36   49   64
[3,]   81  100  121  144
[4,]  169  196  225  256
[5,]  289  324  361  400
> mat / mat
        [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    1    1    1    1
[3,]    1    1    1    1
[4,]    1    1    1    1
[5,]    1    1    1    1
> mat ^ mat
                [,1]         [,2]         [,3]         [,4]
[1,] 1.000000e+00 4.000000e+00 2.700000e+01 2.560000e+02
[2,] 3.125000e+03 4.665600e+04 8.235430e+05 1.677722e+07
[3,] 3.874205e+08 1.000000e+10 2.853117e+11 8.916100e+12
[4,] 3.028751e+14 1.111201e+16 4.378939e+17 1.844674e+19
[5,] 8.272403e+20 3.934641e+22 1.978420e+24 1.048576e+26
>

Matrix dot product is denoted by %*%

> mat %*% t(mat)
        [,1] [,2] [,3] [,4] [,5]
[1,]   30   70  110  150  190
[2,]   70  174  278  382  486
[3,]  110  278  446  614  782
[4,]  150  382  614  846 1078
[5,]  190  486  782 1078 1374
>

The data operations like sum and mean are implemented by functions like colSums, colMeans, rowSums, rowMeans, sum

> colSums(sales.matrix)
2013 2014 2015 2016 2017 
    22   27   26   24   21 

> colMeans(sales.matrix)
2013 2014 2015 2016 2017 
11.0 13.5 13.0 12.0 10.5 

> rowSums(sales.matrix)
    Honda Yamaha 
    60     60 

> rowMeans(sales.matrix)
    Honda Yamaha 
    12     12 

> sum(sales.matrix)
[1] 120
>

Data slicing and indexing are required for any data processing. They are implemented as follows

> mat[1,]
[1] 1 2 3 4

> mat[1,3:4]
[1] 3 4

> mat[1:3,1:3]
        [,1] [,2] [,3]
[1,]    1    2    3
[2,]    5    6    7
[3,]    9   10   11

> mat[,3:4]
        [,1] [,2]
[1,]    3    4
[2,]    7    8
[3,]   11   12
[4,]   15   16
[5,]   19   20
>

Built-in Data Sets


R provides several built in data sets. They have reasonable size and accuracy and help us in rapid prototyping and also in using standard values in regular code

R provides these datasets in form of Data Frames. Here are a few examples

States:


> state.abb
    [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
[30] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> state.area
    [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876   6450  83557  56400  36291  56290  82264  40395  48523  33215  10577   8257
[22]  58216  84068  47716  69686 147138  77227 110540   9304   7836 121666  49576  52586  70665  41222  69919  96981  45333   1214  31055  77047  42244
[43] 267339  84916   9609  40815  68192  24181  56154  97914
>
> head(state.x77)
            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

WorldPhones:


> WorldPhones
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076
>

There are many other data sets.

DatasetContents
AirPassengersMonthly Airline Passenger Numbers 1949-1960
BJsalesSales Data with Leading Indicator
BJsales.lead (BJsales)Sales Data with Leading Indicator
BODBiochemical Oxygen Demand
CO2Carbon Dioxide Uptake in Grass Plants
ChickWeightWeight versus age of chicks on different diets
DNaseElisa assay of DNase
EuStockMarketsDaily Closing Prices of Major European Stock Indices, 1991-1998
FormaldehydeDetermination of Formaldehyde
HairEyeColorHair and Eye Color of Statistics Students
Harman23.corHarman Example 2.3
Harman74.corHarman Example 7.4
IndomethPharmacokinetics of Indomethacin
InsectSpraysEffectiveness of Insect Sprays
JohnsonJohnsonQuarterly Earnings per Johnson & Johnson Share
LakeHuronLevel of Lake Huron 1875-1972
LifeCycleSavingsIntercountry Life-Cycle Savings Data
LoblollyGrowth of Loblolly pine trees
NileFlow of the River Nile
OrangeGrowth of Orange Trees
OrchardSpraysPotency of Orchard Sprays
PlantGrowthResults from an Experiment on Plant Growth
PuromycinReaction Velocity of an Enzymatic Reaction
SeatbeltsRoad Casualties in Great Britain 1969-84
TheophPharmacokinetics of Theophylline
TitanicSurvival of passengers on the Titanic
ToothGrowthThe Effect of Vitamin C on Tooth Growth in Guinea Pigs
UCBAdmissionsStudent Admissions at UC Berkeley
UKDriverDeathsRoad Casualties in Great Britain 1969-84
UKgasUK Quarterly Gas Consumption
USAccDeathsAccidental Deaths in the US 1973-1978
USArrestsViolent Crime Rates by US State
USJudgeRatingsLawyers' Ratings of State Judges in the US Superior Court
USPersonalExpenditurePersonal Expenditure Data
UScitiesDDistances Between European Cities and Between US Cities
VADeathsDeath Rates in Virginia (1940)
WWWusageInternet Usage per Minute
WorldPhonesThe World's Telephones
ability.covAbility and Intelligence Tests
airmilesPassenger Miles on Commercial US Airlines, 1937-1960
airqualityNew York Air Quality Measurements
anscombeAnscombe's Quartet of 'Identical' Simple Linear Regressions
attenuThe Joyner-Boore Attenuation Data
attitudeThe Chatterjee-Price Attitude Data
austresQuarterly Time Series of the Number of Australian Residents
beaver1 (beavers)Body Temperature Series of Two Beavers
beaver2 (beavers)Body Temperature Series of Two Beavers
carsSpeed and Stopping Distances of Cars
chickwtsChicken Weights by Feed Type
co2Mauna Loa Atmospheric CO2 Concentration
crimtabStudent's 3000 Criminals Data
discoveriesYearly Numbers of Important Discoveries
esophSmoking, Alcohol and (O)esophageal Cancer
euroConversion Rates of Euro Currencies
euro.cross (euro)Conversion Rates of Euro Currencies
eurodistDistances Between European Cities and Between US Cities
faithfulOld Faithful Geyser Data
fdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
freenyFreeny's Revenue Data
freeny.x (freeny)Freeny's Revenue Data
freeny.y (freeny)Freeny's Revenue Data
infertInfertility after Spontaneous and Induced Abortion
irisEdgar Anderson's Iris Data
iris3Edgar Anderson's Iris Data
islandsAreas of the World's Major Landmasses
ldeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
lhLuteinizing Hormone in Blood Samples
longleyLongley's Economic Regression Data
lynxAnnual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths)Monthly Deaths from Lung Diseases in the UK
morleyMichelson Speed of Light Data
mtcarsMotor Trend Car Road Tests
nhtempAverage Yearly Temperatures in New Haven
nottemAverage Monthly Temperatures at Nottingham, 1920-1939
npkClassical N, P, K Factorial Experiment
occupationalStatusOccupational Status of Fathers and their Sons
precipAnnual Precipitation in US Cities
presidentsQuarterly Approval Ratings of US Presidents
pressureVapor Pressure of Mercury as a Function of Temperature
quakesLocations of Earthquakes off Fiji
randuRandom Numbers from Congruential Generator RANDU
riversLengths of Major North American Rivers
rockMeasurements on Petroleum Rock Samples
sleepStudent's Sleep Data
stack.loss (stackloss)Brownlee's Stack Loss Plant Data
stack.x (stackloss)Brownlee's Stack Loss Plant Data
stacklossBrownlee's Stack Loss Plant Data
state.abb (state)US State Facts and Figures
state.area (state)US State Facts and Figures
state.center (state)US State Facts and Figures
state.division (state)US State Facts and Figures
state.name (state)US State Facts and Figures
state.region (state)US State Facts and Figures
state.x77 (state)US State Facts and Figures
sunspot.monthMonthly Sunspot Data, from 1749 to "Present"
sunspot.yearYearly Sunspot Data, 1700-1988
sunspotsMonthly Sunspot Numbers, 1749-1983
swissSwiss Fertility and Socioeconomic Indicators (1888) Data
treeringYearly Treering Data, -6000-1979
treesGirth, Height and Volume for Black Cherry Trees
uspopPopulations Recorded by the US Census
volcanoTopographic Information on Auckland's Maunga Whau Volcano
warpbreaksThe Number of Breaks in Yarn during Weaving
womenAverage Heights and Weights for American Women

Data Frames


All data in vectors and matrices is enforced to a single data type. But Data Frames let you overcome this limitation. A data frame can contain several elements of different types. An example of R Data Frame can be seen here:

> WorldPhones
        N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076
>

Creating Data Frame


A new data frame object can be created using the function data.frame()

> empty <- data.frame() # empty data frame
> vector.1 <- 1:10 # vector of integers
> vector.2 <- letters[1:10] # vector of strings
> df <- data.frame(column.1=vector.1,column.2=vector.2)
> 
> df
    column.1 column.2
1         1        a
2         2        b
3         3        c
4         4        d
5         5        e
6         6        f
7         7        g
8         8        h
9         9        i
10       10        j
>

Importing and Exporting Data


You can export and import the data frame to a CSV file. This is useful for saving the context of a data operation.

> write.csv(df, file='mydata.csv')     # Save the data frame to CSV file
>

You can load the contents from the CSV file as below

> d2 <- read.csv('mydata.csv')         # Load the data frame from CSV file
> d2
    X column.1 column.2
1   1        1        a
2   2        2        b
3   3        3        c
4   4        4        d
5   5        5        e
6   6        6        f
7   7        7        g
8   8        8        h
9   9        9        i
10 10       10        j

Please note that there is a difference in what we saved and what we read from the file. The row numbers are also saved in the CSV and then loaded as an independent column when reading from the CSV.

Analyzing Data Frames


While analyzing the data, it is very useful if we can have an initial idea about the kind of data present in the data frame - the columns, the data type, max/min/mean values for numbers, etc. R provides a good set of utilities to make this job simpler. Let us try to understand the data frame states.x77

Head / Tail


The data frame is too big to be viewed manually. We can get a very basic glimpse of the data in there by using the head.

> head(state.x77)
            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
>

Or you can use tail to get the last 6 elements

> tail(state.x77)
                Population Income Illiteracy Life Exp Murder HS Grad Frost  Area
Vermont              472   3907        0.6    71.64    5.5    57.1   168  9267
Virginia            4981   4701        1.4    70.08    9.5    47.8    85 39780
Washington          3559   4864        0.6    71.72    4.3    63.5    32 66570
West Virginia       1799   3617        1.4    69.48    6.7    41.6   100 24070
Wisconsin           4589   4468        0.7    72.48    3.0    54.5   149 54464
Wyoming              376   4566        0.6    70.29    6.9    62.9   173 97203
>

Please note that 6 is just the default value for number of rows in head and tail. You can always override it using the second parameter

> head(mtcars, 3)
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> 
> tail(mtcars, 3)
                mpg cyl disp  hp drat   wt qsec vs am gear carb
Ferrari Dino  19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
Maserati Bora 15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
Volvo 142E    21.4   4  121 109 4.11 2.78 18.6  1  1    4    2
>

Summary and Structure


R also provides two more utility functions that help you understand the data

> #Structure
> str(state.x77)
    num [1:50, 1:8] 3615 365 2212 2110 21198 ...
    - attr(*, "dimnames")=List of 2
    ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
    ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
> 
> summary(state.x77)
    Population        Income       Illiteracy       Life Exp         Murder          HS Grad          Frost             Area       
    Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96   Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
    1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
    Median : 2838   Median :4519   Median :0.950   Median :70.67   Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
    Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88   Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
    3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
    Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60   Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432  
>

Counts


There is another set of functions that help us understand the meaning of the information contained in the data frame

> ncol(df)
[1] 8
> nrow(df)
[1] 50
> 
> colnames(df)
[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"     "HS Grad"    "Frost"      "Area"      
> rownames(df)
    [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"       "California"     "Colorado"       "Connecticut"    "Delaware"       "Florida"       
[10] "Georgia"        "Hawaii"         "Idaho"          "Illinois"       "Indiana"        "Iowa"           "Kansas"         "Kentucky"       "Louisiana"     
[19] "Maine"          "Maryland"       "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"    "Missouri"       "Montana"        "Nebraska"      
[28] "Nevada"         "New Hampshire"  "New Jersey"     "New Mexico"     "New York"       "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"      "Texas"          "Utah"           "Vermont"       
[46] "Virginia"       "Washington"     "West Virginia"  "Wisconsin"      "Wyoming"       
>

Filter Data


You can also filter the data to get a subset of what is available in the data frame. For example, if we want to pull out only those cars that have 5 gears:

> mtcars[mtcars$gear == 5, ]
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
>

We can also use logical operators on the condition out there. For example, if we want an additional criteria that the car should also have 4 cylinders, we can do this:

> mtcars[mtcars$gear == 5 & mtcars$cyl > 4, ]
                mpg cyl disp  hp drat   wt qsec vs am gear carb
Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
Ferrari Dino   19.7   6  145 175 3.62 2.77 15.5  0  1    5    6
Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  0  1    5    8
>

We can also perform statistical operations on this data:

> mean(mtcars[mtcars$hp > 100 & mtcars$wt > 2.5, ]$mpg)
[1] 16.86364
>

Indexing Data Frames


By indexing, we can obtain subsets of the given dataframe. Often we need to add new rows and columns to the given data frame. R provides some functions to enable this functionality. R provides two methods - cbind and rbind to do this.

Add Row


Lets first check out the row bind functionality. To start with, pick up two parts of the mtcars dataset.

> df1 = mtcars[1:5, 1:5]
> df1
                    mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
> 
> df2 = mtcars[6, 1:5]
> df2
            mpg cyl disp  hp drat
Valiant 18.1   6  225 105 2.76

Now, we can join these using rbind

> df <- rbind(df1, df2)
> df
                    mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
Valiant           18.1   6  225 105 2.76
>

Add Column


Similarly, we can also join columns using the cbind command.

> df1 <- mtcars[1:5, 1:5]
> df2 <- mtcars[1:5, 6:7]
> 
> df1
                    mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15
> df2
                        wt  qsec
Mazda RX4         2.620 16.46
Mazda RX4 Wag     2.875 17.02
Datsun 710        2.320 18.61
Hornet 4 Drive    3.215 19.44
Hornet Sportabout 3.440 17.02
>

Now, we can merge these using the rbind method

> cbind(df1, df2)
                    mpg cyl disp  hp drat    wt  qsec
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
Datsun 710        22.8   4  108  93 3.85 2.320 18.61
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
>

Ofcourse, for the cbind and rbind to work properly, the other dimension should match correctly. For example, while appending rows using rbind, the columns should match properly and vice-versa.

Lists


Lists are the objects which contain elements of different types like numbers, strings, vectors, data frames and another list inside it. A list can also contain a matrix or even function as its elements. List is created using list() function. Lists are typically used for organizing data rather than processing it.

Creating a List


Lists are created using the list() function. Following is an example to create a list containing strings, numbers, vectors and a logical values.

# Create a list containing strings, numbers, vectors and a logical
# values.
> list_data <- list(mtcars[1:5,], c('A', 'Sample', 'Vector'), c(21,32,11), TRUE, 51.23, 119.1)
> list_data
[[1]]
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

[[2]]
[1] "A"      "Sample" "Vector"

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE

[[5]]
[1] 51.23

[[6]]
[1] 119.1

>

As you can see, each item in the list is associated with an index number that is shown as [[1]], [[2]]. We can also assign names to these elements.

Naming List Elements


The list elements can be given names and they can be accessed using these names.

> # Create a list containing a vector, a matrix and a list.
> list_data <- list(df = mtcars[1:5,], vec1 = c('A', 'Sample', 'Vector'), vec2 = c(21,32,11), bln = TRUE, num1 = 51.23, num2 = 119.1)
> list_data
$df
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

$vec1
[1] "A"      "Sample" "Vector"

$vec2
[1] 21 32 11

$bln
[1] TRUE

$num1
[1] 51.23

$num2
[1] 119.1

>

You can check for the names in a list using

> names(list_data)
[1] "df"   "vec1" "vec2" "bln"  "num1" "num2"

Now we can also assign names to these objects

> names(list_data) <- c("Data Frame", "Vector 1", "Vector 2", "Boolean", "Number 1", "Number 2")
>

This updates the names of the list elements

> list_data
$`Data Frame`
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

$`Vector 1`
[1] "A"      "Sample" "Vector"

$`Vector 2`
[1] 21 32 11

$Boolean
[1] TRUE

$`Number 1`
[1] 51.23

$`Number 2`
[1] 119.1

>

Accessing List Elements


Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be accessed using the names.

> list_data[1]
$`Data Frame`
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

> 

> list_data[1]
$`Data Frame`
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

> 
> list_data$`Vector 1`
[1] "A"      "Sample" "Vector"
>

Manipulating List Elements


We can add, delete and update list elements as shown below. We can add only at the end of a list. But we can update/delete any element.

> list_data[4] <- NULL
> list_data
$`Data Frame`
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

$`Vector 1`
[1] "A"      "Sample" "Vector"

$`Vector 2`
[1] 21 32 11

$`Number 1`
[1] 51.23

$`Number 2`
[1] 119.1

>

Merging Lists


You can merge many lists into one list by placing all the lists inside one list() function.

# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.
merged.list <- c(list1,list2)

# Print the merged list.
merged.list

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] "Sun"

[[5]]
[1] "Mon"

[[6]]
[1] "Tue"

Converting List to Vector


A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.

> list1 <- list(1:5)
> list1
[[1]]
[1] 1 2 3 4 5

> 
> list2 <-list(10:14)
> list2
[[1]]
[1] 10 11 12 13 14

> 
> v1 <- unlist(list1)
> v2 <- unlist(list2)
> 
> v1
[1] 1 2 3 4 5
> v2
[1] 10 11 12 13 14
> 
> result <- v1+v2
> result
[1] 11 13 15 17 19
>

File IO


R deals with data. So it has functions for various aspects of data processing. But how does it get this data? Reading from in input file is a very important aspect of data processing. R provides simple functions for reading and writing data to various file formats.

File Dump


R allows you to dump data into a file. Such file can be read only in R

> df = mtcars
> save(file = "file.out", compress = T, list = c("df"))
>

This saves the contents of df into file.out. The same can be loaded back from the file using the load method

> load(file = "file.out")
> head(df, 3)
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
>

Note the parameter compress=T . Obviously this results in a compressed output file. If you check out the generated file, it is an illegible binary file. You have an option to use ascii=T, that generates a file with ASCII content.

Dump Everything


There is an extension of this method that lets you dump everything in the memory. You can specify the file name else, it is saved to ".RData"

> save.image()
>

You can also set ascii and compress. The load method does not change. It just picks data from the specified file and sets the variables in the memory.

> load(".RData")
>

CSV File


You can also save the file in form of a CSV

> head(df)
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>
> write.csv(df, file="df.csv")
> df = read.csv("df.csv")
> head(df)
                    X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>

Note here that on reading the CSV file, the row names are treated as the first column in the dataframe.

Table


> head(df)
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> write.table(df, "df.table")
> 
> head(read.table("df.table"))
                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
>