Introduction to R

# This is the editor and the part below the white line is called the
# console.

# The hashtag is used to add comments

# An addition
5 + 5

# A substraction
5 – 5

# A division
(5 + 5)/2

# A plot
plot(cars, type = “o”, col = “blue”)

# Read in your data set and assign to the object:
present = read.table(“”)
# This will print your data set in the console:

# The present data frame is already loaded. Print the number of rows and
# variables with the ‘dim’ function:

# Print the names of the variables of the data frame:

# The present dataset is already loaded. Find the number of boys, and
# assign your answer to
boys = present$boys
# Find the number of girls, and assign your answer to
girls = present$girls
# This will print the results

# The present dataset is already loaded.

# Type here the code to create the plot:
plot(present$year, present$girls)

# The present dataset is already loaded.

# Create the plot here:
year = present$year
girls_born = present$girls
plot(year, girls_born, t =”l”)

# Look for the documentation of the read.table function.

# The present dataset is already loaded.
# The vector babies:
babies = present$boys + present$girls
year = present$year

# Your plot
plot(year, babies, type = “l”)

# The present dataset is already loaded.

# Check when boys outnumber girls
boys_outnumber_girls = present$boys > present$girls

# The present dataset is already loaded. Plot the boy-to-girl ratio for
# every year:
year = present$year
boys_to_girls_ratio = present$boys/present$girls
plot(year, boys_to_girls_ratio, type = “l”)


Defining Cell Names

Watch the video.

  1. Right click and press Defining Names and type in the name
  2. Click on Formula on the ribbon and click Defining Names under Formula Auditing
  3. Just enter the name on the name tag which is located on the left side of formula bar.

Rules for defining cell names

  1. Can only begin with a letter, / (backslash) or _ (underscore)
  2. No spaces allowed in between eg: Sales Tax has to be SalesTax or Sales_Tax or Sales.Tax
  3. case insensitive eg: Peter, peTer, petER or PETEr
  4. No cell references allowed eg: A45, C$3

Linear Regression in R NBA

In this exercise, we’ll apply some of the ideas from baseball to data from the National Basketball Association (NBA). Please download the datasets NBA_train.csv and NBA_test.csv, and save them to a location on your computer that you will remember. This data comes from

A script file containing all of the R commands used in this recitation can be downloaded here.

# Read in the data
NBA = read.csv(“NBA_train.csv”)

# How many wins to make the playoffs?
table(NBA$W, NBA$Playoffs)

# Compute Points Difference

# Check for linear relationship
plot(NBA$PTSdiff, NBA$W)

# Linear regression model for wins
WinsReg = lm(W ~ PTSdiff, data=NBA)

# Linear regression model for points scored
PointsReg = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + DRB + BLK + TOV + STL, data=NBA)

# Sum of Squared Errors
SSE = sum(PointsReg$residuals^2)

# Root mean squared error
RMSE = sqrt(SSE/nrow(NBA))

# Remove insignifcant variables
PointsReg3 = lm(PTS ~ X2PA + X3PA + FTA + AST + ORB + STL, data=NBA)

# Compute SSE and RMSE again
SSE = sum(PointsReg3$residuals^2)
RMSE = sqrt(SSE/nrow(NBA))

# Check for correlations
RegVar = NBA[c(“X2PA”, “X3PA”, “FTA”, “AST”, “ORB”, “STL”)]

# Read in test set
NBA_test = read.csv(“NBA_test.csv”)

# Make predictions on test set
PointsPredictions = predict(PointsReg, newdata=NBA_test)

# Compute out-of-sample R^2
SSE = sum((PointsPredictions – NBA_test$PTS)^2)
SST = sum((mean(NBA$PTS) – NBA_test$PTS)^2)
R2 = 1 – SSE/SST

# Compute the RMSE
RMSE = sqrt(SSE/nrow(NBA_test))



, ,


In the wake of the Great Recession of 2009, there has been a good deal of focus on employment statistics, one of the most important metrics policymakers use to gauge the overall strength of the economy. In the United States, the government measures unemployment using the Current Population Survey (CPS), which collects demographic and employment information from a wide range of Americans each month. In this exercise, we will employ the topics reviewed in the lectures as well as a few new techniques using the September 2013 version of this rich, nationally representative dataset (available online).

The observations in the dataset represent people surveyed in the September 2013 CPS who actually completed a survey. While the full dataset has 385 variables, in this exercise we will use a more compact version of the dataset, CPSData.csv, which has the following variables:

PeopleInHousehold: The number of people in the interviewee’s household.

Region: The census region where the interviewee lives.

State: The state where the interviewee lives.

MetroAreaCode: A code that identifies the metropolitan area in which the interviewee lives (missing if the interviewee does not live in a metropolitan area). The mapping from codes to names of metropolitan areas is provided in the file MetroAreaCodes.csv.

Age: The age, in years, of the interviewee. 80 represents people aged 80-84, and 85 represents people aged 85 and higher.

Married: The marriage status of the interviewee.

Sex: The sex of the interviewee.

Education: The maximum level of education obtained by the interviewee.

Race: The race of the interviewee.

Hispanic: Whether the interviewee is of Hispanic ethnicity.

CountryOfBirthCode: A code identifying the country of birth of the interviewee. The mapping from codes to names of countries is provided in the file CountryCodes.csv.

Citizenship: The United States citizenship status of the interviewee.

EmploymentStatus: The status of employment of the interviewee.

Industry: The industry of employment of the interviewee (only available if they are employed).

Load the dataset from CPSData.csv into a data frame called CPS, and view the dataset with the summary() and str() commands.

How many interviewees are in the dataset?

> CPS <- read.csv(“CPSData.csv”)
> str(CPS)

> summary(CPS)

Among the interviewees with a value reported for the Industry variable, what is the most common industry of employment? Please enter the name exactly how you see it.

> sort(table(CPS$Industry))

Recall from the homework assignment “The Analytical Detective” that you can call the sort() function on the output of the table() function to obtain a sorted breakdown of a variable. For instance, sort(table(CPS$Region)) sorts the regions by the number of interviewees from that region.

Which state has the fewest interviewees?

Which state has the largest number of interviewees?

> sort(table(CPS$State))

What proportion of interviewees are citizens of the United States?

> table(CPS$Citizenship)

The CPS differentiates between race (with possible values American Indian, Asian, Black, Pacific Islander, White, or Multiracial) and ethnicity. A number of interviewees are of Hispanic ethnicity, as captured by the Hispanic variable. For which races are there at least 250 interviewees in the CPS dataset of Hispanic ethnicity?

> hispanic <- subset(CPS, CPS$Hispanic==”1″)
> sort(table(hispanic$Race))

Which variables have at least one interviewee with a missing (NA) value?

> summary(CPS)

Often when evaluating a new dataset, we try to identify if there is a pattern in the missing values in the dataset. We will try to determine if there is a pattern in the missing values of the Married variable. The function$Married) returns a vector of TRUE/FALSE values for whether the Married variable is missing. We can see the breakdown of whether Married is missing based on the reported value of the Region variable with the function table(CPS$Region,$Married)). Which is the most accurate:

> table(CPS$Region,$Married))

> table(CPS$Sex,$Married))

> table(CPS$Age,$Married))

> table(CPS$Citizenship,$Married))

As mentioned in the variable descriptions, MetroAreaCode is missing if an interviewee does not live in a metropolitan area. Using the same technique as in the previous question, answer the following questions about people who live in non-metropolitan areas.

How many states had all interviewees living in a non-metropolitan area (aka they have a missing MetroAreaCode value)? For this question, treat the District of Columbia as a state (even though it is not technically a state).

> nonmetro <- subset(CPS,$MetroAreaCode))

> summary(CPS$State)-summary(nonmetro$State)     note the entire state should be nonmetro

How many states had all interviewees living in a metropolitan area? Again, treat the District of Columbia as a state.

> summary(nonmetro$State)     note the entire state should be a metro

Which region of the United States has the largest proportion of interviewees living in a non-metropolitan area?

> sort(summary(nonmetro$Region)/summary(CPS$Region))     Note I am not comfortable in using table function when this is much more simpler way of doing things.

While we were able to use the table() command to compute the proportion of interviewees from each region not living in a metropolitan area, it was somewhat tedious (it involved manually computing the proportion for each region) and isn’t something you would want to do if there were a larger number of options. It turns out there is a less tedious way to compute the proportion of values that are TRUE. The mean() function, which takes the average of the values passed to it, will treat TRUE as 1 and FALSE as 0, meaning it returns the proportion of values that are true. For instance, mean(c(TRUE, FALSE, TRUE, TRUE)) returns 0.75. Knowing this, use tapply() with the mean function to answer the following questions:

Which state has a proportion of interviewees living in a non-metropolitan area closest to 30%?

Which state has the largest proportion of non-metropolitan interviewees, ignoring states where all interviewees were non-metropolitan?

> sort(summary(nonmetro$State)/summary(CPS$State))     Again I dont know why I dont conform to what is told by the instructors but I find my way a bit simpler. Beside it is just plain wrong to undermine the power of R as it is lot more versatile then what is given in the instruction. Anyway that is my excuse.

Codes like MetroAreaCode and CountryOfBirthCode are a compact way to encode factor variables with text as their possible values, and they are therefore quite common in survey datasets. In fact, all but one of the variables in this dataset were actually stored by a numeric code in the original CPS datafile.

When analyzing a variable stored by a numeric code, we will often want to convert it into the values the codes represent. To do this, we will use a dictionary, which maps the the code to the actual value of the variable. We have provided dictionaries MetroAreaCodes.csv and CountryCodes.csv, which respectively map MetroAreaCode and CountryOfBirthCode into their true values. Read these two dictionaries into data frames MetroAreaMap and CountryMap.

How many metropolitan areas are stored in MetroAreaMap?

> MetroAreaMap <- read.csv(“MetroAreaCodes.csv”)
> CountryMap <- read.csv(“CountryCodes.csv”)

How many countries are stored in CountryMap?

> str(MetroAreaMap)

> str(CountryMap)

To merge in the metropolitan areas, we want to connect the field MetroAreaCode from the CPS data frame with the field Code in MetroAreaMap. The following command merges the two data frames on these columns, overwriting the CPS data frame with the result:

CPS = merge(CPS, MetroAreaMap, by.x=”MetroAreaCode”, by.y=”Code”, all.x=TRUE)

The first two arguments determine the data frames to be merged (they are called “x” and “y”, respectively, in the subsequent parameters to the merge function). by.x=”MetroAreaCode” means we’re matching on the MetroAreaCode variable from the “x” data frame (CPS), while by.y=”Code” means we’re matching on the Code variable from the “y” data frame (MetroAreaMap). Finally, all.x=TRUE means we want to keep all rows from the “x” data frame (CPS), even if some of the rows’ MetroAreaCode doesn’t match any codes in MetroAreaMap (for those familiar with database terminology, this parameter makes the operation a left outer join instead of an inner join).

Review the new version of the CPS data frame with the summary() and str() functions. What is the name of the variable that was added to the data frame by the merge() operation?

How many interviewees have a missing value for the new metropolitan area variable? Note that all of these interviewees would have been removed from the merged data frame if we did not include the all.x=TRUE parameter.

> CPS = merge(CPS, MetroAreaMap, by.x=”MetroAreaCode”, by.y=”Code”, all.x=TRUE)

> str(CPS)

> summary(CPS)

> names(CPS)    my preference as it takes out all the unwanted things.

Which of the following metropolitan areas has the largest number of interviewees?

> sort(summary(CPS$MetroArea))

Which metropolitan area has the highest proportion of interviewees of Hispanic ethnicity? Hint: Use tapply() with mean, as in the previous subproblem. Calling sort() on the output of tapply() could also be helpful here.

> sort(tapply(CPS$Hispanic==”1″,CPS$MetroArea,mean))

Remembering that CPS$Race == “Asian” returns a TRUE/FALSE vector of whether an interviewee is Asian, determine the number of metropolitan areas in the United States from which at least 20% of interviewees are Asian.

> sort(tapply(CPS$Race==”Asian”,CPS$MetroArea,mean))

Normally, we would look at the sorted proportion of interviewees from each metropolitan area who have not received a high school diploma with the command:

sort(tapply(CPS$Education == “No high school diploma”, CPS$MetroArea, mean))

However, none of the interviewees aged 14 and younger have an education value reported, so the mean value is reported as NA for each metropolitan area. To get mean (and related functions, like sum) to ignore missing values, you can pass the parameter na.rm=TRUE. Passing na.rm=TRUE to the tapply function, determine which metropolitan area has the smallest proportion of interviewees who have received no high school diploma.

> sort(tapply(CPS$Education == “No high school diploma”, CPS$MetroArea, mean,na.rm=TRUE))

Just as we did with the metropolitan area information, merge in the country of birth information from the CountryMap data frame, replacing the CPS data frame with the result. If you accidentally overwrite CPS with the wrong values, remember that you can restore it by re-loading the data frame from CPSData.csv and then merging in the metropolitan area information using the command provided in the previous subproblem.

What is the name of the variable added to the CPS data frame by this merge operation?

> CPS = merge(CPS, CountryMap, by.x=”CountryOfBirthCode”, by.y=”Code”, all.x=TRUE)
> names(CPS)

How many interviewees have a missing value for the new country of birth variable?

> summary(CPS$Country)

Among all interviewees born outside of North America, which country was the most common place of birth?

> summary(CPS$Country)     note it is outside of North America and not outside of USA

What proportion of the interviewees from the “New York-Northern New Jersey-Long Island, NY-NJ-PA” metropolitan area have a country of birth that is not the United States? For this computation, don’t include people from this metropolitan area who have a missing country of birth.

> NY <- subset(CPS, CPS$MetroArea==”New York-Northern New Jersey-Long Island, NY-NJ-PA”,na.rm=TRUE)

> NY <- subset(NY, NY$Country!=”NA’s”)     in order to clean the data from NA’s
> NYNonUS <- subset(NY,NY$Country!=”United States”)

> str(NY)

> str(NYNonUS)

Which metropolitan area has the largest number (note — not proportion) of interviewees with a country of birth in India? Hint — remember to include na.rm=TRUE if you are using tapply() to answer this question.

> India <- subset(CPS, CPS$Country==”India”)

> sort(summary(India$MetroArea))

In Brazil?

> Brazil <- subset(CPS, CPS$Country==”Brazil”)

> sort(summary(Brazil$MetroArea))

In Somalia?

> Somalia <- subset(CPS, CPS$Country==”Somalia”)

> sort(summary(Somalia$MetroArea))

Phew! that was a long assignment for first week


Working with Data in R part 2

First go to Go to website and download USDA.csv

> USDA <- read.csv(“USDA.csv”)

> str(USDA)

> summary(USDA)

> names(USDA)

> USDA$Description[which.max(USDA$Sodium)]     To get the description of the observation that has the max Sodium. The answer being Table Salt.

> HighSodium <- subset(USDA, Sodium> 10000)     To get a sub data frame of items with high sodium

> nrow(HighSodium)

[1] 10

> HighSodium$Description     To get names of the items

> match(“CAVIAR”, USDA$Description)     gives the observation number of the Item CAVIAR

> USDA$Sodium[match(“CAVIAR”, USDA$Description)     gives you the sodium content of caviar.

> summary(USDA$Sodium)     to get a summary of Sodium content

> sd(USDA$Sodium)     to get the standard deviation of Sodium across all items. Note since the answer is NA we need to remove NA

> sd(USDA$Sodium, na.rm = TRUE)

> plot(USDA$Protein, USDA$TotalFat, xlab = “Protein, ylab = “Fat”, main = “Protein vs Fat”, col = “Red”)     Note that this changes the plot color to red and nicely labels it instead of default

> hist(USDA$VitaminC, xlab = “Vitamin C”, main = “Histogram of Vitamin C distrubution”, xlim = c(0,100), breaks = 2000)     To create a histogram. xlim specifies the limit on x axis so that we dont have too many outliers and breaks is set as 2000 in order to indicate 1mg of vitamin represents a frequency.

> boxplot(USDA$Carbohydrate, USDA$TotalFat, USDA$Protein, main = “Box plot of three main ingradients”, ylab = “Ingradients in grams”, xlab = “Carbohydrates     Total Fat     Protein”)     Creates a box plot

> HighSodium <- USDA$Sodium > mean(USDA$Sodium, na.rm = TRUE)     To create a vector of boolean if items have above average Sodium content.

> str(HighSodium)

> HighSodium <- as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm = TRUE))     To convert boolean into a vectors of binary (0, 1).

> str(HighSodium)

> USDA$HighSodium <- as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm = TRUE))     To directly add high sodium into USDA data frame.

> str(USDA)

> USDA$HighCarb <- as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm = TRUE))

> USDA$HighFat <- as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm = TRUE))

> USDA$HighProtein <- as.numeric(USDA$Protein > mean(USDA$Protein, na.rm = TRUE))

> str(USDA)

> table(USDA$HighProtein, USDA$HighFat)     to get a table of items with High Protein and High Fat = TRUE

> tapply(USDA$Iron, USDA$HighProtein, mean, na.rm = TRUE)     To analyse the relationship between Iron content and Protein. Remember to put na.rm = TRUE always.

> tapply(USDA$VitaminC, USDA$HighCarb, summary, na.rm = TRUE)     To analyse the Vitamin C content in food with High Carbs.

This concludes part II. I am including a script file and a link for USDA.csv for you to run. Enjoy!

# Reading in the Dataset

# Get the current directory
# Read the csv file
USDA = read.csv(“USDA.csv”)
# Structure of the dataset
# Statistical summary
# Basic Data Analysis

# Vector notation
# Finding the index of the food with highest sodium levels
# Get names of variables in the dataset
# Get the name of the food with highest sodium levels
# Create a subset of the foods with sodium content above 10,000mg
HighSodium = subset(USDA, Sodium>10000)
# Output names of the foods with high sodium content
# Finding the index of CAVIAR in the dataset
match(“CAVIAR”, USDA$Description)
# Find amount of sodium in caviar
# Doing it in one command!
USDA$Sodium[match(“CAVIAR”, USDA$Description)]
# Summary function over Sodium vector
# Standard deviation
sd(USDA$Sodium, na.rm = TRUE)

# Plots

# Scatter Plots
plot(USDA$Protein, USDA$TotalFat)
# Add xlabel, ylabel and title
plot(USDA$Protein, USDA$TotalFat, xlab=”Protein”, ylab = “Fat”, main = “Protein vs Fat”, col = “red”)
# Creating a histogram
hist(USDA$VitaminC, xlab = “Vitamin C (mg)”, main = “Histogram of Vitamin C”)
# Add limits to x-axis
hist(USDA$VitaminC, xlab = “Vitamin C (mg)”, main = “Histogram of Vitamin C”, xlim = c(0,100))
# Specify breaks of histogram
hist(USDA$VitaminC, xlab = “Vitamin C (mg)”, main = “Histogram of Vitamin C”, xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = “Vitamin C (mg)”, main = “Histogram of Vitamin C”, xlim = c(0,100), breaks=2000)
# Boxplots
boxplot(USDA$Sugar, ylab = “Sugar (g)”, main = “Boxplot of Sugar”)
# Adding a variable

# Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise
HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
# Adding the variable to the dataset
USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
# Similarly for HighProtein, HigCarbs, HighFat
USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))
# Summary Tables

# How many foods have higher sodium level than average?
# How many foods have both high sodium and high fat?
table(USDA$HighSodium, USDA$HighFat)
# Average amount of iron sorted by high and low protein?
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
# Maximum level of Vitamin C in hfoods with high and low carbs?
tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
# Using summary function with tapply
tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)

Working with Data in R

Once you have loaded the data, let us start working with R.

First to segregate a variable, simply type > data$var1 and enter. This will load the Variable var1 from the dataframe.

to get Mean type > mean(data$var1) and enter. Similarly > sd(data$var1) gives you standard deviation.

> summary(data$var1)

> which.min(data$var1)     This gives the observation no of the min

> data$var2[which.min(data$var1)]     This gives the var2 when var1 = min

To generate plots we can use

> plot(data$var1,data$var2)     This gives a plot of var1 vs var2 a dot plot

To find the outliers

> outlier <- subset(data, var1>80 & var2<20)

This gives a subset of outliers

> nrow(outlier)      gives the number of rows in that subset.

> outlier[c(var1,var2)]     This gives a table consist of selected variables of outlier subset

> hist(data$var2)     gives you a histogram of var2

> boxplot(data$var1 ~ data$var2, xlab = “GDP”, ylab = “Life Expectancy, main = “A box plot of GDP vs Life Expectancy of various countries”)     gives you a box plot of var1 vs var2. Note use “~” sign.

> table(data$var3)     This sort of gives the summary for that variable

> tapply(data$var1, data$var3, mean)     This gives the summary of the mean of var1 under var3. Note var3 is an ordinal data such as continent.

Sometimes a few data have NAs in it and this is a problem when you are doing a tapply with min wherein it gives NA one way to handle this is

> tapply(data$var1, data$var3, min, na.rm = TRUE)

Loading Data to R

Step1: First change the directory to the location of the Data.

Step2: Type > getwd() and enter. This will give you the current working directory. This is to confirm we are in the same directory.

Step3: Type > data <- read.csv(“data.csv”) and enter. This is to read in the data.

Step4: Type >str(data) and enter. This shows the structure of the data. Weather numeric, integer or factor and a sample of first few data and then number of observations.

Step5: Type > summary(data) and enter. This gives summary of the entire data frame. The numeric data will have the min, max, median, mean, 1st quartile, 3rd quartile and even NAs. The factor variables gives the first few samples or categories.

Step6: Type > data.sub <- subset(data, region == “Europe”) and enter. This will create a subset of data where the region is Europe for eg.

Step7: you can do a str(data.sub) to check the structure and similarly do the summary too. Note the data differs in terms of mean, median etc.

Step8: Type > write.csv(data.sub, “new_data.csv”) and enter. This will create a new CSV dataframe.

Step9: In order to cleanup some space from the memory you can remove some data frames that you are not working on. To do that type > ls() and enter. This will list all the variables. Now to remove type > rm(data.sub) and enter. This will remove the data frame data.sub. You can confirm again using ls() again.

Microsoft Excel 2014 Bible Book review

I am reading Microsoft Excel 2014 Bible The comprehensive tutorial resource and wanted to post what I have learned from it.

Chapter 1: Introducing Excel

You can go straight to the last topic in the chapter: Creating your first Excel Workbook.

Step 1: File ➪ New ➪ Blank Workbook

Step 2: Type Month into cell A1

Step 3: Type Projected Sales into cell B1

Step 4: Type Jan in A2

Step 5: Drag A2 to A13 to autofill Jan to Dec

Step 6: Type 50,000 on B2

Step 7: On B3 enter formula B3 = B2*1.035 indicating a 3.5% increase in sales month on month

Step 8: Drag A3 to autofill to A13

Now comes of couple of cool steps I fell in love with

Cool Step 9: Select B2 to B13 and on the ribbon click the drop down that says general under File>Number now click on currency (no more right clicking)

Cool Step 10: Select 0 decimal by clicking twice to reduce and round off to the nearest $

Cool Step 11: Select A1 to B13 Choose Table Tools ➪ Design ➪ Table Style Options ➪ Total Row. (Ya this is super cool)

Cool Step 12: Choose Insert ➪ Charts ➪ Recommended Charts.

12b: In the Insert Chart dialog box, click the second recommended chart (a column

chart), and click OK

12c: Chart Tools ➪ Design ➪ Chart Styles options

Now save or print do as you wish that is the gist of Chapter 1