Moving to Blogger!

From now on I will be using Blogger - http://rationshop.blogspot.com/

Thanks for visiting!

FEMA’s Flood Insurance Program - Analysis & Maps using R

The National Flood Insurance Program (NFIP) by the Federal Emergency Management Agency (FEMA) in the United States enables property owners in participating communities to purchase insurance protection from the government against losses from flooding (http://en.wikipedia.org/wiki/National_Flood_Insurance_Program). A number of studies in the recent past have analyzed the NFIP and also its feasibility.

To my knowledge, there is no readily available code to analyze the data and other statistics on NFIP provided by FEMA (http://www.fema.gov/policy-claim-statistics-flood-insurance/policy-claim-statistics-flood-insurance/policy-claim-13).

Here is my attempt to clean and format the data from FEMA and also here are some interesting graphics.

Graphics, code and data are here - https://github.com/RationShop/nfip

I wanted to make these maps interactive using googleVis, but googleVis does not seem to have the capability to do so at the resolution of a county (it only seems to work for state and metro resolutions).

American Indians vs Indian Americans, statewide population using googleVis

The population of Native Americans in the United States is estimated to be anywhere upto 20 million (http://en.wikipedia.org/wiki/Native_Americans_in_the_United_States). I wanted to find out what the population of Native Americans (originally referred to as American Indians by immigrants) is and compare it to the number of Indians from India (Indian Americans).

I wanted to do this comparison by county, but the US Census Bureau and other Federal websites are unavailable because of the Government shutdown (as of Oct 7 2013). Hence, I obtained state-wide totals of American Indians and Indian Americans from several online sources and made the below maps using R and googleVis.

The graphics and code are here - http://rpubs.com/RationShop/indians_americans

All code and data are here - https://github.com/RationShop/indians

In states with big cities, the number of Indian Americans are more than the American Indians. Moreover, I am not sure whether the statewide totals include or exclude the Indian immigrants who are not citizens of the United States. Also, the spatial patterns at the county resolution are likely going to be different. Only when the Census Bureau websites are back online I can dig into this a little deeper.

US High School Graduation Rates, using googleVis

Here is an analysis and some graphics using googleVis on high school graduation rates in the United States.

A report from the US Deparment of Education on increasing high school graduation rates was published earlier this year (Jan 2013) and some summary graphics were shown on the Department’s blog (http://www.ed.gov/blog/2013/01/high-school-graduation-rate-at-highest-level-in-three-decades). However, the above blog only shows one static map.

Here is my attempt to do it better using R, googleVis and Rmarkdown - http://rpubs.com/RationShop/afgr

All relevant files are on my GitHub site - https://github.com/RationShop/afgr

R code to obtain and plot rainfall data for the whole world

If you want to create rainfall maps for the whole world in R there is no readily available code or package to do this. Moreover, data publicly available from research institutions is not generally in plain text format or other familiar formats. Hydrological and climatological studies sometimes require rainfall data over the entire world for long periods of time. The Climate Prediction Center’s (CPC) site, daily data from 1979 to present, is a good resource. This data is available at CPC’s ftp site (ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/).

I created R code to download rain/snow (or precipitation to be scientific) data from the CPC’s ftp site and plot it. All my code is available at my GitHub site - https://github.com/RationShop/cpcRain.

The Rmarkdown document showing some examples is at my RPubs site - http://rpubs.com/RationShop/cpcRain

R code for generating multi-site stochastic precipitation

In water resource management, climate change, hydrology and related disciplines long time series of precipitation/rainfall data is required. Since historical records are relatively short, typically 50 years or less, mathematical/statistical models are used to generate synthetic data of required length. This data generation process requires that the spatio-temporal statistics are preserved between the observations and the synthetic data. There are a number of procedures in the literature, largely beginning in the late 1990s with the work of Wilks (see references below). However, the code required to generate the synthetic data is typically not available.

I created R code for the computationally and conceptually simple implementation of the modified Wilks approach of Mhanna and Bauwens (2012). See my GitHub repo - https://github.com/RationShop/StochasticPrecipitation
There are also two R packages which have been made available recently (see links at the above GitHub site).

The R code, for the example dataset provided, takes only a few minutes or less. Someday I plan to make a package out of this.

Fixing non positive definite correlation matrices using R

Problem

When a correlation or covariance matrix is not positive definite (i.e., in instances when some or all eigenvalues are negative), a cholesky decomposition cannot be performed. Sometimes, these eigenvalues are very small negative numbers and occur due to rounding or due to noise in the data. In simulation studies a known/given correlation has to be imposed on an input dataset. In such cases one has to deal with the issue of making a correlation matrix positive definite. Following are papers in the field of stochastic precipitation where such matrices are used.

  • FP Brissette, M Khalili, R Leconte, Journal of Hydrology, 2007, “Efficient stochastic generation of multi-site synthetic precipitation data”
  • GA Baigorria, JW Jones, Journal of Climate, 2010, “GiST: A stochastic model for generating spatially and temporally correlated daily rainfall data”
  • M Mhanna and W Bauwens, International Journal of Climatology, 2012, “A stochastic space-time model for the generation of daily rainfall in the Gaza Strip”

A Solution

The paper by Rebonato and Jackel, “The most general methodology for creating a valid correlation matrix for risk management and option pricing purposes”, Journal of Risk, Vol 2, No 2, 2000, presents a methodology to create a positive definite matrix out of a non-positive definite matrix.

Below is my attempt to reproduce the example from Rebonato and Jackel (2000). The correlation matrix below is from the example.


origMat <- array(c(1, 0.9, 0.7, 0.9, 1, 0.3, 0.7, 0.3, 1), dim = c(3, 
    3))

origEig <- eigen(origMat)
origEig$values
## [1]  2.296728  0.710625 -0.007352

As you can see, the third eigenvalue is negative. Trying a cholesky decomposition on this matrix fails, as expected.

cholStatus <- try(u <- chol(origMat), silent = FALSE)
cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)

Here, I use the method of Rebonato and Jackel (2000), as elaborated by Brissette et al. (2007), to fix the correlation matrix. As per the method, replace the negative eigenvalues with 0 (or a small positive number as Brissette et al. 2007 suggest), then normalize the new vector.

# fix the correl matrix
newMat <- origMat

iter <- 0
while (cholError) {

    iter <- iter + 1
    cat("iteration ", iter, "\n")

    # replace -ve eigen values with small +ve number
    newEig <- eigen(newMat)
    newEig2 <- ifelse(newEig$values < 0, 0, newEig$values)

    # create modified matrix eqn 5 from Brissette et al 2007, inv = transp for
    # eig vectors
    newMat <- newEig$vectors %*% diag(newEig2) %*% t(newEig$vectors)

    # normalize modified matrix eqn 6 from Brissette et al 2007
    newMat <- newMat/sqrt(diag(newMat) %*% t(diag(newMat)))

    # try chol again
    cholStatus <- try(u <- chol(newMat), silent = TRUE)
    cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
}
## iteration  1 
## iteration  2 

# final check
eigen(newMat)$values
## [1]  2.290e+00  7.096e-01 -1.332e-15

Unresolved Issue(?)

However, as you can see, the third eigenvalue is still negative (but very close to zero). The “chol” function in R is not giving an error probably because this negative eigenvalue is within the “tolerance limits”. I would like to know what these “tolerance limits” are.

Long-term precipitation data for your location from the US Historical Climatology Network using R

Abstract

Long-term daily precipitation records from the US Historical Climatology Network (USHCN) are processed and summarized using R. Below is R code which examines the data from Livermore, CA, the closest available USHCN station to my present home in Fremont, CA.

Introduction

Long-term daily precipitation (rain/snow, and also temperature) records for the United States are available from the United States Historical Climatology Network - USHCN. These are observations of precipitation for typically about 100 years or longer. The USHCN stations are a subset of the larger/denser (but shorter) observation network from the Global Historical Climatology Network - GHCN

I obtained the data for California and other relevant data from here.

R code

Identify the station nearest to your location. For me its Livermore, CA.

rm(list = ls())

workDir <- "C:/Rstuff/DATA/ushcn/"
setwd(workDir)

# id of my nearest station (Livermore, CA), identified from
# 'ushcn-stations.txt'
myId <- "044997"

Following is the format of the data.

# format of the data, from 'data_format.txt'# .... (Each record in a file contains one month of daily data.)# # Variable Columns Type COOP ID 1-6 Character YEAR 7-10 Integer MONTH# 11-12 Integer ELEMENT 13-16 Character VALUE1 17-21 Integer MFLAG1 22# Character QFLAG1 23 Character SFLAG1 24 Character VALUE2 25-29 Integer# MFLAG2 30 Character QFLAG2 31 Character SFLAG2 32 Character .  .  .

Process precipitation data for your state and extract a subset of data corresponding to your station.

# read data for all stations in the state
allData <- readLines("state04_CA.txt")

# extract station ids from the data
idData <- substr(allData, 1, 6)
# create a new data frame, with ids as the first column of the frame
newData <- data.frame(idData, allData, stringsAsFactors = FALSE)
# extract data corresp to your nearest station
myData <- subset(newData, idData == myId)
myData <- myData[, 2]  #throw away the previously added first column

Below function used later to determine the number of days in a month, including leap years.

# function to computes days in a month: input is year and month this is
# the best I could do without downloading external R libraries
FnDaysInMonth <- function(yr, mo) {
    date1 <- paste(yr, mo, "01", sep = "-")  #current month, day 1

    mo2 <- ifelse(mo < 12, mo + 1, 1)
    yr2 <- ifelse(mo < 12, yr, yr + 1)
    date2 <- paste(yr2, mo2, "01", sep = "-")  #next month, day 1

    return(as.numeric(difftime(as.Date(date2), as.Date(date1))))
}

Output file to store the data and read it back again for plotting

outFile <- file(paste(myId, ".txt", sep = ""), "wt")

Each line of the data file contains data corresponding to all the days in the month. Discard temperature records and read only “PRCP”. Also, check for the data quality flags.

## each line is 1 month of data
for (eachLine in 1:length(myData)) {
    yrVar <- as.numeric(substr(myData[eachLine], 7, 10))
    moVar <- as.numeric(substr(myData[eachLine], 11, 12))
    metVar <- substr(myData[eachLine], 13, 16)

    # only extract precipitation info
    if (metVar == "PRCP") {
        ## for each day of the month , check the data flags and get the data
        for (eachDay in 1:FnDaysInMonth(yrVar, moVar)) {
            dayOffset <- 17 + ((eachDay - 1) * 8)
            metVal <- as.numeric(substr(myData[eachLine], dayOffset, dayOffset + 
                4))
            mflag <- substr(myData[eachLine], dayOffset + 5, dayOffset + 5)  #is irrelevant
            qflag <- substr(myData[eachLine], dayOffset + 6, dayOffset + 6)  #should be blank
            sflag <- substr(myData[eachLine], dayOffset + 7, dayOffset + 7)  #should not be blank

            # write to ouput
            if (qflag == " " & sflag != " ") {
                writeLines(paste(yrVar, moVar, eachDay, metVal, sep = ","), 
                  outFile)
            }
        }
    }
}
close(outFile)

Read back data for summary graphs

prcp <- read.csv(paste(myId, ".txt", sep = ""), header = FALSE, sep = ",", 
    as.is = TRUE)
colnames(prcp) <- c("yr", "mo", "day", "val")
prcp$val <- prcp$val/100  #convert hundredths of inches to inches

Graphs …

# yearly total
yrtot <- aggregate(val ~ yr, data = prcp, FUN = sum)
png(filename = "fig1.png")
plot(yrtot$val ~ yrtot$yr, type = "h", main = "Annual Precipitation Total (inches), Livermore, CA", 
    ylab = "inches/year", xlab = "year")
garbage <- dev.off()

# monthly total
montot <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
png(filename = "fig2.png")
boxplot(montot$val ~ montot$mo, range = 0, main = "Monthly Precipitation Total (inches), Livermore, CA", 
    ylab = "inches/month", xlab = "calendar month")
garbage <- dev.off()

# number of rainy days per month, rainy day of rain amount > 0.01 inches
prcp$val <- ifelse(prcp$val <= 0.01, 0, 1)
raindays <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
png(filename = "fig3.png")
boxplot(raindays$val ~ raindays$mo, range = 0, main = "Monthly Rainy Days, Livermore, CA", 
    ylab = "days/month", xlab = "calendar month")
garbage <- dev.off()

fig1
fig2
fig3

Summary

This code, when modified slightly, could also be used to read GHCN daily precipitation data.

California High School Graduation and Dropout Rates

Abstract

The California Deparment of Education recently (June 2012) had a news release on the increase in high school (grades 9-12) graduation rates and decrease in dropout rates. The data used by the Department was from two cohorts (4-year periods) on a statewide basis. While the statewide numbers show an increase (decrease) in graduation (dropout) rates, a particular school district’s numbers do not necessarily follow the statewide trend. Moreover, the longer-term data (about 20 years) on graduation and dropout rates has been ignored by the Department in their analyses. Following is my attempt to dig deeper into the data - just for kicks!

Overview

The California Deparment of Education recently (June 2012, news article) publicized the increase in high school graduation rates and decrease in dropout rates. Newspapers have also reported on this topic:

However, the data is available for only two cohorts (‘09 and ‘10). Moreover, the available cohort data is at an aggregated scale - school district or state, and not for each individual school.

I am interested in individual school performance - particulalry the five high schools in the Fremont Unified district, since I currently live here. The following is my attempt to dig deeper into this data, of course, using R. Below are the data sources. I downloaded all the files (2 on cohort rates, 1 on school list and 20 on dropout rates) before running the following R code.

Analysis

Reproduce Dept of Education’s news release data

First, I try to reproduce the results from the news release.

rm(list = ls())

workDir <- "C:/Rstuff/DATA/CA Schools/"
setwd(workDir)

# read data
coh09 <- read.csv("cohort09.txt", header = TRUE, sep = "\t", as.is = TRUE)
coh10 <- read.csv("cohort10.txt", header = TRUE, sep = "\t", as.is = TRUE)

News release numbers are based on “Statewide” data. Extract only this subset and produce a table similar to the one in the news release.

subsetName <- "Statewide"
# cohort09
data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | 
    Subgrouptype == "All"))
# cohort10
data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | 
    Subgrouptype == "All"))

Reproduce table of Graduation Rates

grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", 
    "Cohort.Graduation.Rate")]
colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", 
    "Cohort.Graduation.Rate")]
colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")

# create a table similar to the news release
gradTable <- merge(grad09, grad10)
gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16])
gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", 
    "Num1011")]
row.names(gradTable) <- gradTable$Type
gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
    "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)
gradTable
##     Type Per0910    Num0910 Per1011    Num1011 Change
## All  All   74.72 378976       76.26 382558       1.54
## 5      5   68.08 159364       70.41 167886       2.33
## 1      1   67.32 2933         68.00 2692         0.68
## 2      2   89.01 40384        89.66 39717        0.65
## 3      3   72.33 2520         74.26 2432         1.93
## 4      4   87.34 12002        89.00 12104        1.66
## 6      6   60.54 25224        62.76 24917        2.22
## 7      7   83.52 130539       85.42 124863       1.90
## 9      9   82.80 4534         81.36 5311        -1.44
## 0      0   53.81 1476         46.26 2636        -7.55
## EL    EL   56.39 53015        60.21 60280        3.82
## MIG  MIG   71.14 10790        71.86 9794         0.72
## SE    SE   56.72 34385        59.04 34156        2.32
## SD    SD   68.04 204189       69.94 219856       1.90

Reproduce table of Dropout Rates

drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")

# create a table similar to the news release
dropTable <- merge(drop09, drop10)
dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16])
dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", 
    "Num1011")]
row.names(dropTable) <- dropTable$Type
dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
    "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)
dropTable
##     Type Per0910    Num0910 Per1011    Num1011 Change
## All  All    16.6 84298         14.4 72314        -2.2
## 5      5    20.8 48706         17.7 42126        -3.1
## 1      1    22.1 962           20.7 818          -1.4
## 2      2     7.2 3270           6.2 2764         -1.0
## 3      3    19.6 683           17.4 571          -2.2
## 4      4     7.8 1078           6.7 906          -1.1
## 6      6    26.7 11143         24.7 9791         -2.0
## 7      7    10.7 16759          8.9 12980        -1.8
## 9      9    10.2 556           11.2 730           1.0
## 0      0    41.6 1141          28.6 1628        -13.0
## EL    EL    29.0 27264         24.8 24858        -4.2
## MIG  MIG    18.8 2845          17.3 2352         -1.5
## SE    SE    21.9 13296         18.4 10628        -3.5
## SD    SD    20.1 60309         17.6 55483        -2.5

Note

  • Data used by the news release appears to be similar, but not identical, to the data currently available on the web. The discrepancies in the above numbers and those in the news article are attrbuted to the slightly different datasets.
  • Also, the sum of graduation percentage and dropout percentage for a category need not add up to 100 because some students take longer than 4 years to graduate and are not (yet) accounted here.

Repeat the analysis on Fremont Unified School District

subsetName <- "Fremont Unified"
# cohort09
data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | 
    Subgrouptype == "All"))
# cohort10
data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | 
    Subgrouptype == "All"))

Table of Graduation Rates

grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", 
    "Cohort.Graduation.Rate")]
colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", 
    "Cohort.Graduation.Rate")]
colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")

# create a table similar to the news release
gradTable <- merge(grad09, grad10)
gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16])
gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", 
    "Num1011")]
row.names(gradTable) <- gradTable$Type
gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
    "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)
gradTable
##     Type Per0910    Num0910 Per1011    Num1011 Change
## All  All   82.74 2177         89.34 1952         6.60
## 5      5   69.06 279          79.42 220         10.36
## 1      1   90.00          *  100.00          *  10.00
## 2      2   89.63 1055         93.49 977          3.86
## 3      3   80.95 17           68.42 13         -12.53
## 4      4   80.37 131          92.25 119         11.88
## 6      6   67.86 76           82.52 85          14.66
## 7      7   82.19 586          87.98 505          5.79
## 9      9   81.48 22           85.71 24           4.23
## 0      0   50.00          *   83.33          *  33.33
## EL    EL   68.56 205          73.06 179          4.50
## MIG  MIG   71.43          *   85.71          *  14.28
## SE    SE   55.47 137          65.41 121          9.94
## SD    SD   72.69 503          79.42 436          6.73

Table of Dropout Rates

drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")

# create a table similar to the news release
dropTable <- merge(drop09, drop10)
dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16])
dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", 
    "Num1011")]
row.names(dropTable) <- dropTable$Type
dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
    "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)
dropTable
##     Type Per0910    Num0910 Per1011    Num1011 Change
## All  All     5.1 133            5.5 120           0.4
## 5      5     9.7 39            13.4 37            3.7
## 1      1     0.0          *     0.0          *    0.0
## 2      2     1.7 20             2.5 26            0.8
## 3      3     4.8          *    10.5          *    5.7
## 4      4     4.9          *     6.2          *    1.3
## 6      6    17.0 19             9.7          *   -7.3
## 7      7     5.8 41             5.9 34            0.1
## 9      9    11.1          *    10.7          *   -0.4
## 0      0    50.0          *     0.0          *  -50.0
## EL    EL    11.4 34            16.3 40            4.9
## MIG  MIG     7.1          *    14.3          *    7.2
## SE    SE    12.6 31            11.4 21           -1.2
## SD    SD     9.4 65            11.8 65            2.4

Some Observations

  • Statewide graduation rates, consistent with the news release, have increased across the different ethinic and other demographic groups considered.
  • The Fremont Unified School District’s graduation rates show a similar pattern - increase in graduation rates across the groups.

  • Statewide dropout rates have decreased across most categories. However, Fremont Unified’s dropout rates have increased across most categories.

  • For whatever reason the results in the news release are not presented by gender.

Closer look at Fremont Unified’s High Schools

Now I use the enrollment and dropout numbers by year data, 1992-2011, to see the time series of dropout rates for the 5 high schools in Fremont Unified.

Identify high schools in Fremont Unified - I look at only public schools.

# school info from http://www.cde.ca.gov/ds/si/ds/pubschls.asp
allSchools <- read.csv("pubschls.txt", header = TRUE, sep = "\t", 
    as.is = TRUE, colClasses = "character")
# Fremont specific data
freSchools <- subset(allSchools, County == "Alameda" & District == 
    "Fremont Unified" & StatusType == "Active" & SOC == 66)
freSchools$School
## [1] "American High"         "Irvington High"        "John F. Kennedy High" 
## [4] "Mission San Jose High" "Washington High"      

For each year, extract the total dropouts (“DTOT”) and total enrollment (“ETOT”) for 9,10,11 and 12 grades. Estimate the average dropout rate as the fraction DTOT/ETOT.

# dropout data from http://www.cde.ca.gov/ds/sd/sd/filesdropouts.asp
extns <- substr(as.character(seq(1992, 2011, by = 1)), 3, 4)
freData <- NULL  #stores summary
for (eachYr in extns) {
    dropData <- read.csv(paste("dropouts", eachYr, ".txt", sep = ""), header = TRUE, 
        sep = "\t", as.is = TRUE, colClasses = "character")

    # Fremont specific data
    yearData <- subset(dropData, CDS_CODE %in% freSchools$CDSCode)
    yearData$ETOT <- as.numeric(yearData$ETOT)
    yearData$DTOT <- as.numeric(yearData$DTOT)
    # total dropouts and total enrolled for each school by year
    yearSumm <- aggregate(cbind(DTOT, ETOT) ~ CDS_CODE + YEAR, data = yearData, 
        FUN = sum)
    # update output
    freData <- rbind(freData, yearSumm)
}

Plot the results.

schoolCodes <- levels(as.factor(freData$CDS_CODE))
schoolNames <- freSchools$School[schoolCodes %in% freSchools$CDSCode]
png(filename = "fredrop.png")
par(mfrow = c(5, 1), mar = c(1, 4, 3, 2))
yearNames <- seq(1992, 2011, by = 1)
for (eachSchool in 1:length(schoolCodes)) {
    subData <- subset(freData, CDS_CODE == schoolCodes[eachSchool])
    barplot(subData$DTOT * 100/subData$ETOT, xlab = "", ylim = c(0, 5), ylab = "drop %", 
        names.arg = yearNames, main = schoolNames[eachSchool])
}
garbage <- dev.off()

figure 1

From the graph it is evident that between 2011 and 2010 American and Irvington’s droprates have decreased, while Kennedy, Mission and Washington’s droprates have increased. However, both increases and decreases in recent droprates are relatively small. Fremont schools have pretty low dropout rates!

Summary

While what I did here sheds light on some of the issues with the Department’s analyis, it is by no means complete. The University of California in Santa Barbara has an entire project dedicated to analyze the dropout rates - UCSB Dropout Research Project.

So many ways to Tuesday…

My first post on R using the tools Markdown and Knitr.

Purpose

I often deal with data that have both spatial (geogrphical) and temporal attributes. I also often end up plotting such data under some time constraint. But I always think of ways to plot the data better. Here is one attempt to plot such typical data in as many “unique” ways as possible. The constaints are -

  • data should all be in one plot (multiple panels allowed)
  • convey all the information (and more) in the data and
  • the code should not be too long.

I picked the San Francisco Bay Area census data on population. This dataset has the population of each of the 9 Bay Area counties for the period 1860 through 2010.

First I had to format the data. Comparison by population numbers is not so easy. So I computed the population percentage (of total Bay Area population) for each county, for each decade. Then, I used the formatted data to create some plots.

# analyze population of bay area counties; data from
# http://www.bayareacensus.ca.gov/historical/copop18602000.htm save the
# data to a file called 'SFbayarea_pop.txt'

rm(list = ls())

# all data resides here
setwd("C:/Rstuff/DATA/")

# read data ----
popData <- read.csv("SFbayarea_pop.txt", header = TRUE, sep = "\t", 
    colClasses = "character")

# original data
head(popData)
##      Alameda Contra.Costa  Marin   Napa San.Francisco San.Mateo
## 1860   8,927        5,328  3,334  5,521        56,802     3,214
## 1870  24,237        8,461  6,903  7,163       149,473     6,635
## 1880  62,976       12,525 11,324 13,235       233,959     8,669
## 1890  93,864       13,515 13,072 16,411       298,997    10,087
## 1900 130,197       18,046 15,702 16,451       342,782    12,094
## 1910 246,131       31,674 25,114 19,800       416,912    26,585
##      Santa.Clara Solano Sonoma TOTAL.BAY.AREA
## 1860      11,912  7,169 11,867        114,074
## 1870      26,246 16,871 19,819        265,808
## 1880      35,039 18,475 25,926        422,128
## 1890      48,005 20,946 32,721        547,618
## 1900      60,216 24,143 38,480        658,111
## 1910      83,539 27,559 48,394        925,708

Format data for later use: remove commas in numbers and “.”s in county names:

colnames(popData) <- gsub(".", " ", colnames(popData), fixed = TRUE)
saveDimNames <- dimnames(popData)  # save names for later use

# remove commas in the numbers
popData <- as.data.frame(lapply(popData, function(x) {
    as.numeric(gsub(",", "", x))
}))
# names disappear after the above operation; rename columns
dimnames(popData) <- saveDimNames

# pop as percent of total
popFrac <- popData * 100/popData$TOTAL

# remove last ('total') column
popData <- popData[, -ncol(popData)]
popFrac <- popFrac[, -ncol(popFrac)]

# data, after formatting
head(popData)
##      Alameda Contra Costa Marin  Napa San Francisco San Mateo Santa Clara
## 1860    8927         5328  3334  5521         56802      3214       11912
## 1870   24237         8461  6903  7163        149473      6635       26246
## 1880   62976        12525 11324 13235        233959      8669       35039
## 1890   93864        13515 13072 16411        298997     10087       48005
## 1900  130197        18046 15702 16451        342782     12094       60216
## 1910  246131        31674 25114 19800        416912     26585       83539
##      Solano Sonoma
## 1860   7169  11867
## 1870  16871  19819
## 1880  18475  25926
## 1890  20946  32721
## 1900  24143  38480
## 1910  27559  48394
head(round(popFrac, 1))
##      Alameda Contra Costa Marin Napa San Francisco San Mateo Santa Clara
## 1860     7.8          4.7   2.9  4.8          49.8       2.8        10.4
## 1870     9.1          3.2   2.6  2.7          56.2       2.5         9.9
## 1880    14.9          3.0   2.7  3.1          55.4       2.1         8.3
## 1890    17.1          2.5   2.4  3.0          54.6       1.8         8.8
## 1900    19.8          2.7   2.4  2.5          52.1       1.8         9.1
## 1910    26.6          3.4   2.7  2.1          45.0       2.9         9.0
##      Solano Sonoma
## 1860    6.3   10.4
## 1870    6.3    7.5
## 1880    4.4    6.1
## 1890    3.8    6.0
## 1900    3.7    5.8
## 1910    3.0    5.2

# color scheme for the plots
plotClr <- rainbow(ncol(popData))
popMax <- ceiling(max(popData)/10^5)

Figure 1

png(filename = "fig1.png")
layout(matrix(c(1, 1, 1, 2, 2), 5, 1, byrow = TRUE))
matplot(rownames(popData), popData[, 1:9]/10^5, type = "o", col = plotClr, 
    pch = 1:9, xaxt = "n", xlab = "Year", ylab = "Pop (x 100,000)", ylim = c(0, 
        popMax), main = "Population of Bay Area Counties", cex = 2, cex.axis = 1.2, 
    cex.lab = 1.2)
axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
legend("top", ncol = 3, legend = colnames(popData), cex = 1.2, col = plotClr, 
    pch = 1:9)
matplot(rownames(popFrac), round(popFrac[, 1:9], 1), type = "o", 
    col = plotClr, pch = 1:9, xaxt = "n", xlab = "Year", ylab = "% of Bay Area Tot.", 
    ylim = c(0, max(popFrac)), main = "County Pop as % of Bay Area Total", cex = 2, 
    cex.axis = 1.2, cex.lab = 1.2)
axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
garbage <- dev.off()

figure 1

Figure 2

png(filename = "fig2.png")
par(mfrow = c(3, 3), mar = c(5, 4, 2, 5) + 0.1, oma = c(0, 0, 4, 
    0))
for (eachCnty in 1:ncol(popData)) {
    barplot(popData[, eachCnty]/10^5, names.arg = rownames(popData), xlab = "Year", 
        ylab = "Pop (x 100,000)", main = colnames(popData)[eachCnty])
    par(new = TRUE)
    plot(rownames(popFrac), popFrac[, eachCnty], type = "o", col = "blue", xaxt = "n", 
        yaxt = "n", xlab = "", ylab = "")
    axis(4, col = "blue", col.ticks = "blue", col.axis = "blue")
    mtext("% of Bay Area Pop", side = 4, line = 3, cex = 0.7, col = "blue")
}
mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
    font = 2)
garbage <- dev.off()

figure 2

Figure 3

png(filename = "fig3.png")
layout(matrix(c(1, 1, 1, 1, 2:17), 5, 4, byrow = TRUE))
par(mar = c(2, 4, 2, 2), oma = c(0, 0, 4, 0))
barplot(0, 0, axes = FALSE)
legend("center", ncol = 3, legend = colnames(popData), cex = 1.3, 
    fill = plotClr)
for (eachYr in 1:nrow(popData)) {
    barplot(as.vector(unlist(popData[eachYr, ]/10^5)), col = plotClr, names.arg = NULL, 
        xlab = "", ylab = "Pop (x 100,000)", main = rownames(popData)[eachYr])
}
mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
    font = 2)
garbage <- dev.off()

figure 3

Figure 4

png(filename = "fig4.png")
layout(matrix(c(1, 2, 2), 3, 1, byrow = TRUE))
par(mar = c(4, 4, 2, 2), oma = c(0, 0, 3, 0))
plot(rownames(popData), rowSums(popData)/10^5, type = "o", xaxt = "n", 
    xlab = "", ylab = "Pop (x 100,000)", cex = 2, cex.axis = 1.2, cex.lab = 1.2)
axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
barplot(t(as.matrix(popFrac)), main = "County Pop as % of Bay Area Total", 
    ylim = c(0, 120), yaxt = "n", col = plotClr, cex.axis = 1.2, cex.lab = 1.2, 
    cex.names = 1.2)
axis(2, at = seq(0, 100, 20), cex.axis = 1.2, cex.lab = 1.2)
legend("top", ncol = 3, legend = colnames(popFrac), cex = 1, fill = plotClr)
mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
    cex = 1, font = 2)
garbage <- dev.off()

figure 4

Summary

I tried 4 different ways of plotting the population data. Each of the 4 methods have pros and cons. Of the above 4, my favorite is the last one. Also, I would like to add one figure with a map showing the counties where the counties are color coded by the population (or pop percentage). For this, additional data is required, such as GIS maps of the counties. Will try to do this for the next post.