a modeler's tribulations, gopi goteti's web log

  1. Search
  2. About
  3. Comments/Questions?
  4. Subscribe
  5. Archive
  6. Random
  1. Climate, Models & Uncertainty
  2. Python, also for Hydrologists

a modeler's tribulations, gopi goteti's web log

I am a scientist/engineer/hydrologist working in Oakland, California. This blog is about R and data analysis with a flavor of hydrology and some good memories of Python.

  • R code to obtain and plot rainfall data for the whole world

    If you want to create rainfall maps for the whole world in R there is no readily available code or package to do this. Moreover, data publicly available from research institutions is not generally in plain text format or other familiar formats. Hydrological and climatological studies sometimes require rainfall data over the entire world for long periods of time. The Climate Prediction Center’s (CPC) site, daily data from 1979 to present, is a good resource. This data is available at CPC’s ftp site (ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/).

    I created R code to download rain/snow (or precipitation to be scientific) data from the CPC’s ftp site and plot it. All my code is available at my GitHub site - https://github.com/RationShop/cpcRain.

    The Rmarkdown document showing some examples is at my RPubs site - http://rpubs.com/RationShop/cpcRain

    Posted on May 19, 2013

  • R code for generating multi-site stochastic precipitation

    In water resource management, climate change, hydrology and related disciplines long time series of precipitation/rainfall data is required. Since historical records are relatively short, typically 50 years or less, mathematical/statistical models are used to generate synthetic data of required length. This data generation process requires that the spatio-temporal statistics are preserved between the observations and the synthetic data. There are a number of procedures in the literature, largely beginning in the late 1990s with the work of Wilks (see references below). However, the code required to generate the synthetic data is typically not available.

    I created R code for the computationally and conceptually simple implementation of the modified Wilks approach of Mhanna and Bauwens (2012). See my GitHub repo - https://github.com/RationShop/StochasticPrecipitation
    There are also two R packages which have been made available recently (see links at the above GitHub site).

    The R code, for the example dataset provided, takes only a few minutes or less. Someday I plan to make a package out of this.

    Posted on May 15, 2013

  • Fixing non positive definite correlation matrices using R

    Problem

    When a correlation or covariance matrix is not positive definite (i.e., in instances when some or all eigenvalues are negative), a cholesky decomposition cannot be performed. Sometimes, these eigenvalues are very small negative numbers and occur due to rounding or due to noise in the data. In simulation studies a known/given correlation has to be imposed on an input dataset. In such cases one has to deal with the issue of making a correlation matrix positive definite. Following are papers in the field of stochastic precipitation where such matrices are used.

    • FP Brissette, M Khalili, R Leconte, Journal of Hydrology, 2007, “Efficient stochastic generation of multi-site synthetic precipitation data”
    • GA Baigorria, JW Jones, Journal of Climate, 2010, “GiST: A stochastic model for generating spatially and temporally correlated daily rainfall data”
    • M Mhanna and W Bauwens, International Journal of Climatology, 2012, “A stochastic space-time model for the generation of daily rainfall in the Gaza Strip”

    A Solution

    The paper by Rebonato and Jackel, “The most general methodology for creating a valid correlation matrix for risk management and option pricing purposes”, Journal of Risk, Vol 2, No 2, 2000, presents a methodology to create a positive definite matrix out of a non-positive definite matrix.

    Below is my attempt to reproduce the example from Rebonato and Jackel (2000). The correlation matrix below is from the example.

    
    origMat <- array(c(1, 0.9, 0.7, 0.9, 1, 0.3, 0.7, 0.3, 1), dim = c(3, 
        3))
    
    origEig <- eigen(origMat)
    origEig$values
    
    ## [1]  2.296728  0.710625 -0.007352
    

    As you can see, the third eigenvalue is negative. Trying a cholesky decomposition on this matrix fails, as expected.

    cholStatus <- try(u <- chol(origMat), silent = FALSE)
    cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
    

    Here, I use the method of Rebonato and Jackel (2000), as elaborated by Brissette et al. (2007), to fix the correlation matrix. As per the method, replace the negative eigenvalues with 0 (or a small positive number as Brissette et al. 2007 suggest), then normalize the new vector.

    # fix the correl matrix
    newMat <- origMat
    
    iter <- 0
    while (cholError) {
    
        iter <- iter + 1
        cat("iteration ", iter, "\n")
    
        # replace -ve eigen values with small +ve number
        newEig <- eigen(newMat)
        newEig2 <- ifelse(newEig$values < 0, 0, newEig$values)
    
        # create modified matrix eqn 5 from Brissette et al 2007, inv = transp for
        # eig vectors
        newMat <- newEig$vectors %*% diag(newEig2) %*% t(newEig$vectors)
    
        # normalize modified matrix eqn 6 from Brissette et al 2007
        newMat <- newMat/sqrt(diag(newMat) %*% t(diag(newMat)))
    
        # try chol again
        cholStatus <- try(u <- chol(newMat), silent = TRUE)
        cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)
    }
    
    ## iteration  1 
    ## iteration  2 
    
    
    # final check
    eigen(newMat)$values
    
    ## [1]  2.290e+00  7.096e-01 -1.332e-15
    

    Unresolved Issue(?)

    However, as you can see, the third eigenvalue is still negative (but very close to zero). The “chol” function in R is not giving an error probably because this negative eigenvalue is within the “tolerance limits”. I would like to know what these “tolerance limits” are.

    Tagged: cholesky non positive definite correlation matrix ill conditioned matrix stochastic precipitation

    Posted on October 14, 2012

  • Long-term precipitation data for your location from the US Historical Climatology Network using R

    Abstract

    Long-term daily precipitation records from the US Historical Climatology Network (USHCN) are processed and summarized using R. Below is R code which examines the data from Livermore, CA, the closest available USHCN station to my present home in Fremont, CA.

    Introduction

    Long-term daily precipitation (rain/snow, and also temperature) records for the United States are available from the United States Historical Climatology Network - USHCN. These are observations of precipitation for typically about 100 years or longer. The USHCN stations are a subset of the larger/denser (but shorter) observation network from the Global Historical Climatology Network - GHCN

    I obtained the data for California and other relevant data from here.

    R code

    Identify the station nearest to your location. For me its Livermore, CA.

    rm(list = ls())
    
    workDir <- "C:/Rstuff/DATA/ushcn/"
    setwd(workDir)
    
    # id of my nearest station (Livermore, CA), identified from
    # 'ushcn-stations.txt'
    myId <- "044997"
    

    Following is the format of the data.

    # format of the data, from 'data_format.txt'# .... (Each record in a file contains one month of daily data.)# # Variable Columns Type COOP ID 1-6 Character YEAR 7-10 Integer MONTH# 11-12 Integer ELEMENT 13-16 Character VALUE1 17-21 Integer MFLAG1 22# Character QFLAG1 23 Character SFLAG1 24 Character VALUE2 25-29 Integer# MFLAG2 30 Character QFLAG2 31 Character SFLAG2 32 Character .  .  .
    

    Process precipitation data for your state and extract a subset of data corresponding to your station.

    # read data for all stations in the state
    allData <- readLines("state04_CA.txt")
    
    # extract station ids from the data
    idData <- substr(allData, 1, 6)
    # create a new data frame, with ids as the first column of the frame
    newData <- data.frame(idData, allData, stringsAsFactors = FALSE)
    # extract data corresp to your nearest station
    myData <- subset(newData, idData == myId)
    myData <- myData[, 2]  #throw away the previously added first column
    

    Below function used later to determine the number of days in a month, including leap years.

    # function to computes days in a month: input is year and month this is
    # the best I could do without downloading external R libraries
    FnDaysInMonth <- function(yr, mo) {
        date1 <- paste(yr, mo, "01", sep = "-")  #current month, day 1
    
        mo2 <- ifelse(mo < 12, mo + 1, 1)
        yr2 <- ifelse(mo < 12, yr, yr + 1)
        date2 <- paste(yr2, mo2, "01", sep = "-")  #next month, day 1
    
        return(as.numeric(difftime(as.Date(date2), as.Date(date1))))
    }
    

    Output file to store the data and read it back again for plotting

    outFile <- file(paste(myId, ".txt", sep = ""), "wt")
    

    Each line of the data file contains data corresponding to all the days in the month. Discard temperature records and read only “PRCP”. Also, check for the data quality flags.

    ## each line is 1 month of data
    for (eachLine in 1:length(myData)) {
        yrVar <- as.numeric(substr(myData[eachLine], 7, 10))
        moVar <- as.numeric(substr(myData[eachLine], 11, 12))
        metVar <- substr(myData[eachLine], 13, 16)
    
        # only extract precipitation info
        if (metVar == "PRCP") {
            ## for each day of the month , check the data flags and get the data
            for (eachDay in 1:FnDaysInMonth(yrVar, moVar)) {
                dayOffset <- 17 + ((eachDay - 1) * 8)
                metVal <- as.numeric(substr(myData[eachLine], dayOffset, dayOffset + 
                    4))
                mflag <- substr(myData[eachLine], dayOffset + 5, dayOffset + 5)  #is irrelevant
                qflag <- substr(myData[eachLine], dayOffset + 6, dayOffset + 6)  #should be blank
                sflag <- substr(myData[eachLine], dayOffset + 7, dayOffset + 7)  #should not be blank
    
                # write to ouput
                if (qflag == " " & sflag != " ") {
                    writeLines(paste(yrVar, moVar, eachDay, metVal, sep = ","), 
                      outFile)
                }
            }
        }
    }
    close(outFile)
    

    Read back data for summary graphs

    prcp <- read.csv(paste(myId, ".txt", sep = ""), header = FALSE, sep = ",", 
        as.is = TRUE)
    colnames(prcp) <- c("yr", "mo", "day", "val")
    prcp$val <- prcp$val/100  #convert hundredths of inches to inches
    

    Graphs …

    # yearly total
    yrtot <- aggregate(val ~ yr, data = prcp, FUN = sum)
    png(filename = "fig1.png")
    plot(yrtot$val ~ yrtot$yr, type = "h", main = "Annual Precipitation Total (inches), Livermore, CA", 
        ylab = "inches/year", xlab = "year")
    garbage <- dev.off()
    
    # monthly total
    montot <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
    png(filename = "fig2.png")
    boxplot(montot$val ~ montot$mo, range = 0, main = "Monthly Precipitation Total (inches), Livermore, CA", 
        ylab = "inches/month", xlab = "calendar month")
    garbage <- dev.off()
    
    # number of rainy days per month, rainy day of rain amount > 0.01 inches
    prcp$val <- ifelse(prcp$val <= 0.01, 0, 1)
    raindays <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
    png(filename = "fig3.png")
    boxplot(raindays$val ~ raindays$mo, range = 0, main = "Monthly Rainy Days, Livermore, CA", 
        ylab = "days/month", xlab = "calendar month")
    garbage <- dev.off()
    

    fig1
    fig2
    fig3

    Summary

    This code, when modified slightly, could also be used to read GHCN daily precipitation data.

    Tagged: R USHCN precip precipitation Livermore timeseries california GHCND GHCN

    Posted on September 19, 2012 with 1 note

  • California High School Graduation and Dropout Rates

    Abstract

    The California Deparment of Education recently (June 2012) had a news release on the increase in high school (grades 9-12) graduation rates and decrease in dropout rates. The data used by the Department was from two cohorts (4-year periods) on a statewide basis. While the statewide numbers show an increase (decrease) in graduation (dropout) rates, a particular school district’s numbers do not necessarily follow the statewide trend. Moreover, the longer-term data (about 20 years) on graduation and dropout rates has been ignored by the Department in their analyses. Following is my attempt to dig deeper into the data - just for kicks!

    Overview

    The California Deparment of Education recently (June 2012, news article) publicized the increase in high school graduation rates and decrease in dropout rates. Newspapers have also reported on this topic:

    • San Francisco Chronicle
    • Los Angeles times
    • San Jose Mercury News

    However, the data is available for only two cohorts (‘09 and ‘10). Moreover, the available cohort data is at an aggregated scale - school district or state, and not for each individual school.

    I am interested in individual school performance - particulalry the five high schools in the Fremont Unified district, since I currently live here. The following is my attempt to dig deeper into this data, of course, using R. Below are the data sources. I downloaded all the files (2 on cohort rates, 1 on school list and 20 on dropout rates) before running the following R code.

    • Cohort Outcome Data
    • List of California Schools
    • California High School Dropout data

    Analysis

    Reproduce Dept of Education’s news release data

    First, I try to reproduce the results from the news release.

    rm(list = ls())
    
    workDir <- "C:/Rstuff/DATA/CA Schools/"
    setwd(workDir)
    
    # read data
    coh09 <- read.csv("cohort09.txt", header = TRUE, sep = "\t", as.is = TRUE)
    coh10 <- read.csv("cohort10.txt", header = TRUE, sep = "\t", as.is = TRUE)
    

    News release numbers are based on “Statewide” data. Extract only this subset and produce a table similar to the one in the news release.

    subsetName <- "Statewide"
    # cohort09
    data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | 
        Subgrouptype == "All"))
    # cohort10
    data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | 
        Subgrouptype == "All"))
    

    Reproduce table of Graduation Rates

    grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", 
        "Cohort.Graduation.Rate")]
    colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
    grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", 
        "Cohort.Graduation.Rate")]
    colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")
    
    # create a table similar to the news release
    gradTable <- merge(grad09, grad10)
    gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16])
    gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", 
        "Num1011")]
    row.names(gradTable) <- gradTable$Type
    gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
        "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
    gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)
    
    gradTable
    
    ##     Type Per0910    Num0910 Per1011    Num1011 Change
    ## All  All   74.72 378976       76.26 382558       1.54
    ## 5      5   68.08 159364       70.41 167886       2.33
    ## 1      1   67.32 2933         68.00 2692         0.68
    ## 2      2   89.01 40384        89.66 39717        0.65
    ## 3      3   72.33 2520         74.26 2432         1.93
    ## 4      4   87.34 12002        89.00 12104        1.66
    ## 6      6   60.54 25224        62.76 24917        2.22
    ## 7      7   83.52 130539       85.42 124863       1.90
    ## 9      9   82.80 4534         81.36 5311        -1.44
    ## 0      0   53.81 1476         46.26 2636        -7.55
    ## EL    EL   56.39 53015        60.21 60280        3.82
    ## MIG  MIG   71.14 10790        71.86 9794         0.72
    ## SE    SE   56.72 34385        59.04 34156        2.32
    ## SD    SD   68.04 204189       69.94 219856       1.90
    

    Reproduce table of Dropout Rates

    drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
    colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
    drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
    colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")
    
    # create a table similar to the news release
    dropTable <- merge(drop09, drop10)
    dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16])
    dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", 
        "Num1011")]
    row.names(dropTable) <- dropTable$Type
    dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
        "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
    dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)
    
    dropTable
    
    ##     Type Per0910    Num0910 Per1011    Num1011 Change
    ## All  All    16.6 84298         14.4 72314        -2.2
    ## 5      5    20.8 48706         17.7 42126        -3.1
    ## 1      1    22.1 962           20.7 818          -1.4
    ## 2      2     7.2 3270           6.2 2764         -1.0
    ## 3      3    19.6 683           17.4 571          -2.2
    ## 4      4     7.8 1078           6.7 906          -1.1
    ## 6      6    26.7 11143         24.7 9791         -2.0
    ## 7      7    10.7 16759          8.9 12980        -1.8
    ## 9      9    10.2 556           11.2 730           1.0
    ## 0      0    41.6 1141          28.6 1628        -13.0
    ## EL    EL    29.0 27264         24.8 24858        -4.2
    ## MIG  MIG    18.8 2845          17.3 2352         -1.5
    ## SE    SE    21.9 13296         18.4 10628        -3.5
    ## SD    SD    20.1 60309         17.6 55483        -2.5
    

    Note

    • Data used by the news release appears to be similar, but not identical, to the data currently available on the web. The discrepancies in the above numbers and those in the news article are attrbuted to the slightly different datasets.
    • Also, the sum of graduation percentage and dropout percentage for a category need not add up to 100 because some students take longer than 4 years to graduate and are not (yet) accounted here.

    Repeat the analysis on Fremont Unified School District

    subsetName <- "Fremont Unified"
    # cohort09
    data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | 
        Subgrouptype == "All"))
    # cohort10
    data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | 
        Subgrouptype == "All"))
    

    Table of Graduation Rates

    grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", 
        "Cohort.Graduation.Rate")]
    colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
    grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", 
        "Cohort.Graduation.Rate")]
    colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")
    
    # create a table similar to the news release
    gradTable <- merge(grad09, grad10)
    gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16])
    gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", 
        "Num1011")]
    row.names(gradTable) <- gradTable$Type
    gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
        "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
    gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)
    
    gradTable
    
    ##     Type Per0910    Num0910 Per1011    Num1011 Change
    ## All  All   82.74 2177         89.34 1952         6.60
    ## 5      5   69.06 279          79.42 220         10.36
    ## 1      1   90.00          *  100.00          *  10.00
    ## 2      2   89.63 1055         93.49 977          3.86
    ## 3      3   80.95 17           68.42 13         -12.53
    ## 4      4   80.37 131          92.25 119         11.88
    ## 6      6   67.86 76           82.52 85          14.66
    ## 7      7   82.19 586          87.98 505          5.79
    ## 9      9   81.48 22           85.71 24           4.23
    ## 0      0   50.00          *   83.33          *  33.33
    ## EL    EL   68.56 205          73.06 179          4.50
    ## MIG  MIG   71.43          *   85.71          *  14.28
    ## SE    SE   55.47 137          65.41 121          9.94
    ## SD    SD   72.69 503          79.42 436          6.73
    

    Table of Dropout Rates

    drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
    colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910")
    drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")]
    colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011")
    
    # create a table similar to the news release
    dropTable <- merge(drop09, drop10)
    dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16])
    dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", 
        "Num1011")]
    row.names(dropTable) <- dropTable$Type
    dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", 
        "9", "0", "EL", "MIG", "SE", "SD"), ]  #re-order to match news release order
    dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)
    
    dropTable
    
    ##     Type Per0910    Num0910 Per1011    Num1011 Change
    ## All  All     5.1 133            5.5 120           0.4
    ## 5      5     9.7 39            13.4 37            3.7
    ## 1      1     0.0          *     0.0          *    0.0
    ## 2      2     1.7 20             2.5 26            0.8
    ## 3      3     4.8          *    10.5          *    5.7
    ## 4      4     4.9          *     6.2          *    1.3
    ## 6      6    17.0 19             9.7          *   -7.3
    ## 7      7     5.8 41             5.9 34            0.1
    ## 9      9    11.1          *    10.7          *   -0.4
    ## 0      0    50.0          *     0.0          *  -50.0
    ## EL    EL    11.4 34            16.3 40            4.9
    ## MIG  MIG     7.1          *    14.3          *    7.2
    ## SE    SE    12.6 31            11.4 21           -1.2
    ## SD    SD     9.4 65            11.8 65            2.4
    

    Some Observations

    • Statewide graduation rates, consistent with the news release, have increased across the different ethinic and other demographic groups considered.
    • The Fremont Unified School District’s graduation rates show a similar pattern - increase in graduation rates across the groups.

    • Statewide dropout rates have decreased across most categories. However, Fremont Unified’s dropout rates have increased across most categories.

    • For whatever reason the results in the news release are not presented by gender.

    Closer look at Fremont Unified’s High Schools

    Now I use the enrollment and dropout numbers by year data, 1992-2011, to see the time series of dropout rates for the 5 high schools in Fremont Unified.

    Identify high schools in Fremont Unified - I look at only public schools.

    # school info from http://www.cde.ca.gov/ds/si/ds/pubschls.asp
    allSchools <- read.csv("pubschls.txt", header = TRUE, sep = "\t", 
        as.is = TRUE, colClasses = "character")
    # Fremont specific data
    freSchools <- subset(allSchools, County == "Alameda" & District == 
        "Fremont Unified" & StatusType == "Active" & SOC == 66)
    
    freSchools$School
    
    ## [1] "American High"         "Irvington High"        "John F. Kennedy High" 
    ## [4] "Mission San Jose High" "Washington High"      
    

    For each year, extract the total dropouts (“DTOT”) and total enrollment (“ETOT”) for 9,10,11 and 12 grades. Estimate the average dropout rate as the fraction DTOT/ETOT.

    # dropout data from http://www.cde.ca.gov/ds/sd/sd/filesdropouts.asp
    extns <- substr(as.character(seq(1992, 2011, by = 1)), 3, 4)
    freData <- NULL  #stores summary
    for (eachYr in extns) {
        dropData <- read.csv(paste("dropouts", eachYr, ".txt", sep = ""), header = TRUE, 
            sep = "\t", as.is = TRUE, colClasses = "character")
    
        # Fremont specific data
        yearData <- subset(dropData, CDS_CODE %in% freSchools$CDSCode)
        yearData$ETOT <- as.numeric(yearData$ETOT)
        yearData$DTOT <- as.numeric(yearData$DTOT)
        # total dropouts and total enrolled for each school by year
        yearSumm <- aggregate(cbind(DTOT, ETOT) ~ CDS_CODE + YEAR, data = yearData, 
            FUN = sum)
        # update output
        freData <- rbind(freData, yearSumm)
    }
    

    Plot the results.

    schoolCodes <- levels(as.factor(freData$CDS_CODE))
    schoolNames <- freSchools$School[schoolCodes %in% freSchools$CDSCode]
    png(filename = "fredrop.png")
    par(mfrow = c(5, 1), mar = c(1, 4, 3, 2))
    yearNames <- seq(1992, 2011, by = 1)
    for (eachSchool in 1:length(schoolCodes)) {
        subData <- subset(freData, CDS_CODE == schoolCodes[eachSchool])
        barplot(subData$DTOT * 100/subData$ETOT, xlab = "", ylim = c(0, 5), ylab = "drop %", 
            names.arg = yearNames, main = schoolNames[eachSchool])
    }
    garbage <- dev.off()
    

    figure 1

    From the graph it is evident that between 2011 and 2010 American and Irvington’s droprates have decreased, while Kennedy, Mission and Washington’s droprates have increased. However, both increases and decreases in recent droprates are relatively small. Fremont schools have pretty low dropout rates!

    Summary

    While what I did here sheds light on some of the issues with the Department’s analyis, it is by no means complete. The University of California in Santa Barbara has an entire project dedicated to analyze the dropout rates - UCSB Dropout Research Project.

    Tagged: R California High School Graduation Rates R-bloggers Fremont Unified Dropout Rates

    Posted on September 16, 2012

  • So many ways to Tuesday…

    My first post on R using the tools Markdown and Knitr.

    Purpose

    I often deal with data that have both spatial (geogrphical) and temporal attributes. I also often end up plotting such data under some time constraint. But I always think of ways to plot the data better. Here is one attempt to plot such typical data in as many “unique” ways as possible. The constaints are -

    • data should all be in one plot (multiple panels allowed)
    • convey all the information (and more) in the data and
    • the code should not be too long.

    I picked the San Francisco Bay Area census data on population. This dataset has the population of each of the 9 Bay Area counties for the period 1860 through 2010.

    First I had to format the data. Comparison by population numbers is not so easy. So I computed the population percentage (of total Bay Area population) for each county, for each decade. Then, I used the formatted data to create some plots.

    # analyze population of bay area counties; data from
    # http://www.bayareacensus.ca.gov/historical/copop18602000.htm save the
    # data to a file called 'SFbayarea_pop.txt'
    
    rm(list = ls())
    
    # all data resides here
    setwd("C:/Rstuff/DATA/")
    
    # read data ----
    popData <- read.csv("SFbayarea_pop.txt", header = TRUE, sep = "\t", 
        colClasses = "character")
    
    # original data
    head(popData)
    
    ##      Alameda Contra.Costa  Marin   Napa San.Francisco San.Mateo
    ## 1860   8,927        5,328  3,334  5,521        56,802     3,214
    ## 1870  24,237        8,461  6,903  7,163       149,473     6,635
    ## 1880  62,976       12,525 11,324 13,235       233,959     8,669
    ## 1890  93,864       13,515 13,072 16,411       298,997    10,087
    ## 1900 130,197       18,046 15,702 16,451       342,782    12,094
    ## 1910 246,131       31,674 25,114 19,800       416,912    26,585
    ##      Santa.Clara Solano Sonoma TOTAL.BAY.AREA
    ## 1860      11,912  7,169 11,867        114,074
    ## 1870      26,246 16,871 19,819        265,808
    ## 1880      35,039 18,475 25,926        422,128
    ## 1890      48,005 20,946 32,721        547,618
    ## 1900      60,216 24,143 38,480        658,111
    ## 1910      83,539 27,559 48,394        925,708
    

    Format data for later use: remove commas in numbers and “.”s in county names:

    colnames(popData) <- gsub(".", " ", colnames(popData), fixed = TRUE)
    saveDimNames <- dimnames(popData)  # save names for later use
    
    # remove commas in the numbers
    popData <- as.data.frame(lapply(popData, function(x) {
        as.numeric(gsub(",", "", x))
    }))
    # names disappear after the above operation; rename columns
    dimnames(popData) <- saveDimNames
    
    # pop as percent of total
    popFrac <- popData * 100/popData$TOTAL
    
    # remove last ('total') column
    popData <- popData[, -ncol(popData)]
    popFrac <- popFrac[, -ncol(popFrac)]
    
    # data, after formatting
    head(popData)
    
    ##      Alameda Contra Costa Marin  Napa San Francisco San Mateo Santa Clara
    ## 1860    8927         5328  3334  5521         56802      3214       11912
    ## 1870   24237         8461  6903  7163        149473      6635       26246
    ## 1880   62976        12525 11324 13235        233959      8669       35039
    ## 1890   93864        13515 13072 16411        298997     10087       48005
    ## 1900  130197        18046 15702 16451        342782     12094       60216
    ## 1910  246131        31674 25114 19800        416912     26585       83539
    ##      Solano Sonoma
    ## 1860   7169  11867
    ## 1870  16871  19819
    ## 1880  18475  25926
    ## 1890  20946  32721
    ## 1900  24143  38480
    ## 1910  27559  48394
    
    head(round(popFrac, 1))
    
    ##      Alameda Contra Costa Marin Napa San Francisco San Mateo Santa Clara
    ## 1860     7.8          4.7   2.9  4.8          49.8       2.8        10.4
    ## 1870     9.1          3.2   2.6  2.7          56.2       2.5         9.9
    ## 1880    14.9          3.0   2.7  3.1          55.4       2.1         8.3
    ## 1890    17.1          2.5   2.4  3.0          54.6       1.8         8.8
    ## 1900    19.8          2.7   2.4  2.5          52.1       1.8         9.1
    ## 1910    26.6          3.4   2.7  2.1          45.0       2.9         9.0
    ##      Solano Sonoma
    ## 1860    6.3   10.4
    ## 1870    6.3    7.5
    ## 1880    4.4    6.1
    ## 1890    3.8    6.0
    ## 1900    3.7    5.8
    ## 1910    3.0    5.2
    
    
    # color scheme for the plots
    plotClr <- rainbow(ncol(popData))
    popMax <- ceiling(max(popData)/10^5)
    

    Figure 1

    png(filename = "fig1.png")
    layout(matrix(c(1, 1, 1, 2, 2), 5, 1, byrow = TRUE))
    matplot(rownames(popData), popData[, 1:9]/10^5, type = "o", col = plotClr, 
        pch = 1:9, xaxt = "n", xlab = "Year", ylab = "Pop (x 100,000)", ylim = c(0, 
            popMax), main = "Population of Bay Area Counties", cex = 2, cex.axis = 1.2, 
        cex.lab = 1.2)
    axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
    legend("top", ncol = 3, legend = colnames(popData), cex = 1.2, col = plotClr, 
        pch = 1:9)
    matplot(rownames(popFrac), round(popFrac[, 1:9], 1), type = "o", 
        col = plotClr, pch = 1:9, xaxt = "n", xlab = "Year", ylab = "% of Bay Area Tot.", 
        ylim = c(0, max(popFrac)), main = "County Pop as % of Bay Area Total", cex = 2, 
        cex.axis = 1.2, cex.lab = 1.2)
    axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
    garbage <- dev.off()
    

    figure 1

    Figure 2

    png(filename = "fig2.png")
    par(mfrow = c(3, 3), mar = c(5, 4, 2, 5) + 0.1, oma = c(0, 0, 4, 
        0))
    for (eachCnty in 1:ncol(popData)) {
        barplot(popData[, eachCnty]/10^5, names.arg = rownames(popData), xlab = "Year", 
            ylab = "Pop (x 100,000)", main = colnames(popData)[eachCnty])
        par(new = TRUE)
        plot(rownames(popFrac), popFrac[, eachCnty], type = "o", col = "blue", xaxt = "n", 
            yaxt = "n", xlab = "", ylab = "")
        axis(4, col = "blue", col.ticks = "blue", col.axis = "blue")
        mtext("% of Bay Area Pop", side = 4, line = 3, cex = 0.7, col = "blue")
    }
    mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
        font = 2)
    garbage <- dev.off()
    

    figure 2

    Figure 3

    png(filename = "fig3.png")
    layout(matrix(c(1, 1, 1, 1, 2:17), 5, 4, byrow = TRUE))
    par(mar = c(2, 4, 2, 2), oma = c(0, 0, 4, 0))
    barplot(0, 0, axes = FALSE)
    legend("center", ncol = 3, legend = colnames(popData), cex = 1.3, 
        fill = plotClr)
    for (eachYr in 1:nrow(popData)) {
        barplot(as.vector(unlist(popData[eachYr, ]/10^5)), col = plotClr, names.arg = NULL, 
            xlab = "", ylab = "Pop (x 100,000)", main = rownames(popData)[eachYr])
    }
    mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
        font = 2)
    garbage <- dev.off()
    

    figure 3

    Figure 4

    png(filename = "fig4.png")
    layout(matrix(c(1, 2, 2), 3, 1, byrow = TRUE))
    par(mar = c(4, 4, 2, 2), oma = c(0, 0, 3, 0))
    plot(rownames(popData), rowSums(popData)/10^5, type = "o", xaxt = "n", 
        xlab = "", ylab = "Pop (x 100,000)", cex = 2, cex.axis = 1.2, cex.lab = 1.2)
    axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2)
    barplot(t(as.matrix(popFrac)), main = "County Pop as % of Bay Area Total", 
        ylim = c(0, 120), yaxt = "n", col = plotClr, cex.axis = 1.2, cex.lab = 1.2, 
        cex.names = 1.2)
    axis(2, at = seq(0, 100, 20), cex.axis = 1.2, cex.lab = 1.2)
    legend("top", ncol = 3, legend = colnames(popFrac), cex = 1, fill = plotClr)
    mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, 
        cex = 1, font = 2)
    garbage <- dev.off()
    

    figure 4

    Summary

    I tried 4 different ways of plotting the population data. Each of the 4 methods have pros and cons. Of the above 4, my favorite is the last one. Also, I would like to add one figure with a map showing the counties where the counties are color coded by the population (or pop percentage). For this, additional data is required, such as GIS maps of the counties. Will try to do this for the next post.

    Tagged: bay area population in R R

    Posted on June 29, 2012 with 1 note

  • R is the new (not so blind) love!

    I switched to R (sorry python!). The support for statistics, graphics and the resources/community support are unparalleled. Check out R’s home (http://www.r-project.org/) and the R-bloggers sites (http://www.r-bloggers.com/). 

    Posted on June 12, 2012

  • More python fans

    I came across more people/groups using python - the Catchment Model Framework - http://www.uni-giessen.de/~gh1961/ and the hydropy project (documentation is in Spanish, but code is in English) - http://code.google.com/p/hydropy/

    Posted on January 28, 2011 with 1 note

  • Up and running!

    Finally the blog is up and running! After trying blogger.com and wordpress.com, I’ve settled on tumblr.  Check out the pages on this blog. These pages will be updated periodically.  

    Posted on January 17, 2011 with 1 note

Field Notes Theme. Designed by Manasto Jones. Powered by Tumblr.