-
R code to obtain and plot rainfall data for the whole world
If you want to create rainfall maps for the whole world in R there is no readily available code or package to do this. Moreover, data publicly available from research institutions is not generally in plain text format or other familiar formats. Hydrological and climatological studies sometimes require rainfall data over the entire world for long periods of time. The Climate Prediction Center’s (CPC) site, daily data from 1979 to present, is a good resource. This data is available at CPC’s ftp site (ftp://ftp.cpc.ncep.noaa.gov/precip/CPC_UNI_PRCP/GAUGE_GLB/).
I created R code to download rain/snow (or precipitation to be scientific) data from the CPC’s ftp site and plot it. All my code is available at my GitHub site - https://github.com/RationShop/cpcRain.
The Rmarkdown document showing some examples is at my RPubs site - http://rpubs.com/RationShop/cpcRain
-
R code for generating multi-site stochastic precipitation
In water resource management, climate change, hydrology and related disciplines long time series of precipitation/rainfall data is required. Since historical records are relatively short, typically 50 years or less, mathematical/statistical models are used to generate synthetic data of required length. This data generation process requires that the spatio-temporal statistics are preserved between the observations and the synthetic data. There are a number of procedures in the literature, largely beginning in the late 1990s with the work of Wilks (see references below). However, the code required to generate the synthetic data is typically not available.
I created R code for the computationally and conceptually simple implementation of the modified Wilks approach of Mhanna and Bauwens (2012). See my GitHub repo - https://github.com/RationShop/StochasticPrecipitation
There are also two R packages which have been made available recently (see links at the above GitHub site).The R code, for the example dataset provided, takes only a few minutes or less. Someday I plan to make a package out of this.
-
Fixing non positive definite correlation matrices using R
Problem
When a correlation or covariance matrix is not positive definite (i.e., in instances when some or all eigenvalues are negative), a cholesky decomposition cannot be performed. Sometimes, these eigenvalues are very small negative numbers and occur due to rounding or due to noise in the data. In simulation studies a known/given correlation has to be imposed on an input dataset. In such cases one has to deal with the issue of making a correlation matrix positive definite. Following are papers in the field of stochastic precipitation where such matrices are used.
- FP Brissette, M Khalili, R Leconte, Journal of Hydrology, 2007, “Efficient stochastic generation of multi-site synthetic precipitation data”
- GA Baigorria, JW Jones, Journal of Climate, 2010, “GiST: A stochastic model for generating spatially and temporally correlated daily rainfall data”
- M Mhanna and W Bauwens, International Journal of Climatology, 2012, “A stochastic space-time model for the generation of daily rainfall in the Gaza Strip”
A Solution
The paper by Rebonato and Jackel, “The most general methodology for creating a valid correlation matrix for risk management and option pricing purposes”, Journal of Risk, Vol 2, No 2, 2000, presents a methodology to create a positive definite matrix out of a non-positive definite matrix.
Below is my attempt to reproduce the example from Rebonato and Jackel (2000). The correlation matrix below is from the example.
origMat <- array(c(1, 0.9, 0.7, 0.9, 1, 0.3, 0.7, 0.3, 1), dim = c(3, 3)) origEig <- eigen(origMat) origEig$values## [1] 2.296728 0.710625 -0.007352As you can see, the third eigenvalue is negative. Trying a cholesky decomposition on this matrix fails, as expected.
cholStatus <- try(u <- chol(origMat), silent = FALSE) cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE)Here, I use the method of Rebonato and Jackel (2000), as elaborated by Brissette et al. (2007), to fix the correlation matrix. As per the method, replace the negative eigenvalues with 0 (or a small positive number as Brissette et al. 2007 suggest), then normalize the new vector.
# fix the correl matrix newMat <- origMat iter <- 0 while (cholError) { iter <- iter + 1 cat("iteration ", iter, "\n") # replace -ve eigen values with small +ve number newEig <- eigen(newMat) newEig2 <- ifelse(newEig$values < 0, 0, newEig$values) # create modified matrix eqn 5 from Brissette et al 2007, inv = transp for # eig vectors newMat <- newEig$vectors %*% diag(newEig2) %*% t(newEig$vectors) # normalize modified matrix eqn 6 from Brissette et al 2007 newMat <- newMat/sqrt(diag(newMat) %*% t(diag(newMat))) # try chol again cholStatus <- try(u <- chol(newMat), silent = TRUE) cholError <- ifelse(class(cholStatus) == "try-error", TRUE, FALSE) }## iteration 1 ## iteration 2# final check eigen(newMat)$values## [1] 2.290e+00 7.096e-01 -1.332e-15Unresolved Issue(?)
However, as you can see, the third eigenvalue is still negative (but very close to zero). The “chol” function in R is not giving an error probably because this negative eigenvalue is within the “tolerance limits”. I would like to know what these “tolerance limits” are.
-
Long-term precipitation data for your location from the US Historical Climatology Network using R
Abstract
Long-term daily precipitation records from the US Historical Climatology Network (USHCN) are processed and summarized using R. Below is R code which examines the data from Livermore, CA, the closest available USHCN station to my present home in Fremont, CA.
Introduction
Long-term daily precipitation (rain/snow, and also temperature) records for the United States are available from the United States Historical Climatology Network - USHCN. These are observations of precipitation for typically about 100 years or longer. The USHCN stations are a subset of the larger/denser (but shorter) observation network from the Global Historical Climatology Network - GHCN
I obtained the data for California and other relevant data from here.
R code
Identify the station nearest to your location. For me its Livermore, CA.
rm(list = ls()) workDir <- "C:/Rstuff/DATA/ushcn/" setwd(workDir) # id of my nearest station (Livermore, CA), identified from # 'ushcn-stations.txt' myId <- "044997"Following is the format of the data.
# format of the data, from 'data_format.txt'# .... (Each record in a file contains one month of daily data.)# # Variable Columns Type COOP ID 1-6 Character YEAR 7-10 Integer MONTH# 11-12 Integer ELEMENT 13-16 Character VALUE1 17-21 Integer MFLAG1 22# Character QFLAG1 23 Character SFLAG1 24 Character VALUE2 25-29 Integer# MFLAG2 30 Character QFLAG2 31 Character SFLAG2 32 Character . . .Process precipitation data for your state and extract a subset of data corresponding to your station.
# read data for all stations in the state allData <- readLines("state04_CA.txt") # extract station ids from the data idData <- substr(allData, 1, 6) # create a new data frame, with ids as the first column of the frame newData <- data.frame(idData, allData, stringsAsFactors = FALSE) # extract data corresp to your nearest station myData <- subset(newData, idData == myId) myData <- myData[, 2] #throw away the previously added first columnBelow function used later to determine the number of days in a month, including leap years.
# function to computes days in a month: input is year and month this is # the best I could do without downloading external R libraries FnDaysInMonth <- function(yr, mo) { date1 <- paste(yr, mo, "01", sep = "-") #current month, day 1 mo2 <- ifelse(mo < 12, mo + 1, 1) yr2 <- ifelse(mo < 12, yr, yr + 1) date2 <- paste(yr2, mo2, "01", sep = "-") #next month, day 1 return(as.numeric(difftime(as.Date(date2), as.Date(date1)))) }Output file to store the data and read it back again for plotting
outFile <- file(paste(myId, ".txt", sep = ""), "wt")Each line of the data file contains data corresponding to all the days in the month. Discard temperature records and read only “PRCP”. Also, check for the data quality flags.
## each line is 1 month of data for (eachLine in 1:length(myData)) { yrVar <- as.numeric(substr(myData[eachLine], 7, 10)) moVar <- as.numeric(substr(myData[eachLine], 11, 12)) metVar <- substr(myData[eachLine], 13, 16) # only extract precipitation info if (metVar == "PRCP") { ## for each day of the month , check the data flags and get the data for (eachDay in 1:FnDaysInMonth(yrVar, moVar)) { dayOffset <- 17 + ((eachDay - 1) * 8) metVal <- as.numeric(substr(myData[eachLine], dayOffset, dayOffset + 4)) mflag <- substr(myData[eachLine], dayOffset + 5, dayOffset + 5) #is irrelevant qflag <- substr(myData[eachLine], dayOffset + 6, dayOffset + 6) #should be blank sflag <- substr(myData[eachLine], dayOffset + 7, dayOffset + 7) #should not be blank # write to ouput if (qflag == " " & sflag != " ") { writeLines(paste(yrVar, moVar, eachDay, metVal, sep = ","), outFile) } } } } close(outFile)Read back data for summary graphs
prcp <- read.csv(paste(myId, ".txt", sep = ""), header = FALSE, sep = ",", as.is = TRUE) colnames(prcp) <- c("yr", "mo", "day", "val") prcp$val <- prcp$val/100 #convert hundredths of inches to inchesGraphs …
# yearly total yrtot <- aggregate(val ~ yr, data = prcp, FUN = sum) png(filename = "fig1.png") plot(yrtot$val ~ yrtot$yr, type = "h", main = "Annual Precipitation Total (inches), Livermore, CA", ylab = "inches/year", xlab = "year") garbage <- dev.off() # monthly total montot <- aggregate(val ~ yr + mo, data = prcp, FUN = sum) png(filename = "fig2.png") boxplot(montot$val ~ montot$mo, range = 0, main = "Monthly Precipitation Total (inches), Livermore, CA", ylab = "inches/month", xlab = "calendar month") garbage <- dev.off() # number of rainy days per month, rainy day of rain amount > 0.01 inches prcp$val <- ifelse(prcp$val <= 0.01, 0, 1) raindays <- aggregate(val ~ yr + mo, data = prcp, FUN = sum) png(filename = "fig3.png") boxplot(raindays$val ~ raindays$mo, range = 0, main = "Monthly Rainy Days, Livermore, CA", ylab = "days/month", xlab = "calendar month") garbage <- dev.off()Summary
This code, when modified slightly, could also be used to read GHCN daily precipitation data.
-
California High School Graduation and Dropout Rates
Abstract
The California Deparment of Education recently (June 2012) had a news release on the increase in high school (grades 9-12) graduation rates and decrease in dropout rates. The data used by the Department was from two cohorts (4-year periods) on a statewide basis. While the statewide numbers show an increase (decrease) in graduation (dropout) rates, a particular school district’s numbers do not necessarily follow the statewide trend. Moreover, the longer-term data (about 20 years) on graduation and dropout rates has been ignored by the Department in their analyses. Following is my attempt to dig deeper into the data - just for kicks!
Overview
The California Deparment of Education recently (June 2012, news article) publicized the increase in high school graduation rates and decrease in dropout rates. Newspapers have also reported on this topic:
However, the data is available for only two cohorts (‘09 and ‘10). Moreover, the available cohort data is at an aggregated scale - school district or state, and not for each individual school.
I am interested in individual school performance - particulalry the five high schools in the Fremont Unified district, since I currently live here. The following is my attempt to dig deeper into this data, of course, using R. Below are the data sources. I downloaded all the files (2 on cohort rates, 1 on school list and 20 on dropout rates) before running the following R code.
Analysis
Reproduce Dept of Education’s news release data
First, I try to reproduce the results from the news release.
rm(list = ls()) workDir <- "C:/Rstuff/DATA/CA Schools/" setwd(workDir) # read data coh09 <- read.csv("cohort09.txt", header = TRUE, sep = "\t", as.is = TRUE) coh10 <- read.csv("cohort10.txt", header = TRUE, sep = "\t", as.is = TRUE)News release numbers are based on “Statewide” data. Extract only this subset and produce a table similar to the one in the news release.
subsetName <- "Statewide" # cohort09 data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All")) # cohort10 data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All"))Reproduce table of Graduation Rates
grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release gradTable <- merge(grad09, grad10) gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16]) gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(gradTable) <- gradTable$Type gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)gradTable## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 74.72 378976 76.26 382558 1.54 ## 5 5 68.08 159364 70.41 167886 2.33 ## 1 1 67.32 2933 68.00 2692 0.68 ## 2 2 89.01 40384 89.66 39717 0.65 ## 3 3 72.33 2520 74.26 2432 1.93 ## 4 4 87.34 12002 89.00 12104 1.66 ## 6 6 60.54 25224 62.76 24917 2.22 ## 7 7 83.52 130539 85.42 124863 1.90 ## 9 9 82.80 4534 81.36 5311 -1.44 ## 0 0 53.81 1476 46.26 2636 -7.55 ## EL EL 56.39 53015 60.21 60280 3.82 ## MIG MIG 71.14 10790 71.86 9794 0.72 ## SE SE 56.72 34385 59.04 34156 2.32 ## SD SD 68.04 204189 69.94 219856 1.90Reproduce table of Dropout Rates
drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release dropTable <- merge(drop09, drop10) dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16]) dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(dropTable) <- dropTable$Type dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)dropTable## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 16.6 84298 14.4 72314 -2.2 ## 5 5 20.8 48706 17.7 42126 -3.1 ## 1 1 22.1 962 20.7 818 -1.4 ## 2 2 7.2 3270 6.2 2764 -1.0 ## 3 3 19.6 683 17.4 571 -2.2 ## 4 4 7.8 1078 6.7 906 -1.1 ## 6 6 26.7 11143 24.7 9791 -2.0 ## 7 7 10.7 16759 8.9 12980 -1.8 ## 9 9 10.2 556 11.2 730 1.0 ## 0 0 41.6 1141 28.6 1628 -13.0 ## EL EL 29.0 27264 24.8 24858 -4.2 ## MIG MIG 18.8 2845 17.3 2352 -1.5 ## SE SE 21.9 13296 18.4 10628 -3.5 ## SD SD 20.1 60309 17.6 55483 -2.5Note
- Data used by the news release appears to be similar, but not identical, to the data currently available on the web. The discrepancies in the above numbers and those in the news article are attrbuted to the slightly different datasets.
- Also, the sum of graduation percentage and dropout percentage for a category need not add up to 100 because some students take longer than 4 years to graduate and are not (yet) accounted here.
Repeat the analysis on Fremont Unified School District
subsetName <- "Fremont Unified" # cohort09 data09 <- subset(coh09, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All")) # cohort10 data10 <- subset(coh10, Name == subsetName & (Subgroup == "All" | Subgrouptype == "All"))Table of Graduation Rates
grad09 <- data09[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") grad10 <- data10[, c("Subgroup", "Subgrouptype", "NumGraduates", "Cohort.Graduation.Rate")] colnames(grad10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release gradTable <- merge(grad09, grad10) gradTable$Type <- c(gradTable$Subgrouptype[1:9], gradTable$Subgroup[10:16]) gradTable <- gradTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(gradTable) <- gradTable$Type gradTable <- gradTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order gradTable$Change <- as.numeric(gradTable$Per1011 - gradTable$Per0910)gradTable## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 82.74 2177 89.34 1952 6.60 ## 5 5 69.06 279 79.42 220 10.36 ## 1 1 90.00 * 100.00 * 10.00 ## 2 2 89.63 1055 93.49 977 3.86 ## 3 3 80.95 17 68.42 13 -12.53 ## 4 4 80.37 131 92.25 119 11.88 ## 6 6 67.86 76 82.52 85 14.66 ## 7 7 82.19 586 87.98 505 5.79 ## 9 9 81.48 22 85.71 24 4.23 ## 0 0 50.00 * 83.33 * 33.33 ## EL EL 68.56 205 73.06 179 4.50 ## MIG MIG 71.43 * 85.71 * 14.28 ## SE SE 55.47 137 65.41 121 9.94 ## SD SD 72.69 503 79.42 436 6.73Table of Dropout Rates
drop09 <- data09[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop09) <- c("Subgroup", "Subgrouptype", "Num0910", "Per0910") drop10 <- data10[, c("Subgroup", "Subgrouptype", "NumDropouts", "Cohort.Dropout.Rate")] colnames(drop10) <- c("Subgroup", "Subgrouptype", "Num1011", "Per1011") # create a table similar to the news release dropTable <- merge(drop09, drop10) dropTable$Type <- c(dropTable$Subgrouptype[1:9], dropTable$Subgroup[10:16]) dropTable <- dropTable[, c("Type", "Per0910", "Num0910", "Per1011", "Num1011")] row.names(dropTable) <- dropTable$Type dropTable <- dropTable[c("All", "5", "1", "2", "3", "4", "6", "7", "9", "0", "EL", "MIG", "SE", "SD"), ] #re-order to match news release order dropTable$Change <- as.numeric(dropTable$Per1011 - dropTable$Per0910)dropTable## Type Per0910 Num0910 Per1011 Num1011 Change ## All All 5.1 133 5.5 120 0.4 ## 5 5 9.7 39 13.4 37 3.7 ## 1 1 0.0 * 0.0 * 0.0 ## 2 2 1.7 20 2.5 26 0.8 ## 3 3 4.8 * 10.5 * 5.7 ## 4 4 4.9 * 6.2 * 1.3 ## 6 6 17.0 19 9.7 * -7.3 ## 7 7 5.8 41 5.9 34 0.1 ## 9 9 11.1 * 10.7 * -0.4 ## 0 0 50.0 * 0.0 * -50.0 ## EL EL 11.4 34 16.3 40 4.9 ## MIG MIG 7.1 * 14.3 * 7.2 ## SE SE 12.6 31 11.4 21 -1.2 ## SD SD 9.4 65 11.8 65 2.4Some Observations
- Statewide graduation rates, consistent with the news release, have increased across the different ethinic and other demographic groups considered.
The Fremont Unified School District’s graduation rates show a similar pattern - increase in graduation rates across the groups.
Statewide dropout rates have decreased across most categories. However, Fremont Unified’s dropout rates have increased across most categories.
For whatever reason the results in the news release are not presented by gender.
Closer look at Fremont Unified’s High Schools
Now I use the enrollment and dropout numbers by year data, 1992-2011, to see the time series of dropout rates for the 5 high schools in Fremont Unified.
Identify high schools in Fremont Unified - I look at only public schools.
# school info from http://www.cde.ca.gov/ds/si/ds/pubschls.asp allSchools <- read.csv("pubschls.txt", header = TRUE, sep = "\t", as.is = TRUE, colClasses = "character") # Fremont specific data freSchools <- subset(allSchools, County == "Alameda" & District == "Fremont Unified" & StatusType == "Active" & SOC == 66)freSchools$School## [1] "American High" "Irvington High" "John F. Kennedy High" ## [4] "Mission San Jose High" "Washington High"For each year, extract the total dropouts (“DTOT”) and total enrollment (“ETOT”) for 9,10,11 and 12 grades. Estimate the average dropout rate as the fraction DTOT/ETOT.
# dropout data from http://www.cde.ca.gov/ds/sd/sd/filesdropouts.asp extns <- substr(as.character(seq(1992, 2011, by = 1)), 3, 4) freData <- NULL #stores summary for (eachYr in extns) { dropData <- read.csv(paste("dropouts", eachYr, ".txt", sep = ""), header = TRUE, sep = "\t", as.is = TRUE, colClasses = "character") # Fremont specific data yearData <- subset(dropData, CDS_CODE %in% freSchools$CDSCode) yearData$ETOT <- as.numeric(yearData$ETOT) yearData$DTOT <- as.numeric(yearData$DTOT) # total dropouts and total enrolled for each school by year yearSumm <- aggregate(cbind(DTOT, ETOT) ~ CDS_CODE + YEAR, data = yearData, FUN = sum) # update output freData <- rbind(freData, yearSumm) }Plot the results.
schoolCodes <- levels(as.factor(freData$CDS_CODE)) schoolNames <- freSchools$School[schoolCodes %in% freSchools$CDSCode] png(filename = "fredrop.png") par(mfrow = c(5, 1), mar = c(1, 4, 3, 2)) yearNames <- seq(1992, 2011, by = 1) for (eachSchool in 1:length(schoolCodes)) { subData <- subset(freData, CDS_CODE == schoolCodes[eachSchool]) barplot(subData$DTOT * 100/subData$ETOT, xlab = "", ylim = c(0, 5), ylab = "drop %", names.arg = yearNames, main = schoolNames[eachSchool]) } garbage <- dev.off()From the graph it is evident that between 2011 and 2010 American and Irvington’s droprates have decreased, while Kennedy, Mission and Washington’s droprates have increased. However, both increases and decreases in recent droprates are relatively small. Fremont schools have pretty low dropout rates!
Summary
While what I did here sheds light on some of the issues with the Department’s analyis, it is by no means complete. The University of California in Santa Barbara has an entire project dedicated to analyze the dropout rates - UCSB Dropout Research Project.
-
So many ways to Tuesday…
My first post on R using the tools Markdown and Knitr.
Purpose
I often deal with data that have both spatial (geogrphical) and temporal attributes. I also often end up plotting such data under some time constraint. But I always think of ways to plot the data better. Here is one attempt to plot such typical data in as many “unique” ways as possible. The constaints are -
- data should all be in one plot (multiple panels allowed)
- convey all the information (and more) in the data and
- the code should not be too long.
I picked the San Francisco Bay Area census data on population. This dataset has the population of each of the 9 Bay Area counties for the period 1860 through 2010.
First I had to format the data. Comparison by population numbers is not so easy. So I computed the population percentage (of total Bay Area population) for each county, for each decade. Then, I used the formatted data to create some plots.
# analyze population of bay area counties; data from # http://www.bayareacensus.ca.gov/historical/copop18602000.htm save the # data to a file called 'SFbayarea_pop.txt' rm(list = ls()) # all data resides here setwd("C:/Rstuff/DATA/") # read data ---- popData <- read.csv("SFbayarea_pop.txt", header = TRUE, sep = "\t", colClasses = "character") # original data head(popData)## Alameda Contra.Costa Marin Napa San.Francisco San.Mateo ## 1860 8,927 5,328 3,334 5,521 56,802 3,214 ## 1870 24,237 8,461 6,903 7,163 149,473 6,635 ## 1880 62,976 12,525 11,324 13,235 233,959 8,669 ## 1890 93,864 13,515 13,072 16,411 298,997 10,087 ## 1900 130,197 18,046 15,702 16,451 342,782 12,094 ## 1910 246,131 31,674 25,114 19,800 416,912 26,585 ## Santa.Clara Solano Sonoma TOTAL.BAY.AREA ## 1860 11,912 7,169 11,867 114,074 ## 1870 26,246 16,871 19,819 265,808 ## 1880 35,039 18,475 25,926 422,128 ## 1890 48,005 20,946 32,721 547,618 ## 1900 60,216 24,143 38,480 658,111 ## 1910 83,539 27,559 48,394 925,708Format data for later use: remove commas in numbers and “.”s in county names:
colnames(popData) <- gsub(".", " ", colnames(popData), fixed = TRUE) saveDimNames <- dimnames(popData) # save names for later use # remove commas in the numbers popData <- as.data.frame(lapply(popData, function(x) { as.numeric(gsub(",", "", x)) })) # names disappear after the above operation; rename columns dimnames(popData) <- saveDimNames # pop as percent of total popFrac <- popData * 100/popData$TOTAL # remove last ('total') column popData <- popData[, -ncol(popData)] popFrac <- popFrac[, -ncol(popFrac)] # data, after formatting head(popData)## Alameda Contra Costa Marin Napa San Francisco San Mateo Santa Clara ## 1860 8927 5328 3334 5521 56802 3214 11912 ## 1870 24237 8461 6903 7163 149473 6635 26246 ## 1880 62976 12525 11324 13235 233959 8669 35039 ## 1890 93864 13515 13072 16411 298997 10087 48005 ## 1900 130197 18046 15702 16451 342782 12094 60216 ## 1910 246131 31674 25114 19800 416912 26585 83539 ## Solano Sonoma ## 1860 7169 11867 ## 1870 16871 19819 ## 1880 18475 25926 ## 1890 20946 32721 ## 1900 24143 38480 ## 1910 27559 48394head(round(popFrac, 1))## Alameda Contra Costa Marin Napa San Francisco San Mateo Santa Clara ## 1860 7.8 4.7 2.9 4.8 49.8 2.8 10.4 ## 1870 9.1 3.2 2.6 2.7 56.2 2.5 9.9 ## 1880 14.9 3.0 2.7 3.1 55.4 2.1 8.3 ## 1890 17.1 2.5 2.4 3.0 54.6 1.8 8.8 ## 1900 19.8 2.7 2.4 2.5 52.1 1.8 9.1 ## 1910 26.6 3.4 2.7 2.1 45.0 2.9 9.0 ## Solano Sonoma ## 1860 6.3 10.4 ## 1870 6.3 7.5 ## 1880 4.4 6.1 ## 1890 3.8 6.0 ## 1900 3.7 5.8 ## 1910 3.0 5.2# color scheme for the plots plotClr <- rainbow(ncol(popData)) popMax <- ceiling(max(popData)/10^5)Figure 1
png(filename = "fig1.png") layout(matrix(c(1, 1, 1, 2, 2), 5, 1, byrow = TRUE)) matplot(rownames(popData), popData[, 1:9]/10^5, type = "o", col = plotClr, pch = 1:9, xaxt = "n", xlab = "Year", ylab = "Pop (x 100,000)", ylim = c(0, popMax), main = "Population of Bay Area Counties", cex = 2, cex.axis = 1.2, cex.lab = 1.2) axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2) legend("top", ncol = 3, legend = colnames(popData), cex = 1.2, col = plotClr, pch = 1:9) matplot(rownames(popFrac), round(popFrac[, 1:9], 1), type = "o", col = plotClr, pch = 1:9, xaxt = "n", xlab = "Year", ylab = "% of Bay Area Tot.", ylim = c(0, max(popFrac)), main = "County Pop as % of Bay Area Total", cex = 2, cex.axis = 1.2, cex.lab = 1.2) axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2) garbage <- dev.off()Figure 2
png(filename = "fig2.png") par(mfrow = c(3, 3), mar = c(5, 4, 2, 5) + 0.1, oma = c(0, 0, 4, 0)) for (eachCnty in 1:ncol(popData)) { barplot(popData[, eachCnty]/10^5, names.arg = rownames(popData), xlab = "Year", ylab = "Pop (x 100,000)", main = colnames(popData)[eachCnty]) par(new = TRUE) plot(rownames(popFrac), popFrac[, eachCnty], type = "o", col = "blue", xaxt = "n", yaxt = "n", xlab = "", ylab = "") axis(4, col = "blue", col.ticks = "blue", col.axis = "blue") mtext("% of Bay Area Pop", side = 4, line = 3, cex = 0.7, col = "blue") } mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, font = 2) garbage <- dev.off()Figure 3
png(filename = "fig3.png") layout(matrix(c(1, 1, 1, 1, 2:17), 5, 4, byrow = TRUE)) par(mar = c(2, 4, 2, 2), oma = c(0, 0, 4, 0)) barplot(0, 0, axes = FALSE) legend("center", ncol = 3, legend = colnames(popData), cex = 1.3, fill = plotClr) for (eachYr in 1:nrow(popData)) { barplot(as.vector(unlist(popData[eachYr, ]/10^5)), col = plotClr, names.arg = NULL, xlab = "", ylab = "Pop (x 100,000)", main = rownames(popData)[eachYr]) } mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, font = 2) garbage <- dev.off()Figure 4
png(filename = "fig4.png") layout(matrix(c(1, 2, 2), 3, 1, byrow = TRUE)) par(mar = c(4, 4, 2, 2), oma = c(0, 0, 3, 0)) plot(rownames(popData), rowSums(popData)/10^5, type = "o", xaxt = "n", xlab = "", ylab = "Pop (x 100,000)", cex = 2, cex.axis = 1.2, cex.lab = 1.2) axis(1, at = seq(1860, 2010, 10), cex.axis = 1.2, cex.lab = 1.2) barplot(t(as.matrix(popFrac)), main = "County Pop as % of Bay Area Total", ylim = c(0, 120), yaxt = "n", col = plotClr, cex.axis = 1.2, cex.lab = 1.2, cex.names = 1.2) axis(2, at = seq(0, 100, 20), cex.axis = 1.2, cex.lab = 1.2) legend("top", ncol = 3, legend = colnames(popFrac), cex = 1, fill = plotClr) mtext("Population of Bay Area Counties", side = 3, line = 1, outer = TRUE, cex = 1, font = 2) garbage <- dev.off()Summary
I tried 4 different ways of plotting the population data. Each of the 4 methods have pros and cons. Of the above 4, my favorite is the last one. Also, I would like to add one figure with a map showing the counties where the counties are color coded by the population (or pop percentage). For this, additional data is required, such as GIS maps of the counties. Will try to do this for the next post.
-
R is the new (not so blind) love!
I switched to R (sorry python!). The support for statistics, graphics and the resources/community support are unparalleled. Check out R’s home (http://www.r-project.org/) and the R-bloggers sites (http://www.r-bloggers.com/).
-
More python fans
I came across more people/groups using python - the Catchment Model Framework - http://www.uni-giessen.de/~gh1961/ and the hydropy project (documentation is in Spanish, but code is in English) - http://code.google.com/p/hydropy/
-
Up and running!
Finally the blog is up and running! After trying blogger.com and wordpress.com, I’ve settled on tumblr. Check out the pages on this blog. These pages will be updated periodically.