Lunch with Dr. Ian Lipkin

Please join us for lunch with Dr. Ian Lipkin (Columbia) immediately following his CIDD seminar. RSVP using the poll below- hope to see you there!

When: Thursday, May 7th, 2015 12:00pm – 1:00pm

Where: W-201 Millennium Science Complex

Posted in News | Leave a comment

Lunch with Dr. Paul Turner

Join us for lunch with Dr. Paul Turner (Yale University) immediately following his CIDD seminar. RSVP using the poll below so that we can order enough food!

When: Thursday, April 23rd, 2015 12:00-1:00pm

Where: W-201 Millennium Science Complex

Posted in Speaker Lunches | Leave a comment

A few steps toward cleaner, better-organized code

This post follows from a discussion with my Bozeman lab on code management (see my very ugly slides with more details, especially on using git and github through Rstudio, here).

Developing good coding habits takes a little time and thought, but the pay-off potential is high for (at least) these three reasons.

  • Well-maintained code is usually more reproducible than poorly-maintained code.
  • Clean code is often easier to run on a cluster than poorly-maintained code (for example, it’s easier to ship clean simulations and analyses away to a “big” computer, and save computing time on your local machine).
  • It’s likely that code, as well as data, will soon be a required component of publication, and we might as well be ready.

Since biologists and ecologists receive limited code management training, many grad students are unaware of established code and project management tools, and wind up reinventing the wheel. In this post, I go through a few simple steps to improve code management and keep analyses organized. Most of my examples pertain to R, since that’s my language of choice, but these same principles can be applied to most languages. Here’s my list, with more details on each element (and strategies for implementing them) below.

  1. Find a style guide you like, and use it
  2. Use a consistent filing system for all projects
  3. Functionalize and annotate analyses
  4. Use an integrated development environment like RStudio
  5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

Like much of science, programming benefits for a little sunlight. If you’re serious about improving your code, get someone else to look at it and give you feedback (in fact, for R users, I can probably be that person — shoot me an email)!

1. Find a style guide you like, and use it

Most programming languages (R, Python, C++, etc.) have a particular set of conventions dictating code appearance. These conventions are documented in style guides, like this one for R. The point of a style guide is to keep code readable, and there’s solid research sitting behind most of the advice. In my opinion, these are the two most important stylistic suggestions for R:

  • Spacing: put a space around all binary operators (=, +, <-, -), and after all commas
  • Line breaks: break lines at 80 characters (you can set RStudio to display a vertical reminder line under Tools -> Global Options -> select Code Editing on the left, and then set “Show margins” to 80).

Following the suggested indentation specs is also a good idea, especially if you work (or plan to eventually work) in Python or other languages in which indentation and spacing are interpreted pieces of the code syntax.

2. Use a consistent filing system for all projects

Because of the ebbs and flows of data collection, teaching, and travel, it’s not unusual for scientists to juggle five to ten different projects simultaneously. One way to overcome the resulting chaos is to keep a consistent directory structure for all projects.

I keep every project (for me, “project” is usually synonymous with “manuscript”) in a folder named Research.  The Research folder has subdirectories (a.k.a. subfolders) for each project, and the folders are named to reflect their topics. Inside each project folder, I have exactly the same set of folders. These are

  • Data —————— contains all datasets associated with this project
  • Code —————— contains all code files required for this project
  • Documentation — contains manuscript drafts, notes, presentations associated with this project
  • Figures ————— contains all project figures

Collaborative projects (reviews, analyses with multiple analysts, or projects where I’m working as a consultant) have an additional “Communications” folder. I often use subdirectories within these folders, but their composition varies.

3. Functionalize and annotate analyses

Code for data analysis generally does one of the following: loads data, cleans data, generates functionality (reads in source functions), and runs analysis. Some programmers advocate having a separate script allocated specifically to each of these processes (I found these suggestions, and particularly this embedded link, very useful). I haven’t fully integrated this idea into my workflow yet, but I like it.

Functions are like little machines that take a set of inputs (or “arguments”), do some stuff based on those inputs, and return a set of outputs (or “values”).

There are at least two good reasons to functionalize. First, it facilitates error checking, and second, it improves code readability. My very basic rule of thumb is that if it takes me more than about five lines of code to get from inputs to outputs, it’s probably worth functionalizing.

Here’s an example of a function in R:

 
MyFunction <- function(input.in, input.out) {
  # this function returns a vector of integers from 
  # "input.in" to "input.out".
  #
  # Args
  # input.in = first integer in sequence
  # input.out = last integer in sequence
  #
  # Returns
  # k = vector of integers from input.in to input.out
  #
  k <- seq(from = input.in, to = input.out)
  return(k)
}

I strongly recommend using a header structure like I’ve used here, that specifically lists function purpose, arguments, and returns. It’s a good policy to save each function in its own file (I give those files the same name as the function they contain, so the file containing this function would be named MyFunction.R). To load the function into a different file, use R’s source command.

# head of new file.
# source in all necessary functions (including MyFunction) first
source("MyFunction.R")

my.function.test <- MyFunction(input.in = 2, input.out = 7)

 

Good code designers plan their code structure ahead of time (e.g., I want this big output; to get it, I will need to work through these five subprocesses; each subprocess gets its own function, and a wrapper function integrates them all into a single entity that I will eventually call in one line).

In my experience, this is often not how biologists write code. Instead, we often write our analyses (simulations or samplers or whatever) in giant, free-flowing scripts. To get from that programming style to functionalized code, I recommend breaking code into functions after-the-fact in a few existing projects. Doing this a few times helped me learn to identify relevant subroutines at the project’s outset and write in functions from the beginning. Here are a couple examples of relatively common disease ecology programming tasks, with off-the-cuff suggestions about reasonable subroutines.

Example 1: Simulating a discrete-time, individual-based SIR process
I know from the beginning that I’ll likely want “Births” and “Deaths” functions, a “GetInfected” function that moves individuals from S to I, and a “Recover” function that moves individuals from I to R (e.g., a function for each major demographic process). If each process gets its own function, I know exactly where to go to incorporate additional complexity (like density dependence, pulsed births, age-specific demographic and disease processes, etc.). I wrap all these functions in a single function that calls each subroutine in sequence.

Example 2: Writing an MCMC sampler
Usually I know from the beginning what parameters I’m estimating, and I should have a clear plan about how to step through the algorithm (e.g., which updates are Gibbs, which are Metropolis, which control a reversible jump, etc.). In this case, I’d use a separate function to update each element (or block of elements) in the parameter vector. This gives me the flexibility to change some aspect of one update without having to dig through the whole sampler and risk messing up some other piece. Again, I’d use a wrapper to link all the subroutines into one entity.

My rule of thumb is to annotate everything that I can’t quickly regenerate from basic logic. One of the reasons I’ve pushed toward more functionalized code (and git/github) is because it tightens my annotation process: even if all I do is write an appropriate header for the function, my code is already easier to follow than it was in a single giant script.

4. Use an integrated development environment like RStudio

At our roots most of us are scientists, not programmers or developers, and as such we need to keep up with what’s going on in science, not software development. The Open Science movement has invested a lot of effort in helping scientists incorporage modern computational workflows into research. One example project that tries to do this is RStudio. RStudio is an integrated development environment (“IDE”) set up for scientists and analysts. It integrates an R language compiler with LaTeX (a document generating program commonly used in math, computer science, and physics) and git (one flavor of version control software). Because of these integrations, I highly recommend that scientists using R use it through RStudio (this from someone who used the R-GUI for two years and vim-r-plugin for another two). In my mind, the integration is what makes RStudio worthwhile, even for Mac users whose R-GUI isn’t quite so bad.

5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

Version control software solves the problem of ending all your code files in “_today’s_date”. These tools keep track of changes you make, and allow you to annotate precisely what you’re doing. Importantly, they also let you dig discarded code chunks out of the trash.

Several flavors of version control software are now integrated with online platforms. This allows for easy code sharing and cloud backup (and code publication). One platform, github (and regular git run locally), integrates directly with RStudio; another, bitbucket, has unlimited free repositories. Students also have access to five free repos on github — go to education.github.com and request the student pack; in my experience, it helps to send them a reminder email a few weeks after your first request. An aside: women (or, users with female first names) are crazily under-represented on github. Growing the female online programming community is important for science, and good for your code to boot!

Wrap-up
Some of these steps are easier to implement than others. I recommend establishing some protocols for yourself, applying these protocols to new projects, and gradually moving old projects into compliance as needed.

In my experience, adopting some parts of the style conventions (especially those pertaining to spacing and characters-per-line) were easy; others were harder. Start with the low-hanging fruit — any improvement is better than none!

I don’t know many people (and especially not many scientists) who are particularly proud of how their code looks, but having people look at your code can be incredibly instructive. Find a buddy and work together, if you can.

Finally, a few small changes can make a big difference, not only in the reproducibility of your science, but also in your confidence as a programmer. People do ask biologists and ecologists for code samples in the job application process, and a little finesse goes a long way!

Posted in News | 4 Comments

Creepy, crawly, crunchy: Can insects feed the future?

As CIDD graduate students, we think about insects as vectors of disease, but we don’t always consider other important attributes of insects: like how tasty they might be. Many insects are edible and considering their use as a novel livestock may help combat problems in obtaining global food security.

Why should we worry about food insecurity? We live on a hungry planet. The Food and Agriculture Organization of the United Nations (FAO) estimates that 805 million people are chronically undernourished. Projections of food requirements and population growth suggest that we would need to increase current global food production by 70 percent to keep up with demands for human food by 2050. We could also shift paradigms in food production to incorporate novel sources of food. Are insects the livestock of the future?

On Tuesday, April 21 a panel discussion will be held to discuss the merits of using insects as nontraditional livestock to help feed our globe. Let’s eat more bugs. The discussion will be focused on using insects as a human food source, with particular focus on the barriers to insect-rearing, and insect-eating or “entomophagy”, in the developed and developing world.

The panelists are Robert (Bob) Anderson, founder of Sustainable Strategies LLC and advisor to the U.S. Department of Agriculture, Dr. Florence Dunkel, an associate professor in the College of Agriculture at Montana State University, Dr. Dorothy Blair, a former assistant professor of Nutrition at Penn State, and Dr. Alyssa Chilton, a Penn State staff sensory scientist in the Department of Food Science. Florence has an interesting TEDx talk which can be found here.

Tuesday, April 21, 2015

12 – 1:30 p.m.

Foster Auditorium

Paterno Library

Posted in News | Leave a comment

Opportunity to meet with Florence Dunkel, April 20 @ 11:00 a.m.

All graduate students (CIDD, biology, entomology, INTAD) are welcome to meet with Florence Dunkel on Monday, April 20 at 11:00 a.m. Location TBA. Please RSVP to Jo at jo.ohm@psu.edu if you would like to attend.

 

Posted in News | Leave a comment

Nita Bharti seminar

Nita BhartiDon’t forget that CIDD’s Nita Bharti will being giving a talk tomorrow:

  • The role of movement in the spread and control of disease
  • March 17, 4-5pm
  • 8 Mueller Lab
Posted in News | Leave a comment

Lunch with Dr. Seth Barribeau

Please join us for lunch with Dr. Seth Barribeau (East Carolina University) following his seminar titled “Specificity, memory, and immune system evolution across a social gradient.”  He is very interested in host-parasite interactions within insects!

Where: W-203 Millennium Science Complex

When: March 19th, 2015 12:00 – 1:00PM

Please RSVP before Wednesday, March 18th at 5pm using the poll below. Hope to see you there!

Posted in News | Leave a comment

Big in Japan

CIDD faculty member Matt Ferrari recently traveled to Miyazaki, Japan for a research collaboration on foot and mouth disease. Below is a link to local news coverage along with a rough translation:

http://mrt.jp/localnews/?newsid=00013384

Following is very rough translation:

Foreign research teams working on FMD visited Miyazaki, and they exchanged the opinions with local government staffs who were in charge for FMD containment 5 years ago to learn what kind of control measure had been taken.

Those who visiting Miyazaki are the team of mathematicians from 4 countries including US and UK.

FMD occurred in UK 2001. which lead to slaughtering of more than 6 million livestock. It is also still spreading in Korea now; FMD posing significant international concerns.
With this background, this team is developing simulation models which can help predicting how the disease would spread if FMD would occur. On 27th, they exchanged their opinions with local government staffs. who were actually involved in the FMD containment 5 years ago.

(Miyazaki staff) Depending on the outbreak situation, we have to decide, say, whether using vaccine or not, or conducting preemptive culling or not. This requires vast financial support.
(Reporter) Here is FMD memorial centre in Takanabe-town. Researchers from abroad will visit here to learn about the FMD outbreak in Miyazaki from now.

Researchers looked into exhibitions very seriously.

And followed by Colleen’s interview…

Posted in News | Leave a comment

Science image contest

If you use any hi-res imaging in your research, you may want to check out the College of Engineering Art in Science competition (mentioned today at the Millennium Cafe). The website information is a bit sparse, but it’s sponsored by the Engineering Graduate Student Council, and they’ll be awarding $100 to the top three science images.

Posted in News | Leave a comment

Research and industry

There’s an upcoming Vet-Biomed talk that may be of interest if you’re pursuing opportunities in industry:

  • Evolution of pharmaceutical research at the interface of academia and industry
  • Andrew Dahlem, Eli Lilly and Co.
  • March 4, 2015, 10-11am
  • W201 MSC
Posted in News | Leave a comment