Why aren’t there more women in ecological computing?

In case you somehow missed this, we’re living in an information era. For most scientists, this means that an ever-growing component of our jobs is to access and synthesize information. The old naturalist skills that drove ecology in the past are being replaced, in many cases whole-sale, with information technology acumen.

Despite the importance of IT in modern ecological research, computational skills in the ecological community remain pretty gender-imbalanced. As a computationally savvy(ish) female scientist, this worries me. My personal experience is that computational skills take work to acquire, and in the end they are based more on effort than talent or intuition. Ecologists, and particularly ecological women, need to develop and communicate these skills if we want to hold on to our federal support. At the moment, we’re operating under digital silence. In this post, I’ll

  • Look at the numbers on female participation in ecological computing
  • Think about what these numbers mean
  • Suggest some steps to improve ecologists’ computational skills
  • The intent of this post is NOT to criticize the computational community, and particularly the R project, whose community is very actively trying to improve female participation. It is also not intended as a criticism of Penn State, whose biology program has given me a huge amount of flexibility to pursue my graduate work. Instead, I hope that this post will inspire smart women to recognize one another, and enter and participate in the computational domain in a more visible way.

    A few numbers on women in ecological computing

    1. Female participation in R / Open Source

    The vast majority of academic ecologists now use the statistical computing environment R (www.r-project.org). R is an open source/open access project, driven by a board of directors. R packages are contributed from the broader scientific community, and adopted for hosting at the Comprehensive R Archive Network (“CRAN”). While the gender representation of PhD recipients in the scientific fields that interact most heavily with R is increasingly better balanced (~46% of PhDs awarded to women in statistics in the last decade; ~ 50% of science PhDs awarded to women), female contribution to R remains low.

    I checked the packages listed on the CRAN environmetrics task view. Of the 107 packages listed, five were maintained by authors with clearly female names, whereas 94 had maintainers with clearly male monickers (maintainers of the remaining 8 packages had names that could not be readily classified using the gender package in R). In short,

      5% of the ecology/environment packages hosted on CRAN are maintained by people with clearly female names.

    There are currently 24 men and no women on the R Core Development Team (which is an elected body). Of the 58 other folks that the R project recognizes as major contributors to R, only one has a clearly female first name.

    The dearth of women in computational ecology reflects the general absence of female participation in scientific computing. Although it’s notoriously difficult to measure, female Linux usership likely remains at less than 5%. StackOverflow, R-help and other web forums for computing assistance see very little female participation, even among individuals posing questions. Women in ecology are confronted with the same challenging computational community that women encounter across the computational board. While there are some signs of progress (for example, r-help doesn’t seem to be getting meaner), a lot of work remains.

    2. Female participation and recognition on github

    Additionally, female participation on the code management site github remains incredibly low.

      Less than 3% of code repositories with 5 or more stars map to owners with clearly female names (see Alyssa Frazee’s analysis here).

    The low rates hold across a whole bunch of programming languages. And don’t be fooled: R looks better than most because the gender classified Frazee used classes “Hadley” as female.

    3. Female participation in the ecological literature

    The trend toward female underparticipation holds in the ecological literature as well. 28% of the authors on 500 randomly chosen publications in Ecology Letters have clearly female first names (as opposed to 71% with clearly male names). This drops to 25-26% in the journals Ecological Modelling, and Journal of Theoretical Biology (however, name misclassification rates here are high: 18.8% of names not classified in Ecology Letters; 42.0% in Ecological Modelling, 39.5% in JTB).

    What to make of this?

    First, these numbers are in no way perfect. The participation rates presented here hinge on first name as a true signal of an individual’s gender. Gender associations with a given name are probabilistic, and certainly there may be some error in gender classification of each name. Generally, women in science are aware of gendered stereotypes, and may use gender-ambiguous monickers, or forego names for initials. Lastly, the distribution of number of packages per maintainer is somewhat overdispersed, and maintainers with more then four packages on the environmetrics list all have male names.

    Even with these caveats, I think these data convey a real, emerging problem for the ecological community. Even though many women participate in graduate-level ecology, relatively few participate in the computational realm. Increasingly, that’s the area where the money is. We need to make some changes, or women are likely to be left out.

    These numbers don’t tell the whole story, though. There are some women who are doing amazing things in the computational domain, whose work is often overlooked. Karline Soetaert, for example, is the author of the deSolve package in R, a workhorse package that gets used all over the place. The University of Washington is home to several prominent female R developers, including Daniela Witten and Hana Sevcikova. We may see a generational shift (someday) to a more gender-inclusive computing environment…

    Solutions
    … but that day hasn’t arrived just yet. I think these steps would help get ecological women (and all entry-level users) assimilated into the broader computational community.

    1. A clear statement of computational expectations for ecologists.

    In my experience, grad students really like checking things off lists. If they’re given a list of required competencies, they’ll achieve them. The problem with computation is that people don’t know what skills they should be working to attain. Here’s one example of such a list for ecology grad students:

    – basic knowledge of file directory structure on your own operating system
    – ability to launch a terminal and navigate in it
    – basic knowledge of SQL (set up a database; run a query)
    – ability to fit basic statistical models (ANOVA, linear models)
    – ability to simulate simple datasets
    – basic plotting in R (with overachievers picking up gnuplot or javascript)

    In my mind, these are NOT statistical competencies. There is a TON of material that has to get covered in statistical courses for ecologists. Computational skill belongs elsewhere in the ecological curriculum (if for no other reason than that, just like ecologists, many statisticians lack the training to effectively outfit students with these skills).

    2. Local contact and community

    User groups and forums, computing “corners” where people can work together, and formal code mentoring are all enormously helpful in getting smart new users up to speed computationally. These forums exist at many institutions, but my impression is that they tend to be chronically underutilized. Graduate programs should encourage participation in these forums. A competency list might foster participating.

    3. Inclusion of computing in the general ecology curriculum.

    I wish more graduate programs would incorporate a one-credit, one-hour-a-week-for-one-semester course for all ecology graduate students that taugh basic computing, got students started with R (and maybe also Python, javascript, etc.), and set them up computationally to conduct their own research.

    This post builds on a talk for the Montana State University Department of Statistics, given in April of 2015.

    Posted in News | Leave a comment

    Lunch with Dr. Ian Lipkin

    Please join us for lunch with Dr. Ian Lipkin (Columbia) immediately following his CIDD seminar. RSVP using the poll below- hope to see you there!

    When: Thursday, May 7th, 2015 12:00pm – 1:00pm

    Where: W-201 Millennium Science Complex

    Posted in News | Leave a comment

    Lunch with Dr. Paul Turner

    Join us for lunch with Dr. Paul Turner (Yale University) immediately following his CIDD seminar. RSVP using the poll below so that we can order enough food!

    When: Thursday, April 23rd, 2015 12:00-1:00pm

    Where: W-201 Millennium Science Complex

    Posted in Speaker Lunches | Leave a comment

    A few steps toward cleaner, better-organized code

    This post follows from a discussion with my Bozeman lab on code management (see my very ugly slides with more details, especially on using git and github through Rstudio, here).

    Developing good coding habits takes a little time and thought, but the pay-off potential is high for (at least) these three reasons.

    • Well-maintained code is usually more reproducible than poorly-maintained code.
    • Clean code is often easier to run on a cluster than poorly-maintained code (for example, it’s easier to ship clean simulations and analyses away to a “big” computer, and save computing time on your local machine).
    • It’s likely that code, as well as data, will soon be a required component of publication, and we might as well be ready.

    Since biologists and ecologists receive limited code management training, many grad students are unaware of established code and project management tools, and wind up reinventing the wheel. In this post, I go through a few simple steps to improve code management and keep analyses organized. Most of my examples pertain to R, since that’s my language of choice, but these same principles can be applied to most languages. Here’s my list, with more details on each element (and strategies for implementing them) below.

    1. Find a style guide you like, and use it
    2. Use a consistent filing system for all projects
    3. Functionalize and annotate analyses
    4. Use an integrated development environment like RStudio
    5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

    Like much of science, programming benefits for a little sunlight. If you’re serious about improving your code, get someone else to look at it and give you feedback (in fact, for R users, I can probably be that person — shoot me an email)!

    1. Find a style guide you like, and use it

    Most programming languages (R, Python, C++, etc.) have a particular set of conventions dictating code appearance. These conventions are documented in style guides, like this one for R. The point of a style guide is to keep code readable, and there’s solid research sitting behind most of the advice. In my opinion, these are the two most important stylistic suggestions for R:

    • Spacing: put a space around all binary operators (=, +, <-, -), and after all commas
    • Line breaks: break lines at 80 characters (you can set RStudio to display a vertical reminder line under Tools -> Global Options -> select Code Editing on the left, and then set “Show margins” to 80).

    Following the suggested indentation specs is also a good idea, especially if you work (or plan to eventually work) in Python or other languages in which indentation and spacing are interpreted pieces of the code syntax.

    2. Use a consistent filing system for all projects

    Because of the ebbs and flows of data collection, teaching, and travel, it’s not unusual for scientists to juggle five to ten different projects simultaneously. One way to overcome the resulting chaos is to keep a consistent directory structure for all projects.

    I keep every project (for me, “project” is usually synonymous with “manuscript”) in a folder named Research.  The Research folder has subdirectories (a.k.a. subfolders) for each project, and the folders are named to reflect their topics. Inside each project folder, I have exactly the same set of folders. These are

    • Data —————— contains all datasets associated with this project
    • Code —————— contains all code files required for this project
    • Documentation — contains manuscript drafts, notes, presentations associated with this project
    • Figures ————— contains all project figures

    Collaborative projects (reviews, analyses with multiple analysts, or projects where I’m working as a consultant) have an additional “Communications” folder. I often use subdirectories within these folders, but their composition varies.

    3. Functionalize and annotate analyses

    Code for data analysis generally does one of the following: loads data, cleans data, generates functionality (reads in source functions), and runs analysis. Some programmers advocate having a separate script allocated specifically to each of these processes (I found these suggestions, and particularly this embedded link, very useful). I haven’t fully integrated this idea into my workflow yet, but I like it.

    Functions are like little machines that take a set of inputs (or “arguments”), do some stuff based on those inputs, and return a set of outputs (or “values”).

    There are at least two good reasons to functionalize. First, it facilitates error checking, and second, it improves code readability. My very basic rule of thumb is that if it takes me more than about five lines of code to get from inputs to outputs, it’s probably worth functionalizing.

    Here’s an example of a function in R:

     
    MyFunction <- function(input.in, input.out) {
      # this function returns a vector of integers from 
      # "input.in" to "input.out".
      #
      # Args
      # input.in = first integer in sequence
      # input.out = last integer in sequence
      #
      # Returns
      # k = vector of integers from input.in to input.out
      #
      k <- seq(from = input.in, to = input.out)
      return(k)
    }
    

    I strongly recommend using a header structure like I’ve used here, that specifically lists function purpose, arguments, and returns. It’s a good policy to save each function in its own file (I give those files the same name as the function they contain, so the file containing this function would be named MyFunction.R). To load the function into a different file, use R’s source command.

    # head of new file.
    # source in all necessary functions (including MyFunction) first
    source("MyFunction.R")
    
    my.function.test <- MyFunction(input.in = 2, input.out = 7)

     

    Good code designers plan their code structure ahead of time (e.g., I want this big output; to get it, I will need to work through these five subprocesses; each subprocess gets its own function, and a wrapper function integrates them all into a single entity that I will eventually call in one line).

    In my experience, this is often not how biologists write code. Instead, we often write our analyses (simulations or samplers or whatever) in giant, free-flowing scripts. To get from that programming style to functionalized code, I recommend breaking code into functions after-the-fact in a few existing projects. Doing this a few times helped me learn to identify relevant subroutines at the project’s outset and write in functions from the beginning. Here are a couple examples of relatively common disease ecology programming tasks, with off-the-cuff suggestions about reasonable subroutines.

    Example 1: Simulating a discrete-time, individual-based SIR process
    I know from the beginning that I’ll likely want “Births” and “Deaths” functions, a “GetInfected” function that moves individuals from S to I, and a “Recover” function that moves individuals from I to R (e.g., a function for each major demographic process). If each process gets its own function, I know exactly where to go to incorporate additional complexity (like density dependence, pulsed births, age-specific demographic and disease processes, etc.). I wrap all these functions in a single function that calls each subroutine in sequence.

    Example 2: Writing an MCMC sampler
    Usually I know from the beginning what parameters I’m estimating, and I should have a clear plan about how to step through the algorithm (e.g., which updates are Gibbs, which are Metropolis, which control a reversible jump, etc.). In this case, I’d use a separate function to update each element (or block of elements) in the parameter vector. This gives me the flexibility to change some aspect of one update without having to dig through the whole sampler and risk messing up some other piece. Again, I’d use a wrapper to link all the subroutines into one entity.

    My rule of thumb is to annotate everything that I can’t quickly regenerate from basic logic. One of the reasons I’ve pushed toward more functionalized code (and git/github) is because it tightens my annotation process: even if all I do is write an appropriate header for the function, my code is already easier to follow than it was in a single giant script.

    4. Use an integrated development environment like RStudio

    At our roots most of us are scientists, not programmers or developers, and as such we need to keep up with what’s going on in science, not software development. The Open Science movement has invested a lot of effort in helping scientists incorporage modern computational workflows into research. One example project that tries to do this is RStudio. RStudio is an integrated development environment (“IDE”) set up for scientists and analysts. It integrates an R language compiler with LaTeX (a document generating program commonly used in math, computer science, and physics) and git (one flavor of version control software). Because of these integrations, I highly recommend that scientists using R use it through RStudio (this from someone who used the R-GUI for two years and vim-r-plugin for another two). In my mind, the integration is what makes RStudio worthwhile, even for Mac users whose R-GUI isn’t quite so bad.

    5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

    Version control software solves the problem of ending all your code files in “_today’s_date”. These tools keep track of changes you make, and allow you to annotate precisely what you’re doing. Importantly, they also let you dig discarded code chunks out of the trash.

    Several flavors of version control software are now integrated with online platforms. This allows for easy code sharing and cloud backup (and code publication). One platform, github (and regular git run locally), integrates directly with RStudio; another, bitbucket, has unlimited free repositories. Students also have access to five free repos on github — go to education.github.com and request the student pack; in my experience, it helps to send them a reminder email a few weeks after your first request. An aside: women (or, users with female first names) are crazily under-represented on github. Growing the female online programming community is important for science, and good for your code to boot!

    Wrap-up
    Some of these steps are easier to implement than others. I recommend establishing some protocols for yourself, applying these protocols to new projects, and gradually moving old projects into compliance as needed.

    In my experience, adopting some parts of the style conventions (especially those pertaining to spacing and characters-per-line) were easy; others were harder. Start with the low-hanging fruit — any improvement is better than none!

    I don’t know many people (and especially not many scientists) who are particularly proud of how their code looks, but having people look at your code can be incredibly instructive. Find a buddy and work together, if you can.

    Finally, a few small changes can make a big difference, not only in the reproducibility of your science, but also in your confidence as a programmer. People do ask biologists and ecologists for code samples in the job application process, and a little finesse goes a long way!

    Posted in News | 4 Comments

    Creepy, crawly, crunchy: Can insects feed the future?

    As CIDD graduate students, we think about insects as vectors of disease, but we don’t always consider other important attributes of insects: like how tasty they might be. Many insects are edible and considering their use as a novel livestock may help combat problems in obtaining global food security.

    Why should we worry about food insecurity? We live on a hungry planet. The Food and Agriculture Organization of the United Nations (FAO) estimates that 805 million people are chronically undernourished. Projections of food requirements and population growth suggest that we would need to increase current global food production by 70 percent to keep up with demands for human food by 2050. We could also shift paradigms in food production to incorporate novel sources of food. Are insects the livestock of the future?

    On Tuesday, April 21 a panel discussion will be held to discuss the merits of using insects as nontraditional livestock to help feed our globe. Let’s eat more bugs. The discussion will be focused on using insects as a human food source, with particular focus on the barriers to insect-rearing, and insect-eating or “entomophagy”, in the developed and developing world.

    The panelists are Robert (Bob) Anderson, founder of Sustainable Strategies LLC and advisor to the U.S. Department of Agriculture, Dr. Florence Dunkel, an associate professor in the College of Agriculture at Montana State University, Dr. Dorothy Blair, a former assistant professor of Nutrition at Penn State, and Dr. Alyssa Chilton, a Penn State staff sensory scientist in the Department of Food Science. Florence has an interesting TEDx talk which can be found here.

    Tuesday, April 21, 2015

    12 – 1:30 p.m.

    Foster Auditorium

    Paterno Library

    Posted in News | Leave a comment

    Opportunity to meet with Florence Dunkel, April 20 @ 11:00 a.m.

    All graduate students (CIDD, biology, entomology, INTAD) are welcome to meet with Florence Dunkel on Monday, April 20 at 11:00 a.m. Location TBA. Please RSVP to Jo at jo.ohm@psu.edu if you would like to attend.

     

    Posted in News | Leave a comment

    Nita Bharti seminar

    Nita BhartiDon’t forget that CIDD’s Nita Bharti will being giving a talk tomorrow:

    • The role of movement in the spread and control of disease
    • March 17, 4-5pm
    • 8 Mueller Lab
    Posted in News | Leave a comment

    Lunch with Dr. Seth Barribeau

    Please join us for lunch with Dr. Seth Barribeau (East Carolina University) following his seminar titled “Specificity, memory, and immune system evolution across a social gradient.”  He is very interested in host-parasite interactions within insects!

    Where: W-203 Millennium Science Complex

    When: March 19th, 2015 12:00 – 1:00PM

    Please RSVP before Wednesday, March 18th at 5pm using the poll below. Hope to see you there!

    Posted in News | Leave a comment

    Big in Japan

    CIDD faculty member Matt Ferrari recently traveled to Miyazaki, Japan for a research collaboration on foot and mouth disease. Below is a link to local news coverage along with a rough translation:

    http://mrt.jp/localnews/?newsid=00013384

    Following is very rough translation:

    Foreign research teams working on FMD visited Miyazaki, and they exchanged the opinions with local government staffs who were in charge for FMD containment 5 years ago to learn what kind of control measure had been taken.

    Those who visiting Miyazaki are the team of mathematicians from 4 countries including US and UK.

    FMD occurred in UK 2001. which lead to slaughtering of more than 6 million livestock. It is also still spreading in Korea now; FMD posing significant international concerns.
    With this background, this team is developing simulation models which can help predicting how the disease would spread if FMD would occur. On 27th, they exchanged their opinions with local government staffs. who were actually involved in the FMD containment 5 years ago.

    (Miyazaki staff) Depending on the outbreak situation, we have to decide, say, whether using vaccine or not, or conducting preemptive culling or not. This requires vast financial support.
    (Reporter) Here is FMD memorial centre in Takanabe-town. Researchers from abroad will visit here to learn about the FMD outbreak in Miyazaki from now.

    Researchers looked into exhibitions very seriously.

    And followed by Colleen’s interview…

    Posted in News | Leave a comment

    Science image contest

    If you use any hi-res imaging in your research, you may want to check out the College of Engineering Art in Science competition (mentioned today at the Millennium Cafe). The website information is a bit sparse, but it’s sponsored by the Engineering Graduate Student Council, and they’ll be awarding $100 to the top three science images.

    Posted in News | Leave a comment