In your career as an environmental data scientist, you will often have to learn new analytic approaches to solve problems. The process of learning an unfamiliar statistical model is an important skill on its own. In this final project, you will use the skills you’ve developed in class to teach yourself a statistical model and apply it to an environmental dataset.
Timeline
We’re going to build up to the final project gradually over the whole quarter. These assignments are ungraded. Their purpose is to give you structure and get you instructor feedback early and often.
Week | Assignment |
---|---|
2 | Choose data and question |
4 | Exploratory analysis |
7 | Describe hypotheses with text and visuals |
8 | Describe model in statistical notation and fit model to simulated data |
9 | Fit model to real world data |
10 | First draft of blog post due Monday 12/1 |
Exam | Final draft of blog post due Thursday 12/11 |
Model and response options
The following list of models are suggestions for your final project together with the response characteristics they’re designed for. You are allowed to choose a model not on this list if you’re motivated to do so, but make sure to clear it with the instructors. The models are categorized by complexity level so you can choose a model aligned with your learning curve. You’re encouraged to discuss options with the instructors!
Complexity level | Model name | Response characteristics | Example |
---|---|---|---|
Familiar | Gamma | Positive continuous | Pollution concentrations |
Negative binomial | Counts | Water quality violations | |
Beta | Proportions | Algae coverage on a reef | |
More complex | Hurdle | Inflated zeroes | Occurrence of rare species |
Segmented | Breakpoint | Policy intervention | |
Multinomial | Unordered categories | Land cover types | |
Most complex | Ordinal | Ordered categories | Likert-scale surveys |
State-space | Time series | Salmon returns | |
Spatial error | Spatially autocorrelated | Energy costs in counties |
Specifications
Question and data
- The blog post provides context and background for the question
- The data are explained using text and figures
- The relationships and causal relationships are described with a DAG
Statistical model
- The statistical model is explained conceptually and using formal statistical notation
- The blog post demonstrates how to simulate data according to model assumptions
- The blog post demonstrates that a model fit to the simulated data recovers the parameters
Inference
- Hypotheses are stated in plain language and with visualizations
- Model estimates are presented with appropriate uncertainty (e.g., confidence intervals)
- A hypothesis is tested and the evidence is interpreted
Professionalism
- The overall appearance of the blog post (e.g., figures, code outputs) is portfolio-quality
- The writing is comprehensible to a technical audience
- The code is well-organized and appropriately documented