Jonathan Huang

Tuned Models of Peer Assessment in MOOCs Chris Piech Jonathan Huang Stanford Zhenghao Chen Chuong Do Andrew Ng Daphne Koller Coursera

A variety of assessments How can we efficiently grade 10,000 students?

The Assessment Spectrum Multiple choice Coding assignments Proofs Essay questions Short Response Long Response Easy to automate Hard to grade automatically Can assign complex assignments and provide complex feedback Limited ability to ask expressive questions or require creativity

Stanford/Coursera’sHCI course Video lectures + embedded questions, weekly quizzes, open ended assignments

Student work Slide credit: ChinmayKulkarni

✓ staff-graded 1) Calibration 2) Assess 5 Peers 3) Self-Assess Calibrated peer assessment [Russell, ’05, Kulkarni et al., ‘13] Similar process also used in Mathematical Thinking, Programming Python, Listening to World Music, Fantasy and Science Fiction, Sociology, Social network analysis .... Slide credit: ChinmayKulkarni (http://hci.stanford.edu/research/assess/) Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/)

Largest peer grading network to-date • 77“ground truth” submissions graded by everyone (staff included) HCI 1, Homework #5

How well did peer grading do? Black stuff  much room for improvement! within 5pp within 10pp Up to 20% students get a grade over 10% from ground truth! ~1400 students

Peer Grading Desiderata • Statistical model for estimating and correcting for grader reliability/bias • A simple method for reducing grader workload • Scalable estimation algorithm that easily handles MOOC sized courses Our work: • Highly reliable/accurate assessment • Reduced workload for both students and course staff • Scalability (to, say, tens of thousands of students)

How to decide if a grader is good Who should we trust? Graders 100% 100% 50% 55% 56% 30% 54% Need to reason with all submissions and peer grades jointly! Idea: look at the other submissions graded by these graders! Submissions

Statistical Formalization Submission Student/Grader Average grade inflation/deflation True score Bias Reliability Grading variance Observed score Observed score Observed score Observed score Observed score

Model PG1 Modeling grader bias and reliability Related models in literature True score of student u Crowdsourcing [Whitehill et al. (‘09), Bachrach et al. (‘12), Kamar et al. (‘12) ] Grader reliability of student v Anthropology [Batchelder & Romney (‘88)] Grader bias of student v Peer Assessment [Goldin & Ashley (‘11), Goldin(‘12)] Student v’s assessment of student u (observed)

Correlating bias variables across assignments 20 15 10 y = 0.48*x + 0.16 5 Bias on Assn 5 0 -5 Biases estimated from assignment T with biases at assignment T+1 -10 -15 -20 -20 -15 -10 -5 0 5 10 15 20 Bias on Assn 4

Model PG2 Grader bias at homework T depends on bias at T-1 Temporal coherence True score of student u Grader reliability of student v Grader bias of student v Student v’s assessment of student u (observed)

Model PG3 Coupled grader score and reliability True score of student u Your reliability as a grader depends on your ability! Grader bias of student v Approximate Inference: Gibbs sampling (also implemented EM, Variational methods for a subset of the models) Running time: ~5 minutes for HCI 1 ** PG3 cannot be Gibbs sampled in “closed form” Student v’s assessment of student u (observed)

Incentives Scoring rules can impact student behavior Model PG3 gives high scoring graders more “sway” in computing a submission’s final score. Improves prediction accuracy Model PG3 gives higher homework scores to students who are accurate graders! Encourages students to grade better See [Dasgupta & Ghosh, ‘13] for a theoretical look at this problem

Prediction Accuracy • 33% reduction in RMSE • Only 3% of submissions land farther than 10% from ground truth Baseline (median) prediction accuracy Model PG3 prediction accuracy

Prediction Accuracy, All models Despite an improved rubric in HCI2, the simplest model (PG1 with just bias) outperforms baseline grading on all metrics. Just modeling bias (constant reliability) captures ~95% of the improvement in RMSE An improved rubric made baseline grading in HCI2 more accurate than HCI1 PG3 typically performs other models HCI 1 HCI 2

Meaningful Confidence Estimates When our model is 90% confident that its prediction is within K% of the true grade, then over 90% of the time in experiment, we are indeed within K%. (i.e., our model is conservative) We can use confidence estimates to tell when a submission needs to be seen by more graders! Experiments where confidence fell between .90-.95

How many graders do you need? Some submissions need more graders! Some grader assignments can be reallocated! Note: This is quite an overconservative estimate (as in the last slide)

Understanding graders in the context of the MOOC Question: What factors influence how well a student will grade? Better scoring graders grade better “Harder” submissions to grade “Easiest” submissions to grade Standard deviation Standard deviation Mean Mean

Residual given grader and gradee scores The worst students tend to inflate the best submissions Grade inflation Gradee grade (z-score) Best students tend to downgrade the worst submissions Grade deflation Grader grade (z-score) # standard deviations from mean

How much time should you spend on grading? “sweet spot of grading”: ~ 20 minutes

What your peers say about you! Best submissions Worst submissions

Commenting styles in HCI Students have more to say about weaknesses than strong points sentiment polarity sentiment polarity feedback length (words) On average, comments vary from neutral to positive, with few highly negative comments feedback length residual (z-score)

Student engagement and peer grading all features 1 just grade 0.9 0.8 just bias 0.7 just reliability 0.6 True Positive Rate Task: predict whether a student will complete last homework 0.5 0.4 0.3 0.2 (AUC = 0.97605) 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

Takeaways Peer grading is an easy and practical way to grade open-ended assignments at scale Real world deployment: our system was used in HCI 3! Reasoning jointly over all submissions and accounting for bias/reliability can significantly improve current peer grading in MOOCs Grading performance can tell us about other learning factors such as student engagement or performance

The End

Gradient descent for linear regression ~40,000 submissions

Jonathan Huang

Jonathan Huang

Presentation Transcript

Misc-read presentation: Jonathan Huang (jch1@cs.cmu.edu) 4/19/2006

Jonathan Huang Carlos Guestrin Carnegie Mellon University ICML 2010 Haifa, Israel

Jonathan

Jonathan

Jonathan

Zhangqin Huang

Misc-read presentation: Jonathan Huang (jch1@cs.cmu) 4/19/2006

Jonathan

0845264 Huang

Jonathan

Presented by: Jonathan Huang (jch1@cs.cmu) Advisor: Carlos Guestrin 1/24/2006

Jianhui Huang

Jonathan Huang (jch1@cs.cmu) Advisor: Carlos Guestrin 11/15/2005

Wen-Ting Huang Jau-Chi Huang

Bohua Huang

Jonathan Huang (jch1@cs.cmu) 1/30/2006

Analysts: Su Chen Gina Huang Xuhao Yang Jonathan Barki

sherri huang

Analysts: Su Chen Gina Huang Xuhao Yang Jonathan Barki

Jonathan

Wen-Ting Huang Jau-Chi Huang