1 / 30

Jonathan Huang

Tuned Models of Peer Assessment in MOOCs . Chris Piech. Jonathan Huang. Stanford. Zhenghao Chen. Chuong Do. Andrew Ng. Daphne Koller. Coursera. A variety of assessments. How can we efficiently grade 10,000 students?. The Assessment Spectrum. Multiple choice. Coding assignments.

joanne
Download Presentation

Jonathan Huang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tuned Models of Peer Assessment in MOOCs Chris Piech Jonathan Huang Stanford Zhenghao Chen Chuong Do Andrew Ng Daphne Koller Coursera

  2. A variety of assessments How can we efficiently grade 10,000 students?

  3. The Assessment Spectrum Multiple choice Coding assignments Proofs Essay questions Short Response Long Response Easy to automate Hard to grade automatically Can assign complex assignments and provide complex feedback Limited ability to ask expressive questions or require creativity

  4. Stanford/Coursera’sHCI course Video lectures + embedded questions, weekly quizzes, open ended assignments

  5. Student work Slide credit: ChinmayKulkarni

  6. staff-graded 1) Calibration 2) Assess 5 Peers 3) Self-Assess Calibrated peer assessment [Russell, ’05, Kulkarni et al., ‘13] Similar process also used in Mathematical Thinking, Programming Python, Listening to World Music, Fantasy and Science Fiction, Sociology, Social network analysis .... Slide credit: ChinmayKulkarni (http://hci.stanford.edu/research/assess/) Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/)

  7. Largest peer grading network to-date • 77“ground truth” submissions graded by everyone (staff included) HCI 1, Homework #5

  8. How well did peer grading do? Black stuff  much room for improvement! within 5pp within 10pp Up to 20% students get a grade over 10% from ground truth! ~1400 students

  9. Peer Grading Desiderata • Statistical model for estimating and correcting for grader reliability/bias • A simple method for reducing grader workload • Scalable estimation algorithm that easily handles MOOC sized courses Our work: • Highly reliable/accurate assessment • Reduced workload for both students and course staff • Scalability (to, say, tens of thousands of students)

  10. How to decide if a grader is good Who should we trust? Graders 100% 100% 50% 55% 56% 30% 54% Need to reason with all submissions and peer grades jointly! Idea: look at the other submissions graded by these graders! Submissions

  11. Statistical Formalization Submission Student/Grader Average grade inflation/deflation True score Bias Reliability Grading variance Observed score Observed score Observed score Observed score Observed score

  12. Model PG1 Modeling grader bias and reliability Related models in literature True score of student u Crowdsourcing [Whitehill et al. (‘09), Bachrach et al. (‘12), Kamar et al. (‘12) ] Grader reliability of student v Anthropology [Batchelder & Romney (‘88)] Grader bias of student v Peer Assessment [Goldin & Ashley (‘11), Goldin(‘12)] Student v’s assessment of student u (observed)

  13. Correlating bias variables across assignments 20 15 10 y = 0.48*x + 0.16 5 Bias on Assn 5 0 -5 Biases estimated from assignment T with biases at assignment T+1 -10 -15 -20 -20 -15 -10 -5 0 5 10 15 20 Bias on Assn 4

  14. Model PG2 Grader bias at homework T depends on bias at T-1 Temporal coherence True score of student u Grader reliability of student v Grader bias of student v Student v’s assessment of student u (observed)

  15. Model PG3 Coupled grader score and reliability True score of student u Your reliability as a grader depends on your ability! Grader bias of student v Approximate Inference: Gibbs sampling (also implemented EM, Variational methods for a subset of the models) Running time: ~5 minutes for HCI 1 ** PG3 cannot be Gibbs sampled in “closed form” Student v’s assessment of student u (observed)

  16. Incentives Scoring rules can impact student behavior Model PG3 gives high scoring graders more “sway” in computing a submission’s final score. Improves prediction accuracy Model PG3 gives higher homework scores to students who are accurate graders! Encourages students to grade better See [Dasgupta & Ghosh, ‘13] for a theoretical look at this problem

  17. Prediction Accuracy • 33% reduction in RMSE • Only 3% of submissions land farther than 10% from ground truth Baseline (median) prediction accuracy Model PG3 prediction accuracy

  18. Prediction Accuracy, All models Despite an improved rubric in HCI2, the simplest model (PG1 with just bias) outperforms baseline grading on all metrics. Just modeling bias (constant reliability) captures ~95% of the improvement in RMSE An improved rubric made baseline grading in HCI2 more accurate than HCI1 PG3 typically performs other models HCI 1 HCI 2

  19. Meaningful Confidence Estimates When our model is 90% confident that its prediction is within K% of the true grade, then over 90% of the time in experiment, we are indeed within K%. (i.e., our model is conservative) We can use confidence estimates to tell when a submission needs to be seen by more graders! Experiments where confidence fell between .90-.95

  20. How many graders do you need? Some submissions need more graders! Some grader assignments can be reallocated! Note: This is quite an overconservative estimate (as in the last slide)

  21. Understanding graders in the context of the MOOC Question: What factors influence how well a student will grade? Better scoring graders grade better “Harder” submissions to grade “Easiest” submissions to grade Standard deviation Standard deviation Mean Mean

  22. Residual given grader and gradee scores The worst students tend to inflate the best submissions Grade inflation Gradee grade (z-score) Best students tend to downgrade the worst submissions Grade deflation Grader grade (z-score) # standard deviations from mean

  23. How much time should you spend on grading? “sweet spot of grading”: ~ 20 minutes

  24. What your peers say about you! Best submissions Worst submissions

  25. Commenting styles in HCI Students have more to say about weaknesses than strong points sentiment polarity sentiment polarity feedback length (words) On average, comments vary from neutral to positive, with few highly negative comments feedback length residual (z-score)

  26. Student engagement and peer grading all features 1 just grade 0.9 0.8 just bias 0.7 just reliability 0.6 True Positive Rate Task: predict whether a student will complete last homework 0.5 0.4 0.3 0.2 (AUC = 0.97605) 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate

  27. Takeaways Peer grading is an easy and practical way to grade open-ended assignments at scale Real world deployment: our system was used in HCI 3! Reasoning jointly over all submissions and accounting for bias/reliability can significantly improve current peer grading in MOOCs Grading performance can tell us about other learning factors such as student engagement or performance

  28. The End

  29. Gradient descent for linear regression ~40,000 submissions

More Related