1 / 33

Language Engineering for Human-Computer Collaborative Assessment

Language Engineering for Human-Computer Collaborative Assessment. Mary McGee Wood John Sargeant Phil Reed, Craig Jones School of Computer Science. The Assess by Computer (ABC) project. Tools for setting, taking, and marking exams and for admin tasks

orsin
Download Presentation

Language Engineering for Human-Computer Collaborative Assessment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Engineering for Human-Computer Collaborative Assessment Mary McGee Wood John Sargeant Phil Reed, Craig Jones School of Computer Science

  2. The Assess by Computer (ABC) project • Tools for setting, taking, and marking exams and for admin tasks • Internally funded by the University of Manchester • In use for diagnostic, formative, and “high stakes” summative tests, locally and remotely • HCCA philosophy throughout • Started as a pragmatic development; gradually turning into a research project.

  3. The problem • “Every hour of marking is an hour less of life” • But: we mostly want students’ answers to be constructions, not selections… • … and accurate autonomous marking of constructed answers (for content) is infeasible. • And… we also need to improve the quality and accountability of assessment.

  4. Current systems • Commercially available tools (e.g. QMP) have little or no support for constructed answers (and have other disadvantages) • Substantial work on Automated Essay Scoring in the States, especially Educational Testing Service (ETS), Princeton, USA • E-rater – “Essay rater” – concentrates on style and language use. • C-rater – “Concept rater” – looks at the factual content of answers (85% CH agreement, 92% HH agreement).

  5. HCC – the key idea • Fully Automatic High Quality Machine Translation (FAHQMT) was never realistic • FAHQM Anything is probably neither possible nor reasonable • Aim to exploit the complementary strengths of the system and the user

  6. HCC assessment • Assessment is a collaborative process where human and program each do what they are good at. • Answer Representation (AR) grows dynamically during the marking process. • Aim is to improve both speed and quality of assessment.

  7. HHC software development • Software development can be a collaborative process where developers and users each do what they are good at. • System functionality grows dynamically during the use-and-development process. • Initial aim is to optimise the suitability and habitability of the system. • Real aim is to improve both speed and quality of carrying out the task in hand.

  8. A marking tool

  9. Answer types • Multiple choice - useful where appropriate. • Text – single word to essay. Most common type, can include structured text, e.g. programs, simple maths. • Slots/fill-in-the-blanks. • Simple diagrams (experimental). • Formatted maths – next phase. Can be used in any combination, structured using composite questions.

  10. Text answer types • Traditionally: “short answers” vs. “essays” • Maybe better: “factual” vs “discursive”… • … or “objective” vs. “subjective” Hypothesis: Objective answers can usefully be semi automatically marked using simple statistical clustering and matching techniques, while subjective answers require some amount of “natural language understanding”.

  11. What students really say • Spelling mistakes • Word variants • Context-dependent synonyms • Original answers

  12. Spelling mistakes • interpretor, interperetor, interaper (not sure about spelling), … • hierarchial, hierachical, hirarachical, … • defieciency, deficency, defiency, defficiency, definciency, dificiency, defciency, defficiency, dfficiency, … • But: modal / model casual / causal

  13. Word variants • “rhesus positive”: 281 students produced 52 forms, with six parameters of variation: • Upper / lower case: rhesus / Rhesus / RHESUS • Hyphenation: RH-positive / Rh positive • Spacing: Rh +ve / RH+ve • Parentheses: +ve / (+ve) • "D": Rh positive / Rh D positive • "positive": positive/ pos./ pos / + / +ve / +ive

  14. Context-dependent synonyms • Working memory, Rule memory, Inference Engine • Rule Memmory - which rules are avilable, Main Memory - the current state of the world , Interpretor - decides which rule fires • The knowledge, the rules that operate on the knowledge, and the Intepreter that links the these two.

  15. Original answers • “Give an original example of an exception to default inheritance.” • 9 penguins, 6 ostriches; 20 non-flying birds in total • 8 non-walking mammals, 30 other anomalous animals, 31 disabled animals • 5 plants, 28 artefacts

  16. Really original answers I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. I am a fish. In other words i dont know the answer, sorry, hope u can have a good laugh at my expense though!!!! :) p.s. If you havent seen red dwarf then you'll think im odd for the i am a fishj bit, but if you have seen it dont you think its cool!!!

  17. Simple tools can do a lot • General problem is hard • Simple tools can save significant marking time compared to paper exams • Can display all answers to one part-question together • Order, e.g. by length, highlight keywords (with optional fuzzy matching) etc.. • ..and you don’t have to read their handwriting!

  18. Clustering • Each answer (including the model answer) abstracted into a numerical form to enable measurement of similarity with other answers • Similarity of each answer with each other answer measured and stored in an answer-by-answer similarity matrix • Clustering algorithm applied to the matrix

  19. Abstraction • Vector Space Model • Vectors refined by: Spelling correction Stoplist removal Stemming Term weighting

  20. Similarity measurement • Cosine distance (standard) • Stored in an answer-by-answer similarity matrix • Generic: can handle many other question types, eg diagrams

  21. Clustering algorithm • Agglomerative Hierarchical Clustering • Number of clusters not known in advance • “Average Within Cluster Similarity” a clue to reliability as a basis for marking

  22. Example 1: Production systems • “What are the three components of a production system?” • 151 student answers Cluster 1 50 working memory, rule memory, interpreter Cluster 2 8 1. working memory, 2. rule memory, 3. Interpreter Cluster 3 6 working memory, rule memory, interpreter (inference engine) Cluster 4 5 working memory, rule memory, interpretor Cluster 5 3 working memory, rule memory, interpretter Cluster 6 3 include the phrase “the three components” Cluster 7 3 working memory, rule memory, inference engine … Outliers 65

  23. Outliers • Unique mis-spellings: Rule memory, working memory and interperetor • Correct answers uniquely expressed: Working memory - contains state; Rule memory - contains rules; Interpreter - decides which to fire • Unique wrong answers: I am a fish. I am a fish. I am a fish.

  24. Example 2: Iron deficiency • “Name one deficiency which would give rise to a microcytic anaemia.” • 279 student answers • 17 clusters, 79 outliers Cluster 1 132 iron deficiency and minor variants collapsed by pre-processing Cluster 2 15 iron Cluster 3 8 iron deficiency, diet Cluster 4 7 iron deficiency anemia &c

  25. “What single measurement would you make to confirm that an individual is anaemic?.” • 22 clusters, 85 outliers Cluster 1 67 haemoglobin concentration Cluster 2 42 red blood cell count Cluster 3 15 packed cell volume Cluster 4 13 minor variants on haemoglobin concentration in the blood &c

  26. Example 3: The frame problem • “What, in Artificial Intelligence, is the Frame Problem?.” • 104 student answers • AWCS relaxed to 0.90, giving 12 clusters, 39 outliers Cluster 1 19 real world, chang- (change, changes, changing, &c) Cluster 2 14 world, chang- Cluster 3 8 frame Cluster 4 6 exceptions, inheritance Cluster 5 4 chang-, repres- &c

  27. Benefits • Marking time reduced by factor of 2-3 compared to paper scripts • Can include MCQs where appropriate - they’re not always bad  • Answers genuinely anonymous • Consistency likely to improve • Clerical checking eliminated • Detailed analysis of results possible – good for “drilling down”. • Lots of data generated for further research

  28. Collaboration: software development • Initial system did little more than replace paper exam books • Gradually extending real (diagnostic, formative, and summative) use • Gradually extending functionality • Priorities for development influenced by users and would-be users, e.g.current top priority is formatted maths… • …and by real student answers.

  29. Conclusions: assessment In decreasing order of confidence: • HCCA, even with very simple tools, is very effective, at least in some cases. • There are many issues of usability, procedures, education… • Simple keyword-based answers are easy for HCCA but hard for machines alone. • Discursive / subjective answers probably require a range of NLE techniques.

  30. Conclusions: NLE • Applications of NLE don’t have to be “all or nothing” … • … which is just as well, because even “simple” real data is complicated. • HCC gets the best from both machine and user… • … and means that very simple techniques can be Really Useful.

More Related