Before using GenAI models as EdTech tools, their pedagogical suitability should be corroborated. In this paper, we present ShAnEL-2, a novel multilingual dataset comprising 1,185 student responses to short-answer language learning exercises corrected by teachers. We use ShAnEL-2 to establish an initial benchmark of (1) "off-the-shelf" GenAI models and (2) retrieval-augmented generation (RAG) techniques for the automated correction of this exercise type. With an overall accuracy of 90% and recall of 95%, few-shot RAG (which adds previously corrected responses to the prompt) outperforms the off-the-shelf baseline and textbook RAG setup (which adds coursebook materials) by up to 7 (accuracy) and 5 (recall) percentage points. These results confirm that LLMs learn better from examples than from analysing context and highlight GenAI's particular potential as a correction assistant for teachers.