Consistency of grading programming assignments

When students write programming assignments, we generally want to grade them with a number or letter grade. This may be for homework/coursework or for controlled assessments such as exams. Anyone who has worked in education will know that consistency and fairness of grading are important to students; which teacher hasn’t heard “How come they got an A but I got a B”? In the UK the National Student Survey, which is used to judge UK universities, asks “How fair has the marking and assessment been on your course?”

There are two forms of grading consistency: self-consistency (which tends to be the issue for school teachers, who grade their whole class themselves), and inter-grader consistency (which tends to be the issue for university courses where many Teaching Assistants (TAs) may be grading on the same course). Neither of these forms of grading consistency have been investigated much in programming education.

In a newly published paper at TOCE, my PhD student Marcus Messer investigated grading consistency in programming education. He took a set of historic submissions for a programming assignment at our institution, then he paid a set of TAs to grade them, and surveyed and interviewed them afterwards. The relevant data — the assignments themselves, and the grades — has been released as a public dataset for others to use. But what did we find when we examined the issue of grading consistency?

Consistency is poor

The most noticeable finding is that inter-grader agreement for grading was poor. We found low levels of agreement between graders. I’ll come back to some of the wider setup in the next section, but the task we gave was to grade assignments on four dimensions: correctness, readability, code elegance [design] and documentation. Each assignment was graded by four different people, allowing us to assess inter-grader consistency. We found poor inter-grader agreement on all dimensions and overall.

The graph above shows the spread of grades for the most consistent group of graders for grading program correctness. You can see that most of the assessments have a spread of 4+ grades on our system. This is despite us having something of a ceiling effect with a lot of A- to A+ awarded. So there is noticeable variation in the grades students would have received depending on who graded them.

That is inter-grader consistency, but we also came up with a way to try to measure self-consistency: we gave graders multiple batches spaced in time, and we inserted a single duplicate in each grader’s workload. We had a good cover story for this: the assessments were originally pair-programmed, and we advised the graders that though we had tried to de-duplicate some duplicates may remain, but they should mark them anyway. In interviews afterwards, only 6 out of 28 graders had noticed the duplicate. Self-consistency was also poor; people gave different grades to the same submission just two weeks later.

Validity: controlled vs ecological

There is a trade-off when conducting research, between tightly controlled “laboratory” conditions, and ecological validity where the setup is natural but things are harder to tightly measure. In this study we tried to take the latter approach. We recruited people who would be eligible to be TAs at our institution. We gave our actual grading rubric. We did not have specific training sessions on the rubric or training on grading because we don’t do that at our institution. So this is not “gold standard” grading. But we believe it is a very typical setup, certainly for our institution and I suspect also for many others.

Automated grading

An obvious possible solution to inconsistent manual grading is automated grading. A lot of institutions already use automated unit testing for grading correctness. The other dimensions such as readability and design are a little harder to assess automatically but it is possible to use static analysis tools like Checkstyle for some aspects, and Large Language Models offer new possibilities. However, consistency alone is not the critical criteria: I can mark perfectly consistently if I give everyone 0%. We also want the marking to be correct.

So how can we decide if the marking is correct? Well, what if we had a human-graded dataset to use as a reference? That was part of the motivation for this study but it raises the obvious question: if the human graders do not agree on the grade, how can we justify using it as the grade to base automated assessment on? (Alternatively, if we think the problem is the humans, should we be training them to act like a computer?!) I feel like moving to automated assessment shovels some of these issues under the carpet. Either the humans were overthinking it and the machines have it right, or there are some (possibly quite subjective) aspects that the humans were attempting to mark that the machines are not. I hate to trot out the more-research-is-needed cliché but…

Implications

Many people might say this is unsurprising. For a start, unsurprising is a weak scientific criticism if no work has been done on this before; it’s important to test scientifically. The other counter to that is: so you all know your grading is horrible, but you’re still all doing it? Should we not be thinking about this issue? We either need to improve human grading somehow or switch to automated grading but somehow validating that the grading is correct when we struggle to have a gold standard to compare it to.

I suggest reading through the paper (public PDF available) as Marcus has put a lot of detail in there on the process and the analysis.

Sport at academic conferences

I organise and play in two football [soccer] games a week, which is a big part of my “life” side of work/life balance. Whenever I go to conferences I miss playing and I always had in mind the idea of organising a game at a conference. My hesitation was always that I wasn’t sure anyone would turn up. Organise and it they will come? I wasn’t convinced.

This spring I found out that a couple of other CS ed researchers (James Prather and Brent Reeves) play and would be at ITiCSE, as would I. Long story short, I ended up organising a game at ITiCSE, and then realised that many of the same people would be at a Dagstuhl seminar a few weeks later so I also organised another game there. I thought it might be useful to reflect on sport at conferences and suggest that there should be more of it.

Football players after one of the ITiCSE games

Here’s a picture of some the participants at the ITiCSE game.

The benefits and drawbacks of sport

Let’s start with the obvious point: sport isn’t for everyone. Not everyone is physically able to participate; not everyone is interested. But “not everyone can do it” should not be confused with “no-one should do it”. In our area, SIGCSE has a book club and a semi-regular escape room trip, conferences often end up with board game nights, many people will already jog or swim together on an ad hoc basis. None of these are mandatory and none have anywhere near the entire conference participating even if in theory more people could participate. Sport is just another option in the set.

There are many benefits to sport; it gets everyone out of the conference venue, moving around after being cramped up on planes and trains, and provides a nice contrast to sitting in lecture theatres. The football events were well received and everyone participating (plus a few spectators) seemed to really enjoy it. It wasn’t about football specifically: very few of the participants played football regularly. Hence me making this post about sport rather than football. Most people were disclaiming that they hadn’t played football since university (or earlier) but were happy to come and try it.

That said, the standard was not bad! And we have video proof! We’d wanted to record the ITiCSE game via a drone but the sports centre was near a hospital with a no-fly zone but no such problem at Dagstuhl. Drone footage is silent (their own rotors are too loud) so I added crowd ambience… and commentary:

In both games we had quite a wide variety of ages, fitness levels, skill levels, etc, and mixed gender, which was all great to see.

Tips for organising

A few notes for anyone else interested in organising sport at a conference:

I don’t think there’s too many sports that work well for such a casual event. Many don’t scale easily (e.g. tennis needs a lot of courts if you get a lot of players, whereas in football you can just use more of the pitch), need a lot of equipment, or the minimum skill is high (e.g. ice hockey). Football, frisbee, maybe basketball could work well.
You need to plan in advance; you’ll need to find or book a place to play and get people to sign-up. They’ll need to know to bring exercise clothing to the conference so they’ll need to know they are playing before they come. You may need to bring some equipment yourself. I maybe took it a bit too seriously (compensating for my ITiCSE mistake — see below) but a good third of my Dagstuhl suitcase was a football, bibs and cones:

All the football equipment I took to Dagstuhl

Scope out the facilities once there and communicate with your participants. I made a mistake with the organisation at ITiCSE. I’m so used to playing at the same sports centre that I didn’t check out Radboud University sports centre beforehand. It turned out to be very large with multiple football pitches at opposite ends of the sports centre. So we ended up in two groups that didn’t find each other and played separately. And I did not allow enough time beforehand due to being busy at the conference. Lessons learned! Turn up earlier, scope out exactly where it’s happening, communicate better.
It helps a lot to have the overall organisers on board. I was worried that I was just adding to the already very busy load of conference organisation for something frivolous but at both ITiCSE and Dagstuhl the organisers were supportive and were happy to have more social events going on that they needed to do relatively little to organise. The ITiCSE chairs advertising the game was what gave it enough participants.
There are very limited times in a conference schedule that work. Dagstuhl is more flexible but at ITiCSE it’s three days; the day before people will come late, the last day people will be leaving, the middle day is the conference excursion/dinner, so there’s only really the evening of the first day available.
I think it’s important to set the right tone. What I would really dislike is a very competitive event with everyone trying to show-off and outdo each other and getting very physical, which makes people nervous to participate. The point is to have fun at all levels of skill. Lots of shouted encouragement and no grousing is the way to go.

Oh, and one last tip: don’t sprain your ankle two weeks before the conference game you’d been organising! I was very pleased to make it onto the ITiCSE pitch heavily strapped up after carefully nursing the ankle for two weeks.

I’d like to try to organise a game again at future conferences. I was tempted to try ICER but I thought it might be a bit much to organise three games in one summer. Maybe next year’s ICER!

ICER and paper length

ICER 2025 took place last week and there is now a very brief limbo period before preparations begin for next year. This point, midway through my two-year term as its program co-chair seems like a useful moment to reflect on an issue I’ve been thinking about this year: paper length. (This post is solely my views about ICER and do not necessarily represent those of other past, present and future chairs, the steering committee, SIG etc.)

One of ICER’s features that sets it apart from most other Computing Education Research conferences is its page limit, and I think a lot of other effects flow from this — perhaps as much as any differences in norms or cultural expectations between the conferences. Unfortunately I’m going to have to mix single-column pages and double-column pages in this post because of different systems (one thing I hope to fix next year!), but ICER had a limit this year of 18 single-column pages (excl references), which is around 12 double-column pages. To confuse matters, we also allowed authors to request extra pages if they felt it necessary, so the ICER limit was actually 18-21 single column pages. That’s about twice as long as the 6 double-column page limit at SIGCSE, ITiCSE, UKICER and many other conferences in the area.

Acceptance rate

One criticism of ICER is its low acceptance rate (this year: 19%). This leads to all kinds of suggestions to fix this presumed-problem: another new venue to be able to publish more papers, or moving to a dual-track model in order to increase the number of papers published.

Let’s look at the 19% acceptance rate for this year’s ICER, split by single-column page count (excluding references) of the originally submitted PDF, for all papers which went out to review:

The X axis is page count, Y is paper count. There are stacked bars with blue being accepted papers, pink being rejected papers, and the text at the top of each bar shows the acceptance rate for that specific page count (click to enlarge). (Page counts were automated because I didn’t want to manually examine 160 PDFs to work out where the references began; it’s possible a few were off-by-one.) To borrow a teaching prompt: what do you notice and what do you wonder?

I’ll start with what I consider the headline finding: if you only consider the 46 papers that were 18-21 pages long, our acceptance rate was 15/46, or 33%. I also note that using the extra pages did not seem to affect acceptance rate.

At the lowest end we see a tail of purely-rejected papers. I cannot prove this but I believe this tail may contain several papers that were rejected from ITiCSE (which had its notification date shortly before ICER’s submission date) or other conferences and then resubmitted to ICER without modification; 10-11 single-column pages would be around 6 double-column pages. And the evidence above suggests those papers simply won’t get in to ICER.

So my takeaway is: if you write specifically for ICER our acceptance rate is more reasonable (26 or 33% depending if you put the cut-off at 16+ or 18+ pages), but it is artificially deflated by many of papers which are shorter and much less likely to get in.

I want to be clear that this is not about some kind of “quality” difference between the various conferences. I prefer to think in terms of depth and content: ICER’s page limit implies an expectation of a certain amount of content and depth that is quite different to SIGCSE and ITiCSE. I’ve happily published in both; sometimes you have work which fits perfectly at say SIGCSE, and sometimes you have work that is well suited to ICER. But it seems to me that you cannot freely transfer a paper between the two and expect it to go well.

Distinctiveness is a double-edged sword

I think that ICER being one of the few conferences (Koli Calling is one other, and of course there are journals) that takes longer papers leads to authors being that much more upset when their papers are rejected. Because having written for ICER with its long limit, if you get rejected then what do you do next? You can’t adjust and submit to SIGCSE because you’d be way over the page limit. The next ICER is a year way. So ICER’s distinctiveness may make rejections that bit more difficult to move forwards from.

On the positive side, I do hope that the longer papers at ICER allow for more depth and more content! There are reasons and benefits for having a longer page limit. More results, longer discussions, more space to detail qualitative work and provide more quotes, and so on.

Observing affects the result

There are obvious tweaks that we could make to ICER on the basis of the graph above, such as a minimum page limit. I find it wasteful that if we know short papers won’t get in we still have to spend valuable reviewer time (a resource I’m keen to protect) on those papers. But the unintended consequences of making such tweaks are also obvious: authors would just pad the papers to reach the limit. I worry that even publishing the graph above will encourage this kind of behaviour, even though it is obviously not the sheer amount of pages taken up that lead to the graph above; pages were a useful proxy for amount of content. I’m still a little tempted by a minimum page limit, even if just to send a message that turning around a six page paper without any modification is a waste of everyone’s time.

Not just page limits

ICER does have other practical differences although I’m not sure they matter as much as page length. For example, I don’t think that our reviewer pool is particularly unique: ICER has around 110 reviewers and 30 meta-reviewers (aka senior program committee), and these aren’t some cabal sitting apart from everyone, they are drawn from the computing education research community. This year, for whatever reason, I believe that the overlap between the meta-reviewers and the published authors was lower than usual (quick count: only 8 of the 31 papers had a meta-reviewer on the author list), so it is not the case that the meta-reviewers are just deciding to only publish their own papers.

Another practical difference is the review form. We rewrote the form this year to simplify it (it’s public if you want to read). One criticism in the past was that ICER was too unique in its requirements which meant you needed to be very familiar with the expected style to make headway. I hope that this is less the case with our simplifications. Another change we made this year which I feel helped was an instruction to reviewers to not nitpick minor issues: I think in the past ICER could tend towards well-formed and unflawed papers that ended up limited in their scope, and I think this year we moved away from that.

The way forward

No conference is set in stone and we will organise next year with new co-chairs (who should be announced this month), a changed steering committee, and a desire to constantly re-assess and improve. I am open to feedback on what ICER should do differently, and the next few months are an ideal time to consider what could be changed and improved. The conference feedback survey captures the views of those who came but if you have comments about paper submissions or reviewing and want to give feedback, I’m happy to hear opinions via email and share them with my new co-chair.

Strype’s new graphics API

Our latest project is Strype, a frame-based editor for Python which runs in the browser. We released 1.0 last year which let you run console and Turtle graphics code, but we were itching to add our own graphics API (API doc here). We had experience of creating one in Greenfoot — Strype’s API looks similar but I thought it might be interesting to document some of the differences, and design decisions behind them.

Execution model

One of the major decisions is the execution model. Something that is quite distinctive about Scratch is that its execution model is quite funky. Everything running concurrently is quite a bold move (and hey, I’m on board with concurrency-first) which makes execution simple for short programs and complex for long ones. Replicating this model in straight-line code is difficult (although props to the Pytch folks for working out a reasonably way).

Greenfoot’s model is one where you essentially implement the per-animation-frame callback and the invisible game loop calls that at a fixed cadence (e.g. 30/sec). This has benefits — users can avoid loops for a surprisingly long time in Greenfoot, with this single loop invisibly provided — but is also very much an OOP approach.

Strype programs are all in one file and for now (coming soon!) doesn’t have OOP. So we decided we wanted a more traditional game loop structure. You write the game loop in a “while True” and use a helper function to set the pace (essentially, animation speed):

We agonised about the name of that helper function you see on the last line. Frame is an overloaded word in a frame-based editor so we didn’t want frame_rate or fps. We were between throttle, pace, cadence, and pace won out. Do these things matter? Yes but no but yes; they seem like they do but users will learn whatever you call it but it can still be misleading — I’ll come on to an example later.

Actors

The API is based around a class named Actor, like in Greenfoot. Actors are all the “things” in your scenario: player characters, enemy characters, static scenery, scoreboards, etc. They have an image, position and rotation. This seems a fairly universal concept, similar to Scratch’s sprites and one we were happy with.

Note that we explicitly chose to use a cartesian X, Y coordinate system. I did read Luca Chiodini et al’s paper on compositional graphics in their PyTamaro system which appealed to my functional heart, and is nice for static images, but ultimately many dynamic games really do need a cartesian system. Also, it’s not too difficult to build a compositional API on top of a cartesian system (PyTamaro for Strype?) but the other way round is more difficult.

Coordinate origin and axes

Where is (0, 0)? This was a lot of bike-shedding. Greenfoot has (0,0) in the top-left, with positive Y going downwards, which is fairly traditional in computer graphics. There’s an argument for (0,0) in the bottom-left, with positive Y going upwards as in mathematics. The same argument can be made for (0,0) in the middle, with positive Y again going upwards. In the end we settled on the last one, as it has a few nice properties: the default X and Y values for Actor can be zero which puts you in the middle of the screen, and it’s easy to split the screen into four quadrants. But the bounds in these systems are always annoying because of integer mathematics. We briefly had them as (-400, -300) to (+400, +300) inclusive which is nice and symmetric but makes your display 801×601 pixels wide, which ultimately was too weird.

Media literals

In Greenfoot the way to load images is to use the filename of a file stored in the project directory. But Strype is in the browser and suddenly assets (i.e. images and sounds) are a major pain because the browser can’t access external files easily and so everything needs to be somehow folded into the application. We had a plan to have a per-project media library but it felt clunky. Then I remembered something I’d seen in a DrRacket talk years back about image literals (see video here). In Greenfoot it never felt right to add it, but in the frame-based world it seemed ideal. The idea behind media literals is that you paste an image or sound into your program code and just use it. Sounds weird; works well. You get code that looks like this:

The little images are the image literals. Actor takes an image to its constructor; this code passes an image. To insert them you just copy an image to your clipboard and press ctrl-v at the appropriate point. They act like a single special character while editing the code. Sounds are also possible:

Since the sounds are of class Sound, the sound method “play” can be called on it directly. It feels very natural once you start using it.

The world is not enough

Greenfoot has two main classes: Actor and World. The world shows as the background image but was also a useful place to put code that didn’t belong in any particular actor, for example initialisation and checking for game end. With all of Strype’s logic in one main loop, there just didn’t seem to be much use for a world concept. So it’s not included in Strype; there’s just Actor and a set_background method to set the background image.

Function naming

I always liked Stefik’s deep dive into programming keyword naming (started in this paper), and Cory Bart has similar work on graphics API naming that I read during the Strype design. These isolated “what makes most sense to you” questions are always informative rather than determinative, in that one word choice tends to influence another and so a designer always needs to bear in mind the whole system. (Plus: novices don’t have a full understanding, but experienced programmers stick with what they already learned, so there’s no perfect judges.) But these choices can matter. Greenfoot has a method isKeyDown which checks if a key is held down. But I remember a workshop where the code was isKeyDown(“down”) and when asked to copy and paste for the up version, the student quite understandably wrote isKeyUp(“down”), rather than isKeyDown(“up”). So Strype instead has key_pressed. (Should it have an “is_” prefix? Maybe, maybe not. Sometimes you just have to flip a coin…)

The awkward thing about API naming is that it’s one of the hardest decisions to change. If you change the API, old code gets broken. If you add a new name and retain the old, the API gets bloated and confusing. Even if you get fancy and automatically rename old code as you load it, which we did once in Greenfoot, the old teaching materials are still wrong. So you really do have to live with your mis-steps forever after.

Function typing

As we develop Strype we are still getting used to Python and the Pythonic way of doing things. One noticeable difference that affected the API is that Python cannot overload functions by name like in Java. When applied to constructors this feels quite limiting, and I don’t yet have a good feel for how best to approach these issues. For example, a sound can be created from loading a file or requesting a blank sound. There’s not really an obvious canonical constructor design that supports both these options without exposing some internals. And talking of internals, Python’s lack of private variables and members feels quite grating. As static typing and OOP proponents, it’s difficult for us to know where to draw the line between aiming for what we recognise as good designs, versus fighting too hard against Python to do something it’s not meant for.

Next steps

We have several Strype workshops coming up to road-test this feature (come see us in London or Newcastle in the UK, or Cleveland in the USA). Adding OOP into Strype is mostly done, and we’re just applying the finishing touches on support for third-party libraries. In parallel, it’s time to sit down and generate teacher material. A busy summer ahead.

Howzat? Evaluating Human and AI Hints for Novice Programmers

We have a new paper available at the TOCE journal today (official open-access link, or watermark-free personal copy) about LLM vs human-generated hints for novice programmers. It was authored by our team at KCL, plus LLM wizards Juho Leinonen and Paul Denny.

We were interested in how we might add LLM-generated hints for stuck novice programmers into a beginner’s IDE. (Note: no current plans to deploy this, but we were interested from a research perspective.) This would require a prompt to feed the LLM. At the time of starting the paper 15 months ago, I felt like several of the LLM papers just produced a prompt out of nowhere and evaluated it with one LLM, and that a more general approach would be better.

What we did

So we set out to conduct a study that compared multiple LLMs, with multiple prompts, to evaluate which prompt was best for generating hints, and ideally to detail a reproducible method for generating and evaluating prompts for generating hints. To evaluate the prompts, we evaluated the hints that they generated. This evaluation was done by asking [human!] educators to rank the hints. Along the way we also realised that we could add in some human-written hints to compare those against the LLM-generated hints.

So: 4 LLMs fully crossed with 5 prompts gave us 20 hints, plus 5 human-written hints gave us 25 hints which were evaluated by 8 educators. This was done for 4 different “snapshots” of stuck students which we took from our Blackbox data set – giving us 100 hints in total, evaluated by 32 educators.

Orthogonally to how the hints were generated, we used the attributes of the generated hints (e.g. length, pedagogical approach) to work out which attributes were associated with the highest ranked hints. So we also produced results about novice hints, that were separate to our use of LLMs for generating the hints.

What we found

From this we could work out the following Q&A outcomes (phrased informally here, see the fully paper for details):

Q: Were humans or LLMs better at generating hints?
- A: GPT-4 with the best prompt (Prompt 3 in Table 1 in the paper) beat humans.
Q: Which LLM was best?
- A: GPT-4 beat the others (Gemini, GPT-3.5, Mixtral 8x7B).
Q: Which attributes were associated with the best hints?
- A: 80-160 words in length, readability level suited to 14 years old or younger.

It’s worth noting that these hints were generated for arbitrary stuck students, with no knowledge supplied to the LLM about what the students were trying to do, and no knowledge of what an ideal solution might look like. This is in contrast to much (but not all!) of the hint-generation literature which typically focused on problems where the task was known, and prior or ideal solutions to the task were also required.

Reflections

I’m quite happy with the research design of this study, and how we managed to produce outcomes for LLM hint-generation and for hinting almost orthogonally. I think the one part that didn’t quite pan out as I’d hoped is the reproducible method for generating prompts. The method of the study is suitable for reproducibly evaluating prompts (although human participant recruitment is tough!) but when we came to write how we generated the prompts, I realised it’s more of a creative exercise than a disciplined one. Some aspects were engineered to help — prompts were ideated independently by multiple people before comparing — but ultimately each researcher approached it as more of a writing task than a rigid rule-based process. It felt as hard to say how we approached prompt writing as I might say how I approached writing this blog post.

I’ve noted this in the paper but I wanted to note it again here: I was very impressed by the quality of the TOCE reviews, especially the fabled reviewer #2 who wrote 4000+ words on how our paper could be improved. It took a lot of work to try to answer all their points but I am ultimately very grateful for their effort, which would have qualified them for authorship had they not been anonymous.

Finally: thanks to my co-authors who made the whole research process a joy. And thanks for letting me keep the cricket reference in the title. Howzat?

Follow-up: the threat to free/open source tools in American classrooms

Last week we held our privacy and legislation birds-of-a-feather at SIGCSE 2025 (pictures from National Aviary nearby). My previous post explains the background but briefly: educational software used to be unregulated which was not great; now there is new legislation on both privacy and accessibility but there are awkward rammifications for software makers. School districts in the USA are sending software makers privacy agreements, asking the software maker to indemnify the district against privacy issues, even if the software is free and open source, like Blender, Snap!, BlueJ, Greenfoot. And this is usually per school district, of which there are many in the USA. In this post I’ll summarise the discussion (under the Chatham House Rule).

It was generally agreed that this was a problem for all involved. That’s good, but I’ll focus here on potential solutions, especially to the issue of signing data privacy agreements.

One solution is to institute some kind of company or charity/non-profit to (a) protect software makers personally and (b) potentially sign these agreements. But that would potentially involve legal advice, which would mean charging the school districts, and it’s not like schools have a lot of money to begin with, so this doesn’t seem ideal.

It would be nice to somehow adjust the laws to allow for free open source software. One nice idea was that individual and non-profit software makers could be exempt but I don’t know if that’s feasible. Failing majorly adjusting the laws, there were suggestions to at least aim for a unified standard (e.g. N-DPA, WCAG) to make it one rule to obey (akin to the whole EU having GDPR). A related idea was to have some kind of data privacy categories, like a Creative Commons for data privacy: e.g. DP0 software collects no data at all, DP1 software collects anonymous usage data, DP2 stores personally identifying information etc.

A different kind of collectivity mentioned was that some school districts in places are banding together to sign joint certification agreements. So that way it could be that all districts in a state (or even across states) certify together.

Finally, one suggestion was to negotiate with the school district; instead of just signing or not replying, to reply with a link to the privacy policy and see if that is sufficient for the district to proceed with, even without signing any documents.

The last suggestion is probably the most immediately feasible for individual software makers. In the mean time we can hope that common standards develop, but it does seem inevitable that software makers will need to comply with some standards, even if they don’t certify it. Which is good in principle, but typically means some extra work/consideration, especially if your application stores data from children (especially under-13s).

The threat to free/open source tools in American classrooms

Next week I will be at SIGCSE 2025, where Andreas Stefik, Samantha Schwartz and I will be runnning a birds of a feather session on “The implication of accessibility and privacy legislation on classroom programming tools”. This doesn’t sound exciting (maybe we needed a snappier title, like this blog post), but it’s potentially quite important for the use of free/open source tools in computing classrooms, especially in the USA. In this post I’ll briefly explain the problem. I want to state upfront that I am in no way an expert in these issues, and this is all just to the best of my limited knowledge. Also: this is is a long-term issue, completely unrelated to current/recent political events.

Historically, technology in the classroom has been unregulated, and there has been little oversight or rules on what data these tools could be capturing, from potentially very young children (the privacy issue). This was bad. Furthermore, these tools were generally under no obligation to make any efforts to be accessible, which impacts many people with many disabilities in different ways (the accessibility issue). This was also bad. So it is good, in principle, that these issues are being addressed. But the solutions may have unwanted consequences.

Legislators have started to look into these issues. Different jurisdictions will tackle this differently. For example, the EU (and UK) have the GDPR and similar laws for the privacy issue, which apply generally in society, not just the classroom. The USA has had the ADA for a while which similarly looks at accessibility across society not just the classroom. Initially there was no national/federal movement on privacy, so for privacy there are different rules for different states. This is potentially changing, via the N-DPA initiative, which would be great.

The approach to compliance is different in different places: Europe generally says here’s the law, you must obey, there could be fines (although at least in the UK, the approach to enforcement has been criticised). As I understand it, it is generally left to software providers to directly obey the law, with penalties falling directly on the software provider. In American the penalty would fall on the school district (or school). Schools don’t want to be responsible for such penalties (who does!) so they are asking software providers to take on the penalties and indemnify the schools. There is also a factor that contracts are thought to be easier to enforce than laws, so the American approach favours the use of contracts.

And here is where the problem arises. We make open source software tools like Greenfoot. Greenfoot does one ping to our server on load to report its version and OS, but otherwise it runs locally and sends no other data back to us. So our answer to most privacy questions are “No” or “NA”. But we get sent legal agreements asking us to indemnify schools against data breaches. Here’s some direct quotes from some of the agreements we have been asked to sign:

“[Software provider] shall indemnify and hold [the school district] harmless from any claims arising from its breach [of data privacy/security].”

“[Software provider] shall indemnify and hold harmless [the school district] from and against any loss, claim, cost, or damage arising from or in connection with the breach”

Even though we can’t really have such breaches if we don’t have data, this indemnification business is scary! There’s obviously no way I’m going to sign such agreements in a personal capacity. But I also cannot sign on behalf of the university without approval, and there’s no motivation for the university to approve this: it opens them up to theoretical risk for zero financial gain because there is no direct income from these tools (and as the universities always tell us, research activity already loses us money). So we can’t sign, and thus the school districts cannot use Greenfoot, even though it’s not capturing any data that would worry them.

It’s hard to see a good way forward. I have wondered about a holding company or charity that signs these agreements but even then I’d be hesitant signing these agreements from a foreign legal jurisdiction without legal advice. Which would need to be per-state because the agreements are different per-state. Actually, each school district seems to have its own agreement, although they’re broadly similar within a state. Only very large organisations will be able to sign these things, and for that they will usually need income to make it worthwhile. So this really does seem to threaten free/open source tools in the classroom.

(Note that some tools are free in the zero-money sense because they capture this kind of data and use that to generate income — which is the kind of tool this law is designed to regulate. That’s not the type of tool I mean, hence why I’m saying free/open source. Tools like Greenfoot, Blender, Scratch, Strype and so on.)

The picture on the accessibility compliance side is a bit rosier for providers because usually it’s a best-effort self-assessment of accessibility in many jurisdictions. I will admit that we need to do better on this side of things. I’ve been putting off doing a VPAT for our tools for too long, but really need to sit down and do it.

I’m not against regulation but I think the way it is currently implemented, especially for privacy in the USA, means that it becomes logistically impossible for providers of free/open source tools to provide tools for classroom use, even if the tools are fully compliant with relevant legislation. And that seems like an issue.

If you have thoughts (or any expertise!) on this issue and are at SIGCSE next week, do come and join us to talk about it. Otherwise, add your thoughts in the comments.

Reviewer workload

I’m acting as one of the ICER 2025 program chairs, and that has me thinking about reviewer workload in Computing Education Research. (This post is entirely my own personal views, and does not represent the views of the other organisers, ICER as a conference, SIGCSE, etc.)

When we think about reviewing, what we all want is ~~to get our papers accepted~~ a review system that is expert, fair, consistent, and provides reasoning behind the decisions. As we all know, no reviewers are paid; there is a quid pro quo that academics who write papers must also review papers in order to keep the system balanced. Since not everyone writing papers (e.g. early PhD students) is necessarily qualified to review, that means academics must in general write more reviews than they receive each year. That is a lot of reviewing to do, and so it is worth thinking about how much time that takes us all.

The status quo

I don’t teach which I figure gives me a bit more time than other people, and I am bad at saying no, so last year I reviewed or meta-reviewed for the SIGCSE Technical Symposium, SIGCSE Virtual, Koli Calling, UKICER, ICER, TOCE and CSE. I’ve also reviewed in recent years for ITiCSE, WiPSCE and CEP. So I have reviewed for most of the Computing Education Research venues (as reviewer and meta-reviewer), and most of the conferences have gradually become quite homogeneous in their review processes. Papers get reviewed using a review form with scores and text over a 3-4 week period, reviewers discuss for 1-2 weeks supervised by meta-reviewers, meta-reviewers summarise into a meta-review with a score and text, then the program chairs use all this info to choose which papers to accept.

All of this takes time for the people involved. And as the process has grown more involved, the time required has also grown. All of the changes have been made with good intentions: a more consistent and higher quality set of reviews. But designing the review system reminds me of designing software: everyone proposes new features, which are all small and make sense on their own, but if you add them all then the whole design becomes overburdened and eventually collapses in on itself. In the case of reviewing systems, people will just refuse to do it, and the system will fail to operate. I think we are on the verge of that happening, so it is very important for publication venues to consider the workload they are placing on their volunteers. I have some thoughts about reducing reviewer workload, some of which we’re implementing at ICER, and some of which might be considered by other venues.

The review form

There’s two attitudes to review forms:

Here’s a single text box, and a score/recommendation dropdown. Go for it.
Here’s a tightly structured review form with many sections and scores to fill in.

The first has definite advantages in terms of workload. The second is better at ensuring all the reviewers are all evaluating against the same criteria. ICER is very much in the second category. I think the review form can be seen as the central part of a publication venue: the criteria tell you what is valued and what kinds of paper are accepted. For this reason I never understood why conferences don’t post it publicly; this year we’ve posted the ICER review form on the website.

The ICER review form has gradually been evolving in ways people might not have appreciated. A few years ago reviewers (myself included!) found it frustrating to fill in the mandatory “Was theory used?” part for papers where it was inappropriate. That got changed already to say “… if appropriate”, and got merged into prior work. Parts have been added about reproducibility.

We’ve continued to change the form this year, but with an eye to shortening it. With some combining and rearranging we’ve removed two whole sections from the form. We have tried to apply more emphasis to a paper’s contributions and whether the paper’s claims align with its execution. Our feeling was that at ICER there was a slight bias towards papers that were executed well but perhaps did not make a major contribution, with a slight bias against papers that had a larger potential contribution but with more caveats. We’ll see how it turns out.

The length limit

There’s broadly two opinions on paper length limits: either specify one or say “should be reasonable according to the content”. TOCE are currently doing the latter, which I think for a journal is a reasonable choice. All conferences that I know of specify a limit. ICER is a bit of an outlier here, for two reasons:

The limit was specified in words, not pages.
It has a longer limit than everywhere else.

The idea of having a word limit rather than a page limit at ICER was a worthy experiment. It stops all kinds of busywork (shaving off those single trailing words at the end of paragraphs, shrinking figures) but introduces other problems. Papers can be engorged with figures which don’t count towards the limit (more on that below). And the fiddliness of counting words in a paper is completely crazy, and causes a lot of stress for authors (and work for chairs!) to check it. ICER ended up with 1.5 pages of explanatory text about how to do a word count. So we’re fixing the word count awkwardness (and figure explosion) by going back to a page limit.

I think having a longer limit suits ICER. My vision for ICER is that it should be the flagship conference for research papers, and the kind of depth that ICER papers go into is suited to a longer limit. If you’re over 6 pages double-column excluding references, there’s only a few venues to go to, and ICER is one. However: it’s clear that longer papers mean increased reviewer workload. How long should ICER papers be?

Some might say that even more detail leads to even higher quality papers, but first there is surely a limit to that (should we all be submitting PhD theses as papers?), and second, it becomes too much for reviewers. When you are asked to review for a journal, you are asked to review one paper a time. ICER asks for 4-6 reviews in a 3-4 week period. How are you going to expect quality reviews if the papers are all so long you barely have time to read them all?

For example, I co-wrote a paper in ICER 2023 that ended up 20 pages, double-column. That’s 29 pages in single-column — a journal length paper. Author-me was happy with all the detail we could provide, and it was all within the rules. Chair-me knows that author-me needs to be stopped; we can’t be sending out so many papers of that length to reviewers.

So: it’s time to push back. We’re starting out with an 18 page limit for single column (excluding references, but including everything else: figures, appendices, etc). As a compromise for those who feel this is constraining them we are allowing an optional extra 3 pages but you have to specifically argue for them in your submission form, and reviewers are allowed to ask for them to be removed. I’d like to reduce these limits further, but this year is at least a start in reversing direction on the issue of paper length.

Discussion periods

Personally, I am not sure that discussion periods are very helpful most of the time. Here’s the possibilities that usually seem to happen:

Reviewers change their scores immediately to suit other reviews. Maybe they are junior or lack confidence, maybe it’s just the general human instinct to not cause conflict or not be the outlier. I am often reviewer 1 because I try to get reviews out of the way (even though in spirit I’m a reviewer 2), and I frequently see review 2 or 3 get entered, then re-entered 2-3 minutes later to move their score closer to mine. Maybe they’re swayed by my persuasive argument (doubt it!), but it’s so quick I think they usually just adjust their score. This defeats the purpose of the discussion period which is to discuss why we were so far apart.
No-one really changes their mind after discussion. Maybe I’m too stubborn, but in general: I read the paper, I give my opinion and I stand by it. Occasionally I miss something (e.g. an inappropriate analysis) and will tweak my opinion. I don’t see other people changing their mind too often either. Usually disagreements are not specifically factual, it’s caused by people having different views on whether something is the right topic for the venue, or whether the flaws in the paper outweigh its contribution, etc. These are things on which reasonable people may differ, and at that point it’s either the chairs or meta-reviewers who must decide, but the authors discussing it is futile, and often grinds to a halt because of that.
Review scores get changed (perhaps at the meta-reviewer’s insistence) to coalesce around one number, which creates an impression of consensus. This consensus usually extends to a reviewer saying “Ok, I’ll live with this being rejected if that’s what everyone else thinks, so I’ll change my score”. That could be done in another way besides a discussion period.

I suggest that the discussion period isn’t really doing a lot to help improve review quality, but does add workload to everyone involved. Maybe everyone else’s discussions are operating differently or my impression is wrong — as chair this year I’ll get to see them all, so I can see if it’s just me acting as a bad reviewer or meta-reviewer, and get a wider view. But my impression is that it just adds more time commitment and hassle for all involved, without a matching large improvement in review quality. I also note that journals don’t have discussion periods (often they don’t even let reviewers see each other’s reviews) and yet they are not generally criticised for a lack of review quality.

Binning the discussion period might be a bit too dramatic at ICER but we’re at least going to remove the obvious unnecessary part: there’s no point discussing if everyone already agrees. We’re hoping to at least do away with the futile “Hi, meta-reviewer here, you all seem to be in rough agreement but let’s start a discussion to see if we’ve missed anything” request. Discussion periods will only be used where they are needed, in order for the meta-reviewer to explore disagreements before writing their summary.

Review timelines

Having acted as meta-reviewer, editor, and program chair, I have seen the timelines that reviews come in on. In my experience, there’s two types of reviewer: the one who returns the reviews in the first week, and the one who returns the reviews in the last week before the deadline. I’m not passing judgement (both types of reviewer do good work), just making an observation. If you give 3 weeks, there’s hardly any reviews that come in during the middle week. If you give 5 weeks, you won’t see many during the middle 3 weeks. My flippant-yet-kinda-serious suggestion is that we should give everyone 2 weeks for reviews. It wouldn’t fly, of course: everyone would be up in arms about the incredibly tight timeline. But my personal bet is if that was just the way it was, it wouldn’t make a big difference to how most people organised their time for reviewing. But we’re not implementing this one at ICER!

Also: if you do file reviews early with a long period for reviewing, it makes the discussion period harder because it can be up to 3-4 weeks since you read the paper (alongside 5+ others) so it’s more work to go back and look at the paper to resolve any discussions.

Meta-reviewing

I’ve left my most extreme idea to last, and it’s one that may be better suited to smaller venues. Do you need meta-reviewers at all? I suggest that not every venue does. As I understand it, SIGCSE introduced them to try to increase consistency of reviewing. But SIGCSE is absolutely massive. If you are operating a regional conference that accepts 10 papers, do you need meta-reviewers?

One thing that I think is often overlooked is that meta-reviewers reduce your reviewer pool. If you ask all your most experienced volunteers to be meta-reviewers, your reviews will be from the less experienced volunteers, which might make it seem like meta-reviewers are contributing well by refining or improving these opinions… but what if you just had them acting as normal reviewers? Less workload involved all round, shorter timescales, and for a small conference the chairs can just resolve any conflicting reviews themselves.

Conclusion

I think the workload placed on reviewers and meta-reviewers has become too high, and it’s time for program chairs everywhere to look at how to reduce it. We’ve tried to make some first steps at ICER: shortening the review form, shortening the papers, removing unnecessary discussion. Other conferences may consider some or all of these, plus the more dramatic steps of doing away with discussion periods and/or meta-reviewers. These are my personal views, but happy to hear yours in the comments — either as an author, reviewer, or meta-reviewer.

The USA, according to His Majesty’s Revenue and Customs

His majesty’s revenue and customs, or HMRC, is the UK equivalent of the Internal Revenue Service in the USA. Part of its remit is to check that employee expense claims are not being used for evading tax. Expense claims are not taxed but salary is, so one way to evade income tax would be to give employees cash supposedly intended for future expenses, but at a rate much higher than the going rate for the claimed purpose.

To this end, HMRC maintain a public webpage with rates that they consider reasonable for employees to claim back for hotels and meals without needing a receipt (very similar to a per diem). The list is primarily split by country. But of course, prices are rarely uniform across an entire country. So for larger (or more variable?) countries HMRC list multiple cities in the country, each with their own rate.

But what are you to do if the city you are travelling to is not on that list? Well, HMRC have guidance:

“If your employee is travelling to a place not shown in the table of overseas scale rates you can pay or reimburse your employees using the rates for the closest city shown for the country that they have travelled to.”

Although probably unintended, this means that HMRC are implicitly dividing up countries into price bands, according to the closest listed city. This is data, and we can plot that data. Technically this is a Voronoi diagram, a division of an area according to its closest point from a set of points. (Intuitively, I always feel like a Voronoi diagram should have circles because we’re dealing with distance to points, which generally involves circles. But the separation of which of two points is closest is a straight line, and thus the Voronoi diagram ends up as a straight-lined honeycomb structure.)

So that’s what the plot at the top of this post is: a division of the USA according to the hotel rates that HMRC list. Americans may find that this produces some oddities. You can get quite a nice hotel in North Dakota for Chicago rates, I imagine, but you’re not going to get anything in downtown Austin on that Houston rate.

You might notice that the prices here are quite low. Remember that HMRC intended them as rates low enough to claim without needing a receipt. If you have a receipt, HMRC say it’s fine to claim higher. Unless of course your employer decided to use these rates as a cap for all hotel stays, and left you needing to work out where the closest listed city is, say, to Pittsburgh.

Permanent Registered Reports Track At Computer Science Education

Recently, Aleata Hubbard Cheuoua, Eva Marinus and I acted as guest editors of a special issue at the Computer Science Education journal. The special issue was distinctive because it only accepted replication studies, and reviewed them as Registered Reports. Registered Reports are a new way to publish science: you peer review the study’s design, before the study has been carried out:

(Image Creative Commons Attribution-NoDerivatives 4.0 International License from COS.)

This minimises all kinds of scientific questionable research practices (such as fishing for significant results) and means that changes can be made at a meaningful stage. Often in classical peer review, reviewers ask for changes (e.g. “you should have collected data on prior experience”) that cannot be addressed after data collection has been completed.

I don’t want to spend a long time writing about Registered Reports here, but bottom line is: we completed the special issue, it went well, you can read our editorial, the issue itself, and a whole paper that we just published at ICER 2022 about the experience.

Permanent Registered Reports Track

As a result of the special issue and the paper, Aleata, Eva and I have been appointed as Associated Editors at Computer Science Education to handle a permanent Registered Reports track. The addition of the track is currently being processed by the publisher: we have written submission instructions that will appear on the journal website shortly and a new track is being added to the submission system. In the mean time, the draft of the submission instructions is in this Google Doc (although it is possible it may change slightly before becoming official.)

We would like to invite all researchers in computer science education research to consider whether to make their next study a registered report. It doesn’t fit all models of research (hence it’s an extra track, not a replacement) but for many studies it fits very well. If you’re at the planning stage for a new study, you could write up your planned method analysis, submit it as a registered report, and get expert feedback plus an accept-in-principle before you’ve even begun collecting the data. What’s not to like?

If you have any questions about registered reports at Computer Science Education, please feel free to contact Aleata, Eva and me. We are grateful to the editors-in-chief, Jan Vahrenhold and Brian Dorn for commissioning the special issue and for adding this as a permanent part of their journal.

Consistency is poor

Validity: controlled vs ecological

Automated grading

Implications

Share this:

The benefits and drawbacks of sport

Tips for organising

Share this:

Acceptance rate

Distinctiveness is a double-edged sword

Observing affects the result

Not just page limits

The way forward

Share this:

Execution model

Actors

Coordinate origin and axes

Media literals

The world is not enough

Function naming

Function typing

Next steps

Share this:

What we did

What we found

Reflections

Share this:

Share this:

Share this:

The status quo

The review form

The length limit

Discussion periods

Review timelines

Meta-reviewing

Conclusion

Share this:

Share this:

Permanent Registered Reports Track

Share this: