Putka Testscript’s documentation

How testing works

Each user submission is evaluated independently of all other submissions. The result of the evaluation is a test report consisting of an optional preparation step (typically compilation of user’s source code) and one or more test cases (or tests for short).

The evaluation process is defined entirely by the test script written for the task. The script must be written in python and should use the automatically imported modules jail and putka to run user programs in a safe manner and to log the test report.


You may write to stdout and/or stderr freely. All output will be shown to admins in the UI even if the script crashes.

Module(s) overview

Although there are quite a few functions available, you should rarely need anything but putka.testAllOutputs() and maybe putka.diff(). If this is not enough, you should look into at least putka.submission, jail.limits, jail.run(), putka.addPrepResult() and putka.addTestResult() first.

The putka module contains functions for executing predefined test suites and for interacting with the rest of putka, e.g. reporting test results. The jail module contains functions and attributes for monitoring and restricted/safe execution of user programs.

Jailed exectution

Programs, unless completely trusted (including wrt crashes, infinite loops etc), should be run in the jail provided by the jail module. Use jail.run(). The jail monitors and restricts programs. How exactly it does that is determined by the active jail profile. A profile includes

  • limits (time, memory, …). See jail.limits()
  • system call handlers – for restricting access to certain files, network, forking etc. Some handler sets are predefined, see jail.loadProfile(). If this is not enough, see the source code. Syscall handlers are nasty.

You may set the whole profile at once (see functions with “profile” in the name) or (more commonly) just tweak the jail.limits part.

Common data types and structures

Test results are encoded in a few data structures. Because each is used by at least two functions (the one that produces that part of information and the one that logs the test result) they are presented here. Note however that you only need to know about them when writing custom grading functions or testcase aggregators.

  • When a program is executed in the jailed environment, run statistics are recorded in a dict with the following keys:

    • status: the status of the test run. See the second bullet point.
    • time: CPU time used, in tuba-seconds; a float.
    • realtime: real time used, in tuba-seconds; a float.
    • memory: peak memory usage, in megabytes; a float.
    • tasks: peak number of threads.
    • stdout: everything the program output to stdout (for debugging and manual grading).
    • stderr: everything the program output to stderr (for debugging).
    • exitMode: program’s exit mode (e.g. killed by signal, regular exit, …).
    • exitCode: program’s exit code (e.g. signal number, exit code, …).

    All the keys except status are optional because the test does not necessarily involve running a program.

  • Each test finishes with a status. Possible values are OK, time limit exceeded, runtime error etc. The status is represented with a value from the putka.status enum, e.g. putka.status.time_limit. Check the enum for a full list.

  • Each test result can also include a snippet of the user output and the official output. Each of the snippets is represented with a tuple (L, s) where L is an integer and s is a string containing lines L, L + 1, L + 2, … of the corresponding file stream.

Aggregating test results

Once all the testcases have executed, their points and statuses need to be aggregated. This is usually done by summing, but you may want a different mechanism. In this case, set the putka.aggregator variable to a different aggregation function. Choose one from putka.agg or roll you own.


The classic test

If you just want the traditional thing – compile the source, run the result agains each of the input files and compare the output with the official output file, award points only for a perfect match – then your test script needs only contain a single line:


Approximate floating number matching

Let’s say we again want the “classic” test, but wherever the output contains floats, we only require them to be accurate up to 0.001. For such a custom evaluation, we must provide our own scoring function to putka.testAllOutputs(). Luckily, putka.diff() supports approximate float matching and additionally behaves almost exactly like the scoring function we need to provide. So the custom scoring function is simple – we just wrap diff slightly:

def gradingFunc(userIn, userOut, officialOut):
        return putka.diff(userOut, officialOut, floatEps=0.001)

Better yet, do it with a lambda expression in a single line :)

Completely custom scoring function

Assume we are writing the test script for a following task: you are given a 100-by-100 maze represented with characters # (wall) and . (free). Output the steps for a walk from (0,0) to (99,99), one step per line, each step described with a space-separated pair of coordinates. You get 10*[length of shortest walk]/[length of your walk] points.

Again, we use putka.testAllOutputs(). In the “official output” files, we store the length of the shortest walk rather than the whole walk:

def gradingFunc(userIn, userOut, officialOut):
        # Parse the maze
        maze = userIn.splitlines()
        # Parse the user's walk
        walk = [map(int, line.split()) for line in userOut.strip().splitlines()]
        # Parse the length of the optimal walk
        optimalLen = int(officialOut)
                assert walk[0] == [0,0]
                assert walk[-1] == [99,99]
                for ((x,y), (nextx,nexty)) in zip(walk,walk[1:]):
                        assert 0<=x<100
                        assert 0<=y<100
                        assert maze[y][x]=='.'
                        assert (x==nextx and abs(y-nexty)==1) or (y==nexty and abs(x-nextx)==1)
                # Scores are relative (0 to 1); max score per test case is determined by t_classicCustom
                return optimalLen/len(walk)
        except AssertionError:
                # We don't return the output snippets because they make no sense for this task
                return 0

Interactive programs

All testing of interactive programs is done via standard input/output. You have to provide a controller program separately from the script. Its stdin will become user’s stdout and vice versa.

Before looking at the example below, read the documentation for putka.testInteractive(), which is by far the easiest way of connecting your controller and the user’s program.

Now let us have a look at the canonical interactive task, the number guessing game. The user has to guess a hidden integer from 0 to 2**30. Guesses are made by writing to stdout. A response is read from stdin, onto which it is put in real time by the controller program. The response is, in our example, a single word: MORE, LESS, or BRAVO, depending on the guess. Here is the external controller (which you have to supply as an attachment to the testscript). It expects to receive (as a cmdline arg) the path to a file which contains the target number:

[[ controller.py ]]
import sys

def finish(feedback, score):
        "Write the testcase result into the predefined '_result' file and terminate."
        f=open("_result","w"); f.write("%s\n%f" % (feedback, score)); f.close()

# get the target number from input file
goal = int(open(sys.argv[1]).read())

for attempt in range(31):
        # read user input
        try: guess = int(raw_input())
        except: finish("malformed input", 0)

        # write feedback
        if goal > guess:
                print 'MORE'
        elif goal < guess:
                print 'LESS'
                print 'BRAVO'
                # Assume 5 test cases, 20 points each, regardless of number of guesses
                finish('', 20)  

finish('too many attempts', 0)

The testscript itself is trivial, merely saying to use the controller with each of the (separately provided) input files:

[[ testscript ]]
putka.testInteractive("/usr/bin/python controller.py", putka.inOutPairs())

If putka.testInteractive() does not fit your design, make it fit. It is possible to set up the communication with the user program yourself, but handling all the combinations of the controller and userprog hanging up or misbehaving is relatively hairy.

Freeform exercises

Especially for beginners, it is sometimes desirable to give them exercises which require a freeform explanation of a principle or a procedural solution.

To support such exercises, we can abuse putka.aggregator slightly: create a task with no test cases, then set putka.aggregator = putka.agg.manual. No tests will be run and the task will be marked for admins as requiring manual inspection.

Reference: module members

All the members exposed to the testscript follow.

t_utils.jail.allowedFiles = None

A listing of files that the user program is allowed access to. This is a dict. The keys are file paths. The values are the allowed access modes for the files; possible values are "r" (read), "w" (write), "e" (execute) or any concatenation of these characters.


Close a file descriptor quadruple opened by openProgProxy. The input should have the same format as the output of jail.openProgProxy().

t_utils.jail.compile(srcFile=None, progFile=None, options='', lang=None, addPR=True, needOut=False, loadProf=True)

Compile source file srcFile. Compiler path and arguments are read from system config files.

  • srcFile – path to the source file to be compiled. If omitted, user’s source file is used.
  • progFile – path to the executable to be produced. This is not necessarily a binary file; e.g. for python, it is just the source file itself (compilation only checks the syntax). Use jail.run() with an appropriate lang setting to run files created by compile.
  • options – additional command-line parameters to be passed to the compiler.
  • lang – language of the source file.
  • addPR – should the results be logged (using putka.addPrepResult())?
  • needOut – if False, deletes the files with compiler stdout/stderr (_compile_out, _compile_err).
  • loadProf – load the default no-limits jail profile before compiling (and unload it staight after).
t_utils.jail.getstatus(mode, code, compile)

Coverts a tester status code to the corresponding ref:putka.status constant. Normally only needed internally.

  • mode – tester status code
  • code – tester status subcode
  • compile – bool - is this a compile job?
t_utils.jail.limits = {}
An object containg all the limits that apply to jailed execution. Contains the following read-write attributes:
  • time - seconds of CPU time (scaled to reflect execution time on a “tuba”). Type: float. Default: 2.0 tuba-seconds
  • realtime - the program is allowed to use realtime*time real-time seconds. Tester overhead is not included in this “real”time measurement. Type: float. Default: 4.0
  • memory - memory limit in MB. Type: int(!). Default: 32 MB
  • filesize - maximum program output size in MB. Type: int. Default: 100 MB
  • tasks - maximum number of allowed threads/processes. Type: int. Default: 1

Load the jail profile profileName. Two profiles are predefined: 'blank', which allows full access to the jail directory and imposes no time or memory limits, and 'blocked', which is loaded by default and prohibits everything, including all system calls.


Open a pair of file descriptor pairs (serverIn,serverOut),(clientIn,clientOut)), used for linking processes together.


Load a jail profile from the internal stack (and remove it from the stack). See pushProfile.


Save the jail profile by pushing it on an internal stack. See popProfile.

t_utils.jail.run(progPath, fileIn, fileOut, fileErr, options='', lang=None, bg=False, fatal=True, _compile=False)

Safely execute a program inside the jail and get the run statistics. To keep the grading scripts consistent, you should only use this if other options are not flexible enough. Consider using one of the higher-level functions.

  • progPath – path to the program to be run. Does not need to be a binary – see the lang parameter. Can be relative (but script’s working directory is not the jail root).
  • fileIn – the input file. Can be a path, a proper file object (not StringIO), None (denoting /dev/null) or a file descriptor (int)
  • fileOut – the output file
  • fileErr – the output file
  • options – command-line parameters for the binary to be executed (as a space-separated string)
  • lang – for non-binaries only. If progPath points to e.g. a python script, specify lang=’python’ and jail.run will run it with the right interpreter (platform-dependent). If progPath points to a binary executable, set lang=None.
  • bg – if True, execute OUTSIDE the jail, NON-blockingly
  • fatal – if True, kill the whole test script if this call fails (returns non-0)
  • _compile – is this a compile job? Should be false; if you need True, consider jail.compile().

for normal processes (bg=False) statistics of the run (time, memory used, …) in a dictionary – see Common data types and structures. For background processes, pid of the process or None if there was an error.


Save the current profile as profileName. Saved profiles are internal and are deleted once the test script finishes.

t_utils.putka.addPrepResult(status, stdout=None, stderr=None)

Add preparation (=normally the compile stage) results to the test report.

t_utils.putka.addTestResult(userPoints, maxPoints, stats, userSnippet=None, officialSnippet=None, name=None)

Add the result of a single test to the test report. Interpretation: user was awarded userPoints out of maxPoints (both must be integers). User’s output was userSnippet but should be officialSnippet (both optional and for informational purposes only). stats gives the statistics of running the program (mem, time, …). name is the official input filename used for this testcase, or some other uniquely identifying string. See Common data types and structures for the structure of officialSnippet, userSnippet, stats.

class t_utils.putka.agg

A holder for testcase aggregation functions (candidates for putka.aggregator; see Aggregating test results). Each aggregator function

  • takes a list of tuples (user_points, max_points, status, name) describing individual test cases
  • returns the a tuple of the same form describing the submission as a whole. Optionally, the return tuple may contain a fourth element, the string "MANUAL", to denote that the results are incomplete and should be checked/changed/finalized by a human.
static acm(testcases)

ACM-style. Gives full points if all test cases have full points, otherwise nothing.

static manual(testcases)

Like sum(), but adds the MANUAL element to the output tuple.

static subtasked(point_list, agg_subtask, agg_whole, subtask_points=<function <lambda>>, subtask_parse=<function <lambda>>)

Subtask-enabling aggregator middleware. It’s a sort of decorator, so set it as a function call (putka.aggregator = putka.agg.subtasked(...)).

The aggregator will iterate over the in/out pairs and arrange them into sets based on the input filename (parsing can be customized). Each set will be aggregated with agg_subtask, producing a new set of per-subtask aggregated results, which are then aggregated into the final submission aggregate using agg_whole.

  • point_list – A list of integers indicating the max points for each subtask. Make sure it has as many elements as there are subtasks in the actual data.
  • agg_subtask – The aggregator used to process each subtask.
  • agg_whole – The aggregator used to process the aggregated subtask results into a final submission result.
  • subtask_points – Callable used to calculate subtask points based on the aggregate. Parameters are user_pt (user points as returned by agg_subtask for this subtask); max_pt (maximum points as returned by agg_subtask for this subtask); status (aggregated status for this subtask, as returned by agg_subtask); subtask_pt (the number of points given for this subtask in point_list). The default implementation returns subtask_pt if status == ok and the user got all points.
  • subtask_parse – Given the input filename, should return the integer subtask identifier. The default splits the filename by ‘.’ and returns [-3].
static sum(testcases)

Classic aggregation: sums all the points from all the testcases. Status aggregation is done as follows: errors have precedence over WA which has precedence over OK. If there are multiple different errors, the most frequent one is output. If there are multiple most frequent errors, the one appearing in the earliest testcase is output among them.

t_utils.putka.compareSource(officialSource, maxPoints=100)

Compares contents of file officialSource directly to the user-uploaded file. In case of a perfect match, award maxPoints points, zero otherwise.

t_utils.putka.diff(txt1, txt2, softWhitespace='eof_eoln', floatEps=None, contextSize=10)

Compares two strings, txt1 and txt2 (typically user output and correct output, respectively) and outputs lines in which the two strings differ. The comparison is “soft” as dictated by parameters.

  • softWhitespace

    specifies strictness of comparison regarding whitespace. Each relaxation listed below implicitly includes all those higher in the list. Possible values:

    • ’none’, meaning whitespace is important;
    • ’crlf’, meaning rn, r and n are treated as equivalent;
    • ’eof_eoln’, meaning an extra space or tab just before the newline
      is allowed. Also, the number of newlines at end of file does not matter (including zero newlines).
    • ’all’, meaning all sequences of whitespace characters
      (even within single lines) are treated as equivalent.
  • floatEps – treat floats which differ by less than floatEps as equivalent. NOTE: floatEps==0 still means that 30.0, 30 and 3.0e2 will be considered equivalent. Specify floatEps=None to enforce character-by-character comparison.
  • contextSize – how many lines of differing output to return. See description of the return value.

Return value is a dict with keys user_out, off_out (snippet from txt1, snippet from txt2; snippets are represented by pairs as described in section Common data types and structures; they are contextSize lines long and chosen so that the first pair of lines in which txt1 and txt2 differ is centered in the corresponding snippets) and points (1 if there are no differences, 0 otherwise). If there are no differences, user_out and off_out have value None. Note that the inclusion of the otherwise redundant points key makes this function a very useful basis for a grader for putka.testAllOutputs(). See also putka.diffPEGrader().

t_utils.putka.diffPEGrader(userIn, userOut, officialOut, **kwargs)

A grader for use with putka.testAllOutputs(). Runs putka.diff() on user output and official output using **kwargs. If the comparison shows differences, tries to detect a presentation error; if one is found, the output dict will have "presentation_error"=True.

t_utils.putka.fail(msg, globalFail=False)

Terminate the script and generate a system error with the given message. If globalFail is true, removes this tester from the tester pool.

t_utils.putka.inOutPairs(inMask='(.*)\\.in', outMask='(.*)\\.out')

Return a list of all input and output files, paired where possible. With the (very optional) parameters you can specify a regex that defines what is an input and what an output pair. Each regex must have at least one capturing group; filnames are paired when their first capturing group matches.

By default, input and output files are recognized by the extensions .in and .out and two files are matched into a pair if the whole filename but the extension matches.

Returns:A list of pairs of the form (path to input file, path to output file). Each of the paths can be None if the other element has no match. The pairs are sorted by filename (or the capturing group in case of custom regexes).
Example: For files bla1.in, bla2.in, bla3.in, bla1.out, bla3.out, bla5.out,
the function returns [('bla1.in','bla1.out'), ('bla2.in',None), ('bla3.in','bla3.out'), (None,'bla5.out')]
class t_utils.putka.status

An enum-like object enumerating all statuses with which a test case or the evaluation as a whole can end. Statuses:

  • ok
  • runtime_error - Exit/RunTime Error
  • time_limit - Time Limit Exceeded
  • memory_limit - Memory Limit Exceeded
  • output_limit - Output Size Limit Exceeded
  • thread_limit - Thread Count Limit Exceeded
  • syscall_limit - Illegal System Call
  • exit_error - Nonzero Exit Code
  • wrong_answer - Wrong Answer
  • presentation_error - Presentation Error
t_utils.putka.submission = Submission(source=None,lang=None)
An object containg information about the user’s submission. Contains the following attributes:
  • source - path to the submitted source file
  • lang - language code of the submitted program (a string, e.g. ‘py’)
t_utils.putka.testAllOutputs(grader=<function <lambda>>, maxPoints=100, fnPairs=None)

Compiles the program, then finds all input files (and the corresponding output files, if they exist). For each input file, runs the user’s program and grades its output with the provided grader function.

Input files and the corresponding output files are identified by names of the form something.in something.out. If only one file of the pair is found, None is provided instead of the other file to grader – see putka.inOutPairs()

maxPoints points (default: 100) are distributed equally between all the test cases. If maxPoints is not divisible by the number of cases, later tests (in alphabetical order) get up to one point more.


A grading function. It must accept the following parameters: input, userOutput, officialOutput. All three parameters are full contents of the respective files. officialOutput may be None if no official output file is found. The grading function must return a float (see below) or a dict with the following keys:

  • points - a single number between 0 and 1, denoting the percentage of points achieved for this test case
  • user_out - a snippet of user’s output. See section “Common data types” for the format.
  • off_out - a snippet of the official output. See section “Common data types” for the format.
  • presentation_error - if True, the status of the testcase will be set to PE

The grading function can also return a single float; it will be interpreted as the points dict entry.

Note that you can use a thinly wrapped putka.diff() with parameters of choice as the grader.

t_utils.putka.testInteractive(controllerCmd, args)

Compiles user’s submission, then runs it for each element of args against an external interactive program (“controller program”).

For each test case: the controller program and user program’s stdin/stdout are cross-connected, i.e. controller’s stdout is user’s stdin and vice versa. Both programs are then run. The controller must create a file _result which contains, in this order:

  • 0 or more lines of feedback to the user (visible in the GUI to the user)
  • a line with two integers – the user’s score and the maximum possible score for this testcase.
  • controllerCmd – Path (+ optional commmand-line arguments) to the controller program. For example, "/usr/bin/python myController.py".
  • args – An iterable; each element El represents a test case in which the controller is run as <controllerCmd> <El>; i.e., El is used as additional command-line args. If El is an iterable, its elements are space-concatenated before being used as cmd-line args, otherwise, str(El) is used as-is. Examples of possible useful values for args are putka.inOutPairs(), range(10) and [''].


Indices and tables