I felt it was most interesting to stay either in explore or in exploit for the whole problem. If in explore, then the problem was devoted exclusively to learning. If in exploit, then exclusively to testing what had been learned. This separation was consistent with the explore/exploit regime used in the multiplexer task. But, the main reason was to have a "pure" separation between learning and testing.

It is true that many RL experiments use a "mixed" regime in which the E/E decision is made on each step of a sequential task. This is advantageous--even necessary--as soon as a problem has more than a few steps. Otherwise, it can take "forever" for the system to reach the goal.

Why did you choose as exploration/exploitation strategy to decide at the beginning of a run whether to explore or to exploit instead of choosing at each step, as is more usual in reinforcement learning experiments?