This work research the usage of consideration masking in transformer transducer primarily based speech recognition for constructing a single configurable mannequin for various deployment situations. We current a complete set of experiments evaluating fastened masking, the place the identical consideration masks is utilized at each body, with chunked masking, the place the eye masks for every body is decided by chunk boundaries, by way of recognition accuracy and latency. We then discover the usage of variable masking, the place the eye masks are sampled from a goal distribution at coaching time, to construct fashions that may work in several configurations. Lastly, we examine how a single configurable mannequin can be utilized to carry out each first cross streaming recognition and second cross acoustic rescoring. Experiments present that chunked masking achieves a greater accuracy vs latency trade-off in comparison with fastened masking, each with and with out FastEmit. We additionally present that variable masking improves the accuracy by as much as 8% relative within the acoustic re-scoring situation.