Thanks for being pedantic and making me feel rather ruthless and non-conformist. I like your solution, except that it means I would need a more complex XPath to collect, for instance, all the stage directions for analysis, because some of the stage direction text would end up in the special speaker element.
Nadia Revenga, who knows some of these plays much better than I do, told me she noticed that in some (older) editions, the speaker information is mentioned in this combination of stage direction and speaker name, as in "Sale Flora". In later editions of the same text, she noticed that the speaker name was present separatedly (and as a repetition) to the stage direction. Whether the later editors considered the earlier practice as incorrect, wanted to be more coherent, had less constraints regarding space, or had some other reason for acting this way, is not entirely clear.
Now, a critical edition of the play(s) would certainly have to take all of that into account. The more general point, however, is that people use the TEI with very different purposes in mind. My primary goal right now is to have a text whose markup is a) valid TEI and b) as easy to process for quantitative analysis as possible. I need to easily remember the structure of my encoding, with as few special cases as possible, and with as simple XPaths as possible. There may be a trade-off with regard to editorial rigour. Of course, strict TEI conformance does makes it easier for me (and others) to quickly grasp the structure and meaning of my encoding.
With this in mind, the best solution may be to encode as follows:
<l>Ya los dos estan aqui.</l>
No text that is not actually there (at least, not in my reference edition); no stage direction text in speaker elements; coherent and authoritative speaker information to be taken from @who on "sp". Only minor downside, there will be a few "sp" without a "speaker" element in them, so the structure of the files is not entirely uniform.