On Wed, 2005-09-28 at 15:43, Wendell Piez wrote:
> Ah, this is the nub of it. I guess I just don't have the same
> experience of what people "naturally expect".
I can only speak for one field, of course. Perhaps I shouldn't expect
too much of XML :-)
> While this would be understandable, so would the counter be: "why is
> it messing with my whitespace without my say-so?
But that's precisely what it *is* doing.
> Why does it make a
> difference whether the schema's there or not?" issued in a similarly
> plaintive tone. So we simply disagree as to whether they hit the
> right line on this one.
Yep. I think the argument about the presence or absence of a schema is
misleading, though. It appears to assume that there would be no
difference in processing whether you use a schema/dtd or not, and that
has never been the case (nor IMHO was it ever intended to be: if it
was, that would be a very serious error).
> Also, in my experience such *attitudinal* adjustments as one would
> make when learning something one didn't design oneself can be helped
> by a positive spirit on the part of the instructor. "It might be nice
> if it would fix up your whitespace, but it doesn't, which can be kind
> of a pain, but that turns out to be surprisingly difficult to do in a
> way that satisfies everyone" etc.
That's exactly how I explain it to them :-)
> >Yes, that and the vast majority of other goodies. This particular
> >problem is small in the global scheme of things, but when dealing with
> >very large quantities of extremely dense markup in mixed content, it's a
> >pain in the ass.
> More reason for that Perl in your pipeline?
Ewww <gag/> <spit/> :-) I think that would just make it worse.
> >With strip-space turned on globally (the case in point), the
> >white-space-only text nodes between elements don't make it through to
> >the XSLT, so you cannot address them :-)
> So you're basically asking for a "munge whitespace intelligently"
> option that works at the document scope, but modifies whitespace
> (intelligently) rather than stripping it arbitrarily.
Not quite. "Sensibly" rather than "intelligently". I'm just asking for
strip-space in mixed content not to remove white-space-only text nodes
but to compess them to a single space.
Removing them actually breaks the document, because adjacent spans of
element-marked content willbeoutputcontiguouslylikethis. That is
unquestionably wrong, and should never occur under any circumstances
except when the user explicitly programs XSLT to do so (for some odd
> So, a spec:
> text nodes in mixed content: compress and trim whitespace, but do not
> delete whitespace-only nodes
For strip-space turned on, yes.
> text nodes not in mixed content: preserve? (So as to preserve line indenting?)
No, if you want that, leave strip-space turned off.
> definition of mixed content -- ref to a DTD?
If there is one available, yes (as it will already be in use for
determining default attribute values, entity expansions, etc.
If there is no DTD/schema, mixed content can be detected by the
first occurrence of a non-white-space, non-markup token.
But I am not envisaging this whole white-space business to be of
any use or interest to DTDless usage: it's only of relevance to
people doing large-scale dense-markup processing like engineering
documentation, reference publishing, content-management systems,
and data-mining, where a DTD or schema is pretty much essential
from the start. And specifically text research markup like TEI.
> There are people who've never heard of mixed content (who as you know
> can be surprisingly obtuse about it), who regard strip-space
> elements="*" as a feature, and use it on their banking data, numeric
> data sets or whatever. They'd be unhappy without it.
But they can continue to use it: if they are not using mixed content
then the above will not apply.
> I bet the sum total of XML in use, measured in bytes, is way over on
> the "data-centric" side at the moment. (One can hope this will
> rebalance somewhat over time.)
Yes, it's on the data side by a long way, and it will stay that way
(I don't see any need or prospect of that changing). But I don't
think that affects this argument at all.
> Yeah ... actually they're more like patterns ... wonder how XSLT 2.0 does this.
I've whinged about it to assorted people involved with XSLT, but the
best I got was a recommendation to add an between the offending
elements. Now that really *would* break the document :-) Too many
data-heads. Come to think of it, I don't recall seeing it mentioned at
all in any of the major books on XSLT (Mike's, Jeni's, etc), but I
haven't actually gone looking for it there.
> >Web browsers have an implicit version of strip-space in their default
> >handling of HTML. Imagine if the space between two adjacent elements in
> >mixed content suddenly disappeared.
> Oh I see it all the time (aforementioned horrible bug in MSXML, hence
> in IE)! Everyone hates it. It's abominable. It's probably the single
> worst presentation-level bug in common deployments of XML.
I've never seen this (what a sheltered life I must lead :-)
> But you see that one is an *MS* bug, and not in XSLT, which (even MS
> agrees) specifies this right. You only get strip-space if you
> explicitly turn on the switch.
Maybe I'm engineering the documents differently. What I was suggesting
was that if *all* browsers (unaided by MSXML) just started removing
the space between <em>these</em> <strong>words</strong>. *That* is what
strip-space in mixed content currently does.
> >insistence on treating the document as a tree, even when it's not.
> I have an entire lecture sketched out in my head about the influence
> of the James Clark model of SGML/XML. We owe Clark so much that it is
> sometimes hard to notice the down side of this particular XML
> orthodoxy. But it isn't far to look for -- indeed if we were still
> used to parsing strings we'd probably be further ahead on the overlap
> problem by now -- but it is indubitably powerful, and its strength is
> in its simplicity. And it's not a bad place to rest, and get good
> work done, while we contemplate the next thing. So, part way up the
I was present at an argu^H^H^H^Hdiscussion between James and some of
the Omnimark people (a *long* time ago) where the problems was OM's
ability to output arbitrary start-tags and end-tags, and the evil
done to your soul by outputting the start-tag in one rule and the
end-tag in another :-) which is exactly what gets demanded by so many
posters to c.t.x.
> So I am probably as eager as you to get beyond the limitations of the
It's not really limitations, simply inappropriate for a large class
of continuous-text documents.
> while perhaps not feeling so sore about this particular
> over-application of it as you
My own documents those I do for clients are designed to survive such
maltreatment. It's my users who feel sore about it, and need to be