Print

Print


On Wed, 2005-09-28 at 15:43, Wendell Piez wrote:
> Ah, this is the nub of it. I guess I just don't have the same 
> experience of what people "naturally expect".

I can only speak for one field, of course. Perhaps I shouldn't expect 
too much of XML :-)

> While this would be understandable, so would the counter be: "why is 
> it messing with my whitespace without my say-so? 

But that's precisely what it *is* doing.

> Why does it make a 
> difference whether the schema's there or not?" issued in a similarly 
> plaintive tone. So we simply disagree as to whether they hit the 
> right line on this one.

Yep. I think the argument about the presence or absence of a schema is
misleading, though. It appears to assume that there would be no 
difference in processing whether you use a schema/dtd or not, and that
has never been the case (nor IMHO was it ever intended to be: if it
was, that would be a very serious error). 

> Also, in my experience such *attitudinal* adjustments as one would 
> make when learning something one didn't design oneself can be helped 
> by a positive spirit on the part of the instructor. "It might be nice 
> if it would fix up your whitespace, but it doesn't, which can be kind 
> of a pain, but that turns out to be surprisingly difficult to do in a 
> way that satisfies everyone" etc.

That's exactly how I explain it to them :-)

> >Yes, that and the vast majority of other goodies. This particular
> >problem is small in the global scheme of things, but when dealing with
> >very large quantities of extremely dense markup in mixed content, it's a
> >pain in the ass.
> 
> More reason for that Perl in your pipeline?

Ewww <gag/> <spit/> :-) I think that would just make it worse.

> >With strip-space turned on globally (the case in point), the
> >white-space-only text nodes between elements don't make it through to
> >the XSLT, so you cannot address them :-)
> 
> So you're basically asking for a "munge whitespace intelligently" 
> option that works at the document scope, but modifies whitespace 
> (intelligently) rather than stripping it arbitrarily.

Not quite. "Sensibly" rather than "intelligently". I'm just asking for
strip-space in mixed content not to remove white-space-only text nodes
but to compess them to a single space.

Removing them actually breaks the document, because adjacent spans of
element-marked content willbeoutputcontiguouslylikethis. That is
unquestionably wrong, and should never occur under any circumstances 
except when the user explicitly programs XSLT to do so (for some odd
reason).

> So, a spec:
> 
> text nodes in mixed content: compress and trim whitespace, but do not 
> delete whitespace-only nodes

For strip-space turned on, yes. 

> text nodes not in mixed content: preserve? (So as to preserve line indenting?)

No, if you want that, leave strip-space turned off.

> definition of mixed content -- ref to a DTD?

If there is one available, yes (as it will already be in use for
determining default attribute values, entity expansions, etc.
If there is no DTD/schema, mixed content can be detected by the 
first occurrence of a non-white-space, non-markup token.
But I am not envisaging this whole white-space business to be of
any use or interest to DTDless usage: it's only of relevance to
people doing large-scale dense-markup processing like engineering
documentation, reference publishing, content-management systems,
and data-mining, where a DTD or schema is pretty much essential
from the start. And specifically text research markup like TEI.

> There are people who've never heard of mixed content (who as you know 
> can be surprisingly obtuse about it), who regard strip-space 
> elements="*" as a feature, and use it on their banking data, numeric 
> data sets or whatever. They'd be unhappy without it.

But they can continue to use it: if they are not using mixed content
then the above will not apply.

> I bet the sum total of XML in use, measured in bytes, is way over on 
> the "data-centric" side at the moment. (One can hope this will 
> rebalance somewhat over time.)

Yes, it's on the data side by a long way, and it will stay that way
(I don't see any need or prospect of that changing). But I don't
think that affects this argument at all.

> Yeah ... actually they're more like patterns ... wonder how XSLT 2.0 does this.

I've whinged about it to assorted people involved with XSLT, but the 
best I got was a recommendation to add an &nbsp; between the offending 
elements. Now that really *would* break the document :-)  Too many 
data-heads. Come to think of it, I don't recall seeing it mentioned at
all in any of the major books on XSLT (Mike's, Jeni's, etc), but I
haven't actually gone looking for it there.

> >Web browsers have an implicit version of strip-space in their default
> >handling of HTML. Imagine if the space between two adjacent elements in
> >mixed content suddenly disappeared.
> 
> Oh I see it all the time (aforementioned horrible bug in MSXML, hence 
> in IE)! Everyone hates it. It's abominable. It's probably the single 
> worst presentation-level bug in common deployments of XML.

I've never seen this (what a sheltered life I must lead :-)

> But you see that one is an *MS* bug, and not in XSLT, which (even MS 
> agrees) specifies this right. You only get strip-space if you 
> explicitly turn on the switch.

Maybe I'm engineering the documents differently. What I was suggesting
was that if *all* browsers (unaided by MSXML) just started removing
the space between <em>these</em> <strong>words</strong>. *That* is what
strip-space in mixed content currently does. 

> >insistence on treating the document as a tree, even when it's not.
> 
> I have an entire lecture sketched out in my head about the influence 
> of the James Clark model of SGML/XML. We owe Clark so much that it is 
> sometimes hard to notice the down side of this particular XML 
> orthodoxy. But it isn't far to look for -- indeed if we were still 
> used to parsing strings we'd probably be further ahead on the overlap 
> problem by now -- but it is indubitably powerful, and its strength is 
> in its simplicity. And it's not a bad place to rest, and get good 
> work done, while we contemplate the next thing. So, part way up the 
> mountain.

I was present at an argu^H^H^H^Hdiscussion between James and some of
the Omnimark people (a *long* time ago) where the problems was OM's 
ability to output arbitrary start-tags and end-tags, and the evil
done to your soul by outputting the start-tag in one rule and the
end-tag in another :-) which is exactly what gets demanded by so many
posters to c.t.x.

> So I am probably as eager as you to get beyond the limitations of the 
> tree-view, 

It's not really limitations, simply inappropriate for a large class
of continuous-text documents.

> while perhaps not feeling so sore about this particular 
> over-application of it as you

My own documents those I do for clients are designed to survive such 
maltreatment. It's my users who feel sore about it, and need to be
carefully educated. 

///Peter