LISTSERV 16.5 - TEI-SOM Archives

Subscriber's Corner
Email Lists
TEI-SOM Archives

TEI-SOM@LISTSERV.BROWN.EDU

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		TEI-SOM Home
		TEI-SOM October 2003
Subject:
Bad and good examples of XPointer, and a proposal
From:
Fabio Vitali <[log in to unmask]>
Reply-To:
TEI Stand-Off Markup, XLink, XPointer WG
Date:
Fri, 10 Oct 2003 16:32:39 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (174 lines)
Hi.

> FV Post an example of a pointer to a character using
>  both the official XPointer syntax and an ad-hoc
>  better one.                                          2003-10-06
>
> We need to stick to the horrific method XPointer already uses for
> addressing characters, rather than create our own addressing
> mechanism. History is that the pointer group decided to stick with
> what the DOM and Infoset were doing.
>
> FV will post an example of what this syntax looks like.
>
>
>
> Wrapped up at ~13:54; next call Fri, 10 Oct 03 at 14:30 UTC.
> --------- end plain-text version ---------

The structure of XPointer
-------------------------

A short reminder about the actual syntax of XPointer. I assume everybody
knows about Xpath. XPointer is meant as a way to specify addresses of an XML
document within the fragment part of an URI.

There exist several schemes for Xpointer, each of which can be used in
cnjounction with the others, but independently. This means that given the
schemes A and B, then the XPointer A(x)B(y) creates two independent
mechanisms to address the same fragment of the document, instead of the
first creating a context for the second to work (say, in a hierarchical way:
get the node A(x), and use it as a basis to compute B(y)).

Two schemes have become W3C recommendation, element() and xmlns(). A third
and a fourth one, xpointer() and point(), have been stopped while a proposed
recommendation and are now back to being working draft. They are both
described in the December 19, 2002 W3C Working draft
http://www.w3.org/TR/2002/WD-xptr-xpointer-20021219/

XPointer on addressing characters
---------------------------------
First, a few comments on "the horrific syntax proposed by XPointer for
addressing characters."

The two main schemes provide no way to identify sub-nodes constructs. You
can select a whole node (an element, an attribute, even a text node). The
xpointer() and the point() scheme are on the other hand meant exactly for
this task. Unfortunately, they do so in a very awkward way.

The xpointer() scheme is reasonably based on the Xpath syntax, and adds a
few additional functions. The main function is string-range(), which
identifies a substring of a text node within a node set, if you can specify
its content.

E.g.: xpointer(string-range(//div/text(), "TEI")) means "get the location of
each substring "TEI" within the text nodes of any div element in the
document". Unfortunately, you already have to have the string you are
looking for, and cannot locate the string between character 15 and 18 of a
given text node.

Furthermore, string-range is not a "step production", i.e., it cannot be
placed at the end of a complex tree traversal in plain Xpath:

 xpointer(string-range(//div/text(), "TEI"))     is correct
 xpointer(//div/text()/string-range("TEI"))      is NOT correct

Syntactic sugar, you might say. Not quite. A step production exists,
range-to(), but it can be hardly used in this context.

Node tests in Xpath are a way to identify constructs of a given type in a
location path. So for instance //div/comment()[2] identifies the second
comment within all the div elements of the document. Xpointer defines two
more node tests, point() and range(), that *seem* that can be used for
identifying the address of a single character within a text node. That is,

They seem to allow something like xpointer(//div[1]/text()/point()[18])   to
identify the position between the 18th and 19th character of the first div
element of the document. Unfortunately, although this seems a rather natural
way to express it, there is no place in the xpointer document to justify
such a vision, and indeed it seems to be explicitly denied by the Xpath
recommendation (and by a personal discussion of the matter with Steve
DeRose).

The syntax //div[1]/text()/point()[18] is actually a shorthand for
/ancestor::div[1]/child::text()/child::point()[18], and it seems to imply
that the child axis of a text node is composed of points, of which one can
select the 18th. Unfortunately the Xpath recommendation specifies (section
5: Data Model) that only element nodes and root nodes have children nodes,
and thus it seems clear that the child axis of a text node is empty.

The xpointer document does not say anything about, and so we are left
thinking that it is not possible.

An alternative way to identify characters is to use an extension of the
element() scheme called the point() scheme. This is not an Xpath based
syntax: it is a tree-traversal syntax counting the element to traverse.
Interestingly, in the point() scheme a text node can be further navigated to
the actual individual character.

So point(/1/5/3/47) may seem to refer to the 47th character of the third
element of the fifth element of the root element. No names can be specified.
Nothing except for a full node counting can be given.

My proposal
-----------
My proposal is to create a new scheme (call it tei()), and to add a way to
identify sub-textnode fragments within an XPath syntax. Ideally this would
be a step production rule, i.e., it should be placeable as a step of a
location step:

 tei(//div[1]/text()/point()[18])

Or even better

 tei(//div[1]/value()[18])

Where we define a new node test that returns as a location set the
individual characters of the string value of a Xpath (the string value is
very naturally the string generated through the union of all the text nodes,
at any level of the hierarchy, within a node).

Alternatively it could be something else than a step production, i.e. a
plain function to be placed at the beginning of a xpointer:

 tei(point(//div[1]/text(), 18))

But it seems uglier to me.

The main reason for suggesting a step production would be that you can then
use the step production range-to to have a whole range as identified by the
individual characters that are its extremes.

For instance:

 tei(//div[1]/text()/point()[18]/range-to(//div[1]/text()/point()[25]))

Additionally, I would suggest we propose a new step production for regular
expressions, one for words and one for sentences.

To summarise:
-------------
* define a new xpointer scheme, tei()
* This is equivalent to the xpointer() scheme, with the following
exceptions:
  - It is explicitly stated that the child axis of a text element is
composed of the points between the individual characters.
  - a new node test is created, value(), that gives the location set of all
the points between the individual characters of the string value of an xpath
  - a new step production function is created, regexp(RE), that can
identify, within the context location set, the ranges expressed by the
regular expression.
  - a new node test is created, word(), that selects all the individual
words within the context location set. A word is a set of contiguous
characters separated by an appropriate set of whitespace characters, of
which a definition and a list already exists.
  - possibly a new node test is created, sentence(), that selects all the
individual sentences within the context location set. A sentence is a set of
contiguous words separated by an appropriate set of delimiters, whose
definition includes full stop, question mark and exclamation mark, and
possibly other characters.
* These new additions are all step productions, i.e., can be used in any
position within a longish location step as individual steps to identify the
correct result.



--

Fabio Vitali                            Tiger got to hunt, bird got to fly,
Dept. of Computer Science        Man got to sit and wonder "Why, why, why?'
Univ. of Bologna  ITALY               Tiger got to sleep, bird got to land,
phone:  +39 051 2094872              Man got to tell himself he understand.
e-mail: [log in to unmask]                     Kurt Vonnegut, "Cat's cradle"
http://www.cs.unibo.it/people/faculty/fabio/
Top of Message | Previous Page | Permalink
Search Archives

Advanced Options
Options

		Log In
		Get Password

		Search Archives

		Subscribe or Unsubscribe