Print

Print


It may be that the TEI needs to consider the approach taken by the
physical sciences when dealing with uncertain quantities. In the context
of data analysis, a distinction is made between "accuracy" and
"precision". 

My understanding is that "accuracy" describes how well a measurement
corresponds to the actual value. So if the actual date of something is
115, then 115 is an accurate estimate while 190 is not so accurate.
There is a big problem here as the humanities doesn't have anything like
physical standards against which to measure accuracy.

"Precision", on the other hand refers to the interval in which a
measurement is expected to lie. A narrow interval corresponds to high
precision and a broad interval to low precision. Typically, expressions
of precision involve a confidence level, with 95% being a popular one.
Thus (a +/- b) means that the probability of the actual value being
somewhere between (a - b) and (a + b) is 95%.

So given an actual value of 115 (remembering that there is often no way
to know the actual value in the humanities), we have all sorts of
possibilities. E.g.

accurate and precise: 115 +/- 10
inaccurate and precise: 190 +/- 10
accurate and imprecise: 115 +- 100
inaccurate and imprecise: 190 +/- 100

Looking at things in this way, something like a 2nd century date for a
papyrus roll might be expressed as 150 +/- 50. However, if you asked the
palaeographer if this is indeed what is meant, you might find that he or
she is not sure that the actual date is in this range.

In view of all this, I think that we need to specify an interval and a
confidence level to properly express uncertainty concerning a quantity.
One need not use numbers for the confidence level. Actually, categories
such as "high", "medium" or "low" are preferable because something like
"47%" gives a false sense of precision to something that is more akin to
a forensic category, such as "beyond reasonable doubt". Also, a small
number of categories makes it more likely that separate encoders will
come up with the same encoding for the same thing.

Thus, an adequate description of the magnitude of an unknown quantity
requires:

(1) an estimate of the quantity (e.g. 115)
(2) an interval in which the quantity is thought to lie (e.g. +/- 50)
(3) the confidence attached to the assertion that the actual value lies
within the interval (e.g. "high", "medium", "low").

Various schemes can be used to specify confidence. One possibility has
three levels corresponding to notional confidence levels (C) of C > 95%
(high), 5% < C < 95% (medium), C < 5% (low). Another possibility has
four levels: C > 95% (highly probable), 50% < C < 95% (probable), 5% < C
< 50% (improbable), C < 5% (highly improbable).

Best,

Tim Finney