Design Issues for the World Wide Web

Levels of Abstraction: Net, Web, Graph

timbl@w3.org (Tim Berners-Lee) — Tue, 23 Oct 2007 00:00:00 GMT

The web of things is built on the web of documents, which is built on the web of computers controlled by Domain Name owners, which itself is build on a set of interconnected cables. This is an architecture which provides a social backing to the names for things. It allows people to find out the social aspects of the things they are dealing with, such as provenance, trust, persistence, licensing and appropriate use as well as the raw data. It allows people to figure out what has gone wrong when things don't work, by making the responsibility clear.

The value of this architecture is that each layer leverages the social components of the lower layer's architecture.

Read whole article...

Working despite Ambiguity

timbl@w3.org (Tim Berners-Lee) — Mon, 15 Oct 2012 00:00:00 GMT

(I guess this is one of these things which is perennial. I have not studied much of the history of philosophy but I do find one needs to be prepared to jump in in order to keep the course of what I otherwise regard as engineering still on track... as I have said before, this is philosophical engineering we are doing...)

The point which David Booth has brought up, not for the first time, and which Pat has expounded very well, that no symbol can ever have completely unambiguous meaning is, yes, quite valid. There are several such points which we have to go over every now and again (preferably out of the critical path of working group work) and agree we all understand it and agree that we can all continue in practice without it. And indeed continue in theory without it as well. And Pat, you have lead us through that journey from philosophical foundationlessness to logical foundations before and maybe you can help us again or just point to where you did before. And Graham you make an important distinction.

There are lots of models, I am sure, one can make of ambiguity and language and communication which will allow us to do this, and they may differ in how they work and it probably is best that we agree they exist but not get hung up arguing about which one is "right". They will all be imperfect, but good enough.

Read rest of article...

Web Architecture from 50,000 feet

timbl@w3.org (Tim Berners-Lee) — Fri, 04 Sep 1998 00:00:00 GMT

This document attempts to be a high-level view of the architecture of the World Wide Web. It is not a definitive complete explanation, but it tries to enumerate the architectural decisions which have been made, show how they are related, and give references to more detailed material for those interested. Necessarily, from 50,000 feet, large things seem to get a small mention. It is architecture, then, in the sense of how things hopefully will fit together. I have resisted the urge, and requests, to try to write an architecture for a long time: This was from a feeling that a dead and therefore less valuable document must any attempt to select which, of all the living ideas, seem most stable, logically connected and essential. So we should recognize that while it might be slowly changing, this is also a living document.

The document is written for those who are technically aware or intend soon to be, so it sparse on explanation and heavy in terms of terms.

Read rest of article...

Axioms of Web architecture: URIs

timbl@w3.org (Tim Berners-Lee) — Thu, 19 Dec 1996 00:00:00 GMT

The operation of the World Wide Web, and its interoperability between platforms of differing hardware and software manufacturers, depend on the specifications of protocols such as HTTP, data formats such as HTML, and other syntaxes such as the URL or, more generally, URI specifications. Behind these specifications lie some important rules of behavior which determine the foundation of the properties of the Web. These are rules and principles upon which new designs of programs and the behavior of people must rely. And it is that reliance which makes the Web both an information space which works now, and the foundation for future applications, protocols, and extensions. The more essential of these I refer to loosely as axioms, and the most basic of these have to do with URI.

The aim of thes article is to summarize in one place the axioms of Web architecture: those invariant aspects of Web design which are implied or stated in various specifications or in some cases simply part of the folk law of how the Web ought to be used. Especially for these latter cases, this article is designed to tie together the Web community in a common understanding of how we can progress, extend, and evolve the Web protocols. Terms such as "axiom", and "theorem" are used with gay abandon rather than precision as this not a mathematical treatise.

Read rest of article...

Linked data is like a Bag of Chips

timbl@w3.org (Tim Berners-Lee) — Wed, 26 May 2010 00:00:00 GMT

The value of data is the insight which comes when different bits of data are joined together. For that process to provide value, the world must contain all sorts of kinds of information of different types, and it must be linked together. Linked data involves using ontologies. But if you are a developer, how do you pick those ontologies? The art is to use several different ontologies in the same document, the same message. In a typical application, part of which you need to express will be in a very common idea, (like, say a title of a document) while part of the information will be concepts shared with particular groups, domains, like, say blood pressure. And some will be obscure data (like, say, blood pressure monitor calibration data) which is only understood by device engineers. Putting all this information together in a mixture of ontologies is the best thing to do. Some you will find, some you may work with others toward consensus, some you might use that day in that project. Using each of those ontologies gets you the most total interoperability. A bag of chips has all kinds of information of different types, and each user (the customer, the checkout scanner, celiac, the nutritionist) uses different bits and ignores the rest. With its mixture of ontologies and its rule of ignoring data you don't need, or you don't understand. the world of Linked Data is quite like a bag of chips.

Read whole article...

Beneficent Apps

timbl@w3.org (Tim Berners-Lee) — Sat, 13 Jan 2018 00:00:00 GMT

Beneficent Apps

It is a sign of the times (2010's) that we even have to talk about these. Back in the days of floppy disk based PCs, when you would spend your money on a cool program (or App as we say now) you would spend money to be able to do useful, fun things. Play a game. Fly a plane. Write an essay. Do your taxes. You would typically have the program in the A: drive and put a disk for your data in the B: drive.

The data on the B: drive was completely in your control. You could use it with different programs in the A: drive. The program in the A: drive was our tool. It helped you do your work. It worked for you.

Until, that is, for me one day when Quicken, the program I had bought ages ago to do my finances and taxes, when it had done my taxes, asked words to the effect of "Are you sure you have enough insurance? Would you like to buy some great insurance?"

That was the end of an era: the era in which I trusted Quicken to be my representative and work on my behalf. Of course millennials are so used to doing everything on the web, using web-based tools, and so used to those web-based tools actually working in someone else's interest, that they may assume this is normal, and take this as the default.

But in fact in this world we loose something very important. The basic human ability to use a computer provides a wonderful level of empowerment. There is something important about a program which represents me.

While Beneficent Apps are not the norm on the web, or even on mobile devices, they in fact common. Every open source app is (should be!) a beneficent app. These are apps which are developed by a community for its own use, and generally they are developed with the needs and wants of the end user in mind.

Web browsers are important Beneficent Apps. They are crucial as the tool with which the person interacts the web and all the crazy wonderful stuff out there In the HTTP protocol spec, the browser is in fact called the User Agent. Web browses must protect and the serve the user in lots of ways

Help the user know and understand what party (website owner) they actually talking to
Help the user remember where they have gone
Help the user curate a subset of the web which is valuable for them
Store safely passwords and keys
Help them avoid being tricked, fooled, or manipulated on line
and so on

Tips

When designing a Beneficent App, always just think at each design choice -- what would the user want the app to do? If you are thinking of yourself as a main driving user, then think about use cases which empower, and connect to other powerful things you can do or will be able to do. But also think of users who are differ from you in many ways - their level of tech ability, their preferences about being social or not, their situation, their personality.

If you have an income stream from selling the data from your users, then you are not likely to build a Beneficent App by default.

Metrics

How to you measure how good is a beneficent app? It is so easy to make metrics for non-beneficent apps: the engagement level, click-through, the ad revenue or sales revenue they provide. It is more difficult to measure how useful you have been to your user. A user may just be really well informed of something really important, but not do anything which your app could pick up. So yes you can survey them, but now lets's look at measuring their activity. If your app is a something which helps people organize parties, then you can measure the number of parties which people organize. (not to mention whether they were great parties :-).. but those end goals come infrequently, so you could measure the amount of stuff going on to the end of the end goal: the amount of chat (how about sentiment analysis of the chat as to whether it is happy, constructive?) the extent to which the to-do system works - do tasks get done by different people to the people who raise then, for example -- once think of lots of potential things you could measure which may or may not be useful. Then when you have that list, you can look at previous parties and see ho they were involved with making great parties, or with actually getting the party organized at all (which of those would you want to optimize for?) You can also try your apps out, doing A/B testing in the next version as things to optimize for. But also beware of unexpected negative effects. Did they people building social networks imagine the effects on teenage health of a beauty based economy? Probably not, but now we know that systems build to be happy centers of collaboration can end up being toxic for classes of users. To be beneficent, you have to also do no harm!

Rugulating Agents

The concept of apps which work for you, the concept of something which is your agent, is not in fact foreign to our current world. We have it, after all with doctors, and with lawyers. A doctor takes the Hippocratic Oath, or some form of it, which they commit to operate in the interests of the patient. Lawyers also are bound to put the interest of their client first.

There are therefore a lots of laws and regulations out there. to take inspiration from when wording commitments which Apps, or the developers which create them, if anyone wanted to craft regulations ot terms and conditions about Beneficent Apps

Beneficent AI

As AI gets more powerful, every step it takes it becomes more important that it is beneficent. Beneficent for you the individual, and for us the human race. More important that you have AIs which work for you not someone or something else.

Hey online Ad system, Who do you work for? When you recommend I eat at a restaurant, is that the best one for, me or a the result of an instant online auction for who ever bids most to you for my custom? Hey, Siri, who do you work for? Hey, Alexa, who to you work for?

Can you even imagine an AI that works for you? I can, and he's called Charlie

References

IETF Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing

Up to Design Issues

on: Charlie is a Beneficent AI

Tim BL

Blockchain

timbl@w3.org (Tim Berners-Lee) — Tue, 31 Oct 2017 00:00:00 GMT

Blockchain and the Web

There are so many frequently asked questions around Blockchain and the web, so this may be place to tease a few things out, specifically about the relationship between the technologies. What technical problem is there in the world in 2017 where someone has not asked "Will the blockchain solve that?". The response is then typically "Well, what do you mean by blockchain exactly?", as it really depends which aspect of which system you are talking about. Are you talking about the original Bitcoin Blockchain, or some other use of blockchain technology, like Ethereum, or something which has a crypto-currency but quite a different protocol, like Ripple, and so on. And what were you using the system for - to transfer money, to claim a unique identity, or to notarize a document? It all depends. Now there are a huge number of things written on blockchain, so here I made no attempt to explain it or discuss its details, about which I am not an expert. Here I just try to fish out some of the distinctions between different things out there and how they relate to Web Architecture. Within that, many questions arise, some answered here, and some unanswered anywhere.

The Bitcoin Blockchain

When people talk about Blockchain, sometimes they mean the original Blockchain, in which Bitcoins are mined, and sometimes they mean anything which uses the same ideas, but are different. These are quite different. The original Bitcoin blockchain was the first the world saw of the technology and it proudly announced a new system which included:

A shared global ledger on which things could no notarized
A currency which could be traded rapidly and cheaply on the net
Protocols to use the ledger as a first-come first-served name-allocation system
Protocols to declare your public key in the web.
Many other proposed exciting ideas

There were mutual dependencies:

The ledger depends on people mining Bitcoins to guarantee its integrity
Bitcoins depend on the ledger for bring bought (split) and sold

There were a few anomalies, or ironic aspects. The ledger system works and maintains its integrity only because of the complete openness of the blocks in the chain, and the fact that many systems were constantly checking it; and yet bitcoin in fact provided a way that value could be used anonymously for criminal purposes with impunity. Material in or linked from blocks can of course be encrypted, and must be if if not public. In general the practice of encrypting stuff and making the encrypted version public requires faith that the encryption method don't be cracked in any timescale which is important to the parties. The cracking of encryption becomes ever easier with more powerful computers, latent vulnerabilities in the system, and new hacks to it.

Also, for a new fast and light system of transfer of value, it was ironic that the mining needed to both generate new coin and maintain the verified integrity of the chain necessarily by design involves burning huge amounts of energy, which is not ecologically sustainable.

Using these protocols to the web, they connect in various ways:

Web authentication can use identities based on the blockchain. You could log onto a web server using a key pair whose public key is declared in the blockchain. Identity systems, some using "DID" standard URNs, like Sovrin, Veres One allow one to create an identity which
Web Payments defines a modular system within the browser which allows new forms of payment system to be inserted. Payment with Bitcoin, or any other crypto token, could be added, and I expect several will be added, perhaps in the basic browser itself, or using a browser extension.
Any web site version could be notarized in a blockchain. Many websites are already managed using a version control system like Mercurial or git, which already keeps hashes of the entire tree of files. One could also create ones own hash of the state of a web sit, or a sub-tree of the file system behind it. Then, for example to claim (say) copyright at a certain time of a certain content, you could stick the hash into the blockchain.
and so on.

The value of bitcoin

The rise of the value of a Bitcoin against the US Dollar over the last few years to 2017-10-31

To first order, for lots of operations transferring money between different currencies and countries, the value (say in USD) of a bitcoin is irrelevant. For those making a remittance, say of USD from the USA to (say) Haiti, all one need to do is own some Bitcoin for a minute between buying them in the US and selling them in Haiti.

The NASDAQ up to 1999. People bought tech stocks because of the prices they imagine other people will be prepared to pay in future, with little justification from actual revenue of the companies.

So does the current [2017] dramatic rise in the value of Bitcoin have no connection with the use of the Bitcoin Blockchain as a system? On the contrary, there is a strong connection. The promise of rise in value of Bitcoin motivates the miners, who keep the blockchain operating and ensure its integrity. In general you can imagine that people buy bitcoin for at least two reasons

In order to make a transfer like the remittance above
In order to store value, and invest in the possible rise in value of the token.

The second in a more effective at raising the cost of the currency, as the first is a temporary need. Investing in bitcoin is, if you like, just to add risk to your life. There is no logic based on future revenue to attibute value to it. Its market value is only the extent to which other people imagine other people imagine other people will value it in future. It is is completely speculative.

Systems which rely on the blockchain run the risk of breaking if bitcoin mining stops. The design of Bicoin ensures that to make each new coin, more energy is needed than to make the last one. So the cost of each coin in energy, and therefore in USD, constantly rises. This, I imagine, is why people behave as though the value of bitcoins will contantly rise. However there is nothing in the math that guarantees that will be anyone willing to pay that much say, US Dollars for it. If miners find they don't make a profit, then they will stop mining. When they stop mining, then presumably the sytsem stops, and no one can use the bitcoin blockchain to trade bitcoin.

Differences between Bitcoin Blockchain and the Web

Blockchain the web are similar only in that you can store stuff in it and later retreive it. The web and blockchain are very different.

In a blockchain system, everybody stores everything. It is a distributed ledger where every node in the system stores a copy of the whole chain. The one blockcahin is stored by the whole set of servers. Each server is responsible to the same extent to make sure the stuff in the blockchain is stored and available.

In the web, each web site is different. Hardly anybody stores the same thing. Each web site has different authors, different fans, different plans. Different requirements for size, shape and speed of the data.

When you store something on the web, there are three things you need to keep working in order for it to be available in the future:

The Internet infrastructure
The Domain Name System
The web server which you put the stuff on

When you store something in the block chain, you need two things:

The Internet infrastructure
The community of people and organizations which together store the blockchain
The economic and market conditions around that community to motivate the maintenance of the chain

The persistence on the web server depends on the effort you put in to set it up and have it well hosted, and the money you or whoever it is spends to maintain the server and pay for its connection to the Internet. There are important organizations like the Internet Archive which keep copies of things, but modulo those, the continued operation of the site either depends on an entity, like forbes.com which sources the material to maintain its own site, or a company such as facebook, twitter, or github which serve the data for their users, for the members of their clubs. This doesn't mean that these will magically there for ever, as those who plowed their creativity into their AOL Hometown web sites found when AOL turned off hometown.

For distributed systems like the blockchain, the responsibility for the maintenance of the data you put in them is shared by a single community. When we are talking about the Bitcoin Blockchain, then it is specifically the Bitcoin community. The companies which mine for bitcoin are crucial for the integrity of the system, as without them, there would be no new blocks on which to be able to put more information, and there would be no one checking the integrity of the data, old and new.

If your vision if that everyone will use the same single blockchain, then you are asking them to accept the same "Quality of Service" properties: the same reliability, the same time it takes, the same cost. It is like requiring everyone to join the same one club with the same facilities, opening hours, and all pay the same fee. People are different and in fact want to join different sorts of clubs with very different sorts of facilities and very different fees. If the same Bitcoin blockchain is used by gamers to exchange moves in a distributed game, and retailers for consuming spending, and banks to record the transfer of ownership of real estate, then at some point aren't the banks going to object to maintaining an infrastructure used mostly by gamers, and the gamers object to paying transaction fees used by the banks? Won't the banks spin up a new system (like Ripple, say) where anyone can join so long as you're a bank? Won't the gamers spin up their own Etherium chains because they can and they cheaper and they don't need persistence?

Looked at in that light, in terms of social space of people who use it, and the economic space of its service parameters, the Bitcoin blockchain is centralized. It's not like the web, where everyone can make their own web site, in an independent way, and make as big and as small, and as fast and as slow as they like.

Instability of Currency

When people invest in a blockchain-based currency, in order to benefit from its later rise in value, they are taking a risk that the currency will drop A bit like investing in "Dot-Com" startups doing the boom, they are giving a currency a value baed purely on the imagination that others will in future value it at a given level -- not based on a revenue or interest which the system will provide. Risks include

The value crashing as dot com values did
Those who promote the currency being accused of running a form of ponzi scheme where you rely o future joiners to make worthwhile for those joining now, in an unsustainable way

So if you are looking at a blockchain as a place to store your data, be aware that you connecting it to a financial system, whose continued functioning will be required to keep the data accessible.

Other blockchains in general

When you actually look at building an application to use the blockchain as the place where it stores its data, then there a few serious issues. Moxie Marlinspike'a blog is one of several blogs on the subject. Three of the issues are privacy, speed and transaction cost, and re-centralization.

Blockchains are Public

When you put something on a blockchain, then the way that the blockchain works is that a copy of it is held by every node in the system. So it is very public. If you are using it for claiming a public global digital identity, then that may be what you want. But if you want to use it for something private, like personal data, then this is definitely not what you want.

Yes, you can encrypt it. But encrypting your data for security and then putting it somewhere very public has two problems. One is that even if people can't decrypt they see that you have put something there, which may already be revealing. The other problem is that typically encryption gets easier to crack over time, with faster computers, (not to mention quantum) and sometimes discoveries of weaknesses in the algorithms.

So all the stuff on the blockchain can be held encrypted by people just waiting for a time when they are able to decrypt is.

Blockchains are Slow

If you put stuff on the blockchain, it takes a while. You have to come to a agreement with everyone using the chain what the next block will be.

Blockchains are Expensive

Blockchains workin different ways, but a common theme is 'gas fees'. The tokens you have to spend to

Of other blockchains, there are those which pretty much use the same protocol as Bitcoin, but are a distinct chain, and those which use related sorts of algorithm.

One thing which distinguishes them is whether the crypto token value is tied to a service of value, like computation or storage, such that the protocol automatically guarantees that the service is delivered in return for the coin, and so linking the value of the coin to the costs of providing and value of the service.

Re-centralization

If the actual way a practical blockain app gets to put something in the chain is through an online service -- the blockchain code doesn't run on the user's computer, but on one of a small number of portals -- then the system isn't really decentralized, in effect. The monopoly portals are back in control.

Filecoin

Filecoin is a cryptocurrency, from the designers of IPFS, in which the currency value is related to the amount users are willing to pay , and providers willing to provide for, two services. One is the storage of information, and the other is the retrieval of information. So you can go to a storage provider with your encrypted family photos, you specify a storage time, pay some Filecoin and then the protocol provides, as a property of the protocol, that the storage provider will store the data for the time given. Well, it provides that in order to remain a play on the system, it must.

You might ask, what happens if you can afford to store your stuff, but later on the market changes and you, or you readers, can't afford to access it? When you buy into the system, you are not guaranteeing that the filecoin world will exist, but that if it does, your data will be stored.

Decentralized non-blockchain protocols

These are summarized only rather than elaborated in depth. If you are thinking of stroring things on a blockchain, then are one of these in fact what you need?

Distributed Hash Tables (DHT)

Collaborating parties make their data into chunks each of which is hashed and then stored at a server chosen by indexing into the list of servers with the hash.

The InterPlanetary File System (IPFS) The "InterPlanetary File System" [sic] is a project which allows a community to share immutable files indexed by their hashes.

As IPFS can be used with a URL scheme ipfs:, a small browser extension allows it.

There is a an IPFS HTTP gateway.

Using the web as a Ledger

Now let as look at doing some of the things which people do on blockchain on the web.

Using an arbitrary web page as the head of a ledger list

If you are prepared to trust a particular social entity with the head of your ledger, you can of course put it on their web site. It may not be as "decentralized" as putting on the bitcoin blockchain, but it will delegate the job of keeping the list to a known party. But one way of working is to use a trusted web site as your ledger. Then you can base all the typically blockchain operations on it such as notarizing transfers of ownership or money, staking first claims to unique names, and so on.

Digitally signed linked data

It is straightforward to digitally sign data. You can sign a serialized document and convey or publish that serialization, or you can canonicalize it and sign the canonicalized data model, canonicalized RDF (or XML or JSON). You can then chain together a series of signed documents, each with a URI of and also typically a hash of the ones to which it refers and which it depends on. The web of linked data is particularly suitable for this of course, and a read-write store of data allows applications which operate by making chains (or in general directed acyclic graphs) of digtally signed assertions to flourish. (See the PaperTrail architecture in these notes)

Trees of hash-addressed data in the web

Many of these systems refer to immutable data by its hash. IPFS, for example, and the immutable part of MaidSAFE. But of course hashed trees of immutable data are no stranger to the web, and much if it is underpinned by git and mercurial repositories. Here a hash is used to refer, securely, to a given specific version of a repository. So the web is full of Merkle trees, which have similar proerties to an IPFS. An intersting possibility is to extend HTTP to surface the Merkle tree, so that versions, or immutable parts, of the existig web can be referred to in a secure way. This would allow a client, for example, to check our a version of a subtree of a web site, and load it into, or request it from, IPFS. This connects to the Memento framework for tracking the history of web sites. Basically, any tree of data on the web which is immutable can be secured and referred to by a hash, and this incluedes data, like the messages from past chats in a Solid Pod, which was once mutable but is then declared immutable.

Using existing web architecure in a less centralized way

If you want to use the web to store stuff, then the weak point, the main centralization, is the fact that you have to get a domain name. If you dn't get your own domain name (like alice.com), then you end up with you data stored at a URL which includes the domain name of your ISP (like alice.myisp.com) The latter means you are bound to using the same ISP forever (unless they arrange forwarding to a new provider). To all efforts to make it easier for people to get a domain name and then establish a web presence there, like a Solid POD, are useful. As are top level domains which respect their users.

Conclusion

Blockchain and the crypto currency protocols solve some interesting and useful problems, but none of them in the universal panacea which some have been looking for to fix our dependency on huge monopoly platforms. It is possible to switch certain functions, like website domain names, or personal identities, to blockchain-based protocols, but when that is done, the world has to be aware of a new dependency on the community which runs that system, as we did with DHTs. Economic models for the support of the system need to be elaborated, be transparent and and well understood. But in general use as a place for Web apps to to store data, blockchains are too slow, too expensive, and too public.

References

Marlinspike, Moxie My first impressions of web3. Essential reading
Currency charts at XE.com
Ripple.com home page.
Filecoin home page
IPFS home page
Distributed hash table (DHT) in Wikipedia
Memento at W3C
RFC7089 HTTP Framework for Time-Based Access to Resource States -- Memento
The World's First Autonomous Data Network: The SAFE Network
I want an iPhone 4, YouTube
Fidelity Australia, " The Nasdaq: Will history repeat or will it rhyme?

Updates

CrytoSlate, 2022-12-12 BTC is now cheaper than the all-in-sustaining cost of mining BTC

Up to Design Issues

Tim BL

Conceptual Graphs and the semantic Web

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Jan 2001 00:00:00 GMT

Conceptual Graphs and the Semantic Web

To put it in a nutshell, Conceptual Graphs (CGs) are a logic language used for describing closed worlds of logic. They have traditionally had a strong emphasis on two-dimensional graphical representations, but there are conventional serializations, one "Linear Form" much comparable with N3, and one CG Interchange Format (CGIF) which is more official. With various pros and cons, they are basically as expressive as KIF -- and so in way only have to be webized to a basis for the Semantic Web.

Here I go over a few differences and similarities between CGs and Semantic Web work based on RDF.

I will ignore completely "nonsemantic information" ([1], sec2 ) in this short comparison.

Webizing CGs

Let's take the principles of webizing a language and look at how that applies to CGIF or LF, to imagine a semantic web based on CGIF.

The first thing we clearly have to so is modify the CG syntaxes so that each concept and each relation can be a first class object, by having a URI. The syntax modification is just to allow the characters in a URI to be included, so that an arbitrary concept can be referenced, or an arbitrary relation used. A typical way to map URI space to CG identifiers would be to make URI of a CGIF identifier a concatenation of the URI of the CGIF document, and a hash sign and the local CG identifier -- making the local exsting identifier a fregament identifier in URI terms.

Having mentally webized the language, then the question is how such a semantic web language maps onto say languages. This is simplified by the fact that the CG spec [1] gives a mapping to KIF.

Types and Clases

CG and RDF share concept of type. CGs have the restriction that that the worlds of concepts and types, and that of relationships and relationship types, are disjoint. Therefore, you cannot use a CG to express something about a relation using a relation. If one wanted a true bidirectional mapping, then CGs would have (it seems at first reading) to more or less reify -- to describe at a meta level - an arbitrary RDF graph. However, this would not in my opinion be useful. The designers of CGs intended this disjunction, and so the natural mapping is directly from CG concept types to RDF Classes, and from CG relations to Properties, and from CG Relation Types to RDF Classes which are subclasses of rdf:Property.

The semantic web logic language has to be universal in that it must allow expression of any other language; but it certainly does not force every language to be universal itself.

Centralized Notions in CGs

The CG concept of a knowledge base (KB) contains a few centralized ideas. These are not in fact architectural problem with CGs - they are just engineering decisions which were made without the web scaling requirement. Removing does no damage the CG idea at all.

The ideal of a closed knowledge base, especially that there is a single catalog of all individuals. A KB contains a hierarchy of types, a hierarchy of relations, and a central catalog of individuals. The hierarchies are no trees, but acyclic graphs, so they do not pose a problem above the fact that they are closed - A KB must
The fact that a concept is associated wiht a single type. In the semantic web, though the original creator of a Thing may define a type, logically statements made by third parties can equally well make type assertions about a thing, and those statements may be in the form of a rdf:type statement.
A coreference set has to have a single dominant concept.

Properties and relations

The main difference which stands out at first reading is that RDF properties are always dyadic, while CG relations are monadic.

The RDF base model, and the N3 method of extending it to a logical framework, seem to be supported as a base structure, although the lack of N-ary forms shows up as a mismatch, but the existence of arcs explicitly in the CG model of an N-adic relation suggests a natural mapping back into dyadic RDF when n>2. This just leaves a little tension as the two forms coexist.

The CG world is a bipartite graph - one composed of two relations and concepts, which are disjoint. The RDF world, while it does consist of links which can be thought of as going from thing, via a property, to a thing, does not make properties and things disjoint. Everything is a Thing.

Striking similarities

Some similarities of the CG work and the semantic web to date are striking. Both are inspired largely by circles and arrows diagrams, and in LF and N3 this even shows though in some syntactic forms. People have through the ages been writing circles and arrows on whatever material they had to hand [Enquire, cavewriting] and in N3 I tried to take this very simply into unicode with

w3c:Michael  >- org:member -> w3c:team .

There was a certain feeling of recognition on seeing John Sowa's

[Go]-
   (Agnt)->[Person: John]
   (Dest)->[City: Boston]
   (Inst)->[Bus].

which in N3 would be

@prefix : <#>.
[a :Go]
   >- :agent -> [a :Person; = <#John>];
   >- :dest -> [ a :City; = <#Boston>];
   >- :inst -> [ a :Bus].

remarkable down to the final period. Both syntaxes also have backward arrows a <- (p) <- b in CG's LF, and a<-p-the same in RDF)

Contexts

The concept of "context" occurs very equivalently in CGs and N3, where in both cases a formula is built using quotation. In N3, the braces were introduced to encapsulate a set of information and talk about it as a set. Using an example from [1], loosely "Tom believes that Mary wants to marry a sailor":

[Person: Tom]<-(Expr)<-[Believe]->(Thme)- [Proposition: [Person: Mary *x]<-(Expr)<-[Want]->(Thme)- [Situation: [?x]<-(Agnt)<-[Marry]->(Thme)->[Sailor] ]].

In N3 this would be, mapping dyadic relations to RDF properties,

<#Tom> a :Person; :believes [a :Proposition; = { <#Mary> a :Person; :wants [ a :Situation; = { <#Mary> :marriedTo [ a :Sailor ] ]} ]}.

(In the above, the "=" is an statement of equivalence which makes up for the inability otherwise of N3 syntax to allow an anonymous context to be subject and object of a statement.) In RDF, my own style is to assume that often the type of a thing, when it can be deduced from the predicate's range or domain, should not be stated explicitly. For example, the object of any believes may be a proposition, and the object of any wants may be a situation. So an N3 expression of the above in practice might be more like:

<#Tom> :believes { <#Mary> :wants { <#Mary> :marriedTo [ a :Sailor ] } }.

Leaving aside the question of whether this is a good model for the English sentence, and a lot of philosophy and linguistics (which I generally avoid by not trying to express natural language). The CG world often uses diagrams, such as this one from [1] to describe the above formula:

In N3, the circle-and-arrow diagram I would draw would include an arrow from the rectangle for the situation to the [circle] for the marriage to indicate that there is a universal quantification there.

There are other mappings which once could made, none of which give quite such a neat result. One mapping of CGs to RDF would map the CG arcs to RDF properties, which for the above would be:

[ a :Belief; :expr <#Tom>; :thme: [ a Proposition; = { [ a :Want; :expr <#Mary>; :thme [ a :Situation; = { [ a :Marriage; :agent <#Mary>; :thme: [a :Sailor]] } ] }] ].

In English this would be, "There is a belief, experienced by Tom, that "there is a want, felt by Mary, that there should be a situation: ``Mary is married to a Sailor'' ".

Quantifiers and Lambda

I have not gone into the comparison in great detail in this area. Both N3 and CFIF have existential and universal quantification, though the universal quantification is declared an area of the spec under development called "defined quantifiers". Both have, like RDF, implicit existential quantification from anonymous nodes.

A question I did not resolve in CGIF if how one can determine the scope of a quantifier introduced using the "?x" and "*x" terminology. There was a clarification in [1] that (I think) universal quantifiers have a higher scope than existentials of the same scope -- the same convention as in N3. In N3 in the model one has to link the quantified variable directly to its scope context using a log:forAll or log:forSome statement.

N3 has no Lambda as such. Once can write out a double implication define the meaning of a new term (Property or small set of related properties) by giving a double implication with the equivalent formula, using universally quantified variables for the formal parameters.

The issues faced in the two designs do a appear to have a high overlap. The semantic web has to work also in an open context, defining the meaning, if any, of a nested expression when referred to out of context.

Conclusion

Conceptual Graphs are easily integrated with the Semantic Web as it is, the mapping being apparently very straightforward. The export of a CG in CGIF or LF into N3 looks to be a suitable exercise for the reader ;-). An interesting and more challenging exercise would be to build a CG machine -- and a modified CG syntax -- which can import a graph containing URIs which reference external concepts. The problem that relation types in CGs are not concepts is not huge, as there are many systems - especially ontological systems -which have a similar restrictions and with whom interchange would be possible.

There is an interesting subset of CGs, called "simple graph" which are all one context, with no negations or "defined quantifiers", but which can contain universal quantifiers, and these map directly into the RDF M&S 1.0, or N3 without braces.

The RDF base model, and the N3 method of extending it to a logical framework, seem to be supported as a base structure, although the lack of N-ary forms shows up as a mismatch.

All in all, there is a huge overlap, making the two technologies very comparable and hopefully easily interworkable.

Charlie: An AI that works for you

timbl@w3.org (Tim Berners-Lee) — Sun, 01 Jan 2017 00:00:00 GMT

Charlie works for Bob

Bob was fed up with the AIs around him (Alexa, Siri and so on) who all seemed to work for other people, and so he got Charlie. Charlie is an AI and Charlie works for Bob. Because Charlie works for Bob, Bob gives Charlie access to much more data than he would another AI

- Charlie, Who do your work, for?

- I work for you, Bob

Good Morning, Bob. Good to see you on the exercise bike. Your fitness goals are on track. In fact because the meeting was moved do you want to stretch this to the full hour? We can do some climbs and have time to unwind.

- Ok , sure

- Ok, so lets start warming up at 100 cadence, to warm up, and we can go over a few things. Overnight a bunch of offers came in for your art but as far as I could tell none of them really make sense to you after you’ve paid the fees. I just invested a little in one new start up, mainly because it will give you something else in common with your mother in law. Speaking of relatives, you have quite a bunch of vegans coming on Saturday. I took the liberty of making up a recipe for the thing you really liked at the Indian cafe the other day.

- You made up a recipe?

Well he hadn’t published that one, but he has published a dozen books so I read those as a training set and then extrapolated how he would cook the menu you liked. Then compared it with the Linked Open Recipe data, and adjusted it a bit for the way you like things. So I propose to get the food from Whole Foods, Waitrose, and the Farm — we can get the best stuff and save 12% on the bill. OK?

- Ok. Charlie, Let’s go for it

- Ok, the recipe is in your calendar. I want to leave you to get into your workout now, When you are done there are two things: a new briefing for your meeting today, and the upcoming family birthday presents. I’ve found a bunch of things but I’m not sure they are right — I want you to look at them. OK?

- Ok, Charlie. Who do you work for?

- Legally, ethically and algorithmically, I work for you, Bob.

Two things to notice about Charlie.

Charlie works for Bob, and so Bob trusts Charlie

Because Bob trusts Charlie, Bob gives Charlie access any of and all the data in is life - financial, health, social, etc. Because Charlie gets access across the board, Charlie does a better job — and so Bob trusts him more.

Data is always more powerful when it is joined with other different data to give new insights.

Currently Facebook makes insights about the likes and habits of its members Here, Charlie is getting the insights on behalf of Bob

How could this happen? How could Bob get to the point where he has access to the data?

We may be in for a massive disruptive backlash, following Cambridge Analytica, in which people demand access to their own data.

In the banking sector in the UK, this has already happened in Open Banking, where consumers can use their data with all kinds of apps and services. The positive affects of this may spawn similar rules in other fields, The GPDR rules in Europe basically call for the sort of thing Charlie needs.

We have [2017] a project at MIT (called solid.mit.edu) where we build apps which actually run so as to store their data in one or other data store which a user points them at. So whether it is event planning or bridge building, the actual data of the creative and collaborative things Bob does are created immediately in place over which Bob has complete control. Bob has complete control of all his data.

So Bob may end getting his data because he gets mad and demands it, because regulations grant it to him, or because in a new architecture it has always been his.
Bob is empowered because he can share his data with whoever he likes.
Bob is empowered because he can use all kinds of very powerful apps, including Charlie
A new very different vision of the world.
A more empowered humanity

Do join me in building it.

Update

[2023] Since that piece was written, a couple of things have changed. The Solid platform has gone from being a project at MIT to being a signifiant movement of new standards for personal data and individual sovereignty over that data. Large corporations and governments, and organizations in the public interest have in different but complementary ways started to roll out Solid for citizens and consumers.

And AI systems based on Large Language Models have demonstrated that a fluid conversation with a human is now a thing AI can do, rather than a thing AI can't do. Now Solid gives the third layer of the web we have common standard which allows people to not only look at aggregation of data with a individual's Pod, but also to run machine learning and other insight-extraction systems over a set of pods, while preserving the privacy of the individual.

[2003-10] A 2022 company Inflection AI released their Personal Intelligence (pi.ai) product in May 2023. You can have a private conversation with it about about your personal issues, though it does not have access to your personal data.

Up to Design Issues

Back to Beneficent Apps

On to imagining what could really go wrong with AI

Tim BL

Solid: Socially Aware Cloud Storage

timbl@w3.org (Tim Berners-Lee) — Mon, 17 Aug 2009 00:00:00 GMT
There is an architecture in which a few existing or Web protocols are gathered together with some glue to make a world wide system in which applications (desktop or Web Application) can work on top of a layer of commodity read-write storage. Crucial design issues are that principals (users) and groups are identifies by URIs, and so are global in scope, and that elements of storage are access controlled using those global identifiers. The result is that storage becomes a commodity, independent of the application running on it.
Read whole article...

Connecting the Sciences

timbl@w3.org (Tim Berners-Lee) — Sun, 04 Jan 2004 00:00:00 GMT

Connecting the Sciences

with the Semantic Web

Summary

It interesting to use the Semantic Web for connecting the sciences because increasingly major problems can only be solved by using many fields at once; and because scientific information naturally tends to be "data", ie. relational, logical and/or numeric in form, and so Semantic Web technology is easy to apply.

The need

No scientific discipline is as island. The fields of study to which we give names have fuzzy edges, and overlap one another. They are in fact connected in a loose web which evolves with time, as new fields arise, and we change our perceptions of existing ones. Consider physics , physical chemistry, organic chemistry, cell biology, proteomics, genetics, epidemiology, medicine, pharmacology: wheras one might be an expert in one without being an expert in all of them, one typically has to have a knowledge of neighboring fields.

Of the challenges which confront science, many interesting ones, particularly in the study of the human biology, seem to require the tracing of pathways though many fields. In searches for cures for AIDS, for cancer, or for new viruses such as the SARS, the amount of information to be brought to bear is huge, but spans many disciplines.

Now, naturally, different fields have come up with different ways of modelling their data, different standards for recording it. This makes it very difficult to try out new ideas which cross fields: one has to negotiate for the conversion and transfer of data in each case. This is normal. It takes great time and effort to bring more than one group together to use common data formats and common vocabularies.

The solution

The Semantic Web technology is designed specifically to overcome this problem in a decentralized fashion. That is, it is designed to allow conceptual connectivity between neighboring fields to be set up retrospectively and incrementally. Retrospectively, in that often the modelling has already been done in each field and the data already exists. The overlap of concepts only partial, but adding the metadata which expresses that overlap where is does exist is valuable. Incrementally,n that one does not re in that one does not redesign the data models at once, but instead work at the interfaces progressively building links between related concepts.

The Semantic Web language rise above the level of XML, at which document structure is defined, to the level at which the classes of real things in the field in question are defined, the relationships between them and their properties.

Openness

During the early years of the WWW, an element of reluctance was a hesitation by companies to allow information such as their catalogs or parts lists to be available to the general public. This hesitation evaporated when it became clear that only those companies about whose products information was freely available on the web were likely to be involved in any commerce at all. Currently, funders of science have been known to bemoan the disappearance of the original data upon which reports and papers were based. We discovered with the web of human-readable information that much of the benefit was serendipitous: information was used to advantage in ways that its publisher could never have imagined, and the enquirer who started off surfing for a particular solution often finds quite different solutions to that envisaged, not to mention solutions to quite different, but equally pressing problems.

The history of science is peppered with discoveries made serendipitously - from the proverbial bath of Archimedes through the discovery of penicillin, to the discovery of the effect Viagra. If we are to make new discoveries using information on a huge scale, we will need to emulate the openness of the minds of these researchers by making scientific data available in a Semantic Web so that crazy hypotheses can be tested in a few moments harnessing data from many diverse fields.

Indeed, science itself is not an island, as, for example, a epidemiological survey often yields results when joined with geographical an economic data. The search for a disease outbreak could take one into weather patterns, corporate financial statements, or flight timetables. It is important that the scientific Semantic Web is seen as one interoperable part of the larger Semantic Web.

One particular aspect of openness is the lab notebook. The notebook is by tradition a write-only medium in which the scientist writes what he or she did, the environmental conditions a the time, and the results observed. Often such information fades but occasionally it becomes important after the fact. Semantic Web standards, and the use of Semantic Web-aware instrumentation, may make the recording of these incidental things easier. By analogy with the lab notebook, a researcher group may keep a lot of metadata which it may not wish to publish, at the time but which may be useful to posterity. For this information, we need to find a suitable policy which works for everyone involved.

In the longer term, the Semantic Web will by its existence highlight issues such as privacy, the anonymizing of clinical trail information, the protection of possibly security-sensitive infrastructure information, the meaning of copyright especially of compilations, and so on.

Early Steps

Although there is much work yet to do in developing Semantic Web technology, basic standards exist. The Web ontology Language (OWL) allows ontologies to be written so they can be read and processed by machine; the Resource Description Framework (RDF) allows data to be published using OWL ontologies, so that the data itself can be published and re-used by others.

The building of the Semantic Web is a distributed, decentralized task. It behooves those of us who have information which may be useful to others to model it carefully, to discuss ontologies with our neighbors, and expose the information on the web. (It would not be unreasonable to make such publication a condition of funding.) What can be done to encourage this in the early days, to get the snowball rolling?

Firstly, it would be useful to create some simple ontologies for common basic concepts of science. Weights and measures, the periodic table, physical constants, and simply molecules cry out for a standard description. The sort of data would be valuable as a basis for much more complex scientific data, but also would be a great resource for schools. This basic ontology and dataset would also be a service for other fields: one could see chemical data being used as a basis for hazard information, for food and drug information, and for the chemical supply industry, for example.

Secondly, a few example datasets of great general value would demonstrators of how things should be done, and probably give rise to new tools and experience to be passed on. Geophysical, meteorological, pharmaceutical incompatibility information, e many candidates for early adoption.and genome data come to mind, but there must be many candidates for early adoption.

Initiatives to bring scientific data to the Semantic Web could originate in individual researchers, by funding groups, by journals, or scientific associations or academies. If the grow can be compared with can be compared with that of the early WWW, it will occurs wherever an individual person understands the potential long-term global benefit, and so finds a way to put in the short-term effort to make it happen locally.

Conversations and State

timbl@w3.org (Tim Berners-Lee) — Wed, 01 Nov 2000 00:00:00 GMT

Conversations and state

See also: Paper Trail - presented as a a student project

The basic model of the web is a world of information. Theoretically, a mapping between URIs and representations of the resources they identify, and experientially fro a person a space one can navigate.

Interestingingly, trends at the leading edge of user interface development, and at the semantic web development both point to a world which uses a different model. Human interfaces are moving from screens to conversational mode. The semantic web, while very exciting when viewed as a

Human user interfaces use more and more devices such as speech, gestures and so on, which are not screens. What is special about a screen? A screen with a window system presents a large amount of informatoin at the same time to a person. In practice, more or less everything which a person is concentrting on at one time can be presented in its current state. When the number of pixels on a screen broke through a certain threshold (roughly the 640x320 VGA limit) this led to the development of direct manipulation interface metaphors: folders one could open, and drag and drop. The essential things about this is that the computer is at every instant presenting the current state, whether it or the human is manipulating it. The communication betwen personand machine is in terms of the mutual manipulation of a shared state. The web was intended to extend that form of communication by mutual manipulation of a shared state to remote human-human interaction. While the tools and protocols have their limitations (see UI) much of its effectiveness derived from this model. Because fundamental thing is a shared space of information, one can talk about navigation around within the space, and use all the primaval facilities that the human memory has for navigation.

This is all very well, but it was not always so. When computer terminals had only 24 rows of 80 characters, even when they were addressable, there was a tendency for most jobs to use command line interafaces, for example when manipulating files and directories. The interface was conversational, in that the exchanges were small commands and responses. There was a shared abstract state, but it was imagined in the abstract by the person, and held in some unvisualized form by the computer. This too has itas advantages, in that the imagination of a person can well exceed (on a good day) the capacity of a screen in its ability to hold complex interrelated structures. The interesting thing is that now there is a tednedncy to use many devices which do not have the large screen. The screens on cellphones are currently so small that, while one can scale a web page down and adapt it to a small screen, this might be chosing simply the wrong interface metaphor. When the audio phone only is used, then the shared state becomes zero and the interface is completely conversational again.

The characteristic of a conversation is the state is the set of utterances, or messages, which have been conveyed. This is differenet from a shared expression of a commonly agreed state. The Paper Trail concept links these two modesl in the Semantic Wee Semantic Web, by formally defining the overal agreed state as a function of messages to date. A service which allows a phone user to browse the web converts the other way: it conveys part of the the space of information by means of a conversation. It is is important for a number of reasons.

It allows us to formalize the models of human-machine interface which are in fact conversational for many non-screen devices;

It allows us to formalize social, for example commercial, transactions for which the paper trail is in fact th emost accurate model anyway;

It provides us with tools we can use for formally analysing the infrastructure protocols such as HTTP which with which the information world is actually implemented in practice.

The standardization of XML protocols has, with XML (and RDF), a richness in terms of marshalling data formats to build on, and, with xml-schemas xforms and rdfs, a richness to draw on in terms of languages for defining valid documents, but has no basis yet for defining with equivalent power the validity (and semantics) of a sequence of interrlated messages which are a protocol.

It is not as though the web today itself perfectly matches the stateless model at all. The moment it was created as a basically stateles system, many web site designers took it as their challenge to get around this model in order to create a conversational interface -- and many still do Our concerns about privacy stem largel;y from the knowledge that our "reading" of documents is in fact done by a series of protocols which leave a trace. The P3P project involves quantifying the information transfer which actually takes place. Our handling of HTML forms is getting more complex, and a form itself, becomes, on many sites, the definition os a protocol - a set of valid sequences of information actions.

@@ - already web privacy concerns come from in fact it being a conversation == there is implict state. A

@@ Reasons for formalizaing protcols a la Paper Trail.: uses concepts of validation and will be able to resuse tools - extends semnatics of documnets to semnatics of conversaions. - Creates a formal basis for defining conversaionsal systems of all kinds, including indirctly human language oriented systems.

@@ Machine-machines and human-human convergence

Cultures and boundaries

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Jan 2007 00:00:00 GMT

Cultures and boundaries

When a group of people communicate amongst themselves, they develop, to a certain extent, their own language. Sometimes, they pick terms understood by one party, talk enough to develop a shared understanding of the meaning of the the term, and adopt it across the group. Sometimes, as discussion proceeds within the group, meanings are adjusted so that they can be used for new concepts which are created or discovered by the group's activity. Sometimes, a group will deliberately and quite specifically make up a new term, choosing it to be hopefully different from any other word or phrase used before. While this is evident in technical groups, this process also happens in all walks of life, legal and political, as well as social and familial.

The result of this process is a new language, a new strain of language, or just a twist in the use of an existing word. The first and motivating effect of this, large or small, is to enable communication within the group. A greater shared vocabulary broadens the scope of common discussion which can be made without misunderstanding. The second, complementary and inexorable effect of this change is to create a common bond within the group, which at the same time, erects a barrier around the group. In most cases, all of this is unintentional. For every linguistic development which promotes communication within the group, a corresponding step change is made in the difficulty of communicating across the boundary, between inside and outside the group

That which makes a group culture stronger necessarily isolates it from others.

The culture of the group comprises many things, but the common terms and their meaning, and the set of concepts net which interconnects those meanings, are a very significant part.

So it is, then, that a working group will, given a free rein, work in relative isolation for several months, and when they have finished have great difficulty explaining the specification documents to their peers outside the group. Often this will be a surprise to the members of the group. They may see those outside the group as rather slow to understand, and those outside the group might see the group as having a tendency to use jargon, or to misuse jargon. It may be worse: those on one side of the boundary may see those on the other as being stupid, malicious, or even heretical.

An incomplete but essential solution to the problem is for those involved to think about what those on the other side of the boundary are thinking. This is hard work. (It involves, we discover, use of specific parts of the brain! [SaxePowell] ). This is the job, in a conversation, of listening, the stuff of most manuals and self-help books on human communication. In a technical setting it can involve a careful study of the words used in the other's seemingly senseless protestations, to build up logically a conclusion of how those words must be related in the other's mind.

The process of forming a common culture for a large community is, therefore, full of this work of listening to others. It slowly builds a new set of common terms. The work of taking an specification from one group, and though review and discussion, getting it to be th subject of consensus in a wide group, will typically involve reexamining the terms it uses and often changing them as the group itself goes through the process which the individuals, or for that matter smaller groups, within it had already done. The motivation, for technical specifications, is to get wider interoperability of systems. The motivation, for diplomatic and political things, is to get a common decisions, and to reduce global strife.

There is constant tension between the need to get things done quickly with less effort, by working within a small group, and the need to get this wider understanding which takes so much more time.

Now, in practice, life is made up of a fractal tangle of overlapping communities, of overlapping cultures. This means that the tension is ever-present. It also means that there is always a small amount of common language shared by a very large number of people, and always a very large number of concepts local to an individual, and everything in between. In centuries before this one, geography played an important role in constraining groups, and so nested two-dimensional pattern existed.

With the Internet and the Web, we can connect things without the constraint of these nested geographical areas. We can chose to be a member not just of communities such as town, region, state and country, but of specialists in a given field, or people with a particular medical condition, or people concerned about a particular global issue,. world wide. This means that the topology of the communities, and the connectedness by some metrics, may be different and in fact better than before. The topology which emerges depends on the individual choices of many people. But there is a hunch that a fractal distribution, emphasizing all scales, will be important.

The Semantic Web is a technology engineered specifically for this situation. Terms are defined in ontologies (groups of consistent, related terms). Ontologies are defined by communities. A given person is involved in many communities. A given message will mix terms from many ontologies. A given operation only requires consistency between parts of ontologies which are in use for that operation.

This will promote work toward greater harmonization, but it will not predicate the operation upon the establishment of a global ontology of everything. We know that a single huge ontology of everything cannot be done, as it the effort of getting consensus on it becomes unimaginable. We know that stovepipe systems with only local ontologies leave us with communication, and especially the re-use of data, which just does not happen, to our great detriment. And so we engineer specifically for a fractal topology.

References

(There are many books on these topics.)

[SaxePowell] Rebecca Saxe and Lindsey J. Powell, It’s the Thought That Counts: Specific Brain Regions for One Component of Theory of Mind.

On to Linked Data is like a Bag of Chips

Up to Design Issues

Tim BL

General Computation, Digital Rights Management, and FOSS

timbl@w3.org (Tim Berners-Lee) — Sun, 10 Dec 2017 00:00:00 GMT

General Computation, Digital Rights Management, and FOSS

2013: The discussion of the good and bad of Digital Rights Management software is wide and furious and has been for many years. It connects to the whole issue of how broken copyright law is and how musicians and film producers should be recompensed for their hard work. In the fervent discussion, very extreme positions have been taken, which has led to the debate becoming acrimonious, to the extent that much more heat than light tends to be available. Here we tease out some separate issues which have become entangled.

The Open Source Software right to have, modify and distribute a copy of the source code to a program one is running.

The right to have root [Administrator Privileges] on one's computer, i.e. to be able to completely control what software it runs.

The right to be able to make a copy of something one is listening to or watching.

fair use like quotation, parody, etc

archiving

use on a different machine

The right to be able to sell music or video in an encrypted form so it cannot be stolen.

DRM video in HTML5 is a tricky issue. Around 2013 the W3C community was split as to whether it should be allowed. The Electronic Freedom Foundation, and Cory Doctorow the author and blogger were very adamant that HTML should not allow DRM, it represented a step toward big company control of computing platforms. It is impossible to build a DRM machine which has the open source condition that user can change it.

One argument was at the level of tactics, basically, there are companies who will never put their movies on the net without DRM, so basically if we don't put DRM hooks in HTML they will just stick with flash and force you install an app, or a completely closed platform like a set-top box - in other words use a completely locked down platform. So it isn't as though making DRM more difficult in HTML5 will make DRM go away: it will just force users off the web.

So should W3C just say "DRM is evil we should not collaborate with it in any way" and end up driving people to native apps where the whole app is locked down on a locked down platform, or should we open up a slot in an open system to allow a locked-down system to be accessed?

Looking at the philosophical objections to DRM, there is no perfect solution. Everything violates one or more of the rights we want to preserve.

Some people just feel copyright is wrong and so there is a right to make a copy of anything you see or hear. And the business model for musicians if live gigs and donations.

Some people feel that they are the best judge of when they will copy something, as while they do often pay for music etc they feel (a) copyright law has been twisted and applies e.g. to 30 year old movies when it should not, and (b) DRM is too extreme as it prevents normal things like fair use, backups, and typically fails in the future when the DRM support system has changes and all your archive files become unusable.

Some people may feel that DRM is worth the value of having a thriving music industry and film industry. They are not too fussed about the archive issue as they don't really watch movies they have bought a long time ago, or they are too young to have experienced the problem. They haven't answered the problem of getting money to for example bands and singer-songwriters who do not have the blessing of a DRM distribution channel.

Some people may feel that while they don't specifically want to steal movies, they do object to having any bit of computational hardware which they don't have root on.

A related question is, what sort of systems can we build to help people give money to those who e.g. write or perform music, with or without DRM, in an open market, with no 3rd party gatekeeper?

A decade on

So W3C did allow encripted media to played in the browser, by standardiing Encrypted Media Extensions (EME). This allows the web site to get access to one of a small set of DRM gadget on the device -- gadgets which the user has no control of.

Looking back in 2023, there has been a huge amount of streaming on the web and off. EME in HTML has been used massively. A certain notable part of it is user-generated content like YouTube, Vimeo, and TikTok. A huge amount is commercial movies, typically nowadays 4k resolution, and TV series, some short limited set of episodes as a genre competing with full length movies; some going on many series.

There is [still] a constant compettion between web sites an apps. You can follow a link to a video clip on the web, and watch it on the web but Netfllix, Youtube, Apple, etc will alwys try to get you to switch to the app so they have more control of your environment, and can store lots more data on your device.

When you share a video clip in the app, it typically generates a link which will take the recipient back to the web version. You can configure your Operating System so that it will recognize links direct to the app.

As a developer, I can still develop code on my Apple laptop and still install open source code and random apps written by other people on it.

As an artist, there is no way for me to make my own DRM platform if a person wants to protect their material. So they have to find a route to the big platforms. Currenly the big streaming platforms are infamous for returning very little funds to the original artists. I can make my music or video available on my website, and give it away for free, or charge for it without protecting it. Patronage and live gigs and merchendise sales may provide revenue.

The RDF-diff problem

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Jan 2001 00:00:00 GMT

Abstract

The problem of updating and synchronizing data in the Semantic Web motivates an analog to text diffs for RDF graphs. This paper discusses the problem of comparing two RDF graphs, generating a set of differences, and updating a graph from a set of differences. It discusses two forms of difference information, the context-sensitive weak patch, and the context-free strong patch. It gives a proposed update ontology for patch files for RDF, and discusses experience with proof of concept code.

Read whole article...

The Dysfunction of Social Networks

timbl@w3.org (Tim Berners-Lee) — Sat, 27 Jul 2024 00:00:00 GMT

In the early days of the internet, there was an optimistic belief that technology could foster a wise and self-governing global community. However, as social networks emerged and became monopolistic, this ideal was undermined. The 2016 elections in the US and Brexit vote demonstrated how social media could polarize societies and disrupt democratic processes. Then, the Facebook-Cambridge Analytica scandal of 2018 exposed the manipulative power of targeted advertising, sparking further widespread concern over data privacy and the influence of social media on public opinion.
Various mechanisms of social networks contributed to this dysfunction. These include the abuse of personal data through extensive profiling and tracking, the spread of misinformation and clickbait, and the optimization of content to maximize user engagement at the expense of truth and societal well-being. The consequences of these practices are far-reaching, leading to political polarization, mental health issues, and the erosion of trust in democratic institutions.
The response to these challenges must be multifaceted, involving efforts from tech companies, governments, parents, and activists to mitigate the negative impacts of social networks. Crucial elements include transparency, ethical design, and regulatory measures to create a safer and more humane digital environment. We need a collective effort to harness the positive potential of the internet while addressing its darker aspects.

Read whole article...

Intuitive hypertext editing

timbl@w3.org (Tim Berners-Lee) — Wed, 01 Apr 1998 00:00:00 GMT

Cleaning up the User Interface 2: Hypertext editing

Tim BL 3 April 1998

If you think surfing hypertext is cool, that's because you haven't tried writing it. If you have found your bookmarks/favorites have become a more and more important part of your life, that's because you have learned to put up with the simplest form of hypertext editing there is as a compromise. If you are using a really intuitive hypertext editor, then tell me about it.

Hypertext editing

The Web is universal and so should be able to encompass everything across the range from the very rough scribbled idea on the back of a virtual envelope to a beautifully polished work of art.

Somewhere near the "draft" end of the scale is its use a hypertext communal or personal notebook which is very close to a major original use of the Web in 1990. In this mode I can browse over notes made by people in my group, and rapidly contribute new ideas.

I'm editing this now on a pretty intuitive editor. AOLPress is may not be a top of the line pape layout tool but it can do some of the things which my original "WorldWideWeb" program could do. I wouldn't say that either of these programs was the ideal interface, but if you look also at things like KMS and Doug Engelbart's interface, you see that for all the fancy HTML we have nowadays, there is some immediacy we have lost.

Here are some things I would like to be able to do very rapidly. Dan Connolly suggested a click count as a way of measuring the effort, with 10 clicks penalty when you have to think of a filename or anchor ID.

Imagine there's no mode, imagine there's no location

A first assumption, by the way, is that you have modeless interface in which browsing and editing are not separate functions. If to edit a page, you have to switch from browsing mode to editing mode, then you have lost already. If you have had to switch to edit mode, and think of a local filename in which to save the file, then you have lost doubly, If you have had to answer lots of difficult questions about where to save absolute or relative links, you have lost yet again and probably messed up the file already! You should not have to think about "where" things are.

Make a link

In WorldWideWeb, you had to

Select the target phrase

Hit "command/M" to mark where you were, (Which generated an anchor with a made up name, and remembered it);

Switch to the document to contain the link if different;

select the text to be linked;

Hit "Command/L" to make the link

In AOLPress, I can do the same thing except the "Mark" function consists of three steps: Press the "anchor" button, hit return to accept the program's suggested anchor name, and then hit the "copy URL" button.

In a drag-and drop world, every window should have an icon for the document it holds which can be dragged to make a link. (Later versions of NeXTStep had this with alt/click on the titlebar).

Make a new linked node - Annotate

In WorldWide Web, this was deliberately easy:

Select a phrase

Hit "Command/N". (A new node is created)

Think of a filename in response to the "SaveAs" dialog box :-(

The new node would be created from a template which could set up to have your signature at the bottom, etc. The original phrase was automatically linked to the new node. The cursor was left ready for you to type in what you'd just thought of.

In a world with PICS servers, then a neat operation is to annotate a page you don't have access to:

Create a new node somewhere where you have write access

Create a PICS label with a pointer to it

Store the PICS label on the label server as a label about the annotated node.

The XML LINK work will allow, we hope, a link to be made into the middle of an existing unwritable document with some hope of reliability.

Here are a few other operations which would be very useful when you really use hypertext as a thinking tool.

Excerpt

Dan is always asking for this and doing it by hand. I have never seen an editor which will do it automatically (though Dan has found some javascript hacks that work pretty well).

Copy to the clipboard a BLOCKQUOTE with inside it a copy of the selected text, linked back to the original document from which it came. Make the link to an existing anchor in the document if one is there, or else a new one if one can be made, or else the document as a whole failing that.

Insert an image

It's always nice to be able to grab a screen shot or a video frame and insert it into the minutes you are taking of a meeting -- but how many keystrokes does it take?

Evolvability

timbl@w3.org (Tim Berners-Lee) — Fri, 01 May 1998 00:00:00 GMT

Evolvability

Introduction

The World Wide Web Consortium was founded in 1994 on the mandate to lead the Evolution of the Web while maintaining its Interoperability as a universal space. "Interoperability" and "Evolvability" were two goals for all W3C technology, and whilst there was a good understanding of what the first meant, it was difficult to define the second in terms of technology.

Since then W3C has had first hand experience of the tension beween these two goals, and has seen the process by which specifications have been advanced, fragmented and later reconverged. This has led to a desire for a technological solution which will allow specifications to evolve with the speed and freedom of many parallel deevlopments, but also such that any message, whether "standard" or not, at least has a well defined meaning.

There have been technologies dubbed "futureproof" for years and years, whether they are languages or backplane busses. I expect you the reader to share my cynicism when encountering any such claim. We must work though exactly what we mean: what we expect to be able to do which we could not do before, and how that will make evolution more possible and less painfull.

Free extension

A rule explicit or implcit in all the email-like Internet protocols has always been that if you found a mail header (or something) which you did not understand, you should ignore it. This obviously allows people to add all sorts of records to things in a very free way, and so we can call it the rul of free extension. It has its advatage of rapid prototyping and incremental deployment, and the disadvantage of ambiguity, confusion, and an inability to add a mandatory feature to an existing protocol. I adopeted the rule for HTML when initially designing it - and used it myself all the time, adding elements one by one. This is one way in which HTML was unlike a conventional SGML application, but it allowed the dramatic development of HTML.

The HTML cycle

The development of HTML between 1994 and 1998 took place in a cycle, fuelled by the tension between the competitive urge of companies to outdo each other and the common need for standards for moving forward. The cycle starts simply simply bcause the HTML standard is open and usable by anyone: this means that any engineer, in any company or waiting for a bus can think of new ways to extend HTML, and try them out.

The next phase is that some of these many ideas are tried out in prototypes or products, using the fact free extension rule that any unrecongined extensiosn will be ignored by everything which does not understand them. The result is a drmatic growth in features. Some of these become product differentiators, during which time their originators are loth to discuss the technology with the competition. Some features die in the market and diappear from the products. Those successful features have a fairly short lifetime as product differetiators, as they are soon emulated in some equivalent (though different) feature in competeing products.

After this phase of the cycle, there are three or four ways of doing the same thing, and engineers in each company are forced to spend their time writing three of four different versions of the same thing, and coping with the software architectural problems which arise from the mix of different models. This wastes program size, and confuses users. In the case for example, of the TABLE tag, a browser meeting one in a document had no idea which table extension it was, so the situation could become ambiguous. If the interpretation of the table was important for the safe interpretation ofthe document, the server would never know whether it had been done, as an unaware client would blithely ignore it in any case. This internal software mess resulting from having to implement multiple models also threatens future deevlopment. It turns the stable consistent base for future development into something fragmented and inconsistent: it is difficult to design new features in such an environment.

Now the marketting pressure is off which prevented discussions, and there is a strong call for the engineers to get around the W3C table, and iron out a common way of doing things. As this happens, a system is designed which puts together the best aspects of each other system, plus a few weeks experience, so everyone is in the end happier with the result. The companies all go away making public promises to implement it, even though the engineering staff will be under pressure to add the next feature and startthe next cycle. The result is published as a common specification opene to anyone to implement. And so the cycle starts again.

This is not the way all W3C activities have worked, but it was the particular case with HTML, and it illustrates some of the advantages and disadvantages with the free extenstion rule.

Breaking the cycle

The HTML cycle as a method of arriving at consensus on a document has its drawbacks. By 1998, there were reasons to change the cycle.The work in the W3C, which had started off in 1994 with several years backlog of work, had more or less caught up, and was begining to lead, rather than trail, developments. The work was seen less as fire fighting and more as consolitation. By this time the spec was growing to a size where the principle of modularity was seriously flaunted. Any new developments clearly had to be seperate modules. Already style information had been moved out into the Cascading Style Sheets language, the programming interface work was a seperate Document Object Model activity, and guidelines for accessibility were tackled by a seperate group.

Inthe future it was clear that we needed somehow to set up a modular system which would allow one to add to HTML new standard modules. At the same time, it was clear that with XML available as a manageble version of SGML as a base for anyone to define their own tag sets, there was likely to be a deluge of application-specific and industry-specific XML based languages. The idea of all this happening underthefree extension rule was frightening. Most applications would simply add new tags to HTML. If we continued the process of retrospectively roping into a new bigger standard, the document would grow without limit and become totally unmanageble. The rule of free extesnion was no longer appropriate.

Well defined interfaces

Now let us compare this situation with the way development occus in the world of distributed computing, specifically remote rpocedure call (RPC) and distributed object oriented systems. In these systems, the distributed system (equivalent to the server plus the client for the web) is viewed as a single software system which happens to be spread over several physical machines. [nelson - courier, etc]

The network protocols are defined automatically as a function of the software interfaces which happen to end up being between modules on different machines. Each interface, local or remote, has a well documented structure, and the list of functions (procedures, methods or whatever) and parameters are defined in machine-processable form. As the system is built, the compiler checks that the interfaces required by one module is exactly provided by another module. The interface, in each version of its development, typically has an identifying (typically very long) unique number.

The interface defines the parameters of a remote call, and therefore defines exactly what can occur in a message from one module to another. There is no free extension. If the interface is changed, and a new module made, any module on the other side of the interface will have to be changed too, or you can't build the system.

The great advantage of this is that when the system has been built, you expect it to work. There is no wondering wether a table is being displayed - if you have called the table module, you know exactly what the module is supposed to do, and there is no way the system could be without that module. Given the chaos of the HTML devleopment world, you can imagine that many people were hankering after the well defined interfaces of the distributed computing technology.

With well-defined interfaces, either everything works, or nothing. This was in fact at least formally the case with SGML documents. Each had a document type definition (DTD) refered to at the the top, which defiend in principle exactly what could and could not be in the document. PICS labels were similar in that thet are self-describing: they actually have a URI atthe top which points to a machine-readable description of what can and can't be in athat PICS label. When you see one of these documents, as when you get an RPC mesaage with an interface number on it, you can check whether you understand the interface or not. Another intersting thing you can do, if you don't have a way of processing it, is to look it up in some index and dynamically download the code to process it.

The existence of the Web makes all this much smoother: instead of inventing arbitrary names for inetrfaces, tyou can use a real URI which can be dereferenecd and return the master definition of the interface in real time. The Web can become a decentralised registray of interfaces (languages) and code modules.

The need was clearly for the best of both worlds. One must be able to freely extend a language, but do so with an extension language which is itself well defined. If for example, documents which were HTML 2.0 plus Netscape's version of tables version 2.01 were identified as such, mcuh o the problem of ambiguity would have been resolved, but the rest ofthe world left free to make their own table extensions. This was the goal of the namespaces work in XML.

Modularity in HTML

To be able to use the namespaces work in the extension of HTML, HTML has to transition from being an SGML application (with certain constraints) to being an XML based langauge. This will not only give it a certain ease of parsing, but allow it to build on the modularity introduced by namespaces.

In fact, already in April of 1998 there was a W3C Recommendation for "MathML", defined as as XML langauge and obviously aimed at being usable in the context of an HTML document, but for which there was no defined way to write a combined HTML+MathML document. MathML was already waiting for XML namespaces.

XML namespaces will allow an author (or authoring tool, hopefully) to declare exactly waht set of tags he orshe is using in a document. Later, schemas should allow a browser to decide what to do as a fall back when finding vocabulary which it does not understand.

It is expected that new extensions to HTML be introduced as namespaces, possibly languages in their own right. The intent is that the new languages, where appropriate, will be able to use the existing work on style sheets, such as CSS, and the existing DOM work which defines a programming interface.

Language mixing

Language mixing is an important facility, for HTML, for the evolution of all other Web and application technology. It must allow, in a mixed labguage document, for both langauges to be well defined. A mixed langage document is quiote analogous to a program which makes calls to two runtime libraries, so it is not rocket science. It is not like an RPC message, which in most systems is very strongly ytped froma single rigid definition. (An RPC message can be represented as a structured document but not, in general, vice-versa)

Language mixing is a realtity. Real HTML pages are often HTML with Javascript, or HTML plus CSS, or both. They just aren't declared as such. In real life, many documents are made from multiple vocabularies, only some of which one understands. I don't understand half the information in the tax form - but I know enough to know what applies to me. The invoice is a good example. Many differet coloured copies of the same document used to serve as a packing list, restocking sheet, invoice, and delivery note. Different parts of a company would understand different bits: the financial dividion woul dcheck amounts and signatures, the store would understand the part numbers, and the sales and marketting would define dthe relationship betwene the part numbers and prices.

No longer can the Web tolerate the laxness which HTML and HTTP have been extended. However, it cannot constrain itself to a system as rigid as a classical disributed object oriented system.

The note on namespaces defines some requirements of a language framework which allows new schmata to be developed quite independently, and mixed within one document. This note elaborates on the sorts of things which have to be possible when the evolution occurs.

The Power of schema languages

You may notice than nowhere in the architecture do XML or RDF specify what language the schema should be written in. This is because much of the future power of the system will lie in the power of the schema and related documents, so it isimportant to leave that open as a path for the future. In the short term, yo can think of a schema being written in HTML and english. Indeed, this is enough to tie the significance of documents written in the schema to the law of the land and mkae the document an effective part of serious commercial or other social interaction. You can imagine a schema being in a sort of SGML DTD language which tells a computer program what constraints there are on the structure of documents, but nothing about their meaning. This allows a certain crude validity check to be made on a document but little else.

Now let us imagine further power which we could put into a schema language.

Partial Understanding

A crucial first milestone for the system is partial understanding. Let's use the scenario of an invoice, like the scenario in the "Extensible languages" note. An invoice refers to two schemata: one is a well-known invoice schema and the other a proprietory part number schema. The requirement is that an invoice processing program can process the invoice without needing to understand the part description.

Somehow the program must find out that the invoice is from its point of view just as valid as an invoice with the details fo the part description stripped out.

Optional parts

One possibility is to mark the part description as "optional" on the text. We could imagine a well-known way of doing this. It could be done in the document itself [as usual, using an arbitrary syntax:]

8137498237
...

There are problems with this. One is that we are relying on the invoice schema to define what in invoice is and isn't and what it means. It would be nice if the designer of the invoice could say whether the item should contain a part description of not, or whether it is possible to add things into the item description or not. But in general if there is something to be said we like to allow it to be said anywhere (like metadata). But for the optionalness to be expressed elsewhere would save the writer of every invoice the bother of having to explicitly.

Partial Understanding

The other more fundamental problem is that the notion of "optional" is subjective. We can be more precise about "partial understanding" by saying that the invoice processing system needs to convert the document which contains things it doesn't understand into a document which it does completely understand: a valid invoice. However, another agent may which to convert the same detailed invoice into, say, a delivery note: in this case, quite different information would be "optional".

To be more specific, then, we need to be able to describe a transformation from one document to another which preserves "valididy" in some sense. A simple form of transformation is the removal of sections, but obviously there can be all kinds of level of transformation language ranging from the cudest to theturing complete. Whatever the language, statement that given a document x, that some f(x) can be deduced.

Principle of Least Power

In practice, this suggest that one should leave the actual choice of the transformation language as a flexibility point. However, as with most choices of computer language, the general "principle least power" applies:

When expressing something, use the least powerful language you can.

(@@justify in greater depth in footnote)

While being able to express a very complex function may feel good, the result will in general be less useful. As Lao-Tse puts it, "Usefulness from what is not there". From the point of view of translation algorithms, one usefulness is for them to be reversible. In the case in which you are trying to prove something (such as access to a web site or financial credibility) you need to be able to derive a document of a given form. The rules you use are the pieces of the web of trust and you are looking for a path through the web of trust. Clearly, one approach is to enumerate all the things which can be deduced from a given document, but it is faster to have an idea of which algorithms to apply. Simple ones have input and output patterns. A deletion rule is a very simple case

s/.*foo.*/\1\2/

This is stream editor languge for "Remove "foo" from any string leaving what was on either side". If this rule is allowed, it means that "foo"is optional. @@@ to be continued

Optional features and Partial Understanding

Goal: V1 software partially understands V2 document

Optional features visible as such

Example: "Mandatory" Internet Draft

Example: SMIL (P.Rec. 1998/4/9)

Conversion from unknown language to known language.

Test of Independent Invention

The test of independent invention is a thought experiment which tests one aspect of the quality of a design. When you design something, you make a number of important architectural decisions, such as how many wheels a car has, and that an arch will be used between the pillas of the vault. You make other arbitrary decisions such as the color of the car, the side of the road everyone will drive, whether to open the egg at the big end or the little end.

Suppose it just happens that another group is designing the same sort of thing, tackling the same problem, somewhere else. They are quite unknown to you and you to them, but just suppose that being just as smart as you, they make all the same important archietctural decisions. This you can expect if you believe hat these decisions make logical sense. Imagine that they have the same philosophy: it is largely the philosophy which we are testing. However, imagine that they make all the arbitrary decisions differently. They complement bit 7. They drive on the other other side of the road. They use red buoys on the starbord side, and use 575 lines per screen on their televisions.

Now imagine that the two systems both work (locally), and being usccessful, grow and grow. After a while, they meet. Suddenly you discover each other. Suddenly, people want to work across both systems. They want to connect two road systems, two telephone systems, two networks, two webs. What happens?

I tried originally to make WWW pass the test. Suppose someone had (and it was quite likely) invented a World Wide Web system somewhere else with the same principles. Suppose they called it the Multi Media Mesh ^(tm) and based it on Media Resource Identifiers^(tm), the MultiMedia Transport Protocol^(tm), and a Multi Media Markup Language^(tm). After a few years, the Web and the Mesh meet. What is the damage?

A huge battle, involving the abandonment of projects, conversion or loss of data?

Division of the world by a border commission into two separate communities?

Smooth integration with only incremental effort?

(see also WWW and Unitarian Universalism)

Obviously we are looking for the latter option. Fortunately, we could immediately extend URIs to include "mmtp://" and extend MRIs to include "http:\\". We could make gateways, and on the better browsers immediately configure them to go through a gateway when finding a URI of the new type. The URI space is universal: it covers all addresses of all accessible objects. But it does not have to be the only universal space. Universal, but not unique. We could add MMML as a MIME type. And so on. However, if we required all Web servers to synchronise though one and only one master lock server in Waltdorf, we would have found the Mesh required synchronisation though a master server in Melbourne. It would have failed.

No system completely passes the ToII - it is always some trouble to convert.

Not just a thought experiment

As the Web becomes the basis for many many applications to be build on top of it, the phenomenon of independent invention will recur again and again. We have to build technology so as to make it easy for systems to pass the test, and so survive real life in an evolving world.

If systems cannot pass the TOII, then we can only achieve worldwide interoperability when one original design has originally beaten the others. This can happen if we all sit down together as a worldwide committee and do a "top down"design of the whole thing before we start. This works for a new idea but not for the automation of something which, like pharmacy or trade, has been going on for centuries and is just being represented in the Semantic Web. For example, the library community has had endless trouble trying to agree on a single library card format (MARC record) worldwide.

Another way it can happen is if one system is dropped completely, leading to a complete loss of the effport put into it. When in the late 1980s Europe eventually abandoned its suite of ISO protocols for networking because they just could not interwork with the Internet, a huge amount of work was lost. Many problems, solved in Europe but not in the US (including network addresses of more than 32 bits) had to be solved again on the Internet at great cost. Sweden actually changed from driving on the left to driving on the right. All over the world, people have changed word processor formats again and again but only at the cost of losing access to huge amounts of legacy information. The test of independent invention is not just a thought experiment, it is happening all the time.

From philosophy to requirement

So now let us get more specific about what we really need in the underlying technology of the Semantic Web to allow systems in the future to pass the test of independent invention.

We will be smarter

Our first assumption is that we will be smarter in the future. This means that we will produce better systems. We will want to move on from version 1 to version 2, from version n to version n+1.

What happens now? A group of people use version 4 of a word process and share some documents. One touches a document using a new version 5 of the same program. Oen of the other people tries to load it using version 4 of the software. The version 4 program reads the file, and find it is a version5 file. It declares that there is no way it can read the file,as it was produced in the future, and there is no way it can predict the future to know how to read a version 5 file. A flag day occurs: everyone in the group has to upgrade immediately - and often they never even planned to.

So the first requirement is for a version 4 program to be able to read a version 5 file. Of course there will be some features in version 5 that the version 4 program will not be able to understand. But most of the time, we actually find that what we want to achieve can be done by partial understanding - understanding those parts of the document which correspond to functions which exist in version 4. But even though we know partial understanding would be acceptable, with most systems we don't know how to do even that.

We are not the smartest

The philosophical assumption that we may not be smarter than everyone else (a huge step for some!) leads us to realise that others will have gret ideas too, and will independently invent the same things. It forces us to consider the test of independent invention.

The requirement for the system to pass the ToII is for one program which we write to be able to read somehow (partially if not totally) data written by the program written by the other folks. This simple operation is the key to decentralised evolution of our technology, and to the whole future of the Web.

So we have deduced two requirements for the system from our simple philosophical assumptions:

We will be smarter in the future

Technology: Moving Version 1 to Version 2

We are not smarter than everyone else

Decentralized evolution

Technology: Moving between parallel Version A and Version B

The story so far

We are we with the requirements for evolvability so far? We are looking for a tecnology which has free but well defined extension. We want to do it by allowing documents to use mixed vocabularies. We have already found out (from PICS work for example) that we need to be abl eto know whether extension vocabulary is mandatory or can be ignored. We want to use the Web for any registry, rather than any central point. The technology has to be allow an application to be able to convert the output of a future version of itself, or the output of an equivalent program written indpendently, into something it can process, just by looking up schema information.

Evolution of data

Now let us look at the world of data on the Web, the Semantic Web, which I expect we expect to become a new force in the next few years. By "data" as opposed to "documents", I am talking about information on the Web in a form specifically to aid automated processing rather than human browsing. "Data" is characterised by infomation with a well defined strcuture, where the atomic parts have wel ldefined types, such as numbers and choices from finite sets. "Data", as in a relational database, normally has well defined meaning which has rarely been written down. When someone creates a new databse, they have to give the data type of each column, but don't have to explain what the field name actually means in any way. So there is a well defined semantics but not one which can be accessed. In fact, the only time you tells the machine anything about the semantics is when you define which two columns of different tables are equivalent in some way, so that they can be used for example as the basis for joining the two databases. (That the meaning of data is only defined relative to the meaning of other data is of course quite normal - we don't expect machines to have any built in understanding of what "zip code" might mean apart from where you can read it and write it and what you can compare it with). Notice that what happens with real databases is that they are defined by users one day, and they evolve. They are rarely the result of a committee sitting down and deciding on a set of concepts to use across a company or an industry, and then designing the data schema. The schema is craeted on the fly by the user.

We can distinguish two ways in which tha word "schema" has been used:

Syntactic Schema: A document, real or imagined, which constrains the structure and/or type of data. (pl.: Schemata).

Semantic schema: A document, real or imagined, which defines the infereneces from one schema to another, thus defining the semantics of one syntactic schema in terms of another.

I will use it for the first only. In fact, a syntactic schema dedfines a class of document, and often is accompanied by human documentation which provides some rough semantics.

There is a huge amount ("legacy" would unfairly suggest obsolescence) of data in relational databases. A certain amount of it is being exported onto the web as virtual hypertext. There are many applications which allow one to make hypertext views of difeferent aspects of a database, so that each server request is met by performing adatabse query, and then formatting the result as a report in HTML, with appropriate style and decoration.

Data about data: Metadata

Information about information is interesting in two ways. Firstly, it is interesting because the Web society desperately needs it to be able to manage social aspects of information such as endorsement (PICS labels, etc), ownership and access rights to information, privacy policies (P3P, etc), structuring and cataloguing information and a hundred otehr uses which I will not try to ennumerate. This first aspect is discussed elsewhere. (See Metadata architecture about general treatment of metadata and labels, and the Technology and Society domain for overveiw of many of the social drivers and related projects and technology)

The second interest in metadata is that it is data. If we are looking for a language for putting data onto the Web, in a machine understandable way, then metadata happens to be a first application area. Also, because metadat ais fundamental to most data on eth web, it is the focus of W3C effort, while many other forms of data are regarded as applications rather than core Web archietcure, and so are not.

Publishing data on the web

Suppose for example that you run a server which provides online stock prices. Your application which today provides fancy web pages with a company's data in text and graphs (as GIFs) could tomorrow produce the same page as XML data, in tabular form, for machine access. The same page could even be produced at the same URL in two formats using content negotiation, or you could have a typed link between the machine-understandable and person-understandable versions.

The XML version contains at the top (or soemewhere) a pointer to a schema document. This poiner makes the document "self-describing". It is this pointer which is the key to any machine "understanding" of the page. By making the schema a first class object, in other words by giving its URL and nothing else, we are leaving the dooropen to many possibilities. Now it is time to look at the various sorts of schema document which it could point to.

Levels of schema language

Computer languags can be classified into various types, with various capabilities, and the sort we chose for the schema document, and information we allow the schema fundamentally affects not just what the semantic web can be but, more importantly, how it can grow.

The schema document can, broadly, be one of the following:

Notional only: imaginary, non-existent but named.

Human readable

Machine-understandable and defining structure

Machine-understandable and slo which are optional parts

A Turing-complete recipe for conversion into othr langauges

A logical model of document

We'll go over the pros and cons of each, because none of these should be overlooked, but some are often way better than others.

Schema 1: URI only

No supporting documentation

Allows compatibility yes/no test

This may sound like a silly trivial example, but like many trival examples, it is not silly. If you just name your schema somewhere in URI space, then you have identified it. This deosn't offer a lot of help to anyone to find any documentation online, but one fundamental function is possible. Anyone can check compatability: They can compare the schema against a list of schemata they do understand, and return yes or no.

In fact, they can also se an idnex to look up information about the schema, including ifnromation about suitable software to download to add understanding of the document. In fact this level is the level which many RPC systems use: the interface is given a unique but otherwise random number which cannot be dereferenced directly.

So this is the level of machine-understanding typical of distributed ocmputing systems and should not be underestimated. There are lot sof parts of URI space you can use for this: yo might own some http: space (but never actually serve the document at that point) , but if you don't, you can always generate a URI in a mid: ro cid: space or if desperate in one of the hash spaces.

Schema option 2: Human readable

The next step up from just using the Schema identifier as a document tyope identifier is to make that URI one which will dereference to a human-readable document. If you're a computer, big deal. But as well as allowing a strict compatiability test (test for equality of the schema URI), this also allows human beings to get involed if ther is any argument as to what a document means. This can be signifiant! For example, the schema could point to a complete technical spec which is crammed with legalese about what the document does and does not imply and commit to. At the end of the day, all machine-understandable descriptions of documents are all very well, but until the day that they bootstrap themselves into legality, they must all in the end be defined in terms of human-readable legalese to have social effect. Human legalese is the schema language of our society. This is level 2.

Schema option 3: Define structure

Now we move into the meat of the schema system when we start to discuss schema documents which are machine readable. now we are satrting to enable some machine understanding and automatic processing of document types which have not been pre-programmed by people. Ça commence.

The next level we conside is that when your brower (agent, whatever) dereferences the namespace URI, it find a schema which defines the structure of the document. this is a bit like an SGML Doctument type Definition (DTD). It allows you to do everything which the levels 1 and 2 allowed, if it has sufficient comments in it to allow human arguments to be settled.

In addition, a system which has a way of defineing structure allows everyone to have one and only one parser to handle all manner of documents. Any document coming across the threshold can be parse into a tree.

More than that, it allows a document o be validated against allowed strctures. If a memeo contains two subject fields, it is not valid. Tjis is one fo the principal uses of DTDs in SGML.

In some cases, there maybe another spin-off. You canimagine that if the schema document lists the allwoed structrue of the document, and the types (and maybe names) of each element, then this would allow an agent to construct on the fly a graphic user interafce for editing such a document. This was theintent with PICS rating systems: at least, a parent coming across a new rating system would be be given a ahuman-readable descriptoin of the various parameters and would be able to select

Schema option 4: Structure + Optional flags

The "optional" flag is a term I use here for a common crucial step which can make the difference between chaos and smooth evolution. All you need to do is to mark in the schema of a new version of the language which elements of the langauge can be ignored if you don't understand them. This simple step allows a processor which handled the old language, giventhe schema of the new langauge, to filter it so as to produce a document it can legitimately understand.

Now we have a technology which ahs all the benefits to date, plus it can handle that elusive version 2 to version 1 conversion problem!

Schema option 5: Turning complete language

Always in langauges there is the balance between the declarative limited langauge, whose foprmulae can be easily manipulated, and the powerful programming language whose programs cannot be analyzed in general, but which have to be left to run to see what they do. Each end of the spectrum has its benefits. In describing a lanuage in terms of another, one way is to provide a black box program, say in Java or Javascript, which will convert from one to the other.

Filters written in turing-complete languages generally have to be trusted, as you can't see what rules they are based on by looking at them. But they can do weird and wonderful things. (They can also crash and loop forever of course!).

A good language for conversion from one XML-based language to another is XSL. It lstarted off as a template-like system for building one document from another (and can be very simple) but is in fact Turning-complete.

When you do publish a program to convert language A to language B, then anyone who trusts it has that capability. A disadvantage is that they never know how it works. You can't deduce things about the individual components of the languages. You can't therefore infer much indirectly about relationships to other languages. The only way such a filter can be used is to get whatever you have into language A and then put it though the filter. This might be useful. But it isn't as fascinating as the option of blowing language A open.

Schema option 6: Expose logic of document

What is fundamentally more exciting is to write down as explicitly as posible wahteth new language means. Sorry, let me take that back, in case you think that I am talking about some absulte meaning of meaning. If you know me, I am not. All I mean is that we write in a machine-processable logical way the equivalences and conversions which are possible in and out of language A from other languages. And other languages.

A specific case of course, is when we document the relationship betwen version 2 and version 1. The schema document for version 2 could explain that all the terms are synonyms, except for some new terms which can be converted to nothing (ie are optional) and some which affect the meaning of the document completely and so if you don't understand them you are stuck.

In a more general case, take a language like iCalendar in RDF (were it in RDF), which is for describing events as would be in a personal organizer. A schema for the language might declare equivalences betwen a calendar's concept of group MEMBER ship and an access control system's concept of group membership; it might declare the equivalence of eth concept of LOCATION to be the text description of a Geographical Information Systems standard's location, and it may declare an INDIVIDUAL to be a superset of the HR department's concept of employee. These bits of information of the stuff of the semantic web, as they allow inference to stretch across the gloabe and conclude things which we knew as whole but no one person knew. This is what RDF and the Semnatic Web logic built on top of it is all about.

Extensible languages and web evolution

timbl@w3.org (Tim Berners-Lee) — Sun, 01 Feb 1998 00:00:00 GMT

Up to Design Issues

Contents

Extensible languages

Introduction

Requirements

Glossary

Mixing vocabularies

Scenario

Local scope

Lack of ambiguity

Evolving new scheme languages

Correctness of documents with multiple vocabularies

Granularity

Incorporation into the language

Related resources

Feeds:

timbl@w3.org (Tim Berners-Lee) — Fri, 01 Jan 2021 00:00:00 GMT

Feeds

Feeds of various sorts have been a feature since you could first subscribe to blogs using various forms of RSS. Let's call a feed a sequence of published things such that you can subscribe to, with in some cases a mechanism to inform you when the is more added to it. So a feed by itself is a one-way thing with no feedback.

While blog feeds were the rage at one point, podcasts took over the limelight with the eponymous iPod, and now it seems people happy move in between, and sometimes convert between, text, audio and video blogs, but fitness session and photo streams so not play in the same space.

Here is is a very rough summary of some existing feeds in 2021, and somethings which don't have feeds.

Medium Post format Dominant platform Response actions

Text blog HTML -- Blog comments

Photo JPEG Instagram Like, Comment

Audio podcast MP3 -- --

Video podcast MP4? YouTube Comment

Movie IMDB-RDF Netflix, Green Tomatoes > Media Kraken Rating (GT)

Book LoC RDF? Amazon 5 Star rating

Fitness GPX Strava, Fitbit etc Kudos, comment

Strava and Instagram, as closed platforms, manage the identity of their users and their feeds, in each case one user per feed, with the access control of who can see what, and social actions like likes/kudos and comments on a post. In each case positive feedback can be private or public.

Activity Streams

"Little "a" activity streams : are a UI paradigm for displaying recent activity within a context. Activities are typically displayed in reverse chronological order and consist of relatively simple statements such as "John uploaded a new photo" or "12 people liked Sally's post"."" -- the AS wiki

Activity Steams activities are quite broad, not just publishing something. From the OWL ontology

as:Activity a owl:Class ; rdfs:label "Activity"@en ; rdfs:subClassOf as:Object ; rdfs:comment "An Object representing some form of Action that has been taken"@en .

And there are 22 subclasses defined: Accept, IntransitiveActivity, Add, Announce, IntransitiveActivity, Create, Delete, Dislike, Flag, Follow, Ignore, Join, Leave, Like, View, Listen, Read, Move, Offer, Reject, Remove, Undo, Update. This reminds one of social Actions which have ben pulled out in schema.org, with subclasses AchieveAction AssessAction, ConsumeAction, ControlAction, CreateAction, FindAction, InteractAction, MoveAction, OrganizeAction, PlayAction, SearchAction, SeekToAction, SolveMathAction, TradeAction, TransferAction, and UpdateAction

GitHub feed wold maybe have things like requested review, started review, completed review, approved. In general, any collaborative system in which the shared state changes form time to time, like for example an issue tracker or action list, then those state-based systems cold well be deemed to create Activity Stream events whenever that state changes.

By agreeing on common interoperable representation and workflows for social actions, the Solid world can allow the same smooth user experience no matter what it it they are reacting to --liking, and so on.

The solid mantra that "you should be able to anything with anything" clearly suggests here that whatever that thing is, a person out to be able to record their reaction, and so on.

Scope of social reaction

Clearly the scope of a social reaction, whether it public or personal and confidential or shared with different groups, different communities, is key. With Instagram and Strava each person has one group of followers, so a choice may be not sharing, sharing with followers, or making something completely public. In a Solid world, where a person has many related groups of different sizes, and can be a member of different communities for different reasons, making sure the reaction has the right scope is very important.

Possible harmful consequences

There is research, and common experience, that suggests that social reaction systems like these can lead to harmful unintended consequences, or harmful processes can be set up, deliberately or subconsciously. Examples are people's unhealthy preoccupation with the public reactions to their posts, or bullying comments, and so on. A WebFoundation report discusses, for example the problem of Online Gender-based Voilence. So any new systems we design should involve investigation and modeling of these things, and where it is possible, explicit design to avoid harm. This is beyond level of this article.

Sets of Feeds
If you want to use, in the Solid tradition, many different feed-consuming apps with same set of subscriptions, then you need interop at that level: the set of feeds I subscribe to. The NetNewsWire app will export is list of subscriptions in OPML like:
Subscriptions-OnMyMac.opml

Relatively recently --- "NetNewsWire 6 for iOS — which includes new features iCloud syncing, home screen widgets, and a bunch more — was submitted to App Store review today. The team is super-psyched to get this release shipping!"

Self-hosted feed set sync

NetNewsWire uses a number of protocols to sync your set of RSS feeds across devices

FreshRSS looks interesting

FreshRSS can manage +100k articles without complaining. FreshRSS works on mobile Read your RSS feeds on your mobile without requiring any third-party application. Self host-able: Your data is yours! Host your aggregator and do not depend on anyone.

Discovering feeds

You hear often "Get this from X or Y or wherever you get your podcasts". There is an assumption that the various platforms have more or less organized a list of all the podcasts by name. This works maybe for the famous ones but clearly doesn't scale is we encourage everyone to have a feed.

You can find feeds and subscribe to them by following links. Links from other posts in other feeds, links send in email, links in the minutes of a meeting, and so on. The link you get may be to the feed itself, or it may be to a particular post, in which case you can find the feed.

The global system need to provide both serendipity, when you come across new resources as a surprising while not looking for them, but also the ability to track down feed which you have heard about or you imagine might exist. The algorithms we are talking about here are those which in a Solid world, will determine which information people come across. In a present world with much misinformation and much disinformation, they are important.

Protocols vs Apps

"This podcast was generated by Audm. Get the Audm app to follow this, and other things from ..." No, don't get the audm App, or whatever App they have made for each podcast. Use your generic podcast reader, like Overcast, or the default one on your device. Keep the RSS protocol alive! Allow yourself to customize the way you listen to things, and keep track of what you have read or listened to in one place.

This is an important issue with feeds. If the feed providers have made a web site or an app with richer experience than he feed, would it not be great to be able to slip between the different versions while keeping track of what you have read, which place you had read it?

Synchronized feeds

A few years ago I read Neal Stephensons's novel Termination Shock on both audio and written text versions at once, skipping between the formats quite easily. That is really the way we should be consuming media in many cases: switching from sound to text when another person is around, back to sound when we are driving, and so on. it is about efficiency, and about accessibility.

Presumably the way to do it in RSS would be to have more than one "attachment" to the feed post, and then having some sort of mapping between characters in the text and time in the audio (or video).

[2024-12] I recently found an app Speechify which allows me to import text, googleDocs, etc etc, read-listen to it smoothly. Just what the doctor ordered. Keeps track of what I am reading and where I am across devices. Wishlist, of course: a version which will store my data about where I am in my pod instead of in the Speechify cloud.

Keep it open - 2024

This was first written in 2021. In 2024, there are different threats to the open podcast world. Spotify wants desperately to be my podcast player -- annoying me with suggesting podcasts when I wam using it for music. But I use Overcast, an independent app, for my podcasts, thank you. And now, I gather Spotify and Apple Podcasts, while playing anyones feed, also have some proprietary non-interoperable feeds you can only get on Apple Podcast or Spotify. A good policy is not to listen to the ones which don't use the standard.
Feeds and Feedback

Originally that was the title of this post, as I was thinking of talking about protocols like Linkback and Pingback to record when a link is made to a post --typically when somebody makes a reference to it in their own blog. But that will have to wait for another time.

Updates

MNOT
Mark Nottingham, co-author of Atom, discussed in his 25 August 2024 blog, What RSS Needs discusses a new take on the world of feeds.

References

Google on how Podcasts work and are listed

FreshRSS - open source software (and protocol) to save RSS feed sets

GreenGeeks Web Hosting, What are Trackbacks and Pingbacks in WordPress?

Web Foundation · November 25, 2020 The impact of online gender-based violence on women in public life

Up to Design Issues

Tim BL

Filtering and censorship

timbl@w3.org (Tim Berners-Lee) — Fri, 19 Dec 1997 00:00:00 GMT

Filtering and Censorship

Information is powerful stuff. The world has been enthralled by the power which the Web, the universe of accessible information, gives to people and to groups working and playing together.

Information about information is powerful not just as information, but because it allows one to leverage one's use of information, to benefit from that which is relevant, accurate, stylish, unbiased, or timely, -- whatever one regards as being of "quality" -- without being enmired in that which is not.

Powerful tools are often usable for constructive or destructive purposes just as paper and ink be used for truth or lies, and metal for ploughshares or swords. The Web's power stems from its universality - for example that a hypertext link an point to any information out there, not just a subset. People have asked whether I regret that the Web has been used for some uses, but I have to reply that if somehow it had been built to control the material which was placed in it, then that would be the technology controlling society, rather than the other way around as it should be.

True, one could take the view that our society is not strong enough to be trusted with a powerful information system. One could take the view that society does not currently have the wherewithal to prevent the Web from being abused by destructive forces to an extent that the overall pain is greater than the gain. I do not believe this is true. In the western developed world, at least, I believe that the democratic process will have sufficient control over governments and the judicial process sufficient control of criminals, to continue to defend the health of the evolving society.

We should be very careful, by constant inspection, to ensure that this continues to be the case.

Filtering and Censorship

One of the threats which posed itself in 1994 was of government censorship over information on the Web. In general, there are information acts which societies regard as legal, and those which are illegal (such as fraud). The problem which arose was that in the very subjective question of what information is deemed suitable for children, there was a threat that, in order to "protect" children, seeing no other alternative, governments were contemplating making draconian legislation for example prohibiting the transmission of "indecent" material. The problems here were many.

First of all, the concept of "indecent" was being enforced as a central single concept, quite against the distributed subjective nature of its definition in society. The Web works as a decentralized system, with no hierarchical or other structure to force society into a shape imposed by technology. This works. Centralization of such an idea would [prevent the Web from being an accurate mirror of society itself.

Secondly, the problem being solved was the reading of such information by children, not its transmission. Thirdly, the question of "transmission" seemed to include intermediate parties who were not responsible for the content in an editorial or authorship sense. And one could list other problems, but this is enough for the present.

Information Quality

The basic problem being addressed was that of subjective information "quality". This is the same problem reported by newcomers to teh web who find (typically after a search engine search) too much "junk".

It is unreasonable to ask for information delivered from the web to be of consistently high "quality" if you can't define what "quality" is. There is a need, then, to be able to represent "quality" in a completely subjective way.

This is what the PICS project was all about. PICS was specifically aimed at demonstrating that individuals could obtain their own subjective notion of quality without the government having to try to "protect" them by enforcing some centralized notion. Politically, PICS is a system necessary for the preservation of free speech on the Internet.

The system needed a few different sorts of documents

a "rating system"

which defines a scale or scales on which one might judge a document. The fact that anyone can create one of these is a strong force allowing decentralization of concept, breaking the problem of the global, centralized definition of for example "indecency". PICS allows communities of any size (from one up) to establish their own criteria. Agreement over a large community enhances global harmony, but threatens diversity. Agreement over a small community does the reverse. So in fact some balance is necessary

a "label"

which is a statement about something in terms of the schema. This can be made by any party, not just author or reader, and certainly not just central government. These can be created and exchanged in all manner of ways, so the PICS standard for interoperability is essential.

A "profile"

which describes for a given person the particular rating systems and levels on those scales which represent "quality" information at a given time and in a given context. This sort of information can either be input by a person using a graphic interface (such as a set of sliders in a dialog box), or can simply be transferred from someone they trust, whether family, organization, or friend. Inability to transfer this would prevent people from building their own communities with common standards of trust: hence the importance for this (picsrules) as a standard.

These are all subsets of a general metadata language, designed to be easy for people to use. In particular, by being limited in their power, they allow graphic interfaces to be built.

On social responsibility of technologists

The argument has been made that PICS technology should be suppressed as the power it gives may be abused by governments. (There are even those who have suggested that the whole scheme is a government inspired plot to promote censorship and limit free speech. This is certainly not the case, as neither in the idea, the funding nor the intent.) Whereas most readers may find this far fetched, it is worth a response on principle.

As I pointed out when closing the first International World Wide Web conference, speaking to (then a mere 350) geeky web enthusiasts, I firmly believe it is the task of scientists and technologists to be aware of and responsible for the social implications of their work. This cannot just be left to "professional socially responsible people", as each engineer and scientist is often best aware of the potential of the work. Uttered in the auditorium at CERN, whose particle physicists trace their roots through nuclear physics, I don't think the message went unheard, even though it may have sounded strange in such a new field. Now, (1997) the World Wide Web Consortium has one of its three domains dedicated to the relationship between Technology and Society.

So what about PICS?

The question basically is whether the potential danger of the technology outweighs the freedom and positive good it accords. You can certainly argue this for nuclear fission, and you can certainly argue it for the wide distribution of firearms in populous countries. Can you argue this for PICS and metadata? Is there anything about PICS specifically or metadata in general which makes it more of a danger than a boon?

The specific types of document in PICS are very general. As a system, it is quite generalist, and extremely decentralized. It does as good a job as it can of leaving policy up to others to set, although it does (compared with other systems one could imagine) tend to favor by its nature cultural diversity, and freedom of speech, including freedom to endorse other's work.

The specifications of communication protocols enable, but do not enforce, what manufactured software will or will not be able to do. One cannot, therefore, at this level say what individuals will be able to do. The technology can leave the policy up to others, which leaves other groups to ensure that the values which they hold dear are not lost in new legislation, industry practices, or public apathy.

A metainformation system allows one to talk about information. It enables all kinds of uses of information

finding information

talking about information

making laws about information

breaking laws about information

It is not the place of a technical metadata system to try to limit the statements one can make with metadata, or the laws if any which are made. That is the role of the democratic process and whatever government the people trusts. The W3C as an industry consortium can act for industry in promoting standards, but cannot act to create laws. What we can do is explain to lawmakers and others the effect and intention of technology. That is what this article attempts to do.

Conclusion

So Metadata, PICS and otherwise, is powerful, as is information in general. Constant vigilance by concerned members of the public, industry and government is a very important part of the system of controls which keeps society healthy. The PICS technology was created specifically in order reduce the risk of government censorship in civilized countries. It was the result of members of the industrial community being concerned about the behaviour of government. The indications are that in this it will succeed, but that does not remove the need for such vigilance.

To conclude, out of fear or ignorance, that PICS is more of a danger than it is a boon would be throw the baby out with the bathwater. Metadata is not just a new tool, it is the start of a machine-understandable web (a "web phase 2") of information whose impact should be as empowering to humanity as the human-understandable web of today. We must understand it as we build it.

Linked Data Shapes, Forms and Footprints

timbl@w3.org (Tim Berners-Lee) — Fri, 26 Apr 2019 00:00:00 GMT

In a world of linked data, in which anyone can say anything about anything, how do we build systems in which users and apps are easily allowed to express useful, helpful things? What tools can we use which allow new systems to grow easily and work well together?

Ontology languages

The RDF schema languages, RDF Schema and OWL, tell you implications one can draw from RDF Model data. They also tell you what things do not make logical sense. Therefore in a sense they indirectly have the function of constraining what RDF data one can write, though just by telling what would be nonsense (false). So they can in a rather weak way be used to guide a user interface. But that won't do what we need.

Other schema systems, like that of schema.org, give suggestions as to what predicates can be used to talk about objects of a given class. That is useful, but still is not enough.

In this document, we will discuss three kinds of technologies to help with building apps on top of data:

Shapes explain to machines what data should look like, independently of how that data is displayed to a user.

Forms are a user interface allowing people to read and write data in a specific shape.

Footprints explain to machines where new data should be stored.

Read rest of article...

Fractal web, fractal society

timbl@w3.org (Tim Berners-Lee) — Thu, 31 Dec 1998 00:00:00 GMT

The Scale-free nature of the Web

This article was originally entitled "The Fractal nature of the web". Since then, i have been assured that while many people seem to use fractal to refer to a Zipf (1/f) distribution, it should really only be used in spaces of finite dimension, like the two-dimensional planes of MandelBrot sets. The correct term for the Web, then, is scale-free.

This isn't an observation so much as a requirement.

I have discussed elsewhere how we must avoid the two opposite social deaths of a global monoculture and a set of isolated cults, and how the fractal patterns found in nature seem to present themselves as a good compromise. It seems that the compromise between stability and diversity is served by there the same amount of structure at all scales. I have no mathematical theory to demonstrate that this is an optimization of some metric for the resilience of society and its effectiveness as an organism, nor have I even that metric. (Mail me if you do!)

However, it seems from experience that groups are stable when they have a set of peers, when they have a substructure. Neither the set of peers nor the substructure must involve huge numbers, as groups cannot "scale", that is, work effectively with a very large number of liaisons with peers, or when composed as a set of a very large number of parts. If this is the case then by induction there must be a continuum of group sizes from the vary largest to the very smallest.

This seems to be a general rule which can guide our design, and against which we can measure actual patterns of use.

It is in fact another aspect of the tension between many languages and one global language. Locally defined languages are easy to create, needing local consensus about meaning: only a limited number of people have to share a mental pattern of relationships which define the meaning. However, global languages are so much more effective at communication, reaching the parts that local languages cannot. This tension is exemplified in the standards process, when ideas have to be exposed to successively larger and larger groups, with friction and hard work at each stage.

Other interesting things to model passing though a fractal system include DNA traits in intermarrying populations Someone suggested (who?) that the invention of the bicycle made a great difference to average health in the Welsh valleys because it allowed greater intermarrying and so increased the effective gene pool size Clearly, global travel could end up reducing the diversity. viruses propagating through schools and traveling business people; and problems propagating to someone who has a solution are more good exercises (State your assumptions!).

Zipf happens

Whether we like it or not, early measurements of web traffic by the DEC WRL firewall showed DEC employees browsing sites with a Zipf (1/n) distribution of popularity. (Anyone got any other measurements? [Neilsen 1997]). Recent analyses suggest the Web becoming smaller for its size seem to use.

How can we use knowledge of the Web's fractal nature? By planning network bandwidth between long-range and short-range communication, planning for cache usage, etc. The physical network can be expected to have a variety of scale geographically, like the road system. However, the structure of the Web is interestingly different because of the lack of two-dimensional constraint. The challenge is to use this flexibility in building an effective society on top of the Web.

Looking for a metric

What do we mean by "effective"? We mean we would like to combine scientist's creative ability and knowledge to find a cure for AIDS. We would like to preserve world peace by allowing xenophobia to disperse in a web of understanding, while at the same time preserving the diversity of culture which gives the human race its richness. These are of course the same classic problems of the management of a large organization, of combining individual creativity with corporate vision.

If the web of society has an imbalance, we pay for it. We pay for insufficient global understanding with war. We pay for insufficient family communication with broken families and unsupported individuals. At any level of scale, missing social structure at that scale will prevent problems at that scale being addressed, and also prevent resources at that scale being used. It would therefore be great to have a way of measuring for a given web the degree to which it has a balanced fractal pattern, and if not where its weaknesses are.

Those looking for the "small world" effect chose metrics such as the maximum or mean value of the shortest path between any two points. This gives us a metric for effectiveness at the global scale, but not of the chewiness.

Clustering algorithms can produce globs of various sizes, and a measure of the chewiness of a web may be that the cluster sizes have a Zipf distribution. For example, using Jon Kleinberg's algorithm (which for a link matrix A associates concepts with the eigenvectors of A*A), the strength of the cluster is the value of the eigenvalue, and (while this does not directly indicate size) an interesting test would be on the relative absolute values (squares?) of successive eigenvalues.

Looking it at from the point of view of an individual (a graph node), an interesting question is the proportion of the traffic which is to local or more distant nodes. In Marchiori's model [Marchiori] traffic flows between two nodes in inverse proportion to the resistance of the shortest path. The total "efficiency" is deemed to be the total flow between all pairs of nodes. Can we measure a "chewiness" which measures the approximation of the system to a fractal distribution of long and short range communication? If the Marchiori model were modified to use parallel conductance (more like a real signal flow system) then would this be simpler?

Suppose for example we look at the amount of connection we have with nodes whose distance, or groups whose size, is of each order of magnitude and look for smoothness up to the global level.

Stop Press

2000/03

Well, here I was thinking that while it is intuitively clear that society has to be fractal, I didn't know a mathematical justification for it, when Jon Kleinberg comes up with what for me is his second cool web result.

This is a paper takes the case of a two-dimensional grid. It imagines each cell having a certain distribution of links of various lengths. It demonstrates that in order to achieve the connectivity a la 6 degrees of separation which scales with the log of the size of the system, then the distribution of link density as a function of distance must be precisely an inverse-square law. That is, each cell must have the same number of links (on average) to cells 1-10 squares away as to cells 10-100 away, etc. Anything more local or more global leads to less of a small-world phenomenon: this is the only scalable solution.

True, this applies to a geographical grid, and a square rather uniform one at that. However, He does generalize it to more dimensions. Furthermore, you can see logically how the system works. To get a postcard to an arbitrary person in Massachusetts through a network of friends, you must have enough local friends to be able to find someone who will know someone in Massachusetts. The person they find in Massachusetts must be able to pass it to people successively closer and closer to the target. this only works if there is connectivity on each scale. True, no one has derived the metric of the number of hops a message takes as being an essential metric for systems, but on the other hand there is a clear analogy with the number of hops between a problem and a solution in a large organization .

Other work:

Living semantic web

Data from Swoogle April 2005

Nice to see some Zipf-shaped curves. Swoogle notes:

All these series follows Zipf's distribution, except the tail

The sharp decrease the tail in "class populated" shows that the most populated classes highly correlated such that their are populated by almost the same amount of SWDs. Similar situation can be observed in other series.

The closeness of the sharp decrease of "class populated" and "property populated" is caused by the co-existence of certain classes and certain properties.

Postscript - A personal exercise

There will I am sure be a lot of ways in which the fractal requirement is used in web design. You can also use it in that task of figuring out how you fit in to society at large (and at small). Do your personal interactions spread across the scales? Here is a self-help chart to help think about this. You fill in the groups in your life.

Scale 1 10 1000 10k 100k 1M 10M 100M 1G

Group You family,
group
... ... town? city? country? USA World population

Time spent ? ? ? ? ? ? ? ? ?

Money spent ? ? ? ? ? ? ? ? ?

etc ? ? ? ? ? ? ? ? ?

Another way to do this is find 11 jars, and label one with each scale in powers of 10. (You don't have to paint them but it helps).

Put marbles in each can for each time period you spend on matters at a given scale, such as an international meeting, or a school sportsfield, or with your family, or alone in a treehouse. How well balanced do the jars become?

As a social person, do you spend enough time with groups of each size? If not, are there people one click from you who do, and through whom you are indirectly present in those groups? One of the concerns is that the last column - the global column - tends in my observation to get the smallest amount money at least, as in the US federal and state and town taxes are spread around the other areas but the level of international aid is very much lower. The cool thing is that I think people are born with DNA which gives them a healthy interest at all these levels. People who stick at one scale all their lives feel very uncomfortable. Maybe our preferences have evolved to form naturally a fractal society.

Total Cost of Ontologies (2005)
(I can't remember where I originally brought this up, I think at the Web Science workshop in London 2005/9. This is from ISWC 2005 slides.)
One of the interesting things about assuming a fractal distribution is you can think about the number of ontologies an the time it takes to make them, and the total cost of using ontologies. So let us for example naivel assume that
ontologies are evenly spread across orders of magnitude; committe size goes as log(community), time as comitee^2, cost is shared across community.

Scale Eg Committe size Cost per ontology (weeks) Cost for me

0 Me 1 1 1.000000

10 My team 4 16 1.600000

100 Group 7 49 0.490000

1000 10 100 0.100000

10k Enterprise 13 169 0.016900

100k Business area 16 256 0.002560

1M 19 361 0.000361

10M 22 484 0.000048

100M National, State 25 625 0.000006

1G EU, US 28 784 0.000001

10G Planet 31 961 0.000000

Total cost of 10 ontologies: 3.2 weeks. Serious project: 30 ontologies, TCO = 10 weeks.
Lesson: Do your bit. Others will do theirs.
Thank those who do working groups.
Q: How can the semantic web work...

... when we are all in one big domain of discourse but people are all making their own local ontologies? (2007/3/3)

Rather than 'domain of discourse' , or set of things considered, I think of 'community', set of agents communicating using certain terms. When one thinks in terms of domain of discourse, one tends to conclude that everyone who talk at all about a car (say) has cars in their domain of discourse and so everyone must share the model which includes the single class Car.

It isn't like that though. An agent plays a role in many different overlapping communities. When I tag a photo as being of my car, or I agree to use my car in a car pool, or when I register the car with the Registry of Motor Vehicles, I probably use different ontologies. There is some finite effort it would take to integrate the ontologies, to establish some OWL (or rules, etc) to link them.

Everyone is encouraged to reuse other people's classes and properties to the greatest extent they can.

Some ontologies will already exist and by publicly shred by many, such as ical:dtstart, geo:longitude, etc. This is the single global community.

Some ontologies will be established by smaller communities of many sizes.

Why do I think the structure should be will be fractal? Clearly there will be many more small communities, local ontologies, than global ones. Why a 1/f distribution? Well, it seems to occur in many systems including the web, and may be optimal for some problems. That we should design for a fractal distribution of ontologies is a hunch. But it does solve the issue you raise. Some aspects of the web have been shown to be fractal already.
Here are some properties of the interconnections:

- The connections between the ontologies may be made after their creation, not necessarily involving the original ontology designers.

- There is a cost of connecting ontologies, figuring out how they connect, which people will pay when and only when they need the benefit of extra interoperability.

- Sometimes when connecting ontologies, it is so awkward there is pressure to change the terms that one community uses to fit in better with the other community. Again, a finite cost to make the change, against a benefit or more interop.

Yes, if web-based means an overlapping set of many ontologies in a fractal distribution. In his fractal tangle, there wil be several recurring patterns at different scales. One pattern is a local integration within (say) an enterprise, which starts point-point (problems scale as n^2) and then shifts with EIA to a hub-and-spoke as you say, where the effort scales as N. Then the hub is converted to use RDF, and that means the hub then plugs into a external bus, as it connects to shared ontologies.

So the idea is that in any one message, some of the terms will be from a global ontology, some from subdomains. The amount of data which can be reused by another agent will depend on how many communities they have in common, how many ontologies they share.

In other words, one global ontology is not a solution to the problem, and a local subdomain is not a solution either. But if each agent has uses a mix of a few ontologies of different scale, that is forms a global solution to the problem.

Conjecture

The conjecture is that there is some model which reasonably well described these systems, and that given that model one can show that the scale-free distribution of communities is optimal.

There are many other questions. Of course existing systems on the earth may be very much influenced by the geographical reality of a two-dimensional surface. Historical groups have been nested geographically. So though there may be aspects in which community size is scale-free, that maybe a completely different optimisation problem from the one we have when on the Internet anyone can connect to anyone. If you could devise an algorithm for connecting people into groups, and so that they each participated in communities of different sizes in a scale-free way, then how much more effective (at solving problems, etc) can you make a web-based society which ignores geographical borders? To what extent does humanity as currently connected by the web in fact deviate from geographical nesting anyway?

Fragment identifiers

timbl@w3.org (Tim Berners-Lee) — Tue, 08 Apr 1997 00:00:00 GMT

URI References: Fragment Identifiers on URIs

The URI by itself is a powerful thing, but there is a more powerful concept which is the URI reference.

The URI reference is a thing you build by taking a URI for an information object, adding a "#" sign and then a Fragement identifier. (The last term is historical, so try not to thinl of it necessarily identifying a fragment).

The fragment identifier is a string after URI, after the hash, which identifies something specific as a function of the document. For a user interface Web document such as HTML poage, it typically identifies a part or view. For example in the object

http://foo/bar#frag

the string "frag" is the fragment identifier. It is badly named, as it can identify anything.

(Depending on where you look, the URI is considered to include the fragment identifier, or to have the fragment identifier appended to it. This is important for the BNF, but in practice you will find people using the terms URI and URL loosely to things which do or do not include a possible fragment identifier. Formally, the URI does include the fragment ID)

In practice, you can divide the processing which occurs when following a link using HTTP into three steps:

The client figures out which server to contact by parsing part of the URL, and sends the URL as a request to the server;

The server figures out which object is referred to by parsing the rest of the URL, and returns some rendition of it to the client;

The client presents all or part of the object to the user

The last part typically involves finding some software class which can handle the given MIME type, and passing it the data stream. At the same time, the fragment identifier is passed as a parameter to the created object.

For HTML, the fragment ID is an SGML ID of an element within the HTML object. For XML, if it is just a word, then it is the XML ID of an element in the document.

Axiom

The significance of the fragment identifier is a function of the MIME type of the object

This means that the fragment id is opaque for the rest of the client code. The HTTP engine cannot make any assumptions about it. The server is not even given it.

It also means that for any new data type one can be creative about using the fragment ID in a relevant way. For example, for a 3D object the fragment ID could give a viewport. For a music object, the Fragment ID could give a section in time, or a set of parts, or it could include a suggested tempo. For future versions of HTML, the fragment ID could be made more powerful to include a range or "ladder" reference to a part or parts of the SGML element tree by position. A very useful fragment ID for plain text would allow ranges to be quoted by line and character number

These things are all decisions made when the MIME type is defined. Therefore,

The fragment ID spec for a new MIME type should be part of the MIME type registration process.

Different MIME types then can have different fragment ID specifications. When HTTP for example negotiates between different content types, it is clearly useful for those types to have a consistent (hopefully identical) fragment ID syntax and semantics.

Fragment identifiers for RDF identify concepts

The semantic web has information about anything. The fragment identifier on an RDF (or N3) document identifies not a part of the document, but whatever thing, abstract or concrete, animate or innanimate, the document describes as having that identifier.

It is important, on the Semantic Web, to be clear about what is identified. An http: URI (without fragment identifier) necessarily identifies a generic document. This is because the HTTP server response about a URI can deleiver a rendition of (or location of, or apologies for) a document which is identified by the URI requested. A client which understands the http: protocol can immediately conclude that the fragementid-less URI is a generic document. This is true even if the publisher (owner of the DNS name) has decided not to run a server. Even if it just records the fact that the document is not available online, still a client knows it refers to a document. This means that identifiers for arbitrary RDF concepts should have fragment identifiers. This, in turn, means that RDF namespaces should end with "#".

Object Names as fragment identifiers

When a document language (MIME type) has some form of intra-document naming for objects then it is intuitive is these names can be directly used as fragment identifiers. This is true for XML, that the XML ID which is used to identify elements can be directly used as a fragment identifier.

Fragment IDs and Content negotiation - known bug

If content negotiation occurs across types which do NOT share a fragment ID specification, then rigidly there has been an error. In practice, HTML was the only type (in 1997) which allowed fragment IDs anyway, and other types ignore it. Also, as falling back from a pointer to a specific view to a pointer to the whole document has been considered effective fallback procedure, so no harm was done. Now (2001) it becomes more of a problem. there have been proposasl to add the requested fragment idntifier to the HTTP request to fix this.)

In the future, metadata returned or warnings returned should indicate to the client that this could be a problem. Also, in new access protocols, the fragment ID requested could be shipped to the server as a hint, which would allow the server and client to negotiate and if successful arrange for the fragment ID to be converted to a suitable equivalent value for an alternative MIME type.

User awareness of the form of a reference

Clearly when a fragment ID is generated and associated with a URI which is generic in any way (language, version, etc as well as content-type), then there is a possible failure of the fragment-id refers to something which is not defined in any specific instance. It would be appropriate for a client, when generating a link (or bookmark, etc) to provide the user with a choice of

A bookmark to the whole living document, or

A bookmark to a specific part of a "dead" version;

Intermediate combinations.

As both these options are meaningful and useful, they will have to surface at the user interface level.

Generic resources

timbl@w3.org (Tim Berners-Lee) — Fri, 01 Mar 1996 00:00:00 GMT
See also:

A proposal for an HTML "Resource" element

Historical web design note on formats

HTTP overview by W3C

Generic Resources

A URI represents a resource

A "resource" is a conceptual entity (a little like a Platonic ideal). When represented electronically, a resource may be of the kind which corresponds to only one posisble bit stream representation. An example is the text version of an Internet RFC. That never changes. It will always ha the same checksum.

On the other hand, a resource may be generic in that as a concept it is well specified but not so specifically specified that it can only be represented by a single bit stream. In this case, other URIs may exist which identify a resource more specifically. These other URIs identify resources too, and there is a relationship of genericity between the generic and the relatively specific resource.

As an example, successively specific resources might be

The Bible

The Bible, King James Version

The Bible, KJV, in English

A particular ASCII rendering of the KJV Bible in English

Each resource may have a URI. The authority which allocates the URI is the authority which determines wo what it refers: Therefore, that authority determines to what extent that resource is generic or specific.

This model is more of an observation of a requirement than an implementation decision. Multilevel gnericity clarly exists in all our current life with books and electronic documents. Adoption of this model simply follows from the rule that Web design should not arbitrarily seek to constrain life in general for its own purposes.

Dimensions of genericity

When we discuss electronic resources, an interesting fact is that a small number of dimensions of genericity emerge.

Time A resource may vary with time. For example, "The Wall Street Journal" varies with time. Each issue is a time-specific resource, which does not change with time. Most home pages on the Web change with time, in a less periodic way.

Language When a document is translated, it is useful to be able to refer to it either in the generic, or to a particular specific translation.

Content-Type A given resource may have mny ways in which it can be represented on the wire, using different Content-types (in HTTP terms). As an example, an image may be represented in PNG or JFIF format.

Target medium A given resource may be targetted specifically to a specific medium, such as a printer, being displayed on laptop screen, being displayed on a cellphone, or being projected onto a large screen for an audience. (This is currenltly available for selecting CSS stylesheets, but is not done at the HTTP content negotiation level)

The fact that there are such a small number of dimensions currently apparent sugests that Web software should handle them individually in its interface with the user, even though the architecure should handle them as a single concpet.

Derivation

When a document is translated, one of the language-specific resources may have been the original source. However, this need not always be the case. Specific resources may have been derived from unrelated sources, or multiple sources. Therefore, though it is interesting to be able to describe the "derived-from" relationship, this is not part of the genericity relationship. It is not discused further here.

Genericity Metadata

When making statements about resources, genericity leads two types of statement. The examples use imaginary HTML elements or HTTP headers as illustrations of the meaning.

Dimensions

A statement about the genericity of an object is important both for the user, and also for example for a cache manager. This statment takes the form of a list of dimensions in which the resource for a given URI is generic.

One proposal was the vary field in the URI: header in HTTP:

URI: http//foo.com/bar/baz vary=time,language This is a statement about the relationship between the URI and the resource. (See also Quality of service of names)

Relationships

The other statement which can be made is about a genericity relationship between two resources. Typed links provide this kind of statement. One proposal was

which means "This resource is a language specific version of this resource identified by baz.fr" This needs to be combined in with information about the particualar language.

So much for the architectural ideas. In practice one would use a shorthand form for all this information such as

or

Using RDF to model this

There is now an RDF ontology for these concepts, http://www.w3.org/2006/gen/ont. The ontology does not describe the target-medium dimension. (Please use that instead of the old one desribed here in 2000-09.)

Old ontology RDF to model this

Added 2000/09

Now that the RDF metadata architecure is developed, we can model genericity using a set of properties to represent these relationships. The natural way to do this is to define classes for the one-parameter flags such as time-invariant, language-invariant, etc and properties such as isLangaugeSpecificVersionOf.

Classes

Class name Significance

u:TimeInvariant The relationship between a representation of this resource and the URI will not change over time

u:LanguageInvariant The relationship between a representation of this resource and the URI will not change no matter what language is requested.

u:ContentTypeInvariant The relationship between a representation of this resource and the URI will not change s a function of content negotiation of MIME type

u:Fixed The relationship between a representation of this resource and the URI will not change nder any circumstances

u:Fixed is a subclass of each of the other three. P3P policies are supposed to be in u:Fixed.

Properties

Property name Significance Domain Inverse property name

u:isVersionOf A is one of the specific versions of a time-generic resource B u:TimeInvariant u:hasVersion

u:isLanguageSpecficVersionOf A is one of the specific languages (in the sense of HTTP content-langauge) of a langauge-generic resource B u:LanguageInvariant u:hasLanguageSpecificVersion

u:isContetntTypeSpecificOf A is one of the specific content-type-specific resources (in the sense of HTTP Content-type) of a generic resource B u:ContentTypeInvariant u:hasContentTypeSpecificResource

There is no assurance when one of these properties is used that either subject or object is not itself invariant. In other words, if one states of two identical TimeInvariant resources that one is a version of the other, that is consistent. The promise that neither will change can be made later as a consistent with an earlier promise that one will not change.

The Good Things on the Internet

timbl@w3.org (Tim Berners-Lee) — Sat, 27 Jul 2024 00:00:00 GMT

There is a growing movement to understand, fix, and mitigate the problems of the web, and specifically of social media. Parents of children and youth worry about the potential harm to their offspring from engaging in democracy, and schools and wonder whether to just ban phones for kids. There are a lot of important and good things on the web - which in fact come from the vast majority of the web sites and apps. We need to recognize that, make sure we and our children make the best use of it, while protecting ourselves from the harms. When you look at all of the things to do on the web, or in the apps, then the majority are actually not damaging, many are in fact good - and many are actually wonderful. There are the pre-web systems like email, podcast and blog readers, and chat. There are web platforms which are beneficent, including open source systems. There are systems built on top or the Solid Protocol, which naturally provide users with a power that we call digital sovereignty.

Read whole article...

Putting Government Data on the Web

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Jun 2009 00:00:00 GMT
Government data is being put online to increase accountability, contribute valuable information about the world, and to enable government, the country, and the world to function more efficiently. All of these purposes are served by putting the information on the Web as Linked Data. Start with the "low-hanging fruit". Whatever else, the raw data should be made available as soon as possible. Preferably, it should be put up as Linked Data. As a third priority, it should be linked to other sources. As a lower priority, nice user interfaces should be made to it -- if interested communities outside government have not already done it. The Linked Data technology, unlike any other technology, allows any data communication to be composed of many mixed vocabularies. Each vocabulary is from a community, be it international, national, state or local; or specific to an industry sector. This optimizes the usual trade-off between the expense and difficulty of getting wide agreement, and the practicality of working in a smaller community. Effort toward interoperability can be spent where most needed, making the evolution with time smoother and more productive.
Read whole article...

The Intimacy Gradient

timbl@w3.org (Tim Berners-Lee) — Sun, 13 Mar 2022 00:00:00 GMT

A city has public places where I can do all kinds of things, and also a private house with a private room which may be by myself. In that house there are spaces where I do things with family, friends, colleagues. The web must like a well-designed building, provide a gradient of intimacy between the private and the public, so I can easily recognize the difference, easily know which I am in, and easily welcome people to come into the more intimate areas. Our Solid tools should respect these ideas.

Read whole article...

HTML and XML

timbl@w3.org (Tim Berners-Lee) — Mon, 19 May 2008 00:00:00 GMT

HTML and XML

W3C AC meeting, 2008-05-19

The goal of this document is to investigate the possibility, over time, of healing the rift between the HTML5 and XML technologies, to achieve interoperability between software and markup which are currently on two sides of the fork.

The method is is to try to understand the motivations of the various positions, and address those at source, and not to use them to decide that a particular fork is "right".

The content of this essay is accumulated from many sources. It was given in large part as a talk to the May 2008 W3C Advisory Committee meeting, posing a series of questions about future directions for HTML. Discussion of this topic is directed to the W3C TAG list, www-tag@w3.org (archive) .

Introduction

The development of Web technology advances at different speeds on different fronts and different times. Occasionally it seems that some strategic thinking is necessary in order to ensure that the system as a whole will continue to work well and evolve smoothly. This is one of those times.

The fork

The purpose of this essay is not to detail the history, but let me start by summarizing quickly to set the context. HTML is the most widely deployed document format by a long way in the history of computing. XML, also, is very successful, being a framework for many formats public, private, in many different applications. As a simplification of the original SGML, on which HTML was based, XML allows code to be lighter and faster than SGML systems, and makes it easier for developers. We have seen in recent years a hiatus in the development of HTML, followed by a more recent surge along two branches. One branch of HTML, XHTML, which switched from using SGML to using XML, provided various new features, used the XML namespaces extensibility system, but was not widely deployed in the dominant Web browser, Internet Explorer. Another branch, HTML5, has been specified with the explicit goal of describing exactly the rather contorted behavior existing browsers implement to handle the legacy of Web pages found in practice on the Web, as well as introducing a different set of new features (video tags, etc). While it provides for an optional XML serialization, HTML5 does not in general use XML and specifically does not use XML namespaces. Below we unravel the separate criticisms of XML and XML namespaces.

The existence of the fork is a serious problem, both because a fork in standards is fundamentally costly for the whole community going forward, and because of the technological problems which are highlighted in the issues which each branch has with the other branch.

Arguments for cleaning up

Now, there may be extreme versions of the HTML5-fork style which maintain that everything is fine, and that the mess is just life; we will have to live with liberal parsers forever, and that is the only realistic approach. However, not only is the code stack horrible to maintain, but pages that are not well formed are hard to maintain, process, and reuse.

Also, there is a whole world of XML-based software in the enterprise, some of it SOAP-based services, some of it more document-oriented, whose developers could not imagine for a moment deviating from the XML path by allowing this sort of liberalness, as systems would just stop.

Can we assume that the HTML Web and the XML enterprise systems will be non-interoperable worlds? Possibly, but with a constant cost, whenever attempts are made to move data from one to another, to embed some HTML product description into an order, for example.. The boundary will never be clear as in fact there is an overlap. Some suggest making a version of SVG which is in HTML5 (liberal) format, while others use XML engines to process SVG. People embed HTML in RSS and Atom feeds and RDF feeds (using RDF's XMLLiteral datatype) and RDF parsers don't have embedded HTML5 parsers, so it has to be well-formed. And so on.

To continue to promote messy code on the Web is to create problems and pain later on. To promote clean XML is a current pain for real users which they will not put up with. How can we escape from this? To understand possible paths forward let us look at how the language is typically extended, on each fork.

Centralized HTML extensibility

The HTML community has not embraced URI-based extensibility. In fact, decentralized extensibility is not a general goal to many. This is not surprising. HTML is the most widely deployed data format in history by a long way. Every Web browser is expected to be able to handle it. Its evolution is a form of ongoing negotiation between users, Web developers, and browser developers (and their management). The HTML language itself has a unique place among other languages. The model of a large number of small overlapping communities, which was the target of the RDF design, does not apply to the HTML language.

It is not surprising that requiring each HTML document to start with a namespace declaration irks those for whom the whole world is HTML. When everyone is deemed to know the HTML spec, why have it vectored to by the XHTML namespace and the namespaces specification?

Decentralized extensibility allows new modules to be added to a language by third parties, but why bother when the modules which are generally proposed for addition to HTML, such as SVG, MathML and XForms, can be counted on the finger of one hand? In this case the HTML design authority can simply add new modules themselves. If extensions are needed, then they can just be added to the specification. The list of modules can be made available to everyone, as all systems are expected to be programmed with an inbuilt knowledge of the HTML spec.

While browser plug-ins can be dynamically downloaded from the net, in general HTML extensibility from one level to the next has not been done in that way at all. Using the foundational rule that browsers have from the beginning ignored tags they did not understand, new HTML tags have been added in a calculated way so as to hopefully maximize the benefit and minimize the damage to the community as a whole. Historically, the HTML working group did not make any commitment that the meaning of tags would not change over time, only that change would be made as responsibly as possible.

Decentralized extensibility

Let us investigate the philosophy, now, of the XML branch.

In a world in which there are very many XML-based technologies, and many many groups needing to create new ones and extend old ones, a major motivating requirement has been decentralized extensibility. This is the requirement for a group to be able to define the terms involved in the new technology without having to get an audience with and agreement with central committee. (Examples of centralized extensibility include for example Dewey decimal system, the Library of Congress cataloging system, and the international phone number space).

URI-based extensibility

In the Web environment, decentralized extensibility can be done using HTTP URIs. Basically this means that any group which can lay claim to some (normally HTTP) URI space can pick a URI for a new feature, without having to go through any centralized clearing house (other than the domain name system). It also means that the namespace URI can be used to give pointers to developers, or, with very persistent caching, machines, willing to learn about the new features.

Historically, namespaces were actually a requirement on XML Namespaces imposed by RDF, which was developed in parallel with XML RDF is aimed at a multitude of communities all independently agreeing on different though connected set of terms, and then being able to merge their datasets which use these terms. URI-based extensibility has been very successful in the world of RDF itself, as many ontologies have been developed without central coordination by the RDF working group, which indeed closed long ago. One might argue that arbitrary non-RDF XML applications cannot use URI-based extensibility in the same way, as they do not have the very powerful "ignore triples you don't understand" model of RDF, but a counterexample would be the use of independent namespace-qualified tag names in SOAP messages, headers and content. Another example would be the EXSLT group which uses use namespaces to extend XSLT.

Follow-your-nose principle

The use of HTTP URIs for extensibility is not just a question of allocating names unambiguously. The fact that HTTP URIs have ownership means that there is a responsible authority who can be traced and called upon to explain what a term is supposed to be for and how it relates to other terms.

In fact, as we use HTTP URIs, one can in real time look up that information. Although, for the sake of the servers, the looking up of a namespace document should be viewed as an installation process with a permanent cache, a machine can usefully pick up information at run-time which will allow a system to usefully process a vocabulary which it has not before encountered. This again is much more developed in the RDF world, where ontologies can contain enough information for a new user interface to be created on the fly.

The follow-your-nose principle, then , allows a form of bootstrapping. Like any bootstrap, though, it needs a base to start from. In this case, there is a core set of specifications which a client has to understand in order to do the bootstrapping. Examples of these core specs are Ethernet, TCP and IP, DNS, HTTP,, the Internet Content Type (also known as MIME type) registry.

By one model, a content type of text/html in a HTTP response indicates an HTML document. A content type of application/xhtml+xml indicates an XHTML document.

By another model, a content type of application/xml indicates an XML document, and if, within such a document, namespaces are used for the document element, then the XHTML namespace URI (http://www.w3.org/1999/xhtml) within it indicates an XHTML document.

Recent controversies

Recently, controversies have arisen as various groups have attempted to create new feature sets suitable for adding to HTML and similar languages. One of these is ARIA, which allows a Web page to be annotated to explain the user interface function of various elements, and another is RDFa, which allows a Web page to be annotated to explain the meaning of various elements and add more data. Each of these technologies, like many other technologies one can imagine, works by adding new attributes (and sometimes elements) to the markup.

In ARIA, about 30 new attributes are added. In the XML fork, in one design, these were added using a aria namespace, as, say, aria:foo, while in the HTML5 fork, they were added as aria-foo in the HTML namespace. The arguments about these choices were fairly long and complex, and involved for example discussions of what exactly legacy browsers would do with the DOM in each case. The users of the spec are not just document writers, but also those who write scripts to access and interpret the attributes. In any event, there was no way one could write the same thing in both language both at the markup and the script level, it seemed.

In RDFa (derived from "RDF in attributes"), the requirement was to add new attributes to allow semantics to be given for embedded HTML data. The GRDDL specification, an existing recommendation for pointing to a transform script which extracts RDF from a document, is a possible point of leverage in the follow-your-nose story, if one takes GRDDL as being, for semantic Web clients, as being part of the bootstrap core functionality.

In the XML fork, extensibility is achieved using namespaces, but in the non-XML fork, there are a number of less obvious options, which include the addition of all new attributes to the the HTML world, as though they were in the HTML5 spec. In this case, the social question is: can a group just announce that it is adding attributes to the HTML namespace*, or does it have to get it put there or at least agreed with the HTML design authority? In the normal world of standards, the latter is the rule, as each specification needs, it is felt, a coordinating body. In the HTML world, though, introduction of new tags by vendors, and new attribute values (such as rel="nofollow") is often done without such coordination; the 'marketplace' decides which tags live and which don't, and the low probability of collision replaces the use of clearing houses for new names and values. [fn1]

In practice, then, ARIA and RDFa have proposed to add new attributes (and/or elements) to HTML, deeming them to be added by dint of the existence new specifications, seeing whether they get adopted by a community of readers and writers once specified, and seeing whether they appeal to the those involved in the mainstream HTML language evolution to be worth either inclusion or reference.

So, can we just use a different model for HTML, because of its special place among languages?

While these are two recent examples, one soon discovers many examples of the development or integration of new technologies in this area:

The SVG community has made a very modular specification intended to be mixed with other markup languages, originally using namespaces.

The Mobile HTML specs have used XHTML very cleanly, and XHTML has been integrated with SVG in some cases, following the XML fork. SOAP systems enclose all kinds of XML in their payload, and can include XHTML within that where textual data is present in a remote service invocation or response.

Meanwhile, by contrast, suggestions have surfaced that SVG should be integrated into HTML5 simply by pouring the SVG tags into the HTML specification, using no explicit extensibility controls at all.

So it is impossible to draw a line around HTML as a special case isolating it from the mass of different communities developing their individual applications. So what can be done?

Scale free space

The Web is, as I have mentioned before, composed of many different communities of different sizes, and often is seen to have a scale-free properties. That is, for example, that there is no 'typical' number of inbound links to a page, but the distribution follows a power law. This is partly a measured phenomenon of the Web; it is also a phenomenon which occurs in many other systems, and also I have an unproved hunch that it represents a form of optimal arrangement for society to function effectively. It may be the optimum tradeoff between the ungainliness but great interoperability of a central language and the agility of small communities using of a Babel tower of different languages.

It is a characteristic feature of such scale-free systems that they have one leading player, closely followed in the popularity ranks by other players in decreasing popularity.

In the case of vocabularies on the Web, we have the HTML as the largest scale, in which tags are just tags and everyone is supposed to know them. One could argue that SVG actually belongs at this level and should be and will be as widely deployed as HTML.

At the next level we have languages which are not HTML but still address the needs of very large communities. SVG, MathML and ARIA are examples.

There are many medium-sized communities. The FaceBook Markup Language (FBML) is an example of a vocabulary proposed by one website, though a significant site. Atom feeds for various things can be considered at this level. Also, enterprise systems include many many XML namespaces which are developed, for example, in SOAP-based applications.

Continuing on (roughly) down the scale we get to vocabularies for protein scientists and history museums, for scout troops and bird fanciers; we get vocabularies invented for today's experiment in a lab, for the import of a particular spreadsheet and so on.

It is reasonable for us to not just sit back and admire the scale-free nature of the space, but to actively engineer for it. What does this mean in this case? It means that we should engineer the system with an understanding that HTML is a dominant language (at the moment) used by a very large community of individuals, but with an understanding that there are many other communities, many other languages and specifications, and that these often have to be able to connect with the HTML architecture.

I would like to investigate the possibility of us deliberately designing ourselves a system which is optimal, in that it addresses the needs of all parties, and brings the two branches of the fork into the same space, so that that there is a continuum of extensibility. We start by looking at the issues and problems with that arise when attempting to use XML and namespaces as the basis for the HTML5 fork.

XML Issues

So what are some of the issues with XML which drive the HTML5 fork away from becoming closer to the XML fork?

Issue Motivation

It is a pain to have to add quotes around attributes Ease of use

It is a pain to have to spell the entire tag in the end tag Ease of use

Parsers must stop on error unfriendly, impractical

Namespace URIs take too much space impractical

Non-nested begin/end tags have to be accommodated Legacy TAG soup

At the top are the ones which one could imagine being cured by a redesign of XML. To the bottom are the things which I would resist changing in HTML. In the middle are areas where one could imagine some compromise.

One fundamental difference of philosophy between the forks has been the attitude to deviations from the specification. In the past, people making Web pages have made many deviations from the specifications, so long as they worked. The result is a a legacy of Web pages which have all sorts of errors. It has been essential in the market for browsers that they work with these pages. The approach taken in HTML5 has been to document the behavior of these browsers, so that everyone knows what it is. The goals is that all old pages still work, but there can now be a well-defined algorithm and a test suite, instead of a heap of connected kludges implemented separately at great cost by each browser maker. This world is, then, very liberal, in what the Web page writers are allowed to do, and in what the client software has to accept.

The initial approach taken in the XHTML fork was very different: it was completely conservative. Recognizing that the situation which had arisen with legacy HTML was a big mess, XHTML started anew. A new content type was allocated for XHTML. The XML specification required that any processor deliver no results if the input was not well-formed XML. The idea was for XHTML to start a new branch of clean content which would eventually outgrow the old, and which would be a platform for much cleaner growth, with namespace-based extensibility, and addition of SVG, MathML, XForms, and enterprise-specific extensions in a well-defined way. Organizations and individuals who have adopted XHTML are often vocal in their praises for the benefits which they experience, but this has evidently not lead to any substantial inroads into the dominance of HTML in the general public web.

Robustness Principle

The Internet specifications, since RFC793, have been developed with guidance from a principle that one should be conservative in what one generates, but liberal in what one accepts. This is often a useful maxim, when writing a program to send or receive messages, and when there is a possible area of the spec open to interpretation. So one would send lines of limited length, but accept lines of any length; send always the same case as in the examples, but accept either upper or lower case, and so on.

This maxim works when two programs are communicating with short-lived messages, and when there is feedback between engineers when a system doesn't work. It has not worked so well on the Web, because the Web page designers in fact paid no heed to being conservative. They were not in general engineers who had read the spec at all, but random people copying each other's Web pages, and seeing what worked when they modified them. Further, the Web pages have a long, hopefully very long, lifetime. Once a Web page is out there with badly nested tags, it is out there for good. So on the web, there are some page creators who are no longer present, and others who are around and are open to feedback, new languages, and constructive feedback. Should the robustness principle be used or, if not , what?

Incentives

To look at a system which includes people, one must study the incentives for those people. Suppose there is, on the one hand (and on the X axis) a certain effort which a Web page author puts into the writing of a Web page, to eliminate various levels of error, and on the other hand (and on the Y axis) a reward given, in part, in terms of the quality of the rendered Web page on the range of clients perceived to be of interest.

In the case, shown above, of the conservative, XML fork browser, the page must be completely correct or nothing is rendered. The writer who has an almost perfect page is motivated to fix it, but the writer who has a page with several errors is not, as there will be no noticeable reward for incremental improvement. It is not very surprising that the majority of Web users whose pages would have started off near the left of the graph did not make it to the right when serving their code as XHTML.

Some errors we may consider hopeless even in HTML, in that no useful recovery seems possible for them. In the case of the liberal browser (above), the reward for a hopeless page is zero, but for a page with any other level of errors, it in fact is rendered completely by the browsers. Therefore, a writer whose page is hopeless is motivated to clean it up a little bit. But the writers of pages which have other levels of error are not motivated to clean them up at all.

So while the liberal and conservative forks have very different philosophies, they share one thing: They do not motivate the writer of a Web page to progressively improve their offering.

Bringing the fork together

The solution, as I see it, is to look at the motivating slope and fix it. When the user is provided with incremental rewards,

then he or she will move, hopefully, up the slope.

Motivating slope

What does this mean?

It means distinguishing more than the two possible outcomes of success and failure. We need to make a slope, so we need different levels.

It means recognizing all the errors as errors, but also allotting them an importance level, so that users can concentrate on fixing the more important ones, or perhaps the ones which give the best improvement per effort ratio.

There has been push-back against the idea of showing error indicators on Web pages, because no one wanted to be the browser to give the sub-optimal user experience. This can change in several ways. It can change because of user attitude changes. Al Gore points out repeatedly that we need to clean up the planet. People understand when we have to do some clean up. A browser which does not have these features would be seen as irresponsible in this context.

So step one is to have a tool bar which slides down when a page has errors, giving a rating to the page out of 100, and allowing drill-down by interested users. It is true that most users are not interested and are not able to do something about a random site they visit. However, they might still be interested in the fact that the site is not clean. People who buy a business may be interested in knowing whether that business pollutes the planet. Similarly, people may be interested in knowing whether the HTML that they publish or the sites that they visit are polluting the Web.

Another possibility is to allow users to specify which Web sites they are connected with. Anyone involved in the production of a Web site (up to the CEO and board for the company!) should be able to put that site into the list of sites for which they want more detailed feedback.

Changing the browser

We can also be smarter. We can make it so much easier for people to do the right thing.

The classic way the Web spreads is by the "View Source effect" . You like someone's Web page , you do a View Source operation in the browser, and then you copy it and paste it into your own Web page. This is the way Web technology has spread, and also the way of course all those problems have spread. Suppose, whenever I look at the source of a page, I see a cleaned up version? Suppose it is impossible (or very difficult) to actually see the original source without it being heavily marked with the places where it has syntactic errors. Suppose if I copy it in a clipboard, then I get the cleaned up version? Suppose this applies to "Save As" too? The code to clean up a Web page is not that big by today's standards. There are many implementations, Dave Raggett's tidy being a well-established one, also now as in Marc Gueury's HTML Validator Firefox Extension.

(One way to do it of course is to simply re-serialize the DOM tree of the page as loaded. This of course loses the formatting, which in general is a disadvantage, particularly when one needs to compare versisn of source files, or use source code control systems whcih do so).

It wouldn't have to be perfect. It would have to move a page substantially along the curve toward the clean end of the spectrum.

There are some things which browser manufactures could do right now, which could in fact change the ecosystem of developers and pages so that in a year or two a significant number of new pages were being produced cleanly, and in a few more years as the new content starts to dominant, the majority of the pages you see on the Web would be clean.

We are not talking here about a switch to application/text+xhtml, but continuing to use the MIME type text/html and progressively improving the content we produce that so that it becomes cleaner.

Would that be a good idea, and what exactly would it mean?

Well, in fact if everything was XML some people might regard it as actually less useful than the current HTML, when it comes to quotes around attributes. So this forces us to look at whether XML could actually change itself, to meet HTML between their current positions. So I would suggest that some of the things we would have put on the slope, some of the cleanliness goals, we simply remove and declare them non-goals. But to do that, we have to change XML.

Changing validators

It turns out a that the opinion of the W3C validator has a large amount of clout in the community. Specifications such as microformats and ARIA have been affected very much by what can be done without breaking validation. Now the validator to date has been a DTD-based validator, so it checks that the document conforms to a given grammar. It requires, to be happy with the page, a DTD declaration which specifies what grammar the author of the page thought the page was written with.

DTD validation will not allow the normal forms of XML extension, the addition of new elements and attributes. This is very ironic in a way. The "X" in "XML" is for "Extensible". The whole point is that an application written in XML can be extended by adding new element and attribute types. With namespaces, these elements and attributes become grounded in the Web, and URI space provides a way of avoiding any collision.

In this vision of a way forward, validators, or perhaps one should say page checkers, as validate is a word claimed by XML for DTD-validation, should give a grade to a page, judging it on several counts, at various levels. At the error level:

Content-Type wrong

Character encoding (if marked UTF-8 is it really UTF-8?)

Well-formedness: Bad nesting, missing end tag

HTML elements misplaced according to some kind of grammar

At a warning level:

Extension tags used with no namespace

Extension tags used, in a namespace without a namespace document

At an informative level:

Extension tags used with a namespace and namespace document

Extension tags defined in other W3C recommendations

Quotes missing from attribute values which do not contain spaces

In fact, it may be that the browser, now a computing platform of some power, is in fact the best development platform for a page check in the future. It is possible that the same code in fact could be deployed in a third party server checker harness as in a client-side checker.

Changing XML syntax

The arguments against changing XML are very strong. Its single great value is its single common specification, its stability. It isn't perfect but it is common across so many different applications. Attempts at create an XML1.1 failed, just trying to introduce a few new Unicode characters.

The arguments for changing it are that the alternative could be worse: It could be that the HTML5-style syntax with errors of all kinds being completely ignored propagates into first SVG then RSS then RDF and then SOAP. The entire stack has to be built so as to be able to do HTML5 error ignoring, with special knowledge that comes with that of various HTML tags. Even if you aren't using HTML itself, you just have to use that parser.

What would be changed? It would be recommended that parsers recover from errors where they can, and indicate all errors above a certain level of seriousness to the user.

Now everyone I have spoken to about this has their own list of things they would like to change in XML, if we were to do it, so deciding what goes in would be a interesting communal decision. Here is a list of some things which have come up. Some are better ideas than others, in my humble opinion.

Allow attribute quotes to be omitted for simple values.

Allow namespace to be implicit, given Content Type

Short-hand for switching from one namespace to another? (grounded in Namespace document)

Short form of close tag ?

Remove DTDs.

URIs for grammar, cross-schema links for mixed NS

Remove PIs (Have a xml:pi element if you like?)

Multiple root elements or mixed content as document?

(See Tim Bray 2002,Norm Walsh on XML2.0 in 2004, 2008),...

Lets go through these in order to clarify what we are talking about.

Optional Quoting of Attributes

The quoting of attribute values I have already mentioned. The quotes in SGML were not necessary. When SGML was simplified to XML, the quotes were made mandatory. this simplified the parser, but it complicated life for writers, and required more keystrokes, disk space and bandwidth. It also made the source more difficult to read by increasing clutter. It also made the source more difficult to read

Implied namespace

The implied namespace idea comes from a consideration of the follow-your-nose argument above. If an HTML document is delivered with a Content-Type which labels it as HTML, then why on earth does this information have, in XHTML, to be repeated in the document as the root namespace element? It is a waste of space and an imposition on the user. However, whether the page has an explicit namespace or not, I would like to be able to parse it and look for elements in the DOM using the XHTML namespace. So I would like all HTML elements to be deemed to be in the XHTML namespace. This is actually I think a sensible change to the architecture, that:

With XML-based content, the MIME type registry contains an implied namespace. for text/html, this is the XHTML namespace

The XML parser interface is extended to include an extra parameter, the implicit namespace

(Note that while this is a default for the namespace, as the term default namespace already means the namespace for elements with no prefix, we can't use it for this concept of the namespace for the default namespace when there is no namespace declaration.)

This would make SVG documents smaller as well, and who knows what else. It could be useful for cutting down the transmission time for small XHTML documents to mobile devices, and so on.

What about mixed documents? Well, the HTML mime type could be registered to that the implicit default namespace is XHTML, but also there is an implicit s: namespace for SVG and m: for math, and so on. A machine-readable list could be made centrally available, changed occasionally, an downloaded at install time (not run time, to save the servers!) in XML-2 parsers.

Switching namespaces

The fact that documents have a well defined meaning as grounded in the Web traces back to the terms being defined using URIs. This does not, however, mean that URIs have to be embedded in their full glory at every step. Namespaces already serve as an abbreviation system. Now we could add a sort or chaining within documents. For example, one could define an

I am not sure that this is a good idea, as it makes the amount of off-line information needed more complex, as one would have to have a way of specifying this in a schema language of some sort, and it would be impossible to parse the document correctly without that schema document. But it could be valuable avenue to explore if there continue push-back against namespaces.

Remove DTDs

The DTD syntax within XML is a historical artifact. It was part of SGML, used for defining grammars of SGML applications, but not itself using SGML syntax. The DTD language was kept in XML as at the time there was nothing to replace it. Since then, DTDs have been joined by XML Schema, Relax-NG, and other languages for specifying constraints for applications of XML. Meanwhile, DTDs have fallen behind in that they do not naturally accommodate namespaces. A large amount of infrastructure has been constructed around them in the XHTML fork's HTML Modularization spec.

The main reason for keeping DTDs in XML systems has been that they are needed for defining entities, and specifically character entities. The solution to this I would suggest is to define a namespace of tools to do this in XML. One could even take part of the xml: namespace.

This would also mean that one would lose the feature of default values for attributes and fixed values for attributes. These are a strange feature of the language in many ways. They make this unfortunate difference between a raw infoset and a post-validation infoset. They allow the DTD designer to say "even of you didn't put this attribute in, you still meant it". Of course the semantics of the application language can always be defined to have defaults, even when they are not provided by a DTD processing step.

There have of course been many discussion of this topic over the years.

Processing Instructions

Processing Instructions (PIs) are a strange corner of the XML specification which could be removed to advantage. PIs provide a form of "machine-readable comments" which sit between code (normal markup and text) and comments (which should be completely ignored in the application semantics).

Where one is tempted to use a PI, one should use a namespace to add an attribute for example to the root element. That allows one to have many levels of hint to different possible processors and interpreters about different things. After all, why have three levels when you can have n? (In fact, in RDF, I would often recommend that comments be left as rdfs:comment statements so that they are preserved in the processing and enlighten people reusing the data in a completely different contexts)

PIs are a kludge. The question of what is inside them, and what it means,

Close tag abbreviation

A commonly suggested shortcut, while we are discussing shortcuts, is to allow the closing tag to be given as . I understand this was a question of debate in the original XML design, and did not get in at the time. It is less self-documenting and less robust in the face of certain errors, but it can save a lot of space for enterprise applications where tag names can become very long. Also, for machine-generated code where operator error is not a problem and indenting can be done automatically, it clearly cuts down on the size of the file.

Multiple root elements, or mixed content as XML document*

A characteristic which XML does not currently have is the ability concatenate two valid XML documents and get a new big XML document. This property would have its uses. To make it possible, one could allow mixed content (A mixture of elements and text) a the outermost level. Advantages include that

It would be possible to transmit arbitrary XML content (for example from a selection, or the answer to a question on a form) as an XML document itself.

One could concatenate XML documents and ship them as one when the information to be transferred was the union of the information in each;

One could XML as a format in which the default was plain text, but markup could be allowed if necessary. For example, the title of a book, often plain text but sometimes needing a character entity, or a form field or e-mail in which the default might be plain text but occasionally one wants to add HTML emphasis.

What source could look like

The markup for a page which currently in XML will typically start like

...

in the future when served as text/html could look like simply

...

and be considered perfectly valid XML2. It could be parsed by general XML2 applications, which would be passed the implicit namespace which would have come from a content-type lookup table. To HTML authors, the only non-HTML thing they ould have to do is remember the /> on the end of the link tag. So there is in te end a compromise between the forks, but one in which everyone can do most of what they want. So this may be a better place to be. Is it worth trying to get there?

Costs of change

Changes to XML syntax would of course be a vary major step. It would break a level of stability in the XML specification which has been one of its major advantges. it would potentially affect very large number of parsers.

On the other hand, it would only affect the parsers. The XML data model is not changed by the surface changes to the syntax. XML1 files would be valid XML2, so srializers woudl not have to change. Languages such as XPATH, XSLT and XQuery which are defined on the data model would not change. However, just changing XML parsers would be very dramatic step for the industry. It would leave behind many programs whos development has stopped.

But then again, if the alternative s that all systms have two parsers one for HTML5-like data and one for XML, that is a huge cost too.

What about the cost of change to browers?

Browsers currently have very many ways of treateing web pages, to adapt to different forms the language out there on the web. In one sense, a merge fork track would be another variation, one chosen to be more stable in the long term. The tricks to recognize particular types of old content will presumably be necessary into the future.

Changes to the browsers to bring them toward a common DOM for HTML and XML are also going to be significant. To a certain extent, perhaps one can allow the namespaced API calls to follow the XML+Namespaces model, but the non-namespaced calls to follow the HTML model. The complications here are to great to go into here. ^dw

Conclusion

Future developers will not only use the languages we define today, they will build on them to make new more sophisticated ones. The cleaner the systems we develop are, then the easier it will be. The HTML and SVG document models, for example, are powerful user interface libraries, and exciting novel new applications are being built on top of them. The difficulties inolved in dealing with the different APIs of different forks doesn't help.

We, the Web technology community at large, have a duty to lead the technology toward cleaner engineering solutions. While we should reatain an ability to read old web pages, we should move the community of producers (both hand-coders and authoring tools) so that the newly produced web ages become progressively cleaner.

To do that, we have to understand the motivations of website developers and browser writers and server administrators. We have to understand how changes to the software and the specifications can tweak the way people behave. We can also set new community goals and a new community attitude about unclean Web pages, so long as at the same time we move the goal of cleanliness to make it less irksome.

The direction outlines here involves quite a lot of work. It means developing new parsers, page checkers and browsers which encourage cleanliness. It means cleaning up authoring tools. It involves solving many intricate technical details of how these Web pages look to a script in the DOM. But the alternative -- the current forked track --will be a lot of work too. Keeping both forks maintained with separate diverging code stacks. Writing scripts which explicitly check whether they in an HTML5 or XHTML environment every few lines. Developing increasingly complex new extension methods for HTML5 to emulate namespaces. As the future unrolls, porting new deevlopments, like the tag, within HTML5 to XHTML, and new developments, like RDFa, within XHTML to HTML. Or putting up with the burden of continual re-invention of new functionality in quite incompatible ways, on both sides of the stack.

We need to set ourselves goals of merging the forks, with some give on each side. We need to switch from strictly liberal and strictly conservative attitudes to one in which progressively cleaner pages are considered progressively better. We need to adopt an attitude that we are going to clean up the Web just as we sometimes need to clean a bedroom -- or a planet.

Grouchy Robustness Principle

Be conservative what you produce. Be liberal about what you accept but complain about any deviations from the spec in a way to help and to motivate the producers to adhere to it better.

Tim Berners-Lee

Original August 2008, made public May 2019

Footnotes etc

This is a deliverable of TAG issue 145.

This is $Id: HTML-XML.html,v 1.5 2019/05/20 21:36:35 timbl Exp $

1. (There is a parallel with adding them all to the XHTML namespace, but this is unfortunately not a precise one, because attributes without explicit namespaces are deemed in the NS spec to be in no namespace., rather than to be in the namespace of the element. The fact that the XHTML element has an @href attribute, for example, does not mean that there is an attribute xhtml:href which one could consider mixing into other languages. Some, including me, regard this as a bug in the namespace specification.)

2. (Footnote: At the schema level there are issues too. There is not space to go into those here, but to oversimplify, DTDs are broken by design as they actually don't use XML syntax; XML schema got complicated; RelaxNG is a competing standard, but still needs NVDL to enable mixed namespaces. There is no simple way to say the fundamental statements for connecting two languages such as "An SVG circle can go anywhere an HTML IMG can go".)

3. (Thanks to Norm Walsh for the multiple root element suggestion)

DW: Perhaps the biggers single obstacle is the document.write() method which binds code and markup much too intimitely to all them to eveolve separately. It is too close to self-modifying code. Ironically, it is often used to use a compact declarative form (document.write("
here?
")as an alternative to sequence of method calls to biuld up the same thing. This is easier to write, and easier to read. If it were compiled into a data object (see E4X) this would be clean coding. As it is, the intricaied of when document.write() inserts what into what stream end up defining huge amnounts of how code is written, and allow one to do all kkinds of non-obvious things.

What do HTTP URIs identify?

timbl@w3.org (Tim Berners-Lee) — Sun, 01 Sep 2002 00:00:00 GMT

What do HTTP URIs Identify?

Background Note

This question has been addressed only vaguely in the specifications. However, the lack of very concise logical definition of such things had not been a problem, until the formal systems started to use them. There were no formal systems addressing this sort of issue (as far as I know, except for Dan Connolly's Larch work [@@]), until the Semantic Web introduced languages such as RDF which have well-defined logical properties and are used to describe (among other things) web operations.

The efforts of the Technical Architecture Group to create an architecture document with common terms highlighted this problem. (It demonstrates the ambiguity of natural language that no significant problem had been noticed over the past decade, even though the original author or HTTP , and later co-author of HTTP 1.1 who also did his PhD thesis on an analysis of the web, and both of whom have worked with Web protocols ever since, had had conflicting ideas of what the various terms actually mean.)

This document explains why the author find it difficult to work in the alternative proposed philosophies. If it misrepresents those others' arguments, then it fails, for which I apologize in advance and will endeavor to correct.

1. Web Concepts as here proposed

The WWW is a space of information objects. The URI was originally called a UDI, and originally all URIs identified information objects. Now, URI schemes exist which identify more or less anything (e.g. UUIDs) or electronic mailboxes (mailto:) but is we look purely at HTTP URIs, they define a web of information objects. Information objects -- perhaps in Cyc terms ConceptualWorks -- are normally things which

Carry some sort of message, and

Can be represented, to a greater or lesser authenticity, in bits

I want to make it clear that such things are generic (See Generic Resources) -- while they are documents, they generally are abstractions which may have many different bit representations, as a function of, for example:

Time -- the contents can vary with revision --

Content-type in which the bits are encoded

Natural language in which a human-readable document is written

Machine language in which a machine-processable document is written

and a few more

but the philosophy is that an HTTP URI may identify something with a vagueness as to the dimensions above, but it still must be used to refer to a unique conceptual object whose various representations have a very large a mount in common. Formally, it is the publisher which defines the what an HTTP URI identifies, and so one should look to the publisher for a commitment as to the exact nature of the identity along these axes.

I'm going to refer to this as a document, because it needs a term and that is the best I have to date, but the reader should be sure to realize that this does not mean a conventional office document, it can be for example

A poem

An order for ball bearings

A painting

A Movie

A review of a movie

A sound clip

A record of the temperature of the furnace

An array a million integers, all zero

and so on, as limited only by our imagination.

The Web works because, given an HTTP URI, one can in a large number of cases, get a representation of the document. For a human readable document, the person is presented with the information by virtue of some gadget which is given the bits of a representation. In the case of a hypertext document, a reference to another document is encoded such that, upon user request, the referenced document can in turn be automatically presented. In the case of a machine-readable document, identifiers of concepts, being HTTP URIs, will often allow definitive reference information about those concepts to be pulled in to guide further actions.

The web, then, is made of documents as the internet is made of cables and routers. The documents can be about anything, so when we move to talk about the contents of documents we break away from talking about information space and the whole universe of human -- and machine -- discourse is open to us. Web pages can compare a renaissance choral works with jazz pop hits, and discuss whether pigs have wings. Machine-processable documents can encode information about shoes, and ships, and sealing-wax. Until recently, the Internet protocol standards out of which the Web is built had little to say about such things. They were concerned only with the human-readable side, so it was people, reading natural language (not internet specs) who formed and communicated the concepts at this level. Nowadays, however, semantic web languages allow information to be expressed not only about URIs, TCP ports and documents, but also about arbitrary concepts - the shoes, and ships and sealing wax, and whether pigs have wings. Simple semantic web application allow one to order shoes and travel on ships, and determine that, given the data, pigs do not have wings.

For these purposes it is of course quite essential to distinguish between something described by a document and the document itself. Now that we -- for the first time -- have not only internet protocols which can talk about document but also those which talk about real world things, we must either distinguish or be hopelessly fuzzy.

And is this bad, is it an inhibition to have to work our way though documents before we can talk about whatever we desire? I would argue not, because it is very important not to lose track of the reasons for our taking and processing any piece of information. The process of publishing and reading is a real social process between social entities, not mechanical agents. To be socially responsible, to be able to handle trust, and so on, we must be aware of these operations. The difference between a car and what some web page says about it is crucial - not only when you are buying a car.

Some have opined that the abstraction of the document is nonsense, and all that exists, when a web page describes a car, is the car and various representations of it, the HTML, PNG and GIF bit streams. This is however very weak in my opinion. The various representations have much more in common than simply the car. And the relationship to the car can be many and varied: home page, picture, catalog entry, invoice, remote control panel, weblog, and so on. The document itself is an important part of society - to dismiss its existence is to prevent us being aware of human and aspects of information without which we are impoverished. By contrast, the difference between different representations of the document (GIF or PNG image for example) is very small, and the relationship between versions of a document which changes through time a very strong one.

2. Trying out the Alternatives

The folks who disagree with the model do so for a number of different arguments. This article, therefore will have to take them one by one but the ones which come to mind are as follows:

Every web page (or many of therm) are in fact themselves representations of some abstract thing, and the URI really identifies that thing, not a document at all.

There are many levels of identification (representation as a set of bits, document, car which the web page is about) and the URI publisher, as owner of the URI, has the right to define it to mean whatever he or she likes;

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using common sense and logic

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using the fact that different properties will refer to different levels.

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using extra information which will be provided in other ways along with the URI

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate them by context: A catalog card will talk about a document. A car catalog will talk about a car.

They may have been used to identify documents up till now, but for RDF and the Semantic Web, we should change that and start to use them as the Dublin Core and RDF Core groups have for abstract concepts.

2.1 Identify abstract things not documents

Let's take the alternatives in order. These alternatives all make sense. Each one, however, has problems I can't see any way around when we consider them as a basis as

The first was,

Every web page (or many of them) are in fact themselves representations of some abstract thing, and the URI really identifies that thing, not a document at all.

Well, that wasn't the model I had when URIs were invented and HTTP was written. However, let's see how it flies. If we stick with the principle that a URI (or URIref) must unambiguously identify the same thing in any context, then we come to the conclusion that URIs can not identify the web page. If a web page is about a car, then the URI can't be used to refer to the web page.

2.1.1 Same URI can identify a web page and a car

What, a web page can't be a car? At this point a pedantic line reasoning suggests that we should allow web pages and cars to conceptually overlap, so that something can be both. This is counterintuitive, as a web page is in common sense, not a concrete object whereas a car is. But sure, we could construct a mathematics in which we use the terms rather specially and something can be at the same time a web page and a car.

Frankly, this doesn't serve the social purpose of the semantic web, to be able to deal with common sense concepts and objects. A web page about a car and a car are in most people's minds quite distinct (as I argue further below). A philosophy in which they are identical does not allow me to distinguish between them. not only conflicts with reality as I see it, but also leaves us no way to make statements individually about the two things.

2.1.2 The URI identifies the car, not the web page

So lets fall back on the idea that the URI identifies the subject of the web page, but not the web page itself. This makes sense. We can build the semantic web on top of that easily.

The problem with this is that there are a large number of systems which already do use URIs to identify the document. This is the whole metadata world. Think of a few:

The Dublin Core

RSS

The HTTP headers

The Adobe XML system

Access control systems

(I'm sticking with the machine-processable languages as examples because human-processable ones like HTML have a level of ambiguity traditional in human natural language but quite out of place in the WWW infrastructure -- or the Semantic Web. You can argue that people say "I work for w3.org" or "http://www.amazon.com/shrdlu?asin=314159265359" is a great book, just as they happily say "Moby Dick weighs over three thousand tons", "Moby Dick was finished over a century ago" and "I left Moby Dick on the beach" without expecting to be misunderstood. So we won't use human language as a guide when defining unambiguously the question of what a URI identifies. If we want to do that on the Semantic Web, we will say "I work for the organization whose home page is http://www.ww3.org.)

Some argue the the URI which I associate with someone's home page actually identifies that person. They argue that conventionally people use the identifier to identify the person. However, consider another page put together by friends who found a photograph of the same person. A lot of content filtering systems would collect that URI and put put into their list. Even though the photo had many representations which different devices could download using content negotiation and/or CC/PP (color or black and white and versions of different resolutions) the URI itself would be listed as containing nudity. The public are very aware of different works on the web, even though they have the same topic.

2.1.3 Indirect identification

You can argue that a web page indirectly identifies something, of course, and I am quite happy with that. If you identify an organization as that which has home page http://www.w3.org, then you are not saying that http://www.w3.org/ itself is that organization. This scenario is very very common, just as we identify people and things by their "unambiguous properties": books by ISBN, people by email address, and so forth. So long as we don't think that the person is an email address, we are fine. Some people have thought that in saying "An HTTP URI can't identify an organization" I was ruling out this indirect identification, but not so: I am very much in favor of it. The whole SQL world, after all, only identified things indirectly by a key property. This causes no contradiction. Perhaps I should say "An HTTP URI can't directly identify an organization". But by "identify" I mean "directly identify", and "identity" is a fairly direct word and concept, so I will stick with it.

Conclusion so far: the idea that a URI identifies the thing the document is about doesn't work because we can only use a URI to identify one thing and we have and already do use it to identify documents on the web.

2.1.4 The argument for HTTP URIs identifying a Conceptual Work

So what's wrong with the URI being taken to identify whatever the owner says?

Let's look at what we mean by identifies. When we say there is identity, that means that there is some form of sameness that we associate with the identifier. Now, for all the philosophical argument, we can never test the identity of an abstract thing. What we can test is a representation which has been returned by the server when given that URI. When we use aURI, and get back several possible representations of it, then what expectation do we have about those representations?

Take the test case that I see the web page which has a picture of a car, and I see in the URI in the URI bar in the browser. I email you the URI, "you see, the car is a Toyota?". You click on the link. Your browser shows the same URI as mine in the "URL bar" but you see a table of the car's weight, length, height, color, and registration number. We are confused. The web didn't work because you didn't get the same information as me. I expected you to get the same information, basically. That is how the Web works. That is the expectation behind every hypertext link - that the follower of the link should get basically the same information as the person who made the link. I say, "basically" because I would not have cared whether you saw or JPEG or a GIF. It probably wouldn't have mattered if you had seen a lower resolution or even black-and-white copy of the picture. If you are visually impaired, you may have been able to manage with a well-written description of the picture. But the the essential information is the same, not just the subject of the page.

So now we have put the four corners on the expectation we have of a URI -- that all representations have essentially the same information content. And what we mean by "essentially" allows in fact some wriggle room, and in the end it rests on a common understanding between publisher of the information and quoter of the URI. The sameness we are after is the sameness of information content. That is what is identified by the URI. That is why we say that the URI identifies that conceptual information content, irrespective of its particular representation: the conceptual work. Without that common understanding, the web does not work.

Some people have said, "If we say that URIs identify people, nothing breaks". But all the time they, day to day, rely on sameness of the information things on the web, and use URIs with that implicit assumption. As we formalize how the web works, we have to make that assumption explicit.

2.2 Author definition

So how can we break free of that line of reasoning? We can try throwing away the rule that a URI identifies only one thing.

There are many levels of identification (representation as a set of bits, document, car which the web page is about) and the URI publisher, as owner of the URI, has the right to define it to mean whatever he or she likes.

Well, this one is tempting from the point of view that the owner of an identifier should reign supreme when it comes to saying what it identifies. It is quite a logically consistent position to take. After all, isn't this the case with uuid's? And for a new scheme, this would be interesting. How can we do it though, with HTTP? the problem is an engineering one: I can't in practice use a URI until I have some definitive information from the publisher as to what it identifies.

2.2.1 Default

Why can't a URI default to identifying a web page until you know otherwise? Because the web is open and you will never know when you might lean some other information which will make the default incorrect. (You can't use such "closed world" reasoning).

2.2.2 Web operation

Why can't a URI identify a web page until you have done some well-defined operation -- such as HTTP HEAD or GET -- and checked for information in that? Well, that would certainly work logically. Suppose we we define a return code or HTTP header which means "abstract object requested". It would mean that every web application which deals with web pages as web pages would actually be working under an ambiguity, and RDF processors could be programmed to look for that special information. We can't retrofit the millions of web servers out there, I assume.

I feel that there is a great benefit to fixing this question at the spec level. Otherwise, what happens? I read a web page, I like it and I am going to annotate it as being a great one -- but first I have to find out whether the URI my browser is used, conceptually by the author of the page, to represent some abstract idea? Before I recommend the Vietnam War page, I have to be careful I am not recommending the Vietnam War.

There has been no way to do this before RDF, but then similarly no real need for it. (What, is this just a problem with RDF? No, it will happen with any webized knowledge representation system.). We really need to have communication in which two people use the same URI to mean the same thing. If there

We could fix HTTP so that it would return me some extra semantic headers explaining the whole thing. And in the case that the URI was deemed to be some abstract thing, I would not have the option of recommending the web page. Too bad: it has no URI.

The authors of document certainly thought that they could use "http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest" to identify an abstract thing which is a type of software test. Now they have a choice as to what to make the server return for them when I ask for it. It returns 404 "doesn't match anything we have available". It can't really, because HTTP doesn't allow one to return a class, only a document. And if it were to return a document, then I wouldn't be able to refer to that document without accidentally referring to the class of negative parser tests.

So, we could change HTTP to make this work. We could make a new form of redirect, 343 Abstract Object, please see . . ., which would tell the client that the thing requested was abstract, and would suggest a document to read about it. This avenue of argument is still outstanding. We could take it. It isn't the status quo, but we could make changes in HTTP if the community felt that this was they way to go.

2.3 Logic disambiguates

Otherwise,we have to try another way of letting the URI mean sometimes one thing and sometimes another. Here is another.

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using common sense and logic

This is possible in theory. It is a mess. It fails particularly spectacularly when a URI is used ambiguously to refer to a web page and the thing that web page is about, which happens to be another web page. Anyone can write anything about anything is a Web motto, but here it falls down. Anyone can write anything about anything except those things which might get confused with the document they are writing. It breaks the axiom that we mean the same thing by a URI - in all contexts. (And RDF has a model theory in which necessarily in any interpretation, a symbol always denotes one thing).

2.4 Different Properties

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using the fact that different properties will refer to different levels.

One way of getting here is to start by considering that HTTP headers can be divided into those which refer to the representation (or the document) and those that refer to, say, a car or a donkey. We can look at all RDF properties and other attributes in other languages and divide them in in such a way. So, when I say "http://example.com/albert is a color photo", I am referring to the representation; when I say "http://example.com/albert used to work down the mill" I am referring to the person; when I say "http://example.com/albert was taken on a rainy day" I am revering to the original photograph, which is basically the representation of Albert.

This one has the problem when a web page refers to a web page. It can still be pursued, by having different verbs for talking about ownership of the web page and ownership of the car. This is a classic example of the 2-level syndrome (see also Dictionaries in the Library). The basic fallacy is that you can make the system general by introducing a second level - a new set of attributes, properties, or whatever, which allow you to refer to the metadata of something separately from the thing itself. These systems either turn out to be just limited 2-level systems (like XML and DTDs) or have to be extended to be recursive in some way later on such that in fact the two levels become unnecessary.

2.5 Extra info with URI

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate using extra information which will be provided in other ways along with the URI

This twist now relies on sending extra information with a URI. Effectively, the URI scheme has now failed to identify anything by itself. Those most familiar URIs as used by HTML sometimes suggested adding new attributes to the anchor tags of HTML documents to disambiguate a reference. I guess it would work if HTML anchors were the only uses of URIs. By contrast, they are used in thousands of places and way, many of which I am unaware. The architecture, however, is not that way: the architecture of the WWW is that a URI is a global unambiguous identifier. Not a URI and something else.

(The various designs such a WebDav's propfind which use HTTP methods apart from GET to retreive information suffer from this same problem. the information does not have a URI: it is not on the web.)

2.6 Different meaning in different context

Actually the URI has to, like in English, identify these different things ambiguously. Machines have to disambiguate them by context: A catalog card will talk about a document. A car catalog will talk about a car.

This works in the short term, when the two contexts are disjoint groups who do not need to communicate. It is in fact the current state: the groups of people who use HTTP URIs to talk about documents, and those who have just started to use them to talk about abstract concepts haven't collided yet. (Well, they have in my code. I need to be able to model the metadata about an HTTP URI as that about a document, and it being a class at the same time doesn't jive.)

It doesn't work in the long term because it breaks the axiom that a URI must identify one thing,

2.7 Change it for the Semantics Web

They may have been used to identify documents up till now, but for RDF and the Semantic Web, we should change that and start to use them as the Dublin Core and RDF Core groups have for abstract concepts.

I think that we would have to design a new URI scheme before we change things that much. That is tempting of course. But then -- building a semantic web out of what we have is tempting too. It was tempting to rehash TCP a little when making HTTP. It wasn't practical, and we would have lost a lot more than we would have gained. There is a lot to be said for using common technology. We've got an infrastructure of documents. We want to build an infrastructure of knowledge. Let's build it using the documents. We might find that the commonality with the web of human-readable information is a boon.

2.8 Abandon any identification of abstract things

An argument which surprised me is that yes, HTTP URIs identify documents, but in fact the frgament identifier must only be used to identify parts -- fragments -- of documents. This means that RDF cannot in fact use HTTP URI schemes at all. A completely different system would have to be put together -- either a new set of URIs, or RDF conventions in which the relationship to the part of a document in which something was described became explicit. In N3 this would like like

[ is rdf:referent of <#fmyCar> ] [ is rdf:referent of <#color> ] [ is rdf:referent of <#blue> ]

Of course, languages would quickly generate special syntax for this. Alternatively, the RDF system would built entirely on the understanding that we were referring always to that denoted by a given bit of document, not the bit of document itself. This would mean that there would be no way for the RDF system to refer to documents themselves directly.

This is actually a consistent way of working. It would be a change only for those people who use RDF to talk about documents as documents. We could change.

3. Conclusion

I didn't have this thought out a few years ago. It has only been in actually building a relatively formal system on top of the web infrastructure that I have had to clarify these concepts my own mind. I am forced to conclude that modeling the HTTP part of the web as a web of abstract documents if the only way to go which is practical and, by the philosophical underpinnings of the WWW, tenable.

I apologize again if I have misunderstood or misrepresented other's arguments in this process of this explanation of my own position.

Tim Berners-Lee

2002-07-28Z

What HTTP URIs identify? II

timbl@w3.org (Tim Berners-Lee) — Wed, 01 Jun 2005 00:00:00 GMT

What HTTP URIs Identify

Abstract

HTTP URIs, in the web architecture, have been used to denote documents -- "web pages" informally, or "information resources" more formally. However, with the growth of the Semantic Web, which uses URIs to denote anything at all, the urge to use and practice of using HTTP URIs for arbitrary things grew steadily. The W3C Technical Architecture group eventually decided to resolve the architectural problem that if an HTTP response code of 200 (a successful retrieval) was given, that indicated that the URI indeed was for an information resource, but with no such response, or with a different code, no such assumption could be made. This compromise resolved the issue, leaving a consistent architecture.

Introduction

HTTP URIs, in the web architecture, have been used to denote documents -- "web pages" informally, or "information resources" more formally. However, with the growth of the Semantic Web, which uses URIs to denote anything at all, the urge to use and practice of using HTTP URIs for arbitrary things grew steadily. The Dublin Core project, one of the first RDF vocabularies, and later Friend of a Friend, and various others simply used HTTP URIs to identify RDF Properties. The result was that one could no longer be sure that an HTTP URI was intended to identify the web page one got when one used the URI in a browser. In fact, there was a danger of confusion is one party used the URI for an abstract concept and another used it for the web page. The author wrote a long Design Issues note about this, What do HTTP URIs Identify?. The reader is directed to read that if more detail of the arguments is needed.

This whole issue caused, until 2005, a lot of discussion in technical circles, and much heated debate. In June 2005, the TAG resolved the issue as a function of the runtime protocol response. Basically, the argument is that if you have used a URI to get a web page, then you can use the URI to identify the Information Resource which is that web page: For example, the New York Times home page, or this page you are reading now.

Resolution

The TAG resolution effectively extends the range of things one can use HTTP URIs. However, it does not allow one to simply serve a web page at a URI which is used for something else. Of course, it is a general principle of web architecture that it is useful to serve information to those that look up a URI. In the case that the URI is not intended to be used for an information resource.

The W3C Technical Architecture group eventually decided to resolve the architectural problem that if an HTTP response code of 200 (a successful retrieval) was given, that indicated that the URI indeed was for an information resource, but with no such response, or with a different code, no such assumption could be made. This compromise resolved the issue, leaving a consistent architecture.

Mapping between HTTP URLs and filenames on a server

timbl@w3.org (Tim Berners-Lee) — Fri, 10 Apr 2015 00:00:00 GMT

Mapping between HTTP URLs and filenames on a server

Icing on the cake pattern: URIS of services and metadata close to the URI of the target

timbl@w3.org (Tim Berners-Lee) — Thu, 01 Dec 2016 00:00:00 GMT

Icing on the cake

Identity: how to identify what in RDF

timbl@w3.org (Tim Berners-Lee) — Tue, 01 Jan 2008 00:00:00 GMT

Identifiers - what is identified?

When XML is used to represent a directed laballed graph which is used to represent information about things, then one must be able to make statements about parts of an XML document, parts of the DLG (such as RDF nodes) and of course the objects described.

In most cases it seems obvious to the human reader. The jam jar label text does not (normally) read "jam jar label text" or "jam jar label" or "jam jar" but "jam".

Take the case of a statement about a person in an imaginary syntax

Zoe Albert Bill Claire

The XML element has one attribute and four child elements. The RDF node has three properties (stated here). The person Albert has two children. What so we refer to is we refer to "#foo"? Of course we refer to the element - but when we make RDF statements, we normally want to refer to the RDF node, or rather the object described by the node, in RDF terms the resource.

Of course, in a typical unix programming language we would simply add a syntax character to distinguish the forms of reference: #foo would be the node, and @#foo (or something) would be the object refered to. But in this case we are trying to do everthing with RDF, and what is left with XML, and so we would lose a few points by adding instead some totally new syntax. What we can do is to use different attribute names for the different forms of reference. The attribute names I used above are as follows:

Forms of reference to the object of a property

value litteral string

href taking the string as a URI with or without fragment identifier, the text (or XML fragment or whatever medium) to which it refers.

resource taking a string as a URI with fragment idenifier, the abstract RDF object (rdf:resource) corresponding to the identified XML document fragment.

Here I have used "href"to allow RDF to refer to the XML model. This is important, as for example it is bits of XML which one digitaly signs, not (in sigend XML) bits of RDF. Also, it is useful for RDF to be able to talk about XML elements. It brings up the question of what an RDF fragment identifier means.

RDF and XML fragment identifiers clash

This highlights (2000/02) a bug in the relationship between XML and RDF

Consider what is identified by

http://.../foo.rdf#bar

when ...foo.rdf contains among other things the following:

Ora Lassila ora.lassila@research.nokia.com

The meaning of the fragment identifier is taken from the specification assocaitedwith the MIME type.

Therfore, if this is takes as a document of type application/rdf, then the fragment identifier identifies the thing (person in this case, Ora) described RDF node. This is how refernces are used in RDF.

However, if its considered to be of type text/xml then the fragment identifier is defined bythe XML spec, and so references an element whose attrubute XML:ID. has value "bar". It happens that the rdf:id is not defined to be an xml:id but is defined to "act like one", whatever that means, by the RDF spec. So it isn't clear whether the reference to this would be to the XML subtree (consisting of the rdf:description element and its contents) or would be undefined or possibly a refernce to some other element which happened to have id="bar".

To have a different interpretation of a URI as a function of the notional type of the document belies the fact the point of using XML syntax for RDF was that RDF documents should be XML documents! Of course we embed RDF in regular XML documents. So this distinction is nonsense.

Of course, the RDF spec can simply use the XML definition indirectly and refer to the RDF ndoe described by the XML element. Howvere, this is not powerful enough for RDF. This is because RDF needs to be able to make statements about XML documents and XML elements. So for example, I might want to state that I wrote the above snipet. It would be very tempting to write that I am the author of foo.rdf#bar. But I am not the author of Ora Lassila. RDF uses and parseType to resolve this for inline data: parseType=Resource indicates that the reference is to the RDF object, and parseType=Literal indicates that it is to the XML. The thing could be resolved with an interpretion property which expresses relationship between an XML subtree and an RDF object which it describes. While it would be good to define that property, RDF syntax needs a shortcut. I would propose that "resource=" which is used to point to a resource be also used for a resource fragment id, and that a new syntax be introduced to refer to the actual RDF node. maybe "object=" which happens here to correspond to the (subject, predicate, object) sense -- as well as a "thing" sense. (The former is what is the reason for chosing it - the attribute should express the relationship, not the class of the thing refered to in general!).

Naming properties and elements

We have a similar problem in the XML-RDF relationship looking atthe identity oat the schema level.

In RDF M&S 1.0, a property name defined in a namespace is formed by directly concatenating the namepsace URI with local tag name of the XML element.

One natural way to use this is to end the namespace URI with "#" so that the local tag name becomes the fragment identifier. When the schema is written in XML, this implies that the tag name, being a simple alphanumeric, will identify something in the document by its XML ID. This is a constraint on the schema language: the XML ID of an element must be usable as a reference to the thing being defined.

When there is a 1:1 mapping netween RDF properties and XML element types, there is a choice of

giving them the same URI and distinguishing which is refereed to by context (as in resource= and object= above), or

giving the different URIs algorithimically related, like assuming that #foo-element means the element defining #foo, using a convention specified in eth schema languages, or

giving them totally distinct URIs which can be connected by an assertion in the schema, or an in

Given that it is interesting to use RDF to make statements about XML element types, having different names it appealing. As writing down the relationship every time the algorithmic link is un appealing.

A generic problem with XML identifiers

(I notice in passing that XML has currently a mixture of identifier paces which is a little confusing.

The element and attribute namespace is very well handled in terms of abbreviations, and is grounded in URI space, using the XML namespaces spec.

The URI space is of course the same space, but when value is typed as a URI, then it cannot use the abbreviation system of the elelemnt namespace.)

IDREF considered harmful

The local identifier space is a subset of URI space. When an attribute is defined as a URI, the simple "#" prefix gives access to the local ID space - while still allowing great pwer of expression by reference to anything else on the Web. When the "idref" form is used, this is not possiible. The idref form is a weak form IMHO and not wise for new designs which are not to be deliberately constraining.

Others have noticed this problem and there have even been suggestions which confused the URI prefix and the namespace prefix. In fact the problem can be solved [ref eric whiteboard] with an escape of some sort. One prossibility is ambushing a void URI schme name by using a colon prefix (suggested by Eric Prud'hommeaux)

href=":rdf:description"

would be a perfectly valid URI (in an XML context) which referenced the rdf:description URI using the defined rdf: namespace. I feel this is messy, as it would have to be subject to different handling than any other URI: its expansion would be done in an XML-specific way.

The other link you need is the ability, when using an element name which only occurs once, and without changing the default namespace, it would clearly be logical to be able to just write

a

Because what follows uses the full power of what precedes with generality, we may need to see the first in use before the paper is over. But I can't see making the second change to XML.)

Limiting the damage of an inconsistency

timbl@w3.org (Tim Berners-Lee) — Sat, 26 Jun 2010 00:00:00 GMT

Inconsistent data

What, many people ask, will happen when this huge mass of classical logic meets its first inconsistncy? Surely, once you have one staement that A and another somewhere on the web that not A, then doesn't the whole system fall apart? Surely, then you can deduce anything you want?

This fear of course is quite valid - or would be if all assertions in the whole world were regarded as bing on equal footing. Some imagine that an RDF parser will simply search all XML documents on the web for any facts, and add them to a massive set of belived assertions. This is not how realisic systems will actually work.

On the web, a fact may be asserted in an expression. That expression may be part fo a formula. The formula may ivolve negation, and may invove quotation. The whole formula is found by parsing some document . There is no a priori reason to believe any document on the web. The reason to believe a document will be found in some information (metadata) about the document. That metadata may be an endosement of the document - another RDF statement, which in turn was found another document, and so on.

[@@need picture here]

A real system may work backwards or forwards (or both). I would call working forwards a system which is given a configuartion page to work from which in turn points to other pages which in turn are used as valid data. I would call working backwards a system which, when looking for an answer to a query, looks at a gloal index to find any document at all which mentions a given term. It then searches thes documents turned up for answers to the query. Only when it has found an answer does t check back to see whether the data can be deriveded directly or indirectly from sources it has been set up to trust.

Digital sgnature (see trust) of course adds a notion of secuirty to the whole process. The first step is that a document is not endorsed without giving the checksum it had when believed. The second step is to secify more powerful rules of the form

"whatever any document says so long it is signed with key 57832498437".

In prcatice, particular authroities are trusted only for specific purposed. The semantic web must support this. You must be able to restrict the information believed along the lines of,

"whatever any document says of the form xxxx is a meber of W3C so long as it is signed wiht key 32457934759432".

for example

"whatever any document says of the form "a is an employee of IBM" so long as it is signed by with key 3213123098129".

Limiting inference

There is a choice here, and I am not sure right now which appeals to me most. One is to say precicely,

"whatever any document says of the form xxxx is a member of W3C so long as it is signed with key 32457934759432".

The other is to say,

"whatever is of form xxxx and can be inferred from information signed with key 32457934759432"

In the first case, we are making an arbitrary requirement for a statement to be phrased in a particular way. This seems unnecessarily bureaucratic, and more difficult to treat constently. Normally we like to be able to replace any set of forumlae with another set which can be deduced from it. However, in this case we have to preserve the actual form in case we need to match it against a pattern. This is very messy.

In the second case, we fall prey to the inconsistency trap. Once any pair of conflicting statements can be deduced from information signed with a given key, then anything can be deduced from information signed with the key: the key is completely broken. Of course, only that key is broken, so a trust system can remove any reason it has to trust that key. However, the attacked system may not realize what has happened before it has been convinced that the sun rises in the west.

Is there a way to limit the domain of trust in a key while allowing inmformation to be processed in a consistent way throughout the system? Yes - maybe - there are many. Each KR system which uses a limited logic does do in order (partly) to solve this problem. We just qulaify "can be inferred" be the type of inference rules which may be used. This means the generic proof engine eitehr has to work though a reified version of the rules or it has to know the sets - incorporate each proof engine. Maybe we only need one.

Expiry

Tortoise: What's the time, Achilles?

Achilles: Five past ten, my friend. [They chat for a minute]

Tortoise: What is the time, Achilles?

Achilles: Six minutes past ten, Mr. Toroise.

Tortoise: But Achilles, you just told me just a minute ago it was five minutes past ten. How can I ever believe you again?

Time-varying information is one cause of apparent contradiction. People and documents change status. How does one base inference on information which may be out of date?

One part of this is to put explicit or implcit expry dates on everything. Whenever a server sends resource to an HTTP client, it can give an expiry date. The client can track this, and ensure that all deductions from that document are cancelled when the date arrives, unless a more recent copy can be optained which says the same thing. In human language you might say "It is rainy" but on the semantic web that woudl be exported in a fully qualified way, more like "at Mon Jan 24 09:41:06 EST 2000 the measurement guage 5 at Dubin Airport read rain as having fallen in the last hour". (A fuzzy system would conclude "Dublin is wet" and a clasic logic system "at least once it rained at at least one place in Dublin"!)

I understand [Lehrmann, SW meeting in DC] (sp?) that the KIF folks developed a complete vocabulary for time-variance.

Another tchnique is to make any looseness which exists in the real system visible. Instead of saying

Any employee of any member orgainzation of W3C may register

you say formally to the registration engine

Any person who was some time in the last 2 months an employy of an organization which was som etim ein the last 2 montsh a W3C member may register.

In other words, if an organization were to drop its membership, the system doesn't have to support propagating that information instantly.

I think there will be time-aware reasoning systems, and time-unaware raesoning systems which are fed data with expiry dates and whose results are used within the intersection period of the validity periods of the incomming data. Indeed, time-aware systems may contain nested time-unaware systems, and probably vice-versa.

Semantics and Interpretation (and digital signature)

timbl@w3.org (Tim Berners-Lee) — Wed, 01 Dec 1999 00:00:00 GMT

Interpretation and Semantics on the Semantic Web

We need some philosophy as a basis for the architecture of digital signature and the semantic web.

The semantic web is a computer system, a distributed machine which should function so as to perform socially useful tasks. There will be various interfaces between the Semantic Web (SW) world and the social world of people, such as the physical delivery of goods, and the presentation of a document to a person for signature. However, in general with these important exceptions the Semantic Web will form a self-sufficient loop. The semantics of anything on the SW are then defined either in terms of more stuff on the SW, or in terms of the connection with these real-world connections. So for example I might initially define a check as something which when fed into the bank's black box will make it do a certain thing. Then within the SW all definitions of dollars and transfers can be defined back in terms of the check, and a self-sufficient system can be made where is necessary the recourse can be made to sending a check to a bank, but in fact we can etrade using ecurrency and einvoices and edeliverynotes and so on.

This is a similar relationship with reality that coins originally had with gold, and bills with coin. (A UK pound used to read "I promise to pay the bearer on demand the sum of one Pound signed, signed: Bank of England"). From then on a pound note became what people thought of as a pound, and the notion of what exactly the "sum of one pound" was originally defined by becomes irrelevant and the paper money is self-sufficient. So we are making a computer system which will function as a machine which does a process quite equivalent to (though perhaps more crisply defined than) a social process such as trade or endorsement.

We use the applications which tie the SW to what we currently think of as reality for three reasons:

We need an interface between the SW and the current social systems that is how the SW system will work at least initially.

The social system machine has legislative backing (and public understanding etc.) which we want to exploit;

The social system we have works and we only want to change the machine incrementally.

Our reason is not that the current definitions are fundamental or because their specification is inherently beautiful (indeed many existing systems are really crufty). Importantly, we do not define the semantics of something to the real world in such a way as to break the loop, when the loop can be completed in the SW. Here is an example of a loop in the semantic web.

a. Web server grants access to resource d in response to request is signed with key k1.

b. key k1 is listed in a [employee list] document signed with k2;

c. Key k2 is listed in a [w3c member] list signed with k3;

d. Key k3 is the key with which the web server was set up to trust

This little system can happily run controlling our web site. Now in fact we set it up to model the following social system

A. A person P1 is allowed to read the member site

B. The person P1 is an employee of company C2

C. C2 is a member of the consortium according to Hugo;

D. Hugo is deemed responsible when it comes to defining member site access.

Now to represent the SW loop a-d is very simple. The conditions can be written in math and proved. The social loop A-D as written is always a rough approximation to the very complex web of trust which is often less dependable than the simpler SW model.

Security has always been plagued by people trying to connect the SW steps (such as a-d) at every stage to the social machine (A-D). For example, this would raises the question of how to identify the person P1 with key k1, introducing the quite unnecessary x.500 directory system which is really not part of the trust loop but becomes a security hole, bringing in unnecessary "trusted" third parties. It drags up endless questions of what "identity" really is anyway. It would raise the question of whether it is Hugo or the webmaster or what that is associated with K3. Before we had finished arguing about identity we would be into arguments about "belief". We would be arguing as to whether Hugo really believes that the person is a member of the company - maybe Hugo does not have to but in his webmaster role he does! These are rat holes. (People don't just belive things to believe to a certain extent, they trust certain source for certain purposes). It would be best to use a different term ("interpretation"?) for the mapping between the semantic and real worlds. (I probably haven't got the philosophical terms right at all and I haven't said "model" once)

So what happens, after we have installed our web server access protocol based on digital signature, is that we then relate things to that. We say that invited experts can get have keys on a given list. The semantic web becomes the definitive machine, and we just have rules at the edges about how it related to things like membership payments. An invited expert becomes defined as someone whose key is on a given list.

What we are looking for from a digital signature spec is the relationship between a signature and a string of bits, and what we are looking for from a semantic web toolbox is the language for writing the conditions a-d. We are NOT looking for either to provide and interpretation language for relating a-d to A-D, ora legal language for writing the steps A-D.

Now, the much-asked question, what is the "semantics" of the digital signature in a-d above? From the SW point of view, those rules are the semantics of the system. The whole thing is self-sufficient from the machine's point of view, except for the edges where the server has to understand what to "give access" is, and where the person has to sign a request or a list. The great thing about the semantic web is that we can make it all work and never actually answer the questions "invited" in what sense? by whom? and Does this mean an invitation which has been accepted? and such other rat holes. We must be careful not to confuse what is said with where it is stored There rare basically four rules which define the access machine. We could store them anywhere. They could be sent in an HTTP request, stored on any number of different web sites, in Java rings and smartcards, send by email or etched in marble. The SW design must not constrain where things are stored.

Where do the "sematics of the signature" lie?

The semantics in the SW are for me the whole loop a-d, which you see, to be a loop, and therefore to allow any processing, must eventually be tried down to the key. When you start to argue something on the basis of a signature by a key, they only next step can be some knowledge about the key. In the semantic web, this is a processing rule about things which are signed with that key. However, that does not mean that the signature has semantics which stored as/with/about the key. In fact, I do not think it is useful to talk about the "semantics of the signature.

Documents have meaning. Signatures by themselves do not.

So it is not useful to ask what the semantics of a signature are. Signatures convey trust, but even that because of a set of statements about keys and documents. There are in society many rules about the trust which is conveyed by the signature under various circumstances. We should not attempt to model those when we make the basic infrastructure of the semantic web.

Interpretaion properties for units and languages

timbl@w3.org (Tim Berners-Lee) — Mon, 28 Feb 2000 00:00:00 GMT

Interpretation properties

Abstract: Natural languages, encodings, and similar relationships between one abstract thing and another, are best modeled in RDF as properties. I call these Interpretation properties in that they express the relationship between one value and that value interpreted (or processed in the imagination) in a specific way.

The problem of annotating natural language

There has to date (2000/02) been a consistent muddle in the RDF community about how to represent the natural language of a string. In XML it is simple, because you never have to exactly explain what you mean. You can mark up span of text and declare it to be French.

His name was Jean-Françla;ois but we called him Dan.

Under pressure from the XML community to be standard, the RDF spec included this attribute as the official RDF way to record that a string was in a given language. This was a mistake, as the attribute was thrown into the syntax but not into the model which the spec was defining.

Consider the example in the identity section,

http://www.people.org/types#person Ora Yrjö Uolevi Lassila

Now that represents five nodes in the RDF graph: the anonymous node for Ora himself (who has no web address) and the four arcs specifying that this thing is of type person, and has a common name, email address and home page as given.

Where to we add the language property? Of course we could add a language attribute to the XML, but that would be lost on translation into the RDF model: no triple would result.

Attempt 1: a property of the person?

Many specifications such as iCalendar (see my notes@link) would add another property to the definition of the person.

http://www.people.org/types#person Ora Yrjö Uolevi Lassila fi ora.lassila@research.nokia.com http://www.w3.org/People/Lassila/

Here, the property play:namelang is defined to mean "A has a name which is in natural language B". In the iCalendar spec, the definition more complex in that the lang property is in same cases the language of a name and in other cases that of the object's description. This is a modeling muddle. The nice thing about doing it this way is that the structure is kept flat, and pre-XML systems such as RFC822 (email etc) headers have a syntax which can only cope with this.

There are many drawbacks to this muddle. Ora may have two names, one in Finish and another in English, and the model fails to be able to express that. Because the attribute is apparently tied to the person and not obviously attached to the name, automatic processing of such a thing is ruled out. Clearly, the structure does not reflect the facts of the case.

Attempt 2: a property of the string?

The second attempt is to make a graph which expresses the language as a property of the string itself. Clearly, "Ora Yrjö Uolevi Lassila" is Finnish, is it not? Yes, Ora is Finnish, but that is different. What we need to say is that the string is in the Finnish language. The problem, then, becomes that RDF does not allow literal text to be the subject of a statement. Never mind, RDF in fact invents the rdf:value property which allows us to specify that a node is really text, but say other things about it too. This is done by introducing an intermediate node.

Ora Yrjö Uolevi Lassila fi

There we have it, and in an RDF graph at least very pretty it looks. And indeed, we could work with this, apart from the fact that we have made another modeling error. It is not true that the language is a property of the text string. After all, the string "Tim" - is that English (short for Timothy? or French (short for "Timothé")? I don't need to add a long list of text strings which can be interpreted as one language or as another. A system which made the assertion that the string itself was fundamentally English would simply be not representing the case.

Attempt 3: a relationship between them.

In fact, the situation is that Ora's name is a natural language object, which is the interpretation according to Finnish of the string "Ora Yrjö Uolevi Lassila". In other words, Finish the language is the relationship between Ora's name and the string. In RDF, we model a binary relationship with a property.

http://www.people.org/types#person Ora Yrjö Uolevi Lassila ora.lassila@research.nokia.com http://www.w3.org/People/Lassila/

This works much better. Ora has a name which is the Finnish "Ora". This allows an RDF system to create a node for that string, and a "Finish" link from the concept of Ora the person, maybe a Danish link from the concept of the currency, and an old english link from the concept of weight (1/15 pound), not to mention a Latin link from the concept of the shore.

A problem we may feel is we would like the language to be a string, so that we can reference the ISO spec for all such things, but there is of course no reason why the spec for the lang: space should not reference the same spec.

Another problem we might feel is that it is reasonable for the play:name to expect a string, and in most cases it may get a string: what is the poor system supposed to do in order to accommodate finding a natural language object in place of a string? I guess making a class which includes all strings and all natural language objects is the best way to go. Any use of string which did not allow also such natural language object makes life much more difficult for multilingual software- so this is serious problem.

[[This leads us on to another interesting question of packaging in RDF. There is a requirement in XML packaging and in email packaging and it seems quite similarly in RDF that when you ask me for something of type X I must be able to give you something of type package which happens to include the X you asked for and also some information for your edification. But that is another story.@@@ eleborate and define properties or syntax@@@]]

What is really important is that we are using the ability of RDF to talk about abstract things, just as when we identified people by the resources they were associated with, but avoided pretending that any person had a definitive URI.

Datatypes as interpretation properties^*

Datatypes here I mean in the sense of the atomic types in a programming language, or for example XML Datatypes (XML schema part 2). Defining datatypes involves defining constraints on an input string (for example specifying what a valid date is as a regular expression) and specifying the mathematical abstract individuals which instances of a type represent. One can model the relationship between the representation and the abstract value and the string using a property.

10
<#myshoe> shoe:size "10".

This doesn't tell us what it is 10 of. We could go through life without any model of types: we could define a shoe size as being a decimal string for a number inches. There are many questions and tradeoffs which datatype designers make (for example,

Can you tell the type of a value from the string representation in every case? (eg 1.4e4 vs 1.4d4 for precision)

Are the values of different datatypes distinct? (Eg, is 1 = 1.0?)

Are the set of datatypes extensible? (Eg, can you add complex numbers or prime numbers?)

Does representation equality imply value equality?

Does value equality imply representation equality? (Is the only allowed representation the canonical one?)

It would be nice to be able to model these questions in general in the semantic web, in order describe the properties of dat in arbitrary systems. We can introduce interpretation properties which link a string to its decimal interpretation as number, or a length including units. The problem is that the RDF graph which most folks use is the one above. The object of shoe:size is "10".

The simplistic system corresponding exactly to the Attempt 1 above, is to declare that shoe:size is of class integer. This implies (we then say) that any value is a decimal string. Given the string and the type we can conclude the abstract value, the integer ten. This works. It is the system used by XML datatytpes whose answers for the questions above are as I understand it [No, Yes, Yes, Yes, No]. A snag is that you can't compare two values unless you know the datatypes.

To model the representation explicitly in the RDF it seems you have to introduce another node and arc, which is a pain.

10
<#myshoe> shoe:size [ rdf:value "10" ].

We can then define rdf:value to express that there is some datatype relation which relates the size of the shoe to "10". All datatype relations are subProperties of rdf:value with this system. Once it is that form, the datatype information can be added to the graph. You have the choice of asserting that the object is of a given class, and deducing that the datatype relation must be a certain one. You can nest interpretation properties - interpreting a string as a decimal and then as a length in feet. But this is not possible without that extra node. One wonders about radically changing the way all RDF is parsed into triples, so as to introduce the extra abstract node for every literal -- frightful. One wonders about declaring "10" to be a generic resource, an abstraction associated with the set of all things for which "10" is a representation under some datatype relation. This is frightful too you don't have "equals" any more in the sense you used to have it.

Instead of adding an extra arc in series with the original, we can leave all Properties such as shoe:size as being rather vague relations between the shoe and some string representation, and then using a functional property (say rdf:actual) to relate the shoe:size to a (more useful) property whose object is a typed abstract value.

{ <#myshoe> shoe:size "10" } log:implies { <#myshoe> [is rdf:actual of shoe:size] [rdf:value "10"] } .

@@@ No clear way forward for describing datatypes in RDF/DAML (2001/1) @@

More examples

Interpretation properties was the name I have arbitrarily chosen for this sort of use. I am not sure whether it is a good word. But I want to encourage their use. Base 64 encoding is another example. It comes up everywhere, but XML Digital Signature is one place.

jksdfhher78f8e47fy87eysady87f7sea

Another example is type coercion. Suppose there is a need to take something of datetime and use it as a date:

2000-01-31 12:00ET The Bryn Poeth Uchaf Folk festival

Such properties often have uniqueness and/or unambiguity properties. enc:base64 for example is clearly a reversible transformation. It it relates two strings, on printable and the other a byte string with no other constraints. The byte string could not in general be represented in an XML document. The definition of enc:base64 is that A when encoded in base 64 yields A. This allows any processor, given B to derive A. The specification of the encoding namespace (here refereed to by prefix enc:) could be that any conforming processor must be able to accept a base64 encoding of a string in any place that a string is acceptable.

Interpretation properties make it clear what is going on. For example,

jd8734djr08347jyd4

clearly makes a statement, using properties quite independently defined for the various processes, that the base64 encoding of the SHA-1 hash of the canonicalized form of the W3C home page is jd8734djr08347jyd4. Compare this withe the HTTP situation in which the headers cannot be nested, and the encodings and compression and other things applied to the body are mentioned as unordered annotations, and the spec has to provide a way of making the right conclusion about which happened in what order.

Units of Measure (2006)

This pattern applies very well to units of measure.

See, for example a simple ontology http://www.w3.org/2007/ont/unit of units of measure.

Conclusion

Representing the interpretation of one string as an abstract thing can be done easily with RDF properties. This helps make a clean accurate model. However, using the concept for datatypes in RDF is incompatible with RDF as we know it today.

Links and Laws - what does a hypertext link imply?

timbl@w3.org (Tim Berners-Lee) — Tue, 01 Apr 1997 00:00:00 GMT

Links and Law

Preface

This personal note I have put into the set of web architectural notes as it expresses fundamental understandings upon which the practical use and power of the web rest.

The questions addressed are about the relationship of the hypertext forms of linked and embedded material to the social concepts involved such as attribution, endorsement, and ownership of information.

Links in hypertext are new in that they can be followed automatically, but the concepts of reference and inclusion of material predate paper. There should not therefore be much confusion about what links imply, but as there have been some strange suggestions recently which would seriously damage the web, I write this note.

Abstract

Normal hypertext links do not of themselves imply that the document linked to is part of, is endorsed by, or endorses, or has related ownership or distribution terms as the document linked from. However, embedding material by reference (sometimes called an embedding form of hypertext link) causes the embedded material to become a part of the embedding document.

Two sorts of link

Basic HTML has three ways of linking to other material on the web: the hypertext link from an anchor (HTML "A" element), the general link with no specific source anchor within the document (HTML "LINK" element) and embedded objects and images (IMG and OBJECT). Let's call A and LINK "normal" links as they are visible to the user as a traversal between two documents. We'll call the thing between a document and an embedded image or object or subdocument "embedding" links.

This distinction is an old one in hypertext. Some systems such Peter Brown's original "Guide" worked only by expanding links inline, and some (such as HTML before the IMG tag was introduced) worked only with normal links.

Normal Links

The intention in the design of the web was that normal links should simply be references, with no implied meaning.

A normal hypertext link does NOT necessarily imply that

One document endorses the other; or that

One document is created by the same person as the other, or that

One document is to be considered part of another.

Typically when the user of a graphical window-oriented Web browser follows a normal link, a new window is created and the linked document is displayed in it, or the old document is deleted from its window and the linked document displayed in its place. The window system has a user interface metaphor that things in different windows are different objects.

Meaning in content

So the existence of the link itself does not carry meaning. Of course the contents of the linking document can carry meaning, and often does. So, if one writes "See Fred's web pages (link) which are way cool" that is clearly some kind of endorsement. If one writes "We go into this in more detail on our sales brochure (link)" there is an implication of common authorship. If one writes "Fred's message (link) was written out of malice and is a downright lie" one is denigrating (possibly libellously) the linked document. So the content of hypertext documents carry meaning often about the linked document, and one should be responsible about this. In fact, clarifying the relative status of the linked document is often helpful to the reader.

Embedded Material

The relationship between a document and an image embedded in that document is quite different from normal link. (In some designs it is still refered to as a sort of link).

Images, embedded objects, and background sounds and images are by default to be considered part of the document.

If I say, "To understand this you only have to read this article", or "This is the agreement between us", I am talking about a particular document. It is important that we have a clear picture of what is part of that document and what isn't. Embedded images clearly are part of the embedding document. The author of a document has responsibility for the content, even if the images he or she includes are from another web site.

(There are issues of expectations to be set about availability and security from corruption of remote material, but I do not address these here. Here I just emphasize is that embedded images should be considered part of a document, but documents connected by a normal link should be regarded as separate documents.)

We compose documents out of parts, and the finished work comprises contributions from the parts and also from the arrangement. It is very important that we can include remote parts by reference without having to make a separate local copy. When an embedded image (or sound) is included by reference to its original address (URI) this allows an inquirer to know that address, and hence know the current version of the image. It allows the owner of the image to to a certain extent to know and possibly to control who has access to that image. Also I expect in that in the future it will allow one to find out the owner and licence terms for distribution of that image, which is important for intellectual property rights to be respected on the Web.

Explict distinction

Advertising provides an exception to this rule: a case in which the embedded image is not part of the document. At risk of making ittoo easy for users to turn off advertizing, it would be ideal if the distinction were make in the markup between embeeded information which is or is not part of the document. This would allow, for example, a border to be places around an advertizement to allow the user to realize that it does not come from the same source as the text. I personally feel that this would be an important step forward in the integrity of the web. A flag like

would be fine.

User Interface

When Web documents are presented to people, most current browsers (1997) make a clear distinction between embedded images, which are presented in the same window as the embedding document at the same time, and linked documents which never are. The window system's concept of a "Window" is used to convey when things are part of the same document. It is important for many reasons, some of which were mentioned above, that user interfaces continue to make this distinction.

Frames

The "frames" of HTML unfortunately provide an interface which is less clear. The parts of the document do appear with the same window, but because within a single frame (subsection of a window) one can follow hypertext links replacing content with a separate document, it is easy to create the impression that the owner of the surrounding frames is in fact responsible for the defining document. It is possible that work by the HTML community can produce explict markup (such as the "foreign" flag above) for conveying, when frames are used, which parts of the screen are considered to be the same document. In the mean time, it is appropriate for content providers so make efforts to ensure by the design of (and/or statements on) their web pages that users are not left with the illusion that information within an embedded frame is part of their document when it is really not.

Next: Some dangerous Myths about Links

Myths about Links

timbl@w3.org (Tim Berners-Lee) — Tue, 01 Apr 1997 00:00:00 GMT

Links and Law: Myths

See Links and Law before reading this.

Linked Data

timbl@w3.org (Tim Berners-Lee) — Thu, 27 Jul 2006 00:00:00 GMT
http://www.cafepress.co.uk/w3c_shop.480759174 http://www.cafepress.com/+shirt,480756337
Linked Data

The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.

Like the web of hypertext, the web of data is constructed with documents on the web. However, unlike the web of hypertext, where links are relationships anchors in hypertext documents written in HTML, for data they links between arbitrary things described by RDF,. The URIs identify any kind of object or concept. But for HTML or RDF, the same expectations apply to make the web grow:

Use URIs as names for things

Use HTTP URIs so that people can look up those names.

When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

Include links to other URIs. so that they can discover more things.

Simple. In fact, though, a surprising amount of data isn't linked in 2006, because of problems with one or more of the steps. This article discusses solutions to these problems, details of implementation, and factors affecting choices about how you publish your data.

The four rules

I'll refer to the steps above as rules, but they are expectations of behavior. Breaking them does not destroy anything, but misses an opportunity to make data interconnected. This in turn limits the ways it can later be reused in unexpected ways. It is the unexpected re-use of information which is the value added by the web.

The first rule, to identify things with URIs, is pretty much understood by most people doing semantic web technology. If it doesn't use the universal URI set of symbols, we don't call it Semantic Web.

The second rule, to use HTTP URIs, is also widely understood. The only deviation has been, since the web started, a constant tendency for people to invent new URI schemes (and sub-schemes within the urn: scheme) such as LSIDs and handles and XRIs and DOIs and so on, for various reasons. Typically, these involve not wanting to commit to the established Domain Name System (DNS) for delegation of authority but to construct something under separate control. Sometimes it has to do with not understanding that HTTP URIs are names (not addresses) and that HTTP name lookup is a complex, powerful and evolving set of standards. This issue discussed at length elsewhere, and time does not allow us to delve into it here. [ @@ref TAG finding, etc])

The third rule, that one should serve information on the web against a URI, is, in 2006, well followed for most ontologies, but, for some reason, not for some major datasets. One can, in general, look up the properties and classes one finds in data, and get information from the RDF, RDFS, and OWL ontologies including the relationships between the terms in the ontology.

The basic format here for RDF/XML, with its popular alternative serialization N3 (or Turtle). Large datasets provide a SPARQL query service, but the basic linked data should br provided as well.

Many research and evaluation projects in the few years of the Semantic Web technologies produced ontologies, and significant data stores, but the data, if available at all, is buried in a zip archive somewhere, rather than being accessible on the web as linked data. The Biopax project, the CSAktive data on computer science research people and projects were two examples. [The CSAktive data is now (2007) available as linked data]

There is also a large and increasing amount of URIs of non-ontology data which can be looked up. Semantic wikis are one example. The "Friend of a friend" (FOAF) and Description of a Project (DOAP) ontologies are used to build social networks across the web. Typical social network portals do not provide links to other sites, nor expose their data in a standard form.

LiveJournal and Opera Community are two portal web sites which do in fact publish their data in RDF on the web. (Plaxo has a trail scheme, and I'm not sure whether they support knows links). This means that I can write in my FOAF file that I know Håkon Lie by using his URI in the Opera Community data, and a person or machine browsing that data can then follow that link and find all his friends. [Update:] Also, the Opera Community site allows you to register the RDF URI for yourelf on another site. This means that public data about you from different sites can be linked together into one web, and a person or machine starting with your Opera identity can find the others.

The fourth rule, to make links elsewhere, is necessary to connect the data we have into a web, a serious, unbounded web in which one can find al kinds of things, just as on the hypertext web we have managed to build.

In hypertext web sites it is considered generally rather bad etiquette not to link to related external material. The value of your own information is very much a function of what it links to, as well as the inherent value of the information within the web page. So it is also in the Semantic Web.

So let's look at the ways of linking data, starting with the simplest way of making a link.

Basic web look-up

The simplest way to make linked data is to use, in one file, a URI which points into another.

When you write an RDF file, say , then you can use local identifiers within the file, say #albert, #brian and #carol. In N3 you might say

<#albert> fam:child <#brian>, <#carol>.

or in RDF/XML

The WWW architecture now gives a global identifier "http://example.org/smith#albert" to Albert. This is a valuable thing to do, as anyone on the planet can now use that global identifier to refer to Albert and give more information.

For example, in the document someone might write:

<#denise> fam:child <#edwin>, .

or in RDF/XML

Clearly it is reasonable for anyone who comes across the identifier 'http://example.org/smith#carol" to:

Form the URI of the document by truncating before the hash

Access the document to obtain information about #carol

We call this dereferencing the URI. This is basic semantic web.

There are several variations.

Variation: URIs without Slashes and HTTP 303

There are some circumstances in which dividing identifiers into documents doesn't work very well. There may logically be one global symbol per document per document, and there is a reluctance to include a # in the URI such as

http://wordnet.example.net/antidisesablishmentarianism#word
Historically, the early Dublin Core and FOAF vocabularies did not have # in their URIs. In any event when HTTP URIs without hashes are used for abstract concepts, and there is a document that carries information about them, then:

An HTTP GET request on the URI of the concept returns 303 See Also and gives in the Location: header, the URI of the document.

The document is retrieved as normal

This method has the advantage that URIs can be made up of all forms. It has the disadvantage that an HTTP request mBrowse-ableust be made for every single one. In the case of Dublin Core, for example, dc:title and dc:creator etc are in fact served by the same ontology document, but one does not know until they have each been fetched and returned HTTP redirections.

Variation: FOAF and rdfs:seeAlso

The Friend-Of-A-Friend convention uses a form of data link, but not using either of the two forms mentioned above. To refer to another person in a FOAF file, the convention was to give two properties, one pointing to the document they are described in, and the other for identifying them within that document.

<#i> foaf:knows [
foaf:mbox ;
rdfs:seeAlso ].

Read, "I know that which has email joe@example.com and about which more information is in ".

In fact, for privacy, often people don't put their email addresses on the web directly, but in fact put a one-way hash (SHA-1) of their email address and give that. This clever trick allows people who know their email address already to work out that it is the same person, without giving the email away to others.

<#i> foaf:knows [
foaf:mbox_sha1sum "2738167846123764823647"; # @@ dummy
rdfs:seeAslo ].

This linking system was very successful, forming a growing social network, and dominating, in 2006, the linked data available on the web.

However, the system has the snag that it does not give URIs to people, and so basic links to them cannot be made.

I recommend (e.g in weblogs on Links on the Semantic Web , Give yourself a URI, and and Backward and Forward links in RDF just as important) that those making a FOAF file give themselves a URI as well as using the FOAF convention. Similarly, when you refer to a FOAF file which gives a URI to a person, use it in your reference to that person, so that clients which just use URIs and don't know about the FOAF convention can follow the link.

Browsable graphs
So now we have looked at ways of making a link, let's look at the choices of when to make a link.

One important pattern is a set of data which you can explore as you go link by link by fetching data. Whenever one looks up the URI for a node in the RDF graph, the server returns information about the arcs out of that node, and the arcs in. In other words, it returns any RDF statements in which the term appears as either subject or object.

Formally, call a graph G browsable if, for the URI of any node in G, if I look up that URI I will be returned information which describes the node, where describing a node means:

Returning all statements where the node is a subject or object; and

Describing all blank nodes attached to the node by one arc.

(The subgraph returned has been referred to as "minimum Spanning Graph (MSG [@@ref] ) or RDF molecule [@@ref], depending on whether nodes are considered identified if they can be expressed as a path of function, or reverse inverse functional properties. A concise bounded description, which only follows links from subject to object, does not work.)

In practice, when data is stored in two documents, this means that any RDF statements which relate things in the two files must be repeated in each. So, for example, in my FOAF page I mention that I am a member of the DIG group, and that information is repeated on the DIG group data. Thus, someone starting from the concept of the group can also find out that I am a member. In fact, someone who starts off with my URI can find all the people who are in the same group.

Limitations on browseable data

So statements which relate things in the two documents must be repeated in each. This clearly is against the first rule of data storage: don't store the same data in two different places: you will have problems keeping it consistent. This is indeed an issue with browsable data. A set of of completely browsable data with links in both directions has to be completely consistent, and that takes coordination, especially if different authors or different programs are involved.

We can have completely browsable data, however, where it is automatically generated. The dbview server, for example, provides a browsable virtual documents containing the data from any arbitrary relational database.

When we have a data from multiple sources, then we have compromises. These are often settled by common sense, asking the question,

"If someone has the URI of that thing, what relationships to what other objects is it useful to know about?"

Sometimes, social questions determine the answer. I have links in my FOAF file that I know various people. They don't generally repeat that information in their FOAF files. Someone may say that they know me, which is an assertion which, in the FOAF convention, is theirs to assert, and the reader's to trust or not.

Other times, the number of arcs makes it impractical. A GPS track gives thousands of times at which my latitude, longitude are known. Every person loading my FOAF file can expect to get my business card information, but not all those trackpoints. It is reasonable to have a pointer from the track (or even each point) to the person whose position is represented, but not the other way.

One pattern is to have links of a certain property in a separate document. A person's homepage doesn't list all their publications, but instead puts a link to it a separate document listing them. There is an understanding that foaf:made gives a work of some sort, but foaf:pubs points to a document giving a list of works. Thus, someone searching for something foaf:made link would do well to follow a foaf:pubs link. It might be useful to formalize the notion with a statement like

foaf:made link:listDocumentProperty foaf:pubs.

in one of the ontologies.

Query services

Sometimes the sheer volume of data makes serving it as lots of files possible, but cumbersome for efficient remote queries over the dataset. In this case, it seems reasonable to provide a SPARQL query service. To make the data be effectively linked, someone who only has the URI of something must be able to find their way the SPARQL endpoint.

Here again the HTTP 303 response can be used, to refer the enquirer to a document with metadata about which query service endpoints can provide what information about which classes of URIs.
Vocabularies for doing this have not yet been standardized.

Is your Linked Open Data 5 Star?
(Added 2010). This year, in order to encourage people -- especially government data owners -- along the road to good linked data, I have developped this star rating system.
Linked Data is defined above. Linked Open Data (LOD) is Linked Data which is released under an open licence, which does not impede its reuse for free. Creative Commons CC-BY is an example open licence, as is the UK's Open Government Licence. Linked Data does not of course in general have to be open -- there is a lot of important use of lnked data internally, and for personal and group-wide data. You can have 5-star Linked Data without it being open. However, if it claims to be Linked Open Data then it does have to be open, to get any star at all.
Under the star scheme, you get one (big!) star if the information has been made public at all, even if it is a photo of a scan of a fax of a table -- if it has an open licence. The you get more stars as you make it progressively more powerful, easier for people to use.

★ Available on the web (whatever format) but with an open licence, to be Open Data

★★ Available as machine-readable structured data (e.g. excel instead of image scan of a table)

★★★ as (2) plus non-proprietary format (e.g. CSV instead of excel)

★★★★ All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff

★★★★★ All the above, plus: Link your data to other people’s data to provide context

How well does your data do? You can buy 5 star data mugs, T-shirts and bumper stickers from the W3C shop at cafepress: use them to get your colleages and fellows conference-goers thinking 5 star linked data. (Profits also help W3C :-).

Now in 2010, people have been pressing me, for governmet data, to add a new requirement, and that is there should be metadata about the data itself, and that that metadata should be availble from a major catalog. Any open dataset (or even datasets which are not but should be open) can be regisetreed at ckan.net. Government datasets from the UK and US hsould be regisetred at data.gov.uk or data.gov respectively. Other copuntries I expect to develop their own registries. Yes, there should be metadata about your dataset. That may be the subject of a new note in this series.

Conclusion

Linked data is essential to actually connect the semantic web. It is quite easy to do with a little thought, and becomes second nature. Various common sense considerations determine when to make a link and when not to.

The Tabulator client (running in a suitable browser) allows you to browse linked data using the above conventions, and can be used to check that your linked data works.

References

[Ding2005] Li Ding, et. al., Tracking RDF Graph Provenance using RDF Molecules, UMBC Tech Report TR-CS-05-06

Live data

timbl@w3.org (Tim Berners-Lee) — Wed, 20 Oct 2021 00:00:00 GMT
When applications are built by sharing access-controlled read-write linked data, it is useful for one application to be informed in real time when another changes the data. By adding real-time publish/subscribe (pub/sub) functionality to the architecture, the system can react in real time without having to poll. The Solid protocol includes a basic but effective form of this using WebSockets, where any app or part of an app which is using data from a given resource can listen for changes to that resource. In 2021, the live update protocol is just a web socket 'PING' notification that the resource has changed, after which the client re-loads it. In future it would be good to instead send a PATCH with the change that had happened, to reduce both the bandwidth necessary and the number of network round trips between client and server. This will allow us to connect to more complex distributed protocols such as Conflict-free Replicated Data Types (CRDTs), and provide offline and Local First functionality in future. But right now a simple WebSocket protocol provides great user value, by allowing all kinds of apps to become live apps.
Read whole article...

Logic and the semantic web

timbl@w3.org (Tim Berners-Lee) — Fri, 01 Jan 1999 00:00:00 GMT

The Semantic Web as a language of logic

This looks at the Semantic Web design in the light a little reading on formal logic, of the Access Limited Logic system, in particular, and in the light of logical languages in general. A problem here is a that I am no logician, and so I am am having to step like a fascinated reporter into this world of which I do not possess intimate experience.

Introduction

The Semantic Web Toolbox discusses the step from the web as being a repository of flat data without logic to a level at which it is possible to express logic. This is something which knowledge representation systems have been wary of.

The Semantic Web has a different set of goals from most systems of logic. As Crawford and Kuipers put it in [Crawf90],

[...]a knowledge representation system must have the following properties:

It must have a reasonably compact syntax.

It must have a well defined semantics so that one can say precisely what is being represented.

It must have sufficient expressive power to represent human knowledge.

It must have an efficient, powerful, and understandable reasoning mechanism

It must be usable to build large knowledge bases.

It has proved difficult, however, to achieve the third and fourth properties simultaneously.

The semantic web goal is to be a unifying system which will (like the web for human communication) be as un-restraining as possible so that the complexity of reality can be described. Therefore item 3 becomes essential. This can be achieved by dropping 4 - or the parts of item 4 which conflict with 3, notably a single, efficient reasoning system. The idea is that, within the global semantic web, there will be a subset of the system which will be constrained in specific ways so as to achieve the tractability and efficiency which is no necessary in real applications. However, the semantic web itself will not define a reasoning engine. It will define valid operations, and will require consistency for them. On the semantic web in general, a party must be able to follow a proof of a theorem but is not expected to generate one.

(This fundamental change goals from KR systems to the semantic web is loosely analogous with the goal change from conventional hypertext systems to the original Web design dropping link consistency in favor of expressive flexibility and scalability.The latter did not prevent individual web sites from having a strict hierarchical order or matrix structure, but it did not require it of the web as a whole.)

If there is a semantic web machine, then it is a proof validator, not a theorem prover. It can't find answers, it can't even check that an answer is right, but it can follow a simple explanation that an answer is right. The Semantic Web as a source of data should be fodder for automated reasoning systems of many kinds, but it as such not a reasoning system.

Most knowledge representation systems distinguish between inference "rules" and other believed information. In some cases, this is because the rules (such as substitution in a formula) cannot be written in the language - they are defined outside the language. In fact the set of rules used by the system is often not only formally quite redundant but arbitrary. However, a universal design such as the Semantic Web must be minimalist. We will ask all logical data on the web to be expressed directly or indirectly in terms of the semantic web - a strong demand - so we cannot constrain applications any further. Different machines which use data from the web will use different algorithms, different sets of inference rules. In some cases these will be powerful AI systems and in others they will be simple document conversion systems. The essential this is that the results of either must be provably correct against the same basic minimalist rules. In fact for interchange of proof, the set of rules is an engineering choice.

There are many related ways in which subsystems can be created

The semantic web language can be subsetted, by the removal of operations and axioms and rules;

The set of statements may be limited to that from particular documents or web sites;

The form of formulas used may be constrained, for example using document schemata;

Application design decisions can be made so as to specifically guarantee tractable results using common reasoning engines.

Proofs can be constructed by completely hand-built application-specific programs

For example, Access Limited Logic is restricted (as I understand it) to relations r(a,b) available when r is accessed, and uses inference rules which only chain forward along such links. There is also a "partitioning" of the Web by making partitioning the rules in order to limit complexity.

For the semantic web as a whole, then, we do require tractable

Consistency, that it must not be possible to deduce a contradiction (without having been given one)

Strength in that all applications must be subsets

Grounding in Reality

Philosophically, the semantic web produces more than a set of rules for manipulation of formulae. It defines documents on the Web has having a socially significant meaning. Therefore it is not simply sufficient to demonstrate that one can constrain the semantic web so as to make it isomorphic to a particular algebra of a given system, but one must ensure that a particular mapping is defined so that the web representation of that particular system conveys is semantics in a way that it can meaningfully be combined with the rest of the semantic web. Electronic commerce needs a solid foundation in this way, and the development of the semantic web is (in 1999) essential to provide a rigid framework in which to define electronic commerce terms, before electronic commerce expands as a mass of vaguely defined semantics and ad hoc syntax which leaves no room for automatic treatment, and in which the court of law rather than a logical derivation settles arguments.

Practically, the meaning of semantic web data is grounded in non-semantic-web applications which are interfaced to the semantic web. For example, currency transfer or ecommerce applications, which accept semantic web input, define for practical purposes what the terms in the currency transfer instrument mean.

Axiomatic Basis

@@I [DanC] think this section is outdated by recent thoughts [2002] on paradox and the excluded middle

To the level of first order logic, we don't really need to pick one set of axioms in that there are equivalent choices which lead to demonstrably the same results.

(A cute one at the propositional logic level seems [Burris, p126] to be Nicod's set in which nand (in XML toolbox .. and below [xy]) is the Sheffer (sole) connective and the only rules of inference are substitution and the modus ponens equivelent that from F and [F[G H]] one can deduce H, and the single axiom [[P[QR]][[S[SS]][[UQ][[PU][PU]]]].)

Let us assume the properties of first order logic here.

If we add anything else we have to be careful that it should either be definable in terms of the first order set or that the resulting language is a subset of a well proven logical system -- or else we have a lot of work to do in establishing a new system!

Intractability and Undecidability

These are two goals to which we explicitly do not aspire in the Semantic Web in order to get in return expressive power. (We still require consistency!). The world is full of undecidable statements, and intractable problems. The semantic web has to give the power to express such things.

Crawford and Kuipers The same explain in the introduction their Negation in ALL paper,

"Experience with formally specified knowledge representation systems has revealed a trade-off between the expressive power of knowledge representation systems and their computational complexity. If, for example, a knowledge representation system is as expressive as first order predicate calculus, then the problem of deciding what an agent could logically deduce from its knowledge base is unsolvable"

Do we need in practice to decide what an agent could deduce from its logic base? No, not in general. The agent may have various kinds of reasoning engine, and in practice also various amounts of connectivity, storage space, access to indexes, and processing power which will determine what it will actually deduce. Knowing that a certain algorithm may be nondeterministic polynomial in the size of the entire Web may not be at all helpful, as even linear time would be quite impractical. Practical computability may be assured by topological properties of the web, or the existence of know shortcuts such as precompiled indexes and definitive exclusive lists.

Keeping a language less powerful than first order predicate calculus is quite reasonable within an application, but not for the Web.

Decidability

A dream of logicians in the last century to find languages in which all sentences were either true or false, and provably so. This involved trying to restrict the language so as to avoid the possibility of (for example) self-contradictory statements which can not be categorized as a true or not true.

On the Semantic Web, this looks like a very academic problem, when in fact one anyway operates with a mass of untrustworthy data at any point, and restricts what one uses to a limited subset of the web. Clearly one must not be able to derive a self-contradictory statement, but there is no harm in the language being powerful enough to express it. Indeed, endorsement systems must give us the power to say "that statement is false" and so loops which if believed prove self-contradictory will arise by accident or design. A typical response of a system which finds a self-contradictory statement might be similar to the response to finding a contradiction, for example, to cease to trust information from the same source (or public key).

Reflection: Quoting, Context, and/or Higher Order Logic

@@hmm... better section heading? maybe just quoting, or contexts? one place where we really do seem to need more than HOL is induction.

The fact that there is [Burris p___] "no good set of axioms and rules for higher order logic" is frustrating not only in that it stumps the desire to write common sense mathematically, but also because operations which seem natural for electronic commerce seem at first sight to demand higher order logic. There is also a fundamental niceness to having a system powerful enough to describe its own rules, of course, just as one expects to be able to write a compiler for a programming language in the same language (@@need to study references from Hayes, esp "Tarski's results on meta-descriptions (a consistent language can't be the same expressive power as its own metatheory), Montague's paradox (showing that even quite weak languages can't consistently describe their own semantics)". When Frege tried second-order logic, I understand, Russel showed that his logic was inconsistent. But can we make a language in which is consistent (you can't derive a contradiction from its axioms) and yet allows enough to for example:-

Model human trust in a realistic way

Write down the mapping from XML to RDF logic to allow a theorem to be proved from the raw XML (and similarly define the XML syntax in logic to allow a theorem to be proved from the byte stream), and using it;

The sort of rule it is tempting to write is such as to allow the inference of an RDF triple from a message whose semantic content one can algebraically derive that triple.

forall message,t, r, x, y ( (signed(message,K) & derivable(t, message) & subject(t, x) & predicate(t, r) & object(t, y)) -> r(x,y) )

(where K is a specific constant public key, and t is a triple)

This breaks the boundary between the premises which deal with the mechanics of the language and the conclusion which is about the subject-matter of the language. Do we really need to do this, or can we get by with several independent levels of machinery, letting one machine prepare a "believable" message stream and parse it into a graph, and then a second machine which shares no knowledge space with the first, do the reasoning on the result? To me this seems hopeless, as one will in practice want to direct the front end's search for new documents from the needs of the reasoning by the back end. But this is all hunch.

Peregrin tries to categorize the needs for and problems with higher order logic (HOL) in [Peregrin]. His description of Henkinian Understanding of HOL in which predicates are are subclass of objects ("individuals") seems to describe my current understanding of the mapping of RDF into logic, with RDF predicates, binary relations, being subclass of RDF nodes. Certainly in RDF the "property" type can be deduced from the use of any URI as a predicate:

forall p,x,y p(x,y) -> type(p, property)

and we assume that the "assert" predicate is equivalent to the predicate itself.

forall p,x,y assert(p,x,y) <--> p(x,y)

so we are moving between second-order formulation and first-order formulation.

(2000) The experience of the [PCA] work seems to demonstrate that higher order logic is a very realistic way of unifying these systems.

(2001) The treatment of contexts in [CLA] seems consistent with the design we've implemented.

Induction, primitive recursion, and generalizing to infinitely many cases

It seems clear that FOL is insufficient in that some sort of induction seems necessary.

I agree with Tait (Finitism, J. of Philosophy, 1981, 78, 524-546) that PRA is THE NECESSARY AND SUFFICIENT logic for talking about logics and proofs

Robert S. Boyer, 18 Apr 93

also: pra.n3, an N3 transcription of Peter Suber, Recursive Function Theory

also: ACL2: A Precise Description of the ACL2 Logic Kaufmann and Moore 22 Apr 1998, rdf scratchpad entry 26Mar

(for another sort of induction, i.e. as opposed to deduction, see: Circumscription by McCarthy, 1980.)

A quick look at iCalendar

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Oct 2001 00:00:00 GMT

Building an RDF model:

A quick look at iCalendar

I spent a few hours reading 50 pages of the iCalendar RFC2445 with a view to evaluating proposals to put it into XML. My conclusion early on was that the spec should be written in terms of RDF properties, particularly as it has a clear property/value and parameter/value structure.

Summary

General points I noticed included

The spec is full of x-extensions and IANA registries. these would all be done using namespaces in XML

There is no summary of properties with their domains and ranges, which would make the spec much clearer.

The parameter value type of "URI" implicitly causes dereferencing. This is not clear from the spec but is assumed by the examples.

There are a few example of wanton reification, e.g. relationship type.

Encodings, for cleanliness: the encoding is a relationship between two objects, not the property of an object. Same comment on XML DSig.

I am concerned that I have not found very much protocol defining what how agents interact, or what a message containing a calendar entry means. But maybe that is elsewhere in the spec.

Narrative

When looking for a natural representation of data in a given lanbguage in RDF, one looks at first for the natural structureo fthe language. iCalndar has a nested set of structures which naturally lend themselves to an RDF graph interpretation. Apart from the noted exceptions, this translatoin leads to a set of fairly logically defined RDF properties which could form iCalendar's contribution to the semantic web.

A "calendar" consists of a set of components, such as events, and to-do list and journal entries. These seem natural RDF types. (There is a choice of whether to introduce special a specific property as the relationship between the containing calendar and a specfic type of component, or whther to use generic inclusion property and then specifythe subtype of the component.)

The components have properties, even known as properties in iCalendar. Now each property is in fact a complex thing which has a "value" (implcitly named) and various "parameters" with names.

The named parameters are clearly easily represented as RDF properties.

The values are generally atomic things suhc as integers and strings, with two exceptions. One is when the valeu if the URI and this implies that the actual value is in a document with that URI. Another is that the value datatype "rcecur"is a string which itself has a substructure. This recurrence substructure takes the form of (guess what!) a set of attribute value pairs.

Detailed comments

2.3 Internationalization

If this were XML this would be done for you, with Unicode and the various encodings etc.

4.1 Content Lines

x-name and iana-token are extensions which XML would give for free using namespaces.

"Each property defines the specific ABNF for the parameters allowed on the property"

This makes general parsing impossible, direct conversion into XML difficult. The only hope is that in fact that it not true and there is more consistency than this line leads you to believe! This sounds like a remake of the RFC822 problem which HTTP has in spades: One parser per page of the spec.

4.1.3

Here in the example

ATTACH;FMTTYPE=image/basic;ENCODING=BASE64;VALUE=BINARY: MIICajCCAdOgAwIBAgICBEUwDQYJKoZIhvcNAQEEBQAwdzELMAkGA1U EBhMCVVMxLDAqBgNVBAoTI05ldHNjYXBlIENvbW11bmljYXRpb25zIE <...remainder of "BASE64" encoded binary data...>

represents the encoding as though it were a property of the value. It isn't: it is a relationship between the value and thestring expressed here. Nicer to write that.

image/basic MIICajCCAdOgAwIBAgICBEUwDQYJKoZIhvcN [...]

which would mean (in XML or RDF nonstriped strawman syntax) "Something is attached which has content type image/basic and has base64 encoding MMICCblablahblah".

Note that making base64 a first class relationship (subclass of encoding) makes for brevity and extensibility: with a namespace I can introduce a new one.

Value=binary has all these problems and is unnecessary. It is assumed in base64. The earlier example with the URI

ATTACH:http://xyz.com/public/quarterly-report.doc

has an implicit dereferencing operation which it would be best to expose:

http://xyz.com/public/quarterly-report.doc

which means, consistently with the previous example, "something is attached which is identified by URI http://...."

4.2 property parameters

Property parameter values MUST NOT contain a double quote. So I guess that if i want to represent something which does... I attach it?

4.2.1

ALTREP and many of the following parameters can be represented obviously as RDF properties. There needs to be an explicit property between the introduced thing and any "value".

cid:asdfsadf@sdfsdaf.com Proext XYZ review meeting

This becomes more obvious when you look at things like ATTENDEE.

4.2.2.

There seems to be an embryonic notion of type here ("properties with the CAL-ADDRESS value type". I assume this can be formalized. it would be so much simpler if this were tabulated.

4.2.3 Calendar User Type.

"mailto:" is usually in lower case. I thought it was in fact mandatory that it be in lower case.

4.2.5 Delegatees

It is very confusing who ends up being the attendee notionally when both delegates-to and -from are specified. Changing this to RDF, or contemplating doing logical operations on this make one queasy about the solidity here.

ATTENDEE;DELEGATED-TO="mailto:a@y.com";DELEGATED-FROM="mailto:b@y.com":c@y.com

What is that equivalent to? I assume a@y.com goes to the meeting.

4.2.7. See comment about 4.1.3

4.2.9 Free/Busy Time type

make relationships first class

FREEBUSY=FREE: would be better as FREE: to reduce unnecessary complication and allow extension.

If that section of the spec (4.2.9) seems to be self-referential and difficult to read, that is also because it is describing an unnatural part of a clumsy syntax. You don't say "I am free or busy as follows: 12-1pm and we are talking about free here"! because RDF makes these things first class objects and allow you to group FREE and BUSY and REALLYBUSY as subclases of FREEBUSYTYPE life is easier.

4.2.10 language

xml:lang of course is what one would get for free with XML.

4.2.15

"RELATED-TO:RELTYPE=SIBLING" is a classic wanton reification. Just say SIBLING:

Unfortunately the specification defined how calendars can be put into a hierarchical relationship but doesn't say what that relationship *means*. Maybe it does later in the spec.

4.2.18 Sent By

This is a relationship between a mailbox and another mailbox. It is that the owner of one mailbox is being represented by the owner of another. Yes, the message which asserted this data was probably sent by the agent, but the term is misleading when it crops up in the data. This will cause confusion. This is an example of the clarification which arises when you try to represent the meaning of each rdf:property (icalendar:parameter) independently.

4.2.20 Value Data Type

Note that the "URI" data type does not just constrain the value string to be a valid URI, but indicated that the value string is the document you get when you dereference the URI. Big difference, particularly when you automate the base 64 decoding of something.

In general, note XML data types are defined by XML schema working group. See draft @@. A comparison would be a useful exercise.

4.8.4.1 ATTENDEE

"If the LANGUAGE property parameter is specified, the identified language applies to the CN parameter"

That is a terrible bit of design - a typical bit of interference between different headers which is so temping for designers in these flat specs which can't use nesting. How many other clauses like this are there?

LANGUAGE is, I must admit, a problem RDF has a bug with in general. It is difficult to specify that a string has a language without making an intermediate node that you don't want. This is, I realize the same as the intermediate packaging problem: how to let a system know that what it asked for is inside, but in the mean time, here is some useful information about it. Here is a number and by the way it is prime. here is a GIF and by ht way it is copyright. Here is a common name and by the way it is in English. It is interesting to see the way iCalendar has the same problem

4.8.4 UID

There is linking between components of calendars which uses "UIDs" which are mid URIs with the prefix removed. This is a bug

It removes calendar objects from the URI space so that one cannot refer to them with any other system which uses a URI -- unless you simply assume that you can by using mid:

The spec is full of recommendations for making identifiers unique.

It has a given length of 255 characters which is y2k bug asking to happen. Never specify fixed buffer sizes.

4.8.7.4 "SEQUENCE"

This is not in fact a property of an event, but is a property of a given expression of the state of an event. the rule is that it must be incremented by the organizer if the event changes significantly. In a peer-peer world, it is not obvious what to do.

Not reviewed

I skipped most of the rest of the spec but a few very similar concerns arose with some other parts I glanced at.

Conclusion

It seems that RDF nodes for the calendar, for each event etc, and for each icalendar:property is a fairly straightforward mapping.

A spinoff would be a vocabulary which would include useful reusable models of time.The timezone work could be factored out if it is definitive.

Where RDF mapping was not obvious this sometimes coincided with unclear aspects of the specification.

There are three levels at which the RDF mapping could be made

A very direct mapping of the ical:properties and parameters onto rdf:properties. Always use the same "value" rdf:property for the VALUE of an ical:property. This would leave some things looking illogical in RDF. It would be simple to define as a mapping, but the definitoin of the properties would be strange in some cases.

Make a few simple adjustments to make the RDF more natural. Places to lok for these arehese have been indicated with a @@ in the table. This will make the mapping obvious to an iCal expert reading the RDF, but at the same time make the RDF queries simpler and the properties more reusable. It would move things like RELATED RELTYPE=X into a subclass relationship between X and RELATED which allows generic RDF machinery to process it.

An extensive rework in which the logic of rules was largely exposed in RDFS or something stronger would of course be great.

Appendix: Node types

Node types infered

party implicit node in all properties with a CAL-ADDRESS value type. (person or group: anything which can have a mailbox)

cal-address A mailbox - normally mailto:... URI

CU Calendar user defined in CUTYPE

INDIVIDUAL, GROUP, RESOURCE, ROOM CU

ldap-directory starts "ldap:" (is this a standard?) URI

mime-type string

participation status needs-action, accepted, declines, tentative, delegated, ... (an enum type- could do better. Constraints in the spec.) string

component of a calendar

EVENT, TODO, etc component

TimeProperty DTSTART, DTEND, DUE, EXDATE, RDATE

Timezone see TZID string

icalobject

recur defined by recurrence properties -Really complex datatype could be broken down into RDF! Contains its own nested attr/value structure.

Appendix: rdf:Properties - from "parameters"

Properties from section 4

iCalendar name domain range Notes

ALTREP anything iCal property? URI altervative to body

CN party string

: (mailbox) party cal-address Implicit node between a party and that part's mailbox. Represted by "value" of property

CUTYPE - type

DELEGATED-FROM party cal-address

DELEGATED-TO party cal-address

DIR party URI

eightbit, base64 bits text text encodes bits accordingto RFC2045. Was value of encoding "property"which was faulty model. Now, subclass of generic ëncoding"property

ENCODING bits text Only in schema, as superclass of eightbit and base64 See notes

FMTTYPE document mime-type Why not call it content-type?! Applies to a document. Expect the implit uri proprerty to tell you which object.

FBTYPE Supertype of the following

FREE, BUSY, BUSY-UNAVAILABLE, BUSY-TENTATIVE ? time-interval enum became subclasses FBTYPE property

LANGUAGE string-or-doc iso-language Equivalent xml:lang

MEMBER party cal-address group membership

PARTSTAT party enum A status: part of some protocol?

RANGE component superclass only of ...

THIS-AND-PRIOR, THISANDFUTURE component date-time subclass of RANGE (was qualifier)

RELATED component period@@ superclass of TRIGGER-FROM-START and TRIGGER-FROM-END?

RELTYPE component component Superclass only, of

PARENT, CHILD, SIBLING component component Subclases of RELTYPE. Hierarchical constraints. Semantics unclear@@.

ROLE party enum roleparam Attendee; role=chair could it be better "chair?". Wait and see wether it is a separate dimension.

RSVP party boolean

SENT-BY party cal-address Misleading. "Represented by" would be better. Some message was sent.

TZID anything taking time or D timezone Yuk. should be part of the time string. Makes time complictaed

VALUE string-or-doc string Superclass of the following

BINARY, BOOLEAN, CAL-ADDRESS, DATE, DATE-TIME DURATION, FLOAT, INTEGER, PERIOD, RECUR TEXT, TIME, URI, UTC-OFFSET" string string Specifies the datatype of an associated string

URI document URI Subclass of VALUE but indicates the vale is the content of the resouce identified.

calprop icalobject superclass for the following

VERSION icalobject string subclass of calprop. unique.

PRODID icalobject string subclass of calprop
semantics? unique.

CALSCALE icalbobject string subclass of calprop

METHOD icalobject string This is a hook for a protocol definition

VEVENT icalobject event Property VENVENT of calendar implies component is of type event. See spec for properties including this in their domain

VTODO icalobject todo similar

VJOURNAL icalobject journal similar

VFREEBUSY icalobject freebusy similar

VTIMEZONE icalobject timezonedef similar Definition of a timezone.

VALARM ?component alarm can nest in component

CALSCALE icalobject

Appendix: Calendar component Properties

See spec 4.8

The columns E, T etc indicate whether the subject of the property is permitted to be an event, todo, journal, freebusy, alarm or timezone component.

Properties of calendar components

iCalendar name E T J F
A
Tz range Notes

ATTACH y y y y text-or-doc

CATEGORIES y y y text List of enums

CLASS y y y classification

COMMENT y y y y y text no comment

DESCRIPTION y y y y text

GEO y y float float lat long. @@ Split into two properties?

LOCATION y y text

PERCENT- COMPLETE y integer

PRIORITY y y integer

RESOURCES y y text

STATUS y y y text enum - see the spec.

SUMMARY y y y y text

COMPLETED date-time

DTEND y y date-time or date

DUE y date-time or date

DTSTART y y y y date-time or date

DURATION y y y y duration

FREEBUSY y period

TRANSP y text really boolean!

TZID a a a a a a text

TZNAME y text

TZOFFFROM y utc-offset like -0500

TZOFFTO y utc-offset

TZURL y URI

ATTENDEE y y y y y y party @@ If language is specified, it applies to CN: Kludge! @@@

CONTACT y y y y text

ORGANIZER y y y y party Note in FREEBUSY the use is different

RECURRENCE-ID y y y date-time or date Could be a problem. Not a property of an event, but its presence makes it a reference to a specific occurrence of a repeated event.

RELATED-TO y y y text (really URI whcih is UID of component) Subclass only of PARENT, CHILD, SIBLING above.

PARENT , CHILD, SIBLING y y y see RELATED-TO

URI y y y y URI document "associated with" component. For more information.

UID y y y y UID - URI without mid: @@ Missing scheme!!! @@ replace with midL: URI

EXDATE y y y date-time or date Excludes the dates given @@ implicit logic makes search logic difficult.

EXRULE y y y recur

RDATE y y y date-time or date

RRULE y y y recur

Properties ofAlarm coponents and config control and misc

name domain range Notes

ACTION A text really an enum

REPEAT A Ainteger

TRIGGER A duration or date-time See RELATED. @ Split into two properties?

CREATED ETJ date-time

DTSTAMP ETJF date-time

LAST-MODIFIED ETJTz date-time

SEQUENCE ETJ integer fuzzy rules for incrementing this

REQUEST-STATUS ETJF text eg 3.1.1

Properties from recurrence rules

name domain range Notes

UNTIL rrule text text - all these are text with various constraints and substructure

COUNT

INTERVAL

BYSECOND

BYMINUTE

BYHOUR

BYDAY

BYMONTHDAY

BYYEARDAY

BYWEEKNO

BYMONTH

BYSETPOS

WKST

FREQ

Properties of

name domain range Notes

Examples

@@@

References

There must be a much better list of resources for hacking calendar files of various formats - but until I find it here are some random things I found.

The iCalendar RFC: RFC2445

Jetstream: Java classes in Apache's Jetstream which represent the iCalendar properties.

Open source handheld synchronisation software at openhandheld.org

PalmOs documentation; file formats (pdf copy)

Dan Connolly's design research notebook on this

Mandatory extensions

timbl@w3.org (Tim Berners-Lee) — Sat, 01 Jan 2000 00:00:00 GMT

Mandatory extensions

There is a common requirement for the design of a language on the web that it should allow for extensions, but it must allow a clear declaration as to whether understanding of an extension is a requirement to understanding of the document or whether it may be ignored. (See Evolvability)

Historically the lack of such a "mandatory field" has led to a complete inabaility to get any particular guaranteed behaviour be clients on the web.

This is essential for partial understanding and the smooth evolution of the web.

A simple requirement on a language is that it not only provide for its own extension, but provides for a way to explain whether a given extension is optional or not. This is a fundamental key to smooth evolution from the language to a new version.

There are manyways in which it can be done. It can be done term by term, or in bulk about a whole new language. It can be specified in the new document, in the schema for the new language.

XML provides in Namespaces a standard way of extending languages. It should also, in my opinion, provide a standard way to specify mandatopry or optional extensions.

I propose two things:

Sublanguages

The simple assertion that language A is a sublanguage of language B means that the writer's intent is preserved if a dpcument in language A is converted into a document in language B just by relabelling every term as being from langauge B. For XML, this means that a receiver of namespace A can simply process it as though the namespace had been delcared as B.

This assertion has got to be simple enough to put into a document for cases where the functionality is needed without the receiver having to dereference a schema.

Optional/Ignoreable/Mandatory flags for elements

In XML there are three simple thiong you can do with an element you don't understand.

Stop, and conclude you do not understand the document, or the clause in the document; Example : logical NOT

Ignore the elementand all its contents (including child elements) Example;

Replace the element with its contents (including children). Example:

The schema langauge needs to be able to specify these very simply, and indeed it would be neatto be able to do it in a document for a given elemnt, or in one fell swoop for all the elements in a given namespace.

Languages which donot use XML should attend to these needs in their own way!

The meaning of a document - grounding in a global namespace

timbl@w3.org (Tim Berners-Lee) — Fri, 01 Jan 1999 00:00:00 GMT

Meaning

Grounding the meaning of a document in URI space.

What is the meaning of a document?

The meaning of a document on the Web can be defined more precisely than an arbitrary paper document. Because we have the benefit of a global namespace (URIs), things become possible which were not before. One example is global hypertext; another is the rigid (though rarely absolute) specification of meaning. Just as a hypertext document can now exactly point to another document when it makes a reference (instead of making some vague natural language reference to it), so can a formal document make a precise reference to the language it uses.

A writer of a document uses the language to convey his intent to the reader. It is essential that the intent of the writer can be well defined for both parties and in general for a third party.

The "language" here I means the set of symbols, the syntactic rules which constrain their combination, and some semantics which are conveyed by defining their interpretation in one or more other formal language, or in some natural language.

The meaning of a document is then the product of the text of the document (in some language) and the meaning of the language.

On the Web, important things are identified by URIs. This should clearly apply both to the document itself and to the language. The party which defines what a URI refers to I call the publisher, or owner of the URI. HTTP allows a delegated system of authority for ownership (DNS) to define ownership of URIs, and it also provides a network protocol to retrieve documents representing that identified by the URI. The text a document is defined by its publisher and the meaning of the language is defined by the publisher of the language.

Natural languages are constantly evolving and rather vague, in that no one (except Scrabble players) use a particular dictionary as a definitive set of meanings. In practice, the meaning of a word in a natural language is the sum of the associations of that word -- logical or poetic -- in the mind of the reader or writer. Of course society works on the basis of a very strong similarity of the webs of association in different people's minds.

In the semantic web, however, meaning is not vague: the idea is that languages must be defined formally and as precisely as possible. The semantic web consists of some "terminal" languages which are defined solely in natural language terms, and some languages for which there are machine-readable interpretations into other formal languages. Whereas programs processing documents in the first sort of language will typically have to be hand coded, documents in the second set may be processed automatically to convert them into languages in the first set.

URIs can be of various sorts, with various properties depending on their scheme (and, for http URIs, the publisher), but some URIs can be dereferenced to a definitive document. The document resulting from dereferencing the URI for a language is a place where the publisher of the language can put definitive information about the meaning of a language.

Language and document subsets

As languages evolve, there can be many languages which are similar. "Similarity" doesn't mean much, but something which is well defined is when a document in one language A can be treated precisely as though it had been in another language B.

Meaning in XML

In XML, a language is a "namespace", and the document about the language is called a "schema". In XML, one document can contain a mixture of languages, and so the schema if written in XML may contain information about syntactic constraints (in XML-schema language) and/or RDF properties (in rdf-schema language), or any combination of the above. (note)

XML puts no constraints on a language apart from syntactic structure. There is not (without RDF and logic or some other higher level) any overall framework into which new languages can be introduced. So, the question of what an XML document means depends first upon the fully qualified name of the document element. No semantics can be attached to any of its descendents in the document tree except in as much as is defined by the specification of that element type in that namespace. One cannot talk about the "meaning" of a subtree of a document without understanding the semantics of the language. In fact, because languages only necessarily define meaning for documents, the only way one can talk about the meaning of a subset of a document is to define a how those parts of the document can be reassembled into a second whole document. This is what must be done when a digital signature is applied to a document.

The Meaning of Digital Signature

The language defines semantics. On the simple philosophy that one place is enough, It is not the place of a digital signature to define semantics. A digital signature on a document may give a party reason to use the information therein for purposes it would not have otherwise. The issuer of a public key may also put constraints on what sort of guarantees are made by signature with a given key. But the signature itself must not affect the semantics - the meaning - of a document. To allow it to would be to create an inconsistency between the intent of the writer of the original document and the meaning of the signed document. So, signatures themselves have no meaning. The meaning has to be ascibed to them by other documents. For example, I may say, "If an organization is a member of W3C according to a document signed with this key, then that organzation is indeed a member". That is a trust statement which gives the key a connection into the world of meaning of documents.

Style as meaning

(Although few people would think of presentation style of a document as its "meaning", and many of us spend a lot of time emphasising the difference between style and content and semantics, in fact much of what applies to style applies to semantics. Therefore the "meaning in terms of presentation" is a good test case for the architecture of the system. (For many documentation systems, the only semantics required is "H2 means a big bold block on the left"!) Style sheets provide an "interpretation"of a document by mapping it onto another well-defined language of formatting properties. The style sheet language gives a good definition (in English) of what is needed. This is an interesting comparison, and I mention it as a place where architectural conssistency should be maintained, but it isn't what I normally mean by "meaning".)

Logical meaning

When XML is used to encode logic, then a document is a formula and the (see Logic on the web). Then, the way new predicates and constants interact is defined by the logic. The way fundamental new parts of the language (such as quantification) are added is part of a more general question of how arbitrary languages interact. Examples we have seen are the mixing of XHTML and XSL. What is the result - XHTML or XSL? A document or a style sheet? Both?

Mixing Languages

XML puts no contarints on a language apart from syntactic structure. There is not (without for example RDF and logic) some overall framework into which new languages can be introduced. This means that every language has to define how it canbe extended by mixing with other languages. Typically it will indicate the element types which can be subclassed by extensions and therefore incorporated into documents wherever that element type is allowed.

One particular example of such a type is common to almost all languages. This is the sentence, the fully qualified assertion or statement, the formula with no free variables. Almost all whole documents count as such, though an interesting counterexample is a style sheet which represent a function: it specified the result document as a functin of an input document, and so itself cannot be said to be a stand-alone statement. (If I sent you a message consisting only af a stylesheet with no coverletter, what would it signify? What would it mean if I digitally signed it?)

With that exception, it clearly makes sense to allow any language which has the concept of a sentence -- maybe any language at all - to allow sentences from other languages to be included anywhere where a sentence of its own could go. This should be a generic feature of XML schemas.

(It is would be against the minimalist principle for XML generically to define other common subclasses. Note that the RDF spec does define properties and node types and the concept of subclassing in RDF. HTML defines things like block and inline elements, which can be subclassed in extensions; SVG and SMIL probably define similar concepts. The significance of this when looking at downloaded support code would be that, for example, in a set of Java classes implementing HTML, that any subclass of "Inline element" would export the same software API to allow it to be justified and line wrapped in a text flow object. So there is a natural correspondence between element type subclassing and support class subclassing, but the tow must remain distinct. Language specifications must always define what a language means without refering to implementations if they can possibly avoid it)

Note that without the assurances given by such information you cannot just go around embedding one language in another. Every language has to address the issue which the concept of RDF transparency potentially solves for RDF. A surrounding XML context must have the ability to quote, deny, negate or whatever any element. In fact, nothing in XML says that the menaing of a fragment is not affected by thing anywhere else in a document. Nothing suggests that the process of removing sub-trees creates a valid document. (How does xml fragment deal with this?)

Grounded documents

We can say a document is "grounded" if its meaning is completely defined because every term used is explicitly, directly or indirectly, an explicit direct or indirect referece to its definition in a document on the Web. Clearly a definition of "grounding" depends on the set of documents one considers acceptable definitions. "Grounded in W3C Recommendations" would imply that the closure under [i.e. set of all the things you can possibly end up with by repeated applications of] the operation of looking up definitions would be a subset of the set of W3C recommendations.

This is the basis for the entire web and internet architecture stack today. (See also: Stack) . All commercial use on the web is largely to be considered in this light, that the meaning of each messaeg sent across the Internet is well-defined by a series of specifications.

(A sense of grounding also can be appliyed seperately to different sorts of "understanding". When "understanding" means presentation to a human for human understanding, a presentation-grounded documents points to all information such as schemata and style sheets which will enable it to be presented.)

Grounding as a myth: the Web of Meaning

The concept of grounded documents is important for predicatble systems, but it is a bad model for the web -- or for life -- in the long run. Words in a natural langauge such as English are not grounded in a unique base set*. Every time you look one up in the dictionary all you find are more words. The world is web-like, and any attempt by the Web to constrain it to be tree-like is bound to force a misrepresentation of realtity. This is the Wittgenstein view of meaning. Understanding this view sometimes confuses people about the very systematic way in which meaning in Internet protocols is defined by layers and layers of specs.

In fact, the two views both apply, one nested inside the other. Yes, meaning is use - but in the Internet protocols, society has set up social constraints - laws and other expectations - which constrain use to be according to the specs. This is a social constraint which your computer is under when you use the Internet, just as when you fill out a tax form you don't have a choice as to how to interpret the meaning of "Adjusted Gross Income on line 39 of a US IRS form 1040". There is a whole department of the government which defines what it is and which socially owns the term. So while the

What will change with the Semantic Web's development is that its grounding in legacy systems will fade into history. Right now, the meaning of "Invoice total vale" is effectively defined by the software which you plug your RDF document into, and how it treats invoices. This is an important way to bootstrap the semantic web with useful terms. That will become less important as many different software poducts share teh same term. In the end, it is weblike form which will characterize the semantic web. Everyone will be defining things in terms of other things which they feel are useful and stable enough. It will be impossible to insist that there be a global ordering between more basic and less basic specifications -- and to do so would stop the web scaling. No one will agree on a directed acyclic graph determining what terms are "more basic" than others. For any set of definitions in one direction, there can always be some reverse definitions which can be seen by others as just as valid.

So, while the concept of documents grounded in a given base set is important for interoperability, it must not be seen as a goal to force the semantic web into an acyclic structure. There will be no single Dewy decimal system for the semantic web. The concepts of well-defined stable specifications will still be essential. So will respect for the definitions of terms. The difference will be that any one will chose their own set of langauges they consider "basic", and find ways of defining other languages they come across in terms of those. A rich web of conversions, translations will grow up to support this. The web of trust will provdie tools for navigating within and selecting from this web in a safe way. And of course, global standarsdw il wlways make like much easier where they can be made.

FAQ: Surely meaning is only defined by use?

This is all very well, runs a popular line, except that to talk about "meaning" at all is basically bogus. The meaning of words, and therefore languages, is defined by use - by how people actuall respond to them, by how they are processed. Surely the only way I can guarantee that someone will interpret a document in a particular way is to have some out-of-band agreement with them first?

Philosophically, it is indeed the case that you need some out-of-band (not in the message itself) agreement. In real life, though, in fact there a lot of widely-held agreements. In fact, the law is a set of agreements which you are deemed to accept whether you formally agree or not. So when you are sent a tax form, you can't argue that the language of the tax form is not one you interpret in that way. they just stick you in jail.

The web works like one big agreement. By connecting your computer to it and getting email from POP and IMAP ports, there is an understanding that what you get are MIME messages, and the same thing when you pick up web page using HTTP. So by using the web you are entering a world where the assumption can be made that messages are to be interpreted by a set of specifications. the specifications are (currently) generally written in english, and imperfect, but basically debate about them is practically about details, not aboutteh philosophy as to whether they apply. So that is why one can in practice talk about meaning.

FAQ: Doesn't the meaning of a document depend on its context?

Of course it does. If i exclose a phtocopy of a document as an attachment, it doesn't mean I am sending you that letter.

However, theer are a lot of contexts for a document which have the same implication for the meaning of that document. Publication, by email to a public list, or HTTP, or FTP, or printing on paper and nailing to a tree, in each case leaves the meaning of a document defined in the same way. These contexts, in which a document is published by a party, or a message converyed from one party to another, are so common and basic that the meaning of the document in these contexts is referred to simply as the meaning of the document (or message).

The webarchitecture separately enumerates the ways in which these contexts actually work under he hood (publication using HTTP, etc) and teh way documents are interpreted and dealt with once published. That way, XML langauegs don't ahve to keep referring to "meaning when received with a 200 code in HTTP".

Metadata Architecture

timbl@w3.org (Tim Berners-Lee) — Mon, 06 Jan 1997 00:00:00 GMT

Metadata Architecture

Preface

This document was written before the Semantic Web Roadmap, but is an introduction to the same ideas. Both introduce the world of machine-readable data on the web. This document introduces the concepts in the historical sequence at W3C, where the first driving applications of semantic web were metadat, and the first driving metadata applications were endorsement labels (PICS).

Documents, Metadata, and Links

The thing which you get when you follow a link, when you de-reference a URI, has a lot of names. Formally we call it a resource. Sometimes it is referred to as a document because many of the things currently on the Web are human readable documents. Sometimes it is referred to as an object when the object is something which is more machine readable in nature or has hidden state. I will use the words document and resource interchangeably in what follows and sometimes may slip into using "object".

One of the characteristics of the World Wide Web is that resources, when you retrieve them, do not stand simply by themselves without explanation, but there is information about the resource. Information about information is generally known as Metadata. Specifically, in the web design,

Definition

Metadata is machine understandable information about web resources or other things

The phrase "machine understandable" is key. We are talking here about information which software agents can use in order to make life easier for us, ensure we obey our principles, the law, check that we can trust what we are doing, and make everything work more smoothly and rapidly. Metadata has well defined semantics and structure.

Metadata was called "Metadata" because it started life, and is currently still chiefly, information about web resources, so data about data. In the future, when the metadata languages and engines are more developed, it should also form a strong basis for a web of machine understandable information about anything: about the people, things, concepts and ideas. We keep this fact in our minds in the design, even though the first step is to make a system for information about information.

For an example of metadata, when an object is retrieved using the HTTP protocol, the protocol allows information about its date, its expiry date, its owner, and other arbitrary information to be sent by the server. The world of the World Wide Web is therefore a world of information and some of that information is information about information. In order to have a coherent picture of this, we need a few axioms about metadata. The first axiom is that :

Axiom

metadata is data.

That is to say, information about information is to be counted in all respects as information. There are various parts of this.

One is that metadata can be stored regarded as data, it can be stored in a resource. So, one resource may contain information about itself or about another resource. In current practice on the World Wide Web there are three ways in which one gets metadata. The first is the data about a document contained within the document itself, for example in the HEAD part of an HTML documents or within word processor documents. The second is that during the HTTP transfer the server transfers some metadata to the client about the object which is being transferred. This, during an http GET, is transferred from the server to the client and, during a PUT or a POST, is transferred from the client to the server. One of the things which we have to rationalize in our architecture of the World Wide Web is who exactly is making the statement. Whose statement, whose property is that metadata. The third way in which metadata is found is when it is looked up in another document. This practice has not been very common until the PICS initiative was to define label formats specifically for representing information about World Wide Web resources. The PICS architecture specifically allows for PICS labels which are resources about other resources to be buried within the resource itself, to be retrieved as separate resources, or to be passed over during the http transaction. To conclude,

Metadata about one document can occur within the document, or within a separate document, or it may be transferred accompanying the document.

Put another way, metadata can be a first class object.

The second part of the above axiom is:

Metadata can describe metadata

That is, metadata itself may have attributes such as ownership and an expiry date, and so there is meta-metadata but we don't distinguish many levels, we just say that metadata is data and that from that it follows that it can have other data about itself. This gives the Web a certain consistency.

The Form of Metadata

Metadata consists of assertions about data, and such assertions typically, when represented in computer systems, take the form of a name or type of assertion and a set of parameters, just as in the natural language a sentence takes the form of a verb and a subject, an object and various clauses.

Axiom

The architecture is of metadata represented as a set of independent assertions.

This model implies that in general, two assertions about the same resource can stand alone and independently. When they are grouped together in one place, the combined assertion is simply the sum (actually the logical AND) of the independent ones. Therefore (because AND is commutative) collections of assertions are essentially unordered sets. This design decision rules out for example, in simple sets of data, assertions which are somehow cumulative or later ones override earlier ones. Each assertion stands independently of others.

We will see below how logical expressions are formed to combine assertions in more varied ways, and syntactic rules which allow the subject at least of the assertion to be made implicit. But neither of these change the basic operation of combining assertions in unordered AND lists.

Attributes

Assertions about resources are often referred to as attributes of the resource. That is, the type of assertion is an assertion that the object, the resource in question, has a particular named property such as it's author, and in that case the parameter is the name or identity of the author. Similarly, if the attribute is the document's date of expiry then the parameter is that date.

Often, a group of assertions about the same resource occur together, in which case the syntax generally omits the URI of that resource as it is implicit. In these cases, when it is clear from the context about which resource the assertion is being made, the assertion often takes the form of a list of attributes and values. In RFC822 format messages, such as mail messages and HTTP messages, metadata is transferred where the attribute name is an RFC822 header name and the rest of the RFC822 line is the value of the attribute, such as Date: and From: and To: information. The attribute value pair model is that used by most activities defining the semantics of metadata today.

I use the word "assertion" to emphasize the fact that the attribute value pair when it is transferred is a statement made by some party. It does not simply and directly imply that the resource at any given time has that value for the given attribute. It must be seen as a statement by a particular party with or without implicit or explicit guarantees as to validity. Throughout the World Wide Web, as trust becomes an important issue, it will be important for software -- and people -- to keep track of and take into account who said what in terms of data and metadata. So, our model of data of a resource is something about which typically we know the creator or the person responsible, and typically the date of which the information was created, which implies, in the case of a piece of information which makes an assertion, the date at which the assertion was made.

An assertion

(A u1, p, q...)

typically has as explicit parameters,

the URI of the resource about which the assertion is made (u1).

some identifier (A) for the type of assertion being made, such as author or date or expiry date.

other parameters (p, q,...) according to the type of assertion.

As implicit or explicit or implicit parameters,

The party making the assertion

The date/time of the assertion

etc...

We can often make an analogy with programming languages. An assertion in metadata can be compared with a function call in a programing language. In object oriented languages, the object of the function has a special place among the parameters just as the subject of an assertion does in metadata. In object oriented languages, though, the set of possible functions depends on the object, whereas in metadata the set of assertion types is more or less unlimited, defined by independent choice of vocabulary. Anyone can say anything about anything.

A space for attribute names

It is appropriate for the Web architecture to define like this the topology and the general concepts of links and metadata. What about the significance of individual relationships? Sometimes, as above, these are special, defined in the architecture, and having an architectural significance or a significance to the protocols. In other cases, the significance of relationships or indeed of attributes is part of other specifications, other design, or other applications, and must be defined easily by third parties. Therefore, the set of such relationship and attributes names must be extremely easily extensible and therefore extensible in a decentralized manner. This is why

the URL space is an appropriate space for the definition of attribute names.

We have already (1997) several vocabularies of attribute names: for example, the HTML elements which can occur within the HEAD element, or as another example, the headers in an HTTP request which specify attributes of the object. These are defined within the scope of particular specifications. There is always pressure to extend these specifications in a flexible way. HTTP header names are generally extended arbitrarily by those doing experiments. The same can also be true of HTML elements and extension mechanisms have been proposed for both. If we look generically at the very wide space of all such metadata attribute names, we find something in which the dictionary would be so large that ad hoc arbitrary extension would be just as chaotic as central registration would be stifling.

Aside: Comparison with Entity-Relationship models.
This architecture, in which the assertion identifier is taken from (basically) URL space differs from the "Entity-relationship" (ER) model and many similar models like it, including most object-oriented programming systems. In an ER model, typically every object is typed and the type of an object defines the attributes can have, and therefore the assertions which are being made about it. Once a person is defined as having a name, address and phone number, then the schema has to be altered or a new derived type of person must be introduced before one can make assertions about the race, color or credit card number of a person. The scope of the attribute name is the entity type, just as in OOP the scope of a method name is an object type (or interface)By contrast, in the web, the hypertext link allows statements of new forms to be made about any object, even though (before anything other than syntax checking) this may lead to nonsense or paradox. One can define a property "coolness" within one's own part of the web, and then make statements about the "coolness" of any object on the web.

This design difference is in essence a resurfacing of the decision to make links mondirectional, sacrificing consistency for scalability.

An advantage of ER systems is that they allow one to work, in the user interface for example, with a set of properties which "should" be defined for each entity. You can define these in the Metadata's predicate calculus by defining an expression for a "well specified" object. ("For all X such that X is a customer X is well-specified if there exists n such that n is the name of X and there exists t such that t is the telephone number of X and...)

end of aside.

Metadata ("Entity") headers in HTTP

In the above it is important to realize that the HTTP headers which contain what can be considered as metadata ("entity headers") should be separated quite distinctly from HTTP headers which do not. HTTP headers which contain metadata contain information which can follow the document around. For example, it is reasonable for a cache to pass such information on without treatment, it is reasonable for clients or other programs which process data to store those headers as metadata with the document for later processing. The content of those headers do not have to be associated with that particular HTTP transaction. By contrast, the RFC822 headers in HTTP which deal specifically with the transaction or deal specifically with the TCP link between the two application programs have a shorter scope and can only be regarded as parameters of the HTTP method. To make this separation clear will be to make it easier not only to understand HTTP and how it should be processed, it will also make it clear which pieces of HTTP can be used easily and transparently by other protocols which may use different methods with different parameters. The clarification of the architecture of HTTP such that both the metadata and the methods can be extended into other domains is an important part of the work of the World Wide Web Consortium. The Internet protocols SMTP and NNTP and HTTP as well as many new and proposed protocols share much of the semantics of the RFC822 headers. Formalizing the shared space and making it clear that there is a single design for a particular header, rather than four designs which are independent and happen to look very similar, requires a general architecture, some careful thought, and is essential for the future design of protocols. It will allow protocol design to happen in small groups which can take for granted the bulk of previous work and concentrate on independent new design.

Authorship of HTTP entity headers

It may be possible to remove or at least encompass the apparent anomaly of metadata transferred from an HTTP server by creating a special link type which links the document itself to the set of attributes which the server would give in the HTTP headers. In other words, the server would be able to say, "here is a document, here is some metadata about it, and the metadata about it has the following URL". This would allow one, for example, request a signed copy of the HTTP headers. It would allow one to ask about the intellectual property rights of those headers, and the authorship of those headers.

It is important to be completely clear about the authorship of the HTTP headers. The server should be seen as a software agent acting on behalf of a party which is the publisher or document author: the definer of the URI to resource identity mapping. The webmaster is only an administrator who is responsible for ensuing that (through an appropriately configured server) the transactions on the wire faithfully represent the statements and wishes of that party.

Links

An assertion of relationship between two resources is known as a link.

In this case, it is a triple

(A u1 u2)

of:

the type of assertion being made, that is, the relationship which is being asserted,

the first URI,

and the second URI.

These sorts of assertions, links, are the basis of navigation in the World Wide Web; they can be used for building structure within the World Wide Web and also for creating a semantic Web which can express knowledge about the world itself. That is to say, links may be used both for the structure of data, in which case they are metadata, but also they may be used as a form of data.

Links, like all metadata can be transferred in three ways. They can be embedded in a document, which is one end of the link, they can be transferred in an HTTP message, for example what is called the header of the document, and they can be stored in a third document. This latter method has not been used widely on the World Wide Web to date.

Goal: Self-describing information

A critical part of the design of the whole system is the way that the semantics of metadata or indeed of data are defined. The semantics of metadata in our RFC822 headers in mail messages and in http messages are defined by hand in english in the specifications of those protocols. The PICS system takes this to one stage further in terms of flexibility by allowing a message to contain a pointer to the document which defines, in human readable terms, the semantics of each assertion made within a PICS label. In the future we would like to move toward a state in which any metadata or eventually any form of machine readable data carries a reference to the specification of the semantics of all the assertions made within it.

For example, suppose that when a link is defined between two documents, the relationship which is being asserted is defined in a such way that it can be looked up on the World Wide Web (i.e. using some form of URI), and someone or some program, which has not come across that relationship before can follow the link and extend its understanding or functionality to take advantage of this new form of assertion.

In the case of PICS, one can dynamically pick up a human readable definition of what that assertion really means. In PICS (and in theory in SGML using DTDs), one can also pick up a machine readable definition of what form that assertion can take, what syntax, what types of parameters it can take. This allows a human interface to a new PICS scheme to built on the fly. To go one step further, one could, given a suitable logic or knowledge representation language, pick up a machine readable definition of the semantics of that assertion in terms of other relationships.

The advantages of such self describing information is that it allows development of new applications and new functionality independently by many groups across the web. Without self-describing information, development must wait for large companies or standards committees to meet and agree on the commonly agreed semantics.

Of course a pragmatic way of extending software to handle new forms of information is to dynamically download the code to support a software object which can handle such data for one. Whereas this is a powerful technique, and one which will be used increasingly, it is not sufficient. It is not sufficient because one has to trust the implementation of the object, and the state.

Goal

As much as possible of the syntax and semantics should be able to be acquired by reference from a metadata document.

Building Applications using Link Relationships

It turns out that a very large number of applications both built on top of the web and also built within the infrastructure of the Web can largely be built by defining new relationship types. Examples of these are the document versioning problem which can be largely solved by defining link values relating documents to previous and future versions and to lists of versions; intellectual property rights, distribution terms, and other labeling which can be solved by making a link from one document to the document containing the metadata.

The Web Model: Information hiding and URI syntax

timbl@w3.org (Tim Berners-Lee) — Thu, 29 Jan 1998 00:00:00 GMT

The Web Model

The web is a very general concept -- one universal space of information. The concepts it requires such as identifiers and information resources (documents) are as general and abstract as possible. However, there have been some design decisions made which define some interfaces, and effectively define modules or agents which are independent. These agents are independent in many ways

There is knowledge they have individually but do not share

There is knowledge their designers had individually but did not share

This is basic modularity. The interfaces are defined by the data formats and protocols, and the important features to understand about the design I have ranted about in the linked articles in this series. This modularity, ability for different parts of the system, shows up when different specs are independent, such that you could change one without having to change the other.

The Information Resource

(Formerly, Resource)

This is the current term for a certain unit of information in the Web. In many cases on the current Web, thinking "document" will do. It is something which conveys information. The Web model is that information in the information space is in the abstract chunked into addressable things known as resources.

In the technical architecture, resources have identifiers, Universal Resource Identifiers, and the properties of these identifiers are elaborated later. In fact the concept of a unit of information is central, not only in the technical architecture, but in society's concepts of information, as a document is not only the unit for reference, retrieval and presentation (typically), but also the unit of ownership, license to use, payment, confidentiality, endorsement, etc. So though technically we can derive such things as compound document, generic documents, and resources which look anything but the typical notion of a "document", we have to be able to support these social aspects of information at the same time, so we can't mess with it too much.

Fragment Id and "#"

In the hypertext architecture, when making a reference, such as a hypertext link, we don't just refer to an information resource. Well, we can, but we can also refer to a particular part of or view of a resource. The string which, within the document, defines the other end of the link has two parts. It has the identifier of the document as a whole, and then optionally it has a hash sign "#" and a string representing the view of the object required. This suffix is called a fragment identifier. (Even though it doesn't represent necessarily a fragment of the document: it could represent how the document should be viewed.). The fragment identifier only has relevance in the context of the web page in question. This has an implication how the software is built. For example, An "access" module can be given just the bit of the URI without the fragment identifier. It gets the information, and creates a software object for the hypertext page. That object is passed the fragment identifier.

In fact, analyzing the system a little more, the access function can be broken into the underlying access which creates the object by passing two things to some kind of object creator ("factory"): a data stream and a MIME type.

Generally

Hypertext is a specific application, but this principle works for other applications on the Web. In fact, when we discuss webizing an application, we take some computer language, and we take what were document-global things, say global variables in a programming language, and make them truly global by appending the URI of the document and "#".

Clearly, in different applications the fragment identifier will have completely different function. The independence here means that new applications (such as the Semantic Web) can be built, just like hypertext web, just by introducing new types of document.

Independence

The model of how the web works is that there are two separate functions. The part (blue in the picture) which accesses the document deals with its identifier, but does not know what view will be required. It creates some software object which represents and presents the resource. That object does not need to know how it was created (necessarily), and so does not need to know the URI it was identified by. However, it does know how to interpret the Fragment ID.

So we have two axioms:

The access machinery does not need to look at the fragment ID.

The presentation object does not need to know the URI of the resource

The equivalent axioms when we are talking about specifications amount to:

The specifications for access protocols are independent of the specifications for fragment identifiers.

Why?

For one thing, consider the special case of a link within a document. In this case, the link only specifies a fragment identifier. The object can follow the link itself. It doesn't have to consult the access code in order to figure out where the link goes to. Because the "#" syntax s universal to all access methods, the object can process the link internally. For a static HTML file, for example, this means that you can write and HTMl file with internal links without worrying or knowing about exactly what URIs the file will get. It means you don't have to alter the file if you chose to serve it in some new name or address space. If the "#" syntax was not a universal specification for the web, this would break: you couldn't do it. As Jim Gettys points out, as the era of digitally signed documents comes upon us, changing a signed document will break the signature on it. So allowing one to make a self-consistent document with internal links in a way independent of the namespace is even more essential.

Why else?

This independence is very important for the evolution of the Web. It means that people can go off and design all kinds of new systems for naming, addressing and accessing documents, without having to worry about what sort of documents will be moved. It means that people can go off and make new media types (MIME types), each of which can have different concepts for views and fragments, without having to talk to the people developing the access technology. This has already (1998) proved incredibly enabling to the community, as HTTP has advanced in parallel with many other ways of accessing data, and the number of exciting media types has grown very rapidly, and will be the key to many new revolutions built on top of the basic Web idea.

If you look at the diagram you ill notice how the fragment IDs are generated by and understood by just the one module. You see how, when designing a new MIME type, one is quite free to be creative in making new and powerful forms of fragment ID, knowing hat no other specifications will refer to them, and nothing else will break.

Document sets and relative addressing

Now let us look at what happens when we follow a link. For example, say a hypertext page is clicked on. The page has a representation of the end point of the link. It hands it to the application. In fact, often, there are links between pages whose URIs are very similar and only differ in the right hand part. This isn't true of all name spaces: for example, when making links between news articles identifies by the news id (news:foo) unique ID, you have to specify the whole thing. However, if you restrict publication of a set of documents to a hierarchical name or address space, then you can arrange for documents which are very related and have many links to be in the same part of the tree.

In this case, the links between these documents are "relative URIs".

What happens then is that the relative URI, which only has the locally different part of the URI in it, is handed back to what in the diagram I have called the "application", to be turned into an absolute URI by being combined with the absolute URI of the resource, which the application has remembered.

Note that the application is aware of the absolute URI but still the resource does not have to.

Note that the fragment id is still circulated around a loop between the object (green) which understands it and the applications (yellow) which handles it transparently but does not understand or change it.

Now there was a design decision that the application could have passed to the access module both the relative URI and the absolute URI. Then, different namespaces would have been able to have different algorithms for resolving a base URI and a relative URI into a new absolute URI. But the decision was made that the relative address format should be common across all name spaces.

Why?

Just as we considered internal links above, now consider relative links between a bunch of documents, like the sections of a book, which are close in the tree. In practice, such document sets are moved from place to place, from file systems into HTTP space or FTP space, and because the relative address rules are universal, the documents do not have to be modified every time they are moved. (Yes, if you move half the set to one place and half to another, you have to fix links). This is happening all the time. People are creating and programs are generating hypertext with relative links without knowing or caring what absolute URI will be used to refer to the material.

The access scheme

The so-called "access scheme" is the first part of the URI. As we have seen above, you don't have to know anything about it to parse relative URIs or to process the fragment identifier of a URI. The knowledge of particular schemes is limited to the "access" function (blue in the above diagram).

The scheme is a very important flexibility point, and should not be abused. Anyone dereferencing a URI must have a knowledge of the scheme it uses.

The access scheme defines a huge part of URI space. The scheme defines a subspace with particular properties

The access scheme is by definition the highest point of flexibility. What does that mean? It means that if the whole Web develops problems which we cannot solve within the existing protocols, or if new spaces are designed which really can't be accessed through or mapped into existing spaces, then we can create a new space. We have faith that we will be able to use this flexibility point in the future, because it worked successfully for integrating the older spaces such as Gopher and FTP spaces into the Web.

If you have ported a concept between environments in the past, then there is a better hope that you can in the future.

The danger of too many access schemes

However, we do not do this lightly. When we introduce a new space, it may have very different properties and we expect that the deployment of new software will be needed to allow access to it. Some spaces may be gatewayable into HTTP space, and this will often provide a transition path. This is why early browsers allowed one to declare in a configuration file what gateways to use for what new spaces.

If we use this extension point frivolously, ironically, it will cease to work. Suppose very many schemes are introduced. The access scheme space itself becomes a namespace with all the problems which current namespaces such as DNS are trying to solve, but which are very hard problems:

Clashes in the namespace would destroy interoperability;

Ownership of the space becomes commercially valuable;

Democratic and fair management becomes essential and difficult;

Worse, though, technology will be needed to automatically dereference the schemes themselves and download code to handle them. Something like DNS will be needed. The top level namespace then becomes in fact DNS, or something like it. This, however, begs the question. What happens if later DNS needs to be replaced? There is no top-level extension switch left. The world is stuck with whatever form of access-scheme name service exists.

Therefore, I conclude that access schemes should not be open to trivial extension, and that the access scheme should only be extended by the introduction of new standards with full open review by the entire community.

Alternatives to new schemes

Whereas some schemes (like "data:") are clearly neat and new and orthogonal to HTTP, many schemes could in fact be integrated into http, using HTTP extension mechanisms.

In fact, is HTTP is to be taken as a general computing protocol, then use of an extensible language system for the HTTP request message would allow a huge amount of extension, covering protocols with different functionality (exporting different interfaces).

Evolving scheme spaces

When considering the evolution of a space, it is important to remember that primarily the access scheme refers to a part of the URI space, and secondarily it refers to a protocol. Therefore, one can in fact change the protocols used to access resources within a scheme's namespace, without changing the space. For example, a new DNS protocol could be introduced which over time would replace the current one, without changing the DNS space. This would effectively redefine the HTTP and FTP protocols, but would not harm the namespaces. When touch-tone dialing was introduced, the telephone numbering system remained the same. So an indexing system could be introduced which, when deployed, would allow http:// space objects to be found with greater reliability or speed than the current protocols, while maintaining the HTTP space as being the concatenation of a DNS name and an opaque string.

Modularity

timbl@w3.org (Tim Berners-Lee) — Mon, 01 Oct 2007 00:00:00 GMT

Modularity

Simple things make firm foundations

You can look at the development of web technology in many ways, but one way is as a major software project. In software projects, the independence of specs, has always been really important, I have felt. A classic example is the independence of the HTTP and HTML specifications: you can introduce many forms of new markup language to the web through the MIME Content-Type system, without changing HTTP at all.

The modularity of HTML itself has been discussed recently, for example by Ian Hickson, co-Editor of HTML5:

Note that it really isn't that easy. For example, the HTML parsing rules are deeply integrated with the handling of

Medium	Post format	Dominant platform	Response actions
Text blog	HTML	--	Blog comments
Photo	JPEG	Instagram	Like, Comment
Audio podcast	MP3	--	--
Video podcast	MP4?	YouTube	Comment
Movie	IMDB-RDF	Netflix, Green Tomatoes > Media Kraken	Rating (GT)
Book	LoC RDF?	Amazon	5 Star rating
Fitness	GPX	Strava, Fitbit etc	Kudos, comment

Scale	1	10	1000	10k	100k	1M	10M	100M	1G
Group	You	family, group	...	...	town?	city?	country?	USA	World population
Time spent	?	?	?	?	?	?	?	?	?
Money spent	?	?	?	?	?	?	?	?	?
etc	?	?	?	?	?	?	?	?	?

Scale	Eg	Committe size	Cost per ontology (weeks)	Cost for me
0	Me	1	1	1.000000
10	My team	4	16	1.600000
100	Group	7	49	0.490000
1000		10	100	0.100000
10k	Enterprise	13	169	0.016900
100k	Business area	16	256	0.002560
1M		19	361	0.000361
10M		22	484	0.000048
100M	National, State	25	625	0.000006
1G	EU, US	28	784	0.000001
10G	Planet	31	961	0.000000

Time	A resource may vary with time. For example, "The Wall Street Journal" varies with time. Each issue is a time-specific resource, which does not change with time. Most home pages on the Web change with time, in a less periodic way.
Language	When a document is translated, it is useful to be able to refer to it either in the generic, or to a particular specific translation.
Content-Type	A given resource may have mny ways in which it can be represented on the wire, using different `Content-type`s (in HTTP terms). As an example, an image may be represented in PNG or JFIF format.
Target medium	A given resource may be targetted specifically to a specific medium, such as a printer, being displayed on laptop screen, being displayed on a cellphone, or being projected onto a large screen for an audience. (This is currenltly available for selecting CSS stylesheets, but is not done at the HTTP content negotiation level)

Class name	Significance
u:TimeInvariant	The relationship between a representation of this resource and the URI will not change over time
u:LanguageInvariant	The relationship between a representation of this resource and the URI will not change no matter what language is requested.
u:ContentTypeInvariant	The relationship between a representation of this resource and the URI will not change s a function of content negotiation of MIME type
u:Fixed	The relationship between a representation of this resource and the URI will not change nder any circumstances

Property name	Significance	Domain	Inverse property name
u:isVersionOf	A is one of the specific versions of a time-generic resource B	u:TimeInvariant	u:hasVersion
u:isLanguageSpecficVersionOf	A is one of the specific languages (in the sense of HTTP content-langauge) of a langauge-generic resource B	u:LanguageInvariant	u:hasLanguageSpecificVersion
u:isContetntTypeSpecificOf	A is one of the specific content-type-specific resources (in the sense of HTTP Content-type) of a generic resource B	u:ContentTypeInvariant	u:hasContentTypeSpecificResource

Issue	Motivation
It is a pain to have to add quotes around attributes	Ease of use
It is a pain to have to spell the entire tag in the end tag	Ease of use
Parsers must stop on error	unfriendly, impractical
Namespace URIs take too much space	impractical
Non-nested begin/end tags have to be accommodated	Legacy TAG soup

`value`	litteral string
`href`	taking the string as a URI with or without fragment identifier, the text (or XML fragment or whatever medium) to which it refers.
`resource`	taking a string as a URI with fragment idenifier, the abstract RDF object (rdf:resource) corresponding to the identified XML document fragment.

★	Available on the web (whatever format) but with an open licence, to be Open Data
★★	Available as machine-readable structured data (e.g. excel instead of image scan of a table)
★★★	as (2) plus non-proprietary format (e.g. CSV instead of excel)
★★★★	All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff
★★★★★	All the above, plus: Link your data to other people’s data to provide context

party	implicit node in all properties with a CAL-ADDRESS value type. (person or group: anything which can have a mailbox)
cal-address	A mailbox - normally mailto:...	URI
CU	Calendar user defined in CUTYPE
INDIVIDUAL, GROUP, RESOURCE, ROOM		CU
ldap-directory	starts "ldap:" (is this a standard?)	URI
mime-type		string
participation status	needs-action, accepted, declines, tentative, delegated, ... (an enum type- could do better. Constraints in the spec.)	string
component	of a calendar
EVENT, TODO, etc		component
TimeProperty	DTSTART, DTEND, DUE, EXDATE, RDATE
Timezone	see TZID	string
icalobject
recur	defined by recurrence properties	-Really complex datatype could be broken down into RDF! Contains its own nested attr/value structure.

iCalendar name	domain	range	Notes
ALTREP	anything iCal property?	URI	altervative to body
CN	party	string
: (mailbox)	party	cal-address	Implicit node between a party and that part's mailbox. Represted by "value" of property
CUTYPE - type
DELEGATED-FROM	party	cal-address
DELEGATED-TO	party	cal-address
DIR	party	URI
eightbit, base64	bits	text	text encodes bits accordingto RFC2045. Was value of encoding "property"which was faulty model. Now, subclass of generic ëncoding"property
ENCODING	bits	text	Only in schema, as superclass of eightbit and base64 See notes
FMTTYPE	document	mime-type	Why not call it content-type?! Applies to a document. Expect the implit uri proprerty to tell you which object.
FBTYPE			Supertype of the following
FREE, BUSY, BUSY-UNAVAILABLE, BUSY-TENTATIVE	?	time-interval	enum became subclasses FBTYPE property
LANGUAGE	string-or-doc	iso-language	Equivalent xml:lang
MEMBER	party	cal-address	group membership
PARTSTAT	party	enum	A status: part of some protocol?
RANGE	component		superclass only of ...
THIS-AND-PRIOR, THISANDFUTURE	component	date-time	subclass of RANGE (was qualifier)
RELATED	component	period@@	superclass of TRIGGER-FROM-START and TRIGGER-FROM-END?
RELTYPE	component	component	Superclass only, of
PARENT, CHILD, SIBLING	component	component	Subclases of RELTYPE. Hierarchical constraints. Semantics unclear@@.
ROLE	party	enum roleparam	Attendee; role=chair could it be better "chair?". Wait and see wether it is a separate dimension.
RSVP	party	boolean
SENT-BY	party	cal-address	Misleading. "Represented by" would be better. Some message was sent.
TZID	anything taking time or D	timezone	Yuk. should be part of the time string. Makes time complictaed
VALUE	string-or-doc	string	Superclass of the following
BINARY, BOOLEAN, CAL-ADDRESS, DATE, DATE-TIME DURATION, FLOAT, INTEGER, PERIOD, RECUR TEXT, TIME, URI, UTC-OFFSET"	string	string	Specifies the datatype of an associated string
URI	document	URI	Subclass of VALUE but indicates the vale is the content of the resouce identified.
calprop	icalobject		superclass for the following
VERSION	icalobject	string	subclass of calprop. unique.
PRODID	icalobject	string	subclass of calprop semantics? unique.
CALSCALE	icalbobject	string	subclass of calprop
METHOD	icalobject	string	This is a hook for a protocol definition
VEVENT	icalobject	event	Property VENVENT of calendar implies component is of type event. See spec for properties including this in their domain
VTODO	icalobject	todo	similar
VJOURNAL	icalobject	journal	similar
VFREEBUSY	icalobject	freebusy	similar
VTIMEZONE	icalobject	timezonedef	similar Definition of a timezone.
VALARM	?component	alarm	can nest in component
CALSCALE	icalobject

iCalendar name	E	T	J	F	A	Tz	range	Notes
ATTACH	y	y	y		y		text-or-doc
CATEGORIES	y	y	y				text	List of enums
CLASS	y	y	y					classification
COMMENT	y	y	y	y	y		text	no comment
DESCRIPTION	y	y	y		y		text
GEO	y	y					float float	lat long. @@ Split into two properties?
LOCATION	y	y					text
PERCENT- COMPLETE		y					integer
PRIORITY	y	y					integer
RESOURCES	y	y					text
STATUS	y	y	y				text	enum - see the spec.
SUMMARY	y	y	y		y		text
COMPLETED							date-time
DTEND	y			y			date-time or date
DUE		y					date-time or date
DTSTART	y	y		y		y	date-time or date
DURATION	y	y		y	y		duration
FREEBUSY				y			period
TRANSP	y						text	really boolean!
TZID	a	a	a	a	a	a	text
TZNAME						y	text
TZOFFFROM						y	utc-offset	like -0500
TZOFFTO						y	utc-offset
TZURL						y	URI
ATTENDEE	y	y	y	y	y	y	party	@@ If language is specified, it applies to CN: Kludge! @@@
CONTACT	y	y	y	y			text
ORGANIZER	y	y	y	y			party	Note in FREEBUSY the use is different
RECURRENCE-ID	y	y	y				date-time or date	Could be a problem. Not a property of an event, but its presence makes it a reference to a specific occurrence of a repeated event.
RELATED-TO	y	y	y				text (really URI whcih is UID of component)	Subclass only of PARENT, CHILD, SIBLING above.
PARENT , CHILD, SIBLING	y	y	y					see RELATED-TO
URI	y	y	y	y			URI	document "associated with" component. For more information.
UID	y	y	y	y			UID - URI without mid:	@@ Missing scheme!!! @@ replace with midL: URI
EXDATE	y	y	y				date-time or date	Excludes the dates given @@ implicit logic makes search logic difficult.
EXRULE	y	y	y				recur
RDATE	y	y	y				date-time or date
RRULE	y	y	y				recur

name	domain	range	Notes
ACTION	A	text	really an enum
REPEAT	A	Ainteger
TRIGGER	A	duration or date-time	See RELATED. @ Split into two properties?
CREATED	ETJ	date-time
DTSTAMP	ETJF	date-time
LAST-MODIFIED	ETJTz	date-time
SEQUENCE	ETJ	integer	fuzzy rules for incrementing this
REQUEST-STATUS	ETJF	text	eg 3.1.1

name	domain	range	Notes
UNTIL	rrule	text	text - all these are text with various constraints and substructure
COUNT
INTERVAL
BYSECOND
BYMINUTE
BYHOUR
BYDAY
BYMONTHDAY
BYYEARDAY
BYWEEKNO
BYMONTH
BYSETPOS
WKST
FREQ