The dream voice interfaces and brining the web to other devices besides the desktop is not new.
The best example of this dream is: "Star Trek IV: The Voyage Home", where Scotty and McCoy in 1986 try to talk to a computer.
The potential to bring the web to all telephones was explosive: the alliance between the largest network in the world, the internet, and the most numerous communication device on Earth, the telephone.

One of the very first attempts to bring the phone to the web was WAP & WML. It had limited visualization resources and data entry issues (cellphone keyboard has usability concerns.)
Voice recognition and synthesis technology has evolved and in the beginning of this 21st century it reached a level of accuracy never imagined before.
It was then that the idea of using the voice as interface on the web came up.

There are two basic types of voice interfaces today:
The most structured language is SPEECH. Scholar, Noam Chomsky, believes that SPEECH is INSTINCTIVE
Voice is a natural human method of communication.
Security can be enhansed with voice and technology using biometry to authenticate users.
Terminals to access Voice Interfacesd Applications can be simple telephones.

"Speech technology is an important ingredient for the Web to realize its full potential.
- Tim Berners-Lee at SpeechTEK 2004
The following are compared to IVR (Interactive Voice Response) systems from the traditional voice technology development.

A telephone works as a voice browser to connect to the Voice XML. Therefore all existing back ends can be used with a new presentation layer.
There is a difference between voice recognition and Voice XML.
Personal voice recongition systems allow for wide grammar, but restrict the number of users (ie. Dragon Naturally Speaking or IBM Via Voice.)
Voice XML restricts the grammar, but allows for a wide number of users.
About a decade prior to the World Wide Web, word processors were the first to realize that appearance of the document and the structure are distinctly different.
The initial release of the World Wide Web by Tim Berners-Lee contained such a concept. However the web’s early expansion and lack of control allowed for development of presentational HTML such as <center> <font> <b> <i> etc.

Once the user agent compatibility of CSS was adopted (back in 2000), and the power of separation of the appearance from the structure was understood by the web developers, then the adoption of CSS began to grow. This is the concept of Web Standards and is the first big change toward Usable CSS.

The next big issue in CSS is alternative devices such as cell phones and personal digital assistants (PDAs.) When a PDA responds to the CSS media type of "screen" instead of the appropriate media type of "handheld", numerous presentational issues are created.
Developers will wait until the manufacturers adopt the appropriate CSS media types before the developers use CSS with small screen devices.
CSS is a presentational language that delivers through multiple media all at once.

Aural CSS originally came from a W3C note back in January of 1997. The concept of CSS was well established and the technology was the next logical step as the official presentational language for the web.
Aural CSS is the sound rendering of a document. This enables the designer to work in a new media. The designer’s canvas is now in three dimensional physical space (surround sounds) and temporal space (specify sounds to be played before during or after other sounds / content.)
There are numerous problems with Aural CSS. The problem is not the technology, rather the lack of support by the user agents. To date, none of the screen reading software currently supports Aural CSS.
Opera Software was the first on July 2002 to announce that they would take up the challenge of integrating extended voice support into their browser. This meant that no screen reading software would be necessary. Opera would build a screen reader into their user agent.
The Opera 7.6 beta released in 2004 was the very first web browser to deliver the tools to enable speech customization with multimodal support. The technology was developed by the combined efforts of IBM, Motorola and Opera Software. The technology became xHTML + Voice (X+V)
http://www.w3.org/TR/xhtml+voice/
The voice tools must be installed: Tools > Advanced > Voice and select the Enable Voice option on the page. There is a 10MB download that accompanies the activation of the multimodal support.
The following is part of the xHTML content:
<p>Mary had a little lamb, who's fleece was white as snow.</p>
The aural CSS applied to the document is as follows:
p { voice-family: child female; -xv-voice-volume: loud; }
In this example all paragraphs in the document will be read loudly by a female child, unless stated otherwise in the CSS.
Example: mary.xml
Aural CSS properties in Opera is listed in the handout. There are two different sections. To allow Opera to take advantage of the new CSS3 properties, a prefix of –xv- has been added. This will allow Opera to follow those CSS3 attributes, while allowing all other browsers to quietly ignore the property. Sometime after CSS3 is a full recommendation, the prefix will have to be removed.
Aural CSS Example: Who's On First

Opera 8 is multimodal. This means the browser not only has the ability to support Aural CSS screen reading, but it also supports Voice XML. The ability to talk back to your browser and have it understand is a communication dream. Voice XML is NOT voice recognition, rather it is equivalent of pattern matching of sounds based on expected grammars that have been programmed into the application.
For example a grammar file would look like the following:
#JSGF V1.0;
grammar voiceColor;
<voiceColor> = Red | Pink | Purple | Blue | Green | Orange | Yellow ;
Example: Color.xml

Grammar is written in a format called JSGF (Java API Speech Grammar Format) which is the W3C recommendation for representing speech grammars.
The grammar files are essentially the dialog of the conversation between the computer and the user. Every part of the dialog (including deviations) must be considered.
#JSGF V1.0;
grammar voiceCommand;
public <voiceCommand> = [<polite>] <action> [the][a] <object>;
<polite> = Please | Could you please | I would like you to please ;
<action> = Open | Crack Open | Close | Shut ;
<object> = Door | Window ;
This grammar allows the user to say any of the following:

Voice Interfaces (VoiceXML + AuralCSS) go much beyond the accessibility systems that reads the visual web pages (screen-readers.) VoiceXML allows real dialogues.
VoiceXML+AuralCSS can be used to improve the visual interaction by building alternative dialogues.
Here are a series of examples that show the power of Aural CSS and Voice XML:

Speech applications must overcome the lack of the non-verbal!
Even text conversations try to overcome the lack of the non-verbal using emoticons!
You must predict ANY and ALL kinds of words and dialogs that may take place regarding the nature of your application, including regional variations usage.
"If words of command are not clear and distinct, if orders are not thoroughly understood, then the general is to blame."
-- Sun Tzu, in "The Art of War"

Noisy environments, like bars, subway stations, etc, usually cause problems regarding the use of voice interfaces.
It requires a combined solution between voice interfaces and visual biometry called Visual Speech, allowing lip readings solution by computer (Lip reading solutions by Intel, 2003, already available)
The future technologies are always a challenge to predict. However we do know of some general directions the technologies are headed.
Key skills to ensure success:
Learning curve - expectations

"Speech technology helps computers to figure out what people are thinking, and people to figure out what computers are thinking."
-- Tim Berners-Lee, at SpeechTEK 2004, NY, Sept. 2004
http://www.uwplatt.edu/web/auralCSS
Email: frommelt@uwplatt.edu
Copyright Daniel M. Frommelt, 2006. This work is the intellectual property of the author. Permission is granted for this material to be shared for non-commercial, educational purposes, provided that this copyright statement appears on the reproduced materials and notice is given that the copying is by permission of the author. To disseminate otherwise or to republish requires written permission from the author.