Chapter 2: Data Mapping and Exchange: Visit
Chapter 2: Data Mapping and Exchange: Visit
Email:[email protected]
visit: https://begnafrique.wordpress.com/
Metadata I
Metadata types
(i) Structural metadata:
It is used to describe the structure of database objects such as tables, columns, keys and
indexes.
Indicates how compound objects are put together, eg how pages are ordered to form chapters.
(ii) Guide metadata:
It is used to help humans find specific items and is usually expressed as a set of keywords in a
natural language.
Describes a resource for purposes such as discovery and identifcation.
It can include elements such as title, abstract, author, and keywords.
(iii) Administrative metadata:
It provides information to help manage a resource, such as when and how it was created, file
type and other technical information, and who can access it.
Data Representation and Encoding I
Before natural language data can be written to a computer recording device like disk, tape or memory it
needs to be put in a format that the computer recognizes.
For example, to record data blocks on the surface of the disk the data needs to be represented as a
string of pulses, where each pulse is in either one of two states: positive or negative polarity.
Since there can be only two states, we refer to this as binary notation.
The direction of the polarity (i.e. + or -) determines if the data is interpreted as a binary one or a binary
zero.
Computer Data types:
Numeric Data: Consists of only numbers 0,1,..,9
Alphabetic Data: Consists of only the letters A-Z, in both uppercase
and lowercase , and blank character.
Alphanumeric Data: is a string of symbols where a symbol may be one
of the letters A-Z in either uppercase or lowercase, or one of the digits
0-9 ,or special characters such as, + - * / , . () = etc.
Data Representation and Encoding II
So how do we tell the computer to store the letter "A", specifically a capital A?
Computer is a digital system and can only deal with l's and 0s.
That means digital computers use the binary system to represent and
manipulate numeric values.
So to deal with letters and symbols they use alphanumeric codes
Well, in order to represent a human readable character in other than a one or
zero, computer designers came up with various coding schemes consisting of
a string of ones and zeros to represent many of the common characters
needed by computer users.
Computer codes are used for internal representation of data in the computers.
As computers use binary numbers for internal data representation, computer
codes use binary coding schemas.
Data Representation and Encoding III
In binary coding, every symbol that appears in the data is represented by a group
of bits.
The group bits used to represents a symbol is called a byte.
There are three very popular coding schemes in use today:
-ASCII
-EBCDIC
-Unicode
These coding schemes made it practical for us to record and process natural
language characters on "two-state" or binary computing devices.
Data Representation and Encoding IV
ASCII
The American Standard-Code for Information Interchange (ASCII) pronounced "as-kee" is
It can represents:
-Latin alphabet,
-Arabic numerals
-standard punctuation characters
-Plus small set of accents and other European special characters
The first ASCII code was 7-bit code.
The 7-bit code system can represent 128 characters which means only 7 bits are
They include:
o Printable characters including 26 upper-case letters (A to Z)
o 26 lowercase letters (a - z),
o 10 numerals (0 to 9) and 33 special characters such as mathematical symbols
o space character etc.
o It also denes codes for 33 non-printing obsolete characters except for carriage return and/or
line feed.
However, since the smallest size representation on most computers is a byte, a byte is used
to store an ASCII character.
Then ASCII 7-bit code system was extended to 8-bit code.
The 8-bit code system can represent 256 characters.
The Most Significant Bit of an ASCII character is 0.
Data Representation and Encoding VI
• Using the above ASCII conversion chart we see that a capital "A" is
a hexadecimal 41.
• If we convert this hexadecimal number to its 8-bit binary
representation we get "01000001".
• So the disk surface will have the polarity changed to record the
following string of 8 bits: 01000001
• Lets look at this again.
• This time lets also convert a lowercase "a" as well:
Data Representation and Encoding VIII
• A sign indicator (C for plus, D for minus and F for unsigned) is used in the zoned position
of the rightmost digit.
• It is used primarily in the larger computer environments, specifically mainframes and
some mid-frame computing platforms.
Data Representation and Encoding X
Unicode(Universal Code)
• Both EBCDIC and ASCII were built around the Latin alphabet.
• As such, they are restricted in their abilities to provide data representation for the non
Latin alphabets used by the majority of the worlds population.
• As all countries began using computers, each was devising codes that would most
effectively represent their native languages.
• ASCII and EBCDIC worked fine for English and the Romance languages but didn't have
enough character combinations to support the alphabets of languages from Eastern
Europe, Asia and Africa.
• No single encoding system supports all languages Different encoding systems conflict.
Data Representation and Encoding XI
Unicode features:
Provides a consistent way of encoding multilingual plain text
Defines codes for characters used in all major languages of the world
Defines codes for special characters, mathematical symbols, technical symbols, and
diacritics .........
Capacity to encode as many as a million characters
Assigns each character a unique numeric value and name
Reserves a part of the code space for private use
Affords simplicity and consistency of ASCII, even corresponding characters have same
code
Species an algorithm for the presentation of text with bi-directional behavior
Data Representation and Encoding XII
• Encoding Forms
UTF-8, UTF-16, UTF-32.
• With 16 bits Unicode can support over 65,000 characters.
• The first 256 Unicode characters are the same as ASCII.
• Unicode is required by web users and modern standards XML, Java, ECMAScript
(JavaScript), LDAP, CORBA 3.0, WML.
Data Representation and Encoding XII
• To the purist there is no such thing as well-formed XML; a document is either XML
and therefore, by definition, well-formed, or its just text.
• But in common parlance well-formed XML means a document that follows all its
rules governing the following:
-How the content is separated from the metadata (markup)
-What is used to identify the markup
-What the constituent parts are
• - In what order and where these parts can appear
Well-Formed XML II
In version 1.1 you can use all these control characters although their use is a little
unusual.
You see how to specify which version you are using in the next section.
A few characters in the Unicode specification also can’t be used but you’re unlikely to
come across these.
You can find the full list in the W3C0s XML Recommendation.
(ii) XML Prolog
The first part of a document is the prolog.
It is optional so you won’t see it every time, but if it does exist it must come first.
The prolog begins with an XML declaration which, in its simplest form, looks like the
following: <?xml version=1.0?>
Well-Formed XML IV
• This declaration contains only one piece of information, the version number, and
currently this will always be either 1.0 or 1.1.
• Sometimes the declaration may also contain information about the encoding used in
the document:
<? xml version=1.0 encoding=UTF-8?>
• When an XML processor reads a document, it has to know which encoding was used;
but, its a chicken-and-egg situation if it does not know the encoding how can it read
what you have put in the declaration?
Well-Formed XML V
• The simple answer to this lies in the fact that the first few bytes of a file can contain a
byte order mark, or BOM.
• This helps the parser enough to be able to read the encoding specified in the
declaration.
• Once it knows this it can decode the rest of the document.
• Two main encoding systems use Unicode: UTF-8 and UTF-16.
• UTF stands for UCS Transformation Format, and UCS itself means Universal Character
Set.
• The number refers to how many bits are used to represent a simple character, either 8
or 16 (one or two bytes, respectively).
Well-Formed XML VI
The reason UTF-8 manages with only one byte whereas UTF-16 needs two is because UTF-
8 uses a single byte to represent the more commonly used characters and two or three bytes
for the less common ones.
UTF-16 uses two bytes for the majority of characters and three bytes for the rest.
All XML processors are mandated to understand UTF-8 and UTF-16 even if those are the
only encodings they can read.
UTF-8 is the default for documents without encoding information.
Well-Formed XML VII
• Comments are usually meant for human consumption and are not supposed to be part of
the actual data in a document.
• They are initiated by the sequence <! -- and terminated by -- >.
• Following is example.xml with a comment added:
<?xml version="1.0" encoding="utf-8"?>
<!--This is a comment in XML declaration -->
• Once the XML prolog is finished you need to create the root element of the document.
• XML documents form a tree structure that starts at "the root" and branches to "the
leaves".
• The following section details elements and how to create them:
Well-Formed XML IX
Creating Elements
Elements are the basic building blocks of XML and all documents will have at least one.
All elements are denfined in one of two ways.
At its simplest, an element with content consists of a start tag, which is a left angle bracket
(<) followed by the name of the element, such as myElement,and then a right angle bracket
(>).
So a full start tag might be < myElement >.
To close the element the end tag starts with a left angle bracket, a forward slash, and then
the name of the element and a right angle bracket.
So the end tag for < myElement > would be </myElement >.
You can add spaces after the name in a start tag, such as <myElement >, but not before
the name as in < myElement>.
Well-Formed XML XI
Following are the main contenders for how you should name your elements the main idea is
how you distinguish separate words in an element name:
Naming Style
Pascal-casing: This capitalizes separate words including the first: <MyElement/>.
Camel-casing: Similar to Pascal except that the first letter is lowercase:<myElement /> .
Underscored names: Use an underscore to separate words:<my_element />.
Hyphenated names: Separate words with a hyphen: <my-element />.
Well-Formed XML XI
Naming Specifications
• XML has certain specific rules governing which names you can use for its markup
object.
• An element name can begin with either an underscore or an uppercase or
lowercase letter from the Unicode character set.This means you can use the
Roman alphabet used by English and many other Western languages, the Cyrillic
one used by Russian and its language relatives, characters from Greek, or any of
the other numerous scripts, such as Thai or Arabic, that are defined in the Unicode
standard.
Well-Formed XML XIII
Root Element
• The next step after writing the prolog is creating the root element.
• All documents must have one and only one root element.
• Everything else in the document lies under this element to form a
hierarchical tree.
• One example is when using XML as a logging format.
• A typical log file might look like this:
Well-Formed XML XVI
The problem with this format, though, is that there isn’t a unique root
element; you have to add one to make it well-formed:
Well-Formed XML XVII
Other Elements
Underneath the root element can lie other elements that follow the
same rules for naming and attributes and, as you saw earlier, there
can also be free text.
These nested elements can be used to show individual or repetitive
items of data depending on what you are trying to represent.
For example, your root element could be <person> and the elements
underneath could show the persons characteristics, such as:
<biography> and <address>.
Well-Formed XML XX
Attributes
Attributes provide additional information about an element.
Elements are one of the two main building blocks of XML the other one
is attributes.
Attributes are name-value pairs associated with an element.
You can add a couple of attributes to the example document like so:
Well-Formed XML XXI
A number of rules also govern attributes exist:
Attributes consist of a name and a value separated by an equals sign.
The name, for example, myFirstAttribute, follows the same rules as element names.
The attribute value must be in quotes.
You can use either single or double quotes, the choice is entirely yours.
You can use single on some attributes and double on others, but you can’t mix them
in a single attribute.
There must be a value part, even if its just empty quotes.
You can’t have something like <option selected> as you might in HTML.
Attribute names must be unique per element.
If you use double quotes as the delimiter you can’t also use them as part of the value.
The same applies for single quotes.
Well-Formed XML XXII
There are only two more restrictions to follow regarding character content.
Two characters cannot appear in attribute values or direct element content: the ampersand
(&) and the left angle bracket (<).
You cannot use the latter because its used to delimit elements and it can confuse the
parser.
You cannot use the former because its used to begin entity and character references.
Well-Formed XML XIX
There are no rules about when to use attributes or when to use element.
XML Attributes for Metadata
Sometimes ID references are assigned to elements.
These IDs can be used to identify XML elements in much the same way as the id
attribute in HTML.
This example demonstrates this
Cont..
• Another area where XML-formatted data flourishes over simple text files
is when representing a hierarchy; for instance a file system.
• This scenario needs a root with several folders and files; each folder then
may have its own subfolders, which can also contain folders and files.
• This can go on indefinitely.
• If all you had was a text file, you could try something like this, which has
a column representing the path and one to describe whether its a folder
or a file:
XML Tree Structure II
As you can see, this is not pretty and the information is hard for us humans
to read and quickly assimilate.
Comparatively, now look at one possible XML version of the same
information:
XML Tree Structure III
There is less repetition of data and it would be fairly easy to parse XML
documents must contain a root element.
This element is "the parent" of all other elements.
The elements in an XML document form a document tree.
The tree starts at the root and branches to the lowest level of the tree.
All elements can have sub elements (child elements).
The terms parent, child, and sibling are used to describe the relationships
between elements.
Parent elements have children.
Children on the same level are called siblings (brothers or sisters).
All elements can have text content and attributes (just like in HTML).
XML Tree Structure V
Example
XML Syntax Rules I
To avoid this error, replace the "<" character with an entity reference:
There are 5 predefined entity references in XML:
Entities are a kind of auto text; a way of entering text into an XML document without
typing it all out.
XML Syntax Rules III
Note: Only the characters "<" and "&" are strictly illegal in XML.
The greater than character is legal, but it is a good habit to replace it.
XML DTD and XML Schema
2. A fixed value is also automatically assigned to the element, and you cannot specify
another value.
<xs:element name="color" type="xs:string" fixed="red"/>
XML Schema cont..
XSD Attributes
Simple elements cannot have attributes.
If an element has attributes, it is considered to be of a complex type. But the attribute itself
is always declared as a simple type.
The syntax for defining an attribute is:
<xs : attribute name="xxx" type="yyy"/>
Example
<lastname lang="EN">Smith</lastname>
<xs:attribute name="lang" type="xs:string"/>
Default and Fixed Values for Attributes
Attributes may have a default value or a fixed value specified.
<xs:attribute name="lang" type="xs:string" default="EN"/>
<xs:attribute name="lang" type="xs:string" fixed="EN"/>
Optional and Required Attributes
Attributes are optional by default. To specify that the attribute is required, use the "use"
attribute:
<xs:attribute name="lang" type="xs:string" use="required"/>
XML Schema cont..
• XSD Complex Elements
– A complex element is an XML element that contains other elements and/or
attributes.
– There are four kinds of complex elements:
• Example:
1. A complex XML element, "product", which is empty:
<product pid="1345"/>
2.A complex XML element, "employee", which contains only other
elements:
<employee>
<firstname>John</firstname>
<lastname>Smith</lastname>
</employee>
3. A complex XML element, "food", which contains only text:
<food type="dessert">Ice cream</food>
4. A complex XML element, "description", which contains both elements and
text:
<description>
It happened on <date lang=“EN">03.03.99</date>
</description>
XML Schema cont..
• How to Define a Complex Element using XML Scheme
Complex XML element, "employee", which contains only other elements:
<employee>
<firstname>John</firstname>
<lastname>Smith</lastname>
</employee>
The "employee" element can be declared directly by naming the element:
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
An empty complex element cannot have contents, only attributes.
<product prodid="1345" />
It is possible to declare the "product" element more compactly:
<xs:element name="product">
<xs:complexType>
<xs:attribute name="prodid" type="xs:positiveInteger"/>
</xs:complexType>
</xs:element>
XML Schema cont..
XSD Indicators
How elements are to be used in documents with indicators.
Order indicators are used to define the order of the elements. They are:
1. All
2. Choice
3. Sequence
All Indicator:
The <all> indicator specifies that the child elements can appear in any order, and that each child
element must occur only once:
<xs:element name="person">
<xs:complexType>
<xs:all>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:all>
</xs:complexType>
</xs:element>
XML Schema cont..
Choice Indicator:
The <choice> indicator specifies that either one child element or another can occur:
<xs:element name="person">
<xs:complexType>
<xs:choice>
<xs:element name="employee" type="employee"/>
<xs:element name="member" type="member"/>
</xs:choice>
</xs:complexType>
</xs:element>
Sequence Indicator:
The <sequence> indicator specifies that the child elements must appear in a specific
order:
<xs:element name="person">
<xs:complexType>
<xs:sequence>
<xs:element name="firstname" type="xs:string"/>
<xs:element name="lastname" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML Schema - Example
Example (shiporder.xml):
<?xml version="1.0" encoding="ISO-8859-1"?>
<shiporder orderid="889923">
<orderperson>John Smith</orderperson>
<shipto>
<name>Ola Nordmann</name>
<address>Langgt 23</address>
<city>4000 Stavanger</city>
<country>Norway</country>
</shipto>
</shiporder>
XML Schema - Example
Example "shiporder.xsd":
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="shiporder">
<xs:complexType>
<xs:sequence>
<xs:element name="orderperson" type="xs:string"/>
<xs:element name="shipto">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="address" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="country" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute name="orderid" type="xs:string" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
XML Parser (Parsing XML documents)
XML DOM
The XML DOM defines a standard way for accessing and manipulating XML documents.
The XML DOM views an XML document as a tree-structure.
All elements can be accessed through the DOM tree.
Their content (text and attributes) can be modified or deleted, and new elements can be
created.
The elements, their text, and their attributes are all known as nodes.
The HTML DOM
The HTML DOM defines a standard way for accessing and manipulating HTML documents.
All HTML elements can be accessed through the HTML DOM.
XML Parser (Parsing XML documents)
Load an XML File - Cross-browser
parses an XML document ("note.xml") into an XML DOM object and then extracts some information from it with a JavaScript:
Example
<html>
<body>
<span id="to"></span>
<span id="from"></span>
<span id="message"></span>
<script>
if (window.XMLHttpRequest)
{ // code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
}
else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.open("GET","note.xml",false);
xmlhttp.send();
xmlDoc=xmlhttp.responseXML;
document.getElementById("to").innerHTML= xmlDoc.getElementsByTagName("to")[0].childNodes[0].nodeValue;
document.getElementById("from").innerHTML= xmlDoc.getElementsByTagName("from")[0].childNodes[0].nodeValue;
document.getElementById("message").innerHTML=xmlDoc.getElementsByTagName("message")[0].childNodes[0].nodeValu
e;
</script>
</body>
</html>
XSL , XSLT AND XPATH
XSL