Computer Science Illuminated, Third Edition

eLearning

Animated Flashcards
Live Wire
Cryptic Crossword Puzzles
Ethical Issues
Biographical Sketches
Did You Know
Goin Live
Google Code Digital Lab Manual
Online Glossary
The Learning Store
Language Library
Download PEP/7
Instructor Resources
Student Resources

eLearning Home

Lab #14

Introduction to XML


14.1 What is XML?

XML stands for Extensible Markup Language and it is used to create structured data files which can be easily created or read by a computer program.  XML files are quite often used to exchange data between two computers (especially two programs) where the format of the data (its appearance) is not as important as its structure (how the data is represented).

XML looks quite a bit like HTML.  Take for example, the following fragment of an XML file used in a program I am working on:

sample.xml
<?xml version="1.0"?>

<Level>
   <Category name="Expressions">
       <Description>
Alterations to variables</Description>        
       <LanguageItem name="prefix">
           <Description>
Before the variable</Description>        
       </LanguageItem>
       <LanguageItem name="postfix">
           <Description>
After the variable</Description>        
       </LanguageItem>
       <LanguageItem name="typeCast">
           <Description>
Cast variable to a type</Description>        
       </LanguageItem>
   </Category>
   <Category name="Multiplicative">
       <Description>
Non-Arithmetic operations</Description>        
       <LanguageItem name="multiplication">
           <Description>
Multiplication of two variables</Description>        
       </LanguageItem>
       <LanguageItem name="division">
           <Description>
Division of two variables</Description>        
       </LanguageItem>
       <LanguageItem name="modulus">
           <Description>
Remainder of division operation</Description>
       </LanguageItem>
   </Category>

.... more follows but was edited for brevity...

The file starts with the following line:
<?xml version="1.0"?>
Here, we declare that the file is an XML file, and that the version of XML we used to create the file is 1.0.  (As XML changes, the version number will be modified accordingly.  However, version 1.0 is all we currently have.)  Following the version line, the remainder of the file contains the data we wish to exchange in a highly structured format.

In the example above, we show a sample XML file which is created in a human readable (text) format, with tags (Level, Category, LangaugeItem, and Description).  The structure of the document dictates that a Level tag can contain one or more Category tags; and that Category tags can contain a Description and one or more LanguageItem tags.

Like HTML, tags may contain attributes which are associated with values.  For example, the last LanguageItem tag shown above has a name attribute value of modulus.  The names of the tags and the attributes can be entirely up to the programmer creating the file.  That is, we chose the names of our tags when we designed the file you see above.  However, in doing so we needed to ensure that the program reading in the XML file would not only understand the names of the tags and their possible attributes, but also the structure (rules which define the sub-contents or sub-elements of each tag) of the file.

XML files are really not meant for human consumption.  That is, programs are expected to read XML files, not humans, so there are many unforgiving rules associated with working with XML compared to that of HTML.  If you have worked with HTML, you may have forgotten (on occasion) to insert a closing tag (e.g. </p>).  Many browsers will overlook these human errors and proceed on despite the problem.  The browsers in this case generally will guess at the location of the missing closing tag (based on other nearby tags).

XML however strictly enforces a closing tag for each opening tag.  As such, you will notice that each LanguageItem opening tag has a corresponding closing tag, as does each Description tag.  This is a design issue that the creators of XML added to the language.  By doing so, it removes the ability for a program to attempt to guess where a tag may end.  Now, if the ending tag is missing, the file is invalid (incorrect) and can not be processed.

XML has led to the development of the next language for hypertext documents on the World Wide Web, XHTML.  XHTML (which stands for the Extended HyperText Markup Language) merge HTML and XML to create what is anticipated to be the successor to HTML.

Lab Activity #1:
  1. Locate the XHTML recommendation by the World Wide Web Consortium.
What is the URL of the recommendation document? ____________________

What is the the latest version of the document? _______________________
  1. Find an XML tutorial on the web.
What is the URL of the tutorial that you located?
________________________

  1. Type in the sample file above and add the closing Level tag to the end of the  file.  Save the file to your computer and attempt to load the file using your web browser.
Describe what happens when you load the file. ___________________
Which web browser did you use to load the file? _______________

14.2 Examining XML Structure Definitions

The structure of the data contained in an XML document is defined by another document known as a DTD (Document Type Definition).  The DTD file (generally named with a .dtd extension) provides the supporting information so that the corresponding XML file can be checked for validity.  The DTD file is not required by the XML file, however its presence is strongly encouraged.  Let us take a closer look at XML by creating a sample file around the Pete's Pet Store example used in earlier labs.

Assume that we wished to export a listing of all of the products in our store to another application.  Ideally, we want to provide that data about each product, but not include any formatting information since we don't know how that application might wish to display our information.  This is actually a common use for XML - exporting data from one site to another.

As example, take a look at the popular web site about technology http://www.slashdot.org.  Slashdot (as it is more commonly known) creates a variety of news stories based on categories and headlines submitted by their users.  The article listing from Slashdot's main page is available as an XML formatted file (partially shown via the link below) by downloading http://slashdot.org/slashdot.xml.

Please click here to see an example of slashdot XML.

As you can see, the XML file starts with the top level backslash tag.  This tag can contain one or more story tags, and so on.  The structure of the XML file is defined in another file named backslash.dtd.  This DTD file is located at the URL specified in the second line of the XML file and is also available over the World Wide Web.  blackslash.dtd is shown in the following table:

backslash.dtd
<!DOCTYPE backslash [
<!ELEMENT backslash (story*)>
<!ELEMENT story (title, url, time, author, department, topic,
comments, section, image)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT url (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT department (#PCDATA)>
<!ELEMENT topic (#PCDATA)>
<!ELEMENT comments (#PCDATA)>
<!ELEMENT section (#PCDATA)>
<!ELEMENT image (#PCDATA)>
]>

The DTD file tells us the following about our XML document:
  • The DTD describes the structure of the blackslash node (element) and those elements beneath it (its sub-elements).
  • The backslash element can contain zero or more story elements (defined by using the story* symbology described later)
  • Any story element must contain the following sub elements
    • title,
    • url,
    • time,
    • author,
    • department,
    • topic,
    • comments,
    • section, and
    • image
  • Each story element contains no sub-elements but each element is of the #PCDATA type.  #PCDATA stands for parsed character data (in other words, text data).  Elements can also be of the EMPTY type (note that there is no # symbol), signifying that the element has no content, but may contain attributes (which we will discuss later).
Let's return to our pet store example.  If we wanted to create an XML file which will become a listing of each item our store carries, we would need to create a DTD structure that can accommodate such a file.  In designing the DTD, it should be clear that we would need to place the following information in the XML file:
  • item name,
  • item product number,
  • item description, and
  • item price
For now, we will create the DTD from the first two items in the list (you'll be finishing the DTD in Lab Activity #2).  Thus, it's likely we would want to create a DTD that organized the data in the following format:
<!DOCTYPE backslash [
<!ELEMENT store (products*)>
<!ELEMENT products (itemName, itemProductNumber)>
<!ELEMENT itemName (#PCDATA)>
<!ELEMENT itemProductNumber (#PCDATA)>
]>

Lab Activity #2:
  1. Modify our example DTD file (store.dtd) to capture the following data fields for each product in the store.  Note that we added a few fields not previously seen in our pet store example.
  • item description
  • item price (in US Dollars)
  • quantity on hand
  • supplier name
  • restock time (number of days to resupply)
  1. Next, create the corresponding product file (product.xml) which refers to the store.dtd file you just created.  Include the three products we currently offer and add an additional three products of your choosing.

14.3 Moving the DTD file into the XML file

On some occasions it may be preferred to bundle the XML file and the DTD file together, so that it is all inclusive in one file.  In the examples in Section 2 we keep them separated.  However, with a tad more effort we can bundle our files together.  First, let's start with our XML file.  We will use the example from my recent project.  Here we have the start of the XML file (tableStructure.xml).

tableStructure.xml
<?xml version="1.0"?>

<!-- File Name:  tableInfo.xml -->

<Level>
    <Category name="Expressions">
        <Description>
Alterations to variables</Description>   
        <LanguageItem name="prefix">
            <Description>
Before the variable</Description>   
        </LanguageItem>
        <LanguageItem name="postfix">
            <Description>
After the variable</Description>   
        </LanguageItem>
        <LanguageItem name="typeCast">
            <Description>
Cast variable to a type</Description>   
        </LanguageItem>
    </Category>
    <Category name="Multiplicative">
        <Description>
Non-Arithmetic operations</Description>   
        <LanguageItem name="multiplication">
            <Description>
Multiplication of two variables</Description>   
        </LanguageItem>
        <LanguageItem name="division">
            <Description>
Division of two variables</Description>   
        </LanguageItem>
        <LanguageItem name="modulus">
            <Description>
Remainder of division operation</Description>
        </LanguageItem>
    </Category>

(file truncated for brevity)

In the example above, we have also included a comment line.  Comments in XML (as in HTML) start with the characters <!-- and terminate with the characters -->.

Next, we want to include the DTD in the the .xml file so that the .xml file contains both the data (what's already there) and the definition of the structure of the document.  To do this, we take the contents of our DTD file and embed the DTD as shown in the next table.

tableStructure.xml (revised)
<?xml version="1.0"?>

<!-- File Name:  tableInfo.xml -->

<!DOCTYPE Level
    [   
        <!ELEMENT Level (Category*, Description?)>
        <!ELEMENT Category (Description?, LanguageItem+)>
        <!ELEMENT LanguageItem (Description?, SubItem*)>
        <!ELEMENT SubItem (Description?)>
        <!ELEMENT Description (#PCDATA)>

        <!-- Higher levels must have names -->
        <!ATTLIST Category name CDATA #REQUIRED>
        <!ATTLIST LanguageItem name CDATA #REQUIRED>
        <!ATTLIST SubItem name CDATA #REQUIRED>
    ]
>

<Level>
    <Category name="Expressions">
        <Description>
Alterations to variables</Description>   
        <LanguageItem name="prefix">
            <Description>
Before the variable</Description>   
        </LanguageItem>
        <LanguageItem name="postfix">
            <Description>
After the variable</Description>   
        </LanguageItem>
        <LanguageItem name="typeCast">
            <Description>
Cast variable to a type</Description>   
        </LanguageItem>
    </Category>
    <Category name="Multiplicative">
        <Description>
Non-Arithmetic operations</Description>   
        <LanguageItem name="multiplication">
            <Description>
Multiplication of two variables</Description>   
        </LanguageItem>
        <LanguageItem name="division">
            <Description>
Division of two variables</Description>   
        </LanguageItem>
        <LanguageItem name="modulus">
            <Description>
Remainder of division operation</Description>
        </LanguageItem>
    </Category>

(file truncated for brevity)

Note that we have placed the DTD after the initial <?xml version="1.0"?> statement, and prior to the root tag (Level).  It is also likely that you have noticed we added a few other features to our DTD file.  We'll discuss several of them now, and complete the discussion in the following section.

Our DTD section appears as:
<!DOCTYPE Level
    [   
        <!ELEMENT Level (Category*, Description?)>
        <!ELEMENT Category (Description?, LanguageItem+)>
        <!ELEMENT LanguageItem (Description?, SubItem*)>
        <!ELEMENT SubItem (Description?)>
        <!ELEMENT Description (#PCDATA)>

        <!-- Higher levels must have names -->
        <!ATTLIST Category name CDATA #REQUIRED>
        <!ATTLIST LanguageItem name CDATA #REQUIRED>
        <!ATTLIST SubItem name CDATA #REQUIRED>
    ]
>
The DTD defines five elements (Level, Category, LanguageItem, SubItem, and Description).  Of these five, only the Description node does not contain another element (it only contains a textual value).  Following the definition of the elements, attribute definitions are also provided.  We will discuss attribute creation and use in the DTD in the next section of this lab.

However, there is one final point to make with regard to the definition of each element, and that is the new notation we used.  Specifically we are referring here to the "*", "?", and "+" characters.  These are common characters in what is known as regular expressions in computer science terms.  But, in a nutshell, they dictate the number of sub-elements that may appear.

For example, for the Category sub-element, the definition (<!ELEMENT Level (Category*, Description?)>) states that a Level element can contain zero or more Category sub-elements, followed by zero or one Description sub-elements.  In the table that follows, we define each of these new expression characters.  If an expression character is not present, then the element shown must occur once in as a sub-element.

Expression Character
Meaning
* (asterisk)
The element that precedes the character must appear zero or more times.
? (question mark)
The element that precedes the character must appear zero or one time.
+ (addition sign)
The element that precedes the character must appear one or more times.

Lab Activity #3:
  1. For each of the following elements (based on the revised tableStructure.xml file shown above), write out which sub-elements may appear in the XML file, and how many times each may appear.
  • Category
  • LangaugeItem
  • SubItem
  • Description
  1. Modify your XML file from Lab Activity #2 to include the DTD file.

  2. Rewrite your DTD to include at least two of the expression characters in your DTD.

14.4 DTD Attributes

As we briefly mentioned earlier, the second half of our DTD file contained attributes for the tags we created.  These attributes were created with the following lines in the DTD:
<!ATTLIST Category name CDATA #REQUIRED>
<!ATTLIST LanguageItem name CDATA #REQUIRED>
<!ATTLIST SubItem name CDATA #REQUIRED>
Each attribute is created by first naming the tag that it is associated with (Category for the first attribute created), then the name of attribute (in each case the attribute name is defined), followed by the type of data the attribute will contain (CDATA represents character data).  Finally, the #REQUIRED indicates that a value must be provided for this attribute.

In practice, the first Category and LanguageItem we define in the body of the XML document uses these attributes to provide additional information about the tags.
<Category name="Expressions">
    <Description>Alterations to variables</Description>
    <LanguageItem name="prefix">
         <Description>Before the variable</Description>
   
</LanguageItem>
In the sample above, we define a Category node named "Expressions" which then contains a description of the node, as well as the sub-element (or sub-node) named "prefix" (which happens to be a LanguageItem).  The "prefix" element also contains a description.  By using attributes we can provide additional information about each element or node, just as we did with HTML tags.

Lab Activity #4:
  1. Edit both the store XML file (from Lab Activity #3) to incorporate in no less than 5 new attributes.  Which ones did you create?

  2. Refer to the World Wide Web Consortium's recommendation on XML and determine the other types of fields that can be used in creating attributes.  What are they and what do they accomplish?


About the labs:
These labs were developed in conjunction with the Jones and Bartlett textbook Computer Science Illuminated by Nell Dale and John Lewis.
ISBN:  0-7637-1760-6
Lab content developed by Pete DePasquale and John Lewis.
 
Educators: More Information About This Text Other Computer Science Titles at Jones and Bartlett
Copyright 2019 Jones and Bartlett PublishersContact webmaster