Computer Science Illuminated

Interactive Review

Animated Flashcards
Live Wire
Cryptic Crossword Puzzles
Ethical Issues
Biographical Sketches
Did You Know?
Goin Live
Digital Lab Manual
Online Glossary
The Learning Store
Language Library
Download PEP/7
Instructor Resources
Student Resources

eLearning Home

Data Compression

Download the Cover Sheet for this lab as an Adobe PDF. This can be used to record your answers to the questions in the lab.

This exercise has to do with file compression using key-word encoding. There are several files associated with this exercise that are in the same directory.

wordList.cpp A file containing C++ program that produces a list of the unique words in a file and the number of times each appears.

words.dat The output from program WordList with the words sorted by number of occurrences. A data file containing 3436 non-blank characters, which was the input to the program.

  • Examine file "words.dat" and determine which words are appropriate to use in a key-word encoding scheme.
  • What symbol would you assign to each word?
  • Calculate how many characters you would save by using the key-word encoding. Calculate the compression ratio.

Program WordList is case sensitive; words beginning with an uppercase letter are considered different from the same word beginning with a lowercase letter.

  • Look carefully at program WordList. One small change would let the program ignore case. If line 160 is changed as follows, all letters are considered lowercase:
          letters[count] = tolower(letter);
  • "tolower" is a function that changes each character to lowercase before it is stored in letters.
  • This change was made, the program was rerun, the results were ordered by frequency, and the file was saved under wordslc.dat.
  • Calculate how many additional characters are saved if case is ignored, and recalculate the compression ratio.

Program WordList ignores words of less than three characters. Would it be better to ignore words of less than four characters? Recalculate the compression ratio not encoding words of less than four characters.

Educators: More Information About This Text Other Computer Science Titles at Jones and Bartlett
Copyright 2021 Jones and Bartlett PublishersContact webmaster