CS420 Fall 2004 Homework, Week 4

In class we discussed the implication input data has on a given compression algorithm’s ability to reduce data size. For the purposes of this homework we are going to use the gzip utility that should be available on your Cygwin distributions on Windows. gzip is also included in most UNIX (Linux, etc.) distributions.

gzip uses LZ77 and the rfc is available online: ftp://ftp.isi.edu/in-notes/rfc1952.txt

The command: man gzip describes how to use the program and provides an in-depth description of its operation.

1) Find an example of the following file types, note their size, compress them, note the new size, and express the ratio of uncompressed to compressed size as a decimal:

- .txt file containing English prose

- .c file containing code

- .wave audio file containing music

- .mp3 audio file containing music

- .bmp windows image file

2) Manually construct a text file (source file should be > 1000 bytes) that gets larger after compression. State the compression ratio.

3) In the Physical Review Journal paper, Language Trees and Zipping, Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto suggest that codebook based compression utilities may be used to determine the degree of similarity between two documents. Basically they compare the compression ratio of documents a and b to the compression ratio achieved when compressing document ab (the two documents concatenated together). If the appended document’s compression ratio is similar or better than the level of compression achieved on the independent documents then the documents likely are similar.

One application of this technique is a classifier. It should be possible to determine the language an unknown document x is written in by: 1) creating a set of language exemplar files 2) successively append each exemplar to x. The resulting hybrid that achieves the best compression ratio (compare the total combined uncompressed size to the size of the compressed hybrid) identifies the closest match.

Create 3 files: example English text document english.txt, example html document html.txt, and example C code document c.txt.

Save this HW document to disk as text called x1.txt and as HTML (use view source to capture the html) as x2.txt. Perform the above computation comparing each of the two ‘unknown’ files x1, x2 to your 3 exemplars and fill results in the following chart. (The UNIX/Linux/Cygwin cat utility may be used to concatenate two files: cat filea fileb > fileab)

X	E	Size of X+E	Size of compressed XE	Ratio
X1 (text)	English
X1 (text)	HTML
X1 (text)	C
X2 (html)	English
X2 (html)	HTML
X2 (html)	C

Were X1 and X2 correctly identified? If not consider ‘tuning’ your exemplars to provide better discrimination.

NOTE: Please identify and submit all files with homework (try to find small files…)