Final Write Up

Project Summary

Initially, we planned to implement parallel versions of two lossless data compression algorithm, Lempel-Ziv-Storer-Szymanski (LZSS) compression and Huffman coding, on many-core CPU. However, after we implemented a naive parallel version of both algorithms, we found that the problems we need to consider for both algorithms are very similar. Then, we decided to focus on parallelizing Huffman Coding since it has more parallelizable components and has a higher potential parallel performance gain. We will make use of SIMD intrinsics, ISPC compiler and OpenMP library to perform parallel compression. We will also conduct detailed analysis for our implementation considering characteristics including memory bandwidth, CPU usage and cache behavior, and compare the performance with the sequential implementation.

Background

As internet are getting popular in 1980s, many compression algorithms are invented to overcome the limitation of network and storage bandwidth. There are mainly two types of compression algorithms. Lossy compression algorithms are mainly used to compress image and audio. Lossless compression algorithms are more useful in many other situations like compression in storage device where loss of data is unacceptable. Our project is focusing on parallelizing one of the most popular compression algorithm called Huffman Coding. Huffman Coding is the process of finding the optimal prefix code for a set of source symbols. After the prefix code table is built, more common source symbols will be encoded using less bits. Huffman coding can also be combined with other compression algorithm like LZSS to provide even higher compression ratio. Currently, Huffman Coding is implemented sequentially in many compression libraries. Thus, parallelizing Huffman Coding becomes very important to improve the performance of those compression libraries.

Approach

Huffman Coding Overview

Huffman Coding Compression has four main steps:

  1. Reading from a input file and build a histogram to count the frequency of each byte.
  2. Building a Huffman Tree from the histogram.
  3. Traversing the Huffman Tree and build the prefix code table.
  4. Encoding the input file using the prefix code table.

Initially, we use libhuffman as our starter sequential codes. libhuffman is a single-threaded pure C library and command line interface for Huffman Coding, released under a BSD license. We found it from github and sourceforge. We modified it to be using C++11 and change many inefficient components in it to make it as optimized as possible. We also read the entire files into memory before the compression and decompression starts to avoid being bottlenecked by disk bandwidth. Then, we use it as the baseline for our evaluation. After we running the sequential Huffman Coding on a 5.5 GB Wiki Dataset, we get the following graph.

Figure 1. Compression Time Component

Parallelize Compression

As we can see from the graph, 77.8% of the time is spent on encoding the input files. Thus, parallelizing encoding step becomes our first step. We are mainly using OpenMP to utilize the multi-core CPU and in the meantime to provide a clean interface without dealing with C++ Thread Library. Also, to avoid communication between threads, we divide the input file to equal size chunks and each thread will be working on their own chunk. When writing to the compressed file, we will precompute the size of compressed chunk each thread will produce. Then, use prefix sum to get the output offset so that each thread will know where they should write to. Those offset information will also be written to the front of compressed file as the metadata. The final compressed file will look like the following

Figure 2. Compressed File Format

We conduct the analysis on the total compression time to identify the bottleneck as our development goes on. Figure 3 shows the percentage of time spent on each step of the compression when we only parallelize the second pass of the data to do real compression. Compression data (yellow region) takes almost 80% of the time in the sequential version. However, when we parallelize it and increase the number of threads, the compression data time drops dramatically and now histogram generation (blue region) becomes the bottleneck. It matches with the Amdahl’s Law. Note that in our approach, the building tree step (orange region) requires us to pre-compute the compressed output size and we also parallelize the pre-computation in our first implementation so that we can see the building tree time also drops when the number of threads increases.

Figure 3. Percentage of Time Spent On Steps of Compression (Parallel Encoding Only)

We then tackle with the new bottleneck by parallelizing the histogram construction. To avoid communication between different threads, each thread will be assigned a chunk of input file. Then, each thread will run through their chunk of input file and count the frequency into a local histogram. After this step is done, we will add a barrier to synchronize all threads. Then, each thread will be responsible for merging part of global histogram. By using this two parallel steps, we achieve linear speedup for histogram generation. Figure 4 shows the time analysis with parallel encoding and parallel histogram generation. After we parallelize both steps, the compression distributions are quite similar as the number of threads increases. It shows that parallelizing both histogram construction and data encoding is the correct way to go.

Figure 4. Percentage of Time Spent On Steps of Compression (Parallel Encoding + Parallel Histogram)

Parallelize Decompression

As for the decompression. The first step is to read metadata from the file and build the Huffman Tree in memory. Then, start decoding the file by traversing the Huffman Tree based on compressed bits. After we run the sequential version, we found about 99% of time is spent on the second step. Thus, we focused on parallelizing the second step. Corresponding to the parallel encoding step, the parallel decoding will start by reading the chunk offset from the header. Then, each thread will start reading from input file at that offset. In this way, there is no communicating between threads, we are able to achieve linear speedup for decompression.

Why not ISPC and SIMD instructions

During the development, we also tried to use ISPC to utilize SIMD unit. But there exists several issues showing that huffman coding compression and decompression may not be a perfect workload for SIMD. Our thought is to use SIMD to read in an array of input bytes and calculate how many compressed bits each byte has. Then, do a prefix sum to calculate the offset. Then, use SIMD to pack compressed bits together to avoid inefficient gather and scatter operations. However, this does not work well. Since this whole process for four elements will requires more than 4x instructions than the sequential code. Thus, using SIMD may probably be slower than sequential code in this case. Also, between SIMD iteration, we need to remember how many bits are written in the previous iteration and deal with bit-level conflicts. There is no easy way to express that dependency in ISPC. Also, directly dealing with bits instead of bytes using SIMD is very hard. Thus, we decided to focus on utilizing multi-core instead of SIMD units.

Results

We conduct evaluation on parallel huffman coding compression and decompression on three platforms: GHC Machine, Xeon Phi Co-processor, many-cores NUMA CPU.

GHC Machine (Xeon E5-1660 v4 @ 3.20GHz, 8 Cores, 16 Threads)

Figure 5. Speedup on parallel huffman compression and decompression versus the optimized sequential implementation (Xeon E5-1660)

We first runs Huffman compression and decompression using 500MB Wiki dataset on the GHC machines, which has 8 cores and 16 hardware threads. The result shows a linear speedup until 8 threads are used while there is little speedup from using 8 threads to 16 threads. This makes sense because there are only 8 physical cores in the machine and hyperthreading helps little when huffman compression and decompression are CPU bound.

Xeon Phi (KNL) Co-processor (68 Cores, 256 Threads)

Figure 6. Speedup on parallel huffman compression and decompression versus the optimized sequential implementation (Xeon Phi)

We then runs Huffman compression and decompression using 5.5GB Wiki dataset on Xeon Phi, which has 68 cores and 256 threads. The results are similar and our approaches can achieve linear speedup until we fully utilize all the physical cores and benefit little from hyperthreading.

NUMA CPU (Xeon E5-2699 v4 @2.20GHz, 88 Cores, 4 Sockets, 88 Threads)

Figure 7. Speedup on parallel huffman compression and decompression versus the optimized sequential implementation (Xeon E5-2699)

We also try to run our algorithm on a NUMA CPU, which has 88 cores and 4 sockets with each of the 22 cores placing in the same socket. We configure the CPU affinity policy of OpenMP such that threads are bound to cores according their thread ids. Here we are using the absolute value of the speedup instead of the log speedup because we want to see the behavior of placing threads across sockets. The result shows that when the threads are running in the same socket, the speedup trend is not affected. But when the threads are running across different sockets, we see almost no speedup (22 to 33 threads) or even slow down (77 to 88 threads). This is caused by interconnect traffics across sockets because we use the first thread to load the file into memory and this memory are allocated on the first socket only with the default first touch policy. We then further parallelize the file loading step to try to make memory allocated across sockets to reduce the interconnect traffics but we see little improvement. We think that in order to tackle with this problem, we may need to use NUMA-aware memory allocation system calls or enforce other NUM-aware memory allocation policies but since they are platform specific and the NUMA libraries are absent in the intel cluster, we find it hard to do further improvement on this issue.

References

Work Division


Project Middle Checkpoint (Apr 25th)

Project Status Summary

Since the beginning of the project, here are some of the progress we made.

Huffman Compression

LZSS Compression

Deliverables on the Competition

Issues & Concerns

Huffman Compression

All huffman algorithms are using a byte (8-bit) as the basic unit of the histogram, in other words, the maximum number of different leaves in the huffman tree is 256. We will try to use different number of bit (i.e. 16-bit, 32-bit) as the compression unit. Huffman compression works well when the distribution of the compression is highly skew. There is a trade-off on how many bits are used as the compressed unit because the more bits we use, the more space we can reduce for the less frequent code, but the less opportunity we can have for the skew distribution and the more time is spent on constructing the huffman tree and the huffman code.

LZSS Compression

The only issue now is to rewrite the file I/O part of the sequential algorithm so that it can read a block size of data at a time instead of one byte of time. After we fix this issue, we will start parallelizing the linear search string match and string matching using hash table. Then compared to the sequential code to see how much speedup we get.

Reference

[1] LZSS algorithm details. http://michael.dipperstein.com/bitlibs/index.html

[2] Libhuffman: https://github.com/drichardson/huffman


Project Proposal

Summary

We are going to implement parallel versions of two lossless data compression algorithm, Lempel-Ziv-Storer-Szymanski (LZSS) compression and Huffman coding, on many-core CPU. We will make use of SIMD intrinsics, ISPC compiler and OpenMP library to perform parallel compression. We will also conduct detailed analysis for our implementation considering characteristics including memory bandwidth, CPU usage and cache behavior, and compare the performance with the sequential implementation.

Background

As internet are getting popular in 1980s, many compression algorithms are invented to overcome the limitation of network and storage bandwidth. Lossy compression algorithms are mainly used to compress image and audio. However, lossless compression algorithms are more useful in many other situations like compression in storage device where loss of data is unacceptable. The most famous lossless compression algorithm LZ77 was invented by Abraham Lempel and Jacob Ziv in 1977. After that, there are many algorithms derived from LZ77. One of them is LZSS, which is used in RAR. Another very popular compression algorithm is called Huffman coding, which generate prefix coding for encoding targets. These two algorithms plays an important role in many compression libraries and they are currently implemented sequentially in those libraries.

The Challenge

Goals and Deliverables

75% goal

Resources and Platform Choice

Schedule