OpenMP Sample for Discrete Cosine Transform

Discrete Cosine Transform(DCT) and Quantization are the first two steps in JPEG compression standard. This sample demonstrates how DCT and Quantizing stages can be implemented to run faster using OpenMP* and Intel® Threading Building Blocks (Intel® TBB). In order to see the effect of quantization on the image, the output of Quantization phase is passed on to the de-quantizer followed by Inverse DCT and stored as an output image file. DCT is a lossy compression algorithm which is used to represent every data point value using infinite sum of cosine functions which are linearly orthogonal to each other. DCT is the first step of compression in the JPEG standard. The program shows the possible effect of quality reduction in the image when we do DCT followed by quantization like in JPEG compression. To visibly see the effects if any, the inverse operations (Dequantization and Inverse Discrete Cosine Transform (IDCT)) are done and output is saved as bitmap image. This sample uses a serial implementation of the 2D-DCT (Two Dimensional DCT) algorithm, a vectorized implementation of the algorithm (using OpenMP), parallelized implementation and finally a version which includes both threading and vectorization solution

System Requirements:

Hardware:
- Any Intel processor with Intel® Advanced Vector Extensions (Intel® AVX) support like 2nd Generation Intel® Core™ i3, i5, or i7 processors and Intel® Xeon® E3 or E5 processor family, or newer
For Microsoft Windows*:
- Microsoft Visual Studio 2013* standard edition or above
- Intel® Parallel Studio XE 2019 Composer Edition for C++ Windows* or newer
For Linux*:
- GNU* GCC 4.5 or newer
- Intel® Parallel Studio XE 2019 Composer Edition for C++ Linux* or newer
For macOS*:
- macOS* 10.13 or above
- Xcode* 9.x or above
- Intel® Parallel Studio XE 2019 Composer Edition for C++ macOS* or newer

Code Change Highlights:

Intel® TBB

linear version (DCT.cpp):

for(int i = 0; i < (size_of_image)/64; i++)
{
	startindex = (i * 64);
	process_image_serial(indata, outdata, startindex);
}

Intel® TBB version (DCT.cpp):

tbb::parallel_for(int(0), (size_of_image)/64, [&](int i)
{
	int startindex = (i * 64);
	process_image_SIMD(indata, outdata, startindex);
});

OpenMP SIMD

scalar version (matrix.cpp):

matrix_serial matrix_serial::operator*(matrix_serial &y){
  int size = y.row_size;
  matrix_serial temp(size);
  for(int i = 0; i < size; i++)
  {
    for(int j = 0; j < size; j++)
    {
      temp.ptr[(i * size) + j] = 0;
      for(int k = 0; k < size; k++)
        temp.ptr[(i * size) + j] += (ptr[(i * size) + k] * y.ptr[(k * size) + j]);
    }
  }
  return temp;
}

OpenMP SIMD (matrix.cpp):

matrix_SIMD matrix_SIMD::operator*(matrix_SIMD &y){
  int size = y.row_size;
  matrix_SIMD temp(size);
  auto ptr_copy = ptr;
  auto tempptr_copy = temp.ptr;
  auto yptr_copy = y.ptr;
  
  for(int i = 0; i < size; i++)
  {
    #pragma omp simd
    for(int j = 0; j < size; j++)
    {
      tempptr_copy[(i * size) + j] = 0;
      for(int k = 0; k < size; k++)
        tempptr_copy[(i * size) + j] += (ptr_copy[(i * size) + k] * yptr_copy[(k * size) + j]);
      }
    }
    return temp;
  }

Intel® TBB + OpenMP SIMD

Combine Intel® TBB and OpenMP SIMD implementations as shown above to compute the DCT and IDCT of the image. The code for the same is in DCT.cpp

Performance Data:

Note: Modified Speedup shows performance speedup with respect to serial implementation.

Build Instructions:

For Visual Studio users:

Open the solution .sln file
[Optional] To collect performance numbers (will run example 5 times and take average time):
- Project Properties -> C/C++ -> Preprocessor -> Preprocessor Definitions: add PERF_NUM
Choose a configuration (for best performance, choose a release configuration):
- Intel-debug and Intel-release: uses Intel C++ compiler

For Windows Command Line users:

Enable your particular compiler environment
for Intel C++ Compiler:
- open the appropriate Intel C++ compiler command prompt
- navigate to project folder
- compile with Build.bat [perf_num]
  - perf_num: collect performance numbers (will run example 5 times and take average time)
- to run: Build.bat run

For Linux*/macOS* users:

set the icc environment: source <icc-install-dir>/bin/compilervars.sh {ia32|intel64}
navigate to project folder
for Intel C++ compiler:
- to compile: make [icpc] [perf_num=1]
  - perf_num=1: collect performance numbers (will run example 5 times and take average time)
- to run: make run