2012-06-29

2D Interpolation, Part 5: Final Optimizations

The code has now the proper shape to carry out optimizations to their logical consequences, to the point that I question the wisdom in some of the previous transformations. It is true that in compiler construction some passes introduce constructs only for latter passes to eliminate them, in the hope of an overall simplification. In this case the starting point will be the elimination of the pixel-for-pixel dispatch to the required interpolator, by hoisting the switch out of the inner loop to a top-level procedure. I keep the last version of the interpolator as interp1 applying the same trick I used before to compare pixels:

void interp0() {
  switch (method) {
  case NEAREST:  interpN(); break;
  case BILINEAR: interpL(); break;
  case BICUBIC:  interpC(); break;
  }
}

The initial version of these interpolators are three identical copies of the previous interp0, appropriately renamed. For the sake of brevity I will omit these copies, showing instead the final, simplified form of each one. In the case of interpN(), I first eliminate the switch. Then I eliminate all references to the repeated border (variables having 0 or 3 as indices). Since there is no need to repeat elements, I now can access the source array directly. This in turn allows me to simplify all index computations, adjusting the condition on the last column as required. In fact, there is no need to keep count of the rows as yi already does that, so the end condition on the outer loop gets radically simpler. A further optimization I can afford is to eliminate the need for floating point constants yf, xf by inlining the integer calculation directly into the conditional:

void interpN() {
  int yr = 0;
  int yi = 1;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    int c11 = srcpix[(yi-1)*ncols+0], c12 = srcpix[(yi-1)*ncols+1];
    int c21 = srcpix[ yi   *ncols+0], c22 = srcpix[ yi   *ncols+1];
    int xr = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final int c = 2*yr < height ? 2*xr < width ? c11 : c12 : 2*xr < width ? c21 : c22;
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols) {
          c11 = c12; c12 = srcpix[(yi-1)*ncols+xi];
          c21 = c22; c22 = srcpix[ yi   *ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

The simplification on the linear interpolator interpL() proceeds exactly as before. The access pattern is identical, since the stencil is 2×2, so the end result is the same except for the pixel calculation:

void interpL() {
  int yr = 0;
  int yi = 1;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    int c11 = srcpix[(yi-1)*ncols+0], c12 = srcpix[(yi-1)*ncols+1];
    int c21 = srcpix[ yi   *ncols+0], c22 = srcpix[ yi   *ncols+1];
    int xr = 0;
    int xi = 1;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      final int c = bilinear(xf, yf, c11, c12, c21, c22);
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        if (xi < ncols) {
          c11 = c12; c12 = srcpix[(yi-1)*ncols+xi];
          c21 = c22; c22 = srcpix[ yi   *ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

This leaves me with the bicubic interpolator interpC(). On the one hand there is no need nor possibility to change anything besides eliminating the switch, as all previous transformations were based on the use of a 4×4 stencil with repetitions. On the other, I always question myself whether I can do better. I reason as follows: the stencil ranges over the image from pixel (-1, -1) to pixel (width−3, height−3); in case of under- or overflow the missing pixels are repeated. This means that using min and max as needed can fetch the repetitions for free! In general, the index will be min(max(yi-d, 0),nrows-1), but depending on the value of d and knowing the range of yi some simplifications are possible. The same is true for the repetition of the last column. In the inner loop I use the same trick I did with the arrayCopy(), that is, move the old pixels and fetch the new only if in range:

void interpC() {
  int yr = 0;
  int yi = 2;
  int offset = 0;
  for (int i = 0; i < height; i++) {
    final float yf = (float) yr / (float) height;
    int c00 = srcpix[max(yi-3,     0)*ncols+0];
    int c01 = srcpix[max(yi-3,     0)*ncols+0];
    int c02 = srcpix[max(yi-3,     0)*ncols+1];
    int c03 = srcpix[max(yi-3,     0)*ncols+2];
    int c10 = srcpix[   (yi-2       )*ncols+0];
    int c11 = srcpix[   (yi-2       )*ncols+0];
    int c12 = srcpix[   (yi-2       )*ncols+1];
    int c13 = srcpix[   (yi-2       )*ncols+2];
    int c20 = srcpix[   (yi-1       )*ncols+0];
    int c21 = srcpix[   (yi-1       )*ncols+0];
    int c22 = srcpix[   (yi-1       )*ncols+1];
    int c23 = srcpix[   (yi-1       )*ncols+2];
    int c30 = srcpix[min(yi  ,nrows-1)*ncols+0];
    int c31 = srcpix[min(yi  ,nrows-1)*ncols+0];
    int c32 = srcpix[min(yi  ,nrows-1)*ncols+1];
    int c33 = srcpix[min(yi  ,nrows-1)*ncols+2];
    int xr = 0;
    int xi = 2;
    for (int j = 0; j < width; j++) {
      final float xf = (float) xr / (float) width;
      final int c = bicubic(xf, yf,
            c00, c01, c02, c03, c10, c11, c12, c13,
            c20, c21, c22, c23, c30, c31, c32, c33);
      pixels[offset++] = c;
      xr += ncols - 1;
      if (xr >= width) {
        xr -= width;
        xi += 1;
        c00 = c01; c01 = c02; c02 = c03;
        c10 = c11; c11 = c12; c12 = c13;
        c20 = c21; c21 = c22; c22 = c23;
        c30 = c31; c31 = c32; c32 = c33;
        if (xi < ncols) {
          c03 = srcpix[max(yi-3,      0)*ncols+xi];
          c13 = srcpix[   (yi-2        )*ncols+xi];
          c23 = srcpix[   (yi-1        )*ncols+xi];
          c33 = srcpix[min(yi  ,nrows-1)*ncols+xi];
        }
      }
    }
    yr += nrows - 1;
    if (yr >= height) {
      yr -= height;
      yi += 1;
    }
  }
}

Note that the stencil is loaded at the top of the inner loop with full knowledge of the values that yi and xi can take. Note also that eliminating the conditions implicit in the min and max calculations would require inverting the loops so that the control variables are yi and xi. This is possible but would require modular arithmetic on the destination indices; I would have to start the overall process from scratch.

In the mean time, I suspect that the introduction of row buffers was a red herring (although I swear an innocent, unwitting one). I didn't think of using min and max until the code was focused enough for me to see it from another side. What with insight it appears perfectly obvious can be and often is obscured by unrelated considerations. Selecting the "best" strategy at each step requires knowing the full path to the goal, but when faced with limited information the best we can hope for is that our heuristics don't mislead us too far away. In any case backtracking is inevitable for those of us lacking the gift of foreknowledge, and I think much more honest (if boring) to fully disclose my veering into blind alleys.

The take-away lesson of this (admittedly repetitious, if not downright tedious) exercise is to show how to manually apply some of the strategies that modern parallelizing compilers use automatically to vectorize and optimize straight-loop code, including the use of comparisons with unoptimized versions to assess and ensure the correctness of the transformations. When faced with optimized code I am often left wondering what process the author followed to arrive at that result; my own process is rather pedestrian and methodical and without any magic tricks other than a basic knowledge of high-school arithmetic.

No comments: