In the blog post from August 2014 we compared the modularization of image processing algorithms with template based policies against the dynamic polymophism based stragety pattern in terms of speed. We will compare the run-time of (almost) the same pixel-wise operations on images again. But this time, we consider different functional modularization techniques. More precisely, we will pass function pointers, lambdas, and derived functors to algorithms with corresponding argument types such as a templated one and std::function
among others.
Our algorithm is a binary operation on two images. It simply traverses the two input images and one output image. Thereby, a pixel-wise binary operation is applied to the input images and the result is stored in the output image. The pixel-wise binary operation will be passed as first-class-citizen-function to different argument types, namely:
std::function
- template argument
- function pointer
- base class of functors using dynamic polymorphism
As Input for the function arguments we will use lambda functions, function pointers, and derived functors.
Functional Modularization
In this section we very briefly describe how to apply modularization by passing functions around. Time measurements are the subject of the next section.
The simple image traversal algorithm is depicted in the following
void binary(SomeFuncArgType op, const Image& im_in1, const Image& im_in2,
Image& im_out, unsigned width, unsigned height)
{
for (unsigned y = 0u; y < height; y++)
for (unsigned x = 0u; x < width; x++)
im_out[x + y * width] = op(im_in1[x + y * width],
im_in2[x + y * width]);
}
where SomeFuncArgType
is a placeholder for the subsequent types and the class Image
is just a vector of floats, i.e.,
using Image = std::vector<float>;
Subsequently, we will only show the changed signature, since the bodies are identical.
std::function
The type std::function
is a generic function wrapper. You can use std::function
, e.g., to store function pointers or functors/lambdas. The usage of std::function
for the algorithm’s modularization is depicted in the following signature.
void binary_func(std::function<float(float, float)> op,
const Image& im_in1, const Image& im_in2,
Image& im_out, unsigned width, unsigned height);
Template Argument
As for std::function
, you can pass functors/lambdas and function pointers to the algorithm with the template signature.
template <typename F>
void binary_template(F op, const Image& im_in1, const Image& im_in2,
Image& im_out, unsigned width, unsigned height);
Raw Pointer
The raw pointer signature is C-compatible and given by
void binary_pointer(float (*op)(float, float), const Image& im_in1,
const Image& im_in2, Image& im_out, unsigned width,
unsigned height)
One can pass function pointers of free functions either by address or directly, i.e.,
binary_pointer(&free_func, ...);
binary_pointer(free_func, ...);
and even lambdas, if they do not capture anything. The latter possibility is ignored in our experiments.
Base Class of Functors using Dynamic Polymorphism
Given the abstract base class BaseFunctor
the signature results in
void binary_dyn_poly(const BaseFunctor& op, const Image& im_in1,
const Image& im_in2, Image& im_out, unsigned width,
unsigned height);
Remark
vThe standard library algorithms obtain there predicate by value. Hence, this case is the only one where we could not replace the raw for loops with
std::transform
. However, the execution speed reported below is similar for both raw loops andstd::transform
in all other cases.
Binary Pixel-Wise Operations
For two addition operations (a usual and an unnecessarily complicated one) and a maximum operation, we use lambdas
auto add_lambda = [](float a, float b) { return a + b; };
auto add_complicated_lambda = [](float a, float b)
{
return std::sqrt(a * a) + std::sqrt(b * b);
};
auto maximum_lambda = [](float a, float b) { return std::max(a,b); };
function pointers of free functions
auto add_func (float a, float b) { return a + b; }
auto add_complicated_func(float a, float b)
{
return std::sqrt(a * a) + std::sqrt(b * b);
}
auto maximum_func(float a, float b) { return std::max(a, b); }
and functors dervied from the abstract base class
struct AddFunctor : public BaseFunctor
{
virtual float operator()(float a, float b) const override
{
return a + b;
}
};
struct AddComplicatedFunctor : public BaseFunctor
{
virtual float operator()(float a, float b) const override
{
return std::sqrt(a*a) + std::sqrt(b*b); }
};
struct MaxFunctor : public BaseFunctor
{
virtual float operator()(float a, float b) const override
{
return std::max(a, b);
}
};
Speed Measurements
We apply the pixel-wise binary operations with equal operands to a std::vector<float>
of 1024x1024 zeros 100 times with each approach. As benchmark we take the minimal run time in microseconds ($\mu{\rm s}$) over the 100 runs. We work with an Intel Core i7-6700HQ CPU @ 2.60GHz mobile processor. Compiler-wise, we use GCC 5.4 and Clang 3.9 on Ubuntu 16.04 and Visual Studio 2017 on Windows 10 (all 64 bit). We compiled in release mode.
We will consider 8 pairs of argument types (signature) and passed types namely:
- pointer passed as
std::function
- lambda passed as
std::function
- derived functor passed as
std::function
- lambda passed to a templated argument
- pointer passed to a templated argument
- derived functor passed to a templated argument
- pointer passed as pointer
- derived functor passed as base functor
For each pixel-wise binary operator we will rank the pairs by a penalty for comparison across compilers. To this end, we sort the pairs according to the elapsed times. Each pair will be assigned a penalty of $(t_{\rm cur} - t_{\rm prev}) // 100$ where $t_{\rm cur}$ is the elapsed time of the current pair, $t_{\rm prev}$ is the elapsed time of the previous better pair, and $//$ denotes integer division, i.e., ignoring the remainder. The final ranking is given by the sum of penalties over the different compilers and depicted in the following tables. Additionally, the tables contain the minimal run times in microseconds for each compiler.
add
rank | arg type (sgntr) | passed type | penalty | gcc $\mu$s | clang $\mu$s | vs $\mu$s |
---|---|---|---|---|---|---|
1 | template | lambda | 0 | 1549 | 827 | 819 |
2 | pointer | pointer | 11 | 2168 | 827 | 1552 |
3 | template | pointer | 14 | 2147 | 829 | 1866 |
4 | template | derived functor | 25 | 1881 | 2172 | 1868 |
5 | base functor | derived functor | 26 | 2064 | 2175 | 1890 |
6 | std::function |
lambda | 49 | 4137 | 2021 | 2464 |
7 | std::function |
derived functor | 54 | 4016 | 2567 | 2721 |
8 | std::function |
pointer | 55 | 3974 | 2480 | 2879 |
complicated add
rank | arg type (sgntr) | passed type | penalty | gcc $\mu$s | clang $\mu$s | vs $\mu$s |
---|---|---|---|---|---|---|
1 | template | lambda | 0 | 7460 | 5918 | 846 |
2 | pointer | pointer | 63 | 8264 | 5948 | 6494 |
3 | template | pointer | 65 | 8249 | 5946 | 6731 |
4 | template | derived functor | 94 | 8230 | 8677 | 6947 |
4 | base functor | derived functor | 94 | 8239 | 8673 | 6964 |
6 | std::function |
lambda | 102 | 8546 | 8740 | 7628 |
7 | std::function |
pointer | 104 | 8535 | 8854 | 7773 |
8 | std::function |
derived functor | 105 | 8527 | 8866 | 7907 |
max
rank | arg type (sgntr) | passed type | penalty | gcc $\mu$s | clang $\mu$s | vs $\mu$s |
---|---|---|---|---|---|---|
1 | template | lambda | 0 | 1631 | 834 | 904 |
2 | template | pointer | 15 | 2312 | 837 | 1922 |
2 | pointer | pointer | 15 | 2556 | 834 | 1605 |
4 | template | derived functor | 27 | 2340 | 2180 | 1872 |
5 | base functor | derived functor | 29 | 2351 | 2429 | 1865 |
6 | std::function |
lambda | 45 | 3905 | 2018 | 2415 |
7 | std::function |
derived functor | 60 | 4642 | 2586 | 2828 |
8 | std::function |
pointer | 63 | 5000 | 2552 | 2835 |
Oddity
Visual Studio 2017 is unbelievably fast for the complicated add case, when the lambda is passed to the templated algorithm implementation. This needs further investigation and might lead to another blog post.