*[In this reprinted #altdevblogaday in-depth piece, software engineer Don Olmstead starts off his Abusing C++ metaprogramming series by examining expression templates with a valarray implementation.]*Abusing C++ is an exploration of how far a compiler can be pushed using metaprogramming techniques. Templates are typically frowned upon in game development due to the additional complexity and the increased build times, but they are an incredibly powerful mechanism that should not be discounted wholesale. In this series we'll explore situations where metaprogramming provides a new outlook for solving a problem faced in game development. In this first article we'll explore expression templates, a technique used in mathematics libraries to produce optimal assembly while presenting a simplified interface. This initial portion will demonstrate the technique via a valarray implementation. I was first exposed to Java while pursuing my Bachelor's. Coming from a C++ background, one of the first deficiencies I noticed was the lack of operator overloading. The professor explained it away as a response to people abusing this facility in C++. While I have encountered ill-advised use of operator overloading, mathematical constructs cry out for this ability. As an example, which of these is easier to understand?

result = a * b + c * d; // Makes perfect sense result = add(mul(a,b), mul(c,d)); // If you said this you're either lying or use // a language without operator overloading

// valarray implementation result = a * b + c * d; // C array implementation for (std::size_t i = 0; i < size; ++i) result[i] = a[i] * b[i] + c[i] * d[i];

**A naive implementation**Given the requirements a straightforward implementation would resemble the following.

template <typename Real> class valarray { public: //------------------------------------------------------------ // Additional methods and sanity checking removed for brevity //------------------------------------------------------------ valarray(std::size_t size) : _size(size) , _values(new Real[size]) { } ~valarray() { delete[] _values; } valarray& operator= (const valarray& copy) { assert(_size == copy._size); for (std::size_t i = 0; i < _size; ++i) _values[i] = copy._values[i]; return *this; } inline valarray operator+ (const valarray& rhs) const { valarray result(_size); for (std::size_t i = 0; i < _size; ++i) result._values[i] = _values[i] + rhs._values[i]; return result; } private: std::size_t _size; Real* _values; } ;

// result = a * b + c * d; becomes valarray<Real> temp1 = a * b; // Loop n times, multiplying elements // n items are allocated and later deallocated valarray<Real> temp2 = c * d; // Loop n times, multiplying elements // n items are allocated and later deallocated valarray<Real> temp3 = temp1 + temp2 // Loop n times, adding elements // allocation/deallocation n elements results = temp3; // Read/Write n times

const std::size_t size = a.size(); for (std::size_t i = 0; i < size; ++i) result[i] = a[i] * b[i] + c[i] * d[i];

**Expression templates**Using metaprogramming we can create mathematical constructs that provide a terse interface and generate ideal assembly. Expression templates are used extensively in a number of high performance computing libraries such as Blitz++, and linear algebra libraries like Eigen. The technique has also been utilized to create lambdas, though this usage is moot now that lambdas have been added to the C++ standard. To generate assembly that matches the optimal case, we need to implement code that builds the for loop. Actualizing this requires us to control two things; the temporary instances being created, and when the entire expression is evaluated. We'll begin by changing the type of object the mathematical operators return. We want to create an object that wraps the mathematical operation specified, and references an underlying array or the results of another operation. This requires the internal storage of the valarray to be modified to accomadate this. From there, we specify when the expression is evaluated by utilizing the assignment operator. Finally, we'll walk through the steps required by the compiler to produce the optimal assembly.

**Encoding the expression**As we saw when executing our naive implementation, the temporary instances of our valarray caused our execution time, and storage requirements, to grow substantially. Rather than returning an entirely new instance of the valarray class when encountering a portion of the expression, we can respond with an object that wraps each mathematical operation. As an example we'll encode the multiplication into an expression. This is done through a lightweight structure that wraps the operation.

template <typename Real, typename Op1, typename Op2> class valarray_mul { public: valarray_mul(const Op1& a, const Op2& b) : _op1(a) , _op2(b) { } Real operator[] (std::size_t i) const { return _op1[i] * _op2[i]; } std::size_t size() const { return _op1.size(); } private: const Op1& _op1; const Op2& _op2; } ;

template <typename Real, typename Rep> class valarray { public: valarray(const Rep& rep) : _rep(rep) { } valarray(const valarray& copy) : _rep(copy._rep) { } std::size_t size() const { return _rep.size(); } Real operator[] (std::size_t i) const { return _rep[i]; } const Rep& rep() const { return _rep; } private: Rep _rep; } ;

template <typename Real, typename Lhs, typename Rhs> valarray<Real, valarray_mul<Real, Lhs, Rhs> > operator* (const valarray<Real, Lhs>& lhs, const valarray<Real, Rhs>& rhs) { return valarray<Real, valarray_mul<Real, Lhs, Rhs> > (valarray_mul<Real, Lhs, Rhs>(lhs.rep(), rhs.rep())); }

The expression tree encoded

So the end result of our original expression would be an object of the following type.
valarray<Real, valarray_add< valarray<Real, valarray_mul<valarray<Real>, valarray<Real> > >, valarray<Real, valarray_mul<valarray<Real>, valarray<Real> > > >

// sqrt expression creation template <typename Real, typename Op> valarray<Real, valarray_sqrt<Real, Op> > sqrt(const valarray<Real, Op>& value) { return valarray<Real, valarray_sqrt<Real, Op> > (valarray_sqrt<Real, Op>(value.rep())); } // sqrt expression implementation template <typename Real, typename Op> class valarray_sqrt { public: valarray_sqrt(const Op& a) : _op(a) { } Real operator[] (std::size_t i) const { return std::sqrt(_op[i]); } std::size_t size() const { return _op.size(); } private: const Op& _op; } ;

**Evaluating the expression**The final piece of the puzzle involves controlling when the expression is computed. It's totality needs to be present before proceeding with the evaluation. This functions as a sort of lazy evaluation on the whole formula. The place within the code to invoke the expression is within an assignment operator.

template <typename Rep2> valarray& operator= (const valarray<Real, Rep2>& copy) { std::size_t count = size(); for (std::size_t i = 0; i < count; ++i) _rep[i] = copy[i]; return *this; }

**Expanding the expression**With the code in place we can explore how the compiler will expand the expression. Expression templates rely on aggressive inlining by the compiler. To transform our code to the equivalent C array each expression must be inlined in its entirety. Before walking through the inlining, let's take a quick detour into how compilers accomplish this. The inline keyword, and its nonstandard, more explicit cousin __forceinline, act as suggestions to the compiler. With the inline keyword, the compiler analyzes the benefit of inlining the code, and makes a judgement call based on that. The __forceinline keyword skips these checks, making it more likely the inlining will occur. However, in the end, the compiler makes the final call on what happens, and can ignore the directive completely. Within Visual Studio, this can be detected by upping the warning level to 4 (/W4) or by enabling warning C4710 and C4714 with a pragma. Our implementation actually requires the __forceinline specifier to result in optimal code due to the amount of nesting within the templates. With that said, let's walk through the steps the compiler will need to take in order for it to generate the optimal assembly. We'll be working in the ideal sense, as the actual output is dependent on the compiler used. To reiterate, the expression results in the following object being created.

result = valarray<float, valarray_add< valarray<float, valarray_mul<valarray<float>, valarray<float> > >, valarray<float, valarray_mul<valarray<float>, valarray<float> > > >

template <> valarray& operator= (const valarray<float, /* see above */ >& copy) { assert(size() == copy.size()); std::size_t count = size(); for (std::size_t i = 0; i < count; ++i) _rep[i] = copy[i]; return *this; }

template <> valarray<float, valarray_mul<valarray<float>, valarray<float> > { float operator[] (std::size_t i) const { return _rep[i]; } } ; template <> valarray_mul<valarray<float>, valarray<float> > { float operator[] (std::size_t i) const { return _op1[i] * _op2[i]; } } ; // Expands into a[i] * b[i];

template <> valarray<float, valarray_add< valarray<float, valarray_mul<valarray<float>, valarray<float> > >, valarray<float, valarray_mul<valarray<float>, valarray<float> > > > { float operator[] (std::size_t i) const { return _rep[i]; } } ; template <> valarray_add<valarray<float>, valarray<float> > { float operator[] (std::size_t i) const { return _op1[i] + _op2[i]; } } ; // Expands into a[i] * b[i] + c[i] * d[i];

const std::size_t count = size(); for (std::size_t i = 0; i < count; ++i) _rep[i] = a[i] * b[i] + c[i] * d[i];

**Limitations**A discussion of expression templates wouldn't be complete without mentioning its limitations. As you can see we're created quite a bit of code to essentially avoid writing a for loop. This has increased the complexity of the code that has to be maintained, and as a corollary has increased the amount of effort required to debug it. It also increases the workload of the compiler which has to expand and inline the templates. The types generated by this technique are deeply nested, and will require a significant amount of time to compile. If this technique is heavily used within a codebase, it will greatly influence the build time, which will drag down a programmer's productivity. The quality of output is also highly dependent on how well the compiler handles these transformations, which may or may not have the same performance characteristics of the C array implementations. Before utilizing this approach you should identify where the pros outweigh the cons. The technique would probably not meet the level of control required for low level engine code, as it doesn't have mechanism to apply optimization techniques such as prefetching. It may meet the needs of gameplay code as it provides a simplified interface that compiles to ideal code. Expression templates are by no means a magic bullet, but just another tool in the toolbox that can be wielded when the need arises.

**Up next**By utilizing expression templates we've managed to create a simplified client interface that results in optimal code generation. This allows us to do serious number crunching as quickly as possible. There is an additional avenue for increasing the performance of our valarray; adding SIMD support. In the next article, we'll add SIMD support to our valarray, using not one, not two, but three SIMD architectures. Implementations will be provided using intrinsics for SSE, NEON, and AVX. Finally we'll run some performance comparisons between our implementations, and examine the output generated by the compiler. The code for this and further installments is available in a Github repository.

*[This piece was reprinted from*

*#AltDevBlogADay, a shared blog initiative started by @mike_acton devoted to giving game developers of all disciplines a place to motivate each other to write regularly about their personal game development passions.]*