Due to an error in the mail-configuration e-mails sent between 27-1 and 14-2 may not have arrived. If you did not get a reply, please resend your request or question. Sorry for this inconvenience.

Theoretical transfer speeds visualised

February 5th, 2012 by Vincent Hindriksen No comments »

There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.

Transfer speeds per bus

The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to GPU-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a GPU need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.

We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right.  What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.

What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the GPU and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it’s better to have a GPU that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.

You don’t see in this image how much time it takes to do an operation on the CPU or GPU itself when the data is available. These “transfer-speeds” are directly related with the actual FLOPS (frequency times number of cores times number of operations per core). Then you understand why these maximum theoretical FLOPS do not result in very high FLOPS in all real-life software: data-transfer is part of reality.

Operations and Data-size

This image shows the optimal OpenCL-hardware given operations per byte and data-size. It is a relative representation where to find the best hardware for devices. So if you have a lot of data (and thus transfers), there is a moment you could better use a APU (AMD Fusion or Intel Sandy Bridge). If the operations per byte are low, then it might be best just to use a CPU (using OpenCL).

As it really depends on the actual hardware (for instance GPUs on APUs are currently not really that powerful), use this image only to ask yourself the right questions for your algorithm. More on this in the upcoming new “hardware buying guide”.

I hope these images gave you an easy insight in how things work in the world of OpenCL. If you want more, just see our more extensive training-program.

OpenCL Hardware-support

December 29th, 2011 by Vincent Hindriksen No comments »

Does your computer have OpenCL-capable hardware? Read on, if you want to find out…

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The absolute minimum requirement for OpenCL-on-a-CPU is SSE3, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all devices, so you then need to install drivers for each device.

Read on to find out exactly which processors are supported.

» Read more: OpenCL Hardware-support

Basic concepts: Function Qualifiers

December 27th, 2011 by Vincent Hindriksen No comments »

You have run-time and compile-time of the C-code and of the OpenCL-code. It is very important to make clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification)))  void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read them here in the official documentation, with more examples.

» Read more: Basic concepts: Function Qualifiers

Black-Scholes mixing on SandyBridge, Radeon and Geforce

November 29th, 2011 by Vincent Hindriksen No comments »

Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel’s, while AMD chose to write down the algorithm quite different.
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?

  • Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz
  • GeForce GTX 560 @810MHz, Mem @1000MHz
  • Radeon HD 6870 @930MHz, Mem @1030MHz
Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel’s and AMD’s driver for CPUs. » Read more: Black-Scholes mixing on SandyBridge, Radeon and Geforce

OpenCL potential: Watermarked media for content-protection

November 23rd, 2011 by Vincent Hindriksen No comments »

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

» Read more: OpenCL potential: Watermarked media for content-protection

Differences from OpenCL 1.1 to 1.2

November 19th, 2011 by Vincent Hindriksen 2 comments »

This article is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2. As not everything is clear yet without drivers out there, there will be some edits to this article the coming time – feedback is very welcome.

After many meetings with the many members of the OpenCL task force many ideas sprout. And each 17 or 18 months a new version comes out of OpenCL to give all these ideas form. You see ideas coming up which are totally new, already brought outside in another product by a member, or not appearing as other members voted against. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer can expect and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now I concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you look it up.

» Read more: Differences from OpenCL 1.1 to 1.2

Basic Concepts: online kernel compiling

October 28th, 2011 by Vincent Hindriksen 1 comment »

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

» Read more: Basic Concepts: online kernel compiling

Kernels and the GPL. Are we safe and linking?

October 19th, 2011 by Vincent Hindriksen No comments »

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

» Read more: Kernels and the GPL. Are we safe and linking?

Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions

October 18th, 2011 by Vincent Hindriksen No comments »

In the series Basic Concepts I try to give an alternative description to what is said everywhere else. This time my eye fell on alternative convenience methods in two cases which were introduced there to be nice to devs with i.e. C/C++ and/or graphics backgrounds. But I see it explained too often from the convenience functions and giving the “preferred” functions as a sort of bonus which works for the cases the old functions don’t get it done. Below is the other way around and I hope it gives better understanding. I assume you have read another definition, so you see it from another view not for the first time.

 

 

» Read more: Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions

Both NVidia GTX and AMD Radeon on Linux

October 12th, 2011 by Vincent Hindriksen 2 comments »

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that people failed trying to get Radeon as a compute device and the GTX as primary. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is practically impossible to use the second GPU for OpenGL.

» Read more: Both NVidia GTX and AMD Radeon on Linux