<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>StreamComputing</title>
	<atom:link href="http://www.streamcomputing.eu/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.streamcomputing.eu</link>
	<description>Your Algorithm, Our Speed</description>
	<lastBuildDate>Tue, 14 Feb 2012 07:30:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Theoretical transfer speeds visualised</title>
		<link>http://www.streamcomputing.eu/blog/2012-02-05/theoretical-transfer-speeds-visualised/</link>
		<comments>http://www.streamcomputing.eu/blog/2012-02-05/theoretical-transfer-speeds-visualised/#comments</comments>
		<pubDate>Sun, 05 Feb 2012 20:53:32 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=3017</guid>
		<description><![CDATA[There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form. Transfer speeds per bus The below image gives an idea of theoretical transfer speeds, so you know how a fast network [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_3017" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p>There are two overviews I use during my training, and I would like to share with you. Normally I write them on a whiteboard, but it has advantages having it in a digital form.</p>
<h1>Transfer speeds per bus</h1>
<p>The below image gives an idea of theoretical transfer speeds, so you know how a fast network (1GB of data in 10 seconds) compares to <acronym title='The processor on the videocard'>GPU</acronym>-memory (1GB of data in 0.01 seconds). It does not show all the ins and outs, but just give an idea how things compare. For instance it does not show that many cores on a <acronym title='The processor on the videocard'>GPU</acronym> need to work together to get that maximum transfer rate. Also I have not used very precise benchmark-methods to come to these views.</p>
<p><img title="More..." src="http://www.streamcomputing.eu/wp-includes/js/tinymce/plugins/wordpress/img/trans.gif" alt="" />We zoom into the slower bus-speeds. So all the good stuff is at the left and all buses to avoid are on the right.  What should be clear is that a read from or write to a SSD will make the software very slow if you use write-trough instead of write-back.</p>
<p>What is important to see that localisation of data makes a big difference. Take a look at the image and then try to follow with me. When using GPUs the following all can increase the speed on the same hardware: not using hard-disks in the computation-queue, avoiding transfers to and from the <acronym title='The processor on the videocard'>GPU</acronym> and increasing the computations per byte of data. When an algorithm needs to do a lot of data-operations such as transposing a matrix, then it&#8217;s better to have a <acronym title='The processor on the videocard'>GPU</acronym> that has high memory-access. When the number of operations is important, then clock-speed and cache-speed is most important.</p>
<p><img class="alignnone size-full wp-image-3029" title="transfer-speeds" src="http://www.streamcomputing.eu/wp-content/uploads/2012/02/transfer-speeds1.png" alt="" width="479" height="350" /></p>
<p>You don&#8217;t see in this image how much time it takes to do an operation on the CPU or <acronym title='The processor on the videocard'>GPU</acronym> itself when the data is available. These &#8220;transfer-speeds&#8221; are directly related with the actual FLOPS (frequency times number of cores times number of operations per core). Then you understand why these maximum theoretical FLOPS do not result in very high FLOPS in all real-life software: data-transfer is part of reality.</p>
<h1>Operations and Data-size</h1>
<p>This image shows the optimal <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-hardware given operations per byte and data-size. It is a relative representation where to find the best hardware for devices. So if you have a lot of data (and thus transfers), there is a moment you could better use a APU (AMD Fusion or Intel Sandy Bridge). If the operations per byte are low, then it might be best just to use a CPU (using <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>).</p>
<p>As it really depends on the actual hardware (for instance GPUs on APUs are currently not really that powerful), use this image only to ask yourself the right questions for your algorithm. More on this in the upcoming new &#8220;hardware buying guide&#8221;.</p>
<p><img class="alignnone size-full wp-image-3030" title="ops-data" src="http://www.streamcomputing.eu/wp-content/uploads/2012/02/ops-data1.png" alt="" width="479" height="322" /></p>
<p>I hope these images gave you an easy insight in how things work in the world of <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>. If you want more, just see our more extensive <a href="http://www.streamcomputing.eu/education/">training-program</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2012-02-05/theoretical-transfer-speeds-visualised/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>OpenCL Hardware-support</title>
		<link>http://www.streamcomputing.eu/blog/2011-12-29/opencl-hardware-support/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-12-29/opencl-hardware-support/#comments</comments>
		<pubDate>Thu, 29 Dec 2011 19:51:55 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2938</guid>
		<description><![CDATA[Does your computer have OpenCL-capable hardware? Read on, if you want to find out&#8230; For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2938" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-3015" title="chip-check" src="http://www.streamcomputing.eu/wp-content/uploads/2011/12/chip-check-300x221.jpg" alt="" width="300" height="221" />Does your computer have <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-capable hardware? Read on, if you want to find out&#8230;</p>
<p>For people who only want to run <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-software and have recent hardware, just read this paragraph. If you have recent drivers for your <acronym title='The processor on the videocard'>GPU</acronym>, you can be sure <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> is already supported and you can run <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-capable software. NVidia has support for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> 1.1 since drivers 280.13, so if you need <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> 1.1, then make sure you <a href="http://www.nvidia.com/download/find.aspx" target="_blank">have</a> this version or later. If you want to use Intel-processors and you don&#8217;t have an AMD <acronym title='The processor on the videocard'>GPU</acronym> installed, you need to <a href="http://software.intel.com/en-us/articles/download-intel-opencl-sdk/" target="_blank">download</a> the runtime of Intel <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>.</p>
<p>If you want to know if your X86 device is supported, you&#8217;ll find answers in this article.</p>
<p>Often it is not clear how <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this <a href="http://www.streamcomputing.eu/blog/2011-02-07/ssex-avx-fma-and-other-extensions-through-opencl/" target="_blank">here</a> and <a href="http://www.streamcomputing.eu/blog/2010-12-08/opencl-on-the-cpu-avx-and-sse/" target="_blank">here</a>. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>. The absolute minimum requirement for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-on-a-CPU is SSE3, though.</p>
<p><strong>A question I see often is what to do if you have more devices. There is no <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-package for all devices, so you then need to install drivers for each device.</strong></p>
<p>Read on to find out exactly which processors are supported.</p>
<h1><span id="more-2938"></span>Finding useful hardware</h1>
<p>In short: hardware from 2010 and 2011 has <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-support, hardware from late 2008 to beginning 2010 have reasonable support. Older hardware has support (if you bought the best of the best), but might not give real good speed-up. To be sure (as there are always exceptions), you need to know which processor and graphics card are installed.</p>
<h2>Windows</h2>
<p>As <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> works on CPUs, GPUs and accelerators, the first step is to get a list of all devices. Under the &#8220;System and Security&#8221; section of the &#8220;Control Panel&#8221; you&#8217;ll find the device manager. This lists all devices. We are interested in &#8220;display adaptors&#8221; and &#8220;processors&#8221;. Check for the brands AMD, ATI, NVidia and Intel.</p>
<p>Alternatively you can use <a href="http://www.cpuid.com/softwares/cpu-z.html" target="_blank">GPU-z</a> (CPUs and GPUs), <a href="http://www.techpowerup.com/gpuz/" target="_blank">GPUz</a> (GPUs only) or <a href="http://www.geeks3d.com/20111018/gpu-caps-viewer-1-14-4-download-opengl-opencl-utility/" target="_blank">GPU Caps Viewer</a> (GPUs only, but also a lot on <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-support) to find the information. For CPUs CPU-Z is very useful, as shows you immediately if your CPU supports SSE4.2 and AVX for exquisite OpenCL-performance.</p>
<h2>Linux</h2>
<p>Easiest way is to install &#8216;sysinfo&#8217;, which tells everything about your computer. Alternatively you can find supported CPUs with:</p>
<blockquote><p>cat /proc/cpuinfo | grep sse3</p></blockquote>
<p>A list of GPUs with:</p>
<blockquote><p>lspci | grep VGA</p></blockquote>
<h2>Apple</h2>
<p>You get the <acronym title='The processor on the videocard'>GPU</acronym>-drivers provided with the OS. I found an overview of supported GPUs <a href="http://support.apple.com/kb/HT4728" target="_blank">here</a> and <a href="http://support.apple.com/kb/HT4664" target="_blank">here</a>. Support for CPUs is also already there. If you want to develop software on a MAC, you will find examples <a href="http://developer.apple.com/library/mac/navigation/#section=Frameworks&amp;topic=OpenCL" target="_blank">here</a> and will find XCode completely ready.</p>
<p>The only down-side is that the drivers cannot be manually updated.</p>
<h1>Intel CPUs</h1>
<p>You can choose to use Intel&#8217;s drivers, or AMD&#8217;s drivers.</p>
<h2>Intel <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> drivers</h2>
<p>Intel&#8217;s drivers require support for SSE4.1. Below is an overview of all CPUs released until now which support <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> via the intel driver. Do you have a Core processor, scroll down to AMD <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> drivers.</p>
<p><strong>AVX (Q1 2011)</strong></p>
<ul>
<li>2nd Generation Core i-series (SandyBridge).</li>
</ul>
<p><strong>SSE 4.2 (Q1 2009 &#8211; Q3 2010)</strong></p>
<ul>
<li>Core i7 Processors</li>
<li>Core i5 Processors</li>
<li>Core i3 Processors</li>
<li>Xeon 55XX series</li>
<li>Xeon 56XX series</li>
<li>Xeon 75XX series</li>
</ul>
<p><strong>SSE4.1 (Q2-Q3 2008)</strong></p>
<ul>
<li>Xeon 74XX series</li>
<li>Quad-Core Xeon 54XX, 33XX series</li>
<li>Dual-Core Xeon 52XX, 31XX series</li>
<li>Core 2 Extreme 9XXX series</li>
<li>Core 2 Quad 9XXX series</li>
<li>Core 2 Duo 8XXX series</li>
<li>Core 2 Duo E7200</li>
</ul>
<div>The Intel SDK can be downloaded from <a href="http://software.intel.com/en-us/articles/download-intel-opencl-sdk/">http://software.intel.com/en-us/articles/download-intel-opencl-sdk/</a>.</div>
<h2>AMD <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> drivers</h2>
<p>AMD&#8217;s drivers support above devices and processors with SSE3 or SSSE3. They claim support for SSE2.x too &#8211; but since not much extra speed could be squeezed out of these, I only list CPUs with at least SSE3 support.</p>
<p><strong>SSSE3</strong></p>
<ul>
<li>Quad-Core Xeon 73XX, 53XX, 32XX series</li>
<li>Dual-Core Xeon 72XX, 53XX, 51XX, 30XX series</li>
<li>Core 2 Extreme 7XXX, 6XXX series</li>
<li>Core 2 Quad 6XXX series</li>
<li>Core 2 Duo 7XXX (except E7200), 6XXX, 5XXX, 4XXX series</li>
<li>Core 2 Solo 2XXX series</li>
<li>Pentium dual-core processor E2XXX, T23XX series</li>
</ul>
<p><strong>SSE3</strong></p>
<ul>
<li>Atom processors (SSE3_ATOM)</li>
<li>Dual-Core Xeon 70XX, 71XX, 50XX Series</li>
<li>Dual-Core Xeon processor (ULV and LV) 1.66, 2.0, 2.16</li>
<li>Dual-Core Xeon 2.8</li>
<li>Xeon processors with SSE3 instruction set support</li>
<li>Core Duo</li>
<li>Core Solo</li>
<li>Pentium dual-core processor T21XX, T20XX series</li>
<li>Pentium processor Extreme Edition</li>
<li>Pentium D</li>
<li>Pentium 4 processors with SSE3 instruction set support</li>
</ul>
<p>It is interesting to know that AMD and Intel have different optimisation techniques and therefore sometimes AMD&#8217;s and sometimes Intel&#8217;s drivers give faster code.</p>
<p>The AMD SDK can be downloaded from <a href="http://developer.amd.com/amdapp" target="_blank">http://developer.amd.com/amdapp</a></p>
<h1>AMD CPU/APU</h1>
<p>Intel officially only supports their own processors, so you cannot use Intel&#8217;s drivers.</p>
<p>To have good support for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>, it is a good bet to have at least SSE4a-support &#8211; which is on their <a href="http://en.wikipedia.org/wiki/AMD_K10" target="_blank">K10</a>-architecture:</p>
<ul>
<li>Phenom</li>
<li>Phenom II</li>
<li>Athlon II</li>
</ul>
<div>Bulldozer CPUs/APUs have support for AVX and thus even better performance with <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>.</div>
<div>With SSE3 you might get some speed-up. These are:</div>
<div>
<ul>
<li>Athlon 64 (since Venice Stepping E3 and San Diego Stepping E4)</li>
<li>Athlon 64 X2</li>
<li>Athlon 64 FX (since San Diego Stepping E4)</li>
<li>Opteron (since Stepping E4)</li>
<li>Sempron (since Palermo. Stepping E3)</li>
<li>Turion 64</li>
<li>Turion 64 X2</li>
</ul>
<div>The AMD SDK can be downloaded from <a href="http://developer.amd.com/amdapp" target="_blank">http://developer.amd.com/amdapp</a></div>
<h1>AMD/ATI GPUs</h1>
<p>Over the years the architecture of AMD/ATI GPUs changed to get better support for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>. Below is a list of GPUs that have good support.</p>
<ul>
<li>Recent APUs with embedded Radeon HD.</li>
<li>Radeon HD 6450 and up.</li>
<li>Radeon HD 5450 and up.</li>
<li>FirePro V8800, V7800, V5800, V4800 and V3800.</li>
<li>Mobility Radeon HD 5430 and up.</li>
<li>Mobility FirePro M7820 and M5800.</li>
</ul>
<div>Older GPUs do have (beta) support, but won&#8217;t give really good speed-ups. It might be of interest when you have CrossFire. Here is the list:</div>
<div>
<ul>
<li>Radeon 4350 and up.</li>
<li>FirePro V8750, V8700, V7750, V5700 and V3750.</li>
<li>FireStream 9270 and 9250.</li>
<li>Mobility Radeon HD 4300 Series and up.</li>
<li>Mobility FirePro M7740.</li>
</ul>
</div>
<p>Recent drivers have the <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> runtime. The SDK can be downloaded from <a href="http://developer.amd.com/amdapp" target="_blank">http://developer.amd.com/amdapp</a></p>
<h1>NVidia GPUs</h1>
<p>NVidia supports <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> well on devices with Compute Capability 1.3 and up, which are the following:</p>
<ul>
<li>GeForce GTX 260 and up.</li>
<li>GeForce GTX 400 series.</li>
<li>GeForce GTX 500 series.</li>
<li>Tesla C/S 1060 and up.</li>
<li>Quadro FX 4800 and 5800</li>
</ul>
<div>Older GPUs (with compute capability 1.0 to 1.2) won&#8217;t get really good speed-ups. It might be of interest when you have SLI. The GPUs not listed above from the following series have minimal support for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>, but the list could have some mistakes due to generalisations:</div>
<ul>
<li>GeForce 100 series.</li>
<li>GeForce 200 series.</li>
<li>GeForce 8000 series.</li>
<li>GeForce 9000 series.</li>
<li>Tesla C/D/S 870.</li>
<li>Quadro FX. Check <a href="http://en.wikipedia.org/wiki/Nvidia_Quadro#PCI_Express" target="_blank">this table</a> if it mentions <acronym title='A programming-language like OpenCL only for NVIDIA&#039;s GPUs'>CUDA</acronym>-cores.</li>
</ul>
<div>Recent drivers have the <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> runtime. The SDK can be downloaded from <a href="http://developer.nvidia.com/cuda-downloads">http://developer.nvidia.com/cuda-downloads</a>. It might seem a bit strange you need the <acronym title='A programming-language like OpenCL only for NVIDIA&#039;s GPUs'>CUDA</acronym>-SDK to develop <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>, but they chose to bundle the two.</div>
<h1>Want more <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>?</h1>
<p>Best thing to do is to stay in touch with us via <a href="https://twitter.com/#!/StreamComputing" target="_blank">Twitter</a> and via the <a href="http://streamcomputing.us2.list-manage1.com/subscribe?u=32effe2a8410acadd80ebf113&amp;id=fea430def2">newsletter</a>.</p>
<p>Take some time to look around on this site, as there is a lot of information on <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> available on for example the <a href="http://www.streamcomputing.eu/blog/">blog</a> and <a href="http://www.streamcomputing.eu/education/self-study/">self-study</a> pages.</p>
<p>StreamComputing provides <a href="http://www.streamcomputing.eu/consultancy/">consultancy</a> and <a href="http://www.streamcomputing.eu/education/">training</a> to support your business-needs. This is perfect if you want get up-and-running quickly.</p>
<p>If you have any questions, just fill in the <a href="http://www.streamcomputing.eu/about-us/contact/">contact</a>-form or give a call.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-12-29/opencl-hardware-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Basic concepts: Function Qualifiers</title>
		<link>http://www.streamcomputing.eu/blog/2011-12-27/basic-concepts-function-qualifiers/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-12-27/basic-concepts-function-qualifiers/#comments</comments>
		<pubDate>Tue, 27 Dec 2011 20:55:22 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[OpenCL]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2914</guid>
		<description><![CDATA[You have run-time and compile-time of the C-code and of the OpenCL-code. It is very important to make clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2914" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-2925" title="can-stock-photo_csp2301148" src="http://www.streamcomputing.eu/wp-content/uploads/2011/12/can-stock-photo_csp2301148-289x300.jpg" alt="" width="289" height="300" />You have run-time and compile-time of the C-code and of the <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-code. It is very important to make clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:</p>
<blockquote><p>__kernel <strong>__attribute__((qualifier(qualification))) </strong> void foo ( &#8230;. ) { &#8230;. }</p></blockquote>
<p>There are three qualifiers described in <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> 1.x. Let&#8217;s walk through them one by one. You can also read them <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/functionQualifiers.html">here</a> in the official documentation, with more examples.</p>
<h1><span id="more-2914"></span>vec_type_hint</h1>
<p>It tells what type of vectors is used in the computation, and its width. If you say for example that float4 will be used, the vectorisator tries to optimise the code for that. Especially Intel is very much fan of using this hint; strangely AMD (with VLIW in its Radeons) does not push this hint much. Example, with clear naming of a kernel that works on int8:</p>
<blockquote><p>__kernel <strong>__attribute__((vec_type_hint(int8))) </strong> void <em>foo_int8</em> ( &#8230;. ) { &#8230;. }</p></blockquote>
<p>Default assumption is that the computations will be done using int scalars(!). Intel says in its documentation that is uses vectors of fours as default to optimise on. While in their SDK 1.1 they suggest hinting for float4 or int4, but in 1.5 they say this hint turns off the auto-vectorizer if the width is not 4 * 4 bytes (float4 or int4). AMD tells you should use vectors of 4 wide, but not to use this hint &#8211; it just tries to optimise for vector of 4 wide (like Intel). NVidia is also silent on this one. Nevertheless it is always wise to use this hint, as you never know what kind of optimisations the compiler is capable of.</p>
<p>I my experience this hint does not work always to auto-vectorise scalar-kernels. Packing the bytes together always works better. If you have a good example of a not-too-simple kernel with scalars that vectorised well using this hint, let me know in the comments.</p>
<h1>work_group_size_hint and reqd_work_group_size</h1>
<p>The compiler does not see how the kernel will be called, but it could optimise the scheduler if it knew in advance. work_group_size_hint tells the <em>possible</em>dimension of NDRange, reqd_work_group_size is alike but very strict. It is not documented on how the compilers handle these differently or at all. On GPUs this hint gives a good speed-up. Just try with and without to see how it works for your kernel. Example of a kernel that needs a workgroup of 64x1x1:</p>
<blockquote><p>__kernel <strong>__attribute__((reqd_work_group_size(64, 1, 1))) </strong> void foo ( &#8230;. ) { &#8230;. }</p></blockquote>
<h1>Multiple attributes</h1>
<p>You can use several attributes, when needed. While you might think you can use more attributes separated by a spaces, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/attribute.html" target="_blank">officially</a> they are notated as a comma-separated list between the double parentheses. Example of a kernel that works on float4 and assumes a 1D-workgroup of 32:</p>
<blockquote><p>__kernel <strong>__attribute__((vec_type_hint(float4), work_group_size_hint(32, 1, 1))) </strong> void foo ( &#8230;. ) { &#8230;. }</p></blockquote>
<p>It has no logic to use both work_group_size_hint and reqd_work_group_size, though.</p>
<p>For functions which are not kernels, these hints cannot (officially) not be used. I could not find what the official documentation says on this, but I assume the compiler uses the same hints for the kernel and all the called functions.</p>
<h1>Last words</h1>
<p><acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> is low-level and to give it space to evolve, less explicit coding is needed. Writing code that can be optimised by the compiler I always say to my trainees that they should keep a copy of the unoptimised code, because every year the compiler gets better. Using vec_type_hint also keeps you focused on packing the data together and think as the compiler: can in every step 4 operations be done at the same time? And if so, can I show the compiler how?</p>
<p>Why work_group_size_hint and reqd_work_group_size are actually needed, is a question I have. I hope this will be resolved in 2.0 that you just compile a kernel given an NDRange and then run it. Want more explanation? Ask your questions in the comments.</p>
<p>And remember: always keep a copy of your unoptimised scalar kernel, as you cannot tell what future compilers are capable of.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-12-27/basic-concepts-function-qualifiers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Black-Scholes mixing on SandyBridge, Radeon and Geforce</title>
		<link>http://www.streamcomputing.eu/blog/2011-11-29/black-scholes-mixing-on-sandybridge-radeon-and-geforce/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-11-29/black-scholes-mixing-on-sandybridge-radeon-and-geforce/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 16:59:43 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2542</guid>
		<description><![CDATA[Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel&#8217;s, while AMD chose to write down the [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2542" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-full wp-image-2547" title="d1d2" src="http://www.streamcomputing.eu/wp-content/uploads/2011/10/d1d2.jpg" alt="" width="250" height="162" />Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their <a href="http://software.intel.com/file/37171">OpenCL optimisation-document</a> (page 28 and further) with 3 random factors as input: <em>S</em>, <em>K</em> and <em>T,</em> and two configuration-constants R and V. NVidia is easy to compare to Intel&#8217;s, while AMD chose to write down the algorithm quite different.<br />
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?</p>
<ul>
<li>Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz</li>
<li>GeForce GTX 560 @810MHz, Mem @1000MHz</li>
<li>Radeon HD 6870 @930MHz, Mem @1030MHz</li>
</ul>
<div>Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel&#8217;s and AMD&#8217;s driver for CPUs. <span id="more-2542"></span>The following drivers were used:</div>
<div>
<ul>
<li>Intel <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> SDK 1.5</li>
<li>AMD APP 2.5, driver version 8.902-111012a-127183C-ATI, Catalyst 11.10 (and Catalyst 11.12)</li>
<li>NVidia <acronym title='A programming-language like OpenCL only for NVIDIA&#039;s GPUs'>CUDA</acronym> 4.0 SDK, 280.13 drivers</li>
</ul>
<div>The times are an average of 5 tests, except if there was too much fluctuation. It is best that you look at the code of the the SDKs too, so you understand what is run.</div>
<div><strong>These results will be used for future articles, so only results of the first test here and no extensive discussions on the results.</strong></div>
</div>
<h1>Intel&#8217;s example</h1>
<p>Since Intel provides the smallest kernel, here you have an idea how Black-Scholes works. As you can see, it is very nicely parallel with only references to global id <em>tid</em>.</p>
<blockquote>
<pre>__kernel __attribute__((vec_type_hint(float4)))
void <strong>BlackScholes</strong>(
	__global float4 *callResult,
	__global float4 *putResult,
	const __global float4* S,
	const __global float4* K,
	const __global float4* T,
	float r,
	float v)
{
	size_t tid = get_global_id(0);
	float4 d1 = 0.0f;
	float4 d2 = 0.0f;
	//int4 calls = (int4)'c';
	d1 = (log(S[tid] / K[tid]) + (r + v * v / 2) * T[tid]) / (v * sqrt(T[tid]));
	d2 = d1 - v * sqrt(T[tid]);
	callResult[tid] = S[tid] * CND4(d1)- K[tid] * exp(T[tid] * -r) * CND4(d2);
	putResult[tid] = K[tid] * exp(T[tid] * -r) * CND4(-d2) - S[tid] * CND4(-d1);
}</pre>
</blockquote>
<p>It is used for approximating the best price for an option in the financial world. Read this <a href="http://www.stock-options-made-easy.com/black-scholes-model.html" target="_blank">post on the matter</a>, which explains in easy language what it does. It is also interesting to know that it was developed for computers slower than a Casio calculator.</p>
<h1>Benchmarks</h1>
<p>I used the original code of the SDKs and edited the selection-methods for platform and device-type.</p>
<h2>NVidia&#8217;s code</h2>
<p>NVidia GPU: 0.00082 s<br />
AMD GPU: 0.00159 s (0.00120 to 0.00202 on Catalyst 11.12)<br />
AMD CPU: 0.07400 s  (0.074520 on Catalyst 11.12)<br />
Intel CPU: 0.01525 s</p>
<p>Very clear results: pretty optimised for NVIDIA GPUs. See that for CPUs Intel&#8217;s drivers give a much faster result than AMD.</p>
<h2>Intel&#8217;s code</h2>
<p>As there was no software, I hacked the kernel into NVidia&#8217;s version. I made a float4-version of NVidia&#8217;s CND and altered Intel&#8217;s code a little so it gave back both calls and puts. Unluckily it did not give correct results for AMD and Intel according to the program, so more work for later. For the code I used see above listing. Add &#8220;unsigned int optN&#8221; to be complete, and the trick with callMask in the original code you can use in the float4-version of CND.</p>
<p>&#8220;__attribute__((vec_type_hint(float4)))&#8221; did not make the Intel CPU work faster, even if they claimed it in their article. This compiler-hint also gave an error for NVIDIA and AMD.</p>
<p>NVidia GPU: 0.00019 s<br />
AMD GPU: 0.00035 s (0.00025 to 0.00090 on Catalyst 11.12)<br />
AMD CPU: 0.00529 s (0.00469 on Catalyst 11.12)<br />
Intel CPU: 0.00159 s</p>
<p>Knowing that NVidia gave correct results with this version, we can assume float4 is a good thing.</p>
<h2>AMD&#8217;s code</h2>
<p>As AMD&#8217;s code is totally different (both kernel and host), you cannot compare these results with the other two.</p>
<p>NVidia GPU: 0.514 s<br />
AMD GPU: 0.154 s (0.126s on Catalyst 11.12 &#8211; not much fluctuation)<br />
AMD CPU: 0.105 s (0.102s on Catalyst 11.12)<br />
Intel CPU: 0.161 s</p>
<p>AMD had the strangest results. It seems the CPU via AMD&#8217;s drivers is 5 times as fast as NVidia and also faster than AMD&#8217;s <acronym title='The processor on the videocard'>GPU</acronym>. They successfully made Radeons get faster results than Geforces on this one.</p>
<h1>A last word</h1>
<p>Next time let&#8217;s try an algorithm that needs more memory-optimisations. But as stated above, I am going to use all the results later.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-11-29/black-scholes-mixing-on-sandybridge-radeon-and-geforce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>OpenCL potential: Watermarked media for content-protection</title>
		<link>http://www.streamcomputing.eu/blog/2011-11-23/opencl-potential-watermarked-media-for-content-protection/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-11-23/opencl-potential-watermarked-media-for-content-protection/#comments</comments>
		<pubDate>Wed, 23 Nov 2011 08:03:01 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2755</guid>
		<description><![CDATA[HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content &#8211; before you know the movie is on the free market. DRM is only a temporary solution and many times [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2755" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-2758" title="watermark-1" src="http://www.streamcomputing.eu/wp-content/uploads/2011/11/watermark-1-300x220.jpg" alt="" width="300" height="220" />HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content &#8211; before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.</p>
<p>If you look at e-books, you see a much better way to make sure PDFs don&#8217;t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (<a href="http://www.dphotojournal.com/watermarking-adding-copyright-with-photoshop/" target="_blank">source</a>), but there are many methods which are not easy to see &#8211; and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video&#8217;s ownership cannot be transferred.</p>
<h1><span id="more-2755"></span>Personalised &amp; Watermarked</h1>
<p>Awareness is the most important thing when it comes to copy-protection. People have to know it is watermarked. Consumers don&#8217;t have the feeling they are tricked when some organisation knocks at their door &#8211; they would have acted more responsibly when they knew their name was printed on the media. In the previous century we bought the box around the DVD, and the content of the DVD is of everybody &#8211; the signature of an actor should be put on the box. This century we don&#8217;t get a box anymore and we only know we get sued when sharing the movie with friends. Just looking at the past 20 years I cannot say that people have a good feeling of ownership. I believe that using the visual opportunities as shown below, or even putting a personalised message like &#8220;for your 2011 birthday, hugs from auntie&#8221;, gives the buyer a better feeling of ownership than the 1 minute long warnings before the movie starts.</p>
<p><img class="alignnone size-full wp-image-2759" title="licensed-movie" src="http://www.streamcomputing.eu/wp-content/uploads/2011/11/licensed-movie.jpg" alt="" width="641" height="390" /></p>
<p>This <a href="http://www.watermarking.eu/thesis.pdf" target="_blank">2007 Master thesis by Martin Zlomek</a> discusses various ways of watermarking and how they can be attacked. It is quite honest which are good and which are not &#8211; see chapter &#8220;robustness&#8221;. Recompression, scaling, cropping, noising, denoising, blurring, sharpening all have ways to deform the watermark but take the movie itself with it. Using several movies and blur out differences keeps being a good attack. This means that it is as bad as DRM, once professionals focus on the movie. Difference is that you can never tell where the watermark can be hidden too and thus must make a gamble all is removed when releasing the movie.</p>
<p>When streaming the movie, the renting-company could use the technique to build up a better customer-relation. A personalised advertisement can be added anywhere inside the movie. This could make it possible to show the movie for free, which disrupts the illegal market.</p>
<p>Only disadvantage is that all above costs a lot of processing-power which could raise the price per video too much.</p>
<p><span class="Apple-style-span" style="font-size: 26px; font-weight: bold;">The fast lane</span></p>
<p>Using video-cards and <acronym title='General Purpose GPU, a common name for programming GPUs for non-graphics purposes'>GPGPU</acronym>, a video can be watermarked with <acronym title='General Purpose GPU, a common name for programming GPUs for non-graphics purposes'>GPGPU</acronym> much faster than with traditional techniques. An implementation using GPUs I found <a href="http://www.site.uottawa.ca/~abrunton/publications/ccece05_wm_gpu_brunton.pdf" target="_blank">here</a> [PDF], but many more possibilities exist. Compression can be prepared quite far (motion-detection, actual compression, etc), before additional data is added. Finishing the video-compression using <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> could start delivering the movie for streaming fast.</p>
<p>StreamComputing thinks along with you to get ideas for your problem and gets your product developed. Look around at the site to see what we also do &#8211; ask us any question you have via the <a href="http://www.streamcomputing.eu/about-us/contact/">contact-form</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-11-23/opencl-potential-watermarked-media-for-content-protection/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Differences from OpenCL 1.1 to 1.2</title>
		<link>http://www.streamcomputing.eu/blog/2011-11-19/difference-between-opencl-1-2-and-1-1/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-11-19/difference-between-opencl-1-2-and-1-1/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 18:20:48 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2763</guid>
		<description><![CDATA[This article is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2. As not everything is clear yet without drivers out there, there will be some edits to this article the coming time &#8211; feedback is very welcome. After many meetings with the many members of the OpenCL task [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2763" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="size-full wp-image-2784 alignright" title="opencl_1.2" src="http://www.streamcomputing.eu/wp-content/uploads/2011/11/opencl_1.2.jpg" alt="" width="252" height="241" /></p>
<p>This article is of interest for you, if you don’t want to read the whole new <a href="http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf" target="_blank">specifications</a> [PDF] for OpenCL 1.2. As not everything is clear yet without drivers out there, there will be some edits to this article the coming time &#8211; feedback is very welcome.</p>
<p>After many meetings with the many members of the <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> task force many ideas sprout. And each 17 or 18 months a new version comes out of <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> to give all these ideas form. You see ideas coming up which are totally new, already brought outside in another product by a member, or not appearing as other members voted against. The last category is very interesting and hopefully we&#8217;ll see a lot of forum-discussion soon what should be in the next version as it is missing now.</p>
<p>With the <a href="http://www.khronos.org/news/press/khronos-releases-opencl-1.2-specification" target="_blank">release of 1.2</a> there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer can expect and how you can act on it.</p>
<p>Another big announcement was that Altera is starting to support <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> for a FPGA-product. In another article I will let you know everything there is to know. For now I concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you look it up.</p>
<h1><span id="more-2763"></span>New Kernel-functions</h1>
<p>The most rudiment debug-tool, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/printfFunction.html" target="_blank">printf</a>, first needed to have a vendor-specific extension enabled, but now you can flood the standard output without it. For who hasn&#8217;t tried printf yet, have a global size of 1000, let the CPU print &#8220;ping\n&#8221; and the kernels &#8220;pong\n&#8221; &#8211; then you know exactly why you need to e careful with this function.</p>
<p>The function <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/popcount.html" target="_blank">popcount</a> returns the number of ones in a variable. So if x is 5 (binary 101), then popcount(x) is 2. A nice explanation of fast popcount on SSE is <a href="http://0x80.pl/articles/sse-popcount.html" target="_blank">here</a>. It counts bits regardless of what it represents, so it also counts the sign-bit.</p>
<h1>Replaced functions</h1>
<p>The <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> group prefers to change the name of functions when the parameter-list changes. Below are the &#8220;new&#8221; functions I encountered.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/deprecated.html" target="_blank">clEnqueueMarker</a>, <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueBarrier.html" target="_blank">clEnqueueBarrier</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueWaitForEvents.html" target="_blank">clEnqueueWaitForEvents</a> have been merged into <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMarkerWithWaitList.html" target="_blank">clEnqueueMarkerWithWaitList</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueBarrierWithWaitList.html" target="_blank">clEnqueueBarrierWithWaitList</a>. The barrier and marker functionality are still the same, but if a non-NULL waiting-list is given it will also continue if all the events have occurred. Before this was tricky to program. A new option is that you can fire an event when all previous events have occurred.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateImage2D.html" target="_blank">clCreateImage2D</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateImage3D.html" target="_blank">clCreateImage3D</a> have been merged into <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateImage.html" target="_blank">clCreateImage</a>. <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateFromGLTexture2D.html" target="_blank">clCreateFromGLTexture2D</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clCreateFromGLTexture3D.html" target="_blank">clCreateFromGLTexture3D</a> have been merged into <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromGLTexture.html" target="_blank">clCreateFromGLTexture</a>. As the functions were comparable and the parameter texture_target handles the differences, not much changed. What is new (and a mayor reason for merging these functions) is the adding of 1D images, and support for image-arrays (see below for explanation how they work). 1D images were introduced to be compliant with <a href="http://www.opengl.org/sdk/docs/man/xhtml/glTexImage1D.xml" target="_blank">OpenGL 1D images</a>.</p>
<p>Mem-flags CL_MEM_COPY_HOST_WRITE_ONLY, CL_MEM_COPY_HOST_READ_ONLY and CL_MEM_COPY_HOST_NO_ACCESS have been added to describe how the host can connect to the object at the device, where 1.1 only described how the device could access the object and if the memory was allocated at the device or the host.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clUnloadCompiler.html" target="_blank">clUnloadCompiler</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetExtensionFunctionAddress.html" target="_blank">clGetExtensionFunctionAddress</a> got changed to <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clUnloadPlatformCompiler.html" target="_blank">clUnloadPlatformCompiler</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetExtensionFunctionAddressForPlatform.html" target="_blank">clGetExtensionFunctionAddressForPlatform</a> and now must specify the platform. This seems to be logical, as clUnloadCompiler probably removed compilers of all platforms, and the function-address seems to be unspecified when more platforms were loaded. Not much used functions though.</p>
<h1>DirectX</h1>
<p>Besides the fancy 1D images, support for DirectX 9 and 11 textures also have been added. DX9 is an interesting choice, but this way such software can be given a longer life by adding <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> to speed it up. I still disagree it has official KHR-support as it only works under Microsoft code &#8211; under Linux (and all its deratives like Android) and OSX it is not supported.</p>
<p>The new functions <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromDX9MediaSurfaceKHR.html" target="pagedisplay">clCreateFromDX9MediaSurfaceKHR</a>, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueAcquireDX9MediaSurfacesKHR.html" target="pagedisplay">clEnqueueAcquireDX9MediaSurfacesKHR</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueReleaseDX9MediaSurfacesKHR.html" target="pagedisplay">clEnqueueReleaseDX9MediaSurfacesKHR</a> are comparable to clCreateFromD3D10Texture2DKHR, clEnqueueAcquireD3D10ObjectsKHR and clEnqueueReleaseD3D10ObjectsKHR. <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromD3D11BufferKHR.html" target="pagedisplay">clCreateFromD3D11BufferKHR</a>, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromD3D11Texture2DKHR.html" target="pagedisplay">clCreateFromD3D11Texture2DKHR</a>, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateFromD3D11Texture3DKHR.html" target="pagedisplay">clCreateFromD3D11Texture3DKHR</a>, <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueAcquireD3D11ObjectsKHR.html" target="pagedisplay">clEnqueueAcquireD3D11ObjectsKHR</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueReleaseD3D11ObjectsKHR.html" target="pagedisplay">clEnqueueReleaseD3D11ObjectsKHR</a> are like their D3D10-counterparts.</p>
<p>Sharing like cl_khr_d3d10_sharing for DX9 and 11 is enabled with <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/cl_khr_dx9_media_sharing.html" target="pagedisplay">cl_khr_dx9_media_sharing</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/cl_khr_d3d11_sharing.html" target="_blank">cl_khr_d3d11_sharing</a>. The counterparts of <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetDeviceIDsFromD3D10KHR.html" target="pagedisplay">clGetDeviceIDsFromD3D10KHR</a> are <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetDeviceIDsFromD3D11KHR.html" target="_blank">clGetDeviceIDsFromD3D11KHR</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetDeviceIDsFromDX9MediaAdapterKHR.html" target="pagedisplay">clGetDeviceIDsFromDX9MediaAdapterKHR</a>.</p>
<h1>Multi-user and Multi-device</h1>
<p>As <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-devices get more powerful, it is very probable the device can better be shared. Also it gets more common to have multiple GPUs in a system, and/or have various capable devices now CPUs get better support.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMigrateMemObjects.html" target="_blank">clEnqueueMigrateMemObjects</a> helps with multiple devices to copy memory objects from one device to another; first this had to be done by copying via the host.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubDevices.html" target="_blank">clCreateSubDevices</a> partitions a device in sub-devices. It can be partitioned in equal parts, specified sizes, or depending on specific hardware. The last option can split the devices based on i.e. cache-hierarchy, so that the different subdevices have shared cache at the given level. The functions <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clRetainDevice.html" target="pagedisplay">clRetainDevice</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clReleaseDevice.html" target="pagedisplay">clReleaseDevice</a> have been altered to handle sub-devices. First this was under the extension <a href="http://www.khronos.org/registry/cl/extensions/ext/cl_ext_device_fission.txt">device_fission</a>.</p>
<h1>Intitalisation of data</h1>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueFillBuffer.html" target="_blank">clEnqueueFillBuffer</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueFillImage.html" target="_blank">clEnqueueFillImage</a> help with initialising data by filling it with a pattern or a colour. This was first best done at the host, or with a kernel specially written for it, or just ignored. Now our lives have been improved.</p>
<h1>Building</h1>
<p>It seems that more effort is put in making sure the kernels are better protected. The function <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clBuildProgram.html" target="pagedisplay">clBuildProgram</a> can be split up between <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCompileProgram.html">clCompileProgram</a> and <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clLinkProgram.html" target="_blank">clLinkProgram</a>. If I understand correctly, it is comparable to how <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateProgramWithBinary.html" target="pagedisplay">clCreateProgramWithBinary</a> works, as this takes compiled binaries.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetProgramInfo.html" target="pagedisplay">clGetProgramInfo</a> en <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetProgramBuildInfo.html" target="pagedisplay">clGetProgramBuildInfo</a> have been extended to get information on how the program has been built. The new function <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clGetKernelArgInfo.html" target="pagedisplay">clGetKernelArgInfo</a> returns specified information on the arguments used for building the kernel. This is useful when the building of the software is separated from the program, such as is the case when binaries are used.</p>
<h1>Image arrays</h1>
<p>An array of 1D or 2D images can be written by write_image{f|i|ui|h}. The image ID is given by the y (1D) or z (2D) value. With read_image{f|i|ui|h} you need to specify the coordinates plus the image-number, int2 for 1D and int3 for 2D images.</p>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/get_image_array_size.html">get_image_array_size</a> returns the number of images in an array. It is the responsibility of the software to keep things in order, as it does not give an array of image-numbers.</p>
<h1>Other</h1>
<p><a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/preprocessorDirectives.html" target="_blank">Macros</a> CL_VERSION_1_2 and __OPENCL_C_VERSION__ have been added. The first one gives 120 just like CL_VERSION_1_1 gives 110, the last one gives 100, 110 or 120.</p>
<p>Double-precision is now an optional core feature instead of an extension. Meaning, you just need to check if the device supports it, but you don&#8217;t need to pragma it in.</p>
<p>CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE has been deprecated. It gives the smallest alignment in bytes which can be used for any data type. It is quite <a href="http://www.khronos.org/message_boards/viewtopic.php?f=28&amp;t=1988" target="_blank">comparable</a> to CL_DEVICE_MEM_BASE_ADDR_ALIGN. This could help select the best device for an alignment-optimised kernel, but is rarely used.</p>
<p>A new flag CL_MAP_WRITE_INVALIDATE_REGION has been added to <a href="http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMapBuffer.html" target="_blank">cl_map_flags</a>. This is comparable to CL_MAP_WRITE, but without guarantees memory is not being overwritten.</p>
<p>Storage class specifiers extern and static are now supported. A storage class settles the scope of the variable <a href="http://www.tutorialspoint.com/ansi_c/c_storage_classes.htm" target="_blank">(c definition here)</a>. I need to get deeper into this, as I would think extern is __global, and static is __local &#8211; I&#8217;ll keep you posted to get this more clear.</p>
<h1>Video</h1>
<p></p>
<p>Tim Mattson of Intel explains some of the highlights of <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> 1.2 in this 12 minute video</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-11-19/difference-between-opencl-1-2-and-1-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Basic Concepts: online kernel compiling</title>
		<link>http://www.streamcomputing.eu/blog/2011-10-28/basic-concepts-online-kernel-compiling/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-10-28/basic-concepts-online-kernel-compiling/#comments</comments>
		<pubDate>Fri, 28 Oct 2011 11:27:09 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2640</guid>
		<description><![CDATA[Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn&#8217;t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2640" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-2641" title="check-and-cross-marks-icon-pic" src="http://www.streamcomputing.eu/wp-content/uploads/2011/10/check-and-cross-marks-icon-pic-300x225.jpg" alt="" width="300" height="225" />Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn&#8217;t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> compiler without running your whole software: an Online <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> Compiler. In short is just an <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source &#8211; so a C-compiler is needed.</p>
<ul>
<li><a href="http://clcc.sourceforge.net/" target="_blank">CLCC</a>. It needs the <em>boost</em>-libraries, <em>cmake</em> and <em>make</em> to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).</li>
<li><a href="http://gitorious.net/onlineclc" target="_blank">OnlineCLC</a>. Needs <em>waf</em> to build. Seems to be Linux-only.</li>
</ul>
<p><span id="more-2640"></span>For both you can just give the source-file of the kernel, optionally which driver/device to use. With OnlineCLC it is not very clear what is meant with the &#8220;device&#8221; option and I am trying to find that out and hope to update this article soon. CLCC has an extra option to specify the device-type (<acronym title='The processor on the videocard'>GPU</acronym>, CPU, etc) to use. It is for instance possible to have these commands as an external command in your IDE, so you can check your kernel when needed. And don&#8217;t forget to write your test-cases with this &#8211; assuming the test-machine has all the devices.</p>
<p><strong>Watch out!</strong> It is using the hardware for compiling too, so it can crash your machine as hard as you can do with your own <acronym title='General Purpose GPU, a common name for programming GPUs for non-graphics purposes'>GPGPU</acronym>-software.</p>
<h2>Getting CLCC working on Linux</h2>
<p>The downloaded version 0.1 of CLCC does not work out-of-the-box on a recent Linux-distribution with the latest boost-libraries. You can check out of SVN, wait for the upcoming version 0.2 or enter the below fixes.</p>
<p>The file src/CMakeLists.txt needs &#8220;cmake_minimum_required(VERSION 2.8)&#8221; at the top and</p>
<blockquote><p>if (UNIX)<br />
target_link_libraries(clcc ${CMAKE_DL_LIBS})<br />
endif()</p></blockquote>
<p>to make the linking working. In options.cpp replace &#8220;<em>ifdef __LINUX__</em>&#8221; with &#8220;<em>ifdef __linux__</em>&#8220;. In main.cpp en clpp.cpp replace &#8220;<em>#include &lt;boost/exception.hpp&gt;</em>&#8220; with &#8220;<em>#include &lt;boost/exception/all.hpp&gt;</em>&#8220;. That&#8217;s all.</p>
<p>Now run</p>
<blockquote>
<pre>cmake -G "Unix Makefiles" .
make</pre>
</blockquote>
<p>in the root of the project. Now you are ready to run CLCC. Any feedback can go to george <em>at</em> organicvectory <em>dot</em> com.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-10-28/basic-concepts-online-kernel-compiling/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Kernels and the GPL. Are we safe and linking?</title>
		<link>http://www.streamcomputing.eu/blog/2011-10-19/kernels-and-the-gpl-are-we-safe-and-linking/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-10-19/kernels-and-the-gpl-are-we-safe-and-linking/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 17:41:43 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2580</guid>
		<description><![CDATA[Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters. GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2580" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-2581" title="GPL" src="http://www.streamcomputing.eu/wp-content/uploads/2011/10/GPL-300x293.png" alt="" width="300" height="293" /></p>
<p><strong><em>Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.</em></strong></p>
<p>GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>, <acronym title='A programming-language like OpenCL only for NVIDIA&#039;s GPUs'>CUDA</acronym>, DirectCompute, RenderScript don&#8217;t really <a href="http://en.wikipedia.org/wiki/Linker_(computing)">link</a> a kernel, but load it and launch it. As GPL is quite <a href="http://jacobian.org/writing/gpl-questions/" target="_blank">complicated</a> if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.</p>
<p>If your software is dual-licensed, you cannot assume the GPL is not chosen when <em>eventually</em> used in commercial software. Read below why not.</p>
<p>I hope we can have a discussion here, so we get to the bottom of this.</p>
<h1><span id="more-2580"></span>The GPL FAQ</h1>
<p>The claim is that if a kernel is a independent piece of software that does not get linked but loaded, then GPL does not protect the code. The <a href="http://www.gnu.org/licenses/gpl-faq.html#NFUseGPLPlugins" target="_blank">GPL FAQ</a> says this about GPL-plugins for non-free software:</p>
<blockquote><p>If the program uses fork and exec to invoke plug-ins, then the plug-ins are separate programs, so the license for the main program makes no requirements for them. So you can use the GPL for a plug-in, and there are no special requirements.</p>
<p>If the program dynamically links plug-ins, and they make function calls to each other and share data structures, we believe they form a single program, which must be treated as an extension of both the main program and the plug-ins. This means that combination of the GPL-covered plug-in with the non-free main program would violate the GPL. However, you can resolve that legal problem by adding an exception to your plug-in&#8217;s license, giving permission to link it with the non-free main program.</p></blockquote>
<p>From the first paragraph we can conclude that if the kernel is generalised to a standard plugin (for instance all have &#8220;<em>kernel void performFilter()</em>&#8220;) and is accompanied with the GPL-license, that it is permitted to load that kernel from non-free software. And yes, &#8216;exec&#8217; seems to be a better description of launching a kernel than &#8216;linking&#8217;.</p>
<p>The second paragraph describes what happens when loading&amp;launching a kernel would be the same as linking. I can only think of certain cases, but those are avoidable.</p>
<p>What makes it more difficult is that lot of existing software has complete functionality and <acronym title='General Purpose GPU, a common name for programming GPUs for non-graphics purposes'>GPGPU</acronym>-code is truly a plugin which replaces existing functionality. Meaning the software is not dependent on it and the faster code.</p>
<h1>How to get legal protection?</h1>
<p>Always be explicit. If you really want to be sure, extend the GPL by stating that loading the kernel-file is seen as dynamic linking and as such the GPL is applied, etc. etc. How exactly this can be done, ask your lawyer. For commercial kernels I would recommend you to check licenses that go with Java code &#8211; as jar-files can be decompiled.</p>
<p><em>Which license do you use yourself?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-10-19/kernels-and-the-gpl-are-we-safe-and-linking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions</title>
		<link>http://www.streamcomputing.eu/blog/2011-10-18/basic-concepts-convenience-methods/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-10-18/basic-concepts-convenience-methods/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 22:00:46 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[OpenCL]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2555</guid>
		<description><![CDATA[In the series Basic Concepts I try to give an alternative description to what is said everywhere else. This time my eye fell on alternative convenience methods in two cases which were introduced there to be nice to devs with i.e. C/C++ and/or graphics backgrounds. But I see it explained too often from the convenience functions and [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2555" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p><img class="alignright size-medium wp-image-2561" title="convenience" src="http://www.streamcomputing.eu/wp-content/uploads/2011/10/convenience-225x300.jpg" alt="" width="225" height="300" />In the series <a href="http://www.streamcomputing.eu/blog/series/basic-concepts/">Basic Concepts</a> I try to give an alternative description to what is said everywhere else. This time my eye fell on alternative convenience methods in two cases which were introduced there to be nice to devs with i.e. C/C++ and/or graphics backgrounds. But I see it explained too often from the convenience functions and giving the &#8220;preferred&#8221; functions as a sort of bonus which works for the cases the old functions don&#8217;t get it done. Below is the other way around and I hope it gives better understanding. I assume you have read another definition, so you see it from another view not for the first time.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h1><span id="more-2555"></span>Vector Elements</h1>
<p>Vectors can be seen as <a href="http://en.wikibooks.org/wiki/C++_Programming/Structures" target="_blank">structs</a> on which the computations can be implied to all the elements at the same time. Each element can be accessed by .sX with X being 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F depending on the number of elements in the vector with a 16-vector having all 16 and an 8-vector only 0 to 7. The convenience methods are .x, .y, .z, .w, .hi, .lo, .even en .odd. Below are the the methods defined in the standard. The abbreviation N.D. stands for non-defined. For 3-vectors some functions are not explicitly not-defined, but vague in how to be implemented, so therefore I put &#8220;??&#8221; to them.</p>
<table>
<tbody>
<tr>
<td><strong>Convenience<br />
alternative</strong></td>
<td><strong>vector2</strong></td>
<td><strong></strong><strong>vector3</strong></td>
<td><strong></strong><strong>vector4</strong></td>
<td><strong></strong><strong>vector8</strong></td>
<td><strong></strong><strong>vector16</strong></td>
</tr>
<tr>
<td>x</td>
<td>.s0</td>
<td>.s0</td>
<td>.s0</td>
<td>N.D.</td>
<td>N.D.</td>
</tr>
<tr>
<td>.y</td>
<td>.s1</td>
<td>.s1</td>
<td>.s1</td>
<td>N.D.</td>
<td>N.D.</td>
</tr>
<tr>
<td>.z</td>
<td>.s2</td>
<td>.s2</td>
<td>.s2</td>
<td>N.D.</td>
<td>N.D.</td>
</tr>
<tr>
<td>.w</td>
<td>.s3</td>
<td>N.D.</td>
<td>.s3</td>
<td>N.D.</td>
<td>N.D.</td>
</tr>
<tr>
<td>.hi</td>
<td>.s1</td>
<td>??</td>
<td>.s23</td>
<td>.s4567</td>
<td>.s89ABCDEF</td>
</tr>
<tr>
<td>.lo</td>
<td>.s0</td>
<td>??</td>
<td>.s01</td>
<td>.s0123</td>
<td>.s01234567</td>
</tr>
<tr>
<td>.even</td>
<td>.s0</td>
<td>??</td>
<td>.s02</td>
<td>.s0246</td>
<td>.s02468ACE</td>
</tr>
<tr>
<td>.odd</td>
<td>.s1</td>
<td>??</td>
<td>.s13</td>
<td>.s1357</td>
<td>.s13579BDF</td>
</tr>
</tbody>
</table>
<p>To get an idea what a float4 is, here is an (incompletely) description:</p>
<blockquote><p>struct float4 {<br />
<span style="color: #ffffff;">&#8230;.</span>float s0, s1, s2, s3;<br />
<span style="color: #ffffff;">&#8230;.</span>float x, y, z, w;<br />
<span style="color: #ffffff;">&#8230;.</span>float hi, lo, odd, even;<br />
<span style="color: #ffffff;">&#8230;.</span>float2 s01, s02, s03, s10, s12, s13, s20, s21, s23, s30, s31, s32;<br />
<span style="color: #ffffff;">&#8230;.</span>float2 xy, xz, xw, yx, yz, yw, zx, zy, zw, wx, wy, wz;<br />
<span style="color: #ffffff;">&#8230;.</span>float3 s012, s021, s023, s032, s031, s013, &#8230; /* etc */<br />
<span style="color: #ffffff;">&#8230;.</span>float3 xyz, xzy, xzw, &#8230; /* etc */<br />
<span style="color: #ffffff;">&#8230;.</span>float4 s0123, s0132, &#8230; /* etc */<br />
/* etc &#8211; see remark below */</p>
<p>} float4</p></blockquote>
<p>We are missing i.e. float8 s10123422, but that is quite hard to define in a struct (and neither is defined well in the definitions which imply no repetitions of elements). Just try if .s0011 and .xxyy works with your drivers.</p>
<h1>Conversions</h1>
<p>Next are conversions between types. The specified and complete function is using <a href="http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/convert_T.html" target="_blank">convert_destType&lt;_sat&gt;&lt;_roundingMethod&gt;</a>. Most developers are familiar with explicit conversions like:</p>
<blockquote><p>float a = 5.6f;<br />
int b = (int) a; // = 5</p></blockquote>
<p>In <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym> this is the convenience function and only works with ascalars and one rounding mode without saturation; a explicit conversion &#8216;<em>(destType)&#8217;</em> can be described as &#8216;<em>convert_destType_rte&#8217;</em> (or &#8216;<em>convert_destType&#8217;</em>).</p>
<p>You do use <em>(type)</em> when you want to <strong>widen</strong> a scalar to a vector. For example:</p>
<blockquote><p>float8 f = (float8) 1.0f;</p></blockquote>
<p>If you get used to <em>convert_</em> then you don&#8217;t have think which method to use depending on if its a scalar or vector and depending if you need rte-rounding or another rounding and depending if you need saturation or not. As a bonus the <a href="http://en.wikipedia.org/wiki/Floating_point#Rounding_modes" target="_blank">rounding modes</a> with 2 examples.</p>
<table>
<tbody>
<tr>
<td><strong>float</strong></td>
<td><strong>convert_int_rte</strong></td>
<td><strong></strong><strong>convert_int_rtz</strong></td>
<td><strong></strong><strong>convert_int_rtp</strong></td>
<td><strong></strong><strong>convert_int_rtn</strong></td>
</tr>
<tr>
<td>+1.6f</td>
<td>2</td>
<td>1</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>-1.6f</td>
<td>-2</td>
<td>-1</td>
<td>-1</td>
<td>-2</td>
</tr>
<tr>
<td>+1.4f</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>-1.4f</td>
<td>-1</td>
<td>-1</td>
<td>-1</td>
<td>-2</td>
</tr>
</tbody>
</table>
<h2>Thank you</h2>
<p>Thank you for your time; I hoped you liked the alternative view. Check the rest of the series, while it is still small.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-10-18/basic-concepts-convenience-methods/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Both NVidia GTX and AMD Radeon on Linux</title>
		<link>http://www.streamcomputing.eu/blog/2011-10-12/both-nvidia-gtx-and-amd-radeon-on-linux/</link>
		<comments>http://www.streamcomputing.eu/blog/2011-10-12/both-nvidia-gtx-and-amd-radeon-on-linux/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 10:28:22 +0000</pubDate>
		<dc:creator>Vincent Hindriksen</dc:creator>
				<category><![CDATA[Home]]></category>
		<category><![CDATA[AMD]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[NVIDIA]]></category>

		<guid isPermaLink="false">http://www.streamcomputing.eu/?p=2519</guid>
		<description><![CDATA[Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that people failed trying to get Radeon as a compute device and the GTX as primary. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and [...]]]></description>
			<content:encoded><![CDATA[<span style="float:right; display:inline;"><a href="http://www.streamcomputing.eu/wp-content/plugins/kalins-pdf-creation-station/kalins_pdf_create.php?singlepost=po_2519" target="_blank" ><img src="http://www.streamcomputing.eu/wp-content/uploads/2011/02/pdf_icon-e1297035459917.png" /></a></span><p>Want to have both your GTX and Radeon working as <acronym title='Where our company is all about. A programming-language for GPUs and other massively parallel processors. OpenCL is a trademark of Apple Inc.'>OpenCL</acronym>-devices under Linux? The bad news is that people failed trying to get Radeon as a compute device and the GTX as primary. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn&#8217;t overwritten by NVidia&#8217;s driver as we won&#8217;t use that <acronym title='The processor on the videocard'>GPU</acronym> for graphics &#8211; this is also the reason why it is practically impossible to use the second <acronym title='The processor on the videocard'>GPU</acronym> for OpenGL.<br />
<img class="size-large wp-image-2522 alignnone" title="nvidia_geforce_gtx_580_vs_amd_radeon_hd_6870" src="http://www.streamcomputing.eu/wp-content/uploads/2011/10/nvidia_geforce_gtx_580_vs_amd_radeon_hd_6870-550x318.jpg" alt="" width="550" height="318" /><br />
<span id="more-2519"></span></p>
<p>First install NVidia and don&#8217;t let the installer change xorg.conf or install anything OpenGL-like (if asked) &#8211; so go for the minimum. I eventually used the trick to install both the deb (or rpm) and use the installer, to have the system well cleaned but having no problems when there is a new kernel. Under ubuntu I use &#8216;<em>sudo stop gdm</em>&#8216; to stop X, to give room to the installer &#8211; and &#8217;<em>sudo stop gdm</em>&#8216; to get it going again. Second step is to install the fgrlx-driver of Radeon the normal way. Be warned: you could be unlucky and will need to try a few times to get the right order to have certain files overwritten; I was unluckily having strange screen-behaviour (not detecting the size of the screen correctly which made certain parts of the screen unreachable) which went away after installing the Radeon-drivers again. <em>Edit: cairo-dock was also causing trouble.</em></p>
<p>When rebooting, the fglrx-driver gets loaded, but the nvidia-driver doesn&#8217;t. Below is a script that needs to be run (as root) to get nvidia loaded (check if the directory of lspci is correct when not on Ubuntu). First try if &#8216;<em>modprobe nvidia</em>&#8216; actually works for you (then run clinfo from AMD&#8217;s SDK and see if NVidia pops up). If &#8216;<em>modprobe nvidia</em>&#8216; doesn&#8217;t work, the installation of the driver was not successful.</p>
<blockquote>
<pre>#!/bin/bash

modprobe nvidia

if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/usr/bin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/usr/bin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255</pre>
<pre>else</pre>
<pre>  exit 1
fi</pre>
</blockquote>
<p>If you want this under <a href="http://www.ghacks.net/2009/04/04/get-to-know-linux-the-etcinitd-directory/" target="_blank">init.d</a>, then check out <a href="http://forums.nvidia.com/index.php?showtopic=52629" target="_blank">this post</a> (check if the directory of lspci is correct, when not on Red Hat). Above script was taken from this <a href="http://forums.nvidia.com/index.php?showtopic=49769&amp;st=0&amp;p=272085&amp;#entry272085" target="_blank">blog post</a> by &#8216;mfatica&#8217; of NVidia. Above found working with NVidia drivers 280.13 and AMD drivers 8.892.</p>
<p>All was tested with <a href="http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx" target="_blank">AMD APP 2.5</a> and <a href="http://developer.nvidia.com/cuda-toolkit-40" target="_blank">CUDA 4.0</a>. Let me know if it worked for you, or if you needed to do something extra/different.</p>
<p>A final note. In case you need a totally headless configuration, check <a href="http://developer.amd.com/sdks/AMDAPPSDK/assets/App_Note-Running_AMD_APP_Apps_Remotely.pdf">this PDF by AMD</a> on remote computing, as AMD needs X to be running.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.streamcomputing.eu/blog/2011-10-12/both-nvidia-gtx-and-amd-radeon-on-linux/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: basic (User is logged in)

Served from: www.streamcomputing.eu @ 2012-02-23 01:27:10 -->
