MediaWiki API result

This is the HTML representation of the JSON format. HTML is good for debugging, but is unsuitable for application use.

Specify the format parameter to change the output format. To see the non-HTML representation of the JSON format, set format=json.

See the complete documentation, or the API help for more information.

{
    "compare": {
        "fromid": 1,
        "fromrevid": 1,
        "fromns": 0,
        "fromtitle": "Main Page",
        "toid": 1,
        "torevid": 2,
        "tons": 0,
        "totitle": "Main Page",
        "*": "<tr><td colspan=\"2\" class=\"diff-lineno\" id=\"mw-diff-left-l1\">Line 1:</td>\n<td colspan=\"2\" class=\"diff-lineno\">Line 1:</td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">&#039;&#039;&#039;MediaWiki has been successfully installed</del>.<del class=\"diffchange diffchange-inline\">&#039;&#039;&#039;</del></div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Revision 1</ins>.<ins class=\"diffchange diffchange-inline\">2 (GPGPU-Sim 3.1.1)</ins></div></td></tr>\n<tr><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-deleted\"><br/></td><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-added\"><br/></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div><del class=\"diffchange diffchange-inline\">Consult the [//meta</del>.<del class=\"diffchange diffchange-inline\">wikimedia</del>.<del class=\"diffchange diffchange-inline\">org/wiki/Help:Contents User&#039;s Guide] for information on using the wiki software</del>.</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Editors: Tor M</ins>. <ins class=\"diffchange diffchange-inline\">Aamodt, Wilson W</ins>.<ins class=\"diffchange diffchange-inline\">L</ins>. <ins class=\"diffchange diffchange-inline\">Fung</ins></div></td></tr>\n<tr><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-deleted\"><br/></td><td class=\"diff-marker\"></td><td class=\"diff-context diff-side-added\"><br/></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>== <del class=\"diffchange diffchange-inline\">Getting started </del>==</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{{TOClimit|4}}</ins></div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* [//www.<del class=\"diffchange diffchange-inline\">mediawiki</del>.org/<del class=\"diffchange diffchange-inline\">wiki</del>/<del class=\"diffchange diffchange-inline\">Manual</del>:<del class=\"diffchange diffchange-inline\">Configuration_settings </del>Configuration <del class=\"diffchange diffchange-inline\">settings </del>list]</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* [//www.<del class=\"diffchange diffchange-inline\">mediawiki</del>.org/<del class=\"diffchange diffchange-inline\">wiki</del>/<del class=\"diffchange diffchange-inline\">Manual</del>:<del class=\"diffchange diffchange-inline\">FAQ MediaWiki FAQ</del>]</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">__TOC__</ins></div></td></tr>\n<tr><td class=\"diff-marker\" data-marker=\"\u2212\"></td><td class=\"diff-deletedline diff-side-deleted\"><div>* [<del class=\"diffchange diffchange-inline\">https</del>://<del class=\"diffchange diffchange-inline\">lists</del>.<del class=\"diffchange diffchange-inline\">wikimedia</del>.org/<del class=\"diffchange diffchange-inline\">mailman</del>/<del class=\"diffchange diffchange-inline\">listinfo</del>/<del class=\"diffchange diffchange-inline\">mediawiki</del>-<del class=\"diffchange diffchange-inline\">announce MediaWiki </del>release <del class=\"diffchange diffchange-inline\">mailing </del>list]</div></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">= Introduction =</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This manual provides documentation for GPGPU-Sim 3.x, a cycle-level GPU performance simulator that focuses on &quot;GPU computing&quot; (general purpose computation on GPUs).\u00a0 GPGPU-Sim 3.x is the latest version of GPGPU-Sim.\u00a0 It includes many enhancements to GPGPU-Sim 2.x.\u00a0 If you are trying to install GPGPU-Sim, please refer to the README file in the GPGPU-Sim distribution you are using. The README for most recent version of GPGPU-Sim can also be browsed online [https://dev.ece.ubc.ca/projects/gpgpu-sim/browser/v3.x/README here].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This manual contains three major parts: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A [[#Microarchitecture Model |Microarchitecture Model]] section that describes the microarchitecture that GPGPU-Sim 3.x models.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A [[#Using GPGPU-Sim |Usage]] section that provides documentations on how to use GPGPU-Sim.\u00a0 This section provides information on the following:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Different modes of simulation</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Configuration options (how to change high level parameters of the microarchitecture simulated)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Simulation output (e.g., microarchitecture statistics)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Visualizing microarchitecture behavior (useful for performance debugging)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Strategies for debugging GPGPU-Sim when performance simulations crashes or deadlocks due to errors in the timing model.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A [[#Software Design of GPGPU-Sim |Software Design]] section that explains the internal software design of GPGPU-Sim 3.x.\u00a0 The goal of that section is to provide a starting point for the users to extend GPGPU-Sim for their own research.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If you use GPGPU-Sim in your work please cite our [http://ieeexplore.ieee.org:80/xpl/articleDetails.jsp?reload=true&amp;arnumber=4919648 ISPASS 2009 paper]:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> Analyzing CUDA Workloads Using a Detailed GPU Simulator, in IEEE International</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> April 19-21, 2009.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To help reviewers you should indicate the version of GPGPU-Sim you used (e.g., &quot;GPGPU-Sim version 3.1.0&quot;, &quot;GPGPU-Sim version 3.0.2&quot;, &quot;GPGPU-Sim version 2.1.2b&quot;, etc...).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The GPGPU-Sim 3.x source is available under a BSD style copyright from the UBC ECE department&#039;s [https://dev.ece.ubc.ca/projects/gpgpu-sim/ GIT server].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim version 3.1.0 running PTXPlus has a correlation of 98.3% and 97.3% versus GT200 and Fermi hardware on the [https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Main_Page RODINIA benchmark suite] with scaled down problem sizes (see &lt;xr id=&quot;fig:correl_gt200&quot;/&gt; and &lt;xr id=&quot;fig:correl_fermi&quot;/&gt;).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Please submit bug reports through the [http://www.gpgpu-sim.org/bugs/ GPGPU-Sim Bug Tracking System].\u00a0 If you have further questions after reading the manual and searching the bugs database, you may want to sign up to the [http://groups.google.com/group/gpgpu-sim/ GPGPU-Sim Google Group].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Besides this manual, you may also want to consult the [http://www.gpgpu-sim.org/isca2012-tutorial/GPGPU-Sim-Tutorial-ISCA2012.pdf slides] from our\u00a0 [http://www.gpgpu-sim.org/isca2012-tutorial/ tutorial at ISCA 2012]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Contributors ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Contributing Authors to this Manual ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Tor M. Aamodt, Wilson W. L. Fung, Inderpreet Singh, Ahmed El-Shafiey, Jimmy Kwa, Tayler Hetherington, Ayub Gubran, Andrew Boktor, Tim Rogers, Ali Bakhoda, Hadi Jooybar</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Contributors to GPGPU-Sim version 3.x ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Tor M. Aamodt, Wilson W. L. Fung, Jimmy Kwa, Andrew Boktor, Ayub Gubran, Andrew Turner, Tim Rogers, Tayler Hetherington</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">= Microarchitecture Model =</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This section describes the microarchitecture modelled by GPGPU-Sim 3.x. The model is more detailed than the timing model in GPGPU-Sim 2.x.\u00a0 Some of the new details result from examining various NVIDIA patents.\u00a0 This includes the modelling of instruction fetching, scoreboard logic, and register file access.\u00a0 Other improvements in 3.x include a more detailed texture cache model based upon the [http://www-graphics.stanford.edu/papers/texture_prefetch/ prefetching texture cache architecture].\u00a0 The overall microarchitecture is first described, then the individual components including SIMT cores and clusters, interconnection network and memory partitions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>== <ins class=\"diffchange diffchange-inline\">Overview </ins>==</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x runs program binaries that are composed of a CPU portion and a GPU portion.\u00a0 However, the microarchitecture (timing) model in GPGPU-Sim 3.x reports the cycles where the GPU is busy--it does not model either CPU timing or PCI Express timing (i.e. memory transfer time between CPU and GPU).\u00a0 Several efforts are under way to provide combined CPU plus GPU simulators where the GPU portion is modeled by GPGPU-Sim.\u00a0 For example, see http://www.fusionsim.ca/.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x models GPU microarchitectures similar to those in the NVIDIA GeForce 8x, 9x, and Fermi series.\u00a0 The intention of GPGPU-Sim is to provide a substrate for architecture research rather than to exactly model any particular commercial GPU.\u00a0 That said, GPGPU-Sim 3.x has been calibrated against an NVIDIA GT 200 and NVIDIA Fermi GPU architectures.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Accuracy ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We calculated the correlation of the IPC (Instructions per Clock) versus that of real NVIDIA GPUs.\u00a0 When configured to use the native hardware instruction set (PTXPlus, using the -gpgpu_ptx_convert_to_ptxplus option),GPGPU-Sim 3.1.0 obtains IPC correlation of 98.3% (&lt;xr id=&quot;fig:correl_gt200&quot;/&gt;) and 97.3% (&lt;xr id=&quot;fig:correl_fermi&quot;/&gt;) respectively on a scaled down version of the RODINIA benchmark suite (about 260 kernel launches).\u00a0 All of the benchmarks described in &#039;&#039;[Che et. al. 2009]&#039;&#039; were included in our tests in addition to some other benchmarks from later versions of RODINIA.\u00a0 Each data point in &lt;xr id=&quot;fig:correl_gt200&quot;/&gt; and &lt;xr id=&quot;fig:correl_fermi&quot;/&gt; represents an individual kernel launch.\u00a0 Average absolute errors are 35% and 62% respectively due to some outliers.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We have included our spreadsheet used to calculate those correlations to demonstrate how the correlation coefficients were computed: [[File:correlation.xls]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=0 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &lt;figure id=&quot;fig:correl_gt200&quot;&gt; [[Image:correl_gt200.png|center|frame|&lt;caption&gt;Correlation Versus GT200 Architecture&lt;/caption&gt;]] &lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &lt;figure id=&quot;fig:correl_fermi&quot;&gt;[[Image:correl_fermi.png|center|frame|&lt;caption&gt;Correlation Versus Fermi Architecture&lt;/caption&gt;]]&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Top-Level Organization ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The GPU modeled by GPGPU-Sim is composed of Single Instruction Multiple Thread (SIMT) cores connected via an on-chip connection network to memory partitions that interface to graphics GDDR DRAM.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent to what NVIDIA calls an Streaming Multiprocessor (SM) or what AMD calls a Compute Unit (CU).\u00a0 The organization of an SIMT core is described in &lt;xr id=&quot;fig:overall_arch&quot;/&gt; below.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:overall_arch&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:overall-arch.png|center|frame|&lt;caption&gt;Overall GPU Architecture Modeled by GPGPU-Sim&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Clock Domains ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim supports four independent clock domains: (1) the SIMT Core Cluster clock domain (2) the interconnection network clock domain (3) the L2 cache clock domain, which applies to all logic in the memory partition unit except DRAM, and (4) the DRAM clock domain.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Clock frequencies can have any arbitrary value (they do not need to be multiples of each other).\u00a0 In other words, we assume the existence of synchronizers between clock domains.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In the GPGPU-Sim 3.x simulation model, units in adjacent clock domains communicate through clock crossing buffers that are filled at the source domain&#039;s clock rate and drained at the destination domain&#039;s clock rate.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== SIMT Core Clusters ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The SIMT Cores are grouped into SIMT Core Clusters.\u00a0 The SIMT Cores in a SIMT Core Cluster share a common port to the interconnection network as shown in &lt;xr id=&quot;fig:simt_clusters&quot;/&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:simt_clusters&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:Cluster-arch.png|center|frame|&lt;caption&gt;SIMT Core Clusters&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As illustrated in &lt;xr id=&quot;fig:simt_clusters&quot;/&gt;, each SIMT Core Cluster has a single &#039;&#039;response FIFO&#039;&#039; which holds the packets ejected from the interconnection network. The packets are directed to either a SIMT Core&#039;s instruction cache (if it is a memory response servicing an instruction fetch miss) or its memory pipeline (LDST unit). The packets exit in FIFO fashion. The response FIFO is stalled if a core is unable to accept the packet at the head of the FIFO. For generating memory requests at the LDST unit, each SIMT Core has its own injection port into the interconnection network. The injection port buffer however is shared by all the SIMT Cores in a cluster.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== SIMT Cores ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;xr id=&quot;fig:simt_core&quot;/&gt; below illustrates the SIMT core microarchitecture simulated by GPGPU-Sim 3.x.\u00a0 An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent to what NVIDIA calls an Streaming Multiprocessor (SM) [http://developer.nvidia.com/nvidia-gpu-computing-documentation] or what AMD calls a Compute Unit (CU) [http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf].\u00a0 A Stream Processor (SP) or a CUDA Core would correspond to a lane within an ALU pipeline in the SIMT core.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:simt_core&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:simt-core.png|center|frame|&lt;caption&gt;Detailed Microarchitecture Model of SIMT Core&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This microarchitecture model contains many details not found in earlier versions of GPGPU-Sim.\u00a0 The main differences include:\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>* <ins class=\"diffchange diffchange-inline\">A new front-end that models instruction caches and separates the warp scheduling (issue) stage from the fetch and decode stage </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Scoreboard logic enabling multiple instructions from a single warp to be in the pipeline at once</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A detailed model of an operand collector that schedules operand access to single ported register file banks (used to reduce area and power of the register file)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Flexible model that supports multiple SIMD functional units. This allows memory instructions and ALU instructions to operate in different pipelines.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The following subsections describe the details in &lt;xr id=&quot;fig:simt_core&quot;/&gt; by going through each stage of the pipeline.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Front End ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As described below, the major stages in the front end include instruction cache access and instruction buffering logic, scoreboard and scheduling logic, SIMT stack.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Fetch and Decode ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The instruction buffer (I-Buffer) block in &lt;xr id=&quot;fig:simt_core&quot;/&gt; is used to buffer instructions after they are fetched from the instruction cache.\u00a0 It is statically partitioned so that all warps running on SIMT core have dedicated storage to place instructions.\u00a0 In the current model, each warp has two I-Buffer entries.\u00a0 Each I-Buffer entry has a valid bit, ready bit and a single decoded instruction for this warp. The valid bit of an entry indicates that there is a non-issued decoded instruction within this entry in the I-Buffer.\u00a0 While the ready bit indicates that the decoded instructions of this warp are ready to be issued to the execution pipeline. Conceptually, the ready bit is set in the schedule and issue stage using the scoreboard logic and availability of hardware resources (in the simulator software, rather than actually set a ready bit, a readiness check is performed). The I-Buffer is initially empty with all valid bits and ready bits deactivated. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A warp is eligible for instruction fetch if it does not have any valid instructions within the I-Buffer. Eligible warps are scheduled to access the instruction cache in round robin order.\u00a0 Once selected, a read request is sent to instruction cache with the address of the next instruction in the currently scheduled warp. By default, two consecutive instructions are fetched. Once a warp is scheduled for an instruction fetch, its valid bit in the I-Buffer is activated until all the fetched instructions of this warp are issued to the execution pipeline. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The instruction cache is a read-only, non-blocking set-associative cache that can model both FIFO and LRU replacement policies with on-miss or on-fill allocation policies. A request to the instruction cache results in either a hit, miss or a reservation fail. The reservation fail results if either the miss status holding register (MSHR) is full or there are no replaceable blocks in the cache set because all block are reserved by prior pending requests (see section [</ins>[<ins class=\"diffchange diffchange-inline\">#Memory_Pipeline_.28LDST_unit.29 | Caches]] for more details). In both cases of hit and miss the round robin fetch scheduler moves to the next warp. In case of hit, the fetched instructions are sent to the decode stage.\u00a0 In the case of a miss a request will be generated by the instruction cache.\u00a0 When the miss response is received the block is filled into the instruction cache and the warp will again need to access the instruction cache.\u00a0 While the miss is pending, the warp does not access the instruction cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A warp finishes execution and is not considered by the fetch scheduler anymore if all its threads have finished execution without any outstanding stores or pending writes to local registers. The thread block is considered done once all warps within it are finished and have no pending operations. Once all thread blocks dispatched at a kernel launch finish, then this kernel is considered done.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">At the decode stage, the recent fetched instructions are decoded and stored in their corresponding entry in the I-Buffer waiting to be issued.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The simulator software design for this stage is described in [[GPGPU-Sim_3.x_Manual#Fetch_and_Decode_Software_Model | Fetch and Decode]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Instruction Issue ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A second round robin arbiter chooses a warp to issue from the I-Buffer to rest of the pipeline.\u00a0 This round robin arbiter is decoupled from the round robin arbiter used to schedule instruction cache accesses.\u00a0 The issue scheduler can be configured to issue multiple instructions from the same warp per cycle. Each valid instruction (i.e. decoded and not issued) in the currently checked warp is eligible for issuing if (1) its warp is not waiting at a barrier, (2) it has valid instructions in its I-Buffer entries (valid bit is set), (3) the scoreboard check passes (see section [[GPGPU-Sim_3.x_Manual#Scoreboard | Scoreboard]] for more details), and (4) the operand access stage of the instruction pipeline is not stalled.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Memory instructions (Load, store, or memory barriers) are issued to the memory pipeline. For other instructions, it always prefers the SP pipe for operations that can use both SP and SFU pipelines. However, if a control hazard is detected then instructions in the I-Buffer corresponding to this warp are flushed. The warp&#039;s next pc is updated to point to the next instruction (assuming all branches as not-taken). For more information about handling control flow, refer to [[GPGPU-Sim_3.x_Manual#SIMT_Stack | SIMT Stack]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">At the issue stage barrier operations are executed. Also, the SIMT stack is updated (refer to [[GPGPU-Sim_3.x_Manual#SIMT_Stack | SIMT Stack]] for more details) and register dependencies are tracked (refer to [[GPGPU-Sim_3.x_Manual#Scoreboard | Scoreboard]] for more details). Warps wait for barriers (&quot;__syncthreads()&quot;) at the issue stage.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== SIMT Stack ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A per-warp SIMT stack is used to handle the execution of branch divergence on single-instruction, multiple thread (SIMT) architectures.\u00a0 Since divergence reduces the efficiency of these architectures, different techniques can be adapted to reduce this effect. One of the simplest techniques is the post-dominator stack-based reconvergence mechanism. This technique synchronizes the divergent branches at the earliest guaranteed reconvergence point in order to increase the efficiency of the SIMT architecture. Like previous versions of GPGPU-Sim, GPGPU-Sim 3.x adopts this mechanism. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Entries of the SIMT stack represents a different divergence level. At each divergence branch, a new entry is pushed to the top of the stack.\u00a0 The top-of-stack entry is popped when the warp reaches its reconvergence point. Each entry stores the target PC of the new branch, the immediate post dominator reconvergence PC and the active mask of threads that are diverging to this branch. In our model, the SIMT stack of each warp is updated after each instruction issue of this warp. The target PC, in case of no divergence, is normally updated to the next PC. However, in case of divergence, new entries are pushed to the stack with the new target PC, the active mask that corresponds to threads that diverge to this PC and their immediate reconvergence point PC. Hence, a control hazard is detected if the next PC at top entry of the SIMT stack does not equal to the PC of the instruction currently under check.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">See [http://doi.acm.org/10.1145/1543753.1543756 Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware] for more details.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that it is known that NVIDIA and AMD actually modify the contents of their divergence stack using special instructions.\u00a0 These divergence stack instructions are not exposed in PTX but are visible in the actual hardware SASS instruction set (visible using decuda or NVIDIA&#039;s cuobjdump).\u00a0 When the current version of GPGPU-Sim 3.x is configured to execute SASS via PTXPlus (see [[GPGPU-Sim_3.x_Manual#PTX_vs._PTXPlus|PTX vs. PTXPlus]]) it ignores these low level instructions and instead a comparable control flow graph is created to identify immediate post-dominators.\u00a0 We plan to support execution of the low level branch instructions in a future version of GPGPU-Sim 3.x.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Scoreboard ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Scoreboard algorithm checks for WAW and RAW dependency hazards. As explained [[#Instruction_Issue|above]], the registers written to by a warp are reserved at the issue stage. The scoreboard algorithm indexed by warp IDs.\u00a0 It stores the required register numbers in an entry that corresponds to the warp ID. The reserved registers are released at the write back stage.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As mentioned [[#Instruction_Issue|above]], the decoded instruction of a warp is not scheduled for issue until the scoreboard indicates no WAW or RAW hazards exist. The scoreboard detects WAW and RAW hazards by tracking which registers will be written to by an instruction that has issued but not yet written its results back to the register file.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Register Access and the Operand Collector ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Various NVIDIA patents describe a structure called an &quot;operand collector&quot;.\u00a0 The operand collector is a set of buffers and arbitration logic used to provide the appearance of a multiported register file using multiple banks of single ported RAMs.\u00a0 The overall arrangement saves energy and area which is important to improving throughput.\u00a0 Note that AMD also uses banked register files, but the compiler is responsible for ensuring these are accessed so that no bank conflicts occur. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;xr id=&quot;fig:operand_collector&quot;/&gt; provides an illustration of the detailed way in which GPGPU-Sim 3.x models the operand collector. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:operand_collector&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:operandcollector.png|alt text|center|frame|&lt;caption&gt;Operand collector microarchitecture&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">After an instruction is decoded, a hardware unit called a collector unit is allocated to buffer the source operands of the instruction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The collector units are not used to eliminate name dependencies via register renaming, but rather as a way to space register operand accesses out in time so that no more than one access to a bank occurs in a single cycle. In the organization shown in the figure, each of the four collector units contains three operand entries. Each operand entry has four fields: a valid bit, a register identifier, a ready bit, and operand data. Each operand data field can hold a single 128 byte source operand composed of 32 four byte elements (one four byte value for each scalar thread in a warp). In addition, the collector unit contains an identifier indicating which warp the instruction belongs to. The arbitrator contains a read request queue for each bank to hold access requests until they are granted.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When an instruction is received from the decode stage and a collector unit is available it is allocated to the instruction and the operand, warp, register identifier and valid bits are set. In addition, source operand read requests are queued in the arbiter. To simplify the design, data being written back by the execution units is always prioritized over read requests. The arbitrator selects a group of up to four non-conflicting accesses to send to the register file. To reduce crossbar and collector unit area the selection is made so that each collector unit only receives one operand per cycle.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As each operand is read out of the register file and placed into the corresponding collector unit, a \u201cready bit\u201d is set. Finally, when all the operands are ready the instruction is issued to a SIMD execution unit.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In our model, each back-end pipeline (SP, SFU, MEM) has a set of dedicated collector units, and they share a pool of general collector units.\u00a0 The number of units available to each pipeline and the capacity of the pool of the general units are configurable.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== ALU Pipelines ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim v3.x models two types of ALU functional units.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* SP units executes all types of ALU instructions except transcendentals.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* SFU units executes transcendental instructions (Sine, Cosine, Log... etc.).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Both types of units are pipelined and SIMDized.\u00a0 The SP unit can usually execute one warp instruction per cycle, while the SFU unit may only execute a new warp instruction every few cycles, depending on the instruction type.\u00a0 For example, the SFU unit can execute a sine instruction every 4 cycles or a reciprocal instruction every 2 cycles.\u00a0 Different types of instructions also has different execution latency. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Each SIMT core has one SP unit and one SFU unit.\u00a0 Each unit has an independent issue port from the operand collector.\u00a0 Both units share the same output pipeline register that connects to a common writeback stage.\u00a0 There is a result bus allocator at the output of the operand collector to ensure that the units will never be stalled due to the shared writeback.\u00a0 Each instruction will need to allocate a cycle slot in the result bus before being issued to either unit.\u00a0 Notice that the memory pipeline has its own writeback stage and is not managed by this result bus allocator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The [[#Software_Design_of_GPGPU-Sim|software design]] section contains more implementation detail of the model.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Memory Pipeline (LDST unit) === </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim Supports the various memory spaces in CUDA as visible in PTX.\u00a0 In our model, each SIMT core has 4 different on-chip level 1 memories: shared memory, data cache, constant cache,\u00a0 and texture cache.\u00a0 The following table shows which on chip memories service which type of memory access:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! &#039;&#039;&#039;Core Memory&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! &#039;&#039;&#039;PTX Accesses&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Shared memory (R/W)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| CUDA shared memory (OpenCL local memory) accesses only</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Constant cache (Read Only)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Constant memory and parameter memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Texture cache (Read Only)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Texture accesses only</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Data cache (R/W - evict-on-write for global memory, writeback for local memory)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Global and Local memory accesses (Local memory = Private data in OpenCL) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Although these are modelled as separate physical structures, they are all components of the memory pipeline (LDST unit) and therefore they all share the same writeback stage.\u00a0 The following describes how each of these spaces is serviced:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Texture Memory - Accesses to texture memory are cached in the L1 texture cache (reserved for texture accesses only) and also in the L2 cache (if enabled).\u00a0 The L1 texture cache is a special design described in the 1998 paper [http:</ins>//www<ins class=\"diffchange diffchange-inline\">-graphics.stanford.edu/papers/texture_prefetch/].\u00a0 Threads on GPU cannot write to the texture memory space, thus the L1 texture cache is read-only.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Shared Memory - Each SIMT core contains a configurable amount of shared scratchpad memory that can be shared by threads within a thread block.\u00a0 This memory space is not backed by any L2, and is explicitly managed by the programmer.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Constant Memory - Constants and parameter memory is cached in a read-only constant cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Parameter Memory - See above</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Local Memory - Cached in the L1 data cache and backed by the L2. Treated in a fashion similar to global memory below except values are written back on eviction since there can be no sharing of local (private) data.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Global Memory - Global and local accesses are both serviced by the L1 data cache.\u00a0 Accesses by scalar threads from the same warp are coalesced on a half-warp basis as described in the CUDA 3.1 programming guide. [http://developer.nvidia.com/nvidia-gpu-computing-documentation].\u00a0 These accesses are processed at a rate of 2 per SIMT core cycle, such that a memory instruction that is perfectly coalesced into 2 accesses (one per half-warp) can be serviced in a single cycle.\u00a0 For those instructions that generate more than 2 accesses, these will access the memory system at a rate of 2 per cycle.\u00a0 So, if a memory instruction generates 32 accesses (one per lane in the warp), it will take at least 16 SIMT core cycles to move the instruction to the next pipeline stage.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The subsections below describe the first level memory structures.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== L1 Data Cache ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The L1 data cache is a private, per-SIMT core, non-blocking first level cache for local and global memory accesses. The L1 cache is not banked and is able to service two coalesced memory request per SIMT core cycle. An incoming memory request must not span two or more cache lines in the L1 data cache. Note also that the L1 data caches are not coherent.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The table below summarizes the write policies for the L1 data cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=3 | L1 data cache write policy</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Local Memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Global Memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write Hit\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write-back</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write-evict</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write Miss\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write no-allocate</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write no-allocate</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For local memory, the L1 data cache acts as a write-back cache with write no-allocate. For global memory, write hits cause eviction of the block. This mimics the default policy for global stores as outlined in the PTX ISA specification [http://developer.nvidia.com/nvidia-gpu-computing-documentation].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Memory accesses that hit in the L1 data cache are serviced in one SIMT core clock cycle. Missed accesses are inserted into a FIFO miss queue. One fill request per SIMT clock cycle is generated by the L1 data cache (given the interconnection injection buffers are able to accept the request).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The cache uses Miss Status Holding Registers (MSHR) to hold the status of misses in progress.\u00a0 These are modeled as a fully-associative array. Redundant accesses to the memory system that take place while one request is in flight are merged in the MSHRs. The MSHR table has a fixed number of MSHR entries. Each MSHR entry can service a fixed number of miss requests for a single cache line. The number of MSHR entries and maximum number of requests per entry are configurable. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A memory request that misses in the cache is added to the MSHR table and a fill request is generated if there is no pending request for that cache line. When a fill response to the fill request is received at the cache, the cache line is inserted into the cache and the corresponding MSHR entry is marked as filled. Responses for filled MSHR entries are generated at one request per cycle. Once all the requests waiting at the filled MSHR entry have been responded to and serviced, the MSHR entry is freed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Texture Cache ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The texture cache model is a [http://www-graphics.stanford.edu/papers/texture_prefetch/ prefetching texture cache].\u00a0 Texture memory accesses exhibit mostly spatial locality and this locality has been found to be mostly captured with [http://graphics.stanford.edu/papers/texture_cache/ about 16 KB of storage].\u00a0 In realistic graphics usage scenarios many texture cache accesses miss.\u00a0 The latency to access texture in DRAM is on the order of many 100&#039;s of cycles.\u00a0 Given the large memory access latency and small cache size the problem of when to allocate lines in the cache becomes paramount.\u00a0 The prefetching texture cache solves the problem by temporally decoupling the state of the cache tags from the state of the cache blocks.\u00a0 The tag array represents the state the cache will be in when misses have been serviced 100&#039;s of cycles later.\u00a0 The data array represents the state after misses have been serviced.\u00a0 The key to making this decoupling work is to use a reorder buffer to ensure returning texture miss data is placed into the data array in the same order the tag array saw the access.\u00a0  For more details see the [http://www-graphics.stanford.edu/papers/texture_prefetch/ original paper].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Constant (Read only) Cache ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Accesses to constant and parameter memory run through the L1 constant cache.\u00a0 This cache is implemented with a tag array and is like the L1 data cache with the exception that it cannot be written to.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Thread Block / CTA / Work Group Scheduling ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Thread blocks, Cooperative Thread Arrays (CTAs) in CUDA terminology or Work Groups in OpenCL terminology, are issued to SIMT Cores one at a time. Every SIMT Core clock cycle, the thread block issuing mechanism selects and cycles through the SIMT Core Clusters in a round robin fashion. For each selected SIMT Core Cluster, the SIMT Cores are selected and cycled through in a round robin fashion. For every selected SIMT Core, a single thread block will be issued to the core from the selected kernel if the there are enough resources free on that SIMT Core.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If multiple CUDA Streams or command queues are used in the application, then multiple kernels can be executed concurrently in GPGPU-Sim. Different kernels can be executed across different SIMT Cores; a single SIMT Core can only execute thread blocks from a single kernel at a time. If multiple kernels are concurrently being executed, then the selection of the kernel to issue to each SIMT Core is also round robin. Concurrent kernel execution on CUDA architectures is described in the [http://developer.nvidia.com/nvidia-gpu-computing-documentation NVIDIA CUDA Programming Guide]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Interconnection Network ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Interconnection network is responsible for the communications between SIMT core clusters and Memory Partition units. To simulate the interconnection network we have interfaced the &quot;booksim&quot; simulator to GPGPU-Sim. Booksim is a stand alone network simulator that can be found [http://cva.stanford.edu/books/ppin/ here]. Booksim is capable of simulating virtual channel based Tori and Fly networks and is highly configurable. It can be best understood by referring to &quot;Principles and Practices of Interconnection Networks&quot; book by Dally and Towles.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We refer to our modified version of the booksim as &#039;&#039;Intersim&#039;&#039;. Intersim has it own clock domain. The original booksim only supports a single interconnection network. We have made some changes to be able to simulate two interconnection networks: one for traffic from the SIMT core clusters to Memory Partitions and one network for traffic from Memory partitions back to SIMT core clusters. This is one way of avoiding circular dependencies that might cause protocol deadlocks in the system. Another way would be having dedicated virtual channels for request and response traffic on a single physical network but this capability is not fully supported in the current version of our public release.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note: A newer version of Booksim (Booksim 2.0) is now available from Stanford, but GPGPU-Sim 3.x does not yet use it.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Please note that SIMT Core Clusters do not directly communicate which each other and hence there is no notion of coherence traffic in the interconnection network. There are only four packet types: (1)read-request and (2)write-requests sent from SIMT core clusters to Memory partitions and (3)read-replys and write-acknowledges sent from Memory Partitions to SIMT Core Clusters. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Concentration ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">SIMT core Clusters act as external concentrators in GPGPU-Sim. From the interconnection network&#039;s point of view a SIMT core cluster is a single node and the routers connected to this node only have one injection and one ejection port.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Interface with GPGPU-Sim ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The interconnection network regardless of its internal configuration provides a simple interface to communicate with SIMT core Clusters and Memory partitions that are connected to it. For injecting packets SIMT core clusters or Memory controllers first check if the network has enough buffer space to accept their packet and then send their packet to the network. For ejection they check if there is packet waiting for ejection in the network and then pop it. These action happen in each units&#039; clock domain. The serialization of packets is handled inside the network interface, e.g. SIMT core cluster injects a packet in SIMT core cluster clock domain but the router accepts only one flit per interconnect clock cycle. More implementation details can be found in the Software design section.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Memory Partition ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The memory system in GPGPU-Sim is modelled by a set of memory partitions.\u00a0 As shown in &lt;xr id=&quot;fig:mem_partition&quot;/&gt; each memory partition contain an L2 cache bank, a DRAM access scheduler and the off-chip DRAM channel. The functional execution of atomic operations also occurs in the memory partitions in the Atomic Operation Execution phase. Each memory partition is assigned a sub-set of physical memory addresses for which it is responsible. By default the global linear address space is interleaved among partitions in chunks of 256 bytes.\u00a0 This partitioning of the address space along with the detailed mapping of the address space to DRAM row, banks and columns in each partition is configurable and described in the [[#Address Decoding | Address Decoding]] section.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:mem_partition&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:Mempart-arch.png|center|frame|&lt;caption&gt;Memory Partition Components&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The L2 cache (when enabled) services the incoming texture and (when configured to do so) non-texture memory requests.\u00a0 Note the Quadro FX 5800 (GT200) configuration enables the L2 for texture references only.\u00a0 On a cache miss, the L2 cache bank generates memory requests to the DRAM channel to be serviced by off-chip GDDR3 DRAM.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The subsections below describe in more detail how traffic flows through the memory partition along with the individual components mentioned above.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Memory Partition Connections and Traffic Flow ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;xr id=&quot;fig:mem_partition&quot;/&gt;\u00a0 above shows the three sub-components inside a single Memory Partition and the various FIFO queues that facilitate the flow of memory requests and responses between them. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The memory request packets enter the Memory Partition from the interconnect via the &#039;&#039;ICNT-&gt;L2&#039;&#039; queue. Non-texture accesses are directed through the &#039;&#039;Raster Operations Pipeline (ROP)&#039;&#039; queue to model a minimum pipeline latency of 460 L2 clock cycles, as observed by a [http://www.stuffedcow.net/files/gpuarch-ispass2010.pdf GT200 micro-benchmarking study]. The L2 cache bank pops one request per L2 clock cycle from the &#039;&#039;ICNT-&gt;L2&#039;&#039; queue for servicing. Any memory requests for the off-chip DRAM generated by the L2 are pushed into the &#039;&#039;L2-&gt;dram&#039;&#039; queue. If the L2 cache is disabled, packets are popped from the &#039;&#039;ICNT-&gt;L2&#039;&#039; and pushed directly into the &#039;&#039;L2-&gt;DRAM&#039;&#039; queue, still at the L2 clock frequency. Fill requests returning from off-chip DRAM are popped from &#039;&#039;DRAM-&gt;L2&#039;&#039; queue and consumed by the L2 cache bank. Read replies from the L2 to the SIMT core are pushed through the &#039;&#039;L2-&gt;ICNT&#039;&#039; queue.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &#039;&#039;DRAM latency&#039;&#039; queue is a fixed latency queue that models the minimum latency difference between a L2 access and a DRAM access (an access that has missed L2 cache).\u00a0 This latency is observed via micro-benchmarking and this queue simply modeling this observation (instead of the real hardware that causes this delay).\u00a0 Requests exiting the &#039;&#039;L2-&gt;DRAM&#039;&#039; queue reside in the &#039;&#039;DRAM latency&#039;&#039; queue for a fixed number of SIMT core clock cycles. Each DRAM clock cycle, the DRAM channel can pop memory request from the &#039;&#039;DRAM latency&#039;&#039; queue to be serviced by off-chip DRAM, and push one serviced memory request into the &#039;&#039;DRAM-&gt;L2&#039;&#039; queue.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that ejection from the interconnect to Memory Partition (&#039;&#039;ROP&#039;&#039; or &#039;&#039;ICNT-&gt;L2&#039;&#039; queues) occurs in L2 clock domain while injection into the interconnect from Memory Partition (&#039;&#039;L2-&gt;ICNT&#039;&#039; queue) occurs in the interconnect (ICNT) clock domain.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== L2 Cache Model and Cache Hierarchy ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The L2 cache model is very similar to the L1 [[#L1_Data_Cache|data caches]] in SIMT cores (see [[#L1_Data_Cache|that section]] for more details). When enabled to cache the global memory space data, the L2 acts as a read/write cache with write policies as summarized in the table below.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=3 | L2 cache write policy</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Local Memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Global Memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write Hit\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write-back for L1 write-backs</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write-evict</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write Miss\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write no-allocate</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Write no-allocate</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Additionally, note that the L2 cache is a unified last level cache that is shared by all SIMT cores, whereas the L1 caches are private to each SIMT core.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The private L1 data caches are not coherent (the other L1 caches are for read only address spaces). The cache hierarchy in GPGPU-Sim is non-inclusive non-exclusive. Additionally, a non-decreasing cache line size going down the cache hierarchy (i.e. increasing cache level) is enforced. A memory request from the first level cache also cannot span across two cache lines in the L2 cache. These two restrictions ensure:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># A request from a lower level cache can be serviced by one cache line in the higher level cache. This ensures that requests from the L1 can be serviced atomically by the L2 cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Atomic operations do not need to access multiple blocks at the L2</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This restriction simplifies the cache design and prevents having to deal with live-lock related issues with servicing a request from L1 non-atomically.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Atomic Operation Execution Phase ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Atomic Operation Execution phase is a very idealized model of atomic instruction execution. Atomic instructions with non-conflicting memory accesses that were coalesced into one memory request are executed at the memory partition in one cycle.\u00a0 In the performance model, we currently model an atomic operation as a global load operation that skips the L1 data cache.\u00a0 This generates all the necessary register writeback traffic (and data hazard stalls) within the SIMT core.\u00a0 At L2 cache, the atomic operation marks the accessed cache line dirty (changing its status to &#039;&#039;modified&#039;&#039;) to generate the writeback traffic to DRAM.\u00a0 If L2 cache is not enabled (or used for texture access only), then no DRAM write traffic will be generated for atomic operations (a very idealized model).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== DRAM Scheduling and Timing Model ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim models both DRAM scheduling and timing. GPGPU-Sim implements two open page mode DRAM schedulers: a FIFO (First In First Out) scheduler and a FR-FCFS (First-Ready First-Come-First-Serve) scheduler, both described below.\u00a0 These can be selected using the configuration option -gpgpu_dram_scheduler.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">====FIFO Scheduler====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The FIFO scheduler services requests in the order they are received.\u00a0 This will tend to cause a large number of precharges and activates and hence result in poorer performance especially for applications that generate a large amount of memory traffic relative to the amount of computation they perform.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">====FR-FCFS====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The First-Row First-Come-First-Served scheduler gives higher priority to requests to a currently open row in any of the DRAM banks. The scheduler will schedule all requests in the queue to open rows first. If no such request exists it will open a new row for the oldest request. The code for this scheduler is located in dram_sched.h/.cc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">====DRAM Timing Model ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim accurately models graphics DRAM memory.\u00a0 Currently GPGPU-Sim 3.x models GDDR3 DRAM, though we are working on adding a detailed GDDR5.\u00a0 The following DRAM timing parameters can be set using the option -gpgpu_dram_timing_opt nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tCDLR:tWR. Currently, we do not model the timing of DRAM refresh operations. Please refer to [http://www.hynix.com/datasheet/pdf/dram/HY5RS123235FP(Rev1.3).pdf GDDR3 specifications] for more details about each parameter.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* nbk: number of banks</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tCCD: Column to Column Delay (RD/WR to RD/WR different banks)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRRD: Row to Row Delay (Active to Active different banks)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRCD: Row to Column Delay (Active to RD/WR/RTR/WTR/LTR)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRAS: Active to PRECHARGE command period</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRP: PRECHARGE command period </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRC: Active to Active command period (same bank)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* CL: CAS Latency</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* WL: WRITE latency</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tCDLR: Last data-in to Read Delay (switching from write to read)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tWR: WRITE recovery time</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In our model, commands for each memory bank are scheduled in a round-robin fashion. The banks are arranged in a circular array with a pointer to the bank with the highest priority. The scheduler goes through the banks in order and issues commands. Whenever an activate or precharge command is issued for a bank, the priority pointer is set to the next bank guaranteeing that other pending commands for other banks will be eventually scheduled.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Instruction Set Architecture (ISA) ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTX and SASS ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim simulates the Parallel Thread eXecution (PTX) instruction set used by NVIDIA. PTX is a pseudo-assembly instruction set; i.e. it does not execute directly on the hardware. ptxas is the assembler released by NVIDIA to assemble PTX into the native instruction set run by the hardware (SASS). Each hardware generation supports a different version of SASS. For this reason, PTX is compiled into multiple versions of SASS that correspond to different hardware generations at compile time. Despite that, the PTX code is still embedded into the binary to enable support for future hardware. At runtime, the runtime system selects the appropriate version of SASS to run based on the available hardware. If there is none, the runtime system invokes a just-in-time (JIT) compiler on the embedded PTX to compile it into the SASS corresponding to the available hardware.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTXPlus ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim is capable of running PTX. However, since PTX is not the actual code that runs on the hardware, there is a limit to how accurate it can be. This is mainly due to compiler passes such as strength reduction, instruction scheduling, register allocation to mention a few.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To enable running SASS code in GPGPU-Sim, new features had to be added:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* New addressing modes</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* More elaborate condition codes and predicates</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Additional instructions</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Additional datatypes</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In order to avoid developing and maintaining two parsers and two functional execution engines (one for PTX and the other for SASS), we chose to extend PTX with the required features in order to provide a one-to-one mapping to SASS. PTX along with the extentions is called PTXPlus. To run SASS, we perform a syntax conversion from SASS to PTXPlus.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">PTXPlus has a very similar syntax when compared to PTX with the addition of new addressing modes, more elaborate condition codes and predicates, additional instructions and more data types. It is important to keep in mind that PTXPlus is a superset of PTX, which means that valid PTX is also valid PTXPlus. More details about the exact differences between PTX and PTXPlus can be found in [[#PTX vs. PTXPlus]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== From SASS to PTXPlus ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When the configuration file instructs GPGPU-Sim to run SASS, a conversion tool, cuobjdump_to_ptxplus, is used to convert the SASS embedded within the binary to PTXPlus. For the full details of the conversion process see [[#PTXPlus Conversion ]]. The PTXPlus is then used in the simulation. When SASS is converted to PTXPlus, only the syntax is changed, the instructions and their order is preserved exactly as in the SASS. Thus, the effect of compiler optimizations applied to the native code is fully captured. Currently, GPGPU-Sim only supports the conversion of GT200 SASS to PTXPlus.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">= Using GPGPU-Sim =</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Refer to the README file in the top level GPGPU-Sim directory for instructions on building and running GPGPU-Sim 3.x.\u00a0 This section provides other important guidance on using GPGPU-Sim 3.x, covering topics such as different [[#Simulation_Modes |simulation modes]], how to [[#Configuration_Options |modify the timing model configuration]], a description of the default [[#Understanding_Simulation_Output |simulation statistics]], and description of approaches for analyzing bugs at the functional level via [[#Environment_Variables_for_Debugging |tracing simulation state]] and a [[#Interactive_Debugger_Mode |GDB-like interface]].\u00a0 GPGPU-Sim 3.x also provides extensive support for debugging performance simulation bugs including both a [[#Visualizing_High-Level_GPGPU-Sim_Microarchitecture_Behavior |high level microarchitecture visualizer]] and [[#Visualizing_Cycle_by_Cycle_Microarchitecture_Behavior |cycle by cycle pipeline state visualization]].\u00a0 Next, we describe strategies for [[#Debugging_Errors_in_Performance_Simulation |debugging GPGPU-Sim when it crashes or deadlocks in performance simulation mode]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Finally, it conclude with [[#Frequently_Asked_Questions |answers to frequently asked questions]]. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Simulation Modes ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">By default most users will want to use GPGPU-Sim 3.x to estimate the number of GPU clock cycles it takes to run an application. This is known as performance simulation mode.\u00a0 When trying to run a new application on GPGPU-Sim it is always possible that application may not run correctly--i.e., it is possible it may generate the wrong output.\u00a0 To help debugging such applications, GPGPU-Sim 3.x also supports a fast functional simulation only mode.\u00a0 This mode may also be helpful for compiler research and/or when making changes to the functional simulation engine.\u00a0 Orthogonal to the distinction between performance and functional simulation, GPGPU-Sim 3.x also support execution of the native hardware ISA on NVIDIA GPUs (currently GT200 and earlier only), via an extended PTX syntax we call PTXPlus.\u00a0 The following subsections describe these features in turn.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| class=&quot;wikitable sortable&quot; style=&quot;font-size: smaller&quot; border=&quot;1&quot; cellpadding=&quot;3&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|+ Modes of operation for GPGPU-Sim</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! CUDA Version</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! PTX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! PTXPlus</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! cuobjdump+PTX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! cuobjdump+PTXPlus</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 2.3</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 3.1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{yes}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 4.0</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{no}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{yes}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| {{yes}}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Performance Simulation ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Performance simulation is the default mode of simulation and collects performance statistics in exchange for slower simulation speed. GPGPU-Sim simulates the microarchitecture described in the [[#Microarchitecture_Model|Microarchitecture Model]] section.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To select the performance simulation mode, add the following line to the gpgpusim.config file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_sim_mode 0</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For information regarding understanding the simulation output refer to the section on [[#Understanding_Simulation_Output | understanding simulation output]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Pure Functional Simulation ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Pure functional simulations run faster than performance simulations but only perform the execution of the CUDA/OpenCL program and does not collect performance statistics. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To select the pure functional simulation mode, add the following line to the gpgpusim.config file: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_sim_mode 1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Alternatively, you can set the environmental variable PTX_SIM_MODE_FUNC to 1. Then execute the program normally as you would do in performance simulation mode.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Simulating only the functionality of a GPU device, GPGPU-Sim pure functional simulation mode execute a CUDA/OpenCL program as if it runs on a real GPU device, so no performance measures are collected in this mode, only the regular output of a GPU program is shown. As you expect the pure simulation mode is significantly faster than the performance simulation mode (about 5~10 times faster).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This\u00a0 mode is very useful if you want to quickly check that your code is working correctly on GPGPU-Sim, or if you want to experience using CUDA/OpenCL without the need to have a real GPU computation device. Pure functional simulation supports the same versions of CUDA as the performance simulation (CUDA v3.1) and (CUDA v2.3) for PTX Plus. The pure functional simulation mode execute programs as a group of warps, where warps of each Cooperative Thread Array (CTA) get executed till they all finish or all wait at a barrier, in the latter case once all the warps meet at the barrier they are cleared to go ahead and cross the barrier.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Software design details for Pure Functional Simulation can be found [[#Pure_Functional_Simulation_2 | below]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Interactive Debugger Mode ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Interactive debugger mode offers a GDB-like interface for debugging functional behavior in GPGPU-Sim.\u00a0  However, currently it only works with performance simulation. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To enable interactive debugger mode, set environment variable &lt;tt&gt;GPGPUSIM_DEBUG&lt;/tt&gt; to 1.\u00a0 Here are supported commands: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Command</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dp &lt;id&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Dump pipeline: Display the state (pipeline content) of the SIMT core &lt;id&gt;.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| q</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Quit</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| b &lt;file&gt;:&lt;line&gt; &lt;thread uid&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Set breakpoint at &lt;file&gt;:&lt;line&gt; for thread with &lt;thread uid&gt;.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| d &lt;uid&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Delete breakpoint. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| s</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Single step execution to next core cycle for all cores. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| c</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Continue execution without single stepping. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| w &lt;address&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Set watchpoint at &lt;address&gt;.\u00a0 &lt;address&gt; is specified as a hexadecimal number.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| l </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| List PTX around current breakpoint. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| h</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display help message. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">It is implemented in files &lt;tt&gt;debug.h&lt;/tt&gt; and &lt;tt&gt;debug.cc&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Cuobjdump Support ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As of GPGPU-Sim version 3.1.0, support for using cuobjdump was added. cuobjdump is a software provided by NVidia to extract information like SASS and PTX from binaries. GPGPU-Sim supports using cuobjdump to extract the information it needs to run either SASS or PTX instead of obtaining them through the cubin files. Using cuobjdump is supported only with CUDA 4.0. cuobjdump is enabled by default if the simulator is compiled with CUDA 4.0. To enable/disable cuobjdump, add one of the following option to your configuration file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> # disable cuobjdump</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_use_cuobjdump 0</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> # enable cuobjdump</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_use_cuobjdump 1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTX vs. PTXPlus ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">By default, GPGPU-Sim 3.x simulates [http://developer.nvidia.com/nvidia-gpu-computing-documentation PTX] instructions. However, when executing on an actual GPU, PTX is recompiled to a native GPU ISA (SASS). This recompilation is not fully accounted for in the simulation of normal PTX instructions. To address this issue we created PTXPlus. PTXPlus is an extended form of PTX, introduced by GPGPU-Sim 3.x, that allows for a near 1 to 1 mapping of most GT200 SASS instructions to PTXPlus instructions. It includes new instructions and addressing modes that don&#039;t exist in regular PTX. When the conversion to PTXPlus option is activated, the SASS instructions that make up the program are translated into PTXPlus instructions that can be simulated by GPGPU-Sim. Use of the PTXPlus conversion option can lead to significantly more accurate results. However, conversion to PTXPlus does not yet fully support all programs that could be simulated using normal PTX. Currently, only CUDA Toolkit later than 4.0 is supported for conversion to PTXPlus.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">SASS is the term NVIDIA uses for the native instruction set used by the GPUs according to their released documentation of the instruction sets. This documentation can be found in the file &quot;cuobjdump.pdf&quot; released with the [http://developer.nvidia.com/cuda-toolkit-40 CUDA Toolkit].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To convert the SASS from an executable, GPGPU-Sim cuobjdump -- a software release along with the CUDA toolkit by NVIDIA that extracts PTX, SASS and other information from CUDA executables. GPGPU-Sim 3.x includes a stand alone program called cuobjdump_to_ptxplus that is invoked to convert the output of cuobjdump into PTXPlus which GPGPU-Sim can simulate. cuobjdump_to_ptxplus is a program written in C++ and its source is provided with the GPGPU-Sim distribution. See the [[#PTXPlus Conversion | PTXPlus Conversion]] section for a detailed description on the PTXPlus conversion process. Currently, cuobjdump_to_ptxplus supports the conversion of SASS for sm versions &lt; sm_20.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To enable PTXPlus simulation, add the following line to the gpgpusim.config file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_convert_to_ptxplus 1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Additionally the converted PTXPlus can be saved to files named &quot;_#.ptxplus&quot; by adding the following line to the gpgpusim.config file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> -gpgpu_ptx_save_converted_ptxplus 1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To turn off either feature, either remove the line or change the value from 1 to 0. More details about using PTXPlus can be found in [[#PTXPlus support | PTXPlus support]]. If the option above is enabled, GPGPU-Sim will attempt to convert the SASS code to PTXPlus and then run the resulting PTXPlus. However, as mentioned before, not all programs are supported in this mode.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The subsections below describe the additions we made to PTX to obtain PTXPlus. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Addressing Modes ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To support GT200 SASS, PTXPlus increases the number of addressing modes available to most instructions. Non-load/non-store are now able to directly access memory. The following instruction adds the value in register r0 to the value store in shared memory at address 0x0010 and stores the values in register r1:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> add.u32 $r1, s[0x0010], $r0;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Operands such as s[$r2] or s[$ofs1+0x0010] can also be used. PTXPlus also introduces the following addressing modes that are not present in original PTX:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* g = global memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* s = shared memory</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* ce#c# = constant memory (first number is the kernel number, second number is the constant segment)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> g[$ofs1+$r0]\u00a0 \u00a0  //global memory address determined by the sum of register ofs1 and register r0.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> s[$ofs1+=0x0010] //shared memory address determined by value in register $ofs1. Register $ofs1 is then incremented by 0x0010.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> ce1c2[$ofs1+=$r1] //first kernel&#039;s second constant segment&#039;s memory address determined by value in register $ofs1. Register $ofs1 is then incremented by the value in register $r1.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The implementation details of these addressing modes is described in [[#PTXPlus Implementation | PTXPlus Implementation]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== New Data Types ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Instructions have also been upgraded to more accurately represent how 64 bit and 128 bit values are stored across multiple 32 bit registers. The least significant 32 bits are stored in the far left register while the most significant 32 bits are stored in the far right registers. The following is a list of the new data types and an example of an add instruction adding two 64 bit floating point numbers:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .ff64 = PTXPlus version of 64 bit floating point number</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .bb64 = PTXPlus version of 64 bit untyped</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .bb128 = PTXPlus version of 128 bit untyped</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> add.rn.ff64 {$r0,$r1}, {$r2,$r3}, {$r4,$r5};</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== PTXPlus Instructions ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | PTXPlus Instructions</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| nop</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Do nothing</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| andn</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| a andn b\u00a0 = a and ~b</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| norn</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| a norn b\u00a0 = a nor ~b</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| orn</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| a orn b\u00a0  = a or ~b</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| nandn</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| a nandn b = a nand ~b</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| callp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| A new call instruction added in PTXPlus. It jumps to the indicated label</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| retp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| A new return instruction added in PTXPlus. It jumps back to the instruction after the previous callp instruction</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| breakaddr</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Pushes the address indicated by the operand on the thread&#039;s break address stack</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| break</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Jumps to the address at the top of the thread&#039;s break address stack and pops off the entry</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== PTXPlus Condition Codes and Instruction Predication ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Instead of the normal true-false predicate system in PTX, SASS instructions use 4-bit condition codes to specify more complex predicate behaviour. As such, PTXPlus uses the same 4-bit predicate system. GPGPU-Sim uses the predicate translation table from decuda for simulating PTXPlus instructions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The highest bit represents the overflow flag followed by the carry flag and sign flag. The last and lowest bit is the zero flag. Separate condition codes can be stored in separate predicate registers and instructions can indicate which predicate register to use or modify. The following instruction adds the value in register $r0 to the value in register $r1 and stores the result in register $r2. At the same the, the appropriate flags are set in predicate register $p0.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> add.u32 $p0|$r2, $r0, $r1;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Different test conditions can be used on predicated instructions. For example the next instruction is only performed if the carry flag bit in predicate register $p0 is set:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> @$p0.cf add.u32 $r2, $r0, $r1;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Parameter and Thread ID (tid) Initialization\u00a0 ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">PTXPlus does not use an explicit parameter state space to store the kernel parameters. Instead, the input parameters are copied in order into shared memory starting at address 0x0010. The copying of parameters is performed during GPGPU-Sim&#039;s thread initialization process. The thread initialization process occurs when a thread block is issued to a SIMT core, as described in\u00a0 [[#Thread Block / CTA / Work Group Scheduling | Thread Block / CTA / Work Group Scheduling]]. The [[#Kernel Launch: Parameter Hookup | Kernel Launch: Parameter Hookup]] section describes the implementation for this procedure.\u00a0 Also during this process, the values of special registers %tid.x, %tid.y and %tid.z are copied into register $r0.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> Register $r0:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> |%tid.z|\u00a0 %tid.y\u00a0 |\u00a0 NA\u00a0 |\u00a0 %tid.x\u00a0 |</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 31\u00a0 26 25\u00a0 \u00a0 \u00a0 16 15\u00a0 10 9\u00a0 \u00a0 \u00a0 \u00a0 0</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Debugging via Prints and Traces ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">There are two built-in facilities for debugging gpgpu-sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The first mechanism is through environment variables.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This is useful for debugging elements of GPGPU-Sim that take place before the configuration file (gpgpusim.config) is parsed, however this can be a clumsy way to implement tracing information in the performance simulator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As of version 3.2.1 GPGPU-Sim includes a tracing system implemented in &#039;src/trace.h&#039;, which allows the user to turn traces on and off via the config file and enable traces by their string name.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Both these systems are described below.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Please note that many of the environment variable prints could be implemented via the tracing system, but exist as env variables becuase they are in legacy code.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Also, GPGPU-Sim prints a large amount of information that is not controlled through the tracing system which is also a result of legacy code.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Environment Variables for Debugging ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Some behavior of GPGPU-Sim 3.x relevant to debugging can be configured via environment variables.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When debugging it may be helpful to generate additional information about what is going on in the simulator and print this out to standard output.\u00a0 This is done by using the following environment variable:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> export PTX_SIM_DEBUG=&lt;#&gt; enable debugging and set the verbose level</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The currently supported levels are enumerated below:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Level || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|1 || Verbose logs for CTA allocation </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 2 || Print verbose output from dominator analysis</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 3 || Verbose logs for GPU malloc/memcpy/memset</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 5 || Display the instruction executed </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 6 || Display the modified register(s) by each executed instruction </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 10 || Display the entire register file of the thread executing the instruction </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 50 || Print verbose output from control flow analysis</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| 100 || Print the loaded PTX files</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If a benchmark does not run correctly on GPGPU-Sim you may need to debug the functional simulation engine</ins>. <ins class=\"diffchange diffchange-inline\"> The way we do this is to print out the functional state of a single thread that generates an incorrect output.\u00a0 To enable printing out functional simulation state for a single thread, use the following environment variable (and set the appropriate level for PTX_SIM_DEBUG):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> export PTX_SIM_DEBUG_THREAD_UID=&lt;#&gt; ID of thread to debug</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Other environment configuration options:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> export PTX_SIM_USE_PTX_FILE=&lt;non-empty-string&gt; override PTX embedded in the binary and revert to old strategy of looking for *.ptx files (good for hand-tweaking PTX)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> export PTX_SIM_KERNELFILE=&lt;filename&gt; use this to specify the name of the PTX file</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== GPGPU-Sim debug tracing ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The tracing system is controlled by variables in the gpgpusim.config file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Variable || Values || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|trace_enabled || 0 or 1 || Globally enable or disable all tracing. If enabled, then trace_components are printed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|trace_components || &lt;WARP_SCHEDULER,SCOREBOARD,...&gt; || A comma separated list of tracing elements to enable, a complete list is available in &#039;&#039;src/trace_streams.tup&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|trace_sampling_core || &lt;0 through num_cores-1&gt; || For elements associated with a given shader core (such as the warp scheduler or scoreboard), only print traces from this core</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Code files that implement the system are:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Variable || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|&#039;&#039;src/trace_streams.tup&#039;&#039; || Lists the names of each print stream</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|&#039;&#039;src/trace.cc&#039;&#039; || Some setup implementation and initialization</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|&#039;&#039;src/trace.h&#039;&#039; || Defines all the high level interfaces for the tracing system</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|&#039;&#039;src/gpgpu-sim/shader_trace.h&#039;&#039; || Defines some convenient prints for debugging a specific shader core </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Configuration Options ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Configuration options are passed into GPGPU-Sim with gpgpusim.config and an interconnection configuration file (specified with option -inter_config_file inside gpgpusim.config).\u00a0 GPGPU-Sim 3.0.2 comes with calibrated configuration files in the configs directory for the NVIDIA GT200 (configs/QuadroFX5800/) and Fermi (configs/Fermi/).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here is a list of the configuration options: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Simulation Run Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_max_cycle &lt;# cycles&gt;\u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Terminate GPU simulation early after a maximum number of cycle is reached (0 = no limit)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_max_insn &lt;# insns&gt;\u00a0 \u00a0 \u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Terminate GPU simulation early after a maximum number of instructions (0 = no limit)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_sim_mode &lt;0=performance (default), 1=functional&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Select between performance or functional simulation (note that functional simulation may incorrectly simulate some PTX code that requires each element of a warp to execute in lock-step)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_deadlock_detect &lt;0=off, 1=on (default)&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Stop the simulation at deadlock </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_max_cta</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Terminates GPU simulation early (0 = no limit)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_max_concurrent_kernel</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Maximum kernels that can run concurrently on GPU</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Statistics Collection Options</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_instruction_classification &lt;0=off, 1=on (default)&gt;\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Enable instruction classification</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_runtime_stat &lt;frequency&gt;:&lt;flag&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Display runtime statistics</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_memlatency_stat</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Collect memory latency statistics (0x2 enables MC, 0x4 enables queue logs)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -visualizer_enabled &lt;0=off, 1=on (default)&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Turn on visualizer output (use [[AerialVision_Manual|AerialVision]] visualizer tool to plot data saved in log)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -visualizer_outputfile &lt;filename&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Specifies the output log file for visualizer. Set to NULL for automatically generated filename (Done by default).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -visualizer_zlevel &lt;compression level&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Compression level of the visualizer output log (0=no compression, 9=max compression)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -save_embedded_ptx</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| saves ptx files embedded in binary as &lt;n&gt;.ptx</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -enable_ptx_file_line_stats &lt;0=off, 1=on (default)&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Turn on PTX source line statistic profiling</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -ptx_line_stats_filename &lt;output file name&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Output file for PTX source line statistics.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_warpdistro_shader</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Specify which shader core to collect the warp size distribution from</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_cflog_interval</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Interval between each snapshot in control flow logger</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -keep</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| keep intermediate files created by GPGPU-Sim when interfacing with external programs</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;High-Level Architecture Configuration (See ISPASS paper for more details on what is being modeled)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_mem &lt;# memory controller&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Number of memory controllers (DRAM channels) in this configuration. Read [[#Topology Configuration]] before modifying this option. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_clock_domains &lt;Core Clock&gt;:&lt;Interconnect Clock&gt;:&lt;L2 Clock&gt;:&lt;DRAM Clock&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Clock domain frequencies in MhZ (See [[#Clock Domain Configuration]])</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_clusters\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of processing clusters</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_cores_per_cluster</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of SIMD cores per cluster</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Additional Architecture Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_cluster_ejection_buffer_size</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of packets in ejection buffer</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_ldst_response_buffer_size</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of response packets in LD/ST unit ejection buffer</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_coalesce_arch</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Coalescing arch (default = 13, anything else is off for now)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt; Scheduler</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_num_sched_per_core</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of warp schedulers per core</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_max_insn_issue_per_warp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Max number of instructions that can be issued per warp in one cycle by scheduler</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Shader Core Pipeline Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_shader_core_pipeline &lt;# thread/shader core&gt;:&lt;warp size&gt;:&lt;pipeline SIMD width&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Shader core pipeline configuration </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_shader_registers &lt;# registers/shader core, default=8192&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Number of registers per shader core. Limits number of concurrent CTAs. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_shader_cta &lt;# CTA/shader core, default=8&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Maximum number of concurrent CTAs in shader </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_simd_model &lt;1=immediate post-dominator, others are not supported for now&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 SIMD Branch divergence handling policy </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -ptx_opcode_latency_int/fp/dp &lt;ADD,MAX,MUL,MAD,DIV&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Opcode latencies</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -ptx_opcode_initiation_int/fp/dp &lt;ADD,MAX,MUL,MAD,DIV&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Opcode initiation period. For this number of cycles the inputs of the ALU is held constant. The ALU cannot consume new values during this time. i.e. if this value is 4, then that means this unit can consume new values once every 4 cycles.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Memory Sub-System Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_perfect_mem &lt;0=off (default), 1=on&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Enable perfect memory mode (zero memory latency with no cache misses)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_tex_cache:l1 &lt;nsets&gt;:&lt;bsize&gt;:&lt;assoc&gt;,&lt;rep&gt;:&lt;wr&gt;:&lt;alloc&gt;:&lt;wr_alloc&gt;,&lt;mshr&gt;:&lt;N&gt;:&lt;merge&gt;,&lt;mq&gt;:&lt;fifo_entry&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Texture cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_const_cache:l1 &lt;nsets&gt;:&lt;bsize&gt;:&lt;assoc&gt;,&lt;rep&gt;:&lt;wr&gt;:&lt;alloc&gt;:&lt;wr_alloc&gt;,&lt;mshr&gt;:&lt;N&gt;:&lt;merge&gt;,&lt;mq&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Constant cache (Read-Only) configuration. Evict policy: L = LRU, F = FIFO, R = Random </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_cache:il1 &lt;nsets&gt;:&lt;bsize&gt;:&lt;assoc&gt;,&lt;rep&gt;:&lt;wr&gt;:&lt;alloc&gt;:&lt;wr_alloc&gt;,&lt;mshr&gt;:&lt;N&gt;:&lt;merge&gt;,&lt;mq&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Shader L1 instruction cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_cache:dl1 &lt;nsets&gt;:&lt;bsize&gt;:&lt;assoc&gt;,&lt;rep&gt;:&lt;wr&gt;:&lt;alloc&gt;:&lt;wr_alloc&gt;,&lt;mshr&gt;:&lt;N&gt;:&lt;merge&gt;,&lt;mq&gt; -- set to &quot;none&quot; for no DL1 --</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 L1 data cache (for global and local memory) configuration. Evict policy: L = LRU, F = FIFO, R = Random </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_cache:dl2 &lt;nsets&gt;:&lt;bsize&gt;:&lt;assoc&gt;,&lt;rep&gt;:&lt;wr&gt;:&lt;alloc&gt;:&lt;wr_alloc&gt;,&lt;mshr&gt;:&lt;N&gt;:&lt;merge&gt;,&lt;mq&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Unified banked L2 data cache configuration.\u00a0 This specifies the configuration for the L2 cache bank in one of the memory partitions.\u00a0 The total L2 cache capacity = &lt;nsets&gt; x &lt;bsize&gt; x &lt;assoc&gt; x &lt;# memory controller&gt;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_shmem_size &lt;shared memory size, default=16kB&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Size of shared memory per SIMT core (aka shader core)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_shmem_warp_parts</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of portions a warp is divided into for shared memory bank conflict check </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_flush_cache &lt;0=off (default), 1=on&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Flush cache at the end of each kernel call</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_local_mem_map</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Mapping from local memory space address to simulated GPU physical address space (default = enabled)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_num_reg_banks</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Number of register banks (default = 8)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_reg_bank_use_warp_id</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Use warp ID in mapping registers to banks (default = off)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_cache:dl2_texture_only</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| L2 cache used for texture only (0=no, 1=yes, &#039;&#039;&#039;default=1&#039;&#039;&#039;)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Operand Collector Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_units_sp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector units (default = 4)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_units_sfu</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector units (default = 4)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_units_mem</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector units (default = 2)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_units_gen</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector units (default = 0)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_in_ports_sp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_in_ports_sfu</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_in_ports_mem</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_in_ports_gen</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 number of collector unit in ports (default = 0)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_out_ports_sp</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_out_ports_sfu</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_out_ports_mem </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_operand_collector_num_out_ports_gen</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| number of collector unit in ports (default = 0)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;DRAM/Memory Controller Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_dram_scheduler &lt;0 = fifo, 1 = fr-fcfs&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 DRAM scheduler type</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_frfcfs_dram_sched_queue_size &lt;# entries&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 DRAM FRFCFS scheduler queue size (0 = unlimited (default); # entries per chip).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">(Note: FIFO scheduler queue size is fixed to 2).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 -gpgpu_dram_return_queue_size &lt;# entries&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 DRAM requests return queue size (0 = unlimited (default); # entries per chip). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_dram_buswidth &lt;# bytes/DRAM bus cycle, default=4 bytes, i.e. 8 bytes/command clock cycle&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Bus bandwidth of a single DRAM chip at command bus frequency (default = 4 bytes (8 bytes per command clock cycle)).\u00a0 The number of DRAM chip per memory controller is set by option -gpgpu_n_mem_per_ctrlr.\u00a0 Each memory partition has (gpgpu_dram_buswidth X gpgpu_n_mem_per_ctrlr) bits of DRAM data bus pins.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For example, Quadro FX5800 has a 512-bit DRAM data bus, which is divided among 8 memory partitions.\u00a0 Each memory partition a 512/8=64 bits of DRAM data bus.\u00a0 This 64-bit bus is split into 2 DRAM chips for each memory partition.\u00a0 Each chip will have 32-bit=4-byte of DRAM bus width.\u00a0 We therefore set -gpgpu_dram_buswidth to 4.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_dram_burst_length &lt;# burst per DRAM request&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Burst length of each DRAM request (default = 4 data clock cycle, which runs at 2X command clock frequency in GDDR3)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_dram_timing_opt &lt;nbk:tCCD:tRRD:tRCD:tRAS:tRP:tRC:CL:WL:tWTR&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 DRAM timing parameters:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* nbk = number of banks</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tCCD = CAS to CAS command delay (always = half of burst length)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRRD = Row active to row active delay</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRCD = RAW to CAS delay</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRAS = Row active time</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRP = Row precharge time</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tRC = Row cycle time</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* CL = CAS latency</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* WL = Write latency</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* tWTR = Write to read delay</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_mem_address_mask &lt;address decoding scheme&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 &#039;&#039;&#039;Obsolete&#039;&#039;&#039;: Select different address decoding scheme to spread memory access across different memory banks.\u00a0 (0 = old addressing mask, 1 = new addressing mask, 2 = new add. mask + flipped bank sel and chip sel bits)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_mem_addr_mapping dramid@&lt;start bit&gt;;&lt;memory address map&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Mapping memory address to DRAM model: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;start bit&gt; = where the bits used to specify the DRAM channel ID starts. (This means the next Log2(#DRAM channel) bits will be used as the DRAM channel ID, and the whole address map will be shifted depending on how many bits are used.) &lt;br&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;memory address map&gt; = a 64-character string specify how each bit in a memory address is decoded into row (R), column (C), bank (B) addresses. Part of the address that will be inside a single DRAM burst should be specified with (S). &lt;br&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">See configuration file for Quadro FX 5800 for example. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_n_mem_per_ctrlr &lt;# DRAM chips/memory controller&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 Number of DRAM chip per memory controller (aka DRAM channel)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_dram_partition_queues </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| i2$:$2d:d2$:$2i</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -rop_latency &lt;# minimum cycle before L2 cache access&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Specify the minimum latency (in &#039;&#039;core clock cycles&#039;&#039;) between when a memory request arrived at the memory partition and when it accesses the L2 cache / moves into the queue to access DRAM.\u00a0 It models the minimum L2 hit latency.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -dram_latency &lt;# minimum cycle after L2 cache access and before DRAM access&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Specify the minimum latency (in &#039;&#039;core clock cycles&#039;&#039;) between when a memory request has accessed the L2 cache and when it is pushed into the DRAM scheduler.\u00a0 This option works together with -rop_latency to model the minimum DRAM access latency (= rop_latency + dram_latency).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;Interconnection Configuration</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -inter_config_file &lt;Path to Interconnection Config file&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 The file containing Interconnection Network simulator&#039;s options. For more details about interconnection configurations see Manual provided with the original code at [http://cva.stanford.edu/books/ppin/].\u00a0 NOTE that options under &quot;4.6 Traffic&quot; and &quot;4.7 Simulation parameters&quot; should not be used in our simulator. Also see [[#Interconnection Configuration]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -network_mode</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Interconnection network mode to be used (default = 1).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=2 | &lt;BR&gt;PTX Configurations</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Option || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_use_cuobjdump</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Use cuobjdump to extract ptx/sass (0=no, 1=yes) Only allowed with CUDA 4.0</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_convert_to_ptxplus</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Convert embedded ptx to ptxplus (0=no, 1=yes)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_save_converted_ptxplus</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Save converted ptxplus to a file (0=no, 1=yes)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_force_max_capability</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Force maximum compute capability (default 0)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_inst_debug_to_file</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Dump executed instructions&#039; debug information to a file (0=no, 1=yes)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_inst_debug_file</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Output file to dump debug information for executed instructions</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| -gpgpu_ptx_inst_debug_thread_uid</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Thread UID for executed instructions&#039; debug output</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Interconnection Configuration ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x uses the booksim router simulator to model the interconnection network.\u00a0 For the most part you will want to consult the booksim documentation for how to configure the interconnect.\u00a0 However, below we list special considerations that need to be taken into account to ensure your modifications work with GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Topology Configuration ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that the total number of network nodes as specified in the interconnection network config file should match the total nodes in GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim&#039;s total number of nodes would be the sum of SIMT Core Cluster count and the number of Memory controllers. E.g. in the QuadroFX5800 configuration there are 10 SIMT Core Clusters and 8 Memory Controllers. That is a total of 18 nodes. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Therefore in the interconnection config file the network also has 18 nodes as demonstrated below:\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 topology = fly;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 k = 18;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 n = 1;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 routing_function = dest_tag;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The configuration snippet above sets up a single stage butterfly network with destination tag routing and 18 nodes. Generally, in both butterfly and mesh networks the total number of network nodes would be k*n. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that if you choose to use a mesh network you will want to consider configuring the memory controller placement. In the current release there are a few predefined mappings that can be enabled by setting &quot;use_map=1;&quot; In particular the mesh network used in our [http://ieeexplore.ieee</ins>.org<ins class=\"diffchange diffchange-inline\">:80</ins>/<ins class=\"diffchange diffchange-inline\">xpl/articleDetails.jsp?reload=true&amp;arnumber=4919648 ISPASS 2009 paper] paper can be configured using this setting and the following topology:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* a 6x6 mesh network (topology=mesh, k=6, n=2) : 28 SIMT cores + 8 dram channels assuming the SIMT core Cluster size is one</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">You can create your own mappings by modifying &lt;tt&gt;create_node_map()&lt;</ins>/<ins class=\"diffchange diffchange-inline\">tt&gt; in interconnect_interface.cpp (and set use_map=1)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Booksim options added by GPGPU-Sim ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These options are specific to GPGPU-Sim and not part of the original booksim:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*perfect_icnt: if set the interconnect is not simulated all packets that are injected to the network will appear at their destination after one cycle. This is true even when multiple sources send packets to one destination.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*fixed_lat_per_hop: similar to perfect_icnt above except that the packet appears in destination after &quot;Manhattan distance hop count times fixed_lat_per_hop&quot; cycles.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*use_map: changes the way memory and shader cores are placed. See Topology Configuration.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*flit_size: specifies the flit_size in bytes. This is used to identify the number of flits per packet based on the size of packet as passed to icnt_push() functions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*network_count: Number of independent interconnection networks. Should be set to 2 unless you know what you are doing.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*output_extra_latency: Adds extra cycles to each router. Used to create Figure 10 of ISPASS paper.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*enable_link_stats: prints extra statistics for each link</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*input_buf_size: Input buffer size of each node in flits. If left zero the simulator will set it automatically. See &quot;Injecting a packet from the outside world to network&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*ejection_buffer_size: ejection buffer size. If left zero the simulator will set it automatically. See &quot;Ejecting a packet from network to the outside world&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*boundary_buffer_size: boundary buffer size. If left zero the simulator will set it automatically. See &quot;Ejecting a packet from network to the outside world&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These four options were set using #define in original booksim but we have made them configurable via intersim&#039;s config file:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*MATLAP_OUTPUT (generates Matlab friendly outputs), DISPLAY_LAT_DIST (shows a distribution of packet latencies), DISPLAY_HOP_DIST (shows a distribution of hop counts), DISPLAY_PAIR_LATENCY (shows average latency for each source destination pair)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Booksim Options ignored by GPGPU-Sim ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Please note the following options that are part of original booksim are either ignored or should not be changed from default.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*Traffic Options (section 4.6 of booksim manual): injection_rate, injection_process, burst_alpha, burst_beta, &quot;const_flit_per_packet&quot;, traffic</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*Simulation parameters (section 4.7 of booksim manual)</ins>: <ins class=\"diffchange diffchange-inline\">sim_type, sample_period, warmup_periods, max_samples, latency_thres, sim_count, reorder</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Clock Domain </ins>Configuration <ins class=\"diffchange diffchange-inline\">===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim supports four clock domains that can be controlled by the &lt;tt&gt;-gpgpu_clock_domains&lt;/tt&gt; option:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* DRAM clock domain = frequency of the real DRAM clock (command clock) and not the data clock (i.e. 2x of the command clock frequency)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* SIMT Core Cluster clock domain = frequency of the pipeline stages in a core clock (i.e. the rate at which &lt;tt&gt; simt_core_cluster::core_cycle()&lt;/tt&gt; is called)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Icnt clock domain = frequency of the interconnection network (usually this can be regarded as the &#039;&#039;core&#039;&#039; clock in NVIDIA GPU specs)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* L2 clock domain = frequency of the L2 cache (We usually set this equal to ICNT clock frequency)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that in GPGPU-Sim the width of the pipeline is equal to warp size. To compensate for this we adjust the SIMT Core Cluster clock domain.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For example we model the superpipelined stages in NVIDIA&#039;s Quadro FX 5800 (GT200) SM running at the fast clock rate (1GHz+) with a single-slower pipeline stage running at 1/4 the frequency. So a 1.3GHz shader clock rate of FX 5800 corresponds to a 325MHz SIMT core clock in GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The DRAM clock domain is specified in the frequency of the command clock.\u00a0 To simplify the peak memory bandwidth calculation, most GPU specs report the data clock, which runs at 2X the command clock frequency.\u00a0 For example, Quadro FX5800 has a memory data clock of 1600MHz, while the command clock is only running at 800MHz.\u00a0 Therefore, our configuration sets the DRAM clock domain at 800.0MHz.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== clock Special Register ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In ptx, there is a special register %clock that reads the a clock cycle counter. On the hardware, this register is called SR1. It is a clock cycle counter that silently wraps around. In Quadro, this counter is incremented twice per scheduler clock. In Fermi, it is incremented only once per scheduler clock. GPGPU-Sim will return a value for a counter that is incremented once per scheduler clock (which is the same as the SIMT core clock).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In PTXPlus, the nvidia compiler generates the instructions accessing the %clock register as follows</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> //SASS accessing clock register</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> S2R R1, SR1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> SHL R1, R1, 0x1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> //PTXPlus accessing clock register</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> mov %r1, %clock</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> shl %r1, %r1, 0x1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This basically multiplies the value in the clock register by two. In PTX, however, the clock register is accessed directly. Those conditions must be taken into consideration when calculating results based on the clock register.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> //PTX accessing clock register</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> mov r1, %clock</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Understanding Simulation Output ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">At the end of each CUDA grid launch, GPGPU-Sim prints out the performance statistics to the console (&lt;tt&gt;stdout&lt;/tt&gt;).\u00a0 These performance statistics provide insights into how the CUDA application performs with the simulated GPU architecture.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here is a brief </ins>list <ins class=\"diffchange diffchange-inline\">of the important performance statistics:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== General Simulation Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_sim_cycle || Number of cycles (in Core clock) required to execute this kernel.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_sim_insn\u00a0 || Number of instructions executed in this kernel. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_ipc\u00a0 \u00a0 \u00a0  || gpu_sim_insn / gpu_sim_cycle</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_tot_sim_cycle\u00a0 || Total number of cycles (in Core clock) simulated for all the kernels launched so far. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_tot_sim_insn\u00a0  || Total number of instructions executed for all the kernels launched so far. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_tot_ipc\u00a0 \u00a0 \u00a0 \u00a0 || gpu_tot_sim_insn / gpu_tot_sim_cycle</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_total_sim_rate\u00a0 \u00a0 \u00a0 \u00a0 || gpu_tot_sim_insn / wall_time</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Simple Bottleneck Analysis ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These performance counters track stall events at different high-level parts of the GPU.\u00a0 In combination, they give a broad sense of how where the bottleneck is in the GPU for an application. &lt;xr id=&quot;fig:mem_request_flow&quot;/&gt; illustrates a simplified flow of memory requests through the memory sub-system in GPGPU-Sim, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:mem_request_flow&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:memreqflow.png|center|frame|&lt;caption&gt;Memory request flow diagram&lt;/caption&gt;]] </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here are the description for each counter: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_stall_shd_mem\u00a0  || Number of pipeline stall cycles at the memory stage caused by one of the following reasons:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* shared memory bank conflict </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* non-coalesced memory access </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* serialized constant memory access </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_stall_dramfull\u00a0 || Number of cycles that the interconnect outputs to dram channel is stalled.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpu_stall_icnt2sh\u00a0  || Number of cycles that the dram channels are stalled due to the interconnect congestion.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Memory Access Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_load_insn\u00a0  || Number of global/local load instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_store_insn\u00a0 || Number of global/local store instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_shmem_insn\u00a0 || Number of shared memory instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_tex_insn\u00a0 \u00a0 || Number of texture memory instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_const_mem_insn\u00a0 || Number of constant memory instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_param_mem_insn\u00a0 || Number of parameter read instructions executed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|gpgpu_n_cmem_portconflict\u00a0 || Number of constant memory bank conflict. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|maxmrqlatency\u00a0 || Maximum memory queue latency (amount of time a memory request spent in the DRAM memory queue)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|maxmflatency\u00a0  || Maximum memory fetch latency (round trip latency from shader core to DRAM and back)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|averagemflatency || Average memory fetch latency</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|max_icnt2mem_latency || Maximum latency for a memory request to traverse from a shader core to the destinated DRAM channel</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|max_icnt2sh_latency\u00a0 || Maximum latency for a memory request to traverse from a DRAM channel back to the specified shader core</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Memory Sub-System Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_read_local\u00a0  || Number of local memory reads placed on the interconnect from the shader cores.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_write_local\u00a0 || Number of local memory writes placed on the interconnect from the shader cores. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_read_global\u00a0 || Number of global memory reads placed on the interconnect from the shader cores. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_write_global || Number of global memory writes placed on the interconnect from the shader cores. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_texture\u00a0 \u00a0 \u00a0 || Number of texture memory reads placed on the interconnect from the shader cores.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpu_n_mem_const\u00a0 \u00a0 \u00a0 \u00a0 || Number of constant memory reads placed on the interconnect from the shader cores.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Control-Flow Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim reports the warp occupancy distribution which measures performance penalty due to branch divergence in the CUDA application. This information is reported on a single line following the text &quot;Warp Occupancy Distribution:&quot;.\u00a0 Alternatively, you may want to grep for W0_Idle.\u00a0 The distribution is display in format: &lt;tt&gt;&lt;bin&gt;:&lt;cycle count&gt;&lt;/tt&gt;. Here is the meaning of each bin:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Stall || The number of cycles when the shader core pipeline is stalled and cannot issue any instructions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| W0_Idle || The number of cycles when all available warps are issued to the pipeline and are not ready to execute the next instruction. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| W0_Scoreboard\u00a0 || The number of cycles when all available warps are waiting for data from memory. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| W&#039;&#039;X&#039;&#039; (where &#039;&#039;X&#039;&#039; = 1 to 32) || The number of cycles when a warp with &#039;&#039;X&#039;&#039; active threads is scheduled into the pipeline. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Code that has no branch divergence should result in no cycles with W&quot;X&quot; where X is between 1 and 31. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">See [http://doi.acm.org/10.1145/1543753.1543756 Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware</ins>] <ins class=\"diffchange diffchange-inline\">for more detail.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== DRAM Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">By default, GPGPU-Sim reports the following statistics for each DRAM channel:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_cmd || Total number of command cycles the memory controller in a DRAM channel has elapsed.\u00a0 The controller can issue a single command per command cycle.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_nop || Total number of NOP commands issued by the memory controller. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_act || Total number of Row Activation commands issued by the memory controller. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_pre || Total number of Precharge commands issued by the memory controller.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_req || Total number of memory requests processed by the DRAM channel. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_rd\u00a0 || Total number of read commands issued by the memory controller. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_write || Total number of write commands issued by the memory controller. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|bw_util || DRAM bandwidth utilization = 2 * (n_rd + n_write) / n_cmd</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|n_activity || Total number of active cycles, or command cycles when the memory controller has a pending request at its queue. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|dram_eff || DRAM efficiency = 2 </ins>* <ins class=\"diffchange diffchange-inline\">(n_rd + n_write) / n_activity\u00a0 (i.e. DRAM bandwidth utilization when there is a pending request waiting to be processed) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|mrqq: max || Maximum memory request queue occupancy. (i.e. Maximum number of pending entries in the queue)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|mrqq: avg || Average memory request queue occupancy. (i.e. Average number of pending entries in the queue)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Cache Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For each cache (normal data cache, constant cache, texture cache alike), GPGPU-Sim reports the following statistics:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Access = Total number of access to the cache</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Miss = Total number of misses to the cache.\u00a0 The number in parenthesis is the cache miss rate.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* PendingHit = Total number of pending hits in the cache.\u00a0 An pending hit access has hit a cache line in RESERVED state, which means there is already an inflight memory request sent by a previous cache miss on the same line.\u00a0 This access can be merged that previous memory access so that it does not produce memory traffic.\u00a0 The number in parenthesis is the ratio of accesses that exhibit pending hits. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Notice that pending hits are not counted as cache misses.\u00a0 Also, we do not count pending hits for cache that employs allocate-on-fill policy (e.g. read-only caches such as constant cache and texture cache). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim also calculates the total miss rate for all instances of the L1 data cache:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">total_dl1_misses</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">total_dl1_accesses</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">total_dl1_miss_rate</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Notice that data for L1 Total Miss Rate should be ignored when the l1 cache is turned off: &lt;tt&gt;-gpgpu_cache:dl1 none&lt;/tt&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Interconnect Statistics ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In GPGPU-Sim, the user can configure whether to run all traffic on a single interconnection network, or on two separate physical networks (one relaying data from the shader cores to the DRAM channels and the other relaying the data back).\u00a0 (The motivation for using two separate networks, besides increasing bandwidth, is often to avoid &quot;protocol deadlock&quot; which otherwise requires additional dedicated virtual channels.)\u00a0  GPGPU-Sim reports the following statistics for each individual interconnection network:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=3 cellspacing=0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Statistic || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| average latency || Average latency for a single flit to traverse from a source node to a destination node. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| average accepted rate || Measured average throughput of the network relative to its total input channel throughput. Notice that when using two separate networks for traffics in different directions, some nodes will never inject data into the network (i.e. the output only nodes such as DRAM channels on the cores-to-dram network). To get the real ratio, total input channel throughput from these nodes should be ignored. That means one should multiply this rate with the ratio (total # nodes / # input nodes in this network) to get the real average accepted rate. Note that by default we use two separate networks which is set by network_count option in interconnection network config file. The two networks serve to break circular dependancies that might cause deadlocks.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| min accepted rate || Always 0, as there are nodes that do not inject flits into the network due to the fact that we simulate two separate networks for traffic in different directions. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| latency_stat_0_freq || A histogram showing the distribution of latency of flits traversed in the network. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note: Accepted traffic or throughput of a network is the amount of traffic delivered to the destination terminals of the network. If the network is below saturation all the offered traffic is accepted by the network and offered traffic would be equal to throughput of the network. The interconnect simulator calculates the accepted rate of each node by dividing the total number of packets received at a node by the total network cycles.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Visualizing High-Level GPGPU-Sim Microarchitecture Behavior ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">AerialVision is a GPU performance analysis tool for GPGPU-Sim that is includes with the GPGPU-Sim source code introduced in a </ins>[<ins class=\"diffchange diffchange-inline\">http:</ins>//www.<ins class=\"diffchange diffchange-inline\">ece.ubc.ca/~aamodt/papers/aerialvision.ispass2010.pdf ISPASS 2010 paper].\u00a0 A detailed manual describing how to use AerialVision can be found [[AerialVision_Manual|here]].\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Two examples of the type of high level analysis possible with AerialVision are illustrated below. &lt;xr id=&quot;fig:aerialvision_example1&quot;/&gt; illustrates the use of AerialVision to understand microarchitecture behavior versus time.\u00a0 The top row is average memory access latency versus time, the second row plots load per SIMT core versus time (vertical axis is SIMT core, color represents average instructions per cycle), the bottom row shows load per memory controller channel.\u00a0 &lt;xr id=&quot;fig:aerialvision_example2&quot;/&gt; illustrates the use of AerialVision to understand microarchiteture behavior at the source code level.\u00a0 This helps identify lines of code associated with uncoalesced or branch divergence.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=0 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &lt;figure id=&quot;fig:aerialvision_example1&quot;&gt; [[Image:Timelapse_customize.png|center|frame|&lt;caption&gt;AerialVision Time Lapse View Example&lt;/caption&gt;]] &lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &lt;figure id=&quot;fig:aerialvision_example2&quot;&gt;[[Image:Sourcecode_nav.png|center|frame|&lt;caption&gt;AerialVision Source Code View Example&lt;/caption&gt;]]&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To get GPGPU-Sim to generate a visualizer log file for the Time Lapse View, add the following option to gpgpusim.config: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 -visualizer_enabled 1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The sampling frequency in this log file can be set by option -gpgpu_runtime_stat.\u00a0 One can also specify the name of the visualizer log file with option -visualizer_outputfile. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To generate the output for the Source Code Viewer, add the following option to gpgpusim.config: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 -enable_ptx_file_line_stats 1 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">One can specify the output file name with option -ptx_line_stats_filename. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If you use plots generated by AerialVision in your publications, please cite the above linked ISPASS 2010 paper.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Visualizing Cycle by Cycle Microarchitecture Behavior ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x provides a set of GDB macros that can be used to visualize the detail states of each SIMT core and memory partition.\u00a0 By setting the global variable &quot;g_single_step&quot; in GDB to the shader clock cycle at which you would like to start &quot;single stepping&quot; the shader clock, these macros can be used to observe how the microarchitecture states changes cycle-by-cycle.\u00a0 You can use the macros anytime GDB has stopped the simulator, but the global variable &quot;g_single_step&quot; is used in gpu-sim.cc to trigger a call to a hard coded software breakpoint instruction placed after all shader cores have advanced simulation by one cycle.\u00a0 Stopping simulation here tends to lead to more easy to interpret state.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Visibility at this level is useful for debugging and can help you gain deeper insight into how the simulated GPU micro architecture works.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These GDB macros are available in the &lt;tt&gt;.gdbinit&lt;/tt&gt; file that comes with the GPGPU-Sim distribution.\u00a0 To use these macros, either copy the file to your home directory or to the same directory where GDB is launched.\u00a0 GDB will automatically detect the presence of the macro file, load it and display the following message: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 ** loading GPGPU-Sim debugging macros... **</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Macro || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dp &lt;core_id&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display pipeline state of the SIMT core with ID=&lt;core_id&gt;.\u00a0 See below for an example of the display.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dpc &lt;core_id&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display the pipeline state, then continue to the next breakpoint.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This version is useful if you set &quot;g_single_step&quot; to trigger the hard coded breakpoint where gpu_sim_cycle is incremented in gpgpu-sim::cycle() in src/gpgpu-sim/gpu-sim.cc.\u00a0 Repeatedly hitting enter will advance to show the pipeline contents in successive cycles.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dm &lt;partition_id&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display the internal states of a memory partition with ID=&lt;partition_id&gt;.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptxdis &lt;start_pc&gt; &lt;end_pc&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display PTX instructions between &lt;start_pc&gt; and &lt;end_pc&gt;.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptxdis_func &lt;core_id&gt; &lt;thread_id&gt; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display all PTX instructions inside the kernel function executed by thread &lt;thread_id&gt; in SIMT core &lt;core_id&gt;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_tids2pcs &lt;thread_ids&gt; &lt;length&gt; &lt;core_id&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Display the current PCs of the threads in SIMT core &lt;core_id&gt; represented by an array &lt;thread_ids&gt; of length=&lt;length&gt;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Example of the output by dp: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:dump_pipeline&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:dump_pipeline_anatomy.png|center|frame|&lt;caption&gt;Example pipeline state print out in gdb&lt;/caption&gt;]] </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Debugging Errors in Performance Simulation ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This section describes strategies for debugging errors that crash or deadlock GPGPU-Sim while running in performance simulation mode.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If you ran a benchmark we haven&#039;t tested and it crashed we encourage you to [http://www.gpgpu-sim</ins>.org/<ins class=\"diffchange diffchange-inline\">bugs file a bug].\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">However, suppose GPGPU-Sim crashes or deadlocks after you have made your own changes?\u00a0 What should you do then?\u00a0 This section describes some generic techniques we use at UBC for debugging errors in the GPGPU-Sim timing model.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Segmentation Faults, Aborts and Failed Assertions ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Deadlocks ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Frequently Asked Questions ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Which Linux platform is best for GPGPU-Sim?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Currently we use SUSE 11.3 for developing GPGPU-Sim. However, many people ran it successfully on other distributions. In principle, there should be no problems in doing so. We did very minor testing for GPGPU-Sim with Ubuntu 10.04 LTS.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Can GPGPU-Sim simulate graphics?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We have some plans to model graphics in the future (no ETA on when</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">that might be available).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Is there an option to enable a pure functional simulation without the timing model?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Yes, it is available now. It was removed when we first refactored the GPGPU-Sim 2.x code base into GPGPU-Sim 3.x to simplify the development process, but now it has been reintroduced again.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">How to change configuration files?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim searches for a file called gpgpusim.conf in the current directory. If you need to change the configuration file on the fly, you can create a new directory and create a symbolic link to the configuration file in that directory and use it as your working directory when running GPGPU-Sim. Changing the symbolic link to another file will change the file seen by GPGPU-Sim. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Can I run OpenCL applications on GPGPU-Sim if I don&#039;t have a GPU?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Do I need a GPU to run GPGPU-Sim?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Building and running GPGPU-Sim does not require the presence of a GPU on your system. However, running OpenCL applications requires the NVIDIA Driver which in turn requires the presence of the graphics card. Despite that, we have included support for executing the compilation of OpenCL applications on a remote machine. This means that you only need access to a remote machine that has a graphics card, but the machine you are actually using for running GPGPU-Sim doesn&#039;t.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">I got a parsing error from GPGPU-Sim for an OpenCL program that runs fine on real GPU.\u00a0 What is going on? </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Unlike most CUDA applications that contains compiled PTX code in the binary, the kernel code in an OpenCL program are compiled by the video driver at runtime.\u00a0 Depending on the version of the video driver, different versions of PTX may be produced from the same OpenCL kernel.\u00a0 GPGPU-Sim 3.x is developed with NVIDIA driver version 256.40 (see README file that comes with GPGPU-Sim).\u00a0 The newer driver that comes with CUDA Toolkit 4.0 or newer has introduced some new keywords in PTX.\u00a0 Some of these keywords are not difficult to support in GPGPU-Sim (such as &lt;tt&gt;.ptr&lt;/tt&gt;), while others are not as simple (such as the use of &lt;tt&gt;%envreg&lt;/tt&gt;, which is setup by the driver).\u00a0 For now, we would recommend downgrading the video driver, or setup a remote machine with the supported driver version for OpenCL compilation.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Does/Will GPGPU-Sim support CUDA 4.0?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Supporting CUDA 4 is something we are currently working on implementing (as usual, no ETA on when it will be released).\u00a0 Using multiple versions of CUDA is not hard with GPGPU-Sim 3.x:\u00a0 The setup_environment script for GPGPU-Sim 3.x can be modified to point to the 3.1 installation so you do not need to modify your .bashrc</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Why are there two networks (reply and request)?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Those too networks do not necessarily need to be two different physical networks, they can be two different logical networks e.g. each one can use a dedicated set of virtual channels on a single physical network. If the request and reply networks share the same network then a &quot;Protocol Deadlock&quot; may happen. To understand it better read section 14.1.5 of Dally&#039;s Interconnection Network book.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Is it normal to get &#039;NaN&#039; in the simulator output? </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">You may get it with the cache miss rates when the cache module has never been accessed.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Why do all CTAs finishes at cycle X, while gpu_sim_cycle says (X + Y)? (i.e. Why is GPGPU-Sim still simulating after all the CTAs/shader cores are done?)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The difference from when a CTA is considered finished by GPGPU-Sim to when GPGPU-Sim thinks the simulation is done can be due to global\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">memory write traffic.\u00a0  Basically, it takes some time from issuing a write command until that command is processed by the memory system. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">How to calculate the Peak off-chip DRAM bandwidth given a GPGPU-Sim configuration?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Peak off-chip DRAM bandwidth = gpgpu_n_mem * gpgpu_n_mem_per_ctrlr * gpgpu_dram_buswidth * DRAM Clock * 2 (for DDR)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* gpgpu_n_mem = Number of memory channels in the GPU (each memory channel has an independent controller for DRAM command scheduling)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* gpgpu_n_mem_per_ctrlr = Number of DRAM chips attached to a memory channel (default = 2, for 64-bit memory channel)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* gpgpu_dram_buswidth = Bus width of each DRAM chip (default = 32-bit = 4 bytes)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* DRAM Clock = the real clock of the DRAM chip (as opposed to the effective clock used in marketing - See [[#Clock Domain Configuration]])</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">How to get the DRAM utilization?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Each memory controller prints out some statistics at the end of the simulation using &quot;dram_print()&quot;.\u00a0 DRAM utilization is &quot;bw_util&quot;.\u00a0 Take the average of this number across all the memory controllers (the number for each controller can differ if each DRAM channel gets a different amount of memory traffic).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Inside the simulator&#039;s code, &#039;bwutil&#039; is incremented by 2 for every read or write operation because it takes two DRAM command cycles to service a single read or write operation (given burst length = 4).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Why isn&#039;t DRAM utilization improving with more shader cores (with the same number of DRAM channels) for a memory-limited application?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">DRAM utilization may not improve with having more inflight threads for many reasons. One reason could the DRAM precharge/activate overheads.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">(See e.g., [http://www.ece.ubc.ca/~aamodt/papers</ins>/<ins class=\"diffchange diffchange-inline\">gyuan.mobs2009.pdf Complexity Effective Memory Access Scheduling for Many-Core Accelerator Architectures])</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">How to get the interconnect utilization?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer</ins>:<ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The definition of the interconnect&#039;s untilization highly depends on the topology of the interconnection network itself, so it is quite difficult to give a single &quot;utilization&quot; metric that is consistent across all types of topology.\u00a0 If you are looking into whether the interconnection is the bottleneck of an application, you may want to look at &lt;tt&gt;gpu_stall_icnt2sh&lt;/tt&gt; and &lt;tt&gt;gpu_stall_sh2icnt&lt;/tt&gt; instead.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The throughput (accepted rate) is also a good indicator for the utilization of each network. Note that by default we use two separate networks for traffics from SIMT core clusters to memory partitions and the traffics heading back; therefore you will see two accepted rate numbers reported at the end of simulation (one for each network). See [[#Interconnect Statistics</ins>]<ins class=\"diffchange diffchange-inline\">] for more detail.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Does/Will GPGPU-Sim 3.x support DWF (dynamic warp formation) or TBC (thread block compaction)?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The current release of GPGPU-Sim 3.x does not support DWF nor TBC.\u00a0 For now, only PDOM is supported.\u00a0 We are currently working on implementing TBC on GPGPU-Sim 3.x (evaluations of TBC in the HPCA 2011 paper was done on GPGPU-Sim 2.x).\u00a0 While we do not have any plan to release the implementation yet, it may be released under a separate branch in the future (this depends on how modular the final implementation is). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Question:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Why this simulator is claimed to be timing accurate/cycle accurate? How can I verify this fact?</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Answer:&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">A cycle-accurate simulator reports the timing behavior of the simulated architecture - it is possible for the user to stop the simulator at cycle boundaries and observe the states. All the hardware behavior within a cycle is approximated with C/C++ (as opposed to implementing them in HDLs) to speed up the simulation time. It is also common for architectural simulator to simplify some detailed implementations covering corner cases of a hardware design to emphasize what dictates the overall performance of a system - this is what we try to achieve with GPGPU-Sim. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">So, like all other cycle-accurate simulators used for architectural research/development, we do not guarantee 100% matching with real GPUs. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The normal way to verify a simulator would involve comparing reported timing result of an application running on the simulator against measured runtime of the same application running on the actual hardware simulation target.\u00a0 With PTX-ISA, this is a little tricky, because PTX-ISA is recompiled by the GPU driver into native GPU ISA for execution on the actual GPU, whereas GPGPU-Sim execute PTX-ISA directly.\u00a0 Also, the limited amount of publicly available information on the actual NVIDIA GPU microarchitecture has posed a big challenge on implementing the exact matching behavior in the simulator. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">(i.e. We do not know what is actually implemented inside a GPU. We just implement our best guess in the simulator!)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Nevertheless, we have been continually trying to improve the accuracy of our architecture model.\u00a0 In our ISPASS paper in 2009, we have compared simulated timing performance of various benchmarks against their hardware runtime on a GeForce 8600GT.\u00a0 The correlation coefficient was calculated to be 0.899.\u00a0 GPGPU-Sim 3.0 has been calibrated against an NVIDIA GT 200 GPU and shows IPC correlation of 0.976 on a subset of applications from the NVIDIA CUDA SDK.\u00a0 We welcome feedback from users regarding the accuracy of GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">= Software Design of GPGPU-Sim =</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To perform substantial architecture research with GPGPU-Sim 3.x you will need to modify the source code.\u00a0 This section documents the high level software design of GPGPU-Sim 3.x, which differs from version 2.x. In addition to the software descriptions found here you may find it helpful to consult the doxygen generated documentation for GPGPU-Sim 3.x.\u00a0 Please see the README file for instructions for building the doxygen documentation from the GPGPU-Sim 3.x source.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Below we summarize the [[#File_list_and_brief_description |source organization]],\u00a0 [[#Option_Parser |command line option parser]], object oriented [[#Abstract_Hardware_Model |abstract hardware model]] that provides an interface between the [[#GPGPU-sim_-_Performance_Simulation_Engine |software organization of the performance simulation engine]] and the [[#CUDA-sim_-_Functional_Simulation_Engine |software organization of the functional simulation engine]].\u00a0 Finally, we describe the software design of the [[#Interface_with_outside_world |interface with CUDA/OpenCL applications]].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== File list and brief description ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x consists of three major modules (each located in its own directory):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;cuda-sim&#039;&#039;&#039; - The functional simulator that executes PTX kernels generated by NVCC or OpenCL compiler</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;gpgpu-sim&#039;&#039;&#039; - The performance simulator that simulates the timing behavior of a GPU (or other many core accelerator architectures)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>* <ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;intersim&#039;&#039;&#039; - The interconnection network simulator adopted from Bill Dally&#039;s [http://cva.stanford.edu/books/ppin/ BookSim]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here are the files in each module:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Overall/Utilities ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 class=&quot;wikitable&quot; align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! File name || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Makefile\u00a0 || Makefile that builds gpgpu-sim and calls other the Makefile in cuda-sim and intersim. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| abstract_hardware_model.h &lt;br&gt; abstract_hardware_model.cc\u00a0 || Provide a set of classes that interface between functional and timing simulator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| debug.h &lt;br&gt; debug.cc\u00a0 || Implements the [</ins>[<ins class=\"diffchange diffchange-inline\">#Interactive Debugger Mode |Interactive Debugger Mode]]. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpgpusim_entrypoint.c\u00a0 || Contains functions that interface with the CUDA/OpenCL API stub libraries.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| option_parser.h &lt;BR&gt;option_parser.cc\u00a0 || Implements the command-line option parser.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| stream_manager.h &lt;br&gt; stream_manager.cc\u00a0 || Implements the stream manager to support CUDA streams. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| tr1_hash_map.h\u00a0 || Wrapper code for std::unordered_map in C++11.\u00a0 Falls back to std::map or GNU hash_map if the compile does not support C++11. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| .gdb_init\u00a0 || Contains macros that simplify low-level visualization of simulation states with GDB</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== cuda-sim ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3\u00a0 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! File name || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Makefile\u00a0 \u00a0 \u00a0 \u00a0  || Makefile for cuda-sim. Called by Makefile one level up. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cuda_device_print.h &lt;br&gt; cuda_device_print.cc\u00a0 || Implementation to support printf() within CUDA device functions (i.e. calling printf() within GPU kernel functions). Please note that the device printf works only with CUDA 3.1</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cuda-math.h\u00a0 \u00a0 \u00a0 || Contains interfaces to CUDA Math header files. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cuda-sim.h &lt;br&gt; cuda-sim.cc\u00a0 \u00a0 \u00a0 || Implements the interface between gpgpu-sim and cuda-sim.\u00a0 It also contains a standalone simulator for functional simulation. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|instructions.h &lt;br&gt; instructions.cc\u00a0 || This is where the emulation code of all PTX instructions is implemented. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| memory.h &lt;BR&gt; memory.cc\u00a0 || Functional memory space emulation. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| opcodes.def\u00a0 \u00a0 \u00a0 || DEF file that links between various information of each instruction (eg. string name, implementation, internal opcode...)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| opcodes.h\u00a0 \u00a0 \u00a0 \u00a0 || Defines enum for each PTX instruction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptxinfo.l &lt;BR&gt; ptxinfo.y || Lex and Yacc files for parsing ptxinfo file. (To obtain kernel resource requirement)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_ir.h &lt;BR&gt; ptx_ir.cc\u00a0 || Static structures in CUDA - kernels, functions, symbols... etc. Also contain code to perform static analysis for extracting immediate-post-dominators from kernels at load time. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx.l &lt;BR&gt; ptx.y\u00a0 \u00a0 \u00a0 \u00a0  || Lex and Yacc files for parsing .ptx files and embedded cubin structure to obtain PTX code of the CUDA kernels</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_loader.h &lt;br&gt; ptx_loader.cc\u00a0 || Contains functions for loading and parsing and printing PTX and PTX info file</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_parser.h &lt;br&gt; ptx_parser.cc\u00a0 || Contains functions called by Yacc during parsing which creating infra structures needed for functional and performance simulation</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_parser_decode.def\u00a0 || Contains token definition of parser used in PTX extraction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx_sim.h &lt;BR&gt; ptx_sim.cc|| Dynamic structures in CUDA - Grids, CTA, threads</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| ptx-stats.h &lt;BR&gt;ptx-stats.cc || PTX source line profiler</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== gpgpu-sim ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! File name || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Makefile\u00a0 \u00a0 \u00a0 \u00a0  || Makefile for gpgpu-sim. Called by Makefile one level up. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| addrdec.h &lt;BR&gt; addrdec.cc\u00a0 \u00a0 \u00a0 \u00a0  || Address decoder - Maps a given address to a specific row, bank, column, in a DRAM channel. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| delayqueue.h\u00a0  || An implementation of a flexible pipelined queue. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dram_sched.h &lt;BR&gt; dram_sched.cc\u00a0 || FR-FCFS DRAM request scheduler. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| dram.h &lt;BR&gt; dram.cc\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  || DRAM timing model + interface to other parts of gpgpu-sim. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpu-cache.h &lt;BR&gt; gpu-cache.cc\u00a0 \u00a0  || Cache model for GPGPU-Sim</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpu-misc.h &lt;BR&gt; gpu-misc.cc\u00a0 \u00a0 \u00a0  || Contains misc. functionality that is needed by parts of gpgpu-sim</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| gpu-sim.h &lt;BR&gt; gpu-sim.cc\u00a0 \u00a0 \u00a0 \u00a0  || Gluing different timing models in GPGPU-Sim into one.\u00a0 It contains implementations to support multiple clock domains and implements the thread block dispatcher. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| histogram.h &lt;BR&gt; histogram.cc\u00a0 \u00a0  || Defines several classes that implement different kinds of histograms. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| icnt_wrapper.h &lt;BR&gt; icnt_wrapper.c\u00a0 || Interconnection network interface for gpgpu-sim. It provides a completely decoupled interface allows intersim to work as a interconnection network timing simulator for gpgpu-sim. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| l2cache.h &lt;BR&gt; l2cache.cc\u00a0 \u00a0  || Implements the timing model of a memory partition.\u00a0 It also implements the L2 cache and interfaces it with the rest of the memory partition (e.g. DRAM timing model). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| mem_fetch.h &lt;BR&gt; mem_fetch.cc\u00a0 \u00a0 \u00a0 \u00a0 || Defines &lt;tt&gt;mem_fetch&lt;/tt&gt;, a communication structure that models a memory request. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| mem_fetch_status.tup\u00a0 \u00a0  || Defines the status of a memory request. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| mem_latency_stat.h &lt;BR&gt; mem_latency_stat.cc\u00a0 || Contains various code for memory system statistic collection.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| scoreboard.h &lt;BR&gt; scoreboard.cc\u00a0 \u00a0  || Implements the scoreboard used in SIMT core. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| shader.h &lt;BR&gt; shader.cc\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 || SIMT core timing model. It calls cudu-sim for functional simulation of a particular thread and cuda-sim would return with performance-sensitive information for the thread. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| stack.h &lt;BR&gt; stack.cc\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 || Simple stack used by immediate post-dominator thread scheduler. (deprecated) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| stats.h\u00a0 \u00a0  || Defines the enums that categorize memory accesses and various stall conditions at the memory pipeline. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| stat-tool.h &lt;BR&gt; stat-tool.cc\u00a0 \u00a0  || Implements various tools for performance measurements.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| visualizer.h &lt;BR&gt; visualizer.cc\u00a0 \u00a0  || Output dynamic statistics for the visualizer </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== intersim ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Only files modified from original booksim are listed. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! File name || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| booksim_config.cpp || intersim&#039;s configuration options are defined here and given a default value. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| flit.hpp || Modified to add capability of carrying data to the flits. Flits also know which network they belong to.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| interconnect_interface.cpp &lt;br&gt; interconnect_interface.h || The interface between GPGPU-Sim and intersim is implemented here. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| iq_router.cpp &lt;br&gt; iq_router.hpp || Modified to add support for output_extra_latency (Used to create Figure 10 of [http</ins>://<ins class=\"diffchange diffchange-inline\">ieeexplore</ins>.<ins class=\"diffchange diffchange-inline\">ieee</ins>.org<ins class=\"diffchange diffchange-inline\">:80/xpl/articleDetails.jsp?reload=true&amp;arnumber=4919648 ISPASS 2009 paper]).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| islip.cpp || Some minor edits to fix an out of array bound error.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| Makefile || Modified to create a library instead of the standalone network simulator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| stats.cpp &lt;br&gt;\u00a0 stats.hpp || Stat collection functions are in this file. We have made some minor tweaks. E.g. a new function called NeverUsed is added that tell if that particular stat is ever updated or not. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| statwraper.cpp &lt;br&gt;\u00a0 statwraper.h || A wrapper that enables using the stat collection capabilities implemented in Stat class in stats.cpp in C files.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|- </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| trafficmanager.cpp &lt;br&gt; trafficmanager.hpp || Heavily modified from original booksim. Many high level operations are done here.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Option Parser ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As you modify GPGPU-Sim for your research, you will likely add features that you want to configure differently in different simulations.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim 3.x provides a generic command-line option parser that allows different software modules to register their options through a simple interface.\u00a0 The option parser is instantiated in gpgpu_ptx_sim_init_perf() in gpgpusim_entrypoint.cc. Options are added in gpgpu_sim_config::reg_options() using the function:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> void option_parser_register(option_parser_t opp, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const char *name, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  enum option_dtype type, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *variable, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const char *desc,\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const char *defaultvalue);</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here is the description for each parameter:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* option_parser_t opp - The option parser identifier.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* const char *name - The string the identify the command-line option.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* enum option_dtype type - Data type of the option. It can be one of the following:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** int</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** unsigned int</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** long long</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** unsigned long long</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** bool (as int in C)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** float</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** double</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** c-string (a.k.a. char*) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* void *variable - Pointer to the variable.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* const char *desc - Description of the option as displayed</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* const char *defaultvalue - Default value of the option (the string value will be automatically parser). You can set this to NULL for this c-string variables. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Look inside gpgpu-sim/gpu-sim.cc for examples. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The option parser is implemented with the &lt;tt&gt;OptionParser&lt;/tt&gt; class in option_parser.cc (exposed in the C interface as &lt;tt&gt;option_parser_t&lt;/tt&gt; in option_parser.h).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here is the full C interface that is used by the rest of GPGPU-Sim: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;option_parser_register&#039;&#039;&#039;() - Create a binding between an option name (a string) and a variable in the simulator.\u00a0 The variable is passed by reference (pointer), and it will be modified when option_parser_cmdline() or option_parser_cfgfile() is called.\u00a0 Notice that each option can only be bind to a single variable.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;option_parser_cmdline&#039;&#039;&#039;() - Parse the given command line options.\u00a0 Calls option_parser_cmdline() for option -config &lt;config filename&gt;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;option_parser_cfgfile&#039;&#039;&#039;() - Parse a given file containing the configuration options. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;option_parser_print&#039;&#039;&#039;() - Dump all the registered options and their parsed value. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Only one &lt;tt&gt;OptionParser&lt;</ins>/<ins class=\"diffchange diffchange-inline\">tt&gt; object is instantiated in GPGPU-Sim, inside &lt;tt&gt;gpgpu_ptx_sim_init_perf()&lt;</ins>/<ins class=\"diffchange diffchange-inline\">tt&gt; in gpgpusim_entrypoint.cc.\u00a0 This &lt;tt&gt;OptionParser&lt;</ins>/<ins class=\"diffchange diffchange-inline\">tt&gt; object converts the simulation options in gpgpusim.config into variables values that can be accessed within the simulator.\u00a0 Different modules in GPGPU</ins>-<ins class=\"diffchange diffchange-inline\">Sim registers their options into the &lt;tt&gt;OptionParser&lt;/tt&gt; object (i.e. specifying which option corresponds to which variable in the simulator).\u00a0 After that, the simulator calls option_parser_cmdline() to parse the simulation options contained in gpgpusim.config.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Internally, the &lt;tt&gt;OptionParser&lt;/tt&gt; class contains as set of &lt;tt&gt;OptionRegistry&lt;/tt&gt;.\u00a0 &lt;tt&gt;OptionRegistry&lt;/tt&gt; is a template class that uses the &gt;&gt; operator for parsing.\u00a0 Each instance of &lt;tt&gt;OptionRegistry&lt;/tt&gt; is responsible for parsing one option to a particular type of variable.\u00a0 Currently the parser only supports the following data types, but it is possible to extend supports to more complex data types by overloading the &gt;&gt; operator: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* 32-bit/64-bit integers</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* floating points (float and doubles)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* booleans</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* c-style strings (char*)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Abstract Hardware Model ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The files abstract_hardware_model{.h,.cc} provide a set of classes that interface between functional and timing simulator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Hardware Abstraction Model Objects ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=2 cellpadding=3 cellspacing=0 class=&quot;wikitable&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Enum Name !! Description </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| _memory_space_t\u00a0 \u00a0 \u00a0 \u00a0 || Memory space type (register, local, shared, global, ...)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| uarch_op_t\u00a0 \u00a0 \u00a0  || Type of operation (ALU_OP, SFU_OP, LOAD_OP, STORE_OP, ...)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| _memory_op_t\u00a0  || Defines whether instruction accesses (load or store) memory. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cudaTextureAddressMode\u00a0  || CUDA texture address modes. It can specify wrapping address mode, clamp to edge address mode, mirror address mode or border address mode.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cudaTextureFilterMode\u00a0 \u00a0 \u00a0 \u00a0  || CUDA texture filter modes (point or linear filter mode).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cudaTextureReadMode\u00a0 \u00a0 \u00a0  || CUDA texture read mode. Specifies read texture mode as element type or normalized float</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cudaChannelFormatKind\u00a0 \u00a0  || Data type used by CUDA runtime which specifies channel format. This is an argument of cudaCreateChannelDesc(...).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| mem_access_type\u00a0 || Different types of accesses in the timing simulator to different types of memories. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| cache_operator_type\u00a0 || Different types of L1 data cache access behavior provided by PTX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| divergence_support_t\u00a0 \u00a0 \u00a0  || Different control flow divergence handling model. (post dominator is supported)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Class Name || Description</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class kernel_info_t\u00a0 \u00a0  || Holds information of a kernel. It contains information like kernel function(function_info), grid and block size and list of active threads inside that kernel (ptx_thread_info).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class core_t\u00a0  || Abstract base class of a core for both functional and performance model.\u00a0 shader_core_ctx (the class that implements a SIMT core in timing model) is derived from this class. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct cudaChannelFormatDesc\u00a0 \u00a0 \u00a0 || Channel descriptor structure. It keeps channel format and number of bits of each component.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct cudaArray\u00a0 \u00a0 \u00a0 \u00a0 || Structure for saving data of arrays in GPGPU-Sim. It uses whenever main program calls cuda_malloc, cuda_memcopy, cuda_free and etc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct textureReference\u00a0 \u00a0 \u00a0 || Data type used by cuda runtime to specifies texture references. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class gpgpu_functional_sim_config\u00a0 \u00a0 \u00a0  || Functional simulator configuration options.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class gpgpu_t\u00a0  ||\u00a0 The top-level class that implements a functional GPU simulator.\u00a0 It contains the functional simulator configuration (gpgpu_functional_sim_config) and holds that actual buffer that implements global/texture memory spaces (instances of class memory_space).\u00a0 It has a set of member functions that provides managements to the simulated GPU memory space (malloc, memcpy, texture-bindin, ...).\u00a0 These member functions are called by the CUDA/OpenCL API implementations.\u00a0 Class gpgpu_sim (the top-level GPU timing simulation model) is derived from this class.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct gpgpu_ptx_sim_kernel_info\u00a0 \u00a0 || Holds properties of a kernel function such as PTX version and target machine. Also holds amount of memory and registers using by that kernel.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct gpgpu_ptx_sim_arg || Holds information of kernel arguments/paramters which is set in cudaSetupArgument(...) and _cl_kernel::bind_args(...) for CUDA and OpenCL respectively.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class memory_space_t\u00a0 || Information of a memory space like type of memory and number of banks for this memory space. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class mem_access_t || Contains information of each memory access in the timing simulator. This class has information about type of memory access, requested address, size of data and active masks of threads inside warp accessing the memory.This class is used as one of parameters of mem_fetch class, which basically instantiated for each memory access. This class is for interfacing between two different level of memory and passing through interconnects. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| struct dram_callback_t|| This class is the one who is responsible for atomic operations. The function pointer is set during functional simulation(atom_impl()) to atom_callback(...). During timing simulation this function pointer is being called when in l2cache memory controller pop memory partition unit to interconnect. This function is supposed to compute result of atomic operation and saving it in memory.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class inst_t || Base class of all instruction classes. This class contains information about type and size of instruction, address of instruction, inputs and outputs, latency and memory scope (memory_space_t) of the instruction. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| class warp_inst_t|| Data of instructions need during timing simulation. Each instruction (ptx_instruction) which is inherited from warp_inst_t contains data for timing and functional simulation. ptx_instruction is filled during functional simulation. After this level program needs only timing information so it casts ptx_instruction to warp_inst_t (some data is being loosed) for timing simulation. warp_inst_t inherited from inst_t. It holds warp_id, active thread mask inside the warp, list of memory accesses (mem_access_t) and information of threads inside that warp (per_thread_info)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== GPGPU-sim - Performance Simulation Engine ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In GPGPU-Sim 3.x the performance simulation engine is implemented via numerous classes defined and implemented in the files under &lt;gpgpu-sim_root&gt;/src/gpgpu-sim/.\u00a0 These classes are brought together via the top-level class gpgpu_sim, which is derived from gpgpu_t (its functional simulation counter part).\u00a0 In the current version of GPGPU-Sim 3.x, only one instance of gpgpu_sim, g_the_gpu, is present in the simulator.\u00a0 Simulation of multiple GPU simultaneously is not currently supported but may be provided in future versions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This section describes the various classes in the performance simulation engine.\u00a0 These include a set of software objects that models the microarchitecture described\u00a0 [[#Microarchitecture Model |earlier]], this section also describes how the performance simulation engine interfaces with the functional simulation engine, how it interfaces with\u00a0 AerialVision, and various non-trivial software designs we employ in the performance simulation engine. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Performance Model Software Objects ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">One of the more significant changes in GPGPU-Sim 3.x versus 2.x is the introduction of a C++ (mostly) object oriented design for the performance simulation engine.\u00a0 The high level design of the various classes used to implement the performance simulation engine are described in this subsection.\u00a0 These closely correspond to the hardware blocks described earlier.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== SIMT Core Cluster Class ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The [[#SIMT_Core_Clusters |SIMT core clusters]] are modelled by the &lt;tt&gt;simt_core_cluster&lt;/tt&gt; class. This class contains an array of SIMT core objects in &lt;tt&gt;m_core&lt;/tt&gt;. The &lt;tt&gt;simt_core_cluster::core_cycle()&lt;/tt&gt; method simply cycles each of the SIMT cores in order. The &lt;tt&gt;simt_core_cluster::icnt_cycle()&lt;/tt&gt; method pushes memory requests into the SIMT Core Cluster&#039;s &#039;&#039;response FIFO&#039;&#039; from the interconnection network. It also pops the requests from the FIFO and sends them to the appropriate core&#039;s instruction cache or LDST unit. The &lt;tt&gt;simt_core_cluster::icnt_inject_request_packet(...)&lt;/tt&gt; method provides the SIMT cores with an interface to inject packets into the network.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== SIMT Core Class ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The [[#SIMT_Cores |SIMT core]] microarchitecture shown in &lt;xr id=&quot;fig:simt_core&quot;/&gt; is implemented with the class shader_core_ctx in shader.h/cc.\u00a0 Derived from class core_t (the abstract functional class for a core), this class combines all the different objects that implements various parts of the SIMT core microarchitecture model: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A collection of shd_warp_t objects which models the simulation state of each warp in the core.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A SIMT stack, simt_stack object, for each warp to handle branch divergence. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A set of scheduler_unit objects, each responsible for selecting one or more instructions from its set of warps and issuing those instructions for execution. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A Scoreboard object for detecting data hazard. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* An opndcoll_rfu_t object, which models an operand collector. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A set of simd_function_unit objects, which implements the SP unit and the SFU unit (the ALU pipelines). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A ldst_unit object, which implements the memory pipeline.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A shader_memory_interface which connects the SIMT core to the corresponding SIMT core cluster.\u00a0 Each memory request goes through this interface to be serviced by one of the memory partitions.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Every core cycle, shader_core_ctx::cycle() is called to simulate one cycle at the SIMT core.\u00a0 This function calls a set of member functions that simulate the core&#039;s pipeline stages in reverse order to model the pipelining effect: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* fetch()</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* decode()</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* issue()</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* read_operand()</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* execute()</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* writeback() </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The various pipeline stages are connected via a set of pipeline registers which are pointers to warp_inst_t objects (with the exception of Fetch and Decode, which connects via a ifetch_buffer_t object).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Each shader_core_ctx object refers to a common shader_core_config object when accessing configuration options specific to the SIMT core.\u00a0 All shader_core_ctx objects also link to a common instance of a shader_core_stats object which keeps track of a set of performance measurements for all the SIMT cores. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Fetch and Decode Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This section describes the software modelling [[GPGPU-Sim_3.x_Manual#Fetch_and_Decode | Fetch and Decode]].\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The I-Buffer shown in &lt;xr id=&quot;fig:overall_arch&quot;/&gt;\u00a0 is implemented as an array of shd_warp_t objects inside shader_core_ctx. Each shd_warp_t has a set m_ibuffer of I-Buffer entries (ibuffer_entry) holding a configurable number of instructions (the maximum allowable instructions to fetch in one cycle).\u00a0 Also, shd_warp_t has flags that are used by the schedulers to determine the eligibility of the warp for issue.\u00a0 The decoded instructions stored in an ibuffer_entry as a pointer to a warp_inst_t object. The warp_inst_t holds information about the type of the operation of this instruction and the operands used. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Also, in the fetch stage, the shader_core_ctx::m_inst_fetch_buffer variable acts as a pipeline register between the fetch (instruction cache access) and the decode stage. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If the decode stage is not stalled (i.e. shader_core_ctx::m_inst_fetch_buffer is free of valid instructions), the fetch unit works. The outer for loop implements the round robin scheduler, the last scheduled warp id is stored in m_last_warp_fetched. The first if-statement checks if the warp has finished execution, while inside the second if-statement, the actual fetch from the instruction cache, in case of hit or the memory access generation, in case of miss are done. The second if-statement mainly checks if there are no valid instructions already stored in the entry that corresponds the currently checked warp.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The decode stage simply checks the shader_core_ctx::m_inst_fetch_buffer and start to store the decoded instructions (current configuration decode up to two instructions per cycle) in the instruction buffer entry (m_ibuffer, an object of shd_warp_t::ibuffer_entry) that corresponds to the warp in the shader_core_ctx::m_inst_fetch_buffer.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Schedule and Issue Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Within each core, there are a configurable number of scheduler units. The function shader_core_ctx::issue() iterates over these units where each one of them executes scheduler_unit::cycle(), where a round robin algorithm is applied on the warps. In the scheduler_unit::cycle(), the instruction is issued to its suitable execution pipeline using the function shader_core_ctx::issue_warp(). Within this function, instructions are functionally executed by calling shader_core_ctx::func_exec_inst() and the SIMT stack (m_simt_stack[warp_id]) is updated by calling simt_stack::update(). Also, in this function, the warps are held/released due to barriers by shd_warp_t:set_membar() and barrier_set_t::warp_reaches_barrier. On the other hand, registers are reserved by Scoreboard::reserveRegisters() to be used later by the scoreboard algorithm.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The scheduler_unit::m_sp_out, scheduler_unit::m_sfu_out, scheduler_unit::m_mem_out points to the first pipeline register between the issue stage and the execution stage of SP, SFU and Mem pipline receptively. That is why they are checked before issuing any instruction to its corresponding pipeline using shader_core_ctx::issue_warp().</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== SIMT Stack Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For each scheduler unit there is an array of SIMT stacks. Each SIMT stack corresponds to one warp. In the scheduler_unit::cycle(), the top of the stack entry for the SIMT stack of the scheduled warp determines the issued instruction. The program counter of the top of the stack entry is normally consistent with the program counter of the next instruction in the I-Buffer that corresponds to the scheduled warp (Refer to [[GPGPU-Sim_3.x_Manual#SIMT_Stack | SIMT Stack]]). Otherwise, in case of control hazard, they will not be matched and the instructions within the I-Buffer are flushed.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The implementation of the SIMT stack is in the simt_stack class in shader.h. The SIMT stack is updated after each issue using this function simt_stack::update(...). This function implements the algorithm required at divergence and reconvergence points. Functional execution (refer to [[#CUDA-sim_-_Functional_Simulation_Engine|Instruction Execution]]) is performed at the issue stage before updating the SIMT stack. This allows the issue stage to have information of the next pc of each thread, hence, to update the SIMT stack as required.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Scoreboard Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The scoreboard unit is instantiated in shader_core_ctx as a member object, and passed to scheduler_unit via reference (pointer). It stores both shader core id and a register table index by the warp ids. This register table stores the number of registers reserved by each warp. The functions Scoreboard::reserveRegisters(...), Scoreboard::releaseRegisters(...) and Scoreboard::checkCollision(...) are used to reserve registers, </ins>release <ins class=\"diffchange diffchange-inline\">register and to check for collision before issuing a warp respectively.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Operand Collector Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The operand collector is modeled as one stage in the main pipeline executed by the function shader_core_ctx::cycle(). This stage is represented by the shader_core_ctx::read_operands() function. Refer to [[GPGPU-Sim_3.x_Manual#ALU_Pipelines|ALU Pipeline]] for more details about the interfaces of the operand collector.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The class opndcoll_rfu_t models the operand collector based register file unit. It contains classes that abstracts the collector unit sets, the arbiter and the dispatch units.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The opndcoll_rfu_t::allocate_cu(...) is responsible to allocate warp_inst_t to a free operand collector unit within its assigned sets of operand collectors. Also it adds a read requests for all source operands in their corresponding bank queues in the arbitrator.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">However, opndcoll_rfu_t::allocate_reads(...) processes read requests that do not have conflicts, in other words, the read requests that are in different register banks and do not go to the same operand collector are popped from the arbitrator queues. This accounts for write request priority over read requests.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The function opndcoll_rfu_t::dispatch_ready_cu() dispatches the operand registers of ready operand collectors (with all operands are collected) to the execute stage.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The function opndcoll_rfu_t::writeback( const warp_inst_t &amp;inst ) is called at the write back stage of the memory pipeline. It is responsible to the allocation of writes.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This summarizes the highlights of the main functions used to model the operand collector, however, more details are in the implementations of the opndcoll_rfu_t class in both shader.cc and shader.h.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== ALU Pipeline Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The timing model of SP unit and SFU unit are mostly implemented in the &lt;tt&gt;pipelined_simd_unit&lt;/tt&gt; class defined in &lt;tt&gt;shader.h&lt;/tt&gt;.\u00a0 The specific classes modelling the units (&lt;tt&gt;sp_unit&lt;/tt&gt; and &lt;tt&gt;sfu&lt;/tt&gt; class) are derived from this class with overridden &lt;tt&gt;can_issue()&lt;/tt&gt; member function to specify the types of instruction executable by the unit.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The SP unit is connected to the operation collector unit via the &lt;tt&gt;OC_EX_SP&lt;/tt&gt; pipeline register;\u00a0 the SFU unit is connected to the operand collector unit via the &lt;tt&gt;OC_EX_SFU&lt;/tt&gt; pipeline register.\u00a0 Both units shares a common writeback stage via the &lt;tt&gt;WB_EX&lt;/tt&gt; pipeline register.\u00a0 To prevent two units from stalling for writeback stage conflict, each instruction going into either unit has to allocate a slot in the result bus (m_result_bus) before it is issued into the destined unit (see &lt;tt&gt;shader_core_ctx::execute()&lt;/tt&gt;). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The following figure provides an overview to how &lt;tt&gt;pipelined_simd_unit&lt;/tt&gt; models the throughput and latency for different types of instruction.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:simd_unit_sw_arch&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:simd_function_unit.png|center|frame|&lt;caption&gt;Software Design of Pipelined SIMD Unit&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In each &lt;tt&gt;pipelined_simd_unit&lt;/tt&gt;, the &lt;tt&gt;issue(warp_inst_t*&amp;)&lt;/tt&gt; member function moves the contents of the given pipeline registers into &lt;tt&gt;m_dispatch_reg&lt;/tt&gt;.\u00a0 The instruction then waits at &lt;tt&gt;m_dispatch_reg&lt;/tt&gt; for &lt;tt&gt;initiation_interval&lt;/tt&gt; cycles.\u00a0 In the meantime, no other instruction can be issued into this unit, so this wait models the throughput of the instruction.\u00a0 After the wait, the instruction is dispatched to the internal pipeline registers &lt;tt&gt;m_pipeline_reg&lt;/tt&gt; for latency modelling.\u00a0 The dispatching position is determined so that time spent in &lt;tt&gt;m_dispatch_reg&lt;/tt&gt; are accounted towards the latency as well.\u00a0 Every cycle, the instructions will advances through the pipeline registers and eventually into &lt;tt&gt;m_result_port&lt;/tt&gt;, which is the shared pipeline register leading to the common writeback stage for both SP and SFU units.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The throughput and latency of each type of instruction are specified at &lt;tt&gt;ptx_instruction::set_opcode_and_latency()&lt;/tt&gt; in &lt;tt&gt;cuda-sim.cc&lt;/tt&gt;.\u00a0 This function is called during pre-decode.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Memory Stage Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &lt;tt&gt;ldst_unit&lt;/tt&gt; class inside shader.cc implements the memory stage of the shader pipeline.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The class instantiates and operates on all the in-shader memories: texture (&lt;tt&gt;m_L1T&lt;/tt&gt;), constant (&lt;tt&gt;m_L1C&lt;/tt&gt;) and data (&lt;tt&gt;m_L1D&lt;/tt&gt;). &lt;tt&gt;ldst_unit::cycle()&lt;/tt&gt; implements the guts of the unit&#039;s operation and is pumped &lt;tt&gt;m_config-&gt;mem_warp_parts&lt;/tt&gt; times pre core cycle.\u00a0 This is so fully coalesced memory accesses can be processed in one shader cycle.\u00a0 &lt;tt&gt;ldst_unit::cycle()&lt;/tt&gt; processes the memory responses from the interconnect (stored in &lt;tt&gt;m_response_fifo&lt;/tt&gt;), filling the caches and marking stores as complete.\u00a0 The function also cycles the caches so they can send their requests for missed data to the interconnect.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Cache accesses to each type of L1 memory are done in &lt;tt&gt;shared_cycle(), constant_cycle(), texture_cycle() and memory_cycle()&lt;/tt&gt; respectively.\u00a0 &lt;tt&gt;memory_cycle&lt;/tt&gt; is used to access the L1 data cache.\u00a0 Each of these functions then calls &lt;tt&gt;process_memory_access_queue()&lt;/tt&gt; which is a universal function that pulls an access off the instructions internal access queue and sends this request to the cache.\u00a0 If this access cannot be processed in this cycle (i.e. it neither misses nor hits in the cache which can happen when various system queues are full or when all the lines in a particular way have been reserved and are not yet filled) then the access is attempted again next cycle. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">It is worth noting that not all instructions reach the writeback stage of the unit. All store instructions and load instructions where all requested cache blocks hit exit the pipeline in the &lt;tt&gt;cycle&lt;/tt&gt; function.\u00a0 This is because they do not have to wait for a response from the interconnect and can by-pass the writeback logic that book-keeps the cache lines requested by the instruction and those that have been returned.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Cache Software Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">gpu-cache.h implements all the caches used by the &lt;tt&gt;ldst_unit&lt;/tt&gt;. Both the constant cache and the data cache contain a member &lt;tt&gt;tag_array&lt;/tt&gt; object which implements the reservation and replacement logic.\u00a0 The &lt;tt&gt;probe()&lt;/tt&gt; function checks for a block address without effecting the LRU position of the data in question, while &lt;tt&gt;access()&lt;/tt&gt; is meant to model a look-up that effects the LRU position and is the function that generates the miss and access statistics.\u00a0 MSHR&#039;s are modeled with the &lt;tt&gt;mshr_table&lt;/tt&gt; class emulates a fully associative table with a finite number of merged requests. Requests are released from the MSHR through the &lt;tt&gt;next_access()&lt;/tt&gt; function.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &lt;tt&gt;read_only_cache&lt;/tt&gt; class is used for the constant cache and as the base-class for the &lt;tt&gt;data_cache&lt;/tt&gt; class.\u00a0 This hierarchy can be somewhat confusing because R/W data cache extends from the &lt;tt&gt;read_only_cache&lt;/tt&gt;. The only reason for this is that they share much of the same functionality, with the exception of the &lt;tt&gt;access&lt;/tt&gt; function which deals has to deal with writes in the &lt;tt&gt;data_cache&lt;/tt&gt;.\u00a0 The L2 cache is also implemented with the &lt;tt&gt;data_cache&lt;/tt&gt; class.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &lt;tt&gt;tex_cache&lt;/tt&gt; class implements the texture cache outlined in the architectural description above.\u00a0 It does not use the &lt;tt&gt;tag_array&lt;/tt&gt; or &lt;tt&gt;mshr_table&lt;/tt&gt; since it&#039;s operation is significantly different from that of a conventional cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Thread Block / CTA / Work Group Scheduling =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The scheduling of Thread Blocks to SIMT cores occurs in &lt;tt&gt;shader_core_ctx::issue_block2core(...)&lt;/tt&gt;. The maximum number of thread blocks (or CTAs or Work Groups) that can be concurrently scheduled on a core is calculated by the function &lt;tt&gt;shader_core_config::max_cta(...)&lt;/tt&gt;. This function determines the maximum number of thread blocks that can be concurrently assigned to a single SIMT core based on the number of threads per thread block specified by the program, the per-thread register usage, the shared memory usage, and the configured limit on maximum number of thread blocks per core. Specifically, the number of thread blocks that could be assigned to a SIMT core if each of the above criteria was the limiting factor is computed. The minimum of these is the maximum number of thread blocks that can be assigned to the SIMT core.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In &lt;tt&gt;shader_core_ctx::issue_block2core(...)&lt;/tt&gt;, the thread block size is first padded to be an exact multiple of the warp size. Then a range of free hardware thread ids is determined. The functional state for each thread is initialized by calling &lt;tt&gt;ptx_sim_init_thread&lt;/tt&gt;. The SIMT stacks and warp states are initialized by calling &lt;tt&gt;shader_core_ctx::init_warps&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When each thread finishes, the SIMT core calls &lt;tt&gt;register_cta_thread_exit(...)&lt;/tt&gt; to update the active thread block&#039;s state. When all threads in a thread block have finished, the same function decreases the count of thread blocks active on the core, allowing more thread blocks to be scheduled in the next cycle. New thread blocks to be scheduled are selected from pending kernels.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Interconnection Network ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The interconnection network interface has a few functions as follows. These function are implemented in interconnect_interface.cpp. These function are wrapped in icnt_wrapper.cpp.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The original intention for having icnt_wrapper.cpp was to allow other network simulators to hook up to GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*init_interconnect(): Initialize the network simulator. Its inputs are the interconnection network&#039;s configuration file and the number of SIMT core clusters and memory nodes.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*interconnect_push(): which specifies a source node, a destination node, a pointer to the packet to be transmitted and the packet size (in bytes).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*interconnect_pop(): gets a node number as input and it returns a pointer to the packet that was waiting to be ejected at that node. If there is no packet it returns NULL.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*interconnect_has_buffer(): gets a node number and the packet size to be sent as input and returns one(true) if the input buffer of the source node has enough space.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*advance_interconnect(): Should be called every interconnection clock cycle. As name says it perform all the internal steps of the network for one cycle.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*interconnect_busy(): Returns one(true) if there is a packet in flight inside the network.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">*interconnect_stats(): Prints network statistics.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">====Clock domain crossing for intersim ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Ejecting a packet from network =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">We effectively have a two stage buffer per virtual channel at the output, the first stage contains a buffer per virtual channel that has the same space as the buffers internal to the network, the next stage buffer per virtual channel is where we cross from one clock domain to the other--we push flits into the second stage buffer in the interconnect clock domain, and remove whole packets from the second stage buffer in the shader or L2/DRAM clock domain. We return a credit only when we are able to move a flit from the first stage buffer to the second stage buffer (and this occurs at the interconnect clock frequency).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=====\u00a0 Ejection interface details =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Here is a more detailed explanation of the clock boundary implementation: At the ejection port of each router we have as many buffers as the number of Virtual Channels. Size of each buffer is exactly equal to VC buffer size. These are the first stage of buffers mentioned above. Let&#039;s call the second stage of buffers (again as many as VCs) boundary buffers. These buffers are sized to hold 16-flits each by default (this is a configurable option called boudry_buf_size). When a router tries to eject a flit, the flit is put in the corresponding first stage buffers based on the VC it&#039;s coming from ( No credit is sent back yet). Then the boundary buffers are checked to see if they have space; a flit is popped from the corresponding ejection buffer and pushed to the boundary buffer if it has space (this is done for all buffers in the same cycle). At this point the flit is also pushed to a credit return queue. Routers can pop 1 flit per network cycle from this credit return queue and generate its corresponding credit. The shader (or L2/DRAM) side pops the boundary buffer every shader or (DRAM/L2 cycle) and gets a full &quot;Packet&quot;. i.e. If the packet is 4 flits it frees up 4 slots in the boundary buffer;if it&#039;s 1 flit it only frees up 1 flit. Since boundary buffers are as many as VCs shader (or DRAM) pops them in round robin. (It can only get 1 packet per cycle) In this design the first stage buffer always has space for the flits coming from router and as boundary buffers get full the flow of credits backwards will stop.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Note that the implementation described above is just our way of implementing the interface logic in the simulator and not necessarily the way the network interface is actually implemented in real hardware.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:icnt_ejection&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:Ejection.png|center|frame|&lt;caption&gt;Clock Boundary Implementation&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Injecting a packet to the network =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Each node of the network has an input buffer. This input buffer size is configurable via input_buffer_size option in the interconnect config file. In order to inject a packet into the interconnect first the input buffer capacity is checked by calling interconnect_has_buffer(). If there is enough space the packet will be pushed to interconnect by calling interconnect_push().</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;!--TODO:CHECK--&gt;These steps are done in the shader clock domain (in the memory stage) and in the interconnect clock domain for memory nodes.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Every-time advance_interconnect() function is called (in the interconnect clock domain) flits are taken out of the input buffer on each node and actually start traveling in the network (if possible).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Memory Partition ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Memory Partition is modelled by the &lt;tt&gt;memory_partition_unit&lt;/tt&gt; class defined inside &lt;tt&gt;l2cache.h&lt;/tt&gt; and &lt;tt&gt;l2cache.cc&lt;/tt&gt;. These files also define an extended version of the &lt;tt&gt;mem_fetch_allocator&lt;/tt&gt;, &lt;tt&gt;partition_mf_allocator&lt;/tt&gt;, for generation of &lt;tt&gt;mem_fetch&lt;/tt&gt; objects (memory requests) by the Memory Partition and L2 cache.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">From the sub-components described in the [[#Memory Partition | Memory Partition]] micro-architecture model section, the member object of type &lt;tt&gt;data_cache&lt;/tt&gt; models the L2 cache and type &lt;tt&gt;dram_t&lt;/tt&gt; the off-chip DRAM channel. The various queues are modelled using the &lt;tt&gt;fifo_pipeline&lt;/tt&gt; class. The minimum latency ROP queue is modelled as a queue of &lt;tt&gt;rop_delay_t&lt;/tt&gt; structs. The &lt;tt&gt;rop_delay_t&lt;/tt&gt; structs store the minimum time at which each memory request can exit the ROP queue (push time + constant ROP delay). The &lt;tt&gt;m_request_tracket&lt;/tt&gt; object tracks all in-flight requests not fully serviced yet by the Memory Partition to determine if the Memory Partition is currently active.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Atomic Operation Unit does not have have an associated class. This component is modelled simply by functionally executing the atomic operations of memory requests leaving the &#039;&#039;L2-&gt;icnt queue&#039;&#039;. The next section presents further details.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== Memory Partition Connections and Traffic Flow =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &lt;tt&gt;gpgpu_sim::cycle()&lt;/tt&gt; method clock all the architectural components in GPGPU-Sim, including the Memory Partition&#039;s queues, DRAM channel and L2 cache bank.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The code segment</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 ::icnt_push( m_shader_config-&gt;mem2device(i), mf-&gt;get_tpc(), mf, response_size );</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 m_memory_partition_unit[i]-&gt;pop();</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">injects memory requests into the interconnect from the Memory Partition&#039;s &#039;&#039;L2-&gt;icnt&#039;&#039; queue. The call to &lt;tt&gt;memory_partition_unit::pop()&lt;/tt&gt; functionally executes atomic instructions. The request tracker also discards the entry for that memory request here indicating that the Memory Partition is done servicing this request.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The call to &lt;tt&gt;memory_partition_unit::dram_cycle()&lt;/tt&gt; moves memory requests from &#039;&#039;L2-&gt;dram&#039;&#039; queue to the DRAM channel, DRAM channel to &#039;&#039;dram-&gt;L2&#039;&#039; queue, and cycles the off-chip GDDR3 DRAM memory.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The call to &lt;tt&gt;memory_partition_unit::push()&lt;/tt&gt; ejects packets from the interconnection network and passes them to the Memory Partition. The request tracker is notified of the request. Texture accesses are pushed directly into the &#039;&#039;icnt-&gt;L2&#039;&#039; queue, while non-texture accesses are pushed into the minimum latency &#039;&#039;ROP&#039;&#039; queue. Note that the push operations into both the &#039;&#039;icnt-&gt;L2&#039;&#039; and &#039;&#039;ROP&#039;&#039; queues are throttled by the size of &#039;&#039;icnt-&gt;L2&#039;&#039; queue as defined in the &lt;tt&gt;memory_partition_unit::full()&lt;/tt&gt; method.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The call to &lt;tt&gt;memory_partition_unit::cache_cycle()&lt;/tt&gt; clocks the L2 cache bank and moves requests into or out of the L2 cache. The next section describes the internals of &lt;tt&gt;memory_partition_unit::cache_cycle()&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">===== L2 Cache Model =====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Inside &lt;tt&gt;memory_partition_unit::cache_cycle()&lt;/tt&gt;, the call</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 mem_fetch *mf = m_L2cache-&gt;next_access();</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">generates replies for memory requests waiting in filled MSHR entries, as described in the [[#L1 Data Cache | MSHR description]]. Fill responses, i.e. response messages to memory requests generated by the L2 on read misses, are passed to the L2 cache by popping from the &#039;&#039;dram-&gt;L2&#039;&#039; queue and calling</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 m_L2cache-&gt;fill(mf,gpu_sim_cycle+gpu_tot_sim_cycle);</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Fill requests that are generated by the L2 due to read misses are popped from the L2&#039;s miss queue and pushed into the &#039;&#039;L2-&gt;dram&#039;&#039; queue by calling</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 m_L2cache-&gt;cycle();</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">L2 access for memory request exiting the &#039;&#039;icnt-&gt;L2&#039;&#039; queue is done by the call</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 enum cache_request_status status = m_L2cache-&gt;access(mf-&gt;get_partition_addr(),mf,gpu_sim_cycle+gpu_tot_sim_cycle,events);</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">On a L2 cache hit, a response is immediately generated and pushed into the &#039;&#039;L2-&gt;icnt&#039;&#039; queue. On a miss, no request is generated here as the code internal to the cache class has generated a memory request in its miss queue. If the L2 cache is disabled, then memory requests are pushed straight from the &#039;&#039;icnt-&gt;L2&#039;&#039; queue to the &#039;&#039;L2-&gt;dram&#039;&#039; queue.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Also in &lt;tt&gt;memory_partition_unit::cache_cycle()&lt;/tt&gt;, memory requests are popped from the &#039;&#039;ROP&#039;&#039; queue and inserted into the &#039;&#039;icnt-&gt;L2&#039;&#039; queue.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=====DRAM Scheduling and Timing Model=====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The DRAM timing model is implemented in the files dram.h and dram.cc. The timing model also includes an implementation of a FIFO scheduler. The more complicated FRFCFS scheduler is located in dram_sched.h and dram_sched.cc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The function &lt;tt&gt;dram_t::cycle()&lt;/tt&gt; represents a DRAM cycle. In each cycle, the DRAM pops a request from the request queue then calls the scheduler function to allow the scheduler to select a request to be serviced based on the scheduling policy. Before the requests are sent to the scheduler, they wait in the &#039;&#039;DRAM latency&#039;&#039; queue for a fixed number of SIMT core cycles.\u00a0 This functionality is also implemented inside &lt;tt&gt;dram_t::cycle()&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 case DRAM_FIFO: scheduler_fifo(); break;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 case DRAM_FRFCFS: scheduler_frfcfs(); break;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The DRAM timing model then checks if any bank is ready to issue a new request based on the different timing constraints specified in the configuration file. Those constraints are represented in the DRAM model by variables similar to this one</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> unsigned int CCDc; //Column to Column Delay</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Those variables are decremented at the end of each cycle.\u00a0 An action is only taken when all of its constraint variables have reached zero.\u00a0 Each taken action resets a set of constraint variables to their original configured values.\u00a0 For example, when a column is activated, the variable CCDc is reset to its original configured value, then decremented by one every cycle. We cannot scheduler a new column until this variable reaches zero.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The Macro DEC2ZERO decrements a variable until it reaches zero, and then it keeps it at zero until another action resets it.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Interface between CUDA-Sim and GPGPU-Sim ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The timing simulator (GPGPU-Sim) interfaces with the functional simulator (CUDA-sim) through the &lt;tt&gt;ptx_thread_info&lt;/tt&gt; class. The &lt;tt&gt;m_thread&lt;/tt&gt; member variable is an array of &lt;tt&gt;ptx_thread_info&lt;/tt&gt; in the SIMT core class &lt;tt&gt;shader_core_ctx&lt;/tt&gt; and maintains a functional state of all threads active in that SIMT core. The timing model communicates with the functional model through the &lt;tt&gt;warp_inst_t&lt;/tt&gt; class which represents a dynamic instance of an instruction being executed by a single warp.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The timing model communicates with the functional model at the following three stages of simulation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Decoding&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In the decoding stage at &lt;tt&gt;shader_core_ctx::decode()&lt;/tt&gt;, the timing simulator obtains the instruction from the functional simulator given a PC. This is done by calling the &lt;tt&gt;ptx_fetch_inst&lt;/tt&gt; function.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Instruction execution&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Functional execution: The timing model advances the functional state of a thread by one instruction by calling the &lt;tt&gt;ptx_exec_inst&lt;/tt&gt; method of class &lt;tt&gt;ptx_thread_info&lt;/tt&gt;. This is done inside &lt;tt&gt;core_t::execute_warp_inst_t&lt;/tt&gt;. The timing simulator passes the dynamic instance of the instruction to execute, and the functional model advances the thread&#039;s state accordingly.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># SIMT stack update: After functional execution of an instruction for a warp, the timing model updates the next PC in the SIMT stack by requesting it from the functional model. This happens inside &lt;tt&gt;simt_stack::update&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"># Atomic callback: If the instruction is an atomic operation, then functional execution of the instruction does not take place in &lt;tt&gt;core_t::execute_warp_inst_t&lt;/tt&gt;. Instead, in the functional execution stage the functional simulator stores a pointer to the atomic instruction in the &lt;tt&gt;warp_inst_t&lt;/tt&gt; object by calling &lt;tt&gt;warp_inst_t::add_callback&lt;/tt&gt;. The timing simulator executes this callback function as the request is leaving the L2 cache (see [[#Memory Partition Connections and Traffic Flow | Memory Partition Connections and Traffic Flow]]).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;Launching Thread Blocks&#039;&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When new thread blocks are launched in &lt;tt&gt;shader_core_ctx::issue_block2core&lt;/tt&gt;, the timing simulator initializes the per-thread functional state by calling the functional model method &lt;tt&gt;ptx_sim_init_thread&lt;/tt&gt;. Additionally, the timing model also initializes the SIMT stack and warp states by fetching the starting PC from the functional model.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Address Decoding ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Address decoding is responsible for translating linear addresses to raw addresses, which are used to access the appropriate row, column, and bank in DRAM. Address decoding is also responsible for determining which memory controller to send the memory request to. The code for address decoding is found in &#039;&#039;&#039;addrdec.h&#039;&#039;&#039; and &#039;&#039;&#039;addrdec.cc&#039;&#039;&#039;; located in &quot;gpgpu-sim_root/src/gpgpu-sim/&quot;. When a load or store instruction is encountered in the kernel code, a &quot;memory fetch&quot; object is created (defined in mem_fetch.h/mem_fetch.cc). Upon creation, the mem_fetch object decodes the linear address by calling &#039;&#039;ddrdec_tlx(new_addr_type addr /*linear address*/, addrdec_t *tlx /*raw address struct*/)&#039;&#039;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The interpretation of the linear address can be set to one of 13 predefined configurations by setting &quot;-gpgpu_mem_address_mask&quot; in a &quot;gpgpusim.config&quot; file to one of (0, 1, 2, 3, 5, 6, 14, 15, 16, 100, 103, 106, 160). These configurations specify the bit masks used to extract the chip (memory controller), row, col, bank, and burst from the linear address. A custom mapping can be chosen by setting &quot;-gpgpu_mem_addr_mapping&quot; in a &quot;gpgpusim.config&quot; file to a desired mapping, such as</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 -gpgpu_mem_addr_mapping dramid@8;00000000.00000000.00000000.00000000.0000RRRR.RRRRRRRR.RRBBBCCC.CCCSSSSS</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Where R(r)=row, B(b)=bank, C(c)=column, S(s)=Burst, and D(d) [not shown]=chip.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Also, dramid@&lt;#&gt;\u00a0 means that the address decoder will insert the dram/chip ID starting at bit &lt;#&gt; (counting from LSB) -- i.e., dramid@8 will start at bit 8.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Output to AerialVision Performance Visualizer ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In gpgpu_sim::cycle(), gpgpu_sim::visualizer_printstat() (in gpgpu-sim/visualizer.cc) is called every sampling interval to append a snapshot of the monitored performance metrics to a log file.\u00a0 This log file is the input for the time-lapse view in AerialVision.\u00a0 The log file is compressed via zlib as it is created to minimize disk usage.\u00a0 The sampling interval can be configured by the option &lt;tt&gt;-gpgpu_runtime_stat&lt;/tt&gt;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">gpgpu_sim::visualizer_printstat() calls a set of functions to sample the performance metrics in various modules: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* cflog_visualizer_gzprint(): Generating data for PC-Histogram (See ISPASS 2010 paper for detail). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* shader_CTA_count_visualizer_gzprint(): The number of CTAs active in each SIMT core. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* shader_core_stats::visualizer_print(): Performance metrics for SIMT cores, including the warp cycle breakdown. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* memory_stats_t::visualizer_print(): Performance metric for memory accesses. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* memory_partition_unit::visualizer_print(): Performance metric for each memory partition. Calls dram_t::visualizer_print(). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* time_vector_print_interval2gzfile(): Latency distribution for memory accesses. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The PC-Histogram is implemented using two classes: thread_insn_span and thread_CFlocality.\u00a0 Both classes can be found in gpgpu-sim/stat-tool.{h,cc}.\u00a0 It is interfaced to the SIMT cores via a C interface: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* create_thread_CFlogger(): Create one thread_CFlocality object for each SIMT core. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* cflog_update_thread_pc(): Update PC of a thread.\u00a0 The new PC will be added to the list of PC that was touched by this thread. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* cflog_visualizer_gzprint(): Output the PC-Histogram of this current sampling interval to the log file.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Histogram ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim provides several types of histogram data types that simplifies generation of value breakdown for any metric.\u00a0 These histogram classes are implemented in histogram.{h,cc}: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;binned_histogram&#039;&#039;&#039;: The base histogram with each unique integer value occupying a bin.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;pow2_histogram&#039;&#039;&#039;: A Power-Of-Two histogram with each bin representing log2 of the input value.\u00a0 This is useful when the value of a metric can span a large range (differs by orders of magnitude).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &#039;&#039;&#039;linear_histogram&#039;&#039;&#039;: A histogram with each bin representing a range of values specified by the stride. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">All of the histogram classes offer the same interface: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* constructor([stride], name, nbins, [bins]): Create a histogram with a given name, with nbins # of bins, and with bins located by the given pointer (optional).\u00a0 The stride option is only available to &#039;&#039;&#039;linear_histogram&#039;&#039;&#039;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* reset_bins(): Reset all the bins to zero.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* add2bin(sample): Add a sample to the histogram.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* fprint(fout): Print the histogram to the given file handle.\u00a0 Here is the output format: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 &lt;name&gt; = &lt;number of samples in each bin&gt; max=&lt;maximum among the samples&gt; avg=&lt;average value of the samples&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Dump Pipeline ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">See [[GPGPU-Sim_3.x_Manual#Visualizing_Cycle_by_Cycle_Microarchitecture_Behavior]] for how this is used. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The top level dump pipeline code is implemented in &lt;tt&gt;gpgpu_sim::pipeline(...)&lt;/tt&gt; in gpgpu-sim/gpu-sim.cc.\u00a0 It calls &lt;tt&gt;shader_core_ctx::display_pipeline(...)&lt;/tt&gt; in gpgpu-sim/shader.cc for the pipeline states in each SIMT core.\u00a0 For memory partition states, it calls &lt;tt&gt;memory_partition_unit::print(...)&lt;/tt&gt; in gpgpu-sim/l2cache.cc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== CUDA-sim - Functional Simulation Engine ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The src/cuda-sim directory contains files that implement the functional simulation engine used by GPGPU-Sim.\u00a0 For increased flexibility the functional simulation engine interprets instructions at the level of individual scalar operations per vector lane.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Key Objects Descriptions ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;kernel_info&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/abstract_hardware_model.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* The &#039;&#039;kernel_info_t&#039;&#039; object contains the GPU grid and block dimensions, the &#039;&#039;function_info&#039;&#039; object associated with the kernel entry point, and memory allocated for the kernel arguments in &#039;&#039;param&#039;&#039; memory.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;ptx_cta_info&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/ptx_sim.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains a the thread state (ptx_thread_info) for the set of threads within a cooperative thread array (CTA) (or workgroup in OpenCL). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;ptx_thread_info&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/ptx_sim.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains functional simulation state for a single scalar thread (work item in OpenCL).\u00a0 This includes the following: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Register value storage </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Local memory storage (private memory in OpenCL) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Shared memory storage (local memory in OpenCL).\u00a0 Notice that all scalar threads from the same thread block/workgroup accesses the same shared memory storage. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Program counter (PC)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Call stack </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** Thread IDs (the software ID within a grid launch, and the hardware ID indicating which hardware thread slot it occupies in timing model)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The current functional simulation engine was developed to support NVIDIA&#039;s PTX.\u00a0 PTX is essentially a low level compiler intermediate representation but not the actual machine representation used by NVIDIA hardware (which is known as SASS).\u00a0 Since PTX does not define a binary representation, GPGPU-Sim does not store a binary view of instructions (e.g., as you would learn about when studying instruction set design in an undergraduate computer architecture course).\u00a0 Instead, the text representation of PTX is parsed into a list of objects somewhat akin to a low level compiler intermediate representation.\u00a0  </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Individual PTX instructions are found inside of PTX functions that are either kernel entry points or subroutines that can be called on the GPU.\u00a0 Each PTX function has a function_info object:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;function_info&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains a list of static PTX instructions (ptx_instruction&#039;s) that can be functionally simulated.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* For kernel entry points, stores each of the kernel arguments in a map; &#039;&#039;m_ptx_kernel_param_info&#039;&#039;; however, this might not always be the case for OpenCL applications. In OpenCL, the associated constant memory space can be allocated in two ways: It can be explicitly initialized in the .ptx file where it is declared, or it can be allocated using the clCreateBuffer on the host. In this later case, the .ptx file will contain a global declaration of the parameter, but it will have an unknown array size.\u00a0 Thus, the symbol&#039;s address will not be set and need to be set in the &#039;&#039;function_info::add_param_data(...)&#039;&#039; function before executing the PTX. In this case, the address of the kernel argument is stored in a symbol table in the &#039;&#039;function_info&#039;&#039; object.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The list below describe the class hierarchy used to represent instructions in GPGPU-Sim 3.x.\u00a0 The hierarchy was designed to support future expansion of instruction sets beyond PTX and to isolate functional simulation objects from the timing model.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;inst_t&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/abstract_hardware_model.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains an abstract view of a static instruction relevant to the microarchitecture. This includes the opcode type, source and destination register identifiers, instruction address, instruction size, reconvergence point instruction address, instruction latency and initiation interval, and for memory operations, the memory space accessed.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;warp_inst_t&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/abstract_hardware_model.h/cc) (derived from inst_t):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains the view of a dynamic instruction relevant to the microarchitecture.\u00a0 This includes per lane dynamic information such as mask status and memory address accessed.\u00a0 To support accurate functional execution of global memory atomic operations this class includes a callback interface to functionally simulate atomic memory operations when they reach the DRAM interface in performance simulation.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;ptx_instruction&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc) (derived from warp_inst_t):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains the full state of a dynamic instruction including the interfaces required for functional simulation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To support functional simulation GPGPU-Sim must access data in the various memory spaces defined in the CUDA and OpenCL memory models.\u00a0 This requires both a way to name locations and a place to store the values in those locations.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For naming locations, GPGPU-Sim initially builds up a &quot;symbol table&quot; representation while parsing the input PTX.\u00a0 This is done using the following classes:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;symbol_table&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains a mapping from the textual representation of a memory location in PTX (e.g., &quot;%r2&quot;, &quot;input_data&quot;, etc...) to a symbol object that contains information about the data type and location.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;symbol&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains information about the name and type of data and its location (address or register identifier) in the simulated GPU memory space.\u00a0 Also tracks where the name was declared in the PTX source.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;type_info&#039;&#039;&#039; and &#039;&#039;&#039;type_info_key&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Contains information about the type of a data object (used during instruction interpretation).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;operand_info&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/ptx_ir.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* A wrapper class containing a source operand for an instruction which may be either a register identifier, memory operand (including displacement mode information), or immediate operand.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Storage of dynamic data values used in functional simulation uses different classes for registers and memory spaces.\u00a0 Register values are contained in ptx_thread_info::m_regs which is a mapping from symbol pointer to a C union called &#039;&#039;&#039;ptx_reg_t&#039;&#039;&#039;. Registers are accessed using the method ptx_thread_info::get_operand_value() which uses operand_info as input.\u00a0 For memory operands this method returns the effective address of the memory operand. Each memory space in the programming model is contained in an object of type &#039;&#039;&#039;memory_space&#039;&#039;&#039;.\u00a0 Memory spaces visible to all threads in the GPU are contained in &#039;&#039;&#039;gpgpu_t&#039;&#039;&#039; and accessed via interfaces in ptx_thread_info (e.g., ptx_thread_info::get_global_memory).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;memory_space&#039;&#039;&#039;\u00a0 (&lt;gpgpu-sim_root&gt;/src/cuda-sim/memory.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Abstract base class for implementing memory storage for functional simulation state.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039;&#039;memory_space_impl&#039;&#039;&#039; (&lt;gpgpu-sim_root&gt;/src/cuda-sim/memory.h/cc):</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* To optimize functional simulation performance, memory is implemented using a hash table.\u00a0 The hash table block size is a template argument for the template class memory_space_impl.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTX extraction ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Depending on the configuration file, PTX is extracted either from cubin files or using cuobjdump. This section describes the flow of information for extracting PTX and other information. &lt;xr id=&quot;fig:ptxplus_compile_flow&quot;/&gt; shows the possible flows for the extraction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== From cubin ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">__cudaRegisterFatBinary( void *fatCubin ) in the cuda_runtime_api.cc is the function which is responsible for extracting PTX. This function is called by program for each CUDA file. FatbinCubin is a structure which contains different versions of PTX and cubin corresponded to that CUDA file. GPGPU-Sim extract the newest version of PTX which is not newer than forced_max_capability (defines in simulation parameters). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Using cuobjdump ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In CUDA version 4.0 and later, the fat cubin file used to extract the ptx and sass is not available any more. Instead, cuobjdump is used. cuobjdump is a tool provided by NVidia along with the toolkit that can extract the PTX, SASS as well as other information from the executable. If the option -gpgpu_ptx_use_cuobjdump is set to &quot;1&quot; then GPGPU-Sim will invoke cuobjdump to extract the PTX, SASS and other information from the binary. If conversion to PTXPlus is enabled, the simulator will invoke cuobjdump_to_ptxplus to convert the SASS to PTXPlus. The resulting program is then loaded.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTX/PTXPlus loading ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When the PTX/PTXPlus program is ready, gpgpu_ptx_sim_load_ptx_from_string(...) is called. This function basically use Lex/Yacc to parse the PTX code and create symbol table for that PTX file. Then add_binary(...) is called. This function add created symbol table to CUctx structure which saves all function and symbol table information. Function gpgpu_ptxinfo_load_from_string(...) is invoked in order to extract some information from PTXinfo file. This function run &lt;tt&gt;ptxas&lt;/tt&gt; (the PTX assembler tool from CUDA Toolkit) on PTX file and parse the output file using Lex and Yacc. It extract some information like number of registers using by each kernel from ptxinfo file. Also gpgpu_ptx_sim_convert_ptx_to_ptxplus(...) invoked to to create PTXPlus. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">__cudaRegisterFunction(...) function invoked by application for each device function. This function is generate a mapping between device and host functions. Inside register_function(...) GPGPU-sim searches for symbol table associated with that fatCubin in which device function is located. This function generate a map between kernel entry point and CUDA application function address (host function).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== PTXPlus support ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This subsection describes how PTXPlus is implemented in GPGPU-Sim 3.x.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== PTXPlus Conversion ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim version 3.1.0 and later implement support for native hardware ISA execution (PTXPlus) by using NVIDIA&#039;s &#039;cuobjdump&#039; utility. Currently, PTXPlus is only supported with CUDA 4.0.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When PTXPlus is enabled, the simulator uses cuobjdump to extract into text format the embedded SASS (NVIDIA&#039;s hardware ISA) image included in CUDA binaries.\u00a0 This text representation of SASS is then converted to our own extension of PTX, called PTXPlus, using a separate executable called cuobjdump_to_ptxplus. In the conversion process, more information is needed than available in the SASS text representation. This information is acquired from the ELF and PTX code also extracted using cuobjdump. cuobjdump_to_ptxplus bundles all this information into a single PTXPlus file. &lt;xr id=&quot;fig:ptxplus_compile_flow&quot;/&gt; depicts the slight differences in run time execution flow when using PTXPlus. Note that there are no changes required in the compilation process of CUDA executables. The conversion process is completely handled by GPGPU-Sim at run-time.\u00a0 Note this flow illustrated in &lt;xr id=&quot;fig:ptxplus_compile_flow&quot;/&gt; is different than the one illustrated in Figure 3(b) of our [http://ieeexplore.ieee.org:80/xpl/articleDetails.jsp?reload=true&amp;arnumber=4919648 ISPASS 2009 paper].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;figure id=&quot;fig:ptxplus_compile_flow&quot;&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">[[Image:Ptx-vs-ptxplus-runtime-flow.png|center|frame|&lt;caption&gt;PTX vs PTXPlus Compile Flow&lt;/caption&gt;]]</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/figure&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The translation from PTX to PTXPlus is performed by &lt;tt&gt;gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus()&lt;/tt&gt; located in &lt;tt&gt;ptx_loader.cc&lt;/tt&gt;. &lt;tt&gt;gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus()&lt;/tt&gt; is called by called by &lt;tt&gt;usecuobjdump()&lt;/tt&gt; which passes in the SASS, PTX and ELF information. &lt;tt&gt;gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus()&lt;/tt&gt; calls cuobjdump_to_ptxplus on those inputs. cuobjdump_to_ptxplus uses the three inputs to create the final PTXPlus version of the original program and this is returned from &lt;tt&gt;gpgpu_ptx_sim_convert_ptx_and_sass_to_ptxplus()&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== Operation of cuobjdump_to_ptxplus ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">cuobjdump_to_ptxplus uses three files to generate PTXPlus. First, before cuobjdump_to_ptxplus is executed, GPGPU-Sim parses the information output by NVIDIA&#039;s cuobjdump and merely divides that information into multiple files. For each section (a section corresponds to one CUDA binary), three files are generated: .ptx, .sass and .elf. Those files are merely a split of the output of cuobjdump so it can be easily handled by cuobjdump_to_ptxplus.\u00a0 A description of each is provided below:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .ptx: contains the PTX code corresponding to the CUDA binary</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .sass: contains the SASS generated by building the PTX code</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .elf: contains a textual dump of the ELF object</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">cuobjdump_to_ptxplus takes the three files corresponding to a single binary as input and generates a PTXPlus file. Multiple calls are made to cuobjdump_to_ptxplus to convert multiple binaries as needed. Each of the files are parsed, and an elaborate intermediate representation is generated. Multiple functions are then called to output this representation in the form of a PTXPlus file. Below is a description of the information extracted from each of the files:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .ptx: The ptx file is used to extract information about the available kernels, their function signatures, and information about textures.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .sass: The sass file is used to extract the actual instructions that will be converted to PTXPlus instructions in the output PTXPlus file. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* .elf: The elf file is used to extract constant and local memory values as well as constant memory pointers.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">==== PTXPlus Implementation ====</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;tt&gt;ptx_thread_info::get_operand_value()&lt;/tt&gt; in &lt;tt&gt;instructions.cc&lt;/tt&gt; determines the current value of an input operand. The following extensions to &lt;tt&gt;get_operand_value&lt;/tt&gt; are meant for PTXPlus execution.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* If a register operands has a &quot;.lo&quot; modifier, only the lower 16 bits are read. If a register operands has a &quot;.hi&quot; modifier, only the higher 16 bits are read. This information is stored in the &lt;tt&gt;m_operand_lohi&lt;/tt&gt; property of the operand. A value of 0 is default while a value of 1 means a &quot;.lo&quot; modifier and a value of 2 means a &quot;.hi&quot; modifier.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* For PTXPlus style 64bit or 128bit operands, the &lt;tt&gt;get_reg()&lt;/tt&gt; function is passed each register name and the final result is constructed by combining the data in each register.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* The return value from &lt;tt&gt;get_double_operand_type()&lt;/tt&gt; indicates the use of one of the new ways of determining the memory address. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** If it&#039;s 1, the value of two registers must be added together. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** If it&#039;s 2, the address is stored in a register and the value in the register is postincremented by a value in a second register.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** If it&#039;s 3, the address is stored in a register and the value in the register is postincremented by a constant.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* For memory operands, the first half of the &lt;tt&gt;get_operand_value()&lt;/tt&gt; function calculates the memory address to access. This value is stored in &lt;tt&gt;result&lt;/tt&gt;. &lt;tt&gt;result&lt;/tt&gt; is used as the address and the appropriate data is fetched from the address space indicated by the operand and returned. For postincrementing memory accesses, the register holding the address is also incremented in &lt;tt&gt;get_operand_value()&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* If it isn&#039;t a memory operand, the value of the register is returned.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;tt&gt;get_operand_value&lt;/tt&gt; checks for a negative sign on the operand and takes the negative of &lt;tt&gt;finalResult&lt;/tt&gt; if necessary before returning it.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Control Flow Analysis + Pre-decode ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Each kernel function is analyzed and &#039;&#039;pre-decoded&#039;&#039; as it is loaded into GPGPU-Sim.\u00a0 When the PTX parser detects the end of a kernel function, it calls &lt;tt&gt;function_info::ptx_assemble()&lt;/tt&gt; (in cuda-sim/cuda-sim.cc).\u00a0 This function does the following: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Assign each instruction in the function with a unique PC</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Resolve each branch label in the function to a corresponding instruction/PC </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** I.e. Determine the branch target for each branch instruction</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Create control flow graph for the function</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Perform control-flow analysis</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Pre-decode each instruction (to speed up simulation)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Creation of the control flow graph is done via two member functions in &lt;tt&gt;function_info&lt;/tt&gt;: </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;tt&gt;create_basic_blocks&lt;/tt&gt;() groups individual instructions into basic block (&lt;tt&gt;basic_block_t&lt;/tt&gt;). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* &lt;tt&gt;connect_basic_blocks&lt;/tt&gt;() connects the basic blocks to form a control flow graph.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">After creating the control flow graph, two control flow analysis will be done. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Determine the target of each &lt;tt&gt;break&lt;/tt&gt; instruction:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** This is a make-shift solution to support &lt;tt&gt;break&lt;/tt&gt; instruction in PTXPlus, which implements break-statements in while-loops.\u00a0 A long-term solution is to extend the SIMT stack with proper break entries. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** The goal is to determine the latest &lt;tt&gt;breakaddr&lt;/tt&gt; instruction that precedes each &lt;tt&gt;break&lt;/tt&gt; instruction.\u00a0 Assuming that the code has structured control flow.\u00a0 This information can be determined by traversing upstream through the dominator tree (constructed by calling member functions &lt;tt&gt;find_dominators()&lt;/tt&gt; and &lt;tt&gt;find_idominators()&lt;/tt&gt;).\u00a0 However, the control flow graph changes after break instructions are connected to their targets.\u00a0 The current solution is to perform this analysis iteratively until both the dominator tree and the break targets become stable. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** The algorithm for finding dominators are described in Muchnick&#039;s Adv. Compiler Design and Implementation (Figure 7.14 and Figure 7.15).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Find immediate post-dominator of each branch instruction:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** This information is used by the SIMT stack for reconvergence point at a divergent branch. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">** The analysis is done by calling member functions &lt;tt&gt;find_postdominators()&lt;/tt&gt; and &lt;tt&gt;find_ipostdominators()&lt;/tt&gt;.\u00a0 The algorithm is described in Muchnick&#039;s Adv. Compiler Design and Implementation (Figure 7.14 and Figure 7.15).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Pre-decode is performed by calling &lt;tt&gt;ptx_instruction::pre_decode()&lt;/tt&gt; for each instruction.\u00a0 It extracts information that is useful to the timing simulator.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Detect LD/ST instruction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Determine if the instruction writes to a destination register. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Obtain the reconvergence PC if this is a branch instruction. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Extract the register operands of the instruction.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* Detect predicated instruction. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The extracted information is stored inside the &lt;tt&gt;ptx_instruction&lt;/tt&gt; object corresponding to the instruction.\u00a0 This speeds up simulation because all scalar threads in a kernel launch executes the same kernel function.\u00a0 Extracting these information once as the function is loaded is significantly more efficient than repeating the same extraction for each thread during simulation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Memory Space Buffer ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In CUDA-sim, the various memory spaces in CUDA/OpenCL are implemented functionally with memory space buffers (the &lt;tt&gt;memory_space_impl&lt;/tt&gt; class in &lt;tt&gt;memory.h&lt;/tt&gt;).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* All of global, texture, constant memory spaces are implemented with a single &lt;tt&gt;memory_space&lt;/tt&gt; object inside the top-level &lt;tt&gt;gpgpu_t&lt;/tt&gt; class (as member object &lt;tt&gt;m_global_memory&lt;/tt&gt;).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* The local memory space of each thread is contained in the &lt;tt&gt;ptx_thread_info&lt;/tt&gt; object corresponding to each thread.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">* The shared memory space is common to the entire CTA (thread block), and a unique &lt;tt&gt;memory_space&lt;/tt&gt; object is allocated for each CTA when it is dispatched for execution (in function &lt;tt&gt;ptx_sim_init_thread()&lt;/tt&gt;).\u00a0 The object is deallocated when the CTA has completed execution.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &lt;tt&gt;memory_space_impl&lt;/tt&gt; class implements the read-write interface defined by abstract class &lt;tt&gt;memory_space&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Internally, each &lt;tt&gt;memory_space_impl&lt;/tt&gt; object contains a set of memory pages (implemented by class template &lt;tt&gt;mem_storage&lt;/tt&gt;).\u00a0 It uses a STL unordered map (reverts to STL map if unordered map is not available) to associate pages with their corresponding addresses.\u00a0 Each &lt;tt&gt;mem_storage&lt;/tt&gt; object is an array of bytes with read and write functions.\u00a0 Initially, each &lt;tt&gt;memory_space&lt;/tt&gt; object is empty, and pages are allocated on demand as an address corresponding to the individual page in the memory space is accessed (either via an LD/ST instruction or &lt;tt&gt;cudaMemcpy()&lt;/tt&gt;).\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The implementation of &lt;tt&gt;memory_space&lt;/tt&gt;, &lt;tt&gt;memory_space_impl&lt;/tt&gt; and &lt;tt&gt;mem_storage&lt;/tt&gt; can be found in files &lt;tt&gt;memory.h&lt;/tt&gt; and &lt;tt&gt;memory.cc&lt;/tt&gt;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Global/Constant Memory Initialization ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In CUDA, programmer can declare device variables that are accessible to all kernel/device functions.\u00a0 These variables can be in global (e.g. x_d) or constant memory space (e.g. y_c): </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 __device__ int x_d = 100; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 __constant__ int y_c = 70; </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These variables, and their initial values, are compiled into PTX variables.\u00a0 </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In GPGPU-Sim, these variables are parsed via the PTX parser into (symbol, value) pairs.\u00a0 After the all the PTX are loaded, two functions, &lt;tt&gt;load_stat_globals(...)&lt;/tt&gt; and &lt;tt&gt;load_constants(...)&lt;/tt&gt; are called to assign each variable with a memory address in the simulated global memory space, and copy the initial value to the assigned memory location.\u00a0 The two functions are located inside cuda_runtime_api.cc. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">These variables can also be declared as __global__ in CUDA.\u00a0 In this case, they are accessible by both the host (CPU) and the device (GPU).\u00a0 CUDA accomplish this by having two copies of the same variable in both host memory and device memory.\u00a0 The linkage between the two copies is established using function &lt;tt&gt;__cudaRegisterVar(...)&lt;/tt&gt;.\u00a0 GPGPU-Sim intercept call to this function to acquire this information, and establish a similar linkage by calling functions &lt;tt&gt;gpgpu_ptx_sim_register_const_variable(...)&lt;/tt&gt; or &lt;tt&gt;gpgpu_ptx_sim_register_global_variable(...)&lt;/tt&gt; (implemented\u00a0 in cuda-sim/cuda-sim.cc).\u00a0 With this linkage established, the host may call &lt;tt&gt;cudaMemcpyToSymbol()&lt;/tt&gt; or &lt;tt&gt;cudaMemcpyFromSymbol()&lt;/tt&gt; to access these __global__ variables.\u00a0 GPGPU-Sim implements these functions with &lt;tt&gt;gpgpu_ptx_memcpy_symbol(...)&lt;/tt&gt; in cuda-sim/cuda-sim.cc. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Notice that &lt;tt&gt;__cudaRegisterVar(...)&lt;/tt&gt; is not part of CUDA Runtime API, and future versions of CUDA may implement __global__ variables in a different way.\u00a0 In that case, GPGPU-Sim will need to be modified to support the new implementation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Kernel Launch: Parameter Hookup ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Kernel parameters in GPGPU-Sim are set using the same methods as in regular CUDA and OpenCL applications;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 &#039;&#039;kernel_name &lt;&lt;x,y&gt;&gt; (param1, param2, ..., paramN)&#039;&#039; and</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 &#039;&#039;clSetKernelArg(kernel_name, arg_index, arg_size, arg_value)&#039;&#039;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">respectively. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Another method to pass the kernel arguments in CUDA is with the use of &#039;&#039;cudaSetupArgument(void* arg, size_t count, size_t offset)&#039;&#039; .</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This function pushes &#039;&#039;count&#039;&#039; bytes of the argument pointed to by &#039;&#039;arg&#039;&#039; at &#039;&#039;offset bytes&#039;&#039; from the start of the parameter passing area, which starts at offset 0. The arguments are stored in the top of the execution stack. For example, if a CUDA kernel is to have 3 arguments, &#039;&#039;a&#039;&#039;, &#039;&#039;b&#039;&#039;, and &#039;&#039;c&#039;&#039; (in this order), the offset for &#039;&#039;a&#039;&#039; is 0, offset for &#039;&#039;b&#039;&#039; is sizeof(a), and offset for &#039;&#039;c&#039;&#039; is sizeof(a)+sizeof(b). </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For both CUDA and OpenCL, GPGPU-Sim creates a &#039;&#039;gpgpu_ptx_sim_arg&#039;&#039; object per kernel argument and maintains a </ins>list <ins class=\"diffchange diffchange-inline\">of all kernel arguments. Prior to executing the kernel, an initialization function is called to setup the GPU grid dimensions and parameters: &#039;&#039;gpgpu_cuda_ptx_sim_init_grid(...)&#039;&#039; or &#039;&#039;gpgpu_opencl_ptx_sim_init_grid(...)&#039;&#039;. Two main objects are used within these functions, the &#039;&#039;function_info&#039;&#039;\u00a0 and &#039;&#039;kernel_info_t&#039;&#039; objects, which are described [[#Key_Objects_Descriptions | above]]. In the init_grid functions, the kernel arguments are added to the &#039;&#039;function_info&#039;&#039; object by calling &#039;&#039;function_info_object::add_param_data(arg #, gpgpu_ptx_sim_arg *)&#039;&#039;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">After adding all of the parameters to the &#039;&#039;function_info&#039;&#039; object, &#039;&#039;function_info::finalize(...)&#039;&#039; is called, which copies over the kernel arguments, stored in the &#039;&#039;function_info::m_ptx_kernel_param_info&#039;&#039; map, into the parameter memory allocated in the &#039;&#039;kernel_info_t&#039;&#039; object mentioned above. If it was not done previously in &#039;&#039;function_info::add_param_data(...)&#039;&#039;, the address of each kernel argument is added to the symbol table in the &#039;&#039;function_info&#039;&#039; object.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">PTXPlus support requires copying kernel parameters to shared memory. The kernel parameters can be copied from &#039;&#039;Param&#039;&#039; memory to &#039;&#039;Shared&#039;&#039; memory by calling the &#039;&#039;function_info::param_to_shared(shared_mem_ptr, symbol_table)&#039;&#039; function. This function iterates over the kernel parameters stored in the &#039;&#039;function_info::m_ptx_kernel_param_info&#039;&#039; map and copies each parameter from &#039;&#039;Param&#039;&#039; memory to the appropriate location in &#039;&#039;Shared&#039;&#039; memory pointed to by &#039;&#039;ptx_thread_info::shared_mem_ptr&#039;&#039;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The &#039;&#039;function_info::add_param_data(...), function_info::finalize(...), function_info::param_to_shared(...),&#039;&#039; and &#039;&#039;gpgpu_opencl_ptx_sim_init_grid(...)&#039;&#039; functions are defined in &lt;gpgpu-sim_root&gt;/distribution/src/cuda-sim/cuda-sim.cc. The &#039;&#039;gpgpu_cuda_ptx_sim_init_grid(...)&#039;&#039; function is implemented in &lt;gpgpu-sim_root&gt;/distribution/libcuda/cuda_runtime_api.cc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Generic Memory Space ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Generic addressing is a feature that was introduced in NVIDIA&#039;s PTX 2.0, with generic addressing supported by instructions ld, ldu, st, prefetch, prefetchu, isspacep, cvta, atom, and red. In generic addressing, an address maps to global memory unless it falls within the local memory window or the shared memory window. Within these windows, an address maps to the corresponding location in local or shared memory, i.e. to the address formed by subtracting the window base from the generic address to form the offset in the implied state space. So an instruction can use generic addressing to deal with addresses that may correspond to global, local or shared memory space. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The generic addressing in GPGPU-Sim is supported with instructions ld, ldu, st, isspacep and cvta. Functions generic_to_{shared, local, global}, {shared, local, global}_to_generic and isspace_{shared, local, global} (which are all defined in &quot;cuda_sim.cc&quot;) are used to support the generic addressing in GPGPU-Sim with the previously mentioned instructions. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Identifiers (SHARED_GENERIC_START, SHARED_MEM_SIZE_MAX, LOCAL_GENERIC_START, TOTAL_LOCAL_MEM_PER_SM, LOCAL_MEM_SIZE_MAX, GLOBAL_HEAP_START and STATIC_ALLOC_LIMIT) which are defined in &quot;abstract_hardware_model.h&quot; are used to define the different memory spaces boundaries (windows). These identifiers are used to derive and generate different address spaces which are needed to support generic addressing. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The follwing table shows an example of how the spaces are defined in the code:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellspacing=0 cellpadding=3 align=&quot;center&quot; class=&quot;wikitable&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! Identifier || Value</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| GLOBAL_HEAP_START || 0x80000000</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| SHARED_MEM_SIZE_MAX || 64*1024</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| LOCAL_MEM_SIZE_MAX || 8*1024</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| MAX_STREAMING_MULTIPROCESSORS || 64</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| MAX_THREAD_PER_SM || 2048</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| TOTAL_LOCAL_MEM_PER_SM || MAX_THREAD_PER_SM*LOCAL_MEM_SIZE_MAX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| TOTAL_SHARED_MEM || MAX_STREAMING_MULTIPROCESSORS*SHARED_MEM_SIZE_MAX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| TOTAL_LOCAL_MEM || MAX_STREAMING_MULTIPROCESSORS*MAX_THREAD_PER_SM*LOCAL_MEM_SIZE_MAX</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| SHARED_GENERIC_START || GLOBAL_HEAP_START-TOTAL_SHARED_MEM</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| LOCAL_GENERIC_START || SHARED_GENERIC_START-TOTAL_LOCAL_MEM</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| STATIC_ALLOC_LIMIT || GLOBAL_HEAP_START - (TOTAL_LOCAL_MEM+TOTAL_SHARED_MEM)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Notice that with this address space partitioning, each thread may only have up to 8kB of local memory (LOCAL_MEM_SIZE_MAX).\u00a0 With CUDA compute capability 1.3 and below, each thread can have up to 16kB of local memory.\u00a0 With CUDA compute capability 2.0, this limit has increased to 512kB [http://developer.nvidia.com/nvidia-gpu-computing-documentation].\u00a0 The user may increase LOCAL_MEM_SIZE_MAX to support applications that require more than 8kB of local memory per thread.\u00a0 However, one should always ensure that GLOBAL_HEAP_START &gt; (TOTAL_LOCAL_MEM + TOTAL_SHARED_MEM).\u00a0 Failure to do so may result in erroneous simulation behavior. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">To get more information about the generic addressing in general refer to NVIDIA&#039;s PTX: Parallel Thread Execution ISA Version 2.0 manual[http://developer.nvidia.com/nvidia-gpu-computing-documentation].</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Instruction Execution ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">After parsing, instructions used for functional execution are represented as a ptx_instruction object contained within a function_info object (see cuda-sim/ptx_ir.{h,cc}). Each scalar thread is represented by a ptx_thread_info object. Executing an instruction (functionally) is mainly accomplished by calling the ptx_thread_info::ptx_exec_inst().</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The abstract class core_t has the most basic data structures and procedures required for instruction execution functionally. This class is the base class for the shader_core_ctx and functionalSimCore which are used for performance and pure functional simulation respectively. core_t most important members are objects of types simt_stack and ptx_thread_info, which are used in the functional exectuion to keep track of the warps branch divergence and to handle the threads&#039; instructions execution.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Executing instruction simply starts by initializing scalar threads using the function ptx_sim_init_thread (in cuda-sim.cc), then we execute the scalar threads in warps using the function_info::ptx_exec_inst(). In this version, keeping track of threads as warps is done using a simt_stack object for each warp of scalar threads (this is the assumed model here and other models can be used instead), the simt_stack tells which threads are active and which instruction to execute at each cycle so we can execute the scalar threads\u00a0 in warps.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">In ptx_thread_info::ptx_exec_inst, is where acutally the instructions get functionally executed. We check the instruction opcode and call the corresponding funciton, the file opcodes.def contains the functions used to execute each instruction. Every instruction function takes two parameters of type ptx_instruction and ptx_thread_info which hold the data for the instruction and the thread in execution receptively.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Information are communicated back from the execution of ptx_exec_inst to the function that executes the warps through modifying warp_inst_t parameter that is passed to the ptx_exec_inst by reference, so for atomics we indicate that the executed warp instruction is atomic and add a call back to the warp_inst_t and which set the atomic flag, the flag is then checked by the warp execution function in order to do the callbacks which are used to execute the atomics (check functionalCoreSim::executeWarp in cuda-sim.cc).</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As you might have expected more communication is made in the performance simulation than the pure functional simulation. The pure functional execution with functionalSimCore (in cuda-sim{.h,.cc}) can be checked to get more details on the functional execution.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Interface to Source Code View in AerialVision ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Source Code View in AerialVision is a view where it is possible to plot different kinds of metrics vs. ptx source code. For example, one could plot the DRAM traffic generated by each line in ptx source code. GPGPU-Sim exports the statistics needed to construct the Source Code View in AerialVision to statistics files that are read by AerialVision.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">If the options &quot;-enable_ptx_file_line_stats 1&quot; and &quot;-visualizer_enabled 1&quot; are defined, GPGPU-Sim will save files with the statistics. The name of the file can be specified using the option &quot;-ptx_line_stats_filename filename&quot;.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">For each line in the executed ptx files, one line is added to the line stats file in the following format:</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\"> kernel line : count latency dram_traffic smem_bk_conflicts smem_warp gmem_access_generated gmem_warp exposed_latency warp_divergence</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Using AerialVision, one could plot/view the statistics collected by this interface in different ways. AerialVision can also map those statistics to CUDA C++ source files. Please refer the AerialVision manual for more details about how to do that.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">This functionality is implemented in src/cuda-sim/ptx-stats.h(.cc). The stats for each ptx line are held in an instance of the class &#039;&#039;ptx_file_line_stats&#039;&#039;. The function &#039;&#039;void ptx_file_line_stats_write_file()&#039;&#039; is responsible for printing the statistics to the statistics file in the above format. A number of other functions, similar to &#039;&#039;void ptx_file_line_stats_add_dram_traffic(unsigned pc, unsigned dram_traffic)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&#039;&#039; are called by the rest of the simulator to record different statistics about ptx source code lines.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Pure Functional Simulation ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Pure functional simulation (bypassing performance simulation) is implemented in files cuda-sim{.h,.cc}, in function gpgpu_cuda_ptx_sim_main_func(...) and using the functionalCoreSim class. The functionalCoreSim class is inherited from the core_t abstract class, which contains many of the functional simulation data structures and procedures that are used by the pure functional simulation as well as performance simulation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">== Interface with outside world ==</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim is compiled into stub libraries that dynamically link at runtime to the CUDA/OpenCL application. Those libraries intercept the call intended to be executed by the CUDA runtime environment and instead initialize the simulator and run the kernels on it instead of on the hardware</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== Entry Point and Stream Manager ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPU-Sim is initialized by the function GPGPUSim_Init() which is called when the CUDA or OpenCL application performs its first CUDA/OpenCL API call.\u00a0 Our implementation of the CUDA/OpenCL API function implementations either directly call GPGPUSim_Init() or they call GPGPUSim_Context() which in turn calls GPGPUSim_Init(). An example API call that calls GPGPUSim_Context() is cudaMalloc(). Note that by utilizing static variables GPGPUSim_Init() is not called every time a cudaMalloc() is called.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">First call to GPGPUSim_Init() would call the function gpgpu_ptx_sim_init_perf() located in gpgpusim_entrypoint.cc.\u00a0 Inside gpgpu_ptx_sim_init_perf() all the environmental variables, command line parameters and configuration files are processed. Based on the options a gpgpu_sim object is instantiated and assigned to the global variable g_the_gpu. Also a stream_manager object is instantiated and assigned to the global variable g_stream_manager. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">GPGPUSim_Init() also calls start_sim_thread() function located in gpgpusim_entrypoint.cc. start_sim_thread() creates starts a new pthread which is actually responsible for running the simulation. For OpenCL application, the simulator pthread runs gpgpu_sim_thread_sequential(), which simulates the execution of kernels one at a time. For CUDA application, the simulator pthread runs gpgpu_sim_thread_concurrent(), which simulates the concurrent execution of multiple kernels.\u00a0 The maximum number of kernels that may concurrently execute on the simulated GPU is configured by the option &#039;-gpgpu_max_concurrent_kernel&#039;. </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">gpgpu_sim_thread_sequential() will wait for a start signal to start the simulation (g_sim_signal_start).\u00a0 The start signal is set by the gpgpu_opencl_ptx_sim_main_perf() function used to start OpenCL performance simulation.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">gpgpu_sim_thread_concurrent() initializes the performance simulator structures once and then enters a loop waiting for a job from the stream manager (implemented by class stream_manager in stream_manager.h/cc).\u00a0 The stream manager itself gets the jobs from CUDA API calls to functions such as cudaStreamCreate() and cudaMemcpy() and kernel launches.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== CUDA runtime library (libcudart) ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">When building a CUDA application, NVIDIA&#039;s nvcc translates each kernel launch into a series of API calls to the CUDA runtime API, which prepares the GPU hardware for kernel execution. libcudart.so is the library provided by NVIDIA that implements this API. In order for GPGPU-Sim to intercept those calls and run the kernels on the simulator, GPGPU-Sim also implements this library. The implementation resides in libcuda/cuda_runtime_api.cc. The resulting shared object resides in gpgpu-sim_root/lib/&lt;build_type&gt;/libcudart.so. By including this path into your LD_LIBRARY_PATH, you instruct your system to dynamically link against GPGPU-Sim at runtime instead of the NVIDIA provided library, thus allowing the simulator to run your code. Setting your LD_LIBRARY_PATH should be done through the setup_environment script as instructed in the README file distributed with GPGPU-Sim. Our implementation of libcudart is not compatible with all versions of cuda because of different interfaces that NVIDIA uses between different versions.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">=== OpenCL library (libopencl) ===</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">Similar to libcuda described above, libopencl is a library included with GPGPU-Sim that implements the OpenCL API found in libOpencl.so. GPGPU-Sim currently supports OpenCL v1.1. OpenCL function calls are intercepted by GPGPU-Sim and handled by the simulator instead of the physical hardware. The resulting shared object resides in &lt;gpgpu-sim_root&gt;/lib/&lt;build_type&gt;/libOpenCL.so. By including this path into your LD_LIBRARY_PATH, you instruct your system to dynamically link against GPGPU-Sim at runtime instead of the NVIDIA provided library, thus allowing the simulator to run your code. Setting your LD_LIBRARY_PATH should be done through the setup_environment script as instructed in the [https://dev.ece.ubc.ca/projects/gpgpu-sim/browser/v3.x/README README file in the v3.x directory</ins>]<ins class=\"diffchange diffchange-inline\">.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">As GPGPU-Sim executes PTX, OpenCL applications must be compiled and converted into PTX. This is handled by nvopencl_wrapper.cc (found in &lt;gpgpu-sim_root&gt;/distribution/libopencl/). The OpenCL kernel is passed to the nvopencl_wrapper, compiled using the standard OpenCL clCreateProgramWithSource(...) and clBuildProgram(...) functions, converted into PTX and stored into a temporary PTX file (_ptx_XXXXXX), which is then read into GPGPU-Sim. Compiling OpenCL applications requires a physical device capable of supporting OpenCL. Thus, it may be necessary to perform the compilation process on a remote system containing such a device. GPGPU-Sim supports this through use of the &lt;OPENCL_REMOTE_GPU_HOST&gt; environment variable. If necessary, the compilation and conversion to PTX processes will be performed on the remote system and will return the resulting PTX files to the local system to be read into GPGPU-Sim.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">The following table provides a list of OpenCL functions currently implemented in GPGPU-Sim. See the OpenCL specifications document for more details on the behaviour of these functions. The OpenCL API implementation for GPGPU-Sim can be found in &lt;gpgpu-sim_root&gt;/distribution/libopencl/opencl_runtime_api.cc.</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div>\u00a0</div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">{| border=1 cellpadding=&quot;3&quot; cellspacing=&quot;0&quot; class=&quot;wikitable&quot; style=&quot;text-align:left&quot; align=&quot;center&quot;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">! colspan=1 | &lt;BR&gt;OpenCL API </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|\u00a0 &#039;&#039;&#039;clCreateContextFromType&#039;&#039;&#039;&lt;nowiki&gt;(cl_context_properties *properties, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_ulong\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 device_type,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void (*pfn_notify)(const char *, const void *, size_t, void *),</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 user_data,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_int *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 errcode_ret) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clCreateContext&#039;&#039;&#039;&lt;nowiki&gt;(\u00a0 const cl_context_properties * properties,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_uint num_devices,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const cl_device_id *devices,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void (*pfn_notify)(const char *, const void *, size_t, void *),</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 user_data,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_int *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 errcode_ret)</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetContextInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_context\u00a0 \u00a0 \u00a0 \u00a0  context, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_context_info\u00a0 \u00a0 param_name, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size_ret )</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clCreateCommandQueue&#039;&#039;&#039;&lt;nowiki&gt;(cl_context\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  context, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_device_id\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  device, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_command_queue_properties\u00a0 \u00a0 properties,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_int *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  errcode_ret) </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clCreateBuffer&#039;&#039;&#039;&lt;nowiki&gt;(cl_context\u00a0  context,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_mem_flags flags,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0  size ,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *\u00a0 \u00a0 \u00a0  host_ptr,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_int *\u00a0 \u00a0  errcode_ret )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clCreateProgramWithSource&#039;&#039;&#039;&lt;nowiki&gt;(cl_context\u00a0 \u00a0 \u00a0 \u00a0 context,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  count,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const char **\u00a0 \u00a0  strings,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const size_t *\u00a0 \u00a0 lengths,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_int *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 errcode_ret)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clBuildProgram&#039;&#039;&#039;&lt;nowiki&gt;(cl_program\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  program,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_devices,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const cl_device_id * device_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const char *\u00a0 \u00a0 \u00a0 \u00a0  options, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void (*pfn_notify)(cl_program /* program */, void * /* user_data */),</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  user_data )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clCreateKernel&#039;&#039;&#039;&lt;nowiki&gt;(cl_program\u00a0 \u00a0 \u00a0 program,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const char *\u00a0 \u00a0 kernel_name,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_int *\u00a0 \u00a0 \u00a0 \u00a0 errcode_ret) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clSetKernelArg&#039;&#039;&#039;&lt;nowiki&gt;(cl_kernel\u00a0 \u00a0 kernel,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 arg_index,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0  arg_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const void * arg_value )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clEnqueueNDRangeKernel&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue command_queue,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_kernel\u00a0 \u00a0 \u00a0 \u00a0 kernel,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 work_dim,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const size_t *\u00a0  global_work_offset,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const size_t *\u00a0  global_work_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const size_t *\u00a0  local_work_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_events_in_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const cl_event * event_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_event *\u00a0 \u00a0 \u00a0  event)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clEnqueueReadBuffer&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue\u00a0 \u00a0 command_queue,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_mem\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 buffer,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_bool\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  blocking_read,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 offset,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cb, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 ptr,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  num_events_in_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const cl_event *\u00a0 \u00a0 event_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_event *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 event ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clEnqueueWriteBuffer&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue\u00a0  command_queue, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_mem\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  buffer, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_bool\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 blocking_write, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  offset, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cb, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const void *\u00a0 \u00a0 \u00a0  ptr, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_events_in_wait_list, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const cl_event *\u00a0  event_wait_list, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_event *\u00a0 \u00a0 \u00a0 \u00a0  event ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseMemObject&#039;&#039;&#039;(&lt;nowiki&gt;cl_mem /* memobj */) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseKernel&#039;&#039;&#039;&lt;nowiki&gt;(cl_kernel\u00a0  /* kernel */)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseProgram&#039;&#039;&#039;&lt;nowiki&gt;(cl_program /* program */)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseCommandQueue&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue /* command_queue */)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseContext&#039;&#039;&#039;&lt;nowiki&gt;(cl_context /* context */)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetPlatformIDs&#039;&#039;&#039;&lt;nowiki&gt;(cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetPlatformInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_platform_id\u00a0  platform, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_platform_info param_name,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t *\u00a0 \u00a0 \u00a0 \u00a0  param_value_size_ret ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetDeviceIDs&#039;&#039;&#039;&lt;nowiki&gt;(cl_platform_id\u00a0  platform,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_device_type\u00a0  device_type, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_entries, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_device_id *\u00a0  devices, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint *\u00a0 \u00a0 \u00a0 \u00a0 num_devices ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetDeviceInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_device_id\u00a0 \u00a0 device,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_device_info\u00a0 param_name, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 param_value_size, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 param_value,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t *\u00a0 \u00a0 \u00a0 \u00a0 param_value_size_ret) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clFinish&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue /* command_queue */) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetProgramInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_program\u00a0 \u00a0 \u00a0 \u00a0  program,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_program_info\u00a0 \u00a0 param_name,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size_ret )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clEnqueueCopyBuffer&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue\u00a0 \u00a0 command_queue, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_mem\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 src_buffer,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_mem\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 dst_buffer, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 src_offset,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 dst_offset,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cb, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  num_events_in_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const cl_event *\u00a0 \u00a0 event_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_event *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 event )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetKernelWorkGroupInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_kernel\u00a0 \u00a0 \u00a0 \u00a0 kernel,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_device_id\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  device,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_kernel_work_group_info\u00a0 param_name,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  param_value_size_ret ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clWaitForEvents&#039;&#039;&#039;&lt;nowiki&gt;(cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  /* num_events */,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 const cl_event *\u00a0 \u00a0 /* event_list */)&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clReleaseEvent&#039;&#039;&#039;&lt;nowiki&gt;(cl_event /* event */) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetCommandQueueInfo&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue\u00a0 \u00a0 \u00a0 command_queue,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 cl_command_queue_info param_name,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 param_value_size,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 void *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 param_value,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size_t *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 param_value_size_ret )&lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clFlush&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue /* command_queue */) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clGetSupportedImageFormats&#039;&#039;&#039;&lt;nowiki&gt;(cl_context\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  context,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_mem_flags\u00a0 \u00a0 \u00a0 \u00a0  flags,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_mem_object_type\u00a0  image_type,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_entries,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_image_format *\u00a0 \u00a0 image_formats,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint *\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_image_formats) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">| &#039;&#039;&#039;clEnqueueMapBuffer&#039;&#039;&#039;&lt;nowiki&gt;(cl_command_queue command_queue,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_mem\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  buffer,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_bool\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 blocking_map, </ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_map_flags\u00a0 \u00a0  map_flags,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  offset,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  size_t\u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cb,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_uint\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 num_events_in_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  const cl_event * event_wait_list,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_event *\u00a0 \u00a0 \u00a0  event,</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0  cl_int *\u00a0 \u00a0 \u00a0 \u00a0  errcode_ret ) &lt;/nowiki&gt;</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|-</ins></div></td></tr>\n<tr><td colspan=\"2\" class=\"diff-side-deleted\"></td><td class=\"diff-marker\" data-marker=\"+\"></td><td class=\"diff-addedline diff-side-added\"><div><ins class=\"diffchange diffchange-inline\">|}</ins></div></td></tr>\n"
    }
}