WEBVTT 1 00:09:28.288 --> 00:09:42.568 Okay, good afternoon parallel class. So I'm just added the little issue here getting things set up. I think they're working now. Sorry for the delay. 2 00:09:45.028 --> 00:09:51.509 And 1 more thing to set up, and then. 3 00:10:01.168 --> 00:10:07.499 Yeah, I just 1 minute here. 4 00:10:35.038 --> 00:10:40.109 Cool. 5 00:10:44.129 --> 00:10:50.548 Hey, sharing is actually working and. 6 00:10:50.548 --> 00:10:55.259 Excuse me, professor, when I ask a question before we start the lecture. 7 00:10:58.318 --> 00:11:04.619 Cool. So conceivably things I theoretically working. 8 00:11:04.619 --> 00:11:11.578 Can't say that they're working in practice, but my universal question is. 9 00:11:11.578 --> 00:11:19.558 Can you hear me I've got the chat window open so here, you know, I still haven't quite hear me or. 10 00:11:19.558 --> 00:11:24.599 I can hear you beautiful. Okay. 11 00:11:24.599 --> 00:11:33.089 So, finally things are working so what we're doing now is we're continuing. 12 00:11:33.089 --> 00:11:48.058 Is there a place to submit homework for? No, I'll give you another couple of days to submit it. So it was taking a bit of time. There's a question in voice. Yes. The question was, is there a place to submit homework for? I'll. 13 00:11:48.058 --> 00:11:57.028 I'll give you to Monday to do it so this may be taking a little time and I'll put up I'll give you a day or 2 to submit it after I put it up on grade scope. So. 14 00:11:57.028 --> 00:12:09.928 Okay, so what we're talking, we're continuing on with NVIDIA. I think we're approximately lectured 2.3 that we're speed reading screw. 15 00:12:10.464 --> 00:12:22.943 And we're talking about the architecture and again, my teaching style is I like to teach from examples that are used in the real world. 16 00:12:23.303 --> 00:12:34.464 So, the example here is in video, but from it, as you're looking at, how they solve the problem, and they've solved the problem. Very successfully then you can get an idea for. 17 00:12:36.149 --> 00:12:41.609 You know, some general hardware principles, and so on ways to solve, solve the issues here. 18 00:12:41.933 --> 00:12:53.933 So, NVIDIA has the thread, it's an execution stream. It's got some local memory, it's cut sharing some global memory, and it's got a program counter and pointing to the next instruction to be executed. 19 00:12:54.264 --> 00:13:01.073 And again, with the threads is a work of threads, solid, execute to the same instruction. Unless a particular thread is disabled. 20 00:13:01.408 --> 00:13:09.298 Okay, and the threads are very lightweight. This is the review here the threads are very light weight. 21 00:13:09.298 --> 00:13:16.678 They are much less expensive to start a threat than it is to start a process or a cold routines. 22 00:13:16.678 --> 00:13:22.499 In Lennox or Windows and so this is the review here you might profitably. 23 00:13:22.499 --> 00:13:30.504 Maybe pushing to the screen, but you might profitably. Let's say if you're adding 2 vectors element by element, you might perhaps have a thread for each addition. 24 00:13:30.504 --> 00:13:38.783 Something like that because on the on the video, you can have thousands of threads. 25 00:13:39.239 --> 00:13:42.328 Okay. 26 00:13:43.469 --> 00:13:58.229 So this is okay, so here, your program would alternate, Tom, some serial code, and some parallel codes some serial quotes in parallel code although, and we're assuming serial code here to make it easy. Has 1 thread. 27 00:13:58.229 --> 00:14:04.318 Executing, and then the parallel code might have a 1000. you could also. 28 00:14:04.318 --> 00:14:09.958 Do the simultaneously when you start off parallel threats, then they. 29 00:14:09.958 --> 00:14:21.568 Your your host code that starts the device threads returns immediately and leaves the device running. So which means you have to check do a synchronization later. 30 00:14:21.568 --> 00:14:26.129 And this again is a syntax here that starts off. 31 00:14:26.129 --> 00:14:31.438 The parallel threads, and just to review a colonel. 32 00:14:31.438 --> 00:14:37.318 Is the name for a parallel program on the device? The device is the is the. 33 00:14:37.318 --> 00:14:41.188 No, the video. 34 00:14:41.188 --> 00:14:45.448 You give it some arguments you can pass and this. 35 00:14:45.448 --> 00:14:51.269 Gives the threads that the carnal should use. kernel's also called a grid. 36 00:14:51.269 --> 00:15:04.918 Um, so the grid, the current kernel contains the number of thread blocks and use a specify how many thread blocks it should have and it could be anywhere from 1 up because there's some limit and the limit. 37 00:15:05.364 --> 00:15:16.854 I think it's pretty large and then this here is the number of threads that each thread block or each block thread, block and block they're synonymous. 38 00:15:17.124 --> 00:15:26.634 So don't accuse and video complete consistency. So, this here is, you tell it how many threads each blocks should have. 39 00:15:27.269 --> 00:15:37.739 And that would be anywhere from 1 up to there is a limit. That's a 1024 I believe 1000 threads per block and effectively as many blocks. 40 00:15:37.739 --> 00:15:46.168 As you want, and now there's a limited how many blocks can run at 1 time the other blocks are sitting in a queue. 41 00:15:46.168 --> 00:15:51.688 And there's an, there's an operating system on the device. 42 00:15:51.688 --> 00:15:59.009 Not a lot of explicit details about it, but 1 thing it does. Excuse me is queue up resources to run them. 43 00:16:00.028 --> 00:16:11.364 Okay, it also happens down inside the block. It might suppose you have a 1000 threads in the block. It may. Now, each thread has resources. The big 1. 44 00:16:11.364 --> 00:16:15.714 I've told you is registers, which are past local memory then. 45 00:16:18.504 --> 00:16:32.543 The block has a pool of registers that are shared by all of the threads in the block. If he wants a lot of registers and maybe all the threads of the block can't run simultaneously. So warps of threads are queued up in a queue for the block. 46 00:16:32.543 --> 00:16:43.073 And run as to get resources, another resource and shared would be floating point. Floating point is done in separate Co processors. Separate course it's the same thing on the Z on the host. 47 00:16:44.339 --> 00:16:55.798 And we're not quite the same, because it's on each CPO will have some double recession floating and sign on here on the device. There are fewer double precision. 48 00:16:55.798 --> 00:17:00.479 Um, processors, then there are integer processors. You might call them. 49 00:17:01.708 --> 00:17:12.538 So, it may happen. The thread is queued up or work with threads is queued up waiting for some double precision processes to become available. Perhaps. 50 00:17:12.538 --> 00:17:19.528 And the queuing process is said to be free, there's no overhead and running the cure you overhead is less than a cycle. 51 00:17:19.528 --> 00:17:24.838 And 1 thing in video does from generation to generation. 52 00:17:25.979 --> 00:17:37.648 Both of them say before that is touring before that is capital or I think before that was Pascal and so on is from generation to generation and video changes. 53 00:17:37.648 --> 00:17:42.989 Proportion of the dye that's used for single precision, double precision. 54 00:17:42.989 --> 00:17:47.098 Half precision floats and editors and so on. 55 00:17:47.098 --> 00:18:00.989 Okay, so got this hierarchy here, we saw last time electrons, used to build circuits. You got a low level micro architecture, which is not visible to the user, actual instruction set involving. 56 00:18:00.989 --> 00:18:10.288 Instructions on the CUDA cores and a language like C plus plus, which is used to implement an algorithm and then perhaps at the top they're doing. 57 00:18:10.288 --> 00:18:14.788 Some natural language thing. Oh, okay. 58 00:18:15.838 --> 00:18:22.618 Nothing here instructions said architecture nothing new here. 59 00:18:22.618 --> 00:18:35.519 Nothing here is instruction register for those of you that have forgotten your Coco, I guess, can us program counter register file. 60 00:18:35.519 --> 00:18:39.929 And so on standard. 61 00:18:40.979 --> 00:18:44.489 Oh, okay. So again, getting to. 62 00:18:44.489 --> 00:18:52.048 Something slightly new here so again, so all of the threads they're lightweight, they're executing. 63 00:18:52.048 --> 00:18:59.249 So, the grid again, well, maybe that's the hardware in the kernels of software, I guess, and all the threads they. 64 00:18:59.249 --> 00:19:05.669 They run in the same instruction so it's called single program, multiple data, but they got separate data. 65 00:19:05.669 --> 00:19:12.239 However, the thread has private indices that each thread knows its number in the whole. 66 00:19:12.239 --> 00:19:15.449 Set a ball thread, so it can access private data. 67 00:19:15.449 --> 00:19:27.088 Can index well, 1st is private data for the thread the registers and 2nd, the thread can use its index to go into the global data and get its. 68 00:19:27.088 --> 00:19:36.239 Share of the global data, and I showed you these before thread index dot access the index as a thread and the block. It goes from 0T to. 69 00:19:36.239 --> 00:19:41.338 And 2003 perhaps blocked him as the number of. 70 00:19:41.338 --> 00:19:48.449 Threads when the block was declared created, that's the number of threads that the block has total. 71 00:19:48.449 --> 00:19:54.838 And block index is the index of the block in the grid 0T up to the number of blocks and the grid. 72 00:19:54.838 --> 00:20:00.538 And installed X, because syntactic, these are 3 dimensional. 73 00:20:00.538 --> 00:20:05.098 And disease not 1 dimensional indices. Okay. 74 00:20:05.098 --> 00:20:09.388 So, again, it shows the blocks again in more detail. 75 00:20:09.388 --> 00:20:15.778 Blocks arrow block 1 block and minus 1. we have blocks. 76 00:20:15.778 --> 00:20:26.519 And to this hierarchy here, that inside of block oh, I mentioned before there's a fixed amount of shared memory per block. It's a very small amount. It's. 77 00:20:26.519 --> 00:20:33.118 48 kilobytes or something I can't remember for sure. It's very small and it's fast. 78 00:20:33.118 --> 00:20:43.019 And and it's shared by the threads and the block. So if a thread writes to shared address 7. 79 00:20:44.153 --> 00:20:58.134 All of the threads in the block have access to it, but perhaps you do a barrier synchronization to ensure that and you do have atomic operation for the threads and a block. Like we saw with open MP, open ACC. 80 00:20:58.949 --> 00:21:04.798 You can do an atomic read modify, right? For example, on the shared memory so that you can. 81 00:21:04.798 --> 00:21:11.219 To a calendar correctly the shared memory is actually implemented as the. 82 00:21:11.219 --> 00:21:14.459 Fast high speed cash from. 83 00:21:14.459 --> 00:21:23.844 Fronting in the global memory. Okay, so this is the threads in the block, and you do have to synchronize for the following reason. 84 00:21:24.413 --> 00:21:36.054 Because, as I said, the threads and the threads and a war run synchronously, but the warps in the block 32 threads in a walk up to 32 walks in the block, the warps do not necessarily around synchronicity. 85 00:21:36.054 --> 00:21:50.933 It's as a competition for some resource, such as a floating point processor, then the warps will not run at the same time and the block, but they've got to access that shared memory. You maybe want to do a synchronization. 86 00:21:52.169 --> 00:21:55.169 You start you want to do that. 87 00:21:55.169 --> 00:21:59.939 Now, the different blocks do not interact except. 88 00:21:59.939 --> 00:22:03.298 That they can access the global memory, but. 89 00:22:03.298 --> 00:22:10.858 So, there is no synchronization if I recall right for the, for the different threads for the different blocks. 90 00:22:10.858 --> 00:22:25.679 You could create 1 with something accessing available memory but that would be a bad idea for the following reason is that there are no fairness guarantees for how the different blocks execute. So you see the threads and the bog. They're running the same instruction. 91 00:22:25.679 --> 00:22:31.919 You know, but they're not necessarily doing it at the same time. 92 00:22:31.919 --> 00:22:35.308 So, conceptually, it's a single, you know. 93 00:22:35.308 --> 00:22:39.118 Single program multiple data, but it said the different. 94 00:22:39.118 --> 00:22:49.199 Blocks are running there could be 1 block after another after another, or it could be simultaneously. So you do not want to. 95 00:22:49.199 --> 00:22:56.459 You know, 4 blocks to try to, they don't naturally interact and you could. 96 00:22:56.459 --> 00:23:01.348 With some hacker, a, have them interacting, but that would kill your performance. 97 00:23:01.348 --> 00:23:05.669 If you see what I mean? Okay, so we got the threads and the block. 98 00:23:05.669 --> 00:23:08.878 And then the different blocks at the hierarchy here. 99 00:23:10.824 --> 00:23:23.364 Now, you might ask, why, why don't they just have, you know, thousands of threads in a box that got unlimited 2000 in 2004 threads for block and even with the faster and the newer NVIDIA architectures they don't, they don't increase that. 100 00:23:23.364 --> 00:23:28.854 They don't have more threads for blocking the newer architectures so you might ask yourself why? And my answer. 101 00:23:32.038 --> 00:23:39.959 I've read them statewide, but the obvious thing is that the operations inside 1 block are very expensive to implement. 102 00:23:39.959 --> 00:23:50.729 And I'll take a lot of hardware, and in particular, some of the asynchronous logic being used to implement some of this. They call it 0T overhead waiting. They're. 103 00:23:50.729 --> 00:24:04.648 There are these warps in the block, and he said there's an invisible cue of warps waiting to run and that's done and it was a 0T overhead thing. So, as soon as the resource becomes available, quoting process, or if there is a. 104 00:24:04.648 --> 00:24:11.098 You know, a warped that wants to do a floating operation and the next cycle it gets it. 105 00:24:11.098 --> 00:24:18.419 And this has done with some sort of logic for your software types. 106 00:24:18.894 --> 00:24:31.013 You have you have operations in the which are clock and with every cycle, something happens and then there's more asynchronous on timed operations. 107 00:24:31.403 --> 00:24:34.884 You just have your logic gates and whatever. 108 00:24:35.429 --> 00:24:44.429 And they, as their inputs changed, and their output changes immediately after well, depending how fast the hardware operates and this. 109 00:24:44.429 --> 00:24:50.038 A synchronous operation, it's a horrible complicated mess, but it's very fast. 110 00:24:50.038 --> 00:24:59.429 So, again, in Coco class, they, they teach some of the issues of, um. 111 00:25:00.449 --> 00:25:10.463 Of problems that the inputs to a gate change, but it's slightly different times. You get this temporary, false signal coming out of the gate before the inputs are stabilized and stuff. You got to worry about all of that. 112 00:25:10.644 --> 00:25:19.403 But the upside is it's very fast, and that's the sort of thing that Nvidia uses to do the scheduling inside 1 block. 113 00:25:19.888 --> 00:25:32.699 But you can't, you know, you, it doesn't scale up. That's why there's a 1000 threads for walk. Also all of the NVIDIA. 114 00:25:32.699 --> 00:25:40.949 Architectures and video has been around for 20 years or more more than 20 years. They always have 32 threads for war. 115 00:25:40.949 --> 00:25:48.538 They recently do some new stuff with the warps getting almost fractional warps, but they don't have more again because of the. 116 00:25:48.834 --> 00:26:03.294 The cost and the hardware to do it any case, the 3 dimensional index indexing and then it's just the tactics sugar I call. It makes my view of this is I don't care that the compiler does this because. 117 00:26:03.989 --> 00:26:11.608 You know, that's what's possible you can write class conversion routines and C plus plus and do it. In fact, I do that sort of thing. 118 00:26:11.608 --> 00:26:16.739 Okay, grid of threads, and a block grant and the 3 dimensional indexing. 119 00:26:17.909 --> 00:26:24.689 Okay, that was lecture 2, 3 from invidious point of view. 120 00:26:28.828 --> 00:26:32.669 Here, it's a. 121 00:26:34.048 --> 00:26:39.269 Hello. 122 00:26:57.719 --> 00:27:05.669 No, I can't zoom in let's take the whole height of my screen, so there's no point in making it wider here. Okay. 123 00:27:05.669 --> 00:27:11.608 More stuff here. Same here. 124 00:27:11.608 --> 00:27:22.888 The compiler, so, in video has got a lot of compilers for you, I showed you NBC plus plus straight C. plus plus then is for the CUDA code. 125 00:27:22.888 --> 00:27:29.429 And showed you a little of this before your Hello world program on the host. 126 00:27:29.429 --> 00:27:44.038 Your hello, world program on the device doesn't do anything and a quick reminder. Here. My kernel's, the name of your colonel device that's my kernel up here. This is the. 127 00:27:44.038 --> 00:27:48.148 This is an extension to C. plus plus and this tells. 128 00:27:48.148 --> 00:27:58.618 The compiler, the, this routine, my kernel, it's called a global and that means it's defined. It's called from a host routine. It's called down here. 129 00:27:58.618 --> 00:28:02.368 And it's executed on the device that's. 130 00:28:02.368 --> 00:28:08.939 In video calls a global are no arguments and this says. 131 00:28:08.939 --> 00:28:13.919 Inside the triple bracket angle brackets that there's 1 thread in 1 device. 132 00:28:13.919 --> 00:28:17.038 In 1 block. Okay. 133 00:28:17.038 --> 00:28:23.638 And I showed you for this, this last time quick thing. 134 00:28:25.588 --> 00:28:31.318 Okay, and it will execute. I ran it before for you. 135 00:28:31.318 --> 00:28:45.659 A quick review here mentioning also you can put in add to bugging symbols and they make the Executive's bigger. They don't sounds a slow down the execution and. 136 00:28:45.659 --> 00:28:49.469 Okay, and I showed you last time, could a mem check. 137 00:28:49.469 --> 00:28:54.358 Which does a checks, every address for validity. 138 00:28:54.358 --> 00:28:59.368 And gave you an example, and we'll check for various cool things here. 139 00:28:59.368 --> 00:29:06.388 Um, some things have to be aligned to. Okay. 140 00:29:06.388 --> 00:29:11.759 Showed you example, 2. 141 00:29:12.173 --> 00:29:21.084 We have the gdb, the debugger, and you can go in and look at data in certain threads and so on and mentioned that last time doing a quick review here. 142 00:29:21.923 --> 00:29:27.203 And we can show time round that quickly. 143 00:29:28.433 --> 00:29:43.074 Just a quick thing, not all of these things run, because these slides were written before the latest version, the latest architecture and the latest architecture, they changed something. Compatibly. So we have to use a different B***. Well. 144 00:29:43.348 --> 00:29:49.679 The visual profiler and so on I may demo it Monday for you, but I'm just walking you through slides. 145 00:29:49.679 --> 00:29:59.459 Okay, and again, this does not some of this does not work with the current version, which is why I am. 146 00:29:59.459 --> 00:30:04.348 Going too fast. Yeah, 1 does it and I don't say. 147 00:30:04.348 --> 00:30:08.368 Okay. 148 00:30:08.368 --> 00:30:12.868 Yeah, show you some of the profile or later. 149 00:30:12.868 --> 00:30:18.598 Okay performance. 150 00:30:18.598 --> 00:30:23.219 That was quick. 151 00:30:23.219 --> 00:30:31.679 I like to work up and that was all 4 3. 152 00:30:36.538 --> 00:30:40.138 Okay. 153 00:30:40.138 --> 00:30:44.969 Here. 154 00:30:48.749 --> 00:30:57.598 I'm interested in watching the delay and the synchronization with Webex because I've got 2 laptops in front of me. The 1 that I'm. 155 00:30:57.598 --> 00:31:08.519 Running on the 1 that I'm watching what you see, because what you see is different than what my main laptop sees. Okay. And I showed you this error thing here before. 156 00:31:08.519 --> 00:31:16.858 And the point about this again is that if it may happen that the number of threads is not a multiple of. 157 00:31:18.773 --> 00:31:30.834 Number 3, not a model multiple, the number of threads per block. So last block, you only want some of the threads and the block, but all of the threads are gonna get executed. Perhaps. So you need an, you need a balance check. 158 00:31:32.189 --> 00:31:43.644 So, for the last for this last fractional block, you might call it and if you don't do this, this will be executing. This actually is executing global memory. 159 00:31:44.663 --> 00:31:52.733 So there's several different places that data can be. This here is going to be in the global memory, which on parallel that. 160 00:31:52.979 --> 00:31:56.669 Good GPU is 48 gigabytes and. 161 00:31:58.554 --> 00:31:59.273 So, 162 00:31:59.364 --> 00:32:01.584 if you don't have this check right here, 163 00:32:01.794 --> 00:32:07.733 you'll be walking off the end of the erase and Nicole memory and reading is probably okay, 164 00:32:07.733 --> 00:32:08.483 but writing, 165 00:32:08.483 --> 00:32:11.453 you're going to be smashing some other, 166 00:32:11.513 --> 00:32:15.804 someone else's code may cause it to crash. 167 00:32:16.108 --> 00:32:20.459 Thank you there's some security on. 168 00:32:20.459 --> 00:32:23.999 The device, but it's not perfect. So. 169 00:32:25.979 --> 00:32:38.638 And so, this again, here you need enough threads to cover the elements I showed you this before you ceiling in a separate. Well, I showed a different way to do that before you took I showed you before you end plus 255. 170 00:32:38.638 --> 00:32:47.459 Divided by 256 equivalent a way to do it is to a ceiling function, calculate this as a float into a ceiling would be the same thing. So. 171 00:32:47.459 --> 00:32:53.068 Any case this is a, the underscore is a system. 172 00:32:53.068 --> 00:33:01.318 I convention to say that the arrays and data and device instead of in host. 173 00:33:01.318 --> 00:33:04.348 With the managed memory, that's not. 174 00:33:04.348 --> 00:33:12.148 That's an obsolete idea, because with a manage memory, it's paged back and forth between the hosts and the devices needed. 175 00:33:12.148 --> 00:33:23.729 There might be a performance penalty, depending on your code, but your life gets easier because easier on you, you don't have to explicitly copy the data back and forth. 176 00:33:23.729 --> 00:33:28.769 So the D underscore ideas. So, Bob silly, but. 177 00:33:28.769 --> 00:33:34.378 You'll see it in all the H, underscore means that it was on the host. Okay. 178 00:33:35.999 --> 00:33:42.328 Post called, give him 3 just means. 179 00:33:43.193 --> 00:33:44.064 3 dimensional, 180 00:33:44.153 --> 00:33:46.763 or a manager as it's a provided class, 181 00:33:47.094 --> 00:33:48.713 and nothing interesting here, 182 00:33:49.104 --> 00:33:52.733 except that the number of blocks in the grid instead of being a scale, 183 00:33:52.733 --> 00:33:54.354 it could be the 3 D, 184 00:33:54.443 --> 00:33:58.374 array of managers as are the number of threads and a block. 185 00:33:58.763 --> 00:34:01.794 And so we're passing in 3 global. 186 00:34:02.038 --> 00:34:05.429 Point is to 3 global arrays and then just a. 187 00:34:05.429 --> 00:34:13.708 And teacher here dim grid, it would be a constructor that would take a. 188 00:34:13.708 --> 00:34:24.929 3 integer expressions, let's say, and construct the dim 3, variable variable of class 3 and you could also do a constructor same thing here. 189 00:34:24.929 --> 00:34:31.679 If you're going to have ones for some of the dimensions for the 3, they'd be the last few. 190 00:34:31.679 --> 00:34:38.579 Okay, all. 191 00:34:38.579 --> 00:34:41.759 Showing the same concept again. 192 00:34:41.759 --> 00:34:52.858 Okay, so here's another thing we're seeing new things here under double underscore host. Double underscore is an explicit statement that this routine runs on the host. 193 00:34:52.858 --> 00:35:05.159 It default, so you don't need to just putting they're just putting this here to be explicit and global means that this routine rotten on the device, but is from the host. 194 00:35:05.159 --> 00:35:12.088 There's going to be some other devices other routines that run on the device and are carnival only from the device. 195 00:35:12.088 --> 00:35:22.498 Okay, um, this there is a complication here, which I haven't seen documented that much. You get when you're programming. 196 00:35:22.498 --> 00:35:35.909 You got anything on the device routine if you want to get data from the host of the device, or retain you sort of have to put it in arguments like editor and so on. 197 00:35:35.909 --> 00:35:47.998 There's no common global memory that they can easily access. Well, yeah, the problem memory, but it's a little messy. Sometimes getting data back and forth. 198 00:35:48.773 --> 00:36:00.653 Any case your host device, you set the size of the grid, the blocks number of blocks in the grid number threads and a block and then you call this and you've seen this before specifying past. 199 00:36:00.653 --> 00:36:05.244 This is how you pass in the sizes of the grid and the block in the triple angle brackets. 200 00:36:06.628 --> 00:36:12.929 There is here divide your global routine name and your argument list and when it shows down here. 201 00:36:14.784 --> 00:36:28.554 Inside here and here, just to review inside here, this local variable in the goal routine is running on the device. This, it's a local variable. It's private to the thread and it's in a register. If possible. 202 00:36:28.858 --> 00:36:40.289 If there's not enough registers available, then this will be put in. There's a larger local memory that is available. 203 00:36:40.289 --> 00:36:51.239 To each device routine, but it's very slow what it is. It's just a chunk of the global memory that's made private to each thread. 204 00:36:51.239 --> 00:37:02.068 So, there's more of it, but it's unbelievably slow. So you want to have your local variables in the thread be few enough that they'll fit in their available registers. 205 00:37:02.068 --> 00:37:05.849 By very slow, you got to latency of 100 cycles or something. 206 00:37:05.849 --> 00:37:13.409 Okay, so so each grant again has blocks in each block has lots of threads. 207 00:37:13.409 --> 00:37:18.208 And then the GPU perhaps has only a few. 208 00:37:18.208 --> 00:37:22.018 Processes on the GPU. 209 00:37:24.264 --> 00:37:36.983 Here it's talking about these declarations, I told you about global for the last 2 days and global routine is called on the, it runs on the device and it's called from the host. 210 00:37:37.289 --> 00:37:40.588 I mentioned we saw host today. 211 00:37:40.588 --> 00:37:55.018 It's on the host called from the host, a new 1 device. I don't know that you've seen yet. This is for our routine, which is running, which runs on the device and is called from the device. 212 00:37:55.018 --> 00:38:08.728 So, if you got routine, so there's like a 2 dimensional array here where the program runs, where the function runs, and where it can be called from. 213 00:38:10.588 --> 00:38:24.869 Now, you can do this now suppose you want a routine to run both on the host and the device you can do this, you can have the 2 declarations for a routine you can say, host device or device host in front of the routine name. 214 00:38:24.869 --> 00:38:34.679 And what this will do is the compiler will compile to VERT, will produce 2 versions of the 1 version. 215 00:38:34.679 --> 00:38:39.750 To run on the host in a 2nd version to run on the device. So. 216 00:38:39.750 --> 00:38:48.210 This is if you want to routine that sometimes running on the host and other times running on the device, you just put the 2 declarations in front of it. 217 00:38:48.210 --> 00:38:52.349 Now. 218 00:38:52.349 --> 00:38:55.500 Concern here is that. 219 00:38:57.085 --> 00:39:11.454 The host and the device are not completely identical, so if you're going to declare a routine to be both host and device, then the code inside, it is like, the intersection of what works on the host and what it works on the device. 220 00:39:11.454 --> 00:39:13.795 So you have to be careful. 221 00:39:14.099 --> 00:39:18.809 In what you're in, what you're doing, give you an example. 222 00:39:19.974 --> 00:39:33.894 I don't know, you know, fancy C plus plus class stuff doesn't run on the device. So if you're going to get fancy with that kind of stuff, memory, allocation and constructors, destructors and all that stuff. 223 00:39:34.170 --> 00:39:37.409 I don't know interrupt. 224 00:39:38.429 --> 00:39:42.809 Maybe you don't do on the device. Okay. 225 00:39:42.809 --> 00:39:51.869 Although as time goes on, what can be done on the device is getting more and more. But so if it's a routine that for both the host and the device, then. 226 00:39:51.869 --> 00:39:56.159 It's the limited what you can do in the routine, or do efficiently, or do it all. 227 00:39:56.159 --> 00:40:01.079 So, okay. 228 00:40:01.079 --> 00:40:11.699 So you got your false, false code them it was code instructions call it a code. A program you run it into your program has an extension dot. See, you. 229 00:40:11.699 --> 00:40:18.300 Nbc splits it into 2 pieces as code for the host and code for the device. 230 00:40:18.300 --> 00:40:25.320 And the host code runs through a, um. 231 00:40:25.320 --> 00:40:28.380 I C, plus plus compiler um. 232 00:40:28.380 --> 00:40:32.789 Think get some client or something. 233 00:40:32.789 --> 00:40:37.019 The device code that's complicated. 234 00:40:37.019 --> 00:40:42.179 I'll get to it in a 2nd, but then the whole thing gets merged into 1 that runs. 235 00:40:42.179 --> 00:40:46.019 Okay, the device called. 236 00:40:47.280 --> 00:40:51.210 What's happening here? 237 00:40:51.210 --> 00:40:54.300 Nvidia has a. 238 00:40:54.300 --> 00:41:07.230 It's sort of it's a little reminiscent of Java NVIDIA compiles the coded code and it wouldn't into just in time sort of CO. 239 00:41:07.230 --> 00:41:15.210 A device code that's a love. It's called, it's knocked down at the assembly level. It's a level above it and. 240 00:41:16.769 --> 00:41:20.610 This is what the NBC is, the compiler produces. 241 00:41:20.610 --> 00:41:31.769 And then what happens is that when you execute your NBC code, a program in execution time, it's actually compiled for the device. 242 00:41:31.769 --> 00:41:37.170 So, you're doesn't contain low level. 243 00:41:37.170 --> 00:41:40.710 Device code it contains this intermediate. 244 00:41:40.710 --> 00:41:47.489 Device code called, which is it's a step above the actual hardware instructions. 245 00:41:48.264 --> 00:41:59.574 Now, this means that the 1st time you run your program, it's going to be slower because all the, just in time compiling, it's now time it has to be compiled. 246 00:42:00.175 --> 00:42:05.695 Now, the reason the NVIDIA does that is they, our future proofing. 247 00:42:05.969 --> 00:42:14.969 They are because the next generation of the GPU will of different, low level assembly instructions. 248 00:42:14.969 --> 00:42:21.179 And with this 2 step process, you're, you're. 249 00:42:21.179 --> 00:42:32.909 Will run on the future GPU, which has different hardware instructions because the, just in time compiler, you don't directly see will be different. So you're. 250 00:42:32.909 --> 00:42:41.699 You're so called it has this code you run your old on your new. 251 00:42:42.719 --> 00:42:49.889 Your next, which has different hardware instructions it will work because it just in time compiler. 252 00:42:49.889 --> 00:42:53.610 We'll compile your code into. 253 00:42:53.610 --> 00:43:02.159 Into the new assembly instructions so it makes things complicated, but in future proof, your Executive's and that's sort of nice. 254 00:43:02.159 --> 00:43:05.610 So you don't have to recompile. 255 00:43:05.610 --> 00:43:16.050 Your program well, you may if there's some user visible novelty, but you don't have to recompile your old execute. It will run on the new GPU. So that's sort of nice. 256 00:43:17.610 --> 00:43:26.969 I'm leaving out some details and it may not necessarily, always run completely. But the attempt is that we'll run on the new thing. 257 00:43:28.139 --> 00:43:32.789 There's something there's all sorts of architecture levels, which describe what. 258 00:43:32.789 --> 00:43:47.550 Capabilities are called capability, which describes what capabilities are available for you and so now what this does not necessarily mean. Well, for example, look on parallel. There's 2 on parallel 2, different generations. 259 00:43:47.550 --> 00:43:58.619 So you could run you same execute on both of them because the code would compile into different hardware instructions for the 2. 260 00:43:58.619 --> 00:44:02.909 For the 2 different. 261 00:44:02.909 --> 00:44:08.760 Architecture is, in fact, if I can just a 2nd here, see, if I can. 262 00:44:08.760 --> 00:44:13.559 Start my VPN. 263 00:44:13.559 --> 00:44:28.079 Okay, this window runs on parallel. Oh, 27 security upgrades. Okay. Just for fun. I, Mark are running on the machine of an idol start a class. 264 00:44:28.079 --> 00:44:31.769 So. 265 00:44:31.769 --> 00:44:35.639 Okay, so what we have. 266 00:44:37.500 --> 00:44:45.389 You see, so, device there, that's the 8000 Quadro and it's compute capability. 7.5 and. 267 00:44:45.389 --> 00:44:49.679 Could a run time 11.2 that's fairly new. 268 00:44:49.679 --> 00:44:53.849 If I go to the old machine, the old older. 269 00:44:53.849 --> 00:44:58.260 Did the 2nd as a extend 80. 270 00:44:58.260 --> 00:45:01.409 And it is capability 6.1. 271 00:45:02.670 --> 00:45:09.030 So, what that means is to do has hardware capability that's not available. 272 00:45:09.030 --> 00:45:13.889 On the older 1, each increase and could've capability means as new facilities available. 273 00:45:13.889 --> 00:45:19.289 So, in any case, however, if your program. 274 00:45:19.289 --> 00:45:27.360 What is written? Just using only capabilities at the 6.1 level. You could run it on either and. 275 00:45:27.360 --> 00:45:31.889 And would efficiently use the new because. 276 00:45:31.889 --> 00:45:36.989 The run time, compile that just in time. Compiler would do that. 277 00:45:36.989 --> 00:45:40.079 Well, I've got this thing up here. 278 00:45:40.079 --> 00:45:45.449 Or if there's always 32 threads per block has a 1000. 279 00:45:46.530 --> 00:45:52.074 Here is your thread block so the threads per block is a 1000, 280 00:45:52.074 --> 00:45:53.844 but that could be a 1000 by 1, 281 00:45:53.844 --> 00:46:01.914 or it could be 32 by 32 or something where 3 dimensional these are the Max sizes and then the grid size that's to the 32 minus 1. 282 00:46:03.389 --> 00:46:07.260 So, lots of. 283 00:46:07.260 --> 00:46:15.000 Blocks in the grid, but if we go up a little for shared memory. 284 00:46:15.000 --> 00:46:20.340 So, you block 64 K bytes and shared memory total 1. 285 00:46:20.340 --> 00:46:25.289 What to the registers? I separate and this is 64 K registers at 4 bites each. So. 286 00:46:25.289 --> 00:46:28.440 Threads for strategy for block. 287 00:46:28.440 --> 00:46:37.739 Maximum 1024, the multi processors is another level here. I haven't mentioned. Let me mention it because I got this up in the building. 288 00:46:37.739 --> 00:46:41.340 Okay, so what we have. 289 00:46:43.380 --> 00:46:49.320 So, where. 290 00:46:49.320 --> 00:46:57.179 On the call well, it's a multi processor or a streaming processor or something. 291 00:46:57.179 --> 00:47:04.530 And 72 of them on the 8000. 292 00:47:04.530 --> 00:47:08.159 And each multi, so they're actual physical. 293 00:47:08.159 --> 00:47:11.340 Physical areas on the chip. 294 00:47:11.340 --> 00:47:19.019 And each multi processor kind of up to 64 CUDA cores. So it's 4608 foot or course total. 295 00:47:19.019 --> 00:47:25.139 So, um, so if you're running. 296 00:47:25.139 --> 00:47:32.909 So, 1, so, 60 for this will be 2 warps of cores. And so you got lots of. 297 00:47:32.909 --> 00:47:38.130 Warps in the block, they can be allocated to different multi processors. 298 00:47:38.130 --> 00:47:41.280 Perhaps I think so. 299 00:47:41.280 --> 00:47:47.820 Yeah, okay we'll come back to other stuff here at other times. 300 00:47:51.989 --> 00:48:00.570 Constant memory. This is some fast memory. This is visible to all of the threads. 301 00:48:00.570 --> 00:48:03.989 So, it's just constant and it's fast. 302 00:48:03.989 --> 00:48:10.829 Implemented by cash and that's the major interesting stuff here. 303 00:48:13.769 --> 00:48:28.559 Okay, so again you could have program is split into host code and device code and device colleges is compiled into code, which at run time is just in timely compiled. 304 00:48:29.760 --> 00:48:38.760 No questions about yeah. Okay. 305 00:48:38.760 --> 00:48:49.289 Question, can you show this how it would access these different memory areas? Yeah, there's some declaratory is. 306 00:48:49.289 --> 00:48:59.789 We'll see programs later on to do that. So by default, your local scalers are in registers if they can if not, they spill over to local. 307 00:48:59.789 --> 00:49:08.070 The global's coming in, it's an argument. It's a global and that's just like a scaler. Thank you, Isaac. We'll see that in more detail. 308 00:49:08.070 --> 00:49:13.769 Later okay. 309 00:49:29.639 --> 00:49:35.820 Oh, okay. Multi dimensional code. I don't think this is very interesting, but that's just me. 310 00:49:36.840 --> 00:49:43.170 They're distinct thing between distinguishing between the colonel and the granted for the grids, the hardware, I guess. 311 00:49:43.170 --> 00:49:47.789 Okay, um. 312 00:49:48.869 --> 00:50:00.269 I think we saw a little of this before me for a few seconds. So you have the 2 D grid of threads in the block you can map them to your problem. Like, you're processing a 2 D picture. 313 00:50:00.269 --> 00:50:05.070 That's what they're talking about here. 314 00:50:06.329 --> 00:50:17.730 The relevance here, Rome, major versus column major layout. What the topic here is as follows, you have a 2 dimensional array, the mapping your story and linear memory. 315 00:50:17.730 --> 00:50:21.690 Do you want the. 316 00:50:21.690 --> 00:50:29.159 2nd, subscript to be varying pass. Just step up the, the 1st, subsequent varying pass. If you step up through the memory. 317 00:50:29.159 --> 00:50:33.150 And most languages, like, see. 318 00:50:33.150 --> 00:50:36.539 Do role major layout where. 319 00:50:36.539 --> 00:50:41.610 The rows are contiguous and if you. 320 00:50:42.414 --> 00:50:54.204 You're going across, so if we look here 4 by 4 array, so the role, the 4 yellow elements are contiguous in memory than the 4 red ones. So the rows are contiguous in memory. 321 00:50:54.835 --> 00:51:00.655 The exception to this is for Tran does column major layout. So, for trend. 322 00:51:00.960 --> 00:51:05.219 The columns are contiguous in memory now. 323 00:51:06.449 --> 00:51:12.030 And did it 1st, because was invented, I think, in 1957. 324 00:51:12.030 --> 00:51:16.590 Same here as list by. Thank so for trying to at this point. 325 00:51:16.590 --> 00:51:21.300 Is 64 years old. 326 00:51:21.300 --> 00:51:27.840 The language has been extended somewhat, so your grandparents might have used. 327 00:51:27.840 --> 00:51:42.599 And still use inertia. Okay. Any case for all major layout. This is relevant. Okay. Well, I'll tell you why this is relevant. I'm anticipating the next few slides. 328 00:51:42.599 --> 00:51:47.369 So each thread pulls all processes. Another element. 329 00:51:47.369 --> 00:51:50.940 For efficiency reasons, it's nice if adjacent threads. 330 00:51:50.940 --> 00:51:56.250 Are processing adjacent elements in memory, so. 331 00:51:56.250 --> 00:52:01.590 You want to, and that just makes things more efficient. 332 00:52:01.590 --> 00:52:08.400 1 thing is that if you're reading global memory, it does 128 bytes at a time. 333 00:52:08.400 --> 00:52:14.849 And it's nice if you can use all 1, 2008 bites, which would be 32, 4 bite words, which would be a work with threads. 334 00:52:14.849 --> 00:52:20.730 But, okay, that's going to be illegible for you. Um. 335 00:52:22.019 --> 00:52:31.500 So, it's just showing how your, and let me see if I can zoom it in. 336 00:52:31.500 --> 00:52:34.530 Good okay. 337 00:52:34.530 --> 00:52:42.420 Okay, it's scaling every pixel value. Let me walk you through what's happening here. 338 00:52:42.420 --> 00:52:45.719 We got 2 arguments which gives a width and the height. 339 00:52:45.719 --> 00:52:56.639 Of the image and pixels. Well, you can read the comments so we take the thread index. 340 00:52:56.639 --> 00:53:10.260 And here, we're assuming thread indexes eventually here everything's credit index dot Y, and blocked M dot Y index dot. Y. so we're computing a row and columns dot X dot X dot X. 341 00:53:10.260 --> 00:53:13.949 So, or assuming that the threads and the block. 342 00:53:13.949 --> 00:53:20.429 Are the block is 2 dimensions of locked in is now. 343 00:53:20.429 --> 00:53:27.329 Got it, it's 2 dimension not just a scaler, so we can calculate a 2 dimensional role and call them. 344 00:53:27.329 --> 00:53:30.780 And then what we can do. 345 00:53:31.949 --> 00:53:38.760 Map it back down to a 1 dimensional index here and grab it and write it. 346 00:53:39.900 --> 00:53:45.750 You know, I frankly don't see the point of the 2 dimensional thread. 347 00:53:45.750 --> 00:53:49.469 But I know I'm presenting it to you since I have it here. 348 00:53:49.469 --> 00:53:53.670 If somebody sees a point for it, then. 349 00:53:53.670 --> 00:53:57.000 Okay, um. 350 00:53:57.000 --> 00:54:01.349 Man here. Okay. So. 351 00:54:03.389 --> 00:54:07.710 How you do it up at the scale so. 352 00:54:08.760 --> 00:54:14.190 We're assuming that the size of a block is 16 by 16 threads. 353 00:54:14.190 --> 00:54:17.699 That's 256 dollars less than the max of a 1000. 354 00:54:18.869 --> 00:54:22.889 And so this is the dimension of the. 355 00:54:22.889 --> 00:54:27.809 Threads in the block, that's the block dimension and it should be blocks in the grid. So. 356 00:54:30.239 --> 00:54:37.650 Okay, and we're allowing for the fact that it might not think multiples of 16. 357 00:54:43.139 --> 00:54:51.750 Okay, so the point is that if your threads form a 16 by 16 block, you can block out your. 358 00:54:51.750 --> 00:54:55.650 Duty block of data in your image and so on. 359 00:54:55.650 --> 00:54:59.880 The point here, but not all the same control paths. 360 00:54:59.880 --> 00:55:07.650 Is going to be the things that they at the just hear all of your edge conditions. 361 00:55:10.440 --> 00:55:14.159 Okay, that was simple. 362 00:55:14.159 --> 00:55:17.849 Few seconds to ask questions. 363 00:55:24.150 --> 00:55:31.920 Nothing interesting on this slide. I think I'm being unfair. 364 00:55:34.170 --> 00:55:40.170 Okay, multi dimensional grid, colonel why you'd want to use this. 365 00:55:41.215 --> 00:55:54.625 So you're doing some RGB scaling and you've got this is a 3 dimensional that a naturally on dimension role a 2nd dimension column and the 3rd dimension versus red green or blue. Okay. 366 00:55:54.925 --> 00:55:56.454 We want to do some operation. 367 00:55:56.789 --> 00:56:02.730 This thing here, I got to show you this thing in the lower right for those of you that haven't seen it. 368 00:56:02.730 --> 00:56:12.690 Um, this is cool. It's unrelated to this course, but it's related to computer graphics. 369 00:56:12.690 --> 00:56:15.929 Okay, what's happening here? 370 00:56:17.940 --> 00:56:30.989 Your human visual system processes, callers, non linearly and what this diagram here shows how your human visual system will mix colors. 371 00:56:30.989 --> 00:56:39.659 And it's called the diagram is a commission. 372 00:56:39.659 --> 00:56:47.309 Which did this in 1931 and some French for lack of machine on international. Now today. 373 00:56:47.309 --> 00:56:50.550 Which is the international lighting commission? 374 00:56:50.550 --> 00:56:55.650 And this maps colors that are the same intensity. 375 00:56:55.650 --> 00:56:59.309 Into a 2 dimensional coordinate system X and Y. 376 00:56:59.309 --> 00:57:06.750 Why is brightness effectively and what this shows us how collars will appear to mix. 377 00:57:06.750 --> 00:57:19.199 So, if we have a color up here on the right and a color on the left, and we mix them, and you take the to quarter to a linear. So, the linear combo of the coordinates. 378 00:57:19.199 --> 00:57:27.630 Shows how the colors will mix in your visual system. So, red, which is over here at perhaps 8.7. 379 00:57:27.630 --> 00:57:32.400 3, mixes of, and it 50 50, you'll get white in the middle. 380 00:57:32.400 --> 00:57:38.909 Red and green will mix to get yellow. So, this shows the apparent effect. 381 00:57:38.909 --> 00:57:46.320 To a human being of how colors mix so, red and green mix to get yellow, red and sayad mixed to get white. 382 00:57:46.320 --> 00:57:53.309 Red and blue mixed against something down here, which is not a, which is we call purple. 383 00:57:53.309 --> 00:57:58.679 And the spec, the pure spectral colors are around on the outside. 384 00:57:58.679 --> 00:58:03.300 Curve so, from long wavelength, all frequency red. 385 00:58:03.300 --> 00:58:08.909 To the full wavelengths high frequency violet here. 386 00:58:08.909 --> 00:58:14.340 So that's very nice. Is, is this curve was determined by. 387 00:58:14.340 --> 00:58:22.949 Experiments on people, and the triangle here would be if you have a 3 dimensional. 388 00:58:24.269 --> 00:58:36.510 Color system, printing on paper or using mixing some pure colors with some color sources. If your 3 sources are the vertices of the triangle. 389 00:58:36.510 --> 00:58:46.739 Then the colors that you can generate are points in the interior of the triangle. So if these are the 3 primary colors available for the triangle, you cannot generate anything out here. 390 00:58:46.739 --> 00:58:50.760 Okay, that diagram. 391 00:58:52.050 --> 00:58:55.320 It's sort of fun. It's not parallel computing, but. 392 00:58:55.320 --> 00:59:03.750 It looks like I'm teaching computer graphics again next fall, so no, 1 else can teach it and there's a student demand for it. So I'll be teaching this again. 393 00:59:03.750 --> 00:59:08.460 Okay, how do you do something like this on the. 394 00:59:08.460 --> 00:59:13.590 Device doing something maybe you want to. 395 00:59:13.590 --> 00:59:27.510 Mix things and some formula and this here is the official waiting formula, I think, for how you generate great scale from RGB. 396 00:59:27.510 --> 00:59:36.000 You guys to people green appears brighter than red and blue does not appear very bright at all. 397 00:59:36.000 --> 00:59:42.179 And this is that these are the official weights to convert of our and G and B to to gray. 398 00:59:42.179 --> 00:59:46.019 They want to do that very fast, just a working example. 399 00:59:51.179 --> 00:59:57.780 Your skeleton code you're bringing in the RGB image That'll be input. 400 00:59:57.780 --> 01:00:02.969 Your gray scaling, which will be output and you have to know what's in your height. 401 01:00:02.969 --> 01:00:10.110 Care because the primaries, the intensity are just 8 bits for pixel. 402 01:00:10.110 --> 01:00:14.489 Just as an aside if you're doing high quality. 403 01:00:14.489 --> 01:00:19.320 Processing 8 bits is not enough direct present. 404 01:00:19.320 --> 01:00:23.909 A collar and a pixel you probably want 12 beds. 405 01:00:23.909 --> 01:00:38.545 16, if you couldn't do it, you can see, you could actually see green with better than 1 part and 256 resolution. You can see the, you can see the difference if the bit changes for the green image. 406 01:00:38.545 --> 01:00:39.295 For example, sometimes. 407 01:00:40.800 --> 01:00:44.460 And obviously, if you're doing mixing, you want to have extra bets to avoid. 408 01:00:44.460 --> 01:00:49.469 So truncation error. Okay. Nothing interesting. Here. 409 01:00:52.045 --> 01:01:06.985 And so we get the offset to where the picks the colors for a pixel start and we're just reading R. G and B from the image into register variables. 410 01:01:08.940 --> 01:01:14.159 Um. 411 01:01:15.809 --> 01:01:22.320 And then computing and output Tom down here. So. 412 01:01:22.320 --> 01:01:27.239 Loading point yeah. Okay. 413 01:01:27.239 --> 01:01:34.320 And this would be your device program to convert from a caller image to a. 414 01:01:34.320 --> 01:01:41.250 Great scale image doing every pixel in parallel because it's 1 thread per pixel. 415 01:01:41.250 --> 01:01:46.949 So, nothing weird here. 416 01:01:46.949 --> 01:01:52.110 Is that up for a 2nd or 2 in cases questions? 417 01:01:57.239 --> 01:02:05.429 Okay. 418 01:02:08.159 --> 01:02:18.059 We're indexing in here to determine what the access for the threat and this works because we've got a 2 dimensional array of threads in the block. 419 01:02:23.670 --> 01:02:27.480 Silence. 420 01:02:30.420 --> 01:02:35.760 Some more fun stuff. 421 01:02:39.780 --> 01:02:47.789 Suppose we want to blur the data like this, because some little convolution filter to blurt. 422 01:02:47.789 --> 01:02:53.820 Okay, so here, what's new? Is that. 423 01:02:55.710 --> 01:03:08.400 Each thread not only gets the pixel that it's processing, but has to get this convolution support region of adjacent pixels. Now. 424 01:03:09.659 --> 01:03:19.050 Those of you that took computer graphics with Vertex shaders and fragments. Shaders. 425 01:03:19.050 --> 01:03:26.730 This sort of thing done in that Frank Vertex fragment. 426 01:03:26.730 --> 01:03:31.170 Model in computer graphics, you cannot do this actually, because. 427 01:03:31.170 --> 01:03:44.190 The parallel idea, and open GL, where you have fragment shaders and each fragment trader process is 1 pixel has no access to the data and adjacent pixels. 428 01:03:44.190 --> 01:03:49.920 But so do with it open GL, would take to Steph that open getting obsolete. Now. 429 01:03:49.920 --> 01:03:58.980 So, in any case here, the thread would have access needs to have access to adjacent pixels. 430 01:03:58.980 --> 01:04:04.679 Blurring box. Okay. So. 431 01:04:04.679 --> 01:04:14.369 This will be the global routine that runs on the device. That's the 1 could a thread. Well, compute 1 pixel. 432 01:04:15.659 --> 01:04:20.730 We do our bounds checks because the last, um. 433 01:04:20.730 --> 01:04:31.920 Block thread block may go off the edge of the image. So we don't want to be processing data off the edge of the image because someone else's data. 434 01:04:31.920 --> 01:04:39.000 Reading may be okay I don't want to write it. So what is happening here? 435 01:04:42.420 --> 01:04:50.039 What we're doing here is we are. 436 01:04:50.039 --> 01:04:53.610 Interacting over the adjacent pixels. 437 01:04:53.610 --> 01:04:57.360 You know, we're computing pixel. 438 01:04:59.099 --> 01:05:07.650 You know, up here, we're computing the row and column for the pixel work that we're computing and we're computing it from. 439 01:05:07.650 --> 01:05:11.070 The index of this particular thread. 440 01:05:12.179 --> 01:05:16.980 Okay, they're in row and column what we're doing down here. 441 01:05:16.980 --> 01:05:29.969 Is we're entering over the adjacent pixel so maybe our convolution Windows 3 by 3. so we want to go to the left the right above and below. And so that's what we're doing here. Iterating. 442 01:05:31.170 --> 01:05:42.300 And then down here, we're going in and just summing in the adjacent pixel values. 443 01:05:42.300 --> 01:05:46.170 If we're within bound, so here. 444 01:05:46.170 --> 01:05:49.769 We're in we're adding in, um. 445 01:05:50.789 --> 01:06:04.650 So, adding up the values of all the pixels in the window around our current pixel that's what's happening here. What we're doing. Only if it's within the image, got to check that, we don't go off the edge of the image in either direction. 446 01:06:04.650 --> 01:06:08.099 So, minus 1 and lesson. 447 01:06:08.099 --> 01:06:16.710 If it was inbound, so we add it into our, our running total brightness and we keep track of how many pixels we added and. 448 01:06:16.710 --> 01:06:22.889 So, we iterate over our convolution window and then we write our output pixel here. 449 01:06:22.889 --> 01:06:29.010 Our fixed style, and then we normalize the number of pixels. We actually add it in. 450 01:06:30.389 --> 01:06:34.139 And we convert back to unsigned care. 451 01:06:34.139 --> 01:06:40.800 Assuming, you know, we're, we're assuming that a character is a, is 8 bits. 452 01:06:40.800 --> 01:06:55.500 Um, which on, I guess, all Margaret architecture. That's true. Not on the 1. I used as a student and unsigned watch this 1 because you don't know if you don't say unsigned perhaps the character is signed. 453 01:06:55.500 --> 01:06:59.610 And goes to minus 1, 2008, 2, plus 1 and 27. 454 01:07:01.530 --> 01:07:08.670 And I'm not even completely certain if the C plus plus standard what it says about a character, if you don't say signed or unsigned. 455 01:07:09.840 --> 01:07:13.019 And maybe signed, I don't know why default. 456 01:07:13.019 --> 01:07:20.340 Okay, so this was showing how to do this convolution on the GPU. 457 01:07:20.340 --> 01:07:23.789 So, what this is doing, so now, if you think about this here. 458 01:07:23.789 --> 01:07:29.730 Okay think about how this thing is implemented in hardware. It's a. 459 01:07:29.730 --> 01:07:38.190 Still tricky. See, what's happening here is we're reading stuff from the global memory. 460 01:07:39.329 --> 01:07:47.940 And I said that the global memory as a latency of maybe a 100 cycles, depending. 461 01:07:47.940 --> 01:07:59.190 And so this would mean that that line right here and your program is going to wait 100 cycles and that sort of kills your parallel performance. 462 01:07:59.190 --> 01:08:02.849 Whole thing my fee only 100 cycles, or a few 100 cycles. 463 01:08:02.849 --> 01:08:07.860 Okay, so why is this not a performance killer? 464 01:08:09.030 --> 01:08:17.189 Couple of reasons 1st reason is that adjacent threads are. 465 01:08:17.189 --> 01:08:20.250 As were iterating and her call. 466 01:08:21.899 --> 01:08:29.640 Adjacent threads well, not just per call. I mean, the base here. 467 01:08:29.640 --> 01:08:34.050 You see, adjacent threads are reading adjacent. 468 01:08:34.050 --> 01:08:38.130 Pixels now. 469 01:08:38.130 --> 01:08:41.340 I read instruction from the global memory. 470 01:08:41.340 --> 01:08:46.890 It reads, I think 128 bytes in 1 there is. 471 01:08:46.890 --> 01:08:49.920 This 100 cycle latency. 472 01:08:49.920 --> 01:08:54.569 But then bang to 128 bytes and I believe. 473 01:08:54.569 --> 01:08:58.859 It can read the next 128 bites in the next cycle. So. 474 01:08:58.859 --> 01:09:08.729 There's a latency, but once you pay the latency, the bandwidth is fast. Actually really, really, really fast actually. So. 475 01:09:08.729 --> 01:09:17.460 So, what this means is that adjacent threads are reading adjacent. 476 01:09:18.539 --> 01:09:30.090 Pixels and they're physically adjacent in the global memory. So the 128 bite read from the global memory provides data for 32 threads. The whole war. 477 01:09:31.199 --> 01:09:40.229 So so there's a 100 cycle latency, but bang, the 32 threads, the war in the next cycle, get all get their pixel. 478 01:09:41.250 --> 01:09:45.930 And, and then in the next cycle, the next 32 threads, the next. 479 01:09:45.930 --> 01:09:52.439 All the trends get their data. So this is this design philosophy. 480 01:09:53.640 --> 01:10:07.920 Underlying the underlying is, it goes for bandwidth that is a really big bandwidth when you've got a lot of threats because each cycle 128 bytes of. 481 01:10:07.920 --> 01:10:16.109 Of data from the global memory. So what Nvidia did is they traded off latency for bandwidth. 482 01:10:16.109 --> 01:10:20.430 You got that latency to started going, but once it goes, it. 483 01:10:20.430 --> 01:10:26.939 Goals and we had a slide like 2 days ago or something showing it. Whereas on the host on the Intel. 484 01:10:26.939 --> 01:10:37.289 You don't have this sort of latency, but your bandwidth is slower. So you have bandwidth inside the GPU is really fast. 485 01:10:37.289 --> 01:10:48.329 But you have to work with it, and the way you work worth it is well, you have thousands of threads. So the high bandwidth. So the thousands of periods of need. 486 01:10:48.329 --> 01:10:55.289 A lot of that, so, and the thing is, the threads are accessing adjacent words. 487 01:10:55.289 --> 01:11:01.409 In the memory, and that's how the high bandwidth is useful. 488 01:11:01.409 --> 01:11:08.609 You see, so lots of threads means the bandwidth is useful. If the threads clock if the programmer. 489 01:11:08.609 --> 01:11:13.979 Cooperated you see so there is your trade off high latency, but. 490 01:11:13.979 --> 01:11:21.000 Very high bandwidth. That's how the that's 1 of the keys and thousands of threads running in parallel. 491 01:11:21.000 --> 01:11:26.640 However, the code and threads has to be simple. 492 01:11:26.640 --> 01:11:38.100 Which is why, like, you've got the single instruction. Multiple thread. Concept is an acronym that Nvidia tends to use single instruction multiple thread. 493 01:11:38.100 --> 01:11:47.699 And this is why I say is that weird data structures, like the stuff they love to teach and see us 1 or data structures or something are not. 494 01:11:47.699 --> 01:11:57.329 Totally efficient on the device in pointer chasing, for example, anything which throws the threads and the work out of sync with each other. 495 01:11:57.329 --> 01:12:04.949 Is going to be horribly slow. So, pointer chasing recursion you want you're nice and simple. 496 01:12:04.949 --> 01:12:12.270 Data structures, they're called a structure of a raise here ideal data structure. 497 01:12:12.270 --> 01:12:19.199 At the device is an array of plain old data types and array of ends an array of floats. 498 01:12:21.385 --> 01:12:34.345 Not even an array of 3 D coordinates. That's bad. Down here. You would have an array of X and array of Weiss and an array of these not an array of 3 D points. So a structure of a race. That's the way. 499 01:12:34.890 --> 01:12:39.569 You structure your data and your code, so to use the hardware. 500 01:12:39.569 --> 01:12:45.539 Okay, so this is showing and again, so. 501 01:12:45.539 --> 01:12:50.520 Again, so the thread is accessing adjacent pixels, but there probably and. 502 01:12:50.520 --> 01:12:55.619 Along the road there adjacent and along the column. 503 01:12:55.619 --> 01:13:02.970 They're a fixed offset from each other, but again, adjacent threads or the offsets work out. So probably. 504 01:13:02.970 --> 01:13:08.729 With a cashier that we're not talking about this sort of thing is going to be fast. 505 01:13:08.729 --> 01:13:13.920 So, okay, so that's the deep lesson on this slide here. 506 01:13:13.920 --> 01:13:18.779 And that's, I'll leave it up a series of questions. So. 507 01:13:18.779 --> 01:13:21.930 So, it's exploiting the, um. 508 01:13:21.930 --> 01:13:29.850 The hardware, which is the global, because the pixels in global, they fix all the images in global memory. 509 01:13:29.850 --> 01:13:36.149 Adjacent pixels adjacent, and there's a latency to start reading data, but. 510 01:13:36.149 --> 01:13:40.800 The bandwidth is fast, and then the data that is right is available to all the threads. 511 01:13:40.800 --> 01:13:46.140 So the key is, you read it reads 120 from the global memory. 512 01:13:46.140 --> 01:13:54.750 And that's not for 1 thread. So, 1 thread requests a bite a word. 32 words. But the thing is that that is available to all the threads in the war. 513 01:13:54.750 --> 01:13:59.039 So that's sort of handled in visibly that's the key. 514 01:13:59.039 --> 01:14:05.729 Okay, let's look at. 515 01:14:05.729 --> 01:14:09.149 Let's see what's happening next. 516 01:14:09.149 --> 01:14:13.170 Silence. 517 01:14:16.380 --> 01:14:20.159 Okay, what we have here. 518 01:14:22.020 --> 01:14:28.859 Oh, okay so now we're getting into some hardware stuff. 519 01:14:28.859 --> 01:14:38.100 I mean, a thread block is a software concept that says to get some hardware has to then execute it. 520 01:14:38.100 --> 01:14:41.609 And capacity constraints I mentioned. 521 01:14:42.659 --> 01:14:46.260 There is limited amounts of stuff like, registers. 522 01:14:46.260 --> 01:14:49.289 Floating point processors and so on. 523 01:14:49.289 --> 01:15:02.100 And so this may mean that some blocks and warps will run at 1 after the other, not all at the same time and the 0T overhead. This is this thing that you got to queue. 524 01:15:02.100 --> 01:15:09.119 Or you've got a number of maybe not a queue at the thread level, but you have a number of works that want to run. 525 01:15:09.119 --> 01:15:12.960 And as soon as resources are available, the next cycle. 526 01:15:12.960 --> 01:15:16.170 Award run 0T overhead and. 527 01:15:16.170 --> 01:15:22.020 I don't know details I've inferring that it's done with a synchronous logic. 528 01:15:22.020 --> 01:15:27.149 Which is tricky to design. 529 01:15:27.149 --> 01:15:30.989 Subject to a lot of hazards that's the buzzword used. 530 01:15:30.989 --> 01:15:37.229 A lot of hazzards, but if you can get it to work, it's passed. 531 01:15:37.229 --> 01:15:41.880 Hazards the sort of thing that, um. 532 01:15:41.880 --> 01:15:48.329 You know, you got to gate Gates and the gate and our gate or physically an advocate. 533 01:15:48.329 --> 01:15:54.539 So, you change the inputs, the output changes on a nano 2nd later, let's say. 534 01:15:55.619 --> 01:16:07.710 So now, let's suppose 1 of the inputs to the gay has a, not on, it not takes a fraction of a nanosecond for after it's input. So. 535 01:16:07.710 --> 01:16:11.760 If 1 input to a, to say a, um. 536 01:16:11.760 --> 01:16:15.270 nan gate has a not but not the other one . 537 01:16:15.270 --> 01:16:23.970 So, to speak, then the 2 inputs to the gate are available at different times. So the non gains output immediately reflects its input changes. 538 01:16:23.970 --> 01:16:32.430 And so, well, now a 2nd later, perhaps so, if the inputs to the gate are not available at the same time. 539 01:16:32.430 --> 01:16:38.640 There's a little interval when 1 input has a current as the proper value, but the other input does not yet. 540 01:16:38.640 --> 01:16:45.659 and in that little interval the output from the nan gate will be wrong it'll be fake there'll be there'll be a fake flip . 541 01:16:45.659 --> 01:16:49.560 Which will go away once all the inputs of staff have. 542 01:16:49.560 --> 01:16:53.430 You know, stabilized, but that little output. 543 01:16:53.430 --> 01:16:57.600 Which in a software shouldn't be there, you know, this. 544 01:16:57.600 --> 01:17:10.229 This might be a problem if you don't design for it because if you're counting blips, this is an extra lift. Let's say. So. So there's such an issue that happened with the asynchronous hardware design, but. 545 01:17:10.229 --> 01:17:15.180 But if you can make it work is past the reason they go synchronous. 546 01:17:15.180 --> 01:17:23.159 See fuse where you have a clock and everything waits to the next clock cycle. It's the clock slows things down. Yeah. 547 01:17:23.159 --> 01:17:29.310 But it makes things work. So you have a data boss that's got 32. 548 01:17:29.310 --> 01:17:40.260 It's on the thing is that 32 bit to go to arrive at different times? Well, you don't look at them until the next clock cycle. See, it slows you down, but stuff gets reliable. Okay. 549 01:17:40.260 --> 01:17:45.180 What is transparent scalability? Do. 550 01:17:46.380 --> 01:17:49.529 And then I'll stop in a minute because you get in next class. 551 01:17:49.529 --> 01:17:55.500 This is a big idea what it's saying is that invidious hardware. 552 01:17:55.500 --> 01:18:02.850 Has different versions of the device of the GPU of different members. 553 01:18:03.354 --> 01:18:12.295 Of of hardware stuff, like stuff like number of symmetric, multi processors, and actually do a number of CUDA cores and whatever. 554 01:18:12.295 --> 01:18:20.725 So if you got lots of threads more threads and if there's more threads in your software than there are cooler cars in the hardware. 555 01:18:21.029 --> 01:18:29.760 It doesn't matter the threads just wait and then they run when they can and if you run your program on a bigger faster and you. 556 01:18:29.760 --> 01:18:36.239 It will run faster, but if you did it, right? You get the same answer. 557 01:18:36.239 --> 01:18:42.840 So, this transparent scalability you buy a more expensive for you, and you plug it in. 558 01:18:42.840 --> 01:18:46.260 Your program will run, it will just run past so. 559 01:18:46.260 --> 01:18:53.939 The hardware scales up, it's transparent to the user unless you're checking some real time Micro. 560 01:18:53.939 --> 01:19:00.119 I, this is a powerful idea here, got a good history actually. 561 01:19:00.119 --> 01:19:08.699 Many many years ago early, 19, sexual iddy biddy machine Corporation. 562 01:19:08.699 --> 01:19:12.720 Oh, I'm sorry, international business machines Corporation. 563 01:19:12.720 --> 01:19:26.159 They did the same thing. They had this sequence, it was called their system 360, and they had like, half a dozen machines from small and cheap and slow to begin expensive and passed. 564 01:19:26.159 --> 01:19:34.979 And they had this idea of transparent scalability there all their machines ran the same instructions. Set, ran the same program. 565 01:19:34.979 --> 01:19:47.279 Just the small slow machines ran and did a lot of emulation and so on. And the past machines through lots of Gates had the expensive machines that did it fast. But the same program. 566 01:19:47.279 --> 01:19:54.630 In principle would run and video is doing the same and IBM became the biggest computer company in the world. 567 01:19:54.630 --> 01:20:03.659 On doing things like this at the start they had a number of competitors at the end. They did not. So. 568 01:20:03.659 --> 01:20:07.949 So, Nvidia is doing the same thing here. Transparency. 569 01:20:07.949 --> 01:20:13.859 So that's a good point to stop. What am I on here? Lecture. 570 01:20:15.659 --> 01:20:26.395 Whatever 3.5 transparent scalability will pick up there and we actually run some programs. I want to do today being a desktop present slides to you today. 571 01:20:26.395 --> 01:20:39.145 Now, if you think I'm going to slowly you are welcome to read ahead of me. I'm just going through this at a natural speed because as long as this is interesting stuff here. So, what we're doing for several weeks is. 572 01:20:40.229 --> 01:20:54.569 Seeing how work okay, what will happen next anticipating in the future in the course we may see another software tool or something to use the hardwares perhaps something like. 573 01:20:55.614 --> 01:21:09.204 Thrust which are, which is a parallel version of the standard template library, which is a way to write code on C. plus. Plus, that will run fast. It's functional program and it's designed to run past on device. 574 01:21:09.925 --> 01:21:13.375 We might see that might see some parallel. 575 01:21:13.680 --> 01:21:17.430 Stop with the currency plus plus standard um. 576 01:21:17.430 --> 01:21:21.960 And then the next Chuck will be. 577 01:21:22.944 --> 01:21:33.744 We'll do a chunk on quantum computing, so, which I had a full course on in the fall, but I noticed none of you in my parallel class were also in my quantum class in the fall. 578 01:21:33.744 --> 01:21:39.324 So, I think you would like to have a good chunk a month or so on quantum computing, which we will do. 579 01:21:39.630 --> 01:21:44.460 Oh, and by the way RPI is thinking about. 580 01:21:44.460 --> 01:21:48.000 Emphasizing quantum computing more in the curriculum. 581 01:21:48.000 --> 01:21:57.779 So, we don't know quite what that means, but we can see it as a competitive advantage. If we make quantum computing more important than the curriculum, whatever that takes. So. 582 01:21:57.779 --> 01:22:01.289 Okay, so that's enough to do stuff for today. 583 01:22:01.289 --> 01:22:06.810 If there are. 584 01:22:06.810 --> 01:22:11.279 Questions I'll stay around a minute or 2 other than that. 585 01:22:11.279 --> 01:22:14.789 We can all go off and get lunch. 586 01:22:14.789 --> 01:22:20.220 Questions yes. 587 01:22:20.220 --> 01:22:23.310 How does parallelization could. 588 01:22:23.310 --> 01:22:27.689 Interact with CPU based. 589 01:22:27.689 --> 01:22:33.930 It does not. There are 2 unrelated things. You can run multi core. 590 01:22:33.930 --> 01:22:39.119 On the post at the same time, as you're doing the Mini corps. 591 01:22:39.119 --> 01:22:46.020 On the, or the 1000 the on the device. 592 01:22:46.020 --> 01:22:50.250 They don't affect each other. You can write the 1 program, which. 593 01:22:50.250 --> 01:22:57.539 Which does both if you're going to be calling a global routine from. 594 01:22:57.539 --> 01:23:01.109 Apparel inside, parallel, open ACC. 595 01:23:01.109 --> 01:23:06.989 That could be a fun project. So how you would. 596 01:23:08.159 --> 01:23:15.210 Yeah, there's no reason you can it just if you're inside a parallel block and open ACC. 597 01:23:17.699 --> 01:23:20.909 Well, what do you think what that me, and then call. 598 01:23:23.069 --> 01:23:29.880 You know, call could I haven't gone to yet is your host program can start several kernels. 599 01:23:29.880 --> 01:23:36.899 On the device, so you could have multiple threads and opening each starting a separate colonel on the device. Gotta keep your data. 600 01:23:36.899 --> 01:23:43.619 You know, address this straight, but sure. So what happens so, the device is like, it's like it's a mini. 601 01:23:43.619 --> 01:23:51.479 Time sharing operating system actually, they don't call it that. And so if you started 100. 602 01:23:51.479 --> 01:23:59.640 Kernels on the device if you go back up to here, I'm going to do. 603 01:23:59.640 --> 01:24:09.449 So, there's a limit to how many multi, how many kernels you can run, but if you've got more, I think you can have a 1000 or more and they queue up and they wait. 604 01:24:09.449 --> 01:24:15.180 So, yeah, you could have an open ACC parallel loop there. 605 01:24:15.180 --> 01:24:23.460 Starting up lots of kernels and then on the device, they just sit and wait until they run and. 606 01:24:23.460 --> 01:24:28.170 It probably would run very fast because, you know, it. 607 01:24:28.170 --> 01:24:32.609 If they're using your resources on the device that would fit together nicely. 608 01:24:34.199 --> 01:24:38.069 Anything else that could be a really fun thing to try. 609 01:24:38.069 --> 01:24:42.899 Thanks to the suggestion other suggestion. 610 01:24:44.909 --> 01:24:47.909 No, in that case. 611 01:24:49.050 --> 01:24:53.760 Cool. Let me just. 612 01:24:55.560 --> 01:24:59.220 I need to save this. 613 01:24:59.220 --> 01:25:06.779 And. 614 01:25:08.250 --> 01:25:16.439 And see, you Monday have a good weekend, get out, get some exercise or something. 615 01:25:16.439 --> 01:25:21.689 Okay. 616 01:25:44.010 --> 01:25:48.659 Okay.