WEBVTT 1 00:00:10.648 --> 00:00:13.829 Silence. 2 00:01:42.989 --> 00:01:52.200 To be clear, did anybody find the homework? I think he just posted it today. 3 00:01:52.200 --> 00:01:56.549 So nobody did it because nobody could find it, right? Like, yeah, it wasn't. 4 00:01:56.549 --> 00:01:59.640 There was no writing in the document. 5 00:01:59.640 --> 00:02:07.260 Until today, I think, because I couldn't find it and I don't think I, no, no. 6 00:02:07.260 --> 00:02:10.919 Thank you. 7 00:02:43.110 --> 00:02:48.719 Silence. 8 00:02:51.990 --> 00:02:57.419 Silence. 9 00:03:01.830 --> 00:03:05.400 Silence. 10 00:03:08.819 --> 00:03:13.979 Silence. 11 00:03:31.740 --> 00:03:56.879 Silence. 12 00:03:56.879 --> 00:04:00.060 Okay. 13 00:04:00.060 --> 00:04:09.719 Pay back again. Let's see. Now. 14 00:04:30.389 --> 00:04:34.079 So so. 15 00:04:43.918 --> 00:05:02.369 Silence. 16 00:05:03.838 --> 00:05:08.879 Let's see now, no audio. 17 00:05:10.588 --> 00:05:22.528 Silence right? 18 00:05:22.528 --> 00:05:32.788 Silence. 19 00:05:32.788 --> 00:05:37.798 Right. Silence. 20 00:05:42.209 --> 00:05:49.709 Silence. 21 00:05:58.168 --> 00:06:02.009 Silence. 22 00:06:02.009 --> 00:06:07.738 Well. 23 00:06:12.449 --> 00:06:21.059 So, if someone could hear me, then mention it in Chad, I cannot tell if you're hearing me at the moment. 24 00:06:22.379 --> 00:06:27.959 Good. Okay. Thank you. 25 00:06:27.959 --> 00:06:31.769 So all computing. 26 00:06:31.769 --> 00:06:34.949 And March 4th. 27 00:06:34.949 --> 00:06:39.658 And Terry, you can see the screen. 28 00:06:39.658 --> 00:06:44.278 So, we're continuing on talking about and video. 29 00:06:44.278 --> 00:06:49.019 And their, and we're continuing on with their, the work. 30 00:06:49.019 --> 00:06:56.939 With material from their website, I've also link here to some newer stuff, which will cover in detail later. 31 00:06:56.939 --> 00:07:00.209 And pair is the current. 32 00:07:01.314 --> 00:07:10.163 Architecture and out in the last year, and also white paper talks about things in more detail about cashes and so on. 33 00:07:10.673 --> 00:07:17.153 But meanwhile, and I just put a new homework on to do we're talking about in parallel. 34 00:07:17.579 --> 00:07:20.579 So implement that and see how that works out. 35 00:07:20.579 --> 00:07:26.129 Okay, and I'm going to be going fairly quickly through some of these next set of lectures. 36 00:07:26.129 --> 00:07:32.668 So, at 6.2, let us see here. 37 00:07:34.288 --> 00:07:40.858 Silence. 38 00:07:44.488 --> 00:07:48.449 Silence. 39 00:07:50.788 --> 00:07:55.978 Okay, Eva. Okay. 40 00:07:55.978 --> 00:08:03.238 Memory coalescing. So what's happening here is that you read a chunk of memory. 41 00:08:03.238 --> 00:08:08.759 You read some data from the global memory on the GPU. 42 00:08:08.759 --> 00:08:12.598 It reads, it reads a chunk. It's a 128. 43 00:08:12.598 --> 00:08:17.129 Bite, and now this is implemented is the cash actually so. 44 00:08:17.129 --> 00:08:21.538 You read it and read the whole chunk into the cash. Now that's available. 45 00:08:21.538 --> 00:08:25.678 For other threads that. 46 00:08:25.678 --> 00:08:29.038 If the other threads are accessing adjacent memory. 47 00:08:29.038 --> 00:08:35.578 And they call it burst sections here, but there mean chunks that go into the cash. So. 48 00:08:36.839 --> 00:08:46.193 22008 by 2, as I said, so that if you, if consecutive threads are reading consecutive words from the global memory, that's nice. 49 00:08:46.193 --> 00:08:52.374 Because says consecutive words now, got loaded into the cash and the other threats, and get them really fast. 50 00:08:53.938 --> 00:08:58.889 If they're not reading consecutive words, then. 51 00:08:58.889 --> 00:09:03.989 They're not necessarily all in the cash so this requires multiple. 52 00:09:03.989 --> 00:09:09.839 Trips back to the global memory. 53 00:09:09.839 --> 00:09:19.889 And how you do it, so, threads in the war consecutive words, and if you have the array and major order. 54 00:09:19.889 --> 00:09:25.408 And adjacent to doing adjacent columns in the same row that everything works out very nicely. 55 00:09:25.408 --> 00:09:29.458 Like, here on the left. 56 00:09:29.458 --> 00:09:34.708 On the right if you're going by column, it's not so nice. 57 00:09:34.708 --> 00:09:37.889 So. 58 00:09:40.139 --> 00:09:47.009 And somehow they say B is not coalesced, they're sending some call me. 59 00:09:47.009 --> 00:09:54.028 Are coalesced whatever it depends on how you're labeling stuff, but okay. 60 00:09:54.028 --> 00:09:59.969 And 1 way you helped us mentioned before is if you can group the, um. 61 00:09:59.969 --> 00:10:08.428 Divide the input, a raise into tiles and then once you get it, and then the whole tile goes into the cash. 62 00:10:08.428 --> 00:10:15.688 Or into fast shared memory, and then you can get added easily. So we talked about that last time. So I'm going through this quickly. 63 00:10:15.688 --> 00:10:20.308 That was that 1 past 6.2 okay. To move on. 64 00:10:27.239 --> 00:10:30.599 Silence. 65 00:10:36.899 --> 00:10:42.208 Oh, okay. 66 00:10:42.208 --> 00:10:47.668 Okay, so here we're talking about that we will be. 67 00:10:47.668 --> 00:10:52.109 Oops, backpack. Okay. Now. 68 00:10:53.188 --> 00:10:56.548 It's a nice example of a program that. 69 00:10:56.548 --> 00:10:59.818 Demonstrates parallel computing. 70 00:10:59.818 --> 00:11:06.028 So, we'll use this for several classes. You all know what his grams are can skip through that. 71 00:11:06.028 --> 00:11:10.649 Nothing interesting here. 72 00:11:10.649 --> 00:11:14.369 Histogram in groups of 4 letters. 73 00:11:16.889 --> 00:11:22.589 Okay, now we're the interesting content of these few slides. 74 00:11:22.589 --> 00:11:27.958 Is that there's different ways you can do the histogram program in parallel. 75 00:11:27.958 --> 00:11:35.129 And the obvious way might not be the best way. So there's actually the information content here. 76 00:11:36.749 --> 00:11:44.938 And so the 1st way is who wish to the program do the parent is to grab in parallel. We might partition our input data. 77 00:11:46.379 --> 00:11:50.879 And I need to thread, takes a chunk of the input data and does a. 78 00:11:50.879 --> 00:11:54.269 And, um. 79 00:11:54.269 --> 00:12:03.688 Here the input, Texas programming, massively parallel processors through 0T takes programming thread. 1 takes massively. 80 00:12:04.769 --> 00:12:07.769 Parallel and then processors. 81 00:12:07.769 --> 00:12:12.028 And they are totalling into here's a histogram. 82 00:12:12.028 --> 00:12:17.278 Ray down at the bottom now, the potential problem. 83 00:12:17.278 --> 00:12:27.778 Is that different threads may increment the same counter in the histogram right? Simultaneously? Like here, 3 threads or. 84 00:12:28.374 --> 00:12:37.344 They are purely by chance they've got a letter in the M, to P range. So all 3 threads simultaneously want to increment this counter here. 85 00:12:37.854 --> 00:12:46.224 Now what this means is, of course, you have to do some sort of locking or mucus or atomic read modify right instruction. 86 00:12:46.528 --> 00:12:52.048 To keep this total correct and anticipating the next few slides. 87 00:12:52.048 --> 00:12:56.729 Whatever you do, whatever you do is going to be slow. 88 00:13:00.899 --> 00:13:06.119 They're now they're just going through this incremental counters here and so on. 89 00:13:12.354 --> 00:13:23.514 There's another problem with this, if I go back and just go back here. So thread 1, 0T gets this chunk of the input data thread. 1 gets that chunk thread to August. The 3rd trunk. 90 00:13:23.609 --> 00:13:30.538 And thread 3 gets the 4th chunk a 2nd problem with this partitioning math that. 91 00:13:30.538 --> 00:13:39.269 Is that consecutive threads are not accessing successive words of memory so we can not take advantage of coalescing. 92 00:13:39.269 --> 00:13:45.089 And that's what it talks about here. We've got the red thread degrades reds the greens, Fred and the blue thread. 93 00:13:45.089 --> 00:13:53.129 So, accesses are not coalesced. So if you're thinking ahead of me, what we want to do is interspersed things perhaps. 94 00:13:53.129 --> 00:14:00.328 Okay, this is a bad way so we do, we have to it or leave down here. So. 95 00:14:00.328 --> 00:14:04.769 Consecutive threads read consecutive bites from memory. 96 00:14:04.769 --> 00:14:11.099 And this way, the memory accesses will be coalesced and will be a lot faster. 97 00:14:11.099 --> 00:14:15.958 Silence. 98 00:14:15.958 --> 00:14:29.219 So, 0T gets the characters that are 0T. Mine 4 shred 1, gets the characters that are 1 model for et cetera, et cetera, entirely partitioning. And that will give you better performance from the global memory. 99 00:14:30.298 --> 00:14:41.969 We still have the issue that multiple threads may be implementing the same counter simultaneously, but we've improved 1 problem. 100 00:14:41.969 --> 00:14:47.639 Was the center leaf partitioning. 101 00:15:02.609 --> 00:15:16.589 Okay, this slide set is going to talk about that, implementing the shared global counter. And the problem that happens it describes is a data race. 102 00:15:17.818 --> 00:15:24.928 Where 2 threads are updating the same thing and, as we've seen before, like, a month ago, atomic operations help with that. 103 00:15:24.928 --> 00:15:30.568 Fix a problem. So, 2 threads accessing the same calendar here. 104 00:15:35.969 --> 00:15:39.028 Oh, okay. That's nothing interesting. There. 105 00:15:39.028 --> 00:15:43.528 I'm skipping through this, um. 106 00:15:43.528 --> 00:15:47.759 Data race, you all know what data races by now. 107 00:15:49.168 --> 00:16:02.759 Because they all read from the common memory location, update their private new value, right? It back. You can all see what's happening. I don't need to go through this in detail. I think if I do, then, tell me. 108 00:16:04.469 --> 00:16:11.788 So, we have the atomic operations that we've talked about before. 109 00:16:11.788 --> 00:16:16.649 It was a 7 2. 110 00:16:16.649 --> 00:16:20.219 That's what I said about going through some of these things past. 111 00:16:23.639 --> 00:16:29.068 So, Kudo has an atomic operation to do this for you. 112 00:16:32.698 --> 00:16:36.778 Again, the data race problem. 113 00:16:38.908 --> 00:16:47.879 And what we have to do is ensure that no other threads can access that location until the 1st thread is finished. 114 00:16:47.879 --> 00:16:56.038 So, what this does, is it that operation so when we finish this here, it will be correct. 115 00:16:56.038 --> 00:17:03.389 However, it will be slower and so could I had there are these function calls so. 116 00:17:03.389 --> 00:17:09.989 Cuda is it's some minor syntax extensions to C. plus plus and some functions. 117 00:17:09.989 --> 00:17:18.388 And before we've seen functions, like, memory copy, and so on allocate manage data this 1, here does an atomic ad. 118 00:17:18.388 --> 00:17:29.788 So, we give it the address that we want to add, and how much we want to add to it. So, and we've got atomic operations add subtract detriment. Min and Max swap. 119 00:17:29.788 --> 00:17:34.769 Um, comparing swap is a useful instruction and also. 120 00:17:34.769 --> 00:17:40.769 Okay was comparing swap you can use that that sometimes like. 121 00:17:40.769 --> 00:17:46.679 A standard thing that is used to build the other atomic operations, any case atomic ad. 122 00:17:48.838 --> 00:17:52.949 And returns the old value. 123 00:17:54.388 --> 00:18:00.328 And there's other, there's 32 bed, 64 bad. This is a point here. 124 00:18:00.328 --> 00:18:08.519 Which I didn't understand at 1st, to in parallel programming. If you're working with long managers are long along whatever 64 bit is called. 125 00:18:10.048 --> 00:18:22.019 He cannot assume that the 64 bit integer is actually being handled atomic. I mean, even reading, it might take 2 cycles, depending on the hardware, depending on generation. So. 126 00:18:22.019 --> 00:18:33.509 You have to be careful with the 8 bite variables, because you cannot automatically assume that they're handled atomic course something like this will do that and floats. 127 00:18:33.509 --> 00:18:37.528 Did double procession float. 128 00:18:37.528 --> 00:18:46.979 If I were using them in a parallel program, I might be a little skeptical. I would hope that they'd be handled atomic loans and stores and so on. But. 129 00:18:46.979 --> 00:18:51.479 I wouldn't be surprised if it was. 130 00:18:51.479 --> 00:18:59.459 Okay, so colonel program, parallel program, running on the device each thread. 131 00:18:59.459 --> 00:19:05.909 Processes 1 character, so input variables buffer is the input data. 132 00:19:05.909 --> 00:19:13.888 And we know how long it is, and we get a pointer to the histogram. So, this is a common global histogram are re. 133 00:19:13.888 --> 00:19:21.179 And by default, these pointers to variables are in the global memory. I mean, there's exceptions, but. 134 00:19:21.179 --> 00:19:24.929 If there's nothing else stated, these variables will be in global memory. 135 00:19:24.929 --> 00:19:30.929 Okay, so we get a unique counter for that thread from its index in the block and the blocks index and the grid. 136 00:19:30.929 --> 00:19:35.969 And we. 137 00:19:35.969 --> 00:19:40.648 Is that character bought by? We treated as a chair. 138 00:19:40.648 --> 00:19:44.699 And we add it in. I'm sorry. 139 00:19:44.699 --> 00:19:51.778 I got it back. Yeah. Okay. You double subscripted here buffer sub. I. 140 00:19:51.778 --> 00:19:59.098 It's the character that whose frequency we're recording and histone is here right? We're recording again, so we do that. 141 00:19:59.098 --> 00:20:02.848 And this happens automatically, so. 142 00:20:02.848 --> 00:20:05.999 No, we're just keeping track of things with here. 143 00:20:09.443 --> 00:20:22.433 With sub scripts like this again if there was something funny happening in the program, I would start looking at that to see if something funny could happen in this double subscript that would cause problems like. 144 00:20:22.739 --> 00:20:27.808 Another thread working in the middle, but okay. 145 00:20:28.888 --> 00:20:39.868 And nothing interesting happening here. We're just doing reasonable checks and so on incredible letters. 146 00:20:40.919 --> 00:20:44.759 Okay. 147 00:20:44.759 --> 00:20:48.118 No. 148 00:20:55.919 --> 00:21:00.388 So. 149 00:21:00.388 --> 00:21:03.898 Okay. 150 00:21:06.479 --> 00:21:10.019 Silence. 151 00:21:11.729 --> 00:21:20.159 So, there's latency of this cash and stuff like that. And again so this, this cash is. 152 00:21:20.159 --> 00:21:26.818 Why coalescing works because it's getting 128 bytes of the global memory into the cash. 153 00:21:26.818 --> 00:21:36.838 A global memory again, it has perhaps a couple of 100 cycles latency, but then it becomes fast. The latency is high, but the throughput is high. 154 00:21:36.838 --> 00:21:43.709 And to remind you again, the shared memory is a small amount of past memory available to each block. 155 00:21:43.709 --> 00:21:48.058 64 K bytes give or take 48 K bytes whatever. 156 00:21:49.108 --> 00:21:56.638 Oh, the cash hardware wise, the cash in the shared memory might actually be. 157 00:21:56.638 --> 00:22:00.778 The same memory is sort of partition to 1 or the other. 158 00:22:00.778 --> 00:22:06.298 If it's shared memory, it's directly visible to the threads. If it's cash. 159 00:22:06.298 --> 00:22:13.318 It's not directly visible to the threads and is handled by the cash manager. And to some extent, I mean, it's. 160 00:22:13.318 --> 00:22:20.338 Might be going from a 64 K bite total, which might be partitioned between the cash and the shared memory. 161 00:22:21.868 --> 00:22:32.278 Okay, so we got our DRAM up here. It's in banks we're not talking about. Yes, so I just symmetric multi processors. 162 00:22:32.278 --> 00:22:39.719 Entirely ports were not talking about. Really? So. 163 00:22:39.719 --> 00:22:45.028 Any case go to late and see here if several 100 cycles actually. 164 00:22:45.028 --> 00:22:50.999 Read and again, right? So that's incredible. Slow down on the system. 165 00:22:52.798 --> 00:22:58.469 If we don't use these various techniques. 166 00:23:00.719 --> 00:23:11.848 So, we do a few 100 cycles for the raid founded cycles for the right. And meanwhile that whole chunk of the memory is locked. 167 00:23:11.848 --> 00:23:17.429 So, how big the chunk is, I'm not certain, maybe 128 bites, but. 168 00:23:18.628 --> 00:23:27.689 Nothing else can access that. So, the answer is correct, but we just took an enormous hit in in performance. 169 00:23:27.689 --> 00:23:34.469 The latency, so, and they are saying here could be a 1000 cycles. That's bad. 170 00:23:35.909 --> 00:23:40.828 Right. We cut our bandwidth by a factor of a 1000. 171 00:23:40.828 --> 00:23:44.308 Silence. 172 00:23:45.628 --> 00:23:51.058 Um, so the. 173 00:23:51.058 --> 00:23:55.588 Cash comes into play here we'll help things a lot. 174 00:23:55.588 --> 00:24:02.608 The L2 cash, they're saying the latency maybe only 100 cycles instead of a 1000 cycles. 175 00:24:02.608 --> 00:24:09.689 And so that improves the bandwidth the lot. And the other thing is with the cash being available. 176 00:24:09.689 --> 00:24:15.568 To all the threads again, we have the coalescing things, so we'll get another improvement here. 177 00:24:16.588 --> 00:24:22.739 You think the cache which is invisible to you is handled automatically of course. 178 00:24:25.318 --> 00:24:32.848 So, we got 3 steps here of improving. I go back a couple of slides. We have just. 179 00:24:32.848 --> 00:24:39.239 Atomic operation the global memory, 1000 cycle latency or something. 180 00:24:39.239 --> 00:24:52.949 If we take advantage of the cash, we cut the latency down a factor of 10 or something. Now, this is useful to the extent that success of threads can use the data that's already in the cash. 181 00:24:52.949 --> 00:25:01.048 The next, but again, the cash cash is automatic. Remember you don't actually see it. It's just handled by the memory manager. 182 00:25:01.048 --> 00:25:06.148 The next step is we explicitly use the shared memory. 183 00:25:06.148 --> 00:25:10.288 So, this is getting similar to the game. I mentioned to paper. 184 00:25:11.453 --> 00:25:12.713 If I present in collaborative, 185 00:25:12.713 --> 00:25:15.473 this is on computing visibility on terrain, 186 00:25:15.894 --> 00:25:19.824 and we manage we did our own cash management, 187 00:25:19.824 --> 00:25:26.693 and we beat the virtual memory manager because we understood the access pattern for the trained data. 188 00:25:26.969 --> 00:25:35.759 And while we knew, for example, when we weren't going to be using a certain block of data anymore, so we could reuse that that. 189 00:25:35.759 --> 00:25:40.108 Cash location, for example, so this is the same thing happening here. 190 00:25:40.108 --> 00:25:45.269 You're managing, you might say that you're managing the cash yourself in a way. 191 00:25:45.269 --> 00:25:48.749 This is a sheriff's. This is a shared memory. 192 00:25:48.749 --> 00:25:56.519 Private to each thread block, which means it's all the threads in the block can use it, but you have to modify your program. 193 00:25:56.519 --> 00:25:59.759 But it's going to be very fast, so. 194 00:25:59.759 --> 00:26:04.378 Once you get the data into the shared memory. 195 00:26:04.378 --> 00:26:15.179 So, you read the data into the shared memory once and then all the threads that want that data have access to that data fast, like, in a cycle or whatever. 196 00:26:16.409 --> 00:26:25.888 So, we have 3 stages here. No optimization 3 stages of optimization with the atomic. 197 00:26:25.888 --> 00:26:29.338 Increment instruction go straight to the global memory. 198 00:26:29.338 --> 00:26:35.249 Very big latency use the cash, which you use if you're coalescing access says. 199 00:26:35.249 --> 00:26:39.479 Better, and if you explicitly cash the data into the shared memory. 200 00:26:39.479 --> 00:26:51.929 Better still. 201 00:26:56.009 --> 00:27:02.338 Now, here, we're going to see another. 202 00:27:02.338 --> 00:27:06.118 Technique for improving things, which is. 203 00:27:06.118 --> 00:27:09.179 Or privatizing output means. 204 00:27:09.179 --> 00:27:17.489 Is that instead of incremented into 1, global histogram or Ray we're going to have. 205 00:27:17.489 --> 00:27:28.199 Several private histogram arrays and each block of threads will use this private histogram array. And then at the end, we combine them all. We merge them all. 206 00:27:28.199 --> 00:27:34.318 So, the advantages of this will be that we have less contention for the atomic. 207 00:27:34.318 --> 00:27:37.949 We modify right now. 208 00:27:39.269 --> 00:27:51.239 As you're seeing this set of slides, the example that we're using, assess the ground, but these tech, the reason we're spending time on them is that these techniques are generally useful. 209 00:27:51.239 --> 00:27:54.719 Using. 210 00:27:54.719 --> 00:28:06.449 Explicitly caching the data and certain data and shared memory you have to modify your program, do it. That's a generally valuable technique for parallel programming. And this thing that we're seeing in this slide set. 211 00:28:06.449 --> 00:28:14.249 Is it having private counters that we then merge together? That's another valuable idea. This thing here. 212 00:28:14.249 --> 00:28:21.239 I mean, this is what's implicitly and open MP and open ACC when we declare something. 213 00:28:21.239 --> 00:28:25.169 As a variable that we're implementing into. 214 00:28:25.169 --> 00:28:34.739 And and what happens is that the compiler is that the open system will implement this without even telling you. 215 00:28:34.739 --> 00:28:39.719 It will have private counters for each. 216 00:28:39.719 --> 00:28:43.288 For each thread and then. 217 00:28:43.288 --> 00:28:47.128 So, we don't have contention problems in at the end it will merge them all. 218 00:28:47.128 --> 00:28:56.969 So that's done so open ACC open and Pete, they're higher level than what I'm showing you. Now, I was showing you the lower level version of this. 219 00:28:59.098 --> 00:29:04.288 Ok, so so what's going to happen here? 220 00:29:04.288 --> 00:29:18.749 Is that we have atomic updates of whatever? Histogram could be something else and here they're talking about several blocks so they changed the problem slightly. So your particular problem. 221 00:29:18.749 --> 00:29:22.588 Okay, let's say, talk about history so let's suppose we have this. 222 00:29:22.588 --> 00:29:35.459 So many threads that they don't fit in a block, because again, a block has up to a 1024 threads. Suppose for fun. We want to have a 10000 threads feel homework problem coming on. 223 00:29:35.459 --> 00:29:42.989 It works there cheap. 10100 try 100000 threads. Maybe see what happens now. 224 00:29:44.068 --> 00:29:50.699 If it's 100000 thread going to be 100 blocks of a 1000 threads each now. 225 00:29:50.699 --> 00:29:59.818 Remember that the blocks are running completely independently. The blocks are not synchronized with each other. So if you tried to use this global. 226 00:29:59.818 --> 00:30:05.939 Atomic increment instructions go in the separate blocks. 227 00:30:05.939 --> 00:30:16.858 You could, but I don't know, it's just a lot of things stepping on each other's toes. So, what they're talking about here is each separate block perhaps might have a private histogram array. 228 00:30:16.858 --> 00:30:23.189 And then there was fewer threads potentially accessing the same calendar. 229 00:30:23.189 --> 00:30:29.249 But we have multiple copies of the private history, and then at the end, we merge them together. 230 00:30:29.249 --> 00:30:41.368 Okay, so the downside is the programming got a little more complicated and it used a small amount of extra memory but in this case, in significant amount. 231 00:30:41.368 --> 00:30:45.148 Because histogram a race pretty small and the upside is. 232 00:30:45.148 --> 00:30:50.519 Faster so less contention and serialization. 233 00:30:50.519 --> 00:30:53.909 Oh, okay. 234 00:30:53.909 --> 00:31:00.118 The final merge here, we probably don't have all that many copies so that's not going to be a difficult problem. 235 00:31:00.118 --> 00:31:11.368 So the overhead is minor and the good benefit. The real overhead is the programming got a touch more complicated. 236 00:31:11.368 --> 00:31:17.969 The final accumulating, merging any final private coffee is going to be really cheap. So. 237 00:31:20.999 --> 00:31:24.328 And they are claiming 10 times and 4 so. 238 00:31:26.398 --> 00:31:35.038 Okay, so again, the shared memory that I mentioned many times, it's. 239 00:31:35.038 --> 00:31:41.848 It's 48 k64 K bytes for the block all the threads in the block of access to it. 240 00:31:41.848 --> 00:31:48.298 And it might take typically 1 cycle or a few cycles level 2 cash, 10 times slower. 241 00:31:48.298 --> 00:31:51.358 D, RAM under times slower. 242 00:31:51.358 --> 00:31:57.568 Contention okay, this is a prime example of where shared memory is valuable. 243 00:31:59.098 --> 00:32:02.429 Well, coming back to you here also. 244 00:32:03.989 --> 00:32:13.138 This is also we get into optimization issues. You can start playing games of how many threads per block. I said. 245 00:32:13.138 --> 00:32:23.249 Which I mentioned before, but their valuable optimizing games, you can have up to a 1000 threads per block, but you don't have to have that memory might have 32 threads in the block. Then you're going to need more blocks. 246 00:32:23.249 --> 00:32:29.489 But it might actually go faster because. 247 00:32:29.489 --> 00:32:33.808 That's more memory for thread when you've got fewer threads per block so. 248 00:32:35.038 --> 00:32:42.479 Okay, global program just means it's a device routine called from the host. 249 00:32:42.479 --> 00:32:47.338 And this in the device routine is how you allocate a shared. 250 00:32:47.338 --> 00:32:51.298 Array and again, it Tom. 251 00:32:51.298 --> 00:32:55.229 It's going to be visible to all the threads and the block. 252 00:32:55.229 --> 00:32:59.608 If you just had, if you did not say this, it would be private. 253 00:32:59.608 --> 00:33:05.909 To the thread and would be actually in slow local memory. You say this it's. 254 00:33:05.909 --> 00:33:09.328 Private to the block and it's in fast memory. 255 00:33:12.384 --> 00:33:25.763 Okay, and you got to throw around sync threads and of course, from time to time. So what's happening here you see, is that they're not assuming that the new variable is zeroed. 256 00:33:26.003 --> 00:33:30.054 So explicitly zeroing it. But this has been done in parallel. 257 00:33:30.298 --> 00:33:35.909 So sync it. Okay, the thing is, I said the 32 threads in the war. 258 00:33:35.909 --> 00:33:45.719 Are going, sort of in lock step, but there may be if there's more than 32 threads involved here and at several warps. And again, who knows? They may be going see early 1 after the other. 259 00:33:47.034 --> 00:34:01.523 Okay, so again, we're inside the, this is a code inside the global routine, and we're doing an atomic increment and private and shared memory. So. 260 00:34:04.919 --> 00:34:09.208 Okay, and then, um. 261 00:34:10.498 --> 00:34:20.878 So, I can go back to the 2 slides. Okay. Okay. So let's go through this again. So this is the thread program to do the. 262 00:34:20.878 --> 00:34:25.228 Went from it 1 character. 263 00:34:25.228 --> 00:34:28.559 We created the private histogram. 264 00:34:28.559 --> 00:34:33.719 We increment we zeroed the private histogram data. 265 00:34:33.719 --> 00:34:36.719 In parallel and then we synchronized. 266 00:34:36.719 --> 00:34:43.798 To make sure that it all got zeroed and remember we have this shared private, historical and variable. 267 00:34:43.798 --> 00:34:49.018 On to the next slide, here's our shared private histogram variable. 268 00:34:49.018 --> 00:34:53.369 And we automatically increment it with. 269 00:34:54.389 --> 00:35:07.378 Whenever the character a buffer, so by divided by 4, because again we're bunching the characters and the input data into 4 types of aid to. 270 00:35:07.378 --> 00:35:12.568 H, and so on to make the buffer smaller. Okay. 271 00:35:17.668 --> 00:35:20.909 What's happening with plus stride here? 272 00:35:20.909 --> 00:35:26.489 We actually have fewer threads than the, um. 273 00:35:26.489 --> 00:35:33.449 Total size of the buffer perhaps. So each thread will actually count up. 274 00:35:33.449 --> 00:35:39.509 Many characters in the input data, and this is what we had a couple of sites ago. 275 00:35:39.509 --> 00:35:48.389 The thread is not doing a chunk of adjacent data. It's doing a chunk of characters that are stride apart strides the number of. 276 00:35:48.389 --> 00:35:51.809 Fred, so the point about this here. 277 00:35:51.809 --> 00:36:04.079 Is that this coalesce? The memory access is the fact that this thread is accessing a multiple of characters that are stride characters apart. Not 1 character apart. 278 00:36:04.079 --> 00:36:10.469 Now we also, so we have this awhile loop in the. 279 00:36:10.469 --> 00:36:15.869 You know, in the thread. 280 00:36:15.869 --> 00:36:20.309 And so it might happen that. 281 00:36:20.309 --> 00:36:27.809 1 thread does the loop more times, and another thread in which case will lose the benefits of on those extra iterations. 282 00:36:27.809 --> 00:36:34.829 But as long as this condition is true for all the threads that all 32 threads in the war are doing this in parallel. 283 00:36:34.829 --> 00:36:40.498 Although, of course. 284 00:36:41.909 --> 00:36:46.079 What happens if 1 thread has to wait for the atomic ad? 285 00:36:46.079 --> 00:36:49.619 Because another thread is accessing the same thing. 286 00:36:49.619 --> 00:36:55.349 I'm not completely certain what happens there all 32 threads have to wait or what? So. 287 00:36:55.349 --> 00:37:00.898 That would be a good question. You see the problem I'm thinking about here. 288 00:37:00.898 --> 00:37:13.259 Okay, so the 32 threads in the war operating in lock step, but the time that the atomic ad takes is very, it's not constant. It's variable. It will take more of another thread. 289 00:37:13.259 --> 00:37:16.588 Happens to be accessing the address at the same time. 290 00:37:16.588 --> 00:37:20.608 So, that will slow down this instruction here. 291 00:37:20.608 --> 00:37:24.989 Now, doesn't slow down all the threads and the war. 292 00:37:24.989 --> 00:37:29.759 I think it does actually, but I'm not positive. I believe it would. 293 00:37:29.759 --> 00:37:35.699 So, if 2 threads are updating the same counter. 294 00:37:36.054 --> 00:37:49.974 Then the whole war slows down, which could cause another work to be running faster because the 2 warps don't have to be going in lock step just the threads and 1 war going step. I mean, 2 wars, they could be sequential aspects has done enough resources. 295 00:37:50.514 --> 00:37:55.224 But he said this is another reason here you really want to minimize contention here. 296 00:37:55.528 --> 00:37:59.099 Apart from the late the latency, if you go out to. 297 00:37:59.099 --> 00:38:02.998 The goal was in global memory. 298 00:38:02.998 --> 00:38:12.000 The more the larger fraction of the time you have contention the greater the probability this atomic ad is going to go slower. 299 00:38:12.000 --> 00:38:15.539 So, it's an argument for having multiple private. 300 00:38:15.539 --> 00:38:19.590 If people see what I mean okay. Okay. So. 301 00:38:19.590 --> 00:38:23.519 What happened on the previous slide is. 302 00:38:23.519 --> 00:38:33.059 Code went to a couple of slides here. The threads were going to the array and summing counts into their private histogram. 303 00:38:33.059 --> 00:38:36.690 And all the threads again and. 304 00:38:36.690 --> 00:38:40.380 It's the same private histogram variable for the. 305 00:38:40.380 --> 00:38:43.710 All the threads in the block shared. 306 00:38:43.710 --> 00:38:49.590 So you actually could have fun here computing the probability of there being. 307 00:38:49.590 --> 00:38:54.449 You know, a slow down here. What's the probability that 2 threads would access? The same. 308 00:38:54.449 --> 00:38:57.570 You know, Buffer item at the same time. 309 00:38:57.570 --> 00:39:02.159 In the histogram. Okay. 310 00:39:03.179 --> 00:39:12.059 Okay, so now we've accumulated all the private, but again, remember that the different works. They're not necessarily in lock step. 311 00:39:12.059 --> 00:39:18.480 Particularly it says a lot of threads that could be sequential. 1 more starts after the previous work finishes. 312 00:39:20.844 --> 00:39:34.525 And then, we synchronized. We wait until the slowest work is finished. I said, you know, you could have 10000 threads. Maybe I think I don't know what's limited for all. I know. 100000. but your device might perhaps if it has say 4000 CUDA cores. 313 00:39:34.554 --> 00:39:36.025 I think that's what. 314 00:39:36.989 --> 00:39:43.320 Parallel has that means only at the best 4000 threads can be running at once. So if you've got. 315 00:39:43.320 --> 00:39:54.719 10000 threads then no, you certainly not running all at the same time because the Max that can run simultaneously 4000. probably less than that, depending on resources. So. 316 00:39:54.719 --> 00:40:01.199 Would be I used to use some of the debugging tools to see the thread the core utilization. Okay. 317 00:40:02.610 --> 00:40:09.449 Any case we now wait until the slowest wharf has finished accumulating into private histogram. 318 00:40:09.449 --> 00:40:16.050 And then what we do is. 319 00:40:17.639 --> 00:40:22.260 This is for the threads and the block now different blocks are doing other stuff. 320 00:40:22.260 --> 00:40:30.900 Of course, so it doesn't wait for the other blocks to finish waiting to the other blocks. Finished would be a bad idea and you might be a very long wait. 321 00:40:30.900 --> 00:40:40.320 Okay, so now, what we do here is we have the private histogram for each block, but we had many blocks, perhaps with global histogram. 322 00:40:40.320 --> 00:40:44.460 It's in the well, it's in the L2 cash. 323 00:40:44.460 --> 00:40:52.440 Just automatically that's what the cash does. And now, what we do is the 1st, 8 threads. 324 00:40:53.730 --> 00:40:58.289 For the 1st, 7 threads now, go in parallel. 325 00:40:59.639 --> 00:41:04.440 Um, indexing. 326 00:41:04.440 --> 00:41:07.559 Adding into the global thing here. 327 00:41:07.559 --> 00:41:13.559 So, there's a lot of latency with here, we're implementing into a slow or Ray. 328 00:41:13.559 --> 00:41:19.170 And it has only, you know, not very many elements, like 7 and so. 329 00:41:19.170 --> 00:41:22.230 Probably close to 1 that is going to be. 330 00:41:22.230 --> 00:41:25.829 Slow down to sterilize the atomic ad. 331 00:41:25.829 --> 00:41:30.570 But that's okay, because you're only doing like, 7 operations. 332 00:41:30.570 --> 00:41:37.289 For block times, how many blocks? So this thing is going to be really slow. 333 00:41:37.289 --> 00:41:41.070 But that's tolerable because. 334 00:41:41.070 --> 00:41:44.639 Well, it's not very many variables being added. 335 00:41:47.789 --> 00:41:58.500 And summary, so this, what they call privatization, local caches for each thread block, or if you're doing this open ACC in each. 336 00:41:58.500 --> 00:42:06.900 Process each thread, and it's only certain operations can be done that way, but the obvious operations. 337 00:42:06.900 --> 00:42:10.829 Okay, and you got to fit into shared memory. 338 00:42:10.829 --> 00:42:16.380 Okay, this last paragraph here. 339 00:42:16.380 --> 00:42:19.500 If you can't fit the histogram into. 340 00:42:19.500 --> 00:42:25.469 The end of the shared memory now you have to. 341 00:42:25.469 --> 00:42:31.585 Escalate your techniques now I have a programming style. I do parallel geometry algorithms. 342 00:42:31.585 --> 00:42:41.813 That's my research and I'd be doing a histogram where there's a 1M pockets let's say, and if it's a 1M buckets, then we have to go to other techniques and. 343 00:42:42.090 --> 00:42:45.690 We'll see something later on they talk about other techniques. 344 00:42:45.690 --> 00:42:50.639 As a histogram is too large to fit and to say small fast memory. 345 00:42:51.750 --> 00:42:55.199 Okay, so privatization. 346 00:42:55.199 --> 00:43:03.150 This is a techniques, it's only for parallel programs. It's not an interesting technique Chris for a sequential programs. Obviously. 347 00:43:03.150 --> 00:43:06.360 Okay, so. 348 00:43:07.440 --> 00:43:14.969 Was 7.5. 349 00:43:14.969 --> 00:43:20.429 Hey. 350 00:43:20.429 --> 00:43:23.760 Okay. 351 00:43:23.760 --> 00:43:29.250 So this set of modules. 352 00:43:29.250 --> 00:43:37.590 Is showing us some paradigms of parallel programming. So paradigm is a programming technique, which is useful. 353 00:43:37.590 --> 00:43:41.309 It's not explicit in the programming language. 354 00:43:41.309 --> 00:43:48.329 There's nothing in which privatization is not part of the standard. 355 00:43:48.329 --> 00:43:56.340 It's a technique, but it's a technique, which is useful. You put it into your tool kit of useful parallel techniques. 356 00:43:56.340 --> 00:44:06.269 Always going to see some others stem cells, and we're going to see some other ideas for parallel, making efficient parallel programs. 357 00:44:08.369 --> 00:44:11.909 Okay, you all know a convolution is, I think. 358 00:44:13.110 --> 00:44:18.030 Going to say an idea called or whatever. 359 00:44:20.760 --> 00:44:23.760 Convolution smooth stuff sharp and stuff. 360 00:44:23.760 --> 00:44:28.829 If somebody does not know that, then you can tell me and I'll give a quick. 361 00:44:30.599 --> 00:44:35.250 Thing, but they do low pass filtering for smoothing and high past filtering. 362 00:44:35.250 --> 00:44:42.510 To amplify to look for objects. Like, I just well, just as an aside. 363 00:44:43.739 --> 00:44:56.610 The Retina of your eye does a convolution and you're at neural nets are important. Well, the retina of your eyes, a genuine neural net not a computer simulation. 364 00:44:56.610 --> 00:45:04.949 And the neurons inhibit their neighbors apparently so that if 1 neuron gets a lot of photons. 365 00:45:04.949 --> 00:45:09.059 And it's firing a signal, it tends to inhibit its neighboring. 366 00:45:09.059 --> 00:45:21.360 Neurons from firing so the rest of your eye it appears, and I would say it appears because there's some uncertainty about this, but the red now, if your eye does a high pass filter. 367 00:45:21.360 --> 00:45:25.559 Which makes small objects more apparent. 368 00:45:25.559 --> 00:45:30.090 It's called the mock band effect after the Austrian physicist Ernst mock. 369 00:45:30.090 --> 00:45:35.760 Mark is the same guy that calculated measured the speed of sound. 370 00:45:35.760 --> 00:45:42.239 He's also the same guy that constructed an experiment to see if the whole universe was rotating. 371 00:45:42.239 --> 00:45:49.409 He did lots of good stuff. Any case to the mock band effect is relevant for computer graphics, which I also teach. 372 00:45:50.215 --> 00:46:03.505 I thought it was finished this computer graphics, but the department has clawed me back to teach at next fall. Because apparently, I don't know why it isn't. Any students want to take graphics. So, the department tells me and no, 1 else can teach it sign back into it next fall. 373 00:46:03.954 --> 00:46:05.454 But the mock band effect. 374 00:46:05.760 --> 00:46:16.380 It makes it harder to compute nice images because small differences between color intensity will be more apparent. 375 00:46:16.380 --> 00:46:28.405 To the viewer than you might hope, so this is as high pass filter in the retina signing case there's a convolution relevance. For example, if you have 8 bits for color parameters. 376 00:46:28.405 --> 00:46:34.315 So, if you do a color ramp with going from 0T to 127 to 55, I'm sorry and it's green. 377 00:46:38.695 --> 00:46:50.304 Viewers can sometimes see the edge, the boundary between adjacent colors, and it changes by 1 part and 256. so that's the mock band effect because your eye appeared to amplify that boundary. Now. 378 00:46:52.349 --> 00:46:58.050 Why your eye does it? It makes small objects. 379 00:46:58.050 --> 00:47:03.510 More visible to you and the assumption is that. 380 00:47:03.510 --> 00:47:11.880 A small object might be something hidden in the burst that's going to jump out and eat you. Not a small object though, but an advocate sort of hidden. 381 00:47:11.880 --> 00:47:15.960 Predator, so it's useful if things like that are. 382 00:47:15.960 --> 00:47:20.940 More susceptible. Okay. Any case coming back to invidia convolution. 383 00:47:20.940 --> 00:47:26.610 I think it waited some of the neighboring elements. 384 00:47:26.610 --> 00:47:29.909 The mask is the weights. 385 00:47:31.559 --> 00:47:39.210 Now, if you're thinking ahead of me, because I'm perhaps going to follow, you might think the mask is a sort of thing. It's a constant. 386 00:47:39.210 --> 00:47:45.300 And it's, that's the sort of thing you would put in constant memory let's say. 387 00:47:45.300 --> 00:47:49.110 Because it's the same mask at all. The threads would read. 388 00:47:49.110 --> 00:48:01.110 Read only and so it's going to be implemented. It's going to effectively be in the L2 cash or use app it into the if you have space available, you put it into the shared memory. 389 00:48:01.110 --> 00:48:04.889 On what you pastor, but there's not much of it. 390 00:48:06.150 --> 00:48:10.500 Okay, 1 D convolution. 391 00:48:12.059 --> 00:48:19.079 So, I'll go through this fast and somebody, you know what convolution is you don't have to read it. 392 00:48:21.000 --> 00:48:34.679 These slides that they talk about boundary condition, done, going through this fast. This is a sort of thing. You write a program that works. You got to handle special cases. 393 00:48:34.679 --> 00:48:42.269 And my horrible experience says some of my program taps a total code might be handling special cases. That's life. 394 00:48:42.269 --> 00:48:48.510 You accept it if you want your programs to work. Okay. 395 00:48:49.619 --> 00:48:54.510 Nothing interesting here how they would do the convolution. Well. 396 00:48:55.889 --> 00:48:58.949 What's interesting here. 397 00:49:01.949 --> 00:49:07.409 Good. 398 00:49:09.869 --> 00:49:21.900 Well, you're looping here so all what's happening down here is that the 1 thread is accessing several values. 399 00:49:21.900 --> 00:49:26.789 From the input array and several values from the mask. 400 00:49:26.789 --> 00:49:31.800 So, the ax, the data access patterns are are a little complicated. 401 00:49:31.800 --> 00:49:36.780 That's the relevant thing there, so. 402 00:49:38.429 --> 00:49:42.840 And the other relevant thing is, this is the boundary condition test. 403 00:49:42.840 --> 00:49:50.369 For to stop, you going off the edge of the array, it's the size of the edge of the it's not a multiple of whatever. 404 00:49:50.369 --> 00:49:56.400 A number of threads okay. We just talked about that. 405 00:49:56.400 --> 00:50:00.119 Now, 2 D gets interesting. 406 00:50:00.119 --> 00:50:08.699 Because again, your data access patterns are not so totally local because you're not just going left and right. You're going up and down. 407 00:50:08.699 --> 00:50:15.480 Now, there's something on a device that we haven't talked about much, but there is. 408 00:50:15.480 --> 00:50:28.440 A texture cache, which you can put stuff into for read only 2 D. a raise, which are going to have 2 D, sorts of accessing patterns and the texture cash. 409 00:50:28.440 --> 00:50:38.760 The data is laid out in a space filling curve, pattern, piano curve or something. And the idea is to try to minimize the average access time. 410 00:50:38.760 --> 00:50:44.070 Documentation doesn't talk about that too much, but I'm just mentioning it to you for your interest. 411 00:50:44.070 --> 00:50:49.170 If you wanted to learn more, you could dig deep into and video white papers or something. 412 00:50:51.719 --> 00:50:56.219 Ghost cells are just off the edge of the array. You can just. 413 00:50:56.219 --> 00:51:01.650 But zeroes in there or something, then you call it get simpler. What? It takes more data. 414 00:51:01.650 --> 00:51:16.199 I'm skipping see, I don't actually find this boundary condition. Interesting, because I know how to handle it. I know 2 different ways to handle it. You put conditions in your program, or you've had that here right? I call it padding there. Right? 415 00:51:16.199 --> 00:51:21.510 So, I can do that was 8.1. 416 00:51:25.679 --> 00:51:32.820 To silence. 417 00:51:36.300 --> 00:51:43.590 Okay, go through this. So, tiling again, is we chunk up the input array into blocks. 418 00:51:43.590 --> 00:51:53.070 And then, although you have to read each block in several times. Once you read it in, it gets access more. And on the average, it reduces total reading time. 419 00:51:53.070 --> 00:51:58.050 I just gave you the summary of the slides that talk also it fast. 420 00:51:58.050 --> 00:52:09.659 In the 1 dimensional case, we read a block and and is the input array and PC output right right. We read it in that. 421 00:52:09.659 --> 00:52:17.760 We can we read this chunk of data into our local cash and shared memory. We can use these elements several times. 422 00:52:17.760 --> 00:52:21.570 And you could even imagine a sliding window or some sort. 423 00:52:21.570 --> 00:52:26.429 Okay, output tile. 424 00:52:28.139 --> 00:52:38.489 Okay, there is something new here when you're doing things like convolution, the programs that we saw. So far took 1 output. 425 00:52:38.489 --> 00:52:43.530 Our computed 1 output element and iterated over. 426 00:52:43.530 --> 00:52:51.030 The input raise for each output element, the next output element integrate to the input of raised again. 427 00:52:51.030 --> 00:52:54.750 Is a complimentary way, or instead of. 428 00:52:54.750 --> 00:52:59.190 Doing each output element, once you do each input element once. 429 00:52:59.190 --> 00:53:12.329 And so you read an input element, and then it affects several output elements and so you access the output element several times. But each input element only wants comfortable ways of looking at things. 430 00:53:12.329 --> 00:53:21.750 I think that's not as good perhaps, because you're distributing the rights more and it's I'd rather distribute the reads more reach being faster than rights. Maybe. 431 00:53:21.750 --> 00:53:29.880 Any case they're talking about, you chunk up the output, you could sort of combine the 2 ideas, you. 432 00:53:29.880 --> 00:53:39.869 Partition the output, right? Assuming it's very big. It's a name. It could well be very big few 1M pixels so that you. 433 00:53:41.460 --> 00:53:51.059 So, you work with the thread blocks and you partition the output array into a chunk called a tile and each thread block tries to process 1 tile. 434 00:53:52.289 --> 00:53:55.320 And that's what they're talking about here. So. 435 00:53:55.320 --> 00:54:03.150 The general lesson from the slide set is that you might work with the. 436 00:54:03.150 --> 00:54:09.539 Thread block and because what happens inside a thread block. 437 00:54:09.539 --> 00:54:18.090 Is a lot easier to handle and stuff that happens between multiple thread blocks within the 1 thread block. You have the. 438 00:54:18.090 --> 00:54:29.159 Shared array and you also have a chance of synchronizing threads and if you're lucky, the works in the thread block may run sort of at the same time. 439 00:54:29.159 --> 00:54:42.420 May even run at the same time if you're lucky. So if you can do a computation inside 1 thread block, you got these winning features. The once the computation has to spill over into several thread blocks. 440 00:54:42.420 --> 00:54:54.900 Then you lose something well, you've got scalability, got lots and lots and lots of thread blocks, but you lose this shared idea. So you can work with this idea and figure out. 441 00:54:54.900 --> 00:54:58.710 Do trade offs of what you do inside a thread block and what can you do. 442 00:54:58.710 --> 00:55:02.550 You know, in multiple thread box, it would be an optimization thing. 443 00:55:02.550 --> 00:55:07.139 And again, NVIDIA has this spreadsheet you can play with that tries to help you. 444 00:55:08.309 --> 00:55:14.250 Input tiles again. Nothing interesting. There stuff gets partition into tiles. 445 00:55:15.449 --> 00:55:22.769 Um, so here the gaming again, different design offer. 446 00:55:22.769 --> 00:55:26.369 So you have a thread block. 447 00:55:26.369 --> 00:55:32.400 Um, so you have an output tile for thread blocker an input tile for thread block. 448 00:55:32.400 --> 00:55:37.170 And they talk about the various trade offs here if in. 449 00:55:37.170 --> 00:55:42.179 If you have a thread block for output tile. 450 00:55:42.179 --> 00:55:47.369 Then you have to read the data more than once. 451 00:55:47.369 --> 00:55:52.500 If you have a thread block for input tile. 452 00:55:52.500 --> 00:55:55.920 Then you may not use all the threads. 453 00:55:55.920 --> 00:55:59.880 But Tom, all because of these boundary condition things. 454 00:55:59.880 --> 00:56:08.550 But it may be faster. So, the general lesson from this slide said is as trade offs you want to think about if you care about optimizing. 455 00:56:08.550 --> 00:56:15.659 It depends you may not care about fine tune optimizing, but if it's a big program and you do you think about how you map. 456 00:56:15.659 --> 00:56:19.199 Your data into the thread blocks so. 457 00:56:20.309 --> 00:56:29.820 Get through all of this, I just gave you the content and again what I said many times your block size. 458 00:56:29.820 --> 00:56:33.239 You something you'd designed to optimize. 459 00:56:33.239 --> 00:56:44.820 Um, the relevance here, the shared memory again, you try to load data into the shared memory and the way that it's used several times, once you load it in. 460 00:56:45.324 --> 00:56:57.414 So, what they're talking about here, this is this Pat, what I call padding, they call ghost cells so you just pad an array with arrows at the end. 461 00:56:57.775 --> 00:57:00.833 What this does is it makes your code simpler. 462 00:57:01.110 --> 00:57:07.949 Okay, you don't have to test for going off the end of the year because he won't go off the end because the end just got made wider. 463 00:57:07.949 --> 00:57:16.500 And patted with several so if you trade off again, you use a small amount of extra storage. I think it's not worth worrying about, but. 464 00:57:16.500 --> 00:57:24.269 Code got fast. The other thing also goes again in a war all the threads got into the same thing. So doing special case testing. 465 00:57:24.269 --> 00:57:28.139 In the war. 466 00:57:28.139 --> 00:57:35.280 Well, certain things will serialize, but if you're adding these arrows, it didn't serialized, but it was pointless. That's probably a wash. 467 00:57:35.280 --> 00:57:38.639 So, okay. 468 00:57:46.110 --> 00:57:54.329 Go through this 1 fast. We haven't really talked about what stem cell is, but that's sort of. 469 00:57:54.329 --> 00:58:02.699 The shape of the convolution colonel, the idea of a central is going to come up and if you so, what's going to happen here. 470 00:58:03.840 --> 00:58:07.769 Okay, use the cash. 471 00:58:07.769 --> 00:58:22.110 You think of how you lay out your dad when your dad is 2 D but ultimately the memories really 1 D you think of how tough stuff gets laid out again. You think about how. 472 00:58:23.250 --> 00:58:30.059 You know, you map the threads to the data boundary conditions I don't think are interesting. 473 00:58:30.059 --> 00:58:33.869 Well, this is very interesting and. 474 00:58:33.869 --> 00:58:42.360 You know, how, how do you paralyze the program? So thread to data is very interesting. 475 00:58:42.360 --> 00:58:45.389 How do you map your to D data to 1? D. 476 00:58:45.389 --> 00:58:50.489 Memory is also interesting. And how do you use the cache is interesting. 477 00:58:50.489 --> 00:58:53.760 So those are the interesting parts here. 478 00:59:00.000 --> 00:59:03.690 Okay, what's happening here? 479 00:59:03.690 --> 00:59:09.210 You got your cash to cash 128 bytes. 480 00:59:09.210 --> 00:59:12.869 It starts in multiples of 128, so. 481 00:59:14.489 --> 00:59:23.489 You want a raise in global memory to start at multiples of 1008 you. So you pat him or something to make that happen. 482 00:59:23.489 --> 00:59:29.309 This is going to if you do, not do this thing here is something may take. 483 00:59:29.755 --> 00:59:41.784 2 bursts instead of 1 burst, so not doing this might double your reading time. Actually, it took, you know, had to go out to the Goldman memory twice instead of once. 484 00:59:41.784 --> 00:59:45.204 So you perhaps want to start your rose. 485 00:59:45.480 --> 00:59:50.130 Keep things aligned. Okay. 486 00:59:51.179 --> 00:59:56.610 You never know that when you allocate the matrix that we do that for, you. 487 00:59:57.780 --> 01:00:03.480 Okay, if it's 3 by 3 to 4 by 4. 488 01:00:03.480 --> 01:00:07.289 Okay, programming techniques. 489 01:00:07.289 --> 01:00:13.860 Nothing interesting there things are padded. 490 01:00:13.860 --> 01:00:17.400 Um. 491 01:00:17.400 --> 01:00:23.820 Nothing interesting there. Well, you've got a structure for your edge. 492 01:00:23.820 --> 01:00:27.780 It is storing assorted things, but Chad, whatever. 493 01:00:29.280 --> 01:00:32.909 Silence. 494 01:00:32.909 --> 01:00:41.639 Of back up a little, you're probably using some Pat management software that does this for you that. 495 01:00:41.639 --> 01:00:48.119 You should probably possibly not be creating classes like that. If you can find them already created. 496 01:00:48.119 --> 01:00:56.820 Okay um, what. 497 01:00:56.820 --> 01:01:04.590 If you were not lucky enough to have all this stuff created for you, you've got to go and write your class. 498 01:01:04.590 --> 01:01:11.579 They're talking about there. Yeah, so they provided for you. 499 01:01:11.579 --> 01:01:15.150 Block size again. 500 01:01:17.010 --> 01:01:21.780 Pick a block size now, pick a block size that matches the data. And so. 501 01:01:21.780 --> 01:01:33.539 Again, so all this is doing this is padding the tile out to a multiple of 32 or something. Nothing interesting. There. Okay what's happening here is they're saying that the. 502 01:01:35.994 --> 01:01:50.364 They're saying that the threads in the block are considered to be a 2 dimensional array of threads, not just a 1 dimensional. Right? And I consider this is totally syntactic shuttering. I don't know why they put it in the standard. It just creds up the standard view. And you could do it yourself. 503 01:01:50.610 --> 01:01:54.630 And again, the block, the grid is the. 504 01:01:54.630 --> 01:01:58.289 All the blocks and again, it's assuming that there are a. 505 01:01:58.289 --> 01:02:08.340 3 D number so, 1, a 2 D array of blocks and the grid and I consider that just total syntactic. The point is it's. 506 01:02:08.340 --> 01:02:13.320 To make the way you index the threads and the block match the rows and columns and your own. 507 01:02:13.320 --> 01:02:19.500 In your image. 508 01:02:19.500 --> 01:02:27.360 Okay, I'll find it, uh, talking about constant memory and it's not very much it's 48 K bytes or something. 509 01:02:27.360 --> 01:02:37.980 Depends on the generation of invidia hardware, but they don't actually change it over time. For some reason in video does not make their cash larger. They get to the new and fancier. 510 01:02:37.980 --> 01:02:49.380 Generations, what that suggests to me is that the cash is very expensive to implement in hardware. 511 01:02:49.380 --> 01:02:52.440 So, they don't make it bigger. Okay. 512 01:02:52.440 --> 01:02:58.019 But it's global all the threads can read the constant memory. 513 01:02:58.019 --> 01:03:02.309 The mask in the constant memory aggressively. 514 01:03:02.309 --> 01:03:16.050 Okay, so this is the thing they got special hardware to make if multiple threads are reading the same cash value, it all happens in the same cycle. 515 01:03:16.050 --> 01:03:21.389 So, they've got hardware to do that so. 516 01:03:21.389 --> 01:03:24.480 So effectively. 517 01:03:24.480 --> 01:03:28.019 Memory fast memory bandwidth. Okay. Um. 518 01:03:29.610 --> 01:03:36.900 And, okay, so you can tell the compiler that some array. 519 01:03:36.900 --> 01:03:46.110 Is eligible for constant cashing, so it's going to be read only to find this constant. So the compiler knows that and. 520 01:03:48.869 --> 01:03:52.739 And so it's something which is worthwhile. 521 01:03:52.739 --> 01:04:03.719 Trying to put in the constant cash you have to tell the compiled, because just something's constant. Doesn't mean you want to put it in the cash. It's y'all want to put it in the cash. 522 01:04:03.719 --> 01:04:08.730 All of the threads are going to be wanting to read the same data again. And again and again and again. 523 01:04:08.730 --> 01:04:12.360 So, the compiler may not be able to determine that. 524 01:04:12.360 --> 01:04:18.030 So, you tell a compiler. Okay. Um. 525 01:04:18.030 --> 01:04:21.809 And is cutting things up with this padding here. 526 01:04:23.159 --> 01:04:27.840 Ignore it. 527 01:04:27.840 --> 01:04:35.610 Boundaries ignore the boundaries threads off at the edge. Don't always participate. Yeah, that's. 528 01:04:35.610 --> 01:04:41.969 Not interesting, I mean, you got to handle it, but it's not interesting intellectually. 529 01:04:41.969 --> 01:04:46.500 We saw this before 3 kernel 3 channels for image. 530 01:04:46.500 --> 01:04:52.289 Um, so. 531 01:04:52.289 --> 01:04:59.579 Yeah, I would not say, do it this way because this is this. 532 01:04:59.579 --> 01:05:13.590 Putting red, green, blue for each pixel adjacent. This is creating the array of structures, which is bad because the red channel for each adjacent pixels, not adjacent there, got the green and blue in between. So. 533 01:05:13.590 --> 01:05:18.840 Yeah, I actually have a complaint about this when I said it, it's work, but it's not optimal. 534 01:05:18.840 --> 01:05:22.349 Okay. 535 01:05:26.579 --> 01:05:30.239 I warn you, I do lots of slides today. 536 01:05:38.550 --> 01:05:45.900 Okay, on the general lesson from the slides that. 537 01:05:45.900 --> 01:05:49.650 As if you want to write efficient. 538 01:05:49.650 --> 01:05:56.340 Programs you have to understand how your algorithm accesses the data. 539 01:05:56.340 --> 01:06:08.639 And the access patterns, and this is something that I hit in my research there is a really strong argument for keeping things. Simple. 540 01:06:08.639 --> 01:06:12.179 You want to have simple regular. 541 01:06:12.179 --> 01:06:21.119 Data structures accessed regularly and this means that it works with the hardware, and it will paralyze. 542 01:06:21.119 --> 01:06:30.840 So, what this means is, you want to forget all these complicated, beautiful things you learn in 1, because they do not parallel lies. You do not want. 543 01:06:30.840 --> 01:06:34.769 Trees you do not want pointers. You do not want recursion. 544 01:06:37.284 --> 01:06:48.864 I mean, everything has exceptions, I'll give you that, but you want to try to minimize the use of things like trees and pointers and recurs but you got a pointer. 545 01:06:48.864 --> 01:07:00.505 They've got to chase the pointer and they're chasing the pointer to some random place and memory and bang on any concept of locality, kills your parallel performance. It kills your performance when you got that. 546 01:07:01.349 --> 01:07:05.940 Global memory, which has that latency of hundreds of cycles may be. 547 01:07:05.940 --> 01:07:20.070 And the cash will be useless, because each separate partner will go to some random place. So you really want to think about keeping stuff together. So the cash will be useful. So your shared memory can be used. So you want. 548 01:07:21.144 --> 01:07:31.074 You want simple data structures and the thing with trees is a tree, all goes sprouts out evenly in different directions. It's not uniform again. The pointers. Recursion. I know. 549 01:07:31.074 --> 01:07:38.485 I showed you doing the Fibonacci thing with recursion back with open ACC, open empty. You use recursion. 550 01:07:40.019 --> 01:07:48.869 If it's, you know, you can use recursion if it serves the purpose, but don't just automatically do everything in recursion. 551 01:07:49.525 --> 01:07:58.195 I mean, in fact, if you want to do a recursion and video for a few years now has had something, or colonel can initiate another colonel. 552 01:07:58.195 --> 01:08:13.105 So, if your data has to be really on Eve and video tries to work with you actually, so you can start. So, the kernel's that parallel program that runs on the grid of blocks so we showed, you. 553 01:08:14.664 --> 01:08:28.944 Where the, you start your kernel from your host program, you know, you have that routine that's tagged global or which means run it on the device call it from the host. Well, inside your parallel colonel, you can, in fact. 554 01:08:29.604 --> 01:08:43.435 Create and start another kernel and it will go in on the device. It will sit in the queue of kernels waiting to run, remember this as many operating system and it will run. So, if you want to do stuff like that, there's tools available. 555 01:08:43.435 --> 01:08:50.935 And as as the years go on, and actually makes these tools better for you, but you won't see them described in these side substance on because they're fairly new. 556 01:08:51.835 --> 01:09:06.204 And they existed when these sides were created, but the authors of the sides didn't know about them or something. Hadn't learned them yet. But any case. So, what I'm saying again, and again, because I find it important, you want to think about reuse patterns you want to think about uniform and so on. 557 01:09:06.510 --> 01:09:12.239 Everything can be completely uniform, but you can try and it will be worth it. 558 01:09:12.239 --> 01:09:19.109 So, okay, so this sides that we have, um. 559 01:09:20.729 --> 01:09:25.500 Due to do convolution again. 560 01:09:27.750 --> 01:09:34.590 Okay mask what's 5? And so we add stuff I'm gonna go through this fast. 561 01:09:34.590 --> 01:09:38.369 Um, because it's obvious here you add 5 elements. 562 01:09:38.369 --> 01:09:48.119 Then we're not showing the massive mass would be re, M, it's not shown here, but, and in 5 elements of and get added into the output itself. 563 01:09:52.020 --> 01:10:06.390 So, what they're talking about here I go back, I don't give you sickness going back, is that we will have a tile of elements of them that we explicitly loaded into the shared memory. And then. 564 01:10:06.390 --> 01:10:15.210 The average element of in that we loaded into shared memory, got used several times, in fact, got used, maybe 3 times or something. 565 01:10:15.210 --> 01:10:19.020 So, we explicitly did the tile then we. 566 01:10:19.020 --> 01:10:23.069 You know, bandwidth, reduction of factor or 3 and that's. 567 01:10:23.069 --> 01:10:32.729 Worse, you know, that's that's nice. And it's better than would be done automatically with the L to cash. Let's say. 568 01:10:32.729 --> 01:10:39.810 Because you're loading it into shared memory and, you know, when you're finished with a tile, you see the thing. 569 01:10:39.810 --> 01:10:43.890 An advantage also if you do your explicit. 570 01:10:45.779 --> 01:10:54.840 calfing as we call it loading into shared memory. Okay think about how virtual memory or cash is work when they have, how do they choose what to reject from the cash? 571 01:10:58.494 --> 01:11:11.184 And, you know, you need a replacement strategy. Well, the theoretically best method is, you reject the item, which is not going to be used for the longest time into the future. So, if you could see into the future. 572 01:11:11.850 --> 01:11:26.579 You to check the item, which is not going to be used for the longest time, but you cannot see into the future. So, in fact, what you at least, you check the item that has not been used for the longest amount of time, which has been idle the longest. And that's your is simply a really nice method. 573 01:11:26.579 --> 01:11:34.590 But, I mean, that's what these cash management things do, but the thing is, if you're managing stuff yourself, you know. 574 01:11:34.590 --> 01:11:40.680 When your data will not be needed again, so, you know, when you can inject something from, you. 575 01:11:40.680 --> 01:11:52.680 Own custom cache and that's where you get the trade off the pay off rather if you're putting stuff into the shared memories that, you know, when you're finished with something and you can reuse that memory address. 576 01:11:52.680 --> 01:11:56.460 And see optimize your tiling. 577 01:11:56.460 --> 01:11:59.699 Okay, nothing interesting here. 578 01:12:00.960 --> 01:12:04.800 Or here computing are here. 579 01:12:06.329 --> 01:12:11.310 Are here bandwidth reduction. 580 01:12:14.154 --> 01:12:25.645 Well, basically, with a reasonable size tile like, 250 set, reasonable size, Tile, 256 and a wide mat. Basically, the reduction is the size of the mask. So mask what's equals. 581 01:12:25.645 --> 01:12:37.314 9 means you load 9 items, India shared memory. So each item gets used 9 times, and practically a factor of 9 improvement. If the tile was large enough that these edge conditions don't. 582 01:12:40.079 --> 01:12:48.149 Bother you okay now too? And he gets a touch interesting because it's a 2 D chunk of. 583 01:12:48.149 --> 01:12:53.310 Data and but you'll still get a very nice reduction in the. 584 01:12:56.159 --> 01:13:00.239 And, um. 585 01:13:02.430 --> 01:13:16.500 So, again, well, we can compute what's happening here. I may do this. Perhaps if you put some more detail Monday, if not. Well, the executive summary here is with 2 D. 586 01:13:18.390 --> 01:13:29.670 The reason it's more than the mask to remember that a mask width of 9 means is 81 things in the mask. Okay this is, you know, it's a square. So a mask. What's the 5. 587 01:13:29.670 --> 01:13:36.029 Means that there's 25 things in the masks, so you put it store 81 items in the mask. 588 01:13:36.654 --> 01:13:49.135 Each item, I could use 81 time so in the limit, it's a tile with Scott big enough. This would approach 81 with a mass was a 5 who's getting almost 25 here this gets more efficient as a tile, which gets larger. 589 01:13:51.390 --> 01:13:55.979 If only because what they're doing is they're worrying about the edge effects things around the boundary. 590 01:13:57.090 --> 01:14:00.090 Um, so, yeah. 591 01:14:00.090 --> 01:14:06.899 So, this you have this mask and you have that you chalk up. I call it chunking up. 592 01:14:06.899 --> 01:14:10.289 You protection things and these, these little blocks. 593 01:14:10.289 --> 01:14:22.050 So the point is, they're talking about larger shared memory size. Well, yeah. Had shared and it's very fast. Larger is better but I notice at NVIDIA. 594 01:14:22.050 --> 01:14:25.859 They do not increase their shared memory size as they go to the new. 595 01:14:25.859 --> 01:14:30.359 Generations. 596 01:14:30.359 --> 01:14:38.609 And so I have to ask myself why, and my guess is that shared memory is very expensive to implement in the hardware. 597 01:14:38.609 --> 01:14:50.729 It's accessible to all of the threads in the box. They also and video also started increase the maximum number of threads block. A block has up to a 1000 threads. This did not this has not increased. So. 598 01:14:50.729 --> 01:14:58.140 I'm guessing because it's really expensive to do that. And what invidia does. 599 01:14:59.399 --> 01:15:04.319 Is they'll have more symmetric multi processors on the device? 600 01:15:04.319 --> 01:15:07.890 So the maximum number of threads will increase. 601 01:15:09.479 --> 01:15:17.550 We're not very much. It's up to 4000 on nothing in parallel. I mean, my laptop here is a couple of 1000 threads. 602 01:15:17.550 --> 01:15:22.500 Cause they said, my laptop's almost expensive as parallel that no, Joe actually said. 603 01:15:22.500 --> 01:15:30.420 It's very good laptop, but so as Nvidia goes to more powerful hardware, actually, 1 thing they do, they. 604 01:15:30.420 --> 01:15:37.890 They increase the memory speed, so invidia decreases the clock. So the memory is accessed faster. 605 01:15:37.890 --> 01:15:49.619 They increase the bandwidth for getting data out of the memory. They do not increase the number of threads for block. They do not increase the shared memory size. 606 01:15:49.619 --> 01:15:53.760 That's interesting, you know what they. 607 01:15:53.760 --> 01:15:57.449 Choose to do and they also do things like. 608 01:15:57.449 --> 01:16:01.229 Well, they'll introduce no data types like these. 609 01:16:03.090 --> 01:16:13.260 These processors that work with 16 bit floats to do neural nets because the data so emphasizes neural nets. You don't notice the lack of precision. 610 01:16:13.944 --> 01:16:25.555 So, they'll call it f, p16 half precision floats. So, put that they put in these Tensor processors that will do a 4 by 4 Matrix, smaller pine, add in a cycle or so. 611 01:16:25.555 --> 01:16:29.545 So they spend the gates on adding stuff. 612 01:16:30.239 --> 01:16:36.840 For the neuronet site, the tenser processors half precision floats also. 613 01:16:37.854 --> 01:16:50.875 Change the area on the chip devoted to things like double precision versus single precision, because double for doing double precision fat that's separate hardware. You could implement double precision in single precision, but you take a factor of. 614 01:16:51.899 --> 01:17:01.914 5, or something and performance, if you do it over precision operation as a single precision function, call implemented in software, I'm guessing factor 5 or something hit. 615 01:17:02.125 --> 01:17:09.654 So, you so they decide how much they want to adopt doing hardware and they trade that off from generation to generation. They'll go for Maxwell. 616 01:17:09.654 --> 01:17:18.414 They had less floating point and then they decided they'd gone too far and then the Pascal generation, which followed Maxwell, they went back to more. 617 01:17:18.659 --> 01:17:22.079 Floating point operations to improve their performance. 618 01:17:23.850 --> 01:17:28.590 All just as a side about the performance. 619 01:17:28.590 --> 01:17:32.850 Because this is getting to Tara flop range. 620 01:17:32.850 --> 01:17:43.199 When I can think of the initial version of parallel a couple of years ago, now it had an older process or not the 1 that you're using. Now, the older one's still there. 621 01:17:43.199 --> 01:17:52.590 And it's floating point performance was so high that it for lawyer, it trickled an. 622 01:17:52.590 --> 01:18:00.750 Problem international trade and arms regulations, computers, which is performance is above a certain. 623 01:18:00.750 --> 01:18:06.840 Speed are considered to be munitions. 624 01:18:06.840 --> 01:18:20.909 And so the lawyer was asking me now, who would be using this card this was a 600 dollar card that you could buy on Amazon. You could also buy Amazon dot, in fact, for 600 pounds. 625 01:18:20.909 --> 01:18:32.579 You know, for computers, that's how you do the conversion you change the dollar sign to a pound sign you keep the digits the same. This is not a joke in any case. So I had to tell. 626 01:18:32.579 --> 01:18:43.890 Tell our lawyer who would be using the 600 dollar in video card. It was rather funny actually, because I was also buying the Intel. 627 01:18:43.890 --> 01:18:47.250 A multi core thing. 628 01:18:47.250 --> 01:18:57.420 And which the product they've dropped now, and it was actually faster than invidia but it wasn't something that the lawyer recognized who didn't query me about that. 629 01:18:57.420 --> 01:19:04.140 Okay, in any case. So partitioning things into the right size masks is good. 630 01:19:04.140 --> 01:19:09.569 Okay, that's a reasonable point to stop now. So. 631 01:19:09.569 --> 01:19:15.329 What you thought today was several paradigms. 632 01:19:15.329 --> 01:19:21.060 For writing, efficient, parallel progress here was masks. 633 01:19:21.060 --> 01:19:32.189 And so on so waste private copies of output arrays, like histogram arrays. So we saw a couple of techniques, which are useful. 634 01:19:32.189 --> 01:19:41.010 Or parallel programs, they're not officially part of the language standard, but the things that were developed that turn out to be useful. 635 01:19:42.689 --> 01:19:49.680 So, I put up a homework also, ask you to do some programming playing, implement the histogram thing on your own in parallel and see. 636 01:19:49.680 --> 01:19:58.770 On multi core that would be Intel in many car. That would be the end. May not be parallel. In fact, but you have to tell me. 637 01:19:59.850 --> 01:20:03.840 Other than that, have a good weekend and. 638 01:20:03.840 --> 01:20:12.989 Not part of the class, but I'm enjoying my solar panels up on the roof because on sunny days like this again. 639 01:20:12.989 --> 01:20:19.529 Um, how is it, the battles of January 1, the house, the union so several days this week I've actually. 640 01:20:19.529 --> 01:20:29.520 Generated more electricity in the house uses so, Monday no, because Monday was cloudy, so in the summer and this could be interesting. Keep you in tune for how it's going. 641 01:20:29.520 --> 01:20:32.850 Okay, so. 642 01:20:32.850 --> 01:20:36.750 If there's no question. 643 01:20:37.829 --> 01:20:41.819 If not. 644 01:20:41.819 --> 01:20:45.989 1 day.