WEBVTT 1 00:06:34.499 --> 00:06:45.569 Silence. 2 00:06:45.569 --> 00:06:49.559 Silence. 3 00:06:51.449 --> 00:06:54.988 Silence. 4 00:06:56.428 --> 00:06:59.639 Silence. 5 00:06:59.639 --> 00:07:12.569 Silence. 6 00:07:14.309 --> 00:07:18.149 Silence. 7 00:07:21.478 --> 00:07:26.608 Silence. 8 00:08:00.269 --> 00:08:03.358 Front. 9 00:08:03.358 --> 00:08:09.209 Okay, so good. 10 00:08:09.209 --> 00:08:15.869 Good afternoon people this says parallel computing. 11 00:08:17.968 --> 00:08:23.488 And class 10, Monday, March. 1st, so. 12 00:08:25.588 --> 00:08:29.848 See, what we can do to share stuff. 13 00:08:33.208 --> 00:08:38.788 And. 14 00:08:44.489 --> 00:08:48.568 Silence. 15 00:08:50.428 --> 00:08:57.808 So, my usual questions are. 16 00:08:59.308 --> 00:09:02.639 Can you hear me. 17 00:09:02.639 --> 00:09:09.239 And you see the screen. 18 00:09:11.489 --> 00:09:14.879 Great. Okay. So. 19 00:09:14.879 --> 00:09:20.038 We are continuing on with. 20 00:09:20.038 --> 00:09:23.849 gpo modules paired with Illinois. 21 00:09:23.849 --> 00:09:29.399 And just put where we're starting from here. 3.5. 22 00:09:29.399 --> 00:09:35.009 And see, it. 23 00:09:39.719 --> 00:09:44.369 Okay. 24 00:09:44.369 --> 00:09:49.589 And again, we're speed reading through this and. 25 00:09:51.089 --> 00:09:57.298 There here. Okay. 26 00:09:58.583 --> 00:10:13.014 And again, I'm teaching from specifics and so that your goal is to, you can infer general principles from the specifics. This is the way I like to teach. So, give you practical stuff. 27 00:10:13.528 --> 00:10:17.428 And, okay. 28 00:10:17.428 --> 00:10:22.859 So, we're seeing in some more details about the code of. 29 00:10:23.938 --> 00:10:29.369 Good at current so on. Okay so just to remind you. 30 00:10:29.369 --> 00:10:33.479 What we have here is that. 31 00:10:33.479 --> 00:10:36.839 The devices the, and. 32 00:10:36.839 --> 00:10:40.438 The, um, it has this. 33 00:10:40.438 --> 00:10:44.969 Blocks and each block has threads. 34 00:10:44.969 --> 00:10:59.428 Okay, and actually it shows the organization at the left here and then you and then that's his device might have a number of these sets of blocks because the device may actually be running several. 35 00:10:59.428 --> 00:11:06.658 Unrelated parallel programs at the same time. So if 2 of you in class start off. 36 00:11:06.658 --> 00:11:13.288 Parallel jobs on the PO, in theory, if the resources are available, they might run simultaneously. 37 00:11:13.288 --> 00:11:22.708 I wouldn't bet the farm on the security. There's some sort of security protecting the kernels from each other, but how good it is. 38 00:11:22.708 --> 00:11:31.048 I don't know so I'm also assuming, you know, I would not use the GPU and a hostile environment running, you know. 39 00:11:31.048 --> 00:11:34.499 Programs that you think are going to try to attack the computer. 40 00:11:35.519 --> 00:11:39.749 Now, what this on the right shows here the timeline. 41 00:11:39.749 --> 00:11:44.009 Is that, um, the blocks. 42 00:11:44.009 --> 00:11:51.239 Again, so block might have at the up to a 1024 threads in it. What the. 43 00:11:51.239 --> 00:11:54.568 What the timeline on the right is showing. 44 00:11:54.568 --> 00:12:02.729 Is that several blocks will run simultaneously again? Depending on where is hardware resources? 45 00:12:03.264 --> 00:12:18.114 And then blocks complete the next blocks run and that just showed you this little examples open when there are more blocks I want to run than there are resources available to 46 00:12:18.114 --> 00:12:19.644 run them simultaneously. 47 00:12:20.183 --> 00:12:22.254 Then the order that the blocks run in is. 48 00:12:22.528 --> 00:12:27.389 Totally, not predictable also. 49 00:12:27.389 --> 00:12:35.129 If you have a more expensive, which has more hardware than more blocks will run. 50 00:12:35.129 --> 00:12:40.859 Are able to run in parallel and so your job finishes faster. 51 00:12:42.533 --> 00:12:49.793 Which is a nice way to design things. You can throw more hardware at the problem. And then, but the software stays the same. 52 00:12:50.094 --> 00:13:04.644 I mean, again, I wouldn't go crazy with this concept that the program could run on a whole hierarchy of different. If you start going for corner cases and so on Yeah, it may not. But in general. 53 00:13:04.889 --> 00:13:12.208 The concept is that the blocks yourself for run on different levels of. 54 00:13:12.208 --> 00:13:18.239 This is a traditionally nice way to design. 55 00:13:18.239 --> 00:13:31.649 A group of hardware said IBM became the biggest computer company in the world in the 19 sixties by doing something like this. They had half, they invented something called a system 360 and they had. 56 00:13:31.649 --> 00:13:37.288 I calf initially half a dozen different machines in the line and. 57 00:13:37.288 --> 00:13:42.149 They all ran the same programs and that was at that time, that was the big advance. 58 00:13:42.149 --> 00:13:47.099 So, okay now. 59 00:13:47.099 --> 00:14:01.558 Here's a new idea on this slide that I mentioned briefly. There's something called streaming multi processors and that is at a high level sense. The core in the GPU higher level than than a separate could a core. 60 00:14:01.558 --> 00:14:09.328 And again, a may have a couple of streaming multi processors, or. 61 00:14:09.328 --> 00:14:14.249 12 of them or 15 of them or something and. 62 00:14:14.249 --> 00:14:20.849 So, a streaming multi processor, some versions of the or something. 63 00:14:20.849 --> 00:14:28.078 Okay, and there are certain resources for streaming multi processor so as a slide show so. 64 00:14:28.524 --> 00:14:42.923 You can, you can run several blocks at it simultaneously in 1, streaming multi process, or depending on resources like this fixed amounts of certain types of resources, like shared memory. And if a block. 65 00:14:43.139 --> 00:14:47.009 It's the threads in a block need more than fewer blocks can run at the same time. 66 00:14:47.009 --> 00:14:57.028 Now, talk about, from me here, this is a generation of and it's several generations old. 67 00:14:57.624 --> 00:14:59.693 For me was succeeded by capital, 68 00:14:59.693 --> 00:15:10.134 or that was succeeded by Maxwell that was received by succeeded bypass gal that was succeeded by Volta, 69 00:15:10.313 --> 00:15:13.344 which is now being succeeded by ampcare. 70 00:15:13.344 --> 00:15:15.774 I think so. It's quite a few generations back. 71 00:15:18.239 --> 00:15:25.198 In any case, when the point is here that the stripping multi processor can in total run so many threads. 72 00:15:25.198 --> 00:15:31.288 And you can have more threads for block and fewer blocks or vice versa. So. 73 00:15:31.288 --> 00:15:38.188 And streaming multi processor has this many operating system in it that schedules things. So. 74 00:15:38.188 --> 00:15:42.749 This blocks waiting to run, there's warps waiting to run and so on. 75 00:15:44.068 --> 00:15:49.048 No, I'm in model after John. 76 00:15:49.048 --> 00:15:53.458 You're all familiar with that. 77 00:15:55.073 --> 00:16:09.833 The street is single instruction, multiple data stream. You got the 1 program counter and instruction register can then control multiple and register files. So you look at the space layout on the Silicon. 78 00:16:10.073 --> 00:16:11.124 Then this will have. 79 00:16:11.369 --> 00:16:16.739 Less space for instruction, decoding as a proportion of the total and more space. 80 00:16:16.739 --> 00:16:25.379 Or a, on using register files. This is a good idea to the extent that your program can take advantage of that. 81 00:16:25.379 --> 00:16:28.558 Okay. 82 00:16:28.558 --> 00:16:37.019 So, the threads are grouped into warps as I mentioned before 32 threads in a warm and this 32 a state of constant forever within video. 83 00:16:37.019 --> 00:16:44.788 So, not part of the code of programming model. Well, not formally, but it hasn't changed for 20 years. 84 00:16:44.788 --> 00:16:50.519 So any case. 85 00:16:52.229 --> 00:17:00.504 And the streaming multi processor, then schedules the warps, which are part of the part of the blocks for adblocks. 86 00:17:00.504 --> 00:17:15.263 Now, why warps need scheduling is other resources available in limited quantity besides registers 1 of them are floating point units, single and doubles the 2 separate types of units. And there's not enough. 87 00:17:15.509 --> 00:17:22.048 Floating point units for all the threads to simultaneously do a floating operation. So. 88 00:17:22.048 --> 00:17:27.269 This would be a reason for a work to wait until ordinarily or warp had finished. 89 00:17:27.269 --> 00:17:32.459 Example here. 90 00:17:32.459 --> 00:17:39.209 Do the math, so they're assuming the green and the purple warp and. 91 00:17:40.888 --> 00:17:47.939 And so when you do the math and whatever, so okay. And again. 92 00:17:47.939 --> 00:17:58.558 How many schedule depends on the resources each work needs the resources. Each thread needs actually and because all the threads in the war for identical, basically. 93 00:17:58.558 --> 00:18:06.239 The footnote there, but so again, more resources for thread means who, or threads can run simultaneously. 94 00:18:07.648 --> 00:18:15.179 This is the thing I mentioned a little before 0T overhead, warp, scheduling. 95 00:18:15.179 --> 00:18:21.449 They've got some fancy logic that I believe involves a synchronous logic. 96 00:18:21.449 --> 00:18:25.288 To maintain the. 97 00:18:25.288 --> 00:18:31.318 Them a group of Forbes that are waiting to run, and when all the resources are available. 98 00:18:31.318 --> 00:18:34.618 Then picks a warp. 99 00:18:34.618 --> 00:18:41.278 And Ron said, so, I don't know the details of how that's done. We'll 0T overhead, but. 100 00:18:46.259 --> 00:18:52.108 I assume it works because the cues are not incredibly big, but. 101 00:18:53.278 --> 00:18:56.429 Okay, excuse me. No. 102 00:18:57.628 --> 00:19:01.288 Again, although it says here, excuse me. 103 00:19:01.288 --> 00:19:08.878 Nothing here is particular to, for me, these are all general lessons. That's why I'm showing them to you. 104 00:19:08.878 --> 00:19:12.088 And. 105 00:19:12.088 --> 00:19:24.384 So, it's a question, and the example is matrix multiplication now parallel computing people love matrix multiplication as an examples as test cases. For the following reason. 106 00:19:24.834 --> 00:19:27.804 Matrix multiplication is compute intensive. 107 00:19:28.048 --> 00:19:32.459 If you're multiplying to and by end matrices. 108 00:19:32.459 --> 00:19:37.828 You've got order of N squared data, but you've got order of cubes. 109 00:19:37.828 --> 00:19:52.798 Computation there's a lot of programs where the data dominates it. You spend more of your time in data transmission than you do in processing. So this is the opposite. It's actually a compute intensive job. 110 00:19:52.798 --> 00:19:56.368 That's 1 reason a parallel people like it. 111 00:19:56.368 --> 00:20:00.479 So, what we're going to do for matrix multiplication is. 112 00:20:00.479 --> 00:20:03.538 Chop the matrices up into blocks and use. 113 00:20:03.538 --> 00:20:09.028 And assign blocks of the matrices to threads. So then the question is. 114 00:20:09.028 --> 00:20:14.638 How big are the blocks and so on? And this is the next few slides are talking about that. 115 00:20:14.638 --> 00:20:19.648 Now, NVIDIA provide, it's actually it's a linear programming problem. 116 00:20:19.648 --> 00:20:33.989 Because you've got certain resources available, and you want to optimize, say, processing times that you want to say minimized processing time. But you have to stay within the limits of the various resources like, registers. 117 00:20:33.989 --> 00:20:38.759 And that sort of thing and say floating point units and that. 118 00:20:38.759 --> 00:20:44.878 That's what linear programming guys is to optimize some objective function given. 119 00:20:44.878 --> 00:20:52.648 In keeping, and each resource with not keeping no more than a 100% usage of any given resource. 120 00:20:52.648 --> 00:20:56.189 Geometrically. 121 00:20:56.189 --> 00:21:05.429 It's, if you've got end different resources that you have to watch, it's an end dimensional and you have to find the lowest vertex. 122 00:21:05.429 --> 00:21:10.318 Inside the end dimensional poly Cal Poly talks actually defined by. 123 00:21:10.318 --> 00:21:16.769 The of its faces, not by, and there can be in high dimensions is actually. 124 00:21:16.769 --> 00:21:23.159 Exponentially more vertices in paces perhaps. So it's a search procedure. 125 00:21:23.159 --> 00:21:31.709 Economists, attrition loved linear programming problems may talk more about them later. Perhaps in any case. 126 00:21:31.709 --> 00:21:39.689 So you wanted to Matrix multiplication fast on the GPU so you have to decide what size of blocks of. 127 00:21:39.689 --> 00:21:44.759 Chuck up the matrix and how many threads per thread block and so on. 128 00:21:53.068 --> 00:22:06.659 Silence. 129 00:22:08.308 --> 00:22:09.294 Yeah, okay. 130 00:22:16.074 --> 00:22:25.703 And various resources here, register, shared memory and so on what scoping lifetime means is different types of memory. The scope is. 131 00:22:27.088 --> 00:22:32.519 Who can see the memory so I register it visible to only 1 thread. 132 00:22:32.519 --> 00:22:40.138 Very limited, very narrow scope. Global memory is visible to every 1, the broadest possible scope. 133 00:22:40.138 --> 00:22:46.019 And then lifetime also registers lifetime might be the 1. 134 00:22:47.278 --> 00:22:54.328 You know, the 1 thread, but the global memory's lifetime would be the whole colonel say. 135 00:22:54.773 --> 00:22:55.314 Okay, 136 00:22:55.314 --> 00:23:00.473 so this is okay, 137 00:23:00.473 --> 00:23:01.314 so last time, 138 00:23:01.344 --> 00:23:03.953 the last example we saw was the convolution, 139 00:23:04.223 --> 00:23:08.334 we were blurring pixels and so we had a thread per pixel, 140 00:23:08.574 --> 00:23:10.554 but the thread that was computing, 141 00:23:10.673 --> 00:23:15.624 the blurred pixel value had to look at the adjacent Excel self. 142 00:23:15.624 --> 00:23:15.804 It's. 143 00:23:16.259 --> 00:23:24.179 3 by 3 convolution window, the thread blurring a pixel had to look at the 8 adjacent pixels. For example. 144 00:23:25.439 --> 00:23:31.409 Now, why that potentially is a problem is. 145 00:23:31.409 --> 00:23:41.729 It's going to different places in a global memory so that sort of thing potentially very badly. Hurts the performance. It's a legal thing to do. 146 00:23:41.729 --> 00:23:51.269 Any thread can read and write any word and global memory. Well, we probably shouldn't try writing. The can read any word in global memory. 147 00:23:51.269 --> 00:24:02.638 The reason for writing is the threads are running a synchronously. So you got her seriously think about what it means to a threads writing words in global memory. Unless of course, each thread is a separate private trunk of the global memory. 148 00:24:02.638 --> 00:24:05.729 Which is perfectly fine. Okay. 149 00:24:05.729 --> 00:24:14.459 So so what's on the table here is this convolution program they call it a blurring kernel is accessing. 150 00:24:14.459 --> 00:24:17.939 Several each threads accessing several pixels. 151 00:24:17.939 --> 00:24:22.199 In the global memory here. 152 00:24:24.209 --> 00:24:33.058 And you see what we have here is this double 4 loop iterating over the elements of the. 153 00:24:33.058 --> 00:24:36.749 As a filter and so on. 154 00:24:38.098 --> 00:24:42.179 Okay. 155 00:24:44.729 --> 00:24:51.028 So, the problem is that, um. 156 00:24:51.384 --> 00:25:05.273 So, the, they're looking here, so the GPU computing right is actually at a reasonable speed 1, and a half terra flops in this example, with an old GPU. By the way. 157 00:25:06.479 --> 00:25:11.219 So, the say can do the computations at 1 and a half terra flops. 158 00:25:11.219 --> 00:25:15.838 But the probably trouble is that the bandwidth physical memory. 159 00:25:15.838 --> 00:25:18.838 Limit set so. 160 00:25:20.368 --> 00:25:29.219 So, you're going to be this is gonna be an aisle limited program here and they run into some of the mass saying that. 161 00:25:29.219 --> 00:25:35.489 In this particular case, the memory global memory, successful, 200 gigabytes a 2nd, let's say. 162 00:25:36.989 --> 00:25:40.169 So, we're getting 3% of the. 163 00:25:40.169 --> 00:25:44.249 Computation power of the now. 164 00:25:44.249 --> 00:25:48.088 That may not be bad. You have to look at the global picture that. 165 00:25:48.088 --> 00:25:56.489 It's in the global context, maybe. Okay if that's all you can do, but you might want to perhaps do better. 166 00:25:57.808 --> 00:26:02.939 And we're going to do better by organizing the data. 167 00:26:03.959 --> 00:26:10.949 So here's the issue here that is coming back to the matrix multiplication again. 168 00:26:10.949 --> 00:26:22.888 Now, multiplying Matrix, Adam types matrix end to make matrix P for product. So you docked a vector, which is 1 role of them with a vector, which is 1 column of and. 169 00:26:22.888 --> 00:26:25.919 And then to get 1 element of. 170 00:26:25.919 --> 00:26:32.038 Pete, so now here's the problem that if the matrices are stored in major order. 171 00:26:32.038 --> 00:26:38.429 Than accessing all of them is good. They're all contiguous. 172 00:26:38.429 --> 00:26:43.409 The column of them is a different matter, because they're all they're all discontinuous. 173 00:26:43.409 --> 00:26:49.739 So, you're going to take a performance hit access saying that call them within. 174 00:26:49.739 --> 00:26:55.169 You're also going to take perhaps a hit on accessing the role of them because. 175 00:26:55.169 --> 00:27:01.259 You're reading it on a global memory and only using 1. so the theme of this slide set. 176 00:27:01.259 --> 00:27:06.689 Is going to be that if we have to read data from the global memory. 177 00:27:06.689 --> 00:27:10.679 Get as much use out of that data as you can. 178 00:27:10.679 --> 00:27:16.348 You know, maybe use the data more than once if you can. 179 00:27:16.348 --> 00:27:29.638 Any case, here's the basic matrix modification and it takes pointers to the 3 irrelevant erase matrices and they're stored in the global memory. 180 00:27:29.638 --> 00:27:35.729 Is a pointer it probably almost always, I think might be in the global memory and. 181 00:27:35.729 --> 00:27:40.679 So, in any case, so, each again as usual here, each thread. 182 00:27:40.679 --> 00:27:43.709 Is computing 1 output element. 183 00:27:43.709 --> 00:27:48.028 So, we compute, so given the thread number and the block number. 184 00:27:48.028 --> 00:27:52.409 And compute which row and column that's in red here. 185 00:27:52.409 --> 00:27:56.189 And then what we do here. 186 00:27:57.419 --> 00:28:03.058 Is that we. 187 00:28:03.058 --> 00:28:06.719 Okay, this gets to be the slow part here this loop here. 188 00:28:06.719 --> 00:28:15.419 Is going down and computing 1 output element by taking a whole role of, and the whole call and then. 189 00:28:15.419 --> 00:28:22.288 And docking them dotting them and do their job that's there. So, this here is going to kill your performance. Perhaps. 190 00:28:25.499 --> 00:28:28.528 All right, and they're just putting that in red because. 191 00:28:28.528 --> 00:28:39.179 If you went to someone, a video or something, I think this stuff's available in video. Also. If you want to find someone else some describing the slave slides here. But I, perhaps I'm doing it in less time. 192 00:28:39.179 --> 00:28:42.538 In the video, because I'm hitting just the high points. 193 00:28:44.068 --> 00:28:48.898 Okay, so if they're talking about here. 194 00:28:48.898 --> 00:28:52.499 Is take the output matrix and. 195 00:28:52.499 --> 00:28:56.368 Partition it into blocks here there are 2 by 2 blocks. 196 00:28:56.368 --> 00:29:00.088 And within 1 block. 197 00:29:00.088 --> 00:29:03.298 Again, the blocks of the matrix could be mapped to. 198 00:29:03.298 --> 00:29:08.098 Threads and a thread block, but compute within 1 block at a time. 199 00:29:09.209 --> 00:29:14.278 And the goal will be maybe the data that you have to read for the golf memory might get used. 200 00:29:14.278 --> 00:29:19.828 More intensively. 201 00:29:22.078 --> 00:29:27.628 And what they're showing here is this is computing a 2 by 2 block. 202 00:29:27.628 --> 00:29:32.489 Of the output matrix and so. 203 00:29:32.489 --> 00:29:40.648 You computing 4 output elements, but we have 2 rows of M and 2 columns of and. 204 00:29:40.648 --> 00:29:47.999 Now, what that means to so so each element of them is actually used twice, not once the previous. 205 00:29:47.999 --> 00:30:00.473 Slide I showed you a few sides back, we would read a roll of them and a column of adopt them and tell them in a minute and would be used once to compute that element of P. then we'd do another element of PII. 206 00:30:00.653 --> 00:30:03.624 Well, here, what we're doing is a 2 by 2 block of P. 207 00:30:04.078 --> 00:30:12.808 So so, if we can store these 2 rows of them, and these 2 columns of and locally, perhaps anticipating a little each. 208 00:30:12.808 --> 00:30:21.568 Element will get used twice instead of once. So we will have have the requirements on the global memory and. 209 00:30:23.338 --> 00:30:27.838 The cost, we will have doubled our use of. 210 00:30:27.838 --> 00:30:31.618 We will double our computation use on the GPU. 211 00:30:32.183 --> 00:30:46.794 Because it's limiting, it is reading from Nicole, the memory. What that requires is if we're going to use these 2 rows and these 2 columns twice that we have to be able to store them locally somehow. So now you can me. I'm sure you're starting to think of. 212 00:30:47.398 --> 00:30:53.638 Trade offs and stuff like that. So it's a time it's a space time. 213 00:30:53.638 --> 00:30:59.068 Computation time trade off now where we're going to do that. 214 00:30:59.068 --> 00:31:03.959 Is that we have this local register file available? 215 00:31:03.959 --> 00:31:08.189 So, and again. 216 00:31:08.189 --> 00:31:15.388 The CUDA cores each thread kind of up to 255 words of data. 217 00:31:15.388 --> 00:31:18.538 Maybe 256, I'm not certain. 218 00:31:19.828 --> 00:31:23.699 Words of that that I stores locally that are very fast. 219 00:31:23.699 --> 00:31:32.308 Okay, okay so here is the hierarchical memory you've seen most of this before it has an extra. 220 00:31:32.308 --> 00:31:36.088 It has every 1 extra feature on this so this is. 221 00:31:36.088 --> 00:31:41.038 A partial version of how the memory is laid out on the device. 222 00:31:41.038 --> 00:31:46.919 It's okay, we've got the grid, which is the whole program you're running sort of it's. 223 00:31:46.919 --> 00:31:56.969 The grid has some global memory at the bottom, 48 gigabytes and the good on parallel. It has also a small amount of constant memory. 224 00:31:56.969 --> 00:32:02.368 Which is fast, which is very small. I think it's like. 225 00:32:02.368 --> 00:32:05.608 Give or take 48 K bytes. 226 00:32:05.608 --> 00:32:08.788 But the thing is that the assumption. 227 00:32:08.788 --> 00:32:18.628 It's fast, no latency. And the assumption is that well, the requirements at the same constant memories visible to all of the threads. 228 00:32:18.628 --> 00:32:24.328 So, if you have something that's read, only and all and all the threads want to be able to see it. 229 00:32:24.328 --> 00:32:37.199 Put it in constant memory, any case, the greatest partition into thread blocks and each thread block has an amount of shared memory, fast shared memory available to all the threads in the thread block but it's private. 230 00:32:37.199 --> 00:32:41.159 To the block, the block terminates, the shared memory goes away. 231 00:32:41.159 --> 00:32:52.588 Then we got the block contains the so the yellow block contains the green threads up to a 1024. well, the grid contain very many blocks. Actually. 232 00:32:52.588 --> 00:32:57.719 1M insight Thank not positive. Okay. In any case. So. 233 00:32:57.719 --> 00:33:03.749 You have the up to up to the 1000 green threads and each thread has its private. 234 00:33:03.749 --> 00:33:09.898 Past registers and has access to the same shared memory as the other threads fried, right? 235 00:33:09.898 --> 00:33:13.169 What's not shown here is. 236 00:33:13.169 --> 00:33:18.148 Is local memory for each thread, which is. 237 00:33:18.148 --> 00:33:31.169 Slow each thread has basically a private chunk of a global memory that's called local memory so it's slow, but it's overflowing other stuff. That's not shown here. Um. 238 00:33:31.169 --> 00:33:35.429 And whatever, textures and stuff like relating to graphics, but. 239 00:33:35.429 --> 00:33:39.028 Okay. 240 00:33:39.028 --> 00:33:43.259 So, now here is showing actually 4 types of memory. 241 00:33:43.259 --> 00:33:49.558 And you could have program can have. 242 00:33:49.558 --> 00:33:53.038 The, if this is defined inside. 243 00:33:53.038 --> 00:33:56.999 The program routine, running on the device. 244 00:33:56.999 --> 00:34:00.598 So, for example, in a global routine would be. 245 00:34:00.598 --> 00:34:03.808 An example, so if you just say, integer. 246 00:34:03.808 --> 00:34:08.458 It by default goes into the, it started in to register. 247 00:34:08.458 --> 00:34:12.838 And it's own visible to only that 1 thread. 248 00:34:12.838 --> 00:34:18.778 You can declare something as device shared. 249 00:34:18.778 --> 00:34:22.349 Which means that it's shared memory so that it's fast. 250 00:34:22.349 --> 00:34:25.829 But again, I forget 64 K. 251 00:34:25.829 --> 00:34:30.119 Bites or something, and it's visible to all the threads and the block. 252 00:34:30.119 --> 00:34:39.958 If you just say device, it's the global memory that everyone on the device can access it. 253 00:34:39.958 --> 00:34:47.338 It's cool. Well, the whole grid, the great is the parallel program and the lifetime is as long as the parallel program is running. 254 00:34:47.338 --> 00:34:50.818 You can say something is device constant. 255 00:34:50.818 --> 00:34:54.239 So, it's in that small, constant, fast cash. 256 00:34:54.239 --> 00:35:02.338 Okay, the device stuff has the large latency of 100 cycles or whatever the device constant is very fast. 257 00:35:02.338 --> 00:35:06.838 But it's read only and everyone sees the same constant. 258 00:35:06.838 --> 00:35:12.298 Cool. But again, I think it's like 48 K bytes or something. 259 00:35:12.298 --> 00:35:17.398 Okay, and again, this doesn't show a local. 260 00:35:17.398 --> 00:35:29.128 Memory these per thread arise, that's the per thread local memory that sits in the globe. So it's a way threads can have more memory. 261 00:35:29.128 --> 00:35:34.739 But it's slow question. 262 00:35:34.739 --> 00:35:40.199 Only, the host can write to the constant I don't know, I'll have to check on that. So. 263 00:35:41.728 --> 00:35:51.509 It's probable that there's some way for the device to write to the constant Mary, but I don't know what that is. So good question. 264 00:35:51.509 --> 00:35:54.958 Okay, so this. 265 00:35:54.958 --> 00:35:58.048 Sets if we go back here, what does shared mean? 266 00:35:58.048 --> 00:36:06.599 Goes into shared memory up here. Okay. So what this does is it puts a tile of data. 267 00:36:06.599 --> 00:36:10.349 Into the shared memory. 268 00:36:10.349 --> 00:36:14.878 So, all the threads in the block can read and write it. So. 269 00:36:14.878 --> 00:36:18.659 Up to the size of the shared memory. This is 1 of your hard commitments. 270 00:36:18.659 --> 00:36:29.759 But you do this, and this is going to anticipating a little this will get loaded and cooperatively by a lot of threads. Actually. So be a lot of threads will load the shared. 271 00:36:29.759 --> 00:36:34.289 Array and then we'll start threads that we'll use it. So. 272 00:36:37.079 --> 00:36:42.148 Where to declare variables. 273 00:36:43.798 --> 00:36:47.248 Yeah, so probably the host can access the. 274 00:36:47.248 --> 00:36:53.248 Write the content, but I'm not positive. So where are you defining stuff and so on. 275 00:36:53.248 --> 00:37:02.009 So shared memory again, it's low latency and high throughput. 276 00:37:03.059 --> 00:37:10.438 But it's small, so scratch bad environment it's implemented by some very expensive hardware. 277 00:37:10.438 --> 00:37:21.478 Which is why, as invidia create advances in the general new generations, they do not increase the size of the shared memory. 278 00:37:21.478 --> 00:37:29.369 Or the number of registers available, what they do is they add more hardware is they create more streaming multi processors. 279 00:37:29.369 --> 00:37:34.648 More coded threads could run parallel, but they did not increase the shared memory and register. 280 00:37:34.648 --> 00:37:42.809 Okay, so you got shared memory that everyone can get added a global memory. Everyone can get at it and sell on. So. 281 00:37:42.809 --> 00:37:45.898 Yeah. Okay. Nothing much. 282 00:37:45.898 --> 00:37:49.289 No, new content here and. 283 00:37:49.289 --> 00:37:52.949 Everyone could get it everything. Basically it was enough arrows here. 284 00:37:55.469 --> 00:37:59.278 Okay, that was. 285 00:38:03.958 --> 00:38:07.978 Let's see. 286 00:38:10.168 --> 00:38:13.260 Can to. 287 00:38:15.449 --> 00:38:19.829 Silence. 288 00:38:19.829 --> 00:38:23.159 Okay, so. 289 00:38:23.159 --> 00:38:34.590 We're getting more into this tile parallel algorithm idea. And the idea is that we want to read some global memory into the past. 290 00:38:34.590 --> 00:38:37.920 Cash you might almost call it and then use it several times. 291 00:38:37.920 --> 00:38:43.230 Um, so. 292 00:38:43.230 --> 00:38:46.619 Basic multiplication each thread. 293 00:38:46.619 --> 00:38:52.619 Is accessing data tried to to global memory you might say. 294 00:38:52.619 --> 00:39:00.900 So, but what they're showing here is a different threads are accessing the same global memory but separately. 295 00:39:02.280 --> 00:39:12.420 But any case, so we have we proposed the red cash. It's a cash. That's what the shared memory is. What is the cash that you explicitly manage? 296 00:39:12.420 --> 00:39:16.079 So. 297 00:39:16.079 --> 00:39:26.460 And so we load the cash with a chunk of the global memory process at the moment with another chocolate, global memory and process it. 298 00:39:28.469 --> 00:39:32.909 Relating to carpool, so the. 299 00:39:32.909 --> 00:39:39.300 Interesting thing and things like traffic design. There's some paradoxes. 300 00:39:39.300 --> 00:39:45.989 Where closing a road if okay, if every car optimizes. 301 00:39:47.159 --> 00:39:53.039 His root home, there are cases there is a paradox. We're closing a highway. 302 00:39:53.039 --> 00:39:56.550 Putting a barrier across the highway, so no, 1 can take it. 303 00:39:56.550 --> 00:40:09.750 Will increase everyone's speed will decrease everyone's time to get home. It sounds counter intuitive. I mean, I can draw it for you if you're interested, but it's called the grace paradox, closing a highway. 304 00:40:09.750 --> 00:40:13.469 Can increase through put on the highway system. 305 00:40:13.469 --> 00:40:17.550 If every driver locally optimizes crazy, but. 306 00:40:18.659 --> 00:40:29.550 There's also another sort of paradox where, if you have a highway running at capacity, if you take a few random drivers and pull them off the road, tell them park for an hour. 307 00:40:29.550 --> 00:40:33.210 Then the throughput again increases. 308 00:40:33.210 --> 00:40:45.000 Including averaged over the drivers that you pulled off the road so they took an hour more to get home, but a, very, but a lot of other drivers got home past enough that the average improved. 309 00:40:45.000 --> 00:40:48.210 Counterintuitive things. Okay. 310 00:40:48.210 --> 00:40:53.760 Yeah, you all know what's happening here. So. 311 00:40:53.760 --> 00:40:57.780 Asking for riders, so they can get in the carpool lane. 312 00:40:57.780 --> 00:41:01.050 You do do do nothing interesting here. 313 00:41:01.050 --> 00:41:07.710 The point about these slides is that this. 314 00:41:07.710 --> 00:41:12.690 Cashing works only if the different threads want the same data at the same time. 315 00:41:13.980 --> 00:41:17.760 And you got to synchronize stuff. Okay. 316 00:41:17.760 --> 00:41:21.210 The different periods are taking different times. 317 00:41:21.210 --> 00:41:26.039 I know what works, but for some reason, you synchronize occasionally. 318 00:41:26.039 --> 00:41:29.940 Only threads in the same block. 319 00:41:29.940 --> 00:41:34.320 Okay um, nothing deep here. 320 00:41:34.320 --> 00:41:39.090 You know, identify membranes, access and multiple threads cash. 321 00:41:39.090 --> 00:41:43.230 Synchronized to make sure that all the data has been loaded. 322 00:41:43.230 --> 00:41:48.659 Process that synchronize again to make sure that it's all been processed and then move on. 323 00:41:48.659 --> 00:41:57.059 Okay. 324 00:42:05.070 --> 00:42:15.869 So, what we're going to do is take a strip of several rows of them and several columns of and as much as will fit into the local shared memory. 325 00:42:15.869 --> 00:42:23.369 And fit it into the local shared environment and compute a block of the matrix of several rows and columns. 326 00:42:23.369 --> 00:42:30.420 And this would, yeah. Okay. So this depends on the size of Eminem. If I'm. 327 00:42:30.420 --> 00:42:35.190 And is bigger than you can put fewer rolls into the. 328 00:42:35.190 --> 00:42:41.579 Into the shared memory, of course, because you want to put a whole row in maybe unless you're doing another level of walking. 329 00:42:41.579 --> 00:42:45.989 Okay, nothing new there. 330 00:42:45.989 --> 00:42:58.380 Well, there is something here is instead of putting okay, this is a new ideas. I alluded to it a minute ago, instead of putting several complete rows and several complete columns event into the cash. 331 00:42:58.380 --> 00:43:02.429 Like, cashing me in the shared local shared memory, put blocked. 332 00:43:03.510 --> 00:43:11.099 Partition and into blocks, and here you are, the smaller squares and put several blocks. 333 00:43:12.210 --> 00:43:19.110 Put blocks of, em, and into the shared memory. So this concept here scales off no matter how big admin, and are. 334 00:43:21.420 --> 00:43:30.300 Now, you'll have to read each flock into the shared memory several times. Perhaps in fact, you will, but. 335 00:43:30.300 --> 00:43:34.650 But it's still, it still pays off. 336 00:43:36.150 --> 00:43:40.230 So. 337 00:43:40.230 --> 00:43:48.000 And I'm going to skip through this somewhat, but what the concept is here, and had been partition into blocks. 338 00:43:48.000 --> 00:43:56.159 You load a block into shared memory 2 by 2 block of attitude by 2 block of an, and then you can, um, computer 2 by 2 o'clock of. 339 00:43:56.159 --> 00:44:02.789 Pete, and then you say, maybe keep the block of them and then load a different block of into memory. Perhaps. 340 00:44:02.789 --> 00:44:07.139 It's a partitioned memory. I'm on vacation thing. 341 00:44:07.139 --> 00:44:19.650 Well, by the way, some of you are where there are faster ways to multiply matrices this method here for end by matrices takes order of and cubed. 342 00:44:19.650 --> 00:44:26.429 Multiplications there are methods like the and method. 343 00:44:26.429 --> 00:44:34.170 What 40 years? 50 years old? It takes end of the 2.7 operations. Asymptotically. 344 00:44:34.914 --> 00:44:47.275 And because it can multiply 2, 2 by 2, major season, 7 multiplication instead of an 8, which is a sort of thing. That was just intuitively assumed to be impossible until Strauss and did it. And then it was obvious. 345 00:44:48.324 --> 00:44:54.235 These paradigm shifts, but the trouble with methods like that is they're much more complicated. 346 00:44:54.510 --> 00:44:57.780 So, they do not. 347 00:44:57.780 --> 00:45:05.070 Lend themselves, so their recursive, their hierarchy over complicated. They're not so they don't have regular tad of patterns. 348 00:45:05.070 --> 00:45:12.210 So, they're not so easy to parallelize effectively. So, the basic and cube method is, um. 349 00:45:13.320 --> 00:45:16.889 It's asymptotically not the best, but it's simple. 350 00:45:16.889 --> 00:45:21.655 And simple is worth a lot by the is not the best way. 351 00:45:21.864 --> 00:45:32.335 They've batch that exponent down to end to the 2.3 or 2.4 I think, but the constant factor at the front of that time expression keeps growing as you bash down the exponent. 352 00:45:35.039 --> 00:45:39.329 It's an open problem, but how you can make the exponent. So. 353 00:45:40.380 --> 00:45:43.469 Okay, so again, it's what I'm showing here, you. 354 00:45:43.469 --> 00:45:46.949 Take a, you multiply block of times of block of in. 355 00:45:48.570 --> 00:45:58.860 Okay, bang, bang and then you grab you keep the old block of bam, you grab a new block, a van and your computing stuff and so on. 356 00:46:02.519 --> 00:46:07.679 Well, you still, it is still totalling into the same block of P. by the way. 357 00:46:07.679 --> 00:46:10.710 Because a block of them needs the whole column of. 358 00:46:12.869 --> 00:46:18.329 And then we can add up operations and stuff like that. I'll skip over that. 359 00:46:18.329 --> 00:46:22.139 You got to do synchronization. 360 00:46:22.139 --> 00:46:32.969 Because again, you've all the threads and the block working off the same shared memory for the block. But the threads in the block are not necessarily running at the same time. In fact, they're probably not. 361 00:46:32.969 --> 00:46:39.570 And so the warps are not running at the same time. Their schedule. 362 00:46:39.570 --> 00:46:50.400 Especially because as fewer floating point processors, and there are threats possible. So you have to synchronize to make sure all the threads in the block have completed. So. 363 00:46:52.920 --> 00:46:56.940 Yeah, okay. 364 00:46:56.940 --> 00:47:02.280 Okay, that was getting the number here. 365 00:47:02.280 --> 00:47:08.730 4.3, we're going to a good number of slides sets today. 366 00:47:11.639 --> 00:47:17.219 4.4. 367 00:47:20.039 --> 00:47:26.610 We're going to be I'm going to be going through this past here. Just give you the highlights. 368 00:47:27.929 --> 00:47:31.469 Interesting here the indexing. 369 00:47:31.469 --> 00:47:35.909 The details of how you index that you could figure that out to yourself. 370 00:47:35.909 --> 00:47:39.179 Um. 371 00:47:39.179 --> 00:47:47.699 You can look at the code yourself if you want, but here what this sync threads this is and the global. 372 00:47:47.699 --> 00:47:51.389 But program function that runs for each thread. So. 373 00:47:53.099 --> 00:47:59.579 So, you're, what they're doing is they're adding into the total for that pixel. 374 00:47:59.579 --> 00:48:03.659 And into a local sub, total variable. 375 00:48:03.659 --> 00:48:08.849 And then when all the threads have done, have computed their local p value P values. 376 00:48:08.849 --> 00:48:13.559 Register leading to the thread and then what you do. 377 00:48:13.559 --> 00:48:18.809 Is you add it into the you write it into the. 378 00:48:21.030 --> 00:48:26.070 The global element for that. 379 00:48:26.070 --> 00:48:29.070 But that's where I view right that pixel. So. 380 00:48:31.139 --> 00:48:34.650 That's the interesting part of that. 381 00:48:35.699 --> 00:48:40.019 So you want all the threads to be ready computing this before you. 382 00:48:40.019 --> 00:48:47.099 Do that I'm not actually certain here why you have to sync thread there and looking at that, but. 383 00:48:48.989 --> 00:48:54.570 So you have to sync threads here because. 384 00:48:54.570 --> 00:49:01.920 You're computing the tile and the tile and they're up here and shared memory. Okay. 385 00:49:01.920 --> 00:49:13.920 And the different components of these tiles for Eminem are being computed by different threads. So, the sync threads mean, all the components of the 2 tiles have now been computed. 386 00:49:13.920 --> 00:49:22.530 So, now you can go and you read them because this thread is reading elements of those tiles that were written by other threads. 387 00:49:22.530 --> 00:49:31.559 That's why you have to synchronize here as I'm talking. I'm cannot completely understand or at all understand why you. 388 00:49:31.559 --> 00:49:34.590 After synchronized there actually. 389 00:49:34.590 --> 00:49:39.329 Anyone has any ideas I mean, you're inside a loop. 390 00:49:39.329 --> 00:49:45.780 But, oh, well. 391 00:49:48.929 --> 00:49:54.090 If anyone isn't, isn't a Tom at the increasing p value. 392 00:49:56.039 --> 00:50:00.420 Well, that's separate from the yeah, that's going to have to be an atomic. 393 00:50:00.420 --> 00:50:04.260 Uh, increment here, right? 394 00:50:04.260 --> 00:50:09.989 Now, if this program, if this code is actually correct. 395 00:50:09.989 --> 00:50:14.820 Then the Tom, then this is despite a false and atomic increment. 396 00:50:14.820 --> 00:50:18.000 You know, I would have to check the documentation to see. 397 00:50:18.000 --> 00:50:21.539 But that still doesn't explain why syncs variances needed. 398 00:50:21.539 --> 00:50:28.619 Oh, well, okay. Talking about resource limits here. 399 00:50:28.619 --> 00:50:37.650 And here, they're computing how many floating operations you need for each memory load and so on. 400 00:50:38.880 --> 00:50:45.179 Okay, let's get to detail for a different tile size. That is so. 401 00:50:45.179 --> 00:50:53.159 The Macs should be 32 by 32, because that's a 1000 threads. And that's how many threads you're allowed to happen in the block. So. 402 00:50:55.139 --> 00:51:06.179 So, what happens if there's 32 by 32 blocks you need a, you need a tile block from them and 1 command. Each is a so yes. 403 00:51:06.179 --> 00:51:12.389 To K, float loads and then this is how much you're going to use the. 404 00:51:12.389 --> 00:51:19.769 The blocks with 32 floating operations for each memory load and. 405 00:51:19.769 --> 00:51:23.250 But the thing is that you might wonder. 406 00:51:23.250 --> 00:51:33.030 Well, How's that possible? Lots of floats for 1 load. But the thing is that the loads are, when you do a memory load, it's available to all of the threads. 407 00:51:33.030 --> 00:51:37.679 And that's the key and the floating operations are per are being done. 408 00:51:37.679 --> 00:51:46.590 You know, in parallel on each separate thread, that's why that you might look at this and say, how could you do that? Well, that's the reason if you think about it. 409 00:51:46.590 --> 00:51:51.179 So the fact that the memory loads are being done in parallel. 410 00:51:52.619 --> 00:52:01.230 Before, and then the floating operations, I'm being a little vague, but you might be able to see why this is actually a reasonable amount of parallelism there. 411 00:52:04.440 --> 00:52:10.559 Okay, and then they start talking will it fit in the shared memory and so on? So. 412 00:52:13.349 --> 00:52:19.530 And, um. 413 00:52:19.530 --> 00:52:23.460 Okay, and each thread needs somewhat to the shared memory. 414 00:52:23.460 --> 00:52:33.269 This is the older architecture is only 16 K bytes a shared memory. It's more nowadays. Okay. And the thing about here is that. 415 00:52:33.269 --> 00:52:38.429 If you have fewer threads, perhaps or more thread box are fewer. 416 00:52:38.429 --> 00:52:49.019 More thread blocks means fewer threads for blocks. So a total number of threads will be the same. So, this is this point that more threads, but each. 417 00:52:49.019 --> 00:52:52.920 So, what this is talking about more blogs. 418 00:52:52.920 --> 00:52:56.940 Fewer threads for block, so the threads in the block. 419 00:52:56.940 --> 00:53:04.019 I have more resources, so what this means is it's more shared memory per thread. 420 00:53:05.340 --> 00:53:10.679 And that sometimes is a win. There is a. 421 00:53:10.679 --> 00:53:16.199 There was a talk at the GPU technology conference a couple of years ago, demonstrating this. 422 00:53:16.199 --> 00:53:21.210 Okay, Westerns. 423 00:53:27.900 --> 00:53:31.380 Silence. 424 00:53:32.579 --> 00:53:38.130 Okay, I'm going to go through the slides that passed. 425 00:53:39.510 --> 00:53:45.869 Let me give intellectual content, your partitioning matrices and blocks. 426 00:53:45.869 --> 00:53:52.320 And threads into thread blocks. Well, it may not go evenly. You've got a fraction on block. 427 00:53:52.320 --> 00:53:56.429 At the end, so your Matrix, you're going to a fractional. 428 00:53:56.429 --> 00:54:00.809 Block at the right side of the matrix and fractional blocks at the bottom of the matrix. 429 00:54:02.099 --> 00:54:07.199 Yeah, it just makes the programming a little Messier. That's all. I just summarized it. 430 00:54:07.199 --> 00:54:17.730 Arbitrary size matrices and so on this talking about you could pad 1 way is a pad, the matrix up to the next multiple of the. 431 00:54:17.730 --> 00:54:23.130 Of the block size, it makes it programming easier, but it takes more space. 432 00:54:23.130 --> 00:54:27.150 Significant or not depends on how big the blocks are. 433 00:54:27.150 --> 00:54:30.809 Yeah. Okay. Nothing new there. 434 00:54:31.980 --> 00:54:35.579 Nothing new there, nothing new there. Um. 435 00:54:35.579 --> 00:54:40.409 And nothing new there. 436 00:54:43.619 --> 00:54:51.630 Okay, let me what's happening here. I'm going to give you a summary. 437 00:54:51.630 --> 00:55:01.889 The threads are doing different things so the 1st thing, the threads do is read data from the global memory into the shared memory read, read in that block of data. 438 00:55:01.889 --> 00:55:05.610 And then the next thing, and then they synchronize and then they use it. 439 00:55:05.610 --> 00:55:14.730 And they're talking about a thread may have a valid use in the 1st, step of reading data in but not in the 2nd, step of computing and output. 440 00:55:14.730 --> 00:55:17.820 Element because in the 2nd step. 441 00:55:17.820 --> 00:55:22.050 The element it would compute is off the boundary of the matrix. So. 442 00:55:22.050 --> 00:55:26.730 So the showing here, you've got. 443 00:55:26.730 --> 00:55:30.960 The blank elements of the, the actual matrix is the number. 444 00:55:30.960 --> 00:55:35.159 Entries here, the blank elements are padding it out to the next block size. So. 445 00:55:35.159 --> 00:55:44.579 Yeah, I don't actually find the slides have particularly deep and interesting. I just gave you the content. Yeah the blocks go off of the. 446 00:55:44.579 --> 00:55:48.030 Edge of the green matrix. Yeah. So you. 447 00:55:48.030 --> 00:55:51.119 You know, you code, so if you don't access. 448 00:55:51.119 --> 00:55:56.760 Invalid memory by doing conditionals like that and so on. 449 00:55:56.760 --> 00:56:02.610 What's happening here is a fine point of the C or C. plus plus. 450 00:56:02.610 --> 00:56:10.079 This logical and is required to be a lazy evaluation. It is not an option. 451 00:56:10.079 --> 00:56:14.039 If role lesson with is false. 452 00:56:14.039 --> 00:56:17.159 It is prohibited to evaluate that. 453 00:56:18.659 --> 00:56:24.539 Which is good because of, which is false. This might be an illegal. 454 00:56:25.679 --> 00:56:34.739 Operation width would be so is doing some computation here. So this computation could be invalid if the. 455 00:56:34.739 --> 00:56:41.250 1st clause was false, but that's okay. Cause this condition won't get evaluated. If the 1st cause this false. 456 00:56:42.599 --> 00:56:47.909 I use something like this in a little C program that I wrote many years ago. A. 457 00:56:47.909 --> 00:56:55.829 Points inside a Polygon it uses it's 8 lines of code actually, and confuses people. I use something like this and. 458 00:56:55.829 --> 00:57:01.980 Makes out people, because other languages, Java, I guess, don't have this required lazy evaluation. 459 00:57:01.980 --> 00:57:08.550 You can't just take my seat program and do the simple translation of Java. It will fail. 460 00:57:08.550 --> 00:57:18.690 Okay, nothing interest you check boundary, conditions like this. Okay. And so you will have thread divergence with this. 461 00:57:19.710 --> 00:57:27.360 What that means is that if this is true, then the body gets executed. 462 00:57:27.360 --> 00:57:30.389 And then the L spotty does not. 463 00:57:30.389 --> 00:57:45.269 If the predicate is false, then body is not executed in the body. So this will take twice as long to execute because the threads are all the threads in the war are doing the 1 or the other. They're not doing both spread. It's called thread divergence. 464 00:57:45.269 --> 00:57:56.789 Okay, well, you know, it's tolerable to a point. You may don't want to have perhaps multiple nest and if then else blocks within your thread divergence. So, start getting. 465 00:57:56.789 --> 00:58:00.480 To the point here utilization will fall a lot. So. 466 00:58:00.480 --> 00:58:04.050 Okay. 467 00:58:05.730 --> 00:58:10.380 So, what you're doing is you're doing you're adding stuff in here. 468 00:58:10.380 --> 00:58:15.780 Well, these are all inside these enabling clauses. 469 00:58:15.780 --> 00:58:20.369 Nothing deep you could figure it all out if you didn't read. Decides that okay. 470 00:58:20.369 --> 00:58:23.789 And again, so it's called controlled divergence. 471 00:58:23.789 --> 00:58:31.139 General rectangular matrices. There's nothing interesting here. You just have to do it. 472 00:58:31.139 --> 00:58:34.679 Himself. 473 00:58:34.679 --> 00:58:38.789 Okay, good questions. 474 00:58:38.789 --> 00:58:42.840 Okay, let's. 475 00:58:42.840 --> 00:58:48.360 On. 476 00:58:48.360 --> 00:58:54.210 Silence. 477 00:58:54.210 --> 00:59:01.110 Okay. 478 00:59:01.110 --> 00:59:11.010 Thread execution efficiency. Okay. So the threads are bundled into warps 32 at a time. We got the 70. 479 00:59:11.010 --> 00:59:15.809 Alright, we're under control divergence I told you about was going to see it in more detail here. 480 00:59:15.809 --> 00:59:21.269 Okay, so warps the 32 threads, aged the green, the red and the purple warp. 481 00:59:21.269 --> 00:59:30.599 So, when you program, you might not actually be ever aware of warps there. 482 00:59:30.599 --> 00:59:36.150 An efficiency, implementation technique, they don't necessarily affect your program at all. 483 00:59:38.250 --> 00:59:41.460 So that. 484 00:59:41.460 --> 00:59:44.880 You know, your quota program never never sees. 485 00:59:44.880 --> 00:59:47.909 Never has an explicit or in it, but they. 486 00:59:47.909 --> 00:59:51.539 You want to be aware of them because they affect the efficiency. 487 00:59:51.539 --> 00:59:58.289 Scheduling units the warps again, there's a cue of warps or pool of warps waiting to run. 488 00:59:58.289 --> 01:00:01.500 And then the streaming multi process runs them. 489 01:00:01.500 --> 01:00:04.769 As resources become available. 490 01:00:06.000 --> 01:00:16.619 This is some tactic shuttering where the thread could be a 2 D thread block. It's just laid out in Rome major order. I don't even know why they added this to the. 491 01:00:16.619 --> 01:00:28.949 Architecture specification, because you can implement it you could realize it yourself so easily lease and C plus plus and I said I've done it and C plus plus of little conversion, implicit conversion routines in the classes. 492 01:00:28.949 --> 01:00:32.340 Okay. 493 01:00:34.050 --> 01:00:44.760 I mean, that's nice. And C plus plus so I index into an array, I can either use a scaler or I can use a 2 vector let's say, and it just calls the implicit conversion. 494 01:00:44.760 --> 01:00:51.300 That makes the programming nice. Okay. Threads in a war. 495 01:00:52.739 --> 01:00:58.409 They may change some generation to generation, but in 20 years they haven't changed. 496 01:00:58.409 --> 01:01:07.260 And again, just the point I keep saying, but the separate warps get scheduled independently, they may run. 497 01:01:07.260 --> 01:01:11.190 Side by side or 1 after the other, whatever. 498 01:01:11.190 --> 01:01:14.579 You saw this figure before. 499 01:01:15.929 --> 01:01:29.519 Nothing new and video actually does it says single process or single data? Multiple thread. These are slightly different acronym here, but. 500 01:01:29.519 --> 01:01:36.300 Sort of the same thing. Okay. And again, the point about this is less control overhead. 501 01:01:36.300 --> 01:01:45.510 What you don't have in the gpo is all of the speculative execution stuff, the stuff that makes, you know. 502 01:01:45.510 --> 01:01:53.130 Intel so big, and get them their really high low latency performance that all got stripped out of the. 503 01:01:53.130 --> 01:01:56.940 So, in order to have this parallelism. 504 01:01:56.940 --> 01:02:01.710 Okay. 505 01:02:03.420 --> 01:02:11.639 Once okay, I mentioned if then else, if a thread makes a different decision, then the thing waits for. 506 01:02:11.639 --> 01:02:17.610 Okay, here's the thing loops that are inside the thread. Iterate the same number of times. 507 01:02:17.610 --> 01:02:21.449 Well, what will happen is if 1. 508 01:02:21.449 --> 01:02:28.110 Threads loop 2 or times, and it pauses while the slower threads, finish their looping. 509 01:02:29.340 --> 01:02:33.510 So controlled divergence. Okay. 510 01:02:35.159 --> 01:02:46.949 What I just set up here, and if there's different paths, they could serialize. So the thread is taking 1 path, rather than the threads taking the other path. 511 01:02:48.239 --> 01:02:51.929 And if perhaps nesting is a bad idea. 512 01:02:51.929 --> 01:02:55.590 Number total number of pass will grow exponentially. 513 01:02:58.739 --> 01:03:07.980 Okay, we'll see an example here. Okay so if you do something like this in a thread. Okay, it's not the thing I highlighted. 514 01:03:07.980 --> 01:03:16.710 Okay, so this is a problem because depending on the threat index, the body gets executed, or does not get executed. So, 2 different control. 515 01:03:17.789 --> 01:03:20.789 Is it so. 516 01:03:20.789 --> 01:03:27.809 Take longer now, this here is okay. 517 01:03:27.809 --> 01:03:37.409 You're branching based on the block number the block index, but that's okay because the different blocks of different warps. 518 01:03:37.409 --> 01:03:41.460 And they're running different control things. So so here's the thing. 519 01:03:41.460 --> 01:03:45.989 You got to say a 1024 threads in the block and the thread block. 520 01:03:45.989 --> 01:03:51.210 32 walk to 32 thread so inside each warp. 521 01:03:51.210 --> 01:03:55.469 The threads are doing the same thing, but the different warps in the block. 522 01:03:55.469 --> 01:03:59.909 Have no constraints on them so the different warps in the block. 523 01:03:59.909 --> 01:04:04.679 Can certainly be running different execution pass so. 524 01:04:06.179 --> 01:04:11.190 Yeah, well. 525 01:04:11.190 --> 01:04:20.489 With some footnotes and the different blocks, and the grid can absolutely be doing different things. They don't even have access to the same memory and part of the global memory. 526 01:04:20.489 --> 01:04:24.449 The different warps in the block I mean. 527 01:04:24.449 --> 01:04:30.719 Be careful about this they're running the same instruction sequence. 528 01:04:30.719 --> 01:04:34.559 But they can be at point the instruction. 529 01:04:34.559 --> 01:04:40.199 Register pointing to the current instruction can be different for the different warps. 530 01:04:41.639 --> 01:04:48.179 So, if you're not all totally confused yet, see, if you got that if that else divergence thing. 531 01:04:48.179 --> 01:04:57.539 Or different warps 1 war could be running the then block at the same time as another warp is running the block. 532 01:04:57.539 --> 01:05:06.900 So, they're running the same instruction scheme, but there are different places in it. So I've totally confused too. I succeeded. No. Okay. 533 01:05:06.900 --> 01:05:09.900 It actually does make sense if you think about it. 534 01:05:09.900 --> 01:05:15.420 So, the different warps in a block can be. 535 01:05:15.420 --> 01:05:19.889 At any given time on different instruction, that's why you have to synchronize. 536 01:05:19.889 --> 01:05:24.300 Okay, and the different mix of the different blocks that is. 537 01:05:24.300 --> 01:05:33.389 No connection at all can be running. It probably running a different time. So you probably got more blocks that want to run. Then you have. 538 01:05:33.389 --> 01:05:39.059 Resources to run them Qatar edition you saw this 1 before. 539 01:05:43.829 --> 01:05:48.179 Now, um. 540 01:05:48.179 --> 01:05:56.070 They're assuming that you want to add a factor that's 700, whatever, certain size 768 and do. 541 01:05:56.070 --> 01:05:59.489 Let me summarize this slide for you. 542 01:06:01.260 --> 01:06:07.170 The last block will not be full of threads so it's got threads in our Idol. 543 01:06:07.170 --> 01:06:10.170 And that's called diverged so. 544 01:06:10.170 --> 01:06:17.820 That's the content of that, because they made the number of elements more than a multiple of the number of. 545 01:06:17.820 --> 01:06:21.750 Threads per block. Okay. 546 01:06:22.860 --> 01:06:26.550 No question. 547 01:06:30.000 --> 01:06:33.059 Silence. 548 01:06:39.659 --> 01:06:43.860 Okay, boundary condition checking you got to do it. 549 01:06:43.860 --> 01:06:56.130 But does nothing deep in it and controlled divergence a point. Here is you might have a conditional depends on the data so you cannot necessarily find your control diverges with static code analysis. 550 01:06:56.130 --> 01:07:00.960 Because it could be dynamically depend on the data, you know, it's such a, if a data element equals 5. 551 01:07:00.960 --> 01:07:05.190 Then do something okay that so that I just depends on the data. 552 01:07:05.190 --> 01:07:08.489 And they're going to talk about here. Okay. 553 01:07:09.869 --> 01:07:15.150 Yeah, you know, don't write data that's outside the matrix. 554 01:07:15.150 --> 01:07:21.659 Done is divergence here. Okay. I don't read data. That's outside the matrix. 555 01:07:21.659 --> 01:07:25.139 Action I was wrong. Don't read outside. Okay. 556 01:07:25.139 --> 01:07:29.039 This is this is loading the local. 557 01:07:29.039 --> 01:07:33.630 Shared tile from the global memory, so okay. 558 01:07:33.630 --> 01:07:36.659 Boundary checks. 559 01:07:36.659 --> 01:07:43.380 You saw this before you're off nothing deep there. 560 01:07:43.380 --> 01:07:52.920 And you can compute to control the effect of the thing is so the last war, you're not using all the threads in the war. Some are idle. So. 561 01:07:52.920 --> 01:07:56.039 You could say that that's any fashion so it computes. 562 01:07:56.039 --> 01:08:00.989 The inefficiency nothing interesting there. 563 01:08:00.989 --> 01:08:04.650 Or there or there. 564 01:08:05.730 --> 01:08:11.369 Okay, um, or here, even actually. Okay. 565 01:08:13.650 --> 01:08:17.220 There could be some natural control divergence. 566 01:08:17.220 --> 01:08:21.630 Okay. 567 01:08:21.630 --> 01:08:25.140 That was that the questions. 568 01:08:25.140 --> 01:08:29.340 No, I'm looking at the chat window because to move on. 569 01:08:31.890 --> 01:08:36.930 All right it was module 5 of 1. 570 01:08:36.930 --> 01:08:42.750 On here your 21 modules we've done through 5. we've quite a bit today. Okay. 571 01:08:42.750 --> 01:08:48.090 Let's go into 6 and again I'm summarizing things. 572 01:08:50.430 --> 01:08:56.069 Oh, by the way what I'm doing here is I'm using this virtual. 573 01:08:56.069 --> 01:09:05.729 File system idea all this stuff as far as this big zip file and I just mounted a virtual file system that looks into the big zip file. 574 01:09:05.729 --> 01:09:12.479 And pulls out piece says, that's a thing I told you about. So it doesn't stress the file system so much, but if I care. 575 01:09:12.479 --> 01:09:18.630 I'm on okay, memory access. 576 01:09:20.189 --> 01:09:32.729 Okay, what I've been saying before more often than not your program's limited by getting the data to the processors. 577 01:09:32.729 --> 01:09:37.140 I oh dominate. 578 01:09:41.909 --> 01:09:45.659 I'm not quite certain what that means. Okay, this is. 579 01:09:45.659 --> 01:09:51.270 Water coming out of a dam, ideally, you'd want to have a high bandwidth, but in reality, you. 580 01:09:51.270 --> 01:09:55.319 Shipping to us draw. 581 01:09:56.729 --> 01:10:02.100 Again, I don't know if everyone here is a hardware person. 582 01:10:02.100 --> 01:10:07.949 For your software types, your direct dynamic, random access memory. 583 01:10:07.949 --> 01:10:11.340 Each bit is a little capacitor effectively. 584 01:10:11.340 --> 01:10:16.859 Which is controlled by a transistor so the thing with capacitors. 585 01:10:16.859 --> 01:10:24.930 Is a discharge to resist a capacitors little circuit, which computes a time concept of how fast they discharge. 586 01:10:24.930 --> 01:10:28.680 And if you make them. 587 01:10:28.680 --> 01:10:37.170 Smaller they discharge faster, so the DRAM has to be refreshed as the capacitors are on down. If the capacitors are smaller. 588 01:10:37.170 --> 01:10:47.454 Then you have to refresh or more. The other thing is this time concept controls how quickly you can do things to the memory limits. This is why? 589 01:10:47.454 --> 01:10:58.704 That the memories are not getting faster at the same rate that process that the processors are getting faster because the processors are asynchronous memory. You make your gate smaller. 590 01:10:59.100 --> 01:11:09.510 They gate faster, but you cannot necessarily make the you make the DRAM smaller and past. It has to be refreshed more. And that takes more of your time. 591 01:11:09.510 --> 01:11:19.470 Software review of the limits of DRAM. Why doesn't get faster faster? So you could have static random access memory reach. 592 01:11:19.470 --> 01:11:24.180 Bit is a little flip flop that takes a couple of transistors. 593 01:11:24.180 --> 01:11:27.539 So, it's a synchronous logic. 594 01:11:27.539 --> 01:11:38.430 No, capacitors. It can the static gram can get faster. The problem is it's much more expensive. So static ramp might be used for the cash. 595 01:11:38.430 --> 01:11:42.930 But it's more expensive, so that's why you don't get humungous static RAM. 596 01:11:42.930 --> 01:11:46.710 Memories. 597 01:11:46.710 --> 01:11:51.595 You want to see really expensive static RAM look what they use in space crap. 598 01:11:51.595 --> 01:12:03.864 Some time some of the space craft that they go out out of the solar system has static memory and a bit might be 2 wires wrapped around each other and you wrap. 599 01:12:04.079 --> 01:12:11.939 Clock wise, so it would be a 1 year. Rep counterclockwise that might be as 0T. So a bit is really big. However. 600 01:12:11.939 --> 01:12:20.250 It's also really stable. You run this twisted memory through a Van Allen belt on Jupiter. 601 01:12:20.250 --> 01:12:23.430 And it survives, and these things won't. 602 01:12:23.430 --> 01:12:34.109 Tony are trade offs when I was a student static memory was actually magnetic. Course it was a little. 603 01:12:34.109 --> 01:12:37.260 About a millimeter 2 millimeters across. 604 01:12:37.260 --> 01:12:46.619 Made with, and it would magnetize clockwise or counterclockwise and once you magnetized it, it would stay magnetize effectively forever. 605 01:12:46.619 --> 01:12:56.579 That was static and you re, and you have some, a couple of wires going to the hole in the center of the and. 606 01:12:56.579 --> 01:13:03.359 Well, you would magnetize it by running occurrence 1 way or the other way and you would sense it. 607 01:13:03.359 --> 01:13:08.489 Actually, by magnetize if I'm running a current through and measuring. 608 01:13:08.489 --> 01:13:12.689 The flux change, so again, static lasted forever, but. 609 01:13:12.689 --> 01:13:18.300 Big expensive. That's when machines had. 610 01:13:18.300 --> 01:13:22.739 They talked about K, thousands of bites of memory, not. 611 01:13:22.739 --> 01:13:27.180 Gigabytes of memory. Okay. Nothing interesting. There. 612 01:13:28.500 --> 01:13:32.670 Are there they're slow I just told you, they're slow. 613 01:13:32.670 --> 01:13:40.020 But you can read and a chunk of memory at 1 time. So. 614 01:13:42.989 --> 01:13:48.510 There's a latency also, but sometimes the IO is a touch faster. That's the burst mode. 615 01:13:48.510 --> 01:13:55.170 Diverse mode is useful. Is your accessing sequential elements? Several sequential elements. 616 01:13:55.170 --> 01:13:59.850 Kay banks not relevant to the course. 617 01:13:59.850 --> 01:14:10.470 Okay, yeah, so where this is irrelevant, we're leading into the global memory okay on the gpo, which is a 48 gigabytes. 618 01:14:10.470 --> 01:14:25.409 And which is going to be D, RAM like this. So, the thing is so this is all relevant to that global memory. It's got the latency. But the thing is, it's got the burst and it's the global memory. It's 128 bytes. 619 01:14:25.409 --> 01:14:34.140 I read in 1 cycle, so it's gonna be maybe a 100 cycles to read the 1st bite, but then, wham, in 1 cycle, you could all 1, 2008 bytes. 620 01:14:34.140 --> 01:14:38.399 So, and. 621 01:14:38.399 --> 01:14:42.510 You would use them if a war or 32 threads. 622 01:14:42.510 --> 01:14:56.850 Wanted 32 consecutive words from the global memory toward being 4 bites. So you see that burst of 128 bytes. Somewhat global memory will serve the whole war for 32 to 3 things fit together nicely. Yeah. 623 01:14:57.899 --> 01:15:04.289 If the consecutive threads, and the war want consecutive words from the global memory, that's an IP. So this means. 624 01:15:04.289 --> 01:15:11.550 This is how you got a design, you program. That's why also in your program if you have say. 625 01:15:12.175 --> 01:15:26.935 To D, points X and Y, you don't have an array of the structure of the X wise you have a structure of an array of axes and array of Y, so all the X's consecutive, all the wiser consecutive. So this 1st thing idea will actually be useful. 626 01:15:27.210 --> 01:15:30.689 I'm just doing here. Okay. 627 01:15:33.175 --> 01:15:47.664 Talking about the global, so the global memory on the card with the gpo, it's actually that, you know, they worked really hard to make it fast and it is quite fast. So I talk about it being slow compared to the registers. Well, yeah, but it's past compared to anything else. 628 01:15:47.909 --> 01:15:55.890 Okay, so I might even run a program to show you how fast. 629 01:15:55.890 --> 01:16:00.449 In any case I'm talking about speech there. 630 01:16:01.859 --> 01:16:04.979 Okay. 631 01:16:04.979 --> 01:16:08.130 That's a reasonable point. 632 01:16:08.130 --> 01:16:12.119 To stop we went up through. 633 01:16:12.119 --> 01:16:18.300 6.1, just to show you what 6.2 might be without doing it. 634 01:16:18.300 --> 01:16:21.359 Talks about memory coalescing. 635 01:16:21.359 --> 01:16:29.250 This is what I just told you, the memory coalescing idea is that the 32 consecutive threads. 636 01:16:29.250 --> 01:16:39.029 Goal for 32 consecutive words of global memory and the accessing gets coalesced. So it's only 1 read not 32 reads. 637 01:16:39.029 --> 01:16:43.649 Show you what some of the bandwidth is so just for fun. 638 01:16:50.729 --> 01:16:55.050 I'm running on my local machine here. 639 01:17:07.649 --> 01:17:10.949 Silence. 640 01:17:10.949 --> 01:17:15.420 Silence. 641 01:17:15.420 --> 01:17:20.640 Okay, um. 642 01:17:20.640 --> 01:17:33.359 Just getting my laptop. Okay. It's doing a bandwidth device. The device that's inside the device 300 gigabytes a 2nd it's not not so awful hosted device. 643 01:17:33.359 --> 01:17:40.590 Is gigabytes a 2nd, so the bus from the device to the host it's the fastest. 644 01:17:40.590 --> 01:17:45.060 Bus on the whole computer, I think are there somewhat. 645 01:17:45.060 --> 01:17:49.529 Okay, but let me try parallel and see what it is. 646 01:17:50.939 --> 01:17:54.090 Silence. 647 01:17:57.600 --> 01:18:05.430 On the system. Okay. 648 01:18:05.430 --> 01:18:09.210 Silence. 649 01:18:13.770 --> 01:18:17.399 Silence. 650 01:18:17.399 --> 01:18:21.510 Silence. 651 01:18:23.399 --> 01:18:28.409 Okay, try to do a demo what this means is. 652 01:18:28.409 --> 01:18:34.140 I have to recompile stuff in Sunday, because they try to do a demo. It doesn't work, but that's. 653 01:18:34.140 --> 01:18:43.229 Okay, what I was hoping is I did this on paralleled. I would get higher speeds than if I did it on. 654 01:18:43.229 --> 01:18:49.079 On my laptop, which it has a mobile GPU for our examples. 655 01:18:49.079 --> 01:18:52.409 Other things here just to show you a. 656 01:18:52.409 --> 01:18:57.930 Remind you. 657 01:19:02.069 --> 01:19:05.699 Showing you features about this. 658 01:19:05.699 --> 01:19:12.960 It's still a Quadro, there's only 16 gigabytes of memory, only 3000 cores only. 659 01:19:12.960 --> 01:19:16.319 1500 megahertz. 660 01:19:16.319 --> 01:19:21.810 I say only, but I'm being sarcastic so the memory. 661 01:19:21.810 --> 01:19:26.520 So, we're seeing some fairly reasonable speeds here. 662 01:19:26.520 --> 01:19:31.380 You see, so the memory bus with 256 bits. Okay. 663 01:19:31.380 --> 01:19:37.500 This is this called the cash size this is used for things like. 664 01:19:37.500 --> 01:19:40.680 The only 4 megabytes. 665 01:19:40.680 --> 01:19:47.430 That goes into various things cash. Okay. So the constant memory 64 K bites. 666 01:19:47.430 --> 01:19:54.449 Shared memory 48 K bytes for block and shared memory for multi processor. 667 01:19:54.449 --> 01:19:58.529 So, there's 48 multi process, there's. 668 01:19:58.529 --> 01:20:02.130 And 64 cars per multi processor. 669 01:20:02.130 --> 01:20:09.989 So so what you're seeing here shared memory for blocking shared variable multi processor. 670 01:20:09.989 --> 01:20:14.399 You could see some optimization issues with how many blocks for. 671 01:20:14.399 --> 01:20:19.859 All due process multi processes are the same disagreement multi processes. This many registers. 672 01:20:19.859 --> 01:20:23.640 Threads and multi process or threads per block. 673 01:20:23.640 --> 01:20:29.850 Right, so the thing is, if there's more blocks and multi processor. 674 01:20:29.850 --> 01:20:36.689 Than the others wait to run, they've got. 675 01:20:36.689 --> 01:20:43.859 Space available so you see here because you see a block could have. 676 01:20:43.859 --> 01:20:48.180 To the chancellor of the 20 s to the 26 threads, but. 677 01:20:48.180 --> 01:20:55.770 Only a 1000 are going to run at once. I haven't talked about texture and so on yet but. 678 01:20:57.960 --> 01:21:05.130 Unified addressing and management I mentioned, so. 679 01:21:05.130 --> 01:21:15.840 And, you know, the stuff's originally intended for graphics so this texture memory. 680 01:21:15.840 --> 01:21:19.739 Which I haven't talked about and it's actually. 681 01:21:19.739 --> 01:21:24.119 Stored using some sort of space, filling curve, like a piano curve. 682 01:21:24.119 --> 01:21:27.539 Okay. 683 01:21:27.539 --> 01:21:32.069 Well, that's enough stuff for today you want to get to your next class. 684 01:21:32.069 --> 01:21:36.840 So, what we did is another chunk of the. 685 01:21:36.840 --> 01:21:40.920 And various teaching kit stuff on their. 686 01:21:40.920 --> 01:21:48.359 And I'm hitting the highlights and pointing it to the stuff that I think is interesting and skipping over. The stuff that I think is. 687 01:21:48.359 --> 01:21:53.250 Not so interesting and I will. 688 01:21:53.250 --> 01:22:01.109 6 parallel, so you can run this in parallel. Also the source codes also be able to look at that. I basically, I think have to. 689 01:22:01.109 --> 01:22:15.029 Read it, and maybe where you compile, it is very, there's run time modules, get loaded at run time and if they get updated, you get if there's any sort of version clash. 690 01:22:15.029 --> 01:22:18.359 Between 1 year could a program was compiled. 691 01:22:18.359 --> 01:22:23.880 And what version of run time molecules are, you're going to get these error messages. So, things have to keep in sync. 692 01:22:25.079 --> 01:22:30.659 Okay, so if there's any questions. 693 01:22:30.659 --> 01:22:34.199 No, other than that have a good. 694 01:22:34.199 --> 01:22:38.579 Week. 695 01:22:39.779 --> 01:22:44.970 Oh, I thought of something. 696 01:22:44.970 --> 01:22:49.590 P wave is, um, basically. 697 01:22:49.590 --> 01:22:54.689 Silence. 698 01:22:54.689 --> 01:22:58.199 Silence. 699 01:22:58.199 --> 01:23:03.479 D, wave is 1 of the major quantum computing things. 700 01:23:03.479 --> 01:23:10.829 Another seminar or 1 of them I maybe not and it's out of class time. I cannot require these sorts of things. 701 01:23:10.829 --> 01:23:18.180 But they're valuable for you to look at to see presentations by the major players. 702 01:23:18.180 --> 01:23:21.630 In parallel computing in quantum computing. 703 01:23:21.630 --> 01:23:27.090 So oh, and no homework today I'll give you a break, do it Thursday or something. 704 01:23:29.550 --> 01:23:34.770 Other than that no questions, then goodbye.