WEBVTT 1 00:00:02.399 --> 00:00:07.440 All right and. 2 00:00:07.440 --> 00:00:14.249 If you're wondering why upload the videos to my. 3 00:00:14.249 --> 00:00:19.318 Server, instead of to the RPI video media thing is. 4 00:00:19.318 --> 00:00:26.730 My service less hassle, so okay. I put up a homework, which is to implement. 5 00:00:26.730 --> 00:00:30.480 The histogram thing on both, um. 6 00:00:30.480 --> 00:00:36.600 Many core open or open, and also could multi. 7 00:00:36.600 --> 00:00:41.369 Multi core is Z on many core is the. 8 00:00:41.369 --> 00:00:49.619 Until as in video. Okay, so we're continuing on looking at. 9 00:00:49.619 --> 00:00:55.020 The invidious accelerated computing teaching kit. 10 00:00:55.405 --> 00:01:09.894 And I see my value added apart from pointing you to it in this section is to select the parts that I think are worth presenting and to go quickly. It's of uneven quality. It's got parts of very low signal to noise very short parts that are higher. So. 11 00:01:10.379 --> 00:01:13.950 That's my value added and. 12 00:01:15.209 --> 00:01:18.840 So. 13 00:01:18.840 --> 00:01:23.489 What I did is I had just unzipped everything to my local. 14 00:01:24.629 --> 00:01:29.040 Directory and what do I say we're starting at module 7. 15 00:01:29.040 --> 00:01:34.230 Yes. 16 00:01:35.280 --> 00:01:45.719 We go here maybe we did not, um. 17 00:01:50.700 --> 00:01:55.859 Okay. 18 00:01:55.859 --> 00:02:10.050 And again, to remind you that the E book chapters has, I don't think they have the whole book necessarily, but they've got chapters some of book available for free. And it's very well written. 19 00:02:10.050 --> 00:02:15.210 So, virtually, and also I pick some homework questions out of it. 20 00:02:15.210 --> 00:02:18.479 Okay. 21 00:02:18.479 --> 00:02:22.439 Hello. 22 00:02:24.090 --> 00:02:27.719 So. 23 00:02:27.719 --> 00:02:32.699 Interesting hardware issues. 24 00:02:32.699 --> 00:02:39.990 Linux is not perfectly supported by Lenovo actually on this thing pad the left and right. Most buttons don't work. 25 00:02:39.990 --> 00:02:45.210 Track pad or the speakers. Okay. So. 26 00:02:45.210 --> 00:02:48.539 Blah, blah, um. 27 00:02:49.830 --> 00:03:03.030 It's the grabbing, it's a nice example, because it illustrates some issues with parallel computing. Okay. You all know what his programming is we have these pins and we read in text and we count the frequency counts. 28 00:03:03.030 --> 00:03:12.389 Okay, I'm anticipating a little what makes it different on a parallel computing. 29 00:03:12.389 --> 00:03:18.300 Is that you have these global counters of the frequencies and you read in a ladder you upgrade to. 30 00:03:18.300 --> 00:03:23.490 You update the counter that's a read. Modify right? So, it has to be done to. 31 00:03:23.490 --> 00:03:27.930 Otherwise, if you 2 different threads, try to update the same counter. 32 00:03:27.930 --> 00:03:37.439 Then it will get updated properly and as the number of parallel threads grows, the probability of this happening increases. 33 00:03:37.439 --> 00:03:40.590 And if you have this problem. 34 00:03:40.590 --> 00:03:49.259 Every time you run your program, if you're lucky, you'll get a different wrong answer. If you're not lucky. Every time you'll get the same wrong answer. 35 00:03:50.370 --> 00:03:54.030 So, I'm being serious because if it's different, you suspect an issue, but. 36 00:03:54.030 --> 00:03:59.009 The thing with parallel computing, getting the same wrong answer. That's something. 37 00:03:59.009 --> 00:04:02.669 That effect is the 1st space shuttle in fact. 38 00:04:02.669 --> 00:04:07.800 It's going back a few decades, but they had 4 primary computers that were IBM. 39 00:04:07.800 --> 00:04:11.789 That in critical moments of the flight, like, before launch. 40 00:04:11.789 --> 00:04:15.210 They were supposed to be synchronously running the same thing. 41 00:04:15.210 --> 00:04:18.629 And so this is a check, it's hardware issues. 42 00:04:18.629 --> 00:04:23.189 But then the, the NASA programmers being paranoid people, um. 43 00:04:23.189 --> 00:04:26.848 Yes, why had a 5th? Um. 44 00:04:26.848 --> 00:04:32.788 Computer designed by a different contractor with a different operating system, a backup flight system. 45 00:04:32.788 --> 00:04:38.488 That was also supposed to synchronize with the primary AVIONICS support systems and just before launch. 46 00:04:38.488 --> 00:04:44.788 The backup system refused disagreed with the primary. Suppose the primaries are the backups of brand decks. 47 00:04:44.788 --> 00:04:49.528 But they did the responsible thing, and they scrub the flight until they figured it out. 48 00:04:49.528 --> 00:04:56.038 It turned out that the 4 primary computers together were wrong and the brand deck's backup was right? 49 00:04:56.038 --> 00:05:01.588 It was a 170 synchronization chats every 170 times. 50 00:05:01.588 --> 00:05:09.119 The primaries will get the clock wrong and they had observed it during a Pre flight test. 51 00:05:09.119 --> 00:05:16.379 And they had logged it because they log it and after they log everything, but they log in and left it for future analysis. 52 00:05:16.379 --> 00:05:20.428 And then it happened again, 170 times during. 53 00:05:20.428 --> 00:05:26.309 Light so exam primary, it's a parallel computing, the 4 primaries and then the backup. 54 00:05:26.309 --> 00:05:34.769 And they did it for increased reliability now, with the space travel would also do during non critical parts of the flight. Like, they're in orbit. 55 00:05:34.769 --> 00:05:41.399 They are the primary computers could do different things. They only synced up when it was critical. 56 00:05:41.399 --> 00:05:48.598 Nasa had another, it's quite used to like, old NASA, I guess programmers a lot of the time. 57 00:05:48.598 --> 00:05:53.038 They had a lot of interesting program and it's a little separate from parallel, for example. 58 00:05:53.038 --> 00:05:59.639 You know, your thing is running around little things running around Mars are running real time operating systems. 59 00:05:59.639 --> 00:06:04.168 And because they're going in real time interrupts and they have to handle stuff. 60 00:06:04.168 --> 00:06:07.588 And it's actually even when it's in flight 2 Mars. 61 00:06:07.588 --> 00:06:13.019 There the computers are writing at a low speed, sort of collecting a small amount of data. 62 00:06:13.019 --> 00:06:16.918 And at 1 point, they collected so much data that the overflow of the disk. 63 00:06:16.918 --> 00:06:23.759 And so the people who have some back door that they had, and the hacking in and clearing things out and rebooting it. 64 00:06:23.759 --> 00:06:28.408 Now you think about it, you've got a latency of what, 10 minutes? 15 minutes. 65 00:06:28.408 --> 00:06:33.059 And I don't know what the bandwidth is of the link 8 kilobits a 2nd or something. 66 00:06:33.059 --> 00:06:43.199 And they're going and no, it's not a gooey. So some of you are complaining about using command line interface as well if you're here if you're pass it either to bugging something on Mars. 67 00:06:43.199 --> 00:06:46.978 You're not, you're saying ugly I don't think so. Um. 68 00:06:46.978 --> 00:06:55.889 But they got it working, so it also have things like an interrupt the high priority interrupt with dominate the whole system because it's hurdle more often than they. 69 00:06:55.889 --> 00:07:00.149 So, beautiful examples of reliable programming, so. 70 00:07:01.288 --> 00:07:12.718 Examples of read only memory. Some of the read only remember the stuff they sent to Jupiter and so on. They would twist 2 wires around each other clockwise twisted, be 0 and a counterclockwise twisted via 1 or something. 71 00:07:12.718 --> 00:07:16.678 Do you think core memories? Not very debt. This is even less dense. Um. 72 00:07:16.678 --> 00:07:22.889 So expensive and not and heavy and not very dense. So the disadvantages. 73 00:07:22.889 --> 00:07:29.848 Oh, so why would they do such a thing you put this memory and you produce that Alan belt and it survives. 74 00:07:29.848 --> 00:07:35.278 So, they're really serious radiation so, hardware for the. 75 00:07:35.278 --> 00:07:42.569 Appropriate to the task. Okay. Well, there there was 1. 76 00:07:42.569 --> 00:07:47.158 Programming error space shuttle had it was announced, um. 77 00:07:47.158 --> 00:07:53.399 If it came out of, or is the cotton simulation if it left, or but they put something on a stack. 78 00:07:53.399 --> 00:08:01.588 And at 1 point in simulation, they had to leave orbit and then come back and leave over the 2nd time in the stack. 79 00:08:01.588 --> 00:08:05.848 Overflowed because he would have thought the space shuttle would be horrible twice in 1 mission. 80 00:08:05.848 --> 00:08:09.869 Okay. 81 00:08:09.869 --> 00:08:14.848 So, what we're doing here is that we have a long text. 82 00:08:14.848 --> 00:08:18.149 Programing massively parallel processors. 83 00:08:18.149 --> 00:08:21.358 And we wish to a histogram it. 84 00:08:21.358 --> 00:08:27.749 In parallel, so we wish to assign we wish to partition the input. 85 00:08:27.749 --> 00:08:33.119 And assign different chunks of the input to different threads and the threads will histogram it in parallel. 86 00:08:33.119 --> 00:08:36.208 So, this assumes that, I guess the. 87 00:08:36.208 --> 00:08:39.568 I know time is less than the computation time. Um. 88 00:08:39.568 --> 00:08:46.109 Your 1st question is, how do you partition the text among the processing threats? 89 00:08:46.109 --> 00:08:51.028 So, this slide here shows, the obvious solution. 90 00:08:51.028 --> 00:09:00.448 Where suppose we have a gigabyte of text the 1st, 250 megabytes goes to thread 0 The next to 15 megabytes goes to thread 1 and so on. 91 00:09:00.448 --> 00:09:05.818 So does a quarter, a quarter, a quarter a quarter and we do it at here. 92 00:09:05.818 --> 00:09:12.928 And by the way, you know, here, we've got 4 threads, but you could imagine 1000 threads or 10,000 threads. 93 00:09:12.928 --> 00:09:17.999 Remember that on parallel. 94 00:09:17.999 --> 00:09:22.528 That gpo can do 5,000 has 5,000 parallel threads, give or take. 95 00:09:22.528 --> 00:09:26.519 Okay, so here's the problem. Um. 96 00:09:28.558 --> 00:09:35.308 That it's conceivable that, um, oh, and just to make the diagram readable, we're pocketing and in groups. So. 97 00:09:35.308 --> 00:09:42.089 Letters, so it may happen here. The 3 is the threads want to update the, em, to P count or simultaneously. 98 00:09:43.109 --> 00:09:48.688 Okay, iteration 1 is because looking ahead later we'll see a different way to. 99 00:09:48.688 --> 00:09:53.038 Assign the input texts of the different threads, which will be better. 100 00:09:54.208 --> 00:10:00.509 Oops, let me see, I skip something. Okay. Um, so iteration 2 and I'll look at the 2nd letter. 101 00:10:00.509 --> 00:10:05.339 Oh, you're right, there will be a different version of that stuff here. Right? So we're refining the 1st thing. 102 00:10:05.339 --> 00:10:11.548 And 2 threads, update this counter 1 thread that 1 thread that and so on. 103 00:10:12.688 --> 00:10:22.379 Okay, um, so here's the coalescing thing that was in, um, section 6 module 6. 104 00:10:22.379 --> 00:10:27.658 That what I mentioned last time. 105 00:10:27.658 --> 00:10:39.778 Is the way the hardware the is designed and designed for good hardware design reasons, which are described in the sign in company chapter. 106 00:10:39.778 --> 00:10:44.908 It's divided into banks, blah, blah, blah, blah. I'm a software person, but the effect. 107 00:10:44.908 --> 00:10:49.649 Is that the, the, the. 108 00:10:49.649 --> 00:10:54.509 This is the big global memory on the GPU is the 48 gigabytes on the, um. 109 00:10:54.509 --> 00:11:02.339 Parallel you read memory from the DRAM from the global memory in chunks of 128 bytes. 110 00:11:02.339 --> 00:11:07.739 If you want 1 bite, you have to read 128 and throw away 127 or ignore them. 111 00:11:07.739 --> 00:11:13.589 This is just the way the hardware is designed the implication of this. 112 00:11:13.589 --> 00:11:17.038 So suppose a thread wants only 4 bites. 113 00:11:17.038 --> 00:11:26.729 The implication is, and you've got 32 threads are working together synchronously together in a war. So notice how the different design features work together. I mean. 114 00:11:26.729 --> 00:11:34.948 In video people at some point in the future, the company's going to screw up advantage, but at the moment, they're really brilliant people. The different things fit together. 115 00:11:34.948 --> 00:11:40.019 The global memory, you read data chunks of 128 bytes. 116 00:11:40.019 --> 00:11:43.769 32 threads are in a warp and operates synchronously. 117 00:11:43.769 --> 00:11:48.418 So, what you really want is the 32 threads to want. 118 00:11:48.418 --> 00:11:51.538 And suppose they want each 14 bites of the memory. 119 00:11:51.538 --> 00:11:56.519 What you would really like, is the threads want adjacent 4 fight words. 120 00:11:56.519 --> 00:12:02.759 So all 32 threads together want 128, contiguous bites of develop a memory. 121 00:12:02.759 --> 00:12:08.849 Because if you can do that, you will get 1 read from the global memory for 128 bytes. 122 00:12:08.849 --> 00:12:14.999 And then each of the 32 threads will pull little, their little 4 bite word out of this 128 by chunk. And then. 123 00:12:14.999 --> 00:12:21.028 You have minimized I O, from the global memory and you maximize the efficiency. 124 00:12:21.028 --> 00:12:24.839 That's the good way. It's called coalesced reading because. 125 00:12:24.839 --> 00:12:28.408 The 32 threads, 32 separate requests. 126 00:12:28.408 --> 00:12:34.048 Into 1128bytesread from the global memory or rights would do the same thing. 127 00:12:34.048 --> 00:12:41.399 That's the good way. Um, the problem is that if I can go back a slide. 128 00:12:41.399 --> 00:12:49.408 Here are the different threads what they're reading from the global memory is not adjacent and these reads do not coalesce. 129 00:12:49.408 --> 00:12:53.249 So, I'll go back to this color slide. Um. 130 00:12:54.568 --> 00:12:59.249 So, the colors don't actually, um. 131 00:12:59.249 --> 00:13:06.178 Well, the colors of what each thread reads is the red thread, the gray thread, the green thread of the blue thread. 132 00:13:06.178 --> 00:13:11.038 So step 0 iterations 0, the red thread reads here. 133 00:13:11.038 --> 00:13:14.879 The greens, the gray thread there, the green threads there in the blue thread there. 134 00:13:14.879 --> 00:13:17.879 And so the 4 threads are. 135 00:13:18.928 --> 00:13:23.849 Reading separate chunks from the global memory so this. 136 00:13:23.849 --> 00:13:28.708 So, this quadruples, the reading the aisle from the global memory now. 137 00:13:30.239 --> 00:13:36.688 This is relevant because this bus, it's on, it's on the card it's on the card for the GPU. It's a good fast. 138 00:13:36.688 --> 00:13:44.278 You know, good fast bus, but it still is slower than something that's on the registers or something. So. 139 00:13:44.278 --> 00:13:49.078 This well, they say it's this poor access efficiency. 140 00:13:49.078 --> 00:13:54.599 I'm guessing that this will be the rate limiting problem for the whole program. 141 00:13:54.599 --> 00:14:03.028 They're, they're, they're getting memory probably getting data from the global memory and it's not and their read they're not coalesced. 142 00:14:03.028 --> 00:14:07.019 So, there's another thing also. 143 00:14:07.019 --> 00:14:11.938 There are these caches hidden away in the system, and they're documented Florida. 144 00:14:11.938 --> 00:14:15.028 But there's an cache between the global memory. 145 00:14:15.028 --> 00:14:18.298 And the. 146 00:14:18.298 --> 00:14:25.168 And it's the same hardware as the fast shared memory, I think, and the constant memory it's this, this bank. 147 00:14:26.399 --> 00:14:37.708 Give or take 128 kilobytes I think I could be wrong on that number of very fast cash and it can be used for shared memory for the thread blocks or it can be used. 148 00:14:37.708 --> 00:14:42.389 Just to cash reads and it can also, I think we used for the constant memory. 149 00:14:42.389 --> 00:14:45.869 So, if you're coalescing stuff. 150 00:14:45.869 --> 00:14:50.639 Then it goes through the cache, but something like this here. 151 00:14:50.639 --> 00:14:58.318 Is too big won't fit in the cash is too big. So so you call that stuff you can use the cash and you don't have to program that it happens automatically. 152 00:14:59.668 --> 00:15:06.568 There's also a possibility if you call that stuff into it, right? I don't know but it's possible. The system might even read ahead. 153 00:15:06.568 --> 00:15:09.719 I don't know how sophisticated the GPU is that. 154 00:15:09.719 --> 00:15:18.448 And NVIDIA, they have design documents on their website, but they don't necessarily highlight. 155 00:15:18.448 --> 00:15:23.609 You know what they're doing for efficiency, but you can tell they're doing stuff because it works. So well. 156 00:15:23.609 --> 00:15:30.239 In some ways. Okay, so this is just writing down what I just said. 157 00:15:30.239 --> 00:15:34.708 If you give the 1st, chunk of the data to the 1st thread, and so on. 158 00:15:34.708 --> 00:15:39.749 Then the memory access is, are. 159 00:15:39.749 --> 00:15:49.589 Inefficient okay, the bottom half of that slide is what? I just told you that what you want is to it's called inter, leave. 160 00:15:49.589 --> 00:15:55.318 Partitioning where this is. Okay this here is the input text. 161 00:15:55.318 --> 00:15:59.698 Word 123, the bites of the input texts let's say. 162 00:15:59.698 --> 00:16:03.269 And the colors are, which thread is accessing that bite. 163 00:16:03.269 --> 00:16:08.188 And here you want the threads in her leaves so the 1st, 4 bites are processed. 164 00:16:08.188 --> 00:16:15.568 By the 1st, by the 4 threads. So adjacent thread numbers are reading adjacent data from the memory. 165 00:16:16.678 --> 00:16:19.889 Coalescing and that's a buzz word, they're coalescing. 166 00:16:19.889 --> 00:16:26.129 So, and here they show at the 1st, 4 bytes, get partition among the threads. 167 00:16:26.129 --> 00:16:39.418 Okay, so that increases your IO, this reduces your I. O, total IO from the global memory. We still have the stepping on your own toes problem with. 168 00:16:39.418 --> 00:16:43.708 The calendars will get to that next, but this is the 1st. 169 00:16:43.708 --> 00:16:49.589 So this is the 1st new idea today. That's interleaving increases memory, access performance. 170 00:16:49.589 --> 00:16:55.078 Okay, I'm in the next cycle we do that. 171 00:16:55.078 --> 00:17:08.818 Oh, by the way you read and video, you read documentation also written by 3rd party people, and they'll give you various tricks and hacks to make your programs run faster on the GPU. 172 00:17:08.818 --> 00:17:13.348 Now, you always have a question is when is the hack worth. 173 00:17:13.348 --> 00:17:17.638 Using, and when should you ignore it? And this was something you have to do is. 174 00:17:17.638 --> 00:17:22.318 Software designers, because some of the. 175 00:17:22.318 --> 00:17:28.199 Tips that I've seen written about a, they may not be worth your time. 176 00:17:28.199 --> 00:17:33.778 And be the next generation of GPU may invalidate the. 177 00:17:33.778 --> 00:17:37.469 So you get to cute with your optimization. 178 00:17:37.469 --> 00:17:42.778 It won't help you with the next version of chips in 2 years. So that's something to think about. 179 00:17:42.778 --> 00:17:48.449 In fact, the next version a chip may actually make your optimization run slower. Okay. Okay. 180 00:17:48.449 --> 00:17:52.528 The reason is that in video there always you see there. 181 00:17:52.528 --> 00:17:58.439 The cost is the real estate on the chip. Okay. And when NVIDIA does generation to generation. 182 00:17:58.439 --> 00:18:03.568 Is they changed the allocation of the chip to the different functionality? 183 00:18:03.568 --> 00:18:07.138 How much for floating point how much for double how much for cash. 184 00:18:07.138 --> 00:18:17.669 Whatever, and they change this allocation, they're going to invalidate with you over optimize it just the note. But the sooner leaving thing, I think is a fairly long lasting idea. That's. 185 00:18:17.669 --> 00:18:23.489 Worth doing in general, so that was that. Okay. 186 00:18:23.489 --> 00:18:28.739 There wasn't a lot in this thing, but. 187 00:18:28.739 --> 00:18:33.989 Okay, next thing is the data race, um. 188 00:18:33.989 --> 00:18:38.068 For you so I call stepping on your own toes. 189 00:18:38.068 --> 00:18:42.058 Okay, so. 190 00:18:42.058 --> 00:18:46.949 You know, 2 threads, update the same counter. 191 00:18:48.028 --> 00:18:53.219 I have an example, say from a bank or something um, skip through. 192 00:18:53.219 --> 00:18:57.598 Uh, you're booking something you're booking a seed in the theater. 193 00:18:57.598 --> 00:19:07.679 Theatres open up again or whatever online on the web. 2 people want to book the same seats. So they hold the seat for 10 minutes or something while you book bedroom. 194 00:19:07.679 --> 00:19:14.489 Whatever I have typed this in, on on the wiki edge on the blog actually. 195 00:19:14.489 --> 00:19:20.638 The 2 threads in parallel read, the old value updated and then parallel right back. So the. 196 00:19:20.638 --> 00:19:23.729 Value gets updated only once not twice. 197 00:19:23.729 --> 00:19:28.528 So, depending on how the 2 threads internally. 198 00:19:28.528 --> 00:19:34.528 No guarantees. Okay. I'll skip through that. I've talked about it a lot. Um. 199 00:19:37.648 --> 00:19:44.249 If I'm going too fast, tell me, but, um, oh, and I guess 1 more point is that this. 200 00:19:44.249 --> 00:19:49.259 Sort of badly written program will be different on different GP use because. 201 00:19:49.259 --> 00:19:55.979 If it's a cheap GPU, the threads will run 1 after 1 thread has to finish before the 2nd thread starts. 202 00:19:55.979 --> 00:20:00.028 Lack of resources on an inexpensive GPU they'll run in parallel because. 203 00:20:00.028 --> 00:20:04.949 Resources are available. Okay. Is that. 204 00:20:04.949 --> 00:20:10.558 That the sites that did not answer anything, all it did was present a question. 205 00:20:11.759 --> 00:20:19.169 So, and slides at 3 will answer the question. 206 00:20:19.169 --> 00:20:24.209 Yeah, atomic operations. 207 00:20:25.979 --> 00:20:35.848 And I summarize the slides that you want, read, modify, right as an atomic, not an interruptable at the hardware level instruction. 208 00:20:35.848 --> 00:20:39.118 That's what they say. No, the threads can do it. 209 00:20:39.118 --> 00:20:42.388 Most of you are seeing this, so. 210 00:20:43.469 --> 00:20:49.588 Okay, so specifically in Canada they have some. 211 00:20:49.588 --> 00:20:52.739 They can do atomic, lots of different things. 212 00:20:52.739 --> 00:20:57.838 And they're implemented is 1 machine instruction on the GPU in the core. 213 00:20:57.838 --> 00:21:01.318 Comparing swap is another 1. um, so. 214 00:21:01.318 --> 00:21:06.929 But, basically, they'll do these various things as 1 machine instruction not interruptable. 215 00:21:06.929 --> 00:21:14.638 At the programming level, at the sea level C + plus level inside. 216 00:21:14.638 --> 00:21:17.848 This is what the instruction 10 that the. 217 00:21:17.848 --> 00:21:24.568 Function call 10, so it looks like you give it a point or to an address and global memory let's say, and a value in it. 218 00:21:24.568 --> 00:21:29.338 Adds the value into the register and returns the old value of the register. 219 00:21:29.338 --> 00:21:33.689 And and it's not interruptable. So you're. 220 00:21:33.689 --> 00:21:37.439 And for ad, you can read various, so those things okay. 221 00:21:37.439 --> 00:21:42.628 And open M. P. you've got the atomic Craig by, which will do the same thing. 222 00:21:42.628 --> 00:21:47.068 For example, okay, so you do something like this and. 223 00:21:47.068 --> 00:21:51.358 And CUDA, and see, and it translates to 1 machine instruction. 224 00:21:53.548 --> 00:21:57.088 For an end. 225 00:21:57.088 --> 00:22:00.179 Um. 226 00:22:00.179 --> 00:22:03.298 64 long long. 227 00:22:03.298 --> 00:22:08.219 1 point in the CC +, plus standard. 228 00:22:08.219 --> 00:22:13.469 The number of bits that these types take is not defined so it's implementation thing. 229 00:22:13.469 --> 00:22:17.699 Quotes quotes half percent. 230 00:22:17.699 --> 00:22:25.199 This was an NVIDIA edition as the half precision floats for the machine learning, because machine learning. 231 00:22:25.199 --> 00:22:30.298 The data is flowing precise to bite. Floats are actually useful. 232 00:22:31.798 --> 00:22:38.759 So so we're going back to the global function for our. 233 00:22:38.759 --> 00:22:46.019 Texas programming thing. Okay. And the concept is that each thread adds in another bite. 234 00:22:46.019 --> 00:22:51.568 From the data counts it up. So the obvious arguments. 235 00:22:51.568 --> 00:23:00.598 The, um, but we're implementing the Stripe, so the stride is between like, adjacent Tom. 236 00:23:00.598 --> 00:23:04.138 Hey, Jason elements or something so. 237 00:23:06.538 --> 00:23:12.118 And then we just added in, um, the atomic ad. 238 00:23:12.118 --> 00:23:15.479 We take the character. 239 00:23:16.648 --> 00:23:20.128 Well, here we're just adding, um. 240 00:23:20.128 --> 00:23:25.229 Leaving my hands at ignoring details, we're just adding it in so. 241 00:23:25.229 --> 00:23:32.608 I don't need to go so, the concept is, we're using an atomic ad here and we're implementing the right thing. 242 00:23:32.608 --> 00:23:36.028 And I is the element that we are. 243 00:23:36.028 --> 00:23:40.439 So, we're working on character number I. 244 00:23:40.439 --> 00:23:43.588 What the character is as buffer. 245 00:23:43.588 --> 00:23:48.028 And we want to increment histogram some buffer. So bye. 246 00:23:48.028 --> 00:23:51.419 Autonomy and we had 1 to it because we saw the character was. 247 00:23:51.419 --> 00:23:54.868 Okay, um. 248 00:23:54.868 --> 00:23:59.759 And this here, we got the concept that we have thread blocks. 249 00:23:59.759 --> 00:24:05.578 So on, so we may have so many threads that they're distributed among several blocks. 250 00:24:05.578 --> 00:24:14.638 And this is the thread number within the block. This is the block number of the current thread. 251 00:24:14.638 --> 00:24:18.269 A number of threads for block so this gets a unique. 252 00:24:18.269 --> 00:24:21.328 Counter from 0, on up to the number of threads. 253 00:24:21.328 --> 00:24:27.148 And now what's happening here. So, as I said. 254 00:24:27.148 --> 00:24:30.538 Each time you call this, it. 255 00:24:30.538 --> 00:24:34.199 It does 1 character, but. 256 00:24:34.199 --> 00:24:39.778 It does, but what we're doing here is we're then adding stride. 257 00:24:39.778 --> 00:24:43.618 To so this will actually will the thread actually made loop. 258 00:24:43.618 --> 00:24:48.509 And keep repeating actually, so the thread doesn't just do 1 character. The thread does. 259 00:24:48.509 --> 00:24:53.398 A number of characters, but it's doing them separated by stride. 260 00:24:53.398 --> 00:24:58.469 So that 1 threat, so that adjacent thread to doing adjacent character. So. 261 00:24:58.469 --> 00:25:01.679 This thread does every strike character. 262 00:25:01.679 --> 00:25:05.128 And stride is the, um. 263 00:25:05.128 --> 00:25:10.439 Basically, so the thread blocks, it goes to the next. 264 00:25:10.439 --> 00:25:13.648 Chunk up and. 265 00:25:13.648 --> 00:25:17.878 In the memory basically so okay. 266 00:25:17.878 --> 00:25:22.888 So, this is written to handle very large numbers of threads and even longer. 267 00:25:22.888 --> 00:25:26.338 And even much longer, a rays of data. 268 00:25:27.838 --> 00:25:32.999 Okay um, so. 269 00:25:35.068 --> 00:25:39.179 And this just, um. 270 00:25:41.308 --> 00:25:50.159 There's nothing particularly interesting here they're doing characters and they're assuming characters or Jason asking things. So they can. 271 00:25:50.159 --> 00:25:54.269 This converts from a character code to an editor from 0 on up. 272 00:25:54.269 --> 00:25:58.949 This is a sanity check. 273 00:25:58.949 --> 00:26:02.699 I love it. It's non printable character. Is it. 274 00:26:02.699 --> 00:26:05.818 Nor is it and this says here. 275 00:26:05.818 --> 00:26:10.558 Because it's chunking the letters and the 4th a. 2. D. E. T. H. 276 00:26:10.558 --> 00:26:15.719 And so on, and that's what your position by 4 does and by adding. 277 00:26:15.719 --> 00:26:20.788 Stride up here, so each thread is doing every striked character. 278 00:26:20.788 --> 00:26:24.509 So, strike could be quite big. 279 00:26:26.638 --> 00:26:31.378 Okay, so this also shows here that this is again running on the GPU. 280 00:26:31.378 --> 00:26:34.798 That you can have conditionals and so on. 281 00:26:34.798 --> 00:26:38.308 But as I said, the threads in the war are synchronous. 282 00:26:38.308 --> 00:26:42.088 So, if for some thread, this Boolean is false. 283 00:26:42.088 --> 00:26:48.148 Then the body of the loop is just idled forth. That thread. 284 00:26:48.148 --> 00:26:55.169 For the threads for which the conditional is true the body of not quality the loop, the body of the wild statement is executed. 285 00:26:55.169 --> 00:26:59.788 So, um. 286 00:27:01.259 --> 00:27:05.338 Whatever else is in here. 287 00:27:05.338 --> 00:27:09.898 Basically, that's the, and Ditto for the, um. 288 00:27:09.898 --> 00:27:15.239 If you see, you've got this nested thing here where you get thread divergence. 289 00:27:15.239 --> 00:27:20.249 Okay, wasn't all, uh, this was. 290 00:27:21.749 --> 00:27:25.798 Yeah. 291 00:27:25.798 --> 00:27:32.009 3. 292 00:27:32.009 --> 00:27:37.739 So, the intellectual content of. 293 00:27:37.739 --> 00:27:42.449 These slides 73 was to introduce this atomic operation. 294 00:27:42.449 --> 00:27:47.669 The intellectual content of this 74 is to discuss its performance. 295 00:27:47.669 --> 00:27:57.989 And so your global memory has the humungous latency, but a tolerably past bandwidth. Once you wait that latency. So. 296 00:27:57.989 --> 00:28:01.499 You can do your atomic operations on. 297 00:28:01.499 --> 00:28:08.128 Different types of memory shared memory. Then the cat, it's a cash is the cash for the global memory. 298 00:28:08.128 --> 00:28:13.769 What they mean by shared cash is that all the threads access? If it's the 1 cash. 299 00:28:13.769 --> 00:28:19.979 For the global memory, any 1 access in global memory, the stuff's going to get cashed in the same cash. 300 00:28:19.979 --> 00:28:27.358 Um, okay, let us talking about stuff. 301 00:28:27.358 --> 00:28:32.608 Yeah, so this is the thing about this, where you do a read, you grab. 302 00:28:32.608 --> 00:28:39.568 128 bytes this is why they're doing that because they've got 3200 ports. 303 00:28:39.568 --> 00:28:45.148 On the controller and banks and everything so this is. 304 00:28:45.148 --> 00:28:48.959 The hardware people can look at this. This is the hardware reason. 305 00:28:48.959 --> 00:28:51.959 That have reached at 128 5 shots. 306 00:28:51.959 --> 00:29:02.818 So the are different streaming, multi processors. Of course, they're all everyone's going at the memory at the same time. So. 307 00:29:02.818 --> 00:29:07.019 Oh, have a number of controllers and a number of ports for controllers. So. 308 00:29:07.019 --> 00:29:21.898 And so the concept is, you got the banks of memory, and you can read a word from each bank in parallel multiplies your speed, your IO speed by 32. 309 00:29:21.898 --> 00:29:25.618 Because all 32 bank, uh. 310 00:29:26.909 --> 00:29:31.528 32 ports, rather and banks so this is how you increase your speed. 311 00:29:31.528 --> 00:29:38.788 Um, intellectual content is, is latency here, so. 312 00:29:38.788 --> 00:29:46.919 You're you're doing an atomic operation on the global memory, you got this humungous latency there. 313 00:29:46.919 --> 00:29:51.388 And that latency is big, and you've got several of them so. 314 00:29:51.388 --> 00:29:58.019 You want to optimize what we're leading into is that these atomic. 315 00:29:58.019 --> 00:30:03.118 Operations on the global memory, they'll make your program correct? But they're going to slow it down. 316 00:30:03.118 --> 00:30:08.189 Again, this latency here, it's 100 cycles or or more. So 1000. 317 00:30:08.189 --> 00:30:15.509 Okay, and it's the dominant factor. The latency in your program thing was parallel programs. 318 00:30:15.509 --> 00:30:19.378 Your I O, is usually the limiting factor. 319 00:30:19.378 --> 00:30:26.939 Usually it with parallel computers, your are sitting waiting for the data to get to that. 320 00:30:26.939 --> 00:30:33.568 They're all limited and so this is why we spend time talking about making the more efficient. 321 00:30:35.608 --> 00:30:43.769 It's also a reason that these tutorials on parallel computing, and they often love to use Microsoft application as an example. 322 00:30:43.769 --> 00:30:50.098 Because matrix multiplication is 1 of the cases where you're not limited potentially. 323 00:30:50.098 --> 00:30:57.179 His matrix multiplication, you're processing in square data, but you're doing cube operations on. It's a mixed application. 324 00:30:57.179 --> 00:31:00.929 See, it's potentially CPO limited if your is done, right? 325 00:31:00.929 --> 00:31:09.749 So people like to use it, usually the computation is linear in the size of the problem matrix. Multiplication. It goes up Super linear. 326 00:31:09.749 --> 00:31:13.288 Of this data to the house. 327 00:31:13.288 --> 00:31:21.088 Okay, so here they're taking single agency in this case could be a 1000 cycles. We've just got awful. Horrible. 328 00:31:21.088 --> 00:31:26.848 So you really want to do anything you can. Okay. Um. 329 00:31:26.848 --> 00:31:30.358 Skip the other examples. 330 00:31:30.358 --> 00:31:35.909 For me is a many generation old, they have an updated this slide in a while so. 331 00:31:35.909 --> 00:31:39.838 It went for me hepler. 332 00:31:39.838 --> 00:31:43.828 Maxwell Pascal. 333 00:31:45.209 --> 00:31:50.009 And pair, I think you can correct me something like that. So, it's like. 334 00:31:50.009 --> 00:31:55.108 6 generations that's 10 years ago, but in any case, um. 335 00:31:55.108 --> 00:31:58.888 The idea stays the same. 336 00:31:58.888 --> 00:32:03.628 But these things are shared, if you can. 337 00:32:03.628 --> 00:32:07.019 Oops. 338 00:32:07.019 --> 00:32:11.759 Back here. Okay. I'm hardware improvements. 339 00:32:11.759 --> 00:32:15.868 Again, you pass things into the shared memory, it as much less latency. 340 00:32:15.868 --> 00:32:20.489 But it's private 2 weeks, right? Okay, so that's the content there. 341 00:32:20.489 --> 00:32:25.709 Okay, so the content of this module, 7, 4. 342 00:32:25.709 --> 00:32:30.689 Was it the atomic operations, especially to the global memory? Humungous. 343 00:32:30.689 --> 00:32:37.828 Horrible wait and see see 1, do everything you can to avoid it coalesce stuff. Well, I don't know if it works with atomics. 344 00:32:37.828 --> 00:32:42.269 And you shared memory, which we'll talk about next. 345 00:32:43.469 --> 00:32:46.618 Private to okay. Um. 346 00:32:48.388 --> 00:32:52.019 Okay. 347 00:32:52.019 --> 00:32:58.138 Let me summarize what you're going to see here. This is this thing that we 1st saw with open MP. 348 00:32:58.138 --> 00:33:06.148 Where they were doing open MP, the example, was a reduction. So, which is a really simple histogram just adding up all the elements of an array. 349 00:33:06.148 --> 00:33:09.659 And you had the same issue. 350 00:33:09.659 --> 00:33:13.858 Worse when you're just summing up in array, because you always got flashes. 351 00:33:15.209 --> 00:33:19.138 So the open MP and open, they introduced this reduction. 352 00:33:19.138 --> 00:33:26.729 The option on the, and with this meant, is that what each separate thread would have a separate counter. 353 00:33:26.729 --> 00:33:31.378 And the loops of the, you have a poor loop that's up that's summing up. 354 00:33:31.378 --> 00:33:40.048 The array, so each thread is coming into a private subtotal and then at the end. 355 00:33:40.048 --> 00:33:44.459 The several private subtotals would be summed into the grand total. 356 00:33:44.459 --> 00:33:49.858 Open MP does that automatically the compiler does if you don't have to worry about open ATC also. 357 00:33:49.858 --> 00:33:53.638 What we are seeing here is how that is implemented. 358 00:33:53.638 --> 00:33:57.298 The concept of each thread, having a private subtotal. 359 00:33:57.298 --> 00:34:00.959 And it then gets summed up and because of this. 360 00:34:00.959 --> 00:34:06.868 When you're working through the array, there's no possibility of a clash because. 361 00:34:06.868 --> 00:34:10.588 Those private subtotals are private, they're not chaired by different threads. 362 00:34:10.588 --> 00:34:13.858 Yes, the the. 363 00:34:13.858 --> 00:34:21.748 Also happening parallel, or it does, but that requires logging. That requires the atomic ad. 364 00:34:22.798 --> 00:34:26.639 But it's a very short part of the code. Yeah Yeah. 365 00:34:26.639 --> 00:34:31.708 And you can, you know, this is an optimization thing you could say. 366 00:34:31.708 --> 00:34:36.929 You know, private totals for war for block or whatever. 367 00:34:36.929 --> 00:34:41.248 My guess is it's not worth worrying about it. 368 00:34:41.248 --> 00:34:45.028 Incredibly much, but you could decide on what level do you want that. 369 00:34:45.028 --> 00:34:48.358 Erase. 370 00:34:48.358 --> 00:34:51.688 Very large instead of adding. 371 00:34:51.688 --> 00:34:55.409 All right to the front, um. 372 00:34:55.409 --> 00:34:59.759 As the thread finishes 1st and entry for. 373 00:34:59.759 --> 00:35:02.909 Yeah, 4 elements at a time of. 374 00:35:02.909 --> 00:35:06.719 Formula oh, yeah, exactly. Right. Every time. 375 00:35:06.719 --> 00:35:12.239 And my brothers using 3, somebody to go for the broad shallow trees. 376 00:35:12.239 --> 00:35:15.809 Um, this is in the side, I don't like trees, but. 377 00:35:15.809 --> 00:35:20.489 Of arguments okay now I'm here. Okay. Anticipating. 378 00:35:20.489 --> 00:35:27.898 Probably you might you'd put the subtotals and registers let's say so. 379 00:35:27.898 --> 00:35:35.489 If you've got only like, 200 different categories buckets, and you can put the subtotals and registers perhaps. And each thread might do a private. 380 00:35:35.489 --> 00:35:40.199 Edition, and then you'd start merging stuff we shared memory, perhaps. 381 00:35:40.199 --> 00:35:45.059 Okay, they talk to shared memory, so. 382 00:35:46.528 --> 00:35:55.648 The thread is private, then there's no atomics needed, but then you go up to the next level, you shared memory up another level, global memory or something. Okay. 383 00:35:58.168 --> 00:36:02.278 Um. 384 00:36:02.278 --> 00:36:08.608 I just told you what's happening here so you, the 1st thing is everyone is. 385 00:36:08.608 --> 00:36:13.199 Adding into the same totals or something. Nothing interesting there. Um. 386 00:36:15.688 --> 00:36:18.989 So the thing on the left there is like, 1. 387 00:36:20.608 --> 00:36:30.329 Total array here, all these separate copies that you then merge. So I just summarize the intellectual content of this. So we can skip through it. 388 00:36:30.329 --> 00:36:38.278 Um, the cost is to create these private copies and, um. 389 00:36:38.278 --> 00:36:47.518 The benefit is less, and they really like us to say, quite increase performance 10 times, because you really want to minimize the atomic lock. 390 00:36:47.518 --> 00:36:53.188 So, okay, um. 391 00:36:53.188 --> 00:37:01.018 Yeah, so there's their partitioning is under thread blocks just to remind you. 392 00:37:01.018 --> 00:37:05.579 You can have a 1024 threads and a thread block. It's. 393 00:37:05.579 --> 00:37:13.588 Hardware dependent, but typically, 1024 and the shared memory shared by all 1024 threads in the thread block. 394 00:37:13.588 --> 00:37:20.998 And then if you've got more than 1024 thread, you need multiple thread blocks and the separate thread blocks do not talk to each other. 395 00:37:20.998 --> 00:37:25.528 Except we, as a global memory, so they do not synchronized very. 396 00:37:25.528 --> 00:37:28.978 Very often inside the thread block. 397 00:37:28.978 --> 00:37:34.079 You might have 32 warps of threads and the different warps. 398 00:37:34.079 --> 00:37:42.088 Might be running separately, but they're accessing the same shared memories adding in. So that's that works. You need the atomics to. 399 00:37:42.088 --> 00:37:47.998 To update the top totals and the shared memory with being shared memory, it's faster shared memories fast. 400 00:37:47.998 --> 00:37:56.068 Okay, but again, to remind you, the shared memory might be 48 kilobytes for the whole block. Let's say. 401 00:37:58.378 --> 00:38:07.469 And 1 more thing, this, I think this 48 kilobytes is actually for all the blocks that are currently running on the streaming multi processor. 402 00:38:07.469 --> 00:38:15.750 So, if 1 block doesn't need a lot of shared memory, many more blocks could run in parallel. 403 00:38:15.750 --> 00:38:19.409 And there is an optimization so. 404 00:38:19.409 --> 00:38:22.889 Otherwise the boss get cued up and. 405 00:38:22.889 --> 00:38:30.809 Okay, um, what are we doing here? 406 00:38:32.699 --> 00:38:40.079 So this is a new thing here it's inside the so again so this function is running on the device. 407 00:38:40.079 --> 00:38:43.800 Global means a call from the hose runs on the device. 408 00:38:43.800 --> 00:38:48.480 These are the arguments. This is a new so this says. 409 00:38:48.480 --> 00:38:54.119 That this array, it will be allocated in shared memory. 410 00:38:54.119 --> 00:38:57.989 Visible to all of the threats in the block. 411 00:38:59.309 --> 00:39:02.880 So this is new here language extension. 412 00:39:04.199 --> 00:39:08.909 And when what this means, private for each block means each different thread block. 413 00:39:08.909 --> 00:39:18.960 We'll have it's different version of this, but all 201,024 threads in the block will have access to that blocks version. 414 00:39:21.179 --> 00:39:24.599 And again, the different blocks. 415 00:39:24.599 --> 00:39:31.650 There's this global pool of shared memory that the different blogs share, but it's chunky. Each block has a private to that block. 416 00:39:31.650 --> 00:39:38.880 So, um. 417 00:39:40.019 --> 00:39:46.559 So, what's new here? Um. 418 00:39:48.929 --> 00:39:55.829 So, 1st, we allocate this array and then we want to 0 it. Let's see. So. 419 00:39:57.420 --> 00:40:06.269 Here we go here we go so allocating it doesn't 0 it who is an example so we're going to 0, this. 420 00:40:06.269 --> 00:40:10.409 His private histogram in parallel, so. 421 00:40:10.409 --> 00:40:16.380 This colonel routine called colonel, it's going to be doing something. 422 00:40:16.380 --> 00:40:21.420 But the 1st thing it does is clear out the 0, the histogram. 423 00:40:21.420 --> 00:40:27.900 And it's each thread, you know, 0 is 1 element of the histogram. 424 00:40:27.900 --> 00:40:34.079 For, you know, but we don't run off the end of the histogram. So we have a boundary check. 425 00:40:34.079 --> 00:40:39.599 And then we sink okay, because we don't know. 426 00:40:39.599 --> 00:40:47.309 How the threads so that's 8 different threads and we, so we don't know what order the threads right in. So we do a sync. 427 00:40:47.309 --> 00:40:52.380 And that synchronizes all the threads in the block. 428 00:40:52.380 --> 00:40:58.170 1024 of them perhaps. Okay. And private this means to remind you that. 429 00:40:58.170 --> 00:41:01.860 Local to the block different blocks will have different. 430 00:41:01.860 --> 00:41:07.440 Office okay, so you can initialize the array in parallel. 431 00:41:07.440 --> 00:41:11.940 But there's probably fewer elements of the array than there are threats. 432 00:41:11.940 --> 00:41:15.539 And we must we are required to synchronize at the end. 433 00:41:15.539 --> 00:41:21.570 Or required only if you want the answer to be, right? Of course. 434 00:41:21.570 --> 00:41:25.469 Who knows what you want? Okay. 435 00:41:27.449 --> 00:41:31.650 So, now we're continuing on in the thread so. 436 00:41:31.650 --> 00:41:35.730 Separate threads or separate elements is bigger. 437 00:41:35.730 --> 00:41:42.239 So, we compute an index character number from the thread ID and the block ID. 438 00:41:42.239 --> 00:41:55.199 And and again, so we're going to start at I, and we're going to do every strikes character starting at eyes. I try to strive +3 stride. 439 00:41:55.199 --> 00:42:06.329 And so on, so this will be this coalesced concept that since since iterations here, this is a wild loop to every stripe character. Adjacent threads will do adjacent characters. 440 00:42:06.329 --> 00:42:10.889 That's the reason there, so we, um. 441 00:42:10.889 --> 00:42:16.199 Get destroyed and. 442 00:42:18.389 --> 00:42:23.159 And it's going to be quite large. The block game is the number of threads per block. 443 00:42:23.159 --> 00:42:26.969 Grid is the number of blocks for grid. 444 00:42:26.969 --> 00:42:32.190 So, and then we loop and, um. 445 00:42:32.190 --> 00:42:36.059 So, again, the threads and the war are synchronous. 446 00:42:36.059 --> 00:42:42.329 So some threads, this wild loop, Orlando before other threads, and then. 447 00:42:42.329 --> 00:42:47.070 In which case, then those threats, they just idle while the longer. 448 00:42:47.070 --> 00:42:50.099 All the slower threads finish, so that's okay. 449 00:42:50.099 --> 00:42:55.230 Atomic ad, so, private history is in the shared memory. 450 00:42:55.230 --> 00:43:00.150 And, um, it works, um. 451 00:43:01.199 --> 00:43:06.750 So, again, the point here is that each block has its private ummm. 452 00:43:06.750 --> 00:43:13.800 Instagram shared memory it's fast separate blocks. Will the separate Instagrams later on? We'll bring them together. 453 00:43:13.800 --> 00:43:17.880 So, we still have the atomic, but the point is, it's. 454 00:43:17.880 --> 00:43:22.829 Locking the shared memory, it's much faster than having to lock global memory. 455 00:43:22.829 --> 00:43:29.969 Okay, so that's the intellectual content here is have separate. 456 00:43:29.969 --> 00:43:35.190 Um, histogram, and then later we'll merchant, so. 457 00:43:36.630 --> 00:43:41.190 Build the final histogram we're continuing on in the same. 458 00:43:41.190 --> 00:43:46.860 Global retain we've got we've computed all the private history grams. We sink. 459 00:43:48.869 --> 00:43:53.940 And how long the sync takes depends on your hardware if you've got expensive hardware. 460 00:43:53.940 --> 00:43:58.650 Then all the threads are running in parallel. You got cheap hardware. They're running. 461 00:43:58.650 --> 00:44:02.610 Seriously while the warps are running seriously. 462 00:44:02.610 --> 00:44:13.619 In which case they could wait a while and now we go see, so we have the private histogram here and we add into the global histogram. 463 00:44:13.619 --> 00:44:20.760 Which is in global memory, this 1 is going to be slow. This atomic ad will have the big latency. 464 00:44:20.760 --> 00:44:25.860 But you're not doing it very often so. 465 00:44:27.030 --> 00:44:32.400 And. 466 00:44:32.400 --> 00:44:37.079 And we're only the 1st, 8 threads are doing it. 467 00:44:37.079 --> 00:44:41.010 7. 468 00:44:42.690 --> 00:44:47.760 You know, I could almost be persuaded, we have an error here. I would almost say less than equal to 7. 469 00:44:47.760 --> 00:44:52.739 Less than, but I could be wrong. It depends how many so. 470 00:44:52.739 --> 00:44:59.219 So so if I recap what's happening here. 471 00:44:59.219 --> 00:45:02.849 This again, so this is a global. 472 00:45:02.849 --> 00:45:06.780 Function runs on the device, it's got 3 stages in it. 473 00:45:06.780 --> 00:45:12.719 The 1st stage is to 0, the private histogram. Alright. 474 00:45:13.739 --> 00:45:20.010 Done in parallel the 2nd stage is to populate the private histogram or right. 475 00:45:20.010 --> 00:45:23.039 Done in parallel, um. 476 00:45:23.039 --> 00:45:30.840 And then the 3rd stage is to merge the private histogram arrays into the global histogram. 477 00:45:30.840 --> 00:45:35.730 Gotten in parallel, and between the 3 stages, we have 3. 478 00:45:35.730 --> 00:45:41.610 Think for it, so this shows how. 479 00:45:41.610 --> 00:45:46.619 The reduction Craig and open is implemented. 480 00:45:46.619 --> 00:45:51.840 So or open ACC. 481 00:45:53.519 --> 00:45:58.920 Question. 482 00:46:00.210 --> 00:46:04.110 So, once you see how it's implemented, you probably never write it. 483 00:46:04.110 --> 00:46:09.119 Open whichever ACC MBP. 484 00:46:09.119 --> 00:46:12.300 But this is again showing implementation of. 485 00:46:12.300 --> 00:46:18.360 Okay. 486 00:46:18.360 --> 00:46:22.949 Disagreement arguments. Okay. Um. 487 00:46:22.949 --> 00:46:26.519 More powerful idea. 488 00:46:26.519 --> 00:46:30.360 You can do anything it's a source of commutative. 489 00:46:31.440 --> 00:46:42.570 Um, so people know what associated communities mean, I assume. Yeah, I got logged backgrounds. Okay. 490 00:46:42.570 --> 00:46:48.690 And again, remind you that shared memory. 491 00:46:48.690 --> 00:46:58.739 And everything has to fit I've noticed for Nvidia as the years, go on, they do not increase the size of their shared memory on their chip. 492 00:46:58.739 --> 00:47:02.639 Increase the number of food of cars, for example, global memory. 493 00:47:02.639 --> 00:47:06.690 They do not increase the shared memory, which tells me. 494 00:47:06.690 --> 00:47:10.289 That however it's implemented, it is really expensive. 495 00:47:10.289 --> 00:47:15.179 So, they increase the clock speed. 496 00:47:15.179 --> 00:47:19.619 But not the size of the shared memory, not the number of registers. So. 497 00:47:19.619 --> 00:47:25.289 However, they're doing it it's a lot of Gates involved new hardware types. 498 00:47:25.289 --> 00:47:32.789 Tell him. Okay. 499 00:47:32.789 --> 00:47:36.510 You can spill over from Sharon into global memory. 500 00:47:38.400 --> 00:47:43.110 You're going to pay this horrible penalty if you do. 501 00:47:45.210 --> 00:47:51.210 What, and again, if you define local variables in your, um. 502 00:47:51.210 --> 00:47:59.579 Device function, they could put I believe into registers that by default in the registers. 503 00:47:59.579 --> 00:48:06.119 Scalar variables as long as your registers available, if you declare a lot of local variables. 504 00:48:06.119 --> 00:48:11.340 In your function, it just invisible spills over to global memory. 505 00:48:11.340 --> 00:48:14.849 And well, you run into bunker, it will tell you about. 506 00:48:14.849 --> 00:48:17.969 So, do you have a wall. 507 00:48:17.969 --> 00:48:24.449 You got a cliff, you fall off local rays, you declare in the function. 508 00:48:24.449 --> 00:48:33.599 In this function by default, going to global memory well, it's a private global memory. It's called local memory. It's in global memory, but is private to the thread. 509 00:48:34.829 --> 00:48:39.599 And I actually found a compiler bug and NBC a few years ago. 510 00:48:39.599 --> 00:48:45.030 I found that for particular size of local array, like 256 or something. 511 00:48:45.030 --> 00:48:50.519 I threw the through the compiler into an infinite loop. 512 00:48:50.519 --> 00:48:53.610 The compiler never terminated. 513 00:48:53.610 --> 00:48:56.670 So, cool. 514 00:48:56.670 --> 00:49:06.360 I break software, it was only for 1 size of array. If you're 1 element bigger or smaller, the compiler finished, because it's 1 size. 515 00:49:06.360 --> 00:49:11.250 So clearly they had an off by so I posted it on a couple of like. 516 00:49:11.250 --> 00:49:15.750 And video and other blogs, and some, that are posted this as an error. 517 00:49:15.750 --> 00:49:20.219 And not that long after there was a minor release that fixed it. 518 00:49:20.219 --> 00:49:25.559 Though I found a few compiler errors over the years. 519 00:49:25.559 --> 00:49:28.650 Okay, so this. 520 00:49:28.650 --> 00:49:33.869 Slide set, the content was how you implement the reduction with private history. 521 00:49:35.010 --> 00:49:41.670 Good so. 522 00:49:41.670 --> 00:49:45.659 Real questions there, we can move off. 523 00:49:49.289 --> 00:49:52.739 So. 524 00:49:54.210 --> 00:49:58.559 What we're seeing our programming paradigms. 525 00:49:58.559 --> 00:50:06.030 They are programming techniques, which are useful for parallel computing only parallel. They don't help you with serial computing. 526 00:50:06.030 --> 00:50:10.590 So we saw 1, which is this private histogram. 527 00:50:10.590 --> 00:50:14.849 And this set this module here. 528 00:50:14.849 --> 00:50:21.690 Talks about another important operation convolution and how we do convolution efficiently. 529 00:50:21.690 --> 00:50:26.699 On a parallel computer. Well, encoder, for example. 530 00:50:28.829 --> 00:50:31.829 And the concept is to do the convolution. 531 00:50:31.829 --> 00:50:34.980 And efficiently in parallel. 532 00:50:37.260 --> 00:50:42.179 And you, most of, you know what convolution is if you want. Okay. Good. So. 533 00:50:43.349 --> 00:50:47.309 Does a little pass filter, for example. 534 00:50:47.309 --> 00:50:54.510 Every elements awaited, every output elements, a weighted average of a sliding window of input elements. 535 00:50:54.510 --> 00:50:58.769 You know what that is? Um, skip through it. 536 00:51:00.659 --> 00:51:07.469 Okay um, so convolution has this mask. 537 00:51:07.469 --> 00:51:11.219 And so what we're doing here, 1 dimensional. 538 00:51:11.219 --> 00:51:15.510 So, each output element is awaited some of 5 input element. 539 00:51:15.510 --> 00:51:21.449 And so a Ray, and are the input elements are, are the output elements. 540 00:51:21.449 --> 00:51:25.079 M is the mask so the mask is fixed. 541 00:51:25.079 --> 00:51:28.199 We slide the mask over the input array. 542 00:51:28.199 --> 00:51:36.329 So, for example, here, um, if the mask is centered around the red input element here, and it's up to. 543 00:51:36.329 --> 00:51:40.230 So these 5 input elements from 0 through 4. 544 00:51:40.230 --> 00:51:44.730 Get waited by the 5 elements in the mask and these are the 5. 545 00:51:44.730 --> 00:51:52.739 Qualifications wait, then we receive all these together and produce the 1 output element and we slide the window along here. 546 00:51:52.739 --> 00:52:00.780 So, if I stop right here, and you think, how would you do it efficiently in parallel. 547 00:52:01.920 --> 00:52:07.559 Number 1, the mask is read only and it's used all the time. 548 00:52:07.559 --> 00:52:11.760 So, the masks, for example, you would put a say in constant memory. 549 00:52:11.760 --> 00:52:16.800 And could, which is past constant memories very passed, but read only. 550 00:52:16.800 --> 00:52:20.070 And it's implemented as something in the cash. 551 00:52:20.070 --> 00:52:24.420 If he didn't have constant memory, would stick it in shared memory because. 552 00:52:24.420 --> 00:52:27.599 This is red all the time you want it to be fast. 553 00:52:27.599 --> 00:52:32.489 The next thing is that. 554 00:52:32.489 --> 00:52:36.630 Each element in the input array is read 5 times. 555 00:52:38.219 --> 00:52:41.250 So you're thinking you want to somehow. 556 00:52:41.250 --> 00:52:49.079 Avoid reading, I mean, the input and I'll put a raise, they're going to be in global memory ultimately, because they're big. 557 00:52:49.079 --> 00:52:56.309 But you're reading each element of the input of race, say, 5 times for a master's 5 long. 558 00:52:56.309 --> 00:53:02.280 You don't want to be reading the repetitively 5 times out of the. 559 00:53:02.280 --> 00:53:09.960 Well, memory now, I guess today the cash might have the cache might do that for, you. 560 00:53:09.960 --> 00:53:15.179 You don't have to worry about it if you did want to worry about it, you'd want to do some. 561 00:53:15.179 --> 00:53:19.590 Explicit cash. 562 00:53:19.590 --> 00:53:24.690 Just as an aside, um, even though the systems have cash. 563 00:53:24.690 --> 00:53:28.829 You can sometimes do better than a system cash. 564 00:53:28.829 --> 00:53:32.039 Give an example, if you go. 565 00:53:32.039 --> 00:53:35.219 Brazilian collaborator night well, we do some work on. 566 00:53:35.219 --> 00:53:42.239 Visibility computing visibility for sure. You got an observer of what targets can you see from the observer on the terrain? 567 00:53:42.239 --> 00:53:45.840 You're on line to sight out and and so on. So. 568 00:53:45.840 --> 00:53:49.019 So, we found an example where. 569 00:53:49.019 --> 00:53:56.909 If we explicitly cast chunks of the terrain, we could do better than the system virtual memory manager. 570 00:53:56.909 --> 00:54:07.949 Because we could look ahead, we knew when we were finished with a chunk of the terrain, and we would swap it out of our cash. Whereas the system virtual memory manager could not see into the future. It did not know that. 571 00:54:07.949 --> 00:54:14.070 So, we actually, and the papers published peer reviewed published paper that, in this example. 572 00:54:14.070 --> 00:54:23.010 We beat the virtual memory manager, so so you can do that. Now, the question is it worth it? It's a different question, but. 573 00:54:23.010 --> 00:54:30.269 That was cool. Okay. That's what's happening. Um. 574 00:54:30.269 --> 00:54:35.940 I told you what's happening here you fly go along the input array and you some. 575 00:54:35.940 --> 00:54:47.159 Reduce and do that. Okay. Boundary conditions. I hate boundary conditions. You got to handle them, but it's not interesting. You pattern with 0 So you change the waiting or something. 576 00:54:47.159 --> 00:54:51.389 A lot of errors occur when people do boundary conditions, wrong. 577 00:54:53.519 --> 00:54:59.579 I'm ignoring that. It's not being interesting important, but not interesting. You know what I mean with that. 578 00:54:59.579 --> 00:55:02.699 I'm ignoring anything with boundary condition handling. 579 00:55:03.750 --> 00:55:07.710 Today or not to the, um. 580 00:55:07.710 --> 00:55:13.920 The interesting thing there is elements that are close in 2 dimensions may not be close. 581 00:55:13.920 --> 00:55:17.849 When it's linearized into 1 dimension. 582 00:55:19.409 --> 00:55:23.489 Which you might maybe want to worry about. 583 00:55:23.489 --> 00:55:27.869 That's why space filling curves or invented. 584 00:55:27.869 --> 00:55:36.389 The try and reduce the average distance between adjacent elements when the 2 D array is linearized. 585 00:55:36.389 --> 00:55:41.010 In various uses some undocumented. 586 00:55:41.010 --> 00:55:46.079 Space filling curve for the texture memory, I think cause texture memories again. 587 00:55:47.400 --> 00:55:50.969 They read only they're red a lot. 588 00:55:50.969 --> 00:55:59.849 And so, I think in video stores, a texture video is that special hardware for texture memories graphics, after all and. 589 00:55:59.849 --> 00:56:04.110 They have some sort of Ziggy curve to. 590 00:56:04.110 --> 00:56:08.010 linearized the texture of memory, I think they sort of talk about it. 591 00:56:08.010 --> 00:56:11.730 Any case so. 592 00:56:11.730 --> 00:56:16.320 Move slide the 2 D filter over the input array. 593 00:56:16.320 --> 00:56:21.239 Multiply the K by K elements get this, um, that. 594 00:56:21.239 --> 00:56:25.650 Okay, boundary conditions get them wrong. 595 00:56:25.650 --> 00:56:28.650 Things happen, but I'm going to ignore that. 596 00:56:30.000 --> 00:56:36.269 Um. 597 00:56:36.269 --> 00:56:39.630 Just, except for the 1 thing, again, you're going to get thread divergent. 598 00:56:39.630 --> 00:56:44.369 Ignore that again, because for some threads in the war for the. 599 00:56:44.369 --> 00:56:50.309 Conditional be true other threads and conditional be false for which it's false. Just idle. 600 00:56:50.309 --> 00:56:57.780 Generally all the threads will do the same number of iterations here cause masks with the, the constant. 601 00:56:57.780 --> 00:57:05.670 That's okay and again, these device functions can have 4 loops in wild with them and all that stuff. That's fine. 602 00:57:05.670 --> 00:57:09.900 If you do nothing else. 603 00:57:10.949 --> 00:57:15.900 Again, I'm conditionals are fine inside the device function. 604 00:57:16.949 --> 00:57:21.210 And, okay, um. 605 00:57:21.210 --> 00:57:27.869 I maybe I skipped over a little too much here. 606 00:57:35.010 --> 00:57:39.780 Yeah, okay well, this will get emphasized to touch more later. 607 00:57:39.780 --> 00:57:44.400 So, this thread is computing 1 output pixel. 608 00:57:44.400 --> 00:57:49.289 This is iterating over the adjacent pixel this thing here. 609 00:57:49.289 --> 00:57:57.989 Is iterating over adjacent pixels so we have Jay 0 and up to match with to mess with and so on. 610 00:57:57.989 --> 00:58:02.429 So, the 1 output pixel depends on this block of input pixel. 611 00:58:02.429 --> 00:58:05.550 And, um. 612 00:58:05.550 --> 00:58:08.940 So, the Jay and the things could linearized and. 613 00:58:08.940 --> 00:58:14.429 Whatever whatever whatever the relevant thing is that this loop is going after a number. 614 00:58:14.429 --> 00:58:17.730 Of of input pixels repeatedly. 615 00:58:17.730 --> 00:58:21.239 Sweet 10 foot pixel gets red mess with squared times. 616 00:58:21.239 --> 00:58:24.869 We'll worry about that later. Okay. 617 00:58:26.429 --> 00:58:30.690 Pointed out that was module. 618 00:58:33.719 --> 00:58:41.730 81 questions so again, this is a 2nd paradigm. 619 00:58:41.730 --> 00:58:45.360 For how to do some parallel computation we saw. 620 00:58:45.360 --> 00:58:51.000 His programming here we see we're working into convolution. 621 00:58:52.559 --> 00:58:56.219 8, 1, 8, 2. 622 00:58:59.309 --> 00:59:12.179 What's gonna happen is we're going to cut partition the data and tiles and each tile might be small enough that we can cash. It. 623 00:59:12.179 --> 00:59:16.079 Into some past memory, and we will reduce the latency. 624 00:59:18.480 --> 00:59:22.530 So, if you're. 625 00:59:22.530 --> 00:59:26.699 If I can pause on this slide and you think ahead now. 626 00:59:26.699 --> 00:59:34.349 So the tiles, maybe if you want to be as big as they can, but small enough so that everything fits into shared memory. 627 00:59:34.349 --> 00:59:38.340 Do you think about how many tiles have to be in shared memory together and that. 628 00:59:38.340 --> 00:59:43.829 Determined to do tile size and again you have a trade off. 629 00:59:43.829 --> 00:59:51.480 Smaller tiles means more blocks can run in parallel thread locks. Do you want to do that? I don't know. 630 00:59:51.480 --> 00:59:57.030 You're going to have boundary conditions are even worse. 631 00:59:57.030 --> 01:00:01.260 Um, you're going to decide. 632 01:00:02.639 --> 01:00:06.269 Well, what the input versus output means you want to input. 633 01:00:06.269 --> 01:00:12.269 You can iterate through the input data repeatedly to compute 1 output pixel. 634 01:00:12.269 --> 01:00:18.480 Or you can stick on 1 input, pixel and iterate through all the output pixels that that input pixel goes into. 635 01:00:18.480 --> 01:00:21.570 And that's sort of what they're saying there and then. 636 01:00:21.570 --> 01:00:24.570 This would affect what you dialing, so. 637 01:00:28.170 --> 01:00:33.599 So, it still making some application same thing. Your conventional way you iterate. 638 01:00:34.980 --> 01:00:43.320 Through the input, um, matrices the compute 1 output element you can also. 639 01:00:43.320 --> 01:00:47.099 Iterate lists through the input elements with, for each input element. 640 01:00:47.099 --> 01:00:52.619 Some into all the output elements that it affects no different ways to look at things. 641 01:00:53.670 --> 01:00:58.469 Okay, um. 642 01:00:58.469 --> 01:01:04.260 So, we are running the sliding window down and nothing new on this slide. Um. 643 01:01:06.059 --> 01:01:09.269 And this is the new point here that, um. 644 01:01:11.639 --> 01:01:17.369 This might be a chunk of the input data. We put it in the share it with cash it in the shared memory. 645 01:01:17.369 --> 01:01:21.059 Hello. 646 01:01:21.059 --> 01:01:25.739 And again, who knows maybe the cash handler does it for you. 647 01:01:25.739 --> 01:01:29.010 I don't know. 648 01:01:29.010 --> 01:01:32.789 So, nothing new there, the only new thing on this. 649 01:01:32.789 --> 01:01:37.829 Why is it now talking about cashing stuff into the shared memory? 650 01:01:37.829 --> 01:01:44.460 So, and what they're saying again is 1 particular input element is used several times. So. 651 01:01:46.769 --> 01:01:52.409 Yeah, um. 652 01:01:52.409 --> 01:01:57.510 You could even imagine a sliding cache actually. 653 01:01:57.510 --> 01:02:01.920 That maybe you've got these elements in your shared memory. 654 01:02:01.920 --> 01:02:07.679 So, once you finish with element, 2, you replace it with element 10. perhaps. 655 01:02:07.679 --> 01:02:17.460 When you finish with element 3, replace it with all of them at 11, and they shared memory, you can do something like that. It'd be really cool. So you're sliding this cash down. 656 01:02:17.460 --> 01:02:21.239 Input memory really cool idea. 657 01:02:21.239 --> 01:02:24.809 Um, grabbing would be fine. 658 01:02:25.889 --> 01:02:30.210 So, as I said, is since the access pattern, Greg, you see this, this. 659 01:02:30.210 --> 01:02:33.750 The heavy green box is what you've got in shared memory. 660 01:02:33.750 --> 01:02:39.750 You have a window that's sliding down when you said, when you don't need 2 anymore. 661 01:02:39.750 --> 01:02:43.739 Replace it with 10 and you got to keep track of it all. That would be cool. 662 01:02:45.389 --> 01:02:48.900 So, um. 663 01:02:50.429 --> 01:02:53.940 And again what they're saying here on the output. 664 01:02:53.940 --> 01:02:59.429 You got to have some sort of a cat. 665 01:02:59.429 --> 01:03:08.670 Find what you're doing the output memory, and if you're writing, you tell them at once. I don't know that it's so helpful, but you can imagine a cache of output elements that you're sliding. 666 01:03:10.800 --> 01:03:17.250 Okay, um, okay, what they're talking about here is interesting. 667 01:03:18.300 --> 01:03:23.909 Okay, so your global memory, it's a chunk so say 128 bytes. 668 01:03:23.909 --> 01:03:28.920 So, if you want to write 1 word of the output memory, it still has to. 669 01:03:28.920 --> 01:03:32.519 Effectively update the 128 by 6. 670 01:03:32.519 --> 01:03:38.849 And so what they're saying here is have a local trunk of your global memory. 671 01:03:38.849 --> 01:03:45.510 And you're updating elements, and once you're finished, maybe you write that local trunk back to global memory as 1 operation. 672 01:03:45.510 --> 01:03:49.320 So, you reduced your latency, so. 673 01:03:49.320 --> 01:03:53.250 So you got this tile, so. 674 01:03:53.250 --> 01:03:57.030 You split the output array into tiles. 675 01:03:57.030 --> 01:04:01.559 And because your computing elements of the output array in a predictable way. 676 01:04:01.559 --> 01:04:05.369 You you create the whole tile locally. 677 01:04:05.369 --> 01:04:09.690 And then you send it back to the global memory and. 678 01:04:09.690 --> 01:04:15.119 It's much more efficient than sending each element of the tile to the global memory. 1 by 1 by 1. 679 01:04:15.119 --> 01:04:19.920 Because of this horrible latency on the global memory. 680 01:04:19.920 --> 01:04:26.969 Oh, good Kyle. So, and then they make these tiles correspond to the thread blocks. 681 01:04:26.969 --> 01:04:30.329 So. 682 01:04:30.329 --> 01:04:35.519 And again, the size depends on all the obvious suspects. 683 01:04:35.519 --> 01:04:40.469 So, um. 684 01:04:40.469 --> 01:04:43.650 So that's, um. 685 01:04:45.239 --> 01:04:51.420 An input title same thing. So, um. 686 01:04:51.420 --> 01:04:56.159 And I mentioned foot tile could slide down the array of tiles probably. 687 01:04:56.159 --> 01:05:01.409 Fix position you calculate it, right? So. 688 01:05:01.409 --> 01:05:08.880 Okay, and. 689 01:05:08.880 --> 01:05:12.420 So, what they're talking about here. 690 01:05:12.420 --> 01:05:18.269 As I mentioned, you can challenge to make publication the. 691 01:05:19.800 --> 01:05:23.849 I got your thread block, um. 692 01:05:24.869 --> 01:05:32.010 Well, you can decide how many threads are in a thread block. Okay, Max 124. it could be a lot less to be 108. 693 01:05:32.010 --> 01:05:35.429 And they're just giving different choices for. 694 01:05:35.429 --> 01:05:38.880 How you size the 3rd blocks. 695 01:05:38.880 --> 01:05:47.369 So, and now you can, you can read it basically. 696 01:05:47.369 --> 01:05:50.670 Some courage your Idol, some of the time. 697 01:05:51.809 --> 01:05:55.289 And, okay, so. 698 01:05:56.730 --> 01:06:04.320 Um, the question is reading stuff multiple times. So. 699 01:06:04.320 --> 01:06:09.389 I may rerun it again, Thursday, but I'm going to overlap this. I think. 700 01:06:09.389 --> 01:06:14.670 This module with Thursday, I'll just do preliminarily here. 701 01:06:14.670 --> 01:06:19.110 And, um. 702 01:06:19.110 --> 01:06:24.030 Yeah, the thread is reading a window writing a window and. 703 01:06:24.030 --> 01:06:31.380 Yeah, I'm probably it's getting late enough. I'll finish this thing off. I'm sort of running go. 704 01:06:31.380 --> 01:06:39.210 I'll finish this thing off on Thursday and you can read it, but your design question is setting the various block sizes. So you can. 705 01:06:39.210 --> 01:06:43.860 So, giving you the manager managerial thing. 706 01:06:43.860 --> 01:06:48.780 And our goal is to reuse shared memory to reduce global memory. 707 01:06:48.780 --> 01:06:55.349 How many times each element is used. Okay. Um, both sales are doing the. 708 01:06:55.349 --> 01:07:02.460 Boundary conditions and now, what's that? So I'll continue on with this 1 on Thursday. Um. 709 01:07:02.460 --> 01:07:07.739 So, if I review what we did today. 710 01:07:07.739 --> 01:07:13.199 Well, we saw or the hardware thing of these atomic. 711 01:07:13.199 --> 01:07:19.469 Operations do a read modify right typically could also be comparing swap. 712 01:07:19.469 --> 01:07:27.480 Or an atomic ad that they're 1 machine instruction that cannot be interrupted by another thread they run the completion. 713 01:07:27.480 --> 01:07:33.840 Down to the machine level, at the CUDA level there implemented by dysfunction calls that we saw. 714 01:07:33.840 --> 01:07:39.210 And the 1st example of why they're used is updating a histogram. 715 01:07:39.210 --> 01:07:44.699 In parallel, so this is a common operation variance in the histogram. 716 01:07:44.699 --> 01:07:49.289 So, we saw how that could be implemented and could it requires these atomic operations. 717 01:07:49.289 --> 01:07:54.329 And then he called it a paradigm, perhaps the 2nd paradigm. 718 01:07:54.329 --> 01:07:59.460 Is convolution and we're in the middle of seeing how you do a convolution in parallel. 719 01:07:59.460 --> 01:08:03.869 And the goal with the histogram, the problem was. 720 01:08:03.869 --> 01:08:08.909 We needed these atomic updates the issue with the convolution. 721 01:08:08.909 --> 01:08:20.909 Is we wish to minimize access to the global memory? Because they have a very large latency and we minimize them by caching data explicitly in the small past shared memory. 722 01:08:20.909 --> 01:08:24.869 And we do it explicitly and. 723 01:08:24.869 --> 01:08:29.220 And again, once you do it, once you call a library, but we are seeing how things through it. 724 01:08:30.840 --> 01:08:35.460 And put a homework up to play with some stuff to, as I mentioned and that's. 725 01:08:35.460 --> 01:08:38.609 Enough new stuff for today, so. 726 01:08:38.609 --> 01:08:42.090 If you have any questions, then. 727 01:08:43.500 --> 01:08:49.979 Let's say. 728 01:08:52.409 --> 01:08:55.710 And. 729 01:08:55.710 --> 01:08:59.250 Hmm okay. 730 01:08:59.250 --> 01:09:02.670 Hello. 731 01:09:02.670 --> 01:09:06.047 Cool.