WEBVTT 1 00:04:41.098 --> 00:04:51.899 Silence. 2 00:04:56.459 --> 00:05:03.869 Silence. 3 00:05:07.379 --> 00:05:30.149 Silence. 4 00:05:33.538 --> 00:05:40.319 Silence. 5 00:05:41.608 --> 00:05:45.209 Silence. 6 00:05:45.209 --> 00:05:58.348 Silence. 7 00:06:46.468 --> 00:06:56.908 Silence. 8 00:06:56.908 --> 00:07:08.608 So, good afternoon class. My universal question is, can you hear me. 9 00:07:10.499 --> 00:07:13.559 Good Thank you. Okay, so. 10 00:07:14.639 --> 00:07:20.189 Parallel computing Fab, 11st class 6. 11 00:07:20.189 --> 00:07:24.509 And to see if I can try to. 12 00:07:24.509 --> 00:07:30.869 Clone this I want to see if I can see what you're seeing, it's may cause things to hang up. 13 00:07:32.639 --> 00:07:35.788 But. 14 00:07:38.218 --> 00:07:42.689 Silence. 15 00:07:42.689 --> 00:07:54.899 Silence. 16 00:07:55.918 --> 00:08:08.699 Silence. 17 00:08:10.048 --> 00:08:14.848 Okay. 18 00:08:14.848 --> 00:08:19.978 Silence. 19 00:08:21.449 --> 00:08:26.819 This is. 20 00:08:26.819 --> 00:08:32.578 Should be sharing. 21 00:08:32.578 --> 00:08:36.538 Screen sharing and then it stopped. 22 00:08:36.538 --> 00:08:43.589 Wow. 23 00:08:43.589 --> 00:08:47.698 Okay, things occasionally work so. 24 00:08:47.698 --> 00:08:55.649 What's happening today? 1st awesome. General. Stop. Then we'll get to open ACC. 25 00:08:55.649 --> 00:09:01.889 I installed invidious compiler suite and. 26 00:09:01.889 --> 00:09:11.639 The way you want to do to sit up and video, if you want to browse around it, it's freely available. You could also install it on your own machine. If you'd like. 27 00:09:11.639 --> 00:09:15.808 If your machine as an invidia and. 28 00:09:15.808 --> 00:09:29.543 To make you what you want me to do is add that it's directory of and onto your past variable and so on. So I've got a little file there, which if you source it, then it will modify your path. 29 00:09:29.543 --> 00:09:33.923 And so these are the compute. Compilers had recommended to use on balance. 30 00:09:35.129 --> 00:09:49.438 It's better they certainly work with better, however, doing a little little example last night where I compound open MP program, and it ran faster when compiled with g. plus plus then compiled with. 31 00:09:49.438 --> 00:10:03.958 C, plus, plus, so no, 1 compiler is the best for all but if you want anything that's going to target the using video type compiler. And what? Invidious compiler is that just the P. G. compiler suite done re badge and updated a little. 32 00:10:03.958 --> 00:10:11.999 We're going to do the ACC. 1st, but 1st, some general announcements before I get back to that. 33 00:10:11.999 --> 00:10:17.639 1st, a machine parallel machine you're welcome to use this machine. 34 00:10:17.639 --> 00:10:24.599 Oh, for homework 3 question Monday or Thursday a week after. 35 00:10:24.599 --> 00:10:27.808 A week after it was assigned, so. 36 00:10:27.808 --> 00:10:35.099 Dude, when was it put online just a sec. 37 00:10:47.578 --> 00:10:52.589 Well, we put it online on Monday, so it'd be due Monday guys. Um. 38 00:10:52.589 --> 00:10:55.589 Monday a holiday or something. 39 00:10:55.589 --> 00:11:01.528 But I'm being, this is a small class, only 10 students. I'm being lenient about these things. 40 00:11:01.528 --> 00:11:06.778 Okay, come on now. 41 00:11:06.778 --> 00:11:16.889 Okay, announce the machine parallel it's available. You're welcome to use it for any. 42 00:11:16.889 --> 00:11:29.428 Legal ethical purpose, even unrelated to this course you want to use it for your research for other people in your lab. If you're in a lab. That's fine. By me. You want to have fun with it? Fine by me also. Just, you know. 43 00:11:29.428 --> 00:11:41.639 No coin mining no mining and nothing that makes money. That would be rules. For example, I was running the email key way at home thing. 44 00:11:41.639 --> 00:11:46.139 Using point for a couple of years so I stopped. 45 00:11:46.139 --> 00:11:49.288 Right. Or lately, but in fact, I'm. 46 00:11:49.288 --> 00:11:54.328 Users 359 and total credit as a percentage. That's fairly small. But. 47 00:11:55.558 --> 00:12:00.178 Also, how often is it taken off line? 48 00:12:00.178 --> 00:12:06.599 For parallel, the intent is I need parallel online all the time. 49 00:12:06.599 --> 00:12:10.168 However, it is a research machine. 50 00:12:10.168 --> 00:12:20.729 And so, if something happens, I'm the 1 that has to fix it. And if there was a hardware failure, then it would be offline permanently. 51 00:12:20.729 --> 00:12:29.038 Unless the department wanted to spend money to replaces that is a risk. You're using a research machine, you're not using a machine with guaranteed. 52 00:12:29.038 --> 00:12:41.308 Permanence on the other hand. Of course that's true. If you use anything, you use our super computer used to have a blue Jean, and they take the blue Jean off line and so on anyone that used to bluejean now has to. 53 00:12:41.308 --> 00:12:45.119 Change your code, so that's your risk. 54 00:12:45.119 --> 00:12:49.918 Right, but the flip side is it's a reasonably fast big machine. 55 00:12:49.918 --> 00:12:55.438 Okay, um. 56 00:12:55.438 --> 00:13:01.948 Oh, new teaching tool I'm playing with at the moment in classes. Any questions. 57 00:13:01.948 --> 00:13:05.399 Is I have. 58 00:13:05.399 --> 00:13:09.808 Hello. 59 00:13:11.099 --> 00:13:15.688 Is that this is a mirroring my iPad. 60 00:13:15.688 --> 00:13:19.499 On to the, um, so. 61 00:13:19.499 --> 00:13:28.019 I'm marrying my iPad onto the onto a window here, so. 62 00:13:29.698 --> 00:13:43.469 If there's quite, I couldn't try and see if that works out well, before I, I hadn't heard your laptop that had a touch screen that did not work very well at all. It didn't have hand rejection and it had late lag and so on. 63 00:13:43.469 --> 00:13:52.019 I was Linux actually, I'm not doing new devices like touchscreens very well. So, this so we'll see what happens with that. 64 00:13:52.019 --> 00:13:57.568 Oh, just fun. Real World electrical engineering. 65 00:13:57.568 --> 00:14:08.999 Yeah, I like gadgets so my house has 2 Tesla power walls and they're big batteries and their total capacity's, 27 kilowatt hours and I got 8 kilowatts of peak solar panels on the roof. 66 00:14:08.999 --> 00:14:17.339 So, they finally got working, like, Tuesday I only started the project last August at him and. 67 00:14:17.339 --> 00:14:20.999 In any case, so it's a fun to see what happens. 68 00:14:20.999 --> 00:14:29.489 At the moment the solar panels are generating 2.8 kilowatts of power. 69 00:14:29.489 --> 00:14:32.668 So, the goal is that. 70 00:14:33.683 --> 00:14:47.604 You know, over the year over the year and my net electrical consumption, I posted 0T so if I don't use much power, I could survive a 2 day blackout. Not that they're very many blackouts here. Of course, it's fairly reliable, but still, it's it's cool. 71 00:14:47.879 --> 00:14:52.438 1, more point is. 72 00:14:52.438 --> 00:14:57.389 If you were looking at last year's blog for this course. 73 00:14:57.389 --> 00:15:05.639 I changed things from time to time uh, last year I use Docker for the compilers. It's. 74 00:15:06.234 --> 00:15:10.583 It's complicated not necessary. It was a security risk, so I'm dropping it right now. 75 00:15:10.734 --> 00:15:23.394 I'm not doing it this year, but Docker is an important industrial tool and if anyone would like me to spend a little class time on Docker, just so you could put on your resume that you're familiar with. Docker. Well, then. 76 00:15:24.058 --> 00:15:28.798 You mentioned it other than that. 77 00:15:28.798 --> 00:15:32.759 Now, we are back to open ACC. 78 00:15:34.078 --> 00:15:38.278 And. 79 00:15:40.259 --> 00:15:50.219 Tc dot Org we looked at 1 the Q and A's are are worth looking at by the way. So. 80 00:15:52.528 --> 00:15:57.688 And again, okay. 81 00:16:05.724 --> 00:16:16.793 Simpler. Okay. Generally motherhood stop here. Analyzing your code is the hardest part. Your algorithm has to be parallelizable. Okay. Again, this is something. 82 00:16:17.548 --> 00:16:22.438 I mean, actually write it down. 83 00:16:22.438 --> 00:16:26.249 A chance to use this and Shelly. 84 00:16:26.249 --> 00:16:32.489 Let me get okay. 85 00:16:32.489 --> 00:16:38.068 So, okay, so. 86 00:16:41.308 --> 00:16:44.609 Okay. 87 00:16:44.609 --> 00:16:48.839 Okay, it's not mirroring. 88 00:16:48.839 --> 00:16:56.729 Give me a 2nd, here it was mirroring 20 minutes ago. It's not mirroring. Now. 89 00:17:00.208 --> 00:17:14.189 Eva, okay good. 90 00:17:15.689 --> 00:17:20.249 Silence. 91 00:17:27.449 --> 00:17:35.098 Unfortunately, I cannot get away from that black boundary. So all I can do is. 92 00:17:35.098 --> 00:17:46.199 Things like this. I just overlap. Well, you have to speak if I expose the chat window, then things are. 93 00:17:46.199 --> 00:17:50.909 It's okay, so open ACC um. 94 00:17:53.818 --> 00:17:58.169 And it's so it's higher level. 95 00:17:59.818 --> 00:18:03.148 Um, then say could a. 96 00:18:03.148 --> 00:18:12.749 Or even open MP actually, so it so it's easier to it's easier to use. 97 00:18:12.749 --> 00:18:18.538 Well, you know, perhaps less efficient. 98 00:18:23.128 --> 00:18:26.878 It's our execution. 99 00:18:26.878 --> 00:18:30.388 Okay, so those are your trade offs here? 100 00:18:30.388 --> 00:18:34.078 Okay and so we can get this up. 101 00:18:35.699 --> 00:18:39.659 Some overview of darker. Okay. 102 00:18:40.798 --> 00:18:45.449 Okay, so some Docker. 103 00:18:45.449 --> 00:18:51.689 Maybe next class or something. 104 00:18:53.818 --> 00:18:56.939 Truly. 105 00:18:58.288 --> 00:19:01.588 Oh, okay, good. Okay. 106 00:19:02.699 --> 00:19:12.384 Actually, just a 2nd here. Okay. I can see the chat window. Now. I'm curious what my setup is. 107 00:19:12.384 --> 00:19:21.534 I've got my main laptop that are running work at Webex on and displaying Windows for the mirror of the iPad and decides that I'm showing. 108 00:19:21.838 --> 00:19:32.308 And then I've got the 2nd laptop here we're also running Webex. That's if you look you see, I'm signed in twice and on the 2nd, 1 I can see. 109 00:19:32.308 --> 00:19:45.479 The chat window. Okay. Open ACC. So you analyze this motherhood stuff and just a reminder you got the review from last time you got your. 110 00:19:45.479 --> 00:19:52.828 Which, so you can compile the code without saying open at all. Okay so this is a review. 111 00:19:52.828 --> 00:19:56.548 We said just a review of the reduction thing. 112 00:19:56.548 --> 00:20:07.019 And so if you're doing some operator, like plus or Max and. 113 00:20:07.019 --> 00:20:13.588 Each loop, each iteration of the loop is applying. This is updating this total like here. It's. 114 00:20:13.588 --> 00:20:20.999 By the way, if this is too small for you to see the slides, I'll, I'll enlarge some, something lacking on large them. Now. Actually. 115 00:20:20.999 --> 00:20:24.898 There, okay, then. 116 00:20:24.898 --> 00:20:34.348 The reduction thing, which is for a limited set of operators will have a separate sub, total variable for each thread. So, each thread will. 117 00:20:34.348 --> 00:20:39.419 Accumulate the sub, total or the, and then all the threads will be combined. 118 00:20:39.419 --> 00:20:47.878 So, it's very efficient. Okay. Just a reminder that this was compiling. 119 00:20:47.878 --> 00:20:54.838 Cereal this was compiling to. 120 00:20:56.878 --> 00:21:11.219 Compounding to the GPU and multi core was compiling to multi car in their particular machine. That's the review. Okay this side is new. And this slide is deep with. Okay. 121 00:21:12.239 --> 00:21:17.189 Differences between the CPU memory and the GPU memory. 122 00:21:17.189 --> 00:21:22.378 The CPU memory is larger, but the GPU memory is faster. 123 00:21:22.378 --> 00:21:28.288 So, they're complimentary and they have a bust connecting them and which is. 124 00:21:28.288 --> 00:21:42.743 Often the past, it may be the fastest plus on the computer sometimes. So, throw some numbers at you on parallel the CPO memories 256 gigabytes. The memory is 48 gigabytes by. 48 is very large for cheap you either way. Okay. 125 00:21:42.743 --> 00:21:47.183 And in any case transferring stuff back. And forth. 126 00:21:50.098 --> 00:21:58.858 Now, the thing with the GPU memory is it's very fast going to the CUDA cores CUDA cores. That's. 127 00:21:58.858 --> 00:22:07.229 The execution cores on the GPU so the thing with gpo memory and Scott, we'll get to spend some time on it. 128 00:22:07.229 --> 00:22:16.348 It's very fast, but it also has a very high latency. So getting 1 bite of data from the gpo memory into a core is going to take. 129 00:22:16.348 --> 00:22:19.679 A 100 cycles gone. What? But. 130 00:22:19.679 --> 00:22:32.038 Get each successive word of data is really is fast. Okay. Now, 1 other thing anticipating a little with current version to the GPU. 131 00:22:32.038 --> 00:22:35.368 1st, there's a common address space. 132 00:22:35.368 --> 00:22:39.088 Or these 2 memories, you can address a word. 133 00:22:40.618 --> 00:22:48.058 In either memory, you don't need a separate tag. The tag is a high order bit of the address, I guess. 134 00:22:48.058 --> 00:22:51.538 And there's also a memory manager. 135 00:22:51.538 --> 00:23:03.298 Now, current versions, so that blocks of data are copied back and forth automatically as needed. Although if you do it deliberately. 136 00:23:03.298 --> 00:23:06.358 You'll get higher performance, perhaps. 137 00:23:06.358 --> 00:23:21.239 I give you an example, the virtual memory manager on your CPU is pretty good, but I had a paper published with 1 of my Brazilian collaborators. It was competing visibility on some terrain and. 138 00:23:21.239 --> 00:23:27.028 We actually did better than the virtual memory manager on the host. 139 00:23:27.028 --> 00:23:31.828 Because we knew what the access pattern could be for the blocks of terrain data. 140 00:23:31.828 --> 00:23:36.719 That said, usually almost always let the computer do the management. 141 00:23:36.719 --> 00:23:41.308 Also on the CPU, so this is this it's the. 142 00:23:41.308 --> 00:23:45.719 Page. Okay, so this is this unit. Okay. 143 00:23:45.719 --> 00:23:48.989 There's 2 separate ideas here that could blend them together. 144 00:23:48.989 --> 00:23:53.878 The unified memory is just that they have a common address space. 145 00:23:53.878 --> 00:23:59.278 So, you use an address, the system can tell it run time where it is. 146 00:23:59.278 --> 00:24:03.179 Managed memory takes that. 147 00:24:03.179 --> 00:24:07.199 And moves to data back and forth as needed. 148 00:24:07.199 --> 00:24:14.189 It's think of it as a virtual memory manager or the backing and the actual high speed thing is a. 149 00:24:14.189 --> 00:24:18.058 So. 150 00:24:19.618 --> 00:24:25.138 Talks about it here. So the managed managed part of it is that. 151 00:24:25.943 --> 00:24:36.983 Coffee, you don't have in the past, you wrote a program, you would have to explicitly copy that data back and forth, which was a bit of a pain. But now it's handled automatically. 152 00:24:37.013 --> 00:24:49.374 Of course, if you're copied explicitly, you could do fun things like the asynchronous about it. It called the function that started data, copying the functional return. Immediately. You could do something else on the CPU and then. 153 00:24:49.798 --> 00:24:56.038 Check a flag, and when the GP has got the data, then you do something in. So you do it explicitly you can do this overlapping thing. 154 00:24:56.038 --> 00:24:59.489 Or you can up the TPM manage this so. 155 00:24:59.489 --> 00:25:07.259 So, you can concentrate on high level stuff. Okay here, whatever you see. 156 00:25:07.259 --> 00:25:10.888 N. B. C. plus plus so. 157 00:25:10.888 --> 00:25:23.909 Just to hit you with the command line here, maybe we'll run the programs on a Monday or something get 2 ideas. 1st options. This says, do a reasonable set of optimization things. 158 00:25:25.648 --> 00:25:36.509 So, there's a very large number of different, optimization flags. It says, take a sensible subset of them compile the open ACC. 159 00:25:36.509 --> 00:25:39.509 Directors, if you don't do this, it will just ignore them all. 160 00:25:39.509 --> 00:25:49.588 Key a target architecture for Tesla Reed, NVIDIA, the historical reason why they call NVIDIA Tesla. 161 00:25:49.588 --> 00:25:52.618 I mentioned it quickly. Last time you can ignore it so. 162 00:25:52.618 --> 00:25:56.219 You want to compile for the you call it Tesla. 163 00:25:56.219 --> 00:26:01.858 Managed says, use the managed memory so. 164 00:26:01.858 --> 00:26:11.278 The system will page the data back and forth. I don't know what the page sizes. 1. k4 K I don't know on the GPU, but it will page that data back and forth. 165 00:26:11.278 --> 00:26:20.249 This says print out debugging information so I'm info printout debugging information about the acceleration. So. 166 00:26:20.249 --> 00:26:24.898 And if anyone is unfortunate enough to use for trend well. 167 00:26:24.898 --> 00:26:30.538 My sympathies to you, I've used for very many years. I don't actually like it. 168 00:26:30.538 --> 00:26:36.358 I'd like C plus plus better. Okay, this is your managed memory where the system pages. 169 00:26:36.358 --> 00:26:40.439 So you want to spend the time you can do it better. 170 00:26:40.439 --> 00:26:43.618 Um, but. 171 00:26:43.618 --> 00:26:47.638 And this is a synchronous thing I mentioned here. 172 00:26:49.078 --> 00:26:55.558 But you're going to take your time, so to trade who's worth more you were the computer. 173 00:26:56.634 --> 00:27:11.064 Give you an idea so parallel plus the graphics card and everything, it's about 10000 dollars. You could duplicate the parallel machine for less than 10000 dollars today. So you're saving time when a 10000 dollar computer versus how much you make. 174 00:27:12.778 --> 00:27:18.509 You optimize the problem. Okay. Um. 175 00:27:18.509 --> 00:27:22.108 So, here they are testing. 176 00:27:22.108 --> 00:27:28.439 The uniform, okay, there's different term. Both when this a unified memory there implicitly is 7 unified plus. 177 00:27:28.439 --> 00:27:32.578 Manage there's all these other things also. 178 00:27:32.578 --> 00:27:41.574 Which, I guess are sort of obsolete. Now, other things you could do in the past was to lock pages of memory on the host into real memory. 179 00:27:41.844 --> 00:27:48.054 So, on the host, they would not be memory mapped by the host work from every manager they'd been locked into. 180 00:27:48.358 --> 00:27:59.729 Real memory on the host on what this is a device that's a knew where it was on the host. It did not have to go through the host virtual memory manager and therefore. 181 00:27:59.729 --> 00:28:02.939 It didn't have to work with that. 182 00:28:02.939 --> 00:28:11.368 It's not just an efficiency thing. It's an also the GPU, anytime it, Wanda if the, if the page on the host is pinned. 183 00:28:11.368 --> 00:28:14.729 The GPU anytime and wanted to could just go on to the bus and. 184 00:28:14.729 --> 00:28:18.808 Read and write to it, it didn't, you know, which was nice. 185 00:28:18.808 --> 00:28:30.058 Nice for the and it's more than a matter of speed. It's a matter of not having to synchronize stuff. And and of course that would tie up pages on the. 186 00:28:30.058 --> 00:28:34.108 Host, but host, like mine of a lot of pages that's not an issue. 187 00:28:34.108 --> 00:28:39.209 But now, I think they figure yeah, that's efficient, but. 188 00:28:39.209 --> 00:28:43.648 You don't need it going so here, it's showing you that the unified memory. 189 00:28:43.648 --> 00:28:48.749 In every case, but this 1 here, and I can't even read which it is. 190 00:28:48.749 --> 00:28:53.818 Is you know, it's less it's within 10% of the. 191 00:28:53.818 --> 00:28:59.368 Doing it by hand, and 10% is not measurable efficiency. 192 00:28:59.368 --> 00:29:03.659 Anything under a factor of 2 or 3 for efficiency doesn't matter. 193 00:29:03.659 --> 00:29:09.179 Okay, so unified memory I mentioned. 194 00:29:09.179 --> 00:29:16.618 So, basic data measurement, it's saying everything 3 times, but that's pedagogically. Good to say everything. 3 times. 195 00:29:16.618 --> 00:29:24.989 Okay, um, it's getting get the data back and forth. 196 00:29:24.989 --> 00:29:32.459 So that bus there is fast, um, hosted device, but device memory to devices even faster. So. 197 00:29:32.459 --> 00:29:36.628 Basic data management we saw this thing before. 198 00:29:37.648 --> 00:29:44.969 Yeah, you're going to use data on the GPU you allocated on the GPU and got to keep stuff in sync. 199 00:29:44.969 --> 00:29:48.898 Eventually okay. 200 00:29:50.278 --> 00:29:57.118 Okay here, they're compiling it without managing and initiative. Just showing some of the flags you got. 201 00:29:57.118 --> 00:30:04.828 The, if you say M, info was Excel, you could just say, I'm, if I get a credit, hold on a dad, it's just talking about. 202 00:30:04.828 --> 00:30:10.769 Which loops that accelerates here so you can see what the compiler is thinking. So. 203 00:30:11.848 --> 00:30:17.519 Now, data shaping. 204 00:30:17.519 --> 00:30:22.979 Okay, so what's happening here you got your open program. 205 00:30:22.979 --> 00:30:27.239 Loops running on the device that's. 206 00:30:28.558 --> 00:30:38.489 If your data is a simple structure, the compiler can tell to move. We've just been talking about it. You got to move the data copy the data back and forth. 207 00:30:38.489 --> 00:30:44.848 If things are simple, the compiler can figure this out on its own. 208 00:30:44.848 --> 00:30:49.588 But if things are not simple, you may want to tell the compiler. 209 00:30:49.588 --> 00:31:01.348 What the copy and which way and even if the compiler can figure it out, you might still understand your own program better than the compiler can infer. 210 00:31:01.348 --> 00:31:05.159 So, if you tell the compiler, how to copy the data. 211 00:31:05.159 --> 00:31:08.219 It may get better because to. 212 00:31:08.219 --> 00:31:18.088 Again, well, in particular, you may realize, you don't need to do some copies like, you know, you copy data to the GPU. You do not need to copy it back. 213 00:31:18.088 --> 00:31:26.608 You don't need it back on the host a jeep. You didn't modify it, but now here's the thing. The compiler, unless the compiler can prove. 214 00:31:26.608 --> 00:31:34.259 That the data is not going to get modified on the GPU and can also prove that the host is not going to need it again. 215 00:31:34.259 --> 00:31:38.368 It's going to have to generate code to copy that input data. 216 00:31:38.368 --> 00:31:41.699 Back from the GPU from the device to the host. 217 00:31:41.699 --> 00:31:53.128 Which is possibly a wasted copy, but you can tell the compiler. No, this data goes in to the device, but it doesn't need to come out from the device. You see that sort of thing. 218 00:31:53.128 --> 00:31:58.648 So the compiler would generate good cold, correct code if you didn't do this but. 219 00:31:58.648 --> 00:32:07.048 It's going to be slow correct code and this is so these are these copy interactives here copy. 220 00:32:07.048 --> 00:32:15.808 You tell the, it's going to be 1 of my lines, you tell her compile the copy this array in at the start and out at the end of using the device. 221 00:32:15.808 --> 00:32:22.409 Just go in at the start, just come out at the end and just add. This is just like a lock on the device. 222 00:32:22.409 --> 00:32:27.088 So, okay, now what's going on in here. 223 00:32:28.739 --> 00:32:35.338 Is that you have to may have to tell it how big the array is I'm going to. 224 00:32:35.338 --> 00:32:39.209 I and see if I can actually get this a touch. 225 00:32:39.209 --> 00:32:44.068 Bigger for you, because I'm thinking. 226 00:32:46.528 --> 00:32:53.848 Trying to delete so good. Okay. I got in touch bigger for you. And until I need the. 227 00:32:53.848 --> 00:33:01.919 I pat again. Okay, this may help you here. And I can still see the chat window. If you've got questions. 228 00:33:01.919 --> 00:33:08.909 Yeah, okay. I'm trying to set this. It doesn't work. 229 00:33:08.909 --> 00:33:14.818 Okay, now re, shaping is you do your copying you tell it. 230 00:33:14.818 --> 00:33:24.058 This you have to tell the size of the array, the length, and if it's a 2 dimensional array, the sizes. So that's what that is. Okay. 231 00:33:24.058 --> 00:33:27.239 Again, if it's some complicated. 232 00:33:27.239 --> 00:33:33.808 Data type compiler may not be able to easily determine the size so okay. 233 00:33:33.808 --> 00:33:43.709 Okay, so that's just our copy in part of you. Here's an example. You might want to copy in. 234 00:33:43.709 --> 00:33:48.118 Only part of the year, right? The compiler is not going to know that. You just wanted. Okay. 235 00:33:48.118 --> 00:33:59.338 Here's an example here we got this loop so this is doing your iteration for your heat flow problem. This iterates inside the GPU and the 2nd, the coffee sit out of back. 236 00:33:59.338 --> 00:34:04.199 So, we're copying in. 237 00:34:05.459 --> 00:34:11.849 A, and we're copying a new both ways. 238 00:34:11.849 --> 00:34:20.398 And the 2nd note, we're copying a new and a out because inside here it is. 239 00:34:20.398 --> 00:34:25.588 So, and in the 1st. 240 00:34:28.498 --> 00:34:35.728 Why we are copying a new, both directions and not just and. 241 00:34:35.728 --> 00:34:42.179 Well, because inside the loop, and here, it is both reading and writing a new. 242 00:34:43.349 --> 00:34:50.548 I think what's going to happen is each 1 iteration of inside the loop here is being put on a separate. 243 00:34:50.548 --> 00:34:57.449 Thread and because he's on affect each other, that's why we've got to erase a new and a. 244 00:34:57.449 --> 00:35:05.759 And so we say new goes both ways, it gets, it gets red and then it gets written and again. So. 245 00:35:07.259 --> 00:35:13.018 Although I'm a little I'm not certain why you don't quite need copy out there, but no. 246 00:35:13.018 --> 00:35:18.838 Okay, and we copy within without managed. 247 00:35:18.838 --> 00:35:23.909 Yeah, here the system is determining these copies of generating them. 248 00:35:25.378 --> 00:35:32.938 So, it can do that and it turned out when we. 249 00:35:32.938 --> 00:35:46.708 Try to get explicit. It got slower 3 times slower than a serial machine and a 100 times slower than you called before. 250 00:35:49.619 --> 00:35:55.289 But what happened. 251 00:35:55.289 --> 00:35:58.378 Well, you can profiled the thing. 252 00:35:58.378 --> 00:36:04.498 They show you some profiling tools later, but they're showing. 253 00:36:06.119 --> 00:36:11.248 Who's running and I'll hit this in more detail later, but. 254 00:36:11.248 --> 00:36:17.579 You can see what overlaps and and what's taking the time stream. 255 00:36:17.579 --> 00:36:23.429 And kudos is just a sequence, basically a sequential sequence of calls effectively. 256 00:36:23.429 --> 00:36:31.528 And again, going to this quickly, what it's determining. 257 00:36:31.528 --> 00:36:36.509 Is that most of the time is spent on the data copying. 258 00:36:36.509 --> 00:36:41.579 So very little time is spent on the program of the time is waiting for the which. 259 00:36:41.579 --> 00:36:46.018 Happens to the pricing amount of the time and. 260 00:36:46.018 --> 00:36:49.918 No, they do. This is finding. 261 00:36:49.918 --> 00:36:54.119 The data copy moving. 262 00:36:54.119 --> 00:36:59.579 This is device to host. This is hosted device is what. 263 00:36:59.579 --> 00:37:09.239 You know, 35% or something hosted hosted device like 35% device to host is like, 60%. 264 00:37:09.239 --> 00:37:12.389 And everything else is like, 5%, so. 265 00:37:14.278 --> 00:37:17.969 Um. 266 00:37:17.969 --> 00:37:23.728 Why device to host is more than host to device is. 267 00:37:23.728 --> 00:37:27.148 A good question. 268 00:37:28.829 --> 00:37:34.739 So the problem here and this is getting into subtleties of what open ACC does. 269 00:37:34.739 --> 00:37:44.429 Is that it's doing the copying the complete copying separately for each inner iteration of the loop, which is crazy. 270 00:37:44.429 --> 00:37:52.708 Each iteration uses, like, 4 elements of a, and 1 element of a new, but it's copying everything each time. 271 00:37:52.708 --> 00:37:59.579 When you tell it to explicitly copy, because this is applying to each separate. 272 00:37:59.579 --> 00:38:08.248 Parallel thread and detail down at the bottom for each inner iteration it's copying. 273 00:38:08.248 --> 00:38:11.579 Every everything both ways. 274 00:38:12.690 --> 00:38:20.820 So that's what's taking all the time? Well, it's copying not everything, but it's each iteration is copying. And when it starts is crazy. 275 00:38:20.820 --> 00:38:24.389 Okay, optimize. 276 00:38:26.070 --> 00:38:32.280 And they're just talking about here that. 277 00:38:33.300 --> 00:38:37.650 You have to be careful because this is applying to each separate thread. 278 00:38:38.789 --> 00:38:46.079 So, I'm going through these fast giving my take on it. I can slow down if you want. 279 00:38:48.150 --> 00:38:52.769 Yeah, and what they're saying is saying is what I just said is that. 280 00:38:54.210 --> 00:38:58.619 The copying is happening, basically each generation of the inner loops so. 281 00:39:01.019 --> 00:39:08.099 Okay, um, and. 282 00:39:10.800 --> 00:39:14.039 And they're talking about ways here. 283 00:39:14.039 --> 00:39:20.519 To to speed things up and what's happening here. 284 00:39:23.309 --> 00:39:27.000 The effect is, will be reducing the amount of copying so. 285 00:39:27.000 --> 00:39:37.110 Have another high level loop here basically we're copying a new and we're not before we were caught in and out. Now we're just copying it in. 286 00:39:38.519 --> 00:39:44.670 So, um, rebuild the code. 287 00:39:44.670 --> 00:39:49.949 Generates some things and what happens is. 288 00:39:52.320 --> 00:39:55.980 Well, this is some interesting stuff here what's happening. 289 00:39:55.980 --> 00:40:01.619 Is generate this is the info flag with the compiler. 290 00:40:01.619 --> 00:40:04.679 It's generating information about. 291 00:40:05.940 --> 00:40:13.230 How it's mapping the program to the NVIDIA. 292 00:40:13.230 --> 00:40:18.510 To the. 293 00:40:18.510 --> 00:40:22.679 So, again, the, it's got threads. 294 00:40:24.090 --> 00:40:31.050 Well, there's a war for 32 threads you going to several works together actually so you get a. 295 00:40:31.050 --> 00:40:36.900 You going to block them threads and then you get basically. 296 00:40:36.900 --> 00:40:41.969 A number of blocks, and what it's talking about here is. 297 00:40:41.969 --> 00:40:56.070 How it's mapping it, it's going to take 128 iterations of the loop will be 1 block of threads. And what this is here. This, this is, if you were writing CUDA, how you'd be indexing that particular. 298 00:40:56.070 --> 00:41:04.500 Thread within the block, if you have more than 128 iterations, it will be generating separate blocks. And this would be the index. 299 00:41:04.500 --> 00:41:11.730 For which block and the lots of blocks would be called a gang of blocks. 300 00:41:11.730 --> 00:41:16.230 Lots of lots of threads and a block are called a vector of threads. 301 00:41:16.230 --> 00:41:23.789 Why it's a dot X is you can actually imagine your threads and blocks to be in a 3 dimensional. 302 00:41:23.789 --> 00:41:28.079 A, Ray of threads and blocks that's a syntactic shuttering. Actually. 303 00:41:31.530 --> 00:41:36.179 Okay, so here what happened. 304 00:41:37.349 --> 00:41:43.860 So you try to get explicit with the data copying in and out the 1st iteration, got it wrong and program got. 305 00:41:43.860 --> 00:41:51.869 A 100 times slower. So now you got it. Right? What do you God is you've got something to present faster than when you let the compiler do it so. 306 00:41:51.869 --> 00:41:54.960 What's the lesson? Let the compiler do it. 307 00:41:56.880 --> 00:42:11.519 unified manage memory okay at one point is that the code here was well it was nice simple code it was going through the array sequential predictable manner . 308 00:42:11.519 --> 00:42:14.610 If you had a random type access. 309 00:42:14.610 --> 00:42:17.969 Oh, what do they love and see S1 link lists. 310 00:42:17.969 --> 00:42:24.210 You got some linked list let's say you're doing pointer chasing. It would be horribly slow on the GPU. 311 00:42:26.400 --> 00:42:30.420 So, but this is a nice simple working your way through an array goes fast. 312 00:42:32.610 --> 00:42:42.090 Although actually, Nvidia is aware that people like to use pointers and, like, lists. So they are trying to do it faster in their current hardware. So. 313 00:42:44.280 --> 00:42:49.139 I don't often find pointers useful. That's just me. Okay. 314 00:42:52.500 --> 00:43:00.869 Other things you can explicitly synchronize code data any, anytime you want. 315 00:43:00.869 --> 00:43:05.820 So synchronization thing. 316 00:43:05.820 --> 00:43:08.880 Again, you're doing something, you know, you got. 317 00:43:08.880 --> 00:43:16.500 Your many cores honestly for you, you can be doing something on the CPO at the same time. You're doing something on the and. 318 00:43:16.500 --> 00:43:22.980 Nephews lower level coded to do it, but certainly and but then you might occasionally want to synchronize. 319 00:43:22.980 --> 00:43:31.289 Explicitly, not just waiting for the thread to end and that's what that does. Um, updates, self and device. 320 00:43:31.289 --> 00:43:37.110 Self is the host. Okay. 321 00:43:41.789 --> 00:43:44.880 An example would be. 322 00:43:47.429 --> 00:43:52.739 Okay, we have some loop that's going on for a while so the braces here and here. 323 00:43:52.739 --> 00:43:57.449 And whatever. 324 00:43:57.449 --> 00:44:00.809 So, we want to ensure that the data. 325 00:44:00.809 --> 00:44:04.679 On the device, because this little sitting around on the device. 326 00:44:04.679 --> 00:44:08.610 And we want to ensure that it gets updated back to the host. So. 327 00:44:09.900 --> 00:44:15.809 Oh, we do not want to do a, we want to do this while we're still inside this bigger block. 328 00:44:15.809 --> 00:44:19.409 We could in the block and start a new block, but. 329 00:44:19.409 --> 00:44:23.429 That's slow because Ball's coughing more so. 330 00:44:25.650 --> 00:44:29.849 Unstructured data. Okay. Now. 331 00:44:29.849 --> 00:44:34.320 What did we have up to now? Go back a page or 2. 332 00:44:34.320 --> 00:44:39.599 Is you'd have a block stop it? You'd have a block. 333 00:44:39.599 --> 00:44:46.349 At the start, and then you'd copy data in at the start, you'd copy data out at the end. 334 00:44:46.349 --> 00:44:50.010 And it's a syntactic hierarchy. Okay. Um. 335 00:44:50.010 --> 00:44:54.329 Alexa call scoping they would call it, I guess. 336 00:44:55.559 --> 00:45:09.420 The thing is, maybe your program has some producer consumer relationship between different routines or something, and there's not a simple hierarchy, or it would be. 337 00:45:09.420 --> 00:45:13.440 Difficult to put your program in a simple hierarchy. 338 00:45:13.440 --> 00:45:18.929 So, what we're talking about here, you know, explicit allocations of allocations. 339 00:45:18.929 --> 00:45:23.070 Tell you what I'm talking about if you're in C. plus plus. 340 00:45:23.070 --> 00:45:28.380 You know, you can have, you can a variable can start its lifetime. 341 00:45:29.610 --> 00:45:37.559 When you enter a block, it say allocated, it gets created and then it gets destroyed when you leave the block that's what we had before. 342 00:45:37.559 --> 00:45:42.210 Or you can do things like mailbox and freeze that are explicit. 343 00:45:42.210 --> 00:45:46.650 And to put on the heap, you explicitly. 344 00:45:46.650 --> 00:45:52.409 Create and allocate the variable whenever you want and you explicitly free it whenever you want. 345 00:45:52.409 --> 00:45:58.050 When you finished with it, which could be in another routine, there's not this inclusion hierarchy. 346 00:45:58.050 --> 00:46:01.710 Stuff gets created at the start of a block and delete it destroyed at the end. 347 00:46:01.710 --> 00:46:05.010 So, we got that with open ACC also. 348 00:46:05.010 --> 00:46:12.030 The enter claws you say, whenever you want, it creates the data and then exit destroys the data. 349 00:46:13.050 --> 00:46:17.610 You can do it whenever you want. So I so they talk about here. 350 00:46:17.610 --> 00:46:22.590 Yeah, okay. 351 00:46:22.590 --> 00:46:27.389 And that's just not and they could exist in different functions. 352 00:46:27.389 --> 00:46:31.230 You got some complicated producer consumer thing. 353 00:46:31.230 --> 00:46:39.869 Year window manager is creating some data structure, giving it to the user. 354 00:46:39.869 --> 00:46:44.579 You looking at how the X window system was implemented. 355 00:46:44.579 --> 00:46:47.969 They had a lot of problems deciding at what point. 356 00:46:47.969 --> 00:46:51.389 Do you know who constructs. 357 00:46:51.389 --> 00:46:55.860 An array that's needed by someone else and then who destroys it. 358 00:46:55.860 --> 00:46:59.730 It's the real mess and leads to. 359 00:46:59.730 --> 00:47:04.559 A lot of programming errors, so, but you'll get here too if you do it in a lot, but. 360 00:47:04.559 --> 00:47:10.289 Okay, I'm skipping through here unstructured. 361 00:47:10.289 --> 00:47:13.559 Your simple thing parallel loop. 362 00:47:15.900 --> 00:47:30.030 Okay, you could say here we're going to and the 1st fragment gets created, we're going to copy a and B to the GPU to the device and we're going to create an array see on the device. 363 00:47:30.030 --> 00:47:33.360 Then we were on the loop and then at the end. 364 00:47:33.360 --> 00:47:37.980 We copy sea out to the host and we delete a and B. 365 00:47:37.980 --> 00:47:43.500 From the device, so you get explicit like that if you. 366 00:47:43.500 --> 00:47:47.460 It's doing mailbox and freeze on the device essentially. 367 00:47:47.460 --> 00:47:52.949 Well, exactly actually okay. 368 00:47:54.329 --> 00:48:05.969 So, the structured things, it's only within a single function again my best cases, when the structure doesn't work and you've got some producer consumer called routine concept. 369 00:48:05.969 --> 00:48:10.860 So, when doing systems are okay. 370 00:48:12.630 --> 00:48:16.559 Well, windowing says you get an event hey, got an event loop. You get an event handler. 371 00:48:16.559 --> 00:48:23.340 Gets called when an event, her happens, like suppress and. 372 00:48:23.340 --> 00:48:29.429 And then something gets put on a queue given to the user or whatever. It's not simply and hierarchically. 373 00:48:29.429 --> 00:48:36.210 And then you use the unstructured thing, but if you don't explicitly allocated things, start growing. 374 00:48:37.260 --> 00:48:40.320 So. 375 00:48:40.320 --> 00:48:44.250 And giving an example, they allocate. 376 00:48:44.250 --> 00:48:50.789 In 1 function called allocate and a free in another function. So if you look at what's happening up here. 377 00:48:50.789 --> 00:48:55.860 And the allocator, Ray, it's allocating something on the host with Matlock. 378 00:48:55.860 --> 00:49:00.210 And it's allocating something on the device with the entered data, create. 379 00:49:00.210 --> 00:49:05.159 So, and. 380 00:49:08.250 --> 00:49:12.929 And then the de allocate, it frees it on the device. 381 00:49:12.929 --> 00:49:16.530 And then freeze it on the host so. 382 00:49:16.530 --> 00:49:21.659 And then what Maine does, is it calls allocate array allocated on the. 383 00:49:21.659 --> 00:49:24.750 Allocate everything host and device. 384 00:49:26.159 --> 00:49:30.449 Darrell is a thing in parallel and this is going to run on the device. 385 00:49:32.099 --> 00:49:36.750 And then de, allocate the stuff on the allocate everything. 386 00:49:37.949 --> 00:49:42.690 Now, if you put a program, like, if you tried to compile a program like this. 387 00:49:42.690 --> 00:49:49.289 And you turned optimization on how fast do you think it would run. 388 00:49:50.519 --> 00:49:56.340 So any idea with a good Optimizer. 389 00:50:06.989 --> 00:50:11.309 Silence. 390 00:50:12.989 --> 00:50:23.099 Silence. 391 00:50:24.960 --> 00:50:30.269 And I think it's lock up again. 392 00:50:31.289 --> 00:50:35.070 I'll use I'll use the chat window. 393 00:50:42.449 --> 00:50:50.369 Silence. 394 00:50:54.539 --> 00:51:01.139 Any ideas. 395 00:51:04.500 --> 00:51:19.079 This program here. 396 00:51:21.900 --> 00:51:40.679 Silence. 397 00:51:47.010 --> 00:51:56.730 Silence. 398 00:52:03.360 --> 00:52:07.559 Okay. 399 00:52:19.079 --> 00:52:25.710 Silence. 400 00:52:25.710 --> 00:52:29.550 You see. 401 00:52:30.750 --> 00:52:38.309 This is not just this silly thing. This is a point. If you're trying to do timing tests on computers, you do a program like this. 402 00:52:38.309 --> 00:52:41.820 And you have to be careful the Optimizer will. 403 00:52:41.820 --> 00:52:48.989 Go crazy. Well, you know, if you don't have print statements and so on the Optimizer will say. 404 00:52:48.989 --> 00:52:58.469 Yeah, you know, if I don't do any work at all, if I compile the program down to the empty set, we'll get the same answer, which is the empty set. So. 405 00:52:58.469 --> 00:53:04.469 You see the problem you see, and again with when you're doing timing tests, you gotta worry about that. 406 00:53:06.989 --> 00:53:10.860 Okay, next point. 407 00:53:14.309 --> 00:53:17.670 Strokes okay, this is an issue called deep. 408 00:53:17.670 --> 00:53:20.820 Copies here getting ready to issue a deep copies. 409 00:53:22.050 --> 00:53:27.059 So, you got these hierarchical classes and C plus plus in particular. 410 00:53:27.059 --> 00:53:30.150 When some of the elements have. 411 00:53:30.150 --> 00:53:35.940 Pointers or they have variable size. This creates an issue. 412 00:53:35.940 --> 00:53:39.690 When you're copying the data. 413 00:53:39.690 --> 00:53:46.380 Of this type from hosted device, or if you're into storing it somewhere. 414 00:53:46.380 --> 00:53:49.440 Um, so. 415 00:53:49.440 --> 00:53:52.440 This 1 is easy. 416 00:53:52.440 --> 00:53:58.710 Okay, you, this struck here flow 3 it's 3 floats. 4 bites each probably. 417 00:53:58.710 --> 00:54:03.659 It's easy to copy that. Okay, so I say data create. 418 00:54:03.659 --> 00:54:08.099 No trouble, the compiler says each element of flow 3 as 12 bytes. 419 00:54:08.099 --> 00:54:12.119 And that 1 is easy. 420 00:54:13.380 --> 00:54:19.590 The hard part is something like here you see the data type. 421 00:54:23.099 --> 00:54:28.559 The data type, it's you see, the vector contains a pointer. 422 00:54:28.559 --> 00:54:31.590 To another variable, which is. 423 00:54:31.590 --> 00:54:37.320 Who knows where it is probably on the heap, but you can't guarantee that. 424 00:54:37.320 --> 00:54:41.250 So now what happens when you copy this. 425 00:54:41.250 --> 00:54:45.690 A variable of this class to the device. 426 00:54:46.980 --> 00:54:57.360 You do the simple copy. You're copying this point here here. Okay. Vote star, but you're not copying immediately the target of the pointer. 427 00:54:59.280 --> 00:55:06.059 And, in fact, if you copy the pointer itself down with unified address, it's a valid point on the device pointing back to the host. 428 00:55:06.059 --> 00:55:10.110 Going to be really inefficient to use. 429 00:55:10.110 --> 00:55:14.190 Okay, so what you want to do is actually you want to. 430 00:55:14.190 --> 00:55:18.539 If you're copying a variable of this type to the device. 431 00:55:18.539 --> 00:55:23.699 You do the simple top level copy and then as a 2nd step. 432 00:55:23.699 --> 00:55:27.449 You want to allocate space for this on the device. 433 00:55:27.449 --> 00:55:34.800 And update the pointer, so this is the term. This is a deep copy that the deep coffee is the term here. 434 00:55:34.800 --> 00:55:41.190 Okay, my mirror program hung up. 435 00:55:41.190 --> 00:55:46.050 Let me start it again and so I can write that down just a 2nd here. 436 00:55:50.489 --> 00:56:00.059 Silence. 437 00:56:03.750 --> 00:56:04.559 Okay, 438 00:56:05.574 --> 00:56:25.824 Eva. 439 00:56:31.679 --> 00:56:35.880 Just realized that there. Okay. 440 00:56:37.050 --> 00:56:41.670 So, right, so you term here. 441 00:56:41.670 --> 00:56:48.210 Is it's deep copy? 442 00:56:49.469 --> 00:56:52.619 Of basically of a class. 443 00:56:54.179 --> 00:57:00.420 Say with point, Terry said sad. 444 00:57:00.420 --> 00:57:03.480 Of the device. 445 00:57:03.480 --> 00:57:09.599 Okay, you see, you can just do the superficial top level copy. 446 00:57:09.599 --> 00:57:14.190 Okay, um. 447 00:57:14.190 --> 00:57:18.719 So, the dynamic member is this thing here so. 448 00:57:19.920 --> 00:57:26.250 The open ACC cannot easily do that automatically. You have to do it, so you. 449 00:57:26.250 --> 00:57:29.909 You copy the stocked and then you copy. 450 00:57:29.909 --> 00:57:35.190 You allocate space theory and copy that and put it in a function but that's a pain. 451 00:57:37.050 --> 00:57:45.269 Yeah, programming technique. 1 way I handle stuff like this in my own code for variable. 452 00:57:45.269 --> 00:57:54.599 A raise as I just it's a maximum size of something reasonable. I just allocate all the arrays at the maximum size. 453 00:57:54.599 --> 00:58:01.949 Now, they're not variable anymore makes life easier for me. It way some memories questions. How much memory is it wasting? 454 00:58:01.949 --> 00:58:05.010 Okay, so we have this. 455 00:58:05.010 --> 00:58:11.130 Space here C plus plus same thing. 456 00:58:11.130 --> 00:58:16.440 Well, here is a cool concept. Here. You're writing your class. 457 00:58:16.440 --> 00:58:21.719 But in your constructor, you see, this is a reason this these are not hierarchical here. 458 00:58:21.719 --> 00:58:25.829 Oh, you do the enter and the constructor. You do the exits. 459 00:58:25.829 --> 00:58:30.389 In the Destructor, so again, the enter does a. 460 00:58:30.389 --> 00:58:38.880 On the device, the exit does a free on the device effectively so you can do something like this here but. 461 00:58:41.369 --> 00:58:46.949 Oh, and then you'd also here, I guess you'd have to update pointers or something maybe. 462 00:58:48.360 --> 00:58:55.949 Okay, so this is synchronization and this is the issue with you deep. 463 00:58:55.949 --> 00:59:01.170 You're deep copying so over here, you deep coffee. Okay. 464 00:59:03.719 --> 00:59:08.070 And this is a case you need to updating and so oh, okay. 465 00:59:09.300 --> 00:59:18.179 Closing remarks, we saw this unified memory, and I just mentioned it were thought more detail. The 2nd point is that you may have to tell. 466 00:59:18.179 --> 00:59:22.619 The open AC system. 467 00:59:22.619 --> 00:59:31.500 You may want to tell it the data which data is going into and coming back from the device that if you do it badly, I make things worse. 468 00:59:31.500 --> 00:59:37.800 And unstructured data that, like Malik and free on the device, sent her an accent. So. 469 00:59:37.800 --> 00:59:43.380 Okay, conditions about week 2 here. 470 00:59:44.550 --> 00:59:49.050 Oh, yeah oh, okay. I don't see any chat window on things. 471 00:59:50.789 --> 00:59:56.969 Just look at 3 here and now. 472 01:00:05.880 --> 01:00:10.530 Okay, okay still on. 473 01:00:10.530 --> 01:00:16.260 No, it's my mirroring program for the iPad keeps hanging. 474 01:00:16.260 --> 01:00:25.949 Okay more stuff here. Okay. 475 01:00:30.630 --> 01:00:37.019 Okay, frankly, I don't find their way of mystifying. This. It doesn't Demystified to me. 476 01:00:37.019 --> 01:00:42.449 I'm going to skip through this somewhat. 477 01:00:51.570 --> 01:00:57.539 Ok, what's going on here? Is that, um. 478 01:00:57.539 --> 01:01:00.989 I write this down, but my. 479 01:01:02.760 --> 01:01:07.619 Again, just a 2nd. 480 01:01:25.260 --> 01:01:28.260 Okay. 481 01:01:44.070 --> 01:01:47.880 Oh, okay. 482 01:01:51.420 --> 01:01:55.800 Okay, so we have here your. 483 01:02:00.090 --> 01:02:06.960 And then you might say to here. 484 01:02:06.960 --> 01:02:13.199 So, we have a, um. 485 01:02:13.199 --> 01:02:17.460 I work, or it's like a thread. 486 01:02:17.460 --> 01:02:22.380 A, um, a vector. 487 01:02:22.380 --> 01:02:28.739 It's a block of threads. 488 01:02:28.739 --> 01:02:33.719 And, um, again. 489 01:02:37.619 --> 01:02:40.679 It's a set of locks or something. 490 01:02:42.300 --> 01:02:46.559 Okay. 491 01:02:46.559 --> 01:02:58.139 Point is that the vector, the threads a vector cooperate much more closely than the blocks and a gang. 492 01:02:58.139 --> 01:03:03.329 And they're saying it up here, we're going way too far. 493 01:03:07.260 --> 01:03:19.769 I sort of silly here. Gangs operate independently. Well, yeah, so the set of blocks here. 494 01:03:22.320 --> 01:03:27.690 Again, this is a 2nd here. 495 01:03:33.119 --> 01:03:40.349 Silence. 496 01:03:52.349 --> 01:03:57.420 Well, my problem is that my mirroring program keeps hanging on me. 497 01:03:57.420 --> 01:04:07.019 Okay. 498 01:04:15.420 --> 01:04:23.579 Okay. 499 01:04:27.420 --> 01:04:37.409 Up again. 500 01:04:40.199 --> 01:04:45.090 Silence. 501 01:04:59.429 --> 01:05:07.349 Um. 502 01:05:11.820 --> 01:05:22.409 Oh, okay. So. 503 01:05:26.070 --> 01:05:33.269 Yeah, so that the threads vector can cooperate more. Okay so this is. 504 01:05:33.269 --> 01:05:38.400 Sort of fly here. I don't even understand it that much. Okay. 505 01:05:38.400 --> 01:05:43.079 But the point is that we have different levels of. 506 01:05:43.079 --> 01:05:46.230 Of cooperation here I'm saying. 507 01:05:46.230 --> 01:05:53.010 A J. 508 01:05:53.010 --> 01:05:56.940 Silence. 509 01:05:58.769 --> 01:06:03.690 Have optional. 510 01:06:04.949 --> 01:06:10.469 Here in memory. 511 01:06:10.469 --> 01:06:16.679 And synchronize and so on. Okay. 512 01:06:16.679 --> 01:06:23.010 Okay, now you can profile stuff, um. 513 01:06:23.010 --> 01:06:28.260 Yeah, I'll hit that more later. 514 01:06:28.260 --> 01:06:37.380 And seeing executive summary of these slides is you can profile stuff and you can see how much time is spent on copying data. Both ways. 515 01:06:38.670 --> 01:06:48.389 Yeah, okay. That's the executive summary of that. Okay. 516 01:06:55.440 --> 01:07:00.389 Okay, here's a new thing here. Um. 517 01:07:00.389 --> 01:07:06.420 If you've got nested loops, you can collapse nest and loops. 518 01:07:06.420 --> 01:07:09.539 Into 1, 1 dimensional loop. 519 01:07:09.539 --> 01:07:12.630 And that can sometimes some. 520 01:07:12.630 --> 01:07:19.650 Well, it's 1 loop with bigger loop may be optimized better. 521 01:07:19.650 --> 01:07:24.150 So that some lots of clauses like that. 522 01:07:26.250 --> 01:07:34.050 So, to accessing 2 dimensional right? Effectively collapses it. No accessing 16 element. 1 dimensional. All right so. 523 01:07:35.639 --> 01:07:40.829 Is. 524 01:07:40.829 --> 01:07:45.599 This is each starting each thread and stopping you said this is some overhead here. 525 01:07:45.599 --> 01:07:49.469 No federations of parallel. So the concept is. 526 01:07:49.469 --> 01:07:52.710 Moderation see up here. 527 01:07:52.710 --> 01:07:57.510 We've got 16 separate iterations and each one's very small. 528 01:07:57.510 --> 01:08:03.539 We do some collapsing and merging may fewer iterations. Any generations got more work in here. 529 01:08:03.539 --> 01:08:07.409 So may work better. 530 01:08:09.750 --> 01:08:18.180 And teller compiler to do that was collapse and. 531 01:08:18.180 --> 01:08:22.350 Wow, we got 3% faster. 532 01:08:24.060 --> 01:08:28.319 Another thing you can do is to say. 533 01:08:28.319 --> 01:08:33.479 Again, you're iterating over a big 2 dimensional array. You may want to split it up into tiles. 534 01:08:33.479 --> 01:08:42.239 And put each tile on a separate parallel thread or something. And again, depending on the locality of the reference of the data. 535 01:08:43.560 --> 01:08:49.109 Might help and Mike, perhaps be more efficient. 536 01:08:49.109 --> 01:08:52.590 And you do that with a tile directive here. 537 01:08:54.960 --> 01:09:01.350 And Matrix, multiplication or something this is something you might almost have to. 538 01:09:01.350 --> 01:09:05.189 Do in your code, rewrite your algorithm so that it's local, but. 539 01:09:06.329 --> 01:09:11.850 Okay, and you do tiling and get up here. 540 01:09:15.539 --> 01:09:23.579 And how does it work. 541 01:09:23.579 --> 01:09:29.310 Executive summaries, there's no point to it on the CPU. It doesn't matter because the. 542 01:09:29.310 --> 01:09:38.100 What's the Z on? Okay so, again, you're on your host you know, it takes some time to get something out of the physical D RAM. 543 01:09:38.100 --> 01:09:43.289 So is the caching, but the on does cashing so well. 544 01:09:43.289 --> 01:09:46.710 That, you know, worry about it. 545 01:09:46.710 --> 01:09:52.979 I once wrote a program, try to determine the effect of having the working set. 546 01:09:52.979 --> 01:09:56.310 Of the amount of memory, it actually used to be bigger than the size. 547 01:09:56.310 --> 01:10:02.609 Of the small high speed cache I could not detect the difference actually, because. 548 01:10:02.609 --> 01:10:07.319 Z on was smarter than me in that sense. It was just improving. Yeah. 549 01:10:07.319 --> 01:10:12.210 Okay on the gpo you used this tiling idea in this example here. 550 01:10:12.210 --> 01:10:15.630 It got a little faster sometimes. 551 01:10:15.630 --> 01:10:20.880 10% faster if you cash is if your tiles were too small. 552 01:10:20.880 --> 01:10:27.270 25% slower so. 553 01:10:27.270 --> 01:10:31.380 Big tiles, it's a little fast and probably not worth it another. 554 01:10:31.380 --> 01:10:38.100 13% okay. Okay. 555 01:10:42.720 --> 01:10:48.630 Now, this can be interesting here it's telling open ACC. 556 01:10:48.630 --> 01:10:52.439 What do you want to try to put into separate. 557 01:10:52.439 --> 01:10:59.880 Threads in the same block thread block versus separate blocks or something telling it. 558 01:10:59.880 --> 01:11:07.439 What level of parallelism and. 559 01:11:11.130 --> 01:11:17.760 So, what they're saying here are 1. okay so the. 560 01:11:18.840 --> 01:11:25.649 Basically use the finer levels of on the, on the inner most loop. So. 561 01:11:25.649 --> 01:11:28.770 Factor would be like the separate threads and. 562 01:11:28.770 --> 01:11:39.000 That's red block and then separate blocks or some workers an intermediate thing that sort of vaguely defined and gang would be the separate blocks. 563 01:11:39.000 --> 01:11:43.649 So. 564 01:11:48.810 --> 01:11:51.899 This says that. 565 01:11:53.460 --> 01:11:57.510 It's like a critical loop and open, so that. 566 01:12:00.210 --> 01:12:06.359 This gets run well, in this particular loop, at least it's done sequentially. 567 01:12:08.489 --> 01:12:20.039 And if applications up the line, and again, this is just. 568 01:12:20.039 --> 01:12:23.189 When we see could directly, this will be. 569 01:12:23.189 --> 01:12:27.930 Mapped to how many threads and a thread block and how many blocks and so on. 570 01:12:27.930 --> 01:12:34.020 Q is going to be something like 32. 571 01:12:34.020 --> 01:12:39.180 Are some multiples 32 2024 be typical values for Q here? 572 01:12:39.180 --> 01:12:44.220 Now, you might ask yourself well, why not just have. 573 01:12:45.390 --> 01:12:49.140 A really high value for this inner most parallelization. 574 01:12:49.140 --> 01:12:54.000 But a lot of their heads in a. 575 01:12:54.000 --> 01:12:58.949 Tread block why do you need the higher levels of. 576 01:12:58.949 --> 01:13:04.380 Like, multiple thread blocks. Well, the reason. 577 01:13:04.380 --> 01:13:09.270 Is that there is some limited resources available. 578 01:13:09.270 --> 01:13:19.829 And if you have more parallel threads in the same block, you're using up some limited resources will get into more detail later, which will slow down the programs. So, sometimes. 579 01:13:19.829 --> 01:13:22.829 If you have less parallelism at empty. 580 01:13:22.829 --> 01:13:27.300 At the lower level, your program will actually run faster so. 581 01:13:27.300 --> 01:13:32.489 Okay, so you can have fun here collapsing and factoring and. 582 01:13:34.260 --> 01:13:39.180 Didn't help okay. 583 01:13:42.720 --> 01:13:47.520 Basically. 584 01:13:47.520 --> 01:13:53.100 Don't worry about these fine details of optimization. I just gives the executive summary of this slide. So. 585 01:13:53.100 --> 01:13:57.270 The, it doesn't cleanly map. 586 01:13:57.270 --> 01:14:04.560 To the invidia hardware, so vectors are threads and a thread block. 587 01:14:04.560 --> 01:14:08.220 Those are interior and then the gangs are the. 588 01:14:08.220 --> 01:14:11.640 Multiple thread multiple blocks, so. 589 01:14:11.640 --> 01:14:17.880 Okay, what's that? 590 01:14:17.880 --> 01:14:22.229 Okay, that's a nice point to stop. Now. 591 01:14:23.939 --> 01:14:29.579 So, what we did now is we mostly finished off open ACC. 592 01:14:29.579 --> 01:14:32.909 I may hit some advanced topics on Monday. 593 01:14:32.909 --> 01:14:40.649 And, um. 594 01:14:41.789 --> 01:14:45.960 And show baby show, run some simple programs. 595 01:14:45.960 --> 01:14:54.779 And then moving on to more, getting more directly onto invidia and if you wish to go ahead, get ahead from me. 596 01:14:54.779 --> 01:15:04.140 You can I just downloaded their teaching it here and I'm going to some of their slides and you can actually. 597 01:15:04.140 --> 01:15:07.380 Look at that yourself, if you'd like to get ahead of me. 598 01:15:07.380 --> 01:15:10.470 Any questions. 599 01:15:10.470 --> 01:15:22.020 Silence. 600 01:15:23.670 --> 01:15:28.829 Good anyone still there and that's seen now. 601 01:15:34.350 --> 01:15:40.890 Wow, my solar panels are okay, they were at 3 kilowatts generating only 1 and a half kilowatts. Now. 602 01:15:46.170 --> 01:15:53.789 You're still here. Okay, Joe. That's good. I never quite know if I'm in a physical clash, and I can look up and see, but. 603 01:15:53.789 --> 01:15:58.170 Okay questions. 604 01:15:59.729 --> 01:16:05.310 If there's no questions then have a good weekend. 605 01:16:05.310 --> 01:16:08.550 I'll do some skiing or something and. 606 01:16:09.810 --> 01:16:14.220 If and feedback is welcome, I'll do a little blurb on darker. 607 01:16:14.220 --> 01:16:19.020 And maybe Monday, I don't know. 608 01:16:20.579 --> 01:16:28.560 Other than that, if no questions then. 609 01:16:32.670 --> 01:16:36.989 Next time okay. 610 01:16:43.260 --> 01:16:47.399 Silence. 611 01:16:48.510 --> 01:16:52.409 Okay. 612 01:16:58.859 --> 01:17:02.310 Oh. 613 01:17:05.430 --> 01:17:13.529 Silence. 614 01:17:20.640 --> 01:17:24.359 Silence. 615 01:17:32.189 --> 01:17:36.840 Silence. 616 01:17:43.529 --> 01:17:47.189 Silence.