WEBVTT 1 00:10:41.308 --> 00:10:45.058 Is more 2 more settings. 2 00:10:47.129 --> 00:10:50.698 Oh, okay, cool. Thank you. 3 00:10:50.698 --> 00:10:54.989 Hate computers, so. 4 00:10:58.974 --> 00:11:12.173 Terry, you're seeing the screen and I'm seeing the chat window, so parallel computing and we're recording so parallel computing and 7. 5 00:11:12.869 --> 00:11:22.918 What we're going to do today is some, some random stuff we're going to finish off open ACC, and then get into Nvidia and so on. 6 00:11:22.918 --> 00:11:27.328 1st, because of popular requests, I put a. 7 00:11:27.328 --> 00:11:33.178 A new item here, top of the menu bar and goes in to. 8 00:11:34.678 --> 00:11:40.499 Media site where you see the class lectures here now. 9 00:11:40.499 --> 00:11:51.509 If they're not readable, then tell me, I try to make them readable, but then some of them revert. Okay. So they revert to not readable. And I don't know why so. 10 00:11:51.509 --> 00:11:54.899 Some of them, I've made readable twice and. 11 00:11:54.899 --> 00:12:02.009 Okay point 1 here, um, you might also be wondering. 12 00:12:02.573 --> 00:12:17.364 Stuff on machine parallel how I show it to you Michael show some PDFs today and so on run. Some programs is I actually run it from my local laptop here. It's a get repository, and I've got a copy on my local laptop and move to this stuff. 13 00:12:17.364 --> 00:12:17.634 So. 14 00:12:19.494 --> 00:12:24.293 Well, 1, other little cool, little bookkeeping I like showing you fun programming things. 15 00:12:25.764 --> 00:12:34.823 If you have a tar ball Powerball, you know, a lot of files directories tied up or a zip file or someone, you want to look at files inside it. 16 00:12:35.124 --> 00:12:41.844 You could extract them all into a directory, but if they're compressed and they won't get a lot bigger and this also could be hundreds or thousands of. 17 00:12:42.479 --> 00:12:57.208 Hundreds of hundreds or thousands of files. It's also a lot of AI nodes and if you're running it inside, get that really starts clogging up get. So there's a cool program called archive found what it does. It creates a virtual file system. 18 00:12:57.208 --> 00:13:05.458 And in Linux, it's a command called archive found it's a package called archive mount and. 19 00:13:05.458 --> 00:13:20.308 So this is what I use if I'm just reading some files inside some big zip file or something other thing is, if I start the zip file, I've got more confidence in its integrity. If I've got us hierarchy of directories and. 20 00:13:20.783 --> 00:13:35.274 1000 policy, who knows if a few got deleted or something for some format you can even write into an archive virtual file system, and when you on mounted, it will write a new zip file or a new tower ball or something any case. That's my cool programming. 21 00:13:35.274 --> 00:13:36.474 Tip for today. 22 00:13:38.308 --> 00:13:44.759 So do to do, okay open ACC. 23 00:13:44.759 --> 00:13:52.438 Finish it off and a good book I recommend to. You actually is. 24 00:13:52.438 --> 00:13:56.188 Open this 1 you open ACC for programmers. 25 00:13:56.188 --> 00:14:04.889 And almost a couple of years ago so I bought and I would recommend that to you if you want more information and so on. 26 00:14:04.889 --> 00:14:08.489 And I got the link here here I got the. 27 00:14:08.489 --> 00:14:15.538 We can Amazon and so on, not that expensive. 28 00:14:16.889 --> 00:14:22.708 Also, it has a get site here and the get hub site. 29 00:14:22.708 --> 00:14:26.759 Has some code and solutions and sign and I'll, I'll show that to, you. 30 00:14:28.619 --> 00:14:38.999 So, open ACC for programmers chapter 4 is available online. We've looked at something relating to that. We'll try some programs in it and. 31 00:14:38.999 --> 00:14:47.639 So, we may just. 32 00:14:47.639 --> 00:14:52.168 And I'll just get through some of this quickly a couple of. 33 00:14:52.168 --> 00:15:02.759 This is a class thing we saw it before somewhat. We saw basically this code before points. The average average 4 neighbors. 34 00:15:02.759 --> 00:15:08.308 Given an see, 4, I'm just speed reading through this. 35 00:15:08.308 --> 00:15:17.099 The serial solver has the 2 steps we saw last time 1st step compute the new temperatures and 2nd step copy them back to the old temperatures. 36 00:15:17.933 --> 00:15:28.614 Okay, so I'm going through this too fast, and just put a note up on chat and so on now, the book uses I've been showing you. 37 00:15:29.453 --> 00:15:40.793 Well, the, the NVIDIA compiler is just a compiler updated slightly and the invidia setup has those compilers and it also. So, you can actually try something like that. 38 00:15:41.729 --> 00:15:45.719 Even show you, in fact, if you want. 39 00:15:45.719 --> 00:15:49.769 I go to here. 40 00:15:49.769 --> 00:15:54.778 See, where are we do we do I want. 41 00:16:00.359 --> 00:16:04.619 Got a different. 42 00:16:08.759 --> 00:16:17.879 Okay, so what I mean to make a little bigger for you, so you can see. 43 00:16:22.589 --> 00:16:29.908 Okay, if that's too small, then let let me know. Okay, so we have the, the. 44 00:16:29.908 --> 00:16:33.899 Let's see. 45 00:16:33.899 --> 00:16:37.948 And. 46 00:16:37.948 --> 00:16:42.479 Different 1 here. 47 00:16:48.928 --> 00:16:52.678 Export them trying to find it in here. 48 00:16:57.599 --> 00:17:01.469 Yeah. 49 00:17:06.058 --> 00:17:13.919 Okay, it's got the code example for free for all the chapters online. So we can just do something. 50 00:17:13.919 --> 00:17:18.689 Silence. 51 00:17:21.929 --> 00:17:25.709 And that sort of thing. 52 00:17:32.098 --> 00:17:45.358 The thing, right? Okay. Compiling the code. The bad version is before it was optimized. So they say at the end here, I'll be 21 seconds and let us see what happens here. 53 00:17:45.358 --> 00:17:53.578 19 seconds, some of my laptop is in significantly faster than the demo computer that the book author use it sale. Com. 54 00:17:53.578 --> 00:17:59.519 3372 iterations. Okay. And. 55 00:17:59.519 --> 00:18:04.858 So, it's showing parallelization and it's showing by putting in things like. 56 00:18:04.858 --> 00:18:10.169 This and so, on here, now there's 1 nice thing. 57 00:18:10.169 --> 00:18:20.219 And we could compile it again this what this does, this is the flag to compile with open. This is a flag to get some. 58 00:18:20.219 --> 00:18:25.078 Information here, let me show you what sort of information you might get. 59 00:18:25.078 --> 00:18:30.868 Um. 60 00:18:30.868 --> 00:18:34.648 I never even put this in. I just heard it. This is. 61 00:18:34.648 --> 00:18:46.259 It's sort of crazy. Yeah, so it's showing the loops here. Are they parallelizable generating test account that means and video code and what it's inferring and so on so, this is useful for you. 62 00:18:46.259 --> 00:18:58.259 Okay, there's 1 more thing I'm coming through here. This is a review. This is a cool thing right here. If you do that. 63 00:18:58.259 --> 00:19:02.638 Then, when you run the program, it prints a pile of useful information. 64 00:19:02.638 --> 00:19:05.909 And let's try that. 65 00:19:07.558 --> 00:19:11.038 There and let's run the program. 66 00:19:12.298 --> 00:19:16.558 Silence. 67 00:19:38.669 --> 00:19:50.098 Okay, so up here, we have to refer to the to the source program. I could probably do that, but it, it shows the time it takes or various. 68 00:19:51.538 --> 00:19:56.278 Things and we're seeing copying takes a fair bit of time and surely. 69 00:19:56.278 --> 00:20:01.588 Okay, but it does help you with some simple profiling, which can be useful. 70 00:20:02.818 --> 00:20:06.328 Talks about it here. 71 00:20:06.328 --> 00:20:14.098 And this is a point I made last time. Oh, I can try the source for the program. It's doing a lot of pointless copying. So. 72 00:20:16.709 --> 00:20:20.638 Um. 73 00:20:23.249 --> 00:20:27.509 It's an inside here, we're doing too much. 74 00:20:27.509 --> 00:20:35.278 Pointless copying inside with the iterations and that's that's taking too much time. 75 00:20:35.278 --> 00:20:38.308 And that's what they're talking about here. 76 00:20:39.959 --> 00:20:45.659 Okay, now, what they get to is a way to optimize. 77 00:20:45.659 --> 00:20:52.138 And again, I'll let you read through this if you're interested on your own since we covered it sort of last time. 78 00:20:52.138 --> 00:20:57.749 But in any case, there's a version here called class final. 79 00:20:57.749 --> 00:21:01.618 And if we looked at last final and. 80 00:21:04.409 --> 00:21:13.318 What it's doing is it at the starter here added a new line up at the top. 81 00:21:13.318 --> 00:21:18.298 Oh, it says basically do less copying of the data. 82 00:21:18.298 --> 00:21:24.388 Giving the executive summary of this, and if we take this thing here. 83 00:21:24.388 --> 00:21:27.719 Talk to my data transfers. 84 00:21:31.078 --> 00:21:36.659 To out or something. 85 00:21:39.269 --> 00:21:46.858 It did the whole thing in 3rd, remember the previous time is like 30 seconds. 86 00:21:46.858 --> 00:21:54.328 So, it was very much faster and these times here that were millions of micro seconds and hundreds of thousands of micro seconds. 87 00:21:54.328 --> 00:22:00.929 So this was the case you get the 1st version running in parallel and then you optimize the thing. 88 00:22:00.929 --> 00:22:08.459 And again, speed up by a non trivial factor here. 89 00:22:13.769 --> 00:22:21.358 And talk about the optimization things here now, in this directory, we also have. 90 00:22:21.358 --> 00:22:24.749 Open M. P. and so we could also. 91 00:22:45.868 --> 00:22:50.969 No, the optimized open. Acc took a 2nd. 92 00:22:50.969 --> 00:22:58.108 This is taking the optimized 1 to 30 seconds. This 1 is taking. 93 00:22:59.548 --> 00:23:12.209 Night 20 seconds so it's faster than the optimized open ACC, but much slower than the optimized 1. we could also run the cereal 1. 94 00:23:16.048 --> 00:23:20.219 Silence. 95 00:23:35.699 --> 00:23:38.939 I could also for fun be running H top. 96 00:23:38.939 --> 00:23:45.868 And see what's happening for times pure using 100% of the CPU. 97 00:23:48.179 --> 00:23:59.578 And ad was faster than the non optimized, open ACC, it's at the same speed as the open empty. Actually. So, if an MP did not help here, we'd also just try for fun. A. 98 00:24:01.588 --> 00:24:06.058 How are you saying. 99 00:24:19.019 --> 00:24:23.398 So, optimizing on the serial machine, made a difference. 100 00:24:23.398 --> 00:24:28.588 We could also try optimizing the open ACC for fun. 101 00:24:49.499 --> 00:24:52.919 I know okay. 102 00:24:52.919 --> 00:24:57.358 Silence. 103 00:25:01.108 --> 00:25:05.038 See, what happens with this 1. 104 00:25:05.038 --> 00:25:09.659 Say it didn't help same speed. 105 00:25:09.659 --> 00:25:14.009 Okay, so. 106 00:25:15.419 --> 00:25:23.489 That was this book here any final questions on open ACC. 107 00:25:25.199 --> 00:25:31.348 So, what I would like to do is, okay, now trance for the next. 108 00:25:31.348 --> 00:25:35.999 Bulk of the course the 1st, block of the course was tools to compile. 109 00:25:36.233 --> 00:25:47.483 Just open MP, open ACC. Now I want to get direct more into the NVIDIA picking NVIDIA as currently it's the most common view out there in 5 years. 110 00:25:47.483 --> 00:25:57.534 If NVIDIA gets arrogant and confident, they may vanish. I've seen this happen actually with various computer companies. That went from some of the biggest companies in the business. 111 00:25:58.229 --> 00:26:06.509 You know, we're going on merging away, so any case. So, NVIDIA has a lot of stuff online. 112 00:26:06.509 --> 00:26:09.929 There's this here where you can request membership. 113 00:26:09.929 --> 00:26:16.888 I've done that and what we have here online if I go back. 114 00:26:16.888 --> 00:26:23.848 We have the gpo teaching kit here accelerated right here. 115 00:26:23.848 --> 00:26:28.588 And what I've done a zip file is, in fact, I. 116 00:26:28.588 --> 00:26:39.989 Well, 1 thing, I did not do others chapter 4 that I was looking at. Okay. So I used archive mount on and in fact, if you do a. 117 00:26:39.989 --> 00:26:51.298 Down at the end here, it's a file user base file system. That's what fuse is archive an archive found data type. Okay. 118 00:26:54.898 --> 00:26:59.489 And we're just going to look at some of the. 119 00:27:02.669 --> 00:27:07.739 And we're going to speed read through the slides. Some of them are fairly basic, but. 120 00:27:11.068 --> 00:27:15.959 Don't ask me what's happening there. 121 00:27:15.959 --> 00:27:20.009 Okay. 122 00:27:20.009 --> 00:27:27.028 Much bigger. 123 00:27:27.028 --> 00:27:32.398 Okay, something from Illinois, but they are quite recent and. 124 00:27:35.548 --> 00:27:39.989 And legally they're free. I'm actually legally using them. 125 00:27:39.989 --> 00:27:44.669 So, motherhood stuff. 126 00:27:47.213 --> 00:28:02.124 What we're going to see is we're going to see what coulda is, which I've alluded to before we're going to see more detail. It's the, you might say I will see assembly level language for programming and more about parallelism. The architecture. 127 00:28:02.368 --> 00:28:06.778 Talking about memory. 128 00:28:06.778 --> 00:28:11.608 And in this context, colonel means the GPU, the device. 129 00:28:11.608 --> 00:28:17.848 Performance atomic operations have seen that before. 130 00:28:17.848 --> 00:28:26.249 Now, modules, 8, 9 are interesting. 910 and 11 with any tool that you use. 131 00:28:26.249 --> 00:28:30.239 Sodium computing tools there are certain paradigms. 132 00:28:30.683 --> 00:28:45.443 There are ways to do things efficiently and they may not be obvious if you just look at the tool. And this will be important stuff to learn. Here. They are techniques for writing, parallel programs. 133 00:28:45.443 --> 00:28:52.253 There are techniques which have been shown to be actually useful and allow you to be productive in writing. 134 00:28:52.528 --> 00:28:55.919 Parallel program to their patterns. 135 00:28:55.919 --> 00:29:00.538 And these are these things here. Okay. Um. 136 00:29:00.538 --> 00:29:04.108 See, more of that and talk about things. 137 00:29:04.108 --> 00:29:15.479 Okay, I may not did all of this open and is a competing thing to CUDA. It's more platform independent, but it's not as mature. 138 00:29:15.479 --> 00:29:20.038 And it talks about open ACC and so, okay, that was the 1st slide. 139 00:29:20.038 --> 00:29:23.939 That was fast. 140 00:29:26.189 --> 00:29:29.249 So be fast also. 141 00:29:29.249 --> 00:29:37.229 Is going on here. 142 00:29:37.229 --> 00:29:41.548 Oh, just a 2nd, how do you. 143 00:29:41.548 --> 00:29:50.638 Okay. Um, okay, there is an important thing here. 144 00:29:50.638 --> 00:30:05.219 And that there are different types of architecture, computer architecture and there is an essential way in which the design is different from the CPU design and it's latency versus throughput. 145 00:30:05.219 --> 00:30:13.288 When they talk about it here, so the different types, of course, certain types of architecture to do different things efficiently, there's some. 146 00:30:13.288 --> 00:30:17.368 Just signal processing very efficiently, for example. 147 00:30:18.568 --> 00:30:28.499 The CPU, or what are called latency corps in this context, they're designed to have low latency, whereas the GPU does a designed to have high throughput. 148 00:30:29.608 --> 00:30:32.638 So, the. 149 00:30:33.749 --> 00:30:38.368 They have a lot of they have a very large local cash. 150 00:30:38.368 --> 00:30:45.659 And so that to hide, the fact that having to pull something off of memory is is very slow. 151 00:30:45.659 --> 00:30:57.148 And then a few registers and a lot of control unit. So the say, you know, pipelining and all that stuff control gets very big. 152 00:30:57.148 --> 00:31:06.808 But the effect is latency you can effectively grab data out of memory without noticing the delay hyper, hyper threading and so on the GPU. 153 00:31:06.808 --> 00:31:11.368 Does not hide the latency so much grabbing. 154 00:31:11.368 --> 00:31:17.189 Some data may take a long time. The cash is effectively smaller here. 155 00:31:17.189 --> 00:31:23.939 There's a lot more registers get to that later. But the thing is that they have a very high throughput. 156 00:31:23.939 --> 00:31:29.999 Because they will thread, they'll run many threads and parallel so. 157 00:31:29.999 --> 00:31:35.759 The cheap, you can do a lot of processing and can process more data. 158 00:31:35.759 --> 00:31:46.019 If your algorithm is organized, right? The CPU is designed to have low latency do random reads and they generally are fairly good at it. 159 00:31:46.019 --> 00:31:56.608 Doing it efficiently, but you only have a few threads on the seat. The GP got many threads, thousands of threads. The latency is high to start getting some data but once you start getting some data. 160 00:31:56.608 --> 00:32:08.608 You can it passed so the CPU powerfully I'll use floating point and so on large control unit large cash. 161 00:32:08.608 --> 00:32:13.828 What, if you look at the design for. 162 00:32:13.828 --> 00:32:24.269 Intel or Z on, they can do a lot of things in 1 cycle and double recession. Floating point is not the double precision floats in 1 cycle. 163 00:32:24.269 --> 00:32:32.878 May take several cycles, in fact, depending on which GPU you're using. So, this is CP design for low latency. The GP is designed for. 164 00:32:32.878 --> 00:32:37.919 Through so. 165 00:32:37.919 --> 00:32:46.108 You got a lot of threads, running hundreds of threads that talk to cash is much smaller. They use another big difference simple control. 166 00:32:46.108 --> 00:32:49.469 All does a lot does brand prediction. 167 00:32:49.469 --> 00:32:59.818 All that sort of speculative execution, all that sort of powerful stuff that is not in the is designed to handle straight line code. 168 00:32:59.818 --> 00:33:04.229 And running the same code on a lot of threads in parallel. So. 169 00:33:04.229 --> 00:33:08.159 Memory through simple of control. 170 00:33:08.159 --> 00:33:15.778 So pipeline for high through, but not pipeline for spectrum of execution and so on. 171 00:33:15.778 --> 00:33:20.669 So, it's going to be a latency to get data, or even from the global memory on the. 172 00:33:20.669 --> 00:33:25.439 On the GPU, not even talking about going back to the host. It may be a 100 cycles. 173 00:33:25.439 --> 00:33:36.298 This latest here, but that 100 cycles that amortized over, maybe there might be a 1000 threads executing in parallel. So, 100 cycles. 174 00:33:36.298 --> 00:33:40.019 Latency you start stuff to keep stuff running. It's tolerable. 175 00:33:44.513 --> 00:33:56.513 And this is the point I've mentioned that well, I was figuring that host core was 20 times faster than a device car. They're saying 10 times faster. The point is, you're. 176 00:33:57.088 --> 00:34:00.209 And your Z on is fast. 177 00:34:00.209 --> 00:34:05.699 But the is do a lot of things in parallel. That's the difference. 178 00:34:05.699 --> 00:34:13.259 And got some books here computing grid you can welcome to look at. 179 00:34:13.259 --> 00:34:16.469 F*** slide set in a few minutes. 180 00:34:16.469 --> 00:34:20.668 This is moving. Okay. 181 00:34:28.168 --> 00:34:35.068 No, this is okay. 182 00:34:41.608 --> 00:34:49.409 Where it could, uh, fits into this, we're accelerating some motherhood slide. You how do you accelerate or kick up. 183 00:34:49.409 --> 00:34:53.969 You call libraries, directors to the program and you you say. 184 00:34:53.969 --> 00:35:03.958 Special purpose language nothing complicated there. Nothing complicated there. Okay this starts having some content actually. 185 00:35:03.958 --> 00:35:09.028 There and video, sort of that you're coming and going with tools. 186 00:35:11.338 --> 00:35:22.409 Libraries, they have a lot of libraries libraries for cash transfer for numeric blasts basically, near algebra. 187 00:35:22.409 --> 00:35:28.528 All sparse matrices all that sort of thing. So they provide a lot of linear algebra tools for you. 188 00:35:28.528 --> 00:35:31.619 And some big things on. 189 00:35:31.619 --> 00:35:35.429 Math lives all that sort of thing. 190 00:35:35.429 --> 00:35:38.969 Thrust is something we'll look at later. It's a. 191 00:35:38.969 --> 00:35:46.108 It's an GPU analog to the standard template library actually, with parallel constructions that it had sorts and scans and so on. 192 00:35:46.108 --> 00:35:49.708 Library to me, I stuff. 193 00:35:49.708 --> 00:36:01.199 Image processing stuff, so there's a lot of accelerated libraries and in fact, if you just have an application, you may be better just to pick up good library and not be down at the low level. 194 00:36:01.199 --> 00:36:12.869 Just to show you this codes a little confusing, but I'll talk about it since they have it. What thrust is. 195 00:36:12.869 --> 00:36:17.759 So, it's library extensions to C. plus plus there's no language extensions at all. 196 00:36:17.759 --> 00:36:21.989 Um, again it's like, and. 197 00:36:21.989 --> 00:36:27.028 You have the it's functional programming, functional programming. It's a copy. 198 00:36:27.028 --> 00:36:34.259 And it coffee some vector with you. Oh, let's go up to the top here. These are ways to construct data. 199 00:36:35.608 --> 00:36:41.398 On the host or device, and of course, this is obsolete. Now, you would do a managed array. 200 00:36:41.398 --> 00:36:46.498 And then you'll have to worry about hosted device. Of course, you let the system worry about it, but this shows, you. 201 00:36:47.998 --> 00:36:54.809 And see what a device fact, the only interesting data type here is a vector. So this will be on the device. 202 00:36:54.809 --> 00:37:00.659 It's a factor of floats and you give standard to give the name and the lights and so on. 203 00:37:00.659 --> 00:37:09.929 So, there are coffee coffee. Interesting thing. Here is, this is a functional programming thing here and what this does. 204 00:37:09.929 --> 00:37:13.559 Is it takes and it, it takes to input. 205 00:37:13.559 --> 00:37:17.849 Of vectors device, and put 1 and device and put 2. 206 00:37:17.849 --> 00:37:23.128 And an output factor device output, and it adds some element by element. 207 00:37:23.128 --> 00:37:30.958 The last the last argument to the transform is a plus. 208 00:37:30.958 --> 00:37:34.829 You know, this is. 209 00:37:34.829 --> 00:37:38.699 Templates and so on, so it's a plus on floats. 210 00:37:38.699 --> 00:37:44.068 And so what transformed does is it applies this function. 211 00:37:44.068 --> 00:37:48.329 To, um. 212 00:37:48.329 --> 00:38:02.458 To these 2 input factors from perfect and you can just imagine how that could be compiled for parallel. Now, what thrust does is when you compile it, you give it directives that say what you want to target architecture to be. 213 00:38:02.458 --> 00:38:08.188 The host our device, like, and so on and to a very large extent. 214 00:38:08.188 --> 00:38:12.630 The source code does not have to change at all. It's not completely true, but it's. 215 00:38:12.630 --> 00:38:17.309 For a large extent true. We'll get to it later. 216 00:38:17.309 --> 00:38:23.130 This here is not a function call. It is a, um. 217 00:38:23.130 --> 00:38:26.219 Plus float is a class. 218 00:38:26.219 --> 00:38:30.929 It's a template and the class. 219 00:38:31.949 --> 00:38:37.710 And what the premise here, they're constructing, it's calling the default constructor. 220 00:38:37.710 --> 00:38:46.440 On this class of plus float and the class overload. So coincidentally overload some. 221 00:38:46.440 --> 00:38:50.489 The print operator, and that's out of this returns. 222 00:38:50.489 --> 00:38:58.199 An operator, which does edition, but in an indirect way, by creating a default variable, which happens to have overloaded. 223 00:38:58.199 --> 00:39:03.719 Brent confusing and motherhood stuff here. 224 00:39:03.719 --> 00:39:11.460 Open you seen that before what's happening here is we're being explicit about what gets copied in and out. 225 00:39:11.460 --> 00:39:17.489 In the kernel, and we give the the name of the array. 226 00:39:17.489 --> 00:39:23.849 And which part of the array to copy in and out, we're assuming that the compiler maybe cannot figure it out. 227 00:39:25.500 --> 00:39:31.590 So the previous slide showed using process is not using stuff just using a simple C. plus plus. 228 00:39:33.119 --> 00:39:36.929 Nothing new here. 229 00:39:36.929 --> 00:39:44.909 Nothing new here parallel stuff that obviously all of your major packages can use parallel computing. 230 00:39:44.909 --> 00:39:51.389 Who do the Python? Yes. And so on nothing interesting there and. 231 00:39:51.389 --> 00:39:56.159 There was our 3rd slide said of the day. 232 00:39:57.480 --> 00:40:01.289 4th. 233 00:40:04.500 --> 00:40:08.010 Okay, um. 234 00:40:08.010 --> 00:40:20.909 But this is showing is data parallelism how the tend to operate, you got 2 vectors of data, you want to add them element by elements. So, each addition is done by a separate thread. 235 00:40:20.909 --> 00:40:28.110 Ideally, so these are very lightweight threads. All the thread is doing is, it's adding 2 floats and producing another float. 236 00:40:28.110 --> 00:40:35.550 And the only reason this can possibly be efficient is that the overhead to start install for thread is negligible. 237 00:40:35.550 --> 00:40:39.300 And because we're starting maybe a 1000 of them or something. 238 00:40:39.300 --> 00:40:45.030 So that's implicit in this diagram here. The threads are very light weight. 239 00:40:45.030 --> 00:40:48.239 Um. 240 00:40:50.309 --> 00:40:58.079 Ok, what would be happening here. 241 00:40:59.190 --> 00:41:08.969 It were transitioning to see how it would be done in. We got a main program, which adds to factors and produces a 3rd factor and ends the number of. 242 00:41:08.969 --> 00:41:16.860 Words this is the function again I use function routines and autonomous, late in the function of routine. 243 00:41:16.860 --> 00:41:20.250 We just have a loop, which adds things element by element. 244 00:41:20.250 --> 00:41:28.199 There's a convention H, underscore means the data is on the host. That's the Intel. D underscore means it's on the device. 245 00:41:31.530 --> 00:41:43.619 Okay, so this is getting to the next level of detail here about how we would do this edition thing starting to use the that's the device. 246 00:41:43.619 --> 00:41:50.039 A comment that's indicate this, so. 247 00:41:50.039 --> 00:41:58.469 We, we allocate memory. Okay the data is on our host, we have to allocate memory on the device. 248 00:41:58.469 --> 00:42:06.630 And then we have to copy the data from the host of the device again with a managed memory. These things are automatic but. 249 00:42:06.630 --> 00:42:13.949 Before manage you out, you allocate a vector on the host, you allocated on the device, and you copy data back and forth. 250 00:42:13.949 --> 00:42:21.570 And then we launched the colonel now terminology, the kernel is a parallel program running on the GPU. 251 00:42:21.570 --> 00:42:26.460 So, we launched a current all, we launched a parallel program on a gpo that does the work. 252 00:42:26.460 --> 00:42:29.940 And then finally we copy the data back. 253 00:42:29.940 --> 00:42:37.949 To the host, and if we care, we free the device factors. I would never care program in. They're going to free anyway. 254 00:42:37.949 --> 00:42:51.329 This is another step any if I'd have to wait for the colonel to finish, because it's a synchronous, the CPO can start to doing something, then the CPO can do something else while it's waiting for you. It doesn't have to wait does something else and checks is finished. 255 00:42:52.949 --> 00:42:59.820 Okay um, so that was some new stuff here. There is a lot of new stuff on this. 256 00:42:59.820 --> 00:43:09.840 Simple looking slide. What do we have here? This is a high level architecture description for how the GPU works. 257 00:43:09.840 --> 00:43:18.420 You got I'll do the simple thing. 1st, and you got global memory. Okay. 258 00:43:18.420 --> 00:43:22.139 On parallel on that. 259 00:43:22.139 --> 00:43:30.420 It being the I could buy at the time, which was about a year ago. It has like, 48 gigabytes of global memory. If I recall. 260 00:43:30.420 --> 00:43:33.900 A laptops got 12 gigabytes. 261 00:43:33.900 --> 00:43:37.739 My laptop, I though, hey, it's practically as fast as parallel. 262 00:43:38.909 --> 00:43:41.940 It's also practically use expensive. So, Matt matches. 263 00:43:41.940 --> 00:43:50.400 Okay, you got some global memory, global memory. It's big by GPU terms. It's past, but it has latency. 264 00:43:51.599 --> 00:44:00.539 Now, inside it, you've got threads you got called CUDA cores and you may have a few 1000 of them like 7000. 265 00:44:00.539 --> 00:44:08.400 And they're running threat to thread is, like, a unit of execution is data and a program counter and so on. 266 00:44:08.400 --> 00:44:13.469 And we just index the 3.0001 and so on. 267 00:44:13.469 --> 00:44:19.920 And again, threads have some private registers only 255 for thread. 268 00:44:19.920 --> 00:44:33.599 And the threads this is not in this slide, but the threads are in warps and the 32 threads in a war front synchronously, they're executing the same instruction or they're idle. 269 00:44:33.599 --> 00:44:37.440 If they're executing it, the same instruction. 270 00:44:37.440 --> 00:44:43.079 They can be on different data, but ideally, the data is consecutive. 271 00:44:43.079 --> 00:44:51.809 Threads 1, 1 data would be 1 after 3 0T data and so on. So you got these threads could be thousands of them. 272 00:44:51.809 --> 00:44:55.260 And each thread is a small bank of registers. 273 00:44:55.260 --> 00:45:00.539 The thread can also get at the global memory. Now, the threads are grouped into blocks. 274 00:45:01.889 --> 00:45:07.199 So, a block can have up to a 1000 threads. Be 32 wars is 32 threads. 275 00:45:07.199 --> 00:45:11.309 Doesn't have to have a 1000 crew can be up to a 1000. 276 00:45:11.309 --> 00:45:15.480 So that's the yellow thing here so it may be a 1000 threads in the block. 277 00:45:17.099 --> 00:45:30.630 They're running the same instruction, but the different warps don't have to be running at the same time. The threads and the war for running at the same time. The warps in the block actually could be scheduled differently. They're doing the same instruction. 278 00:45:30.630 --> 00:45:33.869 Not maybe not at the same time. 279 00:45:33.869 --> 00:45:44.219 And inside a block, there's also some, some memory that's private to the block that's available to all the threads of the block. 280 00:45:45.204 --> 00:45:59.215 So, the warps in the block, they're running the same instruction, but maybe not at the same time. What's going on is a queue of warps waiting to run and when resources available, then the. 281 00:46:00.000 --> 00:46:05.099 Imagine a mini operating system pulls the next item off the queue in the block and runs it. 282 00:46:05.099 --> 00:46:12.719 So you got a blog, could have up to 32 warps and each 4 is 32 threads. You can have many blocks. You might have hundreds of blocks in the program. 283 00:46:12.719 --> 00:46:18.119 And the blocks are running again, they're running the same program and the running the same instructions. 284 00:46:18.119 --> 00:46:23.760 Was the different blocks do not communicate with each other at all? So there's a queue of blocks waiting to run. 285 00:46:23.760 --> 00:46:27.269 And the only. 286 00:46:27.269 --> 00:46:33.599 Shared data is the global memory, so they're running at different times. There's no fairness guarantees on the different blocks. 287 00:46:33.599 --> 00:46:36.989 And so they're basically off on their own, you can sink. 288 00:46:36.989 --> 00:46:44.309 Running the same instructions, but at different times, and you couldn't do synchronization, you can force things to wait and so on. 289 00:46:44.309 --> 00:46:53.639 It's not so bad to force the warps and a block to wait for you until they're all completed. Forcing all the blocks to wait is probably a bad idea. 290 00:46:53.639 --> 00:47:04.289 Okay, so you've got 30, just friends and awards are up to 32 works at a block up to a 1000 threads and a block. You could have. 291 00:47:04.289 --> 00:47:17.039 A 1000, unlimited blocks is hundreds or thousands. See, everything here is really light weight. It's unlike a higher level operating system where starting a process takes time and so on everything here. 292 00:47:17.039 --> 00:47:20.070 Is really cheap and simple to start off. 293 00:47:20.070 --> 00:47:30.780 That's the point of it. Okay so he got all the blocks then form a grid and the grid is also called the kernel. So, the grid is a parallel program running. 294 00:47:30.780 --> 00:47:35.579 On the your. 295 00:47:35.579 --> 00:47:40.110 See, am I still aimed here? Good you're. 296 00:47:40.110 --> 00:47:44.219 Gpu, so it has the grid, the kernel, the parallel program. 297 00:47:44.219 --> 00:47:51.809 G. P. S. the device and there could be a number of colonels on the GPU. A couple of them could be running at the same time. I don't know how many. 298 00:47:51.809 --> 00:47:55.920 And again, there's a QC stuff waiting to rod. 299 00:47:57.030 --> 00:48:00.929 Okay, so this is a very substantive slide about. 300 00:48:02.130 --> 00:48:13.500 In how the works inside, and it's designed like this in order for a high performance and when you're doing a lot of data parallelism. So. 301 00:48:13.500 --> 00:48:16.650 And they talk about here, you've got. 302 00:48:16.650 --> 00:48:23.309 There are registers for thread is global memory in between here. There's 2 other things, shared memory and. 303 00:48:23.309 --> 00:48:31.500 Local memory, just like 5 levels of memory or something and I'm not even thinking of things, like, read only memory and so on. 304 00:48:32.635 --> 00:48:45.684 Basically, the smaller memory's more local and faster. There's also all these resource constraints on the system. I said a thread could have up to 255 registered. It doesn't have to have that many good have fewer. 305 00:48:47.460 --> 00:48:51.869 But the thing is, they're all the registers in a block, the. 306 00:48:51.869 --> 00:48:56.099 There is a, a pool of registers for the block. It's. 307 00:48:56.099 --> 00:49:09.420 65000 or something, and all the threads, and the block are getting their registers from that 1 pool. So if I register if each thread wants to 55 registers, you're not going to be running a 1000. 308 00:49:09.420 --> 00:49:16.980 Threads at once is not enough registers. It's going to be running 65000 divided by 255 threads at a time. 309 00:49:16.980 --> 00:49:22.650 So, sometimes if a threat uses fewer registers. 310 00:49:22.650 --> 00:49:26.940 You might get hire throughput because it means you can run more threads at once. 311 00:49:26.940 --> 00:49:33.599 Okay, starts talking about here the, and free. 312 00:49:33.599 --> 00:49:39.869 Go into the global memory again, manage unified memory. You don't need to do that, but. 313 00:49:41.190 --> 00:49:47.039 Like and free free I don't see the point of free program, man stuff, get sprayed. 314 00:49:47.039 --> 00:49:50.760 So, unless you're locking and praying repeatedly. 315 00:49:50.760 --> 00:49:56.760 And I do not know how the global memory does garbage collection and stuff like that. I suspect it might not. 316 00:49:56.760 --> 00:50:03.000 And it certainly does not compact if I say you don't want to get over enthusiastic with Alex and phrase. 317 00:50:03.000 --> 00:50:08.369 Coffee coffee stuff back and forth again. That's obsolete. Now. 318 00:50:08.369 --> 00:50:20.429 All 1, cool thing a synchronous you fire off a copy the mem copy immediately returns to you all the copy still going on. If you're copying a few gigabytes, this might take some time. 319 00:50:21.480 --> 00:50:26.519 So, if you're worried about that, you can, you can check that it completed, but. 320 00:50:26.519 --> 00:50:29.519 It can return to you and you do something else. 321 00:50:29.519 --> 00:50:33.659 And let me check. Okay. 322 00:50:33.659 --> 00:50:38.250 Program again, nothing new here. 323 00:50:38.250 --> 00:50:42.480 Megan and Jim and copies back and forth. 324 00:50:42.480 --> 00:50:49.920 Or use manage memory. Okay good idea here. 325 00:50:49.920 --> 00:51:04.650 Error checking. I know no 1 ever does it in practice. Commercial software does not do it in practice complained complain, but you're not writing commercial software. You're writing good quality codes. So I encourage you to check for errors. 326 00:51:04.650 --> 00:51:09.630 These things, return, error code sometimes you can check. 327 00:51:09.630 --> 00:51:16.230 And overturn some sort of number. 328 00:51:16.230 --> 00:51:20.639 And you can call this, it converts from the number to a string, even. 329 00:51:20.639 --> 00:51:30.449 I look at these cool things here. These are macros in C. plus. Plus they're in the standard, the spiritual returns the name of the file. 330 00:51:31.530 --> 00:51:35.639 That the source code was in and this returns the line number. 331 00:51:35.639 --> 00:51:44.820 That this line was in that's very useful. I like I like this stuff. Very nice. I, the only problem with line number is that. 332 00:51:44.820 --> 00:51:52.409 If you have a macro then this is the line number of the thing in the macro not who called the macro. 333 00:51:54.150 --> 00:51:57.300 Questions. 334 00:52:08.039 --> 00:52:11.550 Can you have 3. 335 00:52:11.550 --> 00:52:16.380 Okay. 336 00:52:17.400 --> 00:52:23.159 Threads are hierarchical. I mentioned that before thread it's going to warps into blocks and to grades. 337 00:52:23.159 --> 00:52:36.360 And threads have ID numbers, so you're firing off a 1000 threads and a thread block also call a block. Each thread knows which thread it is. 338 00:52:36.360 --> 00:52:40.409 Open M. P. and so and you can tell. 339 00:52:40.409 --> 00:52:46.559 You can get the number here. You can get all this information is available to the user which thread you are. 340 00:52:46.559 --> 00:52:54.030 Okay, here we're adding 2 vectors and each. 341 00:52:54.030 --> 00:53:02.280 Pair of elements will be a separate thread might be a 1000 threads. And each thread is very light weight. 342 00:53:02.280 --> 00:53:17.190 This is a design style that they use indicate this. So the is a. 343 00:53:17.190 --> 00:53:21.269 Is an execution for a threat or a process or something? 344 00:53:21.269 --> 00:53:25.440 So, you're running something on the host and we're assuming single threaded on the host. 345 00:53:25.440 --> 00:53:33.360 Keep life easy then we fire off a parallel kernel on the device parallel colonel parallel program on the device. 346 00:53:33.360 --> 00:53:42.929 And it may have many separate blocks block and a thread block of the same thing. Thread blocks, just explicit. So they're running many blocks. And each block has many threads. 347 00:53:42.929 --> 00:53:49.949 Then you got a serial component again, and then you've got a parallel component again and this is how your program or somewhat. 348 00:53:51.000 --> 00:53:56.639 Now, while you got to cereal parts here, you could be running another parallel program. Of course, you could overlap stuff. 349 00:53:56.639 --> 00:54:02.159 If you want to get ahead of me on that, you'd look into could a streams. 350 00:54:02.605 --> 00:54:15.114 Why this would be 1 kudos stream. These things are serialized. You do Serials and parallel then 3rd is another serial block and force another parallel block. This whole thing could be in parallel with this whole thing is called 1 kudos stream. 351 00:54:15.414 --> 00:54:17.815 It could be in parallel with another kudos stream. 352 00:54:18.119 --> 00:54:23.849 Which would do a serial things when the student parallel, et cetera, et cetera. Okay. 353 00:54:23.849 --> 00:54:27.000 Any case new terminology here. 354 00:54:28.019 --> 00:54:31.860 We're going to get to this, so we, the term. 355 00:54:31.860 --> 00:54:35.130 The the site extension to. 356 00:54:35.130 --> 00:54:39.869 C, plus, plus this is now. 357 00:54:39.869 --> 00:54:43.440 Will give the name of the routine we're calling on the device. 358 00:54:43.440 --> 00:54:47.699 We syntax extension, triple angle brackets. 359 00:54:47.699 --> 00:54:53.699 And will tell it how many threads per block, and how many blocks, and we give it so arguments to pass in. 360 00:54:56.610 --> 00:55:02.309 Hierarchy here. Nothing interesting. 361 00:55:04.469 --> 00:55:08.730 Silence. 362 00:55:10.139 --> 00:55:13.230 Nothing interesting here program. 363 00:55:13.230 --> 00:55:16.530 Instructions or even rates that executes that. 364 00:55:16.530 --> 00:55:20.670 Yeah, I think everyone sees this here. 365 00:55:20.670 --> 00:55:25.860 Volume and style here, Coco does that. 366 00:55:25.860 --> 00:55:38.400 Program calendar point to the next instruction to execute instruction or a serious. The current instruction local data registers to inputs and outputs. 367 00:55:38.400 --> 00:55:42.000 A real machine has lots of each of these things and so on. 368 00:55:43.050 --> 00:55:50.369 Good a colonel again as a grid of threads that are a ray of threads. 369 00:55:50.369 --> 00:55:54.329 It's a single program multiple or whatever. 370 00:55:57.239 --> 00:56:02.369 Okay, what's happening here? This. 371 00:56:02.369 --> 00:56:08.940 Is what the syntax would look like each thread does this for a different value of okay. 372 00:56:08.940 --> 00:56:14.429 Plus B, I, and in parallel for maybe a 1000 threads. 373 00:56:14.429 --> 00:56:17.639 Well, how does the thread compute? I. 374 00:56:17.639 --> 00:56:23.010 It would use an instruction like, up here now what's happening here? 375 00:56:23.010 --> 00:56:32.579 Is thread index dot X, ignore the dot access for the moment thread index. The index of the thread in the block. 376 00:56:33.750 --> 00:56:37.590 And block dim is the. 377 00:56:37.590 --> 00:56:40.980 Number of threads and a block. 378 00:56:40.980 --> 00:56:46.829 And block indexes, the indexes of block, each block might have a 1000 threads. Maybe. 379 00:56:46.829 --> 00:56:53.460 So, this line here is it, it computed a unique guy for each thread. So the index of the thread and the block. 380 00:56:53.460 --> 00:56:57.480 Plus the index has a block times, a number of threads for block. 381 00:56:57.480 --> 00:57:02.639 And each thread gets a unique element of a subscript die and does the addition. 382 00:57:02.639 --> 00:57:10.920 So this is showing the threads are doing the same instruction, but to doing the same instruction on different data, because each threat is a different thread index and. 383 00:57:10.920 --> 00:57:14.670 Different blocks at different block in this is. 384 00:57:17.190 --> 00:57:25.980 Okay, so this is this hierarchy I told you where threads are in blocks. 385 00:57:26.485 --> 00:57:41.125 And then you got multiple blocks, we're ignoring warps here. So, here, they're showing a blog thread. Block has to 56 threads. I said, it could have up to a 1000, but else that it doesn't have to have a 1000. it could have fewer. So, here the blocks at 256 threads each, and we're seeing 3 blocks. 386 00:57:44.400 --> 00:57:48.389 And some point about here inside of block. 387 00:57:48.389 --> 00:57:52.349 We've got shared memory. 388 00:57:52.349 --> 00:57:57.690 It's small, it's like 48 K or something. 389 00:57:57.690 --> 00:58:02.519 Shared by all the threads and a block, but it's very fast memory. 390 00:58:03.840 --> 00:58:11.730 Dash there's atomic operation, so if the thread for accessing the shared memory. 391 00:58:11.730 --> 00:58:17.039 Doing an increment and they can do as an atomic operation so. 392 00:58:17.039 --> 00:58:23.369 That's done correctly. You can do synchronizing all the threads in the block if you have to. 393 00:58:25.050 --> 00:58:28.800 And the different blocks are independent. 394 00:58:28.800 --> 00:58:33.360 The only way that interact is reading and writing global Emory, which should be. 395 00:58:33.360 --> 00:58:39.840 Very slow and probably very stupid. Okay. Okay. 396 00:58:39.840 --> 00:58:46.559 So 2 level threads in the block in multiple blocks, and this is how each thread nose, which. 397 00:58:46.559 --> 00:58:51.179 Element to access. 398 00:58:51.179 --> 00:58:55.889 Now, why you might not want to 50000 threads and R block. 399 00:58:55.889 --> 00:58:59.190 Is that the shared memory? For example, the registers. 400 00:58:59.190 --> 00:59:02.820 Our s*** I shared among fewer threads. Each thread gets more. 401 00:59:02.820 --> 00:59:07.710 Walk index, so. 402 00:59:07.710 --> 00:59:11.400 The threads in the block are indexed. 403 00:59:11.400 --> 00:59:19.110 It could be up to 3. D. this is, I think syntactic sugar and your accessing an image or I, or something. 404 00:59:20.670 --> 00:59:24.090 I don't know what hardware support you have for this really. 405 00:59:25.739 --> 00:59:28.889 I just think of them as 1. D, any case you got this. 406 00:59:28.889 --> 00:59:32.579 Block of threads and a thread the grid has. 407 00:59:32.579 --> 00:59:35.610 So, Ryan flocks in their index, so. 408 00:59:35.610 --> 00:59:40.320 Again, so this multi dimensional index is all me for. 409 00:59:40.320 --> 00:59:43.469 It's syntactic for multi dimensional data. 410 00:59:43.469 --> 00:59:47.219 C, plus, plus, what I do is I write a little. 411 00:59:47.219 --> 00:59:54.420 Classes and I've got little conversion routines, implicit conversion routines that will convert back and forth. 412 00:59:54.420 --> 00:59:59.639 Between 1 and 3, that's how much my personal programming style for this sort of stuff. 413 00:59:59.639 --> 01:00:02.940 So. 414 01:00:06.179 --> 01:00:12.630 We is going through this fast. 415 01:00:12.630 --> 01:00:17.909 1 more and there'll be time to leave introduction Dakota. 416 01:00:17.909 --> 01:00:24.840 This is a long 1, so I'm not I'm not even I'll start it then I'll restart it on Monday. 417 01:00:26.039 --> 01:00:29.820 Okay. 418 01:00:36.719 --> 01:00:40.380 Yeah, so this will show basic. 419 01:00:41.610 --> 01:00:45.510 I mean, got something several bigger here. I'm going ahead here. 420 01:00:45.510 --> 01:00:49.139 Silence. 421 01:00:50.400 --> 01:00:53.400 Well, I'll show you more detail on Monday. 422 01:00:53.400 --> 01:00:57.030 What we have up here and a really basic. 423 01:00:57.030 --> 01:01:03.840 Could a program this thing runs on the GPU. It doesn't do anything. 424 01:01:03.840 --> 01:01:08.880 This thing runs on the CPU, it calls the thing running on the GPU here. 425 01:01:08.880 --> 01:01:18.539 We'll do this next time so reasonable point to stop on is. 426 01:01:21.989 --> 01:01:31.380 Here, and I'll put a note about how far we got where we're finished off open ACC and we're deep into getting to the now. 427 01:01:31.380 --> 01:01:35.610 And I have a homework thing, which is to play with that. 428 01:01:35.610 --> 01:01:46.650 On the sample programs, I just showed you and try them and report here experience. So you can put on your resume that you programmed open plus open. 429 01:01:48.360 --> 01:01:52.019 Any questions now. 430 01:01:52.019 --> 01:01:55.469 Silence. 431 01:02:00.900 --> 01:02:06.210 Time to wake up anything to. 432 01:02:07.980 --> 01:02:12.090 Okay. 433 01:02:12.090 --> 01:02:19.110 Silence. 434 01:02:23.820 --> 01:02:29.130 Good by the ways an acronym compute, unify a unified device architecture. 435 01:02:32.519 --> 01:02:36.269 Well, if there is, um. 436 01:02:39.840 --> 01:02:47.969 How basic are the optimum? What do you mean by an optimal operation? 437 01:02:49.590 --> 01:02:58.795 How how do if statements perform? I mentioned something very briefly on darker. 438 01:02:59.065 --> 01:03:06.144 I realized I'd uninstalled Docker off of parallel when I stopped using it when I upgraded into the latest version. 439 01:03:06.960 --> 01:03:14.610 So, what I have to do well, I can tell you about it. I may just do that far. I was going to run anything. I'd have to reinstall it. 440 01:03:14.610 --> 01:03:21.360 How do the statements performance is that the then. 441 01:03:21.360 --> 01:03:27.809 Block gets run while the threads that would do the else block are idle and then it reverses. 442 01:03:27.809 --> 01:03:33.630 What other types of operations they do? Well. 443 01:03:33.630 --> 01:03:36.750 Linear algebra. 444 01:03:36.750 --> 01:03:45.030 Floating they do double precision of new floating point. Well, on some versions, they do floats faster than inside. Think. 445 01:03:45.030 --> 01:03:50.519 It depends because it keeps changing the mix for the different generations. 446 01:03:50.519 --> 01:03:59.940 What they do, I'll tell you what they do badly or pointer chasing anything that's dynamic. Pointer chasing is very slow. 447 01:03:59.940 --> 01:04:09.000 Recursion, I think is so pointer chasing is a bad idea. Recursion is a bad idea. Trees are a bad idea. 448 01:04:09.000 --> 01:04:14.940 Um, stuff like a lot of, you know, lots and lots of. 449 01:04:14.940 --> 01:04:21.329 Anything weird exceptions would be a bad idea. 450 01:04:21.329 --> 01:04:25.679 Throw and catch would be a bad idea anything complicated like that. 451 01:04:25.679 --> 01:04:29.670 Um, it would be a bad idea simple straight line stuff. 452 01:04:29.670 --> 01:04:36.929 Floats trade operations and so on, I think, work. 453 01:04:38.130 --> 01:04:45.570 Floats may work slower because that's done in a separate unit on the GPU and there may be fewer floating point. 454 01:04:45.570 --> 01:04:50.489 Units than simple CUDA cores. So floats may take several cycles. Actually. 455 01:04:50.489 --> 01:04:55.559 Doubles depends how many double units there are. 456 01:04:55.559 --> 01:05:01.199 I said with the if else it gets serialized, so. 457 01:05:02.280 --> 01:05:11.400 The threads for, which was true, could execute it. And then after that, the threads for what's the condition is false could executed 1 after the other. 458 01:05:11.400 --> 01:05:14.699 It's an idea what works and what doesn't work. 459 01:05:16.739 --> 01:05:25.019 There are actually techniques for turning apparent conditional code into a straight line code by using. 460 01:05:25.019 --> 01:05:35.400 Bit masks and stuff like that. I might even show that actually these techniques go back decades and computer graphics where conditionals were slower on sequential. 461 01:05:35.400 --> 01:05:44.280 Processors, but they're useful again other questions. 462 01:05:46.530 --> 01:05:55.199 But following up on your thing, Isaac, that again, it's another reason why say if you want to make your application parallel. 463 01:05:55.199 --> 01:06:00.480 Your 1st, version of this is probably just on the on the Intel. 464 01:06:00.480 --> 01:06:04.559 Where the multi core. 465 01:06:04.559 --> 01:06:10.199 Thread on the multi cork and do different things. So don't jump to the GPU initially. 466 01:06:10.199 --> 01:06:14.789 Silence. 467 01:06:14.789 --> 01:06:19.530 Other questions. 468 01:06:19.530 --> 01:06:28.349 Okay, I have a good weekend go skiing or something. Hope you're not in Texas unfortunately and. 469 01:06:37.289 --> 01:06:42.269 Well, the other professors may know it better than me. So, listened to them. 470 01:06:42.269 --> 01:06:45.989 Um, seriously. 471 01:06:45.989 --> 01:06:52.019 You're going to do something virtually is a question of what you virtualize. 472 01:06:52.019 --> 01:06:55.980 And what level like at the really low level, you could just. 473 01:06:55.980 --> 01:07:01.050 Emulate the hardware and that's very general, but too incredibly slow. 474 01:07:01.050 --> 01:07:09.659 And he could imitate different types of hardware, or you could do different machines using the same hardware. 475 01:07:09.659 --> 01:07:17.400 Like, they're all running Intel, but different operating systems like VMware maybe. And then there's another level up where you're. 476 01:07:17.400 --> 01:07:25.889 Or your separate machines, they're all running Linux, but they're isolated from each other and is sharing some of the low level stuff, but it's protected. 477 01:07:25.889 --> 01:07:31.500 And then you could get even a higher level, still where the machines are. 478 01:07:31.500 --> 01:07:34.679 Sharing more and it's more efficient. 479 01:07:37.284 --> 01:07:50.454 Currently in Linux, now you can give each process, like a separate private view of the processes. So you can't even see the other processes that sort of what the darker level is. So it's efficient and it's high level. 480 01:07:50.730 --> 01:08:01.769 So, each, you might say, process process group, it's seeing a private view of the computer, private view of the file system, and the process space and so on. 481 01:08:03.239 --> 01:08:07.949 And so is it virtual? It's virtual at a very high level, but it's more efficient. 482 01:08:11.010 --> 01:08:16.979 But I'll, I'll dig up something then since you're interested in that, and then you can go. 483 01:08:16.979 --> 01:08:20.909 Compare the different cross and tell us what the other saying. 484 01:08:20.909 --> 01:08:25.470 Okay, didn't give me a little class to do to take that stuff up. 485 01:08:26.789 --> 01:08:30.239 Other questions. Okay. 486 01:08:31.260 --> 01:08:36.960 Bye bye.