WEBVTT 1 00:05:34.499 --> 00:05:39.329 Okay, high class just a 2nd. 2 00:05:39.329 --> 00:05:44.488 Hello. 3 00:05:49.858 --> 00:05:56.848 This works better. Okay. So parallel computing, March, 18 2021. 4 00:05:56.848 --> 00:06:09.088 Class 15 my universal question is if you can hear me, somebody type something as a chat window or Mike, and say, say, hi or something. 5 00:06:10.108 --> 00:06:20.459 Because other than that thanks, Justin. Cool. Okay. So what is on tap for today? 6 00:06:23.548 --> 00:06:31.499 Good. 7 00:06:33.358 --> 00:06:39.298 So, I was a little slow in getting back to you about talks because you're doing talk, you know. 8 00:06:39.713 --> 00:06:41.303 Student class presentation, 9 00:06:41.303 --> 00:06:41.934 so no talk 10 00:07:06.954 --> 00:07:11.814 either Monday or Thursday and several people have not replied. 11 00:07:12.113 --> 00:07:19.824 So what we'll do later today or tomorrow just assign you talks and you'll be on Thursday and running over. 12 00:07:20.098 --> 00:07:23.759 Okay, so. 13 00:07:23.759 --> 00:07:31.798 We're finishing off the, the slides sets from the 3 slides from the book just a little more. 14 00:07:32.334 --> 00:07:46.014 And and accelerated computing, I mentioned open gel for a minute, but we've got to talk on open GL on from Joseph. So I won't really talk about it much. I'll just have a minute or 2 on that. 15 00:07:46.494 --> 00:07:55.673 I'll point you to some invidious primary documentation on this stuff, and which is more up to date and more concise but dryer. 16 00:07:56.723 --> 00:08:10.553 And I bet you spent the last couple of days reading a lot of it. So I rendered I read technical documentation and I've got a section on some other NVIDIA features that we haven't covered. And some stuff that I've typed in. I had to send Monday's blog. 17 00:08:10.553 --> 00:08:20.064 But I moved it to today's and didn't really talk about much of it Monday. So, this is my summary almost my own purposes of essentials of NVIDIA. So I'll talk about that. 18 00:08:20.369 --> 00:08:29.218 And then some blurbs on 2 blurbs that I've on stuff that I find interesting programming. 19 00:08:29.218 --> 00:08:34.198 Geometry stuff on memory allocation and union file system. 20 00:08:34.198 --> 00:08:39.568 And then we'll move on to the next major topic, which is thrust, which is I know my husband. 21 00:08:39.568 --> 00:08:53.754 2 weeks on it or something, and this is a parallel extension to and it also shows you paradigms for parallel efficient, parallel programming and a number of these parallel paradigm. So useful only for parallel programming. 22 00:08:53.783 --> 00:08:54.953 They don't they're not good. 23 00:08:55.678 --> 00:08:59.818 For they're not good for, um, serial so. 24 00:08:59.818 --> 00:09:05.369 So, I don't know what got lost when the audio went out for a minute bump. 25 00:09:06.389 --> 00:09:10.678 I think that things are okay now don't trust computers. 26 00:09:10.678 --> 00:09:17.038 Okay, you won't believe how long I've been programming computers when I say, I don't trust them. 27 00:09:17.038 --> 00:09:20.759 Open see, I'll just have a blurb on that. Um. 28 00:09:21.778 --> 00:09:30.953 I won't show them odd. You all you can read it on your own summarize it. It's apples competition to CUDA and it's coulda where they change the name it's made it look like open shell and made a clunkier. 29 00:09:31.344 --> 00:09:45.203 And as long as Nvidia is dominant, you should use CUDA, not open is more device independent, like, open GL, however, to obtain that independent tickets clock here. 30 00:09:45.203 --> 00:09:51.053 So you want to look at as long as Nvidia stay sane, incompetent you on the state was. 31 00:09:51.808 --> 00:09:57.928 On the reason I put the conflict in there is I have seen computer companies. 32 00:09:57.928 --> 00:10:12.774 Go go insane. Go crazy and basically go from a dominant position to basically non existent in a few years. When I was a grad student, the predominant computer company for computer science apartments was the digital equipment company deck. 33 00:10:14.273 --> 00:10:22.344 They at their peak, where the 2nd, largest computer company in the United States, and they even have an annual conference on the ocean liner. The. 34 00:10:23.844 --> 00:10:37.823 And then they the guy who built it up, can all send it to everything right? For decades started doing everything wrong. He said, he didn't like Lennox UNIX. He didn't like Ethernet. 35 00:10:37.823 --> 00:10:43.854 He didn't like open standards. He got everything wrong. They started filing these patent lawsuits and. 36 00:10:44.158 --> 00:10:49.708 They crash, they basically went to nothing. They merged into Compaq. Yeah. So that happens but. 37 00:10:49.708 --> 00:10:57.568 Okay, now and video primary documentation, and just to point you to this. 38 00:11:00.114 --> 00:11:08.573 This is their source website. They've got several different things here, which I think video manages as somewhat separate projects. 39 00:11:09.833 --> 00:11:15.863 The kudos stuff is, is the tool kit documentation and it's got several things here. 40 00:11:16.708 --> 00:11:20.339 Programming guide is quite interesting and quite detailed. 41 00:11:20.339 --> 00:11:24.359 And. 42 00:11:24.359 --> 00:11:32.938 So, you can have fun going through that. I showed you a little pieces of that. So, basically, if you want sort of the definitive. 43 00:11:32.938 --> 00:11:38.908 Summary of the core and video stuff and so on then. 44 00:11:38.908 --> 00:11:42.239 You'd want to look at the programming guide. 45 00:11:43.918 --> 00:11:48.328 Another 1 now they're self overlap somewhat, so. 46 00:11:48.328 --> 00:11:51.658 They've also got a. 47 00:11:51.658 --> 00:12:03.448 High performance computing documentation and this talks little services stuff about parallel algorithms. It's worth reading you take the stuff. 48 00:12:04.614 --> 00:12:13.344 They have a compiler, a C plus plus gunpowder. I think it's basically the P. P. G. C. compiler PGI, compiler and that. 49 00:12:13.344 --> 00:12:22.433 They're extending and they talk about extensions to the center template live and you put an initial argument and it will say, 50 00:12:23.333 --> 00:12:28.553 try to compile it for either multi core in Kelsey fuse or for many car, 51 00:12:28.553 --> 00:12:29.004 which is in. 52 00:12:30.354 --> 00:12:44.813 So, they talk about that there, they talk about their compiler here, which is separate from, which says, CUDA, the NBC compiler, NBC. Plus Plus does C plus plus compiler but it will emit code for. 53 00:12:45.119 --> 00:12:53.788 The NVIDIA for the if possible so so you want to learn more about that? You can, you can blurb browse around in here. So. 54 00:12:53.788 --> 00:13:06.599 Compiled does not if you want to see plus if you want to put a code in your program directly, then you call all the names are slightly different from each other. They need a little cheat sheet to keep them apart. Actually. 55 00:13:07.104 --> 00:13:21.234 Okay, so we've got that there, their high performance computing documentation talks about some tools. They've got other log even look at some of these packages here. They've got stuff that runs on top of their low level stop to do things such as linear programming parallel. 56 00:13:22.553 --> 00:13:36.894 Fast for a trans form, suddenly, which processing stuff sparse matrix stuff blast blast is spec, an API for some low level, basic matrix vector operations at your high level things are in terms of blast routine. 57 00:13:37.134 --> 00:13:50.303 So, blast itself is a spec, and then their implementation of it open and commercial the commercial ones might actually be good. So any case cool blast blast and the other things for deep learning machine learning and stuff like that. 58 00:13:50.663 --> 00:14:05.004 So, as you're interested, you can browse around in here. The another thing they have some virtual now so you can run where the host is doing time sharing and have several users do it. Okay, so that's some NVIDIA. 59 00:14:05.964 --> 00:14:20.933 Documentation there no other NVIDIA feature. Now we see most of in video, some stuff we've not seen are for any of you have taken my graphics course, or other graphics or things called texture maps and surface maps. 60 00:14:21.239 --> 00:14:26.129 Um, well, you have it a graph database texture for. 61 00:14:26.129 --> 00:14:36.323 For you to fill a polygon with and you can calculate on it and the levels of detail with maps themselves. Well, that's all done in hardware on the video service maps to the surface of a cube. 62 00:14:36.323 --> 00:14:45.114 And so on this machine learning machine, learning hardware, and the main primary operation there for 4 by 4 main Jesus aid was BC. Plus D. 63 00:14:45.448 --> 00:14:57.149 Or they can be vacancies of actors and the matrix elements. They can be several different data types such as half precision floats 16 floats. So that's done fast than the machine learning modules. 64 00:14:57.563 --> 00:15:11.903 They've got, they've got hardware now with their real time rate, tracing to do some re, tracing operations fast such as to run a ray through some parameters. Unfortunately, the user visible API is a little higher level that I wish they had it. 65 00:15:11.903 --> 00:15:24.234 So, you kind of run, you can run a ray through various boxes, but you cannot had to explicitly do an intersection test with an edge and a box. You have to make the intersection task a range Rover. So, which is. 66 00:15:24.538 --> 00:15:31.678 Touch annoying in any case. So they also have an extension for like a group like. 67 00:15:31.678 --> 00:15:38.759 Every time we've seen the threads and 32 threads make a war, but a war runs together. They've now gotten a little. 68 00:15:39.504 --> 00:15:53.783 They've started to break that idea up slightly now and pair their latest architecture. You can have subsets of a war can synchronize a thing. Currently what we've seen is all the threads or all the active threads that a war can synchronize. 69 00:15:54.089 --> 00:16:07.979 But I guess they observed the summary they users wanted fewer numbers of threads to synchronize. If you force all the active threads to synchronize. It's a little too course. So now they've got things called cooperative groups, which are let. 70 00:16:07.979 --> 00:16:15.418 Say, maybe 16 thread synchronize or something. They all still have to run the same instruction if they're active but. 71 00:16:16.553 --> 00:16:29.333 They also have, they're thinking that their GPU might be an expensive resource running on a time sharing system. So you can have virtual GP use like, you have virtual machines running under Linux and the virtual machine gets certain resources. 72 00:16:29.634 --> 00:16:33.864 And in theory, it's hidden from the other virtual machines you can do that. 73 00:16:34.224 --> 00:16:45.714 Now, with the, I'm thinking that Nvidia is thinking of the fact that there are now on Amazon to you can now construct virtual machines that use ethios for a fee. And in fact. 74 00:16:46.229 --> 00:16:52.798 Last year, when I looked at the pricing and videos, latest was something like. 75 00:16:52.798 --> 00:17:05.638 5 or 10,000 dollars, if you bought it, you could use the thing in a virtual machine for less than a dollar an hour. In other words, it was cheaper to use it from Amazon to buy it by yourself. 76 00:17:05.638 --> 00:17:17.128 Unless you're going to use it solidly for a full year, let's say, or several years. So I'm guessing that, since you can now use the video on Amazon virtual machines that to help enable that. 77 00:17:17.128 --> 00:17:23.278 And video now, has you can they can define a virtual subset of a full hardware cheap, you. 78 00:17:23.278 --> 00:17:29.969 Another cool thing on and also I got the formatting wrong here. 79 00:17:29.969 --> 00:17:39.479 Is that you can have global memory on your on and again on. 80 00:17:39.479 --> 00:17:43.618 On parallel the 48 gigabytes of global memory. 81 00:17:44.213 --> 00:17:56.003 There is a way where you can tell the GPU to compress the data to be written to global memory. And so it takes time to compress and uncompressed. But it means it takes less memory. 82 00:17:56.453 --> 00:18:05.273 So, it's a space time trade off and given that. A lot of parallel machines are limited. The takes more is the constraining. 83 00:18:05.638 --> 00:18:09.209 Thing then this is a win, so. 84 00:18:10.554 --> 00:18:24.564 Now, what I have here point, 6.1 really that should be point 7, I should have tab left on that. The latest thing is and Vinnie, I've noticed is not using an idea of a core anymore. You got your streaming, they'll do process, or you might have a few dozen of them. 85 00:18:26.338 --> 00:18:36.538 And just to remind you, let's see, on my current machine here, how many streaming multi processes do I have. 86 00:18:43.558 --> 00:18:54.298 I have 48 streaming, multi processors, multi, multi processors, and each 1 of those 48 multi processors. 87 00:18:54.298 --> 00:19:08.753 As I know that says good. Of course, that's been printed out for the device query program. Each of this 48 multi processors would have say, 64, integer unit, 64 units, 32 double units and 64 or dispatch units or something. 88 00:19:08.784 --> 00:19:16.584 So, and all of these energy flow double. And or any instruction, dispatch units, they all operate independently inside that multi processor. 89 00:19:16.584 --> 00:19:28.733 So the concept that you've got so many that you got organize into CUDA cores and 64 could, of course, of the multi process and their latest generation could a car isn't obsolete concept. 90 00:19:31.439 --> 00:19:43.679 So, okay, so that's some hot new things that we haven't seen in the documentation. So far conceptual hierarchy. I mentioned before yesterday I won't go through it again. 91 00:19:43.679 --> 00:19:48.148 It's, it's a complicated mess. 92 00:19:48.148 --> 00:19:53.878 You know, within the 1 generation, Pascal, let's say you've got 1 of. 93 00:19:54.354 --> 00:20:05.364 1, for 1 awake, you got different levels of GP, use the different manufacturers, say, buy the from invaded and they'll configure it differently or or NVIDIA will ben's stuff. 94 00:20:05.693 --> 00:20:09.713 And there might be some chips where 1 of the multi processes. 95 00:20:10.048 --> 00:20:18.058 Doesn't work, so they'll lower the price and sell it in a lower line all that sort of stuff, change the bot to it. So it's a complicated mess. 96 00:20:18.058 --> 00:20:23.909 Okay, and then they got things is called tight Nexus and I don't even know what they are. So. 97 00:20:23.909 --> 00:20:34.588 Range of speech I mentioned last time I mentioned capability versions. It's another complicated best. The hardware has a version like. 98 00:20:34.588 --> 00:20:42.239 Volta and pair is like a software version and. 99 00:20:42.239 --> 00:20:45.324 It sort of matches the hardware version, but not. Exactly. 100 00:20:47.183 --> 00:20:58.854 And then the compiler compiles to a bite code sort of like Java and what's in the is the bite code and then and then when you run it, it. 101 00:21:00.114 --> 00:21:10.973 It then does it just in time compiling as a bike code to actual machine language so the 1st time you run, it could exact executed. Well, it's going to take longer, which is going to mess with any timing you do. 102 00:21:10.973 --> 00:21:21.834 So you run it a couple of times and throw away the 1st time when it compiles it caches actually the compiled program and I don't know where it cashes that. Actually, I think on the host somewhere. 103 00:21:22.138 --> 00:21:32.009 And I mentioned this sort of thing about optimization and whatever could've fax a touch all. 104 00:21:32.009 --> 00:21:41.608 Okay, something relevant to efficient programming parallel and sequential, which I typed up here because I haven't seen it. 105 00:21:41.608 --> 00:21:44.669 Actually defined this nicely. 106 00:21:44.669 --> 00:21:55.288 But I'm biased in writing C. plus plus there's basically 3 types of memory, the static memory, which is assigned. 107 00:21:55.288 --> 00:22:01.919 When the programmes compiled to be a fixed sized global array and the space is allocated. 108 00:22:01.919 --> 00:22:07.919 You know, with the call, just code and this, that and they have addresses and. 109 00:22:09.084 --> 00:22:19.403 So it would be a global variable if you're going to do this, you maybe don't want to initialize it because the initial values would be signed, it's starting to execute if it's a big array. 110 00:22:19.673 --> 00:22:19.884 Now, 111 00:22:19.884 --> 00:22:21.683 the thing with this by default, 112 00:22:21.683 --> 00:22:23.334 the static space is not that big, 113 00:22:23.334 --> 00:22:30.653 but you can make it bigger by saying MC model equals medium or large and they'll do things like Costa compiled use wider addresses, 114 00:22:30.653 --> 00:22:30.804 like, 115 00:22:30.804 --> 00:22:33.023 initiate my might use to address, 116 00:22:33.324 --> 00:22:38.064 which means your static data plus here executed McAfee more than 2 gigabytes. 117 00:22:38.338 --> 00:22:48.358 And you make a medium or large uses wider addresses. So program will be an insignificant amount slower, but now you could have. 118 00:22:48.358 --> 00:22:53.219 You know, umpteen gigabytes a static data and. 119 00:22:54.023 --> 00:23:08.453 It's not and it's, it increases it doesn't increase. The executer will file if it's not Initialized and it increases your virtual memory space, but that's not a problem. Because the virtual memory is not actually allocated until you touch a page, which I think is 4 K bytes. 120 00:23:08.453 --> 00:23:10.523 So, in the past, I would. 121 00:23:10.828 --> 00:23:14.308 By default actually put my race in static because. 122 00:23:14.308 --> 00:23:19.499 Avoid some messy problems, although with. 123 00:23:19.499 --> 00:23:27.689 And I might want to modify this. Okay, so that's static. Um, the next thing is, there's a push down stack. 124 00:23:27.689 --> 00:23:40.884 I have a guest, you're probably all familiar with the concept of pushed down of stacks. Each time you call a new function and a routine. It puts another frame on the stack and allocate SoCal variables that are local to this new routine. 125 00:23:40.884 --> 00:23:47.304 And then when you come in, it pops that frame off the stack and the allocates all those variables. And. 126 00:23:47.848 --> 00:23:53.489 So it's a nice way to do that variables that are local to that routine or routines that are called from that. 127 00:23:53.489 --> 00:23:56.608 Now, by default. 128 00:23:56.608 --> 00:24:02.548 Is only 8 megabytes of stack space, which is sort of what you can increase it. What's a you limit command. 129 00:24:02.548 --> 00:24:09.358 At in the shell, or you can call a C on routine that does that, and you can make a gigabytes. 130 00:24:09.358 --> 00:24:14.219 So, if you've got local variables, you know. 131 00:24:14.219 --> 00:24:25.858 So, if you want 10 gigabytes of that of variables, local to a routine, make it increase the stack size and make it a local variable in the routine and then it's allocated when you call the routine. 132 00:24:25.858 --> 00:24:31.108 And again, it could be as big as virtual memory. Now. 133 00:24:31.108 --> 00:24:41.398 If you have multithreading, then each thread so far as I can tell as a separate stack. 134 00:24:41.398 --> 00:24:56.128 Now, you might be wondering, how is this possible because you might have seen in a compiler class maybe you could run stacks up from the bottom or down from the top of your memory space. That's to the stacks. How could you have, say 16 stacks? Well, with virtual memory, that's easy because. 135 00:24:56.128 --> 00:25:00.868 There he started some different location and far apart and your virtual address space. 136 00:25:00.868 --> 00:25:10.019 And again, but it's not that that is not allocated until you touch it. So you can have, you know, you can have 100 stacks and each stack is. 137 00:25:11.094 --> 00:25:19.163 100 gigabytes or something and no problem and if you start touching all this data, you're starting to use memory but until you touch it fine. 138 00:25:19.163 --> 00:25:26.153 So, you can have stacks with multi programming and so on I think it open empty so far as I can tell each thread gets a separate stack. 139 00:25:28.378 --> 00:25:42.683 So now, I can allocate some initialization has to be done if you create your market stacks. But so for data you just need and routine and it's collies the routines that you call from. The stacks are nice. But to do it. 140 00:25:42.683 --> 00:25:45.263 So you have to know the size of the data at the time. 141 00:25:45.598 --> 00:25:53.818 That at the time you call the routine, this is something that seed us better than the C. plus plus actually. Okay. 142 00:25:54.233 --> 00:26:08.574 And then there's the 1 that people use automatically I heap, which is, is a global space of random memory, and it's got a heap manager and you create stuff in it and see, you do Matlock and free and C. plus plus you do new and destroy. 143 00:26:08.878 --> 00:26:14.999 And it just creates stuff on the heap and the, he puts his biggest virtual memory. 144 00:26:14.999 --> 00:26:25.378 And pages, now, the thing with the heat is there has to be a structure on it to keep track of this stuff. Like you have, perhaps the link list, what they teach. 145 00:26:25.378 --> 00:26:33.773 And I guess CS courses, they have a linked list of free spaces on the heat. And if they've got 2 continuous, free spaces, they coalesce. 146 00:26:34.554 --> 00:26:40.854 And if you do a lot of knowing knowing and destroying the free space, make it fragmented. 147 00:26:40.854 --> 00:26:51.384 So, perhaps you do a compact upon garbage collection, which scrunches stuff together, because that the stories, all your pointers that can be done only limited cases in any case. So. 148 00:26:53.153 --> 00:26:59.933 And you do, like, your vector class, just put stuff on the heap. The documentation didn't even say that for a long time. 149 00:27:00.772 --> 00:27:15.114 So, everything you allocate on the heap, it's got some hidden bites of data before, which has some identification information and pointer to the next space or something a couple of problems with the heap. So, it's a nice idea. You can do whatever you like, new and destroy. 150 00:27:16.949 --> 00:27:20.459 Now, it's 2 problems. 1st. 151 00:27:20.459 --> 00:27:30.088 On the, you have to keep track of stuff to destroy it when you're finished. Well, there is some smart pointers in so I will try to do that for, you. 152 00:27:30.088 --> 00:27:37.409 Well, if it was a smart, you have only 1 point or to an object you cannot have 2 so 1 point or then. 153 00:27:37.409 --> 00:27:52.378 It can keep track of when no one's referring it anymore and free stuff. Same was a stack of stuff gets fried when you return automatically. So you have to keep track of stuff and free when you're finished with it. And that can be difficult. If you have a producer consumer. 154 00:27:52.378 --> 00:28:03.239 Programming model where 1 part of the program produces stuff. Another part of the program consumes stuff and they're quite distant from each other. And the producer has no idea when the consumer is, you know. 155 00:28:03.239 --> 00:28:09.778 Finished with something and but for the consumer to free it, it has to know stuff about internal producer. 156 00:28:09.778 --> 00:28:22.888 Things and it's a complicated mess and big place. This happens in graphics. Actually you're calling a graphic substance, something next Windows or open GL and the called routine. 157 00:28:22.888 --> 00:28:29.219 Create some object and returns to the caller and they get a question of who should free it. Eventually. 158 00:28:29.219 --> 00:28:37.679 Now, in a small case, you don't care if just a little memory leak, just let it go. I mean, you've got lots of memory, but, you know, it gets to a point where. 159 00:28:37.679 --> 00:28:43.798 Maybe you got to keep text that's the 1st problem with new and destroy is keeping track of it. So, you know, when to destroy it. 160 00:28:43.798 --> 00:28:56.189 An example, say, real word, you all know the DARPA Grand challenge where they autonomous vehicles well, 1, in 1 of the grand challenges, 1 of their vehicles failed because. 161 00:28:56.189 --> 00:29:06.298 It was using Java to keep track of objects that were seen by a forward looking radar and as it passed the object, it would free it from the heap. 162 00:29:06.298 --> 00:29:14.183 And and it accidentally kept a pointer to all to each objects. So no objects for free. So the heat got bigger and bigger. 163 00:29:14.574 --> 00:29:28.374 And as he got bigger, it got slower, which is a point I'll come to in a minute and eventually the computer on their on their vehicle was operating less than real time. It was taking more than real time to process the heat and a Java program. 164 00:29:28.374 --> 00:29:30.233 Because using Java was another mistake, but. 165 00:29:30.538 --> 00:29:34.048 That and. 166 00:29:34.048 --> 00:29:39.509 They had to their vehicle stopped any case. So the 2nd problem with the heat. 167 00:29:39.509 --> 00:29:42.719 Is if you get more and more objects on it. 168 00:29:42.719 --> 00:29:53.429 The time to create and destroy objects grows, it's not constant. It goes up and the number of existing objects, because there's a status check to keep track of objects that has to be managed. 169 00:29:53.429 --> 00:30:01.199 So you do not want to allocate billions of 4 bite objects on the heat for, for 1 thing. Each 4 byte object will have an 8 bike. 170 00:30:01.199 --> 00:30:09.269 Header or something so if you're going to use the heap, you want to allocate fewer bigger objects, not more smaller objects. 171 00:30:09.269 --> 00:30:17.189 And so well, content of using factors, if you got a 1,000,000. 172 00:30:17.189 --> 00:30:27.838 You know, 2 dimensional coordinates you might say the nice way to think about it is you have a 1Million vectors of 2 floats. 173 00:30:27.838 --> 00:30:31.378 Yeah, but each of the, but then you're allocating them. 174 00:30:31.378 --> 00:30:35.429 Clean things on the heap. That's a bad idea. What you do is. 175 00:30:35.429 --> 00:30:41.368 You'd use a fixed size array or something perhaps would be the way I would do it. Maybe. 176 00:30:41.368 --> 00:30:50.429 Or you do something called a placement new where you separate allocating the storage from constructing the object in the storage. It's something in C. plus plus no, 1 knows about. 177 00:30:50.429 --> 00:30:54.929 Because you normally do a new, it grabs space off the. 178 00:30:54.929 --> 00:30:58.979 And allocates and and construct your variables and that. 179 00:30:58.979 --> 00:31:12.929 If it's if it's a final data type instruction is nothing much, but it may be a fancy data type. Reconstruction involves building some structure. So the placement, no, you manage the storage allocation yourself. Like, maybe you've got this humongous. 180 00:31:12.929 --> 00:31:18.449 A re, jigger bite and you and you handle you, you cut pieces off it. 181 00:31:18.449 --> 00:31:24.088 And hand them to the place, but new to construct your 2 dimensional points in it or something and you keep track. 182 00:31:24.088 --> 00:31:29.638 Of your storage, so that's the 2nd, problem with the heap is that too many heap objects. 183 00:31:29.638 --> 00:31:37.739 It slows you down the 3rd problem with a he is a heap is initially a 1 global object and if you're doing parallel programming. 184 00:31:37.739 --> 00:31:41.969 And this is something and you're doing news and saying that has to be serial. 185 00:31:41.969 --> 00:31:46.679 Onto the, because in these different threads, they're modifying that global. 186 00:31:46.679 --> 00:31:50.098 Storage heap, they're not just, you know, they're not just. 187 00:31:50.098 --> 00:31:58.108 Reading it, so there's a problem. So, parallel programming, especially do not want to do multiple operations on the. So. 188 00:31:59.368 --> 00:32:08.578 Now, there is, if you're doing a lot of stuff with smallof, Google has an allocator that's better than the G. plus plus allocator. 189 00:32:08.578 --> 00:32:19.828 And you can do it with by linking in, as I've got there and there's another cool thing you can do if your program is dynamically linked. 190 00:32:22.314 --> 00:32:34.943 You can there's a bad environment, variable called load preload and that is a list of directories basically a list of libraries to load if necessary and you can say to load google's. 191 00:32:35.699 --> 00:32:41.999 Allocator and if you problems linked, it will load in Googles instead of news. 192 00:32:41.999 --> 00:32:45.929 His software foundations and that saves a little time. So. 193 00:32:47.939 --> 00:32:52.169 A little thing, um. 194 00:32:53.278 --> 00:32:58.409 Entails servers have a non uniformed memory architecture. 195 00:32:58.409 --> 00:33:04.828 Which has a very small effect, I think, but parallel, it's still 14 on each. 196 00:33:04.828 --> 00:33:15.628 Unit whatever you call at the high level court has the 14 Z, on course, has half the memory attached to it directly and so accessing memory. 197 00:33:15.628 --> 00:33:25.409 The have 200 gigabytes that's attached to the local block of 14 cores is faster and attaching the faster than it access in the other 1. so. 198 00:33:26.548 --> 00:33:36.568 And I mentioned here, I think a page of memory Zach is assigned when it's 1st touched written to, let's say, so. 199 00:33:36.568 --> 00:33:40.138 Which chords access, so you can have some fun there. 200 00:33:40.138 --> 00:33:43.229 And caching a separate thing. So. 201 00:33:43.229 --> 00:33:47.038 Okay, so that's my summary of 3 types of. 202 00:33:47.038 --> 00:33:50.489 Storage new C plus plus program. 203 00:33:50.489 --> 00:33:53.394 Static array data, 204 00:33:53.394 --> 00:33:56.153 assigned to compile time you got your stack, 205 00:33:56.483 --> 00:34:06.294 which is pushed on when you call a routine and freed when you popped off the stack when you return and you got your global heap where you control the creation and destruction. 206 00:34:07.858 --> 00:34:18.119 Okay, back to and other services more details I type this up based on my understanding. 207 00:34:18.119 --> 00:34:26.878 The numbers get changed, but basically, we've got the host. We've got the device the device has streaming multi processors. 208 00:34:26.878 --> 00:34:30.719 And and videos use different names at different times. 209 00:34:30.719 --> 00:34:40.349 So thread, it's a program, it's got a program counter and some private memory, like registers and some local memory and accessing shared memory. 210 00:34:40.349 --> 00:34:44.878 32 threads make a warp a up to it. 211 00:34:44.878 --> 00:34:49.528 32 warps make a thread block and. 212 00:34:49.528 --> 00:34:54.268 And you don't directly see the warps in your programming, but they affect a performance. So. 213 00:34:54.268 --> 00:35:04.858 A block of and the next level up. So you've got your thread blocks. The next level up is a grid of of thread blocks. 214 00:35:04.858 --> 00:35:10.559 And the kernels the program, the greatest, the hardware, the program runs on, you might say. 215 00:35:10.559 --> 00:35:22.318 So the colonel can have thousands of threads colonel can create other kernels and weight and you want to have threads waiting to execute because this because the thing with. 216 00:35:22.793 --> 00:35:27.173 And it has high throughput, but also has latency in 1 way. 217 00:35:27.173 --> 00:35:37.914 It gets throughput is by high occupancy and it gets that because if a war blocks, because of some resource it needs, then there's other warps waiting to run. 218 00:35:39.869 --> 00:35:42.989 Registers slow memory. 219 00:35:42.989 --> 00:35:47.639 Local memory for each thread for. 220 00:35:47.639 --> 00:35:54.628 So, there's 1 instruction dispatcher to scratches 1 instruction to all 32. 221 00:35:54.628 --> 00:35:58.108 Threads in the war. 222 00:35:58.108 --> 00:36:01.708 So, instruction level parallel. 223 00:36:01.708 --> 00:36:14.489 Is that the warp schedulers that, that they admit the instructions are always running? Because there's always some work waiting to be dispatched that's instruction level. Parallelism. 224 00:36:15.539 --> 00:36:20.099 So, um. 225 00:36:20.099 --> 00:36:23.909 Is there anything shuffle instructions and so on voting. 226 00:36:25.108 --> 00:36:34.318 The block level thing I mentioned several times. The thing is that the block, it's got fixed resources like, so many registers that get distributed among all the active. 227 00:36:34.318 --> 00:36:47.188 Or threads, so more threads or registers for thread to ensure threads. You've got some fast shared membrane space up and down to cash. That's global to everything in the block. But private to the block. 228 00:36:47.188 --> 00:36:50.938 And there's some explicit cash. 229 00:36:50.938 --> 00:36:54.418 The different warps are running a synchronous. 230 00:36:54.418 --> 00:37:01.318 Well, they're running the same instruction sequence where there are different points in that. 231 00:37:01.318 --> 00:37:05.518 In all the threads that a block are running the same program, but. 232 00:37:05.518 --> 00:37:09.989 Not not concurrently necessarily synchronize stuff. 233 00:37:09.989 --> 00:37:14.878 Stripping multi process there contains the. 234 00:37:14.878 --> 00:37:29.489 The thread blocks and and my guess is 1, could a coral of the term is getting obsolete is the 20 s as powerful as Z on core, because slower clock rate and doesn't have this. 235 00:37:30.809 --> 00:37:37.139 Can't do many things at once so much, although this is changing. So this 120 is here could be changing. So. 236 00:37:38.458 --> 00:37:43.168 And again, since there are these variable units, there's also special instruction units. 237 00:37:43.168 --> 00:37:53.188 And the steering multi processor, so this is a reason that if your program is a lot of double precision, it's going to run slower because there's fewer double precision units and single precision units. 238 00:37:57.114 --> 00:38:01.643 And there's some special units, and if you need them, your warps are going to be waiting. 239 00:38:01.643 --> 00:38:12.173 So, and all this stuff changes from generation to generation there was 1 generation of video or instructions ran very slow. 240 00:38:12.449 --> 00:38:25.079 Who knows why they didn't injure instructions were important so actually, if you get a big speed up by converting your into floats and using a floating point ad and multiplies and converting yourself back to it, that's not true anymore. 241 00:38:25.079 --> 00:38:29.400 So, you got schedulers instruction, dispatch you at. 242 00:38:31.829 --> 00:38:38.519 Couple levels of memory grid, level of resources. So these are all the blocks and grid. 243 00:38:38.519 --> 00:38:49.769 And and so your grid is your currently, your grid is your, what's your parallel program you call from your C? Plus? Plus? With the triple angle brackets? 244 00:38:49.769 --> 00:38:59.429 And if it's got a lot of rocks, so if you have maybe 48 streaming, multi process, it's on your, the different blocks in the grid could be running on different streaming, multi processors. 245 00:38:59.429 --> 00:39:06.480 That's okay, because the blocks aren't talking to each other and again, you got a big queue of blocks and they're picked off as their. 246 00:39:06.480 --> 00:39:11.699 As resources are to run them, so you want to have extra blocks available to run. 247 00:39:11.699 --> 00:39:18.929 Self device level resource says you got your global memory. 248 00:39:18.929 --> 00:39:22.949 Which has this late and say. 249 00:39:24.744 --> 00:39:38.364 Host memory mapping well, now it's unified somewhat and so thread level parrot always, is that when a thread tries to read something from the global memory and gets blocked for maybe 200 instructions, then some other threads can run thread level parallelism. 250 00:39:38.364 --> 00:39:48.775 It's a big reason for performance. So the could, of course, and the G. P. or you want that they're always occupied something gets blocked something else is waiting to run. 251 00:39:51.150 --> 00:39:57.210 Cash is cash as great advantage when units like a little operating system. 252 00:39:57.210 --> 00:40:07.139 Hypercube just means various different applications on CPO can be launching parallel programs. 253 00:40:10.980 --> 00:40:15.840 On level things suck, it's even higher so. 254 00:40:15.840 --> 00:40:19.380 Lots of wraps. Okay. 255 00:40:19.380 --> 00:40:30.360 There's no question, it's a summary more about kudos. Some, some things are just a summary of what I've had before, is that. 256 00:40:32.010 --> 00:40:42.510 Point 3, there may be obsolete now, but again, you see plus plus you could have programmed us compiled with, you can have a global function. 257 00:40:42.510 --> 00:40:45.900 Call from the whole strengths on the device. 258 00:40:46.224 --> 00:40:56.094 Device function runs on the device calls from the device, or on the host call from the host and then variable costs data. We'll have a slide on talking about. 259 00:40:56.094 --> 00:41:08.364 Some of this shared is local to the block and his past devices global manages for page. Constant. How do you load constant data? There's a special routine that loads and so. 260 00:41:09.750 --> 00:41:18.960 And some of this stuff can get Tom loosening the roles a little as time goes on and pair loose in some of these roles. 261 00:41:20.010 --> 00:41:25.230 1, more Linux trick. Um, I like to intersperse clinics because I find them useful. 262 00:41:26.309 --> 00:41:35.340 This is a cool 1. it's called a union file system or a transparent or a translucent or an overlay file system. 263 00:41:35.340 --> 00:41:39.690 So, you can layer some it's like it's like layers and. 264 00:41:39.690 --> 00:41:53.429 In your CAD program, and your graphics program, you've got layers of graphics and if a flower 1 is on top of layer 2, then you can see layer. Twos is a transparent parts of our 1. 265 00:41:53.429 --> 00:41:59.849 And so create a new object, you pick what layer it goes on to. Well, you can do that for file systems and. 266 00:41:59.849 --> 00:42:03.179 So you can have, let's say. 267 00:42:05.280 --> 00:42:10.590 A couple of file systems I have an example here. I've got 2 directories a, and B. 268 00:42:10.590 --> 00:42:18.239 And and they get combined to make a unified directory M, and. 269 00:42:20.429 --> 00:42:30.630 So so what happens is that if there's the same file in both directory and directory B, you'll see the version in directory a. 270 00:42:30.630 --> 00:42:33.659 It will hide diversion directory be. 271 00:42:34.014 --> 00:42:44.514 But this is a file and directory be, it's nodded and you'll see it and be and so is the merged version of these 2 directories, like 2 layers in there and our CAD programmer in our drawing program. 272 00:42:45.565 --> 00:42:57.445 And if you write a, and if you write to something, it will go into, it will go into the top layer a, and if you modify a file, it will modify the, you can say if it's read on right. 273 00:42:57.445 --> 00:43:01.735 Or something and say union file system things means coffee on, right? 274 00:43:02.039 --> 00:43:13.739 And so I have the rules here we get all these weird things. What happens if you change a file that's in Fi it writes a new version into a which, which will hide the old version and be. 275 00:43:13.739 --> 00:43:23.190 If you delete a file and B, then if you need a file and it gets deleted, and then if there was something V, it becomes visible and delete a file and B. 276 00:43:23.190 --> 00:43:33.929 Well, since me is not writable, because I just said, hey, it was agreed right? But not be it puts our white out note into a that says this file and B is really deleted. 277 00:43:33.929 --> 00:43:44.699 So, and this is all user, it's not super user. It's all so you can create it was Union file system command. You then deleted with Fuser about you himself. 278 00:43:44.699 --> 00:43:56.489 Sort of cool. Now, what application is suppose you have a read only directory like it's on a D. V. D. or something that you can with fast. You could put a rewrite layer on top of it. 279 00:43:56.489 --> 00:44:05.969 Or, if you've got some big, if you had some big database that you read, only access to, then you put a rewrite layer on top of it to make changes that are private to, you. 280 00:44:05.969 --> 00:44:14.880 And each separate user could do a union file system, a separate union file system on top of this read, only base and each separate user would have. 281 00:44:14.880 --> 00:44:19.500 A separate private rewrite view of the file says this is really cool. I think it's nice. 282 00:44:19.500 --> 00:44:26.010 Now, IBM in a commercial version of this 50 years ago. 283 00:44:26.010 --> 00:44:33.210 So, part of this, which also had virtual operating systems done efficiently. 284 00:44:33.210 --> 00:44:41.250 Anyone could do it any efficient operating since this was the commercial product they've had for decades. I'd be able to occasionally been a leader in. 285 00:44:41.250 --> 00:44:46.170 And some interesting ideas and, uh. 286 00:44:46.170 --> 00:44:55.110 And this is not even their mainstream operating the mainstream operating system. I was always 360. this is a secondary 1 so. 287 00:44:56.460 --> 00:45:02.039 Cp was control program so that was the, the. 288 00:45:02.039 --> 00:45:13.199 A low level, virtual system and CMS was items something management says Cambridge management management system, because it came out of that Cambridge mass lab, which wasn't even a research center. 289 00:45:13.199 --> 00:45:17.340 So this is a scientific center so any case. 290 00:45:17.340 --> 00:45:23.519 So that's enough specific video stuff, specific, good stuff. 291 00:45:23.519 --> 00:45:27.300 Um, well, I haven't been coffee, give you a chance to ask questions. 292 00:45:27.300 --> 00:45:32.250 Oh. 293 00:46:00.300 --> 00:46:06.630 Okay, so what I have here, this is a, um. 294 00:46:06.630 --> 00:46:10.739 Class at Stanford unparallel computing. 295 00:46:10.739 --> 00:46:14.219 Which says been made freely available, so. 296 00:46:15.480 --> 00:46:20.519 So, I'll show it to you pluses and minuses with the Stanford course. 297 00:46:20.519 --> 00:46:33.809 Minus is that as 10 years old? The plus is it's very well done. So for a stop from 10 years ago that's still interesting. Today. It's something to look at, but since 2010 won't be in the course. 298 00:46:33.809 --> 00:46:37.980 So, that's why it's not my primary material for my course. 299 00:46:37.980 --> 00:46:42.090 However, it does have some nice stuff in it. It's very well presented. 300 00:46:42.090 --> 00:46:50.994 And the big thing will get so, it again so what has here with some general stuff at the front? I'll just fast forward through just because they present it. 301 00:46:50.994 --> 00:46:57.414 Well, and then we'll get into some paradigms for parallel programming and a map reduced model. 302 00:46:57.719 --> 00:47:03.869 Um, and then it'll get specific example, these paradigms thrust of this will be a couple of days. 303 00:47:05.340 --> 00:47:08.760 So. 304 00:47:11.309 --> 00:47:17.730 It's not what I want and my mistake here. 305 00:47:20.130 --> 00:47:24.239 Today. 306 00:47:33.389 --> 00:47:37.199 Better better better better. Okay. 307 00:47:40.619 --> 00:47:44.460 So, CS, 193. 308 00:47:44.460 --> 00:47:51.869 And Iraq is 1 of the people that did thrust. So. 309 00:47:52.889 --> 00:47:59.639 Things get faster, um. 310 00:47:59.639 --> 00:48:05.159 You know that you're going parallel, because it can increase o'clock, right? 311 00:48:05.159 --> 00:48:08.369 Different types of prowl is. 312 00:48:08.369 --> 00:48:20.880 Um, like Intel with a Super scalar does instruction level parallel is with their cash with their pipeline, but it's nice but there are limits that allow you can read this yourself. 313 00:48:20.880 --> 00:48:24.090 A massive parallelism, um. 314 00:48:24.090 --> 00:48:28.139 Okay, this is nicely done here. 315 00:48:30.179 --> 00:48:40.829 So, different way, hardware organizations, and then 1 of the keys to NVIDIA success is they picked a nice hardware organization. So. 316 00:48:40.829 --> 00:48:54.000 Is this okay so you might have several process areas like Intel, each 1 has some local memory and then some big amount of global memory. So, the on chip stuff is a cash and so on. 317 00:48:54.000 --> 00:48:58.380 That's 1 way you can do it. I, this is sort of Intel. 318 00:48:58.380 --> 00:49:02.400 Again, the local memories, a cash. Okay. 319 00:49:02.400 --> 00:49:11.280 Many core that's close. Many core means thousands of threads. Multi heartbeat dozens of threads. 320 00:49:11.280 --> 00:49:18.090 Oh, okay, so you do something like this the process. There are lots of processors with 1 memory. 321 00:49:18.090 --> 00:49:21.809 From the global memory, the. 322 00:49:21.809 --> 00:49:29.219 So, they're going through put because they were invented to do graphics. He got to paint a 1Million pixel, 60 times a 2nd. 323 00:49:29.219 --> 00:49:33.389 And lots of threads, a thread per pixel or something. So pathetic creation. 324 00:49:33.389 --> 00:49:45.750 Is no overhead so, and there's latency so you don't care if it's the new frame buffer you're writing to starts a thousands of us is written a thousands of a 2nd wait. 325 00:49:45.750 --> 00:49:57.750 Okay, because he refreshing 60 times a 2nd, maybe, but so sunlight and see, there doesn't matter well, thousands of a seconds, a big latency from a computer point of view, but it's high speed. 326 00:49:57.750 --> 00:50:11.130 So so 1 thread what? So basically, lots of threads running together. So you can read this some more K telling you. I'm just hiding. So the point again, the G. 327 00:50:11.130 --> 00:50:21.119 The green is irrelevant, take color so more of the chip is spent doing stuff mat computation. So. 328 00:50:22.590 --> 00:50:34.079 Some numbers here again, you got instructions schedulers that dispatch instructions the whole thread at a time and so on. 329 00:50:34.079 --> 00:50:47.429 Some load store units that they may load data for a whole thread or whatever. So, and they get the speed because they've got a lot of computation cores, that dark green things. 330 00:50:47.429 --> 00:50:54.300 So, single instruction, multiple thread again, this is a rerun so. 331 00:50:55.440 --> 00:51:00.000 Okay, um. 332 00:51:01.650 --> 00:51:10.079 Yeah, like waiting and so on kernels with lots of threads you've seen all this before you've seen this. 333 00:51:10.079 --> 00:51:17.219 Yeah, and once a thread starts, it doesn't get preempt it runs to completion. So. 334 00:51:17.219 --> 00:51:27.510 So this would be the bad way to do it. Lots of processes. There's no local memory. So playing video does it is there's a block with some welcome memory in the block is a 1000. 335 00:51:27.510 --> 00:51:30.630 Threads perhaps so. 336 00:51:30.630 --> 00:51:33.989 And not that either 1 thread has. 337 00:51:33.989 --> 00:51:38.849 All call memory this is too distributed. So communication kills you. 338 00:51:38.849 --> 00:51:41.969 And. 339 00:51:41.969 --> 00:51:45.750 And that we've seen before. 340 00:51:46.829 --> 00:51:51.929 Lots of examples. Nice speed up. 341 00:51:54.150 --> 00:52:01.380 Right. Okay that was slide lecture 1. and again, if you want me to. 342 00:52:03.809 --> 00:52:07.380 Slow down, tell me, but. 343 00:52:07.380 --> 00:52:10.650 Extra 2 and 5 minutes. 344 00:52:17.309 --> 00:52:22.110 And this is you seen. 345 00:52:22.110 --> 00:52:25.590 High level you you sane. 346 00:52:25.590 --> 00:52:30.360 You're saying. 347 00:52:45.989 --> 00:52:50.519 Yeah, you seen this classes of routine and memory. 348 00:52:50.519 --> 00:52:59.159 Obsolete this is slightly obsolete now. 349 00:52:59.159 --> 00:53:08.219 And the data thing is, if I get time, I might write a devil program, showing the unified memory but I've got time to do it yet. 350 00:53:08.219 --> 00:53:14.849 Because the 1st, 3 versions of my attempt to fail, of course, I have to punch it in that time. 351 00:53:14.849 --> 00:53:21.719 You've seen that and fly into blocks are independent. 352 00:53:22.860 --> 00:53:27.780 And, okay. 353 00:53:29.190 --> 00:53:34.349 Graphics pipeline is. 354 00:53:34.349 --> 00:53:42.570 As long as you nod graphics types, you've got your triangles defined by vertices and vertices can be rotated and scaled. 355 00:53:42.570 --> 00:53:53.340 And then they're combined to make pretty much as triangles to restaurants, which means find the pixels of the triangle and put them into the frame buffer with the depth buffer and blend stuff. And so on. 356 00:53:53.340 --> 00:54:03.960 And these Vertex processing units and Pixel processing, it used to be fixed function, but then they started getting fancier. 357 00:54:03.960 --> 00:54:07.559 That's what that says modes grew with each time. 358 00:54:07.559 --> 00:54:17.309 Hardware used to look like that, but they made things more programmable so the shade or execution did the pixels and so. 359 00:54:17.309 --> 00:54:22.650 Okay, and the vertex unit did rotated and. 360 00:54:22.650 --> 00:54:28.440 Project and so on vertices yes, initially things were fairly limited and then they started. 361 00:54:28.440 --> 00:54:34.440 Adding more instructions and because it a lot of parallelism. 362 00:54:35.789 --> 00:54:47.010 The thing is your vertices, they can all be transformed independent effect. Ycharts. They can all be transformed in parallel. If you got the hardware. Okay. And you don't care if there's a little serialization bottle neck. 363 00:54:47.010 --> 00:54:53.280 Okay, and more efficiency because the 1 okay. 364 00:54:53.280 --> 00:54:57.269 It was 5 minutes for lecture to. 365 00:55:03.510 --> 00:55:18.480 Again, since you've seen so much of this, you can read it on your own. If you like, your program might have a lot of current. I'll see you have a step 1 is a currently finishes step. 2 is finishes and so on. 366 00:55:18.480 --> 00:55:22.019 Worry about race conditions. 367 00:55:22.019 --> 00:55:27.179 You all know what a race condition is by now you have a Tom X to handle that. 368 00:55:27.179 --> 00:55:31.289 What's interesting here no. 369 00:55:31.289 --> 00:55:35.670 For army is like 5 generations back for a video. 370 00:55:35.670 --> 00:55:45.090 They're slow and they're sequential so, you know, be disciplined in your use of them. 371 00:55:45.090 --> 00:55:48.989 And, yeah. 372 00:55:51.389 --> 00:55:58.920 And, yeah, that was 3. 373 00:56:08.155 --> 00:56:14.844 Okay, so, this day again, to show you different types of memory, I've mentioned them before just a nice way. 374 00:56:15.474 --> 00:56:27.355 Share it past shared memory per block, pass, register memory for thread, slow goal, memory, faster, constant, global, constant memory, because it's cash and things talk to the host. 375 00:56:27.659 --> 00:56:38.010 Here's like 5 types of memory and you've got the scope thread, blocker grid and the lifetime and you've also got to speed. So. 376 00:56:38.010 --> 00:56:41.699 Penalty. 377 00:56:43.110 --> 00:56:49.920 Okay, so 5 types of, uh, different levels of locality and different sizes and speeds. 378 00:56:49.920 --> 00:56:55.349 Okay, so you can have 100,000 threads that perhaps. 379 00:56:55.349 --> 00:56:59.159 Only 5,000 run at a time. 380 00:56:59.159 --> 00:57:12.150 And where to declare, I'll give you the ideas here. If you want the details, you can read it. The thing is that newer versions of NVIDIA relax. Some of the restrictions. 381 00:57:13.980 --> 00:57:25.139 So, like you to declare a variable in your C. plus plus program is managed, it's now the variable sound accessible in your. 382 00:57:25.139 --> 00:57:30.869 In your global routine so it's a quick way to get data into a global or you want to read only. 383 00:57:30.869 --> 00:57:39.449 And and they're talking here about ways to speed things up with your loads and your stores. 384 00:57:39.449 --> 00:57:42.900 Point here is yeah, but don't chase pointers. 385 00:57:44.130 --> 00:57:52.349 And it's point is have gotten easier lately in the past, you got a warrior device or host, and so on. 386 00:57:53.670 --> 00:57:57.599 Yeah, don't chase pointer is don't propagate. 387 00:57:57.599 --> 00:58:11.340 Yeah, you play, we've seen examples of your Tile, your global data. Like, we saw the application to have tiles into the past shared memory. 388 00:58:11.340 --> 00:58:17.280 Showing that here, so I'll go through it fast subsets and. 389 00:58:17.280 --> 00:58:21.480 And then when you're finished, copy it back. 390 00:58:23.039 --> 00:58:26.280 You have to design that partition size. Okay. 391 00:58:26.280 --> 00:58:29.730 And watch out. 392 00:58:29.730 --> 00:58:33.360 Synchronized stuff atomic stuff. 393 00:58:33.360 --> 00:58:41.010 Hi, this is this so we showed I showed you for the reduce you do a reduction inside each block and then you. 394 00:58:41.010 --> 00:58:48.690 Reduce the reduced total globally, so just to level hierarchy. So this is efficient just to remind you about it. It's important. 395 00:58:48.690 --> 00:58:53.789 Because inside the 1 block, you got the shared memory, which is fast to read and write. 396 00:58:53.789 --> 00:58:58.920 So, hierarchical atomic, they call it here. 397 00:58:58.920 --> 00:59:02.789 Oh, okay. 398 00:59:02.789 --> 00:59:07.739 Made as well, I'll tell you again, you partition to blocks and. 399 00:59:07.739 --> 00:59:14.969 And then they're doing it how they speed it up. Their 1st version they said was 2008. 400 00:59:14.969 --> 00:59:19.260 Bobs and then they, they do it better. 401 00:59:19.260 --> 00:59:24.150 And and then I'll get a faster so. 402 00:59:24.150 --> 00:59:28.590 And. 403 00:59:30.780 --> 00:59:34.050 Got. 404 00:59:34.050 --> 00:59:44.550 17 times faster or something, so okay. They're experimenting with different titles sizes and. 405 00:59:44.550 --> 00:59:48.690 16 by 16 dollars best. Okay. 406 00:59:48.690 --> 00:59:54.539 The thing in this side is that these resources are finite. I've been saying this many times so. 407 00:59:54.539 --> 00:59:58.619 And I've been saying. 408 00:59:58.619 --> 01:00:09.389 Okay that if I can get you interested in this stuff, then if you want to read more, you can do it on your own. 409 01:00:10.710 --> 01:00:17.760 Performance considerations we've seen this before, so. 410 01:00:17.760 --> 01:00:22.710 Um, I won't repeat it and just. 411 01:00:22.710 --> 01:00:28.530 Chunk up your memories faster, but you mean to Jason thread should read a. 412 01:00:28.530 --> 01:00:32.550 Address hello global memory. 413 01:00:37.644 --> 01:00:50.275 This means that the programming level is an array of structures is a bad idea. You want you want to have Here's a structure, which is a way. You might program nicely, you taught CS. Maybe it's nice. 414 01:00:50.304 --> 01:00:54.474 But the trouble is, from adjacent records keys are have this. 415 01:00:55.019 --> 01:01:05.940 The stride between the addresses of keys for decent items, and the array there's a gap between them to start the value and flag. So this does not coalesce the access as. 416 01:01:05.940 --> 01:01:09.360 If you want is a structure of a raise so what. 417 01:01:09.360 --> 01:01:18.750 So the 3, each, all the keys are in 1 array, they're not in a structure and then all the values are in 1 of the flags and 1 array. And if you want to put the 3 race in this structure. 418 01:01:18.750 --> 01:01:22.260 Okay, I'm not coalescing. 419 01:01:22.260 --> 01:01:31.739 I'm ignoring shared memory banks, the shared memories in banks, and depending on how you access it, you get conflicts that slow you down. Don't worry about it. 420 01:01:31.739 --> 01:01:35.940 I mentioned texture and constant memory quickly. 421 01:01:35.940 --> 01:01:38.940 Um. 422 01:01:38.940 --> 01:01:42.300 And it's small and it's fast so. 423 01:01:44.010 --> 01:01:51.630 All the threads in the work, want the same address. It goes nicely control flow. We've seen many times. 424 01:01:51.630 --> 01:01:57.119 Occupancy is to keep all of your. 425 01:01:57.119 --> 01:02:06.239 Resources running, so all your resources in use and you do that by having lots of warps waiting to run. 426 01:02:06.239 --> 01:02:11.460 So. 427 01:02:11.460 --> 01:02:21.690 You want more warps waiting to run you always want to have stuff waiting to run so then they'll use free hardware. That's what that talks about. 428 01:02:21.690 --> 01:02:26.789 So, if all of your warps are stalled for some data. 429 01:02:26.789 --> 01:02:33.030 Trying to read some global array, then you're losing performance waiting on global memory. So. 430 01:02:33.030 --> 01:02:41.849 So, maybe you want to have some other threads that are trying to compute while some threads are waiting on global memory. We saw that day or 2 ago. So. 431 01:02:42.989 --> 01:02:52.889 And occupancy always talked about before you got this limited resources is registered and shared memory and you want to. 432 01:02:52.889 --> 01:03:04.320 No optimized so there is a spreadsheet that does this also I may show you the source limits so more threads for both for block. 433 01:03:04.320 --> 01:03:14.545 Is good and you're gonna have there are there are debugging flags, which will print out information about your own program. 434 01:03:14.844 --> 01:03:21.715 How many registers kernels use and so on how absurd memory this is a good occupancy calculator. 435 01:03:21.715 --> 01:03:34.704 I'll show you have the source in the it's in the developer and video site you put in some information about how many threads you set per block how many registers prescribed, 436 01:03:34.704 --> 01:03:35.094 et cetera, 437 01:03:35.094 --> 01:03:40.164 et cetera and then it tells you how efficiently you're using resources on your machine. 438 01:03:41.125 --> 01:03:52.494 And then it does the various stuff for you like, block size and register for thread and you can look and see how your occupancy change. So, you put in the basic information that is suitable by you and your program. 439 01:03:52.855 --> 01:03:56.125 And then the spreadsheet goes through and tells you the implications of that. 440 01:03:56.400 --> 01:04:05.429 Nice you put it shared wherever you need for each block and so on. And then it. 441 01:04:05.429 --> 01:04:09.239 It tells you and it tells you what limits you is. 442 01:04:09.239 --> 01:04:20.579 It hits, it's like doing a linear program so the red thing here, this is what's limiting your performance. You don't have enough registers here. It's what you said here is. 443 01:04:20.579 --> 01:04:29.099 You specify too many registers for thread and that's what's slowing you down. 444 01:04:29.099 --> 01:04:34.679 So the number thread blocks for multi process or so. 445 01:04:38.969 --> 01:04:43.949 And it flags the point here and you can see if you very stuff, what happens. 446 01:04:43.949 --> 01:04:52.889 Oh, okay. And options you can do things launching a colonel, take some time. 447 01:04:52.889 --> 01:05:01.409 And maybe you want to fuse kernels into bigger fewer kernels optimize, you amortize that time. 448 01:05:01.409 --> 01:05:05.639 And that's the context there. Okay. 449 01:05:07.349 --> 01:05:11.010 No questions. 450 01:05:13.469 --> 01:05:17.309 Okay, um. 451 01:05:20.519 --> 01:05:33.269 Yeah, nothing well, these are your for your paradigms, you do a reduction we saying you split out the operation over many threads and. 452 01:05:33.269 --> 01:05:36.599 Nothing interesting there. 453 01:05:36.599 --> 01:05:40.349 We've seen this power partition ideas. 454 01:05:40.349 --> 01:05:43.980 Cool. 455 01:05:43.980 --> 01:05:49.739 Okay, the reduction just to review, we've seen it before sum up a whole array. 456 01:05:49.739 --> 01:05:55.289 Okay, this is starting to get a slightly newer here. I'm. 457 01:05:55.289 --> 01:06:09.360 So, we had this a couple of days ago you wanted to do the reduction in parallel. This is the way, which has less locality. So it's not as good. But 1st. 458 01:06:10.014 --> 01:06:22.735 Step for the reduction, you add adjacent elements and store the result. Even numbers saying the next step of the reduction. You add elements that are 2 apart the 3rd step. You add elements that are for part and fine and you get the total sum. 459 01:06:23.215 --> 01:06:33.175 The thing is, is you don't get good thread occupancy utilization and the elements you're adding it farther and farther apart. So you don't have good coherency there. 460 01:06:33.750 --> 01:06:38.699 This is a better way. So. 461 01:06:40.045 --> 01:06:54.114 So, basically, our offerings are initially farther apart in the array, but the output is packed to the front of the array. And then, so, what this does this use as fewer warps the threads and in a warp all the threads get used. 462 01:06:54.385 --> 01:07:01.494 And the active data is stays contiguous. So this has better efficiency better. It's a better use of the GPU. 463 01:07:03.300 --> 01:07:06.780 And the code to do that. 464 01:07:06.780 --> 01:07:14.309 Okay, skipping through that, but putting a raw sync thread in the wrong place we'll hang the machine. 465 01:07:15.449 --> 01:07:22.440 Okay, and you might okay. 466 01:07:22.440 --> 01:07:34.260 And they do the calculation here, this takes log in steps, the total amount of linear work here, which is good. The initial 1, again log in work. This takes linear amount of total work. 467 01:07:34.260 --> 01:07:38.400 Okay. 468 01:07:38.400 --> 01:07:42.150 Here's a new operation called a split operation so. 469 01:07:43.320 --> 01:07:56.250 It's like a sort what's happening here is you've got 2 parallel to raise the 1st, is an array of flags. The 2nd, is an array of elements. 470 01:07:56.250 --> 01:08:09.690 And the flags are just a bit in this case. True. And false. And what we want to do is partition. And so all the 2 elements are before any of the false elements sort of thing a quick sort does, for example. 471 01:08:09.690 --> 01:08:17.460 So we want to do this fast and again, it's a fundamental operation that occurs lots of times. 472 01:08:19.050 --> 01:08:24.869 He build other things on top of it just split operation. So. 473 01:08:24.869 --> 01:08:33.930 Mike like to do it fast you could do it as a sort out of the flag, but that's inefficient. So. 474 01:08:35.399 --> 01:08:40.979 An example is the compact out elements. This is a common application. 475 01:08:40.979 --> 01:08:44.729 So. 476 01:08:45.750 --> 01:08:58.229 And it's used in things like banning stuff, hashing stuff. I call it adult factor for for a regular Ray. 477 01:08:58.229 --> 01:09:03.479 We'll see later how this segmentation thing is used. 478 01:09:03.479 --> 01:09:06.930 And the splitting thing. 479 01:09:06.930 --> 01:09:10.140 So the question becomes. 480 01:09:10.140 --> 01:09:14.939 How do we know how to how to do that? Split? 481 01:09:14.939 --> 01:09:22.590 In parallel, so you might have a thread for each element and writes the element where it should go. But where does it go? 482 01:09:22.590 --> 01:09:26.789 Okay, um. 483 01:09:28.050 --> 01:09:35.789 And let me go back a few slides to show you what's happening here. 484 01:09:35.789 --> 01:09:39.329 This is slides already. I'll go back a few. 485 01:09:39.329 --> 01:09:42.569 Okay to slide 26. okay. 486 01:09:44.845 --> 01:09:59.725 So, let's so, what, in parallel we want to write each element. So, wherein she goes to the 3 goes to here, the 1 goes to the right to there. The 7 goes to the right to there. The 0 comes to the left to their 4 goes to the right 1 goes to right? 6 comes to the left. 487 01:10:00.359 --> 01:10:04.079 And so on 3 stays at the end. 488 01:10:04.079 --> 01:10:07.350 So, how do we know where this. 489 01:10:07.350 --> 01:10:12.270 True. 0 goes. 490 01:10:12.270 --> 01:10:18.840 It the way we know where it goes, depends on the number of true flags to the left of this. 491 01:10:18.840 --> 01:10:33.024 Okay, so, let's talk about 0 indexing from 00 sound. So this go, this is Eric ghost is here, because this is true flag at this element has no truth to it. Slack elements. 0 So it goes to element several. 492 01:10:33.024 --> 01:10:38.095 This true. Has 1 true to what's left. So it goes to element 1. 493 01:10:38.880 --> 01:10:43.920 This true has 2 truths to what's left. So that goes to element 2. 494 01:10:43.920 --> 01:10:54.270 So, you see, this is a place where the scan, so you do an exclusive scan on the true bits and that gives you an array of where each true. 495 01:10:54.270 --> 01:10:59.310 Payload variable should be copied too. So this is an application of a solution scan. 496 01:11:02.215 --> 01:11:11.635 And that's the truth for the for the fall says, it's the same thing, but you have to add the total number of trues. So, this is this has so there's 3 truth altogether. 497 01:11:11.635 --> 01:11:21.534 So this falls, it goes to location 3, plus the number of false as to the left, which goes to location 3, this falls says 1 false, though it's left. 498 01:11:21.534 --> 01:11:35.635 So it goes to location for this fall says, 2 falls to the left so it goes to location. 5, so you want to do the split what we have to do is we have to do exclusive scan on the flag, or re, counting the truth. 499 01:11:35.935 --> 01:11:38.904 We have to do another exclusive scan on it counting the false says. 500 01:11:39.270 --> 01:11:52.350 And then we add to the false as we add in the total number truths. And now we know where each paid old variables should go. And we can. So we did 2 scans and a little arithmetic and bang, we can do all the copies in parallel. 501 01:11:52.350 --> 01:11:59.039 That's what they're talking about here and scan operations and so on. 502 01:11:59.039 --> 01:12:06.899 So, it shows you an application of scans, got many other applications Radix, sort sorting and so on. 503 01:12:06.899 --> 01:12:15.810 With Radix sword is you're sorting your, you're re, based on the least significant. 504 01:12:15.810 --> 01:12:22.470 Bad digit not just a bit. That'd be Sally, but bite. 505 01:12:22.795 --> 01:12:29.154 2 bites and you want to have all the output silver rays contiguous helped me wasting memory. 506 01:12:29.154 --> 01:12:41.725 So, where does each 1 start while you run through into account and other frequency, at least significant digits and then you do a scan on that, and you have adult factor for where to write the next set of elements. 507 01:12:42.354 --> 01:12:47.574 So these are applications of scan and then they send them engineer the scan operation. 508 01:12:47.850 --> 01:13:01.560 It's useless and sequential computing. It doesn't help you, but in parallel computing, it helps you a lot. So, this is a, this is an operation, which is very important operation, but only for parallel computing. 509 01:13:03.899 --> 01:13:09.659 A big no idea for today scenarios he say. 510 01:13:09.659 --> 01:13:13.529 How you do it? 511 01:13:13.529 --> 01:13:16.890 We saw before how you do it, so. 512 01:13:16.890 --> 01:13:23.729 Just, I won't do it, so. 513 01:13:23.729 --> 01:13:30.000 And if you got money blocks, you do it inside each block and then you combine the results of the different blogs. 514 01:13:30.000 --> 01:13:39.600 So results, local to each block, and you take the end number for each for the 1st block and add it to all the results with the 2nd block in parallel and so on. 515 01:13:40.739 --> 01:13:50.100 Yeah, and so these are some operations which for these are the base. 516 01:13:50.100 --> 01:13:53.100 Tools for parallel programming basically. 517 01:13:53.100 --> 01:13:59.970 Splitting scanning reduction and we're going to see generalizations of this. So. 518 01:13:59.970 --> 01:14:12.539 Yeah, sure. 519 01:14:15.180 --> 01:14:20.579 This a 2nd, it was. 520 01:14:24.539 --> 01:14:30.210 And show you some new things. 521 01:14:30.210 --> 01:14:33.960 Segmented scans a new 1. 522 01:14:34.975 --> 01:14:45.354 So, we had this scan where you compute all the sub totals, the segmented scan if you put barriers in the array, and you do not propagate the subtitles across a barrier. 523 01:14:45.805 --> 01:14:52.795 So, it effectively splits your bigger Ray up into a number of smaller rays, and you do a scan on each smaller array. 524 01:14:53.130 --> 01:14:56.250 But the smaller race can all of different sizes. 525 01:14:57.270 --> 01:15:05.460 So, the barriers are firewalls, but the big thing is the smaller race that they're all different sizes. And the whole thing is not in parallel and. 526 01:15:05.460 --> 01:15:09.239 And log in time, um. 527 01:15:10.739 --> 01:15:18.210 So, we have here, um. 528 01:15:18.210 --> 01:15:22.439 So the flags are. 529 01:15:24.149 --> 01:15:28.590 Our barriers, so we don't propagate across a barrier so. 530 01:15:30.300 --> 01:15:33.840 Basically, the 2 barriers here. 531 01:15:35.430 --> 01:15:47.130 And let me look at the results at the bottom. You could skip several things in the middle. We're just going to look at the top row here. It's the head flag, the barriers. 532 01:15:47.130 --> 01:15:52.500 The index says are the things that we are. 533 01:15:55.409 --> 01:16:04.529 That we are summing. Okay. I'm getting confused now. 534 01:16:06.810 --> 01:16:12.270 I'll do this next time. We'll just go through it. I'll come back to that slide next time. 535 01:16:14.609 --> 01:16:18.810 But how how it's implemented and, um. 536 01:16:18.810 --> 01:16:22.409 So. 537 01:16:22.409 --> 01:16:28.319 You can read ahead if you want and sorting, we'll see how to sorted fast and so on. 538 01:16:28.319 --> 01:16:38.699 Reasonable point to stop here. You're welcome to read ahead of me. This is lecture 7. 539 01:16:38.699 --> 01:16:42.180 And the Stanford said, and. 540 01:16:42.180 --> 01:16:53.130 Let me add cholee for fun. I want to introduce you spend a couple of minutes introducing you to thrust. 541 01:16:53.130 --> 01:16:59.850 And see, and then I'll come back to that. 542 01:17:04.319 --> 01:17:07.319 All demonstrated by. 543 01:17:07.319 --> 01:17:17.069 By examples, this is a thrust program C plus plus program you include various things there header only file such parallel versions of s. T. L. 544 01:17:17.069 --> 01:17:26.970 And the only data type is a factor, and you can define a vector on the host or the device and, like, the factor. 545 01:17:26.970 --> 01:17:34.920 So here is a host factor of integers and it's 2 to the 24 elements song. Its name is. 546 01:17:34.920 --> 01:17:39.960 And then you can define a device factor of integers. 547 01:17:39.960 --> 01:17:46.380 Into the angle bracket and so you can define factors on the host and on the device. 548 01:17:46.380 --> 01:17:53.699 And you can copy them so, this here initializes the device factor from the host factor. 549 01:17:53.699 --> 01:17:58.920 So, and this will generate a coda copy. 550 01:17:58.920 --> 01:18:08.369 Sort of thing, and so you can copy from hosted device and back again and you can actually access individual elements. 551 01:18:08.369 --> 01:18:12.270 You could say of 5 or something. 552 01:18:12.270 --> 01:18:18.539 So the only data type is a factor could be on class. It could be on the host of the device and you can give it. 553 01:18:18.539 --> 01:18:21.689 You know, floats whatever. 554 01:18:21.689 --> 01:18:31.199 Um, and then every operation, it's like a map reduce operation and it's like, so. 555 01:18:31.199 --> 01:18:37.380 Well, let me look at sort here and I'll come back to James, so we look at sword here. 556 01:18:37.380 --> 01:18:44.729 It takes a party to the begin and it takes to begin and the end point is so the vector and. 557 01:18:44.729 --> 01:18:58.289 What does he obviously does a sort and this being integers this will sort on the GPU and I'll be very fast. That'll be a Radix sort. And in fact, the sword is fast enough. If you want a sort of bigger right? 558 01:18:58.289 --> 01:19:07.890 You can make your program faster by just copying from the host of the device sorting and copying the result back sorting on. The device is so fast that it. 559 01:19:07.890 --> 01:19:14.489 Pays for the overhead of the 2 way coffee and in case this shows the idea here you see. 560 01:19:14.489 --> 01:19:17.850 You see, you get the big. 561 01:19:17.850 --> 01:19:25.104 Begin an end and sorts on the device. If you gave it a host vector, it would sort on the host. 562 01:19:25.104 --> 01:19:36.114 It's compile time dispatch it dispatch because sword is an overload and class and dispatch is based at compile time, depending on the class of the argument. 563 01:19:37.135 --> 01:19:50.064 Back up to generate, so again takes begin an end for a factor takes a 3rd argument, which is a function, which has 1 output, and it calls the function in parallel for each element of. 564 01:19:50.515 --> 01:19:58.975 So this will initialize H back. Because it called Rand, which we assume exists somewhere in parallel for each L and to create each element of them. 565 01:19:59.250 --> 01:20:03.300 Yes, so we can do this here. We can have a. 566 01:20:03.300 --> 01:20:07.949 Function which takes no arguments and the just stores the output. 567 01:20:07.949 --> 01:20:12.779 And then copy down here, does the copy. 568 01:20:12.779 --> 01:20:22.050 It takes for the source to begin and end pointers for the source factor. Then it takes to begin pointer for the, the destination factor. 569 01:20:22.050 --> 01:20:28.140 And copy the result back from the device to the host. 570 01:20:28.140 --> 01:20:36.239 So, this shows you, some examples for thrust here that you have, the only class is a factor. 571 01:20:36.239 --> 01:20:48.779 Better class, whatever, because it's the vector of something and on the host of the device you do this so this is a map operation. You do map operations and copy. 572 01:20:50.994 --> 01:21:04.944 Show other examples here you can access individual elements, you can print them out and so on. Okay. Good point to stop. So, what we'll do next time is more about the scan with barriers. 573 01:21:05.520 --> 01:21:09.689 And because it's important, and then we'll get into explicitly into. 574 01:21:11.159 --> 01:21:21.750 So, other than that, see, you Thursday see you Monday and Ashley Monday. 575 01:21:25.560 --> 01:21:31.380 We're going to see Connor Joseph lane Dan and Randy. 576 01:21:32.430 --> 01:21:37.890 And then if his time bandwidth, ben's maybe Thursday, and I will take the other. 577 01:21:37.890 --> 01:21:44.909 5 people in the cars, 4 people in the course, and assign you to Thursday and assign you topics. So. 578 01:21:46.649 --> 01:21:52.439 Question it's not. 579 01:21:54.720 --> 01:21:59.100 Have a good weekend. It looks like it's too warm to do skiing. 580 01:21:59.100 --> 01:22:02.880 But.