WEBVTT 1 00:11:20.339 --> 00:11:24.688 Is this any better? How's the sound now? 2 00:11:24.688 --> 00:11:28.979 And. 3 00:11:35.308 --> 00:11:38.999 Hello. 4 00:11:48.683 --> 00:11:55.013 Great sorry about the delay fine computing class 14, March 15th. 5 00:11:55.558 --> 00:12:08.009 2021, what is on the tat on tap for today? We're continuing on in this accelerated computing. 6 00:12:08.009 --> 00:12:14.818 Stuff from NVIDIA, and by the way it's from a book by. 7 00:12:14.818 --> 00:12:23.129 Kirk, and who massive programming massively parallel processors a hand on approach. So, this is an excellent book. The book costs money, but. 8 00:12:23.129 --> 00:12:30.599 By the standards of university textbooks very little money. So if you'll like the slides, I recommend that you get the book. 9 00:12:30.599 --> 00:12:40.739 And so we're moving along in that. Some other stuff I'll mention before I continue on with the slides to prompt you mentioned some of this. 10 00:12:40.739 --> 00:12:49.739 1st, India has some primary documentation that I recommend and. 11 00:12:49.739 --> 00:12:54.479 And should go right and so this. 12 00:12:54.479 --> 00:13:06.719 Is more up to date than the books, but it's dryer and it's more accurate. The books actually sometimes of errors. But if you want a nice concise summary of what's going on. 13 00:13:06.719 --> 00:13:13.798 Like, what's the difference between and right here. 14 00:13:13.798 --> 00:13:26.099 cpu's have less light and they say, have more power for the more compute capability for the electrical power that you're using. And so on. 15 00:13:26.099 --> 00:13:29.938 And I just show real estate layout. 16 00:13:29.938 --> 00:13:38.818 Um, the chip a lot much, much higher percentage of the chip, and the GPU is used for actual processing. 17 00:13:38.818 --> 00:13:50.369 As opposed to things like caches and controls on, you need less control area because you've got the war for 32 threads all running the same instruction. For example. 18 00:13:50.369 --> 00:13:54.958 So, they just talk about it there so I encourage you to. 19 00:13:54.958 --> 00:13:58.798 I encourage you to go through this. 20 00:13:59.938 --> 00:14:06.328 And some other points, I'll just hit, so they've got so Nvidia spent a lot of effort. 21 00:14:06.328 --> 00:14:11.519 On tools to help, you use their hardware and they got different. 22 00:14:11.519 --> 00:14:19.859 Different types of let me actually go through this a little this and stuff have not covered so much. 23 00:14:19.859 --> 00:14:29.609 Yeah, so they've got libraries using the so we've seen a little about that. We'll see a touch more programming languages and then. 24 00:14:29.609 --> 00:14:36.869 This 2 dimensional array of of here, they've got. 25 00:14:36.869 --> 00:14:40.229 Horizontally is, like, goes with time. 26 00:14:40.229 --> 00:14:51.928 So the compute capability goes up, delayed us. The 1 on parallel is compute capability. 7. their latest 1 has come out. Just compute capability. 8 actually, it's less than a year old. So. 27 00:14:51.928 --> 00:14:58.859 They're adding more features and then the vertical here along the X axis is sort of. 28 00:14:59.183 --> 00:15:12.323 Function so we have embedded sales, sell for economist vehicles and so on. I believe Tesla for a while was using and video. I think possibly. But they may do their own. Now. I have an outie Q5 automobile. 29 00:15:12.323 --> 00:15:17.094 It has to and video chips in it. I believe because it is the cars to displace. 30 00:15:18.089 --> 00:15:22.469 So it is the embedded market, just the consumer market. 31 00:15:22.469 --> 00:15:30.719 For gaming the professional market, the difference between consumer and professionals a couple of things. 32 00:15:30.719 --> 00:15:44.278 Professional, for example, we'll do double precision a lot better. It may worry about Tripoli 754, standard, floating point and so on. So. 33 00:15:44.278 --> 00:15:51.389 The Quadro line is does more accurate computations for computer to design and so on. 34 00:15:51.389 --> 00:15:59.879 Um, the Quadro line is also costs more premium priced because NVIDIA understands that say. 35 00:15:59.879 --> 00:16:03.149 Engineers are willing to pay more than gamers. Are. 36 00:16:03.149 --> 00:16:15.269 So, if you look at functionality, the Quadro is more expensive for what you get to the G4 line, which is why some professionals try to use a G force line because it's a lot cheaper. 37 00:16:15.269 --> 00:16:24.389 What's cause NVIDIA at 1 point to change the licensing? So some of their Quadro tools were not allowed to run on the G4. 38 00:16:25.408 --> 00:16:35.639 And then the last special areas, the data center, the Super computing line, which they call Tesla, which is unrelated to the car. 39 00:16:35.639 --> 00:16:46.739 There's 3 uses of the word, Tesla and calm and you say the automobiles forget that that's not relevant to this. It's invidious professional line of and. 40 00:16:46.739 --> 00:16:51.568 Super computing, which means they don't even have video out often. 41 00:16:51.568 --> 00:16:56.639 Does that take space on the chip? It's also the name of a generation. 42 00:16:56.639 --> 00:17:01.558 Indexing through time, it's the name of an obsolete generation of chips. 43 00:17:03.028 --> 00:17:09.808 Any case a scalable model that you can add more course, and get the same thing done, but faster. 44 00:17:09.808 --> 00:17:15.719 Talk about it here and then you just go down and see some new stuff here. 45 00:17:15.719 --> 00:17:20.699 Some interesting things that they've got here. 46 00:17:20.699 --> 00:17:25.378 I may hit a little more of it later. 47 00:17:26.364 --> 00:17:41.243 Cooperative groups, they've got ways to work with actually fractions of a war. Now that an epic parallelism is a colonel can actually start another colonel. When I told you that a global routine was called from the halt and Runnable on the device. 48 00:17:41.848 --> 00:17:49.229 That's a simplification you can in some cases, call a global routine from the device also it fires off an independent colonel. 49 00:17:52.463 --> 00:18:07.223 Texture fetching is another type of memory. I've just alluded to briefly and candid, because remember the origins of all. This is graphics now and for textures and you can have 3 up to 3 dimensional textures. You could have a block of wood and for every X. 50 00:18:07.223 --> 00:18:10.584 Y. Z point you've got a colors as we need 3 dimensional texture. 51 00:18:10.888 --> 00:18:16.679 And more call me, they might be 2 dimensional. And so the thing is that. 52 00:18:16.679 --> 00:18:24.689 They have special hardware tools to do texture, mapping, fast, such as. 53 00:18:24.689 --> 00:18:33.598 If an index is a fraction is a real number, not an energy flow. It will interpolation from the adjacent. 54 00:18:33.598 --> 00:18:43.108 Integer coordinates of the texture, so to interpolation and the texture and excited to be part of the texture fetching. It also. 55 00:18:43.108 --> 00:18:50.519 The details are proprietary, I think, but it may use a space filling curve to. 56 00:18:50.519 --> 00:18:57.628 To order the texture of so, stuff like that. And. 57 00:18:57.628 --> 00:19:09.479 Important things the kernel is the parallel program running on the device. Specify with the global dynamic. Carlos is, like I said is where colonel could start another kernel application. 58 00:19:09.923 --> 00:19:24.084 Might be, I don't know, you're walking down a tree and part of the trees a lot. Bush year say, start a new colonel. Perhaps unified memory is where you can the host in the device of the same address space. You might even get swapping. Okay. 59 00:19:26.519 --> 00:19:30.388 So you can you can walk your way through this and. 60 00:19:30.388 --> 00:19:41.338 Okay, have fun unified memory, manage memory and so on the. 61 00:19:41.338 --> 00:19:50.128 And you could mix serial and parallel call. This would be really efficient use of your machine. So you're running stuff on the host. You're running stuff on the device and they are. 62 00:19:51.209 --> 00:19:56.219 Inter leave perhaps. Oh, okay. So you can have fun with this and. 63 00:19:56.219 --> 00:20:02.909 Go through this, so I got a link to there. 64 00:20:04.739 --> 00:20:12.538 Some other stuff I've got here before I go back to these slides. Okay, this is my understanding. I created a little while ago. 65 00:20:12.538 --> 00:20:19.439 Um, because again, tests and video, not completely logical and they. 66 00:20:19.439 --> 00:20:25.259 So this is my attempt to pull out important points about this. 67 00:20:26.429 --> 00:20:33.689 So you got a micro architecture, which defines the available operations. 68 00:20:34.074 --> 00:20:48.773 A swap might be an operator for mute would be an operation where you permit results among a war, but all the threads in a war or something floating point would be shaders would be operations. So you have the micro architecture you might say it's the. 69 00:20:49.199 --> 00:20:53.368 It defines the machine language of codes you might say. 70 00:20:53.368 --> 00:21:04.259 Okay, and then you got these generations the last many are Tesla from capital Maxwell, Pascal Bolsa and and pair. 71 00:21:04.259 --> 00:21:16.618 So, okay, now the architecture is implemented in microprocessor, which are different so they'd had the same instructions set, but. 72 00:21:16.618 --> 00:21:28.673 A more expensive microprocessor might have more streaming multi processors might have a higher clock rate, et cetera, et cetera and the code for you've got things. 73 00:21:28.673 --> 00:21:34.193 So the code for on the micro process tends to be g K1 tennis a g means graphic. 74 00:21:34.193 --> 00:21:47.094 They all start with a G and K is the initial letter of the generation capital or Pascal and then this is the code here for the particular microprocessor and. 75 00:21:49.078 --> 00:21:53.999 It's not always true that a higher number means it's fast or higher number. It may mean slower. 76 00:21:53.999 --> 00:21:57.479 And then you get these access and stuff, so it's, um. 77 00:21:57.479 --> 00:22:01.558 It's a mess I don't fully understand it, but I'm trying to help you out here. 78 00:22:01.558 --> 00:22:10.469 So, the micro process that you put in a graphic card called an accelerator called a module and that's that thing that you plug into your server. Let's say. 79 00:22:10.469 --> 00:22:13.588 And they got these series. 80 00:22:13.588 --> 00:22:16.828 And and g force Quadro and so on. 81 00:22:16.828 --> 00:22:25.048 And and they get all, and the thing is so they have different cards, which. 82 00:22:25.048 --> 00:22:36.358 The same like a process that can go into different cards, which of different capabilities, like different amounts of memory on the card. And so it's a horrible mess. Oh, and then, another thing is. 83 00:22:36.358 --> 00:22:41.068 The card may be manufactured by another vendor who buys the. 84 00:22:41.068 --> 00:22:54.114 Microprocessor from Nvidia puts it into his own card and P, and why would have a card that say, and they might twiddle various features play games with the memory and bus rate and so on. 85 00:22:54.384 --> 00:23:08.003 So, NVIDIA may produce 1 card as an example of how to do it we call reference implementation so you could buy a reference implementation card directly from NVIDIA, or you could buy a card from someone else, like, or something. 86 00:23:08.308 --> 00:23:17.909 Okay, things like g forced time and you get this then you'll get little suffixes, like K20 X and the XM will. 87 00:23:18.503 --> 00:23:31.284 Do various things well, 1, customization it doesn't have, does it have a fan? Does it have on chip cooling? Doesn't have its own fan or does it rely on a fan from the server the case, or something like that? 88 00:23:31.854 --> 00:23:35.243 Number of processes? So an NVIDIA dies. 89 00:23:35.243 --> 00:23:49.523 For example, they might try to manufacture, I think, with the gateway they say, I think they tried to builds say 6 steering multi processors onto their chip, but things fail sometimes and they would bend them. 90 00:23:49.523 --> 00:23:54.294 So that if they chip only 5 of the 6, we're working, they'd sell it as a cheaper variable. 91 00:23:54.659 --> 00:23:59.009 Okay, slightly different model number. Okay. 92 00:24:00.328 --> 00:24:03.358 So my summary sort of of what's happening. 93 00:24:03.358 --> 00:24:09.509 The term GPU is sometimes used for the whole card. It sometimes use. 94 00:24:09.509 --> 00:24:13.138 So, the processor again, it's not standardized. 95 00:24:13.138 --> 00:24:20.578 Speeds they have, um. 96 00:24:21.898 --> 00:24:31.828 They change the type of memory they change the number of they change the bus. The bus rate. C*** rate is all sorts of variance. So. 97 00:24:32.969 --> 00:24:37.078 Um, now kudo has different. 98 00:24:37.078 --> 00:24:42.058 As a couple of different ways, there's a capability version. 99 00:24:42.058 --> 00:24:51.449 Which tells there's different properties as the number goes up, you get different different things available. 100 00:24:52.949 --> 00:24:58.588 They got cables talking, shared memory, whatever, whatever you can imagine that. So. 101 00:24:58.588 --> 00:25:04.259 As time goes on, they add more capabilities and they bumped the major version number. 102 00:25:04.259 --> 00:25:07.528 Um. 103 00:25:07.528 --> 00:25:14.669 That's down at the heart of the architecture level. There's also a level for the software for the CUDA and that's. 104 00:25:14.669 --> 00:25:25.229 Currently these things here are year out of date I haven't updated for this year. The current capability for the could their software versions up at 12 point. 105 00:25:25.229 --> 00:25:31.528 Too, I think so they bump up the capability for the hardware and they bump up. 106 00:25:31.528 --> 00:25:35.548 The software driver, so add more features to the software. 107 00:25:35.548 --> 00:25:40.348 It gets worse. 108 00:25:40.348 --> 00:25:45.509 Sandy sees the CUDA compiler takes here. 109 00:25:45.509 --> 00:25:56.814 C, plus, plus codes has been slightly extended with his minus and tactic extensions at triple angle brackets and the smart days routine modifiers, global and shared and stuff like that. 110 00:25:58.134 --> 00:26:11.963 And again, the compiler takes us, could a program extension and splits apart into the C code and the could have called the C code goes through crying or so I believe it goes to client naturally. 111 00:26:12.413 --> 00:26:13.163 And. 112 00:26:13.558 --> 00:26:17.459 That may not be true anymore at 1 point. It did. 113 00:26:18.084 --> 00:26:30.653 And then they and then the CUDA code does not get compiled straight down to the low level assembly calls. They could have called and you see plus, softball and gets compiled down to an intermediate abstract machine, a bite code. 114 00:26:30.683 --> 00:26:43.223 So, to speak, just as Java gets compiled to a bite code, and Python might get compiled to a bite code perhaps while the CUDA gets compiled to this intermediate bite code and that's what's stored in your file. 115 00:26:43.499 --> 00:26:48.509 And then when you run your, you're. 116 00:26:48.509 --> 00:27:03.179 At run time, it basically was adjust in time compiler, which compiles from the bite code down to the actual assembly CUDA assembly. And the reason for that is that they can change the back end the. 117 00:27:03.179 --> 00:27:16.919 If you want to run the on a newer generation of hardware GPU well, so the run time, just in time compiler has to change, but the bike code is still valid. 118 00:27:16.919 --> 00:27:20.128 And so it adds an extra level of complexity. 119 00:27:20.128 --> 00:27:30.298 But it makes your means your don't have to be compiled so compiled. So, and you thought you could forward and backward. 120 00:27:32.038 --> 00:27:35.249 And some sort of compatibility, which is nice. 121 00:27:36.054 --> 00:27:48.534 That's also to someone in Linux actually, I've got some 20 year old, I think, approaching 1020 year old, execute a Linux from sea programs that I've compiled and the execute might still run actually. 122 00:27:48.534 --> 00:27:52.733 So, for for minor things and links colonel as. 123 00:27:53.308 --> 00:28:04.679 Has maintained API, compatibility, not for system stuff. It does not for system stuff and Linux. Colonel actually defiantly States. They don't even try to do compatibility. 124 00:28:04.679 --> 00:28:18.269 And it seems like they go to their way to be incompatible from version to version what they want is people using at that level, they want them to build official modules to integrate it into the colonel. They don't want people on their own writing. 125 00:28:18.269 --> 00:28:23.999 Colonel stuff and Linux, they can't stop them, but they make it hard for them. 126 00:28:24.443 --> 00:28:35.423 Okay, so, in any case so I have Linux colonels 1020 panels. My said, it will still run the API still compatible. The crazy thing is my C programs on compile anymore. 127 00:28:35.423 --> 00:28:47.003 The sea language has changed more recently than the length API. It's sort of crazy. Okay. I say bad things about Linux. Well, that's good thing about Linux. 128 00:28:48.058 --> 00:28:55.679 And just points about the key thing with the is optimizing data flow and processing. 129 00:28:55.679 --> 00:29:01.288 Okay, so that's my summary of parts of. 130 00:29:01.288 --> 00:29:13.439 And video, I'll get to memory allocation later. Let me go back to some of these slide sets here. So we're starting with some case studies and. 131 00:29:14.999 --> 00:29:22.679 Let me just show you. I'm just again just hitting highlights for you. 61. 132 00:29:26.128 --> 00:29:31.078 And what I just the design zips the, um. 133 00:29:34.229 --> 00:29:43.229 I ended up the zip file and a okay. 134 00:29:46.019 --> 00:29:58.259 So, if I can make it any bigger for you yeah, I can't really make it any bigger. 135 00:29:58.259 --> 00:30:02.759 Okay, so this is another application and. 136 00:30:03.808 --> 00:30:08.249 So, what we have here is a couple of applications on. 137 00:30:09.388 --> 00:30:15.719 Virus techniques, so work efficiency means. 138 00:30:15.719 --> 00:30:19.318 Don't have a lot of threads in the war being. I know. 139 00:30:19.318 --> 00:30:24.298 If you can avoid it, so I don't know if you have. 140 00:30:24.298 --> 00:30:33.628 128 threads, they're doing things you could it'd be better than 128. 3 is only 4 warps instead of using. 141 00:30:33.628 --> 00:30:45.868 28 or something, because if you're not if you're using fewer warps, then that means that the war tools on the GPU are free to process other or some other applications. 142 00:30:45.868 --> 00:30:56.489 So, you're not using, like, the instruction dispatch unit on the as much so it's available. So that's so that's more work efficiency. 143 00:30:56.489 --> 00:31:06.838 The divergence ways to avoid conditionals in your code because again, conditionals are bad for for okay. 144 00:31:08.338 --> 00:31:16.108 So, it just working with some real application, some molecular dynamics things. 145 00:31:21.624 --> 00:31:35.364 So 1 of the problems is we want to calculate potentials. So you have a charged body here it exerts an electric field throughout the rest of the universe. And but the magnitude is inverse square law. 146 00:31:35.699 --> 00:31:42.659 Ok, so how do you compute say it here's a target point. You want to compute the potential there from all of the charged bodies. 147 00:31:42.659 --> 00:31:49.979 And your application, so you have to sum up and just to anticipate a little, we're going to put a grid on the scene and. 148 00:31:49.979 --> 00:31:52.979 Course and stuff. Okay. 149 00:31:54.538 --> 00:32:04.979 So so, the direct ways, you just iterate over all the other charged atoms in the system, and just sum up potentials that simple but. 150 00:32:04.979 --> 00:32:12.028 It takes quadratic time or linear time, and each target Adam quadratic time and the whole problem. 151 00:32:13.648 --> 00:32:18.598 Yeah, okay now. 152 00:32:18.598 --> 00:32:25.469 If you've got something like here your 1st question, you basically have a. 153 00:32:25.469 --> 00:32:36.534 Double iteration of double loop you're looping over the target. Bodies is potentials you want to calculate and you're looping over the source bodies that are charged that you might say are the sources of fields. 154 00:32:36.923 --> 00:32:40.884 And it's the 1st question is, which is the outer loop in, which is the inner loop. 155 00:32:41.189 --> 00:32:45.989 The next thing here, this cut off thing. 156 00:32:45.989 --> 00:32:51.898 Another big policy design question when you're designing an application like this. 157 00:32:51.898 --> 00:33:02.638 Is well, you know, you got to a certain point, you say the sources of so far away, they don't matter. I'm not going to consider them so. 158 00:33:02.638 --> 00:33:14.098 If you're summing up gravitational potential, say, at my home here in loud unveil, I would some of the effect that I would some of the moon and the sun I'd look at the moon and the sun. 159 00:33:14.098 --> 00:33:21.358 But I would, and maybe if I'm superstitious Jupiter, but I think that's silly, but I would not consider. 160 00:33:21.358 --> 00:33:31.858 The effect on my home of Pluto, I would not consider the effect on my home of stars in the Milky white imagined cluster and so on. Okay. That they're too far away. 161 00:33:31.858 --> 00:33:39.509 Or another thing I would do is I would cluster stuff, so if I want to consider the effect on my home of the sun. 162 00:33:39.509 --> 00:33:43.618 I would treat the whole sun as 1 point body. 163 00:33:43.618 --> 00:33:58.019 Because it's far enough away compared to which size that there's a reasonable approximation. So I would stop. I would start clustering stuff. Same thing for the moon. I treat the moon as 1 point body, a few 1000 miles cross and. 164 00:33:58.019 --> 00:34:12.088 200,000 miles away. That's reasonable, but if I was close to the moon, like, trying to orbit around the moon, I would not treat the whole moon as 1 point body. And that would be seriously wrong. That would calculate a seriously wrong orbit. 165 00:34:12.088 --> 00:34:19.469 Calculation won't be so bad and be useless or something. Okay so we get various approximation things. You have to think of. 166 00:34:19.469 --> 00:34:23.759 So, you could imagine some sort of hierarchy clustering objects. 167 00:34:23.759 --> 00:34:30.778 And let's think of something like the Milky Way let's say you want to compute the evolution of the Milky way. So. 168 00:34:30.778 --> 00:34:44.969 You got these, the arms and the Milky way. The arms are interesting. They are the arms and the way your face phenomenon, the density waves going through the stars. 169 00:34:44.969 --> 00:34:51.958 The arm as a concept exists, but the stars in the arm change. 170 00:34:51.958 --> 00:34:59.338 So, if you wanted to work on evolution of the Milky way, such as some of the physicists are doing at RPI. 171 00:34:59.338 --> 00:35:04.829 With this Milky way at home project, you know, using volunteer cycle. 172 00:35:06.324 --> 00:35:19.193 You would have some sort of adaptable, dynamic clustering. You'd cluster a star would be in 1 cluster when it's in an arm and then it travels out of the arm because the arm this the army says density wave is going faster than the stars. 173 00:35:19.193 --> 00:35:22.103 So this far enters an arm and then is in the arm then leaves the arm. 174 00:35:22.409 --> 00:35:34.798 Well, it's like a wave on the ocean. Think of a wave on the ocean. The wave is this organizational concept. It's not the individual water molecules. The wave travels over the ocean. The water molecules do not. 175 00:35:34.798 --> 00:35:39.599 So the perfect example, so Milky Way arms are like that. So. 176 00:35:39.599 --> 00:35:43.739 Again, you so you try to bucket stuff up cluster stuff to compute. 177 00:35:43.739 --> 00:35:47.188 Then, you know, you, your clustering has to be dynamic. 178 00:35:47.188 --> 00:35:55.559 Okay, so here they're trying to model things like, um, proteins. 179 00:35:55.559 --> 00:36:00.329 Whatever proteins. 180 00:36:00.329 --> 00:36:03.838 Complicated structures. 181 00:36:03.838 --> 00:36:18.443 No, this is a, this is a hard problem. That's important. It's hard because these potentials, for example, excuse me protein is this long Adam long molecule. 182 00:36:18.443 --> 00:36:23.664 The different atoms are different charges so it wants to fold up. If it folds up, it reduces its potential energy. 183 00:36:23.969 --> 00:36:29.009 And so it folds up in all different. 184 00:36:29.009 --> 00:36:32.818 There's a geometric component and, you know. 185 00:36:32.818 --> 00:36:42.719 The blob fits into a slot or something. There is a charge component. You want complementary charges close to each other and. 186 00:36:42.719 --> 00:36:51.719 So the proteins and they're bouncing around folding on unfolding, but they're prevalent. Surely going to end up in some folded state. 187 00:36:51.719 --> 00:36:56.068 And so it's very complicated to compute what that. 188 00:36:56.068 --> 00:37:05.728 Resulting folded state will be and of course, it's not just 1 protein. If there's a couple of proteins in the solution, they may fall together the temperature. 189 00:37:05.728 --> 00:37:11.369 Will affect what the stable state is the principal in chemistry. 190 00:37:11.369 --> 00:37:14.548 Again, the optimum state again, depends on. 191 00:37:14.548 --> 00:37:18.418 Everything in the system, so it's very difficult to compute. 192 00:37:18.418 --> 00:37:29.219 What the folded state will be however, it's important to compute it. This hard problem is a problem we're interested in the result because how the protein folds. 193 00:37:29.219 --> 00:37:38.789 Ends up folded determines what it does, it determines what it will it's activity and because it's the stuff on the surface. 194 00:37:38.789 --> 00:37:43.228 You might say which can interact with other things, which controls what the protein. 195 00:37:43.228 --> 00:37:52.409 We'll do so so this problem protein unfolding it's difficult and it's important. Well, what it will do affect, say drug design. 196 00:37:52.409 --> 00:37:58.108 Okay, a goal might be, I mean, drug design now sort of. 197 00:37:58.108 --> 00:38:02.159 It's like alchemy really well, I'm being a touch unfair. 198 00:38:02.159 --> 00:38:15.804 It has strong resemblance to alchemy that they don't know what they're doing and that is trying everything is wander through the Amazon and interesting. Try this plant and see if it does something a little unfair. 199 00:38:15.804 --> 00:38:27.503 But this truth, and what I'm saying is randomly trying stuff to see what happens. It would be really nice. If from 1st principles, you go from alchemy to chemistry from 1st principles, you could start. 200 00:38:28.650 --> 00:38:34.769 Designing this stuff, but to do that, we have to add a serious, you know, computation. So this is. 201 00:38:34.769 --> 00:38:38.400 How this stuff is relevant to the real world. 202 00:38:38.400 --> 00:38:50.909 Any case, sequential version we want to compute potential energy at each point on a grid. So, for each point on the grid, we iterate over the grid sales. We iterate over the input. Adam suddenly some potential. 203 00:38:50.909 --> 00:38:54.599 In 1 sense, that's optimal. 204 00:38:54.599 --> 00:39:01.349 In 1, limited sense, I'll skip through the code. 205 00:39:01.349 --> 00:39:11.039 Or iterating interesting or okay, well, look at the loops, the loop is iterating over the atoms. The atoms are the sources of potential. 206 00:39:11.039 --> 00:39:22.260 The next loop, we're iterating over the grid. We want to compute the potential that each grid sell. And this case it's a 2 D grid. So we've got 2 loops here. 207 00:39:22.260 --> 00:39:28.530 Iterating over X and over Y, that's the end. And we're just summing over stuff here. 208 00:39:28.530 --> 00:39:31.860 And, um. 209 00:39:31.860 --> 00:39:42.929 And it's called input oriented because we're iterating over the input. The outer loop is iterating over the input. I said this is your choice. What's your outer loop? 210 00:39:42.929 --> 00:39:48.300 The input or the output you might say, okay. 211 00:39:50.190 --> 00:39:53.610 And if I'm going too fast for anyone, tell me. 212 00:39:53.610 --> 00:39:57.659 Yeah. Okay. Did you to do, um. 213 00:39:58.980 --> 00:40:02.309 Now, that was the serial. 214 00:40:02.309 --> 00:40:08.969 Implementation sequential implementation. Now we want to do it in parallel. 215 00:40:08.969 --> 00:40:13.380 We might do 1 slice of the target. 216 00:40:13.380 --> 00:40:20.820 Data or you might be able to put a re on the and slices of the data. 217 00:40:20.820 --> 00:40:26.280 Okay, all, I'll be there. 218 00:40:26.280 --> 00:40:30.570 1, interesting thing here is that. 219 00:40:32.250 --> 00:40:36.510 Well, here we're sort of change in the iteration because. 220 00:40:36.510 --> 00:40:46.739 The outer loop here is basically this 1 slice of the output data and that spicy output data for each. 221 00:40:46.739 --> 00:40:56.190 Grid point in that slice are iterating on all the input Adams. So what we do is we take those input Adams and cash him, put them in the shared memory. 222 00:40:56.190 --> 00:41:04.949 Well, you could let the cash implicitly do it or in a case like this, you think, you know, better than the cash you'd be right? You copies I didn't put data to the shared memory. 223 00:41:04.949 --> 00:41:08.610 And now you integrate now, it's really fast to iterate over it many times. 224 00:41:08.610 --> 00:41:12.420 Now, you're writing data to the. 225 00:41:12.420 --> 00:41:19.800 Output, so there's a latency there. It doesn't actually matter if you think about it. 226 00:41:19.800 --> 00:41:27.059 So, obviously you're going to do your slicing so you're going to put as much data in the shared memory as you can. 227 00:41:27.059 --> 00:41:39.059 Okay, well, 1 back when I say as much that, here's a place you have to ask yourself, can you get away with, say half precision floats or do you need 4 bite dad? Or can you work with 2 by data? 228 00:41:39.059 --> 00:41:46.469 I mean, you're already good in approximation because you got a grid. Okay. 229 00:41:47.849 --> 00:41:51.599 I just describe what's happening here. 230 00:41:53.364 --> 00:41:59.605 The thread might be 1 thing we're talking about here is the thread, actually, 1 thread might work with 1 Adam. 231 00:41:59.635 --> 00:42:08.364 Perhaps it's slightly different what I said in the last slide, but here and up, but this means the thread is computing output will go into various different places. 232 00:42:09.000 --> 00:42:16.230 So, I started talking about different things or. 233 00:42:16.230 --> 00:42:20.159 Yeah, okay. 234 00:42:20.159 --> 00:42:25.170 This is called a scatter thing because the output is being written to scattered random places. 235 00:42:25.170 --> 00:42:30.000 And they're talking about different ways you can. 236 00:42:30.000 --> 00:42:34.679 If you find a constant that's been used again and again, then. 237 00:42:34.679 --> 00:42:39.539 On a host compiler I assume the compiler would optimize it. 238 00:42:39.539 --> 00:42:42.780 Maybe not here on the device. 239 00:42:42.780 --> 00:42:48.539 Although there is that news invidia C plus plus compiler. I haven't tried it yet. 240 00:42:48.539 --> 00:42:52.409 Um, called. 241 00:42:52.409 --> 00:43:01.019 So, they explicitly calculate stuff here that you're going to use again and again play game, be your own Optimizer. 242 00:43:01.019 --> 00:43:07.889 Don't trust the compiler, you're updating the target the, the target. 243 00:43:07.889 --> 00:43:16.019 Potential you have to sequential eyes that in case some other threat is updating atomic operation. 244 00:43:17.094 --> 00:43:30.864 Here's a case, we're having lots of threads might make this not so bad. So this hangs here because at atomic operation well, there might be, you know, there's going to be another war somewhere else is waiting ready to run. So, this is the. 245 00:43:31.679 --> 00:43:35.219 The work efficiency concept. 246 00:43:38.429 --> 00:43:47.099 So, showing, what's happening here is the input are the source atoms that have charges the output are the grid cells whose potentials we want to compute. 247 00:43:47.099 --> 00:43:53.699 So, basically, what they're saying here is there's a threat. Excuse me? There's a threat. 248 00:43:53.699 --> 00:44:04.110 On for each and put Adam and computes out for all the nearby grid salesmen maybe not all the grid cells in the universe. Maybe the nearby ones. Perhaps. So. 249 00:44:04.110 --> 00:44:07.619 Scattered so the output is scattered here. 250 00:44:09.690 --> 00:44:21.449 There are ways to unscattered summit like, you write a pair of output cord and and output value and you sort that array and then do a reduction on it. In fact, that's a valuable way. 251 00:44:21.449 --> 00:44:24.510 Okay. 252 00:44:26.130 --> 00:44:30.000 There's an input oriented over the input. 253 00:44:30.000 --> 00:44:40.050 Okay, but you don't iterate over all the outputs of the outputs that are close to the inputs. 254 00:44:41.699 --> 00:44:49.349 So, they don't like this message they just said, um. 255 00:44:49.349 --> 00:44:54.570 It's easy, but that scatter right at the output. 256 00:44:54.570 --> 00:45:00.030 Or each free chat, it computes all the output grid sales at affects and. 257 00:45:00.030 --> 00:45:14.875 Was an atomic ad, add some so, this is a bad idea. So we presented this to you just to give you a chance to see it's a bad idea, or presenting this to you because it's something you might think of might think of doing. So they're saying don't. 258 00:45:15.210 --> 00:45:22.980 Okay, that was 16 1. 259 00:45:22.980 --> 00:45:28.110 Introduce the problem and gave you a naive way to solve it. 260 00:45:42.570 --> 00:45:48.539 Okay, so this will be a better way to solve it. 261 00:45:49.650 --> 00:46:02.730 More output, the previous was more input oriented. This is my output or the outer loop is iterating over the output the targets grid cells that we want to compute for. So, for each target grid. So we iterate over the atoms. 262 00:46:02.730 --> 00:46:11.340 And add in now, the safe data structure here, we have to iterate over all the atoms because we don't know which atoms are close. 263 00:46:11.340 --> 00:46:23.880 Now, you could imagine putting the Adams in some sort of bucketing structure, you could imagine it before you did this computation, you bucketed the atoms into buckets and maybe the same size of the grid cells are somewhat coarser. 264 00:46:23.880 --> 00:46:29.159 And, of course, bucketing the Adams. 265 00:46:29.159 --> 00:46:34.500 It's a little harder than it looks to do in parallel and that's something I've actually worked on. 266 00:46:34.500 --> 00:46:38.730 And so my ideas, I like, it worked very well but so it's some. 267 00:46:38.730 --> 00:46:48.719 So sorting the atoms, so they're in buckets since then for each output grid you just look over the close atoms is going to make this faster, but you need that initial step. 268 00:46:48.719 --> 00:46:51.750 So. 269 00:46:51.750 --> 00:47:05.039 You see the trouble with building a grid containing all the atoms. Each cell contains all the atoms in it is the number of atoms in each grid. Cell is variable and unpredictable. So you have to have some. 270 00:47:05.039 --> 00:47:09.630 You know, data structure, which can handle this thing. Now. 271 00:47:09.630 --> 00:47:23.250 I guess what they might teach you in something like CS 1 would be a linked list of all the atoms in the grid style. And if you know, me, now, if I say it's something to teach and see us 1 or math, 1, I'm going to follow that by saying it's a bad idea. 272 00:47:23.250 --> 00:47:28.590 So, and link lists on a parallel computer or a very bad idea. 273 00:47:29.880 --> 00:47:44.699 1 thing well, 1 thing I've done actually, I tell you 2 things I have done why 1 thing is I make 2 passes to the input data. The 1st pass. I do a histogram on the input data and count how many Adams will go in each cell. 274 00:47:44.699 --> 00:47:52.619 And the grid is not too big, then that histogram will fit in past shared memory. 275 00:47:52.619 --> 00:47:57.360 That's the 1st step in this then I allocate a ragged array. 276 00:47:57.360 --> 00:48:01.889 For the grid or ragged array as adult. 277 00:48:01.889 --> 00:48:07.199 Allocates this as much space in each cell for the atoms that are going to go in that cell and as adult factor. 278 00:48:07.199 --> 00:48:15.869 Pointing to the start of each cell so it's 1 big array that has all of the atoms and all of the cells of the dope factor pointing to where each cells Adam start. 279 00:48:17.190 --> 00:48:28.769 And you can compute the dope factor from the histogram. It's a scan operating scans are useful. And then you make a 2nd pass through the input atoms and you populate the grid. 280 00:48:29.034 --> 00:48:40.195 And now, we know the atoms that are close to each grid cell that's 1 way to do it. A 2nd way to do it is we, we double the double the memory usage temporarily for each Adam. 281 00:48:40.465 --> 00:48:46.885 We create an ordered pair what cell will go into would create an ordered pair of cell number Adam number and then we sorted. 282 00:48:47.219 --> 00:48:54.239 By cell numbers, and now this doubled the memory because each item now size a cell number. Well, we fortunate by cell number. 283 00:48:55.739 --> 00:49:03.000 Actually, we could compute that dynamic, interesting time space trade off there, but now we have to gather all of the atoms. 284 00:49:04.074 --> 00:49:18.655 In each cell, we just have to be able to get at them fast and now we do a scan operation through the array scans that useful keep saying, and we don't have adult factor pointing to the start of the atoms and each array. So, it's 2 different ways to do it. 285 00:49:21.204 --> 00:49:33.894 The thing with the sorting sorting in parallel is slow can be slow. In fact, some of my programs, and I'm working with parallel Java stuff like this, the slowest part of the whole. 286 00:49:34.619 --> 00:49:45.119 My whole program is the sort in fact, and because parallel sorts involve multiple passes through the multiple passes through the data. 287 00:49:45.119 --> 00:49:52.440 And they're using bucket sorts and they're using it. I'm sorry rating sorts quick sorts. A horrible idea. 288 00:49:52.440 --> 00:50:02.190 Cs, 1, um, in many cases, in many cases, as much faster, and also paralyzes nicely. Um. 289 00:50:02.190 --> 00:50:05.880 But it's still slow. 290 00:50:05.880 --> 00:50:17.789 Because it involves making multiple passes reading and writing the bad any case. So, this is ways you could have this is ways I would improve this that I wouldn't down here to write to all of the atoms. I would iterate. 291 00:50:17.789 --> 00:50:30.300 Try to have some sort of spatial data structure imposed on the atoms. So, for each output grid sell I know the nearby atoms. They're not just Adams in the same cell. They're atoms that are within several cells of the target cell. Of course. 292 00:50:30.300 --> 00:50:39.449 Okay, redundant or computation like this they're trying not to repeat. Okay. 293 00:50:41.760 --> 00:50:47.730 So so what they've done is still a sequential code here, but they're. 294 00:50:47.730 --> 00:50:53.159 They're prepping it. Okay. They're doing a lay off to prep it for the parallel code. 295 00:50:53.159 --> 00:51:05.070 And they're doing a setup. Okay. And the setup is what they've done with this source code changes, is that the output energy get array doesn't get access to as much. 296 00:51:05.070 --> 00:51:09.599 But they did more access to the input so. 297 00:51:09.599 --> 00:51:14.190 More reads and fewer rights you might be thinking that that is good. 298 00:51:14.190 --> 00:51:21.000 It'd be right. Okay. And simple calculations on the coordinates. Okay. 299 00:51:21.000 --> 00:51:30.239 Again, remember now more often than not are data limited. Okay. Moving the data takes more time than computing with the data. 300 00:51:30.239 --> 00:51:41.610 That's why a lot of the demos of parallel processing speed work with matrix qualification because that's a compute bound application. It shows off the parallel computer very nicely. 301 00:51:41.610 --> 00:51:47.670 So, if you have to do more computation. 302 00:51:47.670 --> 00:51:59.579 But to get less data movement, that's probably a win. And so when they look at the con here many more calculations on the coordinates. 303 00:51:59.579 --> 00:52:02.610 It's not so bad probably. 304 00:52:02.610 --> 00:52:12.750 A more access to the atom array, but there read only access is and and if you're lucky, that's to the shared memory. 305 00:52:12.750 --> 00:52:19.139 Okay, so more calculations computations, but. 306 00:52:20.190 --> 00:52:23.789 Slower running on the host, maybe. 307 00:52:24.989 --> 00:52:30.210 Okay, so any case what they're talking about here. 308 00:52:30.210 --> 00:52:33.750 We're going to. 309 00:52:33.750 --> 00:52:37.440 Intellectual content or this is going to use. 310 00:52:37.440 --> 00:52:51.835 We're going to partition the data and we're going to have to hear what can we put in each thread block and so because the thread blocks is sharing this, using the same shared memory. So you figure out how much you can put in the same thread block. 311 00:52:52.500 --> 00:52:59.429 So so this is gather parallelization. 312 00:52:59.429 --> 00:53:05.639 Again, so each output set grid cells potential depends on many input. 313 00:53:05.639 --> 00:53:17.039 Cells, but what we're doing here is that where our outer loop is the output cells of each output cell, we then loop over the input atoms that will affect it. 314 00:53:17.039 --> 00:53:23.219 So, there'll be a thread for the output cell, which will go iterate over the input. Now. 315 00:53:26.244 --> 00:53:38.275 You know, you got to be worrying about thread divergence of course, but okay, so this is gather parallelization previous with scatter parallelization where each thread wrote to many scattered places here. 316 00:53:38.275 --> 00:53:44.034 Each thread reads from many scattered places and gathers data instead of scattering data. 317 00:53:45.480 --> 00:53:51.599 Okay, skip through this a little. 318 00:53:52.889 --> 00:53:56.670 Iterating over obvious things. 319 00:53:58.554 --> 00:54:09.025 And so this gather, colonel, we don't have an atomic ad operate increment operation to update the output cells. 320 00:54:09.025 --> 00:54:18.625 So we don't have this atomic thing, which is very slow doesn't serialize. So right. Bang there we got a big performance when, and. 321 00:54:19.679 --> 00:54:33.750 So, 1, moral of this is that what's fast and sequential code is not always what's fast and parallel call, in fact, competing with each other because of the differing resources on the 2 architectures. 322 00:54:33.750 --> 00:54:39.300 Okay. 323 00:54:39.300 --> 00:54:48.239 And cash effectiveness is important often, because it improves data speed. So. 324 00:54:48.239 --> 00:54:54.809 But again, this is saying different and. 325 00:54:54.809 --> 00:55:02.610 You CPO you read from memory it automatically caches something average from every manager. Perhaps. 326 00:55:02.610 --> 00:55:08.519 And this might slow down so the scatter read performance. 327 00:55:08.519 --> 00:55:13.530 Might slow it down so. 328 00:55:13.530 --> 00:55:20.039 If I go back to the previous slide here, this scatter, although I actually once wrote a. 329 00:55:20.039 --> 00:55:23.880 See program trying to determine the effect of. 330 00:55:23.880 --> 00:55:38.039 Having bad cash performance on a CPU I could not see an effect. Actually, I thought if I was randomly reading from so much different data, that it wouldn't all fit in the cash I'd be thrashing the cash. 331 00:55:38.039 --> 00:55:43.860 And I would see performance drop off. I did not see it. 332 00:55:44.880 --> 00:55:48.659 So, it was weird. 333 00:55:48.659 --> 00:55:58.829 I think the Intel was smarter than me, and they were caching in a sophisticated enough fashion that they still around my program efficiently. 334 00:56:00.480 --> 00:56:06.449 Okay, um, oh, I just saw your question, Isaac. Um. 335 00:56:06.449 --> 00:56:13.679 Is a choice significant for performance? What sort of choice here? 336 00:56:13.679 --> 00:56:16.889 How you loop for. 337 00:56:16.889 --> 00:56:24.059 Well, you can ask me again later. Okay. So fast sequential call. It depends on. 338 00:56:25.440 --> 00:56:28.800 Temporary array some in. Okay. 339 00:56:29.820 --> 00:56:42.539 Okay, sorry I didn't see it earlier. My 2nd, laptops to touch out of my direct line of sight is off to the side a little so I'm focused on the primary laptop. Okay. 340 00:56:42.539 --> 00:56:45.570 Doing Pre, computation staff. 341 00:56:47.849 --> 00:56:51.539 Tricks the, um. 342 00:56:51.539 --> 00:56:57.119 Ok, the executive summary here is we've got tricks. 343 00:56:57.119 --> 00:57:02.190 To use the threads more efficiently, I'll just leave it at that sort of um. 344 00:57:02.190 --> 00:57:08.280 And a thread might compute several potentials and so. 345 00:57:08.280 --> 00:57:14.969 Unrolling a loop means you do several iterations and match and 1. 346 00:57:14.969 --> 00:57:21.329 So, in some cases, it improves performance and nowadays optimizing compilers do it. 347 00:57:21.329 --> 00:57:26.579 Without telling you okay um. 348 00:57:26.579 --> 00:57:31.139 We're trying to reuse temporary results may be. 349 00:57:31.139 --> 00:57:40.829 They're going through space points, gives them more efficiency and then finally, at the end, they do 1 update of the output. 350 00:57:42.264 --> 00:57:56.155 Let's skip over at someone. Okay, so what this was, they didn't complete their example here actually, but it was a start at showing you how to take this potential computation thing, which, by the way is an important problem. 351 00:57:56.184 --> 00:58:02.875 It's there's papers on this. This is the sort of problem it's worth this potential calculation that's worth spending time on. 352 00:58:04.590 --> 00:58:09.750 Okay. 353 00:58:11.250 --> 00:58:17.280 Silence. 354 00:58:19.110 --> 00:58:25.170 Okay, um. 355 00:58:29.425 --> 00:58:40.405 So, just some stuff here on thinking about parallel programs, stuff in general, not specifically invidia. So we've gotten to a higher level, someone parallel programming. 356 00:58:43.980 --> 00:58:53.250 My motherhood stuff, decompose the problem into sub problems and so on. 357 00:58:53.250 --> 00:58:56.969 So, there's different types of scaling here. 358 00:58:56.969 --> 00:59:01.980 The strong scaling is you can solve the. 359 00:59:01.980 --> 00:59:09.539 Problem the same problem faster, the weak scaling if you can solve a bigger problem in the same time. 360 00:59:10.585 --> 00:59:23.965 And we scaling is easier to do than strong scaling. So if you got a 1Million cores doesn't mean you can solve a problem a 1Million times faster. 361 00:59:23.965 --> 00:59:27.235 But it may mean you can solve a 1Million times bigger problem. Now. 362 00:59:27.539 --> 00:59:34.829 The point about all of this, you know, you like to always think of what your goal is strategy, not just tactics. 363 00:59:34.829 --> 00:59:38.429 So, what we have here. 364 00:59:38.429 --> 00:59:47.340 Is the scaling our tactics the reason we spend time on this is to do better science to solve problems that we couldn't solve before at all. 365 00:59:47.340 --> 00:59:57.480 Weather prediction I would like to predict tomorrow's weather before tomorrow comes. Okay. So if I have the, if it's faster. 366 00:59:58.650 --> 01:00:05.190 I can do something I couldn't do before, like, predict tomorrow's weather fast enough that it's useful to me. 367 01:00:05.190 --> 01:00:11.550 Okay, so I'm talking about shared memory. 368 01:00:11.550 --> 01:00:14.969 Open see a little blurb on later. 369 01:00:14.969 --> 01:00:19.800 Varian apple's version of. 370 01:00:19.800 --> 01:00:26.130 Now, message passing is a different thing that's been taught. 371 01:00:26.130 --> 01:00:29.519 Over in CS somewhat um. 372 01:00:29.519 --> 01:00:35.940 1 reason, okay, you're hiring me to give you opinions I guess, to some extent. 373 01:00:35.940 --> 01:00:41.670 I believe that the shared memory algorithms are not. 374 01:00:43.050 --> 01:00:48.175 People do not think about they're it's an important technique that's under used. 375 01:00:48.594 --> 01:01:02.005 I think people they meet if they want to do something, they immediately jump to something like npi and song jump to the sophisticated external cloud based things. Hadoop and cloud based techniques. 376 01:01:03.295 --> 01:01:15.835 And I think that for many applications that is not necessary, you can work in shared memory. Well, starting you want to do something fast on the 1st thing is don't use a GPU on the host. 377 01:01:15.835 --> 01:01:20.635 Do it on the multi color Intel, using the parallel things open MP open. 378 01:01:20.940 --> 01:01:28.349 See, some stuff I haven't showed you and. 379 01:01:28.349 --> 01:01:43.019 You know, like the laptop I'm lecturing from, this is a laptop 128 gigabytes of memory on the laptop. You know, why would I go up message passing? It's a very good laptop. Okay. And parallel is. 380 01:01:43.019 --> 01:01:53.010 256 to go parallels a quarter terabyte of memory. And when I bought that much memory, unparallel cost me about 2 and a half 1000 dollars. 381 01:01:53.010 --> 01:01:56.429 You could buy a server. 382 01:01:56.429 --> 01:02:03.000 For a few tens, a K that as at this point, a couple of terabytes of shared memory. So. 383 01:02:03.000 --> 01:02:09.630 That's a powerful concept and you really ought to exploit it before. 384 01:02:09.630 --> 01:02:17.130 You know, before you move on to message past, let me just show you actually here. 385 01:02:17.130 --> 01:02:26.010 To show you, this is my laptop, um, look at memory here. 125 g because a little got used for something or other. That's, um. 386 01:02:26.010 --> 01:02:30.719 I think Pat. Okay 12 basically. 387 01:02:30.719 --> 01:02:37.920 12 threads running, and once that dual 6 core as hyper threading and so on. So yeah. 388 01:02:39.210 --> 01:02:48.179 Okay, so that's why, like, just shared my, maybe if you have to go to the GPU, then again. 389 01:02:48.179 --> 01:02:54.900 My laptops. 390 01:02:56.340 --> 01:03:04.170 This is a laptop now has 16 gigabytes of memory on the GPU. 391 01:03:05.400 --> 01:03:14.849 You know, so again, it's got the latency, blah, blah, blah, blah, but the point is, it's, it's available to all of the. 392 01:03:14.849 --> 01:03:19.829 3000 coded cores and this is a laptop. 393 01:03:19.829 --> 01:03:30.090 So so you got to any go to parallel again. So parallel has the. 394 01:03:30.090 --> 01:03:33.690 The GP was 48 gigabytes on it. 395 01:03:33.690 --> 01:03:45.719 So, why go message passing the most problems will fit in that much data. So, this is my opinion, but it's a very strongly held opinion that shared memory is not being used enough. 396 01:03:45.719 --> 01:03:51.360 Okay, in any case, um. 397 01:03:52.500 --> 01:03:55.679 Data sharing. 398 01:03:55.679 --> 01:04:02.610 You got to do it, right? Of course. So especially read only sharing is nice. 399 01:04:04.980 --> 01:04:19.860 And task groups, what's happening in the 2nd paragraph here as well effectively you can be doing audio with 1 stream and computation with another stream and they're overlapping good work efficiency. We saw a little of that. When I showed you could've streams. 400 01:04:19.860 --> 01:04:28.019 And so she got a synchronized stuff, especially. 401 01:04:28.019 --> 01:04:32.849 Well, threads in a block, perhaps so, but in barriers. 402 01:04:32.849 --> 01:04:36.030 You're not getting work done, but okay. 403 01:04:36.030 --> 01:04:39.449 There might be another stream, which can do work then. So. 404 01:04:41.369 --> 01:04:55.019 Um, they're just talking about here different types of data and nothing deep here. Single program multiple data. 405 01:04:55.019 --> 01:05:02.309 This is invidious version of 70 single instruction, multiple data streams. So. 406 01:05:02.309 --> 01:05:10.289 Single program, because different threads now, if you're running different instructions in the program, maybe I don't know, multiple data. Oh, okay. 407 01:05:10.289 --> 01:05:17.309 Any case different fork joins up at the host level and so on. Okay. 408 01:05:17.309 --> 01:05:23.280 Nothing really new here. Um. 409 01:05:25.500 --> 01:05:36.510 So, okay, doing single program is the threads are different points in the program. Okay. Nothing really deep in the slide. 410 01:05:39.630 --> 01:05:43.380 Yeah, but other ways you can parallelize up. 411 01:05:43.380 --> 01:05:50.250 We saw this in open M. P. if we go way, way way back. I showed you Fibonacci program. 412 01:05:50.250 --> 01:05:58.590 Or you recursively fired off a separate task that went into a queue to get run when resources were available. 413 01:05:59.610 --> 01:06:06.329 Loop parallelism, iterate the blueprint. You see this stuff up the host level also in politics. Um. 414 01:06:06.329 --> 01:06:14.849 Yeah, well, for join is essentially thread divergence. Well, sorry divergence is a special case. So forth. 415 01:06:14.849 --> 01:06:24.929 Okay, this is a nice slide. Said I slide I guess it's different ways. You can parallelize that, or you can do tasks and parallel. 416 01:06:24.929 --> 01:06:35.309 Iterate over the data or data flow would be it's the same data, but different stages in processing the data are. 417 01:06:35.309 --> 01:06:38.909 Done in parallel, so it's a pipeline. 418 01:06:40.110 --> 01:06:44.460 And this is a nice way to look at things. So. 419 01:06:46.440 --> 01:06:50.340 Chris is going to be less efficient in our misleading so. 420 01:07:00.389 --> 01:07:03.809 Spam nothing really deep on this side, but. 421 01:07:05.369 --> 01:07:09.900 Objection I have to also message fasting is a horrible overhead. 422 01:07:09.900 --> 01:07:14.730 So, um. 423 01:07:16.170 --> 01:07:19.500 Really nothing interesting on this slide to my. 424 01:07:19.500 --> 01:07:32.010 Your points all obvious IDs you have, and so on. 425 01:07:40.650 --> 01:07:48.150 Okay, I got a slight beef with this slide here. If you make it fast enough. 426 01:07:49.440 --> 01:08:03.329 Then you can do different stuff. There's a proverb that quantity has a quality, all of its own military proverb. Actually if you have enough quantity, then the quality starts changing. 427 01:08:03.329 --> 01:08:06.420 And that's the point here. So. 428 01:08:07.800 --> 01:08:11.699 Um, just reminds me a few decades ago. 429 01:08:11.699 --> 01:08:16.140 They started using computers to. 430 01:08:16.140 --> 01:08:19.140 Due to compute. 431 01:08:19.140 --> 01:08:22.770 For a census is a chemical sensor says to compute. 432 01:08:22.770 --> 01:08:36.449 Some chemical synthesis pass and people, chemists would start writing papers on g. I analyzed a computed as a pathway for this new molecule and. 433 01:08:36.449 --> 01:08:43.260 Got a paper out of it and cameras would be sort of insulting. There was nothing interesting. They just took this tool. 434 01:08:43.260 --> 01:08:46.920 You know, turned to crank again, got another paper out. 435 01:08:46.920 --> 01:08:51.449 And, um, yeah, but. 436 01:08:53.305 --> 01:09:07.435 But there was a guy, there was a chemistry profit at Harvard when I was a grad student there named B. J Corey and I knew 1 of his grad students, in fact, was using computers for chemical centers. I was working in this area. Here's what makes it make. 437 01:09:07.710 --> 01:09:12.569 Insults on, and then some years later after I. 438 01:09:12.569 --> 01:09:16.109 10 years later or so I'm reading. 439 01:09:16.109 --> 01:09:19.710 The morning newspaper, you know, 1. 440 01:09:19.710 --> 01:09:25.350 Morning in October and E. J. Corey. There's an article with E. J Corey on page 1. 441 01:09:25.350 --> 01:09:30.840 You just got the Nobel Prize in chemistry. So what was so when he was doing it. 442 01:09:30.840 --> 01:09:35.039 People were saying, yeah, it's not interesting. 443 01:09:35.039 --> 01:09:38.579 It became interesting. Okay. Um. 444 01:09:42.000 --> 01:09:52.409 So just talking about stuff here nothing interesting here, but doing parallel computing faster and so on nothing interesting there. 445 01:09:52.409 --> 01:09:56.279 Is applications why you'd want to do it okay. 446 01:09:56.279 --> 01:10:05.699 Backup protein aggregation it's a folding process. Somebody who general diseases cause the proteins to fold up wrong. 447 01:10:05.699 --> 01:10:14.489 Something gets in the way to protein unfolding, prevents it folding in the normal, good way. So it folds in a bad way and can't do its normal function. 448 01:10:14.489 --> 01:10:17.489 And you guys saw it. 449 01:10:17.489 --> 01:10:24.989 Okay, um, so they're trying to screen drugs again. Again, some flu, let's say. 450 01:10:24.989 --> 01:10:31.500 Viruses and things to say, this is what I was telling you a few minutes ago. 451 01:10:31.500 --> 01:10:34.829 Electrostatic interactions play a role. 452 01:10:34.829 --> 01:10:43.199 Big thing is with the water because water is polar molecule, which means it's got a plus area in a minus area. So. 453 01:10:43.199 --> 01:10:50.579 How the water and water is 1 of the most polar molecules in fact. And so how it. 454 01:10:50.579 --> 01:10:55.920 It's how the water molecules fit around the folded protein effects. 455 01:10:55.920 --> 01:11:09.449 What happens? So, and this buzzword, some buzzwords here boundary integral problem past multi poll method passed multiple messages a way of computing. What's going on here? 456 01:11:09.449 --> 01:11:12.449 Okay. 457 01:11:12.449 --> 01:11:22.140 Wikipedia article on the body problem again. So, in this, what's happening here Maybe so small you can't read it on. 458 01:11:22.140 --> 01:11:29.850 That capital is potentials is also called the body problem. It's not. 459 01:11:29.850 --> 01:11:36.180 There's a different use of the word term and body politic computing orbits. Let's say this is an body problem for electrostatics. 460 01:11:36.180 --> 01:11:40.050 Okay, and. 461 01:11:40.050 --> 01:11:45.239 Is a lot of bodies naive the quadratic time. So waste the speed it up. 462 01:11:45.239 --> 01:11:49.619 A multi pole expansion is like an approximation expansion. 463 01:11:49.619 --> 01:11:54.000 Okay, and lots of papers on this. 464 01:11:55.949 --> 01:12:03.300 Yeah, and so. 465 01:12:03.300 --> 01:12:06.869 See, a strategy in doing something faster and parallel. 466 01:12:06.869 --> 01:12:15.989 Well, 1st, you just call again, you've got an application, you just want to take this course to do something faster. 467 01:12:15.989 --> 01:12:26.369 Well, the 1st thing you would do your domain scientist. Some of the domain is see if there's a package, a library that well, if you can use. So if your. 468 01:12:26.369 --> 01:12:37.109 Application is doing fast for you transforms then you would use invidious fast for a cold parallel patch Korea, quote unquote. A, for example. 469 01:12:37.109 --> 01:12:40.529 Don't even have to know what a core is, that sort of thing. 470 01:12:40.529 --> 01:12:44.250 And Matt lab has. 471 01:12:44.250 --> 01:12:57.989 Parallel tools, then the next level is your writing steep up plus cold with the fragments reopen in ACC and so thrust. I'll show you later. It's a parallel version of the standard template library routines with some parallel algorithms. Like. 472 01:12:57.989 --> 01:13:05.850 Scan okay and then that's the 1st thing. The next thing is you rethink your algorithm. 473 01:13:06.989 --> 01:13:10.560 And then you start getting more fundamental or something. 474 01:13:11.579 --> 01:13:24.539 Nothing deep there you should know that. So eventually you have to start deep thinking about your algorithm because a good parallel algorithm is not a good sequential algorithm. Sometimes. 475 01:13:24.895 --> 01:13:25.465 Okay, 476 01:13:25.465 --> 01:13:38.274 and then you have to start worrying about things now here at this level and worrying there are tools actually between CUDA and open ACC energy Labs or something called for example, 477 01:13:38.515 --> 01:13:40.435 it's higher level than the raw CUDA, 478 01:13:40.465 --> 01:13:49.555 but it gives you more control over threads and warps and blocks and being up open it's in between. 479 01:13:49.890 --> 01:14:00.359 So and what's the appropriate data structure and not stuff involving pointer is not stuff involving trees. So recursion so much. 480 01:14:00.359 --> 01:14:06.840 Okay, that was. 481 01:14:06.840 --> 01:14:12.569 Some more general philosophy on parallel thinking. 482 01:14:19.800 --> 01:14:25.619 Silence. 483 01:14:25.619 --> 01:14:30.989 This is a backup. 484 01:14:36.689 --> 01:14:41.310 Yeah, a little blurb on here. 485 01:14:44.250 --> 01:14:49.109 I'm going to going to go through it fast, but. 486 01:14:51.210 --> 01:14:59.340 So the point about MTI, you got, it's distributed nodes. They're more loosely coupled than on. 487 01:14:59.340 --> 01:15:05.729 So, they don't share any resources, but you can send messages back and forth. 488 01:15:05.729 --> 01:15:08.850 And I'm I've synchronized stuff. 489 01:15:08.850 --> 01:15:19.199 So the nodes have to be more stand alone. You got a nice API. Your model is, you're sending and receiving messages. 490 01:15:19.199 --> 01:15:22.380 And waiting and so on. 491 01:15:22.380 --> 01:15:32.069 Okay, you want to add 2 factors you'd send messages around and then the add are. 492 01:15:32.069 --> 01:15:35.729 You know, a process would send a message back to you, you wait for it. 493 01:15:35.729 --> 01:15:41.039 Okay. 494 01:15:41.039 --> 01:15:46.470 The obvious arguments here. 495 01:15:46.470 --> 01:15:51.569 Receiving on. 496 01:15:51.569 --> 01:15:57.569 If you had to invent this, this is what you would invent it to look like it's, it's done nicely. 497 01:15:59.010 --> 01:16:03.840 And servers checking error messages. 498 01:16:03.840 --> 01:16:16.079 Nice to check for errors and in your waiting as barriers, he got to wait for results and so on. 499 01:16:16.079 --> 01:16:19.560 You can imagine a way could be a blocking later and unlocking weight. 500 01:16:19.560 --> 01:16:23.369 Okay, um. 501 01:16:23.369 --> 01:16:28.619 Commute process waits for messages telling into what the compute computers and then. 502 01:16:29.819 --> 01:16:33.390 Right types of barriers. 503 01:16:55.255 --> 01:17:02.635 Okay, this chapter 18 has got a funny numbering the intros. It would be 1, but it's not called 1. 504 01:17:03.149 --> 01:17:08.430 And. 505 01:17:08.430 --> 01:17:12.600 Um, the stencil again is like the. 506 01:17:12.600 --> 01:17:19.260 In convolution terms, you know, the other no cells that affect this cell. 507 01:17:21.270 --> 01:17:30.239 Going to go so it fast. You're going to tie. Everything has to be bunched up into tiles. Doesn't matter. Mti. Cota. So. 508 01:17:30.239 --> 01:17:36.390 And we do the 3rd, 1. 509 01:17:48.210 --> 01:18:02.039 And you could always combine MTI and coulda, if you had several, you could use MTI to connect them together. Perhaps there's possibly better ways to do it. But. 510 01:18:02.039 --> 01:18:09.479 And video is providing ways to connect different GPU cards together. If you have the IBM. 511 01:18:09.479 --> 01:18:12.989 Post their workstation architecture. 512 01:18:12.989 --> 01:18:24.029 And this is how some of the past the supercomputers work, you have a lot of the IBM workstations. Well, they're not separate workstations. Obviously they're, you know, trace. 513 01:18:24.029 --> 01:18:27.840 In your rack, but the point that's the architecture, but then each. 514 01:18:27.840 --> 01:18:31.170 Can take maybe 6. 515 01:18:31.524 --> 01:18:44.125 invidia GPU cards and the 6, and have a very fast boss called in V linked connecting them together. It's something I do not have access to on parallel. This does not exist for an Intel. 516 01:18:44.515 --> 01:18:47.784 It only exists if your backbone is the IBM. 517 01:18:48.119 --> 01:18:54.720 But if you do that, then you can connect the 6 together, which is very fast boss. 518 01:18:54.720 --> 01:19:01.140 Sense that it's worthwhile having all of their memories in the same address space. So you've got 6 of these. 519 01:19:01.140 --> 01:19:05.279 4 to 8 triggered by card you now got 6 times 48 gigabytes. 520 01:19:05.279 --> 01:19:12.359 You now got a quarter terabyte of common shared memory available to the GP, use dot, going back to the host. 521 01:19:13.380 --> 01:19:25.050 And so now you can have fun with that so there's going to be latency and Sacha, but it's better than on a host. So there's a tool that's available. And this is what is actually used on some of. 522 01:19:25.050 --> 01:19:33.720 Past that computers you'd see right at the top, the top 500 list. Oh, okay. So nothing much interesting here. 523 01:19:33.720 --> 01:19:38.430 It's sort of all the, I think. 524 01:19:41.515 --> 01:19:55.795 There is obviously messages to scatter and gather and reduce, but okay. That's a good point to stop. I'm let me just what's coming up in the future is we got a few more chapters in the book. 525 01:19:56.100 --> 01:20:01.800 Some related programming things here. 526 01:20:01.800 --> 01:20:07.619 And then what I'll do is I've got some other stuff I've typed in. 527 01:20:07.619 --> 01:20:14.100 Over the last few years that I taught the course, you can read it. I've got some more information on NVIDIA that I've typed in. 528 01:20:14.100 --> 01:20:23.819 And then we'll see a tool called thrust, which is again this parallel. 529 01:20:24.899 --> 01:20:28.439 Extension to s. TL and sponsored by NVIDIA. 530 01:20:28.439 --> 01:20:31.619 So, it's not it also has. 531 01:20:31.619 --> 01:20:35.430 Routines, which are useful only in parallel like. 532 01:20:35.430 --> 01:20:45.840 The scatter gather and scan and you'll be amazed what you can do with scan and will teach trust by examples. They have a bunch of example programs. 533 01:20:45.840 --> 01:20:50.520 So, we saw today is more parallel stuff from. 534 01:20:50.520 --> 01:20:56.909 Slides here and then my type, what I typed is my summary of some of the key features. 535 01:20:56.909 --> 01:21:06.510 Of NVIDIA now. Oh, coming up parts in the course in a couple of weeks in video has a developer conference coming up soon. It's free. 536 01:21:06.510 --> 01:21:14.369 And so if you want to learn more, I would say, do a free registration as an invidia developer. And then, you know. 537 01:21:14.369 --> 01:21:22.949 Watch the NVIDIA developer conference and see what fun things might be. Let me just show you here a few minute or 2. 538 01:21:34.529 --> 01:21:38.340 Contrarian. 539 01:21:38.340 --> 01:21:44.039 Developer program, and you can browse around here so. 540 01:21:45.750 --> 01:21:54.060 And have fun too. So if I'm going to slowly for you, you want to learn more stuff than. 541 01:21:54.060 --> 01:22:02.220 You can look at and this is actually the GPU technology conference. Some of that might cost money. 542 01:22:03.329 --> 01:22:06.510 But we want to get the developer conference. 543 01:22:06.510 --> 01:22:18.810 Any resources and stuff it's free, so. 544 01:22:21.329 --> 01:22:27.449 Yeah, so you might have fun going through some of that again the. 545 01:22:27.449 --> 01:22:31.649 So, when I looked at this page earlier, they so free. Okay. 546 01:22:31.649 --> 01:22:38.039 You see, your life averages out charges you a lot of money, but this thing here is free. So the averages out. 547 01:22:38.039 --> 01:22:42.899 Okay, that's enough stuff for today. Both of us want lunch. 548 01:22:42.899 --> 01:22:46.380 I'll stay around for a minute or 2 in case there's questions. 549 01:22:46.380 --> 01:22:54.300 Other than that see, you Thursday. 550 01:23:02.069 --> 01:23:07.979 Yeah, it's not. Oh, okay.