WEBVTT 1 00:00:00.000 --> 00:00:03.419 Message. 2 00:00:03.419 --> 00:00:06.780 Okay. 3 00:00:06.780 --> 00:00:11.218 Hello. 4 00:00:14.788 --> 00:00:18.509 1. 5 00:00:18.509 --> 00:00:22.769 That's. 6 00:00:33.179 --> 00:00:37.439 I wonder if. 7 00:00:37.439 --> 00:00:41.909 Hello. 8 00:00:48.210 --> 00:00:51.450 That. 9 00:00:57.270 --> 00:01:01.560 Okay. 10 00:01:06.239 --> 00:01:09.840 Huh. 11 00:01:09.840 --> 00:01:18.540 Okay, so good afternoon. This is class 13 if I'm counting, right? 12 00:01:19.680 --> 00:01:28.859 Parallel not probability so we're continuing on with the NVIDIA accelerated computing so it's a mixture of hardware. 13 00:01:28.859 --> 00:01:33.329 And some algorithms and so on, we'll see more algorithms today. 14 00:01:33.329 --> 00:01:40.079 And I have some links here for some recent video stuff I'll take. I'll turn it into a homework actually. 15 00:01:40.079 --> 00:01:43.079 Oh, you can read it and answer questions on it. 16 00:01:44.159 --> 00:01:50.939 And so where you are. 17 00:01:50.939 --> 00:01:54.870 Look at some look at a new example today. 18 00:01:54.870 --> 00:02:03.390 The potential calcium calculation, and then look at some ways to optimize programs for parallel parallel computing. 19 00:02:03.390 --> 00:02:08.460 So, going on here. 20 00:02:08.460 --> 00:02:11.490 Okay. 21 00:02:11.490 --> 00:02:15.900 Okay. 22 00:02:15.900 --> 00:02:21.930 Okay. 23 00:02:21.930 --> 00:02:24.990 And that is actually working. 24 00:02:24.990 --> 00:02:28.050 So. 25 00:02:29.250 --> 00:02:32.729 Let's try something like this, even. 26 00:02:35.819 --> 00:02:40.830 Okay, so we're going to see some of this programming techniques. Um. 27 00:02:40.830 --> 00:02:45.240 And the thing is that we have a, um. 28 00:02:45.240 --> 00:02:51.750 Some sort of molecule protein or whatever, and we want to calculate some potentials or are there. 29 00:02:51.750 --> 00:02:56.039 Planet son's in the Milky way or something like. 30 00:02:56.039 --> 00:03:01.439 That and the hard thing we're going to be. 31 00:03:21.235 --> 00:03:30.354 So, the atoms are irregular places, Adam, you know, and it's got some charger. So mass. 32 00:03:30.629 --> 00:03:35.099 And then we want to calculate the potential or or the 4th field. 33 00:03:35.099 --> 00:03:43.710 At each regular grid point and so for the 4th, it's in verse square law for the potential doesn't universe into your law. 34 00:03:43.710 --> 00:03:49.710 Now, if I stop right here, and you had to think how you would do it before I go to the next slides. 35 00:03:49.710 --> 00:03:55.259 You're doing a lot of iterating and the 1st question you would ask yourself is what. 36 00:03:55.259 --> 00:04:04.830 You want your outermost loop to be you could inter, you could, you're the most still could iterate over the atoms or iterate over. 37 00:04:04.830 --> 00:04:14.400 The grid points, and then the other most stupid iterate over the other. So, 1 way you might do this is that you could iterate over the atoms. 38 00:04:14.400 --> 00:04:18.089 And reach out to you calculate its effect on each grid point. 39 00:04:18.089 --> 00:04:21.149 And summit into a subtotal for each credit point. 40 00:04:21.149 --> 00:04:26.819 The opposite way is that you iterate over the grid points and for each good point, you, um. 41 00:04:26.819 --> 00:04:33.629 Then iterate over the atoms and compute each atoms contribution just compute it right away for the good point. So. 42 00:04:33.629 --> 00:04:36.658 The 2 opposite philosophies. 43 00:04:36.658 --> 00:04:41.369 And again, if you think about it, you know. 44 00:04:41.369 --> 00:04:45.778 How you would solve it yourself before I go to the next few slides. 45 00:04:45.778 --> 00:04:50.759 Um, your most stuff you're going to do in parallel. 46 00:04:50.759 --> 00:04:54.298 Um, obviously, because this is a firewall course. 47 00:04:54.298 --> 00:05:01.348 So, let's suppose theater outermost roof is the atoms you're iterating over the atoms. 48 00:05:01.348 --> 00:05:05.278 Then inside for each of them. 49 00:05:05.278 --> 00:05:08.819 Right over the grid cells so. 50 00:05:08.819 --> 00:05:16.168 When you're summing potentials at each grid, so that's going to have to be a re modified right operation. That's going to have to be an atomic operation. 51 00:05:16.168 --> 00:05:20.158 So, if you enter most loop is the atoms you got. 52 00:05:20.158 --> 00:05:25.408 You're gonna have to have a read modify, right? For each grid point. 53 00:05:25.408 --> 00:05:29.488 And you're doing in parallel over the atoms so if you do that. 54 00:05:29.488 --> 00:05:36.478 You're going to have a lot of cloud. Well, 1st, the atomic operations are expensive. 55 00:05:36.478 --> 00:05:44.728 And you're gonna have a lot of them and and reasonable chat. You also going to get classes, but we'll slow things down. 56 00:05:45.869 --> 00:05:52.048 If you're out almost loop and that's called a scatter operation, this will be written down in the slides. But. 57 00:05:52.048 --> 00:05:55.858 If you can you think how you would solve it yourself? 58 00:05:55.858 --> 00:06:01.439 If you iterate over the atoms, right out of blue atom and so on. 59 00:06:01.439 --> 00:06:06.358 Then preach Adam, you're scattering contributions across the whole grid. 60 00:06:06.358 --> 00:06:11.459 So, the red item, you compute, it's effect on these, um. 61 00:06:11.459 --> 00:06:14.908 Here you've got 1, 2, 3. 62 00:06:14.908 --> 00:06:24.928 He got 6 times 4, you've got 24 grid points here. The red Adam contributes to the subtotals for the 24 grid points and so it's capturing. 63 00:06:24.928 --> 00:06:30.119 The contribution of the red atom over the whole grid, it's a scatter paradigm. 64 00:06:31.348 --> 00:06:35.639 And it actually turns out that if you have a sequential machine, that's sufficient. 65 00:06:36.809 --> 00:06:43.559 But the problem on a parallel machine is you have to have these atomic read, modify rights. 66 00:06:43.559 --> 00:06:47.819 As to add the contributions into itself. 67 00:06:47.819 --> 00:07:01.019 Good point. So the opposite method is that your outermost loop is the grid so your so, the grid cells, and that one's in parallel the grid cells. Each grid cell. Each grid point is say a thread. 68 00:07:02.218 --> 00:07:12.178 Then then each thread iterates over all the atoms and gathers the contributions of the atoms that affect the itself. 69 00:07:12.178 --> 00:07:19.108 So this method is better, because you don't have any read modify right? Issues. 70 00:07:19.108 --> 00:07:26.158 Because the gather operation reads all the atoms, but all the writing is occurring on to the current. 71 00:07:26.158 --> 00:07:29.908 Good point. So there's no problem. 72 00:07:29.908 --> 00:07:36.689 With atomic locking, so right there you figure, it's probably going to be more efficient. 73 00:07:37.769 --> 00:07:41.488 The other thing is that for another thing. 74 00:07:41.488 --> 00:07:46.199 Is that so you're again, you're so you're reading. 75 00:07:46.199 --> 00:07:49.798 About from information on all of the atoms, but. 76 00:07:51.209 --> 00:07:58.798 You start thinking now while the atoms are constant, they're not moving so you're, you're reading information from this constant array. So. 77 00:07:58.798 --> 00:08:07.649 He could imagine caching things already. You might do it for you. If it probably won't fit in constant memory but you can see the idea. 78 00:08:07.649 --> 00:08:13.348 That all the threads are reading this constant information. 79 00:08:13.348 --> 00:08:22.738 About the atoms, so there could be a big latency if the atom information is up in the global memory. But latencies are okay, this is. 80 00:08:22.738 --> 00:08:28.978 The design decision that video made made is they don't care about latency. 81 00:08:28.978 --> 00:08:32.548 Because they got lots of threads and Fred gets blocked. 82 00:08:32.548 --> 00:08:37.769 And something else another threat executes and so on a war, eventually. 83 00:08:37.769 --> 00:08:44.818 Okay, so and then it will turn out to just gather paradigm on a parallel machine. 84 00:08:44.818 --> 00:08:52.948 Is more efficient so so this is what you would be thinking perhaps if you were given this as a problem. 85 00:08:52.948 --> 00:08:58.889 And you had to find a solution, you might think, like you said, which would be your outermost loop. 86 00:08:58.889 --> 00:09:04.198 The grid points, and you gather information from the atoms. 87 00:09:04.198 --> 00:09:10.349 Or else, the outermost loop might be the atoms and you scatter information from the atoms. 88 00:09:10.349 --> 00:09:13.619 Is a great point. Okay. 89 00:09:13.619 --> 00:09:19.078 So, the rest of these 2 slides that will be talking about these 2 choices in more detail. 90 00:09:19.078 --> 00:09:25.499 Um, so electrostatic potential. 91 00:09:25.499 --> 00:09:32.219 And and we did some things up. Oh, let me go back to this previous 1 here. 92 00:09:32.219 --> 00:09:35.698 Again, let's suppose you're thinking about this, um. 93 00:09:35.698 --> 00:09:40.828 A couple of other things you might think about. 94 00:09:40.828 --> 00:09:47.038 Is you might imagine some sort of hierarchy some quadri for the atoms. 95 00:09:47.038 --> 00:09:51.778 Because if there's a bunch of atoms that are a long distance away. 96 00:09:51.778 --> 00:09:57.149 You could cluster them together and create 1 molecule or something from them. 97 00:09:57.149 --> 00:10:04.649 And then figure out the contribution of that molecule, let's say to all the distant. So. 98 00:10:04.649 --> 00:10:09.448 You can imagine some hierarchical clustering thing and that's in fact what people do. 99 00:10:09.448 --> 00:10:18.658 And that actually cuts the time if you're counting the number of operations. The nice thing is a quadratic number of operations. 100 00:10:18.658 --> 00:10:23.099 Because each atom affects each point. It's good point. 101 00:10:23.099 --> 00:10:27.719 Well, if you had some hierarchy of of clustering stuff. 102 00:10:27.719 --> 00:10:32.188 You actually can cut the sequential time from N squared down to and log in. 103 00:10:32.188 --> 00:10:35.788 So this is a standard message that used for these things. 104 00:10:35.788 --> 00:10:42.389 Um, now it makes it interesting say people are doing this to the Milky way. 105 00:10:42.389 --> 00:10:50.099 And is that the problem and they want to compute and say how the Milky way to evolve. 106 00:10:50.099 --> 00:10:53.099 And so a star evolves, depending on the. 107 00:10:53.099 --> 00:10:56.938 Then that gravitational force on it from all the other stars. 108 00:10:58.078 --> 00:11:01.859 Well, the problem is that the stars move around differently. 109 00:11:01.859 --> 00:11:08.249 If you look at the Milky way, it's got these famous spiral arms. These arms are a dynamic phenomenon. 110 00:11:08.249 --> 00:11:14.219 You can imagine the speed that that the arm is rotating around the Galaxy. 111 00:11:14.219 --> 00:11:21.418 But you look at the individual stars in the arm, they're moving at a different speed and individual star. 112 00:11:21.418 --> 00:11:27.658 Moves into an arm and then moves out of the arm again these stars are moving into different speed for the Galaxy. 113 00:11:27.658 --> 00:11:33.989 And it's like, it's a wave versus the face feed. Let's say we're talking. 114 00:11:33.989 --> 00:11:38.428 But that sort of thing in physics and. 115 00:11:38.428 --> 00:11:42.658 So, if you cluster the stars to do this hierarchical way to. 116 00:11:42.658 --> 00:11:47.698 Do this potential map more efficiently? You have to update the clustering. 117 00:11:47.698 --> 00:11:52.229 So, because the stars move in and out of different regions, um. 118 00:11:52.229 --> 00:11:57.958 You've got analogy about this dynamic thing if you look at a water wave and the ocean. 119 00:11:57.958 --> 00:12:02.578 You look at the speed that the wave as a wave is moving. 120 00:12:02.578 --> 00:12:06.178 And you look at the speed of the water molecules in the way. 121 00:12:06.178 --> 00:12:16.198 The wave as a whole is moving faster than the individual water molecules. So if you're clustering the molecules into the wave, you have to change update your clustering. 122 00:12:17.759 --> 00:12:25.229 And physics, the physicists talk about things, I've forgotten some of my physics. It's like a way. 123 00:12:25.229 --> 00:12:32.038 Speed, you can have the speed of the way with the way. It can actually go faster than the speed of light. But the speed of information is moving. 124 00:12:32.038 --> 00:12:35.129 Is less always. 125 00:12:35.129 --> 00:12:39.928 So, in any case, you might be thinking about something like that a hierarchical way. 126 00:12:39.928 --> 00:12:44.399 To make this process more efficient and of course, then you'd have to think. 127 00:12:44.399 --> 00:12:49.499 Is this hierarchical way will it work in parallel? So because your static structure is now weird. 128 00:12:50.698 --> 00:12:56.788 Another thing you might think of is an obvious thing to do we haven't talked about it is called Fred. 129 00:12:56.788 --> 00:13:03.269 We've just been talking about how 1 thread does 1, Adam or 1 thread does 1 good point. 130 00:13:03.269 --> 00:13:08.849 Well, you might say, let's have a thread, do 10 Adams or something or 10 good point. 131 00:13:08.849 --> 00:13:12.719 And that could be more efficient. 132 00:13:12.719 --> 00:13:17.759 Because you might be able to localize some information and the registers or something. 133 00:13:17.759 --> 00:13:24.688 And so if you Pro, you want to make your programming more complicated because of course, that's the 1 thread. 134 00:13:24.688 --> 00:13:29.879 Does more than 1 little piece of data you might get faster. 135 00:13:29.879 --> 00:13:34.139 Okay, so those are the things you might be thinking of, um. 136 00:13:34.139 --> 00:13:38.369 Let's move on and see what they talk about here. So this is. 137 00:13:38.369 --> 00:13:41.519 Computing potentials or else computing. 138 00:13:43.889 --> 00:13:49.528 Computing forces would be the same thing. 139 00:13:49.528 --> 00:13:55.349 And then you could calculate movements just as an aside and. 140 00:13:55.349 --> 00:13:59.578 You know, if it's all particularly you don't want to put it on the lips and that sort of thing. 141 00:13:59.578 --> 00:14:04.198 Oh, by the way, if you wonder, um. 142 00:14:05.428 --> 00:14:09.089 How, you know, the formula like this here. 143 00:14:11.578 --> 00:14:17.188 Well, you know, this is an obvious formula well, it has to be true because we teach it in the physics class, but. 144 00:14:17.188 --> 00:14:25.048 How do we know that? It's correct. Well, you use this form well, to calculate movement, calculate orbits. Let's say. 145 00:14:25.048 --> 00:14:32.129 And if this was not an inverse law, if there was an exponent up here. 146 00:14:32.129 --> 00:14:38.249 You would not get stable orbits. So the fact that you, in fact, see stable orbits. 147 00:14:38.249 --> 00:14:42.989 Is a proof that, that this is the right potential. 148 00:14:42.989 --> 00:14:46.678 Well, so okay. 149 00:14:46.678 --> 00:14:49.889 Um. 150 00:14:52.558 --> 00:15:00.328 So here's the idea that here they're talking about iterate over the lattice points throughout each lattice point is a separate. 151 00:15:00.328 --> 00:15:04.109 Thread, um. 152 00:15:04.109 --> 00:15:07.649 And then the inner loop iterates over the atoms. 153 00:15:07.649 --> 00:15:14.009 And they mentioned obvious things here that you have cut off threshold and all that fun stuff. So. 154 00:15:14.009 --> 00:15:18.958 Okay. 155 00:15:20.818 --> 00:15:24.119 Um, so. 156 00:15:24.119 --> 00:15:31.499 The input, the positions of the atoms and the charge on each side or more the mass or something. 157 00:15:31.499 --> 00:15:42.298 They're the known position to be Chatham is no, it's no 1 more actually than the grid. So the atoms are not being snapped to grid points. 158 00:15:42.298 --> 00:15:46.859 The greatest course are the positions of the atoms. 159 00:15:46.859 --> 00:15:52.948 Because if you snap the atoms to grid point, you're losing some information, which. 160 00:15:52.948 --> 00:15:59.188 You don't have to lose. So, the thing is that the atoms that are irregular possessions, and that just makes. 161 00:15:59.188 --> 00:16:07.979 Things may see, but again, you to say that we're going to compute the potentials at the grid points. Um. 162 00:16:09.149 --> 00:16:12.389 Oh, now why do we want the potentials. 163 00:16:12.389 --> 00:16:18.058 Well, an application might be protein folding so the protein is very long and. 164 00:16:18.058 --> 00:16:23.339 Flexible and it can fold in on itself. 165 00:16:23.339 --> 00:16:28.259 And that how it folds in on itself. 166 00:16:28.259 --> 00:16:31.769 Affects what the protein does because. 167 00:16:31.769 --> 00:16:37.558 It's the parts of the protein that are on the outside that can interact with other molecules. 168 00:16:37.558 --> 00:16:41.609 And what's on the outside? Depends on how it hold it. 169 00:16:41.609 --> 00:16:51.119 So, how the protein folds is important now, how it folds, it folds so different parts of the molecules align with each other and attract each other. 170 00:16:51.119 --> 00:16:56.999 And minimize and minimize the potential you see a minimizing objective function. 171 00:16:56.999 --> 00:17:00.928 To find the folding so the folding itself is the big problem. 172 00:17:00.928 --> 00:17:03.958 And very computationally intensive. 173 00:17:03.958 --> 00:17:10.138 And it's important, for example, drug design, because I didn't how the protein folds. 174 00:17:10.138 --> 00:17:15.179 Will control what how active the protein is and what are the molecules. 175 00:17:15.179 --> 00:17:23.068 It will align with so, and how the protein folds is controlled by computing among other things these potentials. 176 00:17:23.068 --> 00:17:27.838 So, it's a practical problem as well as this being an interesting problem. 177 00:17:27.838 --> 00:17:32.729 So, the biophysics people like this again, the astronomers. 178 00:17:32.729 --> 00:17:37.739 Like, a related problem. Okay. Okay. Um. 179 00:17:41.159 --> 00:17:47.098 And, um, so the sequential version, this would be the 1st of the 2 choices. I mentioned. 180 00:17:47.098 --> 00:17:50.759 Iterate over the atoms that's the outer loop. 181 00:17:50.759 --> 00:17:54.479 And the inner loop is iterate over the lattice points. 182 00:17:54.479 --> 00:18:00.598 So but again, the Adam information is read only it's constant. So. 183 00:18:01.949 --> 00:18:05.098 And so this message they mentioned. 184 00:18:05.098 --> 00:18:09.269 Well, they're asymptotically have the same time because asymptotically. 185 00:18:09.269 --> 00:18:13.679 They compute the effect of each atom on each lattice point once. 186 00:18:13.679 --> 00:18:17.699 So, in that sense, they are a. 187 00:18:20.578 --> 00:18:25.259 And that, since they're all taking the same amount of time, when, until you get down into details. 188 00:18:25.259 --> 00:18:32.038 So, yeah, and the. 189 00:18:32.038 --> 00:18:36.538 Just some interesting thing some random C code. 190 00:18:36.538 --> 00:18:39.778 Is that here is the, um. 191 00:18:41.398 --> 00:18:44.878 The outermost loop iterates over the atoms here. 192 00:18:44.878 --> 00:18:49.739 The 2nd loop iterates over grid cells and why the 3rd loop. 193 00:18:49.739 --> 00:18:58.469 Iterates over grid cells and X. so what it's doing is that free Chad, and then what it computes something. 194 00:18:58.469 --> 00:19:01.588 Is that it then push it then does. 195 00:19:01.588 --> 00:19:07.348 An update on the on the total potential for each grid sales. So. 196 00:19:07.348 --> 00:19:12.479 Again, if you're thinking ahead this line here, it's going to be a Reed modify, right? 197 00:19:13.979 --> 00:19:21.598 Okay, so this is the 1 version of the problem is the version of the program where the atoms are the outermost loop iterating the atoms now. 198 00:19:23.429 --> 00:19:32.729 If you look at this, right here, there are various optimization games that you could play. If you actually cared. 199 00:19:32.729 --> 00:19:40.499 Um, the thing is that a good optimize the compiler would play these games for you so. 200 00:19:40.499 --> 00:19:44.759 What would be an example of do you do. 201 00:19:46.919 --> 00:19:51.148 Well, they're doing some stuff here. They don't actually have to worry about like. 202 00:19:51.148 --> 00:19:55.378 Their computing Y. D. Y. squared. Um. 203 00:19:56.848 --> 00:20:01.679 Yeah, well that, you know, you could just. 204 00:20:01.679 --> 00:20:06.959 Do D, Y, and Z or whatever you need it or the optimizing the parallel we'll do that for you. 205 00:20:06.959 --> 00:20:16.348 And wherever D, why? So there's some things here that the compiler is quite good at actually fixing. 206 00:20:17.848 --> 00:20:22.078 Okay, and then what are some things. 207 00:20:22.078 --> 00:20:29.969 Here you could optimize square root takes 1 or 2 cycles. Nowadays you don't have to worry about that floats. You don't care about. 208 00:20:29.969 --> 00:20:33.929 So, there are various things that you could, um. 209 00:20:33.929 --> 00:20:38.878 Games you could play that. I don't worry about. That's what compilers will do. 210 00:20:41.068 --> 00:20:51.209 I mean, I used to joke that what I wanted to see an optimize a compiler is that it would produce correct code. I didn't even care if it was fast. I just carried it and destroy my program. 211 00:20:51.209 --> 00:20:54.989 Because it had optimizes that would actually produce bad code. 212 00:20:54.989 --> 00:21:02.489 They actually produce faster code now, so, in any case, so they talk about various tricks. We don't worry about, um. 213 00:21:02.489 --> 00:21:05.608 Okay, I'll skip through this. 214 00:21:05.608 --> 00:21:10.499 Okay, now what they're doing here, I'll give you the executive summary. 215 00:21:10.499 --> 00:21:18.989 They're, they're avoiding re, computing, Constance if there's something in a loop, which is constantly inside that voice that up out of the loop. 216 00:21:18.989 --> 00:21:24.449 Computed once those competitors do that nowadays they know they're actually. 217 00:21:24.449 --> 00:21:28.828 The browser actually useful for once. Okay. Um. 218 00:21:30.808 --> 00:21:34.648 And we'll skip all that. Okay. Now. 219 00:21:37.828 --> 00:21:43.019 So, how are you going to do this on the on the in parallel now? 220 00:21:47.009 --> 00:21:50.038 So, there's various tricks you can do. 221 00:21:53.128 --> 00:22:03.538 I mean, there was some of the GPU might not be big enough to store the whole lattice array. Perhaps. Maybe your lattice is 6,000 by a 1000. that's a 1Billion points. 222 00:22:03.538 --> 00:22:08.038 And maybe you're not lucky enough to have a, that can store this. 223 00:22:08.038 --> 00:22:14.729 Array with a 1Billion points in it. We do. I'll actually, but, you know, most cannot. 224 00:22:14.729 --> 00:22:19.259 So, what they're talking about here is actually. 225 00:22:19.259 --> 00:22:26.429 A slice of the map would be a slice would be a, a chunk of the map. It's small enough to fit on the GPU. 226 00:22:26.429 --> 00:22:29.729 So that they would essentially. 227 00:22:29.729 --> 00:22:33.088 They are implementing a virtual memory manager you might say. 228 00:22:33.088 --> 00:22:37.199 Take a slice of the map, put it on the GPU and process it. 229 00:22:37.199 --> 00:22:40.949 That's what they're talking about here. So, um. 230 00:22:40.949 --> 00:22:45.148 A slice, it's a slice of the lattice array. Okay. Um. 231 00:22:45.148 --> 00:22:49.949 Because again, we're scattering information from the Adams over the, uh. 232 00:22:49.949 --> 00:22:56.699 Over the lattice the right so you might say here, instead of free chat, I'm computing effect on the whole lattice. 233 00:22:56.699 --> 00:23:06.568 Well, loop over slices of the lattice and compute 1 slice at a time. But the slice is small enough to fit in the GPU. 234 00:23:06.568 --> 00:23:11.068 And that's what they're talking about here. So I gave you the information. 235 00:23:11.068 --> 00:23:17.249 Of that, so we compute 1 slice. We page it back to the host. 236 00:23:17.249 --> 00:23:21.239 And then we compute the next license on. 237 00:23:21.239 --> 00:23:29.308 So, with this, we're reading the atoms more than once, but that may be okay to read only so. 238 00:23:29.308 --> 00:23:33.838 So that's the idea here. Um. 239 00:23:35.308 --> 00:23:44.578 Okay, and okay, now we're going to paralyze it so the outer loop loops over Adams, each Adams a thread. 240 00:23:44.578 --> 00:23:49.348 Um, so you can have 100,000 threads. 241 00:23:49.348 --> 00:23:53.519 And video made the design to say, hey, you can have lots of threads. 242 00:23:53.519 --> 00:23:58.348 If you can run only say, 5,000 at a time and you've got a 100,000 threats. 243 00:23:58.348 --> 00:24:01.949 It is queued up waiting to run so that's okay that works. 244 00:24:01.949 --> 00:24:07.318 And, in fact, it's nice to have more threads waiting than you have threads executing. 245 00:24:07.318 --> 00:24:15.898 Does this handles the latency issue? So the latency is not a problem. If you've got more threats, then you've got 2 cars. 246 00:24:15.898 --> 00:24:20.878 Okay um, nothing weird there. 247 00:24:20.878 --> 00:24:27.148 So, this is a straightforward, obvious method. The only problem here and this is a scatter thing. 248 00:24:27.148 --> 00:24:32.519 For each Adam were scattering information to the lattice points of the slice. 249 00:24:33.959 --> 00:24:39.838 And we're going to have this read modify right issue problem. 250 00:24:41.429 --> 00:24:45.088 Again, the problem with the read modify right? Is that. 251 00:24:45.088 --> 00:24:48.419 It's those things down to. 252 00:24:48.419 --> 00:24:54.598 Thread 1 is acting on this lattice point. Thread, too has to wait if it wants to write that at this point. 253 00:24:54.598 --> 00:25:00.388 Okay, so, um. 254 00:25:02.939 --> 00:25:06.598 Now, what they're talking about here is another issue by the way. 255 00:25:08.189 --> 00:25:14.398 Why are we spending time on this example? Like, why did NVIDIA take the time? 256 00:25:14.398 --> 00:25:17.848 To talk about it and why am I taking your time in class? 257 00:25:17.848 --> 00:25:21.179 There's 2 reasons the specific reason. 258 00:25:21.179 --> 00:25:26.759 Is that this? That is actually a very common application for parallel computing. 259 00:25:26.759 --> 00:25:31.648 It's a different domains it's not in biophysics it's astronomy and so on. 260 00:25:31.648 --> 00:25:35.338 So, the specific problem is, we're spending time on. 261 00:25:35.338 --> 00:25:39.269 As a useful problem that is so compute balance you want to paralyze it? 262 00:25:39.269 --> 00:25:45.148 The general application for this, is that the techniques that we see here. 263 00:25:45.148 --> 00:25:48.298 Are also used in other parallel programs. 264 00:25:48.298 --> 00:25:57.898 So that we see general ideas that are worth putting in your tool, can you develop a toolkit of tools? You can use. 265 00:25:57.898 --> 00:26:02.368 You know, it's not different types of screwdrivers. It's different programming techniques. 266 00:26:03.598 --> 00:26:10.469 And the idea is that you look at when you got nested loops, which should be the outside loop that's an important consideration. 267 00:26:10.469 --> 00:26:20.878 For example, you have too many read modify rights that are going to slow each other down. That's an important consideration scattered versus gathered. 268 00:26:20.878 --> 00:26:24.028 Um, so. 269 00:26:24.773 --> 00:26:37.523 Like, if you're programming and trust, which is this parallel will see maybe next you've got a scatter operator and you've got a gather operator. You might think, maybe gather might be more efficient on a multithreaded machine. Okay. 270 00:26:37.523 --> 00:26:40.614 Now, here, we're going to see an idea, which is a little subtle. 271 00:26:40.919 --> 00:26:47.068 Um, so this information, some information here. 272 00:26:47.068 --> 00:26:51.419 So each thread executes this function. 273 00:26:51.419 --> 00:26:54.538 It's the energy you got this code. 274 00:26:54.538 --> 00:26:59.699 In here that's computed by each thread and parallel. 275 00:26:59.699 --> 00:27:08.278 And it's the same code so the threads are in parallel computing the same information. 276 00:27:08.278 --> 00:27:12.148 So. 277 00:27:12.148 --> 00:27:15.479 You might ask yourself is this a problem. 278 00:27:15.479 --> 00:27:20.699 Now, if you had a sequential machine, a serial machine. 279 00:27:20.699 --> 00:27:26.729 Then this is so C, energy, it's the outer loop was calling to see energy many times. 280 00:27:26.729 --> 00:27:32.459 And inside that, this code is getting the same code is getting re, executed on a sequential machine. 281 00:27:32.459 --> 00:27:35.578 You'd want to worry about that. Well. 282 00:27:35.578 --> 00:27:39.959 I would look at that and say, okay, I have a nice optimized and compiler. 283 00:27:39.959 --> 00:27:46.469 And the worry, it's a reasonable thing, but, you know, it's going to slow down a sequential machine parallel machine. 284 00:27:46.469 --> 00:27:51.598 You know, we're so lucky 5,000 threads compute the same thing. 285 00:27:51.598 --> 00:27:55.439 You might say so, what and that could be a reasonable answer. 286 00:27:55.439 --> 00:28:00.659 Um, so redundantly computed for every thread, but. 287 00:28:01.949 --> 00:28:14.009 You know, but maybe you wouldn't computed once because it slows down all the threads, or the threads are in parallel. So, what I'm pointing out is that this is repeated. Computation is the same thing, but. 288 00:28:14.009 --> 00:28:17.548 Maybe, it's not a problem on a parallel machine. 289 00:28:17.548 --> 00:28:22.169 So, you know, don't automatically think you have to fix it. 290 00:28:22.169 --> 00:28:25.169 Any case, they're bringing this thing up here. 291 00:28:25.169 --> 00:28:28.259 So. 292 00:28:28.259 --> 00:28:35.818 Now, at 1 point is that again, if I, let's talk about this slide before I go to the next slide. 293 00:28:35.818 --> 00:28:39.749 So, again, so this functions see, energy, it's being run. 294 00:28:39.749 --> 00:28:48.929 On each thread and parallel you got these local variables and flow and squared rich slice offset. 295 00:28:48.929 --> 00:28:53.189 He's that squared or if you come across the pond charge and so on. 296 00:28:53.189 --> 00:28:59.848 So, these by default are registers their fast memory that's private to the. 297 00:28:59.848 --> 00:29:04.919 Thread and again the thread has up to 255 give or take. 298 00:29:04.919 --> 00:29:11.818 Registers now there is an issue so each thread is, is. 299 00:29:11.818 --> 00:29:15.689 These threads version of say, DC is, um. 300 00:29:18.058 --> 00:29:23.429 It stored separately, so that is wasted register storage and you don't have a lot of registered, you know. 301 00:29:23.429 --> 00:29:27.088 64 K registers total across all the current threads. 302 00:29:28.229 --> 00:29:31.769 So that could be an argument to, um. 303 00:29:31.769 --> 00:29:40.679 Take this stuff and compute it once and put it in shared memory because shared memory is visible to the whole thread block. 304 00:29:41.663 --> 00:29:54.473 And so there's an argument, perhaps, instead of having to registers, that are private to the thread, put it in shared memory that's global to the thread block or private to the thread block and shared memory is very fast. You don't have very much. 305 00:29:54.473 --> 00:29:57.443 It's 48 K bytes or something for the whole uh. 306 00:29:58.199 --> 00:30:04.709 The whole thread block, so that could be an idea. If you look at this and your optimizing state of mind. 307 00:30:04.709 --> 00:30:11.969 Um, so you, you have you 1st, of all put it in shared memory and then maybe. 308 00:30:11.969 --> 00:30:19.378 Do it once and then sync or something? So okay, so this is what you would be thinking before you look at the next slide perhaps. 309 00:30:19.378 --> 00:30:23.909 And then you got the loop in here, and then the loop inside the loop. 310 00:30:23.909 --> 00:30:28.409 Okay, threads going to have nested loops. You got no problem there. 311 00:30:28.409 --> 00:30:38.638 And there's really not going to be any threat divergence probably or maybe 1 thread does more iterations than another. Probably not. 312 00:30:38.638 --> 00:30:42.358 So, the thing works quite nicely on a thread based. 313 00:30:42.358 --> 00:30:46.739 A single program multiple data set. 314 00:30:46.739 --> 00:30:52.679 Architect by the way the difference between an M. D. and s. P. M. D. 315 00:30:52.679 --> 00:30:57.929 S, I. M. D. single instruction multiple Datas that all of the. 316 00:30:57.929 --> 00:31:03.538 Threads are locked sync so the threads in a warm are s. I. M. D. 317 00:31:03.538 --> 00:31:07.318 The warps and a block are s. P. M. D. 318 00:31:07.318 --> 00:31:15.808 The walks in a block, they're running the same program, but there could be different points in the programs, the different warps, and not synced with each other. Unless you do a sync threads. 319 00:31:15.808 --> 00:31:20.818 So that's why they've added a new term single program multiple data. 320 00:31:20.818 --> 00:31:25.199 So your 1000 threads in the block are running the same program. 321 00:31:25.199 --> 00:31:32.489 But the 1000 threads of 32 warps the 30 who warps could be at 32 different places in that common program. 322 00:31:32.489 --> 00:31:37.019 Um, and and quite commonly some of the. 323 00:31:37.019 --> 00:31:45.388 Or some might not be started yet and others might have finished that's versus a PM. Okay, so you look at this and you might say we will. 324 00:31:45.388 --> 00:31:50.788 Optimize our hardware resources, if we take this stuff perhaps and put it in shared memory. 325 00:31:51.989 --> 00:32:01.499 Okay, and again, um, you're writing a small program who cares and obviously your time was worth something. 326 00:32:01.499 --> 00:32:05.548 But, you know, you're writing something really big than. 327 00:32:05.548 --> 00:32:11.068 Your time is worth less than the computer's time eventually at some point. So you want to optimize. 328 00:32:11.068 --> 00:32:17.038 Okay, looking at the same program further down and this is our. 329 00:32:17.038 --> 00:32:23.249 Scatter operation and it's the read modify. Right? You see, you've got the plus equals here. 330 00:32:23.249 --> 00:32:27.538 And that line has to be, um. 331 00:32:27.538 --> 00:32:31.378 Has to be a atomic, so. 332 00:32:31.378 --> 00:32:36.358 That's the interesting thing down there. 333 00:32:36.358 --> 00:32:44.308 Um, here's a, this graph shows what I mean, by the scatter. 334 00:32:44.308 --> 00:32:48.689 So, we've got our input that, um, that's the Adams. 335 00:32:48.689 --> 00:32:53.638 We got the threads that do the work and the output. That's the lattice points. 336 00:32:53.638 --> 00:32:58.888 And the output from each thread is charactered over all the lattice points. Okay. 337 00:33:00.179 --> 00:33:04.019 And was input is each thread reads 1 input. 338 00:33:04.019 --> 00:33:11.068 From 1, Atom and output scattered output across all of the lattice points. So. 339 00:33:14.128 --> 00:33:23.608 If you look at this and start thinking, um, you read modify right your, your back logs because of that could be worse than you would think because. 340 00:33:23.608 --> 00:33:29.578 You're in a loop for each thread iterating over the lattice. So basically. 341 00:33:29.578 --> 00:33:33.598 All the threads are going to write bang on this 1st word in the lattice. 342 00:33:33.598 --> 00:33:39.568 And then all the threads are going to write to the 2nd word in the last, and all the threads are going to write to the 3rd word on the lives. 343 00:33:39.568 --> 00:33:50.009 You see a problem here, so this program is particularly bad for causing backlogs from that read. Modify. Right? So. 344 00:33:50.009 --> 00:33:57.388 If you did want to do a scatter, or you might want to think of some way to all the threads we're not trying to write to the same. 345 00:33:57.388 --> 00:34:04.259 Lattice point at the same time. So they start the different threads off at different lattice points or something. 346 00:34:04.259 --> 00:34:09.239 Something like that maybe. Okay. 347 00:34:10.619 --> 00:34:14.278 So, that's the scatter. 348 00:34:15.509 --> 00:34:20.068 Um. 349 00:34:20.068 --> 00:34:24.509 And so here, they're talking about another thing. 350 00:34:24.509 --> 00:34:29.998 You could do is if the Adam's really far from the lattice point. Don't do anything. 351 00:34:31.224 --> 00:34:44.994 So don't try to update it so this improves your read modify right? Because you're not writing to all the lattice points. So, the latest points too far away don't write to it and that will help your program. Well, this will make cause thread divergence in your code. 352 00:34:44.994 --> 00:34:49.494 Of course, there's different threads. We'll be doing different work at that point. Someone want to write in some will not. 353 00:34:49.768 --> 00:34:55.378 Need to write so they talk about stuff like that. Um. 354 00:34:55.378 --> 00:34:58.829 Input, as far as your regular and. 355 00:34:58.829 --> 00:35:03.509 Outputs are whatever. 356 00:35:03.509 --> 00:35:13.739 Okay, so the scatter concept, you iterate over the atoms and scatter reach Adam scatter output 2. 357 00:35:13.739 --> 00:35:19.378 The whole lattice or scatter output to 1 slice of the lattice iterate over slices. Um. 358 00:35:22.079 --> 00:35:25.168 Follows the sequential version, so it's easy. 359 00:35:25.168 --> 00:35:32.489 Um, but this atomic ad is going to kill your performance. 360 00:35:32.489 --> 00:35:36.568 Their conclusion is don't do this in parallel. 361 00:35:38.309 --> 00:35:44.099 But by the way, if we're thinking ahead, talk about term projects, at some point. 362 00:35:45.389 --> 00:35:50.159 You know, just because they say this, you don't have to believe them. 363 00:35:50.159 --> 00:35:53.639 So, if you're a skeptical sort of person towards authority. 364 00:35:53.639 --> 00:35:59.068 You might actually implement this and see if this is true. Okay. Do an experiment and. 365 00:35:59.068 --> 00:36:02.278 See, if this is as bad as they say it is. 366 00:36:02.278 --> 00:36:06.088 Yeah, that would make a fun term project. 367 00:36:06.088 --> 00:36:09.838 And again, you then you make the project more sophisticated. 368 00:36:09.838 --> 00:36:21.509 Is a simple version 0 is really bad, but the version 1, you start thinking of some things. Like, you don't have all the threads banging on the same lattice points. Simultaneously. You rotate things a lift somehow. 369 00:36:21.509 --> 00:36:24.958 You might actually negate this problem. Who knows? 370 00:36:24.958 --> 00:36:28.168 Project okay. 371 00:36:28.168 --> 00:36:34.289 That was, um, the 1st lecture. 372 00:36:38.909 --> 00:36:42.449 Okay, part 2. um. 373 00:36:45.059 --> 00:36:50.818 Now, we're continuing on this calculation. 374 00:36:53.369 --> 00:36:56.458 Oh, fancy up here. Okay. Um. 375 00:36:58.168 --> 00:37:02.789 Okay, so now this is the 2nd method. This will be the gather method. 376 00:37:02.789 --> 00:37:07.768 The gathered method is the outermost iteration iterate over lattice points. 377 00:37:07.768 --> 00:37:12.418 And for each lattice point, you then iterate over the atoms. 378 00:37:12.418 --> 00:37:16.259 So, 2, the 2 D thing they're doing, um. 379 00:37:16.259 --> 00:37:22.079 Throughout most things are good points and then the innermost loop here. 380 00:37:22.079 --> 00:37:32.878 We're iterating over Adams and for each great point, we gather the information from all the atoms we gather. The potentials are. 381 00:37:32.878 --> 00:37:38.369 Our read modify right is to a local variable. So that's local to the thread private to the thread. 382 00:37:38.369 --> 00:37:41.759 So then we get the total comp, um. 383 00:37:41.759 --> 00:37:49.768 Contribution for the atom energy, and then we do a 1 read modify right to update that. 384 00:37:49.768 --> 00:37:54.509 Update that lattice point. 385 00:37:54.509 --> 00:37:58.739 The reason this 1 here might want that last line might want to be. 386 00:37:58.739 --> 00:38:02.458 Locke as an atomic operation, as we might have. 387 00:38:02.458 --> 00:38:06.659 Multiple blocks or something if we got lots and lots of threats. 388 00:38:06.659 --> 00:38:13.380 And so this is to prevent the blocks from stepping on each other's toes. So they don't show that yet. But this last line I'd put in it. 389 00:38:13.380 --> 00:38:21.719 Make it a topic. Okay so this has inverted what's in and what's out in the interest of loops? This is gathering information from the atoms. 390 00:38:21.719 --> 00:38:27.659 Here's a great point. Um, so, um. 391 00:38:29.340 --> 00:38:33.719 Okay, Here's a design decision we've mentioned before that. 392 00:38:33.719 --> 00:38:42.210 You got lots of threads, thousands, tens of thousands of thread. The hip of 100,000 threads. I haven't personally tried it. That could be another project. 393 00:38:42.210 --> 00:38:45.269 But any thread block. 394 00:38:45.269 --> 00:38:50.849 On the fast GPU on parallel any 1 thread block. 395 00:38:50.849 --> 00:38:54.510 Can have only 1024 threads. 396 00:38:55.889 --> 00:38:58.980 So, we're doing a 2 D problem here. Um. 397 00:38:58.980 --> 00:39:04.260 To make so it's easier to draw to your lattice. 398 00:39:04.260 --> 00:39:09.150 Is whatever maybe, you know, 200 by 200 or. 399 00:39:09.150 --> 00:39:17.789 1005 1000, so you got to partition the lattice points into thread blocks because 1 lattice point is 1 thread. 400 00:39:17.789 --> 00:39:24.570 He got this big lattice you want to partition it into blocks of a 1000 threads. 401 00:39:25.619 --> 00:39:29.610 No, you could. That was weird. The 2nd, here. 402 00:39:29.610 --> 00:39:42.360 You could say each role, you couldn't do it literally role row, row, row, block, block, block block. But what you're saying, it might be better say you could have duty organization like this. So, 64 or 564. 403 00:39:42.360 --> 00:39:52.230 Lattice points is 1 is 1 block, whatever, you know, uses the design decision you want to make how do you partition. 404 00:39:52.230 --> 00:40:01.139 When you got more threads than blocks sorry here, um, some auto time thing, um, when you've got more threads than blocks, how do you. 405 00:40:01.139 --> 00:40:04.800 But group bundle the threads and to blocks. 406 00:40:04.800 --> 00:40:08.969 Now, how this would affect performance. 407 00:40:08.969 --> 00:40:15.840 Is that again eat inside each block? You've got shared memory that's available to all the threads in the blog. 408 00:40:15.840 --> 00:40:20.280 So, perhaps you can have some information. 409 00:40:20.280 --> 00:40:23.969 And work on that past shared memory. 410 00:40:23.969 --> 00:40:31.260 So, that's it this is why some organizations a thread box might be faster than others because. 411 00:40:32.340 --> 00:40:39.480 You know what you're putting in the shared memory so okay. Um. 412 00:40:39.480 --> 00:40:45.059 Any case this is the gather thing. 413 00:40:45.059 --> 00:40:50.969 The input here are the atoms that are spaced irregularly. 414 00:40:50.969 --> 00:40:55.530 And the threads do the work, and the output are the potentials that the lattice points. 415 00:40:55.530 --> 00:41:02.460 So thread 1 computes the potential lattice point 1 or 2 at lightest point too. So. 416 00:41:02.460 --> 00:41:06.630 The threads are gathering information from. 417 00:41:09.804 --> 00:41:22.735 I dunno if it becomes too annoying. I'll figure it out to stop the timing thing, but at the moment. Okay, so you look at this, it's going to go nicely. There's no read modify right issue on the output. 418 00:41:22.980 --> 00:41:26.730 And even better. 419 00:41:26.730 --> 00:41:31.139 Adjacent threads their writing to adjacent output words in memory. 420 00:41:31.139 --> 00:41:34.409 Okay, this is getting, uh. 421 00:41:34.409 --> 00:41:38.880 See here. Okay. 422 00:41:42.059 --> 00:41:51.750 Yes, oh, I know what my problem was. Okay. Um, okay, so look at this adjacent threads, right? To adjacent output words coalescing. 423 00:41:51.750 --> 00:41:56.070 Cold, I think good. Okay. Um. 424 00:41:56.070 --> 00:42:09.420 And the inputs well, again, look at it and think the adjacent. So all the threads read input word 1, and then all the threads read input word to that all the threads where he'd input were 3. 425 00:42:09.420 --> 00:42:14.610 Think how that bangs on the memory. That's good. Also. Okay. 426 00:42:14.610 --> 00:42:21.389 They just look at this, it works, it plays nicely with how the GPU is designed. 427 00:42:21.389 --> 00:42:26.159 It's efficient on reading and it's efficient on writing. 428 00:42:26.159 --> 00:42:29.369 Cool. So you like to gather paradigm. 429 00:42:30.869 --> 00:42:34.230 You ought to like it well, I don't know if you do, um. 430 00:42:35.429 --> 00:42:39.480 Okay, so here on the executive summary. 431 00:42:39.480 --> 00:42:46.380 Is that each grant you iterate over grid points? That's another thing. And then inside the thread, each thread does 1 grid point. 432 00:42:46.380 --> 00:42:51.690 Okay, and then each so compute some information about the grid point. 433 00:42:51.690 --> 00:42:56.250 And then such as which good point, it is depends on the threat number and the block number. 434 00:42:56.250 --> 00:43:01.409 And then it at the bottom half of the page. 435 00:43:01.409 --> 00:43:05.789 Iterates over the atoms and again. 436 00:43:05.789 --> 00:43:08.969 The loop all the threads are doing the same loop. 437 00:43:08.969 --> 00:43:14.429 Good so it just is a program. What's going to run nicely on GPU. 438 00:43:15.599 --> 00:43:20.880 Um, they're talking about that there. Um. 439 00:43:20.880 --> 00:43:25.889 Works and the consolidated rights adjacent threads are writing adjacent. 440 00:43:25.889 --> 00:43:34.980 Elements and memory most of the time. So this consolidated with a consolidated right means we're accessing the, um. 441 00:43:34.980 --> 00:43:40.800 The global memory and chunks of, I don't know, 32 words, perhaps. 442 00:43:40.800 --> 00:43:48.179 It's a design decision, so at 32 threads are going to be writing a Jason, 32 words of the global memory. 443 00:43:48.179 --> 00:43:51.570 Those 32 rights get consolidated into 1, right? 444 00:43:51.570 --> 00:43:54.659 And 1 access out to the global memory. 445 00:43:54.659 --> 00:43:58.500 And that access is 32 words 128 bytes. 446 00:43:58.500 --> 00:44:01.500 That works nicely. 447 00:44:01.500 --> 00:44:04.650 Okay, so the gather is good. 448 00:44:04.650 --> 00:44:11.190 Good. Okay so what they're saying is what was the fast sequential. 449 00:44:11.190 --> 00:44:14.760 Program design was not the past parallel program designed. 450 00:44:14.760 --> 00:44:18.329 And that is the lesson from this 1. 451 00:44:20.670 --> 00:44:24.360 Okay, um. 452 00:44:27.059 --> 00:44:33.239 Compute efficiency total number of cycles. 453 00:44:35.130 --> 00:44:43.710 Is less your metric for parallel programs is almost always elapsed wall clock time. 454 00:44:43.710 --> 00:44:49.230 So, if your warp is sort of spinning. 455 00:44:49.230 --> 00:44:59.070 Waiting for data, it's not executing any instructions. You don't care you care about that. It's waiting. 456 00:44:59.070 --> 00:45:05.579 And it will wait less if you work the cash better and so on. 457 00:45:05.579 --> 00:45:10.440 Um, and I mentioned here also the. 458 00:45:10.440 --> 00:45:16.590 For reasonable size is you've got more lattice points and you've got Adams, maybe say 20 times more. 459 00:45:16.590 --> 00:45:21.480 And iterating through the big lattice array. 460 00:45:21.480 --> 00:45:26.909 The outside loop is better than iterating through the atom as it is mentioned. So you won't think about these design issues. 461 00:45:26.909 --> 00:45:30.119 Okay, um. 462 00:45:30.119 --> 00:45:33.869 The problem is wiping out data again. 463 00:45:33.869 --> 00:45:38.969 Play nicely with your cash. That's the lesson from here. 464 00:45:40.139 --> 00:45:44.010 Um, do do do do do. 465 00:45:46.050 --> 00:45:50.789 And it's the financial program at play games temporary raise. 466 00:45:50.789 --> 00:46:01.769 Pre calculate stuff I don't know. Um, okay, the Pre computation is you identify common things and waste them out of the loop and do them once. 467 00:46:01.769 --> 00:46:06.840 Yeah, I find it hard to get interested in that sort of thing. 468 00:46:06.840 --> 00:46:11.130 Again hope you, in many cases, the compiler. 469 00:46:11.130 --> 00:46:17.010 Okay, the coarse thing is the thing where on a sequential program is the concept of unrolling a loop. 470 00:46:17.010 --> 00:46:21.269 Uh, so you do several iteration. 471 00:46:21.269 --> 00:46:25.409 Of the index variable in 1 and 1 iteration of the loop. 472 00:46:25.409 --> 00:46:30.269 So loop executes fewer times, but each time does more work. You do it in parallel. 473 00:46:30.269 --> 00:46:35.789 As I mentioned a few minutes, each thread could do several lattice points instead of 1. lattice point. 474 00:46:35.789 --> 00:46:39.360 Why this would help is that. 475 00:46:39.360 --> 00:46:48.480 If the threads are computing some information, it may use the total fewer registers and reduces our scarce resource. They don't talk about that here, but that's. 476 00:46:48.480 --> 00:46:56.849 Would be the big thing like I mentioned, each strategy is private registered so you want to reduce total sorry? You either put stuff in shared memory. 477 00:46:56.849 --> 00:47:02.039 Make your program a little harder are you if you course it in the threads? Total number of registers. 478 00:47:02.039 --> 00:47:06.420 It's pure because the stuff you put in a register is used is used more so. 479 00:47:06.420 --> 00:47:12.329 It's related to this concept of loop unrolling which good compilers do nowadays also. 480 00:47:12.329 --> 00:47:16.980 Okay um, okay. 481 00:47:18.329 --> 00:47:25.230 And video now has some tools to work with fractions of a war. 482 00:47:25.230 --> 00:47:32.730 And that's in the most recent versions of good. So I'm not talking about that. Well, I don't fully understand it. So. 483 00:47:32.730 --> 00:47:38.699 I got some things, but okay. And the padding around the edge. 484 00:47:38.699 --> 00:47:45.750 Okay, um, and. 485 00:47:46.980 --> 00:47:52.800 So, they're picking for their coursing and for the partitioning the lattice and to blocks. 486 00:47:52.800 --> 00:47:58.380 They want to take the greatest effect of, uh, memory coal last thing. 487 00:47:58.380 --> 00:48:03.840 Coalescing operations on the global memory. Um, they are. 488 00:48:03.840 --> 00:48:07.199 Computing temporary results and we're using them perhaps. 489 00:48:07.199 --> 00:48:12.780 And, um, again realize that registers are fixed resource. 490 00:48:12.780 --> 00:48:18.750 That's the information content and that the cost for this is your program got worse. 491 00:48:18.750 --> 00:48:25.889 Now, if you use a higher level tool above. 492 00:48:25.889 --> 00:48:35.670 Such as thrust, it thrust your programming in terms of I mean, the functions are called things like scatter and gather. 493 00:48:35.670 --> 00:48:39.510 And transform and so on. 494 00:48:39.510 --> 00:48:45.179 So the hope is that if you use a higher level tool, some of this stuff is done for you by the. 495 00:48:45.179 --> 00:48:49.679 By the library, so and you have a choice. 496 00:48:49.679 --> 00:48:59.670 Of a number of different tools at different levels. The 1st thing we'll talk about, there's nothing about threads and warps. It's functional programming, mass, produced programming. 497 00:48:59.670 --> 00:49:03.000 It's higher level, but maybe you lose a touch of efficiency. 498 00:49:03.000 --> 00:49:15.929 These intermediate things that the energy lab people recommend if you want to worry about some more detailed things, but not down at this level. The intermediate tools one's called Co costs that, um. 499 00:49:17.039 --> 00:49:21.090 It will give you more control, but are easier. So. 500 00:49:21.090 --> 00:49:25.440 Maybe you don't want to worry about this whole level stuff too often. 501 00:49:25.440 --> 00:49:30.599 Okay, in any case to the point of this 2 slides that was. 502 00:49:30.599 --> 00:49:34.289 To show the example of potential calculation. 503 00:49:34.289 --> 00:49:41.940 Which is specifically an important example, but generally teaches you tools for parallel programming and could. 504 00:49:41.940 --> 00:49:45.329 Any questions about that. 505 00:49:45.329 --> 00:49:48.869 Okay. 506 00:49:51.150 --> 00:49:54.809 Let's continue on. 507 00:49:54.809 --> 00:49:59.760 A political. 508 00:50:00.929 --> 00:50:06.900 Again, I think this 1 goes fast. Um. 509 00:50:08.400 --> 00:50:17.369 Uh, just some general parallel thinking things I'm going to, I may hit this 1 passed. Um. 510 00:50:19.590 --> 00:50:26.039 Oh, what's the content on this? 1? The 2nd paragraph here? Um. 511 00:50:27.599 --> 00:50:33.449 So, you got different goals for parallel computing um. 512 00:50:33.449 --> 00:50:37.829 Strong scaling is you can do the same program faster. 513 00:50:39.179 --> 00:50:43.590 Weak scaling is you can do a bigger program in the same time. 514 00:50:44.610 --> 00:50:48.960 And then, of course, well, better solution. So so. 515 00:50:50.579 --> 00:50:56.309 You want to do weather prediction. Okay the weather people tell us there's going to be a storm tomorrow. 516 00:50:56.309 --> 00:51:01.800 It's a few hours ago they said, I haven't checked the weather report later. They may have changed their mind. 517 00:51:01.800 --> 00:51:05.789 They do okay, so you want to. 518 00:51:05.789 --> 00:51:15.090 Look at the problem of weather now you buy a faster computer, you compute the weather in less time. That's strong. Scaling. 519 00:51:15.090 --> 00:51:19.500 Week scaling is you buy a better computer you can. 520 00:51:19.500 --> 00:51:23.219 Compute the weather approximating on a finer grid. 521 00:51:23.219 --> 00:51:28.530 So, the decomposed the atmosphere into this grid, it's surprisingly course. 522 00:51:28.530 --> 00:51:33.179 Um, so you make the grid finer more grid cells. 523 00:51:33.179 --> 00:51:36.300 But, you know, compute it in the same time. 524 00:51:36.300 --> 00:51:42.570 That's weak scaling. So okay. And this thing here. 525 00:51:42.570 --> 00:51:51.300 Advancing science, qualitatively new stuff. You do the potential problem faster. 526 00:51:51.300 --> 00:51:54.690 You know, design a new drug. Okay, let's say. 527 00:51:54.690 --> 00:51:58.110 That's advancing science. 528 00:51:59.460 --> 00:52:03.780 And nothing interesting there. Um. 529 00:52:03.780 --> 00:52:13.230 Okay, okay so here is introducing the topic that everyone else loves in parallel computing, which is a message passing. 530 00:52:13.230 --> 00:52:16.409 I personally think it's much overrated. 531 00:52:16.409 --> 00:52:20.039 And I'm willing to debate it with anyone. No, 1 ever takes me off on it. 532 00:52:20.039 --> 00:52:24.929 So, you have lots of small nodes that. 533 00:52:24.929 --> 00:52:28.260 Do not have shared memory, you past messages back and forth. 534 00:52:29.460 --> 00:52:34.889 Again oh, that's nice. If you have to but I got 40 gigabytes of memory on my. 535 00:52:34.889 --> 00:52:41.460 It's I don't have problems so big that I can't, you know, that I have to pass messages. 536 00:52:41.460 --> 00:52:48.510 Okay, um, and you got humungous latency and when you pass the message. 537 00:52:48.510 --> 00:52:52.019 Oh, they mention, okay we've seen. 538 00:52:52.019 --> 00:52:56.550 Cuda now we've seen open MP open C. L. 539 00:52:56.550 --> 00:53:01.139 Is a competitor open end piece Ahmad it's more and Dakota. 540 00:53:01.139 --> 00:53:07.500 It's more agnostic it's not specifically targeted to NVIDIA. 541 00:53:07.500 --> 00:53:12.840 And a little different design philosophy, but it's also since it's, let's target it to invidia then now. 542 00:53:12.840 --> 00:53:19.260 It doesn't work as well on his video, so in designing this course, and I created this course. 543 00:53:19.260 --> 00:53:26.849 I made a design decision to talk about NVIDIA because Nvidia is the big parallel computing today. 544 00:53:26.849 --> 00:53:31.139 In 5 years, they could be bankrupt. I've seen it happen. Um. 545 00:53:31.139 --> 00:53:36.480 Really big company then they get big because they're competent and hardworking. 546 00:53:36.480 --> 00:53:42.989 And then they become arrogant, and they do everything wrong and you'd be surprised how fast the company could vanish. 547 00:53:42.989 --> 00:53:48.389 When they get foolish, so, and it could happen to in video, but it hasn't happened yet. 548 00:53:48.389 --> 00:53:53.219 So. 549 00:53:53.219 --> 00:53:59.699 Okay, nothing interesting there. So. 550 00:54:01.135 --> 00:54:15.804 Well, an example, when I was a grad student, digital equipment company tech was the standard company for computer science departments. They grew to become the 2nd, biggest computer company after IBM. They had their annual conference 1 year on ocean liner that. 551 00:54:16.050 --> 00:54:20.130 I think or something in Boston Harbor. 552 00:54:20.130 --> 00:54:29.639 And a couple years later, they said UNIX was snake oil. They did everything. The guy who built the company up, he did everything or. 553 00:54:29.639 --> 00:54:33.329 Oh, what happened to him but he started doing everything wrong in the company. 554 00:54:33.329 --> 00:54:38.849 And they also went from being an open model and friendly to being very nasty about everything. 555 00:54:38.849 --> 00:54:45.630 If you pulled a memory trip out of 1 of their cards and put it into another car, they said that was a legal let's say. 556 00:54:45.630 --> 00:54:52.320 You try to duplicate their terminal, they're done terminal. They tell you for an IV violation, that sort of thing. 557 00:54:52.320 --> 00:54:57.869 The cost of the memory chip, the various factor 15, depending on what it was plugged in to. 558 00:54:57.869 --> 00:55:01.619 And you couldn't clone it because the bus interface was patented. 559 00:55:01.619 --> 00:55:11.909 So the way they did the way 3rd party said memories, they buy cheap, obsolete components, folder memory out and put it into the new expensive tech computers and tech would say that's illegal. 560 00:55:11.909 --> 00:55:17.550 Okay, so people just stop buying deck stuff. 561 00:55:17.550 --> 00:55:21.090 That's solve the problem. Okay any case. 562 00:55:21.090 --> 00:55:25.260 Synchronization nothing interesting here. 563 00:55:25.260 --> 00:55:32.909 The last line is relevant, it's sort of a problem for you guys, the programmers. 564 00:55:32.909 --> 00:55:37.050 You don't synchronize enough, you get past answers that are wrong. 565 00:55:38.099 --> 00:55:43.559 Synchronized too much he gets slowing, uh, nothing. Okay. Um, so. 566 00:55:45.570 --> 00:55:52.710 What they're talking about here? Different models for parallel computing. Okay a single program. Multiple data. That's the. 567 00:55:52.710 --> 00:55:56.610 Cuda and video thing, the master worker thing. 568 00:55:56.610 --> 00:56:00.119 Is that, um, basically. 569 00:56:00.119 --> 00:56:07.530 The boss, um, the professor grad student model, whatever you want to call it the boss. 570 00:56:07.530 --> 00:56:15.269 And, um, partitions out parallel problems and assign them to the workers. In this case, the workers are. 571 00:56:15.269 --> 00:56:24.960 Just like in an office and the workers do the work. The loop parallelism is again, you have a 4 loop to loop, pick your language and this. 572 00:56:24.960 --> 00:56:29.190 And the parallelism was over the iterations of the loop of some iterations on effect. Each other. 573 00:56:29.190 --> 00:56:35.400 Um, and fork join, this is P threads. 574 00:56:35.400 --> 00:56:39.570 Know, what it means so different ways of looking at stuff. 575 00:56:41.429 --> 00:56:46.920 The idea being that if your application fits 1 of these models. 576 00:56:46.920 --> 00:56:50.670 It can then call on a lot of prior art to help you. 577 00:56:50.670 --> 00:56:55.800 So, if it loops parallelize hey. 578 00:56:55.800 --> 00:57:00.510 On open MP, and it will do and. 579 00:57:00.510 --> 00:57:08.130 It will do the work for you, that sort of thing and different models um, just shared data. This common fast data. 580 00:57:08.130 --> 00:57:18.809 That doesn't scale up, of course, with too many things. So on parallel I got 256 gigabytes of memory and it's got the dual 24. 581 00:57:18.809 --> 00:57:22.079 My mind is my however many core. 582 00:57:22.079 --> 00:57:26.190 It has so you can have give or take, um. 583 00:57:26.190 --> 00:57:30.480 40 or 80 threads all accessing the same data. 584 00:57:30.480 --> 00:57:36.269 Shared data share, you got a queue of jobs that they pick off jobs. 585 00:57:36.269 --> 00:57:41.670 With open M P. if you've got a lot of suppose you iterate and do equals 1 to a 1,000,000. 586 00:57:41.670 --> 00:57:45.269 And suppose you have only 32 threads available. 587 00:57:45.269 --> 00:57:51.420 Well, then it will pick chunks of iterations off the queue and executed different ways. It does it. So. 588 00:57:51.420 --> 00:57:57.570 Okay, they talk about that here. Um. 589 00:57:58.829 --> 00:58:04.739 The is the thing is that the different warps are executing at different times. So. 590 00:58:05.820 --> 00:58:15.480 Okay, um, and again, the point about the reason the concept of warp exists, you decode the instruction once and it gets executed. 591 00:58:15.480 --> 00:58:22.679 And 32, that's the data on 32 3 I'd say less space use for instruction decoding and so on. 592 00:58:22.679 --> 00:58:25.980 Hey, did you do, um. 593 00:58:25.980 --> 00:58:29.130 This is what I just told you about. 594 00:58:31.920 --> 00:58:35.460 For John, you probably don't want to do it to low level. So. 595 00:58:37.440 --> 00:58:42.090 Are not low enough well, these different things are here. So open MP. I spent a lot of time on. 596 00:58:42.090 --> 00:58:52.139 Open ACC is a little higher level and open MP open. Acc does she feels better open MP for a long time to not handle GPU? Is it only. 597 00:58:52.139 --> 00:58:59.730 cpu's multi it did it did multi core opening fee for a long time to not do GP recently. 598 00:58:59.730 --> 00:59:03.150 It does, I haven't played with it myself and how well does it. 599 00:59:03.150 --> 00:59:08.760 Open ACC, I guess, sort of intended to replace open, em, a little higher level. 600 00:59:08.760 --> 00:59:15.900 Well, then woke up the open MP people, and I think the open MP people, they saw the open ACC starting to do stuff. 601 00:59:15.900 --> 00:59:19.349 And that woke them up. So, then open MP started getting updated. 602 00:59:21.269 --> 00:59:27.239 Tbd is an acronym for threading building blocks this is intel's. 603 00:59:27.239 --> 00:59:32.429 Parallel toolkit and so it's only for Intel. 604 00:59:32.429 --> 00:59:39.210 The multi core, and I haven't played with it the threading building blocks. 605 00:59:39.210 --> 00:59:42.539 It being specific to Intel. 606 00:59:42.539 --> 00:59:49.530 It might be more efficient. It's probably not worth your time. If you could look at that for fun project. 607 00:59:49.530 --> 00:59:56.670 My what I think happens with Intel, so saying is Intel compiler also. 608 00:59:56.670 --> 01:00:01.380 Completes a GCC has a goal of, uh. 609 01:00:01.380 --> 01:00:04.949 Being compatible with Jesus is a stated goal now. 610 01:00:04.949 --> 01:00:12.030 So so, I think the Intel competitor people, you know, they try to make a good compiler. That's, you know. 611 01:00:12.030 --> 01:00:17.309 This is good call just the, and then that wakes up the people. 612 01:00:17.309 --> 01:00:22.829 And they see getting ahead of them and they improve. Oh. 613 01:00:22.829 --> 01:00:30.690 Which is good. I like competition, but the fact that says, you don't need to worry about you figure GCC will probably catch up to. 614 01:00:30.690 --> 01:00:37.559 Not too long behind so I'm thinking would be the same thing, because I'm thinking if TDB does anything really good. 615 01:00:37.559 --> 01:00:41.610 The open MP implementation on Intel will probably copy it. 616 01:00:41.610 --> 01:00:44.610 But I could be wrong project, so. 617 01:00:44.610 --> 01:00:48.210 Okay, see both of us to app. I don't know anything about so. 618 01:00:50.400 --> 01:00:59.909 Again went back here positive so this is a standard here, though, it's implemented in terms of some hardware on a particular machine. So is already aware of. 619 01:00:59.909 --> 01:01:06.179 The actual low level trading stuff, so okay. Um. 620 01:01:06.179 --> 01:01:13.320 Cool thing about looking at different ways to look to do program in parallel. 621 01:01:13.320 --> 01:01:16.440 Um, you know. 622 01:01:16.440 --> 01:01:21.420 You got different jobs, each different task and align them separate. 623 01:01:21.420 --> 01:01:26.039 Parallel colonel, let's say, or paralyze over the data. 624 01:01:26.039 --> 01:01:29.159 Parallel lives over whatever, uh. 625 01:01:30.630 --> 01:01:35.789 So these are interesting things to look at the bottom. You probably know what most of them are. Um. 626 01:01:35.789 --> 01:01:45.150 Decomposition so you got your molecule let's say, you could imagine some recursive, hierarchical grouping say with an. 627 01:01:45.150 --> 01:01:49.409 Everyone knows what an Austria sort of or people do not. 628 01:01:49.409 --> 01:01:57.389 Okay, the point about this is you got some irregular 3 D, geometric data. So the universe is a queue. 629 01:01:57.389 --> 01:02:00.780 You cut the cube into 8, sub cubes. 630 01:02:00.780 --> 01:02:09.690 And then each sub cube that is too much data you partition it into a sub sub cubes. So you got to treat each note of the tree has 8 kids. 631 01:02:09.690 --> 01:02:16.289 And then you do it, and you partition the kids into 8 grandkids and and so on. 632 01:02:16.289 --> 01:02:19.679 Until 1 node is simple enough. 633 01:02:19.679 --> 01:02:25.019 So it's a nice way to handle irregular spatial information. 634 01:02:25.019 --> 01:02:29.550 Like, the Milky way, you're doing galaxies. 635 01:02:29.550 --> 01:02:38.159 So, the universe is a cube and cube. Some, most of the cubes sub cubes will be empty, or have only, you know, 1 star. 636 01:02:38.159 --> 01:02:42.119 These are the ones with a lot of stuff. You subdivide them again. 637 01:02:42.119 --> 01:02:48.360 So, it adapts the data structure to the data. That's good. 638 01:02:48.360 --> 01:02:55.260 What's bad is it's irregular and on parallel computing that's a little more work. So that's. 639 01:02:55.260 --> 01:02:58.860 Of course of data here in the middle of what we're talking about. 640 01:02:58.860 --> 01:03:03.480 The tree and again, I've had arguments with people about it. 641 01:03:03.480 --> 01:03:07.050 I think it's not as good as people think it is. Um. 642 01:03:07.050 --> 01:03:12.929 It's hard to do some things if the dad is really, really, really a regular artery is not going to save you. 643 01:03:12.929 --> 01:03:17.940 And it's hard to do things like finding nearby elements because a nearby. 644 01:03:17.940 --> 01:03:23.460 Star might be in a totally different part of the, or you might have to go up many levels and over it down again. 645 01:03:23.460 --> 01:03:29.309 So, but it's popular, so. 646 01:03:29.309 --> 01:03:32.519 Okay, that's recursive thing. 647 01:03:32.519 --> 01:03:36.750 Oh, now something new and video has last couple of years. 648 01:03:36.750 --> 01:03:42.389 So, the problem with this artery is some parts of the arc tree will have. 649 01:03:42.389 --> 01:03:50.309 Put down more levels than other parts. So how do you put it? How do you map that into CUDA. 650 01:03:50.309 --> 01:03:53.309 Um, well. 651 01:03:53.309 --> 01:03:57.539 What NVIDIA now has is that. 652 01:03:57.539 --> 01:04:01.920 Inside your kernel, you can find, you. 653 01:04:01.920 --> 01:04:10.619 The thread 1 thread is picking 1 chunk of the, that a thread can create a whole new colonel. 654 01:04:10.619 --> 01:04:13.710 So, if you've got 1 thread for each. 655 01:04:13.710 --> 01:04:17.010 Kid, it's a generation 1 of the tree. 656 01:04:17.010 --> 01:04:21.210 If this thread has, say a 1,000,000. 657 01:04:21.210 --> 01:04:27.630 Stars in it, that 1 thread can create a whole new kernel to handle this cell. 658 01:04:28.769 --> 01:04:33.360 It's as dynamic creation of kernels, there's going to be an overhead, but. 659 01:04:33.360 --> 01:04:42.449 If this 1 thread determines it just got too much work to do. This will be a win. You got the overhead of creating a new colonel with a whole new colonel has its new, um. 660 01:04:42.449 --> 01:04:45.869 You know, thousands of threads maybe. So. 661 01:04:45.869 --> 01:04:50.610 So this is the thing that Nvidia has the hand over curse of data. 662 01:04:50.610 --> 01:04:56.909 It's, it's colonel threads can create whole new parallel colonels. So. 663 01:04:56.909 --> 01:05:00.989 And what happens is they go into the cue because has got this queue of colonel. 664 01:05:00.989 --> 01:05:04.829 You know, with kernels have thread blocks and blah, blah, blah, blah. 665 01:05:04.829 --> 01:05:12.030 It just creates a new colonel before you had to create it in your C program and your host program. Now, you can create a new. 666 01:05:12.030 --> 01:05:19.739 Inside the existing it now the 2 kernels become sort of independent of each other. You might say the parent colonel. 667 01:05:19.739 --> 01:05:30.869 They can establish synchronization things all the obvious stuff. The parent colonel can say I'm going to wait till the child colonel finishes, or it's just like, work join or something or a master worker. 668 01:05:30.869 --> 01:05:37.769 This is the obvious synchronization stuff. In any case, you can have like, recursive kernels now on CUDA. 669 01:05:37.769 --> 01:05:47.159 And other pipeline event driven, most of, you know, what these terms mean, if you don't, then I can describe them. 670 01:05:47.159 --> 01:05:51.719 Different ways to look at complex parallel issues. So. 671 01:05:53.489 --> 01:05:58.980 Linear and regular parallel lives better, but maybe the data doesn't like it. So. 672 01:05:58.980 --> 01:06:04.739 Okay, um, more on ESPN. 673 01:06:04.739 --> 01:06:13.019 Dominant um, yeah, so okay. 674 01:06:15.869 --> 01:06:21.960 Mostly Co, whatever it's matches on software matches the hardware so. 675 01:06:23.070 --> 01:06:30.630 You're going to do a job in parallel. Your hardest problem may be. 676 01:06:30.630 --> 01:06:36.630 To take your job, take your application and make it look parallel. Okay, this is, um. 677 01:06:36.630 --> 01:06:41.940 You know, we talk about potential calculations and calculations. 678 01:06:41.940 --> 01:06:45.119 As examples because it turns out they probably very nicely. 679 01:06:45.119 --> 01:06:51.809 1 might imagine other big jobs that don't parallel very well so we don't talk about them. 680 01:06:51.809 --> 01:06:57.750 Okay, I do nothing interesting here. 681 01:06:57.750 --> 01:07:01.679 Read it nothing interesting there. 682 01:07:01.679 --> 01:07:04.949 Uh. 683 01:07:06.239 --> 01:07:10.829 The Nobel prize is entering awards will go to, um. 684 01:07:10.829 --> 01:07:14.550 It is qualitatively new science. Um. 685 01:07:17.039 --> 01:07:22.739 Is given an example few decades ago there was this chemist called E. J. Corey. 686 01:07:22.739 --> 01:07:26.820 Who was using computers. 687 01:07:26.820 --> 01:07:32.369 To work with chemical synthesis and. 688 01:07:32.369 --> 01:07:35.969 This is a talking 50 years ago. So the thing with. 689 01:07:35.969 --> 01:07:40.050 The cameras, some were just starting to use computers to help the. 690 01:07:40.050 --> 01:07:48.659 And they're producing little papers on how I use the computer to compute something else and something slightly different. 691 01:07:48.659 --> 01:07:56.940 And the major chemists, or rather insulting towards the people that were using computers to incrementally new things. 692 01:07:56.940 --> 01:08:01.230 Okay, um, and. 693 01:08:01.230 --> 01:08:06.480 So, JJ, Corey was 1 of these people, but he figured out a qualitatively new things. 694 01:08:06.480 --> 01:08:09.510 And then he got a Nobel Prize in chemistry before it. 695 01:08:09.510 --> 01:08:18.029 So, the thing is, so he was so he took the pastor stuff. 696 01:08:18.029 --> 01:08:22.199 And they use it to do better stuff. So. 697 01:08:22.199 --> 01:08:26.520 There is a proverb that. 698 01:08:26.520 --> 01:08:32.489 Quantity is a quality all its own you got enough stuff and start looking qualitatively different. 699 01:08:32.489 --> 01:08:39.149 So, you get something passed or you do something different and that's what we'll get you the prizes. 700 01:08:40.289 --> 01:08:45.840 Okay, so so, so when he GE, Cory was doing was. 701 01:08:45.840 --> 01:08:49.260 It was in a general field where the cameras thought it was. 702 01:08:49.260 --> 01:08:56.460 Silly superficial, but he morphed it into something. It was not silly and superficial and it was fundamental. 703 01:08:56.460 --> 01:09:00.539 And it worked for him, so. 704 01:09:00.539 --> 01:09:07.590 Okay, well he was at Harvard, he actually got the prize about 40 years ago, so. 705 01:09:07.590 --> 01:09:16.079 Okay, at that time, Harvard was picking up the Nobel prize every 2 years or so. 706 01:09:16.079 --> 01:09:25.020 Okay um, nothing interesting there. Um. 707 01:09:27.149 --> 01:09:30.689 It's the thing and so on, um. 708 01:09:32.520 --> 01:09:37.979 Well, again, just to bring it back to what's going on in where I call it and it. 709 01:09:37.979 --> 01:09:42.779 It can compute energy levels of bonds and therefore can tell. 710 01:09:42.779 --> 01:09:47.609 Well, and Joe, the bond, you can tell a lot about what's happening. So. 711 01:09:47.609 --> 01:09:52.020 Okay, um, we saw this slide before. 712 01:09:52.020 --> 01:09:56.100 Um, what makes the, the residents thing. 713 01:09:56.100 --> 01:10:00.930 Interesting is you're not on a regular Cartesian grid perhaps because. 714 01:10:00.930 --> 01:10:06.359 Oh, the machine is going into spiral and all sorts of other things. Yeah. 715 01:10:07.680 --> 01:10:12.300 So, reconstructing it and and. 716 01:10:12.300 --> 01:10:16.140 Concept involves inverting a big matrix. 717 01:10:16.140 --> 01:10:20.489 Although they do did roughly, perhaps nothing that. 718 01:10:20.489 --> 01:10:27.810 Personalized medicine this is the thing of the future. 719 01:10:27.810 --> 01:10:33.600 If you go in for some operation for brain surgery or something, cancer there. 720 01:10:33.600 --> 01:10:36.840 They're going to do a model of your personal brain. 721 01:10:36.840 --> 01:10:40.319 They're going to do an individual model of you. 722 01:10:40.319 --> 01:10:44.550 And use that to guide the surgery and so on nowadays. So. 723 01:10:46.680 --> 01:10:50.250 What this means screening of drugs you have a new drugs. 724 01:10:51.390 --> 01:10:56.159 Well, nowadays they test that's a trial run in silicone. 725 01:10:56.159 --> 01:11:01.770 That means they would like to compute what the drug will do. So. 726 01:11:01.770 --> 01:11:06.000 On the supercomputer rather than having to tested on real people. 727 01:11:06.000 --> 01:11:16.500 Just like you design a new aircraft, you like to maybe model its performance using your CFD software not actually building a testing and flying it. 728 01:11:16.500 --> 01:11:23.250 He killed fewer pilots that way and if you can predict the drugs on the computer, what they're going to do, but that's, you know. 729 01:11:23.250 --> 01:11:30.090 Easy to say it's hard to do and the next step after that would be to try to. 730 01:11:30.090 --> 01:11:35.220 From 1st principles compute what the future drug would be. 731 01:11:37.710 --> 01:11:42.539 Okay, I tell you like to work on real quick. Let me send you some dollar figures at you for drugs. 732 01:11:44.460 --> 01:11:51.750 The pharmaceutical industry grosses a 1Trillion dollars a year worldwide. That would be 10 to the 12 dollars a year. 733 01:11:51.750 --> 01:11:55.199 Okay, and the United States is half of that. 734 01:11:55.199 --> 01:11:59.010 So, 5 times, 10 to 11 dollars a year grossing. 735 01:11:59.010 --> 01:12:04.199 In the United States, so if you work it out at 500 dollars, a person for drugs in the United States. 736 01:12:04.199 --> 01:12:07.439 So, a new drug cost, depending on. 737 01:12:07.439 --> 01:12:13.350 How good your account is a few 1Billion dollars, but that includes the cost of the failed drugs. 738 01:12:14.579 --> 01:12:20.880 But, of course, the new drug, you know, that works will contribute to that 1Trillion dollars annual revenue. 739 01:12:20.880 --> 01:12:28.560 So and 1, new drug, that's good enough can affect the bottom line of the whole company. So. 740 01:12:29.609 --> 01:12:35.970 So, in other words, there's a lot of money. 741 01:12:35.970 --> 01:12:40.920 Involved in getting better ways to do new drugs. That's the. 742 01:12:40.920 --> 01:12:47.819 Take away here, so so, by the way is in a side. So if I hear, you know, somebody. 743 01:12:47.819 --> 01:12:54.149 The for drug company complaining while they, it's, you know, 5Billion dollars for a new drug. 744 01:12:54.149 --> 01:12:57.960 Again, it depends on your account and and. 745 01:12:57.960 --> 01:13:09.630 Well, the industry grows 200 times that in the year so you can afford to develop several new drugs in the year without getting that 1Trillion dollar, you know, infringing on a 1Trillion dollar revenue. 746 01:13:09.630 --> 01:13:18.930 Okay, well we need 64 minutes. 32 minutes can handle these slides the numbers. Okay. And so. 747 01:13:20.520 --> 01:13:28.229 In silicone that's us to take on in vivo, in vivo in vivo would be live tests on living people. 748 01:13:28.229 --> 01:13:32.609 In vitro means in glass, which means tests on a. 749 01:13:32.609 --> 01:13:38.760 Petri dish, and so on, and in Silicon would be in Silicon would be tested on the computer. 750 01:13:38.760 --> 01:13:47.010 Oh, okay. Um, and. 751 01:13:50.729 --> 01:13:56.579 So, doing it, you know, do things on the computer test for potential drugs and so on. So. 752 01:13:59.220 --> 01:14:06.750 And the thing with electrostatic, okay now, um, so you're worried about how. 753 01:14:06.750 --> 01:14:12.569 Of the different parts of the molecule their potential effect on each other is that affects how they fold. 754 01:14:12.569 --> 01:14:16.380 It affects the 1 of the things also. Next level up is this that. 755 01:14:18.270 --> 01:14:29.729 So this folded molecule, then it does things like attract water molecules. It's, it's dissolved in water. 1 is also so the drug molecules in water. 756 01:14:29.729 --> 01:14:36.000 And the surface of the drug molecule now starts attracting water molecules as well as attracting other proteins. 757 01:14:36.000 --> 01:14:39.119 And that will affect. 758 01:14:39.119 --> 01:14:45.060 Its activity, so they talk about here the boundary and the girl equation. So. 759 01:14:46.199 --> 01:14:49.829 You're worried about what it's doing in the water and so on. 760 01:14:49.829 --> 01:14:55.770 And then there's polarity issues. Um, you see what's happening. 761 01:14:55.770 --> 01:15:02.069 And these well, you get polar issues also summer Polar, some are non Polar, and it affects how they attract to each other. 762 01:15:02.069 --> 01:15:05.819 So, the past the multi fold math, that that's what I mentioned. 763 01:15:05.819 --> 01:15:14.159 Or you group, um, you had this hierarchy, you group hierarchy of any cluster things together um. 764 01:15:14.159 --> 01:15:17.340 What multiple means is that. 765 01:15:17.340 --> 01:15:27.869 Yeah, but you do a tailor expansion on the field, or on the potential the same things in potential critical to the field field. The differential the potential. 766 01:15:27.869 --> 01:15:39.180 So 1, Atom, that's a simple single poll. Potential thing. 2 atoms. That's a typo multiple atoms. It's a multiple thing. 767 01:15:40.229 --> 01:15:47.250 And mathematically, you can do a now, an analogy to a tailor expansion. 768 01:15:47.250 --> 01:15:53.130 On the on the field for different points and different distances and that's. 769 01:15:53.130 --> 01:16:00.810 All ties into the multiple method, so okay. And they want. 770 01:16:00.810 --> 01:16:06.510 You know, massive calculations. So, Wikipedia article on that. 771 01:16:06.510 --> 01:16:12.359 I'm worried about that worried about that. So the point they're saying here is yeah. 772 01:16:12.359 --> 01:16:15.420 It's important, um. 773 01:16:17.010 --> 01:16:20.699 This is my motherhood hardest part and make your. 774 01:16:20.699 --> 01:16:26.670 Global thing, um, and parallelizable and doing stuff. So. 775 01:16:26.670 --> 01:16:30.119 I think interesting there. 776 01:16:33.149 --> 01:16:39.000 And if you're going to take your application, paralyze it. 777 01:16:39.000 --> 01:16:44.279 1st, put in your pragmatist rope an MP or an open ACC do it for you. 778 01:16:44.279 --> 01:16:51.270 And, oh, the next thing here is, I've gotten very light on, but this is practical importance again. Um. 779 01:16:53.039 --> 01:16:59.010 The common applications already have parallel code. 780 01:16:59.010 --> 01:17:07.289 Parallel f. T. and so on probably better than you can do it that sort of thing. Matlab has some parallel algorithms. 781 01:17:08.550 --> 01:17:14.520 So, if what you want to do is something Matlab has in parallel, wrote a big matrix. Then it's Matlab. 782 01:17:14.520 --> 01:17:19.770 You got to get licensing issues. I had an issue where I'm on a machine with lots. Of course. 783 01:17:19.770 --> 01:17:23.909 But the license for Matlab allowed only 8 cores. 784 01:17:23.909 --> 01:17:27.899 So do 8 times parallel that was that, even though the machine could handle more. 785 01:17:27.899 --> 01:17:31.560 Cool glass. What is this stuff here? 786 01:17:31.560 --> 01:17:36.329 Blast is basically your algebra system. 787 01:17:36.329 --> 01:17:44.369 And these are standard really simple, linear algebra things like multiply matrix times vector. 788 01:17:44.369 --> 01:17:51.630 That are done a lot and the basic class is a reference implementation. So it's the. 789 01:17:51.630 --> 01:17:56.640 It's the API definition, and then different people implement the API differently. 790 01:17:56.640 --> 01:18:01.079 This is a commercial code, which may be the best. If you wanted to do it really good. Maybe. 791 01:18:01.079 --> 01:18:06.270 License some commercial code or this free ones and glasses and go to. 792 01:18:07.319 --> 01:18:12.359 Pgi, that's the video compiler. Pgi was a company that developed that. 793 01:18:12.359 --> 01:18:19.079 And video bought it nothing and then the next thing, you know, rewrite and so on. 794 01:18:19.079 --> 01:18:24.960 Okay, nothing interesting. There. 795 01:18:28.260 --> 01:18:33.329 Oh, nothing interesting there. Um. 796 01:18:35.100 --> 01:18:38.850 How to do stuff in parallel. 797 01:18:38.850 --> 01:18:44.399 And that was that to leave that slide off. Let's see. 798 01:18:46.050 --> 01:18:52.229 Okay, so that was, um. 799 01:18:52.229 --> 01:18:56.039 Module 18 and. 800 01:18:56.039 --> 01:19:01.829 What is happening after that time to end is. 801 01:19:05.550 --> 01:19:08.699 Some other little stuff. 802 01:19:08.699 --> 01:19:16.890 Go through some of that so we're sort are winding down this accelerated computing stuff from in video. But you've seen a lot of. 803 01:19:16.890 --> 01:19:23.010 Things and I think the next topic will be to hit. 804 01:19:23.010 --> 01:19:26.520 This has been sort of hardware related someone to do some trust or something. 805 01:19:28.770 --> 01:19:37.500 And you're free to go do this and we need to. So I'm open for questions and so on. And I encourage you to read the book up and encouraging every class to look at the online. 806 01:19:39.329 --> 01:19:45.600 You want more NVIDIA stuff, go to their website. They got white papers and developer blogs and. 807 01:19:45.600 --> 01:19:52.350 All sorts of stuff online, so, and video really wants to help people learn their system. 808 01:19:52.350 --> 01:19:56.850 They sell more. Okay, so. 809 01:19:56.850 --> 01:20:05.189 Open for questions, other than that, have a good weekend. Maybe go skiing this weekend. If we get some snow and see you Monday. So. 810 01:20:06.960 --> 01:20:10.739 Yeah, I was getting whenever there's snow last. 811 01:20:10.739 --> 01:20:15.449 Well, you can program cash if you take a day, a week off I think so. 812 01:20:15.449 --> 01:20:24.029 And maybe if you so basically read about a big problem, and then go skiing, and while you're getting some exercise, you may think of a solution. 813 01:20:24.029 --> 01:20:27.210 Really. 814 01:20:27.210 --> 01:20:31.739 Okay, so. 815 01:20:31.739 --> 01:20:35.729 Any case. 816 01:20:35.729 --> 01:20:39.960 Yes, that was. 817 01:20:39.960 --> 01:20:47.699 No, I'm going to assign 1, but I'll give you a week after I signed it. What it's going to say. 818 01:20:47.699 --> 01:20:53.609 Is look at these and video and paradox right? Via page, summarizing them. So. 819 01:20:53.609 --> 01:20:57.210 Okay, I was curious. 820 01:20:57.210 --> 01:21:04.289 Just, like, just put the thing in a PDF listing so. 821 01:21:04.289 --> 01:21:10.500 You're comparing comparing different things, so right. You don't have your conclusion of this. 822 01:21:10.500 --> 01:21:16.500 Includes the source code for me to look at I'm not going to run it though. So. 823 01:21:16.500 --> 01:21:29.069 And there are tools, I think I forget if I told you, I told someone you got different PDF pieces. There are tools out there that will combine PDF files into 1 PDF files. 824 01:21:29.069 --> 01:21:33.180 I need to upload it to scope or whatever. 825 01:21:33.180 --> 01:21:39.510 And it should it be something that will not upload? You could always have a pointer to somewhere else. 826 01:21:39.510 --> 01:21:47.250 Yeah, yeah, fine. There's a 1 set is fine. Yeah, sure. Okay. 827 01:21:47.250 --> 01:21:55.680 Had a really cool thing. I used for a final report on a project report. I had lots of separate PDF files in. 828 01:21:55.680 --> 01:22:03.869 I just combined I took each PDF file embedded in the big late that document was running headers and footers and page numbers. 829 01:22:03.869 --> 01:22:08.609 It's really nice. It's really optional. 830 01:22:08.609 --> 01:22:11.640 That you want to ask you to, like. 831 01:22:11.640 --> 01:22:26.454 Have like some of this food on textile loading the text Instagram, not only the text that needs to be converted anyway. Or do you want us to, like, have some variables inside of the flow that he already has? The string. 832 01:22:26.760 --> 01:22:31.260 I don't care whichever is easier. 833 01:22:31.260 --> 01:22:36.569 It takes a little work to read a text file and to see, but not that much work. Yeah. Yeah. 834 01:22:36.569 --> 01:22:43.050 You know, line by line. Well, I actually don't like lines that much. I would just read the whole thing in. 835 01:22:43.050 --> 01:22:49.590 Um, you know yeah, so. 836 01:22:49.590 --> 01:22:53.579 Oh, I think you could do it. Okay. Okay. You want. 837 01:22:53.579 --> 01:23:01.289 Here's 1 way you could do it. There's a way and see it for us to map the whole file in the virtual memory. 838 01:23:01.289 --> 01:23:05.159 So, I figured it's. 839 01:23:05.159 --> 01:23:10.920 My file map thing. So what happens is, you open the file, you call this magic function. 840 01:23:10.920 --> 01:23:15.329 And it returns a pointer to the file and virtual memory. 841 01:23:15.329 --> 01:23:21.420 And also the lanes, so now you can just leave it as it is, or take the length of. 842 01:23:21.420 --> 01:23:25.770 An internal variable on your Ethan and copy it in. 843 01:23:25.770 --> 01:23:29.220 It's really, it's also efficient. 844 01:23:29.220 --> 01:23:34.020 It's all so cool. So it sends a blind jus harder an artificial idea. 845 01:23:34.020 --> 01:23:37.380 That's a side like that sort of style. 846 01:23:37.380 --> 01:23:41.220 You can also open the file and get the length and then do a read. 847 01:23:41.220 --> 01:23:49.050 There's like, it's an f, read function and it just reads and bytes you give it and it reads that bytes is 1 function call. Yeah. 848 01:23:49.050 --> 01:23:56.010 Again, or you could iterate over get cheap by get paragraph, get character and that seems. 849 01:23:56.010 --> 01:24:00.390 Why character character you can load in though. 850 01:24:00.390 --> 01:24:04.199 Full sentence load in the whole file. 851 01:24:04.199 --> 01:24:09.869 Yeah, no, what I'm talking about works only if your memory is bigger than the file, but yeah. 852 01:24:09.869 --> 01:24:16.800 Yeah, you see a lot of these techniques you see, we're designed decades ago when your files are bigger than your memory. 853 01:24:16.800 --> 01:24:20.819 The whole idea passes for compilers would would. 854 01:24:20.819 --> 01:24:25.949 The computer was too small to store the whole program all at once you had to read it in multiple times and yeah. 855 01:24:25.949 --> 01:24:30.539 But ultimately, for years you got memory use it. 856 01:24:30.539 --> 01:24:34.319 Yeah, yeah. 857 01:24:36.420 --> 01:24:41.729 Okay. 858 01:24:48.630 --> 01:24:57.000 Okay. 859 01:25:02.699 --> 01:25:06.930 Hello. 860 01:25:06.930 --> 01:25:13.199 Hello. 861 01:25:13.199 --> 01:25:17.039 Absolutely. 862 01:26:02.760 --> 01:26:07.439 Hello. 863 01:26:43.890 --> 01:26:46.920 For it. 864 01:26:48.029 --> 01:26:51.569 Hello. 865 01:26:59.340 --> 01:27:04.109 Huh. 866 01:27:46.470 --> 01:27:51.090 Hello. 867 01:28:07.470 --> 01:28:11.100 Okay. 868 01:29:08.579 --> 01:29:11.909 Hello. 869 01:29:18.180 --> 01:29:21.569 Okay. 870 01:29:48.810 --> 01:29:52.680 Okay. 871 01:30:28.890 --> 01:30:32.159 Hello. 872 01:31:40.949 --> 01:31:47.399 Okay. 873 01:32:12.599 --> 01:32:16.679 Hello. 874 01:32:28.259 --> 01:32:32.248 Okay.