WEBVTT

1
00:00:00.000 --> 00:00:03.419
Message.

2
00:00:03.419 --> 00:00:06.780
Okay.

3
00:00:06.780 --> 00:00:11.218
Hello.

4
00:00:14.788 --> 00:00:18.509
1.

5
00:00:18.509 --> 00:00:22.769
That's.

6
00:00:33.179 --> 00:00:37.439
I wonder if.

7
00:00:37.439 --> 00:00:41.909
Hello.

8
00:00:48.210 --> 00:00:51.450
That.

9
00:00:57.270 --> 00:01:01.560
Okay.

10
00:01:06.239 --> 00:01:09.840
Huh.

11
00:01:09.840 --> 00:01:18.540
Okay, so good afternoon. This is class 13 if I'm counting, right?

12
00:01:19.680 --> 00:01:28.859
Parallel not probability so we're continuing on with the NVIDIA accelerated computing so it's a mixture of hardware.

13
00:01:28.859 --> 00:01:33.329
And some algorithms and so on, we'll see more algorithms today.

14
00:01:33.329 --> 00:01:40.079
And I have some links here for some recent video stuff I'll take. I'll turn it into a homework actually.

15
00:01:40.079 --> 00:01:43.079
Oh, you can read it and answer questions on it.

16
00:01:44.159 --> 00:01:50.939
And so where you are.

17
00:01:50.939 --> 00:01:54.870
Look at some look at a new example today.

18
00:01:54.870 --> 00:02:03.390
The potential calcium calculation, and then look at some ways to optimize programs for parallel parallel computing.

19
00:02:03.390 --> 00:02:08.460
So, going on here.

20
00:02:08.460 --> 00:02:11.490
Okay.

21
00:02:11.490 --> 00:02:15.900
Okay.

22
00:02:15.900 --> 00:02:21.930
Okay.

23
00:02:21.930 --> 00:02:24.990
And that is actually working.

24
00:02:24.990 --> 00:02:28.050
So.

25
00:02:29.250 --> 00:02:32.729
Let's try something like this, even.

26
00:02:35.819 --> 00:02:40.830
Okay, so we're going to see some of this programming techniques. Um.

27
00:02:40.830 --> 00:02:45.240
And the thing is that we have a, um.

28
00:02:45.240 --> 00:02:51.750
Some sort of molecule protein or whatever, and we want to calculate some potentials or are there.

29
00:02:51.750 --> 00:02:56.039
Planet son's in the Milky way or something like.

30
00:02:56.039 --> 00:03:01.439
That and the hard thing we're going to be.

31
00:03:21.235 --> 00:03:30.354
So, the atoms are irregular places, Adam, you know, and it's got some charger. So mass.

32
00:03:30.629 --> 00:03:35.099
And then we want to calculate the potential or or the 4th field.

33
00:03:35.099 --> 00:03:43.710
At each regular grid point and so for the 4th, it's in verse square law for the potential doesn't universe into your law.

34
00:03:43.710 --> 00:03:49.710
Now, if I stop right here, and you had to think how you would do it before I go to the next slides.

35
00:03:49.710 --> 00:03:55.259
You're doing a lot of iterating and the 1st question you would ask yourself is what.

36
00:03:55.259 --> 00:04:04.830
You want your outermost loop to be you could inter, you could, you're the most still could iterate over the atoms or iterate over.

37
00:04:04.830 --> 00:04:14.400
The grid points, and then the other most stupid iterate over the other. So, 1 way you might do this is that you could iterate over the atoms.

38
00:04:14.400 --> 00:04:18.089
And reach out to you calculate its effect on each grid point.

39
00:04:18.089 --> 00:04:21.149
And summit into a subtotal for each credit point.

40
00:04:21.149 --> 00:04:26.819
The opposite way is that you iterate over the grid points and for each good point, you, um.

41
00:04:26.819 --> 00:04:33.629
Then iterate over the atoms and compute each atoms contribution just compute it right away for the good point. So.

42
00:04:33.629 --> 00:04:36.658
The 2 opposite philosophies.

43
00:04:36.658 --> 00:04:41.369
And again, if you think about it, you know.

44
00:04:41.369 --> 00:04:45.778
How you would solve it yourself before I go to the next few slides.

45
00:04:45.778 --> 00:04:50.759
Um, your most stuff you're going to do in parallel.

46
00:04:50.759 --> 00:04:54.298
Um, obviously, because this is a firewall course.

47
00:04:54.298 --> 00:05:01.348
So, let's suppose theater outermost roof is the atoms you're iterating over the atoms.

48
00:05:01.348 --> 00:05:05.278
Then inside for each of them.

49
00:05:05.278 --> 00:05:08.819
Right over the grid cells so.

50
00:05:08.819 --> 00:05:16.168
When you're summing potentials at each grid, so that's going to have to be a re modified right operation. That's going to have to be an atomic operation.

51
00:05:16.168 --> 00:05:20.158
So, if you enter most loop is the atoms you got.

52
00:05:20.158 --> 00:05:25.408
You're gonna have to have a read modify, right? For each grid point.

53
00:05:25.408 --> 00:05:29.488
And you're doing in parallel over the atoms so if you do that.

54
00:05:29.488 --> 00:05:36.478
You're going to have a lot of cloud. Well, 1st, the atomic operations are expensive.

55
00:05:36.478 --> 00:05:44.728
And you're gonna have a lot of them and and reasonable chat. You also going to get classes, but we'll slow things down.

56
00:05:45.869 --> 00:05:52.048
If you're out almost loop and that's called a scatter operation, this will be written down in the slides. But.

57
00:05:52.048 --> 00:05:55.858
If you can you think how you would solve it yourself?

58
00:05:55.858 --> 00:06:01.439
If you iterate over the atoms, right out of blue atom and so on.

59
00:06:01.439 --> 00:06:06.358
Then preach Adam, you're scattering contributions across the whole grid.

60
00:06:06.358 --> 00:06:11.459
So, the red item, you compute, it's effect on these, um.

61
00:06:11.459 --> 00:06:14.908
Here you've got 1, 2, 3.

62
00:06:14.908 --> 00:06:24.928
He got 6 times 4, you've got 24 grid points here. The red Adam contributes to the subtotals for the 24 grid points and so it's capturing.

63
00:06:24.928 --> 00:06:30.119
The contribution of the red atom over the whole grid, it's a scatter paradigm.

64
00:06:31.348 --> 00:06:35.639
And it actually turns out that if you have a sequential machine, that's sufficient.

65
00:06:36.809 --> 00:06:43.559
But the problem on a parallel machine is you have to have these atomic read, modify rights.

66
00:06:43.559 --> 00:06:47.819
As to add the contributions into itself.

67
00:06:47.819 --> 00:07:01.019
Good point. So the opposite method is that your outermost loop is the grid so your so, the grid cells, and that one's in parallel the grid cells. Each grid cell. Each grid point is say a thread.

68
00:07:02.218 --> 00:07:12.178
Then then each thread iterates over all the atoms and gathers the contributions of the atoms that affect the itself.

69
00:07:12.178 --> 00:07:19.108
So this method is better, because you don't have any read modify right? Issues.

70
00:07:19.108 --> 00:07:26.158
Because the gather operation reads all the atoms, but all the writing is occurring on to the current.

71
00:07:26.158 --> 00:07:29.908
Good point. So there's no problem.

72
00:07:29.908 --> 00:07:36.689
With atomic locking, so right there you figure, it's probably going to be more efficient.

73
00:07:37.769 --> 00:07:41.488
The other thing is that for another thing.

74
00:07:41.488 --> 00:07:46.199
Is that so you're again, you're so you're reading.

75
00:07:46.199 --> 00:07:49.798
About from information on all of the atoms, but.

76
00:07:51.209 --> 00:07:58.798
You start thinking now while the atoms are constant, they're not moving so you're, you're reading information from this constant array. So.

77
00:07:58.798 --> 00:08:07.649
He could imagine caching things already. You might do it for you. If it probably won't fit in constant memory but you can see the idea.

78
00:08:07.649 --> 00:08:13.348
That all the threads are reading this constant information.

79
00:08:13.348 --> 00:08:22.738
About the atoms, so there could be a big latency if the atom information is up in the global memory. But latencies are okay, this is.

80
00:08:22.738 --> 00:08:28.978
The design decision that video made made is they don't care about latency.

81
00:08:28.978 --> 00:08:32.548
Because they got lots of threads and Fred gets blocked.

82
00:08:32.548 --> 00:08:37.769
And something else another threat executes and so on a war, eventually.

83
00:08:37.769 --> 00:08:44.818
Okay, so and then it will turn out to just gather paradigm on a parallel machine.

84
00:08:44.818 --> 00:08:52.948
Is more efficient so so this is what you would be thinking perhaps if you were given this as a problem.

85
00:08:52.948 --> 00:08:58.889
And you had to find a solution, you might think, like you said, which would be your outermost loop.

86
00:08:58.889 --> 00:09:04.198
The grid points, and you gather information from the atoms.

87
00:09:04.198 --> 00:09:10.349
Or else, the outermost loop might be the atoms and you scatter information from the atoms.

88
00:09:10.349 --> 00:09:13.619
Is a great point. Okay.

89
00:09:13.619 --> 00:09:19.078
So, the rest of these 2 slides that will be talking about these 2 choices in more detail.

90
00:09:19.078 --> 00:09:25.499
Um, so electrostatic potential.

91
00:09:25.499 --> 00:09:32.219
And and we did some things up. Oh, let me go back to this previous 1 here.

92
00:09:32.219 --> 00:09:35.698
Again, let's suppose you're thinking about this, um.

93
00:09:35.698 --> 00:09:40.828
A couple of other things you might think about.

94
00:09:40.828 --> 00:09:47.038
Is you might imagine some sort of hierarchy some quadri for the atoms.

95
00:09:47.038 --> 00:09:51.778
Because if there's a bunch of atoms that are a long distance away.

96
00:09:51.778 --> 00:09:57.149
You could cluster them together and create 1 molecule or something from them.

97
00:09:57.149 --> 00:10:04.649
And then figure out the contribution of that molecule, let's say to all the distant. So.

98
00:10:04.649 --> 00:10:09.448
You can imagine some hierarchical clustering thing and that's in fact what people do.

99
00:10:09.448 --> 00:10:18.658
And that actually cuts the time if you're counting the number of operations. The nice thing is a quadratic number of operations.

100
00:10:18.658 --> 00:10:23.099
Because each atom affects each point. It's good point.

101
00:10:23.099 --> 00:10:27.719
Well, if you had some hierarchy of of clustering stuff.

102
00:10:27.719 --> 00:10:32.188
You actually can cut the sequential time from N squared down to and log in.

103
00:10:32.188 --> 00:10:35.788
So this is a standard message that used for these things.

104
00:10:35.788 --> 00:10:42.389
Um, now it makes it interesting say people are doing this to the Milky way.

105
00:10:42.389 --> 00:10:50.099
And is that the problem and they want to compute and say how the Milky way to evolve.

106
00:10:50.099 --> 00:10:53.099
And so a star evolves, depending on the.

107
00:10:53.099 --> 00:10:56.938
Then that gravitational force on it from all the other stars.

108
00:10:58.078 --> 00:11:01.859
Well, the problem is that the stars move around differently.

109
00:11:01.859 --> 00:11:08.249
If you look at the Milky way, it's got these famous spiral arms. These arms are a dynamic phenomenon.

110
00:11:08.249 --> 00:11:14.219
You can imagine the speed that that the arm is rotating around the Galaxy.

111
00:11:14.219 --> 00:11:21.418
But you look at the individual stars in the arm, they're moving at a different speed and individual star.

112
00:11:21.418 --> 00:11:27.658
Moves into an arm and then moves out of the arm again these stars are moving into different speed for the Galaxy.

113
00:11:27.658 --> 00:11:33.989
And it's like, it's a wave versus the face feed. Let's say we're talking.

114
00:11:33.989 --> 00:11:38.428
But that sort of thing in physics and.

115
00:11:38.428 --> 00:11:42.658
So, if you cluster the stars to do this hierarchical way to.

116
00:11:42.658 --> 00:11:47.698
Do this potential map more efficiently? You have to update the clustering.

117
00:11:47.698 --> 00:11:52.229
So, because the stars move in and out of different regions, um.

118
00:11:52.229 --> 00:11:57.958
You've got analogy about this dynamic thing if you look at a water wave and the ocean.

119
00:11:57.958 --> 00:12:02.578
You look at the speed that the wave as a wave is moving.

120
00:12:02.578 --> 00:12:06.178
And you look at the speed of the water molecules in the way.

121
00:12:06.178 --> 00:12:16.198
The wave as a whole is moving faster than the individual water molecules. So if you're clustering the molecules into the wave, you have to change update your clustering.

122
00:12:17.759 --> 00:12:25.229
And physics, the physicists talk about things, I've forgotten some of my physics. It's like a way.

123
00:12:25.229 --> 00:12:32.038
Speed, you can have the speed of the way with the way. It can actually go faster than the speed of light. But the speed of information is moving.

124
00:12:32.038 --> 00:12:35.129
Is less always.

125
00:12:35.129 --> 00:12:39.928
So, in any case, you might be thinking about something like that a hierarchical way.

126
00:12:39.928 --> 00:12:44.399
To make this process more efficient and of course, then you'd have to think.

127
00:12:44.399 --> 00:12:49.499
Is this hierarchical way will it work in parallel? So because your static structure is now weird.

128
00:12:50.698 --> 00:12:56.788
Another thing you might think of is an obvious thing to do we haven't talked about it is called Fred.

129
00:12:56.788 --> 00:13:03.269
We've just been talking about how 1 thread does 1, Adam or 1 thread does 1 good point.

130
00:13:03.269 --> 00:13:08.849
Well, you might say, let's have a thread, do 10 Adams or something or 10 good point.

131
00:13:08.849 --> 00:13:12.719
And that could be more efficient.

132
00:13:12.719 --> 00:13:17.759
Because you might be able to localize some information and the registers or something.

133
00:13:17.759 --> 00:13:24.688
And so if you Pro, you want to make your programming more complicated because of course, that's the 1 thread.

134
00:13:24.688 --> 00:13:29.879
Does more than 1 little piece of data you might get faster.

135
00:13:29.879 --> 00:13:34.139
Okay, so those are the things you might be thinking of, um.

136
00:13:34.139 --> 00:13:38.369
Let's move on and see what they talk about here. So this is.

137
00:13:38.369 --> 00:13:41.519
Computing potentials or else computing.

138
00:13:43.889 --> 00:13:49.528
Computing forces would be the same thing.

139
00:13:49.528 --> 00:13:55.349
And then you could calculate movements just as an aside and.

140
00:13:55.349 --> 00:13:59.578
You know, if it's all particularly you don't want to put it on the lips and that sort of thing.

141
00:13:59.578 --> 00:14:04.198
Oh, by the way, if you wonder, um.

142
00:14:05.428 --> 00:14:09.089
How, you know, the formula like this here.

143
00:14:11.578 --> 00:14:17.188
Well, you know, this is an obvious formula well, it has to be true because we teach it in the physics class, but.

144
00:14:17.188 --> 00:14:25.048
How do we know that? It's correct. Well, you use this form well, to calculate movement, calculate orbits. Let's say.

145
00:14:25.048 --> 00:14:32.129
And if this was not an inverse law, if there was an exponent up here.

146
00:14:32.129 --> 00:14:38.249
You would not get stable orbits. So the fact that you, in fact, see stable orbits.

147
00:14:38.249 --> 00:14:42.989
Is a proof that, that this is the right potential.

148
00:14:42.989 --> 00:14:46.678
Well, so okay.

149
00:14:46.678 --> 00:14:49.889
Um.

150
00:14:52.558 --> 00:15:00.328
So here's the idea that here they're talking about iterate over the lattice points throughout each lattice point is a separate.

151
00:15:00.328 --> 00:15:04.109
Thread, um.

152
00:15:04.109 --> 00:15:07.649
And then the inner loop iterates over the atoms.

153
00:15:07.649 --> 00:15:14.009
And they mentioned obvious things here that you have cut off threshold and all that fun stuff. So.

154
00:15:14.009 --> 00:15:18.958
Okay.

155
00:15:20.818 --> 00:15:24.119
Um, so.

156
00:15:24.119 --> 00:15:31.499
The input, the positions of the atoms and the charge on each side or more the mass or something.

157
00:15:31.499 --> 00:15:42.298
They're the known position to be Chatham is no, it's no 1 more actually than the grid. So the atoms are not being snapped to grid points.

158
00:15:42.298 --> 00:15:46.859
The greatest course are the positions of the atoms.

159
00:15:46.859 --> 00:15:52.948
Because if you snap the atoms to grid point, you're losing some information, which.

160
00:15:52.948 --> 00:15:59.188
You don't have to lose. So, the thing is that the atoms that are irregular possessions, and that just makes.

161
00:15:59.188 --> 00:16:07.979
Things may see, but again, you to say that we're going to compute the potentials at the grid points. Um.

162
00:16:09.149 --> 00:16:12.389
Oh, now why do we want the potentials.

163
00:16:12.389 --> 00:16:18.058
Well, an application might be protein folding so the protein is very long and.

164
00:16:18.058 --> 00:16:23.339
Flexible and it can fold in on itself.

165
00:16:23.339 --> 00:16:28.259
And that how it folds in on itself.

166
00:16:28.259 --> 00:16:31.769
Affects what the protein does because.

167
00:16:31.769 --> 00:16:37.558
It's the parts of the protein that are on the outside that can interact with other molecules.

168
00:16:37.558 --> 00:16:41.609
And what's on the outside? Depends on how it hold it.

169
00:16:41.609 --> 00:16:51.119
So, how the protein folds is important now, how it folds, it folds so different parts of the molecules align with each other and attract each other.

170
00:16:51.119 --> 00:16:56.999
And minimize and minimize the potential you see a minimizing objective function.

171
00:16:56.999 --> 00:17:00.928
To find the folding so the folding itself is the big problem.

172
00:17:00.928 --> 00:17:03.958
And very computationally intensive.

173
00:17:03.958 --> 00:17:10.138
And it's important, for example, drug design, because I didn't how the protein folds.

174
00:17:10.138 --> 00:17:15.179
Will control what how active the protein is and what are the molecules.

175
00:17:15.179 --> 00:17:23.068
It will align with so, and how the protein folds is controlled by computing among other things these potentials.

176
00:17:23.068 --> 00:17:27.838
So, it's a practical problem as well as this being an interesting problem.

177
00:17:27.838 --> 00:17:32.729
So, the biophysics people like this again, the astronomers.

178
00:17:32.729 --> 00:17:37.739
Like, a related problem. Okay. Okay. Um.

179
00:17:41.159 --> 00:17:47.098
And, um, so the sequential version, this would be the 1st of the 2 choices. I mentioned.

180
00:17:47.098 --> 00:17:50.759
Iterate over the atoms that's the outer loop.

181
00:17:50.759 --> 00:17:54.479
And the inner loop is iterate over the lattice points.

182
00:17:54.479 --> 00:18:00.598
So but again, the Adam information is read only it's constant. So.

183
00:18:01.949 --> 00:18:05.098
And so this message they mentioned.

184
00:18:05.098 --> 00:18:09.269
Well, they're asymptotically have the same time because asymptotically.

185
00:18:09.269 --> 00:18:13.679
They compute the effect of each atom on each lattice point once.

186
00:18:13.679 --> 00:18:17.699
So, in that sense, they are a.

187
00:18:20.578 --> 00:18:25.259
And that, since they're all taking the same amount of time, when, until you get down into details.

188
00:18:25.259 --> 00:18:32.038
So, yeah, and the.

189
00:18:32.038 --> 00:18:36.538
Just some interesting thing some random C code.

190
00:18:36.538 --> 00:18:39.778
Is that here is the, um.

191
00:18:41.398 --> 00:18:44.878
The outermost loop iterates over the atoms here.

192
00:18:44.878 --> 00:18:49.739
The 2nd loop iterates over grid cells and why the 3rd loop.

193
00:18:49.739 --> 00:18:58.469
Iterates over grid cells and X. so what it's doing is that free Chad, and then what it computes something.

194
00:18:58.469 --> 00:19:01.588
Is that it then push it then does.

195
00:19:01.588 --> 00:19:07.348
An update on the on the total potential for each grid sales. So.

196
00:19:07.348 --> 00:19:12.479
Again, if you're thinking ahead this line here, it's going to be a Reed modify, right?

197
00:19:13.979 --> 00:19:21.598
Okay, so this is the 1 version of the problem is the version of the program where the atoms are the outermost loop iterating the atoms now.

198
00:19:23.429 --> 00:19:32.729
If you look at this, right here, there are various optimization games that you could play. If you actually cared.

199
00:19:32.729 --> 00:19:40.499
Um, the thing is that a good optimize the compiler would play these games for you so.

200
00:19:40.499 --> 00:19:44.759
What would be an example of do you do.

201
00:19:46.919 --> 00:19:51.148
Well, they're doing some stuff here. They don't actually have to worry about like.

202
00:19:51.148 --> 00:19:55.378
Their computing Y. D. Y. squared. Um.

203
00:19:56.848 --> 00:20:01.679
Yeah, well that, you know, you could just.

204
00:20:01.679 --> 00:20:06.959
Do D, Y, and Z or whatever you need it or the optimizing the parallel we'll do that for you.

205
00:20:06.959 --> 00:20:16.348
And wherever D, why? So there's some things here that the compiler is quite good at actually fixing.

206
00:20:17.848 --> 00:20:22.078
Okay, and then what are some things.

207
00:20:22.078 --> 00:20:29.969
Here you could optimize square root takes 1 or 2 cycles. Nowadays you don't have to worry about that floats. You don't care about.

208
00:20:29.969 --> 00:20:33.929
So, there are various things that you could, um.

209
00:20:33.929 --> 00:20:38.878
Games you could play that. I don't worry about. That's what compilers will do.

210
00:20:41.068 --> 00:20:51.209
I mean, I used to joke that what I wanted to see an optimize a compiler is that it would produce correct code. I didn't even care if it was fast. I just carried it and destroy my program.

211
00:20:51.209 --> 00:20:54.989
Because it had optimizes that would actually produce bad code.

212
00:20:54.989 --> 00:21:02.489
They actually produce faster code now, so, in any case, so they talk about various tricks. We don't worry about, um.

213
00:21:02.489 --> 00:21:05.608
Okay, I'll skip through this.

214
00:21:05.608 --> 00:21:10.499
Okay, now what they're doing here, I'll give you the executive summary.

215
00:21:10.499 --> 00:21:18.989
They're, they're avoiding re, computing, Constance if there's something in a loop, which is constantly inside that voice that up out of the loop.

216
00:21:18.989 --> 00:21:24.449
Computed once those competitors do that nowadays they know they're actually.

217
00:21:24.449 --> 00:21:28.828
The browser actually useful for once. Okay. Um.

218
00:21:30.808 --> 00:21:34.648
And we'll skip all that. Okay. Now.

219
00:21:37.828 --> 00:21:43.019
So, how are you going to do this on the on the in parallel now?

220
00:21:47.009 --> 00:21:50.038
So, there's various tricks you can do.

221
00:21:53.128 --> 00:22:03.538
I mean, there was some of the GPU might not be big enough to store the whole lattice array. Perhaps. Maybe your lattice is 6,000 by a 1000. that's a 1Billion points.

222
00:22:03.538 --> 00:22:08.038
And maybe you're not lucky enough to have a, that can store this.

223
00:22:08.038 --> 00:22:14.729
Array with a 1Billion points in it. We do. I'll actually, but, you know, most cannot.

224
00:22:14.729 --> 00:22:19.259
So, what they're talking about here is actually.

225
00:22:19.259 --> 00:22:26.429
A slice of the map would be a slice would be a, a chunk of the map. It's small enough to fit on the GPU.

226
00:22:26.429 --> 00:22:29.729
So that they would essentially.

227
00:22:29.729 --> 00:22:33.088
They are implementing a virtual memory manager you might say.

228
00:22:33.088 --> 00:22:37.199
Take a slice of the map, put it on the GPU and process it.

229
00:22:37.199 --> 00:22:40.949
That's what they're talking about here. So, um.

230
00:22:40.949 --> 00:22:45.148
A slice, it's a slice of the lattice array. Okay. Um.

231
00:22:45.148 --> 00:22:49.949
Because again, we're scattering information from the Adams over the, uh.

232
00:22:49.949 --> 00:22:56.699
Over the lattice the right so you might say here, instead of free chat, I'm computing effect on the whole lattice.

233
00:22:56.699 --> 00:23:06.568
Well, loop over slices of the lattice and compute 1 slice at a time. But the slice is small enough to fit in the GPU.

234
00:23:06.568 --> 00:23:11.068
And that's what they're talking about here. So I gave you the information.

235
00:23:11.068 --> 00:23:17.249
Of that, so we compute 1 slice. We page it back to the host.

236
00:23:17.249 --> 00:23:21.239
And then we compute the next license on.

237
00:23:21.239 --> 00:23:29.308
So, with this, we're reading the atoms more than once, but that may be okay to read only so.

238
00:23:29.308 --> 00:23:33.838
So that's the idea here. Um.

239
00:23:35.308 --> 00:23:44.578
Okay, and okay, now we're going to paralyze it so the outer loop loops over Adams, each Adams a thread.

240
00:23:44.578 --> 00:23:49.348
Um, so you can have 100,000 threads.

241
00:23:49.348 --> 00:23:53.519
And video made the design to say, hey, you can have lots of threads.

242
00:23:53.519 --> 00:23:58.348
If you can run only say, 5,000 at a time and you've got a 100,000 threats.

243
00:23:58.348 --> 00:24:01.949
It is queued up waiting to run so that's okay that works.

244
00:24:01.949 --> 00:24:07.318
And, in fact, it's nice to have more threads waiting than you have threads executing.

245
00:24:07.318 --> 00:24:15.898
Does this handles the latency issue? So the latency is not a problem. If you've got more threats, then you've got 2 cars.

246
00:24:15.898 --> 00:24:20.878
Okay um, nothing weird there.

247
00:24:20.878 --> 00:24:27.148
So, this is a straightforward, obvious method. The only problem here and this is a scatter thing.

248
00:24:27.148 --> 00:24:32.519
For each Adam were scattering information to the lattice points of the slice.

249
00:24:33.959 --> 00:24:39.838
And we're going to have this read modify right issue problem.

250
00:24:41.429 --> 00:24:45.088
Again, the problem with the read modify right? Is that.

251
00:24:45.088 --> 00:24:48.419
It's those things down to.

252
00:24:48.419 --> 00:24:54.598
Thread 1 is acting on this lattice point. Thread, too has to wait if it wants to write that at this point.

253
00:24:54.598 --> 00:25:00.388
Okay, so, um.

254
00:25:02.939 --> 00:25:06.598
Now, what they're talking about here is another issue by the way.

255
00:25:08.189 --> 00:25:14.398
Why are we spending time on this example? Like, why did NVIDIA take the time?

256
00:25:14.398 --> 00:25:17.848
To talk about it and why am I taking your time in class?

257
00:25:17.848 --> 00:25:21.179
There's 2 reasons the specific reason.

258
00:25:21.179 --> 00:25:26.759
Is that this? That is actually a very common application for parallel computing.

259
00:25:26.759 --> 00:25:31.648
It's a different domains it's not in biophysics it's astronomy and so on.

260
00:25:31.648 --> 00:25:35.338
So, the specific problem is, we're spending time on.

261
00:25:35.338 --> 00:25:39.269
As a useful problem that is so compute balance you want to paralyze it?

262
00:25:39.269 --> 00:25:45.148
The general application for this, is that the techniques that we see here.

263
00:25:45.148 --> 00:25:48.298
Are also used in other parallel programs.

264
00:25:48.298 --> 00:25:57.898
So that we see general ideas that are worth putting in your tool, can you develop a toolkit of tools? You can use.

265
00:25:57.898 --> 00:26:02.368
You know, it's not different types of screwdrivers. It's different programming techniques.

266
00:26:03.598 --> 00:26:10.469
And the idea is that you look at when you got nested loops, which should be the outside loop that's an important consideration.

267
00:26:10.469 --> 00:26:20.878
For example, you have too many read modify rights that are going to slow each other down. That's an important consideration scattered versus gathered.

268
00:26:20.878 --> 00:26:24.028
Um, so.

269
00:26:24.773 --> 00:26:37.523
Like, if you're programming and trust, which is this parallel will see maybe next you've got a scatter operator and you've got a gather operator. You might think, maybe gather might be more efficient on a multithreaded machine. Okay.

270
00:26:37.523 --> 00:26:40.614
Now, here, we're going to see an idea, which is a little subtle.

271
00:26:40.919 --> 00:26:47.068
Um, so this information, some information here.

272
00:26:47.068 --> 00:26:51.419
So each thread executes this function.

273
00:26:51.419 --> 00:26:54.538
It's the energy you got this code.

274
00:26:54.538 --> 00:26:59.699
In here that's computed by each thread and parallel.

275
00:26:59.699 --> 00:27:08.278
And it's the same code so the threads are in parallel computing the same information.

276
00:27:08.278 --> 00:27:12.148
So.

277
00:27:12.148 --> 00:27:15.479
You might ask yourself is this a problem.

278
00:27:15.479 --> 00:27:20.699
Now, if you had a sequential machine, a serial machine.

279
00:27:20.699 --> 00:27:26.729
Then this is so C, energy, it's the outer loop was calling to see energy many times.

280
00:27:26.729 --> 00:27:32.459
And inside that, this code is getting the same code is getting re, executed on a sequential machine.

281
00:27:32.459 --> 00:27:35.578
You'd want to worry about that. Well.

282
00:27:35.578 --> 00:27:39.959
I would look at that and say, okay, I have a nice optimized and compiler.

283
00:27:39.959 --> 00:27:46.469
And the worry, it's a reasonable thing, but, you know, it's going to slow down a sequential machine parallel machine.

284
00:27:46.469 --> 00:27:51.598
You know, we're so lucky 5,000 threads compute the same thing.

285
00:27:51.598 --> 00:27:55.439
You might say so, what and that could be a reasonable answer.

286
00:27:55.439 --> 00:28:00.659
Um, so redundantly computed for every thread, but.

287
00:28:01.949 --> 00:28:14.009
You know, but maybe you wouldn't computed once because it slows down all the threads, or the threads are in parallel. So, what I'm pointing out is that this is repeated. Computation is the same thing, but.

288
00:28:14.009 --> 00:28:17.548
Maybe, it's not a problem on a parallel machine.

289
00:28:17.548 --> 00:28:22.169
So, you know, don't automatically think you have to fix it.

290
00:28:22.169 --> 00:28:25.169
Any case, they're bringing this thing up here.

291
00:28:25.169 --> 00:28:28.259
So.

292
00:28:28.259 --> 00:28:35.818
Now, at 1 point is that again, if I, let's talk about this slide before I go to the next slide.

293
00:28:35.818 --> 00:28:39.749
So, again, so this functions see, energy, it's being run.

294
00:28:39.749 --> 00:28:48.929
On each thread and parallel you got these local variables and flow and squared rich slice offset.

295
00:28:48.929 --> 00:28:53.189
He's that squared or if you come across the pond charge and so on.

296
00:28:53.189 --> 00:28:59.848
So, these by default are registers their fast memory that's private to the.

297
00:28:59.848 --> 00:29:04.919
Thread and again the thread has up to 255 give or take.

298
00:29:04.919 --> 00:29:11.818
Registers now there is an issue so each thread is, is.

299
00:29:11.818 --> 00:29:15.689
These threads version of say, DC is, um.

300
00:29:18.058 --> 00:29:23.429
It stored separately, so that is wasted register storage and you don't have a lot of registered, you know.

301
00:29:23.429 --> 00:29:27.088
64 K registers total across all the current threads.

302
00:29:28.229 --> 00:29:31.769
So that could be an argument to, um.

303
00:29:31.769 --> 00:29:40.679
Take this stuff and compute it once and put it in shared memory because shared memory is visible to the whole thread block.

304
00:29:41.663 --> 00:29:54.473
And so there's an argument, perhaps, instead of having to registers, that are private to the thread, put it in shared memory that's global to the thread block or private to the thread block and shared memory is very fast. You don't have very much.

305
00:29:54.473 --> 00:29:57.443
It's 48 K bytes or something for the whole uh.

306
00:29:58.199 --> 00:30:04.709
The whole thread block, so that could be an idea. If you look at this and your optimizing state of mind.

307
00:30:04.709 --> 00:30:11.969
Um, so you, you have you 1st, of all put it in shared memory and then maybe.

308
00:30:11.969 --> 00:30:19.378
Do it once and then sync or something? So okay, so this is what you would be thinking before you look at the next slide perhaps.

309
00:30:19.378 --> 00:30:23.909
And then you got the loop in here, and then the loop inside the loop.

310
00:30:23.909 --> 00:30:28.409
Okay, threads going to have nested loops. You got no problem there.

311
00:30:28.409 --> 00:30:38.638
And there's really not going to be any threat divergence probably or maybe 1 thread does more iterations than another. Probably not.

312
00:30:38.638 --> 00:30:42.358
So, the thing works quite nicely on a thread based.

313
00:30:42.358 --> 00:30:46.739
A single program multiple data set.

314
00:30:46.739 --> 00:30:52.679
Architect by the way the difference between an M. D. and s. P. M. D.

315
00:30:52.679 --> 00:30:57.929
S, I. M. D. single instruction multiple Datas that all of the.

316
00:30:57.929 --> 00:31:03.538
Threads are locked sync so the threads in a warm are s. I. M. D.

317
00:31:03.538 --> 00:31:07.318
The warps and a block are s. P. M. D.

318
00:31:07.318 --> 00:31:15.808
The walks in a block, they're running the same program, but there could be different points in the programs, the different warps, and not synced with each other. Unless you do a sync threads.

319
00:31:15.808 --> 00:31:20.818
So that's why they've added a new term single program multiple data.

320
00:31:20.818 --> 00:31:25.199
So your 1000 threads in the block are running the same program.

321
00:31:25.199 --> 00:31:32.489
But the 1000 threads of 32 warps the 30 who warps could be at 32 different places in that common program.

322
00:31:32.489 --> 00:31:37.019
Um, and and quite commonly some of the.

323
00:31:37.019 --> 00:31:45.388
Or some might not be started yet and others might have finished that's versus a PM. Okay, so you look at this and you might say we will.

324
00:31:45.388 --> 00:31:50.788
Optimize our hardware resources, if we take this stuff perhaps and put it in shared memory.

325
00:31:51.989 --> 00:32:01.499
Okay, and again, um, you're writing a small program who cares and obviously your time was worth something.

326
00:32:01.499 --> 00:32:05.548
But, you know, you're writing something really big than.

327
00:32:05.548 --> 00:32:11.068
Your time is worth less than the computer's time eventually at some point. So you want to optimize.

328
00:32:11.068 --> 00:32:17.038
Okay, looking at the same program further down and this is our.

329
00:32:17.038 --> 00:32:23.249
Scatter operation and it's the read modify. Right? You see, you've got the plus equals here.

330
00:32:23.249 --> 00:32:27.538
And that line has to be, um.

331
00:32:27.538 --> 00:32:31.378
Has to be a atomic, so.

332
00:32:31.378 --> 00:32:36.358
That's the interesting thing down there.

333
00:32:36.358 --> 00:32:44.308
Um, here's a, this graph shows what I mean, by the scatter.

334
00:32:44.308 --> 00:32:48.689
So, we've got our input that, um, that's the Adams.

335
00:32:48.689 --> 00:32:53.638
We got the threads that do the work and the output. That's the lattice points.

336
00:32:53.638 --> 00:32:58.888
And the output from each thread is charactered over all the lattice points. Okay.

337
00:33:00.179 --> 00:33:04.019
And was input is each thread reads 1 input.

338
00:33:04.019 --> 00:33:11.068
From 1, Atom and output scattered output across all of the lattice points. So.

339
00:33:14.128 --> 00:33:23.608
If you look at this and start thinking, um, you read modify right your, your back logs because of that could be worse than you would think because.

340
00:33:23.608 --> 00:33:29.578
You're in a loop for each thread iterating over the lattice. So basically.

341
00:33:29.578 --> 00:33:33.598
All the threads are going to write bang on this 1st word in the lattice.

342
00:33:33.598 --> 00:33:39.568
And then all the threads are going to write to the 2nd word in the last, and all the threads are going to write to the 3rd word on the lives.

343
00:33:39.568 --> 00:33:50.009
You see a problem here, so this program is particularly bad for causing backlogs from that read. Modify. Right? So.

344
00:33:50.009 --> 00:33:57.388
If you did want to do a scatter, or you might want to think of some way to all the threads we're not trying to write to the same.

345
00:33:57.388 --> 00:34:04.259
Lattice point at the same time. So they start the different threads off at different lattice points or something.

346
00:34:04.259 --> 00:34:09.239
Something like that maybe. Okay.

347
00:34:10.619 --> 00:34:14.278
So, that's the scatter.

348
00:34:15.509 --> 00:34:20.068
Um.

349
00:34:20.068 --> 00:34:24.509
And so here, they're talking about another thing.

350
00:34:24.509 --> 00:34:29.998
You could do is if the Adam's really far from the lattice point. Don't do anything.

351
00:34:31.224 --> 00:34:44.994
So don't try to update it so this improves your read modify right? Because you're not writing to all the lattice points. So, the latest points too far away don't write to it and that will help your program. Well, this will make cause thread divergence in your code.

352
00:34:44.994 --> 00:34:49.494
Of course, there's different threads. We'll be doing different work at that point. Someone want to write in some will not.

353
00:34:49.768 --> 00:34:55.378
Need to write so they talk about stuff like that. Um.

354
00:34:55.378 --> 00:34:58.829
Input, as far as your regular and.

355
00:34:58.829 --> 00:35:03.509
Outputs are whatever.

356
00:35:03.509 --> 00:35:13.739
Okay, so the scatter concept, you iterate over the atoms and scatter reach Adam scatter output 2.

357
00:35:13.739 --> 00:35:19.378
The whole lattice or scatter output to 1 slice of the lattice iterate over slices. Um.

358
00:35:22.079 --> 00:35:25.168
Follows the sequential version, so it's easy.

359
00:35:25.168 --> 00:35:32.489
Um, but this atomic ad is going to kill your performance.

360
00:35:32.489 --> 00:35:36.568
Their conclusion is don't do this in parallel.

361
00:35:38.309 --> 00:35:44.099
But by the way, if we're thinking ahead, talk about term projects, at some point.

362
00:35:45.389 --> 00:35:50.159
You know, just because they say this, you don't have to believe them.

363
00:35:50.159 --> 00:35:53.639
So, if you're a skeptical sort of person towards authority.

364
00:35:53.639 --> 00:35:59.068
You might actually implement this and see if this is true. Okay. Do an experiment and.

365
00:35:59.068 --> 00:36:02.278
See, if this is as bad as they say it is.

366
00:36:02.278 --> 00:36:06.088
Yeah, that would make a fun term project.

367
00:36:06.088 --> 00:36:09.838
And again, you then you make the project more sophisticated.

368
00:36:09.838 --> 00:36:21.509
Is a simple version 0 is really bad, but the version 1, you start thinking of some things. Like, you don't have all the threads banging on the same lattice points. Simultaneously. You rotate things a lift somehow.

369
00:36:21.509 --> 00:36:24.958
You might actually negate this problem. Who knows?

370
00:36:24.958 --> 00:36:28.168
Project okay.

371
00:36:28.168 --> 00:36:34.289
That was, um, the 1st lecture.

372
00:36:38.909 --> 00:36:42.449
Okay, part 2. um.

373
00:36:45.059 --> 00:36:50.818
Now, we're continuing on this calculation.

374
00:36:53.369 --> 00:36:56.458
Oh, fancy up here. Okay. Um.

375
00:36:58.168 --> 00:37:02.789
Okay, so now this is the 2nd method. This will be the gather method.

376
00:37:02.789 --> 00:37:07.768
The gathered method is the outermost iteration iterate over lattice points.

377
00:37:07.768 --> 00:37:12.418
And for each lattice point, you then iterate over the atoms.

378
00:37:12.418 --> 00:37:16.259
So, 2, the 2 D thing they're doing, um.

379
00:37:16.259 --> 00:37:22.079
Throughout most things are good points and then the innermost loop here.

380
00:37:22.079 --> 00:37:32.878
We're iterating over Adams and for each great point, we gather the information from all the atoms we gather. The potentials are.

381
00:37:32.878 --> 00:37:38.369
Our read modify right is to a local variable. So that's local to the thread private to the thread.

382
00:37:38.369 --> 00:37:41.759
So then we get the total comp, um.

383
00:37:41.759 --> 00:37:49.768
Contribution for the atom energy, and then we do a 1 read modify right to update that.

384
00:37:49.768 --> 00:37:54.509
Update that lattice point.

385
00:37:54.509 --> 00:37:58.739
The reason this 1 here might want that last line might want to be.

386
00:37:58.739 --> 00:38:02.458
Locke as an atomic operation, as we might have.

387
00:38:02.458 --> 00:38:06.659
Multiple blocks or something if we got lots and lots of threats.

388
00:38:06.659 --> 00:38:13.380
And so this is to prevent the blocks from stepping on each other's toes. So they don't show that yet. But this last line I'd put in it.

389
00:38:13.380 --> 00:38:21.719
Make it a topic. Okay so this has inverted what's in and what's out in the interest of loops? This is gathering information from the atoms.

390
00:38:21.719 --> 00:38:27.659
Here's a great point. Um, so, um.

391
00:38:29.340 --> 00:38:33.719
Okay, Here's a design decision we've mentioned before that.

392
00:38:33.719 --> 00:38:42.210
You got lots of threads, thousands, tens of thousands of thread. The hip of 100,000 threads. I haven't personally tried it. That could be another project.

393
00:38:42.210 --> 00:38:45.269
But any thread block.

394
00:38:45.269 --> 00:38:50.849
On the fast GPU on parallel any 1 thread block.

395
00:38:50.849 --> 00:38:54.510
Can have only 1024 threads.

396
00:38:55.889 --> 00:38:58.980
So, we're doing a 2 D problem here. Um.

397
00:38:58.980 --> 00:39:04.260
To make so it's easier to draw to your lattice.

398
00:39:04.260 --> 00:39:09.150
Is whatever maybe, you know, 200 by 200 or.

399
00:39:09.150 --> 00:39:17.789
1005 1000, so you got to partition the lattice points into thread blocks because 1 lattice point is 1 thread.

400
00:39:17.789 --> 00:39:24.570
He got this big lattice you want to partition it into blocks of a 1000 threads.

401
00:39:25.619 --> 00:39:29.610
No, you could. That was weird. The 2nd, here.

402
00:39:29.610 --> 00:39:42.360
You could say each role, you couldn't do it literally role row, row, row, block, block, block block. But what you're saying, it might be better say you could have duty organization like this. So, 64 or 564.

403
00:39:42.360 --> 00:39:52.230
Lattice points is 1 is 1 block, whatever, you know, uses the design decision you want to make how do you partition.

404
00:39:52.230 --> 00:40:01.139
When you got more threads than blocks sorry here, um, some auto time thing, um, when you've got more threads than blocks, how do you.

405
00:40:01.139 --> 00:40:04.800
But group bundle the threads and to blocks.

406
00:40:04.800 --> 00:40:08.969
Now, how this would affect performance.

407
00:40:08.969 --> 00:40:15.840
Is that again eat inside each block? You've got shared memory that's available to all the threads in the blog.

408
00:40:15.840 --> 00:40:20.280
So, perhaps you can have some information.

409
00:40:20.280 --> 00:40:23.969
And work on that past shared memory.

410
00:40:23.969 --> 00:40:31.260
So, that's it this is why some organizations a thread box might be faster than others because.

411
00:40:32.340 --> 00:40:39.480
You know what you're putting in the shared memory so okay. Um.

412
00:40:39.480 --> 00:40:45.059
Any case this is the gather thing.

413
00:40:45.059 --> 00:40:50.969
The input here are the atoms that are spaced irregularly.

414
00:40:50.969 --> 00:40:55.530
And the threads do the work, and the output are the potentials that the lattice points.

415
00:40:55.530 --> 00:41:02.460
So thread 1 computes the potential lattice point 1 or 2 at lightest point too. So.

416
00:41:02.460 --> 00:41:06.630
The threads are gathering information from.

417
00:41:09.804 --> 00:41:22.735
I dunno if it becomes too annoying. I'll figure it out to stop the timing thing, but at the moment. Okay, so you look at this, it's going to go nicely. There's no read modify right issue on the output.

418
00:41:22.980 --> 00:41:26.730
And even better.

419
00:41:26.730 --> 00:41:31.139
Adjacent threads their writing to adjacent output words in memory.

420
00:41:31.139 --> 00:41:34.409
Okay, this is getting, uh.

421
00:41:34.409 --> 00:41:38.880
See here. Okay.

422
00:41:42.059 --> 00:41:51.750
Yes, oh, I know what my problem was. Okay. Um, okay, so look at this adjacent threads, right? To adjacent output words coalescing.

423
00:41:51.750 --> 00:41:56.070
Cold, I think good. Okay. Um.

424
00:41:56.070 --> 00:42:09.420
And the inputs well, again, look at it and think the adjacent. So all the threads read input word 1, and then all the threads read input word to that all the threads where he'd input were 3.

425
00:42:09.420 --> 00:42:14.610
Think how that bangs on the memory. That's good. Also. Okay.

426
00:42:14.610 --> 00:42:21.389
They just look at this, it works, it plays nicely with how the GPU is designed.

427
00:42:21.389 --> 00:42:26.159
It's efficient on reading and it's efficient on writing.

428
00:42:26.159 --> 00:42:29.369
Cool. So you like to gather paradigm.

429
00:42:30.869 --> 00:42:34.230
You ought to like it well, I don't know if you do, um.

430
00:42:35.429 --> 00:42:39.480
Okay, so here on the executive summary.

431
00:42:39.480 --> 00:42:46.380
Is that each grant you iterate over grid points? That's another thing. And then inside the thread, each thread does 1 grid point.

432
00:42:46.380 --> 00:42:51.690
Okay, and then each so compute some information about the grid point.

433
00:42:51.690 --> 00:42:56.250
And then such as which good point, it is depends on the threat number and the block number.

434
00:42:56.250 --> 00:43:01.409
And then it at the bottom half of the page.

435
00:43:01.409 --> 00:43:05.789
Iterates over the atoms and again.

436
00:43:05.789 --> 00:43:08.969
The loop all the threads are doing the same loop.

437
00:43:08.969 --> 00:43:14.429
Good so it just is a program. What's going to run nicely on GPU.

438
00:43:15.599 --> 00:43:20.880
Um, they're talking about that there. Um.

439
00:43:20.880 --> 00:43:25.889
Works and the consolidated rights adjacent threads are writing adjacent.

440
00:43:25.889 --> 00:43:34.980
Elements and memory most of the time. So this consolidated with a consolidated right means we're accessing the, um.

441
00:43:34.980 --> 00:43:40.800
The global memory and chunks of, I don't know, 32 words, perhaps.

442
00:43:40.800 --> 00:43:48.179
It's a design decision, so at 32 threads are going to be writing a Jason, 32 words of the global memory.

443
00:43:48.179 --> 00:43:51.570
Those 32 rights get consolidated into 1, right?

444
00:43:51.570 --> 00:43:54.659
And 1 access out to the global memory.

445
00:43:54.659 --> 00:43:58.500
And that access is 32 words 128 bytes.

446
00:43:58.500 --> 00:44:01.500
That works nicely.

447
00:44:01.500 --> 00:44:04.650
Okay, so the gather is good.

448
00:44:04.650 --> 00:44:11.190
Good. Okay so what they're saying is what was the fast sequential.

449
00:44:11.190 --> 00:44:14.760
Program design was not the past parallel program designed.

450
00:44:14.760 --> 00:44:18.329
And that is the lesson from this 1.

451
00:44:20.670 --> 00:44:24.360
Okay, um.

452
00:44:27.059 --> 00:44:33.239
Compute efficiency total number of cycles.

453
00:44:35.130 --> 00:44:43.710
Is less your metric for parallel programs is almost always elapsed wall clock time.

454
00:44:43.710 --> 00:44:49.230
So, if your warp is sort of spinning.

455
00:44:49.230 --> 00:44:59.070
Waiting for data, it's not executing any instructions. You don't care you care about that. It's waiting.

456
00:44:59.070 --> 00:45:05.579
And it will wait less if you work the cash better and so on.

457
00:45:05.579 --> 00:45:10.440
Um, and I mentioned here also the.

458
00:45:10.440 --> 00:45:16.590
For reasonable size is you've got more lattice points and you've got Adams, maybe say 20 times more.

459
00:45:16.590 --> 00:45:21.480
And iterating through the big lattice array.

460
00:45:21.480 --> 00:45:26.909
The outside loop is better than iterating through the atom as it is mentioned. So you won't think about these design issues.

461
00:45:26.909 --> 00:45:30.119
Okay, um.

462
00:45:30.119 --> 00:45:33.869
The problem is wiping out data again.

463
00:45:33.869 --> 00:45:38.969
Play nicely with your cash. That's the lesson from here.

464
00:45:40.139 --> 00:45:44.010
Um, do do do do do.

465
00:45:46.050 --> 00:45:50.789
And it's the financial program at play games temporary raise.

466
00:45:50.789 --> 00:46:01.769
Pre calculate stuff I don't know. Um, okay, the Pre computation is you identify common things and waste them out of the loop and do them once.

467
00:46:01.769 --> 00:46:06.840
Yeah, I find it hard to get interested in that sort of thing.

468
00:46:06.840 --> 00:46:11.130
Again hope you, in many cases, the compiler.

469
00:46:11.130 --> 00:46:17.010
Okay, the coarse thing is the thing where on a sequential program is the concept of unrolling a loop.

470
00:46:17.010 --> 00:46:21.269
Uh, so you do several iteration.

471
00:46:21.269 --> 00:46:25.409
Of the index variable in 1 and 1 iteration of the loop.

472
00:46:25.409 --> 00:46:30.269
So loop executes fewer times, but each time does more work. You do it in parallel.

473
00:46:30.269 --> 00:46:35.789
As I mentioned a few minutes, each thread could do several lattice points instead of 1. lattice point.

474
00:46:35.789 --> 00:46:39.360
Why this would help is that.

475
00:46:39.360 --> 00:46:48.480
If the threads are computing some information, it may use the total fewer registers and reduces our scarce resource. They don't talk about that here, but that's.

476
00:46:48.480 --> 00:46:56.849
Would be the big thing like I mentioned, each strategy is private registered so you want to reduce total sorry? You either put stuff in shared memory.

477
00:46:56.849 --> 00:47:02.039
Make your program a little harder are you if you course it in the threads? Total number of registers.

478
00:47:02.039 --> 00:47:06.420
It's pure because the stuff you put in a register is used is used more so.

479
00:47:06.420 --> 00:47:12.329
It's related to this concept of loop unrolling which good compilers do nowadays also.

480
00:47:12.329 --> 00:47:16.980
Okay um, okay.

481
00:47:18.329 --> 00:47:25.230
And video now has some tools to work with fractions of a war.

482
00:47:25.230 --> 00:47:32.730
And that's in the most recent versions of good. So I'm not talking about that. Well, I don't fully understand it. So.

483
00:47:32.730 --> 00:47:38.699
I got some things, but okay. And the padding around the edge.

484
00:47:38.699 --> 00:47:45.750
Okay, um, and.

485
00:47:46.980 --> 00:47:52.800
So, they're picking for their coursing and for the partitioning the lattice and to blocks.

486
00:47:52.800 --> 00:47:58.380
They want to take the greatest effect of, uh, memory coal last thing.

487
00:47:58.380 --> 00:48:03.840
Coalescing operations on the global memory. Um, they are.

488
00:48:03.840 --> 00:48:07.199
Computing temporary results and we're using them perhaps.

489
00:48:07.199 --> 00:48:12.780
And, um, again realize that registers are fixed resource.

490
00:48:12.780 --> 00:48:18.750
That's the information content and that the cost for this is your program got worse.

491
00:48:18.750 --> 00:48:25.889
Now, if you use a higher level tool above.

492
00:48:25.889 --> 00:48:35.670
Such as thrust, it thrust your programming in terms of I mean, the functions are called things like scatter and gather.

493
00:48:35.670 --> 00:48:39.510
And transform and so on.

494
00:48:39.510 --> 00:48:45.179
So the hope is that if you use a higher level tool, some of this stuff is done for you by the.

495
00:48:45.179 --> 00:48:49.679
By the library, so and you have a choice.

496
00:48:49.679 --> 00:48:59.670
Of a number of different tools at different levels. The 1st thing we'll talk about, there's nothing about threads and warps. It's functional programming, mass, produced programming.

497
00:48:59.670 --> 00:49:03.000
It's higher level, but maybe you lose a touch of efficiency.

498
00:49:03.000 --> 00:49:15.929
These intermediate things that the energy lab people recommend if you want to worry about some more detailed things, but not down at this level. The intermediate tools one's called Co costs that, um.

499
00:49:17.039 --> 00:49:21.090
It will give you more control, but are easier. So.

500
00:49:21.090 --> 00:49:25.440
Maybe you don't want to worry about this whole level stuff too often.

501
00:49:25.440 --> 00:49:30.599
Okay, in any case to the point of this 2 slides that was.

502
00:49:30.599 --> 00:49:34.289
To show the example of potential calculation.

503
00:49:34.289 --> 00:49:41.940
Which is specifically an important example, but generally teaches you tools for parallel programming and could.

504
00:49:41.940 --> 00:49:45.329
Any questions about that.

505
00:49:45.329 --> 00:49:48.869
Okay.

506
00:49:51.150 --> 00:49:54.809
Let's continue on.

507
00:49:54.809 --> 00:49:59.760
A political.

508
00:50:00.929 --> 00:50:06.900
Again, I think this 1 goes fast. Um.

509
00:50:08.400 --> 00:50:17.369
Uh, just some general parallel thinking things I'm going to, I may hit this 1 passed. Um.

510
00:50:19.590 --> 00:50:26.039
Oh, what's the content on this? 1? The 2nd paragraph here? Um.

511
00:50:27.599 --> 00:50:33.449
So, you got different goals for parallel computing um.

512
00:50:33.449 --> 00:50:37.829
Strong scaling is you can do the same program faster.

513
00:50:39.179 --> 00:50:43.590
Weak scaling is you can do a bigger program in the same time.

514
00:50:44.610 --> 00:50:48.960
And then, of course, well, better solution. So so.

515
00:50:50.579 --> 00:50:56.309
You want to do weather prediction. Okay the weather people tell us there's going to be a storm tomorrow.

516
00:50:56.309 --> 00:51:01.800
It's a few hours ago they said, I haven't checked the weather report later. They may have changed their mind.

517
00:51:01.800 --> 00:51:05.789
They do okay, so you want to.

518
00:51:05.789 --> 00:51:15.090
Look at the problem of weather now you buy a faster computer, you compute the weather in less time. That's strong. Scaling.

519
00:51:15.090 --> 00:51:19.500
Week scaling is you buy a better computer you can.

520
00:51:19.500 --> 00:51:23.219
Compute the weather approximating on a finer grid.

521
00:51:23.219 --> 00:51:28.530
So, the decomposed the atmosphere into this grid, it's surprisingly course.

522
00:51:28.530 --> 00:51:33.179
Um, so you make the grid finer more grid cells.

523
00:51:33.179 --> 00:51:36.300
But, you know, compute it in the same time.

524
00:51:36.300 --> 00:51:42.570
That's weak scaling. So okay. And this thing here.

525
00:51:42.570 --> 00:51:51.300
Advancing science, qualitatively new stuff. You do the potential problem faster.

526
00:51:51.300 --> 00:51:54.690
You know, design a new drug. Okay, let's say.

527
00:51:54.690 --> 00:51:58.110
That's advancing science.

528
00:51:59.460 --> 00:52:03.780
And nothing interesting there. Um.

529
00:52:03.780 --> 00:52:13.230
Okay, okay so here is introducing the topic that everyone else loves in parallel computing, which is a message passing.

530
00:52:13.230 --> 00:52:16.409
I personally think it's much overrated.

531
00:52:16.409 --> 00:52:20.039
And I'm willing to debate it with anyone. No, 1 ever takes me off on it.

532
00:52:20.039 --> 00:52:24.929
So, you have lots of small nodes that.

533
00:52:24.929 --> 00:52:28.260
Do not have shared memory, you past messages back and forth.

534
00:52:29.460 --> 00:52:34.889
Again oh, that's nice. If you have to but I got 40 gigabytes of memory on my.

535
00:52:34.889 --> 00:52:41.460
It's I don't have problems so big that I can't, you know, that I have to pass messages.

536
00:52:41.460 --> 00:52:48.510
Okay, um, and you got humungous latency and when you pass the message.

537
00:52:48.510 --> 00:52:52.019
Oh, they mention, okay we've seen.

538
00:52:52.019 --> 00:52:56.550
Cuda now we've seen open MP open C. L.

539
00:52:56.550 --> 00:53:01.139
Is a competitor open end piece Ahmad it's more and Dakota.

540
00:53:01.139 --> 00:53:07.500
It's more agnostic it's not specifically targeted to NVIDIA.

541
00:53:07.500 --> 00:53:12.840
And a little different design philosophy, but it's also since it's, let's target it to invidia then now.

542
00:53:12.840 --> 00:53:19.260
It doesn't work as well on his video, so in designing this course, and I created this course.

543
00:53:19.260 --> 00:53:26.849
I made a design decision to talk about NVIDIA because Nvidia is the big parallel computing today.

544
00:53:26.849 --> 00:53:31.139
In 5 years, they could be bankrupt. I've seen it happen. Um.

545
00:53:31.139 --> 00:53:36.480
Really big company then they get big because they're competent and hardworking.

546
00:53:36.480 --> 00:53:42.989
And then they become arrogant, and they do everything wrong and you'd be surprised how fast the company could vanish.

547
00:53:42.989 --> 00:53:48.389
When they get foolish, so, and it could happen to in video, but it hasn't happened yet.

548
00:53:48.389 --> 00:53:53.219
So.

549
00:53:53.219 --> 00:53:59.699
Okay, nothing interesting there. So.

550
00:54:01.135 --> 00:54:15.804
Well, an example, when I was a grad student, digital equipment company tech was the standard company for computer science departments. They grew to become the 2nd, biggest computer company after IBM. They had their annual conference 1 year on ocean liner that.

551
00:54:16.050 --> 00:54:20.130
I think or something in Boston Harbor.

552
00:54:20.130 --> 00:54:29.639
And a couple years later, they said UNIX was snake oil. They did everything. The guy who built the company up, he did everything or.

553
00:54:29.639 --> 00:54:33.329
Oh, what happened to him but he started doing everything wrong in the company.

554
00:54:33.329 --> 00:54:38.849
And they also went from being an open model and friendly to being very nasty about everything.

555
00:54:38.849 --> 00:54:45.630
If you pulled a memory trip out of 1 of their cards and put it into another car, they said that was a legal let's say.

556
00:54:45.630 --> 00:54:52.320
You try to duplicate their terminal, they're done terminal. They tell you for an IV violation, that sort of thing.

557
00:54:52.320 --> 00:54:57.869
The cost of the memory chip, the various factor 15, depending on what it was plugged in to.

558
00:54:57.869 --> 00:55:01.619
And you couldn't clone it because the bus interface was patented.

559
00:55:01.619 --> 00:55:11.909
So the way they did the way 3rd party said memories, they buy cheap, obsolete components, folder memory out and put it into the new expensive tech computers and tech would say that's illegal.

560
00:55:11.909 --> 00:55:17.550
Okay, so people just stop buying deck stuff.

561
00:55:17.550 --> 00:55:21.090
That's solve the problem. Okay any case.

562
00:55:21.090 --> 00:55:25.260
Synchronization nothing interesting here.

563
00:55:25.260 --> 00:55:32.909
The last line is relevant, it's sort of a problem for you guys, the programmers.

564
00:55:32.909 --> 00:55:37.050
You don't synchronize enough, you get past answers that are wrong.

565
00:55:38.099 --> 00:55:43.559
Synchronized too much he gets slowing, uh, nothing. Okay. Um, so.

566
00:55:45.570 --> 00:55:52.710
What they're talking about here? Different models for parallel computing. Okay a single program. Multiple data. That's the.

567
00:55:52.710 --> 00:55:56.610
Cuda and video thing, the master worker thing.

568
00:55:56.610 --> 00:56:00.119
Is that, um, basically.

569
00:56:00.119 --> 00:56:07.530
The boss, um, the professor grad student model, whatever you want to call it the boss.

570
00:56:07.530 --> 00:56:15.269
And, um, partitions out parallel problems and assign them to the workers. In this case, the workers are.

571
00:56:15.269 --> 00:56:24.960
Just like in an office and the workers do the work. The loop parallelism is again, you have a 4 loop to loop, pick your language and this.

572
00:56:24.960 --> 00:56:29.190
And the parallelism was over the iterations of the loop of some iterations on effect. Each other.

573
00:56:29.190 --> 00:56:35.400
Um, and fork join, this is P threads.

574
00:56:35.400 --> 00:56:39.570
Know, what it means so different ways of looking at stuff.

575
00:56:41.429 --> 00:56:46.920
The idea being that if your application fits 1 of these models.

576
00:56:46.920 --> 00:56:50.670
It can then call on a lot of prior art to help you.

577
00:56:50.670 --> 00:56:55.800
So, if it loops parallelize hey.

578
00:56:55.800 --> 00:57:00.510
On open MP, and it will do and.

579
00:57:00.510 --> 00:57:08.130
It will do the work for you, that sort of thing and different models um, just shared data. This common fast data.

580
00:57:08.130 --> 00:57:18.809
That doesn't scale up, of course, with too many things. So on parallel I got 256 gigabytes of memory and it's got the dual 24.

581
00:57:18.809 --> 00:57:22.079
My mind is my however many core.

582
00:57:22.079 --> 00:57:26.190
It has so you can have give or take, um.

583
00:57:26.190 --> 00:57:30.480
40 or 80 threads all accessing the same data.

584
00:57:30.480 --> 00:57:36.269
Shared data share, you got a queue of jobs that they pick off jobs.

585
00:57:36.269 --> 00:57:41.670
With open M P. if you've got a lot of suppose you iterate and do equals 1 to a 1,000,000.

586
00:57:41.670 --> 00:57:45.269
And suppose you have only 32 threads available.

587
00:57:45.269 --> 00:57:51.420
Well, then it will pick chunks of iterations off the queue and executed different ways. It does it. So.

588
00:57:51.420 --> 00:57:57.570
Okay, they talk about that here. Um.

589
00:57:58.829 --> 00:58:04.739
The is the thing is that the different warps are executing at different times. So.

590
00:58:05.820 --> 00:58:15.480
Okay, um, and again, the point about the reason the concept of warp exists, you decode the instruction once and it gets executed.

591
00:58:15.480 --> 00:58:22.679
And 32, that's the data on 32 3 I'd say less space use for instruction decoding and so on.

592
00:58:22.679 --> 00:58:25.980
Hey, did you do, um.

593
00:58:25.980 --> 00:58:29.130
This is what I just told you about.

594
00:58:31.920 --> 00:58:35.460
For John, you probably don't want to do it to low level. So.

595
00:58:37.440 --> 00:58:42.090
Are not low enough well, these different things are here. So open MP. I spent a lot of time on.

596
00:58:42.090 --> 00:58:52.139
Open ACC is a little higher level and open MP open. Acc does she feels better open MP for a long time to not handle GPU? Is it only.

597
00:58:52.139 --> 00:58:59.730
cpu's multi it did it did multi core opening fee for a long time to not do GP recently.

598
00:58:59.730 --> 00:59:03.150
It does, I haven't played with it myself and how well does it.

599
00:59:03.150 --> 00:59:08.760
Open ACC, I guess, sort of intended to replace open, em, a little higher level.

600
00:59:08.760 --> 00:59:15.900
Well, then woke up the open MP people, and I think the open MP people, they saw the open ACC starting to do stuff.

601
00:59:15.900 --> 00:59:19.349
And that woke them up. So, then open MP started getting updated.

602
00:59:21.269 --> 00:59:27.239
Tbd is an acronym for threading building blocks this is intel's.

603
00:59:27.239 --> 00:59:32.429
Parallel toolkit and so it's only for Intel.

604
00:59:32.429 --> 00:59:39.210
The multi core, and I haven't played with it the threading building blocks.

605
00:59:39.210 --> 00:59:42.539
It being specific to Intel.

606
00:59:42.539 --> 00:59:49.530
It might be more efficient. It's probably not worth your time. If you could look at that for fun project.

607
00:59:49.530 --> 00:59:56.670
My what I think happens with Intel, so saying is Intel compiler also.

608
00:59:56.670 --> 01:00:01.380
Completes a GCC has a goal of, uh.

609
01:00:01.380 --> 01:00:04.949
Being compatible with Jesus is a stated goal now.

610
01:00:04.949 --> 01:00:12.030
So so, I think the Intel competitor people, you know, they try to make a good compiler. That's, you know.

611
01:00:12.030 --> 01:00:17.309
This is good call just the, and then that wakes up the people.

612
01:00:17.309 --> 01:00:22.829
And they see getting ahead of them and they improve. Oh.

613
01:00:22.829 --> 01:00:30.690
Which is good. I like competition, but the fact that says, you don't need to worry about you figure GCC will probably catch up to.

614
01:00:30.690 --> 01:00:37.559
Not too long behind so I'm thinking would be the same thing, because I'm thinking if TDB does anything really good.

615
01:00:37.559 --> 01:00:41.610
The open MP implementation on Intel will probably copy it.

616
01:00:41.610 --> 01:00:44.610
But I could be wrong project, so.

617
01:00:44.610 --> 01:00:48.210
Okay, see both of us to app. I don't know anything about so.

618
01:00:50.400 --> 01:00:59.909
Again went back here positive so this is a standard here, though, it's implemented in terms of some hardware on a particular machine. So is already aware of.

619
01:00:59.909 --> 01:01:06.179
The actual low level trading stuff, so okay. Um.

620
01:01:06.179 --> 01:01:13.320
Cool thing about looking at different ways to look to do program in parallel.

621
01:01:13.320 --> 01:01:16.440
Um, you know.

622
01:01:16.440 --> 01:01:21.420
You got different jobs, each different task and align them separate.

623
01:01:21.420 --> 01:01:26.039
Parallel colonel, let's say, or paralyze over the data.

624
01:01:26.039 --> 01:01:29.159
Parallel lives over whatever, uh.

625
01:01:30.630 --> 01:01:35.789
So these are interesting things to look at the bottom. You probably know what most of them are. Um.

626
01:01:35.789 --> 01:01:45.150
Decomposition so you got your molecule let's say, you could imagine some recursive, hierarchical grouping say with an.

627
01:01:45.150 --> 01:01:49.409
Everyone knows what an Austria sort of or people do not.

628
01:01:49.409 --> 01:01:57.389
Okay, the point about this is you got some irregular 3 D, geometric data. So the universe is a queue.

629
01:01:57.389 --> 01:02:00.780
You cut the cube into 8, sub cubes.

630
01:02:00.780 --> 01:02:09.690
And then each sub cube that is too much data you partition it into a sub sub cubes. So you got to treat each note of the tree has 8 kids.

631
01:02:09.690 --> 01:02:16.289
And then you do it, and you partition the kids into 8 grandkids and and so on.

632
01:02:16.289 --> 01:02:19.679
Until 1 node is simple enough.

633
01:02:19.679 --> 01:02:25.019
So it's a nice way to handle irregular spatial information.

634
01:02:25.019 --> 01:02:29.550
Like, the Milky way, you're doing galaxies.

635
01:02:29.550 --> 01:02:38.159
So, the universe is a cube and cube. Some, most of the cubes sub cubes will be empty, or have only, you know, 1 star.

636
01:02:38.159 --> 01:02:42.119
These are the ones with a lot of stuff. You subdivide them again.

637
01:02:42.119 --> 01:02:48.360
So, it adapts the data structure to the data. That's good.

638
01:02:48.360 --> 01:02:55.260
What's bad is it's irregular and on parallel computing that's a little more work. So that's.

639
01:02:55.260 --> 01:02:58.860
Of course of data here in the middle of what we're talking about.

640
01:02:58.860 --> 01:03:03.480
The tree and again, I've had arguments with people about it.

641
01:03:03.480 --> 01:03:07.050
I think it's not as good as people think it is. Um.

642
01:03:07.050 --> 01:03:12.929
It's hard to do some things if the dad is really, really, really a regular artery is not going to save you.

643
01:03:12.929 --> 01:03:17.940
And it's hard to do things like finding nearby elements because a nearby.

644
01:03:17.940 --> 01:03:23.460
Star might be in a totally different part of the, or you might have to go up many levels and over it down again.

645
01:03:23.460 --> 01:03:29.309
So, but it's popular, so.

646
01:03:29.309 --> 01:03:32.519
Okay, that's recursive thing.

647
01:03:32.519 --> 01:03:36.750
Oh, now something new and video has last couple of years.

648
01:03:36.750 --> 01:03:42.389
So, the problem with this artery is some parts of the arc tree will have.

649
01:03:42.389 --> 01:03:50.309
Put down more levels than other parts. So how do you put it? How do you map that into CUDA.

650
01:03:50.309 --> 01:03:53.309
Um, well.

651
01:03:53.309 --> 01:03:57.539
What NVIDIA now has is that.

652
01:03:57.539 --> 01:04:01.920
Inside your kernel, you can find, you.

653
01:04:01.920 --> 01:04:10.619
The thread 1 thread is picking 1 chunk of the, that a thread can create a whole new colonel.

654
01:04:10.619 --> 01:04:13.710
So, if you've got 1 thread for each.

655
01:04:13.710 --> 01:04:17.010
Kid, it's a generation 1 of the tree.

656
01:04:17.010 --> 01:04:21.210
If this thread has, say a 1,000,000.

657
01:04:21.210 --> 01:04:27.630
Stars in it, that 1 thread can create a whole new kernel to handle this cell.

658
01:04:28.769 --> 01:04:33.360
It's as dynamic creation of kernels, there's going to be an overhead, but.

659
01:04:33.360 --> 01:04:42.449
If this 1 thread determines it just got too much work to do. This will be a win. You got the overhead of creating a new colonel with a whole new colonel has its new, um.

660
01:04:42.449 --> 01:04:45.869
You know, thousands of threads maybe. So.

661
01:04:45.869 --> 01:04:50.610
So this is the thing that Nvidia has the hand over curse of data.

662
01:04:50.610 --> 01:04:56.909
It's, it's colonel threads can create whole new parallel colonels. So.

663
01:04:56.909 --> 01:05:00.989
And what happens is they go into the cue because has got this queue of colonel.

664
01:05:00.989 --> 01:05:04.829
You know, with kernels have thread blocks and blah, blah, blah, blah.

665
01:05:04.829 --> 01:05:12.030
It just creates a new colonel before you had to create it in your C program and your host program. Now, you can create a new.

666
01:05:12.030 --> 01:05:19.739
Inside the existing it now the 2 kernels become sort of independent of each other. You might say the parent colonel.

667
01:05:19.739 --> 01:05:30.869
They can establish synchronization things all the obvious stuff. The parent colonel can say I'm going to wait till the child colonel finishes, or it's just like, work join or something or a master worker.

668
01:05:30.869 --> 01:05:37.769
This is the obvious synchronization stuff. In any case, you can have like, recursive kernels now on CUDA.

669
01:05:37.769 --> 01:05:47.159
And other pipeline event driven, most of, you know, what these terms mean, if you don't, then I can describe them.

670
01:05:47.159 --> 01:05:51.719
Different ways to look at complex parallel issues. So.

671
01:05:53.489 --> 01:05:58.980
Linear and regular parallel lives better, but maybe the data doesn't like it. So.

672
01:05:58.980 --> 01:06:04.739
Okay, um, more on ESPN.

673
01:06:04.739 --> 01:06:13.019
Dominant um, yeah, so okay.

674
01:06:15.869 --> 01:06:21.960
Mostly Co, whatever it's matches on software matches the hardware so.

675
01:06:23.070 --> 01:06:30.630
You're going to do a job in parallel. Your hardest problem may be.

676
01:06:30.630 --> 01:06:36.630
To take your job, take your application and make it look parallel. Okay, this is, um.

677
01:06:36.630 --> 01:06:41.940
You know, we talk about potential calculations and calculations.

678
01:06:41.940 --> 01:06:45.119
As examples because it turns out they probably very nicely.

679
01:06:45.119 --> 01:06:51.809
1 might imagine other big jobs that don't parallel very well so we don't talk about them.

680
01:06:51.809 --> 01:06:57.750
Okay, I do nothing interesting here.

681
01:06:57.750 --> 01:07:01.679
Read it nothing interesting there.

682
01:07:01.679 --> 01:07:04.949
Uh.

683
01:07:06.239 --> 01:07:10.829
The Nobel prize is entering awards will go to, um.

684
01:07:10.829 --> 01:07:14.550
It is qualitatively new science. Um.

685
01:07:17.039 --> 01:07:22.739
Is given an example few decades ago there was this chemist called E. J. Corey.

686
01:07:22.739 --> 01:07:26.820
Who was using computers.

687
01:07:26.820 --> 01:07:32.369
To work with chemical synthesis and.

688
01:07:32.369 --> 01:07:35.969
This is a talking 50 years ago. So the thing with.

689
01:07:35.969 --> 01:07:40.050
The cameras, some were just starting to use computers to help the.

690
01:07:40.050 --> 01:07:48.659
And they're producing little papers on how I use the computer to compute something else and something slightly different.

691
01:07:48.659 --> 01:07:56.940
And the major chemists, or rather insulting towards the people that were using computers to incrementally new things.

692
01:07:56.940 --> 01:08:01.230
Okay, um, and.

693
01:08:01.230 --> 01:08:06.480
So, JJ, Corey was 1 of these people, but he figured out a qualitatively new things.

694
01:08:06.480 --> 01:08:09.510
And then he got a Nobel Prize in chemistry before it.

695
01:08:09.510 --> 01:08:18.029
So, the thing is, so he was so he took the pastor stuff.

696
01:08:18.029 --> 01:08:22.199
And they use it to do better stuff. So.

697
01:08:22.199 --> 01:08:26.520
There is a proverb that.

698
01:08:26.520 --> 01:08:32.489
Quantity is a quality all its own you got enough stuff and start looking qualitatively different.

699
01:08:32.489 --> 01:08:39.149
So, you get something passed or you do something different and that's what we'll get you the prizes.

700
01:08:40.289 --> 01:08:45.840
Okay, so so, so when he GE, Cory was doing was.

701
01:08:45.840 --> 01:08:49.260
It was in a general field where the cameras thought it was.

702
01:08:49.260 --> 01:08:56.460
Silly superficial, but he morphed it into something. It was not silly and superficial and it was fundamental.

703
01:08:56.460 --> 01:09:00.539
And it worked for him, so.

704
01:09:00.539 --> 01:09:07.590
Okay, well he was at Harvard, he actually got the prize about 40 years ago, so.

705
01:09:07.590 --> 01:09:16.079
Okay, at that time, Harvard was picking up the Nobel prize every 2 years or so.

706
01:09:16.079 --> 01:09:25.020
Okay um, nothing interesting there. Um.

707
01:09:27.149 --> 01:09:30.689
It's the thing and so on, um.

708
01:09:32.520 --> 01:09:37.979
Well, again, just to bring it back to what's going on in where I call it and it.

709
01:09:37.979 --> 01:09:42.779
It can compute energy levels of bonds and therefore can tell.

710
01:09:42.779 --> 01:09:47.609
Well, and Joe, the bond, you can tell a lot about what's happening. So.

711
01:09:47.609 --> 01:09:52.020
Okay, um, we saw this slide before.

712
01:09:52.020 --> 01:09:56.100
Um, what makes the, the residents thing.

713
01:09:56.100 --> 01:10:00.930
Interesting is you're not on a regular Cartesian grid perhaps because.

714
01:10:00.930 --> 01:10:06.359
Oh, the machine is going into spiral and all sorts of other things. Yeah.

715
01:10:07.680 --> 01:10:12.300
So, reconstructing it and and.

716
01:10:12.300 --> 01:10:16.140
Concept involves inverting a big matrix.

717
01:10:16.140 --> 01:10:20.489
Although they do did roughly, perhaps nothing that.

718
01:10:20.489 --> 01:10:27.810
Personalized medicine this is the thing of the future.

719
01:10:27.810 --> 01:10:33.600
If you go in for some operation for brain surgery or something, cancer there.

720
01:10:33.600 --> 01:10:36.840
They're going to do a model of your personal brain.

721
01:10:36.840 --> 01:10:40.319
They're going to do an individual model of you.

722
01:10:40.319 --> 01:10:44.550
And use that to guide the surgery and so on nowadays. So.

723
01:10:46.680 --> 01:10:50.250
What this means screening of drugs you have a new drugs.

724
01:10:51.390 --> 01:10:56.159
Well, nowadays they test that's a trial run in silicone.

725
01:10:56.159 --> 01:11:01.770
That means they would like to compute what the drug will do. So.

726
01:11:01.770 --> 01:11:06.000
On the supercomputer rather than having to tested on real people.

727
01:11:06.000 --> 01:11:16.500
Just like you design a new aircraft, you like to maybe model its performance using your CFD software not actually building a testing and flying it.

728
01:11:16.500 --> 01:11:23.250
He killed fewer pilots that way and if you can predict the drugs on the computer, what they're going to do, but that's, you know.

729
01:11:23.250 --> 01:11:30.090
Easy to say it's hard to do and the next step after that would be to try to.

730
01:11:30.090 --> 01:11:35.220
From 1st principles compute what the future drug would be.

731
01:11:37.710 --> 01:11:42.539
Okay, I tell you like to work on real quick. Let me send you some dollar figures at you for drugs.

732
01:11:44.460 --> 01:11:51.750
The pharmaceutical industry grosses a 1Trillion dollars a year worldwide. That would be 10 to the 12 dollars a year.

733
01:11:51.750 --> 01:11:55.199
Okay, and the United States is half of that.

734
01:11:55.199 --> 01:11:59.010
So, 5 times, 10 to 11 dollars a year grossing.

735
01:11:59.010 --> 01:12:04.199
In the United States, so if you work it out at 500 dollars, a person for drugs in the United States.

736
01:12:04.199 --> 01:12:07.439
So, a new drug cost, depending on.

737
01:12:07.439 --> 01:12:13.350
How good your account is a few 1Billion dollars, but that includes the cost of the failed drugs.

738
01:12:14.579 --> 01:12:20.880
But, of course, the new drug, you know, that works will contribute to that 1Trillion dollars annual revenue.

739
01:12:20.880 --> 01:12:28.560
So and 1, new drug, that's good enough can affect the bottom line of the whole company. So.

740
01:12:29.609 --> 01:12:35.970
So, in other words, there's a lot of money.

741
01:12:35.970 --> 01:12:40.920
Involved in getting better ways to do new drugs. That's the.

742
01:12:40.920 --> 01:12:47.819
Take away here, so so, by the way is in a side. So if I hear, you know, somebody.

743
01:12:47.819 --> 01:12:54.149
The for drug company complaining while they, it's, you know, 5Billion dollars for a new drug.

744
01:12:54.149 --> 01:12:57.960
Again, it depends on your account and and.

745
01:12:57.960 --> 01:13:09.630
Well, the industry grows 200 times that in the year so you can afford to develop several new drugs in the year without getting that 1Trillion dollar, you know, infringing on a 1Trillion dollar revenue.

746
01:13:09.630 --> 01:13:18.930
Okay, well we need 64 minutes. 32 minutes can handle these slides the numbers. Okay. And so.

747
01:13:20.520 --> 01:13:28.229
In silicone that's us to take on in vivo, in vivo in vivo would be live tests on living people.

748
01:13:28.229 --> 01:13:32.609
In vitro means in glass, which means tests on a.

749
01:13:32.609 --> 01:13:38.760
Petri dish, and so on, and in Silicon would be in Silicon would be tested on the computer.

750
01:13:38.760 --> 01:13:47.010
Oh, okay. Um, and.

751
01:13:50.729 --> 01:13:56.579
So, doing it, you know, do things on the computer test for potential drugs and so on. So.

752
01:13:59.220 --> 01:14:06.750
And the thing with electrostatic, okay now, um, so you're worried about how.

753
01:14:06.750 --> 01:14:12.569
Of the different parts of the molecule their potential effect on each other is that affects how they fold.

754
01:14:12.569 --> 01:14:16.380
It affects the 1 of the things also. Next level up is this that.

755
01:14:18.270 --> 01:14:29.729
So this folded molecule, then it does things like attract water molecules. It's, it's dissolved in water. 1 is also so the drug molecules in water.

756
01:14:29.729 --> 01:14:36.000
And the surface of the drug molecule now starts attracting water molecules as well as attracting other proteins.

757
01:14:36.000 --> 01:14:39.119
And that will affect.

758
01:14:39.119 --> 01:14:45.060
Its activity, so they talk about here the boundary and the girl equation. So.

759
01:14:46.199 --> 01:14:49.829
You're worried about what it's doing in the water and so on.

760
01:14:49.829 --> 01:14:55.770
And then there's polarity issues. Um, you see what's happening.

761
01:14:55.770 --> 01:15:02.069
And these well, you get polar issues also summer Polar, some are non Polar, and it affects how they attract to each other.

762
01:15:02.069 --> 01:15:05.819
So, the past the multi fold math, that that's what I mentioned.

763
01:15:05.819 --> 01:15:14.159
Or you group, um, you had this hierarchy, you group hierarchy of any cluster things together um.

764
01:15:14.159 --> 01:15:17.340
What multiple means is that.

765
01:15:17.340 --> 01:15:27.869
Yeah, but you do a tailor expansion on the field, or on the potential the same things in potential critical to the field field. The differential the potential.

766
01:15:27.869 --> 01:15:39.180
So 1, Atom, that's a simple single poll. Potential thing. 2 atoms. That's a typo multiple atoms. It's a multiple thing.

767
01:15:40.229 --> 01:15:47.250
And mathematically, you can do a now, an analogy to a tailor expansion.

768
01:15:47.250 --> 01:15:53.130
On the on the field for different points and different distances and that's.

769
01:15:53.130 --> 01:16:00.810
All ties into the multiple method, so okay. And they want.

770
01:16:00.810 --> 01:16:06.510
You know, massive calculations. So, Wikipedia article on that.

771
01:16:06.510 --> 01:16:12.359
I'm worried about that worried about that. So the point they're saying here is yeah.

772
01:16:12.359 --> 01:16:15.420
It's important, um.

773
01:16:17.010 --> 01:16:20.699
This is my motherhood hardest part and make your.

774
01:16:20.699 --> 01:16:26.670
Global thing, um, and parallelizable and doing stuff. So.

775
01:16:26.670 --> 01:16:30.119
I think interesting there.

776
01:16:33.149 --> 01:16:39.000
And if you're going to take your application, paralyze it.

777
01:16:39.000 --> 01:16:44.279
1st, put in your pragmatist rope an MP or an open ACC do it for you.

778
01:16:44.279 --> 01:16:51.270
And, oh, the next thing here is, I've gotten very light on, but this is practical importance again. Um.

779
01:16:53.039 --> 01:16:59.010
The common applications already have parallel code.

780
01:16:59.010 --> 01:17:07.289
Parallel f. T. and so on probably better than you can do it that sort of thing. Matlab has some parallel algorithms.

781
01:17:08.550 --> 01:17:14.520
So, if what you want to do is something Matlab has in parallel, wrote a big matrix. Then it's Matlab.

782
01:17:14.520 --> 01:17:19.770
You got to get licensing issues. I had an issue where I'm on a machine with lots. Of course.

783
01:17:19.770 --> 01:17:23.909
But the license for Matlab allowed only 8 cores.

784
01:17:23.909 --> 01:17:27.899
So do 8 times parallel that was that, even though the machine could handle more.

785
01:17:27.899 --> 01:17:31.560
Cool glass. What is this stuff here?

786
01:17:31.560 --> 01:17:36.329
Blast is basically your algebra system.

787
01:17:36.329 --> 01:17:44.369
And these are standard really simple, linear algebra things like multiply matrix times vector.

788
01:17:44.369 --> 01:17:51.630
That are done a lot and the basic class is a reference implementation. So it's the.

789
01:17:51.630 --> 01:17:56.640
It's the API definition, and then different people implement the API differently.

790
01:17:56.640 --> 01:18:01.079
This is a commercial code, which may be the best. If you wanted to do it really good. Maybe.

791
01:18:01.079 --> 01:18:06.270
License some commercial code or this free ones and glasses and go to.

792
01:18:07.319 --> 01:18:12.359
Pgi, that's the video compiler. Pgi was a company that developed that.

793
01:18:12.359 --> 01:18:19.079
And video bought it nothing and then the next thing, you know, rewrite and so on.

794
01:18:19.079 --> 01:18:24.960
Okay, nothing interesting. There.

795
01:18:28.260 --> 01:18:33.329
Oh, nothing interesting there. Um.

796
01:18:35.100 --> 01:18:38.850
How to do stuff in parallel.

797
01:18:38.850 --> 01:18:44.399
And that was that to leave that slide off. Let's see.

798
01:18:46.050 --> 01:18:52.229
Okay, so that was, um.

799
01:18:52.229 --> 01:18:56.039
Module 18 and.

800
01:18:56.039 --> 01:19:01.829
What is happening after that time to end is.

801
01:19:05.550 --> 01:19:08.699
Some other little stuff.

802
01:19:08.699 --> 01:19:16.890
Go through some of that so we're sort are winding down this accelerated computing stuff from in video. But you've seen a lot of.

803
01:19:16.890 --> 01:19:23.010
Things and I think the next topic will be to hit.

804
01:19:23.010 --> 01:19:26.520
This has been sort of hardware related someone to do some trust or something.

805
01:19:28.770 --> 01:19:37.500
And you're free to go do this and we need to. So I'm open for questions and so on. And I encourage you to read the book up and encouraging every class to look at the online.

806
01:19:39.329 --> 01:19:45.600
You want more NVIDIA stuff, go to their website. They got white papers and developer blogs and.

807
01:19:45.600 --> 01:19:52.350
All sorts of stuff online, so, and video really wants to help people learn their system.

808
01:19:52.350 --> 01:19:56.850
They sell more. Okay, so.

809
01:19:56.850 --> 01:20:05.189
Open for questions, other than that, have a good weekend. Maybe go skiing this weekend. If we get some snow and see you Monday. So.

810
01:20:06.960 --> 01:20:10.739
Yeah, I was getting whenever there's snow last.

811
01:20:10.739 --> 01:20:15.449
Well, you can program cash if you take a day, a week off I think so.

812
01:20:15.449 --> 01:20:24.029
And maybe if you so basically read about a big problem, and then go skiing, and while you're getting some exercise, you may think of a solution.

813
01:20:24.029 --> 01:20:27.210
Really.

814
01:20:27.210 --> 01:20:31.739
Okay, so.

815
01:20:31.739 --> 01:20:35.729
Any case.

816
01:20:35.729 --> 01:20:39.960
Yes, that was.

817
01:20:39.960 --> 01:20:47.699
No, I'm going to assign 1, but I'll give you a week after I signed it. What it's going to say.

818
01:20:47.699 --> 01:20:53.609
Is look at these and video and paradox right? Via page, summarizing them. So.

819
01:20:53.609 --> 01:20:57.210
Okay, I was curious.

820
01:20:57.210 --> 01:21:04.289
Just, like, just put the thing in a PDF listing so.

821
01:21:04.289 --> 01:21:10.500
You're comparing comparing different things, so right. You don't have your conclusion of this.

822
01:21:10.500 --> 01:21:16.500
Includes the source code for me to look at I'm not going to run it though. So.

823
01:21:16.500 --> 01:21:29.069
And there are tools, I think I forget if I told you, I told someone you got different PDF pieces. There are tools out there that will combine PDF files into 1 PDF files.

824
01:21:29.069 --> 01:21:33.180
I need to upload it to scope or whatever.

825
01:21:33.180 --> 01:21:39.510
And it should it be something that will not upload? You could always have a pointer to somewhere else.

826
01:21:39.510 --> 01:21:47.250
Yeah, yeah, fine. There's a 1 set is fine. Yeah, sure. Okay.

827
01:21:47.250 --> 01:21:55.680
Had a really cool thing. I used for a final report on a project report. I had lots of separate PDF files in.

828
01:21:55.680 --> 01:22:03.869
I just combined I took each PDF file embedded in the big late that document was running headers and footers and page numbers.

829
01:22:03.869 --> 01:22:08.609
It's really nice. It's really optional.

830
01:22:08.609 --> 01:22:11.640
That you want to ask you to, like.

831
01:22:11.640 --> 01:22:26.454
Have like some of this food on textile loading the text Instagram, not only the text that needs to be converted anyway. Or do you want us to, like, have some variables inside of the flow that he already has? The string.

832
01:22:26.760 --> 01:22:31.260
I don't care whichever is easier.

833
01:22:31.260 --> 01:22:36.569
It takes a little work to read a text file and to see, but not that much work. Yeah. Yeah.

834
01:22:36.569 --> 01:22:43.050
You know, line by line. Well, I actually don't like lines that much. I would just read the whole thing in.

835
01:22:43.050 --> 01:22:49.590
Um, you know yeah, so.

836
01:22:49.590 --> 01:22:53.579
Oh, I think you could do it. Okay. Okay. You want.

837
01:22:53.579 --> 01:23:01.289
Here's 1 way you could do it. There's a way and see it for us to map the whole file in the virtual memory.

838
01:23:01.289 --> 01:23:05.159
So, I figured it's.

839
01:23:05.159 --> 01:23:10.920
My file map thing. So what happens is, you open the file, you call this magic function.

840
01:23:10.920 --> 01:23:15.329
And it returns a pointer to the file and virtual memory.

841
01:23:15.329 --> 01:23:21.420
And also the lanes, so now you can just leave it as it is, or take the length of.

842
01:23:21.420 --> 01:23:25.770
An internal variable on your Ethan and copy it in.

843
01:23:25.770 --> 01:23:29.220
It's really, it's also efficient.

844
01:23:29.220 --> 01:23:34.020
It's all so cool. So it sends a blind jus harder an artificial idea.

845
01:23:34.020 --> 01:23:37.380
That's a side like that sort of style.

846
01:23:37.380 --> 01:23:41.220
You can also open the file and get the length and then do a read.

847
01:23:41.220 --> 01:23:49.050
There's like, it's an f, read function and it just reads and bytes you give it and it reads that bytes is 1 function call. Yeah.

848
01:23:49.050 --> 01:23:56.010
Again, or you could iterate over get cheap by get paragraph, get character and that seems.

849
01:23:56.010 --> 01:24:00.390
Why character character you can load in though.

850
01:24:00.390 --> 01:24:04.199
Full sentence load in the whole file.

851
01:24:04.199 --> 01:24:09.869
Yeah, no, what I'm talking about works only if your memory is bigger than the file, but yeah.

852
01:24:09.869 --> 01:24:16.800
Yeah, you see a lot of these techniques you see, we're designed decades ago when your files are bigger than your memory.

853
01:24:16.800 --> 01:24:20.819
The whole idea passes for compilers would would.

854
01:24:20.819 --> 01:24:25.949
The computer was too small to store the whole program all at once you had to read it in multiple times and yeah.

855
01:24:25.949 --> 01:24:30.539
But ultimately, for years you got memory use it.

856
01:24:30.539 --> 01:24:34.319
Yeah, yeah.

857
01:24:36.420 --> 01:24:41.729
Okay.

858
01:24:48.630 --> 01:24:57.000
Okay.

859
01:25:02.699 --> 01:25:06.930
Hello.

860
01:25:06.930 --> 01:25:13.199
Hello.

861
01:25:13.199 --> 01:25:17.039
Absolutely.

862
01:26:02.760 --> 01:26:07.439
Hello.

863
01:26:43.890 --> 01:26:46.920
For it.

864
01:26:48.029 --> 01:26:51.569
Hello.

865
01:26:59.340 --> 01:27:04.109
Huh.

866
01:27:46.470 --> 01:27:51.090
Hello.

867
01:28:07.470 --> 01:28:11.100
Okay.

868
01:29:08.579 --> 01:29:11.909
Hello.

869
01:29:18.180 --> 01:29:21.569
Okay.

870
01:29:48.810 --> 01:29:52.680
Okay.

871
01:30:28.890 --> 01:30:32.159
Hello.

872
01:31:40.949 --> 01:31:47.399
Okay.

873
01:32:12.599 --> 01:32:16.679
Hello.

874
01:32:28.259 --> 01:32:32.248
Okay.