WEBVTT

1
00:11:20.339 --> 00:11:24.688
Is this any better? How's the sound now?

2
00:11:24.688 --> 00:11:28.979
And.

3
00:11:35.308 --> 00:11:38.999
Hello.

4
00:11:48.683 --> 00:11:55.013
Great sorry about the delay fine computing class 14, March 15th.

5
00:11:55.558 --> 00:12:08.009
2021, what is on the tat on tap for today? We're continuing on in this accelerated computing.

6
00:12:08.009 --> 00:12:14.818
Stuff from NVIDIA, and by the way it's from a book by.

7
00:12:14.818 --> 00:12:23.129
Kirk, and who massive programming massively parallel processors a hand on approach. So, this is an excellent book. The book costs money, but.

8
00:12:23.129 --> 00:12:30.599
By the standards of university textbooks very little money. So if you'll like the slides, I recommend that you get the book.

9
00:12:30.599 --> 00:12:40.739
And so we're moving along in that. Some other stuff I'll mention before I continue on with the slides to prompt you mentioned some of this.

10
00:12:40.739 --> 00:12:49.739
1st, India has some primary documentation that I recommend and.

11
00:12:49.739 --> 00:12:54.479
And should go right and so this.

12
00:12:54.479 --> 00:13:06.719
Is more up to date than the books, but it's dryer and it's more accurate. The books actually sometimes of errors. But if you want a nice concise summary of what's going on.

13
00:13:06.719 --> 00:13:13.798
Like, what's the difference between and right here.

14
00:13:13.798 --> 00:13:26.099
cpu's have less light and they say, have more power for the more compute capability for the electrical power that you're using. And so on.

15
00:13:26.099 --> 00:13:29.938
And I just show real estate layout.

16
00:13:29.938 --> 00:13:38.818
Um, the chip a lot much, much higher percentage of the chip, and the GPU is used for actual processing.

17
00:13:38.818 --> 00:13:50.369
As opposed to things like caches and controls on, you need less control area because you've got the war for 32 threads all running the same instruction. For example.

18
00:13:50.369 --> 00:13:54.958
So, they just talk about it there so I encourage you to.

19
00:13:54.958 --> 00:13:58.798
I encourage you to go through this.

20
00:13:59.938 --> 00:14:06.328
And some other points, I'll just hit, so they've got so Nvidia spent a lot of effort.

21
00:14:06.328 --> 00:14:11.519
On tools to help, you use their hardware and they got different.

22
00:14:11.519 --> 00:14:19.859
Different types of let me actually go through this a little this and stuff have not covered so much.

23
00:14:19.859 --> 00:14:29.609
Yeah, so they've got libraries using the so we've seen a little about that. We'll see a touch more programming languages and then.

24
00:14:29.609 --> 00:14:36.869
This 2 dimensional array of of here, they've got.

25
00:14:36.869 --> 00:14:40.229
Horizontally is, like, goes with time.

26
00:14:40.229 --> 00:14:51.928
So the compute capability goes up, delayed us. The 1 on parallel is compute capability. 7. their latest 1 has come out. Just compute capability. 8 actually, it's less than a year old. So.

27
00:14:51.928 --> 00:14:58.859
They're adding more features and then the vertical here along the X axis is sort of.

28
00:14:59.183 --> 00:15:12.323
Function so we have embedded sales, sell for economist vehicles and so on. I believe Tesla for a while was using and video. I think possibly. But they may do their own. Now. I have an outie Q5 automobile.

29
00:15:12.323 --> 00:15:17.094
It has to and video chips in it. I believe because it is the cars to displace.

30
00:15:18.089 --> 00:15:22.469
So it is the embedded market, just the consumer market.

31
00:15:22.469 --> 00:15:30.719
For gaming the professional market, the difference between consumer and professionals a couple of things.

32
00:15:30.719 --> 00:15:44.278
Professional, for example, we'll do double precision a lot better. It may worry about Tripoli 754, standard, floating point and so on. So.

33
00:15:44.278 --> 00:15:51.389
The Quadro line is does more accurate computations for computer to design and so on.

34
00:15:51.389 --> 00:15:59.879
Um, the Quadro line is also costs more premium priced because NVIDIA understands that say.

35
00:15:59.879 --> 00:16:03.149
Engineers are willing to pay more than gamers. Are.

36
00:16:03.149 --> 00:16:15.269
So, if you look at functionality, the Quadro is more expensive for what you get to the G4 line, which is why some professionals try to use a G force line because it's a lot cheaper.

37
00:16:15.269 --> 00:16:24.389
What's cause NVIDIA at 1 point to change the licensing? So some of their Quadro tools were not allowed to run on the G4.

38
00:16:25.408 --> 00:16:35.639
And then the last special areas, the data center, the Super computing line, which they call Tesla, which is unrelated to the car.

39
00:16:35.639 --> 00:16:46.739
There's 3 uses of the word, Tesla and calm and you say the automobiles forget that that's not relevant to this. It's invidious professional line of and.

40
00:16:46.739 --> 00:16:51.568
Super computing, which means they don't even have video out often.

41
00:16:51.568 --> 00:16:56.639
Does that take space on the chip? It's also the name of a generation.

42
00:16:56.639 --> 00:17:01.558
Indexing through time, it's the name of an obsolete generation of chips.

43
00:17:03.028 --> 00:17:09.808
Any case a scalable model that you can add more course, and get the same thing done, but faster.

44
00:17:09.808 --> 00:17:15.719
Talk about it here and then you just go down and see some new stuff here.

45
00:17:15.719 --> 00:17:20.699
Some interesting things that they've got here.

46
00:17:20.699 --> 00:17:25.378
I may hit a little more of it later.

47
00:17:26.364 --> 00:17:41.243
Cooperative groups, they've got ways to work with actually fractions of a war. Now that an epic parallelism is a colonel can actually start another colonel. When I told you that a global routine was called from the halt and Runnable on the device.

48
00:17:41.848 --> 00:17:49.229
That's a simplification you can in some cases, call a global routine from the device also it fires off an independent colonel.

49
00:17:52.463 --> 00:18:07.223
Texture fetching is another type of memory. I've just alluded to briefly and candid, because remember the origins of all. This is graphics now and for textures and you can have 3 up to 3 dimensional textures. You could have a block of wood and for every X.

50
00:18:07.223 --> 00:18:10.584
Y. Z point you've got a colors as we need 3 dimensional texture.

51
00:18:10.888 --> 00:18:16.679
And more call me, they might be 2 dimensional. And so the thing is that.

52
00:18:16.679 --> 00:18:24.689
They have special hardware tools to do texture, mapping, fast, such as.

53
00:18:24.689 --> 00:18:33.598
If an index is a fraction is a real number, not an energy flow. It will interpolation from the adjacent.

54
00:18:33.598 --> 00:18:43.108
Integer coordinates of the texture, so to interpolation and the texture and excited to be part of the texture fetching. It also.

55
00:18:43.108 --> 00:18:50.519
The details are proprietary, I think, but it may use a space filling curve to.

56
00:18:50.519 --> 00:18:57.628
To order the texture of so, stuff like that. And.

57
00:18:57.628 --> 00:19:09.479
Important things the kernel is the parallel program running on the device. Specify with the global dynamic. Carlos is, like I said is where colonel could start another kernel application.

58
00:19:09.923 --> 00:19:24.084
Might be, I don't know, you're walking down a tree and part of the trees a lot. Bush year say, start a new colonel. Perhaps unified memory is where you can the host in the device of the same address space. You might even get swapping. Okay.

59
00:19:26.519 --> 00:19:30.388
So you can you can walk your way through this and.

60
00:19:30.388 --> 00:19:41.338
Okay, have fun unified memory, manage memory and so on the.

61
00:19:41.338 --> 00:19:50.128
And you could mix serial and parallel call. This would be really efficient use of your machine. So you're running stuff on the host. You're running stuff on the device and they are.

62
00:19:51.209 --> 00:19:56.219
Inter leave perhaps. Oh, okay. So you can have fun with this and.

63
00:19:56.219 --> 00:20:02.909
Go through this, so I got a link to there.

64
00:20:04.739 --> 00:20:12.538
Some other stuff I've got here before I go back to these slides. Okay, this is my understanding. I created a little while ago.

65
00:20:12.538 --> 00:20:19.439
Um, because again, tests and video, not completely logical and they.

66
00:20:19.439 --> 00:20:25.259
So this is my attempt to pull out important points about this.

67
00:20:26.429 --> 00:20:33.689
So you got a micro architecture, which defines the available operations.

68
00:20:34.074 --> 00:20:48.773
A swap might be an operator for mute would be an operation where you permit results among a war, but all the threads in a war or something floating point would be shaders would be operations. So you have the micro architecture you might say it's the.

69
00:20:49.199 --> 00:20:53.368
It defines the machine language of codes you might say.

70
00:20:53.368 --> 00:21:04.259
Okay, and then you got these generations the last many are Tesla from capital Maxwell, Pascal Bolsa and and pair.

71
00:21:04.259 --> 00:21:16.618
So, okay, now the architecture is implemented in microprocessor, which are different so they'd had the same instructions set, but.

72
00:21:16.618 --> 00:21:28.673
A more expensive microprocessor might have more streaming multi processors might have a higher clock rate, et cetera, et cetera and the code for you've got things.

73
00:21:28.673 --> 00:21:34.193
So the code for on the micro process tends to be g K1 tennis a g means graphic.

74
00:21:34.193 --> 00:21:47.094
They all start with a G and K is the initial letter of the generation capital or Pascal and then this is the code here for the particular microprocessor and.

75
00:21:49.078 --> 00:21:53.999
It's not always true that a higher number means it's fast or higher number. It may mean slower.

76
00:21:53.999 --> 00:21:57.479
And then you get these access and stuff, so it's, um.

77
00:21:57.479 --> 00:22:01.558
It's a mess I don't fully understand it, but I'm trying to help you out here.

78
00:22:01.558 --> 00:22:10.469
So, the micro process that you put in a graphic card called an accelerator called a module and that's that thing that you plug into your server. Let's say.

79
00:22:10.469 --> 00:22:13.588
And they got these series.

80
00:22:13.588 --> 00:22:16.828
And and g force Quadro and so on.

81
00:22:16.828 --> 00:22:25.048
And and they get all, and the thing is so they have different cards, which.

82
00:22:25.048 --> 00:22:36.358
The same like a process that can go into different cards, which of different capabilities, like different amounts of memory on the card. And so it's a horrible mess. Oh, and then, another thing is.

83
00:22:36.358 --> 00:22:41.068
The card may be manufactured by another vendor who buys the.

84
00:22:41.068 --> 00:22:54.114
Microprocessor from Nvidia puts it into his own card and P, and why would have a card that say, and they might twiddle various features play games with the memory and bus rate and so on.

85
00:22:54.384 --> 00:23:08.003
So, NVIDIA may produce 1 card as an example of how to do it we call reference implementation so you could buy a reference implementation card directly from NVIDIA, or you could buy a card from someone else, like, or something.

86
00:23:08.308 --> 00:23:17.909
Okay, things like g forced time and you get this then you'll get little suffixes, like K20 X and the XM will.

87
00:23:18.503 --> 00:23:31.284
Do various things well, 1, customization it doesn't have, does it have a fan? Does it have on chip cooling? Doesn't have its own fan or does it rely on a fan from the server the case, or something like that?

88
00:23:31.854 --> 00:23:35.243
Number of processes? So an NVIDIA dies.

89
00:23:35.243 --> 00:23:49.523
For example, they might try to manufacture, I think, with the gateway they say, I think they tried to builds say 6 steering multi processors onto their chip, but things fail sometimes and they would bend them.

90
00:23:49.523 --> 00:23:54.294
So that if they chip only 5 of the 6, we're working, they'd sell it as a cheaper variable.

91
00:23:54.659 --> 00:23:59.009
Okay, slightly different model number. Okay.

92
00:24:00.328 --> 00:24:03.358
So my summary sort of of what's happening.

93
00:24:03.358 --> 00:24:09.509
The term GPU is sometimes used for the whole card. It sometimes use.

94
00:24:09.509 --> 00:24:13.138
So, the processor again, it's not standardized.

95
00:24:13.138 --> 00:24:20.578
Speeds they have, um.

96
00:24:21.898 --> 00:24:31.828
They change the type of memory they change the number of they change the bus. The bus rate. C*** rate is all sorts of variance. So.

97
00:24:32.969 --> 00:24:37.078
Um, now kudo has different.

98
00:24:37.078 --> 00:24:42.058
As a couple of different ways, there's a capability version.

99
00:24:42.058 --> 00:24:51.449
Which tells there's different properties as the number goes up, you get different different things available.

100
00:24:52.949 --> 00:24:58.588
They got cables talking, shared memory, whatever, whatever you can imagine that. So.

101
00:24:58.588 --> 00:25:04.259
As time goes on, they add more capabilities and they bumped the major version number.

102
00:25:04.259 --> 00:25:07.528
Um.

103
00:25:07.528 --> 00:25:14.669
That's down at the heart of the architecture level. There's also a level for the software for the CUDA and that's.

104
00:25:14.669 --> 00:25:25.229
Currently these things here are year out of date I haven't updated for this year. The current capability for the could their software versions up at 12 point.

105
00:25:25.229 --> 00:25:31.528
Too, I think so they bump up the capability for the hardware and they bump up.

106
00:25:31.528 --> 00:25:35.548
The software driver, so add more features to the software.

107
00:25:35.548 --> 00:25:40.348
It gets worse.

108
00:25:40.348 --> 00:25:45.509
Sandy sees the CUDA compiler takes here.

109
00:25:45.509 --> 00:25:56.814
C, plus, plus codes has been slightly extended with his minus and tactic extensions at triple angle brackets and the smart days routine modifiers, global and shared and stuff like that.

110
00:25:58.134 --> 00:26:11.963
And again, the compiler takes us, could a program extension and splits apart into the C code and the could have called the C code goes through crying or so I believe it goes to client naturally.

111
00:26:12.413 --> 00:26:13.163
And.

112
00:26:13.558 --> 00:26:17.459
That may not be true anymore at 1 point. It did.

113
00:26:18.084 --> 00:26:30.653
And then they and then the CUDA code does not get compiled straight down to the low level assembly calls. They could have called and you see plus, softball and gets compiled down to an intermediate abstract machine, a bite code.

114
00:26:30.683 --> 00:26:43.223
So, to speak, just as Java gets compiled to a bite code, and Python might get compiled to a bite code perhaps while the CUDA gets compiled to this intermediate bite code and that's what's stored in your file.

115
00:26:43.499 --> 00:26:48.509
And then when you run your, you're.

116
00:26:48.509 --> 00:27:03.179
At run time, it basically was adjust in time compiler, which compiles from the bite code down to the actual assembly CUDA assembly. And the reason for that is that they can change the back end the.

117
00:27:03.179 --> 00:27:16.919
If you want to run the on a newer generation of hardware GPU well, so the run time, just in time compiler has to change, but the bike code is still valid.

118
00:27:16.919 --> 00:27:20.128
And so it adds an extra level of complexity.

119
00:27:20.128 --> 00:27:30.298
But it makes your means your don't have to be compiled so compiled. So, and you thought you could forward and backward.

120
00:27:32.038 --> 00:27:35.249
And some sort of compatibility, which is nice.

121
00:27:36.054 --> 00:27:48.534
That's also to someone in Linux actually, I've got some 20 year old, I think, approaching 1020 year old, execute a Linux from sea programs that I've compiled and the execute might still run actually.

122
00:27:48.534 --> 00:27:52.733
So, for for minor things and links colonel as.

123
00:27:53.308 --> 00:28:04.679
Has maintained API, compatibility, not for system stuff. It does not for system stuff and Linux. Colonel actually defiantly States. They don't even try to do compatibility.

124
00:28:04.679 --> 00:28:18.269
And it seems like they go to their way to be incompatible from version to version what they want is people using at that level, they want them to build official modules to integrate it into the colonel. They don't want people on their own writing.

125
00:28:18.269 --> 00:28:23.999
Colonel stuff and Linux, they can't stop them, but they make it hard for them.

126
00:28:24.443 --> 00:28:35.423
Okay, so, in any case so I have Linux colonels 1020 panels. My said, it will still run the API still compatible. The crazy thing is my C programs on compile anymore.

127
00:28:35.423 --> 00:28:47.003
The sea language has changed more recently than the length API. It's sort of crazy. Okay. I say bad things about Linux. Well, that's good thing about Linux.

128
00:28:48.058 --> 00:28:55.679
And just points about the key thing with the is optimizing data flow and processing.

129
00:28:55.679 --> 00:29:01.288
Okay, so that's my summary of parts of.

130
00:29:01.288 --> 00:29:13.439
And video, I'll get to memory allocation later. Let me go back to some of these slide sets here. So we're starting with some case studies and.

131
00:29:14.999 --> 00:29:22.679
Let me just show you. I'm just again just hitting highlights for you. 61.

132
00:29:26.128 --> 00:29:31.078
And what I just the design zips the, um.

133
00:29:34.229 --> 00:29:43.229
I ended up the zip file and a okay.

134
00:29:46.019 --> 00:29:58.259
So, if I can make it any bigger for you yeah, I can't really make it any bigger.

135
00:29:58.259 --> 00:30:02.759
Okay, so this is another application and.

136
00:30:03.808 --> 00:30:08.249
So, what we have here is a couple of applications on.

137
00:30:09.388 --> 00:30:15.719
Virus techniques, so work efficiency means.

138
00:30:15.719 --> 00:30:19.318
Don't have a lot of threads in the war being. I know.

139
00:30:19.318 --> 00:30:24.298
If you can avoid it, so I don't know if you have.

140
00:30:24.298 --> 00:30:33.628
128 threads, they're doing things you could it'd be better than 128. 3 is only 4 warps instead of using.

141
00:30:33.628 --> 00:30:45.868
28 or something, because if you're not if you're using fewer warps, then that means that the war tools on the GPU are free to process other or some other applications.

142
00:30:45.868 --> 00:30:56.489
So, you're not using, like, the instruction dispatch unit on the as much so it's available. So that's so that's more work efficiency.

143
00:30:56.489 --> 00:31:06.838
The divergence ways to avoid conditionals in your code because again, conditionals are bad for for okay.

144
00:31:08.338 --> 00:31:16.108
So, it just working with some real application, some molecular dynamics things.

145
00:31:21.624 --> 00:31:35.364
So 1 of the problems is we want to calculate potentials. So you have a charged body here it exerts an electric field throughout the rest of the universe. And but the magnitude is inverse square law.

146
00:31:35.699 --> 00:31:42.659
Ok, so how do you compute say it here's a target point. You want to compute the potential there from all of the charged bodies.

147
00:31:42.659 --> 00:31:49.979
And your application, so you have to sum up and just to anticipate a little, we're going to put a grid on the scene and.

148
00:31:49.979 --> 00:31:52.979
Course and stuff. Okay.

149
00:31:54.538 --> 00:32:04.979
So so, the direct ways, you just iterate over all the other charged atoms in the system, and just sum up potentials that simple but.

150
00:32:04.979 --> 00:32:12.028
It takes quadratic time or linear time, and each target Adam quadratic time and the whole problem.

151
00:32:13.648 --> 00:32:18.598
Yeah, okay now.

152
00:32:18.598 --> 00:32:25.469
If you've got something like here your 1st question, you basically have a.

153
00:32:25.469 --> 00:32:36.534
Double iteration of double loop you're looping over the target. Bodies is potentials you want to calculate and you're looping over the source bodies that are charged that you might say are the sources of fields.

154
00:32:36.923 --> 00:32:40.884
And it's the 1st question is, which is the outer loop in, which is the inner loop.

155
00:32:41.189 --> 00:32:45.989
The next thing here, this cut off thing.

156
00:32:45.989 --> 00:32:51.898
Another big policy design question when you're designing an application like this.

157
00:32:51.898 --> 00:33:02.638
Is well, you know, you got to a certain point, you say the sources of so far away, they don't matter. I'm not going to consider them so.

158
00:33:02.638 --> 00:33:14.098
If you're summing up gravitational potential, say, at my home here in loud unveil, I would some of the effect that I would some of the moon and the sun I'd look at the moon and the sun.

159
00:33:14.098 --> 00:33:21.358
But I would, and maybe if I'm superstitious Jupiter, but I think that's silly, but I would not consider.

160
00:33:21.358 --> 00:33:31.858
The effect on my home of Pluto, I would not consider the effect on my home of stars in the Milky white imagined cluster and so on. Okay. That they're too far away.

161
00:33:31.858 --> 00:33:39.509
Or another thing I would do is I would cluster stuff, so if I want to consider the effect on my home of the sun.

162
00:33:39.509 --> 00:33:43.618
I would treat the whole sun as 1 point body.

163
00:33:43.618 --> 00:33:58.019
Because it's far enough away compared to which size that there's a reasonable approximation. So I would stop. I would start clustering stuff. Same thing for the moon. I treat the moon as 1 point body, a few 1000 miles cross and.

164
00:33:58.019 --> 00:34:12.088
200,000 miles away. That's reasonable, but if I was close to the moon, like, trying to orbit around the moon, I would not treat the whole moon as 1 point body. And that would be seriously wrong. That would calculate a seriously wrong orbit.

165
00:34:12.088 --> 00:34:19.469
Calculation won't be so bad and be useless or something. Okay so we get various approximation things. You have to think of.

166
00:34:19.469 --> 00:34:23.759
So, you could imagine some sort of hierarchy clustering objects.

167
00:34:23.759 --> 00:34:30.778
And let's think of something like the Milky Way let's say you want to compute the evolution of the Milky way. So.

168
00:34:30.778 --> 00:34:44.969
You got these, the arms and the Milky way. The arms are interesting. They are the arms and the way your face phenomenon, the density waves going through the stars.

169
00:34:44.969 --> 00:34:51.958
The arm as a concept exists, but the stars in the arm change.

170
00:34:51.958 --> 00:34:59.338
So, if you wanted to work on evolution of the Milky way, such as some of the physicists are doing at RPI.

171
00:34:59.338 --> 00:35:04.829
With this Milky way at home project, you know, using volunteer cycle.

172
00:35:06.324 --> 00:35:19.193
You would have some sort of adaptable, dynamic clustering. You'd cluster a star would be in 1 cluster when it's in an arm and then it travels out of the arm because the arm this the army says density wave is going faster than the stars.

173
00:35:19.193 --> 00:35:22.103
So this far enters an arm and then is in the arm then leaves the arm.

174
00:35:22.409 --> 00:35:34.798
Well, it's like a wave on the ocean. Think of a wave on the ocean. The wave is this organizational concept. It's not the individual water molecules. The wave travels over the ocean. The water molecules do not.

175
00:35:34.798 --> 00:35:39.599
So the perfect example, so Milky Way arms are like that. So.

176
00:35:39.599 --> 00:35:43.739
Again, you so you try to bucket stuff up cluster stuff to compute.

177
00:35:43.739 --> 00:35:47.188
Then, you know, you, your clustering has to be dynamic.

178
00:35:47.188 --> 00:35:55.559
Okay, so here they're trying to model things like, um, proteins.

179
00:35:55.559 --> 00:36:00.329
Whatever proteins.

180
00:36:00.329 --> 00:36:03.838
Complicated structures.

181
00:36:03.838 --> 00:36:18.443
No, this is a, this is a hard problem. That's important. It's hard because these potentials, for example, excuse me protein is this long Adam long molecule.

182
00:36:18.443 --> 00:36:23.664
The different atoms are different charges so it wants to fold up. If it folds up, it reduces its potential energy.

183
00:36:23.969 --> 00:36:29.009
And so it folds up in all different.

184
00:36:29.009 --> 00:36:32.818
There's a geometric component and, you know.

185
00:36:32.818 --> 00:36:42.719
The blob fits into a slot or something. There is a charge component. You want complementary charges close to each other and.

186
00:36:42.719 --> 00:36:51.719
So the proteins and they're bouncing around folding on unfolding, but they're prevalent. Surely going to end up in some folded state.

187
00:36:51.719 --> 00:36:56.068
And so it's very complicated to compute what that.

188
00:36:56.068 --> 00:37:05.728
Resulting folded state will be and of course, it's not just 1 protein. If there's a couple of proteins in the solution, they may fall together the temperature.

189
00:37:05.728 --> 00:37:11.369
Will affect what the stable state is the principal in chemistry.

190
00:37:11.369 --> 00:37:14.548
Again, the optimum state again, depends on.

191
00:37:14.548 --> 00:37:18.418
Everything in the system, so it's very difficult to compute.

192
00:37:18.418 --> 00:37:29.219
What the folded state will be however, it's important to compute it. This hard problem is a problem we're interested in the result because how the protein folds.

193
00:37:29.219 --> 00:37:38.789
Ends up folded determines what it does, it determines what it will it's activity and because it's the stuff on the surface.

194
00:37:38.789 --> 00:37:43.228
You might say which can interact with other things, which controls what the protein.

195
00:37:43.228 --> 00:37:52.409
We'll do so so this problem protein unfolding it's difficult and it's important. Well, what it will do affect, say drug design.

196
00:37:52.409 --> 00:37:58.108
Okay, a goal might be, I mean, drug design now sort of.

197
00:37:58.108 --> 00:38:02.159
It's like alchemy really well, I'm being a touch unfair.

198
00:38:02.159 --> 00:38:15.804
It has strong resemblance to alchemy that they don't know what they're doing and that is trying everything is wander through the Amazon and interesting. Try this plant and see if it does something a little unfair.

199
00:38:15.804 --> 00:38:27.503
But this truth, and what I'm saying is randomly trying stuff to see what happens. It would be really nice. If from 1st principles, you go from alchemy to chemistry from 1st principles, you could start.

200
00:38:28.650 --> 00:38:34.769
Designing this stuff, but to do that, we have to add a serious, you know, computation. So this is.

201
00:38:34.769 --> 00:38:38.400
How this stuff is relevant to the real world.

202
00:38:38.400 --> 00:38:50.909
Any case, sequential version we want to compute potential energy at each point on a grid. So, for each point on the grid, we iterate over the grid sales. We iterate over the input. Adam suddenly some potential.

203
00:38:50.909 --> 00:38:54.599
In 1 sense, that's optimal.

204
00:38:54.599 --> 00:39:01.349
In 1, limited sense, I'll skip through the code.

205
00:39:01.349 --> 00:39:11.039
Or iterating interesting or okay, well, look at the loops, the loop is iterating over the atoms. The atoms are the sources of potential.

206
00:39:11.039 --> 00:39:22.260
The next loop, we're iterating over the grid. We want to compute the potential that each grid sell. And this case it's a 2 D grid. So we've got 2 loops here.

207
00:39:22.260 --> 00:39:28.530
Iterating over X and over Y, that's the end. And we're just summing over stuff here.

208
00:39:28.530 --> 00:39:31.860
And, um.

209
00:39:31.860 --> 00:39:42.929
And it's called input oriented because we're iterating over the input. The outer loop is iterating over the input. I said this is your choice. What's your outer loop?

210
00:39:42.929 --> 00:39:48.300
The input or the output you might say, okay.

211
00:39:50.190 --> 00:39:53.610
And if I'm going too fast for anyone, tell me.

212
00:39:53.610 --> 00:39:57.659
Yeah. Okay. Did you to do, um.

213
00:39:58.980 --> 00:40:02.309
Now, that was the serial.

214
00:40:02.309 --> 00:40:08.969
Implementation sequential implementation. Now we want to do it in parallel.

215
00:40:08.969 --> 00:40:13.380
We might do 1 slice of the target.

216
00:40:13.380 --> 00:40:20.820
Data or you might be able to put a re on the and slices of the data.

217
00:40:20.820 --> 00:40:26.280
Okay, all, I'll be there.

218
00:40:26.280 --> 00:40:30.570
1, interesting thing here is that.

219
00:40:32.250 --> 00:40:36.510
Well, here we're sort of change in the iteration because.

220
00:40:36.510 --> 00:40:46.739
The outer loop here is basically this 1 slice of the output data and that spicy output data for each.

221
00:40:46.739 --> 00:40:56.190
Grid point in that slice are iterating on all the input Adams. So what we do is we take those input Adams and cash him, put them in the shared memory.

222
00:40:56.190 --> 00:41:04.949
Well, you could let the cash implicitly do it or in a case like this, you think, you know, better than the cash you'd be right? You copies I didn't put data to the shared memory.

223
00:41:04.949 --> 00:41:08.610
And now you integrate now, it's really fast to iterate over it many times.

224
00:41:08.610 --> 00:41:12.420
Now, you're writing data to the.

225
00:41:12.420 --> 00:41:19.800
Output, so there's a latency there. It doesn't actually matter if you think about it.

226
00:41:19.800 --> 00:41:27.059
So, obviously you're going to do your slicing so you're going to put as much data in the shared memory as you can.

227
00:41:27.059 --> 00:41:39.059
Okay, well, 1 back when I say as much that, here's a place you have to ask yourself, can you get away with, say half precision floats or do you need 4 bite dad? Or can you work with 2 by data?

228
00:41:39.059 --> 00:41:46.469
I mean, you're already good in approximation because you got a grid. Okay.

229
00:41:47.849 --> 00:41:51.599
I just describe what's happening here.

230
00:41:53.364 --> 00:41:59.605
The thread might be 1 thing we're talking about here is the thread, actually, 1 thread might work with 1 Adam.

231
00:41:59.635 --> 00:42:08.364
Perhaps it's slightly different what I said in the last slide, but here and up, but this means the thread is computing output will go into various different places.

232
00:42:09.000 --> 00:42:16.230
So, I started talking about different things or.

233
00:42:16.230 --> 00:42:20.159
Yeah, okay.

234
00:42:20.159 --> 00:42:25.170
This is called a scatter thing because the output is being written to scattered random places.

235
00:42:25.170 --> 00:42:30.000
And they're talking about different ways you can.

236
00:42:30.000 --> 00:42:34.679
If you find a constant that's been used again and again, then.

237
00:42:34.679 --> 00:42:39.539
On a host compiler I assume the compiler would optimize it.

238
00:42:39.539 --> 00:42:42.780
Maybe not here on the device.

239
00:42:42.780 --> 00:42:48.539
Although there is that news invidia C plus plus compiler. I haven't tried it yet.

240
00:42:48.539 --> 00:42:52.409
Um, called.

241
00:42:52.409 --> 00:43:01.019
So, they explicitly calculate stuff here that you're going to use again and again play game, be your own Optimizer.

242
00:43:01.019 --> 00:43:07.889
Don't trust the compiler, you're updating the target the, the target.

243
00:43:07.889 --> 00:43:16.019
Potential you have to sequential eyes that in case some other threat is updating atomic operation.

244
00:43:17.094 --> 00:43:30.864
Here's a case, we're having lots of threads might make this not so bad. So this hangs here because at atomic operation well, there might be, you know, there's going to be another war somewhere else is waiting ready to run. So, this is the.

245
00:43:31.679 --> 00:43:35.219
The work efficiency concept.

246
00:43:38.429 --> 00:43:47.099
So, showing, what's happening here is the input are the source atoms that have charges the output are the grid cells whose potentials we want to compute.

247
00:43:47.099 --> 00:43:53.699
So, basically, what they're saying here is there's a threat. Excuse me? There's a threat.

248
00:43:53.699 --> 00:44:04.110
On for each and put Adam and computes out for all the nearby grid salesmen maybe not all the grid cells in the universe. Maybe the nearby ones. Perhaps. So.

249
00:44:04.110 --> 00:44:07.619
Scattered so the output is scattered here.

250
00:44:09.690 --> 00:44:21.449
There are ways to unscattered summit like, you write a pair of output cord and and output value and you sort that array and then do a reduction on it. In fact, that's a valuable way.

251
00:44:21.449 --> 00:44:24.510
Okay.

252
00:44:26.130 --> 00:44:30.000
There's an input oriented over the input.

253
00:44:30.000 --> 00:44:40.050
Okay, but you don't iterate over all the outputs of the outputs that are close to the inputs.

254
00:44:41.699 --> 00:44:49.349
So, they don't like this message they just said, um.

255
00:44:49.349 --> 00:44:54.570
It's easy, but that scatter right at the output.

256
00:44:54.570 --> 00:45:00.030
Or each free chat, it computes all the output grid sales at affects and.

257
00:45:00.030 --> 00:45:14.875
Was an atomic ad, add some so, this is a bad idea. So we presented this to you just to give you a chance to see it's a bad idea, or presenting this to you because it's something you might think of might think of doing. So they're saying don't.

258
00:45:15.210 --> 00:45:22.980
Okay, that was 16 1.

259
00:45:22.980 --> 00:45:28.110
Introduce the problem and gave you a naive way to solve it.

260
00:45:42.570 --> 00:45:48.539
Okay, so this will be a better way to solve it.

261
00:45:49.650 --> 00:46:02.730
More output, the previous was more input oriented. This is my output or the outer loop is iterating over the output the targets grid cells that we want to compute for. So, for each target grid. So we iterate over the atoms.

262
00:46:02.730 --> 00:46:11.340
And add in now, the safe data structure here, we have to iterate over all the atoms because we don't know which atoms are close.

263
00:46:11.340 --> 00:46:23.880
Now, you could imagine putting the Adams in some sort of bucketing structure, you could imagine it before you did this computation, you bucketed the atoms into buckets and maybe the same size of the grid cells are somewhat coarser.

264
00:46:23.880 --> 00:46:29.159
And, of course, bucketing the Adams.

265
00:46:29.159 --> 00:46:34.500
It's a little harder than it looks to do in parallel and that's something I've actually worked on.

266
00:46:34.500 --> 00:46:38.730
And so my ideas, I like, it worked very well but so it's some.

267
00:46:38.730 --> 00:46:48.719
So sorting the atoms, so they're in buckets since then for each output grid you just look over the close atoms is going to make this faster, but you need that initial step.

268
00:46:48.719 --> 00:46:51.750
So.

269
00:46:51.750 --> 00:47:05.039
You see the trouble with building a grid containing all the atoms. Each cell contains all the atoms in it is the number of atoms in each grid. Cell is variable and unpredictable. So you have to have some.

270
00:47:05.039 --> 00:47:09.630
You know, data structure, which can handle this thing. Now.

271
00:47:09.630 --> 00:47:23.250
I guess what they might teach you in something like CS 1 would be a linked list of all the atoms in the grid style. And if you know, me, now, if I say it's something to teach and see us 1 or math, 1, I'm going to follow that by saying it's a bad idea.

272
00:47:23.250 --> 00:47:28.590
So, and link lists on a parallel computer or a very bad idea.

273
00:47:29.880 --> 00:47:44.699
1 thing well, 1 thing I've done actually, I tell you 2 things I have done why 1 thing is I make 2 passes to the input data. The 1st pass. I do a histogram on the input data and count how many Adams will go in each cell.

274
00:47:44.699 --> 00:47:52.619
And the grid is not too big, then that histogram will fit in past shared memory.

275
00:47:52.619 --> 00:47:57.360
That's the 1st step in this then I allocate a ragged array.

276
00:47:57.360 --> 00:48:01.889
For the grid or ragged array as adult.

277
00:48:01.889 --> 00:48:07.199
Allocates this as much space in each cell for the atoms that are going to go in that cell and as adult factor.

278
00:48:07.199 --> 00:48:15.869
Pointing to the start of each cell so it's 1 big array that has all of the atoms and all of the cells of the dope factor pointing to where each cells Adam start.

279
00:48:17.190 --> 00:48:28.769
And you can compute the dope factor from the histogram. It's a scan operating scans are useful. And then you make a 2nd pass through the input atoms and you populate the grid.

280
00:48:29.034 --> 00:48:40.195
And now, we know the atoms that are close to each grid cell that's 1 way to do it. A 2nd way to do it is we, we double the double the memory usage temporarily for each Adam.

281
00:48:40.465 --> 00:48:46.885
We create an ordered pair what cell will go into would create an ordered pair of cell number Adam number and then we sorted.

282
00:48:47.219 --> 00:48:54.239
By cell numbers, and now this doubled the memory because each item now size a cell number. Well, we fortunate by cell number.

283
00:48:55.739 --> 00:49:03.000
Actually, we could compute that dynamic, interesting time space trade off there, but now we have to gather all of the atoms.

284
00:49:04.074 --> 00:49:18.655
In each cell, we just have to be able to get at them fast and now we do a scan operation through the array scans that useful keep saying, and we don't have adult factor pointing to the start of the atoms and each array. So, it's 2 different ways to do it.

285
00:49:21.204 --> 00:49:33.894
The thing with the sorting sorting in parallel is slow can be slow. In fact, some of my programs, and I'm working with parallel Java stuff like this, the slowest part of the whole.

286
00:49:34.619 --> 00:49:45.119
My whole program is the sort in fact, and because parallel sorts involve multiple passes through the multiple passes through the data.

287
00:49:45.119 --> 00:49:52.440
And they're using bucket sorts and they're using it. I'm sorry rating sorts quick sorts. A horrible idea.

288
00:49:52.440 --> 00:50:02.190
Cs, 1, um, in many cases, in many cases, as much faster, and also paralyzes nicely. Um.

289
00:50:02.190 --> 00:50:05.880
But it's still slow.

290
00:50:05.880 --> 00:50:17.789
Because it involves making multiple passes reading and writing the bad any case. So, this is ways you could have this is ways I would improve this that I wouldn't down here to write to all of the atoms. I would iterate.

291
00:50:17.789 --> 00:50:30.300
Try to have some sort of spatial data structure imposed on the atoms. So, for each output grid sell I know the nearby atoms. They're not just Adams in the same cell. They're atoms that are within several cells of the target cell. Of course.

292
00:50:30.300 --> 00:50:39.449
Okay, redundant or computation like this they're trying not to repeat. Okay.

293
00:50:41.760 --> 00:50:47.730
So so what they've done is still a sequential code here, but they're.

294
00:50:47.730 --> 00:50:53.159
They're prepping it. Okay. They're doing a lay off to prep it for the parallel code.

295
00:50:53.159 --> 00:51:05.070
And they're doing a setup. Okay. And the setup is what they've done with this source code changes, is that the output energy get array doesn't get access to as much.

296
00:51:05.070 --> 00:51:09.599
But they did more access to the input so.

297
00:51:09.599 --> 00:51:14.190
More reads and fewer rights you might be thinking that that is good.

298
00:51:14.190 --> 00:51:21.000
It'd be right. Okay. And simple calculations on the coordinates. Okay.

299
00:51:21.000 --> 00:51:30.239
Again, remember now more often than not are data limited. Okay. Moving the data takes more time than computing with the data.

300
00:51:30.239 --> 00:51:41.610
That's why a lot of the demos of parallel processing speed work with matrix qualification because that's a compute bound application. It shows off the parallel computer very nicely.

301
00:51:41.610 --> 00:51:47.670
So, if you have to do more computation.

302
00:51:47.670 --> 00:51:59.579
But to get less data movement, that's probably a win. And so when they look at the con here many more calculations on the coordinates.

303
00:51:59.579 --> 00:52:02.610
It's not so bad probably.

304
00:52:02.610 --> 00:52:12.750
A more access to the atom array, but there read only access is and and if you're lucky, that's to the shared memory.

305
00:52:12.750 --> 00:52:19.139
Okay, so more calculations computations, but.

306
00:52:20.190 --> 00:52:23.789
Slower running on the host, maybe.

307
00:52:24.989 --> 00:52:30.210
Okay, so any case what they're talking about here.

308
00:52:30.210 --> 00:52:33.750
We're going to.

309
00:52:33.750 --> 00:52:37.440
Intellectual content or this is going to use.

310
00:52:37.440 --> 00:52:51.835
We're going to partition the data and we're going to have to hear what can we put in each thread block and so because the thread blocks is sharing this, using the same shared memory. So you figure out how much you can put in the same thread block.

311
00:52:52.500 --> 00:52:59.429
So so this is gather parallelization.

312
00:52:59.429 --> 00:53:05.639
Again, so each output set grid cells potential depends on many input.

313
00:53:05.639 --> 00:53:17.039
Cells, but what we're doing here is that where our outer loop is the output cells of each output cell, we then loop over the input atoms that will affect it.

314
00:53:17.039 --> 00:53:23.219
So, there'll be a thread for the output cell, which will go iterate over the input. Now.

315
00:53:26.244 --> 00:53:38.275
You know, you got to be worrying about thread divergence of course, but okay, so this is gather parallelization previous with scatter parallelization where each thread wrote to many scattered places here.

316
00:53:38.275 --> 00:53:44.034
Each thread reads from many scattered places and gathers data instead of scattering data.

317
00:53:45.480 --> 00:53:51.599
Okay, skip through this a little.

318
00:53:52.889 --> 00:53:56.670
Iterating over obvious things.

319
00:53:58.554 --> 00:54:09.025
And so this gather, colonel, we don't have an atomic ad operate increment operation to update the output cells.

320
00:54:09.025 --> 00:54:18.625
So we don't have this atomic thing, which is very slow doesn't serialize. So right. Bang there we got a big performance when, and.

321
00:54:19.679 --> 00:54:33.750
So, 1, moral of this is that what's fast and sequential code is not always what's fast and parallel call, in fact, competing with each other because of the differing resources on the 2 architectures.

322
00:54:33.750 --> 00:54:39.300
Okay.

323
00:54:39.300 --> 00:54:48.239
And cash effectiveness is important often, because it improves data speed. So.

324
00:54:48.239 --> 00:54:54.809
But again, this is saying different and.

325
00:54:54.809 --> 00:55:02.610
You CPO you read from memory it automatically caches something average from every manager. Perhaps.

326
00:55:02.610 --> 00:55:08.519
And this might slow down so the scatter read performance.

327
00:55:08.519 --> 00:55:13.530
Might slow it down so.

328
00:55:13.530 --> 00:55:20.039
If I go back to the previous slide here, this scatter, although I actually once wrote a.

329
00:55:20.039 --> 00:55:23.880
See program trying to determine the effect of.

330
00:55:23.880 --> 00:55:38.039
Having bad cash performance on a CPU I could not see an effect. Actually, I thought if I was randomly reading from so much different data, that it wouldn't all fit in the cash I'd be thrashing the cash.

331
00:55:38.039 --> 00:55:43.860
And I would see performance drop off. I did not see it.

332
00:55:44.880 --> 00:55:48.659
So, it was weird.

333
00:55:48.659 --> 00:55:58.829
I think the Intel was smarter than me, and they were caching in a sophisticated enough fashion that they still around my program efficiently.

334
00:56:00.480 --> 00:56:06.449
Okay, um, oh, I just saw your question, Isaac. Um.

335
00:56:06.449 --> 00:56:13.679
Is a choice significant for performance? What sort of choice here?

336
00:56:13.679 --> 00:56:16.889
How you loop for.

337
00:56:16.889 --> 00:56:24.059
Well, you can ask me again later. Okay. So fast sequential call. It depends on.

338
00:56:25.440 --> 00:56:28.800
Temporary array some in. Okay.

339
00:56:29.820 --> 00:56:42.539
Okay, sorry I didn't see it earlier. My 2nd, laptops to touch out of my direct line of sight is off to the side a little so I'm focused on the primary laptop. Okay.

340
00:56:42.539 --> 00:56:45.570
Doing Pre, computation staff.

341
00:56:47.849 --> 00:56:51.539
Tricks the, um.

342
00:56:51.539 --> 00:56:57.119
Ok, the executive summary here is we've got tricks.

343
00:56:57.119 --> 00:57:02.190
To use the threads more efficiently, I'll just leave it at that sort of um.

344
00:57:02.190 --> 00:57:08.280
And a thread might compute several potentials and so.

345
00:57:08.280 --> 00:57:14.969
Unrolling a loop means you do several iterations and match and 1.

346
00:57:14.969 --> 00:57:21.329
So, in some cases, it improves performance and nowadays optimizing compilers do it.

347
00:57:21.329 --> 00:57:26.579
Without telling you okay um.

348
00:57:26.579 --> 00:57:31.139
We're trying to reuse temporary results may be.

349
00:57:31.139 --> 00:57:40.829
They're going through space points, gives them more efficiency and then finally, at the end, they do 1 update of the output.

350
00:57:42.264 --> 00:57:56.155
Let's skip over at someone. Okay, so what this was, they didn't complete their example here actually, but it was a start at showing you how to take this potential computation thing, which, by the way is an important problem.

351
00:57:56.184 --> 00:58:02.875
It's there's papers on this. This is the sort of problem it's worth this potential calculation that's worth spending time on.

352
00:58:04.590 --> 00:58:09.750
Okay.

353
00:58:11.250 --> 00:58:17.280
Silence.

354
00:58:19.110 --> 00:58:25.170
Okay, um.

355
00:58:29.425 --> 00:58:40.405
So, just some stuff here on thinking about parallel programs, stuff in general, not specifically invidia. So we've gotten to a higher level, someone parallel programming.

356
00:58:43.980 --> 00:58:53.250
My motherhood stuff, decompose the problem into sub problems and so on.

357
00:58:53.250 --> 00:58:56.969
So, there's different types of scaling here.

358
00:58:56.969 --> 00:59:01.980
The strong scaling is you can solve the.

359
00:59:01.980 --> 00:59:09.539
Problem the same problem faster, the weak scaling if you can solve a bigger problem in the same time.

360
00:59:10.585 --> 00:59:23.965
And we scaling is easier to do than strong scaling. So if you got a 1Million cores doesn't mean you can solve a problem a 1Million times faster.

361
00:59:23.965 --> 00:59:27.235
But it may mean you can solve a 1Million times bigger problem. Now.

362
00:59:27.539 --> 00:59:34.829
The point about all of this, you know, you like to always think of what your goal is strategy, not just tactics.

363
00:59:34.829 --> 00:59:38.429
So, what we have here.

364
00:59:38.429 --> 00:59:47.340
Is the scaling our tactics the reason we spend time on this is to do better science to solve problems that we couldn't solve before at all.

365
00:59:47.340 --> 00:59:57.480
Weather prediction I would like to predict tomorrow's weather before tomorrow comes. Okay. So if I have the, if it's faster.

366
00:59:58.650 --> 01:00:05.190
I can do something I couldn't do before, like, predict tomorrow's weather fast enough that it's useful to me.

367
01:00:05.190 --> 01:00:11.550
Okay, so I'm talking about shared memory.

368
01:00:11.550 --> 01:00:14.969
Open see a little blurb on later.

369
01:00:14.969 --> 01:00:19.800
Varian apple's version of.

370
01:00:19.800 --> 01:00:26.130
Now, message passing is a different thing that's been taught.

371
01:00:26.130 --> 01:00:29.519
Over in CS somewhat um.

372
01:00:29.519 --> 01:00:35.940
1 reason, okay, you're hiring me to give you opinions I guess, to some extent.

373
01:00:35.940 --> 01:00:41.670
I believe that the shared memory algorithms are not.

374
01:00:43.050 --> 01:00:48.175
People do not think about they're it's an important technique that's under used.

375
01:00:48.594 --> 01:01:02.005
I think people they meet if they want to do something, they immediately jump to something like npi and song jump to the sophisticated external cloud based things. Hadoop and cloud based techniques.

376
01:01:03.295 --> 01:01:15.835
And I think that for many applications that is not necessary, you can work in shared memory. Well, starting you want to do something fast on the 1st thing is don't use a GPU on the host.

377
01:01:15.835 --> 01:01:20.635
Do it on the multi color Intel, using the parallel things open MP open.

378
01:01:20.940 --> 01:01:28.349
See, some stuff I haven't showed you and.

379
01:01:28.349 --> 01:01:43.019
You know, like the laptop I'm lecturing from, this is a laptop 128 gigabytes of memory on the laptop. You know, why would I go up message passing? It's a very good laptop. Okay. And parallel is.

380
01:01:43.019 --> 01:01:53.010
256 to go parallels a quarter terabyte of memory. And when I bought that much memory, unparallel cost me about 2 and a half 1000 dollars.

381
01:01:53.010 --> 01:01:56.429
You could buy a server.

382
01:01:56.429 --> 01:02:03.000
For a few tens, a K that as at this point, a couple of terabytes of shared memory. So.

383
01:02:03.000 --> 01:02:09.630
That's a powerful concept and you really ought to exploit it before.

384
01:02:09.630 --> 01:02:17.130
You know, before you move on to message past, let me just show you actually here.

385
01:02:17.130 --> 01:02:26.010
To show you, this is my laptop, um, look at memory here. 125 g because a little got used for something or other. That's, um.

386
01:02:26.010 --> 01:02:30.719
I think Pat. Okay 12 basically.

387
01:02:30.719 --> 01:02:37.920
12 threads running, and once that dual 6 core as hyper threading and so on. So yeah.

388
01:02:39.210 --> 01:02:48.179
Okay, so that's why, like, just shared my, maybe if you have to go to the GPU, then again.

389
01:02:48.179 --> 01:02:54.900
My laptops.

390
01:02:56.340 --> 01:03:04.170
This is a laptop now has 16 gigabytes of memory on the GPU.

391
01:03:05.400 --> 01:03:14.849
You know, so again, it's got the latency, blah, blah, blah, blah, but the point is, it's, it's available to all of the.

392
01:03:14.849 --> 01:03:19.829
3000 coded cores and this is a laptop.

393
01:03:19.829 --> 01:03:30.090
So so you got to any go to parallel again. So parallel has the.

394
01:03:30.090 --> 01:03:33.690
The GP was 48 gigabytes on it.

395
01:03:33.690 --> 01:03:45.719
So, why go message passing the most problems will fit in that much data. So, this is my opinion, but it's a very strongly held opinion that shared memory is not being used enough.

396
01:03:45.719 --> 01:03:51.360
Okay, in any case, um.

397
01:03:52.500 --> 01:03:55.679
Data sharing.

398
01:03:55.679 --> 01:04:02.610
You got to do it, right? Of course. So especially read only sharing is nice.

399
01:04:04.980 --> 01:04:19.860
And task groups, what's happening in the 2nd paragraph here as well effectively you can be doing audio with 1 stream and computation with another stream and they're overlapping good work efficiency. We saw a little of that. When I showed you could've streams.

400
01:04:19.860 --> 01:04:28.019
And so she got a synchronized stuff, especially.

401
01:04:28.019 --> 01:04:32.849
Well, threads in a block, perhaps so, but in barriers.

402
01:04:32.849 --> 01:04:36.030
You're not getting work done, but okay.

403
01:04:36.030 --> 01:04:39.449
There might be another stream, which can do work then. So.

404
01:04:41.369 --> 01:04:55.019
Um, they're just talking about here different types of data and nothing deep here. Single program multiple data.

405
01:04:55.019 --> 01:05:02.309
This is invidious version of 70 single instruction, multiple data streams. So.

406
01:05:02.309 --> 01:05:10.289
Single program, because different threads now, if you're running different instructions in the program, maybe I don't know, multiple data. Oh, okay.

407
01:05:10.289 --> 01:05:17.309
Any case different fork joins up at the host level and so on. Okay.

408
01:05:17.309 --> 01:05:23.280
Nothing really new here. Um.

409
01:05:25.500 --> 01:05:36.510
So, okay, doing single program is the threads are different points in the program. Okay. Nothing really deep in the slide.

410
01:05:39.630 --> 01:05:43.380
Yeah, but other ways you can parallelize up.

411
01:05:43.380 --> 01:05:50.250
We saw this in open M. P. if we go way, way way back. I showed you Fibonacci program.

412
01:05:50.250 --> 01:05:58.590
Or you recursively fired off a separate task that went into a queue to get run when resources were available.

413
01:05:59.610 --> 01:06:06.329
Loop parallelism, iterate the blueprint. You see this stuff up the host level also in politics. Um.

414
01:06:06.329 --> 01:06:14.849
Yeah, well, for join is essentially thread divergence. Well, sorry divergence is a special case. So forth.

415
01:06:14.849 --> 01:06:24.929
Okay, this is a nice slide. Said I slide I guess it's different ways. You can parallelize that, or you can do tasks and parallel.

416
01:06:24.929 --> 01:06:35.309
Iterate over the data or data flow would be it's the same data, but different stages in processing the data are.

417
01:06:35.309 --> 01:06:38.909
Done in parallel, so it's a pipeline.

418
01:06:40.110 --> 01:06:44.460
And this is a nice way to look at things. So.

419
01:06:46.440 --> 01:06:50.340
Chris is going to be less efficient in our misleading so.

420
01:07:00.389 --> 01:07:03.809
Spam nothing really deep on this side, but.

421
01:07:05.369 --> 01:07:09.900
Objection I have to also message fasting is a horrible overhead.

422
01:07:09.900 --> 01:07:14.730
So, um.

423
01:07:16.170 --> 01:07:19.500
Really nothing interesting on this slide to my.

424
01:07:19.500 --> 01:07:32.010
Your points all obvious IDs you have, and so on.

425
01:07:40.650 --> 01:07:48.150
Okay, I got a slight beef with this slide here. If you make it fast enough.

426
01:07:49.440 --> 01:08:03.329
Then you can do different stuff. There's a proverb that quantity has a quality, all of its own military proverb. Actually if you have enough quantity, then the quality starts changing.

427
01:08:03.329 --> 01:08:06.420
And that's the point here. So.

428
01:08:07.800 --> 01:08:11.699
Um, just reminds me a few decades ago.

429
01:08:11.699 --> 01:08:16.140
They started using computers to.

430
01:08:16.140 --> 01:08:19.140
Due to compute.

431
01:08:19.140 --> 01:08:22.770
For a census is a chemical sensor says to compute.

432
01:08:22.770 --> 01:08:36.449
Some chemical synthesis pass and people, chemists would start writing papers on g. I analyzed a computed as a pathway for this new molecule and.

433
01:08:36.449 --> 01:08:43.260
Got a paper out of it and cameras would be sort of insulting. There was nothing interesting. They just took this tool.

434
01:08:43.260 --> 01:08:46.920
You know, turned to crank again, got another paper out.

435
01:08:46.920 --> 01:08:51.449
And, um, yeah, but.

436
01:08:53.305 --> 01:09:07.435
But there was a guy, there was a chemistry profit at Harvard when I was a grad student there named B. J Corey and I knew 1 of his grad students, in fact, was using computers for chemical centers. I was working in this area. Here's what makes it make.

437
01:09:07.710 --> 01:09:12.569
Insults on, and then some years later after I.

438
01:09:12.569 --> 01:09:16.109
10 years later or so I'm reading.

439
01:09:16.109 --> 01:09:19.710
The morning newspaper, you know, 1.

440
01:09:19.710 --> 01:09:25.350
Morning in October and E. J. Corey. There's an article with E. J Corey on page 1.

441
01:09:25.350 --> 01:09:30.840
You just got the Nobel Prize in chemistry. So what was so when he was doing it.

442
01:09:30.840 --> 01:09:35.039
People were saying, yeah, it's not interesting.

443
01:09:35.039 --> 01:09:38.579
It became interesting. Okay. Um.

444
01:09:42.000 --> 01:09:52.409
So just talking about stuff here nothing interesting here, but doing parallel computing faster and so on nothing interesting there.

445
01:09:52.409 --> 01:09:56.279
Is applications why you'd want to do it okay.

446
01:09:56.279 --> 01:10:05.699
Backup protein aggregation it's a folding process. Somebody who general diseases cause the proteins to fold up wrong.

447
01:10:05.699 --> 01:10:14.489
Something gets in the way to protein unfolding, prevents it folding in the normal, good way. So it folds in a bad way and can't do its normal function.

448
01:10:14.489 --> 01:10:17.489
And you guys saw it.

449
01:10:17.489 --> 01:10:24.989
Okay, um, so they're trying to screen drugs again. Again, some flu, let's say.

450
01:10:24.989 --> 01:10:31.500
Viruses and things to say, this is what I was telling you a few minutes ago.

451
01:10:31.500 --> 01:10:34.829
Electrostatic interactions play a role.

452
01:10:34.829 --> 01:10:43.199
Big thing is with the water because water is polar molecule, which means it's got a plus area in a minus area. So.

453
01:10:43.199 --> 01:10:50.579
How the water and water is 1 of the most polar molecules in fact. And so how it.

454
01:10:50.579 --> 01:10:55.920
It's how the water molecules fit around the folded protein effects.

455
01:10:55.920 --> 01:11:09.449
What happens? So, and this buzzword, some buzzwords here boundary integral problem past multi poll method passed multiple messages a way of computing. What's going on here?

456
01:11:09.449 --> 01:11:12.449
Okay.

457
01:11:12.449 --> 01:11:22.140
Wikipedia article on the body problem again. So, in this, what's happening here Maybe so small you can't read it on.

458
01:11:22.140 --> 01:11:29.850
That capital is potentials is also called the body problem. It's not.

459
01:11:29.850 --> 01:11:36.180
There's a different use of the word term and body politic computing orbits. Let's say this is an body problem for electrostatics.

460
01:11:36.180 --> 01:11:40.050
Okay, and.

461
01:11:40.050 --> 01:11:45.239
Is a lot of bodies naive the quadratic time. So waste the speed it up.

462
01:11:45.239 --> 01:11:49.619
A multi pole expansion is like an approximation expansion.

463
01:11:49.619 --> 01:11:54.000
Okay, and lots of papers on this.

464
01:11:55.949 --> 01:12:03.300
Yeah, and so.

465
01:12:03.300 --> 01:12:06.869
See, a strategy in doing something faster and parallel.

466
01:12:06.869 --> 01:12:15.989
Well, 1st, you just call again, you've got an application, you just want to take this course to do something faster.

467
01:12:15.989 --> 01:12:26.369
Well, the 1st thing you would do your domain scientist. Some of the domain is see if there's a package, a library that well, if you can use. So if your.

468
01:12:26.369 --> 01:12:37.109
Application is doing fast for you transforms then you would use invidious fast for a cold parallel patch Korea, quote unquote. A, for example.

469
01:12:37.109 --> 01:12:40.529
Don't even have to know what a core is, that sort of thing.

470
01:12:40.529 --> 01:12:44.250
And Matt lab has.

471
01:12:44.250 --> 01:12:57.989
Parallel tools, then the next level is your writing steep up plus cold with the fragments reopen in ACC and so thrust. I'll show you later. It's a parallel version of the standard template library routines with some parallel algorithms. Like.

472
01:12:57.989 --> 01:13:05.850
Scan okay and then that's the 1st thing. The next thing is you rethink your algorithm.

473
01:13:06.989 --> 01:13:10.560
And then you start getting more fundamental or something.

474
01:13:11.579 --> 01:13:24.539
Nothing deep there you should know that. So eventually you have to start deep thinking about your algorithm because a good parallel algorithm is not a good sequential algorithm. Sometimes.

475
01:13:24.895 --> 01:13:25.465
Okay,

476
01:13:25.465 --> 01:13:38.274
and then you have to start worrying about things now here at this level and worrying there are tools actually between CUDA and open ACC energy Labs or something called for example,

477
01:13:38.515 --> 01:13:40.435
it's higher level than the raw CUDA,

478
01:13:40.465 --> 01:13:49.555
but it gives you more control over threads and warps and blocks and being up open it's in between.

479
01:13:49.890 --> 01:14:00.359
So and what's the appropriate data structure and not stuff involving pointer is not stuff involving trees. So recursion so much.

480
01:14:00.359 --> 01:14:06.840
Okay, that was.

481
01:14:06.840 --> 01:14:12.569
Some more general philosophy on parallel thinking.

482
01:14:19.800 --> 01:14:25.619
Silence.

483
01:14:25.619 --> 01:14:30.989
This is a backup.

484
01:14:36.689 --> 01:14:41.310
Yeah, a little blurb on here.

485
01:14:44.250 --> 01:14:49.109
I'm going to going to go through it fast, but.

486
01:14:51.210 --> 01:14:59.340
So the point about MTI, you got, it's distributed nodes. They're more loosely coupled than on.

487
01:14:59.340 --> 01:15:05.729
So, they don't share any resources, but you can send messages back and forth.

488
01:15:05.729 --> 01:15:08.850
And I'm I've synchronized stuff.

489
01:15:08.850 --> 01:15:19.199
So the nodes have to be more stand alone. You got a nice API. Your model is, you're sending and receiving messages.

490
01:15:19.199 --> 01:15:22.380
And waiting and so on.

491
01:15:22.380 --> 01:15:32.069
Okay, you want to add 2 factors you'd send messages around and then the add are.

492
01:15:32.069 --> 01:15:35.729
You know, a process would send a message back to you, you wait for it.

493
01:15:35.729 --> 01:15:41.039
Okay.

494
01:15:41.039 --> 01:15:46.470
The obvious arguments here.

495
01:15:46.470 --> 01:15:51.569
Receiving on.

496
01:15:51.569 --> 01:15:57.569
If you had to invent this, this is what you would invent it to look like it's, it's done nicely.

497
01:15:59.010 --> 01:16:03.840
And servers checking error messages.

498
01:16:03.840 --> 01:16:16.079
Nice to check for errors and in your waiting as barriers, he got to wait for results and so on.

499
01:16:16.079 --> 01:16:19.560
You can imagine a way could be a blocking later and unlocking weight.

500
01:16:19.560 --> 01:16:23.369
Okay, um.

501
01:16:23.369 --> 01:16:28.619
Commute process waits for messages telling into what the compute computers and then.

502
01:16:29.819 --> 01:16:33.390
Right types of barriers.

503
01:16:55.255 --> 01:17:02.635
Okay, this chapter 18 has got a funny numbering the intros. It would be 1, but it's not called 1.

504
01:17:03.149 --> 01:17:08.430
And.

505
01:17:08.430 --> 01:17:12.600
Um, the stencil again is like the.

506
01:17:12.600 --> 01:17:19.260
In convolution terms, you know, the other no cells that affect this cell.

507
01:17:21.270 --> 01:17:30.239
Going to go so it fast. You're going to tie. Everything has to be bunched up into tiles. Doesn't matter. Mti. Cota. So.

508
01:17:30.239 --> 01:17:36.390
And we do the 3rd, 1.

509
01:17:48.210 --> 01:18:02.039
And you could always combine MTI and coulda, if you had several, you could use MTI to connect them together. Perhaps there's possibly better ways to do it. But.

510
01:18:02.039 --> 01:18:09.479
And video is providing ways to connect different GPU cards together. If you have the IBM.

511
01:18:09.479 --> 01:18:12.989
Post their workstation architecture.

512
01:18:12.989 --> 01:18:24.029
And this is how some of the past the supercomputers work, you have a lot of the IBM workstations. Well, they're not separate workstations. Obviously they're, you know, trace.

513
01:18:24.029 --> 01:18:27.840
In your rack, but the point that's the architecture, but then each.

514
01:18:27.840 --> 01:18:31.170
Can take maybe 6.

515
01:18:31.524 --> 01:18:44.125
invidia GPU cards and the 6, and have a very fast boss called in V linked connecting them together. It's something I do not have access to on parallel. This does not exist for an Intel.

516
01:18:44.515 --> 01:18:47.784
It only exists if your backbone is the IBM.

517
01:18:48.119 --> 01:18:54.720
But if you do that, then you can connect the 6 together, which is very fast boss.

518
01:18:54.720 --> 01:19:01.140
Sense that it's worthwhile having all of their memories in the same address space. So you've got 6 of these.

519
01:19:01.140 --> 01:19:05.279
4 to 8 triggered by card you now got 6 times 48 gigabytes.

520
01:19:05.279 --> 01:19:12.359
You now got a quarter terabyte of common shared memory available to the GP, use dot, going back to the host.

521
01:19:13.380 --> 01:19:25.050
And so now you can have fun with that so there's going to be latency and Sacha, but it's better than on a host. So there's a tool that's available. And this is what is actually used on some of.

522
01:19:25.050 --> 01:19:33.720
Past that computers you'd see right at the top, the top 500 list. Oh, okay. So nothing much interesting here.

523
01:19:33.720 --> 01:19:38.430
It's sort of all the, I think.

524
01:19:41.515 --> 01:19:55.795
There is obviously messages to scatter and gather and reduce, but okay. That's a good point to stop. I'm let me just what's coming up in the future is we got a few more chapters in the book.

525
01:19:56.100 --> 01:20:01.800
Some related programming things here.

526
01:20:01.800 --> 01:20:07.619
And then what I'll do is I've got some other stuff I've typed in.

527
01:20:07.619 --> 01:20:14.100
Over the last few years that I taught the course, you can read it. I've got some more information on NVIDIA that I've typed in.

528
01:20:14.100 --> 01:20:23.819
And then we'll see a tool called thrust, which is again this parallel.

529
01:20:24.899 --> 01:20:28.439
Extension to s. TL and sponsored by NVIDIA.

530
01:20:28.439 --> 01:20:31.619
So, it's not it also has.

531
01:20:31.619 --> 01:20:35.430
Routines, which are useful only in parallel like.

532
01:20:35.430 --> 01:20:45.840
The scatter gather and scan and you'll be amazed what you can do with scan and will teach trust by examples. They have a bunch of example programs.

533
01:20:45.840 --> 01:20:50.520
So, we saw today is more parallel stuff from.

534
01:20:50.520 --> 01:20:56.909
Slides here and then my type, what I typed is my summary of some of the key features.

535
01:20:56.909 --> 01:21:06.510
Of NVIDIA now. Oh, coming up parts in the course in a couple of weeks in video has a developer conference coming up soon. It's free.

536
01:21:06.510 --> 01:21:14.369
And so if you want to learn more, I would say, do a free registration as an invidia developer. And then, you know.

537
01:21:14.369 --> 01:21:22.949
Watch the NVIDIA developer conference and see what fun things might be. Let me just show you here a few minute or 2.

538
01:21:34.529 --> 01:21:38.340
Contrarian.

539
01:21:38.340 --> 01:21:44.039
Developer program, and you can browse around here so.

540
01:21:45.750 --> 01:21:54.060
And have fun too. So if I'm going to slowly for you, you want to learn more stuff than.

541
01:21:54.060 --> 01:22:02.220
You can look at and this is actually the GPU technology conference. Some of that might cost money.

542
01:22:03.329 --> 01:22:06.510
But we want to get the developer conference.

543
01:22:06.510 --> 01:22:18.810
Any resources and stuff it's free, so.

544
01:22:21.329 --> 01:22:27.449
Yeah, so you might have fun going through some of that again the.

545
01:22:27.449 --> 01:22:31.649
So, when I looked at this page earlier, they so free. Okay.

546
01:22:31.649 --> 01:22:38.039
You see, your life averages out charges you a lot of money, but this thing here is free. So the averages out.

547
01:22:38.039 --> 01:22:42.899
Okay, that's enough stuff for today. Both of us want lunch.

548
01:22:42.899 --> 01:22:46.380
I'll stay around for a minute or 2 in case there's questions.

549
01:22:46.380 --> 01:22:54.300
Other than that see, you Thursday.

550
01:23:02.069 --> 01:23:07.979
Yeah, it's not. Oh, okay.