WEBVTT

1
00:10:41.308 --> 00:10:45.058
Is more 2 more settings.

2
00:10:47.129 --> 00:10:50.698
Oh, okay, cool. Thank you.

3
00:10:50.698 --> 00:10:54.989
Hate computers, so.

4
00:10:58.974 --> 00:11:12.173
Terry, you're seeing the screen and I'm seeing the chat window, so parallel computing and we're recording so parallel computing and 7.

5
00:11:12.869 --> 00:11:22.918
What we're going to do today is some, some random stuff we're going to finish off open ACC, and then get into Nvidia and so on.

6
00:11:22.918 --> 00:11:27.328
1st, because of popular requests, I put a.

7
00:11:27.328 --> 00:11:33.178
A new item here, top of the menu bar and goes in to.

8
00:11:34.678 --> 00:11:40.499
Media site where you see the class lectures here now.

9
00:11:40.499 --> 00:11:51.509
If they're not readable, then tell me, I try to make them readable, but then some of them revert. Okay. So they revert to not readable. And I don't know why so.

10
00:11:51.509 --> 00:11:54.899
Some of them, I've made readable twice and.

11
00:11:54.899 --> 00:12:02.009
Okay point 1 here, um, you might also be wondering.

12
00:12:02.573 --> 00:12:17.364
Stuff on machine parallel how I show it to you Michael show some PDFs today and so on run. Some programs is I actually run it from my local laptop here. It's a get repository, and I've got a copy on my local laptop and move to this stuff.

13
00:12:17.364 --> 00:12:17.634
So.

14
00:12:19.494 --> 00:12:24.293
Well, 1, other little cool, little bookkeeping I like showing you fun programming things.

15
00:12:25.764 --> 00:12:34.823
If you have a tar ball Powerball, you know, a lot of files directories tied up or a zip file or someone, you want to look at files inside it.

16
00:12:35.124 --> 00:12:41.844
You could extract them all into a directory, but if they're compressed and they won't get a lot bigger and this also could be hundreds or thousands of.

17
00:12:42.479 --> 00:12:57.208
Hundreds of hundreds or thousands of files. It's also a lot of AI nodes and if you're running it inside, get that really starts clogging up get. So there's a cool program called archive found what it does. It creates a virtual file system.

18
00:12:57.208 --> 00:13:05.458
And in Linux, it's a command called archive found it's a package called archive mount and.

19
00:13:05.458 --> 00:13:20.308
So this is what I use if I'm just reading some files inside some big zip file or something other thing is, if I start the zip file, I've got more confidence in its integrity. If I've got us hierarchy of directories and.

20
00:13:20.783 --> 00:13:35.274
1000 policy, who knows if a few got deleted or something for some format you can even write into an archive virtual file system, and when you on mounted, it will write a new zip file or a new tower ball or something any case. That's my cool programming.

21
00:13:35.274 --> 00:13:36.474
Tip for today.

22
00:13:38.308 --> 00:13:44.759
So do to do, okay open ACC.

23
00:13:44.759 --> 00:13:52.438
Finish it off and a good book I recommend to. You actually is.

24
00:13:52.438 --> 00:13:56.188
Open this 1 you open ACC for programmers.

25
00:13:56.188 --> 00:14:04.889
And almost a couple of years ago so I bought and I would recommend that to you if you want more information and so on.

26
00:14:04.889 --> 00:14:08.489
And I got the link here here I got the.

27
00:14:08.489 --> 00:14:15.538
We can Amazon and so on, not that expensive.

28
00:14:16.889 --> 00:14:22.708
Also, it has a get site here and the get hub site.

29
00:14:22.708 --> 00:14:26.759
Has some code and solutions and sign and I'll, I'll show that to, you.

30
00:14:28.619 --> 00:14:38.999
So, open ACC for programmers chapter 4 is available online. We've looked at something relating to that. We'll try some programs in it and.

31
00:14:38.999 --> 00:14:47.639
So, we may just.

32
00:14:47.639 --> 00:14:52.168
And I'll just get through some of this quickly a couple of.

33
00:14:52.168 --> 00:15:02.759
This is a class thing we saw it before somewhat. We saw basically this code before points. The average average 4 neighbors.

34
00:15:02.759 --> 00:15:08.308
Given an see, 4, I'm just speed reading through this.

35
00:15:08.308 --> 00:15:17.099
The serial solver has the 2 steps we saw last time 1st step compute the new temperatures and 2nd step copy them back to the old temperatures.

36
00:15:17.933 --> 00:15:28.614
Okay, so I'm going through this too fast, and just put a note up on chat and so on now, the book uses I've been showing you.

37
00:15:29.453 --> 00:15:40.793
Well, the, the NVIDIA compiler is just a compiler updated slightly and the invidia setup has those compilers and it also. So, you can actually try something like that.

38
00:15:41.729 --> 00:15:45.719
Even show you, in fact, if you want.

39
00:15:45.719 --> 00:15:49.769
I go to here.

40
00:15:49.769 --> 00:15:54.778
See, where are we do we do I want.

41
00:16:00.359 --> 00:16:04.619
Got a different.

42
00:16:08.759 --> 00:16:17.879
Okay, so what I mean to make a little bigger for you, so you can see.

43
00:16:22.589 --> 00:16:29.908
Okay, if that's too small, then let let me know. Okay, so we have the, the.

44
00:16:29.908 --> 00:16:33.899
Let's see.

45
00:16:33.899 --> 00:16:37.948
And.

46
00:16:37.948 --> 00:16:42.479
Different 1 here.

47
00:16:48.928 --> 00:16:52.678
Export them trying to find it in here.

48
00:16:57.599 --> 00:17:01.469
Yeah.

49
00:17:06.058 --> 00:17:13.919
Okay, it's got the code example for free for all the chapters online. So we can just do something.

50
00:17:13.919 --> 00:17:18.689
Silence.

51
00:17:21.929 --> 00:17:25.709
And that sort of thing.

52
00:17:32.098 --> 00:17:45.358
The thing, right? Okay. Compiling the code. The bad version is before it was optimized. So they say at the end here, I'll be 21 seconds and let us see what happens here.

53
00:17:45.358 --> 00:17:53.578
19 seconds, some of my laptop is in significantly faster than the demo computer that the book author use it sale. Com.

54
00:17:53.578 --> 00:17:59.519
3372 iterations. Okay. And.

55
00:17:59.519 --> 00:18:04.858
So, it's showing parallelization and it's showing by putting in things like.

56
00:18:04.858 --> 00:18:10.169
This and so, on here, now there's 1 nice thing.

57
00:18:10.169 --> 00:18:20.219
And we could compile it again this what this does, this is the flag to compile with open. This is a flag to get some.

58
00:18:20.219 --> 00:18:25.078
Information here, let me show you what sort of information you might get.

59
00:18:25.078 --> 00:18:30.868
Um.

60
00:18:30.868 --> 00:18:34.648
I never even put this in. I just heard it. This is.

61
00:18:34.648 --> 00:18:46.259
It's sort of crazy. Yeah, so it's showing the loops here. Are they parallelizable generating test account that means and video code and what it's inferring and so on so, this is useful for you.

62
00:18:46.259 --> 00:18:58.259
Okay, there's 1 more thing I'm coming through here. This is a review. This is a cool thing right here. If you do that.

63
00:18:58.259 --> 00:19:02.638
Then, when you run the program, it prints a pile of useful information.

64
00:19:02.638 --> 00:19:05.909
And let's try that.

65
00:19:07.558 --> 00:19:11.038
There and let's run the program.

66
00:19:12.298 --> 00:19:16.558
Silence.

67
00:19:38.669 --> 00:19:50.098
Okay, so up here, we have to refer to the to the source program. I could probably do that, but it, it shows the time it takes or various.

68
00:19:51.538 --> 00:19:56.278
Things and we're seeing copying takes a fair bit of time and surely.

69
00:19:56.278 --> 00:20:01.588
Okay, but it does help you with some simple profiling, which can be useful.

70
00:20:02.818 --> 00:20:06.328
Talks about it here.

71
00:20:06.328 --> 00:20:14.098
And this is a point I made last time. Oh, I can try the source for the program. It's doing a lot of pointless copying. So.

72
00:20:16.709 --> 00:20:20.638
Um.

73
00:20:23.249 --> 00:20:27.509
It's an inside here, we're doing too much.

74
00:20:27.509 --> 00:20:35.278
Pointless copying inside with the iterations and that's that's taking too much time.

75
00:20:35.278 --> 00:20:38.308
And that's what they're talking about here.

76
00:20:39.959 --> 00:20:45.659
Okay, now, what they get to is a way to optimize.

77
00:20:45.659 --> 00:20:52.138
And again, I'll let you read through this if you're interested on your own since we covered it sort of last time.

78
00:20:52.138 --> 00:20:57.749
But in any case, there's a version here called class final.

79
00:20:57.749 --> 00:21:01.618
And if we looked at last final and.

80
00:21:04.409 --> 00:21:13.318
What it's doing is it at the starter here added a new line up at the top.

81
00:21:13.318 --> 00:21:18.298
Oh, it says basically do less copying of the data.

82
00:21:18.298 --> 00:21:24.388
Giving the executive summary of this, and if we take this thing here.

83
00:21:24.388 --> 00:21:27.719
Talk to my data transfers.

84
00:21:31.078 --> 00:21:36.659
To out or something.

85
00:21:39.269 --> 00:21:46.858
It did the whole thing in 3rd, remember the previous time is like 30 seconds.

86
00:21:46.858 --> 00:21:54.328
So, it was very much faster and these times here that were millions of micro seconds and hundreds of thousands of micro seconds.

87
00:21:54.328 --> 00:22:00.929
So this was the case you get the 1st version running in parallel and then you optimize the thing.

88
00:22:00.929 --> 00:22:08.459
And again, speed up by a non trivial factor here.

89
00:22:13.769 --> 00:22:21.358
And talk about the optimization things here now, in this directory, we also have.

90
00:22:21.358 --> 00:22:24.749
Open M. P. and so we could also.

91
00:22:45.868 --> 00:22:50.969
No, the optimized open. Acc took a 2nd.

92
00:22:50.969 --> 00:22:58.108
This is taking the optimized 1 to 30 seconds. This 1 is taking.

93
00:22:59.548 --> 00:23:12.209
Night 20 seconds so it's faster than the optimized open ACC, but much slower than the optimized 1. we could also run the cereal 1.

94
00:23:16.048 --> 00:23:20.219
Silence.

95
00:23:35.699 --> 00:23:38.939
I could also for fun be running H top.

96
00:23:38.939 --> 00:23:45.868
And see what's happening for times pure using 100% of the CPU.

97
00:23:48.179 --> 00:23:59.578
And ad was faster than the non optimized, open ACC, it's at the same speed as the open empty. Actually. So, if an MP did not help here, we'd also just try for fun. A.

98
00:24:01.588 --> 00:24:06.058
How are you saying.

99
00:24:19.019 --> 00:24:23.398
So, optimizing on the serial machine, made a difference.

100
00:24:23.398 --> 00:24:28.588
We could also try optimizing the open ACC for fun.

101
00:24:49.499 --> 00:24:52.919
I know okay.

102
00:24:52.919 --> 00:24:57.358
Silence.

103
00:25:01.108 --> 00:25:05.038
See, what happens with this 1.

104
00:25:05.038 --> 00:25:09.659
Say it didn't help same speed.

105
00:25:09.659 --> 00:25:14.009
Okay, so.

106
00:25:15.419 --> 00:25:23.489
That was this book here any final questions on open ACC.

107
00:25:25.199 --> 00:25:31.348
So, what I would like to do is, okay, now trance for the next.

108
00:25:31.348 --> 00:25:35.999
Bulk of the course the 1st, block of the course was tools to compile.

109
00:25:36.233 --> 00:25:47.483
Just open MP, open ACC. Now I want to get direct more into the NVIDIA picking NVIDIA as currently it's the most common view out there in 5 years.

110
00:25:47.483 --> 00:25:57.534
If NVIDIA gets arrogant and confident, they may vanish. I've seen this happen actually with various computer companies. That went from some of the biggest companies in the business.

111
00:25:58.229 --> 00:26:06.509
You know, we're going on merging away, so any case. So, NVIDIA has a lot of stuff online.

112
00:26:06.509 --> 00:26:09.929
There's this here where you can request membership.

113
00:26:09.929 --> 00:26:16.888
I've done that and what we have here online if I go back.

114
00:26:16.888 --> 00:26:23.848
We have the gpo teaching kit here accelerated right here.

115
00:26:23.848 --> 00:26:28.588
And what I've done a zip file is, in fact, I.

116
00:26:28.588 --> 00:26:39.989
Well, 1 thing, I did not do others chapter 4 that I was looking at. Okay. So I used archive mount on and in fact, if you do a.

117
00:26:39.989 --> 00:26:51.298
Down at the end here, it's a file user base file system. That's what fuse is archive an archive found data type. Okay.

118
00:26:54.898 --> 00:26:59.489
And we're just going to look at some of the.

119
00:27:02.669 --> 00:27:07.739
And we're going to speed read through the slides. Some of them are fairly basic, but.

120
00:27:11.068 --> 00:27:15.959
Don't ask me what's happening there.

121
00:27:15.959 --> 00:27:20.009
Okay.

122
00:27:20.009 --> 00:27:27.028
Much bigger.

123
00:27:27.028 --> 00:27:32.398
Okay, something from Illinois, but they are quite recent and.

124
00:27:35.548 --> 00:27:39.989
And legally they're free. I'm actually legally using them.

125
00:27:39.989 --> 00:27:44.669
So, motherhood stuff.

126
00:27:47.213 --> 00:28:02.124
What we're going to see is we're going to see what coulda is, which I've alluded to before we're going to see more detail. It's the, you might say I will see assembly level language for programming and more about parallelism. The architecture.

127
00:28:02.368 --> 00:28:06.778
Talking about memory.

128
00:28:06.778 --> 00:28:11.608
And in this context, colonel means the GPU, the device.

129
00:28:11.608 --> 00:28:17.848
Performance atomic operations have seen that before.

130
00:28:17.848 --> 00:28:26.249
Now, modules, 8, 9 are interesting. 910 and 11 with any tool that you use.

131
00:28:26.249 --> 00:28:30.239
Sodium computing tools there are certain paradigms.

132
00:28:30.683 --> 00:28:45.443
There are ways to do things efficiently and they may not be obvious if you just look at the tool. And this will be important stuff to learn. Here. They are techniques for writing, parallel programs.

133
00:28:45.443 --> 00:28:52.253
There are techniques which have been shown to be actually useful and allow you to be productive in writing.

134
00:28:52.528 --> 00:28:55.919
Parallel program to their patterns.

135
00:28:55.919 --> 00:29:00.538
And these are these things here. Okay. Um.

136
00:29:00.538 --> 00:29:04.108
See, more of that and talk about things.

137
00:29:04.108 --> 00:29:15.479
Okay, I may not did all of this open and is a competing thing to CUDA. It's more platform independent, but it's not as mature.

138
00:29:15.479 --> 00:29:20.038
And it talks about open ACC and so, okay, that was the 1st slide.

139
00:29:20.038 --> 00:29:23.939
That was fast.

140
00:29:26.189 --> 00:29:29.249
So be fast also.

141
00:29:29.249 --> 00:29:37.229
Is going on here.

142
00:29:37.229 --> 00:29:41.548
Oh, just a 2nd, how do you.

143
00:29:41.548 --> 00:29:50.638
Okay. Um, okay, there is an important thing here.

144
00:29:50.638 --> 00:30:05.219
And that there are different types of architecture, computer architecture and there is an essential way in which the design is different from the CPU design and it's latency versus throughput.

145
00:30:05.219 --> 00:30:13.288
When they talk about it here, so the different types, of course, certain types of architecture to do different things efficiently, there's some.

146
00:30:13.288 --> 00:30:17.368
Just signal processing very efficiently, for example.

147
00:30:18.568 --> 00:30:28.499
The CPU, or what are called latency corps in this context, they're designed to have low latency, whereas the GPU does a designed to have high throughput.

148
00:30:29.608 --> 00:30:32.638
So, the.

149
00:30:33.749 --> 00:30:38.368
They have a lot of they have a very large local cash.

150
00:30:38.368 --> 00:30:45.659
And so that to hide, the fact that having to pull something off of memory is is very slow.

151
00:30:45.659 --> 00:30:57.148
And then a few registers and a lot of control unit. So the say, you know, pipelining and all that stuff control gets very big.

152
00:30:57.148 --> 00:31:06.808
But the effect is latency you can effectively grab data out of memory without noticing the delay hyper, hyper threading and so on the GPU.

153
00:31:06.808 --> 00:31:11.368
Does not hide the latency so much grabbing.

154
00:31:11.368 --> 00:31:17.189
Some data may take a long time. The cash is effectively smaller here.

155
00:31:17.189 --> 00:31:23.939
There's a lot more registers get to that later. But the thing is that they have a very high throughput.

156
00:31:23.939 --> 00:31:29.999
Because they will thread, they'll run many threads and parallel so.

157
00:31:29.999 --> 00:31:35.759
The cheap, you can do a lot of processing and can process more data.

158
00:31:35.759 --> 00:31:46.019
If your algorithm is organized, right? The CPU is designed to have low latency do random reads and they generally are fairly good at it.

159
00:31:46.019 --> 00:31:56.608
Doing it efficiently, but you only have a few threads on the seat. The GP got many threads, thousands of threads. The latency is high to start getting some data but once you start getting some data.

160
00:31:56.608 --> 00:32:08.608
You can it passed so the CPU powerfully I'll use floating point and so on large control unit large cash.

161
00:32:08.608 --> 00:32:13.828
What, if you look at the design for.

162
00:32:13.828 --> 00:32:24.269
Intel or Z on, they can do a lot of things in 1 cycle and double recession. Floating point is not the double precision floats in 1 cycle.

163
00:32:24.269 --> 00:32:32.878
May take several cycles, in fact, depending on which GPU you're using. So, this is CP design for low latency. The GP is designed for.

164
00:32:32.878 --> 00:32:37.919
Through so.

165
00:32:37.919 --> 00:32:46.108
You got a lot of threads, running hundreds of threads that talk to cash is much smaller. They use another big difference simple control.

166
00:32:46.108 --> 00:32:49.469
All does a lot does brand prediction.

167
00:32:49.469 --> 00:32:59.818
All that sort of speculative execution, all that sort of powerful stuff that is not in the is designed to handle straight line code.

168
00:32:59.818 --> 00:33:04.229
And running the same code on a lot of threads in parallel. So.

169
00:33:04.229 --> 00:33:08.159
Memory through simple of control.

170
00:33:08.159 --> 00:33:15.778
So pipeline for high through, but not pipeline for spectrum of execution and so on.

171
00:33:15.778 --> 00:33:20.669
So, it's going to be a latency to get data, or even from the global memory on the.

172
00:33:20.669 --> 00:33:25.439
On the GPU, not even talking about going back to the host. It may be a 100 cycles.

173
00:33:25.439 --> 00:33:36.298
This latest here, but that 100 cycles that amortized over, maybe there might be a 1000 threads executing in parallel. So, 100 cycles.

174
00:33:36.298 --> 00:33:40.019
Latency you start stuff to keep stuff running. It's tolerable.

175
00:33:44.513 --> 00:33:56.513
And this is the point I've mentioned that well, I was figuring that host core was 20 times faster than a device car. They're saying 10 times faster. The point is, you're.

176
00:33:57.088 --> 00:34:00.209
And your Z on is fast.

177
00:34:00.209 --> 00:34:05.699
But the is do a lot of things in parallel. That's the difference.

178
00:34:05.699 --> 00:34:13.259
And got some books here computing grid you can welcome to look at.

179
00:34:13.259 --> 00:34:16.469
F*** slide set in a few minutes.

180
00:34:16.469 --> 00:34:20.668
This is moving. Okay.

181
00:34:28.168 --> 00:34:35.068
No, this is okay.

182
00:34:41.608 --> 00:34:49.409
Where it could, uh, fits into this, we're accelerating some motherhood slide. You how do you accelerate or kick up.

183
00:34:49.409 --> 00:34:53.969
You call libraries, directors to the program and you you say.

184
00:34:53.969 --> 00:35:03.958
Special purpose language nothing complicated there. Nothing complicated there. Okay this starts having some content actually.

185
00:35:03.958 --> 00:35:09.028
There and video, sort of that you're coming and going with tools.

186
00:35:11.338 --> 00:35:22.409
Libraries, they have a lot of libraries libraries for cash transfer for numeric blasts basically, near algebra.

187
00:35:22.409 --> 00:35:28.528
All sparse matrices all that sort of thing. So they provide a lot of linear algebra tools for you.

188
00:35:28.528 --> 00:35:31.619
And some big things on.

189
00:35:31.619 --> 00:35:35.429
Math lives all that sort of thing.

190
00:35:35.429 --> 00:35:38.969
Thrust is something we'll look at later. It's a.

191
00:35:38.969 --> 00:35:46.108
It's an GPU analog to the standard template library actually, with parallel constructions that it had sorts and scans and so on.

192
00:35:46.108 --> 00:35:49.708
Library to me, I stuff.

193
00:35:49.708 --> 00:36:01.199
Image processing stuff, so there's a lot of accelerated libraries and in fact, if you just have an application, you may be better just to pick up good library and not be down at the low level.

194
00:36:01.199 --> 00:36:12.869
Just to show you this codes a little confusing, but I'll talk about it since they have it. What thrust is.

195
00:36:12.869 --> 00:36:17.759
So, it's library extensions to C. plus plus there's no language extensions at all.

196
00:36:17.759 --> 00:36:21.989
Um, again it's like, and.

197
00:36:21.989 --> 00:36:27.028
You have the it's functional programming, functional programming. It's a copy.

198
00:36:27.028 --> 00:36:34.259
And it coffee some vector with you. Oh, let's go up to the top here. These are ways to construct data.

199
00:36:35.608 --> 00:36:41.398
On the host or device, and of course, this is obsolete. Now, you would do a managed array.

200
00:36:41.398 --> 00:36:46.498
And then you'll have to worry about hosted device. Of course, you let the system worry about it, but this shows, you.

201
00:36:47.998 --> 00:36:54.809
And see what a device fact, the only interesting data type here is a vector. So this will be on the device.

202
00:36:54.809 --> 00:37:00.659
It's a factor of floats and you give standard to give the name and the lights and so on.

203
00:37:00.659 --> 00:37:09.929
So, there are coffee coffee. Interesting thing. Here is, this is a functional programming thing here and what this does.

204
00:37:09.929 --> 00:37:13.559
Is it takes and it, it takes to input.

205
00:37:13.559 --> 00:37:17.849
Of vectors device, and put 1 and device and put 2.

206
00:37:17.849 --> 00:37:23.128
And an output factor device output, and it adds some element by element.

207
00:37:23.128 --> 00:37:30.958
The last the last argument to the transform is a plus.

208
00:37:30.958 --> 00:37:34.829
You know, this is.

209
00:37:34.829 --> 00:37:38.699
Templates and so on, so it's a plus on floats.

210
00:37:38.699 --> 00:37:44.068
And so what transformed does is it applies this function.

211
00:37:44.068 --> 00:37:48.329
To, um.

212
00:37:48.329 --> 00:38:02.458
To these 2 input factors from perfect and you can just imagine how that could be compiled for parallel. Now, what thrust does is when you compile it, you give it directives that say what you want to target architecture to be.

213
00:38:02.458 --> 00:38:08.188
The host our device, like, and so on and to a very large extent.

214
00:38:08.188 --> 00:38:12.630
The source code does not have to change at all. It's not completely true, but it's.

215
00:38:12.630 --> 00:38:17.309
For a large extent true. We'll get to it later.

216
00:38:17.309 --> 00:38:23.130
This here is not a function call. It is a, um.

217
00:38:23.130 --> 00:38:26.219
Plus float is a class.

218
00:38:26.219 --> 00:38:30.929
It's a template and the class.

219
00:38:31.949 --> 00:38:37.710
And what the premise here, they're constructing, it's calling the default constructor.

220
00:38:37.710 --> 00:38:46.440
On this class of plus float and the class overload. So coincidentally overload some.

221
00:38:46.440 --> 00:38:50.489
The print operator, and that's out of this returns.

222
00:38:50.489 --> 00:38:58.199
An operator, which does edition, but in an indirect way, by creating a default variable, which happens to have overloaded.

223
00:38:58.199 --> 00:39:03.719
Brent confusing and motherhood stuff here.

224
00:39:03.719 --> 00:39:11.460
Open you seen that before what's happening here is we're being explicit about what gets copied in and out.

225
00:39:11.460 --> 00:39:17.489
In the kernel, and we give the the name of the array.

226
00:39:17.489 --> 00:39:23.849
And which part of the array to copy in and out, we're assuming that the compiler maybe cannot figure it out.

227
00:39:25.500 --> 00:39:31.590
So the previous slide showed using process is not using stuff just using a simple C. plus plus.

228
00:39:33.119 --> 00:39:36.929
Nothing new here.

229
00:39:36.929 --> 00:39:44.909
Nothing new here parallel stuff that obviously all of your major packages can use parallel computing.

230
00:39:44.909 --> 00:39:51.389
Who do the Python? Yes. And so on nothing interesting there and.

231
00:39:51.389 --> 00:39:56.159
There was our 3rd slide said of the day.

232
00:39:57.480 --> 00:40:01.289
4th.

233
00:40:04.500 --> 00:40:08.010
Okay, um.

234
00:40:08.010 --> 00:40:20.909
But this is showing is data parallelism how the tend to operate, you got 2 vectors of data, you want to add them element by elements. So, each addition is done by a separate thread.

235
00:40:20.909 --> 00:40:28.110
Ideally, so these are very lightweight threads. All the thread is doing is, it's adding 2 floats and producing another float.

236
00:40:28.110 --> 00:40:35.550
And the only reason this can possibly be efficient is that the overhead to start install for thread is negligible.

237
00:40:35.550 --> 00:40:39.300
And because we're starting maybe a 1000 of them or something.

238
00:40:39.300 --> 00:40:45.030
So that's implicit in this diagram here. The threads are very light weight.

239
00:40:45.030 --> 00:40:48.239
Um.

240
00:40:50.309 --> 00:40:58.079
Ok, what would be happening here.

241
00:40:59.190 --> 00:41:08.969
It were transitioning to see how it would be done in. We got a main program, which adds to factors and produces a 3rd factor and ends the number of.

242
00:41:08.969 --> 00:41:16.860
Words this is the function again I use function routines and autonomous, late in the function of routine.

243
00:41:16.860 --> 00:41:20.250
We just have a loop, which adds things element by element.

244
00:41:20.250 --> 00:41:28.199
There's a convention H, underscore means the data is on the host. That's the Intel. D underscore means it's on the device.

245
00:41:31.530 --> 00:41:43.619
Okay, so this is getting to the next level of detail here about how we would do this edition thing starting to use the that's the device.

246
00:41:43.619 --> 00:41:50.039
A comment that's indicate this, so.

247
00:41:50.039 --> 00:41:58.469
We, we allocate memory. Okay the data is on our host, we have to allocate memory on the device.

248
00:41:58.469 --> 00:42:06.630
And then we have to copy the data from the host of the device again with a managed memory. These things are automatic but.

249
00:42:06.630 --> 00:42:13.949
Before manage you out, you allocate a vector on the host, you allocated on the device, and you copy data back and forth.

250
00:42:13.949 --> 00:42:21.570
And then we launched the colonel now terminology, the kernel is a parallel program running on the GPU.

251
00:42:21.570 --> 00:42:26.460
So, we launched a current all, we launched a parallel program on a gpo that does the work.

252
00:42:26.460 --> 00:42:29.940
And then finally we copy the data back.

253
00:42:29.940 --> 00:42:37.949
To the host, and if we care, we free the device factors. I would never care program in. They're going to free anyway.

254
00:42:37.949 --> 00:42:51.329
This is another step any if I'd have to wait for the colonel to finish, because it's a synchronous, the CPO can start to doing something, then the CPO can do something else while it's waiting for you. It doesn't have to wait does something else and checks is finished.

255
00:42:52.949 --> 00:42:59.820
Okay um, so that was some new stuff here. There is a lot of new stuff on this.

256
00:42:59.820 --> 00:43:09.840
Simple looking slide. What do we have here? This is a high level architecture description for how the GPU works.

257
00:43:09.840 --> 00:43:18.420
You got I'll do the simple thing. 1st, and you got global memory. Okay.

258
00:43:18.420 --> 00:43:22.139
On parallel on that.

259
00:43:22.139 --> 00:43:30.420
It being the I could buy at the time, which was about a year ago. It has like, 48 gigabytes of global memory. If I recall.

260
00:43:30.420 --> 00:43:33.900
A laptops got 12 gigabytes.

261
00:43:33.900 --> 00:43:37.739
My laptop, I though, hey, it's practically as fast as parallel.

262
00:43:38.909 --> 00:43:41.940
It's also practically use expensive. So, Matt matches.

263
00:43:41.940 --> 00:43:50.400
Okay, you got some global memory, global memory. It's big by GPU terms. It's past, but it has latency.

264
00:43:51.599 --> 00:44:00.539
Now, inside it, you've got threads you got called CUDA cores and you may have a few 1000 of them like 7000.

265
00:44:00.539 --> 00:44:08.400
And they're running threat to thread is, like, a unit of execution is data and a program counter and so on.

266
00:44:08.400 --> 00:44:13.469
And we just index the 3.0001 and so on.

267
00:44:13.469 --> 00:44:19.920
And again, threads have some private registers only 255 for thread.

268
00:44:19.920 --> 00:44:33.599
And the threads this is not in this slide, but the threads are in warps and the 32 threads in a war front synchronously, they're executing the same instruction or they're idle.

269
00:44:33.599 --> 00:44:37.440
If they're executing it, the same instruction.

270
00:44:37.440 --> 00:44:43.079
They can be on different data, but ideally, the data is consecutive.

271
00:44:43.079 --> 00:44:51.809
Threads 1, 1 data would be 1 after 3 0T data and so on. So you got these threads could be thousands of them.

272
00:44:51.809 --> 00:44:55.260
And each thread is a small bank of registers.

273
00:44:55.260 --> 00:45:00.539
The thread can also get at the global memory. Now, the threads are grouped into blocks.

274
00:45:01.889 --> 00:45:07.199
So, a block can have up to a 1000 threads. Be 32 wars is 32 threads.

275
00:45:07.199 --> 00:45:11.309
Doesn't have to have a 1000 crew can be up to a 1000.

276
00:45:11.309 --> 00:45:15.480
So that's the yellow thing here so it may be a 1000 threads in the block.

277
00:45:17.099 --> 00:45:30.630
They're running the same instruction, but the different warps don't have to be running at the same time. The threads and the war for running at the same time. The warps in the block actually could be scheduled differently. They're doing the same instruction.

278
00:45:30.630 --> 00:45:33.869
Not maybe not at the same time.

279
00:45:33.869 --> 00:45:44.219
And inside a block, there's also some, some memory that's private to the block that's available to all the threads of the block.

280
00:45:45.204 --> 00:45:59.215
So, the warps in the block, they're running the same instruction, but maybe not at the same time. What's going on is a queue of warps waiting to run and when resources available, then the.

281
00:46:00.000 --> 00:46:05.099
Imagine a mini operating system pulls the next item off the queue in the block and runs it.

282
00:46:05.099 --> 00:46:12.719
So you got a blog, could have up to 32 warps and each 4 is 32 threads. You can have many blocks. You might have hundreds of blocks in the program.

283
00:46:12.719 --> 00:46:18.119
And the blocks are running again, they're running the same program and the running the same instructions.

284
00:46:18.119 --> 00:46:23.760
Was the different blocks do not communicate with each other at all? So there's a queue of blocks waiting to run.

285
00:46:23.760 --> 00:46:27.269
And the only.

286
00:46:27.269 --> 00:46:33.599
Shared data is the global memory, so they're running at different times. There's no fairness guarantees on the different blocks.

287
00:46:33.599 --> 00:46:36.989
And so they're basically off on their own, you can sink.

288
00:46:36.989 --> 00:46:44.309
Running the same instructions, but at different times, and you couldn't do synchronization, you can force things to wait and so on.

289
00:46:44.309 --> 00:46:53.639
It's not so bad to force the warps and a block to wait for you until they're all completed. Forcing all the blocks to wait is probably a bad idea.

290
00:46:53.639 --> 00:47:04.289
Okay, so you've got 30, just friends and awards are up to 32 works at a block up to a 1000 threads and a block. You could have.

291
00:47:04.289 --> 00:47:17.039
A 1000, unlimited blocks is hundreds or thousands. See, everything here is really light weight. It's unlike a higher level operating system where starting a process takes time and so on everything here.

292
00:47:17.039 --> 00:47:20.070
Is really cheap and simple to start off.

293
00:47:20.070 --> 00:47:30.780
That's the point of it. Okay so he got all the blocks then form a grid and the grid is also called the kernel. So, the grid is a parallel program running.

294
00:47:30.780 --> 00:47:35.579
On the your.

295
00:47:35.579 --> 00:47:40.110
See, am I still aimed here? Good you're.

296
00:47:40.110 --> 00:47:44.219
Gpu, so it has the grid, the kernel, the parallel program.

297
00:47:44.219 --> 00:47:51.809
G. P. S. the device and there could be a number of colonels on the GPU. A couple of them could be running at the same time. I don't know how many.

298
00:47:51.809 --> 00:47:55.920
And again, there's a QC stuff waiting to rod.

299
00:47:57.030 --> 00:48:00.929
Okay, so this is a very substantive slide about.

300
00:48:02.130 --> 00:48:13.500
In how the works inside, and it's designed like this in order for a high performance and when you're doing a lot of data parallelism. So.

301
00:48:13.500 --> 00:48:16.650
And they talk about here, you've got.

302
00:48:16.650 --> 00:48:23.309
There are registers for thread is global memory in between here. There's 2 other things, shared memory and.

303
00:48:23.309 --> 00:48:31.500
Local memory, just like 5 levels of memory or something and I'm not even thinking of things, like, read only memory and so on.

304
00:48:32.635 --> 00:48:45.684
Basically, the smaller memory's more local and faster. There's also all these resource constraints on the system. I said a thread could have up to 255 registered. It doesn't have to have that many good have fewer.

305
00:48:47.460 --> 00:48:51.869
But the thing is, they're all the registers in a block, the.

306
00:48:51.869 --> 00:48:56.099
There is a, a pool of registers for the block. It's.

307
00:48:56.099 --> 00:49:09.420
65000 or something, and all the threads, and the block are getting their registers from that 1 pool. So if I register if each thread wants to 55 registers, you're not going to be running a 1000.

308
00:49:09.420 --> 00:49:16.980
Threads at once is not enough registers. It's going to be running 65000 divided by 255 threads at a time.

309
00:49:16.980 --> 00:49:22.650
So, sometimes if a threat uses fewer registers.

310
00:49:22.650 --> 00:49:26.940
You might get hire throughput because it means you can run more threads at once.

311
00:49:26.940 --> 00:49:33.599
Okay, starts talking about here the, and free.

312
00:49:33.599 --> 00:49:39.869
Go into the global memory again, manage unified memory. You don't need to do that, but.

313
00:49:41.190 --> 00:49:47.039
Like and free free I don't see the point of free program, man stuff, get sprayed.

314
00:49:47.039 --> 00:49:50.760
So, unless you're locking and praying repeatedly.

315
00:49:50.760 --> 00:49:56.760
And I do not know how the global memory does garbage collection and stuff like that. I suspect it might not.

316
00:49:56.760 --> 00:50:03.000
And it certainly does not compact if I say you don't want to get over enthusiastic with Alex and phrase.

317
00:50:03.000 --> 00:50:08.369
Coffee coffee stuff back and forth again. That's obsolete. Now.

318
00:50:08.369 --> 00:50:20.429
All 1, cool thing a synchronous you fire off a copy the mem copy immediately returns to you all the copy still going on. If you're copying a few gigabytes, this might take some time.

319
00:50:21.480 --> 00:50:26.519
So, if you're worried about that, you can, you can check that it completed, but.

320
00:50:26.519 --> 00:50:29.519
It can return to you and you do something else.

321
00:50:29.519 --> 00:50:33.659
And let me check. Okay.

322
00:50:33.659 --> 00:50:38.250
Program again, nothing new here.

323
00:50:38.250 --> 00:50:42.480
Megan and Jim and copies back and forth.

324
00:50:42.480 --> 00:50:49.920
Or use manage memory. Okay good idea here.

325
00:50:49.920 --> 00:51:04.650
Error checking. I know no 1 ever does it in practice. Commercial software does not do it in practice complained complain, but you're not writing commercial software. You're writing good quality codes. So I encourage you to check for errors.

326
00:51:04.650 --> 00:51:09.630
These things, return, error code sometimes you can check.

327
00:51:09.630 --> 00:51:16.230
And overturn some sort of number.

328
00:51:16.230 --> 00:51:20.639
And you can call this, it converts from the number to a string, even.

329
00:51:20.639 --> 00:51:30.449
I look at these cool things here. These are macros in C. plus. Plus they're in the standard, the spiritual returns the name of the file.

330
00:51:31.530 --> 00:51:35.639
That the source code was in and this returns the line number.

331
00:51:35.639 --> 00:51:44.820
That this line was in that's very useful. I like I like this stuff. Very nice. I, the only problem with line number is that.

332
00:51:44.820 --> 00:51:52.409
If you have a macro then this is the line number of the thing in the macro not who called the macro.

333
00:51:54.150 --> 00:51:57.300
Questions.

334
00:52:08.039 --> 00:52:11.550
Can you have 3.

335
00:52:11.550 --> 00:52:16.380
Okay.

336
00:52:17.400 --> 00:52:23.159
Threads are hierarchical. I mentioned that before thread it's going to warps into blocks and to grades.

337
00:52:23.159 --> 00:52:36.360
And threads have ID numbers, so you're firing off a 1000 threads and a thread block also call a block. Each thread knows which thread it is.

338
00:52:36.360 --> 00:52:40.409
Open M. P. and so and you can tell.

339
00:52:40.409 --> 00:52:46.559
You can get the number here. You can get all this information is available to the user which thread you are.

340
00:52:46.559 --> 00:52:54.030
Okay, here we're adding 2 vectors and each.

341
00:52:54.030 --> 00:53:02.280
Pair of elements will be a separate thread might be a 1000 threads. And each thread is very light weight.

342
00:53:02.280 --> 00:53:17.190
This is a design style that they use indicate this. So the is a.

343
00:53:17.190 --> 00:53:21.269
Is an execution for a threat or a process or something?

344
00:53:21.269 --> 00:53:25.440
So, you're running something on the host and we're assuming single threaded on the host.

345
00:53:25.440 --> 00:53:33.360
Keep life easy then we fire off a parallel kernel on the device parallel colonel parallel program on the device.

346
00:53:33.360 --> 00:53:42.929
And it may have many separate blocks block and a thread block of the same thing. Thread blocks, just explicit. So they're running many blocks. And each block has many threads.

347
00:53:42.929 --> 00:53:49.949
Then you got a serial component again, and then you've got a parallel component again and this is how your program or somewhat.

348
00:53:51.000 --> 00:53:56.639
Now, while you got to cereal parts here, you could be running another parallel program. Of course, you could overlap stuff.

349
00:53:56.639 --> 00:54:02.159
If you want to get ahead of me on that, you'd look into could a streams.

350
00:54:02.605 --> 00:54:15.114
Why this would be 1 kudos stream. These things are serialized. You do Serials and parallel then 3rd is another serial block and force another parallel block. This whole thing could be in parallel with this whole thing is called 1 kudos stream.

351
00:54:15.414 --> 00:54:17.815
It could be in parallel with another kudos stream.

352
00:54:18.119 --> 00:54:23.849
Which would do a serial things when the student parallel, et cetera, et cetera. Okay.

353
00:54:23.849 --> 00:54:27.000
Any case new terminology here.

354
00:54:28.019 --> 00:54:31.860
We're going to get to this, so we, the term.

355
00:54:31.860 --> 00:54:35.130
The the site extension to.

356
00:54:35.130 --> 00:54:39.869
C, plus, plus this is now.

357
00:54:39.869 --> 00:54:43.440
Will give the name of the routine we're calling on the device.

358
00:54:43.440 --> 00:54:47.699
We syntax extension, triple angle brackets.

359
00:54:47.699 --> 00:54:53.699
And will tell it how many threads per block, and how many blocks, and we give it so arguments to pass in.

360
00:54:56.610 --> 00:55:02.309
Hierarchy here. Nothing interesting.

361
00:55:04.469 --> 00:55:08.730
Silence.

362
00:55:10.139 --> 00:55:13.230
Nothing interesting here program.

363
00:55:13.230 --> 00:55:16.530
Instructions or even rates that executes that.

364
00:55:16.530 --> 00:55:20.670
Yeah, I think everyone sees this here.

365
00:55:20.670 --> 00:55:25.860
Volume and style here, Coco does that.

366
00:55:25.860 --> 00:55:38.400
Program calendar point to the next instruction to execute instruction or a serious. The current instruction local data registers to inputs and outputs.

367
00:55:38.400 --> 00:55:42.000
A real machine has lots of each of these things and so on.

368
00:55:43.050 --> 00:55:50.369
Good a colonel again as a grid of threads that are a ray of threads.

369
00:55:50.369 --> 00:55:54.329
It's a single program multiple or whatever.

370
00:55:57.239 --> 00:56:02.369
Okay, what's happening here? This.

371
00:56:02.369 --> 00:56:08.940
Is what the syntax would look like each thread does this for a different value of okay.

372
00:56:08.940 --> 00:56:14.429
Plus B, I, and in parallel for maybe a 1000 threads.

373
00:56:14.429 --> 00:56:17.639
Well, how does the thread compute? I.

374
00:56:17.639 --> 00:56:23.010
It would use an instruction like, up here now what's happening here?

375
00:56:23.010 --> 00:56:32.579
Is thread index dot X, ignore the dot access for the moment thread index. The index of the thread in the block.

376
00:56:33.750 --> 00:56:37.590
And block dim is the.

377
00:56:37.590 --> 00:56:40.980
Number of threads and a block.

378
00:56:40.980 --> 00:56:46.829
And block indexes, the indexes of block, each block might have a 1000 threads. Maybe.

379
00:56:46.829 --> 00:56:53.460
So, this line here is it, it computed a unique guy for each thread. So the index of the thread and the block.

380
00:56:53.460 --> 00:56:57.480
Plus the index has a block times, a number of threads for block.

381
00:56:57.480 --> 00:57:02.639
And each thread gets a unique element of a subscript die and does the addition.

382
00:57:02.639 --> 00:57:10.920
So this is showing the threads are doing the same instruction, but to doing the same instruction on different data, because each threat is a different thread index and.

383
00:57:10.920 --> 00:57:14.670
Different blocks at different block in this is.

384
00:57:17.190 --> 00:57:25.980
Okay, so this is this hierarchy I told you where threads are in blocks.

385
00:57:26.485 --> 00:57:41.125
And then you got multiple blocks, we're ignoring warps here. So, here, they're showing a blog thread. Block has to 56 threads. I said, it could have up to a 1000, but else that it doesn't have to have a 1000. it could have fewer. So, here the blocks at 256 threads each, and we're seeing 3 blocks.

386
00:57:44.400 --> 00:57:48.389
And some point about here inside of block.

387
00:57:48.389 --> 00:57:52.349
We've got shared memory.

388
00:57:52.349 --> 00:57:57.690
It's small, it's like 48 K or something.

389
00:57:57.690 --> 00:58:02.519
Shared by all the threads and a block, but it's very fast memory.

390
00:58:03.840 --> 00:58:11.730
Dash there's atomic operation, so if the thread for accessing the shared memory.

391
00:58:11.730 --> 00:58:17.039
Doing an increment and they can do as an atomic operation so.

392
00:58:17.039 --> 00:58:23.369
That's done correctly. You can do synchronizing all the threads in the block if you have to.

393
00:58:25.050 --> 00:58:28.800
And the different blocks are independent.

394
00:58:28.800 --> 00:58:33.360
The only way that interact is reading and writing global Emory, which should be.

395
00:58:33.360 --> 00:58:39.840
Very slow and probably very stupid. Okay. Okay.

396
00:58:39.840 --> 00:58:46.559
So 2 level threads in the block in multiple blocks, and this is how each thread nose, which.

397
00:58:46.559 --> 00:58:51.179
Element to access.

398
00:58:51.179 --> 00:58:55.889
Now, why you might not want to 50000 threads and R block.

399
00:58:55.889 --> 00:58:59.190
Is that the shared memory? For example, the registers.

400
00:58:59.190 --> 00:59:02.820
Our s*** I shared among fewer threads. Each thread gets more.

401
00:59:02.820 --> 00:59:07.710
Walk index, so.

402
00:59:07.710 --> 00:59:11.400
The threads in the block are indexed.

403
00:59:11.400 --> 00:59:19.110
It could be up to 3. D. this is, I think syntactic sugar and your accessing an image or I, or something.

404
00:59:20.670 --> 00:59:24.090
I don't know what hardware support you have for this really.

405
00:59:25.739 --> 00:59:28.889
I just think of them as 1. D, any case you got this.

406
00:59:28.889 --> 00:59:32.579
Block of threads and a thread the grid has.

407
00:59:32.579 --> 00:59:35.610
So, Ryan flocks in their index, so.

408
00:59:35.610 --> 00:59:40.320
Again, so this multi dimensional index is all me for.

409
00:59:40.320 --> 00:59:43.469
It's syntactic for multi dimensional data.

410
00:59:43.469 --> 00:59:47.219
C, plus, plus, what I do is I write a little.

411
00:59:47.219 --> 00:59:54.420
Classes and I've got little conversion routines, implicit conversion routines that will convert back and forth.

412
00:59:54.420 --> 00:59:59.639
Between 1 and 3, that's how much my personal programming style for this sort of stuff.

413
00:59:59.639 --> 01:00:02.940
So.

414
01:00:06.179 --> 01:00:12.630
We is going through this fast.

415
01:00:12.630 --> 01:00:17.909
1 more and there'll be time to leave introduction Dakota.

416
01:00:17.909 --> 01:00:24.840
This is a long 1, so I'm not I'm not even I'll start it then I'll restart it on Monday.

417
01:00:26.039 --> 01:00:29.820
Okay.

418
01:00:36.719 --> 01:00:40.380
Yeah, so this will show basic.

419
01:00:41.610 --> 01:00:45.510
I mean, got something several bigger here. I'm going ahead here.

420
01:00:45.510 --> 01:00:49.139
Silence.

421
01:00:50.400 --> 01:00:53.400
Well, I'll show you more detail on Monday.

422
01:00:53.400 --> 01:00:57.030
What we have up here and a really basic.

423
01:00:57.030 --> 01:01:03.840
Could a program this thing runs on the GPU. It doesn't do anything.

424
01:01:03.840 --> 01:01:08.880
This thing runs on the CPU, it calls the thing running on the GPU here.

425
01:01:08.880 --> 01:01:18.539
We'll do this next time so reasonable point to stop on is.

426
01:01:21.989 --> 01:01:31.380
Here, and I'll put a note about how far we got where we're finished off open ACC and we're deep into getting to the now.

427
01:01:31.380 --> 01:01:35.610
And I have a homework thing, which is to play with that.

428
01:01:35.610 --> 01:01:46.650
On the sample programs, I just showed you and try them and report here experience. So you can put on your resume that you programmed open plus open.

429
01:01:48.360 --> 01:01:52.019
Any questions now.

430
01:01:52.019 --> 01:01:55.469
Silence.

431
01:02:00.900 --> 01:02:06.210
Time to wake up anything to.

432
01:02:07.980 --> 01:02:12.090
Okay.

433
01:02:12.090 --> 01:02:19.110
Silence.

434
01:02:23.820 --> 01:02:29.130
Good by the ways an acronym compute, unify a unified device architecture.

435
01:02:32.519 --> 01:02:36.269
Well, if there is, um.

436
01:02:39.840 --> 01:02:47.969
How basic are the optimum? What do you mean by an optimal operation?

437
01:02:49.590 --> 01:02:58.795
How how do if statements perform? I mentioned something very briefly on darker.

438
01:02:59.065 --> 01:03:06.144
I realized I'd uninstalled Docker off of parallel when I stopped using it when I upgraded into the latest version.

439
01:03:06.960 --> 01:03:14.610
So, what I have to do well, I can tell you about it. I may just do that far. I was going to run anything. I'd have to reinstall it.

440
01:03:14.610 --> 01:03:21.360
How do the statements performance is that the then.

441
01:03:21.360 --> 01:03:27.809
Block gets run while the threads that would do the else block are idle and then it reverses.

442
01:03:27.809 --> 01:03:33.630
What other types of operations they do? Well.

443
01:03:33.630 --> 01:03:36.750
Linear algebra.

444
01:03:36.750 --> 01:03:45.030
Floating they do double precision of new floating point. Well, on some versions, they do floats faster than inside. Think.

445
01:03:45.030 --> 01:03:50.519
It depends because it keeps changing the mix for the different generations.

446
01:03:50.519 --> 01:03:59.940
What they do, I'll tell you what they do badly or pointer chasing anything that's dynamic. Pointer chasing is very slow.

447
01:03:59.940 --> 01:04:09.000
Recursion, I think is so pointer chasing is a bad idea. Recursion is a bad idea. Trees are a bad idea.

448
01:04:09.000 --> 01:04:14.940
Um, stuff like a lot of, you know, lots and lots of.

449
01:04:14.940 --> 01:04:21.329
Anything weird exceptions would be a bad idea.

450
01:04:21.329 --> 01:04:25.679
Throw and catch would be a bad idea anything complicated like that.

451
01:04:25.679 --> 01:04:29.670
Um, it would be a bad idea simple straight line stuff.

452
01:04:29.670 --> 01:04:36.929
Floats trade operations and so on, I think, work.

453
01:04:38.130 --> 01:04:45.570
Floats may work slower because that's done in a separate unit on the GPU and there may be fewer floating point.

454
01:04:45.570 --> 01:04:50.489
Units than simple CUDA cores. So floats may take several cycles. Actually.

455
01:04:50.489 --> 01:04:55.559
Doubles depends how many double units there are.

456
01:04:55.559 --> 01:05:01.199
I said with the if else it gets serialized, so.

457
01:05:02.280 --> 01:05:11.400
The threads for, which was true, could execute it. And then after that, the threads for what's the condition is false could executed 1 after the other.

458
01:05:11.400 --> 01:05:14.699
It's an idea what works and what doesn't work.

459
01:05:16.739 --> 01:05:25.019
There are actually techniques for turning apparent conditional code into a straight line code by using.

460
01:05:25.019 --> 01:05:35.400
Bit masks and stuff like that. I might even show that actually these techniques go back decades and computer graphics where conditionals were slower on sequential.

461
01:05:35.400 --> 01:05:44.280
Processors, but they're useful again other questions.

462
01:05:46.530 --> 01:05:55.199
But following up on your thing, Isaac, that again, it's another reason why say if you want to make your application parallel.

463
01:05:55.199 --> 01:06:00.480
Your 1st, version of this is probably just on the on the Intel.

464
01:06:00.480 --> 01:06:04.559
Where the multi core.

465
01:06:04.559 --> 01:06:10.199
Thread on the multi cork and do different things. So don't jump to the GPU initially.

466
01:06:10.199 --> 01:06:14.789
Silence.

467
01:06:14.789 --> 01:06:19.530
Other questions.

468
01:06:19.530 --> 01:06:28.349
Okay, I have a good weekend go skiing or something. Hope you're not in Texas unfortunately and.

469
01:06:37.289 --> 01:06:42.269
Well, the other professors may know it better than me. So, listened to them.

470
01:06:42.269 --> 01:06:45.989
Um, seriously.

471
01:06:45.989 --> 01:06:52.019
You're going to do something virtually is a question of what you virtualize.

472
01:06:52.019 --> 01:06:55.980
And what level like at the really low level, you could just.

473
01:06:55.980 --> 01:07:01.050
Emulate the hardware and that's very general, but too incredibly slow.

474
01:07:01.050 --> 01:07:09.659
And he could imitate different types of hardware, or you could do different machines using the same hardware.

475
01:07:09.659 --> 01:07:17.400
Like, they're all running Intel, but different operating systems like VMware maybe. And then there's another level up where you're.

476
01:07:17.400 --> 01:07:25.889
Or your separate machines, they're all running Linux, but they're isolated from each other and is sharing some of the low level stuff, but it's protected.

477
01:07:25.889 --> 01:07:31.500
And then you could get even a higher level, still where the machines are.

478
01:07:31.500 --> 01:07:34.679
Sharing more and it's more efficient.

479
01:07:37.284 --> 01:07:50.454
Currently in Linux, now you can give each process, like a separate private view of the processes. So you can't even see the other processes that sort of what the darker level is. So it's efficient and it's high level.

480
01:07:50.730 --> 01:08:01.769
So, each, you might say, process process group, it's seeing a private view of the computer, private view of the file system, and the process space and so on.

481
01:08:03.239 --> 01:08:07.949
And so is it virtual? It's virtual at a very high level, but it's more efficient.

482
01:08:11.010 --> 01:08:16.979
But I'll, I'll dig up something then since you're interested in that, and then you can go.

483
01:08:16.979 --> 01:08:20.909
Compare the different cross and tell us what the other saying.

484
01:08:20.909 --> 01:08:25.470
Okay, didn't give me a little class to do to take that stuff up.

485
01:08:26.789 --> 01:08:30.239
Other questions. Okay.

486
01:08:31.260 --> 01:08:36.960
Bye bye.