WEBVTT

1
00:09:28.288 --> 00:09:42.568
Okay, good afternoon parallel class. So I'm just added the little issue here getting things set up. I think they're working now. Sorry for the delay.

2
00:09:45.028 --> 00:09:51.509
And 1 more thing to set up, and then.

3
00:10:01.168 --> 00:10:07.499
Yeah, I just 1 minute here.

4
00:10:35.038 --> 00:10:40.109
Cool.

5
00:10:44.129 --> 00:10:50.548
Hey, sharing is actually working and.

6
00:10:50.548 --> 00:10:55.259
Excuse me, professor, when I ask a question before we start the lecture.

7
00:10:58.318 --> 00:11:04.619
Cool. So conceivably things I theoretically working.

8
00:11:04.619 --> 00:11:11.578
Can't say that they're working in practice, but my universal question is.

9
00:11:11.578 --> 00:11:19.558
Can you hear me I've got the chat window open so here, you know, I still haven't quite hear me or.

10
00:11:19.558 --> 00:11:24.599
I can hear you beautiful. Okay.

11
00:11:24.599 --> 00:11:33.089
So, finally things are working so what we're doing now is we're continuing.

12
00:11:33.089 --> 00:11:48.058
Is there a place to submit homework for? No, I'll give you another couple of days to submit it. So it was taking a bit of time. There's a question in voice. Yes. The question was, is there a place to submit homework for? I'll.

13
00:11:48.058 --> 00:11:57.028
I'll give you to Monday to do it so this may be taking a little time and I'll put up I'll give you a day or 2 to submit it after I put it up on grade scope. So.

14
00:11:57.028 --> 00:12:09.928
Okay, so what we're talking, we're continuing on with NVIDIA. I think we're approximately lectured 2.3 that we're speed reading screw.

15
00:12:10.464 --> 00:12:22.943
And we're talking about the architecture and again, my teaching style is I like to teach from examples that are used in the real world.

16
00:12:23.303 --> 00:12:34.464
So, the example here is in video, but from it, as you're looking at, how they solve the problem, and they've solved the problem. Very successfully then you can get an idea for.

17
00:12:36.149 --> 00:12:41.609
You know, some general hardware principles, and so on ways to solve, solve the issues here.

18
00:12:41.933 --> 00:12:53.933
So, NVIDIA has the thread, it's an execution stream. It's got some local memory, it's cut sharing some global memory, and it's got a program counter and pointing to the next instruction to be executed.

19
00:12:54.264 --> 00:13:01.073
And again, with the threads is a work of threads, solid, execute to the same instruction. Unless a particular thread is disabled.

20
00:13:01.408 --> 00:13:09.298
Okay, and the threads are very lightweight. This is the review here the threads are very light weight.

21
00:13:09.298 --> 00:13:16.678
They are much less expensive to start a threat than it is to start a process or a cold routines.

22
00:13:16.678 --> 00:13:22.499
In Lennox or Windows and so this is the review here you might profitably.

23
00:13:22.499 --> 00:13:30.504
Maybe pushing to the screen, but you might profitably. Let's say if you're adding 2 vectors element by element, you might perhaps have a thread for each addition.

24
00:13:30.504 --> 00:13:38.783
Something like that because on the on the video, you can have thousands of threads.

25
00:13:39.239 --> 00:13:42.328
Okay.

26
00:13:43.469 --> 00:13:58.229
So this is okay, so here, your program would alternate, Tom, some serial code, and some parallel codes some serial quotes in parallel code although, and we're assuming serial code here to make it easy. Has 1 thread.

27
00:13:58.229 --> 00:14:04.318
Executing, and then the parallel code might have a 1000. you could also.

28
00:14:04.318 --> 00:14:09.958
Do the simultaneously when you start off parallel threats, then they.

29
00:14:09.958 --> 00:14:21.568
Your your host code that starts the device threads returns immediately and leaves the device running. So which means you have to check do a synchronization later.

30
00:14:21.568 --> 00:14:26.129
And this again is a syntax here that starts off.

31
00:14:26.129 --> 00:14:31.438
The parallel threads, and just to review a colonel.

32
00:14:31.438 --> 00:14:37.318
Is the name for a parallel program on the device? The device is the is the.

33
00:14:37.318 --> 00:14:41.188
No, the video.

34
00:14:41.188 --> 00:14:45.448
You give it some arguments you can pass and this.

35
00:14:45.448 --> 00:14:51.269
Gives the threads that the carnal should use. kernel's also called a grid.

36
00:14:51.269 --> 00:15:04.918
Um, so the grid, the current kernel contains the number of thread blocks and use a specify how many thread blocks it should have and it could be anywhere from 1 up because there's some limit and the limit.

37
00:15:05.364 --> 00:15:16.854
I think it's pretty large and then this here is the number of threads that each thread block or each block thread, block and block they're synonymous.

38
00:15:17.124 --> 00:15:26.634
So don't accuse and video complete consistency. So, this here is, you tell it how many threads each blocks should have.

39
00:15:27.269 --> 00:15:37.739
And that would be anywhere from 1 up to there is a limit. That's a 1024 I believe 1000 threads per block and effectively as many blocks.

40
00:15:37.739 --> 00:15:46.168
As you want, and now there's a limited how many blocks can run at 1 time the other blocks are sitting in a queue.

41
00:15:46.168 --> 00:15:51.688
And there's an, there's an operating system on the device.

42
00:15:51.688 --> 00:15:59.009
Not a lot of explicit details about it, but 1 thing it does. Excuse me is queue up resources to run them.

43
00:16:00.028 --> 00:16:11.364
Okay, it also happens down inside the block. It might suppose you have a 1000 threads in the block. It may. Now, each thread has resources. The big 1.

44
00:16:11.364 --> 00:16:15.714
I've told you is registers, which are past local memory then.

45
00:16:18.504 --> 00:16:32.543
The block has a pool of registers that are shared by all of the threads in the block. If he wants a lot of registers and maybe all the threads of the block can't run simultaneously. So warps of threads are queued up in a queue for the block.

46
00:16:32.543 --> 00:16:43.073
And run as to get resources, another resource and shared would be floating point. Floating point is done in separate Co processors. Separate course it's the same thing on the Z on the host.

47
00:16:44.339 --> 00:16:55.798
And we're not quite the same, because it's on each CPO will have some double recession floating and sign on here on the device. There are fewer double precision.

48
00:16:55.798 --> 00:17:00.479
Um, processors, then there are integer processors. You might call them.

49
00:17:01.708 --> 00:17:12.538
So, it may happen. The thread is queued up or work with threads is queued up waiting for some double precision processes to become available. Perhaps.

50
00:17:12.538 --> 00:17:19.528
And the queuing process is said to be free, there's no overhead and running the cure you overhead is less than a cycle.

51
00:17:19.528 --> 00:17:24.838
And 1 thing in video does from generation to generation.

52
00:17:25.979 --> 00:17:37.648
Both of them say before that is touring before that is capital or I think before that was Pascal and so on is from generation to generation and video changes.

53
00:17:37.648 --> 00:17:42.989
Proportion of the dye that's used for single precision, double precision.

54
00:17:42.989 --> 00:17:47.098
Half precision floats and editors and so on.

55
00:17:47.098 --> 00:18:00.989
Okay, so got this hierarchy here, we saw last time electrons, used to build circuits. You got a low level micro architecture, which is not visible to the user, actual instruction set involving.

56
00:18:00.989 --> 00:18:10.288
Instructions on the CUDA cores and a language like C plus plus, which is used to implement an algorithm and then perhaps at the top they're doing.

57
00:18:10.288 --> 00:18:14.788
Some natural language thing. Oh, okay.

58
00:18:15.838 --> 00:18:22.618
Nothing here instructions said architecture nothing new here.

59
00:18:22.618 --> 00:18:35.519
Nothing here is instruction register for those of you that have forgotten your Coco, I guess, can us program counter register file.

60
00:18:35.519 --> 00:18:39.929
And so on standard.

61
00:18:40.979 --> 00:18:44.489
Oh, okay. So again, getting to.

62
00:18:44.489 --> 00:18:52.048
Something slightly new here so again, so all of the threads they're lightweight, they're executing.

63
00:18:52.048 --> 00:18:59.249
So, the grid again, well, maybe that's the hardware in the kernels of software, I guess, and all the threads they.

64
00:18:59.249 --> 00:19:05.669
They run in the same instruction so it's called single program, multiple data, but they got separate data.

65
00:19:05.669 --> 00:19:12.239
However, the thread has private indices that each thread knows its number in the whole.

66
00:19:12.239 --> 00:19:15.449
Set a ball thread, so it can access private data.

67
00:19:15.449 --> 00:19:27.088
Can index well, 1st is private data for the thread the registers and 2nd, the thread can use its index to go into the global data and get its.

68
00:19:27.088 --> 00:19:36.239
Share of the global data, and I showed you these before thread index dot access the index as a thread and the block. It goes from 0T to.

69
00:19:36.239 --> 00:19:41.338
And 2003 perhaps blocked him as the number of.

70
00:19:41.338 --> 00:19:48.449
Threads when the block was declared created, that's the number of threads that the block has total.

71
00:19:48.449 --> 00:19:54.838
And block index is the index of the block in the grid 0T up to the number of blocks and the grid.

72
00:19:54.838 --> 00:20:00.538
And installed X, because syntactic, these are 3 dimensional.

73
00:20:00.538 --> 00:20:05.098
And disease not 1 dimensional indices. Okay.

74
00:20:05.098 --> 00:20:09.388
So, again, it shows the blocks again in more detail.

75
00:20:09.388 --> 00:20:15.778
Blocks arrow block 1 block and minus 1. we have blocks.

76
00:20:15.778 --> 00:20:26.519
And to this hierarchy here, that inside of block oh, I mentioned before there's a fixed amount of shared memory per block. It's a very small amount. It's.

77
00:20:26.519 --> 00:20:33.118
48 kilobytes or something I can't remember for sure. It's very small and it's fast.

78
00:20:33.118 --> 00:20:43.019
And and it's shared by the threads and the block. So if a thread writes to shared address 7.

79
00:20:44.153 --> 00:20:58.134
All of the threads in the block have access to it, but perhaps you do a barrier synchronization to ensure that and you do have atomic operation for the threads and a block. Like we saw with open MP, open ACC.

80
00:20:58.949 --> 00:21:04.798
You can do an atomic read modify, right? For example, on the shared memory so that you can.

81
00:21:04.798 --> 00:21:11.219
To a calendar correctly the shared memory is actually implemented as the.

82
00:21:11.219 --> 00:21:14.459
Fast high speed cash from.

83
00:21:14.459 --> 00:21:23.844
Fronting in the global memory. Okay, so this is the threads in the block, and you do have to synchronize for the following reason.

84
00:21:24.413 --> 00:21:36.054
Because, as I said, the threads and the threads and a war run synchronously, but the warps in the block 32 threads in a walk up to 32 walks in the block, the warps do not necessarily around synchronicity.

85
00:21:36.054 --> 00:21:50.933
It's as a competition for some resource, such as a floating point processor, then the warps will not run at the same time and the block, but they've got to access that shared memory. You maybe want to do a synchronization.

86
00:21:52.169 --> 00:21:55.169
You start you want to do that.

87
00:21:55.169 --> 00:21:59.939
Now, the different blocks do not interact except.

88
00:21:59.939 --> 00:22:03.298
That they can access the global memory, but.

89
00:22:03.298 --> 00:22:10.858
So, there is no synchronization if I recall right for the, for the different threads for the different blocks.

90
00:22:10.858 --> 00:22:25.679
You could create 1 with something accessing available memory but that would be a bad idea for the following reason is that there are no fairness guarantees for how the different blocks execute. So you see the threads and the bog. They're running the same instruction.

91
00:22:25.679 --> 00:22:31.919
You know, but they're not necessarily doing it at the same time.

92
00:22:31.919 --> 00:22:35.308
So, conceptually, it's a single, you know.

93
00:22:35.308 --> 00:22:39.118
Single program multiple data, but it said the different.

94
00:22:39.118 --> 00:22:49.199
Blocks are running there could be 1 block after another after another, or it could be simultaneously. So you do not want to.

95
00:22:49.199 --> 00:22:56.459
You know, 4 blocks to try to, they don't naturally interact and you could.

96
00:22:56.459 --> 00:23:01.348
With some hacker, a, have them interacting, but that would kill your performance.

97
00:23:01.348 --> 00:23:05.669
If you see what I mean? Okay, so we got the threads and the block.

98
00:23:05.669 --> 00:23:08.878
And then the different blocks at the hierarchy here.

99
00:23:10.824 --> 00:23:23.364
Now, you might ask, why, why don't they just have, you know, thousands of threads in a box that got unlimited 2000 in 2004 threads for block and even with the faster and the newer NVIDIA architectures they don't, they don't increase that.

100
00:23:23.364 --> 00:23:28.854
They don't have more threads for blocking the newer architectures so you might ask yourself why? And my answer.

101
00:23:32.038 --> 00:23:39.959
I've read them statewide, but the obvious thing is that the operations inside 1 block are very expensive to implement.

102
00:23:39.959 --> 00:23:50.729
And I'll take a lot of hardware, and in particular, some of the asynchronous logic being used to implement some of this. They call it 0T overhead waiting. They're.

103
00:23:50.729 --> 00:24:04.648
There are these warps in the block, and he said there's an invisible cue of warps waiting to run and that's done and it was a 0T overhead thing. So, as soon as the resource becomes available, quoting process, or if there is a.

104
00:24:04.648 --> 00:24:11.098
You know, a warped that wants to do a floating operation and the next cycle it gets it.

105
00:24:11.098 --> 00:24:18.419
And this has done with some sort of logic for your software types.

106
00:24:18.894 --> 00:24:31.013
You have you have operations in the which are clock and with every cycle, something happens and then there's more asynchronous on timed operations.

107
00:24:31.403 --> 00:24:34.884
You just have your logic gates and whatever.

108
00:24:35.429 --> 00:24:44.429
And they, as their inputs changed, and their output changes immediately after well, depending how fast the hardware operates and this.

109
00:24:44.429 --> 00:24:50.038
A synchronous operation, it's a horrible complicated mess, but it's very fast.

110
00:24:50.038 --> 00:24:59.429
So, again, in Coco class, they, they teach some of the issues of, um.

111
00:25:00.449 --> 00:25:10.463
Of problems that the inputs to a gate change, but it's slightly different times. You get this temporary, false signal coming out of the gate before the inputs are stabilized and stuff. You got to worry about all of that.

112
00:25:10.644 --> 00:25:19.403
But the upside is it's very fast, and that's the sort of thing that Nvidia uses to do the scheduling inside 1 block.

113
00:25:19.888 --> 00:25:32.699
But you can't, you know, you, it doesn't scale up. That's why there's a 1000 threads for walk. Also all of the NVIDIA.

114
00:25:32.699 --> 00:25:40.949
Architectures and video has been around for 20 years or more more than 20 years. They always have 32 threads for war.

115
00:25:40.949 --> 00:25:48.538
They recently do some new stuff with the warps getting almost fractional warps, but they don't have more again because of the.

116
00:25:48.834 --> 00:26:03.294
The cost and the hardware to do it any case, the 3 dimensional index indexing and then it's just the tactics sugar I call. It makes my view of this is I don't care that the compiler does this because.

117
00:26:03.989 --> 00:26:11.608
You know, that's what's possible you can write class conversion routines and C plus plus and do it. In fact, I do that sort of thing.

118
00:26:11.608 --> 00:26:16.739
Okay, grid of threads, and a block grant and the 3 dimensional indexing.

119
00:26:17.909 --> 00:26:24.689
Okay, that was lecture 2, 3 from invidious point of view.

120
00:26:28.828 --> 00:26:32.669
Here, it's a.

121
00:26:34.048 --> 00:26:39.269
Hello.

122
00:26:57.719 --> 00:27:05.669
No, I can't zoom in let's take the whole height of my screen, so there's no point in making it wider here. Okay.

123
00:27:05.669 --> 00:27:11.608
More stuff here. Same here.

124
00:27:11.608 --> 00:27:22.888
The compiler, so, in video has got a lot of compilers for you, I showed you NBC plus plus straight C. plus plus then is for the CUDA code.

125
00:27:22.888 --> 00:27:29.429
And showed you a little of this before your Hello world program on the host.

126
00:27:29.429 --> 00:27:44.038
Your hello, world program on the device doesn't do anything and a quick reminder. Here. My kernel's, the name of your colonel device that's my kernel up here. This is the.

127
00:27:44.038 --> 00:27:48.148
This is an extension to C. plus plus and this tells.

128
00:27:48.148 --> 00:27:58.618
The compiler, the, this routine, my kernel, it's called a global and that means it's defined. It's called from a host routine. It's called down here.

129
00:27:58.618 --> 00:28:02.368
And it's executed on the device that's.

130
00:28:02.368 --> 00:28:08.939
In video calls a global are no arguments and this says.

131
00:28:08.939 --> 00:28:13.919
Inside the triple bracket angle brackets that there's 1 thread in 1 device.

132
00:28:13.919 --> 00:28:17.038
In 1 block. Okay.

133
00:28:17.038 --> 00:28:23.638
And I showed you for this, this last time quick thing.

134
00:28:25.588 --> 00:28:31.318
Okay, and it will execute. I ran it before for you.

135
00:28:31.318 --> 00:28:45.659
A quick review here mentioning also you can put in add to bugging symbols and they make the Executive's bigger. They don't sounds a slow down the execution and.

136
00:28:45.659 --> 00:28:49.469
Okay, and I showed you last time, could a mem check.

137
00:28:49.469 --> 00:28:54.358
Which does a checks, every address for validity.

138
00:28:54.358 --> 00:28:59.368
And gave you an example, and we'll check for various cool things here.

139
00:28:59.368 --> 00:29:06.388
Um, some things have to be aligned to. Okay.

140
00:29:06.388 --> 00:29:11.759
Showed you example, 2.

141
00:29:12.173 --> 00:29:21.084
We have the gdb, the debugger, and you can go in and look at data in certain threads and so on and mentioned that last time doing a quick review here.

142
00:29:21.923 --> 00:29:27.203
And we can show time round that quickly.

143
00:29:28.433 --> 00:29:43.074
Just a quick thing, not all of these things run, because these slides were written before the latest version, the latest architecture and the latest architecture, they changed something. Compatibly. So we have to use a different B***. Well.

144
00:29:43.348 --> 00:29:49.679
The visual profiler and so on I may demo it Monday for you, but I'm just walking you through slides.

145
00:29:49.679 --> 00:29:59.459
Okay, and again, this does not some of this does not work with the current version, which is why I am.

146
00:29:59.459 --> 00:30:04.348
Going too fast. Yeah, 1 does it and I don't say.

147
00:30:04.348 --> 00:30:08.368
Okay.

148
00:30:08.368 --> 00:30:12.868
Yeah, show you some of the profile or later.

149
00:30:12.868 --> 00:30:18.598
Okay performance.

150
00:30:18.598 --> 00:30:23.219
That was quick.

151
00:30:23.219 --> 00:30:31.679
I like to work up and that was all 4 3.

152
00:30:36.538 --> 00:30:40.138
Okay.

153
00:30:40.138 --> 00:30:44.969
Here.

154
00:30:48.749 --> 00:30:57.598
I'm interested in watching the delay and the synchronization with Webex because I've got 2 laptops in front of me. The 1 that I'm.

155
00:30:57.598 --> 00:31:08.519
Running on the 1 that I'm watching what you see, because what you see is different than what my main laptop sees. Okay. And I showed you this error thing here before.

156
00:31:08.519 --> 00:31:16.858
And the point about this again is that if it may happen that the number of threads is not a multiple of.

157
00:31:18.773 --> 00:31:30.834
Number 3, not a model multiple, the number of threads per block. So last block, you only want some of the threads and the block, but all of the threads are gonna get executed. Perhaps. So you need an, you need a balance check.

158
00:31:32.189 --> 00:31:43.644
So, for the last for this last fractional block, you might call it and if you don't do this, this will be executing. This actually is executing global memory.

159
00:31:44.663 --> 00:31:52.733
So there's several different places that data can be. This here is going to be in the global memory, which on parallel that.

160
00:31:52.979 --> 00:31:56.669
Good GPU is 48 gigabytes and.

161
00:31:58.554 --> 00:31:59.273
So,

162
00:31:59.364 --> 00:32:01.584
if you don't have this check right here,

163
00:32:01.794 --> 00:32:07.733
you'll be walking off the end of the erase and Nicole memory and reading is probably okay,

164
00:32:07.733 --> 00:32:08.483
but writing,

165
00:32:08.483 --> 00:32:11.453
you're going to be smashing some other,

166
00:32:11.513 --> 00:32:15.804
someone else's code may cause it to crash.

167
00:32:16.108 --> 00:32:20.459
Thank you there's some security on.

168
00:32:20.459 --> 00:32:23.999
The device, but it's not perfect. So.

169
00:32:25.979 --> 00:32:38.638
And so, this again, here you need enough threads to cover the elements I showed you this before you ceiling in a separate. Well, I showed a different way to do that before you took I showed you before you end plus 255.

170
00:32:38.638 --> 00:32:47.459
Divided by 256 equivalent a way to do it is to a ceiling function, calculate this as a float into a ceiling would be the same thing. So.

171
00:32:47.459 --> 00:32:53.068
Any case this is a, the underscore is a system.

172
00:32:53.068 --> 00:33:01.318
I convention to say that the arrays and data and device instead of in host.

173
00:33:01.318 --> 00:33:04.348
With the managed memory, that's not.

174
00:33:04.348 --> 00:33:12.148
That's an obsolete idea, because with a manage memory, it's paged back and forth between the hosts and the devices needed.

175
00:33:12.148 --> 00:33:23.729
There might be a performance penalty, depending on your code, but your life gets easier because easier on you, you don't have to explicitly copy the data back and forth.

176
00:33:23.729 --> 00:33:28.769
So the D underscore ideas. So, Bob silly, but.

177
00:33:28.769 --> 00:33:34.378
You'll see it in all the H, underscore means that it was on the host. Okay.

178
00:33:35.999 --> 00:33:42.328
Post called, give him 3 just means.

179
00:33:43.193 --> 00:33:44.064
3 dimensional,

180
00:33:44.153 --> 00:33:46.763
or a manager as it's a provided class,

181
00:33:47.094 --> 00:33:48.713
and nothing interesting here,

182
00:33:49.104 --> 00:33:52.733
except that the number of blocks in the grid instead of being a scale,

183
00:33:52.733 --> 00:33:54.354
it could be the 3 D,

184
00:33:54.443 --> 00:33:58.374
array of managers as are the number of threads and a block.

185
00:33:58.763 --> 00:34:01.794
And so we're passing in 3 global.

186
00:34:02.038 --> 00:34:05.429
Point is to 3 global arrays and then just a.

187
00:34:05.429 --> 00:34:13.708
And teacher here dim grid, it would be a constructor that would take a.

188
00:34:13.708 --> 00:34:24.929
3 integer expressions, let's say, and construct the dim 3, variable variable of class 3 and you could also do a constructor same thing here.

189
00:34:24.929 --> 00:34:31.679
If you're going to have ones for some of the dimensions for the 3, they'd be the last few.

190
00:34:31.679 --> 00:34:38.579
Okay, all.

191
00:34:38.579 --> 00:34:41.759
Showing the same concept again.

192
00:34:41.759 --> 00:34:52.858
Okay, so here's another thing we're seeing new things here under double underscore host. Double underscore is an explicit statement that this routine runs on the host.

193
00:34:52.858 --> 00:35:05.159
It default, so you don't need to just putting they're just putting this here to be explicit and global means that this routine rotten on the device, but is from the host.

194
00:35:05.159 --> 00:35:12.088
There's going to be some other devices other routines that run on the device and are carnival only from the device.

195
00:35:12.088 --> 00:35:22.498
Okay, um, this there is a complication here, which I haven't seen documented that much. You get when you're programming.

196
00:35:22.498 --> 00:35:35.909
You got anything on the device routine if you want to get data from the host of the device, or retain you sort of have to put it in arguments like editor and so on.

197
00:35:35.909 --> 00:35:47.998
There's no common global memory that they can easily access. Well, yeah, the problem memory, but it's a little messy. Sometimes getting data back and forth.

198
00:35:48.773 --> 00:36:00.653
Any case your host device, you set the size of the grid, the blocks number of blocks in the grid number threads and a block and then you call this and you've seen this before specifying past.

199
00:36:00.653 --> 00:36:05.244
This is how you pass in the sizes of the grid and the block in the triple angle brackets.

200
00:36:06.628 --> 00:36:12.929
There is here divide your global routine name and your argument list and when it shows down here.

201
00:36:14.784 --> 00:36:28.554
Inside here and here, just to review inside here, this local variable in the goal routine is running on the device. This, it's a local variable. It's private to the thread and it's in a register. If possible.

202
00:36:28.858 --> 00:36:40.289
If there's not enough registers available, then this will be put in. There's a larger local memory that is available.

203
00:36:40.289 --> 00:36:51.239
To each device routine, but it's very slow what it is. It's just a chunk of the global memory that's made private to each thread.

204
00:36:51.239 --> 00:37:02.068
So, there's more of it, but it's unbelievably slow. So you want to have your local variables in the thread be few enough that they'll fit in their available registers.

205
00:37:02.068 --> 00:37:05.849
By very slow, you got to latency of 100 cycles or something.

206
00:37:05.849 --> 00:37:13.409
Okay, so so each grant again has blocks in each block has lots of threads.

207
00:37:13.409 --> 00:37:18.208
And then the GPU perhaps has only a few.

208
00:37:18.208 --> 00:37:22.018
Processes on the GPU.

209
00:37:24.264 --> 00:37:36.983
Here it's talking about these declarations, I told you about global for the last 2 days and global routine is called on the, it runs on the device and it's called from the host.

210
00:37:37.289 --> 00:37:40.588
I mentioned we saw host today.

211
00:37:40.588 --> 00:37:55.018
It's on the host called from the host, a new 1 device. I don't know that you've seen yet. This is for our routine, which is running, which runs on the device and is called from the device.

212
00:37:55.018 --> 00:38:08.728
So, if you got routine, so there's like a 2 dimensional array here where the program runs, where the function runs, and where it can be called from.

213
00:38:10.588 --> 00:38:24.869
Now, you can do this now suppose you want a routine to run both on the host and the device you can do this, you can have the 2 declarations for a routine you can say, host device or device host in front of the routine name.

214
00:38:24.869 --> 00:38:34.679
And what this will do is the compiler will compile to VERT, will produce 2 versions of the 1 version.

215
00:38:34.679 --> 00:38:39.750
To run on the host in a 2nd version to run on the device. So.

216
00:38:39.750 --> 00:38:48.210
This is if you want to routine that sometimes running on the host and other times running on the device, you just put the 2 declarations in front of it.

217
00:38:48.210 --> 00:38:52.349
Now.

218
00:38:52.349 --> 00:38:55.500
Concern here is that.

219
00:38:57.085 --> 00:39:11.454
The host and the device are not completely identical, so if you're going to declare a routine to be both host and device, then the code inside, it is like, the intersection of what works on the host and what it works on the device.

220
00:39:11.454 --> 00:39:13.795
So you have to be careful.

221
00:39:14.099 --> 00:39:18.809
In what you're in, what you're doing, give you an example.

222
00:39:19.974 --> 00:39:33.894
I don't know, you know, fancy C plus plus class stuff doesn't run on the device. So if you're going to get fancy with that kind of stuff, memory, allocation and constructors, destructors and all that stuff.

223
00:39:34.170 --> 00:39:37.409
I don't know interrupt.

224
00:39:38.429 --> 00:39:42.809
Maybe you don't do on the device. Okay.

225
00:39:42.809 --> 00:39:51.869
Although as time goes on, what can be done on the device is getting more and more. But so if it's a routine that for both the host and the device, then.

226
00:39:51.869 --> 00:39:56.159
It's the limited what you can do in the routine, or do efficiently, or do it all.

227
00:39:56.159 --> 00:40:01.079
So, okay.

228
00:40:01.079 --> 00:40:11.699
So you got your false, false code them it was code instructions call it a code. A program you run it into your program has an extension dot. See, you.

229
00:40:11.699 --> 00:40:18.300
Nbc splits it into 2 pieces as code for the host and code for the device.

230
00:40:18.300 --> 00:40:25.320
And the host code runs through a, um.

231
00:40:25.320 --> 00:40:28.380
I C, plus plus compiler um.

232
00:40:28.380 --> 00:40:32.789
Think get some client or something.

233
00:40:32.789 --> 00:40:37.019
The device code that's complicated.

234
00:40:37.019 --> 00:40:42.179
I'll get to it in a 2nd, but then the whole thing gets merged into 1 that runs.

235
00:40:42.179 --> 00:40:46.019
Okay, the device called.

236
00:40:47.280 --> 00:40:51.210
What's happening here?

237
00:40:51.210 --> 00:40:54.300
Nvidia has a.

238
00:40:54.300 --> 00:41:07.230
It's sort of it's a little reminiscent of Java NVIDIA compiles the coded code and it wouldn't into just in time sort of CO.

239
00:41:07.230 --> 00:41:15.210
A device code that's a love. It's called, it's knocked down at the assembly level. It's a level above it and.

240
00:41:16.769 --> 00:41:20.610
This is what the NBC is, the compiler produces.

241
00:41:20.610 --> 00:41:31.769
And then what happens is that when you execute your NBC code, a program in execution time, it's actually compiled for the device.

242
00:41:31.769 --> 00:41:37.170
So, you're doesn't contain low level.

243
00:41:37.170 --> 00:41:40.710
Device code it contains this intermediate.

244
00:41:40.710 --> 00:41:47.489
Device code called, which is it's a step above the actual hardware instructions.

245
00:41:48.264 --> 00:41:59.574
Now, this means that the 1st time you run your program, it's going to be slower because all the, just in time compiling, it's now time it has to be compiled.

246
00:42:00.175 --> 00:42:05.695
Now, the reason the NVIDIA does that is they, our future proofing.

247
00:42:05.969 --> 00:42:14.969
They are because the next generation of the GPU will of different, low level assembly instructions.

248
00:42:14.969 --> 00:42:21.179
And with this 2 step process, you're, you're.

249
00:42:21.179 --> 00:42:32.909
Will run on the future GPU, which has different hardware instructions because the, just in time compiler, you don't directly see will be different. So you're.

250
00:42:32.909 --> 00:42:41.699
You're so called it has this code you run your old on your new.

251
00:42:42.719 --> 00:42:49.889
Your next, which has different hardware instructions it will work because it just in time compiler.

252
00:42:49.889 --> 00:42:53.610
We'll compile your code into.

253
00:42:53.610 --> 00:43:02.159
Into the new assembly instructions so it makes things complicated, but in future proof, your Executive's and that's sort of nice.

254
00:43:02.159 --> 00:43:05.610
So you don't have to recompile.

255
00:43:05.610 --> 00:43:16.050
Your program well, you may if there's some user visible novelty, but you don't have to recompile your old execute. It will run on the new GPU. So that's sort of nice.

256
00:43:17.610 --> 00:43:26.969
I'm leaving out some details and it may not necessarily, always run completely. But the attempt is that we'll run on the new thing.

257
00:43:28.139 --> 00:43:32.789
There's something there's all sorts of architecture levels, which describe what.

258
00:43:32.789 --> 00:43:47.550
Capabilities are called capability, which describes what capabilities are available for you and so now what this does not necessarily mean. Well, for example, look on parallel. There's 2 on parallel 2, different generations.

259
00:43:47.550 --> 00:43:58.619
So you could run you same execute on both of them because the code would compile into different hardware instructions for the 2.

260
00:43:58.619 --> 00:44:02.909
For the 2 different.

261
00:44:02.909 --> 00:44:08.760
Architecture is, in fact, if I can just a 2nd here, see, if I can.

262
00:44:08.760 --> 00:44:13.559
Start my VPN.

263
00:44:13.559 --> 00:44:28.079
Okay, this window runs on parallel. Oh, 27 security upgrades. Okay. Just for fun. I, Mark are running on the machine of an idol start a class.

264
00:44:28.079 --> 00:44:31.769
So.

265
00:44:31.769 --> 00:44:35.639
Okay, so what we have.

266
00:44:37.500 --> 00:44:45.389
You see, so, device there, that's the 8000 Quadro and it's compute capability. 7.5 and.

267
00:44:45.389 --> 00:44:49.679
Could a run time 11.2 that's fairly new.

268
00:44:49.679 --> 00:44:53.849
If I go to the old machine, the old older.

269
00:44:53.849 --> 00:44:58.260
Did the 2nd as a extend 80.

270
00:44:58.260 --> 00:45:01.409
And it is capability 6.1.

271
00:45:02.670 --> 00:45:09.030
So, what that means is to do has hardware capability that's not available.

272
00:45:09.030 --> 00:45:13.889
On the older 1, each increase and could've capability means as new facilities available.

273
00:45:13.889 --> 00:45:19.289
So, in any case, however, if your program.

274
00:45:19.289 --> 00:45:27.360
What is written? Just using only capabilities at the 6.1 level. You could run it on either and.

275
00:45:27.360 --> 00:45:31.889
And would efficiently use the new because.

276
00:45:31.889 --> 00:45:36.989
The run time, compile that just in time. Compiler would do that.

277
00:45:36.989 --> 00:45:40.079
Well, I've got this thing up here.

278
00:45:40.079 --> 00:45:45.449
Or if there's always 32 threads per block has a 1000.

279
00:45:46.530 --> 00:45:52.074
Here is your thread block so the threads per block is a 1000,

280
00:45:52.074 --> 00:45:53.844
but that could be a 1000 by 1,

281
00:45:53.844 --> 00:46:01.914
or it could be 32 by 32 or something where 3 dimensional these are the Max sizes and then the grid size that's to the 32 minus 1.

282
00:46:03.389 --> 00:46:07.260
So, lots of.

283
00:46:07.260 --> 00:46:15.000
Blocks in the grid, but if we go up a little for shared memory.

284
00:46:15.000 --> 00:46:20.340
So, you block 64 K bytes and shared memory total 1.

285
00:46:20.340 --> 00:46:25.289
What to the registers? I separate and this is 64 K registers at 4 bites each. So.

286
00:46:25.289 --> 00:46:28.440
Threads for strategy for block.

287
00:46:28.440 --> 00:46:37.739
Maximum 1024, the multi processors is another level here. I haven't mentioned. Let me mention it because I got this up in the building.

288
00:46:37.739 --> 00:46:41.340
Okay, so what we have.

289
00:46:43.380 --> 00:46:49.320
So, where.

290
00:46:49.320 --> 00:46:57.179
On the call well, it's a multi processor or a streaming processor or something.

291
00:46:57.179 --> 00:47:04.530
And 72 of them on the 8000.

292
00:47:04.530 --> 00:47:08.159
And each multi, so they're actual physical.

293
00:47:08.159 --> 00:47:11.340
Physical areas on the chip.

294
00:47:11.340 --> 00:47:19.019
And each multi processor kind of up to 64 CUDA cores. So it's 4608 foot or course total.

295
00:47:19.019 --> 00:47:25.139
So, um, so if you're running.

296
00:47:25.139 --> 00:47:32.909
So, 1, so, 60 for this will be 2 warps of cores. And so you got lots of.

297
00:47:32.909 --> 00:47:38.130
Warps in the block, they can be allocated to different multi processors.

298
00:47:38.130 --> 00:47:41.280
Perhaps I think so.

299
00:47:41.280 --> 00:47:47.820
Yeah, okay we'll come back to other stuff here at other times.

300
00:47:51.989 --> 00:48:00.570
Constant memory. This is some fast memory. This is visible to all of the threads.

301
00:48:00.570 --> 00:48:03.989
So, it's just constant and it's fast.

302
00:48:03.989 --> 00:48:10.829
Implemented by cash and that's the major interesting stuff here.

303
00:48:13.769 --> 00:48:28.559
Okay, so again you could have program is split into host code and device code and device colleges is compiled into code, which at run time is just in timely compiled.

304
00:48:29.760 --> 00:48:38.760
No questions about yeah. Okay.

305
00:48:38.760 --> 00:48:49.289
Question, can you show this how it would access these different memory areas? Yeah, there's some declaratory is.

306
00:48:49.289 --> 00:48:59.789
We'll see programs later on to do that. So by default, your local scalers are in registers if they can if not, they spill over to local.

307
00:48:59.789 --> 00:49:08.070
The global's coming in, it's an argument. It's a global and that's just like a scaler. Thank you, Isaac. We'll see that in more detail.

308
00:49:08.070 --> 00:49:13.769
Later okay.

309
00:49:29.639 --> 00:49:35.820
Oh, okay. Multi dimensional code. I don't think this is very interesting, but that's just me.

310
00:49:36.840 --> 00:49:43.170
They're distinct thing between distinguishing between the colonel and the granted for the grids, the hardware, I guess.

311
00:49:43.170 --> 00:49:47.789
Okay, um.

312
00:49:48.869 --> 00:50:00.269
I think we saw a little of this before me for a few seconds. So you have the 2 D grid of threads in the block you can map them to your problem. Like, you're processing a 2 D picture.

313
00:50:00.269 --> 00:50:05.070
That's what they're talking about here.

314
00:50:06.329 --> 00:50:17.730
The relevance here, Rome, major versus column major layout. What the topic here is as follows, you have a 2 dimensional array, the mapping your story and linear memory.

315
00:50:17.730 --> 00:50:21.690
Do you want the.

316
00:50:21.690 --> 00:50:29.159
2nd, subscript to be varying pass. Just step up the, the 1st, subsequent varying pass. If you step up through the memory.

317
00:50:29.159 --> 00:50:33.150
And most languages, like, see.

318
00:50:33.150 --> 00:50:36.539
Do role major layout where.

319
00:50:36.539 --> 00:50:41.610
The rows are contiguous and if you.

320
00:50:42.414 --> 00:50:54.204
You're going across, so if we look here 4 by 4 array, so the role, the 4 yellow elements are contiguous in memory than the 4 red ones. So the rows are contiguous in memory.

321
00:50:54.835 --> 00:51:00.655
The exception to this is for Tran does column major layout. So, for trend.

322
00:51:00.960 --> 00:51:05.219
The columns are contiguous in memory now.

323
00:51:06.449 --> 00:51:12.030
And did it 1st, because was invented, I think, in 1957.

324
00:51:12.030 --> 00:51:16.590
Same here as list by. Thank so for trying to at this point.

325
00:51:16.590 --> 00:51:21.300
Is 64 years old.

326
00:51:21.300 --> 00:51:27.840
The language has been extended somewhat, so your grandparents might have used.

327
00:51:27.840 --> 00:51:42.599
And still use inertia. Okay. Any case for all major layout. This is relevant. Okay. Well, I'll tell you why this is relevant. I'm anticipating the next few slides.

328
00:51:42.599 --> 00:51:47.369
So each thread pulls all processes. Another element.

329
00:51:47.369 --> 00:51:50.940
For efficiency reasons, it's nice if adjacent threads.

330
00:51:50.940 --> 00:51:56.250
Are processing adjacent elements in memory, so.

331
00:51:56.250 --> 00:52:01.590
You want to, and that just makes things more efficient.

332
00:52:01.590 --> 00:52:08.400
1 thing is that if you're reading global memory, it does 128 bytes at a time.

333
00:52:08.400 --> 00:52:14.849
And it's nice if you can use all 1, 2008 bites, which would be 32, 4 bite words, which would be a work with threads.

334
00:52:14.849 --> 00:52:20.730
But, okay, that's going to be illegible for you. Um.

335
00:52:22.019 --> 00:52:31.500
So, it's just showing how your, and let me see if I can zoom it in.

336
00:52:31.500 --> 00:52:34.530
Good okay.

337
00:52:34.530 --> 00:52:42.420
Okay, it's scaling every pixel value. Let me walk you through what's happening here.

338
00:52:42.420 --> 00:52:45.719
We got 2 arguments which gives a width and the height.

339
00:52:45.719 --> 00:52:56.639
Of the image and pixels. Well, you can read the comments so we take the thread index.

340
00:52:56.639 --> 00:53:10.260
And here, we're assuming thread indexes eventually here everything's credit index dot Y, and blocked M dot Y index dot. Y. so we're computing a row and columns dot X dot X dot X.

341
00:53:10.260 --> 00:53:13.949
So, or assuming that the threads and the block.

342
00:53:13.949 --> 00:53:20.429
Are the block is 2 dimensions of locked in is now.

343
00:53:20.429 --> 00:53:27.329
Got it, it's 2 dimension not just a scaler, so we can calculate a 2 dimensional role and call them.

344
00:53:27.329 --> 00:53:30.780
And then what we can do.

345
00:53:31.949 --> 00:53:38.760
Map it back down to a 1 dimensional index here and grab it and write it.

346
00:53:39.900 --> 00:53:45.750
You know, I frankly don't see the point of the 2 dimensional thread.

347
00:53:45.750 --> 00:53:49.469
But I know I'm presenting it to you since I have it here.

348
00:53:49.469 --> 00:53:53.670
If somebody sees a point for it, then.

349
00:53:53.670 --> 00:53:57.000
Okay, um.

350
00:53:57.000 --> 00:54:01.349
Man here. Okay. So.

351
00:54:03.389 --> 00:54:07.710
How you do it up at the scale so.

352
00:54:08.760 --> 00:54:14.190
We're assuming that the size of a block is 16 by 16 threads.

353
00:54:14.190 --> 00:54:17.699
That's 256 dollars less than the max of a 1000.

354
00:54:18.869 --> 00:54:22.889
And so this is the dimension of the.

355
00:54:22.889 --> 00:54:27.809
Threads in the block, that's the block dimension and it should be blocks in the grid. So.

356
00:54:30.239 --> 00:54:37.650
Okay, and we're allowing for the fact that it might not think multiples of 16.

357
00:54:43.139 --> 00:54:51.750
Okay, so the point is that if your threads form a 16 by 16 block, you can block out your.

358
00:54:51.750 --> 00:54:55.650
Duty block of data in your image and so on.

359
00:54:55.650 --> 00:54:59.880
The point here, but not all the same control paths.

360
00:54:59.880 --> 00:55:07.650
Is going to be the things that they at the just hear all of your edge conditions.

361
00:55:10.440 --> 00:55:14.159
Okay, that was simple.

362
00:55:14.159 --> 00:55:17.849
Few seconds to ask questions.

363
00:55:24.150 --> 00:55:31.920
Nothing interesting on this slide. I think I'm being unfair.

364
00:55:34.170 --> 00:55:40.170
Okay, multi dimensional grid, colonel why you'd want to use this.

365
00:55:41.215 --> 00:55:54.625
So you're doing some RGB scaling and you've got this is a 3 dimensional that a naturally on dimension role a 2nd dimension column and the 3rd dimension versus red green or blue. Okay.

366
00:55:54.925 --> 00:55:56.454
We want to do some operation.

367
00:55:56.789 --> 00:56:02.730
This thing here, I got to show you this thing in the lower right for those of you that haven't seen it.

368
00:56:02.730 --> 00:56:12.690
Um, this is cool. It's unrelated to this course, but it's related to computer graphics.

369
00:56:12.690 --> 00:56:15.929
Okay, what's happening here?

370
00:56:17.940 --> 00:56:30.989
Your human visual system processes, callers, non linearly and what this diagram here shows how your human visual system will mix colors.

371
00:56:30.989 --> 00:56:39.659
And it's called the diagram is a commission.

372
00:56:39.659 --> 00:56:47.309
Which did this in 1931 and some French for lack of machine on international. Now today.

373
00:56:47.309 --> 00:56:50.550
Which is the international lighting commission?

374
00:56:50.550 --> 00:56:55.650
And this maps colors that are the same intensity.

375
00:56:55.650 --> 00:56:59.309
Into a 2 dimensional coordinate system X and Y.

376
00:56:59.309 --> 00:57:06.750
Why is brightness effectively and what this shows us how collars will appear to mix.

377
00:57:06.750 --> 00:57:19.199
So, if we have a color up here on the right and a color on the left, and we mix them, and you take the to quarter to a linear. So, the linear combo of the coordinates.

378
00:57:19.199 --> 00:57:27.630
Shows how the colors will mix in your visual system. So, red, which is over here at perhaps 8.7.

379
00:57:27.630 --> 00:57:32.400
3, mixes of, and it 50 50, you'll get white in the middle.

380
00:57:32.400 --> 00:57:38.909
Red and green will mix to get yellow. So, this shows the apparent effect.

381
00:57:38.909 --> 00:57:46.320
To a human being of how colors mix so, red and green mix to get yellow, red and sayad mixed to get white.

382
00:57:46.320 --> 00:57:53.309
Red and blue mixed against something down here, which is not a, which is we call purple.

383
00:57:53.309 --> 00:57:58.679
And the spec, the pure spectral colors are around on the outside.

384
00:57:58.679 --> 00:58:03.300
Curve so, from long wavelength, all frequency red.

385
00:58:03.300 --> 00:58:08.909
To the full wavelengths high frequency violet here.

386
00:58:08.909 --> 00:58:14.340
So that's very nice. Is, is this curve was determined by.

387
00:58:14.340 --> 00:58:22.949
Experiments on people, and the triangle here would be if you have a 3 dimensional.

388
00:58:24.269 --> 00:58:36.510
Color system, printing on paper or using mixing some pure colors with some color sources. If your 3 sources are the vertices of the triangle.

389
00:58:36.510 --> 00:58:46.739
Then the colors that you can generate are points in the interior of the triangle. So if these are the 3 primary colors available for the triangle, you cannot generate anything out here.

390
00:58:46.739 --> 00:58:50.760
Okay, that diagram.

391
00:58:52.050 --> 00:58:55.320
It's sort of fun. It's not parallel computing, but.

392
00:58:55.320 --> 00:59:03.750
It looks like I'm teaching computer graphics again next fall, so no, 1 else can teach it and there's a student demand for it. So I'll be teaching this again.

393
00:59:03.750 --> 00:59:08.460
Okay, how do you do something like this on the.

394
00:59:08.460 --> 00:59:13.590
Device doing something maybe you want to.

395
00:59:13.590 --> 00:59:27.510
Mix things and some formula and this here is the official waiting formula, I think, for how you generate great scale from RGB.

396
00:59:27.510 --> 00:59:36.000
You guys to people green appears brighter than red and blue does not appear very bright at all.

397
00:59:36.000 --> 00:59:42.179
And this is that these are the official weights to convert of our and G and B to to gray.

398
00:59:42.179 --> 00:59:46.019
They want to do that very fast, just a working example.

399
00:59:51.179 --> 00:59:57.780
Your skeleton code you're bringing in the RGB image That'll be input.

400
00:59:57.780 --> 01:00:02.969
Your gray scaling, which will be output and you have to know what's in your height.

401
01:00:02.969 --> 01:00:10.110
Care because the primaries, the intensity are just 8 bits for pixel.

402
01:00:10.110 --> 01:00:14.489
Just as an aside if you're doing high quality.

403
01:00:14.489 --> 01:00:19.320
Processing 8 bits is not enough direct present.

404
01:00:19.320 --> 01:00:23.909
A collar and a pixel you probably want 12 beds.

405
01:00:23.909 --> 01:00:38.545
16, if you couldn't do it, you can see, you could actually see green with better than 1 part and 256 resolution. You can see the, you can see the difference if the bit changes for the green image.

406
01:00:38.545 --> 01:00:39.295
For example, sometimes.

407
01:00:40.800 --> 01:00:44.460
And obviously, if you're doing mixing, you want to have extra bets to avoid.

408
01:00:44.460 --> 01:00:49.469
So truncation error. Okay. Nothing interesting. Here.

409
01:00:52.045 --> 01:01:06.985
And so we get the offset to where the picks the colors for a pixel start and we're just reading R. G and B from the image into register variables.

410
01:01:08.940 --> 01:01:14.159
Um.

411
01:01:15.809 --> 01:01:22.320
And then computing and output Tom down here. So.

412
01:01:22.320 --> 01:01:27.239
Loading point yeah. Okay.

413
01:01:27.239 --> 01:01:34.320
And this would be your device program to convert from a caller image to a.

414
01:01:34.320 --> 01:01:41.250
Great scale image doing every pixel in parallel because it's 1 thread per pixel.

415
01:01:41.250 --> 01:01:46.949
So, nothing weird here.

416
01:01:46.949 --> 01:01:52.110
Is that up for a 2nd or 2 in cases questions?

417
01:01:57.239 --> 01:02:05.429
Okay.

418
01:02:08.159 --> 01:02:18.059
We're indexing in here to determine what the access for the threat and this works because we've got a 2 dimensional array of threads in the block.

419
01:02:23.670 --> 01:02:27.480
Silence.

420
01:02:30.420 --> 01:02:35.760
Some more fun stuff.

421
01:02:39.780 --> 01:02:47.789
Suppose we want to blur the data like this, because some little convolution filter to blurt.

422
01:02:47.789 --> 01:02:53.820
Okay, so here, what's new? Is that.

423
01:02:55.710 --> 01:03:08.400
Each thread not only gets the pixel that it's processing, but has to get this convolution support region of adjacent pixels. Now.

424
01:03:09.659 --> 01:03:19.050
Those of you that took computer graphics with Vertex shaders and fragments. Shaders.

425
01:03:19.050 --> 01:03:26.730
This sort of thing done in that Frank Vertex fragment.

426
01:03:26.730 --> 01:03:31.170
Model in computer graphics, you cannot do this actually, because.

427
01:03:31.170 --> 01:03:44.190
The parallel idea, and open GL, where you have fragment shaders and each fragment trader process is 1 pixel has no access to the data and adjacent pixels.

428
01:03:44.190 --> 01:03:49.920
But so do with it open GL, would take to Steph that open getting obsolete. Now.

429
01:03:49.920 --> 01:03:58.980
So, in any case here, the thread would have access needs to have access to adjacent pixels.

430
01:03:58.980 --> 01:04:04.679
Blurring box. Okay. So.

431
01:04:04.679 --> 01:04:14.369
This will be the global routine that runs on the device. That's the 1 could a thread. Well, compute 1 pixel.

432
01:04:15.659 --> 01:04:20.730
We do our bounds checks because the last, um.

433
01:04:20.730 --> 01:04:31.920
Block thread block may go off the edge of the image. So we don't want to be processing data off the edge of the image because someone else's data.

434
01:04:31.920 --> 01:04:39.000
Reading may be okay I don't want to write it. So what is happening here?

435
01:04:42.420 --> 01:04:50.039
What we're doing here is we are.

436
01:04:50.039 --> 01:04:53.610
Interacting over the adjacent pixels.

437
01:04:53.610 --> 01:04:57.360
You know, we're computing pixel.

438
01:04:59.099 --> 01:05:07.650
You know, up here, we're computing the row and column for the pixel work that we're computing and we're computing it from.

439
01:05:07.650 --> 01:05:11.070
The index of this particular thread.

440
01:05:12.179 --> 01:05:16.980
Okay, they're in row and column what we're doing down here.

441
01:05:16.980 --> 01:05:29.969
Is we're entering over the adjacent pixel so maybe our convolution Windows 3 by 3. so we want to go to the left the right above and below. And so that's what we're doing here. Iterating.

442
01:05:31.170 --> 01:05:42.300
And then down here, we're going in and just summing in the adjacent pixel values.

443
01:05:42.300 --> 01:05:46.170
If we're within bound, so here.

444
01:05:46.170 --> 01:05:49.769
We're in we're adding in, um.

445
01:05:50.789 --> 01:06:04.650
So, adding up the values of all the pixels in the window around our current pixel that's what's happening here. What we're doing. Only if it's within the image, got to check that, we don't go off the edge of the image in either direction.

446
01:06:04.650 --> 01:06:08.099
So, minus 1 and lesson.

447
01:06:08.099 --> 01:06:16.710
If it was inbound, so we add it into our, our running total brightness and we keep track of how many pixels we added and.

448
01:06:16.710 --> 01:06:22.889
So, we iterate over our convolution window and then we write our output pixel here.

449
01:06:22.889 --> 01:06:29.010
Our fixed style, and then we normalize the number of pixels. We actually add it in.

450
01:06:30.389 --> 01:06:34.139
And we convert back to unsigned care.

451
01:06:34.139 --> 01:06:40.800
Assuming, you know, we're, we're assuming that a character is a, is 8 bits.

452
01:06:40.800 --> 01:06:55.500
Um, which on, I guess, all Margaret architecture. That's true. Not on the 1. I used as a student and unsigned watch this 1 because you don't know if you don't say unsigned perhaps the character is signed.

453
01:06:55.500 --> 01:06:59.610
And goes to minus 1, 2008, 2, plus 1 and 27.

454
01:07:01.530 --> 01:07:08.670
And I'm not even completely certain if the C plus plus standard what it says about a character, if you don't say signed or unsigned.

455
01:07:09.840 --> 01:07:13.019
And maybe signed, I don't know why default.

456
01:07:13.019 --> 01:07:20.340
Okay, so this was showing how to do this convolution on the GPU.

457
01:07:20.340 --> 01:07:23.789
So, what this is doing, so now, if you think about this here.

458
01:07:23.789 --> 01:07:29.730
Okay think about how this thing is implemented in hardware. It's a.

459
01:07:29.730 --> 01:07:38.190
Still tricky. See, what's happening here is we're reading stuff from the global memory.

460
01:07:39.329 --> 01:07:47.940
And I said that the global memory as a latency of maybe a 100 cycles, depending.

461
01:07:47.940 --> 01:07:59.190
And so this would mean that that line right here and your program is going to wait 100 cycles and that sort of kills your parallel performance.

462
01:07:59.190 --> 01:08:02.849
Whole thing my fee only 100 cycles, or a few 100 cycles.

463
01:08:02.849 --> 01:08:07.860
Okay, so why is this not a performance killer?

464
01:08:09.030 --> 01:08:17.189
Couple of reasons 1st reason is that adjacent threads are.

465
01:08:17.189 --> 01:08:20.250
As were iterating and her call.

466
01:08:21.899 --> 01:08:29.640
Adjacent threads well, not just per call. I mean, the base here.

467
01:08:29.640 --> 01:08:34.050
You see, adjacent threads are reading adjacent.

468
01:08:34.050 --> 01:08:38.130
Pixels now.

469
01:08:38.130 --> 01:08:41.340
I read instruction from the global memory.

470
01:08:41.340 --> 01:08:46.890
It reads, I think 128 bytes in 1 there is.

471
01:08:46.890 --> 01:08:49.920
This 100 cycle latency.

472
01:08:49.920 --> 01:08:54.569
But then bang to 128 bytes and I believe.

473
01:08:54.569 --> 01:08:58.859
It can read the next 128 bites in the next cycle. So.

474
01:08:58.859 --> 01:09:08.729
There's a latency, but once you pay the latency, the bandwidth is fast. Actually really, really, really fast actually. So.

475
01:09:08.729 --> 01:09:17.460
So, what this means is that adjacent threads are reading adjacent.

476
01:09:18.539 --> 01:09:30.090
Pixels and they're physically adjacent in the global memory. So the 128 bite read from the global memory provides data for 32 threads. The whole war.

477
01:09:31.199 --> 01:09:40.229
So so there's a 100 cycle latency, but bang, the 32 threads, the war in the next cycle, get all get their pixel.

478
01:09:41.250 --> 01:09:45.930
And, and then in the next cycle, the next 32 threads, the next.

479
01:09:45.930 --> 01:09:52.439
All the trends get their data. So this is this design philosophy.

480
01:09:53.640 --> 01:10:07.920
Underlying the underlying is, it goes for bandwidth that is a really big bandwidth when you've got a lot of threats because each cycle 128 bytes of.

481
01:10:07.920 --> 01:10:16.109
Of data from the global memory. So what Nvidia did is they traded off latency for bandwidth.

482
01:10:16.109 --> 01:10:20.430
You got that latency to started going, but once it goes, it.

483
01:10:20.430 --> 01:10:26.939
Goals and we had a slide like 2 days ago or something showing it. Whereas on the host on the Intel.

484
01:10:26.939 --> 01:10:37.289
You don't have this sort of latency, but your bandwidth is slower. So you have bandwidth inside the GPU is really fast.

485
01:10:37.289 --> 01:10:48.329
But you have to work with it, and the way you work worth it is well, you have thousands of threads. So the high bandwidth. So the thousands of periods of need.

486
01:10:48.329 --> 01:10:55.289
A lot of that, so, and the thing is, the threads are accessing adjacent words.

487
01:10:55.289 --> 01:11:01.409
In the memory, and that's how the high bandwidth is useful.

488
01:11:01.409 --> 01:11:08.609
You see, so lots of threads means the bandwidth is useful. If the threads clock if the programmer.

489
01:11:08.609 --> 01:11:13.979
Cooperated you see so there is your trade off high latency, but.

490
01:11:13.979 --> 01:11:21.000
Very high bandwidth. That's how the that's 1 of the keys and thousands of threads running in parallel.

491
01:11:21.000 --> 01:11:26.640
However, the code and threads has to be simple.

492
01:11:26.640 --> 01:11:38.100
Which is why, like, you've got the single instruction. Multiple thread. Concept is an acronym that Nvidia tends to use single instruction multiple thread.

493
01:11:38.100 --> 01:11:47.699
And this is why I say is that weird data structures, like the stuff they love to teach and see us 1 or data structures or something are not.

494
01:11:47.699 --> 01:11:57.329
Totally efficient on the device in pointer chasing, for example, anything which throws the threads and the work out of sync with each other.

495
01:11:57.329 --> 01:12:04.949
Is going to be horribly slow. So, pointer chasing recursion you want you're nice and simple.

496
01:12:04.949 --> 01:12:12.270
Data structures, they're called a structure of a raise here ideal data structure.

497
01:12:12.270 --> 01:12:19.199
At the device is an array of plain old data types and array of ends an array of floats.

498
01:12:21.385 --> 01:12:34.345
Not even an array of 3 D coordinates. That's bad. Down here. You would have an array of X and array of Weiss and an array of these not an array of 3 D points. So a structure of a race. That's the way.

499
01:12:34.890 --> 01:12:39.569
You structure your data and your code, so to use the hardware.

500
01:12:39.569 --> 01:12:45.539
Okay, so this is showing and again, so.

501
01:12:45.539 --> 01:12:50.520
Again, so the thread is accessing adjacent pixels, but there probably and.

502
01:12:50.520 --> 01:12:55.619
Along the road there adjacent and along the column.

503
01:12:55.619 --> 01:13:02.970
They're a fixed offset from each other, but again, adjacent threads or the offsets work out. So probably.

504
01:13:02.970 --> 01:13:08.729
With a cashier that we're not talking about this sort of thing is going to be fast.

505
01:13:08.729 --> 01:13:13.920
So, okay, so that's the deep lesson on this slide here.

506
01:13:13.920 --> 01:13:18.779
And that's, I'll leave it up a series of questions. So.

507
01:13:18.779 --> 01:13:21.930
So, it's exploiting the, um.

508
01:13:21.930 --> 01:13:29.850
The hardware, which is the global, because the pixels in global, they fix all the images in global memory.

509
01:13:29.850 --> 01:13:36.149
Adjacent pixels adjacent, and there's a latency to start reading data, but.

510
01:13:36.149 --> 01:13:40.800
The bandwidth is fast, and then the data that is right is available to all the threads.

511
01:13:40.800 --> 01:13:46.140
So the key is, you read it reads 120 from the global memory.

512
01:13:46.140 --> 01:13:54.750
And that's not for 1 thread. So, 1 thread requests a bite a word. 32 words. But the thing is that that is available to all the threads in the war.

513
01:13:54.750 --> 01:13:59.039
So that's sort of handled in visibly that's the key.

514
01:13:59.039 --> 01:14:05.729
Okay, let's look at.

515
01:14:05.729 --> 01:14:09.149
Let's see what's happening next.

516
01:14:09.149 --> 01:14:13.170
Silence.

517
01:14:16.380 --> 01:14:20.159
Okay, what we have here.

518
01:14:22.020 --> 01:14:28.859
Oh, okay so now we're getting into some hardware stuff.

519
01:14:28.859 --> 01:14:38.100
I mean, a thread block is a software concept that says to get some hardware has to then execute it.

520
01:14:38.100 --> 01:14:41.609
And capacity constraints I mentioned.

521
01:14:42.659 --> 01:14:46.260
There is limited amounts of stuff like, registers.

522
01:14:46.260 --> 01:14:49.289
Floating point processors and so on.

523
01:14:49.289 --> 01:15:02.100
And so this may mean that some blocks and warps will run at 1 after the other, not all at the same time and the 0T overhead. This is this thing that you got to queue.

524
01:15:02.100 --> 01:15:09.119
Or you've got a number of maybe not a queue at the thread level, but you have a number of works that want to run.

525
01:15:09.119 --> 01:15:12.960
And as soon as resources are available, the next cycle.

526
01:15:12.960 --> 01:15:16.170
Award run 0T overhead and.

527
01:15:16.170 --> 01:15:22.020
I don't know details I've inferring that it's done with a synchronous logic.

528
01:15:22.020 --> 01:15:27.149
Which is tricky to design.

529
01:15:27.149 --> 01:15:30.989
Subject to a lot of hazards that's the buzzword used.

530
01:15:30.989 --> 01:15:37.229
A lot of hazzards, but if you can get it to work, it's passed.

531
01:15:37.229 --> 01:15:41.880
Hazards the sort of thing that, um.

532
01:15:41.880 --> 01:15:48.329
You know, you got to gate Gates and the gate and our gate or physically an advocate.

533
01:15:48.329 --> 01:15:54.539
So, you change the inputs, the output changes on a nano 2nd later, let's say.

534
01:15:55.619 --> 01:16:07.710
So now, let's suppose 1 of the inputs to the gay has a, not on, it not takes a fraction of a nanosecond for after it's input. So.

535
01:16:07.710 --> 01:16:11.760
If 1 input to a, to say a, um.

536
01:16:11.760 --> 01:16:15.270
nan gate has a not but not the other one .

537
01:16:15.270 --> 01:16:23.970
So, to speak, then the 2 inputs to the gate are available at different times. So the non gains output immediately reflects its input changes.

538
01:16:23.970 --> 01:16:32.430
And so, well, now a 2nd later, perhaps so, if the inputs to the gate are not available at the same time.

539
01:16:32.430 --> 01:16:38.640
There's a little interval when 1 input has a current as the proper value, but the other input does not yet.

540
01:16:38.640 --> 01:16:45.659
and in that little interval the output from the nan gate will be wrong it'll be fake there'll be there'll be a fake flip .

541
01:16:45.659 --> 01:16:49.560
Which will go away once all the inputs of staff have.

542
01:16:49.560 --> 01:16:53.430
You know, stabilized, but that little output.

543
01:16:53.430 --> 01:16:57.600
Which in a software shouldn't be there, you know, this.

544
01:16:57.600 --> 01:17:10.229
This might be a problem if you don't design for it because if you're counting blips, this is an extra lift. Let's say. So. So there's such an issue that happened with the asynchronous hardware design, but.

545
01:17:10.229 --> 01:17:15.180
But if you can make it work is past the reason they go synchronous.

546
01:17:15.180 --> 01:17:23.159
See fuse where you have a clock and everything waits to the next clock cycle. It's the clock slows things down. Yeah.

547
01:17:23.159 --> 01:17:29.310
But it makes things work. So you have a data boss that's got 32.

548
01:17:29.310 --> 01:17:40.260
It's on the thing is that 32 bit to go to arrive at different times? Well, you don't look at them until the next clock cycle. See, it slows you down, but stuff gets reliable. Okay.

549
01:17:40.260 --> 01:17:45.180
What is transparent scalability? Do.

550
01:17:46.380 --> 01:17:49.529
And then I'll stop in a minute because you get in next class.

551
01:17:49.529 --> 01:17:55.500
This is a big idea what it's saying is that invidious hardware.

552
01:17:55.500 --> 01:18:02.850
Has different versions of the device of the GPU of different members.

553
01:18:03.354 --> 01:18:12.295
Of of hardware stuff, like stuff like number of symmetric, multi processors, and actually do a number of CUDA cores and whatever.

554
01:18:12.295 --> 01:18:20.725
So if you got lots of threads more threads and if there's more threads in your software than there are cooler cars in the hardware.

555
01:18:21.029 --> 01:18:29.760
It doesn't matter the threads just wait and then they run when they can and if you run your program on a bigger faster and you.

556
01:18:29.760 --> 01:18:36.239
It will run faster, but if you did it, right? You get the same answer.

557
01:18:36.239 --> 01:18:42.840
So, this transparent scalability you buy a more expensive for you, and you plug it in.

558
01:18:42.840 --> 01:18:46.260
Your program will run, it will just run past so.

559
01:18:46.260 --> 01:18:53.939
The hardware scales up, it's transparent to the user unless you're checking some real time Micro.

560
01:18:53.939 --> 01:19:00.119
I, this is a powerful idea here, got a good history actually.

561
01:19:00.119 --> 01:19:08.699
Many many years ago early, 19, sexual iddy biddy machine Corporation.

562
01:19:08.699 --> 01:19:12.720
Oh, I'm sorry, international business machines Corporation.

563
01:19:12.720 --> 01:19:26.159
They did the same thing. They had this sequence, it was called their system 360, and they had like, half a dozen machines from small and cheap and slow to begin expensive and passed.

564
01:19:26.159 --> 01:19:34.979
And they had this idea of transparent scalability there all their machines ran the same instructions. Set, ran the same program.

565
01:19:34.979 --> 01:19:47.279
Just the small slow machines ran and did a lot of emulation and so on. And the past machines through lots of Gates had the expensive machines that did it fast. But the same program.

566
01:19:47.279 --> 01:19:54.630
In principle would run and video is doing the same and IBM became the biggest computer company in the world.

567
01:19:54.630 --> 01:20:03.659
On doing things like this at the start they had a number of competitors at the end. They did not. So.

568
01:20:03.659 --> 01:20:07.949
So, Nvidia is doing the same thing here. Transparency.

569
01:20:07.949 --> 01:20:13.859
So that's a good point to stop. What am I on here? Lecture.

570
01:20:15.659 --> 01:20:26.395
Whatever 3.5 transparent scalability will pick up there and we actually run some programs. I want to do today being a desktop present slides to you today.

571
01:20:26.395 --> 01:20:39.145
Now, if you think I'm going to slowly you are welcome to read ahead of me. I'm just going through this at a natural speed because as long as this is interesting stuff here. So, what we're doing for several weeks is.

572
01:20:40.229 --> 01:20:54.569
Seeing how work okay, what will happen next anticipating in the future in the course we may see another software tool or something to use the hardwares perhaps something like.

573
01:20:55.614 --> 01:21:09.204
Thrust which are, which is a parallel version of the standard template library, which is a way to write code on C. plus. Plus, that will run fast. It's functional program and it's designed to run past on device.

574
01:21:09.925 --> 01:21:13.375
We might see that might see some parallel.

575
01:21:13.680 --> 01:21:17.430
Stop with the currency plus plus standard um.

576
01:21:17.430 --> 01:21:21.960
And then the next Chuck will be.

577
01:21:22.944 --> 01:21:33.744
We'll do a chunk on quantum computing, so, which I had a full course on in the fall, but I noticed none of you in my parallel class were also in my quantum class in the fall.

578
01:21:33.744 --> 01:21:39.324
So, I think you would like to have a good chunk a month or so on quantum computing, which we will do.

579
01:21:39.630 --> 01:21:44.460
Oh, and by the way RPI is thinking about.

580
01:21:44.460 --> 01:21:48.000
Emphasizing quantum computing more in the curriculum.

581
01:21:48.000 --> 01:21:57.779
So, we don't know quite what that means, but we can see it as a competitive advantage. If we make quantum computing more important than the curriculum, whatever that takes. So.

582
01:21:57.779 --> 01:22:01.289
Okay, so that's enough to do stuff for today.

583
01:22:01.289 --> 01:22:06.810
If there are.

584
01:22:06.810 --> 01:22:11.279
Questions I'll stay around a minute or 2 other than that.

585
01:22:11.279 --> 01:22:14.789
We can all go off and get lunch.

586
01:22:14.789 --> 01:22:20.220
Questions yes.

587
01:22:20.220 --> 01:22:23.310
How does parallelization could.

588
01:22:23.310 --> 01:22:27.689
Interact with CPU based.

589
01:22:27.689 --> 01:22:33.930
It does not. There are 2 unrelated things. You can run multi core.

590
01:22:33.930 --> 01:22:39.119
On the post at the same time, as you're doing the Mini corps.

591
01:22:39.119 --> 01:22:46.020
On the, or the 1000 the on the device.

592
01:22:46.020 --> 01:22:50.250
They don't affect each other. You can write the 1 program, which.

593
01:22:50.250 --> 01:22:57.539
Which does both if you're going to be calling a global routine from.

594
01:22:57.539 --> 01:23:01.109
Apparel inside, parallel, open ACC.

595
01:23:01.109 --> 01:23:06.989
That could be a fun project. So how you would.

596
01:23:08.159 --> 01:23:15.210
Yeah, there's no reason you can it just if you're inside a parallel block and open ACC.

597
01:23:17.699 --> 01:23:20.909
Well, what do you think what that me, and then call.

598
01:23:23.069 --> 01:23:29.880
You know, call could I haven't gone to yet is your host program can start several kernels.

599
01:23:29.880 --> 01:23:36.899
On the device, so you could have multiple threads and opening each starting a separate colonel on the device. Gotta keep your data.

600
01:23:36.899 --> 01:23:43.619
You know, address this straight, but sure. So what happens so, the device is like, it's like it's a mini.

601
01:23:43.619 --> 01:23:51.479
Time sharing operating system actually, they don't call it that. And so if you started 100.

602
01:23:51.479 --> 01:23:59.640
Kernels on the device if you go back up to here, I'm going to do.

603
01:23:59.640 --> 01:24:09.449
So, there's a limit to how many multi, how many kernels you can run, but if you've got more, I think you can have a 1000 or more and they queue up and they wait.

604
01:24:09.449 --> 01:24:15.180
So, yeah, you could have an open ACC parallel loop there.

605
01:24:15.180 --> 01:24:23.460
Starting up lots of kernels and then on the device, they just sit and wait until they run and.

606
01:24:23.460 --> 01:24:28.170
It probably would run very fast because, you know, it.

607
01:24:28.170 --> 01:24:32.609
If they're using your resources on the device that would fit together nicely.

608
01:24:34.199 --> 01:24:38.069
Anything else that could be a really fun thing to try.

609
01:24:38.069 --> 01:24:42.899
Thanks to the suggestion other suggestion.

610
01:24:44.909 --> 01:24:47.909
No, in that case.

611
01:24:49.050 --> 01:24:53.760
Cool. Let me just.

612
01:24:55.560 --> 01:24:59.220
I need to save this.

613
01:24:59.220 --> 01:25:06.779
And.

614
01:25:08.250 --> 01:25:16.439
And see, you Monday have a good weekend, get out, get some exercise or something.

615
01:25:16.439 --> 01:25:21.689
Okay.

616
01:25:44.010 --> 01:25:48.659
Okay.