WEBVTT

1
00:00:10.648 --> 00:00:13.829
Silence.

2
00:01:42.989 --> 00:01:52.200
To be clear, did anybody find the homework? I think he just posted it today.

3
00:01:52.200 --> 00:01:56.549
So nobody did it because nobody could find it, right? Like, yeah, it wasn't.

4
00:01:56.549 --> 00:01:59.640
There was no writing in the document.

5
00:01:59.640 --> 00:02:07.260
Until today, I think, because I couldn't find it and I don't think I, no, no.

6
00:02:07.260 --> 00:02:10.919
Thank you.

7
00:02:43.110 --> 00:02:48.719
Silence.

8
00:02:51.990 --> 00:02:57.419
Silence.

9
00:03:01.830 --> 00:03:05.400
Silence.

10
00:03:08.819 --> 00:03:13.979
Silence.

11
00:03:31.740 --> 00:03:56.879
Silence.

12
00:03:56.879 --> 00:04:00.060
Okay.

13
00:04:00.060 --> 00:04:09.719
Pay back again. Let's see. Now.

14
00:04:30.389 --> 00:04:34.079
So so.

15
00:04:43.918 --> 00:05:02.369
Silence.

16
00:05:03.838 --> 00:05:08.879
Let's see now, no audio.

17
00:05:10.588 --> 00:05:22.528
Silence right?

18
00:05:22.528 --> 00:05:32.788
Silence.

19
00:05:32.788 --> 00:05:37.798
Right. Silence.

20
00:05:42.209 --> 00:05:49.709
Silence.

21
00:05:58.168 --> 00:06:02.009
Silence.

22
00:06:02.009 --> 00:06:07.738
Well.

23
00:06:12.449 --> 00:06:21.059
So, if someone could hear me, then mention it in Chad, I cannot tell if you're hearing me at the moment.

24
00:06:22.379 --> 00:06:27.959
Good. Okay. Thank you.

25
00:06:27.959 --> 00:06:31.769
So all computing.

26
00:06:31.769 --> 00:06:34.949
And March 4th.

27
00:06:34.949 --> 00:06:39.658
And Terry, you can see the screen.

28
00:06:39.658 --> 00:06:44.278
So, we're continuing on talking about and video.

29
00:06:44.278 --> 00:06:49.019
And their, and we're continuing on with their, the work.

30
00:06:49.019 --> 00:06:56.939
With material from their website, I've also link here to some newer stuff, which will cover in detail later.

31
00:06:56.939 --> 00:07:00.209
And pair is the current.

32
00:07:01.314 --> 00:07:10.163
Architecture and out in the last year, and also white paper talks about things in more detail about cashes and so on.

33
00:07:10.673 --> 00:07:17.153
But meanwhile, and I just put a new homework on to do we're talking about in parallel.

34
00:07:17.579 --> 00:07:20.579
So implement that and see how that works out.

35
00:07:20.579 --> 00:07:26.129
Okay, and I'm going to be going fairly quickly through some of these next set of lectures.

36
00:07:26.129 --> 00:07:32.668
So, at 6.2, let us see here.

37
00:07:34.288 --> 00:07:40.858
Silence.

38
00:07:44.488 --> 00:07:48.449
Silence.

39
00:07:50.788 --> 00:07:55.978
Okay, Eva. Okay.

40
00:07:55.978 --> 00:08:03.238
Memory coalescing. So what's happening here is that you read a chunk of memory.

41
00:08:03.238 --> 00:08:08.759
You read some data from the global memory on the GPU.

42
00:08:08.759 --> 00:08:12.598
It reads, it reads a chunk. It's a 128.

43
00:08:12.598 --> 00:08:17.129
Bite, and now this is implemented is the cash actually so.

44
00:08:17.129 --> 00:08:21.538
You read it and read the whole chunk into the cash. Now that's available.

45
00:08:21.538 --> 00:08:25.678
For other threads that.

46
00:08:25.678 --> 00:08:29.038
If the other threads are accessing adjacent memory.

47
00:08:29.038 --> 00:08:35.578
And they call it burst sections here, but there mean chunks that go into the cash. So.

48
00:08:36.839 --> 00:08:46.193
22008 by 2, as I said, so that if you, if consecutive threads are reading consecutive words from the global memory, that's nice.

49
00:08:46.193 --> 00:08:52.374
Because says consecutive words now, got loaded into the cash and the other threats, and get them really fast.

50
00:08:53.938 --> 00:08:58.889
If they're not reading consecutive words, then.

51
00:08:58.889 --> 00:09:03.989
They're not necessarily all in the cash so this requires multiple.

52
00:09:03.989 --> 00:09:09.839
Trips back to the global memory.

53
00:09:09.839 --> 00:09:19.889
And how you do it, so, threads in the war consecutive words, and if you have the array and major order.

54
00:09:19.889 --> 00:09:25.408
And adjacent to doing adjacent columns in the same row that everything works out very nicely.

55
00:09:25.408 --> 00:09:29.458
Like, here on the left.

56
00:09:29.458 --> 00:09:34.708
On the right if you're going by column, it's not so nice.

57
00:09:34.708 --> 00:09:37.889
So.

58
00:09:40.139 --> 00:09:47.009
And somehow they say B is not coalesced, they're sending some call me.

59
00:09:47.009 --> 00:09:54.028
Are coalesced whatever it depends on how you're labeling stuff, but okay.

60
00:09:54.028 --> 00:09:59.969
And 1 way you helped us mentioned before is if you can group the, um.

61
00:09:59.969 --> 00:10:08.428
Divide the input, a raise into tiles and then once you get it, and then the whole tile goes into the cash.

62
00:10:08.428 --> 00:10:15.688
Or into fast shared memory, and then you can get added easily. So we talked about that last time. So I'm going through this quickly.

63
00:10:15.688 --> 00:10:20.308
That was that 1 past 6.2 okay. To move on.

64
00:10:27.239 --> 00:10:30.599
Silence.

65
00:10:36.899 --> 00:10:42.208
Oh, okay.

66
00:10:42.208 --> 00:10:47.668
Okay, so here we're talking about that we will be.

67
00:10:47.668 --> 00:10:52.109
Oops, backpack. Okay. Now.

68
00:10:53.188 --> 00:10:56.548
It's a nice example of a program that.

69
00:10:56.548 --> 00:10:59.818
Demonstrates parallel computing.

70
00:10:59.818 --> 00:11:06.028
So, we'll use this for several classes. You all know what his grams are can skip through that.

71
00:11:06.028 --> 00:11:10.649
Nothing interesting here.

72
00:11:10.649 --> 00:11:14.369
Histogram in groups of 4 letters.

73
00:11:16.889 --> 00:11:22.589
Okay, now we're the interesting content of these few slides.

74
00:11:22.589 --> 00:11:27.958
Is that there's different ways you can do the histogram program in parallel.

75
00:11:27.958 --> 00:11:35.129
And the obvious way might not be the best way. So there's actually the information content here.

76
00:11:36.749 --> 00:11:44.938
And so the 1st way is who wish to the program do the parent is to grab in parallel. We might partition our input data.

77
00:11:46.379 --> 00:11:50.879
And I need to thread, takes a chunk of the input data and does a.

78
00:11:50.879 --> 00:11:54.269
And, um.

79
00:11:54.269 --> 00:12:03.688
Here the input, Texas programming, massively parallel processors through 0T takes programming thread. 1 takes massively.

80
00:12:04.769 --> 00:12:07.769
Parallel and then processors.

81
00:12:07.769 --> 00:12:12.028
And they are totalling into here's a histogram.

82
00:12:12.028 --> 00:12:17.278
Ray down at the bottom now, the potential problem.

83
00:12:17.278 --> 00:12:27.778
Is that different threads may increment the same counter in the histogram right? Simultaneously? Like here, 3 threads or.

84
00:12:28.374 --> 00:12:37.344
They are purely by chance they've got a letter in the M, to P range. So all 3 threads simultaneously want to increment this counter here.

85
00:12:37.854 --> 00:12:46.224
Now what this means is, of course, you have to do some sort of locking or mucus or atomic read modify right instruction.

86
00:12:46.528 --> 00:12:52.048
To keep this total correct and anticipating the next few slides.

87
00:12:52.048 --> 00:12:56.729
Whatever you do, whatever you do is going to be slow.

88
00:13:00.899 --> 00:13:06.119
They're now they're just going through this incremental counters here and so on.

89
00:13:12.354 --> 00:13:23.514
There's another problem with this, if I go back and just go back here. So thread 1, 0T gets this chunk of the input data thread. 1 gets that chunk thread to August. The 3rd trunk.

90
00:13:23.609 --> 00:13:30.538
And thread 3 gets the 4th chunk a 2nd problem with this partitioning math that.

91
00:13:30.538 --> 00:13:39.269
Is that consecutive threads are not accessing successive words of memory so we can not take advantage of coalescing.

92
00:13:39.269 --> 00:13:45.089
And that's what it talks about here. We've got the red thread degrades reds the greens, Fred and the blue thread.

93
00:13:45.089 --> 00:13:53.129
So, accesses are not coalesced. So if you're thinking ahead of me, what we want to do is interspersed things perhaps.

94
00:13:53.129 --> 00:14:00.328
Okay, this is a bad way so we do, we have to it or leave down here. So.

95
00:14:00.328 --> 00:14:04.769
Consecutive threads read consecutive bites from memory.

96
00:14:04.769 --> 00:14:11.099
And this way, the memory accesses will be coalesced and will be a lot faster.

97
00:14:11.099 --> 00:14:15.958
Silence.

98
00:14:15.958 --> 00:14:29.219
So, 0T gets the characters that are 0T. Mine 4 shred 1, gets the characters that are 1 model for et cetera, et cetera, entirely partitioning. And that will give you better performance from the global memory.

99
00:14:30.298 --> 00:14:41.969
We still have the issue that multiple threads may be implementing the same counter simultaneously, but we've improved 1 problem.

100
00:14:41.969 --> 00:14:47.639
Was the center leaf partitioning.

101
00:15:02.609 --> 00:15:16.589
Okay, this slide set is going to talk about that, implementing the shared global counter. And the problem that happens it describes is a data race.

102
00:15:17.818 --> 00:15:24.928
Where 2 threads are updating the same thing and, as we've seen before, like, a month ago, atomic operations help with that.

103
00:15:24.928 --> 00:15:30.568
Fix a problem. So, 2 threads accessing the same calendar here.

104
00:15:35.969 --> 00:15:39.028
Oh, okay. That's nothing interesting. There.

105
00:15:39.028 --> 00:15:43.528
I'm skipping through this, um.

106
00:15:43.528 --> 00:15:47.759
Data race, you all know what data races by now.

107
00:15:49.168 --> 00:16:02.759
Because they all read from the common memory location, update their private new value, right? It back. You can all see what's happening. I don't need to go through this in detail. I think if I do, then, tell me.

108
00:16:04.469 --> 00:16:11.788
So, we have the atomic operations that we've talked about before.

109
00:16:11.788 --> 00:16:16.649
It was a 7 2.

110
00:16:16.649 --> 00:16:20.219
That's what I said about going through some of these things past.

111
00:16:23.639 --> 00:16:29.068
So, Kudo has an atomic operation to do this for you.

112
00:16:32.698 --> 00:16:36.778
Again, the data race problem.

113
00:16:38.908 --> 00:16:47.879
And what we have to do is ensure that no other threads can access that location until the 1st thread is finished.

114
00:16:47.879 --> 00:16:56.038
So, what this does, is it that operation so when we finish this here, it will be correct.

115
00:16:56.038 --> 00:17:03.389
However, it will be slower and so could I had there are these function calls so.

116
00:17:03.389 --> 00:17:09.989
Cuda is it's some minor syntax extensions to C. plus plus and some functions.

117
00:17:09.989 --> 00:17:18.388
And before we've seen functions, like, memory copy, and so on allocate manage data this 1, here does an atomic ad.

118
00:17:18.388 --> 00:17:29.788
So, we give it the address that we want to add, and how much we want to add to it. So, and we've got atomic operations add subtract detriment. Min and Max swap.

119
00:17:29.788 --> 00:17:34.769
Um, comparing swap is a useful instruction and also.

120
00:17:34.769 --> 00:17:40.769
Okay was comparing swap you can use that that sometimes like.

121
00:17:40.769 --> 00:17:46.679
A standard thing that is used to build the other atomic operations, any case atomic ad.

122
00:17:48.838 --> 00:17:52.949
And returns the old value.

123
00:17:54.388 --> 00:18:00.328
And there's other, there's 32 bed, 64 bad. This is a point here.

124
00:18:00.328 --> 00:18:08.519
Which I didn't understand at 1st, to in parallel programming. If you're working with long managers are long along whatever 64 bit is called.

125
00:18:10.048 --> 00:18:22.019
He cannot assume that the 64 bit integer is actually being handled atomic. I mean, even reading, it might take 2 cycles, depending on the hardware, depending on generation. So.

126
00:18:22.019 --> 00:18:33.509
You have to be careful with the 8 bite variables, because you cannot automatically assume that they're handled atomic course something like this will do that and floats.

127
00:18:33.509 --> 00:18:37.528
Did double procession float.

128
00:18:37.528 --> 00:18:46.979
If I were using them in a parallel program, I might be a little skeptical. I would hope that they'd be handled atomic loans and stores and so on. But.

129
00:18:46.979 --> 00:18:51.479
I wouldn't be surprised if it was.

130
00:18:51.479 --> 00:18:59.459
Okay, so colonel program, parallel program, running on the device each thread.

131
00:18:59.459 --> 00:19:05.909
Processes 1 character, so input variables buffer is the input data.

132
00:19:05.909 --> 00:19:13.888
And we know how long it is, and we get a pointer to the histogram. So, this is a common global histogram are re.

133
00:19:13.888 --> 00:19:21.179
And by default, these pointers to variables are in the global memory. I mean, there's exceptions, but.

134
00:19:21.179 --> 00:19:24.929
If there's nothing else stated, these variables will be in global memory.

135
00:19:24.929 --> 00:19:30.929
Okay, so we get a unique counter for that thread from its index in the block and the blocks index and the grid.

136
00:19:30.929 --> 00:19:35.969
And we.

137
00:19:35.969 --> 00:19:40.648
Is that character bought by? We treated as a chair.

138
00:19:40.648 --> 00:19:44.699
And we add it in. I'm sorry.

139
00:19:44.699 --> 00:19:51.778
I got it back. Yeah. Okay. You double subscripted here buffer sub. I.

140
00:19:51.778 --> 00:19:59.098
It's the character that whose frequency we're recording and histone is here right? We're recording again, so we do that.

141
00:19:59.098 --> 00:20:02.848
And this happens automatically, so.

142
00:20:02.848 --> 00:20:05.999
No, we're just keeping track of things with here.

143
00:20:09.443 --> 00:20:22.433
With sub scripts like this again if there was something funny happening in the program, I would start looking at that to see if something funny could happen in this double subscript that would cause problems like.

144
00:20:22.739 --> 00:20:27.808
Another thread working in the middle, but okay.

145
00:20:28.888 --> 00:20:39.868
And nothing interesting happening here. We're just doing reasonable checks and so on incredible letters.

146
00:20:40.919 --> 00:20:44.759
Okay.

147
00:20:44.759 --> 00:20:48.118
No.

148
00:20:55.919 --> 00:21:00.388
So.

149
00:21:00.388 --> 00:21:03.898
Okay.

150
00:21:06.479 --> 00:21:10.019
Silence.

151
00:21:11.729 --> 00:21:20.159
So, there's latency of this cash and stuff like that. And again so this, this cash is.

152
00:21:20.159 --> 00:21:26.818
Why coalescing works because it's getting 128 bytes of the global memory into the cash.

153
00:21:26.818 --> 00:21:36.838
A global memory again, it has perhaps a couple of 100 cycles latency, but then it becomes fast. The latency is high, but the throughput is high.

154
00:21:36.838 --> 00:21:43.709
And to remind you again, the shared memory is a small amount of past memory available to each block.

155
00:21:43.709 --> 00:21:48.058
64 K bytes give or take 48 K bytes whatever.

156
00:21:49.108 --> 00:21:56.638
Oh, the cash hardware wise, the cash in the shared memory might actually be.

157
00:21:56.638 --> 00:22:00.778
The same memory is sort of partition to 1 or the other.

158
00:22:00.778 --> 00:22:06.298
If it's shared memory, it's directly visible to the threads. If it's cash.

159
00:22:06.298 --> 00:22:13.318
It's not directly visible to the threads and is handled by the cash manager. And to some extent, I mean, it's.

160
00:22:13.318 --> 00:22:20.338
Might be going from a 64 K bite total, which might be partitioned between the cash and the shared memory.

161
00:22:21.868 --> 00:22:32.278
Okay, so we got our DRAM up here. It's in banks we're not talking about. Yes, so I just symmetric multi processors.

162
00:22:32.278 --> 00:22:39.719
Entirely ports were not talking about. Really? So.

163
00:22:39.719 --> 00:22:45.028
Any case go to late and see here if several 100 cycles actually.

164
00:22:45.028 --> 00:22:50.999
Read and again, right? So that's incredible. Slow down on the system.

165
00:22:52.798 --> 00:22:58.469
If we don't use these various techniques.

166
00:23:00.719 --> 00:23:11.848
So, we do a few 100 cycles for the raid founded cycles for the right. And meanwhile that whole chunk of the memory is locked.

167
00:23:11.848 --> 00:23:17.429
So, how big the chunk is, I'm not certain, maybe 128 bites, but.

168
00:23:18.628 --> 00:23:27.689
Nothing else can access that. So, the answer is correct, but we just took an enormous hit in in performance.

169
00:23:27.689 --> 00:23:34.469
The latency, so, and they are saying here could be a 1000 cycles. That's bad.

170
00:23:35.909 --> 00:23:40.828
Right. We cut our bandwidth by a factor of a 1000.

171
00:23:40.828 --> 00:23:44.308
Silence.

172
00:23:45.628 --> 00:23:51.058
Um, so the.

173
00:23:51.058 --> 00:23:55.588
Cash comes into play here we'll help things a lot.

174
00:23:55.588 --> 00:24:02.608
The L2 cash, they're saying the latency maybe only 100 cycles instead of a 1000 cycles.

175
00:24:02.608 --> 00:24:09.689
And so that improves the bandwidth the lot. And the other thing is with the cash being available.

176
00:24:09.689 --> 00:24:15.568
To all the threads again, we have the coalescing things, so we'll get another improvement here.

177
00:24:16.588 --> 00:24:22.739
You think the cache which is invisible to you is handled automatically of course.

178
00:24:25.318 --> 00:24:32.848
So, we got 3 steps here of improving. I go back a couple of slides. We have just.

179
00:24:32.848 --> 00:24:39.239
Atomic operation the global memory, 1000 cycle latency or something.

180
00:24:39.239 --> 00:24:52.949
If we take advantage of the cash, we cut the latency down a factor of 10 or something. Now, this is useful to the extent that success of threads can use the data that's already in the cash.

181
00:24:52.949 --> 00:25:01.048
The next, but again, the cash cash is automatic. Remember you don't actually see it. It's just handled by the memory manager.

182
00:25:01.048 --> 00:25:06.148
The next step is we explicitly use the shared memory.

183
00:25:06.148 --> 00:25:10.288
So, this is getting similar to the game. I mentioned to paper.

184
00:25:11.453 --> 00:25:12.713
If I present in collaborative,

185
00:25:12.713 --> 00:25:15.473
this is on computing visibility on terrain,

186
00:25:15.894 --> 00:25:19.824
and we manage we did our own cash management,

187
00:25:19.824 --> 00:25:26.693
and we beat the virtual memory manager because we understood the access pattern for the trained data.

188
00:25:26.969 --> 00:25:35.759
And while we knew, for example, when we weren't going to be using a certain block of data anymore, so we could reuse that that.

189
00:25:35.759 --> 00:25:40.108
Cash location, for example, so this is the same thing happening here.

190
00:25:40.108 --> 00:25:45.269
You're managing, you might say that you're managing the cash yourself in a way.

191
00:25:45.269 --> 00:25:48.749
This is a sheriff's. This is a shared memory.

192
00:25:48.749 --> 00:25:56.519
Private to each thread block, which means it's all the threads in the block can use it, but you have to modify your program.

193
00:25:56.519 --> 00:25:59.759
But it's going to be very fast, so.

194
00:25:59.759 --> 00:26:04.378
Once you get the data into the shared memory.

195
00:26:04.378 --> 00:26:15.179
So, you read the data into the shared memory once and then all the threads that want that data have access to that data fast, like, in a cycle or whatever.

196
00:26:16.409 --> 00:26:25.888
So, we have 3 stages here. No optimization 3 stages of optimization with the atomic.

197
00:26:25.888 --> 00:26:29.338
Increment instruction go straight to the global memory.

198
00:26:29.338 --> 00:26:35.249
Very big latency use the cash, which you use if you're coalescing access says.

199
00:26:35.249 --> 00:26:39.479
Better, and if you explicitly cash the data into the shared memory.

200
00:26:39.479 --> 00:26:51.929
Better still.

201
00:26:56.009 --> 00:27:02.338
Now, here, we're going to see another.

202
00:27:02.338 --> 00:27:06.118
Technique for improving things, which is.

203
00:27:06.118 --> 00:27:09.179
Or privatizing output means.

204
00:27:09.179 --> 00:27:17.489
Is that instead of incremented into 1, global histogram or Ray we're going to have.

205
00:27:17.489 --> 00:27:28.199
Several private histogram arrays and each block of threads will use this private histogram array. And then at the end, we combine them all. We merge them all.

206
00:27:28.199 --> 00:27:34.318
So, the advantages of this will be that we have less contention for the atomic.

207
00:27:34.318 --> 00:27:37.949
We modify right now.

208
00:27:39.269 --> 00:27:51.239
As you're seeing this set of slides, the example that we're using, assess the ground, but these tech, the reason we're spending time on them is that these techniques are generally useful.

209
00:27:51.239 --> 00:27:54.719
Using.

210
00:27:54.719 --> 00:28:06.449
Explicitly caching the data and certain data and shared memory you have to modify your program, do it. That's a generally valuable technique for parallel programming. And this thing that we're seeing in this slide set.

211
00:28:06.449 --> 00:28:14.249
Is it having private counters that we then merge together? That's another valuable idea. This thing here.

212
00:28:14.249 --> 00:28:21.239
I mean, this is what's implicitly and open MP and open ACC when we declare something.

213
00:28:21.239 --> 00:28:25.169
As a variable that we're implementing into.

214
00:28:25.169 --> 00:28:34.739
And and what happens is that the compiler is that the open system will implement this without even telling you.

215
00:28:34.739 --> 00:28:39.719
It will have private counters for each.

216
00:28:39.719 --> 00:28:43.288
For each thread and then.

217
00:28:43.288 --> 00:28:47.128
So, we don't have contention problems in at the end it will merge them all.

218
00:28:47.128 --> 00:28:56.969
So that's done so open ACC open and Pete, they're higher level than what I'm showing you. Now, I was showing you the lower level version of this.

219
00:28:59.098 --> 00:29:04.288
Ok, so so what's going to happen here?

220
00:29:04.288 --> 00:29:18.749
Is that we have atomic updates of whatever? Histogram could be something else and here they're talking about several blocks so they changed the problem slightly. So your particular problem.

221
00:29:18.749 --> 00:29:22.588
Okay, let's say, talk about history so let's suppose we have this.

222
00:29:22.588 --> 00:29:35.459
So many threads that they don't fit in a block, because again, a block has up to a 1024 threads. Suppose for fun. We want to have a 10000 threads feel homework problem coming on.

223
00:29:35.459 --> 00:29:42.989
It works there cheap. 10100 try 100000 threads. Maybe see what happens now.

224
00:29:44.068 --> 00:29:50.699
If it's 100000 thread going to be 100 blocks of a 1000 threads each now.

225
00:29:50.699 --> 00:29:59.818
Remember that the blocks are running completely independently. The blocks are not synchronized with each other. So if you tried to use this global.

226
00:29:59.818 --> 00:30:05.939
Atomic increment instructions go in the separate blocks.

227
00:30:05.939 --> 00:30:16.858
You could, but I don't know, it's just a lot of things stepping on each other's toes. So, what they're talking about here is each separate block perhaps might have a private histogram array.

228
00:30:16.858 --> 00:30:23.189
And then there was fewer threads potentially accessing the same calendar.

229
00:30:23.189 --> 00:30:29.249
But we have multiple copies of the private history, and then at the end, we merge them together.

230
00:30:29.249 --> 00:30:41.368
Okay, so the downside is the programming got a little more complicated and it used a small amount of extra memory but in this case, in significant amount.

231
00:30:41.368 --> 00:30:45.148
Because histogram a race pretty small and the upside is.

232
00:30:45.148 --> 00:30:50.519
Faster so less contention and serialization.

233
00:30:50.519 --> 00:30:53.909
Oh, okay.

234
00:30:53.909 --> 00:31:00.118
The final merge here, we probably don't have all that many copies so that's not going to be a difficult problem.

235
00:31:00.118 --> 00:31:11.368
So the overhead is minor and the good benefit. The real overhead is the programming got a touch more complicated.

236
00:31:11.368 --> 00:31:17.969
The final accumulating, merging any final private coffee is going to be really cheap. So.

237
00:31:20.999 --> 00:31:24.328
And they are claiming 10 times and 4 so.

238
00:31:26.398 --> 00:31:35.038
Okay, so again, the shared memory that I mentioned many times, it's.

239
00:31:35.038 --> 00:31:41.848
It's 48 k64 K bytes for the block all the threads in the block of access to it.

240
00:31:41.848 --> 00:31:48.298
And it might take typically 1 cycle or a few cycles level 2 cash, 10 times slower.

241
00:31:48.298 --> 00:31:51.358
D, RAM under times slower.

242
00:31:51.358 --> 00:31:57.568
Contention okay, this is a prime example of where shared memory is valuable.

243
00:31:59.098 --> 00:32:02.429
Well, coming back to you here also.

244
00:32:03.989 --> 00:32:13.138
This is also we get into optimization issues. You can start playing games of how many threads per block. I said.

245
00:32:13.138 --> 00:32:23.249
Which I mentioned before, but their valuable optimizing games, you can have up to a 1000 threads per block, but you don't have to have that memory might have 32 threads in the block. Then you're going to need more blocks.

246
00:32:23.249 --> 00:32:29.489
But it might actually go faster because.

247
00:32:29.489 --> 00:32:33.808
That's more memory for thread when you've got fewer threads per block so.

248
00:32:35.038 --> 00:32:42.479
Okay, global program just means it's a device routine called from the host.

249
00:32:42.479 --> 00:32:47.338
And this in the device routine is how you allocate a shared.

250
00:32:47.338 --> 00:32:51.298
Array and again, it Tom.

251
00:32:51.298 --> 00:32:55.229
It's going to be visible to all the threads and the block.

252
00:32:55.229 --> 00:32:59.608
If you just had, if you did not say this, it would be private.

253
00:32:59.608 --> 00:33:05.909
To the thread and would be actually in slow local memory. You say this it's.

254
00:33:05.909 --> 00:33:09.328
Private to the block and it's in fast memory.

255
00:33:12.384 --> 00:33:25.763
Okay, and you got to throw around sync threads and of course, from time to time. So what's happening here you see, is that they're not assuming that the new variable is zeroed.

256
00:33:26.003 --> 00:33:30.054
So explicitly zeroing it. But this has been done in parallel.

257
00:33:30.298 --> 00:33:35.909
So sync it. Okay, the thing is, I said the 32 threads in the war.

258
00:33:35.909 --> 00:33:45.719
Are going, sort of in lock step, but there may be if there's more than 32 threads involved here and at several warps. And again, who knows? They may be going see early 1 after the other.

259
00:33:47.034 --> 00:34:01.523
Okay, so again, we're inside the, this is a code inside the global routine, and we're doing an atomic increment and private and shared memory. So.

260
00:34:04.919 --> 00:34:09.208
Okay, and then, um.

261
00:34:10.498 --> 00:34:20.878
So, I can go back to the 2 slides. Okay. Okay. So let's go through this again. So this is the thread program to do the.

262
00:34:20.878 --> 00:34:25.228
Went from it 1 character.

263
00:34:25.228 --> 00:34:28.559
We created the private histogram.

264
00:34:28.559 --> 00:34:33.719
We increment we zeroed the private histogram data.

265
00:34:33.719 --> 00:34:36.719
In parallel and then we synchronized.

266
00:34:36.719 --> 00:34:43.798
To make sure that it all got zeroed and remember we have this shared private, historical and variable.

267
00:34:43.798 --> 00:34:49.018
On to the next slide, here's our shared private histogram variable.

268
00:34:49.018 --> 00:34:53.369
And we automatically increment it with.

269
00:34:54.389 --> 00:35:07.378
Whenever the character a buffer, so by divided by 4, because again we're bunching the characters and the input data into 4 types of aid to.

270
00:35:07.378 --> 00:35:12.568
H, and so on to make the buffer smaller. Okay.

271
00:35:17.668 --> 00:35:20.909
What's happening with plus stride here?

272
00:35:20.909 --> 00:35:26.489
We actually have fewer threads than the, um.

273
00:35:26.489 --> 00:35:33.449
Total size of the buffer perhaps. So each thread will actually count up.

274
00:35:33.449 --> 00:35:39.509
Many characters in the input data, and this is what we had a couple of sites ago.

275
00:35:39.509 --> 00:35:48.389
The thread is not doing a chunk of adjacent data. It's doing a chunk of characters that are stride apart strides the number of.

276
00:35:48.389 --> 00:35:51.809
Fred, so the point about this here.

277
00:35:51.809 --> 00:36:04.079
Is that this coalesce? The memory access is the fact that this thread is accessing a multiple of characters that are stride characters apart. Not 1 character apart.

278
00:36:04.079 --> 00:36:10.469
Now we also, so we have this awhile loop in the.

279
00:36:10.469 --> 00:36:15.869
You know, in the thread.

280
00:36:15.869 --> 00:36:20.309
And so it might happen that.

281
00:36:20.309 --> 00:36:27.809
1 thread does the loop more times, and another thread in which case will lose the benefits of on those extra iterations.

282
00:36:27.809 --> 00:36:34.829
But as long as this condition is true for all the threads that all 32 threads in the war are doing this in parallel.

283
00:36:34.829 --> 00:36:40.498
Although, of course.

284
00:36:41.909 --> 00:36:46.079
What happens if 1 thread has to wait for the atomic ad?

285
00:36:46.079 --> 00:36:49.619
Because another thread is accessing the same thing.

286
00:36:49.619 --> 00:36:55.349
I'm not completely certain what happens there all 32 threads have to wait or what? So.

287
00:36:55.349 --> 00:37:00.898
That would be a good question. You see the problem I'm thinking about here.

288
00:37:00.898 --> 00:37:13.259
Okay, so the 32 threads in the war operating in lock step, but the time that the atomic ad takes is very, it's not constant. It's variable. It will take more of another thread.

289
00:37:13.259 --> 00:37:16.588
Happens to be accessing the address at the same time.

290
00:37:16.588 --> 00:37:20.608
So, that will slow down this instruction here.

291
00:37:20.608 --> 00:37:24.989
Now, doesn't slow down all the threads and the war.

292
00:37:24.989 --> 00:37:29.759
I think it does actually, but I'm not positive. I believe it would.

293
00:37:29.759 --> 00:37:35.699
So, if 2 threads are updating the same counter.

294
00:37:36.054 --> 00:37:49.974
Then the whole war slows down, which could cause another work to be running faster because the 2 warps don't have to be going in lock step just the threads and 1 war going step. I mean, 2 wars, they could be sequential aspects has done enough resources.

295
00:37:50.514 --> 00:37:55.224
But he said this is another reason here you really want to minimize contention here.

296
00:37:55.528 --> 00:37:59.099
Apart from the late the latency, if you go out to.

297
00:37:59.099 --> 00:38:02.998
The goal was in global memory.

298
00:38:02.998 --> 00:38:12.000
The more the larger fraction of the time you have contention the greater the probability this atomic ad is going to go slower.

299
00:38:12.000 --> 00:38:15.539
So, it's an argument for having multiple private.

300
00:38:15.539 --> 00:38:19.590
If people see what I mean okay. Okay. So.

301
00:38:19.590 --> 00:38:23.519
What happened on the previous slide is.

302
00:38:23.519 --> 00:38:33.059
Code went to a couple of slides here. The threads were going to the array and summing counts into their private histogram.

303
00:38:33.059 --> 00:38:36.690
And all the threads again and.

304
00:38:36.690 --> 00:38:40.380
It's the same private histogram variable for the.

305
00:38:40.380 --> 00:38:43.710
All the threads in the block shared.

306
00:38:43.710 --> 00:38:49.590
So you actually could have fun here computing the probability of there being.

307
00:38:49.590 --> 00:38:54.449
You know, a slow down here. What's the probability that 2 threads would access? The same.

308
00:38:54.449 --> 00:38:57.570
You know, Buffer item at the same time.

309
00:38:57.570 --> 00:39:02.159
In the histogram. Okay.

310
00:39:03.179 --> 00:39:12.059
Okay, so now we've accumulated all the private, but again, remember that the different works. They're not necessarily in lock step.

311
00:39:12.059 --> 00:39:18.480
Particularly it says a lot of threads that could be sequential. 1 more starts after the previous work finishes.

312
00:39:20.844 --> 00:39:34.525
And then, we synchronized. We wait until the slowest work is finished. I said, you know, you could have 10000 threads. Maybe I think I don't know what's limited for all. I know. 100000. but your device might perhaps if it has say 4000 CUDA cores.

313
00:39:34.554 --> 00:39:36.025
I think that's what.

314
00:39:36.989 --> 00:39:43.320
Parallel has that means only at the best 4000 threads can be running at once. So if you've got.

315
00:39:43.320 --> 00:39:54.719
10000 threads then no, you certainly not running all at the same time because the Max that can run simultaneously 4000. probably less than that, depending on resources. So.

316
00:39:54.719 --> 00:40:01.199
Would be I used to use some of the debugging tools to see the thread the core utilization. Okay.

317
00:40:02.610 --> 00:40:09.449
Any case we now wait until the slowest wharf has finished accumulating into private histogram.

318
00:40:09.449 --> 00:40:16.050
And then what we do is.

319
00:40:17.639 --> 00:40:22.260
This is for the threads and the block now different blocks are doing other stuff.

320
00:40:22.260 --> 00:40:30.900
Of course, so it doesn't wait for the other blocks to finish waiting to the other blocks. Finished would be a bad idea and you might be a very long wait.

321
00:40:30.900 --> 00:40:40.320
Okay, so now, what we do here is we have the private histogram for each block, but we had many blocks, perhaps with global histogram.

322
00:40:40.320 --> 00:40:44.460
It's in the well, it's in the L2 cash.

323
00:40:44.460 --> 00:40:52.440
Just automatically that's what the cash does. And now, what we do is the 1st, 8 threads.

324
00:40:53.730 --> 00:40:58.289
For the 1st, 7 threads now, go in parallel.

325
00:40:59.639 --> 00:41:04.440
Um, indexing.

326
00:41:04.440 --> 00:41:07.559
Adding into the global thing here.

327
00:41:07.559 --> 00:41:13.559
So, there's a lot of latency with here, we're implementing into a slow or Ray.

328
00:41:13.559 --> 00:41:19.170
And it has only, you know, not very many elements, like 7 and so.

329
00:41:19.170 --> 00:41:22.230
Probably close to 1 that is going to be.

330
00:41:22.230 --> 00:41:25.829
Slow down to sterilize the atomic ad.

331
00:41:25.829 --> 00:41:30.570
But that's okay, because you're only doing like, 7 operations.

332
00:41:30.570 --> 00:41:37.289
For block times, how many blocks? So this thing is going to be really slow.

333
00:41:37.289 --> 00:41:41.070
But that's tolerable because.

334
00:41:41.070 --> 00:41:44.639
Well, it's not very many variables being added.

335
00:41:47.789 --> 00:41:58.500
And summary, so this, what they call privatization, local caches for each thread block, or if you're doing this open ACC in each.

336
00:41:58.500 --> 00:42:06.900
Process each thread, and it's only certain operations can be done that way, but the obvious operations.

337
00:42:06.900 --> 00:42:10.829
Okay, and you got to fit into shared memory.

338
00:42:10.829 --> 00:42:16.380
Okay, this last paragraph here.

339
00:42:16.380 --> 00:42:19.500
If you can't fit the histogram into.

340
00:42:19.500 --> 00:42:25.469
The end of the shared memory now you have to.

341
00:42:25.469 --> 00:42:31.585
Escalate your techniques now I have a programming style. I do parallel geometry algorithms.

342
00:42:31.585 --> 00:42:41.813
That's my research and I'd be doing a histogram where there's a 1M pockets let's say, and if it's a 1M buckets, then we have to go to other techniques and.

343
00:42:42.090 --> 00:42:45.690
We'll see something later on they talk about other techniques.

344
00:42:45.690 --> 00:42:50.639
As a histogram is too large to fit and to say small fast memory.

345
00:42:51.750 --> 00:42:55.199
Okay, so privatization.

346
00:42:55.199 --> 00:43:03.150
This is a techniques, it's only for parallel programs. It's not an interesting technique Chris for a sequential programs. Obviously.

347
00:43:03.150 --> 00:43:06.360
Okay, so.

348
00:43:07.440 --> 00:43:14.969
Was 7.5.

349
00:43:14.969 --> 00:43:20.429
Hey.

350
00:43:20.429 --> 00:43:23.760
Okay.

351
00:43:23.760 --> 00:43:29.250
So this set of modules.

352
00:43:29.250 --> 00:43:37.590
Is showing us some paradigms of parallel programming. So paradigm is a programming technique, which is useful.

353
00:43:37.590 --> 00:43:41.309
It's not explicit in the programming language.

354
00:43:41.309 --> 00:43:48.329
There's nothing in which privatization is not part of the standard.

355
00:43:48.329 --> 00:43:56.340
It's a technique, but it's a technique, which is useful. You put it into your tool kit of useful parallel techniques.

356
00:43:56.340 --> 00:44:06.269
Always going to see some others stem cells, and we're going to see some other ideas for parallel, making efficient parallel programs.

357
00:44:08.369 --> 00:44:11.909
Okay, you all know a convolution is, I think.

358
00:44:13.110 --> 00:44:18.030
Going to say an idea called or whatever.

359
00:44:20.760 --> 00:44:23.760
Convolution smooth stuff sharp and stuff.

360
00:44:23.760 --> 00:44:28.829
If somebody does not know that, then you can tell me and I'll give a quick.

361
00:44:30.599 --> 00:44:35.250
Thing, but they do low pass filtering for smoothing and high past filtering.

362
00:44:35.250 --> 00:44:42.510
To amplify to look for objects. Like, I just well, just as an aside.

363
00:44:43.739 --> 00:44:56.610
The Retina of your eye does a convolution and you're at neural nets are important. Well, the retina of your eyes, a genuine neural net not a computer simulation.

364
00:44:56.610 --> 00:45:04.949
And the neurons inhibit their neighbors apparently so that if 1 neuron gets a lot of photons.

365
00:45:04.949 --> 00:45:09.059
And it's firing a signal, it tends to inhibit its neighboring.

366
00:45:09.059 --> 00:45:21.360
Neurons from firing so the rest of your eye it appears, and I would say it appears because there's some uncertainty about this, but the red now, if your eye does a high pass filter.

367
00:45:21.360 --> 00:45:25.559
Which makes small objects more apparent.

368
00:45:25.559 --> 00:45:30.090
It's called the mock band effect after the Austrian physicist Ernst mock.

369
00:45:30.090 --> 00:45:35.760
Mark is the same guy that calculated measured the speed of sound.

370
00:45:35.760 --> 00:45:42.239
He's also the same guy that constructed an experiment to see if the whole universe was rotating.

371
00:45:42.239 --> 00:45:49.409
He did lots of good stuff. Any case to the mock band effect is relevant for computer graphics, which I also teach.

372
00:45:50.215 --> 00:46:03.505
I thought it was finished this computer graphics, but the department has clawed me back to teach at next fall. Because apparently, I don't know why it isn't. Any students want to take graphics. So, the department tells me and no, 1 else can teach it sign back into it next fall.

373
00:46:03.954 --> 00:46:05.454
But the mock band effect.

374
00:46:05.760 --> 00:46:16.380
It makes it harder to compute nice images because small differences between color intensity will be more apparent.

375
00:46:16.380 --> 00:46:28.405
To the viewer than you might hope, so this is as high pass filter in the retina signing case there's a convolution relevance. For example, if you have 8 bits for color parameters.

376
00:46:28.405 --> 00:46:34.315
So, if you do a color ramp with going from 0T to 127 to 55, I'm sorry and it's green.

377
00:46:38.695 --> 00:46:50.304
Viewers can sometimes see the edge, the boundary between adjacent colors, and it changes by 1 part and 256. so that's the mock band effect because your eye appeared to amplify that boundary. Now.

378
00:46:52.349 --> 00:46:58.050
Why your eye does it? It makes small objects.

379
00:46:58.050 --> 00:47:03.510
More visible to you and the assumption is that.

380
00:47:03.510 --> 00:47:11.880
A small object might be something hidden in the burst that's going to jump out and eat you. Not a small object though, but an advocate sort of hidden.

381
00:47:11.880 --> 00:47:15.960
Predator, so it's useful if things like that are.

382
00:47:15.960 --> 00:47:20.940
More susceptible. Okay. Any case coming back to invidia convolution.

383
00:47:20.940 --> 00:47:26.610
I think it waited some of the neighboring elements.

384
00:47:26.610 --> 00:47:29.909
The mask is the weights.

385
00:47:31.559 --> 00:47:39.210
Now, if you're thinking ahead of me, because I'm perhaps going to follow, you might think the mask is a sort of thing. It's a constant.

386
00:47:39.210 --> 00:47:45.300
And it's, that's the sort of thing you would put in constant memory let's say.

387
00:47:45.300 --> 00:47:49.110
Because it's the same mask at all. The threads would read.

388
00:47:49.110 --> 00:48:01.110
Read only and so it's going to be implemented. It's going to effectively be in the L2 cash or use app it into the if you have space available, you put it into the shared memory.

389
00:48:01.110 --> 00:48:04.889
On what you pastor, but there's not much of it.

390
00:48:06.150 --> 00:48:10.500
Okay, 1 D convolution.

391
00:48:12.059 --> 00:48:19.079
So, I'll go through this fast and somebody, you know what convolution is you don't have to read it.

392
00:48:21.000 --> 00:48:34.679
These slides that they talk about boundary condition, done, going through this fast. This is a sort of thing. You write a program that works. You got to handle special cases.

393
00:48:34.679 --> 00:48:42.269
And my horrible experience says some of my program taps a total code might be handling special cases. That's life.

394
00:48:42.269 --> 00:48:48.510
You accept it if you want your programs to work. Okay.

395
00:48:49.619 --> 00:48:54.510
Nothing interesting here how they would do the convolution. Well.

396
00:48:55.889 --> 00:48:58.949
What's interesting here.

397
00:49:01.949 --> 00:49:07.409
Good.

398
00:49:09.869 --> 00:49:21.900
Well, you're looping here so all what's happening down here is that the 1 thread is accessing several values.

399
00:49:21.900 --> 00:49:26.789
From the input array and several values from the mask.

400
00:49:26.789 --> 00:49:31.800
So, the ax, the data access patterns are are a little complicated.

401
00:49:31.800 --> 00:49:36.780
That's the relevant thing there, so.

402
00:49:38.429 --> 00:49:42.840
And the other relevant thing is, this is the boundary condition test.

403
00:49:42.840 --> 00:49:50.369
For to stop, you going off the edge of the array, it's the size of the edge of the it's not a multiple of whatever.

404
00:49:50.369 --> 00:49:56.400
A number of threads okay. We just talked about that.

405
00:49:56.400 --> 00:50:00.119
Now, 2 D gets interesting.

406
00:50:00.119 --> 00:50:08.699
Because again, your data access patterns are not so totally local because you're not just going left and right. You're going up and down.

407
00:50:08.699 --> 00:50:15.480
Now, there's something on a device that we haven't talked about much, but there is.

408
00:50:15.480 --> 00:50:28.440
A texture cache, which you can put stuff into for read only 2 D. a raise, which are going to have 2 D, sorts of accessing patterns and the texture cash.

409
00:50:28.440 --> 00:50:38.760
The data is laid out in a space filling curve, pattern, piano curve or something. And the idea is to try to minimize the average access time.

410
00:50:38.760 --> 00:50:44.070
Documentation doesn't talk about that too much, but I'm just mentioning it to you for your interest.

411
00:50:44.070 --> 00:50:49.170
If you wanted to learn more, you could dig deep into and video white papers or something.

412
00:50:51.719 --> 00:50:56.219
Ghost cells are just off the edge of the array. You can just.

413
00:50:56.219 --> 00:51:01.650
But zeroes in there or something, then you call it get simpler. What? It takes more data.

414
00:51:01.650 --> 00:51:16.199
I'm skipping see, I don't actually find this boundary condition. Interesting, because I know how to handle it. I know 2 different ways to handle it. You put conditions in your program, or you've had that here right? I call it padding there. Right?

415
00:51:16.199 --> 00:51:21.510
So, I can do that was 8.1.

416
00:51:25.679 --> 00:51:32.820
To silence.

417
00:51:36.300 --> 00:51:43.590
Okay, go through this. So, tiling again, is we chunk up the input array into blocks.

418
00:51:43.590 --> 00:51:53.070
And then, although you have to read each block in several times. Once you read it in, it gets access more. And on the average, it reduces total reading time.

419
00:51:53.070 --> 00:51:58.050
I just gave you the summary of the slides that talk also it fast.

420
00:51:58.050 --> 00:52:09.659
In the 1 dimensional case, we read a block and and is the input array and PC output right right. We read it in that.

421
00:52:09.659 --> 00:52:17.760
We can we read this chunk of data into our local cash and shared memory. We can use these elements several times.

422
00:52:17.760 --> 00:52:21.570
And you could even imagine a sliding window or some sort.

423
00:52:21.570 --> 00:52:26.429
Okay, output tile.

424
00:52:28.139 --> 00:52:38.489
Okay, there is something new here when you're doing things like convolution, the programs that we saw. So far took 1 output.

425
00:52:38.489 --> 00:52:43.530
Our computed 1 output element and iterated over.

426
00:52:43.530 --> 00:52:51.030
The input raise for each output element, the next output element integrate to the input of raised again.

427
00:52:51.030 --> 00:52:54.750
Is a complimentary way, or instead of.

428
00:52:54.750 --> 00:52:59.190
Doing each output element, once you do each input element once.

429
00:52:59.190 --> 00:53:12.329
And so you read an input element, and then it affects several output elements and so you access the output element several times. But each input element only wants comfortable ways of looking at things.

430
00:53:12.329 --> 00:53:21.750
I think that's not as good perhaps, because you're distributing the rights more and it's I'd rather distribute the reads more reach being faster than rights. Maybe.

431
00:53:21.750 --> 00:53:29.880
Any case they're talking about, you chunk up the output, you could sort of combine the 2 ideas, you.

432
00:53:29.880 --> 00:53:39.869
Partition the output, right? Assuming it's very big. It's a name. It could well be very big few 1M pixels so that you.

433
00:53:41.460 --> 00:53:51.059
So, you work with the thread blocks and you partition the output array into a chunk called a tile and each thread block tries to process 1 tile.

434
00:53:52.289 --> 00:53:55.320
And that's what they're talking about here. So.

435
00:53:55.320 --> 00:54:03.150
The general lesson from the slide set is that you might work with the.

436
00:54:03.150 --> 00:54:09.539
Thread block and because what happens inside a thread block.

437
00:54:09.539 --> 00:54:18.090
Is a lot easier to handle and stuff that happens between multiple thread blocks within the 1 thread block. You have the.

438
00:54:18.090 --> 00:54:29.159
Shared array and you also have a chance of synchronizing threads and if you're lucky, the works in the thread block may run sort of at the same time.

439
00:54:29.159 --> 00:54:42.420
May even run at the same time if you're lucky. So if you can do a computation inside 1 thread block, you got these winning features. The once the computation has to spill over into several thread blocks.

440
00:54:42.420 --> 00:54:54.900
Then you lose something well, you've got scalability, got lots and lots and lots of thread blocks, but you lose this shared idea. So you can work with this idea and figure out.

441
00:54:54.900 --> 00:54:58.710
Do trade offs of what you do inside a thread block and what can you do.

442
00:54:58.710 --> 00:55:02.550
You know, in multiple thread box, it would be an optimization thing.

443
00:55:02.550 --> 00:55:07.139
And again, NVIDIA has this spreadsheet you can play with that tries to help you.

444
00:55:08.309 --> 00:55:14.250
Input tiles again. Nothing interesting. There stuff gets partition into tiles.

445
00:55:15.449 --> 00:55:22.769
Um, so here the gaming again, different design offer.

446
00:55:22.769 --> 00:55:26.369
So you have a thread block.

447
00:55:26.369 --> 00:55:32.400
Um, so you have an output tile for thread blocker an input tile for thread block.

448
00:55:32.400 --> 00:55:37.170
And they talk about the various trade offs here if in.

449
00:55:37.170 --> 00:55:42.179
If you have a thread block for output tile.

450
00:55:42.179 --> 00:55:47.369
Then you have to read the data more than once.

451
00:55:47.369 --> 00:55:52.500
If you have a thread block for input tile.

452
00:55:52.500 --> 00:55:55.920
Then you may not use all the threads.

453
00:55:55.920 --> 00:55:59.880
But Tom, all because of these boundary condition things.

454
00:55:59.880 --> 00:56:08.550
But it may be faster. So, the general lesson from this slide said is as trade offs you want to think about if you care about optimizing.

455
00:56:08.550 --> 00:56:15.659
It depends you may not care about fine tune optimizing, but if it's a big program and you do you think about how you map.

456
00:56:15.659 --> 00:56:19.199
Your data into the thread blocks so.

457
00:56:20.309 --> 00:56:29.820
Get through all of this, I just gave you the content and again what I said many times your block size.

458
00:56:29.820 --> 00:56:33.239
You something you'd designed to optimize.

459
00:56:33.239 --> 00:56:44.820
Um, the relevance here, the shared memory again, you try to load data into the shared memory and the way that it's used several times, once you load it in.

460
00:56:45.324 --> 00:56:57.414
So, what they're talking about here, this is this Pat, what I call padding, they call ghost cells so you just pad an array with arrows at the end.

461
00:56:57.775 --> 00:57:00.833
What this does is it makes your code simpler.

462
00:57:01.110 --> 00:57:07.949
Okay, you don't have to test for going off the end of the year because he won't go off the end because the end just got made wider.

463
00:57:07.949 --> 00:57:16.500
And patted with several so if you trade off again, you use a small amount of extra storage. I think it's not worth worrying about, but.

464
00:57:16.500 --> 00:57:24.269
Code got fast. The other thing also goes again in a war all the threads got into the same thing. So doing special case testing.

465
00:57:24.269 --> 00:57:28.139
In the war.

466
00:57:28.139 --> 00:57:35.280
Well, certain things will serialize, but if you're adding these arrows, it didn't serialized, but it was pointless. That's probably a wash.

467
00:57:35.280 --> 00:57:38.639
So, okay.

468
00:57:46.110 --> 00:57:54.329
Go through this 1 fast. We haven't really talked about what stem cell is, but that's sort of.

469
00:57:54.329 --> 00:58:02.699
The shape of the convolution colonel, the idea of a central is going to come up and if you so, what's going to happen here.

470
00:58:03.840 --> 00:58:07.769
Okay, use the cash.

471
00:58:07.769 --> 00:58:22.110
You think of how you lay out your dad when your dad is 2 D but ultimately the memories really 1 D you think of how tough stuff gets laid out again. You think about how.

472
00:58:23.250 --> 00:58:30.059
You know, you map the threads to the data boundary conditions I don't think are interesting.

473
00:58:30.059 --> 00:58:33.869
Well, this is very interesting and.

474
00:58:33.869 --> 00:58:42.360
You know, how, how do you paralyze the program? So thread to data is very interesting.

475
00:58:42.360 --> 00:58:45.389
How do you map your to D data to 1? D.

476
00:58:45.389 --> 00:58:50.489
Memory is also interesting. And how do you use the cache is interesting.

477
00:58:50.489 --> 00:58:53.760
So those are the interesting parts here.

478
00:59:00.000 --> 00:59:03.690
Okay, what's happening here?

479
00:59:03.690 --> 00:59:09.210
You got your cash to cash 128 bytes.

480
00:59:09.210 --> 00:59:12.869
It starts in multiples of 128, so.

481
00:59:14.489 --> 00:59:23.489
You want a raise in global memory to start at multiples of 1008 you. So you pat him or something to make that happen.

482
00:59:23.489 --> 00:59:29.309
This is going to if you do, not do this thing here is something may take.

483
00:59:29.755 --> 00:59:41.784
2 bursts instead of 1 burst, so not doing this might double your reading time. Actually, it took, you know, had to go out to the Goldman memory twice instead of once.

484
00:59:41.784 --> 00:59:45.204
So you perhaps want to start your rose.

485
00:59:45.480 --> 00:59:50.130
Keep things aligned. Okay.

486
00:59:51.179 --> 00:59:56.610
You never know that when you allocate the matrix that we do that for, you.

487
00:59:57.780 --> 01:00:03.480
Okay, if it's 3 by 3 to 4 by 4.

488
01:00:03.480 --> 01:00:07.289
Okay, programming techniques.

489
01:00:07.289 --> 01:00:13.860
Nothing interesting there things are padded.

490
01:00:13.860 --> 01:00:17.400
Um.

491
01:00:17.400 --> 01:00:23.820
Nothing interesting there. Well, you've got a structure for your edge.

492
01:00:23.820 --> 01:00:27.780
It is storing assorted things, but Chad, whatever.

493
01:00:29.280 --> 01:00:32.909
Silence.

494
01:00:32.909 --> 01:00:41.639
Of back up a little, you're probably using some Pat management software that does this for you that.

495
01:00:41.639 --> 01:00:48.119
You should probably possibly not be creating classes like that. If you can find them already created.

496
01:00:48.119 --> 01:00:56.820
Okay um, what.

497
01:00:56.820 --> 01:01:04.590
If you were not lucky enough to have all this stuff created for you, you've got to go and write your class.

498
01:01:04.590 --> 01:01:11.579
They're talking about there. Yeah, so they provided for you.

499
01:01:11.579 --> 01:01:15.150
Block size again.

500
01:01:17.010 --> 01:01:21.780
Pick a block size now, pick a block size that matches the data. And so.

501
01:01:21.780 --> 01:01:33.539
Again, so all this is doing this is padding the tile out to a multiple of 32 or something. Nothing interesting. There. Okay what's happening here is they're saying that the.

502
01:01:35.994 --> 01:01:50.364
They're saying that the threads in the block are considered to be a 2 dimensional array of threads, not just a 1 dimensional. Right? And I consider this is totally syntactic shuttering. I don't know why they put it in the standard. It just creds up the standard view. And you could do it yourself.

503
01:01:50.610 --> 01:01:54.630
And again, the block, the grid is the.

504
01:01:54.630 --> 01:01:58.289
All the blocks and again, it's assuming that there are a.

505
01:01:58.289 --> 01:02:08.340
3 D number so, 1, a 2 D array of blocks and the grid and I consider that just total syntactic. The point is it's.

506
01:02:08.340 --> 01:02:13.320
To make the way you index the threads and the block match the rows and columns and your own.

507
01:02:13.320 --> 01:02:19.500
In your image.

508
01:02:19.500 --> 01:02:27.360
Okay, I'll find it, uh, talking about constant memory and it's not very much it's 48 K bytes or something.

509
01:02:27.360 --> 01:02:37.980
Depends on the generation of invidia hardware, but they don't actually change it over time. For some reason in video does not make their cash larger. They get to the new and fancier.

510
01:02:37.980 --> 01:02:49.380
Generations, what that suggests to me is that the cash is very expensive to implement in hardware.

511
01:02:49.380 --> 01:02:52.440
So, they don't make it bigger. Okay.

512
01:02:52.440 --> 01:02:58.019
But it's global all the threads can read the constant memory.

513
01:02:58.019 --> 01:03:02.309
The mask in the constant memory aggressively.

514
01:03:02.309 --> 01:03:16.050
Okay, so this is the thing they got special hardware to make if multiple threads are reading the same cash value, it all happens in the same cycle.

515
01:03:16.050 --> 01:03:21.389
So, they've got hardware to do that so.

516
01:03:21.389 --> 01:03:24.480
So effectively.

517
01:03:24.480 --> 01:03:28.019
Memory fast memory bandwidth. Okay. Um.

518
01:03:29.610 --> 01:03:36.900
And, okay, so you can tell the compiler that some array.

519
01:03:36.900 --> 01:03:46.110
Is eligible for constant cashing, so it's going to be read only to find this constant. So the compiler knows that and.

520
01:03:48.869 --> 01:03:52.739
And so it's something which is worthwhile.

521
01:03:52.739 --> 01:04:03.719
Trying to put in the constant cash you have to tell the compiled, because just something's constant. Doesn't mean you want to put it in the cash. It's y'all want to put it in the cash.

522
01:04:03.719 --> 01:04:08.730
All of the threads are going to be wanting to read the same data again. And again and again and again.

523
01:04:08.730 --> 01:04:12.360
So, the compiler may not be able to determine that.

524
01:04:12.360 --> 01:04:18.030
So, you tell a compiler. Okay. Um.

525
01:04:18.030 --> 01:04:21.809
And is cutting things up with this padding here.

526
01:04:23.159 --> 01:04:27.840
Ignore it.

527
01:04:27.840 --> 01:04:35.610
Boundaries ignore the boundaries threads off at the edge. Don't always participate. Yeah, that's.

528
01:04:35.610 --> 01:04:41.969
Not interesting, I mean, you got to handle it, but it's not interesting intellectually.

529
01:04:41.969 --> 01:04:46.500
We saw this before 3 kernel 3 channels for image.

530
01:04:46.500 --> 01:04:52.289
Um, so.

531
01:04:52.289 --> 01:04:59.579
Yeah, I would not say, do it this way because this is this.

532
01:04:59.579 --> 01:05:13.590
Putting red, green, blue for each pixel adjacent. This is creating the array of structures, which is bad because the red channel for each adjacent pixels, not adjacent there, got the green and blue in between. So.

533
01:05:13.590 --> 01:05:18.840
Yeah, I actually have a complaint about this when I said it, it's work, but it's not optimal.

534
01:05:18.840 --> 01:05:22.349
Okay.

535
01:05:26.579 --> 01:05:30.239
I warn you, I do lots of slides today.

536
01:05:38.550 --> 01:05:45.900
Okay, on the general lesson from the slides that.

537
01:05:45.900 --> 01:05:49.650
As if you want to write efficient.

538
01:05:49.650 --> 01:05:56.340
Programs you have to understand how your algorithm accesses the data.

539
01:05:56.340 --> 01:06:08.639
And the access patterns, and this is something that I hit in my research there is a really strong argument for keeping things. Simple.

540
01:06:08.639 --> 01:06:12.179
You want to have simple regular.

541
01:06:12.179 --> 01:06:21.119
Data structures accessed regularly and this means that it works with the hardware, and it will paralyze.

542
01:06:21.119 --> 01:06:30.840
So, what this means is, you want to forget all these complicated, beautiful things you learn in 1, because they do not parallel lies. You do not want.

543
01:06:30.840 --> 01:06:34.769
Trees you do not want pointers. You do not want recursion.

544
01:06:37.284 --> 01:06:48.864
I mean, everything has exceptions, I'll give you that, but you want to try to minimize the use of things like trees and pointers and recurs but you got a pointer.

545
01:06:48.864 --> 01:07:00.505
They've got to chase the pointer and they're chasing the pointer to some random place and memory and bang on any concept of locality, kills your parallel performance. It kills your performance when you got that.

546
01:07:01.349 --> 01:07:05.940
Global memory, which has that latency of hundreds of cycles may be.

547
01:07:05.940 --> 01:07:20.070
And the cash will be useless, because each separate partner will go to some random place. So you really want to think about keeping stuff together. So the cash will be useful. So your shared memory can be used. So you want.

548
01:07:21.144 --> 01:07:31.074
You want simple data structures and the thing with trees is a tree, all goes sprouts out evenly in different directions. It's not uniform again. The pointers. Recursion. I know.

549
01:07:31.074 --> 01:07:38.485
I showed you doing the Fibonacci thing with recursion back with open ACC, open empty. You use recursion.

550
01:07:40.019 --> 01:07:48.869
If it's, you know, you can use recursion if it serves the purpose, but don't just automatically do everything in recursion.

551
01:07:49.525 --> 01:07:58.195
I mean, in fact, if you want to do a recursion and video for a few years now has had something, or colonel can initiate another colonel.

552
01:07:58.195 --> 01:08:13.105
So, if your data has to be really on Eve and video tries to work with you actually, so you can start. So, the kernel's that parallel program that runs on the grid of blocks so we showed, you.

553
01:08:14.664 --> 01:08:28.944
Where the, you start your kernel from your host program, you know, you have that routine that's tagged global or which means run it on the device call it from the host. Well, inside your parallel colonel, you can, in fact.

554
01:08:29.604 --> 01:08:43.435
Create and start another kernel and it will go in on the device. It will sit in the queue of kernels waiting to run, remember this as many operating system and it will run. So, if you want to do stuff like that, there's tools available.

555
01:08:43.435 --> 01:08:50.935
And as as the years go on, and actually makes these tools better for you, but you won't see them described in these side substance on because they're fairly new.

556
01:08:51.835 --> 01:09:06.204
And they existed when these sides were created, but the authors of the sides didn't know about them or something. Hadn't learned them yet. But any case. So, what I'm saying again, and again, because I find it important, you want to think about reuse patterns you want to think about uniform and so on.

557
01:09:06.510 --> 01:09:12.239
Everything can be completely uniform, but you can try and it will be worth it.

558
01:09:12.239 --> 01:09:19.109
So, okay, so this sides that we have, um.

559
01:09:20.729 --> 01:09:25.500
Due to do convolution again.

560
01:09:27.750 --> 01:09:34.590
Okay mask what's 5? And so we add stuff I'm gonna go through this fast.

561
01:09:34.590 --> 01:09:38.369
Um, because it's obvious here you add 5 elements.

562
01:09:38.369 --> 01:09:48.119
Then we're not showing the massive mass would be re, M, it's not shown here, but, and in 5 elements of and get added into the output itself.

563
01:09:52.020 --> 01:10:06.390
So, what they're talking about here I go back, I don't give you sickness going back, is that we will have a tile of elements of them that we explicitly loaded into the shared memory. And then.

564
01:10:06.390 --> 01:10:15.210
The average element of in that we loaded into shared memory, got used several times, in fact, got used, maybe 3 times or something.

565
01:10:15.210 --> 01:10:19.020
So, we explicitly did the tile then we.

566
01:10:19.020 --> 01:10:23.069
You know, bandwidth, reduction of factor or 3 and that's.

567
01:10:23.069 --> 01:10:32.729
Worse, you know, that's that's nice. And it's better than would be done automatically with the L to cash. Let's say.

568
01:10:32.729 --> 01:10:39.810
Because you're loading it into shared memory and, you know, when you're finished with a tile, you see the thing.

569
01:10:39.810 --> 01:10:43.890
An advantage also if you do your explicit.

570
01:10:45.779 --> 01:10:54.840
calfing as we call it loading into shared memory. Okay think about how virtual memory or cash is work when they have, how do they choose what to reject from the cash?

571
01:10:58.494 --> 01:11:11.184
And, you know, you need a replacement strategy. Well, the theoretically best method is, you reject the item, which is not going to be used for the longest time into the future. So, if you could see into the future.

572
01:11:11.850 --> 01:11:26.579
You to check the item, which is not going to be used for the longest time, but you cannot see into the future. So, in fact, what you at least, you check the item that has not been used for the longest amount of time, which has been idle the longest. And that's your is simply a really nice method.

573
01:11:26.579 --> 01:11:34.590
But, I mean, that's what these cash management things do, but the thing is, if you're managing stuff yourself, you know.

574
01:11:34.590 --> 01:11:40.680
When your data will not be needed again, so, you know, when you can inject something from, you.

575
01:11:40.680 --> 01:11:52.680
Own custom cache and that's where you get the trade off the pay off rather if you're putting stuff into the shared memories that, you know, when you're finished with something and you can reuse that memory address.

576
01:11:52.680 --> 01:11:56.460
And see optimize your tiling.

577
01:11:56.460 --> 01:11:59.699
Okay, nothing interesting here.

578
01:12:00.960 --> 01:12:04.800
Or here computing are here.

579
01:12:06.329 --> 01:12:11.310
Are here bandwidth reduction.

580
01:12:14.154 --> 01:12:25.645
Well, basically, with a reasonable size tile like, 250 set, reasonable size, Tile, 256 and a wide mat. Basically, the reduction is the size of the mask. So mask what's equals.

581
01:12:25.645 --> 01:12:37.314
9 means you load 9 items, India shared memory. So each item gets used 9 times, and practically a factor of 9 improvement. If the tile was large enough that these edge conditions don't.

582
01:12:40.079 --> 01:12:48.149
Bother you okay now too? And he gets a touch interesting because it's a 2 D chunk of.

583
01:12:48.149 --> 01:12:53.310
Data and but you'll still get a very nice reduction in the.

584
01:12:56.159 --> 01:13:00.239
And, um.

585
01:13:02.430 --> 01:13:16.500
So, again, well, we can compute what's happening here. I may do this. Perhaps if you put some more detail Monday, if not. Well, the executive summary here is with 2 D.

586
01:13:18.390 --> 01:13:29.670
The reason it's more than the mask to remember that a mask width of 9 means is 81 things in the mask. Okay this is, you know, it's a square. So a mask. What's the 5.

587
01:13:29.670 --> 01:13:36.029
Means that there's 25 things in the masks, so you put it store 81 items in the mask.

588
01:13:36.654 --> 01:13:49.135
Each item, I could use 81 time so in the limit, it's a tile with Scott big enough. This would approach 81 with a mass was a 5 who's getting almost 25 here this gets more efficient as a tile, which gets larger.

589
01:13:51.390 --> 01:13:55.979
If only because what they're doing is they're worrying about the edge effects things around the boundary.

590
01:13:57.090 --> 01:14:00.090
Um, so, yeah.

591
01:14:00.090 --> 01:14:06.899
So, this you have this mask and you have that you chalk up. I call it chunking up.

592
01:14:06.899 --> 01:14:10.289
You protection things and these, these little blocks.

593
01:14:10.289 --> 01:14:22.050
So the point is, they're talking about larger shared memory size. Well, yeah. Had shared and it's very fast. Larger is better but I notice at NVIDIA.

594
01:14:22.050 --> 01:14:25.859
They do not increase their shared memory size as they go to the new.

595
01:14:25.859 --> 01:14:30.359
Generations.

596
01:14:30.359 --> 01:14:38.609
And so I have to ask myself why, and my guess is that shared memory is very expensive to implement in the hardware.

597
01:14:38.609 --> 01:14:50.729
It's accessible to all of the threads in the box. They also and video also started increase the maximum number of threads block. A block has up to a 1000 threads. This did not this has not increased. So.

598
01:14:50.729 --> 01:14:58.140
I'm guessing because it's really expensive to do that. And what invidia does.

599
01:14:59.399 --> 01:15:04.319
Is they'll have more symmetric multi processors on the device?

600
01:15:04.319 --> 01:15:07.890
So the maximum number of threads will increase.

601
01:15:09.479 --> 01:15:17.550
We're not very much. It's up to 4000 on nothing in parallel. I mean, my laptop here is a couple of 1000 threads.

602
01:15:17.550 --> 01:15:22.500
Cause they said, my laptop's almost expensive as parallel that no, Joe actually said.

603
01:15:22.500 --> 01:15:30.420
It's very good laptop, but so as Nvidia goes to more powerful hardware, actually, 1 thing they do, they.

604
01:15:30.420 --> 01:15:37.890
They increase the memory speed, so invidia decreases the clock. So the memory is accessed faster.

605
01:15:37.890 --> 01:15:49.619
They increase the bandwidth for getting data out of the memory. They do not increase the number of threads for block. They do not increase the shared memory size.

606
01:15:49.619 --> 01:15:53.760
That's interesting, you know what they.

607
01:15:53.760 --> 01:15:57.449
Choose to do and they also do things like.

608
01:15:57.449 --> 01:16:01.229
Well, they'll introduce no data types like these.

609
01:16:03.090 --> 01:16:13.260
These processors that work with 16 bit floats to do neural nets because the data so emphasizes neural nets. You don't notice the lack of precision.

610
01:16:13.944 --> 01:16:25.555
So, they'll call it f, p16 half precision floats. So, put that they put in these Tensor processors that will do a 4 by 4 Matrix, smaller pine, add in a cycle or so.

611
01:16:25.555 --> 01:16:29.545
So they spend the gates on adding stuff.

612
01:16:30.239 --> 01:16:36.840
For the neuronet site, the tenser processors half precision floats also.

613
01:16:37.854 --> 01:16:50.875
Change the area on the chip devoted to things like double precision versus single precision, because double for doing double precision fat that's separate hardware. You could implement double precision in single precision, but you take a factor of.

614
01:16:51.899 --> 01:17:01.914
5, or something and performance, if you do it over precision operation as a single precision function, call implemented in software, I'm guessing factor 5 or something hit.

615
01:17:02.125 --> 01:17:09.654
So, you so they decide how much they want to adopt doing hardware and they trade that off from generation to generation. They'll go for Maxwell.

616
01:17:09.654 --> 01:17:18.414
They had less floating point and then they decided they'd gone too far and then the Pascal generation, which followed Maxwell, they went back to more.

617
01:17:18.659 --> 01:17:22.079
Floating point operations to improve their performance.

618
01:17:23.850 --> 01:17:28.590
All just as a side about the performance.

619
01:17:28.590 --> 01:17:32.850
Because this is getting to Tara flop range.

620
01:17:32.850 --> 01:17:43.199
When I can think of the initial version of parallel a couple of years ago, now it had an older process or not the 1 that you're using. Now, the older one's still there.

621
01:17:43.199 --> 01:17:52.590
And it's floating point performance was so high that it for lawyer, it trickled an.

622
01:17:52.590 --> 01:18:00.750
Problem international trade and arms regulations, computers, which is performance is above a certain.

623
01:18:00.750 --> 01:18:06.840
Speed are considered to be munitions.

624
01:18:06.840 --> 01:18:20.909
And so the lawyer was asking me now, who would be using this card this was a 600 dollar card that you could buy on Amazon. You could also buy Amazon dot, in fact, for 600 pounds.

625
01:18:20.909 --> 01:18:32.579
You know, for computers, that's how you do the conversion you change the dollar sign to a pound sign you keep the digits the same. This is not a joke in any case. So I had to tell.

626
01:18:32.579 --> 01:18:43.890
Tell our lawyer who would be using the 600 dollar in video card. It was rather funny actually, because I was also buying the Intel.

627
01:18:43.890 --> 01:18:47.250
A multi core thing.

628
01:18:47.250 --> 01:18:57.420
And which the product they've dropped now, and it was actually faster than invidia but it wasn't something that the lawyer recognized who didn't query me about that.

629
01:18:57.420 --> 01:19:04.140
Okay, in any case. So partitioning things into the right size masks is good.

630
01:19:04.140 --> 01:19:09.569
Okay, that's a reasonable point to stop now. So.

631
01:19:09.569 --> 01:19:15.329
What you thought today was several paradigms.

632
01:19:15.329 --> 01:19:21.060
For writing, efficient, parallel progress here was masks.

633
01:19:21.060 --> 01:19:32.189
And so on so waste private copies of output arrays, like histogram arrays. So we saw a couple of techniques, which are useful.

634
01:19:32.189 --> 01:19:41.010
Or parallel programs, they're not officially part of the language standard, but the things that were developed that turn out to be useful.

635
01:19:42.689 --> 01:19:49.680
So, I put up a homework also, ask you to do some programming playing, implement the histogram thing on your own in parallel and see.

636
01:19:49.680 --> 01:19:58.770
On multi core that would be Intel in many car. That would be the end. May not be parallel. In fact, but you have to tell me.

637
01:19:59.850 --> 01:20:03.840
Other than that, have a good weekend and.

638
01:20:03.840 --> 01:20:12.989
Not part of the class, but I'm enjoying my solar panels up on the roof because on sunny days like this again.

639
01:20:12.989 --> 01:20:19.529
Um, how is it, the battles of January 1, the house, the union so several days this week I've actually.

640
01:20:19.529 --> 01:20:29.520
Generated more electricity in the house uses so, Monday no, because Monday was cloudy, so in the summer and this could be interesting. Keep you in tune for how it's going.

641
01:20:29.520 --> 01:20:32.850
Okay, so.

642
01:20:32.850 --> 01:20:36.750
If there's no question.

643
01:20:37.829 --> 01:20:41.819
If not.

644
01:20:41.819 --> 01:20:45.989
1 day.