WEBVTT

1
00:06:34.499 --> 00:06:45.569
Silence.

2
00:06:45.569 --> 00:06:49.559
Silence.

3
00:06:51.449 --> 00:06:54.988
Silence.

4
00:06:56.428 --> 00:06:59.639
Silence.

5
00:06:59.639 --> 00:07:12.569
Silence.

6
00:07:14.309 --> 00:07:18.149
Silence.

7
00:07:21.478 --> 00:07:26.608
Silence.

8
00:08:00.269 --> 00:08:03.358
Front.

9
00:08:03.358 --> 00:08:09.209
Okay, so good.

10
00:08:09.209 --> 00:08:15.869
Good afternoon people this says parallel computing.

11
00:08:17.968 --> 00:08:23.488
And class 10, Monday, March. 1st, so.

12
00:08:25.588 --> 00:08:29.848
See, what we can do to share stuff.

13
00:08:33.208 --> 00:08:38.788
And.

14
00:08:44.489 --> 00:08:48.568
Silence.

15
00:08:50.428 --> 00:08:57.808
So, my usual questions are.

16
00:08:59.308 --> 00:09:02.639
Can you hear me.

17
00:09:02.639 --> 00:09:09.239
And you see the screen.

18
00:09:11.489 --> 00:09:14.879
Great. Okay. So.

19
00:09:14.879 --> 00:09:20.038
We are continuing on with.

20
00:09:20.038 --> 00:09:23.849
gpo modules paired with Illinois.

21
00:09:23.849 --> 00:09:29.399
And just put where we're starting from here. 3.5.

22
00:09:29.399 --> 00:09:35.009
And see, it.

23
00:09:39.719 --> 00:09:44.369
Okay.

24
00:09:44.369 --> 00:09:49.589
And again, we're speed reading through this and.

25
00:09:51.089 --> 00:09:57.298
There here. Okay.

26
00:09:58.583 --> 00:10:13.014
And again, I'm teaching from specifics and so that your goal is to, you can infer general principles from the specifics. This is the way I like to teach. So, give you practical stuff.

27
00:10:13.528 --> 00:10:17.428
And, okay.

28
00:10:17.428 --> 00:10:22.859
So, we're seeing in some more details about the code of.

29
00:10:23.938 --> 00:10:29.369
Good at current so on. Okay so just to remind you.

30
00:10:29.369 --> 00:10:33.479
What we have here is that.

31
00:10:33.479 --> 00:10:36.839
The devices the, and.

32
00:10:36.839 --> 00:10:40.438
The, um, it has this.

33
00:10:40.438 --> 00:10:44.969
Blocks and each block has threads.

34
00:10:44.969 --> 00:10:59.428
Okay, and actually it shows the organization at the left here and then you and then that's his device might have a number of these sets of blocks because the device may actually be running several.

35
00:10:59.428 --> 00:11:06.658
Unrelated parallel programs at the same time. So if 2 of you in class start off.

36
00:11:06.658 --> 00:11:13.288
Parallel jobs on the PO, in theory, if the resources are available, they might run simultaneously.

37
00:11:13.288 --> 00:11:22.708
I wouldn't bet the farm on the security. There's some sort of security protecting the kernels from each other, but how good it is.

38
00:11:22.708 --> 00:11:31.048
I don't know so I'm also assuming, you know, I would not use the GPU and a hostile environment running, you know.

39
00:11:31.048 --> 00:11:34.499
Programs that you think are going to try to attack the computer.

40
00:11:35.519 --> 00:11:39.749
Now, what this on the right shows here the timeline.

41
00:11:39.749 --> 00:11:44.009
Is that, um, the blocks.

42
00:11:44.009 --> 00:11:51.239
Again, so block might have at the up to a 1024 threads in it. What the.

43
00:11:51.239 --> 00:11:54.568
What the timeline on the right is showing.

44
00:11:54.568 --> 00:12:02.729
Is that several blocks will run simultaneously again? Depending on where is hardware resources?

45
00:12:03.264 --> 00:12:18.114
And then blocks complete the next blocks run and that just showed you this little examples open when there are more blocks I want to run than there are resources available to

46
00:12:18.114 --> 00:12:19.644
run them simultaneously.

47
00:12:20.183 --> 00:12:22.254
Then the order that the blocks run in is.

48
00:12:22.528 --> 00:12:27.389
Totally, not predictable also.

49
00:12:27.389 --> 00:12:35.129
If you have a more expensive, which has more hardware than more blocks will run.

50
00:12:35.129 --> 00:12:40.859
Are able to run in parallel and so your job finishes faster.

51
00:12:42.533 --> 00:12:49.793
Which is a nice way to design things. You can throw more hardware at the problem. And then, but the software stays the same.

52
00:12:50.094 --> 00:13:04.644
I mean, again, I wouldn't go crazy with this concept that the program could run on a whole hierarchy of different. If you start going for corner cases and so on Yeah, it may not. But in general.

53
00:13:04.889 --> 00:13:12.208
The concept is that the blocks yourself for run on different levels of.

54
00:13:12.208 --> 00:13:18.239
This is a traditionally nice way to design.

55
00:13:18.239 --> 00:13:31.649
A group of hardware said IBM became the biggest computer company in the world in the 19 sixties by doing something like this. They had half, they invented something called a system 360 and they had.

56
00:13:31.649 --> 00:13:37.288
I calf initially half a dozen different machines in the line and.

57
00:13:37.288 --> 00:13:42.149
They all ran the same programs and that was at that time, that was the big advance.

58
00:13:42.149 --> 00:13:47.099
So, okay now.

59
00:13:47.099 --> 00:14:01.558
Here's a new idea on this slide that I mentioned briefly. There's something called streaming multi processors and that is at a high level sense. The core in the GPU higher level than than a separate could a core.

60
00:14:01.558 --> 00:14:09.328
And again, a may have a couple of streaming multi processors, or.

61
00:14:09.328 --> 00:14:14.249
12 of them or 15 of them or something and.

62
00:14:14.249 --> 00:14:20.849
So, a streaming multi processor, some versions of the or something.

63
00:14:20.849 --> 00:14:28.078
Okay, and there are certain resources for streaming multi processor so as a slide show so.

64
00:14:28.524 --> 00:14:42.923
You can, you can run several blocks at it simultaneously in 1, streaming multi process, or depending on resources like this fixed amounts of certain types of resources, like shared memory. And if a block.

65
00:14:43.139 --> 00:14:47.009
It's the threads in a block need more than fewer blocks can run at the same time.

66
00:14:47.009 --> 00:14:57.028
Now, talk about, from me here, this is a generation of and it's several generations old.

67
00:14:57.624 --> 00:14:59.693
For me was succeeded by capital,

68
00:14:59.693 --> 00:15:10.134
or that was succeeded by Maxwell that was received by succeeded bypass gal that was succeeded by Volta,

69
00:15:10.313 --> 00:15:13.344
which is now being succeeded by ampcare.

70
00:15:13.344 --> 00:15:15.774
I think so. It's quite a few generations back.

71
00:15:18.239 --> 00:15:25.198
In any case, when the point is here that the stripping multi processor can in total run so many threads.

72
00:15:25.198 --> 00:15:31.288
And you can have more threads for block and fewer blocks or vice versa. So.

73
00:15:31.288 --> 00:15:38.188
And streaming multi processor has this many operating system in it that schedules things. So.

74
00:15:38.188 --> 00:15:42.749
This blocks waiting to run, there's warps waiting to run and so on.

75
00:15:44.068 --> 00:15:49.048
No, I'm in model after John.

76
00:15:49.048 --> 00:15:53.458
You're all familiar with that.

77
00:15:55.073 --> 00:16:09.833
The street is single instruction, multiple data stream. You got the 1 program counter and instruction register can then control multiple and register files. So you look at the space layout on the Silicon.

78
00:16:10.073 --> 00:16:11.124
Then this will have.

79
00:16:11.369 --> 00:16:16.739
Less space for instruction, decoding as a proportion of the total and more space.

80
00:16:16.739 --> 00:16:25.379
Or a, on using register files. This is a good idea to the extent that your program can take advantage of that.

81
00:16:25.379 --> 00:16:28.558
Okay.

82
00:16:28.558 --> 00:16:37.019
So, the threads are grouped into warps as I mentioned before 32 threads in a warm and this 32 a state of constant forever within video.

83
00:16:37.019 --> 00:16:44.788
So, not part of the code of programming model. Well, not formally, but it hasn't changed for 20 years.

84
00:16:44.788 --> 00:16:50.519
So any case.

85
00:16:52.229 --> 00:17:00.504
And the streaming multi processor, then schedules the warps, which are part of the part of the blocks for adblocks.

86
00:17:00.504 --> 00:17:15.263
Now, why warps need scheduling is other resources available in limited quantity besides registers 1 of them are floating point units, single and doubles the 2 separate types of units. And there's not enough.

87
00:17:15.509 --> 00:17:22.048
Floating point units for all the threads to simultaneously do a floating operation. So.

88
00:17:22.048 --> 00:17:27.269
This would be a reason for a work to wait until ordinarily or warp had finished.

89
00:17:27.269 --> 00:17:32.459
Example here.

90
00:17:32.459 --> 00:17:39.209
Do the math, so they're assuming the green and the purple warp and.

91
00:17:40.888 --> 00:17:47.939
And so when you do the math and whatever, so okay. And again.

92
00:17:47.939 --> 00:17:58.558
How many schedule depends on the resources each work needs the resources. Each thread needs actually and because all the threads in the war for identical, basically.

93
00:17:58.558 --> 00:18:06.239
The footnote there, but so again, more resources for thread means who, or threads can run simultaneously.

94
00:18:07.648 --> 00:18:15.179
This is the thing I mentioned a little before 0T overhead, warp, scheduling.

95
00:18:15.179 --> 00:18:21.449
They've got some fancy logic that I believe involves a synchronous logic.

96
00:18:21.449 --> 00:18:25.288
To maintain the.

97
00:18:25.288 --> 00:18:31.318
Them a group of Forbes that are waiting to run, and when all the resources are available.

98
00:18:31.318 --> 00:18:34.618
Then picks a warp.

99
00:18:34.618 --> 00:18:41.278
And Ron said, so, I don't know the details of how that's done. We'll 0T overhead, but.

100
00:18:46.259 --> 00:18:52.108
I assume it works because the cues are not incredibly big, but.

101
00:18:53.278 --> 00:18:56.429
Okay, excuse me. No.

102
00:18:57.628 --> 00:19:01.288
Again, although it says here, excuse me.

103
00:19:01.288 --> 00:19:08.878
Nothing here is particular to, for me, these are all general lessons. That's why I'm showing them to you.

104
00:19:08.878 --> 00:19:12.088
And.

105
00:19:12.088 --> 00:19:24.384
So, it's a question, and the example is matrix multiplication now parallel computing people love matrix multiplication as an examples as test cases. For the following reason.

106
00:19:24.834 --> 00:19:27.804
Matrix multiplication is compute intensive.

107
00:19:28.048 --> 00:19:32.459
If you're multiplying to and by end matrices.

108
00:19:32.459 --> 00:19:37.828
You've got order of N squared data, but you've got order of cubes.

109
00:19:37.828 --> 00:19:52.798
Computation there's a lot of programs where the data dominates it. You spend more of your time in data transmission than you do in processing. So this is the opposite. It's actually a compute intensive job.

110
00:19:52.798 --> 00:19:56.368
That's 1 reason a parallel people like it.

111
00:19:56.368 --> 00:20:00.479
So, what we're going to do for matrix multiplication is.

112
00:20:00.479 --> 00:20:03.538
Chop the matrices up into blocks and use.

113
00:20:03.538 --> 00:20:09.028
And assign blocks of the matrices to threads. So then the question is.

114
00:20:09.028 --> 00:20:14.638
How big are the blocks and so on? And this is the next few slides are talking about that.

115
00:20:14.638 --> 00:20:19.648
Now, NVIDIA provide, it's actually it's a linear programming problem.

116
00:20:19.648 --> 00:20:33.989
Because you've got certain resources available, and you want to optimize, say, processing times that you want to say minimized processing time. But you have to stay within the limits of the various resources like, registers.

117
00:20:33.989 --> 00:20:38.759
And that sort of thing and say floating point units and that.

118
00:20:38.759 --> 00:20:44.878
That's what linear programming guys is to optimize some objective function given.

119
00:20:44.878 --> 00:20:52.648
In keeping, and each resource with not keeping no more than a 100% usage of any given resource.

120
00:20:52.648 --> 00:20:56.189
Geometrically.

121
00:20:56.189 --> 00:21:05.429
It's, if you've got end different resources that you have to watch, it's an end dimensional and you have to find the lowest vertex.

122
00:21:05.429 --> 00:21:10.318
Inside the end dimensional poly Cal Poly talks actually defined by.

123
00:21:10.318 --> 00:21:16.769
The of its faces, not by, and there can be in high dimensions is actually.

124
00:21:16.769 --> 00:21:23.159
Exponentially more vertices in paces perhaps. So it's a search procedure.

125
00:21:23.159 --> 00:21:31.709
Economists, attrition loved linear programming problems may talk more about them later. Perhaps in any case.

126
00:21:31.709 --> 00:21:39.689
So you wanted to Matrix multiplication fast on the GPU so you have to decide what size of blocks of.

127
00:21:39.689 --> 00:21:44.759
Chuck up the matrix and how many threads per thread block and so on.

128
00:21:53.068 --> 00:22:06.659
Silence.

129
00:22:08.308 --> 00:22:09.294
Yeah, okay.

130
00:22:16.074 --> 00:22:25.703
And various resources here, register, shared memory and so on what scoping lifetime means is different types of memory. The scope is.

131
00:22:27.088 --> 00:22:32.519
Who can see the memory so I register it visible to only 1 thread.

132
00:22:32.519 --> 00:22:40.138
Very limited, very narrow scope. Global memory is visible to every 1, the broadest possible scope.

133
00:22:40.138 --> 00:22:46.019
And then lifetime also registers lifetime might be the 1.

134
00:22:47.278 --> 00:22:54.328
You know, the 1 thread, but the global memory's lifetime would be the whole colonel say.

135
00:22:54.773 --> 00:22:55.314
Okay,

136
00:22:55.314 --> 00:23:00.473
so this is okay,

137
00:23:00.473 --> 00:23:01.314
so last time,

138
00:23:01.344 --> 00:23:03.953
the last example we saw was the convolution,

139
00:23:04.223 --> 00:23:08.334
we were blurring pixels and so we had a thread per pixel,

140
00:23:08.574 --> 00:23:10.554
but the thread that was computing,

141
00:23:10.673 --> 00:23:15.624
the blurred pixel value had to look at the adjacent Excel self.

142
00:23:15.624 --> 00:23:15.804
It's.

143
00:23:16.259 --> 00:23:24.179
3 by 3 convolution window, the thread blurring a pixel had to look at the 8 adjacent pixels. For example.

144
00:23:25.439 --> 00:23:31.409
Now, why that potentially is a problem is.

145
00:23:31.409 --> 00:23:41.729
It's going to different places in a global memory so that sort of thing potentially very badly. Hurts the performance. It's a legal thing to do.

146
00:23:41.729 --> 00:23:51.269
Any thread can read and write any word and global memory. Well, we probably shouldn't try writing. The can read any word in global memory.

147
00:23:51.269 --> 00:24:02.638
The reason for writing is the threads are running a synchronously. So you got her seriously think about what it means to a threads writing words in global memory. Unless of course, each thread is a separate private trunk of the global memory.

148
00:24:02.638 --> 00:24:05.729
Which is perfectly fine. Okay.

149
00:24:05.729 --> 00:24:14.459
So so what's on the table here is this convolution program they call it a blurring kernel is accessing.

150
00:24:14.459 --> 00:24:17.939
Several each threads accessing several pixels.

151
00:24:17.939 --> 00:24:22.199
In the global memory here.

152
00:24:24.209 --> 00:24:33.058
And you see what we have here is this double 4 loop iterating over the elements of the.

153
00:24:33.058 --> 00:24:36.749
As a filter and so on.

154
00:24:38.098 --> 00:24:42.179
Okay.

155
00:24:44.729 --> 00:24:51.028
So, the problem is that, um.

156
00:24:51.384 --> 00:25:05.273
So, the, they're looking here, so the GPU computing right is actually at a reasonable speed 1, and a half terra flops in this example, with an old GPU. By the way.

157
00:25:06.479 --> 00:25:11.219
So, the say can do the computations at 1 and a half terra flops.

158
00:25:11.219 --> 00:25:15.838
But the probably trouble is that the bandwidth physical memory.

159
00:25:15.838 --> 00:25:18.838
Limit set so.

160
00:25:20.368 --> 00:25:29.219
So, you're going to be this is gonna be an aisle limited program here and they run into some of the mass saying that.

161
00:25:29.219 --> 00:25:35.489
In this particular case, the memory global memory, successful, 200 gigabytes a 2nd, let's say.

162
00:25:36.989 --> 00:25:40.169
So, we're getting 3% of the.

163
00:25:40.169 --> 00:25:44.249
Computation power of the now.

164
00:25:44.249 --> 00:25:48.088
That may not be bad. You have to look at the global picture that.

165
00:25:48.088 --> 00:25:56.489
It's in the global context, maybe. Okay if that's all you can do, but you might want to perhaps do better.

166
00:25:57.808 --> 00:26:02.939
And we're going to do better by organizing the data.

167
00:26:03.959 --> 00:26:10.949
So here's the issue here that is coming back to the matrix multiplication again.

168
00:26:10.949 --> 00:26:22.888
Now, multiplying Matrix, Adam types matrix end to make matrix P for product. So you docked a vector, which is 1 role of them with a vector, which is 1 column of and.

169
00:26:22.888 --> 00:26:25.919
And then to get 1 element of.

170
00:26:25.919 --> 00:26:32.038
Pete, so now here's the problem that if the matrices are stored in major order.

171
00:26:32.038 --> 00:26:38.429
Than accessing all of them is good. They're all contiguous.

172
00:26:38.429 --> 00:26:43.409
The column of them is a different matter, because they're all they're all discontinuous.

173
00:26:43.409 --> 00:26:49.739
So, you're going to take a performance hit access saying that call them within.

174
00:26:49.739 --> 00:26:55.169
You're also going to take perhaps a hit on accessing the role of them because.

175
00:26:55.169 --> 00:27:01.259
You're reading it on a global memory and only using 1. so the theme of this slide set.

176
00:27:01.259 --> 00:27:06.689
Is going to be that if we have to read data from the global memory.

177
00:27:06.689 --> 00:27:10.679
Get as much use out of that data as you can.

178
00:27:10.679 --> 00:27:16.348
You know, maybe use the data more than once if you can.

179
00:27:16.348 --> 00:27:29.638
Any case, here's the basic matrix modification and it takes pointers to the 3 irrelevant erase matrices and they're stored in the global memory.

180
00:27:29.638 --> 00:27:35.729
Is a pointer it probably almost always, I think might be in the global memory and.

181
00:27:35.729 --> 00:27:40.679
So, in any case, so, each again as usual here, each thread.

182
00:27:40.679 --> 00:27:43.709
Is computing 1 output element.

183
00:27:43.709 --> 00:27:48.028
So, we compute, so given the thread number and the block number.

184
00:27:48.028 --> 00:27:52.409
And compute which row and column that's in red here.

185
00:27:52.409 --> 00:27:56.189
And then what we do here.

186
00:27:57.419 --> 00:28:03.058
Is that we.

187
00:28:03.058 --> 00:28:06.719
Okay, this gets to be the slow part here this loop here.

188
00:28:06.719 --> 00:28:15.419
Is going down and computing 1 output element by taking a whole role of, and the whole call and then.

189
00:28:15.419 --> 00:28:22.288
And docking them dotting them and do their job that's there. So, this here is going to kill your performance. Perhaps.

190
00:28:25.499 --> 00:28:28.528
All right, and they're just putting that in red because.

191
00:28:28.528 --> 00:28:39.179
If you went to someone, a video or something, I think this stuff's available in video. Also. If you want to find someone else some describing the slave slides here. But I, perhaps I'm doing it in less time.

192
00:28:39.179 --> 00:28:42.538
In the video, because I'm hitting just the high points.

193
00:28:44.068 --> 00:28:48.898
Okay, so if they're talking about here.

194
00:28:48.898 --> 00:28:52.499
Is take the output matrix and.

195
00:28:52.499 --> 00:28:56.368
Partition it into blocks here there are 2 by 2 blocks.

196
00:28:56.368 --> 00:29:00.088
And within 1 block.

197
00:29:00.088 --> 00:29:03.298
Again, the blocks of the matrix could be mapped to.

198
00:29:03.298 --> 00:29:08.098
Threads and a thread block, but compute within 1 block at a time.

199
00:29:09.209 --> 00:29:14.278
And the goal will be maybe the data that you have to read for the golf memory might get used.

200
00:29:14.278 --> 00:29:19.828
More intensively.

201
00:29:22.078 --> 00:29:27.628
And what they're showing here is this is computing a 2 by 2 block.

202
00:29:27.628 --> 00:29:32.489
Of the output matrix and so.

203
00:29:32.489 --> 00:29:40.648
You computing 4 output elements, but we have 2 rows of M and 2 columns of and.

204
00:29:40.648 --> 00:29:47.999
Now, what that means to so so each element of them is actually used twice, not once the previous.

205
00:29:47.999 --> 00:30:00.473
Slide I showed you a few sides back, we would read a roll of them and a column of adopt them and tell them in a minute and would be used once to compute that element of P. then we'd do another element of PII.

206
00:30:00.653 --> 00:30:03.624
Well, here, what we're doing is a 2 by 2 block of P.

207
00:30:04.078 --> 00:30:12.808
So so, if we can store these 2 rows of them, and these 2 columns of and locally, perhaps anticipating a little each.

208
00:30:12.808 --> 00:30:21.568
Element will get used twice instead of once. So we will have have the requirements on the global memory and.

209
00:30:23.338 --> 00:30:27.838
The cost, we will have doubled our use of.

210
00:30:27.838 --> 00:30:31.618
We will double our computation use on the GPU.

211
00:30:32.183 --> 00:30:46.794
Because it's limiting, it is reading from Nicole, the memory. What that requires is if we're going to use these 2 rows and these 2 columns twice that we have to be able to store them locally somehow. So now you can me. I'm sure you're starting to think of.

212
00:30:47.398 --> 00:30:53.638
Trade offs and stuff like that. So it's a time it's a space time.

213
00:30:53.638 --> 00:30:59.068
Computation time trade off now where we're going to do that.

214
00:30:59.068 --> 00:31:03.959
Is that we have this local register file available?

215
00:31:03.959 --> 00:31:08.189
So, and again.

216
00:31:08.189 --> 00:31:15.388
The CUDA cores each thread kind of up to 255 words of data.

217
00:31:15.388 --> 00:31:18.538
Maybe 256, I'm not certain.

218
00:31:19.828 --> 00:31:23.699
Words of that that I stores locally that are very fast.

219
00:31:23.699 --> 00:31:32.308
Okay, okay so here is the hierarchical memory you've seen most of this before it has an extra.

220
00:31:32.308 --> 00:31:36.088
It has every 1 extra feature on this so this is.

221
00:31:36.088 --> 00:31:41.038
A partial version of how the memory is laid out on the device.

222
00:31:41.038 --> 00:31:46.919
It's okay, we've got the grid, which is the whole program you're running sort of it's.

223
00:31:46.919 --> 00:31:56.969
The grid has some global memory at the bottom, 48 gigabytes and the good on parallel. It has also a small amount of constant memory.

224
00:31:56.969 --> 00:32:02.368
Which is fast, which is very small. I think it's like.

225
00:32:02.368 --> 00:32:05.608
Give or take 48 K bytes.

226
00:32:05.608 --> 00:32:08.788
But the thing is that the assumption.

227
00:32:08.788 --> 00:32:18.628
It's fast, no latency. And the assumption is that well, the requirements at the same constant memories visible to all of the threads.

228
00:32:18.628 --> 00:32:24.328
So, if you have something that's read, only and all and all the threads want to be able to see it.

229
00:32:24.328 --> 00:32:37.199
Put it in constant memory, any case, the greatest partition into thread blocks and each thread block has an amount of shared memory, fast shared memory available to all the threads in the thread block but it's private.

230
00:32:37.199 --> 00:32:41.159
To the block, the block terminates, the shared memory goes away.

231
00:32:41.159 --> 00:32:52.588
Then we got the block contains the so the yellow block contains the green threads up to a 1024. well, the grid contain very many blocks. Actually.

232
00:32:52.588 --> 00:32:57.719
1M insight Thank not positive. Okay. In any case. So.

233
00:32:57.719 --> 00:33:03.749
You have the up to up to the 1000 green threads and each thread has its private.

234
00:33:03.749 --> 00:33:09.898
Past registers and has access to the same shared memory as the other threads fried, right?

235
00:33:09.898 --> 00:33:13.169
What's not shown here is.

236
00:33:13.169 --> 00:33:18.148
Is local memory for each thread, which is.

237
00:33:18.148 --> 00:33:31.169
Slow each thread has basically a private chunk of a global memory that's called local memory so it's slow, but it's overflowing other stuff. That's not shown here. Um.

238
00:33:31.169 --> 00:33:35.429
And whatever, textures and stuff like relating to graphics, but.

239
00:33:35.429 --> 00:33:39.028
Okay.

240
00:33:39.028 --> 00:33:43.259
So, now here is showing actually 4 types of memory.

241
00:33:43.259 --> 00:33:49.558
And you could have program can have.

242
00:33:49.558 --> 00:33:53.038
The, if this is defined inside.

243
00:33:53.038 --> 00:33:56.999
The program routine, running on the device.

244
00:33:56.999 --> 00:34:00.598
So, for example, in a global routine would be.

245
00:34:00.598 --> 00:34:03.808
An example, so if you just say, integer.

246
00:34:03.808 --> 00:34:08.458
It by default goes into the, it started in to register.

247
00:34:08.458 --> 00:34:12.838
And it's own visible to only that 1 thread.

248
00:34:12.838 --> 00:34:18.778
You can declare something as device shared.

249
00:34:18.778 --> 00:34:22.349
Which means that it's shared memory so that it's fast.

250
00:34:22.349 --> 00:34:25.829
But again, I forget 64 K.

251
00:34:25.829 --> 00:34:30.119
Bites or something, and it's visible to all the threads and the block.

252
00:34:30.119 --> 00:34:39.958
If you just say device, it's the global memory that everyone on the device can access it.

253
00:34:39.958 --> 00:34:47.338
It's cool. Well, the whole grid, the great is the parallel program and the lifetime is as long as the parallel program is running.

254
00:34:47.338 --> 00:34:50.818
You can say something is device constant.

255
00:34:50.818 --> 00:34:54.239
So, it's in that small, constant, fast cash.

256
00:34:54.239 --> 00:35:02.338
Okay, the device stuff has the large latency of 100 cycles or whatever the device constant is very fast.

257
00:35:02.338 --> 00:35:06.838
But it's read only and everyone sees the same constant.

258
00:35:06.838 --> 00:35:12.298
Cool. But again, I think it's like 48 K bytes or something.

259
00:35:12.298 --> 00:35:17.398
Okay, and again, this doesn't show a local.

260
00:35:17.398 --> 00:35:29.128
Memory these per thread arise, that's the per thread local memory that sits in the globe. So it's a way threads can have more memory.

261
00:35:29.128 --> 00:35:34.739
But it's slow question.

262
00:35:34.739 --> 00:35:40.199
Only, the host can write to the constant I don't know, I'll have to check on that. So.

263
00:35:41.728 --> 00:35:51.509
It's probable that there's some way for the device to write to the constant Mary, but I don't know what that is. So good question.

264
00:35:51.509 --> 00:35:54.958
Okay, so this.

265
00:35:54.958 --> 00:35:58.048
Sets if we go back here, what does shared mean?

266
00:35:58.048 --> 00:36:06.599
Goes into shared memory up here. Okay. So what this does is it puts a tile of data.

267
00:36:06.599 --> 00:36:10.349
Into the shared memory.

268
00:36:10.349 --> 00:36:14.878
So, all the threads in the block can read and write it. So.

269
00:36:14.878 --> 00:36:18.659
Up to the size of the shared memory. This is 1 of your hard commitments.

270
00:36:18.659 --> 00:36:29.759
But you do this, and this is going to anticipating a little this will get loaded and cooperatively by a lot of threads. Actually. So be a lot of threads will load the shared.

271
00:36:29.759 --> 00:36:34.289
Array and then we'll start threads that we'll use it. So.

272
00:36:37.079 --> 00:36:42.148
Where to declare variables.

273
00:36:43.798 --> 00:36:47.248
Yeah, so probably the host can access the.

274
00:36:47.248 --> 00:36:53.248
Write the content, but I'm not positive. So where are you defining stuff and so on.

275
00:36:53.248 --> 00:37:02.009
So shared memory again, it's low latency and high throughput.

276
00:37:03.059 --> 00:37:10.438
But it's small, so scratch bad environment it's implemented by some very expensive hardware.

277
00:37:10.438 --> 00:37:21.478
Which is why, as invidia create advances in the general new generations, they do not increase the size of the shared memory.

278
00:37:21.478 --> 00:37:29.369
Or the number of registers available, what they do is they add more hardware is they create more streaming multi processors.

279
00:37:29.369 --> 00:37:34.648
More coded threads could run parallel, but they did not increase the shared memory and register.

280
00:37:34.648 --> 00:37:42.809
Okay, so you got shared memory that everyone can get added a global memory. Everyone can get at it and sell on. So.

281
00:37:42.809 --> 00:37:45.898
Yeah. Okay. Nothing much.

282
00:37:45.898 --> 00:37:49.289
No, new content here and.

283
00:37:49.289 --> 00:37:52.949
Everyone could get it everything. Basically it was enough arrows here.

284
00:37:55.469 --> 00:37:59.278
Okay, that was.

285
00:38:03.958 --> 00:38:07.978
Let's see.

286
00:38:10.168 --> 00:38:13.260
Can to.

287
00:38:15.449 --> 00:38:19.829
Silence.

288
00:38:19.829 --> 00:38:23.159
Okay, so.

289
00:38:23.159 --> 00:38:34.590
We're getting more into this tile parallel algorithm idea. And the idea is that we want to read some global memory into the past.

290
00:38:34.590 --> 00:38:37.920
Cash you might almost call it and then use it several times.

291
00:38:37.920 --> 00:38:43.230
Um, so.

292
00:38:43.230 --> 00:38:46.619
Basic multiplication each thread.

293
00:38:46.619 --> 00:38:52.619
Is accessing data tried to to global memory you might say.

294
00:38:52.619 --> 00:39:00.900
So, but what they're showing here is a different threads are accessing the same global memory but separately.

295
00:39:02.280 --> 00:39:12.420
But any case, so we have we proposed the red cash. It's a cash. That's what the shared memory is. What is the cash that you explicitly manage?

296
00:39:12.420 --> 00:39:16.079
So.

297
00:39:16.079 --> 00:39:26.460
And so we load the cash with a chunk of the global memory process at the moment with another chocolate, global memory and process it.

298
00:39:28.469 --> 00:39:32.909
Relating to carpool, so the.

299
00:39:32.909 --> 00:39:39.300
Interesting thing and things like traffic design. There's some paradoxes.

300
00:39:39.300 --> 00:39:45.989
Where closing a road if okay, if every car optimizes.

301
00:39:47.159 --> 00:39:53.039
His root home, there are cases there is a paradox. We're closing a highway.

302
00:39:53.039 --> 00:39:56.550
Putting a barrier across the highway, so no, 1 can take it.

303
00:39:56.550 --> 00:40:09.750
Will increase everyone's speed will decrease everyone's time to get home. It sounds counter intuitive. I mean, I can draw it for you if you're interested, but it's called the grace paradox, closing a highway.

304
00:40:09.750 --> 00:40:13.469
Can increase through put on the highway system.

305
00:40:13.469 --> 00:40:17.550
If every driver locally optimizes crazy, but.

306
00:40:18.659 --> 00:40:29.550
There's also another sort of paradox where, if you have a highway running at capacity, if you take a few random drivers and pull them off the road, tell them park for an hour.

307
00:40:29.550 --> 00:40:33.210
Then the throughput again increases.

308
00:40:33.210 --> 00:40:45.000
Including averaged over the drivers that you pulled off the road so they took an hour more to get home, but a, very, but a lot of other drivers got home past enough that the average improved.

309
00:40:45.000 --> 00:40:48.210
Counterintuitive things. Okay.

310
00:40:48.210 --> 00:40:53.760
Yeah, you all know what's happening here. So.

311
00:40:53.760 --> 00:40:57.780
Asking for riders, so they can get in the carpool lane.

312
00:40:57.780 --> 00:41:01.050
You do do do nothing interesting here.

313
00:41:01.050 --> 00:41:07.710
The point about these slides is that this.

314
00:41:07.710 --> 00:41:12.690
Cashing works only if the different threads want the same data at the same time.

315
00:41:13.980 --> 00:41:17.760
And you got to synchronize stuff. Okay.

316
00:41:17.760 --> 00:41:21.210
The different periods are taking different times.

317
00:41:21.210 --> 00:41:26.039
I know what works, but for some reason, you synchronize occasionally.

318
00:41:26.039 --> 00:41:29.940
Only threads in the same block.

319
00:41:29.940 --> 00:41:34.320
Okay um, nothing deep here.

320
00:41:34.320 --> 00:41:39.090
You know, identify membranes, access and multiple threads cash.

321
00:41:39.090 --> 00:41:43.230
Synchronized to make sure that all the data has been loaded.

322
00:41:43.230 --> 00:41:48.659
Process that synchronize again to make sure that it's all been processed and then move on.

323
00:41:48.659 --> 00:41:57.059
Okay.

324
00:42:05.070 --> 00:42:15.869
So, what we're going to do is take a strip of several rows of them and several columns of and as much as will fit into the local shared memory.

325
00:42:15.869 --> 00:42:23.369
And fit it into the local shared environment and compute a block of the matrix of several rows and columns.

326
00:42:23.369 --> 00:42:30.420
And this would, yeah. Okay. So this depends on the size of Eminem. If I'm.

327
00:42:30.420 --> 00:42:35.190
And is bigger than you can put fewer rolls into the.

328
00:42:35.190 --> 00:42:41.579
Into the shared memory, of course, because you want to put a whole row in maybe unless you're doing another level of walking.

329
00:42:41.579 --> 00:42:45.989
Okay, nothing new there.

330
00:42:45.989 --> 00:42:58.380
Well, there is something here is instead of putting okay, this is a new ideas. I alluded to it a minute ago, instead of putting several complete rows and several complete columns event into the cash.

331
00:42:58.380 --> 00:43:02.429
Like, cashing me in the shared local shared memory, put blocked.

332
00:43:03.510 --> 00:43:11.099
Partition and into blocks, and here you are, the smaller squares and put several blocks.

333
00:43:12.210 --> 00:43:19.110
Put blocks of, em, and into the shared memory. So this concept here scales off no matter how big admin, and are.

334
00:43:21.420 --> 00:43:30.300
Now, you'll have to read each flock into the shared memory several times. Perhaps in fact, you will, but.

335
00:43:30.300 --> 00:43:34.650
But it's still, it still pays off.

336
00:43:36.150 --> 00:43:40.230
So.

337
00:43:40.230 --> 00:43:48.000
And I'm going to skip through this somewhat, but what the concept is here, and had been partition into blocks.

338
00:43:48.000 --> 00:43:56.159
You load a block into shared memory 2 by 2 block of attitude by 2 block of an, and then you can, um, computer 2 by 2 o'clock of.

339
00:43:56.159 --> 00:44:02.789
Pete, and then you say, maybe keep the block of them and then load a different block of into memory. Perhaps.

340
00:44:02.789 --> 00:44:07.139
It's a partitioned memory. I'm on vacation thing.

341
00:44:07.139 --> 00:44:19.650
Well, by the way, some of you are where there are faster ways to multiply matrices this method here for end by matrices takes order of and cubed.

342
00:44:19.650 --> 00:44:26.429
Multiplications there are methods like the and method.

343
00:44:26.429 --> 00:44:34.170
What 40 years? 50 years old? It takes end of the 2.7 operations. Asymptotically.

344
00:44:34.914 --> 00:44:47.275
And because it can multiply 2, 2 by 2, major season, 7 multiplication instead of an 8, which is a sort of thing. That was just intuitively assumed to be impossible until Strauss and did it. And then it was obvious.

345
00:44:48.324 --> 00:44:54.235
These paradigm shifts, but the trouble with methods like that is they're much more complicated.

346
00:44:54.510 --> 00:44:57.780
So, they do not.

347
00:44:57.780 --> 00:45:05.070
Lend themselves, so their recursive, their hierarchy over complicated. They're not so they don't have regular tad of patterns.

348
00:45:05.070 --> 00:45:12.210
So, they're not so easy to parallelize effectively. So, the basic and cube method is, um.

349
00:45:13.320 --> 00:45:16.889
It's asymptotically not the best, but it's simple.

350
00:45:16.889 --> 00:45:21.655
And simple is worth a lot by the is not the best way.

351
00:45:21.864 --> 00:45:32.335
They've batch that exponent down to end to the 2.3 or 2.4 I think, but the constant factor at the front of that time expression keeps growing as you bash down the exponent.

352
00:45:35.039 --> 00:45:39.329
It's an open problem, but how you can make the exponent. So.

353
00:45:40.380 --> 00:45:43.469
Okay, so again, it's what I'm showing here, you.

354
00:45:43.469 --> 00:45:46.949
Take a, you multiply block of times of block of in.

355
00:45:48.570 --> 00:45:58.860
Okay, bang, bang and then you grab you keep the old block of bam, you grab a new block, a van and your computing stuff and so on.

356
00:46:02.519 --> 00:46:07.679
Well, you still, it is still totalling into the same block of P. by the way.

357
00:46:07.679 --> 00:46:10.710
Because a block of them needs the whole column of.

358
00:46:12.869 --> 00:46:18.329
And then we can add up operations and stuff like that. I'll skip over that.

359
00:46:18.329 --> 00:46:22.139
You got to do synchronization.

360
00:46:22.139 --> 00:46:32.969
Because again, you've all the threads and the block working off the same shared memory for the block. But the threads in the block are not necessarily running at the same time. In fact, they're probably not.

361
00:46:32.969 --> 00:46:39.570
And so the warps are not running at the same time. Their schedule.

362
00:46:39.570 --> 00:46:50.400
Especially because as fewer floating point processors, and there are threats possible. So you have to synchronize to make sure all the threads in the block have completed. So.

363
00:46:52.920 --> 00:46:56.940
Yeah, okay.

364
00:46:56.940 --> 00:47:02.280
Okay, that was getting the number here.

365
00:47:02.280 --> 00:47:08.730
4.3, we're going to a good number of slides sets today.

366
00:47:11.639 --> 00:47:17.219
4.4.

367
00:47:20.039 --> 00:47:26.610
We're going to be I'm going to be going through this past here. Just give you the highlights.

368
00:47:27.929 --> 00:47:31.469
Interesting here the indexing.

369
00:47:31.469 --> 00:47:35.909
The details of how you index that you could figure that out to yourself.

370
00:47:35.909 --> 00:47:39.179
Um.

371
00:47:39.179 --> 00:47:47.699
You can look at the code yourself if you want, but here what this sync threads this is and the global.

372
00:47:47.699 --> 00:47:51.389
But program function that runs for each thread. So.

373
00:47:53.099 --> 00:47:59.579
So, you're, what they're doing is they're adding into the total for that pixel.

374
00:47:59.579 --> 00:48:03.659
And into a local sub, total variable.

375
00:48:03.659 --> 00:48:08.849
And then when all the threads have done, have computed their local p value P values.

376
00:48:08.849 --> 00:48:13.559
Register leading to the thread and then what you do.

377
00:48:13.559 --> 00:48:18.809
Is you add it into the you write it into the.

378
00:48:21.030 --> 00:48:26.070
The global element for that.

379
00:48:26.070 --> 00:48:29.070
But that's where I view right that pixel. So.

380
00:48:31.139 --> 00:48:34.650
That's the interesting part of that.

381
00:48:35.699 --> 00:48:40.019
So you want all the threads to be ready computing this before you.

382
00:48:40.019 --> 00:48:47.099
Do that I'm not actually certain here why you have to sync thread there and looking at that, but.

383
00:48:48.989 --> 00:48:54.570
So you have to sync threads here because.

384
00:48:54.570 --> 00:49:01.920
You're computing the tile and the tile and they're up here and shared memory. Okay.

385
00:49:01.920 --> 00:49:13.920
And the different components of these tiles for Eminem are being computed by different threads. So, the sync threads mean, all the components of the 2 tiles have now been computed.

386
00:49:13.920 --> 00:49:22.530
So, now you can go and you read them because this thread is reading elements of those tiles that were written by other threads.

387
00:49:22.530 --> 00:49:31.559
That's why you have to synchronize here as I'm talking. I'm cannot completely understand or at all understand why you.

388
00:49:31.559 --> 00:49:34.590
After synchronized there actually.

389
00:49:34.590 --> 00:49:39.329
Anyone has any ideas I mean, you're inside a loop.

390
00:49:39.329 --> 00:49:45.780
But, oh, well.

391
00:49:48.929 --> 00:49:54.090
If anyone isn't, isn't a Tom at the increasing p value.

392
00:49:56.039 --> 00:50:00.420
Well, that's separate from the yeah, that's going to have to be an atomic.

393
00:50:00.420 --> 00:50:04.260
Uh, increment here, right?

394
00:50:04.260 --> 00:50:09.989
Now, if this program, if this code is actually correct.

395
00:50:09.989 --> 00:50:14.820
Then the Tom, then this is despite a false and atomic increment.

396
00:50:14.820 --> 00:50:18.000
You know, I would have to check the documentation to see.

397
00:50:18.000 --> 00:50:21.539
But that still doesn't explain why syncs variances needed.

398
00:50:21.539 --> 00:50:28.619
Oh, well, okay. Talking about resource limits here.

399
00:50:28.619 --> 00:50:37.650
And here, they're computing how many floating operations you need for each memory load and so on.

400
00:50:38.880 --> 00:50:45.179
Okay, let's get to detail for a different tile size. That is so.

401
00:50:45.179 --> 00:50:53.159
The Macs should be 32 by 32, because that's a 1000 threads. And that's how many threads you're allowed to happen in the block. So.

402
00:50:55.139 --> 00:51:06.179
So, what happens if there's 32 by 32 blocks you need a, you need a tile block from them and 1 command. Each is a so yes.

403
00:51:06.179 --> 00:51:12.389
To K, float loads and then this is how much you're going to use the.

404
00:51:12.389 --> 00:51:19.769
The blocks with 32 floating operations for each memory load and.

405
00:51:19.769 --> 00:51:23.250
But the thing is that you might wonder.

406
00:51:23.250 --> 00:51:33.030
Well, How's that possible? Lots of floats for 1 load. But the thing is that the loads are, when you do a memory load, it's available to all of the threads.

407
00:51:33.030 --> 00:51:37.679
And that's the key and the floating operations are per are being done.

408
00:51:37.679 --> 00:51:46.590
You know, in parallel on each separate thread, that's why that you might look at this and say, how could you do that? Well, that's the reason if you think about it.

409
00:51:46.590 --> 00:51:51.179
So the fact that the memory loads are being done in parallel.

410
00:51:52.619 --> 00:52:01.230
Before, and then the floating operations, I'm being a little vague, but you might be able to see why this is actually a reasonable amount of parallelism there.

411
00:52:04.440 --> 00:52:10.559
Okay, and then they start talking will it fit in the shared memory and so on? So.

412
00:52:13.349 --> 00:52:19.530
And, um.

413
00:52:19.530 --> 00:52:23.460
Okay, and each thread needs somewhat to the shared memory.

414
00:52:23.460 --> 00:52:33.269
This is the older architecture is only 16 K bytes a shared memory. It's more nowadays. Okay. And the thing about here is that.

415
00:52:33.269 --> 00:52:38.429
If you have fewer threads, perhaps or more thread box are fewer.

416
00:52:38.429 --> 00:52:49.019
More thread blocks means fewer threads for blocks. So a total number of threads will be the same. So, this is this point that more threads, but each.

417
00:52:49.019 --> 00:52:52.920
So, what this is talking about more blogs.

418
00:52:52.920 --> 00:52:56.940
Fewer threads for block, so the threads in the block.

419
00:52:56.940 --> 00:53:04.019
I have more resources, so what this means is it's more shared memory per thread.

420
00:53:05.340 --> 00:53:10.679
And that sometimes is a win. There is a.

421
00:53:10.679 --> 00:53:16.199
There was a talk at the GPU technology conference a couple of years ago, demonstrating this.

422
00:53:16.199 --> 00:53:21.210
Okay, Westerns.

423
00:53:27.900 --> 00:53:31.380
Silence.

424
00:53:32.579 --> 00:53:38.130
Okay, I'm going to go through the slides that passed.

425
00:53:39.510 --> 00:53:45.869
Let me give intellectual content, your partitioning matrices and blocks.

426
00:53:45.869 --> 00:53:52.320
And threads into thread blocks. Well, it may not go evenly. You've got a fraction on block.

427
00:53:52.320 --> 00:53:56.429
At the end, so your Matrix, you're going to a fractional.

428
00:53:56.429 --> 00:54:00.809
Block at the right side of the matrix and fractional blocks at the bottom of the matrix.

429
00:54:02.099 --> 00:54:07.199
Yeah, it just makes the programming a little Messier. That's all. I just summarized it.

430
00:54:07.199 --> 00:54:17.730
Arbitrary size matrices and so on this talking about you could pad 1 way is a pad, the matrix up to the next multiple of the.

431
00:54:17.730 --> 00:54:23.130
Of the block size, it makes it programming easier, but it takes more space.

432
00:54:23.130 --> 00:54:27.150
Significant or not depends on how big the blocks are.

433
00:54:27.150 --> 00:54:30.809
Yeah. Okay. Nothing new there.

434
00:54:31.980 --> 00:54:35.579
Nothing new there, nothing new there. Um.

435
00:54:35.579 --> 00:54:40.409
And nothing new there.

436
00:54:43.619 --> 00:54:51.630
Okay, let me what's happening here. I'm going to give you a summary.

437
00:54:51.630 --> 00:55:01.889
The threads are doing different things so the 1st thing, the threads do is read data from the global memory into the shared memory read, read in that block of data.

438
00:55:01.889 --> 00:55:05.610
And then the next thing, and then they synchronize and then they use it.

439
00:55:05.610 --> 00:55:14.730
And they're talking about a thread may have a valid use in the 1st, step of reading data in but not in the 2nd, step of computing and output.

440
00:55:14.730 --> 00:55:17.820
Element because in the 2nd step.

441
00:55:17.820 --> 00:55:22.050
The element it would compute is off the boundary of the matrix. So.

442
00:55:22.050 --> 00:55:26.730
So the showing here, you've got.

443
00:55:26.730 --> 00:55:30.960
The blank elements of the, the actual matrix is the number.

444
00:55:30.960 --> 00:55:35.159
Entries here, the blank elements are padding it out to the next block size. So.

445
00:55:35.159 --> 00:55:44.579
Yeah, I don't actually find the slides have particularly deep and interesting. I just gave you the content. Yeah the blocks go off of the.

446
00:55:44.579 --> 00:55:48.030
Edge of the green matrix. Yeah. So you.

447
00:55:48.030 --> 00:55:51.119
You know, you code, so if you don't access.

448
00:55:51.119 --> 00:55:56.760
Invalid memory by doing conditionals like that and so on.

449
00:55:56.760 --> 00:56:02.610
What's happening here is a fine point of the C or C. plus plus.

450
00:56:02.610 --> 00:56:10.079
This logical and is required to be a lazy evaluation. It is not an option.

451
00:56:10.079 --> 00:56:14.039
If role lesson with is false.

452
00:56:14.039 --> 00:56:17.159
It is prohibited to evaluate that.

453
00:56:18.659 --> 00:56:24.539
Which is good because of, which is false. This might be an illegal.

454
00:56:25.679 --> 00:56:34.739
Operation width would be so is doing some computation here. So this computation could be invalid if the.

455
00:56:34.739 --> 00:56:41.250
1st clause was false, but that's okay. Cause this condition won't get evaluated. If the 1st cause this false.

456
00:56:42.599 --> 00:56:47.909
I use something like this in a little C program that I wrote many years ago. A.

457
00:56:47.909 --> 00:56:55.829
Points inside a Polygon it uses it's 8 lines of code actually, and confuses people. I use something like this and.

458
00:56:55.829 --> 00:57:01.980
Makes out people, because other languages, Java, I guess, don't have this required lazy evaluation.

459
00:57:01.980 --> 00:57:08.550
You can't just take my seat program and do the simple translation of Java. It will fail.

460
00:57:08.550 --> 00:57:18.690
Okay, nothing interest you check boundary, conditions like this. Okay. And so you will have thread divergence with this.

461
00:57:19.710 --> 00:57:27.360
What that means is that if this is true, then the body gets executed.

462
00:57:27.360 --> 00:57:30.389
And then the L spotty does not.

463
00:57:30.389 --> 00:57:45.269
If the predicate is false, then body is not executed in the body. So this will take twice as long to execute because the threads are all the threads in the war are doing the 1 or the other. They're not doing both spread. It's called thread divergence.

464
00:57:45.269 --> 00:57:56.789
Okay, well, you know, it's tolerable to a point. You may don't want to have perhaps multiple nest and if then else blocks within your thread divergence. So, start getting.

465
00:57:56.789 --> 00:58:00.480
To the point here utilization will fall a lot. So.

466
00:58:00.480 --> 00:58:04.050
Okay.

467
00:58:05.730 --> 00:58:10.380
So, what you're doing is you're doing you're adding stuff in here.

468
00:58:10.380 --> 00:58:15.780
Well, these are all inside these enabling clauses.

469
00:58:15.780 --> 00:58:20.369
Nothing deep you could figure it all out if you didn't read. Decides that okay.

470
00:58:20.369 --> 00:58:23.789
And again, so it's called controlled divergence.

471
00:58:23.789 --> 00:58:31.139
General rectangular matrices. There's nothing interesting here. You just have to do it.

472
00:58:31.139 --> 00:58:34.679
Himself.

473
00:58:34.679 --> 00:58:38.789
Okay, good questions.

474
00:58:38.789 --> 00:58:42.840
Okay, let's.

475
00:58:42.840 --> 00:58:48.360
On.

476
00:58:48.360 --> 00:58:54.210
Silence.

477
00:58:54.210 --> 00:59:01.110
Okay.

478
00:59:01.110 --> 00:59:11.010
Thread execution efficiency. Okay. So the threads are bundled into warps 32 at a time. We got the 70.

479
00:59:11.010 --> 00:59:15.809
Alright, we're under control divergence I told you about was going to see it in more detail here.

480
00:59:15.809 --> 00:59:21.269
Okay, so warps the 32 threads, aged the green, the red and the purple warp.

481
00:59:21.269 --> 00:59:30.599
So, when you program, you might not actually be ever aware of warps there.

482
00:59:30.599 --> 00:59:36.150
An efficiency, implementation technique, they don't necessarily affect your program at all.

483
00:59:38.250 --> 00:59:41.460
So that.

484
00:59:41.460 --> 00:59:44.880
You know, your quota program never never sees.

485
00:59:44.880 --> 00:59:47.909
Never has an explicit or in it, but they.

486
00:59:47.909 --> 00:59:51.539
You want to be aware of them because they affect the efficiency.

487
00:59:51.539 --> 00:59:58.289
Scheduling units the warps again, there's a cue of warps or pool of warps waiting to run.

488
00:59:58.289 --> 01:00:01.500
And then the streaming multi process runs them.

489
01:00:01.500 --> 01:00:04.769
As resources become available.

490
01:00:06.000 --> 01:00:16.619
This is some tactic shuttering where the thread could be a 2 D thread block. It's just laid out in Rome major order. I don't even know why they added this to the.

491
01:00:16.619 --> 01:00:28.949
Architecture specification, because you can implement it you could realize it yourself so easily lease and C plus plus and I said I've done it and C plus plus of little conversion, implicit conversion routines in the classes.

492
01:00:28.949 --> 01:00:32.340
Okay.

493
01:00:34.050 --> 01:00:44.760
I mean, that's nice. And C plus plus so I index into an array, I can either use a scaler or I can use a 2 vector let's say, and it just calls the implicit conversion.

494
01:00:44.760 --> 01:00:51.300
That makes the programming nice. Okay. Threads in a war.

495
01:00:52.739 --> 01:00:58.409
They may change some generation to generation, but in 20 years they haven't changed.

496
01:00:58.409 --> 01:01:07.260
And again, just the point I keep saying, but the separate warps get scheduled independently, they may run.

497
01:01:07.260 --> 01:01:11.190
Side by side or 1 after the other, whatever.

498
01:01:11.190 --> 01:01:14.579
You saw this figure before.

499
01:01:15.929 --> 01:01:29.519
Nothing new and video actually does it says single process or single data? Multiple thread. These are slightly different acronym here, but.

500
01:01:29.519 --> 01:01:36.300
Sort of the same thing. Okay. And again, the point about this is less control overhead.

501
01:01:36.300 --> 01:01:45.510
What you don't have in the gpo is all of the speculative execution stuff, the stuff that makes, you know.

502
01:01:45.510 --> 01:01:53.130
Intel so big, and get them their really high low latency performance that all got stripped out of the.

503
01:01:53.130 --> 01:01:56.940
So, in order to have this parallelism.

504
01:01:56.940 --> 01:02:01.710
Okay.

505
01:02:03.420 --> 01:02:11.639
Once okay, I mentioned if then else, if a thread makes a different decision, then the thing waits for.

506
01:02:11.639 --> 01:02:17.610
Okay, here's the thing loops that are inside the thread. Iterate the same number of times.

507
01:02:17.610 --> 01:02:21.449
Well, what will happen is if 1.

508
01:02:21.449 --> 01:02:28.110
Threads loop 2 or times, and it pauses while the slower threads, finish their looping.

509
01:02:29.340 --> 01:02:33.510
So controlled divergence. Okay.

510
01:02:35.159 --> 01:02:46.949
What I just set up here, and if there's different paths, they could serialize. So the thread is taking 1 path, rather than the threads taking the other path.

511
01:02:48.239 --> 01:02:51.929
And if perhaps nesting is a bad idea.

512
01:02:51.929 --> 01:02:55.590
Number total number of pass will grow exponentially.

513
01:02:58.739 --> 01:03:07.980
Okay, we'll see an example here. Okay so if you do something like this in a thread. Okay, it's not the thing I highlighted.

514
01:03:07.980 --> 01:03:16.710
Okay, so this is a problem because depending on the threat index, the body gets executed, or does not get executed. So, 2 different control.

515
01:03:17.789 --> 01:03:20.789
Is it so.

516
01:03:20.789 --> 01:03:27.809
Take longer now, this here is okay.

517
01:03:27.809 --> 01:03:37.409
You're branching based on the block number the block index, but that's okay because the different blocks of different warps.

518
01:03:37.409 --> 01:03:41.460
And they're running different control things. So so here's the thing.

519
01:03:41.460 --> 01:03:45.989
You got to say a 1024 threads in the block and the thread block.

520
01:03:45.989 --> 01:03:51.210
32 walk to 32 thread so inside each warp.

521
01:03:51.210 --> 01:03:55.469
The threads are doing the same thing, but the different warps in the block.

522
01:03:55.469 --> 01:03:59.909
Have no constraints on them so the different warps in the block.

523
01:03:59.909 --> 01:04:04.679
Can certainly be running different execution pass so.

524
01:04:06.179 --> 01:04:11.190
Yeah, well.

525
01:04:11.190 --> 01:04:20.489
With some footnotes and the different blocks, and the grid can absolutely be doing different things. They don't even have access to the same memory and part of the global memory.

526
01:04:20.489 --> 01:04:24.449
The different warps in the block I mean.

527
01:04:24.449 --> 01:04:30.719
Be careful about this they're running the same instruction sequence.

528
01:04:30.719 --> 01:04:34.559
But they can be at point the instruction.

529
01:04:34.559 --> 01:04:40.199
Register pointing to the current instruction can be different for the different warps.

530
01:04:41.639 --> 01:04:48.179
So, if you're not all totally confused yet, see, if you got that if that else divergence thing.

531
01:04:48.179 --> 01:04:57.539
Or different warps 1 war could be running the then block at the same time as another warp is running the block.

532
01:04:57.539 --> 01:05:06.900
So, they're running the same instruction scheme, but there are different places in it. So I've totally confused too. I succeeded. No. Okay.

533
01:05:06.900 --> 01:05:09.900
It actually does make sense if you think about it.

534
01:05:09.900 --> 01:05:15.420
So, the different warps in a block can be.

535
01:05:15.420 --> 01:05:19.889
At any given time on different instruction, that's why you have to synchronize.

536
01:05:19.889 --> 01:05:24.300
Okay, and the different mix of the different blocks that is.

537
01:05:24.300 --> 01:05:33.389
No connection at all can be running. It probably running a different time. So you probably got more blocks that want to run. Then you have.

538
01:05:33.389 --> 01:05:39.059
Resources to run them Qatar edition you saw this 1 before.

539
01:05:43.829 --> 01:05:48.179
Now, um.

540
01:05:48.179 --> 01:05:56.070
They're assuming that you want to add a factor that's 700, whatever, certain size 768 and do.

541
01:05:56.070 --> 01:05:59.489
Let me summarize this slide for you.

542
01:06:01.260 --> 01:06:07.170
The last block will not be full of threads so it's got threads in our Idol.

543
01:06:07.170 --> 01:06:10.170
And that's called diverged so.

544
01:06:10.170 --> 01:06:17.820
That's the content of that, because they made the number of elements more than a multiple of the number of.

545
01:06:17.820 --> 01:06:21.750
Threads per block. Okay.

546
01:06:22.860 --> 01:06:26.550
No question.

547
01:06:30.000 --> 01:06:33.059
Silence.

548
01:06:39.659 --> 01:06:43.860
Okay, boundary condition checking you got to do it.

549
01:06:43.860 --> 01:06:56.130
But does nothing deep in it and controlled divergence a point. Here is you might have a conditional depends on the data so you cannot necessarily find your control diverges with static code analysis.

550
01:06:56.130 --> 01:07:00.960
Because it could be dynamically depend on the data, you know, it's such a, if a data element equals 5.

551
01:07:00.960 --> 01:07:05.190
Then do something okay that so that I just depends on the data.

552
01:07:05.190 --> 01:07:08.489
And they're going to talk about here. Okay.

553
01:07:09.869 --> 01:07:15.150
Yeah, you know, don't write data that's outside the matrix.

554
01:07:15.150 --> 01:07:21.659
Done is divergence here. Okay. I don't read data. That's outside the matrix.

555
01:07:21.659 --> 01:07:25.139
Action I was wrong. Don't read outside. Okay.

556
01:07:25.139 --> 01:07:29.039
This is this is loading the local.

557
01:07:29.039 --> 01:07:33.630
Shared tile from the global memory, so okay.

558
01:07:33.630 --> 01:07:36.659
Boundary checks.

559
01:07:36.659 --> 01:07:43.380
You saw this before you're off nothing deep there.

560
01:07:43.380 --> 01:07:52.920
And you can compute to control the effect of the thing is so the last war, you're not using all the threads in the war. Some are idle. So.

561
01:07:52.920 --> 01:07:56.039
You could say that that's any fashion so it computes.

562
01:07:56.039 --> 01:08:00.989
The inefficiency nothing interesting there.

563
01:08:00.989 --> 01:08:04.650
Or there or there.

564
01:08:05.730 --> 01:08:11.369
Okay, um, or here, even actually. Okay.

565
01:08:13.650 --> 01:08:17.220
There could be some natural control divergence.

566
01:08:17.220 --> 01:08:21.630
Okay.

567
01:08:21.630 --> 01:08:25.140
That was that the questions.

568
01:08:25.140 --> 01:08:29.340
No, I'm looking at the chat window because to move on.

569
01:08:31.890 --> 01:08:36.930
All right it was module 5 of 1.

570
01:08:36.930 --> 01:08:42.750
On here your 21 modules we've done through 5. we've quite a bit today. Okay.

571
01:08:42.750 --> 01:08:48.090
Let's go into 6 and again I'm summarizing things.

572
01:08:50.430 --> 01:08:56.069
Oh, by the way what I'm doing here is I'm using this virtual.

573
01:08:56.069 --> 01:09:05.729
File system idea all this stuff as far as this big zip file and I just mounted a virtual file system that looks into the big zip file.

574
01:09:05.729 --> 01:09:12.479
And pulls out piece says, that's a thing I told you about. So it doesn't stress the file system so much, but if I care.

575
01:09:12.479 --> 01:09:18.630
I'm on okay, memory access.

576
01:09:20.189 --> 01:09:32.729
Okay, what I've been saying before more often than not your program's limited by getting the data to the processors.

577
01:09:32.729 --> 01:09:37.140
I oh dominate.

578
01:09:41.909 --> 01:09:45.659
I'm not quite certain what that means. Okay, this is.

579
01:09:45.659 --> 01:09:51.270
Water coming out of a dam, ideally, you'd want to have a high bandwidth, but in reality, you.

580
01:09:51.270 --> 01:09:55.319
Shipping to us draw.

581
01:09:56.729 --> 01:10:02.100
Again, I don't know if everyone here is a hardware person.

582
01:10:02.100 --> 01:10:07.949
For your software types, your direct dynamic, random access memory.

583
01:10:07.949 --> 01:10:11.340
Each bit is a little capacitor effectively.

584
01:10:11.340 --> 01:10:16.859
Which is controlled by a transistor so the thing with capacitors.

585
01:10:16.859 --> 01:10:24.930
Is a discharge to resist a capacitors little circuit, which computes a time concept of how fast they discharge.

586
01:10:24.930 --> 01:10:28.680
And if you make them.

587
01:10:28.680 --> 01:10:37.170
Smaller they discharge faster, so the DRAM has to be refreshed as the capacitors are on down. If the capacitors are smaller.

588
01:10:37.170 --> 01:10:47.454
Then you have to refresh or more. The other thing is this time concept controls how quickly you can do things to the memory limits. This is why?

589
01:10:47.454 --> 01:10:58.704
That the memories are not getting faster at the same rate that process that the processors are getting faster because the processors are asynchronous memory. You make your gate smaller.

590
01:10:59.100 --> 01:11:09.510
They gate faster, but you cannot necessarily make the you make the DRAM smaller and past. It has to be refreshed more. And that takes more of your time.

591
01:11:09.510 --> 01:11:19.470
Software review of the limits of DRAM. Why doesn't get faster faster? So you could have static random access memory reach.

592
01:11:19.470 --> 01:11:24.180
Bit is a little flip flop that takes a couple of transistors.

593
01:11:24.180 --> 01:11:27.539
So, it's a synchronous logic.

594
01:11:27.539 --> 01:11:38.430
No, capacitors. It can the static gram can get faster. The problem is it's much more expensive. So static ramp might be used for the cash.

595
01:11:38.430 --> 01:11:42.930
But it's more expensive, so that's why you don't get humungous static RAM.

596
01:11:42.930 --> 01:11:46.710
Memories.

597
01:11:46.710 --> 01:11:51.595
You want to see really expensive static RAM look what they use in space crap.

598
01:11:51.595 --> 01:12:03.864
Some time some of the space craft that they go out out of the solar system has static memory and a bit might be 2 wires wrapped around each other and you wrap.

599
01:12:04.079 --> 01:12:11.939
Clock wise, so it would be a 1 year. Rep counterclockwise that might be as 0T. So a bit is really big. However.

600
01:12:11.939 --> 01:12:20.250
It's also really stable. You run this twisted memory through a Van Allen belt on Jupiter.

601
01:12:20.250 --> 01:12:23.430
And it survives, and these things won't.

602
01:12:23.430 --> 01:12:34.109
Tony are trade offs when I was a student static memory was actually magnetic. Course it was a little.

603
01:12:34.109 --> 01:12:37.260
About a millimeter 2 millimeters across.

604
01:12:37.260 --> 01:12:46.619
Made with, and it would magnetize clockwise or counterclockwise and once you magnetized it, it would stay magnetize effectively forever.

605
01:12:46.619 --> 01:12:56.579
That was static and you re, and you have some, a couple of wires going to the hole in the center of the and.

606
01:12:56.579 --> 01:13:03.359
Well, you would magnetize it by running occurrence 1 way or the other way and you would sense it.

607
01:13:03.359 --> 01:13:08.489
Actually, by magnetize if I'm running a current through and measuring.

608
01:13:08.489 --> 01:13:12.689
The flux change, so again, static lasted forever, but.

609
01:13:12.689 --> 01:13:18.300
Big expensive. That's when machines had.

610
01:13:18.300 --> 01:13:22.739
They talked about K, thousands of bites of memory, not.

611
01:13:22.739 --> 01:13:27.180
Gigabytes of memory. Okay. Nothing interesting. There.

612
01:13:28.500 --> 01:13:32.670
Are there they're slow I just told you, they're slow.

613
01:13:32.670 --> 01:13:40.020
But you can read and a chunk of memory at 1 time. So.

614
01:13:42.989 --> 01:13:48.510
There's a latency also, but sometimes the IO is a touch faster. That's the burst mode.

615
01:13:48.510 --> 01:13:55.170
Diverse mode is useful. Is your accessing sequential elements? Several sequential elements.

616
01:13:55.170 --> 01:13:59.850
Kay banks not relevant to the course.

617
01:13:59.850 --> 01:14:10.470
Okay, yeah, so where this is irrelevant, we're leading into the global memory okay on the gpo, which is a 48 gigabytes.

618
01:14:10.470 --> 01:14:25.409
And which is going to be D, RAM like this. So, the thing is so this is all relevant to that global memory. It's got the latency. But the thing is, it's got the burst and it's the global memory. It's 128 bytes.

619
01:14:25.409 --> 01:14:34.140
I read in 1 cycle, so it's gonna be maybe a 100 cycles to read the 1st bite, but then, wham, in 1 cycle, you could all 1, 2008 bytes.

620
01:14:34.140 --> 01:14:38.399
So, and.

621
01:14:38.399 --> 01:14:42.510
You would use them if a war or 32 threads.

622
01:14:42.510 --> 01:14:56.850
Wanted 32 consecutive words from the global memory toward being 4 bites. So you see that burst of 128 bytes. Somewhat global memory will serve the whole war for 32 to 3 things fit together nicely. Yeah.

623
01:14:57.899 --> 01:15:04.289
If the consecutive threads, and the war want consecutive words from the global memory, that's an IP. So this means.

624
01:15:04.289 --> 01:15:11.550
This is how you got a design, you program. That's why also in your program if you have say.

625
01:15:12.175 --> 01:15:26.935
To D, points X and Y, you don't have an array of the structure of the X wise you have a structure of an array of axes and array of Y, so all the X's consecutive, all the wiser consecutive. So this 1st thing idea will actually be useful.

626
01:15:27.210 --> 01:15:30.689
I'm just doing here. Okay.

627
01:15:33.175 --> 01:15:47.664
Talking about the global, so the global memory on the card with the gpo, it's actually that, you know, they worked really hard to make it fast and it is quite fast. So I talk about it being slow compared to the registers. Well, yeah, but it's past compared to anything else.

628
01:15:47.909 --> 01:15:55.890
Okay, so I might even run a program to show you how fast.

629
01:15:55.890 --> 01:16:00.449
In any case I'm talking about speech there.

630
01:16:01.859 --> 01:16:04.979
Okay.

631
01:16:04.979 --> 01:16:08.130
That's a reasonable point.

632
01:16:08.130 --> 01:16:12.119
To stop we went up through.

633
01:16:12.119 --> 01:16:18.300
6.1, just to show you what 6.2 might be without doing it.

634
01:16:18.300 --> 01:16:21.359
Talks about memory coalescing.

635
01:16:21.359 --> 01:16:29.250
This is what I just told you, the memory coalescing idea is that the 32 consecutive threads.

636
01:16:29.250 --> 01:16:39.029
Goal for 32 consecutive words of global memory and the accessing gets coalesced. So it's only 1 read not 32 reads.

637
01:16:39.029 --> 01:16:43.649
Show you what some of the bandwidth is so just for fun.

638
01:16:50.729 --> 01:16:55.050
I'm running on my local machine here.

639
01:17:07.649 --> 01:17:10.949
Silence.

640
01:17:10.949 --> 01:17:15.420
Silence.

641
01:17:15.420 --> 01:17:20.640
Okay, um.

642
01:17:20.640 --> 01:17:33.359
Just getting my laptop. Okay. It's doing a bandwidth device. The device that's inside the device 300 gigabytes a 2nd it's not not so awful hosted device.

643
01:17:33.359 --> 01:17:40.590
Is gigabytes a 2nd, so the bus from the device to the host it's the fastest.

644
01:17:40.590 --> 01:17:45.060
Bus on the whole computer, I think are there somewhat.

645
01:17:45.060 --> 01:17:49.529
Okay, but let me try parallel and see what it is.

646
01:17:50.939 --> 01:17:54.090
Silence.

647
01:17:57.600 --> 01:18:05.430
On the system. Okay.

648
01:18:05.430 --> 01:18:09.210
Silence.

649
01:18:13.770 --> 01:18:17.399
Silence.

650
01:18:17.399 --> 01:18:21.510
Silence.

651
01:18:23.399 --> 01:18:28.409
Okay, try to do a demo what this means is.

652
01:18:28.409 --> 01:18:34.140
I have to recompile stuff in Sunday, because they try to do a demo. It doesn't work, but that's.

653
01:18:34.140 --> 01:18:43.229
Okay, what I was hoping is I did this on paralleled. I would get higher speeds than if I did it on.

654
01:18:43.229 --> 01:18:49.079
On my laptop, which it has a mobile GPU for our examples.

655
01:18:49.079 --> 01:18:52.409
Other things here just to show you a.

656
01:18:52.409 --> 01:18:57.930
Remind you.

657
01:19:02.069 --> 01:19:05.699
Showing you features about this.

658
01:19:05.699 --> 01:19:12.960
It's still a Quadro, there's only 16 gigabytes of memory, only 3000 cores only.

659
01:19:12.960 --> 01:19:16.319
1500 megahertz.

660
01:19:16.319 --> 01:19:21.810
I say only, but I'm being sarcastic so the memory.

661
01:19:21.810 --> 01:19:26.520
So, we're seeing some fairly reasonable speeds here.

662
01:19:26.520 --> 01:19:31.380
You see, so the memory bus with 256 bits. Okay.

663
01:19:31.380 --> 01:19:37.500
This is this called the cash size this is used for things like.

664
01:19:37.500 --> 01:19:40.680
The only 4 megabytes.

665
01:19:40.680 --> 01:19:47.430
That goes into various things cash. Okay. So the constant memory 64 K bites.

666
01:19:47.430 --> 01:19:54.449
Shared memory 48 K bytes for block and shared memory for multi processor.

667
01:19:54.449 --> 01:19:58.529
So, there's 48 multi process, there's.

668
01:19:58.529 --> 01:20:02.130
And 64 cars per multi processor.

669
01:20:02.130 --> 01:20:09.989
So so what you're seeing here shared memory for blocking shared variable multi processor.

670
01:20:09.989 --> 01:20:14.399
You could see some optimization issues with how many blocks for.

671
01:20:14.399 --> 01:20:19.859
All due process multi processes are the same disagreement multi processes. This many registers.

672
01:20:19.859 --> 01:20:23.640
Threads and multi process or threads per block.

673
01:20:23.640 --> 01:20:29.850
Right, so the thing is, if there's more blocks and multi processor.

674
01:20:29.850 --> 01:20:36.689
Than the others wait to run, they've got.

675
01:20:36.689 --> 01:20:43.859
Space available so you see here because you see a block could have.

676
01:20:43.859 --> 01:20:48.180
To the chancellor of the 20 s to the 26 threads, but.

677
01:20:48.180 --> 01:20:55.770
Only a 1000 are going to run at once. I haven't talked about texture and so on yet but.

678
01:20:57.960 --> 01:21:05.130
Unified addressing and management I mentioned, so.

679
01:21:05.130 --> 01:21:15.840
And, you know, the stuff's originally intended for graphics so this texture memory.

680
01:21:15.840 --> 01:21:19.739
Which I haven't talked about and it's actually.

681
01:21:19.739 --> 01:21:24.119
Stored using some sort of space, filling curve, like a piano curve.

682
01:21:24.119 --> 01:21:27.539
Okay.

683
01:21:27.539 --> 01:21:32.069
Well, that's enough stuff for today you want to get to your next class.

684
01:21:32.069 --> 01:21:36.840
So, what we did is another chunk of the.

685
01:21:36.840 --> 01:21:40.920
And various teaching kit stuff on their.

686
01:21:40.920 --> 01:21:48.359
And I'm hitting the highlights and pointing it to the stuff that I think is interesting and skipping over. The stuff that I think is.

687
01:21:48.359 --> 01:21:53.250
Not so interesting and I will.

688
01:21:53.250 --> 01:22:01.109
6 parallel, so you can run this in parallel. Also the source codes also be able to look at that. I basically, I think have to.

689
01:22:01.109 --> 01:22:15.029
Read it, and maybe where you compile, it is very, there's run time modules, get loaded at run time and if they get updated, you get if there's any sort of version clash.

690
01:22:15.029 --> 01:22:18.359
Between 1 year could a program was compiled.

691
01:22:18.359 --> 01:22:23.880
And what version of run time molecules are, you're going to get these error messages. So, things have to keep in sync.

692
01:22:25.079 --> 01:22:30.659
Okay, so if there's any questions.

693
01:22:30.659 --> 01:22:34.199
No, other than that have a good.

694
01:22:34.199 --> 01:22:38.579
Week.

695
01:22:39.779 --> 01:22:44.970
Oh, I thought of something.

696
01:22:44.970 --> 01:22:49.590
P wave is, um, basically.

697
01:22:49.590 --> 01:22:54.689
Silence.

698
01:22:54.689 --> 01:22:58.199
Silence.

699
01:22:58.199 --> 01:23:03.479
D, wave is 1 of the major quantum computing things.

700
01:23:03.479 --> 01:23:10.829
Another seminar or 1 of them I maybe not and it's out of class time. I cannot require these sorts of things.

701
01:23:10.829 --> 01:23:18.180
But they're valuable for you to look at to see presentations by the major players.

702
01:23:18.180 --> 01:23:21.630
In parallel computing in quantum computing.

703
01:23:21.630 --> 01:23:27.090
So oh, and no homework today I'll give you a break, do it Thursday or something.

704
01:23:29.550 --> 01:23:34.770
Other than that no questions, then goodbye.