WEBVTT

1
00:00:02.399 --> 00:00:07.440
All right and.

2
00:00:07.440 --> 00:00:14.249
If you're wondering why upload the videos to my.

3
00:00:14.249 --> 00:00:19.318
Server, instead of to the RPI video media thing is.

4
00:00:19.318 --> 00:00:26.730
My service less hassle, so okay. I put up a homework, which is to implement.

5
00:00:26.730 --> 00:00:30.480
The histogram thing on both, um.

6
00:00:30.480 --> 00:00:36.600
Many core open or open, and also could multi.

7
00:00:36.600 --> 00:00:41.369
Multi core is Z on many core is the.

8
00:00:41.369 --> 00:00:49.619
Until as in video. Okay, so we're continuing on looking at.

9
00:00:49.619 --> 00:00:55.020
The invidious accelerated computing teaching kit.

10
00:00:55.405 --> 00:01:09.894
And I see my value added apart from pointing you to it in this section is to select the parts that I think are worth presenting and to go quickly. It's of uneven quality. It's got parts of very low signal to noise very short parts that are higher. So.

11
00:01:10.379 --> 00:01:13.950
That's my value added and.

12
00:01:15.209 --> 00:01:18.840
So.

13
00:01:18.840 --> 00:01:23.489
What I did is I had just unzipped everything to my local.

14
00:01:24.629 --> 00:01:29.040
Directory and what do I say we're starting at module 7.

15
00:01:29.040 --> 00:01:34.230
Yes.

16
00:01:35.280 --> 00:01:45.719
We go here maybe we did not, um.

17
00:01:50.700 --> 00:01:55.859
Okay.

18
00:01:55.859 --> 00:02:10.050
And again, to remind you that the E book chapters has, I don't think they have the whole book necessarily, but they've got chapters some of book available for free. And it's very well written.

19
00:02:10.050 --> 00:02:15.210
So, virtually, and also I pick some homework questions out of it.

20
00:02:15.210 --> 00:02:18.479
Okay.

21
00:02:18.479 --> 00:02:22.439
Hello.

22
00:02:24.090 --> 00:02:27.719
So.

23
00:02:27.719 --> 00:02:32.699
Interesting hardware issues.

24
00:02:32.699 --> 00:02:39.990
Linux is not perfectly supported by Lenovo actually on this thing pad the left and right. Most buttons don't work.

25
00:02:39.990 --> 00:02:45.210
Track pad or the speakers. Okay. So.

26
00:02:45.210 --> 00:02:48.539
Blah, blah, um.

27
00:02:49.830 --> 00:03:03.030
It's the grabbing, it's a nice example, because it illustrates some issues with parallel computing. Okay. You all know what his programming is we have these pins and we read in text and we count the frequency counts.

28
00:03:03.030 --> 00:03:12.389
Okay, I'm anticipating a little what makes it different on a parallel computing.

29
00:03:12.389 --> 00:03:18.300
Is that you have these global counters of the frequencies and you read in a ladder you upgrade to.

30
00:03:18.300 --> 00:03:23.490
You update the counter that's a read. Modify right? So, it has to be done to.

31
00:03:23.490 --> 00:03:27.930
Otherwise, if you 2 different threads, try to update the same counter.

32
00:03:27.930 --> 00:03:37.439
Then it will get updated properly and as the number of parallel threads grows, the probability of this happening increases.

33
00:03:37.439 --> 00:03:40.590
And if you have this problem.

34
00:03:40.590 --> 00:03:49.259
Every time you run your program, if you're lucky, you'll get a different wrong answer. If you're not lucky. Every time you'll get the same wrong answer.

35
00:03:50.370 --> 00:03:54.030
So, I'm being serious because if it's different, you suspect an issue, but.

36
00:03:54.030 --> 00:03:59.009
The thing with parallel computing, getting the same wrong answer. That's something.

37
00:03:59.009 --> 00:04:02.669
That effect is the 1st space shuttle in fact.

38
00:04:02.669 --> 00:04:07.800
It's going back a few decades, but they had 4 primary computers that were IBM.

39
00:04:07.800 --> 00:04:11.789
That in critical moments of the flight, like, before launch.

40
00:04:11.789 --> 00:04:15.210
They were supposed to be synchronously running the same thing.

41
00:04:15.210 --> 00:04:18.629
And so this is a check, it's hardware issues.

42
00:04:18.629 --> 00:04:23.189
But then the, the NASA programmers being paranoid people, um.

43
00:04:23.189 --> 00:04:26.848
Yes, why had a 5th? Um.

44
00:04:26.848 --> 00:04:32.788
Computer designed by a different contractor with a different operating system, a backup flight system.

45
00:04:32.788 --> 00:04:38.488
That was also supposed to synchronize with the primary AVIONICS support systems and just before launch.

46
00:04:38.488 --> 00:04:44.788
The backup system refused disagreed with the primary. Suppose the primaries are the backups of brand decks.

47
00:04:44.788 --> 00:04:49.528
But they did the responsible thing, and they scrub the flight until they figured it out.

48
00:04:49.528 --> 00:04:56.038
It turned out that the 4 primary computers together were wrong and the brand deck's backup was right?

49
00:04:56.038 --> 00:05:01.588
It was a 170 synchronization chats every 170 times.

50
00:05:01.588 --> 00:05:09.119
The primaries will get the clock wrong and they had observed it during a Pre flight test.

51
00:05:09.119 --> 00:05:16.379
And they had logged it because they log it and after they log everything, but they log in and left it for future analysis.

52
00:05:16.379 --> 00:05:20.428
And then it happened again, 170 times during.

53
00:05:20.428 --> 00:05:26.309
Light so exam primary, it's a parallel computing, the 4 primaries and then the backup.

54
00:05:26.309 --> 00:05:34.769
And they did it for increased reliability now, with the space travel would also do during non critical parts of the flight. Like, they're in orbit.

55
00:05:34.769 --> 00:05:41.399
They are the primary computers could do different things. They only synced up when it was critical.

56
00:05:41.399 --> 00:05:48.598
Nasa had another, it's quite used to like, old NASA, I guess programmers a lot of the time.

57
00:05:48.598 --> 00:05:53.038
They had a lot of interesting program and it's a little separate from parallel, for example.

58
00:05:53.038 --> 00:05:59.639
You know, your thing is running around little things running around Mars are running real time operating systems.

59
00:05:59.639 --> 00:06:04.168
And because they're going in real time interrupts and they have to handle stuff.

60
00:06:04.168 --> 00:06:07.588
And it's actually even when it's in flight 2 Mars.

61
00:06:07.588 --> 00:06:13.019
There the computers are writing at a low speed, sort of collecting a small amount of data.

62
00:06:13.019 --> 00:06:16.918
And at 1 point, they collected so much data that the overflow of the disk.

63
00:06:16.918 --> 00:06:23.759
And so the people who have some back door that they had, and the hacking in and clearing things out and rebooting it.

64
00:06:23.759 --> 00:06:28.408
Now you think about it, you've got a latency of what, 10 minutes? 15 minutes.

65
00:06:28.408 --> 00:06:33.059
And I don't know what the bandwidth is of the link 8 kilobits a 2nd or something.

66
00:06:33.059 --> 00:06:43.199
And they're going and no, it's not a gooey. So some of you are complaining about using command line interface as well if you're here if you're pass it either to bugging something on Mars.

67
00:06:43.199 --> 00:06:46.978
You're not, you're saying ugly I don't think so. Um.

68
00:06:46.978 --> 00:06:55.889
But they got it working, so it also have things like an interrupt the high priority interrupt with dominate the whole system because it's hurdle more often than they.

69
00:06:55.889 --> 00:07:00.149
So, beautiful examples of reliable programming, so.

70
00:07:01.288 --> 00:07:12.718
Examples of read only memory. Some of the read only remember the stuff they sent to Jupiter and so on. They would twist 2 wires around each other clockwise twisted, be 0 and a counterclockwise twisted via 1 or something.

71
00:07:12.718 --> 00:07:16.678
Do you think core memories? Not very debt. This is even less dense. Um.

72
00:07:16.678 --> 00:07:22.889
So expensive and not and heavy and not very dense. So the disadvantages.

73
00:07:22.889 --> 00:07:29.848
Oh, so why would they do such a thing you put this memory and you produce that Alan belt and it survives.

74
00:07:29.848 --> 00:07:35.278
So, they're really serious radiation so, hardware for the.

75
00:07:35.278 --> 00:07:42.569
Appropriate to the task. Okay. Well, there there was 1.

76
00:07:42.569 --> 00:07:47.158
Programming error space shuttle had it was announced, um.

77
00:07:47.158 --> 00:07:53.399
If it came out of, or is the cotton simulation if it left, or but they put something on a stack.

78
00:07:53.399 --> 00:08:01.588
And at 1 point in simulation, they had to leave orbit and then come back and leave over the 2nd time in the stack.

79
00:08:01.588 --> 00:08:05.848
Overflowed because he would have thought the space shuttle would be horrible twice in 1 mission.

80
00:08:05.848 --> 00:08:09.869
Okay.

81
00:08:09.869 --> 00:08:14.848
So, what we're doing here is that we have a long text.

82
00:08:14.848 --> 00:08:18.149
Programing massively parallel processors.

83
00:08:18.149 --> 00:08:21.358
And we wish to a histogram it.

84
00:08:21.358 --> 00:08:27.749
In parallel, so we wish to assign we wish to partition the input.

85
00:08:27.749 --> 00:08:33.119
And assign different chunks of the input to different threads and the threads will histogram it in parallel.

86
00:08:33.119 --> 00:08:36.208
So, this assumes that, I guess the.

87
00:08:36.208 --> 00:08:39.568
I know time is less than the computation time. Um.

88
00:08:39.568 --> 00:08:46.109
Your 1st question is, how do you partition the text among the processing threats?

89
00:08:46.109 --> 00:08:51.028
So, this slide here shows, the obvious solution.

90
00:08:51.028 --> 00:09:00.448
Where suppose we have a gigabyte of text the 1st, 250 megabytes goes to thread 0 The next to 15 megabytes goes to thread 1 and so on.

91
00:09:00.448 --> 00:09:05.818
So does a quarter, a quarter, a quarter a quarter and we do it at here.

92
00:09:05.818 --> 00:09:12.928
And by the way, you know, here, we've got 4 threads, but you could imagine 1000 threads or 10,000 threads.

93
00:09:12.928 --> 00:09:17.999
Remember that on parallel.

94
00:09:17.999 --> 00:09:22.528
That gpo can do 5,000 has 5,000 parallel threads, give or take.

95
00:09:22.528 --> 00:09:26.519
Okay, so here's the problem. Um.

96
00:09:28.558 --> 00:09:35.308
That it's conceivable that, um, oh, and just to make the diagram readable, we're pocketing and in groups. So.

97
00:09:35.308 --> 00:09:42.089
Letters, so it may happen here. The 3 is the threads want to update the, em, to P count or simultaneously.

98
00:09:43.109 --> 00:09:48.688
Okay, iteration 1 is because looking ahead later we'll see a different way to.

99
00:09:48.688 --> 00:09:53.038
Assign the input texts of the different threads, which will be better.

100
00:09:54.208 --> 00:10:00.509
Oops, let me see, I skip something. Okay. Um, so iteration 2 and I'll look at the 2nd letter.

101
00:10:00.509 --> 00:10:05.339
Oh, you're right, there will be a different version of that stuff here. Right? So we're refining the 1st thing.

102
00:10:05.339 --> 00:10:11.548
And 2 threads, update this counter 1 thread that 1 thread that and so on.

103
00:10:12.688 --> 00:10:22.379
Okay, um, so here's the coalescing thing that was in, um, section 6 module 6.

104
00:10:22.379 --> 00:10:27.658
That what I mentioned last time.

105
00:10:27.658 --> 00:10:39.778
Is the way the hardware the is designed and designed for good hardware design reasons, which are described in the sign in company chapter.

106
00:10:39.778 --> 00:10:44.908
It's divided into banks, blah, blah, blah, blah. I'm a software person, but the effect.

107
00:10:44.908 --> 00:10:49.649
Is that the, the, the.

108
00:10:49.649 --> 00:10:54.509
This is the big global memory on the GPU is the 48 gigabytes on the, um.

109
00:10:54.509 --> 00:11:02.339
Parallel you read memory from the DRAM from the global memory in chunks of 128 bytes.

110
00:11:02.339 --> 00:11:07.739
If you want 1 bite, you have to read 128 and throw away 127 or ignore them.

111
00:11:07.739 --> 00:11:13.589
This is just the way the hardware is designed the implication of this.

112
00:11:13.589 --> 00:11:17.038
So suppose a thread wants only 4 bites.

113
00:11:17.038 --> 00:11:26.729
The implication is, and you've got 32 threads are working together synchronously together in a war. So notice how the different design features work together. I mean.

114
00:11:26.729 --> 00:11:34.948
In video people at some point in the future, the company's going to screw up advantage, but at the moment, they're really brilliant people. The different things fit together.

115
00:11:34.948 --> 00:11:40.019
The global memory, you read data chunks of 128 bytes.

116
00:11:40.019 --> 00:11:43.769
32 threads are in a warp and operates synchronously.

117
00:11:43.769 --> 00:11:48.418
So, what you really want is the 32 threads to want.

118
00:11:48.418 --> 00:11:51.538
And suppose they want each 14 bites of the memory.

119
00:11:51.538 --> 00:11:56.519
What you would really like, is the threads want adjacent 4 fight words.

120
00:11:56.519 --> 00:12:02.759
So all 32 threads together want 128, contiguous bites of develop a memory.

121
00:12:02.759 --> 00:12:08.849
Because if you can do that, you will get 1 read from the global memory for 128 bytes.

122
00:12:08.849 --> 00:12:14.999
And then each of the 32 threads will pull little, their little 4 bite word out of this 128 by chunk. And then.

123
00:12:14.999 --> 00:12:21.028
You have minimized I O, from the global memory and you maximize the efficiency.

124
00:12:21.028 --> 00:12:24.839
That's the good way. It's called coalesced reading because.

125
00:12:24.839 --> 00:12:28.408
The 32 threads, 32 separate requests.

126
00:12:28.408 --> 00:12:34.048
Into 1128bytesread from the global memory or rights would do the same thing.

127
00:12:34.048 --> 00:12:41.399
That's the good way. Um, the problem is that if I can go back a slide.

128
00:12:41.399 --> 00:12:49.408
Here are the different threads what they're reading from the global memory is not adjacent and these reads do not coalesce.

129
00:12:49.408 --> 00:12:53.249
So, I'll go back to this color slide. Um.

130
00:12:54.568 --> 00:12:59.249
So, the colors don't actually, um.

131
00:12:59.249 --> 00:13:06.178
Well, the colors of what each thread reads is the red thread, the gray thread, the green thread of the blue thread.

132
00:13:06.178 --> 00:13:11.038
So step 0 iterations 0, the red thread reads here.

133
00:13:11.038 --> 00:13:14.879
The greens, the gray thread there, the green threads there in the blue thread there.

134
00:13:14.879 --> 00:13:17.879
And so the 4 threads are.

135
00:13:18.928 --> 00:13:23.849
Reading separate chunks from the global memory so this.

136
00:13:23.849 --> 00:13:28.708
So, this quadruples, the reading the aisle from the global memory now.

137
00:13:30.239 --> 00:13:36.688
This is relevant because this bus, it's on, it's on the card it's on the card for the GPU. It's a good fast.

138
00:13:36.688 --> 00:13:44.278
You know, good fast bus, but it still is slower than something that's on the registers or something. So.

139
00:13:44.278 --> 00:13:49.078
This well, they say it's this poor access efficiency.

140
00:13:49.078 --> 00:13:54.599
I'm guessing that this will be the rate limiting problem for the whole program.

141
00:13:54.599 --> 00:14:03.028
They're, they're, they're getting memory probably getting data from the global memory and it's not and their read they're not coalesced.

142
00:14:03.028 --> 00:14:07.019
So, there's another thing also.

143
00:14:07.019 --> 00:14:11.938
There are these caches hidden away in the system, and they're documented Florida.

144
00:14:11.938 --> 00:14:15.028
But there's an cache between the global memory.

145
00:14:15.028 --> 00:14:18.298
And the.

146
00:14:18.298 --> 00:14:25.168
And it's the same hardware as the fast shared memory, I think, and the constant memory it's this, this bank.

147
00:14:26.399 --> 00:14:37.708
Give or take 128 kilobytes I think I could be wrong on that number of very fast cash and it can be used for shared memory for the thread blocks or it can be used.

148
00:14:37.708 --> 00:14:42.389
Just to cash reads and it can also, I think we used for the constant memory.

149
00:14:42.389 --> 00:14:45.869
So, if you're coalescing stuff.

150
00:14:45.869 --> 00:14:50.639
Then it goes through the cache, but something like this here.

151
00:14:50.639 --> 00:14:58.318
Is too big won't fit in the cash is too big. So so you call that stuff you can use the cash and you don't have to program that it happens automatically.

152
00:14:59.668 --> 00:15:06.568
There's also a possibility if you call that stuff into it, right? I don't know but it's possible. The system might even read ahead.

153
00:15:06.568 --> 00:15:09.719
I don't know how sophisticated the GPU is that.

154
00:15:09.719 --> 00:15:18.448
And NVIDIA, they have design documents on their website, but they don't necessarily highlight.

155
00:15:18.448 --> 00:15:23.609
You know what they're doing for efficiency, but you can tell they're doing stuff because it works. So well.

156
00:15:23.609 --> 00:15:30.239
In some ways. Okay, so this is just writing down what I just said.

157
00:15:30.239 --> 00:15:34.708
If you give the 1st, chunk of the data to the 1st thread, and so on.

158
00:15:34.708 --> 00:15:39.749
Then the memory access is, are.

159
00:15:39.749 --> 00:15:49.589
Inefficient okay, the bottom half of that slide is what? I just told you that what you want is to it's called inter, leave.

160
00:15:49.589 --> 00:15:55.318
Partitioning where this is. Okay this here is the input text.

161
00:15:55.318 --> 00:15:59.698
Word 123, the bites of the input texts let's say.

162
00:15:59.698 --> 00:16:03.269
And the colors are, which thread is accessing that bite.

163
00:16:03.269 --> 00:16:08.188
And here you want the threads in her leaves so the 1st, 4 bites are processed.

164
00:16:08.188 --> 00:16:15.568
By the 1st, by the 4 threads. So adjacent thread numbers are reading adjacent data from the memory.

165
00:16:16.678 --> 00:16:19.889
Coalescing and that's a buzz word, they're coalescing.

166
00:16:19.889 --> 00:16:26.129
So, and here they show at the 1st, 4 bytes, get partition among the threads.

167
00:16:26.129 --> 00:16:39.418
Okay, so that increases your IO, this reduces your I. O, total IO from the global memory. We still have the stepping on your own toes problem with.

168
00:16:39.418 --> 00:16:43.708
The calendars will get to that next, but this is the 1st.

169
00:16:43.708 --> 00:16:49.589
So this is the 1st new idea today. That's interleaving increases memory, access performance.

170
00:16:49.589 --> 00:16:55.078
Okay, I'm in the next cycle we do that.

171
00:16:55.078 --> 00:17:08.818
Oh, by the way you read and video, you read documentation also written by 3rd party people, and they'll give you various tricks and hacks to make your programs run faster on the GPU.

172
00:17:08.818 --> 00:17:13.348
Now, you always have a question is when is the hack worth.

173
00:17:13.348 --> 00:17:17.638
Using, and when should you ignore it? And this was something you have to do is.

174
00:17:17.638 --> 00:17:22.318
Software designers, because some of the.

175
00:17:22.318 --> 00:17:28.199
Tips that I've seen written about a, they may not be worth your time.

176
00:17:28.199 --> 00:17:33.778
And be the next generation of GPU may invalidate the.

177
00:17:33.778 --> 00:17:37.469
So you get to cute with your optimization.

178
00:17:37.469 --> 00:17:42.778
It won't help you with the next version of chips in 2 years. So that's something to think about.

179
00:17:42.778 --> 00:17:48.449
In fact, the next version a chip may actually make your optimization run slower. Okay. Okay.

180
00:17:48.449 --> 00:17:52.528
The reason is that in video there always you see there.

181
00:17:52.528 --> 00:17:58.439
The cost is the real estate on the chip. Okay. And when NVIDIA does generation to generation.

182
00:17:58.439 --> 00:18:03.568
Is they changed the allocation of the chip to the different functionality?

183
00:18:03.568 --> 00:18:07.138
How much for floating point how much for double how much for cash.

184
00:18:07.138 --> 00:18:17.669
Whatever, and they change this allocation, they're going to invalidate with you over optimize it just the note. But the sooner leaving thing, I think is a fairly long lasting idea. That's.

185
00:18:17.669 --> 00:18:23.489
Worth doing in general, so that was that. Okay.

186
00:18:23.489 --> 00:18:28.739
There wasn't a lot in this thing, but.

187
00:18:28.739 --> 00:18:33.989
Okay, next thing is the data race, um.

188
00:18:33.989 --> 00:18:38.068
For you so I call stepping on your own toes.

189
00:18:38.068 --> 00:18:42.058
Okay, so.

190
00:18:42.058 --> 00:18:46.949
You know, 2 threads, update the same counter.

191
00:18:48.028 --> 00:18:53.219
I have an example, say from a bank or something um, skip through.

192
00:18:53.219 --> 00:18:57.598
Uh, you're booking something you're booking a seed in the theater.

193
00:18:57.598 --> 00:19:07.679
Theatres open up again or whatever online on the web. 2 people want to book the same seats. So they hold the seat for 10 minutes or something while you book bedroom.

194
00:19:07.679 --> 00:19:14.489
Whatever I have typed this in, on on the wiki edge on the blog actually.

195
00:19:14.489 --> 00:19:20.638
The 2 threads in parallel read, the old value updated and then parallel right back. So the.

196
00:19:20.638 --> 00:19:23.729
Value gets updated only once not twice.

197
00:19:23.729 --> 00:19:28.528
So, depending on how the 2 threads internally.

198
00:19:28.528 --> 00:19:34.528
No guarantees. Okay. I'll skip through that. I've talked about it a lot. Um.

199
00:19:37.648 --> 00:19:44.249
If I'm going too fast, tell me, but, um, oh, and I guess 1 more point is that this.

200
00:19:44.249 --> 00:19:49.259
Sort of badly written program will be different on different GP use because.

201
00:19:49.259 --> 00:19:55.979
If it's a cheap GPU, the threads will run 1 after 1 thread has to finish before the 2nd thread starts.

202
00:19:55.979 --> 00:20:00.028
Lack of resources on an inexpensive GPU they'll run in parallel because.

203
00:20:00.028 --> 00:20:04.949
Resources are available. Okay. Is that.

204
00:20:04.949 --> 00:20:10.558
That the sites that did not answer anything, all it did was present a question.

205
00:20:11.759 --> 00:20:19.169
So, and slides at 3 will answer the question.

206
00:20:19.169 --> 00:20:24.209
Yeah, atomic operations.

207
00:20:25.979 --> 00:20:35.848
And I summarize the slides that you want, read, modify, right as an atomic, not an interruptable at the hardware level instruction.

208
00:20:35.848 --> 00:20:39.118
That's what they say. No, the threads can do it.

209
00:20:39.118 --> 00:20:42.388
Most of you are seeing this, so.

210
00:20:43.469 --> 00:20:49.588
Okay, so specifically in Canada they have some.

211
00:20:49.588 --> 00:20:52.739
They can do atomic, lots of different things.

212
00:20:52.739 --> 00:20:57.838
And they're implemented is 1 machine instruction on the GPU in the core.

213
00:20:57.838 --> 00:21:01.318
Comparing swap is another 1. um, so.

214
00:21:01.318 --> 00:21:06.929
But, basically, they'll do these various things as 1 machine instruction not interruptable.

215
00:21:06.929 --> 00:21:14.638
At the programming level, at the sea level C + plus level inside.

216
00:21:14.638 --> 00:21:17.848
This is what the instruction 10 that the.

217
00:21:17.848 --> 00:21:24.568
Function call 10, so it looks like you give it a point or to an address and global memory let's say, and a value in it.

218
00:21:24.568 --> 00:21:29.338
Adds the value into the register and returns the old value of the register.

219
00:21:29.338 --> 00:21:33.689
And and it's not interruptable. So you're.

220
00:21:33.689 --> 00:21:37.439
And for ad, you can read various, so those things okay.

221
00:21:37.439 --> 00:21:42.628
And open M. P. you've got the atomic Craig by, which will do the same thing.

222
00:21:42.628 --> 00:21:47.068
For example, okay, so you do something like this and.

223
00:21:47.068 --> 00:21:51.358
And CUDA, and see, and it translates to 1 machine instruction.

224
00:21:53.548 --> 00:21:57.088
For an end.

225
00:21:57.088 --> 00:22:00.179
Um.

226
00:22:00.179 --> 00:22:03.298
64 long long.

227
00:22:03.298 --> 00:22:08.219
1 point in the CC +, plus standard.

228
00:22:08.219 --> 00:22:13.469
The number of bits that these types take is not defined so it's implementation thing.

229
00:22:13.469 --> 00:22:17.699
Quotes quotes half percent.

230
00:22:17.699 --> 00:22:25.199
This was an NVIDIA edition as the half precision floats for the machine learning, because machine learning.

231
00:22:25.199 --> 00:22:30.298
The data is flowing precise to bite. Floats are actually useful.

232
00:22:31.798 --> 00:22:38.759
So so we're going back to the global function for our.

233
00:22:38.759 --> 00:22:46.019
Texas programming thing. Okay. And the concept is that each thread adds in another bite.

234
00:22:46.019 --> 00:22:51.568
From the data counts it up. So the obvious arguments.

235
00:22:51.568 --> 00:23:00.598
The, um, but we're implementing the Stripe, so the stride is between like, adjacent Tom.

236
00:23:00.598 --> 00:23:04.138
Hey, Jason elements or something so.

237
00:23:06.538 --> 00:23:12.118
And then we just added in, um, the atomic ad.

238
00:23:12.118 --> 00:23:15.479
We take the character.

239
00:23:16.648 --> 00:23:20.128
Well, here we're just adding, um.

240
00:23:20.128 --> 00:23:25.229
Leaving my hands at ignoring details, we're just adding it in so.

241
00:23:25.229 --> 00:23:32.608
I don't need to go so, the concept is, we're using an atomic ad here and we're implementing the right thing.

242
00:23:32.608 --> 00:23:36.028
And I is the element that we are.

243
00:23:36.028 --> 00:23:40.439
So, we're working on character number I.

244
00:23:40.439 --> 00:23:43.588
What the character is as buffer.

245
00:23:43.588 --> 00:23:48.028
And we want to increment histogram some buffer. So bye.

246
00:23:48.028 --> 00:23:51.419
Autonomy and we had 1 to it because we saw the character was.

247
00:23:51.419 --> 00:23:54.868
Okay, um.

248
00:23:54.868 --> 00:23:59.759
And this here, we got the concept that we have thread blocks.

249
00:23:59.759 --> 00:24:05.578
So on, so we may have so many threads that they're distributed among several blocks.

250
00:24:05.578 --> 00:24:14.638
And this is the thread number within the block. This is the block number of the current thread.

251
00:24:14.638 --> 00:24:18.269
A number of threads for block so this gets a unique.

252
00:24:18.269 --> 00:24:21.328
Counter from 0, on up to the number of threads.

253
00:24:21.328 --> 00:24:27.148
And now what's happening here. So, as I said.

254
00:24:27.148 --> 00:24:30.538
Each time you call this, it.

255
00:24:30.538 --> 00:24:34.199
It does 1 character, but.

256
00:24:34.199 --> 00:24:39.778
It does, but what we're doing here is we're then adding stride.

257
00:24:39.778 --> 00:24:43.618
To so this will actually will the thread actually made loop.

258
00:24:43.618 --> 00:24:48.509
And keep repeating actually, so the thread doesn't just do 1 character. The thread does.

259
00:24:48.509 --> 00:24:53.398
A number of characters, but it's doing them separated by stride.

260
00:24:53.398 --> 00:24:58.469
So that 1 threat, so that adjacent thread to doing adjacent character. So.

261
00:24:58.469 --> 00:25:01.679
This thread does every strike character.

262
00:25:01.679 --> 00:25:05.128
And stride is the, um.

263
00:25:05.128 --> 00:25:10.439
Basically, so the thread blocks, it goes to the next.

264
00:25:10.439 --> 00:25:13.648
Chunk up and.

265
00:25:13.648 --> 00:25:17.878
In the memory basically so okay.

266
00:25:17.878 --> 00:25:22.888
So, this is written to handle very large numbers of threads and even longer.

267
00:25:22.888 --> 00:25:26.338
And even much longer, a rays of data.

268
00:25:27.838 --> 00:25:32.999
Okay um, so.

269
00:25:35.068 --> 00:25:39.179
And this just, um.

270
00:25:41.308 --> 00:25:50.159
There's nothing particularly interesting here they're doing characters and they're assuming characters or Jason asking things. So they can.

271
00:25:50.159 --> 00:25:54.269
This converts from a character code to an editor from 0 on up.

272
00:25:54.269 --> 00:25:58.949
This is a sanity check.

273
00:25:58.949 --> 00:26:02.699
I love it. It's non printable character. Is it.

274
00:26:02.699 --> 00:26:05.818
Nor is it and this says here.

275
00:26:05.818 --> 00:26:10.558
Because it's chunking the letters and the 4th a. 2. D. E. T. H.

276
00:26:10.558 --> 00:26:15.719
And so on, and that's what your position by 4 does and by adding.

277
00:26:15.719 --> 00:26:20.788
Stride up here, so each thread is doing every striked character.

278
00:26:20.788 --> 00:26:24.509
So, strike could be quite big.

279
00:26:26.638 --> 00:26:31.378
Okay, so this also shows here that this is again running on the GPU.

280
00:26:31.378 --> 00:26:34.798
That you can have conditionals and so on.

281
00:26:34.798 --> 00:26:38.308
But as I said, the threads in the war are synchronous.

282
00:26:38.308 --> 00:26:42.088
So, if for some thread, this Boolean is false.

283
00:26:42.088 --> 00:26:48.148
Then the body of the loop is just idled forth. That thread.

284
00:26:48.148 --> 00:26:55.169
For the threads for which the conditional is true the body of not quality the loop, the body of the wild statement is executed.

285
00:26:55.169 --> 00:26:59.788
So, um.

286
00:27:01.259 --> 00:27:05.338
Whatever else is in here.

287
00:27:05.338 --> 00:27:09.898
Basically, that's the, and Ditto for the, um.

288
00:27:09.898 --> 00:27:15.239
If you see, you've got this nested thing here where you get thread divergence.

289
00:27:15.239 --> 00:27:20.249
Okay, wasn't all, uh, this was.

290
00:27:21.749 --> 00:27:25.798
Yeah.

291
00:27:25.798 --> 00:27:32.009
3.

292
00:27:32.009 --> 00:27:37.739
So, the intellectual content of.

293
00:27:37.739 --> 00:27:42.449
These slides 73 was to introduce this atomic operation.

294
00:27:42.449 --> 00:27:47.669
The intellectual content of this 74 is to discuss its performance.

295
00:27:47.669 --> 00:27:57.989
And so your global memory has the humungous latency, but a tolerably past bandwidth. Once you wait that latency. So.

296
00:27:57.989 --> 00:28:01.499
You can do your atomic operations on.

297
00:28:01.499 --> 00:28:08.128
Different types of memory shared memory. Then the cat, it's a cash is the cash for the global memory.

298
00:28:08.128 --> 00:28:13.769
What they mean by shared cash is that all the threads access? If it's the 1 cash.

299
00:28:13.769 --> 00:28:19.979
For the global memory, any 1 access in global memory, the stuff's going to get cashed in the same cash.

300
00:28:19.979 --> 00:28:27.358
Um, okay, let us talking about stuff.

301
00:28:27.358 --> 00:28:32.608
Yeah, so this is the thing about this, where you do a read, you grab.

302
00:28:32.608 --> 00:28:39.568
128 bytes this is why they're doing that because they've got 3200 ports.

303
00:28:39.568 --> 00:28:45.148
On the controller and banks and everything so this is.

304
00:28:45.148 --> 00:28:48.959
The hardware people can look at this. This is the hardware reason.

305
00:28:48.959 --> 00:28:51.959
That have reached at 128 5 shots.

306
00:28:51.959 --> 00:29:02.818
So the are different streaming, multi processors. Of course, they're all everyone's going at the memory at the same time. So.

307
00:29:02.818 --> 00:29:07.019
Oh, have a number of controllers and a number of ports for controllers. So.

308
00:29:07.019 --> 00:29:21.898
And so the concept is, you got the banks of memory, and you can read a word from each bank in parallel multiplies your speed, your IO speed by 32.

309
00:29:21.898 --> 00:29:25.618
Because all 32 bank, uh.

310
00:29:26.909 --> 00:29:31.528
32 ports, rather and banks so this is how you increase your speed.

311
00:29:31.528 --> 00:29:38.788
Um, intellectual content is, is latency here, so.

312
00:29:38.788 --> 00:29:46.919
You're you're doing an atomic operation on the global memory, you got this humungous latency there.

313
00:29:46.919 --> 00:29:51.388
And that latency is big, and you've got several of them so.

314
00:29:51.388 --> 00:29:58.019
You want to optimize what we're leading into is that these atomic.

315
00:29:58.019 --> 00:30:03.118
Operations on the global memory, they'll make your program correct? But they're going to slow it down.

316
00:30:03.118 --> 00:30:08.189
Again, this latency here, it's 100 cycles or or more. So 1000.

317
00:30:08.189 --> 00:30:15.509
Okay, and it's the dominant factor. The latency in your program thing was parallel programs.

318
00:30:15.509 --> 00:30:19.378
Your I O, is usually the limiting factor.

319
00:30:19.378 --> 00:30:26.939
Usually it with parallel computers, your are sitting waiting for the data to get to that.

320
00:30:26.939 --> 00:30:33.568
They're all limited and so this is why we spend time talking about making the more efficient.

321
00:30:35.608 --> 00:30:43.769
It's also a reason that these tutorials on parallel computing, and they often love to use Microsoft application as an example.

322
00:30:43.769 --> 00:30:50.098
Because matrix multiplication is 1 of the cases where you're not limited potentially.

323
00:30:50.098 --> 00:30:57.179
His matrix multiplication, you're processing in square data, but you're doing cube operations on. It's a mixed application.

324
00:30:57.179 --> 00:31:00.929
See, it's potentially CPO limited if your is done, right?

325
00:31:00.929 --> 00:31:09.749
So people like to use it, usually the computation is linear in the size of the problem matrix. Multiplication. It goes up Super linear.

326
00:31:09.749 --> 00:31:13.288
Of this data to the house.

327
00:31:13.288 --> 00:31:21.088
Okay, so here they're taking single agency in this case could be a 1000 cycles. We've just got awful. Horrible.

328
00:31:21.088 --> 00:31:26.848
So you really want to do anything you can. Okay. Um.

329
00:31:26.848 --> 00:31:30.358
Skip the other examples.

330
00:31:30.358 --> 00:31:35.909
For me is a many generation old, they have an updated this slide in a while so.

331
00:31:35.909 --> 00:31:39.838
It went for me hepler.

332
00:31:39.838 --> 00:31:43.828
Maxwell Pascal.

333
00:31:45.209 --> 00:31:50.009
And pair, I think you can correct me something like that. So, it's like.

334
00:31:50.009 --> 00:31:55.108
6 generations that's 10 years ago, but in any case, um.

335
00:31:55.108 --> 00:31:58.888
The idea stays the same.

336
00:31:58.888 --> 00:32:03.628
But these things are shared, if you can.

337
00:32:03.628 --> 00:32:07.019
Oops.

338
00:32:07.019 --> 00:32:11.759
Back here. Okay. I'm hardware improvements.

339
00:32:11.759 --> 00:32:15.868
Again, you pass things into the shared memory, it as much less latency.

340
00:32:15.868 --> 00:32:20.489
But it's private 2 weeks, right? Okay, so that's the content there.

341
00:32:20.489 --> 00:32:25.709
Okay, so the content of this module, 7, 4.

342
00:32:25.709 --> 00:32:30.689
Was it the atomic operations, especially to the global memory? Humungous.

343
00:32:30.689 --> 00:32:37.828
Horrible wait and see see 1, do everything you can to avoid it coalesce stuff. Well, I don't know if it works with atomics.

344
00:32:37.828 --> 00:32:42.269
And you shared memory, which we'll talk about next.

345
00:32:43.469 --> 00:32:46.618
Private to okay. Um.

346
00:32:48.388 --> 00:32:52.019
Okay.

347
00:32:52.019 --> 00:32:58.138
Let me summarize what you're going to see here. This is this thing that we 1st saw with open MP.

348
00:32:58.138 --> 00:33:06.148
Where they were doing open MP, the example, was a reduction. So, which is a really simple histogram just adding up all the elements of an array.

349
00:33:06.148 --> 00:33:09.659
And you had the same issue.

350
00:33:09.659 --> 00:33:13.858
Worse when you're just summing up in array, because you always got flashes.

351
00:33:15.209 --> 00:33:19.138
So the open MP and open, they introduced this reduction.

352
00:33:19.138 --> 00:33:26.729
The option on the, and with this meant, is that what each separate thread would have a separate counter.

353
00:33:26.729 --> 00:33:31.378
And the loops of the, you have a poor loop that's up that's summing up.

354
00:33:31.378 --> 00:33:40.048
The array, so each thread is coming into a private subtotal and then at the end.

355
00:33:40.048 --> 00:33:44.459
The several private subtotals would be summed into the grand total.

356
00:33:44.459 --> 00:33:49.858
Open MP does that automatically the compiler does if you don't have to worry about open ATC also.

357
00:33:49.858 --> 00:33:53.638
What we are seeing here is how that is implemented.

358
00:33:53.638 --> 00:33:57.298
The concept of each thread, having a private subtotal.

359
00:33:57.298 --> 00:34:00.959
And it then gets summed up and because of this.

360
00:34:00.959 --> 00:34:06.868
When you're working through the array, there's no possibility of a clash because.

361
00:34:06.868 --> 00:34:10.588
Those private subtotals are private, they're not chaired by different threads.

362
00:34:10.588 --> 00:34:13.858
Yes, the the.

363
00:34:13.858 --> 00:34:21.748
Also happening parallel, or it does, but that requires logging. That requires the atomic ad.

364
00:34:22.798 --> 00:34:26.639
But it's a very short part of the code. Yeah Yeah.

365
00:34:26.639 --> 00:34:31.708
And you can, you know, this is an optimization thing you could say.

366
00:34:31.708 --> 00:34:36.929
You know, private totals for war for block or whatever.

367
00:34:36.929 --> 00:34:41.248
My guess is it's not worth worrying about it.

368
00:34:41.248 --> 00:34:45.028
Incredibly much, but you could decide on what level do you want that.

369
00:34:45.028 --> 00:34:48.358
Erase.

370
00:34:48.358 --> 00:34:51.688
Very large instead of adding.

371
00:34:51.688 --> 00:34:55.409
All right to the front, um.

372
00:34:55.409 --> 00:34:59.759
As the thread finishes 1st and entry for.

373
00:34:59.759 --> 00:35:02.909
Yeah, 4 elements at a time of.

374
00:35:02.909 --> 00:35:06.719
Formula oh, yeah, exactly. Right. Every time.

375
00:35:06.719 --> 00:35:12.239
And my brothers using 3, somebody to go for the broad shallow trees.

376
00:35:12.239 --> 00:35:15.809
Um, this is in the side, I don't like trees, but.

377
00:35:15.809 --> 00:35:20.489
Of arguments okay now I'm here. Okay. Anticipating.

378
00:35:20.489 --> 00:35:27.898
Probably you might you'd put the subtotals and registers let's say so.

379
00:35:27.898 --> 00:35:35.489
If you've got only like, 200 different categories buckets, and you can put the subtotals and registers perhaps. And each thread might do a private.

380
00:35:35.489 --> 00:35:40.199
Edition, and then you'd start merging stuff we shared memory, perhaps.

381
00:35:40.199 --> 00:35:45.059
Okay, they talk to shared memory, so.

382
00:35:46.528 --> 00:35:55.648
The thread is private, then there's no atomics needed, but then you go up to the next level, you shared memory up another level, global memory or something. Okay.

383
00:35:58.168 --> 00:36:02.278
Um.

384
00:36:02.278 --> 00:36:08.608
I just told you what's happening here so you, the 1st thing is everyone is.

385
00:36:08.608 --> 00:36:13.199
Adding into the same totals or something. Nothing interesting there. Um.

386
00:36:15.688 --> 00:36:18.989
So the thing on the left there is like, 1.

387
00:36:20.608 --> 00:36:30.329
Total array here, all these separate copies that you then merge. So I just summarize the intellectual content of this. So we can skip through it.

388
00:36:30.329 --> 00:36:38.278
Um, the cost is to create these private copies and, um.

389
00:36:38.278 --> 00:36:47.518
The benefit is less, and they really like us to say, quite increase performance 10 times, because you really want to minimize the atomic lock.

390
00:36:47.518 --> 00:36:53.188
So, okay, um.

391
00:36:53.188 --> 00:37:01.018
Yeah, so there's their partitioning is under thread blocks just to remind you.

392
00:37:01.018 --> 00:37:05.579
You can have a 1024 threads and a thread block. It's.

393
00:37:05.579 --> 00:37:13.588
Hardware dependent, but typically, 1024 and the shared memory shared by all 1024 threads in the thread block.

394
00:37:13.588 --> 00:37:20.998
And then if you've got more than 1024 thread, you need multiple thread blocks and the separate thread blocks do not talk to each other.

395
00:37:20.998 --> 00:37:25.528
Except we, as a global memory, so they do not synchronized very.

396
00:37:25.528 --> 00:37:28.978
Very often inside the thread block.

397
00:37:28.978 --> 00:37:34.079
You might have 32 warps of threads and the different warps.

398
00:37:34.079 --> 00:37:42.088
Might be running separately, but they're accessing the same shared memories adding in. So that's that works. You need the atomics to.

399
00:37:42.088 --> 00:37:47.998
To update the top totals and the shared memory with being shared memory, it's faster shared memories fast.

400
00:37:47.998 --> 00:37:56.068
Okay, but again, to remind you, the shared memory might be 48 kilobytes for the whole block. Let's say.

401
00:37:58.378 --> 00:38:07.469
And 1 more thing, this, I think this 48 kilobytes is actually for all the blocks that are currently running on the streaming multi processor.

402
00:38:07.469 --> 00:38:15.750
So, if 1 block doesn't need a lot of shared memory, many more blocks could run in parallel.

403
00:38:15.750 --> 00:38:19.409
And there is an optimization so.

404
00:38:19.409 --> 00:38:22.889
Otherwise the boss get cued up and.

405
00:38:22.889 --> 00:38:30.809
Okay, um, what are we doing here?

406
00:38:32.699 --> 00:38:40.079
So this is a new thing here it's inside the so again so this function is running on the device.

407
00:38:40.079 --> 00:38:43.800
Global means a call from the hose runs on the device.

408
00:38:43.800 --> 00:38:48.480
These are the arguments. This is a new so this says.

409
00:38:48.480 --> 00:38:54.119
That this array, it will be allocated in shared memory.

410
00:38:54.119 --> 00:38:57.989
Visible to all of the threats in the block.

411
00:38:59.309 --> 00:39:02.880
So this is new here language extension.

412
00:39:04.199 --> 00:39:08.909
And when what this means, private for each block means each different thread block.

413
00:39:08.909 --> 00:39:18.960
We'll have it's different version of this, but all 201,024 threads in the block will have access to that blocks version.

414
00:39:21.179 --> 00:39:24.599
And again, the different blocks.

415
00:39:24.599 --> 00:39:31.650
There's this global pool of shared memory that the different blogs share, but it's chunky. Each block has a private to that block.

416
00:39:31.650 --> 00:39:38.880
So, um.

417
00:39:40.019 --> 00:39:46.559
So, what's new here? Um.

418
00:39:48.929 --> 00:39:55.829
So, 1st, we allocate this array and then we want to 0 it. Let's see. So.

419
00:39:57.420 --> 00:40:06.269
Here we go here we go so allocating it doesn't 0 it who is an example so we're going to 0, this.

420
00:40:06.269 --> 00:40:10.409
His private histogram in parallel, so.

421
00:40:10.409 --> 00:40:16.380
This colonel routine called colonel, it's going to be doing something.

422
00:40:16.380 --> 00:40:21.420
But the 1st thing it does is clear out the 0, the histogram.

423
00:40:21.420 --> 00:40:27.900
And it's each thread, you know, 0 is 1 element of the histogram.

424
00:40:27.900 --> 00:40:34.079
For, you know, but we don't run off the end of the histogram. So we have a boundary check.

425
00:40:34.079 --> 00:40:39.599
And then we sink okay, because we don't know.

426
00:40:39.599 --> 00:40:47.309
How the threads so that's 8 different threads and we, so we don't know what order the threads right in. So we do a sync.

427
00:40:47.309 --> 00:40:52.380
And that synchronizes all the threads in the block.

428
00:40:52.380 --> 00:40:58.170
1024 of them perhaps. Okay. And private this means to remind you that.

429
00:40:58.170 --> 00:41:01.860
Local to the block different blocks will have different.

430
00:41:01.860 --> 00:41:07.440
Office okay, so you can initialize the array in parallel.

431
00:41:07.440 --> 00:41:11.940
But there's probably fewer elements of the array than there are threats.

432
00:41:11.940 --> 00:41:15.539
And we must we are required to synchronize at the end.

433
00:41:15.539 --> 00:41:21.570
Or required only if you want the answer to be, right? Of course.

434
00:41:21.570 --> 00:41:25.469
Who knows what you want? Okay.

435
00:41:27.449 --> 00:41:31.650
So, now we're continuing on in the thread so.

436
00:41:31.650 --> 00:41:35.730
Separate threads or separate elements is bigger.

437
00:41:35.730 --> 00:41:42.239
So, we compute an index character number from the thread ID and the block ID.

438
00:41:42.239 --> 00:41:55.199
And and again, so we're going to start at I, and we're going to do every strikes character starting at eyes. I try to strive +3 stride.

439
00:41:55.199 --> 00:42:06.329
And so on, so this will be this coalesced concept that since since iterations here, this is a wild loop to every stripe character. Adjacent threads will do adjacent characters.

440
00:42:06.329 --> 00:42:10.889
That's the reason there, so we, um.

441
00:42:10.889 --> 00:42:16.199
Get destroyed and.

442
00:42:18.389 --> 00:42:23.159
And it's going to be quite large. The block game is the number of threads per block.

443
00:42:23.159 --> 00:42:26.969
Grid is the number of blocks for grid.

444
00:42:26.969 --> 00:42:32.190
So, and then we loop and, um.

445
00:42:32.190 --> 00:42:36.059
So, again, the threads and the war are synchronous.

446
00:42:36.059 --> 00:42:42.329
So some threads, this wild loop, Orlando before other threads, and then.

447
00:42:42.329 --> 00:42:47.070
In which case, then those threats, they just idle while the longer.

448
00:42:47.070 --> 00:42:50.099
All the slower threads finish, so that's okay.

449
00:42:50.099 --> 00:42:55.230
Atomic ad, so, private history is in the shared memory.

450
00:42:55.230 --> 00:43:00.150
And, um, it works, um.

451
00:43:01.199 --> 00:43:06.750
So, again, the point here is that each block has its private ummm.

452
00:43:06.750 --> 00:43:13.800
Instagram shared memory it's fast separate blocks. Will the separate Instagrams later on? We'll bring them together.

453
00:43:13.800 --> 00:43:17.880
So, we still have the atomic, but the point is, it's.

454
00:43:17.880 --> 00:43:22.829
Locking the shared memory, it's much faster than having to lock global memory.

455
00:43:22.829 --> 00:43:29.969
Okay, so that's the intellectual content here is have separate.

456
00:43:29.969 --> 00:43:35.190
Um, histogram, and then later we'll merchant, so.

457
00:43:36.630 --> 00:43:41.190
Build the final histogram we're continuing on in the same.

458
00:43:41.190 --> 00:43:46.860
Global retain we've got we've computed all the private history grams. We sink.

459
00:43:48.869 --> 00:43:53.940
And how long the sync takes depends on your hardware if you've got expensive hardware.

460
00:43:53.940 --> 00:43:58.650
Then all the threads are running in parallel. You got cheap hardware. They're running.

461
00:43:58.650 --> 00:44:02.610
Seriously while the warps are running seriously.

462
00:44:02.610 --> 00:44:13.619
In which case they could wait a while and now we go see, so we have the private histogram here and we add into the global histogram.

463
00:44:13.619 --> 00:44:20.760
Which is in global memory, this 1 is going to be slow. This atomic ad will have the big latency.

464
00:44:20.760 --> 00:44:25.860
But you're not doing it very often so.

465
00:44:27.030 --> 00:44:32.400
And.

466
00:44:32.400 --> 00:44:37.079
And we're only the 1st, 8 threads are doing it.

467
00:44:37.079 --> 00:44:41.010
7.

468
00:44:42.690 --> 00:44:47.760
You know, I could almost be persuaded, we have an error here. I would almost say less than equal to 7.

469
00:44:47.760 --> 00:44:52.739
Less than, but I could be wrong. It depends how many so.

470
00:44:52.739 --> 00:44:59.219
So so if I recap what's happening here.

471
00:44:59.219 --> 00:45:02.849
This again, so this is a global.

472
00:45:02.849 --> 00:45:06.780
Function runs on the device, it's got 3 stages in it.

473
00:45:06.780 --> 00:45:12.719
The 1st stage is to 0, the private histogram. Alright.

474
00:45:13.739 --> 00:45:20.010
Done in parallel the 2nd stage is to populate the private histogram or right.

475
00:45:20.010 --> 00:45:23.039
Done in parallel, um.

476
00:45:23.039 --> 00:45:30.840
And then the 3rd stage is to merge the private histogram arrays into the global histogram.

477
00:45:30.840 --> 00:45:35.730
Gotten in parallel, and between the 3 stages, we have 3.

478
00:45:35.730 --> 00:45:41.610
Think for it, so this shows how.

479
00:45:41.610 --> 00:45:46.619
The reduction Craig and open is implemented.

480
00:45:46.619 --> 00:45:51.840
So or open ACC.

481
00:45:53.519 --> 00:45:58.920
Question.

482
00:46:00.210 --> 00:46:04.110
So, once you see how it's implemented, you probably never write it.

483
00:46:04.110 --> 00:46:09.119
Open whichever ACC MBP.

484
00:46:09.119 --> 00:46:12.300
But this is again showing implementation of.

485
00:46:12.300 --> 00:46:18.360
Okay.

486
00:46:18.360 --> 00:46:22.949
Disagreement arguments. Okay. Um.

487
00:46:22.949 --> 00:46:26.519
More powerful idea.

488
00:46:26.519 --> 00:46:30.360
You can do anything it's a source of commutative.

489
00:46:31.440 --> 00:46:42.570
Um, so people know what associated communities mean, I assume. Yeah, I got logged backgrounds. Okay.

490
00:46:42.570 --> 00:46:48.690
And again, remind you that shared memory.

491
00:46:48.690 --> 00:46:58.739
And everything has to fit I've noticed for Nvidia as the years, go on, they do not increase the size of their shared memory on their chip.

492
00:46:58.739 --> 00:47:02.639
Increase the number of food of cars, for example, global memory.

493
00:47:02.639 --> 00:47:06.690
They do not increase the shared memory, which tells me.

494
00:47:06.690 --> 00:47:10.289
That however it's implemented, it is really expensive.

495
00:47:10.289 --> 00:47:15.179
So, they increase the clock speed.

496
00:47:15.179 --> 00:47:19.619
But not the size of the shared memory, not the number of registers. So.

497
00:47:19.619 --> 00:47:25.289
However, they're doing it it's a lot of Gates involved new hardware types.

498
00:47:25.289 --> 00:47:32.789
Tell him. Okay.

499
00:47:32.789 --> 00:47:36.510
You can spill over from Sharon into global memory.

500
00:47:38.400 --> 00:47:43.110
You're going to pay this horrible penalty if you do.

501
00:47:45.210 --> 00:47:51.210
What, and again, if you define local variables in your, um.

502
00:47:51.210 --> 00:47:59.579
Device function, they could put I believe into registers that by default in the registers.

503
00:47:59.579 --> 00:48:06.119
Scalar variables as long as your registers available, if you declare a lot of local variables.

504
00:48:06.119 --> 00:48:11.340
In your function, it just invisible spills over to global memory.

505
00:48:11.340 --> 00:48:14.849
And well, you run into bunker, it will tell you about.

506
00:48:14.849 --> 00:48:17.969
So, do you have a wall.

507
00:48:17.969 --> 00:48:24.449
You got a cliff, you fall off local rays, you declare in the function.

508
00:48:24.449 --> 00:48:33.599
In this function by default, going to global memory well, it's a private global memory. It's called local memory. It's in global memory, but is private to the thread.

509
00:48:34.829 --> 00:48:39.599
And I actually found a compiler bug and NBC a few years ago.

510
00:48:39.599 --> 00:48:45.030
I found that for particular size of local array, like 256 or something.

511
00:48:45.030 --> 00:48:50.519
I threw the through the compiler into an infinite loop.

512
00:48:50.519 --> 00:48:53.610
The compiler never terminated.

513
00:48:53.610 --> 00:48:56.670
So, cool.

514
00:48:56.670 --> 00:49:06.360
I break software, it was only for 1 size of array. If you're 1 element bigger or smaller, the compiler finished, because it's 1 size.

515
00:49:06.360 --> 00:49:11.250
So clearly they had an off by so I posted it on a couple of like.

516
00:49:11.250 --> 00:49:15.750
And video and other blogs, and some, that are posted this as an error.

517
00:49:15.750 --> 00:49:20.219
And not that long after there was a minor release that fixed it.

518
00:49:20.219 --> 00:49:25.559
Though I found a few compiler errors over the years.

519
00:49:25.559 --> 00:49:28.650
Okay, so this.

520
00:49:28.650 --> 00:49:33.869
Slide set, the content was how you implement the reduction with private history.

521
00:49:35.010 --> 00:49:41.670
Good so.

522
00:49:41.670 --> 00:49:45.659
Real questions there, we can move off.

523
00:49:49.289 --> 00:49:52.739
So.

524
00:49:54.210 --> 00:49:58.559
What we're seeing our programming paradigms.

525
00:49:58.559 --> 00:50:06.030
They are programming techniques, which are useful for parallel computing only parallel. They don't help you with serial computing.

526
00:50:06.030 --> 00:50:10.590
So we saw 1, which is this private histogram.

527
00:50:10.590 --> 00:50:14.849
And this set this module here.

528
00:50:14.849 --> 00:50:21.690
Talks about another important operation convolution and how we do convolution efficiently.

529
00:50:21.690 --> 00:50:26.699
On a parallel computer. Well, encoder, for example.

530
00:50:28.829 --> 00:50:31.829
And the concept is to do the convolution.

531
00:50:31.829 --> 00:50:34.980
And efficiently in parallel.

532
00:50:37.260 --> 00:50:42.179
And you, most of, you know what convolution is if you want. Okay. Good. So.

533
00:50:43.349 --> 00:50:47.309
Does a little pass filter, for example.

534
00:50:47.309 --> 00:50:54.510
Every elements awaited, every output elements, a weighted average of a sliding window of input elements.

535
00:50:54.510 --> 00:50:58.769
You know what that is? Um, skip through it.

536
00:51:00.659 --> 00:51:07.469
Okay um, so convolution has this mask.

537
00:51:07.469 --> 00:51:11.219
And so what we're doing here, 1 dimensional.

538
00:51:11.219 --> 00:51:15.510
So, each output element is awaited some of 5 input element.

539
00:51:15.510 --> 00:51:21.449
And so a Ray, and are the input elements are, are the output elements.

540
00:51:21.449 --> 00:51:25.079
M is the mask so the mask is fixed.

541
00:51:25.079 --> 00:51:28.199
We slide the mask over the input array.

542
00:51:28.199 --> 00:51:36.329
So, for example, here, um, if the mask is centered around the red input element here, and it's up to.

543
00:51:36.329 --> 00:51:40.230
So these 5 input elements from 0 through 4.

544
00:51:40.230 --> 00:51:44.730
Get waited by the 5 elements in the mask and these are the 5.

545
00:51:44.730 --> 00:51:52.739
Qualifications wait, then we receive all these together and produce the 1 output element and we slide the window along here.

546
00:51:52.739 --> 00:52:00.780
So, if I stop right here, and you think, how would you do it efficiently in parallel.

547
00:52:01.920 --> 00:52:07.559
Number 1, the mask is read only and it's used all the time.

548
00:52:07.559 --> 00:52:11.760
So, the masks, for example, you would put a say in constant memory.

549
00:52:11.760 --> 00:52:16.800
And could, which is past constant memories very passed, but read only.

550
00:52:16.800 --> 00:52:20.070
And it's implemented as something in the cash.

551
00:52:20.070 --> 00:52:24.420
If he didn't have constant memory, would stick it in shared memory because.

552
00:52:24.420 --> 00:52:27.599
This is red all the time you want it to be fast.

553
00:52:27.599 --> 00:52:32.489
The next thing is that.

554
00:52:32.489 --> 00:52:36.630
Each element in the input array is read 5 times.

555
00:52:38.219 --> 00:52:41.250
So you're thinking you want to somehow.

556
00:52:41.250 --> 00:52:49.079
Avoid reading, I mean, the input and I'll put a raise, they're going to be in global memory ultimately, because they're big.

557
00:52:49.079 --> 00:52:56.309
But you're reading each element of the input of race, say, 5 times for a master's 5 long.

558
00:52:56.309 --> 00:53:02.280
You don't want to be reading the repetitively 5 times out of the.

559
00:53:02.280 --> 00:53:09.960
Well, memory now, I guess today the cash might have the cache might do that for, you.

560
00:53:09.960 --> 00:53:15.179
You don't have to worry about it if you did want to worry about it, you'd want to do some.

561
00:53:15.179 --> 00:53:19.590
Explicit cash.

562
00:53:19.590 --> 00:53:24.690
Just as an aside, um, even though the systems have cash.

563
00:53:24.690 --> 00:53:28.829
You can sometimes do better than a system cash.

564
00:53:28.829 --> 00:53:32.039
Give an example, if you go.

565
00:53:32.039 --> 00:53:35.219
Brazilian collaborator night well, we do some work on.

566
00:53:35.219 --> 00:53:42.239
Visibility computing visibility for sure. You got an observer of what targets can you see from the observer on the terrain?

567
00:53:42.239 --> 00:53:45.840
You're on line to sight out and and so on. So.

568
00:53:45.840 --> 00:53:49.019
So, we found an example where.

569
00:53:49.019 --> 00:53:56.909
If we explicitly cast chunks of the terrain, we could do better than the system virtual memory manager.

570
00:53:56.909 --> 00:54:07.949
Because we could look ahead, we knew when we were finished with a chunk of the terrain, and we would swap it out of our cash. Whereas the system virtual memory manager could not see into the future. It did not know that.

571
00:54:07.949 --> 00:54:14.070
So, we actually, and the papers published peer reviewed published paper that, in this example.

572
00:54:14.070 --> 00:54:23.010
We beat the virtual memory manager, so so you can do that. Now, the question is it worth it? It's a different question, but.

573
00:54:23.010 --> 00:54:30.269
That was cool. Okay. That's what's happening. Um.

574
00:54:30.269 --> 00:54:35.940
I told you what's happening here you fly go along the input array and you some.

575
00:54:35.940 --> 00:54:47.159
Reduce and do that. Okay. Boundary conditions. I hate boundary conditions. You got to handle them, but it's not interesting. You pattern with 0 So you change the waiting or something.

576
00:54:47.159 --> 00:54:51.389
A lot of errors occur when people do boundary conditions, wrong.

577
00:54:53.519 --> 00:54:59.579
I'm ignoring that. It's not being interesting important, but not interesting. You know what I mean with that.

578
00:54:59.579 --> 00:55:02.699
I'm ignoring anything with boundary condition handling.

579
00:55:03.750 --> 00:55:07.710
Today or not to the, um.

580
00:55:07.710 --> 00:55:13.920
The interesting thing there is elements that are close in 2 dimensions may not be close.

581
00:55:13.920 --> 00:55:17.849
When it's linearized into 1 dimension.

582
00:55:19.409 --> 00:55:23.489
Which you might maybe want to worry about.

583
00:55:23.489 --> 00:55:27.869
That's why space filling curves or invented.

584
00:55:27.869 --> 00:55:36.389
The try and reduce the average distance between adjacent elements when the 2 D array is linearized.

585
00:55:36.389 --> 00:55:41.010
In various uses some undocumented.

586
00:55:41.010 --> 00:55:46.079
Space filling curve for the texture memory, I think cause texture memories again.

587
00:55:47.400 --> 00:55:50.969
They read only they're red a lot.

588
00:55:50.969 --> 00:55:59.849
And so, I think in video stores, a texture video is that special hardware for texture memories graphics, after all and.

589
00:55:59.849 --> 00:56:04.110
They have some sort of Ziggy curve to.

590
00:56:04.110 --> 00:56:08.010
linearized the texture of memory, I think they sort of talk about it.

591
00:56:08.010 --> 00:56:11.730
Any case so.

592
00:56:11.730 --> 00:56:16.320
Move slide the 2 D filter over the input array.

593
00:56:16.320 --> 00:56:21.239
Multiply the K by K elements get this, um, that.

594
00:56:21.239 --> 00:56:25.650
Okay, boundary conditions get them wrong.

595
00:56:25.650 --> 00:56:28.650
Things happen, but I'm going to ignore that.

596
00:56:30.000 --> 00:56:36.269
Um.

597
00:56:36.269 --> 00:56:39.630
Just, except for the 1 thing, again, you're going to get thread divergent.

598
00:56:39.630 --> 00:56:44.369
Ignore that again, because for some threads in the war for the.

599
00:56:44.369 --> 00:56:50.309
Conditional be true other threads and conditional be false for which it's false. Just idle.

600
00:56:50.309 --> 00:56:57.780
Generally all the threads will do the same number of iterations here cause masks with the, the constant.

601
00:56:57.780 --> 00:57:05.670
That's okay and again, these device functions can have 4 loops in wild with them and all that stuff. That's fine.

602
00:57:05.670 --> 00:57:09.900
If you do nothing else.

603
00:57:10.949 --> 00:57:15.900
Again, I'm conditionals are fine inside the device function.

604
00:57:16.949 --> 00:57:21.210
And, okay, um.

605
00:57:21.210 --> 00:57:27.869
I maybe I skipped over a little too much here.

606
00:57:35.010 --> 00:57:39.780
Yeah, okay well, this will get emphasized to touch more later.

607
00:57:39.780 --> 00:57:44.400
So, this thread is computing 1 output pixel.

608
00:57:44.400 --> 00:57:49.289
This is iterating over the adjacent pixel this thing here.

609
00:57:49.289 --> 00:57:57.989
Is iterating over adjacent pixels so we have Jay 0 and up to match with to mess with and so on.

610
00:57:57.989 --> 00:58:02.429
So, the 1 output pixel depends on this block of input pixel.

611
00:58:02.429 --> 00:58:05.550
And, um.

612
00:58:05.550 --> 00:58:08.940
So, the Jay and the things could linearized and.

613
00:58:08.940 --> 00:58:14.429
Whatever whatever whatever the relevant thing is that this loop is going after a number.

614
00:58:14.429 --> 00:58:17.730
Of of input pixels repeatedly.

615
00:58:17.730 --> 00:58:21.239
Sweet 10 foot pixel gets red mess with squared times.

616
00:58:21.239 --> 00:58:24.869
We'll worry about that later. Okay.

617
00:58:26.429 --> 00:58:30.690
Pointed out that was module.

618
00:58:33.719 --> 00:58:41.730
81 questions so again, this is a 2nd paradigm.

619
00:58:41.730 --> 00:58:45.360
For how to do some parallel computation we saw.

620
00:58:45.360 --> 00:58:51.000
His programming here we see we're working into convolution.

621
00:58:52.559 --> 00:58:56.219
8, 1, 8, 2.

622
00:58:59.309 --> 00:59:12.179
What's gonna happen is we're going to cut partition the data and tiles and each tile might be small enough that we can cash. It.

623
00:59:12.179 --> 00:59:16.079
Into some past memory, and we will reduce the latency.

624
00:59:18.480 --> 00:59:22.530
So, if you're.

625
00:59:22.530 --> 00:59:26.699
If I can pause on this slide and you think ahead now.

626
00:59:26.699 --> 00:59:34.349
So the tiles, maybe if you want to be as big as they can, but small enough so that everything fits into shared memory.

627
00:59:34.349 --> 00:59:38.340
Do you think about how many tiles have to be in shared memory together and that.

628
00:59:38.340 --> 00:59:43.829
Determined to do tile size and again you have a trade off.

629
00:59:43.829 --> 00:59:51.480
Smaller tiles means more blocks can run in parallel thread locks. Do you want to do that? I don't know.

630
00:59:51.480 --> 00:59:57.030
You're going to have boundary conditions are even worse.

631
00:59:57.030 --> 01:00:01.260
Um, you're going to decide.

632
01:00:02.639 --> 01:00:06.269
Well, what the input versus output means you want to input.

633
01:00:06.269 --> 01:00:12.269
You can iterate through the input data repeatedly to compute 1 output pixel.

634
01:00:12.269 --> 01:00:18.480
Or you can stick on 1 input, pixel and iterate through all the output pixels that that input pixel goes into.

635
01:00:18.480 --> 01:00:21.570
And that's sort of what they're saying there and then.

636
01:00:21.570 --> 01:00:24.570
This would affect what you dialing, so.

637
01:00:28.170 --> 01:00:33.599
So, it still making some application same thing. Your conventional way you iterate.

638
01:00:34.980 --> 01:00:43.320
Through the input, um, matrices the compute 1 output element you can also.

639
01:00:43.320 --> 01:00:47.099
Iterate lists through the input elements with, for each input element.

640
01:00:47.099 --> 01:00:52.619
Some into all the output elements that it affects no different ways to look at things.

641
01:00:53.670 --> 01:00:58.469
Okay, um.

642
01:00:58.469 --> 01:01:04.260
So, we are running the sliding window down and nothing new on this slide. Um.

643
01:01:06.059 --> 01:01:09.269
And this is the new point here that, um.

644
01:01:11.639 --> 01:01:17.369
This might be a chunk of the input data. We put it in the share it with cash it in the shared memory.

645
01:01:17.369 --> 01:01:21.059
Hello.

646
01:01:21.059 --> 01:01:25.739
And again, who knows maybe the cash handler does it for you.

647
01:01:25.739 --> 01:01:29.010
I don't know.

648
01:01:29.010 --> 01:01:32.789
So, nothing new there, the only new thing on this.

649
01:01:32.789 --> 01:01:37.829
Why is it now talking about cashing stuff into the shared memory?

650
01:01:37.829 --> 01:01:44.460
So, and what they're saying again is 1 particular input element is used several times. So.

651
01:01:46.769 --> 01:01:52.409
Yeah, um.

652
01:01:52.409 --> 01:01:57.510
You could even imagine a sliding cache actually.

653
01:01:57.510 --> 01:02:01.920
That maybe you've got these elements in your shared memory.

654
01:02:01.920 --> 01:02:07.679
So, once you finish with element, 2, you replace it with element 10. perhaps.

655
01:02:07.679 --> 01:02:17.460
When you finish with element 3, replace it with all of them at 11, and they shared memory, you can do something like that. It'd be really cool. So you're sliding this cash down.

656
01:02:17.460 --> 01:02:21.239
Input memory really cool idea.

657
01:02:21.239 --> 01:02:24.809
Um, grabbing would be fine.

658
01:02:25.889 --> 01:02:30.210
So, as I said, is since the access pattern, Greg, you see this, this.

659
01:02:30.210 --> 01:02:33.750
The heavy green box is what you've got in shared memory.

660
01:02:33.750 --> 01:02:39.750
You have a window that's sliding down when you said, when you don't need 2 anymore.

661
01:02:39.750 --> 01:02:43.739
Replace it with 10 and you got to keep track of it all. That would be cool.

662
01:02:45.389 --> 01:02:48.900
So, um.

663
01:02:50.429 --> 01:02:53.940
And again what they're saying here on the output.

664
01:02:53.940 --> 01:02:59.429
You got to have some sort of a cat.

665
01:02:59.429 --> 01:03:08.670
Find what you're doing the output memory, and if you're writing, you tell them at once. I don't know that it's so helpful, but you can imagine a cache of output elements that you're sliding.

666
01:03:10.800 --> 01:03:17.250
Okay, um, okay, what they're talking about here is interesting.

667
01:03:18.300 --> 01:03:23.909
Okay, so your global memory, it's a chunk so say 128 bytes.

668
01:03:23.909 --> 01:03:28.920
So, if you want to write 1 word of the output memory, it still has to.

669
01:03:28.920 --> 01:03:32.519
Effectively update the 128 by 6.

670
01:03:32.519 --> 01:03:38.849
And so what they're saying here is have a local trunk of your global memory.

671
01:03:38.849 --> 01:03:45.510
And you're updating elements, and once you're finished, maybe you write that local trunk back to global memory as 1 operation.

672
01:03:45.510 --> 01:03:49.320
So, you reduced your latency, so.

673
01:03:49.320 --> 01:03:53.250
So you got this tile, so.

674
01:03:53.250 --> 01:03:57.030
You split the output array into tiles.

675
01:03:57.030 --> 01:04:01.559
And because your computing elements of the output array in a predictable way.

676
01:04:01.559 --> 01:04:05.369
You you create the whole tile locally.

677
01:04:05.369 --> 01:04:09.690
And then you send it back to the global memory and.

678
01:04:09.690 --> 01:04:15.119
It's much more efficient than sending each element of the tile to the global memory. 1 by 1 by 1.

679
01:04:15.119 --> 01:04:19.920
Because of this horrible latency on the global memory.

680
01:04:19.920 --> 01:04:26.969
Oh, good Kyle. So, and then they make these tiles correspond to the thread blocks.

681
01:04:26.969 --> 01:04:30.329
So.

682
01:04:30.329 --> 01:04:35.519
And again, the size depends on all the obvious suspects.

683
01:04:35.519 --> 01:04:40.469
So, um.

684
01:04:40.469 --> 01:04:43.650
So that's, um.

685
01:04:45.239 --> 01:04:51.420
An input title same thing. So, um.

686
01:04:51.420 --> 01:04:56.159
And I mentioned foot tile could slide down the array of tiles probably.

687
01:04:56.159 --> 01:05:01.409
Fix position you calculate it, right? So.

688
01:05:01.409 --> 01:05:08.880
Okay, and.

689
01:05:08.880 --> 01:05:12.420
So, what they're talking about here.

690
01:05:12.420 --> 01:05:18.269
As I mentioned, you can challenge to make publication the.

691
01:05:19.800 --> 01:05:23.849
I got your thread block, um.

692
01:05:24.869 --> 01:05:32.010
Well, you can decide how many threads are in a thread block. Okay, Max 124. it could be a lot less to be 108.

693
01:05:32.010 --> 01:05:35.429
And they're just giving different choices for.

694
01:05:35.429 --> 01:05:38.880
How you size the 3rd blocks.

695
01:05:38.880 --> 01:05:47.369
So, and now you can, you can read it basically.

696
01:05:47.369 --> 01:05:50.670
Some courage your Idol, some of the time.

697
01:05:51.809 --> 01:05:55.289
And, okay, so.

698
01:05:56.730 --> 01:06:04.320
Um, the question is reading stuff multiple times. So.

699
01:06:04.320 --> 01:06:09.389
I may rerun it again, Thursday, but I'm going to overlap this. I think.

700
01:06:09.389 --> 01:06:14.670
This module with Thursday, I'll just do preliminarily here.

701
01:06:14.670 --> 01:06:19.110
And, um.

702
01:06:19.110 --> 01:06:24.030
Yeah, the thread is reading a window writing a window and.

703
01:06:24.030 --> 01:06:31.380
Yeah, I'm probably it's getting late enough. I'll finish this thing off. I'm sort of running go.

704
01:06:31.380 --> 01:06:39.210
I'll finish this thing off on Thursday and you can read it, but your design question is setting the various block sizes. So you can.

705
01:06:39.210 --> 01:06:43.860
So, giving you the manager managerial thing.

706
01:06:43.860 --> 01:06:48.780
And our goal is to reuse shared memory to reduce global memory.

707
01:06:48.780 --> 01:06:55.349
How many times each element is used. Okay. Um, both sales are doing the.

708
01:06:55.349 --> 01:07:02.460
Boundary conditions and now, what's that? So I'll continue on with this 1 on Thursday. Um.

709
01:07:02.460 --> 01:07:07.739
So, if I review what we did today.

710
01:07:07.739 --> 01:07:13.199
Well, we saw or the hardware thing of these atomic.

711
01:07:13.199 --> 01:07:19.469
Operations do a read modify right typically could also be comparing swap.

712
01:07:19.469 --> 01:07:27.480
Or an atomic ad that they're 1 machine instruction that cannot be interrupted by another thread they run the completion.

713
01:07:27.480 --> 01:07:33.840
Down to the machine level, at the CUDA level there implemented by dysfunction calls that we saw.

714
01:07:33.840 --> 01:07:39.210
And the 1st example of why they're used is updating a histogram.

715
01:07:39.210 --> 01:07:44.699
In parallel, so this is a common operation variance in the histogram.

716
01:07:44.699 --> 01:07:49.289
So, we saw how that could be implemented and could it requires these atomic operations.

717
01:07:49.289 --> 01:07:54.329
And then he called it a paradigm, perhaps the 2nd paradigm.

718
01:07:54.329 --> 01:07:59.460
Is convolution and we're in the middle of seeing how you do a convolution in parallel.

719
01:07:59.460 --> 01:08:03.869
And the goal with the histogram, the problem was.

720
01:08:03.869 --> 01:08:08.909
We needed these atomic updates the issue with the convolution.

721
01:08:08.909 --> 01:08:20.909
Is we wish to minimize access to the global memory? Because they have a very large latency and we minimize them by caching data explicitly in the small past shared memory.

722
01:08:20.909 --> 01:08:24.869
And we do it explicitly and.

723
01:08:24.869 --> 01:08:29.220
And again, once you do it, once you call a library, but we are seeing how things through it.

724
01:08:30.840 --> 01:08:35.460
And put a homework up to play with some stuff to, as I mentioned and that's.

725
01:08:35.460 --> 01:08:38.609
Enough new stuff for today, so.

726
01:08:38.609 --> 01:08:42.090
If you have any questions, then.

727
01:08:43.500 --> 01:08:49.979
Let's say.

728
01:08:52.409 --> 01:08:55.710
And.

729
01:08:55.710 --> 01:08:59.250
Hmm okay.

730
01:08:59.250 --> 01:09:02.670
Hello.

731
01:09:02.670 --> 01:09:06.047
Cool.