WEBVTT

1
00:04:52.019 --> 00:04:56.399
Silence.

2
00:05:11.939 --> 00:05:20.069
Silence.

3
00:05:23.459 --> 00:05:30.209
Silence.

4
00:05:35.548 --> 00:05:39.389
Silence.

5
00:05:42.389 --> 00:06:06.959
Silence.

6
00:06:28.978 --> 00:07:06.478
Site

7
00:07:07.223 --> 00:07:07.553
right?

8
00:07:11.639 --> 00:07:19.108
Okay, good afternoon parallel computing class.

9
00:07:19.108 --> 00:07:26.939
Can anyone hear me 1st because I'm not completely certain. I've got the audio working.

10
00:07:26.939 --> 00:07:30.749
Yeah, thank you. Connor. Great.

11
00:07:30.749 --> 00:07:36.928
So, wherever we are.

12
00:07:36.928 --> 00:07:40.499
Class 13 and March.

13
00:07:40.499 --> 00:07:48.749
11 2021 and what's on tap for today is.

14
00:07:48.749 --> 00:08:00.718
Parallel compute, we're still in the invidia course notes and we've graduated from specifically NVIDIA stuff on.

15
00:08:00.718 --> 00:08:13.319
To general parallel computing paradigms that styles of programming, which will make your parallel program or efficient and they're generally useful paradigms. They're not restricted to in video.

16
00:08:13.319 --> 00:08:19.619
But 1st, a couple of general notes, I'll put them in the blurb for Monday various, some.

17
00:08:19.619 --> 00:08:20.783
Parallel companies,

18
00:08:20.783 --> 00:08:29.184
quantum computing companies like D wave they have online tutorials and if anyone's interested I'll put the blurb up on the website,

19
00:08:29.483 --> 00:08:35.394
this is a way for you to learn current topics outside of the course I'm going to can't make them officially part of the courses.

20
00:08:35.394 --> 00:08:48.024
They're not in class time, but if anyone's interested, I'll put some blurbs up about D, wave has seminars some time to time. For example, what D. wave is 1 of the 3 major quantum computing paradigms.

21
00:08:48.293 --> 00:08:52.283
We'll get to it later in the class after this module.

22
00:08:52.734 --> 00:08:59.634
And D, wave cells, quantum computers that do what's called quantum annealing.

23
00:09:00.053 --> 00:09:08.602
So you have a function to be optimized and it will find the optimum for you doing it in parallel on your quantum computer.

24
00:09:08.849 --> 00:09:11.999
And, oh, another.

25
00:09:11.999 --> 00:09:20.428
Thing with current and video architected, latest and video they've gotten away from the idea of a core. They talk about streaming multi processor.

26
00:09:20.428 --> 00:09:26.788
And with the rest of the course, they don't talk call them CUDA cores anymore.

27
00:09:26.788 --> 00:09:33.418
Yes, I'm calling her. What's your general? I make your a microphone if you'd like, actually.

28
00:09:33.418 --> 00:09:41.668
Yeah, of course it was just regarding homework 5. where are we supposed to actually submit anything for that?

29
00:09:43.349 --> 00:09:46.918
Can't remember what is homework.

30
00:09:49.469 --> 00:10:04.229
Well, it would be, I mean, I'm I'm great guy grading. I'm going to grade easily, but the idea would be to.

31
00:10:04.229 --> 00:10:09.418
Submit a PDF with your report on what you okay.

32
00:10:09.418 --> 00:10:13.889
Okay, yeah, it was just curious because of the posted due date.

33
00:10:13.889 --> 00:10:17.639
Supposed to do a day.

34
00:10:17.639 --> 00:10:21.479
Yeah, I'll I'll extend the due date. Remind me so.

35
00:10:21.479 --> 00:10:36.208
Wonderful. Thank you. Yeah, you're welcome. There's so few for a small course like this I could be more lenient and I'm figuring you learn what you want to learn. I'm presenting you with things. If you'd like to learn them. Fine. If you would not like to learn nothing. Well.

36
00:10:36.208 --> 00:10:41.009
We still have your tuition.

37
00:10:41.009 --> 00:10:46.889
I hope you like to learn it. Okay other questions. Okay.

38
00:10:46.889 --> 00:10:56.489
So, yes, I'll post a link for Monday about videos. Well, they have a nice.

39
00:10:56.489 --> 00:11:06.774
Blurb on their developer website about how they're changing things from generation to generation and so on. And their current terminology is approaching more and Intels.

40
00:11:06.774 --> 00:11:20.783
Actually, they'll have their streaming multi processors, and they're sort of analogous to Intel cores and each shipping multi processor might have 32 floating point units and 32 integer units. And.

41
00:11:22.198 --> 00:11:31.438
Maybe 30 to 64 floating point unit 32 double precision, floating point units perhaps and 32 instruction dispatchers.

42
00:11:31.438 --> 00:11:37.318
And each instruction dispatcher would dispatch for a full thread.

43
00:11:37.318 --> 00:11:51.808
All for the full 32, a full a warp of threads. I'm sorry and so they've down played the term could a car now, which is interesting. I mean, the reason is that the streaming multi processor, so it's got.

44
00:11:51.808 --> 00:11:54.989
A bank of.

45
00:11:55.703 --> 00:12:06.894
Warps waiting to run and they're waiting because they need some resource like, this might need a floating point, starting point units that might need integer units and so on.

46
00:12:07.283 --> 00:12:15.293
So, as these processing units become available under threads that need processing. It just assign 7 dispatches instructions and so on.

47
00:12:15.599 --> 00:12:25.589
So, it's interesting to watch their architecture migrating from year to year and it's instructive. You can look at that and think about why they're doing it.

48
00:12:25.589 --> 00:12:29.818
Um, another point in video makes is that.

49
00:12:29.818 --> 00:12:39.028
Because they got warps that need resources floating, find double precision image or whatever and they've got resources the efficient way to.

50
00:12:39.028 --> 00:12:47.278
To this, even just within 1 thread block and then, of course, the multiple thread box could be running in parallel on multiple streaming multi processors.

51
00:12:47.278 --> 00:12:55.408
If they're available and video makes the point that you used our hardware more efficiently when you've got actually many more.

52
00:12:55.408 --> 00:13:05.879
Warps waiting to run and you've got resources because what you want to have is a lot of threads that want to be actually a lot of works that want to be executed.

53
00:13:05.879 --> 00:13:14.399
And this way, you'll always have something needing execution whenever some hardware resource becomes available to execute it. And so.

54
00:13:14.399 --> 00:13:22.408
Their model works better when you've got thousands of threads and not just 1000 maybe several 1000 I mean.

55
00:13:22.408 --> 00:13:26.009
The gpo can run a 1000 threads can run. Well, 1.

56
00:13:26.009 --> 00:13:30.778
So 1 thread block can run a 1000 threads at a time.

57
00:13:30.778 --> 00:13:34.048
And then the whole machine can run maybe.

58
00:13:34.048 --> 00:13:43.168
4,000 threads at a time, depending so, if it can run up to 4,000 threads and that would suggest, maybe you want 10,000 threads.

59
00:13:43.168 --> 00:13:48.538
Waiting trying to execute because then there will always be something waiting to execute when.

60
00:13:48.538 --> 00:13:52.379
A resource becomes available, so.

61
00:13:52.379 --> 00:13:56.519
Um, and again, because they're a 0 overhead idea where.

62
00:13:56.519 --> 00:13:59.578
It doesn't take.

63
00:13:59.578 --> 00:14:03.178
You know, the scheduling is I.

64
00:14:03.178 --> 00:14:17.333
Don't know enough about how it's implemented, but it's implemented so that it's fast. You don't have a lot of contact swapping time contact. Swapping is free or something. That's why I'm guessing it's using a synchronous logic. Okay. So learning about.

65
00:14:18.538 --> 00:14:22.739
Scanning and so on.

66
00:14:22.739 --> 00:14:29.099
And and again I've got a 2nd laptop.

67
00:14:29.099 --> 00:14:33.538
To my side here, which is showing the chat window and every so often I look over it.

68
00:14:33.538 --> 00:14:37.889
And, um.

69
00:14:39.688 --> 00:14:45.629
Can see what's happening. Okay. Um.

70
00:14:45.629 --> 00:14:48.629
So, okay, so what we saw last time.

71
00:14:48.629 --> 00:14:52.019
Is a new.

72
00:14:52.019 --> 00:14:56.668
Basically, a new style of programming a new.

73
00:14:56.668 --> 00:15:01.379
Paradigm called a scan algorithm and.

74
00:15:01.379 --> 00:15:15.028
The scan does a series of parallel reductions sell the scan input here, for example, is this array of 8 elements 307 04163 and the output.

75
00:15:15.028 --> 00:15:19.918
The ice output element is the sum of the 1st, I input elements.

76
00:15:19.918 --> 00:15:27.688
So so, 3, the 1st output 3, then the 4th is somewhat 3 and 1. this is a partial stage here.

77
00:15:27.688 --> 00:15:31.288
So this contains the reduction.

78
00:15:31.288 --> 00:15:37.019
So, the case output is the reduction some of the 1st K inputs.

79
00:15:37.019 --> 00:15:45.538
Okay, interesting idea. Why do we spend time on it is that it turns out to be a tool.

80
00:15:45.538 --> 00:15:51.568
For surprisingly wide variety of parallel algorithms, so for doing them efficiently.

81
00:15:51.568 --> 00:15:58.649
Just like for sequential machine sorting can be used for a lot of different things and.

82
00:15:58.649 --> 00:16:13.589
This well, the obvious 1 is run lanes decoding, for example, if the input is a list of run lengths, and the output will be where each run lane starts, where each run starts and the output vector. So that would be called adult vector. Actually.

83
00:16:13.589 --> 00:16:17.158
A list of the base points. So the starts is called a dope vector.

84
00:16:17.158 --> 00:16:22.889
That's just 1 example, it's used for a lot of other things it's used for. Um, actually for.

85
00:16:22.889 --> 00:16:27.418
A bucket sorting frequency counts.

86
00:16:27.418 --> 00:16:35.188
So, the frequency counts, we saw that quick example 2 days ago where we didn't have that many, but buckets.

87
00:16:35.188 --> 00:16:40.318
You can use this idea for when there's very many output buckets. Okay.

88
00:16:40.318 --> 00:16:52.528
So we want to do this fast and what we saw last time is a way to do it in parallel by stride stride stride and it's sort of counterintuitive. So, and the 1st stride.

89
00:16:52.528 --> 00:16:56.879
Each output element becomes the sum of the for the.

90
00:16:57.203 --> 00:16:58.134
2 adjacent out,

91
00:16:58.134 --> 00:16:59.484
but once the 2nd stride,

92
00:16:59.724 --> 00:17:12.683
it becomes too much to that or 2 apart the than the stride numbers are going up by powers to strive for each output element is the sum up 2 of the same output element added with 1 forward to the left.

93
00:17:12.989 --> 00:17:18.088
And if you do it, right you can add in place. It may take a little thinking and.

94
00:17:18.088 --> 00:17:25.919
So you takes log in strides. Okay which is nice. This takes login steps and each step takes.

95
00:17:25.919 --> 00:17:30.118
Constant time if you've got enough threads so the whole thing takes log in time.

96
00:17:30.118 --> 00:17:33.659
Um, now.

97
00:17:33.659 --> 00:17:43.709
There's some tricks you can use to make it faster 1 problem with this. Oh, 1 thing also here, when you're adding, when I say each, add each element to the 1 forwarded the left if.

98
00:17:43.709 --> 00:17:46.709
Forwarded the laugh would go off the start of the array then you just.

99
00:17:46.709 --> 00:17:51.449
At 0, how to handle the boundary cases.

100
00:17:51.449 --> 00:17:56.308
And said, I've had programs with the boundary cases, are 1, half of all my lines of code.

101
00:17:56.308 --> 00:18:01.048
Okay, not fun, but necessary.

102
00:18:01.048 --> 00:18:10.588
So and you see, you've got issues here, you're not adding elements which are adjacent. So, those questions are, whether this place nicely with the cash.

103
00:18:10.588 --> 00:18:13.949
And when you're writing.

104
00:18:13.949 --> 00:18:28.169
So you want to use the shared memory if possible 1 way to think of shared memory is a level 2 cash. So you have your global memory. It's big and it's latent. Okay. So we got 48 gigabytes of.

105
00:18:28.169 --> 00:18:32.189
Global memory on the machine on.

106
00:18:32.189 --> 00:18:40.644
On the gpo on parallel, so 48 gigabytes and the latency to read something from it might be a couple of 100 cycles, but it's reading 128 bites.

107
00:18:40.644 --> 00:18:47.544
So it's got to, into cash cash is I can't remember several megabytes and it's chunked up in 102,008 bites things.

108
00:18:50.159 --> 00:18:56.969
So, again, if you're going to read, if you have to read 128 bytes, it'd be nice. If all 108 bites was actually useful.

109
00:18:56.969 --> 00:19:01.138
Which is why Jason threads want to be reading adjacent addresses some of the global memory.

110
00:19:01.138 --> 00:19:14.519
Okay, so that's a big you could call that this level 1 cash at some megabytes, give you a link to a developer paper on this and that cash is visible to everything on the GPU.

111
00:19:14.519 --> 00:19:27.179
So, you read it into the cash, anyone can use it, which is another thing to tie into the, the constant memory cache, for example, also. So the constant memories like a read, only cash, you get something into it and everyone can read it.

112
00:19:27.179 --> 00:19:35.903
Okay, that's level 1 cash now inside each thread block, you could imagine there's a level to cash. That's the same hardware. It's a shared memory.

113
00:19:36.084 --> 00:19:48.473
In fact, the current and video architecture, you have 100 to 102,008 K bytes and you can say how much is explicit shared memory. And how much is the implicit level to cash. So you read something level to cash. It's got a smaller chunk size. 32 bytes.

114
00:19:48.473 --> 00:19:52.344
I think, and it's visible to all the threads in that thread block.

115
00:19:54.989 --> 00:20:05.939
So level 1, cash visible to every 1 level to cash visible to, to inside 1 thread block. And each thread block is a separate level to cash and it's fast to read and write.

116
00:20:05.939 --> 00:20:20.699
And again, it's like the hardware it's the same shared memory level to cash the shared memory you control explicitly level 2 cash is controlled implicitly by the cash manager. Okay so you want to have your program.

117
00:20:20.699 --> 00:20:30.384
Play nicely with the caches plays nicely. Your program runs faster in real time and again, the metric here is not CPU type.

118
00:20:30.594 --> 00:20:44.574
The performance metric is wall clock real time because see, if you time is not so meaningful. Well, again, if you have a car that would be idle if it's idle, if it's spinning, it's wheels waiting for something. You don't actually care.

119
00:20:45.294 --> 00:20:47.784
Well, if it's spinning wheels, it's using some power but.

120
00:20:48.358 --> 00:20:58.499
And if the supercomputer, you'd actually do care about the power, but not in this course. So it's wall clock time, which you want to minimize.

121
00:20:58.794 --> 00:21:11.483
Okay, here also we're mentioning so you have your separate levels you got to synchronize let me go back a page. Okay so here, we've got only 8 threads. No trouble. Suppose you had a 1000 threads here.

122
00:21:11.723 --> 00:21:23.993
There's no guarantee that those thousands fragile, right? At the same time again, because you got limited resources available, each warfare runs all at the same time, but the multiple wards they could run simultaneously they can run sequentially, depending on what's available.

123
00:21:24.269 --> 00:21:33.088
So, after you do each target, you have to synchronize to make sure the data that's stride. Rights is available because the next tribe will read it.

124
00:21:33.088 --> 00:21:36.959
Okay, they talk about that here.

125
00:21:36.959 --> 00:21:40.318
Okay.

126
00:21:40.318 --> 00:21:45.148
Code, I'll skip lots of things threads. Um.

127
00:21:45.148 --> 00:21:51.509
Work efficiency and.

128
00:21:52.943 --> 00:22:07.074
Well, the working efficiency is, are you using the call I'll call them scooter course here efficiently because again, if there's a could a core that's running that doesn't have to run. Well, it's not going to run until it has resources available.

129
00:22:07.314 --> 00:22:12.653
So this is in spite of what I said a minute ago, this is a reason to.

130
00:22:14.699 --> 00:22:24.269
You know, be efficient with executing course, because they may slow down the wall clock time waiting to execute. So, they're talking about some of that here.

131
00:22:24.269 --> 00:22:29.699
Okay, now the implication is that if some cores are.

132
00:22:29.699 --> 00:22:35.608
I don't some cars they're executing you want to pack all the executing ones in the smallest number of warps.

133
00:22:35.608 --> 00:22:41.759
Okay, not an awful lot in that slide set, but some new stuff.

134
00:22:41.759 --> 00:22:46.318
Excuse me.

135
00:22:56.578 --> 00:23:10.499
Okay, so this is going to be a new way to diverse. The tree reducing control. Divergence means that we're packing the, the.

136
00:23:10.499 --> 00:23:14.489
Threads that want to execute into a small number of warps.

137
00:23:14.489 --> 00:23:19.858
And this means the is going to be things a little more complicated. Okay.

138
00:23:22.528 --> 00:23:25.769
And I've got a concept here that.

139
00:23:25.769 --> 00:23:33.209
You look out of each output number at the bottom of all the computation. We've got like, a binary tree going up.

140
00:23:33.209 --> 00:23:41.489
Up to the route, we're adding numbers 2 by 2 adjacent numbers that were adding numbers that are 2 apart that are adding numbers for apart and so on.

141
00:23:41.489 --> 00:23:45.989
Up to and over to apart so it conceptually, we have a binary tree here.

142
00:23:46.584 --> 00:24:01.314
And and what we're doing is we start with building partial sums some of 2 elements, and Osama to parents excuses on the 4 elements and we're working our way down with bigger and bigger, partial sums.

143
00:24:01.558 --> 00:24:05.999
Okay, that's what they're talking about here.

144
00:24:05.999 --> 00:24:12.719
Um, and it's sort of showing what happens here.

145
00:24:14.338 --> 00:24:25.048
We're adding time is going down the page here. The thread number ID is going across the page. So the 1st step, we add.

146
00:24:25.048 --> 00:24:39.388
They're showing something slightly differently. We're adding each element element to the number to their left sex. 1 axial gets added into X Y, next to add new X3. And so on the next stage, we're adding each element to the 1 to 2. it's left.

147
00:24:39.388 --> 00:24:44.909
End of the 1 for this would be for a simple reduction here. Not a full scan. Okay.

148
00:24:46.048 --> 00:24:52.378
And it has log in steps, and we've, at the end of it, we summed all 8 elements, the reduction phase.

149
00:24:52.378 --> 00:24:55.528
We're working our way up to the scan.

150
00:24:55.528 --> 00:25:00.689
Okay, ignore the code for now.

151
00:25:01.949 --> 00:25:13.499
And what we're going to be doing, we're doing more the executive summary. We're doing more additions here to create more.

152
00:25:13.499 --> 00:25:16.798
More partial reduce some chili.

153
00:25:16.798 --> 00:25:26.068
Skip that for the moment, I'll give you the executive summary is that when in this computation.

154
00:25:26.068 --> 00:25:32.219
We're doing more partial skipping over the details to move along.

155
00:25:33.328 --> 00:25:36.659
But putting it all together, um.

156
00:25:36.659 --> 00:25:40.199
What we're doing here.

157
00:25:40.199 --> 00:25:45.088
The 1st, the top half of the tree is where we're doing the reduction.

158
00:25:45.088 --> 00:25:54.598
Of the whole array, and are also doing partial reductions of pieces. So we got reductions of 4 elements reductions of 2 elements and so on.

159
00:25:54.598 --> 00:26:02.159
That's the top half of the year, right? Then the bottom half of the array you might say we're branching out again and.

160
00:26:02.159 --> 00:26:12.838
Or taking these partial reductions and doing more additions and at the end of it, we're going to have our scan operation. I'll leave this up for a minute. So.

161
00:26:12.838 --> 00:26:18.328
So, if you look at the ID number of a thread, and the idea, it's a multiple of 8.

162
00:26:18.328 --> 00:26:29.939
This only 1 here it's got to some of 8 elements the 1. that's a multiple of 4 whose ideas a multiple of 4 has the sum of 4 elements. The ones that are multiples of.

163
00:26:31.648 --> 00:26:35.638
Sorry, I got them off by 1 is we have 16 here not 8.

164
00:26:35.638 --> 00:26:41.009
Okay, the thread whose ID some level 16 has to solve all 16 elements.

165
00:26:41.009 --> 00:26:44.308
The thread whose IDs are multiple of 8, but not 16.

166
00:26:44.308 --> 00:26:50.009
As the sum of 8 elements, the threads who are ID numbers are multiple so 4, but not 8.

167
00:26:50.933 --> 00:27:05.903
For the sums of the 4 elements to their left and the threads, her ID side, the multiple of 2, but not multiple of for the summer 2 elements. And the odd number threads haven't been changed. That's the state of the system in the middle here again time, going down.

168
00:27:06.209 --> 00:27:12.538
Now, what we take these, all these partial psalms, and we start adding more stuff to them.

169
00:27:12.538 --> 00:27:16.378
And so we're, we're doing another.

170
00:27:16.378 --> 00:27:20.068
Branching and at the end result, we will get.

171
00:27:20.068 --> 00:27:25.169
All the scanning, if I look at this here, say here, we got the left 8 elements.

172
00:27:25.169 --> 00:27:37.253
And we add in the summer for more elements, we added in the sum of 2 more elements and 1 more element. And at this point, where the hand shows, this is thread 15 counting from 1 just to make it easy.

173
00:27:38.213 --> 00:27:40.193
It's the sum of the 15 elements to the left.

174
00:27:40.499 --> 00:27:45.179
Let me take another thread just for fun. Let me take this thread here.

175
00:27:45.179 --> 00:27:53.038
So, this thread is the sum of itself and the element to the left and then we add in this, which is a sum of.

176
00:27:53.038 --> 00:27:59.128
The 1st, 8 elements, so, at this point, this spirit hears us some of 10 elements and it'd be thread number 10.

177
00:28:00.209 --> 00:28:09.689
So, this interesting 2 stage process at the a 3rd example here, let's take this. This would be thread 7 I guess.

178
00:28:09.689 --> 00:28:21.868
It's not affected in the 1st stage. We're adding in the series, and some of the 1st, 4 threads we add in the, some of the next 2 threads we added into here at this point, this thread is the sum.

179
00:28:21.868 --> 00:28:25.318
Of the 1st, 7 threats.

180
00:28:25.318 --> 00:28:29.608
So this is another way to do the scan operation at the end of it.

181
00:28:29.608 --> 00:28:33.568
Each thread is the sum of all the threads.

182
00:28:33.568 --> 00:28:40.888
To its left, including itself the 1st thread doesn't get changed. The second third is a solid 2 threads and so on.

183
00:28:40.888 --> 00:28:51.239
And this talk each, the top half took log in stages and the bottom half to log in stages. So to log in stages.

184
00:28:51.239 --> 00:28:55.019
And we've got the all the scanning.

185
00:28:56.249 --> 00:29:01.949
Now, what makes us better than the thing I showed say, 2 slide sets ago slide 10 1.

186
00:29:01.949 --> 00:29:07.469
Is if we look at all the little plus sign each plus sign is a chord that did something.

187
00:29:07.469 --> 00:29:13.409
This is actually only.

188
00:29:13.409 --> 00:29:18.449
To and course, so, this whole thing that took to log in.

189
00:29:18.449 --> 00:29:23.159
Stages took only a total of 2 and.

190
00:29:23.159 --> 00:29:34.409
Additions, so this is what we mean by work efficient, the 1st, version of the scanning operation, and there were log in stages, but in each stage all and.

191
00:29:34.409 --> 00:29:40.499
Course did something, so the 1st thing had end log and work. This thing is only end work.

192
00:29:40.499 --> 00:29:43.679
And again, although I said that idle could, of course.

193
00:29:43.679 --> 00:29:46.858
You don't not you don't care about them, but you care about them if.

194
00:29:46.858 --> 00:30:00.778
They're queued up waiting to execute and they're airing they're slowing down course I do want to do something so, if a course not doing it and he just don't want to say, well, it's adding something off the beginning. They're adding a 0.

195
00:30:00.778 --> 00:30:07.199
While you would rather be able to determine statically that it's going to add his arrow and not executed. Okay.

196
00:30:07.199 --> 00:30:10.679
Good so this is more work efficient.

197
00:30:10.679 --> 00:30:14.489
It has fewer total additions occurring.

198
00:30:15.354 --> 00:30:29.663
Now, the only problems with this, which will get to later are that the active cores are not adjacent to each other. So they're not packed into the smallest number of warps and the offer ends.

199
00:30:29.969 --> 00:30:33.328
To each corps are not adjacent to each other either.

200
00:30:33.328 --> 00:30:36.358
So, this doesn't play nicely.

201
00:30:36.358 --> 00:30:41.219
With the concept of a war, and it does not play nicely with the cash manager.

202
00:30:41.219 --> 00:30:46.078
However, it is, however, it does have the fewest number, of course, executing.

203
00:30:46.078 --> 00:30:53.128
So, where we have progress, um, skip the code, um.

204
00:30:53.128 --> 00:30:56.159
Notice liberal use the think threads.

205
00:30:56.159 --> 00:31:03.148
Okay, so that we'll see what they say about what I just said next.

206
00:31:13.108 --> 00:31:16.769
So, we're going to analyze it.

207
00:31:18.868 --> 00:31:21.929
Okay, so a total of linear amount of work.

208
00:31:21.929 --> 00:31:26.669
Total so 2 and ads.

209
00:31:28.078 --> 00:31:31.138
So, the efficiency crunch thing would have and adds.

210
00:31:31.138 --> 00:31:40.199
That takes time and and this takes time. So we doubled the amount of work number of additions, but we cut the wall clock time.

211
00:31:40.199 --> 00:31:45.568
Payroll is going to is going to take more operations and sequential.

212
00:31:45.568 --> 00:31:49.138
It always happens. A factor of 2 is quite good.

213
00:31:49.138 --> 00:31:54.388
Okay, and now if you have.

214
00:31:54.388 --> 00:31:59.068
Running something P, times and parallel you ain't gonna get a factor of B speed up.

215
00:31:59.068 --> 00:32:04.648
So so the work efficiency, it's nice.

216
00:32:04.648 --> 00:32:13.229
Okay work any efficient might be nice for some things, but Tom.

217
00:32:15.419 --> 00:32:30.328
Okay, so here's the next thing. Suppose we got a big, big, big input factor and it's too big to fit in 1 thread box. I said a thread block has a 1000 threads. Max.

218
00:32:30.328 --> 00:32:37.019
1000 to 24. so if you suppose you want to scan a 1Million element factor.

219
00:32:37.019 --> 00:32:40.648
What you would do is you would fire up a.

220
00:32:41.699 --> 00:32:49.048
A kernel with a 1000 thread blocks each with a 1000 threads and each thread block.

221
00:32:49.048 --> 00:32:53.368
Would do the scan independently on its 1000 elements.

222
00:32:53.368 --> 00:32:57.209
And at some point, at the end, we then have to.

223
00:32:57.209 --> 00:33:01.439
Merge the results and update each thread block.

224
00:33:01.439 --> 00:33:04.588
And they talk about ad, they scan some or race.

225
00:33:04.588 --> 00:33:10.618
So, you you do a 2nd level scan on the total.

226
00:33:10.618 --> 00:33:18.568
For each thread block, and this gives adult factor, which you then go back and add into each thread block. And now you've got your final.

227
00:33:18.568 --> 00:33:21.808
Scanned version of the 1Million element array.

228
00:33:21.808 --> 00:33:27.388
So large, so again, if your vectors too big to fit in 1 thread block.

229
00:33:27.388 --> 00:33:39.628
You partitioned it into a separate piece for each for you run multiple thread blocks, you scan each thread box separately and then you do a combo scan of the.

230
00:33:39.628 --> 00:33:43.378
Basically of the total size of each thread block.

231
00:33:43.378 --> 00:33:49.048
Broadcast that, back out to the thread blocks, and they update themselves and now you've got the whole factor done.

232
00:33:49.048 --> 00:33:56.489
2 step process, so, and again, you have to do it, something like this, because a separate thread blocks.

233
00:33:56.489 --> 00:34:00.778
Got no guarantee when they're running.

234
00:34:00.778 --> 00:34:08.759
Okay, they're calling it scan blocks or thread blocks bigger re, partition into blocks each blocks. Get you.

235
00:34:08.759 --> 00:34:15.838
Scan will be a verb hearing on an adjective. It's kind of separate like you take the total.

236
00:34:15.838 --> 00:34:25.289
From each block, you put it in an salary array, you scan it, you broadcast it out and update and scan things leave this up for a 2nd or 2.

237
00:34:29.969 --> 00:34:35.608
Multiple levels, parallel compute and you got hierarchy so okay.

238
00:34:35.608 --> 00:34:39.688
Hierarchies of everything algorithm memory.

239
00:34:39.688 --> 00:34:49.918
And so on, but it's not a full binary tree. Hierarchy. Hierarchy is not very high.

240
00:34:51.568 --> 00:34:55.559
So, okay, okay.

241
00:34:55.559 --> 00:35:00.088
We met what this cabinet we talked about before was called an inclusive scan.

242
00:35:00.088 --> 00:35:04.949
There's a variable called an exclusive scan where you put a 0 in front.

243
00:35:04.949 --> 00:35:13.289
And the last element is, the sum of the 1st end minus 1 elements, and there's nowhere in this. Is there a sub of all the elements.

244
00:35:13.289 --> 00:35:16.528
And closes scan exclusive scan.

245
00:35:16.528 --> 00:35:19.559
So, it just 1.

246
00:35:19.559 --> 00:35:26.128
You know, conceptually the same, but the exclusive scan is easier for working with dope factor. Is.

247
00:35:26.128 --> 00:35:32.518
Easier for different purposes I use the exclusive scan, but then I don't get the sum of all of that elements if I would need it. So.

248
00:35:32.518 --> 00:35:42.628
Okay, beginning, I just have allocated buffer as I call this already beginning address as I call it adult factor. So.

249
00:35:42.628 --> 00:35:51.208
You got the different, same closest scan the elements of some of all the elements up to here. Exclusive to some of all the elements left up here. So.

250
00:35:52.498 --> 00:35:55.889
Okay, inclusive minor point.

251
00:35:58.829 --> 00:36:06.358
Saying here, you can get an idea on this.

252
00:36:07.829 --> 00:36:12.929
Oh, okay. So what we saw in this set of slides.

253
00:36:12.929 --> 00:36:22.588
Was this parallel scan operation, which is a widely useful operation for parallel algorithms, how to do it and then how to do it efficiently.

254
00:36:24.239 --> 00:36:32.369
After 7 chapter 12, they skipped off 11.

255
00:36:32.369 --> 00:36:38.548
There's nothing interesting. Okay.

256
00:36:40.199 --> 00:36:47.579
A touch on floating point. You've seen some of this before? Exactly. Well, 1 thing relevant. The current Nvidia architecture.

257
00:36:47.579 --> 00:36:52.228
Is since many programs are limited by the time.

258
00:36:52.228 --> 00:36:56.699
They invented a half precision floating point stat format.

259
00:36:56.699 --> 00:37:05.068
So is half the okay floating point? This goes back a few decades.

260
00:37:06.268 --> 00:37:14.068
You like to have the hardware you have some standards for floating point with round off and stuff like that.

261
00:37:14.068 --> 00:37:23.213
And so there's an, I, Tripoli standard for this and the problem with the standard is it's expensive to implement.

262
00:37:23.213 --> 00:37:31.494
And when this was 1st proposed a few decades ago, there was actually a lot of professional debate about this is was this standard overkill.

263
00:37:31.768 --> 00:37:36.418
Was it being too finicky about round offs and stuff like that?

264
00:37:36.418 --> 00:37:43.798
And it would be too expensive to implement and in fact, Cray computing refused to accept the standard.

265
00:37:45.778 --> 00:37:53.878
So, Craig was a major, super computer manufacturer, and they refuse to implement the floating point standard and they said, takes too much hardware.

266
00:37:53.878 --> 00:37:57.869
In any case, no, everyone accepts it, but.

267
00:37:57.869 --> 00:38:02.579
Video has a way to ignore it past math operation. Oh, okay.

268
00:38:02.579 --> 00:38:06.539
So floating point, you got to sign, you've got an exponent, you got a mantis.

269
00:38:06.864 --> 00:38:17.574
And, and they're actually a cool thing is the floating point number is laid out in a way that the bits are laid out in a pattern.

270
00:38:17.934 --> 00:38:26.903
So that you can compare to floating point numbers with an integer binary comparison. That is cool. So, the comparison operator for, and if you apply to.

271
00:38:27.210 --> 00:38:41.574
And floating point number it still works, which is also the hardware point hardware term spits has been you got 32 bits there's nothing in the hardware. It says what they mean that could be a 32 bit integer. That could be a 32 bit floats. It could be 488 characters.

272
00:38:44.099 --> 00:38:49.679
You know, there could be 56 bit characters, but plus 2 spare bits who knows.

273
00:38:49.679 --> 00:38:53.130
And it's how you interpret the bits.

274
00:38:53.905 --> 00:39:06.175
On normalized, I'm going to skip some of the details a little way. Get an extra bet is if the leading bit of the folding point number is a 1 always because this 1, something times and exponent then it's always the 1.

275
00:39:06.175 --> 00:39:09.594
you don't explore you don't store it and you get 1 more better precision.

276
00:39:11.219 --> 00:39:18.510
The details all, I'll just do some historical note.

277
00:39:20.304 --> 00:39:24.894
Again, it takes some thinking to figure out how to do folding point numbers. Right?

278
00:39:25.315 --> 00:39:39.655
And I be, and now it's been figured out, we forget that it took some time and IBM, even when they were the biggest computer company, they actually did floating point numbers in a very inefficient way. The way IBM implemented floating points for years.

279
00:39:39.684 --> 00:39:40.494
Actually.

280
00:39:40.769 --> 00:39:45.510
Had fewer effective, significant bets than necessary. So.

281
00:39:45.510 --> 00:39:57.420
Ibm didn't actually have binary floats, actually base 16 floats and which sounds like it'd be the same thing, but no, because it's more leading binaries arrows. Okay.

282
00:39:57.420 --> 00:40:02.489
I'll skip some of the details here so details and how you implemented it.

283
00:40:02.489 --> 00:40:09.360
Numbers are representative of another example of how hard it is to get this implementation. Right?

284
00:40:09.360 --> 00:40:16.769
Is that some years ago there were various different implementations. I mentioned Craig I believe so.

285
00:40:16.769 --> 00:40:30.929
People wrote routine subroutines to try to to query their computer and to try to determine what the actual number of significant bits in the mantle was, because you couldn't just call.

286
00:40:30.929 --> 00:40:45.385
A system routine, it would tell you, you had to sort of probe your system do addition and see if the result see what the result was and determine what the actual precision of your computer was. So, communications of the ACM published a sub retain.

287
00:40:45.385 --> 00:40:46.764
That would determine that.

288
00:40:47.400 --> 00:40:52.349
And then somebody found a real piece of hardware that would cause this routine to go into an infinite loop.

289
00:40:52.349 --> 00:41:00.389
So, crazy issues, things with implementations as you might have intermediate registers, which have more precision.

290
00:41:00.389 --> 00:41:07.530
Then your memory for afloat, if you got more temporary precision, then you've got less round off error, which is good.

291
00:41:07.530 --> 00:41:14.159
Also, med edition was not commutative a plus B would not be B plus a, because they might be in a temporary register. That's bigger.

292
00:41:14.159 --> 00:41:20.519
You know, subtleties, other little subtleties as you might want your major.

293
00:41:20.519 --> 00:41:28.199
Built in functions, like, signing exponential to be monotonically. If you increase the argument, you'd want the result at least not to decrease.

294
00:41:28.199 --> 00:41:34.019
Well, that because it did sometimes just some implementation weird, little things like that.

295
00:41:34.019 --> 00:41:37.739
Again, skipped some details about this, um.

296
00:41:38.880 --> 00:41:44.400
The takeaway from this is it surprisingly hard to do floating point, right?

297
00:41:44.400 --> 00:41:49.320
Skip over that.

298
00:41:49.320 --> 00:41:52.949
So, I triple a single precision.

299
00:41:52.949 --> 00:42:01.380
23 bit traction and tested. That's actually not enough for a lot of things. He couldn't lose precision as you do. Operations.

300
00:42:01.380 --> 00:42:15.869
So, it's actually, you have to be in touch, careful, scientific computation with single precision. And the exponent is not enough because your biggest represented numbers, 10 of the 37th and again, that's not enough for a lot of things. So, double. So.

301
00:42:15.869 --> 00:42:26.610
What I tell people is on Intel, use double precision, it takes twice the space, but on the Intel is not IO bound so much and it's 52 bed fraction and.

302
00:42:26.610 --> 00:42:35.880
Bigger exponent you're fine. Of course, on the GPU, which are I outbound typically you can't just automatically go to double precision.

303
00:42:35.880 --> 00:42:41.880
1st, because it doubles the old time and 2nd have fewer double precision processors.

304
00:42:41.880 --> 00:42:46.650
Threat thread it might be waiting for processors.

305
00:42:46.650 --> 00:42:52.199
Okay, awesome. Cool things in the I. Tripoli standard that mean.

306
00:42:52.199 --> 00:42:57.269
People freak out at the start. Is it as ways to represent plus and minus and affinity?

307
00:42:57.269 --> 00:43:05.550
And has a bit pattern, which means not a number so you divide 0 by 0 It should output the not a number bit pattern.

308
00:43:06.869 --> 00:43:14.070
Then this is weird. Affects you start now now? Start violating normal rules of a risk I guess you might understand them.

309
00:43:14.070 --> 00:43:22.110
Like, 0 times, not a number is still not a number, so 0 does not always collapse everything to 0. you see the problem.

310
00:43:22.110 --> 00:43:26.340
And so you get weird, counterintuitive things happening with this.

311
00:43:26.340 --> 00:43:33.449
The thing is that no, 1 hardly ever uses these things. I thought it would be cool to use something like not a number and 1 of my.

312
00:43:33.449 --> 00:43:46.289
C, plus, plus programs to represent, if it wasn't out putting a legal number, and it killed my performance, it turned out it was being simulated in software or something and know the documentation didn't say that.

313
00:43:46.289 --> 00:43:50.639
Okay, I'm accuracy in rounding.

314
00:43:52.110 --> 00:43:58.019
You all know what that is? This means that addition is not associative. Of course.

315
00:43:58.019 --> 00:44:01.739
Rounding error.

316
00:44:03.119 --> 00:44:09.119
I think, you know, what rounding is if you don't know if you want me to slow down, I will.

317
00:44:09.119 --> 00:44:16.260
Okay, so hardware you'd like to internally have 2 more bit positions than you visibly have.

318
00:44:16.260 --> 00:44:20.369
And this will help the rounding.

319
00:44:20.369 --> 00:44:24.630
Make your results accurate to the last visible bit typically.

320
00:44:24.630 --> 00:44:27.750
Not a associative.

321
00:44:27.750 --> 00:44:34.710
So, if he got large, plus small, plus small is not because the small plus the yard.

322
00:44:34.710 --> 00:44:41.369
Large may just be the large or small get. It may not affect the large. It's too small. So, these things like this.

323
00:44:41.369 --> 00:44:44.789
Are relevant.

324
00:44:48.775 --> 00:45:00.625
If I back up 2 stages here, this is relevant if you, if you're adding up a bigger array, because a subtotal might start getting bigger than the last next element you're adding. And this is irrelevant here.

325
00:45:00.929 --> 00:45:06.030
I am also teaching probability this semester if your computing variance as.

326
00:45:06.030 --> 00:45:10.079
The sum of all the X squared minus the sum of all the X squared.

327
00:45:10.079 --> 00:45:16.050
You got your may get hit by this it may get wrong.

328
00:45:16.050 --> 00:45:20.579
Run time mass library.

329
00:45:20.579 --> 00:45:23.610
So, what Nvidia has.

330
00:45:23.610 --> 00:45:31.409
So, they sort of say, I triple 754 is nice, but maybe it's too slow. So we have fast hardware versions.

331
00:45:31.409 --> 00:45:34.469
Which are faster, but may be less accurate. So.

332
00:45:34.469 --> 00:45:43.920
And you can pick yeah, you want to be careful about using again. I had 1 of my geometry programs where I turned on past math.

333
00:45:43.920 --> 00:45:52.710
Cool. I fast can be better and it broke my program actually, because I was implicitly, assuming that folks were done properly.

334
00:45:52.710 --> 00:46:02.159
I just implicitly assumed when I design my algorithms and when I put fast math in, as the compiler option are programmed, no longer gave the right answer. So.

335
00:46:02.159 --> 00:46:05.400
Safe.

336
00:46:06.869 --> 00:46:12.690
Be careful. Okay so an introduction to floats and so on.

337
00:46:14.909 --> 00:46:19.920
No question.

338
00:46:19.920 --> 00:46:24.510
Able to.

339
00:46:24.510 --> 00:46:32.940
So, this is not strict. Well, it's parallel computing in the sense.

340
00:46:32.940 --> 00:46:38.309
That stability becomes harder to achieve civility means when round off error goes crazy.

341
00:46:38.309 --> 00:46:41.639
So, give him some examples.

342
00:46:42.989 --> 00:46:47.639
Again, the your backgrounds are variable somewhat, but.

343
00:46:47.639 --> 00:46:54.360
Except for stability affects the outcome. So you started way to multiply to matrices and by May 26 end time.

344
00:46:54.360 --> 00:46:59.969
There are asymptotically pastor ways to multiply matrices. The 1st 1 was.

345
00:46:59.969 --> 00:47:03.360
Front end to the 2.7 time.

346
00:47:03.360 --> 00:47:07.559
The 2.7 was talk to the base 2 of 3 actually.

347
00:47:07.559 --> 00:47:10.619
Walk is the base 2 of them 7.

348
00:47:10.619 --> 00:47:16.079
And that exponents been bashed down. So they'll say they're not you so much.

349
00:47:16.079 --> 00:47:23.309
In spite of the fellow 1st, the constant factor in front of the time is bigger is because in America less stable.

350
00:47:23.309 --> 00:47:27.510
So, they're adding and subtracting things and.

351
00:47:27.510 --> 00:47:33.360
You get more round off arrows so these simple obvious thing.

352
00:47:34.409 --> 00:47:37.980
It's slower, but it's simple.

353
00:47:37.980 --> 00:47:43.380
And at some better roundoff properties, so.

354
00:47:44.880 --> 00:47:50.460
And again, with things, like, converting a matrix solving a system of linear equations.

355
00:47:50.460 --> 00:48:04.500
There are algorithms, which may look better, but are unstable. So they round off. May go. I mean, it's not just round off. I mean, it's not just a few least significant. Bits may be wrong. The algorithm may just crash.

356
00:48:04.500 --> 00:48:08.519
It may end up with 0, significant bits effectively. So.

357
00:48:09.960 --> 00:48:17.190
Review of how you solve a set of linear equations, 3 equations and 3 unknowns.

358
00:48:17.190 --> 00:48:21.059
Well.

359
00:48:21.059 --> 00:48:25.980
The simple way. Well, 1st, you normalize it. So the leading.

360
00:48:25.980 --> 00:48:29.190
The coefficient on next is always 1.

361
00:48:30.780 --> 00:48:43.500
And then what you can do is that you can take the 1st equation, you can subtract it from the 2nd and 3rd and get this. So now you've eliminated X from the 2nd and 3rd. And now you can guess what we're going to do. We're going to scale.

362
00:48:43.500 --> 00:48:50.880
The 2nd equation, subtract it from the 3rd no the 3rd equations to see. So now we walk back up and we solved it.

363
00:48:50.880 --> 00:48:57.059
Like, this, this is nice, but, um.

364
00:48:58.889 --> 00:49:02.519
Depending on what the relative coefficients are.

365
00:49:02.519 --> 00:49:17.130
Let me show you right here, you see, you got you got 16 Y, in the 2nd equation. 4th wind the 3rd equation. So you have to double the 2nd equation, then add it into the 30 equation. So that means all the coefficient on the 2nd equation get doubled.

366
00:49:17.130 --> 00:49:26.039
So they have a bigger so you see, they might start swamping, coefficient and the 3rd equation causing significant fits to be lost. In a case like this. It would be better.

367
00:49:26.039 --> 00:49:34.860
To take the 3rd equation and add half of it to the 2nd equation because now the numbers and the coefficient are getting smaller, not bigger.

368
00:49:34.860 --> 00:49:40.110
So these coefficients of the 3rd equation, we add into the 2nd, they don't swamped.

369
00:49:40.110 --> 00:49:44.519
The 2nd efficients coefficients and they're not causing so much loss of significance.

370
00:49:44.519 --> 00:49:48.690
This is so.

371
00:49:48.690 --> 00:49:51.929
And, um.

372
00:49:51.929 --> 00:49:57.329
Talk a little on skip through it.

373
00:49:58.679 --> 00:50:05.789
Okay problem so it paralyzes nicely.

374
00:50:07.320 --> 00:50:10.469
So, this problem with stability.

375
00:50:13.019 --> 00:50:17.760
And what you would like to do is actually find the largest element in the.

376
00:50:17.760 --> 00:50:23.489
And the array of coefficients and.

377
00:50:23.489 --> 00:50:28.380
Use that swap it up to the top left and then start.

378
00:50:28.380 --> 00:50:31.949
Attracting multiples and it will turn out to yellow.

379
00:50:31.949 --> 00:50:36.030
Better precision in the result.

380
00:50:36.030 --> 00:50:40.710
They're talking about that here. I just gave you the context of it.

381
00:50:44.309 --> 00:50:50.429
But you got to find the largest element in that takes a scan, which takes some time.

382
00:50:50.429 --> 00:51:00.840
So, that may not look for you. Absolutely largest element. They may find the largest element in a row, or in a column or something and work with it. That's called partial pivoting.

383
00:51:00.840 --> 00:51:04.829
Attached to define the pivot, but the pivots not as good, but.

384
00:51:06.539 --> 00:51:09.570
That's what they're talking about here so.

385
00:51:12.179 --> 00:51:22.650
So, the message here is you like to have the best numerical precision it's called the best it's harder to do with parallel algorithms.

386
00:51:25.110 --> 00:51:39.389
Okay, so now we're getting back to specifically video stuff here.

387
00:51:43.500 --> 00:51:48.719
It's attached to the.

388
00:51:48.719 --> 00:51:56.010
Why quite a fast bus, but still, we want to get an idea of how to work together.

389
00:51:56.010 --> 00:52:00.750
So, a different theme.

390
00:52:01.949 --> 00:52:12.119
Okay back so this is your could have programming C plus plus it's got these minus and Catholic extensions that Scott, these new routines could Matlock.

391
00:52:12.119 --> 00:52:25.409
And copy, and again with managed memory, which automatically pages it'd be, could a Matlock managed, and it would be, you would never have to do a mem copy unless you thought you could do it better.

392
00:52:25.409 --> 00:52:31.500
Then the system, and maybe you could actually, but it takes your time.

393
00:52:32.760 --> 00:52:37.079
And again, so this is the call in your.

394
00:52:37.079 --> 00:52:43.230
Main program to call the kernel on the GPU, got the triple brackets.

395
00:52:43.230 --> 00:52:49.949
Extension and it will tell you and unit you specify how many thread blocks and how many threads for block.

396
00:52:51.000 --> 00:53:01.590
Copier okay and the colonel routine again, it's call it from the host and executed on the device and you can pass in arguments.

397
00:53:01.590 --> 00:53:07.739
And in the routine, you can define variables as local, or you got register variables.

398
00:53:07.739 --> 00:53:11.460
Local variables are local to the thread, but there are slow.

399
00:53:11.460 --> 00:53:15.000
That's if you don't have enough registers and shared variables that are.

400
00:53:15.000 --> 00:53:18.750
Global to the thread local to the block.

401
00:53:18.750 --> 00:53:23.039
And lots of sync thread. So this is a general structure of there.

402
00:53:23.039 --> 00:53:31.739
Put a program bandwidth as important.

403
00:53:39.809 --> 00:53:46.139
So this is an obsolete architecture that they mentioned, because it was important for so long.

404
00:53:46.139 --> 00:53:54.030
Used to have a North Bridge and South bridge concentrator. The North bridge had the past peripherals, the South branches the slope peripherals.

405
00:53:54.030 --> 00:54:01.079
Okay, and the thing and white on light yellow.

406
00:54:01.079 --> 00:54:06.750
Okay, historical.

407
00:54:08.400 --> 00:54:13.050
And originally you had boss and so on was slow.

408
00:54:13.050 --> 00:54:16.920
And, okay.

409
00:54:18.630 --> 00:54:25.769
There was the concept of memory mapped. I'll of course, reviewing.

410
00:54:25.769 --> 00:54:33.780
So, the devices on the bus, they would read and write directly to physical memory. But of course, that means that.

411
00:54:34.860 --> 00:54:40.619
The Register has to be in physical memory can't get swapped out by the virtual memory manager.

412
00:54:40.619 --> 00:54:47.519
So, you might hard lock in the address as a devices access, for example.

413
00:54:49.920 --> 00:55:02.219
But the nice concept is, or Here's another slightly different concept. You got your virtual memory space, and every address some addresses that it might map to memory in some addresses map to the.

414
00:55:02.219 --> 00:55:12.869
Peripheral, so the way that's implemented is the peripherals are just watching the address boss, and then same address that applies to them. Then they take action. They reader right? So.

415
00:55:12.869 --> 00:55:18.090
Nice unifying concept of putting everything on number in the virtual memory space.

416
00:55:18.090 --> 00:55:23.429
Okay, new or.

417
00:55:27.000 --> 00:55:33.300
Can't faster. I'm skipping through this to.

418
00:55:35.250 --> 00:55:45.900
Again, faster lanes or several bits, which can go through an interesting thing. 8 of 10 and coating and so on.

419
00:55:47.519 --> 00:55:58.139
Right, because you have too many aside decides arrows are ones on the bus. They get crosstalk perhaps.

420
00:55:59.250 --> 00:56:03.000
And and you don't want too much.

421
00:56:03.000 --> 00:56:06.449
Too many ones means as the D. C. card perhaps. So.

422
00:56:06.449 --> 00:56:12.449
This is not strictly parallel computing, so I'm going through it fast.

423
00:56:15.000 --> 00:56:18.150
Your card.

424
00:56:18.150 --> 00:56:21.570
A few years old, but.

425
00:56:21.570 --> 00:56:27.269
Graphics, you got lots of video outs Express, um.

426
00:56:28.650 --> 00:56:32.369
And now the connector has to do things and so on.

427
00:56:32.369 --> 00:56:42.929
This is why Andrea has compute service with no graphics output. The graphics takes so much space. This is an old chip by the.

428
00:56:42.929 --> 00:56:50.849
Old board gives you some idea 3 not so interesting.

429
00:56:52.230 --> 00:57:06.030
Dma again, it's got to write to pin to memory if it's writing to actual memory. Right? Depend memory. Because again.

430
00:57:06.030 --> 00:57:11.280
So, what happens is if this has been swapped out by the virtual memory manager. So.

431
00:57:12.360 --> 00:57:17.849
The GPU can be doing direct memory access to the main memory. So.

432
00:57:17.849 --> 00:57:23.489
Accessing pen memory, right?

433
00:57:23.489 --> 00:57:33.570
So pinned memories memory that you can't do virtual management on. So you've got less memory you can page on parallel. I've actually got so much real memory.

434
00:57:33.570 --> 00:57:36.840
Um, that.

435
00:57:36.840 --> 00:57:41.460
What, if I got 256 gigabytes, whatever that, that doesn't matter.

436
00:57:43.289 --> 00:57:48.150
Page locked memory sort of obsolete now that you can access that.

437
00:57:48.150 --> 00:57:57.659
Into memory, so, things like mmhmm copy or faster with spend memory because you don't have to wait for it a good page. Maybe.

438
00:57:57.659 --> 00:58:02.730
Over subscription, not on parallel you can't oversubscribe it easily.

439
00:58:04.769 --> 00:58:08.969
Yeah.

440
00:58:12.000 --> 00:58:18.389
And that was not a lot of content there, because it's been partly supplanted with newer stuff. But.

441
00:58:27.389 --> 00:58:32.429
Okay, going to skip through this fast here.

442
00:58:32.429 --> 00:58:40.530
You all know it broke from memory management is.

443
00:58:42.474 --> 00:58:55.855
It's a touch tricky to implement virtual memory management properly. Well, example, your paging your instructions in and out also. And instruction might be multiple bites. And what if an instruction spans.

444
00:58:56.130 --> 00:59:08.219
A page boundary, you can see where I'm instruction may be half a gets paged out as we page back in, which might then cause the 1st page out. You can imagine some crazy deadlock issues.

445
00:59:08.219 --> 00:59:20.489
They've been solved now, but yeah, you get to crashing if you're paging in a note stuff that we're both pages have to be in memory at the real time. So.

446
00:59:20.489 --> 00:59:25.590
At the same time.

447
00:59:27.630 --> 00:59:37.590
Pending stuff helps writing this. Yeah.

448
00:59:38.820 --> 00:59:49.530
So, what they implement things like, ma'am copy is that it actually gets copied, depend to memory and then gets coffee to the virtual memory where you need it.

449
00:59:49.530 --> 00:59:54.510
Takes to state.

450
00:59:56.010 --> 00:59:59.550
And if you want.

451
01:00:04.500 --> 01:00:11.610
Eva, okay.

452
01:00:13.679 --> 01:00:20.010
Here is something new now concept of streams haven't talked about it before.

453
01:00:20.010 --> 01:00:25.829
What is happening here is that you can run. Okay.

454
01:00:25.829 --> 01:00:32.219
So far we've seen parallelism within 1 Cota colonel.

455
01:00:33.989 --> 01:00:42.929
Colonel has thousands of thread blocks. Each thread block has a 1000 threads. So you'll 1 kernel is parallel.

456
01:00:44.190 --> 01:00:47.699
But we still have, but so we had 1 sequential.

457
01:00:47.699 --> 01:00:56.010
Thing in your host program, you allocate memory, you do copying you fire up a parallel hurdle. You'd wait.

458
01:00:56.010 --> 01:01:04.050
You synchronized and read the data. Okay well, what we're talking about here that's like 1 task and the task itself is sequential.

459
01:01:04.050 --> 01:01:11.460
What we're seeing what we're going to learn in this slide set is going to multiple parallel tasks in your.

460
01:01:11.460 --> 01:01:14.519
Code a program, and they're called streams.

461
01:01:14.519 --> 01:01:21.570
So, you start 1 stream, you allocate data and you, you do some copying.

462
01:01:22.769 --> 01:01:31.500
And then execute while while the kernel's executing, you could be out copying data for another stream. And this will give you.

463
01:01:31.500 --> 01:01:35.699
Smaller real time so the streaming facility.

464
01:01:36.414 --> 01:01:50.934
Allows you to do different parts of you C. plus plus coulda program in parallel independently of each other. So, this means that your separate streams are competing for the fixed resources.

465
01:01:51.179 --> 01:01:54.329
Well, 1st, streaming multi thread blocks.

466
01:01:54.329 --> 01:02:01.769
You know, arithmetic units and stuff like that and so these separate streams, they're doing different things. So.

467
01:02:01.769 --> 01:02:04.860
Perhaps they compete, they want different.

468
01:02:04.860 --> 01:02:19.320
Resources on the GPU, so the 2 streams may actually play well together because they want different resources at the same time and therefore you'll get greater work, efficiently greater efficiency on this GPU.

469
01:02:21.000 --> 01:02:26.039
As well as being, if you think of your algorithms having multiple parallel streams and hey.

470
01:02:26.039 --> 01:02:31.050
Some, let's do it. Okay.

471
01:02:31.050 --> 01:02:37.380
Okay, so you an example here.

472
01:02:37.380 --> 01:02:41.250
1 stream and trends means transfer.

473
01:02:42.690 --> 01:02:49.980
It's adding so we're transferring to a race to the GPU, doing a computation and transferring in a right back.

474
01:02:52.199 --> 01:02:55.769
Good of another stream that was computing when the stream was transferring.

475
01:02:55.769 --> 01:03:00.090
And so.

476
01:03:02.880 --> 01:03:09.179
Have some overlap shows it here.

477
01:03:11.909 --> 01:03:15.239
When we got 4.

478
01:03:15.239 --> 01:03:23.429
Possible writing 4 pairs of arrays the, a b0, the a B1, a B2 and a B3.

479
01:03:24.869 --> 01:03:33.150
So, we start off stream 0 and when stream 0 is computing, we start off stream 1, which is transferring.

480
01:03:33.150 --> 01:03:46.170
Using different parts of the hardware so then when stream 0 is talking stuff back to the whole steam stream 1 is computing then we starts to stream to copying data host to device.

481
01:03:46.170 --> 01:03:50.039
So, in the big black square blocks that we've got some.

482
01:03:50.039 --> 01:03:55.440
Parallelism between the different streams.

483
01:03:56.519 --> 01:03:59.940
Okay.

484
01:04:03.780 --> 01:04:06.869
Okay, task parallelism.

485
01:04:08.039 --> 01:04:13.619
We had work parallelism. We got blocked parallelism, never next level up task of parallelism.

486
01:04:16.860 --> 01:04:23.460
Okay, and.

487
01:04:23.460 --> 01:04:27.599
We can do it with Colonel launches and synchronizing and so on.

488
01:04:29.460 --> 01:04:36.480
So it is a cue here is yeah, so.

489
01:04:37.860 --> 01:04:44.940
Start the 2 streams and inside the streams, we can do event querying and so on.

490
01:04:50.849 --> 01:04:54.449
Once up to a view of the streams stream 0 stream 1.

491
01:04:55.710 --> 01:04:59.969
You fire him up and, um.

492
01:04:59.969 --> 01:05:04.619
Okay, time is going top to bottom here so.

493
01:05:07.530 --> 01:05:13.260
You might imagine you've got hardware that does copying feed host in device and you can hardware that does computing.

494
01:05:13.260 --> 01:05:23.579
And and the stream, actually, which hardware the stream runs on, can swap back and forth. That's what this is showing here.

495
01:05:25.320 --> 01:05:31.050
So, stream 0 can start on.

496
01:05:31.050 --> 01:05:34.679
Hardware 0, and then swap. So that's the context here.

497
01:05:36.449 --> 01:05:47.789
Silence.

498
01:05:47.789 --> 01:05:51.659
So, how do we do this overlapping.

499
01:05:56.039 --> 01:06:03.000
Well, okay, so you can create separate streams stream, create and this will take data structure.

500
01:06:04.500 --> 01:06:11.099
In a separate the separate streams want to be working with separate data, allocate separate data.

501
01:06:11.099 --> 01:06:21.329
Allocate the data, so you can do things like mem, copy a sync new.

502
01:06:21.329 --> 01:06:32.159
Again, you play games as managed memory, but the concept of the asynchronous ma'am coffee here is you give another argument, which is the stream that this executes then.

503
01:06:32.159 --> 01:06:39.659
So, the, a sync, mmhmm copy returns to the host immediately while it's still executing on the device.

504
01:06:41.250 --> 01:06:51.210
So you've got these 2 acing copies and they're running in parallel on stream 0 because that's okay. They're accessing different memory.

505
01:06:51.210 --> 01:06:54.269
And then the vector add here.

506
01:06:54.269 --> 01:06:57.300
I'll work on stream 0.

507
01:06:59.099 --> 01:07:04.500
And then stream 1, here you do them copies on stream 1.

508
01:07:04.500 --> 01:07:12.659
And so stream 1 is executing in parallel with streams 0 they're not affecting each other's slides. They're working on different memory.

509
01:07:12.659 --> 01:07:16.619
Except, of course, down the global routine here.

510
01:07:16.619 --> 01:07:20.489
It's the same global routine, but it's running it with different.

511
01:07:20.489 --> 01:07:31.139
On different data streams, arrow stream 1. so again, so these are thread blocks running on the same that are being created. So that all the thread blocks that.

512
01:07:31.139 --> 01:07:34.769
You going to have this big pool of thread blocks waiting to run. So.

513
01:07:36.599 --> 01:07:40.800
But that means when it's hardware resources available, that it's more likely that.

514
01:07:40.800 --> 01:07:44.099
There'll be a thread block that can use.

515
01:07:44.099 --> 01:07:49.230
Okay, so we've got some issues here that we'd like some synchronization, but.

516
01:07:49.230 --> 01:07:59.309
That's your basic idea that the basic idea is that in your code program, you see, plus plus program on the host, you can fire up asynchronous things.

517
01:07:59.309 --> 01:08:07.079
And in multiple streams, nothing interesting there.

518
01:08:11.699 --> 01:08:19.109
Yeah, so we want to figure out the best overlap.

519
01:08:25.619 --> 01:08:30.060
And what are they trying to do here? They're copying.

520
01:08:31.079 --> 01:08:34.409
Ways to reorder things so we do to the.

521
01:08:34.409 --> 01:08:37.739
Streams arrow copy the stream 1 copy.

522
01:08:38.939 --> 01:08:42.149
Let me go back to pages.

523
01:08:45.569 --> 01:08:49.829
3 pages, so the idea is.

524
01:08:49.829 --> 01:08:58.590
So this thing starts, you might say, doesn't start stream 1 quickly enough. You want to start the.

525
01:08:58.590 --> 01:09:03.899
Start all the streams trying to do stuff and there's overlap better.

526
01:09:06.300 --> 01:09:10.229
So, start all the streams, both streams, copying and then.

527
01:09:11.460 --> 01:09:16.800
The addition so trying to get stuff at the task level.

528
01:09:16.800 --> 01:09:21.600
In parallel and talking.

529
01:09:25.020 --> 01:09:33.869
And get even more complicated code, lots of buffers.

530
01:09:35.880 --> 01:09:40.979
Yeah, so Hypercube so each engine, each streaming multi process or.

531
01:09:40.979 --> 01:09:44.340
They want to have to stuff waiting to run.

532
01:09:44.340 --> 01:09:52.770
As the executive summary here, the thing is that we've got streams.

533
01:09:52.770 --> 01:10:01.470
Off to each stream, has a sequence of things waiting to run. The last thing is the GPU, each streaming multi processor perhaps.

534
01:10:01.470 --> 01:10:06.029
These are things running and we want to have.

535
01:10:06.029 --> 01:10:12.090
A, lots of things in your stream waiting to execute.

536
01:10:12.090 --> 01:10:16.229
And so this way, the work queue on the GPU, and the left gets fully.

537
01:10:16.229 --> 01:10:23.909
Occupied and, um.

538
01:10:25.199 --> 01:10:29.880
You know, trying to get stuff parallel as much as possible. So.

539
01:10:31.770 --> 01:10:38.909
And, okay, so there's a synchronization new routine we haven't seen yet.

540
01:10:38.909 --> 01:10:45.569
So, the thing is within the 1 stream, there's various tasks that were a synchronous.

541
01:10:45.569 --> 01:10:54.329
And this waits until everything in that stream has been completed.

542
01:10:54.329 --> 01:11:01.859
Like, the data got copied before you add it, let's say, just for that stream, we saw the synchronize before the did.

543
01:11:01.859 --> 01:11:05.100
All streams.

544
01:11:05.100 --> 01:11:11.310
Okay, so the creative content in chapter.

545
01:11:11.310 --> 01:11:14.609
Here was.

546
01:11:16.079 --> 01:11:29.189
Which was chapter 4 module 14 was we have streams this gives us task level parallelism and you like to order stuff. So the streams different streams can execute in parallel.

547
01:11:44.670 --> 01:11:48.689
So going to see an example that fits some of us together.

548
01:11:49.800 --> 01:11:53.274
Historical known various originally,

549
01:11:53.274 --> 01:11:53.755
nuclear,

550
01:11:53.755 --> 01:12:01.225
magnetic resonance and are when the physicist invented it many decades ago when the medical community started using it,

551
01:12:01.225 --> 01:12:05.484
they renamed it because I think they were afraid that the word nuclear would frighten people.

552
01:12:06.810 --> 01:12:10.949
I'm not joking.

553
01:12:10.949 --> 01:12:16.680
Okay, so I've got a bigger example here that should fit the things together.

554
01:12:18.329 --> 01:12:23.579
Scanned so.

555
01:12:24.840 --> 01:12:29.399
1 of these things called an inverse problem in applied mathematics.

556
01:12:29.399 --> 01:12:35.520
The unknowns are the densities at each foxhole inside the.

557
01:12:35.520 --> 01:12:41.729
Patient, and what, you know, are you see a run these race.

558
01:12:41.729 --> 01:12:45.779
Through it and what you observe is.

559
01:12:45.779 --> 01:12:49.560
Intensity at the end of the Ray.

560
01:12:49.560 --> 01:12:55.590
And so that's the knowns and the unknowns are the data inside you want to solve for it.

561
01:12:58.500 --> 01:13:05.369
Different ways you can scan, which is irrelevant. I'm going to skip through details here.

562
01:13:07.050 --> 01:13:12.600
In any case, this is what the output might look like. Okay.

563
01:13:14.310 --> 01:13:17.970
Oh.

564
01:13:17.970 --> 01:13:24.449
Energy and solvers tend to be more efficient in many cases and simple explicitly inverting.

565
01:13:24.449 --> 01:13:38.039
If you're solving a X equals B, explicit way to say, X equals 8 of the minus 1 time speed and solving it directly. It turns interpretively approaching the value of X is often more efficient.

566
01:13:38.039 --> 01:13:44.640
Get that this is just a setup chapter.

567
01:13:51.600 --> 01:13:56.819
Silence.

568
01:13:57.175 --> 01:14:12.085
Okay, so you can be doing stuff on the kernel that's on the gpo. It's more complicated than what we've seen. So far. We've been seeing kernels where you'd, like, add 2 elements. This is a serious thing. It's got floating point and it's got calling, sign in coasts.

569
01:14:12.085 --> 01:14:13.465
And so on, okay.

570
01:14:13.710 --> 01:14:16.800
Lots of arguments.

571
01:14:16.800 --> 01:14:21.300
Yeah.

572
01:14:23.520 --> 01:14:28.680
Things you can do when you've got multiple loops.

573
01:14:30.720 --> 01:14:39.149
So, what we have here is we've got the outer loop as the inter loop is inside the envelope we've got and iterating.

574
01:14:39.149 --> 01:14:49.470
Well, M, had 2 stages at this 1st stage up here and in the 2nd stage where.

575
01:14:49.470 --> 01:14:59.454
Has the envelope, so we fission so we split the loop and the outer loop into 2 pieces this initial stage.

576
01:14:59.965 --> 01:15:09.685
And then the next stage here twice like this, why we want to do it is it's a setup towards the next loop. We can.

577
01:15:09.960 --> 01:15:16.170
Use you split loops and you combine loop sufficient infusion.

578
01:15:18.119 --> 01:15:25.590
So efficient, and if I back up a little here, it's a nice, simple thing. And this is a sort of thing.

579
01:15:25.590 --> 01:15:30.270
You do lots of times in parallel and so.

580
01:15:30.270 --> 01:15:34.109
Separate colonel, which is very small.

581
01:15:35.130 --> 01:15:41.279
Can't spell threads and just after you do this, you got to synchronize. Of course.

582
01:15:41.279 --> 01:15:45.779
This is the rest of him and the inner end loop.

583
01:15:48.149 --> 01:15:54.119
Okay, we can play games here. We're iterating on, em, and they're just going to swap those and.

584
01:15:54.119 --> 01:15:59.130
It it'll just allow some things to be done more efficiently here. So.

585
01:16:01.319 --> 01:16:05.729
See here, and without her, and was in her here and it's outer admin center. So.

586
01:16:05.729 --> 01:16:11.699
Interchange and this will be a prep to do some other stuff fast.

587
01:16:13.020 --> 01:16:16.229
So.

588
01:16:19.409 --> 01:16:27.510
I'm skipping over some details. Well, and this is the inner loop. Now. It's a current all you do it.

589
01:16:27.510 --> 01:16:39.449
In parallel, we're using registers here and so the executive summary of the slide set is.

590
01:16:39.449 --> 01:16:47.609
Throw look, we have up here, got all these things that are getting used.

591
01:16:47.609 --> 01:16:51.600
Let me go back to slides to show you what's happening.

592
01:16:52.890 --> 01:16:56.963
Okay here so okay.

593
01:16:56.963 --> 01:17:10.015
We initially had M and then and we swapped so we've in this inner loop, we've got various things X Y, sub and Z sub and they're constant inside the inner loop.

594
01:17:10.289 --> 01:17:14.399
So, we can take these constants and pull them out.

595
01:17:14.399 --> 01:17:17.760
Just inside the out of loop and put them in registers.

596
01:17:17.760 --> 01:17:25.439
This is the sort of thing that a good optimize the competitor will do. So, these next couple of slides sets are.

597
01:17:25.439 --> 01:17:28.739
You're imitating a good compiler.

598
01:17:28.739 --> 01:17:33.149
But maybe the compilers haven't got to this page stage yet automatically, but.

599
01:17:33.149 --> 01:17:41.489
So this may be done automatic, this loop interchange and this fission. Good optimized compilers will do on.

600
01:17:41.489 --> 01:17:45.720
Sequential machines are catching up and parallel machines still.

601
01:17:45.720 --> 01:17:51.750
Okay, so we got this loop with X and Y, Savannah and so on.

602
01:17:54.510 --> 01:17:59.520
Pull them out and put them in registers. Oh, okay.

603
01:18:00.569 --> 01:18:06.239
So, in the loop, we are working with a lot of registers. Nice, nice.

604
01:18:09.659 --> 01:18:20.579
Next thing, we're trying to find data the success of threads access and putting it together.

605
01:18:21.659 --> 01:18:26.039
If I were this.

606
01:18:28.380 --> 01:18:32.760
And playing with constant memory also, so.

607
01:18:35.039 --> 01:18:40.710
So, the inner loop again, I'm just skipping over to you. The envelope gets sufficient because.

608
01:18:40.710 --> 01:18:44.489
We've Chuck data together and we're using registers and so on.

609
01:18:44.489 --> 01:18:50.159
Next thing is use the hardware sign and costs that are not.

610
01:18:50.159 --> 01:18:58.050
I Tripoli compliant, they're not gonna be as accurate, but may be good enough for you. So you call them and they will be faster.

611
01:18:59.430 --> 01:19:04.350
These are specific hardware at compute units in this semester.

612
01:19:04.350 --> 01:19:11.159
Streaming multi processors, so if everyone's trying to do trade, get the same time, there'll be some waiting.

613
01:19:12.329 --> 01:19:22.500
There's only so many of these units and if you're feeling confident, you might want to validate your answer.

614
01:19:23.579 --> 01:19:30.689
So, thinking of validation and being conscious is a current theme going on.

615
01:19:32.189 --> 01:19:37.710
With some neuronet or published papers have results that apparently cannot be duplicated.

616
01:19:37.710 --> 01:19:46.260
Oops, so the stuff these papers are published, he published results cannot be independently validated.

617
01:19:48.329 --> 01:19:52.050
What here that their confidence so they validate their stuff.

618
01:19:53.279 --> 01:19:57.989
I am confident in this stuff I do, I encourage people to validate my published stuff.

619
01:19:59.880 --> 01:20:03.779
And their checking speed ups.

620
01:20:07.109 --> 01:20:13.289
And they got some nice feed Ops. Okay. Fast speed ups of a couple of 100. so.

621
01:20:14.789 --> 01:20:17.880
In various things here.

622
01:20:17.880 --> 01:20:21.270
Something took.

623
01:20:21.270 --> 01:20:24.569
2700 now takes 8 and so on.

624
01:20:26.069 --> 01:20:35.970
So, it worked okay and getting it on the gpo at a naive way. Got maybe a factor of 10 speed up.

625
01:20:35.970 --> 01:20:39.569
Getting it on the GPU in an intelligent way.

626
01:20:39.569 --> 01:20:45.060
On another factor of 30 or something in this case. So.

627
01:20:45.060 --> 01:20:50.909
12 in this case, so these techniques actually, in this case were useful.

628
01:20:52.500 --> 01:20:57.510
Okay, good point to stop. Now.

629
01:20:57.510 --> 01:21:06.359
Hence us review what we did today, we saw this scan parallel scan operation, which is.

630
01:21:06.359 --> 01:21:12.029
A useful parallel paradigm, and we saw how to do it more efficiently.

631
01:21:12.029 --> 01:21:15.300
Reordering the algorithm to do it more efficiently.

632
01:21:16.439 --> 01:21:25.050
We saw some hardware layout issues with versus is a little slides or touch up. So, lesson by now.

633
01:21:25.050 --> 01:21:31.590
Didn't spend much time on them. We saw a new concept type of parallelism called task.

634
01:21:31.590 --> 01:21:42.810
Parallel is, I mean, you could a code you see plus plus code running on the host, you can have several parallel tasks running together. They're called and they're using a concept called us. Kudos stream.

635
01:21:42.810 --> 01:21:52.890
And the separate kudos streams run independently of each other, and you can they synchronize inside himself and then we saw in and Omar reconstruction thing.

636
01:21:52.890 --> 01:22:00.510
Where, in the example, I got hundreds of times faster by doing it on the GPU and using these techniques.

637
01:22:00.510 --> 01:22:08.369
Good point to stop. You've got to go out and get lunch. I got to go and get lunch and see you.

638
01:22:08.369 --> 01:22:14.609
Monday, hang around for a minute or 2. what applications as fast math do well, and.

639
01:22:14.609 --> 01:22:17.909
Sorry, I didn't look over for a few minutes.

640
01:22:17.909 --> 01:22:24.899
Isaac, well, it's not going to be accurate in the last bid or 2. so.

641
01:22:24.899 --> 01:22:31.470
If you don't care about precise accuracy, then it's that That'll be good. So, in this thing.

642
01:22:32.880 --> 01:22:36.960
It was a proxy, but they figured that data is not that accurate. Probably. So.

643
01:22:38.340 --> 01:22:47.819
You might be well, maybe to run a sample run of your program with the accurate Trig and then compare it to past.

644
01:22:47.819 --> 01:22:51.930
And then if it works, then use fast math in the future.

645
01:22:53.880 --> 01:22:58.500
Where the fast did not work for me.

646
01:22:59.760 --> 01:23:05.399
It's I just like algorithms file. Just assume that the math is good.

647
01:23:05.399 --> 01:23:12.449
I mean, I know I carefully the floating point started exist. I just automatically incorporated into my algorithm to sign.

648
01:23:12.449 --> 01:23:24.060
So, if we don't get good round, I assume, round off works the way it's supposed to and with the fast method does not that's why it broke my program.

649
01:23:26.939 --> 01:23:30.810
Other stuff welcome.

650
01:23:30.810 --> 01:23:33.840
Anything else? Sorry, I didn't look over quickly enough.

651
01:23:33.840 --> 01:23:37.859
My 2nd lap pops a little off to my side actually. So.

652
01:23:39.090 --> 01:23:43.920
If not I have a good weekend and enjoy the sunny weather.

653
01:23:43.920 --> 01:23:51.539
And I'm enjoying looking at my solar panels like, so far in March they've produced about 10% more than I've used.

654
01:23:51.539 --> 01:23:57.390
But it's been signing.