WEBVTT

1
00:00:04.918 --> 00:00:09.448
Okay.

2
00:00:09.448 --> 00:00:15.448
Right.

3
00:00:20.850 --> 00:00:31.350
Okay, good afternoon probability class 11 before I started if you've got any questions, anything you'd like to.

4
00:00:31.350 --> 00:00:35.369
Ask about our ability in video.

5
00:00:35.369 --> 00:00:42.060
Programming so, on the level level recursion there, but, um.

6
00:00:42.060 --> 00:00:45.179
Okay, so I'm trying to.

7
00:00:45.179 --> 00:00:50.640
You record this and upload it.

8
00:00:50.640 --> 00:00:59.009
I encourage you to look at the textbook programming, massively parallel computers that's available for free.

9
00:01:02.369 --> 00:01:08.010
On the class account, so it says it goes to what I'm talking about in more detail.

10
00:01:08.010 --> 00:01:11.159
And it's it's very well written, so.

11
00:01:11.159 --> 00:01:20.519
More intelligent, it has a couple of obsolete things in it, but they have updated it a couple of years ago. So the books originally had 15 years old, but.

12
00:01:20.519 --> 00:01:28.349
They did updated again a few years ago. So, like, actually mentioned unified memory, which is good to see um.

13
00:01:28.349 --> 00:01:35.219
Chance to mentioned some of the recent changes in NVIDIA and.

14
00:01:35.219 --> 00:01:39.000
I've mentioned unified memory before, so the.

15
00:01:39.000 --> 00:01:43.019
Post in the device are in the same address space.

16
00:01:43.019 --> 00:01:48.239
And it will do demand paging of pages back and forth. Um.

17
00:01:48.239 --> 00:01:54.420
And video got to it in several stages doing it requires hardware support for the Linux colonel.

18
00:01:54.420 --> 00:02:00.659
Which is what delayed them so before they had demand page, and they had, like, demand.

19
00:02:00.659 --> 00:02:04.980
Swapping of entire variable back and forth.

20
00:02:04.980 --> 00:02:14.909
And they had something where our unified address space without demand paging, or you could look at the upper on a point, or know where where the object was.

21
00:02:14.909 --> 00:02:19.680
So, they had something where they would like us a copy, the whole object to the other.

22
00:02:19.680 --> 00:02:25.379
To the other side, if they needed that, and then I got to pay for demand.

23
00:02:25.379 --> 00:02:32.699
Before that they had something where they would lock a host variable into real memory.

24
00:02:32.699 --> 00:02:37.379
On the host, which would make it easier for the device to get at it.

25
00:02:37.379 --> 00:02:44.039
Because of virtual memory on the host, you never know where the object is or if it's in November or not. So they have.

26
00:02:44.039 --> 00:02:47.969
Something when you can lock a variable in the real memory and then.

27
00:02:47.969 --> 00:02:52.590
I didn't warn you that that restricted the amount of memory you could use as well.

28
00:02:52.590 --> 00:02:58.409
I got 256 gigabytes on parallel, so I'm not particularly concerned if you.

29
00:02:58.409 --> 00:03:01.650
Block off a few megabytes. Okay.

30
00:03:01.650 --> 00:03:09.210
Yes, where is the yeah, I'll show you um.

31
00:03:09.210 --> 00:03:19.349
So this is GPU, accelerate accelerator computing God's depth. I pulled off the video website.

32
00:03:19.349 --> 00:03:25.110
There are another half dozen kits there also, which I can pull over. Also. If you want.

33
00:03:25.110 --> 00:03:29.669
They talk about things like machine learning and so and so if you remind me, so I don't forget.

34
00:03:29.669 --> 00:03:36.870
Or, at least pull down the list of them in anyone, anything wants anything that anyone wants a pull over. So, in there.

35
00:03:36.870 --> 00:03:44.550
And then you unzip it, and I just want to know more about the inside membrane.

36
00:03:44.550 --> 00:03:50.219
Yeah, unified memory is, you know, a virtual memory works. Um.

37
00:03:50.219 --> 00:03:53.849
No, okay. Um.

38
00:03:54.930 --> 00:04:00.659
Okay, so yeah, you got burying, um.

39
00:04:00.659 --> 00:04:04.860
Things so computers got real memory um.

40
00:04:04.860 --> 00:04:12.689
Well, like this laptop here, I can't even remember because I got so many laptops it's got 8 gigabytes or 16 gigabytes. Let's find out here.

41
00:04:12.689 --> 00:04:15.810
Um, actually.

42
00:04:17.459 --> 00:04:22.290
It has 16 gigabytes of real memory. Now, the virtual memory.

43
00:04:22.290 --> 00:04:26.098
Would be larger well, this 1 I don't.

44
00:04:26.098 --> 00:04:29.668
Don't actually have any virtual memory because.

45
00:04:29.668 --> 00:04:33.959
It's a separate story I think it's an obsolete idea. In some cases.

46
00:04:33.959 --> 00:04:44.428
The virtual memory is the device is the virtual address space. You have a pointer points in a virtual address and then it's in divided into pages that I think are 4 kilobytes.

47
00:04:44.428 --> 00:04:47.908
So, 1000 words and.

48
00:04:47.908 --> 00:04:55.139
When you actually read or write an address, it pages that 4 or 5 page into real memory.

49
00:04:55.139 --> 00:04:59.158
And has a translation table to give the real memory address.

50
00:04:59.158 --> 00:05:08.848
And so the idea is, you cannot you can address bigger more data, then you can fit into the real memory.

51
00:05:08.848 --> 00:05:11.939
Then it needs an eviction strategy.

52
00:05:11.939 --> 00:05:21.478
Once it gives us up all the real memory it has a policy to evict page is to write them back to virtual memory token page in something else in virtual memory.

53
00:05:21.478 --> 00:05:26.338
The typical depiction strategy is least recently used so.

54
00:05:26.338 --> 00:05:31.319
And updates a marker when each page is used, and then the 1 that hasn't been used in a while gets.

55
00:05:31.319 --> 00:05:37.108
Written back to real memory giver to virtual memory no virtual. Maybe. I forgot. It's on disk. Okay.

56
00:05:37.108 --> 00:05:45.358
So, and it's typically double the size of real memory is if you're looking for a policy, that's a tolerable policy.

57
00:05:45.358 --> 00:05:52.649
On if you're in Windows, it'll be called a page file or a swap file.

58
00:05:52.649 --> 00:05:55.978
There are small differences between them that.

59
00:05:55.978 --> 00:06:00.809
Don't worry about, like, if you're hibernating the system, it may.

60
00:06:00.809 --> 00:06:07.559
Copy all of the real memory out to the swap file so it can power down the memory like.

61
00:06:07.559 --> 00:06:13.348
Okay, so it's a very nice idea. Virtual memory you see, it requires some serious hardware support.

62
00:06:13.348 --> 00:06:16.619
Another thing with virtual memory has been around for many decades.

63
00:06:16.619 --> 00:06:23.218
Is that it's a way to handle devices so you have a device if you're reading it in writing to.

64
00:06:23.218 --> 00:06:29.819
Tape drive keyboard, whatever it would have addresses in virtual memory. So the device is sitting on the address bus.

65
00:06:29.819 --> 00:06:36.928
Watching the traffic, and if you want say to to read from the last key from the keyboard.

66
00:06:36.928 --> 00:06:42.928
You would read a particular address and the keyboard would watch you and if it sees that address being read.

67
00:06:42.928 --> 00:06:46.228
It provides the contents of the last key that was expressed, let's say.

68
00:06:46.228 --> 00:06:55.769
Or you want to get status is the keyboard is it just ready that particular register address? It doesn't exist in real memory. It's an address, but.

69
00:06:55.769 --> 00:07:05.759
The disc is watching the address Boston when it sees this address being read, it provides the status website. So the unifying idea. So accesses to.

70
00:07:05.759 --> 00:07:19.288
Different devices from a software point of view, our reads and rights to memory addresses, but there's no real word in the D RAM. Which corresponds to that. It's just the peripheral is sitting on the bus and watching it.

71
00:07:19.288 --> 00:07:26.488
Obviously, there's a lot of implementation issue is the purple's going to be a lot slower than to how many factors? A 1000.

72
00:07:26.488 --> 00:07:31.288
So, you give me a timing issue, but it's a really unifying idea. It's really nice.

73
00:07:31.288 --> 00:07:38.488
So, it's been around for decades, then you get concepts of a working set such as how much memory is of.

74
00:07:38.488 --> 00:07:42.298
Device is a program actually using.

75
00:07:42.298 --> 00:07:47.098
Now, now there is a powerful concept in Linux with the virtual memory.

76
00:07:47.098 --> 00:07:52.228
Which is that if you allocate a chunk of virtual memory, say gigabyte 10 gigabytes.

77
00:07:52.228 --> 00:07:57.778
Whatever the yoga not yoga and I don't know which it is. Um.

78
00:07:57.778 --> 00:08:05.608
Debate it, um, Linux doesn't reserve the space in your.

79
00:08:05.608 --> 00:08:08.639
Uh, on on disc.

80
00:08:08.639 --> 00:08:13.108
Until you actually touch a page, so if you read or write an address.

81
00:08:13.108 --> 00:08:16.769
The bang, Linux allocated 4 K bytes.

82
00:08:16.769 --> 00:08:20.428
And the page file oh, where it is in Linux.

83
00:08:20.428 --> 00:08:25.228
Um, the virtual memory space, it could be a separate partition, or it could be just a file.

84
00:08:25.228 --> 00:08:28.649
Okay, can you say here's a 10 gigabyte file.

85
00:08:28.649 --> 00:08:32.158
I'll declare it to be a swap file.

86
00:08:32.158 --> 00:08:38.249
And now it becomes paging back, or it could be a whole separate partition or a whole separate disk. Um.

87
00:08:38.249 --> 00:08:41.999
I think if you use a file, it may do some short.

88
00:08:41.999 --> 00:08:45.028
Tricks to make it more efficient. Um.

89
00:08:45.028 --> 00:08:52.139
Like, figure out what physical blocks on just the file uses it go straight to the desk. Okay. So, um.

90
00:08:52.139 --> 00:08:57.899
So, Linux has a really intelligent way to do virtual memory, which people don't understand.

91
00:08:57.899 --> 00:09:02.519
And don't exploit it. You can allocate a humongous.

92
00:09:02.519 --> 00:09:06.928
Chunk of virtual memory, but until you actually use it.

93
00:09:06.928 --> 00:09:12.328
It doesn't physically use up, walks on the desk and that's cool. So, for example.

94
00:09:12.328 --> 00:09:17.099
You can say reserve gigabyte for a, for a stack for your.

95
00:09:17.099 --> 00:09:21.448
Personally, calling your local functions. That's a lot of pages.

96
00:09:21.448 --> 00:09:26.339
But unless you, I actually do a lot of recursive calls that that's not actually used.

97
00:09:26.339 --> 00:09:31.408
So, there's no downside to reserving very large virtual rate as.

98
00:09:31.408 --> 00:09:39.658
So, this I mentioned to start in the class, if you've got a lot of separate thread, each thread might say, reserve a gigabyte, a stack or local variables.

99
00:09:39.658 --> 00:09:43.469
You know, you're calling a function is C + plus or or C.

100
00:09:43.469 --> 00:09:49.198
A local variable, unless you're doing something weird, it gets put on the stack.

101
00:09:49.198 --> 00:10:00.538
So, you know, new stack frame gets allocated when you call a new function or notification, local variables, get put on the stack, you call another 1 gets put on the stack for and then, when you return, they just get.

102
00:10:00.538 --> 00:10:06.389
While the stack pointer, which points at the end of the stack just gets reset and they all vanished.

103
00:10:06.389 --> 00:10:10.678
So you don't know how many local variables you're going to have perhaps.

104
00:10:10.678 --> 00:10:14.068
Fetus allocating among the stack for each thread.

105
00:10:14.068 --> 00:10:18.778
I have no problem. It's not use any resources unless you use it and that's a really cool idea.

106
00:10:18.778 --> 00:10:22.109
So, in general, in the.

107
00:10:22.109 --> 00:10:25.528
X level of indirection because you have virtual address and.

108
00:10:25.528 --> 00:10:28.918
Goes to a real address, the extra level of in directions of.

109
00:10:28.918 --> 00:10:34.649
You will get to a lot of power. Okay, so that's virtual memory currently. I think virtual memories.

110
00:10:34.649 --> 00:10:43.918
When you have your youngest amounts of real memory, the time it takes to page virtual objects in a real space takes time. So, virtual memory is not as useful.

111
00:10:43.918 --> 00:10:48.778
But any case, so, video now, so the same address phase is 49 bits. I believe.

112
00:10:48.778 --> 00:10:52.438
And I know it's 64 minutes so they ignore the top 15.

113
00:10:52.438 --> 00:10:56.639
The hardware looks for 49 bits the top couple of bits say what it is.

114
00:10:56.639 --> 00:10:59.729
But any case, so they paid it between.

115
00:10:59.729 --> 00:11:04.168
Post and device host, meaning Intel.

116
00:11:04.168 --> 00:11:09.778
Z on this, I think, wants Z on. I don't think it works with the consumer grade.

117
00:11:09.778 --> 00:11:13.798
Intel, but you can correct me. Um.

118
00:11:14.849 --> 00:11:18.239
So the is various enterprise type.

119
00:11:18.239 --> 00:11:22.948
Things like better support of virtual machines, for example.

120
00:11:22.948 --> 00:11:30.359
And so eventually the pages get paged back and forth. Now, you can imagine if the host of the device try to. Right?

121
00:11:30.359 --> 00:11:34.678
The same page at the same time, he needs you have some consistency issues.

122
00:11:34.678 --> 00:11:38.698
And I honestly, I don't know how immediate results those.

123
00:11:38.698 --> 00:11:42.058
Um, maybe it's a bad idea to do it.

124
00:11:42.058 --> 00:11:46.019
So now.

125
00:11:46.019 --> 00:11:51.599
Here, if you want a variable to get paid.

126
00:11:51.599 --> 00:11:57.298
It has to be on the global heat, so, which the programs have.

127
00:11:57.298 --> 00:12:01.798
You know, you do Matlock, or you use an vector.

128
00:12:01.798 --> 00:12:09.058
It puts us it grabs based off the global heat, but the heap has an internal data structure, which keeps track of what's used and what free.

129
00:12:09.058 --> 00:12:15.479
And, um, like list and stuff, I think, and and if it's on the global heap, it can be.

130
00:12:15.479 --> 00:12:22.499
He'll be unified memory if it's there's at least 3 different places you can store variables on the.

131
00:12:22.499 --> 00:12:27.149
In Linux, I got a blurb on that. I think I don't know if I share with you yet.

132
00:12:27.149 --> 00:12:32.068
They can be on the global heat. That's where vector allocations come from.

133
00:12:32.068 --> 00:12:36.028
And Mailbox, they can be local variables that go on the stack.

134
00:12:36.028 --> 00:12:41.099
Or they can be like, global variables to spaces reserved at compile time. So, if you have.

135
00:12:41.099 --> 00:12:45.899
Fixed size the Ray, that's a global variable and you see + plus program.

136
00:12:45.899 --> 00:12:51.509
It's allocated at it's a fixed address. It's allocated at compile time.

137
00:12:51.509 --> 00:12:57.208
And so it's part of the executable file.

138
00:12:57.208 --> 00:13:01.349
Now, if you don't initialize it, it doesn't take any space in executable file but that's what it is.

139
00:13:01.349 --> 00:13:06.479
So that's the different types of memory go. So, um.

140
00:13:06.479 --> 00:13:17.908
And what types you should use what's the best type use changes from time to time? So, at the moment, the best 1 is on the heath, but then you got problems if you put millions of small items on the heat.

141
00:13:17.908 --> 00:13:22.558
That will slow things down because it has to keep track of them. So.

142
00:13:22.558 --> 00:13:30.928
There are solutions to that. So do, and also the heat it's not multi threaded very efficiently. So.

143
00:13:30.928 --> 00:13:37.528
If you've got different threads, each of the mail locking stuff on the heat that will really slow your program down. So that's an argument to.

144
00:13:37.528 --> 00:13:42.479
But I'm on a local stack, or there's another thing also there's something in C. + .

145
00:13:42.479 --> 00:13:46.318
But not many people realize so, you.

146
00:13:46.318 --> 00:13:50.129
And C, +, plus you construct a new object of type food.

147
00:13:50.129 --> 00:13:54.269
Okay, so it grabs memory on the heap and.

148
00:13:54.269 --> 00:13:59.219
And run the construction function if it's a non default construction function.

149
00:13:59.219 --> 00:14:02.609
You know, to build that object in the new memory that craps.

150
00:14:02.609 --> 00:14:12.869
And then it returns you the pointer. Well, the grab the memory and construct the new object can be split into 2 separate steps. If you want.

151
00:14:12.869 --> 00:14:22.139
So you can, for example, grab the memory however, you want to grab you can have your own memory allocator. If you want, then you could allocate a 10 gigabyte a ray.

152
00:14:22.139 --> 00:14:26.458
And do your own sub allocation inside that 10 gigabyte array?

153
00:14:26.458 --> 00:14:32.068
Cutting off pieces as you need them, so you could do have your own memory management and then what you could do.

154
00:14:32.068 --> 00:14:38.278
Is that C + plus so if you have a new variables, so you do your own memory management to reserve the actual space.

155
00:14:38.278 --> 00:14:50.009
And then you do what something called a placement new since C, +, + and a placement new constructs the new variable, but you give it the memory. It doesn't grab the chunk of memory itself.

156
00:14:50.009 --> 00:14:55.438
So, you could separate the memory allocation from the new object construction. If you wish.

157
00:14:55.438 --> 00:14:58.739
So, if you do not too small objects.

158
00:14:58.739 --> 00:15:03.058
Oh, you allocate an array of small objects. That's 1 operation. So.

159
00:15:03.058 --> 00:15:12.149
Memory management things and why this is interesting to you is it can for a parallel program, have a big effect on efficiency.

160
00:15:12.149 --> 00:15:18.178
And I've observed it with some of my research is parallel geometry, even with on a Ze on, even with.

161
00:15:18.178 --> 00:15:21.359
Doesn't a few dozen threads, um.

162
00:15:21.359 --> 00:15:31.678
The built in heat, construction things they, they've noticeably slowed down because the essentials those are my single threaded.

163
00:15:31.678 --> 00:15:39.688
There's other things you can do, there are other versions of, like, Google has 1 that are more efficient than the standard C + +1.

164
00:15:39.688 --> 00:15:43.288
And you can do there is a plug in replace, but actually you can.

165
00:15:43.288 --> 00:15:52.798
Provide a different library, and, you know, to run time. So I'm not just involved recompiling the program because at 1 time you can specify a library for dynamic loading.

166
00:15:52.798 --> 00:15:56.788
Libraries objects and you specify google's.

167
00:15:56.788 --> 00:16:00.749
Instead of the default 1 at run time and bang.

168
00:16:00.749 --> 00:16:04.229
All your memory stuff gets twice as fast or something.

169
00:16:05.729 --> 00:16:10.948
So, more than you wanted to learn about memory, other memory questions.

170
00:16:10.948 --> 00:16:17.099
So, I don't know how much the computer science operating system courses talk about.

171
00:16:17.099 --> 00:16:22.048
Real World operating systems and file systems and so on because sometimes the academics.

172
00:16:22.048 --> 00:16:26.578
Like, behind the real world so, I mean, such as.

173
00:16:26.578 --> 00:16:33.568
Parallels quarter, terabyte of, you know, what are the implications of that for your operating system?

174
00:16:33.568 --> 00:16:44.038
Um, you think that's a lot my big laptop at home it's a laptop it's 128 gigabytes of on the laptop.

175
00:16:44.038 --> 00:16:50.339
So, it's a laptop weighs 8 pounds, you know, kind of everything.

176
00:16:50.339 --> 00:16:58.229
And file systems, if you've got file systems on, you know, God knows what petabyte size disk and distribute what does that apply to the file system?

177
00:16:58.229 --> 00:17:07.048
Or you're doing communication, you got an Internet style thing, but, you know, 1 of you knows is maybe on Mars.

178
00:17:07.048 --> 00:17:11.759
Let's 15 minute latency, you know, what does this apply for your network algorithms?

179
00:17:11.759 --> 00:17:16.108
You know, if you're doing a round trip error, checking, installing it, it takes 15 minutes.

180
00:17:16.108 --> 00:17:21.118
To, you know, you're trying to parody code or whatever this has implications.

181
00:17:21.118 --> 00:17:25.288
For what our efficient on networking strategies so.

182
00:17:26.368 --> 00:17:35.068
Another example, databases in database classes, there's this asset paradigm or something. They're used to caught normal form, stuff like that.

183
00:17:35.068 --> 00:17:39.449
Your people that do the big databases, who would be a.

184
00:17:39.449 --> 00:17:44.489
Amazon Google, whatever their databases, I think don't follow that.

185
00:17:44.489 --> 00:17:49.739
So, they, they do their own now, partly because.

186
00:17:49.739 --> 00:17:55.709
They don't know when our theory and party, because they found not doing the theoretically nice way. Actually, it's better performance.

187
00:17:55.709 --> 00:18:04.318
Like, Google may not require their database to be globally consistent completely if it's a map or something. So.

188
00:18:04.318 --> 00:18:08.219
Okay, other questions.

189
00:18:09.298 --> 00:18:13.588
So so we're doing parallel.

190
00:18:13.588 --> 00:18:17.038
Including and video, as an example.

191
00:18:17.038 --> 00:18:22.888
I'd like to teach specifics and figure you induce you by induction the general principle somewhat.

192
00:18:22.888 --> 00:18:28.828
I'm teaching from their slides instead of writing stuff down. I know.

193
00:18:28.828 --> 00:18:33.179
People like professors to write stuff down in class, but.

194
00:18:33.179 --> 00:18:45.179
It seems logical because I'd be learning information from the slides and writing it down for you. I figure I'd show you the slides directly. It's like playing a game of telephone. That's fewer errors.

195
00:18:45.179 --> 00:18:57.358
And you can read the slides, so, you know, if you want, I can write stuff down. But and the technology, also, I'm using cause I'm broadcasting the class for people that can't be here. Then, um.

196
00:18:58.558 --> 00:19:06.358
It's hard to do flexible broadcasting together with handwriting involves linking together multiple.

197
00:19:06.358 --> 00:19:10.019
Laptops okay, so in any case.

198
00:19:10.019 --> 00:19:16.979
Back to the.

199
00:19:16.979 --> 00:19:23.308
Okay, so we're up around module 9 now.

200
00:19:23.308 --> 00:19:32.098
So, what we're seeing now are paradigms. So a paradigm is a widely used way to do something.

201
00:19:32.098 --> 00:19:37.949
So these are techniques for parallel programming.

202
00:19:37.949 --> 00:19:41.278
That have turned out from experience to be useful.

203
00:19:41.278 --> 00:19:46.439
On some of them were not thought of in advance since they were just developed to solve problems.

204
00:19:46.439 --> 00:19:51.719
And so I'll be running through the slides quickly if they don't deserve a lot of time.

205
00:19:51.719 --> 00:19:54.719
Their variable information density, so.

206
00:19:54.719 --> 00:19:59.338
Um, well, let's see.

207
00:20:00.749 --> 00:20:05.848
And, um, sharing.

208
00:20:05.848 --> 00:20:11.699
Screen and theoretically, that is working.

209
00:20:14.788 --> 00:20:19.378
And to make sure I'm recording.

210
00:20:19.378 --> 00:20:22.469
Yeah, I'm recording. Yeah.

211
00:20:22.469 --> 00:20:26.548
Caring good. Okay.

212
00:20:26.548 --> 00:20:34.648
If any of you are watching me remotely also if I have problems, let me know, because I cannot determine always.

213
00:20:34.648 --> 00:20:37.679
What is happening?

214
00:20:37.679 --> 00:20:41.699
I actually have 2 other people. Oh, okay. Good.

215
00:20:41.699 --> 00:20:47.759
So, I said module.

216
00:20:47.759 --> 00:20:51.959
Right.

217
00:20:51.959 --> 00:20:58.979
And again, I've just created a virtual file system that zips into and so on, um.

218
00:21:04.259 --> 00:21:09.179
Okay.

219
00:21:16.648 --> 00:21:22.259
Okay, so it's something called reduction.

220
00:21:22.259 --> 00:21:30.568
Which is a, um, partially some, your way scan your debt down in a rate and.

221
00:21:30.568 --> 00:21:33.628
The reason this is worth spending time on.

222
00:21:33.628 --> 00:21:40.378
Is that you'd be surprised the things you can do with a parallel reduction?

223
00:21:40.378 --> 00:21:44.009
And.

224
00:21:44.009 --> 00:21:53.278
And it's used, for example, and things like trust that we'll talk about later, trust is a parallel version of the SDL things. If you're doing map reduce, for example.

225
00:21:53.278 --> 00:22:02.308
Which is a widely use paradigm. Um, Google uses it a lot. So the map you have, your data structure is the factor the map is, you apply a function to each element of the vector.

226
00:22:02.308 --> 00:22:14.729
That's obviously can be done in parallel and then cause the different applications and different elements of the vector don't affect each other. And then the reduction, if you take that factor and you scrunch it down to a shorter vector or to 1 number.

227
00:22:14.729 --> 00:22:19.378
The obvious reduction is, you saw in the vector, but there's more powerful reduction things.

228
00:22:19.378 --> 00:22:25.138
And that's the point. So, the 1st question is how to do it fast. And 2nd question is what's it useful for.

229
00:22:25.138 --> 00:22:30.358
And there is.

230
00:22:30.358 --> 00:22:36.538
A partial reductions, you've cut the long vector and the pieces and reduce each piece or something.

231
00:22:37.618 --> 00:22:47.759
You mentioned Google, and now some people think that Google invented map reduced, they did not and they don't pretend that they do actually.

232
00:22:48.778 --> 00:22:54.509
The 1st, commercial language that I can think of the produce, I think back the 960 s or so.

233
00:22:54.509 --> 00:23:01.919
Some of these ideas have been around a long time. Virtual memory has been old virtual memory. 1 thing about Jacqueline industrially.

234
00:23:01.919 --> 00:23:10.378
Virtual memory, um, been around 50 years or more. Ibm has used virtual memory and commercial system since.

235
00:23:11.638 --> 00:23:19.679
The seventies, I think, 70 or 80 so, IBM is used virtual machines, you know, about things like VMware and IBM is used commercial virtual machines.

236
00:23:19.679 --> 00:23:32.638
For 50 years, because their hardware, I think accidentally support, they need hardware support to do this efficiently. And the IBM hardware for decades has had this support. I read it was accidentally didn't plan it.

237
00:23:32.638 --> 00:23:39.838
So so they've had commercial systems where you have the bare metal hardware, running, separate, virtual machines and that's.

238
00:23:39.838 --> 00:23:47.788
For IBM for very long time. So okay any case. So, there was an IBM language could APL.

239
00:23:47.788 --> 00:23:57.328
Matt produce, I don't know, 1970 give or take, so okay. Um, Hadoop is a popular.

240
00:23:57.328 --> 00:24:01.858
Distributed lying I do by itself. I think it is obsolete by now.

241
00:24:01.858 --> 00:24:05.969
It is not my specific area, but there's things that have replaced it. Um.

242
00:24:05.969 --> 00:24:16.259
Um, so this slide, you'll understand more but reduction it's a fundamental technique that you can do other things with.

243
00:24:16.259 --> 00:24:21.118
So, and again.

244
00:24:21.118 --> 00:24:25.528
You take in our factor and you reduce it to the maximum of the minimum of the summer or something.

245
00:24:25.528 --> 00:24:28.949
In terms of mathematically.

246
00:24:28.949 --> 00:24:33.778
These arithmetic operations that, um.

247
00:24:33.778 --> 00:24:38.429
They're part of what an abstract algebra is called a field.

248
00:24:38.429 --> 00:24:42.148
And the field is an abstract way to model things like.

249
00:24:42.148 --> 00:24:49.348
Rational numbers are real numbers and that have operations such as the order you do.

250
00:24:49.348 --> 00:24:53.278
In addition doesn't matter a plus. B plus B plus a, and so on.

251
00:24:53.278 --> 00:24:58.858
And so the concept of a field is an abstraction for these operations that.

252
00:24:58.858 --> 00:25:03.088
Rational numbers and real numbers and so on can do complex numbers.

253
00:25:03.088 --> 00:25:11.878
Now, we've got a side thing, of course, floating point numbers don't satisfy me the rules folding point numbers are not associative and community, but we'll ignore that.

254
00:25:11.878 --> 00:25:16.048
Those machines, Ray, plus B does not equal, be plus saved reporting point number. So.

255
00:25:16.048 --> 00:25:20.969
And, absolutely there's machines rate plus B plus C. depends on where you put the parentheses.

256
00:25:20.969 --> 00:25:24.209
Okay, but we'll forget about that part.

257
00:25:24.209 --> 00:25:27.358
Oh, okay. Um.

258
00:25:27.358 --> 00:25:30.868
And to do, okay, in parallel.

259
00:25:30.868 --> 00:25:43.794
The obvious thing you do, you want to sum this Ray in parallel pair by pair, and you pair up and pair up and pair up. And if the array is in elements, it takes longer to the base 2 steps.

260
00:25:43.794 --> 00:25:46.074
If it's 8 elements takes 3 steps and so on.

261
00:25:46.679 --> 00:25:55.138
That's the obvious way to do it in parallel. Now 1 thing, whenever I say something all is obvious, most of the time I follow it by saying, but wrong.

262
00:25:55.138 --> 00:25:58.259
Or it works, but it's not efficient. So okay.

263
00:25:58.259 --> 00:26:02.429
It's like a tournament, um.

264
00:26:04.229 --> 00:26:07.618
Um, you can do a quick analysis, um.

265
00:26:08.848 --> 00:26:14.608
Takes 2 and steps, but okay, the next few slides sets will show you.

266
00:26:14.608 --> 00:26:20.939
More efficient ways to do it.

267
00:26:20.939 --> 00:26:27.028
Some things were worried about, I'll come back to this slide.

268
00:26:27.028 --> 00:26:38.818
So, we're talking coulda and threads are grouped into warps. Okay. 32 threats make a war and the threads and the war. They're SIM D. they're synchronized in lock step.

269
00:26:38.818 --> 00:26:42.749
So that hierarchical tree, I showed you.

270
00:26:42.749 --> 00:26:48.929
How would you implement that on a machine that has worked for? 32 threads that's the problem we want to think about.

271
00:26:48.929 --> 00:26:52.528
So this is this controlled divergence concept that.

272
00:26:52.528 --> 00:26:55.798
You have 32 threads in the war have to do the same instruction.

273
00:26:55.798 --> 00:27:02.699
Or be turned off is a separate thing in video has. So that's Cindy for the war because it's synchronous.

274
00:27:02.699 --> 00:27:09.689
Our videos as a concept of single program multiple Datas instead of s. I. M. D.

275
00:27:09.689 --> 00:27:19.138
And that expresses the concept of the different threads. And the thread block are following the same program, but they're not synchronized in lock step.

276
00:27:19.138 --> 00:27:26.878
Because the different warps their scheduled independently of each other, the different works of threads in the thread block.

277
00:27:26.878 --> 00:27:36.118
If it's if you paid a lot of money for for expensive hardware, they may run at the same time. If you didn't pay much money they may run 1 after the other.

278
00:27:36.118 --> 00:27:42.449
So, they're running a, say, the different threat different warps are running the same program. But at different times.

279
00:27:42.449 --> 00:27:46.979
You synchronize them we will sync threads. Okay but we've got this divergence thing.

280
00:27:46.979 --> 00:27:50.098
On the 1st thing up there is.

281
00:27:50.098 --> 00:27:54.689
How do we assign operations to the threat? Okay. Um.

282
00:27:54.689 --> 00:27:57.929
I'm going to hold this till we see.

283
00:27:59.489 --> 00:28:07.558
Well, 1, other thing they mentioned up here is is only I only have so many threads in a block and perhaps.

284
00:28:07.558 --> 00:28:12.778
2000 whatever it depends. Okay, so here's the way that we.

285
00:28:12.778 --> 00:28:17.878
Talked about before you're summing that big, um.

286
00:28:19.048 --> 00:28:26.098
That long array so you have a thread 0 sums 2 elements and thread 1 for the next 2 elements and so on.

287
00:28:26.098 --> 00:28:30.358
And then after after the 1st step.

288
00:28:30.358 --> 00:28:33.838
Your subtotals are in every 2nd element.

289
00:28:33.838 --> 00:28:41.818
And then you sum them, and after the 2nd step and the sub tools, every 4,000 subtotals, every 8th element.

290
00:28:41.818 --> 00:28:46.169
Okay, so what are issues with this 1st, if you got more than 2000 elements.

291
00:28:46.169 --> 00:28:50.249
Okay, this for an element is that over 2 threads so.

292
00:28:50.249 --> 00:28:56.009
Suppose you're going to have 2000 elements threads in the block and that depends on implementation.

293
00:28:56.009 --> 00:28:59.999
Your vector could not be more than 4,000 long, because.

294
00:28:59.999 --> 00:29:09.898
You need a thread for every, every 2 elements. So more than 4,000 elements had 2000 threads more than 2000. threads won't fit in a thread block. That's your 1st issue.

295
00:29:09.898 --> 00:29:14.068
The 2nd issue is.

296
00:29:14.068 --> 00:29:18.239
Again, this war idea, so the 1st.

297
00:29:18.239 --> 00:29:23.999
Thing, um, here fine all the threads are in use and he just said.

298
00:29:23.999 --> 00:29:30.118
Well, there, there's 1 issue here also. Let me just go back to the start here. I've mentioned a couple of times.

299
00:29:30.118 --> 00:29:36.028
That you like to have data coalesce and you like to have adjacent threads.

300
00:29:36.028 --> 00:29:49.078
Access adjacent elements of the global memory, but here, adjacent threads, and the 1st step are accessing elements that are 8 bytes apart not 4 bytes of parties. Each variable is 4 bytes.

301
00:29:50.519 --> 00:29:55.048
So right at the start, you're not optimizing, you're coalescing. Um.

302
00:29:55.048 --> 00:29:58.469
So that's 1 issue there.

303
00:29:58.469 --> 00:30:02.219
And then we move down to the next stage.

304
00:30:02.219 --> 00:30:07.888
And the next stage, you got a bigger stride and it's.

305
00:30:07.888 --> 00:30:12.659
Or even being less efficient with coalescing and so on.

306
00:30:12.659 --> 00:30:16.288
Then, don't we get down here say to step 2.

307
00:30:16.288 --> 00:30:19.648
We're not even using all the threads. So you in step.

308
00:30:19.648 --> 00:30:26.068
2 here you're using thread 0 and thread.

309
00:30:26.068 --> 00:30:32.699
Well, step 1, you're using all adjacent thread only even number thread. Step 2. you're using every 4 thread perhaps.

310
00:30:32.699 --> 00:30:41.398
And because you see thread 0, we'll add 11 and 14 to make 25. and so if you go down the tree.

311
00:30:41.398 --> 00:30:45.148
You're using a smaller and smaller fraction of the threads.

312
00:30:45.148 --> 00:30:49.798
So, you're really using the memory.

313
00:30:49.798 --> 00:30:53.699
The parallel hardware less and less efficiently.

314
00:30:53.699 --> 00:31:00.479
Now, you know, it's not the be all and end all to use every recruiter core all the time, but still.

315
00:31:00.479 --> 00:31:06.419
You look at this thing, you know, this, this process, it's login levels deep.

316
00:31:06.419 --> 00:31:18.358
And basically, for most of the computation of the reduction of this array, most of the threats are not being used, but they're in the middle of war. So they can't use them for something else.

317
00:31:18.358 --> 00:31:24.868
So this naive way here we're 1st step you through.

318
00:31:24.868 --> 00:31:28.648
To adjacent items, and then after the threads at.

319
00:31:28.648 --> 00:31:34.169
Items that are like, 1 stuff apart and a quarter of the threads items that are 4 steps apart and so on.

320
00:31:34.169 --> 00:31:37.528
Naive obvious thing is.

321
00:31:37.528 --> 00:31:40.949
Has horrible thread divergence.

322
00:31:42.088 --> 00:31:45.509
As you go down the tree and also.

323
00:31:46.588 --> 00:31:50.788
Doesn't use the global memory efficiency, so that's 2 big problems with this simple way.

324
00:31:50.788 --> 00:31:55.919
So, what are we going to do about it? Okay you see the issue here.

325
00:31:56.939 --> 00:32:00.659
And again, the problems we're talking about.

326
00:32:02.128 --> 00:32:07.858
The general issues, it's not just this 1 specific thing. This is a 1 specific program.

327
00:32:07.858 --> 00:32:12.239
Who cares about it, but these are general ideas and that's why.

328
00:32:12.239 --> 00:32:18.209
Spend time on it. Okay. So problems with the simple naive method.

329
00:32:18.209 --> 00:32:27.689
And they talk, here's in words, what I just told you after each step, half of the remaining threads are not needed.

330
00:32:27.689 --> 00:32:36.088
And and 1 of the inputs comes from farther and farther so farther and farther away means you're going to have this latency.

331
00:32:36.088 --> 00:32:44.669
It's not adjacent, you've got this latency in somewhere. They say that the latency could be a 1000 cycles and so.

332
00:32:44.669 --> 00:32:48.388
Yeah, yeah, yeah, yeah, yeah. Okay.

333
00:32:49.828 --> 00:32:59.939
I don't have to look at the code unless you want is how you would program it. Um.

334
00:33:01.108 --> 00:33:04.979
They're using shared memory here, so which is nice. And again, it.

335
00:33:04.979 --> 00:33:12.838
It's like 48 K bytes, but if we're talking say 2000 elements at 4 bytes seats at 8,000.

336
00:33:12.838 --> 00:33:24.148
Fights so all the threads in the block, you could load that factor into fast shared memory and then the 2000 threads start throwing at it and it wasn't be less latency at that point. So that's okay.

337
00:33:24.148 --> 00:33:27.719
Okay, um, reduction steps.

338
00:33:27.719 --> 00:33:32.159
We got sync threads and I won't go through the cold line by line. You can read it.

339
00:33:32.159 --> 00:33:35.189
They got multiple think through it. Um.

340
00:33:35.189 --> 00:33:42.598
Okay, yeah, so you've got those multiple stages you have to synchronize.

341
00:33:42.598 --> 00:33:46.048
Before you do the next stage of the reduction, because.

342
00:33:46.048 --> 00:33:52.318
Again, you don't the different threads may be running 1 after the other. If there's more threads and there's hardware.

343
00:33:52.318 --> 00:33:57.689
So, you have to sync don't have to think if you're.

344
00:33:57.689 --> 00:34:00.929
On, you know, see what happens.

345
00:34:00.929 --> 00:34:04.558
Okay, I'll do to do, um.

346
00:34:07.558 --> 00:34:13.018
What they're talking about here is I said, you maybe you could have only.

347
00:34:13.018 --> 00:34:18.119
Perhaps 4,000 elements in a thread block if a thread block and have 2000 threads.

348
00:34:18.119 --> 00:34:21.599
So, what if you got more than 4,000 elements.

349
00:34:21.599 --> 00:34:27.208
Well, then each, you have multiple thread blocks you could have hundreds of thread blocks and each thread block takes.

350
00:34:27.208 --> 00:34:34.949
4,000 elements, and each thread block operates independently of the other thread blocks. I mean, they can all read off the global memory.

351
00:34:34.949 --> 00:34:46.978
Read orders don't matter. So, each thread block process is 4,000 elements and produces them to a summon. Again. The thread blocks. They're all independent of each other. They don't talk to each other. You can't even sync them.

352
00:34:46.978 --> 00:34:51.568
And, but finally all the thread blocks have finished.

353
00:34:51.568 --> 00:34:56.518
And each threat box got a subtotal. Now you bring them all together.

354
00:34:56.518 --> 00:35:01.528
And that's what they talk about down here. So you're going to have to do some sort of synchronization to know that.

355
00:35:01.528 --> 00:35:05.278
All the thread blocks are done, but yeah, it's only once.

356
00:35:06.449 --> 00:35:21.298
Okay, in different ways you could do that by the way if you're wondering about the overhead to move, say, data to the device, do some work and bring it back even with that factor addition, I think.

357
00:35:21.298 --> 00:35:30.268
On parallel, it speaks things up, perhaps to move it to the GPU if you want it. So okay, that was number 2.

358
00:35:35.068 --> 00:35:40.498
Okay, we want to do better. We need.

359
00:35:40.498 --> 00:35:44.458
Reduce control divergence and so on, um.

360
00:35:44.458 --> 00:35:49.768
Get over this and go to the picture, um.

361
00:35:51.478 --> 00:35:54.478
So.

362
00:35:55.889 --> 00:36:00.568
So Here's a better way to do it. So you can read the words. I'll lecture from the picture.

363
00:36:00.568 --> 00:36:08.248
Um, but we saw before the picture before is thread 0, added elements 0 and 1.

364
00:36:08.248 --> 00:36:13.259
3rd 1 added elements 2 and 3 at 3 and 4.

365
00:36:13.259 --> 00:36:20.248
Here, and if we went down the diagram, there was more and more a bigger and bigger stride. Here is a new way, which is better.

366
00:36:21.418 --> 00:36:25.018
Threads 0 adds elements 0 and 4.

367
00:36:25.018 --> 00:36:28.378
Fred 1 adds elements 1 in 5.

368
00:36:28.378 --> 00:36:35.039
There are 2 ads elements, 2 numbers, 2, and 6 said 3 ads elements number 3 and 7.

369
00:36:35.039 --> 00:36:40.978
So, we start out with threads having a big stride of for between the 2 elements that they had.

370
00:36:40.978 --> 00:36:44.878
Yes, so.

371
00:36:44.878 --> 00:36:53.639
This way to start, getting smaller, but at the beginning of a larger strategy, there's less efficient in a larger strategy. There are more threads.

372
00:36:53.639 --> 00:36:57.838
So, why is this better than.

373
00:36:57.838 --> 00:37:00.869
And the only way you do have to meet.

374
00:37:00.869 --> 00:37:07.199
Partners who were making those, but not effectively fewer because.

375
00:37:07.199 --> 00:37:11.338
The threads that are still executing are scattered throughout the war. So.

376
00:37:11.338 --> 00:37:16.349
Whereas with this, you see the threads that are still active or all together.

377
00:37:16.349 --> 00:37:21.208
So, they're all in 1 war so I guess that's a big reason. This is a bit better.

378
00:37:22.739 --> 00:37:26.789
So, you see here after the 1st step, half the threads.

379
00:37:26.789 --> 00:37:36.059
Are no longer have anything to do but the half that are active, we're all packed together and a half that are inactive. We're all packed together. So the inactive.

380
00:37:36.059 --> 00:37:40.259
Can be doing something else so I guess that could be the big reason.

381
00:37:40.259 --> 00:37:45.628
Oh, but it could be worth, um.

382
00:37:45.628 --> 00:37:52.318
You know, what could be wrong. Yeah, but that's the big thing. I think.

383
00:37:52.318 --> 00:38:00.869
The amount of credits get smaller by half every time yeah. At the beginning versus the iteration, all the levels of penalty.

384
00:38:00.869 --> 00:38:05.159
Yeah, so these threads here are.

385
00:38:05.159 --> 00:38:14.639
If it can do something else, well, the resources that would be dedicated to those threads can do something else. So I guess that's it.

386
00:38:16.230 --> 00:38:19.320
But you're packing yeah, but the other thing.

387
00:38:19.320 --> 00:38:22.380
Is if we look over time.

388
00:38:22.380 --> 00:38:28.260
The data the remaining active data is packed, hold for and posted together. So that's.

389
00:38:30.269 --> 00:38:36.750
So, converging is better than diverging. Well, when it's packed closer together is less of it. Um.

390
00:38:36.750 --> 00:38:40.500
Let me go back to the others to the previous.

391
00:38:40.500 --> 00:38:49.349
Thing here, which I will have.

392
00:38:51.300 --> 00:38:54.539
Somewhere.

393
00:39:00.599 --> 00:39:06.690
Yeah, so if we do this, then.

394
00:39:15.989 --> 00:39:19.800
I don't know, it might take a more detailed analysis. Why? And.

395
00:39:20.969 --> 00:39:24.900
Who knows, they could be wrong, but I think they're right on that.

396
00:39:24.900 --> 00:39:29.519
More price that didn't work and be a ton of them. Yeah.

397
00:39:29.519 --> 00:39:32.909
You're trying to improve the candidates in.

398
00:39:32.909 --> 00:39:39.030
Well, you right at the end, you could start creating this spot, but you've got all these gaps between active threads and that's.

399
00:39:39.030 --> 00:39:42.360
But you don't like so.

400
00:39:42.360 --> 00:39:49.769
You well, they would claim it's even the 1st row is not incredibly efficient.

401
00:39:49.769 --> 00:39:57.719
Because the hardware likes having adjacent threads, right? Driving data that's 4 bytes apart. Not that it is 8 bytes the part. So.

402
00:39:58.920 --> 00:40:02.010
Yeah, yeah.

403
00:40:02.010 --> 00:40:05.579
That would be the issue and if we go back to the other 1.

404
00:40:07.050 --> 00:40:10.170
So, the data's not 8 bytes apart, but.

405
00:40:10.170 --> 00:40:13.260
A large number of bites apart.

406
00:40:15.659 --> 00:40:20.670
Well, here they could say the cold here, the coal last thing is actually better.

407
00:40:20.670 --> 00:40:27.329
Look at this, so each threat reach 2 words from the memory, which.

408
00:40:27.329 --> 00:40:31.530
If you can put it in shared memory great if you kind of global memory, but.

409
00:40:31.530 --> 00:40:36.059
So so each thread has 2 arguments that 1st stage.

410
00:40:36.059 --> 00:40:42.239
So, for all the threads, the 1st arguments are Jason, and for all the threads, the 2nd arguments are adjacent.

411
00:40:42.239 --> 00:40:50.550
But it actually is fully coalesced see, for so, for all the threads, the 2nd, argument to the threads are adjacent.

412
00:40:50.550 --> 00:40:54.329
And also the 1st, so in that sense.

413
00:40:54.329 --> 00:40:58.139
If you have to go into global memory, you're reading chunks of 128 bytes.

414
00:40:58.139 --> 00:41:03.809
And, um, so you basically got 2 sets of active chunks so that.

415
00:41:03.809 --> 00:41:12.150
Yeah, right and I don't know the details of how the cash works and it changes with the generation.

416
00:41:12.150 --> 00:41:17.789
So, I don't know what the size of a cash line is, but.

417
00:41:17.789 --> 00:41:20.940
Yeah, right, right so.

418
00:41:20.940 --> 00:41:23.940
That's the argument for that over.

419
00:41:23.940 --> 00:41:29.039
Yeah, but okay, but there is a deeper point.

420
00:41:29.039 --> 00:41:38.070
Underneath what you're asking, and that there's an optimum amount of optimization and.

421
00:41:38.070 --> 00:41:43.559
You can spend too much of your life trying to micro optimize the program. Okay.

422
00:41:43.559 --> 00:41:47.489
And it's not worth that, um, for 2.

423
00:41:47.489 --> 00:41:55.199
There's an optimum okay, you can do too little and I see people doing X squared algorithms when they could be on login, but you can do too much for 2 problems.

424
00:41:55.199 --> 00:41:58.860
1, your time may be worth more than the machine time. Okay.

425
00:41:58.860 --> 00:42:05.400
The CPU on parallel is worth of 2000 dollars that was so.

426
00:42:05.400 --> 00:42:17.159
And it's got a lifetime of 5 years or something say, 50,000 hours. So that the capital cost of that GPU on parallel and it was about the best I could buy 2 years ago.

427
00:42:17.159 --> 00:42:25.380
It's maybe 10 cents an hour. Okay. And his validation. If you, you can rent it on Amazon for.

428
00:42:25.380 --> 00:42:34.230
I can't remember less than a dollar an hour. The GPU. Okay. So what are you going to how much money you going to make when you get out of here? You see the trade off. Okay.

429
00:42:34.230 --> 00:42:37.650
The 2nd point is that.

430
00:42:37.650 --> 00:42:48.239
As time goes on and video updates the hardware, what was the good optimization in the past may hurt you tomorrow.

431
00:42:48.239 --> 00:42:53.250
So, maximum generation, they had very few double precision corps.

432
00:42:53.250 --> 00:42:56.940
You would want to rewrite your code to use single precision.

433
00:42:56.940 --> 00:43:03.840
Perhaps, if you cared, that's the stuff I had to do as a student. In fact, the hardware troubleshooting hardware had not.

434
00:43:03.840 --> 00:43:10.230
Did not exist on the machines I use. There are a lot of cool techniques to do double precision to work with a single precision hardware.

435
00:43:10.230 --> 00:43:14.940
You know, they're obsolete now, so you don't necessarily want.

436
00:43:14.940 --> 00:43:21.989
To spend too much time, optimize the over optimizing cause next year's hardware. Like, they embed cash.

437
00:43:21.989 --> 00:43:27.300
Okay, and some of the ideas you read about all this stuff with CUDA copy.

438
00:43:27.300 --> 00:43:32.429
The functions basically, obviously, if you've got the, you know, unified memory with paging so.

439
00:43:32.429 --> 00:43:36.869
A lot of that stuff people spend time on is obsolete. Now it's creative disruption.

440
00:43:36.869 --> 00:43:42.210
Okay, so the question is, why am I spending time here?

441
00:43:42.210 --> 00:43:48.780
Art from the fact that I learned it in a way to amortize the time I spent learning it. No. Okay.

442
00:43:48.780 --> 00:43:55.139
It's, it's a powerful idea. It's worth learning the idea. I think.

443
00:43:55.139 --> 00:43:58.230
About how you have to concentrate things together.

444
00:43:58.230 --> 00:44:02.639
That's why we spend time on it. Okay. Um.

445
00:44:02.639 --> 00:44:08.460
Skip the code and.

446
00:44:08.460 --> 00:44:15.630
And they talk about, say, the 1024th original blog is more obvious in 2000. I think then.

447
00:44:15.630 --> 00:44:20.070
The 1st, several stages, there's no divergence. So.

448
00:44:22.050 --> 00:44:28.980
So the last 5 stages, there'll be some divergence, but you still, but when you have the divergent, you've got a lot fewer of war. So.

449
00:44:28.980 --> 00:44:33.690
You've got divergence over a smaller number of threads so that's.

450
00:44:33.690 --> 00:44:43.110
Not bad so that was module 9, telling them about a different way. You can do the parallel reduction that has better coalescing and.

451
00:44:43.110 --> 00:44:54.449
Less divergence. Okay. I see some more of this.

452
00:44:59.099 --> 00:45:06.030
Okay, okay now, by the 1 thing for the reduction that slides that did not mentioned.

453
00:45:06.030 --> 00:45:09.780
Instead of just reducing the whole.

454
00:45:09.780 --> 00:45:13.320
Array down to 1 element, you can.

455
00:45:13.320 --> 00:45:25.679
Foot fence posts in the array that'd be another. It'd be like a stencil vector or another vector or even it would have fence posts in the array and you do the reduction, but don't reduce over a fence post.

456
00:45:25.679 --> 00:45:31.170
So the fence posts would cut this bigger array into multiple little pieces.

457
00:45:31.170 --> 00:45:35.880
A burying length and you reduce the each little piece.

458
00:45:35.880 --> 00:45:39.239
In parallel is still taking log in time.

459
00:45:39.239 --> 00:45:43.860
And this is a way, for example, you to run length encoding.

460
00:45:43.860 --> 00:45:49.679
You have this big array and you want to do run length encoding. So you put a fence post wherever you change.

461
00:45:49.679 --> 00:45:56.489
0, to 1 or back again, and that could be done computed in parallel and you do the parallel reduction, the output of that.

462
00:45:56.489 --> 00:45:59.969
Is the length of every run and where it started?

463
00:46:00.989 --> 00:46:07.980
Perhaps so that's an example of where the reduction is more powerful you might think is you're doing a parallel run 9th computation.

464
00:46:07.980 --> 00:46:16.679
Okay, what we're seeing here is another very common operation in parallel computing of the scanner.

465
00:46:16.679 --> 00:46:22.920
We're also called a prefixed and again it's a basis for a surprising number of things.

466
00:46:22.920 --> 00:46:28.829
Such as you have a list of runway, the scan operation will tell you where each run starts.

467
00:46:28.829 --> 00:46:34.289
And if I've got a length of.

468
00:46:34.289 --> 00:46:39.329
Pockets or something of varying sizes, because they're doing a hit.

469
00:46:40.380 --> 00:46:45.539
I'm sort of like, use this bridge and great. Great element has a different number of items in it.

470
00:46:45.539 --> 00:46:51.210
So, I want to allocate all the elements sequentially scan would tell me where each bucket starts.

471
00:46:51.210 --> 00:46:54.510
Does in parallel. Okay.

472
00:46:54.510 --> 00:46:59.219
And we demonstrate if I offer, okay, let's go buy the figures down here.

473
00:46:59.219 --> 00:47:04.230
This shows you, your, your scan, these types of scans.

474
00:47:04.230 --> 00:47:09.809
The s. L. okay. The input is this vector 3. 1. 70,406 3.

475
00:47:09.809 --> 00:47:14.039
The output, the AI element is the, some of the 1st, AI and foot elements 3.

476
00:47:14.039 --> 00:47:17.400
43+1 at 11 is 3+1+7.

477
00:47:17.400 --> 00:47:20.789
25 was the sum of everything includes the scan.

478
00:47:20.789 --> 00:47:23.940
Okay, and obviously you can do it.

479
00:47:23.940 --> 00:47:33.030
Um, sequentially going down the input factor here what you might want to think is if you already have the input factor on the GPU.

480
00:47:33.030 --> 00:47:37.650
Imagine, how you might do the scan in parallel, taking log in time.

481
00:47:37.650 --> 00:47:41.159
Instead of linear time, that's.

482
00:47:41.159 --> 00:47:50.670
You think about how you might do that so you have some sort of treat you'd build up, you scan little chunks and you combine this Johnson propagate and log in time.

483
00:47:50.670 --> 00:47:56.969
You've done your scan on the whole vector so any case. Okay so this, um.

484
00:47:56.969 --> 00:48:05.280
1 operation would be, like I said, if you've done a run length encoding and these are the lengths of the runs, the 1st run is 3 elements. The 2nd run is 1 element.

485
00:48:05.280 --> 00:48:11.820
3rd 1 is 7 elements. 4th, run doesn't exist and the scan will tell you where each run starts.

486
00:48:11.820 --> 00:48:16.800
After you've expanded the run length encoding. Okay.

487
00:48:17.849 --> 00:48:23.699
That's called an inclusive scan. Okay. Examples like, um.

488
00:48:23.699 --> 00:48:31.920
Putting a sandwich and lots of exams. So how are you going to do it?

489
00:48:31.920 --> 00:48:37.829
Um, there's lots of applications you'd be surprised.

490
00:48:37.829 --> 00:48:41.070
That's the applications you can use it for, um.

491
00:48:42.360 --> 00:48:48.420
Great explored is nice. I don't spend time on this. We want Radix sort in many cases much faster than.

492
00:48:48.420 --> 00:48:53.369
But the huge, quick forward to CS courses, because it is recurring and.

493
00:48:53.369 --> 00:48:57.300
Computer signed to swap recursion resource does not have recursion, but.

494
00:48:57.300 --> 00:49:02.280
All that Scott is as fast. Okay. Um, things like that.

495
00:49:05.400 --> 00:49:09.599
Ignore the applications ignore this, ignore this.

496
00:49:09.599 --> 00:49:13.050
Gorgeous, uh.

497
00:49:15.960 --> 00:49:19.380
Sequentially, it's obvious how you would do it.

498
00:49:21.750 --> 00:49:26.760
I don't know spend time on this.

499
00:49:26.760 --> 00:49:32.550
You have the compute, the ice subtotal, but that that's not faster than sequential.

500
00:49:33.780 --> 00:49:39.210
Okay, so the content of this slide set was to introduce you to the scan operation.

501
00:49:39.210 --> 00:49:47.340
Okay, that was number 1 they took 10 sites to introduce you to the scan operation.

502
00:49:50.909 --> 00:49:56.099
To do.

503
00:49:56.099 --> 00:49:59.369
Hello.

504
00:49:59.369 --> 00:50:12.059
Well, here's okay, well, we're getting to hear is a step on the way to doing it in parallel. Here's our input called X. Y, for some reason.

505
00:50:12.059 --> 00:50:16.050
And what we do is that.

506
00:50:16.050 --> 00:50:23.550
In parallel, every thread adds its element to the previous element, except for this thread. 0.

507
00:50:23.550 --> 00:50:32.190
So thread 0 at 3 to nothing thread 1, add 3 and 1, 3, 2, X1 and 7.

508
00:50:32.190 --> 00:50:38.280
Thread 3 ads, 7 and 00 and 4 the 4 and 1 and 1 and 6 and 6 and 3.

509
00:50:38.280 --> 00:50:43.710
So, everything I did in addition, okay now.

510
00:50:45.900 --> 00:50:49.409
We repeat this log in times we'll have all the subtitles. Okay.

511
00:50:49.409 --> 00:50:52.530
So that's.

512
00:50:52.530 --> 00:50:58.079
0, or depending on how you're counting for this parallel prefixed them.

513
00:50:58.079 --> 00:51:02.280
Prefix diamond scan there's anonymous. It depends what you want to call it.

514
00:51:02.280 --> 00:51:05.550
Oh, okay.

515
00:51:05.550 --> 00:51:10.230
Takes calling some time we've just added a repair adjacent.

516
00:51:11.909 --> 00:51:16.079
Then we do it again.

517
00:51:16.079 --> 00:51:19.170
Okay, and.

518
00:51:19.170 --> 00:51:25.679
Except on the next. Okay, so, accounting from 1. so let's try 1 added pairs.

519
00:51:25.679 --> 00:51:33.780
Try to we had elements that are too apart. So that's what the stride is. The Stripe as the jump from.

520
00:51:33.780 --> 00:51:40.650
So strike 3. um, well, the 1st thread does not think it just propagate the 2nd thread does not think.

521
00:51:40.650 --> 00:51:45.030
The 3rd thread as its own element to the 12 elements earlier.

522
00:51:45.030 --> 00:51:51.690
The last thread has its own element to the 12 elements. Early. 9+5 equals 14. this thread.

523
00:51:51.690 --> 00:51:55.800
4+2 elements earlier is 8 equals 12 so.

524
00:51:56.940 --> 00:52:01.949
Um, so.

525
00:52:01.949 --> 00:52:07.230
What we've done is we have the 1st, 2 elements of the output now they're in.

526
00:52:07.230 --> 00:52:11.460
So, we've lost a little cash.

527
00:52:11.460 --> 00:52:18.539
Well, here well, we've lost a little coalescing here. Okay. We're going to lose more and more thing.

528
00:52:18.539 --> 00:52:22.710
But we do have.

529
00:52:22.710 --> 00:52:26.730
You know, all of the thread efficiency, all the threads are doing the same thing.

530
00:52:26.730 --> 00:52:35.159
That's the 1st, 2 and if we can use shared memory, we're cool.

531
00:52:35.159 --> 00:52:39.659
Next time strike for each thread as.

532
00:52:41.280 --> 00:52:46.199
Its own element to the 14 elements back.

533
00:52:48.059 --> 00:52:52.980
Except that 4 of them, elements back would fall off the end of the year. You do nothing.

534
00:52:52.980 --> 00:52:56.699
What happened? No, stop here at 11. 0.

535
00:52:56.699 --> 00:53:00.000
4 elements back, it's not thing 12.

536
00:53:00.000 --> 00:53:03.690
This thing here at 12 to 34 elements of ads.

537
00:53:03.690 --> 00:53:07.500
Well, the 4 at 11 to.

538
00:53:07.500 --> 00:53:13.199
11 at 14 211 and now we've got the next slide.

539
00:53:14.309 --> 00:53:17.309
And you can guess then we'll have stride game and so on.

540
00:53:17.309 --> 00:53:22.230
And after this thing, the 1st, 4 elements of the output are been computed.

541
00:53:24.960 --> 00:53:28.500
So, eventually these 1st, few threads are going to be idle, but.

542
00:53:28.500 --> 00:53:32.550
Just takes log in steps, but the thing is that.

543
00:53:32.550 --> 00:53:36.809
The threads are mostly being used efficiently. We don't have here.

544
00:53:36.809 --> 00:53:40.650
Oh, the last thing, well, we do.

545
00:53:40.650 --> 00:53:45.869
Because look, each thread takes 2 inputs.

546
00:53:45.869 --> 00:53:49.650
Its own element and the element for to the left. So.

547
00:53:50.760 --> 00:53:54.630
The 2 are Nicole is still. It's cool. You see.

548
00:53:54.630 --> 00:53:58.289
So this thread here, it takes 2 arguments.

549
00:53:58.289 --> 00:54:04.050
Reach thread the 1st argument is, is a corresponding element and the 2nd element.

550
00:54:04.050 --> 00:54:08.550
Argument is for elements left for each thread so it's coalesce.

551
00:54:08.550 --> 00:54:15.210
Nicely, so we have we have coalescing of the input data.

552
00:54:16.260 --> 00:54:21.869
If it matters, if we're reading off the shared memory, it doesn't matter, but if you're reading off the global memory, it matters.

553
00:54:21.869 --> 00:54:28.199
And the threads are being used efficiently because the active threads are adjacent.

554
00:54:28.199 --> 00:54:33.869
And as we, and basically each more and more threads are not active, and they're adjacent.

555
00:54:33.869 --> 00:54:37.920
They were using threads efficiently and were using.

556
00:54:37.920 --> 00:54:42.599
And we're being coherent memory, contiguous memory efficiently.

557
00:54:43.800 --> 00:54:47.519
Um.

558
00:54:47.519 --> 00:54:50.880
Uh, do do you do.

559
00:54:52.800 --> 00:54:57.000
Okay, here's a big point here. Let me go back.

560
00:54:57.000 --> 00:55:01.829
So, this thing, um, I didn't mention yet.

561
00:55:03.659 --> 00:55:12.659
So, the 1st step, each thread adds a pair of elements and then the 2nd step each thread I had the elements, but the inputs.

562
00:55:12.659 --> 00:55:19.320
But the 2nd step our output from the 1st step. So you got into a sync threads here at every step.

563
00:55:19.320 --> 00:55:26.730
Because you don't know that all of the threads that are in a line here and running at the same time because again, more threads.

564
00:55:26.730 --> 00:55:31.469
And CUDA, cores, you've got a big array in sheet hardware.

565
00:55:31.469 --> 00:55:38.489
These threads are a nice horizontal line in the diagram might be running 1 after these, the different warps again might be.

566
00:55:38.489 --> 00:55:41.820
1, after go, so you need a sync threads that every time.

567
00:55:42.960 --> 00:55:47.039
And that's what this talks about various synchronization.

568
00:55:47.039 --> 00:55:53.340
So, I ignore this.

569
00:55:53.340 --> 00:55:58.829
Um.

570
00:56:01.260 --> 00:56:06.059
So, what's their complaint here? Let me go back to the figure.

571
00:56:08.250 --> 00:56:13.110
Here's a complaint is that if the.

572
00:56:13.110 --> 00:56:21.929
Vector is elements long, so it takes log in stages log to the base to a fan.

573
00:56:21.929 --> 00:56:27.150
But the total number of additions is that log in so it takes, um, because.

574
00:56:27.150 --> 00:56:32.519
You're doing at the end stage, you're dividing the number of additions by.

575
00:56:32.519 --> 00:56:41.010
Due to the ad, uh, case stage 2 to the K. actually. So the total number of additions here.

576
00:56:41.010 --> 00:56:51.449
Is then log in and because it's N plus then over 2+that over 4% over 8 and that thumbs up. Sorry 2, what am I.

577
00:56:51.449 --> 00:56:58.889
Oh, if they're ignoring the idol, so total numbers in addition to do something or 2 end but if they keep keeping track of the idle threads.

578
00:56:58.889 --> 00:57:02.099
It's in log in work and they're.

579
00:57:02.099 --> 00:57:05.130
Want to see if they can do better, but in any case.

580
00:57:05.130 --> 00:57:08.519
The contribution of this slide said here.

581
00:57:08.519 --> 00:57:12.719
Is that we do this parallel scan and then log in stages?

582
00:57:12.719 --> 00:57:17.250
And so time and log in using end threads at each step, we've.

583
00:57:17.250 --> 00:57:22.110
Done the scan. Okay.

584
00:57:22.110 --> 00:57:25.739
That was that point here and.

585
00:57:25.739 --> 00:57:31.440
Questions and this was number 2.

586
00:57:32.460 --> 00:57:39.449
How can we do it better?

587
00:57:39.449 --> 00:57:42.719
Okay, um.

588
00:57:44.219 --> 00:57:47.969
So, what we're going to do now is get trickier and.

589
00:57:47.969 --> 00:57:51.090
More efficient, so.

590
00:57:51.090 --> 00:57:54.329
But our algorithm gets weird.

591
00:57:55.590 --> 00:58:01.650
And, um.

592
00:58:04.139 --> 00:58:08.760
You can read this I will.

593
00:58:14.605 --> 00:58:23.844
I'll take it a little quickly. Um, it's a little weird.

594
00:58:23.875 --> 00:58:30.625
I'm going to skip over some details so they some adjacent things and there's some things that are 2 parts, and some things are for apart.

595
00:58:30.900 --> 00:58:34.380
And summit to the right at the end, it.

596
00:58:34.380 --> 00:58:40.619
Login stages. They've got their total, um.

597
00:58:43.050 --> 00:58:48.840
But they don't have all the subtotals yet necessarily. That's the issue here.

598
00:58:48.840 --> 00:58:52.019
So, this thing now, what.

599
00:58:52.019 --> 00:58:55.320
So this is only half the total algorithm. They got the.

600
00:58:55.320 --> 00:59:01.710
They don't have all all of the scan subtotals. They've got little partial, weird, little, partial subtotals.

601
00:59:01.710 --> 00:59:07.260
Like, the sum of 2 and 3 to some of 4 and 5, some of 4 to 7 and so on.

602
00:59:07.260 --> 00:59:13.019
And then some of these weird little subtotals, and get what we want.

603
00:59:13.019 --> 00:59:26.400
It's a 2 stage operation. Um, the argument in favor of it is, it does your additions. It will use less total amount of time.

604
00:59:26.400 --> 00:59:34.230
So, you see, the thing with this 1st, stage is we don't have as many threads being active.

605
00:59:34.230 --> 00:59:41.219
Okay, um, we only at the end stage, we have.

606
00:59:41.219 --> 00:59:51.360
Basically is at each stage with half as many threads, being active as the previous stage. So if you ignore a thread divergence for the moment, we'll get back to it.

607
00:59:51.360 --> 00:59:57.420
This 1st, stage here, which is half the total amount of work, get a lot fewer additions.

608
00:59:57.420 --> 01:00:00.900
Oh.

609
01:00:00.900 --> 01:00:07.590
And, um, and.

610
01:00:07.590 --> 01:00:15.750
Skip through it somewhat and then we do another weird multiple stage saying, I'll just wave my hands. Okay if you want.

611
01:00:15.750 --> 01:00:22.050
It will put all the pieces. Um, this is the 1st stage up here.

612
01:00:22.050 --> 01:00:25.110
We did this thing where we.

613
01:00:25.110 --> 01:00:30.059
Some of the data elements in some that we didn't you see.

614
01:00:30.059 --> 01:00:36.480
And before every thread here was summing elements to a part and every thread was some elements for part and so on.

615
01:00:36.480 --> 01:00:45.900
Here all needs even numbered threads are summing elements to a part and the threads that are multiples of 4, sub elements for apart and so on. So the 1st stage.

616
01:00:45.900 --> 01:00:49.079
We have fewer total additions.

617
01:00:49.079 --> 01:00:54.750
Which will be good if we find a way to handle the thread divergence, which will be the next.

618
01:00:54.750 --> 01:01:00.690
That's the 1st, then the 2nd half we do another bit of propagation where again?

619
01:01:00.690 --> 01:01:04.409
There's fewer threads doing addition, so.

620
01:01:04.409 --> 01:01:07.619
At the bottom of this, we've done a parallel scan.

621
01:01:07.619 --> 01:01:15.239
And although I skipped over details how we've done the parallel scan with a lot less total work.

622
01:01:15.239 --> 01:01:18.989
If we can find a way to pack it into a small number.

623
01:01:18.989 --> 01:01:25.320
Contiguous threat, small number of work. So this is an efficient way. The broader lesson here is.

624
01:01:26.519 --> 01:01:31.500
The assigning the work to threads, for example, the coalesce.

625
01:01:31.500 --> 01:01:34.769
So, the active threads are in the same work.

626
01:01:34.769 --> 01:01:41.730
And that is still access with memories though, and I.

627
01:01:41.730 --> 01:01:46.230
Last way also, so that also works. Um.

628
01:01:46.230 --> 01:01:49.619
Lots of things threads in it.

629
01:01:51.030 --> 01:01:56.070
So, okay, the, this slide captures the executive summary.

630
01:01:56.070 --> 01:02:04.199
Um, we took the previous slides set and reordered everything, so there's a smaller number of total additions.

631
01:02:04.199 --> 01:02:10.800
The previous thing had at every step, all the threads were doing in operation, except the threads at the edge.

632
01:02:10.800 --> 01:02:17.250
Um, here at each successive step, only half, the number of threads are actually active.

633
01:02:17.250 --> 01:02:22.469
So, we reduced the number of active threads, more efficient use of the hardware.

634
01:02:22.469 --> 01:02:27.000
But it's more complicated, but it's a 2 stage operation to.

635
01:02:27.000 --> 01:02:30.090
Emulate some stuff and then propagates and stuff. So.

636
01:02:32.460 --> 01:02:35.670
But we sent out the issue warps, which we'll get to that. Okay so.

637
01:02:35.670 --> 01:02:42.000
That's the executive summary of this algorithm gets more complicated. Hardware use, gets more efficient.

638
01:02:43.829 --> 01:02:49.289
And you.

639
01:02:49.289 --> 01:02:55.079
And, um.

640
01:02:56.099 --> 01:03:02.130
Work analysis of it, um, cut a number of ads down from end log into 2 and.

641
01:03:03.300 --> 01:03:10.500
Um, so.

642
01:03:10.500 --> 01:03:20.190
And again, if you've got more threads, if you've got more, if the vectors larger than the 3 number threads on a block, we partition among multiple blocks.

643
01:03:20.190 --> 01:03:23.760
Okay, this takes some thinking here.

644
01:03:27.360 --> 01:03:30.840
We do the scan over the threads in each block.

645
01:03:30.840 --> 01:03:35.519
And we also get a link to the number of elements in the block then we, um.

646
01:03:38.940 --> 01:03:43.710
We bring all of the block links into 1 block. We do a scan over that.

647
01:03:43.710 --> 01:03:52.349
We distribute it among the blogs, we update stuff and at the end of it, we've done a scan operation over a very long vector that we partition into blocks.

648
01:03:52.349 --> 01:03:55.920
Well, I guess.

649
01:03:55.920 --> 01:04:00.360
And very large input things. Um, here's the okay, this is what.

650
01:04:00.360 --> 01:04:09.690
I just told you here, our goal is to scan of vector, which does not fit in a block because a block and say, do a 1000 elements.

651
01:04:09.690 --> 01:04:13.050
Um, what 2000 maybe.

652
01:04:13.050 --> 01:04:16.110
So, a very big vector.

653
01:04:16.110 --> 01:04:21.869
Big being very long. We split it up among the thread blocks. And again, the thread blocks can't talk to each other.

654
01:04:21.869 --> 01:04:25.710
Each thread block does a scan.

655
01:04:27.690 --> 01:04:33.840
And the last element of the scan is the number is the total of all the elements in that.

656
01:04:33.840 --> 01:04:37.289
We bring all of these totals together into 1.

657
01:04:37.289 --> 01:04:42.389
Ray, but probably fit in shared memory if you got less than.

658
01:04:42.389 --> 01:04:45.719
48 less than 12,000 blocks.

659
01:04:47.550 --> 01:04:54.539
And so each block will then take its length and copy it into this 1 global share, right?

660
01:04:54.539 --> 01:04:59.670
Global array and then we do a scan of that array.

661
01:05:00.840 --> 01:05:08.610
And again, this array here is 1 element as 11 thousands as many elements as the original vector. So, this is a raise probably.

662
01:05:08.610 --> 01:05:14.429
Small enough to fit. Well, unless you're scanning a 1Billion elements, maybe.

663
01:05:14.429 --> 01:05:19.289
And then we scan this, we get a pile of subtotals we distribute them.

664
01:05:19.289 --> 01:05:24.869
Back to the blocks and we update all of the blocks scan values at the end of it.

665
01:05:24.869 --> 01:05:30.449
We got the scan of the original big array so.

666
01:05:30.449 --> 01:05:33.659
So, you've got a sense of it. Um, I mean.

667
01:05:33.659 --> 01:05:36.780
Work on an example by hand next time.

668
01:05:38.429 --> 01:05:44.699
Problem is, like I said, it's very hard for me to simultaneously to ultimately project stuff.

669
01:05:44.699 --> 01:05:51.929
Say things like PDF files and also write stuff by hand I can't really do them on the same machine. So well.

670
01:05:53.099 --> 01:05:56.369
I have to do more research to find some more tools. Okay.

671
01:05:56.369 --> 01:06:02.639
Any case this executive summary the point of this was to scan a really long a raise.

672
01:06:02.639 --> 01:06:06.690
Free stages so now.

673
01:06:06.690 --> 01:06:16.559
Here's a minor thing and the scan I showed you before where the ice output was, the some of the 1st ai input was called inclusive scan.

674
01:06:16.559 --> 01:06:20.610
There is a small variant called an exclusive scan.

675
01:06:20.610 --> 01:06:24.539
Or the output it's shifted, right? 1.

676
01:06:24.539 --> 01:06:29.909
So, the ice output is the sum up to the -1.

677
01:06:29.909 --> 01:06:34.559
1st input so the 1st output is always 0, the 2nd, this is.

678
01:06:34.559 --> 01:06:39.869
Outputs to some of the 1st input, the 3rd outputs a, some of the 1st, 2 inputs and so on.

679
01:06:41.909 --> 01:06:50.789
Totally trivial difference. Why they have a separate name for this. There are times when this is useful exclusive scan. So.

680
01:06:50.789 --> 01:06:55.110
Yeah, and it's.

681
01:06:55.110 --> 01:06:58.260
No matter why, you know.

682
01:06:58.260 --> 01:07:02.130
Of that, um, exclusive scan, inclusive pattern of the same thing.

683
01:07:02.130 --> 01:07:07.769
Okay, so the information content of this slide that is in this figure here.

684
01:07:07.769 --> 01:07:11.130
It's how to do the scan partial thumbs over.

685
01:07:11.130 --> 01:07:14.400
A multi block vector or.

686
01:07:16.260 --> 01:07:22.920
Yes, and again, the scan is a parallel operation, which has a lot of uses.

687
01:07:22.920 --> 01:07:29.670
Um, allocating lots of little arrays of different sizes are called ragged are raised.

688
01:07:29.670 --> 01:07:33.449
Runway encoding decoding and so on.

689
01:07:34.650 --> 01:07:38.969
Instagram would be 1.

690
01:07:38.969 --> 01:07:44.340
There's a lot of things where you're outputs a lot of buckets of varying sizes.

691
01:07:44.340 --> 01:07:48.090
Well, his programming would be 1, some sort of.

692
01:07:49.230 --> 01:07:56.460
Set set operation unions and differences and so on, you end up with blocks of different sizes.

693
01:07:56.460 --> 01:08:00.809
Oh, okay.

694
01:08:02.219 --> 01:08:05.789
And.

695
01:08:05.789 --> 01:08:10.769
Hello.

696
01:08:12.119 --> 01:08:19.859
Nothing in 11. okay.

697
01:08:21.689 --> 01:08:28.470
This is what I mentioned an hour ago. What's going on here?

698
01:08:28.470 --> 01:08:32.550
You got this nice abstract idea of a field.

699
01:08:32.550 --> 01:08:38.279
In addition multiplication operations are associative committed to.

700
01:08:38.279 --> 01:08:43.859
You know, multiplication distributes over addition or vice versa. You got it.

701
01:08:43.859 --> 01:08:50.789
Elements of inverses, you've got receptacles and so on there's 11 axioms at all.

702
01:08:50.789 --> 01:08:53.909
Field satisfy.

703
01:08:53.909 --> 01:08:56.970
And then you've got floats.

704
01:08:56.970 --> 01:09:03.720
Which basically break, I mean, there have been machines in the past for 1 time. Sex are not equal X. okay.

705
01:09:03.720 --> 01:09:13.079
So you got the ideal and you've got the imperfect implementation that's floating floating point numbers.

706
01:09:13.079 --> 01:09:17.399
The context here.

707
01:09:17.399 --> 01:09:21.060
Is a different computers that implement floats in a different way.

708
01:09:21.060 --> 01:09:25.619
Different number of significant fits different face whatever.

709
01:09:25.619 --> 01:09:30.779
Yeah, I can't believe did a standard, um, 2 decades back.

710
01:09:30.779 --> 01:09:36.810
To try and have good practices. How a good floating point system.

711
01:09:38.220 --> 01:09:48.390
Would be and, um, here's the thing, for example, suppose you've got some common operations square root of X. okay. For X being positive.

712
01:09:49.409 --> 01:09:54.840
A bad implementation might have square root of X jittering up and down in the last bit.

713
01:09:54.840 --> 01:09:59.609
That's increases slightly. It's conceivable. Square. Root of ethical down is excellent off.

714
01:09:59.609 --> 01:10:06.090
You know, in the last bit, why is this bad is it might cause a see can't search algorithm to fail.

715
01:10:06.090 --> 01:10:12.090
You see, you violate these little properties that you just intuitively assume, even if they're not written down that.

716
01:10:12.090 --> 01:10:21.359
Your standard function should be monotonically in practice where their monotonically in theory, stuff like that. You would assume a plus B would be.

717
01:10:21.359 --> 01:10:27.720
It's so obvious to you, but there's machines where that was not true. As I said, there's a machine or 1 time techs are not equal X. sometimes.

718
01:10:27.720 --> 01:10:37.590
The reasons for some of this is that the machine might have an invisible register. That's higher precision. Do you do a plus B.

719
01:10:37.590 --> 01:10:42.779
Perhaps a is in a register that's higher precision and B comes out of memory and slower precision.

720
01:10:42.779 --> 01:10:50.609
So B, plus, hey, these are the high precision register and not a, and they're coming from some earlier operation. And the result is.

721
01:10:50.609 --> 01:10:54.060
April speak with Donny would be for, say exactly stuff like that.

722
01:10:54.060 --> 01:10:57.569
Um, so the I triple.

723
01:10:57.569 --> 01:11:01.140
Well, William Kay, Hannah, Berkeley, leading it. Um.

724
01:11:01.140 --> 01:11:06.420
Came up with a standard and it got a lot of complaining because it.

725
01:11:06.420 --> 01:11:19.260
The standard was, it was expensive to implement so CRE computing, which had some fast parallel computer he never would. Do I triple your floating points? So, the 14 point standard was theoretically nice, but.

726
01:11:19.260 --> 01:11:25.529
Was expensive we took a lot of Gates, but now everyone does it because they decide it's nice.

727
01:11:25.529 --> 01:11:30.210
To do things, right. Um, and.

728
01:11:30.210 --> 01:11:34.409
So, even GPU is mostly do what? So that's the context here.

729
01:11:34.409 --> 01:11:40.829
Is that does your implementation does the hardware manufacturers implementation follow this?

730
01:11:40.829 --> 01:11:45.750
If it follows this, your obvious things will happen if it doesn't follow this.

731
01:11:45.750 --> 01:11:53.970
You will your hardware will be cheaper and faster perhaps cheaper, but you're going to get some crazy errors.

732
01:11:53.970 --> 01:11:58.020
So, um.

733
01:11:58.020 --> 01:12:02.100
That's the point about that.

734
01:12:05.189 --> 01:12:10.979
People know how floats are implemented perhaps. So you can read it on your own. Um.

735
01:12:10.979 --> 01:12:19.770
Get that get that you can read the details. I just gave you the quick executive summary of.

736
01:12:19.770 --> 01:12:27.119
Presentable numbers I carefully stand. It also has bit patterns to represent infinity.

737
01:12:27.119 --> 01:12:32.909
If you divide 1 by 0, you've got infinity perhaps and not a number 0 advisor or if not a number.

738
01:12:32.909 --> 01:12:39.119
But, no, 1 uses them. I'll skip all of this.

739
01:12:40.800 --> 01:12:46.590
Read it if you want, um, questions.

740
01:12:48.210 --> 01:12:51.449
Well, give you some limits here. Um.

741
01:12:52.590 --> 01:12:56.670
Normalization you can read about precision and so on on.

742
01:12:56.670 --> 01:13:00.510
Single position floats your biggest numbers, like 10 to 37.

743
01:13:00.510 --> 01:13:05.880
Which is not big enough so, 1, implementation it basically, 1 double precision floats.

744
01:13:05.880 --> 01:13:12.060
Where, unless you do a machine learning where you don't care about accuracy, so you go half precision flows.

745
01:13:12.060 --> 01:13:16.739
I'm willing to debate about machine learning, but.

746
01:13:16.739 --> 01:13:20.039
Special bit pattern is do not a number and.

747
01:13:20.039 --> 01:13:25.710
Accuracy and rounding, um.

748
01:13:27.029 --> 01:13:32.520
Well, 1 thing in the eye, triply standard, if you're rounding a number.

749
01:13:32.520 --> 01:13:40.829
They they record the last bit be correct and you'd have to round you can say, do you want to round down? Do you want to round up if you want to round to 0 that's useful at times.

750
01:13:40.829 --> 01:13:48.359
The fact that the last bit is guaranteed to be accurate means is that the hardware is computing to an extra bid or 2 and rounding.

751
01:13:48.359 --> 01:13:53.220
So, there's a lot of sophisticated stuff in the standard, um, which I'll skip through.

752
01:13:53.220 --> 01:13:57.270
Yeah.

753
01:13:57.270 --> 01:14:01.500
And order not associated.

754
01:14:01.500 --> 01:14:07.649
Is the thing you're large plus small plus small does not equal large + plus the 2 small.

755
01:14:07.649 --> 01:14:11.130
Yeah, okay. I'm, I'm going to.

756
01:14:11.130 --> 01:14:16.859
Well, here is your example, I said, so.

757
01:14:16.859 --> 01:14:23.010
You can do I Tripoli standard or you can do the chief crap version. Sorry?

758
01:14:23.010 --> 01:14:27.060
Which will be faster, but less accurate. So then you pick.

759
01:14:28.770 --> 01:14:34.199
So, you can do fast math, but if you use the fast math option.

760
01:14:34.199 --> 01:14:38.699
You got to worry that you're going to get some surprising failures in your program.

761
01:14:38.699 --> 01:14:48.569
So, you'll have a loop where you're iterating until something converges and your loop will become infinite because I won't converge properly.

762
01:14:48.569 --> 01:14:51.779
There's something an end that terminated.

763
01:14:51.779 --> 01:14:55.770
May not terminate that's a sort of something that didn't terminate.

764
01:14:55.770 --> 01:14:59.819
Using proper floats make onto an infinite loop with the fast math.

765
01:14:59.819 --> 01:15:06.000
Okay, um.

766
01:15:06.000 --> 01:15:09.390
On to make everything double.

767
01:15:10.590 --> 01:15:15.989
On Intel. Okay. So, give you executive summary of this.

768
01:15:15.989 --> 01:15:21.000
Floating point is a pain. It has a lot of sophisticated problems.

769
01:15:21.000 --> 01:15:28.439
They've been handled as much as they can with the standard, but the standard is expensive to implement.

770
01:15:28.439 --> 01:15:31.859
So, people hardware people resist it, but.

771
01:15:31.859 --> 01:15:35.880
If your floating point use is the standard it will.

772
01:15:35.880 --> 01:15:40.199
Do the obvious things most of the time most of the time.

773
01:15:40.199 --> 01:15:43.380
Um.

774
01:15:43.380 --> 01:15:46.529
If you want extra speed, you can.

775
01:15:46.529 --> 01:15:50.069
Not use the standard, but be aware.

776
01:15:50.069 --> 01:15:58.199
If people, like, I can expand as much as you like, on numerical problems. So it's actually spent some time on that.

777
01:15:58.199 --> 01:16:01.710
Um, other stuff about floats.

778
01:16:04.170 --> 01:16:09.090
On NVIDIA the floating operations.

779
01:16:09.090 --> 01:16:15.720
An example of a heart initiative floats. Right? Intel has gotten it wrong more than once. Okay.

780
01:16:15.720 --> 01:16:20.220
Intel has gotten floating division wrong once. Okay.

781
01:16:20.220 --> 01:16:25.710
This is some years ago they Intel implemented floating division. They start.

782
01:16:25.710 --> 01:16:29.909
It was an intuitive process, but it started with a lookup table.

783
01:16:29.909 --> 01:16:38.430
Do well, it was doing a reciprocal 1st, reciprocal and a multi. It used to look up table to get the 1st, few bits of the reciprocal 1 over of why?

784
01:16:38.430 --> 01:16:45.239
And lookup table was, I don't know what a 1000 elements long or something. A few of the entries in lookup table were wrong.

785
01:16:45.239 --> 01:16:49.409
So, the Intel commercial.

786
01:16:49.409 --> 01:16:55.229
For certain values computed the X divided by Y, wrong.

787
01:16:55.229 --> 01:16:59.729
And is out there for a while.

788
01:16:59.729 --> 01:17:08.039
But most of the time it was right and when it was wrong, I think it was only wrong somewhat. So I think some.

789
01:17:08.039 --> 01:17:18.659
Professor somewhere discovered it and went to Intel and pointed this out to them and they ignored him. So he went public and it became a major story in the business.

790
01:17:18.659 --> 01:17:24.149
So the example of how hard floating point is Intel got it wrong and that's not the only time they got it wrong.

791
01:17:24.149 --> 01:17:29.579
Okay, so the folks are done with separate hardware.

792
01:17:29.579 --> 01:17:37.560
Even on the Intel on on video and double precision, maybe it was separate hardware still, depending on, you may emulate it and.

793
01:17:37.560 --> 01:17:43.439
Simulated in single and the decision for Nvidia is how many hardware.

794
01:17:43.439 --> 01:17:47.039
Cores, do you put on the chip and.

795
01:17:47.039 --> 01:17:57.420
The idea is that the threads that want to you do hardware do floats they queue up waiting for the available hardware course for the war. It's all done work by war.

796
01:17:57.420 --> 01:18:03.239
So, if you've got a lot of floating point computation.

797
01:18:03.239 --> 01:18:06.510
In your program and not many hardware. Course.

798
01:18:06.510 --> 01:18:09.630
Your program's going to be waiting.

799
01:18:09.630 --> 01:18:15.029
Your current parallel kernels going to be waiting, you're going to be limited by the lack of hard work horse.

800
01:18:15.029 --> 01:18:22.050
Of course, for sometimes nvidi has done floats faster than it.

801
01:18:22.050 --> 01:18:26.579
Oh, it depends. Um, I think it was 1 point.

802
01:18:26.579 --> 01:18:34.979
Basically, a float with faster. So cause again, you got the special purpose cores and the warps queue up to use them.

803
01:18:34.979 --> 01:18:39.869
And I think that the best afloat may take a couple of cycles. I could be wrong on that.

804
01:18:39.869 --> 01:18:43.500
And at worst more, so okay.

805
01:18:45.239 --> 01:18:53.430
So that is a reasonable point float. Say if I don't know.

806
01:18:53.430 --> 01:19:00.180
What they're saying here that's true. For sequential program.

807
01:19:00.180 --> 01:19:05.789
If you can afford it quick and dirty fix and do everything in double.

808
01:19:05.789 --> 01:19:14.550
Z on, it may actually be about the same speed or the course the same speed, but you're moving twice as much data into the and that may slow down.

809
01:19:15.689 --> 01:19:21.210
But you think you're doing everything in double you might accidentally do a few things single precision.

810
01:19:21.210 --> 01:19:25.260
And so what they're saying here is that.

811
01:19:26.460 --> 01:19:31.739
You may have to be careful about constant Constance and so on that, they're the right precision.

812
01:19:31.739 --> 01:19:36.659
If you get it wrong, you may program maybe slower than you would think or it may be just be.

813
01:19:36.659 --> 01:19:44.399
Less accurate than you would think so. Okay, so that's, um, this was.

814
01:19:46.680 --> 01:19:53.550
You do 12, 1 and again, executive summary floating point is a big topic that.

815
01:19:53.550 --> 01:20:00.750
You probably want to ignore and you probably usually can't ignore, but it's a big field of numerical accuracy. And so, no worries about that.

816
01:20:00.750 --> 01:20:10.229
How do you invert big matrices or whatever that's enough new stuff for today? So, unless if there's any questions.

817
01:20:12.029 --> 01:20:15.210
Take questions for a minute or 2, and then I'll get lunch and.

818
01:20:16.500 --> 01:20:25.079
Probability so so we're working our way through. I'm open to feedback. If you think I'm going to slow. If you think I'm going to quickly.

819
01:20:25.079 --> 01:20:30.869
I'm skipping over major areas and just summarizing for you, but you can dig into them if you want.

820
01:20:32.399 --> 01:20:40.680
So you want to learn more about a risk predict concern William K had K N as the expert on this, and he did the police standard.

821
01:20:40.680 --> 01:20:46.020
He also, I think, did the hardware for HP processors calculators and so on.

822
01:20:46.020 --> 01:20:49.439
As a consultant to HP, so here's the guy behind.

823
01:20:49.439 --> 01:20:52.470
A lot of the implementation.

824
01:20:52.470 --> 01:20:59.010
And he has a program called subscription called paranoia. He has writings on how.

825
01:20:59.010 --> 01:21:02.640
Is badly the stuff can go wrong and, um.

826
01:21:04.260 --> 01:21:07.859
An example of how complicated it is.

827
01:21:07.859 --> 01:21:12.180
The communications the some years ago, published a.

828
01:21:12.180 --> 01:21:19.109
A separate team that would determine what the floating precision was on your computer number of bits in a.

829
01:21:19.109 --> 01:21:27.510
Exponent and such et cetera, they published it and then they discovered that there were some commercial computers were dysfunctional, gone to an infinite loop.

830
01:21:27.510 --> 01:21:33.270
Not even terminate because it was iterating until.

831
01:21:33.270 --> 01:21:39.720
Some numbers stop changing and that would be the would be related to a number of significant fits, but on some hardware.

832
01:21:39.720 --> 01:21:44.250
It would never stop the, it would just iterate forever. And so.

833
01:21:44.250 --> 01:21:49.050
Okay, so.

834
01:21:50.579 --> 01:21:56.609
Open to questions other than that. See, you Tuesday.

835
01:22:03.329 --> 01:22:09.060
Hello.

836
01:22:09.060 --> 01:22:12.149
Hello.

837
01:22:14.039 --> 01:22:17.279
Okay.

838
01:22:17.279 --> 01:22:20.315
Quality.