WEBVTT

1
00:05:34.499 --> 00:05:39.329
Okay, high class just a 2nd.

2
00:05:39.329 --> 00:05:44.488
Hello.

3
00:05:49.858 --> 00:05:56.848
This works better. Okay. So parallel computing, March, 18 2021.

4
00:05:56.848 --> 00:06:09.088
Class 15 my universal question is if you can hear me, somebody type something as a chat window or Mike, and say, say, hi or something.

5
00:06:10.108 --> 00:06:20.459
Because other than that thanks, Justin. Cool. Okay. So what is on tap for today?

6
00:06:23.548 --> 00:06:31.499
Good.

7
00:06:33.358 --> 00:06:39.298
So, I was a little slow in getting back to you about talks because you're doing talk, you know.

8
00:06:39.713 --> 00:06:41.303
Student class presentation,

9
00:06:41.303 --> 00:06:41.934
so no talk

10
00:07:06.954 --> 00:07:11.814
either Monday or Thursday and several people have not replied.

11
00:07:12.113 --> 00:07:19.824
So what we'll do later today or tomorrow just assign you talks and you'll be on Thursday and running over.

12
00:07:20.098 --> 00:07:23.759
Okay, so.

13
00:07:23.759 --> 00:07:31.798
We're finishing off the, the slides sets from the 3 slides from the book just a little more.

14
00:07:32.334 --> 00:07:46.014
And and accelerated computing, I mentioned open gel for a minute, but we've got to talk on open GL on from Joseph. So I won't really talk about it much. I'll just have a minute or 2 on that.

15
00:07:46.494 --> 00:07:55.673
I'll point you to some invidious primary documentation on this stuff, and which is more up to date and more concise but dryer.

16
00:07:56.723 --> 00:08:10.553
And I bet you spent the last couple of days reading a lot of it. So I rendered I read technical documentation and I've got a section on some other NVIDIA features that we haven't covered. And some stuff that I've typed in. I had to send Monday's blog.

17
00:08:10.553 --> 00:08:20.064
But I moved it to today's and didn't really talk about much of it Monday. So, this is my summary almost my own purposes of essentials of NVIDIA. So I'll talk about that.

18
00:08:20.369 --> 00:08:29.218
And then some blurbs on 2 blurbs that I've on stuff that I find interesting programming.

19
00:08:29.218 --> 00:08:34.198
Geometry stuff on memory allocation and union file system.

20
00:08:34.198 --> 00:08:39.568
And then we'll move on to the next major topic, which is thrust, which is I know my husband.

21
00:08:39.568 --> 00:08:53.754
2 weeks on it or something, and this is a parallel extension to and it also shows you paradigms for parallel efficient, parallel programming and a number of these parallel paradigm. So useful only for parallel programming.

22
00:08:53.783 --> 00:08:54.953
They don't they're not good.

23
00:08:55.678 --> 00:08:59.818
For they're not good for, um, serial so.

24
00:08:59.818 --> 00:09:05.369
So, I don't know what got lost when the audio went out for a minute bump.

25
00:09:06.389 --> 00:09:10.678
I think that things are okay now don't trust computers.

26
00:09:10.678 --> 00:09:17.038
Okay, you won't believe how long I've been programming computers when I say, I don't trust them.

27
00:09:17.038 --> 00:09:20.759
Open see, I'll just have a blurb on that. Um.

28
00:09:21.778 --> 00:09:30.953
I won't show them odd. You all you can read it on your own summarize it. It's apples competition to CUDA and it's coulda where they change the name it's made it look like open shell and made a clunkier.

29
00:09:31.344 --> 00:09:45.203
And as long as Nvidia is dominant, you should use CUDA, not open is more device independent, like, open GL, however, to obtain that independent tickets clock here.

30
00:09:45.203 --> 00:09:51.053
So you want to look at as long as Nvidia stay sane, incompetent you on the state was.

31
00:09:51.808 --> 00:09:57.928
On the reason I put the conflict in there is I have seen computer companies.

32
00:09:57.928 --> 00:10:12.774
Go go insane. Go crazy and basically go from a dominant position to basically non existent in a few years. When I was a grad student, the predominant computer company for computer science apartments was the digital equipment company deck.

33
00:10:14.273 --> 00:10:22.344
They at their peak, where the 2nd, largest computer company in the United States, and they even have an annual conference on the ocean liner. The.

34
00:10:23.844 --> 00:10:37.823
And then they the guy who built it up, can all send it to everything right? For decades started doing everything wrong. He said, he didn't like Lennox UNIX. He didn't like Ethernet.

35
00:10:37.823 --> 00:10:43.854
He didn't like open standards. He got everything wrong. They started filing these patent lawsuits and.

36
00:10:44.158 --> 00:10:49.708
They crash, they basically went to nothing. They merged into Compaq. Yeah. So that happens but.

37
00:10:49.708 --> 00:10:57.568
Okay, now and video primary documentation, and just to point you to this.

38
00:11:00.114 --> 00:11:08.573
This is their source website. They've got several different things here, which I think video manages as somewhat separate projects.

39
00:11:09.833 --> 00:11:15.863
The kudos stuff is, is the tool kit documentation and it's got several things here.

40
00:11:16.708 --> 00:11:20.339
Programming guide is quite interesting and quite detailed.

41
00:11:20.339 --> 00:11:24.359
And.

42
00:11:24.359 --> 00:11:32.938
So, you can have fun going through that. I showed you a little pieces of that. So, basically, if you want sort of the definitive.

43
00:11:32.938 --> 00:11:38.908
Summary of the core and video stuff and so on then.

44
00:11:38.908 --> 00:11:42.239
You'd want to look at the programming guide.

45
00:11:43.918 --> 00:11:48.328
Another 1 now they're self overlap somewhat, so.

46
00:11:48.328 --> 00:11:51.658
They've also got a.

47
00:11:51.658 --> 00:12:03.448
High performance computing documentation and this talks little services stuff about parallel algorithms. It's worth reading you take the stuff.

48
00:12:04.614 --> 00:12:13.344
They have a compiler, a C plus plus gunpowder. I think it's basically the P. P. G. C. compiler PGI, compiler and that.

49
00:12:13.344 --> 00:12:22.433
They're extending and they talk about extensions to the center template live and you put an initial argument and it will say,

50
00:12:23.333 --> 00:12:28.553
try to compile it for either multi core in Kelsey fuse or for many car,

51
00:12:28.553 --> 00:12:29.004
which is in.

52
00:12:30.354 --> 00:12:44.813
So, they talk about that there, they talk about their compiler here, which is separate from, which says, CUDA, the NBC compiler, NBC. Plus Plus does C plus plus compiler but it will emit code for.

53
00:12:45.119 --> 00:12:53.788
The NVIDIA for the if possible so so you want to learn more about that? You can, you can blurb browse around in here. So.

54
00:12:53.788 --> 00:13:06.599
Compiled does not if you want to see plus if you want to put a code in your program directly, then you call all the names are slightly different from each other. They need a little cheat sheet to keep them apart. Actually.

55
00:13:07.104 --> 00:13:21.234
Okay, so we've got that there, their high performance computing documentation talks about some tools. They've got other log even look at some of these packages here. They've got stuff that runs on top of their low level stop to do things such as linear programming parallel.

56
00:13:22.553 --> 00:13:36.894
Fast for a trans form, suddenly, which processing stuff sparse matrix stuff blast blast is spec, an API for some low level, basic matrix vector operations at your high level things are in terms of blast routine.

57
00:13:37.134 --> 00:13:50.303
So, blast itself is a spec, and then their implementation of it open and commercial the commercial ones might actually be good. So any case cool blast blast and the other things for deep learning machine learning and stuff like that.

58
00:13:50.663 --> 00:14:05.004
So, as you're interested, you can browse around in here. The another thing they have some virtual now so you can run where the host is doing time sharing and have several users do it. Okay, so that's some NVIDIA.

59
00:14:05.964 --> 00:14:20.933
Documentation there no other NVIDIA feature. Now we see most of in video, some stuff we've not seen are for any of you have taken my graphics course, or other graphics or things called texture maps and surface maps.

60
00:14:21.239 --> 00:14:26.129
Um, well, you have it a graph database texture for.

61
00:14:26.129 --> 00:14:36.323
For you to fill a polygon with and you can calculate on it and the levels of detail with maps themselves. Well, that's all done in hardware on the video service maps to the surface of a cube.

62
00:14:36.323 --> 00:14:45.114
And so on this machine learning machine, learning hardware, and the main primary operation there for 4 by 4 main Jesus aid was BC. Plus D.

63
00:14:45.448 --> 00:14:57.149
Or they can be vacancies of actors and the matrix elements. They can be several different data types such as half precision floats 16 floats. So that's done fast than the machine learning modules.

64
00:14:57.563 --> 00:15:11.903
They've got, they've got hardware now with their real time rate, tracing to do some re, tracing operations fast such as to run a ray through some parameters. Unfortunately, the user visible API is a little higher level that I wish they had it.

65
00:15:11.903 --> 00:15:24.234
So, you kind of run, you can run a ray through various boxes, but you cannot had to explicitly do an intersection test with an edge and a box. You have to make the intersection task a range Rover. So, which is.

66
00:15:24.538 --> 00:15:31.678
Touch annoying in any case. So they also have an extension for like a group like.

67
00:15:31.678 --> 00:15:38.759
Every time we've seen the threads and 32 threads make a war, but a war runs together. They've now gotten a little.

68
00:15:39.504 --> 00:15:53.783
They've started to break that idea up slightly now and pair their latest architecture. You can have subsets of a war can synchronize a thing. Currently what we've seen is all the threads or all the active threads that a war can synchronize.

69
00:15:54.089 --> 00:16:07.979
But I guess they observed the summary they users wanted fewer numbers of threads to synchronize. If you force all the active threads to synchronize. It's a little too course. So now they've got things called cooperative groups, which are let.

70
00:16:07.979 --> 00:16:15.418
Say, maybe 16 thread synchronize or something. They all still have to run the same instruction if they're active but.

71
00:16:16.553 --> 00:16:29.333
They also have, they're thinking that their GPU might be an expensive resource running on a time sharing system. So you can have virtual GP use like, you have virtual machines running under Linux and the virtual machine gets certain resources.

72
00:16:29.634 --> 00:16:33.864
And in theory, it's hidden from the other virtual machines you can do that.

73
00:16:34.224 --> 00:16:45.714
Now, with the, I'm thinking that Nvidia is thinking of the fact that there are now on Amazon to you can now construct virtual machines that use ethios for a fee. And in fact.

74
00:16:46.229 --> 00:16:52.798
Last year, when I looked at the pricing and videos, latest was something like.

75
00:16:52.798 --> 00:17:05.638
5 or 10,000 dollars, if you bought it, you could use the thing in a virtual machine for less than a dollar an hour. In other words, it was cheaper to use it from Amazon to buy it by yourself.

76
00:17:05.638 --> 00:17:17.128
Unless you're going to use it solidly for a full year, let's say, or several years. So I'm guessing that, since you can now use the video on Amazon virtual machines that to help enable that.

77
00:17:17.128 --> 00:17:23.278
And video now, has you can they can define a virtual subset of a full hardware cheap, you.

78
00:17:23.278 --> 00:17:29.969
Another cool thing on and also I got the formatting wrong here.

79
00:17:29.969 --> 00:17:39.479
Is that you can have global memory on your on and again on.

80
00:17:39.479 --> 00:17:43.618
On parallel the 48 gigabytes of global memory.

81
00:17:44.213 --> 00:17:56.003
There is a way where you can tell the GPU to compress the data to be written to global memory. And so it takes time to compress and uncompressed. But it means it takes less memory.

82
00:17:56.453 --> 00:18:05.273
So, it's a space time trade off and given that. A lot of parallel machines are limited. The takes more is the constraining.

83
00:18:05.638 --> 00:18:09.209
Thing then this is a win, so.

84
00:18:10.554 --> 00:18:24.564
Now, what I have here point, 6.1 really that should be point 7, I should have tab left on that. The latest thing is and Vinnie, I've noticed is not using an idea of a core anymore. You got your streaming, they'll do process, or you might have a few dozen of them.

85
00:18:26.338 --> 00:18:36.538
And just to remind you, let's see, on my current machine here, how many streaming multi processes do I have.

86
00:18:43.558 --> 00:18:54.298
I have 48 streaming, multi processors, multi, multi processors, and each 1 of those 48 multi processors.

87
00:18:54.298 --> 00:19:08.753
As I know that says good. Of course, that's been printed out for the device query program. Each of this 48 multi processors would have say, 64, integer unit, 64 units, 32 double units and 64 or dispatch units or something.

88
00:19:08.784 --> 00:19:16.584
So, and all of these energy flow double. And or any instruction, dispatch units, they all operate independently inside that multi processor.

89
00:19:16.584 --> 00:19:28.733
So the concept that you've got so many that you got organize into CUDA cores and 64 could, of course, of the multi process and their latest generation could a car isn't obsolete concept.

90
00:19:31.439 --> 00:19:43.679
So, okay, so that's some hot new things that we haven't seen in the documentation. So far conceptual hierarchy. I mentioned before yesterday I won't go through it again.

91
00:19:43.679 --> 00:19:48.148
It's, it's a complicated mess.

92
00:19:48.148 --> 00:19:53.878
You know, within the 1 generation, Pascal, let's say you've got 1 of.

93
00:19:54.354 --> 00:20:05.364
1, for 1 awake, you got different levels of GP, use the different manufacturers, say, buy the from invaded and they'll configure it differently or or NVIDIA will ben's stuff.

94
00:20:05.693 --> 00:20:09.713
And there might be some chips where 1 of the multi processes.

95
00:20:10.048 --> 00:20:18.058
Doesn't work, so they'll lower the price and sell it in a lower line all that sort of stuff, change the bot to it. So it's a complicated mess.

96
00:20:18.058 --> 00:20:23.909
Okay, and then they got things is called tight Nexus and I don't even know what they are. So.

97
00:20:23.909 --> 00:20:34.588
Range of speech I mentioned last time I mentioned capability versions. It's another complicated best. The hardware has a version like.

98
00:20:34.588 --> 00:20:42.239
Volta and pair is like a software version and.

99
00:20:42.239 --> 00:20:45.324
It sort of matches the hardware version, but not. Exactly.

100
00:20:47.183 --> 00:20:58.854
And then the compiler compiles to a bite code sort of like Java and what's in the is the bite code and then and then when you run it, it.

101
00:21:00.114 --> 00:21:10.973
It then does it just in time compiling as a bike code to actual machine language so the 1st time you run, it could exact executed. Well, it's going to take longer, which is going to mess with any timing you do.

102
00:21:10.973 --> 00:21:21.834
So you run it a couple of times and throw away the 1st time when it compiles it caches actually the compiled program and I don't know where it cashes that. Actually, I think on the host somewhere.

103
00:21:22.138 --> 00:21:32.009
And I mentioned this sort of thing about optimization and whatever could've fax a touch all.

104
00:21:32.009 --> 00:21:41.608
Okay, something relevant to efficient programming parallel and sequential, which I typed up here because I haven't seen it.

105
00:21:41.608 --> 00:21:44.669
Actually defined this nicely.

106
00:21:44.669 --> 00:21:55.288
But I'm biased in writing C. plus plus there's basically 3 types of memory, the static memory, which is assigned.

107
00:21:55.288 --> 00:22:01.919
When the programmes compiled to be a fixed sized global array and the space is allocated.

108
00:22:01.919 --> 00:22:07.919
You know, with the call, just code and this, that and they have addresses and.

109
00:22:09.084 --> 00:22:19.403
So it would be a global variable if you're going to do this, you maybe don't want to initialize it because the initial values would be signed, it's starting to execute if it's a big array.

110
00:22:19.673 --> 00:22:19.884
Now,

111
00:22:19.884 --> 00:22:21.683
the thing with this by default,

112
00:22:21.683 --> 00:22:23.334
the static space is not that big,

113
00:22:23.334 --> 00:22:30.653
but you can make it bigger by saying MC model equals medium or large and they'll do things like Costa compiled use wider addresses,

114
00:22:30.653 --> 00:22:30.804
like,

115
00:22:30.804 --> 00:22:33.023
initiate my might use to address,

116
00:22:33.324 --> 00:22:38.064
which means your static data plus here executed McAfee more than 2 gigabytes.

117
00:22:38.338 --> 00:22:48.358
And you make a medium or large uses wider addresses. So program will be an insignificant amount slower, but now you could have.

118
00:22:48.358 --> 00:22:53.219
You know, umpteen gigabytes a static data and.

119
00:22:54.023 --> 00:23:08.453
It's not and it's, it increases it doesn't increase. The executer will file if it's not Initialized and it increases your virtual memory space, but that's not a problem. Because the virtual memory is not actually allocated until you touch a page, which I think is 4 K bytes.

120
00:23:08.453 --> 00:23:10.523
So, in the past, I would.

121
00:23:10.828 --> 00:23:14.308
By default actually put my race in static because.

122
00:23:14.308 --> 00:23:19.499
Avoid some messy problems, although with.

123
00:23:19.499 --> 00:23:27.689
And I might want to modify this. Okay, so that's static. Um, the next thing is, there's a push down stack.

124
00:23:27.689 --> 00:23:40.884
I have a guest, you're probably all familiar with the concept of pushed down of stacks. Each time you call a new function and a routine. It puts another frame on the stack and allocate SoCal variables that are local to this new routine.

125
00:23:40.884 --> 00:23:47.304
And then when you come in, it pops that frame off the stack and the allocates all those variables. And.

126
00:23:47.848 --> 00:23:53.489
So it's a nice way to do that variables that are local to that routine or routines that are called from that.

127
00:23:53.489 --> 00:23:56.608
Now, by default.

128
00:23:56.608 --> 00:24:02.548
Is only 8 megabytes of stack space, which is sort of what you can increase it. What's a you limit command.

129
00:24:02.548 --> 00:24:09.358
At in the shell, or you can call a C on routine that does that, and you can make a gigabytes.

130
00:24:09.358 --> 00:24:14.219
So, if you've got local variables, you know.

131
00:24:14.219 --> 00:24:25.858
So, if you want 10 gigabytes of that of variables, local to a routine, make it increase the stack size and make it a local variable in the routine and then it's allocated when you call the routine.

132
00:24:25.858 --> 00:24:31.108
And again, it could be as big as virtual memory. Now.

133
00:24:31.108 --> 00:24:41.398
If you have multithreading, then each thread so far as I can tell as a separate stack.

134
00:24:41.398 --> 00:24:56.128
Now, you might be wondering, how is this possible because you might have seen in a compiler class maybe you could run stacks up from the bottom or down from the top of your memory space. That's to the stacks. How could you have, say 16 stacks? Well, with virtual memory, that's easy because.

135
00:24:56.128 --> 00:25:00.868
There he started some different location and far apart and your virtual address space.

136
00:25:00.868 --> 00:25:10.019
And again, but it's not that that is not allocated until you touch it. So you can have, you know, you can have 100 stacks and each stack is.

137
00:25:11.094 --> 00:25:19.163
100 gigabytes or something and no problem and if you start touching all this data, you're starting to use memory but until you touch it fine.

138
00:25:19.163 --> 00:25:26.153
So, you can have stacks with multi programming and so on I think it open empty so far as I can tell each thread gets a separate stack.

139
00:25:28.378 --> 00:25:42.683
So now, I can allocate some initialization has to be done if you create your market stacks. But so for data you just need and routine and it's collies the routines that you call from. The stacks are nice. But to do it.

140
00:25:42.683 --> 00:25:45.263
So you have to know the size of the data at the time.

141
00:25:45.598 --> 00:25:53.818
That at the time you call the routine, this is something that seed us better than the C. plus plus actually. Okay.

142
00:25:54.233 --> 00:26:08.574
And then there's the 1 that people use automatically I heap, which is, is a global space of random memory, and it's got a heap manager and you create stuff in it and see, you do Matlock and free and C. plus plus you do new and destroy.

143
00:26:08.878 --> 00:26:14.999
And it just creates stuff on the heap and the, he puts his biggest virtual memory.

144
00:26:14.999 --> 00:26:25.378
And pages, now, the thing with the heat is there has to be a structure on it to keep track of this stuff. Like you have, perhaps the link list, what they teach.

145
00:26:25.378 --> 00:26:33.773
And I guess CS courses, they have a linked list of free spaces on the heat. And if they've got 2 continuous, free spaces, they coalesce.

146
00:26:34.554 --> 00:26:40.854
And if you do a lot of knowing knowing and destroying the free space, make it fragmented.

147
00:26:40.854 --> 00:26:51.384
So, perhaps you do a compact upon garbage collection, which scrunches stuff together, because that the stories, all your pointers that can be done only limited cases in any case. So.

148
00:26:53.153 --> 00:26:59.933
And you do, like, your vector class, just put stuff on the heap. The documentation didn't even say that for a long time.

149
00:27:00.772 --> 00:27:15.114
So, everything you allocate on the heap, it's got some hidden bites of data before, which has some identification information and pointer to the next space or something a couple of problems with the heap. So, it's a nice idea. You can do whatever you like, new and destroy.

150
00:27:16.949 --> 00:27:20.459
Now, it's 2 problems. 1st.

151
00:27:20.459 --> 00:27:30.088
On the, you have to keep track of stuff to destroy it when you're finished. Well, there is some smart pointers in so I will try to do that for, you.

152
00:27:30.088 --> 00:27:37.409
Well, if it was a smart, you have only 1 point or to an object you cannot have 2 so 1 point or then.

153
00:27:37.409 --> 00:27:52.378
It can keep track of when no one's referring it anymore and free stuff. Same was a stack of stuff gets fried when you return automatically. So you have to keep track of stuff and free when you're finished with it. And that can be difficult. If you have a producer consumer.

154
00:27:52.378 --> 00:28:03.239
Programming model where 1 part of the program produces stuff. Another part of the program consumes stuff and they're quite distant from each other. And the producer has no idea when the consumer is, you know.

155
00:28:03.239 --> 00:28:09.778
Finished with something and but for the consumer to free it, it has to know stuff about internal producer.

156
00:28:09.778 --> 00:28:22.888
Things and it's a complicated mess and big place. This happens in graphics. Actually you're calling a graphic substance, something next Windows or open GL and the called routine.

157
00:28:22.888 --> 00:28:29.219
Create some object and returns to the caller and they get a question of who should free it. Eventually.

158
00:28:29.219 --> 00:28:37.679
Now, in a small case, you don't care if just a little memory leak, just let it go. I mean, you've got lots of memory, but, you know, it gets to a point where.

159
00:28:37.679 --> 00:28:43.798
Maybe you got to keep text that's the 1st problem with new and destroy is keeping track of it. So, you know, when to destroy it.

160
00:28:43.798 --> 00:28:56.189
An example, say, real word, you all know the DARPA Grand challenge where they autonomous vehicles well, 1, in 1 of the grand challenges, 1 of their vehicles failed because.

161
00:28:56.189 --> 00:29:06.298
It was using Java to keep track of objects that were seen by a forward looking radar and as it passed the object, it would free it from the heap.

162
00:29:06.298 --> 00:29:14.183
And and it accidentally kept a pointer to all to each objects. So no objects for free. So the heat got bigger and bigger.

163
00:29:14.574 --> 00:29:28.374
And as he got bigger, it got slower, which is a point I'll come to in a minute and eventually the computer on their on their vehicle was operating less than real time. It was taking more than real time to process the heat and a Java program.

164
00:29:28.374 --> 00:29:30.233
Because using Java was another mistake, but.

165
00:29:30.538 --> 00:29:34.048
That and.

166
00:29:34.048 --> 00:29:39.509
They had to their vehicle stopped any case. So the 2nd problem with the heat.

167
00:29:39.509 --> 00:29:42.719
Is if you get more and more objects on it.

168
00:29:42.719 --> 00:29:53.429
The time to create and destroy objects grows, it's not constant. It goes up and the number of existing objects, because there's a status check to keep track of objects that has to be managed.

169
00:29:53.429 --> 00:30:01.199
So you do not want to allocate billions of 4 bite objects on the heat for, for 1 thing. Each 4 byte object will have an 8 bike.

170
00:30:01.199 --> 00:30:09.269
Header or something so if you're going to use the heap, you want to allocate fewer bigger objects, not more smaller objects.

171
00:30:09.269 --> 00:30:17.189
And so well, content of using factors, if you got a 1,000,000.

172
00:30:17.189 --> 00:30:27.838
You know, 2 dimensional coordinates you might say the nice way to think about it is you have a 1Million vectors of 2 floats.

173
00:30:27.838 --> 00:30:31.378
Yeah, but each of the, but then you're allocating them.

174
00:30:31.378 --> 00:30:35.429
Clean things on the heap. That's a bad idea. What you do is.

175
00:30:35.429 --> 00:30:41.368
You'd use a fixed size array or something perhaps would be the way I would do it. Maybe.

176
00:30:41.368 --> 00:30:50.429
Or you do something called a placement new where you separate allocating the storage from constructing the object in the storage. It's something in C. plus plus no, 1 knows about.

177
00:30:50.429 --> 00:30:54.929
Because you normally do a new, it grabs space off the.

178
00:30:54.929 --> 00:30:58.979
And allocates and and construct your variables and that.

179
00:30:58.979 --> 00:31:12.929
If it's if it's a final data type instruction is nothing much, but it may be a fancy data type. Reconstruction involves building some structure. So the placement, no, you manage the storage allocation yourself. Like, maybe you've got this humongous.

180
00:31:12.929 --> 00:31:18.449
A re, jigger bite and you and you handle you, you cut pieces off it.

181
00:31:18.449 --> 00:31:24.088
And hand them to the place, but new to construct your 2 dimensional points in it or something and you keep track.

182
00:31:24.088 --> 00:31:29.638
Of your storage, so that's the 2nd, problem with the heap is that too many heap objects.

183
00:31:29.638 --> 00:31:37.739
It slows you down the 3rd problem with a he is a heap is initially a 1 global object and if you're doing parallel programming.

184
00:31:37.739 --> 00:31:41.969
And this is something and you're doing news and saying that has to be serial.

185
00:31:41.969 --> 00:31:46.679
Onto the, because in these different threads, they're modifying that global.

186
00:31:46.679 --> 00:31:50.098
Storage heap, they're not just, you know, they're not just.

187
00:31:50.098 --> 00:31:58.108
Reading it, so there's a problem. So, parallel programming, especially do not want to do multiple operations on the. So.

188
00:31:59.368 --> 00:32:08.578
Now, there is, if you're doing a lot of stuff with smallof, Google has an allocator that's better than the G. plus plus allocator.

189
00:32:08.578 --> 00:32:19.828
And you can do it with by linking in, as I've got there and there's another cool thing you can do if your program is dynamically linked.

190
00:32:22.314 --> 00:32:34.943
You can there's a bad environment, variable called load preload and that is a list of directories basically a list of libraries to load if necessary and you can say to load google's.

191
00:32:35.699 --> 00:32:41.999
Allocator and if you problems linked, it will load in Googles instead of news.

192
00:32:41.999 --> 00:32:45.929
His software foundations and that saves a little time. So.

193
00:32:47.939 --> 00:32:52.169
A little thing, um.

194
00:32:53.278 --> 00:32:58.409
Entails servers have a non uniformed memory architecture.

195
00:32:58.409 --> 00:33:04.828
Which has a very small effect, I think, but parallel, it's still 14 on each.

196
00:33:04.828 --> 00:33:15.628
Unit whatever you call at the high level court has the 14 Z, on course, has half the memory attached to it directly and so accessing memory.

197
00:33:15.628 --> 00:33:25.409
The have 200 gigabytes that's attached to the local block of 14 cores is faster and attaching the faster than it access in the other 1. so.

198
00:33:26.548 --> 00:33:36.568
And I mentioned here, I think a page of memory Zach is assigned when it's 1st touched written to, let's say, so.

199
00:33:36.568 --> 00:33:40.138
Which chords access, so you can have some fun there.

200
00:33:40.138 --> 00:33:43.229
And caching a separate thing. So.

201
00:33:43.229 --> 00:33:47.038
Okay, so that's my summary of 3 types of.

202
00:33:47.038 --> 00:33:50.489
Storage new C plus plus program.

203
00:33:50.489 --> 00:33:53.394
Static array data,

204
00:33:53.394 --> 00:33:56.153
assigned to compile time you got your stack,

205
00:33:56.483 --> 00:34:06.294
which is pushed on when you call a routine and freed when you popped off the stack when you return and you got your global heap where you control the creation and destruction.

206
00:34:07.858 --> 00:34:18.119
Okay, back to and other services more details I type this up based on my understanding.

207
00:34:18.119 --> 00:34:26.878
The numbers get changed, but basically, we've got the host. We've got the device the device has streaming multi processors.

208
00:34:26.878 --> 00:34:30.719
And and videos use different names at different times.

209
00:34:30.719 --> 00:34:40.349
So thread, it's a program, it's got a program counter and some private memory, like registers and some local memory and accessing shared memory.

210
00:34:40.349 --> 00:34:44.878
32 threads make a warp a up to it.

211
00:34:44.878 --> 00:34:49.528
32 warps make a thread block and.

212
00:34:49.528 --> 00:34:54.268
And you don't directly see the warps in your programming, but they affect a performance. So.

213
00:34:54.268 --> 00:35:04.858
A block of and the next level up. So you've got your thread blocks. The next level up is a grid of of thread blocks.

214
00:35:04.858 --> 00:35:10.559
And the kernels the program, the greatest, the hardware, the program runs on, you might say.

215
00:35:10.559 --> 00:35:22.318
So the colonel can have thousands of threads colonel can create other kernels and weight and you want to have threads waiting to execute because this because the thing with.

216
00:35:22.793 --> 00:35:27.173
And it has high throughput, but also has latency in 1 way.

217
00:35:27.173 --> 00:35:37.914
It gets throughput is by high occupancy and it gets that because if a war blocks, because of some resource it needs, then there's other warps waiting to run.

218
00:35:39.869 --> 00:35:42.989
Registers slow memory.

219
00:35:42.989 --> 00:35:47.639
Local memory for each thread for.

220
00:35:47.639 --> 00:35:54.628
So, there's 1 instruction dispatcher to scratches 1 instruction to all 32.

221
00:35:54.628 --> 00:35:58.108
Threads in the war.

222
00:35:58.108 --> 00:36:01.708
So, instruction level parallel.

223
00:36:01.708 --> 00:36:14.489
Is that the warp schedulers that, that they admit the instructions are always running? Because there's always some work waiting to be dispatched that's instruction level. Parallelism.

224
00:36:15.539 --> 00:36:20.099
So, um.

225
00:36:20.099 --> 00:36:23.909
Is there anything shuffle instructions and so on voting.

226
00:36:25.108 --> 00:36:34.318
The block level thing I mentioned several times. The thing is that the block, it's got fixed resources like, so many registers that get distributed among all the active.

227
00:36:34.318 --> 00:36:47.188
Or threads, so more threads or registers for thread to ensure threads. You've got some fast shared membrane space up and down to cash. That's global to everything in the block. But private to the block.

228
00:36:47.188 --> 00:36:50.938
And there's some explicit cash.

229
00:36:50.938 --> 00:36:54.418
The different warps are running a synchronous.

230
00:36:54.418 --> 00:37:01.318
Well, they're running the same instruction sequence where there are different points in that.

231
00:37:01.318 --> 00:37:05.518
In all the threads that a block are running the same program, but.

232
00:37:05.518 --> 00:37:09.989
Not not concurrently necessarily synchronize stuff.

233
00:37:09.989 --> 00:37:14.878
Stripping multi process there contains the.

234
00:37:14.878 --> 00:37:29.489
The thread blocks and and my guess is 1, could a coral of the term is getting obsolete is the 20 s as powerful as Z on core, because slower clock rate and doesn't have this.

235
00:37:30.809 --> 00:37:37.139
Can't do many things at once so much, although this is changing. So this 120 is here could be changing. So.

236
00:37:38.458 --> 00:37:43.168
And again, since there are these variable units, there's also special instruction units.

237
00:37:43.168 --> 00:37:53.188
And the steering multi processor, so this is a reason that if your program is a lot of double precision, it's going to run slower because there's fewer double precision units and single precision units.

238
00:37:57.114 --> 00:38:01.643
And there's some special units, and if you need them, your warps are going to be waiting.

239
00:38:01.643 --> 00:38:12.173
So, and all this stuff changes from generation to generation there was 1 generation of video or instructions ran very slow.

240
00:38:12.449 --> 00:38:25.079
Who knows why they didn't injure instructions were important so actually, if you get a big speed up by converting your into floats and using a floating point ad and multiplies and converting yourself back to it, that's not true anymore.

241
00:38:25.079 --> 00:38:29.400
So, you got schedulers instruction, dispatch you at.

242
00:38:31.829 --> 00:38:38.519
Couple levels of memory grid, level of resources. So these are all the blocks and grid.

243
00:38:38.519 --> 00:38:49.769
And and so your grid is your currently, your grid is your, what's your parallel program you call from your C? Plus? Plus? With the triple angle brackets?

244
00:38:49.769 --> 00:38:59.429
And if it's got a lot of rocks, so if you have maybe 48 streaming, multi process, it's on your, the different blocks in the grid could be running on different streaming, multi processors.

245
00:38:59.429 --> 00:39:06.480
That's okay, because the blocks aren't talking to each other and again, you got a big queue of blocks and they're picked off as their.

246
00:39:06.480 --> 00:39:11.699
As resources are to run them, so you want to have extra blocks available to run.

247
00:39:11.699 --> 00:39:18.929
Self device level resource says you got your global memory.

248
00:39:18.929 --> 00:39:22.949
Which has this late and say.

249
00:39:24.744 --> 00:39:38.364
Host memory mapping well, now it's unified somewhat and so thread level parrot always, is that when a thread tries to read something from the global memory and gets blocked for maybe 200 instructions, then some other threads can run thread level parallelism.

250
00:39:38.364 --> 00:39:48.775
It's a big reason for performance. So the could, of course, and the G. P. or you want that they're always occupied something gets blocked something else is waiting to run.

251
00:39:51.150 --> 00:39:57.210
Cash is cash as great advantage when units like a little operating system.

252
00:39:57.210 --> 00:40:07.139
Hypercube just means various different applications on CPO can be launching parallel programs.

253
00:40:10.980 --> 00:40:15.840
On level things suck, it's even higher so.

254
00:40:15.840 --> 00:40:19.380
Lots of wraps. Okay.

255
00:40:19.380 --> 00:40:30.360
There's no question, it's a summary more about kudos. Some, some things are just a summary of what I've had before, is that.

256
00:40:32.010 --> 00:40:42.510
Point 3, there may be obsolete now, but again, you see plus plus you could have programmed us compiled with, you can have a global function.

257
00:40:42.510 --> 00:40:45.900
Call from the whole strengths on the device.

258
00:40:46.224 --> 00:40:56.094
Device function runs on the device calls from the device, or on the host call from the host and then variable costs data. We'll have a slide on talking about.

259
00:40:56.094 --> 00:41:08.364
Some of this shared is local to the block and his past devices global manages for page. Constant. How do you load constant data? There's a special routine that loads and so.

260
00:41:09.750 --> 00:41:18.960
And some of this stuff can get Tom loosening the roles a little as time goes on and pair loose in some of these roles.

261
00:41:20.010 --> 00:41:25.230
1, more Linux trick. Um, I like to intersperse clinics because I find them useful.

262
00:41:26.309 --> 00:41:35.340
This is a cool 1. it's called a union file system or a transparent or a translucent or an overlay file system.

263
00:41:35.340 --> 00:41:39.690
So, you can layer some it's like it's like layers and.

264
00:41:39.690 --> 00:41:53.429
In your CAD program, and your graphics program, you've got layers of graphics and if a flower 1 is on top of layer 2, then you can see layer. Twos is a transparent parts of our 1.

265
00:41:53.429 --> 00:41:59.849
And so create a new object, you pick what layer it goes on to. Well, you can do that for file systems and.

266
00:41:59.849 --> 00:42:03.179
So you can have, let's say.

267
00:42:05.280 --> 00:42:10.590
A couple of file systems I have an example here. I've got 2 directories a, and B.

268
00:42:10.590 --> 00:42:18.239
And and they get combined to make a unified directory M, and.

269
00:42:20.429 --> 00:42:30.630
So so what happens is that if there's the same file in both directory and directory B, you'll see the version in directory a.

270
00:42:30.630 --> 00:42:33.659
It will hide diversion directory be.

271
00:42:34.014 --> 00:42:44.514
But this is a file and directory be, it's nodded and you'll see it and be and so is the merged version of these 2 directories, like 2 layers in there and our CAD programmer in our drawing program.

272
00:42:45.565 --> 00:42:57.445
And if you write a, and if you write to something, it will go into, it will go into the top layer a, and if you modify a file, it will modify the, you can say if it's read on right.

273
00:42:57.445 --> 00:43:01.735
Or something and say union file system things means coffee on, right?

274
00:43:02.039 --> 00:43:13.739
And so I have the rules here we get all these weird things. What happens if you change a file that's in Fi it writes a new version into a which, which will hide the old version and be.

275
00:43:13.739 --> 00:43:23.190
If you delete a file and B, then if you need a file and it gets deleted, and then if there was something V, it becomes visible and delete a file and B.

276
00:43:23.190 --> 00:43:33.929
Well, since me is not writable, because I just said, hey, it was agreed right? But not be it puts our white out note into a that says this file and B is really deleted.

277
00:43:33.929 --> 00:43:44.699
So, and this is all user, it's not super user. It's all so you can create it was Union file system command. You then deleted with Fuser about you himself.

278
00:43:44.699 --> 00:43:56.489
Sort of cool. Now, what application is suppose you have a read only directory like it's on a D. V. D. or something that you can with fast. You could put a rewrite layer on top of it.

279
00:43:56.489 --> 00:44:05.969
Or, if you've got some big, if you had some big database that you read, only access to, then you put a rewrite layer on top of it to make changes that are private to, you.

280
00:44:05.969 --> 00:44:14.880
And each separate user could do a union file system, a separate union file system on top of this read, only base and each separate user would have.

281
00:44:14.880 --> 00:44:19.500
A separate private rewrite view of the file says this is really cool. I think it's nice.

282
00:44:19.500 --> 00:44:26.010
Now, IBM in a commercial version of this 50 years ago.

283
00:44:26.010 --> 00:44:33.210
So, part of this, which also had virtual operating systems done efficiently.

284
00:44:33.210 --> 00:44:41.250
Anyone could do it any efficient operating since this was the commercial product they've had for decades. I'd be able to occasionally been a leader in.

285
00:44:41.250 --> 00:44:46.170
And some interesting ideas and, uh.

286
00:44:46.170 --> 00:44:55.110
And this is not even their mainstream operating the mainstream operating system. I was always 360. this is a secondary 1 so.

287
00:44:56.460 --> 00:45:02.039
Cp was control program so that was the, the.

288
00:45:02.039 --> 00:45:13.199
A low level, virtual system and CMS was items something management says Cambridge management management system, because it came out of that Cambridge mass lab, which wasn't even a research center.

289
00:45:13.199 --> 00:45:17.340
So this is a scientific center so any case.

290
00:45:17.340 --> 00:45:23.519
So that's enough specific video stuff, specific, good stuff.

291
00:45:23.519 --> 00:45:27.300
Um, well, I haven't been coffee, give you a chance to ask questions.

292
00:45:27.300 --> 00:45:32.250
Oh.

293
00:46:00.300 --> 00:46:06.630
Okay, so what I have here, this is a, um.

294
00:46:06.630 --> 00:46:10.739
Class at Stanford unparallel computing.

295
00:46:10.739 --> 00:46:14.219
Which says been made freely available, so.

296
00:46:15.480 --> 00:46:20.519
So, I'll show it to you pluses and minuses with the Stanford course.

297
00:46:20.519 --> 00:46:33.809
Minus is that as 10 years old? The plus is it's very well done. So for a stop from 10 years ago that's still interesting. Today. It's something to look at, but since 2010 won't be in the course.

298
00:46:33.809 --> 00:46:37.980
So, that's why it's not my primary material for my course.

299
00:46:37.980 --> 00:46:42.090
However, it does have some nice stuff in it. It's very well presented.

300
00:46:42.090 --> 00:46:50.994
And the big thing will get so, it again so what has here with some general stuff at the front? I'll just fast forward through just because they present it.

301
00:46:50.994 --> 00:46:57.414
Well, and then we'll get into some paradigms for parallel programming and a map reduced model.

302
00:46:57.719 --> 00:47:03.869
Um, and then it'll get specific example, these paradigms thrust of this will be a couple of days.

303
00:47:05.340 --> 00:47:08.760
So.

304
00:47:11.309 --> 00:47:17.730
It's not what I want and my mistake here.

305
00:47:20.130 --> 00:47:24.239
Today.

306
00:47:33.389 --> 00:47:37.199
Better better better better. Okay.

307
00:47:40.619 --> 00:47:44.460
So, CS, 193.

308
00:47:44.460 --> 00:47:51.869
And Iraq is 1 of the people that did thrust. So.

309
00:47:52.889 --> 00:47:59.639
Things get faster, um.

310
00:47:59.639 --> 00:48:05.159
You know that you're going parallel, because it can increase o'clock, right?

311
00:48:05.159 --> 00:48:08.369
Different types of prowl is.

312
00:48:08.369 --> 00:48:20.880
Um, like Intel with a Super scalar does instruction level parallel is with their cash with their pipeline, but it's nice but there are limits that allow you can read this yourself.

313
00:48:20.880 --> 00:48:24.090
A massive parallelism, um.

314
00:48:24.090 --> 00:48:28.139
Okay, this is nicely done here.

315
00:48:30.179 --> 00:48:40.829
So, different way, hardware organizations, and then 1 of the keys to NVIDIA success is they picked a nice hardware organization. So.

316
00:48:40.829 --> 00:48:54.000
Is this okay so you might have several process areas like Intel, each 1 has some local memory and then some big amount of global memory. So, the on chip stuff is a cash and so on.

317
00:48:54.000 --> 00:48:58.380
That's 1 way you can do it. I, this is sort of Intel.

318
00:48:58.380 --> 00:49:02.400
Again, the local memories, a cash. Okay.

319
00:49:02.400 --> 00:49:11.280
Many core that's close. Many core means thousands of threads. Multi heartbeat dozens of threads.

320
00:49:11.280 --> 00:49:18.090
Oh, okay, so you do something like this the process. There are lots of processors with 1 memory.

321
00:49:18.090 --> 00:49:21.809
From the global memory, the.

322
00:49:21.809 --> 00:49:29.219
So, they're going through put because they were invented to do graphics. He got to paint a 1Million pixel, 60 times a 2nd.

323
00:49:29.219 --> 00:49:33.389
And lots of threads, a thread per pixel or something. So pathetic creation.

324
00:49:33.389 --> 00:49:45.750
Is no overhead so, and there's latency so you don't care if it's the new frame buffer you're writing to starts a thousands of us is written a thousands of a 2nd wait.

325
00:49:45.750 --> 00:49:57.750
Okay, because he refreshing 60 times a 2nd, maybe, but so sunlight and see, there doesn't matter well, thousands of a seconds, a big latency from a computer point of view, but it's high speed.

326
00:49:57.750 --> 00:50:11.130
So so 1 thread what? So basically, lots of threads running together. So you can read this some more K telling you. I'm just hiding. So the point again, the G.

327
00:50:11.130 --> 00:50:21.119
The green is irrelevant, take color so more of the chip is spent doing stuff mat computation. So.

328
00:50:22.590 --> 00:50:34.079
Some numbers here again, you got instructions schedulers that dispatch instructions the whole thread at a time and so on.

329
00:50:34.079 --> 00:50:47.429
Some load store units that they may load data for a whole thread or whatever. So, and they get the speed because they've got a lot of computation cores, that dark green things.

330
00:50:47.429 --> 00:50:54.300
So, single instruction, multiple thread again, this is a rerun so.

331
00:50:55.440 --> 00:51:00.000
Okay, um.

332
00:51:01.650 --> 00:51:10.079
Yeah, like waiting and so on kernels with lots of threads you've seen all this before you've seen this.

333
00:51:10.079 --> 00:51:17.219
Yeah, and once a thread starts, it doesn't get preempt it runs to completion. So.

334
00:51:17.219 --> 00:51:27.510
So this would be the bad way to do it. Lots of processes. There's no local memory. So playing video does it is there's a block with some welcome memory in the block is a 1000.

335
00:51:27.510 --> 00:51:30.630
Threads perhaps so.

336
00:51:30.630 --> 00:51:33.989
And not that either 1 thread has.

337
00:51:33.989 --> 00:51:38.849
All call memory this is too distributed. So communication kills you.

338
00:51:38.849 --> 00:51:41.969
And.

339
00:51:41.969 --> 00:51:45.750
And that we've seen before.

340
00:51:46.829 --> 00:51:51.929
Lots of examples. Nice speed up.

341
00:51:54.150 --> 00:52:01.380
Right. Okay that was slide lecture 1. and again, if you want me to.

342
00:52:03.809 --> 00:52:07.380
Slow down, tell me, but.

343
00:52:07.380 --> 00:52:10.650
Extra 2 and 5 minutes.

344
00:52:17.309 --> 00:52:22.110
And this is you seen.

345
00:52:22.110 --> 00:52:25.590
High level you you sane.

346
00:52:25.590 --> 00:52:30.360
You're saying.

347
00:52:45.989 --> 00:52:50.519
Yeah, you seen this classes of routine and memory.

348
00:52:50.519 --> 00:52:59.159
Obsolete this is slightly obsolete now.

349
00:52:59.159 --> 00:53:08.219
And the data thing is, if I get time, I might write a devil program, showing the unified memory but I've got time to do it yet.

350
00:53:08.219 --> 00:53:14.849
Because the 1st, 3 versions of my attempt to fail, of course, I have to punch it in that time.

351
00:53:14.849 --> 00:53:21.719
You've seen that and fly into blocks are independent.

352
00:53:22.860 --> 00:53:27.780
And, okay.

353
00:53:29.190 --> 00:53:34.349
Graphics pipeline is.

354
00:53:34.349 --> 00:53:42.570
As long as you nod graphics types, you've got your triangles defined by vertices and vertices can be rotated and scaled.

355
00:53:42.570 --> 00:53:53.340
And then they're combined to make pretty much as triangles to restaurants, which means find the pixels of the triangle and put them into the frame buffer with the depth buffer and blend stuff. And so on.

356
00:53:53.340 --> 00:54:03.960
And these Vertex processing units and Pixel processing, it used to be fixed function, but then they started getting fancier.

357
00:54:03.960 --> 00:54:07.559
That's what that says modes grew with each time.

358
00:54:07.559 --> 00:54:17.309
Hardware used to look like that, but they made things more programmable so the shade or execution did the pixels and so.

359
00:54:17.309 --> 00:54:22.650
Okay, and the vertex unit did rotated and.

360
00:54:22.650 --> 00:54:28.440
Project and so on vertices yes, initially things were fairly limited and then they started.

361
00:54:28.440 --> 00:54:34.440
Adding more instructions and because it a lot of parallelism.

362
00:54:35.789 --> 00:54:47.010
The thing is your vertices, they can all be transformed independent effect. Ycharts. They can all be transformed in parallel. If you got the hardware. Okay. And you don't care if there's a little serialization bottle neck.

363
00:54:47.010 --> 00:54:53.280
Okay, and more efficiency because the 1 okay.

364
00:54:53.280 --> 00:54:57.269
It was 5 minutes for lecture to.

365
00:55:03.510 --> 00:55:18.480
Again, since you've seen so much of this, you can read it on your own. If you like, your program might have a lot of current. I'll see you have a step 1 is a currently finishes step. 2 is finishes and so on.

366
00:55:18.480 --> 00:55:22.019
Worry about race conditions.

367
00:55:22.019 --> 00:55:27.179
You all know what a race condition is by now you have a Tom X to handle that.

368
00:55:27.179 --> 00:55:31.289
What's interesting here no.

369
00:55:31.289 --> 00:55:35.670
For army is like 5 generations back for a video.

370
00:55:35.670 --> 00:55:45.090
They're slow and they're sequential so, you know, be disciplined in your use of them.

371
00:55:45.090 --> 00:55:48.989
And, yeah.

372
00:55:51.389 --> 00:55:58.920
And, yeah, that was 3.

373
00:56:08.155 --> 00:56:14.844
Okay, so, this day again, to show you different types of memory, I've mentioned them before just a nice way.

374
00:56:15.474 --> 00:56:27.355
Share it past shared memory per block, pass, register memory for thread, slow goal, memory, faster, constant, global, constant memory, because it's cash and things talk to the host.

375
00:56:27.659 --> 00:56:38.010
Here's like 5 types of memory and you've got the scope thread, blocker grid and the lifetime and you've also got to speed. So.

376
00:56:38.010 --> 00:56:41.699
Penalty.

377
00:56:43.110 --> 00:56:49.920
Okay, so 5 types of, uh, different levels of locality and different sizes and speeds.

378
00:56:49.920 --> 00:56:55.349
Okay, so you can have 100,000 threads that perhaps.

379
00:56:55.349 --> 00:56:59.159
Only 5,000 run at a time.

380
00:56:59.159 --> 00:57:12.150
And where to declare, I'll give you the ideas here. If you want the details, you can read it. The thing is that newer versions of NVIDIA relax. Some of the restrictions.

381
00:57:13.980 --> 00:57:25.139
So, like you to declare a variable in your C. plus plus program is managed, it's now the variable sound accessible in your.

382
00:57:25.139 --> 00:57:30.869
In your global routine so it's a quick way to get data into a global or you want to read only.

383
00:57:30.869 --> 00:57:39.449
And and they're talking here about ways to speed things up with your loads and your stores.

384
00:57:39.449 --> 00:57:42.900
Point here is yeah, but don't chase pointers.

385
00:57:44.130 --> 00:57:52.349
And it's point is have gotten easier lately in the past, you got a warrior device or host, and so on.

386
00:57:53.670 --> 00:57:57.599
Yeah, don't chase pointer is don't propagate.

387
00:57:57.599 --> 00:58:11.340
Yeah, you play, we've seen examples of your Tile, your global data. Like, we saw the application to have tiles into the past shared memory.

388
00:58:11.340 --> 00:58:17.280
Showing that here, so I'll go through it fast subsets and.

389
00:58:17.280 --> 00:58:21.480
And then when you're finished, copy it back.

390
00:58:23.039 --> 00:58:26.280
You have to design that partition size. Okay.

391
00:58:26.280 --> 00:58:29.730
And watch out.

392
00:58:29.730 --> 00:58:33.360
Synchronized stuff atomic stuff.

393
00:58:33.360 --> 00:58:41.010
Hi, this is this so we showed I showed you for the reduce you do a reduction inside each block and then you.

394
00:58:41.010 --> 00:58:48.690
Reduce the reduced total globally, so just to level hierarchy. So this is efficient just to remind you about it. It's important.

395
00:58:48.690 --> 00:58:53.789
Because inside the 1 block, you got the shared memory, which is fast to read and write.

396
00:58:53.789 --> 00:58:58.920
So, hierarchical atomic, they call it here.

397
00:58:58.920 --> 00:59:02.789
Oh, okay.

398
00:59:02.789 --> 00:59:07.739
Made as well, I'll tell you again, you partition to blocks and.

399
00:59:07.739 --> 00:59:14.969
And then they're doing it how they speed it up. Their 1st version they said was 2008.

400
00:59:14.969 --> 00:59:19.260
Bobs and then they, they do it better.

401
00:59:19.260 --> 00:59:24.150
And and then I'll get a faster so.

402
00:59:24.150 --> 00:59:28.590
And.

403
00:59:30.780 --> 00:59:34.050
Got.

404
00:59:34.050 --> 00:59:44.550
17 times faster or something, so okay. They're experimenting with different titles sizes and.

405
00:59:44.550 --> 00:59:48.690
16 by 16 dollars best. Okay.

406
00:59:48.690 --> 00:59:54.539
The thing in this side is that these resources are finite. I've been saying this many times so.

407
00:59:54.539 --> 00:59:58.619
And I've been saying.

408
00:59:58.619 --> 01:00:09.389
Okay that if I can get you interested in this stuff, then if you want to read more, you can do it on your own.

409
01:00:10.710 --> 01:00:17.760
Performance considerations we've seen this before, so.

410
01:00:17.760 --> 01:00:22.710
Um, I won't repeat it and just.

411
01:00:22.710 --> 01:00:28.530
Chunk up your memories faster, but you mean to Jason thread should read a.

412
01:00:28.530 --> 01:00:32.550
Address hello global memory.

413
01:00:37.644 --> 01:00:50.275
This means that the programming level is an array of structures is a bad idea. You want you want to have Here's a structure, which is a way. You might program nicely, you taught CS. Maybe it's nice.

414
01:00:50.304 --> 01:00:54.474
But the trouble is, from adjacent records keys are have this.

415
01:00:55.019 --> 01:01:05.940
The stride between the addresses of keys for decent items, and the array there's a gap between them to start the value and flag. So this does not coalesce the access as.

416
01:01:05.940 --> 01:01:09.360
If you want is a structure of a raise so what.

417
01:01:09.360 --> 01:01:18.750
So the 3, each, all the keys are in 1 array, they're not in a structure and then all the values are in 1 of the flags and 1 array. And if you want to put the 3 race in this structure.

418
01:01:18.750 --> 01:01:22.260
Okay, I'm not coalescing.

419
01:01:22.260 --> 01:01:31.739
I'm ignoring shared memory banks, the shared memories in banks, and depending on how you access it, you get conflicts that slow you down. Don't worry about it.

420
01:01:31.739 --> 01:01:35.940
I mentioned texture and constant memory quickly.

421
01:01:35.940 --> 01:01:38.940
Um.

422
01:01:38.940 --> 01:01:42.300
And it's small and it's fast so.

423
01:01:44.010 --> 01:01:51.630
All the threads in the work, want the same address. It goes nicely control flow. We've seen many times.

424
01:01:51.630 --> 01:01:57.119
Occupancy is to keep all of your.

425
01:01:57.119 --> 01:02:06.239
Resources running, so all your resources in use and you do that by having lots of warps waiting to run.

426
01:02:06.239 --> 01:02:11.460
So.

427
01:02:11.460 --> 01:02:21.690
You want more warps waiting to run you always want to have stuff waiting to run so then they'll use free hardware. That's what that talks about.

428
01:02:21.690 --> 01:02:26.789
So, if all of your warps are stalled for some data.

429
01:02:26.789 --> 01:02:33.030
Trying to read some global array, then you're losing performance waiting on global memory. So.

430
01:02:33.030 --> 01:02:41.849
So, maybe you want to have some other threads that are trying to compute while some threads are waiting on global memory. We saw that day or 2 ago. So.

431
01:02:42.989 --> 01:02:52.889
And occupancy always talked about before you got this limited resources is registered and shared memory and you want to.

432
01:02:52.889 --> 01:03:04.320
No optimized so there is a spreadsheet that does this also I may show you the source limits so more threads for both for block.

433
01:03:04.320 --> 01:03:14.545
Is good and you're gonna have there are there are debugging flags, which will print out information about your own program.

434
01:03:14.844 --> 01:03:21.715
How many registers kernels use and so on how absurd memory this is a good occupancy calculator.

435
01:03:21.715 --> 01:03:34.704
I'll show you have the source in the it's in the developer and video site you put in some information about how many threads you set per block how many registers prescribed,

436
01:03:34.704 --> 01:03:35.094
et cetera,

437
01:03:35.094 --> 01:03:40.164
et cetera and then it tells you how efficiently you're using resources on your machine.

438
01:03:41.125 --> 01:03:52.494
And then it does the various stuff for you like, block size and register for thread and you can look and see how your occupancy change. So, you put in the basic information that is suitable by you and your program.

439
01:03:52.855 --> 01:03:56.125
And then the spreadsheet goes through and tells you the implications of that.

440
01:03:56.400 --> 01:04:05.429
Nice you put it shared wherever you need for each block and so on. And then it.

441
01:04:05.429 --> 01:04:09.239
It tells you and it tells you what limits you is.

442
01:04:09.239 --> 01:04:20.579
It hits, it's like doing a linear program so the red thing here, this is what's limiting your performance. You don't have enough registers here. It's what you said here is.

443
01:04:20.579 --> 01:04:29.099
You specify too many registers for thread and that's what's slowing you down.

444
01:04:29.099 --> 01:04:34.679
So the number thread blocks for multi process or so.

445
01:04:38.969 --> 01:04:43.949
And it flags the point here and you can see if you very stuff, what happens.

446
01:04:43.949 --> 01:04:52.889
Oh, okay. And options you can do things launching a colonel, take some time.

447
01:04:52.889 --> 01:05:01.409
And maybe you want to fuse kernels into bigger fewer kernels optimize, you amortize that time.

448
01:05:01.409 --> 01:05:05.639
And that's the context there. Okay.

449
01:05:07.349 --> 01:05:11.010
No questions.

450
01:05:13.469 --> 01:05:17.309
Okay, um.

451
01:05:20.519 --> 01:05:33.269
Yeah, nothing well, these are your for your paradigms, you do a reduction we saying you split out the operation over many threads and.

452
01:05:33.269 --> 01:05:36.599
Nothing interesting there.

453
01:05:36.599 --> 01:05:40.349
We've seen this power partition ideas.

454
01:05:40.349 --> 01:05:43.980
Cool.

455
01:05:43.980 --> 01:05:49.739
Okay, the reduction just to review, we've seen it before sum up a whole array.

456
01:05:49.739 --> 01:05:55.289
Okay, this is starting to get a slightly newer here. I'm.

457
01:05:55.289 --> 01:06:09.360
So, we had this a couple of days ago you wanted to do the reduction in parallel. This is the way, which has less locality. So it's not as good. But 1st.

458
01:06:10.014 --> 01:06:22.735
Step for the reduction, you add adjacent elements and store the result. Even numbers saying the next step of the reduction. You add elements that are 2 apart the 3rd step. You add elements that are for part and fine and you get the total sum.

459
01:06:23.215 --> 01:06:33.175
The thing is, is you don't get good thread occupancy utilization and the elements you're adding it farther and farther apart. So you don't have good coherency there.

460
01:06:33.750 --> 01:06:38.699
This is a better way. So.

461
01:06:40.045 --> 01:06:54.114
So, basically, our offerings are initially farther apart in the array, but the output is packed to the front of the array. And then, so, what this does this use as fewer warps the threads and in a warp all the threads get used.

462
01:06:54.385 --> 01:07:01.494
And the active data is stays contiguous. So this has better efficiency better. It's a better use of the GPU.

463
01:07:03.300 --> 01:07:06.780
And the code to do that.

464
01:07:06.780 --> 01:07:14.309
Okay, skipping through that, but putting a raw sync thread in the wrong place we'll hang the machine.

465
01:07:15.449 --> 01:07:22.440
Okay, and you might okay.

466
01:07:22.440 --> 01:07:34.260
And they do the calculation here, this takes log in steps, the total amount of linear work here, which is good. The initial 1, again log in work. This takes linear amount of total work.

467
01:07:34.260 --> 01:07:38.400
Okay.

468
01:07:38.400 --> 01:07:42.150
Here's a new operation called a split operation so.

469
01:07:43.320 --> 01:07:56.250
It's like a sort what's happening here is you've got 2 parallel to raise the 1st, is an array of flags. The 2nd, is an array of elements.

470
01:07:56.250 --> 01:08:09.690
And the flags are just a bit in this case. True. And false. And what we want to do is partition. And so all the 2 elements are before any of the false elements sort of thing a quick sort does, for example.

471
01:08:09.690 --> 01:08:17.460
So we want to do this fast and again, it's a fundamental operation that occurs lots of times.

472
01:08:19.050 --> 01:08:24.869
He build other things on top of it just split operation. So.

473
01:08:24.869 --> 01:08:33.930
Mike like to do it fast you could do it as a sort out of the flag, but that's inefficient. So.

474
01:08:35.399 --> 01:08:40.979
An example is the compact out elements. This is a common application.

475
01:08:40.979 --> 01:08:44.729
So.

476
01:08:45.750 --> 01:08:58.229
And it's used in things like banning stuff, hashing stuff. I call it adult factor for for a regular Ray.

477
01:08:58.229 --> 01:09:03.479
We'll see later how this segmentation thing is used.

478
01:09:03.479 --> 01:09:06.930
And the splitting thing.

479
01:09:06.930 --> 01:09:10.140
So the question becomes.

480
01:09:10.140 --> 01:09:14.939
How do we know how to how to do that? Split?

481
01:09:14.939 --> 01:09:22.590
In parallel, so you might have a thread for each element and writes the element where it should go. But where does it go?

482
01:09:22.590 --> 01:09:26.789
Okay, um.

483
01:09:28.050 --> 01:09:35.789
And let me go back a few slides to show you what's happening here.

484
01:09:35.789 --> 01:09:39.329
This is slides already. I'll go back a few.

485
01:09:39.329 --> 01:09:42.569
Okay to slide 26. okay.

486
01:09:44.845 --> 01:09:59.725
So, let's so, what, in parallel we want to write each element. So, wherein she goes to the 3 goes to here, the 1 goes to the right to there. The 7 goes to the right to there. The 0 comes to the left to their 4 goes to the right 1 goes to right? 6 comes to the left.

487
01:10:00.359 --> 01:10:04.079
And so on 3 stays at the end.

488
01:10:04.079 --> 01:10:07.350
So, how do we know where this.

489
01:10:07.350 --> 01:10:12.270
True. 0 goes.

490
01:10:12.270 --> 01:10:18.840
It the way we know where it goes, depends on the number of true flags to the left of this.

491
01:10:18.840 --> 01:10:33.024
Okay, so, let's talk about 0 indexing from 00 sound. So this go, this is Eric ghost is here, because this is true flag at this element has no truth to it. Slack elements. 0 So it goes to element several.

492
01:10:33.024 --> 01:10:38.095
This true. Has 1 true to what's left. So it goes to element 1.

493
01:10:38.880 --> 01:10:43.920
This true has 2 truths to what's left. So that goes to element 2.

494
01:10:43.920 --> 01:10:54.270
So, you see, this is a place where the scan, so you do an exclusive scan on the true bits and that gives you an array of where each true.

495
01:10:54.270 --> 01:10:59.310
Payload variable should be copied too. So this is an application of a solution scan.

496
01:11:02.215 --> 01:11:11.635
And that's the truth for the for the fall says, it's the same thing, but you have to add the total number of trues. So, this is this has so there's 3 truth altogether.

497
01:11:11.635 --> 01:11:21.534
So this falls, it goes to location 3, plus the number of false as to the left, which goes to location 3, this falls says 1 false, though it's left.

498
01:11:21.534 --> 01:11:35.635
So it goes to location for this fall says, 2 falls to the left so it goes to location. 5, so you want to do the split what we have to do is we have to do exclusive scan on the flag, or re, counting the truth.

499
01:11:35.935 --> 01:11:38.904
We have to do another exclusive scan on it counting the false says.

500
01:11:39.270 --> 01:11:52.350
And then we add to the false as we add in the total number truths. And now we know where each paid old variables should go. And we can. So we did 2 scans and a little arithmetic and bang, we can do all the copies in parallel.

501
01:11:52.350 --> 01:11:59.039
That's what they're talking about here and scan operations and so on.

502
01:11:59.039 --> 01:12:06.899
So, it shows you an application of scans, got many other applications Radix, sort sorting and so on.

503
01:12:06.899 --> 01:12:15.810
With Radix sword is you're sorting your, you're re, based on the least significant.

504
01:12:15.810 --> 01:12:22.470
Bad digit not just a bit. That'd be Sally, but bite.

505
01:12:22.795 --> 01:12:29.154
2 bites and you want to have all the output silver rays contiguous helped me wasting memory.

506
01:12:29.154 --> 01:12:41.725
So, where does each 1 start while you run through into account and other frequency, at least significant digits and then you do a scan on that, and you have adult factor for where to write the next set of elements.

507
01:12:42.354 --> 01:12:47.574
So these are applications of scan and then they send them engineer the scan operation.

508
01:12:47.850 --> 01:13:01.560
It's useless and sequential computing. It doesn't help you, but in parallel computing, it helps you a lot. So, this is a, this is an operation, which is very important operation, but only for parallel computing.

509
01:13:03.899 --> 01:13:09.659
A big no idea for today scenarios he say.

510
01:13:09.659 --> 01:13:13.529
How you do it?

511
01:13:13.529 --> 01:13:16.890
We saw before how you do it, so.

512
01:13:16.890 --> 01:13:23.729
Just, I won't do it, so.

513
01:13:23.729 --> 01:13:30.000
And if you got money blocks, you do it inside each block and then you combine the results of the different blogs.

514
01:13:30.000 --> 01:13:39.600
So results, local to each block, and you take the end number for each for the 1st block and add it to all the results with the 2nd block in parallel and so on.

515
01:13:40.739 --> 01:13:50.100
Yeah, and so these are some operations which for these are the base.

516
01:13:50.100 --> 01:13:53.100
Tools for parallel programming basically.

517
01:13:53.100 --> 01:13:59.970
Splitting scanning reduction and we're going to see generalizations of this. So.

518
01:13:59.970 --> 01:14:12.539
Yeah, sure.

519
01:14:15.180 --> 01:14:20.579
This a 2nd, it was.

520
01:14:24.539 --> 01:14:30.210
And show you some new things.

521
01:14:30.210 --> 01:14:33.960
Segmented scans a new 1.

522
01:14:34.975 --> 01:14:45.354
So, we had this scan where you compute all the sub totals, the segmented scan if you put barriers in the array, and you do not propagate the subtitles across a barrier.

523
01:14:45.805 --> 01:14:52.795
So, it effectively splits your bigger Ray up into a number of smaller rays, and you do a scan on each smaller array.

524
01:14:53.130 --> 01:14:56.250
But the smaller race can all of different sizes.

525
01:14:57.270 --> 01:15:05.460
So, the barriers are firewalls, but the big thing is the smaller race that they're all different sizes. And the whole thing is not in parallel and.

526
01:15:05.460 --> 01:15:09.239
And log in time, um.

527
01:15:10.739 --> 01:15:18.210
So, we have here, um.

528
01:15:18.210 --> 01:15:22.439
So the flags are.

529
01:15:24.149 --> 01:15:28.590
Our barriers, so we don't propagate across a barrier so.

530
01:15:30.300 --> 01:15:33.840
Basically, the 2 barriers here.

531
01:15:35.430 --> 01:15:47.130
And let me look at the results at the bottom. You could skip several things in the middle. We're just going to look at the top row here. It's the head flag, the barriers.

532
01:15:47.130 --> 01:15:52.500
The index says are the things that we are.

533
01:15:55.409 --> 01:16:04.529
That we are summing. Okay. I'm getting confused now.

534
01:16:06.810 --> 01:16:12.270
I'll do this next time. We'll just go through it. I'll come back to that slide next time.

535
01:16:14.609 --> 01:16:18.810
But how how it's implemented and, um.

536
01:16:18.810 --> 01:16:22.409
So.

537
01:16:22.409 --> 01:16:28.319
You can read ahead if you want and sorting, we'll see how to sorted fast and so on.

538
01:16:28.319 --> 01:16:38.699
Reasonable point to stop here. You're welcome to read ahead of me. This is lecture 7.

539
01:16:38.699 --> 01:16:42.180
And the Stanford said, and.

540
01:16:42.180 --> 01:16:53.130
Let me add cholee for fun. I want to introduce you spend a couple of minutes introducing you to thrust.

541
01:16:53.130 --> 01:16:59.850
And see, and then I'll come back to that.

542
01:17:04.319 --> 01:17:07.319
All demonstrated by.

543
01:17:07.319 --> 01:17:17.069
By examples, this is a thrust program C plus plus program you include various things there header only file such parallel versions of s. T. L.

544
01:17:17.069 --> 01:17:26.970
And the only data type is a factor, and you can define a vector on the host or the device and, like, the factor.

545
01:17:26.970 --> 01:17:34.920
So here is a host factor of integers and it's 2 to the 24 elements song. Its name is.

546
01:17:34.920 --> 01:17:39.960
And then you can define a device factor of integers.

547
01:17:39.960 --> 01:17:46.380
Into the angle bracket and so you can define factors on the host and on the device.

548
01:17:46.380 --> 01:17:53.699
And you can copy them so, this here initializes the device factor from the host factor.

549
01:17:53.699 --> 01:17:58.920
So, and this will generate a coda copy.

550
01:17:58.920 --> 01:18:08.369
Sort of thing, and so you can copy from hosted device and back again and you can actually access individual elements.

551
01:18:08.369 --> 01:18:12.270
You could say of 5 or something.

552
01:18:12.270 --> 01:18:18.539
So the only data type is a factor could be on class. It could be on the host of the device and you can give it.

553
01:18:18.539 --> 01:18:21.689
You know, floats whatever.

554
01:18:21.689 --> 01:18:31.199
Um, and then every operation, it's like a map reduce operation and it's like, so.

555
01:18:31.199 --> 01:18:37.380
Well, let me look at sort here and I'll come back to James, so we look at sword here.

556
01:18:37.380 --> 01:18:44.729
It takes a party to the begin and it takes to begin and the end point is so the vector and.

557
01:18:44.729 --> 01:18:58.289
What does he obviously does a sort and this being integers this will sort on the GPU and I'll be very fast. That'll be a Radix sort. And in fact, the sword is fast enough. If you want a sort of bigger right?

558
01:18:58.289 --> 01:19:07.890
You can make your program faster by just copying from the host of the device sorting and copying the result back sorting on. The device is so fast that it.

559
01:19:07.890 --> 01:19:14.489
Pays for the overhead of the 2 way coffee and in case this shows the idea here you see.

560
01:19:14.489 --> 01:19:17.850
You see, you get the big.

561
01:19:17.850 --> 01:19:25.104
Begin an end and sorts on the device. If you gave it a host vector, it would sort on the host.

562
01:19:25.104 --> 01:19:36.114
It's compile time dispatch it dispatch because sword is an overload and class and dispatch is based at compile time, depending on the class of the argument.

563
01:19:37.135 --> 01:19:50.064
Back up to generate, so again takes begin an end for a factor takes a 3rd argument, which is a function, which has 1 output, and it calls the function in parallel for each element of.

564
01:19:50.515 --> 01:19:58.975
So this will initialize H back. Because it called Rand, which we assume exists somewhere in parallel for each L and to create each element of them.

565
01:19:59.250 --> 01:20:03.300
Yes, so we can do this here. We can have a.

566
01:20:03.300 --> 01:20:07.949
Function which takes no arguments and the just stores the output.

567
01:20:07.949 --> 01:20:12.779
And then copy down here, does the copy.

568
01:20:12.779 --> 01:20:22.050
It takes for the source to begin and end pointers for the source factor. Then it takes to begin pointer for the, the destination factor.

569
01:20:22.050 --> 01:20:28.140
And copy the result back from the device to the host.

570
01:20:28.140 --> 01:20:36.239
So, this shows you, some examples for thrust here that you have, the only class is a factor.

571
01:20:36.239 --> 01:20:48.779
Better class, whatever, because it's the vector of something and on the host of the device you do this so this is a map operation. You do map operations and copy.

572
01:20:50.994 --> 01:21:04.944
Show other examples here you can access individual elements, you can print them out and so on. Okay. Good point to stop. So, what we'll do next time is more about the scan with barriers.

573
01:21:05.520 --> 01:21:09.689
And because it's important, and then we'll get into explicitly into.

574
01:21:11.159 --> 01:21:21.750
So, other than that, see, you Thursday see you Monday and Ashley Monday.

575
01:21:25.560 --> 01:21:31.380
We're going to see Connor Joseph lane Dan and Randy.

576
01:21:32.430 --> 01:21:37.890
And then if his time bandwidth, ben's maybe Thursday, and I will take the other.

577
01:21:37.890 --> 01:21:44.909
5 people in the cars, 4 people in the course, and assign you to Thursday and assign you topics. So.

578
01:21:46.649 --> 01:21:52.439
Question it's not.

579
01:21:54.720 --> 01:21:59.100
Have a good weekend. It looks like it's too warm to do skiing.

580
01:21:59.100 --> 01:22:02.880
But.