WEBVTT

1
00:04:41.098 --> 00:04:51.899
Silence.

2
00:04:56.459 --> 00:05:03.869
Silence.

3
00:05:07.379 --> 00:05:30.149
Silence.

4
00:05:33.538 --> 00:05:40.319
Silence.

5
00:05:41.608 --> 00:05:45.209
Silence.

6
00:05:45.209 --> 00:05:58.348
Silence.

7
00:06:46.468 --> 00:06:56.908
Silence.

8
00:06:56.908 --> 00:07:08.608
So, good afternoon class. My universal question is, can you hear me.

9
00:07:10.499 --> 00:07:13.559
Good Thank you. Okay, so.

10
00:07:14.639 --> 00:07:20.189
Parallel computing Fab, 11st class 6.

11
00:07:20.189 --> 00:07:24.509
And to see if I can try to.

12
00:07:24.509 --> 00:07:30.869
Clone this I want to see if I can see what you're seeing, it's may cause things to hang up.

13
00:07:32.639 --> 00:07:35.788
But.

14
00:07:38.218 --> 00:07:42.689
Silence.

15
00:07:42.689 --> 00:07:54.899
Silence.

16
00:07:55.918 --> 00:08:08.699
Silence.

17
00:08:10.048 --> 00:08:14.848
Okay.

18
00:08:14.848 --> 00:08:19.978
Silence.

19
00:08:21.449 --> 00:08:26.819
This is.

20
00:08:26.819 --> 00:08:32.578
Should be sharing.

21
00:08:32.578 --> 00:08:36.538
Screen sharing and then it stopped.

22
00:08:36.538 --> 00:08:43.589
Wow.

23
00:08:43.589 --> 00:08:47.698
Okay, things occasionally work so.

24
00:08:47.698 --> 00:08:55.649
What's happening today? 1st awesome. General. Stop. Then we'll get to open ACC.

25
00:08:55.649 --> 00:09:01.889
I installed invidious compiler suite and.

26
00:09:01.889 --> 00:09:11.639
The way you want to do to sit up and video, if you want to browse around it, it's freely available. You could also install it on your own machine. If you'd like.

27
00:09:11.639 --> 00:09:15.808
If your machine as an invidia and.

28
00:09:15.808 --> 00:09:29.543
To make you what you want me to do is add that it's directory of and onto your past variable and so on. So I've got a little file there, which if you source it, then it will modify your path.

29
00:09:29.543 --> 00:09:33.923
And so these are the compute. Compilers had recommended to use on balance.

30
00:09:35.129 --> 00:09:49.438
It's better they certainly work with better, however, doing a little little example last night where I compound open MP program, and it ran faster when compiled with g. plus plus then compiled with.

31
00:09:49.438 --> 00:10:03.958
C, plus, plus, so no, 1 compiler is the best for all but if you want anything that's going to target the using video type compiler. And what? Invidious compiler is that just the P. G. compiler suite done re badge and updated a little.

32
00:10:03.958 --> 00:10:11.999
We're going to do the ACC. 1st, but 1st, some general announcements before I get back to that.

33
00:10:11.999 --> 00:10:17.639
1st, a machine parallel machine you're welcome to use this machine.

34
00:10:17.639 --> 00:10:24.599
Oh, for homework 3 question Monday or Thursday a week after.

35
00:10:24.599 --> 00:10:27.808
A week after it was assigned, so.

36
00:10:27.808 --> 00:10:35.099
Dude, when was it put online just a sec.

37
00:10:47.578 --> 00:10:52.589
Well, we put it online on Monday, so it'd be due Monday guys. Um.

38
00:10:52.589 --> 00:10:55.589
Monday a holiday or something.

39
00:10:55.589 --> 00:11:01.528
But I'm being, this is a small class, only 10 students. I'm being lenient about these things.

40
00:11:01.528 --> 00:11:06.778
Okay, come on now.

41
00:11:06.778 --> 00:11:16.889
Okay, announce the machine parallel it's available. You're welcome to use it for any.

42
00:11:16.889 --> 00:11:29.428
Legal ethical purpose, even unrelated to this course you want to use it for your research for other people in your lab. If you're in a lab. That's fine. By me. You want to have fun with it? Fine by me also. Just, you know.

43
00:11:29.428 --> 00:11:41.639
No coin mining no mining and nothing that makes money. That would be rules. For example, I was running the email key way at home thing.

44
00:11:41.639 --> 00:11:46.139
Using point for a couple of years so I stopped.

45
00:11:46.139 --> 00:11:49.288
Right. Or lately, but in fact, I'm.

46
00:11:49.288 --> 00:11:54.328
Users 359 and total credit as a percentage. That's fairly small. But.

47
00:11:55.558 --> 00:12:00.178
Also, how often is it taken off line?

48
00:12:00.178 --> 00:12:06.599
For parallel, the intent is I need parallel online all the time.

49
00:12:06.599 --> 00:12:10.168
However, it is a research machine.

50
00:12:10.168 --> 00:12:20.729
And so, if something happens, I'm the 1 that has to fix it. And if there was a hardware failure, then it would be offline permanently.

51
00:12:20.729 --> 00:12:29.038
Unless the department wanted to spend money to replaces that is a risk. You're using a research machine, you're not using a machine with guaranteed.

52
00:12:29.038 --> 00:12:41.308
Permanence on the other hand. Of course that's true. If you use anything, you use our super computer used to have a blue Jean, and they take the blue Jean off line and so on anyone that used to bluejean now has to.

53
00:12:41.308 --> 00:12:45.119
Change your code, so that's your risk.

54
00:12:45.119 --> 00:12:49.918
Right, but the flip side is it's a reasonably fast big machine.

55
00:12:49.918 --> 00:12:55.438
Okay, um.

56
00:12:55.438 --> 00:13:01.948
Oh, new teaching tool I'm playing with at the moment in classes. Any questions.

57
00:13:01.948 --> 00:13:05.399
Is I have.

58
00:13:05.399 --> 00:13:09.808
Hello.

59
00:13:11.099 --> 00:13:15.688
Is that this is a mirroring my iPad.

60
00:13:15.688 --> 00:13:19.499
On to the, um, so.

61
00:13:19.499 --> 00:13:28.019
I'm marrying my iPad onto the onto a window here, so.

62
00:13:29.698 --> 00:13:43.469
If there's quite, I couldn't try and see if that works out well, before I, I hadn't heard your laptop that had a touch screen that did not work very well at all. It didn't have hand rejection and it had late lag and so on.

63
00:13:43.469 --> 00:13:52.019
I was Linux actually, I'm not doing new devices like touchscreens very well. So, this so we'll see what happens with that.

64
00:13:52.019 --> 00:13:57.568
Oh, just fun. Real World electrical engineering.

65
00:13:57.568 --> 00:14:08.999
Yeah, I like gadgets so my house has 2 Tesla power walls and they're big batteries and their total capacity's, 27 kilowatt hours and I got 8 kilowatts of peak solar panels on the roof.

66
00:14:08.999 --> 00:14:17.339
So, they finally got working, like, Tuesday I only started the project last August at him and.

67
00:14:17.339 --> 00:14:20.999
In any case, so it's a fun to see what happens.

68
00:14:20.999 --> 00:14:29.489
At the moment the solar panels are generating 2.8 kilowatts of power.

69
00:14:29.489 --> 00:14:32.668
So, the goal is that.

70
00:14:33.683 --> 00:14:47.604
You know, over the year over the year and my net electrical consumption, I posted 0T so if I don't use much power, I could survive a 2 day blackout. Not that they're very many blackouts here. Of course, it's fairly reliable, but still, it's it's cool.

71
00:14:47.879 --> 00:14:52.438
1, more point is.

72
00:14:52.438 --> 00:14:57.389
If you were looking at last year's blog for this course.

73
00:14:57.389 --> 00:15:05.639
I changed things from time to time uh, last year I use Docker for the compilers. It's.

74
00:15:06.234 --> 00:15:10.583
It's complicated not necessary. It was a security risk, so I'm dropping it right now.

75
00:15:10.734 --> 00:15:23.394
I'm not doing it this year, but Docker is an important industrial tool and if anyone would like me to spend a little class time on Docker, just so you could put on your resume that you're familiar with. Docker. Well, then.

76
00:15:24.058 --> 00:15:28.798
You mentioned it other than that.

77
00:15:28.798 --> 00:15:32.759
Now, we are back to open ACC.

78
00:15:34.078 --> 00:15:38.278
And.

79
00:15:40.259 --> 00:15:50.219
Tc dot Org we looked at 1 the Q and A's are are worth looking at by the way. So.

80
00:15:52.528 --> 00:15:57.688
And again, okay.

81
00:16:05.724 --> 00:16:16.793
Simpler. Okay. Generally motherhood stop here. Analyzing your code is the hardest part. Your algorithm has to be parallelizable. Okay. Again, this is something.

82
00:16:17.548 --> 00:16:22.438
I mean, actually write it down.

83
00:16:22.438 --> 00:16:26.249
A chance to use this and Shelly.

84
00:16:26.249 --> 00:16:32.489
Let me get okay.

85
00:16:32.489 --> 00:16:38.068
So, okay, so.

86
00:16:41.308 --> 00:16:44.609
Okay.

87
00:16:44.609 --> 00:16:48.839
Okay, it's not mirroring.

88
00:16:48.839 --> 00:16:56.729
Give me a 2nd, here it was mirroring 20 minutes ago. It's not mirroring. Now.

89
00:17:00.208 --> 00:17:14.189
Eva, okay good.

90
00:17:15.689 --> 00:17:20.249
Silence.

91
00:17:27.449 --> 00:17:35.098
Unfortunately, I cannot get away from that black boundary. So all I can do is.

92
00:17:35.098 --> 00:17:46.199
Things like this. I just overlap. Well, you have to speak if I expose the chat window, then things are.

93
00:17:46.199 --> 00:17:50.909
It's okay, so open ACC um.

94
00:17:53.818 --> 00:17:58.169
And it's so it's higher level.

95
00:17:59.818 --> 00:18:03.148
Um, then say could a.

96
00:18:03.148 --> 00:18:12.749
Or even open MP actually, so it so it's easier to it's easier to use.

97
00:18:12.749 --> 00:18:18.538
Well, you know, perhaps less efficient.

98
00:18:23.128 --> 00:18:26.878
It's our execution.

99
00:18:26.878 --> 00:18:30.388
Okay, so those are your trade offs here?

100
00:18:30.388 --> 00:18:34.078
Okay and so we can get this up.

101
00:18:35.699 --> 00:18:39.659
Some overview of darker. Okay.

102
00:18:40.798 --> 00:18:45.449
Okay, so some Docker.

103
00:18:45.449 --> 00:18:51.689
Maybe next class or something.

104
00:18:53.818 --> 00:18:56.939
Truly.

105
00:18:58.288 --> 00:19:01.588
Oh, okay, good. Okay.

106
00:19:02.699 --> 00:19:12.384
Actually, just a 2nd here. Okay. I can see the chat window. Now. I'm curious what my setup is.

107
00:19:12.384 --> 00:19:21.534
I've got my main laptop that are running work at Webex on and displaying Windows for the mirror of the iPad and decides that I'm showing.

108
00:19:21.838 --> 00:19:32.308
And then I've got the 2nd laptop here we're also running Webex. That's if you look you see, I'm signed in twice and on the 2nd, 1 I can see.

109
00:19:32.308 --> 00:19:45.479
The chat window. Okay. Open ACC. So you analyze this motherhood stuff and just a reminder you got the review from last time you got your.

110
00:19:45.479 --> 00:19:52.828
Which, so you can compile the code without saying open at all. Okay so this is a review.

111
00:19:52.828 --> 00:19:56.548
We said just a review of the reduction thing.

112
00:19:56.548 --> 00:20:07.019
And so if you're doing some operator, like plus or Max and.

113
00:20:07.019 --> 00:20:13.588
Each loop, each iteration of the loop is applying. This is updating this total like here. It's.

114
00:20:13.588 --> 00:20:20.999
By the way, if this is too small for you to see the slides, I'll, I'll enlarge some, something lacking on large them. Now. Actually.

115
00:20:20.999 --> 00:20:24.898
There, okay, then.

116
00:20:24.898 --> 00:20:34.348
The reduction thing, which is for a limited set of operators will have a separate sub, total variable for each thread. So, each thread will.

117
00:20:34.348 --> 00:20:39.419
Accumulate the sub, total or the, and then all the threads will be combined.

118
00:20:39.419 --> 00:20:47.878
So, it's very efficient. Okay. Just a reminder that this was compiling.

119
00:20:47.878 --> 00:20:54.838
Cereal this was compiling to.

120
00:20:56.878 --> 00:21:11.219
Compounding to the GPU and multi core was compiling to multi car in their particular machine. That's the review. Okay this side is new. And this slide is deep with. Okay.

121
00:21:12.239 --> 00:21:17.189
Differences between the CPU memory and the GPU memory.

122
00:21:17.189 --> 00:21:22.378
The CPU memory is larger, but the GPU memory is faster.

123
00:21:22.378 --> 00:21:28.288
So, they're complimentary and they have a bust connecting them and which is.

124
00:21:28.288 --> 00:21:42.743
Often the past, it may be the fastest plus on the computer sometimes. So, throw some numbers at you on parallel the CPO memories 256 gigabytes. The memory is 48 gigabytes by. 48 is very large for cheap you either way. Okay.

125
00:21:42.743 --> 00:21:47.183
And in any case transferring stuff back. And forth.

126
00:21:50.098 --> 00:21:58.858
Now, the thing with the GPU memory is it's very fast going to the CUDA cores CUDA cores. That's.

127
00:21:58.858 --> 00:22:07.229
The execution cores on the GPU so the thing with gpo memory and Scott, we'll get to spend some time on it.

128
00:22:07.229 --> 00:22:16.348
It's very fast, but it also has a very high latency. So getting 1 bite of data from the gpo memory into a core is going to take.

129
00:22:16.348 --> 00:22:19.679
A 100 cycles gone. What? But.

130
00:22:19.679 --> 00:22:32.038
Get each successive word of data is really is fast. Okay. Now, 1 other thing anticipating a little with current version to the GPU.

131
00:22:32.038 --> 00:22:35.368
1st, there's a common address space.

132
00:22:35.368 --> 00:22:39.088
Or these 2 memories, you can address a word.

133
00:22:40.618 --> 00:22:48.058
In either memory, you don't need a separate tag. The tag is a high order bit of the address, I guess.

134
00:22:48.058 --> 00:22:51.538
And there's also a memory manager.

135
00:22:51.538 --> 00:23:03.298
Now, current versions, so that blocks of data are copied back and forth automatically as needed. Although if you do it deliberately.

136
00:23:03.298 --> 00:23:06.358
You'll get higher performance, perhaps.

137
00:23:06.358 --> 00:23:21.239
I give you an example, the virtual memory manager on your CPU is pretty good, but I had a paper published with 1 of my Brazilian collaborators. It was competing visibility on some terrain and.

138
00:23:21.239 --> 00:23:27.028
We actually did better than the virtual memory manager on the host.

139
00:23:27.028 --> 00:23:31.828
Because we knew what the access pattern could be for the blocks of terrain data.

140
00:23:31.828 --> 00:23:36.719
That said, usually almost always let the computer do the management.

141
00:23:36.719 --> 00:23:41.308
Also on the CPU, so this is this it's the.

142
00:23:41.308 --> 00:23:45.719
Page. Okay, so this is this unit. Okay.

143
00:23:45.719 --> 00:23:48.989
There's 2 separate ideas here that could blend them together.

144
00:23:48.989 --> 00:23:53.878
The unified memory is just that they have a common address space.

145
00:23:53.878 --> 00:23:59.278
So, you use an address, the system can tell it run time where it is.

146
00:23:59.278 --> 00:24:03.179
Managed memory takes that.

147
00:24:03.179 --> 00:24:07.199
And moves to data back and forth as needed.

148
00:24:07.199 --> 00:24:14.189
It's think of it as a virtual memory manager or the backing and the actual high speed thing is a.

149
00:24:14.189 --> 00:24:18.058
So.

150
00:24:19.618 --> 00:24:25.138
Talks about it here. So the managed managed part of it is that.

151
00:24:25.943 --> 00:24:36.983
Coffee, you don't have in the past, you wrote a program, you would have to explicitly copy that data back and forth, which was a bit of a pain. But now it's handled automatically.

152
00:24:37.013 --> 00:24:49.374
Of course, if you're copied explicitly, you could do fun things like the asynchronous about it. It called the function that started data, copying the functional return. Immediately. You could do something else on the CPU and then.

153
00:24:49.798 --> 00:24:56.038
Check a flag, and when the GP has got the data, then you do something in. So you do it explicitly you can do this overlapping thing.

154
00:24:56.038 --> 00:24:59.489
Or you can up the TPM manage this so.

155
00:24:59.489 --> 00:25:07.259
So, you can concentrate on high level stuff. Okay here, whatever you see.

156
00:25:07.259 --> 00:25:10.888
N. B. C. plus plus so.

157
00:25:10.888 --> 00:25:23.909
Just to hit you with the command line here, maybe we'll run the programs on a Monday or something get 2 ideas. 1st options. This says, do a reasonable set of optimization things.

158
00:25:25.648 --> 00:25:36.509
So, there's a very large number of different, optimization flags. It says, take a sensible subset of them compile the open ACC.

159
00:25:36.509 --> 00:25:39.509
Directors, if you don't do this, it will just ignore them all.

160
00:25:39.509 --> 00:25:49.588
Key a target architecture for Tesla Reed, NVIDIA, the historical reason why they call NVIDIA Tesla.

161
00:25:49.588 --> 00:25:52.618
I mentioned it quickly. Last time you can ignore it so.

162
00:25:52.618 --> 00:25:56.219
You want to compile for the you call it Tesla.

163
00:25:56.219 --> 00:26:01.858
Managed says, use the managed memory so.

164
00:26:01.858 --> 00:26:11.278
The system will page the data back and forth. I don't know what the page sizes. 1. k4 K I don't know on the GPU, but it will page that data back and forth.

165
00:26:11.278 --> 00:26:20.249
This says print out debugging information so I'm info printout debugging information about the acceleration. So.

166
00:26:20.249 --> 00:26:24.898
And if anyone is unfortunate enough to use for trend well.

167
00:26:24.898 --> 00:26:30.538
My sympathies to you, I've used for very many years. I don't actually like it.

168
00:26:30.538 --> 00:26:36.358
I'd like C plus plus better. Okay, this is your managed memory where the system pages.

169
00:26:36.358 --> 00:26:40.439
So you want to spend the time you can do it better.

170
00:26:40.439 --> 00:26:43.618
Um, but.

171
00:26:43.618 --> 00:26:47.638
And this is a synchronous thing I mentioned here.

172
00:26:49.078 --> 00:26:55.558
But you're going to take your time, so to trade who's worth more you were the computer.

173
00:26:56.634 --> 00:27:11.064
Give you an idea so parallel plus the graphics card and everything, it's about 10000 dollars. You could duplicate the parallel machine for less than 10000 dollars today. So you're saving time when a 10000 dollar computer versus how much you make.

174
00:27:12.778 --> 00:27:18.509
You optimize the problem. Okay. Um.

175
00:27:18.509 --> 00:27:22.108
So, here they are testing.

176
00:27:22.108 --> 00:27:28.439
The uniform, okay, there's different term. Both when this a unified memory there implicitly is 7 unified plus.

177
00:27:28.439 --> 00:27:32.578
Manage there's all these other things also.

178
00:27:32.578 --> 00:27:41.574
Which, I guess are sort of obsolete. Now, other things you could do in the past was to lock pages of memory on the host into real memory.

179
00:27:41.844 --> 00:27:48.054
So, on the host, they would not be memory mapped by the host work from every manager they'd been locked into.

180
00:27:48.358 --> 00:27:59.729
Real memory on the host on what this is a device that's a knew where it was on the host. It did not have to go through the host virtual memory manager and therefore.

181
00:27:59.729 --> 00:28:02.939
It didn't have to work with that.

182
00:28:02.939 --> 00:28:11.368
It's not just an efficiency thing. It's an also the GPU, anytime it, Wanda if the, if the page on the host is pinned.

183
00:28:11.368 --> 00:28:14.729
The GPU anytime and wanted to could just go on to the bus and.

184
00:28:14.729 --> 00:28:18.808
Read and write to it, it didn't, you know, which was nice.

185
00:28:18.808 --> 00:28:30.058
Nice for the and it's more than a matter of speed. It's a matter of not having to synchronize stuff. And and of course that would tie up pages on the.

186
00:28:30.058 --> 00:28:34.108
Host, but host, like mine of a lot of pages that's not an issue.

187
00:28:34.108 --> 00:28:39.209
But now, I think they figure yeah, that's efficient, but.

188
00:28:39.209 --> 00:28:43.648
You don't need it going so here, it's showing you that the unified memory.

189
00:28:43.648 --> 00:28:48.749
In every case, but this 1 here, and I can't even read which it is.

190
00:28:48.749 --> 00:28:53.818
Is you know, it's less it's within 10% of the.

191
00:28:53.818 --> 00:28:59.368
Doing it by hand, and 10% is not measurable efficiency.

192
00:28:59.368 --> 00:29:03.659
Anything under a factor of 2 or 3 for efficiency doesn't matter.

193
00:29:03.659 --> 00:29:09.179
Okay, so unified memory I mentioned.

194
00:29:09.179 --> 00:29:16.618
So, basic data measurement, it's saying everything 3 times, but that's pedagogically. Good to say everything. 3 times.

195
00:29:16.618 --> 00:29:24.989
Okay, um, it's getting get the data back and forth.

196
00:29:24.989 --> 00:29:32.459
So that bus there is fast, um, hosted device, but device memory to devices even faster. So.

197
00:29:32.459 --> 00:29:36.628
Basic data management we saw this thing before.

198
00:29:37.648 --> 00:29:44.969
Yeah, you're going to use data on the GPU you allocated on the GPU and got to keep stuff in sync.

199
00:29:44.969 --> 00:29:48.898
Eventually okay.

200
00:29:50.278 --> 00:29:57.118
Okay here, they're compiling it without managing and initiative. Just showing some of the flags you got.

201
00:29:57.118 --> 00:30:04.828
The, if you say M, info was Excel, you could just say, I'm, if I get a credit, hold on a dad, it's just talking about.

202
00:30:04.828 --> 00:30:10.769
Which loops that accelerates here so you can see what the compiler is thinking. So.

203
00:30:11.848 --> 00:30:17.519
Now, data shaping.

204
00:30:17.519 --> 00:30:22.979
Okay, so what's happening here you got your open program.

205
00:30:22.979 --> 00:30:27.239
Loops running on the device that's.

206
00:30:28.558 --> 00:30:38.489
If your data is a simple structure, the compiler can tell to move. We've just been talking about it. You got to move the data copy the data back and forth.

207
00:30:38.489 --> 00:30:44.848
If things are simple, the compiler can figure this out on its own.

208
00:30:44.848 --> 00:30:49.588
But if things are not simple, you may want to tell the compiler.

209
00:30:49.588 --> 00:31:01.348
What the copy and which way and even if the compiler can figure it out, you might still understand your own program better than the compiler can infer.

210
00:31:01.348 --> 00:31:05.159
So, if you tell the compiler, how to copy the data.

211
00:31:05.159 --> 00:31:08.219
It may get better because to.

212
00:31:08.219 --> 00:31:18.088
Again, well, in particular, you may realize, you don't need to do some copies like, you know, you copy data to the GPU. You do not need to copy it back.

213
00:31:18.088 --> 00:31:26.608
You don't need it back on the host a jeep. You didn't modify it, but now here's the thing. The compiler, unless the compiler can prove.

214
00:31:26.608 --> 00:31:34.259
That the data is not going to get modified on the GPU and can also prove that the host is not going to need it again.

215
00:31:34.259 --> 00:31:38.368
It's going to have to generate code to copy that input data.

216
00:31:38.368 --> 00:31:41.699
Back from the GPU from the device to the host.

217
00:31:41.699 --> 00:31:53.128
Which is possibly a wasted copy, but you can tell the compiler. No, this data goes in to the device, but it doesn't need to come out from the device. You see that sort of thing.

218
00:31:53.128 --> 00:31:58.648
So the compiler would generate good cold, correct code if you didn't do this but.

219
00:31:58.648 --> 00:32:07.048
It's going to be slow correct code and this is so these are these copy interactives here copy.

220
00:32:07.048 --> 00:32:15.808
You tell the, it's going to be 1 of my lines, you tell her compile the copy this array in at the start and out at the end of using the device.

221
00:32:15.808 --> 00:32:22.409
Just go in at the start, just come out at the end and just add. This is just like a lock on the device.

222
00:32:22.409 --> 00:32:27.088
So, okay, now what's going on in here.

223
00:32:28.739 --> 00:32:35.338
Is that you have to may have to tell it how big the array is I'm going to.

224
00:32:35.338 --> 00:32:39.209
I and see if I can actually get this a touch.

225
00:32:39.209 --> 00:32:44.068
Bigger for you, because I'm thinking.

226
00:32:46.528 --> 00:32:53.848
Trying to delete so good. Okay. I got in touch bigger for you. And until I need the.

227
00:32:53.848 --> 00:33:01.919
I pat again. Okay, this may help you here. And I can still see the chat window. If you've got questions.

228
00:33:01.919 --> 00:33:08.909
Yeah, okay. I'm trying to set this. It doesn't work.

229
00:33:08.909 --> 00:33:14.818
Okay, now re, shaping is you do your copying you tell it.

230
00:33:14.818 --> 00:33:24.058
This you have to tell the size of the array, the length, and if it's a 2 dimensional array, the sizes. So that's what that is. Okay.

231
00:33:24.058 --> 00:33:27.239
Again, if it's some complicated.

232
00:33:27.239 --> 00:33:33.808
Data type compiler may not be able to easily determine the size so okay.

233
00:33:33.808 --> 00:33:43.709
Okay, so that's just our copy in part of you. Here's an example. You might want to copy in.

234
00:33:43.709 --> 00:33:48.118
Only part of the year, right? The compiler is not going to know that. You just wanted. Okay.

235
00:33:48.118 --> 00:33:59.338
Here's an example here we got this loop so this is doing your iteration for your heat flow problem. This iterates inside the GPU and the 2nd, the coffee sit out of back.

236
00:33:59.338 --> 00:34:04.199
So, we're copying in.

237
00:34:05.459 --> 00:34:11.849
A, and we're copying a new both ways.

238
00:34:11.849 --> 00:34:20.398
And the 2nd note, we're copying a new and a out because inside here it is.

239
00:34:20.398 --> 00:34:25.588
So, and in the 1st.

240
00:34:28.498 --> 00:34:35.728
Why we are copying a new, both directions and not just and.

241
00:34:35.728 --> 00:34:42.179
Well, because inside the loop, and here, it is both reading and writing a new.

242
00:34:43.349 --> 00:34:50.548
I think what's going to happen is each 1 iteration of inside the loop here is being put on a separate.

243
00:34:50.548 --> 00:34:57.449
Thread and because he's on affect each other, that's why we've got to erase a new and a.

244
00:34:57.449 --> 00:35:05.759
And so we say new goes both ways, it gets, it gets red and then it gets written and again. So.

245
00:35:07.259 --> 00:35:13.018
Although I'm a little I'm not certain why you don't quite need copy out there, but no.

246
00:35:13.018 --> 00:35:18.838
Okay, and we copy within without managed.

247
00:35:18.838 --> 00:35:23.909
Yeah, here the system is determining these copies of generating them.

248
00:35:25.378 --> 00:35:32.938
So, it can do that and it turned out when we.

249
00:35:32.938 --> 00:35:46.708
Try to get explicit. It got slower 3 times slower than a serial machine and a 100 times slower than you called before.

250
00:35:49.619 --> 00:35:55.289
But what happened.

251
00:35:55.289 --> 00:35:58.378
Well, you can profiled the thing.

252
00:35:58.378 --> 00:36:04.498
They show you some profiling tools later, but they're showing.

253
00:36:06.119 --> 00:36:11.248
Who's running and I'll hit this in more detail later, but.

254
00:36:11.248 --> 00:36:17.579
You can see what overlaps and and what's taking the time stream.

255
00:36:17.579 --> 00:36:23.429
And kudos is just a sequence, basically a sequential sequence of calls effectively.

256
00:36:23.429 --> 00:36:31.528
And again, going to this quickly, what it's determining.

257
00:36:31.528 --> 00:36:36.509
Is that most of the time is spent on the data copying.

258
00:36:36.509 --> 00:36:41.579
So very little time is spent on the program of the time is waiting for the which.

259
00:36:41.579 --> 00:36:46.018
Happens to the pricing amount of the time and.

260
00:36:46.018 --> 00:36:49.918
No, they do. This is finding.

261
00:36:49.918 --> 00:36:54.119
The data copy moving.

262
00:36:54.119 --> 00:36:59.579
This is device to host. This is hosted device is what.

263
00:36:59.579 --> 00:37:09.239
You know, 35% or something hosted hosted device like 35% device to host is like, 60%.

264
00:37:09.239 --> 00:37:12.389
And everything else is like, 5%, so.

265
00:37:14.278 --> 00:37:17.969
Um.

266
00:37:17.969 --> 00:37:23.728
Why device to host is more than host to device is.

267
00:37:23.728 --> 00:37:27.148
A good question.

268
00:37:28.829 --> 00:37:34.739
So the problem here and this is getting into subtleties of what open ACC does.

269
00:37:34.739 --> 00:37:44.429
Is that it's doing the copying the complete copying separately for each inner iteration of the loop, which is crazy.

270
00:37:44.429 --> 00:37:52.708
Each iteration uses, like, 4 elements of a, and 1 element of a new, but it's copying everything each time.

271
00:37:52.708 --> 00:37:59.579
When you tell it to explicitly copy, because this is applying to each separate.

272
00:37:59.579 --> 00:38:08.248
Parallel thread and detail down at the bottom for each inner iteration it's copying.

273
00:38:08.248 --> 00:38:11.579
Every everything both ways.

274
00:38:12.690 --> 00:38:20.820
So that's what's taking all the time? Well, it's copying not everything, but it's each iteration is copying. And when it starts is crazy.

275
00:38:20.820 --> 00:38:24.389
Okay, optimize.

276
00:38:26.070 --> 00:38:32.280
And they're just talking about here that.

277
00:38:33.300 --> 00:38:37.650
You have to be careful because this is applying to each separate thread.

278
00:38:38.789 --> 00:38:46.079
So, I'm going through these fast giving my take on it. I can slow down if you want.

279
00:38:48.150 --> 00:38:52.769
Yeah, and what they're saying is saying is what I just said is that.

280
00:38:54.210 --> 00:38:58.619
The copying is happening, basically each generation of the inner loops so.

281
00:39:01.019 --> 00:39:08.099
Okay, um, and.

282
00:39:10.800 --> 00:39:14.039
And they're talking about ways here.

283
00:39:14.039 --> 00:39:20.519
To to speed things up and what's happening here.

284
00:39:23.309 --> 00:39:27.000
The effect is, will be reducing the amount of copying so.

285
00:39:27.000 --> 00:39:37.110
Have another high level loop here basically we're copying a new and we're not before we were caught in and out. Now we're just copying it in.

286
00:39:38.519 --> 00:39:44.670
So, um, rebuild the code.

287
00:39:44.670 --> 00:39:49.949
Generates some things and what happens is.

288
00:39:52.320 --> 00:39:55.980
Well, this is some interesting stuff here what's happening.

289
00:39:55.980 --> 00:40:01.619
Is generate this is the info flag with the compiler.

290
00:40:01.619 --> 00:40:04.679
It's generating information about.

291
00:40:05.940 --> 00:40:13.230
How it's mapping the program to the NVIDIA.

292
00:40:13.230 --> 00:40:18.510
To the.

293
00:40:18.510 --> 00:40:22.679
So, again, the, it's got threads.

294
00:40:24.090 --> 00:40:31.050
Well, there's a war for 32 threads you going to several works together actually so you get a.

295
00:40:31.050 --> 00:40:36.900
You going to block them threads and then you get basically.

296
00:40:36.900 --> 00:40:41.969
A number of blocks, and what it's talking about here is.

297
00:40:41.969 --> 00:40:56.070
How it's mapping it, it's going to take 128 iterations of the loop will be 1 block of threads. And what this is here. This, this is, if you were writing CUDA, how you'd be indexing that particular.

298
00:40:56.070 --> 00:41:04.500
Thread within the block, if you have more than 128 iterations, it will be generating separate blocks. And this would be the index.

299
00:41:04.500 --> 00:41:11.730
For which block and the lots of blocks would be called a gang of blocks.

300
00:41:11.730 --> 00:41:16.230
Lots of lots of threads and a block are called a vector of threads.

301
00:41:16.230 --> 00:41:23.789
Why it's a dot X is you can actually imagine your threads and blocks to be in a 3 dimensional.

302
00:41:23.789 --> 00:41:28.079
A, Ray of threads and blocks that's a syntactic shuttering. Actually.

303
00:41:31.530 --> 00:41:36.179
Okay, so here what happened.

304
00:41:37.349 --> 00:41:43.860
So you try to get explicit with the data copying in and out the 1st iteration, got it wrong and program got.

305
00:41:43.860 --> 00:41:51.869
A 100 times slower. So now you got it. Right? What do you God is you've got something to present faster than when you let the compiler do it so.

306
00:41:51.869 --> 00:41:54.960
What's the lesson? Let the compiler do it.

307
00:41:56.880 --> 00:42:11.519
unified manage memory okay at one point is that the code here was well it was nice simple code it was going through the array sequential predictable manner .

308
00:42:11.519 --> 00:42:14.610
If you had a random type access.

309
00:42:14.610 --> 00:42:17.969
Oh, what do they love and see S1 link lists.

310
00:42:17.969 --> 00:42:24.210
You got some linked list let's say you're doing pointer chasing. It would be horribly slow on the GPU.

311
00:42:26.400 --> 00:42:30.420
So, but this is a nice simple working your way through an array goes fast.

312
00:42:32.610 --> 00:42:42.090
Although actually, Nvidia is aware that people like to use pointers and, like, lists. So they are trying to do it faster in their current hardware. So.

313
00:42:44.280 --> 00:42:49.139
I don't often find pointers useful. That's just me. Okay.

314
00:42:52.500 --> 00:43:00.869
Other things you can explicitly synchronize code data any, anytime you want.

315
00:43:00.869 --> 00:43:05.820
So synchronization thing.

316
00:43:05.820 --> 00:43:08.880
Again, you're doing something, you know, you got.

317
00:43:08.880 --> 00:43:16.500
Your many cores honestly for you, you can be doing something on the CPO at the same time. You're doing something on the and.

318
00:43:16.500 --> 00:43:22.980
Nephews lower level coded to do it, but certainly and but then you might occasionally want to synchronize.

319
00:43:22.980 --> 00:43:31.289
Explicitly, not just waiting for the thread to end and that's what that does. Um, updates, self and device.

320
00:43:31.289 --> 00:43:37.110
Self is the host. Okay.

321
00:43:41.789 --> 00:43:44.880
An example would be.

322
00:43:47.429 --> 00:43:52.739
Okay, we have some loop that's going on for a while so the braces here and here.

323
00:43:52.739 --> 00:43:57.449
And whatever.

324
00:43:57.449 --> 00:44:00.809
So, we want to ensure that the data.

325
00:44:00.809 --> 00:44:04.679
On the device, because this little sitting around on the device.

326
00:44:04.679 --> 00:44:08.610
And we want to ensure that it gets updated back to the host. So.

327
00:44:09.900 --> 00:44:15.809
Oh, we do not want to do a, we want to do this while we're still inside this bigger block.

328
00:44:15.809 --> 00:44:19.409
We could in the block and start a new block, but.

329
00:44:19.409 --> 00:44:23.429
That's slow because Ball's coughing more so.

330
00:44:25.650 --> 00:44:29.849
Unstructured data. Okay. Now.

331
00:44:29.849 --> 00:44:34.320
What did we have up to now? Go back a page or 2.

332
00:44:34.320 --> 00:44:39.599
Is you'd have a block stop it? You'd have a block.

333
00:44:39.599 --> 00:44:46.349
At the start, and then you'd copy data in at the start, you'd copy data out at the end.

334
00:44:46.349 --> 00:44:50.010
And it's a syntactic hierarchy. Okay. Um.

335
00:44:50.010 --> 00:44:54.329
Alexa call scoping they would call it, I guess.

336
00:44:55.559 --> 00:45:09.420
The thing is, maybe your program has some producer consumer relationship between different routines or something, and there's not a simple hierarchy, or it would be.

337
00:45:09.420 --> 00:45:13.440
Difficult to put your program in a simple hierarchy.

338
00:45:13.440 --> 00:45:18.929
So, what we're talking about here, you know, explicit allocations of allocations.

339
00:45:18.929 --> 00:45:23.070
Tell you what I'm talking about if you're in C. plus plus.

340
00:45:23.070 --> 00:45:28.380
You know, you can have, you can a variable can start its lifetime.

341
00:45:29.610 --> 00:45:37.559
When you enter a block, it say allocated, it gets created and then it gets destroyed when you leave the block that's what we had before.

342
00:45:37.559 --> 00:45:42.210
Or you can do things like mailbox and freeze that are explicit.

343
00:45:42.210 --> 00:45:46.650
And to put on the heap, you explicitly.

344
00:45:46.650 --> 00:45:52.409
Create and allocate the variable whenever you want and you explicitly free it whenever you want.

345
00:45:52.409 --> 00:45:58.050
When you finished with it, which could be in another routine, there's not this inclusion hierarchy.

346
00:45:58.050 --> 00:46:01.710
Stuff gets created at the start of a block and delete it destroyed at the end.

347
00:46:01.710 --> 00:46:05.010
So, we got that with open ACC also.

348
00:46:05.010 --> 00:46:12.030
The enter claws you say, whenever you want, it creates the data and then exit destroys the data.

349
00:46:13.050 --> 00:46:17.610
You can do it whenever you want. So I so they talk about here.

350
00:46:17.610 --> 00:46:22.590
Yeah, okay.

351
00:46:22.590 --> 00:46:27.389
And that's just not and they could exist in different functions.

352
00:46:27.389 --> 00:46:31.230
You got some complicated producer consumer thing.

353
00:46:31.230 --> 00:46:39.869
Year window manager is creating some data structure, giving it to the user.

354
00:46:39.869 --> 00:46:44.579
You looking at how the X window system was implemented.

355
00:46:44.579 --> 00:46:47.969
They had a lot of problems deciding at what point.

356
00:46:47.969 --> 00:46:51.389
Do you know who constructs.

357
00:46:51.389 --> 00:46:55.860
An array that's needed by someone else and then who destroys it.

358
00:46:55.860 --> 00:46:59.730
It's the real mess and leads to.

359
00:46:59.730 --> 00:47:04.559
A lot of programming errors, so, but you'll get here too if you do it in a lot, but.

360
00:47:04.559 --> 00:47:10.289
Okay, I'm skipping through here unstructured.

361
00:47:10.289 --> 00:47:13.559
Your simple thing parallel loop.

362
00:47:15.900 --> 00:47:30.030
Okay, you could say here we're going to and the 1st fragment gets created, we're going to copy a and B to the GPU to the device and we're going to create an array see on the device.

363
00:47:30.030 --> 00:47:33.360
Then we were on the loop and then at the end.

364
00:47:33.360 --> 00:47:37.980
We copy sea out to the host and we delete a and B.

365
00:47:37.980 --> 00:47:43.500
From the device, so you get explicit like that if you.

366
00:47:43.500 --> 00:47:47.460
It's doing mailbox and freeze on the device essentially.

367
00:47:47.460 --> 00:47:52.949
Well, exactly actually okay.

368
00:47:54.329 --> 00:48:05.969
So, the structured things, it's only within a single function again my best cases, when the structure doesn't work and you've got some producer consumer called routine concept.

369
00:48:05.969 --> 00:48:10.860
So, when doing systems are okay.

370
00:48:12.630 --> 00:48:16.559
Well, windowing says you get an event hey, got an event loop. You get an event handler.

371
00:48:16.559 --> 00:48:23.340
Gets called when an event, her happens, like suppress and.

372
00:48:23.340 --> 00:48:29.429
And then something gets put on a queue given to the user or whatever. It's not simply and hierarchically.

373
00:48:29.429 --> 00:48:36.210
And then you use the unstructured thing, but if you don't explicitly allocated things, start growing.

374
00:48:37.260 --> 00:48:40.320
So.

375
00:48:40.320 --> 00:48:44.250
And giving an example, they allocate.

376
00:48:44.250 --> 00:48:50.789
In 1 function called allocate and a free in another function. So if you look at what's happening up here.

377
00:48:50.789 --> 00:48:55.860
And the allocator, Ray, it's allocating something on the host with Matlock.

378
00:48:55.860 --> 00:49:00.210
And it's allocating something on the device with the entered data, create.

379
00:49:00.210 --> 00:49:05.159
So, and.

380
00:49:08.250 --> 00:49:12.929
And then the de allocate, it frees it on the device.

381
00:49:12.929 --> 00:49:16.530
And then freeze it on the host so.

382
00:49:16.530 --> 00:49:21.659
And then what Maine does, is it calls allocate array allocated on the.

383
00:49:21.659 --> 00:49:24.750
Allocate everything host and device.

384
00:49:26.159 --> 00:49:30.449
Darrell is a thing in parallel and this is going to run on the device.

385
00:49:32.099 --> 00:49:36.750
And then de, allocate the stuff on the allocate everything.

386
00:49:37.949 --> 00:49:42.690
Now, if you put a program, like, if you tried to compile a program like this.

387
00:49:42.690 --> 00:49:49.289
And you turned optimization on how fast do you think it would run.

388
00:49:50.519 --> 00:49:56.340
So any idea with a good Optimizer.

389
00:50:06.989 --> 00:50:11.309
Silence.

390
00:50:12.989 --> 00:50:23.099
Silence.

391
00:50:24.960 --> 00:50:30.269
And I think it's lock up again.

392
00:50:31.289 --> 00:50:35.070
I'll use I'll use the chat window.

393
00:50:42.449 --> 00:50:50.369
Silence.

394
00:50:54.539 --> 00:51:01.139
Any ideas.

395
00:51:04.500 --> 00:51:19.079
This program here.

396
00:51:21.900 --> 00:51:40.679
Silence.

397
00:51:47.010 --> 00:51:56.730
Silence.

398
00:52:03.360 --> 00:52:07.559
Okay.

399
00:52:19.079 --> 00:52:25.710
Silence.

400
00:52:25.710 --> 00:52:29.550
You see.

401
00:52:30.750 --> 00:52:38.309
This is not just this silly thing. This is a point. If you're trying to do timing tests on computers, you do a program like this.

402
00:52:38.309 --> 00:52:41.820
And you have to be careful the Optimizer will.

403
00:52:41.820 --> 00:52:48.989
Go crazy. Well, you know, if you don't have print statements and so on the Optimizer will say.

404
00:52:48.989 --> 00:52:58.469
Yeah, you know, if I don't do any work at all, if I compile the program down to the empty set, we'll get the same answer, which is the empty set. So.

405
00:52:58.469 --> 00:53:04.469
You see the problem you see, and again with when you're doing timing tests, you gotta worry about that.

406
00:53:06.989 --> 00:53:10.860
Okay, next point.

407
00:53:14.309 --> 00:53:17.670
Strokes okay, this is an issue called deep.

408
00:53:17.670 --> 00:53:20.820
Copies here getting ready to issue a deep copies.

409
00:53:22.050 --> 00:53:27.059
So, you got these hierarchical classes and C plus plus in particular.

410
00:53:27.059 --> 00:53:30.150
When some of the elements have.

411
00:53:30.150 --> 00:53:35.940
Pointers or they have variable size. This creates an issue.

412
00:53:35.940 --> 00:53:39.690
When you're copying the data.

413
00:53:39.690 --> 00:53:46.380
Of this type from hosted device, or if you're into storing it somewhere.

414
00:53:46.380 --> 00:53:49.440
Um, so.

415
00:53:49.440 --> 00:53:52.440
This 1 is easy.

416
00:53:52.440 --> 00:53:58.710
Okay, you, this struck here flow 3 it's 3 floats. 4 bites each probably.

417
00:53:58.710 --> 00:54:03.659
It's easy to copy that. Okay, so I say data create.

418
00:54:03.659 --> 00:54:08.099
No trouble, the compiler says each element of flow 3 as 12 bytes.

419
00:54:08.099 --> 00:54:12.119
And that 1 is easy.

420
00:54:13.380 --> 00:54:19.590
The hard part is something like here you see the data type.

421
00:54:23.099 --> 00:54:28.559
The data type, it's you see, the vector contains a pointer.

422
00:54:28.559 --> 00:54:31.590
To another variable, which is.

423
00:54:31.590 --> 00:54:37.320
Who knows where it is probably on the heap, but you can't guarantee that.

424
00:54:37.320 --> 00:54:41.250
So now what happens when you copy this.

425
00:54:41.250 --> 00:54:45.690
A variable of this class to the device.

426
00:54:46.980 --> 00:54:57.360
You do the simple copy. You're copying this point here here. Okay. Vote star, but you're not copying immediately the target of the pointer.

427
00:54:59.280 --> 00:55:06.059
And, in fact, if you copy the pointer itself down with unified address, it's a valid point on the device pointing back to the host.

428
00:55:06.059 --> 00:55:10.110
Going to be really inefficient to use.

429
00:55:10.110 --> 00:55:14.190
Okay, so what you want to do is actually you want to.

430
00:55:14.190 --> 00:55:18.539
If you're copying a variable of this type to the device.

431
00:55:18.539 --> 00:55:23.699
You do the simple top level copy and then as a 2nd step.

432
00:55:23.699 --> 00:55:27.449
You want to allocate space for this on the device.

433
00:55:27.449 --> 00:55:34.800
And update the pointer, so this is the term. This is a deep copy that the deep coffee is the term here.

434
00:55:34.800 --> 00:55:41.190
Okay, my mirror program hung up.

435
00:55:41.190 --> 00:55:46.050
Let me start it again and so I can write that down just a 2nd here.

436
00:55:50.489 --> 00:56:00.059
Silence.

437
00:56:03.750 --> 00:56:04.559
Okay,

438
00:56:05.574 --> 00:56:25.824
Eva.

439
00:56:31.679 --> 00:56:35.880
Just realized that there. Okay.

440
00:56:37.050 --> 00:56:41.670
So, right, so you term here.

441
00:56:41.670 --> 00:56:48.210
Is it's deep copy?

442
00:56:49.469 --> 00:56:52.619
Of basically of a class.

443
00:56:54.179 --> 00:57:00.420
Say with point, Terry said sad.

444
00:57:00.420 --> 00:57:03.480
Of the device.

445
00:57:03.480 --> 00:57:09.599
Okay, you see, you can just do the superficial top level copy.

446
00:57:09.599 --> 00:57:14.190
Okay, um.

447
00:57:14.190 --> 00:57:18.719
So, the dynamic member is this thing here so.

448
00:57:19.920 --> 00:57:26.250
The open ACC cannot easily do that automatically. You have to do it, so you.

449
00:57:26.250 --> 00:57:29.909
You copy the stocked and then you copy.

450
00:57:29.909 --> 00:57:35.190
You allocate space theory and copy that and put it in a function but that's a pain.

451
00:57:37.050 --> 00:57:45.269
Yeah, programming technique. 1 way I handle stuff like this in my own code for variable.

452
00:57:45.269 --> 00:57:54.599
A raise as I just it's a maximum size of something reasonable. I just allocate all the arrays at the maximum size.

453
00:57:54.599 --> 00:58:01.949
Now, they're not variable anymore makes life easier for me. It way some memories questions. How much memory is it wasting?

454
00:58:01.949 --> 00:58:05.010
Okay, so we have this.

455
00:58:05.010 --> 00:58:11.130
Space here C plus plus same thing.

456
00:58:11.130 --> 00:58:16.440
Well, here is a cool concept. Here. You're writing your class.

457
00:58:16.440 --> 00:58:21.719
But in your constructor, you see, this is a reason this these are not hierarchical here.

458
00:58:21.719 --> 00:58:25.829
Oh, you do the enter and the constructor. You do the exits.

459
00:58:25.829 --> 00:58:30.389
In the Destructor, so again, the enter does a.

460
00:58:30.389 --> 00:58:38.880
On the device, the exit does a free on the device effectively so you can do something like this here but.

461
00:58:41.369 --> 00:58:46.949
Oh, and then you'd also here, I guess you'd have to update pointers or something maybe.

462
00:58:48.360 --> 00:58:55.949
Okay, so this is synchronization and this is the issue with you deep.

463
00:58:55.949 --> 00:59:01.170
You're deep copying so over here, you deep coffee. Okay.

464
00:59:03.719 --> 00:59:08.070
And this is a case you need to updating and so oh, okay.

465
00:59:09.300 --> 00:59:18.179
Closing remarks, we saw this unified memory, and I just mentioned it were thought more detail. The 2nd point is that you may have to tell.

466
00:59:18.179 --> 00:59:22.619
The open AC system.

467
00:59:22.619 --> 00:59:31.500
You may want to tell it the data which data is going into and coming back from the device that if you do it badly, I make things worse.

468
00:59:31.500 --> 00:59:37.800
And unstructured data that, like Malik and free on the device, sent her an accent. So.

469
00:59:37.800 --> 00:59:43.380
Okay, conditions about week 2 here.

470
00:59:44.550 --> 00:59:49.050
Oh, yeah oh, okay. I don't see any chat window on things.

471
00:59:50.789 --> 00:59:56.969
Just look at 3 here and now.

472
01:00:05.880 --> 01:00:10.530
Okay, okay still on.

473
01:00:10.530 --> 01:00:16.260
No, it's my mirroring program for the iPad keeps hanging.

474
01:00:16.260 --> 01:00:25.949
Okay more stuff here. Okay.

475
01:00:30.630 --> 01:00:37.019
Okay, frankly, I don't find their way of mystifying. This. It doesn't Demystified to me.

476
01:00:37.019 --> 01:00:42.449
I'm going to skip through this somewhat.

477
01:00:51.570 --> 01:00:57.539
Ok, what's going on here? Is that, um.

478
01:00:57.539 --> 01:01:00.989
I write this down, but my.

479
01:01:02.760 --> 01:01:07.619
Again, just a 2nd.

480
01:01:25.260 --> 01:01:28.260
Okay.

481
01:01:44.070 --> 01:01:47.880
Oh, okay.

482
01:01:51.420 --> 01:01:55.800
Okay, so we have here your.

483
01:02:00.090 --> 01:02:06.960
And then you might say to here.

484
01:02:06.960 --> 01:02:13.199
So, we have a, um.

485
01:02:13.199 --> 01:02:17.460
I work, or it's like a thread.

486
01:02:17.460 --> 01:02:22.380
A, um, a vector.

487
01:02:22.380 --> 01:02:28.739
It's a block of threads.

488
01:02:28.739 --> 01:02:33.719
And, um, again.

489
01:02:37.619 --> 01:02:40.679
It's a set of locks or something.

490
01:02:42.300 --> 01:02:46.559
Okay.

491
01:02:46.559 --> 01:02:58.139
Point is that the vector, the threads a vector cooperate much more closely than the blocks and a gang.

492
01:02:58.139 --> 01:03:03.329
And they're saying it up here, we're going way too far.

493
01:03:07.260 --> 01:03:19.769
I sort of silly here. Gangs operate independently. Well, yeah, so the set of blocks here.

494
01:03:22.320 --> 01:03:27.690
Again, this is a 2nd here.

495
01:03:33.119 --> 01:03:40.349
Silence.

496
01:03:52.349 --> 01:03:57.420
Well, my problem is that my mirroring program keeps hanging on me.

497
01:03:57.420 --> 01:04:07.019
Okay.

498
01:04:15.420 --> 01:04:23.579
Okay.

499
01:04:27.420 --> 01:04:37.409
Up again.

500
01:04:40.199 --> 01:04:45.090
Silence.

501
01:04:59.429 --> 01:05:07.349
Um.

502
01:05:11.820 --> 01:05:22.409
Oh, okay. So.

503
01:05:26.070 --> 01:05:33.269
Yeah, so that the threads vector can cooperate more. Okay so this is.

504
01:05:33.269 --> 01:05:38.400
Sort of fly here. I don't even understand it that much. Okay.

505
01:05:38.400 --> 01:05:43.079
But the point is that we have different levels of.

506
01:05:43.079 --> 01:05:46.230
Of cooperation here I'm saying.

507
01:05:46.230 --> 01:05:53.010
A J.

508
01:05:53.010 --> 01:05:56.940
Silence.

509
01:05:58.769 --> 01:06:03.690
Have optional.

510
01:06:04.949 --> 01:06:10.469
Here in memory.

511
01:06:10.469 --> 01:06:16.679
And synchronize and so on. Okay.

512
01:06:16.679 --> 01:06:23.010
Okay, now you can profile stuff, um.

513
01:06:23.010 --> 01:06:28.260
Yeah, I'll hit that more later.

514
01:06:28.260 --> 01:06:37.380
And seeing executive summary of these slides is you can profile stuff and you can see how much time is spent on copying data. Both ways.

515
01:06:38.670 --> 01:06:48.389
Yeah, okay. That's the executive summary of that. Okay.

516
01:06:55.440 --> 01:07:00.389
Okay, here's a new thing here. Um.

517
01:07:00.389 --> 01:07:06.420
If you've got nested loops, you can collapse nest and loops.

518
01:07:06.420 --> 01:07:09.539
Into 1, 1 dimensional loop.

519
01:07:09.539 --> 01:07:12.630
And that can sometimes some.

520
01:07:12.630 --> 01:07:19.650
Well, it's 1 loop with bigger loop may be optimized better.

521
01:07:19.650 --> 01:07:24.150
So that some lots of clauses like that.

522
01:07:26.250 --> 01:07:34.050
So, to accessing 2 dimensional right? Effectively collapses it. No accessing 16 element. 1 dimensional. All right so.

523
01:07:35.639 --> 01:07:40.829
Is.

524
01:07:40.829 --> 01:07:45.599
This is each starting each thread and stopping you said this is some overhead here.

525
01:07:45.599 --> 01:07:49.469
No federations of parallel. So the concept is.

526
01:07:49.469 --> 01:07:52.710
Moderation see up here.

527
01:07:52.710 --> 01:07:57.510
We've got 16 separate iterations and each one's very small.

528
01:07:57.510 --> 01:08:03.539
We do some collapsing and merging may fewer iterations. Any generations got more work in here.

529
01:08:03.539 --> 01:08:07.409
So may work better.

530
01:08:09.750 --> 01:08:18.180
And teller compiler to do that was collapse and.

531
01:08:18.180 --> 01:08:22.350
Wow, we got 3% faster.

532
01:08:24.060 --> 01:08:28.319
Another thing you can do is to say.

533
01:08:28.319 --> 01:08:33.479
Again, you're iterating over a big 2 dimensional array. You may want to split it up into tiles.

534
01:08:33.479 --> 01:08:42.239
And put each tile on a separate parallel thread or something. And again, depending on the locality of the reference of the data.

535
01:08:43.560 --> 01:08:49.109
Might help and Mike, perhaps be more efficient.

536
01:08:49.109 --> 01:08:52.590
And you do that with a tile directive here.

537
01:08:54.960 --> 01:09:01.350
And Matrix, multiplication or something this is something you might almost have to.

538
01:09:01.350 --> 01:09:05.189
Do in your code, rewrite your algorithm so that it's local, but.

539
01:09:06.329 --> 01:09:11.850
Okay, and you do tiling and get up here.

540
01:09:15.539 --> 01:09:23.579
And how does it work.

541
01:09:23.579 --> 01:09:29.310
Executive summaries, there's no point to it on the CPU. It doesn't matter because the.

542
01:09:29.310 --> 01:09:38.100
What's the Z on? Okay so, again, you're on your host you know, it takes some time to get something out of the physical D RAM.

543
01:09:38.100 --> 01:09:43.289
So is the caching, but the on does cashing so well.

544
01:09:43.289 --> 01:09:46.710
That, you know, worry about it.

545
01:09:46.710 --> 01:09:52.979
I once wrote a program, try to determine the effect of having the working set.

546
01:09:52.979 --> 01:09:56.310
Of the amount of memory, it actually used to be bigger than the size.

547
01:09:56.310 --> 01:10:02.609
Of the small high speed cache I could not detect the difference actually, because.

548
01:10:02.609 --> 01:10:07.319
Z on was smarter than me in that sense. It was just improving. Yeah.

549
01:10:07.319 --> 01:10:12.210
Okay on the gpo you used this tiling idea in this example here.

550
01:10:12.210 --> 01:10:15.630
It got a little faster sometimes.

551
01:10:15.630 --> 01:10:20.880
10% faster if you cash is if your tiles were too small.

552
01:10:20.880 --> 01:10:27.270
25% slower so.

553
01:10:27.270 --> 01:10:31.380
Big tiles, it's a little fast and probably not worth it another.

554
01:10:31.380 --> 01:10:38.100
13% okay. Okay.

555
01:10:42.720 --> 01:10:48.630
Now, this can be interesting here it's telling open ACC.

556
01:10:48.630 --> 01:10:52.439
What do you want to try to put into separate.

557
01:10:52.439 --> 01:10:59.880
Threads in the same block thread block versus separate blocks or something telling it.

558
01:10:59.880 --> 01:11:07.439
What level of parallelism and.

559
01:11:11.130 --> 01:11:17.760
So, what they're saying here are 1. okay so the.

560
01:11:18.840 --> 01:11:25.649
Basically use the finer levels of on the, on the inner most loop. So.

561
01:11:25.649 --> 01:11:28.770
Factor would be like the separate threads and.

562
01:11:28.770 --> 01:11:39.000
That's red block and then separate blocks or some workers an intermediate thing that sort of vaguely defined and gang would be the separate blocks.

563
01:11:39.000 --> 01:11:43.649
So.

564
01:11:48.810 --> 01:11:51.899
This says that.

565
01:11:53.460 --> 01:11:57.510
It's like a critical loop and open, so that.

566
01:12:00.210 --> 01:12:06.359
This gets run well, in this particular loop, at least it's done sequentially.

567
01:12:08.489 --> 01:12:20.039
And if applications up the line, and again, this is just.

568
01:12:20.039 --> 01:12:23.189
When we see could directly, this will be.

569
01:12:23.189 --> 01:12:27.930
Mapped to how many threads and a thread block and how many blocks and so on.

570
01:12:27.930 --> 01:12:34.020
Q is going to be something like 32.

571
01:12:34.020 --> 01:12:39.180
Are some multiples 32 2024 be typical values for Q here?

572
01:12:39.180 --> 01:12:44.220
Now, you might ask yourself well, why not just have.

573
01:12:45.390 --> 01:12:49.140
A really high value for this inner most parallelization.

574
01:12:49.140 --> 01:12:54.000
But a lot of their heads in a.

575
01:12:54.000 --> 01:12:58.949
Tread block why do you need the higher levels of.

576
01:12:58.949 --> 01:13:04.380
Like, multiple thread blocks. Well, the reason.

577
01:13:04.380 --> 01:13:09.270
Is that there is some limited resources available.

578
01:13:09.270 --> 01:13:19.829
And if you have more parallel threads in the same block, you're using up some limited resources will get into more detail later, which will slow down the programs. So, sometimes.

579
01:13:19.829 --> 01:13:22.829
If you have less parallelism at empty.

580
01:13:22.829 --> 01:13:27.300
At the lower level, your program will actually run faster so.

581
01:13:27.300 --> 01:13:32.489
Okay, so you can have fun here collapsing and factoring and.

582
01:13:34.260 --> 01:13:39.180
Didn't help okay.

583
01:13:42.720 --> 01:13:47.520
Basically.

584
01:13:47.520 --> 01:13:53.100
Don't worry about these fine details of optimization. I just gives the executive summary of this slide. So.

585
01:13:53.100 --> 01:13:57.270
The, it doesn't cleanly map.

586
01:13:57.270 --> 01:14:04.560
To the invidia hardware, so vectors are threads and a thread block.

587
01:14:04.560 --> 01:14:08.220
Those are interior and then the gangs are the.

588
01:14:08.220 --> 01:14:11.640
Multiple thread multiple blocks, so.

589
01:14:11.640 --> 01:14:17.880
Okay, what's that?

590
01:14:17.880 --> 01:14:22.229
Okay, that's a nice point to stop. Now.

591
01:14:23.939 --> 01:14:29.579
So, what we did now is we mostly finished off open ACC.

592
01:14:29.579 --> 01:14:32.909
I may hit some advanced topics on Monday.

593
01:14:32.909 --> 01:14:40.649
And, um.

594
01:14:41.789 --> 01:14:45.960
And show baby show, run some simple programs.

595
01:14:45.960 --> 01:14:54.779
And then moving on to more, getting more directly onto invidia and if you wish to go ahead, get ahead from me.

596
01:14:54.779 --> 01:15:04.140
You can I just downloaded their teaching it here and I'm going to some of their slides and you can actually.

597
01:15:04.140 --> 01:15:07.380
Look at that yourself, if you'd like to get ahead of me.

598
01:15:07.380 --> 01:15:10.470
Any questions.

599
01:15:10.470 --> 01:15:22.020
Silence.

600
01:15:23.670 --> 01:15:28.829
Good anyone still there and that's seen now.

601
01:15:34.350 --> 01:15:40.890
Wow, my solar panels are okay, they were at 3 kilowatts generating only 1 and a half kilowatts. Now.

602
01:15:46.170 --> 01:15:53.789
You're still here. Okay, Joe. That's good. I never quite know if I'm in a physical clash, and I can look up and see, but.

603
01:15:53.789 --> 01:15:58.170
Okay questions.

604
01:15:59.729 --> 01:16:05.310
If there's no questions then have a good weekend.

605
01:16:05.310 --> 01:16:08.550
I'll do some skiing or something and.

606
01:16:09.810 --> 01:16:14.220
If and feedback is welcome, I'll do a little blurb on darker.

607
01:16:14.220 --> 01:16:19.020
And maybe Monday, I don't know.

608
01:16:20.579 --> 01:16:28.560
Other than that, if no questions then.

609
01:16:32.670 --> 01:16:36.989
Next time okay.

610
01:16:43.260 --> 01:16:47.399
Silence.

611
01:16:48.510 --> 01:16:52.409
Okay.

612
01:16:58.859 --> 01:17:02.310
Oh.

613
01:17:05.430 --> 01:17:13.529
Silence.

614
01:17:20.640 --> 01:17:24.359
Silence.

615
01:17:32.189 --> 01:17:36.840
Silence.

616
01:17:43.529 --> 01:17:47.189
Silence.