WEBVTT

1
00:04:09.688 --> 00:04:09.778
I

2
00:04:12.503 --> 00:06:07.163
am.

3
00:06:10.108 --> 00:06:22.678
Silence.

4
00:06:22.678 --> 00:06:25.918
Silence.

5
00:06:28.108 --> 00:06:33.658
Silence.

6
00:06:40.949 --> 00:06:48.088
Silence.

7
00:07:10.858 --> 00:07:14.788
Okay, good afternoon parallel class so.

8
00:07:14.788 --> 00:07:22.079
Monday, March, continuing on talking about and video parallel stuff and so on and.

9
00:07:22.079 --> 00:07:26.278
I assume that people can hear me, but just in case.

10
00:07:28.108 --> 00:07:37.228
And you hear me, thank you. Okay, so.

11
00:07:37.228 --> 00:07:41.579
What we have happening today is.

12
00:07:43.709 --> 00:07:54.478
A blurb on virtualization and Docker sent a request for it and while continuing on with NVIDIA, because it's your biggest.

13
00:07:54.478 --> 00:08:03.988
Supercomputer architect, and I started out doing, but really the biggest supercomputer architecture now and.

14
00:08:03.988 --> 00:08:16.079
Also, I have another homework up, which is a chance for you to do. Let me show homework. 1st, a chance for you to do another talk.

15
00:08:18.149 --> 00:08:22.319
So seconds student, talk and starting.

16
00:08:23.879 --> 00:08:31.228
Well, Thursday and 10 days, and for that, the next few classes and so I'm giving you freedom. If 1 week is.

17
00:08:31.228 --> 00:08:35.879
Easier for you than another week and I'll just fill in on use time with new material.

18
00:08:35.879 --> 00:08:43.948
So do it and teams or 2 so, like, you did present another parallel tool that we haven't covered in class as a lot of them. I've just covered a sampling.

19
00:08:43.948 --> 00:08:47.219
And and for example.

20
00:08:47.219 --> 00:08:54.119
The energy labs have tools, like, costs of some cloud based things C plus plus and power.

21
00:08:54.119 --> 00:09:04.198
Parallel facilities, the current version open a competitor to could cover 1 of the debugging tools that I've mentioned, but haven't actually shown you.

22
00:09:04.198 --> 00:09:11.938
Gpu technology conference coming up, you go to last year cheap, you technology conference find something interesting.

23
00:09:12.533 --> 00:09:15.053
And email me your team,

24
00:09:15.083 --> 00:09:16.374
your team name,

25
00:09:16.614 --> 00:09:20.514
what's people in your team and what dates you prefer,

26
00:09:20.514 --> 00:09:27.714
and your topic or maybe even 2 topics I want to try and have different teams doing different topics.

27
00:09:27.714 --> 00:09:28.134
So.

28
00:09:28.408 --> 00:09:38.519
If we run out of interesting topics, I'll try to take up more, but your wild card you go to the GPU technology conference and find something as a few 100 talks literally there. So.

29
00:09:40.168 --> 00:09:44.639
Okay.

30
00:09:46.438 --> 00:09:49.979
For I don't think you're here.

31
00:09:49.979 --> 00:09:53.219
So.

32
00:09:54.264 --> 00:10:01.014
So, a virtual view of a system is an idealized different view that hides certain features from the user.

33
00:10:01.644 --> 00:10:12.533
So you just, for example, using any modern operating system, modern, being defined the last 50 years or more the file system as a virtual view into the disc.

34
00:10:12.839 --> 00:10:22.229
You don't access raw blocks, you access files? Um, the virtual memory manager is a virtual view into the memory. You don't access.

35
00:10:22.229 --> 00:10:26.729
You usually don't access real memory address as you go through the virtual.

36
00:10:26.729 --> 00:10:33.719
The virtual memory manager, and that adds pluses and it adds minuses.

37
00:10:33.719 --> 00:10:41.458
A big plus security, virtual memory manager, you cannot get at other processes memory and.

38
00:10:41.458 --> 00:10:44.908
Unless you exploit 1 of these holes and Intel.

39
00:10:44.908 --> 00:10:47.908
And.

40
00:10:47.908 --> 00:11:02.879
You know, you get some standardization with a virtual memory manager. It's not as important how much real memory the machine as it affects the performance, but it doesn't so much affects what kind of runs. So, so the virtualization standardizes things and.

41
00:11:02.879 --> 00:11:13.739
Offers security and protection, and also can make facilities available easily. That might not be available. Otherwise.

42
00:11:13.739 --> 00:11:22.349
I mean, this is pushing the name virtualization but early machines, for example, did not have hardware floating point. They emulated.

43
00:11:22.349 --> 00:11:29.489
If you did a floating ad, it really called a little function using the integer.

44
00:11:29.489 --> 00:11:34.558
Instructions on the machine, so it ran, I know 5 times sower but.

45
00:11:34.558 --> 00:11:44.068
It used last hardware, so you, in a sense, you could say that the floating point instructions or a virtual instruction set.

46
00:11:44.068 --> 00:11:48.239
That supplemented the actual physical instruction set with more.

47
00:11:48.239 --> 00:11:54.269
With more tools, so, in a census and tactic sugar, anybody don't knocks and tactic sugar.

48
00:11:54.269 --> 00:11:57.808
Okay, I mean, the goal is programmer productivity.

49
00:11:57.808 --> 00:12:03.538
What's syntactic triggering means is you add news.

50
00:12:03.538 --> 00:12:10.259
New features to a language, or not, they make it easier to read them to program, but they're not.

51
00:12:10.259 --> 00:12:15.359
Deep new powerful things in the language, but their convenience things.

52
00:12:15.359 --> 00:12:19.619
This referring to threads.

53
00:12:19.619 --> 00:12:29.339
And blocks on the GPU as being 2 dimensional 3 dimensional rays, when they're really 1 dimensional arrays and the hardware that just that's attacking shrubbery.

54
00:12:29.339 --> 00:12:37.918
Okay, so virtualization there's lots of different levels you can do that. I've got several levels here working my way up to things like Docker.

55
00:12:37.918 --> 00:12:50.339
Oh, by the way reality check, why this is worth spending time on is a Docker as a hot commercial idea. And if you're applying to a company that has some.

56
00:12:50.339 --> 00:12:53.458
Program scan resumes for.

57
00:12:53.458 --> 00:13:03.658
Keywords and you can honestly say, you know, talker, put it on your resume and you go outside discourse program, something and Docker you can honestly say, you've.

58
00:13:03.658 --> 00:13:10.649
Programmed in it. Okay. You going to defeat the superficial selling tools at their own game.

59
00:13:10.649 --> 00:13:17.038
So, at the very low level with virtualization, emulate the hardware.

60
00:13:17.038 --> 00:13:23.099
And different instructions had different word links and so on.

61
00:13:23.099 --> 00:13:27.119
Um, it's very flexible, but it's very slow.

62
00:13:27.119 --> 00:13:40.708
And you have some minimal operating system on the hardware, just a minimal, very thin layer that can run the virtual guests on top of the host. The host is the low level thing that gets thrown on top of it.

63
00:13:44.693 --> 00:13:54.774
So the low level of big commercial companies, VMware, I shouldn't put volunteers that to your biggest commercial company. There's several free alternatives then.

64
00:13:55.948 --> 00:14:03.688
And kbm, and so on Microsoft has a virtual thing it's getting better. I believe I don't know as much about it.

65
00:14:03.688 --> 00:14:15.089
My guess is that VM Ware is probably better than the free alternatives if only because they put so much effort into it and I believe them to be competent people there.

66
00:14:15.089 --> 00:14:20.879
Um, but VM Ware has actually some free parts to it, but the full thing costs money.

67
00:14:20.879 --> 00:14:27.568
You can run VMware, virtual machines for free. Basically couldn't creating them cost money.

68
00:14:28.589 --> 00:14:34.859
So the concept is, you're you got your host and then you write your client machines and if you're.

69
00:14:34.859 --> 00:14:48.359
Virtualized at a very low level. Your separate virtual machines it might be totally different operating systems. So I've run VMware on laptops and 1 guest. I'd be Windows and a 2nd guest might be Linux and they're running simultaneously.

70
00:14:48.359 --> 00:14:58.313
On the same machine now, how do they do it while they're seeing their own views into the file system? So they're seeing separate parts of the file system, although there are ways to share parts of the file.

71
00:14:58.313 --> 00:15:09.744
So, I can create a shared partition that's accessible from both the Linux guest and the Windows cast, let's say, could have different Windows, guess, different running different versions of Windows. This is actually a commercial.

72
00:15:10.019 --> 00:15:18.869
1 of the commercial appeals of something like the is, you can run different variants. You could run a Windows, 10 and Windows.

73
00:15:18.869 --> 00:15:22.979
I don't know 3 yeah whatever.

74
00:15:22.979 --> 00:15:28.528
You know, different versions of Windows as different guests, all running on the same.

75
00:15:28.528 --> 00:15:32.788
Both and says this is actually a commercial reason for things like VMware.

76
00:15:32.788 --> 00:15:41.278
Okay, so how do you do an efficient? You can't just have little subroutines for everything and kill performance by orders of magnitude.

77
00:15:41.278 --> 00:15:44.729
Well, the thing is that your client program, your clock, your guest.

78
00:15:44.729 --> 00:15:59.158
Your client, most of the machine instructions are harmless, and you can prove they're harmless, like your trapping memory addresses perhaps and you can statically look at most of the machine instructions and know that there are no danger to the host.

79
00:15:59.158 --> 00:16:02.668
See, you let them run, so no penalty.

80
00:16:02.668 --> 00:16:09.658
The powerful instructions that would say analogous to things, you know, be used in Super user mode.

81
00:16:09.658 --> 00:16:14.759
Rings 0T, whatever you call it you can identify them statically.

82
00:16:14.759 --> 00:16:29.099
In general, this is assuming you're not doing things like creating new patterns. What's our new instructions and then executing? Well, that would be a dangerous instruction. So the powerful ones, you can trap them and if it's.

83
00:16:29.099 --> 00:16:37.168
Good hardware you are on good architecture. It will provide tools that make it easy to trap these powerful instructions. And then you emulate.

84
00:16:37.168 --> 00:16:41.639
To make sure that they're not stopping on someone else's memory and so on.

85
00:16:41.639 --> 00:16:50.129
So, doing this efficiently requires some good instructions set where the powerful instructions, the dangerous instructions.

86
00:16:50.129 --> 00:16:55.619
Are identifiable and you set a bit a status fit and then.

87
00:16:55.619 --> 00:17:05.249
They get they get trapped, so if you have the right instructions set, this is the efficient if you don't have the right instructions. That this is horrible.

88
00:17:05.249 --> 00:17:09.989
So, IBM has actually been doing virtual machines for 40 years. They start.

89
00:17:09.989 --> 00:17:22.949
You started in the 80 s and something called started as a research control program. I don't know what it stands for Cambridge monitor system for Cambridge, Massachusetts, and they changed it to control modern.

90
00:17:22.949 --> 00:17:27.449
Monitor system or something and.

91
00:17:28.648 --> 00:17:40.229
And they looked into it with their instruction said, I just saw something called the system 360, and they've expanded it. So, this has been part of IBM product line for 40 years and.

92
00:17:40.229 --> 00:17:47.638
Because now they have their mainframes, they got the base, and then they can run different clients on the mainframes and it's efficient.

93
00:17:47.638 --> 00:17:59.189
Another thing VMware does is they actually tweak the code of a guest perhaps. So they will if you're running Linux or Windows as a guest.

94
00:17:59.574 --> 00:18:14.273
On VMware, they actually tweak it a little tweak. The operating. They run a slightly customized version of Windows or Linux perhaps, and customize so that the dangerous thing instructions can be trapped efficiently. They may even, I believe, modify the code.

95
00:18:14.578 --> 00:18:20.548
So, what was a dangerous instruction will be replaced with a trap instruction of some sort. So, this means that we're on this fast.

96
00:18:21.838 --> 00:18:27.479
Beyond Morris, but a lot of my, I mentioned VM, Mark as the leader that put a lot of money into this for many years.

97
00:18:27.479 --> 00:18:34.409
Okay, now done right? I compute intensive program. I think it's no overhead.

98
00:18:34.409 --> 00:18:47.394
Really a few percent, but in my experience, using the, I'd actually use VM Ware off and on for many years, done more than 10 years. And my experience is at the emulated file system can be got awful bad. You're running.

99
00:18:47.394 --> 00:18:52.374
I'm running a Windows gas and I'm doing a system update and a system update, which might take.

100
00:18:52.648 --> 00:18:59.308
Half an hour, the Windows updates, half an hour on a native system will take hours and hours.

101
00:18:59.308 --> 00:19:03.929
And running is the VMware guest perhaps and its time.

102
00:19:03.929 --> 00:19:18.239
Just stuff gets mapped too many times and anything. That was a simple efficient. Maybe contiguous on a native machine. Now is going for virtual blocks, scattered around the real disc or something.

103
00:19:18.239 --> 00:19:21.808
Okay, but the nice thing is.

104
00:19:21.808 --> 00:19:24.898
Your guess can be all different.

105
00:19:24.898 --> 00:19:38.159
By compute intensive question from Isaac would be something like Matrix multiplication and the guest for perhaps because you could in square data and cube computation.

106
00:19:38.159 --> 00:19:43.499
So, you get a big matrix multiplication program linear algebra thing.

107
00:19:43.499 --> 00:19:48.778
Running in the guest, big bat lab job, running in a guess and VMware.

108
00:19:48.778 --> 00:19:55.078
My guess, is that the CPU time will be almost the same? I haven't tried it, but that's my guess.

109
00:19:55.344 --> 00:20:00.054
It will use more memory because follow the host need to memory in the guest,

110
00:20:00.084 --> 00:20:09.023
and each guest needs memory and you do not want to swap that memory and you do not want your guests to be swapping memory except rarely I mean,

111
00:20:09.054 --> 00:20:10.523
that gets into the file system.

112
00:20:12.778 --> 00:20:16.259
Shifting from the host of the VM.

113
00:20:16.733 --> 00:20:25.284
Well, yeah, if you go and read data from the disk. Okay, so the guest is creating a virtual desk, which gets mapped into into a file on the real desk.

114
00:20:25.794 --> 00:20:38.544
And maybe you got different options in VM, where you can say that you assign a whole partition to VM where to be used for the guest desk, or you can assign files in the host file system.

115
00:20:38.544 --> 00:20:49.433
So, now, you see, you're going through 2 file systems, and actually the guest file system could be multiple to take her bike files in the host file system, partition to 2 gigabytes us for management purposes.

116
00:20:50.548 --> 00:20:55.409
And you're going to 2 levels of the file and my experience from.

117
00:20:55.409 --> 00:21:01.108
Using computers is a virtualizing layer on layer. Virtual file systems is a very bad idea.

118
00:21:01.108 --> 00:21:05.638
So we'll give you a simple thing if you just look at.

119
00:21:05.638 --> 00:21:09.628
Now, okay, here, old, rotating hard drives.

120
00:21:09.628 --> 00:21:15.898
In many cases had 512 by blocks your new.

121
00:21:15.898 --> 00:21:24.388
Days tend to have 4 kilobytes blocks down into the hardware level. So if you, but then they virtualize.

122
00:21:24.388 --> 00:21:31.888
The file system on top of a virtual, they pretend the May pretend that it has a 512 fight.

123
00:21:32.394 --> 00:21:47.213
File hardware files to be called compatible with the old rotating hard drives, but this means that your new virtual, a virtual 4 K by block and it gets not aligned. Right? Things don't get aligned to.

124
00:21:47.213 --> 00:21:49.763
Right? So what would be 1.

125
00:21:50.098 --> 00:21:59.969
Access to the disk, if it was sort of native becomes to accesses because there's a misalignment. You just doubled the.

126
00:21:59.969 --> 00:22:03.628
This is why virtualizing file systems I think is bad.

127
00:22:03.628 --> 00:22:13.588
Okay, any case so PM, whereas the point is, you got clients that are totally different. Well, they all have to be Intel based. If you try to virtualize.

128
00:22:13.588 --> 00:22:21.628
I don't want an arm type operating system on an Intel base. Now you're down to virtualize in a really low level, and it's going to be bad.

129
00:22:21.628 --> 00:22:36.328
Okay, I just as an aside there, when I've tried to run VMware last few last year, I guess it doesn't actually run into a bug to anymore because behind the scenes Linux has been upgrading security.

130
00:22:36.328 --> 00:22:41.519
And at the, for example, you now, you cannot just boot a random file system.

131
00:22:41.519 --> 00:22:44.638
We're going to random operating system on Intel machine now.

132
00:22:45.894 --> 00:22:54.144
There's protection against that to protect against center security so they have to be signed and Windows assigned.

133
00:22:54.864 --> 00:23:09.233
There was a worry that this would freeze out Linux, but the critical Linux modules are actually signed cryptographic signature that is checked by the BIOS on on your Intel machine. Now, and there are ways actually for you to sign extra modules of your own.

134
00:23:09.808 --> 00:23:22.824
If you trust the module, you can generate a cryptographic signature, so it can be loaded. But in general, the lyrics colonel now root does not have infinite powers anymore. It cannot just load around the module into the current lynchy load modules. Into the kernel.

135
00:23:22.824 --> 00:23:32.034
It's the low command, but you can load only signed molecules and have been cryptographically, signed by someone like Nvidia or whatever and.

136
00:23:32.308 --> 00:23:38.638
You can sign your own modules to the escape pause as you can sign your own modules but there's some failure here. So.

137
00:23:38.638 --> 00:23:43.348
I can't figure out how to sign the via the critical VM where modules and therefore VMware can't run.

138
00:23:43.348 --> 00:23:47.128
So this is restricting the power of root in Linux so.

139
00:23:47.128 --> 00:23:53.009
And you can see why to prevent against certain low level security holes.

140
00:23:53.009 --> 00:23:58.318
They don't like to talk about this and Linux obviously documented at a low level but.

141
00:23:58.318 --> 00:24:08.338
When I read announcements about new versions of a bond to in my favorite. Oh, my God. Or or whatever they don't actually talk about this.

142
00:24:08.338 --> 00:24:15.479
Okay, so that was a fairly low level of virtualization. So it's this is the overhead in memory and so on overhead and desk space.

143
00:24:15.479 --> 00:24:29.878
The next level of virtualization restricts us and that we can only run clients that are basically that are running the same operating system. So, with this, we have a Linux host. We can run Linux guests.

144
00:24:29.878 --> 00:24:35.249
Multiple Linux guests at the same time, but they're all drawing on the.

145
00:24:35.249 --> 00:24:39.989
Key facilities of the host operating system so that have to be the same.

146
00:24:39.989 --> 00:24:44.699
But they see a private view, a process based file system, other resources.

147
00:24:44.699 --> 00:24:52.259
So normally, normally you do the PS command, you see with the right options. You see all the processes on the whole system.

148
00:24:52.259 --> 00:24:56.368
And in fact, you can see who's running it you can see the name.

149
00:24:56.368 --> 00:25:06.328
Because the command name, you can see the in all the, you can see the environment in fact, which is why they tell you if you're on a multi user system, when you're running, say encryption.

150
00:25:06.328 --> 00:25:18.209
Which you can put the cryptographic key in the command line, makes it easier to type, but if you do that, anyone else on the system can see it, but it's a PS command. Okay. Now, with this level of virtualization.

151
00:25:18.209 --> 00:25:23.818
A client cannot see the other process. He cannot see that they exist at all. The can see.

152
00:25:23.818 --> 00:25:31.169
Only his own the file system again, it's not that you see that there are other files. There you can't read is that you don't see them.

153
00:25:31.169 --> 00:25:34.769
And other resort, there's other system resources.

154
00:25:34.769 --> 00:25:40.528
So, you get a private view of your piece of the system.

155
00:25:40.528 --> 00:25:49.259
The only way the rest of the computer affects you is obviously resource consumption, is that somebody else is running a compute bound job.

156
00:25:49.259 --> 00:26:03.564
Well, those are cycles you're not using now even that can be controlled, is that the process can limit what? Fraction of the machine that each client can use. So, it could be set up that this virtual client can use no more than half of the CPU cycles.

157
00:26:03.683 --> 00:26:04.344
For example.

158
00:26:04.648 --> 00:26:15.509
Now, this resource consumption affecting this is actually as surprising information leak that can actually be used to leak information out of the virtual client.

159
00:26:15.509 --> 00:26:19.679
By the fact is, you're using a lot of cycles, you're slowing everyone else down and so on.

160
00:26:21.929 --> 00:26:27.538
Many case, there's various terms like this power. Virtualization might be 1 of the terms you can Google it.

161
00:26:27.538 --> 00:26:34.048
And Linux has tools, so again, to do this sufficiently requires that the operating system be designed to support it.

162
00:26:34.048 --> 00:26:45.118
So, Linux now is stuff like called name space, isolation. So name space isolation means that you don't see someone else's processes a process name space. Each client is isolated.

163
00:26:45.564 --> 00:26:57.743
Pile systems, isolated and resources, isolated name space isolation as a keyword you can Google if you like a PDF, if you like, and Linux has something called control groups also, they can group processes.

164
00:26:57.743 --> 00:27:00.953
They can have hierarchy of privileges actually, which is really nice.

165
00:27:01.199 --> 00:27:09.479
And this is actually the base that's used for a lot of other tools. So this level of higher level virtualization it's got a lot less overhead.

166
00:27:09.479 --> 00:27:14.308
But the call, the clients have to be the same operating system.

167
00:27:14.308 --> 00:27:27.778
Next level ups oh, down at something like this. You can you run an app command. It could be it could fire up a virtual machine, do something and then end the overheads a lot less with this.

168
00:27:27.778 --> 00:27:31.679
Power of virtualization or whatever it's called. I couldn't swear to that precise name, but.

169
00:27:31.679 --> 00:27:40.019
Okay, next level up, you just have normal Linux security, file system protections. You can see other resources, but the theory is, you can't access them.

170
00:27:40.019 --> 00:27:50.818
And also the normal, the next level, you got the secure Linux se, Linux that was funded by no such the agency.

171
00:27:50.818 --> 00:27:54.388
National Security Agency, and so on.

172
00:27:54.388 --> 00:28:02.189
No, it says less overhead, but you can still see some other stuff. Now, the normal Linux level. It's really hard to make secure like.

173
00:28:02.189 --> 00:28:05.999
You know, I worry I run Firefox, so I, I examine.

174
00:28:05.999 --> 00:28:10.858
My system, and I think, what are the biggest security pinch of security holes and.

175
00:28:10.858 --> 00:28:19.588
Well, I'm, you know, I load useful programs some over the web things to do in geometry and so on. I have to trust them. Firefox. I got.

176
00:28:19.588 --> 00:28:30.594
Plugins and the Firefox, I have to trust them and how would I make that secure? And that's surprisingly hard even if I can describe exactly the security I want it's really hard to do.

177
00:28:30.594 --> 00:28:34.614
Like, maybe I want to declare that Firefox is allowed to write nothing.

178
00:28:35.068 --> 00:28:40.919
On the computer, except slash temp and except in a subdirectory of my home directory called.

179
00:28:42.239 --> 00:28:51.689
I cannot write commands it effectively enforce that because Firefox by default runs stomped runs over my whole home directory, trying to read stuff. And if I stop it, it fails.

180
00:28:51.689 --> 00:28:56.999
So, and there are tools like app arm or.

181
00:28:56.999 --> 00:29:02.878
And so on, which pretend to help, but, and they helped security somewhat but the other thing.

182
00:29:02.878 --> 00:29:12.058
Is that people doing graphic user interfaces are working as hard as they can to write new convenience trip toys that evade the security.

183
00:29:12.058 --> 00:29:15.058
For example, app armor, it.

184
00:29:15.058 --> 00:29:20.308
Traps new process, spotting and self spot so.

185
00:29:20.308 --> 00:29:23.578
Something to, you know, Firefox or.

186
00:29:23.578 --> 00:29:28.679
Whatever spawns a process, then I could write a rule which.

187
00:29:28.679 --> 00:29:35.189
Say, as Firefox allowed to spawn such a process and forces a sub process to inherit any restrictions with Firefox.

188
00:29:35.189 --> 00:29:41.038
Nice, but now, and our gooey and Linux, we got these on these.

189
00:29:41.038 --> 00:29:51.298
Communication channels I forget what they're called now. They're trying to make Linux, have the security level of Windows actually, but they got these convenience things. So.

190
00:29:51.298 --> 00:30:04.019
Things on my desktop, my computer desktop can send messages to they can send commands to each other, which is very nice. But when any concept of security in this, you start figuring out how to.

191
00:30:04.019 --> 00:30:07.409
You know, trap that in any case.

192
00:30:08.548 --> 00:30:20.038
Advantage for security or virtual machines, as you put the app in the virtual machine and it's now in a walled garden that's 1 of the terms and it can't get out of escape from the walled garden.

193
00:30:20.038 --> 00:30:24.778
And endanger the rest of your machine. Well, that's nice. Except.

194
00:30:24.778 --> 00:30:29.338
That was also the theory for Java security, and we see how secure Java is.

195
00:30:29.338 --> 00:30:37.378
I don't know what's hard about writing a confident wall garden, but it's harder. Programmers are not able to do it. So.

196
00:30:37.378 --> 00:30:51.239
Now, apart from security, there's another big advantage of advantage of virtual machines is that they isolate you from changes in the hosts.

197
00:30:51.239 --> 00:31:01.979
So, virtual, it's taken halting virtual memory to some extent isolates you from the router real memory on the machine. If you don't have enough real memory, it creates a virtual space virtual memory. It hurts the performance.

198
00:31:01.979 --> 00:31:06.838
But it would still run and if you have a.

199
00:31:06.838 --> 00:31:16.648
If you have a virtual machine client, it's got this ideal version of the real machine. That's the actual hardware it's hidden from it.

200
00:31:16.648 --> 00:31:27.084
So, you can run on different virtual machines and an, even you could even say, spill over to something cloudy. So you could maybe run some virtual machines locally.

201
00:31:27.084 --> 00:31:35.634
And then if you need more power, you spill over to Amazon, elastic compute, cluster Amazon and so on.

202
00:31:36.269 --> 00:31:39.388
And which is following the same standard.

203
00:31:39.388 --> 00:31:44.759
And in theory, you could just take your local virtual machine, run it on Amazon. So you've got search capability.

204
00:31:45.808 --> 00:31:56.788
And places like, well, Harvard, for example, Harvard University, who was a grad student, they do it now for their main computer science, low level computer science course is incredibly popular.

205
00:31:56.788 --> 00:32:03.328
Actually talked by my former advisor dog, so just retired 2 years ago, but.

206
00:32:04.134 --> 00:32:18.503
So, what Harvard does is they need more computing power they get it, I think, from Amazon. So, so you had the virtual machine, you got the search capability, and you're isolated from the actual hardware you can buy. So, if you're going to run something, you know, 24, 7, it's much cheaper to buy your own hardware.

207
00:32:18.503 --> 00:32:21.233
If you're going to run it. Occasionally you run it on Amazon website.

208
00:32:24.239 --> 00:32:31.709
So that's very nice. The flip side for that is that.

209
00:32:31.709 --> 00:32:42.598
You're also isolated from really high performance features of the hardware. So it took a while for Nvidia stuff to be useful, accessible to virtual machines.

210
00:32:42.598 --> 00:32:47.578
You know, all the differences that includes the high performance stuff.

211
00:32:47.578 --> 00:32:52.828
Okay, so before I move on to Docker, if you have any questions and.

212
00:32:52.828 --> 00:32:58.769
Trivial question related to the car, so I'm picking coffee from a cup that says a little.

213
00:32:58.769 --> 00:33:03.419
So no points for anyone who can tell me where is a little upset.

214
00:33:03.419 --> 00:33:07.138
I was there 2 years ago. Okay.

215
00:33:07.138 --> 00:33:13.019
Now, Google, like, um, okay, so Docker.

216
00:33:14.909 --> 00:33:19.709
Dockers of popular, lightweight visualization system. So it's the lightweight thing.

217
00:33:19.709 --> 00:33:32.574
You can't, I don't know everything I tell you is, as I understand it, I could be wrong. I could be out of date things. Change here to a successful company is they look at what their customers want, and they consider adding it to their capabilities.

218
00:33:32.574 --> 00:33:34.943
So, but at least in the past Docker.

219
00:33:35.219 --> 00:33:39.118
You say had to run Linux guess on on Linux host and so on.

220
00:33:39.118 --> 00:33:45.568
It's very popular commercially and video uses it to distribute software because again, you see.

221
00:33:45.568 --> 00:33:57.384
How do you package software? So customers can use it. Your Linux. 1st, you put it in a Debbie file. You put it in an RPM. You put it in a Powerball but the Powerball when you extract it makes assumptions on the operating system.

222
00:33:58.344 --> 00:34:05.903
You put it now, they've got a couple of competing things that have bought their had something called snap there ways to try and.

223
00:34:06.179 --> 00:34:15.929
Package the dependents required dependencies with your application so it puts fewer demands on your host operating system. Linux. This is a problem because.

224
00:34:16.764 --> 00:34:29.963
Is Linux it's own on standard? In fact, you feel sorry for Nvidia go, go to invidious developer website where they download the latest version of code and they got, like, 4 different ways to now I feel sorry for them.

225
00:34:29.963 --> 00:34:42.773
They've got an RPM. They've got a Debby, they've got a Powerball you can put it in your local sources for downloading software repositories and entry level process that will download it.

226
00:34:43.043 --> 00:34:47.213
And then they've got different versions of this for every different version. atlantic's.

227
00:34:47.818 --> 00:34:56.338
And then if the NVIDIA doesn't hop quickly enough with exchanges, and sometimes when exchanges incompatible and validates old code.

228
00:34:56.338 --> 00:35:03.628
So, if the video doesn't hop smartly enough, then they get and so on.

229
00:35:03.628 --> 00:35:06.628
So, if you feel sorry for them.

230
00:35:06.628 --> 00:35:19.889
So 1 of their multiple ways to distribute software is as a Docker image. So you run Docker on your system and it's free and you download and download and an image driven videos website.

231
00:35:19.889 --> 00:35:25.469
And in theory, you can write it. I keep emphasizing in theory because I've tried to do it.

232
00:35:25.469 --> 00:35:29.849
And you might notice I, no longer using Docker unparalleled. Is there a reason.

233
00:35:29.849 --> 00:35:43.619
Okay, any case, this is the theory and the theory is also, is that for simple image is again application the overhead to fire to start them up and take them down is really low. I command.

234
00:35:43.619 --> 00:35:58.139
You know, the compiler C plus plus compiler could be a darker image. In fact, that was actually why I initially installed Docker. So it's worth learning Docker. Now, what can happen is that.

235
00:35:59.518 --> 00:36:07.528
A Docker, you can end up with dozens or hundreds of Docker images. Maybe even running simultaneously. Maybe that is sitting there.

236
00:36:07.528 --> 00:36:11.278
Is in a client server concept and.

237
00:36:11.278 --> 00:36:20.039
Just waiting to run, so Kubernetes is a tool to manage lots of Docker images.

238
00:36:20.039 --> 00:36:26.338
So, if you want more information, you can, I gave you a couple of links. You can also.

239
00:36:28.318 --> 00:36:42.748
Okay, build and ship apps and so and they've got a conference and probably free. So so the thing with things like again, Amazon easy, too, you see you Docker image could run on your own private machine, or could run on Amazon.

240
00:36:42.748 --> 00:36:51.599
And then you get stuff where, in theory to thing can migrate. Microsoft has Docker.

241
00:36:51.599 --> 00:36:55.018
See security and stuff like that.

242
00:36:57.028 --> 00:37:03.449
So, free stuff, money stuff I call it the cocaine model pricing model, but okay.

243
00:37:03.449 --> 00:37:06.659
Silence.

244
00:37:08.188 --> 00:37:12.628
Oh, okay. So you're gonna have fun with all of that. So now.

245
00:37:12.628 --> 00:37:25.588
And other references here, piles and piles of reference you can play with it if you've got spare time. Uh, huh. So, my UC with Docker is when I went to using the compiler, the Pacific group.

246
00:37:25.588 --> 00:37:38.489
Compiler because I got annoyed, I get annoyed at G. plus plus because it didn't do it didn't do open ACC. So I was looking for replacements.

247
00:37:38.489 --> 00:37:45.838
P. C. plus plus was sponsored by video you might say so I initially tried to have it running the.

248
00:37:45.838 --> 00:37:54.059
An image on parallel. Okay. So that's running a private file system, but you're going to hook. So you can say you can nominate.

249
00:37:55.318 --> 00:38:06.599
5 trees on the host file system and mount them as guests on the guest. That's how you get files back and forth between the host in the guest. You say that I certainly.

250
00:38:06.599 --> 00:38:16.050
Piles file tree on the host, or whatever is visible on the gas. Just like in Windows.

251
00:38:16.050 --> 00:38:23.400
E, amount some remote file system, or in Linux as I showed you on my laptop here, I can mount the parallel file system.

252
00:38:23.400 --> 00:38:28.199
And access to just as a local file system on my laptop.

253
00:38:28.199 --> 00:38:31.559
I don't try to do fancy things.

254
00:38:31.559 --> 00:38:36.300
So, I wouldn't work with access control list perhaps and.

255
00:38:36.300 --> 00:38:45.119
Subtleties of read, modify writers something in the file system might get lost. And the performance is horrible, but I can do that. So you do that with doc or also.

256
00:38:45.119 --> 00:38:50.789
The problem is that it was really hard, like, impossible to get the security right?

257
00:38:50.789 --> 00:38:59.219
I couldn't see how to specify a host file system for the guest in a way that the guest could not escape and get over the.

258
00:38:59.219 --> 00:39:10.320
Post and it got worse because I was having to run darker as root actually. And this made us really bad. So I just couldn't see how to make it secure.

259
00:39:10.320 --> 00:39:21.780
And then I figured out how to install the compilers as, like, a normal Debbie package or something or is it a whole repository? And then I didn't need to do that anymore. So I killed it.

260
00:39:23.400 --> 00:39:29.369
Okay, so that is your half class on virtual machines and Docker.

261
00:39:29.369 --> 00:39:37.710
Questions? No. Okay.

262
00:39:37.710 --> 00:39:48.150
Next subject for today is we're continuing on invidia GPU and again, why this is worth spending time on is that.

263
00:39:48.150 --> 00:39:55.199
Has the lightest new supercomputer architecture as also and.

264
00:39:55.199 --> 00:40:04.405
How they solve things is instructive and video companies over 20 years old. Now they're successful company, they're handled competitors.

265
00:40:04.405 --> 00:40:11.394
The competitors are mostly failed because NVIDIA, they got good people, and they listened to their users.

266
00:40:11.940 --> 00:40:16.409
And they grow because they're providing a service. Let me show you.

267
00:40:16.409 --> 00:40:28.590
Couple of important ways and video has listened to what its customers want. So, this is the key you know, if you're if you're running a business and customers want to give you money to do something.

268
00:40:28.590 --> 00:40:36.690
You seriously want to consider accepting the money and doing what they want. Just don't reflectively say no, we don't do it.

269
00:40:36.690 --> 00:40:44.340
Companies do former companies I guess you'd call them. Okay. Things invidious done right? In the past.

270
00:40:44.340 --> 00:40:49.980
They started out doing graphics accelerators.

271
00:40:49.980 --> 00:41:00.659
Okay, this is a rehash of my computer graphics. Course you want to render polygons on your screen.

272
00:41:00.659 --> 00:41:08.460
Maybe thousands, maybe millions of polygons. So you've got the vertices of the triangles.

273
00:41:08.460 --> 00:41:17.965
You want to rotate them and project the vertices are independent that can all be done in parallel and hardware. The more vertices you can process in parallel.

274
00:41:17.965 --> 00:41:23.695
So you'd have things called Vertex shaders that would rotate and project the vertices of your.

275
00:41:24.864 --> 00:41:35.994
All the guns, and then everybody should be connected up to make triangles and so on and then you'd have fragment shaders that would process the pixels in your frame buffer and your depth buffer.

276
00:41:36.414 --> 00:41:44.905
And so that if you do 2 objects to the same pixel, the front, most 1 gets drawn and, you know, it's the most 1, because you're maintaining is a buffer and so on.

277
00:41:45.179 --> 00:41:51.269
So so, video did hardware that did the specific thing very fast.

278
00:41:51.269 --> 00:41:57.360
And there was no, the only a raise, for example, is when they invented texture memory and the texture, right?

279
00:41:57.360 --> 00:42:04.650
So, they were very limited, but what they did, they did very fast and customers researchers.

280
00:42:04.650 --> 00:42:13.559
Basically tortured there and video used to do non graphic stuff, fast also embedded in a thing called basic graphics program. General purpose.

281
00:42:13.559 --> 00:42:21.510
Programming on and video observed this happening and they added more general instructions to the.

282
00:42:21.510 --> 00:42:27.059
So that you can now do general purpose programming on without having to.

283
00:42:27.059 --> 00:42:33.840
You hack your way using the vertex shaders and the fragment shaders and textures texture maps and so on it says.

284
00:42:33.840 --> 00:42:38.849
Horrible horrible Cluj. Okay. Now you got good at and stuff. Good.

285
00:42:38.849 --> 00:42:49.139
That was a big thing and video data more recent thing is, of course, they're expanding from graph several years now into machine learning. So they've been.

286
00:42:49.139 --> 00:42:52.440
Ad, observing machine learning is a big business.

287
00:42:52.440 --> 00:43:00.719
There's some argument that everything that's courage in computer engineering is machine learning because if it's current, you call it part of machine learning.

288
00:43:00.719 --> 00:43:07.050
Okay, but machine learning is very compute bound determining the coefficients.

289
00:43:07.050 --> 00:43:15.630
Or say, autonomous vehicles so I was test driving a Tesla on Saturday. I'm going to buy a test real soon. Now, I think.

290
00:43:15.630 --> 00:43:19.050
So, in Tesla.

291
00:43:19.050 --> 00:43:26.219
1 option and test lazy, autonomous driving. It is a 10000 dollar option on top of the base car.

292
00:43:26.219 --> 00:43:32.309
They sell it because people are willing to pay 10000 dollars to make their car.

293
00:43:32.309 --> 00:43:42.840
Autonomous are mostly autonomous, not completely because of the effort and video has put in 2 things like machine learning and they sell their chips to.

294
00:43:42.840 --> 00:43:47.099
Companies, I don't know if they still sell the tests they used to.

295
00:43:47.099 --> 00:44:01.320
But they provide the servers to compute the coefficients that go into these programs and they have added hardware and media ads for some years. Now, has been adding hardware to.

296
00:44:01.320 --> 00:44:06.989
Their to make machine learning, pass, specific hardware they add is.

297
00:44:06.989 --> 00:44:15.510
Half precision floating point, if be 1616 bit floating point, they've also been adding things called these.

298
00:44:15.510 --> 00:44:23.280
Kind of blank on my name now on these processes. These processes take a 4 by 4 matrix.

299
00:44:23.280 --> 00:44:27.630
Which can have different data types, entered jurors or 16, but floats.

300
00:44:27.630 --> 00:44:32.309
And and could do Matrix, basically.

301
00:44:32.309 --> 00:44:38.610
Matrix model apply an ad very quickly because they devote special hardware to. This is this is.

302
00:44:38.610 --> 00:44:43.500
invidia providing facilities that.

303
00:44:43.500 --> 00:44:49.380
The customer wants, so that is why we are spending some time on.

304
00:44:49.380 --> 00:44:57.239
In so we're going to start with lecture 9 1 in their set and if I can find it.

305
00:44:58.889 --> 00:45:02.880
Okay, and again I'm speed reading it.

306
00:45:10.289 --> 00:45:13.710
So, what we're seeing today.

307
00:45:13.710 --> 00:45:17.010
Is 1 of the paradigms.

308
00:45:17.010 --> 00:45:22.829
Always seen a little called reduction. We have not seen reduction yet. Well, we've seen it sort of.

309
00:45:22.829 --> 00:45:29.280
We're going to see it in more detail now so it's a paradigm. It's a programming.

310
00:45:29.280 --> 00:45:34.679
It's used in parallel programs because it does a lot of things efficiently.

311
00:45:34.679 --> 00:45:38.519
So, we're going to see a little in this module 9 1.

312
00:45:38.519 --> 00:45:43.289
On how to do it passed in parallel machines.

313
00:45:43.289 --> 00:45:46.739
Okay, so it's a class of computation. It's.

314
00:45:46.739 --> 00:45:49.800
You put it in your tool can if you're doing parallel programming.

315
00:45:49.800 --> 00:45:54.119
And we're going to look at how efficient it is and do it better.

316
00:45:55.139 --> 00:46:04.710
Okay, so basically, and it's used again you want to have buzzwords map, reduce.

317
00:46:04.710 --> 00:46:09.989
Uh, map, reduce, tell you what Matt produces since the dimension here again. It's a paradigm.

318
00:46:09.989 --> 00:46:13.170
For processing large data sets it is 2.

319
00:46:13.170 --> 00:46:23.969
2 types of operations, the map offer and you work with a set of elements the map operation applies to function to each element in the set creates a new set.

320
00:46:23.969 --> 00:46:27.389
The reduce operation.

321
00:46:28.980 --> 00:46:38.429
Combines those things like, add some, let's say, is the most common thing, or perhaps find some maximum is another common thing.

322
00:46:38.429 --> 00:46:41.550
And so it's called map produce and.

323
00:46:42.235 --> 00:46:52.795
The map produce ideas become popular in the last 5 or 10 years. google's popular I said, and so on there's tools, like, had do, which used that had to s.

324
00:46:52.795 --> 00:46:57.804
I checked might becoming passe now but the reason I tried to avoid very new tools.

325
00:46:58.530 --> 00:47:09.300
In any case, the map produced idea the earliest reference I can find to it is IBM at a commercial language called APL.

326
00:47:09.300 --> 00:47:16.650
Almost 60 years ago, 50 or 50 years ago or something, and had a ray.

327
00:47:16.650 --> 00:47:24.269
Manipulation instructions in it and what did is it had a special and large character sets you need a special keyboard.

328
00:47:24.269 --> 00:47:28.650
And typewriter, electric typewriter.

329
00:47:28.650 --> 00:47:34.590
But so it had operations to do things like map and reduce and so on.

330
00:47:34.590 --> 00:47:42.480
This is again, it's IBM sometimes those things very early. Any case. So, IBM had this in a commercial language in the.

331
00:47:42.480 --> 00:47:47.460
Sixties, and I guess every people.

332
00:47:47.460 --> 00:47:51.989
And then Google pocket are iced in the teens.

333
00:47:51.989 --> 00:48:05.639
Oh, well, Google isn't pretend they invented it. Okay. So so so this is a reduction operation and there's an extension of it called scanning. So these fine points up here is.

334
00:48:05.639 --> 00:48:10.289
Perhaps we want to map produce a 1B, a set with a 1B elements.

335
00:48:10.289 --> 00:48:16.079
It doesn't fit into 1 thread block may not fit into 1 grid. Actually.

336
00:48:16.079 --> 00:48:23.039
So, we have to partition the data, they say into chunks of each thread process, a chunk and so on.

337
00:48:23.039 --> 00:48:29.579
Okay, okay. And it's, um.

338
00:48:29.579 --> 00:48:43.079
And we've seen a little about this before, because you're laying your tools like open, empty, open ACC, they've got reduction operators you can use in a loop. So now we're going to. And I've seen a shot a little about how we implement it. Now, we're going to see more detail.

339
00:48:43.079 --> 00:48:47.760
Okay, so.

340
00:48:47.760 --> 00:48:52.139
Your operation has to be associative and committed to.

341
00:48:54.210 --> 00:49:00.989
Everyone knows it, especially when given to means, I guess if you don't ask. Okay. Um.

342
00:49:02.880 --> 00:49:08.280
Would make it a group commutative group, I guess.

343
00:49:08.280 --> 00:49:15.599
Okay, not enough. People take modern, abstract algebra. You should okay. Um.

344
00:49:16.860 --> 00:49:20.610
So your sequential thing is you to scan your way down the array.

345
00:49:20.610 --> 00:49:24.659
Accumulating the sum and so on. So it.

346
00:49:24.659 --> 00:49:30.599
Is in order an algorithm misses elements in the array and if sequential that's what you do.

347
00:49:30.599 --> 00:49:33.809
Okay, but we're not sequential.

348
00:49:35.789 --> 00:49:44.670
Parallel you've got something like this, it's like a term entry. If anyone watches battle bots, Thursday night.

349
00:49:44.670 --> 00:49:49.829
They're now down to 8 contestants, I guess that they have their tournament tree.

350
00:49:51.300 --> 00:49:55.590
Why don't students have an entry in battle? Bots? Polly does.

351
00:49:55.590 --> 00:50:01.949
Okay, tournament tree and, um.

352
00:50:05.190 --> 00:50:11.159
The thing with suppose you want to do it on a parallel machine let's go back to the slides.

353
00:50:11.159 --> 00:50:24.989
Haven't given you see sickness lately, so you might say, okay, how to do this in parallel. Well, the trouble is so you got this tree here and so your 1st level here is not very parallel.

354
00:50:24.989 --> 00:50:33.989
Um, it takes and items you're combining then that 1st step there is an over 2.

355
00:50:33.989 --> 00:50:38.849
Operations all being done at the same time perhaps and.

356
00:50:38.849 --> 00:50:46.920
So, they talk about your peak resource requirement is quite high, your average parallelism. So a lot lower.

357
00:50:46.920 --> 00:50:50.400
You can read this detailing you on if you want.

358
00:50:50.400 --> 00:51:02.039
Oh, by the way, if you're not familiar with how the numbers work out and this is there's and elements in the original array is a total of and additions because each edition reduces the number of elements by 1.

359
00:51:02.039 --> 00:51:06.510
So, in terms of total of operations, it's actually a efficient.

360
00:51:06.510 --> 00:51:18.090
Okay, so it's work efficient in the sense that the total amount of work for the parallel thing is comparable to the sequential thing.

361
00:51:18.090 --> 00:51:23.369
But that 1st level, there's a lot of work all at the same time. So.

362
00:51:23.369 --> 00:51:29.369
May not be resource efficient in terms of paralyzing, but its work efficient. You're not wasting cycles.

363
00:51:31.230 --> 00:51:34.500
Okay, so how are we going to improve that?

364
00:51:37.650 --> 00:51:43.710
Silence.

365
00:51:43.710 --> 00:51:56.280
Okay, so so the parallel human take, you add 2 values in each step, and initially you got it over 2 threads and then you work your way down.

366
00:51:56.280 --> 00:52:01.320
And.

367
00:52:01.320 --> 00:52:10.050
Now, you can work in space, so you might say you've got creating new arrays and each raise half is big, except for your use. The right. You got some.

368
00:52:11.099 --> 00:52:17.670
Now, this is the way you can use the shared memory to speed things up.

369
00:52:17.670 --> 00:52:27.659
And by loading a chunk of the original array, which is in global memory, you put it in a shared memory as much as will fit.

370
00:52:27.659 --> 00:52:37.619
And then you got all the threads and the block attacking that shared memory array, and overwriting the Ray with 2 values. So it minimizes memory usage.

371
00:52:37.619 --> 00:52:42.480
And that's what we're going to do and then so each.

372
00:52:42.480 --> 00:52:48.719
Thread block hasn't shared memory with a reduction and then you combine the thread blocks.

373
00:52:48.719 --> 00:52:54.960
So Here's an example of what you could do. Excuse me.

374
00:52:56.550 --> 00:53:06.840
You've got 8 elements in your shared memory and so thread 0T adds 0T elements here on 1 thread. 1, add elements to and.

375
00:53:06.840 --> 00:53:17.250
3 2 elements, 4 and 5, and spread 3 elements 6 and 7. so threat number K adds elements 2 K and 2. K. plus 1.

376
00:53:17.250 --> 00:53:22.590
And overwrites to element to K, no extra memory need.

377
00:53:23.610 --> 00:53:27.750
Okay, now this 1st step here.

378
00:53:27.750 --> 00:53:31.800
You know, in the thread bucket of a 1000 threads.

379
00:53:31.800 --> 00:53:36.059
Going to be processing 2000 elements, but it's possible.

380
00:53:36.059 --> 00:53:42.480
That your hardware cannot run 2000 a 1000 threads at all at the same time.

381
00:53:42.480 --> 00:53:51.420
Depending okay on resources. So this 1st step may use a number of cycles. So the threads in that 1st step may.

382
00:53:51.420 --> 00:54:04.800
You know, the warps may be writing consecutively, not in parallel, depending on the hardware resources. At any case after the 1st step now ultimate items and the shared memory or global memory, whichever.

383
00:54:04.800 --> 00:54:08.909
Have your subtitles and then the even number ones, and the odd numbered ones.

384
00:54:08.909 --> 00:54:11.969
You don't care about the next step.

385
00:54:13.260 --> 00:54:18.389
You see thread number K, ads element 4, K2 element 4 K plus 2.

386
00:54:18.389 --> 00:54:27.659
Sorry, sorry? No, just even number 2. K into even the odd number threads set Idol.

387
00:54:27.659 --> 00:54:37.199
And then here, we only use threads that are multiples of 4, and we end up here. Now, this is 1 way to do it.

388
00:54:37.199 --> 00:54:40.710
Problem with this, it works fine.

389
00:54:40.710 --> 00:54:43.920
But before I go to the next slide set.

390
00:54:43.920 --> 00:54:49.829
Um, why this is not as efficient as it might be.

391
00:54:51.269 --> 00:54:55.110
If we look in the middle of here, what's happening is.

392
00:54:55.110 --> 00:54:59.219
We're not using consecutive threads we're using.

393
00:54:59.219 --> 00:55:04.739
Alternate threads like step 2 we're using threads that are too apart from each other.

394
00:55:04.739 --> 00:55:10.079
And this doesn't play nicely with the concept of 32 with grads former war.

395
00:55:10.079 --> 00:55:15.539
So, we got 32 threads in a war pier and the alternate threads are sitting idle. So that's.

396
00:55:15.539 --> 00:55:23.730
And as you go further down the tree, where having more and more thread, sitting idle, so we're really not playing nicely for the concept of a warp of threads.

397
00:55:23.730 --> 00:55:28.650
2nd point is also not playing nice with the concept that.

398
00:55:28.650 --> 00:55:32.070
Data should be continuous.

399
00:55:32.070 --> 00:55:35.730
Um, so here.

400
00:55:35.730 --> 00:55:47.460
Okay, we're accessing data elements that have gaps stride between them. And if you're on the global membranes is really bad. And even on the shared memory, well, you know, you're wasting stuff. Maybe.

401
00:55:47.460 --> 00:55:53.969
You know, it'd be nicer if it's a, the active stuff was packed together. So now you could maybe free up stuff, free up things.

402
00:55:53.969 --> 00:56:00.090
So, this 1st way to do in a parallel some reduction.

403
00:56:00.090 --> 00:56:03.989
Um.

404
00:56:06.510 --> 00:56:10.320
Is, um.

405
00:56:10.320 --> 00:56:14.909
I just realized I might be.

406
00:56:16.260 --> 00:56:20.099
I am recording. Good. I got worried. Okay.

407
00:56:20.099 --> 00:56:23.219
You see, you're what you're.

408
00:56:23.219 --> 00:56:28.920
It's parallel, but it's wasting thread resources and it's inefficient use the threads inefficient use of memory.

409
00:56:30.449 --> 00:56:38.369
And that they're talking about that here 1 of the inputs comes to increasing distance away for Golden memory. That's bad.

410
00:56:38.369 --> 00:56:44.010
Shared memory well, not directly bad, but you're wasting memory.

411
00:56:44.010 --> 00:56:48.210
You like to pack stuff together. Okay.

412
00:56:51.690 --> 00:56:54.960
And how we would implement this.

413
00:56:54.960 --> 00:57:06.000
Shared means that this, this is called this in your global routine, which is called from the host running on the device shared is an array which fits in shared memory. If there's room.

414
00:57:06.000 --> 00:57:13.619
And you can reduction step again, you doing sync threads and things.

415
00:57:13.619 --> 00:57:18.809
So, it's so sync threads.

416
00:57:19.860 --> 00:57:24.059
Yeah, so we got this tree of partial sums.

417
00:57:24.059 --> 00:57:30.659
We got to remember that. Let me give you a little C sickness and.

418
00:57:30.659 --> 00:57:35.010
We're scroll back a few slides here. Gotcha.

419
00:57:35.010 --> 00:57:39.780
Okay, that that 1st level step 0T here lots of threads.

420
00:57:39.780 --> 00:57:45.869
Maybe it's more threads 1 to run them can can run simultaneously. So, some threads.

421
00:57:45.869 --> 00:57:51.960
Some warps are going to run 1st, while other walks are queued up, waiting to run.

422
00:57:51.960 --> 00:57:55.170
So 1st warp finishes.

423
00:57:55.170 --> 00:57:59.880
And then you've got some work processors as they're called.

424
00:57:59.880 --> 00:58:04.980
Waiting for work, so they get assigned more warps that are waiting.

425
00:58:04.980 --> 00:58:10.800
But but as this thing starts finishing now, some more processors are going to be finished.

426
00:58:10.800 --> 00:58:14.639
And it's going to be no more work for them to do. Okay.

427
00:58:14.639 --> 00:58:20.639
So they might start running so what it would by default, it start running this stuff and level 1 here.

428
00:58:20.639 --> 00:58:23.699
Because they're warps waiting to run.

429
00:58:23.699 --> 00:58:27.840
So level 1 might start wanting to run.

430
00:58:27.840 --> 00:58:31.050
Before all of the threads and levels arrow have finished.

431
00:58:32.130 --> 00:58:38.639
You see, because level 0T thread, some of them finished before others and then there's some more processes that are.

432
00:58:38.639 --> 00:58:42.599
Are waiting idle because all of the levels threads have actually started.

433
00:58:42.599 --> 00:58:49.349
But some of them have already finished because they started earlier. So now there's more processes of waiting, waiting to run level 1.

434
00:58:49.644 --> 00:59:03.085
But they shouldn't run level 1, because all the data is not available to them yet, because some of the war 0T threads haven't finished. So the war, 1, the level 1 threads cannot start running until all the level.

435
00:59:03.114 --> 00:59:04.764
0T threads have finished.

436
00:59:05.070 --> 00:59:09.059
That's why the sync threads.

437
00:59:10.409 --> 00:59:17.010
Okay, and you might experiment with omitting sync thread from a program.

438
00:59:17.010 --> 00:59:23.219
You might get a different answer. Every time you get the same answer every time and if you get the same answer.

439
00:59:23.219 --> 00:59:28.079
Maybe might even be right? Who knows why not it might be consistently wrong.

440
00:59:29.550 --> 00:59:32.639
Okay, so, um.

441
00:59:35.610 --> 00:59:39.539
So any case, so now, assuming what we're doing here.

442
00:59:39.539 --> 00:59:43.860
Threads and a block adding up elements.

443
00:59:43.860 --> 00:59:48.030
So, maybe there are more elements in my original array than.

444
00:59:48.030 --> 00:59:54.690
You can have threads in the block against a block can have up to 1024 threads.

445
00:59:56.844 --> 01:00:08.065
Apparently constant oh, by the way, when I say things are fairly constantly, I notice the latest version of NVIDIA, the ampcare architecture is changing some of these numbers that have stayed fairly constant for years.

446
01:00:08.065 --> 01:00:13.255
They've increased the shared memory size, for example, still 32 threads on a war, but the number of.

447
01:00:14.130 --> 01:00:18.809
In shared memory and selling, got larger.

448
01:00:18.809 --> 01:00:23.880
Okay, any case so we got more elements to some to reduce.

449
01:00:23.880 --> 01:00:33.210
Then we can have threads and a block so chunks of the global data are going to be reduced in separate thread blocks. And the separate thread blocks are not.

450
01:00:33.210 --> 01:00:42.000
Talking to each other again, it could be running consecutively. So any attempt to make them communicate would be horribly inefficient.

451
01:00:42.000 --> 01:00:50.940
And they're talking here even in the host might even start separate colonels.

452
01:00:50.940 --> 01:00:58.289
Okay, so now, how can we then we have to merge these partial results.

453
01:00:58.289 --> 01:01:01.769
Well, he just copied back to the host and add them up.

454
01:01:01.769 --> 01:01:13.769
Or thread, 0T of each block could collect the results. So many other threads and each could accumulate Stoppers that we saw a little of this. Actually, yesterday not yesterday. But.

455
01:01:13.769 --> 01:01:18.630
Okay, we're starting to get into the idea now of how to do this more efficiently.

456
01:01:19.949 --> 01:01:32.460
And proving resource efficiency thread to data reduce control. Divergence means.

457
01:01:32.460 --> 01:01:43.170
Pack the act with threads together and what we're going to do this, we're going to have a more complicated algorithm.

458
01:01:44.340 --> 01:01:52.170
Well, what it's going to do is run faster if you trade off. Okay.

459
01:01:52.170 --> 01:01:56.400
You know, more code, more thinking, but faster execution. So.

460
01:01:56.400 --> 01:02:00.210
And I'll show you what's going to be happening here.

461
01:02:01.889 --> 01:02:13.139
And we're going to pack partial, sounds into this front of the array, keep active threads consecutive.

462
01:02:13.139 --> 01:02:17.670
And shift the index usage that's the same thing.

463
01:02:17.670 --> 01:02:22.769
Improves divergence behavior means we want the threads that are running the same code to be.

464
01:02:22.769 --> 01:02:27.960
And the same more reordering computations. Okay.

465
01:02:30.539 --> 01:02:39.510
So, it shows what what's happening here we got this array of 8 elements at the top 31704 1, 6, 3.

466
01:02:39.510 --> 01:02:43.860
Okay, so 1st stage we are adding.

467
01:02:43.860 --> 01:02:48.900
Pairs of elements, but what I showed you before in the last slide set 9, 2.

468
01:02:48.900 --> 01:02:57.659
Thread 0T added like, element, 0T and element 1 thread 1 added element to and element 3 and so on here.

469
01:02:57.659 --> 01:03:00.869
Thread 0T adds element 0T and element for.

470
01:03:02.909 --> 01:03:07.739
So, they're, they're not, they're not adjacent to each other. That is true.

471
01:03:07.739 --> 01:03:11.639
However thread 1, agile 1 an element 5.

472
01:03:11.639 --> 01:03:21.090
2 gentlemen, 2 elements 6 said 3 element, 3 elements 7 so the 2 elements added by thread. 0T are not adjacent.

473
01:03:21.090 --> 01:03:34.170
But the elements added by 3, 0T and added by thread 1 are adjacent. So thread 0T is going sequentially through through 1, right here and sequentially through another. Right there.

474
01:03:34.170 --> 01:03:48.539
So, that plays nicely with the cash manager, and the outputs now are adjacent. So we went in with 8 elements. We come out with 4 elements for 2 elements subtotals, but they're adjacent at the start of the array.

475
01:03:49.619 --> 01:03:54.389
We could argue be overwriting the original Ray or this could be a new or whichever.

476
01:03:55.769 --> 01:04:04.079
You know, at some point, you might start with the 1st, level of being read, only goal memory. And then the next level is and shared memory. Perhaps.

477
01:04:04.079 --> 01:04:07.139
You know, you do these sorts of trade that would be an actually a nice thing to do.

478
01:04:07.139 --> 01:04:12.809
So, you've got latency for all these scratch reading.

479
01:04:12.809 --> 01:04:19.559
The original array from reading it in a systematic way and then they're writing to the.

480
01:04:19.559 --> 01:04:23.820
Shared memory, so the shared memory only has had big enough for half the original right?

481
01:04:23.820 --> 01:04:31.949
Okay, so we did this different thing where the threads are accessing non, adjacent, adding non adjacent elements.

482
01:04:31.949 --> 01:04:35.309
And their sub totals, our act together.

483
01:04:35.309 --> 01:04:49.530
Okay, but now, after the 1st step, there are 4 threads active, but the 4 consecutive threads, and they do the same thing. So you see, the data that's used is always packed to the front of the array. So.

484
01:04:49.530 --> 01:04:57.630
You know, the, the tail of the array, you might free or use for something else if you want it. And the act of threads are all packed together.

485
01:04:57.630 --> 01:05:02.099
So, that means is initially if you had.

486
01:05:02.099 --> 01:05:09.210
Suppose we had a 1024 threads. Okay so we're summing up 2048 elements of the year, right?

487
01:05:09.210 --> 01:05:17.820
Well, after the 1st step is only 512 threads are active. The, the last 512 threads.

488
01:05:17.820 --> 01:05:21.869
Numbers 512 to 1023. I no longer needed so.

489
01:05:21.869 --> 01:05:27.690
You know, those spreads can end and then after the after the next step, only 256 threads are used.

490
01:05:27.690 --> 01:05:42.659
And so there's fewer and fewer active threads. So we don't have a lot of threads and the warfare decide all the active threads are all packed together. So we're using these thread resources more efficiently because the other threads.

491
01:05:42.659 --> 01:05:54.929
They finished, you know, they're not running continuing to run, whereas you say so, what? So, we're packing the memory the act of data together, and we're packing the act with threads together.

492
01:05:54.929 --> 01:06:04.139
And packing is good and how we would do it, you can look at the code yourself. It's not interesting.

493
01:06:04.139 --> 01:06:11.219
So, no diver again divergence mean some of the threads of the war are active and some are passive. So.

494
01:06:11.219 --> 01:06:17.190
If we have a 1024 threads until we get down to only 32 active threads.

495
01:06:17.190 --> 01:06:21.630
Um, there's no diversion all the act with friends are gonna block of all active threats.

496
01:06:21.630 --> 01:06:28.559
Everything's nice powers to also which helps and the final 5 steps are, um.

497
01:06:28.559 --> 01:06:31.769
Divergence.

498
01:06:31.769 --> 01:06:35.579
You final 5 steps you almost might want to do it sequentially.

499
01:06:35.579 --> 01:06:42.329
There's a programming paradigm that I have is if a lot of data you start munching it down.

500
01:06:42.329 --> 01:06:45.449
May be in parallel, and at some point, you switch modes.

501
01:06:45.449 --> 01:06:52.380
And so I could easily see here that you do the 1st, 5 steps in parallel. Well, the 1st thing, you got a 1000.

502
01:06:52.380 --> 01:06:56.400
In parallel with the last 5 steps, you know, you're talking.

503
01:06:56.400 --> 01:07:00.750
32 threads, you might almost say what the hell just add it.

504
01:07:00.750 --> 01:07:04.500
What 1, sequential process that's what I would do here.

505
01:07:04.500 --> 01:07:09.900
So, I should shift modes. You start parallel and you to sequential.

506
01:07:12.090 --> 01:07:24.420
I get the same concept. If I were worried about memory, say, implementing some binary search tree with pointers, I would implement the top half of the tree with pointers. Let's say the bottom half of the tree packed.

507
01:07:25.440 --> 01:07:32.280
And I would have the effect of something that I could update fairly efficiently.

508
01:07:32.280 --> 01:07:35.579
Which requires pointers and something, which doesn't.

509
01:07:35.579 --> 01:07:38.940
Double the space that the tree requires for pointers and.

510
01:07:38.940 --> 01:07:44.039
So, you switch modes in the middle powerful paradigm you don't see described in.

511
01:07:44.039 --> 01:07:49.050
Okay, it was.

512
01:07:49.050 --> 01:07:52.199
What was it? 9.3.

513
01:07:53.219 --> 01:07:57.150
Chance to ask questions.

514
01:08:04.920 --> 01:08:07.920
Oh, okay, good.

515
01:08:07.920 --> 01:08:14.159
Okay, what happens here we're seeing an extension of reduction call.

516
01:08:14.425 --> 01:08:28.975
A scan operation a scan operation is a powerful thing. It's a powerful parallel programming paradigm. It's only useful parallel programming. It's actually not useful and sequential programming. You can use it in sequential programming.

517
01:08:29.005 --> 01:08:30.864
What I mean, by not.

518
01:08:31.170 --> 01:08:36.420
Useful it doesn't give you a performance gain for parallel programming. It gives a performance gain.

519
01:08:36.420 --> 01:08:39.630
What it is, is.

520
01:08:39.630 --> 01:08:47.520
The reduction reduces the array to 1. some, the scan, the parallel scam produces an array of sub totals.

521
01:08:48.810 --> 01:09:03.810
Okay, so, instead of adding all the elements Gavin told you, you've also got a total the 1st, 1 element, another total, the 1st, 200 total of the 1st, 3 and so on. And it started parallel takes log in time. Like, the reduction does.

522
01:09:03.810 --> 01:09:08.010
And, like I said, you would be surprised how useful it is.

523
01:09:08.010 --> 01:09:19.229
Okay, so foundational nice word. Um, and video has got something like a, it's several years old. Now. Some of the ideas are little obsolete.

524
01:09:19.229 --> 01:09:27.689
But it's has it's free on on the web for free now I think we used to charge for it.

525
01:09:27.689 --> 01:09:35.699
And that link is dead so how you would find it or how I find how I found this.

526
01:09:35.699 --> 01:09:40.710
On the web I Google 339.

527
01:09:40.710 --> 01:09:45.149
And I found that it's still on the NVIDIA side. They just moved it somewhere.

528
01:09:45.149 --> 01:09:51.930
See, what I'm not going to go to it now because the interesting stuff is on this slide said.

529
01:09:51.930 --> 01:09:57.810
Okay, what is it's there's.

530
01:09:57.810 --> 01:10:04.890
2 versions didn't close next close. Here is the inclusive scan here. I'll show it by example.

531
01:10:04.890 --> 01:10:08.880
Input 31741 6, 3.

532
01:10:08.880 --> 01:10:18.720
8 elements, the output array has 8 elements, but look at the 1st element is 3, the 2nd, is 3 plus 1. the 3rd is 3 plus 1 plus 7.

533
01:10:18.720 --> 01:10:22.380
So the case element using index 1.

534
01:10:22.380 --> 01:10:25.439
Addressing origin 1 address thing.

535
01:10:25.439 --> 01:10:30.210
The case output element is the sum of the 1st, Kay input elements.

536
01:10:30.210 --> 01:10:34.710
So that's called inclusive scan.

537
01:10:34.710 --> 01:10:45.420
Or a prefix, that's what it is. What would it be a quick application of it? Suppose you have a run length and coding of.

538
01:10:45.420 --> 01:10:54.630
Whatever an image you would use something like this to decode your run length and coated image. So the 1st row there, your run lines.

539
01:10:54.630 --> 01:10:58.920
And the 2nd row is where each run would start in the decoded image.

540
01:10:58.920 --> 01:11:02.819
So, you'd use this, this, you'd use the.

541
01:11:02.819 --> 01:11:07.710
Prefix 70 inclusive scan of your array of runway.

542
01:11:07.710 --> 01:11:11.939
To compute where in the output each rod will start.

543
01:11:11.939 --> 01:11:18.029
Each expanded rod, this is an example of the use of runway of info scan.

544
01:11:18.029 --> 01:11:23.550
A lot of other examples, but this is a nice 1 another example that's used.

545
01:11:23.550 --> 01:11:27.029
For bucket.

546
01:11:27.029 --> 01:11:37.680
They have we did the bucket toggling the accounts. So, frequency counts. We showed you a simple frequency account idea 2 days ago I guess.

547
01:11:37.680 --> 01:11:44.579
That simple idea assumes that the array of output array of counts.

548
01:11:44.579 --> 01:11:50.159
Not that big it works and you've got a small number of possible accounts for accounting frequencies.

549
01:11:50.159 --> 01:11:53.189
Doing his grabbing of text to remember, and we even.

550
01:11:53.189 --> 01:12:02.310
Batched up the ladder, so we didn't have that many output possibilities. Well, suppose you had a 1M or a couple 1M possible.

551
01:12:02.310 --> 01:12:12.630
Key said you're doing a histogram counting on that. What we saw 2 days ago has problems the out that histogram matrix is too big to start in fast memory.

552
01:12:12.630 --> 01:12:26.069
So, we use other tricks, tricks, techniques, whatever you want to call them paradigms, and they go to involve inclusive any case. So, new concept, inclusive scan. You see what it is.

553
01:12:26.069 --> 01:12:35.640
It takes to run under a rotten lines and outputs in array of dope. You call good call. This is called adult factor. This is a buzzer adult factor element chose or each.

554
01:12:35.640 --> 01:12:39.569
Run would start in the output. Okay. That's what it is.

555
01:12:39.569 --> 01:12:44.489
How do you do it fast? Sequentially? It's obvious. We want to do it fast in parallel.

556
01:12:46.170 --> 01:12:49.770
Okay, submarine sandwich example.

557
01:12:51.000 --> 01:12:55.859
How to calculate it? Cool example.

558
01:12:57.539 --> 01:13:04.829
Luxurious struggle to find examples. Okay so don't knock it.

559
01:13:04.829 --> 01:13:08.609
You can find better examples. Go ahead.

560
01:13:10.439 --> 01:13:14.159
Yeah.

561
01:13:14.159 --> 01:13:18.119
Oh, it's using thing fast starting.

562
01:13:18.119 --> 01:13:22.859
The occurrences comparing spring, all this, so you can read this.

563
01:13:22.859 --> 01:13:34.529
I showed you the obvious, you could also be used to run linked end. Coding goes both ways. It has surprising number of applications. We'll see them in a week or so.

564
01:13:34.529 --> 01:13:37.949
Typical applications of the scan.

565
01:13:39.420 --> 01:13:43.140
Yeah, they're getting Sally. That's the definition.

566
01:13:43.140 --> 01:13:48.750
Obviously find it sequentially, but we're in a parallel parallel. Of course.

567
01:13:48.750 --> 01:13:55.260
Efficient takes.

568
01:13:55.260 --> 01:14:01.350
And elements.

569
01:14:03.090 --> 01:14:12.390
That does not mean pictorial, naive inclusion.

570
01:14:12.390 --> 01:14:15.960
The eye thread calculates why? So, by.

571
01:14:15.960 --> 01:14:23.189
Families is no faster than linear or the linear actually, because.

572
01:14:23.189 --> 01:14:26.220
Threads running the are are slower than on the Intel.

573
01:14:26.220 --> 01:14:34.470
Oh, by the way I strongly disagree with this problem or a peer that programming is easy if you don't care, but for performance, it's still hard.

574
01:14:34.470 --> 01:14:39.300
He's still got walking and sequential sterilization issues.

575
01:14:39.300 --> 01:14:42.750
Okay, oh, it was 10 1.

576
01:14:46.020 --> 01:14:51.420
Showed you the problem.

577
01:14:55.380 --> 01:14:59.579
How to do it? So this is how we could do it in parallel.

578
01:15:01.319 --> 01:15:07.590
Yeah, we're starting to cut the time to come in down to something like log in.

579
01:15:07.590 --> 01:15:13.170
Here's our initial array it's called X Y, because we're updating in place.

580
01:15:13.170 --> 01:15:18.270
So, we add each element to its neighbor to the right.

581
01:15:18.270 --> 01:15:21.930
Sorry, I had each element to its neighbor to the left.

582
01:15:21.930 --> 01:15:34.109
Okay, so, for so, 1 got replaced by 4 plus 1. so the input array was our end elements that we want to scan the output array. Each element has that been added to it's neighbor to the left.

583
01:15:35.460 --> 01:15:39.239
Okay, so you have an array of pairs here.

584
01:15:41.310 --> 01:15:45.390
And then we synchronize, so we do this.

585
01:15:45.390 --> 01:15:49.140
Ice and sample, and now we sink threads.

586
01:15:49.140 --> 01:15:54.810
Next step, we do it again. We add each element to the element.

587
01:15:54.810 --> 01:16:01.890
To the do you do.

588
01:16:03.090 --> 01:16:12.390
To the element to, to the 1st step, we had at each element to the element adjacent to the left next step. We had each element to the element.

589
01:16:12.390 --> 01:16:15.720
2 left, so 9 gets added to 5.

590
01:16:15.720 --> 01:16:25.800
7 gets added to 4 to make 11. 5 gets added to 7 to make 12. 4 gets added to 8 to make 12. 7 gets added to forward to make 11.

591
01:16:25.800 --> 01:16:31.770
It gets added to 3 to make 11 and 4 doesn't change because 2 to the left is all started here. Right?

592
01:16:31.770 --> 01:16:36.119
Boundary test edge test. Okay.

593
01:16:37.140 --> 01:16:47.189
What we have now, here are sums of 4 elements of the 14 is a, some of the last 4 elements 3 plus 6 Plus 1 plus 4.

594
01:16:48.750 --> 01:16:55.380
So, every element in the 2nd, or after stride 2, every element is the 4 elements.

595
01:16:55.380 --> 01:17:01.649
Except the 1st, 2 and, um.

596
01:17:01.649 --> 01:17:06.359
The 1st, 3, in fact, 11 here is a sub, only the 1st, 3 elements.

597
01:17:06.359 --> 01:17:10.199
You might imagine that it's patent to the left with cells.

598
01:17:10.199 --> 01:17:13.739
Getting ahead of me now we do this K times.

599
01:17:14.760 --> 01:17:22.949
Now, every element is a sum of 8 out, so we did with each element here was added to the element 4 to the left 14 got added to.

600
01:17:22.949 --> 01:17:32.399
11 making 25, and now the output elements are the sum of 8 elements, except for the 1st batch.

601
01:17:32.399 --> 01:17:36.449
Where they added stuff going off the start of the array. So.

602
01:17:36.449 --> 01:17:42.510
Each element here is the sum of 8 elements, except that.

603
01:17:44.189 --> 01:17:47.579
Um, if there are 8 elements.

604
01:17:50.640 --> 01:17:56.340
At the 3rd level, I think okay and I think we've done it.

605
01:17:56.340 --> 01:18:02.460
This is it, we took 3 steps.

606
01:18:02.460 --> 01:18:07.109
And we got the parallel scan and the original race. So it took 3 parallel steps.

607
01:18:07.109 --> 01:18:17.460
And we did it now, we're adding non adjacent elements so we might want to talk about that later and.

608
01:18:17.460 --> 01:18:21.180
But if we paralyzed the scan nice. Okay.

609
01:18:21.180 --> 01:18:29.010
And we got some thread divergent. Maybe can't be well, the action threads are still adjacent there at the end of the array. So.

610
01:18:29.010 --> 01:18:32.939
So, in log in time, we did the parallel scan. Cool.

611
01:18:32.939 --> 01:18:39.090
Some dependencies, um.

612
01:18:39.090 --> 01:18:42.539
We're over, we're writing in place.

613
01:18:42.539 --> 01:18:49.920
So, if I just jump back here, we have to do a sync between each step.

614
01:18:49.920 --> 01:19:00.539
Because we're overriding the array in place, and we got to make certain that no 1 wants the old value before replace with the new value. So you have to and because there's no guarantees about ordering.

615
01:19:00.539 --> 01:19:03.600
Ok, dependencies.

616
01:19:03.600 --> 01:19:06.659
Everyone does reading and writing.

617
01:19:06.659 --> 01:19:10.079
Consider doing it in the.

618
01:19:10.079 --> 01:19:16.470
Shared memory, and this is how it would be implemented. You can read the code on your own.

619
01:19:18.090 --> 01:19:22.050
It does log in parallel iterations. Nice.

620
01:19:22.050 --> 01:19:35.550
But there is an issue here, the work inefficiency, the threads at the start of the array, or maybe they're still doing stuff, which is useless.

621
01:19:35.550 --> 01:19:41.640
So, we've got some issues, we're not using the threads efficiently. That's work. Not working.

622
01:19:41.640 --> 01:19:50.100
So, we may be saturating things. It may actually be running slower. Let me scroll back to what's going on here.

623
01:19:50.100 --> 01:20:02.939
These 1st, couple of threads here are not doing anything because they don't have threads to stride forward. We're adding each threads adding. It's element to the element forwarded the last. Well, if we don't have 4 elements to the left, the threads not doing anything.

624
01:20:02.939 --> 01:20:06.689
And so it's going to be idle, but maybe we want to be more.

625
01:20:06.689 --> 01:20:11.279
Explicit about being idle, I guess, at what they're talking about.

626
01:20:11.279 --> 01:20:15.779
Um, not quirky, efficient and so on.

627
01:20:15.779 --> 01:20:23.699
That was that so what happened today? 1st, we spent time on.

628
01:20:23.699 --> 01:20:27.060
Virtual machines at different levels.

629
01:20:27.060 --> 01:20:31.439
Ending up with Docker and so on, because it's the commercially valuable.

630
01:20:31.439 --> 01:20:42.899
You might also think how you implement this stuff. You saw different levels of virtual machines you want, you might be having it the back of your mind. What hardware resources do you want to make this too fast?

631
01:20:42.899 --> 01:20:52.949
Hooks into the hardware I mentioned things such as and harmless instructions have to be easily identifiable from harmful instructions.

632
01:20:52.949 --> 01:21:00.689
I think in the IBM system, 360 machine calls, it's actually the terminal for the, for a few bits of the off code or something.

633
01:21:00.689 --> 01:21:05.699
So, powerful off codes are you just look at the bits of the code and you can tell that.

634
01:21:06.744 --> 01:21:18.354
And they are, they're trapped even replaced with the trap instructions, or the hardware traps. I'm sorry, what happens I'm, it's a hardware traps of powerful instructions that you set your virtual bit.

635
01:21:18.564 --> 01:21:27.864
And I think what happens is the instruction turns into a separate team call, and it's done at the hardware and the hardware. So you don't have to modify the code. And there's no overhead in it.

636
01:21:28.140 --> 01:21:33.029
Until you execute, start executing a little protected and sub routine, but there's no overhead in the trap.

637
01:21:33.029 --> 01:21:36.720
And then and then we see some parallel tools.

638
01:21:36.720 --> 01:21:39.779
With virtual machine sing some nice.

639
01:21:39.779 --> 01:21:45.630
Parallel to the stuff I'm showing you here is not specific to NVIDIA.

640
01:21:45.630 --> 01:21:49.109
You pick your parallel architecture.

641
01:21:49.109 --> 01:21:59.399
These parallel reductions are our powerful foundational paradigm for any parallel architecture. So this part of the courses.

642
01:21:59.399 --> 01:22:04.409
As reaching beyond in video, so it's part of the parallel paradigm thing.

643
01:22:08.880 --> 01:22:15.989
Yeah, okay. So you're asking yeah hybrid or? Let me go back to this specific thing here.

644
01:22:15.989 --> 01:22:24.149
Well, we're going to see in the next slides that is packing things together in a way that the other thread walks will just be.

645
01:22:24.149 --> 01:22:28.739
They'll terminate and the resources are free, so.

646
01:22:30.449 --> 01:22:36.270
Yeah, well, thread stuck is we reorganize this slide. I'm showing you. Here is somewhat.

647
01:22:36.270 --> 01:22:40.800
And the idle threads, they'll finish, the just ran off the bottom of the thread.

648
01:22:40.800 --> 01:22:45.840
Finish it while, and if all the threads in the war finish.

649
01:22:45.840 --> 01:22:49.319
That work ends and now those resources are.

650
01:22:49.319 --> 01:22:53.369
Available what resources I'm talking about is.

651
01:22:53.369 --> 01:22:57.090
This concept of credit card that I've been showing you is.

652
01:22:57.090 --> 01:23:00.899
A little simplified from reality.

653
01:23:00.899 --> 01:23:05.789
There is now there's not specific coded courses sets of.

654
01:23:05.789 --> 01:23:12.899
Functional units, and that will execute into Journal, execute code a different type and.

655
01:23:12.899 --> 01:23:18.300
They you need them to decode an instruction and.

656
01:23:18.300 --> 01:23:30.689
Have a thread executing it so that those instruction units are now free for other threads. The registers that a thread would use that are private to the thread that, you know, they're from a block of registers that the whole.

657
01:23:30.689 --> 01:23:40.229
The whole block share, so those are now free. So, yeah, so now these resources are now available for other thread warps to use. Yes.

658
01:23:41.729 --> 01:23:51.750
Other stuff, so you're, you're, we're running late now. So see you Thursday head off for lunch or your next class, and enjoy the week.

659
01:23:51.750 --> 01:23:57.539
And I'm enjoying the sun shines that my solar panels are generating lots of power. Now.

660
01:23:58.949 --> 01:24:08.039
So, and today just we ran up through section 10.2. I'll put a note of that on the blog.

661
01:24:47.970 --> 01:24:53.909
Silence.

662
01:24:55.229 --> 01:24:59.159
Silence.

663
01:25:01.560 --> 01:25:06.090
Silence.