WEBVTT 1 00:04:09.688 --> 00:04:09.778 I 2 00:04:12.503 --> 00:06:07.163 am. 3 00:06:10.108 --> 00:06:22.678 Silence. 4 00:06:22.678 --> 00:06:25.918 Silence. 5 00:06:28.108 --> 00:06:33.658 Silence. 6 00:06:40.949 --> 00:06:48.088 Silence. 7 00:07:10.858 --> 00:07:14.788 Okay, good afternoon parallel class so. 8 00:07:14.788 --> 00:07:22.079 Monday, March, continuing on talking about and video parallel stuff and so on and. 9 00:07:22.079 --> 00:07:26.278 I assume that people can hear me, but just in case. 10 00:07:28.108 --> 00:07:37.228 And you hear me, thank you. Okay, so. 11 00:07:37.228 --> 00:07:41.579 What we have happening today is. 12 00:07:43.709 --> 00:07:54.478 A blurb on virtualization and Docker sent a request for it and while continuing on with NVIDIA, because it's your biggest. 13 00:07:54.478 --> 00:08:03.988 Supercomputer architect, and I started out doing, but really the biggest supercomputer architecture now and. 14 00:08:03.988 --> 00:08:16.079 Also, I have another homework up, which is a chance for you to do. Let me show homework. 1st, a chance for you to do another talk. 15 00:08:18.149 --> 00:08:22.319 So seconds student, talk and starting. 16 00:08:23.879 --> 00:08:31.228 Well, Thursday and 10 days, and for that, the next few classes and so I'm giving you freedom. If 1 week is. 17 00:08:31.228 --> 00:08:35.879 Easier for you than another week and I'll just fill in on use time with new material. 18 00:08:35.879 --> 00:08:43.948 So do it and teams or 2 so, like, you did present another parallel tool that we haven't covered in class as a lot of them. I've just covered a sampling. 19 00:08:43.948 --> 00:08:47.219 And and for example. 20 00:08:47.219 --> 00:08:54.119 The energy labs have tools, like, costs of some cloud based things C plus plus and power. 21 00:08:54.119 --> 00:09:04.198 Parallel facilities, the current version open a competitor to could cover 1 of the debugging tools that I've mentioned, but haven't actually shown you. 22 00:09:04.198 --> 00:09:11.938 Gpu technology conference coming up, you go to last year cheap, you technology conference find something interesting. 23 00:09:12.533 --> 00:09:15.053 And email me your team, 24 00:09:15.083 --> 00:09:16.374 your team name, 25 00:09:16.614 --> 00:09:20.514 what's people in your team and what dates you prefer, 26 00:09:20.514 --> 00:09:27.714 and your topic or maybe even 2 topics I want to try and have different teams doing different topics. 27 00:09:27.714 --> 00:09:28.134 So. 28 00:09:28.408 --> 00:09:38.519 If we run out of interesting topics, I'll try to take up more, but your wild card you go to the GPU technology conference and find something as a few 100 talks literally there. So. 29 00:09:40.168 --> 00:09:44.639 Okay. 30 00:09:46.438 --> 00:09:49.979 For I don't think you're here. 31 00:09:49.979 --> 00:09:53.219 So. 32 00:09:54.264 --> 00:10:01.014 So, a virtual view of a system is an idealized different view that hides certain features from the user. 33 00:10:01.644 --> 00:10:12.533 So you just, for example, using any modern operating system, modern, being defined the last 50 years or more the file system as a virtual view into the disc. 34 00:10:12.839 --> 00:10:22.229 You don't access raw blocks, you access files? Um, the virtual memory manager is a virtual view into the memory. You don't access. 35 00:10:22.229 --> 00:10:26.729 You usually don't access real memory address as you go through the virtual. 36 00:10:26.729 --> 00:10:33.719 The virtual memory manager, and that adds pluses and it adds minuses. 37 00:10:33.719 --> 00:10:41.458 A big plus security, virtual memory manager, you cannot get at other processes memory and. 38 00:10:41.458 --> 00:10:44.908 Unless you exploit 1 of these holes and Intel. 39 00:10:44.908 --> 00:10:47.908 And. 40 00:10:47.908 --> 00:11:02.879 You know, you get some standardization with a virtual memory manager. It's not as important how much real memory the machine as it affects the performance, but it doesn't so much affects what kind of runs. So, so the virtualization standardizes things and. 41 00:11:02.879 --> 00:11:13.739 Offers security and protection, and also can make facilities available easily. That might not be available. Otherwise. 42 00:11:13.739 --> 00:11:22.349 I mean, this is pushing the name virtualization but early machines, for example, did not have hardware floating point. They emulated. 43 00:11:22.349 --> 00:11:29.489 If you did a floating ad, it really called a little function using the integer. 44 00:11:29.489 --> 00:11:34.558 Instructions on the machine, so it ran, I know 5 times sower but. 45 00:11:34.558 --> 00:11:44.068 It used last hardware, so you, in a sense, you could say that the floating point instructions or a virtual instruction set. 46 00:11:44.068 --> 00:11:48.239 That supplemented the actual physical instruction set with more. 47 00:11:48.239 --> 00:11:54.269 With more tools, so, in a census and tactic sugar, anybody don't knocks and tactic sugar. 48 00:11:54.269 --> 00:11:57.808 Okay, I mean, the goal is programmer productivity. 49 00:11:57.808 --> 00:12:03.538 What's syntactic triggering means is you add news. 50 00:12:03.538 --> 00:12:10.259 New features to a language, or not, they make it easier to read them to program, but they're not. 51 00:12:10.259 --> 00:12:15.359 Deep new powerful things in the language, but their convenience things. 52 00:12:15.359 --> 00:12:19.619 This referring to threads. 53 00:12:19.619 --> 00:12:29.339 And blocks on the GPU as being 2 dimensional 3 dimensional rays, when they're really 1 dimensional arrays and the hardware that just that's attacking shrubbery. 54 00:12:29.339 --> 00:12:37.918 Okay, so virtualization there's lots of different levels you can do that. I've got several levels here working my way up to things like Docker. 55 00:12:37.918 --> 00:12:50.339 Oh, by the way reality check, why this is worth spending time on is a Docker as a hot commercial idea. And if you're applying to a company that has some. 56 00:12:50.339 --> 00:12:53.458 Program scan resumes for. 57 00:12:53.458 --> 00:13:03.658 Keywords and you can honestly say, you know, talker, put it on your resume and you go outside discourse program, something and Docker you can honestly say, you've. 58 00:13:03.658 --> 00:13:10.649 Programmed in it. Okay. You going to defeat the superficial selling tools at their own game. 59 00:13:10.649 --> 00:13:17.038 So, at the very low level with virtualization, emulate the hardware. 60 00:13:17.038 --> 00:13:23.099 And different instructions had different word links and so on. 61 00:13:23.099 --> 00:13:27.119 Um, it's very flexible, but it's very slow. 62 00:13:27.119 --> 00:13:40.708 And you have some minimal operating system on the hardware, just a minimal, very thin layer that can run the virtual guests on top of the host. The host is the low level thing that gets thrown on top of it. 63 00:13:44.693 --> 00:13:54.774 So the low level of big commercial companies, VMware, I shouldn't put volunteers that to your biggest commercial company. There's several free alternatives then. 64 00:13:55.948 --> 00:14:03.688 And kbm, and so on Microsoft has a virtual thing it's getting better. I believe I don't know as much about it. 65 00:14:03.688 --> 00:14:15.089 My guess is that VM Ware is probably better than the free alternatives if only because they put so much effort into it and I believe them to be competent people there. 66 00:14:15.089 --> 00:14:20.879 Um, but VM Ware has actually some free parts to it, but the full thing costs money. 67 00:14:20.879 --> 00:14:27.568 You can run VMware, virtual machines for free. Basically couldn't creating them cost money. 68 00:14:28.589 --> 00:14:34.859 So the concept is, you're you got your host and then you write your client machines and if you're. 69 00:14:34.859 --> 00:14:48.359 Virtualized at a very low level. Your separate virtual machines it might be totally different operating systems. So I've run VMware on laptops and 1 guest. I'd be Windows and a 2nd guest might be Linux and they're running simultaneously. 70 00:14:48.359 --> 00:14:58.313 On the same machine now, how do they do it while they're seeing their own views into the file system? So they're seeing separate parts of the file system, although there are ways to share parts of the file. 71 00:14:58.313 --> 00:15:09.744 So, I can create a shared partition that's accessible from both the Linux guest and the Windows cast, let's say, could have different Windows, guess, different running different versions of Windows. This is actually a commercial. 72 00:15:10.019 --> 00:15:18.869 1 of the commercial appeals of something like the is, you can run different variants. You could run a Windows, 10 and Windows. 73 00:15:18.869 --> 00:15:22.979 I don't know 3 yeah whatever. 74 00:15:22.979 --> 00:15:28.528 You know, different versions of Windows as different guests, all running on the same. 75 00:15:28.528 --> 00:15:32.788 Both and says this is actually a commercial reason for things like VMware. 76 00:15:32.788 --> 00:15:41.278 Okay, so how do you do an efficient? You can't just have little subroutines for everything and kill performance by orders of magnitude. 77 00:15:41.278 --> 00:15:44.729 Well, the thing is that your client program, your clock, your guest. 78 00:15:44.729 --> 00:15:59.158 Your client, most of the machine instructions are harmless, and you can prove they're harmless, like your trapping memory addresses perhaps and you can statically look at most of the machine instructions and know that there are no danger to the host. 79 00:15:59.158 --> 00:16:02.668 See, you let them run, so no penalty. 80 00:16:02.668 --> 00:16:09.658 The powerful instructions that would say analogous to things, you know, be used in Super user mode. 81 00:16:09.658 --> 00:16:14.759 Rings 0T, whatever you call it you can identify them statically. 82 00:16:14.759 --> 00:16:29.099 In general, this is assuming you're not doing things like creating new patterns. What's our new instructions and then executing? Well, that would be a dangerous instruction. So the powerful ones, you can trap them and if it's. 83 00:16:29.099 --> 00:16:37.168 Good hardware you are on good architecture. It will provide tools that make it easy to trap these powerful instructions. And then you emulate. 84 00:16:37.168 --> 00:16:41.639 To make sure that they're not stopping on someone else's memory and so on. 85 00:16:41.639 --> 00:16:50.129 So, doing this efficiently requires some good instructions set where the powerful instructions, the dangerous instructions. 86 00:16:50.129 --> 00:16:55.619 Are identifiable and you set a bit a status fit and then. 87 00:16:55.619 --> 00:17:05.249 They get they get trapped, so if you have the right instructions set, this is the efficient if you don't have the right instructions. That this is horrible. 88 00:17:05.249 --> 00:17:09.989 So, IBM has actually been doing virtual machines for 40 years. They start. 89 00:17:09.989 --> 00:17:22.949 You started in the 80 s and something called started as a research control program. I don't know what it stands for Cambridge monitor system for Cambridge, Massachusetts, and they changed it to control modern. 90 00:17:22.949 --> 00:17:27.449 Monitor system or something and. 91 00:17:28.648 --> 00:17:40.229 And they looked into it with their instruction said, I just saw something called the system 360, and they've expanded it. So, this has been part of IBM product line for 40 years and. 92 00:17:40.229 --> 00:17:47.638 Because now they have their mainframes, they got the base, and then they can run different clients on the mainframes and it's efficient. 93 00:17:47.638 --> 00:17:59.189 Another thing VMware does is they actually tweak the code of a guest perhaps. So they will if you're running Linux or Windows as a guest. 94 00:17:59.574 --> 00:18:14.273 On VMware, they actually tweak it a little tweak. The operating. They run a slightly customized version of Windows or Linux perhaps, and customize so that the dangerous thing instructions can be trapped efficiently. They may even, I believe, modify the code. 95 00:18:14.578 --> 00:18:20.548 So, what was a dangerous instruction will be replaced with a trap instruction of some sort. So, this means that we're on this fast. 96 00:18:21.838 --> 00:18:27.479 Beyond Morris, but a lot of my, I mentioned VM, Mark as the leader that put a lot of money into this for many years. 97 00:18:27.479 --> 00:18:34.409 Okay, now done right? I compute intensive program. I think it's no overhead. 98 00:18:34.409 --> 00:18:47.394 Really a few percent, but in my experience, using the, I'd actually use VM Ware off and on for many years, done more than 10 years. And my experience is at the emulated file system can be got awful bad. You're running. 99 00:18:47.394 --> 00:18:52.374 I'm running a Windows gas and I'm doing a system update and a system update, which might take. 100 00:18:52.648 --> 00:18:59.308 Half an hour, the Windows updates, half an hour on a native system will take hours and hours. 101 00:18:59.308 --> 00:19:03.929 And running is the VMware guest perhaps and its time. 102 00:19:03.929 --> 00:19:18.239 Just stuff gets mapped too many times and anything. That was a simple efficient. Maybe contiguous on a native machine. Now is going for virtual blocks, scattered around the real disc or something. 103 00:19:18.239 --> 00:19:21.808 Okay, but the nice thing is. 104 00:19:21.808 --> 00:19:24.898 Your guess can be all different. 105 00:19:24.898 --> 00:19:38.159 By compute intensive question from Isaac would be something like Matrix multiplication and the guest for perhaps because you could in square data and cube computation. 106 00:19:38.159 --> 00:19:43.499 So, you get a big matrix multiplication program linear algebra thing. 107 00:19:43.499 --> 00:19:48.778 Running in the guest, big bat lab job, running in a guess and VMware. 108 00:19:48.778 --> 00:19:55.078 My guess, is that the CPU time will be almost the same? I haven't tried it, but that's my guess. 109 00:19:55.344 --> 00:20:00.054 It will use more memory because follow the host need to memory in the guest, 110 00:20:00.084 --> 00:20:09.023 and each guest needs memory and you do not want to swap that memory and you do not want your guests to be swapping memory except rarely I mean, 111 00:20:09.054 --> 00:20:10.523 that gets into the file system. 112 00:20:12.778 --> 00:20:16.259 Shifting from the host of the VM. 113 00:20:16.733 --> 00:20:25.284 Well, yeah, if you go and read data from the disk. Okay, so the guest is creating a virtual desk, which gets mapped into into a file on the real desk. 114 00:20:25.794 --> 00:20:38.544 And maybe you got different options in VM, where you can say that you assign a whole partition to VM where to be used for the guest desk, or you can assign files in the host file system. 115 00:20:38.544 --> 00:20:49.433 So, now, you see, you're going through 2 file systems, and actually the guest file system could be multiple to take her bike files in the host file system, partition to 2 gigabytes us for management purposes. 116 00:20:50.548 --> 00:20:55.409 And you're going to 2 levels of the file and my experience from. 117 00:20:55.409 --> 00:21:01.108 Using computers is a virtualizing layer on layer. Virtual file systems is a very bad idea. 118 00:21:01.108 --> 00:21:05.638 So we'll give you a simple thing if you just look at. 119 00:21:05.638 --> 00:21:09.628 Now, okay, here, old, rotating hard drives. 120 00:21:09.628 --> 00:21:15.898 In many cases had 512 by blocks your new. 121 00:21:15.898 --> 00:21:24.388 Days tend to have 4 kilobytes blocks down into the hardware level. So if you, but then they virtualize. 122 00:21:24.388 --> 00:21:31.888 The file system on top of a virtual, they pretend the May pretend that it has a 512 fight. 123 00:21:32.394 --> 00:21:47.213 File hardware files to be called compatible with the old rotating hard drives, but this means that your new virtual, a virtual 4 K by block and it gets not aligned. Right? Things don't get aligned to. 124 00:21:47.213 --> 00:21:49.763 Right? So what would be 1. 125 00:21:50.098 --> 00:21:59.969 Access to the disk, if it was sort of native becomes to accesses because there's a misalignment. You just doubled the. 126 00:21:59.969 --> 00:22:03.628 This is why virtualizing file systems I think is bad. 127 00:22:03.628 --> 00:22:13.588 Okay, any case so PM, whereas the point is, you got clients that are totally different. Well, they all have to be Intel based. If you try to virtualize. 128 00:22:13.588 --> 00:22:21.628 I don't want an arm type operating system on an Intel base. Now you're down to virtualize in a really low level, and it's going to be bad. 129 00:22:21.628 --> 00:22:36.328 Okay, I just as an aside there, when I've tried to run VMware last few last year, I guess it doesn't actually run into a bug to anymore because behind the scenes Linux has been upgrading security. 130 00:22:36.328 --> 00:22:41.519 And at the, for example, you now, you cannot just boot a random file system. 131 00:22:41.519 --> 00:22:44.638 We're going to random operating system on Intel machine now. 132 00:22:45.894 --> 00:22:54.144 There's protection against that to protect against center security so they have to be signed and Windows assigned. 133 00:22:54.864 --> 00:23:09.233 There was a worry that this would freeze out Linux, but the critical Linux modules are actually signed cryptographic signature that is checked by the BIOS on on your Intel machine. Now, and there are ways actually for you to sign extra modules of your own. 134 00:23:09.808 --> 00:23:22.824 If you trust the module, you can generate a cryptographic signature, so it can be loaded. But in general, the lyrics colonel now root does not have infinite powers anymore. It cannot just load around the module into the current lynchy load modules. Into the kernel. 135 00:23:22.824 --> 00:23:32.034 It's the low command, but you can load only signed molecules and have been cryptographically, signed by someone like Nvidia or whatever and. 136 00:23:32.308 --> 00:23:38.638 You can sign your own modules to the escape pause as you can sign your own modules but there's some failure here. So. 137 00:23:38.638 --> 00:23:43.348 I can't figure out how to sign the via the critical VM where modules and therefore VMware can't run. 138 00:23:43.348 --> 00:23:47.128 So this is restricting the power of root in Linux so. 139 00:23:47.128 --> 00:23:53.009 And you can see why to prevent against certain low level security holes. 140 00:23:53.009 --> 00:23:58.318 They don't like to talk about this and Linux obviously documented at a low level but. 141 00:23:58.318 --> 00:24:08.338 When I read announcements about new versions of a bond to in my favorite. Oh, my God. Or or whatever they don't actually talk about this. 142 00:24:08.338 --> 00:24:15.479 Okay, so that was a fairly low level of virtualization. So it's this is the overhead in memory and so on overhead and desk space. 143 00:24:15.479 --> 00:24:29.878 The next level of virtualization restricts us and that we can only run clients that are basically that are running the same operating system. So, with this, we have a Linux host. We can run Linux guests. 144 00:24:29.878 --> 00:24:35.249 Multiple Linux guests at the same time, but they're all drawing on the. 145 00:24:35.249 --> 00:24:39.989 Key facilities of the host operating system so that have to be the same. 146 00:24:39.989 --> 00:24:44.699 But they see a private view, a process based file system, other resources. 147 00:24:44.699 --> 00:24:52.259 So normally, normally you do the PS command, you see with the right options. You see all the processes on the whole system. 148 00:24:52.259 --> 00:24:56.368 And in fact, you can see who's running it you can see the name. 149 00:24:56.368 --> 00:25:06.328 Because the command name, you can see the in all the, you can see the environment in fact, which is why they tell you if you're on a multi user system, when you're running, say encryption. 150 00:25:06.328 --> 00:25:18.209 Which you can put the cryptographic key in the command line, makes it easier to type, but if you do that, anyone else on the system can see it, but it's a PS command. Okay. Now, with this level of virtualization. 151 00:25:18.209 --> 00:25:23.818 A client cannot see the other process. He cannot see that they exist at all. The can see. 152 00:25:23.818 --> 00:25:31.169 Only his own the file system again, it's not that you see that there are other files. There you can't read is that you don't see them. 153 00:25:31.169 --> 00:25:34.769 And other resort, there's other system resources. 154 00:25:34.769 --> 00:25:40.528 So, you get a private view of your piece of the system. 155 00:25:40.528 --> 00:25:49.259 The only way the rest of the computer affects you is obviously resource consumption, is that somebody else is running a compute bound job. 156 00:25:49.259 --> 00:26:03.564 Well, those are cycles you're not using now even that can be controlled, is that the process can limit what? Fraction of the machine that each client can use. So, it could be set up that this virtual client can use no more than half of the CPU cycles. 157 00:26:03.683 --> 00:26:04.344 For example. 158 00:26:04.648 --> 00:26:15.509 Now, this resource consumption affecting this is actually as surprising information leak that can actually be used to leak information out of the virtual client. 159 00:26:15.509 --> 00:26:19.679 By the fact is, you're using a lot of cycles, you're slowing everyone else down and so on. 160 00:26:21.929 --> 00:26:27.538 Many case, there's various terms like this power. Virtualization might be 1 of the terms you can Google it. 161 00:26:27.538 --> 00:26:34.048 And Linux has tools, so again, to do this sufficiently requires that the operating system be designed to support it. 162 00:26:34.048 --> 00:26:45.118 So, Linux now is stuff like called name space, isolation. So name space isolation means that you don't see someone else's processes a process name space. Each client is isolated. 163 00:26:45.564 --> 00:26:57.743 Pile systems, isolated and resources, isolated name space isolation as a keyword you can Google if you like a PDF, if you like, and Linux has something called control groups also, they can group processes. 164 00:26:57.743 --> 00:27:00.953 They can have hierarchy of privileges actually, which is really nice. 165 00:27:01.199 --> 00:27:09.479 And this is actually the base that's used for a lot of other tools. So this level of higher level virtualization it's got a lot less overhead. 166 00:27:09.479 --> 00:27:14.308 But the call, the clients have to be the same operating system. 167 00:27:14.308 --> 00:27:27.778 Next level ups oh, down at something like this. You can you run an app command. It could be it could fire up a virtual machine, do something and then end the overheads a lot less with this. 168 00:27:27.778 --> 00:27:31.679 Power of virtualization or whatever it's called. I couldn't swear to that precise name, but. 169 00:27:31.679 --> 00:27:40.019 Okay, next level up, you just have normal Linux security, file system protections. You can see other resources, but the theory is, you can't access them. 170 00:27:40.019 --> 00:27:50.818 And also the normal, the next level, you got the secure Linux se, Linux that was funded by no such the agency. 171 00:27:50.818 --> 00:27:54.388 National Security Agency, and so on. 172 00:27:54.388 --> 00:28:02.189 No, it says less overhead, but you can still see some other stuff. Now, the normal Linux level. It's really hard to make secure like. 173 00:28:02.189 --> 00:28:05.999 You know, I worry I run Firefox, so I, I examine. 174 00:28:05.999 --> 00:28:10.858 My system, and I think, what are the biggest security pinch of security holes and. 175 00:28:10.858 --> 00:28:19.588 Well, I'm, you know, I load useful programs some over the web things to do in geometry and so on. I have to trust them. Firefox. I got. 176 00:28:19.588 --> 00:28:30.594 Plugins and the Firefox, I have to trust them and how would I make that secure? And that's surprisingly hard even if I can describe exactly the security I want it's really hard to do. 177 00:28:30.594 --> 00:28:34.614 Like, maybe I want to declare that Firefox is allowed to write nothing. 178 00:28:35.068 --> 00:28:40.919 On the computer, except slash temp and except in a subdirectory of my home directory called. 179 00:28:42.239 --> 00:28:51.689 I cannot write commands it effectively enforce that because Firefox by default runs stomped runs over my whole home directory, trying to read stuff. And if I stop it, it fails. 180 00:28:51.689 --> 00:28:56.999 So, and there are tools like app arm or. 181 00:28:56.999 --> 00:29:02.878 And so on, which pretend to help, but, and they helped security somewhat but the other thing. 182 00:29:02.878 --> 00:29:12.058 Is that people doing graphic user interfaces are working as hard as they can to write new convenience trip toys that evade the security. 183 00:29:12.058 --> 00:29:15.058 For example, app armor, it. 184 00:29:15.058 --> 00:29:20.308 Traps new process, spotting and self spot so. 185 00:29:20.308 --> 00:29:23.578 Something to, you know, Firefox or. 186 00:29:23.578 --> 00:29:28.679 Whatever spawns a process, then I could write a rule which. 187 00:29:28.679 --> 00:29:35.189 Say, as Firefox allowed to spawn such a process and forces a sub process to inherit any restrictions with Firefox. 188 00:29:35.189 --> 00:29:41.038 Nice, but now, and our gooey and Linux, we got these on these. 189 00:29:41.038 --> 00:29:51.298 Communication channels I forget what they're called now. They're trying to make Linux, have the security level of Windows actually, but they got these convenience things. So. 190 00:29:51.298 --> 00:30:04.019 Things on my desktop, my computer desktop can send messages to they can send commands to each other, which is very nice. But when any concept of security in this, you start figuring out how to. 191 00:30:04.019 --> 00:30:07.409 You know, trap that in any case. 192 00:30:08.548 --> 00:30:20.038 Advantage for security or virtual machines, as you put the app in the virtual machine and it's now in a walled garden that's 1 of the terms and it can't get out of escape from the walled garden. 193 00:30:20.038 --> 00:30:24.778 And endanger the rest of your machine. Well, that's nice. Except. 194 00:30:24.778 --> 00:30:29.338 That was also the theory for Java security, and we see how secure Java is. 195 00:30:29.338 --> 00:30:37.378 I don't know what's hard about writing a confident wall garden, but it's harder. Programmers are not able to do it. So. 196 00:30:37.378 --> 00:30:51.239 Now, apart from security, there's another big advantage of advantage of virtual machines is that they isolate you from changes in the hosts. 197 00:30:51.239 --> 00:31:01.979 So, virtual, it's taken halting virtual memory to some extent isolates you from the router real memory on the machine. If you don't have enough real memory, it creates a virtual space virtual memory. It hurts the performance. 198 00:31:01.979 --> 00:31:06.838 But it would still run and if you have a. 199 00:31:06.838 --> 00:31:16.648 If you have a virtual machine client, it's got this ideal version of the real machine. That's the actual hardware it's hidden from it. 200 00:31:16.648 --> 00:31:27.084 So, you can run on different virtual machines and an, even you could even say, spill over to something cloudy. So you could maybe run some virtual machines locally. 201 00:31:27.084 --> 00:31:35.634 And then if you need more power, you spill over to Amazon, elastic compute, cluster Amazon and so on. 202 00:31:36.269 --> 00:31:39.388 And which is following the same standard. 203 00:31:39.388 --> 00:31:44.759 And in theory, you could just take your local virtual machine, run it on Amazon. So you've got search capability. 204 00:31:45.808 --> 00:31:56.788 And places like, well, Harvard, for example, Harvard University, who was a grad student, they do it now for their main computer science, low level computer science course is incredibly popular. 205 00:31:56.788 --> 00:32:03.328 Actually talked by my former advisor dog, so just retired 2 years ago, but. 206 00:32:04.134 --> 00:32:18.503 So, what Harvard does is they need more computing power they get it, I think, from Amazon. So, so you had the virtual machine, you got the search capability, and you're isolated from the actual hardware you can buy. So, if you're going to run something, you know, 24, 7, it's much cheaper to buy your own hardware. 207 00:32:18.503 --> 00:32:21.233 If you're going to run it. Occasionally you run it on Amazon website. 208 00:32:24.239 --> 00:32:31.709 So that's very nice. The flip side for that is that. 209 00:32:31.709 --> 00:32:42.598 You're also isolated from really high performance features of the hardware. So it took a while for Nvidia stuff to be useful, accessible to virtual machines. 210 00:32:42.598 --> 00:32:47.578 You know, all the differences that includes the high performance stuff. 211 00:32:47.578 --> 00:32:52.828 Okay, so before I move on to Docker, if you have any questions and. 212 00:32:52.828 --> 00:32:58.769 Trivial question related to the car, so I'm picking coffee from a cup that says a little. 213 00:32:58.769 --> 00:33:03.419 So no points for anyone who can tell me where is a little upset. 214 00:33:03.419 --> 00:33:07.138 I was there 2 years ago. Okay. 215 00:33:07.138 --> 00:33:13.019 Now, Google, like, um, okay, so Docker. 216 00:33:14.909 --> 00:33:19.709 Dockers of popular, lightweight visualization system. So it's the lightweight thing. 217 00:33:19.709 --> 00:33:32.574 You can't, I don't know everything I tell you is, as I understand it, I could be wrong. I could be out of date things. Change here to a successful company is they look at what their customers want, and they consider adding it to their capabilities. 218 00:33:32.574 --> 00:33:34.943 So, but at least in the past Docker. 219 00:33:35.219 --> 00:33:39.118 You say had to run Linux guess on on Linux host and so on. 220 00:33:39.118 --> 00:33:45.568 It's very popular commercially and video uses it to distribute software because again, you see. 221 00:33:45.568 --> 00:33:57.384 How do you package software? So customers can use it. Your Linux. 1st, you put it in a Debbie file. You put it in an RPM. You put it in a Powerball but the Powerball when you extract it makes assumptions on the operating system. 222 00:33:58.344 --> 00:34:05.903 You put it now, they've got a couple of competing things that have bought their had something called snap there ways to try and. 223 00:34:06.179 --> 00:34:15.929 Package the dependents required dependencies with your application so it puts fewer demands on your host operating system. Linux. This is a problem because. 224 00:34:16.764 --> 00:34:29.963 Is Linux it's own on standard? In fact, you feel sorry for Nvidia go, go to invidious developer website where they download the latest version of code and they got, like, 4 different ways to now I feel sorry for them. 225 00:34:29.963 --> 00:34:42.773 They've got an RPM. They've got a Debby, they've got a Powerball you can put it in your local sources for downloading software repositories and entry level process that will download it. 226 00:34:43.043 --> 00:34:47.213 And then they've got different versions of this for every different version. atlantic's. 227 00:34:47.818 --> 00:34:56.338 And then if the NVIDIA doesn't hop quickly enough with exchanges, and sometimes when exchanges incompatible and validates old code. 228 00:34:56.338 --> 00:35:03.628 So, if the video doesn't hop smartly enough, then they get and so on. 229 00:35:03.628 --> 00:35:06.628 So, if you feel sorry for them. 230 00:35:06.628 --> 00:35:19.889 So 1 of their multiple ways to distribute software is as a Docker image. So you run Docker on your system and it's free and you download and download and an image driven videos website. 231 00:35:19.889 --> 00:35:25.469 And in theory, you can write it. I keep emphasizing in theory because I've tried to do it. 232 00:35:25.469 --> 00:35:29.849 And you might notice I, no longer using Docker unparalleled. Is there a reason. 233 00:35:29.849 --> 00:35:43.619 Okay, any case, this is the theory and the theory is also, is that for simple image is again application the overhead to fire to start them up and take them down is really low. I command. 234 00:35:43.619 --> 00:35:58.139 You know, the compiler C plus plus compiler could be a darker image. In fact, that was actually why I initially installed Docker. So it's worth learning Docker. Now, what can happen is that. 235 00:35:59.518 --> 00:36:07.528 A Docker, you can end up with dozens or hundreds of Docker images. Maybe even running simultaneously. Maybe that is sitting there. 236 00:36:07.528 --> 00:36:11.278 Is in a client server concept and. 237 00:36:11.278 --> 00:36:20.039 Just waiting to run, so Kubernetes is a tool to manage lots of Docker images. 238 00:36:20.039 --> 00:36:26.338 So, if you want more information, you can, I gave you a couple of links. You can also. 239 00:36:28.318 --> 00:36:42.748 Okay, build and ship apps and so and they've got a conference and probably free. So so the thing with things like again, Amazon easy, too, you see you Docker image could run on your own private machine, or could run on Amazon. 240 00:36:42.748 --> 00:36:51.599 And then you get stuff where, in theory to thing can migrate. Microsoft has Docker. 241 00:36:51.599 --> 00:36:55.018 See security and stuff like that. 242 00:36:57.028 --> 00:37:03.449 So, free stuff, money stuff I call it the cocaine model pricing model, but okay. 243 00:37:03.449 --> 00:37:06.659 Silence. 244 00:37:08.188 --> 00:37:12.628 Oh, okay. So you're gonna have fun with all of that. So now. 245 00:37:12.628 --> 00:37:25.588 And other references here, piles and piles of reference you can play with it if you've got spare time. Uh, huh. So, my UC with Docker is when I went to using the compiler, the Pacific group. 246 00:37:25.588 --> 00:37:38.489 Compiler because I got annoyed, I get annoyed at G. plus plus because it didn't do it didn't do open ACC. So I was looking for replacements. 247 00:37:38.489 --> 00:37:45.838 P. C. plus plus was sponsored by video you might say so I initially tried to have it running the. 248 00:37:45.838 --> 00:37:54.059 An image on parallel. Okay. So that's running a private file system, but you're going to hook. So you can say you can nominate. 249 00:37:55.318 --> 00:38:06.599 5 trees on the host file system and mount them as guests on the guest. That's how you get files back and forth between the host in the guest. You say that I certainly. 250 00:38:06.599 --> 00:38:16.050 Piles file tree on the host, or whatever is visible on the gas. Just like in Windows. 251 00:38:16.050 --> 00:38:23.400 E, amount some remote file system, or in Linux as I showed you on my laptop here, I can mount the parallel file system. 252 00:38:23.400 --> 00:38:28.199 And access to just as a local file system on my laptop. 253 00:38:28.199 --> 00:38:31.559 I don't try to do fancy things. 254 00:38:31.559 --> 00:38:36.300 So, I wouldn't work with access control list perhaps and. 255 00:38:36.300 --> 00:38:45.119 Subtleties of read, modify writers something in the file system might get lost. And the performance is horrible, but I can do that. So you do that with doc or also. 256 00:38:45.119 --> 00:38:50.789 The problem is that it was really hard, like, impossible to get the security right? 257 00:38:50.789 --> 00:38:59.219 I couldn't see how to specify a host file system for the guest in a way that the guest could not escape and get over the. 258 00:38:59.219 --> 00:39:10.320 Post and it got worse because I was having to run darker as root actually. And this made us really bad. So I just couldn't see how to make it secure. 259 00:39:10.320 --> 00:39:21.780 And then I figured out how to install the compilers as, like, a normal Debbie package or something or is it a whole repository? And then I didn't need to do that anymore. So I killed it. 260 00:39:23.400 --> 00:39:29.369 Okay, so that is your half class on virtual machines and Docker. 261 00:39:29.369 --> 00:39:37.710 Questions? No. Okay. 262 00:39:37.710 --> 00:39:48.150 Next subject for today is we're continuing on invidia GPU and again, why this is worth spending time on is that. 263 00:39:48.150 --> 00:39:55.199 Has the lightest new supercomputer architecture as also and. 264 00:39:55.199 --> 00:40:04.405 How they solve things is instructive and video companies over 20 years old. Now they're successful company, they're handled competitors. 265 00:40:04.405 --> 00:40:11.394 The competitors are mostly failed because NVIDIA, they got good people, and they listened to their users. 266 00:40:11.940 --> 00:40:16.409 And they grow because they're providing a service. Let me show you. 267 00:40:16.409 --> 00:40:28.590 Couple of important ways and video has listened to what its customers want. So, this is the key you know, if you're if you're running a business and customers want to give you money to do something. 268 00:40:28.590 --> 00:40:36.690 You seriously want to consider accepting the money and doing what they want. Just don't reflectively say no, we don't do it. 269 00:40:36.690 --> 00:40:44.340 Companies do former companies I guess you'd call them. Okay. Things invidious done right? In the past. 270 00:40:44.340 --> 00:40:49.980 They started out doing graphics accelerators. 271 00:40:49.980 --> 00:41:00.659 Okay, this is a rehash of my computer graphics. Course you want to render polygons on your screen. 272 00:41:00.659 --> 00:41:08.460 Maybe thousands, maybe millions of polygons. So you've got the vertices of the triangles. 273 00:41:08.460 --> 00:41:17.965 You want to rotate them and project the vertices are independent that can all be done in parallel and hardware. The more vertices you can process in parallel. 274 00:41:17.965 --> 00:41:23.695 So you'd have things called Vertex shaders that would rotate and project the vertices of your. 275 00:41:24.864 --> 00:41:35.994 All the guns, and then everybody should be connected up to make triangles and so on and then you'd have fragment shaders that would process the pixels in your frame buffer and your depth buffer. 276 00:41:36.414 --> 00:41:44.905 And so that if you do 2 objects to the same pixel, the front, most 1 gets drawn and, you know, it's the most 1, because you're maintaining is a buffer and so on. 277 00:41:45.179 --> 00:41:51.269 So so, video did hardware that did the specific thing very fast. 278 00:41:51.269 --> 00:41:57.360 And there was no, the only a raise, for example, is when they invented texture memory and the texture, right? 279 00:41:57.360 --> 00:42:04.650 So, they were very limited, but what they did, they did very fast and customers researchers. 280 00:42:04.650 --> 00:42:13.559 Basically tortured there and video used to do non graphic stuff, fast also embedded in a thing called basic graphics program. General purpose. 281 00:42:13.559 --> 00:42:21.510 Programming on and video observed this happening and they added more general instructions to the. 282 00:42:21.510 --> 00:42:27.059 So that you can now do general purpose programming on without having to. 283 00:42:27.059 --> 00:42:33.840 You hack your way using the vertex shaders and the fragment shaders and textures texture maps and so on it says. 284 00:42:33.840 --> 00:42:38.849 Horrible horrible Cluj. Okay. Now you got good at and stuff. Good. 285 00:42:38.849 --> 00:42:49.139 That was a big thing and video data more recent thing is, of course, they're expanding from graph several years now into machine learning. So they've been. 286 00:42:49.139 --> 00:42:52.440 Ad, observing machine learning is a big business. 287 00:42:52.440 --> 00:43:00.719 There's some argument that everything that's courage in computer engineering is machine learning because if it's current, you call it part of machine learning. 288 00:43:00.719 --> 00:43:07.050 Okay, but machine learning is very compute bound determining the coefficients. 289 00:43:07.050 --> 00:43:15.630 Or say, autonomous vehicles so I was test driving a Tesla on Saturday. I'm going to buy a test real soon. Now, I think. 290 00:43:15.630 --> 00:43:19.050 So, in Tesla. 291 00:43:19.050 --> 00:43:26.219 1 option and test lazy, autonomous driving. It is a 10000 dollar option on top of the base car. 292 00:43:26.219 --> 00:43:32.309 They sell it because people are willing to pay 10000 dollars to make their car. 293 00:43:32.309 --> 00:43:42.840 Autonomous are mostly autonomous, not completely because of the effort and video has put in 2 things like machine learning and they sell their chips to. 294 00:43:42.840 --> 00:43:47.099 Companies, I don't know if they still sell the tests they used to. 295 00:43:47.099 --> 00:44:01.320 But they provide the servers to compute the coefficients that go into these programs and they have added hardware and media ads for some years. Now, has been adding hardware to. 296 00:44:01.320 --> 00:44:06.989 Their to make machine learning, pass, specific hardware they add is. 297 00:44:06.989 --> 00:44:15.510 Half precision floating point, if be 1616 bit floating point, they've also been adding things called these. 298 00:44:15.510 --> 00:44:23.280 Kind of blank on my name now on these processes. These processes take a 4 by 4 matrix. 299 00:44:23.280 --> 00:44:27.630 Which can have different data types, entered jurors or 16, but floats. 300 00:44:27.630 --> 00:44:32.309 And and could do Matrix, basically. 301 00:44:32.309 --> 00:44:38.610 Matrix model apply an ad very quickly because they devote special hardware to. This is this is. 302 00:44:38.610 --> 00:44:43.500 invidia providing facilities that. 303 00:44:43.500 --> 00:44:49.380 The customer wants, so that is why we are spending some time on. 304 00:44:49.380 --> 00:44:57.239 In so we're going to start with lecture 9 1 in their set and if I can find it. 305 00:44:58.889 --> 00:45:02.880 Okay, and again I'm speed reading it. 306 00:45:10.289 --> 00:45:13.710 So, what we're seeing today. 307 00:45:13.710 --> 00:45:17.010 Is 1 of the paradigms. 308 00:45:17.010 --> 00:45:22.829 Always seen a little called reduction. We have not seen reduction yet. Well, we've seen it sort of. 309 00:45:22.829 --> 00:45:29.280 We're going to see it in more detail now so it's a paradigm. It's a programming. 310 00:45:29.280 --> 00:45:34.679 It's used in parallel programs because it does a lot of things efficiently. 311 00:45:34.679 --> 00:45:38.519 So, we're going to see a little in this module 9 1. 312 00:45:38.519 --> 00:45:43.289 On how to do it passed in parallel machines. 313 00:45:43.289 --> 00:45:46.739 Okay, so it's a class of computation. It's. 314 00:45:46.739 --> 00:45:49.800 You put it in your tool can if you're doing parallel programming. 315 00:45:49.800 --> 00:45:54.119 And we're going to look at how efficient it is and do it better. 316 00:45:55.139 --> 00:46:04.710 Okay, so basically, and it's used again you want to have buzzwords map, reduce. 317 00:46:04.710 --> 00:46:09.989 Uh, map, reduce, tell you what Matt produces since the dimension here again. It's a paradigm. 318 00:46:09.989 --> 00:46:13.170 For processing large data sets it is 2. 319 00:46:13.170 --> 00:46:23.969 2 types of operations, the map offer and you work with a set of elements the map operation applies to function to each element in the set creates a new set. 320 00:46:23.969 --> 00:46:27.389 The reduce operation. 321 00:46:28.980 --> 00:46:38.429 Combines those things like, add some, let's say, is the most common thing, or perhaps find some maximum is another common thing. 322 00:46:38.429 --> 00:46:41.550 And so it's called map produce and. 323 00:46:42.235 --> 00:46:52.795 The map produce ideas become popular in the last 5 or 10 years. google's popular I said, and so on there's tools, like, had do, which used that had to s. 324 00:46:52.795 --> 00:46:57.804 I checked might becoming passe now but the reason I tried to avoid very new tools. 325 00:46:58.530 --> 00:47:09.300 In any case, the map produced idea the earliest reference I can find to it is IBM at a commercial language called APL. 326 00:47:09.300 --> 00:47:16.650 Almost 60 years ago, 50 or 50 years ago or something, and had a ray. 327 00:47:16.650 --> 00:47:24.269 Manipulation instructions in it and what did is it had a special and large character sets you need a special keyboard. 328 00:47:24.269 --> 00:47:28.650 And typewriter, electric typewriter. 329 00:47:28.650 --> 00:47:34.590 But so it had operations to do things like map and reduce and so on. 330 00:47:34.590 --> 00:47:42.480 This is again, it's IBM sometimes those things very early. Any case. So, IBM had this in a commercial language in the. 331 00:47:42.480 --> 00:47:47.460 Sixties, and I guess every people. 332 00:47:47.460 --> 00:47:51.989 And then Google pocket are iced in the teens. 333 00:47:51.989 --> 00:48:05.639 Oh, well, Google isn't pretend they invented it. Okay. So so so this is a reduction operation and there's an extension of it called scanning. So these fine points up here is. 334 00:48:05.639 --> 00:48:10.289 Perhaps we want to map produce a 1B, a set with a 1B elements. 335 00:48:10.289 --> 00:48:16.079 It doesn't fit into 1 thread block may not fit into 1 grid. Actually. 336 00:48:16.079 --> 00:48:23.039 So, we have to partition the data, they say into chunks of each thread process, a chunk and so on. 337 00:48:23.039 --> 00:48:29.579 Okay, okay. And it's, um. 338 00:48:29.579 --> 00:48:43.079 And we've seen a little about this before, because you're laying your tools like open, empty, open ACC, they've got reduction operators you can use in a loop. So now we're going to. And I've seen a shot a little about how we implement it. Now, we're going to see more detail. 339 00:48:43.079 --> 00:48:47.760 Okay, so. 340 00:48:47.760 --> 00:48:52.139 Your operation has to be associative and committed to. 341 00:48:54.210 --> 00:49:00.989 Everyone knows it, especially when given to means, I guess if you don't ask. Okay. Um. 342 00:49:02.880 --> 00:49:08.280 Would make it a group commutative group, I guess. 343 00:49:08.280 --> 00:49:15.599 Okay, not enough. People take modern, abstract algebra. You should okay. Um. 344 00:49:16.860 --> 00:49:20.610 So your sequential thing is you to scan your way down the array. 345 00:49:20.610 --> 00:49:24.659 Accumulating the sum and so on. So it. 346 00:49:24.659 --> 00:49:30.599 Is in order an algorithm misses elements in the array and if sequential that's what you do. 347 00:49:30.599 --> 00:49:33.809 Okay, but we're not sequential. 348 00:49:35.789 --> 00:49:44.670 Parallel you've got something like this, it's like a term entry. If anyone watches battle bots, Thursday night. 349 00:49:44.670 --> 00:49:49.829 They're now down to 8 contestants, I guess that they have their tournament tree. 350 00:49:51.300 --> 00:49:55.590 Why don't students have an entry in battle? Bots? Polly does. 351 00:49:55.590 --> 00:50:01.949 Okay, tournament tree and, um. 352 00:50:05.190 --> 00:50:11.159 The thing with suppose you want to do it on a parallel machine let's go back to the slides. 353 00:50:11.159 --> 00:50:24.989 Haven't given you see sickness lately, so you might say, okay, how to do this in parallel. Well, the trouble is so you got this tree here and so your 1st level here is not very parallel. 354 00:50:24.989 --> 00:50:33.989 Um, it takes and items you're combining then that 1st step there is an over 2. 355 00:50:33.989 --> 00:50:38.849 Operations all being done at the same time perhaps and. 356 00:50:38.849 --> 00:50:46.920 So, they talk about your peak resource requirement is quite high, your average parallelism. So a lot lower. 357 00:50:46.920 --> 00:50:50.400 You can read this detailing you on if you want. 358 00:50:50.400 --> 00:51:02.039 Oh, by the way, if you're not familiar with how the numbers work out and this is there's and elements in the original array is a total of and additions because each edition reduces the number of elements by 1. 359 00:51:02.039 --> 00:51:06.510 So, in terms of total of operations, it's actually a efficient. 360 00:51:06.510 --> 00:51:18.090 Okay, so it's work efficient in the sense that the total amount of work for the parallel thing is comparable to the sequential thing. 361 00:51:18.090 --> 00:51:23.369 But that 1st level, there's a lot of work all at the same time. So. 362 00:51:23.369 --> 00:51:29.369 May not be resource efficient in terms of paralyzing, but its work efficient. You're not wasting cycles. 363 00:51:31.230 --> 00:51:34.500 Okay, so how are we going to improve that? 364 00:51:37.650 --> 00:51:43.710 Silence. 365 00:51:43.710 --> 00:51:56.280 Okay, so so the parallel human take, you add 2 values in each step, and initially you got it over 2 threads and then you work your way down. 366 00:51:56.280 --> 00:52:01.320 And. 367 00:52:01.320 --> 00:52:10.050 Now, you can work in space, so you might say you've got creating new arrays and each raise half is big, except for your use. The right. You got some. 368 00:52:11.099 --> 00:52:17.670 Now, this is the way you can use the shared memory to speed things up. 369 00:52:17.670 --> 00:52:27.659 And by loading a chunk of the original array, which is in global memory, you put it in a shared memory as much as will fit. 370 00:52:27.659 --> 00:52:37.619 And then you got all the threads and the block attacking that shared memory array, and overwriting the Ray with 2 values. So it minimizes memory usage. 371 00:52:37.619 --> 00:52:42.480 And that's what we're going to do and then so each. 372 00:52:42.480 --> 00:52:48.719 Thread block hasn't shared memory with a reduction and then you combine the thread blocks. 373 00:52:48.719 --> 00:52:54.960 So Here's an example of what you could do. Excuse me. 374 00:52:56.550 --> 00:53:06.840 You've got 8 elements in your shared memory and so thread 0T adds 0T elements here on 1 thread. 1, add elements to and. 375 00:53:06.840 --> 00:53:17.250 3 2 elements, 4 and 5, and spread 3 elements 6 and 7. so threat number K adds elements 2 K and 2. K. plus 1. 376 00:53:17.250 --> 00:53:22.590 And overwrites to element to K, no extra memory need. 377 00:53:23.610 --> 00:53:27.750 Okay, now this 1st step here. 378 00:53:27.750 --> 00:53:31.800 You know, in the thread bucket of a 1000 threads. 379 00:53:31.800 --> 00:53:36.059 Going to be processing 2000 elements, but it's possible. 380 00:53:36.059 --> 00:53:42.480 That your hardware cannot run 2000 a 1000 threads at all at the same time. 381 00:53:42.480 --> 00:53:51.420 Depending okay on resources. So this 1st step may use a number of cycles. So the threads in that 1st step may. 382 00:53:51.420 --> 00:54:04.800 You know, the warps may be writing consecutively, not in parallel, depending on the hardware resources. At any case after the 1st step now ultimate items and the shared memory or global memory, whichever. 383 00:54:04.800 --> 00:54:08.909 Have your subtitles and then the even number ones, and the odd numbered ones. 384 00:54:08.909 --> 00:54:11.969 You don't care about the next step. 385 00:54:13.260 --> 00:54:18.389 You see thread number K, ads element 4, K2 element 4 K plus 2. 386 00:54:18.389 --> 00:54:27.659 Sorry, sorry? No, just even number 2. K into even the odd number threads set Idol. 387 00:54:27.659 --> 00:54:37.199 And then here, we only use threads that are multiples of 4, and we end up here. Now, this is 1 way to do it. 388 00:54:37.199 --> 00:54:40.710 Problem with this, it works fine. 389 00:54:40.710 --> 00:54:43.920 But before I go to the next slide set. 390 00:54:43.920 --> 00:54:49.829 Um, why this is not as efficient as it might be. 391 00:54:51.269 --> 00:54:55.110 If we look in the middle of here, what's happening is. 392 00:54:55.110 --> 00:54:59.219 We're not using consecutive threads we're using. 393 00:54:59.219 --> 00:55:04.739 Alternate threads like step 2 we're using threads that are too apart from each other. 394 00:55:04.739 --> 00:55:10.079 And this doesn't play nicely with the concept of 32 with grads former war. 395 00:55:10.079 --> 00:55:15.539 So, we got 32 threads in a war pier and the alternate threads are sitting idle. So that's. 396 00:55:15.539 --> 00:55:23.730 And as you go further down the tree, where having more and more thread, sitting idle, so we're really not playing nicely for the concept of a warp of threads. 397 00:55:23.730 --> 00:55:28.650 2nd point is also not playing nice with the concept that. 398 00:55:28.650 --> 00:55:32.070 Data should be continuous. 399 00:55:32.070 --> 00:55:35.730 Um, so here. 400 00:55:35.730 --> 00:55:47.460 Okay, we're accessing data elements that have gaps stride between them. And if you're on the global membranes is really bad. And even on the shared memory, well, you know, you're wasting stuff. Maybe. 401 00:55:47.460 --> 00:55:53.969 You know, it'd be nicer if it's a, the active stuff was packed together. So now you could maybe free up stuff, free up things. 402 00:55:53.969 --> 00:56:00.090 So, this 1st way to do in a parallel some reduction. 403 00:56:00.090 --> 00:56:03.989 Um. 404 00:56:06.510 --> 00:56:10.320 Is, um. 405 00:56:10.320 --> 00:56:14.909 I just realized I might be. 406 00:56:16.260 --> 00:56:20.099 I am recording. Good. I got worried. Okay. 407 00:56:20.099 --> 00:56:23.219 You see, you're what you're. 408 00:56:23.219 --> 00:56:28.920 It's parallel, but it's wasting thread resources and it's inefficient use the threads inefficient use of memory. 409 00:56:30.449 --> 00:56:38.369 And that they're talking about that here 1 of the inputs comes to increasing distance away for Golden memory. That's bad. 410 00:56:38.369 --> 00:56:44.010 Shared memory well, not directly bad, but you're wasting memory. 411 00:56:44.010 --> 00:56:48.210 You like to pack stuff together. Okay. 412 00:56:51.690 --> 00:56:54.960 And how we would implement this. 413 00:56:54.960 --> 00:57:06.000 Shared means that this, this is called this in your global routine, which is called from the host running on the device shared is an array which fits in shared memory. If there's room. 414 00:57:06.000 --> 00:57:13.619 And you can reduction step again, you doing sync threads and things. 415 00:57:13.619 --> 00:57:18.809 So, it's so sync threads. 416 00:57:19.860 --> 00:57:24.059 Yeah, so we got this tree of partial sums. 417 00:57:24.059 --> 00:57:30.659 We got to remember that. Let me give you a little C sickness and. 418 00:57:30.659 --> 00:57:35.010 We're scroll back a few slides here. Gotcha. 419 00:57:35.010 --> 00:57:39.780 Okay, that that 1st level step 0T here lots of threads. 420 00:57:39.780 --> 00:57:45.869 Maybe it's more threads 1 to run them can can run simultaneously. So, some threads. 421 00:57:45.869 --> 00:57:51.960 Some warps are going to run 1st, while other walks are queued up, waiting to run. 422 00:57:51.960 --> 00:57:55.170 So 1st warp finishes. 423 00:57:55.170 --> 00:57:59.880 And then you've got some work processors as they're called. 424 00:57:59.880 --> 00:58:04.980 Waiting for work, so they get assigned more warps that are waiting. 425 00:58:04.980 --> 00:58:10.800 But but as this thing starts finishing now, some more processors are going to be finished. 426 00:58:10.800 --> 00:58:14.639 And it's going to be no more work for them to do. Okay. 427 00:58:14.639 --> 00:58:20.639 So they might start running so what it would by default, it start running this stuff and level 1 here. 428 00:58:20.639 --> 00:58:23.699 Because they're warps waiting to run. 429 00:58:23.699 --> 00:58:27.840 So level 1 might start wanting to run. 430 00:58:27.840 --> 00:58:31.050 Before all of the threads and levels arrow have finished. 431 00:58:32.130 --> 00:58:38.639 You see, because level 0T thread, some of them finished before others and then there's some more processes that are. 432 00:58:38.639 --> 00:58:42.599 Are waiting idle because all of the levels threads have actually started. 433 00:58:42.599 --> 00:58:49.349 But some of them have already finished because they started earlier. So now there's more processes of waiting, waiting to run level 1. 434 00:58:49.644 --> 00:59:03.085 But they shouldn't run level 1, because all the data is not available to them yet, because some of the war 0T threads haven't finished. So the war, 1, the level 1 threads cannot start running until all the level. 435 00:59:03.114 --> 00:59:04.764 0T threads have finished. 436 00:59:05.070 --> 00:59:09.059 That's why the sync threads. 437 00:59:10.409 --> 00:59:17.010 Okay, and you might experiment with omitting sync thread from a program. 438 00:59:17.010 --> 00:59:23.219 You might get a different answer. Every time you get the same answer every time and if you get the same answer. 439 00:59:23.219 --> 00:59:28.079 Maybe might even be right? Who knows why not it might be consistently wrong. 440 00:59:29.550 --> 00:59:32.639 Okay, so, um. 441 00:59:35.610 --> 00:59:39.539 So any case, so now, assuming what we're doing here. 442 00:59:39.539 --> 00:59:43.860 Threads and a block adding up elements. 443 00:59:43.860 --> 00:59:48.030 So, maybe there are more elements in my original array than. 444 00:59:48.030 --> 00:59:54.690 You can have threads in the block against a block can have up to 1024 threads. 445 00:59:56.844 --> 01:00:08.065 Apparently constant oh, by the way, when I say things are fairly constantly, I notice the latest version of NVIDIA, the ampcare architecture is changing some of these numbers that have stayed fairly constant for years. 446 01:00:08.065 --> 01:00:13.255 They've increased the shared memory size, for example, still 32 threads on a war, but the number of. 447 01:00:14.130 --> 01:00:18.809 In shared memory and selling, got larger. 448 01:00:18.809 --> 01:00:23.880 Okay, any case so we got more elements to some to reduce. 449 01:00:23.880 --> 01:00:33.210 Then we can have threads and a block so chunks of the global data are going to be reduced in separate thread blocks. And the separate thread blocks are not. 450 01:00:33.210 --> 01:00:42.000 Talking to each other again, it could be running consecutively. So any attempt to make them communicate would be horribly inefficient. 451 01:00:42.000 --> 01:00:50.940 And they're talking here even in the host might even start separate colonels. 452 01:00:50.940 --> 01:00:58.289 Okay, so now, how can we then we have to merge these partial results. 453 01:00:58.289 --> 01:01:01.769 Well, he just copied back to the host and add them up. 454 01:01:01.769 --> 01:01:13.769 Or thread, 0T of each block could collect the results. So many other threads and each could accumulate Stoppers that we saw a little of this. Actually, yesterday not yesterday. But. 455 01:01:13.769 --> 01:01:18.630 Okay, we're starting to get into the idea now of how to do this more efficiently. 456 01:01:19.949 --> 01:01:32.460 And proving resource efficiency thread to data reduce control. Divergence means. 457 01:01:32.460 --> 01:01:43.170 Pack the act with threads together and what we're going to do this, we're going to have a more complicated algorithm. 458 01:01:44.340 --> 01:01:52.170 Well, what it's going to do is run faster if you trade off. Okay. 459 01:01:52.170 --> 01:01:56.400 You know, more code, more thinking, but faster execution. So. 460 01:01:56.400 --> 01:02:00.210 And I'll show you what's going to be happening here. 461 01:02:01.889 --> 01:02:13.139 And we're going to pack partial, sounds into this front of the array, keep active threads consecutive. 462 01:02:13.139 --> 01:02:17.670 And shift the index usage that's the same thing. 463 01:02:17.670 --> 01:02:22.769 Improves divergence behavior means we want the threads that are running the same code to be. 464 01:02:22.769 --> 01:02:27.960 And the same more reordering computations. Okay. 465 01:02:30.539 --> 01:02:39.510 So, it shows what what's happening here we got this array of 8 elements at the top 31704 1, 6, 3. 466 01:02:39.510 --> 01:02:43.860 Okay, so 1st stage we are adding. 467 01:02:43.860 --> 01:02:48.900 Pairs of elements, but what I showed you before in the last slide set 9, 2. 468 01:02:48.900 --> 01:02:57.659 Thread 0T added like, element, 0T and element 1 thread 1 added element to and element 3 and so on here. 469 01:02:57.659 --> 01:03:00.869 Thread 0T adds element 0T and element for. 470 01:03:02.909 --> 01:03:07.739 So, they're, they're not, they're not adjacent to each other. That is true. 471 01:03:07.739 --> 01:03:11.639 However thread 1, agile 1 an element 5. 472 01:03:11.639 --> 01:03:21.090 2 gentlemen, 2 elements 6 said 3 element, 3 elements 7 so the 2 elements added by thread. 0T are not adjacent. 473 01:03:21.090 --> 01:03:34.170 But the elements added by 3, 0T and added by thread 1 are adjacent. So thread 0T is going sequentially through through 1, right here and sequentially through another. Right there. 474 01:03:34.170 --> 01:03:48.539 So, that plays nicely with the cash manager, and the outputs now are adjacent. So we went in with 8 elements. We come out with 4 elements for 2 elements subtotals, but they're adjacent at the start of the array. 475 01:03:49.619 --> 01:03:54.389 We could argue be overwriting the original Ray or this could be a new or whichever. 476 01:03:55.769 --> 01:04:04.079 You know, at some point, you might start with the 1st, level of being read, only goal memory. And then the next level is and shared memory. Perhaps. 477 01:04:04.079 --> 01:04:07.139 You know, you do these sorts of trade that would be an actually a nice thing to do. 478 01:04:07.139 --> 01:04:12.809 So, you've got latency for all these scratch reading. 479 01:04:12.809 --> 01:04:19.559 The original array from reading it in a systematic way and then they're writing to the. 480 01:04:19.559 --> 01:04:23.820 Shared memory, so the shared memory only has had big enough for half the original right? 481 01:04:23.820 --> 01:04:31.949 Okay, so we did this different thing where the threads are accessing non, adjacent, adding non adjacent elements. 482 01:04:31.949 --> 01:04:35.309 And their sub totals, our act together. 483 01:04:35.309 --> 01:04:49.530 Okay, but now, after the 1st step, there are 4 threads active, but the 4 consecutive threads, and they do the same thing. So you see, the data that's used is always packed to the front of the array. So. 484 01:04:49.530 --> 01:04:57.630 You know, the, the tail of the array, you might free or use for something else if you want it. And the act of threads are all packed together. 485 01:04:57.630 --> 01:05:02.099 So, that means is initially if you had. 486 01:05:02.099 --> 01:05:09.210 Suppose we had a 1024 threads. Okay so we're summing up 2048 elements of the year, right? 487 01:05:09.210 --> 01:05:17.820 Well, after the 1st step is only 512 threads are active. The, the last 512 threads. 488 01:05:17.820 --> 01:05:21.869 Numbers 512 to 1023. I no longer needed so. 489 01:05:21.869 --> 01:05:27.690 You know, those spreads can end and then after the after the next step, only 256 threads are used. 490 01:05:27.690 --> 01:05:42.659 And so there's fewer and fewer active threads. So we don't have a lot of threads and the warfare decide all the active threads are all packed together. So we're using these thread resources more efficiently because the other threads. 491 01:05:42.659 --> 01:05:54.929 They finished, you know, they're not running continuing to run, whereas you say so, what? So, we're packing the memory the act of data together, and we're packing the act with threads together. 492 01:05:54.929 --> 01:06:04.139 And packing is good and how we would do it, you can look at the code yourself. It's not interesting. 493 01:06:04.139 --> 01:06:11.219 So, no diver again divergence mean some of the threads of the war are active and some are passive. So. 494 01:06:11.219 --> 01:06:17.190 If we have a 1024 threads until we get down to only 32 active threads. 495 01:06:17.190 --> 01:06:21.630 Um, there's no diversion all the act with friends are gonna block of all active threats. 496 01:06:21.630 --> 01:06:28.559 Everything's nice powers to also which helps and the final 5 steps are, um. 497 01:06:28.559 --> 01:06:31.769 Divergence. 498 01:06:31.769 --> 01:06:35.579 You final 5 steps you almost might want to do it sequentially. 499 01:06:35.579 --> 01:06:42.329 There's a programming paradigm that I have is if a lot of data you start munching it down. 500 01:06:42.329 --> 01:06:45.449 May be in parallel, and at some point, you switch modes. 501 01:06:45.449 --> 01:06:52.380 And so I could easily see here that you do the 1st, 5 steps in parallel. Well, the 1st thing, you got a 1000. 502 01:06:52.380 --> 01:06:56.400 In parallel with the last 5 steps, you know, you're talking. 503 01:06:56.400 --> 01:07:00.750 32 threads, you might almost say what the hell just add it. 504 01:07:00.750 --> 01:07:04.500 What 1, sequential process that's what I would do here. 505 01:07:04.500 --> 01:07:09.900 So, I should shift modes. You start parallel and you to sequential. 506 01:07:12.090 --> 01:07:24.420 I get the same concept. If I were worried about memory, say, implementing some binary search tree with pointers, I would implement the top half of the tree with pointers. Let's say the bottom half of the tree packed. 507 01:07:25.440 --> 01:07:32.280 And I would have the effect of something that I could update fairly efficiently. 508 01:07:32.280 --> 01:07:35.579 Which requires pointers and something, which doesn't. 509 01:07:35.579 --> 01:07:38.940 Double the space that the tree requires for pointers and. 510 01:07:38.940 --> 01:07:44.039 So, you switch modes in the middle powerful paradigm you don't see described in. 511 01:07:44.039 --> 01:07:49.050 Okay, it was. 512 01:07:49.050 --> 01:07:52.199 What was it? 9.3. 513 01:07:53.219 --> 01:07:57.150 Chance to ask questions. 514 01:08:04.920 --> 01:08:07.920 Oh, okay, good. 515 01:08:07.920 --> 01:08:14.159 Okay, what happens here we're seeing an extension of reduction call. 516 01:08:14.425 --> 01:08:28.975 A scan operation a scan operation is a powerful thing. It's a powerful parallel programming paradigm. It's only useful parallel programming. It's actually not useful and sequential programming. You can use it in sequential programming. 517 01:08:29.005 --> 01:08:30.864 What I mean, by not. 518 01:08:31.170 --> 01:08:36.420 Useful it doesn't give you a performance gain for parallel programming. It gives a performance gain. 519 01:08:36.420 --> 01:08:39.630 What it is, is. 520 01:08:39.630 --> 01:08:47.520 The reduction reduces the array to 1. some, the scan, the parallel scam produces an array of sub totals. 521 01:08:48.810 --> 01:09:03.810 Okay, so, instead of adding all the elements Gavin told you, you've also got a total the 1st, 1 element, another total, the 1st, 200 total of the 1st, 3 and so on. And it started parallel takes log in time. Like, the reduction does. 522 01:09:03.810 --> 01:09:08.010 And, like I said, you would be surprised how useful it is. 523 01:09:08.010 --> 01:09:19.229 Okay, so foundational nice word. Um, and video has got something like a, it's several years old. Now. Some of the ideas are little obsolete. 524 01:09:19.229 --> 01:09:27.689 But it's has it's free on on the web for free now I think we used to charge for it. 525 01:09:27.689 --> 01:09:35.699 And that link is dead so how you would find it or how I find how I found this. 526 01:09:35.699 --> 01:09:40.710 On the web I Google 339. 527 01:09:40.710 --> 01:09:45.149 And I found that it's still on the NVIDIA side. They just moved it somewhere. 528 01:09:45.149 --> 01:09:51.930 See, what I'm not going to go to it now because the interesting stuff is on this slide said. 529 01:09:51.930 --> 01:09:57.810 Okay, what is it's there's. 530 01:09:57.810 --> 01:10:04.890 2 versions didn't close next close. Here is the inclusive scan here. I'll show it by example. 531 01:10:04.890 --> 01:10:08.880 Input 31741 6, 3. 532 01:10:08.880 --> 01:10:18.720 8 elements, the output array has 8 elements, but look at the 1st element is 3, the 2nd, is 3 plus 1. the 3rd is 3 plus 1 plus 7. 533 01:10:18.720 --> 01:10:22.380 So the case element using index 1. 534 01:10:22.380 --> 01:10:25.439 Addressing origin 1 address thing. 535 01:10:25.439 --> 01:10:30.210 The case output element is the sum of the 1st, Kay input elements. 536 01:10:30.210 --> 01:10:34.710 So that's called inclusive scan. 537 01:10:34.710 --> 01:10:45.420 Or a prefix, that's what it is. What would it be a quick application of it? Suppose you have a run length and coding of. 538 01:10:45.420 --> 01:10:54.630 Whatever an image you would use something like this to decode your run length and coated image. So the 1st row there, your run lines. 539 01:10:54.630 --> 01:10:58.920 And the 2nd row is where each run would start in the decoded image. 540 01:10:58.920 --> 01:11:02.819 So, you'd use this, this, you'd use the. 541 01:11:02.819 --> 01:11:07.710 Prefix 70 inclusive scan of your array of runway. 542 01:11:07.710 --> 01:11:11.939 To compute where in the output each rod will start. 543 01:11:11.939 --> 01:11:18.029 Each expanded rod, this is an example of the use of runway of info scan. 544 01:11:18.029 --> 01:11:23.550 A lot of other examples, but this is a nice 1 another example that's used. 545 01:11:23.550 --> 01:11:27.029 For bucket. 546 01:11:27.029 --> 01:11:37.680 They have we did the bucket toggling the accounts. So, frequency counts. We showed you a simple frequency account idea 2 days ago I guess. 547 01:11:37.680 --> 01:11:44.579 That simple idea assumes that the array of output array of counts. 548 01:11:44.579 --> 01:11:50.159 Not that big it works and you've got a small number of possible accounts for accounting frequencies. 549 01:11:50.159 --> 01:11:53.189 Doing his grabbing of text to remember, and we even. 550 01:11:53.189 --> 01:12:02.310 Batched up the ladder, so we didn't have that many output possibilities. Well, suppose you had a 1M or a couple 1M possible. 551 01:12:02.310 --> 01:12:12.630 Key said you're doing a histogram counting on that. What we saw 2 days ago has problems the out that histogram matrix is too big to start in fast memory. 552 01:12:12.630 --> 01:12:26.069 So, we use other tricks, tricks, techniques, whatever you want to call them paradigms, and they go to involve inclusive any case. So, new concept, inclusive scan. You see what it is. 553 01:12:26.069 --> 01:12:35.640 It takes to run under a rotten lines and outputs in array of dope. You call good call. This is called adult factor. This is a buzzer adult factor element chose or each. 554 01:12:35.640 --> 01:12:39.569 Run would start in the output. Okay. That's what it is. 555 01:12:39.569 --> 01:12:44.489 How do you do it fast? Sequentially? It's obvious. We want to do it fast in parallel. 556 01:12:46.170 --> 01:12:49.770 Okay, submarine sandwich example. 557 01:12:51.000 --> 01:12:55.859 How to calculate it? Cool example. 558 01:12:57.539 --> 01:13:04.829 Luxurious struggle to find examples. Okay so don't knock it. 559 01:13:04.829 --> 01:13:08.609 You can find better examples. Go ahead. 560 01:13:10.439 --> 01:13:14.159 Yeah. 561 01:13:14.159 --> 01:13:18.119 Oh, it's using thing fast starting. 562 01:13:18.119 --> 01:13:22.859 The occurrences comparing spring, all this, so you can read this. 563 01:13:22.859 --> 01:13:34.529 I showed you the obvious, you could also be used to run linked end. Coding goes both ways. It has surprising number of applications. We'll see them in a week or so. 564 01:13:34.529 --> 01:13:37.949 Typical applications of the scan. 565 01:13:39.420 --> 01:13:43.140 Yeah, they're getting Sally. That's the definition. 566 01:13:43.140 --> 01:13:48.750 Obviously find it sequentially, but we're in a parallel parallel. Of course. 567 01:13:48.750 --> 01:13:55.260 Efficient takes. 568 01:13:55.260 --> 01:14:01.350 And elements. 569 01:14:03.090 --> 01:14:12.390 That does not mean pictorial, naive inclusion. 570 01:14:12.390 --> 01:14:15.960 The eye thread calculates why? So, by. 571 01:14:15.960 --> 01:14:23.189 Families is no faster than linear or the linear actually, because. 572 01:14:23.189 --> 01:14:26.220 Threads running the are are slower than on the Intel. 573 01:14:26.220 --> 01:14:34.470 Oh, by the way I strongly disagree with this problem or a peer that programming is easy if you don't care, but for performance, it's still hard. 574 01:14:34.470 --> 01:14:39.300 He's still got walking and sequential sterilization issues. 575 01:14:39.300 --> 01:14:42.750 Okay, oh, it was 10 1. 576 01:14:46.020 --> 01:14:51.420 Showed you the problem. 577 01:14:55.380 --> 01:14:59.579 How to do it? So this is how we could do it in parallel. 578 01:15:01.319 --> 01:15:07.590 Yeah, we're starting to cut the time to come in down to something like log in. 579 01:15:07.590 --> 01:15:13.170 Here's our initial array it's called X Y, because we're updating in place. 580 01:15:13.170 --> 01:15:18.270 So, we add each element to its neighbor to the right. 581 01:15:18.270 --> 01:15:21.930 Sorry, I had each element to its neighbor to the left. 582 01:15:21.930 --> 01:15:34.109 Okay, so, for so, 1 got replaced by 4 plus 1. so the input array was our end elements that we want to scan the output array. Each element has that been added to it's neighbor to the left. 583 01:15:35.460 --> 01:15:39.239 Okay, so you have an array of pairs here. 584 01:15:41.310 --> 01:15:45.390 And then we synchronize, so we do this. 585 01:15:45.390 --> 01:15:49.140 Ice and sample, and now we sink threads. 586 01:15:49.140 --> 01:15:54.810 Next step, we do it again. We add each element to the element. 587 01:15:54.810 --> 01:16:01.890 To the do you do. 588 01:16:03.090 --> 01:16:12.390 To the element to, to the 1st step, we had at each element to the element adjacent to the left next step. We had each element to the element. 589 01:16:12.390 --> 01:16:15.720 2 left, so 9 gets added to 5. 590 01:16:15.720 --> 01:16:25.800 7 gets added to 4 to make 11. 5 gets added to 7 to make 12. 4 gets added to 8 to make 12. 7 gets added to forward to make 11. 591 01:16:25.800 --> 01:16:31.770 It gets added to 3 to make 11 and 4 doesn't change because 2 to the left is all started here. Right? 592 01:16:31.770 --> 01:16:36.119 Boundary test edge test. Okay. 593 01:16:37.140 --> 01:16:47.189 What we have now, here are sums of 4 elements of the 14 is a, some of the last 4 elements 3 plus 6 Plus 1 plus 4. 594 01:16:48.750 --> 01:16:55.380 So, every element in the 2nd, or after stride 2, every element is the 4 elements. 595 01:16:55.380 --> 01:17:01.649 Except the 1st, 2 and, um. 596 01:17:01.649 --> 01:17:06.359 The 1st, 3, in fact, 11 here is a sub, only the 1st, 3 elements. 597 01:17:06.359 --> 01:17:10.199 You might imagine that it's patent to the left with cells. 598 01:17:10.199 --> 01:17:13.739 Getting ahead of me now we do this K times. 599 01:17:14.760 --> 01:17:22.949 Now, every element is a sum of 8 out, so we did with each element here was added to the element 4 to the left 14 got added to. 600 01:17:22.949 --> 01:17:32.399 11 making 25, and now the output elements are the sum of 8 elements, except for the 1st batch. 601 01:17:32.399 --> 01:17:36.449 Where they added stuff going off the start of the array. So. 602 01:17:36.449 --> 01:17:42.510 Each element here is the sum of 8 elements, except that. 603 01:17:44.189 --> 01:17:47.579 Um, if there are 8 elements. 604 01:17:50.640 --> 01:17:56.340 At the 3rd level, I think okay and I think we've done it. 605 01:17:56.340 --> 01:18:02.460 This is it, we took 3 steps. 606 01:18:02.460 --> 01:18:07.109 And we got the parallel scan and the original race. So it took 3 parallel steps. 607 01:18:07.109 --> 01:18:17.460 And we did it now, we're adding non adjacent elements so we might want to talk about that later and. 608 01:18:17.460 --> 01:18:21.180 But if we paralyzed the scan nice. Okay. 609 01:18:21.180 --> 01:18:29.010 And we got some thread divergent. Maybe can't be well, the action threads are still adjacent there at the end of the array. So. 610 01:18:29.010 --> 01:18:32.939 So, in log in time, we did the parallel scan. Cool. 611 01:18:32.939 --> 01:18:39.090 Some dependencies, um. 612 01:18:39.090 --> 01:18:42.539 We're over, we're writing in place. 613 01:18:42.539 --> 01:18:49.920 So, if I just jump back here, we have to do a sync between each step. 614 01:18:49.920 --> 01:19:00.539 Because we're overriding the array in place, and we got to make certain that no 1 wants the old value before replace with the new value. So you have to and because there's no guarantees about ordering. 615 01:19:00.539 --> 01:19:03.600 Ok, dependencies. 616 01:19:03.600 --> 01:19:06.659 Everyone does reading and writing. 617 01:19:06.659 --> 01:19:10.079 Consider doing it in the. 618 01:19:10.079 --> 01:19:16.470 Shared memory, and this is how it would be implemented. You can read the code on your own. 619 01:19:18.090 --> 01:19:22.050 It does log in parallel iterations. Nice. 620 01:19:22.050 --> 01:19:35.550 But there is an issue here, the work inefficiency, the threads at the start of the array, or maybe they're still doing stuff, which is useless. 621 01:19:35.550 --> 01:19:41.640 So, we've got some issues, we're not using the threads efficiently. That's work. Not working. 622 01:19:41.640 --> 01:19:50.100 So, we may be saturating things. It may actually be running slower. Let me scroll back to what's going on here. 623 01:19:50.100 --> 01:20:02.939 These 1st, couple of threads here are not doing anything because they don't have threads to stride forward. We're adding each threads adding. It's element to the element forwarded the last. Well, if we don't have 4 elements to the left, the threads not doing anything. 624 01:20:02.939 --> 01:20:06.689 And so it's going to be idle, but maybe we want to be more. 625 01:20:06.689 --> 01:20:11.279 Explicit about being idle, I guess, at what they're talking about. 626 01:20:11.279 --> 01:20:15.779 Um, not quirky, efficient and so on. 627 01:20:15.779 --> 01:20:23.699 That was that so what happened today? 1st, we spent time on. 628 01:20:23.699 --> 01:20:27.060 Virtual machines at different levels. 629 01:20:27.060 --> 01:20:31.439 Ending up with Docker and so on, because it's the commercially valuable. 630 01:20:31.439 --> 01:20:42.899 You might also think how you implement this stuff. You saw different levels of virtual machines you want, you might be having it the back of your mind. What hardware resources do you want to make this too fast? 631 01:20:42.899 --> 01:20:52.949 Hooks into the hardware I mentioned things such as and harmless instructions have to be easily identifiable from harmful instructions. 632 01:20:52.949 --> 01:21:00.689 I think in the IBM system, 360 machine calls, it's actually the terminal for the, for a few bits of the off code or something. 633 01:21:00.689 --> 01:21:05.699 So, powerful off codes are you just look at the bits of the code and you can tell that. 634 01:21:06.744 --> 01:21:18.354 And they are, they're trapped even replaced with the trap instructions, or the hardware traps. I'm sorry, what happens I'm, it's a hardware traps of powerful instructions that you set your virtual bit. 635 01:21:18.564 --> 01:21:27.864 And I think what happens is the instruction turns into a separate team call, and it's done at the hardware and the hardware. So you don't have to modify the code. And there's no overhead in it. 636 01:21:28.140 --> 01:21:33.029 Until you execute, start executing a little protected and sub routine, but there's no overhead in the trap. 637 01:21:33.029 --> 01:21:36.720 And then and then we see some parallel tools. 638 01:21:36.720 --> 01:21:39.779 With virtual machine sing some nice. 639 01:21:39.779 --> 01:21:45.630 Parallel to the stuff I'm showing you here is not specific to NVIDIA. 640 01:21:45.630 --> 01:21:49.109 You pick your parallel architecture. 641 01:21:49.109 --> 01:21:59.399 These parallel reductions are our powerful foundational paradigm for any parallel architecture. So this part of the courses. 642 01:21:59.399 --> 01:22:04.409 As reaching beyond in video, so it's part of the parallel paradigm thing. 643 01:22:08.880 --> 01:22:15.989 Yeah, okay. So you're asking yeah hybrid or? Let me go back to this specific thing here. 644 01:22:15.989 --> 01:22:24.149 Well, we're going to see in the next slides that is packing things together in a way that the other thread walks will just be. 645 01:22:24.149 --> 01:22:28.739 They'll terminate and the resources are free, so. 646 01:22:30.449 --> 01:22:36.270 Yeah, well, thread stuck is we reorganize this slide. I'm showing you. Here is somewhat. 647 01:22:36.270 --> 01:22:40.800 And the idle threads, they'll finish, the just ran off the bottom of the thread. 648 01:22:40.800 --> 01:22:45.840 Finish it while, and if all the threads in the war finish. 649 01:22:45.840 --> 01:22:49.319 That work ends and now those resources are. 650 01:22:49.319 --> 01:22:53.369 Available what resources I'm talking about is. 651 01:22:53.369 --> 01:22:57.090 This concept of credit card that I've been showing you is. 652 01:22:57.090 --> 01:23:00.899 A little simplified from reality. 653 01:23:00.899 --> 01:23:05.789 There is now there's not specific coded courses sets of. 654 01:23:05.789 --> 01:23:12.899 Functional units, and that will execute into Journal, execute code a different type and. 655 01:23:12.899 --> 01:23:18.300 They you need them to decode an instruction and. 656 01:23:18.300 --> 01:23:30.689 Have a thread executing it so that those instruction units are now free for other threads. The registers that a thread would use that are private to the thread that, you know, they're from a block of registers that the whole. 657 01:23:30.689 --> 01:23:40.229 The whole block share, so those are now free. So, yeah, so now these resources are now available for other thread warps to use. Yes. 658 01:23:41.729 --> 01:23:51.750 Other stuff, so you're, you're, we're running late now. So see you Thursday head off for lunch or your next class, and enjoy the week. 659 01:23:51.750 --> 01:23:57.539 And I'm enjoying the sun shines that my solar panels are generating lots of power. Now. 660 01:23:58.949 --> 01:24:08.039 So, and today just we ran up through section 10.2. I'll put a note of that on the blog. 661 01:24:47.970 --> 01:24:53.909 Silence. 662 01:24:55.229 --> 01:24:59.159 Silence. 663 01:25:01.560 --> 01:25:06.090 Silence.