WEBVTT 1 00:04:32.608 --> 00:04:40.798 Silence. 2 00:04:44.069 --> 00:04:51.809 Silence. 3 00:04:55.319 --> 00:05:08.459 Silence. 4 00:05:11.548 --> 00:05:18.809 Silence. 5 00:05:20.249 --> 00:05:24.418 Silence. 6 00:05:29.699 --> 00:05:33.778 Silence. 7 00:05:37.228 --> 00:05:41.999 Silence. 8 00:05:41.999 --> 00:05:48.749 Silence. 9 00:05:52.499 --> 00:05:56.098 Silence. 10 00:05:56.098 --> 00:06:01.079 Silence. 11 00:06:08.038 --> 00:06:11.879 Silence. 12 00:06:24.178 --> 00:06:28.468 Silence. 13 00:06:29.819 --> 00:06:45.209 Silence. 14 00:06:57.329 --> 00:07:04.559 Silence. 15 00:07:13.678 --> 00:07:17.278 Silence. 16 00:07:19.408 --> 00:07:25.048 Silence. 17 00:07:29.759 --> 00:07:38.189 Silence. 18 00:08:00.869 --> 00:08:06.809 Okay, good morning. Good afternoon. People. 19 00:08:08.158 --> 00:08:14.278 Let's see night universal question. 20 00:08:14.278 --> 00:08:17.488 Anyone hear me. 21 00:08:18.838 --> 00:08:22.379 Okay. 22 00:08:23.668 --> 00:08:28.259 Couple was getting all of my beautiful Thank you. 23 00:08:28.259 --> 00:08:36.629 Okay, so this is parallel computing class 5, I guess December 8. 24 00:08:36.629 --> 00:08:42.989 2021, I do that, so that if we look at the videos video gets to sort of detached from. 25 00:08:42.989 --> 00:08:46.739 It's title, we know what it is. 26 00:08:46.739 --> 00:08:58.168 So, what's happening today is I got some utility information about connecting to remote computers and file systems and so on. 27 00:08:58.168 --> 00:09:06.839 Working with operating systems and then finish off open MP and talk about open ACC. 28 00:09:06.839 --> 00:09:10.379 And will be moving on after that into and video. 29 00:09:10.379 --> 00:09:14.609 And I've got a homework 3 is a chance for you to program with. 30 00:09:14.609 --> 00:09:20.188 So, connecting to other computers, you're connecting to parallel. 31 00:09:20.188 --> 00:09:30.509 Then you can just say, you can just type in your password, but is actually a public key system. 32 00:09:30.509 --> 00:09:44.788 Method and it's it's actually better to create key pairs. And if you create a key pair, then you do not need to type passwords to connect in. For example, if I show you. 33 00:09:44.788 --> 00:09:48.119 I bring up a window here. 34 00:09:49.649 --> 00:09:55.889 Make it bigger if I want to connect to parallel I just say. 35 00:09:55.889 --> 00:09:59.489 You see, I just. 36 00:10:01.048 --> 00:10:13.499 And you'll notice, I did not have to type a password because on parallel I created a public private key pair and I copied the public key over. 37 00:10:13.499 --> 00:10:16.589 Give my local laptops, so. 38 00:10:16.589 --> 00:10:22.198 And I'll let you read the manuals and so, and if you're on a Windows host, then you'd use putty or whatever. 39 00:10:22.198 --> 00:10:30.808 And you can also, it will also forward if you look at the top left corner up here to also forward X connections and so on. 40 00:10:30.808 --> 00:10:33.958 By default. Okay. So lose that now. 41 00:10:33.958 --> 00:10:37.019 There are questions of. 42 00:10:37.019 --> 00:10:43.198 You can create some key passes and I've got some information here. 43 00:10:43.198 --> 00:10:57.749 Now, other ways to work advantages that gives you is that you can mount remote file systems back on your local computer. You can access remote files. Let me demo that. 44 00:10:57.749 --> 00:11:01.889 I go to the file tab here and I can. 45 00:11:01.889 --> 00:11:08.938 Go down and, um, other locations and I say, connect to server. 46 00:11:15.389 --> 00:11:18.658 Maybe I want to do this. I see. 47 00:11:18.658 --> 00:11:23.009 What am I doing here? 48 00:11:25.379 --> 00:11:30.089 How did I type in SF slash slash? Let's say. 49 00:11:30.089 --> 00:11:36.328 Okay, so what I've now done here, this is now mapped. 50 00:11:36.328 --> 00:11:45.418 The parallels computer parallels file system over to my local laptop. And if I don't like using a browser. 51 00:11:45.418 --> 00:11:50.849 There's another way I can get at it and I'm, I'm a Linux host here. 52 00:11:50.849 --> 00:11:55.469 It's part of my local file system, so I can use all my command line tools. 53 00:11:55.469 --> 00:12:00.178 And it's going to be run user. 54 00:12:03.719 --> 00:12:07.859 It's under. 55 00:12:07.859 --> 00:12:14.698 And right there, so I can connect to there. 56 00:12:14.698 --> 00:12:20.339 And again, there is the parallel file system. 57 00:12:20.339 --> 00:12:28.349 As part of the name space for my local computer, so I can go into here and just access it like. 58 00:12:28.349 --> 00:12:41.099 Like, a local file, a couple of points don't try to do really cute things, simultaneous reading and writing and selling anything complicated and fancy might not work that. Well. 59 00:12:41.099 --> 00:12:45.599 Um, the, um, it's a fuse saying it says this. 60 00:12:45.599 --> 00:12:56.399 File user space, file system. Okay. And you do not have more rights than you would normally have like, it's going to the root of the file system here. If I said to say, touched food like that. 61 00:12:56.399 --> 00:13:02.908 You see, it's going to complain because it I don't get rude rights. So this is a nice way to. 62 00:13:04.019 --> 00:13:10.619 To access the remote file system locally, use emax and so on on it. 63 00:13:10.619 --> 00:13:14.249 So, I've come back to here, so we talk about. 64 00:13:14.249 --> 00:13:27.594 That there, other things you can do is, you can also use to run commands on your remote file system, for example, let me go back to it. 65 00:13:27.894 --> 00:13:34.553 This is just DS is just well, Here's another cool thing. Get on a UNIX file system. 66 00:13:34.764 --> 00:13:46.043 Dev, sham is a file system that's built into memory into DRAM, core, main memory and its size is 1 half the physical amount of here. 67 00:13:46.948 --> 00:13:54.359 Main memory, so this machine here is 100008 gigabytes of memory. Historical temporary file system is potentially. 68 00:13:54.359 --> 00:14:01.019 64 gigabytes and since it's actually in DRAM, it is really, really. 69 00:14:01.019 --> 00:14:13.649 Passed so you don't want to worry about latency and everything with your disk put your files in dev shem s. H. M. for shared memory. And you won't have any I. O. time. 70 00:14:13.649 --> 00:14:19.889 So that's just 1 little hint there in any case. So, other things you can do. 71 00:14:19.889 --> 00:14:30.239 It can run a single command on their file so this will run a single command on parallel. 72 00:14:31.438 --> 00:14:39.389 For example, so you just want to run 1 quick command. 73 00:14:39.389 --> 00:14:47.938 Another thing I can do is I can copy single files back and forth and so on. So I have a file here. 74 00:14:47.938 --> 00:14:52.259 Okay, I want to copy that. 75 00:14:53.879 --> 00:15:01.979 Let's say full 2 on. 76 00:15:01.979 --> 00:15:05.938 And now this is just copied that who? Over to parallel. 77 00:15:05.938 --> 00:15:09.418 And we can. 78 00:15:11.249 --> 00:15:21.119 There it is, I could put a sequence of command after here. Good. Even around something interactive. If you wanted. 79 00:15:21.119 --> 00:15:24.359 Here to some extent. Okay, so. 80 00:15:24.359 --> 00:15:34.678 What I'm showing you are tools to work with remote computers and enlarge your tool set. So you can copy files, copy whole directory. 81 00:15:34.678 --> 00:15:40.889 Oh, the cool thing is that even does file name completion on remote files single commands and so on. 82 00:15:40.889 --> 00:15:47.339 Oh, that's 1 thing. Another thing about file systems. 83 00:15:47.339 --> 00:15:51.568 Is on parallel. 84 00:15:51.568 --> 00:15:56.249 Here's my parallel window. 85 00:15:56.249 --> 00:16:00.178 Okay, if I go to. 86 00:16:05.038 --> 00:16:08.849 Okay, now. 87 00:16:10.703 --> 00:16:18.354 Does he have this file system has some features that might be useful for you? 1 of them? Is it transparently compresses? 88 00:16:18.594 --> 00:16:27.354 So you have no need to use g step and so on on your files, because you store them on this file system and that's where your users are. 89 00:16:27.629 --> 00:16:34.109 Okay, you started them on the file system. Your files are automatically compressed. 90 00:16:34.109 --> 00:16:42.448 So, that just makes life easier. Another thing that this has is, this is under the Jan 20. 91 00:16:44.158 --> 00:16:47.249 Pool, he might call it and. 92 00:16:47.249 --> 00:16:53.729 Another thing that has is if it's working, let me see if it is. 93 00:16:53.729 --> 00:17:00.989 Yeah. 94 00:17:00.989 --> 00:17:07.199 Okay, so this is doing automatic snapshots. 95 00:17:07.199 --> 00:17:11.098 Every 15 minutes, and then every hour. 96 00:17:11.098 --> 00:17:19.169 They are the hourly ones, and then every day, and then every week, and then every month and delete the old ones. 97 00:17:19.169 --> 00:17:23.578 And so if you delete a file. 98 00:17:23.578 --> 00:17:28.288 It may be in 1 of these snapshots, and you can just go back and get it. 99 00:17:29.003 --> 00:17:42.923 Now, the way CFS is able to do this, it has a copy on right philosophy. So it's getting into an operating systems. Course file systems. Course. So nice things about CFS. So, 1, it compresses. 100 00:17:43.344 --> 00:17:45.743 2 was a snapshot thing. Is that. 101 00:17:47.128 --> 00:17:53.548 Oh, these snapshots are atomic, it snapshots the whole file system at the same time. 102 00:17:53.548 --> 00:17:58.288 But it just makes a note. So the snapshots cost is basically free. 103 00:17:58.288 --> 00:18:04.138 And it only so if you overwrite a file, then. 104 00:18:04.138 --> 00:18:09.898 It makes a new copy, but as long as the file was not changed, then. 105 00:18:11.578 --> 00:18:17.278 It doesn't make a separate copy, so it's not using enormous quantities of disk space. 106 00:18:17.278 --> 00:18:24.269 Which is nice now, I'm just to give you an example. 107 00:18:25.648 --> 00:18:31.409 So, let me look up. 108 00:18:33.419 --> 00:18:36.659 What am I now. 109 00:18:36.659 --> 00:18:40.229 If you're changing things. 110 00:18:40.229 --> 00:18:44.519 So, if I look at a snapshot, let's say. 111 00:18:45.808 --> 00:18:51.838 Oh, I don't know, let me look at a frequent 1. 112 00:18:55.108 --> 00:19:00.479 Okay. 113 00:19:04.858 --> 00:19:07.949 Okay, let me give you an example. 114 00:19:09.358 --> 00:19:20.308 Let me just look at this. Okay, so we've got, this is what it looks like for the, the snap. These are universal time. Okay so let me look at. 115 00:19:20.308 --> 00:19:23.788 The, um, snapshot for a while back. 116 00:19:25.558 --> 00:19:28.709 Oh, say December. 117 00:19:30.778 --> 00:19:35.489 And we don't have the same stop there. 118 00:19:36.683 --> 00:19:48.624 So, if if a file existed long enough to be captured in a snapshot and long enough, that, that snapshot didn't get deleted. Like you said, the frequent ones are only stored for a day because you've got the hourly ones. 119 00:19:48.624 --> 00:19:53.634 And so, on any case, this is something I've used this once or twice when I accidentally deleted something. 120 00:19:53.939 --> 00:19:57.838 Okay, so that's nice stuff with CFS. 121 00:19:58.733 --> 00:20:09.983 Any questions that another operating system thing. So, I don't know if our operating system courses teach current file systems and so on and also some other nice features. 122 00:20:10.193 --> 00:20:23.064 You can clone a whole file system and now you've got 2 versions of it. And again, it doesn't duplicate the space until you start changing stuff. So you've got a tree structure thing, 2 versions of the file system, which you now. 123 00:20:23.368 --> 00:20:31.138 Do whatever you want, and our separate file systems, but if a file was the same in both were in both clones. Only 1 copy of it is stored. 124 00:20:31.138 --> 00:20:37.679 Okay, another operating system thing. 125 00:20:37.679 --> 00:20:42.509 Which is relevant is stacks. 126 00:20:42.743 --> 00:20:56.544 So I think most of you, I mean, are aware of this, that, you know, you've got this push down stack for local variables on your computer, you call a function sub routine, send me the name synonymous. 127 00:20:56.544 --> 00:21:01.344 Actually, it puts a new stack frame on the stack and local variables are allocated. 128 00:21:01.913 --> 00:21:03.233 In on the stack, 129 00:21:03.564 --> 00:21:06.354 and then when you return from the function, 130 00:21:06.624 --> 00:21:17.723 the stack is unwind and all the local variables are cleared automatically this is separate from the heap where it shows a global thing and you explicitly allocate stuff on the heap and you explicitly free it. 131 00:21:18.058 --> 00:21:25.409 Matlock and free are construct and free and so and destroy and so on now, with the stack. 132 00:21:25.409 --> 00:21:31.378 See, you might wonder what happens to the stack when you've got threads like an open an. 133 00:21:31.378 --> 00:21:37.858 The answer is that every thread has its own independent stack. 134 00:21:37.858 --> 00:21:42.989 That's created when the thread starts and is destroyed when the thread is finished. 135 00:21:42.989 --> 00:21:46.798 Now, by default, they're very small, but you could make them bigger. 136 00:21:46.798 --> 00:21:54.058 So, if I come over here, go over here it, let me make things big. Let's say. 137 00:21:54.058 --> 00:21:57.659 So, if I do you limit. 138 00:21:57.659 --> 00:22:04.618 Stack size that would be minus. 139 00:22:04.618 --> 00:22:10.558 S, here that's I made it quite big actually, but by default. 140 00:22:10.558 --> 00:22:15.298 By by default attached, if I do a new tab. 141 00:22:15.298 --> 00:22:19.588 That's a parallel. 142 00:22:20.699 --> 00:22:24.778 It's very small 8 megabytes. 143 00:22:24.778 --> 00:22:35.249 And if you run a program, which is using the stack, and you try to put more than a Meg of local variables on the stack, the program will crash. 144 00:22:35.249 --> 00:22:38.909 However, you can. 145 00:22:38.909 --> 00:22:45.028 You can increase it, so that would be stat, the limit. 146 00:22:45.028 --> 00:22:50.189 Minus SAS and make it something bigger. Let's say if we do that. 147 00:22:50.189 --> 00:22:53.189 The stacked size is a reasonable size. 148 00:22:53.189 --> 00:22:59.098 And so if you're going to be running programs, using the stack, you want to make the stack size bigger now. 149 00:22:59.098 --> 00:23:04.709 You might be worrying that this is, um, you know, this is parallel. 150 00:23:04.709 --> 00:23:08.398 56 hyper threads here then. 151 00:23:08.398 --> 00:23:14.278 You give each hyper thread, a few gigabytes of stack. You're really. 152 00:23:14.278 --> 00:23:22.229 Wasting a lot of memory. Well, no, you're not because on Linux, when you have a page of virtual memory. 153 00:23:22.229 --> 00:23:26.338 All a page of virtual memories initially it's an entry in a table. 154 00:23:26.338 --> 00:23:32.219 And the memory is not actually allocated until you touch it. 155 00:23:32.219 --> 00:23:44.818 So, if you don't, so if you make a humungous stack, it doesn't matter until you actually touch it. So there's no problem with. 156 00:23:44.818 --> 00:23:49.798 Having big stacks because of the work from memory manager and. 157 00:23:49.798 --> 00:24:00.473 You know, this solves the problem they doing some operating system courses. If you have 1 stack in your program, it grows up from the bottom. You have a 2nd stack and your program grows down from the top of your available memory. 158 00:24:00.834 --> 00:24:06.534 And what what, if you want more stacks on the page manage system, it doesn't matter. 159 00:24:06.808 --> 00:24:12.298 You can also there's a program which will show this, or, you. 160 00:24:18.358 --> 00:24:24.568 And. 161 00:24:24.568 --> 00:24:29.788 I have not updated. Okay. I got some, um. 162 00:24:29.788 --> 00:24:33.929 I got some pointers I have, it's an open empty I think. 163 00:24:33.929 --> 00:24:37.348 Oh, okay. 164 00:24:37.348 --> 00:24:43.348 Good it's just all copy. It's the stack size all copy it over. 165 00:24:51.358 --> 00:24:56.519 And what the program does. 166 00:24:57.628 --> 00:25:04.798 It's still using it a couple of nice things here. Let me show you locales. 167 00:25:05.814 --> 00:25:16.884 So, in C, plus plus locale for a program sets characteristics, such as how you print numbers, like, here, we separate every 3 digits with a comma in Europe. 168 00:25:16.884 --> 00:25:21.953 They might separate every 3 digits with a period and separate the decimals with the here. 169 00:25:22.229 --> 00:25:33.538 Says, how do you print how you print your numbers? And so on what I've done here is if I sent this locale, then when I print big numbers. 170 00:25:33.538 --> 00:25:37.048 It will put the columns in after every 3 digits. That's fun. 171 00:25:37.048 --> 00:25:41.159 I just set locale now for the stack size. 172 00:25:41.159 --> 00:25:45.209 You can get you can get resource limits. 173 00:25:45.209 --> 00:25:48.868 And it feeds into a structure, our lamb. 174 00:25:48.868 --> 00:25:52.679 And it's type is our limit up here so. 175 00:25:54.449 --> 00:26:01.499 Okay, and so what we do here in our lamb's got all of these fields such as our limb. Docker. 176 00:26:01.499 --> 00:26:04.888 Is the current. 177 00:26:04.888 --> 00:26:11.489 Resource, and is the maximum. 178 00:26:11.489 --> 00:26:15.479 To hear what I can go in, and it will just double it. 179 00:26:15.479 --> 00:26:21.118 So, get our laminate got the size set. Our limit will set the resource. 180 00:26:21.118 --> 00:26:25.108 And it will set it in here and I went down here. 181 00:26:25.108 --> 00:26:29.128 And I tried to set it, and then I printed the new 1 down here. 182 00:26:29.128 --> 00:26:37.348 So, see, well, if I run the program, you see, initially the currently I'm at was 1 gigabyte. 183 00:26:37.348 --> 00:26:51.898 And then the maximum was more than you possibly want to use, you see the advantage of putting a column, every 3 digits and after doubling, it went from 1 gigabyte up to 2 here. So. 184 00:26:51.898 --> 00:26:55.679 And has actually reading the current thing I could say, you. 185 00:26:57.148 --> 00:27:01.858 I make it really small, something like that. Now, I run stack size. 186 00:27:01.858 --> 00:27:05.128 You see, the 100 here was pages. 187 00:27:05.128 --> 00:27:09.749 So it was initially 100 K a page being 1. K. and now I made it bigger. 188 00:27:09.749 --> 00:27:14.219 If I run the program and now it doubled and so on. 189 00:27:14.219 --> 00:27:28.138 Let me look at the program 1 more time. Oh, what happened here is I made the stack size so small the more program won't even run. So, maybe I'd better make it a little bigger. 190 00:27:28.138 --> 00:27:38.638 Good and at the end of here so, and this is a test here if I. 191 00:27:40.259 --> 00:27:46.409 You know, if things fail and I try to access. 192 00:27:46.409 --> 00:27:49.648 This There'll be a local variable on the stack. 193 00:27:49.648 --> 00:27:59.278 And if something happens, I'll get a 2nd fall. So, the segue fault is the message that your stack local stack was too small. 194 00:27:59.278 --> 00:28:10.828 Okay, so these are programming tools that will help you. And this is this concept of a local stat I large local stack for each thread. I think this is a useful programming tool that is under used. So. 195 00:28:10.828 --> 00:28:16.318 Workman needs his toolkits as toolbox and these are toolbox. 196 00:28:16.318 --> 00:28:19.469 For the Linux programmer. 197 00:28:19.469 --> 00:28:25.888 Multiple stacks large stacks, local file system and DRAM. 198 00:28:25.888 --> 00:28:33.058 Very powerful tools that are under used. Okay if you have your favorite tools, yourself mentioned them to me. 199 00:28:33.058 --> 00:28:37.618 Oh. 200 00:28:37.618 --> 00:28:43.288 Um. 201 00:28:43.288 --> 00:28:54.028 Or that just a SEC, I'm ignoring the phone because the chat, when it's available for people to talk. 202 00:28:54.028 --> 00:29:01.618 Okay, now to open MP specifically. Oh. 203 00:29:01.618 --> 00:29:05.278 Um, open empty. 204 00:29:05.278 --> 00:29:09.148 And as a lot of just to remind you, we have the website here. 205 00:29:09.148 --> 00:29:15.689 Lots of information if you want a lot of free. 206 00:29:15.689 --> 00:29:26.999 Stuff is being added gradually at the moment open. Empty is weak for handling GPU back end. So I'm giving demonstrations here on the multi. 207 00:29:29.153 --> 00:29:42.834 And so also, some of the best documentation is obsolete, but okay, so I'm showing you various examples. I'm not going to show you the Lawrence Livermore. Lmc Lawrence Livermore, national Labs has some stuff also more information. 208 00:29:42.834 --> 00:29:52.134 I'm not going to handle that. I'm going to skip that it's available if you want to read it. Well, let me go through it. Look at some of these. So, there's a lot of directives here. 209 00:29:52.409 --> 00:29:58.259 For things defining how data gets copied into the parallel. 210 00:29:58.259 --> 00:30:07.169 Threads data does it get what data shared? What data is private get copied in a reductional is important. I'll hit reduction in a minute. 211 00:30:07.169 --> 00:30:12.058 And atomic okay, atomic directives. 212 00:30:12.058 --> 00:30:16.679 Again, to mark a section of code, which will be done. 213 00:30:16.679 --> 00:30:20.489 I only want so you don't get these. 214 00:30:20.489 --> 00:30:23.669 You know, these problems with. 215 00:30:25.163 --> 00:30:36.173 2 threads trying to write to the same data at the same time thing is to force things to get serialized barriers. Have the obvious meaning. And atomic was serialized the next simple instruction. 216 00:30:36.173 --> 00:30:44.153 A critical will serialized and arbitrarily big block of code, but the overhead to start a critical block as much larger. 217 00:30:44.429 --> 00:30:48.388 The other things apparently obvious. 218 00:30:48.388 --> 00:30:58.048 Okay, so I mentioning the problem here about serialization. We talked about it last time. I'm. 219 00:30:58.048 --> 00:31:02.699 Spending more time on this, because this is the curse of parallel computing. 220 00:31:02.699 --> 00:31:06.118 If you've got 2 threads do load and store. 221 00:31:06.118 --> 00:31:11.128 And they could go in any order and different order every time. 222 00:31:11.128 --> 00:31:18.509 And get a different answer and, of course, they make do the same order every time and giving it consistently the wrong answer. 223 00:31:18.509 --> 00:31:23.128 So, I mentioned critical and atomic. 224 00:31:23.128 --> 00:31:26.219 Okay, I mentioned this sort of thing before. 225 00:31:26.219 --> 00:31:30.388 Now, how do you compile your programs? Well. 226 00:31:31.314 --> 00:31:46.044 If I go back to here, I mean, I have things in the make I mean, everyone's aware of make so, I say, use the f open empty. Now, what you have to add a flag to use open MP, what the name of the flag depends on the compiler. 227 00:31:46.318 --> 00:31:54.419 So, grumble. Okay. I mentioned real time. I mentioned this before. I believe about. 228 00:31:54.419 --> 00:31:59.699 The real number properties. Okay. Tasks. I showed you quickly. Last time. 229 00:31:59.699 --> 00:32:03.929 I love this the beautiful thing. This is how you do Fibonacci. Exactly. 230 00:32:03.929 --> 00:32:09.298 Okay, let me just review tasks. I showed you to you last time. 231 00:32:10.439 --> 00:32:15.358 Okay, I'll show you different task things. 232 00:32:15.358 --> 00:32:21.148 So, the concept here, you can explicitly access. 233 00:32:21.148 --> 00:32:25.558 The parallel threads and what we have here. 234 00:32:28.499 --> 00:32:40.558 Is this starts a parallel thread now? What it does more technically, is you have a queue because you may have more tasks and you have threads. I got 56 hyper threads here. 235 00:32:41.243 --> 00:32:55.163 You're not restricted for having 56 actual threads and your program you could have as many as you want and there's an argument for having more than 56 actually, because it would always be some queued up ready to run. If all the current ones are blocked and safe waiting on I, or something. 236 00:32:56.219 --> 00:33:03.719 So, in any case, you can create explicit tasks and this creates a explicit task. 237 00:33:03.719 --> 00:33:10.199 And it's for doing Fibonacci, and they go into the queue and they run in parallel for as many tasks. 238 00:33:10.199 --> 00:33:14.098 You know, as many tasks as the computer can run. So. 239 00:33:14.098 --> 00:33:23.548 My laptop is dual 6 core, so it could run. My laptop could run the same program as parallel, but the parallel has dual. 240 00:33:23.548 --> 00:33:26.848 14 cause my laptop's only dual 6 core. 241 00:33:26.848 --> 00:33:33.929 So, There'll be more tasks waiting in the queue on my laptop in any case. So, this recursively starts. 242 00:33:33.929 --> 00:33:39.269 2 parallel tasks doing a Fibonacci number recursively and. 243 00:33:39.269 --> 00:33:44.519 They run and then they return back to the while. They're going to wait. 244 00:33:44.519 --> 00:33:50.459 And now we have an atomic here, we want to total up the number of tasks and. 245 00:33:50.459 --> 00:34:01.858 So that's done anatomically and incremental hoping of variables 1 of the legal things for a task for an atomic. And then at the end, we wait. 246 00:34:01.858 --> 00:34:13.259 So this waits for the tasks that were fired up here. So, inside the else, we've got a task and a task, and a task. Wait, wait until those 2 tasks. 247 00:34:13.259 --> 00:34:17.998 Were started, and the tasks are also firing up recursively other tasks. 248 00:34:17.998 --> 00:34:22.619 So you can, but, you know, you want to be reasonable about it because it. 249 00:34:24.028 --> 00:34:32.759 It was an overhead with all of them. Let me show you another task program task. 250 00:34:44.278 --> 00:34:48.719 A couple of things in this program I want to show you 1st. 251 00:34:48.719 --> 00:34:57.898 I've got some really cool macros up here. What these macros do is. 252 00:34:59.248 --> 00:35:06.628 Print for an art will print argh. As a as a string and then it will evaluate it and print the value. 253 00:35:06.628 --> 00:35:16.048 So, if I look down here, no, so this will print literally get threads and then it will evaluate it and print the result. 254 00:35:16.048 --> 00:35:22.349 So this will print with a calm and a new. So this I wrote this to help me. 255 00:35:22.349 --> 00:35:32.668 You know, the bug programs I add is a great concept and what we have here are some control sequences for your terminal that will cause the terminal to change color. 256 00:35:32.668 --> 00:35:38.009 Okay, so in any case what's happening here? 257 00:35:38.009 --> 00:35:44.398 As I'm not doing it necessarily. Recursively. Okay what's happening here? 258 00:35:44.398 --> 00:35:49.498 Is I'm starting the a lot of parallel. 259 00:35:49.498 --> 00:35:55.708 Threads here and what this will do is this will start running the contents of that block. 260 00:35:55.708 --> 00:36:04.739 In parallel on every available thread, or the number of threads that the program's configured. So this will run that block 56 times. 261 00:36:04.739 --> 00:36:10.829 In parallel, which might not be what you want, but that's what it will do. 262 00:36:10.829 --> 00:36:18.059 And this will print, so this block here. 263 00:36:18.059 --> 00:36:21.869 We'll run in a 1 to the 56 threads. 264 00:36:21.869 --> 00:36:27.449 But what's different is the thread number will be different for each thread. So. 265 00:36:27.449 --> 00:36:32.938 And then the critical says, do this on only 1 thread at a time. 266 00:36:32.938 --> 00:36:36.539 So, um. 267 00:36:36.539 --> 00:36:47.429 Barrier waits still the threads are done. So I was wrong. This went down as far as here. My mistake and the barrier make sure everything is finished. 268 00:36:47.429 --> 00:36:55.048 And then master says, run something on only the master thread. 269 00:36:55.048 --> 00:36:59.998 Perhaps, and then run a pile of tasks and parallel. 270 00:37:01.199 --> 00:37:06.628 Course. 271 00:37:06.628 --> 00:37:10.168 Silence. 272 00:37:10.168 --> 00:37:15.478 Okay, so what we have up here is, um. 273 00:37:15.478 --> 00:37:28.079 A mess. Okay because this was not serialized. So, what we got is everything saying starting parallel. Okay. Um, so, Max, a number of threads was 56. 274 00:37:28.079 --> 00:37:32.639 So, starting parallel here is going to be written 56 times. 275 00:37:32.639 --> 00:37:40.619 And then these are like, the thread numbers and so on big mess here. So shows things have to be serialized. 276 00:37:40.619 --> 00:37:48.958 And getting the thread number so this is the concept here. You see it prints the expression actually, and red, and then evaluates it. And Prince of value. 277 00:37:48.958 --> 00:37:55.228 Okay, there's another example, we start a lot of tasks and they run in parallel and then they all finish. 278 00:37:55.228 --> 00:38:01.018 Other ways to do things in parallel. Okay. 279 00:38:05.039 --> 00:38:08.458 Let me show you some other stuff here. 280 00:38:08.458 --> 00:38:13.889 Start CC to read things here. 281 00:38:18.179 --> 00:38:24.179 So, this just shows examples of the various things that you can read in. 282 00:38:24.179 --> 00:38:36.030 Open MP number threads. Are you in a parallel block at the time and so on and so on the wall clock time so this just shows examples of getting a lot of them. 283 00:38:36.030 --> 00:38:41.460 Oops, let me come back again here and in the main program. 284 00:38:41.460 --> 00:38:48.809 So, I'm doing the block in parallel, but what this says here is actually do it only single. 285 00:38:48.809 --> 00:39:03.269 Only do this in 1 of the 56 threads. So don't repeat this block for every possible threat do it for only 1 of them. The reason I'm doing parallel and then single is we can get it. So, let us get the number of threads and so on then. 286 00:39:03.269 --> 00:39:07.409 Silence. 287 00:39:07.409 --> 00:39:12.599 Yeah, and get all the various things here. 288 00:39:12.599 --> 00:39:21.179 Oh, okay. You could use W, time here to get the lapse time for your program. 289 00:39:21.179 --> 00:39:28.320 And the thing is, Linux is a high resolution, low resolution clock the standard clock's like a 6th of a 2nd. 290 00:39:28.320 --> 00:39:33.809 Which isn't actually fine enough so, with the high risk clock here is better. 291 00:39:33.809 --> 00:39:40.650 Okay, so 50 and by the way you can set this as I showed you last time was that environment variable. 292 00:39:40.650 --> 00:39:45.480 That was here. Okay. Hello? Single barrier. 293 00:39:46.920 --> 00:39:52.349 Silence. 294 00:39:52.349 --> 00:40:01.679 Hey, professor yes quick question. How is the, uh. 295 00:40:01.679 --> 00:40:05.789 Delta all time function in the common file. 296 00:40:05.789 --> 00:40:09.480 How accurate it is to which the delta. 297 00:40:09.480 --> 00:40:13.320 Delta all the time I think it's called. Oh, my clock function. 298 00:40:13.320 --> 00:40:17.190 Yeah, as I hope it's good. I don't guarantee it. 299 00:40:17.190 --> 00:40:31.735 Okay, if you're going to time something for publication, you want to run it a couple of times the 1st time you run a program may take more time the 2nd time you run that stuff will be in the cash and will be faster. So, you run a couple of times. 300 00:40:32.219 --> 00:40:35.309 And see, nothing else is on the system and see so. 301 00:40:36.989 --> 00:40:40.079 Okay, so a single barrier. 302 00:40:44.039 --> 00:40:47.579 So, what's happening here? We're just. 303 00:40:47.579 --> 00:40:55.800 Putting a barrier around things, so this executes only once or inside the parallels. The number of threads will be 56. 304 00:40:55.800 --> 00:41:04.559 So the critical said to the Hello world will get printed without being all scrambled because this whole line will get executed once. 305 00:41:04.559 --> 00:41:07.679 And let's try this. 306 00:41:07.679 --> 00:41:11.489 Silence. 307 00:41:11.489 --> 00:41:18.539 Yeah, so all the Hello world were not scrambled up. 308 00:41:18.539 --> 00:41:28.260 Except for that last line, which got scrambled up, because I needed 1 more barrier here or something, but every time I run it, of course, the thread. So different. 309 00:41:28.260 --> 00:41:35.489 Okay, so I think I said I need 1 more barrier at the end here. So things. 310 00:41:35.489 --> 00:41:39.630 Would not get scrambled? Oh, it's okay. No Instagram. Well. 311 00:41:39.630 --> 00:41:43.500 Yeah, this last thing okay. 312 00:41:43.500 --> 00:41:49.769 That another way to do things in parallel is you could explicitly parallel lies stuff. 313 00:41:49.769 --> 00:41:53.460 Um. 314 00:41:53.460 --> 00:41:57.929 Silence. 315 00:42:01.795 --> 00:42:15.534 What's happening here so Here's a Delta clock time and Sunday here, I'm setting a locale. So prince with com is, this is a routine I wrote for Delta clock time, which should what it does. Is it prints the elapse times. That's the last time I called it. 316 00:42:15.809 --> 00:42:19.889 So the 1st, okay, now, what's happening here? 317 00:42:19.889 --> 00:42:28.889 As I have explicit parallel sections, but they're not in a for loop and they're not like recursive tasks or something. 318 00:42:28.889 --> 00:42:35.429 There just 2 thing parts of my program that I say, they don't depend on each other. I can do them in parallel. 319 00:42:37.980 --> 00:42:51.659 So, I've got 2 sections here section so this section is for loop that creates C and this section here those 2 will run in parallel. 320 00:42:51.659 --> 00:43:03.269 So, it's the programmers job to ensure they don't step on each other's toes. So I create sections and the way I do that is I have so that look at the, that is section singular. 321 00:43:03.269 --> 00:43:07.289 I put all of the sections and a sections plural. 322 00:43:07.289 --> 00:43:11.369 Craig MA. Okay. So now. 323 00:43:11.369 --> 00:43:18.360 It's like a case statement or something. I got all of the so each 1 of these as many as they want, and they all around and. 324 00:43:19.829 --> 00:43:31.949 And they all run simultaneously to the extent possible if not, they'll get thrown on the queue to run when there's threads available. And then I have to do all of that inside a parallel. 325 00:43:31.949 --> 00:43:38.099 So, I create the parallel environment and the parallel environment goes on as far as here. 326 00:43:38.099 --> 00:43:43.170 And inside the parallel environment, then everything gets done. 327 00:43:43.170 --> 00:43:48.960 Multiple times on every thread, except if it says master, it's done only on the master thread. 328 00:43:48.960 --> 00:43:56.519 And then the sections thing, then inside the parallel, and the sections are each farmed out to separate threads. 329 00:44:04.349 --> 00:44:12.690 And if we look at this is a CPO load 165%. So, on the average average, I was using, you know. 330 00:44:12.690 --> 00:44:19.530 More than 1 thread to do it. It's very useful. Now, the reason this gets printed out is I've got. 331 00:44:19.530 --> 00:44:30.420 An environment, variable set, so any jaw any command it takes more than a few seconds will automatically run time on it. And here's the locale putting column. I said, I like that. 332 00:44:30.420 --> 00:44:34.079 The way I get the time to print out automatically. 333 00:44:34.079 --> 00:44:47.820 It is report time so this causes a report if the command takes more than a 2nd. 334 00:44:47.820 --> 00:44:53.909 Then it prints out the time that group see, at the time the command took up here. 335 00:44:53.909 --> 00:44:58.110 Ok, sections. 336 00:44:59.340 --> 00:45:06.510 So do a section, you set up a parallel block, you define all your sections I need to find the sections 1 to 1. 337 00:45:06.510 --> 00:45:11.099 So, in my, this is my abbreviation. 338 00:45:11.099 --> 00:45:14.099 For in my not so humble opinion. 339 00:45:14.099 --> 00:45:17.159 Open M. P. is easier than. 340 00:45:17.695 --> 00:45:18.864 Threads and fork. 341 00:45:20.815 --> 00:45:21.355 Okay, 342 00:45:21.355 --> 00:45:22.704 so I've showed you, 343 00:45:22.704 --> 00:45:32.065 that's about as much as I'm going to show you with open MP stack overflows questions on it in a section in a task or a section, 344 00:45:32.065 --> 00:45:36.864 you're explicit the section command the sections run at the same time. 345 00:45:37.349 --> 00:45:45.264 Explicitly tasks are sections are more synchronous to the extent possible and tasks are told asynchronous, 346 00:45:45.534 --> 00:45:53.545 you fire off a task and you return to the caller and the task who's just sitting this runs when it can you can wait for it to finish. 347 00:45:53.965 --> 00:45:58.945 Might be a good idea, or sections the sections and sections directive then. 348 00:45:59.219 --> 00:46:09.809 There run well, the section directive waits typically until they finish. So okay. That's open MP. It's a step above P threads. 349 00:46:09.809 --> 00:46:17.670 But you're still very prescriptive about what you're doing. Now, problems with open. M. P. is it's still weak on. 350 00:46:17.670 --> 00:46:22.440 So, they're just lately been adding support to it. 351 00:46:22.440 --> 00:46:34.949 And so to do parallel on, I'd recommend other tools, like the next 1 I'm going to talk about, but I want to introduce you to open empty because it's it is a major. 352 00:46:34.949 --> 00:46:44.429 It's a major parallelization tool been around. I started at 20 years ago, so it's mature enough wide base of users. 353 00:46:44.429 --> 00:46:48.929 So you can put on your resume that you have written in an open MP program. 354 00:46:48.929 --> 00:46:52.409 And I've got various things here. You could. 355 00:46:52.409 --> 00:47:01.139 And I've got other stuff from last year, a problem. Oh, 1 more thing I forgot to show you my mistake reduce. 356 00:47:01.139 --> 00:47:06.929 Show you the, some problem again. Okay. You remember this. 357 00:47:08.849 --> 00:47:15.030 Okay, you got your parallel for and it's got this and. 358 00:47:15.030 --> 00:47:18.869 Computed is going to be wrong because of different threads step on each other. 359 00:47:18.869 --> 00:47:27.480 Now, you could put this computed block in a critical. It's very slow. You could put it in an atomic, which is faster. 360 00:47:27.480 --> 00:47:30.929 But for things like this, or you're summing into a total. 361 00:47:30.929 --> 00:47:38.579 There's this is called a reduction operating operation. We're reducing the factor of arguments. 362 00:47:38.579 --> 00:47:41.969 To a total or something else. 363 00:47:41.969 --> 00:47:48.780 Okay, there's a special construct to do this because this is a common thing that people want to do. 364 00:47:48.780 --> 00:47:53.340 And so there's a conflict that does it much faster called the reduced construct. 365 00:47:53.340 --> 00:47:58.530 Let me show you that. 366 00:48:06.750 --> 00:48:15.239 Okay, notice inside the for loop we do not have an atomic or critical or anything. 367 00:48:15.239 --> 00:48:20.610 What we have in the starting Craig about pregnenolone parallel 4 and here's the new 1. 368 00:48:22.170 --> 00:48:27.690 What this says is that they've got the variable computer that's down here. 369 00:48:27.690 --> 00:48:31.079 Is being it's a reduction. 370 00:48:31.079 --> 00:48:36.329 Uh, it's the output from a reduction and the reduction is the plus operator. 371 00:48:36.329 --> 00:48:40.980 So this tells open M. P that inside the 4 loop. 372 00:48:40.980 --> 00:48:44.699 We are going to be computed. It's going to be some. 373 00:48:44.699 --> 00:48:50.280 Of a lot of local variables could be anything and to do it fast. 374 00:48:50.280 --> 00:48:58.139 So, what open M P will do is it will have a separate local version of computed for each thread. 375 00:48:58.139 --> 00:49:10.500 So each thread will not some into the global computed, will someone to a local, total, sub, total variable and at the end, all the local subtotal variables are summed into the global. 376 00:49:10.500 --> 00:49:19.590 So, it's very efficient. You don't need any locks or atomics or criticals at all. So it's going to be fast and it's going to be correct. 377 00:49:21.570 --> 00:49:25.559 Silence. 378 00:49:28.170 --> 00:49:32.309 Correct answer. 379 00:49:32.309 --> 00:49:42.690 Now, we look at this again, this was doing a reduce with a some, there's a, there's a number of other operators you can use. 380 00:49:42.690 --> 00:49:51.869 The basic thing, if this is to work, the operator has to be commutative and associative. So you could reduce the, a product, a max or a men. 381 00:49:52.764 --> 00:50:06.625 A logic wise or wise, and but you could not reduce, for example, a minus because minus subtraction is not commutative. There's a list of there's only a specific list of operators that you can reduce there in the documentation. 382 00:50:06.929 --> 00:50:12.510 Um, show you some others. 383 00:50:12.510 --> 00:50:15.780 Silence. 384 00:50:15.780 --> 00:50:19.500 Silence. 385 00:50:19.500 --> 00:50:29.909 Um, what are we doing? We're looking at numbers of threads. Okay. What we're doing here is that we are. 386 00:50:29.909 --> 00:50:34.380 Looking at times, and so on so here we do a reduce. 387 00:50:34.380 --> 00:50:37.889 And printing all sorts of stuff. 388 00:50:37.889 --> 00:50:46.440 Oops, and it's an attempt to do it on the GPU and the way, and something like this. 389 00:50:46.440 --> 00:50:49.920 Um, is an attempt to. 390 00:50:49.920 --> 00:50:58.530 Have it compile for the GPU and say, send teams and whatever and. 391 00:51:08.159 --> 00:51:12.059 And the concept is, it's a little faster perhaps. 392 00:51:12.059 --> 00:51:18.659 We could also some theories now it's so interesting. 393 00:51:18.659 --> 00:51:25.019 Other things to show you all. Well, let me show you another working program. 394 00:51:25.019 --> 00:51:28.230 Where you can get sizes. 395 00:51:33.480 --> 00:51:38.219 This shows you, how you can get the sizes of different data types. 396 00:51:38.219 --> 00:51:43.230 Size of if you can give it a data type as an argument. 397 00:51:43.230 --> 00:51:51.480 And it will return the size and fights. This can be useful because nothing in the sequels plus standard defines. 398 00:51:51.480 --> 00:52:03.420 The size of anything, or the defines relative sizes a short can be no longer than an end and it can be no longer than a long and so on. But could be the same size. But this is a way to. 399 00:52:03.420 --> 00:52:07.650 And get the sizes of different data types can be useful. 400 00:52:13.440 --> 00:52:22.800 But, in fact, an interest for bites and does the common sense along, but notice long long it's no longer than long. They're both 8 bytes. 401 00:52:22.800 --> 00:52:25.949 So, this sort of thing is useful here. 402 00:52:25.949 --> 00:52:29.309 Other things to show you useful. 403 00:52:29.309 --> 00:52:32.340 The way we can do. 404 00:52:32.340 --> 00:52:36.630 With Matt on, so. 405 00:52:36.630 --> 00:52:40.590 Silence. 406 00:52:42.929 --> 00:52:47.489 And this is just trying to play with. 407 00:52:47.489 --> 00:52:55.110 Things I never did get it working. Right? Matrix all application. 408 00:52:59.699 --> 00:53:06.960 Just some playing with doing matrices in parallel. 409 00:53:12.210 --> 00:53:16.590 This is just a sequential thing here. 410 00:53:18.000 --> 00:53:23.010 We could play with it. We could copy in. 411 00:53:23.010 --> 00:53:27.210 Silence. 412 00:53:27.210 --> 00:53:31.980 What is happening here? 413 00:53:33.420 --> 00:53:38.699 We're trying to do the thing in parallel and see what happens. 414 00:53:43.889 --> 00:53:51.269 And, yeah, and the theory is that. 415 00:53:51.269 --> 00:53:55.920 He previously it took almost 4 seconds real. 416 00:53:55.920 --> 00:54:07.585 Now, it took a 3rd of a 2nd rail, so apparently went a lot faster. And by the way I've turned off optimization with the compiling here, just so not to confuse things. 417 00:54:07.585 --> 00:54:11.454 If you turned on optimization everything would go very much faster. 418 00:54:17.010 --> 00:54:21.030 Are we doing here? 419 00:54:21.030 --> 00:54:26.159 Nothing too interesting. Oh, okay. 420 00:54:26.159 --> 00:54:34.019 Okay, so that's a good executive introduction. 421 00:54:34.019 --> 00:54:39.960 To Matrix, 2 open, empty. 422 00:54:41.429 --> 00:54:46.829 Open AC scene. Well, 1st, I want to talk about lots and lots of compilers. 423 00:54:46.829 --> 00:54:51.059 I mean, that's not even all of them. This is also Intel so. 424 00:54:51.059 --> 00:54:58.079 The examples I've been giving you here, they've been using g. plus plus it's a very nice. Compiler does open. Well. 425 00:54:58.079 --> 00:55:01.530 Now, we're, we're. 426 00:55:01.530 --> 00:55:11.400 Migrating and to use and could a few days, which is CUDA is invidious low level. 427 00:55:11.400 --> 00:55:14.730 Compile language. 428 00:55:14.730 --> 00:55:19.050 Um, for Nvidia has a. 429 00:55:19.050 --> 00:55:24.719 Compiler called you give it so kudos some extensions to C. plus plus. 430 00:55:24.719 --> 00:55:30.690 And you need something like NBC to compile g. plus plus doesn't know. 431 00:55:30.690 --> 00:55:45.150 Now, at times I get annoyed with g plus plus it takes a while to adapt to new hardware. There's a commercial compiler called pgc. Plus plus it's commercial, but it's free for, like, amateur usage. 432 00:55:45.150 --> 00:55:51.929 And I think it may be better than g. plus plus, especially for compiling say to Nvidia and so on. 433 00:55:51.929 --> 00:55:55.019 And we have P. G. C. plus plus there. 434 00:55:55.019 --> 00:55:58.019 Right. 435 00:56:00.235 --> 00:56:14.905 So, that might almost be better than cheap. Plus. Plus, the thing is, the names of the flags and so on will be different. In any case. I may start switching over to that for say, open MP. Now, invidia and pgc. 436 00:56:14.905 --> 00:56:21.414 Plus, plus, I believe might have been partly sponsored by NVIDIA. Now I think this is my interpretation. 437 00:56:21.690 --> 00:56:28.860 Is that Nvidia took over? Plus plus and re badge said as plus plus. 438 00:56:28.860 --> 00:56:37.199 And video C plus plus, and they've added some new features, and I'm thinking that this is possibly the best here. 439 00:56:37.199 --> 00:56:41.489 And. 440 00:56:41.489 --> 00:56:46.769 So, you can I have not yet installed it on parallel installed that. 441 00:56:46.769 --> 00:56:57.179 But in any case of open M. P. g plus plus works fine, you could use this but some of the flags I different pgc plus process some nice debugging features. I may show you. 442 00:56:57.179 --> 00:57:06.809 Any case, so that's and I have a homework 3 here playing with open MP. Okay. Open ACC. So. 443 00:57:06.809 --> 00:57:16.889 This is a newer thing than open M. P. it it's more abstract like open a open. You're very specific. 444 00:57:16.889 --> 00:57:27.570 About what you do, you say, parallelize this for loop have these 2nd, run these sections and parallel or fire off tasks to go into the queue to actually it's open empty. 445 00:57:27.570 --> 00:57:31.530 Is very specific about the parallelization. 446 00:57:31.530 --> 00:57:36.269 But it hide some of the low level bookkeeping stuff that you have to worry about with P threads. 447 00:57:36.269 --> 00:57:40.829 So you almost might say open and to the same powers piece or it's but it's easier. 448 00:57:40.829 --> 00:57:44.130 Open may give you a little more. 449 00:57:44.130 --> 00:57:55.530 Okay, and the hardest part for any of this is, your algorithm has to be parallelizable. That's the hard part. Now open. Acc is higher level than it's newer than. 450 00:57:55.530 --> 00:58:10.170 Open M. P. and it's I think it's useful by now. I have a rule. I don't like using something until it's 10 years old, but I open ACC is useful again. It has a wide industry support so it's a living system. People use it. 451 00:58:10.170 --> 00:58:16.199 People it gets extended and so that's nice. I like to have living systems. 452 00:58:16.199 --> 00:58:31.045 That are widely used, not just toy systems as much my insulting thing and to open an open ACC it's higher level than open MP and open. Acc also works with devices open to open. 453 00:58:31.045 --> 00:58:44.784 mvp's been really late adding access to and they do it badly a non standard because the thing is, once the standard does it, then, of course, the compilers have to add as better than that. So that's reasons for open ACC. 454 00:58:45.295 --> 00:58:47.005 It has a wide. 455 00:58:47.880 --> 00:58:51.630 A lot of information here. 456 00:58:51.630 --> 00:58:55.679 And what I want to do is I want to show you. 457 00:58:55.679 --> 00:59:01.409 Walk you through some of the tutorials here, and maybe next time I'll run a few programs. 458 00:59:01.409 --> 00:59:05.849 And tutorials. 459 00:59:05.849 --> 00:59:10.679 And we have here. 460 00:59:15.090 --> 00:59:23.699 Okay, so you can watch the recording. You can I'm going to walk you through the slides and this will also give you an introduction to. 461 00:59:23.699 --> 00:59:28.440 And so piles of information available. 462 00:59:28.440 --> 00:59:31.500 And again also supported by NVIDIA. 463 00:59:36.150 --> 00:59:39.480 So, same chair and nothing here. Okay. 464 00:59:39.480 --> 00:59:45.719 Oops, okay. The 2nd. 465 00:59:45.719 --> 00:59:50.309 Silence. 466 00:59:50.309 --> 00:59:53.670 Okay. 467 00:59:57.119 --> 01:00:00.719 I don't want to do that. 468 01:00:00.719 --> 01:00:05.610 Okay, so I'm going to walk you through this. 469 01:00:05.610 --> 01:00:10.800 And just hit highlights. 470 01:00:10.800 --> 01:00:15.420 I notice it's been given an in various supports this. 471 01:00:15.420 --> 01:00:21.329 Um, so they talk about a whole week for this. I'm going to do it and. 472 01:00:21.329 --> 01:00:25.860 20 minutes whatever. Okay. 473 01:00:31.349 --> 01:00:37.679 So, it's competitive direct is like, open NPS and compile of directors and some library stuff. 474 01:00:37.679 --> 01:00:44.489 I'll show you add pragmatic instead of. 475 01:00:44.489 --> 01:00:48.719 Okay, parallel code. 476 01:00:48.719 --> 01:00:51.929 Um. 477 01:00:53.579 --> 01:01:00.869 Oh, different types of fragments here. There you go. 478 01:01:00.869 --> 01:01:07.619 Okay, this sorts of fragments talk about how the data like. 479 01:01:08.664 --> 01:01:20.275 You know, do you want to copy the data into the parallel region at the start copy out at the end? Do both the compiler can actually determine that much of the time but perhaps, you know, better than the compiler. 480 01:01:20.275 --> 01:01:25.824 And if you get explicit about the data movement, the program might compile better. 481 01:01:26.099 --> 01:01:31.679 So, if you don't if it's if your program is simple at all, be a select the compiler figure it out. Otherwise. 482 01:01:31.679 --> 01:01:41.429 Specify that set up a parallel region. Okay, this is. 483 01:01:41.429 --> 01:01:48.630 This says the compile it for and 7 D and everything that you've got. Basically. 484 01:01:48.630 --> 01:01:53.880 So, there's a loop coming up and gang means run it on the. 485 01:01:53.880 --> 01:01:57.900 Running on the CUDA cores on the so. 486 01:01:57.900 --> 01:02:01.289 Okay, similar to open MP this 1. 487 01:02:01.289 --> 01:02:09.420 So, many, many Corps, this refers to things like the Intel. 488 01:02:09.420 --> 01:02:20.519 5 Co processor card that I talked for a few years in this course, and stopped because Intel dropped the product a couple of years ago or a couple of years ago. So the. 489 01:02:20.519 --> 01:02:24.179 The Z on 5 was a CO processor card plugged into your. 490 01:02:24.179 --> 01:02:28.199 Machine that had 60 quad core. 491 01:02:28.199 --> 01:02:35.010 Z on on on it so it's called many core, 60 cores on the card. 492 01:02:36.684 --> 01:02:45.235 And so they were stripped down beyond they stripped out a lot of the things, like speculative execution and so on. So they took less hardware to build. 493 01:02:45.235 --> 01:02:59.425 But, if your code did not require things, like, speculative execution around very fast, and the card round ran a stripped down embedded version of Linux, and you could connect to it was SS H, and so on, and are using shared file sessions. 494 01:02:59.994 --> 01:03:09.144 But that sort of obsolete. So the SP, we call multi core, dual, 14 core with dual 20 call cells and, of course what each core is very small. 495 01:03:10.284 --> 01:03:18.175 Okay, your concept is that you write your code, your algorithm has to be paralyzed. 496 01:03:18.204 --> 01:03:26.635 All you got to hear that from me again and again, and you add annotations and the compiler then determines how to parallelize for the machine. 497 01:03:27.000 --> 01:03:34.559 And you can still run the program on your sequential machine. And the theory is that the compiler will compile for different architectures. 498 01:03:34.559 --> 01:03:43.469 And the concept is, you don't have to worry about some low level details, so you're not perhaps going to get the same total power. 499 01:03:43.469 --> 01:03:47.039 As if you were aware of the low level details, but your time. 500 01:03:47.039 --> 01:03:53.489 Is faster so this is basically the same as open. 501 01:03:53.489 --> 01:04:02.190 And you could, or you just say, parallelize the loop and your responsibility of the separate iterations as the loop don't affect each other. 502 01:04:03.329 --> 01:04:14.849 Lots of target devices. Ibm power is a very nice architecture. Actually used in a number of the top 500 super computers. Ibm. In fact. 503 01:04:14.849 --> 01:04:19.679 I ask myself what does IBM do? Well, today. 504 01:04:21.355 --> 01:04:35.994 And I can think of only 2 things in their cloud computing services, number 4 or 5, they had a couple percent of the market. Well, they have their mainframes, which are nice, but come a little obsolete. So, their cloud computing, I think is. 505 01:04:36.329 --> 01:04:46.469 Pointless almost, they just last week or 2 weeks ago, closed off their a block chain group. So they've decided blockchain. 506 01:04:46.469 --> 01:04:54.570 It's not a money maker isn't going anywhere, but what the hell it's going where their power architecture is very nice. 507 01:04:54.570 --> 01:05:00.750 And they plug in cards into it. 508 01:05:00.750 --> 01:05:15.534 Like, perhaps, and that makes a very nice supercomputer. Some of the top supercomputers are doing this and it's separate and video cards have a very fast boss between a faster than I can use using a Z on actually. So. 509 01:05:15.809 --> 01:05:26.280 Very IBM does very well so it's 1 thing. Ibm has they have the components with the Super computers? I think the other thing IBM does very well is quantum computing. 510 01:05:26.280 --> 01:05:30.150 They're of course, they're perhaps a leader nothing else. 511 01:05:31.349 --> 01:05:35.605 Okay, back to open ACC. 512 01:05:35.605 --> 01:05:49.284 So you've got the CPU and this is sort of showing multi core and the say the, and since invidia as the biggest part of the gpo market, everything I talk about will be invidia in 5 years. 513 01:05:49.559 --> 01:05:54.570 It might be something else, but okay, so lots of the on course. 514 01:05:54.570 --> 01:05:57.630 Program or goes to the compiler. 515 01:05:57.630 --> 01:06:03.539 Say, current also run the thing on the could, of course, perhaps. 516 01:06:06.449 --> 01:06:12.840 That's the same slide. They're trying to use in some important things. 517 01:06:12.840 --> 01:06:24.780 That's nice advertising, advertising, advertising. Well, good, nothing wrong with advertising. So lots of slides to tell you open ace open ACC as well. 518 01:06:24.780 --> 01:06:30.329 Please stick on your resume, you've used open a C. C. you programmed and open ACC. My mistake. 519 01:06:30.329 --> 01:06:34.230 Syntax pragmatics on. 520 01:06:35.849 --> 01:06:41.010 Nothing new there you can do for trend if you're unfortunate enough to have to you as for trend. 521 01:06:41.010 --> 01:06:46.199 Okay, so now we're going to use an example class heat transfer. 522 01:06:46.199 --> 01:06:50.820 Just ties at every note is the average of it's for neighbors. 523 01:06:52.170 --> 01:06:56.880 Do it in parallel. 524 01:06:56.880 --> 01:07:09.780 So iterating so we're iterating. And every node is here and then we enter each and every, and we actually do it. It converges. And we're ignoring stuff about over relaxation and so on. This is just an example. 525 01:07:10.980 --> 01:07:16.380 Here's your program and. 526 01:07:17.579 --> 01:07:21.179 They compute the average of the 4 neighbours we compute. 527 01:07:21.179 --> 01:07:28.530 How much it changed? So we know how it's converging and the next loop is we copy back. 528 01:07:28.530 --> 01:07:40.349 Okay, sequential program and we repeat this either until the era gets small or until we iterate too much. 529 01:07:44.010 --> 01:07:52.170 Okay, they're making a big point here. I've got to show you some profiling tools analyzing and profiling, because it may not be obvious what the. 530 01:07:52.170 --> 01:07:59.219 What's taking the time? I mean, what you think may be taking the time may not be what is taking the time. Okay. 531 01:07:59.219 --> 01:08:04.800 Profiling, and it turns out that the swap is taking. 532 01:08:04.800 --> 01:08:10.559 Almost half the time. Well, I always slow. 533 01:08:10.559 --> 01:08:17.850 And there are these things, so the examples are using the compilers and. 534 01:08:17.850 --> 01:08:23.310 Profiling tools, I'll run them for you on Thursday. I think this introduction. 535 01:08:23.310 --> 01:08:32.310 You can profile sequential code and it will show you what is what the different parts of the program what is taking the time. 536 01:08:32.310 --> 01:08:37.199 And 46% of the time was copying the array over. 537 01:08:40.590 --> 01:08:47.399 Okay, and if we go down another low level. 538 01:08:47.399 --> 01:08:50.909 And eventually the low level routines that are taking the time, but. 539 01:08:53.279 --> 01:08:59.010 Okay, and nice things of the profiling tools. 540 01:08:59.010 --> 01:09:05.579 Okay, that's okay. I actually so. 541 01:09:05.579 --> 01:09:09.689 Let me just for the moment, I want to scroll back to that. 542 01:09:09.689 --> 01:09:13.619 To show you what to remind you what the program looked like. 543 01:09:13.619 --> 01:09:20.069 Before, and then we'll come back to this page 34. this is a 2nd here. 544 01:09:20.069 --> 01:09:29.579 Gotcha. Okay so this thing here just at the copy was like, 46 of the time and this thing here, the computer, the new value is like. 545 01:09:29.579 --> 01:09:33.989 54% of the time. So, in other words. 546 01:09:33.989 --> 01:09:47.609 Call PNG was so this thing here, the copying was slow. Okay the computation was comparatively fast and that's often the case of parallel computing. The cost is dominated by the aisle time. 547 01:09:47.609 --> 01:09:54.989 Okay, back to page, if I'm scrolling back and forth so quickly that you're getting then let me know. 548 01:09:54.989 --> 01:09:59.729 Okay parallel so. 549 01:09:59.729 --> 01:10:02.970 These gangs are just groups of. 550 01:10:02.970 --> 01:10:08.699 Threads on the GPU. 551 01:10:08.699 --> 01:10:14.069 If you're ahead of me and GPU knowledge, they're tied into things like thread blocks. 552 01:10:14.069 --> 01:10:22.079 Okay, any case so this says we got these gangs here of threads. I'm anticipating a little. 553 01:10:24.060 --> 01:10:27.720 The NVIDIA the CUDA cores. 554 01:10:27.720 --> 01:10:35.310 Have about 3 levels of hierarchy and I'll write this on a future slide at the very lowest level. 555 01:10:35.310 --> 01:10:39.899 You've got 32 threads form a war. 556 01:10:39.899 --> 01:10:53.250 And the 32 threads, and a warp are executing the same instruction. So as an instruction decoder decodes an instruction and distributes it to all 32 threads are running the same instruction. 557 01:10:53.250 --> 01:11:06.715 On different code on different data. I'm sorry the only difference is that a thread can be disabled. So a thread has an enabled status fit and if the threat is disabled, it's not running the instruction. It's idle. 558 01:11:06.925 --> 01:11:10.375 But if the threat is enabled all 32 threads, and the war are running the same. 559 01:11:12.210 --> 01:11:15.539 Then the same instruction, so that's a warp. 560 01:11:15.539 --> 01:11:24.930 Now, the works of threads are grouped into what's called a thread block. So block might have a 1024 threads in it. 561 01:11:24.930 --> 01:11:28.350 32 war, so 32 threads. 562 01:11:28.350 --> 01:11:34.140 And the, the works and a block. 563 01:11:34.140 --> 01:11:48.984 They're scheduled independently there's a little operating system sitting on the GPU and the warps in the block. There's a queue of warps waiting to run as cubes everywhere. And so the warps can execute independently. 564 01:11:49.260 --> 01:12:00.930 But they still have connections to each other and that all the threads in the block have a shared memory, a block of shared memory that they can all read and write. 565 01:12:00.930 --> 01:12:04.050 So, the warps in a blog, have it. 566 01:12:04.050 --> 01:12:13.560 Share some memory, if they want to, they're not forced to I mean, a thread has private memory local to the thread, but there is a shared memory that's. 567 01:12:13.560 --> 01:12:17.970 Chaired by all the threads and the block and. 568 01:12:19.289 --> 01:12:33.689 Also the threads in a block and synchronize, they can, they can set up a barrier and wait till all of the threads and the block hit that barrier. So we got the threads in a warp and then the warps in a blocks. That's 2 levels. 569 01:12:33.689 --> 01:12:37.739 3rd level you got separate blocks. 570 01:12:37.739 --> 01:12:42.750 So, you going to multiple blocks in your program. 571 01:12:42.750 --> 01:12:52.590 As many as you want, basically of 1000 threads each and the separate blocks are scheduled separately and there's a Q a box. 572 01:12:52.590 --> 01:13:00.239 And they scarcely communicate with each other, there's global memory that they can read and write to, like. 573 01:13:00.239 --> 01:13:11.760 unparallel that card is 16 gigabytes of global memories. All of the blocks have access to the global memory, but basically the thread, the separate blocks. 574 01:13:11.760 --> 01:13:22.350 Don't interact with each other they could synchronize with each other, but that's probably a bad idea. It's going to really slow things down. So we got the thread warps. 575 01:13:22.350 --> 01:13:28.229 The block thread, block, single block and then the multiple blocks that's 3 levels. 576 01:13:28.229 --> 01:13:33.300 The multiple blocks form, a kernel kernels like a parallel program. 577 01:13:33.300 --> 01:13:37.409 Your GPU kind of multiple kernels so that's 4 levels. 578 01:13:37.409 --> 01:13:48.899 And the separate colonels don't interact with each other while they could read and write to the global same global memory. But they're probably not even doing that. The separate kernels are like, separate. 579 01:13:48.899 --> 01:13:58.770 Jobs on a parallel computer. Well, it is a parallel computer and there's a queue of colonels so 4 levels. 580 01:13:58.770 --> 01:14:03.779 Hierarchy at 4 levels on the GPU. 581 01:14:05.010 --> 01:14:11.250 And I could probably extend to a 5th level if I thought about it. Okay. But basically, 4 levels of. 582 01:14:11.250 --> 01:14:21.119 Hello ISM with threads. Okay so, here, we're in open ACC, a gang of threads in a block and then separate blocks. Basically that's not a hard thing. 583 01:14:21.119 --> 01:14:25.560 Okay. 584 01:14:25.560 --> 01:14:34.710 And so that's nothing interesting there. Do the loop in parallel do iterations of the loop. You saw this before? I'm going to go through this fast. 585 01:14:34.710 --> 01:14:38.760 Oh. 586 01:14:38.760 --> 01:14:43.199 Like, an open up an M. P. nothing new there. 587 01:14:45.270 --> 01:14:51.090 And watch your data dependencies that's your problem. 588 01:14:54.149 --> 01:14:58.500 It's too hard for the compiler nothing new there. 589 01:14:58.500 --> 01:15:08.100 Oh, the parallel directive? Yeah, that says, do everything inside the block here on every thread in parallel unless. 590 01:15:08.100 --> 01:15:12.060 There's something like a loop and nothing new there. 591 01:15:14.545 --> 01:15:28.944 Okay, this is still like opening open MP, but here we have a reduction my example before the reduction operator was plus the production operators, maximum maximum is associative and committed to. 592 01:15:29.364 --> 01:15:30.444 That's okay. 593 01:15:30.720 --> 01:15:40.079 So, here, this Max's pulled out by the compiler and does a max separately on each thread and then it combines all the sub access into a global Max. 594 01:15:40.079 --> 01:15:43.199 And then we parallelize the 2nd thing. 595 01:15:44.640 --> 01:15:48.989 And mentions the reduction clause here. 596 01:15:48.989 --> 01:15:52.710 I told you what it does synthesis right? Sit down. 597 01:15:53.970 --> 01:16:03.989 There's only a fixed set of reduction, legal reduction operators, because they have compiler support. They actually have supported a low level plus Max and so on. 598 01:16:03.989 --> 01:16:11.130 It wise run the Co. P. G. C. plus plus. 599 01:16:11.130 --> 01:16:20.279 And task does a nice set of has an enormous number of flags, including an enormous number of optimization flags. 600 01:16:20.279 --> 01:16:26.250 And past, this does a nice set of optimization flags. 601 01:16:26.250 --> 01:16:31.199 Men falls a really nice flag. It prints incredible amounts of debugging information. 602 01:16:31.199 --> 01:16:38.819 And file equals all I'm just introducing stuff. I'll demo at. 603 01:16:38.819 --> 01:16:45.149 And we've got various tags to say what to compile it for. I'll review this on Thursday. 604 01:16:45.149 --> 01:16:52.140 This says the Tesla means invidia GPU. It's historical reasons. It doesn't. 605 01:16:52.140 --> 01:16:58.289 Nvidia has generations of the G. P. S. army capital or. 606 01:16:58.289 --> 01:17:09.300 Volta and parent news 1 is ampcare. Previous 1 is Volta. The previous 1 is Pascal. Previous 1 is Maxwell. Previous 1 is capital. So Tesla. 607 01:17:09.300 --> 01:17:17.279 Was 1 generation of it's now being used to refer to all by the PG compilers. 608 01:17:17.279 --> 01:17:24.630 It's no relation to the car and it's no relation to the. 609 01:17:24.630 --> 01:17:28.680 Marketing level for Nvidia, which Tesla means. 610 01:17:30.329 --> 01:17:36.180 And I forget which, and Super computing level or something unrelated. Okay. 611 01:17:37.350 --> 01:17:46.829 So, what this, something like this would say, compiled target architecture, compiled to run in the GPU manage. They'll tell you about in a minute. 612 01:17:46.829 --> 01:17:50.760 And print lots of information and optimize it. So. 613 01:17:50.760 --> 01:17:58.949 And you compile it, and this is sort of information it prints out it prints out the optimization information about what it can optimize. 614 01:17:58.949 --> 01:18:05.550 And the speed up on multi core here, this would be actually. 615 01:18:05.550 --> 01:18:10.199 This should be still on the Intel 3 times faster. 616 01:18:10.199 --> 01:18:16.560 And here you see the system generated, implicit copy ins and copy outs. 617 01:18:16.560 --> 01:18:20.250 You didn't have to specify it um. 618 01:18:23.489 --> 01:18:29.310 Okay, so the 1st compiler thing was just for the Intel. 619 01:18:29.310 --> 01:18:36.420 About 3 times faster on on their particular machine we then compile it to run on the. 620 01:18:36.420 --> 01:18:46.590 And I got 37 times faster the 1 that's all. So, that's the 2nd, newest architect. So, this is quite a new architecture here. So. 621 01:18:49.199 --> 01:18:53.310 Oh, and here's what their many car was. Okay. 622 01:18:54.329 --> 01:19:02.850 Or tolerable chief CPO closing remarks. Okay. Good point to stop. Now, let me go back to my. 623 01:19:02.850 --> 01:19:06.449 Page here we go. 624 01:19:07.800 --> 01:19:14.970 Yeah, so what I did today, just to remind you. 625 01:19:14.970 --> 01:19:19.500 Is that. 626 01:19:19.500 --> 01:19:29.220 I gave you some operating system and useful tools about SS. Oh, I didn't mention a fast I should about and and. 627 01:19:29.220 --> 01:19:35.100 Stack size says, make your stacks bigger and then the allocate pages when needed. 628 01:19:35.100 --> 01:19:38.579 It doesn't matter of a large virtual memory that you don't use. 629 01:19:38.579 --> 01:19:43.050 Well, that's also true and allocating stuff on the heap, but. 630 01:19:44.189 --> 01:19:48.329 Allocate a big array until you touch it. It doesn't cost anything. 631 01:19:48.329 --> 01:20:00.689 And I finished off open empty. It's a very nice thing for Intel, and then to move and it's fairly the level that I started open ACC for, you. 632 01:20:00.689 --> 01:20:07.619 And which will be better for a high level for compiling to. 633 01:20:07.619 --> 01:20:10.619 The GPU and. 634 01:20:10.619 --> 01:20:14.069 I'll continue that next time we'll run some programs. 635 01:20:14.069 --> 01:20:25.979 Island install and V. C. plus plus I guess so. I would recommend if you're starting with the compiler. Plus plus I'm thinking it's possibly better than g. plus plus. And so on. 636 01:20:25.979 --> 01:20:31.949 And then what we're doing is we're migrating into and video with. 637 01:20:31.949 --> 01:20:36.270 And so on slowly, you want to read ahead of me. 638 01:20:36.270 --> 01:20:40.409 Read the next tutorials and have fun. 639 01:20:40.409 --> 01:20:47.399 Okay, so that's enough stuff for today. If anyone has any questions, then. 640 01:20:48.569 --> 01:20:54.329 Hey, professor, can you go over reductions again? And I kind of missed that. 641 01:20:54.329 --> 01:21:03.989 Sure, so this is on parallel. 642 01:21:03.989 --> 01:21:07.199 And. 643 01:21:07.199 --> 01:21:11.670 Silence. 644 01:21:11.670 --> 01:21:16.560 So, what we want to do here, let's ignore the pragmatic. 645 01:21:16.560 --> 01:21:20.310 We want to sum up something. Okay inside a loop. 646 01:21:20.310 --> 01:21:25.800 Like, we want to say, sum up the variable I, or something more complicated. 647 01:21:27.090 --> 01:21:36.390 And we want to do it in parallel. Now, the problem is that there's that global, total variable computed and we, if we access it in parallel. 648 01:21:36.390 --> 01:21:44.579 You know, the different threads will try to write to it. You see, now, the way it is this implement, as you read, you add and you write back. 649 01:21:44.579 --> 01:21:47.670 And they step on each other. 650 01:21:47.670 --> 01:21:51.270 And you're going to get the wrong answer so. 651 01:21:51.270 --> 01:21:55.350 Silence. 652 01:21:55.350 --> 01:21:59.520 See, this is oh, that was right and do. 653 01:21:59.520 --> 01:22:09.899 Well, if I did not have the reduction, I would get the wrong answer here. So I could say. 654 01:22:18.300 --> 01:22:24.329 Silence. 655 01:22:33.899 --> 01:22:38.880 What no, that's my laptop. Okay, so. 656 01:22:43.140 --> 01:22:47.970 Silence. 657 01:22:52.050 --> 01:22:57.300 Come on. 658 01:22:58.949 --> 01:23:02.460 A, while to run. 659 01:23:03.539 --> 01:23:08.250 Okay, she is the correct answer and the wrong answer. 660 01:23:08.250 --> 01:23:13.199 And I haven't mentioned somewhere on here. 661 01:23:15.750 --> 01:23:20.880 You see this problem with the addition, he said the 2 threads step on each other. 662 01:23:20.880 --> 01:23:24.149 So, the answer is to. 663 01:23:25.229 --> 01:23:38.819 If we have this, then this will then it will get compiled as each separate thread will have a local copy of the computed total variable. So each thread will be. 664 01:23:38.819 --> 01:23:43.199 Summing into a local sub, total variable so there's no problem. 665 01:23:43.199 --> 01:23:50.310 And then, at the end, all the local sub, total variables will all be some together to make the global computed. 666 01:23:51.689 --> 01:23:55.680 So this can run in parallel, but we get the right answer. 667 01:23:57.359 --> 01:24:03.960 Does that yeah, that because it's different than. 668 01:24:03.960 --> 01:24:07.710 Like, critical or atomic or something where right. 669 01:24:07.710 --> 01:24:13.949 Well, this is more limited in what it can do. You can reduce only a small set of fixed operators. 670 01:24:13.949 --> 01:24:19.170 Some, the other example for open AC was a max. 671 01:24:19.170 --> 01:24:22.170 So, there's a fixed limited set of operators. 672 01:24:22.170 --> 01:24:26.850 But if that's 1 of the things you want to do. 673 01:24:26.850 --> 01:24:32.789 It does it really fast gotcha. Thank you. The atomic is. 674 01:24:32.789 --> 01:24:40.409 More general, the following statement after an atomic again, it's limited into what the allowed instructions are. 675 01:24:41.430 --> 01:24:49.979 But it's less limited than a reduction and it's slower than a reduction, but still pretty good. The the critical block. 676 01:24:49.979 --> 01:24:56.189 You can put anything you want in the critical block, but there's a big overhead just to start the critical block. 677 01:24:56.189 --> 01:25:00.630 Mm, okay. 678 01:25:00.630 --> 01:25:04.350 Other questions. 679 01:25:04.350 --> 01:25:07.500 Silence. 680 01:25:07.500 --> 01:25:12.510 If not see you Thursday time for lunch. 681 01:25:12.510 --> 01:25:15.689 Sorry, I actually have another 1. sure. Go ahead. 682 01:25:15.689 --> 01:25:19.770 From homework to, or the question about. 683 01:25:19.770 --> 01:25:23.789 Cuda cores until the on. 684 01:25:23.789 --> 01:25:28.409 I wasn't sure about that. 1. can can you kind of go over that? Sure. 685 01:25:28.409 --> 01:25:32.189 Silence. 686 01:25:32.189 --> 01:25:37.770 Here yeah, yeah, it's on 1 of the handouts. 687 01:25:37.770 --> 01:25:42.210 But the Intel on core, it's Super scaler. 688 01:25:42.210 --> 01:25:53.310 You can be running 2 hyper threads and then you can do maybe a floating ad, a floating multiplying integer add some sort of conditional test all in 1 loot all in 1 cycle. 689 01:25:53.310 --> 01:26:00.090 So, in 1 cycle on the Intel Z on, you can do several operations. 690 01:26:00.090 --> 01:26:04.079 Whereas the could a core in 1 cycle. It's. 691 01:26:04.079 --> 01:26:13.109 Much less, it may be even cannot do a floating point operation for every could for every thread on the on the. 692 01:26:13.109 --> 01:26:19.170 So there because there's fewer floating point. 693 01:26:19.170 --> 01:26:23.909 Units on the GPU, then there are actually could of course. 694 01:26:23.909 --> 01:26:29.789 So that's why so I estimate that a could a core is. 695 01:26:29.789 --> 01:26:34.260 5% as Z encore. 696 01:26:35.850 --> 01:26:43.260 Thank you and that said, if you got 4000 CUDA cores, that's still faster than. 697 01:26:43.260 --> 01:26:47.250 You know, 20 Z on cars, but. 698 01:26:47.250 --> 01:26:51.899 Oh, by the way, this is an interesting design issue mentioned the more later, but. 699 01:26:51.899 --> 01:26:58.020 So, when invidious designing a, they have to decide. 700 01:26:58.020 --> 01:27:03.300 How many floating point processors to put on the GPU. 701 01:27:03.300 --> 01:27:12.744 And how many double precision that's a separate chunk of hardware single double received and as they go from generation to generation and video keeps changing things around that. 702 01:27:13.645 --> 01:27:25.314 So, from capital to Maxwell, they reduced they especially reduce the number of double precision cores, a lot of research. And what a lot slower, and then following generation, they reverse their decision somewhat. 703 01:27:25.619 --> 01:27:37.979 And then 1 of these generations they brought in half precision floating point, 16 bed float. So if you had a half precision floater went very fast. But they, but he had double positioned folder. It went a lot slower. So. 704 01:27:39.000 --> 01:27:43.319 It's a design decision that the hardware designers make as to how much. 705 01:27:43.319 --> 01:27:49.380 You know, how much area on the Silicon to add to allocate to the different functions. 706 01:27:49.380 --> 01:27:57.659 And the chips still computes the right answers, but it's how much time it takes to the different functions. 707 01:27:59.489 --> 01:28:03.000 Other questions. 708 01:28:04.380 --> 01:28:07.680 Anyone else okay. 709 01:28:08.909 --> 01:28:12.270 And I'm not going to bother saving this chat window dressing in it. So. 710 01:28:20.279 --> 01:28:21.390 Hello.