WEBVTT 1 00:02:43.830 --> 00:03:06.629 Silence. 2 00:03:13.349 --> 00:03:21.479 Silence. 3 00:03:23.400 --> 00:03:27.270 Silence. 4 00:03:27.270 --> 00:03:30.719 Silence. 5 00:03:30.719 --> 00:03:36.870 Silence. 6 00:03:44.129 --> 00:03:48.150 Silence. 7 00:03:48.150 --> 00:03:54.240 Silence. 8 00:03:57.240 --> 00:04:02.669 Okay. 9 00:04:04.169 --> 00:04:17.189 Silence. 10 00:04:21.149 --> 00:04:24.300 Silence. 11 00:04:33.149 --> 00:04:37.019 Okay, good afternoon class. 12 00:04:38.189 --> 00:04:42.418 I would just. 13 00:04:43.858 --> 00:04:50.069 That's better. 2nd, try. 14 00:04:50.069 --> 00:04:55.858 Good afternoon class I hope you can hear me. 15 00:04:55.858 --> 00:04:59.579 Somebody tell me if you can hear me. So, this is. 16 00:04:59.579 --> 00:05:03.629 Arrow class 7, I guess. 17 00:05:03.629 --> 00:05:07.858 22nd 2012. 18 00:05:07.858 --> 00:05:19.288 And thank you, Justin. So, what is happening today is continuing on. We are talking about NVIDIA now in the, and. 19 00:05:19.288 --> 00:05:28.978 I like to teach from specifics so, but as I gave you specific facts about present information about video. 20 00:05:28.978 --> 00:05:36.209 Then you can see, I hope you, and for general principals of a parallel hardware and software. 21 00:05:36.209 --> 00:05:43.559 I mean, there's different ways 1 can solve a problem, but there are a successful company. So their solutions to. 22 00:05:43.559 --> 00:05:47.519 Parallel hardware and the software to use it. 23 00:05:47.519 --> 00:05:51.778 Solutions that are worth looking at so, but 1st. 24 00:05:51.778 --> 00:05:57.509 There was a question about compilers and so I have an answer here. 25 00:05:57.509 --> 00:06:00.928 On. 26 00:06:00.928 --> 00:06:07.019 See, here, that's better. 27 00:06:07.019 --> 00:06:14.759 Okay, if I pull up a window a window on me, a 2nd, here. 28 00:06:15.928 --> 00:06:19.678 Silence. 29 00:06:21.269 --> 00:06:26.939 Okay, so now this window. 30 00:06:26.939 --> 00:06:31.918 It's open on parallel. 31 00:06:31.918 --> 00:06:38.009 And the invidious packet I installed into opt and video. 32 00:06:41.309 --> 00:06:47.218 And it's under their several levels deep so I created a file and it ends. 33 00:06:48.269 --> 00:06:53.309 And that will add the compilers to your path. 34 00:06:54.173 --> 00:07:08.843 Path is an environment variables, a list of directories separated by colons and if you type a command, you just type the name without the full path name. Then the shell searches through the directories and the path environment variable to find out. 35 00:07:09.149 --> 00:07:17.309 Where your to find out where to execute it. So, if we have something like. 36 00:07:17.309 --> 00:07:21.658 C, plus, plus. 37 00:07:22.738 --> 00:07:28.108 Okay, so what I have to do is take that invariable thing and source. 38 00:07:28.108 --> 00:07:35.338 Now, if I say it finds it in, it's complaining, and we could also say, which. 39 00:07:36.358 --> 00:07:49.259 That shows where it is, there's other ways to do it, but it seemed to me, the easiest thing is to create this little thing, which, and we have, and you could even look in here if you wanted to see what it was. 40 00:07:49.259 --> 00:07:53.728 In here to see what other programs. 41 00:07:53.728 --> 00:08:00.718 There's also the compilers here, which are basically the same. So. 42 00:08:00.718 --> 00:08:06.629 Okay, so there's any questions about that on mute you Mike's her, I'm also. 43 00:08:06.629 --> 00:08:10.228 From time to time, watching the chat window off to my left. 44 00:08:10.228 --> 00:08:17.218 Yes, quick question. So I was trying to use. 45 00:08:17.218 --> 00:08:20.579 For the homework yes. And I couldn't. 46 00:08:20.579 --> 00:08:25.408 Like, on parallel, I couldn't find it. So, is that the process you show? 47 00:08:25.408 --> 00:08:28.798 Help with that I didn't hear what's. 48 00:08:28.798 --> 00:08:36.719 What executed what were you looking for? I was trying to compile the C code with ACC so using the compile. 49 00:08:36.719 --> 00:08:39.749 Yeah, and is in here, so. 50 00:08:39.749 --> 00:08:42.928 So, I would have to. 51 00:08:42.928 --> 00:08:50.428 Copy this directory you all all you do is in the shell that you're running, you do. 52 00:08:50.428 --> 00:08:58.438 You will actually the full thing would be dot means to source opt in and f, environment. So. 53 00:08:58.438 --> 00:09:02.759 You do that and your shell and they'll be available to you. 54 00:09:02.759 --> 00:09:06.869 Okay, sounds good. Thanks. Okay. 55 00:09:06.869 --> 00:09:17.969 So, today we is looking at more video stuff and video has this nice teaching kit download, which I will run from. 56 00:09:19.948 --> 00:09:24.149 And I'm running from my laptop, so I don't have to do remote graphics. 57 00:09:24.149 --> 00:09:27.808 Okay. 58 00:09:27.808 --> 00:09:33.749 And I've actually, um. 59 00:09:33.749 --> 00:09:36.778 Oops, I clicked the thing. 60 00:09:41.548 --> 00:09:46.048 Oh, okay. 61 00:09:46.048 --> 00:09:49.469 The thing I do. 62 00:09:53.068 --> 00:09:56.519 Silence. 63 00:09:56.519 --> 00:10:00.509 Well, I'll grab it again. 64 00:10:01.528 --> 00:10:07.408 I'm going to zip, but instead of doing the virtual file mount, because I'll be writing into it. 65 00:10:07.408 --> 00:10:10.948 And. 66 00:10:10.948 --> 00:10:14.308 Silence. 67 00:10:17.278 --> 00:10:26.938 Silence. 68 00:10:26.938 --> 00:10:31.678 Silence. 69 00:10:31.678 --> 00:10:39.298 Okay, and the slide we did the 1st 3, so work on the 4th 1 now. 70 00:10:42.658 --> 00:10:46.528 Silence. 71 00:10:46.528 --> 00:10:52.828 And. 72 00:10:56.099 --> 00:11:00.599 Over here. 73 00:11:00.599 --> 00:11:11.548 Okay, so we'll be seeing a B for introduction to somebody B*** and profilers also, although some of the things in the slides set here. 74 00:11:11.548 --> 00:11:23.999 Don't actually work on this machine. Okay. And again, lots of ways you can do parallel stuff on the at different levels. You can program down in could you can use Python packages, you can use. 75 00:11:23.999 --> 00:11:28.379 Things like Matlab Mathematica and so on. 76 00:11:28.379 --> 00:11:34.109 And this is okay, so here. 77 00:11:34.109 --> 00:11:44.038 So, what this is a new thing to you, if you write down at the CUDA level. So, let me back up here maybe. So kudo has. 78 00:11:45.803 --> 00:11:51.803 Or CC plus plus, I'm not seeing a big difference. It's a minor extension to the language. 79 00:11:52.494 --> 00:12:05.693 They're not but it's a small extension to the language, which calls various libraries and this you can more directly access the, the, from your C. plus plus program. 80 00:12:07.708 --> 00:12:18.173 And what they have to do it is a compiler called different from NBC plus plus and you give your program, 81 00:12:18.173 --> 00:12:23.573 you C plus plus program that contains fragments of code a code, 82 00:12:23.933 --> 00:12:26.874 and the NBC compiler. 83 00:12:27.744 --> 00:12:37.943 It identifies the fragments of CUDA code and separate them out and compiles them separately calls calls a C plus, plus compiler to call the C plus plus code. 84 00:12:38.094 --> 00:12:44.783 And then it links everything together into 1 package, which you can then run and it will run combined on the host and device. 85 00:12:46.078 --> 00:12:57.028 So the compiling job is somewhat slower, but machines are fast. So that's tolerable. But you can be amazed at how slow is actually. 86 00:12:58.168 --> 00:13:08.188 Here's a simple example, I'll also show you the code what we have here. If you can read it is your basic C plus plus program. 87 00:13:08.188 --> 00:13:14.908 It's not even executed. Well, of course, because it might need an include file or whatever, but will ignore that. 88 00:13:16.229 --> 00:13:20.038 Okay, you would call this with g plus boss. 89 00:13:20.038 --> 00:13:26.278 N. B. C. plus plus now what we've done here. 90 00:13:26.278 --> 00:13:29.369 Is this you might say. 91 00:13:29.369 --> 00:13:33.149 Is the Hello world program? 92 00:13:33.149 --> 00:13:38.369 It's a really basic program what it does here. So this is mostly the same. 93 00:13:38.369 --> 00:13:45.538 This is the new thing here and that may be a little small for you to see I'm going to. 94 00:13:45.538 --> 00:13:49.769 Actually show you the program itself. 95 00:13:49.769 --> 00:13:55.259 Silence. 96 00:13:55.259 --> 00:14:02.759 And I'm going to have to zip that. 97 00:14:02.759 --> 00:14:06.239 Silence. 98 00:14:06.239 --> 00:14:13.589 If you zip and watch it, it's not inside a subdirectory. 99 00:14:13.589 --> 00:14:18.269 And I'll show you the lab too thing. 100 00:14:19.798 --> 00:14:23.339 Silence. 101 00:14:23.339 --> 00:14:28.708 She is a little smaller than I want to. 102 00:14:28.708 --> 00:14:32.339 Let me see here. 103 00:14:33.359 --> 00:14:38.428 1, here. 104 00:14:38.428 --> 00:14:43.259 No, let's see here. Okay the program just as a bit of mess in it. 105 00:14:43.259 --> 00:14:49.288 Silence. 106 00:14:49.288 --> 00:14:53.788 Silence. 107 00:14:53.788 --> 00:14:59.938 Okay, here it is okay here is your basic. 108 00:15:01.168 --> 00:15:05.668 This is your basic C plus plus program? There is no. 109 00:15:05.668 --> 00:15:09.599 Good and again. 110 00:15:09.599 --> 00:15:13.948 I'm going to speak to so I could have. 111 00:15:13.948 --> 00:15:18.418 I could hand these out as examples, but I'm just going to show them to you straight. 112 00:15:19.589 --> 00:15:25.619 Okay, see, maybe this is a little easier to read now and again, if it's too small, you can let me know. 113 00:15:25.619 --> 00:15:28.859 So some with bigger. 114 00:15:28.859 --> 00:15:37.918 So, what we have here is your basic C plus plus program was 2 extensions we've got this and we've got this stuff up here. 115 00:15:39.089 --> 00:15:45.448 What we have here is a program that will run on the GPU. 116 00:15:45.448 --> 00:15:51.808 And, you know, it's going to run in the GPU on the device because it's got this modifier in front. 117 00:15:51.808 --> 00:16:05.458 Global ETS 2 underscores global to underscores. So this is a minor extension to C. plus plus and a global routine. Don't ask me why it's called that. This is a routine. 118 00:16:05.458 --> 00:16:11.969 That with from the host, but executes on the device. 119 00:16:13.048 --> 00:16:24.178 And it actually always returns nothing and this particular case has no arguments, although it could have arguments and grace grace and it's actually doing nothing. 120 00:16:24.178 --> 00:16:30.719 So, it's your empty program, but it is a program that will run on the device and do nothing. 121 00:16:32.188 --> 00:16:35.759 This is the line in your host program that will call it. 122 00:16:35.759 --> 00:16:40.259 You give it the name. My kernel called my colonel. 123 00:16:40.259 --> 00:16:43.979 There were no arguments here and there's no arguments. 124 00:16:43.979 --> 00:16:51.418 Here is the syntax extension triple angle brackets are less than greater than with a 1 1 inside. 125 00:16:52.499 --> 00:16:56.158 So, what do the 1 in the 1 mean? 126 00:16:56.158 --> 00:17:02.009 The 1st 1 is how many thread blocks you want to execute this program on. 127 00:17:02.009 --> 00:17:05.969 That says 1. 128 00:17:05.969 --> 00:17:12.209 The 2nd thing says, how many threads do you want to execute on in each thread block? 129 00:17:12.209 --> 00:17:19.769 1, so, what this means is that this program up here will execute on 1 thread in 1 block. 130 00:17:19.769 --> 00:17:23.338 Not very parallel of course, but still. 131 00:17:23.338 --> 00:17:27.898 That's your basic this is your basic program. 132 00:17:27.898 --> 00:17:40.409 To do nothing on the device if you wonder why we're not even printing inside the device while you can't really easily do statements from the device. So that's why we don't have a print statement here. 133 00:17:40.409 --> 00:17:43.798 And we can actually go down here. 134 00:17:45.269 --> 00:17:49.409 Well, if you're wondering what asked with some flags on it. 135 00:17:50.638 --> 00:17:56.249 And so on okay, and then the make program. 136 00:17:56.249 --> 00:17:59.398 We'll run it and what it will do. 137 00:17:59.398 --> 00:18:09.929 As it will run, it will run, which is required because you're giving it a program and the extension to the program is for. 138 00:18:09.929 --> 00:18:14.548 And for no particular reason optimizing it. 139 00:18:14.548 --> 00:18:20.669 And this does nothing because of course, that's a default, but I could just say, and the. 140 00:18:20.669 --> 00:18:23.999 C. C. 141 00:18:23.999 --> 00:18:26.999 And we look now we've got a name out. 142 00:18:28.138 --> 00:18:33.659 Hello, that's coming from. Okay, so if I count this again. 143 00:18:33.659 --> 00:18:42.989 This is here basic program running something on the device. Okay so it added instructions. 144 00:18:42.989 --> 00:18:54.719 This this global thing, okay, this is called a kernel. And again, a kernel is invidious term for a parallel program running on the device that's on the GPU. 145 00:18:54.719 --> 00:19:03.298 You could, even if you want, you know, log in to start your VPN log into parallel and private, if you'd like. 146 00:19:03.298 --> 00:19:08.999 Or they ignored that this is what they tried to use. 147 00:19:08.999 --> 00:19:15.719 G, plus plus Ford is going to fail. Oh, and it has to have an extension. See you. So. 148 00:19:19.439 --> 00:19:26.969 What was the last thing here? You could play various games if you don't want to have the extension see, you, you could do other stuff. 149 00:19:26.969 --> 00:19:33.419 Okay, now and this is how we we compile that. 150 00:19:33.419 --> 00:19:36.419 And ran it. Oh, okay. 151 00:19:36.419 --> 00:19:40.138 What I just showed you. 152 00:19:40.138 --> 00:19:45.269 So. 153 00:19:47.699 --> 00:19:53.759 You've got a program, but perhaps you're 1 of those people who don't write perfect programs. The 1st time. 154 00:19:53.759 --> 00:19:59.128 I'm 1 of them also, so you try to program it fails. 155 00:19:59.128 --> 00:20:03.358 Or worse than failing as it produces an answer that happens to be wrong. 156 00:20:03.358 --> 00:20:07.348 In a way, you're lucky if the programme fails. So there are some tools. 157 00:20:07.348 --> 00:20:14.249 To to help, you do it and. 158 00:20:14.249 --> 00:20:20.519 Ma'am check is a useful 1, which does validity checks on memory accesses. 159 00:20:20.519 --> 00:20:33.473 And so that's a nice way. It's like Val grind or something to to check array bounce. This does it in parallel some M, checks a useful program. Gdb could a gdb. 160 00:20:33.473 --> 00:20:43.523 Will you go in and step through the parallel programs that's also show you basic things and the site is a more powerful interactive B*** tool. 161 00:20:43.769 --> 00:20:49.949 I don't have it running at the moment the instructions in the slides that don't actually work. 162 00:20:49.949 --> 00:20:55.138 The silly thing is that they don't they don't work on the newest version of. 163 00:20:55.138 --> 00:20:58.199 There's some commercial things here. 164 00:20:58.199 --> 00:21:06.719 Which say they're good, there's 1 or 2 free things not from a video. 1 of the energy Labs says something. I mean, try to show you total view. 165 00:21:06.719 --> 00:21:12.538 Is said to be a very good B*** for parallel of programs. It's also very expensive. 166 00:21:12.538 --> 00:21:19.919 And I've been asking myself, am I willing to buy it or not? It's, it's incredibly expensive, but it claims that it's good. 167 00:21:19.919 --> 00:21:30.028 So our tools to help, because the bugging is hard debugging parallel programs remember they're not deterministic. Every time you run into anything a different answer. Probably. So. 168 00:21:31.378 --> 00:21:42.479 Now, there's various flags you can have also so that's the generals the interesting ones you may turn on some new buggy. What these do is they cause the execute to contain a table. 169 00:21:42.479 --> 00:21:46.979 Of of the variable names and addresses. 170 00:21:46.979 --> 00:21:50.278 So that if there's a problem at a certain address. 171 00:21:50.278 --> 00:21:54.989 It can tell you the name of the variable that caused it perhaps. 172 00:21:55.523 --> 00:22:07.614 And line numbers, that's also useful. You see what line of problem occurred on. Perhaps now these things here, they're not fully compatible with optimising, optimizing changes to code around so much. 173 00:22:08.034 --> 00:22:11.333 So, perhaps if you're doing your bugging, you do not turn on optimizing. 174 00:22:11.608 --> 00:22:22.169 And other things here, this is a way to send an escape clauses and flags through and vcc to the G. plus plus compiler any case there's flags that will turn on debugging. 175 00:22:22.169 --> 00:22:29.548 Is an argument that the minus Gee, you should have on all the time anyway. They make a larger they don't make it any slower. 176 00:22:30.628 --> 00:22:35.368 Line info. Okay. So. 177 00:22:37.469 --> 00:22:43.048 You can do mem check and I'll, I'll show you could've. mmhmm check on. 178 00:22:43.048 --> 00:22:57.538 In a little while with the next example. So we'll have an error in the checkbook. Well, maybe perhaps try to help us find any case it checks memory validity address. It's like a parallel version of Val grind. You might say. 179 00:22:57.538 --> 00:23:03.659 And leaks and possibly someone like this sort of stuff. Okay. 180 00:23:03.659 --> 00:23:07.858 So, it looks like it's a useful program and it's available for them and video. 181 00:23:07.858 --> 00:23:12.358 So Here's an example, I'll go over here. 182 00:23:13.378 --> 00:23:18.719 Silence. 183 00:23:18.719 --> 00:23:22.199 So. 184 00:23:23.788 --> 00:23:34.858 We have a program here there's a part of debugging stuff. I'll walk you through the program in my own way and then I'll show you what the slides say about this. 185 00:23:34.858 --> 00:23:41.249 So, what we have here, the executive summary is a slightly larger. 186 00:23:42.419 --> 00:23:54.179 Could a program and what it will do is, it will just do it will initialize an array. So what we have here is. 187 00:23:54.179 --> 00:24:07.169 In your main program well, if we look at decided this isn't the be running on the so each thread will set the ice element of a, to I. so, each thread is doing like, a couple of instructions. 188 00:24:07.169 --> 00:24:11.818 Okay, so we have the main program here. 189 00:24:11.818 --> 00:24:17.638 And we're going to make the vector a 4097 long. 190 00:24:17.638 --> 00:24:21.479 And will do a later to do that. 191 00:24:21.479 --> 00:24:35.189 And when we're going to do in parallel again, remind you the hierarchy, the threads go into blocks. So you've got are also called thread blocks. So the kernel has a number of blocks and each block has a number of threads. 192 00:24:35.189 --> 00:24:38.608 Though I'm ignoring the work part right now. 193 00:24:38.608 --> 00:24:43.709 So the 128 thread, so it would before works for 32. I'm ignoring that part right now. 194 00:24:43.709 --> 00:24:51.749 So, in any case, a block can have up to a 1024 threads. Typically, that depends on the architecture. 195 00:24:51.749 --> 00:24:56.608 But you don't have to have a 1024 threads in the block. 196 00:24:56.608 --> 00:25:01.888 What we're going to have here is we're going to have only 128 threads in the block. 197 00:25:01.888 --> 00:25:06.028 Now, okay, you might ask yourself why would we not. 198 00:25:06.028 --> 00:25:14.638 Have as many threads in the block as you're allowed to the answer is that each block has a certain limited amount of resources. 199 00:25:14.638 --> 00:25:19.078 Some a fixed amount of shared memory, for example. 200 00:25:19.078 --> 00:25:24.868 And and the. 201 00:25:24.868 --> 00:25:28.528 And the fixed space for registers and. 202 00:25:28.528 --> 00:25:36.808 So the blocks allocation, a shared memory register of registers in particular is shared by all of the threads and the block. 203 00:25:36.808 --> 00:25:47.038 So, if you've got more threads in the block, each thread has a smaller proportion of the shared resources and that's going to slow down the thread. Maybe. 204 00:25:47.038 --> 00:25:50.098 And so if you have more threads and the block. 205 00:25:51.148 --> 00:25:57.868 Each threat, if he's thread is constrained enough and may run so slow that even though this more threads, the whole block run slower. 206 00:25:57.868 --> 00:26:04.828 So, it may be better to have fewer threads per block, but each thread gets more resources. 207 00:26:04.828 --> 00:26:08.759 I'll give you an example on the blue Jean. 208 00:26:08.759 --> 00:26:15.148 If I recall these numbers are off the back of my head, but there were each. 209 00:26:15.148 --> 00:26:24.209 Processor had 4 JG of memory, but there were 2 processes would be paired together. Actually. So they'd past access to each other's resources. 210 00:26:24.209 --> 00:26:37.199 And a useful programming paradigm on the blue jeans would be to use only half of the processors when each processor would then get access to each neighbor's memory also. So it'd be half as many processors. 211 00:26:37.199 --> 00:26:46.378 Running on the blue Jean, but each processor would get twice as much memory gigabytes instead of 4. and that would actually make your job run faster. 212 00:26:46.378 --> 00:26:52.709 So, same idea here, we're going to specify that each block. It's only 128 threads. 213 00:26:52.709 --> 00:27:00.749 Okay, so how many blocks that we have got to have a total of 4097 threads because 1 thread for each element of the array. 214 00:27:00.749 --> 00:27:06.358 So, how many blocks do we have? Well, if we divide the number of threads by the. 215 00:27:06.358 --> 00:27:13.679 Number of total number of threads by the number for block, we're going to get a fraction and this is a programming paradigm here to round up. 216 00:27:13.679 --> 00:27:27.239 So, this will take 4097 divided by 128 rounded up ceiling to the next manager. That's the way to do that. I think many of you have seen this paradigm if you have, if you haven't this is the way. 217 00:27:27.239 --> 00:27:31.199 You do an inter division round it up to the next manager. 218 00:27:31.199 --> 00:27:38.729 Okay, so we're going to have this many threads for block and this many blocks. 219 00:27:38.729 --> 00:27:45.358 There's also limits in how many blocks the colonel can run, but we're under that. 220 00:27:47.608 --> 00:27:56.189 Okay, so now here we allocate the array and so managed it's going to be a managed array. 221 00:27:56.189 --> 00:28:01.348 So this means that it will be available both on the host and on the device. 222 00:28:01.348 --> 00:28:05.159 Very nice. You do not have to do explicit. 223 00:28:05.159 --> 00:28:12.659 Copies back and forth. The trade off is that if you did explicit copies, you might be more efficient. 224 00:28:12.659 --> 00:28:16.048 If you knew exactly the data access pattern, but. 225 00:28:16.048 --> 00:28:19.888 Generally you prefer to optimize your time. 226 00:28:19.888 --> 00:28:24.058 Even if your program run slower, and in fact, it probably will not run slower. 227 00:28:24.058 --> 00:28:35.159 You know, the software is getting good. So any case could a Malik manage to allocate an array that's accessible both on the host and on the device, it will get paged back and forth as needed. 228 00:28:35.159 --> 00:28:44.189 That also means, maybe we don't access random bites on ultimately on both. Okay so it takes the address or return the point here. 229 00:28:44.189 --> 00:28:56.784 This is how many bites to access now the reason we do size of okay you all know a size of an end is for, but that's not guaranteed. Of course, there's nothing in the C. C. 230 00:28:56.784 --> 00:29:01.163 plus plus language definitions that prescribe the size of the data. That's deliberate. 231 00:29:01.409 --> 00:29:12.358 But it's probably 4, when I was a grad student, the machine that we used, which was a very popular machine at 36 bit words and. 232 00:29:12.358 --> 00:29:21.358 So, and that was the unit of address thing. So, an integer with there was 1 word or 36 beds. So, size up would have been 1 in that case. 233 00:29:21.358 --> 00:29:27.209 Characters characters were 7 bits and a 36 words stored. 234 00:29:27.209 --> 00:29:31.618 5, 7 bit characters is 1 pet leftover. 235 00:29:31.618 --> 00:29:36.959 It was a beautiful architecture. It's 1 of the best architectures, ever designed people said. 236 00:29:37.374 --> 00:29:43.794 But the trouble is, it had 18 that addresses to the 18th, 36 words, which was fine when it was invented. 237 00:29:43.794 --> 00:29:55.644 But then when when you got more memory, the 18 pointers were not enough, and the architecture was so thoroughly intricately designed that there was actually, no good. Way to expand the size of the pointers. 238 00:29:56.489 --> 00:30:01.138 Or you always do something with some sort of hack like Intel used to do with. 239 00:30:01.138 --> 00:30:06.388 Segments and so on back, but that destroyed the beauty of the system and. 240 00:30:06.388 --> 00:30:09.898 Okay, any case so this allocates. 241 00:30:09.898 --> 00:30:14.578 Is now a pointer to an array of 4097. 242 00:30:14.578 --> 00:30:19.858 This calls the kernel, the parallel program on the device. 243 00:30:19.858 --> 00:30:25.348 That's its name, colonel, it's up here. 244 00:30:25.348 --> 00:30:30.419 Kernel, that's the name of the device routine. 245 00:30:31.979 --> 00:30:35.909 This okay, triple angled brackets or less and that's extension. 246 00:30:35.909 --> 00:30:40.348 This is the number of blocks that are going to run in parallel. 247 00:30:40.348 --> 00:30:44.159 And this is the number of threads that are going to run in each block. 248 00:30:44.159 --> 00:30:52.138 In parallel, so this starts up. So colonel is going to run in parallel lots of times this many blocks times as many threads. 249 00:30:52.138 --> 00:30:55.828 A lot of threads running in parallel. 250 00:30:55.828 --> 00:31:00.509 And gives it here, it has 2 arguments. 251 00:31:00.509 --> 00:31:03.749 A, is the pointer to the array. 252 00:31:03.749 --> 00:31:15.058 And then is the size of the array so you can pass in integers and can pass and pointers. And the reason the pointer works is because it's a managed array. 253 00:31:16.288 --> 00:31:20.519 Otherwise he could not just go passing point is within host and device. 254 00:31:20.519 --> 00:31:24.358 Okay, so that runs now what. 255 00:31:24.358 --> 00:31:29.848 This syntax does, so it fires up a routine running on the device. 256 00:31:29.848 --> 00:31:34.618 It starts asynchronously and just runs and then. 257 00:31:34.618 --> 00:31:42.028 But it returns immediately to the host program. So this line here immediately returns to the host program. 258 00:31:43.288 --> 00:31:46.769 While the device program is still running. 259 00:31:47.848 --> 00:31:52.259 If you want to wait for the device to finish, you'd have to do a synchronize. 260 00:31:54.388 --> 00:32:02.429 So, you fire up the a synchronous device program, then wait till it done what you're doing other stuff here of course. 261 00:32:02.429 --> 00:32:06.479 There's also facility here you could even. 262 00:32:06.479 --> 00:32:12.209 Fire up several parallel device programs and parallel with each other. 263 00:32:12.209 --> 00:32:17.308 Like, you have a big new device, this. 264 00:32:17.308 --> 00:32:22.348 Little program is not going to use the whole device. So what you could do is. 265 00:32:22.348 --> 00:32:27.719 You could fire up, you could start other colonels. You could start a number several kernels dawson's. I think. 266 00:32:27.719 --> 00:32:38.489 And they're all running in parallel with each other and in parallel with the host. So, at the end, when you start them all up and do your host worked and you'd have to start synchronizing and waiting for everything. 267 00:32:38.489 --> 00:32:43.709 Okay, I haven't showed you the device. 268 00:32:43.709 --> 00:32:46.769 Code in detail yet, but let me finish with the host code. 269 00:32:46.769 --> 00:32:51.148 So, this, this prints the 1st, 10 items to see it. 270 00:32:51.148 --> 00:32:55.709 There what you expect them to be. 271 00:32:56.909 --> 00:33:03.989 This freeze the data again, the program terminate stuff gets freed anyway, but you want to be explicit. 272 00:33:06.209 --> 00:33:11.189 And you could also check if there's errors. Now, what this does is. 273 00:33:11.189 --> 00:33:14.969 Since your device programs are running asynchronously. 274 00:33:14.969 --> 00:33:19.528 If various error conditions occur, what they do is they set a narrow flag. 275 00:33:19.528 --> 00:33:29.939 Which sits around until you check it, so you can check for errors if you wish, or if you don't wish, you know, why is up for bad news I guess you could not check. 276 00:33:29.939 --> 00:33:35.909 And then this is another pointless statement at the end because, of course, if you fall off the end of the block. 277 00:33:35.909 --> 00:33:47.608 You just did a return so the free I think is pointless. And the returns 0T I think is pointless, but that's just me. I kind of want lines of code. So, this was your host program again. 278 00:33:47.608 --> 00:33:58.229 Which allocates vector it sets how many blocks are going to need how many threads block allocates the data. 279 00:33:58.229 --> 00:34:04.378 Calls the device program in parallel synchronizes waits for the output and print some stuff. 280 00:34:04.378 --> 00:34:09.478 Up here here is the kernel program. 281 00:34:10.619 --> 00:34:15.179 We know that it's a colonel program, because it's got this modifier here global. 282 00:34:15.179 --> 00:34:19.528 Was surrounded by underscores. Okay. 283 00:34:19.528 --> 00:34:25.438 It was a name colonel, um, if we can also let means good. 284 00:34:25.438 --> 00:34:29.969 Here is where it's called and here is it. 285 00:34:29.969 --> 00:34:33.748 Takes 2 arguments a pointer to the array. 286 00:34:33.748 --> 00:34:40.648 And the size of the array. Okay. And down here, it's going to be doing things. Okay. Now. 287 00:34:40.648 --> 00:34:45.389 So, we're firing up the. 288 00:34:45.389 --> 00:34:50.099 All of these threads in parallel. 289 00:34:51.298 --> 00:35:02.969 Each now these threads can identify themselves, they don't have processed numbers, but they've got block indexes and thread index. So a thread can. 290 00:35:02.969 --> 00:35:08.938 Tell which blockapps in and can tell which thread it is in the block. 291 00:35:08.938 --> 00:35:18.838 And it can also get environmental information, like the number of threads for block. So each thread can determine its, the relevant parts of its environment. 292 00:35:18.838 --> 00:35:25.798 And there are these read, only reserve variables, block index dot X here. 293 00:35:25.798 --> 00:35:32.548 Is the ID number of the block that this thread is running in 01234 whatever. 294 00:35:32.548 --> 00:35:38.278 This 4097 threads, 128 threats for block. 295 00:35:38.278 --> 00:35:44.668 That's approximately 33 blocks if I did the division. Right? So block index goes from. 296 00:35:44.668 --> 00:35:47.789 0T to 3233. 297 00:35:47.789 --> 00:35:51.509 Um, X. 298 00:35:51.509 --> 00:35:56.998 Is a number of threads per block that's 1, 2008 because we defined it to be 128. 299 00:35:58.259 --> 00:36:02.878 So, and then thread index X. 300 00:36:03.958 --> 00:36:08.068 Is the ID number of the thread in that block? 301 00:36:08.068 --> 00:36:15.358 So we have 128 threads per block thread index dot X will go from 0T to 107. 302 00:36:16.528 --> 00:36:20.548 So each thread can determine which thread it is in the block. 303 00:36:20.548 --> 00:36:23.998 And a total number of threads per block. 304 00:36:23.998 --> 00:36:28.498 And which block it is, and the number total number of blocks there are. 305 00:36:28.498 --> 00:36:31.889 And is this calculation here? 306 00:36:31.889 --> 00:36:34.918 This will compute for us an index. 307 00:36:34.918 --> 00:36:43.918 For that thread, you might say in global parallel space, whatever you want to call it and this index will go from 0T to 4096. 308 00:36:43.918 --> 00:36:47.009 Because there are 4097 threats. 309 00:36:49.588 --> 00:36:55.108 And so now, if we do an a sub equals, I hear. 310 00:36:56.219 --> 00:37:03.059 Then this will actually access each element of a once. 311 00:37:03.059 --> 00:37:09.418 For the between all the different threads I will be from 0T up to 4096. 312 00:37:09.418 --> 00:37:16.588 Because we did this calculation here, so this is and this is a programming style that's. 313 00:37:16.588 --> 00:37:22.079 It's a reasonable programming paradigm here. You've got many threads and each thread does. 314 00:37:22.079 --> 00:37:30.418 Something very simple. I don't know that I necessarily have a thread to do something this simple, but you get the idea. 315 00:37:31.829 --> 00:37:37.289 Any questions with that now, they don't access here. 316 00:37:37.289 --> 00:37:45.148 Just do them what they all are is because thread index is actually a 3 factor not just a scaler. 317 00:37:45.148 --> 00:37:55.768 And block index and block them they're actually 3 vectors not scalers is this is the syntactic thing. You can consider the threads in a block to be a 3 dimensional. 318 00:37:55.768 --> 00:37:59.099 Array of threads and you consider all the blocks in the colonel. 319 00:37:59.099 --> 00:38:06.838 To be a 3 dimensional array of blocks some, there's some examples using that fact I think it's a pretty minor thing. 320 00:38:06.838 --> 00:38:14.670 I would not have put it in the design, but data. Okay. Didn't ask me obviously. Okay so this is an example of a nice, simple. 321 00:38:14.670 --> 00:38:19.320 Program to run on the device, that's the. 322 00:38:19.320 --> 00:38:31.559 Gpu called a colonel for some reason. A colonel is the name for a parallel. Well, this is just a variable name. It could be anything and wants it could be Providence. But the thing is that. 323 00:38:31.559 --> 00:38:35.610 Then kernel is the parallel program, right? Okay. 324 00:38:37.170 --> 00:38:42.539 So, what do we have up at the top here? Include files? 325 00:38:45.269 --> 00:38:54.929 This is this checks if there is an error. 326 00:38:54.929 --> 00:38:58.980 And it's a macro. 327 00:38:58.980 --> 00:39:03.900 I'm not completely certain White was written as a macro. 328 00:39:03.900 --> 00:39:15.059 Instead of as a function, but it was written as a macro and now the thing with macros and CC plus plus they're the same. 329 00:39:15.059 --> 00:39:18.869 Is they all have to be the body of the macro has to be on 1 line. 330 00:39:18.869 --> 00:39:26.610 And so the way they put this on 1 line, but still keep it human readable as they had backslash, new lines separating everything here. 331 00:39:26.610 --> 00:39:30.480 So this is a nice programming style here. 332 00:39:30.480 --> 00:39:37.380 I like this style, what they did is they took the built in a routine it's called couldn't get last error. 333 00:39:37.380 --> 00:39:43.889 And they wrapped it around to make it easier to use. So, this is a nice programming style. 334 00:39:43.889 --> 00:39:47.010 So, could I get last error? Well. 335 00:39:48.210 --> 00:39:53.159 It will return the last error message that could've program generated. 336 00:39:53.159 --> 00:39:59.010 If there was 1 and return said in a variable of class could. 337 00:39:59.010 --> 00:40:08.280 Error and if you're lucky, the error code was could a success. 338 00:40:08.280 --> 00:40:12.360 We should be lucky if not. 339 00:40:12.360 --> 00:40:17.760 What it does is it's a return 8 years. It's a. 340 00:40:17.760 --> 00:40:20.940 Class is a structure is not something totally sample. 341 00:40:20.940 --> 00:40:30.449 And in particular, you can take a, and you can get a human readable version for the. 342 00:40:30.449 --> 00:40:33.960 Do this this is a human readable. 343 00:40:35.099 --> 00:40:39.329 Human readable description of what the error was. 344 00:40:40.349 --> 00:40:46.409 And, oh, this is why they made it a macro. I love this style. Not enough. People do this. 345 00:40:46.409 --> 00:40:54.239 Line is a reserved macro name and C. plus plus it's in the standard. 346 00:40:54.239 --> 00:40:57.960 Which is the line number that. 347 00:40:57.960 --> 00:41:02.099 The line is called from and this is the file name. 348 00:41:02.099 --> 00:41:09.179 So, what's going to happen here is if could at last error, returned an error code. 349 00:41:09.179 --> 00:41:14.940 Then this gets a human readable version of what the error was in print out a message saying, could a failure. 350 00:41:14.940 --> 00:41:19.920 Gets the file, gives the line and gets the human readable message in the next sets. 351 00:41:21.269 --> 00:41:32.159 So, this works, the line works only because this is at a macro because it gets expanded at the line, or could a checker was called if we made could a check error a function. 352 00:41:32.159 --> 00:41:40.829 Then line would be the line number of this line in the function not where the function was called from. 353 00:41:42.269 --> 00:41:50.579 So, I guess my only real complaint with this is that they're making the macro look like a function call. 354 00:41:50.579 --> 00:41:57.030 And I, that is a programming style to hide the difference between and macros and functions. 355 00:41:57.030 --> 00:42:00.570 I'm against that, because in fact, they are different. 356 00:42:00.570 --> 00:42:09.420 And if you try to do bug the program, you immediately see what the difference is they debugged completely differently. So I. 357 00:42:09.420 --> 00:42:13.320 I am opposed and pretending that this they're the same because That'll get you. 358 00:42:13.320 --> 00:42:19.710 They're not the same thing would be nice if they were the same, but they're not. Okay. 359 00:42:19.710 --> 00:42:27.690 So, what this example too, does it shows you a slightly bigger parallel program, which is. 360 00:42:27.690 --> 00:42:35.369 initializes a vector with each spread initializing a different element of their factor. We said how many threads there are per block and how many blocks. 361 00:42:35.369 --> 00:42:38.820 And then we check whoops here. 362 00:42:39.900 --> 00:42:43.739 And then after it, we check to see if it worked correctly. 363 00:42:45.420 --> 00:42:49.019 And then we return now if I. 364 00:42:49.019 --> 00:42:53.820 Okay, let me hit you and see some other little points here. 365 00:42:53.820 --> 00:42:59.099 Okay, we look at this colonel here, this. 366 00:42:59.099 --> 00:43:08.039 Wellness program that runs on the device. We have a variable here so programs running on the device can have local variables. 367 00:43:08.039 --> 00:43:14.639 By default, it's in something called a parallel register. 368 00:43:14.639 --> 00:43:25.284 So, I register is very fast memory that's private to the thread. See different types of memory. They have different degrees of locality for where they're valid. 369 00:43:25.704 --> 00:43:31.434 The global memory is that is visible to every 1 running on a gpo, even different kernels. 370 00:43:31.679 --> 00:43:37.440 As the other extra, and there's a lot of it, like, gone 48 gigabytes on this GPU. 371 00:43:37.440 --> 00:43:41.820 The, on the other hand, we can a very. 372 00:43:41.820 --> 00:43:48.150 There's a store, which is private to each separate thread. Each separate thread has its separate. 373 00:43:48.150 --> 00:43:51.989 Private log for local variables they're called registers. 374 00:43:51.989 --> 00:43:55.320 So, their private to the thread and their fast. 375 00:43:55.320 --> 00:44:01.949 The problem is that there's only 255 of them per thread maximum. 376 00:44:01.949 --> 00:44:07.139 And it might be less because, you know, past memory is expensive and hardware. 377 00:44:07.139 --> 00:44:10.469 So, it's very limited and in fact. 378 00:44:10.469 --> 00:44:18.900 The whole block as only, I think 64 K registers for all the threads and the block. 379 00:44:20.039 --> 00:44:23.130 So, if there's 128 threads in the block. 380 00:44:23.130 --> 00:44:35.820 Then each thread can have only so many in 64 K divide. Well, that's more than 255. that's a max actually. So, the thing is, if we had a 1000 threads in the block, each thread could have only 64 registers, not 255. 381 00:44:35.820 --> 00:44:40.559 This is the 1 of the examples if you allow more threads and the block. 382 00:44:40.559 --> 00:44:46.769 Then each thread will have fewer resource, and particularly the fewer possible registers available to it. 383 00:44:46.769 --> 00:44:55.380 And register is nice, because they're fast. Now we'll see more of this later, but if it may happen at the thread means more. 384 00:44:55.380 --> 00:45:00.809 Private data, so if it needs more data, then it's got. 385 00:45:01.829 --> 00:45:11.550 Register wants and suppose the trade wants private data private to that thread not visible by the other threads. There's also a concept called vocal memory. 386 00:45:11.550 --> 00:45:14.699 Which is private to the thread and there's more of it. 387 00:45:14.699 --> 00:45:24.809 But it's very much slower. It's really incredibly slow. It's basically a private share part of the global memory, but a thread can have more data if it wants, but it's going to be slow. 388 00:45:24.809 --> 00:45:28.829 To just a programming style to try not to have to use that. 389 00:45:30.210 --> 00:45:34.380 Well, 1 more thing is, I actually found a bug. 390 00:45:34.380 --> 00:45:37.889 An. N. B. C. C. and an invidious compiler. 391 00:45:37.889 --> 00:45:48.750 Relating to the number of threads, register and stop for a number of registrations per threat. I found a corner case would screw into an infinite loop. 392 00:45:48.750 --> 00:45:53.039 Not the executer loop the compiler loop. 393 00:45:53.039 --> 00:45:56.579 So, I reported in a couple of. 394 00:45:56.579 --> 00:46:03.269 Bulletin boards, I never got a response, but the fairly really quickly a new version of came out that did not have that bug. 395 00:46:03.269 --> 00:46:08.190 Okay, so this thing shows you a simple. 396 00:46:08.190 --> 00:46:13.800 Kernel program and this new idea here this is how. 397 00:46:13.800 --> 00:46:18.030 The thread can distinguish it from the other threads, because it can determine. 398 00:46:18.030 --> 00:46:26.820 Its index in the block of threads, the number of threads in each block of threads, and the index of the block and the grid of all the blocks. 399 00:46:26.820 --> 00:46:31.559 And the final new thing we showed, you was how you can test if there was an error. 400 00:46:33.750 --> 00:46:38.429 Any questions about that so far. 401 00:46:38.429 --> 00:46:41.969 Ice and simple. Okay. 402 00:46:41.969 --> 00:46:46.650 This nice simple program has a major error in it. 403 00:46:48.329 --> 00:46:53.039 I mean, I'm going to go to the slides in a minute, but before I do. 404 00:46:53.039 --> 00:46:58.469 Is anyone. 405 00:46:58.469 --> 00:47:05.010 Have an idea what this nice simple program writes on Initialized memory. 406 00:47:05.010 --> 00:47:08.219 It has a memory bounds violation. 407 00:47:09.420 --> 00:47:15.809 Can anyone see where it is? 408 00:47:17.730 --> 00:47:22.320 I'll let you think about that. Well, let me go back to the slide is now. 409 00:47:22.320 --> 00:47:25.349 Okay. 410 00:47:28.349 --> 00:47:35.460 Well, you can do a good amount. We can run the thing. Excuse me? 411 00:47:35.460 --> 00:47:39.119 Um, and make a. 412 00:47:40.559 --> 00:47:44.849 Excuse me. 413 00:47:48.389 --> 00:47:57.780 1st tenant almost look correct. The program's giving the output that we expected. 414 00:47:57.780 --> 00:48:00.960 But if we do. 415 00:48:00.960 --> 00:48:09.869 Well, before I run could amend check. Let me look at the program again and try to explain to you what happens then we'll look. Excuse me. 416 00:48:11.400 --> 00:48:20.130 This 1, Alias 2 more tonight. Okay. Here's the problem. 417 00:48:21.989 --> 00:48:26.369 What is happening down here? 418 00:48:26.369 --> 00:48:31.769 Is that 4097 thread? 419 00:48:31.769 --> 00:48:34.980 128 threads for block. 420 00:48:34.980 --> 00:48:39.239 So, what we have here, if I did my math, right? 421 00:48:39.239 --> 00:48:42.389 Is, um, we have 32 blocks. 422 00:48:42.389 --> 00:48:54.510 With 128 threads each and 1 block with 1 thread in it. The 33rd block is only 1 thread and it because 4097 is I didn't even multiple at 128. 423 00:48:55.800 --> 00:49:01.019 So, but what this is going to do here. 424 00:49:01.019 --> 00:49:06.360 Is it's going to call in parallel 33 blocks? 425 00:49:06.360 --> 00:49:12.719 And each block will execute a 128 thread. So this thing gets run. 426 00:49:12.719 --> 00:49:16.320 On 33 blocks of 128 threads. 427 00:49:16.320 --> 00:49:23.039 And goes accessing that, but now here's your problem. 428 00:49:23.039 --> 00:49:30.719 You see a goes out of bounds because the 33rd block, all 128 threads. 429 00:49:30.719 --> 00:49:35.940 That's actually going to be going up to somewhere like 4200 or so. 430 00:49:35.940 --> 00:49:43.050 Because this thing is going to do every boss is going to run in parallel on 30. 431 00:49:43.050 --> 00:49:47.760 On 128 threads, but we don't, but the last block. 432 00:49:47.760 --> 00:49:51.539 We want the last block to run only once not. 433 00:49:51.539 --> 00:49:56.010 1 thread, because there's only 1 element of a. 434 00:49:56.010 --> 00:50:00.449 For the last block to process, because we. 435 00:50:00.449 --> 00:50:04.230 Deliberately constructed the number of threads to be. 436 00:50:04.230 --> 00:50:13.110 Not a multiple number of total number of threads to be. Not a multiple number of threads. Walk is the problem. There's, there's a fraction of a block at the end. 437 00:50:13.110 --> 00:50:16.199 The last block has only. 438 00:50:16.199 --> 00:50:21.210 1 element of a that we need to be Initialized not all 128. 439 00:50:21.210 --> 00:50:29.400 So the last blog is going to have this thing wrong on 127 elements of a, that are not part of a. 440 00:50:29.400 --> 00:50:32.760 You see, this is going to run. 441 00:50:32.760 --> 00:50:36.599 A 127 words off the end of a. 442 00:50:38.070 --> 00:50:41.730 You see the problem here and. 443 00:50:41.730 --> 00:50:46.139 So, we've locked a here to be 4097 long. 444 00:50:46.139 --> 00:50:53.789 But this is going to run, this is going to be accessing elements that are past the end of a, and it's going to be writing them. It's going to be stomping on them. 445 00:50:53.789 --> 00:50:58.019 So, as it turns out here, nothing happened. 446 00:50:59.309 --> 00:51:04.800 Whether that's lucky or unlucky. I'll leave that to you. 447 00:51:04.800 --> 00:51:10.739 We might be more lucky if it crashed, but in any case, this program has a major. 448 00:51:10.739 --> 00:51:21.269 Writing on it, if it's a bad point or it's not writing on initialize memory, you don't care if it's a neutralize. If you're right it's writing memory. That hasn't been. It's not yours. 449 00:51:21.269 --> 00:51:24.630 It's someone else's memory, so if there was another. 450 00:51:24.630 --> 00:51:27.690 Kernel unrelated colonel running on a GPU. 451 00:51:27.690 --> 00:51:34.559 That memory might belong to it so doing this might cause someone else's parallel program to crash. 452 00:51:34.559 --> 00:51:39.119 If the GPU doesn't have thorough on. 453 00:51:39.119 --> 00:51:46.829 You know, bounds protection and stuff like that and it has some stuff, but I don't know that I'd bet the farm on. It's working. 454 00:51:46.829 --> 00:51:52.829 In any case, so we can do CUDA. 455 00:51:52.829 --> 00:51:58.949 Ma'am check on a dot out. 456 00:51:58.949 --> 00:52:05.010 And so what happened here, it's writing due to due to normal. 457 00:52:06.840 --> 00:52:10.110 And we got a problem here. 458 00:52:10.110 --> 00:52:13.920 So, it writes up as far as 9 and then. 459 00:52:16.769 --> 00:52:22.980 Well, things are happening in parallel that's not totally meaningful, but you see, it has it's really complaining here. 460 00:52:22.980 --> 00:52:27.030 So, it caught the. 461 00:52:27.030 --> 00:52:32.159 Why it is only 35 errors. I don't know, but it caught the fact that it was writing. 462 00:52:32.159 --> 00:52:38.460 It was doing bounce checks here, so, and. 463 00:52:38.460 --> 00:52:41.760 Maybe I want to. 464 00:52:41.760 --> 00:52:47.550 If I put some more information here, I got a hope of getting a line number. Perhaps. 465 00:52:47.550 --> 00:52:53.909 Let's try that. 466 00:52:53.909 --> 00:52:57.840 Silence. 467 00:53:07.500 --> 00:53:11.550 Silence. 468 00:53:12.900 --> 00:53:19.469 And now if we do this, hey, gave us line numbers. Okay. So. 469 00:53:19.469 --> 00:53:23.489 Line 33 there was some sort of problem. 470 00:53:25.050 --> 00:53:29.190 Okay. 471 00:53:30.269 --> 00:53:34.769 And and. 472 00:53:37.199 --> 00:53:41.039 And well, that was not very helpful line. 33. 473 00:53:41.039 --> 00:53:44.309 Was it complained about the. 474 00:53:46.230 --> 00:53:52.679 And especially is that is not my childhood complained about find 33, which is already check for errors, but. 475 00:53:52.679 --> 00:53:59.489 Okay, so that's the point at which it caught the error was and couldn't get last error. 476 00:53:59.489 --> 00:54:04.679 Which wasn't the most help frankly, but at least it told us that there was an error. So. 477 00:54:04.679 --> 00:54:13.980 Okay, so it doesn't tell you where the legal right was in this case but at least it tells you there was an illegal right? 478 00:54:15.000 --> 00:54:21.989 We could go in was could a gdb and. 479 00:54:23.670 --> 00:54:30.449 Have fun looking at that also. Let me show you more of that later, but it. 480 00:54:33.059 --> 00:54:39.989 And there's no debugging symbols and so on, because I didn't compile it with. 481 00:54:44.130 --> 00:54:49.050 Okay, I do. 482 00:54:52.530 --> 00:54:59.010 Okay, and it's negative B*** thing, but you can. 483 00:54:59.010 --> 00:55:03.420 You can put a break point in it and stuff like that and whatever. So. 484 00:55:03.420 --> 00:55:07.289 Have fun. 485 00:55:08.369 --> 00:55:15.539 No problem and, um. 486 00:55:16.619 --> 00:55:21.570 Stuff like that. 487 00:55:21.570 --> 00:55:25.530 So. 488 00:55:25.530 --> 00:55:28.619 Have fun here. 489 00:55:28.619 --> 00:55:33.929 Oh, okay. Very basic. You can also look at the threads in some sense. 490 00:55:33.929 --> 00:55:38.460 I hear the showings and commands here. 491 00:55:38.460 --> 00:55:43.289 We could break inside the kernel actually. 492 00:55:43.289 --> 00:55:47.369 And look at the thread number. 493 00:55:47.369 --> 00:55:53.940 Silence. 494 00:55:57.210 --> 00:56:06.239 Yeah, sorry to 0T and so on. So you can look inside here. 495 00:56:06.239 --> 00:56:11.460 Have fun. 496 00:56:11.460 --> 00:56:16.530 Yeah, there's some extensions here. 497 00:56:24.630 --> 00:56:29.190 Yeah, you can see where you are, and you can, um. 498 00:56:31.739 --> 00:56:36.869 Go to another thread and print stuff here, so you gotta have fun right here. So. 499 00:56:38.699 --> 00:56:41.820 We could set RAM check on actually. 500 00:56:41.820 --> 00:56:48.420 Kill all the break points. 501 00:56:48.420 --> 00:56:55.559 And to do, and then we got a place. 502 00:56:55.559 --> 00:56:58.769 Or to execute and whatever. 503 00:56:58.769 --> 00:57:06.150 So, obviously there's going to be some issues debugging in parallel, but at least there are some tools that will help you. 504 00:57:08.250 --> 00:57:15.570 And so there's towel, I believe is free, which is also another powerful 1. 505 00:57:15.570 --> 00:57:20.309 Some other things and from. 506 00:57:22.860 --> 00:57:26.340 From NVIDIA. 507 00:57:27.599 --> 00:57:32.849 Silence. 508 00:57:35.099 --> 00:57:45.030 So, it will give us ideas about how fast some stuff is and so on. 509 00:57:46.530 --> 00:57:51.510 Okay, I don't know how much it affects the program, but. 510 00:57:52.860 --> 00:58:04.914 And if I'm looking at, that doesn't work on newer versions of, could I think so very vampire trace I don't know if it's free or not. 511 00:58:05.184 --> 00:58:08.724 Any case you get non trivial sized programs, you can. 512 00:58:09.030 --> 00:58:13.650 Debugging tools yeah. 513 00:58:15.445 --> 00:58:28.554 And then you don't have to recompile this to run profile. I don't know how it's implemented, but it's possible. There are hardware parks, for example, on ziam's there are hardware hooks inside to see on cpu's. 514 00:58:28.644 --> 00:58:31.855 You can use them for profiling and debugging. 515 00:58:32.130 --> 00:58:43.199 They're not used, but people don't use them directly because it's too hard to use, I think, but I hi. Powerful. Powerful profiler can use these hardware hooks that entails built into the. 516 00:58:43.199 --> 00:58:46.349 They're quite nice. 517 00:58:47.579 --> 00:58:55.679 This is an overhead they're, they're implemented a low level in the hardware. You've got softer interruptions and overhead on these hard visa level hardware things. 518 00:58:55.679 --> 00:59:01.050 At the overhead any case, you can look and see the time. 519 00:59:02.309 --> 00:59:06.239 So, the chat with the time is spent in. 520 00:59:06.239 --> 00:59:09.449 At seeing the colonel duck. 521 00:59:10.559 --> 00:59:13.559 You know, it's 200 microseconds. 522 00:59:13.559 --> 00:59:18.840 Managed took a lot of time, 200 milliseconds. 523 00:59:20.550 --> 00:59:26.969 And everything else is fairly, you know. 524 00:59:26.969 --> 00:59:30.989 And significant and so on. 525 00:59:32.010 --> 00:59:35.909 Down the Matlock took time. Yeah. Well. 526 00:59:35.909 --> 00:59:39.389 Alex, do, I guess, and the whole thing. 527 00:59:39.389 --> 00:59:45.000 And you get down here running on the device to host time. 528 00:59:45.000 --> 00:59:51.659 Um. 529 00:59:53.639 --> 01:00:02.159 8, 8, micro seconds, it's not very bad. There's no host device because the erase be Initialized on the device. 530 01:00:02.159 --> 01:00:05.340 And then read back onto the host, so. 531 01:00:05.340 --> 01:00:08.340 And, okay. 532 01:00:10.380 --> 01:00:14.280 And some page faulting. 533 01:00:14.280 --> 01:00:19.139 That's how it wrote it back to the device, I guess, with page faulting and its. 534 01:00:19.139 --> 01:00:23.730 All horribly passed, but this is a very simple program also. 535 01:00:23.730 --> 01:00:29.969 Okay, for off this is an introduction is also more powerful things. 536 01:00:31.199 --> 01:00:37.320 Okay, now there is a problem here if I try to run this. 537 01:00:38.969 --> 01:00:43.349 Silence. 538 01:00:43.349 --> 01:00:48.329 Silence. 539 01:00:48.329 --> 01:00:51.840 Silence. 540 01:00:51.840 --> 01:00:57.719 I've got a complaint. 541 01:00:57.719 --> 01:01:00.840 Too. 542 01:01:03.659 --> 01:01:07.949 Right. 543 01:01:07.949 --> 01:01:13.860 The gpo is too. No. Okay. So you have to switch to another tool for that. 544 01:01:13.860 --> 01:01:21.719 Palmer okay. The compute capability here. 545 01:01:21.719 --> 01:01:31.769 This they keep implementing this number of each new version of the, and each new number means there are new facilities available on the GPU. So. 546 01:01:31.769 --> 01:01:36.869 And you can get the current version with device. 547 01:01:36.869 --> 01:01:40.920 Where are you. 548 01:01:42.900 --> 01:01:50.280 Yeah capabilities or there's a driver version that's separate from the capability. Very capability version here at 7.5. 549 01:01:50.280 --> 01:01:54.210 We also see relevant stuff here. There is. 550 01:01:54.210 --> 01:01:59.969 3000 cars available, if you want to use more than that, it will just cue stuff up. 551 01:01:59.969 --> 01:02:04.920 Sorry, this is only 16 bytes of memory. 552 01:02:09.210 --> 01:02:18.929 Oh, I'm sorry, this is my laptop that I'm showing you. I'm not showing you parallel. Of course, parallel here is 48 gigabytes of memory. Let me run this parallel. 553 01:02:20.760 --> 01:02:24.420 Silence. 554 01:02:25.440 --> 01:02:29.460 Good. 555 01:02:32.340 --> 01:02:35.460 Yeah, 48 gigabytes of global memory. 556 01:02:35.460 --> 01:02:39.329 4600 quarter cores. 557 01:02:40.739 --> 01:02:45.630 Shared memory per block. This is the fast memory that's, um. 558 01:02:45.630 --> 01:02:49.289 Available. 559 01:02:49.289 --> 01:03:00.420 You see, it's not that big 65000 registers for block divided by all of the threads in the block registers 4 bites. So. 560 01:03:00.420 --> 01:03:05.699 So, a thread block. 561 01:03:05.699 --> 01:03:09.840 Well, could have this many threads blocked total I have to multiply those numbers out. 562 01:03:09.840 --> 01:03:16.980 And this should be the number of blocks in the grid, multiply the mouth. These numbers are touch bigger than on my laptop. 563 01:03:18.840 --> 01:03:26.039 So, and then there's a 2nd GPU on parallel you can use also, which is a much older capability number and. 564 01:03:26.039 --> 01:03:31.559 Much less memory and so on. Yeah. 565 01:03:33.449 --> 01:03:37.469 Okay, any case you're going to be profiling. 566 01:03:38.579 --> 01:03:45.630 The visual profiler again, I'll show it to you more later. I'm having trouble getting it working. 567 01:03:48.210 --> 01:03:52.650 Again, we were having trouble with so I'll ignore that. 568 01:03:53.699 --> 01:04:00.210 But it ways for a larger program to see what's happening inside it. 569 01:04:03.719 --> 01:04:12.179 Another debugging tool that I may show you more later. The executive summary is at NVIDIA tries to help you to bug programs. 570 01:04:15.059 --> 01:04:18.869 Your items inside eclipse on the Linux. 571 01:04:18.869 --> 01:04:21.869 Or Visual Studio, so. 572 01:04:21.869 --> 01:04:29.070 Skipping over this executive summary. Profiling tools are available. So. 573 01:04:29.070 --> 01:04:32.369 Okay. 574 01:04:32.369 --> 01:04:39.719 Find hot spots in parallel and the hotspots may not be what you expect. You're all told all about that. 575 01:04:39.719 --> 01:04:42.780 Nothing interesting here. 576 01:04:45.150 --> 01:04:48.300 Nothing interesting here. 577 01:04:49.800 --> 01:04:56.010 Interesting point here is, as the profiler finds a problem. 578 01:04:56.010 --> 01:05:01.199 Maybe the, maybe the profiler caused a problem. 579 01:05:01.199 --> 01:05:07.380 Profiling, you know, it adds overhead and it's not, it's not going into this whole level. 580 01:05:08.670 --> 01:05:15.389 You know, hooks into the system. Okay. 581 01:05:15.389 --> 01:05:19.139 Introduction to some initial code of stuff and the B**. 582 01:05:30.239 --> 01:05:33.239 Silence. 583 01:05:35.760 --> 01:05:39.210 And God here. 584 01:05:39.210 --> 01:05:43.619 Okay. 585 01:05:43.619 --> 01:05:50.789 So, we're going to see a kernel here that adds to erase Hector. 586 01:05:50.789 --> 01:05:54.179 Element by element, it's. 587 01:05:54.179 --> 01:05:58.409 Separate thread for each element. 588 01:05:58.409 --> 01:06:02.519 This is your host program that would do it. 589 01:06:02.519 --> 01:06:06.570 You know, you could. 590 01:06:06.570 --> 01:06:10.139 Something very simple it just, you know. 591 01:06:10.139 --> 01:06:13.710 To the 2 source to raise a and B and. 592 01:06:13.710 --> 01:06:21.570 Great to see some sample. This is a style in this could of programming. 593 01:06:21.570 --> 01:06:32.880 Is it you may identify your variable where your Ray is by saying H, underscore D underscore for hosted device with the managed data that's obsolete I would say, but. 594 01:06:34.769 --> 01:06:38.610 Any case, so it's going to happen here so that. 595 01:06:38.610 --> 01:06:46.500 The CPU starts things up and transfers to the GPU that does something and it transfers back to the CPU to finish up. 596 01:06:46.500 --> 01:06:53.309 So part 1, here we allocating data on the host. 597 01:06:53.309 --> 01:06:58.260 Pop you to the device 1 that's obsolete. The device does the addition. 598 01:06:58.260 --> 01:07:05.909 And part 3, while we copy it back is now also now, in any case. Okay. 599 01:07:05.909 --> 01:07:14.190 This, and that figure has a lot of substance and I showed it to you last time with. 600 01:07:14.190 --> 01:07:17.579 An earlier slide said I'm going to go through it again because. 601 01:07:17.579 --> 01:07:21.000 Like I said, there's a lot of stuff in this. Let me. 602 01:07:21.000 --> 01:07:27.360 And large it as much as possible, and even enlarged her. 603 01:07:29.760 --> 01:07:33.030 And gender. 604 01:07:35.280 --> 01:07:39.119 Cool. Okay. 605 01:07:39.119 --> 01:07:43.590 You can see it now, host off to the left. 606 01:07:43.590 --> 01:07:49.679 This whole big thing on the right that's the device. The device is the GPU. 607 01:07:49.679 --> 01:07:57.179 A program running on the device is called a kernel sometimes and called a grid at other times. 608 01:07:57.179 --> 01:08:04.500 Don't ask me why that's that point. I had on my last slide set that said terminology is inconsistent. 609 01:08:04.500 --> 01:08:10.500 Even inside and video, the device, the GPU. 610 01:08:10.500 --> 01:08:16.859 Oh, by the way, you know, your host accounts, we got 2 on. 611 01:08:18.270 --> 01:08:21.840 Apparently, you could be using them both in parallel. If you wanted to you could. 612 01:08:21.840 --> 01:08:26.729 You could tell your colonel, this girl runs on this sheet that Colonel runs on the other 1. that's fine. 613 01:08:26.729 --> 01:08:31.020 Assuming the power supply for parallel can handle it, but. 614 01:08:31.020 --> 01:08:39.149 I think it can use use power when they're running a couple of 100 watts actually. Okay. Device. 615 01:08:39.149 --> 01:08:45.569 It's called a grid sometimes, because there's a grid of blocks of threads. 616 01:08:46.979 --> 01:08:53.310 So, but 1st, inside the device, we have global memory here. 617 01:08:53.310 --> 01:08:58.350 That nice fast GPU on parallel the global members. 48 gigabytes. 618 01:08:58.350 --> 01:09:03.960 Okay, it's accessible by everyone on the device. 619 01:09:03.960 --> 01:09:08.460 But it's slow, it's got latency to start accessing it. 620 01:09:08.460 --> 01:09:14.850 Okay, so the grid has blocks of threads and then. 621 01:09:14.850 --> 01:09:19.409 Each block here is a big orange rectangle. 622 01:09:19.409 --> 01:09:24.630 This figure shows 2 threat blocks are called thread blocks. It's anonymous. 623 01:09:24.630 --> 01:09:27.659 So, there's 2 thread blocks here. 624 01:09:27.659 --> 01:09:32.489 And it is indexed, and in this case, in a 2 D or it could be a 3 D. A. right. 625 01:09:32.489 --> 01:09:42.960 Blocks arrows arrow and blocks. There are 1 each block is a block of threads. The threads of the little green rectangles. 626 01:09:42.960 --> 01:09:47.609 So block 0 0 0, 0T and friends are all 1. 627 01:09:48.085 --> 01:09:56.875 And blocks arrow 1 has the same. So thread. 0 0T doesn't tell you what thread it is. It only tells you what credit is relative to the block. 628 01:09:57.324 --> 01:10:03.085 So, to completely identify a thread, you need to know what block you're in and what thread you are in that block. 629 01:10:03.300 --> 01:10:07.949 Okay, so I got threads and because the threads could be a 3 D, array of Fred's. 630 01:10:07.949 --> 01:10:11.729 So, there's green threads inside of. 631 01:10:11.729 --> 01:10:15.000 I am yellow, I guess, blocks inside of blue. 632 01:10:15.000 --> 01:10:24.239 Grid memory is red we had the global, the big global memory. We also have the registers that are small past memory. 633 01:10:24.239 --> 01:10:27.689 And each thread has its own set of registers. 634 01:10:27.689 --> 01:10:39.090 Each thread could have as many as 255 registers, but it could have fewer because all the registers in the block are allocated from a pool of registers for the whole block. That's like. 635 01:10:39.090 --> 01:10:42.210 64000 registers in the block. 636 01:10:42.210 --> 01:10:51.539 So this is part of the hierarchy here. So already, here we see 3 types of memory. Those registers in the thread as global memory for the whole grid. 637 01:10:51.539 --> 01:10:55.199 And what's not shown here? There's post memory. 638 01:10:56.670 --> 01:11:01.350 That's 3 types of memory. There's another 2 or 3 types that are not in this figure. 639 01:11:01.350 --> 01:11:04.590 Zoom down to a reasonable. 640 01:11:04.590 --> 01:11:08.909 Okay, either talk about it here device code. 641 01:11:08.909 --> 01:11:12.930 So the registers their rewrite for thread. 642 01:11:12.930 --> 01:11:16.829 The Golden memories by everyone, and then you can transfer data. 643 01:11:16.829 --> 01:11:23.819 Between the host and the global memory and that's just your introduction. 644 01:11:23.819 --> 01:11:29.670 It gets fun later if you have a sadistic idea of fun. Okay. 645 01:11:29.670 --> 01:11:34.260 Um. 646 01:11:34.260 --> 01:11:42.659 This transfer here is handled automatically by the memory manager. Should you use the memory manager, which you should use. 647 01:11:42.659 --> 01:11:46.680 Fruitful memory. Think of it's virtual. 648 01:11:46.680 --> 01:11:54.060 Between the host, and then GPU, it's not virtual between the host and the backing store on the hosts. That's a different type of virtual memory. Oh, okay. 649 01:11:56.880 --> 01:12:02.579 Could the Matlock Malik space here in the global memory? It's. 650 01:12:02.579 --> 01:12:05.789 And you give it the size, and it returns the address. 651 01:12:07.260 --> 01:12:10.439 Kind of free backwards. 652 01:12:10.439 --> 01:12:16.260 Face, maybe you don't need to do that men copy. 653 01:12:18.449 --> 01:12:21.689 It goes back and forth or you that to me. 654 01:12:21.689 --> 01:12:25.199 Oh. 655 01:12:25.199 --> 01:12:33.779 A sync you fire up a copy, the routine immediately returns to you while it is copying. 656 01:12:33.779 --> 01:12:41.460 So, if you're a good programmer, you can even have a copying while it's executing something else. 657 01:12:41.460 --> 01:12:44.909 Excuse me. 658 01:12:44.909 --> 01:12:48.960 Um. 659 01:12:48.960 --> 01:12:54.989 It's parallel okay you can overlap computation and and it's probably a very good idea. 660 01:12:54.989 --> 01:12:59.489 On the parallel machine. Okay. 661 01:12:59.489 --> 01:13:03.510 1 mailbox and ma'am copies. Look like. 662 01:13:05.399 --> 01:13:10.800 Ma'am copy you give the pointer to the to the destination. 663 01:13:10.800 --> 01:13:17.100 And the source number of bites, by which way you're are going. 664 01:13:17.100 --> 01:13:20.579 Oh, to device or something back. Oh, okay. 665 01:13:21.689 --> 01:13:28.319 If you're paranoid, I am paranoid. You should be paranoid. 666 01:13:28.319 --> 01:13:33.180 Check it it work. 667 01:13:33.180 --> 01:13:38.789 So, you can call to sing returns and error code. 668 01:13:38.789 --> 01:13:46.470 And this is what I showed you before the code was not success. 669 01:13:46.470 --> 01:13:50.699 This will give you a human unreasonable character string. 670 01:13:50.699 --> 01:14:02.909 And this you can do. What I do is if I'm using Matlock, I actually have another routine, which wraps around the Matlock. I call my wrapper. 671 01:14:02.909 --> 01:14:10.920 And my wrapper calls Matlock and checks for errors. Now, there is a point here since is a synchronous. 672 01:14:10.920 --> 01:14:16.680 It's not, um, if you do this, even if mal caused an error. 673 01:14:16.680 --> 01:14:20.279 This will not give you the error because error hasn't occurred yet. 674 01:14:20.279 --> 01:14:28.470 So, if you, so, what my rap raci does is like, Cal Matlock, I synchronize. 675 01:14:28.470 --> 01:14:43.289 And then I check for errors, kills performance, but I'm more interested in correctness in performance. Perhaps once I think my program is perhaps correct. And then I can delete the error checks and synchronization. 676 01:14:43.289 --> 01:14:49.739 Okay, so the slide set was showing us some basic. 677 01:14:49.739 --> 01:14:54.810 More information about some memory design and so on. 678 01:14:58.289 --> 01:15:03.510 Want to show you the start of the next 1, but I won't do the whole thing. 679 01:15:06.779 --> 01:15:18.630 So, this is going to show more about the execution model and threads and stuff like that. 680 01:15:18.630 --> 01:15:22.319 It'd be a I'm going to finish now it's. 681 01:15:22.319 --> 01:15:28.319 Later enough, so you can get to your next class, but in any case, let me maybe. 682 01:15:28.319 --> 01:15:35.399 A good example 3 for fun. 683 01:15:39.899 --> 01:15:52.500 Okay, well this is the program which has the memory error so how would we fix the error? So, the trouble is, this is the final block of threads. 684 01:15:52.500 --> 01:15:55.890 Is going to run this on every thread, but. 685 01:15:55.890 --> 01:16:00.569 You know, this, we should only run it on threads for the elements of a that were allocated. 686 01:16:00.569 --> 01:16:06.239 And the way to fix, this would be to test I and see it, make sure that I is less than an. 687 01:16:06.239 --> 01:16:10.229 And if I is greater than equal to, and then did not execute this. 688 01:16:10.229 --> 01:16:17.670 And that's actually the solution thing. 689 01:16:21.539 --> 01:16:24.539 We are bringing in and as an argument. 690 01:16:24.539 --> 01:16:34.739 And we are checking here see if I send and then do this. So, this will be a thread divergence. Okay. 691 01:16:36.420 --> 01:16:41.250 Because this will be actually will not get executed for every threat in the block necessarily. 692 01:16:41.250 --> 01:16:45.420 We don't we don't want it to her. I'm sure in the box we're, we're good with that. 693 01:16:45.420 --> 01:16:54.810 But if the conditional, the then block were very big, we would start to be losing parallelization because some should not be executing it. Talk to. 694 01:16:54.810 --> 01:17:05.069 In any case. 695 01:17:09.119 --> 01:17:13.590 No errors good. 696 01:17:13.590 --> 01:17:23.399 Okay, and this is a, this is a problem, because probably the number of threads does not fail the last block and you got to have some sort of bounce check like that. 697 01:17:23.399 --> 01:17:28.470 Okay, so reasonable point to stop. 698 01:17:28.470 --> 01:17:33.060 So, we saw some more details to NVIDIA. 699 01:17:33.060 --> 01:17:37.020 And Thursday, we will continue it. 700 01:17:38.069 --> 01:17:42.449 So, I oh, hang around in case. There's for a couple of minutes. 701 01:17:42.449 --> 01:17:47.340 And if you have a question on mute your Mike. 702 01:17:47.340 --> 01:17:53.550 Or use the chat window other than that and joy if you happen to be in. 703 01:17:53.550 --> 01:18:02.789 Troy area enjoy the snow that I see outside my window. I'm only 10 miles from our Pi at the moment actually. So. 704 01:18:04.500 --> 01:18:13.890 Oh, and if I'm news and good weather, not related to parallel thing. But remember, I've got 8 kilowatts of solar panels on the roof and Tesla. 705 01:18:13.890 --> 01:18:22.710 Our walls yesterday was a very sunny day, and I wasn't using much electricity cause I was skiing actually and the. 706 01:18:22.710 --> 01:18:29.489 Solar panels on my house generated a 3rd, more electricity than the house used. So, yesterday I said. 707 01:18:29.489 --> 01:18:33.210 10 kilowatts are so of electricity. 708 01:18:33.210 --> 01:18:39.449 Hours back into the grid, so, and this is February so get to a few months later. 709 01:18:39.449 --> 01:18:42.960 I'll be writing very positive I think. 710 01:18:45.689 --> 01:18:49.800 Well, if there is no. 711 01:18:49.800 --> 01:18:55.350 Oh, I saw a question here. Would it be wrong to think of the registers as cash equivalents? 712 01:18:55.350 --> 01:19:00.899 Um. 713 01:19:00.899 --> 01:19:05.340 I'm not certain what you mean by cash equivalence, but. 714 01:19:06.420 --> 01:19:10.979 I mean, they're not going into any global cash because they're local to the thread. 715 01:19:12.029 --> 01:19:15.449 So, in a sense that they're very fast. 716 01:19:17.130 --> 01:19:25.350 And they're like cash memory, I guess, but when I hear cache, I think of something that's going to get synchronized later on. 717 01:19:25.350 --> 01:19:30.810 And some global synchronization eventually, and the registers are. 718 01:19:30.810 --> 01:19:35.819 They're local, they don't later get synchronized with the other threads so. 719 01:19:37.079 --> 01:19:44.250 The way they're implemented might involve something like a cash. That's totally invisible. The user. 720 01:19:47.189 --> 01:19:53.250 Other questions. 721 01:19:54.810 --> 01:20:00.180 If not then. 722 01:20:04.949 --> 01:20:14.250 Yeah, thank you Eva.