WEBVTT 1 00:04:52.019 --> 00:04:56.399 Silence. 2 00:05:11.939 --> 00:05:20.069 Silence. 3 00:05:23.459 --> 00:05:30.209 Silence. 4 00:05:35.548 --> 00:05:39.389 Silence. 5 00:05:42.389 --> 00:06:06.959 Silence. 6 00:06:28.978 --> 00:07:06.478 Site 7 00:07:07.223 --> 00:07:07.553 right? 8 00:07:11.639 --> 00:07:19.108 Okay, good afternoon parallel computing class. 9 00:07:19.108 --> 00:07:26.939 Can anyone hear me 1st because I'm not completely certain. I've got the audio working. 10 00:07:26.939 --> 00:07:30.749 Yeah, thank you. Connor. Great. 11 00:07:30.749 --> 00:07:36.928 So, wherever we are. 12 00:07:36.928 --> 00:07:40.499 Class 13 and March. 13 00:07:40.499 --> 00:07:48.749 11 2021 and what's on tap for today is. 14 00:07:48.749 --> 00:08:00.718 Parallel compute, we're still in the invidia course notes and we've graduated from specifically NVIDIA stuff on. 15 00:08:00.718 --> 00:08:13.319 To general parallel computing paradigms that styles of programming, which will make your parallel program or efficient and they're generally useful paradigms. They're not restricted to in video. 16 00:08:13.319 --> 00:08:19.619 But 1st, a couple of general notes, I'll put them in the blurb for Monday various, some. 17 00:08:19.619 --> 00:08:20.783 Parallel companies, 18 00:08:20.783 --> 00:08:29.184 quantum computing companies like D wave they have online tutorials and if anyone's interested I'll put the blurb up on the website, 19 00:08:29.483 --> 00:08:35.394 this is a way for you to learn current topics outside of the course I'm going to can't make them officially part of the courses. 20 00:08:35.394 --> 00:08:48.024 They're not in class time, but if anyone's interested, I'll put some blurbs up about D, wave has seminars some time to time. For example, what D. wave is 1 of the 3 major quantum computing paradigms. 21 00:08:48.293 --> 00:08:52.283 We'll get to it later in the class after this module. 22 00:08:52.734 --> 00:08:59.634 And D, wave cells, quantum computers that do what's called quantum annealing. 23 00:09:00.053 --> 00:09:08.602 So you have a function to be optimized and it will find the optimum for you doing it in parallel on your quantum computer. 24 00:09:08.849 --> 00:09:11.999 And, oh, another. 25 00:09:11.999 --> 00:09:20.428 Thing with current and video architected, latest and video they've gotten away from the idea of a core. They talk about streaming multi processor. 26 00:09:20.428 --> 00:09:26.788 And with the rest of the course, they don't talk call them CUDA cores anymore. 27 00:09:26.788 --> 00:09:33.418 Yes, I'm calling her. What's your general? I make your a microphone if you'd like, actually. 28 00:09:33.418 --> 00:09:41.668 Yeah, of course it was just regarding homework 5. where are we supposed to actually submit anything for that? 29 00:09:43.349 --> 00:09:46.918 Can't remember what is homework. 30 00:09:49.469 --> 00:10:04.229 Well, it would be, I mean, I'm I'm great guy grading. I'm going to grade easily, but the idea would be to. 31 00:10:04.229 --> 00:10:09.418 Submit a PDF with your report on what you okay. 32 00:10:09.418 --> 00:10:13.889 Okay, yeah, it was just curious because of the posted due date. 33 00:10:13.889 --> 00:10:17.639 Supposed to do a day. 34 00:10:17.639 --> 00:10:21.479 Yeah, I'll I'll extend the due date. Remind me so. 35 00:10:21.479 --> 00:10:36.208 Wonderful. Thank you. Yeah, you're welcome. There's so few for a small course like this I could be more lenient and I'm figuring you learn what you want to learn. I'm presenting you with things. If you'd like to learn them. Fine. If you would not like to learn nothing. Well. 36 00:10:36.208 --> 00:10:41.009 We still have your tuition. 37 00:10:41.009 --> 00:10:46.889 I hope you like to learn it. Okay other questions. Okay. 38 00:10:46.889 --> 00:10:56.489 So, yes, I'll post a link for Monday about videos. Well, they have a nice. 39 00:10:56.489 --> 00:11:06.774 Blurb on their developer website about how they're changing things from generation to generation and so on. And their current terminology is approaching more and Intels. 40 00:11:06.774 --> 00:11:20.783 Actually, they'll have their streaming multi processors, and they're sort of analogous to Intel cores and each shipping multi processor might have 32 floating point units and 32 integer units. And. 41 00:11:22.198 --> 00:11:31.438 Maybe 30 to 64 floating point unit 32 double precision, floating point units perhaps and 32 instruction dispatchers. 42 00:11:31.438 --> 00:11:37.318 And each instruction dispatcher would dispatch for a full thread. 43 00:11:37.318 --> 00:11:51.808 All for the full 32, a full a warp of threads. I'm sorry and so they've down played the term could a car now, which is interesting. I mean, the reason is that the streaming multi processor, so it's got. 44 00:11:51.808 --> 00:11:54.989 A bank of. 45 00:11:55.703 --> 00:12:06.894 Warps waiting to run and they're waiting because they need some resource like, this might need a floating point, starting point units that might need integer units and so on. 46 00:12:07.283 --> 00:12:15.293 So, as these processing units become available under threads that need processing. It just assign 7 dispatches instructions and so on. 47 00:12:15.599 --> 00:12:25.589 So, it's interesting to watch their architecture migrating from year to year and it's instructive. You can look at that and think about why they're doing it. 48 00:12:25.589 --> 00:12:29.818 Um, another point in video makes is that. 49 00:12:29.818 --> 00:12:39.028 Because they got warps that need resources floating, find double precision image or whatever and they've got resources the efficient way to. 50 00:12:39.028 --> 00:12:47.278 To this, even just within 1 thread block and then, of course, the multiple thread box could be running in parallel on multiple streaming multi processors. 51 00:12:47.278 --> 00:12:55.408 If they're available and video makes the point that you used our hardware more efficiently when you've got actually many more. 52 00:12:55.408 --> 00:13:05.879 Warps waiting to run and you've got resources because what you want to have is a lot of threads that want to be actually a lot of works that want to be executed. 53 00:13:05.879 --> 00:13:14.399 And this way, you'll always have something needing execution whenever some hardware resource becomes available to execute it. And so. 54 00:13:14.399 --> 00:13:22.408 Their model works better when you've got thousands of threads and not just 1000 maybe several 1000 I mean. 55 00:13:22.408 --> 00:13:26.009 The gpo can run a 1000 threads can run. Well, 1. 56 00:13:26.009 --> 00:13:30.778 So 1 thread block can run a 1000 threads at a time. 57 00:13:30.778 --> 00:13:34.048 And then the whole machine can run maybe. 58 00:13:34.048 --> 00:13:43.168 4,000 threads at a time, depending so, if it can run up to 4,000 threads and that would suggest, maybe you want 10,000 threads. 59 00:13:43.168 --> 00:13:48.538 Waiting trying to execute because then there will always be something waiting to execute when. 60 00:13:48.538 --> 00:13:52.379 A resource becomes available, so. 61 00:13:52.379 --> 00:13:56.519 Um, and again, because they're a 0 overhead idea where. 62 00:13:56.519 --> 00:13:59.578 It doesn't take. 63 00:13:59.578 --> 00:14:03.178 You know, the scheduling is I. 64 00:14:03.178 --> 00:14:17.333 Don't know enough about how it's implemented, but it's implemented so that it's fast. You don't have a lot of contact swapping time contact. Swapping is free or something. That's why I'm guessing it's using a synchronous logic. Okay. So learning about. 65 00:14:18.538 --> 00:14:22.739 Scanning and so on. 66 00:14:22.739 --> 00:14:29.099 And and again I've got a 2nd laptop. 67 00:14:29.099 --> 00:14:33.538 To my side here, which is showing the chat window and every so often I look over it. 68 00:14:33.538 --> 00:14:37.889 And, um. 69 00:14:39.688 --> 00:14:45.629 Can see what's happening. Okay. Um. 70 00:14:45.629 --> 00:14:48.629 So, okay, so what we saw last time. 71 00:14:48.629 --> 00:14:52.019 Is a new. 72 00:14:52.019 --> 00:14:56.668 Basically, a new style of programming a new. 73 00:14:56.668 --> 00:15:01.379 Paradigm called a scan algorithm and. 74 00:15:01.379 --> 00:15:15.028 The scan does a series of parallel reductions sell the scan input here, for example, is this array of 8 elements 307 04163 and the output. 75 00:15:15.028 --> 00:15:19.918 The ice output element is the sum of the 1st, I input elements. 76 00:15:19.918 --> 00:15:27.688 So so, 3, the 1st output 3, then the 4th is somewhat 3 and 1. this is a partial stage here. 77 00:15:27.688 --> 00:15:31.288 So this contains the reduction. 78 00:15:31.288 --> 00:15:37.019 So, the case output is the reduction some of the 1st K inputs. 79 00:15:37.019 --> 00:15:45.538 Okay, interesting idea. Why do we spend time on it is that it turns out to be a tool. 80 00:15:45.538 --> 00:15:51.568 For surprisingly wide variety of parallel algorithms, so for doing them efficiently. 81 00:15:51.568 --> 00:15:58.649 Just like for sequential machine sorting can be used for a lot of different things and. 82 00:15:58.649 --> 00:16:13.589 This well, the obvious 1 is run lanes decoding, for example, if the input is a list of run lengths, and the output will be where each run lane starts, where each run starts and the output vector. So that would be called adult vector. Actually. 83 00:16:13.589 --> 00:16:17.158 A list of the base points. So the starts is called a dope vector. 84 00:16:17.158 --> 00:16:22.889 That's just 1 example, it's used for a lot of other things it's used for. Um, actually for. 85 00:16:22.889 --> 00:16:27.418 A bucket sorting frequency counts. 86 00:16:27.418 --> 00:16:35.188 So, the frequency counts, we saw that quick example 2 days ago where we didn't have that many, but buckets. 87 00:16:35.188 --> 00:16:40.318 You can use this idea for when there's very many output buckets. Okay. 88 00:16:40.318 --> 00:16:52.528 So we want to do this fast and what we saw last time is a way to do it in parallel by stride stride stride and it's sort of counterintuitive. So, and the 1st stride. 89 00:16:52.528 --> 00:16:56.879 Each output element becomes the sum of the for the. 90 00:16:57.203 --> 00:16:58.134 2 adjacent out, 91 00:16:58.134 --> 00:16:59.484 but once the 2nd stride, 92 00:16:59.724 --> 00:17:12.683 it becomes too much to that or 2 apart the than the stride numbers are going up by powers to strive for each output element is the sum up 2 of the same output element added with 1 forward to the left. 93 00:17:12.989 --> 00:17:18.088 And if you do it, right you can add in place. It may take a little thinking and. 94 00:17:18.088 --> 00:17:25.919 So you takes log in strides. Okay which is nice. This takes login steps and each step takes. 95 00:17:25.919 --> 00:17:30.118 Constant time if you've got enough threads so the whole thing takes log in time. 96 00:17:30.118 --> 00:17:33.659 Um, now. 97 00:17:33.659 --> 00:17:43.709 There's some tricks you can use to make it faster 1 problem with this. Oh, 1 thing also here, when you're adding, when I say each, add each element to the 1 forwarded the left if. 98 00:17:43.709 --> 00:17:46.709 Forwarded the laugh would go off the start of the array then you just. 99 00:17:46.709 --> 00:17:51.449 At 0, how to handle the boundary cases. 100 00:17:51.449 --> 00:17:56.308 And said, I've had programs with the boundary cases, are 1, half of all my lines of code. 101 00:17:56.308 --> 00:18:01.048 Okay, not fun, but necessary. 102 00:18:01.048 --> 00:18:10.588 So and you see, you've got issues here, you're not adding elements which are adjacent. So, those questions are, whether this place nicely with the cash. 103 00:18:10.588 --> 00:18:13.949 And when you're writing. 104 00:18:13.949 --> 00:18:28.169 So you want to use the shared memory if possible 1 way to think of shared memory is a level 2 cash. So you have your global memory. It's big and it's latent. Okay. So we got 48 gigabytes of. 105 00:18:28.169 --> 00:18:32.189 Global memory on the machine on. 106 00:18:32.189 --> 00:18:40.644 On the gpo on parallel, so 48 gigabytes and the latency to read something from it might be a couple of 100 cycles, but it's reading 128 bites. 107 00:18:40.644 --> 00:18:47.544 So it's got to, into cash cash is I can't remember several megabytes and it's chunked up in 102,008 bites things. 108 00:18:50.159 --> 00:18:56.969 So, again, if you're going to read, if you have to read 128 bytes, it'd be nice. If all 108 bites was actually useful. 109 00:18:56.969 --> 00:19:01.138 Which is why Jason threads want to be reading adjacent addresses some of the global memory. 110 00:19:01.138 --> 00:19:14.519 Okay, so that's a big you could call that this level 1 cash at some megabytes, give you a link to a developer paper on this and that cash is visible to everything on the GPU. 111 00:19:14.519 --> 00:19:27.179 So, you read it into the cash, anyone can use it, which is another thing to tie into the, the constant memory cache, for example, also. So the constant memories like a read, only cash, you get something into it and everyone can read it. 112 00:19:27.179 --> 00:19:35.903 Okay, that's level 1 cash now inside each thread block, you could imagine there's a level to cash. That's the same hardware. It's a shared memory. 113 00:19:36.084 --> 00:19:48.473 In fact, the current and video architecture, you have 100 to 102,008 K bytes and you can say how much is explicit shared memory. And how much is the implicit level to cash. So you read something level to cash. It's got a smaller chunk size. 32 bytes. 114 00:19:48.473 --> 00:19:52.344 I think, and it's visible to all the threads in that thread block. 115 00:19:54.989 --> 00:20:05.939 So level 1, cash visible to every 1 level to cash visible to, to inside 1 thread block. And each thread block is a separate level to cash and it's fast to read and write. 116 00:20:05.939 --> 00:20:20.699 And again, it's like the hardware it's the same shared memory level to cash the shared memory you control explicitly level 2 cash is controlled implicitly by the cash manager. Okay so you want to have your program. 117 00:20:20.699 --> 00:20:30.384 Play nicely with the caches plays nicely. Your program runs faster in real time and again, the metric here is not CPU type. 118 00:20:30.594 --> 00:20:44.574 The performance metric is wall clock real time because see, if you time is not so meaningful. Well, again, if you have a car that would be idle if it's idle, if it's spinning, it's wheels waiting for something. You don't actually care. 119 00:20:45.294 --> 00:20:47.784 Well, if it's spinning wheels, it's using some power but. 120 00:20:48.358 --> 00:20:58.499 And if the supercomputer, you'd actually do care about the power, but not in this course. So it's wall clock time, which you want to minimize. 121 00:20:58.794 --> 00:21:11.483 Okay, here also we're mentioning so you have your separate levels you got to synchronize let me go back a page. Okay so here, we've got only 8 threads. No trouble. Suppose you had a 1000 threads here. 122 00:21:11.723 --> 00:21:23.993 There's no guarantee that those thousands fragile, right? At the same time again, because you got limited resources available, each warfare runs all at the same time, but the multiple wards they could run simultaneously they can run sequentially, depending on what's available. 123 00:21:24.269 --> 00:21:33.088 So, after you do each target, you have to synchronize to make sure the data that's stride. Rights is available because the next tribe will read it. 124 00:21:33.088 --> 00:21:36.959 Okay, they talk about that here. 125 00:21:36.959 --> 00:21:40.318 Okay. 126 00:21:40.318 --> 00:21:45.148 Code, I'll skip lots of things threads. Um. 127 00:21:45.148 --> 00:21:51.509 Work efficiency and. 128 00:21:52.943 --> 00:22:07.074 Well, the working efficiency is, are you using the call I'll call them scooter course here efficiently because again, if there's a could a core that's running that doesn't have to run. Well, it's not going to run until it has resources available. 129 00:22:07.314 --> 00:22:12.653 So this is in spite of what I said a minute ago, this is a reason to. 130 00:22:14.699 --> 00:22:24.269 You know, be efficient with executing course, because they may slow down the wall clock time waiting to execute. So, they're talking about some of that here. 131 00:22:24.269 --> 00:22:29.699 Okay, now the implication is that if some cores are. 132 00:22:29.699 --> 00:22:35.608 I don't some cars they're executing you want to pack all the executing ones in the smallest number of warps. 133 00:22:35.608 --> 00:22:41.759 Okay, not an awful lot in that slide set, but some new stuff. 134 00:22:41.759 --> 00:22:46.318 Excuse me. 135 00:22:56.578 --> 00:23:10.499 Okay, so this is going to be a new way to diverse. The tree reducing control. Divergence means that we're packing the, the. 136 00:23:10.499 --> 00:23:14.489 Threads that want to execute into a small number of warps. 137 00:23:14.489 --> 00:23:19.858 And this means the is going to be things a little more complicated. Okay. 138 00:23:22.528 --> 00:23:25.769 And I've got a concept here that. 139 00:23:25.769 --> 00:23:33.209 You look out of each output number at the bottom of all the computation. We've got like, a binary tree going up. 140 00:23:33.209 --> 00:23:41.489 Up to the route, we're adding numbers 2 by 2 adjacent numbers that were adding numbers that are 2 apart that are adding numbers for apart and so on. 141 00:23:41.489 --> 00:23:45.989 Up to and over to apart so it conceptually, we have a binary tree here. 142 00:23:46.584 --> 00:24:01.314 And and what we're doing is we start with building partial sums some of 2 elements, and Osama to parents excuses on the 4 elements and we're working our way down with bigger and bigger, partial sums. 143 00:24:01.558 --> 00:24:05.999 Okay, that's what they're talking about here. 144 00:24:05.999 --> 00:24:12.719 Um, and it's sort of showing what happens here. 145 00:24:14.338 --> 00:24:25.048 We're adding time is going down the page here. The thread number ID is going across the page. So the 1st step, we add. 146 00:24:25.048 --> 00:24:39.388 They're showing something slightly differently. We're adding each element element to the number to their left sex. 1 axial gets added into X Y, next to add new X3. And so on the next stage, we're adding each element to the 1 to 2. it's left. 147 00:24:39.388 --> 00:24:44.909 End of the 1 for this would be for a simple reduction here. Not a full scan. Okay. 148 00:24:46.048 --> 00:24:52.378 And it has log in steps, and we've, at the end of it, we summed all 8 elements, the reduction phase. 149 00:24:52.378 --> 00:24:55.528 We're working our way up to the scan. 150 00:24:55.528 --> 00:25:00.689 Okay, ignore the code for now. 151 00:25:01.949 --> 00:25:13.499 And what we're going to be doing, we're doing more the executive summary. We're doing more additions here to create more. 152 00:25:13.499 --> 00:25:16.798 More partial reduce some chili. 153 00:25:16.798 --> 00:25:26.068 Skip that for the moment, I'll give you the executive summary is that when in this computation. 154 00:25:26.068 --> 00:25:32.219 We're doing more partial skipping over the details to move along. 155 00:25:33.328 --> 00:25:36.659 But putting it all together, um. 156 00:25:36.659 --> 00:25:40.199 What we're doing here. 157 00:25:40.199 --> 00:25:45.088 The 1st, the top half of the tree is where we're doing the reduction. 158 00:25:45.088 --> 00:25:54.598 Of the whole array, and are also doing partial reductions of pieces. So we got reductions of 4 elements reductions of 2 elements and so on. 159 00:25:54.598 --> 00:26:02.159 That's the top half of the year, right? Then the bottom half of the array you might say we're branching out again and. 160 00:26:02.159 --> 00:26:12.838 Or taking these partial reductions and doing more additions and at the end of it, we're going to have our scan operation. I'll leave this up for a minute. So. 161 00:26:12.838 --> 00:26:18.328 So, if you look at the ID number of a thread, and the idea, it's a multiple of 8. 162 00:26:18.328 --> 00:26:29.939 This only 1 here it's got to some of 8 elements the 1. that's a multiple of 4 whose ideas a multiple of 4 has the sum of 4 elements. The ones that are multiples of. 163 00:26:31.648 --> 00:26:35.638 Sorry, I got them off by 1 is we have 16 here not 8. 164 00:26:35.638 --> 00:26:41.009 Okay, the thread whose ID some level 16 has to solve all 16 elements. 165 00:26:41.009 --> 00:26:44.308 The thread whose IDs are multiple of 8, but not 16. 166 00:26:44.308 --> 00:26:50.009 As the sum of 8 elements, the threads who are ID numbers are multiple so 4, but not 8. 167 00:26:50.933 --> 00:27:05.903 For the sums of the 4 elements to their left and the threads, her ID side, the multiple of 2, but not multiple of for the summer 2 elements. And the odd number threads haven't been changed. That's the state of the system in the middle here again time, going down. 168 00:27:06.209 --> 00:27:12.538 Now, what we take these, all these partial psalms, and we start adding more stuff to them. 169 00:27:12.538 --> 00:27:16.378 And so we're, we're doing another. 170 00:27:16.378 --> 00:27:20.068 Branching and at the end result, we will get. 171 00:27:20.068 --> 00:27:25.169 All the scanning, if I look at this here, say here, we got the left 8 elements. 172 00:27:25.169 --> 00:27:37.253 And we add in the summer for more elements, we added in the sum of 2 more elements and 1 more element. And at this point, where the hand shows, this is thread 15 counting from 1 just to make it easy. 173 00:27:38.213 --> 00:27:40.193 It's the sum of the 15 elements to the left. 174 00:27:40.499 --> 00:27:45.179 Let me take another thread just for fun. Let me take this thread here. 175 00:27:45.179 --> 00:27:53.038 So, this thread is the sum of itself and the element to the left and then we add in this, which is a sum of. 176 00:27:53.038 --> 00:27:59.128 The 1st, 8 elements, so, at this point, this spirit hears us some of 10 elements and it'd be thread number 10. 177 00:28:00.209 --> 00:28:09.689 So, this interesting 2 stage process at the a 3rd example here, let's take this. This would be thread 7 I guess. 178 00:28:09.689 --> 00:28:21.868 It's not affected in the 1st stage. We're adding in the series, and some of the 1st, 4 threads we add in the, some of the next 2 threads we added into here at this point, this thread is the sum. 179 00:28:21.868 --> 00:28:25.318 Of the 1st, 7 threats. 180 00:28:25.318 --> 00:28:29.608 So this is another way to do the scan operation at the end of it. 181 00:28:29.608 --> 00:28:33.568 Each thread is the sum of all the threads. 182 00:28:33.568 --> 00:28:40.888 To its left, including itself the 1st thread doesn't get changed. The second third is a solid 2 threads and so on. 183 00:28:40.888 --> 00:28:51.239 And this talk each, the top half took log in stages and the bottom half to log in stages. So to log in stages. 184 00:28:51.239 --> 00:28:55.019 And we've got the all the scanning. 185 00:28:56.249 --> 00:29:01.949 Now, what makes us better than the thing I showed say, 2 slide sets ago slide 10 1. 186 00:29:01.949 --> 00:29:07.469 Is if we look at all the little plus sign each plus sign is a chord that did something. 187 00:29:07.469 --> 00:29:13.409 This is actually only. 188 00:29:13.409 --> 00:29:18.449 To and course, so, this whole thing that took to log in. 189 00:29:18.449 --> 00:29:23.159 Stages took only a total of 2 and. 190 00:29:23.159 --> 00:29:34.409 Additions, so this is what we mean by work efficient, the 1st, version of the scanning operation, and there were log in stages, but in each stage all and. 191 00:29:34.409 --> 00:29:40.499 Course did something, so the 1st thing had end log and work. This thing is only end work. 192 00:29:40.499 --> 00:29:43.679 And again, although I said that idle could, of course. 193 00:29:43.679 --> 00:29:46.858 You don't not you don't care about them, but you care about them if. 194 00:29:46.858 --> 00:30:00.778 They're queued up waiting to execute and they're airing they're slowing down course I do want to do something so, if a course not doing it and he just don't want to say, well, it's adding something off the beginning. They're adding a 0. 195 00:30:00.778 --> 00:30:07.199 While you would rather be able to determine statically that it's going to add his arrow and not executed. Okay. 196 00:30:07.199 --> 00:30:10.679 Good so this is more work efficient. 197 00:30:10.679 --> 00:30:14.489 It has fewer total additions occurring. 198 00:30:15.354 --> 00:30:29.663 Now, the only problems with this, which will get to later are that the active cores are not adjacent to each other. So they're not packed into the smallest number of warps and the offer ends. 199 00:30:29.969 --> 00:30:33.328 To each corps are not adjacent to each other either. 200 00:30:33.328 --> 00:30:36.358 So, this doesn't play nicely. 201 00:30:36.358 --> 00:30:41.219 With the concept of a war, and it does not play nicely with the cash manager. 202 00:30:41.219 --> 00:30:46.078 However, it is, however, it does have the fewest number, of course, executing. 203 00:30:46.078 --> 00:30:53.128 So, where we have progress, um, skip the code, um. 204 00:30:53.128 --> 00:30:56.159 Notice liberal use the think threads. 205 00:30:56.159 --> 00:31:03.148 Okay, so that we'll see what they say about what I just said next. 206 00:31:13.108 --> 00:31:16.769 So, we're going to analyze it. 207 00:31:18.868 --> 00:31:21.929 Okay, so a total of linear amount of work. 208 00:31:21.929 --> 00:31:26.669 Total so 2 and ads. 209 00:31:28.078 --> 00:31:31.138 So, the efficiency crunch thing would have and adds. 210 00:31:31.138 --> 00:31:40.199 That takes time and and this takes time. So we doubled the amount of work number of additions, but we cut the wall clock time. 211 00:31:40.199 --> 00:31:45.568 Payroll is going to is going to take more operations and sequential. 212 00:31:45.568 --> 00:31:49.138 It always happens. A factor of 2 is quite good. 213 00:31:49.138 --> 00:31:54.388 Okay, and now if you have. 214 00:31:54.388 --> 00:31:59.068 Running something P, times and parallel you ain't gonna get a factor of B speed up. 215 00:31:59.068 --> 00:32:04.648 So so the work efficiency, it's nice. 216 00:32:04.648 --> 00:32:13.229 Okay work any efficient might be nice for some things, but Tom. 217 00:32:15.419 --> 00:32:30.328 Okay, so here's the next thing. Suppose we got a big, big, big input factor and it's too big to fit in 1 thread box. I said a thread block has a 1000 threads. Max. 218 00:32:30.328 --> 00:32:37.019 1000 to 24. so if you suppose you want to scan a 1Million element factor. 219 00:32:37.019 --> 00:32:40.648 What you would do is you would fire up a. 220 00:32:41.699 --> 00:32:49.048 A kernel with a 1000 thread blocks each with a 1000 threads and each thread block. 221 00:32:49.048 --> 00:32:53.368 Would do the scan independently on its 1000 elements. 222 00:32:53.368 --> 00:32:57.209 And at some point, at the end, we then have to. 223 00:32:57.209 --> 00:33:01.439 Merge the results and update each thread block. 224 00:33:01.439 --> 00:33:04.588 And they talk about ad, they scan some or race. 225 00:33:04.588 --> 00:33:10.618 So, you you do a 2nd level scan on the total. 226 00:33:10.618 --> 00:33:18.568 For each thread block, and this gives adult factor, which you then go back and add into each thread block. And now you've got your final. 227 00:33:18.568 --> 00:33:21.808 Scanned version of the 1Million element array. 228 00:33:21.808 --> 00:33:27.388 So large, so again, if your vectors too big to fit in 1 thread block. 229 00:33:27.388 --> 00:33:39.628 You partitioned it into a separate piece for each for you run multiple thread blocks, you scan each thread box separately and then you do a combo scan of the. 230 00:33:39.628 --> 00:33:43.378 Basically of the total size of each thread block. 231 00:33:43.378 --> 00:33:49.048 Broadcast that, back out to the thread blocks, and they update themselves and now you've got the whole factor done. 232 00:33:49.048 --> 00:33:56.489 2 step process, so, and again, you have to do it, something like this, because a separate thread blocks. 233 00:33:56.489 --> 00:34:00.778 Got no guarantee when they're running. 234 00:34:00.778 --> 00:34:08.759 Okay, they're calling it scan blocks or thread blocks bigger re, partition into blocks each blocks. Get you. 235 00:34:08.759 --> 00:34:15.838 Scan will be a verb hearing on an adjective. It's kind of separate like you take the total. 236 00:34:15.838 --> 00:34:25.289 From each block, you put it in an salary array, you scan it, you broadcast it out and update and scan things leave this up for a 2nd or 2. 237 00:34:29.969 --> 00:34:35.608 Multiple levels, parallel compute and you got hierarchy so okay. 238 00:34:35.608 --> 00:34:39.688 Hierarchies of everything algorithm memory. 239 00:34:39.688 --> 00:34:49.918 And so on, but it's not a full binary tree. Hierarchy. Hierarchy is not very high. 240 00:34:51.568 --> 00:34:55.559 So, okay, okay. 241 00:34:55.559 --> 00:35:00.088 We met what this cabinet we talked about before was called an inclusive scan. 242 00:35:00.088 --> 00:35:04.949 There's a variable called an exclusive scan where you put a 0 in front. 243 00:35:04.949 --> 00:35:13.289 And the last element is, the sum of the 1st end minus 1 elements, and there's nowhere in this. Is there a sub of all the elements. 244 00:35:13.289 --> 00:35:16.528 And closes scan exclusive scan. 245 00:35:16.528 --> 00:35:19.559 So, it just 1. 246 00:35:19.559 --> 00:35:26.128 You know, conceptually the same, but the exclusive scan is easier for working with dope factor. Is. 247 00:35:26.128 --> 00:35:32.518 Easier for different purposes I use the exclusive scan, but then I don't get the sum of all of that elements if I would need it. So. 248 00:35:32.518 --> 00:35:42.628 Okay, beginning, I just have allocated buffer as I call this already beginning address as I call it adult factor. So. 249 00:35:42.628 --> 00:35:51.208 You got the different, same closest scan the elements of some of all the elements up to here. Exclusive to some of all the elements left up here. So. 250 00:35:52.498 --> 00:35:55.889 Okay, inclusive minor point. 251 00:35:58.829 --> 00:36:06.358 Saying here, you can get an idea on this. 252 00:36:07.829 --> 00:36:12.929 Oh, okay. So what we saw in this set of slides. 253 00:36:12.929 --> 00:36:22.588 Was this parallel scan operation, which is a widely useful operation for parallel algorithms, how to do it and then how to do it efficiently. 254 00:36:24.239 --> 00:36:32.369 After 7 chapter 12, they skipped off 11. 255 00:36:32.369 --> 00:36:38.548 There's nothing interesting. Okay. 256 00:36:40.199 --> 00:36:47.579 A touch on floating point. You've seen some of this before? Exactly. Well, 1 thing relevant. The current Nvidia architecture. 257 00:36:47.579 --> 00:36:52.228 Is since many programs are limited by the time. 258 00:36:52.228 --> 00:36:56.699 They invented a half precision floating point stat format. 259 00:36:56.699 --> 00:37:05.068 So is half the okay floating point? This goes back a few decades. 260 00:37:06.268 --> 00:37:14.068 You like to have the hardware you have some standards for floating point with round off and stuff like that. 261 00:37:14.068 --> 00:37:23.213 And so there's an, I, Tripoli standard for this and the problem with the standard is it's expensive to implement. 262 00:37:23.213 --> 00:37:31.494 And when this was 1st proposed a few decades ago, there was actually a lot of professional debate about this is was this standard overkill. 263 00:37:31.768 --> 00:37:36.418 Was it being too finicky about round offs and stuff like that? 264 00:37:36.418 --> 00:37:43.798 And it would be too expensive to implement and in fact, Cray computing refused to accept the standard. 265 00:37:45.778 --> 00:37:53.878 So, Craig was a major, super computer manufacturer, and they refuse to implement the floating point standard and they said, takes too much hardware. 266 00:37:53.878 --> 00:37:57.869 In any case, no, everyone accepts it, but. 267 00:37:57.869 --> 00:38:02.579 Video has a way to ignore it past math operation. Oh, okay. 268 00:38:02.579 --> 00:38:06.539 So floating point, you got to sign, you've got an exponent, you got a mantis. 269 00:38:06.864 --> 00:38:17.574 And, and they're actually a cool thing is the floating point number is laid out in a way that the bits are laid out in a pattern. 270 00:38:17.934 --> 00:38:26.903 So that you can compare to floating point numbers with an integer binary comparison. That is cool. So, the comparison operator for, and if you apply to. 271 00:38:27.210 --> 00:38:41.574 And floating point number it still works, which is also the hardware point hardware term spits has been you got 32 bits there's nothing in the hardware. It says what they mean that could be a 32 bit integer. That could be a 32 bit floats. It could be 488 characters. 272 00:38:44.099 --> 00:38:49.679 You know, there could be 56 bit characters, but plus 2 spare bits who knows. 273 00:38:49.679 --> 00:38:53.130 And it's how you interpret the bits. 274 00:38:53.905 --> 00:39:06.175 On normalized, I'm going to skip some of the details a little way. Get an extra bet is if the leading bit of the folding point number is a 1 always because this 1, something times and exponent then it's always the 1. 275 00:39:06.175 --> 00:39:09.594 you don't explore you don't store it and you get 1 more better precision. 276 00:39:11.219 --> 00:39:18.510 The details all, I'll just do some historical note. 277 00:39:20.304 --> 00:39:24.894 Again, it takes some thinking to figure out how to do folding point numbers. Right? 278 00:39:25.315 --> 00:39:39.655 And I be, and now it's been figured out, we forget that it took some time and IBM, even when they were the biggest computer company, they actually did floating point numbers in a very inefficient way. The way IBM implemented floating points for years. 279 00:39:39.684 --> 00:39:40.494 Actually. 280 00:39:40.769 --> 00:39:45.510 Had fewer effective, significant bets than necessary. So. 281 00:39:45.510 --> 00:39:57.420 Ibm didn't actually have binary floats, actually base 16 floats and which sounds like it'd be the same thing, but no, because it's more leading binaries arrows. Okay. 282 00:39:57.420 --> 00:40:02.489 I'll skip some of the details here so details and how you implemented it. 283 00:40:02.489 --> 00:40:09.360 Numbers are representative of another example of how hard it is to get this implementation. Right? 284 00:40:09.360 --> 00:40:16.769 Is that some years ago there were various different implementations. I mentioned Craig I believe so. 285 00:40:16.769 --> 00:40:30.929 People wrote routine subroutines to try to to query their computer and to try to determine what the actual number of significant bits in the mantle was, because you couldn't just call. 286 00:40:30.929 --> 00:40:45.385 A system routine, it would tell you, you had to sort of probe your system do addition and see if the result see what the result was and determine what the actual precision of your computer was. So, communications of the ACM published a sub retain. 287 00:40:45.385 --> 00:40:46.764 That would determine that. 288 00:40:47.400 --> 00:40:52.349 And then somebody found a real piece of hardware that would cause this routine to go into an infinite loop. 289 00:40:52.349 --> 00:41:00.389 So, crazy issues, things with implementations as you might have intermediate registers, which have more precision. 290 00:41:00.389 --> 00:41:07.530 Then your memory for afloat, if you got more temporary precision, then you've got less round off error, which is good. 291 00:41:07.530 --> 00:41:14.159 Also, med edition was not commutative a plus B would not be B plus a, because they might be in a temporary register. That's bigger. 292 00:41:14.159 --> 00:41:20.519 You know, subtleties, other little subtleties as you might want your major. 293 00:41:20.519 --> 00:41:28.199 Built in functions, like, signing exponential to be monotonically. If you increase the argument, you'd want the result at least not to decrease. 294 00:41:28.199 --> 00:41:34.019 Well, that because it did sometimes just some implementation weird, little things like that. 295 00:41:34.019 --> 00:41:37.739 Again, skipped some details about this, um. 296 00:41:38.880 --> 00:41:44.400 The takeaway from this is it surprisingly hard to do floating point, right? 297 00:41:44.400 --> 00:41:49.320 Skip over that. 298 00:41:49.320 --> 00:41:52.949 So, I triple a single precision. 299 00:41:52.949 --> 00:42:01.380 23 bit traction and tested. That's actually not enough for a lot of things. He couldn't lose precision as you do. Operations. 300 00:42:01.380 --> 00:42:15.869 So, it's actually, you have to be in touch, careful, scientific computation with single precision. And the exponent is not enough because your biggest represented numbers, 10 of the 37th and again, that's not enough for a lot of things. So, double. So. 301 00:42:15.869 --> 00:42:26.610 What I tell people is on Intel, use double precision, it takes twice the space, but on the Intel is not IO bound so much and it's 52 bed fraction and. 302 00:42:26.610 --> 00:42:35.880 Bigger exponent you're fine. Of course, on the GPU, which are I outbound typically you can't just automatically go to double precision. 303 00:42:35.880 --> 00:42:41.880 1st, because it doubles the old time and 2nd have fewer double precision processors. 304 00:42:41.880 --> 00:42:46.650 Threat thread it might be waiting for processors. 305 00:42:46.650 --> 00:42:52.199 Okay, awesome. Cool things in the I. Tripoli standard that mean. 306 00:42:52.199 --> 00:42:57.269 People freak out at the start. Is it as ways to represent plus and minus and affinity? 307 00:42:57.269 --> 00:43:05.550 And has a bit pattern, which means not a number so you divide 0 by 0 It should output the not a number bit pattern. 308 00:43:06.869 --> 00:43:14.070 Then this is weird. Affects you start now now? Start violating normal rules of a risk I guess you might understand them. 309 00:43:14.070 --> 00:43:22.110 Like, 0 times, not a number is still not a number, so 0 does not always collapse everything to 0. you see the problem. 310 00:43:22.110 --> 00:43:26.340 And so you get weird, counterintuitive things happening with this. 311 00:43:26.340 --> 00:43:33.449 The thing is that no, 1 hardly ever uses these things. I thought it would be cool to use something like not a number and 1 of my. 312 00:43:33.449 --> 00:43:46.289 C, plus, plus programs to represent, if it wasn't out putting a legal number, and it killed my performance, it turned out it was being simulated in software or something and know the documentation didn't say that. 313 00:43:46.289 --> 00:43:50.639 Okay, I'm accuracy in rounding. 314 00:43:52.110 --> 00:43:58.019 You all know what that is? This means that addition is not associative. Of course. 315 00:43:58.019 --> 00:44:01.739 Rounding error. 316 00:44:03.119 --> 00:44:09.119 I think, you know, what rounding is if you don't know if you want me to slow down, I will. 317 00:44:09.119 --> 00:44:16.260 Okay, so hardware you'd like to internally have 2 more bit positions than you visibly have. 318 00:44:16.260 --> 00:44:20.369 And this will help the rounding. 319 00:44:20.369 --> 00:44:24.630 Make your results accurate to the last visible bit typically. 320 00:44:24.630 --> 00:44:27.750 Not a associative. 321 00:44:27.750 --> 00:44:34.710 So, if he got large, plus small, plus small is not because the small plus the yard. 322 00:44:34.710 --> 00:44:41.369 Large may just be the large or small get. It may not affect the large. It's too small. So, these things like this. 323 00:44:41.369 --> 00:44:44.789 Are relevant. 324 00:44:48.775 --> 00:45:00.625 If I back up 2 stages here, this is relevant if you, if you're adding up a bigger array, because a subtotal might start getting bigger than the last next element you're adding. And this is irrelevant here. 325 00:45:00.929 --> 00:45:06.030 I am also teaching probability this semester if your computing variance as. 326 00:45:06.030 --> 00:45:10.079 The sum of all the X squared minus the sum of all the X squared. 327 00:45:10.079 --> 00:45:16.050 You got your may get hit by this it may get wrong. 328 00:45:16.050 --> 00:45:20.579 Run time mass library. 329 00:45:20.579 --> 00:45:23.610 So, what Nvidia has. 330 00:45:23.610 --> 00:45:31.409 So, they sort of say, I triple 754 is nice, but maybe it's too slow. So we have fast hardware versions. 331 00:45:31.409 --> 00:45:34.469 Which are faster, but may be less accurate. So. 332 00:45:34.469 --> 00:45:43.920 And you can pick yeah, you want to be careful about using again. I had 1 of my geometry programs where I turned on past math. 333 00:45:43.920 --> 00:45:52.710 Cool. I fast can be better and it broke my program actually, because I was implicitly, assuming that folks were done properly. 334 00:45:52.710 --> 00:46:02.159 I just implicitly assumed when I design my algorithms and when I put fast math in, as the compiler option are programmed, no longer gave the right answer. So. 335 00:46:02.159 --> 00:46:05.400 Safe. 336 00:46:06.869 --> 00:46:12.690 Be careful. Okay so an introduction to floats and so on. 337 00:46:14.909 --> 00:46:19.920 No question. 338 00:46:19.920 --> 00:46:24.510 Able to. 339 00:46:24.510 --> 00:46:32.940 So, this is not strict. Well, it's parallel computing in the sense. 340 00:46:32.940 --> 00:46:38.309 That stability becomes harder to achieve civility means when round off error goes crazy. 341 00:46:38.309 --> 00:46:41.639 So, give him some examples. 342 00:46:42.989 --> 00:46:47.639 Again, the your backgrounds are variable somewhat, but. 343 00:46:47.639 --> 00:46:54.360 Except for stability affects the outcome. So you started way to multiply to matrices and by May 26 end time. 344 00:46:54.360 --> 00:46:59.969 There are asymptotically pastor ways to multiply matrices. The 1st 1 was. 345 00:46:59.969 --> 00:47:03.360 Front end to the 2.7 time. 346 00:47:03.360 --> 00:47:07.559 The 2.7 was talk to the base 2 of 3 actually. 347 00:47:07.559 --> 00:47:10.619 Walk is the base 2 of them 7. 348 00:47:10.619 --> 00:47:16.079 And that exponents been bashed down. So they'll say they're not you so much. 349 00:47:16.079 --> 00:47:23.309 In spite of the fellow 1st, the constant factor in front of the time is bigger is because in America less stable. 350 00:47:23.309 --> 00:47:27.510 So, they're adding and subtracting things and. 351 00:47:27.510 --> 00:47:33.360 You get more round off arrows so these simple obvious thing. 352 00:47:34.409 --> 00:47:37.980 It's slower, but it's simple. 353 00:47:37.980 --> 00:47:43.380 And at some better roundoff properties, so. 354 00:47:44.880 --> 00:47:50.460 And again, with things, like, converting a matrix solving a system of linear equations. 355 00:47:50.460 --> 00:48:04.500 There are algorithms, which may look better, but are unstable. So they round off. May go. I mean, it's not just round off. I mean, it's not just a few least significant. Bits may be wrong. The algorithm may just crash. 356 00:48:04.500 --> 00:48:08.519 It may end up with 0, significant bits effectively. So. 357 00:48:09.960 --> 00:48:17.190 Review of how you solve a set of linear equations, 3 equations and 3 unknowns. 358 00:48:17.190 --> 00:48:21.059 Well. 359 00:48:21.059 --> 00:48:25.980 The simple way. Well, 1st, you normalize it. So the leading. 360 00:48:25.980 --> 00:48:29.190 The coefficient on next is always 1. 361 00:48:30.780 --> 00:48:43.500 And then what you can do is that you can take the 1st equation, you can subtract it from the 2nd and 3rd and get this. So now you've eliminated X from the 2nd and 3rd. And now you can guess what we're going to do. We're going to scale. 362 00:48:43.500 --> 00:48:50.880 The 2nd equation, subtract it from the 3rd no the 3rd equations to see. So now we walk back up and we solved it. 363 00:48:50.880 --> 00:48:57.059 Like, this, this is nice, but, um. 364 00:48:58.889 --> 00:49:02.519 Depending on what the relative coefficients are. 365 00:49:02.519 --> 00:49:17.130 Let me show you right here, you see, you got you got 16 Y, in the 2nd equation. 4th wind the 3rd equation. So you have to double the 2nd equation, then add it into the 30 equation. So that means all the coefficient on the 2nd equation get doubled. 366 00:49:17.130 --> 00:49:26.039 So they have a bigger so you see, they might start swamping, coefficient and the 3rd equation causing significant fits to be lost. In a case like this. It would be better. 367 00:49:26.039 --> 00:49:34.860 To take the 3rd equation and add half of it to the 2nd equation because now the numbers and the coefficient are getting smaller, not bigger. 368 00:49:34.860 --> 00:49:40.110 So these coefficients of the 3rd equation, we add into the 2nd, they don't swamped. 369 00:49:40.110 --> 00:49:44.519 The 2nd efficients coefficients and they're not causing so much loss of significance. 370 00:49:44.519 --> 00:49:48.690 This is so. 371 00:49:48.690 --> 00:49:51.929 And, um. 372 00:49:51.929 --> 00:49:57.329 Talk a little on skip through it. 373 00:49:58.679 --> 00:50:05.789 Okay problem so it paralyzes nicely. 374 00:50:07.320 --> 00:50:10.469 So, this problem with stability. 375 00:50:13.019 --> 00:50:17.760 And what you would like to do is actually find the largest element in the. 376 00:50:17.760 --> 00:50:23.489 And the array of coefficients and. 377 00:50:23.489 --> 00:50:28.380 Use that swap it up to the top left and then start. 378 00:50:28.380 --> 00:50:31.949 Attracting multiples and it will turn out to yellow. 379 00:50:31.949 --> 00:50:36.030 Better precision in the result. 380 00:50:36.030 --> 00:50:40.710 They're talking about that here. I just gave you the context of it. 381 00:50:44.309 --> 00:50:50.429 But you got to find the largest element in that takes a scan, which takes some time. 382 00:50:50.429 --> 00:51:00.840 So, that may not look for you. Absolutely largest element. They may find the largest element in a row, or in a column or something and work with it. That's called partial pivoting. 383 00:51:00.840 --> 00:51:04.829 Attached to define the pivot, but the pivots not as good, but. 384 00:51:06.539 --> 00:51:09.570 That's what they're talking about here so. 385 00:51:12.179 --> 00:51:22.650 So, the message here is you like to have the best numerical precision it's called the best it's harder to do with parallel algorithms. 386 00:51:25.110 --> 00:51:39.389 Okay, so now we're getting back to specifically video stuff here. 387 00:51:43.500 --> 00:51:48.719 It's attached to the. 388 00:51:48.719 --> 00:51:56.010 Why quite a fast bus, but still, we want to get an idea of how to work together. 389 00:51:56.010 --> 00:52:00.750 So, a different theme. 390 00:52:01.949 --> 00:52:12.119 Okay back so this is your could have programming C plus plus it's got these minus and Catholic extensions that Scott, these new routines could Matlock. 391 00:52:12.119 --> 00:52:25.409 And copy, and again with managed memory, which automatically pages it'd be, could a Matlock managed, and it would be, you would never have to do a mem copy unless you thought you could do it better. 392 00:52:25.409 --> 00:52:31.500 Then the system, and maybe you could actually, but it takes your time. 393 00:52:32.760 --> 00:52:37.079 And again, so this is the call in your. 394 00:52:37.079 --> 00:52:43.230 Main program to call the kernel on the GPU, got the triple brackets. 395 00:52:43.230 --> 00:52:49.949 Extension and it will tell you and unit you specify how many thread blocks and how many threads for block. 396 00:52:51.000 --> 00:53:01.590 Copier okay and the colonel routine again, it's call it from the host and executed on the device and you can pass in arguments. 397 00:53:01.590 --> 00:53:07.739 And in the routine, you can define variables as local, or you got register variables. 398 00:53:07.739 --> 00:53:11.460 Local variables are local to the thread, but there are slow. 399 00:53:11.460 --> 00:53:15.000 That's if you don't have enough registers and shared variables that are. 400 00:53:15.000 --> 00:53:18.750 Global to the thread local to the block. 401 00:53:18.750 --> 00:53:23.039 And lots of sync thread. So this is a general structure of there. 402 00:53:23.039 --> 00:53:31.739 Put a program bandwidth as important. 403 00:53:39.809 --> 00:53:46.139 So this is an obsolete architecture that they mentioned, because it was important for so long. 404 00:53:46.139 --> 00:53:54.030 Used to have a North Bridge and South bridge concentrator. The North bridge had the past peripherals, the South branches the slope peripherals. 405 00:53:54.030 --> 00:54:01.079 Okay, and the thing and white on light yellow. 406 00:54:01.079 --> 00:54:06.750 Okay, historical. 407 00:54:08.400 --> 00:54:13.050 And originally you had boss and so on was slow. 408 00:54:13.050 --> 00:54:16.920 And, okay. 409 00:54:18.630 --> 00:54:25.769 There was the concept of memory mapped. I'll of course, reviewing. 410 00:54:25.769 --> 00:54:33.780 So, the devices on the bus, they would read and write directly to physical memory. But of course, that means that. 411 00:54:34.860 --> 00:54:40.619 The Register has to be in physical memory can't get swapped out by the virtual memory manager. 412 00:54:40.619 --> 00:54:47.519 So, you might hard lock in the address as a devices access, for example. 413 00:54:49.920 --> 00:55:02.219 But the nice concept is, or Here's another slightly different concept. You got your virtual memory space, and every address some addresses that it might map to memory in some addresses map to the. 414 00:55:02.219 --> 00:55:12.869 Peripheral, so the way that's implemented is the peripherals are just watching the address boss, and then same address that applies to them. Then they take action. They reader right? So. 415 00:55:12.869 --> 00:55:18.090 Nice unifying concept of putting everything on number in the virtual memory space. 416 00:55:18.090 --> 00:55:23.429 Okay, new or. 417 00:55:27.000 --> 00:55:33.300 Can't faster. I'm skipping through this to. 418 00:55:35.250 --> 00:55:45.900 Again, faster lanes or several bits, which can go through an interesting thing. 8 of 10 and coating and so on. 419 00:55:47.519 --> 00:55:58.139 Right, because you have too many aside decides arrows are ones on the bus. They get crosstalk perhaps. 420 00:55:59.250 --> 00:56:03.000 And and you don't want too much. 421 00:56:03.000 --> 00:56:06.449 Too many ones means as the D. C. card perhaps. So. 422 00:56:06.449 --> 00:56:12.449 This is not strictly parallel computing, so I'm going through it fast. 423 00:56:15.000 --> 00:56:18.150 Your card. 424 00:56:18.150 --> 00:56:21.570 A few years old, but. 425 00:56:21.570 --> 00:56:27.269 Graphics, you got lots of video outs Express, um. 426 00:56:28.650 --> 00:56:32.369 And now the connector has to do things and so on. 427 00:56:32.369 --> 00:56:42.929 This is why Andrea has compute service with no graphics output. The graphics takes so much space. This is an old chip by the. 428 00:56:42.929 --> 00:56:50.849 Old board gives you some idea 3 not so interesting. 429 00:56:52.230 --> 00:57:06.030 Dma again, it's got to write to pin to memory if it's writing to actual memory. Right? Depend memory. Because again. 430 00:57:06.030 --> 00:57:11.280 So, what happens is if this has been swapped out by the virtual memory manager. So. 431 00:57:12.360 --> 00:57:17.849 The GPU can be doing direct memory access to the main memory. So. 432 00:57:17.849 --> 00:57:23.489 Accessing pen memory, right? 433 00:57:23.489 --> 00:57:33.570 So pinned memories memory that you can't do virtual management on. So you've got less memory you can page on parallel. I've actually got so much real memory. 434 00:57:33.570 --> 00:57:36.840 Um, that. 435 00:57:36.840 --> 00:57:41.460 What, if I got 256 gigabytes, whatever that, that doesn't matter. 436 00:57:43.289 --> 00:57:48.150 Page locked memory sort of obsolete now that you can access that. 437 00:57:48.150 --> 00:57:57.659 Into memory, so, things like mmhmm copy or faster with spend memory because you don't have to wait for it a good page. Maybe. 438 00:57:57.659 --> 00:58:02.730 Over subscription, not on parallel you can't oversubscribe it easily. 439 00:58:04.769 --> 00:58:08.969 Yeah. 440 00:58:12.000 --> 00:58:18.389 And that was not a lot of content there, because it's been partly supplanted with newer stuff. But. 441 00:58:27.389 --> 00:58:32.429 Okay, going to skip through this fast here. 442 00:58:32.429 --> 00:58:40.530 You all know it broke from memory management is. 443 00:58:42.474 --> 00:58:55.855 It's a touch tricky to implement virtual memory management properly. Well, example, your paging your instructions in and out also. And instruction might be multiple bites. And what if an instruction spans. 444 00:58:56.130 --> 00:59:08.219 A page boundary, you can see where I'm instruction may be half a gets paged out as we page back in, which might then cause the 1st page out. You can imagine some crazy deadlock issues. 445 00:59:08.219 --> 00:59:20.489 They've been solved now, but yeah, you get to crashing if you're paging in a note stuff that we're both pages have to be in memory at the real time. So. 446 00:59:20.489 --> 00:59:25.590 At the same time. 447 00:59:27.630 --> 00:59:37.590 Pending stuff helps writing this. Yeah. 448 00:59:38.820 --> 00:59:49.530 So, what they implement things like, ma'am copy is that it actually gets copied, depend to memory and then gets coffee to the virtual memory where you need it. 449 00:59:49.530 --> 00:59:54.510 Takes to state. 450 00:59:56.010 --> 00:59:59.550 And if you want. 451 01:00:04.500 --> 01:00:11.610 Eva, okay. 452 01:00:13.679 --> 01:00:20.010 Here is something new now concept of streams haven't talked about it before. 453 01:00:20.010 --> 01:00:25.829 What is happening here is that you can run. Okay. 454 01:00:25.829 --> 01:00:32.219 So far we've seen parallelism within 1 Cota colonel. 455 01:00:33.989 --> 01:00:42.929 Colonel has thousands of thread blocks. Each thread block has a 1000 threads. So you'll 1 kernel is parallel. 456 01:00:44.190 --> 01:00:47.699 But we still have, but so we had 1 sequential. 457 01:00:47.699 --> 01:00:56.010 Thing in your host program, you allocate memory, you do copying you fire up a parallel hurdle. You'd wait. 458 01:00:56.010 --> 01:01:04.050 You synchronized and read the data. Okay well, what we're talking about here that's like 1 task and the task itself is sequential. 459 01:01:04.050 --> 01:01:11.460 What we're seeing what we're going to learn in this slide set is going to multiple parallel tasks in your. 460 01:01:11.460 --> 01:01:14.519 Code a program, and they're called streams. 461 01:01:14.519 --> 01:01:21.570 So, you start 1 stream, you allocate data and you, you do some copying. 462 01:01:22.769 --> 01:01:31.500 And then execute while while the kernel's executing, you could be out copying data for another stream. And this will give you. 463 01:01:31.500 --> 01:01:35.699 Smaller real time so the streaming facility. 464 01:01:36.414 --> 01:01:50.934 Allows you to do different parts of you C. plus plus coulda program in parallel independently of each other. So, this means that your separate streams are competing for the fixed resources. 465 01:01:51.179 --> 01:01:54.329 Well, 1st, streaming multi thread blocks. 466 01:01:54.329 --> 01:02:01.769 You know, arithmetic units and stuff like that and so these separate streams, they're doing different things. So. 467 01:02:01.769 --> 01:02:04.860 Perhaps they compete, they want different. 468 01:02:04.860 --> 01:02:19.320 Resources on the GPU, so the 2 streams may actually play well together because they want different resources at the same time and therefore you'll get greater work, efficiently greater efficiency on this GPU. 469 01:02:21.000 --> 01:02:26.039 As well as being, if you think of your algorithms having multiple parallel streams and hey. 470 01:02:26.039 --> 01:02:31.050 Some, let's do it. Okay. 471 01:02:31.050 --> 01:02:37.380 Okay, so you an example here. 472 01:02:37.380 --> 01:02:41.250 1 stream and trends means transfer. 473 01:02:42.690 --> 01:02:49.980 It's adding so we're transferring to a race to the GPU, doing a computation and transferring in a right back. 474 01:02:52.199 --> 01:02:55.769 Good of another stream that was computing when the stream was transferring. 475 01:02:55.769 --> 01:03:00.090 And so. 476 01:03:02.880 --> 01:03:09.179 Have some overlap shows it here. 477 01:03:11.909 --> 01:03:15.239 When we got 4. 478 01:03:15.239 --> 01:03:23.429 Possible writing 4 pairs of arrays the, a b0, the a B1, a B2 and a B3. 479 01:03:24.869 --> 01:03:33.150 So, we start off stream 0 and when stream 0 is computing, we start off stream 1, which is transferring. 480 01:03:33.150 --> 01:03:46.170 Using different parts of the hardware so then when stream 0 is talking stuff back to the whole steam stream 1 is computing then we starts to stream to copying data host to device. 481 01:03:46.170 --> 01:03:50.039 So, in the big black square blocks that we've got some. 482 01:03:50.039 --> 01:03:55.440 Parallelism between the different streams. 483 01:03:56.519 --> 01:03:59.940 Okay. 484 01:04:03.780 --> 01:04:06.869 Okay, task parallelism. 485 01:04:08.039 --> 01:04:13.619 We had work parallelism. We got blocked parallelism, never next level up task of parallelism. 486 01:04:16.860 --> 01:04:23.460 Okay, and. 487 01:04:23.460 --> 01:04:27.599 We can do it with Colonel launches and synchronizing and so on. 488 01:04:29.460 --> 01:04:36.480 So it is a cue here is yeah, so. 489 01:04:37.860 --> 01:04:44.940 Start the 2 streams and inside the streams, we can do event querying and so on. 490 01:04:50.849 --> 01:04:54.449 Once up to a view of the streams stream 0 stream 1. 491 01:04:55.710 --> 01:04:59.969 You fire him up and, um. 492 01:04:59.969 --> 01:05:04.619 Okay, time is going top to bottom here so. 493 01:05:07.530 --> 01:05:13.260 You might imagine you've got hardware that does copying feed host in device and you can hardware that does computing. 494 01:05:13.260 --> 01:05:23.579 And and the stream, actually, which hardware the stream runs on, can swap back and forth. That's what this is showing here. 495 01:05:25.320 --> 01:05:31.050 So, stream 0 can start on. 496 01:05:31.050 --> 01:05:34.679 Hardware 0, and then swap. So that's the context here. 497 01:05:36.449 --> 01:05:47.789 Silence. 498 01:05:47.789 --> 01:05:51.659 So, how do we do this overlapping. 499 01:05:56.039 --> 01:06:03.000 Well, okay, so you can create separate streams stream, create and this will take data structure. 500 01:06:04.500 --> 01:06:11.099 In a separate the separate streams want to be working with separate data, allocate separate data. 501 01:06:11.099 --> 01:06:21.329 Allocate the data, so you can do things like mem, copy a sync new. 502 01:06:21.329 --> 01:06:32.159 Again, you play games as managed memory, but the concept of the asynchronous ma'am coffee here is you give another argument, which is the stream that this executes then. 503 01:06:32.159 --> 01:06:39.659 So, the, a sync, mmhmm copy returns to the host immediately while it's still executing on the device. 504 01:06:41.250 --> 01:06:51.210 So you've got these 2 acing copies and they're running in parallel on stream 0 because that's okay. They're accessing different memory. 505 01:06:51.210 --> 01:06:54.269 And then the vector add here. 506 01:06:54.269 --> 01:06:57.300 I'll work on stream 0. 507 01:06:59.099 --> 01:07:04.500 And then stream 1, here you do them copies on stream 1. 508 01:07:04.500 --> 01:07:12.659 And so stream 1 is executing in parallel with streams 0 they're not affecting each other's slides. They're working on different memory. 509 01:07:12.659 --> 01:07:16.619 Except, of course, down the global routine here. 510 01:07:16.619 --> 01:07:20.489 It's the same global routine, but it's running it with different. 511 01:07:20.489 --> 01:07:31.139 On different data streams, arrow stream 1. so again, so these are thread blocks running on the same that are being created. So that all the thread blocks that. 512 01:07:31.139 --> 01:07:34.769 You going to have this big pool of thread blocks waiting to run. So. 513 01:07:36.599 --> 01:07:40.800 But that means when it's hardware resources available, that it's more likely that. 514 01:07:40.800 --> 01:07:44.099 There'll be a thread block that can use. 515 01:07:44.099 --> 01:07:49.230 Okay, so we've got some issues here that we'd like some synchronization, but. 516 01:07:49.230 --> 01:07:59.309 That's your basic idea that the basic idea is that in your code program, you see, plus plus program on the host, you can fire up asynchronous things. 517 01:07:59.309 --> 01:08:07.079 And in multiple streams, nothing interesting there. 518 01:08:11.699 --> 01:08:19.109 Yeah, so we want to figure out the best overlap. 519 01:08:25.619 --> 01:08:30.060 And what are they trying to do here? They're copying. 520 01:08:31.079 --> 01:08:34.409 Ways to reorder things so we do to the. 521 01:08:34.409 --> 01:08:37.739 Streams arrow copy the stream 1 copy. 522 01:08:38.939 --> 01:08:42.149 Let me go back to pages. 523 01:08:45.569 --> 01:08:49.829 3 pages, so the idea is. 524 01:08:49.829 --> 01:08:58.590 So this thing starts, you might say, doesn't start stream 1 quickly enough. You want to start the. 525 01:08:58.590 --> 01:09:03.899 Start all the streams trying to do stuff and there's overlap better. 526 01:09:06.300 --> 01:09:10.229 So, start all the streams, both streams, copying and then. 527 01:09:11.460 --> 01:09:16.800 The addition so trying to get stuff at the task level. 528 01:09:16.800 --> 01:09:21.600 In parallel and talking. 529 01:09:25.020 --> 01:09:33.869 And get even more complicated code, lots of buffers. 530 01:09:35.880 --> 01:09:40.979 Yeah, so Hypercube so each engine, each streaming multi process or. 531 01:09:40.979 --> 01:09:44.340 They want to have to stuff waiting to run. 532 01:09:44.340 --> 01:09:52.770 As the executive summary here, the thing is that we've got streams. 533 01:09:52.770 --> 01:10:01.470 Off to each stream, has a sequence of things waiting to run. The last thing is the GPU, each streaming multi processor perhaps. 534 01:10:01.470 --> 01:10:06.029 These are things running and we want to have. 535 01:10:06.029 --> 01:10:12.090 A, lots of things in your stream waiting to execute. 536 01:10:12.090 --> 01:10:16.229 And so this way, the work queue on the GPU, and the left gets fully. 537 01:10:16.229 --> 01:10:23.909 Occupied and, um. 538 01:10:25.199 --> 01:10:29.880 You know, trying to get stuff parallel as much as possible. So. 539 01:10:31.770 --> 01:10:38.909 And, okay, so there's a synchronization new routine we haven't seen yet. 540 01:10:38.909 --> 01:10:45.569 So, the thing is within the 1 stream, there's various tasks that were a synchronous. 541 01:10:45.569 --> 01:10:54.329 And this waits until everything in that stream has been completed. 542 01:10:54.329 --> 01:11:01.859 Like, the data got copied before you add it, let's say, just for that stream, we saw the synchronize before the did. 543 01:11:01.859 --> 01:11:05.100 All streams. 544 01:11:05.100 --> 01:11:11.310 Okay, so the creative content in chapter. 545 01:11:11.310 --> 01:11:14.609 Here was. 546 01:11:16.079 --> 01:11:29.189 Which was chapter 4 module 14 was we have streams this gives us task level parallelism and you like to order stuff. So the streams different streams can execute in parallel. 547 01:11:44.670 --> 01:11:48.689 So going to see an example that fits some of us together. 548 01:11:49.800 --> 01:11:53.274 Historical known various originally, 549 01:11:53.274 --> 01:11:53.755 nuclear, 550 01:11:53.755 --> 01:12:01.225 magnetic resonance and are when the physicist invented it many decades ago when the medical community started using it, 551 01:12:01.225 --> 01:12:05.484 they renamed it because I think they were afraid that the word nuclear would frighten people. 552 01:12:06.810 --> 01:12:10.949 I'm not joking. 553 01:12:10.949 --> 01:12:16.680 Okay, so I've got a bigger example here that should fit the things together. 554 01:12:18.329 --> 01:12:23.579 Scanned so. 555 01:12:24.840 --> 01:12:29.399 1 of these things called an inverse problem in applied mathematics. 556 01:12:29.399 --> 01:12:35.520 The unknowns are the densities at each foxhole inside the. 557 01:12:35.520 --> 01:12:41.729 Patient, and what, you know, are you see a run these race. 558 01:12:41.729 --> 01:12:45.779 Through it and what you observe is. 559 01:12:45.779 --> 01:12:49.560 Intensity at the end of the Ray. 560 01:12:49.560 --> 01:12:55.590 And so that's the knowns and the unknowns are the data inside you want to solve for it. 561 01:12:58.500 --> 01:13:05.369 Different ways you can scan, which is irrelevant. I'm going to skip through details here. 562 01:13:07.050 --> 01:13:12.600 In any case, this is what the output might look like. Okay. 563 01:13:14.310 --> 01:13:17.970 Oh. 564 01:13:17.970 --> 01:13:24.449 Energy and solvers tend to be more efficient in many cases and simple explicitly inverting. 565 01:13:24.449 --> 01:13:38.039 If you're solving a X equals B, explicit way to say, X equals 8 of the minus 1 time speed and solving it directly. It turns interpretively approaching the value of X is often more efficient. 566 01:13:38.039 --> 01:13:44.640 Get that this is just a setup chapter. 567 01:13:51.600 --> 01:13:56.819 Silence. 568 01:13:57.175 --> 01:14:12.085 Okay, so you can be doing stuff on the kernel that's on the gpo. It's more complicated than what we've seen. So far. We've been seeing kernels where you'd, like, add 2 elements. This is a serious thing. It's got floating point and it's got calling, sign in coasts. 569 01:14:12.085 --> 01:14:13.465 And so on, okay. 570 01:14:13.710 --> 01:14:16.800 Lots of arguments. 571 01:14:16.800 --> 01:14:21.300 Yeah. 572 01:14:23.520 --> 01:14:28.680 Things you can do when you've got multiple loops. 573 01:14:30.720 --> 01:14:39.149 So, what we have here is we've got the outer loop as the inter loop is inside the envelope we've got and iterating. 574 01:14:39.149 --> 01:14:49.470 Well, M, had 2 stages at this 1st stage up here and in the 2nd stage where. 575 01:14:49.470 --> 01:14:59.454 Has the envelope, so we fission so we split the loop and the outer loop into 2 pieces this initial stage. 576 01:14:59.965 --> 01:15:09.685 And then the next stage here twice like this, why we want to do it is it's a setup towards the next loop. We can. 577 01:15:09.960 --> 01:15:16.170 Use you split loops and you combine loop sufficient infusion. 578 01:15:18.119 --> 01:15:25.590 So efficient, and if I back up a little here, it's a nice, simple thing. And this is a sort of thing. 579 01:15:25.590 --> 01:15:30.270 You do lots of times in parallel and so. 580 01:15:30.270 --> 01:15:34.109 Separate colonel, which is very small. 581 01:15:35.130 --> 01:15:41.279 Can't spell threads and just after you do this, you got to synchronize. Of course. 582 01:15:41.279 --> 01:15:45.779 This is the rest of him and the inner end loop. 583 01:15:48.149 --> 01:15:54.119 Okay, we can play games here. We're iterating on, em, and they're just going to swap those and. 584 01:15:54.119 --> 01:15:59.130 It it'll just allow some things to be done more efficiently here. So. 585 01:16:01.319 --> 01:16:05.729 See here, and without her, and was in her here and it's outer admin center. So. 586 01:16:05.729 --> 01:16:11.699 Interchange and this will be a prep to do some other stuff fast. 587 01:16:13.020 --> 01:16:16.229 So. 588 01:16:19.409 --> 01:16:27.510 I'm skipping over some details. Well, and this is the inner loop. Now. It's a current all you do it. 589 01:16:27.510 --> 01:16:39.449 In parallel, we're using registers here and so the executive summary of the slide set is. 590 01:16:39.449 --> 01:16:47.609 Throw look, we have up here, got all these things that are getting used. 591 01:16:47.609 --> 01:16:51.600 Let me go back to slides to show you what's happening. 592 01:16:52.890 --> 01:16:56.963 Okay here so okay. 593 01:16:56.963 --> 01:17:10.015 We initially had M and then and we swapped so we've in this inner loop, we've got various things X Y, sub and Z sub and they're constant inside the inner loop. 594 01:17:10.289 --> 01:17:14.399 So, we can take these constants and pull them out. 595 01:17:14.399 --> 01:17:17.760 Just inside the out of loop and put them in registers. 596 01:17:17.760 --> 01:17:25.439 This is the sort of thing that a good optimize the competitor will do. So, these next couple of slides sets are. 597 01:17:25.439 --> 01:17:28.739 You're imitating a good compiler. 598 01:17:28.739 --> 01:17:33.149 But maybe the compilers haven't got to this page stage yet automatically, but. 599 01:17:33.149 --> 01:17:41.489 So this may be done automatic, this loop interchange and this fission. Good optimized compilers will do on. 600 01:17:41.489 --> 01:17:45.720 Sequential machines are catching up and parallel machines still. 601 01:17:45.720 --> 01:17:51.750 Okay, so we got this loop with X and Y, Savannah and so on. 602 01:17:54.510 --> 01:17:59.520 Pull them out and put them in registers. Oh, okay. 603 01:18:00.569 --> 01:18:06.239 So, in the loop, we are working with a lot of registers. Nice, nice. 604 01:18:09.659 --> 01:18:20.579 Next thing, we're trying to find data the success of threads access and putting it together. 605 01:18:21.659 --> 01:18:26.039 If I were this. 606 01:18:28.380 --> 01:18:32.760 And playing with constant memory also, so. 607 01:18:35.039 --> 01:18:40.710 So, the inner loop again, I'm just skipping over to you. The envelope gets sufficient because. 608 01:18:40.710 --> 01:18:44.489 We've Chuck data together and we're using registers and so on. 609 01:18:44.489 --> 01:18:50.159 Next thing is use the hardware sign and costs that are not. 610 01:18:50.159 --> 01:18:58.050 I Tripoli compliant, they're not gonna be as accurate, but may be good enough for you. So you call them and they will be faster. 611 01:18:59.430 --> 01:19:04.350 These are specific hardware at compute units in this semester. 612 01:19:04.350 --> 01:19:11.159 Streaming multi processors, so if everyone's trying to do trade, get the same time, there'll be some waiting. 613 01:19:12.329 --> 01:19:22.500 There's only so many of these units and if you're feeling confident, you might want to validate your answer. 614 01:19:23.579 --> 01:19:30.689 So, thinking of validation and being conscious is a current theme going on. 615 01:19:32.189 --> 01:19:37.710 With some neuronet or published papers have results that apparently cannot be duplicated. 616 01:19:37.710 --> 01:19:46.260 Oops, so the stuff these papers are published, he published results cannot be independently validated. 617 01:19:48.329 --> 01:19:52.050 What here that their confidence so they validate their stuff. 618 01:19:53.279 --> 01:19:57.989 I am confident in this stuff I do, I encourage people to validate my published stuff. 619 01:19:59.880 --> 01:20:03.779 And their checking speed ups. 620 01:20:07.109 --> 01:20:13.289 And they got some nice feed Ops. Okay. Fast speed ups of a couple of 100. so. 621 01:20:14.789 --> 01:20:17.880 In various things here. 622 01:20:17.880 --> 01:20:21.270 Something took. 623 01:20:21.270 --> 01:20:24.569 2700 now takes 8 and so on. 624 01:20:26.069 --> 01:20:35.970 So, it worked okay and getting it on the gpo at a naive way. Got maybe a factor of 10 speed up. 625 01:20:35.970 --> 01:20:39.569 Getting it on the GPU in an intelligent way. 626 01:20:39.569 --> 01:20:45.060 On another factor of 30 or something in this case. So. 627 01:20:45.060 --> 01:20:50.909 12 in this case, so these techniques actually, in this case were useful. 628 01:20:52.500 --> 01:20:57.510 Okay, good point to stop. Now. 629 01:20:57.510 --> 01:21:06.359 Hence us review what we did today, we saw this scan parallel scan operation, which is. 630 01:21:06.359 --> 01:21:12.029 A useful parallel paradigm, and we saw how to do it more efficiently. 631 01:21:12.029 --> 01:21:15.300 Reordering the algorithm to do it more efficiently. 632 01:21:16.439 --> 01:21:25.050 We saw some hardware layout issues with versus is a little slides or touch up. So, lesson by now. 633 01:21:25.050 --> 01:21:31.590 Didn't spend much time on them. We saw a new concept type of parallelism called task. 634 01:21:31.590 --> 01:21:42.810 Parallel is, I mean, you could a code you see plus plus code running on the host, you can have several parallel tasks running together. They're called and they're using a concept called us. Kudos stream. 635 01:21:42.810 --> 01:21:52.890 And the separate kudos streams run independently of each other, and you can they synchronize inside himself and then we saw in and Omar reconstruction thing. 636 01:21:52.890 --> 01:22:00.510 Where, in the example, I got hundreds of times faster by doing it on the GPU and using these techniques. 637 01:22:00.510 --> 01:22:08.369 Good point to stop. You've got to go out and get lunch. I got to go and get lunch and see you. 638 01:22:08.369 --> 01:22:14.609 Monday, hang around for a minute or 2. what applications as fast math do well, and. 639 01:22:14.609 --> 01:22:17.909 Sorry, I didn't look over for a few minutes. 640 01:22:17.909 --> 01:22:24.899 Isaac, well, it's not going to be accurate in the last bid or 2. so. 641 01:22:24.899 --> 01:22:31.470 If you don't care about precise accuracy, then it's that That'll be good. So, in this thing. 642 01:22:32.880 --> 01:22:36.960 It was a proxy, but they figured that data is not that accurate. Probably. So. 643 01:22:38.340 --> 01:22:47.819 You might be well, maybe to run a sample run of your program with the accurate Trig and then compare it to past. 644 01:22:47.819 --> 01:22:51.930 And then if it works, then use fast math in the future. 645 01:22:53.880 --> 01:22:58.500 Where the fast did not work for me. 646 01:22:59.760 --> 01:23:05.399 It's I just like algorithms file. Just assume that the math is good. 647 01:23:05.399 --> 01:23:12.449 I mean, I know I carefully the floating point started exist. I just automatically incorporated into my algorithm to sign. 648 01:23:12.449 --> 01:23:24.060 So, if we don't get good round, I assume, round off works the way it's supposed to and with the fast method does not that's why it broke my program. 649 01:23:26.939 --> 01:23:30.810 Other stuff welcome. 650 01:23:30.810 --> 01:23:33.840 Anything else? Sorry, I didn't look over quickly enough. 651 01:23:33.840 --> 01:23:37.859 My 2nd lap pops a little off to my side actually. So. 652 01:23:39.090 --> 01:23:43.920 If not I have a good weekend and enjoy the sunny weather. 653 01:23:43.920 --> 01:23:51.539 And I'm enjoying looking at my solar panels like, so far in March they've produced about 10% more than I've used. 654 01:23:51.539 --> 01:23:57.390 But it's been signing.