WEBVTT 1 00:00:04.918 --> 00:00:09.448 Okay. 2 00:00:09.448 --> 00:00:15.448 Right. 3 00:00:20.850 --> 00:00:31.350 Okay, good afternoon probability class 11 before I started if you've got any questions, anything you'd like to. 4 00:00:31.350 --> 00:00:35.369 Ask about our ability in video. 5 00:00:35.369 --> 00:00:42.060 Programming so, on the level level recursion there, but, um. 6 00:00:42.060 --> 00:00:45.179 Okay, so I'm trying to. 7 00:00:45.179 --> 00:00:50.640 You record this and upload it. 8 00:00:50.640 --> 00:00:59.009 I encourage you to look at the textbook programming, massively parallel computers that's available for free. 9 00:01:02.369 --> 00:01:08.010 On the class account, so it says it goes to what I'm talking about in more detail. 10 00:01:08.010 --> 00:01:11.159 And it's it's very well written, so. 11 00:01:11.159 --> 00:01:20.519 More intelligent, it has a couple of obsolete things in it, but they have updated it a couple of years ago. So the books originally had 15 years old, but. 12 00:01:20.519 --> 00:01:28.349 They did updated again a few years ago. So, like, actually mentioned unified memory, which is good to see um. 13 00:01:28.349 --> 00:01:35.219 Chance to mentioned some of the recent changes in NVIDIA and. 14 00:01:35.219 --> 00:01:39.000 I've mentioned unified memory before, so the. 15 00:01:39.000 --> 00:01:43.019 Post in the device are in the same address space. 16 00:01:43.019 --> 00:01:48.239 And it will do demand paging of pages back and forth. Um. 17 00:01:48.239 --> 00:01:54.420 And video got to it in several stages doing it requires hardware support for the Linux colonel. 18 00:01:54.420 --> 00:02:00.659 Which is what delayed them so before they had demand page, and they had, like, demand. 19 00:02:00.659 --> 00:02:04.980 Swapping of entire variable back and forth. 20 00:02:04.980 --> 00:02:14.909 And they had something where our unified address space without demand paging, or you could look at the upper on a point, or know where where the object was. 21 00:02:14.909 --> 00:02:19.680 So, they had something where they would like us a copy, the whole object to the other. 22 00:02:19.680 --> 00:02:25.379 To the other side, if they needed that, and then I got to pay for demand. 23 00:02:25.379 --> 00:02:32.699 Before that they had something where they would lock a host variable into real memory. 24 00:02:32.699 --> 00:02:37.379 On the host, which would make it easier for the device to get at it. 25 00:02:37.379 --> 00:02:44.039 Because of virtual memory on the host, you never know where the object is or if it's in November or not. So they have. 26 00:02:44.039 --> 00:02:47.969 Something when you can lock a variable in the real memory and then. 27 00:02:47.969 --> 00:02:52.590 I didn't warn you that that restricted the amount of memory you could use as well. 28 00:02:52.590 --> 00:02:58.409 I got 256 gigabytes on parallel, so I'm not particularly concerned if you. 29 00:02:58.409 --> 00:03:01.650 Block off a few megabytes. Okay. 30 00:03:01.650 --> 00:03:09.210 Yes, where is the yeah, I'll show you um. 31 00:03:09.210 --> 00:03:19.349 So this is GPU, accelerate accelerator computing God's depth. I pulled off the video website. 32 00:03:19.349 --> 00:03:25.110 There are another half dozen kits there also, which I can pull over. Also. If you want. 33 00:03:25.110 --> 00:03:29.669 They talk about things like machine learning and so and so if you remind me, so I don't forget. 34 00:03:29.669 --> 00:03:36.870 Or, at least pull down the list of them in anyone, anything wants anything that anyone wants a pull over. So, in there. 35 00:03:36.870 --> 00:03:44.550 And then you unzip it, and I just want to know more about the inside membrane. 36 00:03:44.550 --> 00:03:50.219 Yeah, unified memory is, you know, a virtual memory works. Um. 37 00:03:50.219 --> 00:03:53.849 No, okay. Um. 38 00:03:54.930 --> 00:04:00.659 Okay, so yeah, you got burying, um. 39 00:04:00.659 --> 00:04:04.860 Things so computers got real memory um. 40 00:04:04.860 --> 00:04:12.689 Well, like this laptop here, I can't even remember because I got so many laptops it's got 8 gigabytes or 16 gigabytes. Let's find out here. 41 00:04:12.689 --> 00:04:15.810 Um, actually. 42 00:04:17.459 --> 00:04:22.290 It has 16 gigabytes of real memory. Now, the virtual memory. 43 00:04:22.290 --> 00:04:26.098 Would be larger well, this 1 I don't. 44 00:04:26.098 --> 00:04:29.668 Don't actually have any virtual memory because. 45 00:04:29.668 --> 00:04:33.959 It's a separate story I think it's an obsolete idea. In some cases. 46 00:04:33.959 --> 00:04:44.428 The virtual memory is the device is the virtual address space. You have a pointer points in a virtual address and then it's in divided into pages that I think are 4 kilobytes. 47 00:04:44.428 --> 00:04:47.908 So, 1000 words and. 48 00:04:47.908 --> 00:04:55.139 When you actually read or write an address, it pages that 4 or 5 page into real memory. 49 00:04:55.139 --> 00:04:59.158 And has a translation table to give the real memory address. 50 00:04:59.158 --> 00:05:08.848 And so the idea is, you cannot you can address bigger more data, then you can fit into the real memory. 51 00:05:08.848 --> 00:05:11.939 Then it needs an eviction strategy. 52 00:05:11.939 --> 00:05:21.478 Once it gives us up all the real memory it has a policy to evict page is to write them back to virtual memory token page in something else in virtual memory. 53 00:05:21.478 --> 00:05:26.338 The typical depiction strategy is least recently used so. 54 00:05:26.338 --> 00:05:31.319 And updates a marker when each page is used, and then the 1 that hasn't been used in a while gets. 55 00:05:31.319 --> 00:05:37.108 Written back to real memory giver to virtual memory no virtual. Maybe. I forgot. It's on disk. Okay. 56 00:05:37.108 --> 00:05:45.358 So, and it's typically double the size of real memory is if you're looking for a policy, that's a tolerable policy. 57 00:05:45.358 --> 00:05:52.649 On if you're in Windows, it'll be called a page file or a swap file. 58 00:05:52.649 --> 00:05:55.978 There are small differences between them that. 59 00:05:55.978 --> 00:06:00.809 Don't worry about, like, if you're hibernating the system, it may. 60 00:06:00.809 --> 00:06:07.559 Copy all of the real memory out to the swap file so it can power down the memory like. 61 00:06:07.559 --> 00:06:13.348 Okay, so it's a very nice idea. Virtual memory you see, it requires some serious hardware support. 62 00:06:13.348 --> 00:06:16.619 Another thing with virtual memory has been around for many decades. 63 00:06:16.619 --> 00:06:23.218 Is that it's a way to handle devices so you have a device if you're reading it in writing to. 64 00:06:23.218 --> 00:06:29.819 Tape drive keyboard, whatever it would have addresses in virtual memory. So the device is sitting on the address bus. 65 00:06:29.819 --> 00:06:36.928 Watching the traffic, and if you want say to to read from the last key from the keyboard. 66 00:06:36.928 --> 00:06:42.928 You would read a particular address and the keyboard would watch you and if it sees that address being read. 67 00:06:42.928 --> 00:06:46.228 It provides the contents of the last key that was expressed, let's say. 68 00:06:46.228 --> 00:06:55.769 Or you want to get status is the keyboard is it just ready that particular register address? It doesn't exist in real memory. It's an address, but. 69 00:06:55.769 --> 00:07:05.759 The disc is watching the address Boston when it sees this address being read, it provides the status website. So the unifying idea. So accesses to. 70 00:07:05.759 --> 00:07:19.288 Different devices from a software point of view, our reads and rights to memory addresses, but there's no real word in the D RAM. Which corresponds to that. It's just the peripheral is sitting on the bus and watching it. 71 00:07:19.288 --> 00:07:26.488 Obviously, there's a lot of implementation issue is the purple's going to be a lot slower than to how many factors? A 1000. 72 00:07:26.488 --> 00:07:31.288 So, you give me a timing issue, but it's a really unifying idea. It's really nice. 73 00:07:31.288 --> 00:07:38.488 So, it's been around for decades, then you get concepts of a working set such as how much memory is of. 74 00:07:38.488 --> 00:07:42.298 Device is a program actually using. 75 00:07:42.298 --> 00:07:47.098 Now, now there is a powerful concept in Linux with the virtual memory. 76 00:07:47.098 --> 00:07:52.228 Which is that if you allocate a chunk of virtual memory, say gigabyte 10 gigabytes. 77 00:07:52.228 --> 00:07:57.778 Whatever the yoga not yoga and I don't know which it is. Um. 78 00:07:57.778 --> 00:08:05.608 Debate it, um, Linux doesn't reserve the space in your. 79 00:08:05.608 --> 00:08:08.639 Uh, on on disc. 80 00:08:08.639 --> 00:08:13.108 Until you actually touch a page, so if you read or write an address. 81 00:08:13.108 --> 00:08:16.769 The bang, Linux allocated 4 K bytes. 82 00:08:16.769 --> 00:08:20.428 And the page file oh, where it is in Linux. 83 00:08:20.428 --> 00:08:25.228 Um, the virtual memory space, it could be a separate partition, or it could be just a file. 84 00:08:25.228 --> 00:08:28.649 Okay, can you say here's a 10 gigabyte file. 85 00:08:28.649 --> 00:08:32.158 I'll declare it to be a swap file. 86 00:08:32.158 --> 00:08:38.249 And now it becomes paging back, or it could be a whole separate partition or a whole separate disk. Um. 87 00:08:38.249 --> 00:08:41.999 I think if you use a file, it may do some short. 88 00:08:41.999 --> 00:08:45.028 Tricks to make it more efficient. Um. 89 00:08:45.028 --> 00:08:52.139 Like, figure out what physical blocks on just the file uses it go straight to the desk. Okay. So, um. 90 00:08:52.139 --> 00:08:57.899 So, Linux has a really intelligent way to do virtual memory, which people don't understand. 91 00:08:57.899 --> 00:09:02.519 And don't exploit it. You can allocate a humongous. 92 00:09:02.519 --> 00:09:06.928 Chunk of virtual memory, but until you actually use it. 93 00:09:06.928 --> 00:09:12.328 It doesn't physically use up, walks on the desk and that's cool. So, for example. 94 00:09:12.328 --> 00:09:17.099 You can say reserve gigabyte for a, for a stack for your. 95 00:09:17.099 --> 00:09:21.448 Personally, calling your local functions. That's a lot of pages. 96 00:09:21.448 --> 00:09:26.339 But unless you, I actually do a lot of recursive calls that that's not actually used. 97 00:09:26.339 --> 00:09:31.408 So, there's no downside to reserving very large virtual rate as. 98 00:09:31.408 --> 00:09:39.658 So, this I mentioned to start in the class, if you've got a lot of separate thread, each thread might say, reserve a gigabyte, a stack or local variables. 99 00:09:39.658 --> 00:09:43.469 You know, you're calling a function is C + plus or or C. 100 00:09:43.469 --> 00:09:49.198 A local variable, unless you're doing something weird, it gets put on the stack. 101 00:09:49.198 --> 00:10:00.538 So, you know, new stack frame gets allocated when you call a new function or notification, local variables, get put on the stack, you call another 1 gets put on the stack for and then, when you return, they just get. 102 00:10:00.538 --> 00:10:06.389 While the stack pointer, which points at the end of the stack just gets reset and they all vanished. 103 00:10:06.389 --> 00:10:10.678 So you don't know how many local variables you're going to have perhaps. 104 00:10:10.678 --> 00:10:14.068 Fetus allocating among the stack for each thread. 105 00:10:14.068 --> 00:10:18.778 I have no problem. It's not use any resources unless you use it and that's a really cool idea. 106 00:10:18.778 --> 00:10:22.109 So, in general, in the. 107 00:10:22.109 --> 00:10:25.528 X level of indirection because you have virtual address and. 108 00:10:25.528 --> 00:10:28.918 Goes to a real address, the extra level of in directions of. 109 00:10:28.918 --> 00:10:34.649 You will get to a lot of power. Okay, so that's virtual memory currently. I think virtual memories. 110 00:10:34.649 --> 00:10:43.918 When you have your youngest amounts of real memory, the time it takes to page virtual objects in a real space takes time. So, virtual memory is not as useful. 111 00:10:43.918 --> 00:10:48.778 But any case, so, video now, so the same address phase is 49 bits. I believe. 112 00:10:48.778 --> 00:10:52.438 And I know it's 64 minutes so they ignore the top 15. 113 00:10:52.438 --> 00:10:56.639 The hardware looks for 49 bits the top couple of bits say what it is. 114 00:10:56.639 --> 00:10:59.729 But any case, so they paid it between. 115 00:10:59.729 --> 00:11:04.168 Post and device host, meaning Intel. 116 00:11:04.168 --> 00:11:09.778 Z on this, I think, wants Z on. I don't think it works with the consumer grade. 117 00:11:09.778 --> 00:11:13.798 Intel, but you can correct me. Um. 118 00:11:14.849 --> 00:11:18.239 So the is various enterprise type. 119 00:11:18.239 --> 00:11:22.948 Things like better support of virtual machines, for example. 120 00:11:22.948 --> 00:11:30.359 And so eventually the pages get paged back and forth. Now, you can imagine if the host of the device try to. Right? 121 00:11:30.359 --> 00:11:34.678 The same page at the same time, he needs you have some consistency issues. 122 00:11:34.678 --> 00:11:38.698 And I honestly, I don't know how immediate results those. 123 00:11:38.698 --> 00:11:42.058 Um, maybe it's a bad idea to do it. 124 00:11:42.058 --> 00:11:46.019 So now. 125 00:11:46.019 --> 00:11:51.599 Here, if you want a variable to get paid. 126 00:11:51.599 --> 00:11:57.298 It has to be on the global heat, so, which the programs have. 127 00:11:57.298 --> 00:12:01.798 You know, you do Matlock, or you use an vector. 128 00:12:01.798 --> 00:12:09.058 It puts us it grabs based off the global heat, but the heap has an internal data structure, which keeps track of what's used and what free. 129 00:12:09.058 --> 00:12:15.479 And, um, like list and stuff, I think, and and if it's on the global heap, it can be. 130 00:12:15.479 --> 00:12:22.499 He'll be unified memory if it's there's at least 3 different places you can store variables on the. 131 00:12:22.499 --> 00:12:27.149 In Linux, I got a blurb on that. I think I don't know if I share with you yet. 132 00:12:27.149 --> 00:12:32.068 They can be on the global heat. That's where vector allocations come from. 133 00:12:32.068 --> 00:12:36.028 And Mailbox, they can be local variables that go on the stack. 134 00:12:36.028 --> 00:12:41.099 Or they can be like, global variables to spaces reserved at compile time. So, if you have. 135 00:12:41.099 --> 00:12:45.899 Fixed size the Ray, that's a global variable and you see + plus program. 136 00:12:45.899 --> 00:12:51.509 It's allocated at it's a fixed address. It's allocated at compile time. 137 00:12:51.509 --> 00:12:57.208 And so it's part of the executable file. 138 00:12:57.208 --> 00:13:01.349 Now, if you don't initialize it, it doesn't take any space in executable file but that's what it is. 139 00:13:01.349 --> 00:13:06.479 So that's the different types of memory go. So, um. 140 00:13:06.479 --> 00:13:17.908 And what types you should use what's the best type use changes from time to time? So, at the moment, the best 1 is on the heath, but then you got problems if you put millions of small items on the heat. 141 00:13:17.908 --> 00:13:22.558 That will slow things down because it has to keep track of them. So. 142 00:13:22.558 --> 00:13:30.928 There are solutions to that. So do, and also the heat it's not multi threaded very efficiently. So. 143 00:13:30.928 --> 00:13:37.528 If you've got different threads, each of the mail locking stuff on the heat that will really slow your program down. So that's an argument to. 144 00:13:37.528 --> 00:13:42.479 But I'm on a local stack, or there's another thing also there's something in C. + . 145 00:13:42.479 --> 00:13:46.318 But not many people realize so, you. 146 00:13:46.318 --> 00:13:50.129 And C, +, plus you construct a new object of type food. 147 00:13:50.129 --> 00:13:54.269 Okay, so it grabs memory on the heap and. 148 00:13:54.269 --> 00:13:59.219 And run the construction function if it's a non default construction function. 149 00:13:59.219 --> 00:14:02.609 You know, to build that object in the new memory that craps. 150 00:14:02.609 --> 00:14:12.869 And then it returns you the pointer. Well, the grab the memory and construct the new object can be split into 2 separate steps. If you want. 151 00:14:12.869 --> 00:14:22.139 So you can, for example, grab the memory however, you want to grab you can have your own memory allocator. If you want, then you could allocate a 10 gigabyte a ray. 152 00:14:22.139 --> 00:14:26.458 And do your own sub allocation inside that 10 gigabyte array? 153 00:14:26.458 --> 00:14:32.068 Cutting off pieces as you need them, so you could do have your own memory management and then what you could do. 154 00:14:32.068 --> 00:14:38.278 Is that C + plus so if you have a new variables, so you do your own memory management to reserve the actual space. 155 00:14:38.278 --> 00:14:50.009 And then you do what something called a placement new since C, +, + and a placement new constructs the new variable, but you give it the memory. It doesn't grab the chunk of memory itself. 156 00:14:50.009 --> 00:14:55.438 So, you could separate the memory allocation from the new object construction. If you wish. 157 00:14:55.438 --> 00:14:58.739 So, if you do not too small objects. 158 00:14:58.739 --> 00:15:03.058 Oh, you allocate an array of small objects. That's 1 operation. So. 159 00:15:03.058 --> 00:15:12.149 Memory management things and why this is interesting to you is it can for a parallel program, have a big effect on efficiency. 160 00:15:12.149 --> 00:15:18.178 And I've observed it with some of my research is parallel geometry, even with on a Ze on, even with. 161 00:15:18.178 --> 00:15:21.359 Doesn't a few dozen threads, um. 162 00:15:21.359 --> 00:15:31.678 The built in heat, construction things they, they've noticeably slowed down because the essentials those are my single threaded. 163 00:15:31.678 --> 00:15:39.688 There's other things you can do, there are other versions of, like, Google has 1 that are more efficient than the standard C + +1. 164 00:15:39.688 --> 00:15:43.288 And you can do there is a plug in replace, but actually you can. 165 00:15:43.288 --> 00:15:52.798 Provide a different library, and, you know, to run time. So I'm not just involved recompiling the program because at 1 time you can specify a library for dynamic loading. 166 00:15:52.798 --> 00:15:56.788 Libraries objects and you specify google's. 167 00:15:56.788 --> 00:16:00.749 Instead of the default 1 at run time and bang. 168 00:16:00.749 --> 00:16:04.229 All your memory stuff gets twice as fast or something. 169 00:16:05.729 --> 00:16:10.948 So, more than you wanted to learn about memory, other memory questions. 170 00:16:10.948 --> 00:16:17.099 So, I don't know how much the computer science operating system courses talk about. 171 00:16:17.099 --> 00:16:22.048 Real World operating systems and file systems and so on because sometimes the academics. 172 00:16:22.048 --> 00:16:26.578 Like, behind the real world so, I mean, such as. 173 00:16:26.578 --> 00:16:33.568 Parallels quarter, terabyte of, you know, what are the implications of that for your operating system? 174 00:16:33.568 --> 00:16:44.038 Um, you think that's a lot my big laptop at home it's a laptop it's 128 gigabytes of on the laptop. 175 00:16:44.038 --> 00:16:50.339 So, it's a laptop weighs 8 pounds, you know, kind of everything. 176 00:16:50.339 --> 00:16:58.229 And file systems, if you've got file systems on, you know, God knows what petabyte size disk and distribute what does that apply to the file system? 177 00:16:58.229 --> 00:17:07.048 Or you're doing communication, you got an Internet style thing, but, you know, 1 of you knows is maybe on Mars. 178 00:17:07.048 --> 00:17:11.759 Let's 15 minute latency, you know, what does this apply for your network algorithms? 179 00:17:11.759 --> 00:17:16.108 You know, if you're doing a round trip error, checking, installing it, it takes 15 minutes. 180 00:17:16.108 --> 00:17:21.118 To, you know, you're trying to parody code or whatever this has implications. 181 00:17:21.118 --> 00:17:25.288 For what our efficient on networking strategies so. 182 00:17:26.368 --> 00:17:35.068 Another example, databases in database classes, there's this asset paradigm or something. They're used to caught normal form, stuff like that. 183 00:17:35.068 --> 00:17:39.449 Your people that do the big databases, who would be a. 184 00:17:39.449 --> 00:17:44.489 Amazon Google, whatever their databases, I think don't follow that. 185 00:17:44.489 --> 00:17:49.739 So, they, they do their own now, partly because. 186 00:17:49.739 --> 00:17:55.709 They don't know when our theory and party, because they found not doing the theoretically nice way. Actually, it's better performance. 187 00:17:55.709 --> 00:18:04.318 Like, Google may not require their database to be globally consistent completely if it's a map or something. So. 188 00:18:04.318 --> 00:18:08.219 Okay, other questions. 189 00:18:09.298 --> 00:18:13.588 So so we're doing parallel. 190 00:18:13.588 --> 00:18:17.038 Including and video, as an example. 191 00:18:17.038 --> 00:18:22.888 I'd like to teach specifics and figure you induce you by induction the general principle somewhat. 192 00:18:22.888 --> 00:18:28.828 I'm teaching from their slides instead of writing stuff down. I know. 193 00:18:28.828 --> 00:18:33.179 People like professors to write stuff down in class, but. 194 00:18:33.179 --> 00:18:45.179 It seems logical because I'd be learning information from the slides and writing it down for you. I figure I'd show you the slides directly. It's like playing a game of telephone. That's fewer errors. 195 00:18:45.179 --> 00:18:57.358 And you can read the slides, so, you know, if you want, I can write stuff down. But and the technology, also, I'm using cause I'm broadcasting the class for people that can't be here. Then, um. 196 00:18:58.558 --> 00:19:06.358 It's hard to do flexible broadcasting together with handwriting involves linking together multiple. 197 00:19:06.358 --> 00:19:10.019 Laptops okay, so in any case. 198 00:19:10.019 --> 00:19:16.979 Back to the. 199 00:19:16.979 --> 00:19:23.308 Okay, so we're up around module 9 now. 200 00:19:23.308 --> 00:19:32.098 So, what we're seeing now are paradigms. So a paradigm is a widely used way to do something. 201 00:19:32.098 --> 00:19:37.949 So these are techniques for parallel programming. 202 00:19:37.949 --> 00:19:41.278 That have turned out from experience to be useful. 203 00:19:41.278 --> 00:19:46.439 On some of them were not thought of in advance since they were just developed to solve problems. 204 00:19:46.439 --> 00:19:51.719 And so I'll be running through the slides quickly if they don't deserve a lot of time. 205 00:19:51.719 --> 00:19:54.719 Their variable information density, so. 206 00:19:54.719 --> 00:19:59.338 Um, well, let's see. 207 00:20:00.749 --> 00:20:05.848 And, um, sharing. 208 00:20:05.848 --> 00:20:11.699 Screen and theoretically, that is working. 209 00:20:14.788 --> 00:20:19.378 And to make sure I'm recording. 210 00:20:19.378 --> 00:20:22.469 Yeah, I'm recording. Yeah. 211 00:20:22.469 --> 00:20:26.548 Caring good. Okay. 212 00:20:26.548 --> 00:20:34.648 If any of you are watching me remotely also if I have problems, let me know, because I cannot determine always. 213 00:20:34.648 --> 00:20:37.679 What is happening? 214 00:20:37.679 --> 00:20:41.699 I actually have 2 other people. Oh, okay. Good. 215 00:20:41.699 --> 00:20:47.759 So, I said module. 216 00:20:47.759 --> 00:20:51.959 Right. 217 00:20:51.959 --> 00:20:58.979 And again, I've just created a virtual file system that zips into and so on, um. 218 00:21:04.259 --> 00:21:09.179 Okay. 219 00:21:16.648 --> 00:21:22.259 Okay, so it's something called reduction. 220 00:21:22.259 --> 00:21:30.568 Which is a, um, partially some, your way scan your debt down in a rate and. 221 00:21:30.568 --> 00:21:33.628 The reason this is worth spending time on. 222 00:21:33.628 --> 00:21:40.378 Is that you'd be surprised the things you can do with a parallel reduction? 223 00:21:40.378 --> 00:21:44.009 And. 224 00:21:44.009 --> 00:21:53.278 And it's used, for example, and things like trust that we'll talk about later, trust is a parallel version of the SDL things. If you're doing map reduce, for example. 225 00:21:53.278 --> 00:22:02.308 Which is a widely use paradigm. Um, Google uses it a lot. So the map you have, your data structure is the factor the map is, you apply a function to each element of the vector. 226 00:22:02.308 --> 00:22:14.729 That's obviously can be done in parallel and then cause the different applications and different elements of the vector don't affect each other. And then the reduction, if you take that factor and you scrunch it down to a shorter vector or to 1 number. 227 00:22:14.729 --> 00:22:19.378 The obvious reduction is, you saw in the vector, but there's more powerful reduction things. 228 00:22:19.378 --> 00:22:25.138 And that's the point. So, the 1st question is how to do it fast. And 2nd question is what's it useful for. 229 00:22:25.138 --> 00:22:30.358 And there is. 230 00:22:30.358 --> 00:22:36.538 A partial reductions, you've cut the long vector and the pieces and reduce each piece or something. 231 00:22:37.618 --> 00:22:47.759 You mentioned Google, and now some people think that Google invented map reduced, they did not and they don't pretend that they do actually. 232 00:22:48.778 --> 00:22:54.509 The 1st, commercial language that I can think of the produce, I think back the 960 s or so. 233 00:22:54.509 --> 00:23:01.919 Some of these ideas have been around a long time. Virtual memory has been old virtual memory. 1 thing about Jacqueline industrially. 234 00:23:01.919 --> 00:23:10.378 Virtual memory, um, been around 50 years or more. Ibm has used virtual memory and commercial system since. 235 00:23:11.638 --> 00:23:19.679 The seventies, I think, 70 or 80 so, IBM is used virtual machines, you know, about things like VMware and IBM is used commercial virtual machines. 236 00:23:19.679 --> 00:23:32.638 For 50 years, because their hardware, I think accidentally support, they need hardware support to do this efficiently. And the IBM hardware for decades has had this support. I read it was accidentally didn't plan it. 237 00:23:32.638 --> 00:23:39.838 So so they've had commercial systems where you have the bare metal hardware, running, separate, virtual machines and that's. 238 00:23:39.838 --> 00:23:47.788 For IBM for very long time. So okay any case. So, there was an IBM language could APL. 239 00:23:47.788 --> 00:23:57.328 Matt produce, I don't know, 1970 give or take, so okay. Um, Hadoop is a popular. 240 00:23:57.328 --> 00:24:01.858 Distributed lying I do by itself. I think it is obsolete by now. 241 00:24:01.858 --> 00:24:05.969 It is not my specific area, but there's things that have replaced it. Um. 242 00:24:05.969 --> 00:24:16.259 Um, so this slide, you'll understand more but reduction it's a fundamental technique that you can do other things with. 243 00:24:16.259 --> 00:24:21.118 So, and again. 244 00:24:21.118 --> 00:24:25.528 You take in our factor and you reduce it to the maximum of the minimum of the summer or something. 245 00:24:25.528 --> 00:24:28.949 In terms of mathematically. 246 00:24:28.949 --> 00:24:33.778 These arithmetic operations that, um. 247 00:24:33.778 --> 00:24:38.429 They're part of what an abstract algebra is called a field. 248 00:24:38.429 --> 00:24:42.148 And the field is an abstract way to model things like. 249 00:24:42.148 --> 00:24:49.348 Rational numbers are real numbers and that have operations such as the order you do. 250 00:24:49.348 --> 00:24:53.278 In addition doesn't matter a plus. B plus B plus a, and so on. 251 00:24:53.278 --> 00:24:58.858 And so the concept of a field is an abstraction for these operations that. 252 00:24:58.858 --> 00:25:03.088 Rational numbers and real numbers and so on can do complex numbers. 253 00:25:03.088 --> 00:25:11.878 Now, we've got a side thing, of course, floating point numbers don't satisfy me the rules folding point numbers are not associative and community, but we'll ignore that. 254 00:25:11.878 --> 00:25:16.048 Those machines, Ray, plus B does not equal, be plus saved reporting point number. So. 255 00:25:16.048 --> 00:25:20.969 And, absolutely there's machines rate plus B plus C. depends on where you put the parentheses. 256 00:25:20.969 --> 00:25:24.209 Okay, but we'll forget about that part. 257 00:25:24.209 --> 00:25:27.358 Oh, okay. Um. 258 00:25:27.358 --> 00:25:30.868 And to do, okay, in parallel. 259 00:25:30.868 --> 00:25:43.794 The obvious thing you do, you want to sum this Ray in parallel pair by pair, and you pair up and pair up and pair up. And if the array is in elements, it takes longer to the base 2 steps. 260 00:25:43.794 --> 00:25:46.074 If it's 8 elements takes 3 steps and so on. 261 00:25:46.679 --> 00:25:55.138 That's the obvious way to do it in parallel. Now 1 thing, whenever I say something all is obvious, most of the time I follow it by saying, but wrong. 262 00:25:55.138 --> 00:25:58.259 Or it works, but it's not efficient. So okay. 263 00:25:58.259 --> 00:26:02.429 It's like a tournament, um. 264 00:26:04.229 --> 00:26:07.618 Um, you can do a quick analysis, um. 265 00:26:08.848 --> 00:26:14.608 Takes 2 and steps, but okay, the next few slides sets will show you. 266 00:26:14.608 --> 00:26:20.939 More efficient ways to do it. 267 00:26:20.939 --> 00:26:27.028 Some things were worried about, I'll come back to this slide. 268 00:26:27.028 --> 00:26:38.818 So, we're talking coulda and threads are grouped into warps. Okay. 32 threats make a war and the threads and the war. They're SIM D. they're synchronized in lock step. 269 00:26:38.818 --> 00:26:42.749 So that hierarchical tree, I showed you. 270 00:26:42.749 --> 00:26:48.929 How would you implement that on a machine that has worked for? 32 threads that's the problem we want to think about. 271 00:26:48.929 --> 00:26:52.528 So this is this controlled divergence concept that. 272 00:26:52.528 --> 00:26:55.798 You have 32 threads in the war have to do the same instruction. 273 00:26:55.798 --> 00:27:02.699 Or be turned off is a separate thing in video has. So that's Cindy for the war because it's synchronous. 274 00:27:02.699 --> 00:27:09.689 Our videos as a concept of single program multiple Datas instead of s. I. M. D. 275 00:27:09.689 --> 00:27:19.138 And that expresses the concept of the different threads. And the thread block are following the same program, but they're not synchronized in lock step. 276 00:27:19.138 --> 00:27:26.878 Because the different warps their scheduled independently of each other, the different works of threads in the thread block. 277 00:27:26.878 --> 00:27:36.118 If it's if you paid a lot of money for for expensive hardware, they may run at the same time. If you didn't pay much money they may run 1 after the other. 278 00:27:36.118 --> 00:27:42.449 So, they're running a, say, the different threat different warps are running the same program. But at different times. 279 00:27:42.449 --> 00:27:46.979 You synchronize them we will sync threads. Okay but we've got this divergence thing. 280 00:27:46.979 --> 00:27:50.098 On the 1st thing up there is. 281 00:27:50.098 --> 00:27:54.689 How do we assign operations to the threat? Okay. Um. 282 00:27:54.689 --> 00:27:57.929 I'm going to hold this till we see. 283 00:27:59.489 --> 00:28:07.558 Well, 1, other thing they mentioned up here is is only I only have so many threads in a block and perhaps. 284 00:28:07.558 --> 00:28:12.778 2000 whatever it depends. Okay, so here's the way that we. 285 00:28:12.778 --> 00:28:17.878 Talked about before you're summing that big, um. 286 00:28:19.048 --> 00:28:26.098 That long array so you have a thread 0 sums 2 elements and thread 1 for the next 2 elements and so on. 287 00:28:26.098 --> 00:28:30.358 And then after after the 1st step. 288 00:28:30.358 --> 00:28:33.838 Your subtotals are in every 2nd element. 289 00:28:33.838 --> 00:28:41.818 And then you sum them, and after the 2nd step and the sub tools, every 4,000 subtotals, every 8th element. 290 00:28:41.818 --> 00:28:46.169 Okay, so what are issues with this 1st, if you got more than 2000 elements. 291 00:28:46.169 --> 00:28:50.249 Okay, this for an element is that over 2 threads so. 292 00:28:50.249 --> 00:28:56.009 Suppose you're going to have 2000 elements threads in the block and that depends on implementation. 293 00:28:56.009 --> 00:28:59.999 Your vector could not be more than 4,000 long, because. 294 00:28:59.999 --> 00:29:09.898 You need a thread for every, every 2 elements. So more than 4,000 elements had 2000 threads more than 2000. threads won't fit in a thread block. That's your 1st issue. 295 00:29:09.898 --> 00:29:14.068 The 2nd issue is. 296 00:29:14.068 --> 00:29:18.239 Again, this war idea, so the 1st. 297 00:29:18.239 --> 00:29:23.999 Thing, um, here fine all the threads are in use and he just said. 298 00:29:23.999 --> 00:29:30.118 Well, there, there's 1 issue here also. Let me just go back to the start here. I've mentioned a couple of times. 299 00:29:30.118 --> 00:29:36.028 That you like to have data coalesce and you like to have adjacent threads. 300 00:29:36.028 --> 00:29:49.078 Access adjacent elements of the global memory, but here, adjacent threads, and the 1st step are accessing elements that are 8 bytes apart not 4 bytes of parties. Each variable is 4 bytes. 301 00:29:50.519 --> 00:29:55.048 So right at the start, you're not optimizing, you're coalescing. Um. 302 00:29:55.048 --> 00:29:58.469 So that's 1 issue there. 303 00:29:58.469 --> 00:30:02.219 And then we move down to the next stage. 304 00:30:02.219 --> 00:30:07.888 And the next stage, you got a bigger stride and it's. 305 00:30:07.888 --> 00:30:12.659 Or even being less efficient with coalescing and so on. 306 00:30:12.659 --> 00:30:16.288 Then, don't we get down here say to step 2. 307 00:30:16.288 --> 00:30:19.648 We're not even using all the threads. So you in step. 308 00:30:19.648 --> 00:30:26.068 2 here you're using thread 0 and thread. 309 00:30:26.068 --> 00:30:32.699 Well, step 1, you're using all adjacent thread only even number thread. Step 2. you're using every 4 thread perhaps. 310 00:30:32.699 --> 00:30:41.398 And because you see thread 0, we'll add 11 and 14 to make 25. and so if you go down the tree. 311 00:30:41.398 --> 00:30:45.148 You're using a smaller and smaller fraction of the threads. 312 00:30:45.148 --> 00:30:49.798 So, you're really using the memory. 313 00:30:49.798 --> 00:30:53.699 The parallel hardware less and less efficiently. 314 00:30:53.699 --> 00:31:00.479 Now, you know, it's not the be all and end all to use every recruiter core all the time, but still. 315 00:31:00.479 --> 00:31:06.419 You look at this thing, you know, this, this process, it's login levels deep. 316 00:31:06.419 --> 00:31:18.358 And basically, for most of the computation of the reduction of this array, most of the threats are not being used, but they're in the middle of war. So they can't use them for something else. 317 00:31:18.358 --> 00:31:24.868 So this naive way here we're 1st step you through. 318 00:31:24.868 --> 00:31:28.648 To adjacent items, and then after the threads at. 319 00:31:28.648 --> 00:31:34.169 Items that are like, 1 stuff apart and a quarter of the threads items that are 4 steps apart and so on. 320 00:31:34.169 --> 00:31:37.528 Naive obvious thing is. 321 00:31:37.528 --> 00:31:40.949 Has horrible thread divergence. 322 00:31:42.088 --> 00:31:45.509 As you go down the tree and also. 323 00:31:46.588 --> 00:31:50.788 Doesn't use the global memory efficiency, so that's 2 big problems with this simple way. 324 00:31:50.788 --> 00:31:55.919 So, what are we going to do about it? Okay you see the issue here. 325 00:31:56.939 --> 00:32:00.659 And again, the problems we're talking about. 326 00:32:02.128 --> 00:32:07.858 The general issues, it's not just this 1 specific thing. This is a 1 specific program. 327 00:32:07.858 --> 00:32:12.239 Who cares about it, but these are general ideas and that's why. 328 00:32:12.239 --> 00:32:18.209 Spend time on it. Okay. So problems with the simple naive method. 329 00:32:18.209 --> 00:32:27.689 And they talk, here's in words, what I just told you after each step, half of the remaining threads are not needed. 330 00:32:27.689 --> 00:32:36.088 And and 1 of the inputs comes from farther and farther so farther and farther away means you're going to have this latency. 331 00:32:36.088 --> 00:32:44.669 It's not adjacent, you've got this latency in somewhere. They say that the latency could be a 1000 cycles and so. 332 00:32:44.669 --> 00:32:48.388 Yeah, yeah, yeah, yeah, yeah. Okay. 333 00:32:49.828 --> 00:32:59.939 I don't have to look at the code unless you want is how you would program it. Um. 334 00:33:01.108 --> 00:33:04.979 They're using shared memory here, so which is nice. And again, it. 335 00:33:04.979 --> 00:33:12.838 It's like 48 K bytes, but if we're talking say 2000 elements at 4 bytes seats at 8,000. 336 00:33:12.838 --> 00:33:24.148 Fights so all the threads in the block, you could load that factor into fast shared memory and then the 2000 threads start throwing at it and it wasn't be less latency at that point. So that's okay. 337 00:33:24.148 --> 00:33:27.719 Okay, um, reduction steps. 338 00:33:27.719 --> 00:33:32.159 We got sync threads and I won't go through the cold line by line. You can read it. 339 00:33:32.159 --> 00:33:35.189 They got multiple think through it. Um. 340 00:33:35.189 --> 00:33:42.598 Okay, yeah, so you've got those multiple stages you have to synchronize. 341 00:33:42.598 --> 00:33:46.048 Before you do the next stage of the reduction, because. 342 00:33:46.048 --> 00:33:52.318 Again, you don't the different threads may be running 1 after the other. If there's more threads and there's hardware. 343 00:33:52.318 --> 00:33:57.689 So, you have to sync don't have to think if you're. 344 00:33:57.689 --> 00:34:00.929 On, you know, see what happens. 345 00:34:00.929 --> 00:34:04.558 Okay, I'll do to do, um. 346 00:34:07.558 --> 00:34:13.018 What they're talking about here is I said, you maybe you could have only. 347 00:34:13.018 --> 00:34:18.119 Perhaps 4,000 elements in a thread block if a thread block and have 2000 threads. 348 00:34:18.119 --> 00:34:21.599 So, what if you got more than 4,000 elements. 349 00:34:21.599 --> 00:34:27.208 Well, then each, you have multiple thread blocks you could have hundreds of thread blocks and each thread block takes. 350 00:34:27.208 --> 00:34:34.949 4,000 elements, and each thread block operates independently of the other thread blocks. I mean, they can all read off the global memory. 351 00:34:34.949 --> 00:34:46.978 Read orders don't matter. So, each thread block process is 4,000 elements and produces them to a summon. Again. The thread blocks. They're all independent of each other. They don't talk to each other. You can't even sync them. 352 00:34:46.978 --> 00:34:51.568 And, but finally all the thread blocks have finished. 353 00:34:51.568 --> 00:34:56.518 And each threat box got a subtotal. Now you bring them all together. 354 00:34:56.518 --> 00:35:01.528 And that's what they talk about down here. So you're going to have to do some sort of synchronization to know that. 355 00:35:01.528 --> 00:35:05.278 All the thread blocks are done, but yeah, it's only once. 356 00:35:06.449 --> 00:35:21.298 Okay, in different ways you could do that by the way if you're wondering about the overhead to move, say, data to the device, do some work and bring it back even with that factor addition, I think. 357 00:35:21.298 --> 00:35:30.268 On parallel, it speaks things up, perhaps to move it to the GPU if you want it. So okay, that was number 2. 358 00:35:35.068 --> 00:35:40.498 Okay, we want to do better. We need. 359 00:35:40.498 --> 00:35:44.458 Reduce control divergence and so on, um. 360 00:35:44.458 --> 00:35:49.768 Get over this and go to the picture, um. 361 00:35:51.478 --> 00:35:54.478 So. 362 00:35:55.889 --> 00:36:00.568 So Here's a better way to do it. So you can read the words. I'll lecture from the picture. 363 00:36:00.568 --> 00:36:08.248 Um, but we saw before the picture before is thread 0, added elements 0 and 1. 364 00:36:08.248 --> 00:36:13.259 3rd 1 added elements 2 and 3 at 3 and 4. 365 00:36:13.259 --> 00:36:20.248 Here, and if we went down the diagram, there was more and more a bigger and bigger stride. Here is a new way, which is better. 366 00:36:21.418 --> 00:36:25.018 Threads 0 adds elements 0 and 4. 367 00:36:25.018 --> 00:36:28.378 Fred 1 adds elements 1 in 5. 368 00:36:28.378 --> 00:36:35.039 There are 2 ads elements, 2 numbers, 2, and 6 said 3 ads elements number 3 and 7. 369 00:36:35.039 --> 00:36:40.978 So, we start out with threads having a big stride of for between the 2 elements that they had. 370 00:36:40.978 --> 00:36:44.878 Yes, so. 371 00:36:44.878 --> 00:36:53.639 This way to start, getting smaller, but at the beginning of a larger strategy, there's less efficient in a larger strategy. There are more threads. 372 00:36:53.639 --> 00:36:57.838 So, why is this better than. 373 00:36:57.838 --> 00:37:00.869 And the only way you do have to meet. 374 00:37:00.869 --> 00:37:07.199 Partners who were making those, but not effectively fewer because. 375 00:37:07.199 --> 00:37:11.338 The threads that are still executing are scattered throughout the war. So. 376 00:37:11.338 --> 00:37:16.349 Whereas with this, you see the threads that are still active or all together. 377 00:37:16.349 --> 00:37:21.208 So, they're all in 1 war so I guess that's a big reason. This is a bit better. 378 00:37:22.739 --> 00:37:26.789 So, you see here after the 1st step, half the threads. 379 00:37:26.789 --> 00:37:36.059 Are no longer have anything to do but the half that are active, we're all packed together and a half that are inactive. We're all packed together. So the inactive. 380 00:37:36.059 --> 00:37:40.259 Can be doing something else so I guess that could be the big reason. 381 00:37:40.259 --> 00:37:45.628 Oh, but it could be worth, um. 382 00:37:45.628 --> 00:37:52.318 You know, what could be wrong. Yeah, but that's the big thing. I think. 383 00:37:52.318 --> 00:38:00.869 The amount of credits get smaller by half every time yeah. At the beginning versus the iteration, all the levels of penalty. 384 00:38:00.869 --> 00:38:05.159 Yeah, so these threads here are. 385 00:38:05.159 --> 00:38:14.639 If it can do something else, well, the resources that would be dedicated to those threads can do something else. So I guess that's it. 386 00:38:16.230 --> 00:38:19.320 But you're packing yeah, but the other thing. 387 00:38:19.320 --> 00:38:22.380 Is if we look over time. 388 00:38:22.380 --> 00:38:28.260 The data the remaining active data is packed, hold for and posted together. So that's. 389 00:38:30.269 --> 00:38:36.750 So, converging is better than diverging. Well, when it's packed closer together is less of it. Um. 390 00:38:36.750 --> 00:38:40.500 Let me go back to the others to the previous. 391 00:38:40.500 --> 00:38:49.349 Thing here, which I will have. 392 00:38:51.300 --> 00:38:54.539 Somewhere. 393 00:39:00.599 --> 00:39:06.690 Yeah, so if we do this, then. 394 00:39:15.989 --> 00:39:19.800 I don't know, it might take a more detailed analysis. Why? And. 395 00:39:20.969 --> 00:39:24.900 Who knows, they could be wrong, but I think they're right on that. 396 00:39:24.900 --> 00:39:29.519 More price that didn't work and be a ton of them. Yeah. 397 00:39:29.519 --> 00:39:32.909 You're trying to improve the candidates in. 398 00:39:32.909 --> 00:39:39.030 Well, you right at the end, you could start creating this spot, but you've got all these gaps between active threads and that's. 399 00:39:39.030 --> 00:39:42.360 But you don't like so. 400 00:39:42.360 --> 00:39:49.769 You well, they would claim it's even the 1st row is not incredibly efficient. 401 00:39:49.769 --> 00:39:57.719 Because the hardware likes having adjacent threads, right? Driving data that's 4 bytes apart. Not that it is 8 bytes the part. So. 402 00:39:58.920 --> 00:40:02.010 Yeah, yeah. 403 00:40:02.010 --> 00:40:05.579 That would be the issue and if we go back to the other 1. 404 00:40:07.050 --> 00:40:10.170 So, the data's not 8 bytes apart, but. 405 00:40:10.170 --> 00:40:13.260 A large number of bites apart. 406 00:40:15.659 --> 00:40:20.670 Well, here they could say the cold here, the coal last thing is actually better. 407 00:40:20.670 --> 00:40:27.329 Look at this, so each threat reach 2 words from the memory, which. 408 00:40:27.329 --> 00:40:31.530 If you can put it in shared memory great if you kind of global memory, but. 409 00:40:31.530 --> 00:40:36.059 So so each thread has 2 arguments that 1st stage. 410 00:40:36.059 --> 00:40:42.239 So, for all the threads, the 1st arguments are Jason, and for all the threads, the 2nd arguments are adjacent. 411 00:40:42.239 --> 00:40:50.550 But it actually is fully coalesced see, for so, for all the threads, the 2nd, argument to the threads are adjacent. 412 00:40:50.550 --> 00:40:54.329 And also the 1st, so in that sense. 413 00:40:54.329 --> 00:40:58.139 If you have to go into global memory, you're reading chunks of 128 bytes. 414 00:40:58.139 --> 00:41:03.809 And, um, so you basically got 2 sets of active chunks so that. 415 00:41:03.809 --> 00:41:12.150 Yeah, right and I don't know the details of how the cash works and it changes with the generation. 416 00:41:12.150 --> 00:41:17.789 So, I don't know what the size of a cash line is, but. 417 00:41:17.789 --> 00:41:20.940 Yeah, right, right so. 418 00:41:20.940 --> 00:41:23.940 That's the argument for that over. 419 00:41:23.940 --> 00:41:29.039 Yeah, but okay, but there is a deeper point. 420 00:41:29.039 --> 00:41:38.070 Underneath what you're asking, and that there's an optimum amount of optimization and. 421 00:41:38.070 --> 00:41:43.559 You can spend too much of your life trying to micro optimize the program. Okay. 422 00:41:43.559 --> 00:41:47.489 And it's not worth that, um, for 2. 423 00:41:47.489 --> 00:41:55.199 There's an optimum okay, you can do too little and I see people doing X squared algorithms when they could be on login, but you can do too much for 2 problems. 424 00:41:55.199 --> 00:41:58.860 1, your time may be worth more than the machine time. Okay. 425 00:41:58.860 --> 00:42:05.400 The CPU on parallel is worth of 2000 dollars that was so. 426 00:42:05.400 --> 00:42:17.159 And it's got a lifetime of 5 years or something say, 50,000 hours. So that the capital cost of that GPU on parallel and it was about the best I could buy 2 years ago. 427 00:42:17.159 --> 00:42:25.380 It's maybe 10 cents an hour. Okay. And his validation. If you, you can rent it on Amazon for. 428 00:42:25.380 --> 00:42:34.230 I can't remember less than a dollar an hour. The GPU. Okay. So what are you going to how much money you going to make when you get out of here? You see the trade off. Okay. 429 00:42:34.230 --> 00:42:37.650 The 2nd point is that. 430 00:42:37.650 --> 00:42:48.239 As time goes on and video updates the hardware, what was the good optimization in the past may hurt you tomorrow. 431 00:42:48.239 --> 00:42:53.250 So, maximum generation, they had very few double precision corps. 432 00:42:53.250 --> 00:42:56.940 You would want to rewrite your code to use single precision. 433 00:42:56.940 --> 00:43:03.840 Perhaps, if you cared, that's the stuff I had to do as a student. In fact, the hardware troubleshooting hardware had not. 434 00:43:03.840 --> 00:43:10.230 Did not exist on the machines I use. There are a lot of cool techniques to do double precision to work with a single precision hardware. 435 00:43:10.230 --> 00:43:14.940 You know, they're obsolete now, so you don't necessarily want. 436 00:43:14.940 --> 00:43:21.989 To spend too much time, optimize the over optimizing cause next year's hardware. Like, they embed cash. 437 00:43:21.989 --> 00:43:27.300 Okay, and some of the ideas you read about all this stuff with CUDA copy. 438 00:43:27.300 --> 00:43:32.429 The functions basically, obviously, if you've got the, you know, unified memory with paging so. 439 00:43:32.429 --> 00:43:36.869 A lot of that stuff people spend time on is obsolete. Now it's creative disruption. 440 00:43:36.869 --> 00:43:42.210 Okay, so the question is, why am I spending time here? 441 00:43:42.210 --> 00:43:48.780 Art from the fact that I learned it in a way to amortize the time I spent learning it. No. Okay. 442 00:43:48.780 --> 00:43:55.139 It's, it's a powerful idea. It's worth learning the idea. I think. 443 00:43:55.139 --> 00:43:58.230 About how you have to concentrate things together. 444 00:43:58.230 --> 00:44:02.639 That's why we spend time on it. Okay. Um. 445 00:44:02.639 --> 00:44:08.460 Skip the code and. 446 00:44:08.460 --> 00:44:15.630 And they talk about, say, the 1024th original blog is more obvious in 2000. I think then. 447 00:44:15.630 --> 00:44:20.070 The 1st, several stages, there's no divergence. So. 448 00:44:22.050 --> 00:44:28.980 So the last 5 stages, there'll be some divergence, but you still, but when you have the divergent, you've got a lot fewer of war. So. 449 00:44:28.980 --> 00:44:33.690 You've got divergence over a smaller number of threads so that's. 450 00:44:33.690 --> 00:44:43.110 Not bad so that was module 9, telling them about a different way. You can do the parallel reduction that has better coalescing and. 451 00:44:43.110 --> 00:44:54.449 Less divergence. Okay. I see some more of this. 452 00:44:59.099 --> 00:45:06.030 Okay, okay now, by the 1 thing for the reduction that slides that did not mentioned. 453 00:45:06.030 --> 00:45:09.780 Instead of just reducing the whole. 454 00:45:09.780 --> 00:45:13.320 Array down to 1 element, you can. 455 00:45:13.320 --> 00:45:25.679 Foot fence posts in the array that'd be another. It'd be like a stencil vector or another vector or even it would have fence posts in the array and you do the reduction, but don't reduce over a fence post. 456 00:45:25.679 --> 00:45:31.170 So the fence posts would cut this bigger array into multiple little pieces. 457 00:45:31.170 --> 00:45:35.880 A burying length and you reduce the each little piece. 458 00:45:35.880 --> 00:45:39.239 In parallel is still taking log in time. 459 00:45:39.239 --> 00:45:43.860 And this is a way, for example, you to run length encoding. 460 00:45:43.860 --> 00:45:49.679 You have this big array and you want to do run length encoding. So you put a fence post wherever you change. 461 00:45:49.679 --> 00:45:56.489 0, to 1 or back again, and that could be done computed in parallel and you do the parallel reduction, the output of that. 462 00:45:56.489 --> 00:45:59.969 Is the length of every run and where it started? 463 00:46:00.989 --> 00:46:07.980 Perhaps so that's an example of where the reduction is more powerful you might think is you're doing a parallel run 9th computation. 464 00:46:07.980 --> 00:46:16.679 Okay, what we're seeing here is another very common operation in parallel computing of the scanner. 465 00:46:16.679 --> 00:46:22.920 We're also called a prefixed and again it's a basis for a surprising number of things. 466 00:46:22.920 --> 00:46:28.829 Such as you have a list of runway, the scan operation will tell you where each run starts. 467 00:46:28.829 --> 00:46:34.289 And if I've got a length of. 468 00:46:34.289 --> 00:46:39.329 Pockets or something of varying sizes, because they're doing a hit. 469 00:46:40.380 --> 00:46:45.539 I'm sort of like, use this bridge and great. Great element has a different number of items in it. 470 00:46:45.539 --> 00:46:51.210 So, I want to allocate all the elements sequentially scan would tell me where each bucket starts. 471 00:46:51.210 --> 00:46:54.510 Does in parallel. Okay. 472 00:46:54.510 --> 00:46:59.219 And we demonstrate if I offer, okay, let's go buy the figures down here. 473 00:46:59.219 --> 00:47:04.230 This shows you, your, your scan, these types of scans. 474 00:47:04.230 --> 00:47:09.809 The s. L. okay. The input is this vector 3. 1. 70,406 3. 475 00:47:09.809 --> 00:47:14.039 The output, the AI element is the, some of the 1st, AI and foot elements 3. 476 00:47:14.039 --> 00:47:17.400 43+1 at 11 is 3+1+7. 477 00:47:17.400 --> 00:47:20.789 25 was the sum of everything includes the scan. 478 00:47:20.789 --> 00:47:23.940 Okay, and obviously you can do it. 479 00:47:23.940 --> 00:47:33.030 Um, sequentially going down the input factor here what you might want to think is if you already have the input factor on the GPU. 480 00:47:33.030 --> 00:47:37.650 Imagine, how you might do the scan in parallel, taking log in time. 481 00:47:37.650 --> 00:47:41.159 Instead of linear time, that's. 482 00:47:41.159 --> 00:47:50.670 You think about how you might do that so you have some sort of treat you'd build up, you scan little chunks and you combine this Johnson propagate and log in time. 483 00:47:50.670 --> 00:47:56.969 You've done your scan on the whole vector so any case. Okay so this, um. 484 00:47:56.969 --> 00:48:05.280 1 operation would be, like I said, if you've done a run length encoding and these are the lengths of the runs, the 1st run is 3 elements. The 2nd run is 1 element. 485 00:48:05.280 --> 00:48:11.820 3rd 1 is 7 elements. 4th, run doesn't exist and the scan will tell you where each run starts. 486 00:48:11.820 --> 00:48:16.800 After you've expanded the run length encoding. Okay. 487 00:48:17.849 --> 00:48:23.699 That's called an inclusive scan. Okay. Examples like, um. 488 00:48:23.699 --> 00:48:31.920 Putting a sandwich and lots of exams. So how are you going to do it? 489 00:48:31.920 --> 00:48:37.829 Um, there's lots of applications you'd be surprised. 490 00:48:37.829 --> 00:48:41.070 That's the applications you can use it for, um. 491 00:48:42.360 --> 00:48:48.420 Great explored is nice. I don't spend time on this. We want Radix sort in many cases much faster than. 492 00:48:48.420 --> 00:48:53.369 But the huge, quick forward to CS courses, because it is recurring and. 493 00:48:53.369 --> 00:48:57.300 Computer signed to swap recursion resource does not have recursion, but. 494 00:48:57.300 --> 00:49:02.280 All that Scott is as fast. Okay. Um, things like that. 495 00:49:05.400 --> 00:49:09.599 Ignore the applications ignore this, ignore this. 496 00:49:09.599 --> 00:49:13.050 Gorgeous, uh. 497 00:49:15.960 --> 00:49:19.380 Sequentially, it's obvious how you would do it. 498 00:49:21.750 --> 00:49:26.760 I don't know spend time on this. 499 00:49:26.760 --> 00:49:32.550 You have the compute, the ice subtotal, but that that's not faster than sequential. 500 00:49:33.780 --> 00:49:39.210 Okay, so the content of this slide set was to introduce you to the scan operation. 501 00:49:39.210 --> 00:49:47.340 Okay, that was number 1 they took 10 sites to introduce you to the scan operation. 502 00:49:50.909 --> 00:49:56.099 To do. 503 00:49:56.099 --> 00:49:59.369 Hello. 504 00:49:59.369 --> 00:50:12.059 Well, here's okay, well, we're getting to hear is a step on the way to doing it in parallel. Here's our input called X. Y, for some reason. 505 00:50:12.059 --> 00:50:16.050 And what we do is that. 506 00:50:16.050 --> 00:50:23.550 In parallel, every thread adds its element to the previous element, except for this thread. 0. 507 00:50:23.550 --> 00:50:32.190 So thread 0 at 3 to nothing thread 1, add 3 and 1, 3, 2, X1 and 7. 508 00:50:32.190 --> 00:50:38.280 Thread 3 ads, 7 and 00 and 4 the 4 and 1 and 1 and 6 and 6 and 3. 509 00:50:38.280 --> 00:50:43.710 So, everything I did in addition, okay now. 510 00:50:45.900 --> 00:50:49.409 We repeat this log in times we'll have all the subtitles. Okay. 511 00:50:49.409 --> 00:50:52.530 So that's. 512 00:50:52.530 --> 00:50:58.079 0, or depending on how you're counting for this parallel prefixed them. 513 00:50:58.079 --> 00:51:02.280 Prefix diamond scan there's anonymous. It depends what you want to call it. 514 00:51:02.280 --> 00:51:05.550 Oh, okay. 515 00:51:05.550 --> 00:51:10.230 Takes calling some time we've just added a repair adjacent. 516 00:51:11.909 --> 00:51:16.079 Then we do it again. 517 00:51:16.079 --> 00:51:19.170 Okay, and. 518 00:51:19.170 --> 00:51:25.679 Except on the next. Okay, so, accounting from 1. so let's try 1 added pairs. 519 00:51:25.679 --> 00:51:33.780 Try to we had elements that are too apart. So that's what the stride is. The Stripe as the jump from. 520 00:51:33.780 --> 00:51:40.650 So strike 3. um, well, the 1st thread does not think it just propagate the 2nd thread does not think. 521 00:51:40.650 --> 00:51:45.030 The 3rd thread as its own element to the 12 elements earlier. 522 00:51:45.030 --> 00:51:51.690 The last thread has its own element to the 12 elements. Early. 9+5 equals 14. this thread. 523 00:51:51.690 --> 00:51:55.800 4+2 elements earlier is 8 equals 12 so. 524 00:51:56.940 --> 00:52:01.949 Um, so. 525 00:52:01.949 --> 00:52:07.230 What we've done is we have the 1st, 2 elements of the output now they're in. 526 00:52:07.230 --> 00:52:11.460 So, we've lost a little cash. 527 00:52:11.460 --> 00:52:18.539 Well, here well, we've lost a little coalescing here. Okay. We're going to lose more and more thing. 528 00:52:18.539 --> 00:52:22.710 But we do have. 529 00:52:22.710 --> 00:52:26.730 You know, all of the thread efficiency, all the threads are doing the same thing. 530 00:52:26.730 --> 00:52:35.159 That's the 1st, 2 and if we can use shared memory, we're cool. 531 00:52:35.159 --> 00:52:39.659 Next time strike for each thread as. 532 00:52:41.280 --> 00:52:46.199 Its own element to the 14 elements back. 533 00:52:48.059 --> 00:52:52.980 Except that 4 of them, elements back would fall off the end of the year. You do nothing. 534 00:52:52.980 --> 00:52:56.699 What happened? No, stop here at 11. 0. 535 00:52:56.699 --> 00:53:00.000 4 elements back, it's not thing 12. 536 00:53:00.000 --> 00:53:03.690 This thing here at 12 to 34 elements of ads. 537 00:53:03.690 --> 00:53:07.500 Well, the 4 at 11 to. 538 00:53:07.500 --> 00:53:13.199 11 at 14 211 and now we've got the next slide. 539 00:53:14.309 --> 00:53:17.309 And you can guess then we'll have stride game and so on. 540 00:53:17.309 --> 00:53:22.230 And after this thing, the 1st, 4 elements of the output are been computed. 541 00:53:24.960 --> 00:53:28.500 So, eventually these 1st, few threads are going to be idle, but. 542 00:53:28.500 --> 00:53:32.550 Just takes log in steps, but the thing is that. 543 00:53:32.550 --> 00:53:36.809 The threads are mostly being used efficiently. We don't have here. 544 00:53:36.809 --> 00:53:40.650 Oh, the last thing, well, we do. 545 00:53:40.650 --> 00:53:45.869 Because look, each thread takes 2 inputs. 546 00:53:45.869 --> 00:53:49.650 Its own element and the element for to the left. So. 547 00:53:50.760 --> 00:53:54.630 The 2 are Nicole is still. It's cool. You see. 548 00:53:54.630 --> 00:53:58.289 So this thread here, it takes 2 arguments. 549 00:53:58.289 --> 00:54:04.050 Reach thread the 1st argument is, is a corresponding element and the 2nd element. 550 00:54:04.050 --> 00:54:08.550 Argument is for elements left for each thread so it's coalesce. 551 00:54:08.550 --> 00:54:15.210 Nicely, so we have we have coalescing of the input data. 552 00:54:16.260 --> 00:54:21.869 If it matters, if we're reading off the shared memory, it doesn't matter, but if you're reading off the global memory, it matters. 553 00:54:21.869 --> 00:54:28.199 And the threads are being used efficiently because the active threads are adjacent. 554 00:54:28.199 --> 00:54:33.869 And as we, and basically each more and more threads are not active, and they're adjacent. 555 00:54:33.869 --> 00:54:37.920 They were using threads efficiently and were using. 556 00:54:37.920 --> 00:54:42.599 And we're being coherent memory, contiguous memory efficiently. 557 00:54:43.800 --> 00:54:47.519 Um. 558 00:54:47.519 --> 00:54:50.880 Uh, do do you do. 559 00:54:52.800 --> 00:54:57.000 Okay, here's a big point here. Let me go back. 560 00:54:57.000 --> 00:55:01.829 So, this thing, um, I didn't mention yet. 561 00:55:03.659 --> 00:55:12.659 So, the 1st step, each thread adds a pair of elements and then the 2nd step each thread I had the elements, but the inputs. 562 00:55:12.659 --> 00:55:19.320 But the 2nd step our output from the 1st step. So you got into a sync threads here at every step. 563 00:55:19.320 --> 00:55:26.730 Because you don't know that all of the threads that are in a line here and running at the same time because again, more threads. 564 00:55:26.730 --> 00:55:31.469 And CUDA, cores, you've got a big array in sheet hardware. 565 00:55:31.469 --> 00:55:38.489 These threads are a nice horizontal line in the diagram might be running 1 after these, the different warps again might be. 566 00:55:38.489 --> 00:55:41.820 1, after go, so you need a sync threads that every time. 567 00:55:42.960 --> 00:55:47.039 And that's what this talks about various synchronization. 568 00:55:47.039 --> 00:55:53.340 So, I ignore this. 569 00:55:53.340 --> 00:55:58.829 Um. 570 00:56:01.260 --> 00:56:06.059 So, what's their complaint here? Let me go back to the figure. 571 00:56:08.250 --> 00:56:13.110 Here's a complaint is that if the. 572 00:56:13.110 --> 00:56:21.929 Vector is elements long, so it takes log in stages log to the base to a fan. 573 00:56:21.929 --> 00:56:27.150 But the total number of additions is that log in so it takes, um, because. 574 00:56:27.150 --> 00:56:32.519 You're doing at the end stage, you're dividing the number of additions by. 575 00:56:32.519 --> 00:56:41.010 Due to the ad, uh, case stage 2 to the K. actually. So the total number of additions here. 576 00:56:41.010 --> 00:56:51.449 Is then log in and because it's N plus then over 2+that over 4% over 8 and that thumbs up. Sorry 2, what am I. 577 00:56:51.449 --> 00:56:58.889 Oh, if they're ignoring the idol, so total numbers in addition to do something or 2 end but if they keep keeping track of the idle threads. 578 00:56:58.889 --> 00:57:02.099 It's in log in work and they're. 579 00:57:02.099 --> 00:57:05.130 Want to see if they can do better, but in any case. 580 00:57:05.130 --> 00:57:08.519 The contribution of this slide said here. 581 00:57:08.519 --> 00:57:12.719 Is that we do this parallel scan and then log in stages? 582 00:57:12.719 --> 00:57:17.250 And so time and log in using end threads at each step, we've. 583 00:57:17.250 --> 00:57:22.110 Done the scan. Okay. 584 00:57:22.110 --> 00:57:25.739 That was that point here and. 585 00:57:25.739 --> 00:57:31.440 Questions and this was number 2. 586 00:57:32.460 --> 00:57:39.449 How can we do it better? 587 00:57:39.449 --> 00:57:42.719 Okay, um. 588 00:57:44.219 --> 00:57:47.969 So, what we're going to do now is get trickier and. 589 00:57:47.969 --> 00:57:51.090 More efficient, so. 590 00:57:51.090 --> 00:57:54.329 But our algorithm gets weird. 591 00:57:55.590 --> 00:58:01.650 And, um. 592 00:58:04.139 --> 00:58:08.760 You can read this I will. 593 00:58:14.605 --> 00:58:23.844 I'll take it a little quickly. Um, it's a little weird. 594 00:58:23.875 --> 00:58:30.625 I'm going to skip over some details so they some adjacent things and there's some things that are 2 parts, and some things are for apart. 595 00:58:30.900 --> 00:58:34.380 And summit to the right at the end, it. 596 00:58:34.380 --> 00:58:40.619 Login stages. They've got their total, um. 597 00:58:43.050 --> 00:58:48.840 But they don't have all the subtotals yet necessarily. That's the issue here. 598 00:58:48.840 --> 00:58:52.019 So, this thing now, what. 599 00:58:52.019 --> 00:58:55.320 So this is only half the total algorithm. They got the. 600 00:58:55.320 --> 00:59:01.710 They don't have all all of the scan subtotals. They've got little partial, weird, little, partial subtotals. 601 00:59:01.710 --> 00:59:07.260 Like, the sum of 2 and 3 to some of 4 and 5, some of 4 to 7 and so on. 602 00:59:07.260 --> 00:59:13.019 And then some of these weird little subtotals, and get what we want. 603 00:59:13.019 --> 00:59:26.400 It's a 2 stage operation. Um, the argument in favor of it is, it does your additions. It will use less total amount of time. 604 00:59:26.400 --> 00:59:34.230 So, you see, the thing with this 1st, stage is we don't have as many threads being active. 605 00:59:34.230 --> 00:59:41.219 Okay, um, we only at the end stage, we have. 606 00:59:41.219 --> 00:59:51.360 Basically is at each stage with half as many threads, being active as the previous stage. So if you ignore a thread divergence for the moment, we'll get back to it. 607 00:59:51.360 --> 00:59:57.420 This 1st, stage here, which is half the total amount of work, get a lot fewer additions. 608 00:59:57.420 --> 01:00:00.900 Oh. 609 01:00:00.900 --> 01:00:07.590 And, um, and. 610 01:00:07.590 --> 01:00:15.750 Skip through it somewhat and then we do another weird multiple stage saying, I'll just wave my hands. Okay if you want. 611 01:00:15.750 --> 01:00:22.050 It will put all the pieces. Um, this is the 1st stage up here. 612 01:00:22.050 --> 01:00:25.110 We did this thing where we. 613 01:00:25.110 --> 01:00:30.059 Some of the data elements in some that we didn't you see. 614 01:00:30.059 --> 01:00:36.480 And before every thread here was summing elements to a part and every thread was some elements for part and so on. 615 01:00:36.480 --> 01:00:45.900 Here all needs even numbered threads are summing elements to a part and the threads that are multiples of 4, sub elements for apart and so on. So the 1st stage. 616 01:00:45.900 --> 01:00:49.079 We have fewer total additions. 617 01:00:49.079 --> 01:00:54.750 Which will be good if we find a way to handle the thread divergence, which will be the next. 618 01:00:54.750 --> 01:01:00.690 That's the 1st, then the 2nd half we do another bit of propagation where again? 619 01:01:00.690 --> 01:01:04.409 There's fewer threads doing addition, so. 620 01:01:04.409 --> 01:01:07.619 At the bottom of this, we've done a parallel scan. 621 01:01:07.619 --> 01:01:15.239 And although I skipped over details how we've done the parallel scan with a lot less total work. 622 01:01:15.239 --> 01:01:18.989 If we can find a way to pack it into a small number. 623 01:01:18.989 --> 01:01:25.320 Contiguous threat, small number of work. So this is an efficient way. The broader lesson here is. 624 01:01:26.519 --> 01:01:31.500 The assigning the work to threads, for example, the coalesce. 625 01:01:31.500 --> 01:01:34.769 So, the active threads are in the same work. 626 01:01:34.769 --> 01:01:41.730 And that is still access with memories though, and I. 627 01:01:41.730 --> 01:01:46.230 Last way also, so that also works. Um. 628 01:01:46.230 --> 01:01:49.619 Lots of things threads in it. 629 01:01:51.030 --> 01:01:56.070 So, okay, the, this slide captures the executive summary. 630 01:01:56.070 --> 01:02:04.199 Um, we took the previous slides set and reordered everything, so there's a smaller number of total additions. 631 01:02:04.199 --> 01:02:10.800 The previous thing had at every step, all the threads were doing in operation, except the threads at the edge. 632 01:02:10.800 --> 01:02:17.250 Um, here at each successive step, only half, the number of threads are actually active. 633 01:02:17.250 --> 01:02:22.469 So, we reduced the number of active threads, more efficient use of the hardware. 634 01:02:22.469 --> 01:02:27.000 But it's more complicated, but it's a 2 stage operation to. 635 01:02:27.000 --> 01:02:30.090 Emulate some stuff and then propagates and stuff. So. 636 01:02:32.460 --> 01:02:35.670 But we sent out the issue warps, which we'll get to that. Okay so. 637 01:02:35.670 --> 01:02:42.000 That's the executive summary of this algorithm gets more complicated. Hardware use, gets more efficient. 638 01:02:43.829 --> 01:02:49.289 And you. 639 01:02:49.289 --> 01:02:55.079 And, um. 640 01:02:56.099 --> 01:03:02.130 Work analysis of it, um, cut a number of ads down from end log into 2 and. 641 01:03:03.300 --> 01:03:10.500 Um, so. 642 01:03:10.500 --> 01:03:20.190 And again, if you've got more threads, if you've got more, if the vectors larger than the 3 number threads on a block, we partition among multiple blocks. 643 01:03:20.190 --> 01:03:23.760 Okay, this takes some thinking here. 644 01:03:27.360 --> 01:03:30.840 We do the scan over the threads in each block. 645 01:03:30.840 --> 01:03:35.519 And we also get a link to the number of elements in the block then we, um. 646 01:03:38.940 --> 01:03:43.710 We bring all of the block links into 1 block. We do a scan over that. 647 01:03:43.710 --> 01:03:52.349 We distribute it among the blogs, we update stuff and at the end of it, we've done a scan operation over a very long vector that we partition into blocks. 648 01:03:52.349 --> 01:03:55.920 Well, I guess. 649 01:03:55.920 --> 01:04:00.360 And very large input things. Um, here's the okay, this is what. 650 01:04:00.360 --> 01:04:09.690 I just told you here, our goal is to scan of vector, which does not fit in a block because a block and say, do a 1000 elements. 651 01:04:09.690 --> 01:04:13.050 Um, what 2000 maybe. 652 01:04:13.050 --> 01:04:16.110 So, a very big vector. 653 01:04:16.110 --> 01:04:21.869 Big being very long. We split it up among the thread blocks. And again, the thread blocks can't talk to each other. 654 01:04:21.869 --> 01:04:25.710 Each thread block does a scan. 655 01:04:27.690 --> 01:04:33.840 And the last element of the scan is the number is the total of all the elements in that. 656 01:04:33.840 --> 01:04:37.289 We bring all of these totals together into 1. 657 01:04:37.289 --> 01:04:42.389 Ray, but probably fit in shared memory if you got less than. 658 01:04:42.389 --> 01:04:45.719 48 less than 12,000 blocks. 659 01:04:47.550 --> 01:04:54.539 And so each block will then take its length and copy it into this 1 global share, right? 660 01:04:54.539 --> 01:04:59.670 Global array and then we do a scan of that array. 661 01:05:00.840 --> 01:05:08.610 And again, this array here is 1 element as 11 thousands as many elements as the original vector. So, this is a raise probably. 662 01:05:08.610 --> 01:05:14.429 Small enough to fit. Well, unless you're scanning a 1Billion elements, maybe. 663 01:05:14.429 --> 01:05:19.289 And then we scan this, we get a pile of subtotals we distribute them. 664 01:05:19.289 --> 01:05:24.869 Back to the blocks and we update all of the blocks scan values at the end of it. 665 01:05:24.869 --> 01:05:30.449 We got the scan of the original big array so. 666 01:05:30.449 --> 01:05:33.659 So, you've got a sense of it. Um, I mean. 667 01:05:33.659 --> 01:05:36.780 Work on an example by hand next time. 668 01:05:38.429 --> 01:05:44.699 Problem is, like I said, it's very hard for me to simultaneously to ultimately project stuff. 669 01:05:44.699 --> 01:05:51.929 Say things like PDF files and also write stuff by hand I can't really do them on the same machine. So well. 670 01:05:53.099 --> 01:05:56.369 I have to do more research to find some more tools. Okay. 671 01:05:56.369 --> 01:06:02.639 Any case this executive summary the point of this was to scan a really long a raise. 672 01:06:02.639 --> 01:06:06.690 Free stages so now. 673 01:06:06.690 --> 01:06:16.559 Here's a minor thing and the scan I showed you before where the ice output was, the some of the 1st ai input was called inclusive scan. 674 01:06:16.559 --> 01:06:20.610 There is a small variant called an exclusive scan. 675 01:06:20.610 --> 01:06:24.539 Or the output it's shifted, right? 1. 676 01:06:24.539 --> 01:06:29.909 So, the ice output is the sum up to the -1. 677 01:06:29.909 --> 01:06:34.559 1st input so the 1st output is always 0, the 2nd, this is. 678 01:06:34.559 --> 01:06:39.869 Outputs to some of the 1st input, the 3rd outputs a, some of the 1st, 2 inputs and so on. 679 01:06:41.909 --> 01:06:50.789 Totally trivial difference. Why they have a separate name for this. There are times when this is useful exclusive scan. So. 680 01:06:50.789 --> 01:06:55.110 Yeah, and it's. 681 01:06:55.110 --> 01:06:58.260 No matter why, you know. 682 01:06:58.260 --> 01:07:02.130 Of that, um, exclusive scan, inclusive pattern of the same thing. 683 01:07:02.130 --> 01:07:07.769 Okay, so the information content of this slide that is in this figure here. 684 01:07:07.769 --> 01:07:11.130 It's how to do the scan partial thumbs over. 685 01:07:11.130 --> 01:07:14.400 A multi block vector or. 686 01:07:16.260 --> 01:07:22.920 Yes, and again, the scan is a parallel operation, which has a lot of uses. 687 01:07:22.920 --> 01:07:29.670 Um, allocating lots of little arrays of different sizes are called ragged are raised. 688 01:07:29.670 --> 01:07:33.449 Runway encoding decoding and so on. 689 01:07:34.650 --> 01:07:38.969 Instagram would be 1. 690 01:07:38.969 --> 01:07:44.340 There's a lot of things where you're outputs a lot of buckets of varying sizes. 691 01:07:44.340 --> 01:07:48.090 Well, his programming would be 1, some sort of. 692 01:07:49.230 --> 01:07:56.460 Set set operation unions and differences and so on, you end up with blocks of different sizes. 693 01:07:56.460 --> 01:08:00.809 Oh, okay. 694 01:08:02.219 --> 01:08:05.789 And. 695 01:08:05.789 --> 01:08:10.769 Hello. 696 01:08:12.119 --> 01:08:19.859 Nothing in 11. okay. 697 01:08:21.689 --> 01:08:28.470 This is what I mentioned an hour ago. What's going on here? 698 01:08:28.470 --> 01:08:32.550 You got this nice abstract idea of a field. 699 01:08:32.550 --> 01:08:38.279 In addition multiplication operations are associative committed to. 700 01:08:38.279 --> 01:08:43.859 You know, multiplication distributes over addition or vice versa. You got it. 701 01:08:43.859 --> 01:08:50.789 Elements of inverses, you've got receptacles and so on there's 11 axioms at all. 702 01:08:50.789 --> 01:08:53.909 Field satisfy. 703 01:08:53.909 --> 01:08:56.970 And then you've got floats. 704 01:08:56.970 --> 01:09:03.720 Which basically break, I mean, there have been machines in the past for 1 time. Sex are not equal X. okay. 705 01:09:03.720 --> 01:09:13.079 So you got the ideal and you've got the imperfect implementation that's floating floating point numbers. 706 01:09:13.079 --> 01:09:17.399 The context here. 707 01:09:17.399 --> 01:09:21.060 Is a different computers that implement floats in a different way. 708 01:09:21.060 --> 01:09:25.619 Different number of significant fits different face whatever. 709 01:09:25.619 --> 01:09:30.779 Yeah, I can't believe did a standard, um, 2 decades back. 710 01:09:30.779 --> 01:09:36.810 To try and have good practices. How a good floating point system. 711 01:09:38.220 --> 01:09:48.390 Would be and, um, here's the thing, for example, suppose you've got some common operations square root of X. okay. For X being positive. 712 01:09:49.409 --> 01:09:54.840 A bad implementation might have square root of X jittering up and down in the last bit. 713 01:09:54.840 --> 01:09:59.609 That's increases slightly. It's conceivable. Square. Root of ethical down is excellent off. 714 01:09:59.609 --> 01:10:06.090 You know, in the last bit, why is this bad is it might cause a see can't search algorithm to fail. 715 01:10:06.090 --> 01:10:12.090 You see, you violate these little properties that you just intuitively assume, even if they're not written down that. 716 01:10:12.090 --> 01:10:21.359 Your standard function should be monotonically in practice where their monotonically in theory, stuff like that. You would assume a plus B would be. 717 01:10:21.359 --> 01:10:27.720 It's so obvious to you, but there's machines where that was not true. As I said, there's a machine or 1 time techs are not equal X. sometimes. 718 01:10:27.720 --> 01:10:37.590 The reasons for some of this is that the machine might have an invisible register. That's higher precision. Do you do a plus B. 719 01:10:37.590 --> 01:10:42.779 Perhaps a is in a register that's higher precision and B comes out of memory and slower precision. 720 01:10:42.779 --> 01:10:50.609 So B, plus, hey, these are the high precision register and not a, and they're coming from some earlier operation. And the result is. 721 01:10:50.609 --> 01:10:54.060 April speak with Donny would be for, say exactly stuff like that. 722 01:10:54.060 --> 01:10:57.569 Um, so the I triple. 723 01:10:57.569 --> 01:11:01.140 Well, William Kay, Hannah, Berkeley, leading it. Um. 724 01:11:01.140 --> 01:11:06.420 Came up with a standard and it got a lot of complaining because it. 725 01:11:06.420 --> 01:11:19.260 The standard was, it was expensive to implement so CRE computing, which had some fast parallel computer he never would. Do I triple your floating points? So, the 14 point standard was theoretically nice, but. 726 01:11:19.260 --> 01:11:25.529 Was expensive we took a lot of Gates, but now everyone does it because they decide it's nice. 727 01:11:25.529 --> 01:11:30.210 To do things, right. Um, and. 728 01:11:30.210 --> 01:11:34.409 So, even GPU is mostly do what? So that's the context here. 729 01:11:34.409 --> 01:11:40.829 Is that does your implementation does the hardware manufacturers implementation follow this? 730 01:11:40.829 --> 01:11:45.750 If it follows this, your obvious things will happen if it doesn't follow this. 731 01:11:45.750 --> 01:11:53.970 You will your hardware will be cheaper and faster perhaps cheaper, but you're going to get some crazy errors. 732 01:11:53.970 --> 01:11:58.020 So, um. 733 01:11:58.020 --> 01:12:02.100 That's the point about that. 734 01:12:05.189 --> 01:12:10.979 People know how floats are implemented perhaps. So you can read it on your own. Um. 735 01:12:10.979 --> 01:12:19.770 Get that get that you can read the details. I just gave you the quick executive summary of. 736 01:12:19.770 --> 01:12:27.119 Presentable numbers I carefully stand. It also has bit patterns to represent infinity. 737 01:12:27.119 --> 01:12:32.909 If you divide 1 by 0, you've got infinity perhaps and not a number 0 advisor or if not a number. 738 01:12:32.909 --> 01:12:39.119 But, no, 1 uses them. I'll skip all of this. 739 01:12:40.800 --> 01:12:46.590 Read it if you want, um, questions. 740 01:12:48.210 --> 01:12:51.449 Well, give you some limits here. Um. 741 01:12:52.590 --> 01:12:56.670 Normalization you can read about precision and so on on. 742 01:12:56.670 --> 01:13:00.510 Single position floats your biggest numbers, like 10 to 37. 743 01:13:00.510 --> 01:13:05.880 Which is not big enough so, 1, implementation it basically, 1 double precision floats. 744 01:13:05.880 --> 01:13:12.060 Where, unless you do a machine learning where you don't care about accuracy, so you go half precision flows. 745 01:13:12.060 --> 01:13:16.739 I'm willing to debate about machine learning, but. 746 01:13:16.739 --> 01:13:20.039 Special bit pattern is do not a number and. 747 01:13:20.039 --> 01:13:25.710 Accuracy and rounding, um. 748 01:13:27.029 --> 01:13:32.520 Well, 1 thing in the eye, triply standard, if you're rounding a number. 749 01:13:32.520 --> 01:13:40.829 They they record the last bit be correct and you'd have to round you can say, do you want to round down? Do you want to round up if you want to round to 0 that's useful at times. 750 01:13:40.829 --> 01:13:48.359 The fact that the last bit is guaranteed to be accurate means is that the hardware is computing to an extra bid or 2 and rounding. 751 01:13:48.359 --> 01:13:53.220 So, there's a lot of sophisticated stuff in the standard, um, which I'll skip through. 752 01:13:53.220 --> 01:13:57.270 Yeah. 753 01:13:57.270 --> 01:14:01.500 And order not associated. 754 01:14:01.500 --> 01:14:07.649 Is the thing you're large plus small plus small does not equal large + plus the 2 small. 755 01:14:07.649 --> 01:14:11.130 Yeah, okay. I'm, I'm going to. 756 01:14:11.130 --> 01:14:16.859 Well, here is your example, I said, so. 757 01:14:16.859 --> 01:14:23.010 You can do I Tripoli standard or you can do the chief crap version. Sorry? 758 01:14:23.010 --> 01:14:27.060 Which will be faster, but less accurate. So then you pick. 759 01:14:28.770 --> 01:14:34.199 So, you can do fast math, but if you use the fast math option. 760 01:14:34.199 --> 01:14:38.699 You got to worry that you're going to get some surprising failures in your program. 761 01:14:38.699 --> 01:14:48.569 So, you'll have a loop where you're iterating until something converges and your loop will become infinite because I won't converge properly. 762 01:14:48.569 --> 01:14:51.779 There's something an end that terminated. 763 01:14:51.779 --> 01:14:55.770 May not terminate that's a sort of something that didn't terminate. 764 01:14:55.770 --> 01:14:59.819 Using proper floats make onto an infinite loop with the fast math. 765 01:14:59.819 --> 01:15:06.000 Okay, um. 766 01:15:06.000 --> 01:15:09.390 On to make everything double. 767 01:15:10.590 --> 01:15:15.989 On Intel. Okay. So, give you executive summary of this. 768 01:15:15.989 --> 01:15:21.000 Floating point is a pain. It has a lot of sophisticated problems. 769 01:15:21.000 --> 01:15:28.439 They've been handled as much as they can with the standard, but the standard is expensive to implement. 770 01:15:28.439 --> 01:15:31.859 So, people hardware people resist it, but. 771 01:15:31.859 --> 01:15:35.880 If your floating point use is the standard it will. 772 01:15:35.880 --> 01:15:40.199 Do the obvious things most of the time most of the time. 773 01:15:40.199 --> 01:15:43.380 Um. 774 01:15:43.380 --> 01:15:46.529 If you want extra speed, you can. 775 01:15:46.529 --> 01:15:50.069 Not use the standard, but be aware. 776 01:15:50.069 --> 01:15:58.199 If people, like, I can expand as much as you like, on numerical problems. So it's actually spent some time on that. 777 01:15:58.199 --> 01:16:01.710 Um, other stuff about floats. 778 01:16:04.170 --> 01:16:09.090 On NVIDIA the floating operations. 779 01:16:09.090 --> 01:16:15.720 An example of a heart initiative floats. Right? Intel has gotten it wrong more than once. Okay. 780 01:16:15.720 --> 01:16:20.220 Intel has gotten floating division wrong once. Okay. 781 01:16:20.220 --> 01:16:25.710 This is some years ago they Intel implemented floating division. They start. 782 01:16:25.710 --> 01:16:29.909 It was an intuitive process, but it started with a lookup table. 783 01:16:29.909 --> 01:16:38.430 Do well, it was doing a reciprocal 1st, reciprocal and a multi. It used to look up table to get the 1st, few bits of the reciprocal 1 over of why? 784 01:16:38.430 --> 01:16:45.239 And lookup table was, I don't know what a 1000 elements long or something. A few of the entries in lookup table were wrong. 785 01:16:45.239 --> 01:16:49.409 So, the Intel commercial. 786 01:16:49.409 --> 01:16:55.229 For certain values computed the X divided by Y, wrong. 787 01:16:55.229 --> 01:16:59.729 And is out there for a while. 788 01:16:59.729 --> 01:17:08.039 But most of the time it was right and when it was wrong, I think it was only wrong somewhat. So I think some. 789 01:17:08.039 --> 01:17:18.659 Professor somewhere discovered it and went to Intel and pointed this out to them and they ignored him. So he went public and it became a major story in the business. 790 01:17:18.659 --> 01:17:24.149 So the example of how hard floating point is Intel got it wrong and that's not the only time they got it wrong. 791 01:17:24.149 --> 01:17:29.579 Okay, so the folks are done with separate hardware. 792 01:17:29.579 --> 01:17:37.560 Even on the Intel on on video and double precision, maybe it was separate hardware still, depending on, you may emulate it and. 793 01:17:37.560 --> 01:17:43.439 Simulated in single and the decision for Nvidia is how many hardware. 794 01:17:43.439 --> 01:17:47.039 Cores, do you put on the chip and. 795 01:17:47.039 --> 01:17:57.420 The idea is that the threads that want to you do hardware do floats they queue up waiting for the available hardware course for the war. It's all done work by war. 796 01:17:57.420 --> 01:18:03.239 So, if you've got a lot of floating point computation. 797 01:18:03.239 --> 01:18:06.510 In your program and not many hardware. Course. 798 01:18:06.510 --> 01:18:09.630 Your program's going to be waiting. 799 01:18:09.630 --> 01:18:15.029 Your current parallel kernels going to be waiting, you're going to be limited by the lack of hard work horse. 800 01:18:15.029 --> 01:18:22.050 Of course, for sometimes nvidi has done floats faster than it. 801 01:18:22.050 --> 01:18:26.579 Oh, it depends. Um, I think it was 1 point. 802 01:18:26.579 --> 01:18:34.979 Basically, a float with faster. So cause again, you got the special purpose cores and the warps queue up to use them. 803 01:18:34.979 --> 01:18:39.869 And I think that the best afloat may take a couple of cycles. I could be wrong on that. 804 01:18:39.869 --> 01:18:43.500 And at worst more, so okay. 805 01:18:45.239 --> 01:18:53.430 So that is a reasonable point float. Say if I don't know. 806 01:18:53.430 --> 01:19:00.180 What they're saying here that's true. For sequential program. 807 01:19:00.180 --> 01:19:05.789 If you can afford it quick and dirty fix and do everything in double. 808 01:19:05.789 --> 01:19:14.550 Z on, it may actually be about the same speed or the course the same speed, but you're moving twice as much data into the and that may slow down. 809 01:19:15.689 --> 01:19:21.210 But you think you're doing everything in double you might accidentally do a few things single precision. 810 01:19:21.210 --> 01:19:25.260 And so what they're saying here is that. 811 01:19:26.460 --> 01:19:31.739 You may have to be careful about constant Constance and so on that, they're the right precision. 812 01:19:31.739 --> 01:19:36.659 If you get it wrong, you may program maybe slower than you would think or it may be just be. 813 01:19:36.659 --> 01:19:44.399 Less accurate than you would think so. Okay, so that's, um, this was. 814 01:19:46.680 --> 01:19:53.550 You do 12, 1 and again, executive summary floating point is a big topic that. 815 01:19:53.550 --> 01:20:00.750 You probably want to ignore and you probably usually can't ignore, but it's a big field of numerical accuracy. And so, no worries about that. 816 01:20:00.750 --> 01:20:10.229 How do you invert big matrices or whatever that's enough new stuff for today? So, unless if there's any questions. 817 01:20:12.029 --> 01:20:15.210 Take questions for a minute or 2, and then I'll get lunch and. 818 01:20:16.500 --> 01:20:25.079 Probability so so we're working our way through. I'm open to feedback. If you think I'm going to slow. If you think I'm going to quickly. 819 01:20:25.079 --> 01:20:30.869 I'm skipping over major areas and just summarizing for you, but you can dig into them if you want. 820 01:20:32.399 --> 01:20:40.680 So you want to learn more about a risk predict concern William K had K N as the expert on this, and he did the police standard. 821 01:20:40.680 --> 01:20:46.020 He also, I think, did the hardware for HP processors calculators and so on. 822 01:20:46.020 --> 01:20:49.439 As a consultant to HP, so here's the guy behind. 823 01:20:49.439 --> 01:20:52.470 A lot of the implementation. 824 01:20:52.470 --> 01:20:59.010 And he has a program called subscription called paranoia. He has writings on how. 825 01:20:59.010 --> 01:21:02.640 Is badly the stuff can go wrong and, um. 826 01:21:04.260 --> 01:21:07.859 An example of how complicated it is. 827 01:21:07.859 --> 01:21:12.180 The communications the some years ago, published a. 828 01:21:12.180 --> 01:21:19.109 A separate team that would determine what the floating precision was on your computer number of bits in a. 829 01:21:19.109 --> 01:21:27.510 Exponent and such et cetera, they published it and then they discovered that there were some commercial computers were dysfunctional, gone to an infinite loop. 830 01:21:27.510 --> 01:21:33.270 Not even terminate because it was iterating until. 831 01:21:33.270 --> 01:21:39.720 Some numbers stop changing and that would be the would be related to a number of significant fits, but on some hardware. 832 01:21:39.720 --> 01:21:44.250 It would never stop the, it would just iterate forever. And so. 833 01:21:44.250 --> 01:21:49.050 Okay, so. 834 01:21:50.579 --> 01:21:56.609 Open to questions other than that. See, you Tuesday. 835 01:22:03.329 --> 01:22:09.060 Hello. 836 01:22:09.060 --> 01:22:12.149 Hello. 837 01:22:14.039 --> 01:22:17.279 Okay. 838 01:22:17.279 --> 01:22:20.315 Quality.